mirror of
https://github.com/explosion/spaCy.git
synced 2025-11-08 20:07:51 +03:00
Merge pull request #3413 from explosion/develop
💫 Merge develop (v2.1) into master
This commit is contained in:
commit
9a34d38829
11
.buildkite/train.yml
Normal file
11
.buildkite/train.yml
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
steps:
|
||||||
|
-
|
||||||
|
command: "fab env clean make test wheel"
|
||||||
|
label: ":dizzy: :python:"
|
||||||
|
artifact_paths: "dist/*.whl"
|
||||||
|
- wait
|
||||||
|
- trigger: "spacy-train-from-wheel"
|
||||||
|
label: ":dizzy: :train:"
|
||||||
|
build:
|
||||||
|
env:
|
||||||
|
SPACY_VERSION: "{$SPACY_VERSION}"
|
||||||
2
.github/CONTRIBUTOR_AGREEMENT.md
vendored
2
.github/CONTRIBUTOR_AGREEMENT.md
vendored
|
|
@ -5,7 +5,7 @@ This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
The SCA applies to any contribution that you make to any product or project
|
The SCA applies to any contribution that you make to any product or project
|
||||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
[ExplosionAI UG GmbH](https://explosion.ai/legal). The term
|
||||||
**"you"** shall mean the person or entity identified below.
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
If you agree to be bound by these terms, fill in the information requested
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
|
|
||||||
106
.github/contributors/Poluglottos.md
vendored
Normal file
106
.github/contributors/Poluglottos.md
vendored
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Ryan Ford |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | Mar 13 2019 |
|
||||||
|
| GitHub username | Poluglottos |
|
||||||
|
| Website (optional) | |
|
||||||
106
.github/contributors/clippered.md
vendored
Normal file
106
.github/contributors/clippered.md
vendored
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Kenneth Cruz |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-12-07 |
|
||||||
|
| GitHub username | clippered |
|
||||||
|
| Website (optional) | |
|
||||||
106
.github/contributors/jarib.md
vendored
Normal file
106
.github/contributors/jarib.md
vendored
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jari Bakken |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-12-21 |
|
||||||
|
| GitHub username | jarib |
|
||||||
|
| Website (optional) | |
|
||||||
106
.github/contributors/moreymat.md
vendored
Normal file
106
.github/contributors/moreymat.md
vendored
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Mathieu Morey |
|
||||||
|
| Company name (if applicable) | Datactivist |
|
||||||
|
| Title or role (if applicable) | Researcher |
|
||||||
|
| Date | 2019-01-07 |
|
||||||
|
| GitHub username | moreymat |
|
||||||
|
| Website (optional) | |
|
||||||
11
.gitignore
vendored
11
.gitignore
vendored
|
|
@ -5,11 +5,15 @@ corpora/
|
||||||
keys/
|
keys/
|
||||||
|
|
||||||
# Website
|
# Website
|
||||||
|
website/.cache/
|
||||||
|
website/public/
|
||||||
|
website/node_modules
|
||||||
|
website/.npm
|
||||||
|
website/logs
|
||||||
|
*.log
|
||||||
|
npm-debug.log*
|
||||||
website/www/
|
website/www/
|
||||||
website/_deploy.sh
|
website/_deploy.sh
|
||||||
website/.gitignore
|
|
||||||
website/public
|
|
||||||
node_modules
|
|
||||||
|
|
||||||
# Cython / C extensions
|
# Cython / C extensions
|
||||||
cythonize.json
|
cythonize.json
|
||||||
|
|
@ -38,6 +42,7 @@ venv/
|
||||||
.dev
|
.dev
|
||||||
.denv
|
.denv
|
||||||
.pypyenv
|
.pypyenv
|
||||||
|
.pytest_cache/
|
||||||
|
|
||||||
# Distribution / packaging
|
# Distribution / packaging
|
||||||
env/
|
env/
|
||||||
|
|
|
||||||
24
.travis.yml
Normal file
24
.travis.yml
Normal file
|
|
@ -0,0 +1,24 @@
|
||||||
|
language: python
|
||||||
|
sudo: false
|
||||||
|
cache: pip
|
||||||
|
dist: trusty
|
||||||
|
group: edge
|
||||||
|
python:
|
||||||
|
- "2.7"
|
||||||
|
os:
|
||||||
|
- linux
|
||||||
|
install:
|
||||||
|
- "pip install -r requirements.txt"
|
||||||
|
- "python setup.py build_ext --inplace"
|
||||||
|
- "pip install -e ."
|
||||||
|
script:
|
||||||
|
- "cat /proc/cpuinfo | grep flags | head -n 1"
|
||||||
|
- "pip install pytest pytest-timeout"
|
||||||
|
- "python -m pytest --tb=native spacy"
|
||||||
|
branches:
|
||||||
|
except:
|
||||||
|
- spacy.io
|
||||||
|
notifications:
|
||||||
|
slack:
|
||||||
|
secure: F8GvqnweSdzImuLL64TpfG0i5rYl89liyr9tmFVsHl4c0DNiDuGhZivUz0M1broS8svE3OPOllLfQbACG/4KxD890qfF9MoHzvRDlp7U+RtwMV/YAkYn8MGWjPIbRbX0HpGdY7O2Rc9Qy4Kk0T8ZgiqXYIqAz2Eva9/9BlSmsJQ=
|
||||||
|
email: false
|
||||||
109
CONTRIBUTING.md
109
CONTRIBUTING.md
|
|
@ -55,7 +55,7 @@ even format them as Markdown to copy-paste into GitHub issues:
|
||||||
`python -m spacy info --markdown`.
|
`python -m spacy info --markdown`.
|
||||||
|
|
||||||
* **Checking the model compatibility:** If you're having problems with a
|
* **Checking the model compatibility:** If you're having problems with a
|
||||||
[statistical model](https://spacy.io/models), it may be because to the
|
[statistical model](https://spacy.io/models), it may be because the
|
||||||
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
||||||
this on the command line by running `python -m spacy validate`.
|
this on the command line by running `python -m spacy validate`.
|
||||||
|
|
||||||
|
|
@ -186,13 +186,99 @@ sure your test passes and reference the issue in your commit message.
|
||||||
## Code conventions
|
## Code conventions
|
||||||
|
|
||||||
Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
|
Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
|
||||||
Regular line length is **80 characters**, with some tolerance for lines up to
|
As of `v2.1.0`, spaCy uses [`black`](https://github.com/ambv/black) for code
|
||||||
90 characters if the alternative would be worse — for instance, if your list
|
formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
|
||||||
comprehension comes to 82 characters, it's better not to split it over two lines.
|
Python modules. If you've built spaCy from source, you'll already have both
|
||||||
You can also use a linter like [`flake8`](https://pypi.python.org/pypi/flake8)
|
tools installed.
|
||||||
or [`frosted`](https://pypi.python.org/pypi/frosted) – just keep in mind that
|
|
||||||
it won't work very well for `.pyx` files and will complain about Cython syntax
|
**⚠️ Note that formatting and linting is currently only possible for Python
|
||||||
like `<int*>` or `cimport`.
|
modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
|
||||||
|
|
||||||
|
### Code formatting
|
||||||
|
|
||||||
|
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
||||||
|
formatter, optimised to produce readable code and small diffs. You can run
|
||||||
|
`black` from the command-line, or via your code editor. For example, if you're
|
||||||
|
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
||||||
|
following to your `settings.json` to use `black` for formatting and auto-format
|
||||||
|
your files on save:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"python.formatting.provider": "black",
|
||||||
|
"[python]": {
|
||||||
|
"editor.formatOnSave": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
[See here](https://github.com/ambv/black#editor-integration) for the full
|
||||||
|
list of available editor integrations.
|
||||||
|
|
||||||
|
#### Disabling formatting
|
||||||
|
|
||||||
|
There are a few cases where auto-formatting doesn't improve readability – for
|
||||||
|
example, in some of the the language data files like the `tag_map.py`, or in
|
||||||
|
the tests that construct `Doc` objects from lists of words and other labels.
|
||||||
|
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
|
||||||
|
for that particular code. Here's an example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# fmt: off
|
||||||
|
text = "I look forward to using Thingamajig. I've been told it will make my life easier..."
|
||||||
|
heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
|
||||||
|
deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
|
||||||
|
"nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
|
||||||
|
"poss", "nsubj", "ccomp", "punct"]
|
||||||
|
# fmt: on
|
||||||
|
```
|
||||||
|
|
||||||
|
### Code linting
|
||||||
|
|
||||||
|
[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code
|
||||||
|
style. It scans one or more files and outputs errors and warnings. This feedback
|
||||||
|
can help you stick to general standards and conventions, and can be very useful
|
||||||
|
for spotting potential mistakes and inconsistencies in your code. The most
|
||||||
|
important things to watch out for are syntax errors and undefined names, but you
|
||||||
|
also want to keep an eye on unused declared variables or repeated
|
||||||
|
(i.e. overwritten) dictionary keys. If your code was formatted with `black`
|
||||||
|
(see above), you shouldn't see any formatting-related warnings.
|
||||||
|
|
||||||
|
The [`.flake8`](.flake8) config defines the configuration we use for this
|
||||||
|
codebase. For example, we're not super strict about the line length, and we're
|
||||||
|
excluding very large files like lemmatization and tokenizer exception tables.
|
||||||
|
|
||||||
|
Ideally, running the following command from within the repo directory should
|
||||||
|
not return any errors or warnings:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
flake8 spacy
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Disabling linting
|
||||||
|
|
||||||
|
Sometimes, you explicitly want to write code that's not compatible with our
|
||||||
|
rules. For example, a module's `__init__.py` might import a function so other
|
||||||
|
modules can import it from there, but `flake8` will complain about an unused
|
||||||
|
import. And although it's generally discouraged, there might be cases where it
|
||||||
|
makes sense to use a bare `except`.
|
||||||
|
|
||||||
|
To ignore a given line, you can add a comment like `# noqa: F401`, specifying
|
||||||
|
the code of the error or warning we want to ignore. It's also possible to
|
||||||
|
ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. Here
|
||||||
|
are some examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# The imported class isn't used in this file, but imported here, so it can be
|
||||||
|
# imported *from* here by another module.
|
||||||
|
from .submodule import SomeClass # noqa: F401
|
||||||
|
|
||||||
|
try:
|
||||||
|
do_something()
|
||||||
|
except: # noqa: E722
|
||||||
|
# This bare except is justified, for some specific reason
|
||||||
|
do_something_else()
|
||||||
|
```
|
||||||
|
|
||||||
### Python conventions
|
### Python conventions
|
||||||
|
|
||||||
|
|
@ -206,10 +292,9 @@ for example to show more specific error messages, you can use the `is_config()`
|
||||||
helper function.
|
helper function.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from .compat import unicode_, json_dumps, is_config
|
from .compat import unicode_, is_config
|
||||||
|
|
||||||
compatible_unicode = unicode_('hello world')
|
compatible_unicode = unicode_('hello world')
|
||||||
compatible_json = json_dumps({'key': 'value'})
|
|
||||||
if is_config(windows=True, python2=True):
|
if is_config(windows=True, python2=True):
|
||||||
print("You are using Python 2 on Windows.")
|
print("You are using Python 2 on Windows.")
|
||||||
```
|
```
|
||||||
|
|
@ -235,7 +320,7 @@ of other types these names. For instance, don't name a text string `doc` — you
|
||||||
should usually call this `text`. Two general code style preferences further help
|
should usually call this `text`. Two general code style preferences further help
|
||||||
with naming. First, **lean away from introducing temporary variables**, as these
|
with naming. First, **lean away from introducing temporary variables**, as these
|
||||||
clutter your namespace. This is one reason why comprehension expressions are
|
clutter your namespace. This is one reason why comprehension expressions are
|
||||||
often preferred. Second, **keep your functions shortish**, so that can work in a
|
often preferred. Second, **keep your functions shortish**, so they can work in a
|
||||||
smaller scope. Of course, this is a question of trade-offs.
|
smaller scope. Of course, this is a question of trade-offs.
|
||||||
|
|
||||||
### Cython conventions
|
### Cython conventions
|
||||||
|
|
@ -353,7 +438,7 @@ avoid unnecessary imports.
|
||||||
Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
|
Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
|
||||||
Tests that require the model to be loaded should be marked with
|
Tests that require the model to be loaded should be marked with
|
||||||
`@pytest.mark.models`. Loading the models is expensive and not necessary if
|
`@pytest.mark.models`. Loading the models is expensive and not necessary if
|
||||||
you're not actually testing the model performance. If all you needs ia a `Doc`
|
you're not actually testing the model performance. If all you need is a `Doc`
|
||||||
object with annotations like heads, POS tags or the dependency parse, you can
|
object with annotations like heads, POS tags or the dependency parse, you can
|
||||||
use the `get_doc()` utility function to construct it manually.
|
use the `get_doc()` utility function to construct it manually.
|
||||||
|
|
||||||
|
|
|
||||||
2
LICENSE
2
LICENSE
|
|
@ -1,6 +1,6 @@
|
||||||
The MIT License (MIT)
|
The MIT License (MIT)
|
||||||
|
|
||||||
Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH, 2015 Matthew Honnibal
|
Copyright (C) 2016-2019 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,5 @@
|
||||||
recursive-include include *.h
|
recursive-include include *.h
|
||||||
include LICENSE
|
include LICENSE
|
||||||
include README.rst
|
include README.md
|
||||||
include pyproject.toml
|
include pyproject.toml
|
||||||
include bin/spacy
|
include bin/spacy
|
||||||
|
|
|
||||||
22
Makefile
Normal file
22
Makefile
Normal file
|
|
@ -0,0 +1,22 @@
|
||||||
|
SHELL := /bin/bash
|
||||||
|
sha = $(shell "git" "rev-parse" "--short" "HEAD")
|
||||||
|
|
||||||
|
dist/spacy.pex : spacy/*.py* spacy/*/*.py*
|
||||||
|
python3.6 -m venv env3.6
|
||||||
|
source env3.6/bin/activate
|
||||||
|
env3.6/bin/pip install wheel
|
||||||
|
env3.6/bin/pip install -r requirements.txt --no-cache-dir
|
||||||
|
env3.6/bin/python setup.py build_ext --inplace
|
||||||
|
env3.6/bin/python setup.py sdist
|
||||||
|
env3.6/bin/python setup.py bdist_wheel
|
||||||
|
env3.6/bin/python -m pip install pex==1.5.3
|
||||||
|
env3.6/bin/pex pytest dist/*.whl -e spacy -o dist/spacy-$(sha).pex
|
||||||
|
cp dist/spacy-$(sha).pex dist/spacy.pex
|
||||||
|
chmod a+rx dist/spacy.pex
|
||||||
|
|
||||||
|
.PHONY : clean
|
||||||
|
|
||||||
|
clean : setup.py
|
||||||
|
source env3.6/bin/activate
|
||||||
|
rm -rf dist/*
|
||||||
|
python setup.py clean --all
|
||||||
284
README.md
Normal file
284
README.md
Normal file
|
|
@ -0,0 +1,284 @@
|
||||||
|
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
|
||||||
|
|
||||||
|
# spaCy: Industrial-strength NLP
|
||||||
|
|
||||||
|
spaCy is a library for advanced Natural Language Processing in Python and
|
||||||
|
Cython. It's built on the very latest research, and was designed from day one
|
||||||
|
to be used in real products. spaCy comes with
|
||||||
|
[pre-trained statistical models](https://spacy.io/models) and word vectors, and
|
||||||
|
currently supports tokenization for **45+ languages**. It features the
|
||||||
|
**fastest syntactic parser** in the world, convolutional
|
||||||
|
**neural network models** for tagging, parsing and **named entity recognition**
|
||||||
|
and easy **deep learning** integration. It's commercial open-source software,
|
||||||
|
released under the MIT license.
|
||||||
|
|
||||||
|
💫 **Version 2.1 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||||
|
|
||||||
|
[](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||||
|
[](https://travis-ci.org/explosion/spaCy)
|
||||||
|
[](https://github.com/explosion/spaCy/releases)
|
||||||
|
[](https://pypi.python.org/pypi/spacy)
|
||||||
|
[](https://anaconda.org/conda-forge/spacy)
|
||||||
|
[](https://github.com/explosion/wheelwright/releases)
|
||||||
|
[](https://github.com/ambv/black)
|
||||||
|
[](https://twitter.com/spacy_io)
|
||||||
|
|
||||||
|
## 📖 Documentation
|
||||||
|
|
||||||
|
| Documentation | |
|
||||||
|
| --------------- | -------------------------------------------------------------- |
|
||||||
|
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||||
|
| [Usage Guides] | How to use spaCy and its features. |
|
||||||
|
| [New in v2.1] | New features, backwards incompatibilities and migration guide. |
|
||||||
|
| [API Reference] | The detailed reference for spaCy's API. |
|
||||||
|
| [Models] | Download statistical language models for spaCy. |
|
||||||
|
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||||
|
| [Changelog] | Changes and version history. |
|
||||||
|
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||||
|
|
||||||
|
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||||
|
[new in v2.1]: https://spacy.io/usage/v2-1
|
||||||
|
[usage guides]: https://spacy.io/usage/
|
||||||
|
[api reference]: https://spacy.io/api/
|
||||||
|
[models]: https://spacy.io/models
|
||||||
|
[universe]: https://spacy.io/universe
|
||||||
|
[changelog]: https://spacy.io/usage/#changelog
|
||||||
|
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||||
|
|
||||||
|
## 💬 Where to ask questions
|
||||||
|
|
||||||
|
The spaCy project is maintained by [@honnibal](https://github.com/honnibal)
|
||||||
|
and [@ines](https://github.com/ines). Please understand that we won't be able
|
||||||
|
to provide individual support via email. We also believe that help is much more
|
||||||
|
valuable if it's shared publicly, so that more people can benefit from it.
|
||||||
|
|
||||||
|
| Type | Platforms |
|
||||||
|
| ------------------------ | ------------------------------------------------------ |
|
||||||
|
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
|
||||||
|
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
|
||||||
|
| 👩💻 **Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
|
||||||
|
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
|
||||||
|
|
||||||
|
[github issue tracker]: https://github.com/explosion/spaCy/issues
|
||||||
|
[stack overflow]: http://stackoverflow.com/questions/tagged/spacy
|
||||||
|
[gitter chat]: https://gitter.im/explosion/spaCy
|
||||||
|
[reddit user group]: https://www.reddit.com/r/spacynlp
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Fastest syntactic parser** in the world
|
||||||
|
- **Named entity** recognition
|
||||||
|
- Non-destructive **tokenization**
|
||||||
|
- Support for **45+ languages**
|
||||||
|
- Pre-trained [statistical models](https://spacy.io/models) and word vectors
|
||||||
|
- Easy **deep learning** integration
|
||||||
|
- Part-of-speech tagging
|
||||||
|
- Labelled dependency parsing
|
||||||
|
- Syntax-driven sentence segmentation
|
||||||
|
- Built in **visualizers** for syntax and NER
|
||||||
|
- Convenient string-to-hash mapping
|
||||||
|
- Export to numpy data arrays
|
||||||
|
- Efficient binary serialization
|
||||||
|
- Easy **model packaging** and deployment
|
||||||
|
- State-of-the-art speed
|
||||||
|
- Robust, rigorously evaluated accuracy
|
||||||
|
|
||||||
|
📖 **For more details, see the
|
||||||
|
[facts, figures and benchmarks](https://spacy.io/usage/facts-figures).**
|
||||||
|
|
||||||
|
## Install spaCy
|
||||||
|
|
||||||
|
For detailed installation instructions, see the
|
||||||
|
[documentation](https://spacy.io/usage).
|
||||||
|
|
||||||
|
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
|
||||||
|
- **Python version**: Python 2.7, 3.4+ (only 64 bit)
|
||||||
|
- **Package managers**: [pip] · [conda] (via `conda-forge`)
|
||||||
|
|
||||||
|
[pip]: https://pypi.python.org/pypi/spacy
|
||||||
|
[conda]: https://anaconda.org/conda-forge/spacy
|
||||||
|
|
||||||
|
### pip
|
||||||
|
|
||||||
|
Using pip, spaCy releases are available as source packages and binary wheels
|
||||||
|
(as of `v2.0.13`).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install spacy
|
||||||
|
```
|
||||||
|
|
||||||
|
When using pip it is generally recommended to install packages in a virtual
|
||||||
|
environment to avoid modifying system state:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m venv .env
|
||||||
|
source .env/bin/activate
|
||||||
|
pip install spacy
|
||||||
|
```
|
||||||
|
|
||||||
|
### conda
|
||||||
|
|
||||||
|
Thanks to our great community, we've finally re-added conda support. You can now
|
||||||
|
install spaCy via `conda-forge`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
conda config --add channels conda-forge
|
||||||
|
conda install spacy
|
||||||
|
```
|
||||||
|
|
||||||
|
For the feedstock including the build recipe and configuration,
|
||||||
|
check out [this repository](https://github.com/conda-forge/spacy-feedstock).
|
||||||
|
Improvements and pull requests to the recipe and setup are always appreciated.
|
||||||
|
|
||||||
|
### Updating spaCy
|
||||||
|
|
||||||
|
Some updates to spaCy may require downloading new statistical models. If you're
|
||||||
|
running spaCy v2.0 or higher, you can use the `validate` command to check if
|
||||||
|
your installed models are compatible and if not, print details on how to update
|
||||||
|
them:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -U spacy
|
||||||
|
python -m spacy validate
|
||||||
|
```
|
||||||
|
|
||||||
|
If you've trained your own models, keep in mind that your training and runtime
|
||||||
|
inputs must match. After updating spaCy, we recommend **retraining your models**
|
||||||
|
with the new version.
|
||||||
|
|
||||||
|
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
|
||||||
|
[migration guide](https://spacy.io/usage/v2#migrating).**
|
||||||
|
|
||||||
|
## Download models
|
||||||
|
|
||||||
|
As of v1.7.0, models for spaCy can be installed as **Python packages**.
|
||||||
|
This means that they're a component of your application, just like any
|
||||||
|
other module. Models can be installed using spaCy's `download` command,
|
||||||
|
or manually by pointing pip to a path or URL.
|
||||||
|
|
||||||
|
| Documentation | |
|
||||||
|
| ---------------------- | ------------------------------------------------------------- |
|
||||||
|
| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
|
||||||
|
| [Models Documentation] | Detailed usage instructions. |
|
||||||
|
|
||||||
|
[available models]: https://spacy.io/models
|
||||||
|
[models documentation]: https://spacy.io/docs/usage/models
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# out-of-the-box: download best-matching default model
|
||||||
|
python -m spacy download en
|
||||||
|
|
||||||
|
# download best-matching version of specific model for your spaCy installation
|
||||||
|
python -m spacy download en_core_web_lg
|
||||||
|
|
||||||
|
# pip install .tar.gz archive from path or URL
|
||||||
|
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
|
||||||
|
```
|
||||||
|
|
||||||
|
### Loading and using models
|
||||||
|
|
||||||
|
To load a model, use `spacy.load()` with the model's shortcut link:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import spacy
|
||||||
|
nlp = spacy.load('en')
|
||||||
|
doc = nlp(u'This is a sentence.')
|
||||||
|
```
|
||||||
|
|
||||||
|
If you've installed a model via pip, you can also `import` it directly and
|
||||||
|
then call its `load()` method:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import spacy
|
||||||
|
import en_core_web_sm
|
||||||
|
|
||||||
|
nlp = en_core_web_sm.load()
|
||||||
|
doc = nlp(u'This is a sentence.')
|
||||||
|
```
|
||||||
|
|
||||||
|
📖 **For more info and examples, check out the
|
||||||
|
[models documentation](https://spacy.io/docs/usage/models).**
|
||||||
|
|
||||||
|
### Support for older versions
|
||||||
|
|
||||||
|
If you're using an older version (`v1.6.0` or below), you can still download
|
||||||
|
and install the old models from within spaCy using `python -m spacy.en.download all`
|
||||||
|
or `python -m spacy.de.download all`. The `.tar.gz` archives are also
|
||||||
|
[attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0).
|
||||||
|
To download and install the models manually, unpack the archive, drop the
|
||||||
|
contained directory into `spacy/data` and load the model via `spacy.load('en')`
|
||||||
|
or `spacy.load('de')`.
|
||||||
|
|
||||||
|
## Compile from source
|
||||||
|
|
||||||
|
The other way to install spaCy is to clone its
|
||||||
|
[GitHub repository](https://github.com/explosion/spaCy) and build it from
|
||||||
|
source. That is the common way if you want to make changes to the code base.
|
||||||
|
You'll need to make sure that you have a development environment consisting of a
|
||||||
|
Python distribution including header files, a compiler,
|
||||||
|
[pip](https://pip.pypa.io/en/latest/installing/),
|
||||||
|
[virtualenv](https://virtualenv.pypa.io/) and [git](https://git-scm.com)
|
||||||
|
installed. The compiler part is the trickiest. How to do that depends on your
|
||||||
|
system. See notes on Ubuntu, OS X and Windows for details.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# make sure you are using the latest pip
|
||||||
|
python -m pip install -U pip
|
||||||
|
git clone https://github.com/explosion/spaCy
|
||||||
|
cd spaCy
|
||||||
|
|
||||||
|
python -m venv .env
|
||||||
|
source .env/bin/activate
|
||||||
|
export PYTHONPATH=`pwd`
|
||||||
|
pip install -r requirements.txt
|
||||||
|
python setup.py build_ext --inplace
|
||||||
|
```
|
||||||
|
|
||||||
|
Compared to regular install via pip, [requirements.txt](requirements.txt)
|
||||||
|
additionally installs developer dependencies such as Cython. For more details
|
||||||
|
and instructions, see the documentation on
|
||||||
|
[compiling spaCy from source](https://spacy.io/usage/#source) and the
|
||||||
|
[quickstart widget](https://spacy.io/usage/#section-quickstart) to get
|
||||||
|
the right commands for your platform and Python version.
|
||||||
|
|
||||||
|
### Ubuntu
|
||||||
|
|
||||||
|
Install system-level dependencies via `apt-get`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt-get install build-essential python-dev git
|
||||||
|
```
|
||||||
|
|
||||||
|
### macOS / OS X
|
||||||
|
|
||||||
|
Install a recent version of [XCode](https://developer.apple.com/xcode/),
|
||||||
|
including the so-called "Command Line Tools". macOS and OS X ship with Python
|
||||||
|
and git preinstalled.
|
||||||
|
|
||||||
|
### Windows
|
||||||
|
|
||||||
|
Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or
|
||||||
|
[Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/)
|
||||||
|
that matches the version that was used to compile your Python
|
||||||
|
interpreter. For official distributions these are VS 2008 (Python 2.7),
|
||||||
|
VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
||||||
|
|
||||||
|
## Run tests
|
||||||
|
|
||||||
|
spaCy comes with an [extensive test suite](spacy/tests). In order to run the
|
||||||
|
tests, you'll usually want to clone the repository and build spaCy from source.
|
||||||
|
This will also install the required development dependencies and test utilities
|
||||||
|
defined in the `requirements.txt`.
|
||||||
|
|
||||||
|
Alternatively, you can find out where spaCy is installed and run `pytest` on
|
||||||
|
that directory. Don't forget to also install the test utilities via spaCy's
|
||||||
|
`requirements.txt`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
|
||||||
|
pip install -r path/to/requirements.txt
|
||||||
|
python -m pytest <spacy-directory>
|
||||||
|
```
|
||||||
|
|
||||||
|
See [the documentation](https://spacy.io/usage/#tests) for more details and
|
||||||
|
examples.
|
||||||
324
README.rst
324
README.rst
|
|
@ -1,324 +0,0 @@
|
||||||
spaCy: Industrial-strength NLP
|
|
||||||
******************************
|
|
||||||
|
|
||||||
spaCy is a library for advanced Natural Language Processing in Python and Cython.
|
|
||||||
It's built on the very latest research, and was designed from day one to be
|
|
||||||
used in real products. spaCy comes with
|
|
||||||
`pre-trained statistical models <https://spacy.io/models>`_ and word
|
|
||||||
vectors, and currently supports tokenization for **30+ languages**. It features
|
|
||||||
the **fastest syntactic parser** in the world, convolutional **neural network models**
|
|
||||||
for tagging, parsing and **named entity recognition** and easy **deep learning**
|
|
||||||
integration. It's commercial open-source software, released under the MIT license.
|
|
||||||
|
|
||||||
💫 **Version 2.0 out now!** `Check out the release notes here. <https://github.com/explosion/spaCy/releases>`_
|
|
||||||
|
|
||||||
.. image:: https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square
|
|
||||||
:target: https://dev.azure.com/explosion-ai/public/_build?definitionId=8
|
|
||||||
:alt: Azure Pipelines
|
|
||||||
|
|
||||||
.. image:: https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square
|
|
||||||
:target: https://github.com/explosion/spaCy/releases
|
|
||||||
:alt: Current Release Version
|
|
||||||
|
|
||||||
.. image:: https://img.shields.io/pypi/v/spacy.svg?style=flat-square
|
|
||||||
:target: https://pypi.python.org/pypi/spacy
|
|
||||||
:alt: pypi Version
|
|
||||||
|
|
||||||
.. image:: https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square
|
|
||||||
:target: https://anaconda.org/conda-forge/spacy
|
|
||||||
:alt: conda Version
|
|
||||||
|
|
||||||
.. image:: https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white
|
|
||||||
:target: https://github.com/explosion/wheelwright/releases
|
|
||||||
:alt: Python wheels
|
|
||||||
|
|
||||||
.. image:: https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow
|
|
||||||
:target: https://twitter.com/spacy_io
|
|
||||||
:alt: spaCy on Twitter
|
|
||||||
|
|
||||||
📖 Documentation
|
|
||||||
================
|
|
||||||
|
|
||||||
=================== ===
|
|
||||||
`spaCy 101`_ New to spaCy? Here's everything you need to know!
|
|
||||||
`Usage Guides`_ How to use spaCy and its features.
|
|
||||||
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
|
|
||||||
`API Reference`_ The detailed reference for spaCy's API.
|
|
||||||
`Models`_ Download statistical language models for spaCy.
|
|
||||||
`Universe`_ Libraries, extensions, demos, books and courses.
|
|
||||||
`Changelog`_ Changes and version history.
|
|
||||||
`Contribute`_ How to contribute to the spaCy project and code base.
|
|
||||||
=================== ===
|
|
||||||
|
|
||||||
.. _spaCy 101: https://spacy.io/usage/spacy-101
|
|
||||||
.. _New in v2.0: https://spacy.io/usage/v2#migrating
|
|
||||||
.. _Usage Guides: https://spacy.io/usage/
|
|
||||||
.. _API Reference: https://spacy.io/api/
|
|
||||||
.. _Models: https://spacy.io/models
|
|
||||||
.. _Universe: https://spacy.io/universe
|
|
||||||
.. _Changelog: https://spacy.io/usage/#changelog
|
|
||||||
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
|
||||||
|
|
||||||
💬 Where to ask questions
|
|
||||||
==========================
|
|
||||||
|
|
||||||
The spaCy project is maintained by `@honnibal <https://github.com/honnibal>`_
|
|
||||||
and `@ines <https://github.com/ines>`_. Please understand that we won't be able
|
|
||||||
to provide individual support via email. We also believe that help is much more
|
|
||||||
valuable if it's shared publicly, so that more people can benefit from it.
|
|
||||||
|
|
||||||
====================== ===
|
|
||||||
**Bug Reports** `GitHub Issue Tracker`_
|
|
||||||
**Usage Questions** `Stack Overflow`_, `Gitter Chat`_, `Reddit User Group`_
|
|
||||||
**General Discussion** `Gitter Chat`_, `Reddit User Group`_
|
|
||||||
====================== ===
|
|
||||||
|
|
||||||
.. _GitHub Issue Tracker: https://github.com/explosion/spaCy/issues
|
|
||||||
.. _Stack Overflow: http://stackoverflow.com/questions/tagged/spacy
|
|
||||||
.. _Gitter Chat: https://gitter.im/explosion/spaCy
|
|
||||||
.. _Reddit User Group: https://www.reddit.com/r/spacynlp
|
|
||||||
|
|
||||||
Features
|
|
||||||
========
|
|
||||||
|
|
||||||
* **Fastest syntactic parser** in the world
|
|
||||||
* **Named entity** recognition
|
|
||||||
* Non-destructive **tokenization**
|
|
||||||
* Support for **30+ languages**
|
|
||||||
* Pre-trained `statistical models <https://spacy.io/models>`_ and word vectors
|
|
||||||
* Easy **deep learning** integration
|
|
||||||
* Part-of-speech tagging
|
|
||||||
* Labelled dependency parsing
|
|
||||||
* Syntax-driven sentence segmentation
|
|
||||||
* Built in **visualizers** for syntax and NER
|
|
||||||
* Convenient string-to-hash mapping
|
|
||||||
* Export to numpy data arrays
|
|
||||||
* Efficient binary serialization
|
|
||||||
* Easy **model packaging** and deployment
|
|
||||||
* State-of-the-art speed
|
|
||||||
* Robust, rigorously evaluated accuracy
|
|
||||||
|
|
||||||
📖 **For more details, see the** `facts, figures and benchmarks <https://spacy.io/usage/facts-figures>`_.
|
|
||||||
|
|
||||||
Install spaCy
|
|
||||||
=============
|
|
||||||
|
|
||||||
For detailed installation instructions, see
|
|
||||||
the `documentation <https://spacy.io/usage>`_.
|
|
||||||
|
|
||||||
==================== ===
|
|
||||||
**Operating system** macOS / OS X, Linux, Windows (Cygwin, MinGW, Visual Studio)
|
|
||||||
**Python version** CPython 2.7, 3.4+. Only 64 bit.
|
|
||||||
**Package managers** `pip`_, `conda`_ (via ``conda-forge``)
|
|
||||||
==================== ===
|
|
||||||
|
|
||||||
.. _pip: https://pypi.python.org/pypi/spacy
|
|
||||||
.. _conda: https://anaconda.org/conda-forge/spacy
|
|
||||||
|
|
||||||
pip
|
|
||||||
---
|
|
||||||
|
|
||||||
Using pip, spaCy releases are available as source packages and binary wheels
|
|
||||||
(as of ``v2.0.13``).
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
pip install spacy
|
|
||||||
|
|
||||||
When using pip it is generally recommended to install packages in a virtual
|
|
||||||
environment to avoid modifying system state:
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
python -m venv .env
|
|
||||||
source .env/bin/activate
|
|
||||||
pip install spacy
|
|
||||||
|
|
||||||
conda
|
|
||||||
-----
|
|
||||||
|
|
||||||
Thanks to our great community, we've finally re-added conda support. You can now
|
|
||||||
install spaCy via ``conda-forge``:
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
conda config --add channels conda-forge
|
|
||||||
conda install spacy
|
|
||||||
|
|
||||||
For the feedstock including the build recipe and configuration,
|
|
||||||
check out `this repository <https://github.com/conda-forge/spacy-feedstock>`_.
|
|
||||||
Improvements and pull requests to the recipe and setup are always appreciated.
|
|
||||||
|
|
||||||
Updating spaCy
|
|
||||||
--------------
|
|
||||||
|
|
||||||
Some updates to spaCy may require downloading new statistical models. If you're
|
|
||||||
running spaCy v2.0 or higher, you can use the ``validate`` command to check if
|
|
||||||
your installed models are compatible and if not, print details on how to update
|
|
||||||
them:
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
pip install -U spacy
|
|
||||||
python -m spacy validate
|
|
||||||
|
|
||||||
If you've trained your own models, keep in mind that your training and runtime
|
|
||||||
inputs must match. After updating spaCy, we recommend **retraining your models**
|
|
||||||
with the new version.
|
|
||||||
|
|
||||||
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the**
|
|
||||||
`migration guide <https://spacy.io/usage/v2#migrating>`_.
|
|
||||||
|
|
||||||
Download models
|
|
||||||
===============
|
|
||||||
|
|
||||||
As of v1.7.0, models for spaCy can be installed as **Python packages**.
|
|
||||||
This means that they're a component of your application, just like any
|
|
||||||
other module. Models can be installed using spaCy's ``download`` command,
|
|
||||||
or manually by pointing pip to a path or URL.
|
|
||||||
|
|
||||||
======================= ===
|
|
||||||
`Available Models`_ Detailed model descriptions, accuracy figures and benchmarks.
|
|
||||||
`Models Documentation`_ Detailed usage instructions.
|
|
||||||
======================= ===
|
|
||||||
|
|
||||||
.. _Available Models: https://spacy.io/models
|
|
||||||
.. _Models Documentation: https://spacy.io/docs/usage/models
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
# out-of-the-box: download best-matching default model
|
|
||||||
python -m spacy download en
|
|
||||||
|
|
||||||
# download best-matching version of specific model for your spaCy installation
|
|
||||||
python -m spacy download en_core_web_lg
|
|
||||||
|
|
||||||
# pip install .tar.gz archive from path or URL
|
|
||||||
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
|
|
||||||
|
|
||||||
Loading and using models
|
|
||||||
------------------------
|
|
||||||
|
|
||||||
To load a model, use ``spacy.load()`` with the model's shortcut link:
|
|
||||||
|
|
||||||
.. code:: python
|
|
||||||
|
|
||||||
import spacy
|
|
||||||
nlp = spacy.load('en')
|
|
||||||
doc = nlp(u'This is a sentence.')
|
|
||||||
|
|
||||||
If you've installed a model via pip, you can also ``import`` it directly and
|
|
||||||
then call its ``load()`` method:
|
|
||||||
|
|
||||||
.. code:: python
|
|
||||||
|
|
||||||
import spacy
|
|
||||||
import en_core_web_sm
|
|
||||||
|
|
||||||
nlp = en_core_web_sm.load()
|
|
||||||
doc = nlp(u'This is a sentence.')
|
|
||||||
|
|
||||||
📖 **For more info and examples, check out the**
|
|
||||||
`models documentation <https://spacy.io/docs/usage/models>`_.
|
|
||||||
|
|
||||||
Support for older versions
|
|
||||||
--------------------------
|
|
||||||
|
|
||||||
If you're using an older version (``v1.6.0`` or below), you can still download
|
|
||||||
and install the old models from within spaCy using ``python -m spacy.en.download all``
|
|
||||||
or ``python -m spacy.de.download all``. The ``.tar.gz`` archives are also
|
|
||||||
`attached to the v1.6.0 release <https://github.com/explosion/spaCy/tree/v1.6.0>`_.
|
|
||||||
To download and install the models manually, unpack the archive, drop the
|
|
||||||
contained directory into ``spacy/data`` and load the model via ``spacy.load('en')``
|
|
||||||
or ``spacy.load('de')``.
|
|
||||||
|
|
||||||
Compile from source
|
|
||||||
===================
|
|
||||||
|
|
||||||
The other way to install spaCy is to clone its
|
|
||||||
`GitHub repository <https://github.com/explosion/spaCy>`_ and build it from
|
|
||||||
source. That is the common way if you want to make changes to the code base.
|
|
||||||
You'll need to make sure that you have a development environment consisting of a
|
|
||||||
Python distribution including header files, a compiler,
|
|
||||||
`pip <https://pip.pypa.io/en/latest/installing/>`__, `virtualenv <https://virtualenv.pypa.io/>`_
|
|
||||||
and `git <https://git-scm.com>`_ installed. The compiler part is the trickiest.
|
|
||||||
How to do that depends on your system. See notes on Ubuntu, OS X and Windows for
|
|
||||||
details.
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
# make sure you are using the latest pip
|
|
||||||
python -m pip install -U pip
|
|
||||||
git clone https://github.com/explosion/spaCy
|
|
||||||
cd spaCy
|
|
||||||
|
|
||||||
python -m venv .env
|
|
||||||
source .env/bin/activate
|
|
||||||
export PYTHONPATH=`pwd`
|
|
||||||
pip install -r requirements.txt
|
|
||||||
python setup.py build_ext --inplace
|
|
||||||
|
|
||||||
Compared to regular install via pip, `requirements.txt <requirements.txt>`_
|
|
||||||
additionally installs developer dependencies such as Cython. For more details
|
|
||||||
and instructions, see the documentation on
|
|
||||||
`compiling spaCy from source <https://spacy.io/usage/#source>`_ and the
|
|
||||||
`quickstart widget <https://spacy.io/usage/#section-quickstart>`_ to get
|
|
||||||
the right commands for your platform and Python version.
|
|
||||||
|
|
||||||
Instead of the above verbose commands, you can also use the following
|
|
||||||
`Fabric <http://www.fabfile.org/>`_ commands. All commands assume that your
|
|
||||||
virtual environment is located in a directory ``.env``. If you're using a
|
|
||||||
different directory, you can change it via the environment variable ``VENV_DIR``,
|
|
||||||
for example ``VENV_DIR=".custom-env" fab clean make``.
|
|
||||||
|
|
||||||
============= ===
|
|
||||||
``fab env`` Create virtual environment and delete previous one, if it exists.
|
|
||||||
``fab make`` Compile the source.
|
|
||||||
``fab clean`` Remove compiled objects, including the generated C++.
|
|
||||||
``fab test`` Run basic tests, aborting after first failure.
|
|
||||||
============= ===
|
|
||||||
|
|
||||||
Ubuntu
|
|
||||||
------
|
|
||||||
|
|
||||||
Install system-level dependencies via ``apt-get``:
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
sudo apt-get install build-essential python-dev git
|
|
||||||
|
|
||||||
macOS / OS X
|
|
||||||
------------
|
|
||||||
|
|
||||||
Install a recent version of `XCode <https://developer.apple.com/xcode/>`_,
|
|
||||||
including the so-called "Command Line Tools". macOS and OS X ship with Python
|
|
||||||
and git preinstalled.
|
|
||||||
|
|
||||||
Windows
|
|
||||||
-------
|
|
||||||
|
|
||||||
Install a version of `Visual Studio Express <https://www.visualstudio.com/vs/visual-studio-express/>`_
|
|
||||||
or higher that matches the version that was used to compile your Python
|
|
||||||
interpreter. For official distributions these are VS 2008 (Python 2.7),
|
|
||||||
VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
|
||||||
|
|
||||||
Run tests
|
|
||||||
=========
|
|
||||||
|
|
||||||
spaCy comes with an `extensive test suite <spacy/tests>`_. In order to run the
|
|
||||||
tests, you'll usually want to clone the repository and build spaCy from source.
|
|
||||||
This will also install the required development dependencies and test utilities
|
|
||||||
defined in the ``requirements.txt``.
|
|
||||||
|
|
||||||
Alternatively, you can find out where spaCy is installed and run ``pytest`` on
|
|
||||||
that directory. Don't forget to also install the test utilities via spaCy's
|
|
||||||
``requirements.txt``:
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
|
|
||||||
pip install -r path/to/requirements.txt
|
|
||||||
python -m pytest <spacy-directory>
|
|
||||||
|
|
||||||
See `the documentation <https://spacy.io/usage/#tests>`_ for more details and
|
|
||||||
examples.
|
|
||||||
|
|
@ -5,6 +5,10 @@ trigger:
|
||||||
- '*'
|
- '*'
|
||||||
exclude:
|
exclude:
|
||||||
- 'spacy.io'
|
- 'spacy.io'
|
||||||
|
paths:
|
||||||
|
exclude:
|
||||||
|
- 'website/*'
|
||||||
|
- '*.md'
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
|
|
||||||
|
|
@ -26,12 +30,15 @@ jobs:
|
||||||
dependsOn: 'Validate'
|
dependsOn: 'Validate'
|
||||||
strategy:
|
strategy:
|
||||||
matrix:
|
matrix:
|
||||||
Python27Linux:
|
# Python 2.7 currently doesn't work because it seems to be a narrow
|
||||||
imageName: 'ubuntu-16.04'
|
# unicode build, which causes problems with the regular expressions
|
||||||
python.version: '2.7'
|
|
||||||
Python27Mac:
|
# Python27Linux:
|
||||||
imageName: 'macos-10.13'
|
# imageName: 'ubuntu-16.04'
|
||||||
python.version: '2.7'
|
# python.version: '2.7'
|
||||||
|
# Python27Mac:
|
||||||
|
# imageName: 'macos-10.13'
|
||||||
|
# python.version: '2.7'
|
||||||
Python35Linux:
|
Python35Linux:
|
||||||
imageName: 'ubuntu-16.04'
|
imageName: 'ubuntu-16.04'
|
||||||
python.version: '3.5'
|
python.version: '3.5'
|
||||||
|
|
|
||||||
|
|
@ -35,41 +35,49 @@ import subprocess
|
||||||
import argparse
|
import argparse
|
||||||
|
|
||||||
|
|
||||||
HASH_FILE = 'cythonize.json'
|
HASH_FILE = "cythonize.json"
|
||||||
|
|
||||||
|
|
||||||
def process_pyx(fromfile, tofile):
|
def process_pyx(fromfile, tofile, language_level="-2"):
|
||||||
print('Processing %s' % fromfile)
|
print("Processing %s" % fromfile)
|
||||||
try:
|
try:
|
||||||
from Cython.Compiler.Version import version as cython_version
|
from Cython.Compiler.Version import version as cython_version
|
||||||
from distutils.version import LooseVersion
|
from distutils.version import LooseVersion
|
||||||
if LooseVersion(cython_version) < LooseVersion('0.19'):
|
|
||||||
raise Exception('Require Cython >= 0.19')
|
if LooseVersion(cython_version) < LooseVersion("0.19"):
|
||||||
|
raise Exception("Require Cython >= 0.19")
|
||||||
|
|
||||||
except ImportError:
|
except ImportError:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
flags = ['--fast-fail']
|
flags = ["--fast-fail", language_level]
|
||||||
if tofile.endswith('.cpp'):
|
if tofile.endswith(".cpp"):
|
||||||
flags += ['--cplus']
|
flags += ["--cplus"]
|
||||||
|
|
||||||
try:
|
try:
|
||||||
try:
|
try:
|
||||||
r = subprocess.call(['cython'] + flags + ['-o', tofile, fromfile],
|
r = subprocess.call(
|
||||||
env=os.environ) # See Issue #791
|
["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
|
||||||
|
) # See Issue #791
|
||||||
if r != 0:
|
if r != 0:
|
||||||
raise Exception('Cython failed')
|
raise Exception("Cython failed")
|
||||||
except OSError:
|
except OSError:
|
||||||
# There are ways of installing Cython that don't result in a cython
|
# There are ways of installing Cython that don't result in a cython
|
||||||
# executable on the path, see gh-2397.
|
# executable on the path, see gh-2397.
|
||||||
r = subprocess.call([sys.executable, '-c',
|
r = subprocess.call(
|
||||||
'import sys; from Cython.Compiler.Main import '
|
[
|
||||||
'setuptools_main as main; sys.exit(main())'] + flags +
|
sys.executable,
|
||||||
['-o', tofile, fromfile])
|
"-c",
|
||||||
|
"import sys; from Cython.Compiler.Main import "
|
||||||
|
"setuptools_main as main; sys.exit(main())",
|
||||||
|
]
|
||||||
|
+ flags
|
||||||
|
+ ["-o", tofile, fromfile]
|
||||||
|
)
|
||||||
if r != 0:
|
if r != 0:
|
||||||
raise Exception('Cython failed')
|
raise Exception("Cython failed")
|
||||||
except OSError:
|
except OSError:
|
||||||
raise OSError('Cython needs to be installed')
|
raise OSError("Cython needs to be installed")
|
||||||
|
|
||||||
|
|
||||||
def preserve_cwd(path, func, *args):
|
def preserve_cwd(path, func, *args):
|
||||||
|
|
@ -89,12 +97,12 @@ def load_hashes(filename):
|
||||||
|
|
||||||
|
|
||||||
def save_hashes(hash_db, filename):
|
def save_hashes(hash_db, filename):
|
||||||
with open(filename, 'w') as f:
|
with open(filename, "w") as f:
|
||||||
f.write(json.dumps(hash_db))
|
f.write(json.dumps(hash_db))
|
||||||
|
|
||||||
|
|
||||||
def get_hash(path):
|
def get_hash(path):
|
||||||
return hashlib.md5(open(path, 'rb').read()).hexdigest()
|
return hashlib.md5(open(path, "rb").read()).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
def hash_changed(base, path, db):
|
def hash_changed(base, path, db):
|
||||||
|
|
@ -109,25 +117,27 @@ def hash_add(base, path, db):
|
||||||
|
|
||||||
def process(base, filename, db):
|
def process(base, filename, db):
|
||||||
root, ext = os.path.splitext(filename)
|
root, ext = os.path.splitext(filename)
|
||||||
if ext in ['.pyx', '.cpp']:
|
if ext in [".pyx", ".cpp"]:
|
||||||
if hash_changed(base, filename, db) or not os.path.isfile(os.path.join(base, root + '.cpp')):
|
if hash_changed(base, filename, db) or not os.path.isfile(
|
||||||
preserve_cwd(base, process_pyx, root + '.pyx', root + '.cpp')
|
os.path.join(base, root + ".cpp")
|
||||||
hash_add(base, root + '.cpp', db)
|
):
|
||||||
hash_add(base, root + '.pyx', db)
|
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
|
||||||
|
hash_add(base, root + ".cpp", db)
|
||||||
|
hash_add(base, root + ".pyx", db)
|
||||||
|
|
||||||
|
|
||||||
def check_changes(root, db):
|
def check_changes(root, db):
|
||||||
res = False
|
res = False
|
||||||
new_db = {}
|
new_db = {}
|
||||||
|
|
||||||
setup_filename = 'setup.py'
|
setup_filename = "setup.py"
|
||||||
hash_add('.', setup_filename, new_db)
|
hash_add(".", setup_filename, new_db)
|
||||||
if hash_changed('.', setup_filename, db):
|
if hash_changed(".", setup_filename, db):
|
||||||
res = True
|
res = True
|
||||||
|
|
||||||
for base, _, files in os.walk(root):
|
for base, _, files in os.walk(root):
|
||||||
for filename in files:
|
for filename in files:
|
||||||
if filename.endswith('.pxd'):
|
if filename.endswith(".pxd"):
|
||||||
hash_add(base, filename, new_db)
|
hash_add(base, filename, new_db)
|
||||||
if hash_changed(base, filename, db):
|
if hash_changed(base, filename, db):
|
||||||
res = True
|
res = True
|
||||||
|
|
@ -150,8 +160,10 @@ def run(root):
|
||||||
save_hashes(db, HASH_FILE)
|
save_hashes(db, HASH_FILE)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
parser = argparse.ArgumentParser(description='Cythonize pyx files into C++ files as needed')
|
parser = argparse.ArgumentParser(
|
||||||
parser.add_argument('root', help='root directory')
|
description="Cythonize pyx files into C++ files as needed"
|
||||||
|
)
|
||||||
|
parser.add_argument("root", help="root directory")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
run(args.root)
|
run(args.root)
|
||||||
|
|
|
||||||
97
bin/load_reddit.py
Normal file
97
bin/load_reddit.py
Normal file
|
|
@ -0,0 +1,97 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import bz2
|
||||||
|
import re
|
||||||
|
import srsly
|
||||||
|
import sys
|
||||||
|
import random
|
||||||
|
import datetime
|
||||||
|
import plac
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
_unset = object()
|
||||||
|
|
||||||
|
|
||||||
|
class Reddit(object):
|
||||||
|
"""Stream cleaned comments from Reddit."""
|
||||||
|
|
||||||
|
pre_format_re = re.compile(r"^[`*~]")
|
||||||
|
post_format_re = re.compile(r"[`*~]$")
|
||||||
|
url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
|
||||||
|
link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")
|
||||||
|
|
||||||
|
def __init__(self, file_path, meta_keys={"subreddit": "section"}):
|
||||||
|
"""
|
||||||
|
file_path (unicode / Path): Path to archive or directory of archives.
|
||||||
|
meta_keys (dict): Meta data key included in the Reddit corpus, mapped
|
||||||
|
to display name in Prodigy meta.
|
||||||
|
RETURNS (Reddit): The Reddit loader.
|
||||||
|
"""
|
||||||
|
self.meta = meta_keys
|
||||||
|
file_path = Path(file_path)
|
||||||
|
if not file_path.exists():
|
||||||
|
raise IOError("Can't find file path: {}".format(file_path))
|
||||||
|
if not file_path.is_dir():
|
||||||
|
self.files = [file_path]
|
||||||
|
else:
|
||||||
|
self.files = list(file_path.iterdir())
|
||||||
|
|
||||||
|
def __iter__(self):
|
||||||
|
for file_path in self.iter_files():
|
||||||
|
with bz2.open(str(file_path)) as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
comment = srsly.json_loads(line)
|
||||||
|
if self.is_valid(comment):
|
||||||
|
text = self.strip_tags(comment["body"])
|
||||||
|
yield {"text": text}
|
||||||
|
|
||||||
|
def get_meta(self, item):
|
||||||
|
return {name: item.get(key, "n/a") for key, name in self.meta.items()}
|
||||||
|
|
||||||
|
def iter_files(self):
|
||||||
|
for file_path in self.files:
|
||||||
|
yield file_path
|
||||||
|
|
||||||
|
def strip_tags(self, text):
|
||||||
|
text = self.link_re.sub(r"\1", text)
|
||||||
|
text = text.replace(">", ">").replace("<", "<")
|
||||||
|
text = self.pre_format_re.sub("", text)
|
||||||
|
text = self.post_format_re.sub("", text)
|
||||||
|
text = re.sub(r"\s+", " ", text)
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
def is_valid(self, comment):
|
||||||
|
return (
|
||||||
|
comment["body"] is not None
|
||||||
|
and comment["body"] != "[deleted]"
|
||||||
|
and comment["body"] != "[removed]"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def main(path):
|
||||||
|
reddit = Reddit(path)
|
||||||
|
for comment in reddit:
|
||||||
|
print(srsly.json_dumps(comment))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import socket
|
||||||
|
|
||||||
|
try:
|
||||||
|
BrokenPipeError
|
||||||
|
except NameError:
|
||||||
|
BrokenPipeError = socket.error
|
||||||
|
try:
|
||||||
|
plac.call(main)
|
||||||
|
except BrokenPipeError:
|
||||||
|
import os, sys
|
||||||
|
|
||||||
|
# Python flushes standard streams on exit; redirect remaining output
|
||||||
|
# to devnull to avoid another BrokenPipeError at shutdown
|
||||||
|
devnull = os.open(os.devnull, os.O_WRONLY)
|
||||||
|
os.dup2(devnull, sys.stdout.fileno())
|
||||||
|
sys.exit(1) # Python exits with error code 1 on EPIPE
|
||||||
|
|
@ -5,12 +5,15 @@ set -e
|
||||||
# Insist repository is clean
|
# Insist repository is clean
|
||||||
git diff-index --quiet HEAD
|
git diff-index --quiet HEAD
|
||||||
|
|
||||||
git checkout master
|
git checkout $1
|
||||||
git pull origin master
|
git pull origin $1
|
||||||
git push origin master
|
git push origin $1
|
||||||
|
|
||||||
version=$(grep "__version__ = " spacy/about.py)
|
version=$(grep "__version__ = " spacy/about.py)
|
||||||
version=${version/__version__ = }
|
version=${version/__version__ = }
|
||||||
version=${version/\'/}
|
version=${version/\'/}
|
||||||
version=${version/\'/}
|
version=${version/\'/}
|
||||||
|
version=${version/\"/}
|
||||||
|
version=${version/\"/}
|
||||||
git tag "v$version"
|
git tag "v$version"
|
||||||
git push origin --tags
|
git push origin "v$version" --tags
|
||||||
|
|
|
||||||
107
bin/train_word_vectors.py
Normal file
107
bin/train_word_vectors.py
Normal file
|
|
@ -0,0 +1,107 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
from __future__ import print_function, unicode_literals, division
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
from collections import defaultdict
|
||||||
|
from gensim.models import Word2Vec
|
||||||
|
from preshed.counter import PreshCounter
|
||||||
|
import plac
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class Corpus(object):
|
||||||
|
def __init__(self, directory, min_freq=10):
|
||||||
|
self.directory = directory
|
||||||
|
self.counts = PreshCounter()
|
||||||
|
self.strings = {}
|
||||||
|
self.min_freq = min_freq
|
||||||
|
|
||||||
|
def count_doc(self, doc):
|
||||||
|
# Get counts for this document
|
||||||
|
for word in doc:
|
||||||
|
self.counts.inc(word.orth, 1)
|
||||||
|
return len(doc)
|
||||||
|
|
||||||
|
def __iter__(self):
|
||||||
|
for text_loc in iter_dir(self.directory):
|
||||||
|
with text_loc.open("r", encoding="utf-8") as file_:
|
||||||
|
text = file_.read()
|
||||||
|
yield text
|
||||||
|
|
||||||
|
|
||||||
|
def iter_dir(loc):
|
||||||
|
dir_path = Path(loc)
|
||||||
|
for fn_path in dir_path.iterdir():
|
||||||
|
if fn_path.is_dir():
|
||||||
|
for sub_path in fn_path.iterdir():
|
||||||
|
yield sub_path
|
||||||
|
else:
|
||||||
|
yield fn_path
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
lang=("ISO language code"),
|
||||||
|
in_dir=("Location of input directory"),
|
||||||
|
out_loc=("Location of output file"),
|
||||||
|
n_workers=("Number of workers", "option", "n", int),
|
||||||
|
size=("Dimension of the word vectors", "option", "d", int),
|
||||||
|
window=("Context window size", "option", "w", int),
|
||||||
|
min_count=("Min count", "option", "m", int),
|
||||||
|
negative=("Number of negative samples", "option", "g", int),
|
||||||
|
nr_iter=("Number of iterations", "option", "i", int),
|
||||||
|
)
|
||||||
|
def main(
|
||||||
|
lang,
|
||||||
|
in_dir,
|
||||||
|
out_loc,
|
||||||
|
negative=5,
|
||||||
|
n_workers=4,
|
||||||
|
window=5,
|
||||||
|
size=128,
|
||||||
|
min_count=10,
|
||||||
|
nr_iter=2,
|
||||||
|
):
|
||||||
|
logging.basicConfig(
|
||||||
|
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
|
||||||
|
)
|
||||||
|
model = Word2Vec(
|
||||||
|
size=size,
|
||||||
|
window=window,
|
||||||
|
min_count=min_count,
|
||||||
|
workers=n_workers,
|
||||||
|
sample=1e-5,
|
||||||
|
negative=negative,
|
||||||
|
)
|
||||||
|
nlp = spacy.blank(lang)
|
||||||
|
corpus = Corpus(in_dir)
|
||||||
|
total_words = 0
|
||||||
|
total_sents = 0
|
||||||
|
for text_no, text_loc in enumerate(iter_dir(corpus.directory)):
|
||||||
|
with text_loc.open("r", encoding="utf-8") as file_:
|
||||||
|
text = file_.read()
|
||||||
|
total_sents += text.count("\n")
|
||||||
|
doc = nlp(text)
|
||||||
|
total_words += corpus.count_doc(doc)
|
||||||
|
logger.info(
|
||||||
|
"PROGRESS: at batch #%i, processed %i words, keeping %i word types",
|
||||||
|
text_no,
|
||||||
|
total_words,
|
||||||
|
len(corpus.strings),
|
||||||
|
)
|
||||||
|
model.corpus_count = total_sents
|
||||||
|
model.raw_vocab = defaultdict(int)
|
||||||
|
for orth, freq in corpus.counts:
|
||||||
|
if freq >= min_count:
|
||||||
|
model.raw_vocab[nlp.vocab.strings[orth]] = freq
|
||||||
|
model.scale_vocab()
|
||||||
|
model.finalize_vocab()
|
||||||
|
model.iter = nr_iter
|
||||||
|
model.train(corpus)
|
||||||
|
model.save(out_loc)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
||||||
|
|
@ -1,5 +1,12 @@
|
||||||
"""
|
"""
|
||||||
This example shows how to use an LSTM sentiment classification model trained using Keras in spaCy. spaCy splits the document into sentences, and each sentence is classified using the LSTM. The scores for the sentences are then aggregated to give the document score. This kind of hierarchical model is quite difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras example on this dataset performs quite poorly, because it cuts off the documents so that they're a fixed size. This hurts review accuracy a lot, because people often summarise their rating in the final sentence
|
This example shows how to use an LSTM sentiment classification model trained
|
||||||
|
using Keras in spaCy. spaCy splits the document into sentences, and each
|
||||||
|
sentence is classified using the LSTM. The scores for the sentences are then
|
||||||
|
aggregated to give the document score. This kind of hierarchical model is quite
|
||||||
|
difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras
|
||||||
|
example on this dataset performs quite poorly, because it cuts off the documents
|
||||||
|
so that they're a fixed size. This hurts review accuracy a lot, because people
|
||||||
|
often summarise their rating in the final sentence
|
||||||
|
|
||||||
Prerequisites:
|
Prerequisites:
|
||||||
spacy download en_vectors_web_lg
|
spacy download en_vectors_web_lg
|
||||||
|
|
@ -25,9 +32,9 @@ import spacy
|
||||||
class SentimentAnalyser(object):
|
class SentimentAnalyser(object):
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(cls, path, nlp, max_length=100):
|
def load(cls, path, nlp, max_length=100):
|
||||||
with (path / 'config.json').open() as file_:
|
with (path / "config.json").open() as file_:
|
||||||
model = model_from_json(file_.read())
|
model = model_from_json(file_.read())
|
||||||
with (path / 'model').open('rb') as file_:
|
with (path / "model").open("rb") as file_:
|
||||||
lstm_weights = pickle.load(file_)
|
lstm_weights = pickle.load(file_)
|
||||||
embeddings = get_embeddings(nlp.vocab)
|
embeddings = get_embeddings(nlp.vocab)
|
||||||
model.set_weights([embeddings] + lstm_weights)
|
model.set_weights([embeddings] + lstm_weights)
|
||||||
|
|
@ -42,7 +49,7 @@ class SentimentAnalyser(object):
|
||||||
y = self._model.predict(X)
|
y = self._model.predict(X)
|
||||||
self.set_sentiment(doc, y)
|
self.set_sentiment(doc, y)
|
||||||
|
|
||||||
def pipe(self, docs, batch_size=1000, n_threads=2):
|
def pipe(self, docs, batch_size=1000):
|
||||||
for minibatch in cytoolz.partition_all(batch_size, docs):
|
for minibatch in cytoolz.partition_all(batch_size, docs):
|
||||||
minibatch = list(minibatch)
|
minibatch = list(minibatch)
|
||||||
sentences = []
|
sentences = []
|
||||||
|
|
@ -69,12 +76,12 @@ def get_labelled_sentences(docs, doc_labels):
|
||||||
for sent in doc.sents:
|
for sent in doc.sents:
|
||||||
sentences.append(sent)
|
sentences.append(sent)
|
||||||
labels.append(y)
|
labels.append(y)
|
||||||
return sentences, numpy.asarray(labels, dtype='int32')
|
return sentences, numpy.asarray(labels, dtype="int32")
|
||||||
|
|
||||||
|
|
||||||
def get_features(docs, max_length):
|
def get_features(docs, max_length):
|
||||||
docs = list(docs)
|
docs = list(docs)
|
||||||
Xs = numpy.zeros((len(docs), max_length), dtype='int32')
|
Xs = numpy.zeros((len(docs), max_length), dtype="int32")
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
j = 0
|
j = 0
|
||||||
for token in doc:
|
for token in doc:
|
||||||
|
|
@ -89,13 +96,22 @@ def get_features(docs, max_length):
|
||||||
return Xs
|
return Xs
|
||||||
|
|
||||||
|
|
||||||
def train(train_texts, train_labels, dev_texts, dev_labels,
|
def train(
|
||||||
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100,
|
train_texts,
|
||||||
nb_epoch=5, by_sentence=True):
|
train_labels,
|
||||||
|
dev_texts,
|
||||||
|
dev_labels,
|
||||||
|
lstm_shape,
|
||||||
|
lstm_settings,
|
||||||
|
lstm_optimizer,
|
||||||
|
batch_size=100,
|
||||||
|
nb_epoch=5,
|
||||||
|
by_sentence=True,
|
||||||
|
):
|
||||||
|
|
||||||
print("Loading spaCy")
|
print("Loading spaCy")
|
||||||
nlp = spacy.load('en_vectors_web_lg')
|
nlp = spacy.load("en_vectors_web_lg")
|
||||||
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
nlp.add_pipe(nlp.create_pipe("sentencizer"))
|
||||||
embeddings = get_embeddings(nlp.vocab)
|
embeddings = get_embeddings(nlp.vocab)
|
||||||
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
|
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
|
||||||
|
|
||||||
|
|
@ -106,10 +122,15 @@ def train(train_texts, train_labels, dev_texts, dev_labels,
|
||||||
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
|
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
|
||||||
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
|
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
|
||||||
|
|
||||||
train_X = get_features(train_docs, lstm_shape['max_length'])
|
train_X = get_features(train_docs, lstm_shape["max_length"])
|
||||||
dev_X = get_features(dev_docs, lstm_shape['max_length'])
|
dev_X = get_features(dev_docs, lstm_shape["max_length"])
|
||||||
model.fit(train_X, train_labels, validation_data=(dev_X, dev_labels),
|
model.fit(
|
||||||
epochs=nb_epoch, batch_size=batch_size)
|
train_X,
|
||||||
|
train_labels,
|
||||||
|
validation_data=(dev_X, dev_labels),
|
||||||
|
epochs=nb_epoch,
|
||||||
|
batch_size=batch_size,
|
||||||
|
)
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -119,19 +140,28 @@ def compile_lstm(embeddings, shape, settings):
|
||||||
Embedding(
|
Embedding(
|
||||||
embeddings.shape[0],
|
embeddings.shape[0],
|
||||||
embeddings.shape[1],
|
embeddings.shape[1],
|
||||||
input_length=shape['max_length'],
|
input_length=shape["max_length"],
|
||||||
trainable=False,
|
trainable=False,
|
||||||
weights=[embeddings],
|
weights=[embeddings],
|
||||||
mask_zero=True
|
mask_zero=True,
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
model.add(TimeDistributed(Dense(shape['nr_hidden'], use_bias=False)))
|
model.add(TimeDistributed(Dense(shape["nr_hidden"], use_bias=False)))
|
||||||
model.add(Bidirectional(LSTM(shape['nr_hidden'],
|
model.add(
|
||||||
recurrent_dropout=settings['dropout'],
|
Bidirectional(
|
||||||
dropout=settings['dropout'])))
|
LSTM(
|
||||||
model.add(Dense(shape['nr_class'], activation='sigmoid'))
|
shape["nr_hidden"],
|
||||||
model.compile(optimizer=Adam(lr=settings['lr']), loss='binary_crossentropy',
|
recurrent_dropout=settings["dropout"],
|
||||||
metrics=['accuracy'])
|
dropout=settings["dropout"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
model.add(Dense(shape["nr_class"], activation="sigmoid"))
|
||||||
|
model.compile(
|
||||||
|
optimizer=Adam(lr=settings["lr"]),
|
||||||
|
loss="binary_crossentropy",
|
||||||
|
metrics=["accuracy"],
|
||||||
|
)
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -140,13 +170,13 @@ def get_embeddings(vocab):
|
||||||
|
|
||||||
|
|
||||||
def evaluate(model_dir, texts, labels, max_length=100):
|
def evaluate(model_dir, texts, labels, max_length=100):
|
||||||
nlp = spacy.load('en_vectors_web_lg')
|
nlp = spacy.load("en_vectors_web_lg")
|
||||||
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
nlp.add_pipe(nlp.create_pipe("sentencizer"))
|
||||||
nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length))
|
nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length))
|
||||||
|
|
||||||
correct = 0
|
correct = 0
|
||||||
i = 0
|
i = 0
|
||||||
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):
|
for doc in nlp.pipe(texts, batch_size=1000):
|
||||||
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
|
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
|
||||||
i += 1
|
i += 1
|
||||||
return float(correct) / i
|
return float(correct) / i
|
||||||
|
|
@ -154,7 +184,7 @@ def evaluate(model_dir, texts, labels, max_length=100):
|
||||||
|
|
||||||
def read_data(data_dir, limit=0):
|
def read_data(data_dir, limit=0):
|
||||||
examples = []
|
examples = []
|
||||||
for subdir, label in (('pos', 1), ('neg', 0)):
|
for subdir, label in (("pos", 1), ("neg", 0)):
|
||||||
for filename in (data_dir / subdir).iterdir():
|
for filename in (data_dir / subdir).iterdir():
|
||||||
with filename.open() as file_:
|
with filename.open() as file_:
|
||||||
text = file_.read()
|
text = file_.read()
|
||||||
|
|
@ -176,13 +206,21 @@ def read_data(data_dir, limit=0):
|
||||||
learn_rate=("Learn rate", "option", "e", float),
|
learn_rate=("Learn rate", "option", "e", float),
|
||||||
nb_epoch=("Number of training epochs", "option", "i", int),
|
nb_epoch=("Number of training epochs", "option", "i", int),
|
||||||
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
|
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
|
||||||
nr_examples=("Limit to N examples", "option", "n", int)
|
nr_examples=("Limit to N examples", "option", "n", int),
|
||||||
)
|
)
|
||||||
def main(model_dir=None, train_dir=None, dev_dir=None,
|
def main(
|
||||||
|
model_dir=None,
|
||||||
|
train_dir=None,
|
||||||
|
dev_dir=None,
|
||||||
is_runtime=False,
|
is_runtime=False,
|
||||||
nr_hidden=64, max_length=100, # Shape
|
nr_hidden=64,
|
||||||
dropout=0.5, learn_rate=0.001, # General NN config
|
max_length=100, # Shape
|
||||||
nb_epoch=5, batch_size=256, nr_examples=-1): # Training params
|
dropout=0.5,
|
||||||
|
learn_rate=0.001, # General NN config
|
||||||
|
nb_epoch=5,
|
||||||
|
batch_size=256,
|
||||||
|
nr_examples=-1,
|
||||||
|
): # Training params
|
||||||
if model_dir is not None:
|
if model_dir is not None:
|
||||||
model_dir = pathlib.Path(model_dir)
|
model_dir = pathlib.Path(model_dir)
|
||||||
if train_dir is None or dev_dir is None:
|
if train_dir is None or dev_dir is None:
|
||||||
|
|
@ -204,20 +242,26 @@ def main(model_dir=None, train_dir=None, dev_dir=None,
|
||||||
dev_texts, dev_labels = zip(*imdb_data[1])
|
dev_texts, dev_labels = zip(*imdb_data[1])
|
||||||
else:
|
else:
|
||||||
dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples)
|
dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples)
|
||||||
train_labels = numpy.asarray(train_labels, dtype='int32')
|
train_labels = numpy.asarray(train_labels, dtype="int32")
|
||||||
dev_labels = numpy.asarray(dev_labels, dtype='int32')
|
dev_labels = numpy.asarray(dev_labels, dtype="int32")
|
||||||
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
|
lstm = train(
|
||||||
{'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 1},
|
train_texts,
|
||||||
{'dropout': dropout, 'lr': learn_rate},
|
train_labels,
|
||||||
|
dev_texts,
|
||||||
|
dev_labels,
|
||||||
|
{"nr_hidden": nr_hidden, "max_length": max_length, "nr_class": 1},
|
||||||
|
{"dropout": dropout, "lr": learn_rate},
|
||||||
{},
|
{},
|
||||||
nb_epoch=nb_epoch, batch_size=batch_size)
|
nb_epoch=nb_epoch,
|
||||||
|
batch_size=batch_size,
|
||||||
|
)
|
||||||
weights = lstm.get_weights()
|
weights = lstm.get_weights()
|
||||||
if model_dir is not None:
|
if model_dir is not None:
|
||||||
with (model_dir / 'model').open('wb') as file_:
|
with (model_dir / "model").open("wb") as file_:
|
||||||
pickle.dump(weights[1:], file_)
|
pickle.dump(weights[1:], file_)
|
||||||
with (model_dir / 'config.json').open('w') as file_:
|
with (model_dir / "config.json").open("w") as file_:
|
||||||
file_.write(lstm.to_json())
|
file_.write(lstm.to_json())
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
|
|
@ -15,14 +15,15 @@ import spacy
|
||||||
|
|
||||||
|
|
||||||
TEXTS = [
|
TEXTS = [
|
||||||
'Net income was $9.4 million compared to the prior year of $2.7 million.',
|
"Net income was $9.4 million compared to the prior year of $2.7 million.",
|
||||||
'Revenue exceeded twelve billion dollars, with a loss of $1b.',
|
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("Model to load (needs parser and NER)", "positional", None, str))
|
model=("Model to load (needs parser and NER)", "positional", None, str)
|
||||||
def main(model='en_core_web_sm'):
|
)
|
||||||
|
def main(model="en_core_web_sm"):
|
||||||
nlp = spacy.load(model)
|
nlp = spacy.load(model)
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
print("Processing %d texts" % len(TEXTS))
|
print("Processing %d texts" % len(TEXTS))
|
||||||
|
|
@ -31,7 +32,7 @@ def main(model='en_core_web_sm'):
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
relations = extract_currency_relations(doc)
|
relations = extract_currency_relations(doc)
|
||||||
for r1, r2 in relations:
|
for r1, r2 in relations:
|
||||||
print('{:<10}\t{}\t{}'.format(r1.text, r2.ent_type_, r2.text))
|
print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text))
|
||||||
|
|
||||||
|
|
||||||
def extract_currency_relations(doc):
|
def extract_currency_relations(doc):
|
||||||
|
|
@ -41,18 +42,18 @@ def extract_currency_relations(doc):
|
||||||
span.merge()
|
span.merge()
|
||||||
|
|
||||||
relations = []
|
relations = []
|
||||||
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
|
for money in filter(lambda w: w.ent_type_ == "MONEY", doc):
|
||||||
if money.dep_ in ('attr', 'dobj'):
|
if money.dep_ in ("attr", "dobj"):
|
||||||
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
|
subject = [w for w in money.head.lefts if w.dep_ == "nsubj"]
|
||||||
if subject:
|
if subject:
|
||||||
subject = subject[0]
|
subject = subject[0]
|
||||||
relations.append((subject, money))
|
relations.append((subject, money))
|
||||||
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
|
elif money.dep_ == "pobj" and money.head.dep_ == "prep":
|
||||||
relations.append((money.head.head, money))
|
relations.append((money.head.head, money))
|
||||||
return relations
|
return relations
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# Expected output:
|
# Expected output:
|
||||||
|
|
|
||||||
|
|
@ -24,37 +24,39 @@ import plac
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(model=("Model to load", "positional", None, str))
|
||||||
model=("Model to load", "positional", None, str))
|
def main(model="en_core_web_sm"):
|
||||||
def main(model='en_core_web_sm'):
|
|
||||||
nlp = spacy.load(model)
|
nlp = spacy.load(model)
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
|
|
||||||
doc = nlp("displaCy uses CSS and JavaScript to show you how computers "
|
doc = nlp(
|
||||||
"understand language")
|
"displaCy uses CSS and JavaScript to show you how computers "
|
||||||
|
"understand language"
|
||||||
|
)
|
||||||
|
|
||||||
# The easiest way is to find the head of the subtree you want, and then use
|
# The easiest way is to find the head of the subtree you want, and then use
|
||||||
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
|
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
|
||||||
# is the one that does what you're asking for most directly:
|
# is the one that does what you're asking for most directly:
|
||||||
for word in doc:
|
for word in doc:
|
||||||
if word.dep_ in ('xcomp', 'ccomp'):
|
if word.dep_ in ("xcomp", "ccomp"):
|
||||||
print(''.join(w.text_with_ws for w in word.subtree))
|
print("".join(w.text_with_ws for w in word.subtree))
|
||||||
|
|
||||||
# It'd probably be better for `word.subtree` to return a `Span` object
|
# It'd probably be better for `word.subtree` to return a `Span` object
|
||||||
# instead of a generator over the tokens. If you want the `Span` you can
|
# instead of a generator over the tokens. If you want the `Span` you can
|
||||||
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
|
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
|
||||||
# object is nice because you can easily get a vector, merge it, etc.
|
# object is nice because you can easily get a vector, merge it, etc.
|
||||||
for word in doc:
|
for word in doc:
|
||||||
if word.dep_ in ('xcomp', 'ccomp'):
|
if word.dep_ in ("xcomp", "ccomp"):
|
||||||
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
|
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
|
||||||
print(subtree_span.text, '|', subtree_span.root.text)
|
print(subtree_span.text, "|", subtree_span.root.text)
|
||||||
|
|
||||||
# You might also want to select a head, and then select a start and end
|
# You might also want to select a head, and then select a start and end
|
||||||
# position by walking along its children. You could then take the
|
# position by walking along its children. You could then take the
|
||||||
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
|
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
|
||||||
# a span.
|
# a span.
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# Expected output:
|
# Expected output:
|
||||||
|
|
|
||||||
|
|
@ -45,7 +45,7 @@ from __future__ import print_function, unicode_literals, division
|
||||||
from bz2 import BZ2File
|
from bz2 import BZ2File
|
||||||
import time
|
import time
|
||||||
import plac
|
import plac
|
||||||
import ujson
|
import json
|
||||||
|
|
||||||
from spacy.matcher import PhraseMatcher
|
from spacy.matcher import PhraseMatcher
|
||||||
import spacy
|
import spacy
|
||||||
|
|
@ -55,15 +55,15 @@ import spacy
|
||||||
patterns_loc=("Path to gazetteer", "positional", None, str),
|
patterns_loc=("Path to gazetteer", "positional", None, str),
|
||||||
text_loc=("Path to Reddit corpus file", "positional", None, str),
|
text_loc=("Path to Reddit corpus file", "positional", None, str),
|
||||||
n=("Number of texts to read", "option", "n", int),
|
n=("Number of texts to read", "option", "n", int),
|
||||||
lang=("Language class to initialise", "option", "l", str))
|
lang=("Language class to initialise", "option", "l", str),
|
||||||
def main(patterns_loc, text_loc, n=10000, lang='en'):
|
)
|
||||||
|
def main(patterns_loc, text_loc, n=10000, lang="en"):
|
||||||
nlp = spacy.blank(lang)
|
nlp = spacy.blank(lang)
|
||||||
nlp.vocab.lex_attr_getters = {}
|
nlp.vocab.lex_attr_getters = {}
|
||||||
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
|
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
|
||||||
count = 0
|
count = 0
|
||||||
t1 = time.time()
|
t1 = time.time()
|
||||||
for ent_id, text in get_matches(nlp.tokenizer, phrases,
|
for ent_id, text in get_matches(nlp.tokenizer, phrases, read_text(text_loc, n=n)):
|
||||||
read_text(text_loc, n=n)):
|
|
||||||
count += 1
|
count += 1
|
||||||
t2 = time.time()
|
t2 = time.time()
|
||||||
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
|
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
|
||||||
|
|
@ -71,8 +71,8 @@ def main(patterns_loc, text_loc, n=10000, lang='en'):
|
||||||
|
|
||||||
def read_gazetteer(tokenizer, loc, n=-1):
|
def read_gazetteer(tokenizer, loc, n=-1):
|
||||||
for i, line in enumerate(open(loc)):
|
for i, line in enumerate(open(loc)):
|
||||||
data = ujson.loads(line.strip())
|
data = json.loads(line.strip())
|
||||||
phrase = tokenizer(data['text'])
|
phrase = tokenizer(data["text"])
|
||||||
for w in phrase:
|
for w in phrase:
|
||||||
_ = tokenizer.vocab[w.text]
|
_ = tokenizer.vocab[w.text]
|
||||||
if len(phrase) >= 2:
|
if len(phrase) >= 2:
|
||||||
|
|
@ -82,15 +82,15 @@ def read_gazetteer(tokenizer, loc, n=-1):
|
||||||
def read_text(bz2_loc, n=10000):
|
def read_text(bz2_loc, n=10000):
|
||||||
with BZ2File(bz2_loc) as file_:
|
with BZ2File(bz2_loc) as file_:
|
||||||
for i, line in enumerate(file_):
|
for i, line in enumerate(file_):
|
||||||
data = ujson.loads(line)
|
data = json.loads(line)
|
||||||
yield data['body']
|
yield data["body"]
|
||||||
if i >= n:
|
if i >= n:
|
||||||
break
|
break
|
||||||
|
|
||||||
|
|
||||||
def get_matches(tokenizer, phrases, texts, max_length=6):
|
def get_matches(tokenizer, phrases, texts, max_length=6):
|
||||||
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
|
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
|
||||||
matcher.add('Phrase', None, *phrases)
|
matcher.add("Phrase", None, *phrases)
|
||||||
for text in texts:
|
for text in texts:
|
||||||
doc = tokenizer(text)
|
doc = tokenizer(text)
|
||||||
for w in doc:
|
for w in doc:
|
||||||
|
|
@ -100,10 +100,11 @@ def get_matches(tokenizer, phrases, texts, max_length=6):
|
||||||
yield (ent_id, doc[start:end].text)
|
yield (ent_id, doc[start:end].text)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
if False:
|
if False:
|
||||||
import cProfile
|
import cProfile
|
||||||
import pstats
|
import pstats
|
||||||
|
|
||||||
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
|
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
|
||||||
s = pstats.Stats("Profile.prof")
|
s = pstats.Stats("Profile.prof")
|
||||||
s.strip_dirs().sort_stats("time").print_stats()
|
s.strip_dirs().sort_stats("time").print_stats()
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,5 @@
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import ujson as json
|
import json
|
||||||
from keras.utils import to_categorical
|
from keras.utils import to_categorical
|
||||||
import plac
|
import plac
|
||||||
import sys
|
import sys
|
||||||
|
|
@ -20,9 +20,10 @@ import os
|
||||||
import importlib
|
import importlib
|
||||||
from keras import backend as K
|
from keras import backend as K
|
||||||
|
|
||||||
|
|
||||||
def set_keras_backend(backend):
|
def set_keras_backend(backend):
|
||||||
if K.backend() != backend:
|
if K.backend() != backend:
|
||||||
os.environ['KERAS_BACKEND'] = backend
|
os.environ["KERAS_BACKEND"] = backend
|
||||||
importlib.reload(K)
|
importlib.reload(K)
|
||||||
assert K.backend() == backend
|
assert K.backend() == backend
|
||||||
if backend == "tensorflow":
|
if backend == "tensorflow":
|
||||||
|
|
@ -32,6 +33,7 @@ def set_keras_backend(backend):
|
||||||
K.set_session(K.tf.Session(config=cfg))
|
K.set_session(K.tf.Session(config=cfg))
|
||||||
K.clear_session()
|
K.clear_session()
|
||||||
|
|
||||||
|
|
||||||
set_keras_backend("tensorflow")
|
set_keras_backend("tensorflow")
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -40,9 +42,8 @@ def train(train_loc, dev_loc, shape, settings):
|
||||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||||
|
|
||||||
print("Loading spaCy")
|
print("Loading spaCy")
|
||||||
nlp = spacy.load('en_vectors_web_lg')
|
nlp = spacy.load("en_vectors_web_lg")
|
||||||
assert nlp.path is not None
|
assert nlp.path is not None
|
||||||
|
|
||||||
print("Processing texts...")
|
print("Processing texts...")
|
||||||
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
|
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
|
||||||
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
|
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
|
||||||
|
|
@ -55,28 +56,27 @@ def train(train_loc, dev_loc, shape, settings):
|
||||||
train_X,
|
train_X,
|
||||||
train_labels,
|
train_labels,
|
||||||
validation_data=(dev_X, dev_labels),
|
validation_data=(dev_X, dev_labels),
|
||||||
epochs = settings['nr_epoch'],
|
epochs=settings["nr_epoch"],
|
||||||
batch_size = settings['batch_size'])
|
batch_size=settings["batch_size"],
|
||||||
|
)
|
||||||
if not (nlp.path / 'similarity').exists():
|
if not (nlp.path / "similarity").exists():
|
||||||
(nlp.path / 'similarity').mkdir()
|
(nlp.path / "similarity").mkdir()
|
||||||
print("Saving to", nlp.path / 'similarity')
|
print("Saving to", nlp.path / "similarity")
|
||||||
weights = model.get_weights()
|
weights = model.get_weights()
|
||||||
# remove the embedding matrix. We can reconstruct it.
|
# remove the embedding matrix. We can reconstruct it.
|
||||||
del weights[1]
|
del weights[1]
|
||||||
with (nlp.path / 'similarity' / 'model').open('wb') as file_:
|
with (nlp.path / "similarity" / "model").open("wb") as file_:
|
||||||
pickle.dump(weights, file_)
|
pickle.dump(weights, file_)
|
||||||
with (nlp.path / 'similarity' / 'config.json').open('w') as file_:
|
with (nlp.path / "similarity" / "config.json").open("w") as file_:
|
||||||
file_.write(model.to_json())
|
file_.write(model.to_json())
|
||||||
|
|
||||||
|
|
||||||
def evaluate(dev_loc, shape):
|
def evaluate(dev_loc, shape):
|
||||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||||
nlp = spacy.load('en_vectors_web_lg')
|
nlp = spacy.load("en_vectors_web_lg")
|
||||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
|
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
|
||||||
|
total = 0.0
|
||||||
total = 0.
|
correct = 0.0
|
||||||
correct = 0.
|
|
||||||
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
|
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
|
||||||
doc1 = nlp(text1)
|
doc1 = nlp(text1)
|
||||||
doc2 = nlp(text2)
|
doc2 = nlp(text2)
|
||||||
|
|
@ -88,11 +88,11 @@ def evaluate(dev_loc, shape):
|
||||||
|
|
||||||
|
|
||||||
def demo(shape):
|
def demo(shape):
|
||||||
nlp = spacy.load('en_vectors_web_lg')
|
nlp = spacy.load("en_vectors_web_lg")
|
||||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
|
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
|
||||||
|
|
||||||
doc1 = nlp(u'The king of France is bald.')
|
doc1 = nlp(u"The king of France is bald.")
|
||||||
doc2 = nlp(u'France has no king.')
|
doc2 = nlp(u"France has no king.")
|
||||||
|
|
||||||
print("Sentence 1:", doc1)
|
print("Sentence 1:", doc1)
|
||||||
print("Sentence 2:", doc2)
|
print("Sentence 2:", doc2)
|
||||||
|
|
@ -101,30 +101,31 @@ def demo(shape):
|
||||||
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
|
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
|
||||||
|
|
||||||
|
|
||||||
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
|
LABELS = {"entailment": 0, "contradiction": 1, "neutral": 2}
|
||||||
|
|
||||||
|
|
||||||
def read_snli(path):
|
def read_snli(path):
|
||||||
texts1 = []
|
texts1 = []
|
||||||
texts2 = []
|
texts2 = []
|
||||||
labels = []
|
labels = []
|
||||||
with open(path, 'r') as file_:
|
with open(path, "r") as file_:
|
||||||
for line in file_:
|
for line in file_:
|
||||||
eg = json.loads(line)
|
eg = json.loads(line)
|
||||||
label = eg['gold_label']
|
label = eg["gold_label"]
|
||||||
if label == '-': # per Parikh, ignore - SNLI entries
|
if label == "-": # per Parikh, ignore - SNLI entries
|
||||||
continue
|
continue
|
||||||
texts1.append(eg['sentence1'])
|
texts1.append(eg["sentence1"])
|
||||||
texts2.append(eg['sentence2'])
|
texts2.append(eg["sentence2"])
|
||||||
labels.append(LABELS[label])
|
labels.append(LABELS[label])
|
||||||
return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))
|
return texts1, texts2, to_categorical(np.asarray(labels, dtype="int32"))
|
||||||
|
|
||||||
|
|
||||||
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
||||||
sents = texts + hypotheses
|
sents = texts + hypotheses
|
||||||
|
|
||||||
sents_as_ids = []
|
sents_as_ids = []
|
||||||
for sent in sents:
|
for sent in sents:
|
||||||
doc = nlp(sent)
|
doc = nlp(sent)
|
||||||
word_ids = []
|
word_ids = []
|
||||||
|
|
||||||
for i, token in enumerate(doc):
|
for i, token in enumerate(doc):
|
||||||
# skip odd spaces from tokenizer
|
# skip odd spaces from tokenizer
|
||||||
if token.has_vector and token.vector_norm == 0:
|
if token.has_vector and token.vector_norm == 0:
|
||||||
|
|
@ -140,12 +141,11 @@ def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
||||||
word_ids.append(token.rank % num_unk + 1)
|
word_ids.append(token.rank % num_unk + 1)
|
||||||
|
|
||||||
# there must be a simpler way of generating padded arrays from lists...
|
# there must be a simpler way of generating padded arrays from lists...
|
||||||
word_id_vec = np.zeros((max_length), dtype='int')
|
word_id_vec = np.zeros((max_length), dtype="int")
|
||||||
clipped_len = min(max_length, len(word_ids))
|
clipped_len = min(max_length, len(word_ids))
|
||||||
word_id_vec[:clipped_len] = word_ids[:clipped_len]
|
word_id_vec[:clipped_len] = word_ids[:clipped_len]
|
||||||
sents_as_ids.append(word_id_vec)
|
sents_as_ids.append(word_id_vec)
|
||||||
|
|
||||||
|
|
||||||
return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])]
|
return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])]
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -159,39 +159,49 @@ def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
||||||
learn_rate=("Learning rate", "option", "r", float),
|
learn_rate=("Learning rate", "option", "r", float),
|
||||||
batch_size=("Batch size for neural network training", "option", "b", int),
|
batch_size=("Batch size for neural network training", "option", "b", int),
|
||||||
nr_epoch=("Number of training epochs", "option", "e", int),
|
nr_epoch=("Number of training epochs", "option", "e", int),
|
||||||
entail_dir=("Direction of entailment", "option", "D", str, ["both", "left", "right"])
|
entail_dir=(
|
||||||
|
"Direction of entailment",
|
||||||
|
"option",
|
||||||
|
"D",
|
||||||
|
str,
|
||||||
|
["both", "left", "right"],
|
||||||
|
),
|
||||||
)
|
)
|
||||||
def main(mode, train_loc, dev_loc,
|
def main(
|
||||||
|
mode,
|
||||||
|
train_loc,
|
||||||
|
dev_loc,
|
||||||
max_length=50,
|
max_length=50,
|
||||||
nr_hidden=200,
|
nr_hidden=200,
|
||||||
dropout=0.2,
|
dropout=0.2,
|
||||||
learn_rate=0.001,
|
learn_rate=0.001,
|
||||||
batch_size=1024,
|
batch_size=1024,
|
||||||
nr_epoch=10,
|
nr_epoch=10,
|
||||||
entail_dir="both"):
|
entail_dir="both",
|
||||||
|
):
|
||||||
shape = (max_length, nr_hidden, 3)
|
shape = (max_length, nr_hidden, 3)
|
||||||
settings = {
|
settings = {
|
||||||
'lr': learn_rate,
|
"lr": learn_rate,
|
||||||
'dropout': dropout,
|
"dropout": dropout,
|
||||||
'batch_size': batch_size,
|
"batch_size": batch_size,
|
||||||
'nr_epoch': nr_epoch,
|
"nr_epoch": nr_epoch,
|
||||||
'entail_dir': entail_dir
|
"entail_dir": entail_dir,
|
||||||
}
|
}
|
||||||
|
|
||||||
if mode == 'train':
|
if mode == "train":
|
||||||
if train_loc == None or dev_loc == None:
|
if train_loc == None or dev_loc == None:
|
||||||
print("Train mode requires paths to training and development data sets.")
|
print("Train mode requires paths to training and development data sets.")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
train(train_loc, dev_loc, shape, settings)
|
train(train_loc, dev_loc, shape, settings)
|
||||||
elif mode == 'evaluate':
|
elif mode == "evaluate":
|
||||||
if dev_loc == None:
|
if dev_loc == None:
|
||||||
print("Evaluate mode requires paths to test data set.")
|
print("Evaluate mode requires paths to test data set.")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
correct, total = evaluate(dev_loc, shape)
|
correct, total = evaluate(dev_loc, shape)
|
||||||
print(correct, '/', total, correct / total)
|
print(correct, "/", total, correct / total)
|
||||||
else:
|
else:
|
||||||
demo(shape)
|
demo(shape)
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
|
|
@ -5,11 +5,12 @@ import numpy as np
|
||||||
from keras import layers, Model, models, optimizers
|
from keras import layers, Model, models, optimizers
|
||||||
from keras import backend as K
|
from keras import backend as K
|
||||||
|
|
||||||
|
|
||||||
def build_model(vectors, shape, settings):
|
def build_model(vectors, shape, settings):
|
||||||
max_length, nr_hidden, nr_class = shape
|
max_length, nr_hidden, nr_class = shape
|
||||||
|
|
||||||
input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
|
input1 = layers.Input(shape=(max_length,), dtype="int32", name="words1")
|
||||||
input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
|
input2 = layers.Input(shape=(max_length,), dtype="int32", name="words2")
|
||||||
|
|
||||||
# embeddings (projected)
|
# embeddings (projected)
|
||||||
embed = create_embedding(vectors, max_length, nr_hidden)
|
embed = create_embedding(vectors, max_length, nr_hidden)
|
||||||
|
|
@ -23,7 +24,7 @@ def build_model(vectors, shape, settings):
|
||||||
|
|
||||||
G = create_feedforward(nr_hidden)
|
G = create_feedforward(nr_hidden)
|
||||||
|
|
||||||
if settings['entail_dir'] == 'both':
|
if settings["entail_dir"] == "both":
|
||||||
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
||||||
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
|
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
|
||||||
alpha = layers.dot([norm_weights_a, a], axes=1)
|
alpha = layers.dot([norm_weights_a, a], axes=1)
|
||||||
|
|
@ -40,7 +41,7 @@ def build_model(vectors, shape, settings):
|
||||||
v2_sum = layers.Lambda(sum_word)(v2)
|
v2_sum = layers.Lambda(sum_word)(v2)
|
||||||
concat = layers.concatenate([v1_sum, v2_sum])
|
concat = layers.concatenate([v1_sum, v2_sum])
|
||||||
|
|
||||||
elif settings['entail_dir'] == 'left':
|
elif settings["entail_dir"] == "left":
|
||||||
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
||||||
alpha = layers.dot([norm_weights_a, a], axes=1)
|
alpha = layers.dot([norm_weights_a, a], axes=1)
|
||||||
comp2 = layers.concatenate([b, alpha])
|
comp2 = layers.concatenate([b, alpha])
|
||||||
|
|
@ -58,40 +59,45 @@ def build_model(vectors, shape, settings):
|
||||||
|
|
||||||
H = create_feedforward(nr_hidden)
|
H = create_feedforward(nr_hidden)
|
||||||
out = H(concat)
|
out = H(concat)
|
||||||
out = layers.Dense(nr_class, activation='softmax')(out)
|
out = layers.Dense(nr_class, activation="softmax")(out)
|
||||||
|
|
||||||
model = Model([input1, input2], out)
|
model = Model([input1, input2], out)
|
||||||
|
|
||||||
model.compile(
|
model.compile(
|
||||||
optimizer=optimizers.Adam(lr=settings['lr']),
|
optimizer=optimizers.Adam(lr=settings["lr"]),
|
||||||
loss='categorical_crossentropy',
|
loss="categorical_crossentropy",
|
||||||
metrics=['accuracy'])
|
metrics=["accuracy"],
|
||||||
|
)
|
||||||
|
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
def create_embedding(vectors, max_length, projected_dim):
|
def create_embedding(vectors, max_length, projected_dim):
|
||||||
return models.Sequential([
|
return models.Sequential(
|
||||||
|
[
|
||||||
layers.Embedding(
|
layers.Embedding(
|
||||||
vectors.shape[0],
|
vectors.shape[0],
|
||||||
vectors.shape[1],
|
vectors.shape[1],
|
||||||
input_length=max_length,
|
input_length=max_length,
|
||||||
weights=[vectors],
|
weights=[vectors],
|
||||||
trainable=False),
|
trainable=False,
|
||||||
|
),
|
||||||
layers.TimeDistributed(
|
layers.TimeDistributed(
|
||||||
layers.Dense(projected_dim,
|
layers.Dense(projected_dim, activation=None, use_bias=False)
|
||||||
activation=None,
|
),
|
||||||
use_bias=False))
|
]
|
||||||
])
|
)
|
||||||
|
|
||||||
def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
|
|
||||||
return models.Sequential([
|
def create_feedforward(num_units=200, activation="relu", dropout_rate=0.2):
|
||||||
|
return models.Sequential(
|
||||||
|
[
|
||||||
layers.Dense(num_units, activation=activation),
|
layers.Dense(num_units, activation=activation),
|
||||||
layers.Dropout(dropout_rate),
|
layers.Dropout(dropout_rate),
|
||||||
layers.Dense(num_units, activation=activation),
|
layers.Dense(num_units, activation=activation),
|
||||||
layers.Dropout(dropout_rate)
|
layers.Dropout(dropout_rate),
|
||||||
])
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def normalizer(axis):
|
def normalizer(axis):
|
||||||
|
|
@ -99,39 +105,40 @@ def normalizer(axis):
|
||||||
exp_weights = K.exp(att_weights)
|
exp_weights = K.exp(att_weights)
|
||||||
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
|
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
|
||||||
return exp_weights / sum_weights
|
return exp_weights / sum_weights
|
||||||
|
|
||||||
return _normalize
|
return _normalize
|
||||||
|
|
||||||
|
|
||||||
def sum_word(x):
|
def sum_word(x):
|
||||||
return K.sum(x, axis=1)
|
return K.sum(x, axis=1)
|
||||||
|
|
||||||
|
|
||||||
def test_build_model():
|
def test_build_model():
|
||||||
vectors = np.ndarray((100, 8), dtype='float32')
|
vectors = np.ndarray((100, 8), dtype="float32")
|
||||||
shape = (10, 16, 3)
|
shape = (10, 16, 3)
|
||||||
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
|
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
|
||||||
model = build_model(vectors, shape, settings)
|
model = build_model(vectors, shape, settings)
|
||||||
|
|
||||||
|
|
||||||
def test_fit_model():
|
def test_fit_model():
|
||||||
|
|
||||||
def _generate_X(nr_example, length, nr_vector):
|
def _generate_X(nr_example, length, nr_vector):
|
||||||
X1 = np.ndarray((nr_example, length), dtype='int32')
|
X1 = np.ndarray((nr_example, length), dtype="int32")
|
||||||
X1 *= X1 < nr_vector
|
X1 *= X1 < nr_vector
|
||||||
X1 *= 0 <= X1
|
X1 *= 0 <= X1
|
||||||
X2 = np.ndarray((nr_example, length), dtype='int32')
|
X2 = np.ndarray((nr_example, length), dtype="int32")
|
||||||
X2 *= X2 < nr_vector
|
X2 *= X2 < nr_vector
|
||||||
X2 *= 0 <= X2
|
X2 *= 0 <= X2
|
||||||
return [X1, X2]
|
return [X1, X2]
|
||||||
|
|
||||||
def _generate_Y(nr_example, nr_class):
|
def _generate_Y(nr_example, nr_class):
|
||||||
ys = np.zeros((nr_example, nr_class), dtype='int32')
|
ys = np.zeros((nr_example, nr_class), dtype="int32")
|
||||||
for i in range(nr_example):
|
for i in range(nr_example):
|
||||||
ys[i, i % nr_class] = 1
|
ys[i, i % nr_class] = 1
|
||||||
return ys
|
return ys
|
||||||
|
|
||||||
vectors = np.ndarray((100, 8), dtype='float32')
|
vectors = np.ndarray((100, 8), dtype="float32")
|
||||||
shape = (10, 16, 3)
|
shape = (10, 16, 3)
|
||||||
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
|
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
|
||||||
model = build_model(vectors, shape, settings)
|
model = build_model(vectors, shape, settings)
|
||||||
|
|
||||||
train_X = _generate_X(20, shape[0], vectors.shape[0])
|
train_X = _generate_X(20, shape[0], vectors.shape[0])
|
||||||
|
|
|
||||||
|
|
@ -77,7 +77,7 @@
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"import ujson as json\n",
|
"import json\n",
|
||||||
"from keras.utils import to_categorical\n",
|
"from keras.utils import to_categorical\n",
|
||||||
"\n",
|
"\n",
|
||||||
"LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
|
"LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
|
||||||
|
|
|
||||||
|
|
@ -19,39 +19,40 @@ from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
output_dir=("Output directory for saved HTML", "positional", None, Path))
|
output_dir=("Output directory for saved HTML", "positional", None, Path)
|
||||||
|
)
|
||||||
def main(output_dir=None):
|
def main(output_dir=None):
|
||||||
nlp = English() # start off with blank English class
|
nlp = English() # start off with blank English class
|
||||||
|
|
||||||
Doc.set_extension('overlap', method=overlap_tokens)
|
Doc.set_extension("overlap", method=overlap_tokens)
|
||||||
doc1 = nlp(u"Peach emoji is where it has always been.")
|
doc1 = nlp("Peach emoji is where it has always been.")
|
||||||
doc2 = nlp(u"Peach is the superior emoji.")
|
doc2 = nlp("Peach is the superior emoji.")
|
||||||
print("Text 1:", doc1.text)
|
print("Text 1:", doc1.text)
|
||||||
print("Text 2:", doc2.text)
|
print("Text 2:", doc2.text)
|
||||||
print("Overlapping tokens:", doc1._.overlap(doc2))
|
print("Overlapping tokens:", doc1._.overlap(doc2))
|
||||||
|
|
||||||
Doc.set_extension('to_html', method=to_html)
|
Doc.set_extension("to_html", method=to_html)
|
||||||
doc = nlp(u"This is a sentence about Apple.")
|
doc = nlp("This is a sentence about Apple.")
|
||||||
# add entity manually for demo purposes, to make it work without a model
|
# add entity manually for demo purposes, to make it work without a model
|
||||||
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings['ORG'])]
|
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings["ORG"])]
|
||||||
print("Text:", doc.text)
|
print("Text:", doc.text)
|
||||||
doc._.to_html(output=output_dir, style='ent')
|
doc._.to_html(output=output_dir, style="ent")
|
||||||
|
|
||||||
|
|
||||||
def to_html(doc, output='/tmp', style='dep'):
|
def to_html(doc, output="/tmp", style="dep"):
|
||||||
"""Doc method extension for saving the current state as a displaCy
|
"""Doc method extension for saving the current state as a displaCy
|
||||||
visualization.
|
visualization.
|
||||||
"""
|
"""
|
||||||
# generate filename from first six non-punct tokens
|
# generate filename from first six non-punct tokens
|
||||||
file_name = '-'.join([w.text for w in doc[:6] if not w.is_punct]) + '.html'
|
file_name = "-".join([w.text for w in doc[:6] if not w.is_punct]) + ".html"
|
||||||
html = displacy.render(doc, style=style, page=True) # render markup
|
html = displacy.render(doc, style=style, page=True) # render markup
|
||||||
if output is not None:
|
if output is not None:
|
||||||
output_path = Path(output)
|
output_path = Path(output)
|
||||||
if not output_path.exists():
|
if not output_path.exists():
|
||||||
output_path.mkdir()
|
output_path.mkdir()
|
||||||
output_file = Path(output) / file_name
|
output_file = Path(output) / file_name
|
||||||
output_file.open('w', encoding='utf-8').write(html) # save to file
|
output_file.open("w", encoding="utf-8").write(html) # save to file
|
||||||
print('Saved HTML to {}'.format(output_file))
|
print("Saved HTML to {}".format(output_file))
|
||||||
else:
|
else:
|
||||||
print(html)
|
print(html)
|
||||||
|
|
||||||
|
|
@ -67,7 +68,7 @@ def overlap_tokens(doc, other_doc):
|
||||||
return overlap
|
return overlap
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# Expected output:
|
# Expected output:
|
||||||
|
|
|
||||||
|
|
@ -26,14 +26,18 @@ def main():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
rest_countries = RESTCountriesComponent(nlp) # initialise component
|
rest_countries = RESTCountriesComponent(nlp) # initialise component
|
||||||
nlp.add_pipe(rest_countries) # add it to the pipeline
|
nlp.add_pipe(rest_countries) # add it to the pipeline
|
||||||
doc = nlp(u"Some text about Colombia and the Czech Republic")
|
doc = nlp("Some text about Colombia and the Czech Republic")
|
||||||
print('Pipeline', nlp.pipe_names) # pipeline contains component name
|
print("Pipeline", nlp.pipe_names) # pipeline contains component name
|
||||||
print('Doc has countries', doc._.has_country) # Doc contains countries
|
print("Doc has countries", doc._.has_country) # Doc contains countries
|
||||||
for token in doc:
|
for token in doc:
|
||||||
if token._.is_country:
|
if token._.is_country:
|
||||||
print(token.text, token._.country_capital, token._.country_latlng,
|
print(
|
||||||
token._.country_flag) # country data
|
token.text,
|
||||||
print('Entities', [(e.text, e.label_) for e in doc.ents]) # entities
|
token._.country_capital,
|
||||||
|
token._.country_latlng,
|
||||||
|
token._.country_flag,
|
||||||
|
) # country data
|
||||||
|
print("Entities", [(e.text, e.label_) for e in doc.ents]) # entities
|
||||||
|
|
||||||
|
|
||||||
class RESTCountriesComponent(object):
|
class RESTCountriesComponent(object):
|
||||||
|
|
@ -41,42 +45,42 @@ class RESTCountriesComponent(object):
|
||||||
the REST Countries API, merges country names into one token, assigns entity
|
the REST Countries API, merges country names into one token, assigns entity
|
||||||
labels and sets attributes on country tokens.
|
labels and sets attributes on country tokens.
|
||||||
"""
|
"""
|
||||||
name = 'rest_countries' # component name, will show up in the pipeline
|
|
||||||
|
|
||||||
def __init__(self, nlp, label='GPE'):
|
name = "rest_countries" # component name, will show up in the pipeline
|
||||||
|
|
||||||
|
def __init__(self, nlp, label="GPE"):
|
||||||
"""Initialise the pipeline component. The shared nlp instance is used
|
"""Initialise the pipeline component. The shared nlp instance is used
|
||||||
to initialise the matcher with the shared vocab, get the label ID and
|
to initialise the matcher with the shared vocab, get the label ID and
|
||||||
generate Doc objects as phrase match patterns.
|
generate Doc objects as phrase match patterns.
|
||||||
"""
|
"""
|
||||||
# Make request once on initialisation and store the data
|
# Make request once on initialisation and store the data
|
||||||
r = requests.get('https://restcountries.eu/rest/v2/all')
|
r = requests.get("https://restcountries.eu/rest/v2/all")
|
||||||
r.raise_for_status() # make sure requests raises an error if it fails
|
r.raise_for_status() # make sure requests raises an error if it fails
|
||||||
countries = r.json()
|
countries = r.json()
|
||||||
|
|
||||||
# Convert API response to dict keyed by country name for easy lookup
|
# Convert API response to dict keyed by country name for easy lookup
|
||||||
# This could also be extended using the alternative and foreign language
|
# This could also be extended using the alternative and foreign language
|
||||||
# names provided by the API
|
# names provided by the API
|
||||||
self.countries = {c['name']: c for c in countries}
|
self.countries = {c["name"]: c for c in countries}
|
||||||
self.label = nlp.vocab.strings[label] # get entity label ID
|
self.label = nlp.vocab.strings[label] # get entity label ID
|
||||||
|
|
||||||
# Set up the PhraseMatcher with Doc patterns for each country name
|
# Set up the PhraseMatcher with Doc patterns for each country name
|
||||||
patterns = [nlp(c) for c in self.countries.keys()]
|
patterns = [nlp(c) for c in self.countries.keys()]
|
||||||
self.matcher = PhraseMatcher(nlp.vocab)
|
self.matcher = PhraseMatcher(nlp.vocab)
|
||||||
self.matcher.add('COUNTRIES', None, *patterns)
|
self.matcher.add("COUNTRIES", None, *patterns)
|
||||||
|
|
||||||
# Register attribute on the Token. We'll be overwriting this based on
|
# Register attribute on the Token. We'll be overwriting this based on
|
||||||
# the matches, so we're only setting a default value, not a getter.
|
# the matches, so we're only setting a default value, not a getter.
|
||||||
# If no default value is set, it defaults to None.
|
# If no default value is set, it defaults to None.
|
||||||
Token.set_extension('is_country', default=False)
|
Token.set_extension("is_country", default=False)
|
||||||
Token.set_extension('country_capital', default=False)
|
Token.set_extension("country_capital", default=False)
|
||||||
Token.set_extension('country_latlng', default=False)
|
Token.set_extension("country_latlng", default=False)
|
||||||
Token.set_extension('country_flag', default=False)
|
Token.set_extension("country_flag", default=False)
|
||||||
|
|
||||||
# Register attributes on Doc and Span via a getter that checks if one of
|
# Register attributes on Doc and Span via a getter that checks if one of
|
||||||
# the contained tokens is set to is_country == True.
|
# the contained tokens is set to is_country == True.
|
||||||
Doc.set_extension('has_country', getter=self.has_country)
|
Doc.set_extension("has_country", getter=self.has_country)
|
||||||
Span.set_extension('has_country', getter=self.has_country)
|
Span.set_extension("has_country", getter=self.has_country)
|
||||||
|
|
||||||
|
|
||||||
def __call__(self, doc):
|
def __call__(self, doc):
|
||||||
"""Apply the pipeline component on a Doc object and modify it if matches
|
"""Apply the pipeline component on a Doc object and modify it if matches
|
||||||
|
|
@ -93,10 +97,10 @@ class RESTCountriesComponent(object):
|
||||||
# Can be extended with other data returned by the API, like
|
# Can be extended with other data returned by the API, like
|
||||||
# currencies, country code, flag, calling code etc.
|
# currencies, country code, flag, calling code etc.
|
||||||
for token in entity:
|
for token in entity:
|
||||||
token._.set('is_country', True)
|
token._.set("is_country", True)
|
||||||
token._.set('country_capital', self.countries[entity.text]['capital'])
|
token._.set("country_capital", self.countries[entity.text]["capital"])
|
||||||
token._.set('country_latlng', self.countries[entity.text]['latlng'])
|
token._.set("country_latlng", self.countries[entity.text]["latlng"])
|
||||||
token._.set('country_flag', self.countries[entity.text]['flag'])
|
token._.set("country_flag", self.countries[entity.text]["flag"])
|
||||||
# Overwrite doc.ents and add entity – be careful not to replace!
|
# Overwrite doc.ents and add entity – be careful not to replace!
|
||||||
doc.ents = list(doc.ents) + [entity]
|
doc.ents = list(doc.ents) + [entity]
|
||||||
for span in spans:
|
for span in spans:
|
||||||
|
|
@ -111,10 +115,10 @@ class RESTCountriesComponent(object):
|
||||||
is a country. Since the getter is only called when we access the
|
is a country. Since the getter is only called when we access the
|
||||||
attribute, we can refer to the Token's 'is_country' attribute here,
|
attribute, we can refer to the Token's 'is_country' attribute here,
|
||||||
which is already set in the processing step."""
|
which is already set in the processing step."""
|
||||||
return any([t._.get('is_country') for t in tokens])
|
return any([t._.get("is_country") for t in tokens])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# Expected output:
|
# Expected output:
|
||||||
|
|
|
||||||
|
|
@ -20,23 +20,24 @@ from spacy.tokens import Doc, Span, Token
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
text=("Text to process", "positional", None, str),
|
text=("Text to process", "positional", None, str),
|
||||||
companies=("Names of technology companies", "positional", None, str))
|
companies=("Names of technology companies", "positional", None, str),
|
||||||
|
)
|
||||||
def main(text="Alphabet Inc. is the company behind Google.", *companies):
|
def main(text="Alphabet Inc. is the company behind Google.", *companies):
|
||||||
# For simplicity, we start off with only the blank English Language class
|
# For simplicity, we start off with only the blank English Language class
|
||||||
# and no model or pre-defined pipeline loaded.
|
# and no model or pre-defined pipeline loaded.
|
||||||
nlp = English()
|
nlp = English()
|
||||||
if not companies: # set default companies if none are set via args
|
if not companies: # set default companies if none are set via args
|
||||||
companies = ['Alphabet Inc.', 'Google', 'Netflix', 'Apple'] # etc.
|
companies = ["Alphabet Inc.", "Google", "Netflix", "Apple"] # etc.
|
||||||
component = TechCompanyRecognizer(nlp, companies) # initialise component
|
component = TechCompanyRecognizer(nlp, companies) # initialise component
|
||||||
nlp.add_pipe(component, last=True) # add last to the pipeline
|
nlp.add_pipe(component, last=True) # add last to the pipeline
|
||||||
|
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
print('Pipeline', nlp.pipe_names) # pipeline contains component name
|
print("Pipeline", nlp.pipe_names) # pipeline contains component name
|
||||||
print('Tokens', [t.text for t in doc]) # company names from the list are merged
|
print("Tokens", [t.text for t in doc]) # company names from the list are merged
|
||||||
print('Doc has_tech_org', doc._.has_tech_org) # Doc contains tech orgs
|
print("Doc has_tech_org", doc._.has_tech_org) # Doc contains tech orgs
|
||||||
print('Token 0 is_tech_org', doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
|
print("Token 0 is_tech_org", doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
|
||||||
print('Token 1 is_tech_org', doc[1]._.is_tech_org) # "is" is not
|
print("Token 1 is_tech_org", doc[1]._.is_tech_org) # "is" is not
|
||||||
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
|
print("Entities", [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
|
||||||
|
|
||||||
|
|
||||||
class TechCompanyRecognizer(object):
|
class TechCompanyRecognizer(object):
|
||||||
|
|
@ -45,9 +46,10 @@ class TechCompanyRecognizer(object):
|
||||||
labelled as ORG and their spans are merged into one token. Additionally,
|
labelled as ORG and their spans are merged into one token. Additionally,
|
||||||
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
|
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
|
||||||
respectively."""
|
respectively."""
|
||||||
name = 'tech_companies' # component name, will show up in the pipeline
|
|
||||||
|
|
||||||
def __init__(self, nlp, companies=tuple(), label='ORG'):
|
name = "tech_companies" # component name, will show up in the pipeline
|
||||||
|
|
||||||
|
def __init__(self, nlp, companies=tuple(), label="ORG"):
|
||||||
"""Initialise the pipeline component. The shared nlp instance is used
|
"""Initialise the pipeline component. The shared nlp instance is used
|
||||||
to initialise the matcher with the shared vocab, get the label ID and
|
to initialise the matcher with the shared vocab, get the label ID and
|
||||||
generate Doc objects as phrase match patterns.
|
generate Doc objects as phrase match patterns.
|
||||||
|
|
@ -58,16 +60,16 @@ class TechCompanyRecognizer(object):
|
||||||
# so even if the list of companies is long, it's very efficient
|
# so even if the list of companies is long, it's very efficient
|
||||||
patterns = [nlp(org) for org in companies]
|
patterns = [nlp(org) for org in companies]
|
||||||
self.matcher = PhraseMatcher(nlp.vocab)
|
self.matcher = PhraseMatcher(nlp.vocab)
|
||||||
self.matcher.add('TECH_ORGS', None, *patterns)
|
self.matcher.add("TECH_ORGS", None, *patterns)
|
||||||
|
|
||||||
# Register attribute on the Token. We'll be overwriting this based on
|
# Register attribute on the Token. We'll be overwriting this based on
|
||||||
# the matches, so we're only setting a default value, not a getter.
|
# the matches, so we're only setting a default value, not a getter.
|
||||||
Token.set_extension('is_tech_org', default=False)
|
Token.set_extension("is_tech_org", default=False)
|
||||||
|
|
||||||
# Register attributes on Doc and Span via a getter that checks if one of
|
# Register attributes on Doc and Span via a getter that checks if one of
|
||||||
# the contained tokens is set to is_tech_org == True.
|
# the contained tokens is set to is_tech_org == True.
|
||||||
Doc.set_extension('has_tech_org', getter=self.has_tech_org)
|
Doc.set_extension("has_tech_org", getter=self.has_tech_org)
|
||||||
Span.set_extension('has_tech_org', getter=self.has_tech_org)
|
Span.set_extension("has_tech_org", getter=self.has_tech_org)
|
||||||
|
|
||||||
def __call__(self, doc):
|
def __call__(self, doc):
|
||||||
"""Apply the pipeline component on a Doc object and modify it if matches
|
"""Apply the pipeline component on a Doc object and modify it if matches
|
||||||
|
|
@ -82,7 +84,7 @@ class TechCompanyRecognizer(object):
|
||||||
spans.append(entity)
|
spans.append(entity)
|
||||||
# Set custom attribute on each token of the entity
|
# Set custom attribute on each token of the entity
|
||||||
for token in entity:
|
for token in entity:
|
||||||
token._.set('is_tech_org', True)
|
token._.set("is_tech_org", True)
|
||||||
# Overwrite doc.ents and add entity – be careful not to replace!
|
# Overwrite doc.ents and add entity – be careful not to replace!
|
||||||
doc.ents = list(doc.ents) + [entity]
|
doc.ents = list(doc.ents) + [entity]
|
||||||
for span in spans:
|
for span in spans:
|
||||||
|
|
@ -97,10 +99,10 @@ class TechCompanyRecognizer(object):
|
||||||
is a tech org. Since the getter is only called when we access the
|
is a tech org. Since the getter is only called when we access the
|
||||||
attribute, we can refer to the Token's 'is_tech_org' attribute here,
|
attribute, we can refer to the Token's 'is_tech_org' attribute here,
|
||||||
which is already set in the processing step."""
|
which is already set in the processing step."""
|
||||||
return any([t._.get('is_tech_org') for t in tokens])
|
return any([t._.get("is_tech_org") for t in tokens])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# Expected output:
|
# Expected output:
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
'''Example of adding a pipeline component to prohibit sentence boundaries
|
"""Example of adding a pipeline component to prohibit sentence boundaries
|
||||||
before certain tokens.
|
before certain tokens.
|
||||||
|
|
||||||
What we do is write to the token.is_sent_start attribute, which
|
What we do is write to the token.is_sent_start attribute, which
|
||||||
|
|
@ -10,16 +10,18 @@ should also improve the parse quality.
|
||||||
The specific example here is drawn from https://github.com/explosion/spaCy/issues/2627
|
The specific example here is drawn from https://github.com/explosion/spaCy/issues/2627
|
||||||
Other versions of the model may not make the original mistake, so the specific
|
Other versions of the model may not make the original mistake, so the specific
|
||||||
example might not be apt for future versions.
|
example might not be apt for future versions.
|
||||||
'''
|
"""
|
||||||
import plac
|
import plac
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
|
|
||||||
def prevent_sentence_boundaries(doc):
|
def prevent_sentence_boundaries(doc):
|
||||||
for token in doc:
|
for token in doc:
|
||||||
if not can_be_sentence_start(token):
|
if not can_be_sentence_start(token):
|
||||||
token.is_sent_start = False
|
token.is_sent_start = False
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
def can_be_sentence_start(token):
|
def can_be_sentence_start(token):
|
||||||
if token.i == 0:
|
if token.i == 0:
|
||||||
return True
|
return True
|
||||||
|
|
@ -32,17 +34,18 @@ def can_be_sentence_start(token):
|
||||||
else:
|
else:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
nlp = spacy.load('en_core_web_lg')
|
nlp = spacy.load("en_core_web_lg")
|
||||||
raw_text = "Been here and I'm loving it."
|
raw_text = "Been here and I'm loving it."
|
||||||
doc = nlp(raw_text)
|
doc = nlp(raw_text)
|
||||||
sentences = [sent.string.strip() for sent in doc.sents]
|
sentences = [sent.string.strip() for sent in doc.sents]
|
||||||
print(sentences)
|
print(sentences)
|
||||||
nlp.add_pipe(prevent_sentence_boundaries, before='parser')
|
nlp.add_pipe(prevent_sentence_boundaries, before="parser")
|
||||||
doc = nlp(raw_text)
|
doc = nlp(raw_text)
|
||||||
sentences = [sent.string.strip() for sent in doc.sents]
|
sentences = [sent.string.strip() for sent in doc.sents]
|
||||||
print(sentences)
|
print(sentences)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
|
|
@ -1,10 +1,11 @@
|
||||||
'''Demonstrate adding a rule-based component that forces some tokens to not
|
"""Demonstrate adding a rule-based component that forces some tokens to not
|
||||||
be entities, before the NER tagger is applied. This is used to hotfix the issue
|
be entities, before the NER tagger is applied. This is used to hotfix the issue
|
||||||
in https://github.com/explosion/spaCy/issues/2870 , present as of spaCy v2.0.16.
|
in https://github.com/explosion/spaCy/issues/2870 , present as of spaCy v2.0.16.
|
||||||
'''
|
"""
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.attrs import ENT_IOB
|
from spacy.attrs import ENT_IOB
|
||||||
|
|
||||||
|
|
||||||
def fix_space_tags(doc):
|
def fix_space_tags(doc):
|
||||||
ent_iobs = doc.to_array([ENT_IOB])
|
ent_iobs = doc.to_array([ENT_IOB])
|
||||||
for i, token in enumerate(doc):
|
for i, token in enumerate(doc):
|
||||||
|
|
@ -14,14 +15,16 @@ def fix_space_tags(doc):
|
||||||
doc.from_array([ENT_IOB], ent_iobs.reshape((len(doc), 1)))
|
doc.from_array([ENT_IOB], ent_iobs.reshape((len(doc), 1)))
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def main():
|
|
||||||
nlp = spacy.load('en_core_web_sm')
|
|
||||||
text = u'''This is some crazy test where I dont need an Apple Watch to make things bug'''
|
|
||||||
doc = nlp(text)
|
|
||||||
print('Before', doc.ents)
|
|
||||||
nlp.add_pipe(fix_space_tags, name='fix-ner', before='ner')
|
|
||||||
doc = nlp(text)
|
|
||||||
print('After', doc.ents)
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
def main():
|
||||||
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
text = u"""This is some crazy test where I dont need an Apple Watch to make things bug"""
|
||||||
|
doc = nlp(text)
|
||||||
|
print("Before", doc.ents)
|
||||||
|
nlp.add_pipe(fix_space_tags, name="fix-ner", before="ner")
|
||||||
|
doc = nlp(text)
|
||||||
|
print("After", doc.ents)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
main()
|
main()
|
||||||
|
|
|
||||||
|
|
@ -9,12 +9,14 @@ built-in dataset loader.
|
||||||
Compatible with: spaCy v2.0.0+
|
Compatible with: spaCy v2.0.0+
|
||||||
"""
|
"""
|
||||||
from __future__ import print_function, unicode_literals
|
from __future__ import print_function, unicode_literals
|
||||||
from toolz import partition_all
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from joblib import Parallel, delayed
|
from joblib import Parallel, delayed
|
||||||
|
from functools import partial
|
||||||
import thinc.extra.datasets
|
import thinc.extra.datasets
|
||||||
import plac
|
import plac
|
||||||
import spacy
|
import spacy
|
||||||
|
from spacy.util import minibatch
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
|
|
@ -22,9 +24,9 @@ import spacy
|
||||||
model=("Model name (needs tagger)", "positional", None, str),
|
model=("Model name (needs tagger)", "positional", None, str),
|
||||||
n_jobs=("Number of workers", "option", "n", int),
|
n_jobs=("Number of workers", "option", "n", int),
|
||||||
batch_size=("Batch-size for each process", "option", "b", int),
|
batch_size=("Batch-size for each process", "option", "b", int),
|
||||||
limit=("Limit of entries from the dataset", "option", "l", int))
|
limit=("Limit of entries from the dataset", "option", "l", int),
|
||||||
def main(output_dir, model='en_core_web_sm', n_jobs=4, batch_size=1000,
|
)
|
||||||
limit=10000):
|
def main(output_dir, model="en_core_web_sm", n_jobs=4, batch_size=1000, limit=10000):
|
||||||
nlp = spacy.load(model) # load spaCy model
|
nlp = spacy.load(model) # load spaCy model
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
if not output_dir.exists():
|
if not output_dir.exists():
|
||||||
|
|
@ -34,45 +36,47 @@ def main(output_dir, model='en_core_web_sm', n_jobs=4, batch_size=1000,
|
||||||
data, _ = thinc.extra.datasets.imdb()
|
data, _ = thinc.extra.datasets.imdb()
|
||||||
texts, _ = zip(*data[-limit:])
|
texts, _ = zip(*data[-limit:])
|
||||||
print("Processing texts...")
|
print("Processing texts...")
|
||||||
partitions = partition_all(batch_size, texts)
|
partitions = minibatch(texts, size=batch_size)
|
||||||
executor = Parallel(n_jobs=n_jobs)
|
executor = Parallel(n_jobs=n_jobs, backend="multiprocessing", prefer="processes")
|
||||||
do = delayed(transform_texts)
|
do = delayed(partial(transform_texts, nlp))
|
||||||
tasks = (do(nlp, i, batch, output_dir)
|
tasks = (do(i, batch, output_dir) for i, batch in enumerate(partitions))
|
||||||
for i, batch in enumerate(partitions))
|
|
||||||
executor(tasks)
|
executor(tasks)
|
||||||
|
|
||||||
|
|
||||||
def transform_texts(nlp, batch_id, texts, output_dir):
|
def transform_texts(nlp, batch_id, texts, output_dir):
|
||||||
print(nlp.pipe_names)
|
print(nlp.pipe_names)
|
||||||
out_path = Path(output_dir) / ('%d.txt' % batch_id)
|
out_path = Path(output_dir) / ("%d.txt" % batch_id)
|
||||||
if out_path.exists(): # return None in case same batch is called again
|
if out_path.exists(): # return None in case same batch is called again
|
||||||
return None
|
return None
|
||||||
print('Processing batch', batch_id)
|
print("Processing batch", batch_id)
|
||||||
with out_path.open('w', encoding='utf8') as f:
|
with out_path.open("w", encoding="utf8") as f:
|
||||||
for doc in nlp.pipe(texts):
|
for doc in nlp.pipe(texts):
|
||||||
f.write(' '.join(represent_word(w) for w in doc if not w.is_space))
|
f.write(" ".join(represent_word(w) for w in doc if not w.is_space))
|
||||||
f.write('\n')
|
f.write("\n")
|
||||||
print('Saved {} texts to {}.txt'.format(len(texts), batch_id))
|
print("Saved {} texts to {}.txt".format(len(texts), batch_id))
|
||||||
|
|
||||||
|
|
||||||
def represent_word(word):
|
def represent_word(word):
|
||||||
text = word.text
|
text = word.text
|
||||||
# True-case, i.e. try to normalize sentence-initial capitals.
|
# True-case, i.e. try to normalize sentence-initial capitals.
|
||||||
# Only do this if the lower-cased form is more probable.
|
# Only do this if the lower-cased form is more probable.
|
||||||
if text.istitle() and is_sent_begin(word) \
|
if (
|
||||||
and word.prob < word.doc.vocab[text.lower()].prob:
|
text.istitle()
|
||||||
|
and is_sent_begin(word)
|
||||||
|
and word.prob < word.doc.vocab[text.lower()].prob
|
||||||
|
):
|
||||||
text = text.lower()
|
text = text.lower()
|
||||||
return text + '|' + word.tag_
|
return text + "|" + word.tag_
|
||||||
|
|
||||||
|
|
||||||
def is_sent_begin(word):
|
def is_sent_begin(word):
|
||||||
if word.i == 0:
|
if word.i == 0:
|
||||||
return True
|
return True
|
||||||
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
|
elif word.i >= 2 and word.nbor(-1).text in (".", "!", "?", "..."):
|
||||||
return True
|
return True
|
||||||
else:
|
else:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
437
examples/training/conllu.py
Normal file
437
examples/training/conllu.py
Normal file
|
|
@ -0,0 +1,437 @@
|
||||||
|
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||||
|
.conllu format for development data, allowing the official scorer to be used.
|
||||||
|
"""
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
import plac
|
||||||
|
import tqdm
|
||||||
|
import attr
|
||||||
|
from pathlib import Path
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
|
||||||
|
import spacy
|
||||||
|
import spacy.util
|
||||||
|
from spacy.tokens import Token, Doc
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
from spacy.syntax.nonproj import projectivize
|
||||||
|
from collections import defaultdict, Counter
|
||||||
|
from timeit import default_timer as timer
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
import random
|
||||||
|
import numpy.random
|
||||||
|
|
||||||
|
import conll17_ud_eval
|
||||||
|
|
||||||
|
import spacy.lang.zh
|
||||||
|
import spacy.lang.ja
|
||||||
|
|
||||||
|
spacy.lang.zh.Chinese.Defaults.use_jieba = False
|
||||||
|
spacy.lang.ja.Japanese.Defaults.use_janome = False
|
||||||
|
|
||||||
|
random.seed(0)
|
||||||
|
numpy.random.seed(0)
|
||||||
|
|
||||||
|
|
||||||
|
def minibatch_by_words(items, size=5000):
|
||||||
|
random.shuffle(items)
|
||||||
|
if isinstance(size, int):
|
||||||
|
size_ = itertools.repeat(size)
|
||||||
|
else:
|
||||||
|
size_ = size
|
||||||
|
items = iter(items)
|
||||||
|
while True:
|
||||||
|
batch_size = next(size_)
|
||||||
|
batch = []
|
||||||
|
while batch_size >= 0:
|
||||||
|
try:
|
||||||
|
doc, gold = next(items)
|
||||||
|
except StopIteration:
|
||||||
|
if batch:
|
||||||
|
yield batch
|
||||||
|
return
|
||||||
|
batch_size -= len(doc)
|
||||||
|
batch.append((doc, gold))
|
||||||
|
if batch:
|
||||||
|
yield batch
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
|
||||||
|
################
|
||||||
|
# Data reading #
|
||||||
|
################
|
||||||
|
|
||||||
|
space_re = re.compile("\s+")
|
||||||
|
|
||||||
|
|
||||||
|
def split_text(text):
|
||||||
|
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||||
|
|
||||||
|
|
||||||
|
def read_data(
|
||||||
|
nlp,
|
||||||
|
conllu_file,
|
||||||
|
text_file,
|
||||||
|
raw_text=True,
|
||||||
|
oracle_segments=False,
|
||||||
|
max_doc_length=None,
|
||||||
|
limit=None,
|
||||||
|
):
|
||||||
|
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
|
||||||
|
include Doc objects created using nlp.make_doc and then aligned against
|
||||||
|
the gold-standard sequences. If oracle_segments=True, include Doc objects
|
||||||
|
created from the gold-standard segments. At least one must be True."""
|
||||||
|
if not raw_text and not oracle_segments:
|
||||||
|
raise ValueError("At least one of raw_text or oracle_segments must be True")
|
||||||
|
paragraphs = split_text(text_file.read())
|
||||||
|
conllu = read_conllu(conllu_file)
|
||||||
|
# sd is spacy doc; cd is conllu doc
|
||||||
|
# cs is conllu sent, ct is conllu token
|
||||||
|
docs = []
|
||||||
|
golds = []
|
||||||
|
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
|
||||||
|
sent_annots = []
|
||||||
|
for cs in cd:
|
||||||
|
sent = defaultdict(list)
|
||||||
|
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
|
||||||
|
if "." in id_:
|
||||||
|
continue
|
||||||
|
if "-" in id_:
|
||||||
|
continue
|
||||||
|
id_ = int(id_) - 1
|
||||||
|
head = int(head) - 1 if head != "0" else id_
|
||||||
|
sent["words"].append(word)
|
||||||
|
sent["tags"].append(tag)
|
||||||
|
sent["heads"].append(head)
|
||||||
|
sent["deps"].append("ROOT" if dep == "root" else dep)
|
||||||
|
sent["spaces"].append(space_after == "_")
|
||||||
|
sent["entities"] = ["-"] * len(sent["words"])
|
||||||
|
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
|
||||||
|
if oracle_segments:
|
||||||
|
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
|
||||||
|
golds.append(GoldParse(docs[-1], **sent))
|
||||||
|
|
||||||
|
sent_annots.append(sent)
|
||||||
|
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
|
||||||
|
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||||
|
sent_annots = []
|
||||||
|
docs.append(doc)
|
||||||
|
golds.append(gold)
|
||||||
|
if limit and len(docs) >= limit:
|
||||||
|
return docs, golds
|
||||||
|
|
||||||
|
if raw_text and sent_annots:
|
||||||
|
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||||
|
docs.append(doc)
|
||||||
|
golds.append(gold)
|
||||||
|
if limit and len(docs) >= limit:
|
||||||
|
return docs, golds
|
||||||
|
return docs, golds
|
||||||
|
|
||||||
|
|
||||||
|
def read_conllu(file_):
|
||||||
|
docs = []
|
||||||
|
sent = []
|
||||||
|
doc = []
|
||||||
|
for line in file_:
|
||||||
|
if line.startswith("# newdoc"):
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
doc = []
|
||||||
|
elif line.startswith("#"):
|
||||||
|
continue
|
||||||
|
elif not line.strip():
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
sent = []
|
||||||
|
else:
|
||||||
|
sent.append(list(line.strip().split("\t")))
|
||||||
|
if len(sent[-1]) != 10:
|
||||||
|
print(repr(line))
|
||||||
|
raise ValueError
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
def _make_gold(nlp, text, sent_annots):
|
||||||
|
# Flatten the conll annotations, and adjust the head indices
|
||||||
|
flat = defaultdict(list)
|
||||||
|
for sent in sent_annots:
|
||||||
|
flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
|
||||||
|
for field in ["words", "tags", "deps", "entities", "spaces"]:
|
||||||
|
flat[field].extend(sent[field])
|
||||||
|
# Construct text if necessary
|
||||||
|
assert len(flat["words"]) == len(flat["spaces"])
|
||||||
|
if text is None:
|
||||||
|
text = "".join(
|
||||||
|
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
|
||||||
|
)
|
||||||
|
doc = nlp.make_doc(text)
|
||||||
|
flat.pop("spaces")
|
||||||
|
gold = GoldParse(doc, **flat)
|
||||||
|
return doc, gold
|
||||||
|
|
||||||
|
|
||||||
|
#############################
|
||||||
|
# Data transforms for spaCy #
|
||||||
|
#############################
|
||||||
|
|
||||||
|
|
||||||
|
def golds_to_gold_tuples(docs, golds):
|
||||||
|
"""Get out the annoying 'tuples' format used by begin_training, given the
|
||||||
|
GoldParse objects."""
|
||||||
|
tuples = []
|
||||||
|
for doc, gold in zip(docs, golds):
|
||||||
|
text = doc.text
|
||||||
|
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
|
||||||
|
sents = [((ids, words, tags, heads, labels, iob), [])]
|
||||||
|
tuples.append((text, sents))
|
||||||
|
return tuples
|
||||||
|
|
||||||
|
|
||||||
|
##############
|
||||||
|
# Evaluation #
|
||||||
|
##############
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||||
|
with text_loc.open("r", encoding="utf8") as text_file:
|
||||||
|
texts = split_text(text_file.read())
|
||||||
|
docs = list(nlp.pipe(texts))
|
||||||
|
with sys_loc.open("w", encoding="utf8") as out_file:
|
||||||
|
write_conllu(docs, out_file)
|
||||||
|
with gold_loc.open("r", encoding="utf8") as gold_file:
|
||||||
|
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||||
|
with sys_loc.open("r", encoding="utf8") as sys_file:
|
||||||
|
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||||
|
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||||
|
return scores
|
||||||
|
|
||||||
|
|
||||||
|
def write_conllu(docs, file_):
|
||||||
|
merger = Matcher(docs[0].vocab)
|
||||||
|
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||||
|
for i, doc in enumerate(docs):
|
||||||
|
matches = merger(doc)
|
||||||
|
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||||
|
offsets = [(span.start_char, span.end_char) for span in spans]
|
||||||
|
for start_char, end_char in offsets:
|
||||||
|
doc.merge(start_char, end_char)
|
||||||
|
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||||
|
for j, sent in enumerate(doc.sents):
|
||||||
|
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||||
|
file_.write("# text = {text}\n".format(text=sent.text))
|
||||||
|
for k, token in enumerate(sent):
|
||||||
|
file_.write(token._.get_conllu_lines(k) + "\n")
|
||||||
|
file_.write("\n")
|
||||||
|
|
||||||
|
|
||||||
|
def print_progress(itn, losses, ud_scores):
|
||||||
|
fields = {
|
||||||
|
"dep_loss": losses.get("parser", 0.0),
|
||||||
|
"tag_loss": losses.get("tagger", 0.0),
|
||||||
|
"words": ud_scores["Words"].f1 * 100,
|
||||||
|
"sents": ud_scores["Sentences"].f1 * 100,
|
||||||
|
"tags": ud_scores["XPOS"].f1 * 100,
|
||||||
|
"uas": ud_scores["UAS"].f1 * 100,
|
||||||
|
"las": ud_scores["LAS"].f1 * 100,
|
||||||
|
}
|
||||||
|
header = ["Epoch", "Loss", "LAS", "UAS", "TAG", "SENT", "WORD"]
|
||||||
|
if itn == 0:
|
||||||
|
print("\t".join(header))
|
||||||
|
tpl = "\t".join(
|
||||||
|
(
|
||||||
|
"{:d}",
|
||||||
|
"{dep_loss:.1f}",
|
||||||
|
"{las:.1f}",
|
||||||
|
"{uas:.1f}",
|
||||||
|
"{tags:.1f}",
|
||||||
|
"{sents:.1f}",
|
||||||
|
"{words:.1f}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
print(tpl.format(itn, **fields))
|
||||||
|
|
||||||
|
|
||||||
|
# def get_sent_conllu(sent, sent_id):
|
||||||
|
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
|
||||||
|
|
||||||
|
|
||||||
|
def get_token_conllu(token, i):
|
||||||
|
if token._.begins_fused:
|
||||||
|
n = 1
|
||||||
|
while token.nbor(n)._.inside_fused:
|
||||||
|
n += 1
|
||||||
|
id_ = "%d-%d" % (i, i + n)
|
||||||
|
lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
|
||||||
|
else:
|
||||||
|
lines = []
|
||||||
|
if token.head.i == token.i:
|
||||||
|
head = 0
|
||||||
|
else:
|
||||||
|
head = i + (token.head.i - token.i) + 1
|
||||||
|
fields = [
|
||||||
|
str(i + 1),
|
||||||
|
token.text,
|
||||||
|
token.lemma_,
|
||||||
|
token.pos_,
|
||||||
|
token.tag_,
|
||||||
|
"_",
|
||||||
|
str(head),
|
||||||
|
token.dep_.lower(),
|
||||||
|
"_",
|
||||||
|
"_",
|
||||||
|
]
|
||||||
|
lines.append("\t".join(fields))
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||||
|
Token.set_extension("begins_fused", default=False)
|
||||||
|
Token.set_extension("inside_fused", default=False)
|
||||||
|
|
||||||
|
|
||||||
|
##################
|
||||||
|
# Initialization #
|
||||||
|
##################
|
||||||
|
|
||||||
|
|
||||||
|
def load_nlp(corpus, config):
|
||||||
|
lang = corpus.split("_")[0]
|
||||||
|
nlp = spacy.blank(lang)
|
||||||
|
if config.vectors:
|
||||||
|
nlp.vocab.from_disk(config.vectors / "vocab")
|
||||||
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
def initialize_pipeline(nlp, docs, golds, config):
|
||||||
|
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||||
|
if config.multitask_tag:
|
||||||
|
nlp.parser.add_multitask_objective("tag")
|
||||||
|
if config.multitask_sent:
|
||||||
|
nlp.parser.add_multitask_objective("sent_start")
|
||||||
|
nlp.parser.moves.add_action(2, "subtok")
|
||||||
|
nlp.add_pipe(nlp.create_pipe("tagger"))
|
||||||
|
for gold in golds:
|
||||||
|
for tag in gold.tags:
|
||||||
|
if tag is not None:
|
||||||
|
nlp.tagger.add_label(tag)
|
||||||
|
# Replace labels that didn't make the frequency cutoff
|
||||||
|
actions = set(nlp.parser.labels)
|
||||||
|
label_set = set([act.split("-")[1] for act in actions if "-" in act])
|
||||||
|
for gold in golds:
|
||||||
|
for i, label in enumerate(gold.labels):
|
||||||
|
if label is not None and label not in label_set:
|
||||||
|
gold.labels[i] = label.split("||")[0]
|
||||||
|
return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
|
||||||
|
|
||||||
|
|
||||||
|
########################
|
||||||
|
# Command line helpers #
|
||||||
|
########################
|
||||||
|
|
||||||
|
|
||||||
|
@attr.s
|
||||||
|
class Config(object):
|
||||||
|
vectors = attr.ib(default=None)
|
||||||
|
max_doc_length = attr.ib(default=10)
|
||||||
|
multitask_tag = attr.ib(default=True)
|
||||||
|
multitask_sent = attr.ib(default=True)
|
||||||
|
nr_epoch = attr.ib(default=30)
|
||||||
|
batch_size = attr.ib(default=1000)
|
||||||
|
dropout = attr.ib(default=0.2)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def load(cls, loc):
|
||||||
|
with Path(loc).open("r", encoding="utf8") as file_:
|
||||||
|
cfg = json.load(file_)
|
||||||
|
return cls(**cfg)
|
||||||
|
|
||||||
|
|
||||||
|
class Dataset(object):
|
||||||
|
def __init__(self, path, section):
|
||||||
|
self.path = path
|
||||||
|
self.section = section
|
||||||
|
self.conllu = None
|
||||||
|
self.text = None
|
||||||
|
for file_path in self.path.iterdir():
|
||||||
|
name = file_path.parts[-1]
|
||||||
|
if section in name and name.endswith("conllu"):
|
||||||
|
self.conllu = file_path
|
||||||
|
elif section in name and name.endswith("txt"):
|
||||||
|
self.text = file_path
|
||||||
|
if self.conllu is None:
|
||||||
|
msg = "Could not find .txt file in {path} for {section}"
|
||||||
|
raise IOError(msg.format(section=section, path=path))
|
||||||
|
if self.text is None:
|
||||||
|
msg = "Could not find .txt file in {path} for {section}"
|
||||||
|
self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
|
||||||
|
|
||||||
|
|
||||||
|
class TreebankPaths(object):
|
||||||
|
def __init__(self, ud_path, treebank, **cfg):
|
||||||
|
self.train = Dataset(ud_path / treebank, "train")
|
||||||
|
self.dev = Dataset(ud_path / treebank, "dev")
|
||||||
|
self.lang = self.train.lang
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||||
|
corpus=(
|
||||||
|
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||||
|
config=("Path to json formatted config file", "positional", None, Config.load),
|
||||||
|
limit=("Size limit", "option", "n", int),
|
||||||
|
)
|
||||||
|
def main(ud_dir, parses_dir, config, corpus, limit=0):
|
||||||
|
paths = TreebankPaths(ud_dir, corpus)
|
||||||
|
if not (parses_dir / corpus).exists():
|
||||||
|
(parses_dir / corpus).mkdir()
|
||||||
|
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||||
|
nlp = load_nlp(paths.lang, config)
|
||||||
|
|
||||||
|
docs, golds = read_data(
|
||||||
|
nlp,
|
||||||
|
paths.train.conllu.open(),
|
||||||
|
paths.train.text.open(),
|
||||||
|
max_doc_length=config.max_doc_length,
|
||||||
|
limit=limit,
|
||||||
|
)
|
||||||
|
|
||||||
|
optimizer = initialize_pipeline(nlp, docs, golds, config)
|
||||||
|
|
||||||
|
for i in range(config.nr_epoch):
|
||||||
|
docs = [nlp.make_doc(doc.text) for doc in docs]
|
||||||
|
batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
|
||||||
|
losses = {}
|
||||||
|
n_train_words = sum(len(doc) for doc in docs)
|
||||||
|
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||||
|
for batch in batches:
|
||||||
|
batch_docs, batch_gold = zip(*batch)
|
||||||
|
pbar.update(sum(len(doc) for doc in batch_docs))
|
||||||
|
nlp.update(
|
||||||
|
batch_docs,
|
||||||
|
batch_gold,
|
||||||
|
sgd=optimizer,
|
||||||
|
drop=config.dropout,
|
||||||
|
losses=losses,
|
||||||
|
)
|
||||||
|
|
||||||
|
out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
|
||||||
|
with nlp.use_params(optimizer.averages):
|
||||||
|
scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
|
||||||
|
print_progress(i, losses, scores)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
'''This example shows how to add a multi-task objective that is trained
|
"""This example shows how to add a multi-task objective that is trained
|
||||||
alongside the entity recognizer. This is an alternative to adding features
|
alongside the entity recognizer. This is an alternative to adding features
|
||||||
to the model.
|
to the model.
|
||||||
|
|
||||||
|
|
@ -19,7 +19,7 @@ The specific example here is not necessarily a good idea --- but it shows
|
||||||
how an arbitrary objective function for some word can be used.
|
how an arbitrary objective function for some word can be used.
|
||||||
|
|
||||||
Developed and tested for spaCy 2.0.6
|
Developed and tested for spaCy 2.0.6
|
||||||
'''
|
"""
|
||||||
import random
|
import random
|
||||||
import plac
|
import plac
|
||||||
import spacy
|
import spacy
|
||||||
|
|
@ -30,30 +30,29 @@ random.seed(0)
|
||||||
|
|
||||||
PWD = os.path.dirname(__file__)
|
PWD = os.path.dirname(__file__)
|
||||||
|
|
||||||
TRAIN_DATA = list(read_json_file(os.path.join(PWD, 'training-data.json')))
|
TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json")))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def get_position_label(i, words, tags, heads, labels, ents):
|
def get_position_label(i, words, tags, heads, labels, ents):
|
||||||
'''Return labels indicating the position of the word in the document.
|
"""Return labels indicating the position of the word in the document.
|
||||||
'''
|
"""
|
||||||
if len(words) < 20:
|
if len(words) < 20:
|
||||||
return 'short-doc'
|
return "short-doc"
|
||||||
elif i == 0:
|
elif i == 0:
|
||||||
return 'first-word'
|
return "first-word"
|
||||||
elif i < 10:
|
elif i < 10:
|
||||||
return 'early-word'
|
return "early-word"
|
||||||
elif i < 20:
|
elif i < 20:
|
||||||
return 'mid-word'
|
return "mid-word"
|
||||||
elif i == len(words) - 1:
|
elif i == len(words) - 1:
|
||||||
return 'last-word'
|
return "last-word"
|
||||||
else:
|
else:
|
||||||
return 'late-word'
|
return "late-word"
|
||||||
|
|
||||||
|
|
||||||
def main(n_iter=10):
|
def main(n_iter=10):
|
||||||
nlp = spacy.blank('en')
|
nlp = spacy.blank("en")
|
||||||
ner = nlp.create_pipe('ner')
|
ner = nlp.create_pipe("ner")
|
||||||
ner.add_multitask_objective(get_position_label)
|
ner.add_multitask_objective(get_position_label)
|
||||||
nlp.add_pipe(ner)
|
nlp.add_pipe(ner)
|
||||||
|
|
||||||
|
|
@ -71,15 +70,16 @@ def main(n_iter=10):
|
||||||
[gold], # batch of annotations
|
[gold], # batch of annotations
|
||||||
drop=0.2, # dropout - make it harder to memorise data
|
drop=0.2, # dropout - make it harder to memorise data
|
||||||
sgd=optimizer, # callable to update weights
|
sgd=optimizer, # callable to update weights
|
||||||
losses=losses)
|
losses=losses,
|
||||||
print(losses.get('nn_labeller', 0.0), losses['ner'])
|
)
|
||||||
|
print(losses.get("nn_labeller", 0.0), losses["ner"])
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
for text, _ in TRAIN_DATA:
|
for text, _ in TRAIN_DATA:
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
|
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||||
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
216
examples/training/pretrain_textcat.py
Normal file
216
examples/training/pretrain_textcat.py
Normal file
|
|
@ -0,0 +1,216 @@
|
||||||
|
"""This script is experimental.
|
||||||
|
|
||||||
|
Try pre-training the CNN component of the text categorizer using a cheap
|
||||||
|
language modelling-like objective. Specifically, we load pre-trained vectors
|
||||||
|
(from something like word2vec, GloVe, FastText etc), and use the CNN to
|
||||||
|
predict the tokens' pre-trained vectors. This isn't as easy as it sounds:
|
||||||
|
we're not merely doing compression here, because heavy dropout is applied,
|
||||||
|
including over the input words. This means the model must often (50% of the time)
|
||||||
|
use the context in order to predict the word.
|
||||||
|
|
||||||
|
To evaluate the technique, we're pre-training with the 50k texts from the IMDB
|
||||||
|
corpus, and then training with only 100 labels. Note that it's a bit dirty to
|
||||||
|
pre-train with the development data, but also not *so* terrible: we're not using
|
||||||
|
the development labels, after all --- only the unlabelled text.
|
||||||
|
"""
|
||||||
|
import plac
|
||||||
|
import random
|
||||||
|
import spacy
|
||||||
|
import thinc.extra.datasets
|
||||||
|
from spacy.util import minibatch, use_gpu, compounding
|
||||||
|
import tqdm
|
||||||
|
from spacy._ml import Tok2Vec
|
||||||
|
from spacy.pipeline import TextCategorizer
|
||||||
|
import numpy
|
||||||
|
|
||||||
|
|
||||||
|
def load_texts(limit=0):
|
||||||
|
train, dev = thinc.extra.datasets.imdb()
|
||||||
|
train_texts, train_labels = zip(*train)
|
||||||
|
dev_texts, dev_labels = zip(*train)
|
||||||
|
train_texts = list(train_texts)
|
||||||
|
dev_texts = list(dev_texts)
|
||||||
|
random.shuffle(train_texts)
|
||||||
|
random.shuffle(dev_texts)
|
||||||
|
if limit >= 1:
|
||||||
|
return train_texts[:limit]
|
||||||
|
else:
|
||||||
|
return list(train_texts) + list(dev_texts)
|
||||||
|
|
||||||
|
|
||||||
|
def load_textcat_data(limit=0):
|
||||||
|
"""Load data from the IMDB dataset."""
|
||||||
|
# Partition off part of the train data for evaluation
|
||||||
|
train_data, eval_data = thinc.extra.datasets.imdb()
|
||||||
|
random.shuffle(train_data)
|
||||||
|
train_data = train_data[-limit:]
|
||||||
|
texts, labels = zip(*train_data)
|
||||||
|
eval_texts, eval_labels = zip(*eval_data)
|
||||||
|
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
|
||||||
|
eval_cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in eval_labels]
|
||||||
|
return (texts, cats), (eval_texts, eval_cats)
|
||||||
|
|
||||||
|
|
||||||
|
def prefer_gpu():
|
||||||
|
used = spacy.util.use_gpu(0)
|
||||||
|
if used is None:
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
import cupy.random
|
||||||
|
|
||||||
|
cupy.random.seed(0)
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def build_textcat_model(tok2vec, nr_class, width):
|
||||||
|
from thinc.v2v import Model, Softmax, Maxout
|
||||||
|
from thinc.api import flatten_add_lengths, chain
|
||||||
|
from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
|
||||||
|
from thinc.misc import Residual, LayerNorm
|
||||||
|
from spacy._ml import logistic, zero_init
|
||||||
|
|
||||||
|
with Model.define_operators({">>": chain}):
|
||||||
|
model = (
|
||||||
|
tok2vec
|
||||||
|
>> flatten_add_lengths
|
||||||
|
>> Pooling(mean_pool)
|
||||||
|
>> Softmax(nr_class, width)
|
||||||
|
)
|
||||||
|
model.tok2vec = tok2vec
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def block_gradients(model):
|
||||||
|
from thinc.api import wrap
|
||||||
|
|
||||||
|
def forward(X, drop=0.0):
|
||||||
|
Y, _ = model.begin_update(X, drop=drop)
|
||||||
|
return Y, None
|
||||||
|
|
||||||
|
return wrap(forward, model)
|
||||||
|
|
||||||
|
|
||||||
|
def create_pipeline(width, embed_size, vectors_model):
|
||||||
|
print("Load vectors")
|
||||||
|
nlp = spacy.load(vectors_model)
|
||||||
|
print("Start training")
|
||||||
|
textcat = TextCategorizer(
|
||||||
|
nlp.vocab,
|
||||||
|
labels=["POSITIVE", "NEGATIVE"],
|
||||||
|
model=build_textcat_model(
|
||||||
|
Tok2Vec(width=width, embed_size=embed_size), 2, width
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
nlp.add_pipe(textcat)
|
||||||
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
def train_tensorizer(nlp, texts, dropout, n_iter):
|
||||||
|
tensorizer = nlp.create_pipe("tensorizer")
|
||||||
|
nlp.add_pipe(tensorizer)
|
||||||
|
optimizer = nlp.begin_training()
|
||||||
|
for i in range(n_iter):
|
||||||
|
losses = {}
|
||||||
|
for i, batch in enumerate(minibatch(tqdm.tqdm(texts))):
|
||||||
|
docs = [nlp.make_doc(text) for text in batch]
|
||||||
|
tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout)
|
||||||
|
print(losses)
|
||||||
|
return optimizer
|
||||||
|
|
||||||
|
|
||||||
|
def train_textcat(nlp, n_texts, n_iter=10):
|
||||||
|
textcat = nlp.get_pipe("textcat")
|
||||||
|
tok2vec_weights = textcat.model.tok2vec.to_bytes()
|
||||||
|
(train_texts, train_cats), (dev_texts, dev_cats) = load_textcat_data(limit=n_texts)
|
||||||
|
print(
|
||||||
|
"Using {} examples ({} training, {} evaluation)".format(
|
||||||
|
n_texts, len(train_texts), len(dev_texts)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||||
|
|
||||||
|
# get names of other pipes to disable them during training
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
||||||
|
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||||
|
optimizer = nlp.begin_training()
|
||||||
|
textcat.model.tok2vec.from_bytes(tok2vec_weights)
|
||||||
|
print("Training the model...")
|
||||||
|
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
|
||||||
|
for i in range(n_iter):
|
||||||
|
losses = {"textcat": 0.0}
|
||||||
|
# batch up the examples using spaCy's minibatch
|
||||||
|
batches = minibatch(tqdm.tqdm(train_data), size=2)
|
||||||
|
for batch in batches:
|
||||||
|
texts, annotations = zip(*batch)
|
||||||
|
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
|
||||||
|
with textcat.model.use_params(optimizer.averages):
|
||||||
|
# evaluate on the dev data split off in load_data()
|
||||||
|
scores = evaluate_textcat(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||||
|
print(
|
||||||
|
"{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table
|
||||||
|
losses["textcat"],
|
||||||
|
scores["textcat_p"],
|
||||||
|
scores["textcat_r"],
|
||||||
|
scores["textcat_f"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate_textcat(tokenizer, textcat, texts, cats):
|
||||||
|
docs = (tokenizer(text) for text in texts)
|
||||||
|
tp = 1e-8
|
||||||
|
fp = 1e-8
|
||||||
|
tn = 1e-8
|
||||||
|
fn = 1e-8
|
||||||
|
for i, doc in enumerate(textcat.pipe(docs)):
|
||||||
|
gold = cats[i]
|
||||||
|
for label, score in doc.cats.items():
|
||||||
|
if label not in gold:
|
||||||
|
continue
|
||||||
|
if score >= 0.5 and gold[label] >= 0.5:
|
||||||
|
tp += 1.0
|
||||||
|
elif score >= 0.5 and gold[label] < 0.5:
|
||||||
|
fp += 1.0
|
||||||
|
elif score < 0.5 and gold[label] < 0.5:
|
||||||
|
tn += 1
|
||||||
|
elif score < 0.5 and gold[label] >= 0.5:
|
||||||
|
fn += 1
|
||||||
|
precision = tp / (tp + fp)
|
||||||
|
recall = tp / (tp + fn)
|
||||||
|
f_score = 2 * (precision * recall) / (precision + recall)
|
||||||
|
return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
width=("Width of CNN layers", "positional", None, int),
|
||||||
|
embed_size=("Embedding rows", "positional", None, int),
|
||||||
|
pretrain_iters=("Number of iterations to pretrain", "option", "pn", int),
|
||||||
|
train_iters=("Number of iterations to pretrain", "option", "tn", int),
|
||||||
|
train_examples=("Number of labelled examples", "option", "eg", int),
|
||||||
|
vectors_model=("Name or path to vectors model to learn from"),
|
||||||
|
)
|
||||||
|
def main(
|
||||||
|
width,
|
||||||
|
embed_size,
|
||||||
|
vectors_model,
|
||||||
|
pretrain_iters=30,
|
||||||
|
train_iters=30,
|
||||||
|
train_examples=1000,
|
||||||
|
):
|
||||||
|
random.seed(0)
|
||||||
|
numpy.random.seed(0)
|
||||||
|
use_gpu = prefer_gpu()
|
||||||
|
print("Using GPU?", use_gpu)
|
||||||
|
|
||||||
|
nlp = create_pipeline(width, embed_size, vectors_model)
|
||||||
|
print("Load data")
|
||||||
|
texts = load_texts(limit=0)
|
||||||
|
print("Train tensorizer")
|
||||||
|
optimizer = train_tensorizer(nlp, texts, dropout=0.2, n_iter=pretrain_iters)
|
||||||
|
print("Train textcat")
|
||||||
|
train_textcat(nlp, train_examples, n_iter=train_iters)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
||||||
94
examples/training/rehearsal.py
Normal file
94
examples/training/rehearsal.py
Normal file
|
|
@ -0,0 +1,94 @@
|
||||||
|
"""Prevent catastrophic forgetting with rehearsal updates."""
|
||||||
|
import plac
|
||||||
|
import random
|
||||||
|
import srsly
|
||||||
|
import spacy
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
|
|
||||||
|
LABEL = "ANIMAL"
|
||||||
|
TRAIN_DATA = [
|
||||||
|
(
|
||||||
|
"Horses are too tall and they pretend to care about your feelings",
|
||||||
|
{"entities": [(0, 6, "ANIMAL")]},
|
||||||
|
),
|
||||||
|
("Do they bite?", {"entities": []}),
|
||||||
|
(
|
||||||
|
"horses are too tall and they pretend to care about your feelings",
|
||||||
|
{"entities": [(0, 6, "ANIMAL")]},
|
||||||
|
),
|
||||||
|
("horses pretend to care about your feelings", {"entities": [(0, 6, "ANIMAL")]}),
|
||||||
|
(
|
||||||
|
"they pretend to care about your feelings, those horses",
|
||||||
|
{"entities": [(48, 54, "ANIMAL")]},
|
||||||
|
),
|
||||||
|
("horses?", {"entities": [(0, 6, "ANIMAL")]}),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def read_raw_data(nlp, jsonl_loc):
|
||||||
|
for json_obj in srsly.read_jsonl(jsonl_loc):
|
||||||
|
if json_obj["text"].strip():
|
||||||
|
doc = nlp.make_doc(json_obj["text"])
|
||||||
|
yield doc
|
||||||
|
|
||||||
|
|
||||||
|
def read_gold_data(nlp, gold_loc):
|
||||||
|
docs = []
|
||||||
|
golds = []
|
||||||
|
for json_obj in srsly.read_jsonl(gold_loc):
|
||||||
|
doc = nlp.make_doc(json_obj["text"])
|
||||||
|
ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
|
||||||
|
gold = GoldParse(doc, entities=ents)
|
||||||
|
docs.append(doc)
|
||||||
|
golds.append(gold)
|
||||||
|
return list(zip(docs, golds))
|
||||||
|
|
||||||
|
|
||||||
|
def main(model_name, unlabelled_loc):
|
||||||
|
n_iter = 10
|
||||||
|
dropout = 0.2
|
||||||
|
batch_size = 4
|
||||||
|
nlp = spacy.load(model_name)
|
||||||
|
nlp.get_pipe("ner").add_label(LABEL)
|
||||||
|
raw_docs = list(read_raw_data(nlp, unlabelled_loc))
|
||||||
|
optimizer = nlp.resume_training()
|
||||||
|
# Avoid use of Adam when resuming training. I don't understand this well
|
||||||
|
# yet, but I'm getting weird results from Adam. Try commenting out the
|
||||||
|
# nlp.update(), and using Adam -- you'll find the models drift apart.
|
||||||
|
# I guess Adam is losing precision, introducing gradient noise?
|
||||||
|
optimizer.alpha = 0.1
|
||||||
|
optimizer.b1 = 0.0
|
||||||
|
optimizer.b2 = 0.0
|
||||||
|
|
||||||
|
# get names of other pipes to disable them during training
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
||||||
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
|
with nlp.disable_pipes(*other_pipes):
|
||||||
|
for itn in range(n_iter):
|
||||||
|
random.shuffle(TRAIN_DATA)
|
||||||
|
random.shuffle(raw_docs)
|
||||||
|
losses = {}
|
||||||
|
r_losses = {}
|
||||||
|
# batch up the examples using spaCy's minibatch
|
||||||
|
raw_batches = minibatch(raw_docs, size=4)
|
||||||
|
for batch in minibatch(TRAIN_DATA, size=sizes):
|
||||||
|
docs, golds = zip(*batch)
|
||||||
|
nlp.update(docs, golds, sgd=optimizer, drop=dropout, losses=losses)
|
||||||
|
raw_batch = list(next(raw_batches))
|
||||||
|
nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
|
||||||
|
print("Losses", losses)
|
||||||
|
print("R. Losses", r_losses)
|
||||||
|
print(nlp.get_pipe('ner').model.unseen_classes)
|
||||||
|
test_text = "Do you like horses?"
|
||||||
|
doc = nlp(test_text)
|
||||||
|
print("Entities in '%s'" % test_text)
|
||||||
|
for ent in doc.ents:
|
||||||
|
print(ent.label_, ent.text)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
||||||
|
|
@ -29,73 +29,113 @@ from spacy.util import minibatch, compounding
|
||||||
# training data: texts, heads and dependency labels
|
# training data: texts, heads and dependency labels
|
||||||
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
|
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
|
||||||
TRAIN_DATA = [
|
TRAIN_DATA = [
|
||||||
("find a cafe with great wifi", {
|
(
|
||||||
'heads': [0, 2, 0, 5, 5, 2], # index of token head
|
"find a cafe with great wifi",
|
||||||
'deps': ['ROOT', '-', 'PLACE', '-', 'QUALITY', 'ATTRIBUTE']
|
{
|
||||||
}),
|
"heads": [0, 2, 0, 5, 5, 2], # index of token head
|
||||||
("find a hotel near the beach", {
|
"deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"],
|
||||||
'heads': [0, 2, 0, 5, 5, 2],
|
},
|
||||||
'deps': ['ROOT', '-', 'PLACE', 'QUALITY', '-', 'ATTRIBUTE']
|
),
|
||||||
}),
|
(
|
||||||
("find me the closest gym that's open late", {
|
"find a hotel near the beach",
|
||||||
'heads': [0, 0, 4, 4, 0, 6, 4, 6, 6],
|
{
|
||||||
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'ATTRIBUTE', 'TIME']
|
"heads": [0, 2, 0, 5, 5, 2],
|
||||||
}),
|
"deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"],
|
||||||
("show me the cheapest store that sells flowers", {
|
},
|
||||||
'heads': [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
|
),
|
||||||
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'PRODUCT']
|
(
|
||||||
}),
|
"find me the closest gym that's open late",
|
||||||
("find a nice restaurant in london", {
|
{
|
||||||
'heads': [0, 3, 3, 0, 3, 3],
|
"heads": [0, 0, 4, 4, 0, 6, 4, 6, 6],
|
||||||
'deps': ['ROOT', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
|
"deps": [
|
||||||
}),
|
"ROOT",
|
||||||
("show me the coolest hostel in berlin", {
|
"-",
|
||||||
'heads': [0, 0, 4, 4, 0, 4, 4],
|
"-",
|
||||||
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
|
"QUALITY",
|
||||||
}),
|
"PLACE",
|
||||||
("find a good italian restaurant near work", {
|
"-",
|
||||||
'heads': [0, 4, 4, 4, 0, 4, 5],
|
"-",
|
||||||
'deps': ['ROOT', '-', 'QUALITY', 'ATTRIBUTE', 'PLACE', 'ATTRIBUTE', 'LOCATION']
|
"ATTRIBUTE",
|
||||||
})
|
"TIME",
|
||||||
|
],
|
||||||
|
},
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"show me the cheapest store that sells flowers",
|
||||||
|
{
|
||||||
|
"heads": [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
|
||||||
|
"deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "-", "PRODUCT"],
|
||||||
|
},
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"find a nice restaurant in london",
|
||||||
|
{
|
||||||
|
"heads": [0, 3, 3, 0, 3, 3],
|
||||||
|
"deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"],
|
||||||
|
},
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"show me the coolest hostel in berlin",
|
||||||
|
{
|
||||||
|
"heads": [0, 0, 4, 4, 0, 4, 4],
|
||||||
|
"deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "LOCATION"],
|
||||||
|
},
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"find a good italian restaurant near work",
|
||||||
|
{
|
||||||
|
"heads": [0, 4, 4, 4, 0, 4, 5],
|
||||||
|
"deps": [
|
||||||
|
"ROOT",
|
||||||
|
"-",
|
||||||
|
"QUALITY",
|
||||||
|
"ATTRIBUTE",
|
||||||
|
"PLACE",
|
||||||
|
"ATTRIBUTE",
|
||||||
|
"LOCATION",
|
||||||
|
],
|
||||||
|
},
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||||
output_dir=("Optional output directory", "option", "o", Path),
|
output_dir=("Optional output directory", "option", "o", Path),
|
||||||
n_iter=("Number of training iterations", "option", "n", int))
|
n_iter=("Number of training iterations", "option", "n", int),
|
||||||
|
)
|
||||||
def main(model=None, output_dir=None, n_iter=15):
|
def main(model=None, output_dir=None, n_iter=15):
|
||||||
"""Load the model, set up the pipeline and train the parser."""
|
"""Load the model, set up the pipeline and train the parser."""
|
||||||
if model is not None:
|
if model is not None:
|
||||||
nlp = spacy.load(model) # load existing spaCy model
|
nlp = spacy.load(model) # load existing spaCy model
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
else:
|
else:
|
||||||
nlp = spacy.blank('en') # create blank Language class
|
nlp = spacy.blank("en") # create blank Language class
|
||||||
print("Created blank 'en' model")
|
print("Created blank 'en' model")
|
||||||
|
|
||||||
# We'll use the built-in dependency parser class, but we want to create a
|
# We'll use the built-in dependency parser class, but we want to create a
|
||||||
# fresh instance – just in case.
|
# fresh instance – just in case.
|
||||||
if 'parser' in nlp.pipe_names:
|
if "parser" in nlp.pipe_names:
|
||||||
nlp.remove_pipe('parser')
|
nlp.remove_pipe("parser")
|
||||||
parser = nlp.create_pipe('parser')
|
parser = nlp.create_pipe("parser")
|
||||||
nlp.add_pipe(parser, first=True)
|
nlp.add_pipe(parser, first=True)
|
||||||
|
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in TRAIN_DATA:
|
||||||
for dep in annotations.get('deps', []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
|
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
texts, annotations = zip(*batch)
|
texts, annotations = zip(*batch)
|
||||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||||
print('Losses', losses)
|
print("Losses", losses)
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
test_model(nlp)
|
test_model(nlp)
|
||||||
|
|
@ -115,16 +155,18 @@ def main(model=None, output_dir=None, n_iter=15):
|
||||||
|
|
||||||
|
|
||||||
def test_model(nlp):
|
def test_model(nlp):
|
||||||
texts = ["find a hotel with good wifi",
|
texts = [
|
||||||
|
"find a hotel with good wifi",
|
||||||
"find me the cheapest gym near work",
|
"find me the cheapest gym near work",
|
||||||
"show me the best hotel in berlin"]
|
"show me the best hotel in berlin",
|
||||||
|
]
|
||||||
docs = nlp.pipe(texts)
|
docs = nlp.pipe(texts)
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
print(doc.text)
|
print(doc.text)
|
||||||
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
|
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# Expected output:
|
# Expected output:
|
||||||
|
|
|
||||||
|
|
@ -35,7 +35,7 @@ from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
|
|
||||||
# new entity label
|
# new entity label
|
||||||
LABEL = 'ANIMAL'
|
LABEL = "ANIMAL"
|
||||||
|
|
||||||
# training data
|
# training data
|
||||||
# Note: If you're using an existing model, make sure to mix in examples of
|
# Note: If you're using an existing model, make sure to mix in examples of
|
||||||
|
|
@ -43,29 +43,21 @@ LABEL = 'ANIMAL'
|
||||||
# model might learn the new type, but "forget" what it previously knew.
|
# model might learn the new type, but "forget" what it previously knew.
|
||||||
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
|
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
|
||||||
TRAIN_DATA = [
|
TRAIN_DATA = [
|
||||||
("Horses are too tall and they pretend to care about your feelings", {
|
(
|
||||||
'entities': [(0, 6, 'ANIMAL')]
|
"Horses are too tall and they pretend to care about your feelings",
|
||||||
}),
|
{"entities": [(0, 6, LABEL)]},
|
||||||
|
),
|
||||||
("Do they bite?", {
|
("Do they bite?", {"entities": []}),
|
||||||
'entities': []
|
(
|
||||||
}),
|
"horses are too tall and they pretend to care about your feelings",
|
||||||
|
{"entities": [(0, 6, LABEL)]},
|
||||||
("horses are too tall and they pretend to care about your feelings", {
|
),
|
||||||
'entities': [(0, 6, 'ANIMAL')]
|
("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
|
||||||
}),
|
(
|
||||||
|
"they pretend to care about your feelings, those horses",
|
||||||
("horses pretend to care about your feelings", {
|
{"entities": [(48, 54, LABEL)]},
|
||||||
'entities': [(0, 6, 'ANIMAL')]
|
),
|
||||||
}),
|
("horses?", {"entities": [(0, 6, LABEL)]}),
|
||||||
|
|
||||||
("they pretend to care about your feelings, those horses", {
|
|
||||||
'entities': [(48, 54, 'ANIMAL')]
|
|
||||||
}),
|
|
||||||
|
|
||||||
("horses?", {
|
|
||||||
'entities': [(0, 6, 'ANIMAL')]
|
|
||||||
})
|
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -73,48 +65,50 @@ TRAIN_DATA = [
|
||||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||||
new_model_name=("New model name for model meta.", "option", "nm", str),
|
new_model_name=("New model name for model meta.", "option", "nm", str),
|
||||||
output_dir=("Optional output directory", "option", "o", Path),
|
output_dir=("Optional output directory", "option", "o", Path),
|
||||||
n_iter=("Number of training iterations", "option", "n", int))
|
n_iter=("Number of training iterations", "option", "n", int),
|
||||||
def main(model=None, new_model_name='animal', output_dir=None, n_iter=10):
|
)
|
||||||
|
def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
||||||
"""Set up the pipeline and entity recognizer, and train the new entity."""
|
"""Set up the pipeline and entity recognizer, and train the new entity."""
|
||||||
|
random.seed(0)
|
||||||
if model is not None:
|
if model is not None:
|
||||||
nlp = spacy.load(model) # load existing spaCy model
|
nlp = spacy.load(model) # load existing spaCy model
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
else:
|
else:
|
||||||
nlp = spacy.blank('en') # create blank Language class
|
nlp = spacy.blank("en") # create blank Language class
|
||||||
print("Created blank 'en' model")
|
print("Created blank 'en' model")
|
||||||
# Add entity recognizer to model if it's not in the pipeline
|
# Add entity recognizer to model if it's not in the pipeline
|
||||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||||
if 'ner' not in nlp.pipe_names:
|
if "ner" not in nlp.pipe_names:
|
||||||
ner = nlp.create_pipe('ner')
|
ner = nlp.create_pipe("ner")
|
||||||
nlp.add_pipe(ner)
|
nlp.add_pipe(ner)
|
||||||
# otherwise, get it, so we can add labels to it
|
# otherwise, get it, so we can add labels to it
|
||||||
else:
|
else:
|
||||||
ner = nlp.get_pipe('ner')
|
ner = nlp.get_pipe("ner")
|
||||||
|
|
||||||
ner.add_label(LABEL) # add new entity label to entity recognizer
|
ner.add_label(LABEL) # add new entity label to entity recognizer
|
||||||
|
# Adding extraneous labels shouldn't mess anything up
|
||||||
|
ner.add_label('VEGETABLE')
|
||||||
if model is None:
|
if model is None:
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
else:
|
else:
|
||||||
# Note that 'begin_training' initializes the models, so it'll zero out
|
optimizer = nlp.resume_training()
|
||||||
# existing entity types.
|
move_names = list(ner.move_names)
|
||||||
optimizer = nlp.entity.create_optimizer()
|
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||||
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
|
# batch up the examples using spaCy's minibatch
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
|
batches = minibatch(TRAIN_DATA, size=sizes)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
|
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
texts, annotations = zip(*batch)
|
texts, annotations = zip(*batch)
|
||||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.35,
|
nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
|
||||||
losses=losses)
|
print("Losses", losses)
|
||||||
print('Losses', losses)
|
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
test_text = 'Do you like horses?'
|
test_text = "Do you like horses?"
|
||||||
doc = nlp(test_text)
|
doc = nlp(test_text)
|
||||||
print("Entities in '%s'" % test_text)
|
print("Entities in '%s'" % test_text)
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
|
|
@ -125,17 +119,19 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=10):
|
||||||
output_dir = Path(output_dir)
|
output_dir = Path(output_dir)
|
||||||
if not output_dir.exists():
|
if not output_dir.exists():
|
||||||
output_dir.mkdir()
|
output_dir.mkdir()
|
||||||
nlp.meta['name'] = new_model_name # rename model
|
nlp.meta["name"] = new_model_name # rename model
|
||||||
nlp.to_disk(output_dir)
|
nlp.to_disk(output_dir)
|
||||||
print("Saved model to", output_dir)
|
print("Saved model to", output_dir)
|
||||||
|
|
||||||
# test the saved model
|
# test the saved model
|
||||||
print("Loading from", output_dir)
|
print("Loading from", output_dir)
|
||||||
nlp2 = spacy.load(output_dir)
|
nlp2 = spacy.load(output_dir)
|
||||||
|
# Check the classes have loaded back consistently
|
||||||
|
assert nlp2.get_pipe('ner').move_names == move_names
|
||||||
doc2 = nlp2(test_text)
|
doc2 = nlp2(test_text)
|
||||||
for ent in doc2.ents:
|
for ent in doc2.ents:
|
||||||
print(ent.label_, ent.text)
|
print(ent.label_, ent.text)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
|
|
@ -18,62 +18,69 @@ from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
# training data
|
# training data
|
||||||
TRAIN_DATA = [
|
TRAIN_DATA = [
|
||||||
("They trade mortgage-backed securities.", {
|
(
|
||||||
'heads': [1, 1, 4, 4, 5, 1, 1],
|
"They trade mortgage-backed securities.",
|
||||||
'deps': ['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
|
{
|
||||||
}),
|
"heads": [1, 1, 4, 4, 5, 1, 1],
|
||||||
("I like London and Berlin.", {
|
"deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
|
||||||
'heads': [1, 1, 1, 2, 2, 1],
|
},
|
||||||
'deps': ['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
|
),
|
||||||
})
|
(
|
||||||
|
"I like London and Berlin.",
|
||||||
|
{
|
||||||
|
"heads": [1, 1, 1, 2, 2, 1],
|
||||||
|
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
|
||||||
|
},
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||||
output_dir=("Optional output directory", "option", "o", Path),
|
output_dir=("Optional output directory", "option", "o", Path),
|
||||||
n_iter=("Number of training iterations", "option", "n", int))
|
n_iter=("Number of training iterations", "option", "n", int),
|
||||||
|
)
|
||||||
def main(model=None, output_dir=None, n_iter=10):
|
def main(model=None, output_dir=None, n_iter=10):
|
||||||
"""Load the model, set up the pipeline and train the parser."""
|
"""Load the model, set up the pipeline and train the parser."""
|
||||||
if model is not None:
|
if model is not None:
|
||||||
nlp = spacy.load(model) # load existing spaCy model
|
nlp = spacy.load(model) # load existing spaCy model
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
else:
|
else:
|
||||||
nlp = spacy.blank('en') # create blank Language class
|
nlp = spacy.blank("en") # create blank Language class
|
||||||
print("Created blank 'en' model")
|
print("Created blank 'en' model")
|
||||||
|
|
||||||
# add the parser to the pipeline if it doesn't exist
|
# add the parser to the pipeline if it doesn't exist
|
||||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||||
if 'parser' not in nlp.pipe_names:
|
if "parser" not in nlp.pipe_names:
|
||||||
parser = nlp.create_pipe('parser')
|
parser = nlp.create_pipe("parser")
|
||||||
nlp.add_pipe(parser, first=True)
|
nlp.add_pipe(parser, first=True)
|
||||||
# otherwise, get it, so we can add labels to it
|
# otherwise, get it, so we can add labels to it
|
||||||
else:
|
else:
|
||||||
parser = nlp.get_pipe('parser')
|
parser = nlp.get_pipe("parser")
|
||||||
|
|
||||||
# add labels to the parser
|
# add labels to the parser
|
||||||
for _, annotations in TRAIN_DATA:
|
for _, annotations in TRAIN_DATA:
|
||||||
for dep in annotations.get('deps', []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
|
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
texts, annotations = zip(*batch)
|
texts, annotations = zip(*batch)
|
||||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||||
print('Losses', losses)
|
print("Losses", losses)
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
test_text = "I like securities."
|
test_text = "I like securities."
|
||||||
doc = nlp(test_text)
|
doc = nlp(test_text)
|
||||||
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
|
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
|
||||||
|
|
||||||
# save model to output directory
|
# save model to output directory
|
||||||
if output_dir is not None:
|
if output_dir is not None:
|
||||||
|
|
@ -87,10 +94,10 @@ def main(model=None, output_dir=None, n_iter=10):
|
||||||
print("Loading from", output_dir)
|
print("Loading from", output_dir)
|
||||||
nlp2 = spacy.load(output_dir)
|
nlp2 = spacy.load(output_dir)
|
||||||
doc = nlp2(test_text)
|
doc = nlp2(test_text)
|
||||||
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
|
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# expected result:
|
# expected result:
|
||||||
|
|
|
||||||
|
|
@ -25,11 +25,7 @@ from spacy.util import minibatch, compounding
|
||||||
# http://universaldependencies.github.io/docs/u/pos/index.html
|
# http://universaldependencies.github.io/docs/u/pos/index.html
|
||||||
# You may also specify morphological features for your tags, from the universal
|
# You may also specify morphological features for your tags, from the universal
|
||||||
# scheme.
|
# scheme.
|
||||||
TAG_MAP = {
|
TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
|
||||||
'N': {'pos': 'NOUN'},
|
|
||||||
'V': {'pos': 'VERB'},
|
|
||||||
'J': {'pos': 'ADJ'}
|
|
||||||
}
|
|
||||||
|
|
||||||
# Usually you'll read this in, of course. Data formats vary. Ensure your
|
# Usually you'll read this in, of course. Data formats vary. Ensure your
|
||||||
# strings are unicode and that the number of tags assigned matches spaCy's
|
# strings are unicode and that the number of tags assigned matches spaCy's
|
||||||
|
|
@ -37,16 +33,17 @@ TAG_MAP = {
|
||||||
# that specifies the gold-standard tokenization, e.g.:
|
# that specifies the gold-standard tokenization, e.g.:
|
||||||
# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'], 'tags': ['V', 'J', 'N']})
|
# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'], 'tags': ['V', 'J', 'N']})
|
||||||
TRAIN_DATA = [
|
TRAIN_DATA = [
|
||||||
("I like green eggs", {'tags': ['N', 'V', 'J', 'N']}),
|
("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
|
||||||
("Eat blue ham", {'tags': ['V', 'J', 'N']})
|
("Eat blue ham", {"tags": ["V", "J", "N"]}),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
lang=("ISO Code of language to use", "option", "l", str),
|
lang=("ISO Code of language to use", "option", "l", str),
|
||||||
output_dir=("Optional output directory", "option", "o", Path),
|
output_dir=("Optional output directory", "option", "o", Path),
|
||||||
n_iter=("Number of training iterations", "option", "n", int))
|
n_iter=("Number of training iterations", "option", "n", int),
|
||||||
def main(lang='en', output_dir=None, n_iter=25):
|
)
|
||||||
|
def main(lang="en", output_dir=None, n_iter=25):
|
||||||
"""Create a new model, set up the pipeline and train the tagger. In order to
|
"""Create a new model, set up the pipeline and train the tagger. In order to
|
||||||
train the tagger with a custom tag map, we're creating a new Language
|
train the tagger with a custom tag map, we're creating a new Language
|
||||||
instance with a custom vocab.
|
instance with a custom vocab.
|
||||||
|
|
@ -54,7 +51,7 @@ def main(lang='en', output_dir=None, n_iter=25):
|
||||||
nlp = spacy.blank(lang)
|
nlp = spacy.blank(lang)
|
||||||
# add the tagger to the pipeline
|
# add the tagger to the pipeline
|
||||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||||
tagger = nlp.create_pipe('tagger')
|
tagger = nlp.create_pipe("tagger")
|
||||||
# Add the tags. This needs to be done before you start training.
|
# Add the tags. This needs to be done before you start training.
|
||||||
for tag, values in TAG_MAP.items():
|
for tag, values in TAG_MAP.items():
|
||||||
tagger.add_label(tag, values)
|
tagger.add_label(tag, values)
|
||||||
|
|
@ -65,16 +62,16 @@ def main(lang='en', output_dir=None, n_iter=25):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
|
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
texts, annotations = zip(*batch)
|
texts, annotations = zip(*batch)
|
||||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||||
print('Losses', losses)
|
print("Losses", losses)
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
test_text = "I like blue eggs"
|
test_text = "I like blue eggs"
|
||||||
doc = nlp(test_text)
|
doc = nlp(test_text)
|
||||||
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
|
print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
|
||||||
|
|
||||||
# save model to output directory
|
# save model to output directory
|
||||||
if output_dir is not None:
|
if output_dir is not None:
|
||||||
|
|
@ -88,10 +85,10 @@ def main(lang='en', output_dir=None, n_iter=25):
|
||||||
print("Loading from", output_dir)
|
print("Loading from", output_dir)
|
||||||
nlp2 = spacy.load(output_dir)
|
nlp2 = spacy.load(output_dir)
|
||||||
doc = nlp2(test_text)
|
doc = nlp2(test_text)
|
||||||
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
|
print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
||||||
# Expected output:
|
# Expected output:
|
||||||
|
|
|
||||||
|
|
@ -23,7 +23,8 @@ from spacy.util import minibatch, compounding
|
||||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||||
output_dir=("Optional output directory", "option", "o", Path),
|
output_dir=("Optional output directory", "option", "o", Path),
|
||||||
n_texts=("Number of texts to train from", "option", "t", int),
|
n_texts=("Number of texts to train from", "option", "t", int),
|
||||||
n_iter=("Number of training iterations", "option", "n", int))
|
n_iter=("Number of training iterations", "option", "n", int),
|
||||||
|
)
|
||||||
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
|
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
|
||||||
if output_dir is not None:
|
if output_dir is not None:
|
||||||
output_dir = Path(output_dir)
|
output_dir = Path(output_dir)
|
||||||
|
|
@ -34,49 +35,58 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
|
||||||
nlp = spacy.load(model) # load existing spaCy model
|
nlp = spacy.load(model) # load existing spaCy model
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
else:
|
else:
|
||||||
nlp = spacy.blank('en') # create blank Language class
|
nlp = spacy.blank("en") # create blank Language class
|
||||||
print("Created blank 'en' model")
|
print("Created blank 'en' model")
|
||||||
|
|
||||||
# add the text classifier to the pipeline if it doesn't exist
|
# add the text classifier to the pipeline if it doesn't exist
|
||||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||||
if 'textcat' not in nlp.pipe_names:
|
if "textcat" not in nlp.pipe_names:
|
||||||
textcat = nlp.create_pipe('textcat')
|
textcat = nlp.create_pipe("textcat", config={
|
||||||
|
"architecture": "simple_cnn",
|
||||||
|
"exclusive_classes": True})
|
||||||
nlp.add_pipe(textcat, last=True)
|
nlp.add_pipe(textcat, last=True)
|
||||||
# otherwise, get it, so we can add labels to it
|
# otherwise, get it, so we can add labels to it
|
||||||
else:
|
else:
|
||||||
textcat = nlp.get_pipe('textcat')
|
textcat = nlp.get_pipe("textcat")
|
||||||
|
|
||||||
# add label to text classifier
|
# add label to text classifier
|
||||||
textcat.add_label('POSITIVE')
|
textcat.add_label("POSITIVE")
|
||||||
|
textcat.add_label("NEGATIVE")
|
||||||
|
|
||||||
# load the IMDB dataset
|
# load the IMDB dataset
|
||||||
print("Loading IMDB data...")
|
print("Loading IMDB data...")
|
||||||
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
|
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
|
||||||
print("Using {} examples ({} training, {} evaluation)"
|
print(
|
||||||
.format(n_texts, len(train_texts), len(dev_texts)))
|
"Using {} examples ({} training, {} evaluation)".format(
|
||||||
train_data = list(zip(train_texts,
|
n_texts, len(train_texts), len(dev_texts)
|
||||||
[{'cats': cats} for cats in train_cats]))
|
)
|
||||||
|
)
|
||||||
|
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
print("Training the model...")
|
print("Training the model...")
|
||||||
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
|
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
|
||||||
for i in range(n_iter):
|
for i in range(n_iter):
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(train_data, size=compounding(4., 32., 1.001))
|
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
texts, annotations = zip(*batch)
|
texts, annotations = zip(*batch)
|
||||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
|
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
|
||||||
losses=losses)
|
|
||||||
with textcat.model.use_params(optimizer.averages):
|
with textcat.model.use_params(optimizer.averages):
|
||||||
# evaluate on the dev data split off in load_data()
|
# evaluate on the dev data split off in load_data()
|
||||||
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||||
print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}' # print a simple table
|
print(
|
||||||
.format(losses['textcat'], scores['textcat_p'],
|
"{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table
|
||||||
scores['textcat_r'], scores['textcat_f']))
|
losses["textcat"],
|
||||||
|
scores["textcat_p"],
|
||||||
|
scores["textcat_r"],
|
||||||
|
scores["textcat_f"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
test_text = "This movie sucked"
|
test_text = "This movie sucked"
|
||||||
|
|
@ -102,7 +112,7 @@ def load_data(limit=0, split=0.8):
|
||||||
random.shuffle(train_data)
|
random.shuffle(train_data)
|
||||||
train_data = train_data[-limit:]
|
train_data = train_data[-limit:]
|
||||||
texts, labels = zip(*train_data)
|
texts, labels = zip(*train_data)
|
||||||
cats = [{'POSITIVE': bool(y)} for y in labels]
|
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
|
||||||
split = int(len(train_data) * split)
|
split = int(len(train_data) * split)
|
||||||
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
|
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
|
||||||
|
|
||||||
|
|
@ -118,19 +128,24 @@ def evaluate(tokenizer, textcat, texts, cats):
|
||||||
for label, score in doc.cats.items():
|
for label, score in doc.cats.items():
|
||||||
if label not in gold:
|
if label not in gold:
|
||||||
continue
|
continue
|
||||||
|
if label == "NEGATIVE":
|
||||||
|
continue
|
||||||
if score >= 0.5 and gold[label] >= 0.5:
|
if score >= 0.5 and gold[label] >= 0.5:
|
||||||
tp += 1.
|
tp += 1.0
|
||||||
elif score >= 0.5 and gold[label] < 0.5:
|
elif score >= 0.5 and gold[label] < 0.5:
|
||||||
fp += 1.
|
fp += 1.0
|
||||||
elif score < 0.5 and gold[label] < 0.5:
|
elif score < 0.5 and gold[label] < 0.5:
|
||||||
tn += 1
|
tn += 1
|
||||||
elif score < 0.5 and gold[label] >= 0.5:
|
elif score < 0.5 and gold[label] >= 0.5:
|
||||||
fn += 1
|
fn += 1
|
||||||
precision = tp / (tp + fp)
|
precision = tp / (tp + fp)
|
||||||
recall = tp / (tp + fn)
|
recall = tp / (tp + fn)
|
||||||
|
if (precision+recall) == 0:
|
||||||
|
f_score = 0.0
|
||||||
|
else:
|
||||||
f_score = 2 * (precision * recall) / (precision + recall)
|
f_score = 2 * (precision * recall) / (precision + recall)
|
||||||
return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}
|
return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
|
|
@ -14,8 +14,13 @@ from spacy.language import Language
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
vectors_loc=("Path to .vec file", "positional", None, str),
|
vectors_loc=("Path to .vec file", "positional", None, str),
|
||||||
lang=("Optional language ID. If not set, blank Language() will be used.",
|
lang=(
|
||||||
"positional", None, str))
|
"Optional language ID. If not set, blank Language() will be used.",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
)
|
||||||
def main(vectors_loc, lang=None):
|
def main(vectors_loc, lang=None):
|
||||||
if lang is None:
|
if lang is None:
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
|
|
@ -24,21 +29,21 @@ def main(vectors_loc, lang=None):
|
||||||
# save the model to disk and load it back later (models always need a
|
# save the model to disk and load it back later (models always need a
|
||||||
# "lang" setting). Use 'xx' for blank multi-language class.
|
# "lang" setting). Use 'xx' for blank multi-language class.
|
||||||
nlp = spacy.blank(lang)
|
nlp = spacy.blank(lang)
|
||||||
with open(vectors_loc, 'rb') as file_:
|
with open(vectors_loc, "rb") as file_:
|
||||||
header = file_.readline()
|
header = file_.readline()
|
||||||
nr_row, nr_dim = header.split()
|
nr_row, nr_dim = header.split()
|
||||||
nlp.vocab.reset_vectors(width=int(nr_dim))
|
nlp.vocab.reset_vectors(width=int(nr_dim))
|
||||||
for line in file_:
|
for line in file_:
|
||||||
line = line.rstrip().decode('utf8')
|
line = line.rstrip().decode("utf8")
|
||||||
pieces = line.rsplit(' ', int(nr_dim))
|
pieces = line.rsplit(" ", int(nr_dim))
|
||||||
word = pieces[0]
|
word = pieces[0]
|
||||||
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
|
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype="f")
|
||||||
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
|
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
|
||||||
# test the vectors and similarity
|
# test the vectors and similarity
|
||||||
text = 'class colspan'
|
text = "class colspan"
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
print(text, doc[0].similarity(doc[1]))
|
print(text, doc[0].similarity(doc[1]))
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
|
|
@ -14,26 +14,45 @@ import plac
|
||||||
import spacy
|
import spacy
|
||||||
import tensorflow as tf
|
import tensorflow as tf
|
||||||
import tqdm
|
import tqdm
|
||||||
from tensorflow.contrib.tensorboard.plugins.projector import visualize_embeddings, ProjectorConfig
|
from tensorflow.contrib.tensorboard.plugins.projector import (
|
||||||
|
visualize_embeddings,
|
||||||
|
ProjectorConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
vectors_loc=("Path to spaCy model that contains vectors", "positional", None, str),
|
vectors_loc=("Path to spaCy model that contains vectors", "positional", None, str),
|
||||||
out_loc=("Path to output folder for tensorboard session data", "positional", None, str),
|
out_loc=(
|
||||||
name=("Human readable name for tsv file and vectors tensor", "positional", None, str),
|
"Path to output folder for tensorboard session data",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
name=(
|
||||||
|
"Human readable name for tsv file and vectors tensor",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
str,
|
||||||
|
),
|
||||||
)
|
)
|
||||||
def main(vectors_loc, out_loc, name="spaCy_vectors"):
|
def main(vectors_loc, out_loc, name="spaCy_vectors"):
|
||||||
meta_file = "{}.tsv".format(name)
|
meta_file = "{}.tsv".format(name)
|
||||||
out_meta_file = path.join(out_loc, meta_file)
|
out_meta_file = path.join(out_loc, meta_file)
|
||||||
|
|
||||||
print('Loading spaCy vectors model: {}'.format(vectors_loc))
|
print("Loading spaCy vectors model: {}".format(vectors_loc))
|
||||||
model = spacy.load(vectors_loc)
|
model = spacy.load(vectors_loc)
|
||||||
print('Finding lexemes with vectors attached: {}'.format(vectors_loc))
|
print("Finding lexemes with vectors attached: {}".format(vectors_loc))
|
||||||
strings_stream = tqdm.tqdm(model.vocab.strings, total=len(model.vocab.strings), leave=False)
|
strings_stream = tqdm.tqdm(
|
||||||
|
model.vocab.strings, total=len(model.vocab.strings), leave=False
|
||||||
|
)
|
||||||
queries = [w for w in strings_stream if model.vocab.has_vector(w)]
|
queries = [w for w in strings_stream if model.vocab.has_vector(w)]
|
||||||
vector_count = len(queries)
|
vector_count = len(queries)
|
||||||
|
|
||||||
print('Building Tensorboard Projector metadata for ({}) vectors: {}'.format(vector_count, out_meta_file))
|
print(
|
||||||
|
"Building Tensorboard Projector metadata for ({}) vectors: {}".format(
|
||||||
|
vector_count, out_meta_file
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
# Store vector data in a tensorflow variable
|
# Store vector data in a tensorflow variable
|
||||||
tf_vectors_variable = numpy.zeros((vector_count, model.vocab.vectors.shape[1]))
|
tf_vectors_variable = numpy.zeros((vector_count, model.vocab.vectors.shape[1]))
|
||||||
|
|
@ -41,22 +60,26 @@ def main(vectors_loc, out_loc, name="spaCy_vectors"):
|
||||||
# Write a tab-separated file that contains information about the vectors for visualization
|
# Write a tab-separated file that contains information about the vectors for visualization
|
||||||
#
|
#
|
||||||
# Reference: https://www.tensorflow.org/programmers_guide/embedding#metadata
|
# Reference: https://www.tensorflow.org/programmers_guide/embedding#metadata
|
||||||
with open(out_meta_file, 'wb') as file_metadata:
|
with open(out_meta_file, "wb") as file_metadata:
|
||||||
# Define columns in the first row
|
# Define columns in the first row
|
||||||
file_metadata.write("Text\tFrequency\n".encode('utf-8'))
|
file_metadata.write("Text\tFrequency\n".encode("utf-8"))
|
||||||
# Write out a row for each vector that we add to the tensorflow variable we created
|
# Write out a row for each vector that we add to the tensorflow variable we created
|
||||||
vec_index = 0
|
vec_index = 0
|
||||||
for text in tqdm.tqdm(queries, total=len(queries), leave=False):
|
for text in tqdm.tqdm(queries, total=len(queries), leave=False):
|
||||||
# https://github.com/tensorflow/tensorflow/issues/9094
|
# https://github.com/tensorflow/tensorflow/issues/9094
|
||||||
text = '<Space>' if text.lstrip() == '' else text
|
text = "<Space>" if text.lstrip() == "" else text
|
||||||
lex = model.vocab[text]
|
lex = model.vocab[text]
|
||||||
|
|
||||||
# Store vector data and metadata
|
# Store vector data and metadata
|
||||||
tf_vectors_variable[vec_index] = model.vocab.get_vector(text)
|
tf_vectors_variable[vec_index] = model.vocab.get_vector(text)
|
||||||
file_metadata.write("{}\t{}\n".format(text, math.exp(lex.prob) * vector_count).encode('utf-8'))
|
file_metadata.write(
|
||||||
|
"{}\t{}\n".format(text, math.exp(lex.prob) * vector_count).encode(
|
||||||
|
"utf-8"
|
||||||
|
)
|
||||||
|
)
|
||||||
vec_index += 1
|
vec_index += 1
|
||||||
|
|
||||||
print('Running Tensorflow Session...')
|
print("Running Tensorflow Session...")
|
||||||
sess = tf.InteractiveSession()
|
sess = tf.InteractiveSession()
|
||||||
tf.Variable(tf_vectors_variable, trainable=False, name=name)
|
tf.Variable(tf_vectors_variable, trainable=False, name=name)
|
||||||
tf.global_variables_initializer().run()
|
tf.global_variables_initializer().run()
|
||||||
|
|
@ -73,10 +96,10 @@ def main(vectors_loc, out_loc, name="spaCy_vectors"):
|
||||||
visualize_embeddings(writer, config)
|
visualize_embeddings(writer, config)
|
||||||
|
|
||||||
# Save session and print run command to the output
|
# Save session and print run command to the output
|
||||||
print('Saving Tensorboard Session...')
|
print("Saving Tensorboard Session...")
|
||||||
saver.save(sess, path.join(out_loc, '{}.ckpt'.format(name)))
|
saver.save(sess, path.join(out_loc, "{}.ckpt".format(name)))
|
||||||
print('Done. Run `tensorboard --logdir={0}` to view in Tensorboard'.format(out_loc))
|
print("Done. Run `tensorboard --logdir={0}` to view in Tensorboard".format(out_loc))
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
||||||
|
|
@ -1,88 +0,0 @@
|
||||||
#!/usr/bin/env python
|
|
||||||
# coding: utf8
|
|
||||||
"""Export spaCy model vectors for use in TensorBoard's standalone embedding projector.
|
|
||||||
https://github.com/tensorflow/embedding-projector-standalone
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
|
|
||||||
python vectors_tensorboard_standalone.py ./myVectorModel ./output [name]
|
|
||||||
|
|
||||||
This outputs two files that have to be copied into the "oss_data" of the standalone projector:
|
|
||||||
|
|
||||||
[name]_labels.tsv - metadata such as human readable labels for vectors
|
|
||||||
[name]_tensors.bytes - numpy.ndarray of numpy.float32 precision vectors
|
|
||||||
|
|
||||||
"""
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import json
|
|
||||||
import math
|
|
||||||
from os import path
|
|
||||||
|
|
||||||
import numpy
|
|
||||||
import plac
|
|
||||||
import spacy
|
|
||||||
import tqdm
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
|
||||||
vectors_loc=("Path to spaCy model that contains vectors", "positional", None, str),
|
|
||||||
out_loc=("Path to output folder writing tensors and labels data", "positional", None, str),
|
|
||||||
name=("Human readable name for tsv file and vectors tensor", "positional", None, str),
|
|
||||||
)
|
|
||||||
def main(vectors_loc, out_loc, name="spaCy_vectors"):
|
|
||||||
# A tab-separated file that contains information about the vectors for visualization
|
|
||||||
#
|
|
||||||
# Learn more: https://www.tensorflow.org/programmers_guide/embedding#metadata
|
|
||||||
meta_file = "{}_labels.tsv".format(name)
|
|
||||||
out_meta_file = path.join(out_loc, meta_file)
|
|
||||||
|
|
||||||
print('Loading spaCy vectors model: {}'.format(vectors_loc))
|
|
||||||
model = spacy.load(vectors_loc)
|
|
||||||
|
|
||||||
print('Finding lexemes with vectors attached: {}'.format(vectors_loc))
|
|
||||||
voacb_strings = [
|
|
||||||
w for w in tqdm.tqdm(model.vocab.strings, total=len(model.vocab.strings), leave=False)
|
|
||||||
if model.vocab.has_vector(w)
|
|
||||||
]
|
|
||||||
vector_count = len(voacb_strings)
|
|
||||||
|
|
||||||
print('Building Projector labels for {} vectors: {}'.format(vector_count, out_meta_file))
|
|
||||||
vector_dimensions = model.vocab.vectors.shape[1]
|
|
||||||
tf_vectors_variable = numpy.zeros((vector_count, vector_dimensions), dtype=numpy.float32)
|
|
||||||
|
|
||||||
# Write a tab-separated file that contains information about the vectors for visualization
|
|
||||||
#
|
|
||||||
# Reference: https://www.tensorflow.org/programmers_guide/embedding#metadata
|
|
||||||
with open(out_meta_file, 'wb') as file_metadata:
|
|
||||||
# Define columns in the first row
|
|
||||||
file_metadata.write("Text\tFrequency\n".encode('utf-8'))
|
|
||||||
# Write out a row for each vector that we add to the tensorflow variable we created
|
|
||||||
vec_index = 0
|
|
||||||
|
|
||||||
for text in tqdm.tqdm(voacb_strings, total=len(voacb_strings), leave=False):
|
|
||||||
# https://github.com/tensorflow/tensorflow/issues/9094
|
|
||||||
text = '<Space>' if text.lstrip() == '' else text
|
|
||||||
lex = model.vocab[text]
|
|
||||||
|
|
||||||
# Store vector data and metadata
|
|
||||||
tf_vectors_variable[vec_index] = numpy.float64(model.vocab.get_vector(text))
|
|
||||||
file_metadata.write("{}\t{}\n".format(text, math.exp(lex.prob) * len(voacb_strings)).encode('utf-8'))
|
|
||||||
vec_index += 1
|
|
||||||
|
|
||||||
# Write out "[name]_tensors.bytes" file for standalone embeddings projector to load
|
|
||||||
tensor_path = '{}_tensors.bytes'.format(name)
|
|
||||||
tf_vectors_variable.tofile(path.join(out_loc, tensor_path))
|
|
||||||
|
|
||||||
print('Done.')
|
|
||||||
print('Add the following entry to "oss_data/oss_demo_projector_config.json"')
|
|
||||||
print(json.dumps({
|
|
||||||
"tensorName": name,
|
|
||||||
"tensorShape": [vector_count, vector_dimensions],
|
|
||||||
"tensorPath": 'oss_data/{}'.format(tensor_path),
|
|
||||||
"metadataPath": 'oss_data/{}'.format(meta_file)
|
|
||||||
}, indent=2))
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
plac.call(main)
|
|
||||||
111
fabfile.py
vendored
111
fabfile.py
vendored
|
|
@ -1,49 +1,122 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
|
import contextlib
|
||||||
|
from pathlib import Path
|
||||||
from fabric.api import local, lcd, env, settings, prefix
|
from fabric.api import local, lcd, env, settings, prefix
|
||||||
from fabtools.python import virtualenv
|
|
||||||
from os import path, environ
|
from os import path, environ
|
||||||
|
import shutil
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
PWD = path.dirname(__file__)
|
PWD = path.dirname(__file__)
|
||||||
ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
|
ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
|
||||||
VENV_DIR = path.join(PWD, ENV)
|
VENV_DIR = Path(PWD) / ENV
|
||||||
|
|
||||||
|
|
||||||
def env(lang='python2.7'):
|
@contextlib.contextmanager
|
||||||
if path.exists(VENV_DIR):
|
def virtualenv(name, create=False, python='/usr/bin/python3.6'):
|
||||||
|
python = Path(python).resolve()
|
||||||
|
env_path = VENV_DIR
|
||||||
|
if create:
|
||||||
|
if env_path.exists():
|
||||||
|
shutil.rmtree(str(env_path))
|
||||||
|
local('{python} -m venv {env_path}'.format(python=python, env_path=VENV_DIR))
|
||||||
|
def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
|
||||||
|
return local('source {}/bin/activate && {}'.format(env_path, cmd),
|
||||||
|
shell='/bin/bash', capture=False)
|
||||||
|
yield wrapped_local
|
||||||
|
|
||||||
|
|
||||||
|
def env(lang='python3.6'):
|
||||||
|
if VENV_DIR.exists():
|
||||||
local('rm -rf {env}'.format(env=VENV_DIR))
|
local('rm -rf {env}'.format(env=VENV_DIR))
|
||||||
local('pip install virtualenv')
|
if lang.startswith('python3'):
|
||||||
local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
|
local('{lang} -m venv {env}'.format(lang=lang, env=VENV_DIR))
|
||||||
|
else:
|
||||||
|
local('{lang} -m pip install virtualenv --no-cache-dir'.format(lang=lang))
|
||||||
|
local('{lang} -m virtualenv {env} --no-cache-dir'.format(lang=lang, env=VENV_DIR))
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
print(venv_local('python --version', capture=True))
|
||||||
|
venv_local('pip install --upgrade setuptools --no-cache-dir')
|
||||||
|
venv_local('pip install pytest --no-cache-dir')
|
||||||
|
venv_local('pip install wheel --no-cache-dir')
|
||||||
|
venv_local('pip install -r requirements.txt --no-cache-dir')
|
||||||
|
venv_local('pip install pex --no-cache-dir')
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def install():
|
def install():
|
||||||
with virtualenv(VENV_DIR):
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
local('pip install --upgrade setuptools')
|
venv_local('pip install dist/*.tar.gz')
|
||||||
local('pip install dist/*.tar.gz')
|
|
||||||
local('pip install pytest')
|
|
||||||
|
|
||||||
|
|
||||||
def make():
|
def make():
|
||||||
with virtualenv(VENV_DIR):
|
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
local('pip install cython')
|
local('export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace',
|
||||||
local('pip install murmurhash')
|
shell='/bin/bash')
|
||||||
local('pip install -r requirements.txt')
|
|
||||||
local('python setup.py build_ext --inplace')
|
|
||||||
|
|
||||||
def sdist():
|
def sdist():
|
||||||
with virtualenv(VENV_DIR):
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
|
local('python -m pip install -U setuptools')
|
||||||
local('python setup.py sdist')
|
local('python setup.py sdist')
|
||||||
|
|
||||||
|
def wheel():
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
with lcd(path.dirname(__file__)):
|
||||||
|
venv_local('python setup.py bdist_wheel')
|
||||||
|
|
||||||
|
def pex():
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
with lcd(path.dirname(__file__)):
|
||||||
|
sha = local('git rev-parse --short HEAD', capture=True)
|
||||||
|
venv_local('pex dist/*.whl -e spacy -o dist/spacy-%s.pex' % sha,
|
||||||
|
direct=True)
|
||||||
|
|
||||||
|
|
||||||
def clean():
|
def clean():
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
local('python setup.py clean --all')
|
local('rm -f dist/*.whl')
|
||||||
|
local('rm -f dist/*.pex')
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
venv_local('python setup.py clean --all')
|
||||||
|
|
||||||
|
|
||||||
def test():
|
def test():
|
||||||
with virtualenv(VENV_DIR):
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
local('py.test -x spacy/tests')
|
venv_local('pytest -x spacy/tests')
|
||||||
|
|
||||||
|
def train():
|
||||||
|
args = environ.get('SPACY_TRAIN_ARGS', '')
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
venv_local('spacy train {args}'.format(args=args))
|
||||||
|
|
||||||
|
|
||||||
|
def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=''):
|
||||||
|
is_not_clean = local('git status --porcelain', capture=True)
|
||||||
|
if is_not_clean:
|
||||||
|
print("Repository is not clean")
|
||||||
|
print(is_not_clean)
|
||||||
|
sys.exit(1)
|
||||||
|
git_sha = local('git rev-parse --short HEAD', capture=True)
|
||||||
|
config_checksum = local('sha256sum {config}'.format(config=config), capture=True)
|
||||||
|
experiment_dir = Path(experiment_dir) / '{}--{}'.format(config_checksum[:6], git_sha)
|
||||||
|
if not experiment_dir.exists():
|
||||||
|
experiment_dir.mkdir()
|
||||||
|
test_data_dir = Path(treebank_dir) / 'ud-test-v2.0-conll2017'
|
||||||
|
assert test_data_dir.exists()
|
||||||
|
assert test_data_dir.is_dir()
|
||||||
|
if corpus:
|
||||||
|
corpora = [corpus]
|
||||||
|
else:
|
||||||
|
corpora = ['UD_English', 'UD_Chinese', 'UD_Japanese', 'UD_Vietnamese']
|
||||||
|
|
||||||
|
local('cp {config} {experiment_dir}/config.json'.format(config=config, experiment_dir=experiment_dir))
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
for corpus in corpora:
|
||||||
|
venv_local('spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}'.format(
|
||||||
|
treebank_dir=treebank_dir, experiment_dir=experiment_dir, config=config, corpus=corpus, vectors_dir=vectors_dir))
|
||||||
|
venv_local('spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}'.format(
|
||||||
|
test_data_dir=test_data_dir, experiment_dir=experiment_dir, config=config, corpus=corpus))
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,6 @@ requires = ["setuptools",
|
||||||
"cymem>=2.0.2,<2.1.0",
|
"cymem>=2.0.2,<2.1.0",
|
||||||
"preshed>=2.0.1,<2.1.0",
|
"preshed>=2.0.1,<2.1.0",
|
||||||
"murmurhash>=0.28.0,<1.1.0",
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
"thinc>=6.12.1,<6.13.0",
|
"thinc==7.0.0.dev6",
|
||||||
]
|
]
|
||||||
build-backend = "setuptools.build_meta"
|
build-backend = "setuptools.build_meta"
|
||||||
|
|
|
||||||
|
|
@ -1,15 +1,20 @@
|
||||||
cython>=0.24,<0.28.0
|
# Our libraries
|
||||||
numpy>=1.15.0
|
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=2.0.1,<2.1.0
|
preshed>=2.0.1,<2.1.0
|
||||||
thinc>=6.12.1,<6.13.0
|
thinc>=7.0.2,<7.1.0
|
||||||
|
blis>=0.2.2,<0.3.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
plac<1.0.0,>=0.9.6
|
wasabi>=0.0.12,<1.1.0
|
||||||
ujson>=1.35
|
srsly>=0.0.5,<1.1.0
|
||||||
dill>=0.2,<0.3
|
# Third party dependencies
|
||||||
regex==2018.01.10
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
pytest>=4.0.0,<4.1.0
|
jsonschema>=2.6.0,<3.0.0
|
||||||
mock>=2.0.0,<3.0.0
|
plac<1.0.0,>=0.9.6
|
||||||
pathlib==1.0.1; python_version < "3.4"
|
pathlib==1.0.1; python_version < "3.4"
|
||||||
|
# Development dependencies
|
||||||
|
cython>=0.25
|
||||||
|
pytest>=4.0.0,<4.1.0
|
||||||
|
pytest-timeout>=1.3.0,<2.0.0
|
||||||
|
mock>=2.0.0,<3.0.0
|
||||||
flake8>=3.5.0,<3.6.0
|
flake8>=3.5.0,<3.6.0
|
||||||
|
|
|
||||||
291
setup.py
291
setup.py
|
|
@ -7,83 +7,99 @@ import sys
|
||||||
import contextlib
|
import contextlib
|
||||||
from distutils.command.build_ext import build_ext
|
from distutils.command.build_ext import build_ext
|
||||||
from distutils.sysconfig import get_python_inc
|
from distutils.sysconfig import get_python_inc
|
||||||
|
import distutils.util
|
||||||
from distutils import ccompiler, msvccompiler
|
from distutils import ccompiler, msvccompiler
|
||||||
from setuptools import Extension, setup, find_packages
|
from setuptools import Extension, setup, find_packages
|
||||||
|
|
||||||
|
|
||||||
PACKAGE_DATA = {'': ['*.pyx', '*.pxd', '*.txt', '*.tokens']}
|
def is_new_osx():
|
||||||
|
"""Check whether we're on OSX >= 10.10"""
|
||||||
|
name = distutils.util.get_platform()
|
||||||
|
if sys.platform != "darwin":
|
||||||
|
return False
|
||||||
|
elif name.startswith("macosx-10"):
|
||||||
|
minor_version = int(name.split("-")[1].split(".")[1])
|
||||||
|
if minor_version >= 7:
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
PACKAGE_DATA = {"": ["*.pyx", "*.pxd", "*.txt", "*.tokens", "*.json"]}
|
||||||
|
|
||||||
|
|
||||||
PACKAGES = find_packages()
|
PACKAGES = find_packages()
|
||||||
|
|
||||||
|
|
||||||
MOD_NAMES = [
|
MOD_NAMES = [
|
||||||
'spacy.parts_of_speech',
|
"spacy._align",
|
||||||
'spacy.strings',
|
"spacy.parts_of_speech",
|
||||||
'spacy.lexeme',
|
"spacy.strings",
|
||||||
'spacy.vocab',
|
"spacy.lexeme",
|
||||||
'spacy.attrs',
|
"spacy.vocab",
|
||||||
'spacy.morphology',
|
"spacy.attrs",
|
||||||
'spacy.pipeline',
|
"spacy.morphology",
|
||||||
'spacy.syntax.stateclass',
|
"spacy.pipeline.pipes",
|
||||||
'spacy.syntax._state',
|
"spacy.syntax.stateclass",
|
||||||
'spacy.syntax._beam_utils',
|
"spacy.syntax._state",
|
||||||
'spacy.tokenizer',
|
"spacy.tokenizer",
|
||||||
'spacy.syntax.nn_parser',
|
"spacy.syntax.nn_parser",
|
||||||
'spacy.syntax.nonproj',
|
"spacy.syntax._parser_model",
|
||||||
'spacy.syntax.transition_system',
|
"spacy.syntax._beam_utils",
|
||||||
'spacy.syntax.arc_eager',
|
"spacy.syntax.nonproj",
|
||||||
'spacy.gold',
|
"spacy.syntax.transition_system",
|
||||||
'spacy.tokens.doc',
|
"spacy.syntax.arc_eager",
|
||||||
'spacy.tokens.span',
|
"spacy.gold",
|
||||||
'spacy.tokens.token',
|
"spacy.tokens.doc",
|
||||||
'spacy.tokens._retokenize',
|
"spacy.tokens.span",
|
||||||
'spacy.matcher',
|
"spacy.tokens.token",
|
||||||
'spacy.syntax.ner',
|
"spacy.tokens._retokenize",
|
||||||
'spacy.symbols',
|
"spacy.matcher.matcher",
|
||||||
'spacy.vectors',
|
"spacy.matcher.phrasematcher",
|
||||||
|
"spacy.matcher.dependencymatcher",
|
||||||
|
"spacy.syntax.ner",
|
||||||
|
"spacy.symbols",
|
||||||
|
"spacy.vectors",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
COMPILE_OPTIONS = {
|
COMPILE_OPTIONS = {
|
||||||
'msvc': ['/Ox', '/EHsc'],
|
"msvc": ["/Ox", "/EHsc"],
|
||||||
'mingw32' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function'],
|
"mingw32": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"],
|
||||||
'other' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function']
|
"other": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"],
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
LINK_OPTIONS = {
|
LINK_OPTIONS = {"msvc": [], "mingw32": [], "other": []}
|
||||||
'msvc' : [],
|
|
||||||
'mingw32': [],
|
|
||||||
'other' : []
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
# I don't understand this very well yet. See Issue #267
|
if is_new_osx():
|
||||||
# Fingers crossed!
|
|
||||||
USE_OPENMP_DEFAULT = '0' if sys.platform != 'darwin' else None
|
|
||||||
if os.environ.get('USE_OPENMP', USE_OPENMP_DEFAULT) == '1':
|
|
||||||
if sys.platform == 'darwin':
|
|
||||||
COMPILE_OPTIONS['other'].append('-fopenmp')
|
|
||||||
LINK_OPTIONS['other'].append('-fopenmp')
|
|
||||||
PACKAGE_DATA['spacy.platform.darwin.lib'] = ['*.dylib']
|
|
||||||
PACKAGES.append('spacy.platform.darwin.lib')
|
|
||||||
|
|
||||||
elif sys.platform == 'win32':
|
|
||||||
COMPILE_OPTIONS['msvc'].append('/openmp')
|
|
||||||
|
|
||||||
else:
|
|
||||||
COMPILE_OPTIONS['other'].append('-fopenmp')
|
|
||||||
LINK_OPTIONS['other'].append('-fopenmp')
|
|
||||||
|
|
||||||
if sys.platform == 'darwin':
|
|
||||||
# On Mac, use libc++ because Apple deprecated use of
|
# On Mac, use libc++ because Apple deprecated use of
|
||||||
# libstdc
|
# libstdc
|
||||||
COMPILE_OPTIONS['other'].append('-stdlib=libc++')
|
COMPILE_OPTIONS["other"].append("-stdlib=libc++")
|
||||||
LINK_OPTIONS['other'].append('-lc++')
|
LINK_OPTIONS["other"].append("-lc++")
|
||||||
# g++ (used by unix compiler on mac) links to libstdc++ as a default lib.
|
# g++ (used by unix compiler on mac) links to libstdc++ as a default lib.
|
||||||
# See: https://stackoverflow.com/questions/1653047/avoid-linking-to-libstdc
|
# See: https://stackoverflow.com/questions/1653047/avoid-linking-to-libstdc
|
||||||
LINK_OPTIONS['other'].append('-nodefaultlibs')
|
LINK_OPTIONS["other"].append("-nodefaultlibs")
|
||||||
|
|
||||||
|
|
||||||
|
USE_OPENMP_DEFAULT = "0" if sys.platform != "darwin" else None
|
||||||
|
if os.environ.get("USE_OPENMP", USE_OPENMP_DEFAULT) == "1":
|
||||||
|
if sys.platform == "darwin":
|
||||||
|
COMPILE_OPTIONS["other"].append("-fopenmp")
|
||||||
|
LINK_OPTIONS["other"].append("-fopenmp")
|
||||||
|
PACKAGE_DATA["spacy.platform.darwin.lib"] = ["*.dylib"]
|
||||||
|
PACKAGES.append("spacy.platform.darwin.lib")
|
||||||
|
|
||||||
|
elif sys.platform == "win32":
|
||||||
|
COMPILE_OPTIONS["msvc"].append("/openmp")
|
||||||
|
|
||||||
|
else:
|
||||||
|
COMPILE_OPTIONS["other"].append("-fopenmp")
|
||||||
|
LINK_OPTIONS["other"].append("-fopenmp")
|
||||||
|
|
||||||
|
|
||||||
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
|
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
|
||||||
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
|
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
|
||||||
|
|
@ -91,10 +107,12 @@ class build_ext_options:
|
||||||
def build_options(self):
|
def build_options(self):
|
||||||
for e in self.extensions:
|
for e in self.extensions:
|
||||||
e.extra_compile_args += COMPILE_OPTIONS.get(
|
e.extra_compile_args += COMPILE_OPTIONS.get(
|
||||||
self.compiler.compiler_type, COMPILE_OPTIONS['other'])
|
self.compiler.compiler_type, COMPILE_OPTIONS["other"]
|
||||||
|
)
|
||||||
for e in self.extensions:
|
for e in self.extensions:
|
||||||
e.extra_link_args += LINK_OPTIONS.get(
|
e.extra_link_args += LINK_OPTIONS.get(
|
||||||
self.compiler.compiler_type, LINK_OPTIONS['other'])
|
self.compiler.compiler_type, LINK_OPTIONS["other"]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class build_ext_subclass(build_ext, build_ext_options):
|
class build_ext_subclass(build_ext, build_ext_options):
|
||||||
|
|
@ -104,22 +122,23 @@ class build_ext_subclass(build_ext, build_ext_options):
|
||||||
|
|
||||||
|
|
||||||
def generate_cython(root, source):
|
def generate_cython(root, source):
|
||||||
print('Cythonizing sources')
|
print("Cythonizing sources")
|
||||||
p = subprocess.call([sys.executable,
|
p = subprocess.call(
|
||||||
os.path.join(root, 'bin', 'cythonize.py'),
|
[sys.executable, os.path.join(root, "bin", "cythonize.py"), source],
|
||||||
source], env=os.environ)
|
env=os.environ,
|
||||||
|
)
|
||||||
if p != 0:
|
if p != 0:
|
||||||
raise RuntimeError('Running cythonize failed')
|
raise RuntimeError("Running cythonize failed")
|
||||||
|
|
||||||
|
|
||||||
def is_source_release(path):
|
def is_source_release(path):
|
||||||
return os.path.exists(os.path.join(path, 'PKG-INFO'))
|
return os.path.exists(os.path.join(path, "PKG-INFO"))
|
||||||
|
|
||||||
|
|
||||||
def clean(path):
|
def clean(path):
|
||||||
for name in MOD_NAMES:
|
for name in MOD_NAMES:
|
||||||
name = name.replace('.', '/')
|
name = name.replace(".", "/")
|
||||||
for ext in ['.so', '.html', '.cpp', '.c']:
|
for ext in [".so", ".html", ".cpp", ".c"]:
|
||||||
file_path = os.path.join(path, name + ext)
|
file_path = os.path.join(path, name + ext)
|
||||||
if os.path.exists(file_path):
|
if os.path.exists(file_path):
|
||||||
os.unlink(file_path)
|
os.unlink(file_path)
|
||||||
|
|
@ -140,109 +159,117 @@ def chdir(new_dir):
|
||||||
def setup_package():
|
def setup_package():
|
||||||
root = os.path.abspath(os.path.dirname(__file__))
|
root = os.path.abspath(os.path.dirname(__file__))
|
||||||
|
|
||||||
if len(sys.argv) > 1 and sys.argv[1] == 'clean':
|
if len(sys.argv) > 1 and sys.argv[1] == "clean":
|
||||||
return clean(root)
|
return clean(root)
|
||||||
|
|
||||||
with chdir(root):
|
with chdir(root):
|
||||||
with io.open(os.path.join(root, 'spacy', 'about.py'), encoding='utf8') as f:
|
with io.open(os.path.join(root, "spacy", "about.py"), encoding="utf8") as f:
|
||||||
about = {}
|
about = {}
|
||||||
exec(f.read(), about)
|
exec(f.read(), about)
|
||||||
|
|
||||||
with io.open(os.path.join(root, 'README.rst'), encoding='utf8') as f:
|
with io.open(os.path.join(root, "README.md"), encoding="utf8") as f:
|
||||||
readme = f.read()
|
readme = f.read()
|
||||||
|
|
||||||
include_dirs = [
|
include_dirs = [
|
||||||
get_python_inc(plat_specific=True),
|
get_python_inc(plat_specific=True),
|
||||||
os.path.join(root, 'include')]
|
os.path.join(root, "include"),
|
||||||
|
]
|
||||||
|
|
||||||
if (ccompiler.new_compiler().compiler_type == 'msvc'
|
if (
|
||||||
and msvccompiler.get_build_version() == 9):
|
ccompiler.new_compiler().compiler_type == "msvc"
|
||||||
include_dirs.append(os.path.join(root, 'include', 'msvc9'))
|
and msvccompiler.get_build_version() == 9
|
||||||
|
):
|
||||||
|
include_dirs.append(os.path.join(root, "include", "msvc9"))
|
||||||
|
|
||||||
ext_modules = []
|
ext_modules = []
|
||||||
for mod_name in MOD_NAMES:
|
for mod_name in MOD_NAMES:
|
||||||
mod_path = mod_name.replace('.', '/') + '.cpp'
|
mod_path = mod_name.replace(".", "/") + ".cpp"
|
||||||
extra_link_args = []
|
extra_link_args = []
|
||||||
extra_compile_args = []
|
extra_compile_args = []
|
||||||
# ???
|
# ???
|
||||||
# Imported from patch from @mikepb
|
# Imported from patch from @mikepb
|
||||||
# See Issue #267. Running blind here...
|
# See Issue #267. Running blind here...
|
||||||
if sys.platform == 'darwin':
|
if sys.platform == "darwin":
|
||||||
dylib_path = ['..' for _ in range(mod_name.count('.'))]
|
dylib_path = [".." for _ in range(mod_name.count("."))]
|
||||||
dylib_path = '/'.join(dylib_path)
|
dylib_path = "/".join(dylib_path)
|
||||||
dylib_path = '@loader_path/%s/spacy/platform/darwin/lib' % dylib_path
|
dylib_path = "@loader_path/%s/spacy/platform/darwin/lib" % dylib_path
|
||||||
extra_link_args.append('-Wl,-rpath,%s' % dylib_path)
|
extra_link_args.append("-Wl,-rpath,%s" % dylib_path)
|
||||||
# Try to fix OSX 10.7 problem. Running blind here too.
|
|
||||||
extra_compile_args.append('-std=c++11')
|
|
||||||
extra_link_args.append('-std=c++11')
|
|
||||||
ext_modules.append(
|
ext_modules.append(
|
||||||
Extension(mod_name, [mod_path],
|
Extension(
|
||||||
language='c++', include_dirs=include_dirs,
|
mod_name,
|
||||||
|
[mod_path],
|
||||||
|
language="c++",
|
||||||
|
include_dirs=include_dirs,
|
||||||
extra_link_args=extra_link_args,
|
extra_link_args=extra_link_args,
|
||||||
extra_compile_args=extra_compile_args))
|
)
|
||||||
|
)
|
||||||
|
|
||||||
if not is_source_release(root):
|
if not is_source_release(root):
|
||||||
generate_cython(root, 'spacy')
|
generate_cython(root, "spacy")
|
||||||
|
|
||||||
setup(
|
setup(
|
||||||
name=about['__title__'],
|
name=about["__title__"],
|
||||||
zip_safe=False,
|
zip_safe=False,
|
||||||
packages=PACKAGES,
|
packages=PACKAGES,
|
||||||
package_data=PACKAGE_DATA,
|
package_data=PACKAGE_DATA,
|
||||||
description=about['__summary__'],
|
description=about["__summary__"],
|
||||||
long_description=readme,
|
long_description=readme,
|
||||||
author=about['__author__'],
|
long_description_content_type="text/markdown",
|
||||||
author_email=about['__email__'],
|
author=about["__author__"],
|
||||||
version=about['__version__'],
|
author_email=about["__email__"],
|
||||||
url=about['__uri__'],
|
version=about["__version__"],
|
||||||
license=about['__license__'],
|
url=about["__uri__"],
|
||||||
|
license=about["__license__"],
|
||||||
ext_modules=ext_modules,
|
ext_modules=ext_modules,
|
||||||
scripts=['bin/spacy'],
|
scripts=["bin/spacy"],
|
||||||
setup_requires=['wheel>=0.32.0,<0.33.0'],
|
|
||||||
install_requires=[
|
install_requires=[
|
||||||
'numpy>=1.15.0',
|
"numpy>=1.15.0",
|
||||||
'murmurhash>=0.28.0,<1.1.0',
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
'cymem>=2.0.2,<2.1.0',
|
"cymem>=2.0.2,<2.1.0",
|
||||||
'preshed>=2.0.1,<2.1.0',
|
"preshed>=2.0.1,<2.1.0",
|
||||||
'thinc>=6.12.1,<6.13.0',
|
"thinc>=7.0.2,<7.1.0",
|
||||||
'plac<1.0.0,>=0.9.6',
|
"blis>=0.2.2,<0.3.0",
|
||||||
'ujson>=1.35',
|
"plac<1.0.0,>=0.9.6",
|
||||||
'dill>=0.2,<0.3',
|
"requests>=2.13.0,<3.0.0",
|
||||||
'regex==2018.01.10',
|
"jsonschema>=2.6.0,<3.0.0",
|
||||||
'requests>=2.13.0,<3.0.0',
|
"wasabi>=0.0.12,<1.1.0",
|
||||||
'pathlib==1.0.1; python_version < "3.4"'],
|
"srsly>=0.0.5,<1.1.0",
|
||||||
|
'pathlib==1.0.1; python_version < "3.4"',
|
||||||
|
],
|
||||||
|
setup_requires=["wheel"],
|
||||||
extras_require={
|
extras_require={
|
||||||
'cuda': ['cupy>=4.0'],
|
"cuda": ["cupy>=4.0"],
|
||||||
'cuda80': ['cupy-cuda80>=4.0', 'thinc_gpu_ops>=0.0.3,<0.1.0'],
|
"cuda80": ["cupy-cuda80>=4.0"],
|
||||||
'cuda90': ['cupy-cuda90>=4.0', 'thinc_gpu_ops>=0.0.3,<0.1.0'],
|
"cuda90": ["cupy-cuda90>=4.0"],
|
||||||
'cuda91': ['cupy-cuda91>=4.0', 'thinc_gpu_ops>=0.0.3,<0.1.0'],
|
"cuda91": ["cupy-cuda91>=4.0"],
|
||||||
'cuda92': ['cupy-cuda92>=4.0', 'thinc_gpu_ops>=0.0.3,<0.1.0'],
|
"cuda92": ["cupy-cuda92>=4.0"],
|
||||||
'cuda100': ['cupy-cuda100>=4.0', 'thinc_gpu_ops>=0.0.3,<0.1.0'],
|
"cuda100": ["cupy-cuda100>=4.0"],
|
||||||
'ja': ['mecab-python3==0.7']
|
# Language tokenizers with external dependencies
|
||||||
|
"ja": ["mecab-python3==0.7"],
|
||||||
},
|
},
|
||||||
python_requires='>=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*',
|
python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*",
|
||||||
classifiers=[
|
classifiers=[
|
||||||
'Development Status :: 5 - Production/Stable',
|
"Development Status :: 5 - Production/Stable",
|
||||||
'Environment :: Console',
|
"Environment :: Console",
|
||||||
'Intended Audience :: Developers',
|
"Intended Audience :: Developers",
|
||||||
'Intended Audience :: Science/Research',
|
"Intended Audience :: Science/Research",
|
||||||
'License :: OSI Approved :: MIT License',
|
"License :: OSI Approved :: MIT License",
|
||||||
'Operating System :: POSIX :: Linux',
|
"Operating System :: POSIX :: Linux",
|
||||||
'Operating System :: MacOS :: MacOS X',
|
"Operating System :: MacOS :: MacOS X",
|
||||||
'Operating System :: Microsoft :: Windows',
|
"Operating System :: Microsoft :: Windows",
|
||||||
'Programming Language :: Cython',
|
"Programming Language :: Cython",
|
||||||
'Programming Language :: Python :: 2',
|
"Programming Language :: Python :: 2",
|
||||||
'Programming Language :: Python :: 2.7',
|
"Programming Language :: Python :: 2.7",
|
||||||
'Programming Language :: Python :: 3',
|
"Programming Language :: Python :: 3",
|
||||||
'Programming Language :: Python :: 3.4',
|
"Programming Language :: Python :: 3.4",
|
||||||
'Programming Language :: Python :: 3.5',
|
"Programming Language :: Python :: 3.5",
|
||||||
'Programming Language :: Python :: 3.6',
|
"Programming Language :: Python :: 3.6",
|
||||||
'Programming Language :: Python :: 3.7',
|
"Programming Language :: Python :: 3.7",
|
||||||
'Topic :: Scientific/Engineering'],
|
"Topic :: Scientific/Engineering",
|
||||||
cmdclass = {
|
],
|
||||||
'build_ext': build_ext_subclass},
|
cmdclass={"build_ext": build_ext_subclass},
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
setup_package()
|
setup_package()
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
import warnings
|
import warnings
|
||||||
|
|
||||||
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
|
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
|
||||||
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
||||||
|
|
||||||
|
|
@ -15,7 +16,7 @@ from . import util
|
||||||
|
|
||||||
|
|
||||||
def load(name, **overrides):
|
def load(name, **overrides):
|
||||||
depr_path = overrides.get('path')
|
depr_path = overrides.get("path")
|
||||||
if depr_path not in (True, False, None):
|
if depr_path not in (True, False, None):
|
||||||
deprecation_warning(Warnings.W001.format(path=depr_path))
|
deprecation_warning(Warnings.W001.format(path=depr_path))
|
||||||
return util.load_model(name, **overrides)
|
return util.load_model(name, **overrides)
|
||||||
|
|
|
||||||
|
|
@ -1,36 +1,41 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import print_function
|
from __future__ import print_function
|
||||||
|
|
||||||
# NB! This breaks in plac on Python 2!!
|
# NB! This breaks in plac on Python 2!!
|
||||||
# from __future__ import unicode_literals
|
# from __future__ import unicode_literals
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == "__main__":
|
||||||
import plac
|
import plac
|
||||||
import sys
|
import sys
|
||||||
from spacy.cli import download, link, info, package, train, convert
|
from wasabi import Printer
|
||||||
from spacy.cli import vocab, init_model, profile, evaluate, validate
|
from spacy.cli import download, link, info, package, train, pretrain, convert
|
||||||
from spacy.util import prints
|
from spacy.cli import init_model, profile, evaluate, validate
|
||||||
|
from spacy.cli import ud_train, ud_evaluate, debug_data
|
||||||
|
|
||||||
|
msg = Printer()
|
||||||
|
|
||||||
commands = {
|
commands = {
|
||||||
'download': download,
|
"download": download,
|
||||||
'link': link,
|
"link": link,
|
||||||
'info': info,
|
"info": info,
|
||||||
'train': train,
|
"train": train,
|
||||||
'evaluate': evaluate,
|
"pretrain": pretrain,
|
||||||
'convert': convert,
|
"debug-data": debug_data,
|
||||||
'package': package,
|
"ud-train": ud_train,
|
||||||
'vocab': vocab,
|
"evaluate": evaluate,
|
||||||
'init-model': init_model,
|
"ud-evaluate": ud_evaluate,
|
||||||
'profile': profile,
|
"convert": convert,
|
||||||
'validate': validate
|
"package": package,
|
||||||
|
"init-model": init_model,
|
||||||
|
"profile": profile,
|
||||||
|
"validate": validate,
|
||||||
}
|
}
|
||||||
if len(sys.argv) == 1:
|
if len(sys.argv) == 1:
|
||||||
prints(', '.join(commands), title="Available commands", exits=1)
|
msg.info("Available commands", ", ".join(commands), exits=1)
|
||||||
command = sys.argv.pop(1)
|
command = sys.argv.pop(1)
|
||||||
sys.argv[0] = 'spacy %s' % command
|
sys.argv[0] = "spacy %s" % command
|
||||||
if command in commands:
|
if command in commands:
|
||||||
plac.call(commands[command], sys.argv[1:])
|
plac.call(commands[command], sys.argv[1:])
|
||||||
else:
|
else:
|
||||||
prints(
|
available = "Available: {}".format(", ".join(commands))
|
||||||
"Available: %s" % ', '.join(commands),
|
msg.fail("Unknown command: {}".format(command), available, exits=1)
|
||||||
title="Unknown command: %s" % command,
|
|
||||||
exits=1)
|
|
||||||
|
|
|
||||||
255
spacy/_align.pyx
Normal file
255
spacy/_align.pyx
Normal file
|
|
@ -0,0 +1,255 @@
|
||||||
|
# cython: infer_types=True
|
||||||
|
'''Do Levenshtein alignment, for evaluation of tokenized input.
|
||||||
|
|
||||||
|
Random notes:
|
||||||
|
|
||||||
|
r i n g
|
||||||
|
0 1 2 3 4
|
||||||
|
r 1 0 1 2 3
|
||||||
|
a 2 1 1 2 3
|
||||||
|
n 3 2 2 1 2
|
||||||
|
g 4 3 3 2 1
|
||||||
|
|
||||||
|
0,0: (1,1)=min(0+0,1+1,1+1)=0 S
|
||||||
|
1,0: (2,1)=min(1+1,0+1,2+1)=1 D
|
||||||
|
2,0: (3,1)=min(2+1,3+1,1+1)=2 D
|
||||||
|
3,0: (4,1)=min(3+1,4+1,2+1)=3 D
|
||||||
|
0,1: (1,2)=min(1+1,2+1,0+1)=1 D
|
||||||
|
1,1: (2,2)=min(0+1,1+1,1+1)=1 S
|
||||||
|
2,1: (3,2)=min(1+1,1+1,2+1)=2 S or I
|
||||||
|
3,1: (4,2)=min(2+1,2+1,3+1)=3 S or I
|
||||||
|
0,2: (1,3)=min(2+1,3+1,1+1)=2 I
|
||||||
|
1,2: (2,3)=min(1+1,2+1,1+1)=2 S or I
|
||||||
|
2,2: (3,3)
|
||||||
|
3,2: (4,3)
|
||||||
|
At state (i, j) we're asking "How do I transform S[:i+1] to T[:j+1]?"
|
||||||
|
|
||||||
|
We know the costs to transition:
|
||||||
|
|
||||||
|
S[:i] -> T[:j] (at D[i,j])
|
||||||
|
S[:i+1] -> T[:j] (at D[i+1,j])
|
||||||
|
S[:i] -> T[:j+1] (at D[i,j+1])
|
||||||
|
|
||||||
|
Further, we now we can tranform:
|
||||||
|
S[:i+1] -> S[:i] (DEL) for 1,
|
||||||
|
T[:j+1] -> T[:j] (INS) for 1.
|
||||||
|
S[i+1] -> T[j+1] (SUB) for 0 or 1
|
||||||
|
|
||||||
|
Therefore we have the costs:
|
||||||
|
SUB: Cost(S[:i]->T[:j]) + Cost(S[i]->S[j])
|
||||||
|
i.e. D[i, j] + S[i+1] != T[j+1]
|
||||||
|
INS: Cost(S[:i+1]->T[:j]) + Cost(T[:j+1]->T[:j])
|
||||||
|
i.e. D[i+1,j] + 1
|
||||||
|
DEL: Cost(S[:i]->T[:j+1]) + Cost(S[:i+1]->S[:i])
|
||||||
|
i.e. D[i,j+1] + 1
|
||||||
|
|
||||||
|
Source string S has length m, with index i
|
||||||
|
Target string T has length n, with index j
|
||||||
|
|
||||||
|
Output two alignment vectors: i2j (length m) and j2i (length n)
|
||||||
|
# function LevenshteinDistance(char s[1..m], char t[1..n]):
|
||||||
|
# for all i and j, d[i,j] will hold the Levenshtein distance between
|
||||||
|
# the first i characters of s and the first j characters of t
|
||||||
|
# note that d has (m+1)*(n+1) values
|
||||||
|
# set each element in d to zero
|
||||||
|
ring rang
|
||||||
|
- r i n g
|
||||||
|
- 0 0 0 0 0
|
||||||
|
r 0 0 0 0 0
|
||||||
|
a 0 0 0 0 0
|
||||||
|
n 0 0 0 0 0
|
||||||
|
g 0 0 0 0 0
|
||||||
|
|
||||||
|
# source prefixes can be transformed into empty string by
|
||||||
|
# dropping all characters
|
||||||
|
# d[i, 0] := i
|
||||||
|
ring rang
|
||||||
|
- r i n g
|
||||||
|
- 0 0 0 0 0
|
||||||
|
r 1 0 0 0 0
|
||||||
|
a 2 0 0 0 0
|
||||||
|
n 3 0 0 0 0
|
||||||
|
g 4 0 0 0 0
|
||||||
|
|
||||||
|
# target prefixes can be reached from empty source prefix
|
||||||
|
# by inserting every character
|
||||||
|
# d[0, j] := j
|
||||||
|
- r i n g
|
||||||
|
- 0 1 2 3 4
|
||||||
|
r 1 0 0 0 0
|
||||||
|
a 2 0 0 0 0
|
||||||
|
n 3 0 0 0 0
|
||||||
|
g 4 0 0 0 0
|
||||||
|
|
||||||
|
'''
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
from libc.stdint cimport uint32_t
|
||||||
|
import numpy
|
||||||
|
cimport numpy as np
|
||||||
|
from .compat import unicode_
|
||||||
|
from murmurhash.mrmr cimport hash32
|
||||||
|
|
||||||
|
|
||||||
|
def align(S, T):
|
||||||
|
cdef int m = len(S)
|
||||||
|
cdef int n = len(T)
|
||||||
|
cdef np.ndarray matrix = numpy.zeros((m+1, n+1), dtype='int32')
|
||||||
|
cdef np.ndarray i2j = numpy.zeros((m,), dtype='i')
|
||||||
|
cdef np.ndarray j2i = numpy.zeros((n,), dtype='i')
|
||||||
|
|
||||||
|
cdef np.ndarray S_arr = _convert_sequence(S)
|
||||||
|
cdef np.ndarray T_arr = _convert_sequence(T)
|
||||||
|
|
||||||
|
fill_matrix(<int*>matrix.data,
|
||||||
|
<const int*>S_arr.data, m, <const int*>T_arr.data, n)
|
||||||
|
fill_i2j(i2j, matrix)
|
||||||
|
fill_j2i(j2i, matrix)
|
||||||
|
for i in range(i2j.shape[0]):
|
||||||
|
if i2j[i] >= 0 and len(S[i]) != len(T[i2j[i]]):
|
||||||
|
i2j[i] = -1
|
||||||
|
for j in range(j2i.shape[0]):
|
||||||
|
if j2i[j] >= 0 and len(T[j]) != len(S[j2i[j]]):
|
||||||
|
j2i[j] = -1
|
||||||
|
return matrix[-1,-1], i2j, j2i, matrix
|
||||||
|
|
||||||
|
|
||||||
|
def multi_align(np.ndarray i2j, np.ndarray j2i, i_lengths, j_lengths):
|
||||||
|
'''Let's say we had:
|
||||||
|
|
||||||
|
Guess: [aa bb cc dd]
|
||||||
|
Truth: [aa bbcc dd]
|
||||||
|
i2j: [0, None, -2, 2]
|
||||||
|
j2i: [0, -2, 3]
|
||||||
|
|
||||||
|
We want:
|
||||||
|
|
||||||
|
i2j_multi: {1: 1, 2: 1}
|
||||||
|
j2i_multi: {}
|
||||||
|
'''
|
||||||
|
i2j_miss = _get_regions(i2j, i_lengths)
|
||||||
|
j2i_miss = _get_regions(j2i, j_lengths)
|
||||||
|
|
||||||
|
i2j_multi, j2i_multi = _get_mapping(i2j_miss, j2i_miss, i_lengths, j_lengths)
|
||||||
|
return i2j_multi, j2i_multi
|
||||||
|
|
||||||
|
|
||||||
|
def _get_regions(alignment, lengths):
|
||||||
|
regions = {}
|
||||||
|
start = None
|
||||||
|
offset = 0
|
||||||
|
for i in range(len(alignment)):
|
||||||
|
if alignment[i] < 0:
|
||||||
|
if start is None:
|
||||||
|
start = offset
|
||||||
|
regions.setdefault(start, [])
|
||||||
|
regions[start].append(i)
|
||||||
|
else:
|
||||||
|
start = None
|
||||||
|
offset += lengths[i]
|
||||||
|
return regions
|
||||||
|
|
||||||
|
|
||||||
|
def _get_mapping(miss1, miss2, lengths1, lengths2):
|
||||||
|
i2j = {}
|
||||||
|
j2i = {}
|
||||||
|
for start, region1 in miss1.items():
|
||||||
|
if not region1 or start not in miss2:
|
||||||
|
continue
|
||||||
|
region2 = miss2[start]
|
||||||
|
if sum(lengths1[i] for i in region1) == sum(lengths2[i] for i in region2):
|
||||||
|
j = region2.pop(0)
|
||||||
|
buff = []
|
||||||
|
# Consume tokens from region 1, until we meet the length of the
|
||||||
|
# first token in region2. If we do, align the tokens. If
|
||||||
|
# we exceed the length, break.
|
||||||
|
while region1:
|
||||||
|
buff.append(region1.pop(0))
|
||||||
|
if sum(lengths1[i] for i in buff) == lengths2[j]:
|
||||||
|
for i in buff:
|
||||||
|
i2j[i] = j
|
||||||
|
j2i[j] = buff[-1]
|
||||||
|
j += 1
|
||||||
|
buff = []
|
||||||
|
elif sum(lengths1[i] for i in buff) > lengths2[j]:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
if buff and sum(lengths1[i] for i in buff) == lengths2[j]:
|
||||||
|
for i in buff:
|
||||||
|
i2j[i] = j
|
||||||
|
j2i[j] = buff[-1]
|
||||||
|
return i2j, j2i
|
||||||
|
|
||||||
|
|
||||||
|
def _convert_sequence(seq):
|
||||||
|
if isinstance(seq, numpy.ndarray):
|
||||||
|
return numpy.ascontiguousarray(seq, dtype='uint32_t')
|
||||||
|
cdef np.ndarray output = numpy.zeros((len(seq),), dtype='uint32')
|
||||||
|
cdef bytes item_bytes
|
||||||
|
for i, item in enumerate(seq):
|
||||||
|
if item == "``":
|
||||||
|
item = '"'
|
||||||
|
elif item == "''":
|
||||||
|
item = '"'
|
||||||
|
if isinstance(item, unicode):
|
||||||
|
item_bytes = item.encode('utf8')
|
||||||
|
else:
|
||||||
|
item_bytes = item
|
||||||
|
output[i] = hash32(<void*><char*>item_bytes, len(item_bytes), 0)
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
cdef void fill_matrix(int* D,
|
||||||
|
const int* S, int m, const int* T, int n) nogil:
|
||||||
|
m1 = m+1
|
||||||
|
n1 = n+1
|
||||||
|
for i in range(m1*n1):
|
||||||
|
D[i] = 0
|
||||||
|
|
||||||
|
for i in range(m1):
|
||||||
|
D[i*n1] = i
|
||||||
|
|
||||||
|
for j in range(n1):
|
||||||
|
D[j] = j
|
||||||
|
|
||||||
|
cdef int sub_cost, ins_cost, del_cost
|
||||||
|
for j in range(n):
|
||||||
|
for i in range(m):
|
||||||
|
i_j = i*n1 + j
|
||||||
|
i1_j1 = (i+1)*n1 + j+1
|
||||||
|
i1_j = (i+1)*n1 + j
|
||||||
|
i_j1 = i*n1 + j+1
|
||||||
|
if S[i] != T[j]:
|
||||||
|
sub_cost = D[i_j] + 1
|
||||||
|
else:
|
||||||
|
sub_cost = D[i_j]
|
||||||
|
del_cost = D[i_j1] + 1
|
||||||
|
ins_cost = D[i1_j] + 1
|
||||||
|
best = min(min(sub_cost, ins_cost), del_cost)
|
||||||
|
D[i1_j1] = best
|
||||||
|
|
||||||
|
|
||||||
|
cdef void fill_i2j(np.ndarray i2j, np.ndarray D) except *:
|
||||||
|
j = D.shape[1]-2
|
||||||
|
cdef int i = D.shape[0]-2
|
||||||
|
while i >= 0:
|
||||||
|
while D[i+1, j] < D[i+1, j+1]:
|
||||||
|
j -= 1
|
||||||
|
if D[i, j+1] < D[i+1, j+1]:
|
||||||
|
i2j[i] = -1
|
||||||
|
else:
|
||||||
|
i2j[i] = j
|
||||||
|
j -= 1
|
||||||
|
i -= 1
|
||||||
|
|
||||||
|
cdef void fill_j2i(np.ndarray j2i, np.ndarray D) except *:
|
||||||
|
i = D.shape[0]-2
|
||||||
|
cdef int j = D.shape[1]-2
|
||||||
|
while j >= 0:
|
||||||
|
while D[i, j+1] < D[i+1, j+1]:
|
||||||
|
i -= 1
|
||||||
|
if D[i+1, j] < D[i+1, j+1]:
|
||||||
|
j2i[j] = -1
|
||||||
|
else:
|
||||||
|
j2i[j] = i
|
||||||
|
i -= 1
|
||||||
|
j -= 1
|
||||||
475
spacy/_ml.py
475
spacy/_ml.py
|
|
@ -5,16 +5,17 @@ import numpy
|
||||||
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
|
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
|
||||||
from thinc.i2v import HashEmbed, StaticVectors
|
from thinc.i2v import HashEmbed, StaticVectors
|
||||||
from thinc.t2t import ExtractWindow, ParametricAttention
|
from thinc.t2t import ExtractWindow, ParametricAttention
|
||||||
from thinc.t2v import Pooling, sum_pool
|
from thinc.t2v import Pooling, sum_pool, mean_pool
|
||||||
from thinc.misc import Residual
|
from thinc.misc import Residual
|
||||||
from thinc.misc import LayerNorm as LN
|
from thinc.misc import LayerNorm as LN
|
||||||
|
from thinc.misc import FeatureExtracter
|
||||||
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
|
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
|
||||||
from thinc.api import FeatureExtracter, with_getitem, flatten_add_lengths
|
from thinc.api import with_getitem, flatten_add_lengths
|
||||||
from thinc.api import uniqued, wrap, noop
|
from thinc.api import uniqued, wrap, noop
|
||||||
|
from thinc.api import with_square_sequences
|
||||||
from thinc.linear.linear import LinearModel
|
from thinc.linear.linear import LinearModel
|
||||||
from thinc.neural.ops import NumpyOps, CupyOps
|
from thinc.neural.ops import NumpyOps, CupyOps
|
||||||
from thinc.neural.util import get_array_module, copy_array
|
from thinc.neural.util import get_array_module
|
||||||
from thinc.neural._lsuv import svd_orthonormal
|
|
||||||
from thinc.neural.optimizers import Adam
|
from thinc.neural.optimizers import Adam
|
||||||
|
|
||||||
from thinc import describe
|
from thinc import describe
|
||||||
|
|
@ -26,37 +27,42 @@ from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
try:
|
||||||
|
import torch.nn
|
||||||
|
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||||
|
except ImportError:
|
||||||
|
torch = None
|
||||||
|
|
||||||
VECTORS_KEY = 'spacy_pretrained_vectors'
|
VECTORS_KEY = "spacy_pretrained_vectors"
|
||||||
|
|
||||||
|
|
||||||
def cosine(vec1, vec2):
|
def cosine(vec1, vec2):
|
||||||
xp = get_array_module(vec1)
|
xp = get_array_module(vec1)
|
||||||
norm1 = xp.linalg.norm(vec1)
|
norm1 = xp.linalg.norm(vec1)
|
||||||
norm2 = xp.linalg.norm(vec2)
|
norm2 = xp.linalg.norm(vec2)
|
||||||
if norm1 == 0. or norm2 == 0.:
|
if norm1 == 0.0 or norm2 == 0.0:
|
||||||
return 0
|
return 0
|
||||||
else:
|
else:
|
||||||
return vec1.dot(vec2) / (norm1 * norm2)
|
return vec1.dot(vec2) / (norm1 * norm2)
|
||||||
|
|
||||||
|
|
||||||
def create_default_optimizer(ops, **cfg):
|
def create_default_optimizer(ops, **cfg):
|
||||||
learn_rate = util.env_opt('learn_rate', 0.001)
|
learn_rate = util.env_opt("learn_rate", 0.001)
|
||||||
beta1 = util.env_opt('optimizer_B1', 0.9)
|
beta1 = util.env_opt("optimizer_B1", 0.8)
|
||||||
beta2 = util.env_opt('optimizer_B2', 0.999)
|
beta2 = util.env_opt("optimizer_B2", 0.8)
|
||||||
eps = util.env_opt('optimizer_eps', 1e-08)
|
eps = util.env_opt("optimizer_eps", 0.00001)
|
||||||
L2 = util.env_opt('L2_penalty', 1e-6)
|
L2 = util.env_opt("L2_penalty", 1e-6)
|
||||||
max_grad_norm = util.env_opt('grad_norm_clip', 1.)
|
max_grad_norm = util.env_opt("grad_norm_clip", 5.0)
|
||||||
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1,
|
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1, beta2=beta2, eps=eps)
|
||||||
beta2=beta2, eps=eps)
|
|
||||||
optimizer.max_grad_norm = max_grad_norm
|
optimizer.max_grad_norm = max_grad_norm
|
||||||
optimizer.device = ops.device
|
optimizer.device = ops.device
|
||||||
return optimizer
|
return optimizer
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def _flatten_add_lengths(seqs, pad=0, drop=0.):
|
def _flatten_add_lengths(seqs, pad=0, drop=0.0):
|
||||||
ops = Model.ops
|
ops = Model.ops
|
||||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
|
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||||
|
|
||||||
def finish_update(d_X, sgd=None):
|
def finish_update(d_X, sgd=None):
|
||||||
return ops.unflatten(d_X, lengths, pad=pad)
|
return ops.unflatten(d_X, lengths, pad=pad)
|
||||||
|
|
@ -65,46 +71,70 @@ def _flatten_add_lengths(seqs, pad=0, drop=0.):
|
||||||
return (X, lengths), finish_update
|
return (X, lengths), finish_update
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
|
||||||
def _logistic(X, drop=0.):
|
|
||||||
xp = get_array_module(X)
|
|
||||||
if not isinstance(X, xp.ndarray):
|
|
||||||
X = xp.asarray(X)
|
|
||||||
# Clip to range (-10, 10)
|
|
||||||
X = xp.minimum(X, 10., X)
|
|
||||||
X = xp.maximum(X, -10., X)
|
|
||||||
Y = 1. / (1. + xp.exp(-X))
|
|
||||||
|
|
||||||
def logistic_bwd(dY, sgd=None):
|
|
||||||
dX = dY * (Y * (1-Y))
|
|
||||||
return dX
|
|
||||||
|
|
||||||
return Y, logistic_bwd
|
|
||||||
|
|
||||||
|
|
||||||
def _zero_init(model):
|
def _zero_init(model):
|
||||||
def _zero_init_impl(self, X, y):
|
def _zero_init_impl(self, *args, **kwargs):
|
||||||
self.W.fill(0)
|
self.W.fill(0)
|
||||||
model.on_data_hooks.append(_zero_init_impl)
|
|
||||||
|
model.on_init_hooks.append(_zero_init_impl)
|
||||||
if model.W is not None:
|
if model.W is not None:
|
||||||
model.W.fill(0.)
|
model.W.fill(0.0)
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def _preprocess_doc(docs, drop=0.):
|
def _preprocess_doc(docs, drop=0.0):
|
||||||
keys = [doc.to_array(LOWER) for doc in docs]
|
keys = [doc.to_array(LOWER) for doc in docs]
|
||||||
ops = Model.ops
|
|
||||||
# The dtype here matches what thinc is expecting -- which differs per
|
# The dtype here matches what thinc is expecting -- which differs per
|
||||||
# platform (by int definition). This should be fixed once the problem
|
# platform (by int definition). This should be fixed once the problem
|
||||||
# is fixed on Thinc's side.
|
# is fixed on Thinc's side.
|
||||||
lengths = ops.asarray([arr.shape[0] for arr in keys], dtype=numpy.int_)
|
lengths = numpy.array([arr.shape[0] for arr in keys], dtype=numpy.int_)
|
||||||
keys = ops.xp.concatenate(keys)
|
keys = numpy.concatenate(keys)
|
||||||
vals = ops.allocate(keys.shape) + 1.
|
vals = numpy.zeros(keys.shape, dtype='f')
|
||||||
return (keys, vals, lengths), None
|
return (keys, vals, lengths), None
|
||||||
|
|
||||||
|
|
||||||
|
def with_cpu(ops, model):
|
||||||
|
"""Wrap a model that should run on CPU, transferring inputs and outputs
|
||||||
|
as necessary."""
|
||||||
|
model.to_cpu()
|
||||||
|
|
||||||
|
def with_cpu_forward(inputs, drop=0.):
|
||||||
|
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
||||||
|
gpu_outputs = _to_device(ops, cpu_outputs)
|
||||||
|
|
||||||
|
def with_cpu_backprop(d_outputs, sgd=None):
|
||||||
|
cpu_d_outputs = _to_cpu(d_outputs)
|
||||||
|
return backprop(cpu_d_outputs, sgd=sgd)
|
||||||
|
|
||||||
|
return gpu_outputs, with_cpu_backprop
|
||||||
|
|
||||||
|
return wrap(with_cpu_forward, model)
|
||||||
|
|
||||||
|
|
||||||
|
def _to_cpu(X):
|
||||||
|
if isinstance(X, numpy.ndarray):
|
||||||
|
return X
|
||||||
|
elif isinstance(X, tuple):
|
||||||
|
return tuple([_to_cpu(x) for x in X])
|
||||||
|
elif isinstance(X, list):
|
||||||
|
return [_to_cpu(x) for x in X]
|
||||||
|
elif hasattr(X, 'get'):
|
||||||
|
return X.get()
|
||||||
|
else:
|
||||||
|
return X
|
||||||
|
|
||||||
|
|
||||||
|
def _to_device(ops, X):
|
||||||
|
if isinstance(X, tuple):
|
||||||
|
return tuple([_to_device(ops, x) for x in X])
|
||||||
|
elif isinstance(X, list):
|
||||||
|
return [_to_device(ops, x) for x in X]
|
||||||
|
else:
|
||||||
|
return ops.asarray(X)
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def _preprocess_doc_bigrams(docs, drop=0.):
|
def _preprocess_doc_bigrams(docs, drop=0.0):
|
||||||
unigrams = [doc.to_array(LOWER) for doc in docs]
|
unigrams = [doc.to_array(LOWER) for doc in docs]
|
||||||
ops = Model.ops
|
ops = Model.ops
|
||||||
bigrams = [ops.ngrams(2, doc_unis) for doc_unis in unigrams]
|
bigrams = [ops.ngrams(2, doc_unis) for doc_unis in unigrams]
|
||||||
|
|
@ -115,27 +145,29 @@ def _preprocess_doc_bigrams(docs, drop=0.):
|
||||||
# is fixed on Thinc's side.
|
# is fixed on Thinc's side.
|
||||||
lengths = ops.asarray([arr.shape[0] for arr in keys], dtype=numpy.int_)
|
lengths = ops.asarray([arr.shape[0] for arr in keys], dtype=numpy.int_)
|
||||||
keys = ops.xp.concatenate(keys)
|
keys = ops.xp.concatenate(keys)
|
||||||
vals = ops.asarray(ops.xp.concatenate(vals), dtype='f')
|
vals = ops.asarray(ops.xp.concatenate(vals), dtype="f")
|
||||||
return (keys, vals, lengths), None
|
return (keys, vals, lengths), None
|
||||||
|
|
||||||
|
|
||||||
@describe.on_data(_set_dimensions_if_needed,
|
@describe.on_data(
|
||||||
lambda model, X, y: model.init_weights(model))
|
_set_dimensions_if_needed, lambda model, X, y: model.init_weights(model)
|
||||||
|
)
|
||||||
@describe.attributes(
|
@describe.attributes(
|
||||||
nI=Dimension("Input size"),
|
nI=Dimension("Input size"),
|
||||||
nF=Dimension("Number of features"),
|
nF=Dimension("Number of features"),
|
||||||
nO=Dimension("Output size"),
|
nO=Dimension("Output size"),
|
||||||
nP=Dimension("Maxout pieces"),
|
nP=Dimension("Maxout pieces"),
|
||||||
W=Synapses("Weights matrix",
|
W=Synapses("Weights matrix", lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)),
|
||||||
lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)),
|
b=Biases("Bias vector", lambda obj: (obj.nO, obj.nP)),
|
||||||
b=Biases("Bias vector",
|
pad=Synapses(
|
||||||
lambda obj: (obj.nO, obj.nP)),
|
"Pad",
|
||||||
pad=Synapses("Pad",
|
|
||||||
lambda obj: (1, obj.nF, obj.nO, obj.nP),
|
lambda obj: (1, obj.nF, obj.nO, obj.nP),
|
||||||
lambda M, ops: ops.normal_init(M, 1.)),
|
lambda M, ops: ops.normal_init(M, 1.0),
|
||||||
|
),
|
||||||
d_W=Gradient("W"),
|
d_W=Gradient("W"),
|
||||||
d_pad=Gradient("pad"),
|
d_pad=Gradient("pad"),
|
||||||
d_b=Gradient("b"))
|
d_b=Gradient("b"),
|
||||||
|
)
|
||||||
class PrecomputableAffine(Model):
|
class PrecomputableAffine(Model):
|
||||||
def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs):
|
def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs):
|
||||||
Model.__init__(self, **kwargs)
|
Model.__init__(self, **kwargs)
|
||||||
|
|
@ -144,9 +176,10 @@ class PrecomputableAffine(Model):
|
||||||
self.nI = nI
|
self.nI = nI
|
||||||
self.nF = nF
|
self.nF = nF
|
||||||
|
|
||||||
def begin_update(self, X, drop=0.):
|
def begin_update(self, X, drop=0.0):
|
||||||
Yf = self.ops.xp.dot(X,
|
Yf = self.ops.gemm(
|
||||||
self.W.reshape((self.nF*self.nO*self.nP, self.nI)).T)
|
X, self.W.reshape((self.nF * self.nO * self.nP, self.nI)), trans2=True
|
||||||
|
)
|
||||||
Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP))
|
Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP))
|
||||||
Yf = self._add_padding(Yf)
|
Yf = self._add_padding(Yf)
|
||||||
|
|
||||||
|
|
@ -162,11 +195,12 @@ class PrecomputableAffine(Model):
|
||||||
Wopfi = self.W.transpose((1, 2, 0, 3))
|
Wopfi = self.W.transpose((1, 2, 0, 3))
|
||||||
Wopfi = self.ops.xp.ascontiguousarray(Wopfi)
|
Wopfi = self.ops.xp.ascontiguousarray(Wopfi)
|
||||||
Wopfi = Wopfi.reshape((self.nO * self.nP, self.nF * self.nI))
|
Wopfi = Wopfi.reshape((self.nO * self.nP, self.nF * self.nI))
|
||||||
dXf = self.ops.dot(dY.reshape((dY.shape[0], self.nO*self.nP)), Wopfi)
|
dXf = self.ops.gemm(dY.reshape((dY.shape[0], self.nO * self.nP)), Wopfi)
|
||||||
|
|
||||||
# Reuse the buffer
|
# Reuse the buffer
|
||||||
dWopfi = Wopfi; dWopfi.fill(0.)
|
dWopfi = Wopfi
|
||||||
self.ops.xp.dot(dY.T, Xf, out=dWopfi)
|
dWopfi.fill(0.0)
|
||||||
|
self.ops.gemm(dY, Xf, out=dWopfi, trans1=True)
|
||||||
dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI))
|
dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI))
|
||||||
# (o, p, f, i) --> (f, o, p, i)
|
# (o, p, f, i) --> (f, o, p, i)
|
||||||
self.d_W += dWopfi.transpose((2, 0, 1, 3))
|
self.d_W += dWopfi.transpose((2, 0, 1, 3))
|
||||||
|
|
@ -174,6 +208,7 @@ class PrecomputableAffine(Model):
|
||||||
if sgd is not None:
|
if sgd is not None:
|
||||||
sgd(self._mem.weights, self._mem.gradient, key=self.id)
|
sgd(self._mem.weights, self._mem.gradient, key=self.id)
|
||||||
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
|
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
|
||||||
|
|
||||||
return Yf, backward
|
return Yf, backward
|
||||||
|
|
||||||
def _add_padding(self, Yf):
|
def _add_padding(self, Yf):
|
||||||
|
|
@ -182,7 +217,7 @@ class PrecomputableAffine(Model):
|
||||||
|
|
||||||
def _backprop_padding(self, dY, ids):
|
def _backprop_padding(self, dY, ids):
|
||||||
# (1, nF, nO, nP) += (nN, nF, nO, nP) where IDs (nN, nF) < 0
|
# (1, nF, nO, nP) += (nN, nF, nO, nP) where IDs (nN, nF) < 0
|
||||||
mask = ids < 0.
|
mask = ids < 0.0
|
||||||
mask = mask.sum(axis=1)
|
mask = mask.sum(axis=1)
|
||||||
d_pad = dY * mask.reshape((ids.shape[0], 1, 1))
|
d_pad = dY * mask.reshape((ids.shape[0], 1, 1))
|
||||||
self.d_pad += d_pad.sum(axis=0)
|
self.d_pad += d_pad.sum(axis=0)
|
||||||
|
|
@ -190,36 +225,40 @@ class PrecomputableAffine(Model):
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def init_weights(model):
|
def init_weights(model):
|
||||||
'''This is like the 'layer sequential unit variance', but instead
|
"""This is like the 'layer sequential unit variance', but instead
|
||||||
of taking the actual inputs, we randomly generate whitened data.
|
of taking the actual inputs, we randomly generate whitened data.
|
||||||
|
|
||||||
Why's this all so complicated? We have a huge number of inputs,
|
Why's this all so complicated? We have a huge number of inputs,
|
||||||
and the maxout unit makes guessing the dynamics tricky. Instead
|
and the maxout unit makes guessing the dynamics tricky. Instead
|
||||||
we set the maxout weights to values that empirically result in
|
we set the maxout weights to values that empirically result in
|
||||||
whitened outputs given whitened inputs.
|
whitened outputs given whitened inputs.
|
||||||
'''
|
"""
|
||||||
if (model.W**2).sum() != 0.:
|
if (model.W ** 2).sum() != 0.0:
|
||||||
return
|
return
|
||||||
ops = model.ops
|
ops = model.ops
|
||||||
xp = ops.xp
|
xp = ops.xp
|
||||||
ops.normal_init(model.W, model.nF * model.nI, inplace=True)
|
ops.normal_init(model.W, model.nF * model.nI, inplace=True)
|
||||||
|
|
||||||
ids = ops.allocate((5000, model.nF), dtype='f')
|
ids = ops.allocate((5000, model.nF), dtype="f")
|
||||||
ids += xp.random.uniform(0, 1000, ids.shape)
|
ids += xp.random.uniform(0, 1000, ids.shape)
|
||||||
ids = ops.asarray(ids, dtype='i')
|
ids = ops.asarray(ids, dtype="i")
|
||||||
tokvecs = ops.allocate((5000, model.nI), dtype='f')
|
tokvecs = ops.allocate((5000, model.nI), dtype="f")
|
||||||
tokvecs += xp.random.normal(loc=0., scale=1.,
|
tokvecs += xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape(
|
||||||
size=tokvecs.size).reshape(tokvecs.shape)
|
tokvecs.shape
|
||||||
|
)
|
||||||
|
|
||||||
def predict(ids, tokvecs):
|
def predict(ids, tokvecs):
|
||||||
# nS ids. nW tokvecs
|
# nS ids. nW tokvecs. Exclude the padding array.
|
||||||
hiddens = model(tokvecs) # (nW, f, o, p)
|
hiddens = model(tokvecs[:-1]) # (nW, f, o, p)
|
||||||
|
vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype="f")
|
||||||
# need nS vectors
|
# need nS vectors
|
||||||
vectors = model.ops.allocate((ids.shape[0], model.nO, model.nP))
|
hiddens = hiddens.reshape(
|
||||||
for i, feats in enumerate(ids):
|
(hiddens.shape[0] * model.nF, model.nO * model.nP)
|
||||||
for j, id_ in enumerate(feats):
|
)
|
||||||
vectors[i] += hiddens[id_, j]
|
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
|
||||||
|
vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP))
|
||||||
vectors += model.b
|
vectors += model.b
|
||||||
|
vectors = model.ops.asarray(vectors)
|
||||||
if model.nP >= 2:
|
if model.nP >= 2:
|
||||||
return model.ops.maxout(vectors)[0]
|
return model.ops.maxout(vectors)[0]
|
||||||
else:
|
else:
|
||||||
|
|
@ -245,9 +284,11 @@ def link_vectors_to_models(vocab):
|
||||||
vectors = vocab.vectors
|
vectors = vocab.vectors
|
||||||
if vectors.name is None:
|
if vectors.name is None:
|
||||||
vectors.name = VECTORS_KEY
|
vectors.name = VECTORS_KEY
|
||||||
|
if vectors.data.size != 0:
|
||||||
print(
|
print(
|
||||||
"Warning: Unnamed vectors -- this won't allow multiple vectors "
|
"Warning: Unnamed vectors -- this won't allow multiple vectors "
|
||||||
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape)
|
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape
|
||||||
|
)
|
||||||
ops = Model.ops
|
ops = Model.ops
|
||||||
for word in vocab:
|
for word in vocab:
|
||||||
if word.orth in vectors.key2row:
|
if word.orth in vectors.key2row:
|
||||||
|
|
@ -260,43 +301,68 @@ def link_vectors_to_models(vocab):
|
||||||
thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
|
thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
|
||||||
|
|
||||||
|
|
||||||
|
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
|
||||||
|
if depth == 0:
|
||||||
|
return layerize(noop())
|
||||||
|
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
|
||||||
|
return with_square_sequences(PyTorchWrapperRNN(model))
|
||||||
|
|
||||||
|
|
||||||
def Tok2Vec(width, embed_size, **kwargs):
|
def Tok2Vec(width, embed_size, **kwargs):
|
||||||
pretrained_vectors = kwargs.get('pretrained_vectors', None)
|
pretrained_vectors = kwargs.get("pretrained_vectors", None)
|
||||||
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
|
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||||
|
subword_features = kwargs.get("subword_features", True)
|
||||||
|
conv_depth = kwargs.get("conv_depth", 4)
|
||||||
|
bilstm_depth = kwargs.get("bilstm_depth", 0)
|
||||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||||
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
|
with Model.define_operators(
|
||||||
'+': add, '*': reapply}):
|
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
|
||||||
norm = HashEmbed(width, embed_size, column=cols.index(NORM),
|
):
|
||||||
name='embed_norm')
|
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
|
||||||
prefix = HashEmbed(width, embed_size//2, column=cols.index(PREFIX),
|
if subword_features:
|
||||||
name='embed_prefix')
|
prefix = HashEmbed(
|
||||||
suffix = HashEmbed(width, embed_size//2, column=cols.index(SUFFIX),
|
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
|
||||||
name='embed_suffix')
|
)
|
||||||
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
|
suffix = HashEmbed(
|
||||||
name='embed_shape')
|
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
|
||||||
|
)
|
||||||
|
shape = HashEmbed(
|
||||||
|
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
prefix, suffix, shape = (None, None, None)
|
||||||
if pretrained_vectors is not None:
|
if pretrained_vectors is not None:
|
||||||
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
||||||
|
|
||||||
|
if subword_features:
|
||||||
embed = uniqued(
|
embed = uniqued(
|
||||||
(glove | norm | prefix | suffix | shape)
|
(glove | norm | prefix | suffix | shape)
|
||||||
>> LN(Maxout(width, width*5, pieces=3)), column=cols.index(ORTH))
|
>> LN(Maxout(width, width * 5, pieces=3)),
|
||||||
|
column=cols.index(ORTH),
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
|
embed = uniqued(
|
||||||
|
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
|
||||||
|
column=cols.index(ORTH),
|
||||||
|
)
|
||||||
|
elif subword_features:
|
||||||
embed = uniqued(
|
embed = uniqued(
|
||||||
(norm | prefix | suffix | shape)
|
(norm | prefix | suffix | shape)
|
||||||
>> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))
|
>> LN(Maxout(width, width * 4, pieces=3)),
|
||||||
|
column=cols.index(ORTH),
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
embed = norm
|
||||||
|
|
||||||
convolution = Residual(
|
convolution = Residual(
|
||||||
ExtractWindow(nW=1)
|
ExtractWindow(nW=1)
|
||||||
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
|
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
|
||||||
)
|
)
|
||||||
|
tok2vec = FeatureExtracter(cols) >> with_flatten(
|
||||||
tok2vec = (
|
embed >> convolution ** conv_depth, pad=conv_depth
|
||||||
FeatureExtracter(cols)
|
|
||||||
>> with_flatten(
|
|
||||||
embed
|
|
||||||
>> convolution ** 4, pad=4
|
|
||||||
)
|
|
||||||
)
|
)
|
||||||
|
if bilstm_depth >= 1:
|
||||||
|
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
|
||||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||||
tok2vec.nO = width
|
tok2vec.nO = width
|
||||||
tok2vec.embed = embed
|
tok2vec.embed = embed
|
||||||
|
|
@ -304,7 +370,7 @@ def Tok2Vec(width, embed_size, **kwargs):
|
||||||
|
|
||||||
|
|
||||||
def reapply(layer, n_times):
|
def reapply(layer, n_times):
|
||||||
def reapply_fwd(X, drop=0.):
|
def reapply_fwd(X, drop=0.0):
|
||||||
backprops = []
|
backprops = []
|
||||||
for i in range(n_times):
|
for i in range(n_times):
|
||||||
Y, backprop = layer.begin_update(X, drop=drop)
|
Y, backprop = layer.begin_update(X, drop=drop)
|
||||||
|
|
@ -322,12 +388,14 @@ def reapply(layer, n_times):
|
||||||
return dX
|
return dX
|
||||||
|
|
||||||
return Y, reapply_bwd
|
return Y, reapply_bwd
|
||||||
|
|
||||||
return wrap(reapply_fwd, layer)
|
return wrap(reapply_fwd, layer)
|
||||||
|
|
||||||
|
|
||||||
def asarray(ops, dtype):
|
def asarray(ops, dtype):
|
||||||
def forward(X, drop=0.):
|
def forward(X, drop=0.0):
|
||||||
return ops.asarray(X, dtype=dtype), None
|
return ops.asarray(X, dtype=dtype), None
|
||||||
|
|
||||||
return layerize(forward)
|
return layerize(forward)
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -344,7 +412,7 @@ def get_col(idx):
|
||||||
if idx < 0:
|
if idx < 0:
|
||||||
raise IndexError(Errors.E066.format(value=idx))
|
raise IndexError(Errors.E066.format(value=idx))
|
||||||
|
|
||||||
def forward(X, drop=0.):
|
def forward(X, drop=0.0):
|
||||||
if isinstance(X, numpy.ndarray):
|
if isinstance(X, numpy.ndarray):
|
||||||
ops = NumpyOps()
|
ops = NumpyOps()
|
||||||
else:
|
else:
|
||||||
|
|
@ -365,7 +433,7 @@ def doc2feats(cols=None):
|
||||||
if cols is None:
|
if cols is None:
|
||||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||||
|
|
||||||
def forward(docs, drop=0.):
|
def forward(docs, drop=0.0):
|
||||||
feats = []
|
feats = []
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
feats.append(doc.to_array(cols))
|
feats.append(doc.to_array(cols))
|
||||||
|
|
@ -377,13 +445,14 @@ def doc2feats(cols=None):
|
||||||
|
|
||||||
|
|
||||||
def print_shape(prefix):
|
def print_shape(prefix):
|
||||||
def forward(X, drop=0.):
|
def forward(X, drop=0.0):
|
||||||
return X, lambda dX, **kwargs: dX
|
return X, lambda dX, **kwargs: dX
|
||||||
|
|
||||||
return layerize(forward)
|
return layerize(forward)
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def get_token_vectors(tokens_attrs_vectors, drop=0.):
|
def get_token_vectors(tokens_attrs_vectors, drop=0.0):
|
||||||
tokens, attrs, vectors = tokens_attrs_vectors
|
tokens, attrs, vectors = tokens_attrs_vectors
|
||||||
|
|
||||||
def backward(d_output, sgd=None):
|
def backward(d_output, sgd=None):
|
||||||
|
|
@ -393,14 +462,14 @@ def get_token_vectors(tokens_attrs_vectors, drop=0.):
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def logistic(X, drop=0.):
|
def logistic(X, drop=0.0):
|
||||||
xp = get_array_module(X)
|
xp = get_array_module(X)
|
||||||
if not isinstance(X, xp.ndarray):
|
if not isinstance(X, xp.ndarray):
|
||||||
X = xp.asarray(X)
|
X = xp.asarray(X)
|
||||||
# Clip to range (-10, 10)
|
# Clip to range (-10, 10)
|
||||||
X = xp.minimum(X, 10., X)
|
X = xp.minimum(X, 10.0, X)
|
||||||
X = xp.maximum(X, -10., X)
|
X = xp.maximum(X, -10.0, X)
|
||||||
Y = 1. / (1. + xp.exp(-X))
|
Y = 1.0 / (1.0 + xp.exp(-X))
|
||||||
|
|
||||||
def logistic_bwd(dY, sgd=None):
|
def logistic_bwd(dY, sgd=None):
|
||||||
dX = dY * (Y * (1 - Y))
|
dX = dY * (Y * (1 - Y))
|
||||||
|
|
@ -412,12 +481,13 @@ def logistic(X, drop=0.):
|
||||||
def zero_init(model):
|
def zero_init(model):
|
||||||
def _zero_init_impl(self, X, y):
|
def _zero_init_impl(self, X, y):
|
||||||
self.W.fill(0)
|
self.W.fill(0)
|
||||||
|
|
||||||
model.on_data_hooks.append(_zero_init_impl)
|
model.on_data_hooks.append(_zero_init_impl)
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def preprocess_doc(docs, drop=0.):
|
def preprocess_doc(docs, drop=0.0):
|
||||||
keys = [doc.to_array([LOWER]) for doc in docs]
|
keys = [doc.to_array([LOWER]) for doc in docs]
|
||||||
ops = Model.ops
|
ops = Model.ops
|
||||||
lengths = ops.asarray([arr.shape[0] for arr in keys])
|
lengths = ops.asarray([arr.shape[0] for arr in keys])
|
||||||
|
|
@ -427,29 +497,32 @@ def preprocess_doc(docs, drop=0.):
|
||||||
|
|
||||||
|
|
||||||
def getitem(i):
|
def getitem(i):
|
||||||
def getitem_fwd(X, drop=0.):
|
def getitem_fwd(X, drop=0.0):
|
||||||
return X[i], None
|
return X[i], None
|
||||||
|
|
||||||
return layerize(getitem_fwd)
|
return layerize(getitem_fwd)
|
||||||
|
|
||||||
|
|
||||||
def build_tagger_model(nr_class, **cfg):
|
def build_tagger_model(nr_class, **cfg):
|
||||||
embed_size = util.env_opt('embed_size', 7000)
|
embed_size = util.env_opt("embed_size", 2000)
|
||||||
if 'token_vector_width' in cfg:
|
if "token_vector_width" in cfg:
|
||||||
token_vector_width = cfg['token_vector_width']
|
token_vector_width = cfg["token_vector_width"]
|
||||||
else:
|
else:
|
||||||
token_vector_width = util.env_opt('token_vector_width', 128)
|
token_vector_width = util.env_opt("token_vector_width", 96)
|
||||||
pretrained_vectors = cfg.get('pretrained_vectors')
|
pretrained_vectors = cfg.get("pretrained_vectors")
|
||||||
with Model.define_operators({'>>': chain, '+': add}):
|
subword_features = cfg.get("subword_features", True)
|
||||||
if 'tok2vec' in cfg:
|
with Model.define_operators({">>": chain, "+": add}):
|
||||||
tok2vec = cfg['tok2vec']
|
if "tok2vec" in cfg:
|
||||||
|
tok2vec = cfg["tok2vec"]
|
||||||
else:
|
else:
|
||||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
tok2vec = Tok2Vec(
|
||||||
pretrained_vectors=pretrained_vectors)
|
token_vector_width,
|
||||||
softmax = with_flatten(Softmax(nr_class, token_vector_width))
|
embed_size,
|
||||||
model = (
|
subword_features=subword_features,
|
||||||
tok2vec
|
pretrained_vectors=pretrained_vectors,
|
||||||
>> softmax
|
|
||||||
)
|
)
|
||||||
|
softmax = with_flatten(Softmax(nr_class, token_vector_width))
|
||||||
|
model = tok2vec >> softmax
|
||||||
model.nI = None
|
model.nI = None
|
||||||
model.tok2vec = tok2vec
|
model.tok2vec = tok2vec
|
||||||
model.softmax = softmax
|
model.softmax = softmax
|
||||||
|
|
@ -457,10 +530,10 @@ def build_tagger_model(nr_class, **cfg):
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def SpacyVectors(docs, drop=0.):
|
def SpacyVectors(docs, drop=0.0):
|
||||||
batch = []
|
batch = []
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
indices = numpy.zeros((len(doc),), dtype='i')
|
indices = numpy.zeros((len(doc),), dtype="i")
|
||||||
for i, word in enumerate(doc):
|
for i, word in enumerate(doc):
|
||||||
if word.orth in doc.vocab.vectors.key2row:
|
if word.orth in doc.vocab.vectors.key2row:
|
||||||
indices[i] = doc.vocab.vectors.key2row[word.orth]
|
indices[i] = doc.vocab.vectors.key2row[word.orth]
|
||||||
|
|
@ -472,11 +545,11 @@ def SpacyVectors(docs, drop=0.):
|
||||||
|
|
||||||
|
|
||||||
def build_text_classifier(nr_class, width=64, **cfg):
|
def build_text_classifier(nr_class, width=64, **cfg):
|
||||||
nr_vector = cfg.get('nr_vector', 5000)
|
depth = cfg.get("depth", 2)
|
||||||
pretrained_dims = cfg.get('pretrained_dims', 0)
|
nr_vector = cfg.get("nr_vector", 5000)
|
||||||
with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
|
pretrained_dims = cfg.get("pretrained_dims", 0)
|
||||||
'**': clone}):
|
with Model.define_operators({">>": chain, "+": add, "|": concatenate, "**": clone}):
|
||||||
if cfg.get('low_data') and pretrained_dims:
|
if cfg.get("low_data") and pretrained_dims:
|
||||||
model = (
|
model = (
|
||||||
SpacyVectors
|
SpacyVectors
|
||||||
>> flatten_add_lengths
|
>> flatten_add_lengths
|
||||||
|
|
@ -494,21 +567,19 @@ def build_text_classifier(nr_class, width=64, **cfg):
|
||||||
suffix = HashEmbed(width // 2, nr_vector, column=3)
|
suffix = HashEmbed(width // 2, nr_vector, column=3)
|
||||||
shape = HashEmbed(width // 2, nr_vector, column=4)
|
shape = HashEmbed(width // 2, nr_vector, column=4)
|
||||||
|
|
||||||
trained_vectors = (
|
trained_vectors = FeatureExtracter(
|
||||||
FeatureExtracter([ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID])
|
[ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID]
|
||||||
>> with_flatten(
|
) >> with_flatten(
|
||||||
uniqued(
|
uniqued(
|
||||||
(lower | prefix | suffix | shape)
|
(lower | prefix | suffix | shape)
|
||||||
>> LN(Maxout(width, width + (width // 2) * 3)),
|
>> LN(Maxout(width, width + (width // 2) * 3)),
|
||||||
column=0
|
column=0,
|
||||||
)
|
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
if pretrained_dims:
|
if pretrained_dims:
|
||||||
static_vectors = (
|
static_vectors = SpacyVectors >> with_flatten(
|
||||||
SpacyVectors
|
Affine(width, pretrained_dims)
|
||||||
>> with_flatten(Affine(width, pretrained_dims))
|
|
||||||
)
|
)
|
||||||
# TODO Make concatenate support lists
|
# TODO Make concatenate support lists
|
||||||
vectors = concatenate_lists(trained_vectors, static_vectors)
|
vectors = concatenate_lists(trained_vectors, static_vectors)
|
||||||
|
|
@ -517,14 +588,13 @@ def build_text_classifier(nr_class, width=64, **cfg):
|
||||||
vectors = trained_vectors
|
vectors = trained_vectors
|
||||||
vectors_width = width
|
vectors_width = width
|
||||||
static_vectors = None
|
static_vectors = None
|
||||||
cnn_model = (
|
tok2vec = vectors >> with_flatten(
|
||||||
vectors
|
|
||||||
>> with_flatten(
|
|
||||||
LN(Maxout(width, vectors_width))
|
LN(Maxout(width, vectors_width))
|
||||||
>> Residual(
|
>> Residual((ExtractWindow(nW=1) >> LN(Maxout(width, width * 3)))) ** depth,
|
||||||
(ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
|
pad=depth,
|
||||||
) ** 2, pad=2
|
|
||||||
)
|
)
|
||||||
|
cnn_model = (
|
||||||
|
tok2vec
|
||||||
>> flatten_add_lengths
|
>> flatten_add_lengths
|
||||||
>> ParametricAttention(width)
|
>> ParametricAttention(width)
|
||||||
>> Pooling(sum_pool)
|
>> Pooling(sum_pool)
|
||||||
|
|
@ -534,24 +604,47 @@ def build_text_classifier(nr_class, width=64, **cfg):
|
||||||
|
|
||||||
linear_model = (
|
linear_model = (
|
||||||
_preprocess_doc
|
_preprocess_doc
|
||||||
>> LinearModel(nr_class)
|
>> with_cpu(Model.ops, LinearModel(nr_class))
|
||||||
)
|
)
|
||||||
#model = linear_model >> logistic
|
if cfg.get('exclusive_classes'):
|
||||||
|
output_layer = Softmax(nr_class, nr_class * 2)
|
||||||
model = (
|
else:
|
||||||
(linear_model | cnn_model)
|
output_layer = (
|
||||||
>> zero_init(Affine(nr_class, nr_class*2, drop_factor=0.0))
|
zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0))
|
||||||
>> logistic
|
>> logistic
|
||||||
)
|
)
|
||||||
|
model = (
|
||||||
|
(linear_model | cnn_model)
|
||||||
|
>> output_layer
|
||||||
|
)
|
||||||
|
model.tok2vec = chain(tok2vec, flatten)
|
||||||
model.nO = nr_class
|
model.nO = nr_class
|
||||||
model.lsuv = False
|
model.lsuv = False
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False, **cfg):
|
||||||
|
"""
|
||||||
|
Build a simple CNN text classifier, given a token-to-vector model as inputs.
|
||||||
|
If exclusive_classes=True, a softmax non-linearity is applied, so that the
|
||||||
|
outputs sum to 1. If exclusive_classes=False, a logistic non-linearity
|
||||||
|
is applied instead, so that outputs are in the range [0, 1].
|
||||||
|
"""
|
||||||
|
with Model.define_operators({">>": chain}):
|
||||||
|
if exclusive_classes:
|
||||||
|
output_layer = Softmax(nr_class, tok2vec.nO)
|
||||||
|
else:
|
||||||
|
output_layer = zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
|
||||||
|
model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer
|
||||||
|
model.tok2vec = chain(tok2vec, flatten)
|
||||||
|
model.nO = nr_class
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def flatten(seqs, drop=0.):
|
def flatten(seqs, drop=0.0):
|
||||||
ops = Model.ops
|
ops = Model.ops
|
||||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
|
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||||
|
|
||||||
def finish_update(d_X, sgd=None):
|
def finish_update(d_X, sgd=None):
|
||||||
return ops.unflatten(d_X, lengths, pad=0)
|
return ops.unflatten(d_X, lengths, pad=0)
|
||||||
|
|
@ -566,14 +659,14 @@ def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||||
"""
|
"""
|
||||||
if not layers:
|
if not layers:
|
||||||
return noop()
|
return noop()
|
||||||
drop_factor = kwargs.get('drop_factor', 1.0)
|
drop_factor = kwargs.get("drop_factor", 1.0)
|
||||||
ops = layers[0].ops
|
ops = layers[0].ops
|
||||||
layers = [chain(layer, flatten) for layer in layers]
|
layers = [chain(layer, flatten) for layer in layers]
|
||||||
concat = concatenate(*layers)
|
concat = concatenate(*layers)
|
||||||
|
|
||||||
def concatenate_lists_fwd(Xs, drop=0.):
|
def concatenate_lists_fwd(Xs, drop=0.0):
|
||||||
drop *= drop_factor
|
drop *= drop_factor
|
||||||
lengths = ops.asarray([len(X) for X in Xs], dtype='i')
|
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
|
||||||
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
||||||
ys = ops.unflatten(flat_y, lengths)
|
ys = ops.unflatten(flat_y, lengths)
|
||||||
|
|
||||||
|
|
@ -584,3 +677,79 @@ def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||||
|
|
||||||
model = wrap(concatenate_lists_fwd, concat)
|
model = wrap(concatenate_lists_fwd, concat)
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def masked_language_model(vocab, model, mask_prob=0.15):
|
||||||
|
"""Convert a model into a BERT-style masked language model"""
|
||||||
|
|
||||||
|
random_words = _RandomWords(vocab)
|
||||||
|
|
||||||
|
def mlm_forward(docs, drop=0.0):
|
||||||
|
mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob)
|
||||||
|
mask = model.ops.asarray(mask).reshape((mask.shape[0], 1))
|
||||||
|
output, backprop = model.begin_update(docs, drop=drop)
|
||||||
|
|
||||||
|
def mlm_backward(d_output, sgd=None):
|
||||||
|
d_output *= 1 - mask
|
||||||
|
return backprop(d_output, sgd=sgd)
|
||||||
|
|
||||||
|
return output, mlm_backward
|
||||||
|
|
||||||
|
return wrap(mlm_forward, model)
|
||||||
|
|
||||||
|
|
||||||
|
class _RandomWords(object):
|
||||||
|
def __init__(self, vocab):
|
||||||
|
self.words = [lex.text for lex in vocab if lex.prob != 0.0]
|
||||||
|
self.probs = [lex.prob for lex in vocab if lex.prob != 0.0]
|
||||||
|
self.words = self.words[:10000]
|
||||||
|
self.probs = self.probs[:10000]
|
||||||
|
self.probs = numpy.exp(numpy.array(self.probs, dtype="f"))
|
||||||
|
self.probs /= self.probs.sum()
|
||||||
|
self._cache = []
|
||||||
|
|
||||||
|
def next(self):
|
||||||
|
if not self._cache:
|
||||||
|
self._cache.extend(
|
||||||
|
numpy.random.choice(len(self.words), 10000, p=self.probs)
|
||||||
|
)
|
||||||
|
index = self._cache.pop()
|
||||||
|
return self.words[index]
|
||||||
|
|
||||||
|
|
||||||
|
def _apply_mask(docs, random_words, mask_prob=0.15):
|
||||||
|
# This needs to be here to avoid circular imports
|
||||||
|
from .tokens.doc import Doc
|
||||||
|
|
||||||
|
N = sum(len(doc) for doc in docs)
|
||||||
|
mask = numpy.random.uniform(0.0, 1.0, (N,))
|
||||||
|
mask = mask >= mask_prob
|
||||||
|
i = 0
|
||||||
|
masked_docs = []
|
||||||
|
for doc in docs:
|
||||||
|
words = []
|
||||||
|
for token in doc:
|
||||||
|
if not mask[i]:
|
||||||
|
word = _replace_word(token.text, random_words)
|
||||||
|
else:
|
||||||
|
word = token.text
|
||||||
|
words.append(word)
|
||||||
|
i += 1
|
||||||
|
spaces = [bool(w.whitespace_) for w in doc]
|
||||||
|
# NB: If you change this implementation to instead modify
|
||||||
|
# the docs in place, take care that the IDs reflect the original
|
||||||
|
# words. Currently we use the original docs to make the vectors
|
||||||
|
# for the target, so we don't lose the original tokens. But if
|
||||||
|
# you modified the docs in place here, you would.
|
||||||
|
masked_docs.append(Doc(doc.vocab, words=words, spaces=spaces))
|
||||||
|
return mask, masked_docs
|
||||||
|
|
||||||
|
|
||||||
|
def _replace_word(word, random_words, mask="[MASK]"):
|
||||||
|
roll = numpy.random.random()
|
||||||
|
if roll < 0.8:
|
||||||
|
return mask
|
||||||
|
elif roll < 0.9:
|
||||||
|
return random_words.next()
|
||||||
|
else:
|
||||||
|
return word
|
||||||
|
|
|
||||||
|
|
@ -1,16 +1,17 @@
|
||||||
# inspired from:
|
# inspired from:
|
||||||
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/
|
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/
|
||||||
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
||||||
|
# fmt: off
|
||||||
|
|
||||||
__title__ = 'spacy'
|
__title__ = "spacy-nightly"
|
||||||
__version__ = '2.0.18'
|
__version__ = "2.1.0a13"
|
||||||
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
|
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
|
||||||
__uri__ = 'https://spacy.io'
|
__uri__ = "https://spacy.io"
|
||||||
__author__ = 'Explosion AI'
|
__author__ = "Explosion AI"
|
||||||
__email__ = 'contact@explosion.ai'
|
__email__ = "contact@explosion.ai"
|
||||||
__license__ = 'MIT'
|
__license__ = "MIT"
|
||||||
__release__ = True
|
__release__ = True
|
||||||
|
|
||||||
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'
|
__shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json"
|
||||||
|
|
|
||||||
|
|
@ -131,7 +131,7 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
||||||
'NumValue', 'PartType', 'Polite', 'StyleVariant',
|
'NumValue', 'PartType', 'Polite', 'StyleVariant',
|
||||||
'PronType', 'AdjType', 'Person', 'Variant', 'AdpType',
|
'PronType', 'AdjType', 'Person', 'Variant', 'AdpType',
|
||||||
'Reflex', 'Negative', 'Mood', 'Aspect', 'Case',
|
'Reflex', 'Negative', 'Mood', 'Aspect', 'Case',
|
||||||
'Polarity', 'Animacy' # U20
|
'Polarity', 'PrepCase', 'Animacy' # U20
|
||||||
]
|
]
|
||||||
for key in morph_keys:
|
for key in morph_keys:
|
||||||
if key in stringy_attrs:
|
if key in stringy_attrs:
|
||||||
|
|
|
||||||
|
|
@ -1,11 +1,13 @@
|
||||||
from .download import download
|
from .download import download # noqa: F401
|
||||||
from .info import info
|
from .info import info # noqa: F401
|
||||||
from .link import link
|
from .link import link # noqa: F401
|
||||||
from .package import package
|
from .package import package # noqa: F401
|
||||||
from .profile import profile
|
from .profile import profile # noqa: F401
|
||||||
from .train import train
|
from .train import train # noqa: F401
|
||||||
from .evaluate import evaluate
|
from .pretrain import pretrain # noqa: F401
|
||||||
from .convert import convert
|
from .debug_data import debug_data # noqa: F401
|
||||||
from .vocab import make_vocab as vocab
|
from .evaluate import evaluate # noqa: F401
|
||||||
from .init_model import init_model
|
from .convert import convert # noqa: F401
|
||||||
from .validate import validate
|
from .init_model import init_model # noqa: F401
|
||||||
|
from .validate import validate # noqa: F401
|
||||||
|
from .ud import ud_train, ud_evaluate # noqa: F401
|
||||||
|
|
|
||||||
|
|
@ -1,74 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
|
|
||||||
class Messages(object):
|
|
||||||
M001 = ("Download successful but linking failed")
|
|
||||||
M002 = ("Creating a shortcut link for 'en' didn't work (maybe you "
|
|
||||||
"don't have admin permissions?), but you can still load the "
|
|
||||||
"model via its full package name: nlp = spacy.load('{name}')")
|
|
||||||
M003 = ("Server error ({code})")
|
|
||||||
M004 = ("Couldn't fetch {desc}. Please find a model for your spaCy "
|
|
||||||
"installation (v{version}), and download it manually. For more "
|
|
||||||
"details, see the documentation: https://spacy.io/usage/models")
|
|
||||||
M005 = ("Compatibility error")
|
|
||||||
M006 = ("No compatible models found for v{version} of spaCy.")
|
|
||||||
M007 = ("No compatible model found for '{name}' (spaCy v{version}).")
|
|
||||||
M008 = ("Can't locate model data")
|
|
||||||
M009 = ("The data should be located in {path}")
|
|
||||||
M010 = ("Can't find the spaCy data path to create model symlink")
|
|
||||||
M011 = ("Make sure a directory `/data` exists within your spaCy "
|
|
||||||
"installation and try again. The data directory should be "
|
|
||||||
"located here:")
|
|
||||||
M012 = ("Link '{name}' already exists")
|
|
||||||
M013 = ("To overwrite an existing link, use the --force flag.")
|
|
||||||
M014 = ("Can't overwrite symlink '{name}'")
|
|
||||||
M015 = ("This can happen if your data directory contains a directory or "
|
|
||||||
"file of the same name.")
|
|
||||||
M016 = ("Error: Couldn't link model to '{name}'")
|
|
||||||
M017 = ("Creating a symlink in spacy/data failed. Make sure you have the "
|
|
||||||
"required permissions and try re-running the command as admin, or "
|
|
||||||
"use a virtualenv. You can still import the model as a module and "
|
|
||||||
"call its load() method, or create the symlink manually.")
|
|
||||||
M018 = ("Linking successful")
|
|
||||||
M019 = ("You can now load the model via spacy.load('{name}')")
|
|
||||||
M020 = ("Can't find model meta.json")
|
|
||||||
M021 = ("Couldn't fetch compatibility table.")
|
|
||||||
M022 = ("Can't find spaCy v{version} in compatibility table")
|
|
||||||
M023 = ("Installed models (spaCy v{version})")
|
|
||||||
M024 = ("No models found in your current environment.")
|
|
||||||
M025 = ("Use the following commands to update the model packages:")
|
|
||||||
M026 = ("The following models are not available for spaCy "
|
|
||||||
"v{version}: {models}")
|
|
||||||
M027 = ("You may also want to overwrite the incompatible links using the "
|
|
||||||
"`python -m spacy link` command with `--force`, or remove them "
|
|
||||||
"from the data directory. Data path: {path}")
|
|
||||||
M028 = ("Input file not found")
|
|
||||||
M029 = ("Output directory not found")
|
|
||||||
M030 = ("Unknown format")
|
|
||||||
M031 = ("Can't find converter for {converter}")
|
|
||||||
M032 = ("Generated output file {name}")
|
|
||||||
M033 = ("Created {n_docs} documents")
|
|
||||||
M034 = ("Evaluation data not found")
|
|
||||||
M035 = ("Visualization output directory not found")
|
|
||||||
M036 = ("Generated {n} parses as HTML")
|
|
||||||
M037 = ("Can't find words frequencies file")
|
|
||||||
M038 = ("Sucessfully compiled vocab")
|
|
||||||
M039 = ("{entries} entries, {vectors} vectors")
|
|
||||||
M040 = ("Output directory not found")
|
|
||||||
M041 = ("Loaded meta.json from file")
|
|
||||||
M042 = ("Successfully created package '{name}'")
|
|
||||||
M043 = ("To build the package, run `python setup.py sdist` in this "
|
|
||||||
"directory.")
|
|
||||||
M044 = ("Package directory already exists")
|
|
||||||
M045 = ("Please delete the directory and try again, or use the `--force` "
|
|
||||||
"flag to overwrite existing directories.")
|
|
||||||
M046 = ("Generating meta.json")
|
|
||||||
M047 = ("Enter the package settings for your model. The following "
|
|
||||||
"information will be read from your model data: pipeline, vectors.")
|
|
||||||
M048 = ("No '{key}' setting found in meta.json")
|
|
||||||
M049 = ("This setting is required to build your package.")
|
|
||||||
M050 = ("Training data not found")
|
|
||||||
M051 = ("Development data not found")
|
|
||||||
M052 = ("Not a valid meta.json format")
|
|
||||||
M053 = ("Expected dict but got: {meta_type}")
|
|
||||||
220
spacy/cli/_schemas.py
Normal file
220
spacy/cli/_schemas.py
Normal file
|
|
@ -0,0 +1,220 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
# NB: This schema describes the new format of the training data, see #2928
|
||||||
|
TRAINING_SCHEMA = {
|
||||||
|
"$schema": "http://json-schema.org/draft-06/schema",
|
||||||
|
"title": "Training data for spaCy models",
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"text": {
|
||||||
|
"title": "The text of the training example",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
},
|
||||||
|
"ents": {
|
||||||
|
"title": "Named entity spans in the text",
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"start": {
|
||||||
|
"title": "Start character offset of the span",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"end": {
|
||||||
|
"title": "End character offset of the span",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"label": {
|
||||||
|
"title": "Entity label",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
"pattern": "^[A-Z0-9]*$",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["start", "end", "label"],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"sents": {
|
||||||
|
"title": "Sentence spans in the text",
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"start": {
|
||||||
|
"title": "Start character offset of the span",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"end": {
|
||||||
|
"title": "End character offset of the span",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["start", "end"],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"cats": {
|
||||||
|
"title": "Text categories for the text classifier",
|
||||||
|
"type": "object",
|
||||||
|
"patternProperties": {
|
||||||
|
"*": {
|
||||||
|
"title": "A text category",
|
||||||
|
"oneOf": [
|
||||||
|
{"type": "boolean"},
|
||||||
|
{"type": "number", "minimum": 0},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"propertyNames": {"pattern": "^[A-Z0-9]*$", "minLength": 1},
|
||||||
|
},
|
||||||
|
"tokens": {
|
||||||
|
"title": "The tokens in the text",
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "object",
|
||||||
|
"minProperties": 1,
|
||||||
|
"properties": {
|
||||||
|
"id": {
|
||||||
|
"title": "Token ID, usually token index",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"start": {
|
||||||
|
"title": "Start character offset of the token",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"end": {
|
||||||
|
"title": "End character offset of the token",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"pos": {
|
||||||
|
"title": "Coarse-grained part-of-speech tag",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
},
|
||||||
|
"tag": {
|
||||||
|
"title": "Fine-grained part-of-speech tag",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
},
|
||||||
|
"dep": {
|
||||||
|
"title": "Dependency label",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
},
|
||||||
|
"head": {
|
||||||
|
"title": "Index of the token's head",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["start", "end"],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"_": {"title": "Custom user space", "type": "object"},
|
||||||
|
},
|
||||||
|
"required": ["text"],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
META_SCHEMA = {
|
||||||
|
"$schema": "http://json-schema.org/draft-06/schema",
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"lang": {
|
||||||
|
"title": "Two-letter language code, e.g. 'en'",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 2,
|
||||||
|
"maxLength": 2,
|
||||||
|
"pattern": "^[a-z]*$",
|
||||||
|
},
|
||||||
|
"name": {
|
||||||
|
"title": "Model name",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
"pattern": "^[a-z_]*$",
|
||||||
|
},
|
||||||
|
"version": {
|
||||||
|
"title": "Model version",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
"pattern": "^[0-9a-z.-]*$",
|
||||||
|
},
|
||||||
|
"spacy_version": {
|
||||||
|
"title": "Compatible spaCy version identifier",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
"pattern": "^[0-9a-z.-><=]*$",
|
||||||
|
},
|
||||||
|
"parent_package": {
|
||||||
|
"title": "Name of parent spaCy package, e.g. spacy or spacy-nightly",
|
||||||
|
"type": "string",
|
||||||
|
"minLength": 1,
|
||||||
|
"default": "spacy",
|
||||||
|
},
|
||||||
|
"pipeline": {
|
||||||
|
"title": "Names of pipeline components",
|
||||||
|
"type": "array",
|
||||||
|
"items": {"type": "string", "minLength": 1},
|
||||||
|
},
|
||||||
|
"description": {"title": "Model description", "type": "string"},
|
||||||
|
"license": {"title": "Model license", "type": "string"},
|
||||||
|
"author": {"title": "Model author name", "type": "string"},
|
||||||
|
"email": {"title": "Model author email", "type": "string", "format": "email"},
|
||||||
|
"url": {"title": "Model author URL", "type": "string", "format": "uri"},
|
||||||
|
"sources": {
|
||||||
|
"title": "Training data sources",
|
||||||
|
"type": "array",
|
||||||
|
"items": {"type": "string"},
|
||||||
|
},
|
||||||
|
"vectors": {
|
||||||
|
"title": "Included word vectors",
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"keys": {
|
||||||
|
"title": "Number of unique keys",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"vectors": {
|
||||||
|
"title": "Number of unique vectors",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
"width": {
|
||||||
|
"title": "Number of dimensions",
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 0,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"accuracy": {
|
||||||
|
"title": "Accuracy numbers",
|
||||||
|
"type": "object",
|
||||||
|
"patternProperties": {"*": {"type": "number", "minimum": 0.0}},
|
||||||
|
},
|
||||||
|
"speed": {
|
||||||
|
"title": "Speed evaluation numbers",
|
||||||
|
"type": "object",
|
||||||
|
"patternProperties": {
|
||||||
|
"*": {
|
||||||
|
"oneOf": [
|
||||||
|
{"type": "number", "minimum": 0.0},
|
||||||
|
{"type": "integer", "minimum": 0},
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["lang", "name", "version"],
|
||||||
|
}
|
||||||
|
|
@ -3,45 +3,95 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from wasabi import Printer
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
from .converters import conllu2json, iob2json, conll_ner2json
|
||||||
|
from .converters import ner_jsonl2json
|
||||||
|
|
||||||
from .converters import conllu2json, conllubio2json, iob2json, conll_ner2json
|
|
||||||
from ._messages import Messages
|
|
||||||
from ..util import prints
|
|
||||||
|
|
||||||
# Converters are matched by file extension. To add a converter, add a new
|
# Converters are matched by file extension. To add a converter, add a new
|
||||||
# entry to this dict with the file extension mapped to the converter function
|
# entry to this dict with the file extension mapped to the converter function
|
||||||
# imported from /converters.
|
# imported from /converters.
|
||||||
CONVERTERS = {
|
CONVERTERS = {
|
||||||
'conllubio': conllubio2json,
|
"conllubio": conllu2json,
|
||||||
'conllu': conllu2json,
|
"conllu": conllu2json,
|
||||||
'conll': conllu2json,
|
"conll": conllu2json,
|
||||||
'ner': conll_ner2json,
|
"ner": conll_ner2json,
|
||||||
'iob': iob2json,
|
"iob": iob2json,
|
||||||
|
"jsonl": ner_jsonl2json,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# File types
|
||||||
|
FILE_TYPES = ("json", "jsonl", "msg")
|
||||||
|
FILE_TYPES_STDOUT = ("json", "jsonl")
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
input_file=("input file", "positional", None, str),
|
input_file=("Input file", "positional", None, str),
|
||||||
output_dir=("output directory for converted file", "positional", None, str),
|
output_dir=("Output directory. '-' for stdout.", "positional", None, str),
|
||||||
|
file_type=("Type of data to produce: {}".format(FILE_TYPES), "option", "t", str),
|
||||||
n_sents=("Number of sentences per doc", "option", "n", int),
|
n_sents=("Number of sentences per doc", "option", "n", int),
|
||||||
converter=("Name of converter (auto, iob, conllu or ner)", "option", "c", str),
|
converter=("Converter: {}".format(tuple(CONVERTERS.keys())), "option", "c", str),
|
||||||
morphology=("Enable appending morphology to tags", "flag", "m", bool))
|
lang=("Language (if tokenizer required)", "option", "l", str),
|
||||||
def convert(input_file, output_dir, n_sents=1, morphology=False, converter='auto'):
|
morphology=("Enable appending morphology to tags", "flag", "m", bool),
|
||||||
|
)
|
||||||
|
def convert(
|
||||||
|
input_file,
|
||||||
|
output_dir="-",
|
||||||
|
file_type="jsonl",
|
||||||
|
n_sents=1,
|
||||||
|
morphology=False,
|
||||||
|
converter="auto",
|
||||||
|
lang=None,
|
||||||
|
):
|
||||||
"""
|
"""
|
||||||
Convert files into JSON format for use with train command and other
|
Convert files into JSON format for use with train command and other
|
||||||
experiment management functions.
|
experiment management functions. If no output_dir is specified, the data
|
||||||
|
is written to stdout, so you can pipe them forward to a JSONL file:
|
||||||
|
$ spacy convert some_file.conllu > some_file.jsonl
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
input_path = Path(input_file)
|
input_path = Path(input_file)
|
||||||
output_path = Path(output_dir)
|
if file_type not in FILE_TYPES:
|
||||||
|
msg.fail(
|
||||||
|
"Unknown file type: '{}'".format(file_type),
|
||||||
|
"Supported file types: '{}'".format(", ".join(FILE_TYPES)),
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
|
||||||
|
# TODO: support msgpack via stdout in srsly?
|
||||||
|
msg.fail(
|
||||||
|
"Can't write .{} data to stdout.".format(file_type),
|
||||||
|
"Please specify an output directory.",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
if not input_path.exists():
|
if not input_path.exists():
|
||||||
prints(input_path, title=Messages.M028, exits=1)
|
msg.fail("Input file not found", input_path, exits=1)
|
||||||
if not output_path.exists():
|
if output_dir != "-" and not Path(output_dir).exists():
|
||||||
prints(output_path, title=Messages.M029, exits=1)
|
msg.fail("Output directory not found", output_dir, exits=1)
|
||||||
if converter == 'auto':
|
if converter == "auto":
|
||||||
converter = input_path.suffix[1:]
|
converter = input_path.suffix[1:]
|
||||||
if converter not in CONVERTERS:
|
if converter not in CONVERTERS:
|
||||||
prints(Messages.M031.format(converter=converter),
|
msg.fail("Can't find converter for {}".format(converter), exits=1)
|
||||||
title=Messages.M030, exits=1)
|
# Use converter function to convert data
|
||||||
func = CONVERTERS[converter]
|
func = CONVERTERS[converter]
|
||||||
func(input_path, output_path,
|
input_data = input_path.open("r", encoding="utf-8").read()
|
||||||
n_sents=n_sents, use_morphology=morphology)
|
data = func(input_data, n_sents=n_sents, use_morphology=morphology, lang=lang)
|
||||||
|
if output_dir != "-":
|
||||||
|
# Export data to a file
|
||||||
|
suffix = ".{}".format(file_type)
|
||||||
|
output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix)
|
||||||
|
if file_type == "json":
|
||||||
|
srsly.write_json(output_file, data)
|
||||||
|
elif file_type == "jsonl":
|
||||||
|
srsly.write_jsonl(output_file, data)
|
||||||
|
elif file_type == "msg":
|
||||||
|
srsly.write_msgpack(output_file, data)
|
||||||
|
msg.good("Generated output file ({} documents)".format(len(data)), output_file)
|
||||||
|
else:
|
||||||
|
# Print to stdout
|
||||||
|
if file_type == "json":
|
||||||
|
srsly.write_json("-", data)
|
||||||
|
elif file_type == "jsonl":
|
||||||
|
srsly.write_jsonl("-", data)
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
from .conllu2json import conllu2json
|
from .conllu2json import conllu2json # noqa: F401
|
||||||
from .conllubio2json import conllubio2json
|
from .iob2json import iob2json # noqa: F401
|
||||||
from .iob2json import iob2json
|
from .conll_ner2json import conll_ner2json # noqa: F401
|
||||||
from .conll_ner2json import conll_ner2json
|
from .jsonl2json import ner_jsonl2json # noqa: F401
|
||||||
|
|
|
||||||
|
|
@ -1,52 +1,38 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .._messages import Messages
|
|
||||||
from ...compat import json_dumps, path2str
|
|
||||||
from ...util import prints
|
|
||||||
from ...gold import iob_to_biluo
|
from ...gold import iob_to_biluo
|
||||||
|
|
||||||
|
|
||||||
def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
|
def conll_ner2json(input_data, **kwargs):
|
||||||
"""
|
"""
|
||||||
Convert files in the CoNLL-2003 NER format into JSON format for use with
|
Convert files in the CoNLL-2003 NER format into JSON format for use with
|
||||||
train cli.
|
train cli.
|
||||||
"""
|
"""
|
||||||
docs = read_conll_ner(input_path)
|
delimit_docs = "-DOCSTART- -X- O O"
|
||||||
|
|
||||||
output_filename = input_path.parts[-1].replace(".conll", "") + ".json"
|
|
||||||
output_filename = input_path.parts[-1].replace(".conll", "") + ".json"
|
|
||||||
output_file = output_path / output_filename
|
|
||||||
with output_file.open('w', encoding='utf-8') as f:
|
|
||||||
f.write(json_dumps(docs))
|
|
||||||
prints(Messages.M033.format(n_docs=len(docs)),
|
|
||||||
title=Messages.M032.format(name=path2str(output_file)))
|
|
||||||
|
|
||||||
|
|
||||||
def read_conll_ner(input_path):
|
|
||||||
text = input_path.open('r', encoding='utf-8').read()
|
|
||||||
i = 0
|
|
||||||
delimit_docs = '-DOCSTART- -X- O O'
|
|
||||||
output_docs = []
|
output_docs = []
|
||||||
for doc in text.strip().split(delimit_docs):
|
for doc in input_data.strip().split(delimit_docs):
|
||||||
doc = doc.strip()
|
doc = doc.strip()
|
||||||
if not doc:
|
if not doc:
|
||||||
continue
|
continue
|
||||||
output_doc = []
|
output_doc = []
|
||||||
for sent in doc.split('\n\n'):
|
for sent in doc.split("\n\n"):
|
||||||
sent = sent.strip()
|
sent = sent.strip()
|
||||||
if not sent:
|
if not sent:
|
||||||
continue
|
continue
|
||||||
lines = [line.strip() for line in sent.split('\n') if line.strip()]
|
lines = [line.strip() for line in sent.split("\n") if line.strip()]
|
||||||
words, tags, chunks, iob_ents = zip(*[line.split() for line in lines])
|
words, tags, chunks, iob_ents = zip(*[line.split() for line in lines])
|
||||||
biluo_ents = iob_to_biluo(iob_ents)
|
biluo_ents = iob_to_biluo(iob_ents)
|
||||||
output_doc.append({'tokens': [
|
output_doc.append(
|
||||||
{'orth': w, 'tag': tag, 'ner': ent} for (w, tag, ent) in
|
{
|
||||||
zip(words, tags, biluo_ents)
|
"tokens": [
|
||||||
]})
|
{"orth": w, "tag": tag, "ner": ent}
|
||||||
output_docs.append({
|
for (w, tag, ent) in zip(words, tags, biluo_ents)
|
||||||
'id': len(output_docs),
|
]
|
||||||
'paragraphs': [{'sentences': output_doc}]
|
}
|
||||||
})
|
)
|
||||||
|
output_docs.append(
|
||||||
|
{"id": len(output_docs), "paragraphs": [{"sentences": output_doc}]}
|
||||||
|
)
|
||||||
output_doc = []
|
output_doc = []
|
||||||
return output_docs
|
return output_docs
|
||||||
|
|
|
||||||
|
|
@ -1,34 +1,27 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .._messages import Messages
|
|
||||||
from ...compat import json_dumps, path2str
|
|
||||||
from ...util import prints
|
|
||||||
from ...gold import iob_to_biluo
|
|
||||||
import re
|
import re
|
||||||
|
|
||||||
|
from ...gold import iob_to_biluo
|
||||||
|
|
||||||
def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
|
||||||
|
|
||||||
|
def conllu2json(input_data, n_sents=10, use_morphology=False, lang=None):
|
||||||
"""
|
"""
|
||||||
Convert conllu files into JSON format for use with train cli.
|
Convert conllu files into JSON format for use with train cli.
|
||||||
use_morphology parameter enables appending morphology to tags, which is
|
use_morphology parameter enables appending morphology to tags, which is
|
||||||
useful for languages such as Spanish, where UD tags are not so rich.
|
useful for languages such as Spanish, where UD tags are not so rich.
|
||||||
"""
|
|
||||||
# by @dvsrepo, via #11 explosion/spacy-dev-resources
|
|
||||||
|
|
||||||
"""
|
|
||||||
Extract NER tags if available and convert them so that they follow
|
Extract NER tags if available and convert them so that they follow
|
||||||
BILUO and the Wikipedia scheme
|
BILUO and the Wikipedia scheme
|
||||||
"""
|
"""
|
||||||
|
# by @dvsrepo, via #11 explosion/spacy-dev-resources
|
||||||
# by @katarkor
|
# by @katarkor
|
||||||
|
|
||||||
docs = []
|
docs = []
|
||||||
sentences = []
|
sentences = []
|
||||||
conll_tuples = read_conllx(input_path, use_morphology=use_morphology)
|
conll_tuples = read_conllx(input_data, use_morphology=use_morphology)
|
||||||
checked_for_ner = False
|
checked_for_ner = False
|
||||||
has_ner_tags = False
|
has_ner_tags = False
|
||||||
|
|
||||||
for i, (raw_text, tokens) in enumerate(conll_tuples):
|
for i, (raw_text, tokens) in enumerate(conll_tuples):
|
||||||
sentence, brackets = tokens[0]
|
sentence, brackets = tokens[0]
|
||||||
if not checked_for_ner:
|
if not checked_for_ner:
|
||||||
|
|
@ -37,29 +30,19 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||||
sentences.append(generate_sentence(sentence, has_ner_tags))
|
sentences.append(generate_sentence(sentence, has_ner_tags))
|
||||||
# Real-sized documents could be extracted using the comments on the
|
# Real-sized documents could be extracted using the comments on the
|
||||||
# conluu document
|
# conluu document
|
||||||
|
if len(sentences) % n_sents == 0:
|
||||||
if(len(sentences) % n_sents == 0):
|
|
||||||
doc = create_doc(sentences, i)
|
doc = create_doc(sentences, i)
|
||||||
docs.append(doc)
|
docs.append(doc)
|
||||||
sentences = []
|
sentences = []
|
||||||
|
return docs
|
||||||
output_filename = input_path.parts[-1].replace(".conll", ".json")
|
|
||||||
output_filename = input_path.parts[-1].replace(".conllu", ".json")
|
|
||||||
output_file = output_path / output_filename
|
|
||||||
with output_file.open('w', encoding='utf-8') as f:
|
|
||||||
f.write(json_dumps(docs))
|
|
||||||
prints(Messages.M033.format(n_docs=len(docs)),
|
|
||||||
title=Messages.M032.format(name=path2str(output_file)))
|
|
||||||
|
|
||||||
|
|
||||||
def is_ner(tag):
|
def is_ner(tag):
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Check the 10th column of the first token to determine if the file contains
|
Check the 10th column of the first token to determine if the file contains
|
||||||
NER tags
|
NER tags
|
||||||
"""
|
"""
|
||||||
|
tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag)
|
||||||
tag_match = re.match('([A-Z_]+)-([A-Z_]+)', tag)
|
|
||||||
if tag_match:
|
if tag_match:
|
||||||
return True
|
return True
|
||||||
elif tag == "O":
|
elif tag == "O":
|
||||||
|
|
@ -67,29 +50,30 @@ def is_ner(tag):
|
||||||
else:
|
else:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def read_conllx(input_path, use_morphology=False, n=0):
|
|
||||||
text = input_path.open('r', encoding='utf-8').read()
|
def read_conllx(input_data, use_morphology=False, n=0):
|
||||||
i = 0
|
i = 0
|
||||||
for sent in text.strip().split('\n\n'):
|
for sent in input_data.strip().split("\n\n"):
|
||||||
lines = sent.strip().split('\n')
|
lines = sent.strip().split("\n")
|
||||||
if lines:
|
if lines:
|
||||||
while lines[0].startswith('#'):
|
while lines[0].startswith("#"):
|
||||||
lines.pop(0)
|
lines.pop(0)
|
||||||
tokens = []
|
tokens = []
|
||||||
for line in lines:
|
for line in lines:
|
||||||
|
|
||||||
parts = line.split('\t')
|
parts = line.split("\t")
|
||||||
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
|
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
|
||||||
if '-' in id_ or '.' in id_:
|
if "-" in id_ or "." in id_:
|
||||||
continue
|
continue
|
||||||
try:
|
try:
|
||||||
id_ = int(id_) - 1
|
id_ = int(id_) - 1
|
||||||
head = (int(head) - 1) if head != '0' else id_
|
head = (int(head) - 1) if head != "0" else id_
|
||||||
dep = 'ROOT' if dep == 'root' else dep
|
dep = "ROOT" if dep == "root" else dep
|
||||||
tag = pos if tag == '_' else tag
|
tag = pos if tag == "_" else tag
|
||||||
tag = tag+'__'+morph if use_morphology else tag
|
tag = tag + "__" + morph if use_morphology else tag
|
||||||
|
iob = iob if iob else "O"
|
||||||
tokens.append((id_, word, tag, head, dep, iob))
|
tokens.append((id_, word, tag, head, dep, iob))
|
||||||
except:
|
except: # noqa: E722
|
||||||
print(line)
|
print(line)
|
||||||
raise
|
raise
|
||||||
tuples = [list(t) for t in zip(*tokens)]
|
tuples = [list(t) for t in zip(*tokens)]
|
||||||
|
|
@ -98,31 +82,31 @@ def read_conllx(input_path, use_morphology=False, n=0):
|
||||||
if n >= 1 and i >= n:
|
if n >= 1 and i >= n:
|
||||||
break
|
break
|
||||||
|
|
||||||
def simplify_tags(iob):
|
|
||||||
|
|
||||||
|
def simplify_tags(iob):
|
||||||
"""
|
"""
|
||||||
Simplify tags obtained from the dataset in order to follow Wikipedia
|
Simplify tags obtained from the dataset in order to follow Wikipedia
|
||||||
scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while
|
scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while
|
||||||
'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to
|
'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to
|
||||||
'MISC'.
|
'MISC'.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
new_iob = []
|
new_iob = []
|
||||||
for tag in iob:
|
for tag in iob:
|
||||||
tag_match = re.match('([A-Z_]+)-([A-Z_]+)', tag)
|
tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag)
|
||||||
if tag_match:
|
if tag_match:
|
||||||
prefix = tag_match.group(1)
|
prefix = tag_match.group(1)
|
||||||
suffix = tag_match.group(2)
|
suffix = tag_match.group(2)
|
||||||
if suffix == 'GPE_LOC':
|
if suffix == "GPE_LOC":
|
||||||
suffix = 'LOC'
|
suffix = "LOC"
|
||||||
elif suffix == 'GPE_ORG':
|
elif suffix == "GPE_ORG":
|
||||||
suffix = 'ORG'
|
suffix = "ORG"
|
||||||
elif suffix != 'PER' and suffix != 'LOC' and suffix != 'ORG':
|
elif suffix != "PER" and suffix != "LOC" and suffix != "ORG":
|
||||||
suffix = 'MISC'
|
suffix = "MISC"
|
||||||
tag = prefix + '-' + suffix
|
tag = prefix + "-" + suffix
|
||||||
new_iob.append(tag)
|
new_iob.append(tag)
|
||||||
return new_iob
|
return new_iob
|
||||||
|
|
||||||
|
|
||||||
def generate_sentence(sent, has_ner_tags):
|
def generate_sentence(sent, has_ner_tags):
|
||||||
(id_, word, tag, head, dep, iob) = sent
|
(id_, word, tag, head, dep, iob) = sent
|
||||||
sentence = {}
|
sentence = {}
|
||||||
|
|
|
||||||
|
|
@ -1,95 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ...compat import json_dumps, path2str
|
|
||||||
from ...util import prints
|
|
||||||
from ...gold import iob_to_biluo
|
|
||||||
|
|
||||||
def conllubio2json(input_path, output_path, n_sents=10, use_morphology=False):
|
|
||||||
"""
|
|
||||||
Convert conllu files into JSON format for use with train cli.
|
|
||||||
use_morphology parameter enables appending morphology to tags, which is
|
|
||||||
useful for languages such as Spanish, where UD tags are not so rich.
|
|
||||||
"""
|
|
||||||
# by @dvsrepo, via #11 explosion/spacy-dev-resources
|
|
||||||
|
|
||||||
docs = []
|
|
||||||
sentences = []
|
|
||||||
conll_tuples = read_conllx(input_path, use_morphology=use_morphology)
|
|
||||||
|
|
||||||
for i, (raw_text, tokens) in enumerate(conll_tuples):
|
|
||||||
sentence, brackets = tokens[0]
|
|
||||||
sentences.append(generate_sentence(sentence))
|
|
||||||
# Real-sized documents could be extracted using the comments on the
|
|
||||||
# conluu document
|
|
||||||
if(len(sentences) % n_sents == 0):
|
|
||||||
doc = create_doc(sentences, i)
|
|
||||||
docs.append(doc)
|
|
||||||
sentences = []
|
|
||||||
|
|
||||||
output_filename = input_path.parts[-1].replace(".conll", ".json")
|
|
||||||
output_filename = input_path.parts[-1].replace(".conllu", ".json")
|
|
||||||
output_file = output_path / output_filename
|
|
||||||
with output_file.open('w', encoding='utf-8') as f:
|
|
||||||
f.write(json_dumps(docs))
|
|
||||||
prints("Created %d documents" % len(docs),
|
|
||||||
title="Generated output file %s" % path2str(output_file))
|
|
||||||
|
|
||||||
|
|
||||||
def read_conllx(input_path, use_morphology=False, n=0):
|
|
||||||
text = input_path.open('r', encoding='utf-8').read()
|
|
||||||
i = 0
|
|
||||||
for sent in text.strip().split('\n\n'):
|
|
||||||
lines = sent.strip().split('\n')
|
|
||||||
if lines:
|
|
||||||
while lines[0].startswith('#'):
|
|
||||||
lines.pop(0)
|
|
||||||
tokens = []
|
|
||||||
for line in lines:
|
|
||||||
|
|
||||||
parts = line.split('\t')
|
|
||||||
id_, word, lemma, pos, tag, morph, head, dep, _1, ner = parts
|
|
||||||
if '-' in id_ or '.' in id_:
|
|
||||||
continue
|
|
||||||
try:
|
|
||||||
id_ = int(id_) - 1
|
|
||||||
head = (int(head) - 1) if head != '0' else id_
|
|
||||||
dep = 'ROOT' if dep == 'root' else dep
|
|
||||||
tag = pos if tag == '_' else tag
|
|
||||||
tag = tag+'__'+morph if use_morphology else tag
|
|
||||||
ner = ner if ner else 'O'
|
|
||||||
tokens.append((id_, word, tag, head, dep, ner))
|
|
||||||
except:
|
|
||||||
print(line)
|
|
||||||
raise
|
|
||||||
tuples = [list(t) for t in zip(*tokens)]
|
|
||||||
yield (None, [[tuples, []]])
|
|
||||||
i += 1
|
|
||||||
if n >= 1 and i >= n:
|
|
||||||
break
|
|
||||||
|
|
||||||
def generate_sentence(sent):
|
|
||||||
(id_, word, tag, head, dep, ner) = sent
|
|
||||||
sentence = {}
|
|
||||||
tokens = []
|
|
||||||
ner = iob_to_biluo(ner)
|
|
||||||
for i, id in enumerate(id_):
|
|
||||||
token = {}
|
|
||||||
token["orth"] = word[i]
|
|
||||||
token["tag"] = tag[i]
|
|
||||||
token["head"] = head[i] - id
|
|
||||||
token["dep"] = dep[i]
|
|
||||||
token["ner"] = ner[i]
|
|
||||||
tokens.append(token)
|
|
||||||
sentence["tokens"] = tokens
|
|
||||||
return sentence
|
|
||||||
|
|
||||||
|
|
||||||
def create_doc(sentences,id):
|
|
||||||
doc = {}
|
|
||||||
paragraph = {}
|
|
||||||
doc["id"] = id
|
|
||||||
doc["paragraphs"] = []
|
|
||||||
paragraph["sentences"] = sentences
|
|
||||||
doc["paragraphs"].append(paragraph)
|
|
||||||
return doc
|
|
||||||
|
|
@ -1,30 +1,25 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from cytoolz import partition_all, concat
|
|
||||||
|
|
||||||
from .._messages import Messages
|
|
||||||
from ...compat import json_dumps, path2str
|
|
||||||
from ...util import prints
|
|
||||||
from ...gold import iob_to_biluo
|
|
||||||
|
|
||||||
import re
|
import re
|
||||||
|
|
||||||
|
from ...gold import iob_to_biluo
|
||||||
|
from ...util import minibatch
|
||||||
|
|
||||||
def iob2json(input_path, output_path, n_sents=10, *a, **k):
|
|
||||||
|
def iob2json(input_data, n_sents=10, *args, **kwargs):
|
||||||
"""
|
"""
|
||||||
Convert IOB files into JSON format for use with train cli.
|
Convert IOB files into JSON format for use with train cli.
|
||||||
"""
|
"""
|
||||||
with input_path.open('r', encoding='utf8') as file_:
|
docs = []
|
||||||
sentences = read_iob(file_)
|
for group in minibatch(docs, n_sents):
|
||||||
docs = merge_sentences(sentences, n_sents)
|
group = list(group)
|
||||||
output_filename = (input_path.parts[-1]
|
first = group.pop(0)
|
||||||
.replace(".iob2", ".json")
|
to_extend = first["paragraphs"][0]["sentences"]
|
||||||
.replace(".iob", ".json"))
|
for sent in group[1:]:
|
||||||
output_file = output_path / output_filename
|
to_extend.extend(sent["paragraphs"][0]["sentences"])
|
||||||
with output_file.open('w', encoding='utf-8') as f:
|
docs.append(first)
|
||||||
f.write(json_dumps(docs))
|
return docs
|
||||||
prints(Messages.M033.format(n_docs=len(docs)),
|
|
||||||
title=Messages.M032.format(name=path2str(output_file)))
|
|
||||||
|
|
||||||
|
|
||||||
def read_iob(raw_sents):
|
def read_iob(raw_sents):
|
||||||
|
|
@ -32,32 +27,25 @@ def read_iob(raw_sents):
|
||||||
for line in raw_sents:
|
for line in raw_sents:
|
||||||
if not line.strip():
|
if not line.strip():
|
||||||
continue
|
continue
|
||||||
tokens = [re.split('[^\w\-]', line.strip())]
|
# tokens = [t.split("|") for t in line.split()]
|
||||||
|
tokens = [re.split("[^\w\-]", line.strip())]
|
||||||
if len(tokens[0]) == 3:
|
if len(tokens[0]) == 3:
|
||||||
words, pos, iob = zip(*tokens)
|
words, pos, iob = zip(*tokens)
|
||||||
elif len(tokens[0]) == 2:
|
elif len(tokens[0]) == 2:
|
||||||
words, iob = zip(*tokens)
|
words, iob = zip(*tokens)
|
||||||
pos = ['-'] * len(words)
|
pos = ["-"] * len(words)
|
||||||
else:
|
else:
|
||||||
raise ValueError('The iob/iob2 file is not formatted correctly. Try checking whitespace and delimiters.')
|
raise ValueError(
|
||||||
|
"The iob/iob2 file is not formatted correctly. Try checking whitespace and delimiters."
|
||||||
|
)
|
||||||
biluo = iob_to_biluo(iob)
|
biluo = iob_to_biluo(iob)
|
||||||
sentences.append([
|
sentences.append(
|
||||||
{'orth': w, 'tag': p, 'ner': ent}
|
[
|
||||||
|
{"orth": w, "tag": p, "ner": ent}
|
||||||
for (w, p, ent) in zip(words, pos, biluo)
|
for (w, p, ent) in zip(words, pos, biluo)
|
||||||
])
|
]
|
||||||
sentences = [{'tokens': sent} for sent in sentences]
|
)
|
||||||
paragraphs = [{'sentences': [sent]} for sent in sentences]
|
sentences = [{"tokens": sent} for sent in sentences]
|
||||||
docs = [{'id': 0, 'paragraphs': [para]} for para in paragraphs]
|
paragraphs = [{"sentences": [sent]} for sent in sentences]
|
||||||
|
docs = [{"id": 0, "paragraphs": [para]} for para in paragraphs]
|
||||||
return docs
|
return docs
|
||||||
|
|
||||||
def merge_sentences(docs, n_sents):
|
|
||||||
counter = 0
|
|
||||||
merged = []
|
|
||||||
for group in partition_all(n_sents, docs):
|
|
||||||
group = list(group)
|
|
||||||
first = group.pop(0)
|
|
||||||
to_extend = first['paragraphs'][0]['sentences']
|
|
||||||
for sent in group[1:]:
|
|
||||||
to_extend.extend(sent['paragraphs'][0]['sentences'])
|
|
||||||
merged.append(first)
|
|
||||||
return merged
|
|
||||||
|
|
|
||||||
20
spacy/cli/converters/jsonl2json.py
Normal file
20
spacy/cli/converters/jsonl2json.py
Normal file
|
|
@ -0,0 +1,20 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
from ...util import get_lang_class
|
||||||
|
|
||||||
|
|
||||||
|
def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False):
|
||||||
|
if lang is None:
|
||||||
|
raise ValueError("No --lang specified, but tokenization required")
|
||||||
|
json_docs = []
|
||||||
|
input_tuples = [srsly.json_loads(line) for line in input_data]
|
||||||
|
nlp = get_lang_class(lang)()
|
||||||
|
for i, (raw_text, ents) in enumerate(input_tuples):
|
||||||
|
doc = nlp.make_doc(raw_text)
|
||||||
|
doc[0].is_sent_start = True
|
||||||
|
doc.ents = [doc.char_span(s, e, label=L) for s, e, L in ents["entities"]]
|
||||||
|
json_docs.append(doc.to_json())
|
||||||
|
return json_docs
|
||||||
399
spacy/cli/debug_data.py
Normal file
399
spacy/cli/debug_data.py
Normal file
|
|
@ -0,0 +1,399 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
from collections import Counter
|
||||||
|
import plac
|
||||||
|
import sys
|
||||||
|
import srsly
|
||||||
|
from wasabi import Printer, MESSAGES
|
||||||
|
|
||||||
|
from ..gold import GoldCorpus, read_json_object
|
||||||
|
from ..util import load_model, get_lang_class
|
||||||
|
|
||||||
|
|
||||||
|
# Minimum number of expected occurences of label in data to train new label
|
||||||
|
NEW_LABEL_THRESHOLD = 50
|
||||||
|
# Minimum number of expected examples to train a blank model
|
||||||
|
BLANK_MODEL_MIN_THRESHOLD = 100
|
||||||
|
BLANK_MODEL_THRESHOLD = 2000
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
lang=("model language", "positional", None, str),
|
||||||
|
train_path=("location of JSON-formatted training data", "positional", None, Path),
|
||||||
|
dev_path=("location of JSON-formatted development data", "positional", None, Path),
|
||||||
|
base_model=("name of model to update (optional)", "option", "b", str),
|
||||||
|
pipeline=(
|
||||||
|
"Comma-separated names of pipeline components to train",
|
||||||
|
"option",
|
||||||
|
"p",
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool),
|
||||||
|
ignore_validation=(
|
||||||
|
"Don't exit if JSON format validation fails",
|
||||||
|
"flag",
|
||||||
|
"IV",
|
||||||
|
bool,
|
||||||
|
),
|
||||||
|
verbose=("Print additional information and explanations", "flag", "V", bool),
|
||||||
|
no_format=("Don't pretty-print the results", "flag", "NF", bool),
|
||||||
|
)
|
||||||
|
def debug_data(
|
||||||
|
lang,
|
||||||
|
train_path,
|
||||||
|
dev_path,
|
||||||
|
base_model=None,
|
||||||
|
pipeline="tagger,parser,ner",
|
||||||
|
ignore_warnings=False,
|
||||||
|
ignore_validation=False,
|
||||||
|
verbose=False,
|
||||||
|
no_format=False,
|
||||||
|
):
|
||||||
|
msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings)
|
||||||
|
|
||||||
|
# Make sure all files and paths exists if they are needed
|
||||||
|
if not train_path.exists():
|
||||||
|
msg.fail("Training data not found", train_path, exits=1)
|
||||||
|
if not dev_path.exists():
|
||||||
|
msg.fail("Development data not found", dev_path, exits=1)
|
||||||
|
|
||||||
|
# Initialize the model and pipeline
|
||||||
|
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||||
|
if base_model:
|
||||||
|
nlp = load_model(base_model)
|
||||||
|
else:
|
||||||
|
lang_cls = get_lang_class(lang)
|
||||||
|
nlp = lang_cls()
|
||||||
|
|
||||||
|
msg.divider("Data format validation")
|
||||||
|
# Load the data in one – might take a while but okay in this case
|
||||||
|
train_data = _load_file(train_path, msg)
|
||||||
|
dev_data = _load_file(dev_path, msg)
|
||||||
|
|
||||||
|
# Validate data format using the JSON schema
|
||||||
|
# TODO: update once the new format is ready
|
||||||
|
train_data_errors = [] # TODO: validate_json
|
||||||
|
dev_data_errors = [] # TODO: validate_json
|
||||||
|
if not train_data_errors:
|
||||||
|
msg.good("Training data JSON format is valid")
|
||||||
|
if not dev_data_errors:
|
||||||
|
msg.good("Development data JSON format is valid")
|
||||||
|
for error in train_data_errors:
|
||||||
|
msg.fail("Training data: {}".format(error))
|
||||||
|
for error in dev_data_errors:
|
||||||
|
msg.fail("Develoment data: {}".format(error))
|
||||||
|
if (train_data_errors or dev_data_errors) and not ignore_validation:
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Create the gold corpus to be able to better analyze data
|
||||||
|
with msg.loading("Analyzing corpus..."):
|
||||||
|
train_data = read_json_object(train_data)
|
||||||
|
dev_data = read_json_object(dev_data)
|
||||||
|
corpus = GoldCorpus(train_data, dev_data)
|
||||||
|
train_docs = list(corpus.train_docs(nlp))
|
||||||
|
dev_docs = list(corpus.dev_docs(nlp))
|
||||||
|
msg.good("Corpus is loadable")
|
||||||
|
|
||||||
|
# Create all gold data here to avoid iterating over the train_docs constantly
|
||||||
|
gold_data = _compile_gold(train_docs, pipeline)
|
||||||
|
train_texts = gold_data["texts"]
|
||||||
|
dev_texts = set([doc.text for doc, gold in dev_docs])
|
||||||
|
|
||||||
|
msg.divider("Training stats")
|
||||||
|
msg.text("Training pipeline: {}".format(", ".join(pipeline)))
|
||||||
|
for pipe in [p for p in pipeline if p not in nlp.factories]:
|
||||||
|
msg.fail("Pipeline component '{}' not available in factories".format(pipe))
|
||||||
|
if base_model:
|
||||||
|
msg.text("Starting with base model '{}'".format(base_model))
|
||||||
|
else:
|
||||||
|
msg.text("Starting with blank model '{}'".format(lang))
|
||||||
|
msg.text("{} training docs".format(len(train_docs)))
|
||||||
|
msg.text("{} evaluation docs".format(len(dev_docs)))
|
||||||
|
|
||||||
|
overlap = len(train_texts.intersection(dev_texts))
|
||||||
|
if overlap:
|
||||||
|
msg.warn("{} training examples also in evaluation data".format(overlap))
|
||||||
|
else:
|
||||||
|
msg.good("No overlap between training and evaluation data")
|
||||||
|
if not base_model and len(train_docs) < BLANK_MODEL_THRESHOLD:
|
||||||
|
text = "Low number of examples to train from a blank model ({})".format(
|
||||||
|
len(train_docs)
|
||||||
|
)
|
||||||
|
if len(train_docs) < BLANK_MODEL_MIN_THRESHOLD:
|
||||||
|
msg.fail(text)
|
||||||
|
else:
|
||||||
|
msg.warn(text)
|
||||||
|
msg.text(
|
||||||
|
"It's recommended to use at least {} examples (minimum {})".format(
|
||||||
|
BLANK_MODEL_THRESHOLD, BLANK_MODEL_MIN_THRESHOLD
|
||||||
|
),
|
||||||
|
show=verbose,
|
||||||
|
)
|
||||||
|
|
||||||
|
msg.divider("Vocab & Vectors")
|
||||||
|
n_words = gold_data["n_words"]
|
||||||
|
msg.info(
|
||||||
|
"{} total {} in the data ({} unique)".format(
|
||||||
|
n_words, "word" if n_words == 1 else "words", len(gold_data["words"])
|
||||||
|
)
|
||||||
|
)
|
||||||
|
most_common_words = gold_data["words"].most_common(10)
|
||||||
|
msg.text(
|
||||||
|
"10 most common words: {}".format(
|
||||||
|
_format_labels(most_common_words, counts=True)
|
||||||
|
),
|
||||||
|
show=verbose,
|
||||||
|
)
|
||||||
|
if len(nlp.vocab.vectors):
|
||||||
|
msg.info(
|
||||||
|
"{} vectors ({} unique keys, {} dimensions)".format(
|
||||||
|
len(nlp.vocab.vectors),
|
||||||
|
nlp.vocab.vectors.n_keys,
|
||||||
|
nlp.vocab.vectors_length,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
msg.info("No word vectors present in the model")
|
||||||
|
|
||||||
|
if "ner" in pipeline:
|
||||||
|
# Get all unique NER labels present in the data
|
||||||
|
labels = set(label for label in gold_data["ner"] if label not in ("O", "-"))
|
||||||
|
label_counts = gold_data["ner"]
|
||||||
|
model_labels = _get_labels_from_model(nlp, "ner")
|
||||||
|
new_labels = [l for l in labels if l not in model_labels]
|
||||||
|
existing_labels = [l for l in labels if l in model_labels]
|
||||||
|
has_low_data_warning = False
|
||||||
|
has_no_neg_warning = False
|
||||||
|
has_ws_ents_error = False
|
||||||
|
|
||||||
|
msg.divider("Named Entity Recognition")
|
||||||
|
msg.info(
|
||||||
|
"{} new {}, {} existing {}".format(
|
||||||
|
len(new_labels),
|
||||||
|
"label" if len(new_labels) == 1 else "labels",
|
||||||
|
len(existing_labels),
|
||||||
|
"label" if len(existing_labels) == 1 else "labels",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
missing_values = label_counts["-"]
|
||||||
|
msg.text(
|
||||||
|
"{} missing {} (tokens with '-' label)".format(
|
||||||
|
missing_values, "value" if missing_values == 1 else "values"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
if new_labels:
|
||||||
|
labels_with_counts = [
|
||||||
|
(label, count)
|
||||||
|
for label, count in label_counts.most_common()
|
||||||
|
if label != "-"
|
||||||
|
]
|
||||||
|
labels_with_counts = _format_labels(labels_with_counts, counts=True)
|
||||||
|
msg.text("New: {}".format(labels_with_counts), show=verbose)
|
||||||
|
if existing_labels:
|
||||||
|
msg.text(
|
||||||
|
"Existing: {}".format(_format_labels(existing_labels)), show=verbose
|
||||||
|
)
|
||||||
|
|
||||||
|
if gold_data["ws_ents"]:
|
||||||
|
msg.fail("{} invalid whitespace entity spans".format(gold_data["ws_ents"]))
|
||||||
|
has_ws_ents_error = True
|
||||||
|
|
||||||
|
for label in new_labels:
|
||||||
|
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
||||||
|
msg.warn(
|
||||||
|
"Low number of examples for new label '{}' ({})".format(
|
||||||
|
label, label_counts[label]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
has_low_data_warning = True
|
||||||
|
|
||||||
|
with msg.loading("Analyzing label distribution..."):
|
||||||
|
neg_docs = _get_examples_without_label(train_docs, label)
|
||||||
|
if neg_docs == 0:
|
||||||
|
msg.warn(
|
||||||
|
"No examples for texts WITHOUT new label '{}'".format(label)
|
||||||
|
)
|
||||||
|
has_no_neg_warning = True
|
||||||
|
|
||||||
|
if not has_low_data_warning:
|
||||||
|
msg.good("Good amount of examples for all labels")
|
||||||
|
if not has_no_neg_warning:
|
||||||
|
msg.good("Examples without occurences available for all labels")
|
||||||
|
if not has_ws_ents_error:
|
||||||
|
msg.good("No entities consisting of or starting/ending with whitespace")
|
||||||
|
|
||||||
|
if has_low_data_warning:
|
||||||
|
msg.text(
|
||||||
|
"To train a new entity type, your data should include at "
|
||||||
|
"least {} insteances of the new label".format(NEW_LABEL_THRESHOLD),
|
||||||
|
show=verbose,
|
||||||
|
)
|
||||||
|
if has_no_neg_warning:
|
||||||
|
msg.text(
|
||||||
|
"Training data should always include examples of entities "
|
||||||
|
"in context, as well as examples without a given entity "
|
||||||
|
"type.",
|
||||||
|
show=verbose,
|
||||||
|
)
|
||||||
|
if has_ws_ents_error:
|
||||||
|
msg.text(
|
||||||
|
"As of spaCy v2.1.0, entity spans consisting of or starting/ending "
|
||||||
|
"with whitespace characters are considered invalid."
|
||||||
|
)
|
||||||
|
|
||||||
|
if "textcat" in pipeline:
|
||||||
|
msg.divider("Text Classification")
|
||||||
|
labels = [label for label in gold_data["textcat"]]
|
||||||
|
model_labels = _get_labels_from_model(nlp, "textcat")
|
||||||
|
new_labels = [l for l in labels if l not in model_labels]
|
||||||
|
existing_labels = [l for l in labels if l in model_labels]
|
||||||
|
msg.info(
|
||||||
|
"Text Classification: {} new label(s), {} existing label(s)".format(
|
||||||
|
len(new_labels), len(existing_labels)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
if new_labels:
|
||||||
|
labels_with_counts = _format_labels(
|
||||||
|
gold_data["textcat"].most_common(), counts=True
|
||||||
|
)
|
||||||
|
msg.text("New: {}".format(labels_with_counts), show=verbose)
|
||||||
|
if existing_labels:
|
||||||
|
msg.text(
|
||||||
|
"Existing: {}".format(_format_labels(existing_labels)), show=verbose
|
||||||
|
)
|
||||||
|
|
||||||
|
if "tagger" in pipeline:
|
||||||
|
msg.divider("Part-of-speech Tagging")
|
||||||
|
labels = [label for label in gold_data["tags"]]
|
||||||
|
tag_map = nlp.Defaults.tag_map
|
||||||
|
msg.info(
|
||||||
|
"{} {} in data ({} {} in tag map)".format(
|
||||||
|
len(labels),
|
||||||
|
"label" if len(labels) == 1 else "labels",
|
||||||
|
len(tag_map),
|
||||||
|
"label" if len(tag_map) == 1 else "labels",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
labels_with_counts = _format_labels(
|
||||||
|
gold_data["tags"].most_common(), counts=True
|
||||||
|
)
|
||||||
|
msg.text(labels_with_counts, show=verbose)
|
||||||
|
non_tagmap = [l for l in labels if l not in tag_map]
|
||||||
|
if not non_tagmap:
|
||||||
|
msg.good("All labels present in tag map for language '{}'".format(nlp.lang))
|
||||||
|
for label in non_tagmap:
|
||||||
|
msg.fail(
|
||||||
|
"Label '{}' not found in tag map for language '{}'".format(
|
||||||
|
label, nlp.lang
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if "parser" in pipeline:
|
||||||
|
msg.divider("Dependency Parsing")
|
||||||
|
labels = [label for label in gold_data["deps"]]
|
||||||
|
msg.info(
|
||||||
|
"{} {} in data".format(
|
||||||
|
len(labels), "label" if len(labels) == 1 else "labels"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
labels_with_counts = _format_labels(
|
||||||
|
gold_data["deps"].most_common(), counts=True
|
||||||
|
)
|
||||||
|
msg.text(labels_with_counts, show=verbose)
|
||||||
|
|
||||||
|
msg.divider("Summary")
|
||||||
|
good_counts = msg.counts[MESSAGES.GOOD]
|
||||||
|
warn_counts = msg.counts[MESSAGES.WARN]
|
||||||
|
fail_counts = msg.counts[MESSAGES.FAIL]
|
||||||
|
if good_counts:
|
||||||
|
msg.good(
|
||||||
|
"{} {} passed".format(
|
||||||
|
good_counts, "check" if good_counts == 1 else "checks"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
if warn_counts:
|
||||||
|
msg.warn(
|
||||||
|
"{} {}".format(warn_counts, "warning" if warn_counts == 1 else "warnings")
|
||||||
|
)
|
||||||
|
if fail_counts:
|
||||||
|
msg.fail("{} {}".format(fail_counts, "error" if fail_counts == 1 else "errors"))
|
||||||
|
|
||||||
|
if fail_counts:
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
def _load_file(file_path, msg):
|
||||||
|
file_name = file_path.parts[-1]
|
||||||
|
if file_path.suffix == ".json":
|
||||||
|
with msg.loading("Loading {}...".format(file_name)):
|
||||||
|
data = srsly.read_json(file_path)
|
||||||
|
msg.good("Loaded {}".format(file_name))
|
||||||
|
return data
|
||||||
|
elif file_path.suffix == ".jsonl":
|
||||||
|
with msg.loading("Loading {}...".format(file_name)):
|
||||||
|
data = srsly.read_jsonl(file_path)
|
||||||
|
msg.good("Loaded {}".format(file_name))
|
||||||
|
return data
|
||||||
|
msg.fail(
|
||||||
|
"Can't load file extension {}".format(file_path.suffix),
|
||||||
|
"Expected .json or .jsonl",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _compile_gold(train_docs, pipeline):
|
||||||
|
data = {
|
||||||
|
"ner": Counter(),
|
||||||
|
"cats": Counter(),
|
||||||
|
"tags": Counter(),
|
||||||
|
"deps": Counter(),
|
||||||
|
"words": Counter(),
|
||||||
|
"ws_ents": 0,
|
||||||
|
"n_words": 0,
|
||||||
|
"texts": set(),
|
||||||
|
}
|
||||||
|
for doc, gold in train_docs:
|
||||||
|
data["words"].update(gold.words)
|
||||||
|
data["n_words"] += len(gold.words)
|
||||||
|
data["texts"].add(doc.text)
|
||||||
|
if "ner" in pipeline:
|
||||||
|
for i, label in enumerate(gold.ner):
|
||||||
|
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
||||||
|
# "Illegal" whitespace entity
|
||||||
|
data["ws_ents"] += 1
|
||||||
|
if label.startswith(("B-", "U-")):
|
||||||
|
combined_label = label.split("-")[1]
|
||||||
|
data["ner"][combined_label] += 1
|
||||||
|
elif label == "-":
|
||||||
|
data["ner"]["-"] += 1
|
||||||
|
if "textcat" in pipeline:
|
||||||
|
data["cats"].update(gold.cats)
|
||||||
|
if "tagger" in pipeline:
|
||||||
|
data["tags"].update(gold.tags)
|
||||||
|
if "parser" in pipeline:
|
||||||
|
data["deps"].update(gold.labels)
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
def _format_labels(labels, counts=False):
|
||||||
|
if counts:
|
||||||
|
return ", ".join(["'{}' ({})".format(l, c) for l, c in labels])
|
||||||
|
return ", ".join(["'{}'".format(l) for l in labels])
|
||||||
|
|
||||||
|
|
||||||
|
def _get_examples_without_label(data, label):
|
||||||
|
count = 0
|
||||||
|
for doc, gold in data:
|
||||||
|
labels = [label.split("-")[1] for label in gold.ner if label not in ("O", "-")]
|
||||||
|
if label not in labels:
|
||||||
|
count += 1
|
||||||
|
return count
|
||||||
|
|
||||||
|
|
||||||
|
def _get_labels_from_model(nlp, pipe_name):
|
||||||
|
if pipe_name not in nlp.pipe_names:
|
||||||
|
return set()
|
||||||
|
pipe = nlp.get_pipe(pipe_name)
|
||||||
|
return pipe.labels
|
||||||
|
|
@ -6,41 +6,49 @@ import requests
|
||||||
import os
|
import os
|
||||||
import subprocess
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
|
from wasabi import Printer
|
||||||
|
|
||||||
from ._messages import Messages
|
|
||||||
from .link import link
|
from .link import link
|
||||||
from ..util import prints, get_package_path
|
from ..util import get_package_path
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
|
msg = Printer()
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("model to download, shortcut or name", "positional", None, str),
|
model=("Model to download (shortcut or name)", "positional", None, str),
|
||||||
direct=("force direct download. Needs model name with version and won't "
|
direct=("Force direct download of name + version", "flag", "d", bool),
|
||||||
"perform compatibility check", "flag", "d", bool),
|
pip_args=("additional arguments to be passed to `pip install` on model install"),
|
||||||
pip_args=("additional arguments to be passed to `pip install` when "
|
)
|
||||||
"installing the model"))
|
|
||||||
def download(model, direct=False, *pip_args):
|
def download(model, direct=False, *pip_args):
|
||||||
"""
|
"""
|
||||||
Download compatible model from default download path using pip. Model
|
Download compatible model from default download path using pip. Model
|
||||||
can be shortcut, model name or, if --direct flag is set, full model name
|
can be shortcut, model name or, if --direct flag is set, full model name
|
||||||
with version.
|
with version. For direct downloads, the compatibility check will be skipped.
|
||||||
"""
|
"""
|
||||||
|
dl_tpl = "{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}"
|
||||||
if direct:
|
if direct:
|
||||||
components = model.split("-")
|
components = model.split("-")
|
||||||
model_name = "".join(components[:-1])
|
model_name = "".join(components[:-1])
|
||||||
version = components[-1]
|
version = components[-1]
|
||||||
dl = download_model(
|
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||||
'{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}'.format(
|
|
||||||
m=model_name, v=version), pip_args)
|
|
||||||
else:
|
else:
|
||||||
shortcuts = get_json(about.__shortcuts__, "available shortcuts")
|
shortcuts = get_json(about.__shortcuts__, "available shortcuts")
|
||||||
model_name = shortcuts.get(model, model)
|
model_name = shortcuts.get(model, model)
|
||||||
compatibility = get_compatibility()
|
compatibility = get_compatibility()
|
||||||
version = get_version(model_name, compatibility)
|
version = get_version(model_name, compatibility)
|
||||||
dl = download_model('{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}'
|
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||||
.format(m=model_name, v=version), pip_args)
|
|
||||||
if dl != 0: # if download subprocess doesn't return 0, exit
|
if dl != 0: # if download subprocess doesn't return 0, exit
|
||||||
sys.exit(dl)
|
sys.exit(dl)
|
||||||
|
msg.good(
|
||||||
|
"Download and installation successful",
|
||||||
|
"You can now load the model via spacy.load('{}')".format(model_name),
|
||||||
|
)
|
||||||
|
# Only create symlink if the model is installed via a shortcut like 'en'.
|
||||||
|
# There's no real advantage over an additional symlink for en_core_web_sm
|
||||||
|
# and if anything, it's more error prone and causes more confusion.
|
||||||
|
if model in shortcuts:
|
||||||
try:
|
try:
|
||||||
# Get package path here because link uses
|
# Get package path here because link uses
|
||||||
# pip.get_installed_distributions() to check if model is a
|
# pip.get_installed_distributions() to check if model is a
|
||||||
|
|
@ -48,44 +56,58 @@ def download(model, direct=False, *pip_args):
|
||||||
# subprocess
|
# subprocess
|
||||||
package_path = get_package_path(model_name)
|
package_path = get_package_path(model_name)
|
||||||
link(model_name, model, force=True, model_path=package_path)
|
link(model_name, model, force=True, model_path=package_path)
|
||||||
except:
|
except: # noqa: E722
|
||||||
# Dirty, but since spacy.download and the auto-linking is
|
# Dirty, but since spacy.download and the auto-linking is
|
||||||
# mostly a convenience wrapper, it's best to show a success
|
# mostly a convenience wrapper, it's best to show a success
|
||||||
# message and loading instructions, even if linking fails.
|
# message and loading instructions, even if linking fails.
|
||||||
prints(Messages.M001.format(name=model_name), title=Messages.M002)
|
msg.warn(
|
||||||
|
"Download successful but linking failed",
|
||||||
|
"Creating a shortcut link for '{}' didn't work (maybe you "
|
||||||
|
"don't have admin permissions?), but you can still load "
|
||||||
|
"the model via its full package name: "
|
||||||
|
"nlp = spacy.load('{}')".format(model, model_name),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def get_json(url, desc):
|
def get_json(url, desc):
|
||||||
r = requests.get(url)
|
r = requests.get(url)
|
||||||
if r.status_code != 200:
|
if r.status_code != 200:
|
||||||
prints(Messages.M004.format(desc=desc, version=about.__version__),
|
msg.fail(
|
||||||
title=Messages.M003.format(code=r.status_code), exits=1)
|
"Server error ({})".format(r.status_code),
|
||||||
|
"Couldn't fetch {}. Please find a model for your spaCy "
|
||||||
|
"installation (v{}), and download it manually. For more "
|
||||||
|
"details, see the documentation: "
|
||||||
|
"https://spacy.io/usage/models".format(desc, about.__version__),
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
return r.json()
|
return r.json()
|
||||||
|
|
||||||
|
|
||||||
def get_compatibility():
|
def get_compatibility():
|
||||||
version = about.__version__
|
version = about.__version__
|
||||||
version = version.rsplit('.dev', 1)[0]
|
version = version.rsplit(".dev", 1)[0]
|
||||||
comp_table = get_json(about.__compatibility__, "compatibility table")
|
comp_table = get_json(about.__compatibility__, "compatibility table")
|
||||||
comp = comp_table['spacy']
|
comp = comp_table["spacy"]
|
||||||
if version not in comp:
|
if version not in comp:
|
||||||
prints(Messages.M006.format(version=version), title=Messages.M005,
|
msg.fail("No compatible models found for v{} of spaCy".format(version), exits=1)
|
||||||
exits=1)
|
|
||||||
return comp[version]
|
return comp[version]
|
||||||
|
|
||||||
|
|
||||||
def get_version(model, comp):
|
def get_version(model, comp):
|
||||||
model = model.rsplit('.dev', 1)[0]
|
model = model.rsplit(".dev", 1)[0]
|
||||||
if model not in comp:
|
if model not in comp:
|
||||||
prints(Messages.M007.format(name=model, version=about.__version__),
|
msg.fail(
|
||||||
title=Messages.M005, exits=1)
|
"No compatible model found for '{}' "
|
||||||
|
"(spaCy v{}).".format(model, about.__version__),
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
return comp[model][0]
|
return comp[model][0]
|
||||||
|
|
||||||
|
|
||||||
def download_model(filename, user_pip_args=None):
|
def download_model(filename, user_pip_args=None):
|
||||||
download_url = about.__download_url__ + '/' + filename
|
download_url = about.__download_url__ + "/" + filename
|
||||||
pip_args = ['--no-cache-dir', '--no-deps']
|
pip_args = ["--no-cache-dir", "--no-deps"]
|
||||||
if user_pip_args:
|
if user_pip_args:
|
||||||
pip_args.extend(user_pip_args)
|
pip_args.extend(user_pip_args)
|
||||||
cmd = [sys.executable, '-m', 'pip', 'install'] + pip_args + [download_url]
|
cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url]
|
||||||
return subprocess.call(cmd, env=os.environ.copy())
|
return subprocess.call(cmd, env=os.environ.copy())
|
||||||
|
|
|
||||||
|
|
@ -3,30 +3,34 @@ from __future__ import unicode_literals, division, print_function
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
|
from wasabi import Printer
|
||||||
|
|
||||||
from ._messages import Messages
|
|
||||||
from ..gold import GoldCorpus
|
from ..gold import GoldCorpus
|
||||||
from ..util import prints
|
|
||||||
from .. import util
|
from .. import util
|
||||||
from .. import displacy
|
from .. import displacy
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("model name or path", "positional", None, str),
|
model=("Model name or path", "positional", None, str),
|
||||||
data_path=("location of JSON-formatted evaluation data", "positional",
|
data_path=("Location of JSON-formatted evaluation data", "positional", None, str),
|
||||||
None, str),
|
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||||
gold_preproc=("use gold preprocessing", "flag", "G", bool),
|
gpu_id=("Use GPU", "option", "g", int),
|
||||||
gpu_id=("use GPU", "option", "g", int),
|
displacy_path=("Directory to output rendered parses as HTML", "option", "dp", str),
|
||||||
displacy_path=("directory to output rendered parses as HTML", "option",
|
displacy_limit=("Limit of parses to render as HTML", "option", "dl", int),
|
||||||
"dp", str),
|
)
|
||||||
displacy_limit=("limit of parses to render as HTML", "option", "dl", int))
|
def evaluate(
|
||||||
def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None,
|
model,
|
||||||
displacy_limit=25):
|
data_path,
|
||||||
|
gpu_id=-1,
|
||||||
|
gold_preproc=False,
|
||||||
|
displacy_path=None,
|
||||||
|
displacy_limit=25,
|
||||||
|
):
|
||||||
"""
|
"""
|
||||||
Evaluate a model. To render a sample of parses in a HTML file, set an
|
Evaluate a model. To render a sample of parses in a HTML file, set an
|
||||||
output directory as the displacy_path argument.
|
output directory as the displacy_path argument.
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
util.fix_random_seed()
|
util.fix_random_seed()
|
||||||
if gpu_id >= 0:
|
if gpu_id >= 0:
|
||||||
util.use_gpu(gpu_id)
|
util.use_gpu(gpu_id)
|
||||||
|
|
@ -34,9 +38,9 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
|
||||||
data_path = util.ensure_path(data_path)
|
data_path = util.ensure_path(data_path)
|
||||||
displacy_path = util.ensure_path(displacy_path)
|
displacy_path = util.ensure_path(displacy_path)
|
||||||
if not data_path.exists():
|
if not data_path.exists():
|
||||||
prints(data_path, title=Messages.M034, exits=1)
|
msg.fail("Evaluation data not found", data_path, exits=1)
|
||||||
if displacy_path and not displacy_path.exists():
|
if displacy_path and not displacy_path.exists():
|
||||||
prints(displacy_path, title=Messages.M035, exits=1)
|
msg.fail("Visualization output directory not found", displacy_path, exits=1)
|
||||||
corpus = GoldCorpus(data_path, data_path)
|
corpus = GoldCorpus(data_path, data_path)
|
||||||
nlp = util.load_model(model)
|
nlp = util.load_model(model)
|
||||||
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
||||||
|
|
@ -44,65 +48,44 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
|
||||||
scorer = nlp.evaluate(dev_docs, verbose=False)
|
scorer = nlp.evaluate(dev_docs, verbose=False)
|
||||||
end = timer()
|
end = timer()
|
||||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||||
print_results(scorer, time=end - begin, words=nwords,
|
results = {
|
||||||
wps=nwords / (end - begin))
|
"Time": "%.2f s" % (end - begin),
|
||||||
|
"Words": nwords,
|
||||||
|
"Words/s": "%.0f" % (nwords / (end - begin)),
|
||||||
|
"TOK": "%.2f" % scorer.token_acc,
|
||||||
|
"POS": "%.2f" % scorer.tags_acc,
|
||||||
|
"UAS": "%.2f" % scorer.uas,
|
||||||
|
"LAS": "%.2f" % scorer.las,
|
||||||
|
"NER P": "%.2f" % scorer.ents_p,
|
||||||
|
"NER R": "%.2f" % scorer.ents_r,
|
||||||
|
"NER F": "%.2f" % scorer.ents_f,
|
||||||
|
}
|
||||||
|
msg.table(results, title="Results")
|
||||||
|
|
||||||
if displacy_path:
|
if displacy_path:
|
||||||
docs, golds = zip(*dev_docs)
|
docs, golds = zip(*dev_docs)
|
||||||
render_deps = 'parser' in nlp.meta.get('pipeline', [])
|
render_deps = "parser" in nlp.meta.get("pipeline", [])
|
||||||
render_ents = 'ner' in nlp.meta.get('pipeline', [])
|
render_ents = "ner" in nlp.meta.get("pipeline", [])
|
||||||
render_parses(docs, displacy_path, model_name=model,
|
render_parses(
|
||||||
limit=displacy_limit, deps=render_deps, ents=render_ents)
|
docs,
|
||||||
prints(displacy_path, title=Messages.M036.format(n=displacy_limit))
|
displacy_path,
|
||||||
|
model_name=model,
|
||||||
|
limit=displacy_limit,
|
||||||
|
deps=render_deps,
|
||||||
|
ents=render_ents,
|
||||||
|
)
|
||||||
|
msg.good("Generated {} parses as HTML".format(displacy_limit), displacy_path)
|
||||||
|
|
||||||
|
|
||||||
def render_parses(docs, output_path, model_name='', limit=250, deps=True,
|
def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True):
|
||||||
ents=True):
|
docs[0].user_data["title"] = model_name
|
||||||
docs[0].user_data['title'] = model_name
|
|
||||||
if ents:
|
if ents:
|
||||||
with (output_path / 'entities.html').open('w') as file_:
|
with (output_path / "entities.html").open("w") as file_:
|
||||||
html = displacy.render(docs[:limit], style='ent', page=True)
|
html = displacy.render(docs[:limit], style="ent", page=True)
|
||||||
file_.write(html)
|
file_.write(html)
|
||||||
if deps:
|
if deps:
|
||||||
with (output_path / 'parses.html').open('w') as file_:
|
with (output_path / "parses.html").open("w") as file_:
|
||||||
html = displacy.render(docs[:limit], style='dep', page=True,
|
html = displacy.render(
|
||||||
options={'compact': True})
|
docs[:limit], style="dep", page=True, options={"compact": True}
|
||||||
|
)
|
||||||
file_.write(html)
|
file_.write(html)
|
||||||
|
|
||||||
|
|
||||||
def print_progress(itn, losses, dev_scores, wps=0.0):
|
|
||||||
scores = {}
|
|
||||||
for col in ['dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc',
|
|
||||||
'ents_p', 'ents_r', 'ents_f', 'wps']:
|
|
||||||
scores[col] = 0.0
|
|
||||||
scores['dep_loss'] = losses.get('parser', 0.0)
|
|
||||||
scores['ner_loss'] = losses.get('ner', 0.0)
|
|
||||||
scores['tag_loss'] = losses.get('tagger', 0.0)
|
|
||||||
scores.update(dev_scores)
|
|
||||||
scores['wps'] = wps
|
|
||||||
tpl = '\t'.join((
|
|
||||||
'{:d}',
|
|
||||||
'{dep_loss:.3f}',
|
|
||||||
'{ner_loss:.3f}',
|
|
||||||
'{uas:.3f}',
|
|
||||||
'{ents_p:.3f}',
|
|
||||||
'{ents_r:.3f}',
|
|
||||||
'{ents_f:.3f}',
|
|
||||||
'{tags_acc:.3f}',
|
|
||||||
'{token_acc:.3f}',
|
|
||||||
'{wps:.1f}'))
|
|
||||||
print(tpl.format(itn, **scores))
|
|
||||||
|
|
||||||
|
|
||||||
def print_results(scorer, time, words, wps):
|
|
||||||
results = {
|
|
||||||
'Time': '%.2f s' % time,
|
|
||||||
'Words': words,
|
|
||||||
'Words/s': '%.0f' % wps,
|
|
||||||
'TOK': '%.2f' % scorer.token_acc,
|
|
||||||
'POS': '%.2f' % scorer.tags_acc,
|
|
||||||
'UAS': '%.2f' % scorer.uas,
|
|
||||||
'LAS': '%.2f' % scorer.las,
|
|
||||||
'NER P': '%.2f' % scorer.ents_p,
|
|
||||||
'NER R': '%.2f' % scorer.ents_r,
|
|
||||||
'NER F': '%.2f' % scorer.ents_f}
|
|
||||||
util.print_table(results, title="Results")
|
|
||||||
|
|
|
||||||
|
|
@ -4,64 +4,90 @@ from __future__ import unicode_literals
|
||||||
import plac
|
import plac
|
||||||
import platform
|
import platform
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from wasabi import Printer
|
||||||
|
import srsly
|
||||||
|
|
||||||
from ._messages import Messages
|
from ..compat import path2str, basestring_, unicode_
|
||||||
from ..compat import path2str
|
|
||||||
from .. import util
|
from .. import util
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("optional: shortcut link of model", "positional", None, str),
|
model=("Optional shortcut link of model", "positional", None, str),
|
||||||
markdown=("generate Markdown for GitHub issues", "flag", "md", str),
|
markdown=("Generate Markdown for GitHub issues", "flag", "md", str),
|
||||||
silent=("don't print anything (just return)", "flag", "s"))
|
silent=("Don't print anything (just return)", "flag", "s"),
|
||||||
|
)
|
||||||
def info(model=None, markdown=False, silent=False):
|
def info(model=None, markdown=False, silent=False):
|
||||||
"""Print info about spaCy installation. If a model shortcut link is
|
"""
|
||||||
|
Print info about spaCy installation. If a model shortcut link is
|
||||||
speficied as an argument, print model information. Flag --markdown
|
speficied as an argument, print model information. Flag --markdown
|
||||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
if model:
|
if model:
|
||||||
if util.is_package(model):
|
if util.is_package(model):
|
||||||
model_path = util.get_package_path(model)
|
model_path = util.get_package_path(model)
|
||||||
else:
|
else:
|
||||||
model_path = util.get_data_path() / model
|
model_path = util.get_data_path() / model
|
||||||
meta_path = model_path / 'meta.json'
|
meta_path = model_path / "meta.json"
|
||||||
if not meta_path.is_file():
|
if not meta_path.is_file():
|
||||||
util.prints(meta_path, title=Messages.M020, exits=1)
|
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||||
meta = util.read_json(meta_path)
|
meta = srsly.read_json(meta_path)
|
||||||
if model_path.resolve() != model_path:
|
if model_path.resolve() != model_path:
|
||||||
meta['link'] = path2str(model_path)
|
meta["link"] = path2str(model_path)
|
||||||
meta['source'] = path2str(model_path.resolve())
|
meta["source"] = path2str(model_path.resolve())
|
||||||
else:
|
else:
|
||||||
meta['source'] = path2str(model_path)
|
meta["source"] = path2str(model_path)
|
||||||
if not silent:
|
if not silent:
|
||||||
print_info(meta, 'model %s' % model, markdown)
|
title = "Info about model '{}'".format(model)
|
||||||
return meta
|
model_meta = {
|
||||||
data = {'spaCy version': about.__version__,
|
k: v for k, v in meta.items() if k not in ("accuracy", "speed")
|
||||||
'Location': path2str(Path(__file__).parent.parent),
|
}
|
||||||
'Platform': platform.platform(),
|
|
||||||
'Python version': platform.python_version(),
|
|
||||||
'Models': list_models()}
|
|
||||||
if not silent:
|
|
||||||
print_info(data, 'spaCy', markdown)
|
|
||||||
return data
|
|
||||||
|
|
||||||
|
|
||||||
def print_info(data, title, markdown):
|
|
||||||
title = 'Info about %s' % title
|
|
||||||
if markdown:
|
if markdown:
|
||||||
util.print_markdown(data, title=title)
|
print_markdown(model_meta, title=title)
|
||||||
else:
|
else:
|
||||||
util.print_table(data, title=title)
|
msg.table(model_meta, title=title)
|
||||||
|
return meta
|
||||||
|
data = {
|
||||||
|
"spaCy version": about.__version__,
|
||||||
|
"Location": path2str(Path(__file__).parent.parent),
|
||||||
|
"Platform": platform.platform(),
|
||||||
|
"Python version": platform.python_version(),
|
||||||
|
"Models": list_models(),
|
||||||
|
}
|
||||||
|
if not silent:
|
||||||
|
title = "Info about spaCy"
|
||||||
|
if markdown:
|
||||||
|
print_markdown(data, title=title)
|
||||||
|
else:
|
||||||
|
msg.table(data, title=title)
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
def list_models():
|
def list_models():
|
||||||
def exclude_dir(dir_name):
|
def exclude_dir(dir_name):
|
||||||
# exclude common cache directories and hidden directories
|
# exclude common cache directories and hidden directories
|
||||||
exclude = ['cache', 'pycache', '__pycache__']
|
exclude = ("cache", "pycache", "__pycache__")
|
||||||
return dir_name in exclude or dir_name.startswith('.')
|
return dir_name in exclude or dir_name.startswith(".")
|
||||||
|
|
||||||
data_path = util.get_data_path()
|
data_path = util.get_data_path()
|
||||||
if data_path:
|
if data_path:
|
||||||
models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()]
|
models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()]
|
||||||
return ', '.join([m for m in models if not exclude_dir(m)])
|
return ", ".join([m for m in models if not exclude_dir(m)])
|
||||||
return '-'
|
return "-"
|
||||||
|
|
||||||
|
|
||||||
|
def print_markdown(data, title=None):
|
||||||
|
"""Print data in GitHub-flavoured Markdown format for issues etc.
|
||||||
|
|
||||||
|
data (dict or list of tuples): Label/value pairs.
|
||||||
|
title (unicode or None): Title, will be rendered as headline 2.
|
||||||
|
"""
|
||||||
|
markdown = []
|
||||||
|
for key, value in data.items():
|
||||||
|
if isinstance(value, basestring_) and Path(value).exists():
|
||||||
|
continue
|
||||||
|
markdown.append("* **{}:** {}".format(key, unicode_(value)))
|
||||||
|
if title:
|
||||||
|
print("\n## {}".format(title))
|
||||||
|
print("\n{}\n".format("\n".join(markdown)))
|
||||||
|
|
|
||||||
|
|
@ -11,11 +11,12 @@ from preshed.counter import PreshCounter
|
||||||
import tarfile
|
import tarfile
|
||||||
import gzip
|
import gzip
|
||||||
import zipfile
|
import zipfile
|
||||||
|
import srsly
|
||||||
|
from wasabi import Printer
|
||||||
|
|
||||||
from ._messages import Messages
|
|
||||||
from ..vectors import Vectors
|
from ..vectors import Vectors
|
||||||
from ..errors import Errors, Warnings, user_warning
|
from ..errors import Errors, Warnings, user_warning
|
||||||
from ..util import prints, ensure_path, get_lang_class
|
from ..util import ensure_path, get_lang_class
|
||||||
|
|
||||||
try:
|
try:
|
||||||
import ftfy
|
import ftfy
|
||||||
|
|
@ -23,113 +24,178 @@ except ImportError:
|
||||||
ftfy = None
|
ftfy = None
|
||||||
|
|
||||||
|
|
||||||
|
msg = Printer()
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
lang=("model language", "positional", None, str),
|
lang=("Model language", "positional", None, str),
|
||||||
output_dir=("model output directory", "positional", None, Path),
|
output_dir=("Model output directory", "positional", None, Path),
|
||||||
freqs_loc=("location of words frequencies file", "positional", None, Path),
|
freqs_loc=("Location of words frequencies file", "option", "f", Path),
|
||||||
clusters_loc=("optional: location of brown clusters data",
|
jsonl_loc=("Location of JSONL-formatted attributes file", "option", "j", Path),
|
||||||
"option", "c", str),
|
clusters_loc=("Optional location of brown clusters data", "option", "c", str),
|
||||||
vectors_loc=("optional: location of vectors file in Word2Vec format "
|
vectors_loc=("Optional vectors file in Word2Vec format", "option", "v", str),
|
||||||
"(either as .txt or zipped as .zip or .tar.gz)", "option",
|
prune_vectors=("Optional number of vectors to prune to", "option", "V", int),
|
||||||
"v", str),
|
|
||||||
prune_vectors=("optional: number of vectors to prune to",
|
|
||||||
"option", "V", int)
|
|
||||||
)
|
)
|
||||||
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None,
|
def init_model(
|
||||||
vectors_loc=None, prune_vectors=-1):
|
lang,
|
||||||
|
output_dir,
|
||||||
|
freqs_loc=None,
|
||||||
|
clusters_loc=None,
|
||||||
|
jsonl_loc=None,
|
||||||
|
vectors_loc=None,
|
||||||
|
prune_vectors=-1,
|
||||||
|
):
|
||||||
"""
|
"""
|
||||||
Create a new model from raw data, like word frequencies, Brown clusters
|
Create a new model from raw data, like word frequencies, Brown clusters
|
||||||
and word vectors.
|
and word vectors. If vectors are provided in Word2Vec format, they can
|
||||||
|
be either a .txt or zipped as a .zip or .tar.gz.
|
||||||
"""
|
"""
|
||||||
if freqs_loc is not None and not freqs_loc.exists():
|
if jsonl_loc is not None:
|
||||||
prints(freqs_loc, title=Messages.M037, exits=1)
|
if freqs_loc is not None or clusters_loc is not None:
|
||||||
|
settings = ["-j"]
|
||||||
|
if freqs_loc:
|
||||||
|
settings.append("-f")
|
||||||
|
if clusters_loc:
|
||||||
|
settings.append("-c")
|
||||||
|
msg.warn(
|
||||||
|
"Incompatible arguments",
|
||||||
|
"The -f and -c arguments are deprecated, and not compatible "
|
||||||
|
"with the -j argument, which should specify the same "
|
||||||
|
"information. Either merge the frequencies and clusters data "
|
||||||
|
"into the JSONL-formatted file (recommended), or use only the "
|
||||||
|
"-f and -c files, without the other lexical attributes.",
|
||||||
|
)
|
||||||
|
jsonl_loc = ensure_path(jsonl_loc)
|
||||||
|
lex_attrs = srsly.read_jsonl(jsonl_loc)
|
||||||
|
else:
|
||||||
clusters_loc = ensure_path(clusters_loc)
|
clusters_loc = ensure_path(clusters_loc)
|
||||||
vectors_loc = ensure_path(vectors_loc)
|
freqs_loc = ensure_path(freqs_loc)
|
||||||
probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
|
if freqs_loc is not None and not freqs_loc.exists():
|
||||||
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
|
msg.fail("Can't find words frequencies file", freqs_loc, exits=1)
|
||||||
clusters = read_clusters(clusters_loc) if clusters_loc else {}
|
lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
|
||||||
nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors)
|
|
||||||
|
with msg.loading("Creating model..."):
|
||||||
|
nlp = create_model(lang, lex_attrs)
|
||||||
|
msg.good("Successfully created model")
|
||||||
|
if vectors_loc is not None:
|
||||||
|
add_vectors(nlp, vectors_loc, prune_vectors)
|
||||||
|
vec_added = len(nlp.vocab.vectors)
|
||||||
|
lex_added = len(nlp.vocab)
|
||||||
|
msg.good(
|
||||||
|
"Sucessfully compiled vocab",
|
||||||
|
"{} entries, {} vectors".format(lex_added, vec_added),
|
||||||
|
)
|
||||||
if not output_dir.exists():
|
if not output_dir.exists():
|
||||||
output_dir.mkdir()
|
output_dir.mkdir()
|
||||||
nlp.to_disk(output_dir)
|
nlp.to_disk(output_dir)
|
||||||
return nlp
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
def open_file(loc):
|
def open_file(loc):
|
||||||
'''Handle .gz, .tar.gz or unzipped files'''
|
"""Handle .gz, .tar.gz or unzipped files"""
|
||||||
loc = ensure_path(loc)
|
loc = ensure_path(loc)
|
||||||
print("Open loc")
|
|
||||||
if tarfile.is_tarfile(str(loc)):
|
if tarfile.is_tarfile(str(loc)):
|
||||||
return tarfile.open(str(loc), 'r:gz')
|
return tarfile.open(str(loc), "r:gz")
|
||||||
elif loc.parts[-1].endswith('gz'):
|
elif loc.parts[-1].endswith("gz"):
|
||||||
return (line.decode('utf8') for line in gzip.open(str(loc), 'r'))
|
return (line.decode("utf8") for line in gzip.open(str(loc), "r"))
|
||||||
elif loc.parts[-1].endswith('zip'):
|
elif loc.parts[-1].endswith("zip"):
|
||||||
zip_file = zipfile.ZipFile(str(loc))
|
zip_file = zipfile.ZipFile(str(loc))
|
||||||
names = zip_file.namelist()
|
names = zip_file.namelist()
|
||||||
file_ = zip_file.open(names[0])
|
file_ = zip_file.open(names[0])
|
||||||
return (line.decode('utf8') for line in file_)
|
return (line.decode("utf8") for line in file_)
|
||||||
else:
|
else:
|
||||||
return loc.open('r', encoding='utf8')
|
return loc.open("r", encoding="utf8")
|
||||||
|
|
||||||
def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors):
|
|
||||||
print("Creating model...")
|
def read_attrs_from_deprecated(freqs_loc, clusters_loc):
|
||||||
|
with msg.loading("Counting frequencies..."):
|
||||||
|
probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
|
||||||
|
msg.good("Counted frequencies")
|
||||||
|
with msg.loading("Reading clusters..."):
|
||||||
|
clusters = read_clusters(clusters_loc) if clusters_loc else {}
|
||||||
|
msg.good("Read clusters")
|
||||||
|
lex_attrs = []
|
||||||
|
sorted_probs = sorted(probs.items(), key=lambda item: item[1], reverse=True)
|
||||||
|
for i, (word, prob) in tqdm(enumerate(sorted_probs)):
|
||||||
|
attrs = {"orth": word, "id": i, "prob": prob}
|
||||||
|
# Decode as a little-endian string, so that we can do & 15 to get
|
||||||
|
# the first 4 bits. See _parse_features.pyx
|
||||||
|
if word in clusters:
|
||||||
|
attrs["cluster"] = int(clusters[word][::-1], 2)
|
||||||
|
else:
|
||||||
|
attrs["cluster"] = 0
|
||||||
|
lex_attrs.append(attrs)
|
||||||
|
return lex_attrs
|
||||||
|
|
||||||
|
|
||||||
|
def create_model(lang, lex_attrs):
|
||||||
lang_class = get_lang_class(lang)
|
lang_class = get_lang_class(lang)
|
||||||
nlp = lang_class()
|
nlp = lang_class()
|
||||||
for lexeme in nlp.vocab:
|
for lexeme in nlp.vocab:
|
||||||
lexeme.rank = 0
|
lexeme.rank = 0
|
||||||
lex_added = 0
|
lex_added = 0
|
||||||
for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))):
|
for attrs in lex_attrs:
|
||||||
lexeme = nlp.vocab[word]
|
if "settings" in attrs:
|
||||||
lexeme.rank = i
|
continue
|
||||||
lexeme.prob = prob
|
lexeme = nlp.vocab[attrs["orth"]]
|
||||||
|
lexeme.set_attrs(**attrs)
|
||||||
lexeme.is_oov = False
|
lexeme.is_oov = False
|
||||||
# Decode as a little-endian string, so that we can do & 15 to get
|
|
||||||
# the first 4 bits. See _parse_features.pyx
|
|
||||||
if word in clusters:
|
|
||||||
lexeme.cluster = int(clusters[word][::-1], 2)
|
|
||||||
else:
|
|
||||||
lexeme.cluster = 0
|
|
||||||
lex_added += 1
|
lex_added += 1
|
||||||
nlp.vocab.cfg.update({'oov_prob': oov_prob})
|
lex_added += 1
|
||||||
|
oov_prob = min(lex.prob for lex in nlp.vocab)
|
||||||
|
nlp.vocab.cfg.update({"oov_prob": oov_prob - 1})
|
||||||
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
def add_vectors(nlp, vectors_loc, prune_vectors):
|
||||||
|
vectors_loc = ensure_path(vectors_loc)
|
||||||
|
if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
|
||||||
|
nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
|
||||||
|
for lex in nlp.vocab:
|
||||||
|
if lex.rank:
|
||||||
|
nlp.vocab.vectors.add(lex.orth, row=lex.rank)
|
||||||
|
else:
|
||||||
|
if vectors_loc:
|
||||||
|
with msg.loading("Reading vectors from {}".format(vectors_loc)):
|
||||||
|
vectors_data, vector_keys = read_vectors(vectors_loc)
|
||||||
|
msg.good("Loaded vectors from {}".format(vectors_loc))
|
||||||
|
else:
|
||||||
|
vectors_data, vector_keys = (None, None)
|
||||||
if vector_keys is not None:
|
if vector_keys is not None:
|
||||||
for word in vector_keys:
|
for word in vector_keys:
|
||||||
if word not in nlp.vocab:
|
if word not in nlp.vocab:
|
||||||
lexeme = nlp.vocab[word]
|
lexeme = nlp.vocab[word]
|
||||||
lexeme.is_oov = False
|
lexeme.is_oov = False
|
||||||
lex_added += 1
|
if vectors_data is not None:
|
||||||
if len(vectors_data):
|
|
||||||
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
||||||
|
nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"]
|
||||||
|
nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name
|
||||||
if prune_vectors >= 1:
|
if prune_vectors >= 1:
|
||||||
nlp.vocab.prune_vectors(prune_vectors)
|
nlp.vocab.prune_vectors(prune_vectors)
|
||||||
vec_added = len(nlp.vocab.vectors)
|
|
||||||
prints(Messages.M039.format(entries=lex_added, vectors=vec_added),
|
|
||||||
title=Messages.M038)
|
|
||||||
return nlp
|
|
||||||
|
|
||||||
|
|
||||||
def read_vectors(vectors_loc):
|
def read_vectors(vectors_loc):
|
||||||
print("Reading vectors from %s" % vectors_loc)
|
|
||||||
f = open_file(vectors_loc)
|
f = open_file(vectors_loc)
|
||||||
shape = tuple(int(size) for size in next(f).split())
|
shape = tuple(int(size) for size in next(f).split())
|
||||||
vectors_data = numpy.zeros(shape=shape, dtype='f')
|
vectors_data = numpy.zeros(shape=shape, dtype="f")
|
||||||
vectors_keys = []
|
vectors_keys = []
|
||||||
for i, line in enumerate(tqdm(f)):
|
for i, line in enumerate(tqdm(f)):
|
||||||
line = line.rstrip()
|
line = line.rstrip()
|
||||||
pieces = line.rsplit(' ', vectors_data.shape[1]+1)
|
pieces = line.rsplit(" ", vectors_data.shape[1] + 1)
|
||||||
word = pieces.pop(0)
|
word = pieces.pop(0)
|
||||||
if len(pieces) != vectors_data.shape[1]:
|
if len(pieces) != vectors_data.shape[1]:
|
||||||
raise ValueError(Errors.E094.format(line_num=i, loc=vectors_loc))
|
msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1)
|
||||||
vectors_data[i] = numpy.asarray(pieces, dtype='f')
|
vectors_data[i] = numpy.asarray(pieces, dtype="f")
|
||||||
vectors_keys.append(word)
|
vectors_keys.append(word)
|
||||||
return vectors_data, vectors_keys
|
return vectors_data, vectors_keys
|
||||||
|
|
||||||
|
|
||||||
def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
||||||
print("Counting frequencies...")
|
|
||||||
counts = PreshCounter()
|
counts = PreshCounter()
|
||||||
total = 0
|
total = 0
|
||||||
with freqs_loc.open() as f:
|
with freqs_loc.open() as f:
|
||||||
for i, line in enumerate(f):
|
for i, line in enumerate(f):
|
||||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
freq, doc_freq, key = line.rstrip().split("\t", 2)
|
||||||
freq = int(freq)
|
freq = int(freq)
|
||||||
counts.inc(i + 1, freq)
|
counts.inc(i + 1, freq)
|
||||||
total += freq
|
total += freq
|
||||||
|
|
@ -138,7 +204,7 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
||||||
probs = {}
|
probs = {}
|
||||||
with freqs_loc.open() as f:
|
with freqs_loc.open() as f:
|
||||||
for line in tqdm(f):
|
for line in tqdm(f):
|
||||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
freq, doc_freq, key = line.rstrip().split("\t", 2)
|
||||||
doc_freq = int(doc_freq)
|
doc_freq = int(doc_freq)
|
||||||
freq = int(freq)
|
freq = int(freq)
|
||||||
if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length:
|
if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length:
|
||||||
|
|
@ -154,7 +220,6 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
||||||
|
|
||||||
|
|
||||||
def read_clusters(clusters_loc):
|
def read_clusters(clusters_loc):
|
||||||
print("Reading clusters...")
|
|
||||||
clusters = {}
|
clusters = {}
|
||||||
if ftfy is None:
|
if ftfy is None:
|
||||||
user_warning(Warnings.W004)
|
user_warning(Warnings.W004)
|
||||||
|
|
@ -171,7 +236,7 @@ def read_clusters(clusters_loc):
|
||||||
if int(freq) >= 3:
|
if int(freq) >= 3:
|
||||||
clusters[word] = cluster
|
clusters[word] = cluster
|
||||||
else:
|
else:
|
||||||
clusters[word] = '0'
|
clusters[word] = "0"
|
||||||
# Expand clusters with re-casing
|
# Expand clusters with re-casing
|
||||||
for word, cluster in list(clusters.items()):
|
for word, cluster in list(clusters.items()):
|
||||||
if word.lower() not in clusters:
|
if word.lower() not in clusters:
|
||||||
|
|
|
||||||
|
|
@ -3,51 +3,76 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from wasabi import Printer
|
||||||
|
|
||||||
from ._messages import Messages
|
|
||||||
from ..compat import symlink_to, path2str
|
from ..compat import symlink_to, path2str
|
||||||
from ..util import prints
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
origin=("package name or local path to model", "positional", None, str),
|
origin=("package name or local path to model", "positional", None, str),
|
||||||
link_name=("name of shortuct link to create", "positional", None, str),
|
link_name=("name of shortuct link to create", "positional", None, str),
|
||||||
force=("force overwriting of existing link", "flag", "f", bool))
|
force=("force overwriting of existing link", "flag", "f", bool),
|
||||||
|
)
|
||||||
def link(origin, link_name, force=False, model_path=None):
|
def link(origin, link_name, force=False, model_path=None):
|
||||||
"""
|
"""
|
||||||
Create a symlink for models within the spacy/data directory. Accepts
|
Create a symlink for models within the spacy/data directory. Accepts
|
||||||
either the name of a pip package, or the local path to the model data
|
either the name of a pip package, or the local path to the model data
|
||||||
directory. Linking models allows loading them via spacy.load(link_name).
|
directory. Linking models allows loading them via spacy.load(link_name).
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
if util.is_package(origin):
|
if util.is_package(origin):
|
||||||
model_path = util.get_package_path(origin)
|
model_path = util.get_package_path(origin)
|
||||||
else:
|
else:
|
||||||
model_path = Path(origin) if model_path is None else Path(model_path)
|
model_path = Path(origin) if model_path is None else Path(model_path)
|
||||||
if not model_path.exists():
|
if not model_path.exists():
|
||||||
prints(Messages.M009.format(path=path2str(model_path)),
|
msg.fail(
|
||||||
title=Messages.M008, exits=1)
|
"Can't locate model data",
|
||||||
|
"The data should be located in {}".format(path2str(model_path)),
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
data_path = util.get_data_path()
|
data_path = util.get_data_path()
|
||||||
if not data_path or not data_path.exists():
|
if not data_path or not data_path.exists():
|
||||||
spacy_loc = Path(__file__).parent.parent
|
spacy_loc = Path(__file__).parent.parent
|
||||||
prints(Messages.M011, spacy_loc, title=Messages.M010, exits=1)
|
msg.fail(
|
||||||
|
"Can't find the spaCy data path to create model symlink",
|
||||||
|
"Make sure a directory `/data` exists within your spaCy "
|
||||||
|
"installation and try again. The data directory should be located "
|
||||||
|
"here:".format(path=spacy_loc),
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
link_path = util.get_data_path() / link_name
|
link_path = util.get_data_path() / link_name
|
||||||
if link_path.is_symlink() and not force:
|
if link_path.is_symlink() and not force:
|
||||||
prints(Messages.M013, title=Messages.M012.format(name=link_name),
|
msg.fail(
|
||||||
exits=1)
|
"Link '{}' already exists".format(link_name),
|
||||||
|
"To overwrite an existing link, use the --force flag",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
elif link_path.is_symlink(): # does a symlink exist?
|
elif link_path.is_symlink(): # does a symlink exist?
|
||||||
# NB: It's important to check for is_symlink here and not for exists,
|
# NB: It's important to check for is_symlink here and not for exists,
|
||||||
# because invalid/outdated symlinks would return False otherwise.
|
# because invalid/outdated symlinks would return False otherwise.
|
||||||
link_path.unlink()
|
link_path.unlink()
|
||||||
elif link_path.exists(): # does it exist otherwise?
|
elif link_path.exists(): # does it exist otherwise?
|
||||||
# NB: Check this last because valid symlinks also "exist".
|
# NB: Check this last because valid symlinks also "exist".
|
||||||
prints(Messages.M015, link_path,
|
msg.fail(
|
||||||
title=Messages.M014.format(name=link_name), exits=1)
|
"Can't overwrite symlink '{}'".format(link_name),
|
||||||
msg = "%s --> %s" % (path2str(model_path), path2str(link_path))
|
"This can happen if your data directory contains a directory or "
|
||||||
|
"file of the same name.",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
details = "%s --> %s" % (path2str(model_path), path2str(link_path))
|
||||||
try:
|
try:
|
||||||
symlink_to(link_path, model_path)
|
symlink_to(link_path, model_path)
|
||||||
except:
|
except: # noqa: E722
|
||||||
# This is quite dirty, but just making sure other errors are caught.
|
# This is quite dirty, but just making sure other errors are caught.
|
||||||
prints(Messages.M017, msg, title=Messages.M016.format(name=link_name))
|
msg.fail(
|
||||||
|
"Couldn't link model to '{}'".format(link_name),
|
||||||
|
"Creating a symlink in spacy/data failed. Make sure you have the "
|
||||||
|
"required permissions and try re-running the command as admin, or "
|
||||||
|
"use a virtualenv. You can still import the model as a module and "
|
||||||
|
"call its load() method, or create the symlink manually.",
|
||||||
|
)
|
||||||
|
msg.text(details)
|
||||||
raise
|
raise
|
||||||
prints(msg, Messages.M019.format(name=link_name), title=Messages.M018)
|
msg.good("Linking successful", details)
|
||||||
|
msg.text("You can now load the model via spacy.load('{}')".format(link_name))
|
||||||
|
|
|
||||||
|
|
@ -4,109 +4,116 @@ from __future__ import unicode_literals
|
||||||
import plac
|
import plac
|
||||||
import shutil
|
import shutil
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from wasabi import Printer, get_raw_input
|
||||||
|
import srsly
|
||||||
|
|
||||||
from ._messages import Messages
|
from ..compat import path2str
|
||||||
from ..compat import path2str, json_dumps
|
|
||||||
from ..util import prints
|
|
||||||
from .. import util
|
from .. import util
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
input_dir=("directory with model data", "positional", None, str),
|
input_dir=("Directory with model data", "positional", None, str),
|
||||||
output_dir=("output parent directory", "positional", None, str),
|
output_dir=("Output parent directory", "positional", None, str),
|
||||||
meta_path=("path to meta.json", "option", "m", str),
|
meta_path=("Path to meta.json", "option", "m", str),
|
||||||
create_meta=("create meta.json, even if one exists in directory – if "
|
create_meta=("Create meta.json, even if one exists", "flag", "c", bool),
|
||||||
"existing meta is found, entries are shown as defaults in "
|
force=("Force overwriting existing model in output directory", "flag", "f", bool),
|
||||||
"the command line prompt", "flag", "c", bool),
|
)
|
||||||
force=("force overwriting of existing model directory in output directory",
|
def package(input_dir, output_dir, meta_path=None, create_meta=False, force=False):
|
||||||
"flag", "f", bool))
|
|
||||||
def package(input_dir, output_dir, meta_path=None, create_meta=False,
|
|
||||||
force=False):
|
|
||||||
"""
|
"""
|
||||||
Generate Python package for model data, including meta and required
|
Generate Python package for model data, including meta and required
|
||||||
installation files. A new directory will be created in the specified
|
installation files. A new directory will be created in the specified
|
||||||
output directory, and model data will be copied over.
|
output directory, and model data will be copied over. If --create-meta is
|
||||||
|
set and a meta.json already exists in the output directory, the existing
|
||||||
|
values will be used as the defaults in the command-line prompt.
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
input_path = util.ensure_path(input_dir)
|
input_path = util.ensure_path(input_dir)
|
||||||
output_path = util.ensure_path(output_dir)
|
output_path = util.ensure_path(output_dir)
|
||||||
meta_path = util.ensure_path(meta_path)
|
meta_path = util.ensure_path(meta_path)
|
||||||
if not input_path or not input_path.exists():
|
if not input_path or not input_path.exists():
|
||||||
prints(input_path, title=Messages.M008, exits=1)
|
msg.fail("Can't locate model data", input_path, exits=1)
|
||||||
if not output_path or not output_path.exists():
|
if not output_path or not output_path.exists():
|
||||||
prints(output_path, title=Messages.M040, exits=1)
|
msg.fail("Output directory not found", output_path, exits=1)
|
||||||
if meta_path and not meta_path.exists():
|
if meta_path and not meta_path.exists():
|
||||||
prints(meta_path, title=Messages.M020, exits=1)
|
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||||
|
|
||||||
meta_path = meta_path or input_path / 'meta.json'
|
meta_path = meta_path or input_path / "meta.json"
|
||||||
if meta_path.is_file():
|
if meta_path.is_file():
|
||||||
meta = util.read_json(meta_path)
|
meta = srsly.read_json(meta_path)
|
||||||
if not create_meta: # only print this if user doesn't want to overwrite
|
if not create_meta: # only print if user doesn't want to overwrite
|
||||||
prints(meta_path, title=Messages.M041)
|
msg.good("Loaded meta.json from file", meta_path)
|
||||||
else:
|
else:
|
||||||
meta = generate_meta(input_dir, meta)
|
meta = generate_meta(input_dir, meta, msg)
|
||||||
meta = validate_meta(meta, ['lang', 'name', 'version'])
|
for key in ("lang", "name", "version"):
|
||||||
model_name = meta['lang'] + '_' + meta['name']
|
if key not in meta or meta[key] == "":
|
||||||
model_name_v = model_name + '-' + meta['version']
|
msg.fail(
|
||||||
|
"No '{}' setting found in meta.json".format(key),
|
||||||
|
"This setting is required to build your package.",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
model_name = meta["lang"] + "_" + meta["name"]
|
||||||
|
model_name_v = model_name + "-" + meta["version"]
|
||||||
main_path = output_path / model_name_v
|
main_path = output_path / model_name_v
|
||||||
package_path = main_path / model_name
|
package_path = main_path / model_name
|
||||||
|
|
||||||
create_dirs(package_path, force)
|
|
||||||
shutil.copytree(path2str(input_path),
|
|
||||||
path2str(package_path / model_name_v))
|
|
||||||
create_file(main_path / 'meta.json', json_dumps(meta))
|
|
||||||
create_file(main_path / 'setup.py', TEMPLATE_SETUP)
|
|
||||||
create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
|
|
||||||
create_file(package_path / '__init__.py', TEMPLATE_INIT)
|
|
||||||
prints(main_path, Messages.M043,
|
|
||||||
title=Messages.M042.format(name=model_name_v))
|
|
||||||
|
|
||||||
|
|
||||||
def create_dirs(package_path, force):
|
|
||||||
if package_path.exists():
|
if package_path.exists():
|
||||||
if force:
|
if force:
|
||||||
shutil.rmtree(path2str(package_path))
|
shutil.rmtree(path2str(package_path))
|
||||||
else:
|
else:
|
||||||
prints(package_path, Messages.M045, title=Messages.M044, exits=1)
|
msg.fail(
|
||||||
|
"Package directory already exists",
|
||||||
|
"Please delete the directory and try again, or use the "
|
||||||
|
"`--force` flag to overwrite existing "
|
||||||
|
"directories.".format(path=path2str(package_path)),
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
Path.mkdir(package_path, parents=True)
|
Path.mkdir(package_path, parents=True)
|
||||||
|
shutil.copytree(path2str(input_path), path2str(package_path / model_name_v))
|
||||||
|
create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
|
||||||
|
create_file(main_path / "setup.py", TEMPLATE_SETUP)
|
||||||
|
create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
|
||||||
|
create_file(package_path / "__init__.py", TEMPLATE_INIT)
|
||||||
|
msg.good("Successfully created package '{}'".format(model_name_v), main_path)
|
||||||
|
msg.text("To build the package, run `python setup.py sdist` in this directory.")
|
||||||
|
|
||||||
|
|
||||||
def create_file(file_path, contents):
|
def create_file(file_path, contents):
|
||||||
file_path.touch()
|
file_path.touch()
|
||||||
file_path.open('w', encoding='utf-8').write(contents)
|
file_path.open("w", encoding="utf-8").write(contents)
|
||||||
|
|
||||||
|
|
||||||
def generate_meta(model_path, existing_meta):
|
def generate_meta(model_path, existing_meta, msg):
|
||||||
meta = existing_meta or {}
|
meta = existing_meta or {}
|
||||||
settings = [('lang', 'Model language', meta.get('lang', 'en')),
|
settings = [
|
||||||
('name', 'Model name', meta.get('name', 'model')),
|
("lang", "Model language", meta.get("lang", "en")),
|
||||||
('version', 'Model version', meta.get('version', '0.0.0')),
|
("name", "Model name", meta.get("name", "model")),
|
||||||
('spacy_version', 'Required spaCy version',
|
("version", "Model version", meta.get("version", "0.0.0")),
|
||||||
'>=%s,<3.0.0' % about.__version__),
|
("spacy_version", "Required spaCy version", ">=%s,<3.0.0" % about.__version__),
|
||||||
('description', 'Model description',
|
("description", "Model description", meta.get("description", False)),
|
||||||
meta.get('description', False)),
|
("author", "Author", meta.get("author", False)),
|
||||||
('author', 'Author', meta.get('author', False)),
|
("email", "Author email", meta.get("email", False)),
|
||||||
('email', 'Author email', meta.get('email', False)),
|
("url", "Author website", meta.get("url", False)),
|
||||||
('url', 'Author website', meta.get('url', False)),
|
("license", "License", meta.get("license", "CC BY-SA 3.0")),
|
||||||
('license', 'License', meta.get('license', 'CC BY-SA 3.0'))]
|
]
|
||||||
nlp = util.load_model_from_path(Path(model_path))
|
nlp = util.load_model_from_path(Path(model_path))
|
||||||
meta['pipeline'] = nlp.pipe_names
|
meta["pipeline"] = nlp.pipe_names
|
||||||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
meta["vectors"] = {
|
||||||
'vectors': len(nlp.vocab.vectors),
|
"width": nlp.vocab.vectors_length,
|
||||||
'keys': nlp.vocab.vectors.n_keys}
|
"vectors": len(nlp.vocab.vectors),
|
||||||
prints(Messages.M047, title=Messages.M046)
|
"keys": nlp.vocab.vectors.n_keys,
|
||||||
|
"name": nlp.vocab.vectors.name,
|
||||||
|
}
|
||||||
|
msg.divider("Generating meta.json")
|
||||||
|
msg.text(
|
||||||
|
"Enter the package settings for your model. The following information "
|
||||||
|
"will be read from your model data: pipeline, vectors."
|
||||||
|
)
|
||||||
for setting, desc, default in settings:
|
for setting, desc, default in settings:
|
||||||
response = util.get_raw_input(desc, default)
|
response = get_raw_input(desc, default)
|
||||||
meta[setting] = default if response == '' and default else response
|
meta[setting] = default if response == "" and default else response
|
||||||
if about.__title__ != 'spacy':
|
if about.__title__ != "spacy":
|
||||||
meta['parent_package'] = about.__title__
|
meta["parent_package"] = about.__title__
|
||||||
return meta
|
|
||||||
|
|
||||||
|
|
||||||
def validate_meta(meta, keys):
|
|
||||||
for key in keys:
|
|
||||||
if key not in meta or meta[key] == '':
|
|
||||||
prints(Messages.M049, title=Messages.M048.format(key=key), exits=1)
|
|
||||||
return meta
|
return meta
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -140,7 +147,7 @@ def list_files(data_dir):
|
||||||
|
|
||||||
def list_requirements(meta):
|
def list_requirements(meta):
|
||||||
parent_package = meta.get('parent_package', 'spacy')
|
parent_package = meta.get('parent_package', 'spacy')
|
||||||
requirements = [parent_package + ">=" + meta['spacy_version']]
|
requirements = [parent_package + meta['spacy_version']]
|
||||||
if 'setup_requires' in meta:
|
if 'setup_requires' in meta:
|
||||||
requirements += meta['setup_requires']
|
requirements += meta['setup_requires']
|
||||||
return requirements
|
return requirements
|
||||||
|
|
|
||||||
251
spacy/cli/pretrain.py
Normal file
251
spacy/cli/pretrain.py
Normal file
|
|
@ -0,0 +1,251 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import print_function, unicode_literals
|
||||||
|
|
||||||
|
import plac
|
||||||
|
import random
|
||||||
|
import numpy
|
||||||
|
import time
|
||||||
|
from collections import Counter
|
||||||
|
from pathlib import Path
|
||||||
|
from thinc.v2v import Affine, Maxout
|
||||||
|
from thinc.misc import LayerNorm as LN
|
||||||
|
from thinc.neural.util import prefer_gpu
|
||||||
|
from wasabi import Printer
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
from ..tokens import Doc
|
||||||
|
from ..attrs import ID, HEAD
|
||||||
|
from .._ml import Tok2Vec, flatten, chain, create_default_optimizer
|
||||||
|
from .._ml import masked_language_model
|
||||||
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
texts_loc=("Path to jsonl file with texts to learn from", "positional", None, str),
|
||||||
|
vectors_model=("Name or path to vectors model to learn from"),
|
||||||
|
output_dir=("Directory to write models each epoch", "positional", None, str),
|
||||||
|
width=("Width of CNN layers", "option", "cw", int),
|
||||||
|
depth=("Depth of CNN layers", "option", "cd", int),
|
||||||
|
embed_rows=("Embedding rows", "option", "er", int),
|
||||||
|
use_vectors=("Whether to use the static vectors as input features", "flag", "uv"),
|
||||||
|
dropout=("Dropout", "option", "d", float),
|
||||||
|
seed=("Seed for random number generators", "option", "s", float),
|
||||||
|
nr_iter=("Number of iterations to pretrain", "option", "i", int),
|
||||||
|
)
|
||||||
|
def pretrain(
|
||||||
|
texts_loc,
|
||||||
|
vectors_model,
|
||||||
|
output_dir,
|
||||||
|
width=96,
|
||||||
|
depth=4,
|
||||||
|
embed_rows=2000,
|
||||||
|
use_vectors=False,
|
||||||
|
dropout=0.2,
|
||||||
|
nr_iter=1000,
|
||||||
|
seed=0,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
||||||
|
using an approximate language-modelling objective. Specifically, we load
|
||||||
|
pre-trained vectors, and train a component like a CNN, BiLSTM, etc to predict
|
||||||
|
vectors which match the pre-trained ones. The weights are saved to a directory
|
||||||
|
after each epoch. You can then pass a path to one of these pre-trained weights
|
||||||
|
files to the 'spacy train' command.
|
||||||
|
|
||||||
|
This technique may be especially helpful if you have little labelled data.
|
||||||
|
However, it's still quite experimental, so your mileage may vary.
|
||||||
|
|
||||||
|
To load the weights back in during 'spacy train', you need to ensure
|
||||||
|
all settings are the same between pretraining and training. The API and
|
||||||
|
errors around this need some improvement.
|
||||||
|
"""
|
||||||
|
config = dict(locals())
|
||||||
|
msg = Printer()
|
||||||
|
util.fix_random_seed(seed)
|
||||||
|
|
||||||
|
has_gpu = prefer_gpu()
|
||||||
|
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
||||||
|
|
||||||
|
output_dir = Path(output_dir)
|
||||||
|
if not output_dir.exists():
|
||||||
|
output_dir.mkdir()
|
||||||
|
msg.good("Created output directory")
|
||||||
|
srsly.write_json(output_dir / "config.json", config)
|
||||||
|
msg.good("Saved settings to config.json")
|
||||||
|
|
||||||
|
# Load texts from file or stdin
|
||||||
|
if texts_loc != "-": # reading from a file
|
||||||
|
texts_loc = Path(texts_loc)
|
||||||
|
if not texts_loc.exists():
|
||||||
|
msg.fail("Input text file doesn't exist", texts_loc, exits=1)
|
||||||
|
with msg.loading("Loading input texts..."):
|
||||||
|
texts = list(srsly.read_jsonl(texts_loc))
|
||||||
|
msg.good("Loaded input texts")
|
||||||
|
random.shuffle(texts)
|
||||||
|
else: # reading from stdin
|
||||||
|
msg.text("Reading input text from stdin...")
|
||||||
|
texts = srsly.read_jsonl("-")
|
||||||
|
|
||||||
|
with msg.loading("Loading model '{}'...".format(vectors_model)):
|
||||||
|
nlp = util.load_model(vectors_model)
|
||||||
|
msg.good("Loaded model '{}'".format(vectors_model))
|
||||||
|
pretrained_vectors = None if not use_vectors else nlp.vocab.vectors.name
|
||||||
|
model = create_pretraining_model(
|
||||||
|
nlp,
|
||||||
|
Tok2Vec(
|
||||||
|
width,
|
||||||
|
embed_rows,
|
||||||
|
conv_depth=depth,
|
||||||
|
pretrained_vectors=pretrained_vectors,
|
||||||
|
bilstm_depth=0, # Requires PyTorch. Experimental.
|
||||||
|
cnn_maxout_pieces=3, # You can try setting this higher
|
||||||
|
subword_features=True, # Set to False for Chinese etc
|
||||||
|
),
|
||||||
|
)
|
||||||
|
optimizer = create_default_optimizer(model.ops)
|
||||||
|
tracker = ProgressTracker(frequency=10000)
|
||||||
|
msg.divider("Pre-training tok2vec layer")
|
||||||
|
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
|
||||||
|
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
|
||||||
|
for epoch in range(nr_iter):
|
||||||
|
for batch in util.minibatch_by_words(
|
||||||
|
((text, None) for text in texts), size=3000
|
||||||
|
):
|
||||||
|
docs = make_docs(nlp, [text for (text, _) in batch])
|
||||||
|
loss = make_update(model, docs, optimizer, drop=dropout)
|
||||||
|
progress = tracker.update(epoch, loss, docs)
|
||||||
|
if progress:
|
||||||
|
msg.row(progress, **row_settings)
|
||||||
|
if texts_loc == "-" and tracker.words_per_epoch[epoch] >= 10 ** 7:
|
||||||
|
break
|
||||||
|
with model.use_params(optimizer.averages):
|
||||||
|
with (output_dir / ("model%d.bin" % epoch)).open("wb") as file_:
|
||||||
|
file_.write(model.tok2vec.to_bytes())
|
||||||
|
log = {
|
||||||
|
"nr_word": tracker.nr_word,
|
||||||
|
"loss": tracker.loss,
|
||||||
|
"epoch_loss": tracker.epoch_loss,
|
||||||
|
"epoch": epoch,
|
||||||
|
}
|
||||||
|
with (output_dir / "log.jsonl").open("a") as file_:
|
||||||
|
file_.write(srsly.json_dumps(log) + "\n")
|
||||||
|
tracker.epoch_loss = 0.0
|
||||||
|
if texts_loc != "-":
|
||||||
|
# Reshuffle the texts if texts were loaded from a file
|
||||||
|
random.shuffle(texts)
|
||||||
|
|
||||||
|
|
||||||
|
def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
|
||||||
|
"""Perform an update over a single batch of documents.
|
||||||
|
|
||||||
|
docs (iterable): A batch of `Doc` objects.
|
||||||
|
drop (float): The droput rate.
|
||||||
|
optimizer (callable): An optimizer.
|
||||||
|
RETURNS loss: A float for the loss.
|
||||||
|
"""
|
||||||
|
predictions, backprop = model.begin_update(docs, drop=drop)
|
||||||
|
loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
|
||||||
|
backprop(gradients, sgd=optimizer)
|
||||||
|
# Don't want to return a cupy object here
|
||||||
|
# The gradients are modified in-place by the BERT MLM,
|
||||||
|
# so we get an accurate loss
|
||||||
|
return float(loss)
|
||||||
|
|
||||||
|
|
||||||
|
def make_docs(nlp, batch, min_length=1, max_length=500):
|
||||||
|
docs = []
|
||||||
|
for record in batch:
|
||||||
|
text = record["text"]
|
||||||
|
if "tokens" in record:
|
||||||
|
doc = Doc(nlp.vocab, words=record["tokens"])
|
||||||
|
else:
|
||||||
|
doc = nlp.make_doc(text)
|
||||||
|
if "heads" in record:
|
||||||
|
heads = record["heads"]
|
||||||
|
heads = numpy.asarray(heads, dtype="uint64")
|
||||||
|
heads = heads.reshape((len(doc), 1))
|
||||||
|
doc = doc.from_array([HEAD], heads)
|
||||||
|
if len(doc) >= min_length and len(doc) < max_length:
|
||||||
|
docs.append(doc)
|
||||||
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
def get_vectors_loss(ops, docs, prediction, objective="L2"):
|
||||||
|
"""Compute a mean-squared error loss between the documents' vectors and
|
||||||
|
the prediction.
|
||||||
|
|
||||||
|
Note that this is ripe for customization! We could compute the vectors
|
||||||
|
in some other word, e.g. with an LSTM language model, or use some other
|
||||||
|
type of objective.
|
||||||
|
"""
|
||||||
|
# The simplest way to implement this would be to vstack the
|
||||||
|
# token.vector values, but that's a bit inefficient, especially on GPU.
|
||||||
|
# Instead we fetch the index into the vectors table for each of our tokens,
|
||||||
|
# and look them up all at once. This prevents data copying.
|
||||||
|
ids = ops.flatten([doc.to_array(ID).ravel() for doc in docs])
|
||||||
|
target = docs[0].vocab.vectors.data[ids]
|
||||||
|
if objective == "L2":
|
||||||
|
d_scores = prediction - target
|
||||||
|
loss = (d_scores ** 2).sum()
|
||||||
|
else:
|
||||||
|
raise NotImplementedError(objective)
|
||||||
|
return loss, d_scores
|
||||||
|
|
||||||
|
|
||||||
|
def create_pretraining_model(nlp, tok2vec):
|
||||||
|
"""Define a network for the pretraining. We simply add an output layer onto
|
||||||
|
the tok2vec input model. The tok2vec input model needs to be a model that
|
||||||
|
takes a batch of Doc objects (as a list), and returns a list of arrays.
|
||||||
|
Each array in the output needs to have one row per token in the doc.
|
||||||
|
"""
|
||||||
|
output_size = nlp.vocab.vectors.data.shape[1]
|
||||||
|
output_layer = chain(
|
||||||
|
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
|
||||||
|
)
|
||||||
|
# This is annoying, but the parser etc have the flatten step after
|
||||||
|
# the tok2vec. To load the weights in cleanly, we need to match
|
||||||
|
# the shape of the models' components exactly. So what we cann
|
||||||
|
# "tok2vec" has to be the same set of processes as what the components do.
|
||||||
|
tok2vec = chain(tok2vec, flatten)
|
||||||
|
model = chain(tok2vec, output_layer)
|
||||||
|
model = masked_language_model(nlp.vocab, model)
|
||||||
|
model.tok2vec = tok2vec
|
||||||
|
model.output_layer = output_layer
|
||||||
|
model.begin_training([nlp.make_doc("Give it a doc to infer shapes")])
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
class ProgressTracker(object):
|
||||||
|
def __init__(self, frequency=1000000):
|
||||||
|
self.loss = 0.0
|
||||||
|
self.prev_loss = 0.0
|
||||||
|
self.nr_word = 0
|
||||||
|
self.words_per_epoch = Counter()
|
||||||
|
self.frequency = frequency
|
||||||
|
self.last_time = time.time()
|
||||||
|
self.last_update = 0
|
||||||
|
self.epoch_loss = 0.0
|
||||||
|
|
||||||
|
def update(self, epoch, loss, docs):
|
||||||
|
self.loss += loss
|
||||||
|
self.epoch_loss += loss
|
||||||
|
words_in_batch = sum(len(doc) for doc in docs)
|
||||||
|
self.words_per_epoch[epoch] += words_in_batch
|
||||||
|
self.nr_word += words_in_batch
|
||||||
|
words_since_update = self.nr_word - self.last_update
|
||||||
|
if words_since_update >= self.frequency:
|
||||||
|
wps = words_since_update / (time.time() - self.last_time)
|
||||||
|
self.last_update = self.nr_word
|
||||||
|
self.last_time = time.time()
|
||||||
|
loss_per_word = self.loss - self.prev_loss
|
||||||
|
status = (
|
||||||
|
epoch,
|
||||||
|
self.nr_word,
|
||||||
|
"%.8f" % self.loss,
|
||||||
|
"%.8f" % loss_per_word,
|
||||||
|
int(wps),
|
||||||
|
)
|
||||||
|
self.prev_loss = float(self.loss)
|
||||||
|
return status
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
|
@ -3,48 +3,67 @@ from __future__ import unicode_literals, division, print_function
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import ujson
|
import srsly
|
||||||
import cProfile
|
import cProfile
|
||||||
import pstats
|
import pstats
|
||||||
|
|
||||||
import spacy
|
|
||||||
import sys
|
import sys
|
||||||
import tqdm
|
import tqdm
|
||||||
import cytoolz
|
import itertools
|
||||||
import thinc.extra.datasets
|
import thinc.extra.datasets
|
||||||
|
from wasabi import Printer
|
||||||
|
|
||||||
|
from ..util import load_model
|
||||||
def read_inputs(loc):
|
|
||||||
if loc is None:
|
|
||||||
file_ = sys.stdin
|
|
||||||
file_ = (line.encode('utf8') for line in file_)
|
|
||||||
else:
|
|
||||||
file_ = Path(loc).open()
|
|
||||||
for line in file_:
|
|
||||||
data = ujson.loads(line)
|
|
||||||
text = data['text']
|
|
||||||
yield text
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
lang=("model/language", "positional", None, str),
|
model=("Model to load", "positional", None, str),
|
||||||
inputs=("Location of input file", "positional", None, read_inputs))
|
inputs=("Location of input file. '-' for stdin.", "positional", None, str),
|
||||||
def profile(lang, inputs=None):
|
n_texts=("Maximum number of texts to use if available", "option", "n", int),
|
||||||
|
)
|
||||||
|
def profile(model, inputs=None, n_texts=10000):
|
||||||
"""
|
"""
|
||||||
Profile a spaCy pipeline, to find out which functions take the most time.
|
Profile a spaCy pipeline, to find out which functions take the most time.
|
||||||
|
Input should be formatted as one JSON object per line with a key "text".
|
||||||
|
It can either be provided as a JSONL file, or be read from sys.sytdin.
|
||||||
|
If no input file is specified, the IMDB dataset is loaded via Thinc.
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
|
if inputs is not None:
|
||||||
|
inputs = _read_inputs(inputs, msg)
|
||||||
if inputs is None:
|
if inputs is None:
|
||||||
|
n_inputs = 25000
|
||||||
|
with msg.loading("Loading IMDB dataset via Thinc..."):
|
||||||
imdb_train, _ = thinc.extra.datasets.imdb()
|
imdb_train, _ = thinc.extra.datasets.imdb()
|
||||||
inputs, _ = zip(*imdb_train)
|
inputs, _ = zip(*imdb_train)
|
||||||
inputs = inputs[:25000]
|
msg.info("Loaded IMDB dataset and using {} examples".format(n_inputs))
|
||||||
nlp = spacy.load(lang)
|
inputs = inputs[:n_inputs]
|
||||||
texts = list(cytoolz.take(10000, inputs))
|
with msg.loading("Loading model '{}'...".format(model)):
|
||||||
cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(),
|
nlp = load_model(model)
|
||||||
"Profile.prof")
|
msg.good("Loaded model '{}'".format(model))
|
||||||
|
texts = list(itertools.islice(inputs, n_texts))
|
||||||
|
cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof")
|
||||||
s = pstats.Stats("Profile.prof")
|
s = pstats.Stats("Profile.prof")
|
||||||
|
msg.divider("Profile stats")
|
||||||
s.strip_dirs().sort_stats("time").print_stats()
|
s.strip_dirs().sort_stats("time").print_stats()
|
||||||
|
|
||||||
|
|
||||||
def parse_texts(nlp, texts):
|
def parse_texts(nlp, texts):
|
||||||
for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
|
for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def _read_inputs(loc, msg):
|
||||||
|
if loc == "-":
|
||||||
|
msg.info("Reading input from sys.stdin")
|
||||||
|
file_ = sys.stdin
|
||||||
|
file_ = (line.encode("utf8") for line in file_)
|
||||||
|
else:
|
||||||
|
input_path = Path(loc)
|
||||||
|
if not input_path.exists() or not input_path.is_file():
|
||||||
|
msg.fail("Not a valid input data file", loc, exits=1)
|
||||||
|
msg.info("Using data from {}".format(input_path.parts[-1]))
|
||||||
|
file_ = input_path.open()
|
||||||
|
for line in file_:
|
||||||
|
data = srsly.json_loads(line)
|
||||||
|
text = data["text"]
|
||||||
|
yield text
|
||||||
|
|
|
||||||
|
|
@ -2,100 +2,316 @@
|
||||||
from __future__ import unicode_literals, division, print_function
|
from __future__ import unicode_literals, division, print_function
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
|
import os
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import tqdm
|
import tqdm
|
||||||
from thinc.neural._classes.model import Model
|
from thinc.neural._classes.model import Model
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
|
import shutil
|
||||||
|
import srsly
|
||||||
|
from wasabi import Printer
|
||||||
|
import contextlib
|
||||||
|
import random
|
||||||
|
|
||||||
from ._messages import Messages
|
from .._ml import create_default_optimizer
|
||||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||||
from ..gold import GoldCorpus, minibatch
|
from ..gold import GoldCorpus
|
||||||
from ..util import prints
|
|
||||||
from .. import util
|
from .. import util
|
||||||
from .. import about
|
from .. import about
|
||||||
from .. import displacy
|
|
||||||
from ..compat import json_dumps
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
lang=("model language", "positional", None, str),
|
lang=("Model language", "positional", None, str),
|
||||||
output_dir=("output directory to store model in", "positional", None, str),
|
output_path=("Output directory to store model in", "positional", None, Path),
|
||||||
train_data=("location of JSON-formatted training data", "positional",
|
train_path=("Location of JSON-formatted training data", "positional", None, Path),
|
||||||
None, str),
|
dev_path=("Location of JSON-formatted development data", "positional", None, Path),
|
||||||
dev_data=("location of JSON-formatted development data (optional)",
|
raw_text=(
|
||||||
"positional", None, str),
|
"Path to jsonl file with unlabelled text documents.",
|
||||||
n_iter=("number of iterations", "option", "n", int),
|
"option",
|
||||||
n_sents=("number of sentences", "option", "ns", int),
|
"rt",
|
||||||
|
Path,
|
||||||
|
),
|
||||||
|
base_model=("Name of model to update (optional)", "option", "b", str),
|
||||||
|
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
|
||||||
|
vectors=("Model to load vectors from", "option", "v", str),
|
||||||
|
n_iter=("Number of iterations", "option", "n", int),
|
||||||
|
n_examples=("Number of examples", "option", "ns", int),
|
||||||
use_gpu=("Use GPU", "option", "g", int),
|
use_gpu=("Use GPU", "option", "g", int),
|
||||||
vectors=("Model to load vectors from", "option", "v"),
|
|
||||||
no_tagger=("Don't train tagger", "flag", "T", bool),
|
|
||||||
no_parser=("Don't train parser", "flag", "P", bool),
|
|
||||||
no_entities=("Don't train NER", "flag", "N", bool),
|
|
||||||
parser_multitasks=("Side objectives for parser CNN, e.g. dep dep,tag", "option", "pt", str),
|
|
||||||
entity_multitasks=("Side objectives for ner CNN, e.g. dep dep,tag", "option", "et", str),
|
|
||||||
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
|
||||||
version=("Model version", "option", "V", str),
|
version=("Model version", "option", "V", str),
|
||||||
meta_path=("Optional path to meta.json. All relevant properties will be "
|
meta_path=("Optional path to meta.json to use as base.", "option", "m", Path),
|
||||||
"overwritten.", "option", "m", Path),
|
init_tok2vec=(
|
||||||
verbose=("Display more information for debug", "option", None, bool))
|
"Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.",
|
||||||
def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
"option",
|
||||||
parser_multitasks='', entity_multitasks='',
|
"t2v",
|
||||||
use_gpu=-1, vectors=None, no_tagger=False,
|
Path,
|
||||||
no_parser=False, no_entities=False, gold_preproc=False,
|
),
|
||||||
version="0.0.0", meta_path=None, verbose=False):
|
parser_multitasks=(
|
||||||
|
"Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'",
|
||||||
|
"option",
|
||||||
|
"pt",
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
entity_multitasks=(
|
||||||
|
"Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'",
|
||||||
|
"option",
|
||||||
|
"et",
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
noise_level=("Amount of corruption for data augmentation", "option", "nl", float),
|
||||||
|
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||||
|
learn_tokens=("Make parser learn gold-standard tokenization", "flag", "T", bool),
|
||||||
|
verbose=("Display more information for debug", "flag", "VV", bool),
|
||||||
|
debug=("Run data diagnostics before training", "flag", "D", bool),
|
||||||
|
)
|
||||||
|
def train(
|
||||||
|
lang,
|
||||||
|
output_path,
|
||||||
|
train_path,
|
||||||
|
dev_path,
|
||||||
|
raw_text=None,
|
||||||
|
base_model=None,
|
||||||
|
pipeline="tagger,parser,ner",
|
||||||
|
vectors=None,
|
||||||
|
n_iter=30,
|
||||||
|
n_examples=0,
|
||||||
|
use_gpu=-1,
|
||||||
|
version="0.0.0",
|
||||||
|
meta_path=None,
|
||||||
|
init_tok2vec=None,
|
||||||
|
parser_multitasks="",
|
||||||
|
entity_multitasks="",
|
||||||
|
noise_level=0.0,
|
||||||
|
gold_preproc=False,
|
||||||
|
learn_tokens=False,
|
||||||
|
verbose=False,
|
||||||
|
debug=False,
|
||||||
|
):
|
||||||
"""
|
"""
|
||||||
Train a model. Expects data in spaCy's JSON format.
|
Train or update a spaCy model. Requires data to be formatted in spaCy's
|
||||||
|
JSON format. To convert data from other formats, use the `spacy convert`
|
||||||
|
command.
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
util.fix_random_seed()
|
util.fix_random_seed()
|
||||||
util.set_env_log(True)
|
util.set_env_log(verbose)
|
||||||
n_sents = n_sents or None
|
|
||||||
output_path = util.ensure_path(output_dir)
|
# Make sure all files and paths exists if they are needed
|
||||||
train_path = util.ensure_path(train_data)
|
train_path = util.ensure_path(train_path)
|
||||||
dev_path = util.ensure_path(dev_data)
|
dev_path = util.ensure_path(dev_path)
|
||||||
meta_path = util.ensure_path(meta_path)
|
meta_path = util.ensure_path(meta_path)
|
||||||
|
if raw_text is not None:
|
||||||
|
raw_text = list(srsly.read_jsonl(raw_text))
|
||||||
|
if not train_path or not train_path.exists():
|
||||||
|
msg.fail("Training data not found", train_path, exits=1)
|
||||||
|
if not dev_path or not dev_path.exists():
|
||||||
|
msg.fail("Development data not found", dev_path, exits=1)
|
||||||
|
if meta_path is not None and not meta_path.exists():
|
||||||
|
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||||
|
meta = srsly.read_json(meta_path) if meta_path else {}
|
||||||
|
if output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
|
||||||
|
msg.warn(
|
||||||
|
"Output directory is not empty",
|
||||||
|
"This can lead to unintended side effects when saving the model. "
|
||||||
|
"Please use an empty directory or a different path instead. If "
|
||||||
|
"the specified output path doesn't exist, the directory will be "
|
||||||
|
"created for you.",
|
||||||
|
)
|
||||||
if not output_path.exists():
|
if not output_path.exists():
|
||||||
output_path.mkdir()
|
output_path.mkdir()
|
||||||
if not train_path.exists():
|
|
||||||
prints(train_path, title=Messages.M050, exits=1)
|
|
||||||
if dev_path and not dev_path.exists():
|
|
||||||
prints(dev_path, title=Messages.M051, exits=1)
|
|
||||||
if meta_path is not None and not meta_path.exists():
|
|
||||||
prints(meta_path, title=Messages.M020, exits=1)
|
|
||||||
meta = util.read_json(meta_path) if meta_path else {}
|
|
||||||
if not isinstance(meta, dict):
|
|
||||||
prints(Messages.M053.format(meta_type=type(meta)),
|
|
||||||
title=Messages.M052, exits=1)
|
|
||||||
meta.setdefault('lang', lang)
|
|
||||||
meta.setdefault('name', 'unnamed')
|
|
||||||
|
|
||||||
pipeline = ['tagger', 'parser', 'ner']
|
|
||||||
if no_tagger and 'tagger' in pipeline:
|
|
||||||
pipeline.remove('tagger')
|
|
||||||
if no_parser and 'parser' in pipeline:
|
|
||||||
pipeline.remove('parser')
|
|
||||||
if no_entities and 'ner' in pipeline:
|
|
||||||
pipeline.remove('ner')
|
|
||||||
|
|
||||||
# Take dropout and batch size as generators of values -- dropout
|
# Take dropout and batch size as generators of values -- dropout
|
||||||
# starts high and decays sharply, to force the optimizer to explore.
|
# starts high and decays sharply, to force the optimizer to explore.
|
||||||
# Batch size starts at 1 and grows, so that we make updates quickly
|
# Batch size starts at 1 and grows, so that we make updates quickly
|
||||||
# at the beginning of training.
|
# at the beginning of training.
|
||||||
dropout_rates = util.decaying(util.env_opt('dropout_from', 0.2),
|
dropout_rates = util.decaying(
|
||||||
util.env_opt('dropout_to', 0.2),
|
util.env_opt("dropout_from", 0.2),
|
||||||
util.env_opt('dropout_decay', 0.0))
|
util.env_opt("dropout_to", 0.2),
|
||||||
batch_sizes = util.compounding(util.env_opt('batch_from', 1),
|
util.env_opt("dropout_decay", 0.0),
|
||||||
util.env_opt('batch_to', 16),
|
)
|
||||||
util.env_opt('batch_compound', 1.001))
|
batch_sizes = util.compounding(
|
||||||
max_doc_len = util.env_opt('max_doc_len', 5000)
|
util.env_opt("batch_from", 100.0),
|
||||||
corpus = GoldCorpus(train_path, dev_path, limit=n_sents)
|
util.env_opt("batch_to", 1000.0),
|
||||||
|
util.env_opt("batch_compound", 1.001),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Set up the base model and pipeline. If a base model is specified, load
|
||||||
|
# the model and make sure the pipeline matches the pipeline setting. If
|
||||||
|
# training starts from a blank model, intitalize the language class.
|
||||||
|
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||||
|
msg.text("Training pipeline: {}".format(pipeline))
|
||||||
|
if base_model:
|
||||||
|
msg.text("Starting with base model '{}'".format(base_model))
|
||||||
|
nlp = util.load_model(base_model)
|
||||||
|
if nlp.lang != lang:
|
||||||
|
msg.fail(
|
||||||
|
"Model language ('{}') doesn't match language specified as "
|
||||||
|
"`lang` argument ('{}') ".format(nlp.lang, lang),
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline]
|
||||||
|
nlp.disable_pipes(*other_pipes)
|
||||||
|
for pipe in pipeline:
|
||||||
|
if pipe not in nlp.pipe_names:
|
||||||
|
nlp.add_pipe(nlp.create_pipe(pipe))
|
||||||
|
else:
|
||||||
|
msg.text("Starting with blank model '{}'".format(lang))
|
||||||
|
lang_cls = util.get_lang_class(lang)
|
||||||
|
nlp = lang_cls()
|
||||||
|
for pipe in pipeline:
|
||||||
|
nlp.add_pipe(nlp.create_pipe(pipe))
|
||||||
|
|
||||||
|
if learn_tokens:
|
||||||
|
nlp.add_pipe(nlp.create_pipe("merge_subtokens"))
|
||||||
|
|
||||||
|
if vectors:
|
||||||
|
msg.text("Loading vector from model '{}'".format(vectors))
|
||||||
|
_load_vectors(nlp, vectors)
|
||||||
|
|
||||||
|
# Multitask objectives
|
||||||
|
multitask_options = [("parser", parser_multitasks), ("ner", entity_multitasks)]
|
||||||
|
for pipe_name, multitasks in multitask_options:
|
||||||
|
if multitasks:
|
||||||
|
if pipe_name not in pipeline:
|
||||||
|
msg.fail(
|
||||||
|
"Can't use multitask objective without '{}' in the "
|
||||||
|
"pipeline".format(pipe_name)
|
||||||
|
)
|
||||||
|
pipe = nlp.get_pipe(pipe_name)
|
||||||
|
for objective in multitasks.split(","):
|
||||||
|
pipe.add_multitask_objective(objective)
|
||||||
|
|
||||||
|
# Prepare training corpus
|
||||||
|
msg.text("Counting training words (limit={})".format(n_examples))
|
||||||
|
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
|
||||||
n_train_words = corpus.count_train()
|
n_train_words = corpus.count_train()
|
||||||
|
|
||||||
lang_class = util.get_lang_class(lang)
|
if base_model:
|
||||||
nlp = lang_class()
|
# Start with an existing model, use default optimizer
|
||||||
meta['pipeline'] = pipeline
|
optimizer = create_default_optimizer(Model.ops)
|
||||||
nlp.meta.update(meta)
|
else:
|
||||||
if vectors:
|
# Start with a blank model, call begin_training
|
||||||
print("Load vectors model", vectors)
|
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
||||||
|
|
||||||
|
nlp._optimizer = None
|
||||||
|
|
||||||
|
# Load in pre-trained weights
|
||||||
|
if init_tok2vec is not None:
|
||||||
|
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
|
||||||
|
msg.text("Loaded pretrained tok2vec for: {}".format(components))
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
row_head = ("Itn", "Dep Loss", "NER Loss", "UAS", "NER P", "NER R", "NER F", "Tag %", "Token %", "CPU WPS", "GPU WPS")
|
||||||
|
row_settings = {
|
||||||
|
"widths": (3, 10, 10, 7, 7, 7, 7, 7, 7, 7, 7),
|
||||||
|
"aligns": tuple(["r" for i in row_head]),
|
||||||
|
"spacing": 2
|
||||||
|
}
|
||||||
|
# fmt: on
|
||||||
|
print("")
|
||||||
|
msg.row(row_head, **row_settings)
|
||||||
|
msg.row(["-" * width for width in row_settings["widths"]], **row_settings)
|
||||||
|
try:
|
||||||
|
for i in range(n_iter):
|
||||||
|
train_docs = corpus.train_docs(
|
||||||
|
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0
|
||||||
|
)
|
||||||
|
if raw_text:
|
||||||
|
random.shuffle(raw_text)
|
||||||
|
raw_batches = util.minibatch(
|
||||||
|
(nlp.make_doc(rt["text"]) for rt in raw_text), size=8
|
||||||
|
)
|
||||||
|
words_seen = 0
|
||||||
|
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||||
|
losses = {}
|
||||||
|
for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
|
||||||
|
if not batch:
|
||||||
|
continue
|
||||||
|
docs, golds = zip(*batch)
|
||||||
|
nlp.update(
|
||||||
|
docs,
|
||||||
|
golds,
|
||||||
|
sgd=optimizer,
|
||||||
|
drop=next(dropout_rates),
|
||||||
|
losses=losses,
|
||||||
|
)
|
||||||
|
if raw_text:
|
||||||
|
# If raw text is available, perform 'rehearsal' updates,
|
||||||
|
# which use unlabelled data to reduce overfitting.
|
||||||
|
raw_batch = list(next(raw_batches))
|
||||||
|
nlp.rehearse(raw_batch, sgd=optimizer, losses=losses)
|
||||||
|
if not int(os.environ.get("LOG_FRIENDLY", 0)):
|
||||||
|
pbar.update(sum(len(doc) for doc in docs))
|
||||||
|
words_seen += sum(len(doc) for doc in docs)
|
||||||
|
with nlp.use_params(optimizer.averages):
|
||||||
|
util.set_env_log(False)
|
||||||
|
epoch_model_path = output_path / ("model%d" % i)
|
||||||
|
nlp.to_disk(epoch_model_path)
|
||||||
|
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
||||||
|
dev_docs = list(corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc))
|
||||||
|
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||||
|
start_time = timer()
|
||||||
|
scorer = nlp_loaded.evaluate(dev_docs, debug)
|
||||||
|
end_time = timer()
|
||||||
|
if use_gpu < 0:
|
||||||
|
gpu_wps = None
|
||||||
|
cpu_wps = nwords / (end_time - start_time)
|
||||||
|
else:
|
||||||
|
gpu_wps = nwords / (end_time - start_time)
|
||||||
|
with Model.use_device("cpu"):
|
||||||
|
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
||||||
|
dev_docs = list(
|
||||||
|
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
|
||||||
|
)
|
||||||
|
start_time = timer()
|
||||||
|
scorer = nlp_loaded.evaluate(dev_docs)
|
||||||
|
end_time = timer()
|
||||||
|
cpu_wps = nwords / (end_time - start_time)
|
||||||
|
acc_loc = output_path / ("model%d" % i) / "accuracy.json"
|
||||||
|
srsly.write_json(acc_loc, scorer.scores)
|
||||||
|
|
||||||
|
# Update model meta.json
|
||||||
|
meta["lang"] = nlp.lang
|
||||||
|
meta["pipeline"] = nlp.pipe_names
|
||||||
|
meta["spacy_version"] = ">=%s" % about.__version__
|
||||||
|
meta["accuracy"] = scorer.scores
|
||||||
|
meta["speed"] = {"nwords": nwords, "cpu": cpu_wps, "gpu": gpu_wps}
|
||||||
|
meta["vectors"] = {
|
||||||
|
"width": nlp.vocab.vectors_length,
|
||||||
|
"vectors": len(nlp.vocab.vectors),
|
||||||
|
"keys": nlp.vocab.vectors.n_keys,
|
||||||
|
"name": nlp.vocab.vectors.name
|
||||||
|
}
|
||||||
|
meta.setdefault("name", "model%d" % i)
|
||||||
|
meta.setdefault("version", version)
|
||||||
|
meta_loc = output_path / ("model%d" % i) / "meta.json"
|
||||||
|
srsly.write_json(meta_loc, meta)
|
||||||
|
|
||||||
|
util.set_env_log(verbose)
|
||||||
|
|
||||||
|
progress = _get_progress(
|
||||||
|
i, losses, scorer.scores, cpu_wps=cpu_wps, gpu_wps=gpu_wps
|
||||||
|
)
|
||||||
|
msg.row(progress, **row_settings)
|
||||||
|
finally:
|
||||||
|
with nlp.use_params(optimizer.averages):
|
||||||
|
final_model_path = output_path / "model-final"
|
||||||
|
nlp.to_disk(final_model_path)
|
||||||
|
msg.good("Saved model to output directory", final_model_path)
|
||||||
|
with msg.loading("Creating best model..."):
|
||||||
|
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
|
||||||
|
msg.good("Created best model", best_model_path)
|
||||||
|
|
||||||
|
|
||||||
|
@contextlib.contextmanager
|
||||||
|
def _create_progress_bar(total):
|
||||||
|
if int(os.environ.get("LOG_FRIENDLY", 0)):
|
||||||
|
yield
|
||||||
|
else:
|
||||||
|
pbar = tqdm.tqdm(total=total, leave=False)
|
||||||
|
yield pbar
|
||||||
|
|
||||||
|
|
||||||
|
def _load_vectors(nlp, vectors):
|
||||||
util.load_model(vectors, vocab=nlp.vocab)
|
util.load_model(vectors, vocab=nlp.vocab)
|
||||||
for lex in nlp.vocab:
|
for lex in nlp.vocab:
|
||||||
values = {}
|
values = {}
|
||||||
|
|
@ -106,131 +322,92 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||||
values[lex.vocab.strings[attr]] = func(lex.orth_)
|
values[lex.vocab.strings[attr]] = func(lex.orth_)
|
||||||
lex.set_attrs(**values)
|
lex.set_attrs(**values)
|
||||||
lex.is_oov = False
|
lex.is_oov = False
|
||||||
for name in pipeline:
|
|
||||||
nlp.add_pipe(nlp.create_pipe(name), name=name)
|
|
||||||
if parser_multitasks:
|
|
||||||
for objective in parser_multitasks.split(','):
|
|
||||||
nlp.parser.add_multitask_objective(objective)
|
|
||||||
if entity_multitasks:
|
|
||||||
for objective in entity_multitasks.split(','):
|
|
||||||
nlp.entity.add_multitask_objective(objective)
|
|
||||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
|
||||||
nlp._optimizer = None
|
|
||||||
|
|
||||||
print("Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS")
|
|
||||||
try:
|
|
||||||
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
|
|
||||||
gold_preproc=gold_preproc, max_length=0)
|
|
||||||
train_docs = list(train_docs)
|
|
||||||
for i in range(n_iter):
|
|
||||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
|
||||||
losses = {}
|
|
||||||
for batch in minibatch(train_docs, size=batch_sizes):
|
|
||||||
batch = [(d, g) for (d, g) in batch if len(d) < max_doc_len]
|
|
||||||
if not batch:
|
|
||||||
continue
|
|
||||||
docs, golds = zip(*batch)
|
|
||||||
nlp.update(docs, golds, sgd=optimizer,
|
|
||||||
drop=next(dropout_rates), losses=losses)
|
|
||||||
pbar.update(sum(len(doc) for doc in docs))
|
|
||||||
|
|
||||||
with nlp.use_params(optimizer.averages):
|
def _load_pretrained_tok2vec(nlp, loc):
|
||||||
util.set_env_log(False)
|
"""Load pre-trained weights for the 'token-to-vector' part of the component
|
||||||
epoch_model_path = output_path / ('model%d' % i)
|
models, which is typically a CNN. See 'spacy pretrain'. Experimental.
|
||||||
nlp.to_disk(epoch_model_path)
|
"""
|
||||||
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
with loc.open("rb") as file_:
|
||||||
dev_docs = list(corpus.dev_docs(
|
weights_data = file_.read()
|
||||||
nlp_loaded,
|
loaded = []
|
||||||
gold_preproc=gold_preproc))
|
for name, component in nlp.pipeline:
|
||||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
if hasattr(component, "model") and hasattr(component.model, "tok2vec"):
|
||||||
start_time = timer()
|
component.tok2vec.from_bytes(weights_data)
|
||||||
scorer = nlp_loaded.evaluate(dev_docs, verbose)
|
loaded.append(name)
|
||||||
end_time = timer()
|
return loaded
|
||||||
if use_gpu < 0:
|
|
||||||
gpu_wps = None
|
|
||||||
cpu_wps = nwords/(end_time-start_time)
|
def _collate_best_model(meta, output_path, components):
|
||||||
|
bests = {}
|
||||||
|
for component in components:
|
||||||
|
bests[component] = _find_best(output_path, component)
|
||||||
|
best_dest = output_path / "model-best"
|
||||||
|
shutil.copytree(output_path / "model-final", best_dest)
|
||||||
|
for component, best_component_src in bests.items():
|
||||||
|
shutil.rmtree(best_dest / component)
|
||||||
|
shutil.copytree(best_component_src / component, best_dest / component)
|
||||||
|
accs = srsly.read_json(best_component_src / "accuracy.json")
|
||||||
|
for metric in _get_metrics(component):
|
||||||
|
meta["accuracy"][metric] = accs[metric]
|
||||||
|
srsly.write_json(best_dest / "meta.json", meta)
|
||||||
|
return best_dest
|
||||||
|
|
||||||
|
|
||||||
|
def _find_best(experiment_dir, component):
|
||||||
|
accuracies = []
|
||||||
|
for epoch_model in experiment_dir.iterdir():
|
||||||
|
if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final":
|
||||||
|
accs = srsly.read_json(epoch_model / "accuracy.json")
|
||||||
|
scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)]
|
||||||
|
accuracies.append((scores, epoch_model))
|
||||||
|
if accuracies:
|
||||||
|
return max(accuracies)[1]
|
||||||
else:
|
else:
|
||||||
gpu_wps = nwords/(end_time-start_time)
|
return None
|
||||||
with Model.use_device('cpu'):
|
|
||||||
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
|
||||||
dev_docs = list(corpus.dev_docs(
|
|
||||||
nlp_loaded, gold_preproc=gold_preproc))
|
|
||||||
start_time = timer()
|
|
||||||
scorer = nlp_loaded.evaluate(dev_docs)
|
|
||||||
end_time = timer()
|
|
||||||
cpu_wps = nwords/(end_time-start_time)
|
|
||||||
acc_loc = (output_path / ('model%d' % i) / 'accuracy.json')
|
|
||||||
with acc_loc.open('w') as file_:
|
|
||||||
file_.write(json_dumps(scorer.scores))
|
|
||||||
meta_loc = output_path / ('model%d' % i) / 'meta.json'
|
|
||||||
meta['accuracy'] = scorer.scores
|
|
||||||
meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
|
|
||||||
'gpu': gpu_wps}
|
|
||||||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
|
||||||
'vectors': len(nlp.vocab.vectors),
|
|
||||||
'keys': nlp.vocab.vectors.n_keys}
|
|
||||||
meta['lang'] = nlp.lang
|
|
||||||
meta['pipeline'] = pipeline
|
|
||||||
meta['spacy_version'] = '>=%s' % about.__version__
|
|
||||||
meta.setdefault('name', 'model%d' % i)
|
|
||||||
meta.setdefault('version', version)
|
|
||||||
|
|
||||||
with meta_loc.open('w') as file_:
|
|
||||||
file_.write(json_dumps(meta))
|
|
||||||
util.set_env_log(True)
|
|
||||||
print_progress(i, losses, scorer.scores, cpu_wps=cpu_wps,
|
|
||||||
gpu_wps=gpu_wps)
|
|
||||||
finally:
|
|
||||||
print("Saving model...")
|
|
||||||
with nlp.use_params(optimizer.averages):
|
|
||||||
final_model_path = output_path / 'model-final'
|
|
||||||
nlp.to_disk(final_model_path)
|
|
||||||
|
|
||||||
|
|
||||||
def _render_parses(i, to_render):
|
def _get_metrics(component):
|
||||||
to_render[0].user_data['title'] = "Batch %d" % i
|
if component == "parser":
|
||||||
with Path('/tmp/entities.html').open('w') as file_:
|
return ("las", "uas", "token_acc")
|
||||||
html = displacy.render(to_render[:5], style='ent', page=True)
|
elif component == "tagger":
|
||||||
file_.write(html)
|
return ("tags_acc",)
|
||||||
with Path('/tmp/parses.html').open('w') as file_:
|
elif component == "ner":
|
||||||
html = displacy.render(to_render[:5], style='dep', page=True)
|
return ("ents_f", "ents_p", "ents_r")
|
||||||
file_.write(html)
|
return ("token_acc",)
|
||||||
|
|
||||||
|
|
||||||
def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
|
def _get_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
|
||||||
scores = {}
|
scores = {}
|
||||||
for col in ['dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc',
|
for col in [
|
||||||
'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps']:
|
"dep_loss",
|
||||||
|
"tag_loss",
|
||||||
|
"uas",
|
||||||
|
"tags_acc",
|
||||||
|
"token_acc",
|
||||||
|
"ents_p",
|
||||||
|
"ents_r",
|
||||||
|
"ents_f",
|
||||||
|
"cpu_wps",
|
||||||
|
"gpu_wps",
|
||||||
|
]:
|
||||||
scores[col] = 0.0
|
scores[col] = 0.0
|
||||||
scores['dep_loss'] = losses.get('parser', 0.0)
|
scores["dep_loss"] = losses.get("parser", 0.0)
|
||||||
scores['ner_loss'] = losses.get('ner', 0.0)
|
scores["ner_loss"] = losses.get("ner", 0.0)
|
||||||
scores['tag_loss'] = losses.get('tagger', 0.0)
|
scores["tag_loss"] = losses.get("tagger", 0.0)
|
||||||
scores.update(dev_scores)
|
scores.update(dev_scores)
|
||||||
scores['cpu_wps'] = cpu_wps
|
scores["cpu_wps"] = cpu_wps
|
||||||
scores['gpu_wps'] = gpu_wps or 0.0
|
scores["gpu_wps"] = gpu_wps or 0.0
|
||||||
tpl = ''.join((
|
return [
|
||||||
'{:<6d}',
|
itn,
|
||||||
'{dep_loss:<10.3f}',
|
"{:.3f}".format(scores["dep_loss"]),
|
||||||
'{ner_loss:<10.3f}',
|
"{:.3f}".format(scores["ner_loss"]),
|
||||||
'{uas:<8.3f}',
|
"{:.3f}".format(scores["uas"]),
|
||||||
'{ents_p:<8.3f}',
|
"{:.3f}".format(scores["ents_p"]),
|
||||||
'{ents_r:<8.3f}',
|
"{:.3f}".format(scores["ents_r"]),
|
||||||
'{ents_f:<8.3f}',
|
"{:.3f}".format(scores["ents_f"]),
|
||||||
'{tags_acc:<8.3f}',
|
"{:.3f}".format(scores["tags_acc"]),
|
||||||
'{token_acc:<9.3f}',
|
"{:.3f}".format(scores["token_acc"]),
|
||||||
'{cpu_wps:<9.1f}',
|
"{:.0f}".format(scores["cpu_wps"]),
|
||||||
'{gpu_wps:.1f}',
|
"{:.0f}".format(scores["gpu_wps"]),
|
||||||
))
|
]
|
||||||
print(tpl.format(itn, **scores))
|
|
||||||
|
|
||||||
|
|
||||||
def print_results(scorer):
|
|
||||||
results = {
|
|
||||||
'TOK': '%.2f' % scorer.token_acc,
|
|
||||||
'POS': '%.2f' % scorer.tags_acc,
|
|
||||||
'UAS': '%.2f' % scorer.uas,
|
|
||||||
'LAS': '%.2f' % scorer.las,
|
|
||||||
'NER P': '%.2f' % scorer.ents_p,
|
|
||||||
'NER R': '%.2f' % scorer.ents_r,
|
|
||||||
'NER F': '%.2f' % scorer.ents_f}
|
|
||||||
util.print_table(results, title="Results")
|
|
||||||
|
|
|
||||||
2
spacy/cli/ud/__init__.py
Normal file
2
spacy/cli/ud/__init__.py
Normal file
|
|
@ -0,0 +1,2 @@
|
||||||
|
from .conll17_ud_eval import main as ud_evaluate # noqa: F401
|
||||||
|
from .ud_train import main as ud_train # noqa: F401
|
||||||
614
spacy/cli/ud/conll17_ud_eval.py
Normal file
614
spacy/cli/ud/conll17_ud_eval.py
Normal file
|
|
@ -0,0 +1,614 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
# flake8: noqa
|
||||||
|
|
||||||
|
# CoNLL 2017 UD Parsing evaluation script.
|
||||||
|
#
|
||||||
|
# Compatible with Python 2.7 and 3.2+, can be used either as a module
|
||||||
|
# or a standalone executable.
|
||||||
|
#
|
||||||
|
# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
|
||||||
|
# Faculty of Mathematics and Physics, Charles University, Czech Republic.
|
||||||
|
#
|
||||||
|
# Changelog:
|
||||||
|
# - [02 Jan 2017] Version 0.9: Initial release
|
||||||
|
# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
|
||||||
|
# - [10 Mar 2017] Version 1.0: Add documentation and test
|
||||||
|
# Compare HEADs correctly using aligned words
|
||||||
|
# Allow evaluation with errorneous spaces in forms
|
||||||
|
# Compare forms in LCS case insensitively
|
||||||
|
# Detect cycles and multiple root nodes
|
||||||
|
# Compute AlignedAccuracy
|
||||||
|
|
||||||
|
# Command line usage
|
||||||
|
# ------------------
|
||||||
|
# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
|
||||||
|
#
|
||||||
|
# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
|
||||||
|
# is printed
|
||||||
|
# - if -v is given, several metrics are printed (as precision, recall, F1 score,
|
||||||
|
# and in case the metric is computed on aligned words also accuracy on these):
|
||||||
|
# - Tokens: how well do the gold tokens match system tokens
|
||||||
|
# - Sentences: how well do the gold sentences match system sentences
|
||||||
|
# - Words: how well can the gold words be aligned to system words
|
||||||
|
# - UPOS: using aligned words, how well does UPOS match
|
||||||
|
# - XPOS: using aligned words, how well does XPOS match
|
||||||
|
# - Feats: using aligned words, how well does FEATS match
|
||||||
|
# - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
|
||||||
|
# - Lemmas: using aligned words, how well does LEMMA match
|
||||||
|
# - UAS: using aligned words, how well does HEAD match
|
||||||
|
# - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
|
||||||
|
# - if weights_file is given (with lines containing deprel-weight pairs),
|
||||||
|
# one more metric is shown:
|
||||||
|
# - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
|
||||||
|
|
||||||
|
# API usage
|
||||||
|
# ---------
|
||||||
|
# - load_conllu(file)
|
||||||
|
# - loads CoNLL-U file from given file object to an internal representation
|
||||||
|
# - the file object should return str on both Python 2 and Python 3
|
||||||
|
# - raises UDError exception if the given file cannot be loaded
|
||||||
|
# - evaluate(gold_ud, system_ud)
|
||||||
|
# - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
|
||||||
|
# - raises UDError if the concatenated tokens of gold and system file do not match
|
||||||
|
# - returns a dictionary with the metrics described above, each metrics having
|
||||||
|
# four fields: precision, recall, f1 and aligned_accuracy (when using aligned
|
||||||
|
# words, otherwise this is None)
|
||||||
|
|
||||||
|
# Description of token matching
|
||||||
|
# -----------------------------
|
||||||
|
# In order to match tokens of gold file and system file, we consider the text
|
||||||
|
# resulting from concatenation of gold tokens and text resulting from
|
||||||
|
# concatenation of system tokens. These texts should match -- if they do not,
|
||||||
|
# the evaluation fails.
|
||||||
|
#
|
||||||
|
# If the texts do match, every token is represented as a range in this original
|
||||||
|
# text, and tokens are equal only if their range is the same.
|
||||||
|
|
||||||
|
# Description of word matching
|
||||||
|
# ----------------------------
|
||||||
|
# When matching words of gold file and system file, we first match the tokens.
|
||||||
|
# The words which are also tokens are matched as tokens, but words in multi-word
|
||||||
|
# tokens have to be handled differently.
|
||||||
|
#
|
||||||
|
# To handle multi-word tokens, we start by finding "multi-word spans".
|
||||||
|
# Multi-word span is a span in the original text such that
|
||||||
|
# - it contains at least one multi-word token
|
||||||
|
# - all multi-word tokens in the span (considering both gold and system ones)
|
||||||
|
# are completely inside the span (i.e., they do not "stick out")
|
||||||
|
# - the multi-word span is as small as possible
|
||||||
|
#
|
||||||
|
# For every multi-word span, we align the gold and system words completely
|
||||||
|
# inside this span using LCS on their FORMs. The words not intersecting
|
||||||
|
# (even partially) any multi-word span are then aligned as tokens.
|
||||||
|
|
||||||
|
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import io
|
||||||
|
import sys
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
# CoNLL-U column names
|
||||||
|
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
|
||||||
|
|
||||||
|
# UD Error is used when raising exceptions in this module
|
||||||
|
class UDError(Exception):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Load given CoNLL-U file into internal representation
|
||||||
|
def load_conllu(file, check_parse=True):
|
||||||
|
# Internal representation classes
|
||||||
|
class UDRepresentation:
|
||||||
|
def __init__(self):
|
||||||
|
# Characters of all the tokens in the whole file.
|
||||||
|
# Whitespace between tokens is not included.
|
||||||
|
self.characters = []
|
||||||
|
# List of UDSpan instances with start&end indices into `characters`.
|
||||||
|
self.tokens = []
|
||||||
|
# List of UDWord instances.
|
||||||
|
self.words = []
|
||||||
|
# List of UDSpan instances with start&end indices into `characters`.
|
||||||
|
self.sentences = []
|
||||||
|
class UDSpan:
|
||||||
|
def __init__(self, start, end, characters):
|
||||||
|
self.start = start
|
||||||
|
# Note that self.end marks the first position **after the end** of span,
|
||||||
|
# so we can use characters[start:end] or range(start, end).
|
||||||
|
self.end = end
|
||||||
|
self.characters = characters
|
||||||
|
|
||||||
|
@property
|
||||||
|
def text(self):
|
||||||
|
return ''.join(self.characters[self.start:self.end])
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return self.text
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return self.text
|
||||||
|
class UDWord:
|
||||||
|
def __init__(self, span, columns, is_multiword):
|
||||||
|
# Span of this word (or MWT, see below) within ud_representation.characters.
|
||||||
|
self.span = span
|
||||||
|
# 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
|
||||||
|
self.columns = columns
|
||||||
|
# is_multiword==True means that this word is part of a multi-word token.
|
||||||
|
# In that case, self.span marks the span of the whole multi-word token.
|
||||||
|
self.is_multiword = is_multiword
|
||||||
|
# Reference to the UDWord instance representing the HEAD (or None if root).
|
||||||
|
self.parent = None
|
||||||
|
# Let's ignore language-specific deprel subtypes.
|
||||||
|
self.columns[DEPREL] = columns[DEPREL].split(':')[0]
|
||||||
|
|
||||||
|
ud = UDRepresentation()
|
||||||
|
|
||||||
|
# Load the CoNLL-U file
|
||||||
|
index, sentence_start = 0, None
|
||||||
|
linenum = 0
|
||||||
|
while True:
|
||||||
|
line = file.readline()
|
||||||
|
linenum += 1
|
||||||
|
if not line:
|
||||||
|
break
|
||||||
|
line = line.rstrip("\r\n")
|
||||||
|
|
||||||
|
# Handle sentence start boundaries
|
||||||
|
if sentence_start is None:
|
||||||
|
# Skip comments
|
||||||
|
if line.startswith("#"):
|
||||||
|
continue
|
||||||
|
# Start a new sentence
|
||||||
|
ud.sentences.append(UDSpan(index, 0, ud.characters))
|
||||||
|
sentence_start = len(ud.words)
|
||||||
|
if not line:
|
||||||
|
# Add parent UDWord links and check there are no cycles
|
||||||
|
def process_word(word):
|
||||||
|
if word.parent == "remapping":
|
||||||
|
raise UDError("There is a cycle in a sentence")
|
||||||
|
if word.parent is None:
|
||||||
|
head = int(word.columns[HEAD])
|
||||||
|
if head > len(ud.words) - sentence_start:
|
||||||
|
raise UDError("Line {}: HEAD '{}' points outside of the sentence".format(
|
||||||
|
linenum, word.columns[HEAD]))
|
||||||
|
if head:
|
||||||
|
parent = ud.words[sentence_start + head - 1]
|
||||||
|
word.parent = "remapping"
|
||||||
|
process_word(parent)
|
||||||
|
word.parent = parent
|
||||||
|
|
||||||
|
for word in ud.words[sentence_start:]:
|
||||||
|
process_word(word)
|
||||||
|
|
||||||
|
# Check there is a single root node
|
||||||
|
if check_parse:
|
||||||
|
if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
|
||||||
|
raise UDError("There are multiple roots in a sentence")
|
||||||
|
|
||||||
|
# End the sentence
|
||||||
|
ud.sentences[-1].end = index
|
||||||
|
sentence_start = None
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Read next token/word
|
||||||
|
columns = line.split("\t")
|
||||||
|
if len(columns) != 10:
|
||||||
|
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
|
||||||
|
|
||||||
|
# Skip empty nodes
|
||||||
|
if "." in columns[ID]:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Delete spaces from FORM so gold.characters == system.characters
|
||||||
|
# even if one of them tokenizes the space.
|
||||||
|
columns[FORM] = columns[FORM].replace(" ", "")
|
||||||
|
if not columns[FORM]:
|
||||||
|
raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
|
||||||
|
|
||||||
|
# Save token
|
||||||
|
ud.characters.extend(columns[FORM])
|
||||||
|
ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
|
||||||
|
index += len(columns[FORM])
|
||||||
|
|
||||||
|
# Handle multi-word tokens to save word(s)
|
||||||
|
if "-" in columns[ID]:
|
||||||
|
try:
|
||||||
|
start, end = map(int, columns[ID].split("-"))
|
||||||
|
except:
|
||||||
|
raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
|
||||||
|
|
||||||
|
for _ in range(start, end + 1):
|
||||||
|
word_line = file.readline().rstrip("\r\n")
|
||||||
|
word_columns = word_line.split("\t")
|
||||||
|
if len(word_columns) != 10:
|
||||||
|
print(columns)
|
||||||
|
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
|
||||||
|
ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
|
||||||
|
# Basic tokens/words
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
word_id = int(columns[ID])
|
||||||
|
except:
|
||||||
|
raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
|
||||||
|
if word_id != len(ud.words) - sentence_start + 1:
|
||||||
|
raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
|
||||||
|
|
||||||
|
try:
|
||||||
|
head_id = int(columns[HEAD])
|
||||||
|
except:
|
||||||
|
raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
|
||||||
|
if head_id < 0:
|
||||||
|
raise UDError("HEAD cannot be negative")
|
||||||
|
|
||||||
|
ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
|
||||||
|
|
||||||
|
if sentence_start is not None:
|
||||||
|
raise UDError("The CoNLL-U file does not end with empty line")
|
||||||
|
|
||||||
|
return ud
|
||||||
|
|
||||||
|
# Evaluate the gold and system treebanks (loaded using load_conllu).
|
||||||
|
def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True):
|
||||||
|
class Score:
|
||||||
|
def __init__(self, gold_total, system_total, correct, aligned_total=None, undersegmented=None, oversegmented=None):
|
||||||
|
self.precision = correct / system_total if system_total else 0.0
|
||||||
|
self.recall = correct / gold_total if gold_total else 0.0
|
||||||
|
self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
|
||||||
|
self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
|
||||||
|
self.undersegmented = undersegmented
|
||||||
|
self.oversegmented = oversegmented
|
||||||
|
self.under_perc = len(undersegmented) / gold_total if gold_total and undersegmented else 0.0
|
||||||
|
self.over_perc = len(oversegmented) / gold_total if gold_total and oversegmented else 0.0
|
||||||
|
class AlignmentWord:
|
||||||
|
def __init__(self, gold_word, system_word):
|
||||||
|
self.gold_word = gold_word
|
||||||
|
self.system_word = system_word
|
||||||
|
self.gold_parent = None
|
||||||
|
self.system_parent_gold_aligned = None
|
||||||
|
class Alignment:
|
||||||
|
def __init__(self, gold_words, system_words):
|
||||||
|
self.gold_words = gold_words
|
||||||
|
self.system_words = system_words
|
||||||
|
self.matched_words = []
|
||||||
|
self.matched_words_map = {}
|
||||||
|
def append_aligned_words(self, gold_word, system_word):
|
||||||
|
self.matched_words.append(AlignmentWord(gold_word, system_word))
|
||||||
|
self.matched_words_map[system_word] = gold_word
|
||||||
|
def fill_parents(self):
|
||||||
|
# We represent root parents in both gold and system data by '0'.
|
||||||
|
# For gold data, we represent non-root parent by corresponding gold word.
|
||||||
|
# For system data, we represent non-root parent by either gold word aligned
|
||||||
|
# to parent system nodes, or by None if no gold words is aligned to the parent.
|
||||||
|
for words in self.matched_words:
|
||||||
|
words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
|
||||||
|
words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
|
||||||
|
if words.system_word.parent is not None else 0
|
||||||
|
|
||||||
|
def lower(text):
|
||||||
|
if sys.version_info < (3, 0) and isinstance(text, str):
|
||||||
|
return text.decode("utf-8").lower()
|
||||||
|
return text.lower()
|
||||||
|
|
||||||
|
def spans_score(gold_spans, system_spans):
|
||||||
|
correct, gi, si = 0, 0, 0
|
||||||
|
undersegmented = list()
|
||||||
|
oversegmented = list()
|
||||||
|
combo = 0
|
||||||
|
previous_end_si_earlier = False
|
||||||
|
previous_end_gi_earlier = False
|
||||||
|
while gi < len(gold_spans) and si < len(system_spans):
|
||||||
|
previous_si = system_spans[si-1] if si > 0 else None
|
||||||
|
previous_gi = gold_spans[gi-1] if gi > 0 else None
|
||||||
|
if system_spans[si].start < gold_spans[gi].start:
|
||||||
|
# avoid counting the same mistake twice
|
||||||
|
if not previous_end_si_earlier:
|
||||||
|
combo += 1
|
||||||
|
oversegmented.append(str(previous_gi).strip())
|
||||||
|
si += 1
|
||||||
|
elif gold_spans[gi].start < system_spans[si].start:
|
||||||
|
# avoid counting the same mistake twice
|
||||||
|
if not previous_end_gi_earlier:
|
||||||
|
combo += 1
|
||||||
|
undersegmented.append(str(previous_si).strip())
|
||||||
|
gi += 1
|
||||||
|
else:
|
||||||
|
correct += gold_spans[gi].end == system_spans[si].end
|
||||||
|
if gold_spans[gi].end < system_spans[si].end:
|
||||||
|
undersegmented.append(str(system_spans[si]).strip())
|
||||||
|
previous_end_gi_earlier = True
|
||||||
|
previous_end_si_earlier = False
|
||||||
|
elif gold_spans[gi].end > system_spans[si].end:
|
||||||
|
oversegmented.append(str(gold_spans[gi]).strip())
|
||||||
|
previous_end_si_earlier = True
|
||||||
|
previous_end_gi_earlier = False
|
||||||
|
else:
|
||||||
|
previous_end_gi_earlier = False
|
||||||
|
previous_end_si_earlier = False
|
||||||
|
si += 1
|
||||||
|
gi += 1
|
||||||
|
|
||||||
|
return Score(len(gold_spans), len(system_spans), correct, None, undersegmented, oversegmented)
|
||||||
|
|
||||||
|
def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
|
||||||
|
gold, system, aligned, correct = 0, 0, 0, 0
|
||||||
|
|
||||||
|
for word in alignment.gold_words:
|
||||||
|
gold += weight_fn(word)
|
||||||
|
|
||||||
|
for word in alignment.system_words:
|
||||||
|
system += weight_fn(word)
|
||||||
|
|
||||||
|
for words in alignment.matched_words:
|
||||||
|
aligned += weight_fn(words.gold_word)
|
||||||
|
|
||||||
|
if key_fn is None:
|
||||||
|
# Return score for whole aligned words
|
||||||
|
return Score(gold, system, aligned)
|
||||||
|
|
||||||
|
for words in alignment.matched_words:
|
||||||
|
if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
|
||||||
|
correct += weight_fn(words.gold_word)
|
||||||
|
|
||||||
|
return Score(gold, system, correct, aligned)
|
||||||
|
|
||||||
|
def beyond_end(words, i, multiword_span_end):
|
||||||
|
if i >= len(words):
|
||||||
|
return True
|
||||||
|
if words[i].is_multiword:
|
||||||
|
return words[i].span.start >= multiword_span_end
|
||||||
|
return words[i].span.end > multiword_span_end
|
||||||
|
|
||||||
|
def extend_end(word, multiword_span_end):
|
||||||
|
if word.is_multiword and word.span.end > multiword_span_end:
|
||||||
|
return word.span.end
|
||||||
|
return multiword_span_end
|
||||||
|
|
||||||
|
def find_multiword_span(gold_words, system_words, gi, si):
|
||||||
|
# We know gold_words[gi].is_multiword or system_words[si].is_multiword.
|
||||||
|
# Find the start of the multiword span (gs, ss), so the multiword span is minimal.
|
||||||
|
# Initialize multiword_span_end characters index.
|
||||||
|
if gold_words[gi].is_multiword:
|
||||||
|
multiword_span_end = gold_words[gi].span.end
|
||||||
|
if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
|
||||||
|
si += 1
|
||||||
|
else: # if system_words[si].is_multiword
|
||||||
|
multiword_span_end = system_words[si].span.end
|
||||||
|
if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
|
||||||
|
gi += 1
|
||||||
|
gs, ss = gi, si
|
||||||
|
|
||||||
|
# Find the end of the multiword span
|
||||||
|
# (so both gi and si are pointing to the word following the multiword span end).
|
||||||
|
while not beyond_end(gold_words, gi, multiword_span_end) or \
|
||||||
|
not beyond_end(system_words, si, multiword_span_end):
|
||||||
|
if gi < len(gold_words) and (si >= len(system_words) or
|
||||||
|
gold_words[gi].span.start <= system_words[si].span.start):
|
||||||
|
multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
|
||||||
|
gi += 1
|
||||||
|
else:
|
||||||
|
multiword_span_end = extend_end(system_words[si], multiword_span_end)
|
||||||
|
si += 1
|
||||||
|
return gs, ss, gi, si
|
||||||
|
|
||||||
|
def compute_lcs(gold_words, system_words, gi, si, gs, ss):
|
||||||
|
lcs = [[0] * (si - ss) for i in range(gi - gs)]
|
||||||
|
for g in reversed(range(gi - gs)):
|
||||||
|
for s in reversed(range(si - ss)):
|
||||||
|
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
|
||||||
|
lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
|
||||||
|
lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
|
||||||
|
lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
|
||||||
|
return lcs
|
||||||
|
|
||||||
|
def align_words(gold_words, system_words):
|
||||||
|
alignment = Alignment(gold_words, system_words)
|
||||||
|
|
||||||
|
gi, si = 0, 0
|
||||||
|
while gi < len(gold_words) and si < len(system_words):
|
||||||
|
if gold_words[gi].is_multiword or system_words[si].is_multiword:
|
||||||
|
# A: Multi-word tokens => align via LCS within the whole "multiword span".
|
||||||
|
gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
|
||||||
|
|
||||||
|
if si > ss and gi > gs:
|
||||||
|
lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
|
||||||
|
|
||||||
|
# Store aligned words
|
||||||
|
s, g = 0, 0
|
||||||
|
while g < gi - gs and s < si - ss:
|
||||||
|
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
|
||||||
|
alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
|
||||||
|
g += 1
|
||||||
|
s += 1
|
||||||
|
elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
|
||||||
|
g += 1
|
||||||
|
else:
|
||||||
|
s += 1
|
||||||
|
else:
|
||||||
|
# B: No multi-word token => align according to spans.
|
||||||
|
if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
|
||||||
|
alignment.append_aligned_words(gold_words[gi], system_words[si])
|
||||||
|
gi += 1
|
||||||
|
si += 1
|
||||||
|
elif gold_words[gi].span.start <= system_words[si].span.start:
|
||||||
|
gi += 1
|
||||||
|
else:
|
||||||
|
si += 1
|
||||||
|
|
||||||
|
alignment.fill_parents()
|
||||||
|
|
||||||
|
return alignment
|
||||||
|
|
||||||
|
# Check that underlying character sequences do match
|
||||||
|
if gold_ud.characters != system_ud.characters:
|
||||||
|
index = 0
|
||||||
|
while gold_ud.characters[index] == system_ud.characters[index]:
|
||||||
|
index += 1
|
||||||
|
|
||||||
|
raise UDError(
|
||||||
|
"The concatenation of tokens in gold file and in system file differ!\n" +
|
||||||
|
"First 20 differing characters in gold file: '{}' and system file: '{}'".format(
|
||||||
|
"".join(gold_ud.characters[index:index + 20]),
|
||||||
|
"".join(system_ud.characters[index:index + 20])
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Align words
|
||||||
|
alignment = align_words(gold_ud.words, system_ud.words)
|
||||||
|
|
||||||
|
# Compute the F1-scores
|
||||||
|
if check_parse:
|
||||||
|
result = {
|
||||||
|
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
|
||||||
|
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
|
||||||
|
"Words": alignment_score(alignment, None),
|
||||||
|
"UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
|
||||||
|
"XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
|
||||||
|
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
|
||||||
|
"AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
|
||||||
|
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
|
||||||
|
"UAS": alignment_score(alignment, lambda w, parent: parent),
|
||||||
|
"LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
result = {
|
||||||
|
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
|
||||||
|
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
|
||||||
|
"Words": alignment_score(alignment, None),
|
||||||
|
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
|
||||||
|
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# Add WeightedLAS if weights are given
|
||||||
|
if deprel_weights is not None:
|
||||||
|
def weighted_las(word):
|
||||||
|
return deprel_weights.get(word.columns[DEPREL], 1.0)
|
||||||
|
result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
def load_deprel_weights(weights_file):
|
||||||
|
if weights_file is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
deprel_weights = {}
|
||||||
|
for line in weights_file:
|
||||||
|
# Ignore comments and empty lines
|
||||||
|
if line.startswith("#") or not line.strip():
|
||||||
|
continue
|
||||||
|
|
||||||
|
columns = line.rstrip("\r\n").split()
|
||||||
|
if len(columns) != 2:
|
||||||
|
raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
|
||||||
|
|
||||||
|
deprel_weights[columns[0]] = float(columns[1])
|
||||||
|
|
||||||
|
return deprel_weights
|
||||||
|
|
||||||
|
def load_conllu_file(path):
|
||||||
|
_file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
|
||||||
|
return load_conllu(_file)
|
||||||
|
|
||||||
|
def evaluate_wrapper(args):
|
||||||
|
# Load CoNLL-U files
|
||||||
|
gold_ud = load_conllu_file(args.gold_file)
|
||||||
|
system_ud = load_conllu_file(args.system_file)
|
||||||
|
|
||||||
|
# Load weights if requested
|
||||||
|
deprel_weights = load_deprel_weights(args.weights)
|
||||||
|
|
||||||
|
return evaluate(gold_ud, system_ud, deprel_weights)
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# Parse arguments
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("gold_file", type=str,
|
||||||
|
help="Name of the CoNLL-U file with the gold data.")
|
||||||
|
parser.add_argument("system_file", type=str,
|
||||||
|
help="Name of the CoNLL-U file with the predicted data.")
|
||||||
|
parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
|
||||||
|
metavar="deprel_weights_file",
|
||||||
|
help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
|
||||||
|
parser.add_argument("--verbose", "-v", default=0, action="count",
|
||||||
|
help="Print all metrics.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Use verbose if weights are supplied
|
||||||
|
if args.weights is not None and not args.verbose:
|
||||||
|
args.verbose = 1
|
||||||
|
|
||||||
|
# Evaluate
|
||||||
|
evaluation = evaluate_wrapper(args)
|
||||||
|
|
||||||
|
# Print the evaluation
|
||||||
|
if not args.verbose:
|
||||||
|
print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
|
||||||
|
else:
|
||||||
|
metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
|
||||||
|
if args.weights is not None:
|
||||||
|
metrics.append("WeightedLAS")
|
||||||
|
|
||||||
|
print("Metrics | Precision | Recall | F1 Score | AligndAcc")
|
||||||
|
print("-----------+-----------+-----------+-----------+-----------")
|
||||||
|
for metric in metrics:
|
||||||
|
print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
|
||||||
|
metric,
|
||||||
|
100 * evaluation[metric].precision,
|
||||||
|
100 * evaluation[metric].recall,
|
||||||
|
100 * evaluation[metric].f1,
|
||||||
|
"{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
|
||||||
|
))
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
||||||
|
# Tests, which can be executed with `python -m unittest conll17_ud_eval`.
|
||||||
|
class TestAlignment(unittest.TestCase):
|
||||||
|
@staticmethod
|
||||||
|
def _load_words(words):
|
||||||
|
"""Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
|
||||||
|
lines, num_words = [], 0
|
||||||
|
for w in words:
|
||||||
|
parts = w.split(" ")
|
||||||
|
if len(parts) == 1:
|
||||||
|
num_words += 1
|
||||||
|
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
|
||||||
|
else:
|
||||||
|
lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
|
||||||
|
for part in parts[1:]:
|
||||||
|
num_words += 1
|
||||||
|
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
|
||||||
|
return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
|
||||||
|
|
||||||
|
def _test_exception(self, gold, system):
|
||||||
|
self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
|
||||||
|
|
||||||
|
def _test_ok(self, gold, system, correct):
|
||||||
|
metrics = evaluate(self._load_words(gold), self._load_words(system))
|
||||||
|
gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
|
||||||
|
system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
|
||||||
|
self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
|
||||||
|
(correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
|
||||||
|
|
||||||
|
def test_exception(self):
|
||||||
|
self._test_exception(["a"], ["b"])
|
||||||
|
|
||||||
|
def test_equal(self):
|
||||||
|
self._test_ok(["a"], ["a"], 1)
|
||||||
|
self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
|
||||||
|
|
||||||
|
def test_equal_with_multiword(self):
|
||||||
|
self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
|
||||||
|
self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
|
||||||
|
self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
|
||||||
|
self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
|
||||||
|
|
||||||
|
def test_alignment(self):
|
||||||
|
self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
|
||||||
|
self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
|
||||||
|
self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
|
||||||
|
self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
|
||||||
|
self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
|
||||||
|
self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
|
||||||
|
self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)
|
||||||
287
spacy/cli/ud/run_eval.py
Normal file
287
spacy/cli/ud/run_eval.py
Normal file
|
|
@ -0,0 +1,287 @@
|
||||||
|
import spacy
|
||||||
|
import time
|
||||||
|
import re
|
||||||
|
import plac
|
||||||
|
import operator
|
||||||
|
import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
import xml.etree.ElementTree as ET
|
||||||
|
|
||||||
|
from spacy.cli.ud import conll17_ud_eval
|
||||||
|
from spacy.cli.ud.ud_train import write_conllu
|
||||||
|
from spacy.lang.lex_attrs import word_shape
|
||||||
|
from spacy.util import get_lang_class
|
||||||
|
|
||||||
|
# All languages in spaCy - in UD format (note that Norwegian is 'no' instead of 'nb')
|
||||||
|
ALL_LANGUAGES = "ar, ca, da, de, el, en, es, fa, fi, fr, ga, he, hi, hr, hu, id, " \
|
||||||
|
"it, ja, no, nl, pl, pt, ro, ru, sv, tr, ur, vi, zh"
|
||||||
|
|
||||||
|
# Non-parsing tasks that will be evaluated (works for default models)
|
||||||
|
EVAL_NO_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats']
|
||||||
|
|
||||||
|
# Tasks that will be evaluated if check_parse=True (does not work for default models)
|
||||||
|
EVAL_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats', 'UPOS', 'XPOS', 'AllTags', 'UAS', 'LAS']
|
||||||
|
|
||||||
|
# Minimum frequency an error should have to be printed
|
||||||
|
PRINT_FREQ = 20
|
||||||
|
|
||||||
|
# Maximum number of errors printed per category
|
||||||
|
PRINT_TOTAL = 10
|
||||||
|
|
||||||
|
space_re = re.compile("\s+")
|
||||||
|
|
||||||
|
|
||||||
|
def load_model(modelname, add_sentencizer=False):
|
||||||
|
""" Load a specific spaCy model """
|
||||||
|
loading_start = time.time()
|
||||||
|
nlp = spacy.load(modelname)
|
||||||
|
if add_sentencizer:
|
||||||
|
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||||
|
loading_end = time.time()
|
||||||
|
loading_time = loading_end - loading_start
|
||||||
|
if add_sentencizer:
|
||||||
|
return nlp, loading_time, modelname + '_sentencizer'
|
||||||
|
return nlp, loading_time, modelname
|
||||||
|
|
||||||
|
|
||||||
|
def load_default_model_sentencizer(lang):
|
||||||
|
""" Load a generic spaCy model and add the sentencizer for sentence tokenization"""
|
||||||
|
loading_start = time.time()
|
||||||
|
lang_class = get_lang_class(lang)
|
||||||
|
nlp = lang_class()
|
||||||
|
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||||
|
loading_end = time.time()
|
||||||
|
loading_time = loading_end - loading_start
|
||||||
|
return nlp, loading_time, lang + "_default_" + 'sentencizer'
|
||||||
|
|
||||||
|
|
||||||
|
def split_text(text):
|
||||||
|
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||||
|
|
||||||
|
|
||||||
|
def get_freq_tuples(my_list, print_total_threshold):
|
||||||
|
""" Turn a list of errors into frequency-sorted tuples thresholded by a certain total number """
|
||||||
|
d = {}
|
||||||
|
for token in my_list:
|
||||||
|
d.setdefault(token, 0)
|
||||||
|
d[token] += 1
|
||||||
|
return sorted(d.items(), key=operator.itemgetter(1), reverse=True)[:print_total_threshold]
|
||||||
|
|
||||||
|
|
||||||
|
def _contains_blinded_text(stats_xml):
|
||||||
|
""" Heuristic to determine whether the treebank has blinded texts or not """
|
||||||
|
tree = ET.parse(stats_xml)
|
||||||
|
root = tree.getroot()
|
||||||
|
total_tokens = int(root.find('size/total/tokens').text)
|
||||||
|
unique_lemmas = int(root.find('lemmas').get('unique'))
|
||||||
|
|
||||||
|
# assume the corpus is largely blinded when there are less than 1% unique tokens
|
||||||
|
return (unique_lemmas / total_tokens) < 0.01
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_all_treebanks(ud_dir, languages, corpus, best_per_language):
|
||||||
|
"""" Fetch the txt files for all treebanks for a given set of languages """
|
||||||
|
all_treebanks = dict()
|
||||||
|
treebank_size = dict()
|
||||||
|
for l in languages:
|
||||||
|
all_treebanks[l] = []
|
||||||
|
treebank_size[l] = 0
|
||||||
|
|
||||||
|
for treebank_dir in ud_dir.iterdir():
|
||||||
|
if treebank_dir.is_dir():
|
||||||
|
for txt_path in treebank_dir.iterdir():
|
||||||
|
if txt_path.name.endswith('-ud-' + corpus + '.txt'):
|
||||||
|
file_lang = txt_path.name.split('_')[0]
|
||||||
|
if file_lang in languages:
|
||||||
|
gold_path = treebank_dir / txt_path.name.replace('.txt', '.conllu')
|
||||||
|
stats_xml = treebank_dir / "stats.xml"
|
||||||
|
# ignore treebanks where the texts are not publicly available
|
||||||
|
if not _contains_blinded_text(stats_xml):
|
||||||
|
if not best_per_language:
|
||||||
|
all_treebanks[file_lang].append(txt_path)
|
||||||
|
# check the tokens in the gold annotation to keep only the biggest treebank per language
|
||||||
|
else:
|
||||||
|
with gold_path.open(mode='r', encoding='utf-8') as gold_file:
|
||||||
|
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||||
|
gold_tokens = len(gold_ud.tokens)
|
||||||
|
if treebank_size[file_lang] < gold_tokens:
|
||||||
|
all_treebanks[file_lang] = [txt_path]
|
||||||
|
treebank_size[file_lang] = gold_tokens
|
||||||
|
|
||||||
|
return all_treebanks
|
||||||
|
|
||||||
|
|
||||||
|
def run_single_eval(nlp, loading_time, print_name, text_path, gold_ud, tmp_output_path, out_file, print_header,
|
||||||
|
check_parse, print_freq_tasks):
|
||||||
|
"""" Run an evaluation of a model nlp on a certain specified treebank """
|
||||||
|
with text_path.open(mode='r', encoding='utf-8') as f:
|
||||||
|
flat_text = f.read()
|
||||||
|
|
||||||
|
# STEP 1: tokenize text
|
||||||
|
tokenization_start = time.time()
|
||||||
|
texts = split_text(flat_text)
|
||||||
|
docs = list(nlp.pipe(texts))
|
||||||
|
tokenization_end = time.time()
|
||||||
|
tokenization_time = tokenization_end - tokenization_start
|
||||||
|
|
||||||
|
# STEP 2: record stats and timings
|
||||||
|
tokens_per_s = int(len(gold_ud.tokens) / tokenization_time)
|
||||||
|
|
||||||
|
print_header_1 = ['date', 'text_path', 'gold_tokens', 'model', 'loading_time', 'tokenization_time', 'tokens_per_s']
|
||||||
|
print_string_1 = [str(datetime.date.today()), text_path.name, len(gold_ud.tokens),
|
||||||
|
print_name, "%.2f" % loading_time, "%.2f" % tokenization_time, tokens_per_s]
|
||||||
|
|
||||||
|
# STEP 3: evaluate predicted tokens and features
|
||||||
|
with tmp_output_path.open(mode="w", encoding="utf8") as tmp_out_file:
|
||||||
|
write_conllu(docs, tmp_out_file)
|
||||||
|
with tmp_output_path.open(mode="r", encoding="utf8") as sys_file:
|
||||||
|
sys_ud = conll17_ud_eval.load_conllu(sys_file, check_parse=check_parse)
|
||||||
|
tmp_output_path.unlink()
|
||||||
|
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud, check_parse=check_parse)
|
||||||
|
|
||||||
|
# STEP 4: format the scoring results
|
||||||
|
eval_headers = EVAL_PARSE
|
||||||
|
if not check_parse:
|
||||||
|
eval_headers = EVAL_NO_PARSE
|
||||||
|
|
||||||
|
for score_name in eval_headers:
|
||||||
|
score = scores[score_name]
|
||||||
|
print_string_1.extend(["%.2f" % score.precision,
|
||||||
|
"%.2f" % score.recall,
|
||||||
|
"%.2f" % score.f1])
|
||||||
|
print_string_1.append("-" if score.aligned_accuracy is None else "%.2f" % score.aligned_accuracy)
|
||||||
|
print_string_1.append("-" if score.undersegmented is None else "%.4f" % score.under_perc)
|
||||||
|
print_string_1.append("-" if score.oversegmented is None else "%.4f" % score.over_perc)
|
||||||
|
|
||||||
|
print_header_1.extend([score_name + '_p', score_name + '_r', score_name + '_F', score_name + '_acc',
|
||||||
|
score_name + '_under', score_name + '_over'])
|
||||||
|
|
||||||
|
if score_name in print_freq_tasks:
|
||||||
|
print_header_1.extend([score_name + '_word_under_ex', score_name + '_shape_under_ex',
|
||||||
|
score_name + '_word_over_ex', score_name + '_shape_over_ex'])
|
||||||
|
|
||||||
|
d_under_words = get_freq_tuples(score.undersegmented, PRINT_TOTAL)
|
||||||
|
d_under_shapes = get_freq_tuples([word_shape(x) for x in score.undersegmented], PRINT_TOTAL)
|
||||||
|
d_over_words = get_freq_tuples(score.oversegmented, PRINT_TOTAL)
|
||||||
|
d_over_shapes = get_freq_tuples([word_shape(x) for x in score.oversegmented], PRINT_TOTAL)
|
||||||
|
|
||||||
|
# saving to CSV with ; seperator so blinding ; in the example output
|
||||||
|
print_string_1.append(
|
||||||
|
str({k: v for k, v in d_under_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||||
|
print_string_1.append(
|
||||||
|
str({k: v for k, v in d_under_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||||
|
print_string_1.append(
|
||||||
|
str({k: v for k, v in d_over_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||||
|
print_string_1.append(
|
||||||
|
str({k: v for k, v in d_over_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||||
|
|
||||||
|
# STEP 5: print the formatted results to CSV
|
||||||
|
if print_header:
|
||||||
|
out_file.write(';'.join(map(str, print_header_1)) + '\n')
|
||||||
|
out_file.write(';'.join(map(str, print_string_1)) + '\n')
|
||||||
|
|
||||||
|
|
||||||
|
def run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks):
|
||||||
|
"""" Run an evaluation for each language with its specified models and treebanks """
|
||||||
|
print_header = True
|
||||||
|
|
||||||
|
for tb_lang, treebank_list in treebanks.items():
|
||||||
|
print()
|
||||||
|
print("Language", tb_lang)
|
||||||
|
for text_path in treebank_list:
|
||||||
|
print(" Evaluating on", text_path)
|
||||||
|
|
||||||
|
gold_path = text_path.parent / (text_path.stem + '.conllu')
|
||||||
|
print(" Gold data from ", gold_path)
|
||||||
|
|
||||||
|
# nested try blocks to ensure the code can continue with the next iteration after a failure
|
||||||
|
try:
|
||||||
|
with gold_path.open(mode='r', encoding='utf-8') as gold_file:
|
||||||
|
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||||
|
|
||||||
|
for nlp, nlp_loading_time, nlp_name in models[tb_lang]:
|
||||||
|
try:
|
||||||
|
print(" Benchmarking", nlp_name)
|
||||||
|
tmp_output_path = text_path.parent / str('tmp_' + nlp_name + '.conllu')
|
||||||
|
run_single_eval(nlp, nlp_loading_time, nlp_name, text_path, gold_ud, tmp_output_path, out_file,
|
||||||
|
print_header, check_parse, print_freq_tasks)
|
||||||
|
print_header = False
|
||||||
|
except Exception as e:
|
||||||
|
print(" Ran into trouble: ", str(e))
|
||||||
|
except Exception as e:
|
||||||
|
print(" Ran into trouble: ", str(e))
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
out_path=("Path to output CSV file", "positional", None, Path),
|
||||||
|
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||||
|
check_parse=("Set flag to evaluate parsing performance", "flag", "p", bool),
|
||||||
|
langs=("Enumeration of languages to evaluate (default: all)", "option", "l", str),
|
||||||
|
exclude_trained_models=("Set flag to exclude trained models", "flag", "t", bool),
|
||||||
|
exclude_multi=("Set flag to exclude the multi-language model as default baseline", "flag", "m", bool),
|
||||||
|
hide_freq=("Set flag to avoid printing out more detailed high-freq tokenization errors", "flag", "f", bool),
|
||||||
|
corpus=("Whether to run on train, dev or test", "option", "c", str),
|
||||||
|
best_per_language=("Set flag to only keep the largest treebank for each language", "flag", "b", bool)
|
||||||
|
)
|
||||||
|
def main(out_path, ud_dir, check_parse=False, langs=ALL_LANGUAGES, exclude_trained_models=False, exclude_multi=False,
|
||||||
|
hide_freq=False, corpus='train', best_per_language=False):
|
||||||
|
""""
|
||||||
|
Assemble all treebanks and models to run evaluations with.
|
||||||
|
When setting check_parse to True, the default models will not be evaluated as they don't have parsing functionality
|
||||||
|
"""
|
||||||
|
languages = [lang.strip() for lang in langs.split(",")]
|
||||||
|
|
||||||
|
print_freq_tasks = []
|
||||||
|
if not hide_freq:
|
||||||
|
print_freq_tasks = ['Tokens']
|
||||||
|
|
||||||
|
# fetching all relevant treebank from the directory
|
||||||
|
treebanks = fetch_all_treebanks(ud_dir, languages, corpus, best_per_language)
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("Loading all relevant models for", languages)
|
||||||
|
models = dict()
|
||||||
|
|
||||||
|
# multi-lang model
|
||||||
|
multi = None
|
||||||
|
if not exclude_multi and not check_parse:
|
||||||
|
multi = load_model('xx_ent_wiki_sm', add_sentencizer=True)
|
||||||
|
|
||||||
|
# initialize all models with the multi-lang model
|
||||||
|
for lang in languages:
|
||||||
|
models[lang] = [multi] if multi else []
|
||||||
|
# add default models if we don't want to evaluate parsing info
|
||||||
|
if not check_parse:
|
||||||
|
# Norwegian is 'nb' in spaCy but 'no' in the UD corpora
|
||||||
|
if lang == 'no':
|
||||||
|
models['no'].append(load_default_model_sentencizer('nb'))
|
||||||
|
else:
|
||||||
|
models[lang].append(load_default_model_sentencizer(lang))
|
||||||
|
|
||||||
|
# language-specific trained models
|
||||||
|
if not exclude_trained_models:
|
||||||
|
if 'de' in models:
|
||||||
|
models['de'].append(load_model('de_core_news_sm'))
|
||||||
|
if 'es' in models:
|
||||||
|
models['es'].append(load_model('es_core_news_sm'))
|
||||||
|
models['es'].append(load_model('es_core_news_md'))
|
||||||
|
if 'pt' in models:
|
||||||
|
models['pt'].append(load_model('pt_core_news_sm'))
|
||||||
|
if 'it' in models:
|
||||||
|
models['it'].append(load_model('it_core_news_sm'))
|
||||||
|
if 'nl' in models:
|
||||||
|
models['nl'].append(load_model('nl_core_news_sm'))
|
||||||
|
if 'en' in models:
|
||||||
|
models['en'].append(load_model('en_core_web_sm'))
|
||||||
|
models['en'].append(load_model('en_core_web_md'))
|
||||||
|
models['en'].append(load_model('en_core_web_lg'))
|
||||||
|
if 'fr' in models:
|
||||||
|
models['fr'].append(load_model('fr_core_news_sm'))
|
||||||
|
models['fr'].append(load_model('fr_core_news_md'))
|
||||||
|
|
||||||
|
with out_path.open(mode='w', encoding='utf-8') as out_file:
|
||||||
|
run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
||||||
338
spacy/cli/ud/ud_run_test.py
Normal file
338
spacy/cli/ud/ud_run_test.py
Normal file
|
|
@ -0,0 +1,338 @@
|
||||||
|
# flake8: noqa
|
||||||
|
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||||
|
.conllu format for development data, allowing the official scorer to be used.
|
||||||
|
"""
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import plac
|
||||||
|
import tqdm
|
||||||
|
from pathlib import Path
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
import spacy
|
||||||
|
import spacy.util
|
||||||
|
from ...tokens import Token, Doc
|
||||||
|
from ...gold import GoldParse
|
||||||
|
from ...util import compounding, minibatch_by_words
|
||||||
|
from ...syntax.nonproj import projectivize
|
||||||
|
from ...matcher import Matcher
|
||||||
|
|
||||||
|
# from ...morphology import Fused_begin, Fused_inside
|
||||||
|
from ... import displacy
|
||||||
|
from collections import defaultdict, Counter
|
||||||
|
from timeit import default_timer as timer
|
||||||
|
|
||||||
|
Fused_begin = None
|
||||||
|
Fused_inside = None
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
import random
|
||||||
|
import numpy.random
|
||||||
|
|
||||||
|
from . import conll17_ud_eval
|
||||||
|
|
||||||
|
from ... import lang
|
||||||
|
from ...lang import zh
|
||||||
|
from ...lang import ja
|
||||||
|
from ...lang import ru
|
||||||
|
|
||||||
|
|
||||||
|
################
|
||||||
|
# Data reading #
|
||||||
|
################
|
||||||
|
|
||||||
|
space_re = re.compile(r"\s+")
|
||||||
|
|
||||||
|
|
||||||
|
def split_text(text):
|
||||||
|
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||||
|
|
||||||
|
|
||||||
|
##############
|
||||||
|
# Evaluation #
|
||||||
|
##############
|
||||||
|
|
||||||
|
|
||||||
|
def read_conllu(file_):
|
||||||
|
docs = []
|
||||||
|
sent = []
|
||||||
|
doc = []
|
||||||
|
for line in file_:
|
||||||
|
if line.startswith("# newdoc"):
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
doc = []
|
||||||
|
elif line.startswith("#"):
|
||||||
|
continue
|
||||||
|
elif not line.strip():
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
sent = []
|
||||||
|
else:
|
||||||
|
sent.append(list(line.strip().split("\t")))
|
||||||
|
if len(sent[-1]) != 10:
|
||||||
|
print(repr(line))
|
||||||
|
raise ValueError
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||||
|
if text_loc.parts[-1].endswith(".conllu"):
|
||||||
|
docs = []
|
||||||
|
with text_loc.open() as file_:
|
||||||
|
for conllu_doc in read_conllu(file_):
|
||||||
|
for conllu_sent in conllu_doc:
|
||||||
|
words = [line[1] for line in conllu_sent]
|
||||||
|
docs.append(Doc(nlp.vocab, words=words))
|
||||||
|
for name, component in nlp.pipeline:
|
||||||
|
docs = list(component.pipe(docs))
|
||||||
|
else:
|
||||||
|
with text_loc.open("r", encoding="utf8") as text_file:
|
||||||
|
texts = split_text(text_file.read())
|
||||||
|
docs = list(nlp.pipe(texts))
|
||||||
|
with sys_loc.open("w", encoding="utf8") as out_file:
|
||||||
|
write_conllu(docs, out_file)
|
||||||
|
with gold_loc.open("r", encoding="utf8") as gold_file:
|
||||||
|
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||||
|
with sys_loc.open("r", encoding="utf8") as sys_file:
|
||||||
|
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||||
|
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||||
|
return docs, scores
|
||||||
|
|
||||||
|
|
||||||
|
def write_conllu(docs, file_):
|
||||||
|
merger = Matcher(docs[0].vocab)
|
||||||
|
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||||
|
for i, doc in enumerate(docs):
|
||||||
|
matches = merger(doc)
|
||||||
|
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
for span in spans:
|
||||||
|
retokenizer.merge(span)
|
||||||
|
# TODO: This shouldn't be necessary? Should be handled in merge
|
||||||
|
for word in doc:
|
||||||
|
if word.i == word.head.i:
|
||||||
|
word.dep_ = "ROOT"
|
||||||
|
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||||
|
for j, sent in enumerate(doc.sents):
|
||||||
|
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||||
|
file_.write("# text = {text}\n".format(text=sent.text))
|
||||||
|
for k, token in enumerate(sent):
|
||||||
|
file_.write(_get_token_conllu(token, k, len(sent)) + "\n")
|
||||||
|
file_.write("\n")
|
||||||
|
for word in sent:
|
||||||
|
if word.head.i == word.i and word.dep_ == "ROOT":
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
print("Rootless sentence!")
|
||||||
|
print(sent)
|
||||||
|
print(i)
|
||||||
|
for w in sent:
|
||||||
|
print(w.i, w.text, w.head.text, w.head.i, w.dep_)
|
||||||
|
raise ValueError
|
||||||
|
|
||||||
|
|
||||||
|
def _get_token_conllu(token, k, sent_len):
|
||||||
|
if token.check_morph(Fused_begin) and (k + 1 < sent_len):
|
||||||
|
n = 1
|
||||||
|
text = [token.text]
|
||||||
|
while token.nbor(n).check_morph(Fused_inside):
|
||||||
|
text.append(token.nbor(n).text)
|
||||||
|
n += 1
|
||||||
|
id_ = "%d-%d" % (k + 1, (k + n))
|
||||||
|
fields = [id_, "".join(text)] + ["_"] * 8
|
||||||
|
lines = ["\t".join(fields)]
|
||||||
|
else:
|
||||||
|
lines = []
|
||||||
|
if token.head.i == token.i:
|
||||||
|
head = 0
|
||||||
|
else:
|
||||||
|
head = k + (token.head.i - token.i) + 1
|
||||||
|
fields = [
|
||||||
|
str(k + 1),
|
||||||
|
token.text,
|
||||||
|
token.lemma_,
|
||||||
|
token.pos_,
|
||||||
|
token.tag_,
|
||||||
|
"_",
|
||||||
|
str(head),
|
||||||
|
token.dep_.lower(),
|
||||||
|
"_",
|
||||||
|
"_",
|
||||||
|
]
|
||||||
|
if token.check_morph(Fused_begin) and (k + 1 < sent_len):
|
||||||
|
if k == 0:
|
||||||
|
fields[1] = token.norm_[0].upper() + token.norm_[1:]
|
||||||
|
else:
|
||||||
|
fields[1] = token.norm_
|
||||||
|
elif token.check_morph(Fused_inside):
|
||||||
|
fields[1] = token.norm_
|
||||||
|
elif token._.split_start is not None:
|
||||||
|
split_start = token._.split_start
|
||||||
|
split_end = token._.split_end
|
||||||
|
split_len = (split_end.i - split_start.i) + 1
|
||||||
|
n_in_split = token.i - split_start.i
|
||||||
|
subtokens = guess_fused_orths(split_start.text, [""] * split_len)
|
||||||
|
fields[1] = subtokens[n_in_split]
|
||||||
|
|
||||||
|
lines.append("\t".join(fields))
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def guess_fused_orths(word, ud_forms):
|
||||||
|
"""The UD data 'fused tokens' don't necessarily expand to keys that match
|
||||||
|
the form. We need orths that exact match the string. Here we make a best
|
||||||
|
effort to divide up the word."""
|
||||||
|
if word == "".join(ud_forms):
|
||||||
|
# Happy case: we get a perfect split, with each letter accounted for.
|
||||||
|
return ud_forms
|
||||||
|
elif len(word) == sum(len(subtoken) for subtoken in ud_forms):
|
||||||
|
# Unideal, but at least lengths match.
|
||||||
|
output = []
|
||||||
|
remain = word
|
||||||
|
for subtoken in ud_forms:
|
||||||
|
assert len(subtoken) >= 1
|
||||||
|
output.append(remain[: len(subtoken)])
|
||||||
|
remain = remain[len(subtoken) :]
|
||||||
|
assert len(remain) == 0, (word, ud_forms, remain)
|
||||||
|
return output
|
||||||
|
else:
|
||||||
|
# Let's say word is 6 long, and there are three subtokens. The orths
|
||||||
|
# *must* equal the original string. Arbitrarily, split [4, 1, 1]
|
||||||
|
first = word[: len(word) - (len(ud_forms) - 1)]
|
||||||
|
output = [first]
|
||||||
|
remain = word[len(first) :]
|
||||||
|
for i in range(1, len(ud_forms)):
|
||||||
|
assert remain
|
||||||
|
output.append(remain[:1])
|
||||||
|
remain = remain[1:]
|
||||||
|
assert len(remain) == 0, (word, output, remain)
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
def print_results(name, ud_scores):
|
||||||
|
fields = {}
|
||||||
|
if ud_scores is not None:
|
||||||
|
fields.update(
|
||||||
|
{
|
||||||
|
"words": ud_scores["Words"].f1 * 100,
|
||||||
|
"sents": ud_scores["Sentences"].f1 * 100,
|
||||||
|
"tags": ud_scores["XPOS"].f1 * 100,
|
||||||
|
"uas": ud_scores["UAS"].f1 * 100,
|
||||||
|
"las": ud_scores["LAS"].f1 * 100,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
fields.update({"words": 0.0, "sents": 0.0, "tags": 0.0, "uas": 0.0, "las": 0.0})
|
||||||
|
tpl = "\t".join(
|
||||||
|
(name, "{las:.1f}", "{uas:.1f}", "{tags:.1f}", "{sents:.1f}", "{words:.1f}")
|
||||||
|
)
|
||||||
|
print(tpl.format(**fields))
|
||||||
|
return fields
|
||||||
|
|
||||||
|
|
||||||
|
def get_token_split_start(token):
|
||||||
|
if token.text == "":
|
||||||
|
assert token.i != 0
|
||||||
|
i = -1
|
||||||
|
while token.nbor(i).text == "":
|
||||||
|
i -= 1
|
||||||
|
return token.nbor(i)
|
||||||
|
elif (token.i + 1) < len(token.doc) and token.nbor(1).text == "":
|
||||||
|
return token
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def get_token_split_end(token):
|
||||||
|
if (token.i + 1) == len(token.doc):
|
||||||
|
return token if token.text == "" else None
|
||||||
|
elif token.text != "" and token.nbor(1).text != "":
|
||||||
|
return None
|
||||||
|
i = 1
|
||||||
|
while (token.i + i) < len(token.doc) and token.nbor(i).text == "":
|
||||||
|
i += 1
|
||||||
|
return token.nbor(i - 1)
|
||||||
|
|
||||||
|
|
||||||
|
##################
|
||||||
|
# Initialization #
|
||||||
|
##################
|
||||||
|
|
||||||
|
|
||||||
|
def load_nlp(experiments_dir, corpus):
|
||||||
|
nlp = spacy.load(experiments_dir / corpus / "best-model")
|
||||||
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||||
|
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||||
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
test_data_dir=(
|
||||||
|
"Path to Universal Dependencies test data",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
Path,
|
||||||
|
),
|
||||||
|
experiment_dir=("Parent directory with output model", "positional", None, Path),
|
||||||
|
corpus=(
|
||||||
|
"UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
def main(test_data_dir, experiment_dir, corpus):
|
||||||
|
Token.set_extension("split_start", getter=get_token_split_start)
|
||||||
|
Token.set_extension("split_end", getter=get_token_split_end)
|
||||||
|
Token.set_extension("begins_fused", default=False)
|
||||||
|
Token.set_extension("inside_fused", default=False)
|
||||||
|
lang.zh.Chinese.Defaults.use_jieba = False
|
||||||
|
lang.ja.Japanese.Defaults.use_janome = False
|
||||||
|
lang.ru.Russian.Defaults.use_pymorphy2 = False
|
||||||
|
|
||||||
|
nlp = load_nlp(experiment_dir, corpus)
|
||||||
|
|
||||||
|
treebank_code = nlp.meta["treebank"]
|
||||||
|
for section in ("test", "dev"):
|
||||||
|
if section == "dev":
|
||||||
|
section_dir = "conll17-ud-development-2017-03-19"
|
||||||
|
else:
|
||||||
|
section_dir = "conll17-ud-test-2017-05-09"
|
||||||
|
text_path = test_data_dir / "input" / section_dir / (treebank_code + ".txt")
|
||||||
|
udpipe_path = (
|
||||||
|
test_data_dir / "input" / section_dir / (treebank_code + "-udpipe.conllu")
|
||||||
|
)
|
||||||
|
gold_path = test_data_dir / "gold" / section_dir / (treebank_code + ".conllu")
|
||||||
|
|
||||||
|
header = [section, "LAS", "UAS", "TAG", "SENT", "WORD"]
|
||||||
|
print("\t".join(header))
|
||||||
|
inputs = {"gold": gold_path, "udp": udpipe_path, "raw": text_path}
|
||||||
|
for input_type in ("udp", "raw"):
|
||||||
|
input_path = inputs[input_type]
|
||||||
|
output_path = (
|
||||||
|
experiment_dir / corpus / "{section}.conllu".format(section=section)
|
||||||
|
)
|
||||||
|
|
||||||
|
parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path)
|
||||||
|
|
||||||
|
accuracy = print_results(input_type, test_scores)
|
||||||
|
acc_path = (
|
||||||
|
experiment_dir
|
||||||
|
/ corpus
|
||||||
|
/ "{section}-accuracy.json".format(section=section)
|
||||||
|
)
|
||||||
|
srsly.write_json(acc_path, accuracy)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
||||||
543
spacy/cli/ud/ud_train.py
Normal file
543
spacy/cli/ud/ud_train.py
Normal file
|
|
@ -0,0 +1,543 @@
|
||||||
|
# flake8: noqa
|
||||||
|
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||||
|
.conllu format for development data, allowing the official scorer to be used.
|
||||||
|
"""
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import plac
|
||||||
|
import tqdm
|
||||||
|
from pathlib import Path
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
|
||||||
|
import spacy
|
||||||
|
import spacy.util
|
||||||
|
from ...tokens import Token, Doc
|
||||||
|
from ...gold import GoldParse
|
||||||
|
from ...util import compounding, minibatch, minibatch_by_words
|
||||||
|
from ...syntax.nonproj import projectivize
|
||||||
|
from ...matcher import Matcher
|
||||||
|
from ... import displacy
|
||||||
|
from collections import defaultdict, Counter
|
||||||
|
from timeit import default_timer as timer
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
import random
|
||||||
|
import numpy.random
|
||||||
|
|
||||||
|
from . import conll17_ud_eval
|
||||||
|
|
||||||
|
from ... import lang
|
||||||
|
from ...lang import zh
|
||||||
|
from ...lang import ja
|
||||||
|
|
||||||
|
try:
|
||||||
|
import torch
|
||||||
|
except ImportError:
|
||||||
|
torch = None
|
||||||
|
|
||||||
|
|
||||||
|
################
|
||||||
|
# Data reading #
|
||||||
|
################
|
||||||
|
|
||||||
|
space_re = re.compile("\s+")
|
||||||
|
|
||||||
|
|
||||||
|
def split_text(text):
|
||||||
|
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||||
|
|
||||||
|
|
||||||
|
def read_data(
|
||||||
|
nlp,
|
||||||
|
conllu_file,
|
||||||
|
text_file,
|
||||||
|
raw_text=True,
|
||||||
|
oracle_segments=False,
|
||||||
|
max_doc_length=None,
|
||||||
|
limit=None,
|
||||||
|
):
|
||||||
|
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
|
||||||
|
include Doc objects created using nlp.make_doc and then aligned against
|
||||||
|
the gold-standard sequences. If oracle_segments=True, include Doc objects
|
||||||
|
created from the gold-standard segments. At least one must be True."""
|
||||||
|
if not raw_text and not oracle_segments:
|
||||||
|
raise ValueError("At least one of raw_text or oracle_segments must be True")
|
||||||
|
paragraphs = split_text(text_file.read())
|
||||||
|
conllu = read_conllu(conllu_file)
|
||||||
|
# sd is spacy doc; cd is conllu doc
|
||||||
|
# cs is conllu sent, ct is conllu token
|
||||||
|
docs = []
|
||||||
|
golds = []
|
||||||
|
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
|
||||||
|
sent_annots = []
|
||||||
|
for cs in cd:
|
||||||
|
sent = defaultdict(list)
|
||||||
|
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
|
||||||
|
if "." in id_:
|
||||||
|
continue
|
||||||
|
if "-" in id_:
|
||||||
|
continue
|
||||||
|
id_ = int(id_) - 1
|
||||||
|
head = int(head) - 1 if head != "0" else id_
|
||||||
|
sent["words"].append(word)
|
||||||
|
sent["tags"].append(tag)
|
||||||
|
sent["heads"].append(head)
|
||||||
|
sent["deps"].append("ROOT" if dep == "root" else dep)
|
||||||
|
sent["spaces"].append(space_after == "_")
|
||||||
|
sent["entities"] = ["-"] * len(sent["words"])
|
||||||
|
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
|
||||||
|
if oracle_segments:
|
||||||
|
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
|
||||||
|
golds.append(GoldParse(docs[-1], **sent))
|
||||||
|
|
||||||
|
sent_annots.append(sent)
|
||||||
|
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
|
||||||
|
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||||
|
sent_annots = []
|
||||||
|
docs.append(doc)
|
||||||
|
golds.append(gold)
|
||||||
|
if limit and len(docs) >= limit:
|
||||||
|
return docs, golds
|
||||||
|
|
||||||
|
if raw_text and sent_annots:
|
||||||
|
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||||
|
docs.append(doc)
|
||||||
|
golds.append(gold)
|
||||||
|
if limit and len(docs) >= limit:
|
||||||
|
return docs, golds
|
||||||
|
return docs, golds
|
||||||
|
|
||||||
|
|
||||||
|
def read_conllu(file_):
|
||||||
|
docs = []
|
||||||
|
sent = []
|
||||||
|
doc = []
|
||||||
|
for line in file_:
|
||||||
|
if line.startswith("# newdoc"):
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
doc = []
|
||||||
|
elif line.startswith("#"):
|
||||||
|
continue
|
||||||
|
elif not line.strip():
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
sent = []
|
||||||
|
else:
|
||||||
|
sent.append(list(line.strip().split("\t")))
|
||||||
|
if len(sent[-1]) != 10:
|
||||||
|
print(repr(line))
|
||||||
|
raise ValueError
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
|
||||||
|
# Flatten the conll annotations, and adjust the head indices
|
||||||
|
flat = defaultdict(list)
|
||||||
|
sent_starts = []
|
||||||
|
for sent in sent_annots:
|
||||||
|
flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
|
||||||
|
for field in ["words", "tags", "deps", "entities", "spaces"]:
|
||||||
|
flat[field].extend(sent[field])
|
||||||
|
sent_starts.append(True)
|
||||||
|
sent_starts.extend([False] * (len(sent["words"]) - 1))
|
||||||
|
# Construct text if necessary
|
||||||
|
assert len(flat["words"]) == len(flat["spaces"])
|
||||||
|
if text is None:
|
||||||
|
text = "".join(
|
||||||
|
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
|
||||||
|
)
|
||||||
|
doc = nlp.make_doc(text)
|
||||||
|
flat.pop("spaces")
|
||||||
|
gold = GoldParse(doc, **flat)
|
||||||
|
gold.sent_starts = sent_starts
|
||||||
|
for i in range(len(gold.heads)):
|
||||||
|
if random.random() < drop_deps:
|
||||||
|
gold.heads[i] = None
|
||||||
|
gold.labels[i] = None
|
||||||
|
|
||||||
|
return doc, gold
|
||||||
|
|
||||||
|
|
||||||
|
#############################
|
||||||
|
# Data transforms for spaCy #
|
||||||
|
#############################
|
||||||
|
|
||||||
|
|
||||||
|
def golds_to_gold_tuples(docs, golds):
|
||||||
|
"""Get out the annoying 'tuples' format used by begin_training, given the
|
||||||
|
GoldParse objects."""
|
||||||
|
tuples = []
|
||||||
|
for doc, gold in zip(docs, golds):
|
||||||
|
text = doc.text
|
||||||
|
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
|
||||||
|
sents = [((ids, words, tags, heads, labels, iob), [])]
|
||||||
|
tuples.append((text, sents))
|
||||||
|
return tuples
|
||||||
|
|
||||||
|
|
||||||
|
##############
|
||||||
|
# Evaluation #
|
||||||
|
##############
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||||
|
if text_loc.parts[-1].endswith(".conllu"):
|
||||||
|
docs = []
|
||||||
|
with text_loc.open() as file_:
|
||||||
|
for conllu_doc in read_conllu(file_):
|
||||||
|
for conllu_sent in conllu_doc:
|
||||||
|
words = [line[1] for line in conllu_sent]
|
||||||
|
docs.append(Doc(nlp.vocab, words=words))
|
||||||
|
for name, component in nlp.pipeline:
|
||||||
|
docs = list(component.pipe(docs))
|
||||||
|
else:
|
||||||
|
with text_loc.open("r", encoding="utf8") as text_file:
|
||||||
|
texts = split_text(text_file.read())
|
||||||
|
docs = list(nlp.pipe(texts))
|
||||||
|
with sys_loc.open("w", encoding="utf8") as out_file:
|
||||||
|
write_conllu(docs, out_file)
|
||||||
|
with gold_loc.open("r", encoding="utf8") as gold_file:
|
||||||
|
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||||
|
with sys_loc.open("r", encoding="utf8") as sys_file:
|
||||||
|
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||||
|
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||||
|
return docs, scores
|
||||||
|
|
||||||
|
|
||||||
|
def write_conllu(docs, file_):
|
||||||
|
merger = Matcher(docs[0].vocab)
|
||||||
|
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||||
|
for i, doc in enumerate(docs):
|
||||||
|
matches = merger(doc)
|
||||||
|
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
for span in spans:
|
||||||
|
retokenizer.merge(span)
|
||||||
|
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||||
|
for j, sent in enumerate(doc.sents):
|
||||||
|
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||||
|
file_.write("# text = {text}\n".format(text=sent.text))
|
||||||
|
for k, token in enumerate(sent):
|
||||||
|
if token.head.i > sent[-1].i or token.head.i < sent[0].i:
|
||||||
|
for word in doc[sent[0].i - 10 : sent[0].i]:
|
||||||
|
print(word.i, word.head.i, word.text, word.dep_)
|
||||||
|
for word in sent:
|
||||||
|
print(word.i, word.head.i, word.text, word.dep_)
|
||||||
|
for word in doc[sent[-1].i : sent[-1].i + 10]:
|
||||||
|
print(word.i, word.head.i, word.text, word.dep_)
|
||||||
|
raise ValueError(
|
||||||
|
"Invalid parse: head outside sentence (%s)" % token.text
|
||||||
|
)
|
||||||
|
file_.write(token._.get_conllu_lines(k) + "\n")
|
||||||
|
file_.write("\n")
|
||||||
|
|
||||||
|
|
||||||
|
def print_progress(itn, losses, ud_scores):
|
||||||
|
fields = {
|
||||||
|
"dep_loss": losses.get("parser", 0.0),
|
||||||
|
"tag_loss": losses.get("tagger", 0.0),
|
||||||
|
"words": ud_scores["Words"].f1 * 100,
|
||||||
|
"sents": ud_scores["Sentences"].f1 * 100,
|
||||||
|
"tags": ud_scores["XPOS"].f1 * 100,
|
||||||
|
"uas": ud_scores["UAS"].f1 * 100,
|
||||||
|
"las": ud_scores["LAS"].f1 * 100,
|
||||||
|
}
|
||||||
|
header = ["Epoch", "Loss", "LAS", "UAS", "TAG", "SENT", "WORD"]
|
||||||
|
if itn == 0:
|
||||||
|
print("\t".join(header))
|
||||||
|
tpl = "\t".join(
|
||||||
|
(
|
||||||
|
"{:d}",
|
||||||
|
"{dep_loss:.1f}",
|
||||||
|
"{las:.1f}",
|
||||||
|
"{uas:.1f}",
|
||||||
|
"{tags:.1f}",
|
||||||
|
"{sents:.1f}",
|
||||||
|
"{words:.1f}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
print(tpl.format(itn, **fields))
|
||||||
|
|
||||||
|
|
||||||
|
# def get_sent_conllu(sent, sent_id):
|
||||||
|
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
|
||||||
|
|
||||||
|
|
||||||
|
def get_token_conllu(token, i):
|
||||||
|
if token._.begins_fused:
|
||||||
|
n = 1
|
||||||
|
while token.nbor(n)._.inside_fused:
|
||||||
|
n += 1
|
||||||
|
id_ = "%d-%d" % (i, i + n)
|
||||||
|
lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
|
||||||
|
else:
|
||||||
|
lines = []
|
||||||
|
if token.head.i == token.i:
|
||||||
|
head = 0
|
||||||
|
else:
|
||||||
|
head = i + (token.head.i - token.i) + 1
|
||||||
|
fields = [
|
||||||
|
str(i + 1),
|
||||||
|
token.text,
|
||||||
|
token.lemma_,
|
||||||
|
token.pos_,
|
||||||
|
token.tag_,
|
||||||
|
"_",
|
||||||
|
str(head),
|
||||||
|
token.dep_.lower(),
|
||||||
|
"_",
|
||||||
|
"_",
|
||||||
|
]
|
||||||
|
lines.append("\t".join(fields))
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||||
|
Token.set_extension("begins_fused", default=False)
|
||||||
|
Token.set_extension("inside_fused", default=False)
|
||||||
|
|
||||||
|
|
||||||
|
##################
|
||||||
|
# Initialization #
|
||||||
|
##################
|
||||||
|
|
||||||
|
|
||||||
|
def load_nlp(corpus, config, vectors=None):
|
||||||
|
lang = corpus.split("_")[0]
|
||||||
|
nlp = spacy.blank(lang)
|
||||||
|
if config.vectors:
|
||||||
|
if not vectors:
|
||||||
|
raise ValueError(
|
||||||
|
"config asks for vectors, but no vectors "
|
||||||
|
"directory set on command line (use -v)"
|
||||||
|
)
|
||||||
|
if (Path(vectors) / corpus).exists():
|
||||||
|
nlp.vocab.from_disk(Path(vectors) / corpus / "vocab")
|
||||||
|
nlp.meta["treebank"] = corpus
|
||||||
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||||
|
nlp.add_pipe(nlp.create_pipe("tagger"))
|
||||||
|
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||||
|
if config.multitask_tag:
|
||||||
|
nlp.parser.add_multitask_objective("tag")
|
||||||
|
if config.multitask_sent:
|
||||||
|
nlp.parser.add_multitask_objective("sent_start")
|
||||||
|
for gold in golds:
|
||||||
|
for tag in gold.tags:
|
||||||
|
if tag is not None:
|
||||||
|
nlp.tagger.add_label(tag)
|
||||||
|
if torch is not None and device != -1:
|
||||||
|
torch.set_default_tensor_type("torch.cuda.FloatTensor")
|
||||||
|
optimizer = nlp.begin_training(
|
||||||
|
lambda: golds_to_gold_tuples(docs, golds),
|
||||||
|
device=device,
|
||||||
|
subword_features=config.subword_features,
|
||||||
|
conv_depth=config.conv_depth,
|
||||||
|
bilstm_depth=config.bilstm_depth,
|
||||||
|
)
|
||||||
|
if config.pretrained_tok2vec:
|
||||||
|
_load_pretrained_tok2vec(nlp, config.pretrained_tok2vec)
|
||||||
|
return optimizer
|
||||||
|
|
||||||
|
|
||||||
|
def _load_pretrained_tok2vec(nlp, loc):
|
||||||
|
"""Load pre-trained weights for the 'token-to-vector' part of the component
|
||||||
|
models, which is typically a CNN. See 'spacy pretrain'. Experimental.
|
||||||
|
"""
|
||||||
|
with Path(loc).open("rb") as file_:
|
||||||
|
weights_data = file_.read()
|
||||||
|
loaded = []
|
||||||
|
for name, component in nlp.pipeline:
|
||||||
|
if hasattr(component, "model") and hasattr(component.model, "tok2vec"):
|
||||||
|
component.tok2vec.from_bytes(weights_data)
|
||||||
|
loaded.append(name)
|
||||||
|
return loaded
|
||||||
|
|
||||||
|
|
||||||
|
########################
|
||||||
|
# Command line helpers #
|
||||||
|
########################
|
||||||
|
|
||||||
|
|
||||||
|
class Config(object):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vectors=None,
|
||||||
|
max_doc_length=10,
|
||||||
|
multitask_tag=False,
|
||||||
|
multitask_sent=False,
|
||||||
|
multitask_dep=False,
|
||||||
|
multitask_vectors=None,
|
||||||
|
bilstm_depth=0,
|
||||||
|
nr_epoch=30,
|
||||||
|
min_batch_size=100,
|
||||||
|
max_batch_size=1000,
|
||||||
|
batch_by_words=True,
|
||||||
|
dropout=0.2,
|
||||||
|
conv_depth=4,
|
||||||
|
subword_features=True,
|
||||||
|
vectors_dir=None,
|
||||||
|
pretrained_tok2vec=None,
|
||||||
|
):
|
||||||
|
if vectors_dir is not None:
|
||||||
|
if vectors is None:
|
||||||
|
vectors = True
|
||||||
|
if multitask_vectors is None:
|
||||||
|
multitask_vectors = True
|
||||||
|
for key, value in locals().items():
|
||||||
|
setattr(self, key, value)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def load(cls, loc, vectors_dir=None):
|
||||||
|
with Path(loc).open("r", encoding="utf8") as file_:
|
||||||
|
cfg = json.load(file_)
|
||||||
|
if vectors_dir is not None:
|
||||||
|
cfg["vectors_dir"] = vectors_dir
|
||||||
|
return cls(**cfg)
|
||||||
|
|
||||||
|
|
||||||
|
class Dataset(object):
|
||||||
|
def __init__(self, path, section):
|
||||||
|
self.path = path
|
||||||
|
self.section = section
|
||||||
|
self.conllu = None
|
||||||
|
self.text = None
|
||||||
|
for file_path in self.path.iterdir():
|
||||||
|
name = file_path.parts[-1]
|
||||||
|
if section in name and name.endswith("conllu"):
|
||||||
|
self.conllu = file_path
|
||||||
|
elif section in name and name.endswith("txt"):
|
||||||
|
self.text = file_path
|
||||||
|
if self.conllu is None:
|
||||||
|
msg = "Could not find .txt file in {path} for {section}"
|
||||||
|
raise IOError(msg.format(section=section, path=path))
|
||||||
|
if self.text is None:
|
||||||
|
msg = "Could not find .txt file in {path} for {section}"
|
||||||
|
self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
|
||||||
|
|
||||||
|
|
||||||
|
class TreebankPaths(object):
|
||||||
|
def __init__(self, ud_path, treebank, **cfg):
|
||||||
|
self.train = Dataset(ud_path / treebank, "train")
|
||||||
|
self.dev = Dataset(ud_path / treebank, "dev")
|
||||||
|
self.lang = self.train.lang
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||||
|
corpus=(
|
||||||
|
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
str,
|
||||||
|
),
|
||||||
|
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||||
|
config=("Path to json formatted config file", "option", "C", Path),
|
||||||
|
limit=("Size limit", "option", "n", int),
|
||||||
|
gpu_device=("Use GPU", "option", "g", int),
|
||||||
|
use_oracle_segments=("Use oracle segments", "flag", "G", int),
|
||||||
|
vectors_dir=(
|
||||||
|
"Path to directory with pre-trained vectors, named e.g. en/",
|
||||||
|
"option",
|
||||||
|
"v",
|
||||||
|
Path,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
def main(
|
||||||
|
ud_dir,
|
||||||
|
parses_dir,
|
||||||
|
corpus,
|
||||||
|
config=None,
|
||||||
|
limit=0,
|
||||||
|
gpu_device=-1,
|
||||||
|
vectors_dir=None,
|
||||||
|
use_oracle_segments=False,
|
||||||
|
):
|
||||||
|
spacy.util.fix_random_seed()
|
||||||
|
lang.zh.Chinese.Defaults.use_jieba = False
|
||||||
|
lang.ja.Japanese.Defaults.use_janome = False
|
||||||
|
|
||||||
|
if config is not None:
|
||||||
|
config = Config.load(config, vectors_dir=vectors_dir)
|
||||||
|
else:
|
||||||
|
config = Config(vectors_dir=vectors_dir)
|
||||||
|
paths = TreebankPaths(ud_dir, corpus)
|
||||||
|
if not (parses_dir / corpus).exists():
|
||||||
|
(parses_dir / corpus).mkdir()
|
||||||
|
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||||
|
nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
|
||||||
|
|
||||||
|
docs, golds = read_data(
|
||||||
|
nlp,
|
||||||
|
paths.train.conllu.open(),
|
||||||
|
paths.train.text.open(),
|
||||||
|
max_doc_length=config.max_doc_length,
|
||||||
|
limit=limit,
|
||||||
|
)
|
||||||
|
|
||||||
|
optimizer = initialize_pipeline(nlp, docs, golds, config, gpu_device)
|
||||||
|
|
||||||
|
batch_sizes = compounding(config.min_batch_size, config.max_batch_size, 1.001)
|
||||||
|
beam_prob = compounding(0.2, 0.8, 1.001)
|
||||||
|
for i in range(config.nr_epoch):
|
||||||
|
docs, golds = read_data(
|
||||||
|
nlp,
|
||||||
|
paths.train.conllu.open(),
|
||||||
|
paths.train.text.open(),
|
||||||
|
max_doc_length=config.max_doc_length,
|
||||||
|
limit=limit,
|
||||||
|
oracle_segments=use_oracle_segments,
|
||||||
|
raw_text=not use_oracle_segments,
|
||||||
|
)
|
||||||
|
Xs = list(zip(docs, golds))
|
||||||
|
random.shuffle(Xs)
|
||||||
|
if config.batch_by_words:
|
||||||
|
batches = minibatch_by_words(Xs, size=batch_sizes)
|
||||||
|
else:
|
||||||
|
batches = minibatch(Xs, size=batch_sizes)
|
||||||
|
losses = {}
|
||||||
|
n_train_words = sum(len(doc) for doc in docs)
|
||||||
|
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||||
|
for batch in batches:
|
||||||
|
batch_docs, batch_gold = zip(*batch)
|
||||||
|
pbar.update(sum(len(doc) for doc in batch_docs))
|
||||||
|
nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
|
||||||
|
nlp.update(
|
||||||
|
batch_docs,
|
||||||
|
batch_gold,
|
||||||
|
sgd=optimizer,
|
||||||
|
drop=config.dropout,
|
||||||
|
losses=losses,
|
||||||
|
)
|
||||||
|
|
||||||
|
out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
|
||||||
|
with nlp.use_params(optimizer.averages):
|
||||||
|
if use_oracle_segments:
|
||||||
|
parsed_docs, scores = evaluate(
|
||||||
|
nlp, paths.dev.conllu, paths.dev.conllu, out_path
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
parsed_docs, scores = evaluate(
|
||||||
|
nlp, paths.dev.text, paths.dev.conllu, out_path
|
||||||
|
)
|
||||||
|
print_progress(i, losses, scores)
|
||||||
|
|
||||||
|
|
||||||
|
def _render_parses(i, to_render):
|
||||||
|
to_render[0].user_data["title"] = "Batch %d" % i
|
||||||
|
with Path("/tmp/parses.html").open("w") as file_:
|
||||||
|
html = displacy.render(to_render[:5], style="dep", page=True)
|
||||||
|
file_.write(html)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
||||||
|
|
@ -4,28 +4,38 @@ from __future__ import unicode_literals, print_function
|
||||||
import pkg_resources
|
import pkg_resources
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import sys
|
import sys
|
||||||
import ujson
|
|
||||||
import requests
|
import requests
|
||||||
|
import srsly
|
||||||
|
from wasabi import Printer
|
||||||
|
|
||||||
from ._messages import Messages
|
from ..compat import path2str
|
||||||
from ..compat import path2str, locale_escape
|
from ..util import get_data_path
|
||||||
from ..util import prints, get_data_path, read_json
|
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
def validate():
|
def validate():
|
||||||
"""Validate that the currently installed version of spaCy is compatible
|
"""
|
||||||
|
Validate that the currently installed version of spaCy is compatible
|
||||||
with the installed models. Should be run after `pip install -U spacy`.
|
with the installed models. Should be run after `pip install -U spacy`.
|
||||||
"""
|
"""
|
||||||
|
msg = Printer()
|
||||||
|
with msg.loading("Loading compatibility table..."):
|
||||||
r = requests.get(about.__compatibility__)
|
r = requests.get(about.__compatibility__)
|
||||||
if r.status_code != 200:
|
if r.status_code != 200:
|
||||||
prints(Messages.M021, title=Messages.M003.format(code=r.status_code),
|
msg.fail(
|
||||||
exits=1)
|
"Server error ({})".format(r.status_code),
|
||||||
compat = r.json()['spacy']
|
"Couldn't fetch compatibility table.",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
msg.good("Loaded compatibility table")
|
||||||
|
compat = r.json()["spacy"]
|
||||||
current_compat = compat.get(about.__version__)
|
current_compat = compat.get(about.__version__)
|
||||||
if not current_compat:
|
if not current_compat:
|
||||||
prints(about.__compatibility__, exits=1,
|
msg.fail(
|
||||||
title=Messages.M022.format(version=about.__version__))
|
"Can't find spaCy v{} in compatibility table".format(about.__version__),
|
||||||
|
about.__compatibility__,
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
all_models = set()
|
all_models = set()
|
||||||
for spacy_v, models in dict(compat).items():
|
for spacy_v, models in dict(compat).items():
|
||||||
all_models.update(models.keys())
|
all_models.update(models.keys())
|
||||||
|
|
@ -33,33 +43,45 @@ def validate():
|
||||||
compat[spacy_v][model] = [reformat_version(v) for v in model_vs]
|
compat[spacy_v][model] = [reformat_version(v) for v in model_vs]
|
||||||
model_links = get_model_links(current_compat)
|
model_links = get_model_links(current_compat)
|
||||||
model_pkgs = get_model_pkgs(current_compat, all_models)
|
model_pkgs = get_model_pkgs(current_compat, all_models)
|
||||||
incompat_links = {l for l, d in model_links.items() if not d['compat']}
|
incompat_links = {l for l, d in model_links.items() if not d["compat"]}
|
||||||
incompat_models = {d['name'] for _, d in model_pkgs.items()
|
incompat_models = {d["name"] for _, d in model_pkgs.items() if not d["compat"]}
|
||||||
if not d['compat']}
|
incompat_models.update(
|
||||||
incompat_models.update([d['name'] for _, d in model_links.items()
|
[d["name"] for _, d in model_links.items() if not d["compat"]]
|
||||||
if not d['compat']])
|
)
|
||||||
na_models = [m for m in incompat_models if m not in current_compat]
|
na_models = [m for m in incompat_models if m not in current_compat]
|
||||||
update_models = [m for m in incompat_models if m in current_compat]
|
update_models = [m for m in incompat_models if m in current_compat]
|
||||||
|
spacy_dir = Path(__file__).parent.parent
|
||||||
|
|
||||||
|
msg.divider("Installed models (spaCy v{})".format(about.__version__))
|
||||||
|
msg.info("spaCy installation: {}".format(path2str(spacy_dir)))
|
||||||
|
|
||||||
prints(path2str(Path(__file__).parent.parent),
|
|
||||||
title=Messages.M023.format(version=about.__version__))
|
|
||||||
if model_links or model_pkgs:
|
if model_links or model_pkgs:
|
||||||
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
|
header = ("TYPE", "NAME", "MODEL", "VERSION", "")
|
||||||
|
rows = []
|
||||||
for name, data in model_pkgs.items():
|
for name, data in model_pkgs.items():
|
||||||
print(get_model_row(current_compat, name, data, 'package'))
|
rows.append(get_model_row(current_compat, name, data, msg))
|
||||||
for name, data in model_links.items():
|
for name, data in model_links.items():
|
||||||
print(get_model_row(current_compat, name, data, 'link'))
|
rows.append(get_model_row(current_compat, name, data, msg, "link"))
|
||||||
|
msg.table(rows, header=header)
|
||||||
else:
|
else:
|
||||||
prints(Messages.M024, exits=0)
|
msg.text("No models found in your current environment.", exits=0)
|
||||||
if update_models:
|
if update_models:
|
||||||
cmd = ' python -m spacy download {}'
|
msg.divider("Install updates")
|
||||||
print("\n " + Messages.M025)
|
msg.text("Use the following commands to update the model packages:")
|
||||||
print('\n'.join([cmd.format(pkg) for pkg in update_models]))
|
cmd = "python -m spacy download {}"
|
||||||
|
print("\n".join([cmd.format(pkg) for pkg in update_models]) + "\n")
|
||||||
if na_models:
|
if na_models:
|
||||||
prints(Messages.M025.format(version=about.__version__,
|
msg.text(
|
||||||
models=', '.join(na_models)))
|
"The following models are not available for spaCy "
|
||||||
|
"v{}: {}".format(about.__version__, ", ".join(na_models))
|
||||||
|
)
|
||||||
if incompat_links:
|
if incompat_links:
|
||||||
prints(Messages.M027.format(path=path2str(get_data_path())))
|
msg.text(
|
||||||
|
"You may also want to overwrite the incompatible links using the "
|
||||||
|
"`python -m spacy link` command with `--force`, or remove them "
|
||||||
|
"from the data directory. "
|
||||||
|
"Data path: {path}".format(path=path2str(get_data_path()))
|
||||||
|
)
|
||||||
if incompat_models or incompat_links:
|
if incompat_models or incompat_links:
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
@ -70,50 +92,48 @@ def get_model_links(compat):
|
||||||
if data_path:
|
if data_path:
|
||||||
models = [p for p in data_path.iterdir() if is_model_path(p)]
|
models = [p for p in data_path.iterdir() if is_model_path(p)]
|
||||||
for model in models:
|
for model in models:
|
||||||
meta_path = Path(model) / 'meta.json'
|
meta_path = Path(model) / "meta.json"
|
||||||
if not meta_path.exists():
|
if not meta_path.exists():
|
||||||
continue
|
continue
|
||||||
meta = read_json(meta_path)
|
meta = srsly.read_json(meta_path)
|
||||||
link = model.parts[-1]
|
link = model.parts[-1]
|
||||||
name = meta['lang'] + '_' + meta['name']
|
name = meta["lang"] + "_" + meta["name"]
|
||||||
links[link] = {'name': name, 'version': meta['version'],
|
links[link] = {
|
||||||
'compat': is_compat(compat, name, meta['version'])}
|
"name": name,
|
||||||
|
"version": meta["version"],
|
||||||
|
"compat": is_compat(compat, name, meta["version"]),
|
||||||
|
}
|
||||||
return links
|
return links
|
||||||
|
|
||||||
|
|
||||||
def get_model_pkgs(compat, all_models):
|
def get_model_pkgs(compat, all_models):
|
||||||
pkgs = {}
|
pkgs = {}
|
||||||
for pkg_name, pkg_data in pkg_resources.working_set.by_key.items():
|
for pkg_name, pkg_data in pkg_resources.working_set.by_key.items():
|
||||||
package = pkg_name.replace('-', '_')
|
package = pkg_name.replace("-", "_")
|
||||||
if package in all_models:
|
if package in all_models:
|
||||||
version = pkg_data.version
|
version = pkg_data.version
|
||||||
pkgs[pkg_name] = {'name': package, 'version': version,
|
pkgs[pkg_name] = {
|
||||||
'compat': is_compat(compat, package, version)}
|
"name": package,
|
||||||
|
"version": version,
|
||||||
|
"compat": is_compat(compat, package, version),
|
||||||
|
}
|
||||||
return pkgs
|
return pkgs
|
||||||
|
|
||||||
|
|
||||||
def get_model_row(compat, name, data, type='package'):
|
def get_model_row(compat, name, data, msg, model_type="package"):
|
||||||
tpl_red = '\x1b[38;5;1m{}\x1b[0m'
|
if data["compat"]:
|
||||||
tpl_green = '\x1b[38;5;2m{}\x1b[0m'
|
comp = msg.text("", color="green", icon="good", no_print=True)
|
||||||
if data['compat']:
|
version = msg.text(data["version"], color="green", no_print=True)
|
||||||
comp = tpl_green.format(locale_escape('✔', errors='ignore'))
|
|
||||||
version = tpl_green.format(data['version'])
|
|
||||||
else:
|
else:
|
||||||
comp = '--> {}'.format(compat.get(data['name'], ['n/a'])[0])
|
version = msg.text(data["version"], color="red", no_print=True)
|
||||||
version = tpl_red.format(data['version'])
|
comp = "--> {}".format(compat.get(data["name"], ["n/a"])[0])
|
||||||
return get_row(type, name, data['name'], version, comp)
|
return (model_type, name, data["name"], version, comp)
|
||||||
|
|
||||||
|
|
||||||
def get_row(*args):
|
|
||||||
tpl_row = ' {:<10}' + (' {:<20}' * 4)
|
|
||||||
return tpl_row.format(*args)
|
|
||||||
|
|
||||||
|
|
||||||
def is_model_path(model_path):
|
def is_model_path(model_path):
|
||||||
exclude = ['cache', 'pycache', '__pycache__']
|
exclude = ["cache", "pycache", "__pycache__"]
|
||||||
name = model_path.parts[-1]
|
name = model_path.parts[-1]
|
||||||
return (model_path.is_dir() and name not in exclude
|
return model_path.is_dir() and name not in exclude and not name.startswith(".")
|
||||||
and not name.startswith('.'))
|
|
||||||
|
|
||||||
|
|
||||||
def is_compat(compat, name, version):
|
def is_compat(compat, name, version):
|
||||||
|
|
@ -122,6 +142,6 @@ def is_compat(compat, name, version):
|
||||||
|
|
||||||
def reformat_version(version):
|
def reformat_version(version):
|
||||||
"""Hack to reformat old versions ending on '-alpha' to match pip format."""
|
"""Hack to reformat old versions ending on '-alpha' to match pip format."""
|
||||||
if version.endswith('-alpha'):
|
if version.endswith("-alpha"):
|
||||||
return version.replace('-alpha', 'a0')
|
return version.replace("-alpha", "a0")
|
||||||
return version.replace('-alpha', 'a')
|
return version.replace("-alpha", "a")
|
||||||
|
|
|
||||||
|
|
@ -1,59 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import plac
|
|
||||||
import json
|
|
||||||
import spacy
|
|
||||||
import numpy
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
from ..vectors import Vectors
|
|
||||||
from ..util import prints, ensure_path
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
|
||||||
lang=("model language", "positional", None, str),
|
|
||||||
output_dir=("model output directory", "positional", None, Path),
|
|
||||||
lexemes_loc=("location of JSONL-formatted lexical data", "positional",
|
|
||||||
None, Path),
|
|
||||||
vectors_loc=("optional: location of vectors data, as numpy .npz",
|
|
||||||
"positional", None, str),
|
|
||||||
prune_vectors=("optional: number of vectors to prune to.",
|
|
||||||
"option", "V", int)
|
|
||||||
)
|
|
||||||
def make_vocab(lang, output_dir, lexemes_loc, vectors_loc=None, prune_vectors=-1):
|
|
||||||
"""Compile a vocabulary from a lexicon jsonl file and word vectors."""
|
|
||||||
if not lexemes_loc.exists():
|
|
||||||
prints(lexemes_loc, title="Can't find lexical data", exits=1)
|
|
||||||
vectors_loc = ensure_path(vectors_loc)
|
|
||||||
nlp = spacy.blank(lang)
|
|
||||||
for word in nlp.vocab:
|
|
||||||
word.rank = 0
|
|
||||||
lex_added = 0
|
|
||||||
with lexemes_loc.open() as file_:
|
|
||||||
for line in file_:
|
|
||||||
if line.strip():
|
|
||||||
attrs = json.loads(line)
|
|
||||||
if 'settings' in attrs:
|
|
||||||
nlp.vocab.cfg.update(attrs['settings'])
|
|
||||||
else:
|
|
||||||
lex = nlp.vocab[attrs['orth']]
|
|
||||||
lex.set_attrs(**attrs)
|
|
||||||
assert lex.rank == attrs['id']
|
|
||||||
lex_added += 1
|
|
||||||
if vectors_loc is not None:
|
|
||||||
vector_data = numpy.load(vectors_loc.open('rb'))
|
|
||||||
nlp.vocab.vectors = Vectors(data=vector_data)
|
|
||||||
for word in nlp.vocab:
|
|
||||||
if word.rank:
|
|
||||||
nlp.vocab.vectors.add(word.orth, row=word.rank)
|
|
||||||
|
|
||||||
if prune_vectors >= 1:
|
|
||||||
remap = nlp.vocab.prune_vectors(prune_vectors)
|
|
||||||
if not output_dir.exists():
|
|
||||||
output_dir.mkdir()
|
|
||||||
nlp.to_disk(output_dir)
|
|
||||||
vec_added = len(nlp.vocab.vectors)
|
|
||||||
prints("{} entries, {} vectors".format(lex_added, vec_added), output_dir,
|
|
||||||
title="Sucessfully compiled vocab and vectors, and saved model")
|
|
||||||
return nlp
|
|
||||||
|
|
@ -1,11 +1,16 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
|
"""
|
||||||
|
Helpers for Python and platform compatibility. To distinguish them from
|
||||||
|
the builtin functions, replacement functions are suffixed with an underscore,
|
||||||
|
e.g. `unicode_`.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/top-level#compat
|
||||||
|
"""
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
import ujson
|
|
||||||
import itertools
|
import itertools
|
||||||
import locale
|
|
||||||
|
|
||||||
from thinc.neural.util import copy_array
|
from thinc.neural.util import copy_array
|
||||||
|
|
||||||
|
|
@ -30,9 +35,9 @@ except ImportError:
|
||||||
cupy = None
|
cupy = None
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from thinc.neural.optimizers import Optimizer
|
from thinc.neural.optimizers import Optimizer # noqa: F401
|
||||||
except ImportError:
|
except ImportError:
|
||||||
from thinc.neural.optimizers import Adam as Optimizer
|
from thinc.neural.optimizers import Adam as Optimizer # noqa: F401
|
||||||
|
|
||||||
pickle = pickle
|
pickle = pickle
|
||||||
copy_reg = copy_reg
|
copy_reg = copy_reg
|
||||||
|
|
@ -55,9 +60,6 @@ if is_python2:
|
||||||
unicode_ = unicode # noqa: F821
|
unicode_ = unicode # noqa: F821
|
||||||
basestring_ = basestring # noqa: F821
|
basestring_ = basestring # noqa: F821
|
||||||
input_ = raw_input # noqa: F821
|
input_ = raw_input # noqa: F821
|
||||||
json_dumps = lambda data: ujson.dumps(
|
|
||||||
data, indent=2, escape_forward_slashes=False
|
|
||||||
).decode("utf8")
|
|
||||||
path2str = lambda path: str(path).decode("utf8")
|
path2str = lambda path: str(path).decode("utf8")
|
||||||
|
|
||||||
elif is_python3:
|
elif is_python3:
|
||||||
|
|
@ -65,24 +67,27 @@ elif is_python3:
|
||||||
unicode_ = str
|
unicode_ = str
|
||||||
basestring_ = str
|
basestring_ = str
|
||||||
input_ = input
|
input_ = input
|
||||||
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
|
|
||||||
path2str = lambda path: str(path)
|
path2str = lambda path: str(path)
|
||||||
|
|
||||||
|
|
||||||
def b_to_str(b_str):
|
def b_to_str(b_str):
|
||||||
|
"""Convert a bytes object to a string.
|
||||||
|
|
||||||
|
b_str (bytes): The object to convert.
|
||||||
|
RETURNS (unicode): The converted string.
|
||||||
|
"""
|
||||||
if is_python2:
|
if is_python2:
|
||||||
return b_str
|
return b_str
|
||||||
# important: if no encoding is set, string becomes "b'...'"
|
# Important: if no encoding is set, string becomes "b'...'"
|
||||||
return str(b_str, encoding="utf8")
|
return str(b_str, encoding="utf8")
|
||||||
|
|
||||||
|
|
||||||
def getattr_(obj, name, *default):
|
|
||||||
if is_python3 and isinstance(name, bytes):
|
|
||||||
name = name.decode("utf8")
|
|
||||||
return getattr(obj, name, *default)
|
|
||||||
|
|
||||||
|
|
||||||
def symlink_to(orig, dest):
|
def symlink_to(orig, dest):
|
||||||
|
"""Create a symlink. Used for model shortcut links.
|
||||||
|
|
||||||
|
orig (unicode / Path): The origin path.
|
||||||
|
dest (unicode / Path): The destination path of the symlink.
|
||||||
|
"""
|
||||||
if is_windows:
|
if is_windows:
|
||||||
import subprocess
|
import subprocess
|
||||||
|
|
||||||
|
|
@ -92,6 +97,10 @@ def symlink_to(orig, dest):
|
||||||
|
|
||||||
|
|
||||||
def symlink_remove(link):
|
def symlink_remove(link):
|
||||||
|
"""Remove a symlink. Used for model shortcut links.
|
||||||
|
|
||||||
|
link (unicode / Path): The path to the symlink.
|
||||||
|
"""
|
||||||
# https://stackoverflow.com/q/26554135/6400719
|
# https://stackoverflow.com/q/26554135/6400719
|
||||||
if os.path.isdir(path2str(link)) and is_windows:
|
if os.path.isdir(path2str(link)) and is_windows:
|
||||||
# this should only be on Py2.7 and windows
|
# this should only be on Py2.7 and windows
|
||||||
|
|
@ -101,6 +110,18 @@ def symlink_remove(link):
|
||||||
|
|
||||||
|
|
||||||
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
||||||
|
"""Check if a specific configuration of Python version and operating system
|
||||||
|
matches the user's setup. Mostly used to display targeted error messages.
|
||||||
|
|
||||||
|
python2 (bool): spaCy is executed with Python 2.x.
|
||||||
|
python3 (bool): spaCy is executed with Python 3.x.
|
||||||
|
windows (bool): spaCy is executed on Windows.
|
||||||
|
linux (bool): spaCy is executed on Linux.
|
||||||
|
osx (bool): spaCy is executed on OS X or macOS.
|
||||||
|
RETURNS (bool): Whether the configuration matches the user's platform.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/top-level#compat.is_config
|
||||||
|
"""
|
||||||
return (
|
return (
|
||||||
python2 in (None, is_python2)
|
python2 in (None, is_python2)
|
||||||
and python3 in (None, is_python3)
|
and python3 in (None, is_python3)
|
||||||
|
|
@ -110,19 +131,14 @@ def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def normalize_string_keys(old):
|
|
||||||
"""Given a dictionary, make sure keys are unicode strings, not bytes."""
|
|
||||||
new = {}
|
|
||||||
for key, value in old.items():
|
|
||||||
if isinstance(key, bytes_):
|
|
||||||
new[key.decode("utf8")] = value
|
|
||||||
else:
|
|
||||||
new[key] = value
|
|
||||||
return new
|
|
||||||
|
|
||||||
|
|
||||||
def import_file(name, loc):
|
def import_file(name, loc):
|
||||||
loc = str(loc)
|
"""Import module from a file. Used to load models from a directory.
|
||||||
|
|
||||||
|
name (unicode): Name of module to load.
|
||||||
|
loc (unicode / Path): Path to the file.
|
||||||
|
RETURNS: The loaded module.
|
||||||
|
"""
|
||||||
|
loc = path2str(loc)
|
||||||
if is_python_pre_3_5:
|
if is_python_pre_3_5:
|
||||||
import imp
|
import imp
|
||||||
|
|
||||||
|
|
@ -134,12 +150,3 @@ def import_file(name, loc):
|
||||||
module = importlib.util.module_from_spec(spec)
|
module = importlib.util.module_from_spec(spec)
|
||||||
spec.loader.exec_module(module)
|
spec.loader.exec_module(module)
|
||||||
return module
|
return module
|
||||||
|
|
||||||
|
|
||||||
def locale_escape(string, errors="replace"):
|
|
||||||
"""
|
|
||||||
Mangle non-supported characters, for savages with ascii terminals.
|
|
||||||
"""
|
|
||||||
encoding = locale.getpreferredencoding()
|
|
||||||
string = string.encode(encoding, errors).decode("utf8")
|
|
||||||
return string
|
|
||||||
|
|
|
||||||
|
|
@ -1,18 +1,26 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
|
"""
|
||||||
|
spaCy's built in visualization suite for dependencies and named entities.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/top-level#displacy
|
||||||
|
USAGE: https://spacy.io/usage/visualizers
|
||||||
|
"""
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .render import DependencyRenderer, EntityRenderer
|
from .render import DependencyRenderer, EntityRenderer
|
||||||
from ..tokens import Doc, Span
|
from ..tokens import Doc, Span
|
||||||
from ..compat import b_to_str
|
from ..compat import b_to_str
|
||||||
from ..errors import Errors, Warnings, user_warning
|
from ..errors import Errors, Warnings, user_warning
|
||||||
from ..util import prints, is_in_jupyter
|
from ..util import is_in_jupyter
|
||||||
|
|
||||||
|
|
||||||
_html = {}
|
_html = {}
|
||||||
|
RENDER_WRAPPER = None
|
||||||
|
|
||||||
|
|
||||||
def render(docs, style='dep', page=False, minify=False, jupyter=False,
|
def render(
|
||||||
options={}, manual=False):
|
docs, style="dep", page=False, minify=False, jupyter=False, options={}, manual=False
|
||||||
|
):
|
||||||
"""Render displaCy visualisation.
|
"""Render displaCy visualisation.
|
||||||
|
|
||||||
docs (list or Doc): Document(s) to visualise.
|
docs (list or Doc): Document(s) to visualise.
|
||||||
|
|
@ -23,9 +31,14 @@ def render(docs, style='dep', page=False, minify=False, jupyter=False,
|
||||||
options (dict): Visualiser-specific options, e.g. colors.
|
options (dict): Visualiser-specific options, e.g. colors.
|
||||||
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
||||||
RETURNS (unicode): Rendered HTML markup.
|
RETURNS (unicode): Rendered HTML markup.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/top-level#displacy.render
|
||||||
|
USAGE: https://spacy.io/usage/visualizers
|
||||||
"""
|
"""
|
||||||
factories = {'dep': (DependencyRenderer, parse_deps),
|
factories = {
|
||||||
'ent': (EntityRenderer, parse_ents)}
|
"dep": (DependencyRenderer, parse_deps),
|
||||||
|
"ent": (EntityRenderer, parse_ents),
|
||||||
|
}
|
||||||
if style not in factories:
|
if style not in factories:
|
||||||
raise ValueError(Errors.E087.format(style=style))
|
raise ValueError(Errors.E087.format(style=style))
|
||||||
if isinstance(docs, (Doc, Span, dict)):
|
if isinstance(docs, (Doc, Span, dict)):
|
||||||
|
|
@ -36,16 +49,27 @@ def render(docs, style='dep', page=False, minify=False, jupyter=False,
|
||||||
renderer, converter = factories[style]
|
renderer, converter = factories[style]
|
||||||
renderer = renderer(options=options)
|
renderer = renderer(options=options)
|
||||||
parsed = [converter(doc, options) for doc in docs] if not manual else docs
|
parsed = [converter(doc, options) for doc in docs] if not manual else docs
|
||||||
_html['parsed'] = renderer.render(parsed, page=page, minify=minify).strip()
|
_html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip()
|
||||||
html = _html['parsed']
|
html = _html["parsed"]
|
||||||
|
if RENDER_WRAPPER is not None:
|
||||||
|
html = RENDER_WRAPPER(html)
|
||||||
if jupyter or is_in_jupyter(): # return HTML rendered by IPython display()
|
if jupyter or is_in_jupyter(): # return HTML rendered by IPython display()
|
||||||
from IPython.core.display import display, HTML
|
from IPython.core.display import display, HTML
|
||||||
|
|
||||||
return display(HTML(html))
|
return display(HTML(html))
|
||||||
return html
|
return html
|
||||||
|
|
||||||
|
|
||||||
def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
|
def serve(
|
||||||
port=5000):
|
docs,
|
||||||
|
style="dep",
|
||||||
|
page=True,
|
||||||
|
minify=False,
|
||||||
|
options={},
|
||||||
|
manual=False,
|
||||||
|
port=5000,
|
||||||
|
host="0.0.0.0",
|
||||||
|
):
|
||||||
"""Serve displaCy visualisation.
|
"""Serve displaCy visualisation.
|
||||||
|
|
||||||
docs (list or Doc): Document(s) to visualise.
|
docs (list or Doc): Document(s) to visualise.
|
||||||
|
|
@ -55,27 +79,33 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
|
||||||
options (dict): Visualiser-specific options, e.g. colors.
|
options (dict): Visualiser-specific options, e.g. colors.
|
||||||
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
||||||
port (int): Port to serve visualisation.
|
port (int): Port to serve visualisation.
|
||||||
|
host (unicode): Host to serve visualisation.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/top-level#displacy.serve
|
||||||
|
USAGE: https://spacy.io/usage/visualizers
|
||||||
"""
|
"""
|
||||||
from wsgiref import simple_server
|
from wsgiref import simple_server
|
||||||
render(docs, style=style, page=page, minify=minify, options=options,
|
|
||||||
manual=manual)
|
if is_in_jupyter():
|
||||||
httpd = simple_server.make_server('0.0.0.0', port, app)
|
user_warning(Warnings.W011)
|
||||||
prints("Using the '{}' visualizer".format(style),
|
|
||||||
title="Serving on port {}...".format(port))
|
render(docs, style=style, page=page, minify=minify, options=options, manual=manual)
|
||||||
|
httpd = simple_server.make_server(host, port, app)
|
||||||
|
print("\nUsing the '{}' visualizer".format(style))
|
||||||
|
print("Serving on http://{}:{} ...\n".format(host, port))
|
||||||
try:
|
try:
|
||||||
httpd.serve_forever()
|
httpd.serve_forever()
|
||||||
except KeyboardInterrupt:
|
except KeyboardInterrupt:
|
||||||
prints("Shutting down server on port {}.".format(port))
|
print("Shutting down server on port {}.".format(port))
|
||||||
finally:
|
finally:
|
||||||
httpd.server_close()
|
httpd.server_close()
|
||||||
|
|
||||||
|
|
||||||
def app(environ, start_response):
|
def app(environ, start_response):
|
||||||
# headers and status need to be bytes in Python 2, see #1227
|
# Headers and status need to be bytes in Python 2, see #1227
|
||||||
headers = [(b_to_str(b'Content-type'),
|
headers = [(b_to_str(b"Content-type"), b_to_str(b"text/html; charset=utf-8"))]
|
||||||
b_to_str(b'text/html; charset=utf-8'))]
|
start_response(b_to_str(b"200 OK"), headers)
|
||||||
start_response(b_to_str(b'200 OK'), headers)
|
res = _html["parsed"].encode(encoding="utf-8")
|
||||||
res = _html['parsed'].encode(encoding='utf-8')
|
|
||||||
return [res]
|
return [res]
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -88,11 +118,16 @@ def parse_deps(orig_doc, options={}):
|
||||||
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
||||||
if not doc.is_parsed:
|
if not doc.is_parsed:
|
||||||
user_warning(Warnings.W005)
|
user_warning(Warnings.W005)
|
||||||
if options.get('collapse_phrases', False):
|
if options.get("collapse_phrases", False):
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
for np in list(doc.noun_chunks):
|
for np in list(doc.noun_chunks):
|
||||||
np.merge(tag=np.root.tag_, lemma=np.root.lemma_,
|
attrs = {
|
||||||
ent_type=np.root.ent_type_)
|
"tag": np.root.tag_,
|
||||||
if options.get('collapse_punct', True):
|
"lemma": np.root.lemma_,
|
||||||
|
"ent_type": np.root.ent_type_,
|
||||||
|
}
|
||||||
|
retokenizer.merge(np, attrs=attrs)
|
||||||
|
if options.get("collapse_punct", True):
|
||||||
spans = []
|
spans = []
|
||||||
for word in doc[:-1]:
|
for word in doc[:-1]:
|
||||||
if word.is_punct or not word.nbor(1).is_punct:
|
if word.is_punct or not word.nbor(1).is_punct:
|
||||||
|
|
@ -102,23 +137,31 @@ def parse_deps(orig_doc, options={}):
|
||||||
while end < len(doc) and doc[end].is_punct:
|
while end < len(doc) and doc[end].is_punct:
|
||||||
end += 1
|
end += 1
|
||||||
span = doc[start:end]
|
span = doc[start:end]
|
||||||
spans.append((span.start_char, span.end_char, word.tag_,
|
spans.append((span, word.tag_, word.lemma_, word.ent_type_))
|
||||||
word.lemma_, word.ent_type_))
|
with doc.retokenize() as retokenizer:
|
||||||
for start, end, tag, lemma, ent_type in spans:
|
for span, tag, lemma, ent_type in spans:
|
||||||
doc.merge(start, end, tag=tag, lemma=lemma, ent_type=ent_type)
|
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
|
||||||
if options.get('fine_grained'):
|
retokenizer.merge(span, attrs=attrs)
|
||||||
words = [{'text': w.text, 'tag': w.tag_} for w in doc]
|
if options.get("fine_grained"):
|
||||||
|
words = [{"text": w.text, "tag": w.tag_} for w in doc]
|
||||||
else:
|
else:
|
||||||
words = [{'text': w.text, 'tag': w.pos_} for w in doc]
|
words = [{"text": w.text, "tag": w.pos_} for w in doc]
|
||||||
arcs = []
|
arcs = []
|
||||||
for word in doc:
|
for word in doc:
|
||||||
if word.i < word.head.i:
|
if word.i < word.head.i:
|
||||||
arcs.append({'start': word.i, 'end': word.head.i,
|
arcs.append(
|
||||||
'label': word.dep_, 'dir': 'left'})
|
{"start": word.i, "end": word.head.i, "label": word.dep_, "dir": "left"}
|
||||||
|
)
|
||||||
elif word.i > word.head.i:
|
elif word.i > word.head.i:
|
||||||
arcs.append({'start': word.head.i, 'end': word.i,
|
arcs.append(
|
||||||
'label': word.dep_, 'dir': 'right'})
|
{
|
||||||
return {'words': words, 'arcs': arcs}
|
"start": word.head.i,
|
||||||
|
"end": word.i,
|
||||||
|
"label": word.dep_,
|
||||||
|
"dir": "right",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return {"words": words, "arcs": arcs, "settings": get_doc_settings(orig_doc)}
|
||||||
|
|
||||||
|
|
||||||
def parse_ents(doc, options={}):
|
def parse_ents(doc, options={}):
|
||||||
|
|
@ -127,10 +170,36 @@ def parse_ents(doc, options={}):
|
||||||
doc (Doc): Document do parse.
|
doc (Doc): Document do parse.
|
||||||
RETURNS (dict): Generated entities keyed by text (original text) and ents.
|
RETURNS (dict): Generated entities keyed by text (original text) and ents.
|
||||||
"""
|
"""
|
||||||
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
|
ents = [
|
||||||
for ent in doc.ents]
|
{"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
|
||||||
|
for ent in doc.ents
|
||||||
|
]
|
||||||
if not ents:
|
if not ents:
|
||||||
user_warning(Warnings.W006)
|
user_warning(Warnings.W006)
|
||||||
title = (doc.user_data.get('title', None)
|
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
|
||||||
if hasattr(doc, 'user_data') else None)
|
settings = get_doc_settings(doc)
|
||||||
return {'text': doc.text, 'ents': ents, 'title': title}
|
return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
|
||||||
|
|
||||||
|
|
||||||
|
def set_render_wrapper(func):
|
||||||
|
"""Set an optional wrapper function that is called around the generated
|
||||||
|
HTML markup on displacy.render. This can be used to allow integration into
|
||||||
|
other platforms, similar to Jupyter Notebooks that require functions to be
|
||||||
|
called around the HTML. It can also be used to implement custom callbacks
|
||||||
|
on render, or to embed the visualization in a custom page.
|
||||||
|
|
||||||
|
func (callable): Function to call around markup before rendering it. Needs
|
||||||
|
to take one argument, the HTML markup, and should return the desired
|
||||||
|
output of displacy.render.
|
||||||
|
"""
|
||||||
|
global RENDER_WRAPPER
|
||||||
|
if not hasattr(func, "__call__"):
|
||||||
|
raise ValueError(Errors.E110.format(obj=type(func)))
|
||||||
|
RENDER_WRAPPER = func
|
||||||
|
|
||||||
|
|
||||||
|
def get_doc_settings(doc):
|
||||||
|
return {
|
||||||
|
"lang": doc.lang_,
|
||||||
|
"direction": doc.vocab.writing_system.get("direction", "ltr"),
|
||||||
|
}
|
||||||
|
|
|
||||||
|
|
@ -3,14 +3,18 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import uuid
|
import uuid
|
||||||
|
|
||||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
|
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||||
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||||
from ..util import minify_html, escape_html
|
from ..util import minify_html, escape_html
|
||||||
|
|
||||||
|
DEFAULT_LANG = "en"
|
||||||
|
DEFAULT_DIR = "ltr"
|
||||||
|
|
||||||
|
|
||||||
class DependencyRenderer(object):
|
class DependencyRenderer(object):
|
||||||
"""Render dependency parses as SVGs."""
|
"""Render dependency parses as SVGs."""
|
||||||
style = 'dep'
|
|
||||||
|
style = "dep"
|
||||||
|
|
||||||
def __init__(self, options={}):
|
def __init__(self, options={}):
|
||||||
"""Initialise dependency renderer.
|
"""Initialise dependency renderer.
|
||||||
|
|
@ -19,18 +23,18 @@ class DependencyRenderer(object):
|
||||||
arrow_spacing, arrow_width, arrow_stroke, distance, offset_x,
|
arrow_spacing, arrow_width, arrow_stroke, distance, offset_x,
|
||||||
color, bg, font)
|
color, bg, font)
|
||||||
"""
|
"""
|
||||||
self.compact = options.get('compact', False)
|
self.compact = options.get("compact", False)
|
||||||
self.word_spacing = options.get('word_spacing', 45)
|
self.word_spacing = options.get("word_spacing", 45)
|
||||||
self.arrow_spacing = options.get('arrow_spacing',
|
self.arrow_spacing = options.get("arrow_spacing", 12 if self.compact else 20)
|
||||||
12 if self.compact else 20)
|
self.arrow_width = options.get("arrow_width", 6 if self.compact else 10)
|
||||||
self.arrow_width = options.get('arrow_width',
|
self.arrow_stroke = options.get("arrow_stroke", 2)
|
||||||
6 if self.compact else 10)
|
self.distance = options.get("distance", 150 if self.compact else 175)
|
||||||
self.arrow_stroke = options.get('arrow_stroke', 2)
|
self.offset_x = options.get("offset_x", 50)
|
||||||
self.distance = options.get('distance', 150 if self.compact else 175)
|
self.color = options.get("color", "#000000")
|
||||||
self.offset_x = options.get('offset_x', 50)
|
self.bg = options.get("bg", "#ffffff")
|
||||||
self.color = options.get('color', '#000000')
|
self.font = options.get("font", "Arial")
|
||||||
self.bg = options.get('bg', '#ffffff')
|
self.direction = DEFAULT_DIR
|
||||||
self.font = options.get('font', 'Arial')
|
self.lang = DEFAULT_LANG
|
||||||
|
|
||||||
def render(self, parsed, page=False, minify=False):
|
def render(self, parsed, page=False, minify=False):
|
||||||
"""Render complete markup.
|
"""Render complete markup.
|
||||||
|
|
@ -43,14 +47,21 @@ class DependencyRenderer(object):
|
||||||
# Create a random ID prefix to make sure parses don't receive the
|
# Create a random ID prefix to make sure parses don't receive the
|
||||||
# same ID, even if they're identical
|
# same ID, even if they're identical
|
||||||
id_prefix = uuid.uuid4().hex
|
id_prefix = uuid.uuid4().hex
|
||||||
rendered = [self.render_svg('{}-{}'.format(id_prefix, i), p['words'], p['arcs'])
|
rendered = []
|
||||||
for i, p in enumerate(parsed)]
|
for i, p in enumerate(parsed):
|
||||||
|
if i == 0:
|
||||||
|
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||||
|
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||||
|
render_id = "{}-{}".format(id_prefix, i)
|
||||||
|
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
||||||
|
rendered.append(svg)
|
||||||
if page:
|
if page:
|
||||||
content = ''.join([TPL_FIGURE.format(content=svg)
|
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
|
||||||
for svg in rendered])
|
markup = TPL_PAGE.format(
|
||||||
markup = TPL_PAGE.format(content=content)
|
content=content, lang=self.lang, dir=self.direction
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
markup = ''.join(rendered)
|
markup = "".join(rendered)
|
||||||
if minify:
|
if minify:
|
||||||
return minify_html(markup)
|
return minify_html(markup)
|
||||||
return markup
|
return markup
|
||||||
|
|
@ -69,15 +80,23 @@ class DependencyRenderer(object):
|
||||||
self.width = self.offset_x + len(words) * self.distance
|
self.width = self.offset_x + len(words) * self.distance
|
||||||
self.height = self.offset_y + 3 * self.word_spacing
|
self.height = self.offset_y + 3 * self.word_spacing
|
||||||
self.id = render_id
|
self.id = render_id
|
||||||
words = [self.render_word(w['text'], w['tag'], i)
|
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
|
||||||
for i, w in enumerate(words)]
|
arcs = [
|
||||||
arcs = [self.render_arrow(a['label'], a['start'],
|
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
|
||||||
a['end'], a['dir'], i)
|
for i, a in enumerate(arcs)
|
||||||
for i, a in enumerate(arcs)]
|
]
|
||||||
content = ''.join(words) + ''.join(arcs)
|
content = "".join(words) + "".join(arcs)
|
||||||
return TPL_DEP_SVG.format(id=self.id, width=self.width,
|
return TPL_DEP_SVG.format(
|
||||||
height=self.height, color=self.color,
|
id=self.id,
|
||||||
bg=self.bg, font=self.font, content=content)
|
width=self.width,
|
||||||
|
height=self.height,
|
||||||
|
color=self.color,
|
||||||
|
bg=self.bg,
|
||||||
|
font=self.font,
|
||||||
|
content=content,
|
||||||
|
dir=self.direction,
|
||||||
|
lang=self.lang,
|
||||||
|
)
|
||||||
|
|
||||||
def render_word(self, text, tag, i):
|
def render_word(self, text, tag, i):
|
||||||
"""Render individual word.
|
"""Render individual word.
|
||||||
|
|
@ -89,12 +108,13 @@ class DependencyRenderer(object):
|
||||||
"""
|
"""
|
||||||
y = self.offset_y + self.word_spacing
|
y = self.offset_y + self.word_spacing
|
||||||
x = self.offset_x + i * self.distance
|
x = self.offset_x + i * self.distance
|
||||||
|
if self.direction == "rtl":
|
||||||
|
x = self.width - x
|
||||||
html_text = escape_html(text)
|
html_text = escape_html(text)
|
||||||
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
||||||
|
|
||||||
|
|
||||||
def render_arrow(self, label, start, end, direction, i):
|
def render_arrow(self, label, start, end, direction, i):
|
||||||
"""Render indivicual arrow.
|
"""Render individual arrow.
|
||||||
|
|
||||||
label (unicode): Dependency label.
|
label (unicode): Dependency label.
|
||||||
start (int): Index of start word.
|
start (int): Index of start word.
|
||||||
|
|
@ -105,9 +125,17 @@ class DependencyRenderer(object):
|
||||||
"""
|
"""
|
||||||
level = self.levels.index(end - start) + 1
|
level = self.levels.index(end - start) + 1
|
||||||
x_start = self.offset_x + start * self.distance + self.arrow_spacing
|
x_start = self.offset_x + start * self.distance + self.arrow_spacing
|
||||||
|
if self.direction == "rtl":
|
||||||
|
x_start = self.width - x_start
|
||||||
y = self.offset_y
|
y = self.offset_y
|
||||||
x_end = (self.offset_x+(end-start)*self.distance+start*self.distance
|
x_end = (
|
||||||
- self.arrow_spacing*(self.highest_level-level)/4)
|
self.offset_x
|
||||||
|
+ (end - start) * self.distance
|
||||||
|
+ start * self.distance
|
||||||
|
- self.arrow_spacing * (self.highest_level - level) / 4
|
||||||
|
)
|
||||||
|
if self.direction == "rtl":
|
||||||
|
x_end = self.width - x_end
|
||||||
y_curve = self.offset_y - level * self.distance / 2
|
y_curve = self.offset_y - level * self.distance / 2
|
||||||
if self.compact:
|
if self.compact:
|
||||||
y_curve = self.offset_y - level * self.distance / 6
|
y_curve = self.offset_y - level * self.distance / 6
|
||||||
|
|
@ -115,8 +143,16 @@ class DependencyRenderer(object):
|
||||||
y_curve = -self.distance
|
y_curve = -self.distance
|
||||||
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
|
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
|
||||||
arc = self.get_arc(x_start, y, y_curve, x_end)
|
arc = self.get_arc(x_start, y, y_curve, x_end)
|
||||||
return TPL_DEP_ARCS.format(id=self.id, i=i, stroke=self.arrow_stroke,
|
label_side = "right" if self.direction == "rtl" else "left"
|
||||||
head=arrowhead, label=label, arc=arc)
|
return TPL_DEP_ARCS.format(
|
||||||
|
id=self.id,
|
||||||
|
i=i,
|
||||||
|
stroke=self.arrow_stroke,
|
||||||
|
head=arrowhead,
|
||||||
|
label=label,
|
||||||
|
label_side=label_side,
|
||||||
|
arc=arc,
|
||||||
|
)
|
||||||
|
|
||||||
def get_arc(self, x_start, y, y_curve, x_end):
|
def get_arc(self, x_start, y, y_curve, x_end):
|
||||||
"""Render individual arc.
|
"""Render individual arc.
|
||||||
|
|
@ -141,13 +177,22 @@ class DependencyRenderer(object):
|
||||||
end (int): X-coordinate of arrow end point.
|
end (int): X-coordinate of arrow end point.
|
||||||
RETURNS (unicode): Definition of the arrow head path ('d' attribute).
|
RETURNS (unicode): Definition of the arrow head path ('d' attribute).
|
||||||
"""
|
"""
|
||||||
if direction == 'left':
|
if direction == "left":
|
||||||
pos1, pos2, pos3 = (x, x - self.arrow_width + 2, x + self.arrow_width - 2)
|
pos1, pos2, pos3 = (x, x - self.arrow_width + 2, x + self.arrow_width - 2)
|
||||||
else:
|
else:
|
||||||
pos1, pos2, pos3 = (end, end+self.arrow_width-2,
|
pos1, pos2, pos3 = (
|
||||||
end-self.arrow_width+2)
|
end,
|
||||||
arrowhead = (pos1, y+2, pos2, y-self.arrow_width, pos3,
|
end + self.arrow_width - 2,
|
||||||
y-self.arrow_width)
|
end - self.arrow_width + 2,
|
||||||
|
)
|
||||||
|
arrowhead = (
|
||||||
|
pos1,
|
||||||
|
y + 2,
|
||||||
|
pos2,
|
||||||
|
y - self.arrow_width,
|
||||||
|
pos3,
|
||||||
|
y - self.arrow_width,
|
||||||
|
)
|
||||||
return "M{},{} L{},{} {},{}".format(*arrowhead)
|
return "M{},{} L{},{} {},{}".format(*arrowhead)
|
||||||
|
|
||||||
def get_levels(self, arcs):
|
def get_levels(self, arcs):
|
||||||
|
|
@ -157,30 +202,46 @@ class DependencyRenderer(object):
|
||||||
args (list): Individual arcs and their start, end, direction and label.
|
args (list): Individual arcs and their start, end, direction and label.
|
||||||
RETURNS (list): Arc levels sorted from lowest to highest.
|
RETURNS (list): Arc levels sorted from lowest to highest.
|
||||||
"""
|
"""
|
||||||
levels = set(map(lambda arc: arc['end'] - arc['start'], arcs))
|
levels = set(map(lambda arc: arc["end"] - arc["start"], arcs))
|
||||||
return sorted(list(levels))
|
return sorted(list(levels))
|
||||||
|
|
||||||
|
|
||||||
class EntityRenderer(object):
|
class EntityRenderer(object):
|
||||||
"""Render named entities as HTML."""
|
"""Render named entities as HTML."""
|
||||||
style = 'ent'
|
|
||||||
|
style = "ent"
|
||||||
|
|
||||||
def __init__(self, options={}):
|
def __init__(self, options={}):
|
||||||
"""Initialise dependency renderer.
|
"""Initialise dependency renderer.
|
||||||
|
|
||||||
options (dict): Visualiser-specific options (colors, ents)
|
options (dict): Visualiser-specific options (colors, ents)
|
||||||
"""
|
"""
|
||||||
colors = {'ORG': '#7aecec', 'PRODUCT': '#bfeeb7', 'GPE': '#feca74',
|
colors = {
|
||||||
'LOC': '#ff9561', 'PERSON': '#aa9cfc', 'NORP': '#c887fb',
|
"ORG": "#7aecec",
|
||||||
'FACILITY': '#9cc9cc', 'EVENT': '#ffeb80', 'LAW': '#ff8197',
|
"PRODUCT": "#bfeeb7",
|
||||||
'LANGUAGE': '#ff8197', 'WORK_OF_ART': '#f0d0ff',
|
"GPE": "#feca74",
|
||||||
'DATE': '#bfe1d9', 'TIME': '#bfe1d9', 'MONEY': '#e4e7d2',
|
"LOC": "#ff9561",
|
||||||
'QUANTITY': '#e4e7d2', 'ORDINAL': '#e4e7d2',
|
"PERSON": "#aa9cfc",
|
||||||
'CARDINAL': '#e4e7d2', 'PERCENT': '#e4e7d2'}
|
"NORP": "#c887fb",
|
||||||
colors.update(options.get('colors', {}))
|
"FACILITY": "#9cc9cc",
|
||||||
self.default_color = '#ddd'
|
"EVENT": "#ffeb80",
|
||||||
|
"LAW": "#ff8197",
|
||||||
|
"LANGUAGE": "#ff8197",
|
||||||
|
"WORK_OF_ART": "#f0d0ff",
|
||||||
|
"DATE": "#bfe1d9",
|
||||||
|
"TIME": "#bfe1d9",
|
||||||
|
"MONEY": "#e4e7d2",
|
||||||
|
"QUANTITY": "#e4e7d2",
|
||||||
|
"ORDINAL": "#e4e7d2",
|
||||||
|
"CARDINAL": "#e4e7d2",
|
||||||
|
"PERCENT": "#e4e7d2",
|
||||||
|
}
|
||||||
|
colors.update(options.get("colors", {}))
|
||||||
|
self.default_color = "#ddd"
|
||||||
self.colors = colors
|
self.colors = colors
|
||||||
self.ents = options.get('ents', None)
|
self.ents = options.get("ents", None)
|
||||||
|
self.direction = DEFAULT_DIR
|
||||||
|
self.lang = DEFAULT_LANG
|
||||||
|
|
||||||
def render(self, parsed, page=False, minify=False):
|
def render(self, parsed, page=False, minify=False):
|
||||||
"""Render complete markup.
|
"""Render complete markup.
|
||||||
|
|
@ -190,14 +251,17 @@ class EntityRenderer(object):
|
||||||
minify (bool): Minify HTML markup.
|
minify (bool): Minify HTML markup.
|
||||||
RETURNS (unicode): Rendered HTML markup.
|
RETURNS (unicode): Rendered HTML markup.
|
||||||
"""
|
"""
|
||||||
rendered = [self.render_ents(p['text'], p['ents'],
|
rendered = []
|
||||||
p.get('title', None)) for p in parsed]
|
for i, p in enumerate(parsed):
|
||||||
|
if i == 0:
|
||||||
|
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||||
|
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||||
|
rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
|
||||||
if page:
|
if page:
|
||||||
docs = ''.join([TPL_FIGURE.format(content=doc)
|
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
||||||
for doc in rendered])
|
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
|
||||||
markup = TPL_PAGE.format(content=docs)
|
|
||||||
else:
|
else:
|
||||||
markup = ''.join(rendered)
|
markup = "".join(rendered)
|
||||||
if minify:
|
if minify:
|
||||||
return minify_html(markup)
|
return minify_html(markup)
|
||||||
return markup
|
return markup
|
||||||
|
|
@ -209,26 +273,30 @@ class EntityRenderer(object):
|
||||||
spans (list): Individual entity spans and their start, end and label.
|
spans (list): Individual entity spans and their start, end and label.
|
||||||
title (unicode or None): Document title set in Doc.user_data['title'].
|
title (unicode or None): Document title set in Doc.user_data['title'].
|
||||||
"""
|
"""
|
||||||
markup = ''
|
markup = ""
|
||||||
offset = 0
|
offset = 0
|
||||||
for span in spans:
|
for span in spans:
|
||||||
label = span['label']
|
label = span["label"]
|
||||||
start = span['start']
|
start = span["start"]
|
||||||
end = span['end']
|
end = span["end"]
|
||||||
entity = text[start:end]
|
entity = escape_html(text[start:end])
|
||||||
fragments = text[offset:start].split('\n')
|
fragments = text[offset:start].split("\n")
|
||||||
for i, fragment in enumerate(fragments):
|
for i, fragment in enumerate(fragments):
|
||||||
markup += fragment
|
markup += escape_html(fragment)
|
||||||
if len(fragments) > 1 and i != len(fragments) - 1:
|
if len(fragments) > 1 and i != len(fragments) - 1:
|
||||||
markup += '</br>'
|
markup += "</br>"
|
||||||
if self.ents is None or label.upper() in self.ents:
|
if self.ents is None or label.upper() in self.ents:
|
||||||
color = self.colors.get(label.upper(), self.default_color)
|
color = self.colors.get(label.upper(), self.default_color)
|
||||||
markup += TPL_ENT.format(label=label, text=entity, bg=color)
|
ent_settings = {"label": label, "text": entity, "bg": color}
|
||||||
|
if self.direction == "rtl":
|
||||||
|
markup += TPL_ENT_RTL.format(**ent_settings)
|
||||||
|
else:
|
||||||
|
markup += TPL_ENT.format(**ent_settings)
|
||||||
else:
|
else:
|
||||||
markup += entity
|
markup += entity
|
||||||
offset = end
|
offset = end
|
||||||
markup += text[offset:]
|
markup += escape_html(text[offset:])
|
||||||
markup = TPL_ENTS.format(content=markup, colors=self.colors)
|
markup = TPL_ENTS.format(content=markup, dir=self.direction)
|
||||||
if title:
|
if title:
|
||||||
markup = TPL_TITLE.format(title=title) + markup
|
markup = TPL_TITLE.format(title=title) + markup
|
||||||
return markup
|
return markup
|
||||||
|
|
|
||||||
|
|
@ -2,11 +2,11 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
# setting explicit height and max-width: none on the SVG is required for
|
# Setting explicit height and max-width: none on the SVG is required for
|
||||||
# Jupyter to render it properly in a cell
|
# Jupyter to render it properly in a cell
|
||||||
|
|
||||||
TPL_DEP_SVG = """
|
TPL_DEP_SVG = """
|
||||||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="{id}" class="displacy" width="{width}" height="{height}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}">{content}</svg>
|
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="{lang}" id="{id}" class="displacy" width="{width}" height="{height}" direction="{dir}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}; direction: {dir}">{content}</svg>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -22,7 +22,7 @@ TPL_DEP_ARCS = """
|
||||||
<g class="displacy-arrow">
|
<g class="displacy-arrow">
|
||||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||||
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
|
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
|
||||||
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">{label}</textPath>
|
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" side="{label_side}" fill="currentColor" text-anchor="middle">{label}</textPath>
|
||||||
</text>
|
</text>
|
||||||
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
|
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
|
||||||
</g>
|
</g>
|
||||||
|
|
@ -39,7 +39,7 @@ TPL_TITLE = """
|
||||||
|
|
||||||
|
|
||||||
TPL_ENTS = """
|
TPL_ENTS = """
|
||||||
<div class="entities" style="line-height: 2.5">{content}</div>
|
<div class="entities" style="line-height: 2.5; direction: {dir}">{content}</div>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -50,14 +50,21 @@ TPL_ENT = """
|
||||||
</mark>
|
</mark>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
TPL_ENT_RTL = """
|
||||||
|
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||||
|
{text}
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
|
||||||
|
</mark>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
TPL_PAGE = """
|
TPL_PAGE = """
|
||||||
<!DOCTYPE html>
|
<!DOCTYPE html>
|
||||||
<html>
|
<html lang="{lang}">
|
||||||
<head>
|
<head>
|
||||||
<title>displaCy</title>
|
<title>displaCy</title>
|
||||||
</head>
|
</head>
|
||||||
|
|
||||||
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem;">{content}</body>
|
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: {dir}">{content}</body>
|
||||||
</html>
|
</html>
|
||||||
"""
|
"""
|
||||||
|
|
|
||||||
186
spacy/errors.py
186
spacy/errors.py
|
|
@ -8,13 +8,17 @@ import inspect
|
||||||
|
|
||||||
def add_codes(err_cls):
|
def add_codes(err_cls):
|
||||||
"""Add error codes to string messages via class attribute names."""
|
"""Add error codes to string messages via class attribute names."""
|
||||||
|
|
||||||
class ErrorsWithCodes(object):
|
class ErrorsWithCodes(object):
|
||||||
def __getattribute__(self, code):
|
def __getattribute__(self, code):
|
||||||
msg = getattr(err_cls, code)
|
msg = getattr(err_cls, code)
|
||||||
return '[{code}] {msg}'.format(code=code, msg=msg)
|
return "[{code}] {msg}".format(code=code, msg=msg)
|
||||||
|
|
||||||
return ErrorsWithCodes()
|
return ErrorsWithCodes()
|
||||||
|
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class Warnings(object):
|
class Warnings(object):
|
||||||
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
||||||
|
|
@ -38,6 +42,44 @@ class Warnings(object):
|
||||||
"surprising to you, make sure the Doc was processed using a model "
|
"surprising to you, make sure the Doc was processed using a model "
|
||||||
"that supports named entity recognition, and check the `doc.ents` "
|
"that supports named entity recognition, and check the `doc.ents` "
|
||||||
"property manually if necessary.")
|
"property manually if necessary.")
|
||||||
|
W007 = ("The model you're using has no word vectors loaded, so the result "
|
||||||
|
"of the {obj}.similarity method will be based on the tagger, "
|
||||||
|
"parser and NER, which may not give useful similarity judgements. "
|
||||||
|
"This may happen if you're using one of the small models, e.g. "
|
||||||
|
"`en_core_web_sm`, which don't ship with word vectors and only "
|
||||||
|
"use context-sensitive tensors. You can always add your own word "
|
||||||
|
"vectors, or use one of the larger models instead if available.")
|
||||||
|
W008 = ("Evaluating {obj}.similarity based on empty vectors.")
|
||||||
|
W009 = ("Custom factory '{name}' provided by entry points of another "
|
||||||
|
"package overwrites built-in factory.")
|
||||||
|
W010 = ("As of v2.1.0, the PhraseMatcher doesn't have a phrase length "
|
||||||
|
"limit anymore, so the max_length argument is now deprecated.")
|
||||||
|
W011 = ("It looks like you're calling displacy.serve from within a "
|
||||||
|
"Jupyter notebook or a similar environment. This likely means "
|
||||||
|
"you're already running a local web server, so there's no need to "
|
||||||
|
"make displaCy start another one. Instead, you should be able to "
|
||||||
|
"replace displacy.serve with displacy.render to show the "
|
||||||
|
"visualization.")
|
||||||
|
W012 = ("A Doc object you're adding to the PhraseMatcher for pattern "
|
||||||
|
"'{key}' is parsed and/or tagged, but to match on '{attr}', you "
|
||||||
|
"don't actually need this information. This means that creating "
|
||||||
|
"the patterns is potentially much slower, because all pipeline "
|
||||||
|
"components are applied. To only create tokenized Doc objects, "
|
||||||
|
"try using `nlp.make_doc(text)` or process all texts as a stream "
|
||||||
|
"using `list(nlp.tokenizer.pipe(all_texts))`.")
|
||||||
|
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
|
||||||
|
"efficient and less error-prone Doc.retokenize context manager "
|
||||||
|
"instead.")
|
||||||
|
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
|
||||||
|
"methods is and should be replaced with `exclude`. This makes it "
|
||||||
|
"consistent with the other objects serializable.")
|
||||||
|
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
||||||
|
"being serialized or deserialized is deprecated. Please use the "
|
||||||
|
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
||||||
|
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
|
||||||
|
"the v2.x models cannot release the global interpreter lock. "
|
||||||
|
"Future versions may introduce a `n_process` argument for "
|
||||||
|
"parallel inference via multiprocessing.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
|
@ -148,7 +190,7 @@ class Errors(object):
|
||||||
"you forget to call the `set_extension` method?")
|
"you forget to call the `set_extension` method?")
|
||||||
E047 = ("Can't assign a value to unregistered extension attribute "
|
E047 = ("Can't assign a value to unregistered extension attribute "
|
||||||
"'{name}'. Did you forget to call the `set_extension` method?")
|
"'{name}'. Did you forget to call the `set_extension` method?")
|
||||||
E048 = ("Can't import language {lang} from spacy.lang.")
|
E048 = ("Can't import language {lang} from spacy.lang: {err}")
|
||||||
E049 = ("Can't find spaCy data directory: '{path}'. Check your "
|
E049 = ("Can't find spaCy data directory: '{path}'. Check your "
|
||||||
"installation and permissions, or use spacy.util.set_data_path "
|
"installation and permissions, or use spacy.util.set_data_path "
|
||||||
"to customise the location if necessary.")
|
"to customise the location if necessary.")
|
||||||
|
|
@ -249,23 +291,88 @@ class Errors(object):
|
||||||
"error. Are you writing to a default function argument?")
|
"error. Are you writing to a default function argument?")
|
||||||
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
|
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
|
||||||
"Span objects, or dicts if set to manual=True.")
|
"Span objects, or dicts if set to manual=True.")
|
||||||
E097 = ("Can't merge non-disjoint spans. '{token}' is already part of tokens to merge")
|
E097 = ("Invalid pattern: expected token pattern (list of dicts) or "
|
||||||
E098 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A token"
|
"phrase pattern (string) but got:\n{pattern}")
|
||||||
|
E098 = ("Invalid pattern specified: expected both SPEC and PATTERN.")
|
||||||
|
E099 = ("First node of pattern should be a root node. The root should "
|
||||||
|
"only contain NODE_NAME.")
|
||||||
|
E100 = ("Nodes apart from the root should contain NODE_NAME, NBOR_NAME and "
|
||||||
|
"NBOR_RELOP.")
|
||||||
|
E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
|
||||||
|
"have been declared in previous edges.")
|
||||||
|
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
|
||||||
|
"tokens to merge.")
|
||||||
|
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A token"
|
||||||
" can only be part of one entity, so make sure the entities you're "
|
" can only be part of one entity, so make sure the entities you're "
|
||||||
"setting don't overlap.")
|
"setting don't overlap.")
|
||||||
E099 = ("The newly split token can only have one root (head = 0).")
|
E104 = ("Can't find JSON schema for '{name}'.")
|
||||||
E100 = ("The newly split token needs to have a root (head = 0)")
|
E105 = ("The Doc.print_tree() method is now deprecated. Please use "
|
||||||
E101 = ("All subtokens must have associated heads")
|
"Doc.to_json() instead or write your own function.")
|
||||||
|
E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
|
||||||
|
"settings: {opts}")
|
||||||
|
E107 = ("Value of doc._.{attr} is not JSON-serializable: {value}")
|
||||||
|
E108 = ("As of spaCy v2.1, the pipe name `sbd` has been deprecated "
|
||||||
|
"in favor of the pipe name `sentencizer`, which does the same "
|
||||||
|
"thing. For example, use `nlp.create_pipeline('sentencizer')`")
|
||||||
|
E109 = ("Model for component '{name}' not initialized. Did you forget to load "
|
||||||
|
"a model, or forget to call begin_training()?")
|
||||||
|
E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}")
|
||||||
|
E111 = ("Pickling a token is not supported, because tokens are only views "
|
||||||
|
"of the parent Doc and can't exist on their own. A pickled token "
|
||||||
|
"would always have to include its Doc and Vocab, which has "
|
||||||
|
"practically no advantage over pickling the parent Doc directly. "
|
||||||
|
"So instead of pickling the token, pickle the Doc it belongs to.")
|
||||||
|
E112 = ("Pickling a span is not supported, because spans are only views "
|
||||||
|
"of the parent Doc and can't exist on their own. A pickled span "
|
||||||
|
"would always have to include its Doc and Vocab, which has "
|
||||||
|
"practically no advantage over pickling the parent Doc directly. "
|
||||||
|
"So instead of pickling the span, pickle the Doc it belongs to or "
|
||||||
|
"use Span.as_doc to convert the span to a standalone Doc object.")
|
||||||
|
E113 = ("The newly split token can only have one root (head = 0).")
|
||||||
|
E114 = ("The newly split token needs to have a root (head = 0).")
|
||||||
|
E115 = ("All subtokens must have associated heads.")
|
||||||
|
E116 = ("Cannot currently add labels to pre-trained text classifier. Add "
|
||||||
|
"labels before training begins. This functionality was available "
|
||||||
|
"in previous versions, but had significant bugs that led to poor "
|
||||||
|
"performance.")
|
||||||
|
E117 = ("The newly split tokens must match the text of the original token. "
|
||||||
|
"New orths: {new}. Old text: {old}.")
|
||||||
|
E118 = ("The custom extension attribute '{attr}' is not registered on the "
|
||||||
|
"Token object so it can't be set during retokenization. To "
|
||||||
|
"register an attribute, use the Token.set_extension classmethod.")
|
||||||
|
E119 = ("Can't set custom extension attribute '{attr}' during retokenization "
|
||||||
|
"because it's not writable. This usually means it was registered "
|
||||||
|
"with a getter function (and no setter) or as a method extension, "
|
||||||
|
"so the value is computed dynamically. To overwrite a custom "
|
||||||
|
"attribute manually, it should be registered with a default value "
|
||||||
|
"or with a getter AND setter.")
|
||||||
|
E120 = ("Can't set custom extension attributes during retokenization. "
|
||||||
|
"Expected dict mapping attribute names to values, but got: {value}")
|
||||||
|
E121 = ("Can't bulk merge spans. Attribute length {attr_len} should be "
|
||||||
|
"equal to span length ({span_len}).")
|
||||||
|
E122 = ("Cannot find token to be split. Did it get merged?")
|
||||||
|
E123 = ("Cannot find head of token to be split. Did it get merged?")
|
||||||
|
E124 = ("Cannot read from file: {path}. Supported formats: {formats}")
|
||||||
|
E125 = ("Unexpected value: {value}")
|
||||||
|
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
||||||
|
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||||
|
E127 = ("Cannot create phrase pattern representation for length 0. This "
|
||||||
|
"is likely a bug in spaCy.")
|
||||||
|
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
|
||||||
|
"arguments to exclude fields from being serialized or deserialized "
|
||||||
|
"is now deprecated. Please use the `exclude` argument instead. "
|
||||||
|
"For example: exclude=['{arg}'].")
|
||||||
|
E129 = ("Cannot write the label of an existing Span object because a Span "
|
||||||
|
"is a read-only view of the underlying Token objects stored in the Doc. "
|
||||||
|
"Instead, create a new Span object and specify the `label` keyword argument, "
|
||||||
|
"for example:\nfrom spacy.tokens import Span\n"
|
||||||
|
"span = Span(doc, start={start}, end={end}, label='{label}')")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class TempErrors(object):
|
class TempErrors(object):
|
||||||
T001 = ("Max length currently 10 for phrase matching")
|
|
||||||
T002 = ("Pattern length ({doc_len}) >= phrase_matcher.max_length "
|
|
||||||
"({max_len}). Length can be set on initialization, up to 10.")
|
|
||||||
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
|
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
|
||||||
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
|
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
|
||||||
T005 = ("Currently history size is hard-coded to 0. Received: {value}.")
|
|
||||||
T006 = ("Currently history width is hard-coded to 0. Received: {value}.")
|
|
||||||
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||||
T008 = ("Bad configuration of Tagger. This is probably a bug within "
|
T008 = ("Bad configuration of Tagger. This is probably a bug within "
|
||||||
|
|
@ -274,56 +381,77 @@ class TempErrors(object):
|
||||||
"(pretrained_dims) but not the new name (pretrained_vectors).")
|
"(pretrained_dims) but not the new name (pretrained_vectors).")
|
||||||
|
|
||||||
|
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
class MatchPatternError(ValueError):
|
||||||
|
def __init__(self, key, errors):
|
||||||
|
"""Custom error for validating match patterns.
|
||||||
|
|
||||||
|
key (unicode): The name of the matcher rule.
|
||||||
|
errors (dict): Validation errors (sequence of strings) mapped to pattern
|
||||||
|
ID, i.e. the index of the added pattern.
|
||||||
|
"""
|
||||||
|
msg = "Invalid token patterns for matcher rule '{}'\n".format(key)
|
||||||
|
for pattern_idx, error_msgs in errors.items():
|
||||||
|
pattern_errors = "\n".join(["- {}".format(e) for e in error_msgs])
|
||||||
|
msg += "\nPattern {}:\n{}\n".format(pattern_idx, pattern_errors)
|
||||||
|
ValueError.__init__(self, msg)
|
||||||
|
|
||||||
|
|
||||||
class ModelsWarning(UserWarning):
|
class ModelsWarning(UserWarning):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
WARNINGS = {
|
WARNINGS = {
|
||||||
'user': UserWarning,
|
"user": UserWarning,
|
||||||
'deprecation': DeprecationWarning,
|
"deprecation": DeprecationWarning,
|
||||||
'models': ModelsWarning,
|
"models": ModelsWarning,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def _get_warn_types(arg):
|
def _get_warn_types(arg):
|
||||||
if arg == '': # don't show any warnings
|
if arg == "": # don't show any warnings
|
||||||
return []
|
return []
|
||||||
if not arg or arg == 'all': # show all available warnings
|
if not arg or arg == "all": # show all available warnings
|
||||||
return WARNINGS.keys()
|
return WARNINGS.keys()
|
||||||
return [w_type.strip() for w_type in arg.split(',')
|
return [w_type.strip() for w_type in arg.split(",") if w_type.strip() in WARNINGS]
|
||||||
if w_type.strip() in WARNINGS]
|
|
||||||
|
|
||||||
|
|
||||||
def _get_warn_excl(arg):
|
def _get_warn_excl(arg):
|
||||||
if not arg:
|
if not arg:
|
||||||
return []
|
return []
|
||||||
return [w_id.strip() for w_id in arg.split(',')]
|
return [w_id.strip() for w_id in arg.split(",")]
|
||||||
|
|
||||||
|
|
||||||
SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER')
|
SPACY_WARNING_FILTER = os.environ.get("SPACY_WARNING_FILTER")
|
||||||
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES'))
|
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get("SPACY_WARNING_TYPES"))
|
||||||
SPACY_WARNING_IGNORE = _get_warn_excl(os.environ.get('SPACY_WARNING_IGNORE'))
|
SPACY_WARNING_IGNORE = _get_warn_excl(os.environ.get("SPACY_WARNING_IGNORE"))
|
||||||
|
|
||||||
|
|
||||||
def user_warning(message):
|
def user_warning(message):
|
||||||
_warn(message, 'user')
|
_warn(message, "user")
|
||||||
|
|
||||||
|
|
||||||
def deprecation_warning(message):
|
def deprecation_warning(message):
|
||||||
_warn(message, 'deprecation')
|
_warn(message, "deprecation")
|
||||||
|
|
||||||
|
|
||||||
def models_warning(message):
|
def models_warning(message):
|
||||||
_warn(message, 'models')
|
_warn(message, "models")
|
||||||
|
|
||||||
|
|
||||||
def _warn(message, warn_type='user'):
|
def _warn(message, warn_type="user"):
|
||||||
"""
|
"""
|
||||||
message (unicode): The message to display.
|
message (unicode): The message to display.
|
||||||
category (Warning): The Warning to show.
|
category (Warning): The Warning to show.
|
||||||
"""
|
"""
|
||||||
w_id = message.split('[', 1)[1].split(']', 1)[0] # get ID from string
|
if message.startswith("["):
|
||||||
if warn_type in SPACY_WARNING_TYPES and w_id not in SPACY_WARNING_IGNORE:
|
w_id = message.split("[", 1)[1].split("]", 1)[0] # get ID from string
|
||||||
|
else:
|
||||||
|
w_id = None
|
||||||
|
ignore_warning = w_id and w_id in SPACY_WARNING_IGNORE
|
||||||
|
if warn_type in SPACY_WARNING_TYPES and not ignore_warning:
|
||||||
category = WARNINGS[warn_type]
|
category = WARNINGS[warn_type]
|
||||||
stack = inspect.stack()[-1]
|
stack = inspect.stack()[-1]
|
||||||
with warnings.catch_warnings():
|
with warnings.catch_warnings():
|
||||||
|
|
|
||||||
|
|
@ -21,295 +21,272 @@ GLOSSARY = {
|
||||||
# POS tags
|
# POS tags
|
||||||
# Universal POS Tags
|
# Universal POS Tags
|
||||||
# http://universaldependencies.org/u/pos/
|
# http://universaldependencies.org/u/pos/
|
||||||
|
"ADJ": "adjective",
|
||||||
'ADJ': 'adjective',
|
"ADP": "adposition",
|
||||||
'ADP': 'adposition',
|
"ADV": "adverb",
|
||||||
'ADV': 'adverb',
|
"AUX": "auxiliary",
|
||||||
'AUX': 'auxiliary',
|
"CONJ": "conjunction",
|
||||||
'CONJ': 'conjunction',
|
"CCONJ": "coordinating conjunction",
|
||||||
'CCONJ': 'coordinating conjunction',
|
"DET": "determiner",
|
||||||
'DET': 'determiner',
|
"INTJ": "interjection",
|
||||||
'INTJ': 'interjection',
|
"NOUN": "noun",
|
||||||
'NOUN': 'noun',
|
"NUM": "numeral",
|
||||||
'NUM': 'numeral',
|
"PART": "particle",
|
||||||
'PART': 'particle',
|
"PRON": "pronoun",
|
||||||
'PRON': 'pronoun',
|
"PROPN": "proper noun",
|
||||||
'PROPN': 'proper noun',
|
"PUNCT": "punctuation",
|
||||||
'PUNCT': 'punctuation',
|
"SCONJ": "subordinating conjunction",
|
||||||
'SCONJ': 'subordinating conjunction',
|
"SYM": "symbol",
|
||||||
'SYM': 'symbol',
|
"VERB": "verb",
|
||||||
'VERB': 'verb',
|
"X": "other",
|
||||||
'X': 'other',
|
"EOL": "end of line",
|
||||||
'EOL': 'end of line',
|
"SPACE": "space",
|
||||||
'SPACE': 'space',
|
|
||||||
|
|
||||||
|
|
||||||
# POS tags (English)
|
# POS tags (English)
|
||||||
# OntoNotes 5 / Penn Treebank
|
# OntoNotes 5 / Penn Treebank
|
||||||
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
|
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
|
||||||
|
".": "punctuation mark, sentence closer",
|
||||||
'.': 'punctuation mark, sentence closer',
|
",": "punctuation mark, comma",
|
||||||
',': 'punctuation mark, comma',
|
"-LRB-": "left round bracket",
|
||||||
'-LRB-': 'left round bracket',
|
"-RRB-": "right round bracket",
|
||||||
'-RRB-': 'right round bracket',
|
"``": "opening quotation mark",
|
||||||
'``': 'opening quotation mark',
|
'""': "closing quotation mark",
|
||||||
'""': 'closing quotation mark',
|
"''": "closing quotation mark",
|
||||||
"''": 'closing quotation mark',
|
":": "punctuation mark, colon or ellipsis",
|
||||||
':': 'punctuation mark, colon or ellipsis',
|
"$": "symbol, currency",
|
||||||
'$': 'symbol, currency',
|
"#": "symbol, number sign",
|
||||||
'#': 'symbol, number sign',
|
"AFX": "affix",
|
||||||
'AFX': 'affix',
|
"CC": "conjunction, coordinating",
|
||||||
'CC': 'conjunction, coordinating',
|
"CD": "cardinal number",
|
||||||
'CD': 'cardinal number',
|
"DT": "determiner",
|
||||||
'DT': 'determiner',
|
"EX": "existential there",
|
||||||
'EX': 'existential there',
|
"FW": "foreign word",
|
||||||
'FW': 'foreign word',
|
"HYPH": "punctuation mark, hyphen",
|
||||||
'HYPH': 'punctuation mark, hyphen',
|
"IN": "conjunction, subordinating or preposition",
|
||||||
'IN': 'conjunction, subordinating or preposition',
|
"JJ": "adjective",
|
||||||
'JJ': 'adjective',
|
"JJR": "adjective, comparative",
|
||||||
'JJR': 'adjective, comparative',
|
"JJS": "adjective, superlative",
|
||||||
'JJS': 'adjective, superlative',
|
"LS": "list item marker",
|
||||||
'LS': 'list item marker',
|
"MD": "verb, modal auxiliary",
|
||||||
'MD': 'verb, modal auxiliary',
|
"NIL": "missing tag",
|
||||||
'NIL': 'missing tag',
|
"NN": "noun, singular or mass",
|
||||||
'NN': 'noun, singular or mass',
|
"NNP": "noun, proper singular",
|
||||||
'NNP': 'noun, proper singular',
|
"NNPS": "noun, proper plural",
|
||||||
'NNPS': 'noun, proper plural',
|
"NNS": "noun, plural",
|
||||||
'NNS': 'noun, plural',
|
"PDT": "predeterminer",
|
||||||
'PDT': 'predeterminer',
|
"POS": "possessive ending",
|
||||||
'POS': 'possessive ending',
|
"PRP": "pronoun, personal",
|
||||||
'PRP': 'pronoun, personal',
|
"PRP$": "pronoun, possessive",
|
||||||
'PRP$': 'pronoun, possessive',
|
"RB": "adverb",
|
||||||
'RB': 'adverb',
|
"RBR": "adverb, comparative",
|
||||||
'RBR': 'adverb, comparative',
|
"RBS": "adverb, superlative",
|
||||||
'RBS': 'adverb, superlative',
|
"RP": "adverb, particle",
|
||||||
'RP': 'adverb, particle',
|
"TO": "infinitival to",
|
||||||
'TO': 'infinitival to',
|
"UH": "interjection",
|
||||||
'UH': 'interjection',
|
"VB": "verb, base form",
|
||||||
'VB': 'verb, base form',
|
"VBD": "verb, past tense",
|
||||||
'VBD': 'verb, past tense',
|
"VBG": "verb, gerund or present participle",
|
||||||
'VBG': 'verb, gerund or present participle',
|
"VBN": "verb, past participle",
|
||||||
'VBN': 'verb, past participle',
|
"VBP": "verb, non-3rd person singular present",
|
||||||
'VBP': 'verb, non-3rd person singular present',
|
"VBZ": "verb, 3rd person singular present",
|
||||||
'VBZ': 'verb, 3rd person singular present',
|
"WDT": "wh-determiner",
|
||||||
'WDT': 'wh-determiner',
|
"WP": "wh-pronoun, personal",
|
||||||
'WP': 'wh-pronoun, personal',
|
"WP$": "wh-pronoun, possessive",
|
||||||
'WP$': 'wh-pronoun, possessive',
|
"WRB": "wh-adverb",
|
||||||
'WRB': 'wh-adverb',
|
"SP": "space",
|
||||||
'SP': 'space',
|
"ADD": "email",
|
||||||
'ADD': 'email',
|
"NFP": "superfluous punctuation",
|
||||||
'NFP': 'superfluous punctuation',
|
"GW": "additional word in multi-word expression",
|
||||||
'GW': 'additional word in multi-word expression',
|
"XX": "unknown",
|
||||||
'XX': 'unknown',
|
"BES": 'auxiliary "be"',
|
||||||
'BES': 'auxiliary "be"',
|
"HVS": 'forms of "have"',
|
||||||
'HVS': 'forms of "have"',
|
|
||||||
|
|
||||||
|
|
||||||
# POS Tags (German)
|
# POS Tags (German)
|
||||||
# TIGER Treebank
|
# TIGER Treebank
|
||||||
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
|
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
|
||||||
|
"$(": "other sentence-internal punctuation mark",
|
||||||
'$(': 'other sentence-internal punctuation mark',
|
"$,": "comma",
|
||||||
'$,': 'comma',
|
"$.": "sentence-final punctuation mark",
|
||||||
'$.': 'sentence-final punctuation mark',
|
"ADJA": "adjective, attributive",
|
||||||
'ADJA': 'adjective, attributive',
|
"ADJD": "adjective, adverbial or predicative",
|
||||||
'ADJD': 'adjective, adverbial or predicative',
|
"APPO": "postposition",
|
||||||
'APPO': 'postposition',
|
"APPR": "preposition; circumposition left",
|
||||||
'APPR': 'preposition; circumposition left',
|
"APPRART": "preposition with article",
|
||||||
'APPRART': 'preposition with article',
|
"APZR": "circumposition right",
|
||||||
'APZR': 'circumposition right',
|
"ART": "definite or indefinite article",
|
||||||
'ART': 'definite or indefinite article',
|
"CARD": "cardinal number",
|
||||||
'CARD': 'cardinal number',
|
"FM": "foreign language material",
|
||||||
'FM': 'foreign language material',
|
"ITJ": "interjection",
|
||||||
'ITJ': 'interjection',
|
"KOKOM": "comparative conjunction",
|
||||||
'KOKOM': 'comparative conjunction',
|
"KON": "coordinate conjunction",
|
||||||
'KON': 'coordinate conjunction',
|
"KOUI": 'subordinate conjunction with "zu" and infinitive',
|
||||||
'KOUI': 'subordinate conjunction with "zu" and infinitive',
|
"KOUS": "subordinate conjunction with sentence",
|
||||||
'KOUS': 'subordinate conjunction with sentence',
|
"NE": "proper noun",
|
||||||
'NE': 'proper noun',
|
"NNE": "proper noun",
|
||||||
'NNE': 'proper noun',
|
"PAV": "pronominal adverb",
|
||||||
'PAV': 'pronominal adverb',
|
"PROAV": "pronominal adverb",
|
||||||
'PROAV': 'pronominal adverb',
|
"PDAT": "attributive demonstrative pronoun",
|
||||||
'PDAT': 'attributive demonstrative pronoun',
|
"PDS": "substituting demonstrative pronoun",
|
||||||
'PDS': 'substituting demonstrative pronoun',
|
"PIAT": "attributive indefinite pronoun without determiner",
|
||||||
'PIAT': 'attributive indefinite pronoun without determiner',
|
"PIDAT": "attributive indefinite pronoun with determiner",
|
||||||
'PIDAT': 'attributive indefinite pronoun with determiner',
|
"PIS": "substituting indefinite pronoun",
|
||||||
'PIS': 'substituting indefinite pronoun',
|
"PPER": "non-reflexive personal pronoun",
|
||||||
'PPER': 'non-reflexive personal pronoun',
|
"PPOSAT": "attributive possessive pronoun",
|
||||||
'PPOSAT': 'attributive possessive pronoun',
|
"PPOSS": "substituting possessive pronoun",
|
||||||
'PPOSS': 'substituting possessive pronoun',
|
"PRELAT": "attributive relative pronoun",
|
||||||
'PRELAT': 'attributive relative pronoun',
|
"PRELS": "substituting relative pronoun",
|
||||||
'PRELS': 'substituting relative pronoun',
|
"PRF": "reflexive personal pronoun",
|
||||||
'PRF': 'reflexive personal pronoun',
|
"PTKA": "particle with adjective or adverb",
|
||||||
'PTKA': 'particle with adjective or adverb',
|
"PTKANT": "answer particle",
|
||||||
'PTKANT': 'answer particle',
|
"PTKNEG": "negative particle",
|
||||||
'PTKNEG': 'negative particle',
|
"PTKVZ": "separable verbal particle",
|
||||||
'PTKVZ': 'separable verbal particle',
|
"PTKZU": '"zu" before infinitive',
|
||||||
'PTKZU': '"zu" before infinitive',
|
"PWAT": "attributive interrogative pronoun",
|
||||||
'PWAT': 'attributive interrogative pronoun',
|
"PWAV": "adverbial interrogative or relative pronoun",
|
||||||
'PWAV': 'adverbial interrogative or relative pronoun',
|
"PWS": "substituting interrogative pronoun",
|
||||||
'PWS': 'substituting interrogative pronoun',
|
"TRUNC": "word remnant",
|
||||||
'TRUNC': 'word remnant',
|
"VAFIN": "finite verb, auxiliary",
|
||||||
'VAFIN': 'finite verb, auxiliary',
|
"VAIMP": "imperative, auxiliary",
|
||||||
'VAIMP': 'imperative, auxiliary',
|
"VAINF": "infinitive, auxiliary",
|
||||||
'VAINF': 'infinitive, auxiliary',
|
"VAPP": "perfect participle, auxiliary",
|
||||||
'VAPP': 'perfect participle, auxiliary',
|
"VMFIN": "finite verb, modal",
|
||||||
'VMFIN': 'finite verb, modal',
|
"VMINF": "infinitive, modal",
|
||||||
'VMINF': 'infinitive, modal',
|
"VMPP": "perfect participle, modal",
|
||||||
'VMPP': 'perfect participle, modal',
|
"VVFIN": "finite verb, full",
|
||||||
'VVFIN': 'finite verb, full',
|
"VVIMP": "imperative, full",
|
||||||
'VVIMP': 'imperative, full',
|
"VVINF": "infinitive, full",
|
||||||
'VVINF': 'infinitive, full',
|
"VVIZU": 'infinitive with "zu", full',
|
||||||
'VVIZU': 'infinitive with "zu", full',
|
"VVPP": "perfect participle, full",
|
||||||
'VVPP': 'perfect participle, full',
|
"XY": "non-word containing non-letter",
|
||||||
'XY': 'non-word containing non-letter',
|
|
||||||
|
|
||||||
|
|
||||||
# Noun chunks
|
# Noun chunks
|
||||||
|
"NP": "noun phrase",
|
||||||
'NP': 'noun phrase',
|
"PP": "prepositional phrase",
|
||||||
'PP': 'prepositional phrase',
|
"VP": "verb phrase",
|
||||||
'VP': 'verb phrase',
|
"ADVP": "adverb phrase",
|
||||||
'ADVP': 'adverb phrase',
|
"ADJP": "adjective phrase",
|
||||||
'ADJP': 'adjective phrase',
|
"SBAR": "subordinating conjunction",
|
||||||
'SBAR': 'subordinating conjunction',
|
"PRT": "particle",
|
||||||
'PRT': 'particle',
|
"PNP": "prepositional noun phrase",
|
||||||
'PNP': 'prepositional noun phrase',
|
|
||||||
|
|
||||||
|
|
||||||
# Dependency Labels (English)
|
# Dependency Labels (English)
|
||||||
# ClearNLP / Universal Dependencies
|
# ClearNLP / Universal Dependencies
|
||||||
# https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
|
# https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
|
||||||
|
"acomp": "adjectival complement",
|
||||||
'acomp': 'adjectival complement',
|
"advcl": "adverbial clause modifier",
|
||||||
'advcl': 'adverbial clause modifier',
|
"advmod": "adverbial modifier",
|
||||||
'advmod': 'adverbial modifier',
|
"agent": "agent",
|
||||||
'agent': 'agent',
|
"amod": "adjectival modifier",
|
||||||
'amod': 'adjectival modifier',
|
"appos": "appositional modifier",
|
||||||
'appos': 'appositional modifier',
|
"attr": "attribute",
|
||||||
'attr': 'attribute',
|
"aux": "auxiliary",
|
||||||
'aux': 'auxiliary',
|
"auxpass": "auxiliary (passive)",
|
||||||
'auxpass': 'auxiliary (passive)',
|
"cc": "coordinating conjunction",
|
||||||
'cc': 'coordinating conjunction',
|
"ccomp": "clausal complement",
|
||||||
'ccomp': 'clausal complement',
|
"complm": "complementizer",
|
||||||
'complm': 'complementizer',
|
"conj": "conjunct",
|
||||||
'conj': 'conjunct',
|
"cop": "copula",
|
||||||
'cop': 'copula',
|
"csubj": "clausal subject",
|
||||||
'csubj': 'clausal subject',
|
"csubjpass": "clausal subject (passive)",
|
||||||
'csubjpass': 'clausal subject (passive)',
|
"dep": "unclassified dependent",
|
||||||
'dep': 'unclassified dependent',
|
"det": "determiner",
|
||||||
'det': 'determiner',
|
"dobj": "direct object",
|
||||||
'dobj': 'direct object',
|
"expl": "expletive",
|
||||||
'expl': 'expletive',
|
"hmod": "modifier in hyphenation",
|
||||||
'hmod': 'modifier in hyphenation',
|
"hyph": "hyphen",
|
||||||
'hyph': 'hyphen',
|
"infmod": "infinitival modifier",
|
||||||
'infmod': 'infinitival modifier',
|
"intj": "interjection",
|
||||||
'intj': 'interjection',
|
"iobj": "indirect object",
|
||||||
'iobj': 'indirect object',
|
"mark": "marker",
|
||||||
'mark': 'marker',
|
"meta": "meta modifier",
|
||||||
'meta': 'meta modifier',
|
"neg": "negation modifier",
|
||||||
'neg': 'negation modifier',
|
"nmod": "modifier of nominal",
|
||||||
'nmod': 'modifier of nominal',
|
"nn": "noun compound modifier",
|
||||||
'nn': 'noun compound modifier',
|
"npadvmod": "noun phrase as adverbial modifier",
|
||||||
'npadvmod': 'noun phrase as adverbial modifier',
|
"nsubj": "nominal subject",
|
||||||
'nsubj': 'nominal subject',
|
"nsubjpass": "nominal subject (passive)",
|
||||||
'nsubjpass': 'nominal subject (passive)',
|
"num": "number modifier",
|
||||||
'num': 'number modifier',
|
"number": "number compound modifier",
|
||||||
'number': 'number compound modifier',
|
"oprd": "object predicate",
|
||||||
'oprd': 'object predicate',
|
"obj": "object",
|
||||||
'obj': 'object',
|
"obl": "oblique nominal",
|
||||||
'obl': 'oblique nominal',
|
"parataxis": "parataxis",
|
||||||
'parataxis': 'parataxis',
|
"partmod": "participal modifier",
|
||||||
'partmod': 'participal modifier',
|
"pcomp": "complement of preposition",
|
||||||
'pcomp': 'complement of preposition',
|
"pobj": "object of preposition",
|
||||||
'pobj': 'object of preposition',
|
"poss": "possession modifier",
|
||||||
'poss': 'possession modifier',
|
"possessive": "possessive modifier",
|
||||||
'possessive': 'possessive modifier',
|
"preconj": "pre-correlative conjunction",
|
||||||
'preconj': 'pre-correlative conjunction',
|
"prep": "prepositional modifier",
|
||||||
'prep': 'prepositional modifier',
|
"prt": "particle",
|
||||||
'prt': 'particle',
|
"punct": "punctuation",
|
||||||
'punct': 'punctuation',
|
"quantmod": "modifier of quantifier",
|
||||||
'quantmod': 'modifier of quantifier',
|
"rcmod": "relative clause modifier",
|
||||||
'rcmod': 'relative clause modifier',
|
"root": "root",
|
||||||
'root': 'root',
|
"xcomp": "open clausal complement",
|
||||||
'xcomp': 'open clausal complement',
|
|
||||||
|
|
||||||
|
|
||||||
# Dependency labels (German)
|
# Dependency labels (German)
|
||||||
# TIGER Treebank
|
# TIGER Treebank
|
||||||
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
|
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
|
||||||
# currently missing: 'cc' (comparative complement) because of conflict
|
# currently missing: 'cc' (comparative complement) because of conflict
|
||||||
# with English labels
|
# with English labels
|
||||||
|
"ac": "adpositional case marker",
|
||||||
'ac': 'adpositional case marker',
|
"adc": "adjective component",
|
||||||
'adc': 'adjective component',
|
"ag": "genitive attribute",
|
||||||
'ag': 'genitive attribute',
|
"ams": "measure argument of adjective",
|
||||||
'ams': 'measure argument of adjective',
|
"app": "apposition",
|
||||||
'app': 'apposition',
|
"avc": "adverbial phrase component",
|
||||||
'avc': 'adverbial phrase component',
|
"cd": "coordinating conjunction",
|
||||||
'cd': 'coordinating conjunction',
|
"cj": "conjunct",
|
||||||
'cj': 'conjunct',
|
"cm": "comparative conjunction",
|
||||||
'cm': 'comparative conjunction',
|
"cp": "complementizer",
|
||||||
'cp': 'complementizer',
|
"cvc": "collocational verb construction",
|
||||||
'cvc': 'collocational verb construction',
|
"da": "dative",
|
||||||
'da': 'dative',
|
"dh": "discourse-level head",
|
||||||
'dh': 'discourse-level head',
|
"dm": "discourse marker",
|
||||||
'dm': 'discourse marker',
|
"ep": "expletive es",
|
||||||
'ep': 'expletive es',
|
"hd": "head",
|
||||||
'hd': 'head',
|
"ju": "junctor",
|
||||||
'ju': 'junctor',
|
"mnr": "postnominal modifier",
|
||||||
'mnr': 'postnominal modifier',
|
"mo": "modifier",
|
||||||
'mo': 'modifier',
|
"ng": "negation",
|
||||||
'ng': 'negation',
|
"nk": "noun kernel element",
|
||||||
'nk': 'noun kernel element',
|
"nmc": "numerical component",
|
||||||
'nmc': 'numerical component',
|
"oa": "accusative object",
|
||||||
'oa': 'accusative object',
|
"oc": "clausal object",
|
||||||
'oc': 'clausal object',
|
"og": "genitive object",
|
||||||
'og': 'genitive object',
|
"op": "prepositional object",
|
||||||
'op': 'prepositional object',
|
"par": "parenthetical element",
|
||||||
'par': 'parenthetical element',
|
"pd": "predicate",
|
||||||
'pd': 'predicate',
|
"pg": "phrasal genitive",
|
||||||
'pg': 'phrasal genitive',
|
"ph": "placeholder",
|
||||||
'ph': 'placeholder',
|
"pm": "morphological particle",
|
||||||
'pm': 'morphological particle',
|
"pnc": "proper noun component",
|
||||||
'pnc': 'proper noun component',
|
"rc": "relative clause",
|
||||||
'rc': 'relative clause',
|
"re": "repeated element",
|
||||||
're': 'repeated element',
|
"rs": "reported speech",
|
||||||
'rs': 'reported speech',
|
"sb": "subject",
|
||||||
'sb': 'subject',
|
|
||||||
|
|
||||||
|
|
||||||
# Named Entity Recognition
|
# Named Entity Recognition
|
||||||
# OntoNotes 5
|
# OntoNotes 5
|
||||||
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
|
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
|
||||||
|
"PERSON": "People, including fictional",
|
||||||
'PERSON': 'People, including fictional',
|
"NORP": "Nationalities or religious or political groups",
|
||||||
'NORP': 'Nationalities or religious or political groups',
|
"FACILITY": "Buildings, airports, highways, bridges, etc.",
|
||||||
'FACILITY': 'Buildings, airports, highways, bridges, etc.',
|
"FAC": "Buildings, airports, highways, bridges, etc.",
|
||||||
'FAC': 'Buildings, airports, highways, bridges, etc.',
|
"ORG": "Companies, agencies, institutions, etc.",
|
||||||
'ORG': 'Companies, agencies, institutions, etc.',
|
"GPE": "Countries, cities, states",
|
||||||
'GPE': 'Countries, cities, states',
|
"LOC": "Non-GPE locations, mountain ranges, bodies of water",
|
||||||
'LOC': 'Non-GPE locations, mountain ranges, bodies of water',
|
"PRODUCT": "Objects, vehicles, foods, etc. (not services)",
|
||||||
'PRODUCT': 'Objects, vehicles, foods, etc. (not services)',
|
"EVENT": "Named hurricanes, battles, wars, sports events, etc.",
|
||||||
'EVENT': 'Named hurricanes, battles, wars, sports events, etc.',
|
"WORK_OF_ART": "Titles of books, songs, etc.",
|
||||||
'WORK_OF_ART': 'Titles of books, songs, etc.',
|
"LAW": "Named documents made into laws.",
|
||||||
'LAW': 'Named documents made into laws.',
|
"LANGUAGE": "Any named language",
|
||||||
'LANGUAGE': 'Any named language',
|
"DATE": "Absolute or relative dates or periods",
|
||||||
'DATE': 'Absolute or relative dates or periods',
|
"TIME": "Times smaller than a day",
|
||||||
'TIME': 'Times smaller than a day',
|
"PERCENT": 'Percentage, including "%"',
|
||||||
'PERCENT': 'Percentage, including "%"',
|
"MONEY": "Monetary values, including unit",
|
||||||
'MONEY': 'Monetary values, including unit',
|
"QUANTITY": "Measurements, as of weight or distance",
|
||||||
'QUANTITY': 'Measurements, as of weight or distance',
|
"ORDINAL": '"first", "second", etc.',
|
||||||
'ORDINAL': '"first", "second", etc.',
|
"CARDINAL": "Numerals that do not fall under another type",
|
||||||
'CARDINAL': 'Numerals that do not fall under another type',
|
|
||||||
|
|
||||||
|
|
||||||
# Named Entity Recognition
|
# Named Entity Recognition
|
||||||
# Wikipedia
|
# Wikipedia
|
||||||
# http://www.sciencedirect.com/science/article/pii/S0004370212000276
|
# http://www.sciencedirect.com/science/article/pii/S0004370212000276
|
||||||
# https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
|
# https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
|
||||||
|
"PER": "Named person or family.",
|
||||||
'PER': 'Named person or family.',
|
"MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
|
||||||
'MISC': ('Miscellaneous entities, e.g. events, nationalities, '
|
|
||||||
'products or works of art'),
|
|
||||||
}
|
}
|
||||||
|
|
|
||||||
655
spacy/gold.pyx
655
spacy/gold.pyx
|
|
@ -3,16 +3,25 @@
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
import re
|
import re
|
||||||
import ujson
|
|
||||||
import random
|
import random
|
||||||
import cytoolz
|
import numpy
|
||||||
import itertools
|
import tempfile
|
||||||
|
import shutil
|
||||||
|
from pathlib import Path
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
from . import _align
|
||||||
from .syntax import nonproj
|
from .syntax import nonproj
|
||||||
from .tokens import Doc
|
from .tokens import Doc, Span
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
|
from .compat import path2str
|
||||||
from . import util
|
from . import util
|
||||||
from .util import minibatch
|
from .util import minibatch, itershuffle
|
||||||
|
|
||||||
|
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
|
||||||
|
|
||||||
|
|
||||||
|
punct_re = re.compile(r"\W")
|
||||||
|
|
||||||
|
|
||||||
def tags_to_entities(tags):
|
def tags_to_entities(tags):
|
||||||
|
|
@ -21,22 +30,22 @@ def tags_to_entities(tags):
|
||||||
for i, tag in enumerate(tags):
|
for i, tag in enumerate(tags):
|
||||||
if tag is None:
|
if tag is None:
|
||||||
continue
|
continue
|
||||||
if tag.startswith('O'):
|
if tag.startswith("O"):
|
||||||
# TODO: We shouldn't be getting these malformed inputs. Fix this.
|
# TODO: We shouldn't be getting these malformed inputs. Fix this.
|
||||||
if start is not None:
|
if start is not None:
|
||||||
start = None
|
start = None
|
||||||
continue
|
continue
|
||||||
elif tag == '-':
|
elif tag == "-":
|
||||||
continue
|
continue
|
||||||
elif tag.startswith('I'):
|
elif tag.startswith("I"):
|
||||||
if start is None:
|
if start is None:
|
||||||
raise ValueError(Errors.E067.format(tags=tags[:i + 1]))
|
raise ValueError(Errors.E067.format(tags=tags[:i + 1]))
|
||||||
continue
|
continue
|
||||||
if tag.startswith('U'):
|
if tag.startswith("U"):
|
||||||
entities.append((tag[2:], i, i))
|
entities.append((tag[2:], i, i))
|
||||||
elif tag.startswith('B'):
|
elif tag.startswith("B"):
|
||||||
start = i
|
start = i
|
||||||
elif tag.startswith('L'):
|
elif tag.startswith("L"):
|
||||||
entities.append((tag[2:], start, i))
|
entities.append((tag[2:], start, i))
|
||||||
start = None
|
start = None
|
||||||
else:
|
else:
|
||||||
|
|
@ -55,204 +64,71 @@ def merge_sents(sents):
|
||||||
m_deps[3].extend(head + i for head in heads)
|
m_deps[3].extend(head + i for head in heads)
|
||||||
m_deps[4].extend(labels)
|
m_deps[4].extend(labels)
|
||||||
m_deps[5].extend(ner)
|
m_deps[5].extend(ner)
|
||||||
m_brackets.extend((b['first'] + i, b['last'] + i, b['label'])
|
m_brackets.extend((b["first"] + i, b["last"] + i, b["label"])
|
||||||
for b in brackets)
|
for b in brackets)
|
||||||
i += len(ids)
|
i += len(ids)
|
||||||
return [(m_deps, m_brackets)]
|
return [(m_deps, m_brackets)]
|
||||||
|
|
||||||
|
|
||||||
def align(cand_words, gold_words):
|
def align(cand_words, gold_words):
|
||||||
cost, edit_path = _min_edit_path(cand_words, gold_words)
|
|
||||||
alignment = []
|
|
||||||
i_of_gold = 0
|
|
||||||
for move in edit_path:
|
|
||||||
if move == 'M':
|
|
||||||
alignment.append(i_of_gold)
|
|
||||||
i_of_gold += 1
|
|
||||||
elif move == 'S':
|
|
||||||
alignment.append(None)
|
|
||||||
i_of_gold += 1
|
|
||||||
elif move == 'D':
|
|
||||||
alignment.append(None)
|
|
||||||
elif move == 'I':
|
|
||||||
i_of_gold += 1
|
|
||||||
else:
|
|
||||||
raise Exception(move)
|
|
||||||
return alignment
|
|
||||||
|
|
||||||
|
|
||||||
punct_re = re.compile(r'\W')
|
|
||||||
|
|
||||||
|
|
||||||
def _min_edit_path(cand_words, gold_words):
|
|
||||||
cdef:
|
|
||||||
Pool mem
|
|
||||||
int i, j, n_cand, n_gold
|
|
||||||
int* curr_costs
|
|
||||||
int* prev_costs
|
|
||||||
|
|
||||||
# TODO: Fix this --- just do it properly, make the full edit matrix and
|
|
||||||
# then walk back over it...
|
|
||||||
# Preprocess inputs
|
|
||||||
cand_words = [punct_re.sub('', w).lower() for w in cand_words]
|
|
||||||
gold_words = [punct_re.sub('', w).lower() for w in gold_words]
|
|
||||||
|
|
||||||
if cand_words == gold_words:
|
if cand_words == gold_words:
|
||||||
return 0, ''.join(['M' for _ in gold_words])
|
alignment = numpy.arange(len(cand_words))
|
||||||
mem = Pool()
|
return 0, alignment, alignment, {}, {}
|
||||||
n_cand = len(cand_words)
|
cand_words = [w.replace(" ", "").lower() for w in cand_words]
|
||||||
n_gold = len(gold_words)
|
gold_words = [w.replace(" ", "").lower() for w in gold_words]
|
||||||
# Levenshtein distance, except we need the history, and we may want
|
cost, i2j, j2i, matrix = _align.align(cand_words, gold_words)
|
||||||
# different costs. Mark operations with a string, and score the history
|
i2j_multi, j2i_multi = _align.multi_align(i2j, j2i, [len(w) for w in cand_words],
|
||||||
# using _edit_cost.
|
[len(w) for w in gold_words])
|
||||||
previous_row = []
|
for i, j in list(i2j_multi.items()):
|
||||||
prev_costs = <int*>mem.alloc(n_gold + 1, sizeof(int))
|
if i2j_multi.get(i+1) != j and i2j_multi.get(i-1) != j:
|
||||||
curr_costs = <int*>mem.alloc(n_gold + 1, sizeof(int))
|
i2j[i] = j
|
||||||
for i in range(n_gold + 1):
|
i2j_multi.pop(i)
|
||||||
cell = ''
|
for j, i in list(j2i_multi.items()):
|
||||||
for j in range(i):
|
if j2i_multi.get(j+1) != i and j2i_multi.get(j-1) != i:
|
||||||
cell += 'I'
|
j2i[j] = i
|
||||||
previous_row.append('I' * i)
|
j2i_multi.pop(j)
|
||||||
prev_costs[i] = i
|
return cost, i2j, j2i, i2j_multi, j2i_multi
|
||||||
for i, cand in enumerate(cand_words):
|
|
||||||
current_row = ['D' * (i + 1)]
|
|
||||||
curr_costs[0] = i+1
|
|
||||||
for j, gold in enumerate(gold_words):
|
|
||||||
if gold.lower() == cand.lower():
|
|
||||||
s_cost = prev_costs[j]
|
|
||||||
i_cost = curr_costs[j] + 1
|
|
||||||
d_cost = prev_costs[j + 1] + 1
|
|
||||||
else:
|
|
||||||
s_cost = prev_costs[j] + 1
|
|
||||||
i_cost = curr_costs[j] + 1
|
|
||||||
d_cost = prev_costs[j + 1] + (1 if cand else 0)
|
|
||||||
|
|
||||||
if s_cost <= i_cost and s_cost <= d_cost:
|
|
||||||
best_cost = s_cost
|
|
||||||
best_hist = previous_row[j] + ('M' if gold == cand else 'S')
|
|
||||||
elif i_cost <= s_cost and i_cost <= d_cost:
|
|
||||||
best_cost = i_cost
|
|
||||||
best_hist = current_row[j] + 'I'
|
|
||||||
else:
|
|
||||||
best_cost = d_cost
|
|
||||||
best_hist = previous_row[j + 1] + 'D'
|
|
||||||
|
|
||||||
current_row.append(best_hist)
|
|
||||||
curr_costs[j+1] = best_cost
|
|
||||||
previous_row = current_row
|
|
||||||
for j in range(len(gold_words) + 1):
|
|
||||||
prev_costs[j] = curr_costs[j]
|
|
||||||
curr_costs[j] = 0
|
|
||||||
|
|
||||||
return prev_costs[n_gold], previous_row[-1]
|
|
||||||
|
|
||||||
|
|
||||||
class GoldCorpus(object):
|
class GoldCorpus(object):
|
||||||
"""An annotated corpus, using the JSON file format. Manages
|
"""An annotated corpus, using the JSON file format. Manages
|
||||||
annotations for tagging, dependency parsing and NER."""
|
annotations for tagging, dependency parsing and NER.
|
||||||
def __init__(self, train_path, dev_path, gold_preproc=True, limit=None):
|
|
||||||
|
DOCS: https://spacy.io/api/goldcorpus
|
||||||
|
"""
|
||||||
|
def __init__(self, train, dev, gold_preproc=False, limit=None):
|
||||||
"""Create a GoldCorpus.
|
"""Create a GoldCorpus.
|
||||||
|
|
||||||
train_path (unicode or Path): File or directory of training data.
|
train_path (unicode or Path): File or directory of training data.
|
||||||
dev_path (unicode or Path): File or directory of development data.
|
dev_path (unicode or Path): File or directory of development data.
|
||||||
RETURNS (GoldCorpus): The newly created object.
|
RETURNS (GoldCorpus): The newly created object.
|
||||||
"""
|
"""
|
||||||
self.train_path = util.ensure_path(train_path)
|
|
||||||
self.dev_path = util.ensure_path(dev_path)
|
|
||||||
self.limit = limit
|
self.limit = limit
|
||||||
self.train_locs = self.walk_corpus(self.train_path)
|
if isinstance(train, str) or isinstance(train, Path):
|
||||||
self.dev_locs = self.walk_corpus(self.dev_path)
|
train = self.read_tuples(self.walk_corpus(train))
|
||||||
|
dev = self.read_tuples(self.walk_corpus(dev))
|
||||||
|
# Write temp directory with one doc per file, so we can shuffle and stream
|
||||||
|
self.tmp_dir = Path(tempfile.mkdtemp())
|
||||||
|
self.write_msgpack(self.tmp_dir / "train", train, limit=self.limit)
|
||||||
|
self.write_msgpack(self.tmp_dir / "dev", dev, limit=self.limit)
|
||||||
|
|
||||||
@property
|
def __del__(self):
|
||||||
def train_tuples(self):
|
shutil.rmtree(self.tmp_dir)
|
||||||
i = 0
|
|
||||||
for loc in self.train_locs:
|
|
||||||
gold_tuples = read_json_file(loc)
|
|
||||||
for item in gold_tuples:
|
|
||||||
yield item
|
|
||||||
i += len(item[1])
|
|
||||||
if self.limit and i >= self.limit:
|
|
||||||
break
|
|
||||||
|
|
||||||
@property
|
@staticmethod
|
||||||
def dev_tuples(self):
|
def write_msgpack(directory, doc_tuples, limit=0):
|
||||||
i = 0
|
if not directory.exists():
|
||||||
for loc in self.dev_locs:
|
directory.mkdir()
|
||||||
gold_tuples = read_json_file(loc)
|
|
||||||
for item in gold_tuples:
|
|
||||||
yield item
|
|
||||||
i += len(item[1])
|
|
||||||
if self.limit and i >= self.limit:
|
|
||||||
break
|
|
||||||
|
|
||||||
def count_train(self):
|
|
||||||
n = 0
|
n = 0
|
||||||
i = 0
|
for i, doc_tuple in enumerate(doc_tuples):
|
||||||
for raw_text, paragraph_tuples in self.train_tuples:
|
srsly.write_msgpack(directory / "{}.msg".format(i), [doc_tuple])
|
||||||
n += sum([len(s[0][1]) for s in paragraph_tuples])
|
n += len(doc_tuple[1])
|
||||||
if self.limit and i >= self.limit:
|
if limit and n >= limit:
|
||||||
break
|
break
|
||||||
i += len(paragraph_tuples)
|
|
||||||
return n
|
|
||||||
|
|
||||||
def train_docs(self, nlp, gold_preproc=False,
|
|
||||||
projectivize=False, max_length=None,
|
|
||||||
noise_level=0.0):
|
|
||||||
train_tuples = self.train_tuples
|
|
||||||
if projectivize:
|
|
||||||
train_tuples = nonproj.preprocess_training_data(
|
|
||||||
self.train_tuples, label_freq_cutoff=100)
|
|
||||||
random.shuffle(train_tuples)
|
|
||||||
gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
|
|
||||||
max_length=max_length,
|
|
||||||
noise_level=noise_level)
|
|
||||||
yield from gold_docs
|
|
||||||
|
|
||||||
def dev_docs(self, nlp, gold_preproc=False):
|
|
||||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)
|
|
||||||
yield from gold_docs
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None,
|
|
||||||
noise_level=0.0):
|
|
||||||
for raw_text, paragraph_tuples in tuples:
|
|
||||||
if gold_preproc:
|
|
||||||
raw_text = None
|
|
||||||
else:
|
|
||||||
paragraph_tuples = merge_sents(paragraph_tuples)
|
|
||||||
docs = cls._make_docs(nlp, raw_text, paragraph_tuples,
|
|
||||||
gold_preproc, noise_level=noise_level)
|
|
||||||
golds = cls._make_golds(docs, paragraph_tuples)
|
|
||||||
for doc, gold in zip(docs, golds):
|
|
||||||
if (not max_length) or len(doc) < max_length:
|
|
||||||
yield doc, gold
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc,
|
|
||||||
noise_level=0.0):
|
|
||||||
if raw_text is not None:
|
|
||||||
raw_text = add_noise(raw_text, noise_level)
|
|
||||||
return [nlp.make_doc(raw_text)]
|
|
||||||
else:
|
|
||||||
return [Doc(nlp.vocab,
|
|
||||||
words=add_noise(sent_tuples[1], noise_level))
|
|
||||||
for (sent_tuples, brackets) in paragraph_tuples]
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def _make_golds(cls, docs, paragraph_tuples):
|
|
||||||
if len(docs) != len(paragraph_tuples):
|
|
||||||
raise ValueError(Errors.E070.format(n_docs=len(docs),
|
|
||||||
n_annots=len(paragraph_tuples)))
|
|
||||||
if len(docs) == 1:
|
|
||||||
return [GoldParse.from_annot_tuples(docs[0],
|
|
||||||
paragraph_tuples[0][0])]
|
|
||||||
else:
|
|
||||||
return [GoldParse.from_annot_tuples(doc, sent_tuples)
|
|
||||||
for doc, (sent_tuples, brackets)
|
|
||||||
in zip(docs, paragraph_tuples)]
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def walk_corpus(path):
|
def walk_corpus(path):
|
||||||
|
path = util.ensure_path(path)
|
||||||
if not path.is_dir():
|
if not path.is_dir():
|
||||||
return [path]
|
return [path]
|
||||||
paths = [path]
|
paths = [path]
|
||||||
|
|
@ -262,14 +138,108 @@ class GoldCorpus(object):
|
||||||
if str(path) in seen:
|
if str(path) in seen:
|
||||||
continue
|
continue
|
||||||
seen.add(str(path))
|
seen.add(str(path))
|
||||||
if path.parts[-1].startswith('.'):
|
if path.parts[-1].startswith("."):
|
||||||
continue
|
continue
|
||||||
elif path.is_dir():
|
elif path.is_dir():
|
||||||
paths.extend(path.iterdir())
|
paths.extend(path.iterdir())
|
||||||
elif path.parts[-1].endswith('.json'):
|
elif path.parts[-1].endswith(".json"):
|
||||||
locs.append(path)
|
locs.append(path)
|
||||||
return locs
|
return locs
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def read_tuples(locs, limit=0):
|
||||||
|
i = 0
|
||||||
|
for loc in locs:
|
||||||
|
loc = util.ensure_path(loc)
|
||||||
|
if loc.parts[-1].endswith("json"):
|
||||||
|
gold_tuples = read_json_file(loc)
|
||||||
|
elif loc.parts[-1].endswith("jsonl"):
|
||||||
|
gold_tuples = srsly.read_jsonl(loc)
|
||||||
|
elif loc.parts[-1].endswith("msg"):
|
||||||
|
gold_tuples = srsly.read_msgpack(loc)
|
||||||
|
else:
|
||||||
|
supported = ("json", "jsonl", "msg")
|
||||||
|
raise ValueError(Errors.E124.format(path=path2str(loc), formats=supported))
|
||||||
|
for item in gold_tuples:
|
||||||
|
yield item
|
||||||
|
i += len(item[1])
|
||||||
|
if limit and i >= limit:
|
||||||
|
return
|
||||||
|
|
||||||
|
@property
|
||||||
|
def dev_tuples(self):
|
||||||
|
locs = (self.tmp_dir / "dev").iterdir()
|
||||||
|
yield from self.read_tuples(locs, limit=self.limit)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def train_tuples(self):
|
||||||
|
locs = (self.tmp_dir / "train").iterdir()
|
||||||
|
yield from self.read_tuples(locs, limit=self.limit)
|
||||||
|
|
||||||
|
def count_train(self):
|
||||||
|
n = 0
|
||||||
|
i = 0
|
||||||
|
for raw_text, paragraph_tuples in self.train_tuples:
|
||||||
|
for sent_tuples, brackets in paragraph_tuples:
|
||||||
|
n += len(sent_tuples[1])
|
||||||
|
if self.limit and i >= self.limit:
|
||||||
|
break
|
||||||
|
i += 1
|
||||||
|
return n
|
||||||
|
|
||||||
|
def train_docs(self, nlp, gold_preproc=False, max_length=None,
|
||||||
|
noise_level=0.0):
|
||||||
|
locs = list((self.tmp_dir / 'train').iterdir())
|
||||||
|
random.shuffle(locs)
|
||||||
|
train_tuples = self.read_tuples(locs, limit=self.limit)
|
||||||
|
gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
|
||||||
|
max_length=max_length,
|
||||||
|
noise_level=noise_level,
|
||||||
|
make_projective=True)
|
||||||
|
yield from gold_docs
|
||||||
|
|
||||||
|
def dev_docs(self, nlp, gold_preproc=False):
|
||||||
|
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc)
|
||||||
|
yield from gold_docs
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None,
|
||||||
|
noise_level=0.0, make_projective=False):
|
||||||
|
for raw_text, paragraph_tuples in tuples:
|
||||||
|
if gold_preproc:
|
||||||
|
raw_text = None
|
||||||
|
else:
|
||||||
|
paragraph_tuples = merge_sents(paragraph_tuples)
|
||||||
|
docs = cls._make_docs(nlp, raw_text, paragraph_tuples, gold_preproc,
|
||||||
|
noise_level=noise_level)
|
||||||
|
golds = cls._make_golds(docs, paragraph_tuples, make_projective)
|
||||||
|
for doc, gold in zip(docs, golds):
|
||||||
|
if (not max_length) or len(doc) < max_length:
|
||||||
|
yield doc, gold
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0):
|
||||||
|
if raw_text is not None:
|
||||||
|
raw_text = add_noise(raw_text, noise_level)
|
||||||
|
return [nlp.make_doc(raw_text)]
|
||||||
|
else:
|
||||||
|
return [Doc(nlp.vocab, words=add_noise(sent_tuples[1], noise_level))
|
||||||
|
for (sent_tuples, brackets) in paragraph_tuples]
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _make_golds(cls, docs, paragraph_tuples, make_projective):
|
||||||
|
if len(docs) != len(paragraph_tuples):
|
||||||
|
n_annots = len(paragraph_tuples)
|
||||||
|
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
|
||||||
|
if len(docs) == 1:
|
||||||
|
return [GoldParse.from_annot_tuples(docs[0], paragraph_tuples[0][0],
|
||||||
|
make_projective=make_projective)]
|
||||||
|
else:
|
||||||
|
return [GoldParse.from_annot_tuples(doc, sent_tuples,
|
||||||
|
make_projective=make_projective)
|
||||||
|
for doc, (sent_tuples, brackets)
|
||||||
|
in zip(docs, paragraph_tuples)]
|
||||||
|
|
||||||
|
|
||||||
def add_noise(orig, noise_level):
|
def add_noise(orig, noise_level):
|
||||||
if random.random() >= noise_level:
|
if random.random() >= noise_level:
|
||||||
|
|
@ -279,60 +249,134 @@ def add_noise(orig, noise_level):
|
||||||
corrupted = [w for w in corrupted if w]
|
corrupted = [w for w in corrupted if w]
|
||||||
return corrupted
|
return corrupted
|
||||||
else:
|
else:
|
||||||
return ''.join(_corrupt(c, noise_level) for c in orig)
|
return "".join(_corrupt(c, noise_level) for c in orig)
|
||||||
|
|
||||||
|
|
||||||
def _corrupt(c, noise_level):
|
def _corrupt(c, noise_level):
|
||||||
if random.random() >= noise_level:
|
if random.random() >= noise_level:
|
||||||
return c
|
return c
|
||||||
elif c == ' ':
|
elif c == " ":
|
||||||
return '\n'
|
return "\n"
|
||||||
elif c == '\n':
|
elif c == "\n":
|
||||||
return ' '
|
return " "
|
||||||
elif c in ['.', "'", "!", "?"]:
|
elif c in [".", "'", "!", "?", ","]:
|
||||||
return ''
|
return ""
|
||||||
else:
|
else:
|
||||||
return c.lower()
|
return c.lower()
|
||||||
|
|
||||||
|
|
||||||
|
def read_json_object(json_corpus_section):
|
||||||
|
"""Take a list of JSON-formatted documents (e.g. from an already loaded
|
||||||
|
training data file) and yield tuples in the GoldParse format.
|
||||||
|
|
||||||
|
json_corpus_section (list): The data.
|
||||||
|
YIELDS (tuple): The reformatted data.
|
||||||
|
"""
|
||||||
|
for json_doc in json_corpus_section:
|
||||||
|
tuple_doc = json_to_tuple(json_doc)
|
||||||
|
for tuple_paragraph in tuple_doc:
|
||||||
|
yield tuple_paragraph
|
||||||
|
|
||||||
|
|
||||||
|
def json_to_tuple(doc):
|
||||||
|
"""Convert an item in the JSON-formatted training data to the tuple format
|
||||||
|
used by GoldParse.
|
||||||
|
|
||||||
|
doc (dict): One entry in the training data.
|
||||||
|
YIELDS (tuple): The reformatted data.
|
||||||
|
"""
|
||||||
|
paragraphs = []
|
||||||
|
for paragraph in doc["paragraphs"]:
|
||||||
|
sents = []
|
||||||
|
for sent in paragraph["sentences"]:
|
||||||
|
words = []
|
||||||
|
ids = []
|
||||||
|
tags = []
|
||||||
|
heads = []
|
||||||
|
labels = []
|
||||||
|
ner = []
|
||||||
|
for i, token in enumerate(sent["tokens"]):
|
||||||
|
words.append(token["orth"])
|
||||||
|
ids.append(i)
|
||||||
|
tags.append(token.get('tag', "-"))
|
||||||
|
heads.append(token.get("head", 0) + i)
|
||||||
|
labels.append(token.get("dep", ""))
|
||||||
|
# Ensure ROOT label is case-insensitive
|
||||||
|
if labels[-1].lower() == "root":
|
||||||
|
labels[-1] = "ROOT"
|
||||||
|
ner.append(token.get("ner", "-"))
|
||||||
|
sents.append([
|
||||||
|
[ids, words, tags, heads, labels, ner],
|
||||||
|
sent.get("brackets", [])])
|
||||||
|
if sents:
|
||||||
|
yield [paragraph.get("raw", None), sents]
|
||||||
|
|
||||||
|
|
||||||
def read_json_file(loc, docs_filter=None, limit=None):
|
def read_json_file(loc, docs_filter=None, limit=None):
|
||||||
loc = util.ensure_path(loc)
|
loc = util.ensure_path(loc)
|
||||||
if loc.is_dir():
|
if loc.is_dir():
|
||||||
for filename in loc.iterdir():
|
for filename in loc.iterdir():
|
||||||
yield from read_json_file(loc / filename, limit=limit)
|
yield from read_json_file(loc / filename, limit=limit)
|
||||||
else:
|
else:
|
||||||
with loc.open('r', encoding='utf8') as file_:
|
for doc in _json_iterate(loc):
|
||||||
docs = ujson.load(file_)
|
|
||||||
if limit is not None:
|
|
||||||
docs = docs[:limit]
|
|
||||||
for doc in docs:
|
|
||||||
if docs_filter is not None and not docs_filter(doc):
|
if docs_filter is not None and not docs_filter(doc):
|
||||||
continue
|
continue
|
||||||
paragraphs = []
|
for json_tuple in json_to_tuple(doc):
|
||||||
for paragraph in doc['paragraphs']:
|
yield json_tuple
|
||||||
sents = []
|
|
||||||
for sent in paragraph['sentences']:
|
|
||||||
words = []
|
def _json_iterate(loc):
|
||||||
ids = []
|
# We should've made these files jsonl...But since we didn't, parse out
|
||||||
tags = []
|
# the docs one-by-one to reduce memory usage.
|
||||||
heads = []
|
# It's okay to read in the whole file -- just don't parse it into JSON.
|
||||||
labels = []
|
cdef bytes py_raw
|
||||||
ner = []
|
loc = util.ensure_path(loc)
|
||||||
for i, token in enumerate(sent['tokens']):
|
with loc.open("rb") as file_:
|
||||||
words.append(token['orth'])
|
py_raw = file_.read()
|
||||||
ids.append(i)
|
raw = <char*>py_raw
|
||||||
tags.append(token.get('tag', '-'))
|
cdef int square_depth = 0
|
||||||
heads.append(token.get('head', 0) + i)
|
cdef int curly_depth = 0
|
||||||
labels.append(token.get('dep', ''))
|
cdef int inside_string = 0
|
||||||
# Ensure ROOT label is case-insensitive
|
cdef int escape = 0
|
||||||
if labels[-1].lower() == 'root':
|
cdef int start = -1
|
||||||
labels[-1] = 'ROOT'
|
cdef char c
|
||||||
ner.append(token.get('ner', '-'))
|
cdef char quote = ord('"')
|
||||||
sents.append([
|
cdef char backslash = ord("\\")
|
||||||
[ids, words, tags, heads, labels, ner],
|
cdef char open_square = ord("[")
|
||||||
sent.get('brackets', [])])
|
cdef char close_square = ord("]")
|
||||||
if sents:
|
cdef char open_curly = ord("{")
|
||||||
yield [paragraph.get('raw', None), sents]
|
cdef char close_curly = ord("}")
|
||||||
|
for i in range(len(py_raw)):
|
||||||
|
c = raw[i]
|
||||||
|
if escape:
|
||||||
|
escape = False
|
||||||
|
continue
|
||||||
|
if c == backslash:
|
||||||
|
escape = True
|
||||||
|
continue
|
||||||
|
if c == quote:
|
||||||
|
inside_string = not inside_string
|
||||||
|
continue
|
||||||
|
if inside_string:
|
||||||
|
continue
|
||||||
|
if c == open_square:
|
||||||
|
square_depth += 1
|
||||||
|
elif c == close_square:
|
||||||
|
square_depth -= 1
|
||||||
|
elif c == open_curly:
|
||||||
|
if square_depth == 1 and curly_depth == 0:
|
||||||
|
start = i
|
||||||
|
curly_depth += 1
|
||||||
|
elif c == close_curly:
|
||||||
|
curly_depth -= 1
|
||||||
|
if square_depth == 1 and curly_depth == 0:
|
||||||
|
py_str = py_raw[start : i + 1].decode("utf8")
|
||||||
|
try:
|
||||||
|
yield srsly.json_loads(py_str)
|
||||||
|
except Exception:
|
||||||
|
print(py_str)
|
||||||
|
raise
|
||||||
|
start = -1
|
||||||
|
|
||||||
|
|
||||||
def iob_to_biluo(tags):
|
def iob_to_biluo(tags):
|
||||||
|
|
@ -346,7 +390,7 @@ def iob_to_biluo(tags):
|
||||||
|
|
||||||
|
|
||||||
def _consume_os(tags):
|
def _consume_os(tags):
|
||||||
while tags and tags[0] == 'O':
|
while tags and tags[0] == "O":
|
||||||
yield tags.pop(0)
|
yield tags.pop(0)
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -354,24 +398,27 @@ def _consume_ent(tags):
|
||||||
if not tags:
|
if not tags:
|
||||||
return []
|
return []
|
||||||
tag = tags.pop(0)
|
tag = tags.pop(0)
|
||||||
target_in = 'I' + tag[1:]
|
target_in = "I" + tag[1:]
|
||||||
target_last = 'L' + tag[1:]
|
target_last = "L" + tag[1:]
|
||||||
length = 1
|
length = 1
|
||||||
while tags and tags[0] in {target_in, target_last}:
|
while tags and tags[0] in {target_in, target_last}:
|
||||||
length += 1
|
length += 1
|
||||||
tags.pop(0)
|
tags.pop(0)
|
||||||
label = tag[2:]
|
label = tag[2:]
|
||||||
if length == 1:
|
if length == 1:
|
||||||
return ['U-' + label]
|
return ["U-" + label]
|
||||||
else:
|
else:
|
||||||
start = 'B-' + label
|
start = "B-" + label
|
||||||
end = 'L-' + label
|
end = "L-" + label
|
||||||
middle = ['I-%s' % label for _ in range(1, length - 1)]
|
middle = ["I-%s" % label for _ in range(1, length - 1)]
|
||||||
return [start] + middle + [end]
|
return [start] + middle + [end]
|
||||||
|
|
||||||
|
|
||||||
cdef class GoldParse:
|
cdef class GoldParse:
|
||||||
"""Collection for training annotations."""
|
"""Collection for training annotations.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/goldparse
|
||||||
|
"""
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
|
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
|
||||||
_, words, tags, heads, deps, entities = annot_tuples
|
_, words, tags, heads, deps, entities = annot_tuples
|
||||||
|
|
@ -380,7 +427,7 @@ cdef class GoldParse:
|
||||||
|
|
||||||
def __init__(self, doc, annot_tuples=None, words=None, tags=None,
|
def __init__(self, doc, annot_tuples=None, words=None, tags=None,
|
||||||
heads=None, deps=None, entities=None, make_projective=False,
|
heads=None, deps=None, entities=None, make_projective=False,
|
||||||
cats=None):
|
cats=None, **_):
|
||||||
"""Create a GoldParse.
|
"""Create a GoldParse.
|
||||||
|
|
||||||
doc (Doc): The document the annotations refer to.
|
doc (Doc): The document the annotations refer to.
|
||||||
|
|
@ -414,10 +461,14 @@ cdef class GoldParse:
|
||||||
if deps is None:
|
if deps is None:
|
||||||
deps = [None for _ in doc]
|
deps = [None for _ in doc]
|
||||||
if entities is None:
|
if entities is None:
|
||||||
entities = [None for _ in doc]
|
entities = ["-" for _ in doc]
|
||||||
elif len(entities) == 0:
|
elif len(entities) == 0:
|
||||||
entities = ['O' for _ in doc]
|
entities = ["O" for _ in doc]
|
||||||
elif not isinstance(entities[0], basestring):
|
else:
|
||||||
|
# Translate the None values to '-', to make processing easier.
|
||||||
|
# See Issue #2603
|
||||||
|
entities = [(ent if ent is not None else "-") for ent in entities]
|
||||||
|
if not isinstance(entities[0], basestring):
|
||||||
# Assume we have entities specified by character offset.
|
# Assume we have entities specified by character offset.
|
||||||
entities = biluo_tags_from_offsets(doc, entities)
|
entities = biluo_tags_from_offsets(doc, entities)
|
||||||
|
|
||||||
|
|
@ -440,8 +491,21 @@ cdef class GoldParse:
|
||||||
self.labels = [None] * len(doc)
|
self.labels = [None] * len(doc)
|
||||||
self.ner = [None] * len(doc)
|
self.ner = [None] * len(doc)
|
||||||
|
|
||||||
self.cand_to_gold = align([t.orth_ for t in doc], words)
|
# This needs to be done before we align the words
|
||||||
self.gold_to_cand = align(words, [t.orth_ for t in doc])
|
if make_projective and heads is not None and deps is not None:
|
||||||
|
heads, deps = nonproj.projectivize(heads, deps)
|
||||||
|
|
||||||
|
# Do many-to-one alignment for misaligned tokens.
|
||||||
|
# If we over-segment, we'll have one gold word that covers a sequence
|
||||||
|
# of predicted words
|
||||||
|
# If we under-segment, we'll have one predicted word that covers a
|
||||||
|
# sequence of gold words.
|
||||||
|
# If we "mis-segment", we'll have a sequence of predicted words covering
|
||||||
|
# a sequence of gold words. That's many-to-many -- we don't do that.
|
||||||
|
cost, i2j, j2i, i2j_multi, j2i_multi = align([t.orth_ for t in doc], words)
|
||||||
|
|
||||||
|
self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
|
||||||
|
self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
|
||||||
|
|
||||||
annot_tuples = (range(len(words)), words, tags, heads, deps, entities)
|
annot_tuples = (range(len(words)), words, tags, heads, deps, entities)
|
||||||
self.orig_annot = list(zip(*annot_tuples))
|
self.orig_annot = list(zip(*annot_tuples))
|
||||||
|
|
@ -449,12 +513,47 @@ cdef class GoldParse:
|
||||||
for i, gold_i in enumerate(self.cand_to_gold):
|
for i, gold_i in enumerate(self.cand_to_gold):
|
||||||
if doc[i].text.isspace():
|
if doc[i].text.isspace():
|
||||||
self.words[i] = doc[i].text
|
self.words[i] = doc[i].text
|
||||||
self.tags[i] = '_SP'
|
self.tags[i] = "_SP"
|
||||||
self.heads[i] = None
|
self.heads[i] = None
|
||||||
self.labels[i] = None
|
self.labels[i] = None
|
||||||
self.ner[i] = 'O'
|
self.ner[i] = "O"
|
||||||
if gold_i is None:
|
if gold_i is None:
|
||||||
pass
|
if i in i2j_multi:
|
||||||
|
self.words[i] = words[i2j_multi[i]]
|
||||||
|
self.tags[i] = tags[i2j_multi[i]]
|
||||||
|
is_last = i2j_multi[i] != i2j_multi.get(i+1)
|
||||||
|
is_first = i2j_multi[i] != i2j_multi.get(i-1)
|
||||||
|
# Set next word in multi-token span as head, until last
|
||||||
|
if not is_last:
|
||||||
|
self.heads[i] = i+1
|
||||||
|
self.labels[i] = "subtok"
|
||||||
|
else:
|
||||||
|
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]]
|
||||||
|
self.labels[i] = deps[i2j_multi[i]]
|
||||||
|
# Now set NER...This is annoying because if we've split
|
||||||
|
# got an entity word split into two, we need to adjust the
|
||||||
|
# BILOU tags. We can't have BB or LL etc.
|
||||||
|
# Case 1: O -- easy.
|
||||||
|
ner_tag = entities[i2j_multi[i]]
|
||||||
|
if ner_tag == "O":
|
||||||
|
self.ner[i] = "O"
|
||||||
|
# Case 2: U. This has to become a B I* L sequence.
|
||||||
|
elif ner_tag.startswith("U-"):
|
||||||
|
if is_first:
|
||||||
|
self.ner[i] = ner_tag.replace("U-", "B-", 1)
|
||||||
|
elif is_last:
|
||||||
|
self.ner[i] = ner_tag.replace("U-", "L-", 1)
|
||||||
|
else:
|
||||||
|
self.ner[i] = ner_tag.replace("U-", "I-", 1)
|
||||||
|
# Case 3: L. If not last, change to I.
|
||||||
|
elif ner_tag.startswith("L-"):
|
||||||
|
if is_last:
|
||||||
|
self.ner[i] = ner_tag
|
||||||
|
else:
|
||||||
|
self.ner[i] = ner_tag.replace("L-", "I-", 1)
|
||||||
|
# Case 4: I. Stays correct
|
||||||
|
elif ner_tag.startswith("I-"):
|
||||||
|
self.ner[i] = ner_tag
|
||||||
else:
|
else:
|
||||||
self.words[i] = words[gold_i]
|
self.words[i] = words[gold_i]
|
||||||
self.tags[i] = tags[gold_i]
|
self.tags[i] = tags[gold_i]
|
||||||
|
|
@ -469,10 +568,6 @@ cdef class GoldParse:
|
||||||
if cycle is not None:
|
if cycle is not None:
|
||||||
raise ValueError(Errors.E069.format(cycle=cycle))
|
raise ValueError(Errors.E069.format(cycle=cycle))
|
||||||
|
|
||||||
if make_projective:
|
|
||||||
proj_heads, _ = nonproj.projectivize(self.heads, self.labels)
|
|
||||||
self.heads = proj_heads
|
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
"""Get the number of gold-standard tokens.
|
"""Get the number of gold-standard tokens.
|
||||||
|
|
||||||
|
|
@ -487,12 +582,38 @@ cdef class GoldParse:
|
||||||
"""
|
"""
|
||||||
return not nonproj.is_nonproj_tree(self.heads)
|
return not nonproj.is_nonproj_tree(self.heads)
|
||||||
|
|
||||||
@property
|
property sent_starts:
|
||||||
def sent_starts(self):
|
def __get__(self):
|
||||||
return [self.c.sent_start[i] for i in range(self.length)]
|
return [self.c.sent_start[i] for i in range(self.length)]
|
||||||
|
|
||||||
|
def __set__(self, sent_starts):
|
||||||
|
for gold_i, is_sent_start in enumerate(sent_starts):
|
||||||
|
i = self.gold_to_cand[gold_i]
|
||||||
|
if i is not None:
|
||||||
|
if is_sent_start in (1, True):
|
||||||
|
self.c.sent_start[i] = 1
|
||||||
|
elif is_sent_start in (-1, False):
|
||||||
|
self.c.sent_start[i] = -1
|
||||||
|
else:
|
||||||
|
self.c.sent_start[i] = 0
|
||||||
|
|
||||||
def biluo_tags_from_offsets(doc, entities, missing='O'):
|
|
||||||
|
def docs_to_json(docs, underscore=None):
|
||||||
|
"""Convert a list of Doc objects into the JSON-serializable format used by
|
||||||
|
the spacy train command.
|
||||||
|
|
||||||
|
docs (iterable / Doc): The Doc object(s) to convert.
|
||||||
|
underscore (list): Optional list of string names of custom doc._.
|
||||||
|
attributes. Attribute values need to be JSON-serializable. Values will
|
||||||
|
be added to an "_" key in the data, e.g. "_": {"foo": "bar"}.
|
||||||
|
RETURNS (list): The data in spaCy's JSON format.
|
||||||
|
"""
|
||||||
|
if isinstance(docs, Doc):
|
||||||
|
docs = [docs]
|
||||||
|
return [doc.to_json(underscore=underscore) for doc in docs]
|
||||||
|
|
||||||
|
|
||||||
|
def biluo_tags_from_offsets(doc, entities, missing="O"):
|
||||||
"""Encode labelled spans into per-token tags, using the
|
"""Encode labelled spans into per-token tags, using the
|
||||||
Begin/In/Last/Unit/Out scheme (BILUO).
|
Begin/In/Last/Unit/Out scheme (BILUO).
|
||||||
|
|
||||||
|
|
@ -515,11 +636,11 @@ def biluo_tags_from_offsets(doc, entities, missing='O'):
|
||||||
>>> entities = [(len('I like '), len('I like London'), 'LOC')]
|
>>> entities = [(len('I like '), len('I like London'), 'LOC')]
|
||||||
>>> doc = nlp.tokenizer(text)
|
>>> doc = nlp.tokenizer(text)
|
||||||
>>> tags = biluo_tags_from_offsets(doc, entities)
|
>>> tags = biluo_tags_from_offsets(doc, entities)
|
||||||
>>> assert tags == ['O', 'O', 'U-LOC', 'O']
|
>>> assert tags == ["O", "O", 'U-LOC', "O"]
|
||||||
"""
|
"""
|
||||||
starts = {token.idx: token.i for token in doc}
|
starts = {token.idx: token.i for token in doc}
|
||||||
ends = {token.idx + len(token): token.i for token in doc}
|
ends = {token.idx + len(token): token.i for token in doc}
|
||||||
biluo = ['-' for _ in doc]
|
biluo = ["-" for _ in doc]
|
||||||
# Handle entity cases
|
# Handle entity cases
|
||||||
for start_char, end_char, label in entities:
|
for start_char, end_char, label in entities:
|
||||||
start_token = starts.get(start_char)
|
start_token = starts.get(start_char)
|
||||||
|
|
@ -527,12 +648,12 @@ def biluo_tags_from_offsets(doc, entities, missing='O'):
|
||||||
# Only interested if the tokenization is correct
|
# Only interested if the tokenization is correct
|
||||||
if start_token is not None and end_token is not None:
|
if start_token is not None and end_token is not None:
|
||||||
if start_token == end_token:
|
if start_token == end_token:
|
||||||
biluo[start_token] = 'U-%s' % label
|
biluo[start_token] = "U-%s" % label
|
||||||
else:
|
else:
|
||||||
biluo[start_token] = 'B-%s' % label
|
biluo[start_token] = "B-%s" % label
|
||||||
for i in range(start_token+1, end_token):
|
for i in range(start_token+1, end_token):
|
||||||
biluo[i] = 'I-%s' % label
|
biluo[i] = "I-%s" % label
|
||||||
biluo[end_token] = 'L-%s' % label
|
biluo[end_token] = "L-%s" % label
|
||||||
# Now distinguish the O cases from ones where we miss the tokenization
|
# Now distinguish the O cases from ones where we miss the tokenization
|
||||||
entity_chars = set()
|
entity_chars = set()
|
||||||
for start_char, end_char, label in entities:
|
for start_char, end_char, label in entities:
|
||||||
|
|
@ -547,6 +668,24 @@ def biluo_tags_from_offsets(doc, entities, missing='O'):
|
||||||
return biluo
|
return biluo
|
||||||
|
|
||||||
|
|
||||||
|
def spans_from_biluo_tags(doc, tags):
|
||||||
|
"""Encode per-token tags following the BILUO scheme into Span object, e.g.
|
||||||
|
to overwrite the doc.ents.
|
||||||
|
|
||||||
|
doc (Doc): The document that the BILUO tags refer to.
|
||||||
|
entities (iterable): A sequence of BILUO tags with each tag describing one
|
||||||
|
token. Each tags string will be of the form of either "", "O" or
|
||||||
|
"{action}-{label}", where action is one of "B", "I", "L", "U".
|
||||||
|
RETURNS (list): A sequence of Span objects.
|
||||||
|
"""
|
||||||
|
token_offsets = tags_to_entities(tags)
|
||||||
|
spans = []
|
||||||
|
for label, start_idx, end_idx in token_offsets:
|
||||||
|
span = Span(doc, start_idx, end_idx + 1, label=label)
|
||||||
|
spans.append(span)
|
||||||
|
return spans
|
||||||
|
|
||||||
|
|
||||||
def offsets_from_biluo_tags(doc, tags):
|
def offsets_from_biluo_tags(doc, tags):
|
||||||
"""Encode per-token tags following the BILUO scheme into entity offsets.
|
"""Encode per-token tags following the BILUO scheme into entity offsets.
|
||||||
|
|
||||||
|
|
@ -558,13 +697,9 @@ def offsets_from_biluo_tags(doc, tags):
|
||||||
`end` will be character-offset integers denoting the slice into the
|
`end` will be character-offset integers denoting the slice into the
|
||||||
original string.
|
original string.
|
||||||
"""
|
"""
|
||||||
token_offsets = tags_to_entities(tags)
|
spans = spans_from_biluo_tags(doc, tags)
|
||||||
offsets = []
|
return [(span.start_char, span.end_char, span.label_) for span in spans]
|
||||||
for label, start_idx, end_idx in token_offsets:
|
|
||||||
span = doc[start_idx : end_idx + 1]
|
|
||||||
offsets.append((span.start_char, span.end_char, label))
|
|
||||||
return offsets
|
|
||||||
|
|
||||||
|
|
||||||
def is_punct_label(label):
|
def is_punct_label(label):
|
||||||
return label == 'P' or label.lower() == 'punct'
|
return label == "P" or label.lower() == "punct"
|
||||||
|
|
|
||||||
20
spacy/lang/af/__init__.py
Normal file
20
spacy/lang/af/__init__.py
Normal file
|
|
@ -0,0 +1,20 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG
|
||||||
|
|
||||||
|
|
||||||
|
class AfrikaansDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "af"
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
class Afrikaans(Language):
|
||||||
|
lang = "af"
|
||||||
|
Defaults = AfrikaansDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Afrikaans"]
|
||||||
61
spacy/lang/af/stop_words.py
Normal file
61
spacy/lang/af/stop_words.py
Normal file
|
|
@ -0,0 +1,61 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
# Source: https://github.com/stopwords-iso/stopwords-af
|
||||||
|
|
||||||
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
|
'n
|
||||||
|
aan
|
||||||
|
af
|
||||||
|
al
|
||||||
|
as
|
||||||
|
baie
|
||||||
|
by
|
||||||
|
daar
|
||||||
|
dag
|
||||||
|
dat
|
||||||
|
die
|
||||||
|
dit
|
||||||
|
een
|
||||||
|
ek
|
||||||
|
en
|
||||||
|
gaan
|
||||||
|
gesê
|
||||||
|
haar
|
||||||
|
het
|
||||||
|
hom
|
||||||
|
hulle
|
||||||
|
hy
|
||||||
|
in
|
||||||
|
is
|
||||||
|
jou
|
||||||
|
jy
|
||||||
|
kan
|
||||||
|
kom
|
||||||
|
ma
|
||||||
|
maar
|
||||||
|
met
|
||||||
|
my
|
||||||
|
na
|
||||||
|
nie
|
||||||
|
om
|
||||||
|
ons
|
||||||
|
op
|
||||||
|
saam
|
||||||
|
sal
|
||||||
|
se
|
||||||
|
sien
|
||||||
|
so
|
||||||
|
sy
|
||||||
|
te
|
||||||
|
toe
|
||||||
|
uit
|
||||||
|
van
|
||||||
|
vir
|
||||||
|
was
|
||||||
|
wat
|
||||||
|
ʼn
|
||||||
|
""".split()
|
||||||
|
)
|
||||||
|
|
@ -16,16 +16,19 @@ from ...util import update_exc, add_lookups
|
||||||
class ArabicDefaults(Language.Defaults):
|
class ArabicDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: 'ar'
|
lex_attr_getters[LANG] = lambda text: "ar"
|
||||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
lex_attr_getters[NORM] = add_lookups(
|
||||||
|
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||||
|
)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||||
|
|
||||||
|
|
||||||
class Arabic(Language):
|
class Arabic(Language):
|
||||||
lang = 'ar'
|
lang = "ar"
|
||||||
Defaults = ArabicDefaults
|
Defaults = ArabicDefaults
|
||||||
|
|
||||||
|
|
||||||
__all__ = ['Arabic']
|
__all__ = ["Arabic"]
|
||||||
|
|
|
||||||
|
|
@ -10,11 +10,11 @@ Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
sentences = [
|
sentences = [
|
||||||
"نال الكاتب خالد توفيق جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
|
"نال الكاتب خالد توفيق جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
|
||||||
"أين تقع دمشق ؟"
|
"أين تقع دمشق ؟",
|
||||||
"كيف حالك ؟",
|
"كيف حالك ؟",
|
||||||
"هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
|
"هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
|
||||||
"ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
|
"ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
|
||||||
"هل بالإمكان أن نلتقي غدا؟",
|
"هل بالإمكان أن نلتقي غدا؟",
|
||||||
"هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
|
"هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
|
||||||
"كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم"
|
"كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم",
|
||||||
]
|
]
|
||||||
|
|
|
||||||
|
|
@ -2,7 +2,8 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from ...attrs import LIKE_NUM
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
_num_words = set("""
|
_num_words = set(
|
||||||
|
"""
|
||||||
صفر
|
صفر
|
||||||
واحد
|
واحد
|
||||||
إثنان
|
إثنان
|
||||||
|
|
@ -52,9 +53,11 @@ _num_words = set("""
|
||||||
مليون
|
مليون
|
||||||
مليار
|
مليار
|
||||||
مليارات
|
مليارات
|
||||||
""".split())
|
""".split()
|
||||||
|
)
|
||||||
|
|
||||||
_ordinal_words = set("""
|
_ordinal_words = set(
|
||||||
|
"""
|
||||||
اول
|
اول
|
||||||
أول
|
أول
|
||||||
حاد
|
حاد
|
||||||
|
|
@ -69,18 +72,21 @@ _ordinal_words = set("""
|
||||||
ثامن
|
ثامن
|
||||||
تاسع
|
تاسع
|
||||||
عاشر
|
عاشر
|
||||||
""".split())
|
""".split()
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def like_num(text):
|
def like_num(text):
|
||||||
"""
|
"""
|
||||||
check if text resembles a number
|
Check if text resembles a number
|
||||||
"""
|
"""
|
||||||
text = text.replace(',', '').replace('.', '')
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(",", "").replace(".", "")
|
||||||
if text.isdigit():
|
if text.isdigit():
|
||||||
return True
|
return True
|
||||||
if text.count('/') == 1:
|
if text.count("/") == 1:
|
||||||
num, denom = text.split('/')
|
num, denom = text.split("/")
|
||||||
if num.isdigit() and denom.isdigit():
|
if num.isdigit() and denom.isdigit():
|
||||||
return True
|
return True
|
||||||
if text in _num_words:
|
if text in _num_words:
|
||||||
|
|
@ -90,6 +96,4 @@ def like_num(text):
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
LEX_ATTRS = {
|
LEX_ATTRS = {LIKE_NUM: like_num}
|
||||||
LIKE_NUM: like_num
|
|
||||||
}
|
|
||||||
|
|
|
||||||
|
|
@ -1,15 +1,20 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..punctuation import TOKENIZER_INFIXES
|
|
||||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||||
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
from ..char_classes import UNITS, ALPHA_UPPER
|
||||||
|
|
||||||
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
|
_suffixes = (
|
||||||
[r'(?<=[0-9])\+',
|
LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])\+",
|
||||||
# Arabic is written from Right-To-Left
|
# Arabic is written from Right-To-Left
|
||||||
r'(?<=[0-9])(?:{})'.format(CURRENCY),
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
r'(?<=[0-9])(?:{})'.format(UNITS),
|
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||||
r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
|
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
TOKENIZER_SUFFIXES = _suffixes
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,8 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
STOP_WORDS = set("""
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
من
|
من
|
||||||
نحو
|
نحو
|
||||||
لعل
|
لعل
|
||||||
|
|
@ -388,4 +389,5 @@ STOP_WORDS = set("""
|
||||||
وإن
|
وإن
|
||||||
ولو
|
ولو
|
||||||
يا
|
يا
|
||||||
""".split())
|
""".split()
|
||||||
|
)
|
||||||
|
|
|
||||||
|
|
@ -1,21 +1,23 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
from ...symbols import ORTH, LEMMA
|
||||||
import re
|
|
||||||
|
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
|
||||||
# time
|
|
||||||
|
# Time
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
{LEMMA: "قبل الميلاد", ORTH: "ق.م"},
|
{LEMMA: "قبل الميلاد", ORTH: "ق.م"},
|
||||||
{LEMMA: "بعد الميلاد", ORTH: "ب. م"},
|
{LEMMA: "بعد الميلاد", ORTH: "ب. م"},
|
||||||
{LEMMA: "ميلادي", ORTH: ".م"},
|
{LEMMA: "ميلادي", ORTH: ".م"},
|
||||||
{LEMMA: "هجري", ORTH: ".هـ"},
|
{LEMMA: "هجري", ORTH: ".هـ"},
|
||||||
{LEMMA: "توفي", ORTH: ".ت"}]:
|
{LEMMA: "توفي", ORTH: ".ت"},
|
||||||
|
]:
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
# scientific abv.
|
# Scientific abv.
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
{LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
|
{LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
|
||||||
{LEMMA: "الشارح", ORTH: "الشـ"},
|
{LEMMA: "الشارح", ORTH: "الشـ"},
|
||||||
|
|
@ -28,20 +30,20 @@ for exc_data in [
|
||||||
{LEMMA: "أنبأنا", ORTH: "أنا"},
|
{LEMMA: "أنبأنا", ORTH: "أنا"},
|
||||||
{LEMMA: "أخبرنا", ORTH: "نا"},
|
{LEMMA: "أخبرنا", ORTH: "نا"},
|
||||||
{LEMMA: "مصدر سابق", ORTH: "م. س"},
|
{LEMMA: "مصدر سابق", ORTH: "م. س"},
|
||||||
{LEMMA: "مصدر نفسه", ORTH: "م. ن"}]:
|
{LEMMA: "مصدر نفسه", ORTH: "م. ن"},
|
||||||
|
]:
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
# other abv.
|
# Other abv.
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
{LEMMA: "دكتور", ORTH: "د."},
|
{LEMMA: "دكتور", ORTH: "د."},
|
||||||
{LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
|
{LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
|
||||||
{LEMMA: "أستاذ", ORTH: "أ."},
|
{LEMMA: "أستاذ", ORTH: "أ."},
|
||||||
{LEMMA: "بروفيسور", ORTH: "ب."}]:
|
{LEMMA: "بروفيسور", ORTH: "ب."},
|
||||||
|
]:
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
for exc_data in [
|
for exc_data in [{LEMMA: "تلفون", ORTH: "ت."}, {LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
|
||||||
{LEMMA: "تلفون", ORTH: "ت."},
|
|
||||||
{LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
|
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
|
|
||||||
20
spacy/lang/bg/__init__.py
Normal file
20
spacy/lang/bg/__init__.py
Normal file
|
|
@ -0,0 +1,20 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG
|
||||||
|
|
||||||
|
|
||||||
|
class BulgarianDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "bg"
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
class Bulgarian(Language):
|
||||||
|
lang = "bg"
|
||||||
|
Defaults = BulgarianDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Bulgarian"]
|
||||||
269
spacy/lang/bg/stop_words.py
Normal file
269
spacy/lang/bg/stop_words.py
Normal file
|
|
@ -0,0 +1,269 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
# Source: https://github.com/Alir3z4/stop-words
|
||||||
|
|
||||||
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
|
а
|
||||||
|
автентичен
|
||||||
|
аз
|
||||||
|
ако
|
||||||
|
ала
|
||||||
|
бе
|
||||||
|
без
|
||||||
|
беше
|
||||||
|
би
|
||||||
|
бивш
|
||||||
|
бивша
|
||||||
|
бившо
|
||||||
|
бил
|
||||||
|
била
|
||||||
|
били
|
||||||
|
било
|
||||||
|
благодаря
|
||||||
|
близо
|
||||||
|
бъдат
|
||||||
|
бъде
|
||||||
|
бяха
|
||||||
|
в
|
||||||
|
вас
|
||||||
|
ваш
|
||||||
|
ваша
|
||||||
|
вероятно
|
||||||
|
вече
|
||||||
|
взема
|
||||||
|
ви
|
||||||
|
вие
|
||||||
|
винаги
|
||||||
|
внимава
|
||||||
|
време
|
||||||
|
все
|
||||||
|
всеки
|
||||||
|
всички
|
||||||
|
всичко
|
||||||
|
всяка
|
||||||
|
във
|
||||||
|
въпреки
|
||||||
|
върху
|
||||||
|
г
|
||||||
|
ги
|
||||||
|
главен
|
||||||
|
главна
|
||||||
|
главно
|
||||||
|
глас
|
||||||
|
го
|
||||||
|
година
|
||||||
|
години
|
||||||
|
годишен
|
||||||
|
д
|
||||||
|
да
|
||||||
|
дали
|
||||||
|
два
|
||||||
|
двама
|
||||||
|
двамата
|
||||||
|
две
|
||||||
|
двете
|
||||||
|
ден
|
||||||
|
днес
|
||||||
|
дни
|
||||||
|
до
|
||||||
|
добра
|
||||||
|
добре
|
||||||
|
добро
|
||||||
|
добър
|
||||||
|
докато
|
||||||
|
докога
|
||||||
|
дори
|
||||||
|
досега
|
||||||
|
доста
|
||||||
|
друг
|
||||||
|
друга
|
||||||
|
други
|
||||||
|
е
|
||||||
|
евтин
|
||||||
|
едва
|
||||||
|
един
|
||||||
|
една
|
||||||
|
еднаква
|
||||||
|
еднакви
|
||||||
|
еднакъв
|
||||||
|
едно
|
||||||
|
екип
|
||||||
|
ето
|
||||||
|
живот
|
||||||
|
за
|
||||||
|
забавям
|
||||||
|
зад
|
||||||
|
заедно
|
||||||
|
заради
|
||||||
|
засега
|
||||||
|
заспал
|
||||||
|
затова
|
||||||
|
защо
|
||||||
|
защото
|
||||||
|
и
|
||||||
|
из
|
||||||
|
или
|
||||||
|
им
|
||||||
|
има
|
||||||
|
имат
|
||||||
|
иска
|
||||||
|
й
|
||||||
|
каза
|
||||||
|
как
|
||||||
|
каква
|
||||||
|
какво
|
||||||
|
както
|
||||||
|
какъв
|
||||||
|
като
|
||||||
|
кога
|
||||||
|
когато
|
||||||
|
което
|
||||||
|
които
|
||||||
|
кой
|
||||||
|
който
|
||||||
|
колко
|
||||||
|
която
|
||||||
|
къде
|
||||||
|
където
|
||||||
|
към
|
||||||
|
лесен
|
||||||
|
лесно
|
||||||
|
ли
|
||||||
|
лош
|
||||||
|
м
|
||||||
|
май
|
||||||
|
малко
|
||||||
|
ме
|
||||||
|
между
|
||||||
|
мек
|
||||||
|
мен
|
||||||
|
месец
|
||||||
|
ми
|
||||||
|
много
|
||||||
|
мнозина
|
||||||
|
мога
|
||||||
|
могат
|
||||||
|
може
|
||||||
|
мокър
|
||||||
|
моля
|
||||||
|
момента
|
||||||
|
му
|
||||||
|
н
|
||||||
|
на
|
||||||
|
над
|
||||||
|
назад
|
||||||
|
най
|
||||||
|
направи
|
||||||
|
напред
|
||||||
|
например
|
||||||
|
нас
|
||||||
|
не
|
||||||
|
него
|
||||||
|
нещо
|
||||||
|
нея
|
||||||
|
ни
|
||||||
|
ние
|
||||||
|
никой
|
||||||
|
нито
|
||||||
|
нищо
|
||||||
|
но
|
||||||
|
нов
|
||||||
|
нова
|
||||||
|
нови
|
||||||
|
новина
|
||||||
|
някои
|
||||||
|
някой
|
||||||
|
няколко
|
||||||
|
няма
|
||||||
|
обаче
|
||||||
|
около
|
||||||
|
освен
|
||||||
|
особено
|
||||||
|
от
|
||||||
|
отгоре
|
||||||
|
отново
|
||||||
|
още
|
||||||
|
пак
|
||||||
|
по
|
||||||
|
повече
|
||||||
|
повечето
|
||||||
|
под
|
||||||
|
поне
|
||||||
|
поради
|
||||||
|
после
|
||||||
|
почти
|
||||||
|
прави
|
||||||
|
пред
|
||||||
|
преди
|
||||||
|
през
|
||||||
|
при
|
||||||
|
пък
|
||||||
|
първата
|
||||||
|
първи
|
||||||
|
първо
|
||||||
|
пъти
|
||||||
|
равен
|
||||||
|
равна
|
||||||
|
с
|
||||||
|
са
|
||||||
|
сам
|
||||||
|
само
|
||||||
|
се
|
||||||
|
сега
|
||||||
|
си
|
||||||
|
син
|
||||||
|
скоро
|
||||||
|
след
|
||||||
|
следващ
|
||||||
|
сме
|
||||||
|
смях
|
||||||
|
според
|
||||||
|
сред
|
||||||
|
срещу
|
||||||
|
сте
|
||||||
|
съм
|
||||||
|
със
|
||||||
|
също
|
||||||
|
т
|
||||||
|
тази
|
||||||
|
така
|
||||||
|
такива
|
||||||
|
такъв
|
||||||
|
там
|
||||||
|
твой
|
||||||
|
те
|
||||||
|
тези
|
||||||
|
ти
|
||||||
|
т.н.
|
||||||
|
то
|
||||||
|
това
|
||||||
|
тогава
|
||||||
|
този
|
||||||
|
той
|
||||||
|
толкова
|
||||||
|
точно
|
||||||
|
три
|
||||||
|
трябва
|
||||||
|
тук
|
||||||
|
тъй
|
||||||
|
тя
|
||||||
|
тях
|
||||||
|
у
|
||||||
|
утре
|
||||||
|
харесва
|
||||||
|
хиляди
|
||||||
|
ч
|
||||||
|
часа
|
||||||
|
че
|
||||||
|
често
|
||||||
|
чрез
|
||||||
|
ще
|
||||||
|
щом
|
||||||
|
юмрук
|
||||||
|
я
|
||||||
|
як
|
||||||
|
""".split()
|
||||||
|
)
|
||||||
|
|
@ -15,7 +15,7 @@ from ...util import update_exc
|
||||||
|
|
||||||
class BengaliDefaults(Language.Defaults):
|
class BengaliDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: 'bn'
|
lex_attr_getters[LANG] = lambda text: "bn"
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
@ -26,8 +26,8 @@ class BengaliDefaults(Language.Defaults):
|
||||||
|
|
||||||
|
|
||||||
class Bengali(Language):
|
class Bengali(Language):
|
||||||
lang = 'bn'
|
lang = "bn"
|
||||||
Defaults = BengaliDefaults
|
Defaults = BengaliDefaults
|
||||||
|
|
||||||
|
|
||||||
__all__ = ['Bengali']
|
__all__ = ["Bengali"]
|
||||||
|
|
|
||||||
|
|
@ -13,11 +13,9 @@ LEMMA_RULES = {
|
||||||
["গাছা", ""],
|
["গাছা", ""],
|
||||||
["গাছি", ""],
|
["গাছি", ""],
|
||||||
["ছড়া", ""],
|
["ছড়া", ""],
|
||||||
|
|
||||||
["কে", ""],
|
["কে", ""],
|
||||||
["ে", ""],
|
["ে", ""],
|
||||||
["তে", ""],
|
["তে", ""],
|
||||||
|
|
||||||
["র", ""],
|
["র", ""],
|
||||||
["রা", ""],
|
["রা", ""],
|
||||||
["রে", ""],
|
["রে", ""],
|
||||||
|
|
@ -28,7 +26,6 @@ LEMMA_RULES = {
|
||||||
["গুলা", ""],
|
["গুলা", ""],
|
||||||
["গুলো", ""],
|
["গুলো", ""],
|
||||||
["গুলি", ""],
|
["গুলি", ""],
|
||||||
|
|
||||||
["কুল", ""],
|
["কুল", ""],
|
||||||
["গণ", ""],
|
["গণ", ""],
|
||||||
["দল", ""],
|
["দল", ""],
|
||||||
|
|
@ -45,7 +42,6 @@ LEMMA_RULES = {
|
||||||
["সকল", ""],
|
["সকল", ""],
|
||||||
["মহল", ""],
|
["মহল", ""],
|
||||||
["াবলি", ""], # আবলি
|
["াবলি", ""], # আবলি
|
||||||
|
|
||||||
# Bengali digit representations
|
# Bengali digit representations
|
||||||
["০", "0"],
|
["০", "0"],
|
||||||
["১", "1"],
|
["১", "1"],
|
||||||
|
|
@ -58,11 +54,5 @@ LEMMA_RULES = {
|
||||||
["৮", "8"],
|
["৮", "8"],
|
||||||
["৯", "9"],
|
["৯", "9"],
|
||||||
],
|
],
|
||||||
|
"punct": [["“", '"'], ["”", '"'], ["\u2018", "'"], ["\u2019", "'"]],
|
||||||
"punct": [
|
|
||||||
["“", "\""],
|
|
||||||
["”", "\""],
|
|
||||||
["\u2018", "'"],
|
|
||||||
["\u2019", "'"]
|
|
||||||
]
|
|
||||||
}
|
}
|
||||||
|
|
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user