mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-30 19:24:07 +03:00
Merge remote-tracking branch 'origin/develop' into rliaw-develop
This commit is contained in:
commit
610dfd85c2
106
.github/contributors/Arvindcheenu.md
vendored
Normal file
106
.github/contributors/Arvindcheenu.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Arvind Srinivasan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-06-13 |
|
||||
| GitHub username | arvindcheenu |
|
||||
| Website (optional) | |
|
106
.github/contributors/JannisTriesToCode.md
vendored
Normal file
106
.github/contributors/JannisTriesToCode.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ----------------------------- |
|
||||
| Name | Jannis Rauschke |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 22.05.2020 |
|
||||
| GitHub username | JannisTriesToCode |
|
||||
| Website (optional) | https://twitter.com/JRauschke |
|
4
.github/contributors/MartinoMensio.md
vendored
4
.github/contributors/MartinoMensio.md
vendored
|
@ -99,8 +99,8 @@ mark both statements:
|
|||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Martino Mensio |
|
||||
| Company name (if applicable) | Polytechnic University of Turin |
|
||||
| Title or role (if applicable) | Student |
|
||||
| Company name (if applicable) | The Open University |
|
||||
| Title or role (if applicable) | PhD Student |
|
||||
| Date | 17 November 2017 |
|
||||
| GitHub username | MartinoMensio |
|
||||
| Website (optional) | https://martinomensio.github.io/ |
|
||||
|
|
106
.github/contributors/R1j1t.md
vendored
Normal file
106
.github/contributors/R1j1t.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Rajat |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 24 May 2020 |
|
||||
| GitHub username | R1j1t |
|
||||
| Website (optional) | |
|
106
.github/contributors/hiroshi-matsuda-rit.md
vendored
Normal file
106
.github/contributors/hiroshi-matsuda-rit.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Hiroshi Matsuda |
|
||||
| Company name (if applicable) | Megagon Labs, Tokyo |
|
||||
| Title or role (if applicable) | Research Scientist |
|
||||
| Date | June 6, 2020 |
|
||||
| GitHub username | hiroshi-matsuda-rit |
|
||||
| Website (optional) | |
|
106
.github/contributors/jonesmartins.md
vendored
Normal file
106
.github/contributors/jonesmartins.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jones Martins |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-06-10 |
|
||||
| GitHub username | jonesmartins |
|
||||
| Website (optional) | |
|
106
.github/contributors/leomrocha.md
vendored
Normal file
106
.github/contributors/leomrocha.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Leonardo M. Rocha |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | Eng. |
|
||||
| Date | 31/05/2020 |
|
||||
| GitHub username | leomrocha |
|
||||
| Website (optional) | |
|
106
.github/contributors/lfiedler.md
vendored
Normal file
106
.github/contributors/lfiedler.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Leander Fiedler |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 06 April 2020 |
|
||||
| GitHub username | lfiedler |
|
||||
| Website (optional) | |
|
106
.github/contributors/mahnerak.md
vendored
Normal file
106
.github/contributors/mahnerak.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Karen Hambardzumyan |
|
||||
| Company name (if applicable) | YerevaNN |
|
||||
| Title or role (if applicable) | Researcher |
|
||||
| Date | 2020-06-19 |
|
||||
| GitHub username | mahnerak |
|
||||
| Website (optional) | https://mahnerak.com/|
|
106
.github/contributors/myavrum.md
vendored
Normal file
106
.github/contributors/myavrum.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Marat M. Yavrumyan |
|
||||
| Company name (if applicable) | YSU, UD_Armenian Project |
|
||||
| Title or role (if applicable) | Dr., Principal Investigator |
|
||||
| Date | 2020-06-19 |
|
||||
| GitHub username | myavrum |
|
||||
| Website (optional) | http://armtreebank.yerevann.com/ |
|
106
.github/contributors/theudas.md
vendored
Normal file
106
.github/contributors/theudas.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Philipp Sodmann |
|
||||
| Company name (if applicable) | Empolis |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2017-05-06 |
|
||||
| GitHub username | theudas |
|
||||
| Website (optional) | |
|
29
.github/workflows/issue-manager.yml
vendored
Normal file
29
.github/workflows/issue-manager.yml
vendored
Normal file
|
@ -0,0 +1,29 @@
|
|||
name: Issue Manager
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: "0 0 * * *"
|
||||
issue_comment:
|
||||
types:
|
||||
- created
|
||||
- edited
|
||||
issues:
|
||||
types:
|
||||
- labeled
|
||||
|
||||
jobs:
|
||||
issue-manager:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: tiangolo/issue-manager@0.2.1
|
||||
with:
|
||||
token: ${{ secrets.GITHUB_TOKEN }}
|
||||
config: >
|
||||
{
|
||||
"resolved": {
|
||||
"delay": "P7D",
|
||||
"message": "This issue has been automatically closed because it was answered and there was no follow-up discussion.",
|
||||
"remove_label_on_comment": true,
|
||||
"remove_label_on_close": true
|
||||
}
|
||||
}
|
5
Makefile
5
Makefile
|
@ -5,8 +5,9 @@ VENV := ./env$(PYVER)
|
|||
version := $(shell "bin/get-version.sh")
|
||||
|
||||
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy_lookups_data
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core
|
||||
chmod a+rx $@
|
||||
cp $@ dist/spacy.pex
|
||||
|
||||
dist/pytest.pex : wheelhouse/pytest-*.whl
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
|
||||
|
@ -14,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl
|
|||
|
||||
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
||||
$(VENV)/bin/pip wheel spacy_lookups_data -w ./wheelhouse
|
||||
$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse
|
||||
touch $@
|
||||
|
||||
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
||||
|
|
17
README.md
17
README.md
|
@ -6,12 +6,12 @@ spaCy is a library for advanced Natural Language Processing in Python and
|
|||
Cython. It's built on the very latest research, and was designed from day one to
|
||||
be used in real products. spaCy comes with
|
||||
[pretrained statistical models](https://spacy.io/models) and word vectors, and
|
||||
currently supports tokenization for **50+ languages**. It features
|
||||
currently supports tokenization for **60+ languages**. It features
|
||||
state-of-the-art speed, convolutional **neural network models** for tagging,
|
||||
parsing and **named entity recognition** and easy **deep learning** integration.
|
||||
It's commercial open-source software, released under the MIT license.
|
||||
|
||||
💫 **Version 2.2 out now!**
|
||||
💫 **Version 2.3 out now!**
|
||||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
|
||||
[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
|
@ -31,7 +31,7 @@ It's commercial open-source software, released under the MIT license.
|
|||
| --------------- | -------------------------------------------------------------- |
|
||||
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||
| [Usage Guides] | How to use spaCy and its features. |
|
||||
| [New in v2.2] | New features, backwards incompatibilities and migration guide. |
|
||||
| [New in v2.3] | New features, backwards incompatibilities and migration guide. |
|
||||
| [API Reference] | The detailed reference for spaCy's API. |
|
||||
| [Models] | Download statistical language models for spaCy. |
|
||||
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||
|
@ -39,7 +39,7 @@ It's commercial open-source software, released under the MIT license.
|
|||
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v2.2]: https://spacy.io/usage/v2-2
|
||||
[new in v2.3]: https://spacy.io/usage/v2-3
|
||||
[usage guides]: https://spacy.io/usage/
|
||||
[api reference]: https://spacy.io/api/
|
||||
[models]: https://spacy.io/models
|
||||
|
@ -119,12 +119,13 @@ of `v2.0.13`).
|
|||
pip install spacy
|
||||
```
|
||||
|
||||
To install additional data tables for lemmatization in **spaCy v2.2+** you can
|
||||
run `pip install spacy[lookups]` or install
|
||||
To install additional data tables for lemmatization and normalization in
|
||||
**spaCy v2.2+** you can run `pip install spacy[lookups]` or install
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||
separately. The lookups package is needed to create blank models with
|
||||
lemmatization data, and to lemmatize in languages that don't yet come with
|
||||
pretrained models and aren't powered by third-party libraries.
|
||||
lemmatization data for v2.2+ plus normalization data for v2.3+, and to
|
||||
lemmatize in languages that don't yet come with pretrained models and aren't
|
||||
powered by third-party libraries.
|
||||
|
||||
When using pip it is generally recommended to install packages in a virtual
|
||||
environment to avoid modifying system state:
|
||||
|
|
|
@ -14,7 +14,7 @@ import spacy
|
|||
import spacy.util
|
||||
from bin.ud import conll17_ud_eval
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse, Example
|
||||
from spacy.gold import Example
|
||||
from spacy.util import compounding, minibatch, minibatch_by_words
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from spacy.matcher import Matcher
|
||||
|
@ -78,22 +78,21 @@ def read_data(
|
|||
head = int(head) - 1 if head != "0" else id_
|
||||
sent["words"].append(word)
|
||||
sent["tags"].append(tag)
|
||||
sent["morphology"].append(_parse_morph_string(morph))
|
||||
sent["morphology"][-1].add("POS_%s" % pos)
|
||||
sent["morphs"].append(_compile_morph_string(morph, pos))
|
||||
sent["heads"].append(head)
|
||||
sent["deps"].append("ROOT" if dep == "root" else dep)
|
||||
sent["spaces"].append(space_after == "_")
|
||||
sent["entities"] = ["-"] * len(sent["words"])
|
||||
sent["entities"] = ["-"] * len(sent["words"]) # TODO: doc-level format
|
||||
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
|
||||
if oracle_segments:
|
||||
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
|
||||
golds.append(GoldParse(docs[-1], **sent))
|
||||
assert golds[-1].morphology is not None
|
||||
golds.append(sent)
|
||||
assert golds[-1]["morphs"] is not None
|
||||
|
||||
sent_annots.append(sent)
|
||||
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
|
||||
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||
assert gold.morphology is not None
|
||||
assert gold["morphs"] is not None
|
||||
sent_annots = []
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
|
@ -109,17 +108,10 @@ def read_data(
|
|||
return golds_to_gold_data(docs, golds)
|
||||
|
||||
|
||||
def _parse_morph_string(morph_string):
|
||||
def _compile_morph_string(morph_string, pos):
|
||||
if morph_string == '_':
|
||||
return set()
|
||||
output = []
|
||||
replacements = {'1': 'one', '2': 'two', '3': 'three'}
|
||||
for feature in morph_string.split('|'):
|
||||
key, value = feature.split('=')
|
||||
value = replacements.get(value, value)
|
||||
value = value.split(',')[0]
|
||||
output.append('%s_%s' % (key, value.lower()))
|
||||
return set(output)
|
||||
return f"POS={pos}"
|
||||
return morph_string + f"|POS={pos}"
|
||||
|
||||
|
||||
def read_conllu(file_):
|
||||
|
@ -151,28 +143,27 @@ def read_conllu(file_):
|
|||
|
||||
def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
|
||||
# Flatten the conll annotations, and adjust the head indices
|
||||
flat = defaultdict(list)
|
||||
gold = defaultdict(list)
|
||||
sent_starts = []
|
||||
for sent in sent_annots:
|
||||
flat["heads"].extend(len(flat["words"])+head for head in sent["heads"])
|
||||
for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]:
|
||||
flat[field].extend(sent[field])
|
||||
gold["heads"].extend(len(gold["words"])+head for head in sent["heads"])
|
||||
for field in ["words", "tags", "deps", "morphs", "entities", "spaces"]:
|
||||
gold[field].extend(sent[field])
|
||||
sent_starts.append(True)
|
||||
sent_starts.extend([False] * (len(sent["words"]) - 1))
|
||||
# Construct text if necessary
|
||||
assert len(flat["words"]) == len(flat["spaces"])
|
||||
assert len(gold["words"]) == len(gold["spaces"])
|
||||
if text is None:
|
||||
text = "".join(
|
||||
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
|
||||
word + " " * space for word, space in zip(gold["words"], gold["spaces"])
|
||||
)
|
||||
doc = nlp.make_doc(text)
|
||||
flat.pop("spaces")
|
||||
gold = GoldParse(doc, **flat)
|
||||
gold.sent_starts = sent_starts
|
||||
for i in range(len(gold.heads)):
|
||||
gold.pop("spaces")
|
||||
gold["sent_starts"] = sent_starts
|
||||
for i in range(len(gold["heads"])):
|
||||
if random.random() < drop_deps:
|
||||
gold.heads[i] = None
|
||||
gold.labels[i] = None
|
||||
gold["heads"][i] = None
|
||||
gold["labels"][i] = None
|
||||
|
||||
return doc, gold
|
||||
|
||||
|
@ -183,15 +174,10 @@ def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
|
|||
|
||||
|
||||
def golds_to_gold_data(docs, golds):
|
||||
"""Get out the training data format used by begin_training, given the
|
||||
GoldParse objects."""
|
||||
"""Get out the training data format used by begin_training"""
|
||||
data = []
|
||||
for doc, gold in zip(docs, golds):
|
||||
example = Example(doc=doc)
|
||||
example.add_doc_annotation(cats=gold.cats)
|
||||
token_annotation_dict = gold.orig.to_dict()
|
||||
example.add_token_annotation(**token_annotation_dict)
|
||||
example.goldparse = gold
|
||||
example = Example.from_dict(doc, dict(gold))
|
||||
data.append(example)
|
||||
return data
|
||||
|
||||
|
@ -359,9 +345,8 @@ def initialize_pipeline(nlp, examples, config, device):
|
|||
nlp.parser.add_multitask_objective("tag")
|
||||
if config.multitask_sent:
|
||||
nlp.parser.add_multitask_objective("sent_start")
|
||||
for ex in examples:
|
||||
gold = ex.gold
|
||||
for tag in gold.tags:
|
||||
for eg in examples:
|
||||
for tag in eg.get_aligned("TAG", as_string=True):
|
||||
if tag is not None:
|
||||
nlp.tagger.add_label(tag)
|
||||
if torch is not None and device != -1:
|
||||
|
@ -495,10 +480,6 @@ def main(
|
|||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
spacy.util.fix_random_seed()
|
||||
lang.zh.Chinese.Defaults.use_jieba = False
|
||||
lang.ja.Japanese.Defaults.use_janome = False
|
||||
|
@ -541,10 +522,10 @@ def main(
|
|||
else:
|
||||
batches = minibatch(examples, size=batch_sizes)
|
||||
losses = {}
|
||||
n_train_words = sum(len(ex.doc) for ex in examples)
|
||||
n_train_words = sum(len(eg.predicted) for eg in examples)
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
for batch in batches:
|
||||
pbar.update(sum(len(ex.doc) for ex in batch))
|
||||
pbar.update(sum(len(ex.predicted) for ex in batch))
|
||||
nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
|
||||
nlp.update(
|
||||
batch,
|
||||
|
|
|
@ -5,17 +5,16 @@
|
|||
# data is passed in sentence-by-sentence via some prior preprocessing.
|
||||
gold_preproc = false
|
||||
# Limitations on training document length or number of examples.
|
||||
max_length = 0
|
||||
max_length = 5000
|
||||
limit = 0
|
||||
# Data augmentation
|
||||
orth_variant_level = 0.0
|
||||
noise_level = 0.0
|
||||
dropout = 0.1
|
||||
# Controls early-stopping. 0 or -1 mean unlimited.
|
||||
patience = 1600
|
||||
max_epochs = 0
|
||||
max_steps = 20000
|
||||
eval_frequency = 400
|
||||
eval_frequency = 200
|
||||
# Other settings
|
||||
seed = 0
|
||||
accumulate_gradient = 1
|
||||
|
@ -41,15 +40,15 @@ beta2 = 0.999
|
|||
L2_is_weight_decay = true
|
||||
L2 = 0.01
|
||||
grad_clip = 1.0
|
||||
use_averages = true
|
||||
use_averages = false
|
||||
eps = 1e-8
|
||||
learn_rate = 0.001
|
||||
#learn_rate = 0.001
|
||||
|
||||
#[optimizer.learn_rate]
|
||||
#@schedules = "warmup_linear.v1"
|
||||
#warmup_steps = 250
|
||||
#total_steps = 20000
|
||||
#initial_rate = 0.001
|
||||
[optimizer.learn_rate]
|
||||
@schedules = "warmup_linear.v1"
|
||||
warmup_steps = 250
|
||||
total_steps = 20000
|
||||
initial_rate = 0.001
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
|
@ -58,15 +57,11 @@ vectors = null
|
|||
[nlp.pipeline.tok2vec]
|
||||
factory = "tok2vec"
|
||||
|
||||
[nlp.pipeline.senter]
|
||||
factory = "senter"
|
||||
|
||||
[nlp.pipeline.ner]
|
||||
factory = "ner"
|
||||
learn_tokens = false
|
||||
min_action_freq = 1
|
||||
beam_width = 1
|
||||
beam_update_prob = 1.0
|
||||
|
||||
[nlp.pipeline.tagger]
|
||||
factory = "tagger"
|
||||
|
@ -74,16 +69,7 @@ factory = "tagger"
|
|||
[nlp.pipeline.parser]
|
||||
factory = "parser"
|
||||
learn_tokens = false
|
||||
min_action_freq = 1
|
||||
beam_width = 1
|
||||
beam_update_prob = 1.0
|
||||
|
||||
[nlp.pipeline.senter.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
|
||||
[nlp.pipeline.senter.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecTensors.v1"
|
||||
width = ${nlp.pipeline.tok2vec.model:width}
|
||||
min_action_freq = 30
|
||||
|
||||
[nlp.pipeline.tagger.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
|
@ -96,8 +82,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
|
|||
@architectures = "spacy.TransitionBasedParser.v1"
|
||||
nr_feature_tokens = 8
|
||||
hidden_width = 128
|
||||
maxout_pieces = 3
|
||||
use_upper = false
|
||||
maxout_pieces = 2
|
||||
use_upper = true
|
||||
|
||||
[nlp.pipeline.parser.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecTensors.v1"
|
||||
|
@ -107,8 +93,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
|
|||
@architectures = "spacy.TransitionBasedParser.v1"
|
||||
nr_feature_tokens = 3
|
||||
hidden_width = 128
|
||||
maxout_pieces = 3
|
||||
use_upper = false
|
||||
maxout_pieces = 2
|
||||
use_upper = true
|
||||
|
||||
[nlp.pipeline.ner.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecTensors.v1"
|
||||
|
@ -117,10 +103,10 @@ width = ${nlp.pipeline.tok2vec.model:width}
|
|||
[nlp.pipeline.tok2vec.model]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
pretrained_vectors = ${nlp:vectors}
|
||||
width = 256
|
||||
depth = 6
|
||||
width = 128
|
||||
depth = 4
|
||||
window_size = 1
|
||||
embed_size = 10000
|
||||
embed_size = 7000
|
||||
maxout_pieces = 3
|
||||
subword_features = true
|
||||
dropout = null
|
||||
dropout = ${training:dropout}
|
||||
|
|
|
@ -9,7 +9,6 @@ max_length = 0
|
|||
limit = 0
|
||||
# Data augmentation
|
||||
orth_variant_level = 0.0
|
||||
noise_level = 0.0
|
||||
dropout = 0.1
|
||||
# Controls early-stopping. 0 or -1 mean unlimited.
|
||||
patience = 1600
|
||||
|
|
80
examples/experiments/onto-ner.cfg
Normal file
80
examples/experiments/onto-ner.cfg
Normal file
|
@ -0,0 +1,80 @@
|
|||
# Training hyper-parameters and additional features.
|
||||
[training]
|
||||
# Whether to train on sequences with 'gold standard' sentence boundaries
|
||||
# and tokens. If you set this to true, take care to ensure your run-time
|
||||
# data is passed in sentence-by-sentence via some prior preprocessing.
|
||||
gold_preproc = false
|
||||
# Limitations on training document length or number of examples.
|
||||
max_length = 5000
|
||||
limit = 0
|
||||
# Data augmentation
|
||||
orth_variant_level = 0.0
|
||||
dropout = 0.2
|
||||
# Controls early-stopping. 0 or -1 mean unlimited.
|
||||
patience = 1600
|
||||
max_epochs = 0
|
||||
max_steps = 20000
|
||||
eval_frequency = 500
|
||||
# Other settings
|
||||
seed = 0
|
||||
accumulate_gradient = 1
|
||||
use_pytorch_for_gpu_memory = false
|
||||
# Control how scores are printed and checkpoints are evaluated.
|
||||
scores = ["speed", "ents_p", "ents_r", "ents_f"]
|
||||
score_weights = {"ents_f": 1.0}
|
||||
# These settings are invalid for the transformer models.
|
||||
init_tok2vec = null
|
||||
discard_oversize = false
|
||||
omit_extra_lookups = false
|
||||
|
||||
[training.batch_size]
|
||||
@schedules = "compounding.v1"
|
||||
start = 100
|
||||
stop = 1000
|
||||
compound = 1.001
|
||||
|
||||
[training.optimizer]
|
||||
@optimizers = "Adam.v1"
|
||||
beta1 = 0.9
|
||||
beta2 = 0.999
|
||||
L2_is_weight_decay = false
|
||||
L2 = 1e-6
|
||||
grad_clip = 1.0
|
||||
use_averages = true
|
||||
eps = 1e-8
|
||||
learn_rate = 0.001
|
||||
|
||||
#[optimizer.learn_rate]
|
||||
#@schedules = "warmup_linear.v1"
|
||||
#warmup_steps = 250
|
||||
#total_steps = 20000
|
||||
#initial_rate = 0.001
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
vectors = null
|
||||
|
||||
[nlp.pipeline.ner]
|
||||
factory = "ner"
|
||||
learn_tokens = false
|
||||
min_action_freq = 1
|
||||
beam_width = 1
|
||||
beam_update_prob = 1.0
|
||||
|
||||
[nlp.pipeline.ner.model]
|
||||
@architectures = "spacy.TransitionBasedParser.v1"
|
||||
nr_feature_tokens = 3
|
||||
hidden_width = 64
|
||||
maxout_pieces = 2
|
||||
use_upper = true
|
||||
|
||||
[nlp.pipeline.ner.model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
pretrained_vectors = ${nlp:vectors}
|
||||
width = 96
|
||||
depth = 4
|
||||
window_size = 1
|
||||
embed_size = 2000
|
||||
maxout_pieces = 3
|
||||
subword_features = true
|
||||
dropout = ${training:dropout}
|
|
@ -6,7 +6,6 @@ init_tok2vec = null
|
|||
vectors = null
|
||||
max_epochs = 100
|
||||
orth_variant_level = 0.0
|
||||
noise_level = 0.0
|
||||
gold_preproc = true
|
||||
max_length = 0
|
||||
use_gpu = 0
|
||||
|
|
|
@ -6,7 +6,6 @@ init_tok2vec = null
|
|||
vectors = null
|
||||
max_epochs = 100
|
||||
orth_variant_level = 0.0
|
||||
noise_level = 0.0
|
||||
gold_preproc = true
|
||||
max_length = 0
|
||||
use_gpu = -1
|
||||
|
|
|
@ -12,7 +12,7 @@ import tqdm
|
|||
import spacy
|
||||
import spacy.util
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse, Example
|
||||
from spacy.gold import Example
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from collections import defaultdict
|
||||
from spacy.matcher import Matcher
|
||||
|
@ -33,31 +33,6 @@ random.seed(0)
|
|||
numpy.random.seed(0)
|
||||
|
||||
|
||||
def minibatch_by_words(examples, size=5000):
|
||||
random.shuffle(examples)
|
||||
if isinstance(size, int):
|
||||
size_ = itertools.repeat(size)
|
||||
else:
|
||||
size_ = size
|
||||
examples = iter(examples)
|
||||
while True:
|
||||
batch_size = next(size_)
|
||||
batch = []
|
||||
while batch_size >= 0:
|
||||
try:
|
||||
example = next(examples)
|
||||
except StopIteration:
|
||||
if batch:
|
||||
yield batch
|
||||
return
|
||||
batch_size -= len(example.doc)
|
||||
batch.append(example)
|
||||
if batch:
|
||||
yield batch
|
||||
else:
|
||||
break
|
||||
|
||||
|
||||
################
|
||||
# Data reading #
|
||||
################
|
||||
|
@ -110,7 +85,7 @@ def read_data(
|
|||
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
|
||||
if oracle_segments:
|
||||
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
|
||||
golds.append(GoldParse(docs[-1], **sent))
|
||||
golds.append(sent)
|
||||
|
||||
sent_annots.append(sent)
|
||||
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
|
||||
|
@ -159,20 +134,19 @@ def read_conllu(file_):
|
|||
|
||||
def _make_gold(nlp, text, sent_annots):
|
||||
# Flatten the conll annotations, and adjust the head indices
|
||||
flat = defaultdict(list)
|
||||
gold = defaultdict(list)
|
||||
for sent in sent_annots:
|
||||
flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
|
||||
gold["heads"].extend(len(gold["words"]) + head for head in sent["heads"])
|
||||
for field in ["words", "tags", "deps", "entities", "spaces"]:
|
||||
flat[field].extend(sent[field])
|
||||
gold[field].extend(sent[field])
|
||||
# Construct text if necessary
|
||||
assert len(flat["words"]) == len(flat["spaces"])
|
||||
assert len(gold["words"]) == len(gold["spaces"])
|
||||
if text is None:
|
||||
text = "".join(
|
||||
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
|
||||
word + " " * space for word, space in zip(gold["words"], gold["spaces"])
|
||||
)
|
||||
doc = nlp.make_doc(text)
|
||||
flat.pop("spaces")
|
||||
gold = GoldParse(doc, **flat)
|
||||
gold.pop("spaces")
|
||||
return doc, gold
|
||||
|
||||
|
||||
|
@ -182,15 +156,10 @@ def _make_gold(nlp, text, sent_annots):
|
|||
|
||||
|
||||
def golds_to_gold_data(docs, golds):
|
||||
"""Get out the training data format used by begin_training, given the
|
||||
GoldParse objects."""
|
||||
"""Get out the training data format used by begin_training."""
|
||||
data = []
|
||||
for doc, gold in zip(docs, golds):
|
||||
example = Example(doc=doc)
|
||||
example.add_doc_annotation(cats=gold.cats)
|
||||
token_annotation_dict = gold.orig.to_dict()
|
||||
example.add_token_annotation(**token_annotation_dict)
|
||||
example.goldparse = gold
|
||||
example = Example.from_dict(doc, gold)
|
||||
data.append(example)
|
||||
return data
|
||||
|
||||
|
@ -313,15 +282,15 @@ def initialize_pipeline(nlp, examples, config):
|
|||
nlp.parser.add_multitask_objective("sent_start")
|
||||
nlp.parser.moves.add_action(2, "subtok")
|
||||
nlp.add_pipe(nlp.create_pipe("tagger"))
|
||||
for ex in examples:
|
||||
for tag in ex.gold.tags:
|
||||
for eg in examples:
|
||||
for tag in eg.get_aligned("TAG", as_string=True):
|
||||
if tag is not None:
|
||||
nlp.tagger.add_label(tag)
|
||||
# Replace labels that didn't make the frequency cutoff
|
||||
actions = set(nlp.parser.labels)
|
||||
label_set = set([act.split("-")[1] for act in actions if "-" in act])
|
||||
for ex in examples:
|
||||
gold = ex.gold
|
||||
for eg in examples:
|
||||
gold = eg.gold
|
||||
for i, label in enumerate(gold.labels):
|
||||
if label is not None and label not in label_set:
|
||||
gold.labels[i] = label.split("||")[0]
|
||||
|
@ -415,13 +384,12 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
|
|||
optimizer = initialize_pipeline(nlp, examples, config)
|
||||
|
||||
for i in range(config.nr_epoch):
|
||||
docs = [nlp.make_doc(example.doc.text) for example in examples]
|
||||
batches = minibatch_by_words(examples, size=config.batch_size)
|
||||
batches = spacy.minibatch_by_words(examples, size=config.batch_size)
|
||||
losses = {}
|
||||
n_train_words = sum(len(doc) for doc in docs)
|
||||
n_train_words = sum(len(eg.reference.doc) for eg in examples)
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
for batch in batches:
|
||||
pbar.update(sum(len(ex.doc) for ex in batch))
|
||||
pbar.update(sum(len(eg.reference.doc) for eg in batch))
|
||||
nlp.update(
|
||||
examples=batch, sgd=optimizer, drop=config.dropout, losses=losses,
|
||||
)
|
||||
|
|
|
@ -30,7 +30,7 @@ ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
|
|||
model=("Model name, should have pretrained word embeddings", "positional", None, str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
)
|
||||
def main(model=None, output_dir=None):
|
||||
def main(model, output_dir=None):
|
||||
"""Load the model and create the KB with pre-defined entity encodings.
|
||||
If an output_dir is provided, the KB will be stored there in a file 'kb'.
|
||||
The updated vocab will also be written to a directory in the output_dir."""
|
||||
|
|
|
@ -24,8 +24,10 @@ import random
|
|||
import plac
|
||||
import spacy
|
||||
import os.path
|
||||
|
||||
from spacy.gold.example import Example
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import read_json_file, GoldParse
|
||||
from spacy.gold import read_json_file
|
||||
|
||||
random.seed(0)
|
||||
|
||||
|
@ -59,27 +61,25 @@ def main(n_iter=10):
|
|||
print(nlp.pipeline)
|
||||
|
||||
print("Create data", len(TRAIN_DATA))
|
||||
optimizer = nlp.begin_training(get_examples=lambda: TRAIN_DATA)
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for example in TRAIN_DATA:
|
||||
for token_annotation in example.token_annotations:
|
||||
doc = Doc(nlp.vocab, words=token_annotation.words)
|
||||
gold = GoldParse.from_annotation(doc, example.doc_annotation, token_annotation)
|
||||
|
||||
nlp.update(
|
||||
examples=[(doc, gold)], # 1 example
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
sgd=optimizer, # callable to update weights
|
||||
losses=losses,
|
||||
)
|
||||
for example_dict in TRAIN_DATA:
|
||||
doc = Doc(nlp.vocab, words=example_dict["words"])
|
||||
example = Example.from_dict(doc, example_dict)
|
||||
nlp.update(
|
||||
examples=[example], # 1 example
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
sgd=optimizer, # callable to update weights
|
||||
losses=losses,
|
||||
)
|
||||
print(losses.get("nn_labeller", 0.0), losses["ner"])
|
||||
|
||||
# test the trained model
|
||||
for example in TRAIN_DATA:
|
||||
if example.text is not None:
|
||||
doc = nlp(example.text)
|
||||
for example_dict in TRAIN_DATA:
|
||||
if "text" in example_dict:
|
||||
doc = nlp(example_dict["text"])
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
|
|
|
@ -4,9 +4,10 @@ import random
|
|||
import warnings
|
||||
import srsly
|
||||
import spacy
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.gold import Example
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
# TODO: further fix & test this script for v.3 ? (read_gold_data is never called)
|
||||
|
||||
LABEL = "ANIMAL"
|
||||
TRAIN_DATA = [
|
||||
|
@ -36,15 +37,13 @@ def read_raw_data(nlp, jsonl_loc):
|
|||
|
||||
|
||||
def read_gold_data(nlp, gold_loc):
|
||||
docs = []
|
||||
golds = []
|
||||
examples = []
|
||||
for json_obj in srsly.read_jsonl(gold_loc):
|
||||
doc = nlp.make_doc(json_obj["text"])
|
||||
ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
|
||||
gold = GoldParse(doc, entities=ents)
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
return list(zip(docs, golds))
|
||||
example = Example.from_dict(doc, {"entities": ents})
|
||||
examples.append(example)
|
||||
return examples
|
||||
|
||||
|
||||
def main(model_name, unlabelled_loc):
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
# coding: utf-8
|
||||
"""Using the parser to recognise your own semantics
|
||||
|
||||
spaCy's parser component can be used to trained to predict any type of tree
|
||||
spaCy's parser component can be trained to predict any type of tree
|
||||
structure over your input text. You can also predict trees over whole documents
|
||||
or chat logs, with connections between the sentence-roots used to annotate
|
||||
discourse structure. In this example, we'll build a message parser for a common
|
||||
|
|
|
@ -56,7 +56,7 @@ def main(model=None, output_dir=None, n_iter=100):
|
|||
print("Add label", ent[2])
|
||||
ner.add_label(ent[2])
|
||||
|
||||
with nlp.select_pipes(enable="ner") and warnings.catch_warnings():
|
||||
with nlp.select_pipes(enable="simple_ner") and warnings.catch_warnings():
|
||||
# show warnings for misaligned entity spans once
|
||||
warnings.filterwarnings("once", category=UserWarning, module="spacy")
|
||||
|
||||
|
|
|
@ -19,7 +19,7 @@ from ml_datasets import loaders
|
|||
import spacy
|
||||
from spacy import util
|
||||
from spacy.util import minibatch, compounding
|
||||
from spacy.gold import Example, GoldParse
|
||||
from spacy.gold import Example
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
|
@ -62,11 +62,10 @@ def main(config_path, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=Non
|
|||
train_examples = []
|
||||
for text, cats in zip(train_texts, train_cats):
|
||||
doc = nlp.make_doc(text)
|
||||
gold = GoldParse(doc, cats=cats)
|
||||
example = Example.from_dict(doc, {"cats": cats})
|
||||
for cat in cats:
|
||||
textcat.add_label(cat)
|
||||
ex = Example.from_gold(gold, doc=doc)
|
||||
train_examples.append(ex)
|
||||
train_examples.append(example)
|
||||
|
||||
with nlp.select_pipes(enable="textcat"): # only train textcat
|
||||
optimizer = nlp.begin_training()
|
||||
|
|
|
@ -6,7 +6,7 @@ requires = [
|
|||
"cymem>=2.0.2,<2.1.0",
|
||||
"preshed>=3.0.2,<3.1.0",
|
||||
"murmurhash>=0.28.0,<1.1.0",
|
||||
"thinc==8.0.0a9",
|
||||
"thinc==8.0.0a11",
|
||||
"blis>=0.4.0,<0.5.0"
|
||||
]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
|
|
@ -1,17 +1,17 @@
|
|||
# Our libraries
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==8.0.0a9
|
||||
thinc==8.0.0a11
|
||||
blis>=0.4.0,<0.5.0
|
||||
ml_datasets>=0.1.1
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=2.0.0,<3.0.0
|
||||
wasabi>=0.7.0,<1.1.0
|
||||
srsly>=2.1.0,<3.0.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
typer>=0.3.0,<1.0.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
pydantic>=1.3.0,<2.0.0
|
||||
# Official Python utilities
|
||||
|
|
16
setup.cfg
16
setup.cfg
|
@ -36,22 +36,21 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc==8.0.0a9
|
||||
thinc==8.0.0a11
|
||||
install_requires =
|
||||
# Our libraries
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==8.0.0a9
|
||||
thinc==8.0.0a11
|
||||
blis>=0.4.0,<0.5.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=2.0.0,<3.0.0
|
||||
wasabi>=0.7.0,<1.1.0
|
||||
srsly>=2.1.0,<3.0.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
ml_datasets>=0.1.1
|
||||
typer>=0.3.0,<1.0.0
|
||||
# Third-party dependencies
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
numpy>=1.15.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
pydantic>=1.3.0,<2.0.0
|
||||
# Official Python utilities
|
||||
|
@ -61,7 +60,7 @@ install_requires =
|
|||
|
||||
[options.extras_require]
|
||||
lookups =
|
||||
spacy_lookups_data>=0.3.1,<0.4.0
|
||||
spacy_lookups_data>=0.3.2,<0.4.0
|
||||
cuda =
|
||||
cupy>=5.0.0b4,<9.0.0
|
||||
cuda80 =
|
||||
|
@ -80,7 +79,8 @@ cuda102 =
|
|||
cupy-cuda102>=5.0.0b4,<9.0.0
|
||||
# Language tokenizers with external dependencies
|
||||
ja =
|
||||
fugashi>=0.1.3
|
||||
sudachipy>=0.4.5
|
||||
sudachidict_core>=20200330
|
||||
ko =
|
||||
natto-py==0.9.0
|
||||
th =
|
||||
|
|
7
setup.py
7
setup.py
|
@ -23,6 +23,8 @@ Options.docstrings = True
|
|||
|
||||
PACKAGES = find_packages()
|
||||
MOD_NAMES = [
|
||||
"spacy.gold.align",
|
||||
"spacy.gold.example",
|
||||
"spacy.parts_of_speech",
|
||||
"spacy.strings",
|
||||
"spacy.lexeme",
|
||||
|
@ -37,11 +39,10 @@ MOD_NAMES = [
|
|||
"spacy.tokenizer",
|
||||
"spacy.syntax.nn_parser",
|
||||
"spacy.syntax._parser_model",
|
||||
"spacy.syntax._beam_utils",
|
||||
"spacy.syntax.nonproj",
|
||||
"spacy.syntax.transition_system",
|
||||
"spacy.syntax.arc_eager",
|
||||
"spacy.gold",
|
||||
"spacy.gold.gold_io",
|
||||
"spacy.tokens.doc",
|
||||
"spacy.tokens.span",
|
||||
"spacy.tokens.token",
|
||||
|
@ -120,7 +121,7 @@ class build_ext_subclass(build_ext, build_ext_options):
|
|||
|
||||
def clean(path):
|
||||
for path in path.glob("**/*"):
|
||||
if path.is_file() and path.suffix in (".so", ".cpp"):
|
||||
if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
|
||||
print(f"Deleting {path.name}")
|
||||
path.unlink()
|
||||
|
||||
|
|
|
@ -8,7 +8,7 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
|||
from thinc.api import prefer_gpu, require_gpu
|
||||
|
||||
from . import pipeline
|
||||
from .cli.info import info as cli_info
|
||||
from .cli.info import info
|
||||
from .glossary import explain
|
||||
from .about import __version__
|
||||
from .errors import Errors, Warnings
|
||||
|
@ -34,7 +34,3 @@ def load(name, **overrides):
|
|||
def blank(name, **kwargs):
|
||||
LangClass = util.get_lang_class(name)
|
||||
return LangClass(**kwargs)
|
||||
|
||||
|
||||
def info(model=None, markdown=False, silent=False):
|
||||
return cli_info(model, markdown, silent)
|
||||
|
|
|
@ -1,31 +1,4 @@
|
|||
if __name__ == "__main__":
|
||||
import plac
|
||||
import sys
|
||||
from wasabi import msg
|
||||
from spacy.cli import download, link, info, package, pretrain, convert
|
||||
from spacy.cli import init_model, profile, evaluate, validate, debug_data
|
||||
from spacy.cli import train_cli
|
||||
from spacy.cli import setup_cli
|
||||
|
||||
commands = {
|
||||
"download": download,
|
||||
"link": link,
|
||||
"info": info,
|
||||
"train": train_cli,
|
||||
"pretrain": pretrain,
|
||||
"debug-data": debug_data,
|
||||
"evaluate": evaluate,
|
||||
"convert": convert,
|
||||
"package": package,
|
||||
"init-model": init_model,
|
||||
"profile": profile,
|
||||
"validate": validate,
|
||||
}
|
||||
if len(sys.argv) == 1:
|
||||
msg.info("Available commands", ", ".join(commands), exits=1)
|
||||
command = sys.argv.pop(1)
|
||||
sys.argv[0] = f"spacy {command}"
|
||||
if command in commands:
|
||||
plac.call(commands[command], sys.argv[1:])
|
||||
else:
|
||||
available = f"Available: {', '.join(commands)}"
|
||||
msg.fail(f"Unknown command: {command}", available, exits=1)
|
||||
setup_cli()
|
||||
|
|
|
@ -1,7 +1,8 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "3.0.0.dev9"
|
||||
__version__ = "3.0.0.dev12"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
__shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json"
|
||||
__projects__ = "https://github.com/explosion/spacy-boilerplates"
|
||||
|
|
|
@ -1,19 +1,28 @@
|
|||
from wasabi import msg
|
||||
|
||||
from ._app import app, setup_cli # noqa: F401
|
||||
|
||||
# These are the actual functions, NOT the wrapped CLI commands. The CLI commands
|
||||
# are registered automatically and won't have to be imported here.
|
||||
from .download import download # noqa: F401
|
||||
from .info import info # noqa: F401
|
||||
from .package import package # noqa: F401
|
||||
from .profile import profile # noqa: F401
|
||||
from .train_from_config import train_cli # noqa: F401
|
||||
from .train import train_cli # noqa: F401
|
||||
from .pretrain import pretrain # noqa: F401
|
||||
from .debug_data import debug_data # noqa: F401
|
||||
from .evaluate import evaluate # noqa: F401
|
||||
from .convert import convert # noqa: F401
|
||||
from .init_model import init_model # noqa: F401
|
||||
from .validate import validate # noqa: F401
|
||||
from .project import project_clone, project_assets, project_run # noqa: F401
|
||||
from .project import project_run_all # noqa: F401
|
||||
|
||||
|
||||
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
||||
def link(*args, **kwargs):
|
||||
"""As of spaCy v3.0, model symlinks are deprecated. You can load models
|
||||
using their full names or from a directory path."""
|
||||
msg.warn(
|
||||
"As of spaCy v3.0, model symlinks are deprecated. You can load models "
|
||||
"using their full names or from a directory path."
|
||||
|
|
24
spacy/cli/_app.py
Normal file
24
spacy/cli/_app.py
Normal file
|
@ -0,0 +1,24 @@
|
|||
import typer
|
||||
from typer.main import get_command
|
||||
|
||||
|
||||
COMMAND = "python -m spacy"
|
||||
NAME = "spacy"
|
||||
HELP = """spaCy Command-line Interface
|
||||
|
||||
DOCS: https://spacy.io/api/cli
|
||||
"""
|
||||
|
||||
|
||||
app = typer.Typer(name=NAME, help=HELP)
|
||||
|
||||
# Wrappers for Typer's annotations. Initially created to set defaults and to
|
||||
# keep the names short, but not needed at the moment.
|
||||
Arg = typer.Argument
|
||||
Opt = typer.Option
|
||||
|
||||
|
||||
def setup_cli() -> None:
|
||||
# Ensure that the help messages always display the correct prompt
|
||||
command = get_command(app)
|
||||
command(prog_name=COMMAND)
|
|
@ -1,118 +1,157 @@
|
|||
from typing import Optional
|
||||
from enum import Enum
|
||||
from pathlib import Path
|
||||
from wasabi import Printer
|
||||
import srsly
|
||||
import re
|
||||
import sys
|
||||
|
||||
from .converters import conllu2json, iob2json, conll_ner2json
|
||||
from .converters import ner_jsonl2json
|
||||
from ._app import app, Arg, Opt
|
||||
from ..gold import docs_to_json
|
||||
from ..tokens import DocBin
|
||||
from ..gold.converters import iob2docs, conll_ner2docs, json2docs
|
||||
|
||||
|
||||
# Converters are matched by file extension except for ner/iob, which are
|
||||
# matched by file extension and content. To add a converter, add a new
|
||||
# entry to this dict with the file extension mapped to the converter function
|
||||
# imported from /converters.
|
||||
|
||||
CONVERTERS = {
|
||||
"conllubio": conllu2json,
|
||||
"conllu": conllu2json,
|
||||
"conll": conllu2json,
|
||||
"ner": conll_ner2json,
|
||||
"iob": iob2json,
|
||||
"jsonl": ner_jsonl2json,
|
||||
# "conllubio": conllu2docs, TODO
|
||||
# "conllu": conllu2docs, TODO
|
||||
# "conll": conllu2docs, TODO
|
||||
"ner": conll_ner2docs,
|
||||
"iob": iob2docs,
|
||||
"json": json2docs,
|
||||
}
|
||||
|
||||
# File types
|
||||
FILE_TYPES = ("json", "jsonl", "msg")
|
||||
FILE_TYPES_STDOUT = ("json", "jsonl")
|
||||
|
||||
# File types that can be written to stdout
|
||||
FILE_TYPES_STDOUT = ("json")
|
||||
|
||||
|
||||
def convert(
|
||||
class FileTypes(str, Enum):
|
||||
json = "json"
|
||||
spacy = "spacy"
|
||||
|
||||
|
||||
@app.command("convert")
|
||||
def convert_cli(
|
||||
# fmt: off
|
||||
input_file: ("Input file", "positional", None, str),
|
||||
output_dir: ("Output directory. '-' for stdout.", "positional", None, str) = "-",
|
||||
file_type: (f"Type of data to produce: {FILE_TYPES}", "option", "t", str, FILE_TYPES) = "json",
|
||||
n_sents: ("Number of sentences per doc (0 to disable)", "option", "n", int) = 1,
|
||||
seg_sents: ("Segment sentences (for -c ner)", "flag", "s") = False,
|
||||
model: ("Model for sentence segmentation (for -s)", "option", "b", str) = None,
|
||||
morphology: ("Enable appending morphology to tags", "flag", "m", bool) = False,
|
||||
merge_subtokens: ("Merge CoNLL-U subtokens", "flag", "T", bool) = False,
|
||||
converter: (f"Converter: {tuple(CONVERTERS.keys())}", "option", "c", str) = "auto",
|
||||
ner_map_path: ("NER tag mapping (as JSON-encoded dict of entity types)", "option", "N", Path) = None,
|
||||
lang: ("Language (if tokenizer required)", "option", "l", str) = None,
|
||||
input_path: str = Arg(..., help="Input file or directory", exists=True),
|
||||
output_dir: Path = Arg("-", help="Output directory. '-' for stdout.", allow_dash=True, exists=True),
|
||||
file_type: FileTypes = Opt("spacy", "--file-type", "-t", help="Type of data to produce"),
|
||||
n_sents: int = Opt(1, "--n-sents", "-n", help="Number of sentences per doc (0 to disable)"),
|
||||
seg_sents: bool = Opt(False, "--seg-sents", "-s", help="Segment sentences (for -c ner)"),
|
||||
model: Optional[str] = Opt(None, "--model", "-b", help="Model for sentence segmentation (for -s)"),
|
||||
morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
|
||||
merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
|
||||
converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
|
||||
ner_map: Optional[Path] = Opt(None, "--ner-map", "-N", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
|
||||
lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Convert files into JSON format for use with train command and other
|
||||
Convert files into json or DocBin format for use with train command and other
|
||||
experiment management functions. If no output_dir is specified, the data
|
||||
is written to stdout, so you can pipe them forward to a JSON file:
|
||||
$ spacy convert some_file.conllu > some_file.json
|
||||
"""
|
||||
no_print = output_dir == "-"
|
||||
msg = Printer(no_print=no_print)
|
||||
input_path = Path(input_file)
|
||||
if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
|
||||
# TODO: support msgpack via stdout in srsly?
|
||||
msg.fail(
|
||||
f"Can't write .{file_type} data to stdout",
|
||||
"Please specify an output directory.",
|
||||
exits=1,
|
||||
)
|
||||
if not input_path.exists():
|
||||
msg.fail("Input file not found", input_path, exits=1)
|
||||
if output_dir != "-" and not Path(output_dir).exists():
|
||||
msg.fail("Output directory not found", output_dir, exits=1)
|
||||
input_data = input_path.open("r", encoding="utf-8").read()
|
||||
if converter == "auto":
|
||||
converter = input_path.suffix[1:]
|
||||
if converter == "ner" or converter == "iob":
|
||||
converter_autodetect = autodetect_ner_format(input_data)
|
||||
if converter_autodetect == "ner":
|
||||
msg.info("Auto-detected token-per-line NER format")
|
||||
converter = converter_autodetect
|
||||
elif converter_autodetect == "iob":
|
||||
msg.info("Auto-detected sentence-per-line NER format")
|
||||
converter = converter_autodetect
|
||||
else:
|
||||
msg.warn(
|
||||
"Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert"
|
||||
)
|
||||
if converter not in CONVERTERS:
|
||||
msg.fail(f"Can't find converter for {converter}", exits=1)
|
||||
ner_map = None
|
||||
if ner_map_path is not None:
|
||||
ner_map = srsly.read_json(ner_map_path)
|
||||
# Use converter function to convert data
|
||||
func = CONVERTERS[converter]
|
||||
data = func(
|
||||
input_data,
|
||||
if isinstance(file_type, FileTypes):
|
||||
# We get an instance of the FileTypes from the CLI so we need its string value
|
||||
file_type = file_type.value
|
||||
input_path = Path(input_path)
|
||||
output_dir = "-" if output_dir == Path("-") else output_dir
|
||||
cli_args = locals()
|
||||
silent = output_dir == "-"
|
||||
msg = Printer(no_print=silent)
|
||||
verify_cli_args(msg, **cli_args)
|
||||
converter = _get_converter(msg, converter, input_path)
|
||||
convert(
|
||||
input_path,
|
||||
output_dir,
|
||||
file_type=file_type,
|
||||
n_sents=n_sents,
|
||||
seg_sents=seg_sents,
|
||||
append_morphology=morphology,
|
||||
merge_subtokens=merge_subtokens,
|
||||
lang=lang,
|
||||
model=model,
|
||||
no_print=no_print,
|
||||
morphology=morphology,
|
||||
merge_subtokens=merge_subtokens,
|
||||
converter=converter,
|
||||
ner_map=ner_map,
|
||||
lang=lang,
|
||||
silent=silent,
|
||||
msg=msg,
|
||||
)
|
||||
if output_dir != "-":
|
||||
# Export data to a file
|
||||
suffix = f".{file_type}"
|
||||
output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix)
|
||||
if file_type == "json":
|
||||
srsly.write_json(output_file, data)
|
||||
elif file_type == "jsonl":
|
||||
srsly.write_jsonl(output_file, data)
|
||||
elif file_type == "msg":
|
||||
srsly.write_msgpack(output_file, data)
|
||||
msg.good(f"Generated output file ({len(data)} documents): {output_file}")
|
||||
|
||||
|
||||
def convert(
|
||||
input_path: Path,
|
||||
output_dir: Path,
|
||||
*,
|
||||
file_type: str = "json",
|
||||
n_sents: int = 1,
|
||||
seg_sents: bool = False,
|
||||
model: Optional[str] = None,
|
||||
morphology: bool = False,
|
||||
merge_subtokens: bool = False,
|
||||
converter: str = "auto",
|
||||
ner_map: Optional[Path] = None,
|
||||
lang: Optional[str] = None,
|
||||
silent: bool = True,
|
||||
msg: Optional[Path] = None,
|
||||
) -> None:
|
||||
if not msg:
|
||||
msg = Printer(no_print=silent)
|
||||
ner_map = srsly.read_json(ner_map) if ner_map is not None else None
|
||||
|
||||
for input_loc in walk_directory(input_path):
|
||||
input_data = input_loc.open("r", encoding="utf-8").read()
|
||||
# Use converter function to convert data
|
||||
func = CONVERTERS[converter]
|
||||
docs = func(
|
||||
input_data,
|
||||
n_sents=n_sents,
|
||||
seg_sents=seg_sents,
|
||||
append_morphology=morphology,
|
||||
merge_subtokens=merge_subtokens,
|
||||
lang=lang,
|
||||
model=model,
|
||||
no_print=silent,
|
||||
ner_map=ner_map,
|
||||
)
|
||||
if output_dir == "-":
|
||||
_print_docs_to_stdout(docs, file_type)
|
||||
else:
|
||||
if input_loc != input_path:
|
||||
subpath = input_loc.relative_to(input_path)
|
||||
output_file = Path(output_dir) / subpath.with_suffix(f".{file_type}")
|
||||
else:
|
||||
output_file = Path(output_dir) / input_loc.parts[-1]
|
||||
output_file = output_file.with_suffix(f".{file_type}")
|
||||
_write_docs_to_file(docs, output_file, file_type)
|
||||
msg.good(f"Generated output file ({len(docs)} documents): {output_file}")
|
||||
|
||||
|
||||
def _print_docs_to_stdout(docs, output_type):
|
||||
if output_type == "json":
|
||||
srsly.write_json("-", docs_to_json(docs))
|
||||
else:
|
||||
# Print to stdout
|
||||
if file_type == "json":
|
||||
srsly.write_json("-", data)
|
||||
elif file_type == "jsonl":
|
||||
srsly.write_jsonl("-", data)
|
||||
sys.stdout.buffer.write(DocBin(docs=docs).to_bytes())
|
||||
|
||||
|
||||
def autodetect_ner_format(input_data):
|
||||
def _write_docs_to_file(docs, output_file, output_type):
|
||||
if not output_file.parent.exists():
|
||||
output_file.parent.mkdir(parents=True)
|
||||
if output_type == "json":
|
||||
srsly.write_json(output_file, docs_to_json(docs))
|
||||
else:
|
||||
data = DocBin(docs=docs).to_bytes()
|
||||
with output_file.open("wb") as file_:
|
||||
file_.write(data)
|
||||
|
||||
|
||||
def autodetect_ner_format(input_data: str) -> str:
|
||||
# guess format from the first 20 lines
|
||||
lines = input_data.split("\n")[:20]
|
||||
format_guesses = {"ner": 0, "iob": 0}
|
||||
|
@ -129,3 +168,86 @@ def autodetect_ner_format(input_data):
|
|||
if format_guesses["ner"] == 0 and format_guesses["iob"] > 0:
|
||||
return "iob"
|
||||
return None
|
||||
|
||||
|
||||
def walk_directory(path):
|
||||
if not path.is_dir():
|
||||
return [path]
|
||||
paths = [path]
|
||||
locs = []
|
||||
seen = set()
|
||||
for path in paths:
|
||||
if str(path) in seen:
|
||||
continue
|
||||
seen.add(str(path))
|
||||
if path.parts[-1].startswith("."):
|
||||
continue
|
||||
elif path.is_dir():
|
||||
paths.extend(path.iterdir())
|
||||
else:
|
||||
locs.append(path)
|
||||
return locs
|
||||
|
||||
|
||||
def verify_cli_args(
|
||||
msg,
|
||||
input_path,
|
||||
output_dir,
|
||||
file_type,
|
||||
n_sents,
|
||||
seg_sents,
|
||||
model,
|
||||
morphology,
|
||||
merge_subtokens,
|
||||
converter,
|
||||
ner_map,
|
||||
lang,
|
||||
):
|
||||
input_path = Path(input_path)
|
||||
if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
|
||||
# TODO: support msgpack via stdout in srsly?
|
||||
msg.fail(
|
||||
f"Can't write .{file_type} data to stdout",
|
||||
"Please specify an output directory.",
|
||||
exits=1,
|
||||
)
|
||||
if not input_path.exists():
|
||||
msg.fail("Input file not found", input_path, exits=1)
|
||||
if output_dir != "-" and not Path(output_dir).exists():
|
||||
msg.fail("Output directory not found", output_dir, exits=1)
|
||||
if input_path.is_dir():
|
||||
input_locs = walk_directory(input_path)
|
||||
if len(input_locs) == 0:
|
||||
msg.fail("No input files in directory", input_path, exits=1)
|
||||
file_types = list(set([loc.suffix[1:] for loc in input_locs]))
|
||||
if len(file_types) >= 2:
|
||||
file_types = ",".join(file_types)
|
||||
msg.fail("All input files must be same type", file_types, exits=1)
|
||||
converter = _get_converter(msg, converter, input_path)
|
||||
if converter not in CONVERTERS:
|
||||
msg.fail(f"Can't find converter for {converter}", exits=1)
|
||||
return converter
|
||||
|
||||
|
||||
def _get_converter(msg, converter, input_path):
|
||||
if input_path.is_dir():
|
||||
input_path = walk_directory(input_path)[0]
|
||||
if converter == "auto":
|
||||
converter = input_path.suffix[1:]
|
||||
if converter == "ner" or converter == "iob":
|
||||
with input_path.open() as file_:
|
||||
input_data = file_.read()
|
||||
converter_autodetect = autodetect_ner_format(input_data)
|
||||
if converter_autodetect == "ner":
|
||||
msg.info("Auto-detected token-per-line NER format")
|
||||
converter = converter_autodetect
|
||||
elif converter_autodetect == "iob":
|
||||
msg.info("Auto-detected sentence-per-line NER format")
|
||||
converter = converter_autodetect
|
||||
else:
|
||||
msg.warn(
|
||||
"Can't automatically detect NER format. "
|
||||
"Conversion may not succeed. "
|
||||
"See https://spacy.io/api/cli#convert"
|
||||
)
|
||||
return converter
|
||||
|
|
|
@ -1,4 +0,0 @@
|
|||
from .conllu2json import conllu2json # noqa: F401
|
||||
from .iob2json import iob2json # noqa: F401
|
||||
from .conll_ner2json import conll_ner2json # noqa: F401
|
||||
from .jsonl2json import ner_jsonl2json # noqa: F401
|
|
@ -1,65 +0,0 @@
|
|||
from wasabi import Printer
|
||||
|
||||
from ...gold import iob_to_biluo
|
||||
from ...util import minibatch
|
||||
from .conll_ner2json import n_sents_info
|
||||
|
||||
|
||||
def iob2json(input_data, n_sents=10, no_print=False, *args, **kwargs):
|
||||
"""
|
||||
Convert IOB files with one sentence per line and tags separated with '|'
|
||||
into JSON format for use with train cli. IOB and IOB2 are accepted.
|
||||
|
||||
Sample formats:
|
||||
|
||||
I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
|
||||
I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
|
||||
I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
"""
|
||||
msg = Printer(no_print=no_print)
|
||||
docs = read_iob(input_data.split("\n"))
|
||||
if n_sents > 0:
|
||||
n_sents_info(msg, n_sents)
|
||||
docs = merge_sentences(docs, n_sents)
|
||||
return docs
|
||||
|
||||
|
||||
def read_iob(raw_sents):
|
||||
sentences = []
|
||||
for line in raw_sents:
|
||||
if not line.strip():
|
||||
continue
|
||||
tokens = [t.split("|") for t in line.split()]
|
||||
if len(tokens[0]) == 3:
|
||||
words, pos, iob = zip(*tokens)
|
||||
elif len(tokens[0]) == 2:
|
||||
words, iob = zip(*tokens)
|
||||
pos = ["-"] * len(words)
|
||||
else:
|
||||
raise ValueError(
|
||||
"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
|
||||
)
|
||||
biluo = iob_to_biluo(iob)
|
||||
sentences.append(
|
||||
[
|
||||
{"orth": w, "tag": p, "ner": ent}
|
||||
for (w, p, ent) in zip(words, pos, biluo)
|
||||
]
|
||||
)
|
||||
sentences = [{"tokens": sent} for sent in sentences]
|
||||
paragraphs = [{"sentences": [sent]} for sent in sentences]
|
||||
docs = [{"id": i, "paragraphs": [para]} for i, para in enumerate(paragraphs)]
|
||||
return docs
|
||||
|
||||
|
||||
def merge_sentences(docs, n_sents):
|
||||
merged = []
|
||||
for group in minibatch(docs, size=n_sents):
|
||||
group = list(group)
|
||||
first = group.pop(0)
|
||||
to_extend = first["paragraphs"][0]["sentences"]
|
||||
for sent in group:
|
||||
to_extend.extend(sent["paragraphs"][0]["sentences"])
|
||||
merged.append(first)
|
||||
return merged
|
|
@ -1,50 +0,0 @@
|
|||
import srsly
|
||||
|
||||
from ...gold import docs_to_json
|
||||
from ...util import get_lang_class, minibatch
|
||||
|
||||
|
||||
def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False, **_):
|
||||
if lang is None:
|
||||
raise ValueError("No --lang specified, but tokenization required")
|
||||
json_docs = []
|
||||
input_examples = [srsly.json_loads(line) for line in input_data.strip().split("\n")]
|
||||
nlp = get_lang_class(lang)()
|
||||
sentencizer = nlp.create_pipe("sentencizer")
|
||||
for i, batch in enumerate(minibatch(input_examples, size=n_sents)):
|
||||
docs = []
|
||||
for record in batch:
|
||||
raw_text = record["text"]
|
||||
if "entities" in record:
|
||||
ents = record["entities"]
|
||||
else:
|
||||
ents = record["spans"]
|
||||
ents = [(e["start"], e["end"], e["label"]) for e in ents]
|
||||
doc = nlp.make_doc(raw_text)
|
||||
sentencizer(doc)
|
||||
spans = [doc.char_span(s, e, label=L) for s, e, L in ents]
|
||||
doc.ents = _cleanup_spans(spans)
|
||||
docs.append(doc)
|
||||
json_docs.append(docs_to_json(docs, id=i))
|
||||
return json_docs
|
||||
|
||||
|
||||
def _cleanup_spans(spans):
|
||||
output = []
|
||||
seen = set()
|
||||
for span in spans:
|
||||
if span is not None:
|
||||
# Trim whitespace
|
||||
while len(span) and span[0].is_space:
|
||||
span = span[1:]
|
||||
while len(span) and span[-1].is_space:
|
||||
span = span[:-1]
|
||||
if not len(span):
|
||||
continue
|
||||
for i in range(span.start, span.end):
|
||||
if i in seen:
|
||||
break
|
||||
else:
|
||||
output.append(span)
|
||||
seen.update(range(span.start, span.end))
|
||||
return output
|
|
@ -1,11 +1,14 @@
|
|||
from typing import Optional, List, Sequence, Dict, Any, Tuple
|
||||
from pathlib import Path
|
||||
from collections import Counter
|
||||
import sys
|
||||
import srsly
|
||||
from wasabi import Printer, MESSAGES
|
||||
|
||||
from ..gold import GoldCorpus
|
||||
from ._app import app, Arg, Opt
|
||||
from ..gold import Corpus, Example
|
||||
from ..syntax import nonproj
|
||||
from ..language import Language
|
||||
from ..util import load_model, get_lang_class
|
||||
|
||||
|
||||
|
@ -18,17 +21,18 @@ BLANK_MODEL_MIN_THRESHOLD = 100
|
|||
BLANK_MODEL_THRESHOLD = 2000
|
||||
|
||||
|
||||
def debug_data(
|
||||
@app.command("debug-data")
|
||||
def debug_data_cli(
|
||||
# fmt: off
|
||||
lang: ("Model language", "positional", None, str),
|
||||
train_path: ("Location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path: ("Location of JSON-formatted development data", "positional", None, Path),
|
||||
tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None,
|
||||
base_model: ("Name of model to update (optional)", "option", "b", str) = None,
|
||||
pipeline: ("Comma-separated names of pipeline components to train", "option", "p", str) = "tagger,parser,ner",
|
||||
ignore_warnings: ("Ignore warnings, only show stats and errors", "flag", "IW", bool) = False,
|
||||
verbose: ("Print additional information and explanations", "flag", "V", bool) = False,
|
||||
no_format: ("Don't pretty-print the results", "flag", "NF", bool) = False,
|
||||
lang: str = Arg(..., help="Model language"),
|
||||
train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
|
||||
dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
|
||||
tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map", exists=True, dir_okay=False),
|
||||
base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Name of model to update (optional)"),
|
||||
pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of pipeline components to train"),
|
||||
ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
|
||||
verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
|
||||
no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
|
@ -36,8 +40,36 @@ def debug_data(
|
|||
stats, and find problems like invalid entity annotations, cyclic
|
||||
dependencies, low data labels and more.
|
||||
"""
|
||||
msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings)
|
||||
debug_data(
|
||||
lang,
|
||||
train_path,
|
||||
dev_path,
|
||||
tag_map_path=tag_map_path,
|
||||
base_model=base_model,
|
||||
pipeline=[p.strip() for p in pipeline.split(",")],
|
||||
ignore_warnings=ignore_warnings,
|
||||
verbose=verbose,
|
||||
no_format=no_format,
|
||||
silent=False,
|
||||
)
|
||||
|
||||
|
||||
def debug_data(
|
||||
lang: str,
|
||||
train_path: Path,
|
||||
dev_path: Path,
|
||||
*,
|
||||
tag_map_path: Optional[Path] = None,
|
||||
base_model: Optional[str] = None,
|
||||
pipeline: List[str] = ["tagger", "parser", "ner"],
|
||||
ignore_warnings: bool = False,
|
||||
verbose: bool = False,
|
||||
no_format: bool = True,
|
||||
silent: bool = True,
|
||||
):
|
||||
msg = Printer(
|
||||
no_print=silent, pretty=not no_format, ignore_warnings=ignore_warnings
|
||||
)
|
||||
# Make sure all files and paths exists if they are needed
|
||||
if not train_path.exists():
|
||||
msg.fail("Training data not found", train_path, exits=1)
|
||||
|
@ -49,7 +81,6 @@ def debug_data(
|
|||
tag_map = srsly.read_json(tag_map_path)
|
||||
|
||||
# Initialize the model and pipeline
|
||||
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||
if base_model:
|
||||
nlp = load_model(base_model)
|
||||
else:
|
||||
|
@ -68,12 +99,9 @@ def debug_data(
|
|||
loading_train_error_message = ""
|
||||
loading_dev_error_message = ""
|
||||
with msg.loading("Loading corpus..."):
|
||||
corpus = GoldCorpus(train_path, dev_path)
|
||||
corpus = Corpus(train_path, dev_path)
|
||||
try:
|
||||
train_dataset = list(corpus.train_dataset(nlp))
|
||||
train_dataset_unpreprocessed = list(
|
||||
corpus.train_dataset_without_preprocessing(nlp)
|
||||
)
|
||||
except ValueError as e:
|
||||
loading_train_error_message = f"Training data cannot be loaded: {e}"
|
||||
try:
|
||||
|
@ -89,11 +117,9 @@ def debug_data(
|
|||
msg.good("Corpus is loadable")
|
||||
|
||||
# Create all gold data here to avoid iterating over the train_dataset constantly
|
||||
gold_train_data = _compile_gold(train_dataset, pipeline, nlp)
|
||||
gold_train_unpreprocessed_data = _compile_gold(
|
||||
train_dataset_unpreprocessed, pipeline
|
||||
)
|
||||
gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp)
|
||||
gold_train_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=True)
|
||||
gold_train_unpreprocessed_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=False)
|
||||
gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp, make_proj=True)
|
||||
|
||||
train_texts = gold_train_data["texts"]
|
||||
dev_texts = gold_dev_data["texts"]
|
||||
|
@ -446,7 +472,7 @@ def debug_data(
|
|||
sys.exit(1)
|
||||
|
||||
|
||||
def _load_file(file_path, msg):
|
||||
def _load_file(file_path: Path, msg: Printer) -> None:
|
||||
file_name = file_path.parts[-1]
|
||||
if file_path.suffix == ".json":
|
||||
with msg.loading(f"Loading {file_name}..."):
|
||||
|
@ -465,7 +491,9 @@ def _load_file(file_path, msg):
|
|||
)
|
||||
|
||||
|
||||
def _compile_gold(examples, pipeline, nlp):
|
||||
def _compile_gold(
|
||||
examples: Sequence[Example], pipeline: List[str], nlp: Language, make_proj: bool
|
||||
) -> Dict[str, Any]:
|
||||
data = {
|
||||
"ner": Counter(),
|
||||
"cats": Counter(),
|
||||
|
@ -484,20 +512,20 @@ def _compile_gold(examples, pipeline, nlp):
|
|||
"n_cats_multilabel": 0,
|
||||
"texts": set(),
|
||||
}
|
||||
for example in examples:
|
||||
gold = example.gold
|
||||
doc = example.doc
|
||||
valid_words = [x for x in gold.words if x is not None]
|
||||
for eg in examples:
|
||||
gold = eg.reference
|
||||
doc = eg.predicted
|
||||
valid_words = [x for x in gold if x is not None]
|
||||
data["words"].update(valid_words)
|
||||
data["n_words"] += len(valid_words)
|
||||
data["n_misaligned_words"] += len(gold.words) - len(valid_words)
|
||||
data["n_misaligned_words"] += len(gold) - len(valid_words)
|
||||
data["texts"].add(doc.text)
|
||||
if len(nlp.vocab.vectors):
|
||||
for word in valid_words:
|
||||
if nlp.vocab.strings[word] not in nlp.vocab.vectors:
|
||||
data["words_missing_vectors"].update([word])
|
||||
if "ner" in pipeline:
|
||||
for i, label in enumerate(gold.ner):
|
||||
for i, label in enumerate(eg.get_aligned_ner()):
|
||||
if label is None:
|
||||
continue
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
||||
|
@ -523,32 +551,34 @@ def _compile_gold(examples, pipeline, nlp):
|
|||
if list(gold.cats.values()).count(1.0) != 1:
|
||||
data["n_cats_multilabel"] += 1
|
||||
if "tagger" in pipeline:
|
||||
data["tags"].update([x for x in gold.tags if x is not None])
|
||||
tags = eg.get_aligned("TAG", as_string=True)
|
||||
data["tags"].update([x for x in tags if x is not None])
|
||||
if "parser" in pipeline:
|
||||
data["deps"].update([x for x in gold.labels if x is not None])
|
||||
for i, (dep, head) in enumerate(zip(gold.labels, gold.heads)):
|
||||
aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
|
||||
data["deps"].update([x for x in aligned_deps if x is not None])
|
||||
for i, (dep, head) in enumerate(zip(aligned_deps, aligned_heads)):
|
||||
if head == i:
|
||||
data["roots"].update([dep])
|
||||
data["n_sents"] += 1
|
||||
if nonproj.is_nonproj_tree(gold.heads):
|
||||
if nonproj.is_nonproj_tree(aligned_heads):
|
||||
data["n_nonproj"] += 1
|
||||
if nonproj.contains_cycle(gold.heads):
|
||||
if nonproj.contains_cycle(aligned_heads):
|
||||
data["n_cycles"] += 1
|
||||
return data
|
||||
|
||||
|
||||
def _format_labels(labels, counts=False):
|
||||
def _format_labels(labels: List[Tuple[str, int]], counts: bool = False) -> str:
|
||||
if counts:
|
||||
return ", ".join([f"'{l}' ({c})" for l, c in labels])
|
||||
return ", ".join([f"'{l}'" for l in labels])
|
||||
|
||||
|
||||
def _get_examples_without_label(data, label):
|
||||
def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
|
||||
count = 0
|
||||
for ex in data:
|
||||
for eg in data:
|
||||
labels = [
|
||||
label.split("-")[1]
|
||||
for label in ex.gold.ner
|
||||
for label in eg.get_aligned_ner()
|
||||
if label not in ("O", "-", None)
|
||||
]
|
||||
if label not in labels:
|
||||
|
@ -556,7 +586,7 @@ def _get_examples_without_label(data, label):
|
|||
return count
|
||||
|
||||
|
||||
def _get_labels_from_model(nlp, pipe_name):
|
||||
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]:
|
||||
if pipe_name not in nlp.pipe_names:
|
||||
return set()
|
||||
pipe = nlp.get_pipe(pipe_name)
|
||||
|
|
|
@ -1,23 +1,36 @@
|
|||
from typing import Optional, Sequence, Union
|
||||
import requests
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
from wasabi import msg
|
||||
import typer
|
||||
|
||||
from ._app import app, Arg, Opt
|
||||
from .. import about
|
||||
from ..util import is_package, get_base_version
|
||||
from ..util import is_package, get_base_version, run_command
|
||||
|
||||
|
||||
def download(
|
||||
model: ("Model to download (shortcut or name)", "positional", None, str),
|
||||
direct: ("Force direct download of name + version", "flag", "d", bool) = False,
|
||||
*pip_args: ("Additional arguments to be passed to `pip install` on model install"),
|
||||
@app.command(
|
||||
"download",
|
||||
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
|
||||
)
|
||||
def download_cli(
|
||||
# fmt: off
|
||||
ctx: typer.Context,
|
||||
model: str = Arg(..., help="Model to download (shortcut or name)"),
|
||||
direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Download compatible model from default download path using pip. If --direct
|
||||
flag is set, the command expects the full model name with version.
|
||||
For direct downloads, the compatibility check will be skipped.
|
||||
For direct downloads, the compatibility check will be skipped. All
|
||||
additional arguments provided to this command will be passed to `pip install`
|
||||
on model installation.
|
||||
"""
|
||||
download(model, direct, *ctx.args)
|
||||
|
||||
|
||||
def download(model: str, direct: bool = False, *pip_args) -> None:
|
||||
if not is_package("spacy") and "--no-deps" not in pip_args:
|
||||
msg.warn(
|
||||
"Skipping model package dependencies and setting `--no-deps`. "
|
||||
|
@ -33,22 +46,20 @@ def download(
|
|||
components = model.split("-")
|
||||
model_name = "".join(components[:-1])
|
||||
version = components[-1]
|
||||
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||
download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||
else:
|
||||
shortcuts = get_json(about.__shortcuts__, "available shortcuts")
|
||||
model_name = shortcuts.get(model, model)
|
||||
compatibility = get_compatibility()
|
||||
version = get_version(model_name, compatibility)
|
||||
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||
if dl != 0: # if download subprocess doesn't return 0, exit
|
||||
sys.exit(dl)
|
||||
msg.good(
|
||||
"Download and installation successful",
|
||||
f"You can now load the model via spacy.load('{model_name}')",
|
||||
)
|
||||
download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||
msg.good(
|
||||
"Download and installation successful",
|
||||
f"You can now load the model via spacy.load('{model_name}')",
|
||||
)
|
||||
|
||||
|
||||
def get_json(url, desc):
|
||||
def get_json(url: str, desc: str) -> Union[dict, list]:
|
||||
r = requests.get(url)
|
||||
if r.status_code != 200:
|
||||
msg.fail(
|
||||
|
@ -62,7 +73,7 @@ def get_json(url, desc):
|
|||
return r.json()
|
||||
|
||||
|
||||
def get_compatibility():
|
||||
def get_compatibility() -> dict:
|
||||
version = get_base_version(about.__version__)
|
||||
comp_table = get_json(about.__compatibility__, "compatibility table")
|
||||
comp = comp_table["spacy"]
|
||||
|
@ -71,7 +82,7 @@ def get_compatibility():
|
|||
return comp[version]
|
||||
|
||||
|
||||
def get_version(model, comp):
|
||||
def get_version(model: str, comp: dict) -> str:
|
||||
model = get_base_version(model)
|
||||
if model not in comp:
|
||||
msg.fail(
|
||||
|
@ -81,10 +92,12 @@ def get_version(model, comp):
|
|||
return comp[model][0]
|
||||
|
||||
|
||||
def download_model(filename, user_pip_args=None):
|
||||
def download_model(
|
||||
filename: str, user_pip_args: Optional[Sequence[str]] = None
|
||||
) -> None:
|
||||
download_url = about.__download_url__ + "/" + filename
|
||||
pip_args = ["--no-cache-dir"]
|
||||
if user_pip_args:
|
||||
pip_args.extend(user_pip_args)
|
||||
cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url]
|
||||
return subprocess.call(cmd, env=os.environ.copy())
|
||||
run_command(cmd)
|
||||
|
|
|
@ -1,46 +1,75 @@
|
|||
from typing import Optional, List, Dict
|
||||
from timeit import default_timer as timer
|
||||
from wasabi import msg
|
||||
from wasabi import Printer
|
||||
from pathlib import Path
|
||||
import re
|
||||
import srsly
|
||||
|
||||
from ..gold import GoldCorpus
|
||||
from ..gold import Corpus
|
||||
from ..tokens import Doc
|
||||
from ._app import app, Arg, Opt
|
||||
from ..scorer import Scorer
|
||||
from .. import util
|
||||
from .. import displacy
|
||||
|
||||
|
||||
def evaluate(
|
||||
@app.command("evaluate")
|
||||
def evaluate_cli(
|
||||
# fmt: off
|
||||
model: ("Model name or path", "positional", None, str),
|
||||
data_path: ("Location of JSON-formatted evaluation data", "positional", None, str),
|
||||
gpu_id: ("Use GPU", "option", "g", int) = -1,
|
||||
gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False,
|
||||
displacy_path: ("Directory to output rendered parses as HTML", "option", "dp", str) = None,
|
||||
displacy_limit: ("Limit of parses to render as HTML", "option", "dl", int) = 25,
|
||||
return_scores: ("Return dict containing model scores", "flag", "R", bool) = False,
|
||||
model: str = Arg(..., help="Model name or path"),
|
||||
data_path: Path = Arg(..., help="Location of JSON-formatted evaluation data", exists=True),
|
||||
output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False),
|
||||
gpu_id: int = Opt(-1, "--gpu-id", "-g", help="Use GPU"),
|
||||
gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
|
||||
displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
|
||||
displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Evaluate a model. To render a sample of parses in a HTML file, set an
|
||||
output directory as the displacy_path argument.
|
||||
"""
|
||||
evaluate(
|
||||
model,
|
||||
data_path,
|
||||
output=output,
|
||||
gpu_id=gpu_id,
|
||||
gold_preproc=gold_preproc,
|
||||
displacy_path=displacy_path,
|
||||
displacy_limit=displacy_limit,
|
||||
silent=False,
|
||||
)
|
||||
|
||||
|
||||
def evaluate(
|
||||
model: str,
|
||||
data_path: Path,
|
||||
output: Optional[Path],
|
||||
gpu_id: int = -1,
|
||||
gold_preproc: bool = False,
|
||||
displacy_path: Optional[Path] = None,
|
||||
displacy_limit: int = 25,
|
||||
silent: bool = True,
|
||||
) -> Scorer:
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
util.fix_random_seed()
|
||||
if gpu_id >= 0:
|
||||
util.use_gpu(gpu_id)
|
||||
util.set_env_log(False)
|
||||
data_path = util.ensure_path(data_path)
|
||||
output_path = util.ensure_path(output)
|
||||
displacy_path = util.ensure_path(displacy_path)
|
||||
if not data_path.exists():
|
||||
msg.fail("Evaluation data not found", data_path, exits=1)
|
||||
if displacy_path and not displacy_path.exists():
|
||||
msg.fail("Visualization output directory not found", displacy_path, exits=1)
|
||||
corpus = GoldCorpus(data_path, data_path)
|
||||
if model.startswith("blank:"):
|
||||
nlp = util.get_lang_class(model.replace("blank:", ""))()
|
||||
else:
|
||||
nlp = util.load_model(model)
|
||||
corpus = Corpus(data_path, data_path)
|
||||
nlp = util.load_model(model)
|
||||
dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc))
|
||||
begin = timer()
|
||||
scorer = nlp.evaluate(dev_dataset, verbose=False)
|
||||
end = timer()
|
||||
nwords = sum(len(ex.doc) for ex in dev_dataset)
|
||||
nwords = sum(len(ex.predicted) for ex in dev_dataset)
|
||||
results = {
|
||||
"Time": f"{end - begin:.2f} s",
|
||||
"Words": nwords,
|
||||
|
@ -60,10 +89,22 @@ def evaluate(
|
|||
"Sent R": f"{scorer.sent_r:.2f}",
|
||||
"Sent F": f"{scorer.sent_f:.2f}",
|
||||
}
|
||||
data = {re.sub(r"[\s/]", "_", k.lower()): v for k, v in results.items()}
|
||||
|
||||
msg.table(results, title="Results")
|
||||
|
||||
if scorer.ents_per_type:
|
||||
data["ents_per_type"] = scorer.ents_per_type
|
||||
print_ents_per_type(msg, scorer.ents_per_type)
|
||||
if scorer.textcats_f_per_cat:
|
||||
data["textcats_f_per_cat"] = scorer.textcats_f_per_cat
|
||||
print_textcats_f_per_cat(msg, scorer.textcats_f_per_cat)
|
||||
if scorer.textcats_auc_per_cat:
|
||||
data["textcats_auc_per_cat"] = scorer.textcats_auc_per_cat
|
||||
print_textcats_auc_per_cat(msg, scorer.textcats_auc_per_cat)
|
||||
|
||||
if displacy_path:
|
||||
docs = [ex.doc for ex in dev_dataset]
|
||||
docs = [ex.predicted for ex in dev_dataset]
|
||||
render_deps = "parser" in nlp.meta.get("pipeline", [])
|
||||
render_ents = "ner" in nlp.meta.get("pipeline", [])
|
||||
render_parses(
|
||||
|
@ -75,11 +116,21 @@ def evaluate(
|
|||
ents=render_ents,
|
||||
)
|
||||
msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path)
|
||||
if return_scores:
|
||||
return scorer.scores
|
||||
|
||||
if output_path is not None:
|
||||
srsly.write_json(output_path, data)
|
||||
msg.good(f"Saved results to {output_path}")
|
||||
return data
|
||||
|
||||
|
||||
def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True):
|
||||
def render_parses(
|
||||
docs: List[Doc],
|
||||
output_path: Path,
|
||||
model_name: str = "",
|
||||
limit: int = 250,
|
||||
deps: bool = True,
|
||||
ents: bool = True,
|
||||
):
|
||||
docs[0].user_data["title"] = model_name
|
||||
if ents:
|
||||
html = displacy.render(docs[:limit], style="ent", page=True)
|
||||
|
@ -91,3 +142,40 @@ def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=T
|
|||
)
|
||||
with (output_path / "parses.html").open("w", encoding="utf8") as file_:
|
||||
file_.write(html)
|
||||
|
||||
|
||||
def print_ents_per_type(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
|
||||
data = [
|
||||
(k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
|
||||
for k, v in scores.items()
|
||||
]
|
||||
msg.table(
|
||||
data,
|
||||
header=("", "P", "R", "F"),
|
||||
aligns=("l", "r", "r", "r"),
|
||||
title="NER (per type)",
|
||||
)
|
||||
|
||||
|
||||
def print_textcats_f_per_cat(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
|
||||
data = [
|
||||
(k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
|
||||
for k, v in scores.items()
|
||||
]
|
||||
msg.table(
|
||||
data,
|
||||
header=("", "P", "R", "F"),
|
||||
aligns=("l", "r", "r", "r"),
|
||||
title="Textcat F (per type)",
|
||||
)
|
||||
|
||||
|
||||
def print_textcats_auc_per_cat(
|
||||
msg: Printer, scores: Dict[str, Dict[str, float]]
|
||||
) -> None:
|
||||
msg.table(
|
||||
[(k, f"{v['roc_auc_score']:.2f}") for k, v in scores.items()],
|
||||
header=("", "ROC AUC"),
|
||||
aligns=("l", "r"),
|
||||
title="Textcat ROC AUC (per label)",
|
||||
)
|
||||
|
|
|
@ -1,77 +1,109 @@
|
|||
from typing import Optional, Dict, Any, Union
|
||||
import platform
|
||||
from pathlib import Path
|
||||
from wasabi import msg
|
||||
from wasabi import Printer
|
||||
import srsly
|
||||
|
||||
from .validate import get_model_pkgs
|
||||
from ._app import app, Arg, Opt
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
||||
def info(
|
||||
model: ("Optional model name", "positional", None, str) = None,
|
||||
markdown: ("Generate Markdown for GitHub issues", "flag", "md", str) = False,
|
||||
silent: ("Don't print anything (just return)", "flag", "s") = False,
|
||||
@app.command("info")
|
||||
def info_cli(
|
||||
# fmt: off
|
||||
model: Optional[str] = Arg(None, help="Optional model name"),
|
||||
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
|
||||
silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Print info about spaCy installation. If a model is speficied as an argument,
|
||||
print model information. Flag --markdown prints details in Markdown for easy
|
||||
copy-pasting to GitHub issues.
|
||||
"""
|
||||
info(model, markdown=markdown, silent=silent)
|
||||
|
||||
|
||||
def info(
|
||||
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True
|
||||
) -> Union[str, dict]:
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
if model:
|
||||
if util.is_package(model):
|
||||
model_path = util.get_package_path(model)
|
||||
else:
|
||||
model_path = model
|
||||
meta_path = model_path / "meta.json"
|
||||
if not meta_path.is_file():
|
||||
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||
meta = srsly.read_json(meta_path)
|
||||
if model_path.resolve() != model_path:
|
||||
meta["link"] = str(model_path)
|
||||
meta["source"] = str(model_path.resolve())
|
||||
else:
|
||||
meta["source"] = str(model_path)
|
||||
title = f"Info about model '{model}'"
|
||||
data = info_model(model, silent=silent)
|
||||
else:
|
||||
title = "Info about spaCy"
|
||||
data = info_spacy()
|
||||
raw_data = {k.lower().replace(" ", "_"): v for k, v in data.items()}
|
||||
if "Models" in data and isinstance(data["Models"], dict):
|
||||
data["Models"] = ", ".join(f"{n} ({v})" for n, v in data["Models"].items())
|
||||
markdown_data = get_markdown(data, title=title)
|
||||
if markdown:
|
||||
if not silent:
|
||||
title = f"Info about model '{model}'"
|
||||
model_meta = {
|
||||
k: v for k, v in meta.items() if k not in ("accuracy", "speed")
|
||||
}
|
||||
if markdown:
|
||||
print_markdown(model_meta, title=title)
|
||||
else:
|
||||
msg.table(model_meta, title=title)
|
||||
return meta
|
||||
all_models, _ = get_model_pkgs()
|
||||
data = {
|
||||
print(markdown_data)
|
||||
return markdown_data
|
||||
if not silent:
|
||||
table_data = dict(data)
|
||||
msg.table(table_data, title=title)
|
||||
return raw_data
|
||||
|
||||
|
||||
def info_spacy() -> Dict[str, any]:
|
||||
"""Generate info about the current spaCy intallation.
|
||||
|
||||
RETURNS (dict): The spaCy info.
|
||||
"""
|
||||
all_models = {}
|
||||
for pkg_name in util.get_installed_models():
|
||||
package = pkg_name.replace("-", "_")
|
||||
all_models[package] = util.get_package_version(pkg_name)
|
||||
return {
|
||||
"spaCy version": about.__version__,
|
||||
"Location": str(Path(__file__).parent.parent),
|
||||
"Platform": platform.platform(),
|
||||
"Python version": platform.python_version(),
|
||||
"Models": ", ".join(
|
||||
f"{m['name']} ({m['version']})" for m in all_models.values()
|
||||
),
|
||||
"Models": all_models,
|
||||
}
|
||||
if not silent:
|
||||
title = "Info about spaCy"
|
||||
if markdown:
|
||||
print_markdown(data, title=title)
|
||||
else:
|
||||
msg.table(data, title=title)
|
||||
return data
|
||||
|
||||
|
||||
def print_markdown(data, title=None):
|
||||
"""Print data in GitHub-flavoured Markdown format for issues etc.
|
||||
def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
|
||||
"""Generate info about a specific model.
|
||||
|
||||
model (str): Model name of path.
|
||||
silent (bool): Don't print anything, just return.
|
||||
RETURNS (dict): The model meta.
|
||||
"""
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
if util.is_package(model):
|
||||
model_path = util.get_package_path(model)
|
||||
else:
|
||||
model_path = model
|
||||
meta_path = model_path / "meta.json"
|
||||
if not meta_path.is_file():
|
||||
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||
meta = srsly.read_json(meta_path)
|
||||
if model_path.resolve() != model_path:
|
||||
meta["link"] = str(model_path)
|
||||
meta["source"] = str(model_path.resolve())
|
||||
else:
|
||||
meta["source"] = str(model_path)
|
||||
return {k: v for k, v in meta.items() if k not in ("accuracy", "speed")}
|
||||
|
||||
|
||||
def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
|
||||
"""Get data in GitHub-flavoured Markdown format for issues etc.
|
||||
|
||||
data (dict or list of tuples): Label/value pairs.
|
||||
title (str / None): Title, will be rendered as headline 2.
|
||||
RETURNS (str): The Markdown string.
|
||||
"""
|
||||
markdown = []
|
||||
for key, value in data.items():
|
||||
if isinstance(value, str) and Path(value).exists():
|
||||
continue
|
||||
markdown.append(f"* **{key}:** {value}")
|
||||
result = "\n{}\n".format("\n".join(markdown))
|
||||
if title:
|
||||
print(f"\n## {title}")
|
||||
print("\n{}\n".format("\n".join(markdown)))
|
||||
result = f"\n## {title}\n{result}"
|
||||
return result
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
from typing import Optional, List, Dict, Any, Union, IO
|
||||
import math
|
||||
from tqdm import tqdm
|
||||
import numpy
|
||||
|
@ -9,10 +10,12 @@ import gzip
|
|||
import zipfile
|
||||
import srsly
|
||||
import warnings
|
||||
from wasabi import msg
|
||||
from wasabi import Printer
|
||||
|
||||
from ._app import app, Arg, Opt
|
||||
from ..vectors import Vectors
|
||||
from ..errors import Errors, Warnings
|
||||
from ..language import Language
|
||||
from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
|
||||
from ..lookups import Lookups
|
||||
|
||||
|
@ -25,20 +28,21 @@ except ImportError:
|
|||
DEFAULT_OOV_PROB = -20
|
||||
|
||||
|
||||
def init_model(
|
||||
@app.command("init-model")
|
||||
def init_model_cli(
|
||||
# fmt: off
|
||||
lang: ("Model language", "positional", None, str),
|
||||
output_dir: ("Model output directory", "positional", None, Path),
|
||||
freqs_loc: ("Location of words frequencies file", "option", "f", Path) = None,
|
||||
clusters_loc: ("Optional location of brown clusters data", "option", "c", str) = None,
|
||||
jsonl_loc: ("Location of JSONL-formatted attributes file", "option", "j", Path) = None,
|
||||
vectors_loc: ("Optional vectors file in Word2Vec format", "option", "v", str) = None,
|
||||
prune_vectors: ("Optional number of vectors to prune to", "option", "V", int) = -1,
|
||||
truncate_vectors: ("Optional number of vectors to truncate to when reading in vectors file", "option", "t", int) = 0,
|
||||
vectors_name: ("Optional name for the word vectors, e.g. en_core_web_lg.vectors", "option", "vn", str) = None,
|
||||
model_name: ("Optional name for the model meta", "option", "mn", str) = None,
|
||||
omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False,
|
||||
base_model: ("Base model (for languages with custom tokenizers)", "option", "b", str) = None
|
||||
lang: str = Arg(..., help="Model language"),
|
||||
output_dir: Path = Arg(..., help="Model output directory"),
|
||||
freqs_loc: Optional[Path] = Arg(None, help="Location of words frequencies file", exists=True),
|
||||
clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
|
||||
jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
|
||||
vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
|
||||
prune_vectors: int = Opt(-1 , "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
|
||||
truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
|
||||
vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
|
||||
model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"),
|
||||
omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
|
||||
base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Base model (for languages with custom tokenizers)")
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
|
@ -46,6 +50,38 @@ def init_model(
|
|||
and word vectors. If vectors are provided in Word2Vec format, they can
|
||||
be either a .txt or zipped as a .zip or .tar.gz.
|
||||
"""
|
||||
init_model(
|
||||
lang,
|
||||
output_dir,
|
||||
freqs_loc=freqs_loc,
|
||||
clusters_loc=clusters_loc,
|
||||
jsonl_loc=jsonl_loc,
|
||||
prune_vectors=prune_vectors,
|
||||
truncate_vectors=truncate_vectors,
|
||||
vectors_name=vectors_name,
|
||||
model_name=model_name,
|
||||
omit_extra_lookups=omit_extra_lookups,
|
||||
base_model=base_model,
|
||||
silent=False,
|
||||
)
|
||||
|
||||
|
||||
def init_model(
|
||||
lang: str,
|
||||
output_dir: Path,
|
||||
freqs_loc: Optional[Path] = None,
|
||||
clusters_loc: Optional[Path] = None,
|
||||
jsonl_loc: Optional[Path] = None,
|
||||
vectors_loc: Optional[Path] = None,
|
||||
prune_vectors: int = -1,
|
||||
truncate_vectors: int = 0,
|
||||
vectors_name: Optional[str] = None,
|
||||
model_name: Optional[str] = None,
|
||||
omit_extra_lookups: bool = False,
|
||||
base_model: Optional[str] = None,
|
||||
silent: bool = True,
|
||||
) -> Language:
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
if jsonl_loc is not None:
|
||||
if freqs_loc is not None or clusters_loc is not None:
|
||||
settings = ["-j"]
|
||||
|
@ -68,7 +104,7 @@ def init_model(
|
|||
freqs_loc = ensure_path(freqs_loc)
|
||||
if freqs_loc is not None and not freqs_loc.exists():
|
||||
msg.fail("Can't find words frequencies file", freqs_loc, exits=1)
|
||||
lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
|
||||
lex_attrs = read_attrs_from_deprecated(msg, freqs_loc, clusters_loc)
|
||||
|
||||
with msg.loading("Creating model..."):
|
||||
nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
|
||||
|
@ -83,7 +119,9 @@ def init_model(
|
|||
|
||||
msg.good("Successfully created model")
|
||||
if vectors_loc is not None:
|
||||
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
|
||||
add_vectors(
|
||||
msg, nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name
|
||||
)
|
||||
vec_added = len(nlp.vocab.vectors)
|
||||
lex_added = len(nlp.vocab)
|
||||
msg.good(
|
||||
|
@ -95,7 +133,7 @@ def init_model(
|
|||
return nlp
|
||||
|
||||
|
||||
def open_file(loc):
|
||||
def open_file(loc: Union[str, Path]) -> IO:
|
||||
"""Handle .gz, .tar.gz or unzipped files"""
|
||||
loc = ensure_path(loc)
|
||||
if tarfile.is_tarfile(str(loc)):
|
||||
|
@ -111,7 +149,9 @@ def open_file(loc):
|
|||
return loc.open("r", encoding="utf8")
|
||||
|
||||
|
||||
def read_attrs_from_deprecated(freqs_loc, clusters_loc):
|
||||
def read_attrs_from_deprecated(
|
||||
msg: Printer, freqs_loc: Optional[Path], clusters_loc: Optional[Path]
|
||||
) -> List[Dict[str, Any]]:
|
||||
if freqs_loc is not None:
|
||||
with msg.loading("Counting frequencies..."):
|
||||
probs, _ = read_freqs(freqs_loc)
|
||||
|
@ -139,7 +179,12 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
|
|||
return lex_attrs
|
||||
|
||||
|
||||
def create_model(lang, lex_attrs, name=None, base_model=None):
|
||||
def create_model(
|
||||
lang: str,
|
||||
lex_attrs: List[Dict[str, Any]],
|
||||
name: Optional[str] = None,
|
||||
base_model: Optional[Union[str, Path]] = None,
|
||||
) -> Language:
|
||||
if base_model:
|
||||
nlp = load_model(base_model)
|
||||
# keep the tokenizer but remove any existing pipeline components due to
|
||||
|
@ -166,7 +211,14 @@ def create_model(lang, lex_attrs, name=None, base_model=None):
|
|||
return nlp
|
||||
|
||||
|
||||
def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
|
||||
def add_vectors(
|
||||
msg: Printer,
|
||||
nlp: Language,
|
||||
vectors_loc: Optional[Path],
|
||||
truncate_vectors: int,
|
||||
prune_vectors: int,
|
||||
name: Optional[str] = None,
|
||||
) -> None:
|
||||
vectors_loc = ensure_path(vectors_loc)
|
||||
if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
|
||||
nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
|
||||
|
@ -176,7 +228,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
|
|||
else:
|
||||
if vectors_loc:
|
||||
with msg.loading(f"Reading vectors from {vectors_loc}"):
|
||||
vectors_data, vector_keys = read_vectors(vectors_loc)
|
||||
vectors_data, vector_keys = read_vectors(msg, vectors_loc)
|
||||
msg.good(f"Loaded vectors from {vectors_loc}")
|
||||
else:
|
||||
vectors_data, vector_keys = (None, None)
|
||||
|
@ -195,7 +247,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
|
|||
nlp.vocab.prune_vectors(prune_vectors)
|
||||
|
||||
|
||||
def read_vectors(vectors_loc, truncate_vectors=0):
|
||||
def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int = 0):
|
||||
f = open_file(vectors_loc)
|
||||
shape = tuple(int(size) for size in next(f).split())
|
||||
if truncate_vectors >= 1:
|
||||
|
@ -215,7 +267,9 @@ def read_vectors(vectors_loc, truncate_vectors=0):
|
|||
return vectors_data, vectors_keys
|
||||
|
||||
|
||||
def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
||||
def read_freqs(
|
||||
freqs_loc: Path, max_length: int = 100, min_doc_freq: int = 5, min_freq: int = 50
|
||||
):
|
||||
counts = PreshCounter()
|
||||
total = 0
|
||||
with freqs_loc.open() as f:
|
||||
|
@ -244,7 +298,7 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
|||
return probs, oov_prob
|
||||
|
||||
|
||||
def read_clusters(clusters_loc):
|
||||
def read_clusters(clusters_loc: Path) -> dict:
|
||||
clusters = {}
|
||||
if ftfy is None:
|
||||
warnings.warn(Warnings.W004)
|
||||
|
|
|
@ -1,19 +1,25 @@
|
|||
from typing import Optional, Union, Any, Dict
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from wasabi import msg, get_raw_input
|
||||
from wasabi import Printer, get_raw_input
|
||||
import srsly
|
||||
import sys
|
||||
|
||||
from ._app import app, Arg, Opt
|
||||
from ..schemas import validate, ModelMetaSchema
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
||||
def package(
|
||||
@app.command("package")
|
||||
def package_cli(
|
||||
# fmt: off
|
||||
input_dir: ("Directory with model data", "positional", None, str),
|
||||
output_dir: ("Output parent directory", "positional", None, str),
|
||||
meta_path: ("Path to meta.json", "option", "m", str) = None,
|
||||
create_meta: ("Create meta.json, even if one exists", "flag", "c", bool) = False,
|
||||
force: ("Force overwriting existing model in output directory", "flag", "f", bool) = False,
|
||||
input_dir: Path = Arg(..., help="Directory with model data", exists=True, file_okay=False),
|
||||
output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
|
||||
meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
|
||||
create_meta: bool = Opt(False, "--create-meta", "-c", "-C", help="Create meta.json, even if one exists"),
|
||||
version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
|
||||
force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing model in output directory"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
|
@ -23,6 +29,27 @@ def package(
|
|||
set and a meta.json already exists in the output directory, the existing
|
||||
values will be used as the defaults in the command-line prompt.
|
||||
"""
|
||||
package(
|
||||
input_dir,
|
||||
output_dir,
|
||||
meta_path=meta_path,
|
||||
version=version,
|
||||
create_meta=create_meta,
|
||||
force=force,
|
||||
silent=False,
|
||||
)
|
||||
|
||||
|
||||
def package(
|
||||
input_dir: Path,
|
||||
output_dir: Path,
|
||||
meta_path: Optional[Path] = None,
|
||||
version: Optional[str] = None,
|
||||
create_meta: bool = False,
|
||||
force: bool = False,
|
||||
silent: bool = True,
|
||||
) -> None:
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
input_path = util.ensure_path(input_dir)
|
||||
output_path = util.ensure_path(output_dir)
|
||||
meta_path = util.ensure_path(meta_path)
|
||||
|
@ -33,23 +60,23 @@ def package(
|
|||
if meta_path and not meta_path.exists():
|
||||
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||
|
||||
meta_path = meta_path or input_path / "meta.json"
|
||||
if meta_path.is_file():
|
||||
meta = srsly.read_json(meta_path)
|
||||
if not create_meta: # only print if user doesn't want to overwrite
|
||||
msg.good("Loaded meta.json from file", meta_path)
|
||||
else:
|
||||
meta = generate_meta(input_dir, meta, msg)
|
||||
for key in ("lang", "name", "version"):
|
||||
if key not in meta or meta[key] == "":
|
||||
msg.fail(
|
||||
f"No '{key}' setting found in meta.json",
|
||||
"This setting is required to build your package.",
|
||||
exits=1,
|
||||
)
|
||||
meta_path = meta_path or input_dir / "meta.json"
|
||||
if not meta_path.exists() or not meta_path.is_file():
|
||||
msg.fail("Can't load model meta.json", meta_path, exits=1)
|
||||
meta = srsly.read_json(meta_path)
|
||||
meta = get_meta(input_dir, meta)
|
||||
if version is not None:
|
||||
meta["version"] = version
|
||||
if not create_meta: # only print if user doesn't want to overwrite
|
||||
msg.good("Loaded meta.json from file", meta_path)
|
||||
else:
|
||||
meta = generate_meta(meta, msg)
|
||||
errors = validate(ModelMetaSchema, meta)
|
||||
if errors:
|
||||
msg.fail("Invalid model meta.json", "\n".join(errors), exits=1)
|
||||
model_name = meta["lang"] + "_" + meta["name"]
|
||||
model_name_v = model_name + "-" + meta["version"]
|
||||
main_path = output_path / model_name_v
|
||||
main_path = output_dir / model_name_v
|
||||
package_path = main_path / model_name
|
||||
|
||||
if package_path.exists():
|
||||
|
@ -63,32 +90,37 @@ def package(
|
|||
exits=1,
|
||||
)
|
||||
Path.mkdir(package_path, parents=True)
|
||||
shutil.copytree(str(input_path), str(package_path / model_name_v))
|
||||
shutil.copytree(str(input_dir), str(package_path / model_name_v))
|
||||
create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
|
||||
create_file(main_path / "setup.py", TEMPLATE_SETUP)
|
||||
create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
|
||||
create_file(package_path / "__init__.py", TEMPLATE_INIT)
|
||||
msg.good(f"Successfully created package '{model_name_v}'", main_path)
|
||||
msg.text("To build the package, run `python setup.py sdist` in this directory.")
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "sdist"])
|
||||
zip_file = main_path / "dist" / f"{model_name_v}.tar.gz"
|
||||
msg.good(f"Successfully created zipped Python package", zip_file)
|
||||
|
||||
|
||||
def create_file(file_path, contents):
|
||||
def create_file(file_path: Path, contents: str) -> None:
|
||||
file_path.touch()
|
||||
file_path.open("w", encoding="utf-8").write(contents)
|
||||
|
||||
|
||||
def generate_meta(model_path, existing_meta, msg):
|
||||
meta = existing_meta or {}
|
||||
settings = [
|
||||
("lang", "Model language", meta.get("lang", "en")),
|
||||
("name", "Model name", meta.get("name", "model")),
|
||||
("version", "Model version", meta.get("version", "0.0.0")),
|
||||
("description", "Model description", meta.get("description", False)),
|
||||
("author", "Author", meta.get("author", False)),
|
||||
("email", "Author email", meta.get("email", False)),
|
||||
("url", "Author website", meta.get("url", False)),
|
||||
("license", "License", meta.get("license", "MIT")),
|
||||
]
|
||||
def get_meta(
|
||||
model_path: Union[str, Path], existing_meta: Dict[str, Any]
|
||||
) -> Dict[str, Any]:
|
||||
meta = {
|
||||
"lang": "en",
|
||||
"name": "model",
|
||||
"version": "0.0.0",
|
||||
"description": None,
|
||||
"author": None,
|
||||
"email": None,
|
||||
"url": None,
|
||||
"license": "MIT",
|
||||
}
|
||||
meta.update(existing_meta)
|
||||
nlp = util.load_model_from_path(Path(model_path))
|
||||
meta["spacy_version"] = util.get_model_version_range(about.__version__)
|
||||
meta["pipeline"] = nlp.pipe_names
|
||||
|
@ -98,6 +130,23 @@ def generate_meta(model_path, existing_meta, msg):
|
|||
"keys": nlp.vocab.vectors.n_keys,
|
||||
"name": nlp.vocab.vectors.name,
|
||||
}
|
||||
if about.__title__ != "spacy":
|
||||
meta["parent_package"] = about.__title__
|
||||
return meta
|
||||
|
||||
|
||||
def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any]:
|
||||
meta = existing_meta or {}
|
||||
settings = [
|
||||
("lang", "Model language", meta.get("lang", "en")),
|
||||
("name", "Model name", meta.get("name", "model")),
|
||||
("version", "Model version", meta.get("version", "0.0.0")),
|
||||
("description", "Model description", meta.get("description", None)),
|
||||
("author", "Author", meta.get("author", None)),
|
||||
("email", "Author email", meta.get("email", None)),
|
||||
("url", "Author website", meta.get("url", None)),
|
||||
("license", "License", meta.get("license", "MIT")),
|
||||
]
|
||||
msg.divider("Generating meta.json")
|
||||
msg.text(
|
||||
"Enter the package settings for your model. The following information "
|
||||
|
@ -106,8 +155,6 @@ def generate_meta(model_path, existing_meta, msg):
|
|||
for setting, desc, default in settings:
|
||||
response = get_raw_input(desc, default)
|
||||
meta[setting] = default if response == "" and default else response
|
||||
if about.__title__ != "spacy":
|
||||
meta["parent_package"] = about.__title__
|
||||
return meta
|
||||
|
||||
|
||||
|
@ -158,12 +205,12 @@ def setup_package():
|
|||
|
||||
setup(
|
||||
name=model_name,
|
||||
description=meta['description'],
|
||||
author=meta['author'],
|
||||
author_email=meta['email'],
|
||||
url=meta['url'],
|
||||
description=meta.get('description'),
|
||||
author=meta.get('author'),
|
||||
author_email=meta.get('email'),
|
||||
url=meta.get('url'),
|
||||
version=meta['version'],
|
||||
license=meta['license'],
|
||||
license=meta.get('license'),
|
||||
packages=[model_name],
|
||||
package_data={model_name: list_files(model_dir)},
|
||||
install_requires=list_requirements(meta),
|
||||
|
|
|
@ -1,14 +1,15 @@
|
|||
from typing import Optional
|
||||
import random
|
||||
import numpy
|
||||
import time
|
||||
import re
|
||||
from collections import Counter
|
||||
import plac
|
||||
from pathlib import Path
|
||||
from thinc.api import Linear, Maxout, chain, list2array, use_pytorch_for_gpu_memory
|
||||
from wasabi import msg
|
||||
import srsly
|
||||
|
||||
from ._app import app, Arg, Opt
|
||||
from ..errors import Errors
|
||||
from ..ml.models.multi_task import build_masked_language_model
|
||||
from ..tokens import Doc
|
||||
|
@ -17,25 +18,17 @@ from .. import util
|
|||
from ..gold import Example
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
@app.command("pretrain")
|
||||
def pretrain_cli(
|
||||
# fmt: off
|
||||
texts_loc=("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", "positional", None, str),
|
||||
vectors_model=("Name or path to spaCy model with vectors to learn from", "positional", None, str),
|
||||
output_dir=("Directory to write models to on each epoch", "positional", None, Path),
|
||||
config_path=("Path to config file", "positional", None, Path),
|
||||
use_gpu=("Use GPU", "option", "g", int),
|
||||
resume_path=("Path to pretrained weights from which to resume pretraining", "option","r", Path),
|
||||
epoch_resume=("The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files.","option", "er", int),
|
||||
texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True),
|
||||
vectors_model: str = Arg(..., help="Name or path to spaCy model with vectors to learn from"),
|
||||
output_dir: Path = Arg(..., help="Directory to write models to on each epoch"),
|
||||
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
|
||||
use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
|
||||
resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
|
||||
epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files."),
|
||||
# fmt: on
|
||||
)
|
||||
def pretrain(
|
||||
texts_loc,
|
||||
vectors_model,
|
||||
config_path,
|
||||
output_dir,
|
||||
use_gpu=-1,
|
||||
resume_path=None,
|
||||
epoch_resume=None,
|
||||
):
|
||||
"""
|
||||
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
||||
|
@ -52,6 +45,26 @@ def pretrain(
|
|||
all settings are the same between pretraining and training. Ideally,
|
||||
this is done by using the same config file for both commands.
|
||||
"""
|
||||
pretrain(
|
||||
texts_loc,
|
||||
vectors_model,
|
||||
output_dir,
|
||||
config_path,
|
||||
use_gpu=use_gpu,
|
||||
resume_path=resume_path,
|
||||
epoch_resume=epoch_resume,
|
||||
)
|
||||
|
||||
|
||||
def pretrain(
|
||||
texts_loc: Path,
|
||||
vectors_model: str,
|
||||
output_dir: Path,
|
||||
config_path: Path,
|
||||
use_gpu: int = -1,
|
||||
resume_path: Optional[Path] = None,
|
||||
epoch_resume: Optional[int] = None,
|
||||
):
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
|
||||
|
@ -166,8 +179,7 @@ def pretrain(
|
|||
skip_counter = 0
|
||||
loss_func = pretrain_config["loss_func"]
|
||||
for epoch in range(epoch_resume, pretrain_config["max_epochs"]):
|
||||
examples = [Example(doc=text) for text in texts]
|
||||
batches = util.minibatch_by_words(examples, size=pretrain_config["batch_size"])
|
||||
batches = util.minibatch_by_words(texts, size=pretrain_config["batch_size"])
|
||||
for batch_id, batch in enumerate(batches):
|
||||
docs, count = make_docs(
|
||||
nlp,
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
from typing import Optional, Sequence, Union, Iterator
|
||||
import tqdm
|
||||
from pathlib import Path
|
||||
import srsly
|
||||
|
@ -5,17 +6,19 @@ import cProfile
|
|||
import pstats
|
||||
import sys
|
||||
import itertools
|
||||
import ml_datasets
|
||||
from wasabi import msg
|
||||
from wasabi import msg, Printer
|
||||
|
||||
from ._app import app, Arg, Opt
|
||||
from ..language import Language
|
||||
from ..util import load_model
|
||||
|
||||
|
||||
def profile(
|
||||
@app.command("profile")
|
||||
def profile_cli(
|
||||
# fmt: off
|
||||
model: ("Model to load", "positional", None, str),
|
||||
inputs: ("Location of input file. '-' for stdin.", "positional", None, str) = None,
|
||||
n_texts: ("Maximum number of texts to use if available", "option", "n", int) = 10000,
|
||||
model: str = Arg(..., help="Model to load"),
|
||||
inputs: Optional[Path] = Arg(None, help="Location of input file. '-' for stdin.", exists=True, allow_dash=True),
|
||||
n_texts: int = Opt(10000, "--n-texts", "-n", help="Maximum number of texts to use if available"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
|
@ -24,6 +27,18 @@ def profile(
|
|||
It can either be provided as a JSONL file, or be read from sys.sytdin.
|
||||
If no input file is specified, the IMDB dataset is loaded via Thinc.
|
||||
"""
|
||||
profile(model, inputs=inputs, n_texts=n_texts)
|
||||
|
||||
|
||||
def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
|
||||
try:
|
||||
import ml_datasets
|
||||
except ImportError:
|
||||
msg.fail(
|
||||
"This command requires the ml_datasets library to be installed:"
|
||||
"pip install ml_datasets",
|
||||
exits=1,
|
||||
)
|
||||
if inputs is not None:
|
||||
inputs = _read_inputs(inputs, msg)
|
||||
if inputs is None:
|
||||
|
@ -43,12 +58,12 @@ def profile(
|
|||
s.strip_dirs().sort_stats("time").print_stats()
|
||||
|
||||
|
||||
def parse_texts(nlp, texts):
|
||||
def parse_texts(nlp: Language, texts: Sequence[str]) -> None:
|
||||
for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
|
||||
pass
|
||||
|
||||
|
||||
def _read_inputs(loc, msg):
|
||||
def _read_inputs(loc: Union[Path, str], msg: Printer) -> Iterator[str]:
|
||||
if loc == "-":
|
||||
msg.info("Reading input from sys.stdin")
|
||||
file_ = sys.stdin
|
||||
|
|
704
spacy/cli/project.py
Normal file
704
spacy/cli/project.py
Normal file
|
@ -0,0 +1,704 @@
|
|||
from typing import List, Dict, Any, Optional, Sequence
|
||||
import typer
|
||||
import srsly
|
||||
from pathlib import Path
|
||||
from wasabi import msg
|
||||
import subprocess
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import requests
|
||||
import tqdm
|
||||
|
||||
from ._app import app, Arg, Opt, COMMAND, NAME
|
||||
from .. import about
|
||||
from ..schemas import ProjectConfigSchema, validate
|
||||
from ..util import ensure_path, run_command, make_tempdir, working_dir
|
||||
from ..util import get_hash, get_checksum, split_command
|
||||
|
||||
|
||||
CONFIG_FILE = "project.yml"
|
||||
DVC_CONFIG = "dvc.yaml"
|
||||
DVC_DIR = ".dvc"
|
||||
DIRS = [
|
||||
"assets",
|
||||
"metas",
|
||||
"configs",
|
||||
"packages",
|
||||
"metrics",
|
||||
"scripts",
|
||||
"notebooks",
|
||||
"training",
|
||||
"corpus",
|
||||
]
|
||||
CACHES = [
|
||||
Path.home() / ".torch",
|
||||
Path.home() / ".caches" / "torch",
|
||||
os.environ.get("TORCH_HOME"),
|
||||
Path.home() / ".keras",
|
||||
]
|
||||
DVC_CONFIG_COMMENT = """# This file is auto-generated by spaCy based on your project.yml. Do not edit
|
||||
# it directly and edit the project.yml instead and re-run the project."""
|
||||
CLI_HELP = f"""Command-line interface for spaCy projects and working with project
|
||||
templates. You'd typically start by cloning a project template to a local
|
||||
directory and fetching its assets like datasets etc. See the project's
|
||||
{CONFIG_FILE} for the available commands. Under the hood, spaCy uses DVC (Data
|
||||
Version Control) to manage input and output files and to ensure steps are only
|
||||
re-run if their inputs change.
|
||||
"""
|
||||
|
||||
project_cli = typer.Typer(help=CLI_HELP, no_args_is_help=True)
|
||||
|
||||
|
||||
@project_cli.callback(invoke_without_command=True)
|
||||
def callback(ctx: typer.Context):
|
||||
"""This runs before every project command and ensures DVC is installed."""
|
||||
ensure_dvc()
|
||||
|
||||
|
||||
################
|
||||
# CLI COMMANDS #
|
||||
################
|
||||
|
||||
|
||||
@project_cli.command("clone")
|
||||
def project_clone_cli(
|
||||
# fmt: off
|
||||
name: str = Arg(..., help="The name of the template to fetch"),
|
||||
dest: Path = Arg(Path.cwd(), help="Where to download and work. Defaults to current working directory.", exists=False),
|
||||
repo: str = Opt(about.__projects__, "--repo", "-r", help="The repository to look in."),
|
||||
git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
|
||||
no_init: bool = Opt(False, "--no-init", "-NI", help="Don't initialize the project with DVC"),
|
||||
# fmt: on
|
||||
):
|
||||
"""Clone a project template from a repository. Calls into "git" and will
|
||||
only download the files from the given subdirectory. The GitHub repo
|
||||
defaults to the official spaCy template repo, but can be customized
|
||||
(including using a private repo). Setting the --git flag will also
|
||||
initialize the project directory as a Git repo. If the project is intended
|
||||
to be a Git repo, it should be initialized with Git first, before
|
||||
initializing DVC (Data Version Control). This allows DVC to integrate with
|
||||
Git.
|
||||
"""
|
||||
if dest == Path.cwd():
|
||||
dest = dest / name
|
||||
project_clone(name, dest, repo=repo, git=git, no_init=no_init)
|
||||
|
||||
|
||||
@project_cli.command("init")
|
||||
def project_init_cli(
|
||||
# fmt: off
|
||||
path: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
|
||||
git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
|
||||
force: bool = Opt(False, "--force", "-F", help="Force initiziation"),
|
||||
# fmt: on
|
||||
):
|
||||
"""Initialize a project directory with DVC and optionally Git. This should
|
||||
typically be taken care of automatically when you run the "project clone"
|
||||
command, but you can also run it separately. If the project is intended to
|
||||
be a Git repo, it should be initialized with Git first, before initializing
|
||||
DVC. This allows DVC to integrate with Git.
|
||||
"""
|
||||
project_init(path, git=git, force=force, silent=True)
|
||||
|
||||
|
||||
@project_cli.command("assets")
|
||||
def project_assets_cli(
|
||||
# fmt: off
|
||||
project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
|
||||
# fmt: on
|
||||
):
|
||||
"""Use DVC (Data Version Control) to fetch project assets. Assets are
|
||||
defined in the "assets" section of the project config. If possible, DVC
|
||||
will try to track the files so you can pull changes from upstream. It will
|
||||
also try and store the checksum so the assets are versioned. If the file
|
||||
can't be tracked or checked, it will be downloaded without DVC. If a checksum
|
||||
is provided in the project config, the file is only downloaded if no local
|
||||
file with the same checksum exists.
|
||||
"""
|
||||
project_assets(project_dir)
|
||||
|
||||
|
||||
@project_cli.command(
|
||||
"run-all",
|
||||
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
|
||||
)
|
||||
def project_run_all_cli(
|
||||
# fmt: off
|
||||
ctx: typer.Context,
|
||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||
show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
|
||||
# fmt: on
|
||||
):
|
||||
"""Run all commands defined in the project. This command will use DVC and
|
||||
the defined outputs and dependencies in the project config to determine
|
||||
which steps need to be re-run and where to start. This means you're only
|
||||
re-generating data if the inputs have changed.
|
||||
|
||||
This command calls into "dvc repro" and all additional arguments are passed
|
||||
to the "dvc repro" command: https://dvc.org/doc/command-reference/repro
|
||||
"""
|
||||
if show_help:
|
||||
print_run_help(project_dir)
|
||||
else:
|
||||
project_run_all(project_dir, *ctx.args)
|
||||
|
||||
|
||||
@project_cli.command(
|
||||
"run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
|
||||
)
|
||||
def project_run_cli(
|
||||
# fmt: off
|
||||
ctx: typer.Context,
|
||||
subcommand: str = Arg(None, help="Name of command defined in project config"),
|
||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||
show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
|
||||
# fmt: on
|
||||
):
|
||||
"""Run a named script defined in the project config. If the command is
|
||||
part of the default pipeline defined in the "run" section, DVC is used to
|
||||
determine whether the step should re-run if its inputs have changed, or
|
||||
whether everything is up to date. If the script is not part of the default
|
||||
pipeline, it will be called separately without DVC.
|
||||
|
||||
If DVC is used, the command calls into "dvc repro" and all additional
|
||||
arguments are passed to the "dvc repro" command:
|
||||
https://dvc.org/doc/command-reference/repro
|
||||
"""
|
||||
if show_help or not subcommand:
|
||||
print_run_help(project_dir, subcommand)
|
||||
else:
|
||||
project_run(project_dir, subcommand, *ctx.args)
|
||||
|
||||
|
||||
@project_cli.command("exec", hidden=True)
|
||||
def project_exec_cli(
|
||||
# fmt: off
|
||||
subcommand: str = Arg(..., help="Name of command defined in project config"),
|
||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||
# fmt: on
|
||||
):
|
||||
"""Execute a command defined in the project config. This CLI command is
|
||||
only called internally in auto-generated DVC pipelines, as a shortcut for
|
||||
multi-step commands in the project config. You typically shouldn't have to
|
||||
call it yourself. To run a command, call "run" or "run-all".
|
||||
"""
|
||||
project_exec(project_dir, subcommand)
|
||||
|
||||
|
||||
@project_cli.command("update-dvc")
|
||||
def project_update_dvc_cli(
|
||||
# fmt: off
|
||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||
verbose: bool = Opt(False, "--verbose", "-V", help="Print more info"),
|
||||
force: bool = Opt(False, "--force", "-F", help="Force update DVC config"),
|
||||
# fmt: on
|
||||
):
|
||||
"""Update the auto-generated DVC config file. Uses the steps defined in the
|
||||
"run" section of the project config. This typically happens automatically
|
||||
when running a command, but can also be triggered manually if needed.
|
||||
"""
|
||||
config = load_project_config(project_dir)
|
||||
updated = update_dvc_config(project_dir, config, verbose=verbose, force=force)
|
||||
if updated:
|
||||
msg.good(f"Updated DVC config from {CONFIG_FILE}")
|
||||
else:
|
||||
msg.info(f"No changes found in {CONFIG_FILE}, no update needed")
|
||||
|
||||
|
||||
app.add_typer(project_cli, name="project")
|
||||
|
||||
|
||||
#################
|
||||
# CLI FUNCTIONS #
|
||||
#################
|
||||
|
||||
|
||||
def project_clone(
|
||||
name: str,
|
||||
dest: Path,
|
||||
*,
|
||||
repo: str = about.__projects__,
|
||||
git: bool = False,
|
||||
no_init: bool = False,
|
||||
) -> None:
|
||||
"""Clone a project template from a repository.
|
||||
|
||||
name (str): Name of subdirectory to clone.
|
||||
dest (Path): Destination path of cloned project.
|
||||
repo (str): URL of Git repo containing project templates.
|
||||
git (bool): Initialize project as Git repo. Should be set to True if project
|
||||
is intended as a repo, since it will allow DVC to integrate with Git.
|
||||
no_init (bool): Don't initialize DVC and Git automatically. If True, the
|
||||
"init" command or "git init" and "dvc init" need to be run manually.
|
||||
"""
|
||||
dest = ensure_path(dest)
|
||||
check_clone(name, dest, repo)
|
||||
project_dir = dest.resolve()
|
||||
# We're using Git and sparse checkout to only clone the files we need
|
||||
with make_tempdir() as tmp_dir:
|
||||
cmd = f"git clone {repo} {tmp_dir} --no-checkout --depth 1 --config core.sparseCheckout=true"
|
||||
try:
|
||||
run_command(cmd)
|
||||
except SystemExit:
|
||||
err = f"Could not clone the repo '{repo}' into the temp dir '{tmp_dir}'"
|
||||
msg.fail(err)
|
||||
with (tmp_dir / ".git" / "info" / "sparse-checkout").open("w") as f:
|
||||
f.write(name)
|
||||
run_command(["git", "-C", str(tmp_dir), "fetch"])
|
||||
run_command(["git", "-C", str(tmp_dir), "checkout"])
|
||||
shutil.move(str(tmp_dir / Path(name).name), str(project_dir))
|
||||
msg.good(f"Cloned project '{name}' from {repo} into {project_dir}")
|
||||
for sub_dir in DIRS:
|
||||
dir_path = project_dir / sub_dir
|
||||
if not dir_path.exists():
|
||||
dir_path.mkdir(parents=True)
|
||||
if not no_init:
|
||||
project_init(project_dir, git=git, force=True, silent=True)
|
||||
msg.good(f"Your project is now ready!", dest)
|
||||
print(f"To fetch the assets, run:\n{COMMAND} project assets {dest}")
|
||||
|
||||
|
||||
def project_init(
|
||||
project_dir: Path,
|
||||
*,
|
||||
git: bool = False,
|
||||
force: bool = False,
|
||||
silent: bool = False,
|
||||
analytics: bool = False,
|
||||
):
|
||||
"""Initialize a project as a DVC and (optionally) as a Git repo.
|
||||
|
||||
project_dir (Path): Path to project directory.
|
||||
git (bool): Also call "git init" to initialize directory as a Git repo.
|
||||
silent (bool): Don't print any output (via DVC).
|
||||
analytics (bool): Opt-in to DVC analytics (defaults to False).
|
||||
"""
|
||||
with working_dir(project_dir) as cwd:
|
||||
if git:
|
||||
run_command(["git", "init"])
|
||||
init_cmd = ["dvc", "init"]
|
||||
if silent:
|
||||
init_cmd.append("--quiet")
|
||||
if not git:
|
||||
init_cmd.append("--no-scm")
|
||||
if force:
|
||||
init_cmd.append("--force")
|
||||
run_command(init_cmd)
|
||||
# We don't want to have analytics on by default – our users should
|
||||
# opt-in explicitly. If they want it, they can always enable it.
|
||||
if not analytics:
|
||||
run_command(["dvc", "config", "core.analytics", "false"])
|
||||
# Remove unused and confusing plot templates from .dvc directory
|
||||
# TODO: maybe we shouldn't do this, but it's otherwise super confusing
|
||||
# once you commit your changes via Git and it creates a bunch of files
|
||||
# that have no purpose
|
||||
plots_dir = cwd / DVC_DIR / "plots"
|
||||
if plots_dir.exists():
|
||||
shutil.rmtree(str(plots_dir))
|
||||
config = load_project_config(cwd)
|
||||
setup_check_dvc(cwd, config)
|
||||
|
||||
|
||||
def project_assets(project_dir: Path) -> None:
|
||||
"""Fetch assets for a project using DVC if possible.
|
||||
|
||||
project_dir (Path): Path to project directory.
|
||||
"""
|
||||
project_path = ensure_path(project_dir)
|
||||
config = load_project_config(project_path)
|
||||
setup_check_dvc(project_path, config)
|
||||
assets = config.get("assets", {})
|
||||
if not assets:
|
||||
msg.warn(f"No assets specified in {CONFIG_FILE}", exits=0)
|
||||
msg.info(f"Fetching {len(assets)} asset(s)")
|
||||
variables = config.get("variables", {})
|
||||
fetched_assets = []
|
||||
for asset in assets:
|
||||
url = asset["url"].format(**variables)
|
||||
dest = asset["dest"].format(**variables)
|
||||
fetched_path = fetch_asset(project_path, url, dest, asset.get("checksum"))
|
||||
if fetched_path:
|
||||
fetched_assets.append(str(fetched_path))
|
||||
if fetched_assets:
|
||||
with working_dir(project_path):
|
||||
run_command(["dvc", "add", *fetched_assets, "--external"])
|
||||
|
||||
|
||||
def fetch_asset(
|
||||
project_path: Path, url: str, dest: Path, checksum: Optional[str] = None
|
||||
) -> Optional[Path]:
|
||||
"""Fetch an asset from a given URL or path. Will try to import the file
|
||||
using DVC's import-url if possible (fully tracked and versioned) and falls
|
||||
back to get-url (versioned) and a non-DVC download if necessary. If a
|
||||
checksum is provided and a local file exists, it's only re-downloaded if the
|
||||
checksum doesn't match.
|
||||
|
||||
project_path (Path): Path to project directory.
|
||||
url (str): URL or path to asset.
|
||||
checksum (Optional[str]): Optional expected checksum of local file.
|
||||
RETURNS (Optional[Path]): The path to the fetched asset or None if fetching
|
||||
the asset failed.
|
||||
"""
|
||||
url = convert_asset_url(url)
|
||||
dest_path = (project_path / dest).resolve()
|
||||
if dest_path.exists() and checksum:
|
||||
# If there's already a file, check for checksum
|
||||
# TODO: add support for caches (dvc import-url with local path)
|
||||
if checksum == get_checksum(dest_path):
|
||||
msg.good(f"Skipping download with matching checksum: {dest}")
|
||||
return dest_path
|
||||
with working_dir(project_path):
|
||||
try:
|
||||
# If these fail, we don't want to output an error or info message.
|
||||
# Try with tracking the source first, then just downloading with
|
||||
# DVC, then a regular non-DVC download.
|
||||
try:
|
||||
dvc_cmd = ["dvc", "import-url", url, str(dest_path)]
|
||||
print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
|
||||
except subprocess.CalledProcessError:
|
||||
dvc_cmd = ["dvc", "get-url", url, str(dest_path)]
|
||||
print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
|
||||
except subprocess.CalledProcessError:
|
||||
try:
|
||||
download_file(url, dest_path)
|
||||
except requests.exceptions.HTTPError as e:
|
||||
msg.fail(f"Download failed: {dest}", e)
|
||||
return None
|
||||
if checksum and checksum != get_checksum(dest_path):
|
||||
msg.warn(f"Checksum doesn't match value defined in {CONFIG_FILE}: {dest}")
|
||||
msg.good(f"Fetched asset {dest}")
|
||||
return dest_path
|
||||
|
||||
|
||||
def project_run_all(project_dir: Path, *dvc_args) -> None:
|
||||
"""Run all commands defined in the project using DVC.
|
||||
|
||||
project_dir (Path): Path to project directory.
|
||||
*dvc_args: Other arguments passed to "dvc repro".
|
||||
"""
|
||||
config = load_project_config(project_dir)
|
||||
setup_check_dvc(project_dir, config)
|
||||
dvc_cmd = ["dvc", "repro", *dvc_args]
|
||||
with working_dir(project_dir):
|
||||
run_command(dvc_cmd)
|
||||
|
||||
|
||||
def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
|
||||
"""Simulate a CLI help prompt using the info available in the project config.
|
||||
|
||||
project_dir (Path): The project directory.
|
||||
subcommand (Optional[str]): The subcommand or None. If a subcommand is
|
||||
provided, the subcommand help is shown. Otherwise, the top-level help
|
||||
and a list of available commands is printed.
|
||||
"""
|
||||
config = load_project_config(project_dir)
|
||||
setup_check_dvc(project_dir, config)
|
||||
config_commands = config.get("commands", [])
|
||||
commands = {cmd["name"]: cmd for cmd in config_commands}
|
||||
if subcommand:
|
||||
validate_subcommand(commands.keys(), subcommand)
|
||||
print(f"Usage: {COMMAND} project run {subcommand} {project_dir}")
|
||||
help_text = commands[subcommand].get("help")
|
||||
if help_text:
|
||||
msg.text(f"\n{help_text}\n")
|
||||
else:
|
||||
print(f"\nAvailable commands in {CONFIG_FILE}")
|
||||
print(f"Usage: {COMMAND} project run [COMMAND] {project_dir}")
|
||||
msg.table([(cmd["name"], cmd.get("help", "")) for cmd in config_commands])
|
||||
msg.text("Run all commands defined in the 'run' block of the project config:")
|
||||
print(f"{COMMAND} project run-all {project_dir}")
|
||||
|
||||
|
||||
def project_run(project_dir: Path, subcommand: str, *dvc_args) -> None:
|
||||
"""Run a named script defined in the project config. If the script is part
|
||||
of the default pipeline (defined in the "run" section), DVC is used to
|
||||
execute the command, so it can determine whether to rerun it. It then
|
||||
calls into "exec" to execute it.
|
||||
|
||||
project_dir (Path): Path to project directory.
|
||||
subcommand (str): Name of command to run.
|
||||
*dvc_args: Other arguments passed to "dvc repro".
|
||||
"""
|
||||
config = load_project_config(project_dir)
|
||||
setup_check_dvc(project_dir, config)
|
||||
config_commands = config.get("commands", [])
|
||||
variables = config.get("variables", {})
|
||||
commands = {cmd["name"]: cmd for cmd in config_commands}
|
||||
validate_subcommand(commands.keys(), subcommand)
|
||||
if subcommand in config.get("run", []):
|
||||
# This is one of the pipeline commands tracked in DVC
|
||||
dvc_cmd = ["dvc", "repro", subcommand, *dvc_args]
|
||||
with working_dir(project_dir):
|
||||
run_command(dvc_cmd)
|
||||
else:
|
||||
cmd = commands[subcommand]
|
||||
# Deps in non-DVC commands aren't tracked, but if they're defined,
|
||||
# make sure they exist before running the command
|
||||
for dep in cmd.get("deps", []):
|
||||
if not (project_dir / dep).exists():
|
||||
err = f"Missing dependency specified by command '{subcommand}': {dep}"
|
||||
msg.fail(err, exits=1)
|
||||
with working_dir(project_dir):
|
||||
run_commands(cmd["script"], variables)
|
||||
|
||||
|
||||
def project_exec(project_dir: Path, subcommand: str):
|
||||
"""Execute a command defined in the project config.
|
||||
|
||||
project_dir (Path): Path to project directory.
|
||||
subcommand (str): Name of command to run.
|
||||
"""
|
||||
config = load_project_config(project_dir)
|
||||
config_commands = config.get("commands", [])
|
||||
variables = config.get("variables", {})
|
||||
commands = {cmd["name"]: cmd for cmd in config_commands}
|
||||
with working_dir(project_dir):
|
||||
run_commands(commands[subcommand]["script"], variables)
|
||||
|
||||
|
||||
###########
|
||||
# HELPERS #
|
||||
###########
|
||||
|
||||
|
||||
def load_project_config(path: Path) -> Dict[str, Any]:
|
||||
"""Load the project config file from a directory and validate it.
|
||||
|
||||
path (Path): The path to the project directory.
|
||||
RETURNS (Dict[str, Any]): The loaded project config.
|
||||
"""
|
||||
config_path = path / CONFIG_FILE
|
||||
if not config_path.exists():
|
||||
msg.fail("Can't find project config", config_path, exits=1)
|
||||
invalid_err = f"Invalid project config in {CONFIG_FILE}"
|
||||
try:
|
||||
config = srsly.read_yaml(config_path)
|
||||
except ValueError as e:
|
||||
msg.fail(invalid_err, e, exits=1)
|
||||
errors = validate(ProjectConfigSchema, config)
|
||||
if errors:
|
||||
msg.fail(invalid_err, "\n".join(errors), exits=1)
|
||||
return config
|
||||
|
||||
|
||||
def update_dvc_config(
|
||||
path: Path,
|
||||
config: Dict[str, Any],
|
||||
verbose: bool = False,
|
||||
silent: bool = False,
|
||||
force: bool = False,
|
||||
) -> bool:
|
||||
"""Re-run the DVC commands in dry mode and update dvc.yaml file in the
|
||||
project directory. The file is auto-generated based on the config. The
|
||||
first line of the auto-generated file specifies the hash of the config
|
||||
dict, so if any of the config values change, the DVC config is regenerated.
|
||||
|
||||
path (Path): The path to the project directory.
|
||||
config (Dict[str, Any]): The loaded project config.
|
||||
verbose (bool): Whether to print additional info (via DVC).
|
||||
silent (bool): Don't output anything (via DVC).
|
||||
force (bool): Force update, even if hashes match.
|
||||
RETURNS (bool): Whether the DVC config file was updated.
|
||||
"""
|
||||
config_hash = get_hash(config)
|
||||
path = path.resolve()
|
||||
dvc_config_path = path / DVC_CONFIG
|
||||
if dvc_config_path.exists():
|
||||
# Check if the file was generated using the current config, if not, redo
|
||||
with dvc_config_path.open("r", encoding="utf8") as f:
|
||||
ref_hash = f.readline().strip().replace("# ", "")
|
||||
if ref_hash == config_hash and not force:
|
||||
return False # Nothing has changed in project config, don't need to update
|
||||
dvc_config_path.unlink()
|
||||
variables = config.get("variables", {})
|
||||
commands = []
|
||||
# We only want to include commands that are part of the main list of "run"
|
||||
# commands in project.yml and should be run in sequence
|
||||
config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
||||
for name in config.get("run", []):
|
||||
validate_subcommand(config_commands.keys(), name)
|
||||
command = config_commands[name]
|
||||
deps = command.get("deps", [])
|
||||
outputs = command.get("outputs", [])
|
||||
outputs_no_cache = command.get("outputs_no_cache", [])
|
||||
if not deps and not outputs and not outputs_no_cache:
|
||||
continue
|
||||
# Default to "." as the project path since dvc.yaml is auto-generated
|
||||
# and we don't want arbitrary paths in there
|
||||
project_cmd = ["python", "-m", NAME, "project", ".", "exec", name]
|
||||
deps_cmd = [c for cl in [["-d", p] for p in deps] for c in cl]
|
||||
outputs_cmd = [c for cl in [["-o", p] for p in outputs] for c in cl]
|
||||
outputs_nc_cmd = [c for cl in [["-O", p] for p in outputs_no_cache] for c in cl]
|
||||
dvc_cmd = ["dvc", "run", "-n", name, "-w", str(path), "--no-exec"]
|
||||
if verbose:
|
||||
dvc_cmd.append("--verbose")
|
||||
if silent:
|
||||
dvc_cmd.append("--quiet")
|
||||
full_cmd = [*dvc_cmd, *deps_cmd, *outputs_cmd, *outputs_nc_cmd, *project_cmd]
|
||||
commands.append(" ".join(full_cmd))
|
||||
with working_dir(path):
|
||||
run_commands(commands, variables, silent=True)
|
||||
with dvc_config_path.open("r+", encoding="utf8") as f:
|
||||
content = f.read()
|
||||
f.seek(0, 0)
|
||||
f.write(f"# {config_hash}\n{DVC_CONFIG_COMMENT}\n{content}")
|
||||
return True
|
||||
|
||||
|
||||
def ensure_dvc() -> None:
|
||||
"""Ensure that the "dvc" command is available and show an error if not."""
|
||||
try:
|
||||
subprocess.run(["dvc", "--version"], stdout=subprocess.DEVNULL)
|
||||
except Exception:
|
||||
msg.fail(
|
||||
"spaCy projects require DVC (Data Version Control) and the 'dvc' command",
|
||||
"You can install the Python package from pip (pip install dvc) or "
|
||||
"conda (conda install -c conda-forge dvc). For more details, see the "
|
||||
"documentation: https://dvc.org/doc/install",
|
||||
exits=1,
|
||||
)
|
||||
|
||||
|
||||
def setup_check_dvc(project_dir: Path, config: Dict[str, Any]) -> None:
|
||||
"""Check that the project is set up correctly with DVC and update its
|
||||
config if needed. Will raise an error if the project is not an initialized
|
||||
DVC project.
|
||||
|
||||
project_dir (Path): The path to the project directory.
|
||||
config (Dict[str, Any]): The loaded project config.
|
||||
"""
|
||||
if not project_dir.exists():
|
||||
msg.fail(f"Can't find project directory: {project_dir}")
|
||||
if not (project_dir / ".dvc").exists():
|
||||
msg.fail(
|
||||
"Project not initialized as a DVC project.",
|
||||
f"Make sure that the project template was cloned correctly. To "
|
||||
f"initialize the project directory manually, you can run: "
|
||||
f"{COMMAND} project init {project_dir}",
|
||||
exits=1,
|
||||
)
|
||||
with msg.loading("Updating DVC config..."):
|
||||
updated = update_dvc_config(project_dir, config, silent=True)
|
||||
if updated:
|
||||
msg.good(f"Updated DVC config from changed {CONFIG_FILE}")
|
||||
|
||||
|
||||
def run_commands(
|
||||
commands: List[str] = tuple(), variables: Dict[str, str] = {}, silent: bool = False
|
||||
) -> None:
|
||||
"""Run a sequence of commands in a subprocess, in order.
|
||||
|
||||
commands (List[str]): The string commands.
|
||||
variables (Dict[str, str]): Dictionary of variable names, mapped to their
|
||||
values. Will be used to substitute format string variables in the
|
||||
commands.
|
||||
silent (bool): Don't print the commands.
|
||||
"""
|
||||
for command in commands:
|
||||
# Substitute variables, e.g. "./{NAME}.json"
|
||||
command = command.format(**variables)
|
||||
command = split_command(command)
|
||||
# Not sure if this is needed or a good idea. Motivation: users may often
|
||||
# use commands in their config that reference "python" and we want to
|
||||
# make sure that it's always executing the same Python that spaCy is
|
||||
# executed with and the pip in the same env, not some other Python/pip.
|
||||
# Also ensures cross-compatibility if user 1 writes "python3" (because
|
||||
# that's how it's set up on their system), and user 2 without the
|
||||
# shortcut tries to re-run the command.
|
||||
if len(command) and command[0] in ("python", "python3"):
|
||||
command[0] = sys.executable
|
||||
elif len(command) and command[0] in ("pip", "pip3"):
|
||||
command = [sys.executable, "-m", "pip", *command[1:]]
|
||||
if not silent:
|
||||
print(f"Running command: {' '.join(command)}")
|
||||
run_command(command)
|
||||
|
||||
|
||||
def convert_asset_url(url: str) -> str:
|
||||
"""Check and convert the asset URL if needed.
|
||||
|
||||
url (str): The asset URL.
|
||||
RETURNS (str): The converted URL.
|
||||
"""
|
||||
# If the asset URL is a regular GitHub URL it's likely a mistake
|
||||
if re.match("(http(s?)):\/\/github.com", url):
|
||||
converted = url.replace("github.com", "raw.githubusercontent.com")
|
||||
converted = re.sub(r"/(tree|blob)/", "/", converted)
|
||||
msg.warn(
|
||||
"Downloading from a regular GitHub URL. This will only download "
|
||||
"the source of the page, not the actual file. Converting the URL "
|
||||
"to a raw URL.",
|
||||
converted,
|
||||
)
|
||||
return converted
|
||||
return url
|
||||
|
||||
|
||||
def check_clone(name: str, dest: Path, repo: str) -> None:
|
||||
"""Check and validate that the destination path can be used to clone. Will
|
||||
check that Git is available and that the destination path is suitable.
|
||||
|
||||
name (str): Name of the directory to clone from the repo.
|
||||
dest (Path): Local destination of cloned directory.
|
||||
repo (str): URL of the repo to clone from.
|
||||
"""
|
||||
try:
|
||||
subprocess.run(["git", "--version"], stdout=subprocess.DEVNULL)
|
||||
except Exception:
|
||||
msg.fail(
|
||||
f"Cloning spaCy project templates requires Git and the 'git' command. ",
|
||||
f"To clone a project without Git, copy the files from the '{name}' "
|
||||
f"directory in the {repo} to {dest} manually and then run:",
|
||||
f"{COMMAND} project init {dest}",
|
||||
exits=1,
|
||||
)
|
||||
if not dest:
|
||||
msg.fail(f"Not a valid directory to clone project: {dest}", exits=1)
|
||||
if dest.exists():
|
||||
# Directory already exists (not allowed, clone needs to create it)
|
||||
msg.fail(f"Can't clone project, directory already exists: {dest}", exits=1)
|
||||
if not dest.parent.exists():
|
||||
# We're not creating parents, parent dir should exist
|
||||
msg.fail(
|
||||
f"Can't clone project, parent directory doesn't exist: {dest.parent}",
|
||||
exits=1,
|
||||
)
|
||||
|
||||
|
||||
def validate_subcommand(commands: Sequence[str], subcommand: str) -> None:
|
||||
"""Check that a subcommand is valid and defined. Raises an error otherwise.
|
||||
|
||||
commands (Sequence[str]): The available commands.
|
||||
subcommand (str): The subcommand.
|
||||
"""
|
||||
if subcommand not in commands:
|
||||
msg.fail(
|
||||
f"Can't find command '{subcommand}' in {CONFIG_FILE}. "
|
||||
f"Available commands: {', '.join(commands)}",
|
||||
exits=1,
|
||||
)
|
||||
|
||||
|
||||
def download_file(url: str, dest: Path, chunk_size: int = 1024) -> None:
|
||||
"""Download a file using requests.
|
||||
|
||||
url (str): The URL of the file.
|
||||
dest (Path): The destination path.
|
||||
chunk_size (int): The size of chunks to read/write.
|
||||
"""
|
||||
response = requests.get(url, stream=True)
|
||||
response.raise_for_status()
|
||||
total = int(response.headers.get("content-length", 0))
|
||||
progress_settings = {
|
||||
"total": total,
|
||||
"unit": "iB",
|
||||
"unit_scale": True,
|
||||
"unit_divisor": chunk_size,
|
||||
"leave": False,
|
||||
}
|
||||
with dest.open("wb") as f, tqdm.tqdm(**progress_settings) as bar:
|
||||
for data in response.iter_content(chunk_size=chunk_size):
|
||||
size = f.write(data)
|
||||
bar.update(size)
|
|
@ -2,9 +2,8 @@ from typing import Optional, Dict, List, Union, Sequence
|
|||
from timeit import default_timer as timer
|
||||
|
||||
import srsly
|
||||
from pydantic import BaseModel, FilePath
|
||||
import plac
|
||||
import tqdm
|
||||
from pydantic import BaseModel, FilePath
|
||||
from pathlib import Path
|
||||
from wasabi import msg
|
||||
import thinc
|
||||
|
@ -12,11 +11,17 @@ import thinc.schedules
|
|||
from thinc.api import Model, use_pytorch_for_gpu_memory
|
||||
import random
|
||||
|
||||
from ..gold import GoldCorpus
|
||||
from ._app import app, Arg, Opt
|
||||
from ..gold import Corpus
|
||||
from ..lookups import Lookups
|
||||
from .. import util
|
||||
from ..errors import Errors
|
||||
from ..ml import models # don't remove - required to load the built-in architectures
|
||||
|
||||
# Don't remove - required to load the built-in architectures
|
||||
from ..ml import models # noqa: F401
|
||||
|
||||
# from ..schemas import ConfigSchema # TODO: include?
|
||||
|
||||
|
||||
registry = util.registry
|
||||
|
||||
|
@ -114,41 +119,24 @@ class ConfigSchema(BaseModel):
|
|||
extra = "allow"
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
# fmt: off
|
||||
train_path=("Location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path=("Location of JSON-formatted development data", "positional", None, Path),
|
||||
config_path=("Path to config file", "positional", None, Path),
|
||||
output_path=("Output directory to store model in", "option", "o", Path),
|
||||
init_tok2vec=(
|
||||
"Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental.", "option", "t2v",
|
||||
Path),
|
||||
raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path),
|
||||
verbose=("Display more information for debugging purposes", "flag", "VV", bool),
|
||||
use_gpu=("Use GPU", "option", "g", int),
|
||||
num_workers=("Parallel Workers", "option", "j", int),
|
||||
strategy=("Distributed training strategy (requires spacy_ray)", "option", "strategy", str),
|
||||
ray_address=(
|
||||
"Address of the Ray cluster. Multi-node training (requires spacy_ray)",
|
||||
"option", "address", str),
|
||||
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
|
||||
omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
|
||||
# fmt: on
|
||||
)
|
||||
@app.command("train")
|
||||
def train_cli(
|
||||
train_path,
|
||||
dev_path,
|
||||
config_path,
|
||||
output_path=None,
|
||||
init_tok2vec=None,
|
||||
raw_text=None,
|
||||
verbose=False,
|
||||
use_gpu=-1,
|
||||
num_workers=1,
|
||||
strategy="allreduce",
|
||||
ray_address=None,
|
||||
tag_map_path=None,
|
||||
omit_extra_lookups=False,
|
||||
# fmt: off
|
||||
train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
|
||||
dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
|
||||
config_path: Path = Arg(..., help="Path to config file", exists=True),
|
||||
output_path: Optional[Path] = Opt(None, "--output-path", "-o", help="Output directory to store model in"),
|
||||
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||
init_tok2vec: Optional[Path] = Opt(None, "--init-tok2vec", "-t2v", help="Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental."),
|
||||
raw_text: Optional[Path] = Opt(None, "--raw-text", "-rt", help="Path to jsonl file with unlabelled text documents."),
|
||||
verbose: bool = Opt(False, "--verbose", "-VV", help="Display more information for debugging purposes"),
|
||||
use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
|
||||
num_workers: int = Opt(None, "-j", help="Parallel Workers"),
|
||||
strategy: str = Opt(None, "--strategy", help="Distributed training strategy (requires spacy_ray)"),
|
||||
ray_address: str = Opt(None, "--address", help="Address of the Ray cluster. Multi-node training (requires spacy_ray)"),
|
||||
tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map"),
|
||||
omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Train or update a spaCy model. Requires data to be formatted in spaCy's
|
||||
|
@ -156,26 +144,8 @@ def train_cli(
|
|||
command.
|
||||
"""
|
||||
util.set_env_log(verbose)
|
||||
verify_cli_args(**locals())
|
||||
|
||||
# Make sure all files and paths exists if they are needed
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
if not train_path or not train_path.exists():
|
||||
msg.fail("Training data not found", train_path, exits=1)
|
||||
if not dev_path or not dev_path.exists():
|
||||
msg.fail("Development data not found", dev_path, exits=1)
|
||||
if output_path is not None:
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
|
||||
msg.warn(
|
||||
"Output directory is not empty.",
|
||||
"This can lead to unintended side effects when saving the model. "
|
||||
"Please use an empty directory or a different path instead. If "
|
||||
"the specified output path doesn't exist, the directory will be "
|
||||
"created for you.",
|
||||
)
|
||||
if raw_text is not None:
|
||||
raw_text = list(srsly.read_jsonl(raw_text))
|
||||
tag_map = {}
|
||||
|
@ -184,8 +154,6 @@ def train_cli(
|
|||
|
||||
weights_data = None
|
||||
if init_tok2vec is not None:
|
||||
if not init_tok2vec.exists():
|
||||
msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
|
||||
with init_tok2vec.open("rb") as file_:
|
||||
weights_data = file_.read()
|
||||
|
||||
|
@ -214,17 +182,17 @@ def train_cli(
|
|||
train(**train_args)
|
||||
|
||||
def train(
|
||||
config_path,
|
||||
data_paths,
|
||||
raw_text=None,
|
||||
output_path=None,
|
||||
tag_map=None,
|
||||
weights_data=None,
|
||||
omit_extra_lookups=False,
|
||||
disable_tqdm=False,
|
||||
remote_optimizer=None,
|
||||
randomization_index=0
|
||||
):
|
||||
config_path: Path,
|
||||
data_paths: Dict[str, Path],
|
||||
raw_text: Optional[Path] = None,
|
||||
output_path: Optional[Path] = None,
|
||||
tag_map: Optional[Path] = None,
|
||||
weights_data: Optional[bytes] = None,
|
||||
omit_extra_lookups: bool = False,
|
||||
disable_tqdm: bool = False,
|
||||
remote_optimizer: Optimizer = None,
|
||||
randomization_index: int = 0
|
||||
) -> None:
|
||||
msg.info(f"Loading config from: {config_path}")
|
||||
# Read the config first without creating objects, to get to the original nlp_config
|
||||
config = util.load_config(config_path, create_objects=False)
|
||||
|
@ -243,69 +211,20 @@ def train(
|
|||
if remote_optimizer:
|
||||
optimizer = remote_optimizer
|
||||
limit = training["limit"]
|
||||
msg.info("Loading training corpus")
|
||||
corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
|
||||
|
||||
# verify textcat config
|
||||
corpus = Corpus(data_paths["train"], data_paths["dev"], limit=limit)
|
||||
if "textcat" in nlp_config["pipeline"]:
|
||||
textcat_labels = set(nlp.get_pipe("textcat").labels)
|
||||
textcat_multilabel = not nlp_config["pipeline"]["textcat"]["model"]["exclusive_classes"]
|
||||
|
||||
# check whether the setting 'exclusive_classes' corresponds to the provided training data
|
||||
if textcat_multilabel:
|
||||
multilabel_found = False
|
||||
for ex in corpus.train_examples:
|
||||
cats = ex.doc_annotation.cats
|
||||
textcat_labels.update(cats.keys())
|
||||
if list(cats.values()).count(1.0) != 1:
|
||||
multilabel_found = True
|
||||
if not multilabel_found:
|
||||
msg.warn(
|
||||
"The textcat training instances look like they have "
|
||||
"mutually exclusive classes. Set 'exclusive_classes' "
|
||||
"to 'true' in the config to train a classifier with "
|
||||
"mutually exclusive classes more accurately."
|
||||
)
|
||||
else:
|
||||
for ex in corpus.train_examples:
|
||||
cats = ex.doc_annotation.cats
|
||||
textcat_labels.update(cats.keys())
|
||||
if list(cats.values()).count(1.0) != 1:
|
||||
msg.fail(
|
||||
"Some textcat training instances do not have exactly "
|
||||
"one positive label. Set 'exclusive_classes' "
|
||||
"to 'false' in the config to train a classifier with classes "
|
||||
"that are not mutually exclusive."
|
||||
)
|
||||
msg.info(f"Initialized textcat component for {len(textcat_labels)} unique labels")
|
||||
nlp.get_pipe("textcat").labels = tuple(textcat_labels)
|
||||
|
||||
# if 'positive_label' is provided: double check whether it's in the data and the task is binary
|
||||
if nlp_config["pipeline"]["textcat"].get("positive_label", None):
|
||||
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
|
||||
pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
|
||||
if pos_label not in textcat_labels:
|
||||
msg.fail(
|
||||
f"The textcat's 'positive_label' config setting '{pos_label}' "
|
||||
f"does not match any label in the training data.",
|
||||
exits=1,
|
||||
)
|
||||
if len(textcat_labels) != 2:
|
||||
msg.fail(
|
||||
f"A textcat 'positive_label' '{pos_label}' was "
|
||||
f"provided for training data that does not appear to be a "
|
||||
f"binary classification problem with two labels.",
|
||||
exits=1,
|
||||
)
|
||||
|
||||
verify_textcat_config(nlp, nlp_config)
|
||||
if training.get("resume", False):
|
||||
msg.info("Resuming training")
|
||||
nlp.resume_training()
|
||||
else:
|
||||
msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}")
|
||||
nlp.begin_training(
|
||||
lambda: corpus.train_examples
|
||||
)
|
||||
train_examples = list(corpus.train_dataset(
|
||||
nlp,
|
||||
shuffle=False,
|
||||
gold_preproc=training["gold_preproc"]
|
||||
))
|
||||
nlp.begin_training(lambda: train_examples)
|
||||
|
||||
# Update tag map with provided mapping
|
||||
nlp.vocab.morphology.tag_map.update(tag_map)
|
||||
|
@ -332,11 +251,11 @@ def train(
|
|||
tok2vec = tok2vec.get(subpath)
|
||||
if not tok2vec:
|
||||
msg.fail(
|
||||
f"Could not locate the tok2vec model at {tok2vec_path}.",
|
||||
exits=1,
|
||||
f"Could not locate the tok2vec model at {tok2vec_path}.", exits=1,
|
||||
)
|
||||
tok2vec.from_bytes(weights_data)
|
||||
|
||||
msg.info("Loading training corpus")
|
||||
train_batches = create_train_batches(nlp, corpus, training, randomization_index)
|
||||
evaluate = create_evaluation_callback(nlp, optimizer, corpus, training)
|
||||
|
||||
|
@ -369,18 +288,15 @@ def train(
|
|||
update_meta(training, nlp, info)
|
||||
nlp.to_disk(output_path / "model-best")
|
||||
progress = tqdm.tqdm(**tqdm_args)
|
||||
# Clean up the objects to faciliate garbage collection.
|
||||
for eg in batch:
|
||||
eg.doc = None
|
||||
eg.goldparse = None
|
||||
eg.doc_annotation = None
|
||||
eg.token_annotation = None
|
||||
except Exception as e:
|
||||
msg.warn(
|
||||
f"Aborting and saving the final best model. "
|
||||
f"Encountered exception: {str(e)}",
|
||||
exits=1,
|
||||
)
|
||||
if output_path is not None:
|
||||
msg.warn(
|
||||
f"Aborting and saving the final best model. "
|
||||
f"Encountered exception: {str(e)}",
|
||||
exits=1,
|
||||
)
|
||||
else:
|
||||
raise e
|
||||
finally:
|
||||
if output_path is not None:
|
||||
final_model_path = output_path / "model-final"
|
||||
|
@ -393,23 +309,22 @@ def train(
|
|||
|
||||
|
||||
def create_train_batches(nlp, corpus, cfg, randomization_index):
|
||||
epochs_todo = cfg.get("max_epochs", 0)
|
||||
max_epochs = cfg.get("max_epochs", 0)
|
||||
train_examples = list(corpus.train_dataset(
|
||||
nlp,
|
||||
shuffle=True,
|
||||
gold_preproc=cfg["gold_preproc"],
|
||||
max_length=cfg["max_length"]
|
||||
))
|
||||
|
||||
epoch = 0
|
||||
while True:
|
||||
train_examples = list(
|
||||
corpus.train_dataset(
|
||||
nlp,
|
||||
noise_level=0.0, # I think this is deprecated?
|
||||
orth_variant_level=cfg["orth_variant_level"],
|
||||
gold_preproc=cfg["gold_preproc"],
|
||||
max_length=cfg["max_length"],
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
)
|
||||
if len(train_examples) == 0:
|
||||
raise ValueError(Errors.E988)
|
||||
for _ in range(randomization_index):
|
||||
random.random()
|
||||
random.shuffle(train_examples)
|
||||
epoch += 1
|
||||
batches = util.minibatch_by_words(
|
||||
train_examples,
|
||||
size=cfg["batch_size"],
|
||||
|
@ -418,15 +333,12 @@ def create_train_batches(nlp, corpus, cfg, randomization_index):
|
|||
# make sure the minibatch_by_words result is not empty, or we'll have an infinite training loop
|
||||
try:
|
||||
first = next(batches)
|
||||
yield first
|
||||
yield epoch, first
|
||||
except StopIteration:
|
||||
raise ValueError(Errors.E986)
|
||||
for batch in batches:
|
||||
yield batch
|
||||
epochs_todo -= 1
|
||||
# We intentionally compare exactly to 0 here, so that max_epochs < 1
|
||||
# will not break.
|
||||
if epochs_todo == 0:
|
||||
yield epoch, batch
|
||||
if max_epochs >= 1 and epoch >= max_epochs:
|
||||
break
|
||||
|
||||
|
||||
|
@ -437,7 +349,8 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
|
|||
nlp, gold_preproc=cfg["gold_preproc"], ignore_misaligned=True
|
||||
)
|
||||
)
|
||||
n_words = sum(len(ex.doc) for ex in dev_examples)
|
||||
|
||||
n_words = sum(len(ex.predicted) for ex in dev_examples)
|
||||
start_time = timer()
|
||||
|
||||
if optimizer.averages:
|
||||
|
@ -453,7 +366,11 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
|
|||
try:
|
||||
weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights)
|
||||
except KeyError as e:
|
||||
raise KeyError(Errors.E983.format(dict_name='score_weights', key=str(e), keys=list(scores.keys())))
|
||||
raise KeyError(
|
||||
Errors.E983.format(
|
||||
dict="score_weights", key=str(e), keys=list(scores.keys())
|
||||
)
|
||||
)
|
||||
|
||||
scores["speed"] = wps
|
||||
return weighted_score, scores
|
||||
|
@ -494,7 +411,7 @@ def train_while_improving(
|
|||
|
||||
Every iteration, the function yields out a tuple with:
|
||||
|
||||
* batch: A zipped sequence of Tuple[Doc, GoldParse] pairs.
|
||||
* batch: A list of Example objects.
|
||||
* info: A dict with various information about the last update (see below).
|
||||
* is_best_checkpoint: A value in None, False, True, indicating whether this
|
||||
was the best evaluation so far. You should use this to save the model
|
||||
|
@ -526,7 +443,7 @@ def train_while_improving(
|
|||
(nlp.make_doc(rt["text"]) for rt in raw_text), size=8
|
||||
)
|
||||
|
||||
for step, batch in enumerate(train_data):
|
||||
for step, (epoch, batch) in enumerate(train_data):
|
||||
dropout = next(dropouts)
|
||||
with nlp.select_pipes(enable=to_enable):
|
||||
for subbatch in subdivide_batch(batch, accumulate_gradient):
|
||||
|
@ -548,6 +465,7 @@ def train_while_improving(
|
|||
score, other_scores = (None, None)
|
||||
is_best_checkpoint = None
|
||||
info = {
|
||||
"epoch": epoch,
|
||||
"step": step,
|
||||
"score": score,
|
||||
"other_scores": other_scores,
|
||||
|
@ -568,7 +486,7 @@ def train_while_improving(
|
|||
|
||||
def subdivide_batch(batch, accumulate_gradient):
|
||||
batch = list(batch)
|
||||
batch.sort(key=lambda eg: len(eg.doc))
|
||||
batch.sort(key=lambda eg: len(eg.predicted))
|
||||
sub_len = len(batch) // accumulate_gradient
|
||||
start = 0
|
||||
for i in range(accumulate_gradient):
|
||||
|
@ -586,9 +504,9 @@ def setup_printer(training, nlp):
|
|||
score_widths = [max(len(col), 6) for col in score_cols]
|
||||
loss_cols = [f"Loss {pipe}" for pipe in nlp.pipe_names]
|
||||
loss_widths = [max(len(col), 8) for col in loss_cols]
|
||||
table_header = ["#"] + loss_cols + score_cols + ["Score"]
|
||||
table_header = ["E", "#"] + loss_cols + score_cols + ["Score"]
|
||||
table_header = [col.upper() for col in table_header]
|
||||
table_widths = [6] + loss_widths + score_widths + [6]
|
||||
table_widths = [3, 6] + loss_widths + score_widths + [6]
|
||||
table_aligns = ["r" for _ in table_widths]
|
||||
|
||||
msg.row(table_header, widths=table_widths)
|
||||
|
@ -602,17 +520,25 @@ def setup_printer(training, nlp):
|
|||
]
|
||||
except KeyError as e:
|
||||
raise KeyError(
|
||||
Errors.E983.format(dict_name='scores (losses)', key=str(e), keys=list(info["losses"].keys())))
|
||||
Errors.E983.format(
|
||||
dict="scores (losses)", key=str(e), keys=list(info["losses"].keys())
|
||||
)
|
||||
)
|
||||
|
||||
try:
|
||||
scores = [
|
||||
"{0:.2f}".format(float(info["other_scores"][col]))
|
||||
for col in score_cols
|
||||
"{0:.2f}".format(float(info["other_scores"][col])) for col in score_cols
|
||||
]
|
||||
except KeyError as e:
|
||||
raise KeyError(Errors.E983.format(dict_name='scores (other)', key=str(e), keys=list(info["other_scores"].keys())))
|
||||
raise KeyError(
|
||||
Errors.E983.format(
|
||||
dict="scores (other)",
|
||||
key=str(e),
|
||||
keys=list(info["other_scores"].keys()),
|
||||
)
|
||||
)
|
||||
data = (
|
||||
[info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
|
||||
[info["epoch"], info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
|
||||
)
|
||||
msg.row(data, widths=table_widths, aligns=table_aligns)
|
||||
|
||||
|
@ -626,3 +552,67 @@ def update_meta(training, nlp, info):
|
|||
nlp.meta["performance"][metric] = info["other_scores"][metric]
|
||||
for pipe_name in nlp.pipe_names:
|
||||
nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name]
|
||||
|
||||
|
||||
def verify_cli_args(
|
||||
train_path,
|
||||
dev_path,
|
||||
config_path,
|
||||
output_path=None,
|
||||
code_path=None,
|
||||
init_tok2vec=None,
|
||||
raw_text=None,
|
||||
verbose=False,
|
||||
use_gpu=-1,
|
||||
tag_map_path=None,
|
||||
omit_extra_lookups=False,
|
||||
):
|
||||
# Make sure all files and paths exists if they are needed
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
if not train_path or not train_path.exists():
|
||||
msg.fail("Training data not found", train_path, exits=1)
|
||||
if not dev_path or not dev_path.exists():
|
||||
msg.fail("Development data not found", dev_path, exits=1)
|
||||
if output_path is not None:
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
|
||||
msg.warn(
|
||||
"Output directory is not empty.",
|
||||
"This can lead to unintended side effects when saving the model. "
|
||||
"Please use an empty directory or a different path instead. If "
|
||||
"the specified output path doesn't exist, the directory will be "
|
||||
"created for you.",
|
||||
)
|
||||
if code_path is not None:
|
||||
if not code_path.exists():
|
||||
msg.fail("Path to Python code not found", code_path, exits=1)
|
||||
try:
|
||||
util.import_file("python_code", code_path)
|
||||
except Exception as e:
|
||||
msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1)
|
||||
if init_tok2vec is not None and not init_tok2vec.exists():
|
||||
msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
|
||||
|
||||
|
||||
def verify_textcat_config(nlp, nlp_config):
|
||||
# if 'positive_label' is provided: double check whether it's in the data and
|
||||
# the task is binary
|
||||
if nlp_config["pipeline"]["textcat"].get("positive_label", None):
|
||||
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
|
||||
pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
|
||||
if pos_label not in textcat_labels:
|
||||
msg.fail(
|
||||
f"The textcat's 'positive_label' config setting '{pos_label}' "
|
||||
f"does not match any label in the training data.",
|
||||
exits=1,
|
||||
)
|
||||
if len(textcat_labels) != 2:
|
||||
msg.fail(
|
||||
f"A textcat 'positive_label' '{pos_label}' was "
|
||||
f"provided for training data that does not appear to be a "
|
||||
f"binary classification problem with two labels.",
|
||||
exits=1,
|
||||
)
|
|
@ -1,18 +1,25 @@
|
|||
from typing import Tuple
|
||||
from pathlib import Path
|
||||
import sys
|
||||
import requests
|
||||
from wasabi import msg
|
||||
from wasabi import msg, Printer
|
||||
|
||||
from ._app import app
|
||||
from .. import about
|
||||
from ..util import get_package_version, get_installed_models, get_base_version
|
||||
from ..util import get_package_path, get_model_meta, is_compatible_version
|
||||
|
||||
|
||||
def validate():
|
||||
@app.command("validate")
|
||||
def validate_cli():
|
||||
"""
|
||||
Validate that the currently installed version of spaCy is compatible
|
||||
with the installed models. Should be run after `pip install -U spacy`.
|
||||
"""
|
||||
validate()
|
||||
|
||||
|
||||
def validate() -> None:
|
||||
model_pkgs, compat = get_model_pkgs()
|
||||
spacy_version = get_base_version(about.__version__)
|
||||
current_compat = compat.get(spacy_version, {})
|
||||
|
@ -55,7 +62,8 @@ def validate():
|
|||
sys.exit(1)
|
||||
|
||||
|
||||
def get_model_pkgs():
|
||||
def get_model_pkgs(silent: bool = False) -> Tuple[dict, dict]:
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
with msg.loading("Loading compatibility table..."):
|
||||
r = requests.get(about.__compatibility__)
|
||||
if r.status_code != 200:
|
||||
|
@ -93,7 +101,7 @@ def get_model_pkgs():
|
|||
return pkgs, compat
|
||||
|
||||
|
||||
def reformat_version(version):
|
||||
def reformat_version(version: str) -> str:
|
||||
"""Hack to reformat old versions ending on '-alpha' to match pip format."""
|
||||
if version.endswith("-alpha"):
|
||||
return version.replace("-alpha", "a0")
|
||||
|
|
129
spacy/errors.py
129
spacy/errors.py
|
@ -3,7 +3,7 @@ def add_codes(err_cls):
|
|||
|
||||
class ErrorsWithCodes(err_cls):
|
||||
def __getattribute__(self, code):
|
||||
msg = super().__getattribute__(code)
|
||||
msg = super(ErrorsWithCodes, self).__getattribute__(code)
|
||||
if code.startswith("__"): # python system attributes like __class__
|
||||
return msg
|
||||
else:
|
||||
|
@ -111,8 +111,31 @@ class Warnings(object):
|
|||
"`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
|
||||
" to check the alignment. Misaligned entities ('-') will be "
|
||||
"ignored during training.")
|
||||
W031 = ("Model '{model}' ({model_version}) requires spaCy {version} and "
|
||||
"is incompatible with the current spaCy version ({current}). This "
|
||||
"may lead to unexpected results or runtime errors. To resolve "
|
||||
"this, download a newer compatible model or retrain your custom "
|
||||
"model with the current spaCy version. For more details and "
|
||||
"available updates, run: python -m spacy validate")
|
||||
W032 = ("Unable to determine model compatibility for model '{model}' "
|
||||
"({model_version}) with the current spaCy version ({current}). "
|
||||
"This may lead to unexpected results or runtime errors. To resolve "
|
||||
"this, download a newer compatible model or retrain your custom "
|
||||
"model with the current spaCy version. For more details and "
|
||||
"available updates, run: python -m spacy validate")
|
||||
W033 = ("Training a new {model} using a model with no lexeme normalization "
|
||||
"table. This may degrade the performance of the model to some "
|
||||
"degree. If this is intentional or the language you're using "
|
||||
"doesn't have a normalization table, please ignore this warning. "
|
||||
"If this is surprising, make sure you have the spacy-lookups-data "
|
||||
"package installed. The languages with lexeme normalization tables "
|
||||
"are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.")
|
||||
|
||||
# TODO: fix numbering after merging develop into master
|
||||
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
|
||||
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
|
||||
W093 = ("Could not find any data to train the {name} on. Is your "
|
||||
"input data correctly formatted ?")
|
||||
W094 = ("Model '{model}' ({model_version}) specifies an under-constrained "
|
||||
"spaCy version requirement: {version}. This can lead to compatibility "
|
||||
"problems with older versions, or as new spaCy versions are "
|
||||
|
@ -133,7 +156,7 @@ class Warnings(object):
|
|||
"so a default configuration was used.")
|
||||
W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', "
|
||||
"but got '{type}' instead, so ignoring it.")
|
||||
W100 = ("Skipping unsupported morphological feature(s): {feature}. "
|
||||
W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
|
||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||
|
||||
|
@ -161,18 +184,13 @@ class Errors(object):
|
|||
"`nlp.select_pipes()`, you should remove them explicitly with "
|
||||
"`nlp.remove_pipe()` before the pipeline is restored. Names of "
|
||||
"the new components: {names}")
|
||||
E009 = ("The `update` method expects same number of docs and golds, but "
|
||||
"got: {n_docs} docs, {n_golds} golds.")
|
||||
E010 = ("Word vectors set to length 0. This may be because you don't have "
|
||||
"a model installed or loaded, or because your model doesn't "
|
||||
"include word vectors. For more info, see the docs:\n"
|
||||
"https://spacy.io/usage/models")
|
||||
E011 = ("Unknown operator: '{op}'. Options: {opts}")
|
||||
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
|
||||
E013 = ("Error selecting action in matcher")
|
||||
E014 = ("Unknown tag ID: {tag}")
|
||||
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
|
||||
"`force=True` to overwrite.")
|
||||
E016 = ("MultitaskObjective target should be function or one of: dep, "
|
||||
"tag, ent, dep_tag_offset, ent_tag.")
|
||||
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
|
||||
|
@ -180,21 +198,8 @@ class Errors(object):
|
|||
"refers to an issue with the `Vocab` or `StringStore`.")
|
||||
E019 = ("Can't create transition with unknown action ID: {action}. Action "
|
||||
"IDs are enumerated in spacy/syntax/{src}.pyx.")
|
||||
E020 = ("Could not find a gold-standard action to supervise the "
|
||||
"dependency parser. The tree is non-projective (i.e. it has "
|
||||
"crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
|
||||
"The ArcEager transition system only supports projective trees. "
|
||||
"To learn non-projective representations, transform the data "
|
||||
"before training and after parsing. Either pass "
|
||||
"`make_projective=True` to the GoldParse class, or use "
|
||||
"spacy.syntax.nonproj.preprocess_training_data.")
|
||||
E021 = ("Could not find a gold-standard action to supervise the "
|
||||
"dependency parser. The GoldParse was projective. The transition "
|
||||
"system has {n_actions} actions. State at failure: {state}")
|
||||
E022 = ("Could not find a transition with the name '{name}' in the NER "
|
||||
"model.")
|
||||
E023 = ("Error cleaning up beam: The same state occurred twice at "
|
||||
"memory address {addr} and position {i}.")
|
||||
E024 = ("Could not find an optimal move to supervise the parser. Usually, "
|
||||
"this means that the model can't be updated in a way that's valid "
|
||||
"and satisfies the correct annotations specified in the GoldParse. "
|
||||
|
@ -238,7 +243,6 @@ class Errors(object):
|
|||
"offset {start}.")
|
||||
E037 = ("Error calculating span: Can't find a token ending at character "
|
||||
"offset {end}.")
|
||||
E038 = ("Error finding sentence for span. Infinite loop detected.")
|
||||
E039 = ("Array bounds exceeded while searching for root word. This likely "
|
||||
"means the parse tree is in an invalid state. Please report this "
|
||||
"issue here: http://github.com/explosion/spaCy/issues")
|
||||
|
@ -269,8 +273,6 @@ class Errors(object):
|
|||
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
|
||||
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
|
||||
"({rows}, {cols}).")
|
||||
E061 = ("Bad file name: {filename}. Example of a valid file name: "
|
||||
"'vectors.128.f.bin'")
|
||||
E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
|
||||
"and 63 are occupied. You can replace one by specifying the "
|
||||
"`flag_id` explicitly, e.g. "
|
||||
|
@ -284,39 +286,17 @@ class Errors(object):
|
|||
"Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
|
||||
E065 = ("Only one of the vector table's width and shape can be specified. "
|
||||
"Got width {width} and shape {shape}.")
|
||||
E066 = ("Error creating model helper for extracting columns. Can only "
|
||||
"extract columns by positive integer. Got: {value}.")
|
||||
E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
|
||||
"an entity) without a preceding 'B' (beginning of an entity). "
|
||||
"Tag sequence:\n{tags}")
|
||||
E068 = ("Invalid BILUO tag: '{tag}'.")
|
||||
E069 = ("Invalid gold-standard parse tree. Found cycle between word "
|
||||
"IDs: {cycle} (tokens: {cycle_tokens}) in the document starting "
|
||||
"with tokens: {doc_tokens}.")
|
||||
E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
|
||||
"does not align with number of annotations ({n_annots}).")
|
||||
E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
|
||||
"match the one in the vocab ({vocab_orth}).")
|
||||
E072 = ("Error serializing lexeme: expected data length {length}, "
|
||||
"got {bad_length}.")
|
||||
E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
|
||||
"are of length {length}. You can use `vocab.reset_vectors` to "
|
||||
"clear the existing vectors and resize the table.")
|
||||
E074 = ("Error interpreting compiled match pattern: patterns are expected "
|
||||
"to end with the attribute {attr}. Got: {bad_attr}.")
|
||||
E075 = ("Error accepting match: length ({length}) > maximum length "
|
||||
"({max_len}).")
|
||||
E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
|
||||
"has {words} words.")
|
||||
E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
|
||||
"equal number of GoldParse objects ({n_golds}) in batch.")
|
||||
E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
|
||||
"not equal number of words in GoldParse ({words_gold}).")
|
||||
E079 = ("Error computing states in beam: number of predicted beams "
|
||||
"({pbeams}) does not equal number of gold beams ({gbeams}).")
|
||||
E080 = ("Duplicate state found in beam: {key}.")
|
||||
E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
|
||||
"does not equal number of losses ({losses}).")
|
||||
E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
|
||||
"projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
|
||||
"match.")
|
||||
|
@ -324,8 +304,6 @@ class Errors(object):
|
|||
"`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
|
||||
E084 = ("Error assigning label ID {label} to span: not in StringStore.")
|
||||
E085 = ("Can't create lexeme for string '{string}'.")
|
||||
E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
|
||||
"not match hash {hash_id} in StringStore.")
|
||||
E087 = ("Unknown displaCy style: {style}.")
|
||||
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
|
||||
"v2.x parser and NER models require roughly 1GB of temporary "
|
||||
|
@ -367,7 +345,6 @@ class Errors(object):
|
|||
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
||||
"token can only be part of one entity, so make sure the entities "
|
||||
"you're setting don't overlap.")
|
||||
E104 = ("Can't find JSON schema for '{name}'.")
|
||||
E105 = ("The Doc.print_tree() method is now deprecated. Please use "
|
||||
"Doc.to_json() instead or write your own function.")
|
||||
E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
|
||||
|
@ -390,8 +367,6 @@ class Errors(object):
|
|||
"practically no advantage over pickling the parent Doc directly. "
|
||||
"So instead of pickling the span, pickle the Doc it belongs to or "
|
||||
"use Span.as_doc to convert the span to a standalone Doc object.")
|
||||
E113 = ("The newly split token can only have one root (head = 0).")
|
||||
E114 = ("The newly split token needs to have a root (head = 0).")
|
||||
E115 = ("All subtokens must have associated heads.")
|
||||
E116 = ("Cannot currently add labels to pretrained text classifier. Add "
|
||||
"labels before training begins. This functionality was available "
|
||||
|
@ -414,12 +389,9 @@ class Errors(object):
|
|||
"equal to span length ({span_len}).")
|
||||
E122 = ("Cannot find token to be split. Did it get merged?")
|
||||
E123 = ("Cannot find head of token to be split. Did it get merged?")
|
||||
E124 = ("Cannot read from file: {path}. Supported formats: {formats}")
|
||||
E125 = ("Unexpected value: {value}")
|
||||
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E127 = ("Cannot create phrase pattern representation for length 0. This "
|
||||
"is likely a bug in spaCy.")
|
||||
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
|
||||
"arguments to exclude fields from being serialized or deserialized "
|
||||
"is now deprecated. Please use the `exclude` argument instead. "
|
||||
|
@ -461,8 +433,6 @@ class Errors(object):
|
|||
"provided {found}.")
|
||||
E143 = ("Labels for component '{name}' not initialized. Did you forget to "
|
||||
"call add_label()?")
|
||||
E144 = ("Could not find parameter `{param}` when building the entity "
|
||||
"linker model.")
|
||||
E145 = ("Error reading `{param}` from input file.")
|
||||
E146 = ("Could not access `{path}`.")
|
||||
E147 = ("Unexpected error in the {method} functionality of the "
|
||||
|
@ -474,8 +444,6 @@ class Errors(object):
|
|||
"the component matches the model being loaded.")
|
||||
E150 = ("The language of the `nlp` object and the `vocab` should be the "
|
||||
"same, but found '{nlp}' and '{vocab}' respectively.")
|
||||
E151 = ("Trying to call nlp.update without required annotation types. "
|
||||
"Expected top-level keys: {exp}. Got: {unexp}.")
|
||||
E152 = ("The attribute {attr} is not supported for token patterns. "
|
||||
"Please use the option validate=True with Matcher, PhraseMatcher, "
|
||||
"or EntityRuler for more details.")
|
||||
|
@ -512,11 +480,6 @@ class Errors(object):
|
|||
"that case.")
|
||||
E166 = ("Can only merge DocBins with the same pre-defined attributes.\n"
|
||||
"Current DocBin: {current}\nOther DocBin: {other}")
|
||||
E167 = ("Unknown morphological feature: '{feat}' ({feat_id}). This can "
|
||||
"happen if the tagger was trained with a different set of "
|
||||
"morphological features. If you're using a pretrained model, make "
|
||||
"sure that your models are up to date:\npython -m spacy validate")
|
||||
E168 = ("Unknown field: {field}")
|
||||
E169 = ("Can't find module: {module}")
|
||||
E170 = ("Cannot apply transition {name}: invalid for the current state.")
|
||||
E171 = ("Matcher.add received invalid on_match callback argument: expected "
|
||||
|
@ -527,8 +490,6 @@ class Errors(object):
|
|||
E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
|
||||
"Lookups containing the lemmatization tables. See the docs for "
|
||||
"details: https://spacy.io/api/lemmatizer#init")
|
||||
E174 = ("Architecture '{name}' not found in registry. Available "
|
||||
"names: {names}")
|
||||
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
|
||||
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
|
||||
E177 = ("Ill-formed IOB input detected: {tag}")
|
||||
|
@ -556,9 +517,6 @@ class Errors(object):
|
|||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
||||
E187 = ("Only unicode strings are supported as labels.")
|
||||
E188 = ("Could not match the gold entity links to entities in the doc - "
|
||||
"make sure the gold EL data refers to valid results of the "
|
||||
"named entity recognizer in the `nlp` pipeline.")
|
||||
E189 = ("Each argument to `get_doc` should be of equal length.")
|
||||
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
||||
"'{index}' with value '{value}' (equivalent to relative head "
|
||||
|
@ -578,12 +536,32 @@ class Errors(object):
|
|||
E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
|
||||
E198 = ("Unable to return {n} most similar vectors for the current vectors "
|
||||
"table, which contains {n_rows} vectors.")
|
||||
E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
|
||||
|
||||
# TODO: fix numbering after merging develop into master
|
||||
E983 = ("Invalid key for '{dict_name}': {key}. Available keys: "
|
||||
E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?")
|
||||
E971 = ("Found incompatible lengths in Doc.from_array: {array_length} for the "
|
||||
"array and {doc_length} for the Doc itself.")
|
||||
E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
|
||||
E973 = ("Unexpected type for NER data")
|
||||
E974 = ("Unknown {obj} attribute: {key}")
|
||||
E975 = ("The method Example.from_dict expects a Doc as first argument, "
|
||||
"but got {type}")
|
||||
E976 = ("The method Example.from_dict expects a dict as second argument, "
|
||||
"but received None.")
|
||||
E977 = ("Can not compare a MorphAnalysis with a string object. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E978 = ("The {method} method of component {name} takes a list of Example objects, "
|
||||
"but found {types} instead.")
|
||||
E979 = ("Cannot convert {type} to an Example object.")
|
||||
E980 = ("Each link annotation should refer to a dictionary with at most one "
|
||||
"identifier mapping to 1.0, and all others to 0.0.")
|
||||
E981 = ("The offsets of the annotations for 'links' need to refer exactly "
|
||||
"to the offsets of the 'entities' annotations.")
|
||||
E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
|
||||
"into {values}, but found {value}.")
|
||||
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
|
||||
"{keys}")
|
||||
E984 = ("Could not parse the {input} - double check the data is written "
|
||||
"in the correct format as expected by spaCy.")
|
||||
E985 = ("The pipeline component '{component}' is already available in the base "
|
||||
"model. The settings in the component block in the config file are "
|
||||
"being ignored. If you want to replace this component instead, set "
|
||||
|
@ -615,22 +593,13 @@ class Errors(object):
|
|||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||
"'{token_attrs}'.")
|
||||
E998 = ("To create GoldParse objects from Example objects without a "
|
||||
"Doc, get_gold_parses() should be called with a Vocab object.")
|
||||
E999 = ("Encountered an unexpected format for the dictionary holding "
|
||||
"gold annotations: {gold_dict}")
|
||||
|
||||
|
||||
|
||||
@add_codes
|
||||
class TempErrors(object):
|
||||
T003 = ("Resizing pretrained Tagger models is not currently supported.")
|
||||
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
|
||||
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||
T008 = ("Bad configuration of Tagger. This is probably a bug within "
|
||||
"spaCy. We changed the name of an internal attribute for loading "
|
||||
"pretrained vectors, and the class has been passed the old name "
|
||||
"(pretrained_dims) but not the new name (pretrained_vectors).")
|
||||
|
||||
|
||||
# fmt: on
|
||||
|
|
|
@ -1,68 +0,0 @@
|
|||
from cymem.cymem cimport Pool
|
||||
|
||||
from .typedefs cimport attr_t
|
||||
from .syntax.transition_system cimport Transition
|
||||
|
||||
from .tokens import Doc
|
||||
|
||||
|
||||
cdef struct GoldParseC:
|
||||
int* tags
|
||||
int* heads
|
||||
int* has_dep
|
||||
int* sent_start
|
||||
attr_t* labels
|
||||
int** brackets
|
||||
Transition* ner
|
||||
|
||||
|
||||
cdef class GoldParse:
|
||||
cdef Pool mem
|
||||
|
||||
cdef GoldParseC c
|
||||
cdef readonly TokenAnnotation orig
|
||||
|
||||
cdef int length
|
||||
cdef public int loss
|
||||
cdef public list words
|
||||
cdef public list tags
|
||||
cdef public list pos
|
||||
cdef public list morphs
|
||||
cdef public list lemmas
|
||||
cdef public list sent_starts
|
||||
cdef public list heads
|
||||
cdef public list labels
|
||||
cdef public dict orths
|
||||
cdef public list ner
|
||||
cdef public dict brackets
|
||||
cdef public dict cats
|
||||
cdef public dict links
|
||||
|
||||
cdef readonly list cand_to_gold
|
||||
cdef readonly list gold_to_cand
|
||||
|
||||
|
||||
cdef class TokenAnnotation:
|
||||
cdef public list ids
|
||||
cdef public list words
|
||||
cdef public list tags
|
||||
cdef public list pos
|
||||
cdef public list morphs
|
||||
cdef public list lemmas
|
||||
cdef public list heads
|
||||
cdef public list deps
|
||||
cdef public list entities
|
||||
cdef public list sent_starts
|
||||
cdef public dict brackets_by_start
|
||||
|
||||
|
||||
cdef class DocAnnotation:
|
||||
cdef public object cats
|
||||
cdef public object links
|
||||
|
||||
|
||||
cdef class Example:
|
||||
cdef public object doc
|
||||
cdef public TokenAnnotation token_annotation
|
||||
cdef public DocAnnotation doc_annotation
|
||||
cdef public object goldparse
|
1419
spacy/gold.pyx
1419
spacy/gold.pyx
File diff suppressed because it is too large
Load Diff
0
spacy/gold/__init__.pxd
Normal file
0
spacy/gold/__init__.pxd
Normal file
11
spacy/gold/__init__.py
Normal file
11
spacy/gold/__init__.py
Normal file
|
@ -0,0 +1,11 @@
|
|||
from .corpus import Corpus
|
||||
from .example import Example
|
||||
from .align import align
|
||||
|
||||
from .iob_utils import iob_to_biluo, biluo_to_iob
|
||||
from .iob_utils import biluo_tags_from_offsets, offsets_from_biluo_tags
|
||||
from .iob_utils import spans_from_biluo_tags
|
||||
from .iob_utils import tags_to_entities
|
||||
|
||||
from .gold_io import docs_to_json
|
||||
from .gold_io import read_json_file
|
8
spacy/gold/align.pxd
Normal file
8
spacy/gold/align.pxd
Normal file
|
@ -0,0 +1,8 @@
|
|||
cdef class Alignment:
|
||||
cdef public object cost
|
||||
cdef public object i2j
|
||||
cdef public object j2i
|
||||
cdef public object i2j_multi
|
||||
cdef public object j2i_multi
|
||||
cdef public object cand_to_gold
|
||||
cdef public object gold_to_cand
|
101
spacy/gold/align.pyx
Normal file
101
spacy/gold/align.pyx
Normal file
|
@ -0,0 +1,101 @@
|
|||
import numpy
|
||||
from ..errors import Errors, AlignmentError
|
||||
|
||||
|
||||
cdef class Alignment:
|
||||
def __init__(self, spacy_words, gold_words):
|
||||
# Do many-to-one alignment for misaligned tokens.
|
||||
# If we over-segment, we'll have one gold word that covers a sequence
|
||||
# of predicted words
|
||||
# If we under-segment, we'll have one predicted word that covers a
|
||||
# sequence of gold words.
|
||||
# If we "mis-segment", we'll have a sequence of predicted words covering
|
||||
# a sequence of gold words. That's many-to-many -- we don't do that
|
||||
# except for NER spans where the start and end can be aligned.
|
||||
cost, i2j, j2i, i2j_multi, j2i_multi = align(spacy_words, gold_words)
|
||||
self.cost = cost
|
||||
self.i2j = i2j
|
||||
self.j2i = j2i
|
||||
self.i2j_multi = i2j_multi
|
||||
self.j2i_multi = j2i_multi
|
||||
self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
|
||||
self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
|
||||
|
||||
|
||||
def align(tokens_a, tokens_b):
|
||||
"""Calculate alignment tables between two tokenizations.
|
||||
|
||||
tokens_a (List[str]): The candidate tokenization.
|
||||
tokens_b (List[str]): The reference tokenization.
|
||||
RETURNS: (tuple): A 5-tuple consisting of the following information:
|
||||
* cost (int): The number of misaligned tokens.
|
||||
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
|
||||
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
|
||||
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
|
||||
it has the value -1.
|
||||
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
|
||||
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
|
||||
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
|
||||
the same token of `tokens_b`.
|
||||
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
|
||||
direction.
|
||||
"""
|
||||
tokens_a = _normalize_for_alignment(tokens_a)
|
||||
tokens_b = _normalize_for_alignment(tokens_b)
|
||||
cost = 0
|
||||
a2b = numpy.empty(len(tokens_a), dtype="i")
|
||||
b2a = numpy.empty(len(tokens_b), dtype="i")
|
||||
a2b.fill(-1)
|
||||
b2a.fill(-1)
|
||||
a2b_multi = {}
|
||||
b2a_multi = {}
|
||||
i = 0
|
||||
j = 0
|
||||
offset_a = 0
|
||||
offset_b = 0
|
||||
while i < len(tokens_a) and j < len(tokens_b):
|
||||
a = tokens_a[i][offset_a:]
|
||||
b = tokens_b[j][offset_b:]
|
||||
if a == b:
|
||||
if offset_a == offset_b == 0:
|
||||
a2b[i] = j
|
||||
b2a[j] = i
|
||||
elif offset_a == 0:
|
||||
cost += 2
|
||||
a2b_multi[i] = j
|
||||
elif offset_b == 0:
|
||||
cost += 2
|
||||
b2a_multi[j] = i
|
||||
offset_a = offset_b = 0
|
||||
i += 1
|
||||
j += 1
|
||||
elif a == "":
|
||||
assert offset_a == 0
|
||||
cost += 1
|
||||
i += 1
|
||||
elif b == "":
|
||||
assert offset_b == 0
|
||||
cost += 1
|
||||
j += 1
|
||||
elif b.startswith(a):
|
||||
cost += 1
|
||||
if offset_a == 0:
|
||||
a2b_multi[i] = j
|
||||
i += 1
|
||||
offset_a = 0
|
||||
offset_b += len(a)
|
||||
elif a.startswith(b):
|
||||
cost += 1
|
||||
if offset_b == 0:
|
||||
b2a_multi[j] = i
|
||||
j += 1
|
||||
offset_b = 0
|
||||
offset_a += len(b)
|
||||
else:
|
||||
assert "".join(tokens_a) != "".join(tokens_b)
|
||||
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
|
||||
return cost, a2b, b2a, a2b_multi, b2a_multi
|
||||
|
||||
|
||||
def _normalize_for_alignment(tokens):
|
||||
return [w.replace(" ", "").lower() for w in tokens]
|
111
spacy/gold/augment.py
Normal file
111
spacy/gold/augment.py
Normal file
|
@ -0,0 +1,111 @@
|
|||
import random
|
||||
import itertools
|
||||
|
||||
|
||||
def make_orth_variants_example(nlp, example, orth_variant_level=0.0): # TODO: naming
|
||||
raw_text = example.text
|
||||
orig_dict = example.to_dict()
|
||||
variant_text, variant_token_annot = make_orth_variants(
|
||||
nlp, raw_text, orig_dict["token_annotation"], orth_variant_level
|
||||
)
|
||||
doc = nlp.make_doc(variant_text)
|
||||
orig_dict["token_annotation"] = variant_token_annot
|
||||
return example.from_dict(doc, orig_dict)
|
||||
|
||||
|
||||
def make_orth_variants(nlp, raw_text, orig_token_dict, orth_variant_level=0.0):
|
||||
if random.random() >= orth_variant_level:
|
||||
return raw_text, orig_token_dict
|
||||
if not orig_token_dict:
|
||||
return raw_text, orig_token_dict
|
||||
raw = raw_text
|
||||
token_dict = orig_token_dict
|
||||
lower = False
|
||||
if random.random() >= 0.5:
|
||||
lower = True
|
||||
if raw is not None:
|
||||
raw = raw.lower()
|
||||
ndsv = nlp.Defaults.single_orth_variants
|
||||
ndpv = nlp.Defaults.paired_orth_variants
|
||||
words = token_dict.get("words", [])
|
||||
tags = token_dict.get("tags", [])
|
||||
# keep unmodified if words or tags are not defined
|
||||
if words and tags:
|
||||
if lower:
|
||||
words = [w.lower() for w in words]
|
||||
# single variants
|
||||
punct_choices = [random.choice(x["variants"]) for x in ndsv]
|
||||
for word_idx in range(len(words)):
|
||||
for punct_idx in range(len(ndsv)):
|
||||
if (
|
||||
tags[word_idx] in ndsv[punct_idx]["tags"]
|
||||
and words[word_idx] in ndsv[punct_idx]["variants"]
|
||||
):
|
||||
words[word_idx] = punct_choices[punct_idx]
|
||||
# paired variants
|
||||
punct_choices = [random.choice(x["variants"]) for x in ndpv]
|
||||
for word_idx in range(len(words)):
|
||||
for punct_idx in range(len(ndpv)):
|
||||
if tags[word_idx] in ndpv[punct_idx]["tags"] and words[
|
||||
word_idx
|
||||
] in itertools.chain.from_iterable(ndpv[punct_idx]["variants"]):
|
||||
# backup option: random left vs. right from pair
|
||||
pair_idx = random.choice([0, 1])
|
||||
# best option: rely on paired POS tags like `` / ''
|
||||
if len(ndpv[punct_idx]["tags"]) == 2:
|
||||
pair_idx = ndpv[punct_idx]["tags"].index(tags[word_idx])
|
||||
# next best option: rely on position in variants
|
||||
# (may not be unambiguous, so order of variants matters)
|
||||
else:
|
||||
for pair in ndpv[punct_idx]["variants"]:
|
||||
if words[word_idx] in pair:
|
||||
pair_idx = pair.index(words[word_idx])
|
||||
words[word_idx] = punct_choices[punct_idx][pair_idx]
|
||||
token_dict["words"] = words
|
||||
token_dict["tags"] = tags
|
||||
# modify raw
|
||||
if raw is not None:
|
||||
variants = []
|
||||
for single_variants in ndsv:
|
||||
variants.extend(single_variants["variants"])
|
||||
for paired_variants in ndpv:
|
||||
variants.extend(
|
||||
list(itertools.chain.from_iterable(paired_variants["variants"]))
|
||||
)
|
||||
# store variants in reverse length order to be able to prioritize
|
||||
# longer matches (e.g., "---" before "--")
|
||||
variants = sorted(variants, key=lambda x: len(x))
|
||||
variants.reverse()
|
||||
variant_raw = ""
|
||||
raw_idx = 0
|
||||
# add initial whitespace
|
||||
while raw_idx < len(raw) and raw[raw_idx].isspace():
|
||||
variant_raw += raw[raw_idx]
|
||||
raw_idx += 1
|
||||
for word in words:
|
||||
match_found = False
|
||||
# skip whitespace words
|
||||
if word.isspace():
|
||||
match_found = True
|
||||
# add identical word
|
||||
elif word not in variants and raw[raw_idx:].startswith(word):
|
||||
variant_raw += word
|
||||
raw_idx += len(word)
|
||||
match_found = True
|
||||
# add variant word
|
||||
else:
|
||||
for variant in variants:
|
||||
if not match_found and raw[raw_idx:].startswith(variant):
|
||||
raw_idx += len(variant)
|
||||
variant_raw += word
|
||||
match_found = True
|
||||
# something went wrong, abort
|
||||
# (add a warning message?)
|
||||
if not match_found:
|
||||
return raw_text, orig_token_dict
|
||||
# add following whitespace
|
||||
while raw_idx < len(raw) and raw[raw_idx].isspace():
|
||||
variant_raw += raw[raw_idx]
|
||||
raw_idx += 1
|
||||
raw = variant_raw
|
||||
return raw, token_dict
|
6
spacy/gold/converters/__init__.py
Normal file
6
spacy/gold/converters/__init__.py
Normal file
|
@ -0,0 +1,6 @@
|
|||
from .iob2docs import iob2docs # noqa: F401
|
||||
from .conll_ner2docs import conll_ner2docs # noqa: F401
|
||||
from .json2docs import json2docs
|
||||
|
||||
# TODO: Update this one
|
||||
# from .conllu2docs import conllu2docs # noqa: F401
|
|
@ -1,17 +1,18 @@
|
|||
from wasabi import Printer
|
||||
|
||||
from .. import tags_to_entities
|
||||
from ...gold import iob_to_biluo
|
||||
from ...lang.xx import MultiLanguage
|
||||
from ...tokens.doc import Doc
|
||||
from ...tokens import Doc, Span
|
||||
from ...util import load_model
|
||||
|
||||
|
||||
def conll_ner2json(
|
||||
def conll_ner2docs(
|
||||
input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
|
||||
):
|
||||
"""
|
||||
Convert files in the CoNLL-2003 NER format and similar
|
||||
whitespace-separated columns into JSON format for use with train cli.
|
||||
whitespace-separated columns into Doc objects.
|
||||
|
||||
The first column is the tokens, the final column is the IOB tags. If an
|
||||
additional second column is present, the second column is the tags.
|
||||
|
@ -81,17 +82,25 @@ def conll_ner2json(
|
|||
"No document delimiters found. Use `-n` to automatically group "
|
||||
"sentences into documents."
|
||||
)
|
||||
|
||||
if model:
|
||||
nlp = load_model(model)
|
||||
else:
|
||||
nlp = MultiLanguage()
|
||||
output_docs = []
|
||||
for doc in input_data.strip().split(doc_delimiter):
|
||||
doc = doc.strip()
|
||||
if not doc:
|
||||
for conll_doc in input_data.strip().split(doc_delimiter):
|
||||
conll_doc = conll_doc.strip()
|
||||
if not conll_doc:
|
||||
continue
|
||||
output_doc = []
|
||||
for sent in doc.split("\n\n"):
|
||||
sent = sent.strip()
|
||||
if not sent:
|
||||
words = []
|
||||
sent_starts = []
|
||||
pos_tags = []
|
||||
biluo_tags = []
|
||||
for conll_sent in conll_doc.split("\n\n"):
|
||||
conll_sent = conll_sent.strip()
|
||||
if not conll_sent:
|
||||
continue
|
||||
lines = [line.strip() for line in sent.split("\n") if line.strip()]
|
||||
lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
|
||||
cols = list(zip(*[line.split() for line in lines]))
|
||||
if len(cols) < 2:
|
||||
raise ValueError(
|
||||
|
@ -99,25 +108,19 @@ def conll_ner2json(
|
|||
"Try checking whitespace and delimiters. See "
|
||||
"https://spacy.io/api/cli#convert"
|
||||
)
|
||||
words = cols[0]
|
||||
iob_ents = cols[-1]
|
||||
if len(cols) > 2:
|
||||
tags = cols[1]
|
||||
else:
|
||||
tags = ["-"] * len(words)
|
||||
biluo_ents = iob_to_biluo(iob_ents)
|
||||
output_doc.append(
|
||||
{
|
||||
"tokens": [
|
||||
{"orth": w, "tag": tag, "ner": ent}
|
||||
for (w, tag, ent) in zip(words, tags, biluo_ents)
|
||||
]
|
||||
}
|
||||
)
|
||||
output_docs.append(
|
||||
{"id": len(output_docs), "paragraphs": [{"sentences": output_doc}]}
|
||||
)
|
||||
output_doc = []
|
||||
length = len(cols[0])
|
||||
words.extend(cols[0])
|
||||
sent_starts.extend([True] + [False] * (length - 1))
|
||||
biluo_tags.extend(iob_to_biluo(cols[-1]))
|
||||
pos_tags.extend(cols[1] if len(cols) > 2 else ["-"] * length)
|
||||
|
||||
doc = Doc(nlp.vocab, words=words)
|
||||
for i, token in enumerate(doc):
|
||||
token.tag_ = pos_tags[i]
|
||||
token.is_sent_start = sent_starts[i]
|
||||
entities = tags_to_entities(biluo_tags)
|
||||
doc.ents = [Span(doc, start=s, end=e + 1, label=L) for L, s, e in entities]
|
||||
output_docs.append(doc)
|
||||
return output_docs
|
||||
|
||||
|
|
@ -1,10 +1,10 @@
|
|||
import re
|
||||
|
||||
from .conll_ner2docs import n_sents_info
|
||||
from ...gold import Example
|
||||
from ...gold import iob_to_biluo, spans_from_biluo_tags, biluo_tags_from_offsets
|
||||
from ...gold import iob_to_biluo, spans_from_biluo_tags
|
||||
from ...language import Language
|
||||
from ...tokens import Doc, Token
|
||||
from .conll_ner2json import n_sents_info
|
||||
from wasabi import Printer
|
||||
|
||||
|
||||
|
@ -12,7 +12,6 @@ def conllu2json(
|
|||
input_data,
|
||||
n_sents=10,
|
||||
append_morphology=False,
|
||||
lang=None,
|
||||
ner_map=None,
|
||||
merge_subtokens=False,
|
||||
no_print=False,
|
||||
|
@ -44,10 +43,7 @@ def conllu2json(
|
|||
raw += example.text
|
||||
sentences.append(
|
||||
generate_sentence(
|
||||
example.token_annotation,
|
||||
has_ner_tags,
|
||||
MISC_NER_PATTERN,
|
||||
ner_map=ner_map,
|
||||
example.to_dict(), has_ner_tags, MISC_NER_PATTERN, ner_map=ner_map,
|
||||
)
|
||||
)
|
||||
# Real-sized documents could be extracted using the comments on the
|
||||
|
@ -145,21 +141,22 @@ def get_entities(lines, tag_pattern, ner_map=None):
|
|||
return iob_to_biluo(iob)
|
||||
|
||||
|
||||
def generate_sentence(token_annotation, has_ner_tags, tag_pattern, ner_map=None):
|
||||
def generate_sentence(example_dict, has_ner_tags, tag_pattern, ner_map=None):
|
||||
sentence = {}
|
||||
tokens = []
|
||||
for i, id_ in enumerate(token_annotation.ids):
|
||||
token_annotation = example_dict["token_annotation"]
|
||||
for i, id_ in enumerate(token_annotation["ids"]):
|
||||
token = {}
|
||||
token["id"] = id_
|
||||
token["orth"] = token_annotation.get_word(i)
|
||||
token["tag"] = token_annotation.get_tag(i)
|
||||
token["pos"] = token_annotation.get_pos(i)
|
||||
token["lemma"] = token_annotation.get_lemma(i)
|
||||
token["morph"] = token_annotation.get_morph(i)
|
||||
token["head"] = token_annotation.get_head(i) - id_
|
||||
token["dep"] = token_annotation.get_dep(i)
|
||||
token["orth"] = token_annotation["words"][i]
|
||||
token["tag"] = token_annotation["tags"][i]
|
||||
token["pos"] = token_annotation["pos"][i]
|
||||
token["lemma"] = token_annotation["lemmas"][i]
|
||||
token["morph"] = token_annotation["morphs"][i]
|
||||
token["head"] = token_annotation["heads"][i] - i
|
||||
token["dep"] = token_annotation["deps"][i]
|
||||
if has_ner_tags:
|
||||
token["ner"] = token_annotation.get_entity(i)
|
||||
token["ner"] = example_dict["doc_annotation"]["entities"][i]
|
||||
tokens.append(token)
|
||||
sentence["tokens"] = tokens
|
||||
return sentence
|
||||
|
@ -267,40 +264,25 @@ def example_from_conllu_sentence(
|
|||
doc = merge_conllu_subtokens(lines, doc)
|
||||
|
||||
# create Example from custom Doc annotation
|
||||
ids, words, tags, heads, deps = [], [], [], [], []
|
||||
pos, lemmas, morphs, spaces = [], [], [], []
|
||||
words, spaces, tags, morphs, lemmas = [], [], [], [], []
|
||||
for i, t in enumerate(doc):
|
||||
ids.append(i)
|
||||
words.append(t._.merged_orth)
|
||||
lemmas.append(t._.merged_lemma)
|
||||
spaces.append(t._.merged_spaceafter)
|
||||
morphs.append(t._.merged_morph)
|
||||
if append_morphology and t._.merged_morph:
|
||||
tags.append(t.tag_ + "__" + t._.merged_morph)
|
||||
else:
|
||||
tags.append(t.tag_)
|
||||
pos.append(t.pos_)
|
||||
morphs.append(t._.merged_morph)
|
||||
lemmas.append(t._.merged_lemma)
|
||||
heads.append(t.head.i)
|
||||
deps.append(t.dep_)
|
||||
spaces.append(t._.merged_spaceafter)
|
||||
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||
ents = biluo_tags_from_offsets(doc, ent_offsets)
|
||||
raw = ""
|
||||
for word, space in zip(words, spaces):
|
||||
raw += word
|
||||
if space:
|
||||
raw += " "
|
||||
example = Example(doc=raw)
|
||||
example.set_token_annotation(
|
||||
ids=ids,
|
||||
words=words,
|
||||
tags=tags,
|
||||
pos=pos,
|
||||
morphs=morphs,
|
||||
lemmas=lemmas,
|
||||
heads=heads,
|
||||
deps=deps,
|
||||
entities=ents,
|
||||
)
|
||||
|
||||
doc_x = Doc(vocab, words=words, spaces=spaces)
|
||||
ref_dict = Example(doc_x, reference=doc).to_dict()
|
||||
ref_dict["words"] = words
|
||||
ref_dict["lemmas"] = lemmas
|
||||
ref_dict["spaces"] = spaces
|
||||
ref_dict["tags"] = tags
|
||||
ref_dict["morphs"] = morphs
|
||||
example = Example.from_dict(doc_x, ref_dict)
|
||||
return example
|
||||
|
||||
|
64
spacy/gold/converters/iob2docs.py
Normal file
64
spacy/gold/converters/iob2docs.py
Normal file
|
@ -0,0 +1,64 @@
|
|||
from wasabi import Printer
|
||||
|
||||
from .conll_ner2docs import n_sents_info
|
||||
from ...gold import iob_to_biluo, tags_to_entities
|
||||
from ...tokens import Doc, Span
|
||||
from ...util import minibatch
|
||||
|
||||
|
||||
def iob2docs(input_data, vocab, n_sents=10, no_print=False, *args, **kwargs):
|
||||
"""
|
||||
Convert IOB files with one sentence per line and tags separated with '|'
|
||||
into Doc objects so they can be saved. IOB and IOB2 are accepted.
|
||||
|
||||
Sample formats:
|
||||
|
||||
I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
|
||||
I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
|
||||
I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
"""
|
||||
msg = Printer(no_print=no_print)
|
||||
if n_sents > 0:
|
||||
n_sents_info(msg, n_sents)
|
||||
docs = read_iob(input_data.split("\n"), vocab, n_sents)
|
||||
return docs
|
||||
|
||||
|
||||
def read_iob(raw_sents, vocab, n_sents):
|
||||
docs = []
|
||||
for group in minibatch(raw_sents, size=n_sents):
|
||||
tokens = []
|
||||
words = []
|
||||
tags = []
|
||||
iob = []
|
||||
sent_starts = []
|
||||
for line in group:
|
||||
if not line.strip():
|
||||
continue
|
||||
sent_tokens = [t.split("|") for t in line.split()]
|
||||
if len(sent_tokens[0]) == 3:
|
||||
sent_words, sent_tags, sent_iob = zip(*sent_tokens)
|
||||
elif len(sent_tokens[0]) == 2:
|
||||
sent_words, sent_iob = zip(*sent_tokens)
|
||||
sent_tags = ["-"] * len(sent_words)
|
||||
else:
|
||||
raise ValueError(
|
||||
"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
|
||||
)
|
||||
words.extend(sent_words)
|
||||
tags.extend(sent_tags)
|
||||
iob.extend(sent_iob)
|
||||
tokens.extend(sent_tokens)
|
||||
sent_starts.append(True)
|
||||
sent_starts.extend([False for _ in sent_words[1:]])
|
||||
doc = Doc(vocab, words=words)
|
||||
for i, tag in enumerate(tags):
|
||||
doc[i].tag_ = tag
|
||||
for i, sent_start in enumerate(sent_starts):
|
||||
doc[i].is_sent_start = sent_start
|
||||
biluo = iob_to_biluo(iob)
|
||||
entities = tags_to_entities(biluo)
|
||||
doc.ents = [Span(doc, start=s, end=e+1, label=L) for (L, s, e) in entities]
|
||||
docs.append(doc)
|
||||
return docs
|
24
spacy/gold/converters/json2docs.py
Normal file
24
spacy/gold/converters/json2docs.py
Normal file
|
@ -0,0 +1,24 @@
|
|||
import srsly
|
||||
from ..gold_io import json_iterate, json_to_annotations
|
||||
from ..example import annotations2doc
|
||||
from ..example import _fix_legacy_dict_data, _parse_example_dict_data
|
||||
from ...util import load_model
|
||||
from ...lang.xx import MultiLanguage
|
||||
|
||||
|
||||
def json2docs(input_data, model=None, **kwargs):
|
||||
nlp = load_model(model) if model is not None else MultiLanguage()
|
||||
if not isinstance(input_data, bytes):
|
||||
if not isinstance(input_data, str):
|
||||
input_data = srsly.json_dumps(input_data)
|
||||
input_data = input_data.encode("utf8")
|
||||
docs = []
|
||||
for json_doc in json_iterate(input_data):
|
||||
for json_para in json_to_annotations(json_doc):
|
||||
example_dict = _fix_legacy_dict_data(json_para)
|
||||
tok_dict, doc_dict = _parse_example_dict_data(example_dict)
|
||||
if json_para.get("raw"):
|
||||
assert tok_dict.get("SPACY")
|
||||
doc = annotations2doc(nlp.vocab, tok_dict, doc_dict)
|
||||
docs.append(doc)
|
||||
return docs
|
122
spacy/gold/corpus.py
Normal file
122
spacy/gold/corpus.py
Normal file
|
@ -0,0 +1,122 @@
|
|||
import random
|
||||
from .. import util
|
||||
from .example import Example
|
||||
from ..tokens import DocBin, Doc
|
||||
|
||||
|
||||
class Corpus:
|
||||
"""An annotated corpus, reading train and dev datasets from
|
||||
the DocBin (.spacy) format.
|
||||
|
||||
DOCS: https://spacy.io/api/goldcorpus
|
||||
"""
|
||||
|
||||
def __init__(self, train_loc, dev_loc, limit=0):
|
||||
"""Create a Corpus.
|
||||
|
||||
train (str / Path): File or directory of training data.
|
||||
dev (str / Path): File or directory of development data.
|
||||
limit (int): Max. number of examples returned
|
||||
RETURNS (Corpus): The newly created object.
|
||||
"""
|
||||
self.train_loc = train_loc
|
||||
self.dev_loc = dev_loc
|
||||
self.limit = limit
|
||||
|
||||
@staticmethod
|
||||
def walk_corpus(path):
|
||||
path = util.ensure_path(path)
|
||||
if not path.is_dir():
|
||||
return [path]
|
||||
paths = [path]
|
||||
locs = []
|
||||
seen = set()
|
||||
for path in paths:
|
||||
if str(path) in seen:
|
||||
continue
|
||||
seen.add(str(path))
|
||||
if path.parts[-1].startswith("."):
|
||||
continue
|
||||
elif path.is_dir():
|
||||
paths.extend(path.iterdir())
|
||||
elif path.parts[-1].endswith(".spacy"):
|
||||
locs.append(path)
|
||||
return locs
|
||||
|
||||
def make_examples(self, nlp, reference_docs, max_length=0):
|
||||
for reference in reference_docs:
|
||||
if len(reference) >= max_length >= 1:
|
||||
if reference.is_sentenced:
|
||||
for ref_sent in reference.sents:
|
||||
yield Example(
|
||||
nlp.make_doc(ref_sent.text),
|
||||
ref_sent.as_doc()
|
||||
)
|
||||
else:
|
||||
yield Example(
|
||||
nlp.make_doc(reference.text),
|
||||
reference
|
||||
)
|
||||
|
||||
def make_examples_gold_preproc(self, nlp, reference_docs):
|
||||
for reference in reference_docs:
|
||||
if reference.is_sentenced:
|
||||
ref_sents = [sent.as_doc() for sent in reference.sents]
|
||||
else:
|
||||
ref_sents = [reference]
|
||||
for ref_sent in ref_sents:
|
||||
yield Example(
|
||||
Doc(
|
||||
nlp.vocab,
|
||||
words=[w.text for w in ref_sent],
|
||||
spaces=[bool(w.whitespace_) for w in ref_sent]
|
||||
),
|
||||
ref_sent
|
||||
)
|
||||
|
||||
def read_docbin(self, vocab, locs):
|
||||
""" Yield training examples as example dicts """
|
||||
i = 0
|
||||
for loc in locs:
|
||||
loc = util.ensure_path(loc)
|
||||
if loc.parts[-1].endswith(".spacy"):
|
||||
with loc.open("rb") as file_:
|
||||
doc_bin = DocBin().from_bytes(file_.read())
|
||||
docs = doc_bin.get_docs(vocab)
|
||||
for doc in docs:
|
||||
if len(doc):
|
||||
yield doc
|
||||
i += 1
|
||||
if self.limit >= 1 and i >= self.limit:
|
||||
break
|
||||
|
||||
def count_train(self, nlp):
|
||||
"""Returns count of words in train examples"""
|
||||
n = 0
|
||||
i = 0
|
||||
for example in self.train_dataset(nlp):
|
||||
n += len(example.predicted)
|
||||
if self.limit >= 0 and i >= self.limit:
|
||||
break
|
||||
i += 1
|
||||
return n
|
||||
|
||||
def train_dataset(self, nlp, *, shuffle=True, gold_preproc=False,
|
||||
max_length=0, **kwargs):
|
||||
ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.train_loc))
|
||||
if gold_preproc:
|
||||
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
||||
else:
|
||||
examples = self.make_examples(nlp, ref_docs, max_length)
|
||||
if shuffle:
|
||||
examples = list(examples)
|
||||
random.shuffle(examples)
|
||||
yield from examples
|
||||
|
||||
def dev_dataset(self, nlp, *, gold_preproc=False, **kwargs):
|
||||
ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.dev_loc))
|
||||
if gold_preproc:
|
||||
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
||||
else:
|
||||
examples = self.make_examples(nlp, ref_docs, max_length=0)
|
||||
yield from examples
|
8
spacy/gold/example.pxd
Normal file
8
spacy/gold/example.pxd
Normal file
|
@ -0,0 +1,8 @@
|
|||
from ..tokens.doc cimport Doc
|
||||
from .align cimport Alignment
|
||||
|
||||
|
||||
cdef class Example:
|
||||
cdef readonly Doc x
|
||||
cdef readonly Doc y
|
||||
cdef readonly Alignment _alignment
|
432
spacy/gold/example.pyx
Normal file
432
spacy/gold/example.pyx
Normal file
|
@ -0,0 +1,432 @@
|
|||
import warnings
|
||||
|
||||
import numpy
|
||||
|
||||
from ..tokens.doc cimport Doc
|
||||
from ..tokens.span cimport Span
|
||||
from ..tokens.span import Span
|
||||
from ..attrs import IDS
|
||||
from .align cimport Alignment
|
||||
from .iob_utils import biluo_to_iob, biluo_tags_from_offsets, biluo_tags_from_doc
|
||||
from .iob_utils import spans_from_biluo_tags
|
||||
from .align import Alignment
|
||||
from ..errors import Errors, Warnings
|
||||
from ..syntax import nonproj
|
||||
|
||||
|
||||
cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
|
||||
""" Create a Doc from dictionaries with token and doc annotations. Assumes ORTH & SPACY are set. """
|
||||
attrs, array = _annot2array(vocab, tok_annot, doc_annot)
|
||||
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
|
||||
if "entities" in doc_annot:
|
||||
_add_entities_to_doc(output, doc_annot["entities"])
|
||||
if array.size:
|
||||
output = output.from_array(attrs, array)
|
||||
# links are currently added with ENT_KB_ID on the token level
|
||||
output.cats.update(doc_annot.get("cats", {}))
|
||||
return output
|
||||
|
||||
|
||||
cdef class Example:
|
||||
def __init__(self, Doc predicted, Doc reference, *, Alignment alignment=None):
|
||||
""" Doc can either be text, or an actual Doc """
|
||||
if predicted is None:
|
||||
raise TypeError(Errors.E972.format(arg="predicted"))
|
||||
if reference is None:
|
||||
raise TypeError(Errors.E972.format(arg="reference"))
|
||||
self.x = predicted
|
||||
self.y = reference
|
||||
self._alignment = alignment
|
||||
|
||||
property predicted:
|
||||
def __get__(self):
|
||||
return self.x
|
||||
|
||||
def __set__(self, doc):
|
||||
self.x = doc
|
||||
|
||||
property reference:
|
||||
def __get__(self):
|
||||
return self.y
|
||||
|
||||
def __set__(self, doc):
|
||||
self.y = doc
|
||||
|
||||
def copy(self):
|
||||
return Example(
|
||||
self.x.copy(),
|
||||
self.y.copy()
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, Doc predicted, dict example_dict):
|
||||
if example_dict is None:
|
||||
raise ValueError(Errors.E976)
|
||||
if not isinstance(predicted, Doc):
|
||||
raise TypeError(Errors.E975.format(type=type(predicted)))
|
||||
example_dict = _fix_legacy_dict_data(example_dict)
|
||||
tok_dict, doc_dict = _parse_example_dict_data(example_dict)
|
||||
if "ORTH" not in tok_dict:
|
||||
tok_dict["ORTH"] = [tok.text for tok in predicted]
|
||||
tok_dict["SPACY"] = [tok.whitespace_ for tok in predicted]
|
||||
if not _has_field(tok_dict, "SPACY"):
|
||||
spaces = _guess_spaces(predicted.text, tok_dict["ORTH"])
|
||||
return Example(
|
||||
predicted,
|
||||
annotations2doc(predicted.vocab, tok_dict, doc_dict)
|
||||
)
|
||||
|
||||
@property
|
||||
def alignment(self):
|
||||
if self._alignment is None:
|
||||
spacy_words = [token.orth_ for token in self.predicted]
|
||||
gold_words = [token.orth_ for token in self.reference]
|
||||
if gold_words == []:
|
||||
gold_words = spacy_words
|
||||
self._alignment = Alignment(spacy_words, gold_words)
|
||||
return self._alignment
|
||||
|
||||
def get_aligned(self, field, as_string=False):
|
||||
"""Return an aligned array for a token attribute."""
|
||||
i2j_multi = self.alignment.i2j_multi
|
||||
cand_to_gold = self.alignment.cand_to_gold
|
||||
|
||||
vocab = self.reference.vocab
|
||||
gold_values = self.reference.to_array([field])
|
||||
output = [None] * len(self.predicted)
|
||||
for i, gold_i in enumerate(cand_to_gold):
|
||||
if self.predicted[i].text.isspace():
|
||||
output[i] = None
|
||||
if gold_i is None:
|
||||
if i in i2j_multi:
|
||||
output[i] = gold_values[i2j_multi[i]]
|
||||
else:
|
||||
output[i] = None
|
||||
else:
|
||||
output[i] = gold_values[gold_i]
|
||||
if as_string and field not in ["ENT_IOB", "SENT_START"]:
|
||||
output = [vocab.strings[o] if o is not None else o for o in output]
|
||||
return output
|
||||
|
||||
def get_aligned_parse(self, projectivize=True):
|
||||
cand_to_gold = self.alignment.cand_to_gold
|
||||
gold_to_cand = self.alignment.gold_to_cand
|
||||
aligned_heads = [None] * self.x.length
|
||||
aligned_deps = [None] * self.x.length
|
||||
heads = [token.head.i for token in self.y]
|
||||
deps = [token.dep_ for token in self.y]
|
||||
if projectivize:
|
||||
heads, deps = nonproj.projectivize(heads, deps)
|
||||
for cand_i in range(self.x.length):
|
||||
gold_i = cand_to_gold[cand_i]
|
||||
if gold_i is not None: # Alignment found
|
||||
gold_head = gold_to_cand[heads[gold_i]]
|
||||
if gold_head is not None:
|
||||
aligned_heads[cand_i] = gold_head
|
||||
aligned_deps[cand_i] = deps[gold_i]
|
||||
return aligned_heads, aligned_deps
|
||||
|
||||
def get_aligned_ner(self):
|
||||
if not self.y.is_nered:
|
||||
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
|
||||
x_text = self.x.text
|
||||
# Get a list of entities, and make spans for non-entity tokens.
|
||||
# We then work through the spans in order, trying to find them in
|
||||
# the text and using that to get the offset. Any token that doesn't
|
||||
# get a tag set this way is tagged None.
|
||||
# This could maybe be improved? It at least feels easy to reason about.
|
||||
y_spans = list(self.y.ents)
|
||||
y_spans.sort()
|
||||
x_text_offset = 0
|
||||
x_spans = []
|
||||
for y_span in y_spans:
|
||||
if x_text.count(y_span.text) >= 1:
|
||||
start_char = x_text.index(y_span.text) + x_text_offset
|
||||
end_char = start_char + len(y_span.text)
|
||||
x_span = self.x.char_span(start_char, end_char, label=y_span.label)
|
||||
if x_span is not None:
|
||||
x_spans.append(x_span)
|
||||
x_text = self.x.text[end_char:]
|
||||
x_text_offset = end_char
|
||||
x_tags = biluo_tags_from_offsets(
|
||||
self.x,
|
||||
[(e.start_char, e.end_char, e.label_) for e in x_spans],
|
||||
missing=None
|
||||
)
|
||||
gold_to_cand = self.alignment.gold_to_cand
|
||||
for token in self.y:
|
||||
if token.ent_iob_ == "O":
|
||||
cand_i = gold_to_cand[token.i]
|
||||
if cand_i is not None and x_tags[cand_i] is None:
|
||||
x_tags[cand_i] = "O"
|
||||
i2j_multi = self.alignment.i2j_multi
|
||||
for i, tag in enumerate(x_tags):
|
||||
if tag is None and i in i2j_multi:
|
||||
gold_i = i2j_multi[i]
|
||||
if gold_i is not None and self.y[gold_i].ent_iob_ == "O":
|
||||
x_tags[i] = "O"
|
||||
return x_tags
|
||||
|
||||
def to_dict(self):
|
||||
return {
|
||||
"doc_annotation": {
|
||||
"cats": dict(self.reference.cats),
|
||||
"entities": biluo_tags_from_doc(self.reference),
|
||||
"links": self._links_to_dict()
|
||||
},
|
||||
"token_annotation": {
|
||||
"ids": [t.i+1 for t in self.reference],
|
||||
"words": [t.text for t in self.reference],
|
||||
"tags": [t.tag_ for t in self.reference],
|
||||
"lemmas": [t.lemma_ for t in self.reference],
|
||||
"pos": [t.pos_ for t in self.reference],
|
||||
"morphs": [t.morph_ for t in self.reference],
|
||||
"heads": [t.head.i for t in self.reference],
|
||||
"deps": [t.dep_ for t in self.reference],
|
||||
"sent_starts": [int(bool(t.is_sent_start)) for t in self.reference]
|
||||
}
|
||||
}
|
||||
|
||||
def _links_to_dict(self):
|
||||
links = {}
|
||||
for ent in self.reference.ents:
|
||||
if ent.kb_id_:
|
||||
links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0}
|
||||
return links
|
||||
|
||||
|
||||
def split_sents(self):
|
||||
""" Split the token annotations into multiple Examples based on
|
||||
sent_starts and return a list of the new Examples"""
|
||||
if not self.reference.is_sentenced:
|
||||
return [self]
|
||||
|
||||
sent_starts = self.get_aligned("SENT_START")
|
||||
sent_starts.append(1) # appending virtual start of a next sentence to facilitate search
|
||||
|
||||
output = []
|
||||
pred_start = 0
|
||||
for sent in self.reference.sents:
|
||||
new_ref = sent.as_doc()
|
||||
pred_end = sent_starts.index(1, pred_start+1) # find where the next sentence starts
|
||||
new_pred = self.predicted[pred_start : pred_end].as_doc()
|
||||
output.append(Example(new_pred, new_ref))
|
||||
pred_start = pred_end
|
||||
|
||||
return output
|
||||
|
||||
property text:
|
||||
def __get__(self):
|
||||
return self.x.text
|
||||
|
||||
def __str__(self):
|
||||
return str(self.to_dict())
|
||||
|
||||
def __repr__(self):
|
||||
return str(self.to_dict())
|
||||
|
||||
|
||||
def _annot2array(vocab, tok_annot, doc_annot):
|
||||
attrs = []
|
||||
values = []
|
||||
|
||||
for key, value in doc_annot.items():
|
||||
if value:
|
||||
if key == "entities":
|
||||
pass
|
||||
elif key == "links":
|
||||
entities = doc_annot.get("entities", {})
|
||||
if not entities:
|
||||
raise ValueError(Errors.E981)
|
||||
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], value, entities)
|
||||
tok_annot["ENT_KB_ID"] = ent_kb_ids
|
||||
elif key == "cats":
|
||||
pass
|
||||
else:
|
||||
raise ValueError(Errors.E974.format(obj="doc", key=key))
|
||||
|
||||
for key, value in tok_annot.items():
|
||||
if key not in IDS:
|
||||
raise ValueError(Errors.E974.format(obj="token", key=key))
|
||||
elif key in ["ORTH", "SPACY"]:
|
||||
pass
|
||||
elif key == "HEAD":
|
||||
attrs.append(key)
|
||||
values.append([h-i for i, h in enumerate(value)])
|
||||
elif key == "SENT_START":
|
||||
attrs.append(key)
|
||||
values.append(value)
|
||||
elif key == "MORPH":
|
||||
attrs.append(key)
|
||||
values.append([vocab.morphology.add(v) for v in value])
|
||||
else:
|
||||
attrs.append(key)
|
||||
values.append([vocab.strings.add(v) for v in value])
|
||||
|
||||
array = numpy.asarray(values, dtype="uint64")
|
||||
return attrs, array.T
|
||||
|
||||
|
||||
def _add_entities_to_doc(doc, ner_data):
|
||||
if ner_data is None:
|
||||
return
|
||||
elif ner_data == []:
|
||||
doc.ents = []
|
||||
elif isinstance(ner_data[0], tuple):
|
||||
return _add_entities_to_doc(
|
||||
doc,
|
||||
biluo_tags_from_offsets(doc, ner_data)
|
||||
)
|
||||
elif isinstance(ner_data[0], str) or ner_data[0] is None:
|
||||
return _add_entities_to_doc(
|
||||
doc,
|
||||
spans_from_biluo_tags(doc, ner_data)
|
||||
)
|
||||
elif isinstance(ner_data[0], Span):
|
||||
# Ugh, this is super messy. Really hard to set O entities
|
||||
doc.ents = ner_data
|
||||
doc.ents = [span for span in ner_data if span.label_]
|
||||
else:
|
||||
raise ValueError(Errors.E973)
|
||||
|
||||
|
||||
def _parse_example_dict_data(example_dict):
|
||||
return (
|
||||
example_dict["token_annotation"],
|
||||
example_dict["doc_annotation"]
|
||||
)
|
||||
|
||||
|
||||
def _fix_legacy_dict_data(example_dict):
|
||||
token_dict = example_dict.get("token_annotation", {})
|
||||
doc_dict = example_dict.get("doc_annotation", {})
|
||||
for key, value in example_dict.items():
|
||||
if value:
|
||||
if key in ("token_annotation", "doc_annotation"):
|
||||
pass
|
||||
elif key == "ids":
|
||||
pass
|
||||
elif key in ("cats", "links"):
|
||||
doc_dict[key] = value
|
||||
elif key in ("ner", "entities"):
|
||||
doc_dict["entities"] = value
|
||||
else:
|
||||
token_dict[key] = value
|
||||
# Remap keys
|
||||
remapping = {
|
||||
"words": "ORTH",
|
||||
"tags": "TAG",
|
||||
"pos": "POS",
|
||||
"lemmas": "LEMMA",
|
||||
"deps": "DEP",
|
||||
"heads": "HEAD",
|
||||
"sent_starts": "SENT_START",
|
||||
"morphs": "MORPH",
|
||||
"spaces": "SPACY",
|
||||
}
|
||||
old_token_dict = token_dict
|
||||
token_dict = {}
|
||||
for key, value in old_token_dict.items():
|
||||
if key in ("text", "ids", "brackets"):
|
||||
pass
|
||||
elif key in remapping:
|
||||
token_dict[remapping[key]] = value
|
||||
else:
|
||||
raise KeyError(Errors.E983.format(key=key, dict="token_annotation", keys=remapping.keys()))
|
||||
text = example_dict.get("text", example_dict.get("raw"))
|
||||
if _has_field(token_dict, "ORTH") and not _has_field(token_dict, "SPACY"):
|
||||
token_dict["SPACY"] = _guess_spaces(text, token_dict["ORTH"])
|
||||
if "HEAD" in token_dict and "SENT_START" in token_dict:
|
||||
# If heads are set, we don't also redundantly specify SENT_START.
|
||||
token_dict.pop("SENT_START")
|
||||
warnings.warn(Warnings.W092)
|
||||
return {
|
||||
"token_annotation": token_dict,
|
||||
"doc_annotation": doc_dict
|
||||
}
|
||||
|
||||
def _has_field(annot, field):
|
||||
if field not in annot:
|
||||
return False
|
||||
elif annot[field] is None:
|
||||
return False
|
||||
elif len(annot[field]) == 0:
|
||||
return False
|
||||
elif all([value is None for value in annot[field]]):
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
|
||||
|
||||
def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces):
|
||||
if isinstance(biluo_or_offsets[0], (list, tuple)):
|
||||
# Convert to biluo if necessary
|
||||
# This is annoying but to convert the offsets we need a Doc
|
||||
# that has the target tokenization.
|
||||
reference = Doc(vocab, words=words, spaces=spaces)
|
||||
biluo = biluo_tags_from_offsets(reference, biluo_or_offsets)
|
||||
else:
|
||||
biluo = biluo_or_offsets
|
||||
ent_iobs = []
|
||||
ent_types = []
|
||||
for iob_tag in biluo_to_iob(biluo):
|
||||
if iob_tag in (None, "-"):
|
||||
ent_iobs.append("")
|
||||
ent_types.append("")
|
||||
else:
|
||||
ent_iobs.append(iob_tag.split("-")[0])
|
||||
if iob_tag.startswith("I") or iob_tag.startswith("B"):
|
||||
ent_types.append(iob_tag.split("-", 1)[1])
|
||||
else:
|
||||
ent_types.append("")
|
||||
return ent_iobs, ent_types
|
||||
|
||||
def _parse_links(vocab, words, links, entities):
|
||||
reference = Doc(vocab, words=words)
|
||||
starts = {token.idx: token.i for token in reference}
|
||||
ends = {token.idx + len(token): token.i for token in reference}
|
||||
ent_kb_ids = ["" for _ in reference]
|
||||
entity_map = [(ent[0], ent[1]) for ent in entities]
|
||||
|
||||
# links annotations need to refer 1-1 to entity annotations - throw error otherwise
|
||||
for index, annot_dict in links.items():
|
||||
start_char, end_char = index
|
||||
if (start_char, end_char) not in entity_map:
|
||||
raise ValueError(Errors.E981)
|
||||
|
||||
for index, annot_dict in links.items():
|
||||
true_kb_ids = []
|
||||
for key, value in annot_dict.items():
|
||||
if value == 1.0:
|
||||
true_kb_ids.append(key)
|
||||
if len(true_kb_ids) > 1:
|
||||
raise ValueError(Errors.E980)
|
||||
|
||||
if len(true_kb_ids) == 1:
|
||||
start_char, end_char = index
|
||||
start_token = starts.get(start_char)
|
||||
end_token = ends.get(end_char)
|
||||
for i in range(start_token, end_token+1):
|
||||
ent_kb_ids[i] = true_kb_ids[0]
|
||||
|
||||
return ent_kb_ids
|
||||
|
||||
|
||||
def _guess_spaces(text, words):
|
||||
if text is None:
|
||||
return [True] * len(words)
|
||||
spaces = []
|
||||
text_pos = 0
|
||||
# align words with text
|
||||
for word in words:
|
||||
try:
|
||||
word_start = text[text_pos:].index(word)
|
||||
except ValueError:
|
||||
spaces.append(True)
|
||||
continue
|
||||
text_pos += word_start + len(word)
|
||||
if text_pos < len(text) and text[text_pos] == " ":
|
||||
spaces.append(True)
|
||||
else:
|
||||
spaces.append(False)
|
||||
return spaces
|
199
spacy/gold/gold_io.pyx
Normal file
199
spacy/gold/gold_io.pyx
Normal file
|
@ -0,0 +1,199 @@
|
|||
import warnings
|
||||
import srsly
|
||||
from .. import util
|
||||
from ..errors import Warnings
|
||||
from ..tokens import Doc
|
||||
from .iob_utils import biluo_tags_from_offsets, tags_to_entities
|
||||
import json
|
||||
|
||||
|
||||
def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
|
||||
"""Convert a list of Doc objects into the JSON-serializable format used by
|
||||
the spacy train command.
|
||||
|
||||
docs (iterable / Doc): The Doc object(s) to convert.
|
||||
doc_id (int): Id for the JSON.
|
||||
RETURNS (dict): The data in spaCy's JSON format
|
||||
- each input doc will be treated as a paragraph in the output doc
|
||||
"""
|
||||
if isinstance(docs, Doc):
|
||||
docs = [docs]
|
||||
json_doc = {"id": doc_id, "paragraphs": []}
|
||||
for i, doc in enumerate(docs):
|
||||
json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
|
||||
for cat, val in doc.cats.items():
|
||||
json_cat = {"label": cat, "value": val}
|
||||
json_para["cats"].append(json_cat)
|
||||
for ent in doc.ents:
|
||||
ent_tuple = (ent.start_char, ent.end_char, ent.label_)
|
||||
json_para["entities"].append(ent_tuple)
|
||||
if ent.kb_id_:
|
||||
link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}}
|
||||
json_para["links"].append(link_dict)
|
||||
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||
biluo_tags = biluo_tags_from_offsets(doc, ent_offsets, missing=ner_missing_tag)
|
||||
for j, sent in enumerate(doc.sents):
|
||||
json_sent = {"tokens": [], "brackets": []}
|
||||
for token in sent:
|
||||
json_token = {"id": token.i, "orth": token.text, "space": token.whitespace_}
|
||||
if doc.is_tagged:
|
||||
json_token["tag"] = token.tag_
|
||||
json_token["pos"] = token.pos_
|
||||
json_token["morph"] = token.morph_
|
||||
json_token["lemma"] = token.lemma_
|
||||
if doc.is_parsed:
|
||||
json_token["head"] = token.head.i-token.i
|
||||
json_token["dep"] = token.dep_
|
||||
json_sent["tokens"].append(json_token)
|
||||
json_para["sentences"].append(json_sent)
|
||||
json_doc["paragraphs"].append(json_para)
|
||||
return json_doc
|
||||
|
||||
|
||||
def read_json_file(loc, docs_filter=None, limit=None):
|
||||
"""Read Example dictionaries from a json file or directory."""
|
||||
loc = util.ensure_path(loc)
|
||||
if loc.is_dir():
|
||||
for filename in loc.iterdir():
|
||||
yield from read_json_file(loc / filename, limit=limit)
|
||||
else:
|
||||
with loc.open("rb") as file_:
|
||||
utf8_str = file_.read()
|
||||
for json_doc in json_iterate(utf8_str):
|
||||
if docs_filter is not None and not docs_filter(json_doc):
|
||||
continue
|
||||
for json_paragraph in json_to_annotations(json_doc):
|
||||
yield json_paragraph
|
||||
|
||||
|
||||
def json_to_annotations(doc):
|
||||
"""Convert an item in the JSON-formatted training data to the format
|
||||
used by Example.
|
||||
|
||||
doc (dict): One entry in the training data.
|
||||
YIELDS (tuple): The reformatted data - one training example per paragraph
|
||||
"""
|
||||
for paragraph in doc["paragraphs"]:
|
||||
example = {"text": paragraph.get("raw", None)}
|
||||
words = []
|
||||
spaces = []
|
||||
ids = []
|
||||
tags = []
|
||||
ner_tags = []
|
||||
pos = []
|
||||
morphs = []
|
||||
lemmas = []
|
||||
heads = []
|
||||
labels = []
|
||||
sent_starts = []
|
||||
brackets = []
|
||||
for sent in paragraph["sentences"]:
|
||||
sent_start_i = len(words)
|
||||
for i, token in enumerate(sent["tokens"]):
|
||||
words.append(token["orth"])
|
||||
spaces.append(token.get("space", None))
|
||||
ids.append(token.get('id', sent_start_i + i))
|
||||
tags.append(token.get("tag", None))
|
||||
pos.append(token.get("pos", None))
|
||||
morphs.append(token.get("morph", None))
|
||||
lemmas.append(token.get("lemma", None))
|
||||
if "head" in token:
|
||||
heads.append(token["head"] + sent_start_i + i)
|
||||
else:
|
||||
heads.append(None)
|
||||
if "dep" in token:
|
||||
labels.append(token["dep"])
|
||||
# Ensure ROOT label is case-insensitive
|
||||
if labels[-1].lower() == "root":
|
||||
labels[-1] = "ROOT"
|
||||
else:
|
||||
labels.append(None)
|
||||
ner_tags.append(token.get("ner", None))
|
||||
if i == 0:
|
||||
sent_starts.append(1)
|
||||
else:
|
||||
sent_starts.append(0)
|
||||
if "brackets" in sent:
|
||||
brackets.extend((b["first"] + sent_start_i,
|
||||
b["last"] + sent_start_i, b["label"])
|
||||
for b in sent["brackets"])
|
||||
|
||||
example["token_annotation"] = dict(
|
||||
ids=ids,
|
||||
words=words,
|
||||
spaces=spaces,
|
||||
sent_starts=sent_starts,
|
||||
brackets=brackets
|
||||
)
|
||||
# avoid including dummy values that looks like gold info was present
|
||||
if any(tags):
|
||||
example["token_annotation"]["tags"] = tags
|
||||
if any(pos):
|
||||
example["token_annotation"]["pos"] = pos
|
||||
if any(morphs):
|
||||
example["token_annotation"]["morphs"] = morphs
|
||||
if any(lemmas):
|
||||
example["token_annotation"]["lemmas"] = lemmas
|
||||
if any(head is not None for head in heads):
|
||||
example["token_annotation"]["heads"] = heads
|
||||
if any(labels):
|
||||
example["token_annotation"]["deps"] = labels
|
||||
|
||||
cats = {}
|
||||
for cat in paragraph.get("cats", {}):
|
||||
cats[cat["label"]] = cat["value"]
|
||||
example["doc_annotation"] = dict(
|
||||
cats=cats,
|
||||
entities=ner_tags,
|
||||
links=paragraph.get("links", [])
|
||||
)
|
||||
yield example
|
||||
|
||||
def json_iterate(bytes utf8_str):
|
||||
# We should've made these files jsonl...But since we didn't, parse out
|
||||
# the docs one-by-one to reduce memory usage.
|
||||
# It's okay to read in the whole file -- just don't parse it into JSON.
|
||||
cdef long file_length = len(utf8_str)
|
||||
if file_length > 2 ** 30:
|
||||
warnings.warn(Warnings.W027.format(size=file_length))
|
||||
|
||||
raw = <char*>utf8_str
|
||||
cdef int square_depth = 0
|
||||
cdef int curly_depth = 0
|
||||
cdef int inside_string = 0
|
||||
cdef int escape = 0
|
||||
cdef long start = -1
|
||||
cdef char c
|
||||
cdef char quote = ord('"')
|
||||
cdef char backslash = ord("\\")
|
||||
cdef char open_square = ord("[")
|
||||
cdef char close_square = ord("]")
|
||||
cdef char open_curly = ord("{")
|
||||
cdef char close_curly = ord("}")
|
||||
for i in range(file_length):
|
||||
c = raw[i]
|
||||
if escape:
|
||||
escape = False
|
||||
continue
|
||||
if c == backslash:
|
||||
escape = True
|
||||
continue
|
||||
if c == quote:
|
||||
inside_string = not inside_string
|
||||
continue
|
||||
if inside_string:
|
||||
continue
|
||||
if c == open_square:
|
||||
square_depth += 1
|
||||
elif c == close_square:
|
||||
square_depth -= 1
|
||||
elif c == open_curly:
|
||||
if square_depth == 1 and curly_depth == 0:
|
||||
start = i
|
||||
curly_depth += 1
|
||||
elif c == close_curly:
|
||||
curly_depth -= 1
|
||||
if square_depth == 1 and curly_depth == 0:
|
||||
substr = utf8_str[start : i + 1].decode("utf8")
|
||||
yield srsly.json_loads(substr)
|
||||
start = -1
|
209
spacy/gold/iob_utils.py
Normal file
209
spacy/gold/iob_utils.py
Normal file
|
@ -0,0 +1,209 @@
|
|||
import warnings
|
||||
from ..errors import Errors, Warnings
|
||||
from ..tokens import Span
|
||||
|
||||
|
||||
def iob_to_biluo(tags):
|
||||
out = []
|
||||
tags = list(tags)
|
||||
while tags:
|
||||
out.extend(_consume_os(tags))
|
||||
out.extend(_consume_ent(tags))
|
||||
return out
|
||||
|
||||
|
||||
def biluo_to_iob(tags):
|
||||
out = []
|
||||
for tag in tags:
|
||||
if tag is None:
|
||||
out.append(tag)
|
||||
else:
|
||||
tag = tag.replace("U-", "B-", 1).replace("L-", "I-", 1)
|
||||
out.append(tag)
|
||||
return out
|
||||
|
||||
|
||||
def _consume_os(tags):
|
||||
while tags and tags[0] == "O":
|
||||
yield tags.pop(0)
|
||||
|
||||
|
||||
def _consume_ent(tags):
|
||||
if not tags:
|
||||
return []
|
||||
tag = tags.pop(0)
|
||||
target_in = "I" + tag[1:]
|
||||
target_last = "L" + tag[1:]
|
||||
length = 1
|
||||
while tags and tags[0] in {target_in, target_last}:
|
||||
length += 1
|
||||
tags.pop(0)
|
||||
label = tag[2:]
|
||||
if length == 1:
|
||||
if len(label) == 0:
|
||||
raise ValueError(Errors.E177.format(tag=tag))
|
||||
return ["U-" + label]
|
||||
else:
|
||||
start = "B-" + label
|
||||
end = "L-" + label
|
||||
middle = [f"I-{label}" for _ in range(1, length - 1)]
|
||||
return [start] + middle + [end]
|
||||
|
||||
|
||||
def biluo_tags_from_doc(doc, missing="O"):
|
||||
return biluo_tags_from_offsets(
|
||||
doc,
|
||||
[(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents],
|
||||
missing=missing,
|
||||
)
|
||||
|
||||
|
||||
def biluo_tags_from_offsets(doc, entities, missing="O"):
|
||||
"""Encode labelled spans into per-token tags, using the
|
||||
Begin/In/Last/Unit/Out scheme (BILUO).
|
||||
|
||||
doc (Doc): The document that the entity offsets refer to. The output tags
|
||||
will refer to the token boundaries within the document.
|
||||
entities (iterable): A sequence of `(start, end, label)` triples. `start`
|
||||
and `end` should be character-offset integers denoting the slice into
|
||||
the original string.
|
||||
RETURNS (list): A list of unicode strings, describing the tags. Each tag
|
||||
string will be of the form either "", "O" or "{action}-{label}", where
|
||||
action is one of "B", "I", "L", "U". The string "-" is used where the
|
||||
entity offsets don't align with the tokenization in the `Doc` object.
|
||||
The training algorithm will view these as missing values. "O" denotes a
|
||||
non-entity token. "B" denotes the beginning of a multi-token entity,
|
||||
"I" the inside of an entity of three or more tokens, and "L" the end
|
||||
of an entity of two or more tokens. "U" denotes a single-token entity.
|
||||
|
||||
EXAMPLE:
|
||||
>>> text = 'I like London.'
|
||||
>>> entities = [(len('I like '), len('I like London'), 'LOC')]
|
||||
>>> doc = nlp.tokenizer(text)
|
||||
>>> tags = biluo_tags_from_offsets(doc, entities)
|
||||
>>> assert tags == ["O", "O", 'U-LOC', "O"]
|
||||
"""
|
||||
# Ensure no overlapping entity labels exist
|
||||
tokens_in_ents = {}
|
||||
|
||||
starts = {token.idx: token.i for token in doc}
|
||||
ends = {token.idx + len(token): token.i for token in doc}
|
||||
biluo = ["-" for _ in doc]
|
||||
# Handle entity cases
|
||||
for start_char, end_char, label in entities:
|
||||
if not label:
|
||||
for s in starts: # account for many-to-one
|
||||
if s >= start_char and s < end_char:
|
||||
biluo[starts[s]] = "O"
|
||||
else:
|
||||
for token_index in range(start_char, end_char):
|
||||
if token_index in tokens_in_ents.keys():
|
||||
raise ValueError(
|
||||
Errors.E103.format(
|
||||
span1=(
|
||||
tokens_in_ents[token_index][0],
|
||||
tokens_in_ents[token_index][1],
|
||||
tokens_in_ents[token_index][2],
|
||||
),
|
||||
span2=(start_char, end_char, label),
|
||||
)
|
||||
)
|
||||
tokens_in_ents[token_index] = (start_char, end_char, label)
|
||||
|
||||
start_token = starts.get(start_char)
|
||||
end_token = ends.get(end_char)
|
||||
# Only interested if the tokenization is correct
|
||||
if start_token is not None and end_token is not None:
|
||||
if start_token == end_token:
|
||||
biluo[start_token] = f"U-{label}"
|
||||
else:
|
||||
biluo[start_token] = f"B-{label}"
|
||||
for i in range(start_token + 1, end_token):
|
||||
biluo[i] = f"I-{label}"
|
||||
biluo[end_token] = f"L-{label}"
|
||||
# Now distinguish the O cases from ones where we miss the tokenization
|
||||
entity_chars = set()
|
||||
for start_char, end_char, label in entities:
|
||||
for i in range(start_char, end_char):
|
||||
entity_chars.add(i)
|
||||
for token in doc:
|
||||
for i in range(token.idx, token.idx + len(token)):
|
||||
if i in entity_chars:
|
||||
break
|
||||
else:
|
||||
biluo[token.i] = missing
|
||||
if "-" in biluo and missing != "-":
|
||||
ent_str = str(entities)
|
||||
warnings.warn(
|
||||
Warnings.W030.format(
|
||||
text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
|
||||
entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
|
||||
)
|
||||
)
|
||||
return biluo
|
||||
|
||||
|
||||
def spans_from_biluo_tags(doc, tags):
|
||||
"""Encode per-token tags following the BILUO scheme into Span object, e.g.
|
||||
to overwrite the doc.ents.
|
||||
|
||||
doc (Doc): The document that the BILUO tags refer to.
|
||||
entities (iterable): A sequence of BILUO tags with each tag describing one
|
||||
token. Each tags string will be of the form of either "", "O" or
|
||||
"{action}-{label}", where action is one of "B", "I", "L", "U".
|
||||
RETURNS (list): A sequence of Span objects.
|
||||
"""
|
||||
token_offsets = tags_to_entities(tags)
|
||||
spans = []
|
||||
for label, start_idx, end_idx in token_offsets:
|
||||
span = Span(doc, start_idx, end_idx + 1, label=label)
|
||||
spans.append(span)
|
||||
return spans
|
||||
|
||||
|
||||
def offsets_from_biluo_tags(doc, tags):
|
||||
"""Encode per-token tags following the BILUO scheme into entity offsets.
|
||||
|
||||
doc (Doc): The document that the BILUO tags refer to.
|
||||
entities (iterable): A sequence of BILUO tags with each tag describing one
|
||||
token. Each tags string will be of the form of either "", "O" or
|
||||
"{action}-{label}", where action is one of "B", "I", "L", "U".
|
||||
RETURNS (list): A sequence of `(start, end, label)` triples. `start` and
|
||||
`end` will be character-offset integers denoting the slice into the
|
||||
original string.
|
||||
"""
|
||||
spans = spans_from_biluo_tags(doc, tags)
|
||||
return [(span.start_char, span.end_char, span.label_) for span in spans]
|
||||
|
||||
|
||||
def tags_to_entities(tags):
|
||||
""" Note that the end index returned by this function is inclusive.
|
||||
To use it for Span creation, increment the end by 1."""
|
||||
entities = []
|
||||
start = None
|
||||
for i, tag in enumerate(tags):
|
||||
if tag is None:
|
||||
continue
|
||||
if tag.startswith("O"):
|
||||
# TODO: We shouldn't be getting these malformed inputs. Fix this.
|
||||
if start is not None:
|
||||
start = None
|
||||
else:
|
||||
entities.append(("", i, i))
|
||||
continue
|
||||
elif tag == "-":
|
||||
continue
|
||||
elif tag.startswith("I"):
|
||||
if start is None:
|
||||
raise ValueError(Errors.E067.format(tags=tags[: i + 1]))
|
||||
continue
|
||||
if tag.startswith("U"):
|
||||
entities.append((tag[2:], i, i))
|
||||
elif tag.startswith("B"):
|
||||
start = i
|
||||
elif tag.startswith("L"):
|
||||
entities.append((tag[2:], start, i))
|
||||
start = None
|
||||
else:
|
||||
raise ValueError(Errors.E068.format(tag=tag))
|
||||
return entities
|
|
@ -446,6 +446,8 @@ cdef class Writer:
|
|||
assert not path.isdir(loc), f"{loc} is directory"
|
||||
if isinstance(loc, Path):
|
||||
loc = bytes(loc)
|
||||
if path.exists(loc):
|
||||
assert not path.isdir(loc), "%s is directory." % loc
|
||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||
self._fp = fopen(<char*>bytes_loc, 'wb')
|
||||
if not self._fp:
|
||||
|
@ -487,10 +489,10 @@ cdef class Writer:
|
|||
|
||||
cdef class Reader:
|
||||
def __init__(self, object loc):
|
||||
assert path.exists(loc)
|
||||
assert not path.isdir(loc)
|
||||
if isinstance(loc, Path):
|
||||
loc = bytes(loc)
|
||||
assert path.exists(loc)
|
||||
assert not path.isdir(loc)
|
||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||
self._fp = fopen(<char*>bytes_loc, 'rb')
|
||||
if not self._fp:
|
||||
|
|
|
@ -20,29 +20,25 @@ def noun_chunks(doclike):
|
|||
conj = doc.vocab.strings.add("conj")
|
||||
nmod = doc.vocab.strings.add("nmod")
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
seen = set()
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
flag = False
|
||||
if word.pos == NOUN:
|
||||
# check for patterns such as γραμμή παραγωγής
|
||||
for potential_nmod in word.rights:
|
||||
if potential_nmod.dep == nmod:
|
||||
seen.update(
|
||||
j for j in range(word.left_edge.i, potential_nmod.i + 1)
|
||||
)
|
||||
prev_end = potential_nmod.i
|
||||
yield word.left_edge.i, potential_nmod.i + 1, np_label
|
||||
flag = True
|
||||
break
|
||||
if flag is False:
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
prev_end = word.i
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
# covers the case: έχει όμορφα και έξυπνα παιδιά
|
||||
|
@ -51,9 +47,7 @@ def noun_chunks(doclike):
|
|||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
prev_end = word.i
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
|
||||
|
||||
|
|
|
@ -25,17 +25,15 @@ def noun_chunks(doclike):
|
|||
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
||||
conj = doc.vocab.strings.add("conj")
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
seen = set()
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
prev_end = word.i
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
head = word.head
|
||||
|
@ -43,9 +41,7 @@ def noun_chunks(doclike):
|
|||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
prev_end = word.i
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
|
||||
|
||||
|
|
|
@ -136,7 +136,19 @@ for pron in ["he", "she", "it"]:
|
|||
|
||||
# W-words, relative pronouns, prepositions etc.
|
||||
|
||||
for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
|
||||
for word in [
|
||||
"who",
|
||||
"what",
|
||||
"when",
|
||||
"where",
|
||||
"why",
|
||||
"how",
|
||||
"there",
|
||||
"that",
|
||||
"this",
|
||||
"these",
|
||||
"those",
|
||||
]:
|
||||
for orth in [word, word.title()]:
|
||||
_exc[orth + "'s"] = [
|
||||
{ORTH: orth, LEMMA: word, NORM: word},
|
||||
|
@ -396,6 +408,8 @@ _other_exc = {
|
|||
{ORTH: "Let", LEMMA: "let", NORM: "let"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"},
|
||||
],
|
||||
"c'mon": [{ORTH: "c'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
|
||||
"C'mon": [{ORTH: "C'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
|
||||
}
|
||||
|
||||
_exc.update(_other_exc)
|
||||
|
|
|
@ -14,5 +14,9 @@ sentences = [
|
|||
"El gato come pescado.",
|
||||
"Veo al hombre con el telescopio.",
|
||||
"La araña come moscas.",
|
||||
"El pingüino incuba en su nido.",
|
||||
"El pingüino incuba en su nido sobre el hielo.",
|
||||
"¿Dónde estais?",
|
||||
"¿Quién es el presidente Francés?",
|
||||
"¿Dónde está encuentra la capital de Argentina?",
|
||||
"¿Cuándo nació José de San Martín?",
|
||||
]
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
|
||||
from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
|
||||
|
|
|
@ -7,8 +7,12 @@ _exc = {
|
|||
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "n°", LEMMA: "número"},
|
||||
{ORTH: "°C", LEMMA: "grados Celcius"},
|
||||
{ORTH: "aprox.", LEMMA: "aproximadamente"},
|
||||
{ORTH: "dna.", LEMMA: "docena"},
|
||||
{ORTH: "dpto.", LEMMA: "departamento"},
|
||||
{ORTH: "ej.", LEMMA: "ejemplo"},
|
||||
{ORTH: "esq.", LEMMA: "esquina"},
|
||||
{ORTH: "pág.", LEMMA: "página"},
|
||||
{ORTH: "p.ej.", LEMMA: "por ejemplo"},
|
||||
|
@ -16,6 +20,7 @@ for exc_data in [
|
|||
{ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"},
|
||||
{ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
|
||||
{ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
|
||||
{ORTH: "vol.", NORM: "volúmen"},
|
||||
]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
@ -35,10 +40,14 @@ for h in range(1, 12 + 1):
|
|||
for orth in [
|
||||
"a.C.",
|
||||
"a.J.C.",
|
||||
"d.C.",
|
||||
"d.J.C.",
|
||||
"apdo.",
|
||||
"Av.",
|
||||
"Avda.",
|
||||
"Cía.",
|
||||
"Dr.",
|
||||
"Dra.",
|
||||
"EE.UU.",
|
||||
"etc.",
|
||||
"fig.",
|
||||
|
@ -54,9 +63,9 @@ for orth in [
|
|||
"Prof.",
|
||||
"Profa.",
|
||||
"q.e.p.d.",
|
||||
"S.A.",
|
||||
"Q.E.P.D." "S.A.",
|
||||
"S.L.",
|
||||
"s.s.s.",
|
||||
"S.R.L." "s.s.s.",
|
||||
"Sr.",
|
||||
"Sra.",
|
||||
"Srta.",
|
||||
|
|
|
@ -25,17 +25,15 @@ def noun_chunks(doclike):
|
|||
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
||||
conj = doc.vocab.strings.add("conj")
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
seen = set()
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
prev_end = word.i
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
head = word.head
|
||||
|
@ -43,9 +41,7 @@ def noun_chunks(doclike):
|
|||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
prev_end = word.i
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
|
||||
|
||||
|
|
|
@ -531,7 +531,6 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Beaumont-Hamel",
|
||||
"Beaumont-Louestault",
|
||||
"Beaumont-Monteux",
|
||||
"Beaumont-Pied-de-Buf",
|
||||
"Beaumont-Pied-de-Bœuf",
|
||||
"Beaumont-Sardolles",
|
||||
"Beaumont-Village",
|
||||
|
@ -948,7 +947,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Buxières-sous-les-Côtes",
|
||||
"Buzy-Darmont",
|
||||
"Byhleguhre-Byhlen",
|
||||
"Burs-en-Othe",
|
||||
"Bœurs-en-Othe",
|
||||
"Bâle-Campagne",
|
||||
"Bâle-Ville",
|
||||
"Béard-Géovreissiat",
|
||||
|
@ -1586,11 +1585,11 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Cruci-Falgardiens",
|
||||
"Cruquius-Oost",
|
||||
"Cruviers-Lascours",
|
||||
"Crèvecur-en-Auge",
|
||||
"Crèvecur-en-Brie",
|
||||
"Crèvecur-le-Grand",
|
||||
"Crèvecur-le-Petit",
|
||||
"Crèvecur-sur-l'Escaut",
|
||||
"Crèvecœur-en-Auge",
|
||||
"Crèvecœur-en-Brie",
|
||||
"Crèvecœur-le-Grand",
|
||||
"Crèvecœur-le-Petit",
|
||||
"Crèvecœur-sur-l'Escaut",
|
||||
"Crécy-Couvé",
|
||||
"Créon-d'Armagnac",
|
||||
"Cubjac-Auvézère-Val-d'Ans",
|
||||
|
@ -1616,7 +1615,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Cuxac-Cabardès",
|
||||
"Cuxac-d'Aude",
|
||||
"Cuyk-Sainte-Agathe",
|
||||
"Cuvres-et-Valsery",
|
||||
"Cœuvres-et-Valsery",
|
||||
"Céaux-d'Allègre",
|
||||
"Céleste-Empire",
|
||||
"Cénac-et-Saint-Julien",
|
||||
|
@ -1679,7 +1678,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Devrai-Gondragnières",
|
||||
"Dhuys et Morin-en-Brie",
|
||||
"Diane-Capelle",
|
||||
"Dieffenbach-lès-Wrth",
|
||||
"Dieffenbach-lès-Wœrth",
|
||||
"Diekhusen-Fahrstedt",
|
||||
"Diennes-Aubigny",
|
||||
"Diensdorf-Radlow",
|
||||
|
@ -1752,7 +1751,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Durdat-Larequille",
|
||||
"Durfort-Lacapelette",
|
||||
"Durfort-et-Saint-Martin-de-Sossenac",
|
||||
"Duil-sur-le-Mignon",
|
||||
"Dœuil-sur-le-Mignon",
|
||||
"Dão-Lafões",
|
||||
"Débats-Rivière-d'Orpra",
|
||||
"Décines-Charpieu",
|
||||
|
@ -2687,8 +2686,8 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Kuhlen-Wendorf",
|
||||
"KwaZulu-Natal",
|
||||
"Kyzyl-Arvat",
|
||||
"Kur-la-Grande",
|
||||
"Kur-la-Petite",
|
||||
"Kœur-la-Grande",
|
||||
"Kœur-la-Petite",
|
||||
"Kölln-Reisiek",
|
||||
"Königsbach-Stein",
|
||||
"Königshain-Wiederau",
|
||||
|
@ -4024,7 +4023,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Marcilly-d'Azergues",
|
||||
"Marcillé-Raoul",
|
||||
"Marcillé-Robert",
|
||||
"Marcq-en-Barul",
|
||||
"Marcq-en-Barœul",
|
||||
"Marcy-l'Etoile",
|
||||
"Marcy-l'Étoile",
|
||||
"Mareil-Marly",
|
||||
|
@ -4258,7 +4257,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Monlezun-d'Armagnac",
|
||||
"Monléon-Magnoac",
|
||||
"Monnetier-Mornex",
|
||||
"Mons-en-Barul",
|
||||
"Mons-en-Barœul",
|
||||
"Monsempron-Libos",
|
||||
"Monsteroux-Milieu",
|
||||
"Montacher-Villegardin",
|
||||
|
@ -4348,7 +4347,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Mornay-Berry",
|
||||
"Mortain-Bocage",
|
||||
"Morteaux-Couliboeuf",
|
||||
"Morteaux-Coulibuf",
|
||||
"Morteaux-Coulibœuf",
|
||||
"Morteaux-Coulibœuf",
|
||||
"Mortes-Frontières",
|
||||
"Mory-Montcrux",
|
||||
|
@ -4391,7 +4390,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Muncq-Nieurlet",
|
||||
"Murtin-Bogny",
|
||||
"Murtin-et-le-Châtelet",
|
||||
"Murs-Verdey",
|
||||
"Mœurs-Verdey",
|
||||
"Ménestérol-Montignac",
|
||||
"Ménil'muche",
|
||||
"Ménil-Annelles",
|
||||
|
@ -4612,7 +4611,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Neuves-Maisons",
|
||||
"Neuvic-Entier",
|
||||
"Neuvicq-Montguyon",
|
||||
"Neuville-lès-Luilly",
|
||||
"Neuville-lès-Lœuilly",
|
||||
"Neuvy-Bouin",
|
||||
"Neuvy-Deux-Clochers",
|
||||
"Neuvy-Grandchamp",
|
||||
|
@ -4773,8 +4772,8 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Nuncq-Hautecôte",
|
||||
"Nurieux-Volognat",
|
||||
"Nuthe-Urstromtal",
|
||||
"Nux-les-Mines",
|
||||
"Nux-lès-Auxi",
|
||||
"Nœux-les-Mines",
|
||||
"Nœux-lès-Auxi",
|
||||
"Nâves-Parmelan",
|
||||
"Nézignan-l'Evêque",
|
||||
"Nézignan-l'Évêque",
|
||||
|
@ -5343,7 +5342,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Quincy-Voisins",
|
||||
"Quincy-sous-le-Mont",
|
||||
"Quint-Fonsegrives",
|
||||
"Quux-Haut-Maînil",
|
||||
"Quœux-Haut-Maînil",
|
||||
"Quœux-Haut-Maînil",
|
||||
"Qwa-Qwa",
|
||||
"R.-V.",
|
||||
|
@ -5631,12 +5630,12 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Saint Aulaye-Puymangou",
|
||||
"Saint Geniez d'Olt et d'Aubrac",
|
||||
"Saint Martin de l'If",
|
||||
"Saint-Denux",
|
||||
"Saint-Jean-de-Buf",
|
||||
"Saint-Martin-le-Nud",
|
||||
"Saint-Michel-Tubuf",
|
||||
"Saint-Denœux",
|
||||
"Saint-Jean-de-Bœuf",
|
||||
"Saint-Martin-le-Nœud",
|
||||
"Saint-Michel-Tubœuf",
|
||||
"Saint-Paul - Flaugnac",
|
||||
"Saint-Pierre-de-Buf",
|
||||
"Saint-Pierre-de-Bœuf",
|
||||
"Saint-Thegonnec Loc-Eguiner",
|
||||
"Sainte-Alvère-Saint-Laurent Les Bâtons",
|
||||
"Salignac-Eyvignes",
|
||||
|
@ -6208,7 +6207,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Tite-Live",
|
||||
"Titisee-Neustadt",
|
||||
"Tobel-Tägerschen",
|
||||
"Togny-aux-Bufs",
|
||||
"Togny-aux-Bœufs",
|
||||
"Tongre-Notre-Dame",
|
||||
"Tonnay-Boutonne",
|
||||
"Tonnay-Charente",
|
||||
|
@ -6336,7 +6335,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Vals-près-le-Puy",
|
||||
"Valverde-Enrique",
|
||||
"Valzin-en-Petite-Montagne",
|
||||
"Vanduvre-lès-Nancy",
|
||||
"Vandœuvre-lès-Nancy",
|
||||
"Varces-Allières-et-Risset",
|
||||
"Varenne-l'Arconce",
|
||||
"Varenne-sur-le-Doubs",
|
||||
|
@ -6457,9 +6456,9 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Villenave-d'Ornon",
|
||||
"Villequier-Aumont",
|
||||
"Villerouge-Termenès",
|
||||
"Villers-aux-Nuds",
|
||||
"Villers-aux-Nœuds",
|
||||
"Villez-sur-le-Neubourg",
|
||||
"Villiers-en-Désuvre",
|
||||
"Villiers-en-Désœuvre",
|
||||
"Villieu-Loyes-Mollon",
|
||||
"Villingen-Schwenningen",
|
||||
"Villié-Morgon",
|
||||
|
@ -6467,7 +6466,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Vilosnes-Haraumont",
|
||||
"Vilters-Wangs",
|
||||
"Vincent-Froideville",
|
||||
"Vincy-Manuvre",
|
||||
"Vincy-Manœuvre",
|
||||
"Vincy-Manœuvre",
|
||||
"Vincy-Reuil-et-Magny",
|
||||
"Vindrac-Alayrac",
|
||||
|
@ -6511,8 +6510,8 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Vrigne-Meusiens",
|
||||
"Vrijhoeve-Capelle",
|
||||
"Vuisternens-devant-Romont",
|
||||
"Vlfling-lès-Bouzonville",
|
||||
"Vuil-et-Giget",
|
||||
"Vœlfling-lès-Bouzonville",
|
||||
"Vœuil-et-Giget",
|
||||
"Vélez-Blanco",
|
||||
"Vélez-Málaga",
|
||||
"Vélez-Rubio",
|
||||
|
@ -6615,7 +6614,7 @@ FR_BASE_EXCEPTIONS = [
|
|||
"Wust-Fischbeck",
|
||||
"Wutha-Farnroda",
|
||||
"Wy-dit-Joli-Village",
|
||||
"Wlfling-lès-Sarreguemines",
|
||||
"Wœlfling-lès-Sarreguemines",
|
||||
"Wünnewil-Flamatt",
|
||||
"X-SAMPA",
|
||||
"X-arbre",
|
||||
|
|
|
@ -24,17 +24,15 @@ def noun_chunks(doclike):
|
|||
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||
conj = doc.vocab.strings.add("conj")
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
seen = set()
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
||||
prev_end = word.right_edge.i
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
head = word.head
|
||||
|
@ -42,9 +40,7 @@ def noun_chunks(doclike):
|
|||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
||||
prev_end = word.right_edge.i
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
|
||||
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
import re
|
||||
|
||||
from .punctuation import ELISION, HYPHENS
|
||||
from ..tokenizer_exceptions import URL_PATTERN
|
||||
from ..char_classes import ALPHA_LOWER, ALPHA
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
|
@ -452,9 +451,6 @@ _regular_exp += [
|
|||
for hc in _hyphen_combination
|
||||
]
|
||||
|
||||
# URLs
|
||||
_regular_exp.append(URL_PATTERN)
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
TOKEN_MATCH = re.compile(
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
|
||||
from ...language import Language
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
એમ
|
||||
|
|
|
@ -7,7 +7,6 @@ _concat_icons = CONCAT_ICONS.replace("\u00B0", "")
|
|||
|
||||
_currency = r"\$¢£€¥฿"
|
||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||
_units = UNITS.replace("%", "")
|
||||
|
||||
_prefixes = (
|
||||
LIST_PUNCT
|
||||
|
@ -18,7 +17,8 @@ _prefixes = (
|
|||
)
|
||||
|
||||
_suffixes = (
|
||||
LIST_PUNCT
|
||||
[r"\+"]
|
||||
+ LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ [_concat_icons]
|
||||
|
@ -26,7 +26,7 @@ _suffixes = (
|
|||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.",
|
||||
r"(?<=[0-9])(?:[{c}])".format(c=_currency),
|
||||
r"(?<=[0-9])(?:{u})".format(u=_units),
|
||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||
r"(?<=[{al}{e}{q}(?:{c})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, c=_currency
|
||||
),
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
import re
|
||||
|
||||
from ..punctuation import ALPHA_LOWER, CURRENCY
|
||||
from ..tokenizer_exceptions import URL_PATTERN
|
||||
from ...symbols import ORTH
|
||||
|
||||
|
||||
|
@ -646,4 +645,4 @@ _nums = r"(({ne})|({t})|({on})|({c}))({s})?".format(
|
|||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
TOKEN_MATCH = re.compile(r"^({u})|({n})$".format(u=URL_PATTERN, n=_nums)).match
|
||||
TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .tag_map import TAG_MAP
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
>>> from spacy.lang.hy.examples import sentences
|
||||
|
|
|
@ -1,12 +1,9 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = [
|
||||
"զրօ",
|
||||
"մէկ",
|
||||
"զրո",
|
||||
"մեկ",
|
||||
"երկու",
|
||||
"երեք",
|
||||
"չորս",
|
||||
|
@ -28,10 +25,10 @@ _num_words = [
|
|||
"քսան" "երեսուն",
|
||||
"քառասուն",
|
||||
"հիսուն",
|
||||
"վաթցսուն",
|
||||
"վաթսուն",
|
||||
"յոթանասուն",
|
||||
"ութսուն",
|
||||
"ինիսուն",
|
||||
"իննսուն",
|
||||
"հարյուր",
|
||||
"հազար",
|
||||
"միլիոն",
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
նա
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
|
||||
from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ
|
||||
|
||||
|
|
|
@ -24,17 +24,15 @@ def noun_chunks(doclike):
|
|||
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||
conj = doc.vocab.strings.add("conj")
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
seen = set()
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
||||
prev_end = word.right_edge.i
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
head = word.head
|
||||
|
@ -42,9 +40,7 @@ def noun_chunks(doclike):
|
|||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
||||
prev_end = word.right_edge.i
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
|
||||
|
||||
|
|
|
@ -1,111 +1,266 @@
|
|||
import re
|
||||
from collections import namedtuple
|
||||
import srsly
|
||||
from collections import namedtuple, OrderedDict
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
from .tag_map import TAG_MAP
|
||||
from .tag_orth_map import TAG_ORTH_MAP
|
||||
from .tag_bigram_map import TAG_BIGRAM_MAP
|
||||
from ...attrs import LANG
|
||||
from ...language import Language
|
||||
from ...tokens import Doc
|
||||
from ...compat import copy_reg
|
||||
from ...errors import Errors
|
||||
from ...language import Language
|
||||
from ...symbols import POS
|
||||
from ...tokens import Doc
|
||||
from ...util import DummyTokenizer
|
||||
from ... import util
|
||||
|
||||
|
||||
# Hold the attributes we need with convenient names
|
||||
DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
|
||||
|
||||
# Handling for multiple spaces in a row is somewhat awkward, this simplifies
|
||||
# the flow by creating a dummy with the same interface.
|
||||
DummyNode = namedtuple("DummyNode", ["surface", "pos", "feature"])
|
||||
DummyNodeFeatures = namedtuple("DummyNodeFeatures", ["lemma"])
|
||||
DummySpace = DummyNode(" ", " ", DummyNodeFeatures(" "))
|
||||
DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
|
||||
DummySpace = DummyNode(" ", " ", " ")
|
||||
|
||||
|
||||
def try_fugashi_import():
|
||||
"""Fugashi is required for Japanese support, so check for it.
|
||||
It it's not available blow up and explain how to fix it."""
|
||||
def try_sudachi_import(split_mode="A"):
|
||||
"""SudachiPy is required for Japanese support, so check for it.
|
||||
It it's not available blow up and explain how to fix it.
|
||||
split_mode should be one of these values: "A", "B", "C", None->"A"."""
|
||||
try:
|
||||
import fugashi
|
||||
from sudachipy import dictionary, tokenizer
|
||||
|
||||
return fugashi
|
||||
split_mode = {
|
||||
None: tokenizer.Tokenizer.SplitMode.A,
|
||||
"A": tokenizer.Tokenizer.SplitMode.A,
|
||||
"B": tokenizer.Tokenizer.SplitMode.B,
|
||||
"C": tokenizer.Tokenizer.SplitMode.C,
|
||||
}[split_mode]
|
||||
tok = dictionary.Dictionary().create(mode=split_mode)
|
||||
return tok
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Japanese support requires Fugashi: " "https://github.com/polm/fugashi"
|
||||
"Japanese support requires SudachiPy and SudachiDict-core "
|
||||
"(https://github.com/WorksApplications/SudachiPy). "
|
||||
"Install with `pip install sudachipy sudachidict_core` or "
|
||||
"install spaCy with `pip install spacy[ja]`."
|
||||
)
|
||||
|
||||
|
||||
def resolve_pos(token):
|
||||
def resolve_pos(orth, pos, next_pos):
|
||||
"""If necessary, add a field to the POS tag for UD mapping.
|
||||
Under Universal Dependencies, sometimes the same Unidic POS tag can
|
||||
be mapped differently depending on the literal token or its context
|
||||
in the sentence. This function adds information to the POS tag to
|
||||
resolve ambiguous mappings.
|
||||
in the sentence. This function returns resolved POSs for both token
|
||||
and next_token by tuple.
|
||||
"""
|
||||
|
||||
# this is only used for consecutive ascii spaces
|
||||
if token.surface == " ":
|
||||
return "空白"
|
||||
# Some tokens have their UD tag decided based on the POS of the following
|
||||
# token.
|
||||
|
||||
# TODO: This is a first take. The rules here are crude approximations.
|
||||
# For many of these, full dependencies are needed to properly resolve
|
||||
# PoS mappings.
|
||||
if token.pos == "連体詞,*,*,*":
|
||||
if re.match(r"[こそあど此其彼]の", token.surface):
|
||||
return token.pos + ",DET"
|
||||
if re.match(r"[こそあど此其彼]", token.surface):
|
||||
return token.pos + ",PRON"
|
||||
return token.pos + ",ADJ"
|
||||
return token.pos
|
||||
# orth based rules
|
||||
if pos[0] in TAG_ORTH_MAP:
|
||||
orth_map = TAG_ORTH_MAP[pos[0]]
|
||||
if orth in orth_map:
|
||||
return orth_map[orth], None
|
||||
|
||||
# tag bi-gram mapping
|
||||
if next_pos:
|
||||
tag_bigram = pos[0], next_pos[0]
|
||||
if tag_bigram in TAG_BIGRAM_MAP:
|
||||
bipos = TAG_BIGRAM_MAP[tag_bigram]
|
||||
if bipos[0] is None:
|
||||
return TAG_MAP[pos[0]][POS], bipos[1]
|
||||
else:
|
||||
return bipos
|
||||
|
||||
return TAG_MAP[pos[0]][POS], None
|
||||
|
||||
|
||||
def get_words_and_spaces(tokenizer, text):
|
||||
"""Get the individual tokens that make up the sentence and handle white space.
|
||||
# Use a mapping of paired punctuation to avoid splitting quoted sentences.
|
||||
pairpunct = {"「": "」", "『": "』", "【": "】"}
|
||||
|
||||
Japanese doesn't usually use white space, and MeCab's handling of it for
|
||||
multiple spaces in a row is somewhat awkward.
|
||||
|
||||
def separate_sentences(doc):
|
||||
"""Given a doc, mark tokens that start sentences based on Unidic tags.
|
||||
"""
|
||||
|
||||
tokens = tokenizer.parseToNodeList(text)
|
||||
stack = [] # save paired punctuation
|
||||
|
||||
for i, token in enumerate(doc[:-2]):
|
||||
# Set all tokens after the first to false by default. This is necessary
|
||||
# for the doc code to be aware we've done sentencization, see
|
||||
# `is_sentenced`.
|
||||
token.sent_start = i == 0
|
||||
if token.tag_:
|
||||
if token.tag_ == "補助記号-括弧開":
|
||||
ts = str(token)
|
||||
if ts in pairpunct:
|
||||
stack.append(pairpunct[ts])
|
||||
elif stack and ts == stack[-1]:
|
||||
stack.pop()
|
||||
|
||||
if token.tag_ == "補助記号-句点":
|
||||
next_token = doc[i + 1]
|
||||
if next_token.tag_ != token.tag_ and not stack:
|
||||
next_token.sent_start = True
|
||||
|
||||
|
||||
def get_dtokens(tokenizer, text):
|
||||
tokens = tokenizer.tokenize(text)
|
||||
words = []
|
||||
spaces = []
|
||||
for token in tokens:
|
||||
# If there's more than one space, spaces after the first become tokens
|
||||
for ii in range(len(token.white_space) - 1):
|
||||
words.append(DummySpace)
|
||||
spaces.append(False)
|
||||
for ti, token in enumerate(tokens):
|
||||
tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
|
||||
inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
|
||||
dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
|
||||
if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
|
||||
# don't add multiple space tokens in a row
|
||||
continue
|
||||
words.append(dtoken)
|
||||
|
||||
words.append(token)
|
||||
spaces.append(bool(token.white_space))
|
||||
return words, spaces
|
||||
# remove empty tokens. These can be produced with characters like … that
|
||||
# Sudachi normalizes internally.
|
||||
words = [ww for ww in words if len(ww.surface) > 0]
|
||||
return words
|
||||
|
||||
|
||||
def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
|
||||
words = [x.surface for x in dtokens]
|
||||
if "".join("".join(words).split()) != "".join(text.split()):
|
||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||
text_words = []
|
||||
text_lemmas = []
|
||||
text_tags = []
|
||||
text_spaces = []
|
||||
text_pos = 0
|
||||
# handle empty and whitespace-only texts
|
||||
if len(words) == 0:
|
||||
return text_words, text_lemmas, text_tags, text_spaces
|
||||
elif len([word for word in words if not word.isspace()]) == 0:
|
||||
assert text.isspace()
|
||||
text_words = [text]
|
||||
text_lemmas = [text]
|
||||
text_tags = [gap_tag]
|
||||
text_spaces = [False]
|
||||
return text_words, text_lemmas, text_tags, text_spaces
|
||||
# normalize words to remove all whitespace tokens
|
||||
norm_words, norm_dtokens = zip(
|
||||
*[
|
||||
(word, dtokens)
|
||||
for word, dtokens in zip(words, dtokens)
|
||||
if not word.isspace()
|
||||
]
|
||||
)
|
||||
# align words with text
|
||||
for word, dtoken in zip(norm_words, norm_dtokens):
|
||||
try:
|
||||
word_start = text[text_pos:].index(word)
|
||||
except ValueError:
|
||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||
if word_start > 0:
|
||||
w = text[text_pos : text_pos + word_start]
|
||||
text_words.append(w)
|
||||
text_lemmas.append(w)
|
||||
text_tags.append(gap_tag)
|
||||
text_spaces.append(False)
|
||||
text_pos += word_start
|
||||
text_words.append(word)
|
||||
text_lemmas.append(dtoken.lemma)
|
||||
text_tags.append(dtoken.pos)
|
||||
text_spaces.append(False)
|
||||
text_pos += len(word)
|
||||
if text_pos < len(text) and text[text_pos] == " ":
|
||||
text_spaces[-1] = True
|
||||
text_pos += 1
|
||||
if text_pos < len(text):
|
||||
w = text[text_pos:]
|
||||
text_words.append(w)
|
||||
text_lemmas.append(w)
|
||||
text_tags.append(gap_tag)
|
||||
text_spaces.append(False)
|
||||
return text_words, text_lemmas, text_tags, text_spaces
|
||||
|
||||
|
||||
class JapaneseTokenizer(DummyTokenizer):
|
||||
def __init__(self, cls, nlp=None):
|
||||
def __init__(self, cls, nlp=None, config={}):
|
||||
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||
self.tokenizer = try_fugashi_import().Tagger()
|
||||
self.tokenizer.parseToNodeList("") # see #2901
|
||||
self.split_mode = config.get("split_mode", None)
|
||||
self.tokenizer = try_sudachi_import(self.split_mode)
|
||||
|
||||
def __call__(self, text):
|
||||
dtokens, spaces = get_words_and_spaces(self.tokenizer, text)
|
||||
words = [x.surface for x in dtokens]
|
||||
dtokens = get_dtokens(self.tokenizer, text)
|
||||
|
||||
words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
|
||||
doc = Doc(self.vocab, words=words, spaces=spaces)
|
||||
unidic_tags = []
|
||||
for token, dtoken in zip(doc, dtokens):
|
||||
unidic_tags.append(dtoken.pos)
|
||||
token.tag_ = resolve_pos(dtoken)
|
||||
next_pos = None
|
||||
for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
|
||||
token.tag_ = unidic_tag[0]
|
||||
if next_pos:
|
||||
token.pos = next_pos
|
||||
next_pos = None
|
||||
else:
|
||||
token.pos, next_pos = resolve_pos(
|
||||
token.orth_,
|
||||
unidic_tag,
|
||||
unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None,
|
||||
)
|
||||
|
||||
# if there's no lemma info (it's an unk) just use the surface
|
||||
token.lemma_ = dtoken.feature.lemma or dtoken.surface
|
||||
token.lemma_ = lemma
|
||||
doc.user_data["unidic_tags"] = unidic_tags
|
||||
|
||||
return doc
|
||||
|
||||
def _get_config(self):
|
||||
config = OrderedDict((("split_mode", self.split_mode),))
|
||||
return config
|
||||
|
||||
def _set_config(self, config={}):
|
||||
self.split_mode = config.get("split_mode", None)
|
||||
|
||||
def to_bytes(self, **kwargs):
|
||||
serializers = OrderedDict(
|
||||
(("cfg", lambda: srsly.json_dumps(self._get_config())),)
|
||||
)
|
||||
return util.to_bytes(serializers, [])
|
||||
|
||||
def from_bytes(self, data, **kwargs):
|
||||
deserializers = OrderedDict(
|
||||
(("cfg", lambda b: self._set_config(srsly.json_loads(b))),)
|
||||
)
|
||||
util.from_bytes(data, deserializers, [])
|
||||
self.tokenizer = try_sudachi_import(self.split_mode)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, **kwargs):
|
||||
path = util.ensure_path(path)
|
||||
serializers = OrderedDict(
|
||||
(("cfg", lambda p: srsly.write_json(p, self._get_config())),)
|
||||
)
|
||||
return util.to_disk(path, serializers, [])
|
||||
|
||||
def from_disk(self, path, **kwargs):
|
||||
path = util.ensure_path(path)
|
||||
serializers = OrderedDict(
|
||||
(("cfg", lambda p: self._set_config(srsly.read_json(p))),)
|
||||
)
|
||||
util.from_disk(path, serializers, [])
|
||||
self.tokenizer = try_sudachi_import(self.split_mode)
|
||||
|
||||
|
||||
class JapaneseDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda _text: "ja"
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
syntax_iterators = SYNTAX_ITERATORS
|
||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||
|
||||
@classmethod
|
||||
def create_tokenizer(cls, nlp=None):
|
||||
return JapaneseTokenizer(cls, nlp)
|
||||
def create_tokenizer(cls, nlp=None, config={}):
|
||||
return JapaneseTokenizer(cls, nlp, config)
|
||||
|
||||
|
||||
class Japanese(Language):
|
||||
|
|
176
spacy/lang/ja/bunsetu.py
Normal file
176
spacy/lang/ja/bunsetu.py
Normal file
|
@ -0,0 +1,176 @@
|
|||
POS_PHRASE_MAP = {
|
||||
"NOUN": "NP",
|
||||
"NUM": "NP",
|
||||
"PRON": "NP",
|
||||
"PROPN": "NP",
|
||||
"VERB": "VP",
|
||||
"ADJ": "ADJP",
|
||||
"ADV": "ADVP",
|
||||
"CCONJ": "CCONJP",
|
||||
}
|
||||
|
||||
|
||||
# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
|
||||
def yield_bunsetu(doc, debug=False):
|
||||
bunsetu = []
|
||||
bunsetu_may_end = False
|
||||
phrase_type = None
|
||||
phrase = None
|
||||
prev = None
|
||||
prev_tag = None
|
||||
prev_dep = None
|
||||
prev_head = None
|
||||
for t in doc:
|
||||
pos = t.pos_
|
||||
pos_type = POS_PHRASE_MAP.get(pos, None)
|
||||
tag = t.tag_
|
||||
dep = t.dep_
|
||||
head = t.head.i
|
||||
if debug:
|
||||
print(
|
||||
t.i,
|
||||
t.orth_,
|
||||
pos,
|
||||
pos_type,
|
||||
dep,
|
||||
head,
|
||||
bunsetu_may_end,
|
||||
phrase_type,
|
||||
phrase,
|
||||
bunsetu,
|
||||
)
|
||||
|
||||
# DET is always an individual bunsetu
|
||||
if pos == "DET":
|
||||
if bunsetu:
|
||||
yield bunsetu, phrase_type, phrase
|
||||
yield [t], None, None
|
||||
bunsetu = []
|
||||
bunsetu_may_end = False
|
||||
phrase_type = None
|
||||
phrase = None
|
||||
|
||||
# PRON or Open PUNCT always splits bunsetu
|
||||
elif tag == "補助記号-括弧開":
|
||||
if bunsetu:
|
||||
yield bunsetu, phrase_type, phrase
|
||||
bunsetu = [t]
|
||||
bunsetu_may_end = True
|
||||
phrase_type = None
|
||||
phrase = None
|
||||
|
||||
# bunsetu head not appeared
|
||||
elif phrase_type is None:
|
||||
if bunsetu and prev_tag == "補助記号-読点":
|
||||
yield bunsetu, phrase_type, phrase
|
||||
bunsetu = []
|
||||
bunsetu_may_end = False
|
||||
phrase_type = None
|
||||
phrase = None
|
||||
bunsetu.append(t)
|
||||
if pos_type: # begin phrase
|
||||
phrase = [t]
|
||||
phrase_type = pos_type
|
||||
if pos_type in {"ADVP", "CCONJP"}:
|
||||
bunsetu_may_end = True
|
||||
|
||||
# entering new bunsetu
|
||||
elif pos_type and (
|
||||
pos_type != phrase_type
|
||||
or bunsetu_may_end # different phrase type arises # same phrase type but bunsetu already ended
|
||||
):
|
||||
# exceptional case: NOUN to VERB
|
||||
if (
|
||||
phrase_type == "NP"
|
||||
and pos_type == "VP"
|
||||
and prev_dep == "compound"
|
||||
and prev_head == t.i
|
||||
):
|
||||
bunsetu.append(t)
|
||||
phrase_type = "VP"
|
||||
phrase.append(t)
|
||||
# exceptional case: VERB to NOUN
|
||||
elif (
|
||||
phrase_type == "VP"
|
||||
and pos_type == "NP"
|
||||
and (
|
||||
prev_dep == "compound"
|
||||
and prev_head == t.i
|
||||
or dep == "compound"
|
||||
and prev == head
|
||||
or prev_dep == "nmod"
|
||||
and prev_head == t.i
|
||||
)
|
||||
):
|
||||
bunsetu.append(t)
|
||||
phrase_type = "NP"
|
||||
phrase.append(t)
|
||||
else:
|
||||
yield bunsetu, phrase_type, phrase
|
||||
bunsetu = [t]
|
||||
bunsetu_may_end = False
|
||||
phrase_type = pos_type
|
||||
phrase = [t]
|
||||
|
||||
# NOUN bunsetu
|
||||
elif phrase_type == "NP":
|
||||
bunsetu.append(t)
|
||||
if not bunsetu_may_end and (
|
||||
(
|
||||
(pos_type == "NP" or pos == "SYM")
|
||||
and (prev_head == t.i or prev_head == head)
|
||||
and prev_dep in {"compound", "nummod"}
|
||||
)
|
||||
or (
|
||||
pos == "PART"
|
||||
and (prev == head or prev_head == head)
|
||||
and dep == "mark"
|
||||
)
|
||||
):
|
||||
phrase.append(t)
|
||||
else:
|
||||
bunsetu_may_end = True
|
||||
|
||||
# VERB bunsetu
|
||||
elif phrase_type == "VP":
|
||||
bunsetu.append(t)
|
||||
if (
|
||||
not bunsetu_may_end
|
||||
and pos == "VERB"
|
||||
and prev_head == t.i
|
||||
and prev_dep == "compound"
|
||||
):
|
||||
phrase.append(t)
|
||||
else:
|
||||
bunsetu_may_end = True
|
||||
|
||||
# ADJ bunsetu
|
||||
elif phrase_type == "ADJP" and tag != "連体詞":
|
||||
bunsetu.append(t)
|
||||
if not bunsetu_may_end and (
|
||||
(
|
||||
pos == "NOUN"
|
||||
and (prev_head == t.i or prev_head == head)
|
||||
and prev_dep in {"amod", "compound"}
|
||||
)
|
||||
or (
|
||||
pos == "PART"
|
||||
and (prev == head or prev_head == head)
|
||||
and dep == "mark"
|
||||
)
|
||||
):
|
||||
phrase.append(t)
|
||||
else:
|
||||
bunsetu_may_end = True
|
||||
|
||||
# other bunsetu
|
||||
else:
|
||||
bunsetu.append(t)
|
||||
|
||||
prev = t.i
|
||||
prev_tag = t.tag_
|
||||
prev_dep = t.dep_
|
||||
prev_head = head
|
||||
|
||||
if bunsetu:
|
||||
yield bunsetu, phrase_type, phrase
|
54
spacy/lang/ja/syntax_iterators.py
Normal file
54
spacy/lang/ja/syntax_iterators.py
Normal file
|
@ -0,0 +1,54 @@
|
|||
from ...symbols import NOUN, PROPN, PRON, VERB
|
||||
|
||||
# XXX this can probably be pruned a bit
|
||||
labels = [
|
||||
"nsubj",
|
||||
"nmod",
|
||||
"dobj",
|
||||
"nsubjpass",
|
||||
"pcomp",
|
||||
"pobj",
|
||||
"obj",
|
||||
"obl",
|
||||
"dative",
|
||||
"appos",
|
||||
"attr",
|
||||
"ROOT",
|
||||
]
|
||||
|
||||
|
||||
def noun_chunks(obj):
|
||||
"""
|
||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||
"""
|
||||
|
||||
doc = obj.doc # Ensure works on both Doc and Span.
|
||||
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
||||
doc.vocab.strings.add("conj")
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
seen = set()
|
||||
for i, word in enumerate(obj):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
unseen = [w.i for w in word.subtree if w.i not in seen]
|
||||
if not unseen:
|
||||
continue
|
||||
|
||||
# this takes care of particles etc.
|
||||
seen.update(j.i for j in word.subtree)
|
||||
# This avoids duplicating embedded clauses
|
||||
seen.update(range(word.i + 1))
|
||||
|
||||
# if the head of this is a verb, mark that and rights seen
|
||||
# Don't do the subtree as that can hide other phrases
|
||||
if word.head.pos == VERB:
|
||||
seen.add(word.head.i)
|
||||
seen.update(w.i for w in word.head.rights)
|
||||
yield unseen[0], word.i + 1, np_label
|
||||
|
||||
|
||||
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
28
spacy/lang/ja/tag_bigram_map.py
Normal file
28
spacy/lang/ja/tag_bigram_map.py
Normal file
|
@ -0,0 +1,28 @@
|
|||
from ...symbols import ADJ, AUX, NOUN, PART, VERB
|
||||
|
||||
# mapping from tag bi-gram to pos of previous token
|
||||
TAG_BIGRAM_MAP = {
|
||||
# This covers only small part of AUX.
|
||||
("形容詞-非自立可能", "助詞-終助詞"): (AUX, None),
|
||||
("名詞-普通名詞-形状詞可能", "助動詞"): (ADJ, None),
|
||||
# ("副詞", "名詞-普通名詞-形状詞可能"): (None, ADJ),
|
||||
# This covers acl, advcl, obl and root, but has side effect for compound.
|
||||
("名詞-普通名詞-サ変可能", "動詞-非自立可能"): (VERB, AUX),
|
||||
# This covers almost all of the deps
|
||||
("名詞-普通名詞-サ変形状詞可能", "動詞-非自立可能"): (VERB, AUX),
|
||||
("名詞-普通名詞-副詞可能", "動詞-非自立可能"): (None, VERB),
|
||||
("副詞", "動詞-非自立可能"): (None, VERB),
|
||||
("形容詞-一般", "動詞-非自立可能"): (None, VERB),
|
||||
("形容詞-非自立可能", "動詞-非自立可能"): (None, VERB),
|
||||
("接頭辞", "動詞-非自立可能"): (None, VERB),
|
||||
("助詞-係助詞", "動詞-非自立可能"): (None, VERB),
|
||||
("助詞-副助詞", "動詞-非自立可能"): (None, VERB),
|
||||
("助詞-格助詞", "動詞-非自立可能"): (None, VERB),
|
||||
("補助記号-読点", "動詞-非自立可能"): (None, VERB),
|
||||
("形容詞-一般", "接尾辞-名詞的-一般"): (None, PART),
|
||||
("助詞-格助詞", "形状詞-助動詞語幹"): (None, NOUN),
|
||||
("連体詞", "形状詞-助動詞語幹"): (None, NOUN),
|
||||
("動詞-一般", "助詞-副助詞"): (None, PART),
|
||||
("動詞-非自立可能", "助詞-副助詞"): (None, PART),
|
||||
("助動詞", "助詞-副助詞"): (None, PART),
|
||||
}
|
|
@ -1,79 +1,68 @@
|
|||
from ...symbols import POS, PUNCT, INTJ, X, ADJ, AUX, ADP, PART, SCONJ, NOUN
|
||||
from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE
|
||||
from ...symbols import POS, PUNCT, INTJ, ADJ, AUX, ADP, PART, SCONJ, NOUN
|
||||
from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE, CCONJ
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
# Explanation of Unidic tags:
|
||||
# https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
|
||||
# Universal Dependencies Mapping:
|
||||
# Universal Dependencies Mapping: (Some of the entries in this mapping are updated to v2.6 in the list below)
|
||||
# http://universaldependencies.org/ja/overview/morphology.html
|
||||
# http://universaldependencies.org/ja/pos/all.html
|
||||
"記号,一般,*,*": {
|
||||
POS: PUNCT
|
||||
}, # this includes characters used to represent sounds like ドレミ
|
||||
"記号,文字,*,*": {
|
||||
POS: PUNCT
|
||||
}, # this is for Greek and Latin characters used as sumbols, as in math
|
||||
"感動詞,フィラー,*,*": {POS: INTJ},
|
||||
"感動詞,一般,*,*": {POS: INTJ},
|
||||
# this is specifically for unicode full-width space
|
||||
"空白,*,*,*": {POS: X},
|
||||
# This is used when sequential half-width spaces are present
|
||||
"記号-一般": {POS: NOUN}, # this includes characters used to represent sounds like ドレミ
|
||||
"記号-文字": {
|
||||
POS: NOUN
|
||||
}, # this is for Greek and Latin characters having some meanings, or used as symbols, as in math
|
||||
"感動詞-フィラー": {POS: INTJ},
|
||||
"感動詞-一般": {POS: INTJ},
|
||||
"空白": {POS: SPACE},
|
||||
"形状詞,一般,*,*": {POS: ADJ},
|
||||
"形状詞,タリ,*,*": {POS: ADJ},
|
||||
"形状詞,助動詞語幹,*,*": {POS: ADJ},
|
||||
"形容詞,一般,*,*": {POS: ADJ},
|
||||
"形容詞,非自立可能,*,*": {POS: AUX}, # XXX ADJ if alone, AUX otherwise
|
||||
"助詞,格助詞,*,*": {POS: ADP},
|
||||
"助詞,係助詞,*,*": {POS: ADP},
|
||||
"助詞,終助詞,*,*": {POS: PART},
|
||||
"助詞,準体助詞,*,*": {POS: SCONJ}, # の as in 走るのが速い
|
||||
"助詞,接続助詞,*,*": {POS: SCONJ}, # verb ending て
|
||||
"助詞,副助詞,*,*": {POS: PART}, # ばかり, つつ after a verb
|
||||
"助動詞,*,*,*": {POS: AUX},
|
||||
"接続詞,*,*,*": {POS: SCONJ}, # XXX: might need refinement
|
||||
"接頭辞,*,*,*": {POS: NOUN},
|
||||
"接尾辞,形状詞的,*,*": {POS: ADJ}, # がち, チック
|
||||
"接尾辞,形容詞的,*,*": {POS: ADJ}, # -らしい
|
||||
"接尾辞,動詞的,*,*": {POS: NOUN}, # -じみ
|
||||
"接尾辞,名詞的,サ変可能,*": {POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,*
|
||||
"接尾辞,名詞的,一般,*": {POS: NOUN},
|
||||
"接尾辞,名詞的,助数詞,*": {POS: NOUN},
|
||||
"接尾辞,名詞的,副詞可能,*": {POS: NOUN}, # -後, -過ぎ
|
||||
"代名詞,*,*,*": {POS: PRON},
|
||||
"動詞,一般,*,*": {POS: VERB},
|
||||
"動詞,非自立可能,*,*": {POS: VERB}, # XXX VERB if alone, AUX otherwise
|
||||
"動詞,非自立可能,*,*,AUX": {POS: AUX},
|
||||
"動詞,非自立可能,*,*,VERB": {POS: VERB},
|
||||
"副詞,*,*,*": {POS: ADV},
|
||||
"補助記号,AA,一般,*": {POS: SYM}, # text art
|
||||
"補助記号,AA,顔文字,*": {POS: SYM}, # kaomoji
|
||||
"補助記号,一般,*,*": {POS: SYM},
|
||||
"補助記号,括弧開,*,*": {POS: PUNCT}, # open bracket
|
||||
"補助記号,括弧閉,*,*": {POS: PUNCT}, # close bracket
|
||||
"補助記号,句点,*,*": {POS: PUNCT}, # period or other EOS marker
|
||||
"補助記号,読点,*,*": {POS: PUNCT}, # comma
|
||||
"名詞,固有名詞,一般,*": {POS: PROPN}, # general proper noun
|
||||
"名詞,固有名詞,人名,一般": {POS: PROPN}, # person's name
|
||||
"名詞,固有名詞,人名,姓": {POS: PROPN}, # surname
|
||||
"名詞,固有名詞,人名,名": {POS: PROPN}, # first name
|
||||
"名詞,固有名詞,地名,一般": {POS: PROPN}, # place name
|
||||
"名詞,固有名詞,地名,国": {POS: PROPN}, # country name
|
||||
"名詞,助動詞語幹,*,*": {POS: AUX},
|
||||
"名詞,数詞,*,*": {POS: NUM}, # includes Chinese numerals
|
||||
"名詞,普通名詞,サ変可能,*": {POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun
|
||||
"名詞,普通名詞,サ変可能,*,NOUN": {POS: NOUN},
|
||||
"名詞,普通名詞,サ変可能,*,VERB": {POS: VERB},
|
||||
"名詞,普通名詞,サ変形状詞可能,*": {POS: NOUN}, # ex: 下手
|
||||
"名詞,普通名詞,一般,*": {POS: NOUN},
|
||||
"名詞,普通名詞,形状詞可能,*": {POS: NOUN}, # XXX: sometimes ADJ in UDv2
|
||||
"名詞,普通名詞,形状詞可能,*,NOUN": {POS: NOUN},
|
||||
"名詞,普通名詞,形状詞可能,*,ADJ": {POS: ADJ},
|
||||
"名詞,普通名詞,助数詞可能,*": {POS: NOUN}, # counter / unit
|
||||
"名詞,普通名詞,副詞可能,*": {POS: NOUN},
|
||||
"連体詞,*,*,*": {POS: ADJ}, # XXX this has exceptions based on literal token
|
||||
"連体詞,*,*,*,ADJ": {POS: ADJ},
|
||||
"連体詞,*,*,*,PRON": {POS: PRON},
|
||||
"連体詞,*,*,*,DET": {POS: DET},
|
||||
"形状詞-一般": {POS: ADJ},
|
||||
"形状詞-タリ": {POS: ADJ},
|
||||
"形状詞-助動詞語幹": {POS: AUX},
|
||||
"形容詞-一般": {POS: ADJ},
|
||||
"形容詞-非自立可能": {POS: ADJ}, # XXX ADJ if alone, AUX otherwise
|
||||
"助詞-格助詞": {POS: ADP},
|
||||
"助詞-係助詞": {POS: ADP},
|
||||
"助詞-終助詞": {POS: PART},
|
||||
"助詞-準体助詞": {POS: SCONJ}, # の as in 走るのが速い
|
||||
"助詞-接続助詞": {POS: SCONJ}, # verb ending て0
|
||||
"助詞-副助詞": {POS: ADP}, # ばかり, つつ after a verb
|
||||
"助動詞": {POS: AUX},
|
||||
"接続詞": {POS: CCONJ}, # XXX: might need refinement
|
||||
"接頭辞": {POS: NOUN},
|
||||
"接尾辞-形状詞的": {POS: PART}, # がち, チック
|
||||
"接尾辞-形容詞的": {POS: AUX}, # -らしい
|
||||
"接尾辞-動詞的": {POS: PART}, # -じみ
|
||||
"接尾辞-名詞的-サ変可能": {POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,*
|
||||
"接尾辞-名詞的-一般": {POS: NOUN},
|
||||
"接尾辞-名詞的-助数詞": {POS: NOUN},
|
||||
"接尾辞-名詞的-副詞可能": {POS: NOUN}, # -後, -過ぎ
|
||||
"代名詞": {POS: PRON},
|
||||
"動詞-一般": {POS: VERB},
|
||||
"動詞-非自立可能": {POS: AUX}, # XXX VERB if alone, AUX otherwise
|
||||
"副詞": {POS: ADV},
|
||||
"補助記号-AA-一般": {POS: SYM}, # text art
|
||||
"補助記号-AA-顔文字": {POS: PUNCT}, # kaomoji
|
||||
"補助記号-一般": {POS: SYM},
|
||||
"補助記号-括弧開": {POS: PUNCT}, # open bracket
|
||||
"補助記号-括弧閉": {POS: PUNCT}, # close bracket
|
||||
"補助記号-句点": {POS: PUNCT}, # period or other EOS marker
|
||||
"補助記号-読点": {POS: PUNCT}, # comma
|
||||
"名詞-固有名詞-一般": {POS: PROPN}, # general proper noun
|
||||
"名詞-固有名詞-人名-一般": {POS: PROPN}, # person's name
|
||||
"名詞-固有名詞-人名-姓": {POS: PROPN}, # surname
|
||||
"名詞-固有名詞-人名-名": {POS: PROPN}, # first name
|
||||
"名詞-固有名詞-地名-一般": {POS: PROPN}, # place name
|
||||
"名詞-固有名詞-地名-国": {POS: PROPN}, # country name
|
||||
"名詞-助動詞語幹": {POS: AUX},
|
||||
"名詞-数詞": {POS: NUM}, # includes Chinese numerals
|
||||
"名詞-普通名詞-サ変可能": {POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun
|
||||
"名詞-普通名詞-サ変形状詞可能": {POS: NOUN},
|
||||
"名詞-普通名詞-一般": {POS: NOUN},
|
||||
"名詞-普通名詞-形状詞可能": {POS: NOUN}, # XXX: sometimes ADJ in UDv2
|
||||
"名詞-普通名詞-助数詞可能": {POS: NOUN}, # counter / unit
|
||||
"名詞-普通名詞-副詞可能": {POS: NOUN},
|
||||
"連体詞": {POS: DET}, # XXX this has exceptions based on literal token
|
||||
# GSD tags. These aren't in Unidic, but we need them for the GSD data.
|
||||
"外国語": {POS: PROPN}, # Foreign words
|
||||
"絵文字・記号等": {POS: SYM}, # emoji / kaomoji ^^;
|
||||
}
|
||||
|
|
22
spacy/lang/ja/tag_orth_map.py
Normal file
22
spacy/lang/ja/tag_orth_map.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
from ...symbols import DET, PART, PRON, SPACE, X
|
||||
|
||||
# mapping from tag bi-gram to pos of previous token
|
||||
TAG_ORTH_MAP = {
|
||||
"空白": {" ": SPACE, " ": X},
|
||||
"助詞-副助詞": {"たり": PART},
|
||||
"連体詞": {
|
||||
"あの": DET,
|
||||
"かの": DET,
|
||||
"この": DET,
|
||||
"その": DET,
|
||||
"どの": DET,
|
||||
"彼の": DET,
|
||||
"此の": DET,
|
||||
"其の": DET,
|
||||
"ある": PRON,
|
||||
"こんな": PRON,
|
||||
"そんな": PRON,
|
||||
"どんな": PRON,
|
||||
"あらゆる": PRON,
|
||||
},
|
||||
}
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
|
||||
from ...language import Language
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user