mirror of
https://github.com/explosion/spaCy.git
synced 2025-08-04 20:30:24 +03:00
Merge remote-tracking branch 'upstream/master' into refactor/parser-gpu
This commit is contained in:
commit
4f9c54001b
106
.github/contributors/Pantalaymon.md
vendored
Normal file
106
.github/contributors/Pantalaymon.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name |Valentin-Gabriel Soumah|
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2021-11-23 |
|
||||||
|
| GitHub username | Pantalaymon |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/avi197.md
vendored
Normal file
106
.github/contributors/avi197.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Son Pham |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 09/10/2021 |
|
||||||
|
| GitHub username | Avi197 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/fgaim.md
vendored
Normal file
106
.github/contributors/fgaim.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Fitsum Gaim |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2021-08-07 |
|
||||||
|
| GitHub username | fgaim |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/syrull.md
vendored
Normal file
106
.github/contributors/syrull.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Dimitar Ganev |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2021/8/2 |
|
||||||
|
| GitHub username | syrull |
|
||||||
|
| Website (optional) | |
|
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -9,6 +9,7 @@ keys/
|
||||||
spacy/tests/package/setup.cfg
|
spacy/tests/package/setup.cfg
|
||||||
spacy/tests/package/pyproject.toml
|
spacy/tests/package/pyproject.toml
|
||||||
spacy/tests/package/requirements.txt
|
spacy/tests/package/requirements.txt
|
||||||
|
spacy/tests/universe/universe.json
|
||||||
|
|
||||||
# Website
|
# Website
|
||||||
website/.cache/
|
website/.cache/
|
||||||
|
|
|
@ -143,15 +143,25 @@ Changes to `.py` files will be effective immediately.
|
||||||
### Fixing bugs
|
### Fixing bugs
|
||||||
|
|
||||||
When fixing a bug, first create an
|
When fixing a bug, first create an
|
||||||
[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
|
[issue](https://github.com/explosion/spaCy/issues) if one does not already
|
||||||
The description text can be very short – we don't want to make this too
|
exist. The description text can be very short – we don't want to make this too
|
||||||
bureaucratic.
|
bureaucratic.
|
||||||
|
|
||||||
Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
|
Next, add a test to the relevant file in the
|
||||||
[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
|
[`spacy/tests`](spacy/tests)folder. Then add a [pytest
|
||||||
you're fixing, and make sure the test fails. Next, add and commit your test file
|
mark](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers),
|
||||||
referencing the issue number in the commit message. Finally, fix the bug, make
|
`@pytest.mark.issue(NUMBER)`, to reference the issue number.
|
||||||
sure your test passes and reference the issue in your commit message.
|
|
||||||
|
```python
|
||||||
|
# Assume you're fixing Issue #1234
|
||||||
|
@pytest.mark.issue(1234)
|
||||||
|
def test_issue1234():
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
Test for the bug you're fixing, and make sure the test fails. Next, add and
|
||||||
|
commit your test file. Finally, fix the bug, make sure your test passes and
|
||||||
|
reference the issue number in your pull request description.
|
||||||
|
|
||||||
📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
|
📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
|
||||||
|
|
||||||
|
|
2
LICENSE
2
LICENSE
|
@ -1,6 +1,6 @@
|
||||||
The MIT License (MIT)
|
The MIT License (MIT)
|
||||||
|
|
||||||
Copyright (C) 2016-2021 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
Copyright (C) 2016-2022 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
|
|
@ -1,11 +1,8 @@
|
||||||
recursive-include include *.h
|
|
||||||
recursive-include spacy *.pyi *.pyx *.pxd *.txt *.cfg *.jinja *.toml
|
recursive-include spacy *.pyi *.pyx *.pxd *.txt *.cfg *.jinja *.toml
|
||||||
include LICENSE
|
include LICENSE
|
||||||
include README.md
|
include README.md
|
||||||
include pyproject.toml
|
include pyproject.toml
|
||||||
include spacy/py.typed
|
include spacy/py.typed
|
||||||
recursive-exclude spacy/lang *.json
|
recursive-include spacy/cli *.yml
|
||||||
recursive-include spacy/lang *.json.gz
|
|
||||||
recursive-include spacy/cli *.json *.yml
|
|
||||||
recursive-include licenses *
|
recursive-include licenses *
|
||||||
recursive-exclude spacy *.cpp
|
recursive-exclude spacy *.cpp
|
||||||
|
|
|
@ -16,7 +16,7 @@ production-ready [**training system**](https://spacy.io/usage/training) and easy
|
||||||
model packaging, deployment and workflow management. spaCy is commercial
|
model packaging, deployment and workflow management. spaCy is commercial
|
||||||
open-source software, released under the MIT license.
|
open-source software, released under the MIT license.
|
||||||
|
|
||||||
💫 **Version 3.0 out now!**
|
💫 **Version 3.2 out now!**
|
||||||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||||
|
|
||||||
[](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
[](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||||
|
|
|
@ -23,7 +23,7 @@ jobs:
|
||||||
# defined in .flake8 and overwrites the selected codes.
|
# defined in .flake8 and overwrites the selected codes.
|
||||||
- job: "Validate"
|
- job: "Validate"
|
||||||
pool:
|
pool:
|
||||||
vmImage: "ubuntu-18.04"
|
vmImage: "ubuntu-latest"
|
||||||
steps:
|
steps:
|
||||||
- task: UsePythonVersion@0
|
- task: UsePythonVersion@0
|
||||||
inputs:
|
inputs:
|
||||||
|
@ -39,49 +39,49 @@ jobs:
|
||||||
matrix:
|
matrix:
|
||||||
# We're only running one platform per Python version to speed up builds
|
# We're only running one platform per Python version to speed up builds
|
||||||
Python36Linux:
|
Python36Linux:
|
||||||
imageName: "ubuntu-18.04"
|
imageName: "ubuntu-latest"
|
||||||
python.version: "3.6"
|
python.version: "3.6"
|
||||||
# Python36Windows:
|
# Python36Windows:
|
||||||
# imageName: "windows-2019"
|
# imageName: "windows-latest"
|
||||||
# python.version: "3.6"
|
# python.version: "3.6"
|
||||||
# Python36Mac:
|
# Python36Mac:
|
||||||
# imageName: "macos-10.14"
|
# imageName: "macos-latest"
|
||||||
# python.version: "3.6"
|
# python.version: "3.6"
|
||||||
# Python37Linux:
|
# Python37Linux:
|
||||||
# imageName: "ubuntu-18.04"
|
# imageName: "ubuntu-latest"
|
||||||
# python.version: "3.7"
|
# python.version: "3.7"
|
||||||
Python37Windows:
|
Python37Windows:
|
||||||
imageName: "windows-2019"
|
imageName: "windows-latest"
|
||||||
python.version: "3.7"
|
python.version: "3.7"
|
||||||
# Python37Mac:
|
# Python37Mac:
|
||||||
# imageName: "macos-10.14"
|
# imageName: "macos-latest"
|
||||||
# python.version: "3.7"
|
# python.version: "3.7"
|
||||||
# Python38Linux:
|
# Python38Linux:
|
||||||
# imageName: "ubuntu-18.04"
|
# imageName: "ubuntu-latest"
|
||||||
# python.version: "3.8"
|
# python.version: "3.8"
|
||||||
# Python38Windows:
|
# Python38Windows:
|
||||||
# imageName: "windows-2019"
|
# imageName: "windows-latest"
|
||||||
# python.version: "3.8"
|
# python.version: "3.8"
|
||||||
Python38Mac:
|
Python38Mac:
|
||||||
imageName: "macos-10.14"
|
imageName: "macos-latest"
|
||||||
python.version: "3.8"
|
python.version: "3.8"
|
||||||
Python39Linux:
|
Python39Linux:
|
||||||
imageName: "ubuntu-18.04"
|
imageName: "ubuntu-latest"
|
||||||
python.version: "3.9"
|
python.version: "3.9"
|
||||||
# Python39Windows:
|
# Python39Windows:
|
||||||
# imageName: "windows-2019"
|
# imageName: "windows-latest"
|
||||||
# python.version: "3.9"
|
# python.version: "3.9"
|
||||||
# Python39Mac:
|
# Python39Mac:
|
||||||
# imageName: "macos-10.14"
|
# imageName: "macos-latest"
|
||||||
# python.version: "3.9"
|
# python.version: "3.9"
|
||||||
Python310Linux:
|
Python310Linux:
|
||||||
imageName: "ubuntu-20.04"
|
imageName: "ubuntu-latest"
|
||||||
python.version: "3.10"
|
python.version: "3.10"
|
||||||
Python310Windows:
|
Python310Windows:
|
||||||
imageName: "windows-2019"
|
imageName: "windows-latest"
|
||||||
python.version: "3.10"
|
python.version: "3.10"
|
||||||
Python310Mac:
|
Python310Mac:
|
||||||
imageName: "macos-10.15"
|
imageName: "macos-latest"
|
||||||
python.version: "3.10"
|
python.version: "3.10"
|
||||||
maxParallel: 4
|
maxParallel: 4
|
||||||
pool:
|
pool:
|
||||||
|
|
|
@ -444,7 +444,7 @@ spaCy uses the [`pytest`](http://doc.pytest.org/) framework for testing. Tests f
|
||||||
|
|
||||||
When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.
|
When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.
|
||||||
|
|
||||||
Regression tests are tests that refer to bugs reported in specific issues. They should live in the `regression` module and are named according to the issue number (e.g. `test_issue1234.py`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first. Every once in a while, we go through the `regression` module and group tests together into larger files by issue number, in groups of 500 to 1000 numbers. This prevents us from ending up with too many individual files over time.
|
Regression tests are tests that refer to bugs reported in specific issues. They should live in the relevant module of the test suite, named according to the issue number (e.g., `test_issue1234.py`), and [marked](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers) appropriately (e.g. `@pytest.mark.issue(1234)`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first.
|
||||||
|
|
||||||
The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.
|
The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.
|
||||||
|
|
||||||
|
|
|
@ -1,5 +1,6 @@
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=3.0.8,<3.1.0
|
spacy-legacy>=3.0.8,<3.1.0
|
||||||
|
spacy-loggers>=1.0.0,<2.0.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=8.0.12,<8.1.0
|
thinc>=8.0.12,<8.1.0
|
||||||
|
@ -17,6 +18,7 @@ requests>=2.13.0,<3.0.0
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0
|
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0
|
||||||
jinja2
|
jinja2
|
||||||
|
langcodes>=3.2.0,<4.0.0
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
packaging>=20.0
|
packaging>=20.0
|
||||||
|
@ -29,7 +31,7 @@ pytest-timeout>=1.3.0,<2.0.0
|
||||||
mock>=2.0.0,<3.0.0
|
mock>=2.0.0,<3.0.0
|
||||||
flake8>=3.8.0,<3.10.0
|
flake8>=3.8.0,<3.10.0
|
||||||
hypothesis>=3.27.0,<7.0.0
|
hypothesis>=3.27.0,<7.0.0
|
||||||
mypy>=0.910
|
mypy==0.910
|
||||||
types-dataclasses>=0.1.3; python_version < "3.7"
|
types-dataclasses>=0.1.3; python_version < "3.7"
|
||||||
types-mock>=0.1.1
|
types-mock>=0.1.1
|
||||||
types-requests
|
types-requests
|
||||||
|
|
38
setup.cfg
38
setup.cfg
|
@ -42,6 +42,7 @@ setup_requires =
|
||||||
install_requires =
|
install_requires =
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=3.0.8,<3.1.0
|
spacy-legacy>=3.0.8,<3.1.0
|
||||||
|
spacy-loggers>=1.0.0,<2.0.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
|
@ -62,6 +63,7 @@ install_requires =
|
||||||
setuptools
|
setuptools
|
||||||
packaging>=20.0
|
packaging>=20.0
|
||||||
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
|
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
|
||||||
|
langcodes>=3.2.0,<4.0.0
|
||||||
|
|
||||||
[options.entry_points]
|
[options.entry_points]
|
||||||
console_scripts =
|
console_scripts =
|
||||||
|
@ -69,43 +71,45 @@ console_scripts =
|
||||||
|
|
||||||
[options.extras_require]
|
[options.extras_require]
|
||||||
lookups =
|
lookups =
|
||||||
spacy_lookups_data>=1.0.2,<1.1.0
|
spacy_lookups_data>=1.0.3,<1.1.0
|
||||||
transformers =
|
transformers =
|
||||||
spacy_transformers>=1.0.1,<1.2.0
|
spacy_transformers>=1.1.2,<1.2.0
|
||||||
ray =
|
ray =
|
||||||
spacy_ray>=0.1.0,<1.0.0
|
spacy_ray>=0.1.0,<1.0.0
|
||||||
cuda =
|
cuda =
|
||||||
cupy>=5.0.0b4,<10.0.0
|
cupy>=5.0.0b4,<11.0.0
|
||||||
cuda80 =
|
cuda80 =
|
||||||
cupy-cuda80>=5.0.0b4,<10.0.0
|
cupy-cuda80>=5.0.0b4,<11.0.0
|
||||||
cuda90 =
|
cuda90 =
|
||||||
cupy-cuda90>=5.0.0b4,<10.0.0
|
cupy-cuda90>=5.0.0b4,<11.0.0
|
||||||
cuda91 =
|
cuda91 =
|
||||||
cupy-cuda91>=5.0.0b4,<10.0.0
|
cupy-cuda91>=5.0.0b4,<11.0.0
|
||||||
cuda92 =
|
cuda92 =
|
||||||
cupy-cuda92>=5.0.0b4,<10.0.0
|
cupy-cuda92>=5.0.0b4,<11.0.0
|
||||||
cuda100 =
|
cuda100 =
|
||||||
cupy-cuda100>=5.0.0b4,<10.0.0
|
cupy-cuda100>=5.0.0b4,<11.0.0
|
||||||
cuda101 =
|
cuda101 =
|
||||||
cupy-cuda101>=5.0.0b4,<10.0.0
|
cupy-cuda101>=5.0.0b4,<11.0.0
|
||||||
cuda102 =
|
cuda102 =
|
||||||
cupy-cuda102>=5.0.0b4,<10.0.0
|
cupy-cuda102>=5.0.0b4,<11.0.0
|
||||||
cuda110 =
|
cuda110 =
|
||||||
cupy-cuda110>=5.0.0b4,<10.0.0
|
cupy-cuda110>=5.0.0b4,<11.0.0
|
||||||
cuda111 =
|
cuda111 =
|
||||||
cupy-cuda111>=5.0.0b4,<10.0.0
|
cupy-cuda111>=5.0.0b4,<11.0.0
|
||||||
cuda112 =
|
cuda112 =
|
||||||
cupy-cuda112>=5.0.0b4,<10.0.0
|
cupy-cuda112>=5.0.0b4,<11.0.0
|
||||||
cuda113 =
|
cuda113 =
|
||||||
cupy-cuda113>=5.0.0b4,<10.0.0
|
cupy-cuda113>=5.0.0b4,<11.0.0
|
||||||
cuda114 =
|
cuda114 =
|
||||||
cupy-cuda114>=5.0.0b4,<10.0.0
|
cupy-cuda114>=5.0.0b4,<11.0.0
|
||||||
|
cuda115 =
|
||||||
|
cupy-cuda115>=5.0.0b4,<11.0.0
|
||||||
apple =
|
apple =
|
||||||
thinc-apple-ops>=0.0.4,<1.0.0
|
thinc-apple-ops>=0.0.4,<1.0.0
|
||||||
# Language tokenizers with external dependencies
|
# Language tokenizers with external dependencies
|
||||||
ja =
|
ja =
|
||||||
sudachipy>=0.4.9
|
sudachipy>=0.5.2,!=0.6.1
|
||||||
sudachidict_core>=20200330
|
sudachidict_core>=20211220
|
||||||
ko =
|
ko =
|
||||||
natto-py==0.9.0
|
natto-py==0.9.0
|
||||||
th =
|
th =
|
||||||
|
|
1
setup.py
1
setup.py
|
@ -78,6 +78,7 @@ COPY_FILES = {
|
||||||
ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package",
|
ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package",
|
||||||
ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package",
|
ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package",
|
||||||
ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package",
|
ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package",
|
||||||
|
ROOT / "website" / "meta" / "universe.json": PACKAGE_ROOT / "tests" / "universe",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy"
|
__title__ = "spacy"
|
||||||
__version__ = "3.1.4"
|
__version__ = "3.2.1"
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
__projects__ = "https://github.com/explosion/projects"
|
__projects__ = "https://github.com/explosion/projects"
|
||||||
|
|
|
@ -1,3 +1,6 @@
|
||||||
|
from .errors import Errors
|
||||||
|
|
||||||
|
IOB_STRINGS = ("", "I", "O", "B")
|
||||||
|
|
||||||
IDS = {
|
IDS = {
|
||||||
"": NULL_ATTR,
|
"": NULL_ATTR,
|
||||||
|
@ -64,7 +67,6 @@ IDS = {
|
||||||
"FLAG61": FLAG61,
|
"FLAG61": FLAG61,
|
||||||
"FLAG62": FLAG62,
|
"FLAG62": FLAG62,
|
||||||
"FLAG63": FLAG63,
|
"FLAG63": FLAG63,
|
||||||
|
|
||||||
"ID": ID,
|
"ID": ID,
|
||||||
"ORTH": ORTH,
|
"ORTH": ORTH,
|
||||||
"LOWER": LOWER,
|
"LOWER": LOWER,
|
||||||
|
@ -72,7 +74,6 @@ IDS = {
|
||||||
"SHAPE": SHAPE,
|
"SHAPE": SHAPE,
|
||||||
"PREFIX": PREFIX,
|
"PREFIX": PREFIX,
|
||||||
"SUFFIX": SUFFIX,
|
"SUFFIX": SUFFIX,
|
||||||
|
|
||||||
"LENGTH": LENGTH,
|
"LENGTH": LENGTH,
|
||||||
"LEMMA": LEMMA,
|
"LEMMA": LEMMA,
|
||||||
"POS": POS,
|
"POS": POS,
|
||||||
|
@ -87,7 +88,7 @@ IDS = {
|
||||||
"SPACY": SPACY,
|
"SPACY": SPACY,
|
||||||
"LANG": LANG,
|
"LANG": LANG,
|
||||||
"MORPH": MORPH,
|
"MORPH": MORPH,
|
||||||
"IDX": IDX
|
"IDX": IDX,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@ -109,28 +110,66 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
||||||
"""
|
"""
|
||||||
inty_attrs = {}
|
inty_attrs = {}
|
||||||
if _do_deprecated:
|
if _do_deprecated:
|
||||||
if 'F' in stringy_attrs:
|
if "F" in stringy_attrs:
|
||||||
stringy_attrs["ORTH"] = stringy_attrs.pop("F")
|
stringy_attrs["ORTH"] = stringy_attrs.pop("F")
|
||||||
if 'L' in stringy_attrs:
|
if "L" in stringy_attrs:
|
||||||
stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
|
stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
|
||||||
if 'pos' in stringy_attrs:
|
if "pos" in stringy_attrs:
|
||||||
stringy_attrs["TAG"] = stringy_attrs.pop("pos")
|
stringy_attrs["TAG"] = stringy_attrs.pop("pos")
|
||||||
if 'morph' in stringy_attrs:
|
if "morph" in stringy_attrs:
|
||||||
morphs = stringy_attrs.pop('morph')
|
morphs = stringy_attrs.pop("morph")
|
||||||
if 'number' in stringy_attrs:
|
if "number" in stringy_attrs:
|
||||||
stringy_attrs.pop('number')
|
stringy_attrs.pop("number")
|
||||||
if 'tenspect' in stringy_attrs:
|
if "tenspect" in stringy_attrs:
|
||||||
stringy_attrs.pop('tenspect')
|
stringy_attrs.pop("tenspect")
|
||||||
morph_keys = [
|
morph_keys = [
|
||||||
'PunctType', 'PunctSide', 'Other', 'Degree', 'AdvType', 'Number',
|
"PunctType",
|
||||||
'VerbForm', 'PronType', 'Aspect', 'Tense', 'PartType', 'Poss',
|
"PunctSide",
|
||||||
'Hyph', 'ConjType', 'NumType', 'Foreign', 'VerbType', 'NounType',
|
"Other",
|
||||||
'Gender', 'Mood', 'Negative', 'Tense', 'Voice', 'Abbr',
|
"Degree",
|
||||||
'Derivation', 'Echo', 'Foreign', 'NameType', 'NounType', 'NumForm',
|
"AdvType",
|
||||||
'NumValue', 'PartType', 'Polite', 'StyleVariant',
|
"Number",
|
||||||
'PronType', 'AdjType', 'Person', 'Variant', 'AdpType',
|
"VerbForm",
|
||||||
'Reflex', 'Negative', 'Mood', 'Aspect', 'Case',
|
"PronType",
|
||||||
'Polarity', 'PrepCase', 'Animacy' # U20
|
"Aspect",
|
||||||
|
"Tense",
|
||||||
|
"PartType",
|
||||||
|
"Poss",
|
||||||
|
"Hyph",
|
||||||
|
"ConjType",
|
||||||
|
"NumType",
|
||||||
|
"Foreign",
|
||||||
|
"VerbType",
|
||||||
|
"NounType",
|
||||||
|
"Gender",
|
||||||
|
"Mood",
|
||||||
|
"Negative",
|
||||||
|
"Tense",
|
||||||
|
"Voice",
|
||||||
|
"Abbr",
|
||||||
|
"Derivation",
|
||||||
|
"Echo",
|
||||||
|
"Foreign",
|
||||||
|
"NameType",
|
||||||
|
"NounType",
|
||||||
|
"NumForm",
|
||||||
|
"NumValue",
|
||||||
|
"PartType",
|
||||||
|
"Polite",
|
||||||
|
"StyleVariant",
|
||||||
|
"PronType",
|
||||||
|
"AdjType",
|
||||||
|
"Person",
|
||||||
|
"Variant",
|
||||||
|
"AdpType",
|
||||||
|
"Reflex",
|
||||||
|
"Negative",
|
||||||
|
"Mood",
|
||||||
|
"Aspect",
|
||||||
|
"Case",
|
||||||
|
"Polarity",
|
||||||
|
"PrepCase",
|
||||||
|
"Animacy", # U20
|
||||||
]
|
]
|
||||||
for key in morph_keys:
|
for key in morph_keys:
|
||||||
if key in stringy_attrs:
|
if key in stringy_attrs:
|
||||||
|
@ -142,8 +181,13 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
||||||
for name, value in stringy_attrs.items():
|
for name, value in stringy_attrs.items():
|
||||||
int_key = intify_attr(name)
|
int_key = intify_attr(name)
|
||||||
if int_key is not None:
|
if int_key is not None:
|
||||||
if strings_map is not None and isinstance(value, basestring):
|
if int_key == ENT_IOB:
|
||||||
if hasattr(strings_map, 'add'):
|
if value in IOB_STRINGS:
|
||||||
|
value = IOB_STRINGS.index(value)
|
||||||
|
elif isinstance(value, str):
|
||||||
|
raise ValueError(Errors.E1025.format(value=value))
|
||||||
|
if strings_map is not None and isinstance(value, str):
|
||||||
|
if hasattr(strings_map, "add"):
|
||||||
value = strings_map.add(value)
|
value = strings_map.add(value)
|
||||||
else:
|
else:
|
||||||
value = strings_map[value]
|
value = strings_map[value]
|
||||||
|
|
|
@ -25,7 +25,7 @@ def debug_config_cli(
|
||||||
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
|
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
"""Debug a config.cfg file and show validation errors. The command will
|
"""Debug a config file and show validation errors. The command will
|
||||||
create all objects in the tree and validate them. Note that some config
|
create all objects in the tree and validate them. Note that some config
|
||||||
validation errors are blocking and will prevent the rest of the config from
|
validation errors are blocking and will prevent the rest of the config from
|
||||||
being resolved. This means that you may not see all validation errors at
|
being resolved. This means that you may not see all validation errors at
|
||||||
|
|
|
@ -14,7 +14,7 @@ from ..training.initialize import get_sourced_components
|
||||||
from ..schemas import ConfigSchemaTraining
|
from ..schemas import ConfigSchemaTraining
|
||||||
from ..pipeline._parser_internals import nonproj
|
from ..pipeline._parser_internals import nonproj
|
||||||
from ..pipeline._parser_internals.nonproj import DELIMITER
|
from ..pipeline._parser_internals.nonproj import DELIMITER
|
||||||
from ..pipeline import Morphologizer
|
from ..pipeline import Morphologizer, SpanCategorizer
|
||||||
from ..morphology import Morphology
|
from ..morphology import Morphology
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..util import registry, resolve_dot_names
|
from ..util import registry, resolve_dot_names
|
||||||
|
@ -203,6 +203,7 @@ def debug_data(
|
||||||
has_low_data_warning = False
|
has_low_data_warning = False
|
||||||
has_no_neg_warning = False
|
has_no_neg_warning = False
|
||||||
has_ws_ents_error = False
|
has_ws_ents_error = False
|
||||||
|
has_boundary_cross_ents_warning = False
|
||||||
|
|
||||||
msg.divider("Named Entity Recognition")
|
msg.divider("Named Entity Recognition")
|
||||||
msg.info(f"{len(model_labels)} label(s)")
|
msg.info(f"{len(model_labels)} label(s)")
|
||||||
|
@ -242,12 +243,20 @@ def debug_data(
|
||||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||||
has_no_neg_warning = True
|
has_no_neg_warning = True
|
||||||
|
|
||||||
|
if gold_train_data["boundary_cross_ents"]:
|
||||||
|
msg.warn(
|
||||||
|
f"{gold_train_data['boundary_cross_ents']} entity span(s) crossing sentence boundaries"
|
||||||
|
)
|
||||||
|
has_boundary_cross_ents_warning = True
|
||||||
|
|
||||||
if not has_low_data_warning:
|
if not has_low_data_warning:
|
||||||
msg.good("Good amount of examples for all labels")
|
msg.good("Good amount of examples for all labels")
|
||||||
if not has_no_neg_warning:
|
if not has_no_neg_warning:
|
||||||
msg.good("Examples without occurrences available for all labels")
|
msg.good("Examples without occurrences available for all labels")
|
||||||
if not has_ws_ents_error:
|
if not has_ws_ents_error:
|
||||||
msg.good("No entities consisting of or starting/ending with whitespace")
|
msg.good("No entities consisting of or starting/ending with whitespace")
|
||||||
|
if not has_boundary_cross_ents_warning:
|
||||||
|
msg.good("No entities crossing sentence boundaries")
|
||||||
|
|
||||||
if has_low_data_warning:
|
if has_low_data_warning:
|
||||||
msg.text(
|
msg.text(
|
||||||
|
@ -565,6 +574,7 @@ def _compile_gold(
|
||||||
"words": Counter(),
|
"words": Counter(),
|
||||||
"roots": Counter(),
|
"roots": Counter(),
|
||||||
"ws_ents": 0,
|
"ws_ents": 0,
|
||||||
|
"boundary_cross_ents": 0,
|
||||||
"n_words": 0,
|
"n_words": 0,
|
||||||
"n_misaligned_words": 0,
|
"n_misaligned_words": 0,
|
||||||
"words_missing_vectors": Counter(),
|
"words_missing_vectors": Counter(),
|
||||||
|
@ -602,6 +612,8 @@ def _compile_gold(
|
||||||
if label.startswith(("B-", "U-")):
|
if label.startswith(("B-", "U-")):
|
||||||
combined_label = label.split("-")[1]
|
combined_label = label.split("-")[1]
|
||||||
data["ner"][combined_label] += 1
|
data["ner"][combined_label] += 1
|
||||||
|
if gold[i].is_sent_start and label.startswith(("I-", "L-")):
|
||||||
|
data["boundary_cross_ents"] += 1
|
||||||
elif label == "-":
|
elif label == "-":
|
||||||
data["ner"]["-"] += 1
|
data["ner"]["-"] += 1
|
||||||
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
||||||
|
@ -687,8 +699,34 @@ def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
|
||||||
return count
|
return count
|
||||||
|
|
||||||
|
|
||||||
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]:
|
def _get_labels_from_model(
|
||||||
if pipe_name not in nlp.pipe_names:
|
nlp: Language, factory_name: str
|
||||||
return set()
|
) -> Set[str]:
|
||||||
pipe = nlp.get_pipe(pipe_name)
|
pipe_names = [
|
||||||
return set(pipe.labels)
|
pipe_name
|
||||||
|
for pipe_name in nlp.pipe_names
|
||||||
|
if nlp.get_pipe_meta(pipe_name).factory == factory_name
|
||||||
|
]
|
||||||
|
labels: Set[str] = set()
|
||||||
|
for pipe_name in pipe_names:
|
||||||
|
pipe = nlp.get_pipe(pipe_name)
|
||||||
|
labels.update(pipe.labels)
|
||||||
|
return labels
|
||||||
|
|
||||||
|
|
||||||
|
def _get_labels_from_spancat(
|
||||||
|
nlp: Language
|
||||||
|
) -> Dict[str, Set[str]]:
|
||||||
|
pipe_names = [
|
||||||
|
pipe_name
|
||||||
|
for pipe_name in nlp.pipe_names
|
||||||
|
if nlp.get_pipe_meta(pipe_name).factory == "spancat"
|
||||||
|
]
|
||||||
|
labels: Dict[str, Set[str]] = {}
|
||||||
|
for pipe_name in pipe_names:
|
||||||
|
pipe = nlp.get_pipe(pipe_name)
|
||||||
|
assert isinstance(pipe, SpanCategorizer)
|
||||||
|
if pipe.key not in labels:
|
||||||
|
labels[pipe.key] = set()
|
||||||
|
labels[pipe.key].update(pipe.labels)
|
||||||
|
return labels
|
||||||
|
|
|
@ -27,7 +27,7 @@ class Optimizations(str, Enum):
|
||||||
@init_cli.command("config")
|
@init_cli.command("config")
|
||||||
def init_config_cli(
|
def init_config_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
|
output_file: Path = Arg(..., help="File to save the config to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
|
||||||
lang: str = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
|
lang: str = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
|
||||||
pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
|
pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
|
||||||
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
||||||
|
@ -37,7 +37,7 @@ def init_config_cli(
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Generate a starter config.cfg for training. Based on your requirements
|
Generate a starter config file for training. Based on your requirements
|
||||||
specified via the CLI arguments, this command generates a config with the
|
specified via the CLI arguments, this command generates a config with the
|
||||||
optimal settings for your use case. This includes the choice of architecture,
|
optimal settings for your use case. This includes the choice of architecture,
|
||||||
pretrained weights and related hyperparameters.
|
pretrained weights and related hyperparameters.
|
||||||
|
@ -66,15 +66,15 @@ def init_config_cli(
|
||||||
@init_cli.command("fill-config")
|
@init_cli.command("fill-config")
|
||||||
def init_fill_config_cli(
|
def init_fill_config_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False),
|
base_path: Path = Arg(..., help="Path to base config to fill", exists=True, dir_okay=False),
|
||||||
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
|
output_file: Path = Arg("-", help="Path to output .cfg file (or - for stdout)", allow_dash=True),
|
||||||
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
|
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
|
||||||
diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
|
diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
|
||||||
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Fill partial config.cfg with default values. Will add all missing settings
|
Fill partial config file with default values. Will add all missing settings
|
||||||
from the default config and will create all objects, check the registered
|
from the default config and will create all objects, check the registered
|
||||||
functions for their default values and update the base config. This command
|
functions for their default values and update the base config. This command
|
||||||
can be used with a config generated via the training quickstart widget:
|
can be used with a config generated via the training quickstart widget:
|
||||||
|
|
|
@ -20,6 +20,7 @@ def init_vectors_cli(
|
||||||
output_dir: Path = Arg(..., help="Pipeline output directory"),
|
output_dir: Path = Arg(..., help="Pipeline output directory"),
|
||||||
prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"),
|
prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"),
|
||||||
truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
|
truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
|
||||||
|
mode: str = Opt("default", "--mode", "-m", help="Vectors mode: default or floret"),
|
||||||
name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
|
name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
|
||||||
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
||||||
jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True),
|
jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True),
|
||||||
|
@ -34,7 +35,14 @@ def init_vectors_cli(
|
||||||
nlp = util.get_lang_class(lang)()
|
nlp = util.get_lang_class(lang)()
|
||||||
if jsonl_loc is not None:
|
if jsonl_loc is not None:
|
||||||
update_lexemes(nlp, jsonl_loc)
|
update_lexemes(nlp, jsonl_loc)
|
||||||
convert_vectors(nlp, vectors_loc, truncate=truncate, prune=prune, name=name)
|
convert_vectors(
|
||||||
|
nlp,
|
||||||
|
vectors_loc,
|
||||||
|
truncate=truncate,
|
||||||
|
prune=prune,
|
||||||
|
name=name,
|
||||||
|
mode=mode,
|
||||||
|
)
|
||||||
msg.good(f"Successfully converted {len(nlp.vocab.vectors)} vectors")
|
msg.good(f"Successfully converted {len(nlp.vocab.vectors)} vectors")
|
||||||
nlp.to_disk(output_dir)
|
nlp.to_disk(output_dir)
|
||||||
msg.good(
|
msg.good(
|
||||||
|
|
|
@ -4,6 +4,7 @@ from pathlib import Path
|
||||||
from wasabi import Printer, MarkdownRenderer, get_raw_input
|
from wasabi import Printer, MarkdownRenderer, get_raw_input
|
||||||
from thinc.api import Config
|
from thinc.api import Config
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
|
from catalogue import RegistryError
|
||||||
import srsly
|
import srsly
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
|
@ -212,9 +213,18 @@ def get_third_party_dependencies(
|
||||||
if "factory" in component:
|
if "factory" in component:
|
||||||
funcs["factories"].add(component["factory"])
|
funcs["factories"].add(component["factory"])
|
||||||
modules = set()
|
modules = set()
|
||||||
|
lang = config["nlp"]["lang"]
|
||||||
for reg_name, func_names in funcs.items():
|
for reg_name, func_names in funcs.items():
|
||||||
for func_name in func_names:
|
for func_name in func_names:
|
||||||
func_info = util.registry.find(reg_name, func_name)
|
# Try the lang-specific version and fall back
|
||||||
|
try:
|
||||||
|
func_info = util.registry.find(reg_name, lang + "." + func_name)
|
||||||
|
except RegistryError:
|
||||||
|
try:
|
||||||
|
func_info = util.registry.find(reg_name, func_name)
|
||||||
|
except RegistryError as regerr:
|
||||||
|
# lang-specific version being absent is not actually an issue
|
||||||
|
raise regerr from None
|
||||||
module_name = func_info.get("module") # type: ignore[attr-defined]
|
module_name = func_info.get("module") # type: ignore[attr-defined]
|
||||||
if module_name: # the code is part of a module, not a --code file
|
if module_name: # the code is part of a module, not a --code file
|
||||||
modules.add(func_info["module"].split(".")[0]) # type: ignore[index]
|
modules.add(func_info["module"].split(".")[0]) # type: ignore[index]
|
||||||
|
@ -397,7 +407,7 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
|
||||||
continue
|
continue
|
||||||
col1 = md.bold(md.code(pipe))
|
col1 = md.bold(md.code(pipe))
|
||||||
col2 = ", ".join(
|
col2 = ", ".join(
|
||||||
[md.code(label.replace("|", "\\|")) for label in labels]
|
[md.code(str(label).replace("|", "\\|")) for label in labels]
|
||||||
) # noqa: W605
|
) # noqa: W605
|
||||||
label_data.append((col1, col2))
|
label_data.append((col1, col2))
|
||||||
n_labels += len(labels)
|
n_labels += len(labels)
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
from typing import Any, Dict, Optional
|
from typing import Any, Dict, Optional
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
|
import os
|
||||||
import re
|
import re
|
||||||
import shutil
|
import shutil
|
||||||
import requests
|
import requests
|
||||||
|
@ -129,10 +130,17 @@ def fetch_asset(
|
||||||
the asset failed.
|
the asset failed.
|
||||||
"""
|
"""
|
||||||
dest_path = (project_path / dest).resolve()
|
dest_path = (project_path / dest).resolve()
|
||||||
if dest_path.exists() and checksum:
|
if dest_path.exists():
|
||||||
# If there's already a file, check for checksum
|
# If there's already a file, check for checksum
|
||||||
if checksum == get_checksum(dest_path):
|
if checksum:
|
||||||
msg.good(f"Skipping download with matching checksum: {dest}")
|
if checksum == get_checksum(dest_path):
|
||||||
|
msg.good(f"Skipping download with matching checksum: {dest}")
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
# If there's not a checksum, make sure the file is a possibly valid size
|
||||||
|
if os.path.getsize(dest_path) == 0:
|
||||||
|
msg.warn(f"Asset exists but with size of 0 bytes, deleting: {dest}")
|
||||||
|
os.remove(dest_path)
|
||||||
# We might as well support the user here and create parent directories in
|
# We might as well support the user here and create parent directories in
|
||||||
# case the asset dir isn't listed as a dir to create in the project.yml
|
# case the asset dir isn't listed as a dir to create in the project.yml
|
||||||
if not dest_path.parent.exists():
|
if not dest_path.parent.exists():
|
||||||
|
|
|
@ -16,8 +16,10 @@ gpu_allocator = null
|
||||||
|
|
||||||
[nlp]
|
[nlp]
|
||||||
lang = "{{ lang }}"
|
lang = "{{ lang }}"
|
||||||
{%- set no_tok2vec = components|length == 1 and (("textcat" in components or "textcat_multilabel" in components) and optimize == "efficiency")-%}
|
{%- set has_textcat = ("textcat" in components or "textcat_multilabel" in components) -%}
|
||||||
{%- if not no_tok2vec and ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or "textcat" in components or "textcat_multilabel" in components) -%}
|
{%- set with_accuracy = optimize == "accuracy" -%}
|
||||||
|
{%- set has_accurate_textcat = has_textcat and with_accuracy -%}
|
||||||
|
{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or has_accurate_textcat) -%}
|
||||||
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
|
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
|
||||||
{%- else -%}
|
{%- else -%}
|
||||||
{%- set full_pipeline = components %}
|
{%- set full_pipeline = components %}
|
||||||
|
@ -197,7 +199,7 @@ no_output_layer = false
|
||||||
|
|
||||||
{# NON-TRANSFORMER PIPELINE #}
|
{# NON-TRANSFORMER PIPELINE #}
|
||||||
{% else -%}
|
{% else -%}
|
||||||
{% if not no_tok2vec-%}
|
{% if "tok2vec" in full_pipeline -%}
|
||||||
[components.tok2vec]
|
[components.tok2vec]
|
||||||
factory = "tok2vec"
|
factory = "tok2vec"
|
||||||
|
|
||||||
|
|
|
@ -68,12 +68,14 @@ seed = ${system.seed}
|
||||||
gpu_allocator = ${system.gpu_allocator}
|
gpu_allocator = ${system.gpu_allocator}
|
||||||
dropout = 0.1
|
dropout = 0.1
|
||||||
accumulate_gradient = 1
|
accumulate_gradient = 1
|
||||||
# Controls early-stopping. 0 disables early stopping.
|
# Controls early-stopping, i.e., the number of steps to continue without
|
||||||
|
# improvement before stopping. 0 disables early stopping.
|
||||||
patience = 1600
|
patience = 1600
|
||||||
# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in
|
# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in
|
||||||
# memory and shuffled within the training loop. -1 means stream train corpus
|
# memory and shuffled within the training loop. -1 means stream train corpus
|
||||||
# rather than loading in memory with no shuffling within the training loop.
|
# rather than loading in memory with no shuffling within the training loop.
|
||||||
max_epochs = 0
|
max_epochs = 0
|
||||||
|
# Maximum number of update steps to train for. 0 means an unlimited number of steps.
|
||||||
max_steps = 20000
|
max_steps = 20000
|
||||||
eval_frequency = 200
|
eval_frequency = 200
|
||||||
# Control how scores are printed and checkpoints are evaluated.
|
# Control how scores are printed and checkpoints are evaluated.
|
||||||
|
|
|
@ -5,6 +5,7 @@ raw_text = null
|
||||||
max_epochs = 1000
|
max_epochs = 1000
|
||||||
dropout = 0.2
|
dropout = 0.2
|
||||||
n_save_every = null
|
n_save_every = null
|
||||||
|
n_save_epoch = null
|
||||||
component = "tok2vec"
|
component = "tok2vec"
|
||||||
layer = ""
|
layer = ""
|
||||||
corpus = "corpora.pretrain"
|
corpus = "corpora.pretrain"
|
||||||
|
|
|
@ -181,11 +181,19 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
|
||||||
def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
|
def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
|
||||||
"""Generate named entities in [{start: i, end: i, label: 'label'}] format.
|
"""Generate named entities in [{start: i, end: i, label: 'label'}] format.
|
||||||
|
|
||||||
doc (Doc): Document do parse.
|
doc (Doc): Document to parse.
|
||||||
|
options (Dict[str, Any]): NER-specific visualisation options.
|
||||||
RETURNS (dict): Generated entities keyed by text (original text) and ents.
|
RETURNS (dict): Generated entities keyed by text (original text) and ents.
|
||||||
"""
|
"""
|
||||||
|
kb_url_template = options.get("kb_url_template", None)
|
||||||
ents = [
|
ents = [
|
||||||
{"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
|
{
|
||||||
|
"start": ent.start_char,
|
||||||
|
"end": ent.end_char,
|
||||||
|
"label": ent.label_,
|
||||||
|
"kb_id": ent.kb_id_ if ent.kb_id_ else "",
|
||||||
|
"kb_url": kb_url_template.format(ent.kb_id_) if kb_url_template else "#",
|
||||||
|
}
|
||||||
for ent in doc.ents
|
for ent in doc.ents
|
||||||
]
|
]
|
||||||
if not ents:
|
if not ents:
|
||||||
|
|
|
@ -18,7 +18,7 @@ DEFAULT_LABEL_COLORS = {
|
||||||
"LOC": "#ff9561",
|
"LOC": "#ff9561",
|
||||||
"PERSON": "#aa9cfc",
|
"PERSON": "#aa9cfc",
|
||||||
"NORP": "#c887fb",
|
"NORP": "#c887fb",
|
||||||
"FACILITY": "#9cc9cc",
|
"FAC": "#9cc9cc",
|
||||||
"EVENT": "#ffeb80",
|
"EVENT": "#ffeb80",
|
||||||
"LAW": "#ff8197",
|
"LAW": "#ff8197",
|
||||||
"LANGUAGE": "#ff8197",
|
"LANGUAGE": "#ff8197",
|
||||||
|
|
|
@ -1,18 +1,13 @@
|
||||||
import warnings
|
import warnings
|
||||||
|
|
||||||
|
|
||||||
def add_codes(err_cls):
|
class ErrorsWithCodes(type):
|
||||||
"""Add error codes to string messages via class attribute names."""
|
def __getattribute__(self, code):
|
||||||
|
msg = super().__getattribute__(code)
|
||||||
class ErrorsWithCodes(err_cls):
|
if code.startswith("__"): # python system attributes like __class__
|
||||||
def __getattribute__(self, code):
|
return msg
|
||||||
msg = super(ErrorsWithCodes, self).__getattribute__(code)
|
else:
|
||||||
if code.startswith("__"): # python system attributes like __class__
|
return "[{code}] {msg}".format(code=code, msg=msg)
|
||||||
return msg
|
|
||||||
else:
|
|
||||||
return "[{code}] {msg}".format(code=code, msg=msg)
|
|
||||||
|
|
||||||
return ErrorsWithCodes()
|
|
||||||
|
|
||||||
|
|
||||||
def setup_default_warnings():
|
def setup_default_warnings():
|
||||||
|
@ -27,6 +22,9 @@ def setup_default_warnings():
|
||||||
# warn once about lemmatizer without required POS
|
# warn once about lemmatizer without required POS
|
||||||
filter_warning("once", error_msg=Warnings.W108)
|
filter_warning("once", error_msg=Warnings.W108)
|
||||||
|
|
||||||
|
# floret vector table cannot be modified
|
||||||
|
filter_warning("once", error_msg="[W114]")
|
||||||
|
|
||||||
|
|
||||||
def filter_warning(action: str, error_msg: str):
|
def filter_warning(action: str, error_msg: str):
|
||||||
"""Customize how spaCy should handle a certain warning.
|
"""Customize how spaCy should handle a certain warning.
|
||||||
|
@ -44,8 +42,7 @@ def _escape_warning_msg(msg):
|
||||||
|
|
||||||
# fmt: off
|
# fmt: off
|
||||||
|
|
||||||
@add_codes
|
class Warnings(metaclass=ErrorsWithCodes):
|
||||||
class Warnings:
|
|
||||||
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
||||||
"generate a dependency visualization for it. Make sure the Doc "
|
"generate a dependency visualization for it. Make sure the Doc "
|
||||||
"was processed with a model that supports dependency parsing, and "
|
"was processed with a model that supports dependency parsing, and "
|
||||||
|
@ -192,10 +189,12 @@ class Warnings:
|
||||||
"vectors are not identical to current pipeline vectors.")
|
"vectors are not identical to current pipeline vectors.")
|
||||||
W114 = ("Using multiprocessing with GPU models is not recommended and may "
|
W114 = ("Using multiprocessing with GPU models is not recommended and may "
|
||||||
"lead to errors.")
|
"lead to errors.")
|
||||||
|
W115 = ("Skipping {method}: the floret vector table cannot be modified. "
|
||||||
|
"Vectors are calculated from character ngrams.")
|
||||||
|
W116 = ("Unable to clean attribute '{attr}'.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
class Errors(metaclass=ErrorsWithCodes):
|
||||||
class Errors:
|
|
||||||
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
|
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
|
||||||
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
||||||
"This usually happens when spaCy calls `nlp.{method}` with a custom "
|
"This usually happens when spaCy calls `nlp.{method}` with a custom "
|
||||||
|
@ -284,7 +283,7 @@ class Errors:
|
||||||
"you forget to call the `set_extension` method?")
|
"you forget to call the `set_extension` method?")
|
||||||
E047 = ("Can't assign a value to unregistered extension attribute "
|
E047 = ("Can't assign a value to unregistered extension attribute "
|
||||||
"'{name}'. Did you forget to call the `set_extension` method?")
|
"'{name}'. Did you forget to call the `set_extension` method?")
|
||||||
E048 = ("Can't import language {lang} from spacy.lang: {err}")
|
E048 = ("Can't import language {lang} or any matching language from spacy.lang: {err}")
|
||||||
E050 = ("Can't find model '{name}'. It doesn't seem to be a Python "
|
E050 = ("Can't find model '{name}'. It doesn't seem to be a Python "
|
||||||
"package or a valid path to a data directory.")
|
"package or a valid path to a data directory.")
|
||||||
E052 = ("Can't find model directory: {path}")
|
E052 = ("Can't find model directory: {path}")
|
||||||
|
@ -518,13 +517,24 @@ class Errors:
|
||||||
E199 = ("Unable to merge 0-length span at `doc[{start}:{end}]`.")
|
E199 = ("Unable to merge 0-length span at `doc[{start}:{end}]`.")
|
||||||
E200 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
E200 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||||
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.")
|
||||||
|
|
||||||
# New errors added in v3.x
|
# New errors added in v3.x
|
||||||
E866 = ("A SpanGroup is not functional after the corresponding Doc has "
|
E858 = ("The {mode} vector table does not support this operation. "
|
||||||
|
"{alternative}")
|
||||||
|
E859 = ("The floret vector table cannot be modified.")
|
||||||
|
E860 = ("Can't truncate fasttext-bloom vectors.")
|
||||||
|
E861 = ("No 'keys' should be provided when initializing floret vectors "
|
||||||
|
"with 'minn' and 'maxn'.")
|
||||||
|
E862 = ("'hash_count' must be between 1-4 for floret vectors.")
|
||||||
|
E863 = ("'maxn' must be greater than or equal to 'minn'.")
|
||||||
|
E864 = ("The complete vector table 'data' is required to initialize floret "
|
||||||
|
"vectors.")
|
||||||
|
E865 = ("A SpanGroup is not functional after the corresponding Doc has "
|
||||||
"been garbage collected. To keep using the spans, make sure that "
|
"been garbage collected. To keep using the spans, make sure that "
|
||||||
"the corresponding Doc object is still available in the scope of "
|
"the corresponding Doc object is still available in the scope of "
|
||||||
"your function.")
|
"your function.")
|
||||||
|
E866 = ("Expected a string or 'Doc' as input, but got: {type}.")
|
||||||
E867 = ("The 'textcat' component requires at least two labels because it "
|
E867 = ("The 'textcat' component requires at least two labels because it "
|
||||||
"uses mutually exclusive classes where exactly one label is True "
|
"uses mutually exclusive classes where exactly one label is True "
|
||||||
"for each doc. For binary classification tasks, you can use two "
|
"for each doc. For binary classification tasks, you can use two "
|
||||||
|
@ -632,7 +642,7 @@ class Errors:
|
||||||
E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found "
|
E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found "
|
||||||
"for mode '{mode}'. Required tables: {tables}. Found: {found}.")
|
"for mode '{mode}'. Required tables: {tables}. Found: {found}.")
|
||||||
E913 = ("Corpus path can't be None. Maybe you forgot to define it in your "
|
E913 = ("Corpus path can't be None. Maybe you forgot to define it in your "
|
||||||
"config.cfg or override it on the CLI?")
|
".cfg file or override it on the CLI?")
|
||||||
E914 = ("Executing {name} callback failed. Expected the function to "
|
E914 = ("Executing {name} callback failed. Expected the function to "
|
||||||
"return the nlp object but got: {value}. Maybe you forgot to return "
|
"return the nlp object but got: {value}. Maybe you forgot to return "
|
||||||
"the modified object in your function?")
|
"the modified object in your function?")
|
||||||
|
@ -878,7 +888,13 @@ class Errors:
|
||||||
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
|
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
|
||||||
"Non-UD tags should use the `tag` property.")
|
"Non-UD tags should use the `tag` property.")
|
||||||
E1022 = ("Words must be of type str or int, but input is of type '{wtype}'")
|
E1022 = ("Words must be of type str or int, but input is of type '{wtype}'")
|
||||||
|
E1023 = ("Couldn't read EntityRuler from the {path}. This file doesn't "
|
||||||
|
"exist.")
|
||||||
|
E1024 = ("A pattern with ID \"{ent_id}\" is not present in EntityRuler "
|
||||||
|
"patterns.")
|
||||||
|
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
|
||||||
|
"supported values are: 'I', 'O', 'B' and ''")
|
||||||
|
|
||||||
|
|
||||||
# Deprecated model shortcuts, only used in errors and warnings
|
# Deprecated model shortcuts, only used in errors and warnings
|
||||||
OLD_MODEL_SHORTCUTS = {
|
OLD_MODEL_SHORTCUTS = {
|
||||||
|
|
20
spacy/kb.pyx
20
spacy/kb.pyx
|
@ -124,7 +124,7 @@ cdef class KnowledgeBase:
|
||||||
def get_alias_strings(self):
|
def get_alias_strings(self):
|
||||||
return [self.vocab.strings[x] for x in self._alias_index]
|
return [self.vocab.strings[x] for x in self._alias_index]
|
||||||
|
|
||||||
def add_entity(self, unicode entity, float freq, vector[float] entity_vector):
|
def add_entity(self, str entity, float freq, vector[float] entity_vector):
|
||||||
"""
|
"""
|
||||||
Add an entity to the KB, optionally specifying its log probability based on corpus frequency
|
Add an entity to the KB, optionally specifying its log probability based on corpus frequency
|
||||||
Return the hash of the entity ID/name at the end.
|
Return the hash of the entity ID/name at the end.
|
||||||
|
@ -185,15 +185,15 @@ cdef class KnowledgeBase:
|
||||||
|
|
||||||
i += 1
|
i += 1
|
||||||
|
|
||||||
def contains_entity(self, unicode entity):
|
def contains_entity(self, str entity):
|
||||||
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
||||||
return entity_hash in self._entry_index
|
return entity_hash in self._entry_index
|
||||||
|
|
||||||
def contains_alias(self, unicode alias):
|
def contains_alias(self, str alias):
|
||||||
cdef hash_t alias_hash = self.vocab.strings.add(alias)
|
cdef hash_t alias_hash = self.vocab.strings.add(alias)
|
||||||
return alias_hash in self._alias_index
|
return alias_hash in self._alias_index
|
||||||
|
|
||||||
def add_alias(self, unicode alias, entities, probabilities):
|
def add_alias(self, str alias, entities, probabilities):
|
||||||
"""
|
"""
|
||||||
For a given alias, add its potential entities and prior probabilies to the KB.
|
For a given alias, add its potential entities and prior probabilies to the KB.
|
||||||
Return the alias_hash at the end
|
Return the alias_hash at the end
|
||||||
|
@ -239,7 +239,7 @@ cdef class KnowledgeBase:
|
||||||
raise RuntimeError(Errors.E891.format(alias=alias))
|
raise RuntimeError(Errors.E891.format(alias=alias))
|
||||||
return alias_hash
|
return alias_hash
|
||||||
|
|
||||||
def append_alias(self, unicode alias, unicode entity, float prior_prob, ignore_warnings=False):
|
def append_alias(self, str alias, str entity, float prior_prob, ignore_warnings=False):
|
||||||
"""
|
"""
|
||||||
For an alias already existing in the KB, extend its potential entities with one more.
|
For an alias already existing in the KB, extend its potential entities with one more.
|
||||||
Throw a warning if either the alias or the entity is unknown,
|
Throw a warning if either the alias or the entity is unknown,
|
||||||
|
@ -286,7 +286,7 @@ cdef class KnowledgeBase:
|
||||||
alias_entry.probs = probs
|
alias_entry.probs = probs
|
||||||
self._aliases_table[alias_index] = alias_entry
|
self._aliases_table[alias_index] = alias_entry
|
||||||
|
|
||||||
def get_alias_candidates(self, unicode alias) -> Iterator[Candidate]:
|
def get_alias_candidates(self, str alias) -> Iterator[Candidate]:
|
||||||
"""
|
"""
|
||||||
Return candidate entities for an alias. Each candidate defines the entity, the original alias,
|
Return candidate entities for an alias. Each candidate defines the entity, the original alias,
|
||||||
and the prior probability of that alias resolving to that entity.
|
and the prior probability of that alias resolving to that entity.
|
||||||
|
@ -307,7 +307,7 @@ cdef class KnowledgeBase:
|
||||||
for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs)
|
for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs)
|
||||||
if entry_index != 0]
|
if entry_index != 0]
|
||||||
|
|
||||||
def get_vector(self, unicode entity):
|
def get_vector(self, str entity):
|
||||||
cdef hash_t entity_hash = self.vocab.strings[entity]
|
cdef hash_t entity_hash = self.vocab.strings[entity]
|
||||||
|
|
||||||
# Return an empty list if this entity is unknown in this KB
|
# Return an empty list if this entity is unknown in this KB
|
||||||
|
@ -317,7 +317,7 @@ cdef class KnowledgeBase:
|
||||||
|
|
||||||
return self._vectors_table[self._entries[entry_index].vector_index]
|
return self._vectors_table[self._entries[entry_index].vector_index]
|
||||||
|
|
||||||
def get_prior_prob(self, unicode entity, unicode alias):
|
def get_prior_prob(self, str entity, str alias):
|
||||||
""" Return the prior probability of a given alias being linked to a given entity,
|
""" Return the prior probability of a given alias being linked to a given entity,
|
||||||
or return 0.0 when this combination is not known in the knowledge base"""
|
or return 0.0 when this combination is not known in the knowledge base"""
|
||||||
cdef hash_t alias_hash = self.vocab.strings[alias]
|
cdef hash_t alias_hash = self.vocab.strings[alias]
|
||||||
|
@ -587,7 +587,7 @@ cdef class Writer:
|
||||||
def __init__(self, path):
|
def __init__(self, path):
|
||||||
assert isinstance(path, Path)
|
assert isinstance(path, Path)
|
||||||
content = bytes(path)
|
content = bytes(path)
|
||||||
cdef bytes bytes_loc = content.encode('utf8') if type(content) == unicode else content
|
cdef bytes bytes_loc = content.encode('utf8') if type(content) == str else content
|
||||||
self._fp = fopen(<char*>bytes_loc, 'wb')
|
self._fp = fopen(<char*>bytes_loc, 'wb')
|
||||||
if not self._fp:
|
if not self._fp:
|
||||||
raise IOError(Errors.E146.format(path=path))
|
raise IOError(Errors.E146.format(path=path))
|
||||||
|
@ -629,7 +629,7 @@ cdef class Writer:
|
||||||
cdef class Reader:
|
cdef class Reader:
|
||||||
def __init__(self, path):
|
def __init__(self, path):
|
||||||
content = bytes(path)
|
content = bytes(path)
|
||||||
cdef bytes bytes_loc = content.encode('utf8') if type(content) == unicode else content
|
cdef bytes bytes_loc = content.encode('utf8') if type(content) == str else content
|
||||||
self._fp = fopen(<char*>bytes_loc, 'rb')
|
self._fp = fopen(<char*>bytes_loc, 'rb')
|
||||||
if not self._fp:
|
if not self._fp:
|
||||||
PyErr_SetFromErrno(IOError)
|
PyErr_SetFromErrno(IOError)
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||||
from ..char_classes import UNITS, ALPHA_UPPER
|
from ..char_classes import UNITS, ALPHA_UPPER
|
||||||
|
|
||||||
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
|
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧ ፠ ፨".strip().split()
|
||||||
|
|
||||||
_suffixes = (
|
_suffixes = (
|
||||||
_list_punct
|
_list_punct
|
||||||
|
|
|
@ -1,265 +1,79 @@
|
||||||
# Source: https://github.com/Alir3z4/stop-words
|
"""
|
||||||
|
References:
|
||||||
|
https://github.com/Alir3z4/stop-words - Original list, serves as a base.
|
||||||
|
https://postvai.com/books/stop-dumi.pdf - Additions to the original list in order to improve it.
|
||||||
|
"""
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
а
|
а автентичен аз ако ала
|
||||||
автентичен
|
|
||||||
аз
|
бе без беше би бивш бивша бившо бивши бил била били било благодаря близо бъдат
|
||||||
ако
|
бъде бъда бяха
|
||||||
ала
|
|
||||||
бе
|
в вас ваш ваша вашата вашият вероятно вече взема ви вие винаги внимава време все
|
||||||
без
|
всеки всички вместо всичко вследствие всъщност всяка втори във въпреки върху
|
||||||
беше
|
вътре веднъж
|
||||||
би
|
|
||||||
бивш
|
г ги главен главна главно глас го годно година години годишен
|
||||||
бивша
|
|
||||||
бившо
|
д да дали далеч далече два двама двамата две двете ден днес дни до добра добре
|
||||||
бил
|
добро добър достатъчно докато докога дори досега доста друг друга другаде други
|
||||||
била
|
|
||||||
били
|
е евтин едва един една еднаква еднакви еднакъв едно екип ето
|
||||||
било
|
|
||||||
благодаря
|
живот жив
|
||||||
близо
|
|
||||||
бъдат
|
за здравей здрасти знае зная забавям зад зададени заедно заради засега заспал
|
||||||
бъде
|
затова запазва започвам защо защото завинаги
|
||||||
бяха
|
|
||||||
в
|
и из или им има имат иска искам използвайки изглежда изглеждаше изглеждайки
|
||||||
вас
|
извън имайки
|
||||||
ваш
|
|
||||||
ваша
|
й йо
|
||||||
вероятно
|
|
||||||
вече
|
каза казва казвайки казвам как каква какво както какъв като кога кауза каузи
|
||||||
взема
|
когато когото което които кой който колко която къде където към край кратък
|
||||||
ви
|
кръгъл
|
||||||
вие
|
|
||||||
винаги
|
лесен лесно ли летя летиш летим лош
|
||||||
внимава
|
|
||||||
време
|
м май малко макар малцина междувременно минус ме между мек мен месец ми мис
|
||||||
все
|
мисля много мнозина мога могат може мой можем мокър моля момента му
|
||||||
всеки
|
|
||||||
всички
|
н на над назад най наш навсякъде навътре нагоре направи напред надолу наистина
|
||||||
всичко
|
например наопаки наполовина напоследък нека независимо нас насам наскоро
|
||||||
всяка
|
настрана необходимо него негов нещо нея ни ние никой нито нищо но нов някак нова
|
||||||
във
|
нови новина някои някой някога някъде няколко няма
|
||||||
въпреки
|
|
||||||
върху
|
о обаче около описан опитах опитва опитвайки опитвам определен определено освен
|
||||||
г
|
обикновено осигурява обратно означава особен особено от ох отвъд отгоре отдолу
|
||||||
ги
|
отново отива отивам отидох отсега отделно отколкото откъдето очевидно оттам
|
||||||
главен
|
относно още
|
||||||
главна
|
|
||||||
главно
|
п пак по повече повечето под поне просто пряко поради после последен последно
|
||||||
глас
|
посочен почти прави прав прави правя пред преди през при пък първата първи първо
|
||||||
го
|
път пъти плюс
|
||||||
година
|
|
||||||
години
|
равен равна различен различни разумен разумно
|
||||||
годишен
|
|
||||||
д
|
с са сам само себе сериозно сигурен сигурно се сега си син скоро скорошен след
|
||||||
да
|
следващ следващия следва следното следователно случва сме смях собствен
|
||||||
дали
|
сравнително смея според сред става срещу съвсем съдържа съдържащ съжалявам
|
||||||
два
|
съответен съответно сте съм със също
|
||||||
двама
|
|
||||||
двамата
|
т така техен техни такива такъв твърде там трета твой те тези ти то това
|
||||||
две
|
тогава този той търси толкова точно три трябва тук тъй тя тях
|
||||||
двете
|
|
||||||
ден
|
у утре ужасно употреба успоредно уточнен уточняване
|
||||||
днес
|
|
||||||
дни
|
харесва харесали хиляди
|
||||||
до
|
|
||||||
добра
|
ч часа ценя цяло цялостен че често чрез чудя
|
||||||
добре
|
|
||||||
добро
|
ще щеше щом щяха
|
||||||
добър
|
|
||||||
докато
|
|
||||||
докога
|
|
||||||
дори
|
|
||||||
досега
|
|
||||||
доста
|
|
||||||
друг
|
|
||||||
друга
|
|
||||||
други
|
|
||||||
е
|
|
||||||
евтин
|
|
||||||
едва
|
|
||||||
един
|
|
||||||
една
|
|
||||||
еднаква
|
|
||||||
еднакви
|
|
||||||
еднакъв
|
|
||||||
едно
|
|
||||||
екип
|
|
||||||
ето
|
|
||||||
живот
|
|
||||||
за
|
|
||||||
забавям
|
|
||||||
зад
|
|
||||||
заедно
|
|
||||||
заради
|
|
||||||
засега
|
|
||||||
заспал
|
|
||||||
затова
|
|
||||||
защо
|
|
||||||
защото
|
|
||||||
и
|
|
||||||
из
|
|
||||||
или
|
|
||||||
им
|
|
||||||
има
|
|
||||||
имат
|
|
||||||
иска
|
|
||||||
й
|
|
||||||
каза
|
|
||||||
как
|
|
||||||
каква
|
|
||||||
какво
|
|
||||||
както
|
|
||||||
какъв
|
|
||||||
като
|
|
||||||
кога
|
|
||||||
когато
|
|
||||||
което
|
|
||||||
които
|
|
||||||
кой
|
|
||||||
който
|
|
||||||
колко
|
|
||||||
която
|
|
||||||
къде
|
|
||||||
където
|
|
||||||
към
|
|
||||||
лесен
|
|
||||||
лесно
|
|
||||||
ли
|
|
||||||
лош
|
|
||||||
м
|
|
||||||
май
|
|
||||||
малко
|
|
||||||
ме
|
|
||||||
между
|
|
||||||
мек
|
|
||||||
мен
|
|
||||||
месец
|
|
||||||
ми
|
|
||||||
много
|
|
||||||
мнозина
|
|
||||||
мога
|
|
||||||
могат
|
|
||||||
може
|
|
||||||
мокър
|
|
||||||
моля
|
|
||||||
момента
|
|
||||||
му
|
|
||||||
н
|
|
||||||
на
|
|
||||||
над
|
|
||||||
назад
|
|
||||||
най
|
|
||||||
направи
|
|
||||||
напред
|
|
||||||
например
|
|
||||||
нас
|
|
||||||
не
|
|
||||||
него
|
|
||||||
нещо
|
|
||||||
нея
|
|
||||||
ни
|
|
||||||
ние
|
|
||||||
никой
|
|
||||||
нито
|
|
||||||
нищо
|
|
||||||
но
|
|
||||||
нов
|
|
||||||
нова
|
|
||||||
нови
|
|
||||||
новина
|
|
||||||
някои
|
|
||||||
някой
|
|
||||||
няколко
|
|
||||||
няма
|
|
||||||
обаче
|
|
||||||
около
|
|
||||||
освен
|
|
||||||
особено
|
|
||||||
от
|
|
||||||
отгоре
|
|
||||||
отново
|
|
||||||
още
|
|
||||||
пак
|
|
||||||
по
|
|
||||||
повече
|
|
||||||
повечето
|
|
||||||
под
|
|
||||||
поне
|
|
||||||
поради
|
|
||||||
после
|
|
||||||
почти
|
|
||||||
прави
|
|
||||||
пред
|
|
||||||
преди
|
|
||||||
през
|
|
||||||
при
|
|
||||||
пък
|
|
||||||
първата
|
|
||||||
първи
|
|
||||||
първо
|
|
||||||
пъти
|
|
||||||
равен
|
|
||||||
равна
|
|
||||||
с
|
|
||||||
са
|
|
||||||
сам
|
|
||||||
само
|
|
||||||
се
|
|
||||||
сега
|
|
||||||
си
|
|
||||||
син
|
|
||||||
скоро
|
|
||||||
след
|
|
||||||
следващ
|
|
||||||
сме
|
|
||||||
смях
|
|
||||||
според
|
|
||||||
сред
|
|
||||||
срещу
|
|
||||||
сте
|
|
||||||
съм
|
|
||||||
със
|
|
||||||
също
|
|
||||||
т
|
|
||||||
тази
|
|
||||||
така
|
|
||||||
такива
|
|
||||||
такъв
|
|
||||||
там
|
|
||||||
твой
|
|
||||||
те
|
|
||||||
тези
|
|
||||||
ти
|
|
||||||
т.н.
|
|
||||||
то
|
|
||||||
това
|
|
||||||
тогава
|
|
||||||
този
|
|
||||||
той
|
|
||||||
толкова
|
|
||||||
точно
|
|
||||||
три
|
|
||||||
трябва
|
|
||||||
тук
|
|
||||||
тъй
|
|
||||||
тя
|
|
||||||
тях
|
|
||||||
у
|
|
||||||
утре
|
|
||||||
харесва
|
|
||||||
хиляди
|
|
||||||
ч
|
|
||||||
часа
|
|
||||||
че
|
|
||||||
често
|
|
||||||
чрез
|
|
||||||
ще
|
|
||||||
щом
|
|
||||||
юмрук
|
юмрук
|
||||||
я
|
|
||||||
як
|
я як
|
||||||
""".split()
|
""".split()
|
||||||
)
|
)
|
||||||
|
|
|
@ -1,10 +1,16 @@
|
||||||
|
"""
|
||||||
|
References:
|
||||||
|
https://slovored.com/bg/abbr/grammar/ - Additional refs for abbreviations
|
||||||
|
(countries, occupations, fields of studies and more).
|
||||||
|
"""
|
||||||
|
|
||||||
from ...symbols import ORTH, NORM
|
from ...symbols import ORTH, NORM
|
||||||
|
|
||||||
|
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
|
||||||
|
# measurements
|
||||||
_abbr_exc = [
|
for abbr in [
|
||||||
{ORTH: "м", NORM: "метър"},
|
{ORTH: "м", NORM: "метър"},
|
||||||
{ORTH: "мм", NORM: "милиметър"},
|
{ORTH: "мм", NORM: "милиметър"},
|
||||||
{ORTH: "см", NORM: "сантиметър"},
|
{ORTH: "см", NORM: "сантиметър"},
|
||||||
|
@ -17,51 +23,191 @@ _abbr_exc = [
|
||||||
{ORTH: "хл", NORM: "хектолиър"},
|
{ORTH: "хл", NORM: "хектолиър"},
|
||||||
{ORTH: "дкл", NORM: "декалитър"},
|
{ORTH: "дкл", NORM: "декалитър"},
|
||||||
{ORTH: "л", NORM: "литър"},
|
{ORTH: "л", NORM: "литър"},
|
||||||
]
|
]:
|
||||||
for abbr in _abbr_exc:
|
|
||||||
_exc[abbr[ORTH]] = [abbr]
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
_abbr_line_exc = [
|
# line abbreviations
|
||||||
|
for abbr in [
|
||||||
{ORTH: "г-жа", NORM: "госпожа"},
|
{ORTH: "г-жа", NORM: "госпожа"},
|
||||||
{ORTH: "г-н", NORM: "господин"},
|
{ORTH: "г-н", NORM: "господин"},
|
||||||
{ORTH: "г-ца", NORM: "госпожица"},
|
{ORTH: "г-ца", NORM: "госпожица"},
|
||||||
{ORTH: "д-р", NORM: "доктор"},
|
{ORTH: "д-р", NORM: "доктор"},
|
||||||
{ORTH: "о-в", NORM: "остров"},
|
{ORTH: "о-в", NORM: "остров"},
|
||||||
{ORTH: "п-в", NORM: "полуостров"},
|
{ORTH: "п-в", NORM: "полуостров"},
|
||||||
]
|
{ORTH: "с-у", NORM: "срещу"},
|
||||||
|
{ORTH: "в-у", NORM: "върху"},
|
||||||
for abbr in _abbr_line_exc:
|
{ORTH: "м-у", NORM: "между"},
|
||||||
|
]:
|
||||||
_exc[abbr[ORTH]] = [abbr]
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
_abbr_dot_exc = [
|
# foreign language related abbreviations
|
||||||
|
for abbr in [
|
||||||
|
{ORTH: "англ.", NORM: "английски"},
|
||||||
|
{ORTH: "ан.", NORM: "английски термин"},
|
||||||
|
{ORTH: "араб.", NORM: "арабски"},
|
||||||
|
{ORTH: "афр.", NORM: "африкански"},
|
||||||
|
{ORTH: "гр.", NORM: "гръцки"},
|
||||||
|
{ORTH: "лат.", NORM: "латински"},
|
||||||
|
{ORTH: "рим.", NORM: "римски"},
|
||||||
|
{ORTH: "старогр.", NORM: "старогръцки"},
|
||||||
|
{ORTH: "староевр.", NORM: "староеврейски"},
|
||||||
|
{ORTH: "фр.", NORM: "френски"},
|
||||||
|
{ORTH: "хол.", NORM: "холандски"},
|
||||||
|
{ORTH: "швед.", NORM: "шведски"},
|
||||||
|
{ORTH: "шотл.", NORM: "шотландски"},
|
||||||
|
{ORTH: "яп.", NORM: "японски"},
|
||||||
|
]:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
# profession and academic titles abbreviations
|
||||||
|
for abbr in [
|
||||||
{ORTH: "акад.", NORM: "академик"},
|
{ORTH: "акад.", NORM: "академик"},
|
||||||
{ORTH: "ал.", NORM: "алинея"},
|
|
||||||
{ORTH: "арх.", NORM: "архитект"},
|
{ORTH: "арх.", NORM: "архитект"},
|
||||||
|
{ORTH: "инж.", NORM: "инженер"},
|
||||||
|
{ORTH: "канц.", NORM: "канцлер"},
|
||||||
|
{ORTH: "проф.", NORM: "професор"},
|
||||||
|
{ORTH: "св.", NORM: "свети"},
|
||||||
|
]:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
# fields of studies
|
||||||
|
for abbr in [
|
||||||
|
{ORTH: "агр.", NORM: "агрономия"},
|
||||||
|
{ORTH: "ав.", NORM: "авиация"},
|
||||||
|
{ORTH: "агр.", NORM: "агрономия"},
|
||||||
|
{ORTH: "археол.", NORM: "археология"},
|
||||||
|
{ORTH: "астр.", NORM: "астрономия"},
|
||||||
|
{ORTH: "геод.", NORM: "геодезия"},
|
||||||
|
{ORTH: "геол.", NORM: "геология"},
|
||||||
|
{ORTH: "геом.", NORM: "геометрия"},
|
||||||
|
{ORTH: "гимн.", NORM: "гимнастика"},
|
||||||
|
{ORTH: "грам.", NORM: "граматика"},
|
||||||
|
{ORTH: "жур.", NORM: "журналистика"},
|
||||||
|
{ORTH: "журн.", NORM: "журналистика"},
|
||||||
|
{ORTH: "зем.", NORM: "земеделие"},
|
||||||
|
{ORTH: "икон.", NORM: "икономика"},
|
||||||
|
{ORTH: "лит.", NORM: "литература"},
|
||||||
|
{ORTH: "мат.", NORM: "математика"},
|
||||||
|
{ORTH: "мед.", NORM: "медицина"},
|
||||||
|
{ORTH: "муз.", NORM: "музика"},
|
||||||
|
{ORTH: "печ.", NORM: "печатарство"},
|
||||||
|
{ORTH: "пол.", NORM: "политика"},
|
||||||
|
{ORTH: "псих.", NORM: "психология"},
|
||||||
|
{ORTH: "соц.", NORM: "социология"},
|
||||||
|
{ORTH: "стат.", NORM: "статистика"},
|
||||||
|
{ORTH: "стил.", NORM: "стилистика"},
|
||||||
|
{ORTH: "топогр.", NORM: "топография"},
|
||||||
|
{ORTH: "търг.", NORM: "търговия"},
|
||||||
|
{ORTH: "фарм.", NORM: "фармацевтика"},
|
||||||
|
{ORTH: "фехт.", NORM: "фехтовка"},
|
||||||
|
{ORTH: "физиол.", NORM: "физиология"},
|
||||||
|
{ORTH: "физ.", NORM: "физика"},
|
||||||
|
{ORTH: "фил.", NORM: "философия"},
|
||||||
|
{ORTH: "фин.", NORM: "финанси"},
|
||||||
|
{ORTH: "фолкл.", NORM: "фолклор"},
|
||||||
|
{ORTH: "фон.", NORM: "фонетика"},
|
||||||
|
{ORTH: "фот.", NORM: "фотография"},
|
||||||
|
{ORTH: "футб.", NORM: "футбол"},
|
||||||
|
{ORTH: "хим.", NORM: "химия"},
|
||||||
|
{ORTH: "хир.", NORM: "хирургия"},
|
||||||
|
{ORTH: "ел.", NORM: "електротехника"},
|
||||||
|
]:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
for abbr in [
|
||||||
|
{ORTH: "ал.", NORM: "алинея"},
|
||||||
|
{ORTH: "авт.", NORM: "автоматично"},
|
||||||
|
{ORTH: "адм.", NORM: "администрация"},
|
||||||
|
{ORTH: "арт.", NORM: "артилерия"},
|
||||||
{ORTH: "бл.", NORM: "блок"},
|
{ORTH: "бл.", NORM: "блок"},
|
||||||
{ORTH: "бр.", NORM: "брой"},
|
{ORTH: "бр.", NORM: "брой"},
|
||||||
{ORTH: "бул.", NORM: "булевард"},
|
{ORTH: "бул.", NORM: "булевард"},
|
||||||
|
{ORTH: "букв.", NORM: "буквално"},
|
||||||
{ORTH: "в.", NORM: "век"},
|
{ORTH: "в.", NORM: "век"},
|
||||||
|
{ORTH: "вр.", NORM: "време"},
|
||||||
|
{ORTH: "вм.", NORM: "вместо"},
|
||||||
|
{ORTH: "воен.", NORM: "военен термин"},
|
||||||
{ORTH: "г.", NORM: "година"},
|
{ORTH: "г.", NORM: "година"},
|
||||||
{ORTH: "гр.", NORM: "град"},
|
{ORTH: "гр.", NORM: "град"},
|
||||||
|
{ORTH: "гл.", NORM: "глагол"},
|
||||||
|
{ORTH: "др.", NORM: "други"},
|
||||||
|
{ORTH: "ез.", NORM: "езеро"},
|
||||||
{ORTH: "ж.р.", NORM: "женски род"},
|
{ORTH: "ж.р.", NORM: "женски род"},
|
||||||
{ORTH: "инж.", NORM: "инженер"},
|
{ORTH: "жп.", NORM: "железопът"},
|
||||||
|
{ORTH: "застр.", NORM: "застрахователно дело"},
|
||||||
|
{ORTH: "знач.", NORM: "значение"},
|
||||||
|
{ORTH: "и др.", NORM: "и други"},
|
||||||
|
{ORTH: "и под.", NORM: "и подобни"},
|
||||||
|
{ORTH: "и пр.", NORM: "и прочие"},
|
||||||
|
{ORTH: "изр.", NORM: "изречение"},
|
||||||
|
{ORTH: "изт.", NORM: "източен"},
|
||||||
|
{ORTH: "конкр.", NORM: "конкретно"},
|
||||||
{ORTH: "лв.", NORM: "лев"},
|
{ORTH: "лв.", NORM: "лев"},
|
||||||
|
{ORTH: "л.", NORM: "лице"},
|
||||||
{ORTH: "м.р.", NORM: "мъжки род"},
|
{ORTH: "м.р.", NORM: "мъжки род"},
|
||||||
{ORTH: "мат.", NORM: "математика"},
|
{ORTH: "мин.вр.", NORM: "минало време"},
|
||||||
{ORTH: "мед.", NORM: "медицина"},
|
{ORTH: "мн.ч.", NORM: "множествено число"},
|
||||||
|
{ORTH: "напр.", NORM: "например"},
|
||||||
|
{ORTH: "нар.", NORM: "наречие"},
|
||||||
|
{ORTH: "науч.", NORM: "научен термин"},
|
||||||
|
{ORTH: "непр.", NORM: "неправилно"},
|
||||||
|
{ORTH: "обик.", NORM: "обикновено"},
|
||||||
|
{ORTH: "опред.", NORM: "определение"},
|
||||||
|
{ORTH: "особ.", NORM: "особено"},
|
||||||
|
{ORTH: "ост.", NORM: "остаряло"},
|
||||||
|
{ORTH: "относ.", NORM: "относително"},
|
||||||
|
{ORTH: "отр.", NORM: "отрицателно"},
|
||||||
{ORTH: "пл.", NORM: "площад"},
|
{ORTH: "пл.", NORM: "площад"},
|
||||||
{ORTH: "проф.", NORM: "професор"},
|
{ORTH: "пад.", NORM: "падеж"},
|
||||||
|
{ORTH: "парл.", NORM: "парламентарен"},
|
||||||
|
{ORTH: "погов.", NORM: "поговорка"},
|
||||||
|
{ORTH: "пон.", NORM: "понякога"},
|
||||||
|
{ORTH: "правосл.", NORM: "православен"},
|
||||||
|
{ORTH: "прибл.", NORM: "приблизително"},
|
||||||
|
{ORTH: "прил.", NORM: "прилагателно име"},
|
||||||
|
{ORTH: "пр.", NORM: "прочие"},
|
||||||
{ORTH: "с.", NORM: "село"},
|
{ORTH: "с.", NORM: "село"},
|
||||||
{ORTH: "с.р.", NORM: "среден род"},
|
{ORTH: "с.р.", NORM: "среден род"},
|
||||||
{ORTH: "св.", NORM: "свети"},
|
|
||||||
{ORTH: "сп.", NORM: "списание"},
|
{ORTH: "сп.", NORM: "списание"},
|
||||||
{ORTH: "стр.", NORM: "страница"},
|
{ORTH: "стр.", NORM: "страница"},
|
||||||
|
{ORTH: "сз.", NORM: "съюз"},
|
||||||
|
{ORTH: "сег.", NORM: "сегашно"},
|
||||||
|
{ORTH: "сп.", NORM: "спорт"},
|
||||||
|
{ORTH: "срв.", NORM: "сравни"},
|
||||||
|
{ORTH: "с.ст.", NORM: "селскостопанска техника"},
|
||||||
|
{ORTH: "счет.", NORM: "счетоводство"},
|
||||||
|
{ORTH: "съкр.", NORM: "съкратено"},
|
||||||
|
{ORTH: "съобщ.", NORM: "съобщение"},
|
||||||
|
{ORTH: "същ.", NORM: "съществително"},
|
||||||
|
{ORTH: "текст.", NORM: "текстилен"},
|
||||||
|
{ORTH: "телев.", NORM: "телевизия"},
|
||||||
|
{ORTH: "тел.", NORM: "телефон"},
|
||||||
|
{ORTH: "т.е.", NORM: "тоест"},
|
||||||
|
{ORTH: "т.н.", NORM: "така нататък"},
|
||||||
|
{ORTH: "т.нар.", NORM: "така наречен"},
|
||||||
|
{ORTH: "търж.", NORM: "тържествено"},
|
||||||
{ORTH: "ул.", NORM: "улица"},
|
{ORTH: "ул.", NORM: "улица"},
|
||||||
|
{ORTH: "уч.", NORM: "училище"},
|
||||||
|
{ORTH: "унив.", NORM: "университет"},
|
||||||
|
{ORTH: "харт.", NORM: "хартия"},
|
||||||
|
{ORTH: "хидр.", NORM: "хидравлика"},
|
||||||
|
{ORTH: "хран.", NORM: "хранителна"},
|
||||||
|
{ORTH: "църк.", NORM: "църковен термин"},
|
||||||
|
{ORTH: "числ.", NORM: "числително"},
|
||||||
{ORTH: "чл.", NORM: "член"},
|
{ORTH: "чл.", NORM: "член"},
|
||||||
]
|
{ORTH: "ч.", NORM: "число"},
|
||||||
|
{ORTH: "числ.", NORM: "числително"},
|
||||||
for abbr in _abbr_dot_exc:
|
{ORTH: "шахм.", NORM: "шахмат"},
|
||||||
|
{ORTH: "шах.", NORM: "шахмат"},
|
||||||
|
{ORTH: "юр.", NORM: "юридически"},
|
||||||
|
]:
|
||||||
_exc[abbr[ORTH]] = [abbr]
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
# slash abbreviations
|
||||||
|
for abbr in [
|
||||||
|
{ORTH: "м/у", NORM: "между"},
|
||||||
|
{ORTH: "с/у", NORM: "срещу"},
|
||||||
|
]:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
||||||
|
@ -23,13 +23,25 @@ class Bengali(Language):
|
||||||
@Bengali.factory(
|
@Bengali.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return Lemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Bengali"]
|
__all__ = ["Bengali"]
|
||||||
|
|
23
spacy/lang/ca/__init__.py
Normal file → Executable file
23
spacy/lang/ca/__init__.py
Normal file → Executable file
|
@ -1,9 +1,9 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
@ -15,6 +15,7 @@ class CatalanDefaults(BaseDefaults):
|
||||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
infixes = TOKENIZER_INFIXES
|
infixes = TOKENIZER_INFIXES
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
lex_attr_getters = LEX_ATTRS
|
lex_attr_getters = LEX_ATTRS
|
||||||
syntax_iterators = SYNTAX_ITERATORS
|
syntax_iterators = SYNTAX_ITERATORS
|
||||||
|
@ -28,13 +29,25 @@ class Catalan(Language):
|
||||||
@Catalan.factory(
|
@Catalan.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return CatalanLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return CatalanLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Catalan"]
|
__all__ = ["Catalan"]
|
||||||
|
|
11
spacy/lang/ca/punctuation.py
Normal file → Executable file
11
spacy/lang/ca/punctuation.py
Normal file → Executable file
|
@ -1,4 +1,5 @@
|
||||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS
|
||||||
|
from ..char_classes import LIST_CURRENCY
|
||||||
from ..char_classes import CURRENCY
|
from ..char_classes import CURRENCY
|
||||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
|
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
|
||||||
from ..char_classes import merge_chars, _units
|
from ..char_classes import merge_chars, _units
|
||||||
|
@ -6,6 +7,14 @@ from ..char_classes import merge_chars, _units
|
||||||
|
|
||||||
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
||||||
|
|
||||||
|
_prefixes = (
|
||||||
|
["§", "%", "=", "—", "–", "-", r"\+(?![0-9])"]
|
||||||
|
+ LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ LIST_CURRENCY
|
||||||
|
+ LIST_ICONS
|
||||||
|
)
|
||||||
|
|
||||||
_infixes = (
|
_infixes = (
|
||||||
LIST_ELLIPSES
|
LIST_ELLIPSES
|
||||||
|
@ -18,6 +27,7 @@ _infixes = (
|
||||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}][{el}])(?=[{a}0-9])".format(a=ALPHA, el=ELISION),
|
r"(?<=[{a}][{el}])(?=[{a}0-9])".format(a=ALPHA, el=ELISION),
|
||||||
|
r"('ls|'l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(?![A-Za-z])|(-l'|-m'|-t'|-n')",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -44,3 +54,4 @@ _suffixes = (
|
||||||
|
|
||||||
TOKENIZER_INFIXES = _infixes
|
TOKENIZER_INFIXES = _infixes
|
||||||
TOKENIZER_SUFFIXES = _suffixes
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
|
|
21
spacy/lang/ca/tokenizer_exceptions.py
Normal file → Executable file
21
spacy/lang/ca/tokenizer_exceptions.py
Normal file → Executable file
|
@ -18,12 +18,21 @@ for exc_data in [
|
||||||
{ORTH: "nov.", NORM: "novembre"},
|
{ORTH: "nov.", NORM: "novembre"},
|
||||||
{ORTH: "dec.", NORM: "desembre"},
|
{ORTH: "dec.", NORM: "desembre"},
|
||||||
{ORTH: "Dr.", NORM: "doctor"},
|
{ORTH: "Dr.", NORM: "doctor"},
|
||||||
|
{ORTH: "Dra.", NORM: "doctora"},
|
||||||
{ORTH: "Sr.", NORM: "senyor"},
|
{ORTH: "Sr.", NORM: "senyor"},
|
||||||
{ORTH: "Sra.", NORM: "senyora"},
|
{ORTH: "Sra.", NORM: "senyora"},
|
||||||
{ORTH: "Srta.", NORM: "senyoreta"},
|
{ORTH: "Srta.", NORM: "senyoreta"},
|
||||||
{ORTH: "núm", NORM: "número"},
|
{ORTH: "núm", NORM: "número"},
|
||||||
{ORTH: "St.", NORM: "sant"},
|
{ORTH: "St.", NORM: "sant"},
|
||||||
{ORTH: "Sta.", NORM: "santa"},
|
{ORTH: "Sta.", NORM: "santa"},
|
||||||
|
{ORTH: "pl.", NORM: "plaça"},
|
||||||
|
{ORTH: "à."},
|
||||||
|
{ORTH: "è."},
|
||||||
|
{ORTH: "é."},
|
||||||
|
{ORTH: "í."},
|
||||||
|
{ORTH: "ò."},
|
||||||
|
{ORTH: "ó."},
|
||||||
|
{ORTH: "ú."},
|
||||||
{ORTH: "'l"},
|
{ORTH: "'l"},
|
||||||
{ORTH: "'ls"},
|
{ORTH: "'ls"},
|
||||||
{ORTH: "'m"},
|
{ORTH: "'m"},
|
||||||
|
@ -34,6 +43,18 @@ for exc_data in [
|
||||||
]:
|
]:
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
_exc["del"] = [{ORTH: "d", NORM: "de"}, {ORTH: "el"}]
|
||||||
|
_exc["dels"] = [{ORTH: "d", NORM: "de"}, {ORTH: "els"}]
|
||||||
|
|
||||||
|
_exc["al"] = [{ORTH: "a"}, {ORTH: "l", NORM: "el"}]
|
||||||
|
_exc["als"] = [{ORTH: "a"}, {ORTH: "ls", NORM: "els"}]
|
||||||
|
|
||||||
|
_exc["pel"] = [{ORTH: "p", NORM: "per"}, {ORTH: "el"}]
|
||||||
|
_exc["pels"] = [{ORTH: "p", NORM: "per"}, {ORTH: "els"}]
|
||||||
|
|
||||||
|
_exc["holahola"] = [{ORTH: "holahola", NORM: "cocacola"}]
|
||||||
|
|
||||||
|
|
||||||
# Times
|
# Times
|
||||||
_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", NORM: "p.m."}]
|
_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", NORM: "p.m."}]
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
@ -28,13 +28,25 @@ class Greek(Language):
|
||||||
@Greek.factory(
|
@Greek.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return GreekLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return GreekLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Greek"]
|
__all__ = ["Greek"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
@ -26,13 +26,25 @@ class English(Language):
|
||||||
@English.factory(
|
@English.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return EnglishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return EnglishLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["English"]
|
__all__ = ["English"]
|
||||||
|
|
|
@ -10,7 +10,7 @@ class EnglishLemmatizer(Lemmatizer):
|
||||||
Check whether we're dealing with an uninflected paradigm, so we can
|
Check whether we're dealing with an uninflected paradigm, so we can
|
||||||
avoid lemmatization entirely.
|
avoid lemmatization entirely.
|
||||||
|
|
||||||
univ_pos (unicode / int): The token's universal part-of-speech tag.
|
univ_pos (str / int): The token's universal part-of-speech tag.
|
||||||
morphology (dict): The token's morphological features following the
|
morphology (dict): The token's morphological features following the
|
||||||
Universal Dependencies scheme.
|
Universal Dependencies scheme.
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
@ -26,13 +26,25 @@ class Spanish(Language):
|
||||||
@Spanish.factory(
|
@Spanish.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return SpanishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return SpanishLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Spanish"]
|
__all__ = ["Spanish"]
|
||||||
|
|
|
@ -1,58 +1,76 @@
|
||||||
from typing import Union, Iterator, Tuple
|
from typing import Union, Iterator, Tuple
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON, VERB, AUX
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
from ...errors import Errors
|
from ...errors import Errors
|
||||||
from ...tokens import Doc, Span, Token
|
from ...tokens import Doc, Span
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
||||||
"""Detect base noun phrases from a dependency parse. Works on Doc and Span."""
|
"""
|
||||||
doc = doclike.doc
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
|
"""
|
||||||
|
labels = [
|
||||||
|
"nsubj",
|
||||||
|
"nsubj:pass",
|
||||||
|
"obj",
|
||||||
|
"obl",
|
||||||
|
"nmod",
|
||||||
|
"pcomp",
|
||||||
|
"appos",
|
||||||
|
"ROOT",
|
||||||
|
]
|
||||||
|
post_modifiers = ["flat", "fixed", "compound"]
|
||||||
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
if not doc.has_annotation("DEP"):
|
if not doc.has_annotation("DEP"):
|
||||||
raise ValueError(Errors.E029)
|
raise ValueError(Errors.E029)
|
||||||
if not len(doc):
|
np_deps = {doc.vocab.strings.add(label) for label in labels}
|
||||||
return
|
np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
left_labels = ["det", "fixed", "neg"] # ['nunmod', 'det', 'appos', 'fixed']
|
adj_label = doc.vocab.strings.add("amod")
|
||||||
right_labels = ["flat", "fixed", "compound", "neg"]
|
adp_label = doc.vocab.strings.add("ADP")
|
||||||
stop_labels = ["punct"]
|
conj = doc.vocab.strings.add("conj")
|
||||||
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
|
conj_pos = doc.vocab.strings.add("CCONJ")
|
||||||
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
|
prev_end = -1
|
||||||
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
|
for i, word in enumerate(doclike):
|
||||||
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
|
continue
|
||||||
|
# Prevent nested chunks from being produced
|
||||||
|
if word.left_edge.i <= prev_end:
|
||||||
|
continue
|
||||||
|
if word.dep in np_deps:
|
||||||
|
right_childs = list(word.rights)
|
||||||
|
right_child = right_childs[0] if right_childs else None
|
||||||
|
|
||||||
prev_right = -1
|
if right_child:
|
||||||
for token in doclike:
|
if right_child.dep == adj_label:
|
||||||
if token.pos in [PROPN, NOUN, PRON]:
|
right_end = right_child.right_edge
|
||||||
left, right = noun_bounds(
|
elif right_child.dep in np_modifs: # Check if we can expand to right
|
||||||
doc, token, np_left_deps, np_right_deps, stop_deps
|
right_end = word.right_edge
|
||||||
)
|
else:
|
||||||
if left.i <= prev_right:
|
right_end = word
|
||||||
continue
|
|
||||||
yield left.i, right.i + 1, np_label
|
|
||||||
prev_right = right.i
|
|
||||||
|
|
||||||
|
|
||||||
def is_verb_token(token: Token) -> bool:
|
|
||||||
return token.pos in [VERB, AUX]
|
|
||||||
|
|
||||||
|
|
||||||
def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
|
|
||||||
left_bound = root
|
|
||||||
for token in reversed(list(root.lefts)):
|
|
||||||
if token.dep in np_left_deps:
|
|
||||||
left_bound = token
|
|
||||||
right_bound = root
|
|
||||||
for token in root.rights:
|
|
||||||
if token.dep in np_right_deps:
|
|
||||||
left, right = noun_bounds(
|
|
||||||
doc, token, np_left_deps, np_right_deps, stop_deps
|
|
||||||
)
|
|
||||||
filter_func = lambda t: is_verb_token(t) or t.dep in stop_deps
|
|
||||||
if list(filter(filter_func, doc[left_bound.i : right.i])):
|
|
||||||
break
|
|
||||||
else:
|
else:
|
||||||
right_bound = right
|
right_end = word
|
||||||
return left_bound, right_bound
|
prev_end = right_end.i
|
||||||
|
|
||||||
|
left_index = word.left_edge.i
|
||||||
|
left_index = (
|
||||||
|
left_index + 1 if word.left_edge.pos == adp_label else left_index
|
||||||
|
) # Eliminate left attached de, del
|
||||||
|
|
||||||
|
yield left_index, right_end.i + 1, np_label
|
||||||
|
elif word.dep == conj:
|
||||||
|
head = word.head
|
||||||
|
while head.dep == conj and head.head.i < head.i:
|
||||||
|
head = head.head
|
||||||
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
|
if head.dep in np_deps:
|
||||||
|
prev_end = word.i
|
||||||
|
|
||||||
|
left_index = word.left_edge.i # eliminate left attached conjunction
|
||||||
|
left_index = (
|
||||||
|
left_index + 1 if word.left_edge.pos == conj_pos else left_index
|
||||||
|
)
|
||||||
|
yield left_index, word.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
@ -26,13 +26,25 @@ class Persian(Language):
|
||||||
@Persian.factory(
|
@Persian.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return Lemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Persian"]
|
__all__ = ["Persian"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
|
@ -31,13 +31,25 @@ class French(Language):
|
||||||
@French.factory(
|
@French.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return FrenchLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return FrenchLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["French"]
|
__all__ = ["French"]
|
||||||
|
|
|
@ -1,6 +1,11 @@
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from thinc.api import Model
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from ...language import Language, BaseDefaults
|
from ...language import Language, BaseDefaults
|
||||||
|
from .lemmatizer import IrishLemmatizer
|
||||||
|
|
||||||
|
|
||||||
class IrishDefaults(BaseDefaults):
|
class IrishDefaults(BaseDefaults):
|
||||||
|
@ -13,4 +18,16 @@ class Irish(Language):
|
||||||
Defaults = IrishDefaults
|
Defaults = IrishDefaults
|
||||||
|
|
||||||
|
|
||||||
|
@Irish.factory(
|
||||||
|
"lemmatizer",
|
||||||
|
assigns=["token.lemma"],
|
||||||
|
default_config={"model": None, "mode": "pos_lookup", "overwrite": False},
|
||||||
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
|
)
|
||||||
|
def make_lemmatizer(
|
||||||
|
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
||||||
|
):
|
||||||
|
return IrishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Irish"]
|
__all__ = ["Irish"]
|
||||||
|
|
|
@ -1,35 +0,0 @@
|
||||||
# fmt: off
|
|
||||||
consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"]
|
|
||||||
broad_vowels = ["a", "á", "o", "ó", "u", "ú"]
|
|
||||||
slender_vowels = ["e", "é", "i", "í"]
|
|
||||||
vowels = broad_vowels + slender_vowels
|
|
||||||
# fmt: on
|
|
||||||
|
|
||||||
|
|
||||||
def ends_dentals(word):
|
|
||||||
if word != "" and word[-1] in ["d", "n", "t", "s"]:
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def devoice(word):
|
|
||||||
if len(word) > 2 and word[-2] == "s" and word[-1] == "d":
|
|
||||||
return word[:-1] + "t"
|
|
||||||
else:
|
|
||||||
return word
|
|
||||||
|
|
||||||
|
|
||||||
def ends_with_vowel(word):
|
|
||||||
return word != "" and word[-1] in vowels
|
|
||||||
|
|
||||||
|
|
||||||
def starts_with_vowel(word):
|
|
||||||
return word != "" and word[0] in vowels
|
|
||||||
|
|
||||||
|
|
||||||
def deduplicate(word):
|
|
||||||
if len(word) > 2 and word[-2] == word[-1] and word[-1] in consonants:
|
|
||||||
return word[:-1]
|
|
||||||
else:
|
|
||||||
return word
|
|
162
spacy/lang/ga/lemmatizer.py
Normal file
162
spacy/lang/ga/lemmatizer.py
Normal file
|
@ -0,0 +1,162 @@
|
||||||
|
from typing import List, Dict, Tuple
|
||||||
|
|
||||||
|
from ...pipeline import Lemmatizer
|
||||||
|
from ...tokens import Token
|
||||||
|
|
||||||
|
|
||||||
|
class IrishLemmatizer(Lemmatizer):
|
||||||
|
# This is a lookup-based lemmatiser using data extracted from
|
||||||
|
# BuNaMo (https://github.com/michmech/BuNaMo)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
|
||||||
|
if mode == "pos_lookup":
|
||||||
|
# fmt: off
|
||||||
|
required = [
|
||||||
|
"lemma_lookup_adj", "lemma_lookup_adp",
|
||||||
|
"lemma_lookup_noun", "lemma_lookup_verb"
|
||||||
|
]
|
||||||
|
# fmt: on
|
||||||
|
return (required, [])
|
||||||
|
else:
|
||||||
|
return super().get_lookups_config(mode)
|
||||||
|
|
||||||
|
def pos_lookup_lemmatize(self, token: Token) -> List[str]:
|
||||||
|
univ_pos = token.pos_
|
||||||
|
string = unponc(token.text)
|
||||||
|
if univ_pos not in ["PROPN", "ADP", "ADJ", "NOUN", "VERB"]:
|
||||||
|
return [string.lower()]
|
||||||
|
demutated = demutate(string)
|
||||||
|
secondary = ""
|
||||||
|
if string[0:1].lower() == "h" and string[1:2].lower() in "aáeéiíoóuú":
|
||||||
|
secondary = string[1:]
|
||||||
|
lookup_pos = univ_pos.lower()
|
||||||
|
if univ_pos == "PROPN":
|
||||||
|
lookup_pos = "noun"
|
||||||
|
if token.has_morph():
|
||||||
|
# TODO: lookup is actually required for the genitive forms, but
|
||||||
|
# this is not in BuNaMo, and would not be of use with IDT.
|
||||||
|
if univ_pos == "NOUN" and (
|
||||||
|
"VerbForm=Vnoun" in token.morph or "VerbForm=Inf" in token.morph
|
||||||
|
):
|
||||||
|
hpref = "Form=HPref" in token.morph
|
||||||
|
return [demutate(string, hpref).lower()]
|
||||||
|
elif univ_pos == "ADJ" and "VerbForm=Part" in token.morph:
|
||||||
|
return [demutate(string).lower()]
|
||||||
|
lookup_table = self.lookups.get_table("lemma_lookup_" + lookup_pos, {})
|
||||||
|
|
||||||
|
def to_list(value):
|
||||||
|
if value is None:
|
||||||
|
value = []
|
||||||
|
elif not isinstance(value, list):
|
||||||
|
value = [value]
|
||||||
|
return value
|
||||||
|
|
||||||
|
if univ_pos == "ADP":
|
||||||
|
return to_list(lookup_table.get(string, string.lower()))
|
||||||
|
ret = []
|
||||||
|
if univ_pos == "PROPN":
|
||||||
|
ret.extend(to_list(lookup_table.get(demutated)))
|
||||||
|
ret.extend(to_list(lookup_table.get(secondary)))
|
||||||
|
else:
|
||||||
|
ret.extend(to_list(lookup_table.get(demutated.lower())))
|
||||||
|
ret.extend(to_list(lookup_table.get(secondary.lower())))
|
||||||
|
if len(ret) == 0:
|
||||||
|
ret = [string.lower()]
|
||||||
|
return ret
|
||||||
|
|
||||||
|
|
||||||
|
def demutate(word: str, is_hpref: bool = False) -> str:
|
||||||
|
UVOWELS = "AÁEÉIÍOÓUÚ"
|
||||||
|
LVOWELS = "aáeéiíoóuú"
|
||||||
|
lc = word.lower()
|
||||||
|
# remove eclipsis
|
||||||
|
if lc.startswith("bhf"):
|
||||||
|
word = word[2:]
|
||||||
|
elif lc.startswith("mb"):
|
||||||
|
word = word[1:]
|
||||||
|
elif lc.startswith("gc"):
|
||||||
|
word = word[1:]
|
||||||
|
elif lc.startswith("nd"):
|
||||||
|
word = word[1:]
|
||||||
|
elif lc.startswith("ng"):
|
||||||
|
word = word[1:]
|
||||||
|
elif lc.startswith("bp"):
|
||||||
|
word = word[1:]
|
||||||
|
elif lc.startswith("dt"):
|
||||||
|
word = word[1:]
|
||||||
|
elif word[0:1] == "n" and word[1:2] in UVOWELS:
|
||||||
|
word = word[1:]
|
||||||
|
elif lc.startswith("n-") and word[2:3] in LVOWELS:
|
||||||
|
word = word[2:]
|
||||||
|
# non-standard eclipsis
|
||||||
|
elif lc.startswith("bh-f"):
|
||||||
|
word = word[3:]
|
||||||
|
elif lc.startswith("m-b"):
|
||||||
|
word = word[2:]
|
||||||
|
elif lc.startswith("g-c"):
|
||||||
|
word = word[2:]
|
||||||
|
elif lc.startswith("n-d"):
|
||||||
|
word = word[2:]
|
||||||
|
elif lc.startswith("n-g"):
|
||||||
|
word = word[2:]
|
||||||
|
elif lc.startswith("b-p"):
|
||||||
|
word = word[2:]
|
||||||
|
elif lc.startswith("d-t"):
|
||||||
|
word = word[2:]
|
||||||
|
|
||||||
|
# t-prothesis
|
||||||
|
elif lc.startswith("ts"):
|
||||||
|
word = word[1:]
|
||||||
|
elif lc.startswith("t-s"):
|
||||||
|
word = word[2:]
|
||||||
|
|
||||||
|
# h-prothesis, if known to be present
|
||||||
|
elif is_hpref and word[0:1] == "h":
|
||||||
|
word = word[1:]
|
||||||
|
# h-prothesis, simple case
|
||||||
|
# words can also begin with 'h', but unlike eclipsis,
|
||||||
|
# a hyphen is not used, so that needs to be handled
|
||||||
|
# elsewhere
|
||||||
|
elif word[0:1] == "h" and word[1:2] in UVOWELS:
|
||||||
|
word = word[1:]
|
||||||
|
|
||||||
|
# lenition
|
||||||
|
# this breaks the previous if, to handle super-non-standard
|
||||||
|
# text where both eclipsis and lenition were used.
|
||||||
|
if lc[0:1] in "bcdfgmpst" and lc[1:2] == "h":
|
||||||
|
word = word[0:1] + word[2:]
|
||||||
|
|
||||||
|
return word
|
||||||
|
|
||||||
|
|
||||||
|
def unponc(word: str) -> str:
|
||||||
|
# fmt: off
|
||||||
|
PONC = {
|
||||||
|
"ḃ": "bh",
|
||||||
|
"ċ": "ch",
|
||||||
|
"ḋ": "dh",
|
||||||
|
"ḟ": "fh",
|
||||||
|
"ġ": "gh",
|
||||||
|
"ṁ": "mh",
|
||||||
|
"ṗ": "ph",
|
||||||
|
"ṡ": "sh",
|
||||||
|
"ṫ": "th",
|
||||||
|
"Ḃ": "BH",
|
||||||
|
"Ċ": "CH",
|
||||||
|
"Ḋ": "DH",
|
||||||
|
"Ḟ": "FH",
|
||||||
|
"Ġ": "GH",
|
||||||
|
"Ṁ": "MH",
|
||||||
|
"Ṗ": "PH",
|
||||||
|
"Ṡ": "SH",
|
||||||
|
"Ṫ": "TH"
|
||||||
|
}
|
||||||
|
# fmt: on
|
||||||
|
buf = []
|
||||||
|
for ch in word:
|
||||||
|
if ch in PONC:
|
||||||
|
buf.append(PONC[ch])
|
||||||
|
else:
|
||||||
|
buf.append(ch)
|
||||||
|
return "".join(buf)
|
|
@ -9,6 +9,8 @@ _exc = {
|
||||||
"ded'": [{ORTH: "de", NORM: "de"}, {ORTH: "d'", NORM: "do"}],
|
"ded'": [{ORTH: "de", NORM: "de"}, {ORTH: "d'", NORM: "do"}],
|
||||||
"lem'": [{ORTH: "le", NORM: "le"}, {ORTH: "m'", NORM: "mo"}],
|
"lem'": [{ORTH: "le", NORM: "le"}, {ORTH: "m'", NORM: "mo"}],
|
||||||
"led'": [{ORTH: "le", NORM: "le"}, {ORTH: "d'", NORM: "do"}],
|
"led'": [{ORTH: "le", NORM: "le"}, {ORTH: "d'", NORM: "do"}],
|
||||||
|
"théis": [{ORTH: "th", NORM: "tar"}, {ORTH: "éis", NORM: "éis"}],
|
||||||
|
"tréis": [{ORTH: "tr", NORM: "tar"}, {ORTH: "éis", NORM: "éis"}],
|
||||||
}
|
}
|
||||||
|
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
|
|
|
@ -646,5 +646,10 @@ _nums = r"(({ne})|({t})|({on})|({c}))({s})?".format(
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
for u in "cfkCFK":
|
||||||
|
_exc[f"°{u}"] = [{ORTH: f"°{u}"}]
|
||||||
|
_exc[f"°{u}."] = [{ORTH: f"°{u}"}, {ORTH: "."}]
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
||||||
TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match
|
TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
@ -23,13 +23,25 @@ class Italian(Language):
|
||||||
@Italian.factory(
|
@Italian.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "pos_lookup", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "pos_lookup",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return ItalianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return ItalianLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Italian"]
|
__all__ = ["Italian"]
|
||||||
|
|
|
@ -1,21 +1,25 @@
|
||||||
from typing import Optional, Union, Dict, Any
|
from typing import Optional, Union, Dict, Any, Callable
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import srsly
|
import srsly
|
||||||
from collections import namedtuple
|
from collections import namedtuple
|
||||||
|
from thinc.api import Model
|
||||||
|
import re
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .tag_orth_map import TAG_ORTH_MAP
|
from .tag_orth_map import TAG_ORTH_MAP
|
||||||
from .tag_bigram_map import TAG_BIGRAM_MAP
|
from .tag_bigram_map import TAG_BIGRAM_MAP
|
||||||
from ...compat import copy_reg
|
|
||||||
from ...errors import Errors
|
from ...errors import Errors
|
||||||
from ...language import Language, BaseDefaults
|
from ...language import Language, BaseDefaults
|
||||||
|
from ...pipeline import Morphologizer
|
||||||
|
from ...pipeline.morphologizer import DEFAULT_MORPH_MODEL
|
||||||
from ...scorer import Scorer
|
from ...scorer import Scorer
|
||||||
from ...symbols import POS
|
from ...symbols import POS
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc, MorphAnalysis
|
||||||
from ...training import validate_examples
|
from ...training import validate_examples
|
||||||
from ...util import DummyTokenizer, registry, load_config_from_str
|
from ...util import DummyTokenizer, registry, load_config_from_str
|
||||||
|
from ...vocab import Vocab
|
||||||
from ... import util
|
from ... import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -31,16 +35,21 @@ split_mode = null
|
||||||
@registry.tokenizers("spacy.ja.JapaneseTokenizer")
|
@registry.tokenizers("spacy.ja.JapaneseTokenizer")
|
||||||
def create_tokenizer(split_mode: Optional[str] = None):
|
def create_tokenizer(split_mode: Optional[str] = None):
|
||||||
def japanese_tokenizer_factory(nlp):
|
def japanese_tokenizer_factory(nlp):
|
||||||
return JapaneseTokenizer(nlp, split_mode=split_mode)
|
return JapaneseTokenizer(nlp.vocab, split_mode=split_mode)
|
||||||
|
|
||||||
return japanese_tokenizer_factory
|
return japanese_tokenizer_factory
|
||||||
|
|
||||||
|
|
||||||
class JapaneseTokenizer(DummyTokenizer):
|
class JapaneseTokenizer(DummyTokenizer):
|
||||||
def __init__(self, nlp: Language, split_mode: Optional[str] = None) -> None:
|
def __init__(self, vocab: Vocab, split_mode: Optional[str] = None) -> None:
|
||||||
self.vocab = nlp.vocab
|
self.vocab = vocab
|
||||||
self.split_mode = split_mode
|
self.split_mode = split_mode
|
||||||
self.tokenizer = try_sudachi_import(self.split_mode)
|
self.tokenizer = try_sudachi_import(self.split_mode)
|
||||||
|
# if we're using split mode A we don't need subtokens
|
||||||
|
self.need_subtokens = not (split_mode is None or split_mode == "A")
|
||||||
|
|
||||||
|
def __reduce__(self):
|
||||||
|
return JapaneseTokenizer, (self.vocab, self.split_mode)
|
||||||
|
|
||||||
def __call__(self, text: str) -> Doc:
|
def __call__(self, text: str) -> Doc:
|
||||||
# convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
|
# convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
|
||||||
|
@ -49,8 +58,8 @@ class JapaneseTokenizer(DummyTokenizer):
|
||||||
dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
|
dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
|
||||||
|
|
||||||
# create Doc with tag bi-gram based part-of-speech identification rules
|
# create Doc with tag bi-gram based part-of-speech identification rules
|
||||||
words, tags, inflections, lemmas, readings, sub_tokens_list = (
|
words, tags, inflections, lemmas, norms, readings, sub_tokens_list = (
|
||||||
zip(*dtokens) if dtokens else [[]] * 6
|
zip(*dtokens) if dtokens else [[]] * 7
|
||||||
)
|
)
|
||||||
sub_tokens_list = list(sub_tokens_list)
|
sub_tokens_list = list(sub_tokens_list)
|
||||||
doc = Doc(self.vocab, words=words, spaces=spaces)
|
doc = Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
@ -68,9 +77,18 @@ class JapaneseTokenizer(DummyTokenizer):
|
||||||
)
|
)
|
||||||
# if there's no lemma info (it's an unk) just use the surface
|
# if there's no lemma info (it's an unk) just use the surface
|
||||||
token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
|
token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
|
||||||
doc.user_data["inflections"] = inflections
|
morph = {}
|
||||||
doc.user_data["reading_forms"] = readings
|
if dtoken.inf:
|
||||||
doc.user_data["sub_tokens"] = sub_tokens_list
|
# it's normal for this to be empty for non-inflecting types
|
||||||
|
morph["Inflection"] = dtoken.inf
|
||||||
|
token.norm_ = dtoken.norm
|
||||||
|
if dtoken.reading:
|
||||||
|
# punctuation is its own reading, but we don't want values like
|
||||||
|
# "=" here
|
||||||
|
morph["Reading"] = re.sub("[=|]", "_", dtoken.reading)
|
||||||
|
token.morph = MorphAnalysis(self.vocab, morph)
|
||||||
|
if self.need_subtokens:
|
||||||
|
doc.user_data["sub_tokens"] = sub_tokens_list
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def _get_dtokens(self, sudachipy_tokens, need_sub_tokens: bool = True):
|
def _get_dtokens(self, sudachipy_tokens, need_sub_tokens: bool = True):
|
||||||
|
@ -81,9 +99,10 @@ class JapaneseTokenizer(DummyTokenizer):
|
||||||
DetailedToken(
|
DetailedToken(
|
||||||
token.surface(), # orth
|
token.surface(), # orth
|
||||||
"-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag
|
"-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag
|
||||||
",".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf
|
";".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf
|
||||||
token.dictionary_form(), # lemma
|
token.dictionary_form(), # lemma
|
||||||
token.reading_form(), # user_data['reading_forms']
|
token.normalized_form(),
|
||||||
|
token.reading_form(),
|
||||||
sub_tokens_list[idx]
|
sub_tokens_list[idx]
|
||||||
if sub_tokens_list
|
if sub_tokens_list
|
||||||
else None, # user_data['sub_tokens']
|
else None, # user_data['sub_tokens']
|
||||||
|
@ -105,9 +124,8 @@ class JapaneseTokenizer(DummyTokenizer):
|
||||||
]
|
]
|
||||||
|
|
||||||
def _get_sub_tokens(self, sudachipy_tokens):
|
def _get_sub_tokens(self, sudachipy_tokens):
|
||||||
if (
|
# do nothing for default split mode
|
||||||
self.split_mode is None or self.split_mode == "A"
|
if not self.need_subtokens:
|
||||||
): # do nothing for default split mode
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
sub_tokens_list = [] # list of (list of list of DetailedToken | None)
|
sub_tokens_list = [] # list of (list of list of DetailedToken | None)
|
||||||
|
@ -176,9 +194,37 @@ class Japanese(Language):
|
||||||
Defaults = JapaneseDefaults
|
Defaults = JapaneseDefaults
|
||||||
|
|
||||||
|
|
||||||
|
@Japanese.factory(
|
||||||
|
"morphologizer",
|
||||||
|
assigns=["token.morph", "token.pos"],
|
||||||
|
default_config={
|
||||||
|
"model": DEFAULT_MORPH_MODEL,
|
||||||
|
"overwrite": True,
|
||||||
|
"extend": True,
|
||||||
|
"scorer": {"@scorers": "spacy.morphologizer_scorer.v1"},
|
||||||
|
},
|
||||||
|
default_score_weights={
|
||||||
|
"pos_acc": 0.5,
|
||||||
|
"morph_micro_f": 0.5,
|
||||||
|
"morph_per_feat": None,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
def make_morphologizer(
|
||||||
|
nlp: Language,
|
||||||
|
model: Model,
|
||||||
|
name: str,
|
||||||
|
overwrite: bool,
|
||||||
|
extend: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
|
):
|
||||||
|
return Morphologizer(
|
||||||
|
nlp.vocab, model, name, overwrite=overwrite, extend=extend, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
# Hold the attributes we need with convenient names
|
# Hold the attributes we need with convenient names
|
||||||
DetailedToken = namedtuple(
|
DetailedToken = namedtuple(
|
||||||
"DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"]
|
"DetailedToken", ["surface", "tag", "inf", "lemma", "norm", "reading", "sub_tokens"]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@ -254,7 +300,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
|
||||||
return text_dtokens, text_spaces
|
return text_dtokens, text_spaces
|
||||||
elif len([word for word in words if not word.isspace()]) == 0:
|
elif len([word for word in words if not word.isspace()]) == 0:
|
||||||
assert text.isspace()
|
assert text.isspace()
|
||||||
text_dtokens = [DetailedToken(text, gap_tag, "", text, None, None)]
|
text_dtokens = [DetailedToken(text, gap_tag, "", text, text, None, None)]
|
||||||
text_spaces = [False]
|
text_spaces = [False]
|
||||||
return text_dtokens, text_spaces
|
return text_dtokens, text_spaces
|
||||||
|
|
||||||
|
@ -271,7 +317,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
|
||||||
# space token
|
# space token
|
||||||
if word_start > 0:
|
if word_start > 0:
|
||||||
w = text[text_pos : text_pos + word_start]
|
w = text[text_pos : text_pos + word_start]
|
||||||
text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
|
text_dtokens.append(DetailedToken(w, gap_tag, "", w, w, None, None))
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
text_pos += word_start
|
text_pos += word_start
|
||||||
|
|
||||||
|
@ -287,16 +333,10 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
|
||||||
# trailing space token
|
# trailing space token
|
||||||
if text_pos < len(text):
|
if text_pos < len(text):
|
||||||
w = text[text_pos:]
|
w = text[text_pos:]
|
||||||
text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
|
text_dtokens.append(DetailedToken(w, gap_tag, "", w, w, None, None))
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
|
|
||||||
return text_dtokens, text_spaces
|
return text_dtokens, text_spaces
|
||||||
|
|
||||||
|
|
||||||
def pickle_japanese(instance):
|
|
||||||
return Japanese, tuple()
|
|
||||||
|
|
||||||
|
|
||||||
copy_reg.pickle(Japanese, pickle_japanese)
|
|
||||||
|
|
||||||
__all__ = ["Japanese"]
|
__all__ = ["Japanese"]
|
||||||
|
|
|
@ -5,11 +5,11 @@ from .tag_map import TAG_MAP
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from ...language import Language, BaseDefaults
|
from ...language import Language, BaseDefaults
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...compat import copy_reg
|
|
||||||
from ...scorer import Scorer
|
from ...scorer import Scorer
|
||||||
from ...symbols import POS
|
from ...symbols import POS
|
||||||
from ...training import validate_examples
|
from ...training import validate_examples
|
||||||
from ...util import DummyTokenizer, registry, load_config_from_str
|
from ...util import DummyTokenizer, registry, load_config_from_str
|
||||||
|
from ...vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
DEFAULT_CONFIG = """
|
DEFAULT_CONFIG = """
|
||||||
|
@ -23,17 +23,20 @@ DEFAULT_CONFIG = """
|
||||||
@registry.tokenizers("spacy.ko.KoreanTokenizer")
|
@registry.tokenizers("spacy.ko.KoreanTokenizer")
|
||||||
def create_tokenizer():
|
def create_tokenizer():
|
||||||
def korean_tokenizer_factory(nlp):
|
def korean_tokenizer_factory(nlp):
|
||||||
return KoreanTokenizer(nlp)
|
return KoreanTokenizer(nlp.vocab)
|
||||||
|
|
||||||
return korean_tokenizer_factory
|
return korean_tokenizer_factory
|
||||||
|
|
||||||
|
|
||||||
class KoreanTokenizer(DummyTokenizer):
|
class KoreanTokenizer(DummyTokenizer):
|
||||||
def __init__(self, nlp: Language):
|
def __init__(self, vocab: Vocab):
|
||||||
self.vocab = nlp.vocab
|
self.vocab = vocab
|
||||||
MeCab = try_mecab_import() # type: ignore[func-returns-value]
|
MeCab = try_mecab_import() # type: ignore[func-returns-value]
|
||||||
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
|
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
|
||||||
|
|
||||||
|
def __reduce__(self):
|
||||||
|
return KoreanTokenizer, (self.vocab,)
|
||||||
|
|
||||||
def __del__(self):
|
def __del__(self):
|
||||||
self.mecab_tokenizer.__del__()
|
self.mecab_tokenizer.__del__()
|
||||||
|
|
||||||
|
@ -106,10 +109,4 @@ def check_spaces(text, tokens):
|
||||||
yield False
|
yield False
|
||||||
|
|
||||||
|
|
||||||
def pickle_korean(instance):
|
|
||||||
return Korean, tuple()
|
|
||||||
|
|
||||||
|
|
||||||
copy_reg.pickle(Korean, pickle_korean)
|
|
||||||
|
|
||||||
__all__ = ["Korean"]
|
__all__ = ["Korean"]
|
||||||
|
|
|
@ -3,6 +3,7 @@ import unicodedata
|
||||||
import re
|
import re
|
||||||
|
|
||||||
from .. import attrs
|
from .. import attrs
|
||||||
|
from .tokenizer_exceptions import URL_MATCH
|
||||||
|
|
||||||
|
|
||||||
_like_email = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)").match
|
_like_email = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)").match
|
||||||
|
@ -109,6 +110,8 @@ def like_url(text: str) -> bool:
|
||||||
return True
|
return True
|
||||||
if tld.isalpha() and tld in _tlds:
|
if tld.isalpha() and tld in _tlds:
|
||||||
return True
|
return True
|
||||||
|
if URL_MATCH(text):
|
||||||
|
return True
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
from .lemmatizer import MacedonianLemmatizer
|
from .lemmatizer import MacedonianLemmatizer
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
@ -38,13 +38,25 @@ class Macedonian(Language):
|
||||||
@Macedonian.factory(
|
@Macedonian.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return MacedonianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return MacedonianLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Macedonian"]
|
__all__ = ["Macedonian"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
|
@ -26,13 +26,25 @@ class Norwegian(Language):
|
||||||
@Norwegian.factory(
|
@Norwegian.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return Lemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Norwegian"]
|
__all__ = ["Norwegian"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
|
@ -30,13 +30,25 @@ class Dutch(Language):
|
||||||
@Dutch.factory(
|
@Dutch.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return DutchLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return DutchLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Dutch"]
|
__all__ = ["Dutch"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
|
@ -33,13 +33,25 @@ class Polish(Language):
|
||||||
@Polish.factory(
|
@Polish.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "pos_lookup", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "pos_lookup",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return PolishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return PolishLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Polish"]
|
__all__ = ["Polish"]
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
|
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
|
||||||
from ...language import Language, BaseDefaults
|
from ...language import Language, BaseDefaults
|
||||||
|
|
||||||
|
@ -10,6 +11,7 @@ class PortugueseDefaults(BaseDefaults):
|
||||||
infixes = TOKENIZER_INFIXES
|
infixes = TOKENIZER_INFIXES
|
||||||
prefixes = TOKENIZER_PREFIXES
|
prefixes = TOKENIZER_PREFIXES
|
||||||
lex_attr_getters = LEX_ATTRS
|
lex_attr_getters = LEX_ATTRS
|
||||||
|
syntax_iterators = SYNTAX_ITERATORS
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
|
85
spacy/lang/pt/syntax_iterators.py
Normal file
85
spacy/lang/pt/syntax_iterators.py
Normal file
|
@ -0,0 +1,85 @@
|
||||||
|
from typing import Union, Iterator, Tuple
|
||||||
|
|
||||||
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
from ...tokens import Doc, Span
|
||||||
|
|
||||||
|
|
||||||
|
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
||||||
|
"""
|
||||||
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
|
"""
|
||||||
|
labels = [
|
||||||
|
"nsubj",
|
||||||
|
"nsubj:pass",
|
||||||
|
"obj",
|
||||||
|
"obl",
|
||||||
|
"obl:agent",
|
||||||
|
"nmod",
|
||||||
|
"pcomp",
|
||||||
|
"appos",
|
||||||
|
"ROOT",
|
||||||
|
]
|
||||||
|
post_modifiers = ["flat", "flat:name", "fixed", "compound"]
|
||||||
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
if not doc.has_annotation("DEP"):
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
np_deps = {doc.vocab.strings.add(label) for label in labels}
|
||||||
|
np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
|
||||||
|
np_label = doc.vocab.strings.add("NP")
|
||||||
|
adj_label = doc.vocab.strings.add("amod")
|
||||||
|
det_label = doc.vocab.strings.add("det")
|
||||||
|
det_pos = doc.vocab.strings.add("DET")
|
||||||
|
adp_label = doc.vocab.strings.add("ADP")
|
||||||
|
conj = doc.vocab.strings.add("conj")
|
||||||
|
conj_pos = doc.vocab.strings.add("CCONJ")
|
||||||
|
prev_end = -1
|
||||||
|
for i, word in enumerate(doclike):
|
||||||
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
|
continue
|
||||||
|
# Prevent nested chunks from being produced
|
||||||
|
if word.left_edge.i <= prev_end:
|
||||||
|
continue
|
||||||
|
if word.dep in np_deps:
|
||||||
|
right_childs = list(word.rights)
|
||||||
|
right_child = right_childs[0] if right_childs else None
|
||||||
|
|
||||||
|
if right_child:
|
||||||
|
if (
|
||||||
|
right_child.dep == adj_label
|
||||||
|
): # allow chain of adjectives by expanding to right
|
||||||
|
right_end = right_child.right_edge
|
||||||
|
elif (
|
||||||
|
right_child.dep == det_label and right_child.pos == det_pos
|
||||||
|
): # cut relative pronouns here
|
||||||
|
right_end = right_child
|
||||||
|
elif right_child.dep in np_modifs: # Check if we can expand to right
|
||||||
|
right_end = word.right_edge
|
||||||
|
else:
|
||||||
|
right_end = word
|
||||||
|
else:
|
||||||
|
right_end = word
|
||||||
|
prev_end = right_end.i
|
||||||
|
|
||||||
|
left_index = word.left_edge.i
|
||||||
|
left_index = (
|
||||||
|
left_index + 1 if word.left_edge.pos == adp_label else left_index
|
||||||
|
)
|
||||||
|
|
||||||
|
yield left_index, right_end.i + 1, np_label
|
||||||
|
elif word.dep == conj:
|
||||||
|
head = word.head
|
||||||
|
while head.dep == conj and head.head.i < head.i:
|
||||||
|
head = head.head
|
||||||
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
|
if head.dep in np_deps:
|
||||||
|
prev_end = word.i
|
||||||
|
|
||||||
|
left_index = word.left_edge.i # eliminate left attached conjunction
|
||||||
|
left_index = (
|
||||||
|
left_index + 1 if word.left_edge.pos == conj_pos else left_index
|
||||||
|
)
|
||||||
|
yield left_index, word.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
@ -22,7 +22,12 @@ class Russian(Language):
|
||||||
@Russian.factory(
|
@Russian.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "pymorphy2", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "pymorphy2",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
|
@ -31,8 +36,11 @@ def make_lemmatizer(
|
||||||
name: str,
|
name: str,
|
||||||
mode: str,
|
mode: str,
|
||||||
overwrite: bool,
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return RussianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return RussianLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Russian"]
|
__all__ = ["Russian"]
|
||||||
|
|
|
@ -1,8 +1,9 @@
|
||||||
from typing import Optional, List, Dict, Tuple
|
from typing import Optional, List, Dict, Tuple, Callable
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from ...pipeline import Lemmatizer
|
from ...pipeline import Lemmatizer
|
||||||
|
from ...pipeline.lemmatizer import lemmatizer_score
|
||||||
from ...symbols import POS
|
from ...symbols import POS
|
||||||
from ...tokens import Token
|
from ...tokens import Token
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
|
@ -20,6 +21,7 @@ class RussianLemmatizer(Lemmatizer):
|
||||||
*,
|
*,
|
||||||
mode: str = "pymorphy2",
|
mode: str = "pymorphy2",
|
||||||
overwrite: bool = False,
|
overwrite: bool = False,
|
||||||
|
scorer: Optional[Callable] = lemmatizer_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
if mode == "pymorphy2":
|
if mode == "pymorphy2":
|
||||||
try:
|
try:
|
||||||
|
@ -31,7 +33,9 @@ class RussianLemmatizer(Lemmatizer):
|
||||||
) from None
|
) from None
|
||||||
if getattr(self, "_morph", None) is None:
|
if getattr(self, "_morph", None) is None:
|
||||||
self._morph = MorphAnalyzer()
|
self._morph = MorphAnalyzer()
|
||||||
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
|
super().__init__(
|
||||||
|
vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
def pymorphy2_lemmatize(self, token: Token) -> List[str]:
|
def pymorphy2_lemmatize(self, token: Token) -> List[str]:
|
||||||
string = token.text
|
string = token.text
|
||||||
|
|
|
@ -1,47 +1,195 @@
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
අතර
|
සහ
|
||||||
එච්චර
|
සමග
|
||||||
එපමණ
|
සමඟ
|
||||||
එලෙස
|
අහා
|
||||||
එවිට
|
ආහ්
|
||||||
ඒ
|
ආ
|
||||||
කට
|
ඕහෝ
|
||||||
කදී
|
අනේ
|
||||||
කින්
|
අඳෝ
|
||||||
ක්
|
අපොයි
|
||||||
ට
|
අපෝ
|
||||||
තුර
|
අයියෝ
|
||||||
ත්
|
ආයි
|
||||||
ද
|
ඌයි
|
||||||
නමුත්
|
චී
|
||||||
නොහොත්
|
චිහ්
|
||||||
පමණ
|
චික්
|
||||||
පමණි
|
හෝ
|
||||||
ම
|
දෝ
|
||||||
මෙච්චර
|
දෝහෝ
|
||||||
මෙපමණ
|
මෙන්
|
||||||
මෙලෙස
|
සේ
|
||||||
මෙවිට
|
වැනි
|
||||||
මේ
|
බඳු
|
||||||
ය
|
වන්
|
||||||
යි
|
අයුරු
|
||||||
ලදී
|
අයුරින්
|
||||||
ලෙස
|
ලෙස
|
||||||
වගේ
|
වැඩි
|
||||||
|
ශ්රී
|
||||||
|
හා
|
||||||
|
ය
|
||||||
|
නිසා
|
||||||
|
නිසාවෙන්
|
||||||
|
බවට
|
||||||
|
බව
|
||||||
|
බවෙන්
|
||||||
|
නම්
|
||||||
|
වැඩි
|
||||||
|
සිට
|
||||||
|
දී
|
||||||
|
මහා
|
||||||
|
මහ
|
||||||
|
පමණ
|
||||||
|
පමණින්
|
||||||
|
පමන
|
||||||
වන
|
වන
|
||||||
විට
|
විට
|
||||||
විටෙක
|
විටින්
|
||||||
විතර
|
මේ
|
||||||
විය
|
මෙලෙස
|
||||||
වුව
|
මෙයින්
|
||||||
වුවත්
|
ඇති
|
||||||
වුවද
|
ලෙස
|
||||||
වූ
|
සිදු
|
||||||
සමඟ
|
වශයෙන්
|
||||||
|
යන
|
||||||
|
සඳහා
|
||||||
|
මගින්
|
||||||
|
හෝ
|
||||||
|
ඉතා
|
||||||
|
ඒ
|
||||||
|
එම
|
||||||
|
ද
|
||||||
|
අතර
|
||||||
|
විසින්
|
||||||
|
සමග
|
||||||
|
පිළිබඳව
|
||||||
|
පිළිබඳ
|
||||||
|
තුළ
|
||||||
|
බව
|
||||||
|
වැනි
|
||||||
|
මහ
|
||||||
|
මෙම
|
||||||
|
මෙහි
|
||||||
|
මේ
|
||||||
|
වෙත
|
||||||
|
වෙතින්
|
||||||
|
වෙතට
|
||||||
|
වෙනුවෙන්
|
||||||
|
වෙනුවට
|
||||||
|
වෙන
|
||||||
|
ගැන
|
||||||
|
නෑ
|
||||||
|
අනුව
|
||||||
|
නව
|
||||||
|
පිළිබඳ
|
||||||
|
විශේෂ
|
||||||
|
දැනට
|
||||||
|
එහෙන්
|
||||||
|
මෙහෙන්
|
||||||
|
එහේ
|
||||||
|
මෙහේ
|
||||||
|
ම
|
||||||
|
තවත්
|
||||||
|
තව
|
||||||
සහ
|
සහ
|
||||||
හා
|
දක්වා
|
||||||
|
ට
|
||||||
|
ගේ
|
||||||
|
එ
|
||||||
|
ක
|
||||||
|
ක්
|
||||||
|
බවත්
|
||||||
|
බවද
|
||||||
|
මත
|
||||||
|
ඇතුලු
|
||||||
|
ඇතුළු
|
||||||
|
මෙසේ
|
||||||
|
වඩා
|
||||||
|
වඩාත්ම
|
||||||
|
නිති
|
||||||
|
නිතිත්
|
||||||
|
නිතොර
|
||||||
|
නිතර
|
||||||
|
ඉක්බිති
|
||||||
|
දැන්
|
||||||
|
යලි
|
||||||
|
පුන
|
||||||
|
ඉතින්
|
||||||
|
සිට
|
||||||
|
සිටන්
|
||||||
|
පටන්
|
||||||
|
තෙක්
|
||||||
|
දක්වා
|
||||||
|
සා
|
||||||
|
තාක්
|
||||||
|
තුවක්
|
||||||
|
පවා
|
||||||
|
ද
|
||||||
|
හෝ
|
||||||
|
වත්
|
||||||
|
විනා
|
||||||
|
හැර
|
||||||
|
මිස
|
||||||
|
මුත්
|
||||||
|
කිම
|
||||||
|
කිම්
|
||||||
|
ඇයි
|
||||||
|
මන්ද
|
||||||
හෙවත්
|
හෙවත්
|
||||||
හෝ
|
නොහොත්
|
||||||
|
පතා
|
||||||
|
පාසා
|
||||||
|
ගානෙ
|
||||||
|
තව
|
||||||
|
ඉතා
|
||||||
|
බොහෝ
|
||||||
|
වහා
|
||||||
|
සෙද
|
||||||
|
සැනින්
|
||||||
|
හනික
|
||||||
|
එම්බා
|
||||||
|
එම්බල
|
||||||
|
බොල
|
||||||
|
නම්
|
||||||
|
වනාහි
|
||||||
|
කලී
|
||||||
|
ඉඳුරා
|
||||||
|
අන්න
|
||||||
|
ඔන්න
|
||||||
|
මෙන්න
|
||||||
|
උදෙසා
|
||||||
|
පිණිස
|
||||||
|
සඳහා
|
||||||
|
අරබයා
|
||||||
|
නිසා
|
||||||
|
එනිසා
|
||||||
|
එබැවින්
|
||||||
|
බැවින්
|
||||||
|
හෙයින්
|
||||||
|
සේක්
|
||||||
|
සේක
|
||||||
|
ගැන
|
||||||
|
අනුව
|
||||||
|
පරිදි
|
||||||
|
විට
|
||||||
|
තෙක්
|
||||||
|
මෙතෙක්
|
||||||
|
මේතාක්
|
||||||
|
තුරු
|
||||||
|
තුරා
|
||||||
|
තුරාවට
|
||||||
|
තුලින්
|
||||||
|
නමුත්
|
||||||
|
එනමුත්
|
||||||
|
වස්
|
||||||
|
මෙන්
|
||||||
|
ලෙස
|
||||||
|
පරිදි
|
||||||
|
එහෙත්
|
||||||
""".split()
|
""".split()
|
||||||
)
|
)
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
@ -29,13 +29,25 @@ class Swedish(Language):
|
||||||
@Swedish.factory(
|
@Swedish.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "rule", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "rule",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return Lemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Swedish"]
|
__all__ = ["Swedish"]
|
||||||
|
|
|
@ -3,6 +3,7 @@ from .lex_attrs import LEX_ATTRS
|
||||||
from ...language import Language, BaseDefaults
|
from ...language import Language, BaseDefaults
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...util import DummyTokenizer, registry, load_config_from_str
|
from ...util import DummyTokenizer, registry, load_config_from_str
|
||||||
|
from ...vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
DEFAULT_CONFIG = """
|
DEFAULT_CONFIG = """
|
||||||
|
@ -16,13 +17,13 @@ DEFAULT_CONFIG = """
|
||||||
@registry.tokenizers("spacy.th.ThaiTokenizer")
|
@registry.tokenizers("spacy.th.ThaiTokenizer")
|
||||||
def create_thai_tokenizer():
|
def create_thai_tokenizer():
|
||||||
def thai_tokenizer_factory(nlp):
|
def thai_tokenizer_factory(nlp):
|
||||||
return ThaiTokenizer(nlp)
|
return ThaiTokenizer(nlp.vocab)
|
||||||
|
|
||||||
return thai_tokenizer_factory
|
return thai_tokenizer_factory
|
||||||
|
|
||||||
|
|
||||||
class ThaiTokenizer(DummyTokenizer):
|
class ThaiTokenizer(DummyTokenizer):
|
||||||
def __init__(self, nlp: Language) -> None:
|
def __init__(self, vocab: Vocab) -> None:
|
||||||
try:
|
try:
|
||||||
from pythainlp.tokenize import word_tokenize
|
from pythainlp.tokenize import word_tokenize
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
@ -31,7 +32,7 @@ class ThaiTokenizer(DummyTokenizer):
|
||||||
"https://github.com/PyThaiNLP/pythainlp"
|
"https://github.com/PyThaiNLP/pythainlp"
|
||||||
) from None
|
) from None
|
||||||
self.word_tokenize = word_tokenize
|
self.word_tokenize = word_tokenize
|
||||||
self.vocab = nlp.vocab
|
self.vocab = vocab
|
||||||
|
|
||||||
def __call__(self, text: str) -> Doc:
|
def __call__(self, text: str) -> Doc:
|
||||||
words = list(self.word_tokenize(text))
|
words = list(self.word_tokenize(text))
|
||||||
|
|
|
@ -2,7 +2,7 @@ from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
_num_words = [
|
_num_words = [
|
||||||
"ዜሮ",
|
"ዜሮ",
|
||||||
"ሐደ",
|
"ሓደ",
|
||||||
"ክልተ",
|
"ክልተ",
|
||||||
"ሰለስተ",
|
"ሰለስተ",
|
||||||
"ኣርባዕተ",
|
"ኣርባዕተ",
|
||||||
|
@ -11,66 +11,37 @@ _num_words = [
|
||||||
"ሸውዓተ",
|
"ሸውዓተ",
|
||||||
"ሽሞንተ",
|
"ሽሞንተ",
|
||||||
"ትሽዓተ",
|
"ትሽዓተ",
|
||||||
"ኣሰርተ",
|
"ዓሰርተ",
|
||||||
"ኣሰርተ ሐደ",
|
|
||||||
"ኣሰርተ ክልተ",
|
|
||||||
"ኣሰርተ ሰለስተ",
|
|
||||||
"ኣሰርተ ኣርባዕተ",
|
|
||||||
"ኣሰርተ ሓሙሽተ",
|
|
||||||
"ኣሰርተ ሽድሽተ",
|
|
||||||
"ኣሰርተ ሸውዓተ",
|
|
||||||
"ኣሰርተ ሽሞንተ",
|
|
||||||
"ኣሰርተ ትሽዓተ",
|
|
||||||
"ዕስራ",
|
"ዕስራ",
|
||||||
"ሰላሳ",
|
"ሰላሳ",
|
||||||
"ኣርብዓ",
|
"ኣርብዓ",
|
||||||
"ሃምሳ",
|
"ሓምሳ",
|
||||||
"ስልሳ",
|
"ሱሳ",
|
||||||
"ሰብዓ",
|
"ሰብዓ",
|
||||||
"ሰማንያ",
|
"ሰማንያ",
|
||||||
"ተስዓ",
|
"ቴስዓ",
|
||||||
"ሚእቲ",
|
"ሚእቲ",
|
||||||
"ሺሕ",
|
"ሺሕ",
|
||||||
"ሚልዮን",
|
"ሚልዮን",
|
||||||
"ቢልዮን",
|
"ቢልዮን",
|
||||||
"ትሪልዮን",
|
"ትሪልዮን",
|
||||||
"ኳድሪልዮን",
|
"ኳድሪልዮን",
|
||||||
"ገጅልዮን",
|
"ጋዚልዮን",
|
||||||
"ባዝልዮን",
|
"ባዚልዮን",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
# Tigrinya ordinals above 10 are the same as _num_words but start with "መበል "
|
||||||
_ordinal_words = [
|
_ordinal_words = [
|
||||||
"ቀዳማይ",
|
"ቀዳማይ",
|
||||||
"ካልኣይ",
|
"ካልኣይ",
|
||||||
"ሳልሳይ",
|
"ሳልሳይ",
|
||||||
"ራብኣይ",
|
"ራብዓይ",
|
||||||
"ሓምሻይ",
|
"ሓምሻይ",
|
||||||
"ሻድሻይ",
|
"ሻድሻይ",
|
||||||
"ሻውዓይ",
|
"ሻውዓይ",
|
||||||
"ሻምናይ",
|
"ሻምናይ",
|
||||||
"ዘጠነኛ",
|
"ታሽዓይ",
|
||||||
"አስረኛ",
|
"ዓስራይ",
|
||||||
"ኣሰርተ አንደኛ",
|
|
||||||
"ኣሰርተ ሁለተኛ",
|
|
||||||
"ኣሰርተ ሶስተኛ",
|
|
||||||
"ኣሰርተ አራተኛ",
|
|
||||||
"ኣሰርተ አምስተኛ",
|
|
||||||
"ኣሰርተ ስድስተኛ",
|
|
||||||
"ኣሰርተ ሰባተኛ",
|
|
||||||
"ኣሰርተ ስምንተኛ",
|
|
||||||
"ኣሰርተ ዘጠነኛ",
|
|
||||||
"ሃያኛ",
|
|
||||||
"ሰላሳኛ" "አርባኛ",
|
|
||||||
"አምሳኛ",
|
|
||||||
"ስድሳኛ",
|
|
||||||
"ሰባኛ",
|
|
||||||
"ሰማንያኛ",
|
|
||||||
"ዘጠናኛ",
|
|
||||||
"መቶኛ",
|
|
||||||
"ሺኛ",
|
|
||||||
"ሚሊዮንኛ",
|
|
||||||
"ቢሊዮንኛ",
|
|
||||||
"ትሪሊዮንኛ",
|
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@ -92,7 +63,7 @@ def like_num(text):
|
||||||
# Check ordinal number
|
# Check ordinal number
|
||||||
if text_lower in _ordinal_words:
|
if text_lower in _ordinal_words:
|
||||||
return True
|
return True
|
||||||
if text_lower.endswith("ኛ"):
|
if text_lower.endswith("ይ"):
|
||||||
if text_lower[:-2].isdigit():
|
if text_lower[:-2].isdigit():
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||||
from ..char_classes import UNITS, ALPHA_UPPER
|
from ..char_classes import UNITS, ALPHA_UPPER
|
||||||
|
|
||||||
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
|
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧ ፠ ፨".strip().split()
|
||||||
|
|
||||||
_suffixes = (
|
_suffixes = (
|
||||||
_list_punct
|
_list_punct
|
||||||
|
|
|
@ -1,6 +1,27 @@
|
||||||
|
# Stop words from Tigrinya Wordcount: https://github.com/fgaim/Tigrinya-WordCount/blob/main/ti_stop_words.txt
|
||||||
|
|
||||||
# Stop words
|
# Stop words
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
ግን ግና ንስኻ ንስኺ ንስኻትክን ንስኻትኩም ናትካ ናትኪ ናትክን ናትኩም
|
'ምበር 'ሞ 'ቲ 'ታ 'ኳ 'ውን 'ዚ 'የ 'ዩ 'ያ 'ዮም 'ዮን
|
||||||
|
ልዕሊ ሒዙ ሒዛ ሕጂ መበል መን መንጎ መጠን ማለት ምስ ምባል
|
||||||
|
ምእንቲ ምኽንያቱ ምኽንያት ምዃኑ ምዃንና ምዃኖም
|
||||||
|
ስለ ስለዚ ስለዝበላ ሽዑ ቅድሚ በለ በቲ በዚ ብምባል ብተወሳኺ ብኸመይ
|
||||||
|
ብዘይ ብዘይካ ብዙሕ ብዛዕባ ብፍላይ ተባሂሉ ነበረ ነቲ ነታ ነቶም
|
||||||
|
ነዚ ነይሩ ነገራት ነገር ናብ ናብቲ ናትኩም ናትኪ ናትካ ናትክን
|
||||||
|
ናይ ናይቲ ንሕና ንሱ ንሳ ንሳቶም ንስኺ ንስኻ ንስኻትኩም ንስኻትክን ንዓይ
|
||||||
|
ኢለ ኢሉ ኢላ ኢልካ ኢሎም ኢና ኢኻ ኢዩ ኣለኹ
|
||||||
|
ኣለዉ ኣለዎ ኣሎ ኣብ ኣብቲ ኣብታ ኣብኡ ኣብዚ ኣነ ኣዝዩ ኣይኮነን ኣይኰነን
|
||||||
|
እምበር እሞ እተን እቲ እታ እቶም እንተ እንተሎ
|
||||||
|
ኣላ እንተኾነ እንታይ እንከሎ እኳ እዋን እውን እዚ እዛ እዞም
|
||||||
|
እየ እየን እዩ እያ እዮም
|
||||||
|
ከሎ ከመይ ከም ከምቲ ከምኡ ከምዘሎ
|
||||||
|
ከምዚ ከኣ ኩሉ ካልእ ካልኦት ካብ ካብቲ ካብቶም ክሳብ ክሳዕ ክብል
|
||||||
|
ክንደይ ክንዲ ክኸውን ኮይኑ ኰይኑ ኵሉ ኸም ኸኣ ወይ
|
||||||
|
ዋላ ዘለና ዘለዉ ዘለዋ ዘለዎ ዘለዎም ዘላ ዘሎ ዘይብሉ
|
||||||
|
ዝርከብ ዝበሃል ዝበለ ዝብል ዝተባህለ ዝተኻየደ ዝተፈላለየ ዝተፈላለዩ
|
||||||
|
ዝነበረ ዝነበረት ዝነበሩ ዝካየድ ዝኸውን ዝኽእል ዝኾነ ዝዀነ
|
||||||
|
የለን ይቕረብ ይብል ይኸውን ይኹን ይኽእል ደኣ ድሕሪ ድማ
|
||||||
|
ገለ ገሊጹ ገና ገይሩ ግና ግን ጥራይ
|
||||||
""".split()
|
""".split()
|
||||||
)
|
)
|
||||||
|
|
|
@ -250,3 +250,9 @@ o.0
|
||||||
|
|
||||||
for orth in emoticons:
|
for orth in emoticons:
|
||||||
BASE_EXCEPTIONS[orth] = [{ORTH: orth}]
|
BASE_EXCEPTIONS[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
||||||
|
# Moved from a suffix setting due to #9155 removing prefixes from consideration
|
||||||
|
# for lookbehinds
|
||||||
|
for u in "cfkCFK":
|
||||||
|
BASE_EXCEPTIONS[f"°{u}."] = [{ORTH: "°"}, {ORTH: f"{u}"}, {ORTH: "."}]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
|
@ -23,13 +23,25 @@ class Ukrainian(Language):
|
||||||
@Ukrainian.factory(
|
@Ukrainian.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "pymorphy2", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "pymorphy2",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return UkrainianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return UkrainianLemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Ukrainian"]
|
__all__ = ["Ukrainian"]
|
||||||
|
|
|
@ -1,8 +1,9 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from ..ru.lemmatizer import RussianLemmatizer
|
from ..ru.lemmatizer import RussianLemmatizer
|
||||||
|
from ...pipeline.lemmatizer import lemmatizer_score
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
|
@ -15,6 +16,7 @@ class UkrainianLemmatizer(RussianLemmatizer):
|
||||||
*,
|
*,
|
||||||
mode: str = "pymorphy2",
|
mode: str = "pymorphy2",
|
||||||
overwrite: bool = False,
|
overwrite: bool = False,
|
||||||
|
scorer: Optional[Callable] = lemmatizer_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
if mode == "pymorphy2":
|
if mode == "pymorphy2":
|
||||||
try:
|
try:
|
||||||
|
@ -27,4 +29,6 @@ class UkrainianLemmatizer(RussianLemmatizer):
|
||||||
) from None
|
) from None
|
||||||
if getattr(self, "_morph", None) is None:
|
if getattr(self, "_morph", None) is None:
|
||||||
self._morph = MorphAnalyzer(lang="uk")
|
self._morph = MorphAnalyzer(lang="uk")
|
||||||
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
|
super().__init__(
|
||||||
|
vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
|
@ -9,6 +9,7 @@ from .lex_attrs import LEX_ATTRS
|
||||||
from ...language import Language, BaseDefaults
|
from ...language import Language, BaseDefaults
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...util import DummyTokenizer, registry, load_config_from_str
|
from ...util import DummyTokenizer, registry, load_config_from_str
|
||||||
|
from ...vocab import Vocab
|
||||||
from ... import util
|
from ... import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -24,14 +25,14 @@ use_pyvi = true
|
||||||
@registry.tokenizers("spacy.vi.VietnameseTokenizer")
|
@registry.tokenizers("spacy.vi.VietnameseTokenizer")
|
||||||
def create_vietnamese_tokenizer(use_pyvi: bool = True):
|
def create_vietnamese_tokenizer(use_pyvi: bool = True):
|
||||||
def vietnamese_tokenizer_factory(nlp):
|
def vietnamese_tokenizer_factory(nlp):
|
||||||
return VietnameseTokenizer(nlp, use_pyvi=use_pyvi)
|
return VietnameseTokenizer(nlp.vocab, use_pyvi=use_pyvi)
|
||||||
|
|
||||||
return vietnamese_tokenizer_factory
|
return vietnamese_tokenizer_factory
|
||||||
|
|
||||||
|
|
||||||
class VietnameseTokenizer(DummyTokenizer):
|
class VietnameseTokenizer(DummyTokenizer):
|
||||||
def __init__(self, nlp: Language, use_pyvi: bool = False):
|
def __init__(self, vocab: Vocab, use_pyvi: bool = False):
|
||||||
self.vocab = nlp.vocab
|
self.vocab = vocab
|
||||||
self.use_pyvi = use_pyvi
|
self.use_pyvi = use_pyvi
|
||||||
if self.use_pyvi:
|
if self.use_pyvi:
|
||||||
try:
|
try:
|
||||||
|
@ -45,6 +46,9 @@ class VietnameseTokenizer(DummyTokenizer):
|
||||||
)
|
)
|
||||||
raise ImportError(msg) from None
|
raise ImportError(msg) from None
|
||||||
|
|
||||||
|
def __reduce__(self):
|
||||||
|
return VietnameseTokenizer, (self.vocab, self.use_pyvi)
|
||||||
|
|
||||||
def __call__(self, text: str) -> Doc:
|
def __call__(self, text: str) -> Doc:
|
||||||
if self.use_pyvi:
|
if self.use_pyvi:
|
||||||
words = self.pyvi_tokenize(text)
|
words = self.pyvi_tokenize(text)
|
||||||
|
|
17
spacy/lang/vi/examples.py
Normal file
17
spacy/lang/vi/examples.py
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
>>> from spacy.lang.vi.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Đây là đâu, tôi là ai?",
|
||||||
|
"Căn phòng có nhiều cửa sổ nên nó khá sáng",
|
||||||
|
"Đại dịch COVID vừa qua đã gây ảnh hưởng rất lớn tới nhiều doanh nghiệp lớn nhỏ.",
|
||||||
|
"Thành phố Hồ Chí Minh đã bị ảnh hưởng nặng nề trong thời gian vừa qua.",
|
||||||
|
"Ông bạn đang ở đâu thế?",
|
||||||
|
"Ai là người giải phóng đất nước Việt Nam khỏi ách đô hộ?",
|
||||||
|
"Vị tướng nào là người đã làm nên chiến thắng lịch sử Điện Biên Phủ?",
|
||||||
|
"Làm việc nhiều chán quá, đi chơi đâu đi?",
|
||||||
|
]
|
|
@ -9,11 +9,14 @@ _num_words = [
|
||||||
"bốn",
|
"bốn",
|
||||||
"năm",
|
"năm",
|
||||||
"sáu",
|
"sáu",
|
||||||
|
"bảy",
|
||||||
"bẩy",
|
"bẩy",
|
||||||
"tám",
|
"tám",
|
||||||
"chín",
|
"chín",
|
||||||
"mười",
|
"mười",
|
||||||
|
"chục",
|
||||||
"trăm",
|
"trăm",
|
||||||
|
"nghìn",
|
||||||
"tỷ",
|
"tỷ",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
|
@ -11,6 +11,7 @@ from ...scorer import Scorer
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...training import validate_examples, Example
|
from ...training import validate_examples, Example
|
||||||
from ...util import DummyTokenizer, registry, load_config_from_str
|
from ...util import DummyTokenizer, registry, load_config_from_str
|
||||||
|
from ...vocab import Vocab
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from ... import util
|
from ... import util
|
||||||
|
@ -48,14 +49,14 @@ class Segmenter(str, Enum):
|
||||||
@registry.tokenizers("spacy.zh.ChineseTokenizer")
|
@registry.tokenizers("spacy.zh.ChineseTokenizer")
|
||||||
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
|
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
|
||||||
def chinese_tokenizer_factory(nlp):
|
def chinese_tokenizer_factory(nlp):
|
||||||
return ChineseTokenizer(nlp, segmenter=segmenter)
|
return ChineseTokenizer(nlp.vocab, segmenter=segmenter)
|
||||||
|
|
||||||
return chinese_tokenizer_factory
|
return chinese_tokenizer_factory
|
||||||
|
|
||||||
|
|
||||||
class ChineseTokenizer(DummyTokenizer):
|
class ChineseTokenizer(DummyTokenizer):
|
||||||
def __init__(self, nlp: Language, segmenter: Segmenter = Segmenter.char):
|
def __init__(self, vocab: Vocab, segmenter: Segmenter = Segmenter.char):
|
||||||
self.vocab = nlp.vocab
|
self.vocab = vocab
|
||||||
self.segmenter = (
|
self.segmenter = (
|
||||||
segmenter.value if isinstance(segmenter, Segmenter) else segmenter
|
segmenter.value if isinstance(segmenter, Segmenter) else segmenter
|
||||||
)
|
)
|
||||||
|
|
|
@ -115,7 +115,7 @@ class Language:
|
||||||
|
|
||||||
Defaults (class): Settings, data and factory methods for creating the `nlp`
|
Defaults (class): Settings, data and factory methods for creating the `nlp`
|
||||||
object and processing pipeline.
|
object and processing pipeline.
|
||||||
lang (str): Two-letter language ID, i.e. ISO code.
|
lang (str): IETF language code, such as 'en'.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language
|
DOCS: https://spacy.io/api/language
|
||||||
"""
|
"""
|
||||||
|
@ -228,6 +228,7 @@ class Language:
|
||||||
"vectors": len(self.vocab.vectors),
|
"vectors": len(self.vocab.vectors),
|
||||||
"keys": self.vocab.vectors.n_keys,
|
"keys": self.vocab.vectors.n_keys,
|
||||||
"name": self.vocab.vectors.name,
|
"name": self.vocab.vectors.name,
|
||||||
|
"mode": self.vocab.vectors.mode,
|
||||||
}
|
}
|
||||||
self._meta["labels"] = dict(self.pipe_labels)
|
self._meta["labels"] = dict(self.pipe_labels)
|
||||||
# TODO: Adding this back to prevent breaking people's code etc., but
|
# TODO: Adding this back to prevent breaking people's code etc., but
|
||||||
|
@ -700,7 +701,8 @@ class Language:
|
||||||
if (
|
if (
|
||||||
self.vocab.vectors.shape != source.vocab.vectors.shape
|
self.vocab.vectors.shape != source.vocab.vectors.shape
|
||||||
or self.vocab.vectors.key2row != source.vocab.vectors.key2row
|
or self.vocab.vectors.key2row != source.vocab.vectors.key2row
|
||||||
or self.vocab.vectors.to_bytes() != source.vocab.vectors.to_bytes()
|
or self.vocab.vectors.to_bytes(exclude=["strings"])
|
||||||
|
!= source.vocab.vectors.to_bytes(exclude=["strings"])
|
||||||
):
|
):
|
||||||
warnings.warn(Warnings.W113.format(name=source_name))
|
warnings.warn(Warnings.W113.format(name=source_name))
|
||||||
if source_name not in source.component_names:
|
if source_name not in source.component_names:
|
||||||
|
@ -978,7 +980,7 @@ class Language:
|
||||||
|
|
||||||
def __call__(
|
def __call__(
|
||||||
self,
|
self,
|
||||||
text: str,
|
text: Union[str, Doc],
|
||||||
*,
|
*,
|
||||||
disable: Iterable[str] = SimpleFrozenList(),
|
disable: Iterable[str] = SimpleFrozenList(),
|
||||||
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
|
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
|
||||||
|
@ -987,7 +989,9 @@ class Language:
|
||||||
and can contain arbitrary whitespace. Alignment into the original string
|
and can contain arbitrary whitespace. Alignment into the original string
|
||||||
is preserved.
|
is preserved.
|
||||||
|
|
||||||
text (str): The text to be processed.
|
text (Union[str, Doc]): If `str`, the text to be processed. If `Doc`,
|
||||||
|
the doc will be passed directly to the pipeline, skipping
|
||||||
|
`Language.make_doc`.
|
||||||
disable (List[str]): Names of the pipeline components to disable.
|
disable (List[str]): Names of the pipeline components to disable.
|
||||||
component_cfg (Dict[str, dict]): An optional dictionary with extra
|
component_cfg (Dict[str, dict]): An optional dictionary with extra
|
||||||
keyword arguments for specific components.
|
keyword arguments for specific components.
|
||||||
|
@ -995,7 +999,7 @@ class Language:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language#call
|
DOCS: https://spacy.io/api/language#call
|
||||||
"""
|
"""
|
||||||
doc = self.make_doc(text)
|
doc = self._ensure_doc(text)
|
||||||
if component_cfg is None:
|
if component_cfg is None:
|
||||||
component_cfg = {}
|
component_cfg = {}
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
|
@ -1080,6 +1084,20 @@ class Language:
|
||||||
)
|
)
|
||||||
return self.tokenizer(text)
|
return self.tokenizer(text)
|
||||||
|
|
||||||
|
def _ensure_doc(self, doc_like: Union[str, Doc]) -> Doc:
|
||||||
|
"""Create a Doc if need be, or raise an error if the input is not a Doc or a string."""
|
||||||
|
if isinstance(doc_like, Doc):
|
||||||
|
return doc_like
|
||||||
|
if isinstance(doc_like, str):
|
||||||
|
return self.make_doc(doc_like)
|
||||||
|
raise ValueError(Errors.E866.format(type=type(doc_like)))
|
||||||
|
|
||||||
|
def _ensure_doc_with_context(self, doc_like: Union[str, Doc], context: Any) -> Doc:
|
||||||
|
"""Create a Doc if need be and add as_tuples context, or raise an error if the input is not a Doc or a string."""
|
||||||
|
doc = self._ensure_doc(doc_like)
|
||||||
|
doc._context = context
|
||||||
|
return doc
|
||||||
|
|
||||||
def update(
|
def update(
|
||||||
self,
|
self,
|
||||||
examples: Iterable[Example],
|
examples: Iterable[Example],
|
||||||
|
@ -1267,9 +1285,9 @@ class Language:
|
||||||
)
|
)
|
||||||
except IOError:
|
except IOError:
|
||||||
raise IOError(Errors.E884.format(vectors=I["vectors"]))
|
raise IOError(Errors.E884.format(vectors=I["vectors"]))
|
||||||
if self.vocab.vectors.data.shape[1] >= 1:
|
if self.vocab.vectors.shape[1] >= 1:
|
||||||
ops = get_current_ops()
|
ops = get_current_ops()
|
||||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
self.vocab.vectors.to_ops(ops)
|
||||||
if hasattr(self.tokenizer, "initialize"):
|
if hasattr(self.tokenizer, "initialize"):
|
||||||
tok_settings = validate_init_settings(
|
tok_settings = validate_init_settings(
|
||||||
self.tokenizer.initialize, # type: ignore[union-attr]
|
self.tokenizer.initialize, # type: ignore[union-attr]
|
||||||
|
@ -1314,8 +1332,8 @@ class Language:
|
||||||
DOCS: https://spacy.io/api/language#resume_training
|
DOCS: https://spacy.io/api/language#resume_training
|
||||||
"""
|
"""
|
||||||
ops = get_current_ops()
|
ops = get_current_ops()
|
||||||
if self.vocab.vectors.data.shape[1] >= 1:
|
if self.vocab.vectors.shape[1] >= 1:
|
||||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
self.vocab.vectors.to_ops(ops)
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if hasattr(proc, "_rehearsal_model"):
|
if hasattr(proc, "_rehearsal_model"):
|
||||||
proc._rehearsal_model = deepcopy(proc.model) # type: ignore[attr-defined]
|
proc._rehearsal_model = deepcopy(proc.model) # type: ignore[attr-defined]
|
||||||
|
@ -1386,20 +1404,13 @@ class Language:
|
||||||
for eg in examples:
|
for eg in examples:
|
||||||
self.make_doc(eg.reference.text)
|
self.make_doc(eg.reference.text)
|
||||||
# apply all pipeline components
|
# apply all pipeline components
|
||||||
for name, pipe in self.pipeline:
|
docs = self.pipe(
|
||||||
kwargs = component_cfg.get(name, {})
|
(eg.predicted for eg in examples),
|
||||||
kwargs.setdefault("batch_size", batch_size)
|
batch_size=batch_size,
|
||||||
for doc, eg in zip(
|
component_cfg=component_cfg,
|
||||||
_pipe(
|
)
|
||||||
(eg.predicted for eg in examples),
|
for eg, doc in zip(examples, docs):
|
||||||
proc=pipe,
|
eg.predicted = doc
|
||||||
name=name,
|
|
||||||
default_error_handler=self.default_error_handler,
|
|
||||||
kwargs=kwargs,
|
|
||||||
),
|
|
||||||
examples,
|
|
||||||
):
|
|
||||||
eg.predicted = doc
|
|
||||||
end_time = timer()
|
end_time = timer()
|
||||||
results = scorer.score(examples)
|
results = scorer.score(examples)
|
||||||
n_words = sum(len(eg.predicted) for eg in examples)
|
n_words = sum(len(eg.predicted) for eg in examples)
|
||||||
|
@ -1450,7 +1461,7 @@ class Language:
|
||||||
@overload
|
@overload
|
||||||
def pipe(
|
def pipe(
|
||||||
self,
|
self,
|
||||||
texts: Iterable[str],
|
texts: Iterable[Union[str, Doc]],
|
||||||
*,
|
*,
|
||||||
as_tuples: Literal[False] = ...,
|
as_tuples: Literal[False] = ...,
|
||||||
batch_size: Optional[int] = ...,
|
batch_size: Optional[int] = ...,
|
||||||
|
@ -1463,7 +1474,7 @@ class Language:
|
||||||
@overload
|
@overload
|
||||||
def pipe( # noqa: F811
|
def pipe( # noqa: F811
|
||||||
self,
|
self,
|
||||||
texts: Iterable[Tuple[str, _AnyContext]],
|
texts: Iterable[Tuple[Union[str, Doc], _AnyContext]],
|
||||||
*,
|
*,
|
||||||
as_tuples: Literal[True] = ...,
|
as_tuples: Literal[True] = ...,
|
||||||
batch_size: Optional[int] = ...,
|
batch_size: Optional[int] = ...,
|
||||||
|
@ -1475,7 +1486,9 @@ class Language:
|
||||||
|
|
||||||
def pipe( # noqa: F811
|
def pipe( # noqa: F811
|
||||||
self,
|
self,
|
||||||
texts: Union[Iterable[str], Iterable[Tuple[str, _AnyContext]]],
|
texts: Union[
|
||||||
|
Iterable[Union[str, Doc]], Iterable[Tuple[Union[str, Doc], _AnyContext]]
|
||||||
|
],
|
||||||
*,
|
*,
|
||||||
as_tuples: bool = False,
|
as_tuples: bool = False,
|
||||||
batch_size: Optional[int] = None,
|
batch_size: Optional[int] = None,
|
||||||
|
@ -1485,7 +1498,8 @@ class Language:
|
||||||
) -> Union[Iterator[Doc], Iterator[Tuple[Doc, _AnyContext]]]:
|
) -> Union[Iterator[Doc], Iterator[Tuple[Doc, _AnyContext]]]:
|
||||||
"""Process texts as a stream, and yield `Doc` objects in order.
|
"""Process texts as a stream, and yield `Doc` objects in order.
|
||||||
|
|
||||||
texts (Iterable[str]): A sequence of texts to process.
|
texts (Iterable[Union[str, Doc]]): A sequence of texts or docs to
|
||||||
|
process.
|
||||||
as_tuples (bool): If set to True, inputs should be a sequence of
|
as_tuples (bool): If set to True, inputs should be a sequence of
|
||||||
(text, context) tuples. Output will then be a sequence of
|
(text, context) tuples. Output will then be a sequence of
|
||||||
(doc, context) tuples. Defaults to False.
|
(doc, context) tuples. Defaults to False.
|
||||||
|
@ -1500,23 +1514,24 @@ class Language:
|
||||||
"""
|
"""
|
||||||
# Handle texts with context as tuples
|
# Handle texts with context as tuples
|
||||||
if as_tuples:
|
if as_tuples:
|
||||||
texts = cast(Iterable[Tuple[str, _AnyContext]], texts)
|
texts = cast(Iterable[Tuple[Union[str, Doc], _AnyContext]], texts)
|
||||||
text_context1, text_context2 = itertools.tee(texts)
|
docs_with_contexts = (
|
||||||
texts = (tc[0] for tc in text_context1)
|
self._ensure_doc_with_context(text, context) for text, context in texts
|
||||||
contexts = (tc[1] for tc in text_context2)
|
)
|
||||||
docs = self.pipe(
|
docs = self.pipe(
|
||||||
texts,
|
docs_with_contexts,
|
||||||
batch_size=batch_size,
|
batch_size=batch_size,
|
||||||
disable=disable,
|
disable=disable,
|
||||||
n_process=n_process,
|
n_process=n_process,
|
||||||
component_cfg=component_cfg,
|
component_cfg=component_cfg,
|
||||||
)
|
)
|
||||||
for doc, context in zip(docs, contexts):
|
for doc in docs:
|
||||||
|
context = doc._context
|
||||||
|
doc._context = None
|
||||||
yield (doc, context)
|
yield (doc, context)
|
||||||
return
|
return
|
||||||
|
|
||||||
# At this point, we know that we're dealing with an iterable of plain texts
|
texts = cast(Iterable[Union[str, Doc]], texts)
|
||||||
texts = cast(Iterable[str], texts)
|
|
||||||
|
|
||||||
# Set argument defaults
|
# Set argument defaults
|
||||||
if n_process == -1:
|
if n_process == -1:
|
||||||
|
@ -1551,7 +1566,7 @@ class Language:
|
||||||
docs = self._multiprocessing_pipe(texts, pipes, n_process, batch_size)
|
docs = self._multiprocessing_pipe(texts, pipes, n_process, batch_size)
|
||||||
else:
|
else:
|
||||||
# if n_process == 1, no processes are forked.
|
# if n_process == 1, no processes are forked.
|
||||||
docs = (self.make_doc(text) for text in texts)
|
docs = (self._ensure_doc(text) for text in texts)
|
||||||
for pipe in pipes:
|
for pipe in pipes:
|
||||||
docs = pipe(docs)
|
docs = pipe(docs)
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
|
@ -1570,7 +1585,7 @@ class Language:
|
||||||
|
|
||||||
def _multiprocessing_pipe(
|
def _multiprocessing_pipe(
|
||||||
self,
|
self,
|
||||||
texts: Iterable[str],
|
texts: Iterable[Union[str, Doc]],
|
||||||
pipes: Iterable[Callable[..., Iterator[Doc]]],
|
pipes: Iterable[Callable[..., Iterator[Doc]]],
|
||||||
n_process: int,
|
n_process: int,
|
||||||
batch_size: int,
|
batch_size: int,
|
||||||
|
@ -1596,7 +1611,7 @@ class Language:
|
||||||
procs = [
|
procs = [
|
||||||
mp.Process(
|
mp.Process(
|
||||||
target=_apply_pipes,
|
target=_apply_pipes,
|
||||||
args=(self.make_doc, pipes, rch, sch, Underscore.get_state()),
|
args=(self._ensure_doc, pipes, rch, sch, Underscore.get_state()),
|
||||||
)
|
)
|
||||||
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
||||||
]
|
]
|
||||||
|
@ -1609,11 +1624,12 @@ class Language:
|
||||||
recv.recv() for recv in cycle(bytedocs_recv_ch)
|
recv.recv() for recv in cycle(bytedocs_recv_ch)
|
||||||
)
|
)
|
||||||
try:
|
try:
|
||||||
for i, (_, (byte_doc, byte_error)) in enumerate(
|
for i, (_, (byte_doc, byte_context, byte_error)) in enumerate(
|
||||||
zip(raw_texts, byte_tuples), 1
|
zip(raw_texts, byte_tuples), 1
|
||||||
):
|
):
|
||||||
if byte_doc is not None:
|
if byte_doc is not None:
|
||||||
doc = Doc(self.vocab).from_bytes(byte_doc)
|
doc = Doc(self.vocab).from_bytes(byte_doc)
|
||||||
|
doc._context = byte_context
|
||||||
yield doc
|
yield doc
|
||||||
elif byte_error is not None:
|
elif byte_error is not None:
|
||||||
error = srsly.msgpack_loads(byte_error)
|
error = srsly.msgpack_loads(byte_error)
|
||||||
|
@ -1800,7 +1816,9 @@ class Language:
|
||||||
)
|
)
|
||||||
if model not in source_nlp_vectors_hashes:
|
if model not in source_nlp_vectors_hashes:
|
||||||
source_nlp_vectors_hashes[model] = hash(
|
source_nlp_vectors_hashes[model] = hash(
|
||||||
source_nlps[model].vocab.vectors.to_bytes()
|
source_nlps[model].vocab.vectors.to_bytes(
|
||||||
|
exclude=["strings"]
|
||||||
|
)
|
||||||
)
|
)
|
||||||
if "_sourced_vectors_hashes" not in nlp.meta:
|
if "_sourced_vectors_hashes" not in nlp.meta:
|
||||||
nlp.meta["_sourced_vectors_hashes"] = {}
|
nlp.meta["_sourced_vectors_hashes"] = {}
|
||||||
|
@ -2138,7 +2156,7 @@ def _copy_examples(examples: Iterable[Example]) -> List[Example]:
|
||||||
|
|
||||||
|
|
||||||
def _apply_pipes(
|
def _apply_pipes(
|
||||||
make_doc: Callable[[str], Doc],
|
ensure_doc: Callable[[Union[str, Doc]], Doc],
|
||||||
pipes: Iterable[Callable[..., Iterator[Doc]]],
|
pipes: Iterable[Callable[..., Iterator[Doc]]],
|
||||||
receiver,
|
receiver,
|
||||||
sender,
|
sender,
|
||||||
|
@ -2146,7 +2164,8 @@ def _apply_pipes(
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Worker for Language.pipe
|
"""Worker for Language.pipe
|
||||||
|
|
||||||
make_doc (Callable[[str,] Doc]): Function to create Doc from text.
|
ensure_doc (Callable[[Union[str, Doc]], Doc]): Function to create Doc from text
|
||||||
|
or raise an error if the input is neither a Doc nor a string.
|
||||||
pipes (Iterable[Pipe]): The components to apply.
|
pipes (Iterable[Pipe]): The components to apply.
|
||||||
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
||||||
created by `multiprocessing.Pipe()`
|
created by `multiprocessing.Pipe()`
|
||||||
|
@ -2159,16 +2178,16 @@ def _apply_pipes(
|
||||||
while True:
|
while True:
|
||||||
try:
|
try:
|
||||||
texts = receiver.get()
|
texts = receiver.get()
|
||||||
docs = (make_doc(text) for text in texts)
|
docs = (ensure_doc(text) for text in texts)
|
||||||
for pipe in pipes:
|
for pipe in pipes:
|
||||||
docs = pipe(docs) # type: ignore[arg-type, assignment]
|
docs = pipe(docs) # type: ignore[arg-type, assignment]
|
||||||
# Connection does not accept unpickable objects, so send list.
|
# Connection does not accept unpickable objects, so send list.
|
||||||
byte_docs = [(doc.to_bytes(), None) for doc in docs]
|
byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
|
||||||
padding = [(None, None)] * (len(texts) - len(byte_docs))
|
padding = [(None, None, None)] * (len(texts) - len(byte_docs))
|
||||||
sender.send(byte_docs + padding) # type: ignore[operator]
|
sender.send(byte_docs + padding) # type: ignore[operator]
|
||||||
except Exception:
|
except Exception:
|
||||||
error_msg = [(None, srsly.msgpack_dumps(traceback.format_exc()))]
|
error_msg = [(None, None, srsly.msgpack_dumps(traceback.format_exc()))]
|
||||||
padding = [(None, None)] * (len(texts) - 1)
|
padding = [(None, None, None)] * (len(texts) - 1)
|
||||||
sender.send(error_msg + padding)
|
sender.send(error_msg + padding)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -19,7 +19,7 @@ class Lexeme:
|
||||||
@property
|
@property
|
||||||
def vector_norm(self) -> float: ...
|
def vector_norm(self) -> float: ...
|
||||||
vector: Floats1d
|
vector: Floats1d
|
||||||
rank: str
|
rank: int
|
||||||
sentiment: float
|
sentiment: float
|
||||||
@property
|
@property
|
||||||
def orth_(self) -> str: ...
|
def orth_(self) -> str: ...
|
||||||
|
|
|
@ -130,8 +130,10 @@ cdef class Lexeme:
|
||||||
return 0.0
|
return 0.0
|
||||||
vector = self.vector
|
vector = self.vector
|
||||||
xp = get_array_module(vector)
|
xp = get_array_module(vector)
|
||||||
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
|
result = xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||||
|
# ensure we get a scalar back (numpy does this automatically but cupy doesn't)
|
||||||
|
return result.item()
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def has_vector(self):
|
def has_vector(self):
|
||||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||||
|
@ -284,7 +286,7 @@ cdef class Lexeme:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.lower]
|
return self.vocab.strings[self.c.lower]
|
||||||
|
|
||||||
def __set__(self, unicode x):
|
def __set__(self, str x):
|
||||||
self.c.lower = self.vocab.strings.add(x)
|
self.c.lower = self.vocab.strings.add(x)
|
||||||
|
|
||||||
property norm_:
|
property norm_:
|
||||||
|
@ -294,7 +296,7 @@ cdef class Lexeme:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.norm]
|
return self.vocab.strings[self.c.norm]
|
||||||
|
|
||||||
def __set__(self, unicode x):
|
def __set__(self, str x):
|
||||||
self.norm = self.vocab.strings.add(x)
|
self.norm = self.vocab.strings.add(x)
|
||||||
|
|
||||||
property shape_:
|
property shape_:
|
||||||
|
@ -304,7 +306,7 @@ cdef class Lexeme:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.shape]
|
return self.vocab.strings[self.c.shape]
|
||||||
|
|
||||||
def __set__(self, unicode x):
|
def __set__(self, str x):
|
||||||
self.c.shape = self.vocab.strings.add(x)
|
self.c.shape = self.vocab.strings.add(x)
|
||||||
|
|
||||||
property prefix_:
|
property prefix_:
|
||||||
|
@ -314,7 +316,7 @@ cdef class Lexeme:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.prefix]
|
return self.vocab.strings[self.c.prefix]
|
||||||
|
|
||||||
def __set__(self, unicode x):
|
def __set__(self, str x):
|
||||||
self.c.prefix = self.vocab.strings.add(x)
|
self.c.prefix = self.vocab.strings.add(x)
|
||||||
|
|
||||||
property suffix_:
|
property suffix_:
|
||||||
|
@ -324,7 +326,7 @@ cdef class Lexeme:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.suffix]
|
return self.vocab.strings[self.c.suffix]
|
||||||
|
|
||||||
def __set__(self, unicode x):
|
def __set__(self, str x):
|
||||||
self.c.suffix = self.vocab.strings.add(x)
|
self.c.suffix = self.vocab.strings.add(x)
|
||||||
|
|
||||||
property lang_:
|
property lang_:
|
||||||
|
@ -332,7 +334,7 @@ cdef class Lexeme:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.lang]
|
return self.vocab.strings[self.c.lang]
|
||||||
|
|
||||||
def __set__(self, unicode x):
|
def __set__(self, str x):
|
||||||
self.c.lang = self.vocab.strings.add(x)
|
self.c.lang = self.vocab.strings.add(x)
|
||||||
|
|
||||||
property flags:
|
property flags:
|
||||||
|
|
|
@ -148,9 +148,9 @@ cdef class DependencyMatcher:
|
||||||
Creates a token key to be used by the matcher
|
Creates a token key to be used by the matcher
|
||||||
"""
|
"""
|
||||||
return self._normalize_key(
|
return self._normalize_key(
|
||||||
unicode(key) + DELIMITER +
|
str(key) + DELIMITER +
|
||||||
unicode(pattern_idx) + DELIMITER +
|
str(pattern_idx) + DELIMITER +
|
||||||
unicode(token_idx)
|
str(token_idx)
|
||||||
)
|
)
|
||||||
|
|
||||||
def add(self, key, patterns, *, on_match=None):
|
def add(self, key, patterns, *, on_match=None):
|
||||||
|
@ -424,7 +424,7 @@ cdef class DependencyMatcher:
|
||||||
return [doc[child.i] for child in doc[node].head.children if child.i < node]
|
return [doc[child.i] for child in doc[node].head.children if child.i < node]
|
||||||
|
|
||||||
def _normalize_key(self, key):
|
def _normalize_key(self, key):
|
||||||
if isinstance(key, basestring):
|
if isinstance(key, str):
|
||||||
return self.vocab.strings.add(key)
|
return self.vocab.strings.add(key)
|
||||||
else:
|
else:
|
||||||
return key
|
return key
|
||||||
|
|
|
@ -18,7 +18,7 @@ from ..tokens.doc cimport Doc, get_token_attr_for_matcher
|
||||||
from ..tokens.span cimport Span
|
from ..tokens.span cimport Span
|
||||||
from ..tokens.token cimport Token
|
from ..tokens.token cimport Token
|
||||||
from ..tokens.morphanalysis cimport MorphAnalysis
|
from ..tokens.morphanalysis cimport MorphAnalysis
|
||||||
from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH
|
from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH, ENT_IOB
|
||||||
|
|
||||||
from ..schemas import validate_token_pattern
|
from ..schemas import validate_token_pattern
|
||||||
from ..errors import Errors, MatchPatternError, Warnings
|
from ..errors import Errors, MatchPatternError, Warnings
|
||||||
|
@ -312,7 +312,7 @@ cdef class Matcher:
|
||||||
return final_results
|
return final_results
|
||||||
|
|
||||||
def _normalize_key(self, key):
|
def _normalize_key(self, key):
|
||||||
if isinstance(key, basestring):
|
if isinstance(key, str):
|
||||||
return self.vocab.strings.add(key)
|
return self.vocab.strings.add(key)
|
||||||
else:
|
else:
|
||||||
return key
|
return key
|
||||||
|
@ -360,7 +360,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
||||||
for i, token in enumerate(doclike):
|
for i, token in enumerate(doclike):
|
||||||
for name, index in extensions.items():
|
for name, index in extensions.items():
|
||||||
value = token._.get(name)
|
value = token._.get(name)
|
||||||
if isinstance(value, basestring):
|
if isinstance(value, str):
|
||||||
value = token.vocab.strings[value]
|
value = token.vocab.strings[value]
|
||||||
extra_attr_values[i * nr_extra_attr + index] = value
|
extra_attr_values[i * nr_extra_attr + index] = value
|
||||||
# Main loop
|
# Main loop
|
||||||
|
@ -786,7 +786,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
||||||
def _get_attr_values(spec, string_store):
|
def _get_attr_values(spec, string_store):
|
||||||
attr_values = []
|
attr_values = []
|
||||||
for attr, value in spec.items():
|
for attr, value in spec.items():
|
||||||
if isinstance(attr, basestring):
|
if isinstance(attr, str):
|
||||||
attr = attr.upper()
|
attr = attr.upper()
|
||||||
if attr == '_':
|
if attr == '_':
|
||||||
continue
|
continue
|
||||||
|
@ -797,8 +797,11 @@ def _get_attr_values(spec, string_store):
|
||||||
if attr == "IS_SENT_START":
|
if attr == "IS_SENT_START":
|
||||||
attr = "SENT_START"
|
attr = "SENT_START"
|
||||||
attr = IDS.get(attr)
|
attr = IDS.get(attr)
|
||||||
if isinstance(value, basestring):
|
if isinstance(value, str):
|
||||||
value = string_store.add(value)
|
if attr == ENT_IOB and value in Token.iob_strings():
|
||||||
|
value = Token.iob_strings().index(value)
|
||||||
|
else:
|
||||||
|
value = string_store.add(value)
|
||||||
elif isinstance(value, bool):
|
elif isinstance(value, bool):
|
||||||
value = int(value)
|
value = int(value)
|
||||||
elif isinstance(value, int):
|
elif isinstance(value, int):
|
||||||
|
@ -938,7 +941,7 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
|
||||||
seen_predicates = {pred.key: pred.i for pred in extra_predicates}
|
seen_predicates = {pred.key: pred.i for pred in extra_predicates}
|
||||||
output = []
|
output = []
|
||||||
for attr, value in spec.items():
|
for attr, value in spec.items():
|
||||||
if isinstance(attr, basestring):
|
if isinstance(attr, str):
|
||||||
if attr == "_":
|
if attr == "_":
|
||||||
output.extend(
|
output.extend(
|
||||||
_get_extension_extra_predicates(
|
_get_extension_extra_predicates(
|
||||||
|
@ -995,7 +998,7 @@ def _get_operators(spec):
|
||||||
"?": (ZERO_ONE,), "1": (ONE,), "!": (ZERO,)}
|
"?": (ZERO_ONE,), "1": (ONE,), "!": (ZERO,)}
|
||||||
# Fix casing
|
# Fix casing
|
||||||
spec = {key.upper(): values for key, values in spec.items()
|
spec = {key.upper(): values for key, values in spec.items()
|
||||||
if isinstance(key, basestring)}
|
if isinstance(key, str)}
|
||||||
if "OP" not in spec:
|
if "OP" not in spec:
|
||||||
return (ONE,)
|
return (ONE,)
|
||||||
elif spec["OP"] in lookup:
|
elif spec["OP"] in lookup:
|
||||||
|
@ -1013,7 +1016,7 @@ def _get_extensions(spec, string_store, name2index):
|
||||||
if isinstance(value, dict):
|
if isinstance(value, dict):
|
||||||
# Handle predicates (e.g. "IN", in the extra_predicates, not here.
|
# Handle predicates (e.g. "IN", in the extra_predicates, not here.
|
||||||
continue
|
continue
|
||||||
if isinstance(value, basestring):
|
if isinstance(value, str):
|
||||||
value = string_store.add(value)
|
value = string_store.add(value)
|
||||||
if name not in name2index:
|
if name not in name2index:
|
||||||
name2index[name] = len(name2index)
|
name2index[name] = len(name2index)
|
||||||
|
|
|
@ -8,12 +8,9 @@ class PhraseMatcher:
|
||||||
def __init__(
|
def __init__(
|
||||||
self, vocab: Vocab, attr: Optional[Union[int, str]], validate: bool = ...
|
self, vocab: Vocab, attr: Optional[Union[int, str]], validate: bool = ...
|
||||||
) -> None: ...
|
) -> None: ...
|
||||||
def __call__(
|
def __reduce__(self) -> Any: ...
|
||||||
self,
|
def __len__(self) -> int: ...
|
||||||
doclike: Union[Doc, Span],
|
def __contains__(self, key: str) -> bool: ...
|
||||||
*,
|
|
||||||
as_spans: bool = ...,
|
|
||||||
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
|
|
||||||
def add(
|
def add(
|
||||||
self,
|
self,
|
||||||
key: str,
|
key: str,
|
||||||
|
@ -23,3 +20,10 @@ class PhraseMatcher:
|
||||||
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
|
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
|
||||||
] = ...,
|
] = ...,
|
||||||
) -> None: ...
|
) -> None: ...
|
||||||
|
def remove(self, key: str) -> None: ...
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
doclike: Union[Doc, Span],
|
||||||
|
*,
|
||||||
|
as_spans: bool = ...,
|
||||||
|
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
|
||||||
|
|
|
@ -28,7 +28,13 @@ def forward(
|
||||||
X, spans = source_spans
|
X, spans = source_spans
|
||||||
assert spans.dataXd.ndim == 2
|
assert spans.dataXd.ndim == 2
|
||||||
indices = _get_span_indices(ops, spans, X.lengths)
|
indices = _get_span_indices(ops, spans, X.lengths)
|
||||||
Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0]) # type: ignore[arg-type, index]
|
if len(indices) > 0:
|
||||||
|
Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0]) # type: ignore[arg-type, index]
|
||||||
|
else:
|
||||||
|
Y = Ragged(
|
||||||
|
ops.xp.zeros(X.dataXd.shape, dtype=X.dataXd.dtype),
|
||||||
|
ops.xp.zeros((len(X.lengths),), dtype="i"),
|
||||||
|
)
|
||||||
x_shape = X.dataXd.shape
|
x_shape = X.dataXd.shape
|
||||||
x_lengths = X.lengths
|
x_lengths = X.lengths
|
||||||
|
|
||||||
|
@ -53,7 +59,7 @@ def _get_span_indices(ops, spans: Ragged, lengths: Ints1d) -> Ints1d:
|
||||||
for j in range(spans_i.shape[0]):
|
for j in range(spans_i.shape[0]):
|
||||||
indices.append(ops.xp.arange(spans_i[j, 0], spans_i[j, 1])) # type: ignore[call-overload, index]
|
indices.append(ops.xp.arange(spans_i[j, 0], spans_i[j, 1])) # type: ignore[call-overload, index]
|
||||||
offset += length
|
offset += length
|
||||||
return ops.flatten(indices)
|
return ops.flatten(indices, dtype="i", ndim_if_empty=1)
|
||||||
|
|
||||||
|
|
||||||
def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:
|
def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:
|
||||||
|
|
|
@ -23,7 +23,7 @@ def create_pretrain_vectors(
|
||||||
maxout_pieces: int, hidden_size: int, loss: str
|
maxout_pieces: int, hidden_size: int, loss: str
|
||||||
) -> Callable[["Vocab", Model], Model]:
|
) -> Callable[["Vocab", Model], Model]:
|
||||||
def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model:
|
def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model:
|
||||||
if vocab.vectors.data.shape[1] == 0:
|
if vocab.vectors.shape[1] == 0:
|
||||||
raise ValueError(Errors.E875)
|
raise ValueError(Errors.E875)
|
||||||
model = build_cloze_multi_task_model(
|
model = build_cloze_multi_task_model(
|
||||||
vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
|
vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
|
||||||
|
@ -116,7 +116,7 @@ def build_multi_task_model(
|
||||||
def build_cloze_multi_task_model(
|
def build_cloze_multi_task_model(
|
||||||
vocab: "Vocab", tok2vec: Model, maxout_pieces: int, hidden_size: int
|
vocab: "Vocab", tok2vec: Model, maxout_pieces: int, hidden_size: int
|
||||||
) -> Model:
|
) -> Model:
|
||||||
nO = vocab.vectors.data.shape[1]
|
nO = vocab.vectors.shape[1]
|
||||||
output_layer = chain(
|
output_layer = chain(
|
||||||
cast(Model[List["Floats2d"], Floats2d], list2array()),
|
cast(Model[List["Floats2d"], Floats2d], list2array()),
|
||||||
Maxout(
|
Maxout(
|
||||||
|
|
|
@ -53,7 +53,7 @@ def build_hash_embed_cnn_tok2vec(
|
||||||
window_size (int): The number of tokens on either side to concatenate during
|
window_size (int): The number of tokens on either side to concatenate during
|
||||||
the convolutions. The receptive field of the CNN will be
|
the convolutions. The receptive field of the CNN will be
|
||||||
depth * (window_size * 2 + 1), so a 4-layer network with window_size of
|
depth * (window_size * 2 + 1), so a 4-layer network with window_size of
|
||||||
2 will be sensitive to 17 words at a time. Recommended value is 1.
|
2 will be sensitive to 20 words at a time. Recommended value is 1.
|
||||||
embed_size (int): The number of rows in the hash embedding tables. This can
|
embed_size (int): The number of rows in the hash embedding tables. This can
|
||||||
be surprisingly small, due to the use of the hash embeddings. Recommended
|
be surprisingly small, due to the use of the hash embeddings. Recommended
|
||||||
values are between 2000 and 10000.
|
values are between 2000 and 10000.
|
||||||
|
@ -123,7 +123,7 @@ def MultiHashEmbed(
|
||||||
attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into
|
attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into
|
||||||
account some subword information, without constructing a fully character-based
|
account some subword information, without constructing a fully character-based
|
||||||
representation. If pretrained vectors are available, they can be included in
|
representation. If pretrained vectors are available, they can be included in
|
||||||
the representation as well, with the vectors table will be kept static
|
the representation as well, with the vectors table kept static
|
||||||
(i.e. it's not updated).
|
(i.e. it's not updated).
|
||||||
|
|
||||||
The `width` parameter specifies the output width of the layer and the widths
|
The `width` parameter specifies the output width of the layer and the widths
|
||||||
|
|
|
@ -1,11 +1,13 @@
|
||||||
from typing import List, Tuple, Callable, Optional, cast
|
from typing import List, Tuple, Callable, Optional, Sequence, cast
|
||||||
from thinc.initializers import glorot_uniform_init
|
from thinc.initializers import glorot_uniform_init
|
||||||
from thinc.util import partial
|
from thinc.util import partial
|
||||||
from thinc.types import Ragged, Floats2d, Floats1d
|
from thinc.types import Ragged, Floats2d, Floats1d, Ints1d
|
||||||
from thinc.api import Model, Ops, registry
|
from thinc.api import Model, Ops, registry
|
||||||
|
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
from ..vectors import Mode
|
||||||
|
from ..vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
@registry.layers("spacy.StaticVectors.v2")
|
@registry.layers("spacy.StaticVectors.v2")
|
||||||
|
@ -34,20 +36,32 @@ def StaticVectors(
|
||||||
def forward(
|
def forward(
|
||||||
model: Model[List[Doc], Ragged], docs: List[Doc], is_train: bool
|
model: Model[List[Doc], Ragged], docs: List[Doc], is_train: bool
|
||||||
) -> Tuple[Ragged, Callable]:
|
) -> Tuple[Ragged, Callable]:
|
||||||
if not sum(len(doc) for doc in docs):
|
token_count = sum(len(doc) for doc in docs)
|
||||||
|
if not token_count:
|
||||||
return _handle_empty(model.ops, model.get_dim("nO"))
|
return _handle_empty(model.ops, model.get_dim("nO"))
|
||||||
key_attr = model.attrs["key_attr"]
|
key_attr: int = model.attrs["key_attr"]
|
||||||
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
|
keys: Ints1d = model.ops.flatten(
|
||||||
V = cast(Floats2d, model.ops.asarray(docs[0].vocab.vectors.data))
|
cast(Sequence, [doc.to_array(key_attr) for doc in docs])
|
||||||
rows = model.ops.flatten(
|
|
||||||
[doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs]
|
|
||||||
)
|
)
|
||||||
|
vocab: Vocab = docs[0].vocab
|
||||||
|
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
|
||||||
|
if vocab.vectors.mode == Mode.default:
|
||||||
|
V = cast(Floats2d, model.ops.asarray(vocab.vectors.data))
|
||||||
|
rows = vocab.vectors.find(keys=keys)
|
||||||
|
V = model.ops.as_contig(V[rows])
|
||||||
|
elif vocab.vectors.mode == Mode.floret:
|
||||||
|
V = cast(Floats2d, vocab.vectors.get_batch(keys))
|
||||||
|
V = model.ops.as_contig(V)
|
||||||
|
else:
|
||||||
|
raise RuntimeError(Errors.E896)
|
||||||
try:
|
try:
|
||||||
vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
|
vectors_data = model.ops.gemm(V, W, trans2=True)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
raise RuntimeError(Errors.E896)
|
raise RuntimeError(Errors.E896)
|
||||||
# Convert negative indices to 0-vectors (TODO: more options for UNK tokens)
|
if vocab.vectors.mode == Mode.default:
|
||||||
vectors_data[rows < 0] = 0
|
# Convert negative indices to 0-vectors
|
||||||
|
# TODO: more options for UNK tokens
|
||||||
|
vectors_data[rows < 0] = 0
|
||||||
output = Ragged(
|
output = Ragged(
|
||||||
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i") # type: ignore
|
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i") # type: ignore
|
||||||
)
|
)
|
||||||
|
@ -63,7 +77,7 @@ def forward(
|
||||||
model.inc_grad(
|
model.inc_grad(
|
||||||
"W",
|
"W",
|
||||||
model.ops.gemm(
|
model.ops.gemm(
|
||||||
cast(Floats2d, d_output.data), model.ops.as_contig(V[rows]), trans1=True
|
cast(Floats2d, d_output.data), model.ops.as_contig(V), trans1=True
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
return []
|
return []
|
||||||
|
@ -80,7 +94,7 @@ def init(
|
||||||
nM = model.get_dim("nM") if model.has_dim("nM") else None
|
nM = model.get_dim("nM") if model.has_dim("nM") else None
|
||||||
nO = model.get_dim("nO") if model.has_dim("nO") else None
|
nO = model.get_dim("nO") if model.has_dim("nO") else None
|
||||||
if X is not None and len(X):
|
if X is not None and len(X):
|
||||||
nM = X[0].vocab.vectors.data.shape[1]
|
nM = X[0].vocab.vectors.shape[1]
|
||||||
if Y is not None:
|
if Y is not None:
|
||||||
nO = Y.data.shape[1]
|
nO = Y.data.shape[1]
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
from cython.operator cimport dereference as deref, preincrement as incr
|
||||||
from libc.string cimport memcpy, memset
|
from libc.string cimport memcpy, memset
|
||||||
from libc.stdlib cimport calloc, free
|
from libc.stdlib cimport calloc, free
|
||||||
from libc.stdint cimport uint32_t, uint64_t
|
from libc.stdint cimport uint32_t, uint64_t
|
||||||
|
@ -185,16 +186,20 @@ cdef cppclass StateC:
|
||||||
int L(int head, int idx) nogil const:
|
int L(int head, int idx) nogil const:
|
||||||
if idx < 1 or this._left_arcs.size() == 0:
|
if idx < 1 or this._left_arcs.size() == 0:
|
||||||
return -1
|
return -1
|
||||||
cdef vector[int] lefts
|
|
||||||
for i in range(this._left_arcs.size()):
|
# Work backwards through left-arcs to find the arc at the
|
||||||
arc = this._left_arcs.at(i)
|
# requested index more quickly.
|
||||||
|
cdef size_t child_index = 0
|
||||||
|
it = this._left_arcs.const_rbegin()
|
||||||
|
while it != this._left_arcs.rend():
|
||||||
|
arc = deref(it)
|
||||||
if arc.head == head and arc.child != -1 and arc.child < head:
|
if arc.head == head and arc.child != -1 and arc.child < head:
|
||||||
lefts.push_back(arc.child)
|
child_index += 1
|
||||||
idx = (<int>lefts.size()) - idx
|
if child_index == idx:
|
||||||
if idx < 0:
|
return arc.child
|
||||||
return -1
|
incr(it)
|
||||||
else:
|
|
||||||
return lefts.at(idx)
|
return -1
|
||||||
|
|
||||||
int R(int head, int idx) nogil const:
|
int R(int head, int idx) nogil const:
|
||||||
if idx < 1 or this._right_arcs.size() == 0:
|
if idx < 1 or this._right_arcs.size() == 0:
|
||||||
|
|
|
@ -17,7 +17,7 @@ from ...errors import Errors
|
||||||
from thinc.extra.search cimport Beam
|
from thinc.extra.search cimport Beam
|
||||||
|
|
||||||
cdef weight_t MIN_SCORE = -90000
|
cdef weight_t MIN_SCORE = -90000
|
||||||
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok')
|
cdef attr_t SUBTOK_LABEL = hash_string('subtok')
|
||||||
|
|
||||||
DEF NON_MONOTONIC = True
|
DEF NON_MONOTONIC = True
|
||||||
|
|
||||||
|
@ -585,7 +585,10 @@ cdef class ArcEager(TransitionSystem):
|
||||||
actions[RIGHT][label] = 1
|
actions[RIGHT][label] = 1
|
||||||
actions[REDUCE][label] = 1
|
actions[REDUCE][label] = 1
|
||||||
for example in kwargs.get('examples', []):
|
for example in kwargs.get('examples', []):
|
||||||
heads, labels = example.get_aligned_parse(projectivize=True)
|
# use heads and labels from the reference parse (without regard to
|
||||||
|
# misalignments between the predicted and reference)
|
||||||
|
example_gold_preproc = Example(example.reference, example.reference)
|
||||||
|
heads, labels = example_gold_preproc.get_aligned_parse(projectivize=True)
|
||||||
for child, (head, label) in enumerate(zip(heads, labels)):
|
for child, (head, label) in enumerate(zip(heads, labels)):
|
||||||
if head is None or label is None:
|
if head is None or label is None:
|
||||||
continue
|
continue
|
||||||
|
@ -601,7 +604,7 @@ cdef class ArcEager(TransitionSystem):
|
||||||
actions[SHIFT][''] += 1
|
actions[SHIFT][''] += 1
|
||||||
if min_freq is not None:
|
if min_freq is not None:
|
||||||
for action, label_freqs in actions.items():
|
for action, label_freqs in actions.items():
|
||||||
for label, freq in list(label_freqs.items()):
|
for label, freq in label_freqs.copy().items():
|
||||||
if freq < min_freq:
|
if freq < min_freq:
|
||||||
label_freqs.pop(label)
|
label_freqs.pop(label)
|
||||||
# Ensure these actions are present
|
# Ensure these actions are present
|
||||||
|
|
|
@ -5,15 +5,15 @@ from pathlib import Path
|
||||||
|
|
||||||
from .pipe import Pipe
|
from .pipe import Pipe
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..training import validate_examples, Example
|
from ..training import Example
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..matcher import Matcher
|
from ..matcher import Matcher
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..symbols import IDS, TAG, POS, MORPH, LEMMA
|
from ..symbols import IDS
|
||||||
from ..tokens import Doc, Span
|
from ..tokens import Doc, Span
|
||||||
from ..tokens._retokenize import normalize_token_attrs, set_token_attrs
|
from ..tokens._retokenize import normalize_token_attrs, set_token_attrs
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
from ..util import SimpleFrozenList
|
from ..util import SimpleFrozenList, registry
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -23,9 +23,41 @@ TagMapType = Dict[str, Dict[Union[int, str], Union[int, str]]]
|
||||||
MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]
|
MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]
|
||||||
|
|
||||||
|
|
||||||
@Language.factory("attribute_ruler", default_config={"validate": False})
|
@Language.factory(
|
||||||
def make_attribute_ruler(nlp: Language, name: str, validate: bool):
|
"attribute_ruler",
|
||||||
return AttributeRuler(nlp.vocab, name, validate=validate)
|
default_config={
|
||||||
|
"validate": False,
|
||||||
|
"scorer": {"@scorers": "spacy.attribute_ruler_scorer.v1"},
|
||||||
|
},
|
||||||
|
)
|
||||||
|
def make_attribute_ruler(
|
||||||
|
nlp: Language, name: str, validate: bool, scorer: Optional[Callable]
|
||||||
|
):
|
||||||
|
return AttributeRuler(nlp.vocab, name, validate=validate, scorer=scorer)
|
||||||
|
|
||||||
|
|
||||||
|
def attribute_ruler_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||||
|
def morph_key_getter(token, attr):
|
||||||
|
return getattr(token, attr).key
|
||||||
|
|
||||||
|
results = {}
|
||||||
|
results.update(Scorer.score_token_attr(examples, "tag", **kwargs))
|
||||||
|
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
|
||||||
|
results.update(
|
||||||
|
Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs)
|
||||||
|
)
|
||||||
|
results.update(
|
||||||
|
Scorer.score_token_attr_per_feat(
|
||||||
|
examples, "morph", getter=morph_key_getter, **kwargs
|
||||||
|
)
|
||||||
|
)
|
||||||
|
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.attribute_ruler_scorer.v1")
|
||||||
|
def make_attribute_ruler_scorer():
|
||||||
|
return attribute_ruler_score
|
||||||
|
|
||||||
|
|
||||||
class AttributeRuler(Pipe):
|
class AttributeRuler(Pipe):
|
||||||
|
@ -36,7 +68,12 @@ class AttributeRuler(Pipe):
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self, vocab: Vocab, name: str = "attribute_ruler", *, validate: bool = False
|
self,
|
||||||
|
vocab: Vocab,
|
||||||
|
name: str = "attribute_ruler",
|
||||||
|
*,
|
||||||
|
validate: bool = False,
|
||||||
|
scorer: Optional[Callable] = attribute_ruler_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Create the AttributeRuler. After creation, you can add patterns
|
"""Create the AttributeRuler. After creation, you can add patterns
|
||||||
with the `.initialize()` or `.add_patterns()` methods, or load patterns
|
with the `.initialize()` or `.add_patterns()` methods, or load patterns
|
||||||
|
@ -45,6 +82,10 @@ class AttributeRuler(Pipe):
|
||||||
|
|
||||||
vocab (Vocab): The vocab.
|
vocab (Vocab): The vocab.
|
||||||
name (str): The pipe name. Defaults to "attribute_ruler".
|
name (str): The pipe name. Defaults to "attribute_ruler".
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_token_attr for the attributes "tag", "pos", "morph" and
|
||||||
|
"lemma" and Scorer.score_token_attr_per_feat for the attribute
|
||||||
|
"morph".
|
||||||
|
|
||||||
RETURNS (AttributeRuler): The AttributeRuler component.
|
RETURNS (AttributeRuler): The AttributeRuler component.
|
||||||
|
|
||||||
|
@ -57,6 +98,7 @@ class AttributeRuler(Pipe):
|
||||||
self.attrs: List[Dict] = []
|
self.attrs: List[Dict] = []
|
||||||
self._attrs_unnormed: List[Dict] = [] # store for reference
|
self._attrs_unnormed: List[Dict] = [] # store for reference
|
||||||
self.indices: List[int] = []
|
self.indices: List[int] = []
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
def clear(self) -> None:
|
def clear(self) -> None:
|
||||||
"""Reset all patterns."""
|
"""Reset all patterns."""
|
||||||
|
@ -228,45 +270,6 @@ class AttributeRuler(Pipe):
|
||||||
all_patterns.append(p)
|
all_patterns.append(p)
|
||||||
return all_patterns # type: ignore[return-value]
|
return all_patterns # type: ignore[return-value]
|
||||||
|
|
||||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by
|
|
||||||
Scorer.score_token_attr for the attributes "tag", "pos", "morph"
|
|
||||||
and "lemma" for the target token attributes.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tagger#score
|
|
||||||
"""
|
|
||||||
|
|
||||||
def morph_key_getter(token, attr):
|
|
||||||
return getattr(token, attr).key
|
|
||||||
|
|
||||||
validate_examples(examples, "AttributeRuler.score")
|
|
||||||
results = {}
|
|
||||||
attrs = set() # type: ignore
|
|
||||||
for token_attrs in self.attrs:
|
|
||||||
attrs.update(token_attrs)
|
|
||||||
for attr in attrs:
|
|
||||||
if attr == TAG:
|
|
||||||
results.update(Scorer.score_token_attr(examples, "tag", **kwargs))
|
|
||||||
elif attr == POS:
|
|
||||||
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
|
|
||||||
elif attr == MORPH:
|
|
||||||
results.update(
|
|
||||||
Scorer.score_token_attr(
|
|
||||||
examples, "morph", getter=morph_key_getter, **kwargs
|
|
||||||
)
|
|
||||||
)
|
|
||||||
results.update(
|
|
||||||
Scorer.score_token_attr_per_feat(
|
|
||||||
examples, "morph", getter=morph_key_getter, **kwargs
|
|
||||||
)
|
|
||||||
)
|
|
||||||
elif attr == LEMMA:
|
|
||||||
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
|
|
||||||
return results
|
|
||||||
|
|
||||||
def to_bytes(self, exclude: Iterable[str] = SimpleFrozenList()) -> bytes:
|
def to_bytes(self, exclude: Iterable[str] = SimpleFrozenList()) -> bytes:
|
||||||
"""Serialize the AttributeRuler to a bytestring.
|
"""Serialize the AttributeRuler to a bytestring.
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from typing import Optional, Iterable
|
from typing import Optional, Iterable, Callable
|
||||||
from thinc.api import Model, Config
|
from thinc.api import Model, Config
|
||||||
|
|
||||||
from ._parser_internals.transition_system import TransitionSystem
|
from ._parser_internals.transition_system import TransitionSystem
|
||||||
|
@ -12,7 +12,7 @@ from ..language import Language
|
||||||
from ._parser_internals import nonproj
|
from ._parser_internals import nonproj
|
||||||
from ._parser_internals.nonproj import DELIMITER
|
from ._parser_internals.nonproj import DELIMITER
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..training import validate_examples
|
from ..util import registry
|
||||||
|
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
|
@ -45,6 +45,7 @@ DEFAULT_PARSER_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
"learn_tokens": False,
|
"learn_tokens": False,
|
||||||
"min_action_freq": 30,
|
"min_action_freq": 30,
|
||||||
"model": DEFAULT_PARSER_MODEL,
|
"model": DEFAULT_PARSER_MODEL,
|
||||||
|
"scorer": {"@scorers": "spacy.parser_scorer.v1"},
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"dep_uas": 0.5,
|
"dep_uas": 0.5,
|
||||||
|
@ -63,6 +64,7 @@ def make_parser(
|
||||||
update_with_oracle_cut_size: int,
|
update_with_oracle_cut_size: int,
|
||||||
learn_tokens: bool,
|
learn_tokens: bool,
|
||||||
min_action_freq: int,
|
min_action_freq: int,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
"""Create a transition-based DependencyParser component. The dependency parser
|
"""Create a transition-based DependencyParser component. The dependency parser
|
||||||
jointly learns sentence segmentation and labelled dependency parsing, and can
|
jointly learns sentence segmentation and labelled dependency parsing, and can
|
||||||
|
@ -99,6 +101,7 @@ def make_parser(
|
||||||
primarily affects the label accuracy, it can also affect the attachment
|
primarily affects the label accuracy, it can also affect the attachment
|
||||||
structure, as the labels are used to represent the pseudo-projectivity
|
structure, as the labels are used to represent the pseudo-projectivity
|
||||||
transformation.
|
transformation.
|
||||||
|
scorer (Optional[Callable]): The scoring method.
|
||||||
"""
|
"""
|
||||||
return DependencyParser(
|
return DependencyParser(
|
||||||
nlp.vocab,
|
nlp.vocab,
|
||||||
|
@ -115,6 +118,7 @@ def make_parser(
|
||||||
# At some point in the future we can try to implement support for
|
# At some point in the future we can try to implement support for
|
||||||
# partial annotations, perhaps only in the beam objective.
|
# partial annotations, perhaps only in the beam objective.
|
||||||
incorrect_spans_key=None,
|
incorrect_spans_key=None,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@ -130,6 +134,7 @@ def make_parser(
|
||||||
"learn_tokens": False,
|
"learn_tokens": False,
|
||||||
"min_action_freq": 30,
|
"min_action_freq": 30,
|
||||||
"model": DEFAULT_PARSER_MODEL,
|
"model": DEFAULT_PARSER_MODEL,
|
||||||
|
"scorer": {"@scorers": "spacy.parser_scorer.v1"},
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"dep_uas": 0.5,
|
"dep_uas": 0.5,
|
||||||
|
@ -151,6 +156,7 @@ def make_beam_parser(
|
||||||
beam_width: int,
|
beam_width: int,
|
||||||
beam_density: float,
|
beam_density: float,
|
||||||
beam_update_prob: float,
|
beam_update_prob: float,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
"""Create a transition-based DependencyParser component that uses beam-search.
|
"""Create a transition-based DependencyParser component that uses beam-search.
|
||||||
The dependency parser jointly learns sentence segmentation and labelled
|
The dependency parser jointly learns sentence segmentation and labelled
|
||||||
|
@ -208,9 +214,40 @@ def make_beam_parser(
|
||||||
# At some point in the future we can try to implement support for
|
# At some point in the future we can try to implement support for
|
||||||
# partial annotations, perhaps only in the beam objective.
|
# partial annotations, perhaps only in the beam objective.
|
||||||
incorrect_spans_key=None,
|
incorrect_spans_key=None,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def parser_score(examples, **kwargs):
|
||||||
|
"""Score a batch of examples.
|
||||||
|
|
||||||
|
examples (Iterable[Example]): The examples to score.
|
||||||
|
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans
|
||||||
|
and Scorer.score_deps.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/dependencyparser#score
|
||||||
|
"""
|
||||||
|
def has_sents(doc):
|
||||||
|
return doc.has_annotation("SENT_START")
|
||||||
|
|
||||||
|
def dep_getter(token, attr):
|
||||||
|
dep = getattr(token, attr)
|
||||||
|
dep = token.vocab.strings.as_string(dep).lower()
|
||||||
|
return dep
|
||||||
|
results = {}
|
||||||
|
results.update(Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs))
|
||||||
|
kwargs.setdefault("getter", dep_getter)
|
||||||
|
kwargs.setdefault("ignore_labels", ("p", "punct"))
|
||||||
|
results.update(Scorer.score_deps(examples, "dep", **kwargs))
|
||||||
|
del results["sents_per_type"]
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.parser_scorer.v1")
|
||||||
|
def make_parser_scorer():
|
||||||
|
return parser_score
|
||||||
|
|
||||||
|
|
||||||
class DependencyParser(Parser):
|
class DependencyParser(Parser):
|
||||||
"""Pipeline component for dependency parsing.
|
"""Pipeline component for dependency parsing.
|
||||||
|
|
||||||
|
@ -234,6 +271,7 @@ class DependencyParser(Parser):
|
||||||
beam_update_prob=0.0,
|
beam_update_prob=0.0,
|
||||||
multitasks=tuple(),
|
multitasks=tuple(),
|
||||||
incorrect_spans_key=None,
|
incorrect_spans_key=None,
|
||||||
|
scorer=parser_score,
|
||||||
):
|
):
|
||||||
"""Create a DependencyParser."""
|
"""Create a DependencyParser."""
|
||||||
super().__init__(
|
super().__init__(
|
||||||
|
@ -249,6 +287,7 @@ class DependencyParser(Parser):
|
||||||
beam_update_prob=beam_update_prob,
|
beam_update_prob=beam_update_prob,
|
||||||
multitasks=multitasks,
|
multitasks=multitasks,
|
||||||
incorrect_spans_key=incorrect_spans_key,
|
incorrect_spans_key=incorrect_spans_key,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
@ -281,36 +320,6 @@ class DependencyParser(Parser):
|
||||||
labels.add(label)
|
labels.add(label)
|
||||||
return tuple(sorted(labels))
|
return tuple(sorted(labels))
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans
|
|
||||||
and Scorer.score_deps.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/dependencyparser#score
|
|
||||||
"""
|
|
||||||
|
|
||||||
def has_sents(doc):
|
|
||||||
return doc.has_annotation("SENT_START")
|
|
||||||
|
|
||||||
validate_examples(examples, "DependencyParser.score")
|
|
||||||
|
|
||||||
def dep_getter(token, attr):
|
|
||||||
dep = getattr(token, attr)
|
|
||||||
dep = token.vocab.strings.as_string(dep).lower()
|
|
||||||
return dep
|
|
||||||
|
|
||||||
results = {}
|
|
||||||
results.update(
|
|
||||||
Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
|
|
||||||
)
|
|
||||||
kwargs.setdefault("getter", dep_getter)
|
|
||||||
kwargs.setdefault("ignore_labels", ("p", "punct"))
|
|
||||||
results.update(Scorer.score_deps(examples, "dep", **kwargs))
|
|
||||||
del results["sents_per_type"]
|
|
||||||
return results
|
|
||||||
|
|
||||||
def scored_parses(self, beams):
|
def scored_parses(self, beams):
|
||||||
"""Return two dictionaries with scores for each beam/doc that was processed:
|
"""Return two dictionaries with scores for each beam/doc that was processed:
|
||||||
one containing (i, head) keys, and another containing (i, label) keys.
|
one containing (i, head) keys, and another containing (i, label) keys.
|
||||||
|
|
|
@ -17,10 +17,12 @@ from ..language import Language
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
from ..training import Example, validate_examples, validate_get_examples
|
from ..training import Example, validate_examples, validate_get_examples
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..util import SimpleFrozenList
|
from ..util import SimpleFrozenList, registry
|
||||||
from .. import util
|
from .. import util
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
|
|
||||||
|
# See #9050
|
||||||
|
BACKWARD_OVERWRITE = True
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
[model]
|
[model]
|
||||||
|
@ -51,6 +53,8 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
"incl_context": True,
|
"incl_context": True,
|
||||||
"entity_vector_length": 64,
|
"entity_vector_length": 64,
|
||||||
"get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
|
"get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
|
||||||
|
"overwrite": True,
|
||||||
|
"scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"nel_micro_f": 1.0,
|
"nel_micro_f": 1.0,
|
||||||
|
@ -69,6 +73,8 @@ def make_entity_linker(
|
||||||
incl_context: bool,
|
incl_context: bool,
|
||||||
entity_vector_length: int,
|
entity_vector_length: int,
|
||||||
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
|
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
"""Construct an EntityLinker component.
|
"""Construct an EntityLinker component.
|
||||||
|
|
||||||
|
@ -82,6 +88,7 @@ def make_entity_linker(
|
||||||
entity_vector_length (int): Size of encoding vectors in the KB.
|
entity_vector_length (int): Size of encoding vectors in the KB.
|
||||||
get_candidates (Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]): Function that
|
get_candidates (Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]): Function that
|
||||||
produces a list of candidates, given a certain knowledge base and a textual mention.
|
produces a list of candidates, given a certain knowledge base and a textual mention.
|
||||||
|
scorer (Optional[Callable]): The scoring method.
|
||||||
"""
|
"""
|
||||||
return EntityLinker(
|
return EntityLinker(
|
||||||
nlp.vocab,
|
nlp.vocab,
|
||||||
|
@ -93,9 +100,20 @@ def make_entity_linker(
|
||||||
incl_context=incl_context,
|
incl_context=incl_context,
|
||||||
entity_vector_length=entity_vector_length,
|
entity_vector_length=entity_vector_length,
|
||||||
get_candidates=get_candidates,
|
get_candidates=get_candidates,
|
||||||
|
overwrite=overwrite,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def entity_linker_score(examples, **kwargs):
|
||||||
|
return Scorer.score_links(examples, negative_labels=[EntityLinker.NIL], **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.entity_linker_scorer.v1")
|
||||||
|
def make_entity_linker_scorer():
|
||||||
|
return entity_linker_score
|
||||||
|
|
||||||
|
|
||||||
class EntityLinker(TrainablePipe):
|
class EntityLinker(TrainablePipe):
|
||||||
"""Pipeline component for named entity linking.
|
"""Pipeline component for named entity linking.
|
||||||
|
|
||||||
|
@ -116,6 +134,8 @@ class EntityLinker(TrainablePipe):
|
||||||
incl_context: bool,
|
incl_context: bool,
|
||||||
entity_vector_length: int,
|
entity_vector_length: int,
|
||||||
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
|
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
|
||||||
|
overwrite: bool = BACKWARD_OVERWRITE,
|
||||||
|
scorer: Optional[Callable] = entity_linker_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize an entity linker.
|
"""Initialize an entity linker.
|
||||||
|
|
||||||
|
@ -130,6 +150,8 @@ class EntityLinker(TrainablePipe):
|
||||||
entity_vector_length (int): Size of encoding vectors in the KB.
|
entity_vector_length (int): Size of encoding vectors in the KB.
|
||||||
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
|
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
|
||||||
produces a list of candidates, given a certain knowledge base and a textual mention.
|
produces a list of candidates, given a certain knowledge base and a textual mention.
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_links.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entitylinker#init
|
DOCS: https://spacy.io/api/entitylinker#init
|
||||||
"""
|
"""
|
||||||
|
@ -141,11 +163,12 @@ class EntityLinker(TrainablePipe):
|
||||||
self.incl_prior = incl_prior
|
self.incl_prior = incl_prior
|
||||||
self.incl_context = incl_context
|
self.incl_context = incl_context
|
||||||
self.get_candidates = get_candidates
|
self.get_candidates = get_candidates
|
||||||
self.cfg: Dict[str, Any] = {}
|
self.cfg: Dict[str, Any] = {"overwrite": overwrite}
|
||||||
self.distance = CosineDistance(normalize=False)
|
self.distance = CosineDistance(normalize=False)
|
||||||
# how many neighbour sentences to take into account
|
# how many neighbour sentences to take into account
|
||||||
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
|
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
|
||||||
self.kb = empty_kb(entity_vector_length)(self.vocab)
|
self.kb = empty_kb(entity_vector_length)(self.vocab)
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
|
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
|
||||||
"""Define the KB of this pipe by providing a function that will
|
"""Define the KB of this pipe by providing a function that will
|
||||||
|
@ -384,23 +407,14 @@ class EntityLinker(TrainablePipe):
|
||||||
if count_ents != len(kb_ids):
|
if count_ents != len(kb_ids):
|
||||||
raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
|
raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
|
||||||
i = 0
|
i = 0
|
||||||
|
overwrite = self.cfg["overwrite"]
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
kb_id = kb_ids[i]
|
kb_id = kb_ids[i]
|
||||||
i += 1
|
i += 1
|
||||||
for token in ent:
|
for token in ent:
|
||||||
token.ent_kb_id_ = kb_id
|
if token.ent_kb_id == 0 or overwrite:
|
||||||
|
token.ent_kb_id_ = kb_id
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores.
|
|
||||||
|
|
||||||
DOCS TODO: https://spacy.io/api/entity_linker#score
|
|
||||||
"""
|
|
||||||
validate_examples(examples, "EntityLinker.score")
|
|
||||||
return Scorer.score_links(examples, negative_labels=[self.NIL])
|
|
||||||
|
|
||||||
def to_bytes(self, *, exclude=tuple()):
|
def to_bytes(self, *, exclude=tuple()):
|
||||||
"""Serialize the pipe to a bytestring.
|
"""Serialize the pipe to a bytestring.
|
||||||
|
|
|
@ -9,11 +9,10 @@ from .pipe import Pipe
|
||||||
from ..training import Example
|
from ..training import Example
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..util import ensure_path, to_disk, from_disk, SimpleFrozenList
|
from ..util import ensure_path, to_disk, from_disk, SimpleFrozenList, registry
|
||||||
from ..tokens import Doc, Span
|
from ..tokens import Doc, Span
|
||||||
from ..matcher import Matcher, PhraseMatcher
|
from ..matcher import Matcher, PhraseMatcher
|
||||||
from ..scorer import get_ner_prf
|
from ..scorer import get_ner_prf
|
||||||
from ..training import validate_examples
|
|
||||||
|
|
||||||
|
|
||||||
DEFAULT_ENT_ID_SEP = "||"
|
DEFAULT_ENT_ID_SEP = "||"
|
||||||
|
@ -28,6 +27,7 @@ PatternType = Dict[str, Union[str, List[Dict[str, Any]]]]
|
||||||
"validate": False,
|
"validate": False,
|
||||||
"overwrite_ents": False,
|
"overwrite_ents": False,
|
||||||
"ent_id_sep": DEFAULT_ENT_ID_SEP,
|
"ent_id_sep": DEFAULT_ENT_ID_SEP,
|
||||||
|
"scorer": {"@scorers": "spacy.entity_ruler_scorer.v1"},
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"ents_f": 1.0,
|
"ents_f": 1.0,
|
||||||
|
@ -43,6 +43,7 @@ def make_entity_ruler(
|
||||||
validate: bool,
|
validate: bool,
|
||||||
overwrite_ents: bool,
|
overwrite_ents: bool,
|
||||||
ent_id_sep: str,
|
ent_id_sep: str,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return EntityRuler(
|
return EntityRuler(
|
||||||
nlp,
|
nlp,
|
||||||
|
@ -51,9 +52,19 @@ def make_entity_ruler(
|
||||||
validate=validate,
|
validate=validate,
|
||||||
overwrite_ents=overwrite_ents,
|
overwrite_ents=overwrite_ents,
|
||||||
ent_id_sep=ent_id_sep,
|
ent_id_sep=ent_id_sep,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def entity_ruler_score(examples, **kwargs):
|
||||||
|
return get_ner_prf(examples)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.entity_ruler_scorer.v1")
|
||||||
|
def make_entity_ruler_scorer():
|
||||||
|
return entity_ruler_score
|
||||||
|
|
||||||
|
|
||||||
class EntityRuler(Pipe):
|
class EntityRuler(Pipe):
|
||||||
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
||||||
rules or exact phrase matches. It can be combined with the statistical
|
rules or exact phrase matches. It can be combined with the statistical
|
||||||
|
@ -75,6 +86,7 @@ class EntityRuler(Pipe):
|
||||||
overwrite_ents: bool = False,
|
overwrite_ents: bool = False,
|
||||||
ent_id_sep: str = DEFAULT_ENT_ID_SEP,
|
ent_id_sep: str = DEFAULT_ENT_ID_SEP,
|
||||||
patterns: Optional[List[PatternType]] = None,
|
patterns: Optional[List[PatternType]] = None,
|
||||||
|
scorer: Optional[Callable] = entity_ruler_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize the entity ruler. If patterns are supplied here, they
|
"""Initialize the entity ruler. If patterns are supplied here, they
|
||||||
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
||||||
|
@ -95,6 +107,8 @@ class EntityRuler(Pipe):
|
||||||
overwrite_ents (bool): If existing entities are present, e.g. entities
|
overwrite_ents (bool): If existing entities are present, e.g. entities
|
||||||
added by the model, overwrite them by matches if necessary.
|
added by the model, overwrite them by matches if necessary.
|
||||||
ent_id_sep (str): Separator used internally for entity IDs.
|
ent_id_sep (str): Separator used internally for entity IDs.
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
spacy.scorer.get_ner_prf.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#init
|
DOCS: https://spacy.io/api/entityruler#init
|
||||||
"""
|
"""
|
||||||
|
@ -113,6 +127,7 @@ class EntityRuler(Pipe):
|
||||||
self._ent_ids = defaultdict(tuple) # type: ignore
|
self._ent_ids = defaultdict(tuple) # type: ignore
|
||||||
if patterns is not None:
|
if patterns is not None:
|
||||||
self.add_patterns(patterns)
|
self.add_patterns(patterns)
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
def __len__(self) -> int:
|
def __len__(self) -> int:
|
||||||
"""The number of all patterns added to the entity ruler."""
|
"""The number of all patterns added to the entity ruler."""
|
||||||
|
@ -333,6 +348,46 @@ class EntityRuler(Pipe):
|
||||||
self.nlp.vocab, attr=self.phrase_matcher_attr, validate=self._validate
|
self.nlp.vocab, attr=self.phrase_matcher_attr, validate=self._validate
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def remove(self, ent_id: str) -> None:
|
||||||
|
"""Remove a pattern by its ent_id if a pattern with this ent_id was added before
|
||||||
|
|
||||||
|
ent_id (str): id of the pattern to be removed
|
||||||
|
RETURNS: None
|
||||||
|
DOCS: https://spacy.io/api/entityruler#remove
|
||||||
|
"""
|
||||||
|
label_id_pairs = [
|
||||||
|
(label, eid) for (label, eid) in self._ent_ids.values() if eid == ent_id
|
||||||
|
]
|
||||||
|
if not label_id_pairs:
|
||||||
|
raise ValueError(Errors.E1024.format(ent_id=ent_id))
|
||||||
|
created_labels = [
|
||||||
|
self._create_label(label, eid) for (label, eid) in label_id_pairs
|
||||||
|
]
|
||||||
|
# remove the patterns from self.phrase_patterns
|
||||||
|
self.phrase_patterns = defaultdict(
|
||||||
|
list,
|
||||||
|
{
|
||||||
|
label: val
|
||||||
|
for (label, val) in self.phrase_patterns.items()
|
||||||
|
if label not in created_labels
|
||||||
|
},
|
||||||
|
)
|
||||||
|
# remove the patterns from self.token_pattern
|
||||||
|
self.token_patterns = defaultdict(
|
||||||
|
list,
|
||||||
|
{
|
||||||
|
label: val
|
||||||
|
for (label, val) in self.token_patterns.items()
|
||||||
|
if label not in created_labels
|
||||||
|
},
|
||||||
|
)
|
||||||
|
# remove the patterns from self.token_pattern
|
||||||
|
for label in created_labels:
|
||||||
|
if label in self.phrase_matcher:
|
||||||
|
self.phrase_matcher.remove(label)
|
||||||
|
else:
|
||||||
|
self.matcher.remove(label)
|
||||||
|
|
||||||
def _require_patterns(self) -> None:
|
def _require_patterns(self) -> None:
|
||||||
"""Raise a warning if this component has no patterns defined."""
|
"""Raise a warning if this component has no patterns defined."""
|
||||||
if len(self) == 0:
|
if len(self) == 0:
|
||||||
|
@ -363,10 +418,6 @@ class EntityRuler(Pipe):
|
||||||
label = f"{label}{self.ent_id_sep}{ent_id}"
|
label = f"{label}{self.ent_id_sep}{ent_id}"
|
||||||
return label
|
return label
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
validate_examples(examples, "EntityRuler.score")
|
|
||||||
return get_ner_prf(examples)
|
|
||||||
|
|
||||||
def from_bytes(
|
def from_bytes(
|
||||||
self, patterns_bytes: bytes, *, exclude: Iterable[str] = SimpleFrozenList()
|
self, patterns_bytes: bytes, *, exclude: Iterable[str] = SimpleFrozenList()
|
||||||
) -> "EntityRuler":
|
) -> "EntityRuler":
|
||||||
|
@ -420,10 +471,16 @@ class EntityRuler(Pipe):
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
self.clear()
|
self.clear()
|
||||||
depr_patterns_path = path.with_suffix(".jsonl")
|
depr_patterns_path = path.with_suffix(".jsonl")
|
||||||
if depr_patterns_path.is_file():
|
if path.suffix == ".jsonl": # user provides a jsonl
|
||||||
|
if path.is_file:
|
||||||
|
patterns = srsly.read_jsonl(path)
|
||||||
|
self.add_patterns(patterns)
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E1023.format(path=path))
|
||||||
|
elif depr_patterns_path.is_file():
|
||||||
patterns = srsly.read_jsonl(depr_patterns_path)
|
patterns = srsly.read_jsonl(depr_patterns_path)
|
||||||
self.add_patterns(patterns)
|
self.add_patterns(patterns)
|
||||||
else:
|
elif path.is_dir(): # path is a valid directory
|
||||||
cfg = {}
|
cfg = {}
|
||||||
deserializers_patterns = {
|
deserializers_patterns = {
|
||||||
"patterns": lambda p: self.add_patterns(
|
"patterns": lambda p: self.add_patterns(
|
||||||
|
@ -440,6 +497,8 @@ class EntityRuler(Pipe):
|
||||||
self.nlp.vocab, attr=self.phrase_matcher_attr
|
self.nlp.vocab, attr=self.phrase_matcher_attr
|
||||||
)
|
)
|
||||||
from_disk(path, deserializers_patterns, {})
|
from_disk(path, deserializers_patterns, {})
|
||||||
|
else: # path is not a valid directory or file
|
||||||
|
raise ValueError(Errors.E146.format(path=path))
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(
|
def to_disk(
|
||||||
|
|
|
@ -1,6 +1,8 @@
|
||||||
from typing import Dict, Any
|
from typing import Dict, Any
|
||||||
import srsly
|
import srsly
|
||||||
|
import warnings
|
||||||
|
|
||||||
|
from ..errors import Warnings
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..matcher import Matcher
|
from ..matcher import Matcher
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
|
@ -136,3 +138,65 @@ class TokenSplitter:
|
||||||
"cfg": lambda p: self._set_config(srsly.read_json(p)),
|
"cfg": lambda p: self._set_config(srsly.read_json(p)),
|
||||||
}
|
}
|
||||||
util.from_disk(path, serializers, [])
|
util.from_disk(path, serializers, [])
|
||||||
|
|
||||||
|
|
||||||
|
@Language.factory(
|
||||||
|
"doc_cleaner",
|
||||||
|
default_config={"attrs": {"tensor": None, "_.trf_data": None}, "silent": True},
|
||||||
|
)
|
||||||
|
def make_doc_cleaner(nlp: Language, name: str, *, attrs: Dict[str, Any], silent: bool):
|
||||||
|
return DocCleaner(attrs, silent=silent)
|
||||||
|
|
||||||
|
|
||||||
|
class DocCleaner:
|
||||||
|
def __init__(self, attrs: Dict[str, Any], *, silent: bool = True):
|
||||||
|
self.cfg: Dict[str, Any] = {"attrs": dict(attrs), "silent": silent}
|
||||||
|
|
||||||
|
def __call__(self, doc: Doc) -> Doc:
|
||||||
|
attrs: dict = self.cfg["attrs"]
|
||||||
|
silent: bool = self.cfg["silent"]
|
||||||
|
for attr, value in attrs.items():
|
||||||
|
obj = doc
|
||||||
|
parts = attr.split(".")
|
||||||
|
skip = False
|
||||||
|
for part in parts[:-1]:
|
||||||
|
if hasattr(obj, part):
|
||||||
|
obj = getattr(obj, part)
|
||||||
|
else:
|
||||||
|
skip = True
|
||||||
|
if not silent:
|
||||||
|
warnings.warn(Warnings.W116.format(attr=attr))
|
||||||
|
if not skip:
|
||||||
|
if hasattr(obj, parts[-1]):
|
||||||
|
setattr(obj, parts[-1], value)
|
||||||
|
else:
|
||||||
|
if not silent:
|
||||||
|
warnings.warn(Warnings.W116.format(attr=attr))
|
||||||
|
return doc
|
||||||
|
|
||||||
|
def to_bytes(self, **kwargs):
|
||||||
|
serializers = {
|
||||||
|
"cfg": lambda: srsly.json_dumps(self.cfg),
|
||||||
|
}
|
||||||
|
return util.to_bytes(serializers, [])
|
||||||
|
|
||||||
|
def from_bytes(self, data, **kwargs):
|
||||||
|
deserializers = {
|
||||||
|
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||||
|
}
|
||||||
|
util.from_bytes(data, deserializers, [])
|
||||||
|
return self
|
||||||
|
|
||||||
|
def to_disk(self, path, **kwargs):
|
||||||
|
path = util.ensure_path(path)
|
||||||
|
serializers = {
|
||||||
|
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||||
|
}
|
||||||
|
return util.to_disk(path, serializers, [])
|
||||||
|
|
||||||
|
def from_disk(self, path, **kwargs):
|
||||||
|
path = util.ensure_path(path)
|
||||||
|
serializers = {
|
||||||
|
"cfg": lambda p: self.cfg.update(srsly.read_json(p)),
|
||||||
|
}
|
||||||
|
util.from_disk(path, serializers, [])
|
||||||
|
|
|
@ -12,21 +12,41 @@ from ..lookups import Lookups, load_lookups
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..tokens import Doc, Token
|
from ..tokens import Doc, Token
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
from ..training import validate_examples
|
from ..util import logger, SimpleFrozenList, registry
|
||||||
from ..util import logger, SimpleFrozenList
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"lemmatizer",
|
"lemmatizer",
|
||||||
assigns=["token.lemma"],
|
assigns=["token.lemma"],
|
||||||
default_config={"model": None, "mode": "lookup", "overwrite": False},
|
default_config={
|
||||||
|
"model": None,
|
||||||
|
"mode": "lookup",
|
||||||
|
"overwrite": False,
|
||||||
|
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={"lemma_acc": 1.0},
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_lemmatizer(
|
def make_lemmatizer(
|
||||||
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool = False
|
nlp: Language,
|
||||||
|
model: Optional[Model],
|
||||||
|
name: str,
|
||||||
|
mode: str,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
|
return Lemmatizer(
|
||||||
|
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def lemmatizer_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||||
|
return Scorer.score_token_attr(examples, "lemma", **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.lemmatizer_scorer.v1")
|
||||||
|
def make_lemmatizer_scorer():
|
||||||
|
return lemmatizer_score
|
||||||
|
|
||||||
|
|
||||||
class Lemmatizer(Pipe):
|
class Lemmatizer(Pipe):
|
||||||
|
@ -60,6 +80,7 @@ class Lemmatizer(Pipe):
|
||||||
*,
|
*,
|
||||||
mode: str = "lookup",
|
mode: str = "lookup",
|
||||||
overwrite: bool = False,
|
overwrite: bool = False,
|
||||||
|
scorer: Optional[Callable] = lemmatizer_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize a Lemmatizer.
|
"""Initialize a Lemmatizer.
|
||||||
|
|
||||||
|
@ -69,6 +90,8 @@ class Lemmatizer(Pipe):
|
||||||
mode (str): The lemmatizer mode: "lookup", "rule". Defaults to "lookup".
|
mode (str): The lemmatizer mode: "lookup", "rule". Defaults to "lookup".
|
||||||
overwrite (bool): Whether to overwrite existing lemmas. Defaults to
|
overwrite (bool): Whether to overwrite existing lemmas. Defaults to
|
||||||
`False`.
|
`False`.
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_token_attr for the attribute "lemma".
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/lemmatizer#init
|
DOCS: https://spacy.io/api/lemmatizer#init
|
||||||
"""
|
"""
|
||||||
|
@ -89,6 +112,7 @@ class Lemmatizer(Pipe):
|
||||||
raise ValueError(Errors.E1003.format(mode=mode))
|
raise ValueError(Errors.E1003.format(mode=mode))
|
||||||
self.lemmatize = getattr(self, mode_attr)
|
self.lemmatize = getattr(self, mode_attr)
|
||||||
self.cache = {} # type: ignore[var-annotated]
|
self.cache = {} # type: ignore[var-annotated]
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def mode(self):
|
def mode(self):
|
||||||
|
@ -247,17 +271,6 @@ class Lemmatizer(Pipe):
|
||||||
"""
|
"""
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/lemmatizer#score
|
|
||||||
"""
|
|
||||||
validate_examples(examples, "Lemmatizer.score")
|
|
||||||
return Scorer.score_token_attr(examples, "lemma", **kwargs)
|
|
||||||
|
|
||||||
def to_disk(
|
def to_disk(
|
||||||
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
|
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
|
||||||
):
|
):
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
from typing import Optional, Union, Dict
|
from typing import Optional, Union, Dict, Callable
|
||||||
import srsly
|
import srsly
|
||||||
from thinc.api import SequenceCategoricalCrossentropy, Model, Config
|
from thinc.api import SequenceCategoricalCrossentropy, Model, Config
|
||||||
from itertools import islice
|
from itertools import islice
|
||||||
|
@ -17,7 +17,11 @@ from .tagger import Tagger
|
||||||
from .. import util
|
from .. import util
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..training import validate_examples, validate_get_examples
|
from ..training import validate_examples, validate_get_examples
|
||||||
|
from ..util import registry
|
||||||
|
|
||||||
|
# See #9050
|
||||||
|
BACKWARD_OVERWRITE = True
|
||||||
|
BACKWARD_EXTEND = False
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
[model]
|
[model]
|
||||||
|
@ -48,15 +52,35 @@ DEFAULT_MORPH_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"morphologizer",
|
"morphologizer",
|
||||||
assigns=["token.morph", "token.pos"],
|
assigns=["token.morph", "token.pos"],
|
||||||
default_config={"model": DEFAULT_MORPH_MODEL},
|
default_config={"model": DEFAULT_MORPH_MODEL, "overwrite": True, "extend": False, "scorer": {"@scorers": "spacy.morphologizer_scorer.v1"}},
|
||||||
default_score_weights={"pos_acc": 0.5, "morph_acc": 0.5, "morph_per_feat": None},
|
default_score_weights={"pos_acc": 0.5, "morph_acc": 0.5, "morph_per_feat": None},
|
||||||
)
|
)
|
||||||
def make_morphologizer(
|
def make_morphologizer(
|
||||||
nlp: Language,
|
nlp: Language,
|
||||||
model: Model,
|
model: Model,
|
||||||
name: str,
|
name: str,
|
||||||
|
overwrite: bool,
|
||||||
|
extend: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return Morphologizer(nlp.vocab, model, name)
|
return Morphologizer(nlp.vocab, model, name, overwrite=overwrite, extend=extend, scorer=scorer)
|
||||||
|
|
||||||
|
|
||||||
|
def morphologizer_score(examples, **kwargs):
|
||||||
|
def morph_key_getter(token, attr):
|
||||||
|
return getattr(token, attr).key
|
||||||
|
|
||||||
|
results = {}
|
||||||
|
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
|
||||||
|
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs))
|
||||||
|
results.update(Scorer.score_token_attr_per_feat(examples,
|
||||||
|
"morph", getter=morph_key_getter, **kwargs))
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.morphologizer_scorer.v1")
|
||||||
|
def make_morphologizer_scorer():
|
||||||
|
return morphologizer_score
|
||||||
|
|
||||||
|
|
||||||
class Morphologizer(Tagger):
|
class Morphologizer(Tagger):
|
||||||
|
@ -67,6 +91,10 @@ class Morphologizer(Tagger):
|
||||||
vocab: Vocab,
|
vocab: Vocab,
|
||||||
model: Model,
|
model: Model,
|
||||||
name: str = "morphologizer",
|
name: str = "morphologizer",
|
||||||
|
*,
|
||||||
|
overwrite: bool = BACKWARD_OVERWRITE,
|
||||||
|
extend: bool = BACKWARD_EXTEND,
|
||||||
|
scorer: Optional[Callable] = morphologizer_score,
|
||||||
):
|
):
|
||||||
"""Initialize a morphologizer.
|
"""Initialize a morphologizer.
|
||||||
|
|
||||||
|
@ -74,6 +102,9 @@ class Morphologizer(Tagger):
|
||||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_token_attr for the attributes "pos" and "morph" and
|
||||||
|
Scorer.score_token_attr_per_feat for the attribute "morph".
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/morphologizer#init
|
DOCS: https://spacy.io/api/morphologizer#init
|
||||||
"""
|
"""
|
||||||
|
@ -85,8 +116,14 @@ class Morphologizer(Tagger):
|
||||||
# store mappings from morph+POS labels to token-level annotations:
|
# store mappings from morph+POS labels to token-level annotations:
|
||||||
# 1) labels_morph stores a mapping from morph+POS->morph
|
# 1) labels_morph stores a mapping from morph+POS->morph
|
||||||
# 2) labels_pos stores a mapping from morph+POS->POS
|
# 2) labels_pos stores a mapping from morph+POS->POS
|
||||||
cfg = {"labels_morph": {}, "labels_pos": {}}
|
cfg = {
|
||||||
|
"labels_morph": {},
|
||||||
|
"labels_pos": {},
|
||||||
|
"overwrite": overwrite,
|
||||||
|
"extend": extend,
|
||||||
|
}
|
||||||
self.cfg = dict(sorted(cfg.items()))
|
self.cfg = dict(sorted(cfg.items()))
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def labels(self):
|
def labels(self):
|
||||||
|
@ -192,14 +229,35 @@ class Morphologizer(Tagger):
|
||||||
docs = [docs]
|
docs = [docs]
|
||||||
cdef Doc doc
|
cdef Doc doc
|
||||||
cdef Vocab vocab = self.vocab
|
cdef Vocab vocab = self.vocab
|
||||||
|
cdef bint overwrite = self.cfg["overwrite"]
|
||||||
|
cdef bint extend = self.cfg["extend"]
|
||||||
|
labels = self.labels
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
doc_tag_ids = batch_tag_ids[i]
|
doc_tag_ids = batch_tag_ids[i]
|
||||||
if hasattr(doc_tag_ids, "get"):
|
if hasattr(doc_tag_ids, "get"):
|
||||||
doc_tag_ids = doc_tag_ids.get()
|
doc_tag_ids = doc_tag_ids.get()
|
||||||
for j, tag_id in enumerate(doc_tag_ids):
|
for j, tag_id in enumerate(doc_tag_ids):
|
||||||
morph = self.labels[tag_id]
|
morph = labels[tag_id]
|
||||||
doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"].get(morph, 0))
|
# set morph
|
||||||
doc.c[j].pos = self.cfg["labels_pos"].get(morph, 0)
|
if doc.c[j].morph == 0 or overwrite or extend:
|
||||||
|
if overwrite and extend:
|
||||||
|
# morphologizer morph overwrites any existing features
|
||||||
|
# while extending
|
||||||
|
extended_morph = Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph])
|
||||||
|
extended_morph.update(Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0)))
|
||||||
|
doc.c[j].morph = self.vocab.morphology.add(extended_morph)
|
||||||
|
elif extend:
|
||||||
|
# existing features are preserved and any new features
|
||||||
|
# are added
|
||||||
|
extended_morph = Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0))
|
||||||
|
extended_morph.update(Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph]))
|
||||||
|
doc.c[j].morph = self.vocab.morphology.add(extended_morph)
|
||||||
|
else:
|
||||||
|
# clobber
|
||||||
|
doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"].get(morph, 0))
|
||||||
|
# set POS
|
||||||
|
if doc.c[j].pos == 0 or overwrite:
|
||||||
|
doc.c[j].pos = self.cfg["labels_pos"].get(morph, 0)
|
||||||
|
|
||||||
def get_loss(self, examples, scores):
|
def get_loss(self, examples, scores):
|
||||||
"""Find the loss and gradient of loss for the batch of documents and
|
"""Find the loss and gradient of loss for the batch of documents and
|
||||||
|
@ -246,24 +304,3 @@ class Morphologizer(Tagger):
|
||||||
if self.model.ops.xp.isnan(loss):
|
if self.model.ops.xp.isnan(loss):
|
||||||
raise ValueError(Errors.E910.format(name=self.name))
|
raise ValueError(Errors.E910.format(name=self.name))
|
||||||
return float(loss), d_scores
|
return float(loss), d_scores
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by
|
|
||||||
Scorer.score_token_attr for the attributes "pos" and "morph" and
|
|
||||||
Scorer.score_token_attr_per_feat for the attribute "morph".
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/morphologizer#score
|
|
||||||
"""
|
|
||||||
def morph_key_getter(token, attr):
|
|
||||||
return getattr(token, attr).key
|
|
||||||
|
|
||||||
validate_examples(examples, "Morphologizer.score")
|
|
||||||
results = {}
|
|
||||||
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
|
|
||||||
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs))
|
|
||||||
results.update(Scorer.score_token_attr_per_feat(examples,
|
|
||||||
"morph", getter=morph_key_getter, **kwargs))
|
|
||||||
return results
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from typing import Optional, Iterable
|
from typing import Optional, Iterable, Callable
|
||||||
from thinc.api import Model, Config
|
from thinc.api import Model, Config
|
||||||
|
|
||||||
from ._parser_internals.transition_system import TransitionSystem
|
from ._parser_internals.transition_system import TransitionSystem
|
||||||
|
@ -10,6 +10,7 @@ from ._parser_internals.ner import BiluoPushDown
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..scorer import get_ner_prf, PRFScore
|
from ..scorer import get_ner_prf, PRFScore
|
||||||
from ..training import validate_examples
|
from ..training import validate_examples
|
||||||
|
from ..util import registry
|
||||||
|
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
|
@ -41,6 +42,7 @@ DEFAULT_NER_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
"update_with_oracle_cut_size": 100,
|
"update_with_oracle_cut_size": 100,
|
||||||
"model": DEFAULT_NER_MODEL,
|
"model": DEFAULT_NER_MODEL,
|
||||||
"incorrect_spans_key": None,
|
"incorrect_spans_key": None,
|
||||||
|
"scorer": {"@scorers": "spacy.ner_scorer.v1"},
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"ents_f": 1.0,
|
"ents_f": 1.0,
|
||||||
|
@ -55,7 +57,8 @@ def make_ner(
|
||||||
model: Model,
|
model: Model,
|
||||||
moves: Optional[TransitionSystem],
|
moves: Optional[TransitionSystem],
|
||||||
update_with_oracle_cut_size: int,
|
update_with_oracle_cut_size: int,
|
||||||
incorrect_spans_key: Optional[str] = None,
|
incorrect_spans_key: Optional[str],
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
"""Create a transition-based EntityRecognizer component. The entity recognizer
|
"""Create a transition-based EntityRecognizer component. The entity recognizer
|
||||||
identifies non-overlapping labelled spans of tokens.
|
identifies non-overlapping labelled spans of tokens.
|
||||||
|
@ -83,6 +86,7 @@ def make_ner(
|
||||||
incorrect_spans_key (Optional[str]): Identifies spans that are known
|
incorrect_spans_key (Optional[str]): Identifies spans that are known
|
||||||
to be incorrect entity annotations. The incorrect entity annotations
|
to be incorrect entity annotations. The incorrect entity annotations
|
||||||
can be stored in the span group, under this key.
|
can be stored in the span group, under this key.
|
||||||
|
scorer (Optional[Callable]): The scoring method.
|
||||||
"""
|
"""
|
||||||
return EntityRecognizer(
|
return EntityRecognizer(
|
||||||
nlp.vocab,
|
nlp.vocab,
|
||||||
|
@ -95,6 +99,7 @@ def make_ner(
|
||||||
beam_width=1,
|
beam_width=1,
|
||||||
beam_density=0.0,
|
beam_density=0.0,
|
||||||
beam_update_prob=0.0,
|
beam_update_prob=0.0,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@ -109,6 +114,7 @@ def make_ner(
|
||||||
"beam_update_prob": 0.5,
|
"beam_update_prob": 0.5,
|
||||||
"beam_width": 32,
|
"beam_width": 32,
|
||||||
"incorrect_spans_key": None,
|
"incorrect_spans_key": None,
|
||||||
|
"scorer": None,
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"ents_f": 1.0,
|
"ents_f": 1.0,
|
||||||
|
@ -126,7 +132,8 @@ def make_beam_ner(
|
||||||
beam_width: int,
|
beam_width: int,
|
||||||
beam_density: float,
|
beam_density: float,
|
||||||
beam_update_prob: float,
|
beam_update_prob: float,
|
||||||
incorrect_spans_key: Optional[str] = None,
|
incorrect_spans_key: Optional[str],
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
"""Create a transition-based EntityRecognizer component that uses beam-search.
|
"""Create a transition-based EntityRecognizer component that uses beam-search.
|
||||||
The entity recognizer identifies non-overlapping labelled spans of tokens.
|
The entity recognizer identifies non-overlapping labelled spans of tokens.
|
||||||
|
@ -162,6 +169,7 @@ def make_beam_ner(
|
||||||
and are faster to compute.
|
and are faster to compute.
|
||||||
incorrect_spans_key (Optional[str]): Optional key into span groups of
|
incorrect_spans_key (Optional[str]): Optional key into span groups of
|
||||||
entities known to be non-entities.
|
entities known to be non-entities.
|
||||||
|
scorer (Optional[Callable]): The scoring method.
|
||||||
"""
|
"""
|
||||||
return EntityRecognizer(
|
return EntityRecognizer(
|
||||||
nlp.vocab,
|
nlp.vocab,
|
||||||
|
@ -174,9 +182,19 @@ def make_beam_ner(
|
||||||
beam_density=beam_density,
|
beam_density=beam_density,
|
||||||
beam_update_prob=beam_update_prob,
|
beam_update_prob=beam_update_prob,
|
||||||
incorrect_spans_key=incorrect_spans_key,
|
incorrect_spans_key=incorrect_spans_key,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def ner_score(examples, **kwargs):
|
||||||
|
return get_ner_prf(examples, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.ner_scorer.v1")
|
||||||
|
def make_ner_scorer():
|
||||||
|
return ner_score
|
||||||
|
|
||||||
|
|
||||||
class EntityRecognizer(Parser):
|
class EntityRecognizer(Parser):
|
||||||
"""Pipeline component for named entity recognition.
|
"""Pipeline component for named entity recognition.
|
||||||
|
|
||||||
|
@ -198,6 +216,7 @@ class EntityRecognizer(Parser):
|
||||||
beam_update_prob=0.0,
|
beam_update_prob=0.0,
|
||||||
multitasks=tuple(),
|
multitasks=tuple(),
|
||||||
incorrect_spans_key=None,
|
incorrect_spans_key=None,
|
||||||
|
scorer=ner_score,
|
||||||
):
|
):
|
||||||
"""Create an EntityRecognizer."""
|
"""Create an EntityRecognizer."""
|
||||||
super().__init__(
|
super().__init__(
|
||||||
|
@ -213,6 +232,7 @@ class EntityRecognizer(Parser):
|
||||||
beam_update_prob=beam_update_prob,
|
beam_update_prob=beam_update_prob,
|
||||||
multitasks=multitasks,
|
multitasks=multitasks,
|
||||||
incorrect_spans_key=incorrect_spans_key,
|
incorrect_spans_key=incorrect_spans_key,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
def add_multitask_objective(self, mt_component):
|
def add_multitask_objective(self, mt_component):
|
||||||
|
@ -239,17 +259,6 @@ class EntityRecognizer(Parser):
|
||||||
)
|
)
|
||||||
return tuple(sorted(labels))
|
return tuple(sorted(labels))
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The NER precision, recall and f-scores.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityrecognizer#score
|
|
||||||
"""
|
|
||||||
validate_examples(examples, "EntityRecognizer.score")
|
|
||||||
return get_ner_prf(examples)
|
|
||||||
|
|
||||||
def scored_ents(self, beams):
|
def scored_ents(self, beams):
|
||||||
"""Return a dictionary of (start, end, label) tuples with corresponding scores
|
"""Return a dictionary of (start, end, label) tuples with corresponding scores
|
||||||
for each beam/doc that was processed.
|
for each beam/doc that was processed.
|
||||||
|
|
|
@ -81,6 +81,17 @@ cdef class Pipe:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/pipe#score
|
DOCS: https://spacy.io/api/pipe#score
|
||||||
"""
|
"""
|
||||||
|
if hasattr(self, "scorer") and self.scorer is not None:
|
||||||
|
scorer_kwargs = {}
|
||||||
|
# use default settings from cfg (e.g., threshold)
|
||||||
|
if hasattr(self, "cfg") and isinstance(self.cfg, dict):
|
||||||
|
scorer_kwargs.update(self.cfg)
|
||||||
|
# override self.cfg["labels"] with self.labels
|
||||||
|
if hasattr(self, "labels"):
|
||||||
|
scorer_kwargs["labels"] = self.labels
|
||||||
|
# override with kwargs settings
|
||||||
|
scorer_kwargs.update(kwargs)
|
||||||
|
return self.scorer(examples, **scorer_kwargs)
|
||||||
return {}
|
return {}
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
|
|
@ -1,26 +1,32 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
from typing import Optional, List
|
from typing import Optional, List, Callable
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
|
|
||||||
from .pipe import Pipe
|
from .pipe import Pipe
|
||||||
|
from .senter import senter_score
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..training import validate_examples
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
# see #9050
|
||||||
|
BACKWARD_OVERWRITE = False
|
||||||
|
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"sentencizer",
|
"sentencizer",
|
||||||
assigns=["token.is_sent_start", "doc.sents"],
|
assigns=["token.is_sent_start", "doc.sents"],
|
||||||
default_config={"punct_chars": None},
|
default_config={"punct_chars": None, "overwrite": False, "scorer": {"@scorers": "spacy.senter_scorer.v1"}},
|
||||||
default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
|
default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
|
||||||
)
|
)
|
||||||
def make_sentencizer(
|
def make_sentencizer(
|
||||||
nlp: Language,
|
nlp: Language,
|
||||||
name: str,
|
name: str,
|
||||||
punct_chars: Optional[List[str]]
|
punct_chars: Optional[List[str]],
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
):
|
):
|
||||||
return Sentencizer(name, punct_chars=punct_chars)
|
return Sentencizer(name, punct_chars=punct_chars, overwrite=overwrite, scorer=scorer)
|
||||||
|
|
||||||
|
|
||||||
class Sentencizer(Pipe):
|
class Sentencizer(Pipe):
|
||||||
|
@ -41,12 +47,20 @@ class Sentencizer(Pipe):
|
||||||
'𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈',
|
'𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈',
|
||||||
'。', '。']
|
'。', '。']
|
||||||
|
|
||||||
def __init__(self, name="sentencizer", *, punct_chars=None):
|
def __init__(
|
||||||
|
self,
|
||||||
|
name="sentencizer",
|
||||||
|
*,
|
||||||
|
punct_chars=None,
|
||||||
|
overwrite=BACKWARD_OVERWRITE,
|
||||||
|
scorer=senter_score,
|
||||||
|
):
|
||||||
"""Initialize the sentencizer.
|
"""Initialize the sentencizer.
|
||||||
|
|
||||||
punct_chars (list): Punctuation characters to split on. Will be
|
punct_chars (list): Punctuation characters to split on. Will be
|
||||||
serialized with the nlp object.
|
serialized with the nlp object.
|
||||||
RETURNS (Sentencizer): The sentencizer component.
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_spans for the attribute "sents".
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/sentencizer#init
|
DOCS: https://spacy.io/api/sentencizer#init
|
||||||
"""
|
"""
|
||||||
|
@ -55,6 +69,8 @@ class Sentencizer(Pipe):
|
||||||
self.punct_chars = set(punct_chars)
|
self.punct_chars = set(punct_chars)
|
||||||
else:
|
else:
|
||||||
self.punct_chars = set(self.default_punct_chars)
|
self.punct_chars = set(self.default_punct_chars)
|
||||||
|
self.overwrite = overwrite
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
def __call__(self, doc):
|
def __call__(self, doc):
|
||||||
"""Apply the sentencizer to a Doc and set Token.is_sent_start.
|
"""Apply the sentencizer to a Doc and set Token.is_sent_start.
|
||||||
|
@ -115,29 +131,12 @@ class Sentencizer(Pipe):
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
doc_tag_ids = batch_tag_ids[i]
|
doc_tag_ids = batch_tag_ids[i]
|
||||||
for j, tag_id in enumerate(doc_tag_ids):
|
for j, tag_id in enumerate(doc_tag_ids):
|
||||||
# Don't clobber existing sentence boundaries
|
if doc.c[j].sent_start == 0 or self.overwrite:
|
||||||
if doc.c[j].sent_start == 0:
|
|
||||||
if tag_id:
|
if tag_id:
|
||||||
doc.c[j].sent_start = 1
|
doc.c[j].sent_start = 1
|
||||||
else:
|
else:
|
||||||
doc.c[j].sent_start = -1
|
doc.c[j].sent_start = -1
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/sentencizer#score
|
|
||||||
"""
|
|
||||||
def has_sents(doc):
|
|
||||||
return doc.has_annotation("SENT_START")
|
|
||||||
|
|
||||||
validate_examples(examples, "Sentencizer.score")
|
|
||||||
results = Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
|
|
||||||
del results["sents_per_type"]
|
|
||||||
return results
|
|
||||||
|
|
||||||
def to_bytes(self, *, exclude=tuple()):
|
def to_bytes(self, *, exclude=tuple()):
|
||||||
"""Serialize the sentencizer to a bytestring.
|
"""Serialize the sentencizer to a bytestring.
|
||||||
|
|
||||||
|
@ -145,7 +144,7 @@ class Sentencizer(Pipe):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/sentencizer#to_bytes
|
DOCS: https://spacy.io/api/sentencizer#to_bytes
|
||||||
"""
|
"""
|
||||||
return srsly.msgpack_dumps({"punct_chars": list(self.punct_chars)})
|
return srsly.msgpack_dumps({"punct_chars": list(self.punct_chars), "overwrite": self.overwrite})
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, *, exclude=tuple()):
|
def from_bytes(self, bytes_data, *, exclude=tuple()):
|
||||||
"""Load the sentencizer from a bytestring.
|
"""Load the sentencizer from a bytestring.
|
||||||
|
@ -157,6 +156,7 @@ class Sentencizer(Pipe):
|
||||||
"""
|
"""
|
||||||
cfg = srsly.msgpack_loads(bytes_data)
|
cfg = srsly.msgpack_loads(bytes_data)
|
||||||
self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars))
|
self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars))
|
||||||
|
self.overwrite = cfg.get("overwrite", self.overwrite)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, *, exclude=tuple()):
|
def to_disk(self, path, *, exclude=tuple()):
|
||||||
|
@ -166,7 +166,7 @@ class Sentencizer(Pipe):
|
||||||
"""
|
"""
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
path = path.with_suffix(".json")
|
path = path.with_suffix(".json")
|
||||||
srsly.write_json(path, {"punct_chars": list(self.punct_chars)})
|
srsly.write_json(path, {"punct_chars": list(self.punct_chars), "overwrite": self.overwrite})
|
||||||
|
|
||||||
|
|
||||||
def from_disk(self, path, *, exclude=tuple()):
|
def from_disk(self, path, *, exclude=tuple()):
|
||||||
|
@ -178,4 +178,5 @@ class Sentencizer(Pipe):
|
||||||
path = path.with_suffix(".json")
|
path = path.with_suffix(".json")
|
||||||
cfg = srsly.read_json(path)
|
cfg = srsly.read_json(path)
|
||||||
self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars))
|
self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars))
|
||||||
|
self.overwrite = cfg.get("overwrite", self.overwrite)
|
||||||
return self
|
return self
|
||||||
|
|
|
@ -1,5 +1,6 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
from itertools import islice
|
from itertools import islice
|
||||||
|
from typing import Optional, Callable
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
from thinc.api import Model, SequenceCategoricalCrossentropy, Config
|
from thinc.api import Model, SequenceCategoricalCrossentropy, Config
|
||||||
|
@ -11,8 +12,11 @@ from ..language import Language
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..training import validate_examples, validate_get_examples
|
from ..training import validate_examples, validate_get_examples
|
||||||
|
from ..util import registry
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
# See #9050
|
||||||
|
BACKWARD_OVERWRITE = False
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
[model]
|
[model]
|
||||||
|
@ -34,11 +38,25 @@ DEFAULT_SENTER_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"senter",
|
"senter",
|
||||||
assigns=["token.is_sent_start"],
|
assigns=["token.is_sent_start"],
|
||||||
default_config={"model": DEFAULT_SENTER_MODEL},
|
default_config={"model": DEFAULT_SENTER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.senter_scorer.v1"}},
|
||||||
default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
|
default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
|
||||||
)
|
)
|
||||||
def make_senter(nlp: Language, name: str, model: Model):
|
def make_senter(nlp: Language, name: str, model: Model, overwrite: bool, scorer: Optional[Callable]):
|
||||||
return SentenceRecognizer(nlp.vocab, model, name)
|
return SentenceRecognizer(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer)
|
||||||
|
|
||||||
|
|
||||||
|
def senter_score(examples, **kwargs):
|
||||||
|
def has_sents(doc):
|
||||||
|
return doc.has_annotation("SENT_START")
|
||||||
|
|
||||||
|
results = Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
|
||||||
|
del results["sents_per_type"]
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.senter_scorer.v1")
|
||||||
|
def make_senter_scorer():
|
||||||
|
return senter_score
|
||||||
|
|
||||||
|
|
||||||
class SentenceRecognizer(Tagger):
|
class SentenceRecognizer(Tagger):
|
||||||
|
@ -46,13 +64,23 @@ class SentenceRecognizer(Tagger):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/sentencerecognizer
|
DOCS: https://spacy.io/api/sentencerecognizer
|
||||||
"""
|
"""
|
||||||
def __init__(self, vocab, model, name="senter"):
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab,
|
||||||
|
model,
|
||||||
|
name="senter",
|
||||||
|
*,
|
||||||
|
overwrite=BACKWARD_OVERWRITE,
|
||||||
|
scorer=senter_score,
|
||||||
|
):
|
||||||
"""Initialize a sentence recognizer.
|
"""Initialize a sentence recognizer.
|
||||||
|
|
||||||
vocab (Vocab): The shared vocabulary.
|
vocab (Vocab): The shared vocabulary.
|
||||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_spans for the attribute "sents".
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/sentencerecognizer#init
|
DOCS: https://spacy.io/api/sentencerecognizer#init
|
||||||
"""
|
"""
|
||||||
|
@ -60,7 +88,8 @@ class SentenceRecognizer(Tagger):
|
||||||
self.model = model
|
self.model = model
|
||||||
self.name = name
|
self.name = name
|
||||||
self._rehearsal_model = None
|
self._rehearsal_model = None
|
||||||
self.cfg = {}
|
self.cfg = {"overwrite": overwrite}
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def labels(self):
|
def labels(self):
|
||||||
|
@ -85,13 +114,13 @@ class SentenceRecognizer(Tagger):
|
||||||
if isinstance(docs, Doc):
|
if isinstance(docs, Doc):
|
||||||
docs = [docs]
|
docs = [docs]
|
||||||
cdef Doc doc
|
cdef Doc doc
|
||||||
|
cdef bint overwrite = self.cfg["overwrite"]
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
doc_tag_ids = batch_tag_ids[i]
|
doc_tag_ids = batch_tag_ids[i]
|
||||||
if hasattr(doc_tag_ids, "get"):
|
if hasattr(doc_tag_ids, "get"):
|
||||||
doc_tag_ids = doc_tag_ids.get()
|
doc_tag_ids = doc_tag_ids.get()
|
||||||
for j, tag_id in enumerate(doc_tag_ids):
|
for j, tag_id in enumerate(doc_tag_ids):
|
||||||
# Don't clobber existing sentence boundaries
|
if doc.c[j].sent_start == 0 or overwrite:
|
||||||
if doc.c[j].sent_start == 0:
|
|
||||||
if tag_id == 1:
|
if tag_id == 1:
|
||||||
doc.c[j].sent_start = 1
|
doc.c[j].sent_start = 1
|
||||||
else:
|
else:
|
||||||
|
@ -153,18 +182,3 @@ class SentenceRecognizer(Tagger):
|
||||||
|
|
||||||
def add_label(self, label, values=None):
|
def add_label(self, label, values=None):
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans.
|
|
||||||
DOCS: https://spacy.io/api/sentencerecognizer#score
|
|
||||||
"""
|
|
||||||
def has_sents(doc):
|
|
||||||
return doc.has_annotation("SENT_START")
|
|
||||||
|
|
||||||
validate_examples(examples, "SentenceRecognizer.score")
|
|
||||||
results = Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
|
|
||||||
del results["sents_per_type"]
|
|
||||||
return results
|
|
||||||
|
|
|
@ -78,7 +78,7 @@ def build_ngram_suggester(sizes: List[int]) -> Suggester:
|
||||||
if len(spans) > 0:
|
if len(spans) > 0:
|
||||||
output = Ragged(ops.xp.vstack(spans), lengths_array)
|
output = Ragged(ops.xp.vstack(spans), lengths_array)
|
||||||
else:
|
else:
|
||||||
output = Ragged(ops.xp.zeros((0, 0)), lengths_array)
|
output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)
|
||||||
|
|
||||||
assert output.dataXd.ndim == 2
|
assert output.dataXd.ndim == 2
|
||||||
return output
|
return output
|
||||||
|
@ -104,6 +104,7 @@ def build_ngram_range_suggester(min_size: int, max_size: int) -> Suggester:
|
||||||
"max_positive": None,
|
"max_positive": None,
|
||||||
"model": DEFAULT_SPANCAT_MODEL,
|
"model": DEFAULT_SPANCAT_MODEL,
|
||||||
"suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
|
"suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
|
||||||
|
"scorer": {"@scorers": "spacy.spancat_scorer.v1"},
|
||||||
},
|
},
|
||||||
default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
|
default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
|
||||||
)
|
)
|
||||||
|
@ -113,8 +114,9 @@ def make_spancat(
|
||||||
suggester: Suggester,
|
suggester: Suggester,
|
||||||
model: Model[Tuple[List[Doc], Ragged], Floats2d],
|
model: Model[Tuple[List[Doc], Ragged], Floats2d],
|
||||||
spans_key: str,
|
spans_key: str,
|
||||||
threshold: float = 0.5,
|
scorer: Optional[Callable],
|
||||||
max_positive: Optional[int] = None,
|
threshold: float,
|
||||||
|
max_positive: Optional[int],
|
||||||
) -> "SpanCategorizer":
|
) -> "SpanCategorizer":
|
||||||
"""Create a SpanCategorizer component. The span categorizer consists of two
|
"""Create a SpanCategorizer component. The span categorizer consists of two
|
||||||
parts: a suggester function that proposes candidate spans, and a labeller
|
parts: a suggester function that proposes candidate spans, and a labeller
|
||||||
|
@ -144,9 +146,28 @@ def make_spancat(
|
||||||
threshold=threshold,
|
threshold=threshold,
|
||||||
max_positive=max_positive,
|
max_positive=max_positive,
|
||||||
name=name,
|
name=name,
|
||||||
|
scorer=scorer,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def spancat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||||
|
kwargs = dict(kwargs)
|
||||||
|
attr_prefix = "spans_"
|
||||||
|
key = kwargs["spans_key"]
|
||||||
|
kwargs.setdefault("attr", f"{attr_prefix}{key}")
|
||||||
|
kwargs.setdefault("allow_overlap", True)
|
||||||
|
kwargs.setdefault(
|
||||||
|
"getter", lambda doc, key: doc.spans.get(key[len(attr_prefix) :], [])
|
||||||
|
)
|
||||||
|
kwargs.setdefault("has_annotation", lambda doc: key in doc.spans)
|
||||||
|
return Scorer.score_spans(examples, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.spancat_scorer.v1")
|
||||||
|
def make_spancat_scorer():
|
||||||
|
return spancat_score
|
||||||
|
|
||||||
|
|
||||||
class SpanCategorizer(TrainablePipe):
|
class SpanCategorizer(TrainablePipe):
|
||||||
"""Pipeline component to label spans of text.
|
"""Pipeline component to label spans of text.
|
||||||
|
|
||||||
|
@ -163,8 +184,25 @@ class SpanCategorizer(TrainablePipe):
|
||||||
spans_key: str = "spans",
|
spans_key: str = "spans",
|
||||||
threshold: float = 0.5,
|
threshold: float = 0.5,
|
||||||
max_positive: Optional[int] = None,
|
max_positive: Optional[int] = None,
|
||||||
|
scorer: Optional[Callable] = spancat_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize the span categorizer.
|
"""Initialize the span categorizer.
|
||||||
|
vocab (Vocab): The shared vocabulary.
|
||||||
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
|
name (str): The component instance name, used to add entries to the
|
||||||
|
losses during training.
|
||||||
|
spans_key (str): Key of the Doc.spans dict to save the spans under.
|
||||||
|
During initialization and training, the component will look for
|
||||||
|
spans on the reference document under the same key. Defaults to
|
||||||
|
`"spans"`.
|
||||||
|
threshold (float): Minimum probability to consider a prediction
|
||||||
|
positive. Spans with a positive prediction will be saved on the Doc.
|
||||||
|
Defaults to 0.5.
|
||||||
|
max_positive (Optional[int]): Maximum number of labels to consider
|
||||||
|
positive per span. Defaults to None, indicating no limit.
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_spans for the Doc.spans[spans_key] with overlapping
|
||||||
|
spans allowed.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/spancategorizer#init
|
DOCS: https://spacy.io/api/spancategorizer#init
|
||||||
"""
|
"""
|
||||||
|
@ -178,6 +216,7 @@ class SpanCategorizer(TrainablePipe):
|
||||||
self.suggester = suggester
|
self.suggester = suggester
|
||||||
self.model = model
|
self.model = model
|
||||||
self.name = name
|
self.name = name
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def key(self) -> str:
|
def key(self) -> str:
|
||||||
|
@ -379,26 +418,6 @@ class SpanCategorizer(TrainablePipe):
|
||||||
else:
|
else:
|
||||||
self.model.initialize()
|
self.model.initialize()
|
||||||
|
|
||||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/spancategorizer#score
|
|
||||||
"""
|
|
||||||
validate_examples(examples, "SpanCategorizer.score")
|
|
||||||
self._validate_categories(examples)
|
|
||||||
kwargs = dict(kwargs)
|
|
||||||
attr_prefix = "spans_"
|
|
||||||
kwargs.setdefault("attr", f"{attr_prefix}{self.key}")
|
|
||||||
kwargs.setdefault("allow_overlap", True)
|
|
||||||
kwargs.setdefault(
|
|
||||||
"getter", lambda doc, key: doc.spans.get(key[len(attr_prefix) :], [])
|
|
||||||
)
|
|
||||||
kwargs.setdefault("has_annotation", lambda doc: self.key in doc.spans)
|
|
||||||
return Scorer.score_spans(examples, **kwargs)
|
|
||||||
|
|
||||||
def _validate_categories(self, examples: Iterable[Example]):
|
def _validate_categories(self, examples: Iterable[Example]):
|
||||||
# TODO
|
# TODO
|
||||||
pass
|
pass
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
|
from typing import Callable, Optional
|
||||||
import numpy
|
import numpy
|
||||||
import srsly
|
import srsly
|
||||||
from thinc.api import Model, set_dropout_rate, SequenceCategoricalCrossentropy, Config
|
from thinc.api import Model, set_dropout_rate, SequenceCategoricalCrossentropy, Config
|
||||||
|
@ -18,8 +19,11 @@ from ..parts_of_speech import X
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..training import validate_examples, validate_get_examples
|
from ..training import validate_examples, validate_get_examples
|
||||||
|
from ..util import registry
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
# See #9050
|
||||||
|
BACKWARD_OVERWRITE = False
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
[model]
|
[model]
|
||||||
|
@ -41,10 +45,17 @@ DEFAULT_TAGGER_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"tagger",
|
"tagger",
|
||||||
assigns=["token.tag"],
|
assigns=["token.tag"],
|
||||||
default_config={"model": DEFAULT_TAGGER_MODEL},
|
default_config={"model": DEFAULT_TAGGER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.tagger_scorer.v1"}, "neg_prefix": "!"},
|
||||||
default_score_weights={"tag_acc": 1.0},
|
default_score_weights={"tag_acc": 1.0},
|
||||||
)
|
)
|
||||||
def make_tagger(nlp: Language, name: str, model: Model):
|
def make_tagger(
|
||||||
|
nlp: Language,
|
||||||
|
name: str,
|
||||||
|
model: Model,
|
||||||
|
overwrite: bool,
|
||||||
|
scorer: Optional[Callable],
|
||||||
|
neg_prefix: str,
|
||||||
|
):
|
||||||
"""Construct a part-of-speech tagger component.
|
"""Construct a part-of-speech tagger component.
|
||||||
|
|
||||||
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
||||||
|
@ -52,7 +63,16 @@ def make_tagger(nlp: Language, name: str, model: Model):
|
||||||
in size, and be normalized as probabilities (all scores between 0 and 1,
|
in size, and be normalized as probabilities (all scores between 0 and 1,
|
||||||
with the rows summing to 1).
|
with the rows summing to 1).
|
||||||
"""
|
"""
|
||||||
return Tagger(nlp.vocab, model, name)
|
return Tagger(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer, neg_prefix=neg_prefix)
|
||||||
|
|
||||||
|
|
||||||
|
def tagger_score(examples, **kwargs):
|
||||||
|
return Scorer.score_token_attr(examples, "tag", **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.tagger_scorer.v1")
|
||||||
|
def make_tagger_scorer():
|
||||||
|
return tagger_score
|
||||||
|
|
||||||
|
|
||||||
class Tagger(TrainablePipe):
|
class Tagger(TrainablePipe):
|
||||||
|
@ -60,13 +80,24 @@ class Tagger(TrainablePipe):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tagger
|
DOCS: https://spacy.io/api/tagger
|
||||||
"""
|
"""
|
||||||
def __init__(self, vocab, model, name="tagger"):
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab,
|
||||||
|
model,
|
||||||
|
name="tagger",
|
||||||
|
*,
|
||||||
|
overwrite=BACKWARD_OVERWRITE,
|
||||||
|
scorer=tagger_score,
|
||||||
|
neg_prefix="!",
|
||||||
|
):
|
||||||
"""Initialize a part-of-speech tagger.
|
"""Initialize a part-of-speech tagger.
|
||||||
|
|
||||||
vocab (Vocab): The shared vocabulary.
|
vocab (Vocab): The shared vocabulary.
|
||||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_token_attr for the attribute "tag".
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tagger#init
|
DOCS: https://spacy.io/api/tagger#init
|
||||||
"""
|
"""
|
||||||
|
@ -74,8 +105,9 @@ class Tagger(TrainablePipe):
|
||||||
self.model = model
|
self.model = model
|
||||||
self.name = name
|
self.name = name
|
||||||
self._rehearsal_model = None
|
self._rehearsal_model = None
|
||||||
cfg = {"labels": []}
|
cfg = {"labels": [], "overwrite": overwrite, "neg_prefix": neg_prefix}
|
||||||
self.cfg = dict(sorted(cfg.items()))
|
self.cfg = dict(sorted(cfg.items()))
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def labels(self):
|
def labels(self):
|
||||||
|
@ -135,14 +167,15 @@ class Tagger(TrainablePipe):
|
||||||
docs = [docs]
|
docs = [docs]
|
||||||
cdef Doc doc
|
cdef Doc doc
|
||||||
cdef Vocab vocab = self.vocab
|
cdef Vocab vocab = self.vocab
|
||||||
|
cdef bint overwrite = self.cfg["overwrite"]
|
||||||
|
labels = self.labels
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
doc_tag_ids = batch_tag_ids[i]
|
doc_tag_ids = batch_tag_ids[i]
|
||||||
if hasattr(doc_tag_ids, "get"):
|
if hasattr(doc_tag_ids, "get"):
|
||||||
doc_tag_ids = doc_tag_ids.get()
|
doc_tag_ids = doc_tag_ids.get()
|
||||||
for j, tag_id in enumerate(doc_tag_ids):
|
for j, tag_id in enumerate(doc_tag_ids):
|
||||||
# Don't clobber preset POS tags
|
if doc.c[j].tag == 0 or overwrite:
|
||||||
if doc.c[j].tag == 0:
|
doc.c[j].tag = self.vocab.strings[labels[tag_id]]
|
||||||
doc.c[j].tag = self.vocab.strings[self.labels[tag_id]]
|
|
||||||
|
|
||||||
def update(self, examples, *, drop=0., sgd=None, losses=None):
|
def update(self, examples, *, drop=0., sgd=None, losses=None):
|
||||||
"""Learn from a batch of documents and gold-standard information,
|
"""Learn from a batch of documents and gold-standard information,
|
||||||
|
@ -222,7 +255,7 @@ class Tagger(TrainablePipe):
|
||||||
DOCS: https://spacy.io/api/tagger#get_loss
|
DOCS: https://spacy.io/api/tagger#get_loss
|
||||||
"""
|
"""
|
||||||
validate_examples(examples, "Tagger.get_loss")
|
validate_examples(examples, "Tagger.get_loss")
|
||||||
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, neg_prefix="!")
|
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, neg_prefix=self.cfg["neg_prefix"])
|
||||||
# Convert empty tag "" to missing value None so that both misaligned
|
# Convert empty tag "" to missing value None so that both misaligned
|
||||||
# tokens and tokens with missing annotation have the default missing
|
# tokens and tokens with missing annotation have the default missing
|
||||||
# value None.
|
# value None.
|
||||||
|
@ -289,15 +322,3 @@ class Tagger(TrainablePipe):
|
||||||
self.cfg["labels"].append(label)
|
self.cfg["labels"].append(label)
|
||||||
self.vocab.strings.add(label)
|
self.vocab.strings.add(label)
|
||||||
return 1
|
return 1
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by
|
|
||||||
Scorer.score_token_attr for the attributes "tag".
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tagger#score
|
|
||||||
"""
|
|
||||||
validate_examples(examples, "Tagger.score")
|
|
||||||
return Scorer.score_token_attr(examples, "tag", **kwargs)
|
|
||||||
|
|
|
@ -10,6 +10,7 @@ from ..training import Example, validate_examples, validate_get_examples
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
|
from ..util import registry
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
|
@ -70,7 +71,11 @@ subword_features = true
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"textcat",
|
"textcat",
|
||||||
assigns=["doc.cats"],
|
assigns=["doc.cats"],
|
||||||
default_config={"threshold": 0.5, "model": DEFAULT_SINGLE_TEXTCAT_MODEL},
|
default_config={
|
||||||
|
"threshold": 0.5,
|
||||||
|
"model": DEFAULT_SINGLE_TEXTCAT_MODEL,
|
||||||
|
"scorer": {"@scorers": "spacy.textcat_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"cats_score": 1.0,
|
"cats_score": 1.0,
|
||||||
"cats_score_desc": None,
|
"cats_score_desc": None,
|
||||||
|
@ -86,7 +91,11 @@ subword_features = true
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
def make_textcat(
|
def make_textcat(
|
||||||
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
|
nlp: Language,
|
||||||
|
name: str,
|
||||||
|
model: Model[List[Doc], List[Floats2d]],
|
||||||
|
threshold: float,
|
||||||
|
scorer: Optional[Callable],
|
||||||
) -> "TextCategorizer":
|
) -> "TextCategorizer":
|
||||||
"""Create a TextCategorizer component. The text categorizer predicts categories
|
"""Create a TextCategorizer component. The text categorizer predicts categories
|
||||||
over a whole document. It can learn one or more labels, and the labels are considered
|
over a whole document. It can learn one or more labels, and the labels are considered
|
||||||
|
@ -95,8 +104,23 @@ def make_textcat(
|
||||||
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
||||||
scores for each category.
|
scores for each category.
|
||||||
threshold (float): Cutoff to consider a prediction "positive".
|
threshold (float): Cutoff to consider a prediction "positive".
|
||||||
|
scorer (Optional[Callable]): The scoring method.
|
||||||
"""
|
"""
|
||||||
return TextCategorizer(nlp.vocab, model, name, threshold=threshold)
|
return TextCategorizer(nlp.vocab, model, name, threshold=threshold, scorer=scorer)
|
||||||
|
|
||||||
|
|
||||||
|
def textcat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||||
|
return Scorer.score_cats(
|
||||||
|
examples,
|
||||||
|
"cats",
|
||||||
|
multi_label=False,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.textcat_scorer.v1")
|
||||||
|
def make_textcat_scorer():
|
||||||
|
return textcat_score
|
||||||
|
|
||||||
|
|
||||||
class TextCategorizer(TrainablePipe):
|
class TextCategorizer(TrainablePipe):
|
||||||
|
@ -106,7 +130,13 @@ class TextCategorizer(TrainablePipe):
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
|
self,
|
||||||
|
vocab: Vocab,
|
||||||
|
model: Model,
|
||||||
|
name: str = "textcat",
|
||||||
|
*,
|
||||||
|
threshold: float,
|
||||||
|
scorer: Optional[Callable] = textcat_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize a text categorizer for single-label classification.
|
"""Initialize a text categorizer for single-label classification.
|
||||||
|
|
||||||
|
@ -115,6 +145,8 @@ class TextCategorizer(TrainablePipe):
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
threshold (float): Cutoff to consider a prediction "positive".
|
threshold (float): Cutoff to consider a prediction "positive".
|
||||||
|
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||||
|
Scorer.score_cats for the attribute "cats".
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/textcategorizer#init
|
DOCS: https://spacy.io/api/textcategorizer#init
|
||||||
"""
|
"""
|
||||||
|
@ -124,6 +156,7 @@ class TextCategorizer(TrainablePipe):
|
||||||
self._rehearsal_model = None
|
self._rehearsal_model = None
|
||||||
cfg = {"labels": [], "threshold": threshold, "positive_label": None}
|
cfg = {"labels": [], "threshold": threshold, "positive_label": None}
|
||||||
self.cfg = dict(cfg)
|
self.cfg = dict(cfg)
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def labels(self) -> Tuple[str]:
|
def labels(self) -> Tuple[str]:
|
||||||
|
@ -353,26 +386,6 @@ class TextCategorizer(TrainablePipe):
|
||||||
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
|
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
|
||||||
self.model.initialize(X=doc_sample, Y=label_sample)
|
self.model.initialize(X=doc_sample, Y=label_sample)
|
||||||
|
|
||||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/textcategorizer#score
|
|
||||||
"""
|
|
||||||
validate_examples(examples, "TextCategorizer.score")
|
|
||||||
self._validate_categories(examples)
|
|
||||||
kwargs.setdefault("threshold", self.cfg["threshold"])
|
|
||||||
kwargs.setdefault("positive_label", self.cfg["positive_label"])
|
|
||||||
return Scorer.score_cats(
|
|
||||||
examples,
|
|
||||||
"cats",
|
|
||||||
labels=self.labels,
|
|
||||||
multi_label=False,
|
|
||||||
**kwargs,
|
|
||||||
)
|
|
||||||
|
|
||||||
def _validate_categories(self, examples: Iterable[Example]):
|
def _validate_categories(self, examples: Iterable[Example]):
|
||||||
"""Check whether the provided examples all have single-label cats annotations."""
|
"""Check whether the provided examples all have single-label cats annotations."""
|
||||||
for ex in examples:
|
for ex in examples:
|
||||||
|
|
|
@ -5,10 +5,11 @@ from thinc.api import Model, Config
|
||||||
from thinc.types import Floats2d
|
from thinc.types import Floats2d
|
||||||
|
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..training import Example, validate_examples, validate_get_examples
|
from ..training import Example, validate_get_examples
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
|
from ..util import registry
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
from .textcat import TextCategorizer
|
from .textcat import TextCategorizer
|
||||||
|
|
||||||
|
@ -70,7 +71,11 @@ subword_features = true
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"textcat_multilabel",
|
"textcat_multilabel",
|
||||||
assigns=["doc.cats"],
|
assigns=["doc.cats"],
|
||||||
default_config={"threshold": 0.5, "model": DEFAULT_MULTI_TEXTCAT_MODEL},
|
default_config={
|
||||||
|
"threshold": 0.5,
|
||||||
|
"model": DEFAULT_MULTI_TEXTCAT_MODEL,
|
||||||
|
"scorer": {"@scorers": "spacy.textcat_multilabel_scorer.v1"},
|
||||||
|
},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"cats_score": 1.0,
|
"cats_score": 1.0,
|
||||||
"cats_score_desc": None,
|
"cats_score_desc": None,
|
||||||
|
@ -86,7 +91,11 @@ subword_features = true
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
def make_multilabel_textcat(
|
def make_multilabel_textcat(
|
||||||
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
|
nlp: Language,
|
||||||
|
name: str,
|
||||||
|
model: Model[List[Doc], List[Floats2d]],
|
||||||
|
threshold: float,
|
||||||
|
scorer: Optional[Callable],
|
||||||
) -> "TextCategorizer":
|
) -> "TextCategorizer":
|
||||||
"""Create a TextCategorizer component. The text categorizer predicts categories
|
"""Create a TextCategorizer component. The text categorizer predicts categories
|
||||||
over a whole document. It can learn one or more labels, and the labels are considered
|
over a whole document. It can learn one or more labels, and the labels are considered
|
||||||
|
@ -97,7 +106,23 @@ def make_multilabel_textcat(
|
||||||
scores for each category.
|
scores for each category.
|
||||||
threshold (float): Cutoff to consider a prediction "positive".
|
threshold (float): Cutoff to consider a prediction "positive".
|
||||||
"""
|
"""
|
||||||
return MultiLabel_TextCategorizer(nlp.vocab, model, name, threshold=threshold)
|
return MultiLabel_TextCategorizer(
|
||||||
|
nlp.vocab, model, name, threshold=threshold, scorer=scorer
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def textcat_multilabel_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||||
|
return Scorer.score_cats(
|
||||||
|
examples,
|
||||||
|
"cats",
|
||||||
|
multi_label=True,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.scorers("spacy.textcat_multilabel_scorer.v1")
|
||||||
|
def make_textcat_multilabel_scorer():
|
||||||
|
return textcat_multilabel_score
|
||||||
|
|
||||||
|
|
||||||
class MultiLabel_TextCategorizer(TextCategorizer):
|
class MultiLabel_TextCategorizer(TextCategorizer):
|
||||||
|
@ -113,6 +138,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
|
||||||
name: str = "textcat_multilabel",
|
name: str = "textcat_multilabel",
|
||||||
*,
|
*,
|
||||||
threshold: float,
|
threshold: float,
|
||||||
|
scorer: Optional[Callable] = textcat_multilabel_score,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize a text categorizer for multi-label classification.
|
"""Initialize a text categorizer for multi-label classification.
|
||||||
|
|
||||||
|
@ -130,6 +156,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
|
||||||
self._rehearsal_model = None
|
self._rehearsal_model = None
|
||||||
cfg = {"labels": [], "threshold": threshold}
|
cfg = {"labels": [], "threshold": threshold}
|
||||||
self.cfg = dict(cfg)
|
self.cfg = dict(cfg)
|
||||||
|
self.scorer = scorer
|
||||||
|
|
||||||
def initialize( # type: ignore[override]
|
def initialize( # type: ignore[override]
|
||||||
self,
|
self,
|
||||||
|
@ -166,24 +193,6 @@ class MultiLabel_TextCategorizer(TextCategorizer):
|
||||||
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
|
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
|
||||||
self.model.initialize(X=doc_sample, Y=label_sample)
|
self.model.initialize(X=doc_sample, Y=label_sample)
|
||||||
|
|
||||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
|
||||||
"""Score a batch of examples.
|
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
|
||||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/textcategorizer#score
|
|
||||||
"""
|
|
||||||
validate_examples(examples, "MultiLabel_TextCategorizer.score")
|
|
||||||
kwargs.setdefault("threshold", self.cfg["threshold"])
|
|
||||||
return Scorer.score_cats(
|
|
||||||
examples,
|
|
||||||
"cats",
|
|
||||||
labels=self.labels,
|
|
||||||
multi_label=True,
|
|
||||||
**kwargs,
|
|
||||||
)
|
|
||||||
|
|
||||||
def _validate_categories(self, examples: Iterable[Example]):
|
def _validate_categories(self, examples: Iterable[Example]):
|
||||||
"""This component allows any type of single- or multi-label annotations.
|
"""This component allows any type of single- or multi-label annotations.
|
||||||
This method overwrites the more strict one from 'textcat'."""
|
This method overwrites the more strict one from 'textcat'."""
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user