mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Merge branch 'develop' into refactor/remove-symlinks
This commit is contained in:
commit
a3335d36b8
106
.github/contributors/AlJohri.md
vendored
Normal file
106
.github/contributors/AlJohri.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Al Johri |
|
||||||
|
| Company name (if applicable) | N/A |
|
||||||
|
| Title or role (if applicable) | N/A |
|
||||||
|
| Date | December 27th, 2019 |
|
||||||
|
| GitHub username | AlJohri |
|
||||||
|
| Website (optional) | http://aljohri.com/ |
|
106
.github/contributors/Jan-711.md
vendored
Normal file
106
.github/contributors/Jan-711.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jan Jessewitsch |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 16.02.2020 |
|
||||||
|
| GitHub username | Jan-711 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/ceteri.md
vendored
Normal file
106
.github/contributors/ceteri.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ---------------------- |
|
||||||
|
| Name | Paco Nathan |
|
||||||
|
| Company name (if applicable) | Derwen, Inc. |
|
||||||
|
| Title or role (if applicable) | Managing Partner |
|
||||||
|
| Date | 2020-01-25 |
|
||||||
|
| GitHub username | ceteri |
|
||||||
|
| Website (optional) | https://derwen.ai/paco |
|
106
.github/contributors/drndos.md
vendored
Normal file
106
.github/contributors/drndos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Filip Bednárik |
|
||||||
|
| Company name (if applicable) | Ardevop, s. r. o. |
|
||||||
|
| Title or role (if applicable) | IT Consultant |
|
||||||
|
| Date | 2020-01-26 |
|
||||||
|
| GitHub username | drndos |
|
||||||
|
| Website (optional) | https://ardevop.sk |
|
106
.github/contributors/iechevarria.md
vendored
Normal file
106
.github/contributors/iechevarria.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | --------------------- |
|
||||||
|
| Name | Ivan Echevarria |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2019-12-24 |
|
||||||
|
| GitHub username | iechevarria |
|
||||||
|
| Website (optional) | https://echevarria.io |
|
106
.github/contributors/iurshina.md
vendored
Normal file
106
.github/contributors/iurshina.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Anastasiia Iurshina |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 28.12.2019 |
|
||||||
|
| GitHub username | iurshina |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/onlyanegg.md
vendored
Normal file
106
.github/contributors/onlyanegg.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
- Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
- to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
- each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
| ----------------------------- | ---------------- |
|
||||||
|
| Name | Tyler Couto |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | January 29, 2020 |
|
||||||
|
| GitHub username | onlyanegg |
|
||||||
|
| Website (optional) | |
|
|
@ -1,5 +1,5 @@
|
||||||
recursive-include include *.h
|
recursive-include include *.h
|
||||||
recursive-include spacy *.pyx *.pxd *.txt
|
recursive-include spacy *.txt *.pyx *.pxd
|
||||||
include LICENSE
|
include LICENSE
|
||||||
include README.md
|
include README.md
|
||||||
include bin/spacy
|
include bin/spacy
|
||||||
|
|
|
@ -7,16 +7,17 @@ Run `wikipedia_pretrain_kb.py`
|
||||||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
||||||
* You can set the filtering parameters for KB construction:
|
* You can set the filtering parameters for KB construction:
|
||||||
* `max_per_alias`: (max) number of candidate entities in the KB per alias/synonym
|
* `max_per_alias` (`-a`): (max) number of candidate entities in the KB per alias/synonym
|
||||||
* `min_freq`: threshold of number of times an entity should occur in the corpus to be included in the KB
|
* `min_freq` (`-f`): threshold of number of times an entity should occur in the corpus to be included in the KB
|
||||||
* `min_pair`: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
|
* `min_pair` (`-c`): threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
|
||||||
* Further parameters to set:
|
* Further parameters to set:
|
||||||
* `descriptions_from_wikipedia`: whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
|
* `descriptions_from_wikipedia` (`-wp`): whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
|
||||||
* `entity_vector_length`: length of the pre-trained entity description vectors
|
* `entity_vector_length` (`-v`): length of the pre-trained entity description vectors
|
||||||
* `lang`: language for which to fetch Wikidata information (as the dump contains all languages)
|
* `lang` (`-la`): language for which to fetch Wikidata information (as the dump contains all languages)
|
||||||
|
|
||||||
Quick testing and rerunning:
|
Quick testing and rerunning:
|
||||||
* When trying out the pipeline for a quick test, set `limit_prior`, `limit_train` and/or `limit_wd` to read only parts of the dumps instead of everything.
|
* When trying out the pipeline for a quick test, set `limit_prior` (`-lp`), `limit_train` (`-lt`) and/or `limit_wd` (`-lw`) to read only parts of the dumps instead of everything.
|
||||||
|
* e.g. set `-lt 20000 -lp 2000 -lw 3000 -f 1`
|
||||||
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
|
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
|
||||||
|
|
||||||
|
|
||||||
|
@ -24,11 +25,13 @@ Quick testing and rerunning:
|
||||||
|
|
||||||
Run `wikidata_train_entity_linker.py`
|
Run `wikidata_train_entity_linker.py`
|
||||||
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
|
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
|
||||||
|
* Specify the output directory (`-o`) in which the final, trained model will be saved
|
||||||
* You can set the learning parameters for the EL training:
|
* You can set the learning parameters for the EL training:
|
||||||
* `epochs`: number of training iterations
|
* `epochs` (`-e`): number of training iterations
|
||||||
* `dropout`: dropout rate
|
* `dropout` (`-p`): dropout rate
|
||||||
* `lr`: learning rate
|
* `lr` (`-n`): learning rate
|
||||||
* `l2`: L2 regularization
|
* `l2` (`-r`): L2 regularization
|
||||||
* Specify the number of training and dev testing entities with `train_inst` and `dev_inst` respectively
|
* Specify the number of training and dev testing articles with `train_articles` (`-t`) and `dev_articles` (`-d`) respectively
|
||||||
|
* If not specified, the full dataset will be processed - this may take a LONG time !
|
||||||
* Further parameters to set:
|
* Further parameters to set:
|
||||||
* `labels_discard`: NER label types to discard during training
|
* `labels_discard` (`-l`): NER label types to discard during training
|
||||||
|
|
|
@ -1,6 +1,8 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
import random
|
import random
|
||||||
|
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
|
|
||||||
|
@ -92,102 +94,81 @@ class BaselineResults(object):
|
||||||
self.random.update_metrics(ent_label, true_entity, random_candidate)
|
self.random.update_metrics(ent_label, true_entity, random_candidate)
|
||||||
|
|
||||||
|
|
||||||
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True):
|
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True, dev_limit=None):
|
||||||
|
counts = dict()
|
||||||
|
baseline_results = BaselineResults()
|
||||||
|
context_results = EvaluationResults()
|
||||||
|
combo_results = EvaluationResults()
|
||||||
|
|
||||||
|
for doc, gold in tqdm(dev_data, total=dev_limit, leave=False, desc='Processing dev data'):
|
||||||
|
if len(doc) > 0:
|
||||||
|
correct_ents = dict()
|
||||||
|
for entity, kb_dict in gold.links.items():
|
||||||
|
start, end = entity
|
||||||
|
for gold_kb, value in kb_dict.items():
|
||||||
|
if value:
|
||||||
|
# only evaluating on positive examples
|
||||||
|
offset = _offset(start, end)
|
||||||
|
correct_ents[offset] = gold_kb
|
||||||
|
|
||||||
if baseline:
|
if baseline:
|
||||||
baseline_accuracies, counts = measure_baselines(dev_data, kb)
|
_add_baseline(baseline_results, counts, doc, correct_ents, kb)
|
||||||
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
|
|
||||||
logger.info(baseline_accuracies.report_performance("random"))
|
|
||||||
logger.info(baseline_accuracies.report_performance("prior"))
|
|
||||||
logger.info(baseline_accuracies.report_performance("oracle"))
|
|
||||||
|
|
||||||
if context:
|
if context:
|
||||||
# using only context
|
# using only context
|
||||||
el_pipe.cfg["incl_context"] = True
|
el_pipe.cfg["incl_context"] = True
|
||||||
el_pipe.cfg["incl_prior"] = False
|
el_pipe.cfg["incl_prior"] = False
|
||||||
results = get_eval_results(dev_data, el_pipe)
|
_add_eval_result(context_results, doc, correct_ents, el_pipe)
|
||||||
logger.info(results.report_metrics("context only"))
|
|
||||||
|
|
||||||
# measuring combined accuracy (prior + context)
|
# measuring combined accuracy (prior + context)
|
||||||
el_pipe.cfg["incl_context"] = True
|
el_pipe.cfg["incl_context"] = True
|
||||||
el_pipe.cfg["incl_prior"] = True
|
el_pipe.cfg["incl_prior"] = True
|
||||||
results = get_eval_results(dev_data, el_pipe)
|
_add_eval_result(combo_results, doc, correct_ents, el_pipe)
|
||||||
logger.info(results.report_metrics("context and prior"))
|
|
||||||
|
if baseline:
|
||||||
|
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
|
||||||
|
logger.info(baseline_results.report_performance("random"))
|
||||||
|
logger.info(baseline_results.report_performance("prior"))
|
||||||
|
logger.info(baseline_results.report_performance("oracle"))
|
||||||
|
|
||||||
|
if context:
|
||||||
|
logger.info(context_results.report_metrics("context only"))
|
||||||
|
logger.info(combo_results.report_metrics("context and prior"))
|
||||||
|
|
||||||
|
|
||||||
def get_eval_results(data, el_pipe=None):
|
def _add_eval_result(results, doc, correct_ents, el_pipe):
|
||||||
"""
|
"""
|
||||||
Evaluate the ent.kb_id_ annotations against the gold standard.
|
Evaluate the ent.kb_id_ annotations against the gold standard.
|
||||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
||||||
If the docs in the data require further processing with an entity linker, set el_pipe.
|
|
||||||
"""
|
"""
|
||||||
docs = []
|
|
||||||
golds = []
|
|
||||||
for d, g in tqdm(data, leave=False):
|
|
||||||
if len(d) > 0:
|
|
||||||
golds.append(g)
|
|
||||||
if el_pipe is not None:
|
|
||||||
docs.append(el_pipe(d))
|
|
||||||
else:
|
|
||||||
docs.append(d)
|
|
||||||
|
|
||||||
results = EvaluationResults()
|
|
||||||
for doc, gold in zip(docs, golds):
|
|
||||||
try:
|
try:
|
||||||
correct_entries_per_article = dict()
|
doc = el_pipe(doc)
|
||||||
for entity, kb_dict in gold.links.items():
|
|
||||||
start, end = entity
|
|
||||||
for gold_kb, value in kb_dict.items():
|
|
||||||
if value:
|
|
||||||
# only evaluating on positive examples
|
|
||||||
offset = _offset(start, end)
|
|
||||||
correct_entries_per_article[offset] = gold_kb
|
|
||||||
|
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
ent_label = ent.label_
|
ent_label = ent.label_
|
||||||
pred_entity = ent.kb_id_
|
|
||||||
start = ent.start_char
|
start = ent.start_char
|
||||||
end = ent.end_char
|
end = ent.end_char
|
||||||
offset = _offset(start, end)
|
offset = _offset(start, end)
|
||||||
gold_entity = correct_entries_per_article.get(offset, None)
|
gold_entity = correct_ents.get(offset, None)
|
||||||
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
||||||
if gold_entity is not None:
|
if gold_entity is not None:
|
||||||
|
pred_entity = ent.kb_id_
|
||||||
results.update_metrics(ent_label, gold_entity, pred_entity)
|
results.update_metrics(ent_label, gold_entity, pred_entity)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logging.error("Error assessing accuracy " + str(e))
|
logging.error("Error assessing accuracy " + str(e))
|
||||||
|
|
||||||
return results
|
|
||||||
|
|
||||||
|
def _add_baseline(baseline_results, counts, doc, correct_ents, kb):
|
||||||
def measure_baselines(data, kb):
|
|
||||||
"""
|
"""
|
||||||
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
|
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
|
||||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
||||||
Also return a dictionary of counts by entity label.
|
|
||||||
"""
|
"""
|
||||||
counts_d = dict()
|
|
||||||
|
|
||||||
baseline_results = BaselineResults()
|
|
||||||
|
|
||||||
docs = [d for d, g in data if len(d) > 0]
|
|
||||||
golds = [g for d, g in data if len(d) > 0]
|
|
||||||
|
|
||||||
for doc, gold in zip(docs, golds):
|
|
||||||
correct_entries_per_article = dict()
|
|
||||||
for entity, kb_dict in gold.links.items():
|
|
||||||
start, end = entity
|
|
||||||
for gold_kb, value in kb_dict.items():
|
|
||||||
# only evaluating on positive examples
|
|
||||||
if value:
|
|
||||||
offset = _offset(start, end)
|
|
||||||
correct_entries_per_article[offset] = gold_kb
|
|
||||||
|
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
ent_label = ent.label_
|
ent_label = ent.label_
|
||||||
start = ent.start_char
|
start = ent.start_char
|
||||||
end = ent.end_char
|
end = ent.end_char
|
||||||
offset = _offset(start, end)
|
offset = _offset(start, end)
|
||||||
gold_entity = correct_entries_per_article.get(offset, None)
|
gold_entity = correct_ents.get(offset, None)
|
||||||
|
|
||||||
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
||||||
if gold_entity is not None:
|
if gold_entity is not None:
|
||||||
|
@ -207,8 +188,8 @@ def measure_baselines(data, kb):
|
||||||
prior_candidate = candidates[best_index].entity_
|
prior_candidate = candidates[best_index].entity_
|
||||||
random_candidate = random.choice(candidates).entity_
|
random_candidate = random.choice(candidates).entity_
|
||||||
|
|
||||||
current_count = counts_d.get(ent_label, 0)
|
current_count = counts.get(ent_label, 0)
|
||||||
counts_d[ent_label] = current_count+1
|
counts[ent_label] = current_count+1
|
||||||
|
|
||||||
baseline_results.update_baselines(
|
baseline_results.update_baselines(
|
||||||
gold_entity,
|
gold_entity,
|
||||||
|
@ -218,8 +199,6 @@ def measure_baselines(data, kb):
|
||||||
oracle_candidate,
|
oracle_candidate,
|
||||||
)
|
)
|
||||||
|
|
||||||
return baseline_results, counts_d
|
|
||||||
|
|
||||||
|
|
||||||
def _offset(start, end):
|
def _offset(start, end):
|
||||||
return "{}_{}".format(start, end)
|
return "{}_{}".format(start, end)
|
||||||
|
|
|
@ -40,7 +40,7 @@ logger = logging.getLogger(__name__)
|
||||||
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
|
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
|
||||||
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
|
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
|
||||||
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
|
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
|
||||||
descr_from_wp=("Flag for using wp descriptions not wd", "flag", "wp"),
|
descr_from_wp=("Flag for using descriptions from WP instead of WD (default False)", "flag", "wp"),
|
||||||
limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
|
limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
|
||||||
limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
|
limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
|
||||||
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),
|
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Script to take a previously created Knowledge Base and train an entity linking
|
"""Script that takes a previously created Knowledge Base and trains an entity linking
|
||||||
pipeline. The provided KB directory should hold the kb, the original nlp object and
|
pipeline. The provided KB directory should hold the kb, the original nlp object and
|
||||||
its vocab used to create the KB, and a few auxiliary files such as the entity definitions,
|
its vocab used to create the KB, and a few auxiliary files such as the entity definitions,
|
||||||
as created by the script `wikidata_create_kb`.
|
as created by the script `wikidata_create_kb`.
|
||||||
|
@ -14,9 +14,16 @@ import logging
|
||||||
import spacy
|
import spacy
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import plac
|
import plac
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
from bin.wiki_entity_linking import wikipedia_processor
|
from bin.wiki_entity_linking import wikipedia_processor
|
||||||
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR
|
from bin.wiki_entity_linking import (
|
||||||
|
TRAINING_DATA_FILE,
|
||||||
|
KB_MODEL_DIR,
|
||||||
|
KB_FILE,
|
||||||
|
LOG_FORMAT,
|
||||||
|
OUTPUT_MODEL_DIR,
|
||||||
|
)
|
||||||
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
|
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
|
||||||
from bin.wiki_entity_linking.kb_creator import read_kb
|
from bin.wiki_entity_linking.kb_creator import read_kb
|
||||||
|
|
||||||
|
@ -33,8 +40,8 @@ logger = logging.getLogger(__name__)
|
||||||
dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float),
|
dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float),
|
||||||
lr=("Learning rate (default 0.005)", "option", "n", float),
|
lr=("Learning rate (default 0.005)", "option", "n", float),
|
||||||
l2=("L2 regularization", "option", "r", float),
|
l2=("L2 regularization", "option", "r", float),
|
||||||
train_inst=("# training instances (default 90% of all)", "option", "t", int),
|
train_articles=("# training articles (default 90% of all)", "option", "t", int),
|
||||||
dev_inst=("# test instances (default 10% of all)", "option", "d", int),
|
dev_articles=("# dev test articles (default 10% of all)", "option", "d", int),
|
||||||
labels_discard=("NER labels to discard (default None)", "option", "l", str),
|
labels_discard=("NER labels to discard (default None)", "option", "l", str),
|
||||||
)
|
)
|
||||||
def main(
|
def main(
|
||||||
|
@ -45,10 +52,15 @@ def main(
|
||||||
dropout=0.5,
|
dropout=0.5,
|
||||||
lr=0.005,
|
lr=0.005,
|
||||||
l2=1e-6,
|
l2=1e-6,
|
||||||
train_inst=None,
|
train_articles=None,
|
||||||
dev_inst=None,
|
dev_articles=None,
|
||||||
labels_discard=None
|
labels_discard=None,
|
||||||
):
|
):
|
||||||
|
if not output_dir:
|
||||||
|
logger.warning(
|
||||||
|
"No output dir specified so no results will be written, are you sure about this ?"
|
||||||
|
)
|
||||||
|
|
||||||
logger.info("Creating Entity Linker with Wikipedia and WikiData")
|
logger.info("Creating Entity Linker with Wikipedia and WikiData")
|
||||||
|
|
||||||
output_dir = Path(output_dir) if output_dir else dir_kb
|
output_dir = Path(output_dir) if output_dir else dir_kb
|
||||||
|
@ -64,47 +76,57 @@ def main(
|
||||||
# STEP 1 : load the NLP object
|
# STEP 1 : load the NLP object
|
||||||
logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
|
logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
|
||||||
nlp = spacy.load(nlp_dir)
|
nlp = spacy.load(nlp_dir)
|
||||||
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
|
logger.info(
|
||||||
kb = read_kb(nlp, kb_path)
|
"Original NLP pipeline has following pipeline components: {}".format(
|
||||||
|
nlp.pipe_names
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
# check that there is a NER component in the pipeline
|
# check that there is a NER component in the pipeline
|
||||||
if "ner" not in nlp.pipe_names:
|
if "ner" not in nlp.pipe_names:
|
||||||
raise ValueError("The `nlp` object should have a pretrained `ner` component.")
|
raise ValueError("The `nlp` object should have a pretrained `ner` component.")
|
||||||
|
|
||||||
# STEP 2: read the training dataset previously created from WP
|
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
|
||||||
logger.info("STEP 2: Reading training dataset from {}".format(training_path))
|
kb = read_kb(nlp, kb_path)
|
||||||
|
|
||||||
|
# STEP 2: read the training dataset previously created from WP
|
||||||
|
logger.info("STEP 2: Reading training & dev dataset from {}".format(training_path))
|
||||||
|
train_indices, dev_indices = wikipedia_processor.read_training_indices(
|
||||||
|
training_path
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
"Training set has {} articles, limit set to roughly {} articles per epoch".format(
|
||||||
|
len(train_indices), train_articles if train_articles else "all"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
"Dev set has {} articles, limit set to rougly {} articles for evaluation".format(
|
||||||
|
len(dev_indices), dev_articles if dev_articles else "all"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
if dev_articles:
|
||||||
|
dev_indices = dev_indices[0:dev_articles]
|
||||||
|
|
||||||
|
# STEP 3: create and train an entity linking pipe
|
||||||
|
logger.info(
|
||||||
|
"STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(
|
||||||
|
epochs
|
||||||
|
)
|
||||||
|
)
|
||||||
if labels_discard:
|
if labels_discard:
|
||||||
labels_discard = [x.strip() for x in labels_discard.split(",")]
|
labels_discard = [x.strip() for x in labels_discard.split(",")]
|
||||||
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
|
logger.info(
|
||||||
|
"Discarding {} NER types: {}".format(len(labels_discard), labels_discard)
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
labels_discard = []
|
labels_discard = []
|
||||||
|
|
||||||
train_data = wikipedia_processor.read_training(
|
|
||||||
nlp=nlp,
|
|
||||||
entity_file_path=training_path,
|
|
||||||
dev=False,
|
|
||||||
limit=train_inst,
|
|
||||||
kb=kb,
|
|
||||||
labels_discard=labels_discard
|
|
||||||
)
|
|
||||||
|
|
||||||
# for testing, get all pos instances (independently of KB)
|
|
||||||
dev_data = wikipedia_processor.read_training(
|
|
||||||
nlp=nlp,
|
|
||||||
entity_file_path=training_path,
|
|
||||||
dev=True,
|
|
||||||
limit=dev_inst,
|
|
||||||
kb=None,
|
|
||||||
labels_discard=labels_discard
|
|
||||||
)
|
|
||||||
|
|
||||||
# STEP 3: create and train an entity linking pipe
|
|
||||||
logger.info("STEP 3: Creating and training an Entity Linking pipe")
|
|
||||||
|
|
||||||
el_pipe = nlp.create_pipe(
|
el_pipe = nlp.create_pipe(
|
||||||
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors,
|
name="entity_linker",
|
||||||
"labels_discard": labels_discard}
|
config={
|
||||||
|
"pretrained_vectors": nlp.vocab.vectors,
|
||||||
|
"labels_discard": labels_discard,
|
||||||
|
},
|
||||||
)
|
)
|
||||||
el_pipe.set_kb(kb)
|
el_pipe.set_kb(kb)
|
||||||
nlp.add_pipe(el_pipe, last=True)
|
nlp.add_pipe(el_pipe, last=True)
|
||||||
|
@ -115,78 +137,96 @@ def main(
|
||||||
optimizer.learn_rate = lr
|
optimizer.learn_rate = lr
|
||||||
optimizer.L2 = l2
|
optimizer.L2 = l2
|
||||||
|
|
||||||
logger.info("Training on {} articles".format(len(train_data)))
|
|
||||||
logger.info("Dev testing on {} articles".format(len(dev_data)))
|
|
||||||
|
|
||||||
# baseline performance on dev data
|
|
||||||
logger.info("Dev Baseline Accuracies:")
|
logger.info("Dev Baseline Accuracies:")
|
||||||
measure_performance(dev_data, kb, el_pipe, baseline=True, context=False)
|
dev_data = wikipedia_processor.read_el_docs_golds(
|
||||||
|
nlp=nlp,
|
||||||
|
entity_file_path=training_path,
|
||||||
|
dev=True,
|
||||||
|
line_ids=dev_indices,
|
||||||
|
kb=kb,
|
||||||
|
labels_discard=labels_discard,
|
||||||
|
)
|
||||||
|
|
||||||
|
measure_performance(
|
||||||
|
dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices)
|
||||||
|
)
|
||||||
|
|
||||||
for itn in range(epochs):
|
for itn in range(epochs):
|
||||||
random.shuffle(train_data)
|
random.shuffle(train_indices)
|
||||||
losses = {}
|
losses = {}
|
||||||
batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001))
|
batches = minibatch(train_indices, size=compounding(8.0, 128.0, 1.001))
|
||||||
batchnr = 0
|
batchnr = 0
|
||||||
|
articles_processed = 0
|
||||||
|
|
||||||
with nlp.disable_pipes(*other_pipes):
|
# we either process the whole training file, or just a part each epoch
|
||||||
|
bar_total = len(train_indices)
|
||||||
|
if train_articles:
|
||||||
|
bar_total = train_articles
|
||||||
|
|
||||||
|
with tqdm(total=bar_total, leave=False, desc=f"Epoch {itn}") as pbar:
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
|
if not train_articles or articles_processed < train_articles:
|
||||||
|
with nlp.disable_pipes("entity_linker"):
|
||||||
|
train_batch = wikipedia_processor.read_el_docs_golds(
|
||||||
|
nlp=nlp,
|
||||||
|
entity_file_path=training_path,
|
||||||
|
dev=False,
|
||||||
|
line_ids=batch,
|
||||||
|
kb=kb,
|
||||||
|
labels_discard=labels_discard,
|
||||||
|
)
|
||||||
|
docs, golds = zip(*train_batch)
|
||||||
try:
|
try:
|
||||||
|
with nlp.disable_pipes(*other_pipes):
|
||||||
nlp.update(
|
nlp.update(
|
||||||
examples=batch,
|
docs=docs,
|
||||||
|
golds=golds,
|
||||||
sgd=optimizer,
|
sgd=optimizer,
|
||||||
drop=dropout,
|
drop=dropout,
|
||||||
losses=losses,
|
losses=losses,
|
||||||
)
|
)
|
||||||
batchnr += 1
|
batchnr += 1
|
||||||
|
articles_processed += len(docs)
|
||||||
|
pbar.update(len(docs))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("Error updating batch:" + str(e))
|
logger.error("Error updating batch:" + str(e))
|
||||||
if batchnr > 0:
|
if batchnr > 0:
|
||||||
logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2)))
|
logging.info(
|
||||||
measure_performance(dev_data, kb, el_pipe, baseline=False, context=True)
|
"Epoch {} trained on {} articles, train loss {}".format(
|
||||||
|
itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)
|
||||||
# STEP 4: measure the performance of our trained pipe on an independent dev set
|
)
|
||||||
logger.info("STEP 4: Final performance measurement of Entity Linking pipe")
|
)
|
||||||
measure_performance(dev_data, kb, el_pipe)
|
# re-read the dev_data (data is returned as a generator)
|
||||||
|
dev_data = wikipedia_processor.read_el_docs_golds(
|
||||||
# STEP 5: apply the EL pipe on a toy example
|
nlp=nlp,
|
||||||
logger.info("STEP 5: Applying Entity Linking to toy example")
|
entity_file_path=training_path,
|
||||||
run_el_toy_example(nlp=nlp)
|
dev=True,
|
||||||
|
line_ids=dev_indices,
|
||||||
|
kb=kb,
|
||||||
|
labels_discard=labels_discard,
|
||||||
|
)
|
||||||
|
measure_performance(
|
||||||
|
dev_data,
|
||||||
|
kb,
|
||||||
|
el_pipe,
|
||||||
|
baseline=False,
|
||||||
|
context=True,
|
||||||
|
dev_limit=len(dev_indices),
|
||||||
|
)
|
||||||
|
|
||||||
if output_dir:
|
if output_dir:
|
||||||
# STEP 6: write the NLP pipeline (now including an EL model) to file
|
# STEP 4: write the NLP pipeline (now including an EL model) to file
|
||||||
logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir))
|
logger.info(
|
||||||
|
"Final NLP pipeline has following pipeline components: {}".format(
|
||||||
|
nlp.pipe_names
|
||||||
|
)
|
||||||
|
)
|
||||||
|
logger.info("STEP 4: Writing trained NLP to {}".format(nlp_output_dir))
|
||||||
nlp.to_disk(nlp_output_dir)
|
nlp.to_disk(nlp_output_dir)
|
||||||
|
|
||||||
logger.info("Done!")
|
logger.info("Done!")
|
||||||
|
|
||||||
|
|
||||||
def check_kb(kb):
|
|
||||||
for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
|
|
||||||
candidates = kb.get_candidates(mention)
|
|
||||||
|
|
||||||
logger.info("generating candidates for " + mention + " :")
|
|
||||||
for c in candidates:
|
|
||||||
logger.info(" ".join[
|
|
||||||
str(c.prior_prob),
|
|
||||||
c.alias_,
|
|
||||||
"-->",
|
|
||||||
c.entity_ + " (freq=" + str(c.entity_freq) + ")"
|
|
||||||
])
|
|
||||||
|
|
||||||
|
|
||||||
def run_el_toy_example(nlp):
|
|
||||||
text = (
|
|
||||||
"In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, "
|
|
||||||
"Douglas reminds us to always bring our towel, even in China or Brazil. "
|
|
||||||
"The main character in Doug's novel is the man Arthur Dent, "
|
|
||||||
"but Dougledydoug doesn't write about George Washington or Homer Simpson."
|
|
||||||
)
|
|
||||||
doc = nlp(text)
|
|
||||||
logger.info(text)
|
|
||||||
for ent in doc.ents:
|
|
||||||
logger.info(" ".join(["ent", ent.text, ent.label_, ent.kb_id_]))
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
|
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
@ -6,9 +6,6 @@ import bz2
|
||||||
import logging
|
import logging
|
||||||
import random
|
import random
|
||||||
import json
|
import json
|
||||||
from tqdm import tqdm
|
|
||||||
|
|
||||||
from functools import partial
|
|
||||||
|
|
||||||
from spacy.gold import GoldParse
|
from spacy.gold import GoldParse
|
||||||
from bin.wiki_entity_linking import wiki_io as io
|
from bin.wiki_entity_linking import wiki_io as io
|
||||||
|
@ -454,25 +451,40 @@ def _write_training_entities(outputfile, article_id, clean_text, entities):
|
||||||
outputfile.write(line)
|
outputfile.write(line)
|
||||||
|
|
||||||
|
|
||||||
def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
|
def read_training_indices(entity_file_path):
|
||||||
""" This method provides training examples that correspond to the entity annotations found by the nlp object.
|
""" This method creates two lists of indices into the training file: one with indices for the
|
||||||
|
training examples, and one for the dev examples."""
|
||||||
|
train_indices = []
|
||||||
|
dev_indices = []
|
||||||
|
|
||||||
|
with entity_file_path.open("r", encoding="utf8") as file:
|
||||||
|
for i, line in enumerate(file):
|
||||||
|
example = json.loads(line)
|
||||||
|
article_id = example["article_id"]
|
||||||
|
clean_text = example["clean_text"]
|
||||||
|
|
||||||
|
if is_valid_article(clean_text):
|
||||||
|
if is_dev(article_id):
|
||||||
|
dev_indices.append(i)
|
||||||
|
else:
|
||||||
|
train_indices.append(i)
|
||||||
|
|
||||||
|
return train_indices, dev_indices
|
||||||
|
|
||||||
|
|
||||||
|
def read_el_docs_golds(nlp, entity_file_path, dev, line_ids, kb, labels_discard=None):
|
||||||
|
""" This method provides training/dev examples that correspond to the entity annotations found by the nlp object.
|
||||||
For training, it will include both positive and negative examples by using the candidate generator from the kb.
|
For training, it will include both positive and negative examples by using the candidate generator from the kb.
|
||||||
For testing (kb=None), it will include all positive examples only."""
|
For testing (kb=None), it will include all positive examples only."""
|
||||||
if not labels_discard:
|
if not labels_discard:
|
||||||
labels_discard = []
|
labels_discard = []
|
||||||
|
|
||||||
data = []
|
texts = []
|
||||||
num_entities = 0
|
entities_list = []
|
||||||
get_gold_parse = partial(
|
|
||||||
_get_gold_parse, dev=dev, kb=kb, labels_discard=labels_discard
|
|
||||||
)
|
|
||||||
|
|
||||||
logger.info(
|
|
||||||
"Reading {} data with limit {}".format("dev" if dev else "train", limit)
|
|
||||||
)
|
|
||||||
with entity_file_path.open("r", encoding="utf8") as file:
|
with entity_file_path.open("r", encoding="utf8") as file:
|
||||||
with tqdm(total=limit, leave=False) as pbar:
|
|
||||||
for i, line in enumerate(file):
|
for i, line in enumerate(file):
|
||||||
|
if i in line_ids:
|
||||||
example = json.loads(line)
|
example = json.loads(line)
|
||||||
article_id = example["article_id"]
|
article_id = example["article_id"]
|
||||||
clean_text = example["clean_text"]
|
clean_text = example["clean_text"]
|
||||||
|
@ -481,16 +493,15 @@ def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
|
||||||
if dev != is_dev(article_id) or not is_valid_article(clean_text):
|
if dev != is_dev(article_id) or not is_valid_article(clean_text):
|
||||||
continue
|
continue
|
||||||
|
|
||||||
doc = nlp(clean_text)
|
texts.append(clean_text)
|
||||||
gold = get_gold_parse(doc, entities)
|
entities_list.append(entities)
|
||||||
|
|
||||||
|
docs = nlp.pipe(texts, batch_size=50)
|
||||||
|
|
||||||
|
for doc, entities in zip(docs, entities_list):
|
||||||
|
gold = _get_gold_parse(doc, entities, dev=dev, kb=kb, labels_discard=labels_discard)
|
||||||
if gold and len(gold.links) > 0:
|
if gold and len(gold.links) > 0:
|
||||||
data.append((doc, gold))
|
yield doc, gold
|
||||||
num_entities += len(gold.links)
|
|
||||||
pbar.update(len(gold.links))
|
|
||||||
if limit and num_entities >= limit:
|
|
||||||
break
|
|
||||||
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
|
|
||||||
return data
|
|
||||||
|
|
||||||
|
|
||||||
def _get_gold_parse(doc, entities, dev, kb, labels_discard):
|
def _get_gold_parse(doc, entities, dev, kb, labels_discard):
|
||||||
|
|
|
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
|
||||||
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
||||||
|
|
||||||
|
|
||||||
@st.cache(ignore_hash=True)
|
@st.cache(allow_output_mutation=True)
|
||||||
def load_model(name):
|
def load_model(name):
|
||||||
return spacy.load(name)
|
return spacy.load(name)
|
||||||
|
|
||||||
|
|
||||||
@st.cache(ignore_hash=True)
|
@st.cache(allow_output_mutation=True)
|
||||||
def process_text(model_name, text):
|
def process_text(model_name, text):
|
||||||
nlp = load_model(model_name)
|
nlp = load_model(model_name)
|
||||||
return nlp(text)
|
return nlp(text)
|
||||||
|
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
|
||||||
st.header("Named Entities")
|
st.header("Named Entities")
|
||||||
st.sidebar.header("Named Entities")
|
st.sidebar.header("Named Entities")
|
||||||
label_set = nlp.get_pipe("ner").labels
|
label_set = nlp.get_pipe("ner").labels
|
||||||
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
|
labels = st.sidebar.multiselect(
|
||||||
|
"Entity labels", options=label_set, default=list(label_set)
|
||||||
|
)
|
||||||
html = displacy.render(doc, style="ent", options={"ents": labels})
|
html = displacy.render(doc, style="ent", options={"ents": labels})
|
||||||
# Newlines seem to mess with the rendering
|
# Newlines seem to mess with the rendering
|
||||||
html = html.replace("\n", " ")
|
html = html.replace("\n", " ")
|
||||||
|
|
|
@ -32,27 +32,24 @@ DESC_WIDTH = 64 # dimension of output entity vectors
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
vocab_path=("Path to the vocab for the kb", "option", "v", Path),
|
model=("Model name, should have pretrained word embeddings", "positional", None, str),
|
||||||
model=("Model name, should have pretrained word embeddings", "option", "m", str),
|
|
||||||
output_dir=("Optional output directory", "option", "o", Path),
|
output_dir=("Optional output directory", "option", "o", Path),
|
||||||
n_iter=("Number of training iterations", "option", "n", int),
|
n_iter=("Number of training iterations", "option", "n", int),
|
||||||
)
|
)
|
||||||
def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
|
def main(model=None, output_dir=None, n_iter=50):
|
||||||
"""Load the model, create the KB and pretrain the entity encodings.
|
"""Load the model, create the KB and pretrain the entity encodings.
|
||||||
Either an nlp model or a vocab is needed to provide access to pretrained word embeddings.
|
|
||||||
If an output_dir is provided, the KB will be stored there in a file 'kb'.
|
If an output_dir is provided, the KB will be stored there in a file 'kb'.
|
||||||
When providing an nlp model, the updated vocab will also be written to a directory in the output_dir."""
|
The updated vocab will also be written to a directory in the output_dir."""
|
||||||
if model is None and vocab_path is None:
|
|
||||||
raise ValueError("Either the `nlp` model or the `vocab` should be specified.")
|
|
||||||
|
|
||||||
if model is not None:
|
|
||||||
nlp = spacy.load(model) # load existing spaCy model
|
nlp = spacy.load(model) # load existing spaCy model
|
||||||
print("Loaded model '%s'" % model)
|
print("Loaded model '%s'" % model)
|
||||||
else:
|
|
||||||
vocab = Vocab().from_disk(vocab_path)
|
# check the length of the nlp vectors
|
||||||
# create blank Language class with specified vocab
|
if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
|
||||||
nlp = spacy.blank("en", vocab=vocab)
|
raise ValueError(
|
||||||
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
"The `nlp` object should have access to pretrained word vectors, "
|
||||||
|
" cf. https://spacy.io/usage/models#languages."
|
||||||
|
)
|
||||||
|
|
||||||
kb = KnowledgeBase(vocab=nlp.vocab)
|
kb = KnowledgeBase(vocab=nlp.vocab)
|
||||||
|
|
||||||
|
@ -103,8 +100,6 @@ def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
|
||||||
print()
|
print()
|
||||||
print("Saved KB to", kb_path)
|
print("Saved KB to", kb_path)
|
||||||
|
|
||||||
# only storing the vocab if we weren't already reading it from file
|
|
||||||
if not vocab_path:
|
|
||||||
vocab_path = output_dir / "vocab"
|
vocab_path = output_dir / "vocab"
|
||||||
kb.vocab.to_disk(vocab_path)
|
kb.vocab.to_disk(vocab_path)
|
||||||
print("Saved vocab to", vocab_path)
|
print("Saved vocab to", vocab_path)
|
||||||
|
|
|
@ -131,7 +131,8 @@ def train_textcat(nlp, n_texts, n_iter=10):
|
||||||
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
textcat.model.tok2vec.from_bytes(tok2vec_weights)
|
textcat.model.tok2vec.from_bytes(tok2vec_weights)
|
||||||
|
|
|
@ -63,7 +63,8 @@ def main(model_name, unlabelled_loc):
|
||||||
optimizer.b2 = 0.0
|
optimizer.b2 = 0.0
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
with nlp.disable_pipes(*other_pipes):
|
with nlp.disable_pipes(*other_pipes):
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
|
|
|
@ -113,7 +113,8 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
TRAIN_DOCS.append((doc, annotation_clean))
|
TRAIN_DOCS.append((doc, annotation_clean))
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
|
pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train entity linker
|
with nlp.disable_pipes(*other_pipes): # only train entity linker
|
||||||
# reset and initialize the weights randomly
|
# reset and initialize the weights randomly
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
|
|
|
@ -124,7 +124,8 @@ def main(model=None, output_dir=None, n_iter=15):
|
||||||
for dep in annotations.get("deps", []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
|
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
|
|
|
@ -55,7 +55,8 @@ def main(model=None, output_dir=None, n_iter=100):
|
||||||
ner.add_label(ent[2])
|
ner.add_label(ent[2])
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||||
# reset and initialize the weights randomly – but only if we're
|
# reset and initialize the weights randomly – but only if we're
|
||||||
# training a new model
|
# training a new model
|
||||||
|
|
|
@ -95,7 +95,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
||||||
optimizer = nlp.resume_training()
|
optimizer = nlp.resume_training()
|
||||||
move_names = list(ner.move_names)
|
move_names = list(ner.move_names)
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
|
|
|
@ -65,7 +65,8 @@ def main(model=None, output_dir=None, n_iter=15):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
|
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
|
|
|
@ -68,7 +68,8 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None
|
||||||
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
if init_tok2vec is not None:
|
if init_tok2vec is not None:
|
||||||
|
|
|
@ -49,6 +49,7 @@ install_requires =
|
||||||
catalogue>=0.0.7,<1.1.0
|
catalogue>=0.0.7,<1.1.0
|
||||||
ml_datasets
|
ml_datasets
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
|
tqdm>=4.38.0,<5.0.0
|
||||||
setuptools
|
setuptools
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
plac>=0.9.6,<1.2.0
|
plac>=0.9.6,<1.2.0
|
||||||
|
|
|
@ -5,7 +5,7 @@ warnings.filterwarnings("ignore", message="numpy.dtype size changed")
|
||||||
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
||||||
|
|
||||||
# These are imported as part of the API
|
# These are imported as part of the API
|
||||||
from thinc.util import prefer_gpu, require_gpu
|
from thinc.api import prefer_gpu, require_gpu
|
||||||
|
|
||||||
from . import pipeline
|
from . import pipeline
|
||||||
from .cli.info import info as cli_info
|
from .cli.info import info as cli_info
|
||||||
|
|
|
@ -92,3 +92,4 @@ cdef enum attr_id_t:
|
||||||
LANG
|
LANG
|
||||||
ENT_KB_ID = symbols.ENT_KB_ID
|
ENT_KB_ID = symbols.ENT_KB_ID
|
||||||
MORPH
|
MORPH
|
||||||
|
ENT_ID = symbols.ENT_ID
|
||||||
|
|
|
@ -81,6 +81,7 @@ IDS = {
|
||||||
"DEP": DEP,
|
"DEP": DEP,
|
||||||
"ENT_IOB": ENT_IOB,
|
"ENT_IOB": ENT_IOB,
|
||||||
"ENT_TYPE": ENT_TYPE,
|
"ENT_TYPE": ENT_TYPE,
|
||||||
|
"ENT_ID": ENT_ID,
|
||||||
"ENT_KB_ID": ENT_KB_ID,
|
"ENT_KB_ID": ENT_KB_ID,
|
||||||
"HEAD": HEAD,
|
"HEAD": HEAD,
|
||||||
"SENT_START": SENT_START,
|
"SENT_START": SENT_START,
|
||||||
|
|
|
@ -9,8 +9,14 @@ from wasabi import Printer
|
||||||
|
|
||||||
|
|
||||||
def conllu2json(
|
def conllu2json(
|
||||||
input_data, n_sents=10, append_morphology=False, lang=None, ner_map=None,
|
input_data,
|
||||||
merge_subtokens=False, no_print=False, **_
|
n_sents=10,
|
||||||
|
append_morphology=False,
|
||||||
|
lang=None,
|
||||||
|
ner_map=None,
|
||||||
|
merge_subtokens=False,
|
||||||
|
no_print=False,
|
||||||
|
**_
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Convert conllu files into JSON format for use with train cli.
|
Convert conllu files into JSON format for use with train cli.
|
||||||
|
@ -26,9 +32,13 @@ def conllu2json(
|
||||||
docs = []
|
docs = []
|
||||||
raw = ""
|
raw = ""
|
||||||
sentences = []
|
sentences = []
|
||||||
conll_data = read_conllx(input_data, append_morphology=append_morphology,
|
conll_data = read_conllx(
|
||||||
ner_tag_pattern=MISC_NER_PATTERN, ner_map=ner_map,
|
input_data,
|
||||||
merge_subtokens=merge_subtokens)
|
append_morphology=append_morphology,
|
||||||
|
ner_tag_pattern=MISC_NER_PATTERN,
|
||||||
|
ner_map=ner_map,
|
||||||
|
merge_subtokens=merge_subtokens,
|
||||||
|
)
|
||||||
has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN)
|
has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN)
|
||||||
for i, example in enumerate(conll_data):
|
for i, example in enumerate(conll_data):
|
||||||
raw += example.text
|
raw += example.text
|
||||||
|
@ -72,20 +82,28 @@ def has_ner(input_data, ner_tag_pattern):
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
def read_conllx(input_data, append_morphology=False, merge_subtokens=False,
|
def read_conllx(
|
||||||
ner_tag_pattern="", ner_map=None):
|
input_data,
|
||||||
|
append_morphology=False,
|
||||||
|
merge_subtokens=False,
|
||||||
|
ner_tag_pattern="",
|
||||||
|
ner_map=None,
|
||||||
|
):
|
||||||
""" Yield examples, one for each sentence """
|
""" Yield examples, one for each sentence """
|
||||||
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
|
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
|
||||||
i = 0
|
|
||||||
for sent in input_data.strip().split("\n\n"):
|
for sent in input_data.strip().split("\n\n"):
|
||||||
lines = sent.strip().split("\n")
|
lines = sent.strip().split("\n")
|
||||||
if lines:
|
if lines:
|
||||||
while lines[0].startswith("#"):
|
while lines[0].startswith("#"):
|
||||||
lines.pop(0)
|
lines.pop(0)
|
||||||
example = example_from_conllu_sentence(vocab, lines,
|
example = example_from_conllu_sentence(
|
||||||
ner_tag_pattern, merge_subtokens=merge_subtokens,
|
vocab,
|
||||||
|
lines,
|
||||||
|
ner_tag_pattern,
|
||||||
|
merge_subtokens=merge_subtokens,
|
||||||
append_morphology=append_morphology,
|
append_morphology=append_morphology,
|
||||||
ner_map=ner_map)
|
ner_map=ner_map,
|
||||||
|
)
|
||||||
yield example
|
yield example
|
||||||
|
|
||||||
|
|
||||||
|
@ -157,8 +175,14 @@ def create_json_doc(raw, sentences, id_):
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
def example_from_conllu_sentence(
|
||||||
merge_subtokens=False, append_morphology=False, ner_map=None):
|
vocab,
|
||||||
|
lines,
|
||||||
|
ner_tag_pattern,
|
||||||
|
merge_subtokens=False,
|
||||||
|
append_morphology=False,
|
||||||
|
ner_map=None,
|
||||||
|
):
|
||||||
"""Create an Example from the lines for one CoNLL-U sentence, merging
|
"""Create an Example from the lines for one CoNLL-U sentence, merging
|
||||||
subtokens and appending morphology to tags if required.
|
subtokens and appending morphology to tags if required.
|
||||||
|
|
||||||
|
@ -182,7 +206,6 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
||||||
in_subtok = False
|
in_subtok = False
|
||||||
for i in range(len(lines)):
|
for i in range(len(lines)):
|
||||||
line = lines[i]
|
line = lines[i]
|
||||||
subtok_lines = []
|
|
||||||
parts = line.split("\t")
|
parts = line.split("\t")
|
||||||
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
||||||
if "." in id_:
|
if "." in id_:
|
||||||
|
@ -212,7 +235,7 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
||||||
subtok_word = ""
|
subtok_word = ""
|
||||||
in_subtok = False
|
in_subtok = False
|
||||||
id_ = int(id_) - 1
|
id_ = int(id_) - 1
|
||||||
head = (int(head) - 1) if head != "0" else id_
|
head = (int(head) - 1) if head not in ("0", "_") else id_
|
||||||
tag = pos if tag == "_" else tag
|
tag = pos if tag == "_" else tag
|
||||||
morph = morph if morph != "_" else ""
|
morph = morph if morph != "_" else ""
|
||||||
dep = "ROOT" if dep == "root" else dep
|
dep = "ROOT" if dep == "root" else dep
|
||||||
|
@ -266,9 +289,17 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
||||||
if space:
|
if space:
|
||||||
raw += " "
|
raw += " "
|
||||||
example = Example(doc=raw)
|
example = Example(doc=raw)
|
||||||
example.set_token_annotation(ids=ids, words=words, tags=tags, pos=pos,
|
example.set_token_annotation(
|
||||||
morphs=morphs, lemmas=lemmas, heads=heads,
|
ids=ids,
|
||||||
deps=deps, entities=ents)
|
words=words,
|
||||||
|
tags=tags,
|
||||||
|
pos=pos,
|
||||||
|
morphs=morphs,
|
||||||
|
lemmas=lemmas,
|
||||||
|
heads=heads,
|
||||||
|
deps=deps,
|
||||||
|
entities=ents,
|
||||||
|
)
|
||||||
return example
|
return example
|
||||||
|
|
||||||
|
|
||||||
|
@ -292,7 +323,7 @@ def merge_conllu_subtokens(lines, doc):
|
||||||
if token._.merged_morph:
|
if token._.merged_morph:
|
||||||
for feature in token._.merged_morph.split("|"):
|
for feature in token._.merged_morph.split("|"):
|
||||||
field, values = feature.split("=", 1)
|
field, values = feature.split("=", 1)
|
||||||
if not field in morphs:
|
if field not in morphs:
|
||||||
morphs[field] = set()
|
morphs[field] = set()
|
||||||
for value in values.split(","):
|
for value in values.split(","):
|
||||||
morphs[field].add(value)
|
morphs[field].add(value)
|
||||||
|
@ -306,7 +337,9 @@ def merge_conllu_subtokens(lines, doc):
|
||||||
token._.merged_lemma = " ".join(lemmas)
|
token._.merged_lemma = " ".join(lemmas)
|
||||||
token.tag_ = "_".join(tags)
|
token.tag_ = "_".join(tags)
|
||||||
token._.merged_morph = "|".join(sorted(morphs.values()))
|
token._.merged_morph = "|".join(sorted(morphs.values()))
|
||||||
token._.merged_spaceafter = True if subtok_span[-1].whitespace_ else False
|
token._.merged_spaceafter = (
|
||||||
|
True if subtok_span[-1].whitespace_ else False
|
||||||
|
)
|
||||||
|
|
||||||
with doc.retokenize() as retokenizer:
|
with doc.retokenize() as retokenizer:
|
||||||
for span in subtok_spans:
|
for span in subtok_spans:
|
||||||
|
|
|
@ -166,6 +166,7 @@ def debug_data(
|
||||||
has_low_data_warning = False
|
has_low_data_warning = False
|
||||||
has_no_neg_warning = False
|
has_no_neg_warning = False
|
||||||
has_ws_ents_error = False
|
has_ws_ents_error = False
|
||||||
|
has_punct_ents_warning = False
|
||||||
|
|
||||||
msg.divider("Named Entity Recognition")
|
msg.divider("Named Entity Recognition")
|
||||||
msg.info(
|
msg.info(
|
||||||
|
@ -190,6 +191,10 @@ def debug_data(
|
||||||
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
|
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
|
||||||
has_ws_ents_error = True
|
has_ws_ents_error = True
|
||||||
|
|
||||||
|
if gold_train_data["punct_ents"]:
|
||||||
|
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
|
||||||
|
has_punct_ents_warning = True
|
||||||
|
|
||||||
for label in new_labels:
|
for label in new_labels:
|
||||||
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
|
@ -209,6 +214,8 @@ def debug_data(
|
||||||
msg.good("Examples without occurrences available for all labels")
|
msg.good("Examples without occurrences available for all labels")
|
||||||
if not has_ws_ents_error:
|
if not has_ws_ents_error:
|
||||||
msg.good("No entities consisting of or starting/ending with whitespace")
|
msg.good("No entities consisting of or starting/ending with whitespace")
|
||||||
|
if not has_punct_ents_warning:
|
||||||
|
msg.good("No entities consisting of or starting/ending with punctuation")
|
||||||
|
|
||||||
if has_low_data_warning:
|
if has_low_data_warning:
|
||||||
msg.text(
|
msg.text(
|
||||||
|
@ -229,6 +236,12 @@ def debug_data(
|
||||||
"with whitespace characters are considered invalid."
|
"with whitespace characters are considered invalid."
|
||||||
)
|
)
|
||||||
|
|
||||||
|
if has_punct_ents_warning:
|
||||||
|
msg.text(
|
||||||
|
"Entity spans consisting of or starting/ending "
|
||||||
|
"with punctuation can not be trained with a noise level > 0."
|
||||||
|
)
|
||||||
|
|
||||||
if "textcat" in pipeline:
|
if "textcat" in pipeline:
|
||||||
msg.divider("Text Classification")
|
msg.divider("Text Classification")
|
||||||
labels = [label for label in gold_train_data["cats"]]
|
labels = [label for label in gold_train_data["cats"]]
|
||||||
|
@ -446,6 +459,7 @@ def _compile_gold(examples, pipeline):
|
||||||
"words": Counter(),
|
"words": Counter(),
|
||||||
"roots": Counter(),
|
"roots": Counter(),
|
||||||
"ws_ents": 0,
|
"ws_ents": 0,
|
||||||
|
"punct_ents": 0,
|
||||||
"n_words": 0,
|
"n_words": 0,
|
||||||
"n_misaligned_words": 0,
|
"n_misaligned_words": 0,
|
||||||
"n_sents": 0,
|
"n_sents": 0,
|
||||||
|
@ -469,6 +483,16 @@ def _compile_gold(examples, pipeline):
|
||||||
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
||||||
# "Illegal" whitespace entity
|
# "Illegal" whitespace entity
|
||||||
data["ws_ents"] += 1
|
data["ws_ents"] += 1
|
||||||
|
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
|
||||||
|
".",
|
||||||
|
"'",
|
||||||
|
"!",
|
||||||
|
"?",
|
||||||
|
",",
|
||||||
|
]:
|
||||||
|
# punctuation entity: could be replaced by whitespace when training with noise,
|
||||||
|
# so add a warning to alert the user to this unexpected side effect.
|
||||||
|
data["punct_ents"] += 1
|
||||||
if label.startswith(("B-", "U-")):
|
if label.startswith(("B-", "U-")):
|
||||||
combined_label = label.split("-")[1]
|
combined_label = label.split("-")[1]
|
||||||
data["ner"][combined_label] += 1
|
data["ner"][combined_label] += 1
|
||||||
|
|
|
@ -4,14 +4,12 @@ import time
|
||||||
import re
|
import re
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from thinc.layers import Linear, Maxout
|
from thinc.api import Linear, Maxout, chain, list2array, prefer_gpu
|
||||||
from thinc.util import prefer_gpu
|
from thinc.api import CosineDistance, L2Distance
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
import srsly
|
import srsly
|
||||||
from thinc.layers import chain, list2array
|
|
||||||
from thinc.loss import CosineDistance, L2Distance
|
|
||||||
|
|
||||||
from spacy.gold import Example
|
from ..gold import Example
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..attrs import ID, HEAD
|
from ..attrs import ID, HEAD
|
||||||
|
@ -28,7 +26,7 @@ def pretrain(
|
||||||
vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str),
|
vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str),
|
||||||
output_dir: ("Directory to write models to on each epoch", "positional", None, str),
|
output_dir: ("Directory to write models to on each epoch", "positional", None, str),
|
||||||
width: ("Width of CNN layers", "option", "cw", int) = 96,
|
width: ("Width of CNN layers", "option", "cw", int) = 96,
|
||||||
depth: ("Depth of CNN layers", "option", "cd", int) = 4,
|
conv_depth: ("Depth of CNN layers", "option", "cd", int) = 4,
|
||||||
bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0,
|
bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0,
|
||||||
cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3,
|
cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3,
|
||||||
sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0,
|
sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0,
|
||||||
|
@ -77,9 +75,15 @@ def pretrain(
|
||||||
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
||||||
|
|
||||||
output_dir = Path(output_dir)
|
output_dir = Path(output_dir)
|
||||||
|
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||||
|
msg.warn(
|
||||||
|
"Output directory is not empty",
|
||||||
|
"It is better to use an empty directory or refer to a new output path, "
|
||||||
|
"then the new directory will be created for you.",
|
||||||
|
)
|
||||||
if not output_dir.exists():
|
if not output_dir.exists():
|
||||||
output_dir.mkdir()
|
output_dir.mkdir()
|
||||||
msg.good("Created output directory")
|
msg.good(f"Created output directory: {output_dir}")
|
||||||
srsly.write_json(output_dir / "config.json", config)
|
srsly.write_json(output_dir / "config.json", config)
|
||||||
msg.good("Saved settings to config.json")
|
msg.good("Saved settings to config.json")
|
||||||
|
|
||||||
|
@ -107,7 +111,7 @@ def pretrain(
|
||||||
Tok2Vec(
|
Tok2Vec(
|
||||||
width,
|
width,
|
||||||
embed_rows,
|
embed_rows,
|
||||||
conv_depth=depth,
|
conv_depth=conv_depth,
|
||||||
pretrained_vectors=pretrained_vectors,
|
pretrained_vectors=pretrained_vectors,
|
||||||
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
||||||
subword_features=not use_chars, # Set to False for Chinese etc
|
subword_features=not use_chars, # Set to False for Chinese etc
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
import os
|
import os
|
||||||
import tqdm
|
import tqdm
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from thinc.backends import use_ops
|
from thinc.api import use_ops
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
import shutil
|
import shutil
|
||||||
import srsly
|
import srsly
|
||||||
|
@ -10,6 +10,7 @@ import contextlib
|
||||||
import random
|
import random
|
||||||
|
|
||||||
from ..util import create_default_optimizer
|
from ..util import create_default_optimizer
|
||||||
|
from ..util import use_gpu as set_gpu
|
||||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||||
from ..gold import GoldCorpus
|
from ..gold import GoldCorpus
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -26,6 +27,14 @@ def train(
|
||||||
base_model: ("Name of model to update (optional)", "option", "b", str) = None,
|
base_model: ("Name of model to update (optional)", "option", "b", str) = None,
|
||||||
pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner",
|
pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner",
|
||||||
vectors: ("Model to load vectors from", "option", "v", str) = None,
|
vectors: ("Model to load vectors from", "option", "v", str) = None,
|
||||||
|
replace_components: ("Replace components from base model", "flag", "R", bool) = False,
|
||||||
|
width: ("Width of CNN layers of Tok2Vec component", "option", "cw", int) = 96,
|
||||||
|
conv_depth: ("Depth of CNN layers of Tok2Vec component", "option", "cd", int) = 4,
|
||||||
|
cnn_window: ("Window size for CNN layers of Tok2Vec component", "option", "cW", int) = 1,
|
||||||
|
cnn_pieces: ("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int) = 3,
|
||||||
|
use_chars: ("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool) = False,
|
||||||
|
bilstm_depth: ("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int) = 0,
|
||||||
|
embed_rows: ("Number of embedding rows of Tok2Vec component", "option", "er", int) = 2000,
|
||||||
n_iter: ("Number of iterations", "option", "n", int) = 30,
|
n_iter: ("Number of iterations", "option", "n", int) = 30,
|
||||||
n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None,
|
n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None,
|
||||||
n_examples: ("Number of examples", "option", "ns", int) = 0,
|
n_examples: ("Number of examples", "option", "ns", int) = 0,
|
||||||
|
@ -80,6 +89,7 @@ def train(
|
||||||
)
|
)
|
||||||
if not output_path.exists():
|
if not output_path.exists():
|
||||||
output_path.mkdir()
|
output_path.mkdir()
|
||||||
|
msg.good(f"Created output directory: {output_path}")
|
||||||
|
|
||||||
tag_map = {}
|
tag_map = {}
|
||||||
if tag_map_path is not None:
|
if tag_map_path is not None:
|
||||||
|
@ -113,6 +123,21 @@ def train(
|
||||||
# training starts from a blank model, intitalize the language class.
|
# training starts from a blank model, intitalize the language class.
|
||||||
pipeline = [p.strip() for p in pipeline.split(",")]
|
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||||
msg.text(f"Training pipeline: {pipeline}")
|
msg.text(f"Training pipeline: {pipeline}")
|
||||||
|
disabled_pipes = None
|
||||||
|
pipes_added = False
|
||||||
|
msg.text(f"Training pipeline: {pipeline}")
|
||||||
|
if use_gpu >= 0:
|
||||||
|
activated_gpu = None
|
||||||
|
try:
|
||||||
|
activated_gpu = set_gpu(use_gpu)
|
||||||
|
except Exception as e:
|
||||||
|
msg.warn(f"Exception: {e}")
|
||||||
|
if activated_gpu is not None:
|
||||||
|
msg.text(f"Using GPU: {use_gpu}")
|
||||||
|
else:
|
||||||
|
msg.warn(f"Unable to activate GPU: {use_gpu}")
|
||||||
|
msg.text("Using CPU only")
|
||||||
|
use_gpu = -1
|
||||||
if base_model:
|
if base_model:
|
||||||
msg.text(f"Starting with base model '{base_model}'")
|
msg.text(f"Starting with base model '{base_model}'")
|
||||||
nlp = util.load_model(base_model)
|
nlp = util.load_model(base_model)
|
||||||
|
@ -122,9 +147,8 @@ def train(
|
||||||
f"specified as `lang` argument ('{lang}') ",
|
f"specified as `lang` argument ('{lang}') ",
|
||||||
exits=1,
|
exits=1,
|
||||||
)
|
)
|
||||||
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
|
|
||||||
for pipe in pipeline:
|
for pipe in pipeline:
|
||||||
if pipe not in nlp.pipe_names:
|
pipe_cfg = {}
|
||||||
if pipe == "parser":
|
if pipe == "parser":
|
||||||
pipe_cfg = {"learn_tokens": learn_tokens}
|
pipe_cfg = {"learn_tokens": learn_tokens}
|
||||||
elif pipe == "textcat":
|
elif pipe == "textcat":
|
||||||
|
@ -133,9 +157,14 @@ def train(
|
||||||
"architecture": textcat_arch,
|
"architecture": textcat_arch,
|
||||||
"positive_label": textcat_positive_label,
|
"positive_label": textcat_positive_label,
|
||||||
}
|
}
|
||||||
else:
|
if pipe not in nlp.pipe_names:
|
||||||
pipe_cfg = {}
|
msg.text(f"Adding component to base model '{pipe}'")
|
||||||
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
||||||
|
pipes_added = True
|
||||||
|
elif replace_components:
|
||||||
|
msg.text(f"Replacing component from base model '{pipe}'")
|
||||||
|
nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg))
|
||||||
|
pipes_added = True
|
||||||
else:
|
else:
|
||||||
if pipe == "textcat":
|
if pipe == "textcat":
|
||||||
textcat_cfg = nlp.get_pipe("textcat").cfg
|
textcat_cfg = nlp.get_pipe("textcat").cfg
|
||||||
|
@ -144,11 +173,6 @@ def train(
|
||||||
"architecture": textcat_cfg["architecture"],
|
"architecture": textcat_cfg["architecture"],
|
||||||
"positive_label": textcat_cfg["positive_label"],
|
"positive_label": textcat_cfg["positive_label"],
|
||||||
}
|
}
|
||||||
pipe_cfg = {
|
|
||||||
"exclusive_classes": not textcat_multilabel,
|
|
||||||
"architecture": textcat_arch,
|
|
||||||
"positive_label": textcat_positive_label,
|
|
||||||
}
|
|
||||||
if base_cfg != pipe_cfg:
|
if base_cfg != pipe_cfg:
|
||||||
msg.fail(
|
msg.fail(
|
||||||
f"The base textcat model configuration does"
|
f"The base textcat model configuration does"
|
||||||
|
@ -156,6 +180,10 @@ def train(
|
||||||
f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}",
|
f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}",
|
||||||
exits=1,
|
exits=1,
|
||||||
)
|
)
|
||||||
|
msg.text(f"Extending component from base model '{pipe}'")
|
||||||
|
disabled_pipes = nlp.disable_pipes(
|
||||||
|
[p for p in nlp.pipe_names if p not in pipeline]
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
msg.text(f"Starting with blank model '{lang}'")
|
msg.text(f"Starting with blank model '{lang}'")
|
||||||
lang_cls = util.get_lang_class(lang)
|
lang_cls = util.get_lang_class(lang)
|
||||||
|
@ -198,13 +226,20 @@ def train(
|
||||||
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
|
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
|
||||||
n_train_words = corpus.count_train()
|
n_train_words = corpus.count_train()
|
||||||
|
|
||||||
if base_model:
|
if base_model and not pipes_added:
|
||||||
# Start with an existing model, use default optimizer
|
# Start with an existing model, use default optimizer
|
||||||
optimizer = create_default_optimizer()
|
optimizer = create_default_optimizer()
|
||||||
else:
|
else:
|
||||||
# Start with a blank model, call begin_training
|
# Start with a blank model, call begin_training
|
||||||
optimizer = nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
|
cfg = {"device": use_gpu}
|
||||||
|
cfg["conv_depth"] = conv_depth
|
||||||
|
cfg["token_vector_width"] = width
|
||||||
|
cfg["bilstm_depth"] = bilstm_depth
|
||||||
|
cfg["cnn_maxout_pieces"] = cnn_pieces
|
||||||
|
cfg["embed_size"] = embed_rows
|
||||||
|
cfg["conv_window"] = cnn_window
|
||||||
|
cfg["subword_features"] = not use_chars
|
||||||
|
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
|
||||||
nlp._optimizer = None
|
nlp._optimizer = None
|
||||||
|
|
||||||
# Load in pretrained weights
|
# Load in pretrained weights
|
||||||
|
@ -214,7 +249,7 @@ def train(
|
||||||
|
|
||||||
# Verify textcat config
|
# Verify textcat config
|
||||||
if "textcat" in pipeline:
|
if "textcat" in pipeline:
|
||||||
textcat_labels = nlp.get_pipe("textcat").cfg["labels"]
|
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
|
||||||
if textcat_positive_label and textcat_positive_label not in textcat_labels:
|
if textcat_positive_label and textcat_positive_label not in textcat_labels:
|
||||||
msg.fail(
|
msg.fail(
|
||||||
f"The textcat_positive_label (tpl) '{textcat_positive_label}' "
|
f"The textcat_positive_label (tpl) '{textcat_positive_label}' "
|
||||||
|
@ -327,12 +362,22 @@ def train(
|
||||||
for batch in util.minibatch_by_words(train_data, size=batch_sizes):
|
for batch in util.minibatch_by_words(train_data, size=batch_sizes):
|
||||||
if not batch:
|
if not batch:
|
||||||
continue
|
continue
|
||||||
|
docs, golds = zip(*batch)
|
||||||
|
try:
|
||||||
nlp.update(
|
nlp.update(
|
||||||
batch,
|
docs,
|
||||||
|
golds,
|
||||||
sgd=optimizer,
|
sgd=optimizer,
|
||||||
drop=next(dropout_rates),
|
drop=next(dropout_rates),
|
||||||
losses=losses,
|
losses=losses,
|
||||||
)
|
)
|
||||||
|
except ValueError as e:
|
||||||
|
msg.warn("Error during training")
|
||||||
|
if init_tok2vec:
|
||||||
|
msg.warn(
|
||||||
|
"Did you provide the same parameters during 'train' as during 'pretrain'?"
|
||||||
|
)
|
||||||
|
msg.fail(f"Original error message: {e}", exits=1)
|
||||||
if raw_text:
|
if raw_text:
|
||||||
# If raw text is available, perform 'rehearsal' updates,
|
# If raw text is available, perform 'rehearsal' updates,
|
||||||
# which use unlabelled data to reduce overfitting.
|
# which use unlabelled data to reduce overfitting.
|
||||||
|
@ -396,11 +441,16 @@ def train(
|
||||||
"cpu": cpu_wps,
|
"cpu": cpu_wps,
|
||||||
"gpu": gpu_wps,
|
"gpu": gpu_wps,
|
||||||
}
|
}
|
||||||
meta["accuracy"] = scorer.scores
|
meta.setdefault("accuracy", {})
|
||||||
|
for component in nlp.pipe_names:
|
||||||
|
for metric in _get_metrics(component):
|
||||||
|
meta["accuracy"][metric] = scorer.scores[metric]
|
||||||
else:
|
else:
|
||||||
meta.setdefault("beam_accuracy", {})
|
meta.setdefault("beam_accuracy", {})
|
||||||
meta.setdefault("beam_speed", {})
|
meta.setdefault("beam_speed", {})
|
||||||
meta["beam_accuracy"][beam_width] = scorer.scores
|
for component in nlp.pipe_names:
|
||||||
|
for metric in _get_metrics(component):
|
||||||
|
meta["beam_accuracy"][metric] = scorer.scores[metric]
|
||||||
meta["beam_speed"][beam_width] = {
|
meta["beam_speed"][beam_width] = {
|
||||||
"nwords": nwords,
|
"nwords": nwords,
|
||||||
"cpu": cpu_wps,
|
"cpu": cpu_wps,
|
||||||
|
@ -453,13 +503,19 @@ def train(
|
||||||
f"Best score = {best_score}; Final iteration score = {current_score}"
|
f"Best score = {best_score}; Final iteration score = {current_score}"
|
||||||
)
|
)
|
||||||
break
|
break
|
||||||
|
except Exception as e:
|
||||||
|
msg.warn(f"Aborting and saving final best model. Encountered exception: {e}")
|
||||||
finally:
|
finally:
|
||||||
|
best_pipes = nlp.pipe_names
|
||||||
|
if disabled_pipes:
|
||||||
|
disabled_pipes.restore()
|
||||||
with nlp.use_params(optimizer.averages):
|
with nlp.use_params(optimizer.averages):
|
||||||
final_model_path = output_path / "model-final"
|
final_model_path = output_path / "model-final"
|
||||||
nlp.to_disk(final_model_path)
|
nlp.to_disk(final_model_path)
|
||||||
|
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
|
||||||
msg.good("Saved model to output directory", final_model_path)
|
msg.good("Saved model to output directory", final_model_path)
|
||||||
with msg.loading("Creating best model..."):
|
with msg.loading("Creating best model..."):
|
||||||
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
|
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
|
||||||
msg.good("Created best model", best_model_path)
|
msg.good("Created best model", best_model_path)
|
||||||
|
|
||||||
|
|
||||||
|
@ -519,15 +575,14 @@ def _load_pretrained_tok2vec(nlp, loc):
|
||||||
|
|
||||||
def _collate_best_model(meta, output_path, components):
|
def _collate_best_model(meta, output_path, components):
|
||||||
bests = {}
|
bests = {}
|
||||||
|
meta.setdefault("accuracy", {})
|
||||||
for component in components:
|
for component in components:
|
||||||
bests[component] = _find_best(output_path, component)
|
bests[component] = _find_best(output_path, component)
|
||||||
best_dest = output_path / "model-best"
|
best_dest = output_path / "model-best"
|
||||||
shutil.copytree(str(output_path / "model-final"), str(best_dest))
|
shutil.copytree(str(output_path / "model-final"), str(best_dest))
|
||||||
for component, best_component_src in bests.items():
|
for component, best_component_src in bests.items():
|
||||||
shutil.rmtree(str(best_dest / component))
|
shutil.rmtree(str(best_dest / component))
|
||||||
shutil.copytree(
|
shutil.copytree(str(best_component_src / component), str(best_dest / component))
|
||||||
str(best_component_src / component), str(best_dest / component)
|
|
||||||
)
|
|
||||||
accs = srsly.read_json(best_component_src / "accuracy.json")
|
accs = srsly.read_json(best_component_src / "accuracy.json")
|
||||||
for metric in _get_metrics(component):
|
for metric in _get_metrics(component):
|
||||||
meta["accuracy"][metric] = accs[metric]
|
meta["accuracy"][metric] = accs[metric]
|
||||||
|
@ -550,13 +605,15 @@ def _find_best(experiment_dir, component):
|
||||||
|
|
||||||
def _get_metrics(component):
|
def _get_metrics(component):
|
||||||
if component == "parser":
|
if component == "parser":
|
||||||
return ("las", "uas", "token_acc", "sent_f")
|
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
|
||||||
elif component == "tagger":
|
elif component == "tagger":
|
||||||
return ("tags_acc",)
|
return ("tags_acc",)
|
||||||
elif component == "ner":
|
elif component == "ner":
|
||||||
return ("ents_f", "ents_p", "ents_r")
|
return ("ents_f", "ents_p", "ents_r", "enty_per_type")
|
||||||
elif component == "sentrec":
|
elif component == "sentrec":
|
||||||
return ("sent_f", "sent_p", "sent_r")
|
return ("sent_f", "sent_p", "sent_r")
|
||||||
|
elif component == "textcat":
|
||||||
|
return ("textcat_score",)
|
||||||
return ("token_acc",)
|
return ("token_acc",)
|
||||||
|
|
||||||
|
|
||||||
|
@ -568,8 +625,12 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths):
|
||||||
row_head.extend(["Tag Loss ", " Tag % "])
|
row_head.extend(["Tag Loss ", " Tag % "])
|
||||||
output_stats.extend(["tag_loss", "tags_acc"])
|
output_stats.extend(["tag_loss", "tags_acc"])
|
||||||
elif pipe == "parser":
|
elif pipe == "parser":
|
||||||
row_head.extend(["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"])
|
row_head.extend(
|
||||||
output_stats.extend(["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"])
|
["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"]
|
||||||
|
)
|
||||||
|
output_stats.extend(
|
||||||
|
["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"]
|
||||||
|
)
|
||||||
elif pipe == "ner":
|
elif pipe == "ner":
|
||||||
row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
|
row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
|
||||||
output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])
|
output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])
|
||||||
|
|
|
@ -1,19 +1,20 @@
|
||||||
|
from typing import Optional, Dict, List, Union, Sequence
|
||||||
import plac
|
import plac
|
||||||
from thinc.util import require_gpu
|
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import thinc
|
import thinc
|
||||||
import thinc.schedules
|
import thinc.schedules
|
||||||
from thinc.model import Model
|
from thinc.api import Model
|
||||||
from spacy.gold import GoldCorpus
|
|
||||||
import spacy
|
|
||||||
from spacy.pipeline.tok2vec import Tok2VecListener
|
|
||||||
from typing import Optional, Dict, List, Union, Sequence
|
|
||||||
from pydantic import BaseModel, FilePath, StrictInt
|
from pydantic import BaseModel, FilePath, StrictInt
|
||||||
import tqdm
|
import tqdm
|
||||||
|
|
||||||
from ..ml import component_models
|
# TODO: relative imports?
|
||||||
from .. import util
|
import spacy
|
||||||
|
from spacy.gold import GoldCorpus
|
||||||
|
from spacy.pipeline.tok2vec import Tok2VecListener
|
||||||
|
from spacy.ml import component_models
|
||||||
|
from spacy import util
|
||||||
|
|
||||||
|
|
||||||
registry = util.registry
|
registry = util.registry
|
||||||
|
|
||||||
|
@ -153,10 +154,9 @@ def create_tb_parser_model(
|
||||||
hidden_width: StrictInt = 64,
|
hidden_width: StrictInt = 64,
|
||||||
maxout_pieces: StrictInt = 3,
|
maxout_pieces: StrictInt = 3,
|
||||||
):
|
):
|
||||||
from thinc.layers import Linear, chain, list2array
|
from thinc.api import Linear, chain, list2array, use_ops, zero_init
|
||||||
from spacy.ml._layers import PrecomputableAffine
|
from spacy.ml._layers import PrecomputableAffine
|
||||||
from spacy.syntax._parser_model import ParserModel
|
from spacy.syntax._parser_model import ParserModel
|
||||||
from thinc.api import use_ops, zero_init
|
|
||||||
|
|
||||||
token_vector_width = tok2vec.get_dim("nO")
|
token_vector_width = tok2vec.get_dim("nO")
|
||||||
tok2vec = chain(tok2vec, list2array())
|
tok2vec = chain(tok2vec, list2array())
|
||||||
|
@ -221,13 +221,9 @@ def train_from_config_cli(
|
||||||
|
|
||||||
|
|
||||||
def train_from_config(
|
def train_from_config(
|
||||||
config_path,
|
config_path, data_paths, raw_text=None, meta_path=None, output_path=None,
|
||||||
data_paths,
|
|
||||||
raw_text=None,
|
|
||||||
meta_path=None,
|
|
||||||
output_path=None,
|
|
||||||
):
|
):
|
||||||
msg.info("Loading config from: {}".format(config_path))
|
msg.info(f"Loading config from: {config_path}")
|
||||||
config = util.load_from_config(config_path, create_objects=True)
|
config = util.load_from_config(config_path, create_objects=True)
|
||||||
use_gpu = config["training"]["use_gpu"]
|
use_gpu = config["training"]["use_gpu"]
|
||||||
if use_gpu >= 0:
|
if use_gpu >= 0:
|
||||||
|
@ -241,9 +237,7 @@ def train_from_config(
|
||||||
msg.info("Loading training corpus")
|
msg.info("Loading training corpus")
|
||||||
corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
|
corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
|
||||||
msg.info("Initializing the nlp pipeline")
|
msg.info("Initializing the nlp pipeline")
|
||||||
nlp.begin_training(
|
nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
|
||||||
lambda: corpus.train_examples, device=use_gpu
|
|
||||||
)
|
|
||||||
|
|
||||||
train_batches = create_train_batches(nlp, corpus, config["training"])
|
train_batches = create_train_batches(nlp, corpus, config["training"])
|
||||||
evaluate = create_evaluation_callback(nlp, optimizer, corpus, config["training"])
|
evaluate = create_evaluation_callback(nlp, optimizer, corpus, config["training"])
|
||||||
|
@ -260,7 +254,7 @@ def train_from_config(
|
||||||
config["training"]["eval_frequency"],
|
config["training"]["eval_frequency"],
|
||||||
)
|
)
|
||||||
|
|
||||||
msg.info("Training. Initial learn rate: {}".format(optimizer.learn_rate))
|
msg.info(f"Training. Initial learn rate: {optimizer.learn_rate}")
|
||||||
print_row = setup_printer(config)
|
print_row = setup_printer(config)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
@ -414,7 +408,7 @@ def subdivide_batch(batch):
|
||||||
def setup_printer(config):
|
def setup_printer(config):
|
||||||
score_cols = config["training"]["scores"]
|
score_cols = config["training"]["scores"]
|
||||||
score_widths = [max(len(col), 6) for col in score_cols]
|
score_widths = [max(len(col), 6) for col in score_cols]
|
||||||
loss_cols = ["Loss {}".format(pipe) for pipe in config["nlp"]["pipeline"]]
|
loss_cols = [f"Loss {pipe}" for pipe in config["nlp"]["pipeline"]]
|
||||||
loss_widths = [max(len(col), 8) for col in loss_cols]
|
loss_widths = [max(len(col), 8) for col in loss_cols]
|
||||||
table_header = ["#"] + loss_cols + score_cols + ["Score"]
|
table_header = ["#"] + loss_cols + score_cols + ["Score"]
|
||||||
table_header = [col.upper() for col in table_header]
|
table_header = [col.upper() for col in table_header]
|
||||||
|
|
|
@ -29,7 +29,7 @@ try:
|
||||||
except ImportError:
|
except ImportError:
|
||||||
cupy = None
|
cupy = None
|
||||||
|
|
||||||
from thinc.optimizers import Optimizer # noqa: F401
|
from thinc.api import Optimizer # noqa: F401
|
||||||
|
|
||||||
pickle = pickle
|
pickle = pickle
|
||||||
copy_reg = copy_reg
|
copy_reg = copy_reg
|
||||||
|
|
|
@ -51,9 +51,10 @@ def render(
|
||||||
html = RENDER_WRAPPER(html)
|
html = RENDER_WRAPPER(html)
|
||||||
if jupyter or (jupyter is None and is_in_jupyter()):
|
if jupyter or (jupyter is None and is_in_jupyter()):
|
||||||
# return HTML rendered by IPython display()
|
# return HTML rendered by IPython display()
|
||||||
|
# See #4840 for details on span wrapper to disable mathjax
|
||||||
from IPython.core.display import display, HTML
|
from IPython.core.display import display, HTML
|
||||||
|
|
||||||
return display(HTML(html))
|
return display(HTML('<span class="tex2jax_ignore">{}</span>'.format(html)))
|
||||||
return html
|
return html
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Setting explicit height and max-width: none on the SVG is required for
|
# Setting explicit height and max-width: none on the SVG is required for
|
||||||
# Jupyter to render it properly in a cell
|
# Jupyter to render it properly in a cell
|
||||||
|
|
||||||
|
|
|
@ -75,10 +75,9 @@ class Warnings(object):
|
||||||
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
||||||
"being serialized or deserialized is deprecated. Please use the "
|
"being serialized or deserialized is deprecated. Please use the "
|
||||||
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
||||||
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
|
W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, "
|
||||||
"the v2.x models cannot release the global interpreter lock. "
|
"the argument `n_process` controls parallel inference via "
|
||||||
"Future versions may introduce a `n_process` argument for "
|
"multiprocessing.")
|
||||||
"parallel inference via multiprocessing.")
|
|
||||||
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
||||||
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
||||||
"ignoring the duplicate entry.")
|
"ignoring the duplicate entry.")
|
||||||
|
@ -170,7 +169,8 @@ class Errors(object):
|
||||||
"and satisfies the correct annotations specified in the GoldParse. "
|
"and satisfies the correct annotations specified in the GoldParse. "
|
||||||
"For example, are all labels added to the model? If you're "
|
"For example, are all labels added to the model? If you're "
|
||||||
"training a named entity recognizer, also make sure that none of "
|
"training a named entity recognizer, also make sure that none of "
|
||||||
"your annotated entity spans have leading or trailing whitespace. "
|
"your annotated entity spans have leading or trailing whitespace "
|
||||||
|
"or punctuation. "
|
||||||
"You can also use the experimental `debug-data` command to "
|
"You can also use the experimental `debug-data` command to "
|
||||||
"validate your JSON-formatted training data. For details, run:\n"
|
"validate your JSON-formatted training data. For details, run:\n"
|
||||||
"python -m spacy debug-data --help")
|
"python -m spacy debug-data --help")
|
||||||
|
@ -536,8 +536,8 @@ class Errors(object):
|
||||||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||||
"'{token_attrs}'.")
|
"'{token_attrs}'.")
|
||||||
E998 = ("Can only create GoldParse's from Example's without a Doc, "
|
E998 = ("Can only create GoldParse objects from Example objects without a "
|
||||||
"if get_gold_parses() is called with a Vocab object.")
|
"Doc if get_gold_parses() is called with a Vocab object.")
|
||||||
E999 = ("Encountered an unexpected format for the dictionary holding "
|
E999 = ("Encountered an unexpected format for the dictionary holding "
|
||||||
"gold annotations: {gold_dict}")
|
"gold annotations: {gold_dict}")
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
def explain(term):
|
def explain(term):
|
||||||
"""Get a description for a given POS tag, dependency label or entity type.
|
"""Get a description for a given POS tag, dependency label or entity type.
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
|
|
||||||
from spacy.tokens import Doc
|
from .tokens import Doc
|
||||||
from .typedefs cimport attr_t
|
from .typedefs cimport attr_t
|
||||||
from .syntax.transition_system cimport Transition
|
from .syntax.transition_system cimport Transition
|
||||||
|
|
||||||
|
@ -65,5 +65,3 @@ cdef class Example:
|
||||||
cdef public TokenAnnotation token_annotation
|
cdef public TokenAnnotation token_annotation
|
||||||
cdef public DocAnnotation doc_annotation
|
cdef public DocAnnotation doc_annotation
|
||||||
cdef public object goldparse
|
cdef public object goldparse
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -6,7 +6,7 @@ from libcpp.vector cimport vector
|
||||||
from libc.stdint cimport int32_t, int64_t
|
from libc.stdint cimport int32_t, int64_t
|
||||||
from libc.stdio cimport FILE
|
from libc.stdio cimport FILE
|
||||||
|
|
||||||
from spacy.vocab cimport Vocab
|
from .vocab cimport Vocab
|
||||||
from .typedefs cimport hash_t
|
from .typedefs cimport hash_t
|
||||||
|
|
||||||
from .structs cimport KBEntryC, AliasC
|
from .structs cimport KBEntryC, AliasC
|
||||||
|
@ -169,4 +169,3 @@ cdef class Reader:
|
||||||
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
|
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
|
||||||
|
|
||||||
cdef int _read(self, void* value, size_t size) except -1
|
cdef int _read(self, void* value, size_t size) except -1
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/stopwords-iso/stopwords-af
|
# Source: https://github.com/stopwords-iso/stopwords-af
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/Alir3z4/stop-words
|
# Source: https://github.com/Alir3z4/stop-words
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
অতএব অথচ অথবা অনুযায়ী অনেক অনেকে অনেকেই অন্তত অবধি অবশ্য অর্থাৎ অন্য অনুযায়ী অর্ধভাগে
|
অতএব অথচ অথবা অনুযায়ী অনেক অনেকে অনেকেই অন্তত অবধি অবশ্য অর্থাৎ অন্য অনুযায়ী অর্ধভাগে
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -14,6 +14,17 @@ _tamil = r"\u0B80-\u0BFF"
|
||||||
|
|
||||||
_telugu = r"\u0C00-\u0C7F"
|
_telugu = r"\u0C00-\u0C7F"
|
||||||
|
|
||||||
|
# from the final table in: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
|
||||||
|
_cjk = (
|
||||||
|
r"\u4E00-\u62FF\u6300-\u77FF\u7800-\u8CFF\u8D00-\u9FFF\u3400-\u4DBF"
|
||||||
|
r"\U00020000-\U000215FF\U00021600-\U000230FF\U00023100-\U000245FF"
|
||||||
|
r"\U00024600-\U000260FF\U00026100-\U000275FF\U00027600-\U000290FF"
|
||||||
|
r"\U00029100-\U0002A6DF\U0002A700-\U0002B73F\U0002B740-\U0002B81F"
|
||||||
|
r"\U0002B820-\U0002CEAF\U0002CEB0-\U0002EBEF\u2E80-\u2EFF\u2F00-\u2FDF"
|
||||||
|
r"\u2FF0-\u2FFF\u3000-\u303F\u31C0-\u31EF\u3200-\u32FF\u3300-\u33FF"
|
||||||
|
r"\uF900-\uFAFF\uFE30-\uFE4F\U0001F200-\U0001F2FF\U0002F800-\U0002FA1F"
|
||||||
|
)
|
||||||
|
|
||||||
# Latin standard
|
# Latin standard
|
||||||
_latin_u_standard = r"A-Z"
|
_latin_u_standard = r"A-Z"
|
||||||
_latin_l_standard = r"a-z"
|
_latin_l_standard = r"a-z"
|
||||||
|
@ -212,6 +223,7 @@ _uncased = (
|
||||||
+ _tamil
|
+ _tamil
|
||||||
+ _telugu
|
+ _telugu
|
||||||
+ _hangul
|
+ _hangul
|
||||||
|
+ _cjk
|
||||||
)
|
)
|
||||||
|
|
||||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/Alir3z4/stop-words
|
# Source: https://github.com/Alir3z4/stop-words
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
|
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
|
||||||
|
@ -26,7 +25,7 @@ früher fünf fünfte fünften fünfter fünftes für
|
||||||
|
|
||||||
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
|
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
|
||||||
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
|
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
|
||||||
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
|
gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
|
||||||
großen grosser großer grosses großes gut gute guter gutes
|
großen grosser großer grosses großes gut gute guter gutes
|
||||||
|
|
||||||
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
|
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
|
||||||
|
@ -44,9 +43,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
|
||||||
lang lange leicht leider lieber los
|
lang lange leicht leider lieber los
|
||||||
|
|
||||||
machen macht machte mag magst man manche manchem manchen mancher manches mehr
|
machen macht machte mag magst man manche manchem manchen mancher manches mehr
|
||||||
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
|
mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
|
||||||
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
|
mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
|
||||||
musste mussten
|
|
||||||
|
|
||||||
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
|
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
|
||||||
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
|
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .tag_map_general import TAG_MAP
|
from ..tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .lemmatizer import GreekLemmatizer
|
from .lemmatizer import GreekLemmatizer
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
def get_pos_from_wiktionary():
|
def get_pos_from_wiktionary():
|
||||||
import re
|
import re
|
||||||
from gensim.corpora.wikicorpus import extract_pages
|
from gensim.corpora.wikicorpus import extract_pages
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# These exceptions are used to add NORM values based on a token's ORTH value.
|
# These exceptions are used to add NORM values based on a token's ORTH value.
|
||||||
# Norms are only set if no alternative is provided in the tokenizer exceptions.
|
# Norms are only set if no alternative is provided in the tokenizer exceptions.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Stop words
|
# Stop words
|
||||||
# Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0
|
# Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,24 +0,0 @@
|
||||||
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
|
||||||
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
|
|
||||||
|
|
||||||
|
|
||||||
TAG_MAP = {
|
|
||||||
"ADJ": {POS: ADJ},
|
|
||||||
"ADV": {POS: ADV},
|
|
||||||
"INTJ": {POS: INTJ},
|
|
||||||
"NOUN": {POS: NOUN},
|
|
||||||
"PROPN": {POS: PROPN},
|
|
||||||
"VERB": {POS: VERB},
|
|
||||||
"ADP": {POS: ADP},
|
|
||||||
"CCONJ": {POS: CCONJ},
|
|
||||||
"SCONJ": {POS: SCONJ},
|
|
||||||
"PART": {POS: PART},
|
|
||||||
"PUNCT": {POS: PUNCT},
|
|
||||||
"SYM": {POS: SYM},
|
|
||||||
"NUM": {POS: NUM},
|
|
||||||
"PRON": {POS: PRON},
|
|
||||||
"AUX": {POS: AUX},
|
|
||||||
"SPACE": {POS: SPACE},
|
|
||||||
"DET": {POS: DET},
|
|
||||||
"X": {POS: X},
|
|
||||||
}
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
_exc = {
|
_exc = {
|
||||||
# Slang and abbreviations
|
# Slang and abbreviations
|
||||||
"cos": "because",
|
"cos": "because",
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Stop words
|
# Stop words
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
|
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/stopwords-iso/stopwords-et
|
# Source: https://github.com/stopwords-iso/stopwords-et
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
verb_roots = """
|
verb_roots = """
|
||||||
#هست
|
#هست
|
||||||
آخت#آهنج
|
آخت#آهنج
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Stop words from HAZM package
|
# Stop words from HAZM package
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -1,9 +1,10 @@
|
||||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_HYPHENS
|
||||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
from ..punctuation import TOKENIZER_SUFFIXES
|
from ..punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
|
||||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||||
|
DASHES = "|".join(x for x in LIST_HYPHENS if x != "-")
|
||||||
|
|
||||||
_infixes = (
|
_infixes = (
|
||||||
LIST_ELLIPSES
|
LIST_ELLIPSES
|
||||||
|
@ -11,11 +12,9 @@ _infixes = (
|
||||||
+ [
|
+ [
|
||||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
|
||||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
|
||||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])(?:{d})(?=[{a}])".format(a=ALPHA, d=DASHES),
|
||||||
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}0-9])[<>=/](?=[{a}])".format(a=ALPHA),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt
|
# Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt
|
||||||
# Reformatted with some minor corrections
|
# Reformatted with some minor corrections
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -28,6 +28,9 @@ for exc_data in [
|
||||||
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
||||||
{ORTH: "n.", LEMMA: "noin"},
|
{ORTH: "n.", LEMMA: "noin"},
|
||||||
{ORTH: "nimim.", LEMMA: "nimimerkki"},
|
{ORTH: "nimim.", LEMMA: "nimimerkki"},
|
||||||
|
{ORTH: "n:o", LEMMA: "numero"},
|
||||||
|
{ORTH: "N:o", LEMMA: "numero"},
|
||||||
|
{ORTH: "nro", LEMMA: "numero"},
|
||||||
{ORTH: "ns.", LEMMA: "niin sanottu"},
|
{ORTH: "ns.", LEMMA: "niin sanottu"},
|
||||||
{ORTH: "nyk.", LEMMA: "nykyinen"},
|
{ORTH: "nyk.", LEMMA: "nykyinen"},
|
||||||
{ORTH: "oik.", LEMMA: "oikealla"},
|
{ORTH: "oik.", LEMMA: "oikealla"},
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons
|
a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# fmt: off
|
# fmt: off
|
||||||
consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"]
|
consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"]
|
||||||
broad_vowels = ["a", "á", "o", "ó", "u", "ú"]
|
broad_vowels = ["a", "á", "o", "ó", "u", "ú"]
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6
|
# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben
|
a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/Xangis/extra-stopwords
|
# Source: https://github.com/Xangis/extra-stopwords
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl
|
a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
ಹಲವು
|
ಹಲವು
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/stopwords-iso/stopwords-lv
|
# Source: https://github.com/stopwords-iso/stopwords-lv
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json
|
# Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
# These exceptions are used to add NORM values based on a token's ORTH value.
|
# These exceptions are used to add NORM values based on a token's ORTH value.
|
||||||
# Individual languages can also add their own exceptions and overwrite them -
|
# Individual languages can also add their own exceptions and overwrite them -
|
||||||
# for example, British vs. American spelling in English.
|
# for example, British vs. American spelling in English.
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes
|
à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
_exc = {
|
_exc = {
|
||||||
# Slang
|
# Slang
|
||||||
"прив": "привет",
|
"прив": "привет",
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
අතර
|
අතර
|
||||||
|
|
|
@ -1,11 +1,16 @@
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tag_map import TAG_MAP
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG
|
||||||
|
|
||||||
|
|
||||||
class SlovakDefaults(Language.Defaults):
|
class SlovakDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "sk"
|
lex_attr_getters[LANG] = lambda text: "sk"
|
||||||
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
|
23
spacy/lang/sk/examples.py
Normal file
23
spacy/lang/sk/examples.py
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.sk.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Ardevop, s.r.o. je malá startup firma na území SR.",
|
||||||
|
"Samojazdiace autá presúvajú poistnú zodpovednosť na výrobcov automobilov.",
|
||||||
|
"Košice sú na východe.",
|
||||||
|
"Bratislava je hlavné mesto Slovenskej republiky.",
|
||||||
|
"Kde si?",
|
||||||
|
"Kto je prezidentom Francúzska?",
|
||||||
|
"Aké je hlavné mesto Slovenska?",
|
||||||
|
"Kedy sa narodil Andrej Kiska?",
|
||||||
|
"Včera som dostal 100€ na ruku.",
|
||||||
|
"Dnes je nedeľa 26.1.2020.",
|
||||||
|
"Narodil sa 15.4.1998 v Ružomberku.",
|
||||||
|
"Niekto mi povedal, že 500 eur je veľa peňazí.",
|
||||||
|
"Podaj mi ruku!",
|
||||||
|
]
|
59
spacy/lang/sk/lex_attrs.py
Normal file
59
spacy/lang/sk/lex_attrs.py
Normal file
|
@ -0,0 +1,59 @@
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
_num_words = [
|
||||||
|
"nula",
|
||||||
|
"jeden",
|
||||||
|
"dva",
|
||||||
|
"tri",
|
||||||
|
"štyri",
|
||||||
|
"päť",
|
||||||
|
"šesť",
|
||||||
|
"sedem",
|
||||||
|
"osem",
|
||||||
|
"deväť",
|
||||||
|
"desať",
|
||||||
|
"jedenásť",
|
||||||
|
"dvanásť",
|
||||||
|
"trinásť",
|
||||||
|
"štrnásť",
|
||||||
|
"pätnásť",
|
||||||
|
"šestnásť",
|
||||||
|
"sedemnásť",
|
||||||
|
"osemnásť",
|
||||||
|
"devätnásť",
|
||||||
|
"dvadsať",
|
||||||
|
"tridsať",
|
||||||
|
"štyridsať",
|
||||||
|
"päťdesiat",
|
||||||
|
"šesťdesiat",
|
||||||
|
"sedemdesiat",
|
||||||
|
"osemdesiat",
|
||||||
|
"deväťdesiat",
|
||||||
|
"sto",
|
||||||
|
"tisíc",
|
||||||
|
"milión",
|
||||||
|
"miliarda",
|
||||||
|
"bilión",
|
||||||
|
"biliarda",
|
||||||
|
"trilión",
|
||||||
|
"triliarda",
|
||||||
|
"kvadrilión",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(",", "").replace(".", "")
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count("/") == 1:
|
||||||
|
num, denom = text.split("/")
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text.lower() in _num_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {LIKE_NUM: like_num}
|
|
@ -1,5 +1,4 @@
|
||||||
|
# Source: https://github.com/Ardevop-sk/stopwords-sk
|
||||||
# Source: https://github.com/stopwords-iso/stopwords-sk
|
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
|
@ -7,17 +6,41 @@ a
|
||||||
aby
|
aby
|
||||||
aj
|
aj
|
||||||
ak
|
ak
|
||||||
|
akej
|
||||||
|
akejže
|
||||||
ako
|
ako
|
||||||
|
akom
|
||||||
|
akomže
|
||||||
|
akou
|
||||||
|
akouže
|
||||||
|
akože
|
||||||
|
aká
|
||||||
|
akáže
|
||||||
|
aké
|
||||||
|
akého
|
||||||
|
akéhože
|
||||||
|
akému
|
||||||
|
akémuže
|
||||||
|
akéže
|
||||||
|
akú
|
||||||
|
akúže
|
||||||
aký
|
aký
|
||||||
|
akých
|
||||||
|
akýchže
|
||||||
|
akým
|
||||||
|
akými
|
||||||
|
akýmiže
|
||||||
|
akýmže
|
||||||
|
akýže
|
||||||
ale
|
ale
|
||||||
alebo
|
alebo
|
||||||
and
|
|
||||||
ani
|
ani
|
||||||
asi
|
asi
|
||||||
avšak
|
avšak
|
||||||
až
|
až
|
||||||
ba
|
ba
|
||||||
bez
|
bez
|
||||||
|
bezo
|
||||||
bol
|
bol
|
||||||
bola
|
bola
|
||||||
boli
|
boli
|
||||||
|
@ -28,23 +51,32 @@ budeme
|
||||||
budete
|
budete
|
||||||
budeš
|
budeš
|
||||||
budú
|
budú
|
||||||
buï
|
|
||||||
buď
|
buď
|
||||||
by
|
by
|
||||||
byť
|
byť
|
||||||
cez
|
cez
|
||||||
|
cezo
|
||||||
dnes
|
dnes
|
||||||
do
|
do
|
||||||
ešte
|
ešte
|
||||||
for
|
|
||||||
ho
|
ho
|
||||||
hoci
|
hoci
|
||||||
i
|
i
|
||||||
iba
|
iba
|
||||||
ich
|
ich
|
||||||
im
|
im
|
||||||
|
inej
|
||||||
|
inom
|
||||||
|
iná
|
||||||
iné
|
iné
|
||||||
|
iného
|
||||||
|
inému
|
||||||
|
iní
|
||||||
|
inú
|
||||||
iný
|
iný
|
||||||
|
iných
|
||||||
|
iným
|
||||||
|
inými
|
||||||
ja
|
ja
|
||||||
je
|
je
|
||||||
jeho
|
jeho
|
||||||
|
@ -53,80 +85,185 @@ jemu
|
||||||
ju
|
ju
|
||||||
k
|
k
|
||||||
kam
|
kam
|
||||||
|
kamže
|
||||||
|
každou
|
||||||
každá
|
každá
|
||||||
každé
|
každé
|
||||||
|
každého
|
||||||
|
každému
|
||||||
každí
|
každí
|
||||||
|
každú
|
||||||
každý
|
každý
|
||||||
|
každých
|
||||||
|
každým
|
||||||
|
každými
|
||||||
kde
|
kde
|
||||||
kedže
|
kej
|
||||||
keï
|
kejže
|
||||||
keď
|
keď
|
||||||
|
keďže
|
||||||
|
kie
|
||||||
|
kieho
|
||||||
|
kiehože
|
||||||
|
kiemu
|
||||||
|
kiemuže
|
||||||
|
kieže
|
||||||
|
koho
|
||||||
|
kom
|
||||||
|
komu
|
||||||
|
kou
|
||||||
|
kouže
|
||||||
kto
|
kto
|
||||||
|
ktorej
|
||||||
ktorou
|
ktorou
|
||||||
ktorá
|
ktorá
|
||||||
ktoré
|
ktoré
|
||||||
ktorí
|
ktorí
|
||||||
|
ktorú
|
||||||
ktorý
|
ktorý
|
||||||
|
ktorých
|
||||||
|
ktorým
|
||||||
|
ktorými
|
||||||
ku
|
ku
|
||||||
|
ká
|
||||||
|
káže
|
||||||
|
ké
|
||||||
|
kéže
|
||||||
|
kú
|
||||||
|
kúže
|
||||||
|
ký
|
||||||
|
kýho
|
||||||
|
kýhože
|
||||||
|
kým
|
||||||
|
kýmu
|
||||||
|
kýmuže
|
||||||
|
kýže
|
||||||
lebo
|
lebo
|
||||||
|
leda
|
||||||
|
ledaže
|
||||||
len
|
len
|
||||||
ma
|
ma
|
||||||
|
majú
|
||||||
|
mal
|
||||||
|
mala
|
||||||
|
mali
|
||||||
mať
|
mať
|
||||||
medzi
|
medzi
|
||||||
menej
|
|
||||||
mi
|
mi
|
||||||
mna
|
|
||||||
mne
|
mne
|
||||||
mnou
|
mnou
|
||||||
moja
|
moja
|
||||||
moje
|
moje
|
||||||
|
mojej
|
||||||
|
mojich
|
||||||
|
mojim
|
||||||
|
mojimi
|
||||||
|
mojou
|
||||||
|
moju
|
||||||
|
možno
|
||||||
mu
|
mu
|
||||||
|
musia
|
||||||
musieť
|
musieť
|
||||||
|
musí
|
||||||
|
musím
|
||||||
|
musíme
|
||||||
|
musíte
|
||||||
|
musíš
|
||||||
my
|
my
|
||||||
má
|
má
|
||||||
|
mám
|
||||||
|
máme
|
||||||
máte
|
máte
|
||||||
mòa
|
máš
|
||||||
môcť
|
môcť
|
||||||
môj
|
môj
|
||||||
|
môjho
|
||||||
môže
|
môže
|
||||||
|
môžem
|
||||||
|
môžeme
|
||||||
|
môžete
|
||||||
|
môžeš
|
||||||
|
môžu
|
||||||
|
mňa
|
||||||
na
|
na
|
||||||
nad
|
nad
|
||||||
|
nado
|
||||||
|
najmä
|
||||||
nami
|
nami
|
||||||
|
naša
|
||||||
|
naše
|
||||||
|
našej
|
||||||
naši
|
naši
|
||||||
|
našich
|
||||||
|
našim
|
||||||
|
našimi
|
||||||
|
našou
|
||||||
|
ne
|
||||||
nech
|
nech
|
||||||
neho
|
neho
|
||||||
nej
|
nej
|
||||||
|
nejakej
|
||||||
|
nejakom
|
||||||
|
nejakou
|
||||||
|
nejaká
|
||||||
|
nejaké
|
||||||
|
nejakého
|
||||||
|
nejakému
|
||||||
|
nejakú
|
||||||
|
nejaký
|
||||||
|
nejakých
|
||||||
|
nejakým
|
||||||
|
nejakými
|
||||||
nemu
|
nemu
|
||||||
než
|
než
|
||||||
nich
|
nich
|
||||||
nie
|
nie
|
||||||
|
niektorej
|
||||||
|
niektorom
|
||||||
|
niektorou
|
||||||
|
niektorá
|
||||||
|
niektoré
|
||||||
|
niektorého
|
||||||
|
niektorému
|
||||||
|
niektorú
|
||||||
niektorý
|
niektorý
|
||||||
|
niektorých
|
||||||
|
niektorým
|
||||||
|
niektorými
|
||||||
nielen
|
nielen
|
||||||
|
niečo
|
||||||
nim
|
nim
|
||||||
|
nimi
|
||||||
nič
|
nič
|
||||||
|
ničoho
|
||||||
|
ničom
|
||||||
|
ničomu
|
||||||
|
ničím
|
||||||
no
|
no
|
||||||
nová
|
|
||||||
nové
|
|
||||||
noví
|
|
||||||
nový
|
|
||||||
nám
|
nám
|
||||||
nás
|
nás
|
||||||
náš
|
náš
|
||||||
|
nášho
|
||||||
ním
|
ním
|
||||||
o
|
o
|
||||||
od
|
od
|
||||||
odo
|
odo
|
||||||
of
|
|
||||||
on
|
on
|
||||||
ona
|
ona
|
||||||
oni
|
oni
|
||||||
ono
|
ono
|
||||||
ony
|
ony
|
||||||
|
oň
|
||||||
|
oňho
|
||||||
po
|
po
|
||||||
pod
|
pod
|
||||||
|
podo
|
||||||
podľa
|
podľa
|
||||||
pokiaľ
|
pokiaľ
|
||||||
|
popod
|
||||||
|
popri
|
||||||
potom
|
potom
|
||||||
|
poza
|
||||||
pre
|
pre
|
||||||
pred
|
pred
|
||||||
predo
|
predo
|
||||||
|
@ -134,42 +271,56 @@ preto
|
||||||
pretože
|
pretože
|
||||||
prečo
|
prečo
|
||||||
pri
|
pri
|
||||||
prvá
|
|
||||||
prvé
|
|
||||||
prví
|
|
||||||
prvý
|
|
||||||
práve
|
práve
|
||||||
pýta
|
|
||||||
s
|
s
|
||||||
sa
|
sa
|
||||||
seba
|
seba
|
||||||
|
sebe
|
||||||
|
sebou
|
||||||
sem
|
sem
|
||||||
si
|
si
|
||||||
sme
|
sme
|
||||||
so
|
so
|
||||||
som
|
som
|
||||||
späť
|
|
||||||
ste
|
ste
|
||||||
svoj
|
svoj
|
||||||
|
svoja
|
||||||
svoje
|
svoje
|
||||||
|
svojho
|
||||||
svojich
|
svojich
|
||||||
|
svojim
|
||||||
|
svojimi
|
||||||
|
svojou
|
||||||
|
svoju
|
||||||
svojím
|
svojím
|
||||||
svojími
|
|
||||||
sú
|
sú
|
||||||
ta
|
ta
|
||||||
tak
|
tak
|
||||||
|
takej
|
||||||
|
takejto
|
||||||
|
taká
|
||||||
|
takáto
|
||||||
|
také
|
||||||
|
takého
|
||||||
|
takéhoto
|
||||||
|
takému
|
||||||
|
takémuto
|
||||||
|
takéto
|
||||||
|
takí
|
||||||
|
takú
|
||||||
|
takúto
|
||||||
taký
|
taký
|
||||||
|
takýto
|
||||||
takže
|
takže
|
||||||
tam
|
tam
|
||||||
te
|
|
||||||
teba
|
teba
|
||||||
tebe
|
tebe
|
||||||
tebou
|
tebou
|
||||||
teda
|
teda
|
||||||
tej
|
tej
|
||||||
|
tejto
|
||||||
ten
|
ten
|
||||||
tento
|
tento
|
||||||
the
|
|
||||||
ti
|
ti
|
||||||
tie
|
tie
|
||||||
tieto
|
tieto
|
||||||
|
@ -177,52 +328,97 @@ tiež
|
||||||
to
|
to
|
||||||
toho
|
toho
|
||||||
tohoto
|
tohoto
|
||||||
|
tohto
|
||||||
tom
|
tom
|
||||||
tomto
|
tomto
|
||||||
tomu
|
tomu
|
||||||
tomuto
|
tomuto
|
||||||
toto
|
toto
|
||||||
tou
|
tou
|
||||||
|
touto
|
||||||
tu
|
tu
|
||||||
tvoj
|
tvoj
|
||||||
tvojími
|
tvoja
|
||||||
|
tvoje
|
||||||
|
tvojej
|
||||||
|
tvojho
|
||||||
|
tvoji
|
||||||
|
tvojich
|
||||||
|
tvojim
|
||||||
|
tvojimi
|
||||||
|
tvojím
|
||||||
ty
|
ty
|
||||||
tá
|
tá
|
||||||
táto
|
táto
|
||||||
|
tí
|
||||||
|
títo
|
||||||
tú
|
tú
|
||||||
túto
|
túto
|
||||||
|
tých
|
||||||
tým
|
tým
|
||||||
|
tými
|
||||||
týmto
|
týmto
|
||||||
tě
|
u
|
||||||
už
|
už
|
||||||
v
|
v
|
||||||
vami
|
vami
|
||||||
|
vaša
|
||||||
vaše
|
vaše
|
||||||
veï
|
vašej
|
||||||
|
vaši
|
||||||
|
vašich
|
||||||
|
vašim
|
||||||
|
vaším
|
||||||
|
veď
|
||||||
viac
|
viac
|
||||||
vo
|
vo
|
||||||
vy
|
vy
|
||||||
vám
|
vám
|
||||||
vás
|
vás
|
||||||
váš
|
váš
|
||||||
|
vášho
|
||||||
však
|
však
|
||||||
|
všetci
|
||||||
|
všetka
|
||||||
|
všetko
|
||||||
|
všetky
|
||||||
všetok
|
všetok
|
||||||
z
|
z
|
||||||
za
|
za
|
||||||
|
začo
|
||||||
|
začože
|
||||||
zo
|
zo
|
||||||
a
|
|
||||||
áno
|
áno
|
||||||
èi
|
čej
|
||||||
èo
|
|
||||||
èí
|
|
||||||
òom
|
|
||||||
òou
|
|
||||||
òu
|
|
||||||
či
|
či
|
||||||
|
čia
|
||||||
|
čie
|
||||||
|
čieho
|
||||||
|
čiemu
|
||||||
|
čiu
|
||||||
čo
|
čo
|
||||||
|
čoho
|
||||||
|
čom
|
||||||
|
čomu
|
||||||
|
čou
|
||||||
|
čože
|
||||||
|
čí
|
||||||
|
čím
|
||||||
|
čími
|
||||||
ďalšia
|
ďalšia
|
||||||
ďalšie
|
ďalšie
|
||||||
|
ďalšieho
|
||||||
|
ďalšiemu
|
||||||
|
ďalšiu
|
||||||
|
ďalšom
|
||||||
|
ďalšou
|
||||||
ďalší
|
ďalší
|
||||||
|
ďalších
|
||||||
|
ďalším
|
||||||
|
ďalšími
|
||||||
|
ňom
|
||||||
|
ňou
|
||||||
|
ňu
|
||||||
že
|
že
|
||||||
""".split()
|
""".split()
|
||||||
)
|
)
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user