mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Merge branch 'develop' into refactor/remove-symlinks
This commit is contained in:
commit
a3335d36b8
106
.github/contributors/AlJohri.md
vendored
Normal file
106
.github/contributors/AlJohri.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Al Johri |
|
||||
| Company name (if applicable) | N/A |
|
||||
| Title or role (if applicable) | N/A |
|
||||
| Date | December 27th, 2019 |
|
||||
| GitHub username | AlJohri |
|
||||
| Website (optional) | http://aljohri.com/ |
|
106
.github/contributors/Jan-711.md
vendored
Normal file
106
.github/contributors/Jan-711.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jan Jessewitsch |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 16.02.2020 |
|
||||
| GitHub username | Jan-711 |
|
||||
| Website (optional) | |
|
106
.github/contributors/ceteri.md
vendored
Normal file
106
.github/contributors/ceteri.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ---------------------- |
|
||||
| Name | Paco Nathan |
|
||||
| Company name (if applicable) | Derwen, Inc. |
|
||||
| Title or role (if applicable) | Managing Partner |
|
||||
| Date | 2020-01-25 |
|
||||
| GitHub username | ceteri |
|
||||
| Website (optional) | https://derwen.ai/paco |
|
106
.github/contributors/drndos.md
vendored
Normal file
106
.github/contributors/drndos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Filip Bednárik |
|
||||
| Company name (if applicable) | Ardevop, s. r. o. |
|
||||
| Title or role (if applicable) | IT Consultant |
|
||||
| Date | 2020-01-26 |
|
||||
| GitHub username | drndos |
|
||||
| Website (optional) | https://ardevop.sk |
|
106
.github/contributors/iechevarria.md
vendored
Normal file
106
.github/contributors/iechevarria.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | --------------------- |
|
||||
| Name | Ivan Echevarria |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-12-24 |
|
||||
| GitHub username | iechevarria |
|
||||
| Website (optional) | https://echevarria.io |
|
106
.github/contributors/iurshina.md
vendored
Normal file
106
.github/contributors/iurshina.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Anastasiia Iurshina |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 28.12.2019 |
|
||||
| GitHub username | iurshina |
|
||||
| Website (optional) | |
|
106
.github/contributors/onlyanegg.md
vendored
Normal file
106
.github/contributors/onlyanegg.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
- Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
- to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
- each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
| ----------------------------- | ---------------- |
|
||||
| Name | Tyler Couto |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | January 29, 2020 |
|
||||
| GitHub username | onlyanegg |
|
||||
| Website (optional) | |
|
|
@ -1,5 +1,5 @@
|
|||
recursive-include include *.h
|
||||
recursive-include spacy *.pyx *.pxd *.txt
|
||||
recursive-include spacy *.txt *.pyx *.pxd
|
||||
include LICENSE
|
||||
include README.md
|
||||
include bin/spacy
|
||||
|
|
|
@ -7,16 +7,17 @@ Run `wikipedia_pretrain_kb.py`
|
|||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
||||
* You can set the filtering parameters for KB construction:
|
||||
* `max_per_alias`: (max) number of candidate entities in the KB per alias/synonym
|
||||
* `min_freq`: threshold of number of times an entity should occur in the corpus to be included in the KB
|
||||
* `min_pair`: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
|
||||
* `max_per_alias` (`-a`): (max) number of candidate entities in the KB per alias/synonym
|
||||
* `min_freq` (`-f`): threshold of number of times an entity should occur in the corpus to be included in the KB
|
||||
* `min_pair` (`-c`): threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
|
||||
* Further parameters to set:
|
||||
* `descriptions_from_wikipedia`: whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
|
||||
* `entity_vector_length`: length of the pre-trained entity description vectors
|
||||
* `lang`: language for which to fetch Wikidata information (as the dump contains all languages)
|
||||
* `descriptions_from_wikipedia` (`-wp`): whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
|
||||
* `entity_vector_length` (`-v`): length of the pre-trained entity description vectors
|
||||
* `lang` (`-la`): language for which to fetch Wikidata information (as the dump contains all languages)
|
||||
|
||||
Quick testing and rerunning:
|
||||
* When trying out the pipeline for a quick test, set `limit_prior`, `limit_train` and/or `limit_wd` to read only parts of the dumps instead of everything.
|
||||
* When trying out the pipeline for a quick test, set `limit_prior` (`-lp`), `limit_train` (`-lt`) and/or `limit_wd` (`-lw`) to read only parts of the dumps instead of everything.
|
||||
* e.g. set `-lt 20000 -lp 2000 -lw 3000 -f 1`
|
||||
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
|
||||
|
||||
|
||||
|
@ -24,11 +25,13 @@ Quick testing and rerunning:
|
|||
|
||||
Run `wikidata_train_entity_linker.py`
|
||||
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
|
||||
* Specify the output directory (`-o`) in which the final, trained model will be saved
|
||||
* You can set the learning parameters for the EL training:
|
||||
* `epochs`: number of training iterations
|
||||
* `dropout`: dropout rate
|
||||
* `lr`: learning rate
|
||||
* `l2`: L2 regularization
|
||||
* Specify the number of training and dev testing entities with `train_inst` and `dev_inst` respectively
|
||||
* `epochs` (`-e`): number of training iterations
|
||||
* `dropout` (`-p`): dropout rate
|
||||
* `lr` (`-n`): learning rate
|
||||
* `l2` (`-r`): L2 regularization
|
||||
* Specify the number of training and dev testing articles with `train_articles` (`-t`) and `dev_articles` (`-d`) respectively
|
||||
* If not specified, the full dataset will be processed - this may take a LONG time !
|
||||
* Further parameters to set:
|
||||
* `labels_discard`: NER label types to discard during training
|
||||
* `labels_discard` (`-l`): NER label types to discard during training
|
||||
|
|
|
@ -1,6 +1,8 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import logging
|
||||
import random
|
||||
|
||||
from tqdm import tqdm
|
||||
from collections import defaultdict
|
||||
|
||||
|
@ -92,102 +94,81 @@ class BaselineResults(object):
|
|||
self.random.update_metrics(ent_label, true_entity, random_candidate)
|
||||
|
||||
|
||||
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True):
|
||||
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True, dev_limit=None):
|
||||
counts = dict()
|
||||
baseline_results = BaselineResults()
|
||||
context_results = EvaluationResults()
|
||||
combo_results = EvaluationResults()
|
||||
|
||||
for doc, gold in tqdm(dev_data, total=dev_limit, leave=False, desc='Processing dev data'):
|
||||
if len(doc) > 0:
|
||||
correct_ents = dict()
|
||||
for entity, kb_dict in gold.links.items():
|
||||
start, end = entity
|
||||
for gold_kb, value in kb_dict.items():
|
||||
if value:
|
||||
# only evaluating on positive examples
|
||||
offset = _offset(start, end)
|
||||
correct_ents[offset] = gold_kb
|
||||
|
||||
if baseline:
|
||||
baseline_accuracies, counts = measure_baselines(dev_data, kb)
|
||||
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
|
||||
logger.info(baseline_accuracies.report_performance("random"))
|
||||
logger.info(baseline_accuracies.report_performance("prior"))
|
||||
logger.info(baseline_accuracies.report_performance("oracle"))
|
||||
_add_baseline(baseline_results, counts, doc, correct_ents, kb)
|
||||
|
||||
if context:
|
||||
# using only context
|
||||
el_pipe.cfg["incl_context"] = True
|
||||
el_pipe.cfg["incl_prior"] = False
|
||||
results = get_eval_results(dev_data, el_pipe)
|
||||
logger.info(results.report_metrics("context only"))
|
||||
_add_eval_result(context_results, doc, correct_ents, el_pipe)
|
||||
|
||||
# measuring combined accuracy (prior + context)
|
||||
el_pipe.cfg["incl_context"] = True
|
||||
el_pipe.cfg["incl_prior"] = True
|
||||
results = get_eval_results(dev_data, el_pipe)
|
||||
logger.info(results.report_metrics("context and prior"))
|
||||
_add_eval_result(combo_results, doc, correct_ents, el_pipe)
|
||||
|
||||
if baseline:
|
||||
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
|
||||
logger.info(baseline_results.report_performance("random"))
|
||||
logger.info(baseline_results.report_performance("prior"))
|
||||
logger.info(baseline_results.report_performance("oracle"))
|
||||
|
||||
if context:
|
||||
logger.info(context_results.report_metrics("context only"))
|
||||
logger.info(combo_results.report_metrics("context and prior"))
|
||||
|
||||
|
||||
def get_eval_results(data, el_pipe=None):
|
||||
def _add_eval_result(results, doc, correct_ents, el_pipe):
|
||||
"""
|
||||
Evaluate the ent.kb_id_ annotations against the gold standard.
|
||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
||||
If the docs in the data require further processing with an entity linker, set el_pipe.
|
||||
"""
|
||||
docs = []
|
||||
golds = []
|
||||
for d, g in tqdm(data, leave=False):
|
||||
if len(d) > 0:
|
||||
golds.append(g)
|
||||
if el_pipe is not None:
|
||||
docs.append(el_pipe(d))
|
||||
else:
|
||||
docs.append(d)
|
||||
|
||||
results = EvaluationResults()
|
||||
for doc, gold in zip(docs, golds):
|
||||
try:
|
||||
correct_entries_per_article = dict()
|
||||
for entity, kb_dict in gold.links.items():
|
||||
start, end = entity
|
||||
for gold_kb, value in kb_dict.items():
|
||||
if value:
|
||||
# only evaluating on positive examples
|
||||
offset = _offset(start, end)
|
||||
correct_entries_per_article[offset] = gold_kb
|
||||
|
||||
doc = el_pipe(doc)
|
||||
for ent in doc.ents:
|
||||
ent_label = ent.label_
|
||||
pred_entity = ent.kb_id_
|
||||
start = ent.start_char
|
||||
end = ent.end_char
|
||||
offset = _offset(start, end)
|
||||
gold_entity = correct_entries_per_article.get(offset, None)
|
||||
gold_entity = correct_ents.get(offset, None)
|
||||
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
||||
if gold_entity is not None:
|
||||
pred_entity = ent.kb_id_
|
||||
results.update_metrics(ent_label, gold_entity, pred_entity)
|
||||
|
||||
except Exception as e:
|
||||
logging.error("Error assessing accuracy " + str(e))
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def measure_baselines(data, kb):
|
||||
def _add_baseline(baseline_results, counts, doc, correct_ents, kb):
|
||||
"""
|
||||
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
|
||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
||||
Also return a dictionary of counts by entity label.
|
||||
"""
|
||||
counts_d = dict()
|
||||
|
||||
baseline_results = BaselineResults()
|
||||
|
||||
docs = [d for d, g in data if len(d) > 0]
|
||||
golds = [g for d, g in data if len(d) > 0]
|
||||
|
||||
for doc, gold in zip(docs, golds):
|
||||
correct_entries_per_article = dict()
|
||||
for entity, kb_dict in gold.links.items():
|
||||
start, end = entity
|
||||
for gold_kb, value in kb_dict.items():
|
||||
# only evaluating on positive examples
|
||||
if value:
|
||||
offset = _offset(start, end)
|
||||
correct_entries_per_article[offset] = gold_kb
|
||||
|
||||
for ent in doc.ents:
|
||||
ent_label = ent.label_
|
||||
start = ent.start_char
|
||||
end = ent.end_char
|
||||
offset = _offset(start, end)
|
||||
gold_entity = correct_entries_per_article.get(offset, None)
|
||||
gold_entity = correct_ents.get(offset, None)
|
||||
|
||||
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
||||
if gold_entity is not None:
|
||||
|
@ -207,8 +188,8 @@ def measure_baselines(data, kb):
|
|||
prior_candidate = candidates[best_index].entity_
|
||||
random_candidate = random.choice(candidates).entity_
|
||||
|
||||
current_count = counts_d.get(ent_label, 0)
|
||||
counts_d[ent_label] = current_count+1
|
||||
current_count = counts.get(ent_label, 0)
|
||||
counts[ent_label] = current_count+1
|
||||
|
||||
baseline_results.update_baselines(
|
||||
gold_entity,
|
||||
|
@ -218,8 +199,6 @@ def measure_baselines(data, kb):
|
|||
oracle_candidate,
|
||||
)
|
||||
|
||||
return baseline_results, counts_d
|
||||
|
||||
|
||||
def _offset(start, end):
|
||||
return "{}_{}".format(start, end)
|
||||
|
|
|
@ -40,7 +40,7 @@ logger = logging.getLogger(__name__)
|
|||
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
|
||||
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
|
||||
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
|
||||
descr_from_wp=("Flag for using wp descriptions not wd", "flag", "wp"),
|
||||
descr_from_wp=("Flag for using descriptions from WP instead of WD (default False)", "flag", "wp"),
|
||||
limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
|
||||
limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
|
||||
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# coding: utf-8
|
||||
"""Script to take a previously created Knowledge Base and train an entity linking
|
||||
"""Script that takes a previously created Knowledge Base and trains an entity linking
|
||||
pipeline. The provided KB directory should hold the kb, the original nlp object and
|
||||
its vocab used to create the KB, and a few auxiliary files such as the entity definitions,
|
||||
as created by the script `wikidata_create_kb`.
|
||||
|
@ -14,9 +14,16 @@ import logging
|
|||
import spacy
|
||||
from pathlib import Path
|
||||
import plac
|
||||
from tqdm import tqdm
|
||||
|
||||
from bin.wiki_entity_linking import wikipedia_processor
|
||||
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR
|
||||
from bin.wiki_entity_linking import (
|
||||
TRAINING_DATA_FILE,
|
||||
KB_MODEL_DIR,
|
||||
KB_FILE,
|
||||
LOG_FORMAT,
|
||||
OUTPUT_MODEL_DIR,
|
||||
)
|
||||
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
|
||||
from bin.wiki_entity_linking.kb_creator import read_kb
|
||||
|
||||
|
@ -33,8 +40,8 @@ logger = logging.getLogger(__name__)
|
|||
dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float),
|
||||
lr=("Learning rate (default 0.005)", "option", "n", float),
|
||||
l2=("L2 regularization", "option", "r", float),
|
||||
train_inst=("# training instances (default 90% of all)", "option", "t", int),
|
||||
dev_inst=("# test instances (default 10% of all)", "option", "d", int),
|
||||
train_articles=("# training articles (default 90% of all)", "option", "t", int),
|
||||
dev_articles=("# dev test articles (default 10% of all)", "option", "d", int),
|
||||
labels_discard=("NER labels to discard (default None)", "option", "l", str),
|
||||
)
|
||||
def main(
|
||||
|
@ -45,10 +52,15 @@ def main(
|
|||
dropout=0.5,
|
||||
lr=0.005,
|
||||
l2=1e-6,
|
||||
train_inst=None,
|
||||
dev_inst=None,
|
||||
labels_discard=None
|
||||
train_articles=None,
|
||||
dev_articles=None,
|
||||
labels_discard=None,
|
||||
):
|
||||
if not output_dir:
|
||||
logger.warning(
|
||||
"No output dir specified so no results will be written, are you sure about this ?"
|
||||
)
|
||||
|
||||
logger.info("Creating Entity Linker with Wikipedia and WikiData")
|
||||
|
||||
output_dir = Path(output_dir) if output_dir else dir_kb
|
||||
|
@ -64,47 +76,57 @@ def main(
|
|||
# STEP 1 : load the NLP object
|
||||
logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
|
||||
nlp = spacy.load(nlp_dir)
|
||||
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
|
||||
kb = read_kb(nlp, kb_path)
|
||||
logger.info(
|
||||
"Original NLP pipeline has following pipeline components: {}".format(
|
||||
nlp.pipe_names
|
||||
)
|
||||
)
|
||||
|
||||
# check that there is a NER component in the pipeline
|
||||
if "ner" not in nlp.pipe_names:
|
||||
raise ValueError("The `nlp` object should have a pretrained `ner` component.")
|
||||
|
||||
# STEP 2: read the training dataset previously created from WP
|
||||
logger.info("STEP 2: Reading training dataset from {}".format(training_path))
|
||||
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
|
||||
kb = read_kb(nlp, kb_path)
|
||||
|
||||
# STEP 2: read the training dataset previously created from WP
|
||||
logger.info("STEP 2: Reading training & dev dataset from {}".format(training_path))
|
||||
train_indices, dev_indices = wikipedia_processor.read_training_indices(
|
||||
training_path
|
||||
)
|
||||
logger.info(
|
||||
"Training set has {} articles, limit set to roughly {} articles per epoch".format(
|
||||
len(train_indices), train_articles if train_articles else "all"
|
||||
)
|
||||
)
|
||||
logger.info(
|
||||
"Dev set has {} articles, limit set to rougly {} articles for evaluation".format(
|
||||
len(dev_indices), dev_articles if dev_articles else "all"
|
||||
)
|
||||
)
|
||||
if dev_articles:
|
||||
dev_indices = dev_indices[0:dev_articles]
|
||||
|
||||
# STEP 3: create and train an entity linking pipe
|
||||
logger.info(
|
||||
"STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(
|
||||
epochs
|
||||
)
|
||||
)
|
||||
if labels_discard:
|
||||
labels_discard = [x.strip() for x in labels_discard.split(",")]
|
||||
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
|
||||
logger.info(
|
||||
"Discarding {} NER types: {}".format(len(labels_discard), labels_discard)
|
||||
)
|
||||
else:
|
||||
labels_discard = []
|
||||
|
||||
train_data = wikipedia_processor.read_training(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=False,
|
||||
limit=train_inst,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard
|
||||
)
|
||||
|
||||
# for testing, get all pos instances (independently of KB)
|
||||
dev_data = wikipedia_processor.read_training(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=True,
|
||||
limit=dev_inst,
|
||||
kb=None,
|
||||
labels_discard=labels_discard
|
||||
)
|
||||
|
||||
# STEP 3: create and train an entity linking pipe
|
||||
logger.info("STEP 3: Creating and training an Entity Linking pipe")
|
||||
|
||||
el_pipe = nlp.create_pipe(
|
||||
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors,
|
||||
"labels_discard": labels_discard}
|
||||
name="entity_linker",
|
||||
config={
|
||||
"pretrained_vectors": nlp.vocab.vectors,
|
||||
"labels_discard": labels_discard,
|
||||
},
|
||||
)
|
||||
el_pipe.set_kb(kb)
|
||||
nlp.add_pipe(el_pipe, last=True)
|
||||
|
@ -115,78 +137,96 @@ def main(
|
|||
optimizer.learn_rate = lr
|
||||
optimizer.L2 = l2
|
||||
|
||||
logger.info("Training on {} articles".format(len(train_data)))
|
||||
logger.info("Dev testing on {} articles".format(len(dev_data)))
|
||||
|
||||
# baseline performance on dev data
|
||||
logger.info("Dev Baseline Accuracies:")
|
||||
measure_performance(dev_data, kb, el_pipe, baseline=True, context=False)
|
||||
dev_data = wikipedia_processor.read_el_docs_golds(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=True,
|
||||
line_ids=dev_indices,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard,
|
||||
)
|
||||
|
||||
measure_performance(
|
||||
dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices)
|
||||
)
|
||||
|
||||
for itn in range(epochs):
|
||||
random.shuffle(train_data)
|
||||
random.shuffle(train_indices)
|
||||
losses = {}
|
||||
batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001))
|
||||
batches = minibatch(train_indices, size=compounding(8.0, 128.0, 1.001))
|
||||
batchnr = 0
|
||||
articles_processed = 0
|
||||
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
# we either process the whole training file, or just a part each epoch
|
||||
bar_total = len(train_indices)
|
||||
if train_articles:
|
||||
bar_total = train_articles
|
||||
|
||||
with tqdm(total=bar_total, leave=False, desc=f"Epoch {itn}") as pbar:
|
||||
for batch in batches:
|
||||
if not train_articles or articles_processed < train_articles:
|
||||
with nlp.disable_pipes("entity_linker"):
|
||||
train_batch = wikipedia_processor.read_el_docs_golds(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=False,
|
||||
line_ids=batch,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard,
|
||||
)
|
||||
docs, golds = zip(*train_batch)
|
||||
try:
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
nlp.update(
|
||||
examples=batch,
|
||||
docs=docs,
|
||||
golds=golds,
|
||||
sgd=optimizer,
|
||||
drop=dropout,
|
||||
losses=losses,
|
||||
)
|
||||
batchnr += 1
|
||||
articles_processed += len(docs)
|
||||
pbar.update(len(docs))
|
||||
except Exception as e:
|
||||
logger.error("Error updating batch:" + str(e))
|
||||
if batchnr > 0:
|
||||
logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2)))
|
||||
measure_performance(dev_data, kb, el_pipe, baseline=False, context=True)
|
||||
|
||||
# STEP 4: measure the performance of our trained pipe on an independent dev set
|
||||
logger.info("STEP 4: Final performance measurement of Entity Linking pipe")
|
||||
measure_performance(dev_data, kb, el_pipe)
|
||||
|
||||
# STEP 5: apply the EL pipe on a toy example
|
||||
logger.info("STEP 5: Applying Entity Linking to toy example")
|
||||
run_el_toy_example(nlp=nlp)
|
||||
logging.info(
|
||||
"Epoch {} trained on {} articles, train loss {}".format(
|
||||
itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)
|
||||
)
|
||||
)
|
||||
# re-read the dev_data (data is returned as a generator)
|
||||
dev_data = wikipedia_processor.read_el_docs_golds(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=True,
|
||||
line_ids=dev_indices,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard,
|
||||
)
|
||||
measure_performance(
|
||||
dev_data,
|
||||
kb,
|
||||
el_pipe,
|
||||
baseline=False,
|
||||
context=True,
|
||||
dev_limit=len(dev_indices),
|
||||
)
|
||||
|
||||
if output_dir:
|
||||
# STEP 6: write the NLP pipeline (now including an EL model) to file
|
||||
logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir))
|
||||
# STEP 4: write the NLP pipeline (now including an EL model) to file
|
||||
logger.info(
|
||||
"Final NLP pipeline has following pipeline components: {}".format(
|
||||
nlp.pipe_names
|
||||
)
|
||||
)
|
||||
logger.info("STEP 4: Writing trained NLP to {}".format(nlp_output_dir))
|
||||
nlp.to_disk(nlp_output_dir)
|
||||
|
||||
logger.info("Done!")
|
||||
|
||||
|
||||
def check_kb(kb):
|
||||
for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
|
||||
candidates = kb.get_candidates(mention)
|
||||
|
||||
logger.info("generating candidates for " + mention + " :")
|
||||
for c in candidates:
|
||||
logger.info(" ".join[
|
||||
str(c.prior_prob),
|
||||
c.alias_,
|
||||
"-->",
|
||||
c.entity_ + " (freq=" + str(c.entity_freq) + ")"
|
||||
])
|
||||
|
||||
|
||||
def run_el_toy_example(nlp):
|
||||
text = (
|
||||
"In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, "
|
||||
"Douglas reminds us to always bring our towel, even in China or Brazil. "
|
||||
"The main character in Doug's novel is the man Arthur Dent, "
|
||||
"but Dougledydoug doesn't write about George Washington or Homer Simpson."
|
||||
)
|
||||
doc = nlp(text)
|
||||
logger.info(text)
|
||||
for ent in doc.ents:
|
||||
logger.info(" ".join(["ent", ent.text, ent.label_, ent.kb_id_]))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
|
||||
plac.call(main)
|
||||
|
|
|
@ -6,9 +6,6 @@ import bz2
|
|||
import logging
|
||||
import random
|
||||
import json
|
||||
from tqdm import tqdm
|
||||
|
||||
from functools import partial
|
||||
|
||||
from spacy.gold import GoldParse
|
||||
from bin.wiki_entity_linking import wiki_io as io
|
||||
|
@ -454,25 +451,40 @@ def _write_training_entities(outputfile, article_id, clean_text, entities):
|
|||
outputfile.write(line)
|
||||
|
||||
|
||||
def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
|
||||
""" This method provides training examples that correspond to the entity annotations found by the nlp object.
|
||||
def read_training_indices(entity_file_path):
|
||||
""" This method creates two lists of indices into the training file: one with indices for the
|
||||
training examples, and one for the dev examples."""
|
||||
train_indices = []
|
||||
dev_indices = []
|
||||
|
||||
with entity_file_path.open("r", encoding="utf8") as file:
|
||||
for i, line in enumerate(file):
|
||||
example = json.loads(line)
|
||||
article_id = example["article_id"]
|
||||
clean_text = example["clean_text"]
|
||||
|
||||
if is_valid_article(clean_text):
|
||||
if is_dev(article_id):
|
||||
dev_indices.append(i)
|
||||
else:
|
||||
train_indices.append(i)
|
||||
|
||||
return train_indices, dev_indices
|
||||
|
||||
|
||||
def read_el_docs_golds(nlp, entity_file_path, dev, line_ids, kb, labels_discard=None):
|
||||
""" This method provides training/dev examples that correspond to the entity annotations found by the nlp object.
|
||||
For training, it will include both positive and negative examples by using the candidate generator from the kb.
|
||||
For testing (kb=None), it will include all positive examples only."""
|
||||
if not labels_discard:
|
||||
labels_discard = []
|
||||
|
||||
data = []
|
||||
num_entities = 0
|
||||
get_gold_parse = partial(
|
||||
_get_gold_parse, dev=dev, kb=kb, labels_discard=labels_discard
|
||||
)
|
||||
texts = []
|
||||
entities_list = []
|
||||
|
||||
logger.info(
|
||||
"Reading {} data with limit {}".format("dev" if dev else "train", limit)
|
||||
)
|
||||
with entity_file_path.open("r", encoding="utf8") as file:
|
||||
with tqdm(total=limit, leave=False) as pbar:
|
||||
for i, line in enumerate(file):
|
||||
if i in line_ids:
|
||||
example = json.loads(line)
|
||||
article_id = example["article_id"]
|
||||
clean_text = example["clean_text"]
|
||||
|
@ -481,16 +493,15 @@ def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
|
|||
if dev != is_dev(article_id) or not is_valid_article(clean_text):
|
||||
continue
|
||||
|
||||
doc = nlp(clean_text)
|
||||
gold = get_gold_parse(doc, entities)
|
||||
texts.append(clean_text)
|
||||
entities_list.append(entities)
|
||||
|
||||
docs = nlp.pipe(texts, batch_size=50)
|
||||
|
||||
for doc, entities in zip(docs, entities_list):
|
||||
gold = _get_gold_parse(doc, entities, dev=dev, kb=kb, labels_discard=labels_discard)
|
||||
if gold and len(gold.links) > 0:
|
||||
data.append((doc, gold))
|
||||
num_entities += len(gold.links)
|
||||
pbar.update(len(gold.links))
|
||||
if limit and num_entities >= limit:
|
||||
break
|
||||
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
|
||||
return data
|
||||
yield doc, gold
|
||||
|
||||
|
||||
def _get_gold_parse(doc, entities, dev, kb, labels_discard):
|
||||
|
|
|
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
|
|||
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
||||
|
||||
|
||||
@st.cache(ignore_hash=True)
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def load_model(name):
|
||||
return spacy.load(name)
|
||||
|
||||
|
||||
@st.cache(ignore_hash=True)
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def process_text(model_name, text):
|
||||
nlp = load_model(model_name)
|
||||
return nlp(text)
|
||||
|
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
|
|||
st.header("Named Entities")
|
||||
st.sidebar.header("Named Entities")
|
||||
label_set = nlp.get_pipe("ner").labels
|
||||
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
|
||||
labels = st.sidebar.multiselect(
|
||||
"Entity labels", options=label_set, default=list(label_set)
|
||||
)
|
||||
html = displacy.render(doc, style="ent", options={"ents": labels})
|
||||
# Newlines seem to mess with the rendering
|
||||
html = html.replace("\n", " ")
|
||||
|
|
|
@ -32,27 +32,24 @@ DESC_WIDTH = 64 # dimension of output entity vectors
|
|||
|
||||
|
||||
@plac.annotations(
|
||||
vocab_path=("Path to the vocab for the kb", "option", "v", Path),
|
||||
model=("Model name, should have pretrained word embeddings", "option", "m", str),
|
||||
model=("Model name, should have pretrained word embeddings", "positional", None, str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
)
|
||||
def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
|
||||
def main(model=None, output_dir=None, n_iter=50):
|
||||
"""Load the model, create the KB and pretrain the entity encodings.
|
||||
Either an nlp model or a vocab is needed to provide access to pretrained word embeddings.
|
||||
If an output_dir is provided, the KB will be stored there in a file 'kb'.
|
||||
When providing an nlp model, the updated vocab will also be written to a directory in the output_dir."""
|
||||
if model is None and vocab_path is None:
|
||||
raise ValueError("Either the `nlp` model or the `vocab` should be specified.")
|
||||
The updated vocab will also be written to a directory in the output_dir."""
|
||||
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
vocab = Vocab().from_disk(vocab_path)
|
||||
# create blank Language class with specified vocab
|
||||
nlp = spacy.blank("en", vocab=vocab)
|
||||
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
||||
|
||||
# check the length of the nlp vectors
|
||||
if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
|
||||
raise ValueError(
|
||||
"The `nlp` object should have access to pretrained word vectors, "
|
||||
" cf. https://spacy.io/usage/models#languages."
|
||||
)
|
||||
|
||||
kb = KnowledgeBase(vocab=nlp.vocab)
|
||||
|
||||
|
@ -103,8 +100,6 @@ def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
|
|||
print()
|
||||
print("Saved KB to", kb_path)
|
||||
|
||||
# only storing the vocab if we weren't already reading it from file
|
||||
if not vocab_path:
|
||||
vocab_path = output_dir / "vocab"
|
||||
kb.vocab.to_disk(vocab_path)
|
||||
print("Saved vocab to", vocab_path)
|
||||
|
|
|
@ -131,7 +131,8 @@ def train_textcat(nlp, n_texts, n_iter=10):
|
|||
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
||||
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||
optimizer = nlp.begin_training()
|
||||
textcat.model.tok2vec.from_bytes(tok2vec_weights)
|
||||
|
|
|
@ -63,7 +63,8 @@ def main(model_name, unlabelled_loc):
|
|||
optimizer.b2 = 0.0
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
sizes = compounding(1.0, 4.0, 1.001)
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
for itn in range(n_iter):
|
||||
|
|
|
@ -113,7 +113,8 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
|||
TRAIN_DOCS.append((doc, annotation_clean))
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
|
||||
pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train entity linker
|
||||
# reset and initialize the weights randomly
|
||||
optimizer = nlp.begin_training()
|
||||
|
|
|
@ -124,7 +124,8 @@ def main(model=None, output_dir=None, n_iter=15):
|
|||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
|
||||
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
|
|
|
@ -55,7 +55,8 @@ def main(model=None, output_dir=None, n_iter=100):
|
|||
ner.add_label(ent[2])
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
# reset and initialize the weights randomly – but only if we're
|
||||
# training a new model
|
||||
|
|
|
@ -95,7 +95,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
|||
optimizer = nlp.resume_training()
|
||||
move_names = list(ner.move_names)
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
sizes = compounding(1.0, 4.0, 1.001)
|
||||
# batch up the examples using spaCy's minibatch
|
||||
|
|
|
@ -65,7 +65,8 @@ def main(model=None, output_dir=None, n_iter=15):
|
|||
parser.add_label(dep)
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
|
||||
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
|
|
|
@ -68,7 +68,8 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None
|
|||
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
||||
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||
optimizer = nlp.begin_training()
|
||||
if init_tok2vec is not None:
|
||||
|
|
|
@ -49,6 +49,7 @@ install_requires =
|
|||
catalogue>=0.0.7,<1.1.0
|
||||
ml_datasets
|
||||
# Third-party dependencies
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
setuptools
|
||||
numpy>=1.15.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
|
|
|
@ -5,7 +5,7 @@ warnings.filterwarnings("ignore", message="numpy.dtype size changed")
|
|||
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
||||
|
||||
# These are imported as part of the API
|
||||
from thinc.util import prefer_gpu, require_gpu
|
||||
from thinc.api import prefer_gpu, require_gpu
|
||||
|
||||
from . import pipeline
|
||||
from .cli.info import info as cli_info
|
||||
|
|
|
@ -92,3 +92,4 @@ cdef enum attr_id_t:
|
|||
LANG
|
||||
ENT_KB_ID = symbols.ENT_KB_ID
|
||||
MORPH
|
||||
ENT_ID = symbols.ENT_ID
|
||||
|
|
|
@ -81,6 +81,7 @@ IDS = {
|
|||
"DEP": DEP,
|
||||
"ENT_IOB": ENT_IOB,
|
||||
"ENT_TYPE": ENT_TYPE,
|
||||
"ENT_ID": ENT_ID,
|
||||
"ENT_KB_ID": ENT_KB_ID,
|
||||
"HEAD": HEAD,
|
||||
"SENT_START": SENT_START,
|
||||
|
|
|
@ -9,8 +9,14 @@ from wasabi import Printer
|
|||
|
||||
|
||||
def conllu2json(
|
||||
input_data, n_sents=10, append_morphology=False, lang=None, ner_map=None,
|
||||
merge_subtokens=False, no_print=False, **_
|
||||
input_data,
|
||||
n_sents=10,
|
||||
append_morphology=False,
|
||||
lang=None,
|
||||
ner_map=None,
|
||||
merge_subtokens=False,
|
||||
no_print=False,
|
||||
**_
|
||||
):
|
||||
"""
|
||||
Convert conllu files into JSON format for use with train cli.
|
||||
|
@ -26,9 +32,13 @@ def conllu2json(
|
|||
docs = []
|
||||
raw = ""
|
||||
sentences = []
|
||||
conll_data = read_conllx(input_data, append_morphology=append_morphology,
|
||||
ner_tag_pattern=MISC_NER_PATTERN, ner_map=ner_map,
|
||||
merge_subtokens=merge_subtokens)
|
||||
conll_data = read_conllx(
|
||||
input_data,
|
||||
append_morphology=append_morphology,
|
||||
ner_tag_pattern=MISC_NER_PATTERN,
|
||||
ner_map=ner_map,
|
||||
merge_subtokens=merge_subtokens,
|
||||
)
|
||||
has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN)
|
||||
for i, example in enumerate(conll_data):
|
||||
raw += example.text
|
||||
|
@ -72,20 +82,28 @@ def has_ner(input_data, ner_tag_pattern):
|
|||
return False
|
||||
|
||||
|
||||
def read_conllx(input_data, append_morphology=False, merge_subtokens=False,
|
||||
ner_tag_pattern="", ner_map=None):
|
||||
def read_conllx(
|
||||
input_data,
|
||||
append_morphology=False,
|
||||
merge_subtokens=False,
|
||||
ner_tag_pattern="",
|
||||
ner_map=None,
|
||||
):
|
||||
""" Yield examples, one for each sentence """
|
||||
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
|
||||
i = 0
|
||||
for sent in input_data.strip().split("\n\n"):
|
||||
lines = sent.strip().split("\n")
|
||||
if lines:
|
||||
while lines[0].startswith("#"):
|
||||
lines.pop(0)
|
||||
example = example_from_conllu_sentence(vocab, lines,
|
||||
ner_tag_pattern, merge_subtokens=merge_subtokens,
|
||||
example = example_from_conllu_sentence(
|
||||
vocab,
|
||||
lines,
|
||||
ner_tag_pattern,
|
||||
merge_subtokens=merge_subtokens,
|
||||
append_morphology=append_morphology,
|
||||
ner_map=ner_map)
|
||||
ner_map=ner_map,
|
||||
)
|
||||
yield example
|
||||
|
||||
|
||||
|
@ -157,8 +175,14 @@ def create_json_doc(raw, sentences, id_):
|
|||
return doc
|
||||
|
||||
|
||||
def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
||||
merge_subtokens=False, append_morphology=False, ner_map=None):
|
||||
def example_from_conllu_sentence(
|
||||
vocab,
|
||||
lines,
|
||||
ner_tag_pattern,
|
||||
merge_subtokens=False,
|
||||
append_morphology=False,
|
||||
ner_map=None,
|
||||
):
|
||||
"""Create an Example from the lines for one CoNLL-U sentence, merging
|
||||
subtokens and appending morphology to tags if required.
|
||||
|
||||
|
@ -182,7 +206,6 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
|||
in_subtok = False
|
||||
for i in range(len(lines)):
|
||||
line = lines[i]
|
||||
subtok_lines = []
|
||||
parts = line.split("\t")
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
||||
if "." in id_:
|
||||
|
@ -212,7 +235,7 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
|||
subtok_word = ""
|
||||
in_subtok = False
|
||||
id_ = int(id_) - 1
|
||||
head = (int(head) - 1) if head != "0" else id_
|
||||
head = (int(head) - 1) if head not in ("0", "_") else id_
|
||||
tag = pos if tag == "_" else tag
|
||||
morph = morph if morph != "_" else ""
|
||||
dep = "ROOT" if dep == "root" else dep
|
||||
|
@ -266,9 +289,17 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
|
|||
if space:
|
||||
raw += " "
|
||||
example = Example(doc=raw)
|
||||
example.set_token_annotation(ids=ids, words=words, tags=tags, pos=pos,
|
||||
morphs=morphs, lemmas=lemmas, heads=heads,
|
||||
deps=deps, entities=ents)
|
||||
example.set_token_annotation(
|
||||
ids=ids,
|
||||
words=words,
|
||||
tags=tags,
|
||||
pos=pos,
|
||||
morphs=morphs,
|
||||
lemmas=lemmas,
|
||||
heads=heads,
|
||||
deps=deps,
|
||||
entities=ents,
|
||||
)
|
||||
return example
|
||||
|
||||
|
||||
|
@ -280,7 +311,7 @@ def merge_conllu_subtokens(lines, doc):
|
|||
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
||||
if "-" in id_:
|
||||
subtok_start, subtok_end = id_.split("-")
|
||||
subtok_span = doc[int(subtok_start) - 1:int(subtok_end)]
|
||||
subtok_span = doc[int(subtok_start) - 1 : int(subtok_end)]
|
||||
subtok_spans.append(subtok_span)
|
||||
# create merged tag, morph, and lemma values
|
||||
tags = []
|
||||
|
@ -292,7 +323,7 @@ def merge_conllu_subtokens(lines, doc):
|
|||
if token._.merged_morph:
|
||||
for feature in token._.merged_morph.split("|"):
|
||||
field, values = feature.split("=", 1)
|
||||
if not field in morphs:
|
||||
if field not in morphs:
|
||||
morphs[field] = set()
|
||||
for value in values.split(","):
|
||||
morphs[field].add(value)
|
||||
|
@ -306,7 +337,9 @@ def merge_conllu_subtokens(lines, doc):
|
|||
token._.merged_lemma = " ".join(lemmas)
|
||||
token.tag_ = "_".join(tags)
|
||||
token._.merged_morph = "|".join(sorted(morphs.values()))
|
||||
token._.merged_spaceafter = True if subtok_span[-1].whitespace_ else False
|
||||
token._.merged_spaceafter = (
|
||||
True if subtok_span[-1].whitespace_ else False
|
||||
)
|
||||
|
||||
with doc.retokenize() as retokenizer:
|
||||
for span in subtok_spans:
|
||||
|
|
|
@ -166,6 +166,7 @@ def debug_data(
|
|||
has_low_data_warning = False
|
||||
has_no_neg_warning = False
|
||||
has_ws_ents_error = False
|
||||
has_punct_ents_warning = False
|
||||
|
||||
msg.divider("Named Entity Recognition")
|
||||
msg.info(
|
||||
|
@ -190,6 +191,10 @@ def debug_data(
|
|||
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
|
||||
has_ws_ents_error = True
|
||||
|
||||
if gold_train_data["punct_ents"]:
|
||||
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
|
||||
has_punct_ents_warning = True
|
||||
|
||||
for label in new_labels:
|
||||
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
||||
msg.warn(
|
||||
|
@ -209,6 +214,8 @@ def debug_data(
|
|||
msg.good("Examples without occurrences available for all labels")
|
||||
if not has_ws_ents_error:
|
||||
msg.good("No entities consisting of or starting/ending with whitespace")
|
||||
if not has_punct_ents_warning:
|
||||
msg.good("No entities consisting of or starting/ending with punctuation")
|
||||
|
||||
if has_low_data_warning:
|
||||
msg.text(
|
||||
|
@ -229,6 +236,12 @@ def debug_data(
|
|||
"with whitespace characters are considered invalid."
|
||||
)
|
||||
|
||||
if has_punct_ents_warning:
|
||||
msg.text(
|
||||
"Entity spans consisting of or starting/ending "
|
||||
"with punctuation can not be trained with a noise level > 0."
|
||||
)
|
||||
|
||||
if "textcat" in pipeline:
|
||||
msg.divider("Text Classification")
|
||||
labels = [label for label in gold_train_data["cats"]]
|
||||
|
@ -446,6 +459,7 @@ def _compile_gold(examples, pipeline):
|
|||
"words": Counter(),
|
||||
"roots": Counter(),
|
||||
"ws_ents": 0,
|
||||
"punct_ents": 0,
|
||||
"n_words": 0,
|
||||
"n_misaligned_words": 0,
|
||||
"n_sents": 0,
|
||||
|
@ -469,6 +483,16 @@ def _compile_gold(examples, pipeline):
|
|||
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
||||
# "Illegal" whitespace entity
|
||||
data["ws_ents"] += 1
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
|
||||
".",
|
||||
"'",
|
||||
"!",
|
||||
"?",
|
||||
",",
|
||||
]:
|
||||
# punctuation entity: could be replaced by whitespace when training with noise,
|
||||
# so add a warning to alert the user to this unexpected side effect.
|
||||
data["punct_ents"] += 1
|
||||
if label.startswith(("B-", "U-")):
|
||||
combined_label = label.split("-")[1]
|
||||
data["ner"][combined_label] += 1
|
||||
|
|
|
@ -4,14 +4,12 @@ import time
|
|||
import re
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from thinc.layers import Linear, Maxout
|
||||
from thinc.util import prefer_gpu
|
||||
from thinc.api import Linear, Maxout, chain, list2array, prefer_gpu
|
||||
from thinc.api import CosineDistance, L2Distance
|
||||
from wasabi import msg
|
||||
import srsly
|
||||
from thinc.layers import chain, list2array
|
||||
from thinc.loss import CosineDistance, L2Distance
|
||||
|
||||
from spacy.gold import Example
|
||||
from ..gold import Example
|
||||
from ..errors import Errors
|
||||
from ..tokens import Doc
|
||||
from ..attrs import ID, HEAD
|
||||
|
@ -28,7 +26,7 @@ def pretrain(
|
|||
vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str),
|
||||
output_dir: ("Directory to write models to on each epoch", "positional", None, str),
|
||||
width: ("Width of CNN layers", "option", "cw", int) = 96,
|
||||
depth: ("Depth of CNN layers", "option", "cd", int) = 4,
|
||||
conv_depth: ("Depth of CNN layers", "option", "cd", int) = 4,
|
||||
bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0,
|
||||
cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3,
|
||||
sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0,
|
||||
|
@ -77,9 +75,15 @@ def pretrain(
|
|||
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
||||
|
||||
output_dir = Path(output_dir)
|
||||
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||
msg.warn(
|
||||
"Output directory is not empty",
|
||||
"It is better to use an empty directory or refer to a new output path, "
|
||||
"then the new directory will be created for you.",
|
||||
)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
msg.good("Created output directory")
|
||||
msg.good(f"Created output directory: {output_dir}")
|
||||
srsly.write_json(output_dir / "config.json", config)
|
||||
msg.good("Saved settings to config.json")
|
||||
|
||||
|
@ -107,7 +111,7 @@ def pretrain(
|
|||
Tok2Vec(
|
||||
width,
|
||||
embed_rows,
|
||||
conv_depth=depth,
|
||||
conv_depth=conv_depth,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
||||
subword_features=not use_chars, # Set to False for Chinese etc
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
import os
|
||||
import tqdm
|
||||
from pathlib import Path
|
||||
from thinc.backends import use_ops
|
||||
from thinc.api import use_ops
|
||||
from timeit import default_timer as timer
|
||||
import shutil
|
||||
import srsly
|
||||
|
@ -10,6 +10,7 @@ import contextlib
|
|||
import random
|
||||
|
||||
from ..util import create_default_optimizer
|
||||
from ..util import use_gpu as set_gpu
|
||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||
from ..gold import GoldCorpus
|
||||
from .. import util
|
||||
|
@ -26,6 +27,14 @@ def train(
|
|||
base_model: ("Name of model to update (optional)", "option", "b", str) = None,
|
||||
pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner",
|
||||
vectors: ("Model to load vectors from", "option", "v", str) = None,
|
||||
replace_components: ("Replace components from base model", "flag", "R", bool) = False,
|
||||
width: ("Width of CNN layers of Tok2Vec component", "option", "cw", int) = 96,
|
||||
conv_depth: ("Depth of CNN layers of Tok2Vec component", "option", "cd", int) = 4,
|
||||
cnn_window: ("Window size for CNN layers of Tok2Vec component", "option", "cW", int) = 1,
|
||||
cnn_pieces: ("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int) = 3,
|
||||
use_chars: ("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool) = False,
|
||||
bilstm_depth: ("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int) = 0,
|
||||
embed_rows: ("Number of embedding rows of Tok2Vec component", "option", "er", int) = 2000,
|
||||
n_iter: ("Number of iterations", "option", "n", int) = 30,
|
||||
n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None,
|
||||
n_examples: ("Number of examples", "option", "ns", int) = 0,
|
||||
|
@ -80,6 +89,7 @@ def train(
|
|||
)
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
|
||||
tag_map = {}
|
||||
if tag_map_path is not None:
|
||||
|
@ -113,6 +123,21 @@ def train(
|
|||
# training starts from a blank model, intitalize the language class.
|
||||
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||
msg.text(f"Training pipeline: {pipeline}")
|
||||
disabled_pipes = None
|
||||
pipes_added = False
|
||||
msg.text(f"Training pipeline: {pipeline}")
|
||||
if use_gpu >= 0:
|
||||
activated_gpu = None
|
||||
try:
|
||||
activated_gpu = set_gpu(use_gpu)
|
||||
except Exception as e:
|
||||
msg.warn(f"Exception: {e}")
|
||||
if activated_gpu is not None:
|
||||
msg.text(f"Using GPU: {use_gpu}")
|
||||
else:
|
||||
msg.warn(f"Unable to activate GPU: {use_gpu}")
|
||||
msg.text("Using CPU only")
|
||||
use_gpu = -1
|
||||
if base_model:
|
||||
msg.text(f"Starting with base model '{base_model}'")
|
||||
nlp = util.load_model(base_model)
|
||||
|
@ -122,9 +147,8 @@ def train(
|
|||
f"specified as `lang` argument ('{lang}') ",
|
||||
exits=1,
|
||||
)
|
||||
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
|
||||
for pipe in pipeline:
|
||||
if pipe not in nlp.pipe_names:
|
||||
pipe_cfg = {}
|
||||
if pipe == "parser":
|
||||
pipe_cfg = {"learn_tokens": learn_tokens}
|
||||
elif pipe == "textcat":
|
||||
|
@ -133,9 +157,14 @@ def train(
|
|||
"architecture": textcat_arch,
|
||||
"positive_label": textcat_positive_label,
|
||||
}
|
||||
else:
|
||||
pipe_cfg = {}
|
||||
if pipe not in nlp.pipe_names:
|
||||
msg.text(f"Adding component to base model '{pipe}'")
|
||||
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
pipes_added = True
|
||||
elif replace_components:
|
||||
msg.text(f"Replacing component from base model '{pipe}'")
|
||||
nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
pipes_added = True
|
||||
else:
|
||||
if pipe == "textcat":
|
||||
textcat_cfg = nlp.get_pipe("textcat").cfg
|
||||
|
@ -144,11 +173,6 @@ def train(
|
|||
"architecture": textcat_cfg["architecture"],
|
||||
"positive_label": textcat_cfg["positive_label"],
|
||||
}
|
||||
pipe_cfg = {
|
||||
"exclusive_classes": not textcat_multilabel,
|
||||
"architecture": textcat_arch,
|
||||
"positive_label": textcat_positive_label,
|
||||
}
|
||||
if base_cfg != pipe_cfg:
|
||||
msg.fail(
|
||||
f"The base textcat model configuration does"
|
||||
|
@ -156,6 +180,10 @@ def train(
|
|||
f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}",
|
||||
exits=1,
|
||||
)
|
||||
msg.text(f"Extending component from base model '{pipe}'")
|
||||
disabled_pipes = nlp.disable_pipes(
|
||||
[p for p in nlp.pipe_names if p not in pipeline]
|
||||
)
|
||||
else:
|
||||
msg.text(f"Starting with blank model '{lang}'")
|
||||
lang_cls = util.get_lang_class(lang)
|
||||
|
@ -198,13 +226,20 @@ def train(
|
|||
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
|
||||
n_train_words = corpus.count_train()
|
||||
|
||||
if base_model:
|
||||
if base_model and not pipes_added:
|
||||
# Start with an existing model, use default optimizer
|
||||
optimizer = create_default_optimizer()
|
||||
else:
|
||||
# Start with a blank model, call begin_training
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
|
||||
|
||||
cfg = {"device": use_gpu}
|
||||
cfg["conv_depth"] = conv_depth
|
||||
cfg["token_vector_width"] = width
|
||||
cfg["bilstm_depth"] = bilstm_depth
|
||||
cfg["cnn_maxout_pieces"] = cnn_pieces
|
||||
cfg["embed_size"] = embed_rows
|
||||
cfg["conv_window"] = cnn_window
|
||||
cfg["subword_features"] = not use_chars
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
|
||||
nlp._optimizer = None
|
||||
|
||||
# Load in pretrained weights
|
||||
|
@ -214,7 +249,7 @@ def train(
|
|||
|
||||
# Verify textcat config
|
||||
if "textcat" in pipeline:
|
||||
textcat_labels = nlp.get_pipe("textcat").cfg["labels"]
|
||||
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
|
||||
if textcat_positive_label and textcat_positive_label not in textcat_labels:
|
||||
msg.fail(
|
||||
f"The textcat_positive_label (tpl) '{textcat_positive_label}' "
|
||||
|
@ -327,12 +362,22 @@ def train(
|
|||
for batch in util.minibatch_by_words(train_data, size=batch_sizes):
|
||||
if not batch:
|
||||
continue
|
||||
docs, golds = zip(*batch)
|
||||
try:
|
||||
nlp.update(
|
||||
batch,
|
||||
docs,
|
||||
golds,
|
||||
sgd=optimizer,
|
||||
drop=next(dropout_rates),
|
||||
losses=losses,
|
||||
)
|
||||
except ValueError as e:
|
||||
msg.warn("Error during training")
|
||||
if init_tok2vec:
|
||||
msg.warn(
|
||||
"Did you provide the same parameters during 'train' as during 'pretrain'?"
|
||||
)
|
||||
msg.fail(f"Original error message: {e}", exits=1)
|
||||
if raw_text:
|
||||
# If raw text is available, perform 'rehearsal' updates,
|
||||
# which use unlabelled data to reduce overfitting.
|
||||
|
@ -396,11 +441,16 @@ def train(
|
|||
"cpu": cpu_wps,
|
||||
"gpu": gpu_wps,
|
||||
}
|
||||
meta["accuracy"] = scorer.scores
|
||||
meta.setdefault("accuracy", {})
|
||||
for component in nlp.pipe_names:
|
||||
for metric in _get_metrics(component):
|
||||
meta["accuracy"][metric] = scorer.scores[metric]
|
||||
else:
|
||||
meta.setdefault("beam_accuracy", {})
|
||||
meta.setdefault("beam_speed", {})
|
||||
meta["beam_accuracy"][beam_width] = scorer.scores
|
||||
for component in nlp.pipe_names:
|
||||
for metric in _get_metrics(component):
|
||||
meta["beam_accuracy"][metric] = scorer.scores[metric]
|
||||
meta["beam_speed"][beam_width] = {
|
||||
"nwords": nwords,
|
||||
"cpu": cpu_wps,
|
||||
|
@ -453,13 +503,19 @@ def train(
|
|||
f"Best score = {best_score}; Final iteration score = {current_score}"
|
||||
)
|
||||
break
|
||||
except Exception as e:
|
||||
msg.warn(f"Aborting and saving final best model. Encountered exception: {e}")
|
||||
finally:
|
||||
best_pipes = nlp.pipe_names
|
||||
if disabled_pipes:
|
||||
disabled_pipes.restore()
|
||||
with nlp.use_params(optimizer.averages):
|
||||
final_model_path = output_path / "model-final"
|
||||
nlp.to_disk(final_model_path)
|
||||
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
|
||||
msg.good("Saved model to output directory", final_model_path)
|
||||
with msg.loading("Creating best model..."):
|
||||
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
|
||||
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
|
||||
msg.good("Created best model", best_model_path)
|
||||
|
||||
|
||||
|
@ -519,15 +575,14 @@ def _load_pretrained_tok2vec(nlp, loc):
|
|||
|
||||
def _collate_best_model(meta, output_path, components):
|
||||
bests = {}
|
||||
meta.setdefault("accuracy", {})
|
||||
for component in components:
|
||||
bests[component] = _find_best(output_path, component)
|
||||
best_dest = output_path / "model-best"
|
||||
shutil.copytree(str(output_path / "model-final"), str(best_dest))
|
||||
for component, best_component_src in bests.items():
|
||||
shutil.rmtree(str(best_dest / component))
|
||||
shutil.copytree(
|
||||
str(best_component_src / component), str(best_dest / component)
|
||||
)
|
||||
shutil.copytree(str(best_component_src / component), str(best_dest / component))
|
||||
accs = srsly.read_json(best_component_src / "accuracy.json")
|
||||
for metric in _get_metrics(component):
|
||||
meta["accuracy"][metric] = accs[metric]
|
||||
|
@ -550,13 +605,15 @@ def _find_best(experiment_dir, component):
|
|||
|
||||
def _get_metrics(component):
|
||||
if component == "parser":
|
||||
return ("las", "uas", "token_acc", "sent_f")
|
||||
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
|
||||
elif component == "tagger":
|
||||
return ("tags_acc",)
|
||||
elif component == "ner":
|
||||
return ("ents_f", "ents_p", "ents_r")
|
||||
return ("ents_f", "ents_p", "ents_r", "enty_per_type")
|
||||
elif component == "sentrec":
|
||||
return ("sent_f", "sent_p", "sent_r")
|
||||
elif component == "textcat":
|
||||
return ("textcat_score",)
|
||||
return ("token_acc",)
|
||||
|
||||
|
||||
|
@ -568,8 +625,12 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths):
|
|||
row_head.extend(["Tag Loss ", " Tag % "])
|
||||
output_stats.extend(["tag_loss", "tags_acc"])
|
||||
elif pipe == "parser":
|
||||
row_head.extend(["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"])
|
||||
output_stats.extend(["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"])
|
||||
row_head.extend(
|
||||
["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"]
|
||||
)
|
||||
output_stats.extend(
|
||||
["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"]
|
||||
)
|
||||
elif pipe == "ner":
|
||||
row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
|
||||
output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])
|
||||
|
|
|
@ -1,19 +1,20 @@
|
|||
from typing import Optional, Dict, List, Union, Sequence
|
||||
import plac
|
||||
from thinc.util import require_gpu
|
||||
from wasabi import msg
|
||||
from pathlib import Path
|
||||
import thinc
|
||||
import thinc.schedules
|
||||
from thinc.model import Model
|
||||
from spacy.gold import GoldCorpus
|
||||
import spacy
|
||||
from spacy.pipeline.tok2vec import Tok2VecListener
|
||||
from typing import Optional, Dict, List, Union, Sequence
|
||||
from thinc.api import Model
|
||||
from pydantic import BaseModel, FilePath, StrictInt
|
||||
import tqdm
|
||||
|
||||
from ..ml import component_models
|
||||
from .. import util
|
||||
# TODO: relative imports?
|
||||
import spacy
|
||||
from spacy.gold import GoldCorpus
|
||||
from spacy.pipeline.tok2vec import Tok2VecListener
|
||||
from spacy.ml import component_models
|
||||
from spacy import util
|
||||
|
||||
|
||||
registry = util.registry
|
||||
|
||||
|
@ -153,10 +154,9 @@ def create_tb_parser_model(
|
|||
hidden_width: StrictInt = 64,
|
||||
maxout_pieces: StrictInt = 3,
|
||||
):
|
||||
from thinc.layers import Linear, chain, list2array
|
||||
from thinc.api import Linear, chain, list2array, use_ops, zero_init
|
||||
from spacy.ml._layers import PrecomputableAffine
|
||||
from spacy.syntax._parser_model import ParserModel
|
||||
from thinc.api import use_ops, zero_init
|
||||
|
||||
token_vector_width = tok2vec.get_dim("nO")
|
||||
tok2vec = chain(tok2vec, list2array())
|
||||
|
@ -221,13 +221,9 @@ def train_from_config_cli(
|
|||
|
||||
|
||||
def train_from_config(
|
||||
config_path,
|
||||
data_paths,
|
||||
raw_text=None,
|
||||
meta_path=None,
|
||||
output_path=None,
|
||||
config_path, data_paths, raw_text=None, meta_path=None, output_path=None,
|
||||
):
|
||||
msg.info("Loading config from: {}".format(config_path))
|
||||
msg.info(f"Loading config from: {config_path}")
|
||||
config = util.load_from_config(config_path, create_objects=True)
|
||||
use_gpu = config["training"]["use_gpu"]
|
||||
if use_gpu >= 0:
|
||||
|
@ -241,9 +237,7 @@ def train_from_config(
|
|||
msg.info("Loading training corpus")
|
||||
corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
|
||||
msg.info("Initializing the nlp pipeline")
|
||||
nlp.begin_training(
|
||||
lambda: corpus.train_examples, device=use_gpu
|
||||
)
|
||||
nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
|
||||
|
||||
train_batches = create_train_batches(nlp, corpus, config["training"])
|
||||
evaluate = create_evaluation_callback(nlp, optimizer, corpus, config["training"])
|
||||
|
@ -260,7 +254,7 @@ def train_from_config(
|
|||
config["training"]["eval_frequency"],
|
||||
)
|
||||
|
||||
msg.info("Training. Initial learn rate: {}".format(optimizer.learn_rate))
|
||||
msg.info(f"Training. Initial learn rate: {optimizer.learn_rate}")
|
||||
print_row = setup_printer(config)
|
||||
|
||||
try:
|
||||
|
@ -414,7 +408,7 @@ def subdivide_batch(batch):
|
|||
def setup_printer(config):
|
||||
score_cols = config["training"]["scores"]
|
||||
score_widths = [max(len(col), 6) for col in score_cols]
|
||||
loss_cols = ["Loss {}".format(pipe) for pipe in config["nlp"]["pipeline"]]
|
||||
loss_cols = [f"Loss {pipe}" for pipe in config["nlp"]["pipeline"]]
|
||||
loss_widths = [max(len(col), 8) for col in loss_cols]
|
||||
table_header = ["#"] + loss_cols + score_cols + ["Score"]
|
||||
table_header = [col.upper() for col in table_header]
|
||||
|
|
|
@ -29,7 +29,7 @@ try:
|
|||
except ImportError:
|
||||
cupy = None
|
||||
|
||||
from thinc.optimizers import Optimizer # noqa: F401
|
||||
from thinc.api import Optimizer # noqa: F401
|
||||
|
||||
pickle = pickle
|
||||
copy_reg = copy_reg
|
||||
|
|
|
@ -51,9 +51,10 @@ def render(
|
|||
html = RENDER_WRAPPER(html)
|
||||
if jupyter or (jupyter is None and is_in_jupyter()):
|
||||
# return HTML rendered by IPython display()
|
||||
# See #4840 for details on span wrapper to disable mathjax
|
||||
from IPython.core.display import display, HTML
|
||||
|
||||
return display(HTML(html))
|
||||
return display(HTML('<span class="tex2jax_ignore">{}</span>'.format(html)))
|
||||
return html
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Setting explicit height and max-width: none on the SVG is required for
|
||||
# Jupyter to render it properly in a cell
|
||||
|
||||
|
|
|
@ -75,10 +75,9 @@ class Warnings(object):
|
|||
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
||||
"being serialized or deserialized is deprecated. Please use the "
|
||||
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
||||
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
|
||||
"the v2.x models cannot release the global interpreter lock. "
|
||||
"Future versions may introduce a `n_process` argument for "
|
||||
"parallel inference via multiprocessing.")
|
||||
W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, "
|
||||
"the argument `n_process` controls parallel inference via "
|
||||
"multiprocessing.")
|
||||
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
||||
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
||||
"ignoring the duplicate entry.")
|
||||
|
@ -170,7 +169,8 @@ class Errors(object):
|
|||
"and satisfies the correct annotations specified in the GoldParse. "
|
||||
"For example, are all labels added to the model? If you're "
|
||||
"training a named entity recognizer, also make sure that none of "
|
||||
"your annotated entity spans have leading or trailing whitespace. "
|
||||
"your annotated entity spans have leading or trailing whitespace "
|
||||
"or punctuation. "
|
||||
"You can also use the experimental `debug-data` command to "
|
||||
"validate your JSON-formatted training data. For details, run:\n"
|
||||
"python -m spacy debug-data --help")
|
||||
|
@ -536,8 +536,8 @@ class Errors(object):
|
|||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||
"'{token_attrs}'.")
|
||||
E998 = ("Can only create GoldParse's from Example's without a Doc, "
|
||||
"if get_gold_parses() is called with a Vocab object.")
|
||||
E998 = ("Can only create GoldParse objects from Example objects without a "
|
||||
"Doc if get_gold_parses() is called with a Vocab object.")
|
||||
E999 = ("Encountered an unexpected format for the dictionary holding "
|
||||
"gold annotations: {gold_dict}")
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
def explain(term):
|
||||
"""Get a description for a given POS tag, dependency label or entity type.
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
from cymem.cymem cimport Pool
|
||||
|
||||
from spacy.tokens import Doc
|
||||
from .tokens import Doc
|
||||
from .typedefs cimport attr_t
|
||||
from .syntax.transition_system cimport Transition
|
||||
|
||||
|
@ -65,5 +65,3 @@ cdef class Example:
|
|||
cdef public TokenAnnotation token_annotation
|
||||
cdef public DocAnnotation doc_annotation
|
||||
cdef public object goldparse
|
||||
|
||||
|
||||
|
|
|
@ -6,7 +6,7 @@ from libcpp.vector cimport vector
|
|||
from libc.stdint cimport int32_t, int64_t
|
||||
from libc.stdio cimport FILE
|
||||
|
||||
from spacy.vocab cimport Vocab
|
||||
from .vocab cimport Vocab
|
||||
from .typedefs cimport hash_t
|
||||
|
||||
from .structs cimport KBEntryC, AliasC
|
||||
|
@ -169,4 +169,3 @@ cdef class Reader:
|
|||
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
|
||||
|
||||
cdef int _read(self, void* value, size_t size) except -1
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-af
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/Alir3z4/stop-words
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
অতএব অথচ অথবা অনুযায়ী অনেক অনেকে অনেকেই অন্তত অবধি অবশ্য অর্থাৎ অন্য অনুযায়ী অর্ধভাগে
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -14,6 +14,17 @@ _tamil = r"\u0B80-\u0BFF"
|
|||
|
||||
_telugu = r"\u0C00-\u0C7F"
|
||||
|
||||
# from the final table in: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
|
||||
_cjk = (
|
||||
r"\u4E00-\u62FF\u6300-\u77FF\u7800-\u8CFF\u8D00-\u9FFF\u3400-\u4DBF"
|
||||
r"\U00020000-\U000215FF\U00021600-\U000230FF\U00023100-\U000245FF"
|
||||
r"\U00024600-\U000260FF\U00026100-\U000275FF\U00027600-\U000290FF"
|
||||
r"\U00029100-\U0002A6DF\U0002A700-\U0002B73F\U0002B740-\U0002B81F"
|
||||
r"\U0002B820-\U0002CEAF\U0002CEB0-\U0002EBEF\u2E80-\u2EFF\u2F00-\u2FDF"
|
||||
r"\u2FF0-\u2FFF\u3000-\u303F\u31C0-\u31EF\u3200-\u32FF\u3300-\u33FF"
|
||||
r"\uF900-\uFAFF\uFE30-\uFE4F\U0001F200-\U0001F2FF\U0002F800-\U0002FA1F"
|
||||
)
|
||||
|
||||
# Latin standard
|
||||
_latin_u_standard = r"A-Z"
|
||||
_latin_l_standard = r"a-z"
|
||||
|
@ -212,6 +223,7 @@ _uncased = (
|
|||
+ _tamil
|
||||
+ _telugu
|
||||
+ _hangul
|
||||
+ _cjk
|
||||
)
|
||||
|
||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/Alir3z4/stop-words
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
|
||||
|
@ -19,14 +18,14 @@ dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
|
|||
durfte durften
|
||||
|
||||
eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
|
||||
einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
|
||||
einem einen einer eines einige einigen einiger einiges einmal einmaleins elf en
|
||||
ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch
|
||||
|
||||
früher fünf fünfte fünften fünfter fünftes für
|
||||
|
||||
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
|
||||
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
|
||||
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
|
||||
gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
|
||||
großen grosser großer grosses großes gut gute guter gutes
|
||||
|
||||
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
|
||||
|
@ -44,9 +43,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
|
|||
lang lange leicht leider lieber los
|
||||
|
||||
machen macht machte mag magst man manche manchem manchen mancher manches mehr
|
||||
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
|
||||
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
|
||||
musste mussten
|
||||
mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
|
||||
mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
|
||||
|
||||
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
|
||||
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .tag_map_general import TAG_MAP
|
||||
from ..tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .lemmatizer import GreekLemmatizer
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
def get_pos_from_wiktionary():
|
||||
import re
|
||||
from gensim.corpora.wikicorpus import extract_pages
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# These exceptions are used to add NORM values based on a token's ORTH value.
|
||||
# Norms are only set if no alternative is provided in the tokenizer exceptions.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Stop words
|
||||
# Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,24 +0,0 @@
|
|||
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADJ": {POS: ADJ},
|
||||
"ADV": {POS: ADV},
|
||||
"INTJ": {POS: INTJ},
|
||||
"NOUN": {POS: NOUN},
|
||||
"PROPN": {POS: PROPN},
|
||||
"VERB": {POS: VERB},
|
||||
"ADP": {POS: ADP},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PART": {POS: PART},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"SYM": {POS: SYM},
|
||||
"NUM": {POS: NUM},
|
||||
"PRON": {POS: PRON},
|
||||
"AUX": {POS: AUX},
|
||||
"SPACE": {POS: SPACE},
|
||||
"DET": {POS: DET},
|
||||
"X": {POS: X},
|
||||
}
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
_exc = {
|
||||
# Slang and abbreviations
|
||||
"cos": "because",
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Stop words
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-et
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
verb_roots = """
|
||||
#هست
|
||||
آخت#آهنج
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Stop words from HAZM package
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
|
|
|
@ -1,9 +1,10 @@
|
|||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_HYPHENS
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||
DASHES = "|".join(x for x in LIST_HYPHENS if x != "-")
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
|
@ -11,11 +12,9 @@ _infixes = (
|
|||
+ [
|
||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])(?:{d})(?=[{a}])".format(a=ALPHA, d=DASHES),
|
||||
r"(?<=[{a}0-9])[<>=/](?=[{a}])".format(a=ALPHA),
|
||||
]
|
||||
)
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt
|
||||
# Reformatted with some minor corrections
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -28,6 +28,9 @@ for exc_data in [
|
|||
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
||||
{ORTH: "n.", LEMMA: "noin"},
|
||||
{ORTH: "nimim.", LEMMA: "nimimerkki"},
|
||||
{ORTH: "n:o", LEMMA: "numero"},
|
||||
{ORTH: "N:o", LEMMA: "numero"},
|
||||
{ORTH: "nro", LEMMA: "numero"},
|
||||
{ORTH: "ns.", LEMMA: "niin sanottu"},
|
||||
{ORTH: "nyk.", LEMMA: "nykyinen"},
|
||||
{ORTH: "oik.", LEMMA: "oikealla"},
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# fmt: off
|
||||
consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"]
|
||||
broad_vowels = ["a", "á", "o", "ó", "u", "ú"]
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/Xangis/extra-stopwords
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
ಹಲವು
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-lv
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# These exceptions are used to add NORM values based on a token's ORTH value.
|
||||
# Individual languages can also add their own exceptions and overwrite them -
|
||||
# for example, British vs. American spelling in English.
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
_exc = {
|
||||
# Slang
|
||||
"прив": "привет",
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
අතර
|
||||
|
|
|
@ -1,11 +1,16 @@
|
|||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
||||
|
||||
class SlovakDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "sk"
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
|
|
23
spacy/lang/sk/examples.py
Normal file
23
spacy/lang/sk/examples.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.sk.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Ardevop, s.r.o. je malá startup firma na území SR.",
|
||||
"Samojazdiace autá presúvajú poistnú zodpovednosť na výrobcov automobilov.",
|
||||
"Košice sú na východe.",
|
||||
"Bratislava je hlavné mesto Slovenskej republiky.",
|
||||
"Kde si?",
|
||||
"Kto je prezidentom Francúzska?",
|
||||
"Aké je hlavné mesto Slovenska?",
|
||||
"Kedy sa narodil Andrej Kiska?",
|
||||
"Včera som dostal 100€ na ruku.",
|
||||
"Dnes je nedeľa 26.1.2020.",
|
||||
"Narodil sa 15.4.1998 v Ružomberku.",
|
||||
"Niekto mi povedal, že 500 eur je veľa peňazí.",
|
||||
"Podaj mi ruku!",
|
||||
]
|
59
spacy/lang/sk/lex_attrs.py
Normal file
59
spacy/lang/sk/lex_attrs.py
Normal file
|
@ -0,0 +1,59 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = [
|
||||
"nula",
|
||||
"jeden",
|
||||
"dva",
|
||||
"tri",
|
||||
"štyri",
|
||||
"päť",
|
||||
"šesť",
|
||||
"sedem",
|
||||
"osem",
|
||||
"deväť",
|
||||
"desať",
|
||||
"jedenásť",
|
||||
"dvanásť",
|
||||
"trinásť",
|
||||
"štrnásť",
|
||||
"pätnásť",
|
||||
"šestnásť",
|
||||
"sedemnásť",
|
||||
"osemnásť",
|
||||
"devätnásť",
|
||||
"dvadsať",
|
||||
"tridsať",
|
||||
"štyridsať",
|
||||
"päťdesiat",
|
||||
"šesťdesiat",
|
||||
"sedemdesiat",
|
||||
"osemdesiat",
|
||||
"deväťdesiat",
|
||||
"sto",
|
||||
"tisíc",
|
||||
"milión",
|
||||
"miliarda",
|
||||
"bilión",
|
||||
"biliarda",
|
||||
"trilión",
|
||||
"triliarda",
|
||||
"kvadrilión",
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
|
@ -1,5 +1,4 @@
|
|||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-sk
|
||||
# Source: https://github.com/Ardevop-sk/stopwords-sk
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
|
@ -7,17 +6,41 @@ a
|
|||
aby
|
||||
aj
|
||||
ak
|
||||
akej
|
||||
akejže
|
||||
ako
|
||||
akom
|
||||
akomže
|
||||
akou
|
||||
akouže
|
||||
akože
|
||||
aká
|
||||
akáže
|
||||
aké
|
||||
akého
|
||||
akéhože
|
||||
akému
|
||||
akémuže
|
||||
akéže
|
||||
akú
|
||||
akúže
|
||||
aký
|
||||
akých
|
||||
akýchže
|
||||
akým
|
||||
akými
|
||||
akýmiže
|
||||
akýmže
|
||||
akýže
|
||||
ale
|
||||
alebo
|
||||
and
|
||||
ani
|
||||
asi
|
||||
avšak
|
||||
až
|
||||
ba
|
||||
bez
|
||||
bezo
|
||||
bol
|
||||
bola
|
||||
boli
|
||||
|
@ -28,23 +51,32 @@ budeme
|
|||
budete
|
||||
budeš
|
||||
budú
|
||||
buï
|
||||
buď
|
||||
by
|
||||
byť
|
||||
cez
|
||||
cezo
|
||||
dnes
|
||||
do
|
||||
ešte
|
||||
for
|
||||
ho
|
||||
hoci
|
||||
i
|
||||
iba
|
||||
ich
|
||||
im
|
||||
inej
|
||||
inom
|
||||
iná
|
||||
iné
|
||||
iného
|
||||
inému
|
||||
iní
|
||||
inú
|
||||
iný
|
||||
iných
|
||||
iným
|
||||
inými
|
||||
ja
|
||||
je
|
||||
jeho
|
||||
|
@ -53,80 +85,185 @@ jemu
|
|||
ju
|
||||
k
|
||||
kam
|
||||
kamže
|
||||
každou
|
||||
každá
|
||||
každé
|
||||
každého
|
||||
každému
|
||||
každí
|
||||
každú
|
||||
každý
|
||||
každých
|
||||
každým
|
||||
každými
|
||||
kde
|
||||
kedže
|
||||
keï
|
||||
kej
|
||||
kejže
|
||||
keď
|
||||
keďže
|
||||
kie
|
||||
kieho
|
||||
kiehože
|
||||
kiemu
|
||||
kiemuže
|
||||
kieže
|
||||
koho
|
||||
kom
|
||||
komu
|
||||
kou
|
||||
kouže
|
||||
kto
|
||||
ktorej
|
||||
ktorou
|
||||
ktorá
|
||||
ktoré
|
||||
ktorí
|
||||
ktorú
|
||||
ktorý
|
||||
ktorých
|
||||
ktorým
|
||||
ktorými
|
||||
ku
|
||||
ká
|
||||
káže
|
||||
ké
|
||||
kéže
|
||||
kú
|
||||
kúže
|
||||
ký
|
||||
kýho
|
||||
kýhože
|
||||
kým
|
||||
kýmu
|
||||
kýmuže
|
||||
kýže
|
||||
lebo
|
||||
leda
|
||||
ledaže
|
||||
len
|
||||
ma
|
||||
majú
|
||||
mal
|
||||
mala
|
||||
mali
|
||||
mať
|
||||
medzi
|
||||
menej
|
||||
mi
|
||||
mna
|
||||
mne
|
||||
mnou
|
||||
moja
|
||||
moje
|
||||
mojej
|
||||
mojich
|
||||
mojim
|
||||
mojimi
|
||||
mojou
|
||||
moju
|
||||
možno
|
||||
mu
|
||||
musia
|
||||
musieť
|
||||
musí
|
||||
musím
|
||||
musíme
|
||||
musíte
|
||||
musíš
|
||||
my
|
||||
má
|
||||
mám
|
||||
máme
|
||||
máte
|
||||
mòa
|
||||
máš
|
||||
môcť
|
||||
môj
|
||||
môjho
|
||||
môže
|
||||
môžem
|
||||
môžeme
|
||||
môžete
|
||||
môžeš
|
||||
môžu
|
||||
mňa
|
||||
na
|
||||
nad
|
||||
nado
|
||||
najmä
|
||||
nami
|
||||
naša
|
||||
naše
|
||||
našej
|
||||
naši
|
||||
našich
|
||||
našim
|
||||
našimi
|
||||
našou
|
||||
ne
|
||||
nech
|
||||
neho
|
||||
nej
|
||||
nejakej
|
||||
nejakom
|
||||
nejakou
|
||||
nejaká
|
||||
nejaké
|
||||
nejakého
|
||||
nejakému
|
||||
nejakú
|
||||
nejaký
|
||||
nejakých
|
||||
nejakým
|
||||
nejakými
|
||||
nemu
|
||||
než
|
||||
nich
|
||||
nie
|
||||
niektorej
|
||||
niektorom
|
||||
niektorou
|
||||
niektorá
|
||||
niektoré
|
||||
niektorého
|
||||
niektorému
|
||||
niektorú
|
||||
niektorý
|
||||
niektorých
|
||||
niektorým
|
||||
niektorými
|
||||
nielen
|
||||
niečo
|
||||
nim
|
||||
nimi
|
||||
nič
|
||||
ničoho
|
||||
ničom
|
||||
ničomu
|
||||
ničím
|
||||
no
|
||||
nová
|
||||
nové
|
||||
noví
|
||||
nový
|
||||
nám
|
||||
nás
|
||||
náš
|
||||
nášho
|
||||
ním
|
||||
o
|
||||
od
|
||||
odo
|
||||
of
|
||||
on
|
||||
ona
|
||||
oni
|
||||
ono
|
||||
ony
|
||||
oň
|
||||
oňho
|
||||
po
|
||||
pod
|
||||
podo
|
||||
podľa
|
||||
pokiaľ
|
||||
popod
|
||||
popri
|
||||
potom
|
||||
poza
|
||||
pre
|
||||
pred
|
||||
predo
|
||||
|
@ -134,42 +271,56 @@ preto
|
|||
pretože
|
||||
prečo
|
||||
pri
|
||||
prvá
|
||||
prvé
|
||||
prví
|
||||
prvý
|
||||
práve
|
||||
pýta
|
||||
s
|
||||
sa
|
||||
seba
|
||||
sebe
|
||||
sebou
|
||||
sem
|
||||
si
|
||||
sme
|
||||
so
|
||||
som
|
||||
späť
|
||||
ste
|
||||
svoj
|
||||
svoja
|
||||
svoje
|
||||
svojho
|
||||
svojich
|
||||
svojim
|
||||
svojimi
|
||||
svojou
|
||||
svoju
|
||||
svojím
|
||||
svojími
|
||||
sú
|
||||
ta
|
||||
tak
|
||||
takej
|
||||
takejto
|
||||
taká
|
||||
takáto
|
||||
také
|
||||
takého
|
||||
takéhoto
|
||||
takému
|
||||
takémuto
|
||||
takéto
|
||||
takí
|
||||
takú
|
||||
takúto
|
||||
taký
|
||||
takýto
|
||||
takže
|
||||
tam
|
||||
te
|
||||
teba
|
||||
tebe
|
||||
tebou
|
||||
teda
|
||||
tej
|
||||
tejto
|
||||
ten
|
||||
tento
|
||||
the
|
||||
ti
|
||||
tie
|
||||
tieto
|
||||
|
@ -177,52 +328,97 @@ tiež
|
|||
to
|
||||
toho
|
||||
tohoto
|
||||
tohto
|
||||
tom
|
||||
tomto
|
||||
tomu
|
||||
tomuto
|
||||
toto
|
||||
tou
|
||||
touto
|
||||
tu
|
||||
tvoj
|
||||
tvojími
|
||||
tvoja
|
||||
tvoje
|
||||
tvojej
|
||||
tvojho
|
||||
tvoji
|
||||
tvojich
|
||||
tvojim
|
||||
tvojimi
|
||||
tvojím
|
||||
ty
|
||||
tá
|
||||
táto
|
||||
tí
|
||||
títo
|
||||
tú
|
||||
túto
|
||||
tých
|
||||
tým
|
||||
tými
|
||||
týmto
|
||||
tě
|
||||
u
|
||||
už
|
||||
v
|
||||
vami
|
||||
vaša
|
||||
vaše
|
||||
veï
|
||||
vašej
|
||||
vaši
|
||||
vašich
|
||||
vašim
|
||||
vaším
|
||||
veď
|
||||
viac
|
||||
vo
|
||||
vy
|
||||
vám
|
||||
vás
|
||||
váš
|
||||
vášho
|
||||
však
|
||||
všetci
|
||||
všetka
|
||||
všetko
|
||||
všetky
|
||||
všetok
|
||||
z
|
||||
za
|
||||
začo
|
||||
začože
|
||||
zo
|
||||
a
|
||||
áno
|
||||
èi
|
||||
èo
|
||||
èí
|
||||
òom
|
||||
òou
|
||||
òu
|
||||
čej
|
||||
či
|
||||
čia
|
||||
čie
|
||||
čieho
|
||||
čiemu
|
||||
čiu
|
||||
čo
|
||||
čoho
|
||||
čom
|
||||
čomu
|
||||
čou
|
||||
čože
|
||||
čí
|
||||
čím
|
||||
čími
|
||||
ďalšia
|
||||
ďalšie
|
||||
ďalšieho
|
||||
ďalšiemu
|
||||
ďalšiu
|
||||
ďalšom
|
||||
ďalšou
|
||||
ďalší
|
||||
ďalších
|
||||
ďalším
|
||||
ďalšími
|
||||
ňom
|
||||
ňou
|
||||
ňu
|
||||
že
|
||||
""".split()
|
||||
)
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user