mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 13:47:13 +03:00
Merge master
This commit is contained in:
commit
a48e69d8d1
106
.github/contributors/GiorgioPorgio.md
vendored
Normal file
106
.github/contributors/GiorgioPorgio.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | George Ketsopoulos |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 23 October 2019 |
|
||||
| GitHub username | GiorgioPorgio |
|
||||
| Website (optional) | |
|
106
.github/contributors/PeterGilles.md
vendored
Normal file
106
.github/contributors/PeterGilles.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Peter Gilles |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 10.10. |
|
||||
| GitHub username | Peter Gilles |
|
||||
| Website (optional) | |
|
106
.github/contributors/gustavengstrom.md
vendored
Normal file
106
.github/contributors/gustavengstrom.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Gustav Engström |
|
||||
| Company name (if applicable) | Davcon |
|
||||
| Title or role (if applicable) | Data scientis |
|
||||
| Date | 2019-10-10 |
|
||||
| GitHub username | gustavengstrom |
|
||||
| Website (optional) | |
|
106
.github/contributors/neelkamath.md
vendored
Normal file
106
.github/contributors/neelkamath.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ---------------------- |
|
||||
| Name | Neel Kamath |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | October 30, 2019 |
|
||||
| GitHub username | neelkamath |
|
||||
| Website (optional) | https://neelkamath.com |
|
106
.github/contributors/pberba.md
vendored
Normal file
106
.github/contributors/pberba.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Pepe Berba |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-10-18 |
|
||||
| GitHub username | pberba |
|
||||
| Website (optional) | |
|
106
.github/contributors/prilopes.md
vendored
Normal file
106
.github/contributors/prilopes.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Priscilla Lopes |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-11-06 |
|
||||
| GitHub username | prilopes |
|
||||
| Website (optional) | |
|
106
.github/contributors/questoph.md
vendored
Normal file
106
.github/contributors/questoph.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Christoph Purschke |
|
||||
| Company name (if applicable) | University of Luxembourg |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 14/11/2019 |
|
||||
| GitHub username | questoph |
|
||||
| Website (optional) | https://purschke.info |
|
106
.github/contributors/zhuorulin.md
vendored
Normal file
106
.github/contributors/zhuorulin.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Zhuoru Lin |
|
||||
| Company name (if applicable) | Bombora Inc. |
|
||||
| Title or role (if applicable) | Data Scientist |
|
||||
| Date | 2017-11-13 |
|
||||
| GitHub username | ZhuoruLin |
|
||||
| Website (optional) | |
|
2
Makefile
2
Makefile
|
@ -9,7 +9,7 @@ dist/spacy.pex : dist/spacy-$(sha).pex
|
|||
|
||||
dist/spacy-$(sha).pex : dist/$(wheel)
|
||||
env3.6/bin/python -m pip install pex==1.5.3
|
||||
env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex
|
||||
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
|
||||
|
||||
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
|
||||
python3.6 -m venv env3.6
|
||||
|
|
20
README.md
20
README.md
|
@ -135,8 +135,7 @@ Thanks to our great community, we've finally re-added conda support. You can now
|
|||
install spaCy via `conda-forge`:
|
||||
|
||||
```bash
|
||||
conda config --add channels conda-forge
|
||||
conda install spacy
|
||||
conda install -c conda-forge spacy
|
||||
```
|
||||
|
||||
For the feedstock including the build recipe and configuration, check out
|
||||
|
@ -181,9 +180,6 @@ pointing pip to a path or URL.
|
|||
# download best-matching version of specific model for your spaCy installation
|
||||
python -m spacy download en_core_web_sm
|
||||
|
||||
# out-of-the-box: download best-matching default model
|
||||
python -m spacy download en
|
||||
|
||||
# pip install .tar.gz archive from path or URL
|
||||
pip install /Users/you/en_core_web_sm-2.2.0.tar.gz
|
||||
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
|
||||
|
@ -197,7 +193,7 @@ path to the model data directory.
|
|||
```python
|
||||
import spacy
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
doc = nlp(u"This is a sentence.")
|
||||
doc = nlp("This is a sentence.")
|
||||
```
|
||||
|
||||
You can also `import` a model directly via its full name and then call its
|
||||
|
@ -208,22 +204,12 @@ import spacy
|
|||
import en_core_web_sm
|
||||
|
||||
nlp = en_core_web_sm.load()
|
||||
doc = nlp(u"This is a sentence.")
|
||||
doc = nlp("This is a sentence.")
|
||||
```
|
||||
|
||||
📖 **For more info and examples, check out the
|
||||
[models documentation](https://spacy.io/docs/usage/models).**
|
||||
|
||||
### Support for older versions
|
||||
|
||||
If you're using an older version (`v1.6.0` or below), you can still download and
|
||||
install the old models from within spaCy using `python -m spacy.en.download all`
|
||||
or `python -m spacy.de.download all`. The `.tar.gz` archives are also
|
||||
[attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0).
|
||||
To download and install the models manually, unpack the archive, drop the
|
||||
contained directory into `spacy/data` and load the model via `spacy.load('en')`
|
||||
or `spacy.load('de')`.
|
||||
|
||||
## Compile from source
|
||||
|
||||
The other way to install spaCy is to clone its
|
||||
|
|
|
@ -9,6 +9,11 @@ trigger:
|
|||
exclude:
|
||||
- 'website/*'
|
||||
- '*.md'
|
||||
pr:
|
||||
paths:
|
||||
exclude:
|
||||
- 'website/*'
|
||||
- '*.md'
|
||||
|
||||
jobs:
|
||||
|
||||
|
@ -30,24 +35,12 @@ jobs:
|
|||
dependsOn: 'Validate'
|
||||
strategy:
|
||||
matrix:
|
||||
# Python 2.7 currently doesn't work because it seems to be a narrow
|
||||
# unicode build, which causes problems with the regular expressions
|
||||
|
||||
# Python27Linux:
|
||||
# imageName: 'ubuntu-16.04'
|
||||
# python.version: '2.7'
|
||||
# Python27Mac:
|
||||
# imageName: 'macos-10.13'
|
||||
# python.version: '2.7'
|
||||
Python35Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.5'
|
||||
Python35Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.5'
|
||||
Python35Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.5'
|
||||
Python36Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.6'
|
||||
|
@ -66,6 +59,15 @@ jobs:
|
|||
Python37Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.7'
|
||||
Python38Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.8'
|
||||
Python38Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.8'
|
||||
Python38Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.8'
|
||||
maxParallel: 4
|
||||
pool:
|
||||
vmImage: $(imageName)
|
||||
|
@ -76,10 +78,8 @@ jobs:
|
|||
versionSpec: '$(python.version)'
|
||||
architecture: 'x64'
|
||||
|
||||
# Downgrading pip is necessary to prevent a wheel version incompatiblity.
|
||||
# Might be fixed in the future or some other way, so investigate again.
|
||||
- script: |
|
||||
python -m pip install -U pip==18.1 setuptools
|
||||
python -m pip install -U setuptools
|
||||
pip install -r requirements.txt
|
||||
displayName: 'Install dependencies'
|
||||
|
||||
|
|
|
@ -84,7 +84,7 @@ def read_conllu(file_):
|
|||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
if text_loc.parts[-1].endswith(".conllu"):
|
||||
docs = []
|
||||
with text_loc.open() as file_:
|
||||
with text_loc.open(encoding="utf8") as file_:
|
||||
for conllu_doc in read_conllu(file_):
|
||||
for conllu_sent in conllu_doc:
|
||||
words = [line[1] for line in conllu_sent]
|
||||
|
|
|
@ -7,7 +7,6 @@ from __future__ import unicode_literals
|
|||
import plac
|
||||
from pathlib import Path
|
||||
import re
|
||||
import sys
|
||||
import json
|
||||
|
||||
import spacy
|
||||
|
@ -19,12 +18,9 @@ from spacy.util import compounding, minibatch, minibatch_by_words
|
|||
from spacy.syntax.nonproj import projectivize
|
||||
from spacy.matcher import Matcher
|
||||
from spacy import displacy
|
||||
from collections import defaultdict, Counter
|
||||
from timeit import default_timer as timer
|
||||
from collections import defaultdict
|
||||
|
||||
import itertools
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from spacy import lang
|
||||
from spacy.lang import zh
|
||||
|
@ -203,7 +199,7 @@ def golds_to_gold_tuples(docs, golds):
|
|||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
if text_loc.parts[-1].endswith(".conllu"):
|
||||
docs = []
|
||||
with text_loc.open() as file_:
|
||||
with text_loc.open(encoding="utf8") as file_:
|
||||
for conllu_doc in read_conllu(file_):
|
||||
for conllu_sent in conllu_doc:
|
||||
words = [line[1] for line in conllu_sent]
|
||||
|
@ -225,6 +221,13 @@ def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
|||
|
||||
|
||||
def write_conllu(docs, file_):
|
||||
if not Token.has_extension("get_conllu_lines"):
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
if not Token.has_extension("begins_fused"):
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
if not Token.has_extension("inside_fused"):
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
merger = Matcher(docs[0].vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||
for i, doc in enumerate(docs):
|
||||
|
@ -323,10 +326,6 @@ def get_token_conllu(token, i):
|
|||
return "\n".join(lines)
|
||||
|
||||
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu, force=True)
|
||||
Token.set_extension("begins_fused", default=False, force=True)
|
||||
Token.set_extension("inside_fused", default=False, force=True)
|
||||
|
||||
|
||||
##################
|
||||
# Initialization #
|
||||
|
@ -378,7 +377,7 @@ def _load_pretrained_tok2vec(nlp, loc):
|
|||
"""Load pretrained weights for the 'token-to-vector' part of the component
|
||||
models, which is typically a CNN. See 'spacy pretrain'. Experimental.
|
||||
"""
|
||||
with Path(loc).open("rb") as file_:
|
||||
with Path(loc).open("rb", encoding="utf8") as file_:
|
||||
weights_data = file_.read()
|
||||
loaded = []
|
||||
for name, component in nlp.pipeline:
|
||||
|
@ -459,13 +458,13 @@ class TreebankPaths(object):
|
|||
|
||||
@plac.annotations(
|
||||
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||
corpus=(
|
||||
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
|
||||
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||
config=("Path to json formatted config file", "option", "C", Path),
|
||||
limit=("Size limit", "option", "n", int),
|
||||
gpu_device=("Use GPU", "option", "g", int),
|
||||
|
@ -490,6 +489,10 @@ def main(
|
|||
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
||||
import tqdm
|
||||
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
spacy.util.fix_random_seed()
|
||||
lang.zh.Chinese.Defaults.use_jieba = False
|
||||
lang.ja.Japanese.Defaults.use_janome = False
|
||||
|
@ -506,8 +509,8 @@ def main(
|
|||
|
||||
docs, golds = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(),
|
||||
paths.train.text.open(),
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
max_doc_length=config.max_doc_length,
|
||||
limit=limit,
|
||||
)
|
||||
|
@ -519,8 +522,8 @@ def main(
|
|||
for i in range(config.nr_epoch):
|
||||
docs, golds = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(),
|
||||
paths.train.text.open(),
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
max_doc_length=config.max_doc_length,
|
||||
limit=limit,
|
||||
oracle_segments=use_oracle_segments,
|
||||
|
@ -560,7 +563,7 @@ def main(
|
|||
|
||||
def _render_parses(i, to_render):
|
||||
to_render[0].user_data["title"] = "Batch %d" % i
|
||||
with Path("/tmp/parses.html").open("w") as file_:
|
||||
with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
|
||||
html = displacy.render(to_render[:5], style="dep", page=True)
|
||||
file_.write(html)
|
||||
|
||||
|
|
34
bin/wiki_entity_linking/README.md
Normal file
34
bin/wiki_entity_linking/README.md
Normal file
|
@ -0,0 +1,34 @@
|
|||
## Entity Linking with Wikipedia and Wikidata
|
||||
|
||||
### Step 1: Create a Knowledge Base (KB) and training data
|
||||
|
||||
Run `wikipedia_pretrain_kb.py`
|
||||
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
|
||||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
||||
* You can set the filtering parameters for KB construction:
|
||||
* `max_per_alias`: (max) number of candidate entities in the KB per alias/synonym
|
||||
* `min_freq`: threshold of number of times an entity should occur in the corpus to be included in the KB
|
||||
* `min_pair`: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
|
||||
* Further parameters to set:
|
||||
* `descriptions_from_wikipedia`: whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
|
||||
* `entity_vector_length`: length of the pre-trained entity description vectors
|
||||
* `lang`: language for which to fetch Wikidata information (as the dump contains all languages)
|
||||
|
||||
Quick testing and rerunning:
|
||||
* When trying out the pipeline for a quick test, set `limit_prior`, `limit_train` and/or `limit_wd` to read only parts of the dumps instead of everything.
|
||||
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
|
||||
|
||||
|
||||
### Step 2: Train an Entity Linking model
|
||||
|
||||
Run `wikidata_train_entity_linker.py`
|
||||
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
|
||||
* You can set the learning parameters for the EL training:
|
||||
* `epochs`: number of training iterations
|
||||
* `dropout`: dropout rate
|
||||
* `lr`: learning rate
|
||||
* `l2`: L2 regularization
|
||||
* Specify the number of training and dev testing entities with `train_inst` and `dev_inst` respectively
|
||||
* Further parameters to set:
|
||||
* `labels_discard`: NER label types to discard during training
|
|
@ -6,6 +6,7 @@ OUTPUT_MODEL_DIR = "nlp"
|
|||
PRIOR_PROB_PATH = "prior_prob.csv"
|
||||
ENTITY_DEFS_PATH = "entity_defs.csv"
|
||||
ENTITY_FREQ_PATH = "entity_freq.csv"
|
||||
ENTITY_ALIAS_PATH = "entity_alias.csv"
|
||||
ENTITY_DESCR_PATH = "entity_descriptions.csv"
|
||||
|
||||
LOG_FORMAT = '%(asctime)s - %(levelname)s - %(name)s - %(message)s'
|
||||
|
|
|
@ -15,10 +15,11 @@ class Metrics(object):
|
|||
candidate_is_correct = true_entity == candidate
|
||||
|
||||
# Assume that we have no labeled negatives in the data (i.e. cases where true_entity is "NIL")
|
||||
# Therefore, if candidate_is_correct then we have a true positive and never a true negative
|
||||
# Therefore, if candidate_is_correct then we have a true positive and never a true negative.
|
||||
self.true_pos += candidate_is_correct
|
||||
self.false_neg += not candidate_is_correct
|
||||
if candidate not in {"", "NIL"}:
|
||||
if candidate and candidate not in {"", "NIL"}:
|
||||
# A wrong prediction (e.g. Q42 != Q3) counts both as a FP as well as a FN.
|
||||
self.false_pos += not candidate_is_correct
|
||||
|
||||
def calculate_precision(self):
|
||||
|
@ -33,6 +34,14 @@ class Metrics(object):
|
|||
else:
|
||||
return self.true_pos / (self.true_pos + self.false_neg)
|
||||
|
||||
def calculate_fscore(self):
|
||||
p = self.calculate_precision()
|
||||
r = self.calculate_recall()
|
||||
if p + r == 0:
|
||||
return 0.0
|
||||
else:
|
||||
return 2 * p * r / (p + r)
|
||||
|
||||
|
||||
class EvaluationResults(object):
|
||||
def __init__(self):
|
||||
|
@ -43,18 +52,20 @@ class EvaluationResults(object):
|
|||
self.metrics.update_results(true_entity, candidate)
|
||||
self.metrics_by_label[ent_label].update_results(true_entity, candidate)
|
||||
|
||||
def increment_false_negatives(self):
|
||||
self.metrics.false_neg += 1
|
||||
|
||||
def report_metrics(self, model_name):
|
||||
model_str = model_name.title()
|
||||
recall = self.metrics.calculate_recall()
|
||||
precision = self.metrics.calculate_precision()
|
||||
return ("{}: ".format(model_str) +
|
||||
"Recall = {} | ".format(round(recall, 3)) +
|
||||
"Precision = {} | ".format(round(precision, 3)) +
|
||||
"Precision by label = {}".format({k: v.calculate_precision()
|
||||
for k, v in self.metrics_by_label.items()}))
|
||||
fscore = self.metrics.calculate_fscore()
|
||||
return (
|
||||
"{}: ".format(model_str)
|
||||
+ "F-score = {} | ".format(round(fscore, 3))
|
||||
+ "Recall = {} | ".format(round(recall, 3))
|
||||
+ "Precision = {} | ".format(round(precision, 3))
|
||||
+ "F-score by label = {}".format(
|
||||
{k: v.calculate_fscore() for k, v in sorted(self.metrics_by_label.items())}
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
class BaselineResults(object):
|
||||
|
@ -63,25 +74,32 @@ class BaselineResults(object):
|
|||
self.prior = EvaluationResults()
|
||||
self.oracle = EvaluationResults()
|
||||
|
||||
def report_accuracy(self, model):
|
||||
def report_performance(self, model):
|
||||
results = getattr(self, model)
|
||||
return results.report_metrics(model)
|
||||
|
||||
def update_baselines(self, true_entity, ent_label, random_candidate, prior_candidate, oracle_candidate):
|
||||
def update_baselines(
|
||||
self,
|
||||
true_entity,
|
||||
ent_label,
|
||||
random_candidate,
|
||||
prior_candidate,
|
||||
oracle_candidate,
|
||||
):
|
||||
self.oracle.update_metrics(ent_label, true_entity, oracle_candidate)
|
||||
self.prior.update_metrics(ent_label, true_entity, prior_candidate)
|
||||
self.random.update_metrics(ent_label, true_entity, random_candidate)
|
||||
|
||||
|
||||
def measure_performance(dev_data, kb, el_pipe):
|
||||
baseline_accuracies = measure_baselines(
|
||||
dev_data, kb
|
||||
)
|
||||
|
||||
logger.info(baseline_accuracies.report_accuracy("random"))
|
||||
logger.info(baseline_accuracies.report_accuracy("prior"))
|
||||
logger.info(baseline_accuracies.report_accuracy("oracle"))
|
||||
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True):
|
||||
if baseline:
|
||||
baseline_accuracies, counts = measure_baselines(dev_data, kb)
|
||||
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
|
||||
logger.info(baseline_accuracies.report_performance("random"))
|
||||
logger.info(baseline_accuracies.report_performance("prior"))
|
||||
logger.info(baseline_accuracies.report_performance("oracle"))
|
||||
|
||||
if context:
|
||||
# using only context
|
||||
el_pipe.cfg["incl_context"] = True
|
||||
el_pipe.cfg["incl_prior"] = False
|
||||
|
@ -96,7 +114,11 @@ def measure_performance(dev_data, kb, el_pipe):
|
|||
|
||||
|
||||
def get_eval_results(data, el_pipe=None):
|
||||
# If the docs in the data require further processing with an entity linker, set el_pipe
|
||||
"""
|
||||
Evaluate the ent.kb_id_ annotations against the gold standard.
|
||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
||||
If the docs in the data require further processing with an entity linker, set el_pipe.
|
||||
"""
|
||||
from tqdm import tqdm
|
||||
|
||||
docs = []
|
||||
|
@ -111,18 +133,15 @@ def get_eval_results(data, el_pipe=None):
|
|||
|
||||
results = EvaluationResults()
|
||||
for doc, gold in zip(docs, golds):
|
||||
tagged_entries_per_article = {_offset(ent.start_char, ent.end_char): ent for ent in doc.ents}
|
||||
try:
|
||||
correct_entries_per_article = dict()
|
||||
for entity, kb_dict in gold.links.items():
|
||||
start, end = entity
|
||||
# only evaluating on positive examples
|
||||
for gold_kb, value in kb_dict.items():
|
||||
if value:
|
||||
# only evaluating on positive examples
|
||||
offset = _offset(start, end)
|
||||
correct_entries_per_article[offset] = gold_kb
|
||||
if offset not in tagged_entries_per_article:
|
||||
results.increment_false_negatives()
|
||||
|
||||
for ent in doc.ents:
|
||||
ent_label = ent.label_
|
||||
|
@ -142,7 +161,11 @@ def get_eval_results(data, el_pipe=None):
|
|||
|
||||
|
||||
def measure_baselines(data, kb):
|
||||
# Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound
|
||||
"""
|
||||
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
|
||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
||||
Also return a dictionary of counts by entity label.
|
||||
"""
|
||||
counts_d = dict()
|
||||
|
||||
baseline_results = BaselineResults()
|
||||
|
@ -152,7 +175,6 @@ def measure_baselines(data, kb):
|
|||
|
||||
for doc, gold in zip(docs, golds):
|
||||
correct_entries_per_article = dict()
|
||||
tagged_entries_per_article = {_offset(ent.start_char, ent.end_char): ent for ent in doc.ents}
|
||||
for entity, kb_dict in gold.links.items():
|
||||
start, end = entity
|
||||
for gold_kb, value in kb_dict.items():
|
||||
|
@ -160,10 +182,6 @@ def measure_baselines(data, kb):
|
|||
if value:
|
||||
offset = _offset(start, end)
|
||||
correct_entries_per_article[offset] = gold_kb
|
||||
if offset not in tagged_entries_per_article:
|
||||
baseline_results.random.increment_false_negatives()
|
||||
baseline_results.oracle.increment_false_negatives()
|
||||
baseline_results.prior.increment_false_negatives()
|
||||
|
||||
for ent in doc.ents:
|
||||
ent_label = ent.label_
|
||||
|
@ -176,7 +194,7 @@ def measure_baselines(data, kb):
|
|||
if gold_entity is not None:
|
||||
candidates = kb.get_candidates(ent.text)
|
||||
oracle_candidate = ""
|
||||
best_candidate = ""
|
||||
prior_candidate = ""
|
||||
random_candidate = ""
|
||||
if candidates:
|
||||
scores = []
|
||||
|
@ -187,13 +205,21 @@ def measure_baselines(data, kb):
|
|||
oracle_candidate = c.entity_
|
||||
|
||||
best_index = scores.index(max(scores))
|
||||
best_candidate = candidates[best_index].entity_
|
||||
prior_candidate = candidates[best_index].entity_
|
||||
random_candidate = random.choice(candidates).entity_
|
||||
|
||||
baseline_results.update_baselines(gold_entity, ent_label,
|
||||
random_candidate, best_candidate, oracle_candidate)
|
||||
current_count = counts_d.get(ent_label, 0)
|
||||
counts_d[ent_label] = current_count+1
|
||||
|
||||
return baseline_results
|
||||
baseline_results.update_baselines(
|
||||
gold_entity,
|
||||
ent_label,
|
||||
random_candidate,
|
||||
prior_candidate,
|
||||
oracle_candidate,
|
||||
)
|
||||
|
||||
return baseline_results, counts_d
|
||||
|
||||
|
||||
def _offset(start, end):
|
||||
|
|
|
@ -1,17 +1,12 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import csv
|
||||
import logging
|
||||
import spacy
|
||||
import sys
|
||||
|
||||
from spacy.kb import KnowledgeBase
|
||||
|
||||
from bin.wiki_entity_linking import wikipedia_processor as wp
|
||||
from bin.wiki_entity_linking.train_descriptions import EntityEncoder
|
||||
|
||||
csv.field_size_limit(sys.maxsize)
|
||||
from bin.wiki_entity_linking import wiki_io as io
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
@ -22,18 +17,24 @@ def create_kb(
|
|||
max_entities_per_alias,
|
||||
min_entity_freq,
|
||||
min_occ,
|
||||
entity_def_input,
|
||||
entity_def_path,
|
||||
entity_descr_path,
|
||||
count_input,
|
||||
prior_prob_input,
|
||||
entity_alias_path,
|
||||
entity_freq_path,
|
||||
prior_prob_path,
|
||||
entity_vector_length,
|
||||
):
|
||||
# Create the knowledge base from Wikidata entries
|
||||
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=entity_vector_length)
|
||||
entity_list, filtered_title_to_id = _define_entities(nlp, kb, entity_def_path, entity_descr_path, min_entity_freq, entity_freq_path, entity_vector_length)
|
||||
_define_aliases(kb, entity_alias_path, entity_list, filtered_title_to_id, max_entities_per_alias, min_occ, prior_prob_path)
|
||||
return kb
|
||||
|
||||
|
||||
def _define_entities(nlp, kb, entity_def_path, entity_descr_path, min_entity_freq, entity_freq_path, entity_vector_length):
|
||||
# read the mappings from file
|
||||
title_to_id = get_entity_to_id(entity_def_input)
|
||||
id_to_descr = get_id_to_description(entity_descr_path)
|
||||
title_to_id = io.read_title_to_id(entity_def_path)
|
||||
id_to_descr = io.read_id_to_descr(entity_descr_path)
|
||||
|
||||
# check the length of the nlp vectors
|
||||
if "vectors" in nlp.meta and nlp.vocab.vectors.size:
|
||||
|
@ -45,10 +46,8 @@ def create_kb(
|
|||
" cf. https://spacy.io/usage/models#languages."
|
||||
)
|
||||
|
||||
logger.info("Get entity frequencies")
|
||||
entity_frequencies = wp.get_all_frequencies(count_input=count_input)
|
||||
|
||||
logger.info("Filtering entities with fewer than {} mentions".format(min_entity_freq))
|
||||
entity_frequencies = io.read_entity_to_count(entity_freq_path)
|
||||
# filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise
|
||||
filtered_title_to_id, entity_list, description_list, frequency_list = get_filtered_entities(
|
||||
title_to_id,
|
||||
|
@ -56,36 +55,33 @@ def create_kb(
|
|||
entity_frequencies,
|
||||
min_entity_freq
|
||||
)
|
||||
logger.info("Left with {} entities".format(len(description_list)))
|
||||
logger.info("Kept {} entities from the set of {}".format(len(description_list), len(title_to_id.keys())))
|
||||
|
||||
logger.info("Train entity encoder")
|
||||
logger.info("Training entity encoder")
|
||||
encoder = EntityEncoder(nlp, input_dim, entity_vector_length)
|
||||
encoder.train(description_list=description_list, to_print=True)
|
||||
|
||||
logger.info("Get entity embeddings:")
|
||||
logger.info("Getting entity embeddings")
|
||||
embeddings = encoder.apply_encoder(description_list)
|
||||
|
||||
logger.info("Adding {} entities".format(len(entity_list)))
|
||||
kb.set_entities(
|
||||
entity_list=entity_list, freq_list=frequency_list, vector_list=embeddings
|
||||
)
|
||||
return entity_list, filtered_title_to_id
|
||||
|
||||
logger.info("Adding aliases")
|
||||
|
||||
def _define_aliases(kb, entity_alias_path, entity_list, filtered_title_to_id, max_entities_per_alias, min_occ, prior_prob_path):
|
||||
logger.info("Adding aliases from Wikipedia and Wikidata")
|
||||
_add_aliases(
|
||||
kb,
|
||||
entity_list=entity_list,
|
||||
title_to_id=filtered_title_to_id,
|
||||
max_entities_per_alias=max_entities_per_alias,
|
||||
min_occ=min_occ,
|
||||
prior_prob_input=prior_prob_input,
|
||||
prior_prob_path=prior_prob_path,
|
||||
)
|
||||
|
||||
logger.info("KB size: {} entities, {} aliases".format(
|
||||
kb.get_size_entities(),
|
||||
kb.get_size_aliases()))
|
||||
|
||||
logger.info("Done with kb")
|
||||
return kb
|
||||
|
||||
|
||||
def get_filtered_entities(title_to_id, id_to_descr, entity_frequencies,
|
||||
min_entity_freq: int = 10):
|
||||
|
@ -104,34 +100,13 @@ def get_filtered_entities(title_to_id, id_to_descr, entity_frequencies,
|
|||
return filtered_title_to_id, entity_list, description_list, frequency_list
|
||||
|
||||
|
||||
def get_entity_to_id(entity_def_output):
|
||||
entity_to_id = dict()
|
||||
with entity_def_output.open("r", encoding="utf8") as csvfile:
|
||||
csvreader = csv.reader(csvfile, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
entity_to_id[row[0]] = row[1]
|
||||
return entity_to_id
|
||||
|
||||
|
||||
def get_id_to_description(entity_descr_path):
|
||||
id_to_desc = dict()
|
||||
with entity_descr_path.open("r", encoding="utf8") as csvfile:
|
||||
csvreader = csv.reader(csvfile, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
id_to_desc[row[0]] = row[1]
|
||||
return id_to_desc
|
||||
|
||||
|
||||
def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_input):
|
||||
def _add_aliases(kb, entity_list, title_to_id, max_entities_per_alias, min_occ, prior_prob_path):
|
||||
wp_titles = title_to_id.keys()
|
||||
|
||||
# adding aliases with prior probabilities
|
||||
# we can read this file sequentially, it's sorted by alias, and then by count
|
||||
with prior_prob_input.open("r", encoding="utf8") as prior_file:
|
||||
logger.info("Adding WP aliases")
|
||||
with prior_prob_path.open("r", encoding="utf8") as prior_file:
|
||||
# skip header
|
||||
prior_file.readline()
|
||||
line = prior_file.readline()
|
||||
|
@ -180,10 +155,7 @@ def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_in
|
|||
line = prior_file.readline()
|
||||
|
||||
|
||||
def read_nlp_kb(model_dir, kb_file):
|
||||
nlp = spacy.load(model_dir)
|
||||
def read_kb(nlp, kb_file):
|
||||
kb = KnowledgeBase(vocab=nlp.vocab)
|
||||
kb.load_bulk(kb_file)
|
||||
logger.info("kb entities: {}".format(kb.get_size_entities()))
|
||||
logger.info("kb aliases: {}".format(kb.get_size_aliases()))
|
||||
return nlp, kb
|
||||
return kb
|
||||
|
|
|
@ -53,7 +53,7 @@ class EntityEncoder:
|
|||
|
||||
start = start + batch_size
|
||||
stop = min(stop + batch_size, len(description_list))
|
||||
logger.info("encoded: {} entities".format(stop))
|
||||
logger.info("Encoded: {} entities".format(stop))
|
||||
|
||||
return encodings
|
||||
|
||||
|
@ -62,7 +62,7 @@ class EntityEncoder:
|
|||
if to_print:
|
||||
logger.info(
|
||||
"Trained entity descriptions on {} ".format(processed) +
|
||||
"(non-unique) entities across {} ".format(self.epochs) +
|
||||
"(non-unique) descriptions across {} ".format(self.epochs) +
|
||||
"epochs"
|
||||
)
|
||||
logger.info("Final loss: {}".format(loss))
|
||||
|
|
|
@ -1,395 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import logging
|
||||
import random
|
||||
import re
|
||||
import bz2
|
||||
import json
|
||||
|
||||
from functools import partial
|
||||
|
||||
from spacy.gold import GoldParse
|
||||
from bin.wiki_entity_linking import kb_creator
|
||||
|
||||
"""
|
||||
Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
|
||||
Gold-standard entities are stored in one file in standoff format (by character offset).
|
||||
"""
|
||||
|
||||
ENTITY_FILE = "gold_entities.csv"
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def create_training_examples_and_descriptions(wikipedia_input,
|
||||
entity_def_input,
|
||||
description_output,
|
||||
training_output,
|
||||
parse_descriptions,
|
||||
limit=None):
|
||||
wp_to_id = kb_creator.get_entity_to_id(entity_def_input)
|
||||
_process_wikipedia_texts(wikipedia_input,
|
||||
wp_to_id,
|
||||
description_output,
|
||||
training_output,
|
||||
parse_descriptions,
|
||||
limit)
|
||||
|
||||
|
||||
def _process_wikipedia_texts(wikipedia_input,
|
||||
wp_to_id,
|
||||
output,
|
||||
training_output,
|
||||
parse_descriptions,
|
||||
limit=None):
|
||||
"""
|
||||
Read the XML wikipedia data to parse out training data:
|
||||
raw text data + positive instances
|
||||
"""
|
||||
title_regex = re.compile(r"(?<=<title>).*(?=</title>)")
|
||||
id_regex = re.compile(r"(?<=<id>)\d*(?=</id>)")
|
||||
|
||||
read_ids = set()
|
||||
|
||||
with output.open("a", encoding="utf8") as descr_file, training_output.open("w", encoding="utf8") as entity_file:
|
||||
if parse_descriptions:
|
||||
_write_training_description(descr_file, "WD_id", "description")
|
||||
with bz2.open(wikipedia_input, mode="rb") as file:
|
||||
article_count = 0
|
||||
article_text = ""
|
||||
article_title = None
|
||||
article_id = None
|
||||
reading_text = False
|
||||
reading_revision = False
|
||||
|
||||
logger.info("Processed {} articles".format(article_count))
|
||||
|
||||
for line in file:
|
||||
clean_line = line.strip().decode("utf-8")
|
||||
|
||||
if clean_line == "<revision>":
|
||||
reading_revision = True
|
||||
elif clean_line == "</revision>":
|
||||
reading_revision = False
|
||||
|
||||
# Start reading new page
|
||||
if clean_line == "<page>":
|
||||
article_text = ""
|
||||
article_title = None
|
||||
article_id = None
|
||||
# finished reading this page
|
||||
elif clean_line == "</page>":
|
||||
if article_id:
|
||||
clean_text, entities = _process_wp_text(
|
||||
article_title,
|
||||
article_text,
|
||||
wp_to_id
|
||||
)
|
||||
if clean_text is not None and entities is not None:
|
||||
_write_training_entities(entity_file,
|
||||
article_id,
|
||||
clean_text,
|
||||
entities)
|
||||
|
||||
if article_title in wp_to_id and parse_descriptions:
|
||||
description = " ".join(clean_text[:1000].split(" ")[:-1])
|
||||
_write_training_description(
|
||||
descr_file,
|
||||
wp_to_id[article_title],
|
||||
description
|
||||
)
|
||||
article_count += 1
|
||||
if article_count % 10000 == 0:
|
||||
logger.info("Processed {} articles".format(article_count))
|
||||
if limit and article_count >= limit:
|
||||
break
|
||||
article_text = ""
|
||||
article_title = None
|
||||
article_id = None
|
||||
reading_text = False
|
||||
reading_revision = False
|
||||
|
||||
# start reading text within a page
|
||||
if "<text" in clean_line:
|
||||
reading_text = True
|
||||
|
||||
if reading_text:
|
||||
article_text += " " + clean_line
|
||||
|
||||
# stop reading text within a page (we assume a new page doesn't start on the same line)
|
||||
if "</text" in clean_line:
|
||||
reading_text = False
|
||||
|
||||
# read the ID of this article (outside the revision portion of the document)
|
||||
if not reading_revision:
|
||||
ids = id_regex.search(clean_line)
|
||||
if ids:
|
||||
article_id = ids[0]
|
||||
if article_id in read_ids:
|
||||
logger.info(
|
||||
"Found duplicate article ID", article_id, clean_line
|
||||
) # This should never happen ...
|
||||
read_ids.add(article_id)
|
||||
|
||||
# read the title of this article (outside the revision portion of the document)
|
||||
if not reading_revision:
|
||||
titles = title_regex.search(clean_line)
|
||||
if titles:
|
||||
article_title = titles[0].strip()
|
||||
logger.info("Finished. Processed {} articles".format(article_count))
|
||||
|
||||
|
||||
text_regex = re.compile(r"(?<=<text xml:space=\"preserve\">).*(?=</text)")
|
||||
info_regex = re.compile(r"{[^{]*?}")
|
||||
htlm_regex = re.compile(r"<!--[^-]*-->")
|
||||
category_regex = re.compile(r"\[\[Category:[^\[]*]]")
|
||||
file_regex = re.compile(r"\[\[File:[^[\]]+]]")
|
||||
ref_regex = re.compile(r"<ref.*?>") # non-greedy
|
||||
ref_2_regex = re.compile(r"</ref.*?>") # non-greedy
|
||||
|
||||
|
||||
def _process_wp_text(article_title, article_text, wp_to_id):
|
||||
# ignore meta Wikipedia pages
|
||||
if (
|
||||
article_title.startswith("Wikipedia:") or
|
||||
article_title.startswith("Kategori:")
|
||||
):
|
||||
return None, None
|
||||
|
||||
# remove the text tags
|
||||
text_search = text_regex.search(article_text)
|
||||
if text_search is None:
|
||||
return None, None
|
||||
text = text_search.group(0)
|
||||
|
||||
# stop processing if this is a redirect page
|
||||
if text.startswith("#REDIRECT"):
|
||||
return None, None
|
||||
|
||||
# get the raw text without markup etc, keeping only interwiki links
|
||||
clean_text, entities = _remove_links(_get_clean_wp_text(text), wp_to_id)
|
||||
return clean_text, entities
|
||||
|
||||
|
||||
def _get_clean_wp_text(article_text):
|
||||
clean_text = article_text.strip()
|
||||
|
||||
# remove bolding & italic markup
|
||||
clean_text = clean_text.replace("'''", "")
|
||||
clean_text = clean_text.replace("''", "")
|
||||
|
||||
# remove nested {{info}} statements by removing the inner/smallest ones first and iterating
|
||||
try_again = True
|
||||
previous_length = len(clean_text)
|
||||
while try_again:
|
||||
clean_text = info_regex.sub(
|
||||
"", clean_text
|
||||
) # non-greedy match excluding a nested {
|
||||
if len(clean_text) < previous_length:
|
||||
try_again = True
|
||||
else:
|
||||
try_again = False
|
||||
previous_length = len(clean_text)
|
||||
|
||||
# remove HTML comments
|
||||
clean_text = htlm_regex.sub("", clean_text)
|
||||
|
||||
# remove Category and File statements
|
||||
clean_text = category_regex.sub("", clean_text)
|
||||
clean_text = file_regex.sub("", clean_text)
|
||||
|
||||
# remove multiple =
|
||||
while "==" in clean_text:
|
||||
clean_text = clean_text.replace("==", "=")
|
||||
|
||||
clean_text = clean_text.replace(". =", ".")
|
||||
clean_text = clean_text.replace(" = ", ". ")
|
||||
clean_text = clean_text.replace("= ", ".")
|
||||
clean_text = clean_text.replace(" =", "")
|
||||
|
||||
# remove refs (non-greedy match)
|
||||
clean_text = ref_regex.sub("", clean_text)
|
||||
clean_text = ref_2_regex.sub("", clean_text)
|
||||
|
||||
# remove additional wikiformatting
|
||||
clean_text = re.sub(r"<blockquote>", "", clean_text)
|
||||
clean_text = re.sub(r"</blockquote>", "", clean_text)
|
||||
|
||||
# change special characters back to normal ones
|
||||
clean_text = clean_text.replace(r"<", "<")
|
||||
clean_text = clean_text.replace(r">", ">")
|
||||
clean_text = clean_text.replace(r""", '"')
|
||||
clean_text = clean_text.replace(r"&nbsp;", " ")
|
||||
clean_text = clean_text.replace(r"&", "&")
|
||||
|
||||
# remove multiple spaces
|
||||
while " " in clean_text:
|
||||
clean_text = clean_text.replace(" ", " ")
|
||||
|
||||
return clean_text.strip()
|
||||
|
||||
|
||||
def _remove_links(clean_text, wp_to_id):
|
||||
# read the text char by char to get the right offsets for the interwiki links
|
||||
entities = []
|
||||
final_text = ""
|
||||
open_read = 0
|
||||
reading_text = True
|
||||
reading_entity = False
|
||||
reading_mention = False
|
||||
reading_special_case = False
|
||||
entity_buffer = ""
|
||||
mention_buffer = ""
|
||||
for index, letter in enumerate(clean_text):
|
||||
if letter == "[":
|
||||
open_read += 1
|
||||
elif letter == "]":
|
||||
open_read -= 1
|
||||
elif letter == "|":
|
||||
if reading_text:
|
||||
final_text += letter
|
||||
# switch from reading entity to mention in the [[entity|mention]] pattern
|
||||
elif reading_entity:
|
||||
reading_text = False
|
||||
reading_entity = False
|
||||
reading_mention = True
|
||||
else:
|
||||
reading_special_case = True
|
||||
else:
|
||||
if reading_entity:
|
||||
entity_buffer += letter
|
||||
elif reading_mention:
|
||||
mention_buffer += letter
|
||||
elif reading_text:
|
||||
final_text += letter
|
||||
else:
|
||||
raise ValueError("Not sure at point", clean_text[index - 2: index + 2])
|
||||
|
||||
if open_read > 2:
|
||||
reading_special_case = True
|
||||
|
||||
if open_read == 2 and reading_text:
|
||||
reading_text = False
|
||||
reading_entity = True
|
||||
reading_mention = False
|
||||
|
||||
# we just finished reading an entity
|
||||
if open_read == 0 and not reading_text:
|
||||
if "#" in entity_buffer or entity_buffer.startswith(":"):
|
||||
reading_special_case = True
|
||||
# Ignore cases with nested structures like File: handles etc
|
||||
if not reading_special_case:
|
||||
if not mention_buffer:
|
||||
mention_buffer = entity_buffer
|
||||
start = len(final_text)
|
||||
end = start + len(mention_buffer)
|
||||
qid = wp_to_id.get(entity_buffer, None)
|
||||
if qid:
|
||||
entities.append((mention_buffer, qid, start, end))
|
||||
final_text += mention_buffer
|
||||
|
||||
entity_buffer = ""
|
||||
mention_buffer = ""
|
||||
|
||||
reading_text = True
|
||||
reading_entity = False
|
||||
reading_mention = False
|
||||
reading_special_case = False
|
||||
return final_text, entities
|
||||
|
||||
|
||||
def _write_training_description(outputfile, qid, description):
|
||||
if description is not None:
|
||||
line = str(qid) + "|" + description + "\n"
|
||||
outputfile.write(line)
|
||||
|
||||
|
||||
def _write_training_entities(outputfile, article_id, clean_text, entities):
|
||||
entities_data = [{"alias": ent[0], "entity": ent[1], "start": ent[2], "end": ent[3]} for ent in entities]
|
||||
line = json.dumps(
|
||||
{
|
||||
"article_id": article_id,
|
||||
"clean_text": clean_text,
|
||||
"entities": entities_data
|
||||
},
|
||||
ensure_ascii=False) + "\n"
|
||||
outputfile.write(line)
|
||||
|
||||
|
||||
def read_training(nlp, entity_file_path, dev, limit, kb):
|
||||
""" This method provides training examples that correspond to the entity annotations found by the nlp object.
|
||||
For training,, it will include negative training examples by using the candidate generator,
|
||||
and it will only keep positive training examples that can be found by using the candidate generator.
|
||||
For testing, it will include all positive examples only."""
|
||||
|
||||
from tqdm import tqdm
|
||||
data = []
|
||||
num_entities = 0
|
||||
get_gold_parse = partial(_get_gold_parse, dev=dev, kb=kb)
|
||||
|
||||
logger.info("Reading {} data with limit {}".format('dev' if dev else 'train', limit))
|
||||
with entity_file_path.open("r", encoding="utf8") as file:
|
||||
with tqdm(total=limit, leave=False) as pbar:
|
||||
for i, line in enumerate(file):
|
||||
example = json.loads(line)
|
||||
article_id = example["article_id"]
|
||||
clean_text = example["clean_text"]
|
||||
entities = example["entities"]
|
||||
|
||||
if dev != is_dev(article_id) or len(clean_text) >= 30000:
|
||||
continue
|
||||
|
||||
doc = nlp(clean_text)
|
||||
gold = get_gold_parse(doc, entities)
|
||||
if gold and len(gold.links) > 0:
|
||||
data.append((doc, gold))
|
||||
num_entities += len(gold.links)
|
||||
pbar.update(len(gold.links))
|
||||
if limit and num_entities >= limit:
|
||||
break
|
||||
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
|
||||
return data
|
||||
|
||||
|
||||
def _get_gold_parse(doc, entities, dev, kb):
|
||||
gold_entities = {}
|
||||
tagged_ent_positions = set(
|
||||
[(ent.start_char, ent.end_char) for ent in doc.ents]
|
||||
)
|
||||
|
||||
for entity in entities:
|
||||
entity_id = entity["entity"]
|
||||
alias = entity["alias"]
|
||||
start = entity["start"]
|
||||
end = entity["end"]
|
||||
|
||||
candidates = kb.get_candidates(alias)
|
||||
candidate_ids = [
|
||||
c.entity_ for c in candidates
|
||||
]
|
||||
|
||||
should_add_ent = (
|
||||
dev or
|
||||
(
|
||||
(start, end) in tagged_ent_positions and
|
||||
entity_id in candidate_ids and
|
||||
len(candidates) > 1
|
||||
)
|
||||
)
|
||||
|
||||
if should_add_ent:
|
||||
value_by_id = {entity_id: 1.0}
|
||||
if not dev:
|
||||
random.shuffle(candidate_ids)
|
||||
value_by_id.update({
|
||||
kb_id: 0.0
|
||||
for kb_id in candidate_ids
|
||||
if kb_id != entity_id
|
||||
})
|
||||
gold_entities[(start, end)] = value_by_id
|
||||
|
||||
return GoldParse(doc, links=gold_entities)
|
||||
|
||||
|
||||
def is_dev(article_id):
|
||||
return article_id.endswith("3")
|
127
bin/wiki_entity_linking/wiki_io.py
Normal file
127
bin/wiki_entity_linking/wiki_io.py
Normal file
|
@ -0,0 +1,127 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import sys
|
||||
import csv
|
||||
|
||||
# min() needed to prevent error on windows, cf https://stackoverflow.com/questions/52404416/
|
||||
csv.field_size_limit(min(sys.maxsize, 2147483646))
|
||||
|
||||
""" This class provides reading/writing methods for temp files """
|
||||
|
||||
|
||||
# Entity definition: WP title -> WD ID #
|
||||
def write_title_to_id(entity_def_output, title_to_id):
|
||||
with entity_def_output.open("w", encoding="utf8") as id_file:
|
||||
id_file.write("WP_title" + "|" + "WD_id" + "\n")
|
||||
for title, qid in title_to_id.items():
|
||||
id_file.write(title + "|" + str(qid) + "\n")
|
||||
|
||||
|
||||
def read_title_to_id(entity_def_output):
|
||||
title_to_id = dict()
|
||||
with entity_def_output.open("r", encoding="utf8") as id_file:
|
||||
csvreader = csv.reader(id_file, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
title_to_id[row[0]] = row[1]
|
||||
return title_to_id
|
||||
|
||||
|
||||
# Entity aliases from WD: WD ID -> WD alias #
|
||||
def write_id_to_alias(entity_alias_path, id_to_alias):
|
||||
with entity_alias_path.open("w", encoding="utf8") as alias_file:
|
||||
alias_file.write("WD_id" + "|" + "alias" + "\n")
|
||||
for qid, alias_list in id_to_alias.items():
|
||||
for alias in alias_list:
|
||||
alias_file.write(str(qid) + "|" + alias + "\n")
|
||||
|
||||
|
||||
def read_id_to_alias(entity_alias_path):
|
||||
id_to_alias = dict()
|
||||
with entity_alias_path.open("r", encoding="utf8") as alias_file:
|
||||
csvreader = csv.reader(alias_file, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
qid = row[0]
|
||||
alias = row[1]
|
||||
alias_list = id_to_alias.get(qid, [])
|
||||
alias_list.append(alias)
|
||||
id_to_alias[qid] = alias_list
|
||||
return id_to_alias
|
||||
|
||||
|
||||
def read_alias_to_id_generator(entity_alias_path):
|
||||
""" Read (aliases, qid) tuples """
|
||||
|
||||
with entity_alias_path.open("r", encoding="utf8") as alias_file:
|
||||
csvreader = csv.reader(alias_file, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
qid = row[0]
|
||||
alias = row[1]
|
||||
yield alias, qid
|
||||
|
||||
|
||||
# Entity descriptions from WD: WD ID -> WD alias #
|
||||
def write_id_to_descr(entity_descr_output, id_to_descr):
|
||||
with entity_descr_output.open("w", encoding="utf8") as descr_file:
|
||||
descr_file.write("WD_id" + "|" + "description" + "\n")
|
||||
for qid, descr in id_to_descr.items():
|
||||
descr_file.write(str(qid) + "|" + descr + "\n")
|
||||
|
||||
|
||||
def read_id_to_descr(entity_desc_path):
|
||||
id_to_desc = dict()
|
||||
with entity_desc_path.open("r", encoding="utf8") as descr_file:
|
||||
csvreader = csv.reader(descr_file, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
id_to_desc[row[0]] = row[1]
|
||||
return id_to_desc
|
||||
|
||||
|
||||
# Entity counts from WP: WP title -> count #
|
||||
def write_entity_to_count(prior_prob_input, count_output):
|
||||
# Write entity counts for quick access later
|
||||
entity_to_count = dict()
|
||||
total_count = 0
|
||||
|
||||
with prior_prob_input.open("r", encoding="utf8") as prior_file:
|
||||
# skip header
|
||||
prior_file.readline()
|
||||
line = prior_file.readline()
|
||||
|
||||
while line:
|
||||
splits = line.replace("\n", "").split(sep="|")
|
||||
# alias = splits[0]
|
||||
count = int(splits[1])
|
||||
entity = splits[2]
|
||||
|
||||
current_count = entity_to_count.get(entity, 0)
|
||||
entity_to_count[entity] = current_count + count
|
||||
|
||||
total_count += count
|
||||
|
||||
line = prior_file.readline()
|
||||
|
||||
with count_output.open("w", encoding="utf8") as entity_file:
|
||||
entity_file.write("entity" + "|" + "count" + "\n")
|
||||
for entity, count in entity_to_count.items():
|
||||
entity_file.write(entity + "|" + str(count) + "\n")
|
||||
|
||||
|
||||
def read_entity_to_count(count_input):
|
||||
entity_to_count = dict()
|
||||
with count_input.open("r", encoding="utf8") as csvfile:
|
||||
csvreader = csv.reader(csvfile, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
entity_to_count[row[0]] = int(row[1])
|
||||
|
||||
return entity_to_count
|
128
bin/wiki_entity_linking/wiki_namespaces.py
Normal file
128
bin/wiki_entity_linking/wiki_namespaces.py
Normal file
|
@ -0,0 +1,128 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# List of meta pages in Wikidata, should be kept out of the Knowledge base
|
||||
WD_META_ITEMS = [
|
||||
"Q163875",
|
||||
"Q191780",
|
||||
"Q224414",
|
||||
"Q4167836",
|
||||
"Q4167410",
|
||||
"Q4663903",
|
||||
"Q11266439",
|
||||
"Q13406463",
|
||||
"Q15407973",
|
||||
"Q18616576",
|
||||
"Q19887878",
|
||||
"Q22808320",
|
||||
"Q23894233",
|
||||
"Q33120876",
|
||||
"Q42104522",
|
||||
"Q47460393",
|
||||
"Q64875536",
|
||||
"Q66480449",
|
||||
]
|
||||
|
||||
|
||||
# TODO: add more cases from non-English WP's
|
||||
|
||||
# List of prefixes that refer to Wikipedia "file" pages
|
||||
WP_FILE_NAMESPACE = ["Bestand", "File"]
|
||||
|
||||
# List of prefixes that refer to Wikipedia "category" pages
|
||||
WP_CATEGORY_NAMESPACE = ["Kategori", "Category", "Categorie"]
|
||||
|
||||
# List of prefixes that refer to Wikipedia "meta" pages
|
||||
# these will/should be matched ignoring case
|
||||
WP_META_NAMESPACE = (
|
||||
WP_FILE_NAMESPACE
|
||||
+ WP_CATEGORY_NAMESPACE
|
||||
+ [
|
||||
"b",
|
||||
"betawikiversity",
|
||||
"Book",
|
||||
"c",
|
||||
"Commons",
|
||||
"d",
|
||||
"dbdump",
|
||||
"download",
|
||||
"Draft",
|
||||
"Education",
|
||||
"Foundation",
|
||||
"Gadget",
|
||||
"Gadget definition",
|
||||
"Gebruiker",
|
||||
"gerrit",
|
||||
"Help",
|
||||
"Image",
|
||||
"Incubator",
|
||||
"m",
|
||||
"mail",
|
||||
"mailarchive",
|
||||
"media",
|
||||
"MediaWiki",
|
||||
"MediaWiki talk",
|
||||
"Mediawikiwiki",
|
||||
"MediaZilla",
|
||||
"Meta",
|
||||
"Metawikipedia",
|
||||
"Module",
|
||||
"mw",
|
||||
"n",
|
||||
"nost",
|
||||
"oldwikisource",
|
||||
"otrs",
|
||||
"OTRSwiki",
|
||||
"Overleg gebruiker",
|
||||
"outreach",
|
||||
"outreachwiki",
|
||||
"Portal",
|
||||
"phab",
|
||||
"Phabricator",
|
||||
"Project",
|
||||
"q",
|
||||
"quality",
|
||||
"rev",
|
||||
"s",
|
||||
"spcom",
|
||||
"Special",
|
||||
"species",
|
||||
"Strategy",
|
||||
"sulutil",
|
||||
"svn",
|
||||
"Talk",
|
||||
"Template",
|
||||
"Template talk",
|
||||
"Testwiki",
|
||||
"ticket",
|
||||
"TimedText",
|
||||
"Toollabs",
|
||||
"tools",
|
||||
"tswiki",
|
||||
"User",
|
||||
"User talk",
|
||||
"v",
|
||||
"voy",
|
||||
"w",
|
||||
"Wikibooks",
|
||||
"Wikidata",
|
||||
"wikiHow",
|
||||
"Wikinvest",
|
||||
"wikilivres",
|
||||
"Wikimedia",
|
||||
"Wikinews",
|
||||
"Wikipedia",
|
||||
"Wikipedia talk",
|
||||
"Wikiquote",
|
||||
"Wikisource",
|
||||
"Wikispecies",
|
||||
"Wikitech",
|
||||
"Wikiversity",
|
||||
"Wikivoyage",
|
||||
"wikt",
|
||||
"wiktionary",
|
||||
"wmf",
|
||||
"wmania",
|
||||
"WP",
|
||||
]
|
||||
)
|
|
@ -18,11 +18,12 @@ from pathlib import Path
|
|||
import plac
|
||||
|
||||
from bin.wiki_entity_linking import wikipedia_processor as wp, wikidata_processor as wd
|
||||
from bin.wiki_entity_linking import wiki_io as io
|
||||
from bin.wiki_entity_linking import kb_creator
|
||||
from bin.wiki_entity_linking import training_set_creator
|
||||
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_FILE, ENTITY_DESCR_PATH, KB_MODEL_DIR, LOG_FORMAT
|
||||
from bin.wiki_entity_linking import ENTITY_FREQ_PATH, PRIOR_PROB_PATH, ENTITY_DEFS_PATH
|
||||
from bin.wiki_entity_linking import ENTITY_FREQ_PATH, PRIOR_PROB_PATH, ENTITY_DEFS_PATH, ENTITY_ALIAS_PATH
|
||||
import spacy
|
||||
from bin.wiki_entity_linking.kb_creator import read_kb
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
@ -39,9 +40,11 @@ logger = logging.getLogger(__name__)
|
|||
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
|
||||
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
|
||||
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
|
||||
descriptions_from_wikipedia=("Flag for using wp descriptions not wd", "flag", "wp"),
|
||||
limit=("Optional threshold to limit lines read from dumps", "option", "l", int),
|
||||
lang=("Optional language for which to get wikidata titles. Defaults to 'en'", "option", "la", str),
|
||||
descr_from_wp=("Flag for using wp descriptions not wd", "flag", "wp"),
|
||||
limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
|
||||
limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
|
||||
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),
|
||||
lang=("Optional language for which to get Wikidata titles. Defaults to 'en'", "option", "la", str),
|
||||
)
|
||||
def main(
|
||||
wd_json,
|
||||
|
@ -54,13 +57,16 @@ def main(
|
|||
entity_vector_length=64,
|
||||
loc_prior_prob=None,
|
||||
loc_entity_defs=None,
|
||||
loc_entity_alias=None,
|
||||
loc_entity_desc=None,
|
||||
descriptions_from_wikipedia=False,
|
||||
limit=None,
|
||||
descr_from_wp=False,
|
||||
limit_prior=None,
|
||||
limit_train=None,
|
||||
limit_wd=None,
|
||||
lang="en",
|
||||
):
|
||||
|
||||
entity_defs_path = loc_entity_defs if loc_entity_defs else output_dir / ENTITY_DEFS_PATH
|
||||
entity_alias_path = loc_entity_alias if loc_entity_alias else output_dir / ENTITY_ALIAS_PATH
|
||||
entity_descr_path = loc_entity_desc if loc_entity_desc else output_dir / ENTITY_DESCR_PATH
|
||||
entity_freq_path = output_dir / ENTITY_FREQ_PATH
|
||||
prior_prob_path = loc_prior_prob if loc_prior_prob else output_dir / PRIOR_PROB_PATH
|
||||
|
@ -69,15 +75,12 @@ def main(
|
|||
|
||||
logger.info("Creating KB with Wikipedia and WikiData")
|
||||
|
||||
if limit is not None:
|
||||
logger.warning("Warning: reading only {} lines of Wikipedia/Wikidata dumps.".format(limit))
|
||||
|
||||
# STEP 0: set up IO
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir(parents=True)
|
||||
|
||||
# STEP 1: create the NLP object
|
||||
logger.info("STEP 1: Loading model {}".format(model))
|
||||
# STEP 1: Load the NLP object
|
||||
logger.info("STEP 1: Loading NLP model {}".format(model))
|
||||
nlp = spacy.load(model)
|
||||
|
||||
# check the length of the nlp vectors
|
||||
|
@ -90,62 +93,83 @@ def main(
|
|||
# STEP 2: create prior probabilities from WP
|
||||
if not prior_prob_path.exists():
|
||||
# It takes about 2h to process 1000M lines of Wikipedia XML dump
|
||||
logger.info("STEP 2: writing prior probabilities to {}".format(prior_prob_path))
|
||||
wp.read_prior_probs(wp_xml, prior_prob_path, limit=limit)
|
||||
logger.info("STEP 2: reading prior probabilities from {}".format(prior_prob_path))
|
||||
logger.info("STEP 2: Writing prior probabilities to {}".format(prior_prob_path))
|
||||
if limit_prior is not None:
|
||||
logger.warning("Warning: reading only {} lines of Wikipedia dump".format(limit_prior))
|
||||
wp.read_prior_probs(wp_xml, prior_prob_path, limit=limit_prior)
|
||||
else:
|
||||
logger.info("STEP 2: Reading prior probabilities from {}".format(prior_prob_path))
|
||||
|
||||
# STEP 3: deduce entity frequencies from WP (takes only a few minutes)
|
||||
logger.info("STEP 3: calculating entity frequencies")
|
||||
wp.write_entity_counts(prior_prob_path, entity_freq_path, to_print=False)
|
||||
# STEP 3: calculate entity frequencies
|
||||
if not entity_freq_path.exists():
|
||||
logger.info("STEP 3: Calculating and writing entity frequencies to {}".format(entity_freq_path))
|
||||
io.write_entity_to_count(prior_prob_path, entity_freq_path)
|
||||
else:
|
||||
logger.info("STEP 3: Reading entity frequencies from {}".format(entity_freq_path))
|
||||
|
||||
# STEP 4: reading definitions and (possibly) descriptions from WikiData or from file
|
||||
message = " and descriptions" if not descriptions_from_wikipedia else ""
|
||||
if (not entity_defs_path.exists()) or (not descriptions_from_wikipedia and not entity_descr_path.exists()):
|
||||
if (not entity_defs_path.exists()) or (not descr_from_wp and not entity_descr_path.exists()):
|
||||
# It takes about 10h to process 55M lines of Wikidata JSON dump
|
||||
logger.info("STEP 4: parsing wikidata for entity definitions" + message)
|
||||
title_to_id, id_to_descr = wd.read_wikidata_entities_json(
|
||||
logger.info("STEP 4: Parsing and writing Wikidata entity definitions to {}".format(entity_defs_path))
|
||||
if limit_wd is not None:
|
||||
logger.warning("Warning: reading only {} lines of Wikidata dump".format(limit_wd))
|
||||
title_to_id, id_to_descr, id_to_alias = wd.read_wikidata_entities_json(
|
||||
wd_json,
|
||||
limit,
|
||||
limit_wd,
|
||||
to_print=False,
|
||||
lang=lang,
|
||||
parse_descriptions=(not descriptions_from_wikipedia),
|
||||
parse_descr=(not descr_from_wp),
|
||||
)
|
||||
wd.write_entity_files(entity_defs_path, title_to_id)
|
||||
if not descriptions_from_wikipedia:
|
||||
wd.write_entity_description_files(entity_descr_path, id_to_descr)
|
||||
logger.info("STEP 4: read entity definitions" + message)
|
||||
io.write_title_to_id(entity_defs_path, title_to_id)
|
||||
|
||||
# STEP 5: Getting gold entities from wikipedia
|
||||
message = " and descriptions" if descriptions_from_wikipedia else ""
|
||||
if (not training_entities_path.exists()) or (descriptions_from_wikipedia and not entity_descr_path.exists()):
|
||||
logger.info("STEP 5: parsing wikipedia for gold entities" + message)
|
||||
training_set_creator.create_training_examples_and_descriptions(
|
||||
wp_xml,
|
||||
entity_defs_path,
|
||||
entity_descr_path,
|
||||
training_entities_path,
|
||||
parse_descriptions=descriptions_from_wikipedia,
|
||||
limit=limit,
|
||||
)
|
||||
logger.info("STEP 5: read gold entities" + message)
|
||||
logger.info("STEP 4b: Writing Wikidata entity aliases to {}".format(entity_alias_path))
|
||||
io.write_id_to_alias(entity_alias_path, id_to_alias)
|
||||
|
||||
if not descr_from_wp:
|
||||
logger.info("STEP 4c: Writing Wikidata entity descriptions to {}".format(entity_descr_path))
|
||||
io.write_id_to_descr(entity_descr_path, id_to_descr)
|
||||
else:
|
||||
logger.info("STEP 4: Reading entity definitions from {}".format(entity_defs_path))
|
||||
logger.info("STEP 4b: Reading entity aliases from {}".format(entity_alias_path))
|
||||
if not descr_from_wp:
|
||||
logger.info("STEP 4c: Reading entity descriptions from {}".format(entity_descr_path))
|
||||
|
||||
# STEP 5: Getting gold entities from Wikipedia
|
||||
if (not training_entities_path.exists()) or (descr_from_wp and not entity_descr_path.exists()):
|
||||
logger.info("STEP 5: Parsing and writing Wikipedia gold entities to {}".format(training_entities_path))
|
||||
if limit_train is not None:
|
||||
logger.warning("Warning: reading only {} lines of Wikipedia dump".format(limit_train))
|
||||
wp.create_training_and_desc(wp_xml, entity_defs_path, entity_descr_path,
|
||||
training_entities_path, descr_from_wp, limit_train)
|
||||
if descr_from_wp:
|
||||
logger.info("STEP 5b: Parsing and writing Wikipedia descriptions to {}".format(entity_descr_path))
|
||||
else:
|
||||
logger.info("STEP 5: Reading gold entities from {}".format(training_entities_path))
|
||||
if descr_from_wp:
|
||||
logger.info("STEP 5b: Reading entity descriptions from {}".format(entity_descr_path))
|
||||
|
||||
# STEP 6: creating the actual KB
|
||||
# It takes ca. 30 minutes to pretrain the entity embeddings
|
||||
logger.info("STEP 6: creating the KB at {}".format(kb_path))
|
||||
if not kb_path.exists():
|
||||
logger.info("STEP 6: Creating the KB at {}".format(kb_path))
|
||||
kb = kb_creator.create_kb(
|
||||
nlp=nlp,
|
||||
max_entities_per_alias=max_per_alias,
|
||||
min_entity_freq=min_freq,
|
||||
min_occ=min_pair,
|
||||
entity_def_input=entity_defs_path,
|
||||
entity_def_path=entity_defs_path,
|
||||
entity_descr_path=entity_descr_path,
|
||||
count_input=entity_freq_path,
|
||||
prior_prob_input=prior_prob_path,
|
||||
entity_alias_path=entity_alias_path,
|
||||
entity_freq_path=entity_freq_path,
|
||||
prior_prob_path=prior_prob_path,
|
||||
entity_vector_length=entity_vector_length,
|
||||
)
|
||||
|
||||
kb.dump(kb_path)
|
||||
logger.info("kb entities: {}".format(kb.get_size_entities()))
|
||||
logger.info("kb aliases: {}".format(kb.get_size_aliases()))
|
||||
nlp.to_disk(output_dir / KB_MODEL_DIR)
|
||||
else:
|
||||
logger.info("STEP 6: KB already exists at {}".format(kb_path))
|
||||
|
||||
logger.info("Done!")
|
||||
|
||||
|
|
|
@ -1,40 +1,52 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import gzip
|
||||
import bz2
|
||||
import json
|
||||
import logging
|
||||
import datetime
|
||||
|
||||
from bin.wiki_entity_linking.wiki_namespaces import WD_META_ITEMS
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang="en", parse_descriptions=True):
|
||||
# Read the JSON wiki data and parse out the entities. Takes about 7u30 to parse 55M lines.
|
||||
def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang="en", parse_descr=True):
|
||||
# Read the JSON wiki data and parse out the entities. Takes about 7-10h to parse 55M lines.
|
||||
# get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||
|
||||
site_filter = '{}wiki'.format(lang)
|
||||
|
||||
# properties filter (currently disabled to get ALL data)
|
||||
prop_filter = dict()
|
||||
# prop_filter = {'P31': {'Q5', 'Q15632617'}} # currently defined as OR: one property suffices to be selected
|
||||
# filter: currently defined as OR: one hit suffices to be removed from further processing
|
||||
exclude_list = WD_META_ITEMS
|
||||
|
||||
# punctuation
|
||||
exclude_list.extend(["Q1383557", "Q10617810"])
|
||||
|
||||
# letters etc
|
||||
exclude_list.extend(["Q188725", "Q19776628", "Q3841820", "Q17907810", "Q9788", "Q9398093"])
|
||||
|
||||
neg_prop_filter = {
|
||||
'P31': exclude_list, # instance of
|
||||
'P279': exclude_list # subclass
|
||||
}
|
||||
|
||||
title_to_id = dict()
|
||||
id_to_descr = dict()
|
||||
id_to_alias = dict()
|
||||
|
||||
# parse appropriate fields - depending on what we need in the KB
|
||||
parse_properties = False
|
||||
parse_sitelinks = True
|
||||
parse_labels = False
|
||||
parse_aliases = False
|
||||
parse_claims = False
|
||||
parse_aliases = True
|
||||
parse_claims = True
|
||||
|
||||
with gzip.open(wikidata_file, mode='rb') as file:
|
||||
with bz2.open(wikidata_file, mode='rb') as file:
|
||||
for cnt, line in enumerate(file):
|
||||
if limit and cnt >= limit:
|
||||
break
|
||||
if cnt % 500000 == 0:
|
||||
logger.info("processed {} lines of WikiData dump".format(cnt))
|
||||
if cnt % 500000 == 0 and cnt > 0:
|
||||
logger.info("processed {} lines of WikiData JSON dump".format(cnt))
|
||||
clean_line = line.strip()
|
||||
if clean_line.endswith(b","):
|
||||
clean_line = clean_line[:-1]
|
||||
|
@ -43,13 +55,11 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
|
|||
entry_type = obj["type"]
|
||||
|
||||
if entry_type == "item":
|
||||
# filtering records on their properties (currently disabled to get ALL data)
|
||||
# keep = False
|
||||
keep = True
|
||||
|
||||
claims = obj["claims"]
|
||||
if parse_claims:
|
||||
for prop, value_set in prop_filter.items():
|
||||
for prop, value_set in neg_prop_filter.items():
|
||||
claim_property = claims.get(prop, None)
|
||||
if claim_property:
|
||||
for cp in claim_property:
|
||||
|
@ -61,7 +71,7 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
|
|||
)
|
||||
cp_rank = cp["rank"]
|
||||
if cp_rank != "deprecated" and cp_id in value_set:
|
||||
keep = True
|
||||
keep = False
|
||||
|
||||
if keep:
|
||||
unique_id = obj["id"]
|
||||
|
@ -108,7 +118,7 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
|
|||
"label (" + lang + "):", lang_label["value"]
|
||||
)
|
||||
|
||||
if found_link and parse_descriptions:
|
||||
if found_link and parse_descr:
|
||||
descriptions = obj["descriptions"]
|
||||
if descriptions:
|
||||
lang_descr = descriptions.get(lang, None)
|
||||
|
@ -130,22 +140,15 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
|
|||
print(
|
||||
"alias (" + lang + "):", item["value"]
|
||||
)
|
||||
alias_list = id_to_alias.get(unique_id, [])
|
||||
alias_list.append(item["value"])
|
||||
id_to_alias[unique_id] = alias_list
|
||||
|
||||
if to_print:
|
||||
print()
|
||||
|
||||
return title_to_id, id_to_descr
|
||||
# log final number of lines processed
|
||||
logger.info("Finished. Processed {} lines of WikiData JSON dump".format(cnt))
|
||||
return title_to_id, id_to_descr, id_to_alias
|
||||
|
||||
|
||||
def write_entity_files(entity_def_output, title_to_id):
|
||||
with entity_def_output.open("w", encoding="utf8") as id_file:
|
||||
id_file.write("WP_title" + "|" + "WD_id" + "\n")
|
||||
for title, qid in title_to_id.items():
|
||||
id_file.write(title + "|" + str(qid) + "\n")
|
||||
|
||||
|
||||
def write_entity_description_files(entity_descr_output, id_to_descr):
|
||||
with entity_descr_output.open("w", encoding="utf8") as descr_file:
|
||||
descr_file.write("WD_id" + "|" + "description" + "\n")
|
||||
for qid, descr in id_to_descr.items():
|
||||
descr_file.write(str(qid) + "|" + descr + "\n")
|
||||
|
|
|
@ -6,19 +6,19 @@ as created by the script `wikidata_create_kb`.
|
|||
|
||||
For the Wikipedia dump: get enwiki-latest-pages-articles-multistream.xml.bz2
|
||||
from https://dumps.wikimedia.org/enwiki/latest/
|
||||
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import random
|
||||
import logging
|
||||
import spacy
|
||||
from pathlib import Path
|
||||
import plac
|
||||
|
||||
from bin.wiki_entity_linking import training_set_creator
|
||||
from bin.wiki_entity_linking import wikipedia_processor
|
||||
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR
|
||||
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance, measure_baselines
|
||||
from bin.wiki_entity_linking.kb_creator import read_nlp_kb
|
||||
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
|
||||
from bin.wiki_entity_linking.kb_creator import read_kb
|
||||
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
@ -35,6 +35,7 @@ logger = logging.getLogger(__name__)
|
|||
l2=("L2 regularization", "option", "r", float),
|
||||
train_inst=("# training instances (default 90% of all)", "option", "t", int),
|
||||
dev_inst=("# test instances (default 10% of all)", "option", "d", int),
|
||||
labels_discard=("NER labels to discard (default None)", "option", "l", str),
|
||||
)
|
||||
def main(
|
||||
dir_kb,
|
||||
|
@ -46,13 +47,14 @@ def main(
|
|||
l2=1e-6,
|
||||
train_inst=None,
|
||||
dev_inst=None,
|
||||
labels_discard=None
|
||||
):
|
||||
logger.info("Creating Entity Linker with Wikipedia and WikiData")
|
||||
|
||||
output_dir = Path(output_dir) if output_dir else dir_kb
|
||||
training_path = loc_training if loc_training else output_dir / TRAINING_DATA_FILE
|
||||
training_path = loc_training if loc_training else dir_kb / TRAINING_DATA_FILE
|
||||
nlp_dir = dir_kb / KB_MODEL_DIR
|
||||
kb_path = output_dir / KB_FILE
|
||||
kb_path = dir_kb / KB_FILE
|
||||
nlp_output_dir = output_dir / OUTPUT_MODEL_DIR
|
||||
|
||||
# STEP 0: set up IO
|
||||
|
@ -60,38 +62,49 @@ def main(
|
|||
output_dir.mkdir()
|
||||
|
||||
# STEP 1 : load the NLP object
|
||||
logger.info("STEP 1: loading model from {}".format(nlp_dir))
|
||||
nlp, kb = read_nlp_kb(nlp_dir, kb_path)
|
||||
logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
|
||||
nlp = spacy.load(nlp_dir)
|
||||
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
|
||||
kb = read_kb(nlp, kb_path)
|
||||
|
||||
# check that there is a NER component in the pipeline
|
||||
if "ner" not in nlp.pipe_names:
|
||||
raise ValueError("The `nlp` object should have a pretrained `ner` component.")
|
||||
|
||||
# STEP 2: create a training dataset from WP
|
||||
logger.info("STEP 2: reading training dataset from {}".format(training_path))
|
||||
# STEP 2: read the training dataset previously created from WP
|
||||
logger.info("STEP 2: Reading training dataset from {}".format(training_path))
|
||||
|
||||
train_data = training_set_creator.read_training(
|
||||
if labels_discard:
|
||||
labels_discard = [x.strip() for x in labels_discard.split(",")]
|
||||
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
|
||||
else:
|
||||
labels_discard = []
|
||||
|
||||
train_data = wikipedia_processor.read_training(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=False,
|
||||
limit=train_inst,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard
|
||||
)
|
||||
|
||||
# for testing, get all pos instances, whether or not they are in the kb
|
||||
dev_data = training_set_creator.read_training(
|
||||
# for testing, get all pos instances (independently of KB)
|
||||
dev_data = wikipedia_processor.read_training(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=True,
|
||||
limit=dev_inst,
|
||||
kb=kb,
|
||||
kb=None,
|
||||
labels_discard=labels_discard
|
||||
)
|
||||
|
||||
# STEP 3: create and train the entity linking pipe
|
||||
logger.info("STEP 3: training Entity Linking pipe")
|
||||
# STEP 3: create and train an entity linking pipe
|
||||
logger.info("STEP 3: Creating and training an Entity Linking pipe")
|
||||
|
||||
el_pipe = nlp.create_pipe(
|
||||
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors.name}
|
||||
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors.name,
|
||||
"labels_discard": labels_discard}
|
||||
)
|
||||
el_pipe.set_kb(kb)
|
||||
nlp.add_pipe(el_pipe, last=True)
|
||||
|
@ -105,14 +118,9 @@ def main(
|
|||
logger.info("Training on {} articles".format(len(train_data)))
|
||||
logger.info("Dev testing on {} articles".format(len(dev_data)))
|
||||
|
||||
dev_baseline_accuracies = measure_baselines(
|
||||
dev_data, kb
|
||||
)
|
||||
|
||||
# baseline performance on dev data
|
||||
logger.info("Dev Baseline Accuracies:")
|
||||
logger.info(dev_baseline_accuracies.report_accuracy("random"))
|
||||
logger.info(dev_baseline_accuracies.report_accuracy("prior"))
|
||||
logger.info(dev_baseline_accuracies.report_accuracy("oracle"))
|
||||
measure_performance(dev_data, kb, el_pipe, baseline=True, context=False)
|
||||
|
||||
for itn in range(epochs):
|
||||
random.shuffle(train_data)
|
||||
|
@ -136,18 +144,18 @@ def main(
|
|||
logger.error("Error updating batch:" + str(e))
|
||||
if batchnr > 0:
|
||||
logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2)))
|
||||
measure_performance(dev_data, kb, el_pipe)
|
||||
measure_performance(dev_data, kb, el_pipe, baseline=False, context=True)
|
||||
|
||||
# STEP 4: measure the performance of our trained pipe on an independent dev set
|
||||
logger.info("STEP 4: performance measurement of Entity Linking pipe")
|
||||
logger.info("STEP 4: Final performance measurement of Entity Linking pipe")
|
||||
measure_performance(dev_data, kb, el_pipe)
|
||||
|
||||
# STEP 5: apply the EL pipe on a toy example
|
||||
logger.info("STEP 5: applying Entity Linking to toy example")
|
||||
logger.info("STEP 5: Applying Entity Linking to toy example")
|
||||
run_el_toy_example(nlp=nlp)
|
||||
|
||||
if output_dir:
|
||||
# STEP 6: write the NLP pipeline (including entity linker) to file
|
||||
# STEP 6: write the NLP pipeline (now including an EL model) to file
|
||||
logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir))
|
||||
nlp.to_disk(nlp_output_dir)
|
||||
|
||||
|
|
|
@ -3,147 +3,104 @@ from __future__ import unicode_literals
|
|||
|
||||
import re
|
||||
import bz2
|
||||
import csv
|
||||
import datetime
|
||||
import logging
|
||||
import random
|
||||
import json
|
||||
|
||||
from bin.wiki_entity_linking import LOG_FORMAT
|
||||
from functools import partial
|
||||
|
||||
from spacy.gold import GoldParse
|
||||
from bin.wiki_entity_linking import wiki_io as io
|
||||
from bin.wiki_entity_linking.wiki_namespaces import (
|
||||
WP_META_NAMESPACE,
|
||||
WP_FILE_NAMESPACE,
|
||||
WP_CATEGORY_NAMESPACE,
|
||||
)
|
||||
|
||||
"""
|
||||
Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions.
|
||||
Write these results to file for downstream KB and training data generation.
|
||||
|
||||
Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
|
||||
"""
|
||||
|
||||
ENTITY_FILE = "gold_entities.csv"
|
||||
|
||||
map_alias_to_link = dict()
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# these will/should be matched ignoring case
|
||||
wiki_namespaces = [
|
||||
"b",
|
||||
"betawikiversity",
|
||||
"Book",
|
||||
"c",
|
||||
"Category",
|
||||
"Commons",
|
||||
"d",
|
||||
"dbdump",
|
||||
"download",
|
||||
"Draft",
|
||||
"Education",
|
||||
"Foundation",
|
||||
"Gadget",
|
||||
"Gadget definition",
|
||||
"gerrit",
|
||||
"File",
|
||||
"Help",
|
||||
"Image",
|
||||
"Incubator",
|
||||
"m",
|
||||
"mail",
|
||||
"mailarchive",
|
||||
"media",
|
||||
"MediaWiki",
|
||||
"MediaWiki talk",
|
||||
"Mediawikiwiki",
|
||||
"MediaZilla",
|
||||
"Meta",
|
||||
"Metawikipedia",
|
||||
"Module",
|
||||
"mw",
|
||||
"n",
|
||||
"nost",
|
||||
"oldwikisource",
|
||||
"outreach",
|
||||
"outreachwiki",
|
||||
"otrs",
|
||||
"OTRSwiki",
|
||||
"Portal",
|
||||
"phab",
|
||||
"Phabricator",
|
||||
"Project",
|
||||
"q",
|
||||
"quality",
|
||||
"rev",
|
||||
"s",
|
||||
"spcom",
|
||||
"Special",
|
||||
"species",
|
||||
"Strategy",
|
||||
"sulutil",
|
||||
"svn",
|
||||
"Talk",
|
||||
"Template",
|
||||
"Template talk",
|
||||
"Testwiki",
|
||||
"ticket",
|
||||
"TimedText",
|
||||
"Toollabs",
|
||||
"tools",
|
||||
"tswiki",
|
||||
"User",
|
||||
"User talk",
|
||||
"v",
|
||||
"voy",
|
||||
"w",
|
||||
"Wikibooks",
|
||||
"Wikidata",
|
||||
"wikiHow",
|
||||
"Wikinvest",
|
||||
"wikilivres",
|
||||
"Wikimedia",
|
||||
"Wikinews",
|
||||
"Wikipedia",
|
||||
"Wikipedia talk",
|
||||
"Wikiquote",
|
||||
"Wikisource",
|
||||
"Wikispecies",
|
||||
"Wikitech",
|
||||
"Wikiversity",
|
||||
"Wikivoyage",
|
||||
"wikt",
|
||||
"wiktionary",
|
||||
"wmf",
|
||||
"wmania",
|
||||
"WP",
|
||||
]
|
||||
title_regex = re.compile(r"(?<=<title>).*(?=</title>)")
|
||||
id_regex = re.compile(r"(?<=<id>)\d*(?=</id>)")
|
||||
text_regex = re.compile(r"(?<=<text xml:space=\"preserve\">).*(?=</text)")
|
||||
info_regex = re.compile(r"{[^{]*?}")
|
||||
html_regex = re.compile(r"<!--[^-]*-->")
|
||||
ref_regex = re.compile(r"<ref.*?>") # non-greedy
|
||||
ref_2_regex = re.compile(r"</ref.*?>") # non-greedy
|
||||
|
||||
# find the links
|
||||
link_regex = re.compile(r"\[\[[^\[\]]*\]\]")
|
||||
|
||||
# match on interwiki links, e.g. `en:` or `:fr:`
|
||||
ns_regex = r":?" + "[a-z][a-z]" + ":"
|
||||
|
||||
# match on Namespace: optionally preceded by a :
|
||||
for ns in wiki_namespaces:
|
||||
for ns in WP_META_NAMESPACE:
|
||||
ns_regex += "|" + ":?" + ns + ":"
|
||||
|
||||
ns_regex = re.compile(ns_regex, re.IGNORECASE)
|
||||
|
||||
files = r""
|
||||
for f in WP_FILE_NAMESPACE:
|
||||
files += "\[\[" + f + ":[^[\]]+]]" + "|"
|
||||
files = files[0 : len(files) - 1]
|
||||
file_regex = re.compile(files)
|
||||
|
||||
cats = r""
|
||||
for c in WP_CATEGORY_NAMESPACE:
|
||||
cats += "\[\[" + c + ":[^\[]*]]" + "|"
|
||||
cats = cats[0 : len(cats) - 1]
|
||||
category_regex = re.compile(cats)
|
||||
|
||||
|
||||
def read_prior_probs(wikipedia_input, prior_prob_output, limit=None):
|
||||
"""
|
||||
Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities.
|
||||
The full file takes about 2h to parse 1100M lines.
|
||||
It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from.
|
||||
The full file takes about 2-3h to parse 1100M lines.
|
||||
It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from,
|
||||
though dev test articles are excluded in order not to get an artificially strong baseline.
|
||||
"""
|
||||
cnt = 0
|
||||
read_id = False
|
||||
current_article_id = None
|
||||
with bz2.open(wikipedia_input, mode="rb") as file:
|
||||
line = file.readline()
|
||||
cnt = 0
|
||||
while line and (not limit or cnt < limit):
|
||||
if cnt % 25000000 == 0:
|
||||
if cnt % 25000000 == 0 and cnt > 0:
|
||||
logger.info("processed {} lines of Wikipedia XML dump".format(cnt))
|
||||
clean_line = line.strip().decode("utf-8")
|
||||
|
||||
# we attempt at reading the article's ID (but not the revision or contributor ID)
|
||||
if "<revision>" in clean_line or "<contributor>" in clean_line:
|
||||
read_id = False
|
||||
if "<page>" in clean_line:
|
||||
read_id = True
|
||||
|
||||
if read_id:
|
||||
ids = id_regex.search(clean_line)
|
||||
if ids:
|
||||
current_article_id = ids[0]
|
||||
|
||||
# only processing prior probabilities from true training (non-dev) articles
|
||||
if not is_dev(current_article_id):
|
||||
aliases, entities, normalizations = get_wp_links(clean_line)
|
||||
for alias, entity, norm in zip(aliases, entities, normalizations):
|
||||
_store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
|
||||
_store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
|
||||
_store_alias(
|
||||
alias, entity, normalize_alias=norm, normalize_entity=True
|
||||
)
|
||||
|
||||
line = file.readline()
|
||||
cnt += 1
|
||||
logger.info("processed {} lines of Wikipedia XML dump".format(cnt))
|
||||
logger.info("Finished. processed {} lines of Wikipedia XML dump".format(cnt))
|
||||
|
||||
# write all aliases and their entities and count occurrences to file
|
||||
with prior_prob_output.open("w", encoding="utf8") as outputfile:
|
||||
|
@ -182,7 +139,7 @@ def get_wp_links(text):
|
|||
match = match[2:][:-2].replace("_", " ").strip()
|
||||
|
||||
if ns_regex.match(match):
|
||||
pass # ignore namespaces at the beginning of the string
|
||||
pass # ignore the entity if it points to a "meta" page
|
||||
|
||||
# this is a simple [[link]], with the alias the same as the mention
|
||||
elif "|" not in match:
|
||||
|
@ -218,47 +175,382 @@ def _capitalize_first(text):
|
|||
return result
|
||||
|
||||
|
||||
def write_entity_counts(prior_prob_input, count_output, to_print=False):
|
||||
# Write entity counts for quick access later
|
||||
entity_to_count = dict()
|
||||
total_count = 0
|
||||
|
||||
with prior_prob_input.open("r", encoding="utf8") as prior_file:
|
||||
# skip header
|
||||
prior_file.readline()
|
||||
line = prior_file.readline()
|
||||
|
||||
while line:
|
||||
splits = line.replace("\n", "").split(sep="|")
|
||||
# alias = splits[0]
|
||||
count = int(splits[1])
|
||||
entity = splits[2]
|
||||
|
||||
current_count = entity_to_count.get(entity, 0)
|
||||
entity_to_count[entity] = current_count + count
|
||||
|
||||
total_count += count
|
||||
|
||||
line = prior_file.readline()
|
||||
|
||||
with count_output.open("w", encoding="utf8") as entity_file:
|
||||
entity_file.write("entity" + "|" + "count" + "\n")
|
||||
for entity, count in entity_to_count.items():
|
||||
entity_file.write(entity + "|" + str(count) + "\n")
|
||||
|
||||
if to_print:
|
||||
for entity, count in entity_to_count.items():
|
||||
print("Entity count:", entity, count)
|
||||
print("Total count:", total_count)
|
||||
def create_training_and_desc(
|
||||
wp_input, def_input, desc_output, training_output, parse_desc, limit=None
|
||||
):
|
||||
wp_to_id = io.read_title_to_id(def_input)
|
||||
_process_wikipedia_texts(
|
||||
wp_input, wp_to_id, desc_output, training_output, parse_desc, limit
|
||||
)
|
||||
|
||||
|
||||
def get_all_frequencies(count_input):
|
||||
entity_to_count = dict()
|
||||
with count_input.open("r", encoding="utf8") as csvfile:
|
||||
csvreader = csv.reader(csvfile, delimiter="|")
|
||||
# skip header
|
||||
next(csvreader)
|
||||
for row in csvreader:
|
||||
entity_to_count[row[0]] = int(row[1])
|
||||
def _process_wikipedia_texts(
|
||||
wikipedia_input, wp_to_id, output, training_output, parse_descriptions, limit=None
|
||||
):
|
||||
"""
|
||||
Read the XML wikipedia data to parse out training data:
|
||||
raw text data + positive instances
|
||||
"""
|
||||
|
||||
return entity_to_count
|
||||
read_ids = set()
|
||||
|
||||
with output.open("a", encoding="utf8") as descr_file, training_output.open(
|
||||
"w", encoding="utf8"
|
||||
) as entity_file:
|
||||
if parse_descriptions:
|
||||
_write_training_description(descr_file, "WD_id", "description")
|
||||
with bz2.open(wikipedia_input, mode="rb") as file:
|
||||
article_count = 0
|
||||
article_text = ""
|
||||
article_title = None
|
||||
article_id = None
|
||||
reading_text = False
|
||||
reading_revision = False
|
||||
|
||||
for line in file:
|
||||
clean_line = line.strip().decode("utf-8")
|
||||
|
||||
if clean_line == "<revision>":
|
||||
reading_revision = True
|
||||
elif clean_line == "</revision>":
|
||||
reading_revision = False
|
||||
|
||||
# Start reading new page
|
||||
if clean_line == "<page>":
|
||||
article_text = ""
|
||||
article_title = None
|
||||
article_id = None
|
||||
# finished reading this page
|
||||
elif clean_line == "</page>":
|
||||
if article_id:
|
||||
clean_text, entities = _process_wp_text(
|
||||
article_title, article_text, wp_to_id
|
||||
)
|
||||
if clean_text is not None and entities is not None:
|
||||
_write_training_entities(
|
||||
entity_file, article_id, clean_text, entities
|
||||
)
|
||||
|
||||
if article_title in wp_to_id and parse_descriptions:
|
||||
description = " ".join(
|
||||
clean_text[:1000].split(" ")[:-1]
|
||||
)
|
||||
_write_training_description(
|
||||
descr_file, wp_to_id[article_title], description
|
||||
)
|
||||
article_count += 1
|
||||
if article_count % 10000 == 0 and article_count > 0:
|
||||
logger.info(
|
||||
"Processed {} articles".format(article_count)
|
||||
)
|
||||
if limit and article_count >= limit:
|
||||
break
|
||||
article_text = ""
|
||||
article_title = None
|
||||
article_id = None
|
||||
reading_text = False
|
||||
reading_revision = False
|
||||
|
||||
# start reading text within a page
|
||||
if "<text" in clean_line:
|
||||
reading_text = True
|
||||
|
||||
if reading_text:
|
||||
article_text += " " + clean_line
|
||||
|
||||
# stop reading text within a page (we assume a new page doesn't start on the same line)
|
||||
if "</text" in clean_line:
|
||||
reading_text = False
|
||||
|
||||
# read the ID of this article (outside the revision portion of the document)
|
||||
if not reading_revision:
|
||||
ids = id_regex.search(clean_line)
|
||||
if ids:
|
||||
article_id = ids[0]
|
||||
if article_id in read_ids:
|
||||
logger.info(
|
||||
"Found duplicate article ID", article_id, clean_line
|
||||
) # This should never happen ...
|
||||
read_ids.add(article_id)
|
||||
|
||||
# read the title of this article (outside the revision portion of the document)
|
||||
if not reading_revision:
|
||||
titles = title_regex.search(clean_line)
|
||||
if titles:
|
||||
article_title = titles[0].strip()
|
||||
logger.info("Finished. Processed {} articles".format(article_count))
|
||||
|
||||
|
||||
def _process_wp_text(article_title, article_text, wp_to_id):
|
||||
# ignore meta Wikipedia pages
|
||||
if ns_regex.match(article_title):
|
||||
return None, None
|
||||
|
||||
# remove the text tags
|
||||
text_search = text_regex.search(article_text)
|
||||
if text_search is None:
|
||||
return None, None
|
||||
text = text_search.group(0)
|
||||
|
||||
# stop processing if this is a redirect page
|
||||
if text.startswith("#REDIRECT"):
|
||||
return None, None
|
||||
|
||||
# get the raw text without markup etc, keeping only interwiki links
|
||||
clean_text, entities = _remove_links(_get_clean_wp_text(text), wp_to_id)
|
||||
return clean_text, entities
|
||||
|
||||
|
||||
def _get_clean_wp_text(article_text):
|
||||
clean_text = article_text.strip()
|
||||
|
||||
# remove bolding & italic markup
|
||||
clean_text = clean_text.replace("'''", "")
|
||||
clean_text = clean_text.replace("''", "")
|
||||
|
||||
# remove nested {{info}} statements by removing the inner/smallest ones first and iterating
|
||||
try_again = True
|
||||
previous_length = len(clean_text)
|
||||
while try_again:
|
||||
clean_text = info_regex.sub(
|
||||
"", clean_text
|
||||
) # non-greedy match excluding a nested {
|
||||
if len(clean_text) < previous_length:
|
||||
try_again = True
|
||||
else:
|
||||
try_again = False
|
||||
previous_length = len(clean_text)
|
||||
|
||||
# remove HTML comments
|
||||
clean_text = html_regex.sub("", clean_text)
|
||||
|
||||
# remove Category and File statements
|
||||
clean_text = category_regex.sub("", clean_text)
|
||||
clean_text = file_regex.sub("", clean_text)
|
||||
|
||||
# remove multiple =
|
||||
while "==" in clean_text:
|
||||
clean_text = clean_text.replace("==", "=")
|
||||
|
||||
clean_text = clean_text.replace(". =", ".")
|
||||
clean_text = clean_text.replace(" = ", ". ")
|
||||
clean_text = clean_text.replace("= ", ".")
|
||||
clean_text = clean_text.replace(" =", "")
|
||||
|
||||
# remove refs (non-greedy match)
|
||||
clean_text = ref_regex.sub("", clean_text)
|
||||
clean_text = ref_2_regex.sub("", clean_text)
|
||||
|
||||
# remove additional wikiformatting
|
||||
clean_text = re.sub(r"<blockquote>", "", clean_text)
|
||||
clean_text = re.sub(r"</blockquote>", "", clean_text)
|
||||
|
||||
# change special characters back to normal ones
|
||||
clean_text = clean_text.replace(r"<", "<")
|
||||
clean_text = clean_text.replace(r">", ">")
|
||||
clean_text = clean_text.replace(r""", '"')
|
||||
clean_text = clean_text.replace(r"&nbsp;", " ")
|
||||
clean_text = clean_text.replace(r"&", "&")
|
||||
|
||||
# remove multiple spaces
|
||||
while " " in clean_text:
|
||||
clean_text = clean_text.replace(" ", " ")
|
||||
|
||||
return clean_text.strip()
|
||||
|
||||
|
||||
def _remove_links(clean_text, wp_to_id):
|
||||
# read the text char by char to get the right offsets for the interwiki links
|
||||
entities = []
|
||||
final_text = ""
|
||||
open_read = 0
|
||||
reading_text = True
|
||||
reading_entity = False
|
||||
reading_mention = False
|
||||
reading_special_case = False
|
||||
entity_buffer = ""
|
||||
mention_buffer = ""
|
||||
for index, letter in enumerate(clean_text):
|
||||
if letter == "[":
|
||||
open_read += 1
|
||||
elif letter == "]":
|
||||
open_read -= 1
|
||||
elif letter == "|":
|
||||
if reading_text:
|
||||
final_text += letter
|
||||
# switch from reading entity to mention in the [[entity|mention]] pattern
|
||||
elif reading_entity:
|
||||
reading_text = False
|
||||
reading_entity = False
|
||||
reading_mention = True
|
||||
else:
|
||||
reading_special_case = True
|
||||
else:
|
||||
if reading_entity:
|
||||
entity_buffer += letter
|
||||
elif reading_mention:
|
||||
mention_buffer += letter
|
||||
elif reading_text:
|
||||
final_text += letter
|
||||
else:
|
||||
raise ValueError("Not sure at point", clean_text[index - 2 : index + 2])
|
||||
|
||||
if open_read > 2:
|
||||
reading_special_case = True
|
||||
|
||||
if open_read == 2 and reading_text:
|
||||
reading_text = False
|
||||
reading_entity = True
|
||||
reading_mention = False
|
||||
|
||||
# we just finished reading an entity
|
||||
if open_read == 0 and not reading_text:
|
||||
if "#" in entity_buffer or entity_buffer.startswith(":"):
|
||||
reading_special_case = True
|
||||
# Ignore cases with nested structures like File: handles etc
|
||||
if not reading_special_case:
|
||||
if not mention_buffer:
|
||||
mention_buffer = entity_buffer
|
||||
start = len(final_text)
|
||||
end = start + len(mention_buffer)
|
||||
qid = wp_to_id.get(entity_buffer, None)
|
||||
if qid:
|
||||
entities.append((mention_buffer, qid, start, end))
|
||||
final_text += mention_buffer
|
||||
|
||||
entity_buffer = ""
|
||||
mention_buffer = ""
|
||||
|
||||
reading_text = True
|
||||
reading_entity = False
|
||||
reading_mention = False
|
||||
reading_special_case = False
|
||||
return final_text, entities
|
||||
|
||||
|
||||
def _write_training_description(outputfile, qid, description):
|
||||
if description is not None:
|
||||
line = str(qid) + "|" + description + "\n"
|
||||
outputfile.write(line)
|
||||
|
||||
|
||||
def _write_training_entities(outputfile, article_id, clean_text, entities):
|
||||
entities_data = [
|
||||
{"alias": ent[0], "entity": ent[1], "start": ent[2], "end": ent[3]}
|
||||
for ent in entities
|
||||
]
|
||||
line = (
|
||||
json.dumps(
|
||||
{
|
||||
"article_id": article_id,
|
||||
"clean_text": clean_text,
|
||||
"entities": entities_data,
|
||||
},
|
||||
ensure_ascii=False,
|
||||
)
|
||||
+ "\n"
|
||||
)
|
||||
outputfile.write(line)
|
||||
|
||||
|
||||
def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
|
||||
""" This method provides training examples that correspond to the entity annotations found by the nlp object.
|
||||
For training, it will include both positive and negative examples by using the candidate generator from the kb.
|
||||
For testing (kb=None), it will include all positive examples only."""
|
||||
|
||||
from tqdm import tqdm
|
||||
|
||||
if not labels_discard:
|
||||
labels_discard = []
|
||||
|
||||
data = []
|
||||
num_entities = 0
|
||||
get_gold_parse = partial(
|
||||
_get_gold_parse, dev=dev, kb=kb, labels_discard=labels_discard
|
||||
)
|
||||
|
||||
logger.info(
|
||||
"Reading {} data with limit {}".format("dev" if dev else "train", limit)
|
||||
)
|
||||
with entity_file_path.open("r", encoding="utf8") as file:
|
||||
with tqdm(total=limit, leave=False) as pbar:
|
||||
for i, line in enumerate(file):
|
||||
example = json.loads(line)
|
||||
article_id = example["article_id"]
|
||||
clean_text = example["clean_text"]
|
||||
entities = example["entities"]
|
||||
|
||||
if dev != is_dev(article_id) or not is_valid_article(clean_text):
|
||||
continue
|
||||
|
||||
doc = nlp(clean_text)
|
||||
gold = get_gold_parse(doc, entities)
|
||||
if gold and len(gold.links) > 0:
|
||||
data.append((doc, gold))
|
||||
num_entities += len(gold.links)
|
||||
pbar.update(len(gold.links))
|
||||
if limit and num_entities >= limit:
|
||||
break
|
||||
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
|
||||
return data
|
||||
|
||||
|
||||
def _get_gold_parse(doc, entities, dev, kb, labels_discard):
|
||||
gold_entities = {}
|
||||
tagged_ent_positions = {
|
||||
(ent.start_char, ent.end_char): ent
|
||||
for ent in doc.ents
|
||||
if ent.label_ not in labels_discard
|
||||
}
|
||||
|
||||
for entity in entities:
|
||||
entity_id = entity["entity"]
|
||||
alias = entity["alias"]
|
||||
start = entity["start"]
|
||||
end = entity["end"]
|
||||
|
||||
candidate_ids = []
|
||||
if kb and not dev:
|
||||
candidates = kb.get_candidates(alias)
|
||||
candidate_ids = [cand.entity_ for cand in candidates]
|
||||
|
||||
tagged_ent = tagged_ent_positions.get((start, end), None)
|
||||
if tagged_ent:
|
||||
# TODO: check that alias == doc.text[start:end]
|
||||
should_add_ent = (dev or entity_id in candidate_ids) and is_valid_sentence(
|
||||
tagged_ent.sent.text
|
||||
)
|
||||
|
||||
if should_add_ent:
|
||||
value_by_id = {entity_id: 1.0}
|
||||
if not dev:
|
||||
random.shuffle(candidate_ids)
|
||||
value_by_id.update(
|
||||
{kb_id: 0.0 for kb_id in candidate_ids if kb_id != entity_id}
|
||||
)
|
||||
gold_entities[(start, end)] = value_by_id
|
||||
|
||||
return GoldParse(doc, links=gold_entities)
|
||||
|
||||
|
||||
def is_dev(article_id):
|
||||
if not article_id:
|
||||
return False
|
||||
return article_id.endswith("3")
|
||||
|
||||
|
||||
def is_valid_article(doc_text):
|
||||
# custom length cut-off
|
||||
return 10 < len(doc_text) < 30000
|
||||
|
||||
|
||||
def is_valid_sentence(sent_text):
|
||||
if not 10 < len(sent_text) < 3000:
|
||||
# custom length cut-off
|
||||
return False
|
||||
|
||||
if sent_text.strip().startswith("*") or sent_text.strip().startswith("#"):
|
||||
# remove 'enumeration' sentences (occurs often on Wikipedia)
|
||||
return False
|
||||
|
||||
return True
|
||||
|
|
|
@ -7,7 +7,7 @@ dependency tree to find the noun phrase they are referring to – for example:
|
|||
$9.4 million --> Net income.
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
Last tested with: v2.2.1
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
|
@ -38,14 +38,17 @@ def main(model="en_core_web_sm"):
|
|||
|
||||
def filter_spans(spans):
|
||||
# Filter a sequence of spans so they don't contain overlaps
|
||||
get_sort_key = lambda span: (span.end - span.start, span.start)
|
||||
# For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
|
||||
get_sort_key = lambda span: (span.end - span.start, -span.start)
|
||||
sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
|
||||
result = []
|
||||
seen_tokens = set()
|
||||
for span in sorted_spans:
|
||||
# Check for end - 1 here because boundaries are inclusive
|
||||
if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
|
||||
result.append(span)
|
||||
seen_tokens.update(range(span.start, span.end))
|
||||
result = sorted(result, key=lambda span: span.start)
|
||||
return result
|
||||
|
||||
|
||||
|
|
|
@ -91,8 +91,8 @@ def demo(shape):
|
|||
nlp = spacy.load("en_vectors_web_lg")
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
|
||||
|
||||
doc1 = nlp(u"The king of France is bald.")
|
||||
doc2 = nlp(u"France has no king.")
|
||||
doc1 = nlp("The king of France is bald.")
|
||||
doc2 = nlp("France has no king.")
|
||||
|
||||
print("Sentence 1:", doc1)
|
||||
print("Sentence 2:", doc2)
|
||||
|
|
45
examples/load_from_docbin.py
Normal file
45
examples/load_from_docbin.py
Normal file
|
@ -0,0 +1,45 @@
|
|||
# coding: utf-8
|
||||
"""
|
||||
Example of loading previously parsed text using spaCy's DocBin class. The example
|
||||
performs an entity count to show that the annotations are available.
|
||||
For more details, see https://spacy.io/usage/saving-loading#docs
|
||||
Installation:
|
||||
python -m spacy download en_core_web_lg
|
||||
Usage:
|
||||
python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import spacy
|
||||
from spacy.tokens import DocBin
|
||||
from timeit import default_timer as timer
|
||||
from collections import Counter
|
||||
|
||||
EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy"
|
||||
|
||||
|
||||
def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH):
|
||||
nlp = spacy.load(model)
|
||||
print("Reading data from {}".format(docbin_path))
|
||||
with open(docbin_path, "rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
nr_word = 0
|
||||
start_time = timer()
|
||||
entities = Counter()
|
||||
docbin = DocBin().from_bytes(bytes_data)
|
||||
for doc in docbin.get_docs(nlp.vocab):
|
||||
nr_word += len(doc)
|
||||
entities.update((e.label_, e.text) for e in doc.ents)
|
||||
end_time = timer()
|
||||
msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)"
|
||||
wps = nr_word / (end_time - start_time)
|
||||
print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps))
|
||||
print("Most common entities:")
|
||||
for (label, entity), freq in entities.most_common(30):
|
||||
print(freq, entity, label)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import plac
|
||||
|
||||
plac.call(main)
|
1
examples/training/conllu-config.json
Normal file
1
examples/training/conllu-config.json
Normal file
|
@ -0,0 +1 @@
|
|||
{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0}
|
|
@ -13,8 +13,7 @@ import spacy.util
|
|||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from collections import defaultdict, Counter
|
||||
from timeit import default_timer as timer
|
||||
from collections import defaultdict
|
||||
from spacy.matcher import Matcher
|
||||
|
||||
import itertools
|
||||
|
@ -290,11 +289,6 @@ def get_token_conllu(token, i):
|
|||
return "\n".join(lines)
|
||||
|
||||
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
|
||||
##################
|
||||
# Initialization #
|
||||
##################
|
||||
|
@ -381,20 +375,24 @@ class TreebankPaths(object):
|
|||
|
||||
@plac.annotations(
|
||||
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||
config=("Path to json formatted config file", "positional", None, Config.load),
|
||||
corpus=(
|
||||
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
|
||||
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||
config=("Path to json formatted config file", "positional", None, Config.load),
|
||||
limit=("Size limit", "option", "n", int),
|
||||
)
|
||||
def main(ud_dir, parses_dir, config, corpus, limit=0):
|
||||
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
||||
import tqdm
|
||||
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
paths = TreebankPaths(ud_dir, corpus)
|
||||
if not (parses_dir / corpus).exists():
|
||||
(parses_dir / corpus).mkdir()
|
||||
|
@ -403,8 +401,8 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
|
|||
|
||||
docs, golds = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(),
|
||||
paths.train.text.open(),
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
max_doc_length=config.max_doc_length,
|
||||
limit=limit,
|
||||
)
|
||||
|
|
|
@ -18,19 +18,21 @@ during training. We discard the auxiliary model before run-time.
|
|||
The specific example here is not necessarily a good idea --- but it shows
|
||||
how an arbitrary objective function for some word can be used.
|
||||
|
||||
Developed and tested for spaCy 2.0.6
|
||||
Developed and tested for spaCy 2.0.6. Updated for v2.2.2
|
||||
"""
|
||||
import random
|
||||
import plac
|
||||
import spacy
|
||||
import os.path
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import read_json_file, GoldParse
|
||||
|
||||
random.seed(0)
|
||||
|
||||
PWD = os.path.dirname(__file__)
|
||||
|
||||
TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json")))
|
||||
TRAIN_DATA = list(read_json_file(
|
||||
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
|
||||
|
||||
|
||||
def get_position_label(i, words, tags, heads, labels, ents):
|
||||
|
@ -55,6 +57,7 @@ def main(n_iter=10):
|
|||
ner = nlp.create_pipe("ner")
|
||||
ner.add_multitask_objective(get_position_label)
|
||||
nlp.add_pipe(ner)
|
||||
print(nlp.pipeline)
|
||||
|
||||
print("Create data", len(TRAIN_DATA))
|
||||
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
|
||||
|
@ -62,9 +65,9 @@ def main(n_iter=10):
|
|||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annot_brackets in TRAIN_DATA:
|
||||
annotations, _ = annot_brackets
|
||||
doc = nlp.make_doc(text)
|
||||
gold = GoldParse.from_annot_tuples(doc, annotations[0])
|
||||
for annotations, _ in annot_brackets:
|
||||
doc = Doc(nlp.vocab, words=annotations[1])
|
||||
gold = GoldParse.from_annot_tuples(doc, annotations)
|
||||
nlp.update(
|
||||
[doc], # batch of texts
|
||||
[gold], # batch of annotations
|
||||
|
@ -76,6 +79,7 @@ def main(n_iter=10):
|
|||
|
||||
# test the trained model
|
||||
for text, _ in TRAIN_DATA:
|
||||
if text is not None:
|
||||
doc = nlp(text)
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
|
|
@ -8,7 +8,7 @@
|
|||
{
|
||||
"tokens": [
|
||||
{
|
||||
"head": 4,
|
||||
"head": 44,
|
||||
"dep": "prep",
|
||||
"tag": "IN",
|
||||
"orth": "In",
|
||||
|
|
|
@ -48,4 +48,6 @@ redirects = [
|
|||
{from = "/api/sentencesegmenter", to="/api/sentencizer"},
|
||||
{from = "/universe", to = "/universe/project/:id", query = {id = ":id"}, force = true},
|
||||
{from = "/universe", to = "/universe/category/:category", query = {category = ":category"}, force = true},
|
||||
# Renamed universe projects
|
||||
{from = "/universe/project/spacy-pytorch-transformers", to = "/universe/project/spacy-transformers", force = true}
|
||||
]
|
||||
|
|
|
@ -1,15 +1,16 @@
|
|||
# Our libraries
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=7.1.1,<7.2.0
|
||||
thinc>=7.3.0,<7.4.0
|
||||
blis>=0.4.0,<0.5.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.2.0,<1.1.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=0.1.0,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
plac<1.0.0,>=0.9.6
|
||||
plac>=0.9.6,<1.2.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
# Optional dependencies
|
||||
jsonschema>=2.6.0,<3.1.0
|
||||
|
|
23
setup.cfg
23
setup.cfg
|
@ -22,6 +22,7 @@ classifiers =
|
|||
Programming Language :: Python :: 3.5
|
||||
Programming Language :: Python :: 3.6
|
||||
Programming Language :: Python :: 3.7
|
||||
Programming Language :: Python :: 3.8
|
||||
Topic :: Scientific/Engineering
|
||||
|
||||
[options]
|
||||
|
@ -37,40 +38,38 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc>=7.1.1,<7.2.0
|
||||
thinc>=7.3.0,<7.4.0
|
||||
install_requires =
|
||||
numpy>=1.15.0
|
||||
# Our libraries
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=7.1.1,<7.2.0
|
||||
thinc>=7.3.0,<7.4.0
|
||||
blis>=0.4.0,<0.5.0
|
||||
plac<1.0.0,>=0.9.6
|
||||
requests>=2.13.0,<3.0.0
|
||||
wasabi>=0.2.0,<1.1.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=0.1.0,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third-party dependencies
|
||||
setuptools
|
||||
numpy>=1.15.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
|
||||
[options.extras_require]
|
||||
lookups =
|
||||
spacy_lookups_data>=0.0.5<0.2.0
|
||||
cuda =
|
||||
thinc_gpu_ops>=0.0.1,<0.1.0
|
||||
cupy>=5.0.0b4
|
||||
cuda80 =
|
||||
thinc_gpu_ops>=0.0.1,<0.1.0
|
||||
cupy-cuda80>=5.0.0b4
|
||||
cuda90 =
|
||||
thinc_gpu_ops>=0.0.1,<0.1.0
|
||||
cupy-cuda90>=5.0.0b4
|
||||
cuda91 =
|
||||
thinc_gpu_ops>=0.0.1,<0.1.0
|
||||
cupy-cuda91>=5.0.0b4
|
||||
cuda92 =
|
||||
thinc_gpu_ops>=0.0.1,<0.1.0
|
||||
cupy-cuda92>=5.0.0b4
|
||||
cuda100 =
|
||||
thinc_gpu_ops>=0.0.1,<0.1.0
|
||||
cupy-cuda100>=5.0.0b4
|
||||
# Language tokenizers with external dependencies
|
||||
ja =
|
||||
|
|
|
@ -9,11 +9,14 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
|||
# These are imported as part of the API
|
||||
from thinc.neural.util import prefer_gpu, require_gpu
|
||||
|
||||
from . import pipeline
|
||||
from .cli.info import info as cli_info
|
||||
from .glossary import explain
|
||||
from .about import __version__
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
from .util import registry
|
||||
from .language import component
|
||||
|
||||
|
||||
if sys.maxunicode == 65535:
|
||||
|
|
|
@ -7,12 +7,10 @@ from __future__ import print_function
|
|||
if __name__ == "__main__":
|
||||
import plac
|
||||
import sys
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
from spacy.cli import download, link, info, package, train, pretrain, convert
|
||||
from spacy.cli import init_model, profile, evaluate, validate, debug_data
|
||||
|
||||
msg = Printer()
|
||||
|
||||
commands = {
|
||||
"download": download,
|
||||
"link": link,
|
||||
|
|
255
spacy/_ml.py
255
spacy/_ml.py
|
@ -7,12 +7,15 @@ from thinc.i2v import HashEmbed, StaticVectors
|
|||
from thinc.t2t import ExtractWindow, ParametricAttention
|
||||
from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
|
||||
from thinc.misc import Residual
|
||||
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
|
||||
from thinc.t2t import ExtractWindow, ParametricAttention
|
||||
from thinc.t2v import Pooling, sum_pool, mean_pool
|
||||
from thinc.i2v import HashEmbed
|
||||
from thinc.misc import Residual, FeatureExtracter
|
||||
from thinc.misc import LayerNorm as LN
|
||||
from thinc.misc import FeatureExtracter
|
||||
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
|
||||
from thinc.api import with_getitem, flatten_add_lengths
|
||||
from thinc.api import uniqued, wrap, noop
|
||||
from thinc.api import with_square_sequences
|
||||
from thinc.linear.linear import LinearModel
|
||||
from thinc.neural.ops import NumpyOps, CupyOps
|
||||
from thinc.neural.util import get_array_module, copy_array, to_categorical
|
||||
|
@ -29,14 +32,13 @@ from .strings import get_string_id
|
|||
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
||||
from .errors import Errors, user_warning, Warnings
|
||||
from . import util
|
||||
from . import ml as new_ml
|
||||
from .ml import _legacy_tok2vec
|
||||
|
||||
try:
|
||||
import torch.nn
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
except ImportError:
|
||||
torch = None
|
||||
|
||||
VECTORS_KEY = "spacy_pretrained_vectors"
|
||||
# Backwards compatibility with <2.2.2
|
||||
USE_MODEL_REGISTRY_TOK2VEC = False
|
||||
|
||||
|
||||
def cosine(vec1, vec2):
|
||||
|
@ -314,6 +316,10 @@ def link_vectors_to_models(vocab):
|
|||
|
||||
|
||||
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
|
||||
import torch.nn
|
||||
from thinc.api import with_square_sequences
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
|
||||
if depth == 0:
|
||||
return layerize(noop())
|
||||
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
|
||||
|
@ -336,161 +342,91 @@ def Tok2Vec_chars_cnn(width, embed_size, **kwargs):
|
|||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
|
||||
def Tok2Vec_chars_selfattention(width, embed_size, **kwargs):
|
||||
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||
sa_depth = kwargs.get("self_attn_depth", 4)
|
||||
with Model.define_operators(
|
||||
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
|
||||
):
|
||||
embed = (
|
||||
CharacterEmbed(nM=64, nC=8)
|
||||
>> with_flatten(LN(Maxout(width, 64*8, pieces=cnn_maxout_pieces))))
|
||||
tok2vec = (
|
||||
embed
|
||||
>> PositionEncode(10000, width)
|
||||
>> SelfAttention(width, 1, 4) ** sa_depth
|
||||
)
|
||||
|
||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||
tok2vec.nO = width
|
||||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
|
||||
|
||||
def Tok2Vec_chars_bilstm(width, embed_size, **kwargs):
|
||||
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||
depth = kwargs.get("bilstm_depth", 2)
|
||||
with Model.define_operators(
|
||||
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
|
||||
):
|
||||
embed = (
|
||||
CharacterEmbed(nM=64, nC=8)
|
||||
>> with_flatten(LN(Maxout(width, 64*8, pieces=cnn_maxout_pieces))))
|
||||
tok2vec = (
|
||||
embed
|
||||
>> Residual(PyTorchBiLSTM(width, width, depth))
|
||||
>> with_flatten(LN(nO=width))
|
||||
)
|
||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||
tok2vec.nO = width
|
||||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
|
||||
|
||||
|
||||
def CNN(width, depth, pieces, nW=1):
|
||||
if pieces == 1:
|
||||
layer = chain(
|
||||
ExtractWindow(nW=nW),
|
||||
Mish(width, width*(nW*2+1)),
|
||||
LN(nO=width)
|
||||
)
|
||||
return clone(Residual(layer), depth)
|
||||
else:
|
||||
layer = chain(
|
||||
ExtractWindow(nW=nW),
|
||||
LN(Maxout(width, width * (nW*2+1), pieces=pieces)))
|
||||
return clone(Residual(layer), depth)
|
||||
|
||||
|
||||
def SelfAttention(width, depth, pieces):
|
||||
layer = chain(
|
||||
prepare_self_attention(Affine(width * 3, width), nM=width, nH=pieces, window=None),
|
||||
MultiHeadedAttention(),
|
||||
with_flatten(Maxout(width, width)))
|
||||
return clone(Residual(chain(layer, with_flatten(LN(nO=width)))), depth)
|
||||
|
||||
|
||||
def PositionEncode(L, D):
|
||||
positions = NumpyOps().position_encode(L, D)
|
||||
positions = Model.ops.asarray(positions)
|
||||
def position_encode_forward(Xs, drop=0.):
|
||||
output = []
|
||||
for x in Xs:
|
||||
output.append(x + positions[:x.shape[0]])
|
||||
def position_encode_backward(dYs, sgd=None):
|
||||
return dYs
|
||||
return output, position_encode_backward
|
||||
return layerize(position_encode_forward)
|
||||
|
||||
|
||||
def Tok2Vec(width, embed_size, **kwargs):
|
||||
pretrained_vectors = kwargs.setdefault("pretrained_vectors", None)
|
||||
cnn_maxout_pieces = kwargs.setdefault("cnn_maxout_pieces", 3)
|
||||
subword_features = kwargs.setdefault("subword_features", True)
|
||||
char_embed = kwargs.setdefault("char_embed", False)
|
||||
conv_depth = kwargs.setdefault("conv_depth", 4)
|
||||
bilstm_depth = kwargs.setdefault("bilstm_depth", 0)
|
||||
self_attn_depth = kwargs.setdefault("self_attn_depth", 0)
|
||||
conv_window = kwargs.setdefault("conv_window", 1)
|
||||
if char_embed and self_attn_depth:
|
||||
return Tok2Vec_chars_selfattention(width, embed_size, **kwargs)
|
||||
elif char_embed and bilstm_depth:
|
||||
return Tok2Vec_chars_bilstm(width, embed_size, **kwargs)
|
||||
elif char_embed and conv_depth:
|
||||
return Tok2Vec_chars_cnn(width, embed_size, **kwargs)
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
with Model.define_operators(
|
||||
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
|
||||
):
|
||||
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
|
||||
if subword_features:
|
||||
prefix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
|
||||
)
|
||||
suffix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
|
||||
)
|
||||
shape = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
|
||||
)
|
||||
else:
|
||||
prefix, suffix, shape = (None, None, None)
|
||||
if pretrained_vectors is not None:
|
||||
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
||||
if not USE_MODEL_REGISTRY_TOK2VEC:
|
||||
# Preserve prior tok2vec for backwards compat, in v2.2.2
|
||||
return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs)
|
||||
pretrained_vectors = kwargs.get("pretrained_vectors", None)
|
||||
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||
subword_features = kwargs.get("subword_features", True)
|
||||
char_embed = kwargs.get("char_embed", False)
|
||||
conv_depth = kwargs.get("conv_depth", 4)
|
||||
bilstm_depth = kwargs.get("bilstm_depth", 0)
|
||||
conv_window = kwargs.get("conv_window", 1)
|
||||
|
||||
if subword_features:
|
||||
embed = uniqued(
|
||||
(glove | norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 5, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
else:
|
||||
embed = uniqued(
|
||||
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
elif subword_features:
|
||||
embed = uniqued(
|
||||
(norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 4, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
else:
|
||||
embed = norm
|
||||
cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]
|
||||
|
||||
tok2vec = (
|
||||
FeatureExtracter(cols)
|
||||
>> with_bos_eos(
|
||||
with_flatten(
|
||||
embed
|
||||
>> CNN(width, conv_depth, cnn_maxout_pieces, nW=conv_window),
|
||||
pad=conv_depth * conv_window)
|
||||
)
|
||||
)
|
||||
|
||||
if bilstm_depth >= 1:
|
||||
tok2vec = (
|
||||
tok2vec
|
||||
>> Residual(
|
||||
PyTorchBiLSTM(width, width, 1)
|
||||
>> LN(nO=width)
|
||||
) ** bilstm_depth
|
||||
)
|
||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||
tok2vec.nO = width
|
||||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}}
|
||||
if char_embed:
|
||||
embed_cfg = {
|
||||
"arch": "spacy.CharacterEmbed.v1",
|
||||
"config": {
|
||||
"width": 64,
|
||||
"chars": 6,
|
||||
"@mix": {
|
||||
"arch": "spacy.LayerNormalizedMaxout.v1",
|
||||
"config": {"width": width, "pieces": 3},
|
||||
},
|
||||
"@embed_features": None,
|
||||
},
|
||||
}
|
||||
else:
|
||||
embed_cfg = {
|
||||
"arch": "spacy.MultiHashEmbed.v1",
|
||||
"config": {
|
||||
"width": width,
|
||||
"rows": embed_size,
|
||||
"columns": cols,
|
||||
"use_subwords": subword_features,
|
||||
"@pretrained_vectors": None,
|
||||
"@mix": {
|
||||
"arch": "spacy.LayerNormalizedMaxout.v1",
|
||||
"config": {"width": width, "pieces": 3},
|
||||
},
|
||||
},
|
||||
}
|
||||
if pretrained_vectors:
|
||||
embed_cfg["config"]["@pretrained_vectors"] = {
|
||||
"arch": "spacy.PretrainedVectors.v1",
|
||||
"config": {
|
||||
"vectors_name": pretrained_vectors,
|
||||
"width": width,
|
||||
"column": cols.index("ID"),
|
||||
},
|
||||
}
|
||||
if cnn_maxout_pieces >= 2:
|
||||
cnn_cfg = {
|
||||
"arch": "spacy.MaxoutWindowEncoder.v1",
|
||||
"config": {
|
||||
"width": width,
|
||||
"window_size": conv_window,
|
||||
"pieces": cnn_maxout_pieces,
|
||||
"depth": conv_depth,
|
||||
},
|
||||
}
|
||||
else:
|
||||
cnn_cfg = {
|
||||
"arch": "spacy.MishWindowEncoder.v1",
|
||||
"config": {"width": width, "window_size": conv_window, "depth": conv_depth},
|
||||
}
|
||||
bilstm_cfg = {
|
||||
"arch": "spacy.TorchBiLSTMEncoder.v1",
|
||||
"config": {"width": width, "depth": bilstm_depth},
|
||||
}
|
||||
if conv_depth == 0 and bilstm_depth == 0:
|
||||
encode_cfg = {}
|
||||
elif conv_depth >= 1 and bilstm_depth >= 1:
|
||||
encode_cfg = {
|
||||
"arch": "thinc.FeedForward.v1",
|
||||
"config": {"children": [cnn_cfg, bilstm_cfg]},
|
||||
}
|
||||
elif conv_depth >= 1:
|
||||
encode_cfg = cnn_cfg
|
||||
else:
|
||||
encode_cfg = bilstm_cfg
|
||||
config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg}
|
||||
return new_ml.Tok2Vec(config)
|
||||
|
||||
|
||||
def with_bos_eos(layer):
|
||||
|
@ -1101,6 +1037,7 @@ class CharacterEmbed(Model):
|
|||
return output, backprop_character_embed
|
||||
|
||||
|
||||
<<<<<<< HEAD
|
||||
def get_characters_loss(ops, docs, prediction, nr_char=10):
|
||||
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
|
||||
target_ids = target_ids.reshape((-1,))
|
||||
|
@ -1112,6 +1049,8 @@ def get_characters_loss(ops, docs, prediction, nr_char=10):
|
|||
return loss, d_target
|
||||
|
||||
|
||||
=======
|
||||
>>>>>>> master
|
||||
def get_cossim_loss(yh, y, ignore_zeros=False):
|
||||
xp = get_array_module(yh)
|
||||
# Find the zero vectors
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "2.2.1"
|
||||
__version__ = "2.2.2"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
179
spacy/analysis.py
Normal file
179
spacy/analysis.py
Normal file
|
@ -0,0 +1,179 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from collections import OrderedDict
|
||||
from wasabi import Printer
|
||||
|
||||
from .tokens import Doc, Token, Span
|
||||
from .errors import Errors, Warnings, user_warning
|
||||
|
||||
|
||||
def analyze_pipes(pipeline, name, pipe, index, warn=True):
|
||||
"""Analyze a pipeline component with respect to its position in the current
|
||||
pipeline and the other components. Will check whether requirements are
|
||||
fulfilled (e.g. if previous components assign the attributes).
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
name (unicode): The name of the pipeline component to analyze.
|
||||
pipe (callable): The pipeline component function to analyze.
|
||||
index (int): The index of the component in the pipeline.
|
||||
warn (bool): Show user warning if problem is found.
|
||||
RETURNS (list): The problems found for the given pipeline component.
|
||||
"""
|
||||
assert pipeline[index][0] == name
|
||||
prev_pipes = pipeline[:index]
|
||||
pipe_requires = getattr(pipe, "requires", [])
|
||||
requires = OrderedDict([(annot, False) for annot in pipe_requires])
|
||||
if requires:
|
||||
for prev_name, prev_pipe in prev_pipes:
|
||||
prev_assigns = getattr(prev_pipe, "assigns", [])
|
||||
for annot in prev_assigns:
|
||||
requires[annot] = True
|
||||
problems = []
|
||||
for annot, fulfilled in requires.items():
|
||||
if not fulfilled:
|
||||
problems.append(annot)
|
||||
if warn:
|
||||
user_warning(Warnings.W025.format(name=name, attr=annot))
|
||||
return problems
|
||||
|
||||
|
||||
def analyze_all_pipes(pipeline, warn=True):
|
||||
"""Analyze all pipes in the pipeline in order.
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
warn (bool): Show user warning if problem is found.
|
||||
RETURNS (dict): The problems found, keyed by component name.
|
||||
"""
|
||||
problems = {}
|
||||
for i, (name, pipe) in enumerate(pipeline):
|
||||
problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn)
|
||||
return problems
|
||||
|
||||
|
||||
def dot_to_dict(values):
|
||||
"""Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"]
|
||||
become {"token": {"pos": True, "_": {"xyz": True }}}.
|
||||
|
||||
values (iterable): The values to convert.
|
||||
RETURNS (dict): The converted values.
|
||||
"""
|
||||
result = {}
|
||||
for value in values:
|
||||
path = result
|
||||
parts = value.lower().split(".")
|
||||
for i, item in enumerate(parts):
|
||||
is_last = i == len(parts) - 1
|
||||
path = path.setdefault(item, True if is_last else {})
|
||||
return result
|
||||
|
||||
|
||||
def validate_attrs(values):
|
||||
"""Validate component attributes provided to "assigns", "requires" etc.
|
||||
Raises error for invalid attributes and formatting. Doesn't check if
|
||||
custom extension attributes are registered, since this is something the
|
||||
user might want to do themselves later in the component.
|
||||
|
||||
values (iterable): The string attributes to check, e.g. `["token.pos"]`.
|
||||
RETURNS (iterable): The checked attributes.
|
||||
"""
|
||||
data = dot_to_dict(values)
|
||||
objs = {"doc": Doc, "token": Token, "span": Span}
|
||||
for obj_key, attrs in data.items():
|
||||
if obj_key == "span":
|
||||
# Support Span only for custom extension attributes
|
||||
span_attrs = [attr for attr in values if attr.startswith("span.")]
|
||||
span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")]
|
||||
if span_attrs:
|
||||
raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs)))
|
||||
if obj_key not in objs: # first element is not doc/token/span
|
||||
invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key))
|
||||
raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs))
|
||||
if not isinstance(attrs, dict): # attr is something like "doc"
|
||||
raise ValueError(Errors.E182.format(attr=obj_key))
|
||||
for attr, value in attrs.items():
|
||||
if attr == "_":
|
||||
if value is True: # attr is something like "doc._"
|
||||
raise ValueError(Errors.E182.format(attr="{}._".format(obj_key)))
|
||||
for ext_attr, ext_value in value.items():
|
||||
# We don't check whether the attribute actually exists
|
||||
if ext_value is not True: # attr is something like doc._.x.y
|
||||
good = "{}._.{}".format(obj_key, ext_attr)
|
||||
bad = "{}.{}".format(good, ".".join(ext_value))
|
||||
raise ValueError(Errors.E183.format(attr=bad, solution=good))
|
||||
continue # we can't validate those further
|
||||
if attr.endswith("_"): # attr is something like "token.pos_"
|
||||
raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1]))
|
||||
if value is not True: # attr is something like doc.x.y
|
||||
good = "{}.{}".format(obj_key, attr)
|
||||
bad = "{}.{}".format(good, ".".join(value))
|
||||
raise ValueError(Errors.E183.format(attr=bad, solution=good))
|
||||
obj = objs[obj_key]
|
||||
if not hasattr(obj, attr):
|
||||
raise ValueError(Errors.E185.format(obj=obj_key, attr=attr))
|
||||
return values
|
||||
|
||||
|
||||
def _get_feature_for_attr(pipeline, attr, feature):
|
||||
assert feature in ["assigns", "requires"]
|
||||
result = []
|
||||
for pipe_name, pipe in pipeline:
|
||||
pipe_assigns = getattr(pipe, feature, [])
|
||||
if attr in pipe_assigns:
|
||||
result.append((pipe_name, pipe))
|
||||
return result
|
||||
|
||||
|
||||
def get_assigns_for_attr(pipeline, attr):
|
||||
"""Get all pipeline components that assign an attr, e.g. "doc.tensor".
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
attr (unicode): The attribute to check.
|
||||
RETURNS (list): (name, pipeline) tuples of components that assign the attr.
|
||||
"""
|
||||
return _get_feature_for_attr(pipeline, attr, "assigns")
|
||||
|
||||
|
||||
def get_requires_for_attr(pipeline, attr):
|
||||
"""Get all pipeline components that require an attr, e.g. "doc.tensor".
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
attr (unicode): The attribute to check.
|
||||
RETURNS (list): (name, pipeline) tuples of components that require the attr.
|
||||
"""
|
||||
return _get_feature_for_attr(pipeline, attr, "requires")
|
||||
|
||||
|
||||
def print_summary(nlp, pretty=True, no_print=False):
|
||||
"""Print a formatted summary for the current nlp object's pipeline. Shows
|
||||
a table with the pipeline components and why they assign and require, as
|
||||
well as any problems if available.
|
||||
|
||||
nlp (Language): The nlp object.
|
||||
pretty (bool): Pretty-print the results (color etc).
|
||||
no_print (bool): Don't print anything, just return the data.
|
||||
RETURNS (dict): A dict with "overview" and "problems".
|
||||
"""
|
||||
msg = Printer(pretty=pretty, no_print=no_print)
|
||||
overview = []
|
||||
problems = {}
|
||||
for i, (name, pipe) in enumerate(nlp.pipeline):
|
||||
requires = getattr(pipe, "requires", [])
|
||||
assigns = getattr(pipe, "assigns", [])
|
||||
retok = getattr(pipe, "retokenizes", False)
|
||||
overview.append((i, name, requires, assigns, retok))
|
||||
problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False)
|
||||
msg.divider("Pipeline Overview")
|
||||
header = ("#", "Component", "Requires", "Assigns", "Retokenizes")
|
||||
msg.table(overview, header=header, divider=True, multiline=True)
|
||||
n_problems = sum(len(p) for p in problems.values())
|
||||
if any(p for p in problems.values()):
|
||||
msg.divider("Problems ({})".format(n_problems))
|
||||
for name, problem in problems.items():
|
||||
if problem:
|
||||
problem = ", ".join(problem)
|
||||
msg.warn("'{}' requirements not met: {}".format(name, problem))
|
||||
else:
|
||||
msg.good("No problems found.")
|
||||
if no_print:
|
||||
return {"overview": overview, "problems": problems}
|
|
@ -57,7 +57,8 @@ def convert(
|
|||
is written to stdout, so you can pipe them forward to a JSON file:
|
||||
$ spacy convert some_file.conllu > some_file.json
|
||||
"""
|
||||
msg = Printer()
|
||||
no_print = output_dir == "-"
|
||||
msg = Printer(no_print=no_print)
|
||||
input_path = Path(input_file)
|
||||
if file_type not in FILE_TYPES:
|
||||
msg.fail(
|
||||
|
@ -102,6 +103,7 @@ def convert(
|
|||
use_morphology=morphology,
|
||||
lang=lang,
|
||||
model=model,
|
||||
no_print=no_print,
|
||||
)
|
||||
if output_dir != "-":
|
||||
# Export data to a file
|
||||
|
|
|
@ -9,7 +9,9 @@ from ...tokens.doc import Doc
|
|||
from ...util import load_model
|
||||
|
||||
|
||||
def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, **kwargs):
|
||||
def conll_ner2json(
|
||||
input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
|
||||
):
|
||||
"""
|
||||
Convert files in the CoNLL-2003 NER format and similar
|
||||
whitespace-separated columns into JSON format for use with train cli.
|
||||
|
@ -34,7 +36,7 @@ def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, **kwargs
|
|||
. O
|
||||
|
||||
"""
|
||||
msg = Printer()
|
||||
msg = Printer(no_print=no_print)
|
||||
doc_delimiter = "-DOCSTART- -X- O O"
|
||||
# check for existing delimiters, which should be preserved
|
||||
if "\n\n" in input_data and seg_sents:
|
||||
|
|
|
@ -34,6 +34,9 @@ def conllu2json(input_data, n_sents=10, use_morphology=False, lang=None, **_):
|
|||
doc = create_doc(sentences, i)
|
||||
docs.append(doc)
|
||||
sentences = []
|
||||
if sentences:
|
||||
doc = create_doc(sentences, i)
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
||||
|
||||
|
|
|
@ -8,7 +8,7 @@ from ...util import minibatch
|
|||
from .conll_ner2json import n_sents_info
|
||||
|
||||
|
||||
def iob2json(input_data, n_sents=10, *args, **kwargs):
|
||||
def iob2json(input_data, n_sents=10, no_print=False, *args, **kwargs):
|
||||
"""
|
||||
Convert IOB files with one sentence per line and tags separated with '|'
|
||||
into JSON format for use with train cli. IOB and IOB2 are accepted.
|
||||
|
@ -20,7 +20,7 @@ def iob2json(input_data, n_sents=10, *args, **kwargs):
|
|||
I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
"""
|
||||
msg = Printer()
|
||||
msg = Printer(no_print=no_print)
|
||||
docs = read_iob(input_data.split("\n"))
|
||||
if n_sents > 0:
|
||||
n_sents_info(msg, n_sents)
|
||||
|
|
|
@ -7,7 +7,7 @@ from ...gold import docs_to_json
|
|||
from ...util import get_lang_class, minibatch
|
||||
|
||||
|
||||
def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False):
|
||||
def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False, **_):
|
||||
if lang is None:
|
||||
raise ValueError("No --lang specified, but tokenization required")
|
||||
json_docs = []
|
||||
|
|
|
@ -121,6 +121,8 @@ def debug_data(
|
|||
msg.text("{} training docs".format(len(train_docs)))
|
||||
msg.text("{} evaluation docs".format(len(dev_docs)))
|
||||
|
||||
if not len(dev_docs):
|
||||
msg.fail("No evaluation docs")
|
||||
overlap = len(train_texts.intersection(dev_texts))
|
||||
if overlap:
|
||||
msg.warn("{} training examples also in evaluation data".format(overlap))
|
||||
|
@ -206,6 +208,9 @@ def debug_data(
|
|||
missing_values, "value" if missing_values == 1 else "values"
|
||||
)
|
||||
)
|
||||
for label in new_labels:
|
||||
if len(label) == 0:
|
||||
msg.fail("Empty label found in new labels")
|
||||
if new_labels:
|
||||
labels_with_counts = [
|
||||
(label, count)
|
||||
|
@ -360,6 +365,16 @@ def debug_data(
|
|||
)
|
||||
)
|
||||
|
||||
# check for documents with multiple sentences
|
||||
sents_per_doc = gold_train_data["n_sents"] / len(gold_train_data["texts"])
|
||||
if sents_per_doc < 1.1:
|
||||
msg.warn(
|
||||
"The training data contains {:.2f} sentences per "
|
||||
"document. When there are very few documents containing more "
|
||||
"than one sentence, the parser will not learn how to segment "
|
||||
"longer texts into sentences.".format(sents_per_doc)
|
||||
)
|
||||
|
||||
# profile labels
|
||||
labels_train = [label for label in gold_train_data["deps"]]
|
||||
labels_train_unpreprocessed = [
|
||||
|
|
|
@ -6,17 +6,13 @@ import requests
|
|||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import pkg_resources
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
|
||||
from .link import link
|
||||
from ..util import get_package_path
|
||||
from .. import about
|
||||
|
||||
|
||||
msg = Printer()
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model to download (shortcut or name)", "positional", None, str),
|
||||
direct=("Force direct download of name + version", "flag", "d", bool),
|
||||
|
@ -87,6 +83,8 @@ def download(model, direct=False, *pip_args):
|
|||
|
||||
def require_package(name):
|
||||
try:
|
||||
import pkg_resources
|
||||
|
||||
pkg_resources.working_set.require(name)
|
||||
return True
|
||||
except: # noqa: E722
|
||||
|
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals, division, print_function
|
|||
|
||||
import plac
|
||||
from timeit import default_timer as timer
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
|
||||
from ..gold import GoldCorpus
|
||||
from .. import util
|
||||
|
@ -32,7 +32,6 @@ def evaluate(
|
|||
Evaluate a model. To render a sample of parses in a HTML file, set an
|
||||
output directory as the displacy_path argument.
|
||||
"""
|
||||
msg = Printer()
|
||||
util.fix_random_seed()
|
||||
if gpu_id >= 0:
|
||||
util.use_gpu(gpu_id)
|
||||
|
|
|
@ -4,7 +4,7 @@ from __future__ import unicode_literals
|
|||
import plac
|
||||
import platform
|
||||
from pathlib import Path
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
import srsly
|
||||
|
||||
from ..compat import path2str, basestring_, unicode_
|
||||
|
@ -23,7 +23,6 @@ def info(model=None, markdown=False, silent=False):
|
|||
speficied as an argument, print model information. Flag --markdown
|
||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||
"""
|
||||
msg = Printer()
|
||||
if model:
|
||||
if util.is_package(model):
|
||||
model_path = util.get_package_path(model)
|
||||
|
|
|
@ -11,7 +11,7 @@ import tarfile
|
|||
import gzip
|
||||
import zipfile
|
||||
import srsly
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
|
||||
from ..vectors import Vectors
|
||||
from ..errors import Errors, Warnings, user_warning
|
||||
|
@ -24,7 +24,6 @@ except ImportError:
|
|||
|
||||
|
||||
DEFAULT_OOV_PROB = -20
|
||||
msg = Printer()
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
|
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
|
||||
from ..compat import symlink_to, path2str
|
||||
from .. import util
|
||||
|
@ -20,7 +20,6 @@ def link(origin, link_name, force=False, model_path=None):
|
|||
either the name of a pip package, or the local path to the model data
|
||||
directory. Linking models allows loading them via spacy.load(link_name).
|
||||
"""
|
||||
msg = Printer()
|
||||
if util.is_package(origin):
|
||||
model_path = util.get_package_path(origin)
|
||||
else:
|
||||
|
|
|
@ -4,7 +4,7 @@ from __future__ import unicode_literals
|
|||
import plac
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from wasabi import Printer, get_raw_input
|
||||
from wasabi import msg, get_raw_input
|
||||
import srsly
|
||||
|
||||
from ..compat import path2str
|
||||
|
@ -27,7 +27,6 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals
|
|||
set and a meta.json already exists in the output directory, the existing
|
||||
values will be used as the defaults in the command-line prompt.
|
||||
"""
|
||||
msg = Printer()
|
||||
input_path = util.ensure_path(input_dir)
|
||||
output_path = util.ensure_path(output_dir)
|
||||
meta_path = util.ensure_path(meta_path)
|
||||
|
|
|
@ -133,7 +133,6 @@ def pretrain(
|
|||
for key in config:
|
||||
if isinstance(config[key], Path):
|
||||
config[key] = str(config[key])
|
||||
msg = Printer()
|
||||
util.fix_random_seed(seed)
|
||||
if gpu_id != -1:
|
||||
has_gpu = require_gpu(gpu_id=gpu_id)
|
||||
|
@ -272,7 +271,7 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
|
|||
"""Perform an update over a single batch of documents.
|
||||
|
||||
docs (iterable): A batch of `Doc` objects.
|
||||
drop (float): The droput rate.
|
||||
drop (float): The dropout rate.
|
||||
optimizer (callable): An optimizer.
|
||||
RETURNS loss: A float for the loss.
|
||||
"""
|
||||
|
|
|
@ -9,7 +9,7 @@ import pstats
|
|||
import sys
|
||||
import itertools
|
||||
import thinc.extra.datasets
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
|
||||
from ..util import load_model
|
||||
|
||||
|
@ -26,7 +26,6 @@ def profile(model, inputs=None, n_texts=10000):
|
|||
It can either be provided as a JSONL file, or be read from sys.sytdin.
|
||||
If no input file is specified, the IMDB dataset is loaded via Thinc.
|
||||
"""
|
||||
msg = Printer()
|
||||
if inputs is not None:
|
||||
inputs = _read_inputs(inputs, msg)
|
||||
if inputs is None:
|
||||
|
|
|
@ -8,7 +8,7 @@ from thinc.neural._classes.model import Model
|
|||
from timeit import default_timer as timer
|
||||
import shutil
|
||||
import srsly
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
import contextlib
|
||||
import random
|
||||
from thinc.neural.util import require_gpu
|
||||
|
@ -92,7 +92,6 @@ def train(
|
|||
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
||||
import tqdm
|
||||
|
||||
msg = Printer()
|
||||
util.fix_random_seed()
|
||||
util.set_env_log(verbose)
|
||||
|
||||
|
@ -159,8 +158,7 @@ def train(
|
|||
"`lang` argument ('{}') ".format(nlp.lang, lang),
|
||||
exits=1,
|
||||
)
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline]
|
||||
nlp.disable_pipes(*other_pipes)
|
||||
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
|
||||
for pipe in pipeline:
|
||||
if pipe not in nlp.pipe_names:
|
||||
if pipe == "parser":
|
||||
|
@ -266,7 +264,11 @@ def train(
|
|||
exits=1,
|
||||
)
|
||||
train_docs = corpus.train_docs(
|
||||
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0
|
||||
nlp,
|
||||
noise_level=noise_level,
|
||||
gold_preproc=gold_preproc,
|
||||
max_length=0,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
train_labels = set()
|
||||
if textcat_multilabel:
|
||||
|
@ -347,6 +349,7 @@ def train(
|
|||
orth_variant_level=orth_variant_level,
|
||||
gold_preproc=gold_preproc,
|
||||
max_length=0,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
if raw_text:
|
||||
random.shuffle(raw_text)
|
||||
|
@ -385,7 +388,11 @@ def train(
|
|||
if hasattr(component, "cfg"):
|
||||
component.cfg["beam_width"] = beam_width
|
||||
dev_docs = list(
|
||||
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
|
||||
corpus.dev_docs(
|
||||
nlp_loaded,
|
||||
gold_preproc=gold_preproc,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
)
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
start_time = timer()
|
||||
|
@ -402,7 +409,11 @@ def train(
|
|||
if hasattr(component, "cfg"):
|
||||
component.cfg["beam_width"] = beam_width
|
||||
dev_docs = list(
|
||||
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
|
||||
corpus.dev_docs(
|
||||
nlp_loaded,
|
||||
gold_preproc=gold_preproc,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
|
||||
|
|
|
@ -1,12 +1,11 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import pkg_resources
|
||||
from pathlib import Path
|
||||
import sys
|
||||
import requests
|
||||
import srsly
|
||||
from wasabi import Printer
|
||||
from wasabi import msg
|
||||
|
||||
from ..compat import path2str
|
||||
from ..util import get_data_path
|
||||
|
@ -18,7 +17,6 @@ def validate():
|
|||
Validate that the currently installed version of spaCy is compatible
|
||||
with the installed models. Should be run after `pip install -U spacy`.
|
||||
"""
|
||||
msg = Printer()
|
||||
with msg.loading("Loading compatibility table..."):
|
||||
r = requests.get(about.__compatibility__)
|
||||
if r.status_code != 200:
|
||||
|
@ -109,6 +107,8 @@ def get_model_links(compat):
|
|||
|
||||
|
||||
def get_model_pkgs(compat, all_models):
|
||||
import pkg_resources
|
||||
|
||||
pkgs = {}
|
||||
for pkg_name, pkg_data in pkg_resources.working_set.by_key.items():
|
||||
package = pkg_name.replace("-", "_")
|
||||
|
|
|
@ -12,6 +12,7 @@ import os
|
|||
import sys
|
||||
import itertools
|
||||
import ast
|
||||
import types
|
||||
|
||||
from thinc.neural.util import copy_array
|
||||
|
||||
|
@ -62,6 +63,7 @@ if is_python2:
|
|||
basestring_ = basestring # noqa: F821
|
||||
input_ = raw_input # noqa: F821
|
||||
path2str = lambda path: str(path).decode("utf8")
|
||||
class_types = (type, types.ClassType)
|
||||
|
||||
elif is_python3:
|
||||
bytes_ = bytes
|
||||
|
@ -69,6 +71,7 @@ elif is_python3:
|
|||
basestring_ = str
|
||||
input_ = input
|
||||
path2str = lambda path: str(path)
|
||||
class_types = (type, types.ClassType) if is_python_pre_3_5 else type
|
||||
|
||||
|
||||
def b_to_str(b_str):
|
||||
|
|
|
@ -5,7 +5,7 @@ import uuid
|
|||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS
|
||||
from ..util import minify_html, escape_html, registry
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
|
@ -242,7 +242,7 @@ class EntityRenderer(object):
|
|||
"CARDINAL": "#e4e7d2",
|
||||
"PERCENT": "#e4e7d2",
|
||||
}
|
||||
user_colors = get_entry_points(ENTRY_POINTS.displacy_colors)
|
||||
user_colors = registry.displacy_colors.get_all()
|
||||
for user_color in user_colors.values():
|
||||
colors.update(user_color)
|
||||
colors.update(options.get("colors", {}))
|
||||
|
|
|
@ -44,14 +44,14 @@ TPL_ENTS = """
|
|||
|
||||
|
||||
TPL_ENT = """
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||
{text}
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
|
||||
</mark>
|
||||
"""
|
||||
|
||||
TPL_ENT_RTL = """
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
|
||||
{text}
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
|
||||
</mark>
|
||||
|
|
|
@ -80,8 +80,8 @@ class Warnings(object):
|
|||
"the v2.x models cannot release the global interpreter lock. "
|
||||
"Future versions may introduce a `n_process` argument for "
|
||||
"parallel inference via multiprocessing.")
|
||||
W017 = ("Alias '{alias}' already exists in the Knowledge base.")
|
||||
W018 = ("Entity '{entity}' already exists in the Knowledge base.")
|
||||
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
||||
W018 = ("Entity '{entity}' already exists in the Knowledge Base.")
|
||||
W019 = ("Changing vectors name from {old} to {new}, to avoid clash with "
|
||||
"previously loaded vectors. See Issue #3853.")
|
||||
W020 = ("Unnamed vectors. This won't allow multiple vectors models to be "
|
||||
|
@ -95,6 +95,12 @@ class Warnings(object):
|
|||
"you can ignore this warning by setting SPACY_WARNING_IGNORE=W022. "
|
||||
"If this is surprising, make sure you have the spacy-lookups-data "
|
||||
"package installed.")
|
||||
W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. "
|
||||
"'n_process' will be set to 1.")
|
||||
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
||||
"the Knowledge Base.")
|
||||
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
|
||||
"previous components in the pipeline declare that they assign it.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -407,7 +413,7 @@ class Errors(object):
|
|||
"{probabilities_length} respectively.")
|
||||
E133 = ("The sum of prior probabilities for alias '{alias}' should not "
|
||||
"exceed 1, but found {sum}.")
|
||||
E134 = ("Alias '{alias}' defined for unknown entity '{entity}'.")
|
||||
E134 = ("Entity '{entity}' is not defined in the Knowledge Base.")
|
||||
E135 = ("If you meant to replace a built-in component, use `create_pipe`: "
|
||||
"`nlp.replace_pipe('{name}', nlp.create_pipe('{name}'))`")
|
||||
E136 = ("This additional feature requires the jsonschema library to be "
|
||||
|
@ -419,7 +425,7 @@ class Errors(object):
|
|||
E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input "
|
||||
"includes either the `text` or `tokens` key. For more info, see "
|
||||
"the docs:\nhttps://spacy.io/api/cli#pretrain-jsonl")
|
||||
E139 = ("Knowledge base for component '{name}' not initialized. Did you "
|
||||
E139 = ("Knowledge Base for component '{name}' not initialized. Did you "
|
||||
"forget to call set_kb()?")
|
||||
E140 = ("The list of entities, prior probabilities and entity vectors "
|
||||
"should be of equal length.")
|
||||
|
@ -495,6 +501,34 @@ class Errors(object):
|
|||
E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
|
||||
"Lookups containing the lemmatization tables. See the docs for "
|
||||
"details: https://spacy.io/api/lemmatizer#init")
|
||||
E174 = ("Architecture '{name}' not found in registry. Available "
|
||||
"names: {names}")
|
||||
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
|
||||
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
|
||||
E177 = ("Ill-formed IOB input detected: {tag}")
|
||||
E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you "
|
||||
"accidentally passed a single pattern to Matcher.add instead of a "
|
||||
"list of patterns? If you only want to add one pattern, make sure "
|
||||
"to wrap it in a list. For example: matcher.add('{key}', [pattern])")
|
||||
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
|
||||
"Doc. If you only want to add one pattern, make sure to wrap it "
|
||||
"in a list. For example: matcher.add('{key}', [doc])")
|
||||
E180 = ("Span attributes can't be declared as required or assigned by "
|
||||
"components, since spans are only views of the Doc. Use Doc and "
|
||||
"Token attributes (or custom extension attributes) only and remove "
|
||||
"the following: {attrs}")
|
||||
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
||||
"Only Doc and Token attributes are supported.")
|
||||
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
||||
"to define the attribute? For example: {attr}.???")
|
||||
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
|
||||
"attributes are supported, for example: {solution}")
|
||||
E184 = ("Only attributes without underscores are supported in component "
|
||||
"attribute declarations (because underscore and non-underscore "
|
||||
"attributes are connected anyways): {attr} -> {solution}")
|
||||
E185 = ("Received invalid attribute in component attribute declaration: "
|
||||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -527,6 +561,10 @@ class MatchPatternError(ValueError):
|
|||
ValueError.__init__(self, msg)
|
||||
|
||||
|
||||
class AlignmentError(ValueError):
|
||||
pass
|
||||
|
||||
|
||||
class ModelsWarning(UserWarning):
|
||||
pass
|
||||
|
||||
|
|
|
@ -80,7 +80,7 @@ GLOSSARY = {
|
|||
"RBR": "adverb, comparative",
|
||||
"RBS": "adverb, superlative",
|
||||
"RP": "adverb, particle",
|
||||
"TO": "infinitival to",
|
||||
"TO": 'infinitival "to"',
|
||||
"UH": "interjection",
|
||||
"VB": "verb, base form",
|
||||
"VBD": "verb, past tense",
|
||||
|
@ -279,6 +279,12 @@ GLOSSARY = {
|
|||
"re": "repeated element",
|
||||
"rs": "reported speech",
|
||||
"sb": "subject",
|
||||
"sb": "subject",
|
||||
"sbp": "passivized subject (PP)",
|
||||
"sp": "subject or predicate",
|
||||
"svp": "separable verb prefix",
|
||||
"uc": "unit component",
|
||||
"vo": "vocative",
|
||||
# Named Entity Recognition
|
||||
# OntoNotes 5
|
||||
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
|
||||
|
|
166
spacy/gold.pyx
166
spacy/gold.pyx
|
@ -11,10 +11,9 @@ import itertools
|
|||
from pathlib import Path
|
||||
import srsly
|
||||
|
||||
from . import _align
|
||||
from .syntax import nonproj
|
||||
from .tokens import Doc, Span
|
||||
from .errors import Errors
|
||||
from .errors import Errors, AlignmentError
|
||||
from .compat import path2str
|
||||
from . import util
|
||||
from .util import minibatch, itershuffle
|
||||
|
@ -22,6 +21,7 @@ from .util import minibatch, itershuffle
|
|||
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
|
||||
|
||||
|
||||
USE_NEW_ALIGN = False
|
||||
punct_re = re.compile(r"\W")
|
||||
|
||||
|
||||
|
@ -73,7 +73,21 @@ def merge_sents(sents):
|
|||
return [(m_deps, (m_cats, m_brackets))]
|
||||
|
||||
|
||||
def align(tokens_a, tokens_b):
|
||||
_ALIGNMENT_NORM_MAP = [("``", "'"), ("''", "'"), ('"', "'"), ("`", "'")]
|
||||
|
||||
|
||||
def _normalize_for_alignment(tokens):
|
||||
tokens = [w.replace(" ", "").lower() for w in tokens]
|
||||
output = []
|
||||
for token in tokens:
|
||||
token = token.replace(" ", "").lower()
|
||||
for before, after in _ALIGNMENT_NORM_MAP:
|
||||
token = token.replace(before, after)
|
||||
output.append(token)
|
||||
return output
|
||||
|
||||
|
||||
def _align_before_v2_2_2(tokens_a, tokens_b):
|
||||
"""Calculate alignment tables between two tokenizations, using the Levenshtein
|
||||
algorithm. The alignment is case-insensitive.
|
||||
|
||||
|
@ -92,6 +106,7 @@ def align(tokens_a, tokens_b):
|
|||
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
|
||||
direction.
|
||||
"""
|
||||
from . import _align
|
||||
if tokens_a == tokens_b:
|
||||
alignment = numpy.arange(len(tokens_a))
|
||||
return 0, alignment, alignment, {}, {}
|
||||
|
@ -111,6 +126,82 @@ def align(tokens_a, tokens_b):
|
|||
return cost, i2j, j2i, i2j_multi, j2i_multi
|
||||
|
||||
|
||||
def align(tokens_a, tokens_b):
|
||||
"""Calculate alignment tables between two tokenizations.
|
||||
|
||||
tokens_a (List[str]): The candidate tokenization.
|
||||
tokens_b (List[str]): The reference tokenization.
|
||||
RETURNS: (tuple): A 5-tuple consisting of the following information:
|
||||
* cost (int): The number of misaligned tokens.
|
||||
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
|
||||
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
|
||||
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
|
||||
it has the value -1.
|
||||
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
|
||||
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
|
||||
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
|
||||
the same token of `tokens_b`.
|
||||
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
|
||||
direction.
|
||||
"""
|
||||
if not USE_NEW_ALIGN:
|
||||
return _align_before_v2_2_2(tokens_a, tokens_b)
|
||||
tokens_a = _normalize_for_alignment(tokens_a)
|
||||
tokens_b = _normalize_for_alignment(tokens_b)
|
||||
cost = 0
|
||||
a2b = numpy.empty(len(tokens_a), dtype="i")
|
||||
b2a = numpy.empty(len(tokens_b), dtype="i")
|
||||
a2b_multi = {}
|
||||
b2a_multi = {}
|
||||
i = 0
|
||||
j = 0
|
||||
offset_a = 0
|
||||
offset_b = 0
|
||||
while i < len(tokens_a) and j < len(tokens_b):
|
||||
a = tokens_a[i][offset_a:]
|
||||
b = tokens_b[j][offset_b:]
|
||||
a2b[i] = b2a[j] = -1
|
||||
if a == b:
|
||||
if offset_a == offset_b == 0:
|
||||
a2b[i] = j
|
||||
b2a[j] = i
|
||||
elif offset_a == 0:
|
||||
cost += 2
|
||||
a2b_multi[i] = j
|
||||
elif offset_b == 0:
|
||||
cost += 2
|
||||
b2a_multi[j] = i
|
||||
offset_a = offset_b = 0
|
||||
i += 1
|
||||
j += 1
|
||||
elif a == "":
|
||||
assert offset_a == 0
|
||||
cost += 1
|
||||
i += 1
|
||||
elif b == "":
|
||||
assert offset_b == 0
|
||||
cost += 1
|
||||
j += 1
|
||||
elif b.startswith(a):
|
||||
cost += 1
|
||||
if offset_a == 0:
|
||||
a2b_multi[i] = j
|
||||
i += 1
|
||||
offset_a = 0
|
||||
offset_b += len(a)
|
||||
elif a.startswith(b):
|
||||
cost += 1
|
||||
if offset_b == 0:
|
||||
b2a_multi[j] = i
|
||||
j += 1
|
||||
offset_b = 0
|
||||
offset_a += len(b)
|
||||
else:
|
||||
assert "".join(tokens_a) != "".join(tokens_b)
|
||||
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
|
||||
return cost, a2b, b2a, a2b_multi, b2a_multi
|
||||
|
||||
|
||||
class GoldCorpus(object):
|
||||
"""An annotated corpus, using the JSON file format. Manages
|
||||
annotations for tagging, dependency parsing and NER.
|
||||
|
@ -176,6 +267,11 @@ class GoldCorpus(object):
|
|||
gold_tuples = read_json_file(loc)
|
||||
elif loc.parts[-1].endswith("jsonl"):
|
||||
gold_tuples = srsly.read_jsonl(loc)
|
||||
first_gold_tuple = next(gold_tuples)
|
||||
gold_tuples = itertools.chain([first_gold_tuple], gold_tuples)
|
||||
# TODO: proper format checks with schemas
|
||||
if isinstance(first_gold_tuple, dict):
|
||||
gold_tuples = read_json_object(gold_tuples)
|
||||
elif loc.parts[-1].endswith("msg"):
|
||||
gold_tuples = srsly.read_msgpack(loc)
|
||||
else:
|
||||
|
@ -209,7 +305,8 @@ class GoldCorpus(object):
|
|||
return n
|
||||
|
||||
def train_docs(self, nlp, gold_preproc=False, max_length=None,
|
||||
noise_level=0.0, orth_variant_level=0.0):
|
||||
noise_level=0.0, orth_variant_level=0.0,
|
||||
ignore_misaligned=False):
|
||||
locs = list((self.tmp_dir / 'train').iterdir())
|
||||
random.shuffle(locs)
|
||||
train_tuples = self.read_tuples(locs, limit=self.limit)
|
||||
|
@ -217,20 +314,23 @@ class GoldCorpus(object):
|
|||
max_length=max_length,
|
||||
noise_level=noise_level,
|
||||
orth_variant_level=orth_variant_level,
|
||||
make_projective=True)
|
||||
make_projective=True,
|
||||
ignore_misaligned=ignore_misaligned)
|
||||
yield from gold_docs
|
||||
|
||||
def train_docs_without_preprocessing(self, nlp, gold_preproc=False):
|
||||
gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc)
|
||||
yield from gold_docs
|
||||
|
||||
def dev_docs(self, nlp, gold_preproc=False):
|
||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc)
|
||||
def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False):
|
||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc,
|
||||
ignore_misaligned=ignore_misaligned)
|
||||
yield from gold_docs
|
||||
|
||||
@classmethod
|
||||
def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None,
|
||||
noise_level=0.0, orth_variant_level=0.0, make_projective=False):
|
||||
noise_level=0.0, orth_variant_level=0.0, make_projective=False,
|
||||
ignore_misaligned=False):
|
||||
for raw_text, paragraph_tuples in tuples:
|
||||
if gold_preproc:
|
||||
raw_text = None
|
||||
|
@ -239,8 +339,10 @@ class GoldCorpus(object):
|
|||
docs, paragraph_tuples = cls._make_docs(nlp, raw_text,
|
||||
paragraph_tuples, gold_preproc, noise_level=noise_level,
|
||||
orth_variant_level=orth_variant_level)
|
||||
golds = cls._make_golds(docs, paragraph_tuples, make_projective)
|
||||
golds = cls._make_golds(docs, paragraph_tuples, make_projective,
|
||||
ignore_misaligned=ignore_misaligned)
|
||||
for doc, gold in zip(docs, golds):
|
||||
if gold is not None:
|
||||
if (not max_length) or len(doc) < max_length:
|
||||
yield doc, gold
|
||||
|
||||
|
@ -257,14 +359,22 @@ class GoldCorpus(object):
|
|||
|
||||
|
||||
@classmethod
|
||||
def _make_golds(cls, docs, paragraph_tuples, make_projective):
|
||||
def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False):
|
||||
if len(docs) != len(paragraph_tuples):
|
||||
n_annots = len(paragraph_tuples)
|
||||
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
|
||||
return [GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats,
|
||||
golds = []
|
||||
for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples):
|
||||
try:
|
||||
gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats,
|
||||
make_projective=make_projective)
|
||||
for doc, (sent_tuples, (cats, brackets))
|
||||
in zip(docs, paragraph_tuples)]
|
||||
except AlignmentError:
|
||||
if ignore_misaligned:
|
||||
gold = None
|
||||
else:
|
||||
raise
|
||||
golds.append(gold)
|
||||
return golds
|
||||
|
||||
|
||||
def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
|
||||
|
@ -494,7 +604,6 @@ def _json_iterate(loc):
|
|||
|
||||
def iob_to_biluo(tags):
|
||||
out = []
|
||||
curr_label = None
|
||||
tags = list(tags)
|
||||
while tags:
|
||||
out.extend(_consume_os(tags))
|
||||
|
@ -519,6 +628,8 @@ def _consume_ent(tags):
|
|||
tags.pop(0)
|
||||
label = tag[2:]
|
||||
if length == 1:
|
||||
if len(label) == 0:
|
||||
raise ValueError(Errors.E177.format(tag=tag))
|
||||
return ["U-" + label]
|
||||
else:
|
||||
start = "B-" + label
|
||||
|
@ -542,7 +653,7 @@ cdef class GoldParse:
|
|||
def __init__(self, doc, annot_tuples=None, words=None, tags=None, morphology=None,
|
||||
heads=None, deps=None, entities=None, make_projective=False,
|
||||
cats=None, links=None, **_):
|
||||
"""Create a GoldParse.
|
||||
"""Create a GoldParse. The fields will not be initialized if len(doc) is zero.
|
||||
|
||||
doc (Doc): The document the annotations refer to.
|
||||
words (iterable): A sequence of unicode word strings.
|
||||
|
@ -571,6 +682,15 @@ cdef class GoldParse:
|
|||
negative examples respectively.
|
||||
RETURNS (GoldParse): The newly constructed object.
|
||||
"""
|
||||
self.mem = Pool()
|
||||
self.loss = 0
|
||||
self.length = len(doc)
|
||||
|
||||
self.cats = {} if cats is None else dict(cats)
|
||||
self.links = links
|
||||
|
||||
# avoid allocating memory if the doc does not contain any tokens
|
||||
if self.length > 0:
|
||||
if words is None:
|
||||
words = [token.text for token in doc]
|
||||
if tags is None:
|
||||
|
@ -582,9 +702,9 @@ cdef class GoldParse:
|
|||
if morphology is None:
|
||||
morphology = [None for _ in words]
|
||||
if entities is None:
|
||||
entities = ["-" for _ in doc]
|
||||
entities = ["-" for _ in words]
|
||||
elif len(entities) == 0:
|
||||
entities = ["O" for _ in doc]
|
||||
entities = ["O" for _ in words]
|
||||
else:
|
||||
# Translate the None values to '-', to make processing easier.
|
||||
# See Issue #2603
|
||||
|
@ -592,9 +712,6 @@ cdef class GoldParse:
|
|||
if not isinstance(entities[0], basestring):
|
||||
# Assume we have entities specified by character offset.
|
||||
entities = biluo_tags_from_offsets(doc, entities)
|
||||
self.mem = Pool()
|
||||
self.loss = 0
|
||||
self.length = len(doc)
|
||||
|
||||
# These are filled by the tagger/parser/entity recogniser
|
||||
self.c.tags = <int*>self.mem.alloc(len(doc), sizeof(int))
|
||||
|
@ -604,8 +721,6 @@ cdef class GoldParse:
|
|||
self.c.sent_start = <int*>self.mem.alloc(len(doc), sizeof(int))
|
||||
self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
|
||||
|
||||
self.cats = {} if cats is None else dict(cats)
|
||||
self.links = links
|
||||
self.words = [None] * len(doc)
|
||||
self.tags = [None] * len(doc)
|
||||
self.heads = [None] * len(doc)
|
||||
|
@ -652,7 +767,9 @@ cdef class GoldParse:
|
|||
self.heads[i] = i+1
|
||||
self.labels[i] = "subtok"
|
||||
else:
|
||||
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]]
|
||||
head_i = heads[i2j_multi[i]]
|
||||
if head_i:
|
||||
self.heads[i] = self.gold_to_cand[head_i]
|
||||
self.labels[i] = deps[i2j_multi[i]]
|
||||
# Now set NER...This is annoying because if we've split
|
||||
# got an entity word split into two, we need to adjust the
|
||||
|
@ -740,7 +857,8 @@ def docs_to_json(docs, id=0):
|
|||
|
||||
docs (iterable / Doc): The Doc object(s) to convert.
|
||||
id (int): Id for the JSON.
|
||||
RETURNS (list): The data in spaCy's JSON format.
|
||||
RETURNS (dict): The data in spaCy's JSON format
|
||||
- each input doc will be treated as a paragraph in the output doc
|
||||
"""
|
||||
if isinstance(docs, Doc):
|
||||
docs = [docs]
|
||||
|
|
69
spacy/kb.pyx
69
spacy/kb.pyx
|
@ -142,6 +142,7 @@ cdef class KnowledgeBase:
|
|||
|
||||
i = 0
|
||||
cdef KBEntryC entry
|
||||
cdef hash_t entity_hash
|
||||
while i < nr_entities:
|
||||
entity_vector = vector_list[i]
|
||||
if len(entity_vector) != self.entity_vector_length:
|
||||
|
@ -161,6 +162,14 @@ cdef class KnowledgeBase:
|
|||
|
||||
i += 1
|
||||
|
||||
def contains_entity(self, unicode entity):
|
||||
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
||||
return entity_hash in self._entry_index
|
||||
|
||||
def contains_alias(self, unicode alias):
|
||||
cdef hash_t alias_hash = self.vocab.strings.add(alias)
|
||||
return alias_hash in self._alias_index
|
||||
|
||||
def add_alias(self, unicode alias, entities, probabilities):
|
||||
"""
|
||||
For a given alias, add its potential entities and prior probabilies to the KB.
|
||||
|
@ -190,7 +199,7 @@ cdef class KnowledgeBase:
|
|||
for entity, prob in zip(entities, probabilities):
|
||||
entity_hash = self.vocab.strings[entity]
|
||||
if not entity_hash in self._entry_index:
|
||||
raise ValueError(Errors.E134.format(alias=alias, entity=entity))
|
||||
raise ValueError(Errors.E134.format(entity=entity))
|
||||
|
||||
entry_index = <int64_t>self._entry_index.get(entity_hash)
|
||||
entry_indices.push_back(int(entry_index))
|
||||
|
@ -201,8 +210,63 @@ cdef class KnowledgeBase:
|
|||
|
||||
return alias_hash
|
||||
|
||||
def get_candidates(self, unicode alias):
|
||||
def append_alias(self, unicode alias, unicode entity, float prior_prob, ignore_warnings=False):
|
||||
"""
|
||||
For an alias already existing in the KB, extend its potential entities with one more.
|
||||
Throw a warning if either the alias or the entity is unknown,
|
||||
or when the combination is already previously recorded.
|
||||
Throw an error if this entity+prior prob would exceed the sum of 1.
|
||||
For efficiency, it's best to use the method `add_alias` as much as possible instead of this one.
|
||||
"""
|
||||
# Check if the alias exists in the KB
|
||||
cdef hash_t alias_hash = self.vocab.strings[alias]
|
||||
if not alias_hash in self._alias_index:
|
||||
raise ValueError(Errors.E176.format(alias=alias))
|
||||
|
||||
# Check if the entity exists in the KB
|
||||
cdef hash_t entity_hash = self.vocab.strings[entity]
|
||||
if not entity_hash in self._entry_index:
|
||||
raise ValueError(Errors.E134.format(entity=entity))
|
||||
entry_index = <int64_t>self._entry_index.get(entity_hash)
|
||||
|
||||
# Throw an error if the prior probabilities (including the new one) sum up to more than 1
|
||||
alias_index = <int64_t>self._alias_index.get(alias_hash)
|
||||
alias_entry = self._aliases_table[alias_index]
|
||||
current_sum = sum([p for p in alias_entry.probs])
|
||||
new_sum = current_sum + prior_prob
|
||||
|
||||
if new_sum > 1.00001:
|
||||
raise ValueError(Errors.E133.format(alias=alias, sum=new_sum))
|
||||
|
||||
entry_indices = alias_entry.entry_indices
|
||||
|
||||
is_present = False
|
||||
for i in range(entry_indices.size()):
|
||||
if entry_indices[i] == int(entry_index):
|
||||
is_present = True
|
||||
|
||||
if is_present:
|
||||
if not ignore_warnings:
|
||||
user_warning(Warnings.W024.format(entity=entity, alias=alias))
|
||||
else:
|
||||
entry_indices.push_back(int(entry_index))
|
||||
alias_entry.entry_indices = entry_indices
|
||||
|
||||
probs = alias_entry.probs
|
||||
probs.push_back(float(prior_prob))
|
||||
alias_entry.probs = probs
|
||||
self._aliases_table[alias_index] = alias_entry
|
||||
|
||||
|
||||
def get_candidates(self, unicode alias):
|
||||
"""
|
||||
Return candidate entities for an alias. Each candidate defines the entity, the original alias,
|
||||
and the prior probability of that alias resolving to that entity.
|
||||
If the alias is not known in the KB, and empty list is returned.
|
||||
"""
|
||||
cdef hash_t alias_hash = self.vocab.strings[alias]
|
||||
if not alias_hash in self._alias_index:
|
||||
return []
|
||||
alias_index = <int64_t>self._alias_index.get(alias_hash)
|
||||
alias_entry = self._aliases_table[alias_index]
|
||||
|
||||
|
@ -341,7 +405,6 @@ cdef class KnowledgeBase:
|
|||
assert nr_entities == self.get_size_entities()
|
||||
|
||||
# STEP 3: load aliases
|
||||
|
||||
cdef int64_t nr_aliases
|
||||
reader.read_alias_length(&nr_aliases)
|
||||
self._alias_index = PreshMap(nr_aliases+1)
|
||||
|
|
|
@ -184,7 +184,7 @@ _russian_lower = r"ёа-я"
|
|||
_russian_upper = r"ЁА-Я"
|
||||
_russian = r"ёа-яЁА-Я"
|
||||
|
||||
_sinhala = r"\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6"
|
||||
_sinhala = r"\u0D80-\u0DFF"
|
||||
|
||||
_tatar_lower = r"әөүҗңһ"
|
||||
_tatar_upper = r"ӘӨҮҖҢҺ"
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
|
||||
from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
|
@ -20,8 +20,8 @@ TAG_MAP = {
|
|||
"CARD": {POS: NUM, "NumType": "card"},
|
||||
"FM": {POS: X, "Foreign": "yes"},
|
||||
"ITJ": {POS: INTJ},
|
||||
"KOKOM": {POS: CONJ, "ConjType": "comp"},
|
||||
"KON": {POS: CONJ},
|
||||
"KOKOM": {POS: CCONJ, "ConjType": "comp"},
|
||||
"KON": {POS: CCONJ},
|
||||
"KOUI": {POS: SCONJ},
|
||||
"KOUS": {POS: SCONJ},
|
||||
"NE": {POS: PROPN},
|
||||
|
@ -43,7 +43,7 @@ TAG_MAP = {
|
|||
"PTKA": {POS: PART},
|
||||
"PTKANT": {POS: PART, "PartType": "res"},
|
||||
"PTKNEG": {POS: PART, "Polarity": "neg"},
|
||||
"PTKVZ": {POS: PART, "PartType": "vbp"},
|
||||
"PTKVZ": {POS: ADP, "PartType": "vbp"},
|
||||
"PTKZU": {POS: PART, "PartType": "inf"},
|
||||
"PWAT": {POS: DET, "PronType": "int"},
|
||||
"PWAV": {POS: ADV, "PronType": "int"},
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
|
@ -28,8 +28,8 @@ TAG_MAP = {
|
|||
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||
"LS": {POS: X, "NumType": "ord"},
|
||||
"MD": {POS: AUX, "VerbType": "mod"},
|
||||
"NIL": {POS: ""},
|
||||
"MD": {POS: VERB, "VerbType": "mod"},
|
||||
"NIL": {POS: X},
|
||||
"NN": {POS: NOUN, "Number": "sing"},
|
||||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||
|
@ -37,7 +37,7 @@ TAG_MAP = {
|
|||
"PDT": {POS: DET},
|
||||
"POS": {POS: PART, "Poss": "yes"},
|
||||
"PRP": {POS: PRON, "PronType": "prs"},
|
||||
"PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"},
|
||||
"PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
|
||||
"RB": {POS: ADV, "Degree": "pos"},
|
||||
"RBR": {POS: ADV, "Degree": "comp"},
|
||||
"RBS": {POS: ADV, "Degree": "sup"},
|
||||
|
@ -58,9 +58,9 @@ TAG_MAP = {
|
|||
"Number": "sing",
|
||||
"Person": "three",
|
||||
},
|
||||
"WDT": {POS: PRON},
|
||||
"WDT": {POS: DET},
|
||||
"WP": {POS: PRON},
|
||||
"WP$": {POS: PRON, "Poss": "yes"},
|
||||
"WP$": {POS: DET, "Poss": "yes"},
|
||||
"WRB": {POS: ADV},
|
||||
"ADD": {POS: X},
|
||||
"NFP": {POS: PUNCT},
|
||||
|
|
|
@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.
|
|||
|
||||
|
||||
sentences = [
|
||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares",
|
||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes",
|
||||
"San Francisco analiza prohibir los robots delivery",
|
||||
"Londres es una gran ciudad del Reino Unido",
|
||||
"El gato come pescado",
|
||||
"Veo al hombre con el telescopio",
|
||||
"La araña come moscas",
|
||||
"El pingüino incuba en su nido",
|
||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||
"San Francisco analiza prohibir los robots delivery.",
|
||||
"Londres es una gran ciudad del Reino Unido.",
|
||||
"El gato come pescado.",
|
||||
"Veo al hombre con el telescopio.",
|
||||
"La araña come moscas.",
|
||||
"El pingüino incuba en su nido.",
|
||||
]
|
||||
|
|
|
@ -5,7 +5,7 @@ from ..punctuation import TOKENIZER_INFIXES
|
|||
from ..char_classes import ALPHA
|
||||
|
||||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
||||
ELISION = " ' ’ ".strip().replace(" ", "")
|
||||
|
||||
|
||||
_infixes = TOKENIZER_INFIXES + [
|
||||
|
|
36
spacy/lang/lb/__init__.py
Normal file
36
spacy/lang/lb/__init__.py
Normal file
|
@ -0,0 +1,36 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .norm_exceptions import NORM_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
|
||||
|
||||
class LuxembourgishDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "lb"
|
||||
lex_attr_getters[NORM] = add_lookups(
|
||||
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
|
||||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
infixes = TOKENIZER_INFIXES
|
||||
|
||||
|
||||
class Luxembourgish(Language):
|
||||
lang = "lb"
|
||||
Defaults = LuxembourgishDefaults
|
||||
|
||||
|
||||
__all__ = ["Luxembourgish"]
|
18
spacy/lang/lb/examples.py
Normal file
18
spacy/lang/lb/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.lb.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
sentences = [
|
||||
"An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum.",
|
||||
"Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
|
||||
"Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet.",
|
||||
"Um Enn huet den Nordwand säi Kampf opginn.",
|
||||
"Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.",
|
||||
"Do huet den Nordwand missen zouginn, dass d’Sonn vun hinnen zwee de Stäerkste wier.",
|
||||
]
|
44
spacy/lang/lb/lex_attrs.py
Normal file
44
spacy/lang/lb/lex_attrs.py
Normal file
|
@ -0,0 +1,44 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = set(
|
||||
"""
|
||||
null eent zwee dräi véier fënnef sechs ziwen aacht néng zéng eelef zwielef dräizéng
|
||||
véierzéng foffzéng siechzéng siwwenzéng uechtzeng uechzeng nonnzéng nongzéng zwanzeg drësseg véierzeg foffzeg sechzeg siechzeg siwenzeg achtzeg achzeg uechtzeg uechzeg nonnzeg
|
||||
honnert dausend millioun milliard billioun billiard trillioun triliard
|
||||
""".split()
|
||||
)
|
||||
|
||||
_ordinal_words = set(
|
||||
"""
|
||||
éischten zweeten drëtten véierten fënneften sechsten siwenten aachten néngten zéngten eeleften
|
||||
zwieleften dräizéngten véierzéngten foffzéngten siechzéngten uechtzéngen uechzéngten nonnzéngten nongzéngten zwanzegsten
|
||||
drëssegsten véierzegsten foffzegsten siechzegsten siwenzegsten uechzegsten nonnzegsten
|
||||
honnertsten dausendsten milliounsten
|
||||
milliardsten billiounsten billiardsten trilliounsten trilliardsten
|
||||
""".split()
|
||||
)
|
||||
|
||||
|
||||
def like_num(text):
|
||||
"""
|
||||
check if text resembles a number
|
||||
"""
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
if text in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
16
spacy/lang/lb/norm_exceptions.py
Normal file
16
spacy/lang/lb/norm_exceptions.py
Normal file
|
@ -0,0 +1,16 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# TODO
|
||||
# norm execptions: find a possibility to deal with the zillions of spelling
|
||||
# variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
|
||||
# here one could include the most common spelling mistakes
|
||||
|
||||
_exc = {"datt": "dass", "wgl.": "weg.", "vläicht": "viläicht"}
|
||||
|
||||
|
||||
NORM_EXCEPTIONS = {}
|
||||
|
||||
for string, norm in _exc.items():
|
||||
NORM_EXCEPTIONS[string] = norm
|
||||
NORM_EXCEPTIONS[string.title()] = norm
|
16
spacy/lang/lb/punctuation.py
Normal file
16
spacy/lang/lb/punctuation.py
Normal file
|
@ -0,0 +1,16 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..punctuation import TOKENIZER_INFIXES
|
||||
from ..char_classes import ALPHA
|
||||
|
||||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "")
|
||||
HYPHENS = r"- – — ‐ ‑".strip().replace(" ", "")
|
||||
|
||||
|
||||
_infixes = TOKENIZER_INFIXES + [
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
||||
]
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
214
spacy/lang/lb/stop_words.py
Normal file
214
spacy/lang/lb/stop_words.py
Normal file
|
@ -0,0 +1,214 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a
|
||||
à
|
||||
äis
|
||||
är
|
||||
ärt
|
||||
äert
|
||||
ären
|
||||
all
|
||||
allem
|
||||
alles
|
||||
alleguer
|
||||
als
|
||||
also
|
||||
am
|
||||
an
|
||||
anerefalls
|
||||
ass
|
||||
aus
|
||||
awer
|
||||
bei
|
||||
beim
|
||||
bis
|
||||
bis
|
||||
d'
|
||||
dach
|
||||
datt
|
||||
däin
|
||||
där
|
||||
dat
|
||||
de
|
||||
dee
|
||||
den
|
||||
deel
|
||||
deem
|
||||
deen
|
||||
deene
|
||||
déi
|
||||
den
|
||||
deng
|
||||
denger
|
||||
dem
|
||||
der
|
||||
dësem
|
||||
di
|
||||
dir
|
||||
do
|
||||
da
|
||||
dann
|
||||
domat
|
||||
dozou
|
||||
drop
|
||||
du
|
||||
duerch
|
||||
duerno
|
||||
e
|
||||
ee
|
||||
em
|
||||
een
|
||||
eent
|
||||
ë
|
||||
en
|
||||
ënner
|
||||
ëm
|
||||
ech
|
||||
eis
|
||||
eise
|
||||
eisen
|
||||
eiser
|
||||
eises
|
||||
eisereen
|
||||
esou
|
||||
een
|
||||
eng
|
||||
enger
|
||||
engem
|
||||
entweder
|
||||
et
|
||||
eréischt
|
||||
falls
|
||||
fir
|
||||
géint
|
||||
géif
|
||||
gëtt
|
||||
gët
|
||||
geet
|
||||
gi
|
||||
ginn
|
||||
gouf
|
||||
gouff
|
||||
goung
|
||||
hat
|
||||
haten
|
||||
hatt
|
||||
hätt
|
||||
hei
|
||||
hu
|
||||
huet
|
||||
hun
|
||||
hunn
|
||||
hiren
|
||||
hien
|
||||
hin
|
||||
hier
|
||||
hir
|
||||
jidderen
|
||||
jiddereen
|
||||
jiddwereen
|
||||
jiddereng
|
||||
jiddwerengen
|
||||
jo
|
||||
ins
|
||||
iech
|
||||
iwwer
|
||||
kann
|
||||
kee
|
||||
keen
|
||||
kënne
|
||||
kënnt
|
||||
kéng
|
||||
kéngen
|
||||
kéngem
|
||||
koum
|
||||
kuckt
|
||||
mam
|
||||
mat
|
||||
ma
|
||||
mä
|
||||
mech
|
||||
méi
|
||||
mécht
|
||||
meng
|
||||
menger
|
||||
mer
|
||||
mir
|
||||
muss
|
||||
nach
|
||||
nämmlech
|
||||
nämmelech
|
||||
näischt
|
||||
nawell
|
||||
nëmme
|
||||
nëmmen
|
||||
net
|
||||
nees
|
||||
nee
|
||||
no
|
||||
nu
|
||||
nom
|
||||
och
|
||||
oder
|
||||
ons
|
||||
onsen
|
||||
onser
|
||||
onsereen
|
||||
onst
|
||||
om
|
||||
op
|
||||
ouni
|
||||
säi
|
||||
säin
|
||||
schonn
|
||||
schonns
|
||||
si
|
||||
sid
|
||||
sie
|
||||
se
|
||||
sech
|
||||
seng
|
||||
senge
|
||||
sengem
|
||||
senger
|
||||
selwecht
|
||||
selwer
|
||||
sinn
|
||||
sollten
|
||||
souguer
|
||||
sou
|
||||
soss
|
||||
sot
|
||||
't
|
||||
tëscht
|
||||
u
|
||||
un
|
||||
um
|
||||
virdrun
|
||||
vu
|
||||
vum
|
||||
vun
|
||||
wann
|
||||
war
|
||||
waren
|
||||
was
|
||||
wat
|
||||
wëllt
|
||||
weider
|
||||
wéi
|
||||
wéini
|
||||
wéinst
|
||||
wi
|
||||
wollt
|
||||
wou
|
||||
wouhin
|
||||
zanter
|
||||
ze
|
||||
zu
|
||||
zum
|
||||
zwar
|
||||
""".split()
|
||||
)
|
28
spacy/lang/lb/tag_map.py
Normal file
28
spacy/lang/lb/tag_map.py
Normal file
|
@ -0,0 +1,28 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PART, SPACE, AUX
|
||||
|
||||
# TODO: tag map is still using POS tags from an internal training set.
|
||||
# These POS tags have to be modified to match those from Universal Dependencies
|
||||
|
||||
TAG_MAP = {
|
||||
"$": {POS: PUNCT},
|
||||
"ADJ": {POS: ADJ},
|
||||
"AV": {POS: ADV},
|
||||
"APPR": {POS: ADP, "AdpType": "prep"},
|
||||
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
|
||||
"D": {POS: DET, "PronType": "art"},
|
||||
"KO": {POS: CONJ},
|
||||
"N": {POS: NOUN},
|
||||
"P": {POS: ADV},
|
||||
"TRUNC": {POS: X, "Hyph": "yes"},
|
||||
"AUX": {POS: AUX},
|
||||
"V": {POS: VERB},
|
||||
"MV": {POS: VERB, "VerbType": "mod"},
|
||||
"PTK": {POS: PART},
|
||||
"INTER": {POS: PART},
|
||||
"NUM": {POS: NUM},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
51
spacy/lang/lb/tokenizer_exceptions.py
Normal file
51
spacy/lang/lb/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,51 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA, NORM
|
||||
|
||||
# TODO
|
||||
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
|
||||
|
||||
_exc = {
|
||||
|
||||
}
|
||||
|
||||
# translate / delete what is not necessary
|
||||
for exc_data in [
|
||||
{ORTH: "wgl.", LEMMA: "wann ech gelift", NORM: "wann ech gelieft"},
|
||||
{ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
|
||||
{ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},
|
||||
{ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},
|
||||
{ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
|
||||
{ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
|
||||
{ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
|
||||
{ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
|
||||
{ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"},
|
||||
]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
||||
# to be extended
|
||||
for orth in [
|
||||
"z.B.",
|
||||
"Dipl.",
|
||||
"Dr.",
|
||||
"etc.",
|
||||
"i.e.",
|
||||
"o.k.",
|
||||
"O.K.",
|
||||
"p.a.",
|
||||
"p.s.",
|
||||
"P.S.",
|
||||
"phil.",
|
||||
"q.e.d.",
|
||||
"R.I.P.",
|
||||
"rer.",
|
||||
"sen.",
|
||||
"ë.a.",
|
||||
"U.S.",
|
||||
"U.S.A.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.
|
|||
|
||||
|
||||
sentences = [
|
||||
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar",
|
||||
"Selvkjørende biler flytter forsikringsansvaret over på produsentene ",
|
||||
"San Francisco vurderer å forby robotbud på fortauene",
|
||||
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
|
||||
"Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
|
||||
"San Francisco vurderer å forby robotbud på fortauene.",
|
||||
"London er en stor by i Storbritannia.",
|
||||
]
|
||||
|
|
|
@ -14,6 +14,7 @@ from ..norm_exceptions import BASE_NORMS
|
|||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
|
||||
|
||||
class SwedishDefaults(Language.Defaults):
|
||||
|
@ -29,6 +30,7 @@ class SwedishDefaults(Language.Defaults):
|
|||
suffixes = TOKENIZER_SUFFIXES
|
||||
stop_words = STOP_WORDS
|
||||
morph_rules = MORPH_RULES
|
||||
syntax_iterators = SYNTAX_ITERATORS
|
||||
|
||||
|
||||
class Swedish(Language):
|
||||
|
|
50
spacy/lang/sv/syntax_iterators.py
Normal file
50
spacy/lang/sv/syntax_iterators.py
Normal file
|
@ -0,0 +1,50 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import NOUN, PROPN, PRON
|
||||
|
||||
|
||||
def noun_chunks(obj):
|
||||
"""
|
||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||
"""
|
||||
labels = [
|
||||
"nsubj",
|
||||
"nsubj:pass",
|
||||
"dobj",
|
||||
"obj",
|
||||
"iobj",
|
||||
"ROOT",
|
||||
"appos",
|
||||
"nmod",
|
||||
"nmod:poss",
|
||||
]
|
||||
doc = obj.doc # Ensure works on both Doc and Span.
|
||||
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||
conj = doc.vocab.strings.add("conj")
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
seen = set()
|
||||
for i, word in enumerate(obj):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
head = word.head
|
||||
while head.dep == conj and head.head.i < head.i:
|
||||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
|
||||
|
||||
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
99
spacy/lang/xx/examples.py
Normal file
99
spacy/lang/xx/examples.py
Normal file
|
@ -0,0 +1,99 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.de.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
# combined examples from de/en/es/fr/it/nl/pl/pt/ru
|
||||
|
||||
sentences = [
|
||||
"Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
|
||||
"Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
|
||||
"Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
|
||||
"Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
|
||||
"San Francisco erwägt Verbot von Lieferrobotern",
|
||||
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
|
||||
"Wo bist du?",
|
||||
"Was ist die Hauptstadt von Deutschland?",
|
||||
"Apple is looking at buying U.K. startup for $1 billion",
|
||||
"Autonomous cars shift insurance liability toward manufacturers",
|
||||
"San Francisco considers banning sidewalk delivery robots",
|
||||
"London is a big city in the United Kingdom.",
|
||||
"Where are you?",
|
||||
"Who is the president of France?",
|
||||
"What is the capital of the United States?",
|
||||
"When was Barack Obama born?",
|
||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||
"San Francisco analiza prohibir los robots delivery.",
|
||||
"Londres es una gran ciudad del Reino Unido.",
|
||||
"El gato come pescado.",
|
||||
"Veo al hombre con el telescopio.",
|
||||
"La araña come moscas.",
|
||||
"El pingüino incuba en su nido.",
|
||||
"Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
|
||||
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
|
||||
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
|
||||
"Londres est une grande ville du Royaume-Uni",
|
||||
"L’Italie choisit ArcelorMittal pour reprendre la plus grande aciérie d’Europe",
|
||||
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
|
||||
"La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
|
||||
"Nouvelles attaques de Trump contre le maire de Londres",
|
||||
"Où es-tu ?",
|
||||
"Qui est le président de la France ?",
|
||||
"Où est la capitale des États-Unis ?",
|
||||
"Quand est né Barack Obama ?",
|
||||
"Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
|
||||
"Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
|
||||
"San Francisco prevede di bandire i robot di consegna porta a porta",
|
||||
"Londra è una grande città del Regno Unito.",
|
||||
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
||||
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
||||
"San Francisco overweegt robots op voetpaden te verbieden",
|
||||
"Londen is een grote stad in het Verenigd Koninkrijk",
|
||||
"Poczuł przyjemną woń mocnej kawy.",
|
||||
"Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
|
||||
"Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
|
||||
"Nowy abonament pod lupą Komisji Europejskiej",
|
||||
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
|
||||
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
|
||||
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
|
||||
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
|
||||
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
|
||||
"Londres é a maior cidade do Reino Unido.",
|
||||
# Translations from English:
|
||||
"Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
|
||||
"Беспилотные автомобили перекладывают страховую ответственность на производителя",
|
||||
"В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
|
||||
"Лондон — это большой город в Соединённом Королевстве",
|
||||
# Native Russian sentences:
|
||||
# Colloquial:
|
||||
"Да, нет, наверное!", # Typical polite refusal
|
||||
"Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech
|
||||
# Examples of Bookish Russian:
|
||||
# Quote from "The Golden Calf"
|
||||
"Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
|
||||
# Quotes from "Ivan Vasilievich changes his occupation"
|
||||
"Ты пошто боярыню обидел, смерд?!!",
|
||||
"Оставь меня, старушка, я в печали!",
|
||||
# Quotes from Dostoevsky:
|
||||
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
|
||||
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
|
||||
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
|
||||
# Quotes from Chekhov:
|
||||
"Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
|
||||
# Quotes from Turgenev:
|
||||
"Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
|
||||
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
|
||||
# Quotes from newspapers:
|
||||
# Komsomolskaya Pravda:
|
||||
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
|
||||
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
|
||||
# Argumenty i Facty:
|
||||
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
|
||||
]
|
|
@ -4,19 +4,92 @@ from __future__ import unicode_literals
|
|||
from ...attrs import LANG
|
||||
from ...language import Language
|
||||
from ...tokens import Doc
|
||||
from ...util import DummyTokenizer
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
|
||||
|
||||
def try_jieba_import(use_jieba):
|
||||
try:
|
||||
import jieba
|
||||
return jieba
|
||||
except ImportError:
|
||||
if use_jieba:
|
||||
msg = (
|
||||
"Jieba not installed. Either set Chinese.use_jieba = False, "
|
||||
"or install it https://github.com/fxsjy/jieba"
|
||||
)
|
||||
raise ImportError(msg)
|
||||
|
||||
|
||||
class ChineseTokenizer(DummyTokenizer):
|
||||
def __init__(self, cls, nlp=None):
|
||||
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||
self.use_jieba = cls.use_jieba
|
||||
self.jieba_seg = try_jieba_import(self.use_jieba)
|
||||
self.tokenizer = Language.Defaults().create_tokenizer(nlp)
|
||||
|
||||
def __call__(self, text):
|
||||
# use jieba
|
||||
if self.use_jieba:
|
||||
jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
|
||||
words = [jieba_words[0]]
|
||||
spaces = [False]
|
||||
for i in range(1, len(jieba_words)):
|
||||
word = jieba_words[i]
|
||||
if word.isspace():
|
||||
# second token in adjacent whitespace following a
|
||||
# non-space token
|
||||
if spaces[-1]:
|
||||
words.append(word)
|
||||
spaces.append(False)
|
||||
# first space token following non-space token
|
||||
elif word == " " and not words[-1].isspace():
|
||||
spaces[-1] = True
|
||||
# token is non-space whitespace or any whitespace following
|
||||
# a whitespace token
|
||||
else:
|
||||
# extend previous whitespace token with more whitespace
|
||||
if words[-1].isspace():
|
||||
words[-1] += word
|
||||
# otherwise it's a new whitespace token
|
||||
else:
|
||||
words.append(word)
|
||||
spaces.append(False)
|
||||
else:
|
||||
words.append(word)
|
||||
spaces.append(False)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
|
||||
# split into individual characters
|
||||
words = []
|
||||
spaces = []
|
||||
for token in self.tokenizer(text):
|
||||
if token.text.isspace():
|
||||
words.append(token.text)
|
||||
spaces.append(False)
|
||||
else:
|
||||
words.extend(list(token.text))
|
||||
spaces.extend([False] * len(token.text))
|
||||
spaces[-1] = bool(token.whitespace_)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
|
||||
|
||||
class ChineseDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "zh"
|
||||
use_jieba = True
|
||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||
use_jieba = True
|
||||
|
||||
@classmethod
|
||||
def create_tokenizer(cls, nlp=None):
|
||||
return ChineseTokenizer(cls, nlp)
|
||||
|
||||
|
||||
class Chinese(Language):
|
||||
|
@ -24,26 +97,7 @@ class Chinese(Language):
|
|||
Defaults = ChineseDefaults # override defaults
|
||||
|
||||
def make_doc(self, text):
|
||||
if self.Defaults.use_jieba:
|
||||
try:
|
||||
import jieba
|
||||
except ImportError:
|
||||
msg = (
|
||||
"Jieba not installed. Either set Chinese.use_jieba = False, "
|
||||
"or install it https://github.com/fxsjy/jieba"
|
||||
)
|
||||
raise ImportError(msg)
|
||||
words = list(jieba.cut(text, cut_all=False))
|
||||
words = [x for x in words if x]
|
||||
return Doc(self.vocab, words=words, spaces=[False] * len(words))
|
||||
else:
|
||||
words = []
|
||||
spaces = []
|
||||
for token in self.tokenizer(text):
|
||||
words.extend(list(token.text))
|
||||
spaces.extend([False] * len(token.text))
|
||||
spaces[-1] = bool(token.whitespace_)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
return self.tokenizer(text)
|
||||
|
||||
|
||||
__all__ = ["Chinese"]
|
||||
|
|
|
@ -1,11 +1,12 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PART, INTJ, PRON
|
||||
from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X
|
||||
from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE
|
||||
|
||||
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.
|
||||
# We also map the tags to the simpler Google Universal POS tag set.
|
||||
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn
|
||||
# Treebank tag set. We also map the tags to the simpler Universal Dependencies
|
||||
# v2 tag set.
|
||||
|
||||
TAG_MAP = {
|
||||
"AS": {POS: PART},
|
||||
|
@ -38,10 +39,11 @@ TAG_MAP = {
|
|||
"OD": {POS: NUM},
|
||||
"DT": {POS: DET},
|
||||
"CC": {POS: CCONJ},
|
||||
"CS": {POS: CONJ},
|
||||
"CS": {POS: SCONJ},
|
||||
"AD": {POS: ADV},
|
||||
"JJ": {POS: ADJ},
|
||||
"P": {POS: ADP},
|
||||
"PN": {POS: PRON},
|
||||
"PU": {POS: PUNCT},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
||||
|
|
|
@ -3,6 +3,7 @@ from __future__ import absolute_import, unicode_literals
|
|||
|
||||
import random
|
||||
import itertools
|
||||
from spacy.util import minibatch
|
||||
import weakref
|
||||
import functools
|
||||
from collections import OrderedDict
|
||||
|
@ -10,18 +11,15 @@ from contextlib import contextmanager
|
|||
from copy import copy, deepcopy
|
||||
from thinc.neural import Model
|
||||
import srsly
|
||||
import multiprocessing as mp
|
||||
from itertools import chain, cycle
|
||||
|
||||
from .tokenizer import Tokenizer
|
||||
from .vocab import Vocab
|
||||
from .lemmatizer import Lemmatizer
|
||||
from .lookups import Lookups
|
||||
from .pipeline import DependencyParser, Tagger
|
||||
from .pipeline import Tensorizer, EntityRecognizer, EntityLinker
|
||||
from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
|
||||
from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
|
||||
from .pipeline import EntityRuler
|
||||
from .pipeline import Morphologizer
|
||||
from .compat import izip, basestring_
|
||||
from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs
|
||||
from .compat import izip, basestring_, is_python2, class_types
|
||||
from .gold import GoldParse
|
||||
from .scorer import Scorer
|
||||
from ._ml import link_vectors_to_models, create_default_optimizer
|
||||
|
@ -30,12 +28,16 @@ from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
|||
from .lang.punctuation import TOKENIZER_INFIXES
|
||||
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
||||
from .lang.tag_map import TAG_MAP
|
||||
from .tokens import Doc
|
||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from .errors import Errors, Warnings, deprecation_warning, user_warning
|
||||
from . import util
|
||||
from . import about
|
||||
|
||||
|
||||
ENABLE_PIPELINE_ANALYSIS = False
|
||||
|
||||
|
||||
class BaseDefaults(object):
|
||||
@classmethod
|
||||
def create_lemmatizer(cls, nlp=None, lookups=None):
|
||||
|
@ -49,8 +51,8 @@ class BaseDefaults(object):
|
|||
filenames = {name: root / filename for name, filename in cls.resources}
|
||||
if LANG in cls.lex_attr_getters:
|
||||
lang = cls.lex_attr_getters[LANG](None)
|
||||
user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {})
|
||||
filenames.update(user_lookups)
|
||||
if lang in util.registry.lookups:
|
||||
filenames.update(util.registry.lookups.get(lang))
|
||||
lookups = Lookups()
|
||||
for name, filename in filenames.items():
|
||||
data = util.load_language_data(filename)
|
||||
|
@ -106,10 +108,6 @@ class BaseDefaults(object):
|
|||
tag_map = dict(TAG_MAP)
|
||||
tokenizer_exceptions = {}
|
||||
stop_words = set()
|
||||
lemma_rules = {}
|
||||
lemma_exc = {}
|
||||
lemma_index = {}
|
||||
lemma_lookup = {}
|
||||
morph_rules = {}
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
syntax_iterators = {}
|
||||
|
@ -133,22 +131,7 @@ class Language(object):
|
|||
Defaults = BaseDefaults
|
||||
lang = None
|
||||
|
||||
factories = {
|
||||
"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp),
|
||||
"tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
|
||||
"tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
|
||||
"morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg),
|
||||
"parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
|
||||
"ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
|
||||
"entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg),
|
||||
"similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
|
||||
"textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
|
||||
"sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
|
||||
"merge_noun_chunks": lambda nlp, **cfg: merge_noun_chunks,
|
||||
"merge_entities": lambda nlp, **cfg: merge_entities,
|
||||
"merge_subtokens": lambda nlp, **cfg: merge_subtokens,
|
||||
"entity_ruler": lambda nlp, **cfg: EntityRuler(nlp, **cfg),
|
||||
}
|
||||
factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)}
|
||||
|
||||
def __init__(
|
||||
self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
|
||||
|
@ -172,7 +155,7 @@ class Language(object):
|
|||
100,000 characters in one text.
|
||||
RETURNS (Language): The newly constructed object.
|
||||
"""
|
||||
user_factories = util.get_entry_points(util.ENTRY_POINTS.factories)
|
||||
user_factories = util.registry.factories.get_all()
|
||||
self.factories.update(user_factories)
|
||||
self._meta = dict(meta)
|
||||
self._path = None
|
||||
|
@ -218,6 +201,7 @@ class Language(object):
|
|||
"name": self.vocab.vectors.name,
|
||||
}
|
||||
self._meta["pipeline"] = self.pipe_names
|
||||
self._meta["factories"] = self.pipe_factories
|
||||
self._meta["labels"] = self.pipe_labels
|
||||
return self._meta
|
||||
|
||||
|
@ -259,6 +243,17 @@ class Language(object):
|
|||
"""
|
||||
return [pipe_name for pipe_name, _ in self.pipeline]
|
||||
|
||||
@property
|
||||
def pipe_factories(self):
|
||||
"""Get the component factories for the available pipeline components.
|
||||
|
||||
RETURNS (dict): Factory names, keyed by component names.
|
||||
"""
|
||||
factories = {}
|
||||
for pipe_name, pipe in self.pipeline:
|
||||
factories[pipe_name] = getattr(pipe, "factory", pipe_name)
|
||||
return factories
|
||||
|
||||
@property
|
||||
def pipe_labels(self):
|
||||
"""Get the labels set by the pipeline components, if available (if
|
||||
|
@ -327,33 +322,30 @@ class Language(object):
|
|||
msg += Errors.E004.format(component=component)
|
||||
raise ValueError(msg)
|
||||
if name is None:
|
||||
if hasattr(component, "name"):
|
||||
name = component.name
|
||||
elif hasattr(component, "__name__"):
|
||||
name = component.__name__
|
||||
elif hasattr(component, "__class__") and hasattr(
|
||||
component.__class__, "__name__"
|
||||
):
|
||||
name = component.__class__.__name__
|
||||
else:
|
||||
name = repr(component)
|
||||
name = util.get_component_name(component)
|
||||
if name in self.pipe_names:
|
||||
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
|
||||
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
|
||||
raise ValueError(Errors.E006)
|
||||
pipe_index = 0
|
||||
pipe = (name, component)
|
||||
if last or not any([first, before, after]):
|
||||
pipe_index = len(self.pipeline)
|
||||
self.pipeline.append(pipe)
|
||||
elif first:
|
||||
self.pipeline.insert(0, pipe)
|
||||
elif before and before in self.pipe_names:
|
||||
pipe_index = self.pipe_names.index(before)
|
||||
self.pipeline.insert(self.pipe_names.index(before), pipe)
|
||||
elif after and after in self.pipe_names:
|
||||
pipe_index = self.pipe_names.index(after) + 1
|
||||
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
|
||||
else:
|
||||
raise ValueError(
|
||||
Errors.E001.format(name=before or after, opts=self.pipe_names)
|
||||
)
|
||||
if ENABLE_PIPELINE_ANALYSIS:
|
||||
analyze_pipes(self.pipeline, name, component, pipe_index)
|
||||
|
||||
def has_pipe(self, name):
|
||||
"""Check if a component name is present in the pipeline. Equivalent to
|
||||
|
@ -382,6 +374,8 @@ class Language(object):
|
|||
msg += Errors.E135.format(name=name)
|
||||
raise ValueError(msg)
|
||||
self.pipeline[self.pipe_names.index(name)] = (name, component)
|
||||
if ENABLE_PIPELINE_ANALYSIS:
|
||||
analyze_all_pipes(self.pipeline)
|
||||
|
||||
def rename_pipe(self, old_name, new_name):
|
||||
"""Rename a pipeline component.
|
||||
|
@ -408,7 +402,10 @@ class Language(object):
|
|||
"""
|
||||
if name not in self.pipe_names:
|
||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||
return self.pipeline.pop(self.pipe_names.index(name))
|
||||
removed = self.pipeline.pop(self.pipe_names.index(name))
|
||||
if ENABLE_PIPELINE_ANALYSIS:
|
||||
analyze_all_pipes(self.pipeline)
|
||||
return removed
|
||||
|
||||
def __call__(self, text, disable=[], component_cfg=None):
|
||||
"""Apply the pipeline to some text. The text can span multiple sentences,
|
||||
|
@ -448,6 +445,8 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#disable_pipes
|
||||
"""
|
||||
if len(names) == 1 and isinstance(names[0], (list, tuple)):
|
||||
names = names[0] # support list of names instead of spread
|
||||
return DisabledPipes(self, *names)
|
||||
|
||||
def make_doc(self, text):
|
||||
|
@ -477,7 +476,7 @@ class Language(object):
|
|||
|
||||
docs (iterable): A batch of `Doc` objects.
|
||||
golds (iterable): A batch of `GoldParse` objects.
|
||||
drop (float): The droput rate.
|
||||
drop (float): The dropout rate.
|
||||
sgd (callable): An optimizer.
|
||||
losses (dict): Dictionary to update with the loss, keyed by component.
|
||||
component_cfg (dict): Config parameters for specific pipeline
|
||||
|
@ -525,7 +524,7 @@ class Language(object):
|
|||
even if you're updating it with a smaller set of examples.
|
||||
|
||||
docs (iterable): A batch of `Doc` objects.
|
||||
drop (float): The droput rate.
|
||||
drop (float): The dropout rate.
|
||||
sgd (callable): An optimizer.
|
||||
RETURNS (dict): Results from the update.
|
||||
|
||||
|
@ -679,7 +678,7 @@ class Language(object):
|
|||
kwargs = component_cfg.get(name, {})
|
||||
kwargs.setdefault("batch_size", batch_size)
|
||||
if not hasattr(pipe, "pipe"):
|
||||
docs = _pipe(pipe, docs, kwargs)
|
||||
docs = _pipe(docs, pipe, kwargs)
|
||||
else:
|
||||
docs = pipe.pipe(docs, **kwargs)
|
||||
for doc, gold in zip(docs, golds):
|
||||
|
@ -733,6 +732,7 @@ class Language(object):
|
|||
disable=[],
|
||||
cleanup=False,
|
||||
component_cfg=None,
|
||||
n_process=1,
|
||||
):
|
||||
"""Process texts as a stream, and yield `Doc` objects in order.
|
||||
|
||||
|
@ -746,12 +746,21 @@ class Language(object):
|
|||
use. Experimental.
|
||||
component_cfg (dict): An optional dictionary with extra keyword
|
||||
arguments for specific components.
|
||||
n_process (int): Number of processors to process texts, only supported
|
||||
in Python3. If -1, set `multiprocessing.cpu_count()`.
|
||||
YIELDS (Doc): Documents in the order of the original text.
|
||||
|
||||
DOCS: https://spacy.io/api/language#pipe
|
||||
"""
|
||||
# raw_texts will be used later to stop iterator.
|
||||
texts, raw_texts = itertools.tee(texts)
|
||||
if is_python2 and n_process != 1:
|
||||
user_warning(Warnings.W023)
|
||||
n_process = 1
|
||||
if n_threads != -1:
|
||||
deprecation_warning(Warnings.W016)
|
||||
if n_process == -1:
|
||||
n_process = mp.cpu_count()
|
||||
if as_tuples:
|
||||
text_context1, text_context2 = itertools.tee(texts)
|
||||
texts = (tc[0] for tc in text_context1)
|
||||
|
@ -760,14 +769,18 @@ class Language(object):
|
|||
texts,
|
||||
batch_size=batch_size,
|
||||
disable=disable,
|
||||
n_process=n_process,
|
||||
component_cfg=component_cfg,
|
||||
)
|
||||
for doc, context in izip(docs, contexts):
|
||||
yield (doc, context)
|
||||
return
|
||||
docs = (self.make_doc(text) for text in texts)
|
||||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
|
||||
pipes = (
|
||||
[]
|
||||
) # contains functools.partial objects so that easily create multiprocess worker.
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
continue
|
||||
|
@ -775,10 +788,20 @@ class Language(object):
|
|||
# Allow component_cfg to overwrite the top-level kwargs.
|
||||
kwargs.setdefault("batch_size", batch_size)
|
||||
if hasattr(proc, "pipe"):
|
||||
docs = proc.pipe(docs, **kwargs)
|
||||
f = functools.partial(proc.pipe, **kwargs)
|
||||
else:
|
||||
# Apply the function, but yield the doc
|
||||
docs = _pipe(proc, docs, kwargs)
|
||||
f = functools.partial(_pipe, proc=proc, kwargs=kwargs)
|
||||
pipes.append(f)
|
||||
|
||||
if n_process != 1:
|
||||
docs = self._multiprocessing_pipe(texts, pipes, n_process, batch_size)
|
||||
else:
|
||||
# if n_process == 1, no processes are forked.
|
||||
docs = (self.make_doc(text) for text in texts)
|
||||
for pipe in pipes:
|
||||
docs = pipe(docs)
|
||||
|
||||
# Track weakrefs of "recent" documents, so that we can see when they
|
||||
# expire from memory. When they do, we know we don't need old strings.
|
||||
# This way, we avoid maintaining an unbounded growth in string entries
|
||||
|
@ -809,6 +832,46 @@ class Language(object):
|
|||
self.tokenizer._reset_cache(keys)
|
||||
nr_seen = 0
|
||||
|
||||
def _multiprocessing_pipe(self, texts, pipes, n_process, batch_size):
|
||||
# raw_texts is used later to stop iteration.
|
||||
texts, raw_texts = itertools.tee(texts)
|
||||
# for sending texts to worker
|
||||
texts_q = [mp.Queue() for _ in range(n_process)]
|
||||
# for receiving byte encoded docs from worker
|
||||
bytedocs_recv_ch, bytedocs_send_ch = zip(
|
||||
*[mp.Pipe(False) for _ in range(n_process)]
|
||||
)
|
||||
|
||||
batch_texts = minibatch(texts, batch_size)
|
||||
# Sender sends texts to the workers.
|
||||
# This is necessary to properly handle infinite length of texts.
|
||||
# (In this case, all data cannot be sent to the workers at once)
|
||||
sender = _Sender(batch_texts, texts_q, chunk_size=n_process)
|
||||
# send twice so that make process busy
|
||||
sender.send()
|
||||
sender.send()
|
||||
|
||||
procs = [
|
||||
mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
|
||||
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
||||
]
|
||||
for proc in procs:
|
||||
proc.start()
|
||||
|
||||
# Cycle channels not to break the order of docs.
|
||||
# The received object is batch of byte encoded docs, so flatten them with chain.from_iterable.
|
||||
byte_docs = chain.from_iterable(recv.recv() for recv in cycle(bytedocs_recv_ch))
|
||||
docs = (Doc(self.vocab).from_bytes(byte_doc) for byte_doc in byte_docs)
|
||||
try:
|
||||
for i, (_, doc) in enumerate(zip(raw_texts, docs), 1):
|
||||
yield doc
|
||||
if i % batch_size == 0:
|
||||
# tell `sender` that one batch was consumed.
|
||||
sender.step()
|
||||
finally:
|
||||
for proc in procs:
|
||||
proc.terminate()
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), disable=None):
|
||||
"""Save the current state to a directory. If a model is loaded, this
|
||||
will include the model.
|
||||
|
@ -936,6 +999,52 @@ class Language(object):
|
|||
return self
|
||||
|
||||
|
||||
class component(object):
|
||||
"""Decorator for pipeline components. Can decorate both function components
|
||||
and class components and will automatically register components in the
|
||||
Language.factories. If the component is a class and needs access to the
|
||||
nlp object or config parameters, it can expose a from_nlp classmethod
|
||||
that takes the nlp object and **cfg arguments and returns the initialized
|
||||
component.
|
||||
"""
|
||||
|
||||
# NB: This decorator needs to live here, because it needs to write to
|
||||
# Language.factories. All other solutions would cause circular import.
|
||||
|
||||
def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False):
|
||||
"""Decorate a pipeline component.
|
||||
|
||||
name (unicode): Default component and factory name.
|
||||
assigns (list): Attributes assigned by component, e.g. `["token.pos"]`.
|
||||
requires (list): Attributes required by component, e.g. `["token.dep"]`.
|
||||
retokenizes (bool): Whether the component changes the tokenization.
|
||||
"""
|
||||
self.name = name
|
||||
self.assigns = validate_attrs(assigns)
|
||||
self.requires = validate_attrs(requires)
|
||||
self.retokenizes = retokenizes
|
||||
|
||||
def __call__(self, *args, **kwargs):
|
||||
obj = args[0]
|
||||
args = args[1:]
|
||||
factory_name = self.name or util.get_component_name(obj)
|
||||
obj.name = factory_name
|
||||
obj.factory = factory_name
|
||||
obj.assigns = self.assigns
|
||||
obj.requires = self.requires
|
||||
obj.retokenizes = self.retokenizes
|
||||
|
||||
def factory(nlp, **cfg):
|
||||
if hasattr(obj, "from_nlp"):
|
||||
return obj.from_nlp(nlp, **cfg)
|
||||
elif isinstance(obj, class_types):
|
||||
return obj()
|
||||
return obj
|
||||
|
||||
Language.factories[obj.factory] = factory
|
||||
return obj
|
||||
|
||||
|
||||
def _fix_pretrained_vectors_name(nlp):
|
||||
# TODO: Replace this once we handle vectors consistently as static
|
||||
# data
|
||||
|
@ -987,12 +1096,56 @@ class DisabledPipes(list):
|
|||
self[:] = []
|
||||
|
||||
|
||||
def _pipe(func, docs, kwargs):
|
||||
def _pipe(docs, proc, kwargs):
|
||||
# We added some args for pipe that __call__ doesn't expect.
|
||||
kwargs = dict(kwargs)
|
||||
for arg in ["n_threads", "batch_size"]:
|
||||
if arg in kwargs:
|
||||
kwargs.pop(arg)
|
||||
for doc in docs:
|
||||
doc = func(doc, **kwargs)
|
||||
doc = proc(doc, **kwargs)
|
||||
yield doc
|
||||
|
||||
|
||||
def _apply_pipes(make_doc, pipes, reciever, sender):
|
||||
"""Worker for Language.pipe
|
||||
|
||||
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
||||
created by `multiprocessing.Pipe()`
|
||||
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
|
||||
`multiprocessing.Pipe()`
|
||||
"""
|
||||
while True:
|
||||
texts = reciever.get()
|
||||
docs = (make_doc(text) for text in texts)
|
||||
for pipe in pipes:
|
||||
docs = pipe(docs)
|
||||
# Connection does not accept unpickable objects, so send list.
|
||||
sender.send([doc.to_bytes() for doc in docs])
|
||||
|
||||
|
||||
class _Sender:
|
||||
"""Util for sending data to multiprocessing workers in Language.pipe"""
|
||||
|
||||
def __init__(self, data, queues, chunk_size):
|
||||
self.data = iter(data)
|
||||
self.queues = iter(cycle(queues))
|
||||
self.chunk_size = chunk_size
|
||||
self.count = 0
|
||||
|
||||
def send(self):
|
||||
"""Send chunk_size items from self.data to channels."""
|
||||
for item, q in itertools.islice(
|
||||
zip(self.data, cycle(self.queues)), self.chunk_size
|
||||
):
|
||||
# cycle channels so that distribute the texts evenly
|
||||
q.put(item)
|
||||
|
||||
def step(self):
|
||||
"""Tell sender that comsumed one item.
|
||||
|
||||
Data is sent to the workers after every chunk_size calls."""
|
||||
self.count += 1
|
||||
if self.count >= self.chunk_size:
|
||||
self.count = 0
|
||||
self.send()
|
||||
|
|
|
@ -375,7 +375,7 @@ cdef class Lexeme:
|
|||
Lexeme.c_set_flag(self.c, IS_STOP, x)
|
||||
|
||||
property is_alpha:
|
||||
"""RETURNS (bool): Whether the lexeme consists of alphanumeric
|
||||
"""RETURNS (bool): Whether the lexeme consists of alphabetic
|
||||
characters. Equivalent to `lexeme.text.isalpha()`.
|
||||
"""
|
||||
def __get__(self):
|
||||
|
|
|
@ -111,7 +111,7 @@ TOKEN_PATTERN_SCHEMA = {
|
|||
"$ref": "#/definitions/integer_value",
|
||||
},
|
||||
"IS_ALPHA": {
|
||||
"title": "Token consists of alphanumeric characters",
|
||||
"title": "Token consists of alphabetic characters",
|
||||
"$ref": "#/definitions/boolean_value",
|
||||
},
|
||||
"IS_ASCII": {
|
||||
|
|
|
@ -102,7 +102,10 @@ cdef class DependencyMatcher:
|
|||
visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True
|
||||
idx = idx + 1
|
||||
|
||||
def add(self, key, on_match, *patterns):
|
||||
def add(self, key, patterns, *_patterns, on_match=None):
|
||||
if patterns is None or hasattr(patterns, "__call__"): # old API
|
||||
on_match = patterns
|
||||
patterns = _patterns
|
||||
for pattern in patterns:
|
||||
if len(pattern) == 0:
|
||||
raise ValueError(Errors.E012.format(key=key))
|
||||
|
@ -237,7 +240,7 @@ cdef class DependencyMatcher:
|
|||
for i, (ent_id, nodes) in enumerate(matched_key_trees):
|
||||
on_match = self._callbacks.get(ent_id)
|
||||
if on_match is not None:
|
||||
on_match(self, doc, i, matches)
|
||||
on_match(self, doc, i, matched_key_trees)
|
||||
return matched_key_trees
|
||||
|
||||
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):
|
||||
|
|
|
@ -74,7 +74,7 @@ cdef class Matcher:
|
|||
"""
|
||||
return self._normalize_key(key) in self._patterns
|
||||
|
||||
def add(self, key, on_match, *patterns):
|
||||
def add(self, key, patterns, *_patterns, on_match=None):
|
||||
"""Add a match-rule to the matcher. A match-rule consists of: an ID
|
||||
key, an on_match callback, and one or more patterns.
|
||||
|
||||
|
@ -98,16 +98,29 @@ cdef class Matcher:
|
|||
operator will behave non-greedily. This quirk in the semantics makes
|
||||
the matcher more efficient, by avoiding the need for back-tracking.
|
||||
|
||||
As of spaCy v2.2.2, Matcher.add supports the future API, which makes
|
||||
the patterns the second argument and a list (instead of a variable
|
||||
number of arguments). The on_match callback becomes an optional keyword
|
||||
argument.
|
||||
|
||||
key (unicode): The match ID.
|
||||
on_match (callable): Callback executed on match.
|
||||
*patterns (list): List of token descriptions.
|
||||
patterns (list): The patterns to add for the given key.
|
||||
on_match (callable): Optional callback executed on match.
|
||||
*_patterns (list): For backwards compatibility: list of patterns to add
|
||||
as variable arguments. Will be ignored if a list of patterns is
|
||||
provided as the second argument.
|
||||
"""
|
||||
errors = {}
|
||||
if on_match is not None and not hasattr(on_match, "__call__"):
|
||||
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
|
||||
if patterns is None or hasattr(patterns, "__call__"): # old API
|
||||
on_match = patterns
|
||||
patterns = _patterns
|
||||
for i, pattern in enumerate(patterns):
|
||||
if len(pattern) == 0:
|
||||
raise ValueError(Errors.E012.format(key=key))
|
||||
if not isinstance(pattern, list):
|
||||
raise ValueError(Errors.E178.format(pat=pattern, key=key))
|
||||
if self.validator:
|
||||
errors[i] = validate_json(pattern, self.validator)
|
||||
if any(err for err in errors.values()):
|
||||
|
@ -133,13 +146,15 @@ cdef class Matcher:
|
|||
|
||||
key (unicode): The ID of the match rule.
|
||||
"""
|
||||
key = self._normalize_key(key)
|
||||
self._patterns.pop(key)
|
||||
self._callbacks.pop(key)
|
||||
norm_key = self._normalize_key(key)
|
||||
if not norm_key in self._patterns:
|
||||
raise ValueError(Errors.E175.format(key=key))
|
||||
self._patterns.pop(norm_key)
|
||||
self._callbacks.pop(norm_key)
|
||||
cdef int i = 0
|
||||
while i < self.patterns.size():
|
||||
pattern_key = get_pattern_key(self.patterns.at(i))
|
||||
if pattern_key == key:
|
||||
pattern_key = get_ent_id(self.patterns.at(i))
|
||||
if pattern_key == norm_key:
|
||||
self.patterns.erase(self.patterns.begin()+i)
|
||||
else:
|
||||
i += 1
|
||||
|
@ -252,6 +267,11 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
|
|||
cdef PatternStateC state
|
||||
cdef int i, j, nr_extra_attr
|
||||
cdef Pool mem = Pool()
|
||||
output = []
|
||||
if doc.length == 0:
|
||||
# avoid any processing or mem alloc if the document is empty
|
||||
return output
|
||||
if len(predicates) > 0:
|
||||
predicate_cache = <char*>mem.alloc(doc.length * len(predicates), sizeof(char))
|
||||
if extensions is not None and len(extensions) >= 1:
|
||||
nr_extra_attr = max(extensions.values()) + 1
|
||||
|
@ -276,7 +296,6 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
|
|||
predicate_cache += len(predicates)
|
||||
# Handle matches that end in 0-width patterns
|
||||
finish_states(matches, states)
|
||||
output = []
|
||||
seen = set()
|
||||
for i in range(matches.size()):
|
||||
match = (
|
||||
|
@ -293,18 +312,6 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
|
|||
return output
|
||||
|
||||
|
||||
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
||||
# There have been a few bugs here.
|
||||
# The code was originally designed to always have pattern[1].attrs.value
|
||||
# be the ent_id when we get to the end of a pattern. However, Issue #2671
|
||||
# showed this wasn't the case when we had a reject-and-continue before a
|
||||
# match.
|
||||
# The patch to #2671 was wrong though, which came up in #3839.
|
||||
while pattern.attrs.attr != ID:
|
||||
pattern += 1
|
||||
return pattern.attrs.value
|
||||
|
||||
|
||||
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
||||
char* cached_py_predicates,
|
||||
Token token, const attr_t* extra_attrs, py_predicates) except *:
|
||||
|
@ -533,6 +540,7 @@ cdef char get_is_match(PatternStateC state,
|
|||
if predicate_matches[state.pattern.py_predicates[i]] == -1:
|
||||
return 0
|
||||
spec = state.pattern
|
||||
if spec.nr_attr > 0:
|
||||
for attr in spec.attrs[:spec.nr_attr]:
|
||||
if get_token_attr(token, attr.attr) != attr.value:
|
||||
return 0
|
||||
|
@ -543,7 +551,11 @@ cdef char get_is_match(PatternStateC state,
|
|||
|
||||
|
||||
cdef char get_is_final(PatternStateC state) nogil:
|
||||
if state.pattern[1].attrs[0].attr == ID and state.pattern[1].nr_attr == 0:
|
||||
if state.pattern[1].nr_attr == 0 and state.pattern[1].attrs != NULL:
|
||||
id_attr = state.pattern[1].attrs[0]
|
||||
if id_attr.attr != ID:
|
||||
with gil:
|
||||
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
|
||||
return 1
|
||||
else:
|
||||
return 0
|
||||
|
@ -558,22 +570,27 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
|||
cdef int i, index
|
||||
for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs):
|
||||
pattern[i].quantifier = quantifier
|
||||
# Ensure attrs refers to a null pointer if nr_attr == 0
|
||||
if len(spec) > 0:
|
||||
pattern[i].attrs = <AttrValueC*>mem.alloc(len(spec), sizeof(AttrValueC))
|
||||
pattern[i].nr_attr = len(spec)
|
||||
for j, (attr, value) in enumerate(spec):
|
||||
pattern[i].attrs[j].attr = attr
|
||||
pattern[i].attrs[j].value = value
|
||||
if len(extensions) > 0:
|
||||
pattern[i].extra_attrs = <IndexValueC*>mem.alloc(len(extensions), sizeof(IndexValueC))
|
||||
for j, (index, value) in enumerate(extensions):
|
||||
pattern[i].extra_attrs[j].index = index
|
||||
pattern[i].extra_attrs[j].value = value
|
||||
pattern[i].nr_extra_attr = len(extensions)
|
||||
if len(predicates) > 0:
|
||||
pattern[i].py_predicates = <int32_t*>mem.alloc(len(predicates), sizeof(int32_t))
|
||||
for j, index in enumerate(predicates):
|
||||
pattern[i].py_predicates[j] = index
|
||||
pattern[i].nr_py = len(predicates)
|
||||
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
||||
i = len(token_specs)
|
||||
# Even though here, nr_attr == 0, we're storing the ID value in attrs[0] (bug-prone, thread carefully!)
|
||||
pattern[i].attrs = <AttrValueC*>mem.alloc(2, sizeof(AttrValueC))
|
||||
pattern[i].attrs[0].attr = ID
|
||||
pattern[i].attrs[0].value = entity_id
|
||||
|
@ -583,8 +600,26 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
|||
return pattern
|
||||
|
||||
|
||||
cdef attr_t get_pattern_key(const TokenPatternC* pattern) nogil:
|
||||
while pattern.nr_attr != 0 or pattern.nr_extra_attr != 0 or pattern.nr_py != 0:
|
||||
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
||||
# There have been a few bugs here. We used to have two functions,
|
||||
# get_ent_id and get_pattern_key that tried to do the same thing. These
|
||||
# are now unified to try to solve the "ghost match" problem.
|
||||
# Below is the previous implementation of get_ent_id and the comment on it,
|
||||
# preserved for reference while we figure out whether the heisenbug in the
|
||||
# matcher is resolved.
|
||||
#
|
||||
#
|
||||
# cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
||||
# # The code was originally designed to always have pattern[1].attrs.value
|
||||
# # be the ent_id when we get to the end of a pattern. However, Issue #2671
|
||||
# # showed this wasn't the case when we had a reject-and-continue before a
|
||||
# # match.
|
||||
# # The patch to #2671 was wrong though, which came up in #3839.
|
||||
# while pattern.attrs.attr != ID:
|
||||
# pattern += 1
|
||||
# return pattern.attrs.value
|
||||
while pattern.nr_attr != 0 or pattern.nr_extra_attr != 0 or pattern.nr_py != 0 \
|
||||
or pattern.quantifier != ZERO:
|
||||
pattern += 1
|
||||
id_attr = pattern[0].attrs[0]
|
||||
if id_attr.attr != ID:
|
||||
|
@ -642,7 +677,7 @@ def _get_attr_values(spec, string_store):
|
|||
value = string_store.add(value)
|
||||
elif isinstance(value, bool):
|
||||
value = int(value)
|
||||
elif isinstance(value, dict):
|
||||
elif isinstance(value, (dict, int)):
|
||||
continue
|
||||
else:
|
||||
raise ValueError(Errors.E153.format(vtype=type(value).__name__))
|
||||
|
|
|
@ -4,6 +4,7 @@ from cymem.cymem cimport Pool
|
|||
from preshed.maps cimport key_t, MapStruct
|
||||
|
||||
from ..attrs cimport attr_id_t
|
||||
from ..structs cimport SpanC
|
||||
from ..tokens.doc cimport Doc
|
||||
from ..vocab cimport Vocab
|
||||
|
||||
|
@ -18,10 +19,4 @@ cdef class PhraseMatcher:
|
|||
cdef Pool mem
|
||||
cdef key_t _terminal_hash
|
||||
|
||||
cdef void find_matches(self, Doc doc, vector[MatchStruct] *matches) nogil
|
||||
|
||||
|
||||
cdef struct MatchStruct:
|
||||
key_t match_id
|
||||
int start
|
||||
int end
|
||||
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
|
||||
|
|
|
@ -9,6 +9,7 @@ from preshed.maps cimport map_init, map_set, map_get, map_clear, map_iter
|
|||
from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA
|
||||
from ..structs cimport TokenC
|
||||
from ..tokens.token cimport Token
|
||||
from ..typedefs cimport attr_t
|
||||
|
||||
from ._schemas import TOKEN_PATTERN_SCHEMA
|
||||
from ..errors import Errors, Warnings, deprecation_warning, user_warning
|
||||
|
@ -102,8 +103,10 @@ cdef class PhraseMatcher:
|
|||
cdef vector[MapStruct*] path_nodes
|
||||
cdef vector[key_t] path_keys
|
||||
cdef key_t key_to_remove
|
||||
for keyword in self._docs[key]:
|
||||
for keyword in sorted(self._docs[key], key=lambda x: len(x), reverse=True):
|
||||
current_node = self.c_map
|
||||
path_nodes.clear()
|
||||
path_keys.clear()
|
||||
for token in keyword:
|
||||
result = map_get(current_node, token)
|
||||
if result:
|
||||
|
@ -149,16 +152,27 @@ cdef class PhraseMatcher:
|
|||
del self._callbacks[key]
|
||||
del self._docs[key]
|
||||
|
||||
def add(self, key, on_match, *docs):
|
||||
def add(self, key, docs, *_docs, on_match=None):
|
||||
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
|
||||
key, an on_match callback, and one or more patterns.
|
||||
|
||||
As of spaCy v2.2.2, PhraseMatcher.add supports the future API, which
|
||||
makes the patterns the second argument and a list (instead of a variable
|
||||
number of arguments). The on_match callback becomes an optional keyword
|
||||
argument.
|
||||
|
||||
key (unicode): The match ID.
|
||||
docs (list): List of `Doc` objects representing match patterns.
|
||||
on_match (callable): Callback executed on match.
|
||||
*docs (Doc): `Doc` objects representing match patterns.
|
||||
*_docs (Doc): For backwards compatibility: list of patterns to add
|
||||
as variable arguments. Will be ignored if a list of patterns is
|
||||
provided as the second argument.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#add
|
||||
"""
|
||||
if docs is None or hasattr(docs, "__call__"): # old API
|
||||
on_match = docs
|
||||
docs = _docs
|
||||
|
||||
_ = self.vocab[key]
|
||||
self._callbacks[key] = on_match
|
||||
|
@ -168,6 +182,8 @@ cdef class PhraseMatcher:
|
|||
cdef MapStruct* internal_node
|
||||
cdef void* result
|
||||
|
||||
if isinstance(docs, Doc):
|
||||
raise ValueError(Errors.E179.format(key=key))
|
||||
for doc in docs:
|
||||
if len(doc) == 0:
|
||||
continue
|
||||
|
@ -220,17 +236,17 @@ cdef class PhraseMatcher:
|
|||
# if doc is empty or None just return empty list
|
||||
return matches
|
||||
|
||||
cdef vector[MatchStruct] c_matches
|
||||
cdef vector[SpanC] c_matches
|
||||
self.find_matches(doc, &c_matches)
|
||||
for i in range(c_matches.size()):
|
||||
matches.append((c_matches[i].match_id, c_matches[i].start, c_matches[i].end))
|
||||
matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
|
||||
for i, (ent_id, start, end) in enumerate(matches):
|
||||
on_match = self._callbacks.get(ent_id)
|
||||
on_match = self._callbacks.get(self.vocab.strings[ent_id])
|
||||
if on_match is not None:
|
||||
on_match(self, doc, i, matches)
|
||||
return matches
|
||||
|
||||
cdef void find_matches(self, Doc doc, vector[MatchStruct] *matches) nogil:
|
||||
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
|
||||
cdef MapStruct* current_node = self.c_map
|
||||
cdef int start = 0
|
||||
cdef int idx = 0
|
||||
|
@ -238,7 +254,7 @@ cdef class PhraseMatcher:
|
|||
cdef key_t key
|
||||
cdef void* value
|
||||
cdef int i = 0
|
||||
cdef MatchStruct ms
|
||||
cdef SpanC ms
|
||||
cdef void* result
|
||||
while idx < doc.length:
|
||||
start = idx
|
||||
|
@ -253,7 +269,7 @@ cdef class PhraseMatcher:
|
|||
if result:
|
||||
i = 0
|
||||
while map_iter(<MapStruct*>result, &i, &key, &value):
|
||||
ms = make_matchstruct(key, start, idy)
|
||||
ms = make_spanstruct(key, start, idy)
|
||||
matches.push_back(ms)
|
||||
inner_token = Token.get_struct_attr(&doc.c[idy], self.attr)
|
||||
result = map_get(current_node, inner_token)
|
||||
|
@ -268,7 +284,7 @@ cdef class PhraseMatcher:
|
|||
if result:
|
||||
i = 0
|
||||
while map_iter(<MapStruct*>result, &i, &key, &value):
|
||||
ms = make_matchstruct(key, start, idy)
|
||||
ms = make_spanstruct(key, start, idy)
|
||||
matches.push_back(ms)
|
||||
current_node = self.c_map
|
||||
idx += 1
|
||||
|
@ -318,9 +334,9 @@ def unpickle_matcher(vocab, docs, callbacks, attr):
|
|||
return matcher
|
||||
|
||||
|
||||
cdef MatchStruct make_matchstruct(key_t match_id, int start, int end) nogil:
|
||||
cdef MatchStruct ms
|
||||
ms.match_id = match_id
|
||||
ms.start = start
|
||||
ms.end = end
|
||||
return ms
|
||||
cdef SpanC make_spanstruct(attr_t label, int start, int end) nogil:
|
||||
cdef SpanC spanc
|
||||
spanc.label = label
|
||||
spanc.start = start
|
||||
spanc.end = end
|
||||
return spanc
|
||||
|
|
5
spacy/ml/__init__.py
Normal file
5
spacy/ml/__init__.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tok2vec import Tok2Vec # noqa: F401
|
||||
from .common import FeedForward, LayerNormalizedMaxout # noqa: F401
|
131
spacy/ml/_legacy_tok2vec.py
Normal file
131
spacy/ml/_legacy_tok2vec.py
Normal file
|
@ -0,0 +1,131 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from thinc.v2v import Model, Maxout
|
||||
from thinc.i2v import HashEmbed, StaticVectors
|
||||
from thinc.t2t import ExtractWindow
|
||||
from thinc.misc import Residual
|
||||
from thinc.misc import LayerNorm as LN
|
||||
from thinc.misc import FeatureExtracter
|
||||
from thinc.api import layerize, chain, clone, concatenate, with_flatten
|
||||
from thinc.api import uniqued, wrap, noop
|
||||
|
||||
from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE
|
||||
|
||||
|
||||
def Tok2Vec(width, embed_size, **kwargs):
|
||||
# Circular imports :(
|
||||
from .._ml import CharacterEmbed
|
||||
from .._ml import PyTorchBiLSTM
|
||||
|
||||
pretrained_vectors = kwargs.get("pretrained_vectors", None)
|
||||
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||
subword_features = kwargs.get("subword_features", True)
|
||||
char_embed = kwargs.get("char_embed", False)
|
||||
if char_embed:
|
||||
subword_features = False
|
||||
conv_depth = kwargs.get("conv_depth", 4)
|
||||
bilstm_depth = kwargs.get("bilstm_depth", 0)
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
|
||||
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
|
||||
if subword_features:
|
||||
prefix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
|
||||
)
|
||||
suffix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
|
||||
)
|
||||
shape = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
|
||||
)
|
||||
else:
|
||||
prefix, suffix, shape = (None, None, None)
|
||||
if pretrained_vectors is not None:
|
||||
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
||||
|
||||
if subword_features:
|
||||
embed = uniqued(
|
||||
(glove | norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 5, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
else:
|
||||
embed = uniqued(
|
||||
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
elif subword_features:
|
||||
embed = uniqued(
|
||||
(norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 4, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
elif char_embed:
|
||||
embed = concatenate_lists(
|
||||
CharacterEmbed(nM=64, nC=8),
|
||||
FeatureExtracter(cols) >> with_flatten(norm),
|
||||
)
|
||||
reduce_dimensions = LN(
|
||||
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
|
||||
)
|
||||
else:
|
||||
embed = norm
|
||||
|
||||
convolution = Residual(
|
||||
ExtractWindow(nW=1)
|
||||
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
|
||||
)
|
||||
if char_embed:
|
||||
tok2vec = embed >> with_flatten(
|
||||
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth
|
||||
)
|
||||
else:
|
||||
tok2vec = FeatureExtracter(cols) >> with_flatten(
|
||||
embed >> convolution ** conv_depth, pad=conv_depth
|
||||
)
|
||||
|
||||
if bilstm_depth >= 1:
|
||||
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
|
||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||
tok2vec.nO = width
|
||||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
|
||||
|
||||
@layerize
|
||||
def flatten(seqs, drop=0.0):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=0)
|
||||
|
||||
X = ops.flatten(seqs, pad=0)
|
||||
return X, finish_update
|
||||
|
||||
|
||||
def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||
"""Compose two or more models `f`, `g`, etc, such that their outputs are
|
||||
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
|
||||
"""
|
||||
if not layers:
|
||||
return noop()
|
||||
drop_factor = kwargs.get("drop_factor", 1.0)
|
||||
ops = layers[0].ops
|
||||
layers = [chain(layer, flatten) for layer in layers]
|
||||
concat = concatenate(*layers)
|
||||
|
||||
def concatenate_lists_fwd(Xs, drop=0.0):
|
||||
if drop is not None:
|
||||
drop *= drop_factor
|
||||
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
|
||||
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
||||
ys = ops.unflatten(flat_y, lengths)
|
||||
|
||||
def concatenate_lists_bwd(d_ys, sgd=None):
|
||||
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
|
||||
|
||||
return ys, concatenate_lists_bwd
|
||||
|
||||
model = wrap(concatenate_lists_fwd, concat)
|
||||
return model
|
42
spacy/ml/_wire.py
Normal file
42
spacy/ml/_wire.py
Normal file
|
@ -0,0 +1,42 @@
|
|||
from __future__ import unicode_literals
|
||||
from thinc.api import layerize, wrap, noop, chain, concatenate
|
||||
from thinc.v2v import Model
|
||||
|
||||
|
||||
def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||
"""Compose two or more models `f`, `g`, etc, such that their outputs are
|
||||
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
|
||||
"""
|
||||
if not layers:
|
||||
return layerize(noop())
|
||||
drop_factor = kwargs.get("drop_factor", 1.0)
|
||||
ops = layers[0].ops
|
||||
layers = [chain(layer, flatten) for layer in layers]
|
||||
concat = concatenate(*layers)
|
||||
|
||||
def concatenate_lists_fwd(Xs, drop=0.0):
|
||||
if drop is not None:
|
||||
drop *= drop_factor
|
||||
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
|
||||
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
||||
ys = ops.unflatten(flat_y, lengths)
|
||||
|
||||
def concatenate_lists_bwd(d_ys, sgd=None):
|
||||
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
|
||||
|
||||
return ys, concatenate_lists_bwd
|
||||
|
||||
model = wrap(concatenate_lists_fwd, concat)
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def flatten(seqs, drop=0.0):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=0)
|
||||
|
||||
X = ops.flatten(seqs, pad=0)
|
||||
return X, finish_update
|
23
spacy/ml/common.py
Normal file
23
spacy/ml/common.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from thinc.api import chain
|
||||
from thinc.v2v import Maxout
|
||||
from thinc.misc import LayerNorm
|
||||
from ..util import registry, make_layer
|
||||
|
||||
|
||||
@registry.architectures.register("thinc.FeedForward.v1")
|
||||
def FeedForward(config):
|
||||
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
|
||||
model = chain(*layers)
|
||||
model.cfg = config
|
||||
return model
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
|
||||
def LayerNormalizedMaxout(config):
|
||||
width = config["width"]
|
||||
pieces = config["pieces"]
|
||||
layer = LayerNorm(Maxout(width, pieces=pieces))
|
||||
layer.nO = width
|
||||
return layer
|
176
spacy/ml/tok2vec.py
Normal file
176
spacy/ml/tok2vec.py
Normal file
|
@ -0,0 +1,176 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued
|
||||
from thinc.api import noop, with_square_sequences
|
||||
from thinc.v2v import Maxout, Model
|
||||
from thinc.i2v import HashEmbed, StaticVectors
|
||||
from thinc.t2t import ExtractWindow
|
||||
from thinc.misc import Residual, LayerNorm, FeatureExtracter
|
||||
from ..util import make_layer, registry
|
||||
from ._wire import concatenate_lists
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.Tok2Vec.v1")
|
||||
def Tok2Vec(config):
|
||||
doc2feats = make_layer(config["@doc2feats"])
|
||||
embed = make_layer(config["@embed"])
|
||||
encode = make_layer(config["@encode"])
|
||||
field_size = getattr(encode, "receptive_field", 0)
|
||||
tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size))
|
||||
tok2vec.cfg = config
|
||||
tok2vec.nO = encode.nO
|
||||
tok2vec.embed = embed
|
||||
tok2vec.encode = encode
|
||||
return tok2vec
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.Doc2Feats.v1")
|
||||
def Doc2Feats(config):
|
||||
columns = config["columns"]
|
||||
return FeatureExtracter(columns)
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
||||
def MultiHashEmbed(config):
|
||||
# For backwards compatibility with models before the architecture registry,
|
||||
# we have to be careful to get exactly the same model structure. One subtle
|
||||
# trick is that when we define concatenation with the operator, the operator
|
||||
# is actually binary associative. So when we write (a | b | c), we're actually
|
||||
# getting concatenate(concatenate(a, b), c). That's why the implementation
|
||||
# is a bit ugly here.
|
||||
cols = config["columns"]
|
||||
width = config["width"]
|
||||
rows = config["rows"]
|
||||
|
||||
norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm")
|
||||
if config["use_subwords"]:
|
||||
prefix = HashEmbed(
|
||||
width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix"
|
||||
)
|
||||
suffix = HashEmbed(
|
||||
width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix"
|
||||
)
|
||||
shape = HashEmbed(
|
||||
width, rows // 2, column=cols.index("SHAPE"), name="embed_shape"
|
||||
)
|
||||
if config.get("@pretrained_vectors"):
|
||||
glove = make_layer(config["@pretrained_vectors"])
|
||||
mix = make_layer(config["@mix"])
|
||||
|
||||
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||
if config["use_subwords"] and config["@pretrained_vectors"]:
|
||||
mix._layers[0].nI = width * 5
|
||||
layer = uniqued(
|
||||
(glove | norm | prefix | suffix | shape) >> mix,
|
||||
column=cols.index("ORTH"),
|
||||
)
|
||||
elif config["use_subwords"]:
|
||||
mix._layers[0].nI = width * 4
|
||||
layer = uniqued(
|
||||
(norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH")
|
||||
)
|
||||
elif config["@pretrained_vectors"]:
|
||||
mix._layers[0].nI = width * 2
|
||||
layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),)
|
||||
else:
|
||||
layer = norm
|
||||
layer.cfg = config
|
||||
return layer
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
||||
def CharacterEmbed(config):
|
||||
from .. import _ml
|
||||
|
||||
width = config["width"]
|
||||
chars = config["chars"]
|
||||
|
||||
chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars)
|
||||
other_tables = make_layer(config["@embed_features"])
|
||||
mix = make_layer(config["@mix"])
|
||||
|
||||
model = chain(concatenate_lists(chr_embed, other_tables), mix)
|
||||
model.cfg = config
|
||||
return model
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
|
||||
def MaxoutWindowEncoder(config):
|
||||
nO = config["width"]
|
||||
nW = config["window_size"]
|
||||
nP = config["pieces"]
|
||||
depth = config["depth"]
|
||||
|
||||
cnn = chain(
|
||||
ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP))
|
||||
)
|
||||
model = clone(Residual(cnn), depth)
|
||||
model.nO = nO
|
||||
model.receptive_field = nW * depth
|
||||
return model
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.MishWindowEncoder.v1")
|
||||
def MishWindowEncoder(config):
|
||||
from thinc.v2v import Mish
|
||||
|
||||
nO = config["width"]
|
||||
nW = config["window_size"]
|
||||
depth = config["depth"]
|
||||
|
||||
cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1))))
|
||||
model = clone(Residual(cnn), depth)
|
||||
model.nO = nO
|
||||
return model
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.PretrainedVectors.v1")
|
||||
def PretrainedVectors(config):
|
||||
return StaticVectors(config["vectors_name"], config["width"], config["column"])
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
||||
def TorchBiLSTMEncoder(config):
|
||||
import torch.nn
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
|
||||
width = config["width"]
|
||||
depth = config["depth"]
|
||||
if depth == 0:
|
||||
return layerize(noop())
|
||||
return with_square_sequences(
|
||||
PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True))
|
||||
)
|
||||
|
||||
|
||||
_EXAMPLE_CONFIG = {
|
||||
"@doc2feats": {
|
||||
"arch": "Doc2Feats",
|
||||
"config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]},
|
||||
},
|
||||
"@embed": {
|
||||
"arch": "spacy.MultiHashEmbed.v1",
|
||||
"config": {
|
||||
"width": 96,
|
||||
"rows": 2000,
|
||||
"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"],
|
||||
"use_subwords": True,
|
||||
"@pretrained_vectors": {
|
||||
"arch": "TransformedStaticVectors",
|
||||
"config": {
|
||||
"vectors_name": "en_vectors_web_lg.vectors",
|
||||
"width": 96,
|
||||
"column": 0,
|
||||
},
|
||||
},
|
||||
"@mix": {
|
||||
"arch": "LayerNormalizedMaxout",
|
||||
"config": {"width": 96, "pieces": 3},
|
||||
},
|
||||
},
|
||||
},
|
||||
"@encode": {
|
||||
"arch": "MaxoutWindowEncode",
|
||||
"config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3},
|
||||
},
|
||||
}
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
|||
from collections import defaultdict, OrderedDict
|
||||
import srsly
|
||||
|
||||
from ..language import component
|
||||
from ..errors import Errors
|
||||
from ..compat import basestring_
|
||||
from ..util import ensure_path, to_disk, from_disk
|
||||
|
@ -13,6 +14,7 @@ from ..matcher import Matcher, PhraseMatcher
|
|||
DEFAULT_ENT_ID_SEP = "||"
|
||||
|
||||
|
||||
@component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"])
|
||||
class EntityRuler(object):
|
||||
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
||||
rules or exact phrase matches. It can be combined with the statistical
|
||||
|
@ -24,8 +26,6 @@ class EntityRuler(object):
|
|||
USAGE: https://spacy.io/usage/rule-based-matching#entityruler
|
||||
"""
|
||||
|
||||
name = "entity_ruler"
|
||||
|
||||
def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg):
|
||||
"""Initialize the entitiy ruler. If patterns are supplied here, they
|
||||
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
||||
|
@ -64,10 +64,15 @@ class EntityRuler(object):
|
|||
self.phrase_matcher_attr = None
|
||||
self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate)
|
||||
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
|
||||
self._ent_ids = defaultdict(dict)
|
||||
patterns = cfg.get("patterns")
|
||||
if patterns is not None:
|
||||
self.add_patterns(patterns)
|
||||
|
||||
@classmethod
|
||||
def from_nlp(cls, nlp, **cfg):
|
||||
return cls(nlp, **cfg)
|
||||
|
||||
def __len__(self):
|
||||
"""The number of all patterns added to the entity ruler."""
|
||||
n_token_patterns = sum(len(p) for p in self.token_patterns.values())
|
||||
|
@ -100,10 +105,9 @@ class EntityRuler(object):
|
|||
continue
|
||||
# check for end - 1 here because boundaries are inclusive
|
||||
if start not in seen_tokens and end - 1 not in seen_tokens:
|
||||
if self.ent_ids:
|
||||
label_ = self.nlp.vocab.strings[match_id]
|
||||
ent_label, ent_id = self._split_label(label_)
|
||||
span = Span(doc, start, end, label=ent_label)
|
||||
if match_id in self._ent_ids:
|
||||
label, ent_id = self._ent_ids[match_id]
|
||||
span = Span(doc, start, end, label=label)
|
||||
if ent_id:
|
||||
for token in span:
|
||||
token.ent_id_ = ent_id
|
||||
|
@ -131,11 +135,11 @@ class EntityRuler(object):
|
|||
|
||||
@property
|
||||
def ent_ids(self):
|
||||
"""All entity ids present in the match patterns meta dicts.
|
||||
"""All entity ids present in the match patterns `id` properties.
|
||||
|
||||
RETURNS (set): The string entity ids.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#labels
|
||||
DOCS: https://spacy.io/api/entityruler#ent_ids
|
||||
"""
|
||||
all_ent_ids = set()
|
||||
for l in self.labels:
|
||||
|
@ -147,7 +151,6 @@ class EntityRuler(object):
|
|||
@property
|
||||
def patterns(self):
|
||||
"""Get all patterns that were added to the entity ruler.
|
||||
|
||||
RETURNS (list): The original patterns, one dictionary per pattern.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#patterns
|
||||
|
@ -183,14 +186,20 @@ class EntityRuler(object):
|
|||
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
|
||||
try:
|
||||
current_index = self.nlp.pipe_names.index(self.name)
|
||||
subsequent_pipes = [pipe for pipe in self.nlp.pipe_names[current_index + 1:]]
|
||||
subsequent_pipes = [
|
||||
pipe for pipe in self.nlp.pipe_names[current_index + 1 :]
|
||||
]
|
||||
except ValueError:
|
||||
subsequent_pipes = []
|
||||
with self.nlp.disable_pipes(*subsequent_pipes):
|
||||
with self.nlp.disable_pipes(subsequent_pipes):
|
||||
for entry in patterns:
|
||||
label = entry["label"]
|
||||
if "id" in entry:
|
||||
ent_label = label
|
||||
label = self._create_label(label, entry["id"])
|
||||
key = self.matcher._normalize_key(label)
|
||||
self._ent_ids[key] = (ent_label, entry["id"])
|
||||
|
||||
pattern = entry["pattern"]
|
||||
if isinstance(pattern, basestring_):
|
||||
self.phrase_patterns[label].append(self.nlp(pattern))
|
||||
|
@ -199,9 +208,9 @@ class EntityRuler(object):
|
|||
else:
|
||||
raise ValueError(Errors.E097.format(pattern=pattern))
|
||||
for label, patterns in self.token_patterns.items():
|
||||
self.matcher.add(label, None, *patterns)
|
||||
self.matcher.add(label, patterns)
|
||||
for label, patterns in self.phrase_patterns.items():
|
||||
self.phrase_matcher.add(label, None, *patterns)
|
||||
self.phrase_matcher.add(label, patterns)
|
||||
|
||||
def _split_label(self, label):
|
||||
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
|
||||
|
|
|
@ -1,9 +1,16 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..language import component
|
||||
from ..matcher import Matcher
|
||||
from ..util import filter_spans
|
||||
|
||||
|
||||
@component(
|
||||
"merge_noun_chunks",
|
||||
requires=["token.dep", "token.tag", "token.pos"],
|
||||
retokenizes=True,
|
||||
)
|
||||
def merge_noun_chunks(doc):
|
||||
"""Merge noun chunks into a single token.
|
||||
|
||||
|
@ -21,6 +28,11 @@ def merge_noun_chunks(doc):
|
|||
return doc
|
||||
|
||||
|
||||
@component(
|
||||
"merge_entities",
|
||||
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
|
||||
retokenizes=True,
|
||||
)
|
||||
def merge_entities(doc):
|
||||
"""Merge entities into a single token.
|
||||
|
||||
|
@ -36,6 +48,7 @@ def merge_entities(doc):
|
|||
return doc
|
||||
|
||||
|
||||
@component("merge_subtokens", requires=["token.dep"], retokenizes=True)
|
||||
def merge_subtokens(doc, label="subtok"):
|
||||
"""Merge subtokens into a single token.
|
||||
|
||||
|
@ -48,7 +61,7 @@ def merge_subtokens(doc, label="subtok"):
|
|||
merger = Matcher(doc.vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
|
||||
matches = merger(doc)
|
||||
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||
spans = filter_spans([doc[start : end + 1] for _, start, end in matches])
|
||||
with doc.retokenize() as retokenizer:
|
||||
for span in spans:
|
||||
retokenizer.merge(span)
|
||||
|
|
|
@ -5,9 +5,11 @@ from thinc.t2v import Pooling, max_pool, mean_pool
|
|||
from thinc.neural._classes.difference import Siamese, CauchySimilarity
|
||||
|
||||
from .pipes import Pipe
|
||||
from ..language import component
|
||||
from .._ml import link_vectors_to_models
|
||||
|
||||
|
||||
@component("sentencizer_hook", assigns=["doc.user_hooks"])
|
||||
class SentenceSegmenter(object):
|
||||
"""A simple spaCy hook, to allow custom sentence boundary detection logic
|
||||
(that doesn't require the dependency parse). To change the sentence
|
||||
|
@ -17,8 +19,6 @@ class SentenceSegmenter(object):
|
|||
and yield `Span` objects for each sentence.
|
||||
"""
|
||||
|
||||
name = "sentencizer"
|
||||
|
||||
def __init__(self, vocab, strategy=None):
|
||||
self.vocab = vocab
|
||||
if strategy is None or strategy == "on_punct":
|
||||
|
@ -44,6 +44,7 @@ class SentenceSegmenter(object):
|
|||
yield doc[start : len(doc)]
|
||||
|
||||
|
||||
@component("similarity", assigns=["doc.user_hooks"])
|
||||
class SimilarityHook(Pipe):
|
||||
"""
|
||||
Experimental: A pipeline component to install a hook for supervised
|
||||
|
@ -58,8 +59,6 @@ class SimilarityHook(Pipe):
|
|||
Where W is a vector of dimension weights, initialized to 1.
|
||||
"""
|
||||
|
||||
name = "similarity"
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
|
|
|
@ -8,6 +8,7 @@ from thinc.api import chain
|
|||
from thinc.neural.util import to_categorical, copy_array, get_array_module
|
||||
from .. import util
|
||||
from .pipes import Pipe
|
||||
from ..language import component
|
||||
from .._ml import Tok2Vec, build_morphologizer_model
|
||||
from .._ml import link_vectors_to_models, zero_init, flatten
|
||||
from .._ml import create_default_optimizer
|
||||
|
@ -18,8 +19,8 @@ from ..vocab cimport Vocab
|
|||
from ..morphology cimport Morphology
|
||||
|
||||
|
||||
@component("morphologizer", assigns=["token.morph", "token.pos"])
|
||||
class Morphologizer(Pipe):
|
||||
name = 'morphologizer'
|
||||
|
||||
@classmethod
|
||||
def Model(cls, **cfg):
|
||||
|
|
|
@ -13,7 +13,6 @@ from thinc.misc import LayerNorm
|
|||
from thinc.neural.util import to_categorical
|
||||
from thinc.neural.util import get_array_module
|
||||
|
||||
from .functions import merge_subtokens
|
||||
from ..tokens.doc cimport Doc
|
||||
from ..syntax.nn_parser cimport Parser
|
||||
from ..syntax.ner cimport BiluoPushDown
|
||||
|
@ -21,6 +20,8 @@ from ..syntax.arc_eager cimport ArcEager
|
|||
from ..morphology cimport Morphology
|
||||
from ..vocab cimport Vocab
|
||||
|
||||
from .functions import merge_subtokens
|
||||
from ..language import Language, component
|
||||
from ..syntax import nonproj
|
||||
from ..attrs import POS, ID
|
||||
from ..parts_of_speech import X
|
||||
|
@ -55,6 +56,10 @@ class Pipe(object):
|
|||
"""Initialize a model for the pipe."""
|
||||
raise NotImplementedError
|
||||
|
||||
@classmethod
|
||||
def from_nlp(cls, nlp, **cfg):
|
||||
return cls(nlp.vocab, **cfg)
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
"""Create a new pipe instance."""
|
||||
raise NotImplementedError
|
||||
|
@ -224,11 +229,10 @@ class Pipe(object):
|
|||
return self
|
||||
|
||||
|
||||
@component("tensorizer", assigns=["doc.tensor"])
|
||||
class Tensorizer(Pipe):
|
||||
"""Pre-train position-sensitive vectors for tokens."""
|
||||
|
||||
name = "tensorizer"
|
||||
|
||||
@classmethod
|
||||
def Model(cls, output_size=300, **cfg):
|
||||
"""Create a new statistical model for the class.
|
||||
|
@ -363,14 +367,13 @@ class Tensorizer(Pipe):
|
|||
return sgd
|
||||
|
||||
|
||||
@component("tagger", assigns=["token.tag", "token.pos"])
|
||||
class Tagger(Pipe):
|
||||
"""Pipeline component for part-of-speech tagging.
|
||||
|
||||
DOCS: https://spacy.io/api/tagger
|
||||
"""
|
||||
|
||||
name = "tagger"
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
|
@ -515,7 +518,6 @@ class Tagger(Pipe):
|
|||
orig_tag_map = dict(self.vocab.morphology.tag_map)
|
||||
new_tag_map = OrderedDict()
|
||||
for raw_text, annots_brackets in get_gold_tuples():
|
||||
_ = annots_brackets.pop()
|
||||
for annots, brackets in annots_brackets:
|
||||
ids, words, tags, heads, deps, ents = annots
|
||||
for tag in tags:
|
||||
|
@ -658,13 +660,12 @@ class Tagger(Pipe):
|
|||
return self
|
||||
|
||||
|
||||
@component("nn_labeller")
|
||||
class MultitaskObjective(Tagger):
|
||||
"""Experimental: Assist training of a parser or tagger, by training a
|
||||
side-objective.
|
||||
"""
|
||||
|
||||
name = "nn_labeller"
|
||||
|
||||
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
|
@ -923,12 +924,12 @@ class ClozeMultitask(Pipe):
|
|||
return words
|
||||
|
||||
|
||||
@component("textcat", assigns=["doc.cats"])
|
||||
class TextCategorizer(Pipe):
|
||||
"""Pipeline component for text classification.
|
||||
|
||||
DOCS: https://spacy.io/api/textcategorizer
|
||||
"""
|
||||
name = 'textcat'
|
||||
|
||||
@classmethod
|
||||
def Model(cls, nr_class=1, **cfg):
|
||||
|
@ -1057,7 +1058,8 @@ class TextCategorizer(Pipe):
|
|||
return 1
|
||||
|
||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
|
||||
for raw_text, (_, (cats, _2)) in get_gold_tuples():
|
||||
for raw_text, annot_brackets in get_gold_tuples():
|
||||
for _, (cats, _2) in annot_brackets:
|
||||
for cat in cats:
|
||||
self.add_label(cat)
|
||||
if self.model is True:
|
||||
|
@ -1075,8 +1077,11 @@ cdef class DependencyParser(Parser):
|
|||
|
||||
DOCS: https://spacy.io/api/dependencyparser
|
||||
"""
|
||||
|
||||
# cdef classes can't have decorators, so we're defining this here
|
||||
name = "parser"
|
||||
factory = "parser"
|
||||
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
|
||||
requires = []
|
||||
TransitionSystem = ArcEager
|
||||
nr_feature = 8
|
||||
|
||||
|
@ -1122,8 +1127,10 @@ cdef class EntityRecognizer(Parser):
|
|||
|
||||
DOCS: https://spacy.io/api/entityrecognizer
|
||||
"""
|
||||
|
||||
name = "ner"
|
||||
factory = "ner"
|
||||
assigns = ["doc.ents", "token.ent_iob", "token.ent_type"]
|
||||
requires = []
|
||||
TransitionSystem = BiluoPushDown
|
||||
nr_feature = 3
|
||||
|
||||
|
@ -1154,12 +1161,16 @@ cdef class EntityRecognizer(Parser):
|
|||
return tuple(sorted(labels))
|
||||
|
||||
|
||||
@component(
|
||||
"entity_linker",
|
||||
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
|
||||
assigns=["token.ent_kb_id"]
|
||||
)
|
||||
class EntityLinker(Pipe):
|
||||
"""Pipeline component for named entity linking.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker
|
||||
"""
|
||||
name = 'entity_linker'
|
||||
NIL = "NIL" # string used to refer to a non-existing link
|
||||
|
||||
@classmethod
|
||||
|
@ -1220,23 +1231,26 @@ class EntityLinker(Pipe):
|
|||
docs = [docs]
|
||||
golds = [golds]
|
||||
|
||||
context_docs = []
|
||||
sentence_docs = []
|
||||
|
||||
for doc, gold in zip(docs, golds):
|
||||
ents_by_offset = dict()
|
||||
for ent in doc.ents:
|
||||
ents_by_offset["{}_{}".format(ent.start_char, ent.end_char)] = ent
|
||||
ents_by_offset[(ent.start_char, ent.end_char)] = ent
|
||||
|
||||
for entity, kb_dict in gold.links.items():
|
||||
start, end = entity
|
||||
mention = doc.text[start:end]
|
||||
# the gold annotations should link to proper entities - if this fails, the dataset is likely corrupt
|
||||
ent = ents_by_offset[(start, end)]
|
||||
|
||||
for kb_id, value in kb_dict.items():
|
||||
# Currently only training on the positive instances
|
||||
if value:
|
||||
context_docs.append(doc)
|
||||
sentence_docs.append(ent.sent.as_doc())
|
||||
|
||||
context_encodings, bp_context = self.model.begin_update(context_docs, drop=drop)
|
||||
loss, d_scores = self.get_similarity_loss(scores=context_encodings, golds=golds, docs=None)
|
||||
sentence_encodings, bp_context = self.model.begin_update(sentence_docs, drop=drop)
|
||||
loss, d_scores = self.get_similarity_loss(scores=sentence_encodings, golds=golds, docs=None)
|
||||
bp_context(d_scores, sgd=sgd)
|
||||
|
||||
if losses is not None:
|
||||
|
@ -1305,22 +1319,41 @@ class EntityLinker(Pipe):
|
|||
if isinstance(docs, Doc):
|
||||
docs = [docs]
|
||||
|
||||
context_encodings = self.model(docs)
|
||||
xp = get_array_module(context_encodings)
|
||||
|
||||
for i, doc in enumerate(docs):
|
||||
if len(doc) > 0:
|
||||
# Looping through each sentence and each entity
|
||||
# This may go wrong if there are entities across sentences - because they might not get a KB ID
|
||||
for sent in doc.ents:
|
||||
sent_doc = sent.as_doc()
|
||||
# currently, the context is the same for each entity in a sentence (should be refined)
|
||||
context_encoding = context_encodings[i]
|
||||
context_enc_t = context_encoding.T
|
||||
norm_1 = xp.linalg.norm(context_enc_t)
|
||||
for ent in doc.ents:
|
||||
sentence_encoding = self.model([sent_doc])[0]
|
||||
xp = get_array_module(sentence_encoding)
|
||||
sentence_encoding_t = sentence_encoding.T
|
||||
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
||||
|
||||
for ent in sent_doc.ents:
|
||||
entity_count += 1
|
||||
|
||||
to_discard = self.cfg.get("labels_discard", [])
|
||||
if to_discard and ent.label_ in to_discard:
|
||||
# ignoring this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
final_tensors.append(sentence_encoding)
|
||||
|
||||
else:
|
||||
candidates = self.kb.get_candidates(ent.text)
|
||||
if not candidates:
|
||||
final_kb_ids.append(self.NIL) # no prediction possible for this entity
|
||||
final_tensors.append(context_encoding)
|
||||
# no prediction possible for this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
final_tensors.append(sentence_encoding)
|
||||
|
||||
elif len(candidates) == 1:
|
||||
# shortcut for efficiency reasons: take the 1 candidate
|
||||
|
||||
# TODO: thresholding
|
||||
final_kb_ids.append(candidates[0].entity_)
|
||||
final_tensors.append(sentence_encoding)
|
||||
|
||||
else:
|
||||
random.shuffle(candidates)
|
||||
|
||||
|
@ -1333,13 +1366,13 @@ class EntityLinker(Pipe):
|
|||
# add in similarity from the context
|
||||
if self.cfg.get("incl_context", True):
|
||||
entity_encodings = xp.asarray([c.entity_vector for c in candidates])
|
||||
norm_2 = xp.linalg.norm(entity_encodings, axis=1)
|
||||
entity_norm = xp.linalg.norm(entity_encodings, axis=1)
|
||||
|
||||
if len(entity_encodings) != len(prior_probs):
|
||||
raise RuntimeError(Errors.E147.format(method="predict", msg="vectors not of equal length"))
|
||||
|
||||
# cosine similarity
|
||||
sims = xp.dot(entity_encodings, context_enc_t) / (norm_1 * norm_2)
|
||||
sims = xp.dot(entity_encodings, sentence_encoding_t) / (sentence_norm * entity_norm)
|
||||
if sims.shape != prior_probs.shape:
|
||||
raise ValueError(Errors.E161)
|
||||
scores = prior_probs + sims - (prior_probs*sims)
|
||||
|
@ -1348,7 +1381,7 @@ class EntityLinker(Pipe):
|
|||
best_index = scores.argmax()
|
||||
best_candidate = candidates[best_index]
|
||||
final_kb_ids.append(best_candidate.entity_)
|
||||
final_tensors.append(context_encoding)
|
||||
final_tensors.append(sentence_encoding)
|
||||
|
||||
if not (len(final_tensors) == len(final_kb_ids) == entity_count):
|
||||
raise RuntimeError(Errors.E147.format(method="predict", msg="result variables not of equal length"))
|
||||
|
@ -1408,13 +1441,13 @@ class EntityLinker(Pipe):
|
|||
raise NotImplementedError
|
||||
|
||||
|
||||
@component("sentencizer", assigns=["token.is_sent_start", "doc.sents"])
|
||||
class Sentencizer(object):
|
||||
"""Segment the Doc into sentences using a rule-based strategy.
|
||||
|
||||
DOCS: https://spacy.io/api/sentencizer
|
||||
"""
|
||||
|
||||
name = "sentencizer"
|
||||
default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹',
|
||||
'।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄',
|
||||
'᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿',
|
||||
|
@ -1440,6 +1473,10 @@ class Sentencizer(object):
|
|||
else:
|
||||
self.punct_chars = set(self.default_punct_chars)
|
||||
|
||||
@classmethod
|
||||
def from_nlp(cls, nlp, **cfg):
|
||||
return cls(**cfg)
|
||||
|
||||
def __call__(self, doc):
|
||||
"""Apply the sentencizer to a Doc and set Token.is_sent_start.
|
||||
|
||||
|
@ -1506,4 +1543,9 @@ class Sentencizer(object):
|
|||
return self
|
||||
|
||||
|
||||
# Cython classes can't be decorated, so we need to add the factories here
|
||||
Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg)
|
||||
Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg)
|
||||
|
||||
|
||||
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]
|
||||
|
|
|
@ -82,6 +82,7 @@ class Scorer(object):
|
|||
self.sbd = PRFScore()
|
||||
self.unlabelled = PRFScore()
|
||||
self.labelled = PRFScore()
|
||||
self.labelled_per_dep = dict()
|
||||
self.tags = PRFScore()
|
||||
self.ner = PRFScore()
|
||||
self.ner_per_ents = dict()
|
||||
|
@ -124,9 +125,18 @@ class Scorer(object):
|
|||
|
||||
@property
|
||||
def las(self):
|
||||
"""RETURNS (float): Labelled depdendency score."""
|
||||
"""RETURNS (float): Labelled dependency score."""
|
||||
return self.labelled.fscore * 100
|
||||
|
||||
@property
|
||||
def las_per_type(self):
|
||||
"""RETURNS (dict): Scores per dependency label.
|
||||
"""
|
||||
return {
|
||||
k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
|
||||
for k, v in self.labelled_per_dep.items()
|
||||
}
|
||||
|
||||
@property
|
||||
def ents_p(self):
|
||||
"""RETURNS (float): Named entity accuracy (precision)."""
|
||||
|
@ -196,6 +206,7 @@ class Scorer(object):
|
|||
return {
|
||||
"uas": self.uas,
|
||||
"las": self.las,
|
||||
"las_per_type": self.las_per_type,
|
||||
"ents_p": self.ents_p,
|
||||
"ents_r": self.ents_r,
|
||||
"ents_f": self.ents_f,
|
||||
|
@ -219,15 +230,24 @@ class Scorer(object):
|
|||
DOCS: https://spacy.io/api/scorer#score
|
||||
"""
|
||||
if len(doc) != len(gold):
|
||||
gold = GoldParse.from_annot_tuples(doc, zip(*gold.orig_annot))
|
||||
gold = GoldParse.from_annot_tuples(
|
||||
doc, tuple(zip(*gold.orig_annot)) + (gold.cats,)
|
||||
)
|
||||
gold_deps = set()
|
||||
gold_deps_per_dep = {}
|
||||
gold_tags = set()
|
||||
gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot]))
|
||||
for id_, word, tag, head, dep, ner in gold.orig_annot:
|
||||
gold_tags.add((id_, tag))
|
||||
if dep not in (None, "") and dep.lower() not in punct_labels:
|
||||
gold_deps.add((id_, head, dep.lower()))
|
||||
if dep.lower() not in self.labelled_per_dep:
|
||||
self.labelled_per_dep[dep.lower()] = PRFScore()
|
||||
if dep.lower() not in gold_deps_per_dep:
|
||||
gold_deps_per_dep[dep.lower()] = set()
|
||||
gold_deps_per_dep[dep.lower()].add((id_, head, dep.lower()))
|
||||
cand_deps = set()
|
||||
cand_deps_per_dep = {}
|
||||
cand_tags = set()
|
||||
for token in doc:
|
||||
if token.orth_.isspace():
|
||||
|
@ -247,6 +267,11 @@ class Scorer(object):
|
|||
self.labelled.fp += 1
|
||||
else:
|
||||
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
|
||||
if token.dep_.lower() not in self.labelled_per_dep:
|
||||
self.labelled_per_dep[token.dep_.lower()] = PRFScore()
|
||||
if token.dep_.lower() not in cand_deps_per_dep:
|
||||
cand_deps_per_dep[token.dep_.lower()] = set()
|
||||
cand_deps_per_dep[token.dep_.lower()].add((gold_i, gold_head, token.dep_.lower()))
|
||||
if "-" not in [token[-1] for token in gold.orig_annot]:
|
||||
# Find all NER labels in gold and doc
|
||||
ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
|
||||
|
@ -278,6 +303,8 @@ class Scorer(object):
|
|||
self.ner.score_set(cand_ents, gold_ents)
|
||||
self.tags.score_set(cand_tags, gold_tags)
|
||||
self.labelled.score_set(cand_deps, gold_deps)
|
||||
for dep in self.labelled_per_dep:
|
||||
self.labelled_per_dep[dep].score_set(cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()))
|
||||
self.unlabelled.score_set(
|
||||
set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
|
||||
)
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user