Merge master

This commit is contained in:
Matthw Honnibal 2019-11-17 17:19:08 +01:00
commit a48e69d8d1
195 changed files with 6107 additions and 1900 deletions

106
.github/contributors/GiorgioPorgio.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | George Ketsopoulos |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 23 October 2019 |
| GitHub username | GiorgioPorgio |
| Website (optional) | |

106
.github/contributors/PeterGilles.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Peter Gilles |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 10.10. |
| GitHub username | Peter Gilles |
| Website (optional) | |

106
.github/contributors/gustavengstrom.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Gustav Engström |
| Company name (if applicable) | Davcon |
| Title or role (if applicable) | Data scientis |
| Date | 2019-10-10 |
| GitHub username | gustavengstrom |
| Website (optional) | |

106
.github/contributors/neelkamath.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ---------------------- |
| Name | Neel Kamath |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | October 30, 2019 |
| GitHub username | neelkamath |
| Website (optional) | https://neelkamath.com |

106
.github/contributors/pberba.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Pepe Berba |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-10-18 |
| GitHub username | pberba |
| Website (optional) | |

106
.github/contributors/prilopes.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Priscilla Lopes |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-11-06 |
| GitHub username | prilopes |
| Website (optional) | |

106
.github/contributors/questoph.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Christoph Purschke |
| Company name (if applicable) | University of Luxembourg |
| Title or role (if applicable) | |
| Date | 14/11/2019 |
| GitHub username | questoph |
| Website (optional) | https://purschke.info |

106
.github/contributors/zhuorulin.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Zhuoru Lin |
| Company name (if applicable) | Bombora Inc. |
| Title or role (if applicable) | Data Scientist |
| Date | 2017-11-13 |
| GitHub username | ZhuoruLin |
| Website (optional) | |

View File

@ -9,7 +9,7 @@ dist/spacy.pex : dist/spacy-$(sha).pex
dist/spacy-$(sha).pex : dist/$(wheel) dist/spacy-$(sha).pex : dist/$(wheel)
env3.6/bin/python -m pip install pex==1.5.3 env3.6/bin/python -m pip install pex==1.5.3
env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py* dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
python3.6 -m venv env3.6 python3.6 -m venv env3.6

View File

@ -72,21 +72,21 @@ it.
## Features ## Features
- Non-destructive **tokenization** - Non-destructive **tokenization**
- **Named entity** recognition - **Named entity** recognition
- Support for **50+ languages** - Support for **50+ languages**
- pretrained [statistical models](https://spacy.io/models) and word vectors - pretrained [statistical models](https://spacy.io/models) and word vectors
- State-of-the-art speed - State-of-the-art speed
- Easy **deep learning** integration - Easy **deep learning** integration
- Part-of-speech tagging - Part-of-speech tagging
- Labelled dependency parsing - Labelled dependency parsing
- Syntax-driven sentence segmentation - Syntax-driven sentence segmentation
- Built in **visualizers** for syntax and NER - Built in **visualizers** for syntax and NER
- Convenient string-to-hash mapping - Convenient string-to-hash mapping
- Export to numpy data arrays - Export to numpy data arrays
- Efficient binary serialization - Efficient binary serialization
- Easy **model packaging** and deployment - Easy **model packaging** and deployment
- Robust, rigorously evaluated accuracy - Robust, rigorously evaluated accuracy
📖 **For more details, see the 📖 **For more details, see the
[facts, figures and benchmarks](https://spacy.io/usage/facts-figures).** [facts, figures and benchmarks](https://spacy.io/usage/facts-figures).**
@ -96,10 +96,10 @@ it.
For detailed installation instructions, see the For detailed installation instructions, see the
[documentation](https://spacy.io/usage). [documentation](https://spacy.io/usage).
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
Studio) Studio)
- **Python version**: Python 2.7, 3.5+ (only 64 bit) - **Python version**: Python 2.7, 3.5+ (only 64 bit)
- **Package managers**: [pip] · [conda] (via `conda-forge`) - **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.org/project/spacy/ [pip]: https://pypi.org/project/spacy/
[conda]: https://anaconda.org/conda-forge/spacy [conda]: https://anaconda.org/conda-forge/spacy
@ -135,8 +135,7 @@ Thanks to our great community, we've finally re-added conda support. You can now
install spaCy via `conda-forge`: install spaCy via `conda-forge`:
```bash ```bash
conda config --add channels conda-forge conda install -c conda-forge spacy
conda install spacy
``` ```
For the feedstock including the build recipe and configuration, check out For the feedstock including the build recipe and configuration, check out
@ -181,9 +180,6 @@ pointing pip to a path or URL.
# download best-matching version of specific model for your spaCy installation # download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm python -m spacy download en_core_web_sm
# out-of-the-box: download best-matching default model
python -m spacy download en
# pip install .tar.gz archive from path or URL # pip install .tar.gz archive from path or URL
pip install /Users/you/en_core_web_sm-2.2.0.tar.gz pip install /Users/you/en_core_web_sm-2.2.0.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
@ -197,7 +193,7 @@ path to the model data directory.
```python ```python
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is a sentence.") doc = nlp("This is a sentence.")
``` ```
You can also `import` a model directly via its full name and then call its You can also `import` a model directly via its full name and then call its
@ -208,22 +204,12 @@ import spacy
import en_core_web_sm import en_core_web_sm
nlp = en_core_web_sm.load() nlp = en_core_web_sm.load()
doc = nlp(u"This is a sentence.") doc = nlp("This is a sentence.")
``` ```
📖 **For more info and examples, check out the 📖 **For more info and examples, check out the
[models documentation](https://spacy.io/docs/usage/models).** [models documentation](https://spacy.io/docs/usage/models).**
### Support for older versions
If you're using an older version (`v1.6.0` or below), you can still download and
install the old models from within spaCy using `python -m spacy.en.download all`
or `python -m spacy.de.download all`. The `.tar.gz` archives are also
[attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0).
To download and install the models manually, unpack the archive, drop the
contained directory into `spacy/data` and load the model via `spacy.load('en')`
or `spacy.load('de')`.
## Compile from source ## Compile from source
The other way to install spaCy is to clone its The other way to install spaCy is to clone its

View File

@ -9,6 +9,11 @@ trigger:
exclude: exclude:
- 'website/*' - 'website/*'
- '*.md' - '*.md'
pr:
paths:
exclude:
- 'website/*'
- '*.md'
jobs: jobs:
@ -30,24 +35,12 @@ jobs:
dependsOn: 'Validate' dependsOn: 'Validate'
strategy: strategy:
matrix: matrix:
# Python 2.7 currently doesn't work because it seems to be a narrow
# unicode build, which causes problems with the regular expressions
# Python27Linux:
# imageName: 'ubuntu-16.04'
# python.version: '2.7'
# Python27Mac:
# imageName: 'macos-10.13'
# python.version: '2.7'
Python35Linux: Python35Linux:
imageName: 'ubuntu-16.04' imageName: 'ubuntu-16.04'
python.version: '3.5' python.version: '3.5'
Python35Windows: Python35Windows:
imageName: 'vs2017-win2016' imageName: 'vs2017-win2016'
python.version: '3.5' python.version: '3.5'
Python35Mac:
imageName: 'macos-10.13'
python.version: '3.5'
Python36Linux: Python36Linux:
imageName: 'ubuntu-16.04' imageName: 'ubuntu-16.04'
python.version: '3.6' python.version: '3.6'
@ -66,6 +59,15 @@ jobs:
Python37Mac: Python37Mac:
imageName: 'macos-10.13' imageName: 'macos-10.13'
python.version: '3.7' python.version: '3.7'
Python38Linux:
imageName: 'ubuntu-16.04'
python.version: '3.8'
Python38Windows:
imageName: 'vs2017-win2016'
python.version: '3.8'
Python38Mac:
imageName: 'macos-10.13'
python.version: '3.8'
maxParallel: 4 maxParallel: 4
pool: pool:
vmImage: $(imageName) vmImage: $(imageName)
@ -76,10 +78,8 @@ jobs:
versionSpec: '$(python.version)' versionSpec: '$(python.version)'
architecture: 'x64' architecture: 'x64'
# Downgrading pip is necessary to prevent a wheel version incompatiblity.
# Might be fixed in the future or some other way, so investigate again.
- script: | - script: |
python -m pip install -U pip==18.1 setuptools python -m pip install -U setuptools
pip install -r requirements.txt pip install -r requirements.txt
displayName: 'Install dependencies' displayName: 'Install dependencies'

View File

@ -84,7 +84,7 @@ def read_conllu(file_):
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"): if text_loc.parts[-1].endswith(".conllu"):
docs = [] docs = []
with text_loc.open() as file_: with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_): for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc: for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent] words = [line[1] for line in conllu_sent]

View File

@ -7,7 +7,6 @@ from __future__ import unicode_literals
import plac import plac
from pathlib import Path from pathlib import Path
import re import re
import sys
import json import json
import spacy import spacy
@ -19,12 +18,9 @@ from spacy.util import compounding, minibatch, minibatch_by_words
from spacy.syntax.nonproj import projectivize from spacy.syntax.nonproj import projectivize
from spacy.matcher import Matcher from spacy.matcher import Matcher
from spacy import displacy from spacy import displacy
from collections import defaultdict, Counter from collections import defaultdict
from timeit import default_timer as timer
import itertools
import random import random
import numpy.random
from spacy import lang from spacy import lang
from spacy.lang import zh from spacy.lang import zh
@ -203,7 +199,7 @@ def golds_to_gold_tuples(docs, golds):
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"): if text_loc.parts[-1].endswith(".conllu"):
docs = [] docs = []
with text_loc.open() as file_: with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_): for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc: for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent] words = [line[1] for line in conllu_sent]
@ -225,6 +221,13 @@ def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
def write_conllu(docs, file_): def write_conllu(docs, file_):
if not Token.has_extension("get_conllu_lines"):
Token.set_extension("get_conllu_lines", method=get_token_conllu)
if not Token.has_extension("begins_fused"):
Token.set_extension("begins_fused", default=False)
if not Token.has_extension("inside_fused"):
Token.set_extension("inside_fused", default=False)
merger = Matcher(docs[0].vocab) merger = Matcher(docs[0].vocab)
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}]) merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
@ -323,10 +326,6 @@ def get_token_conllu(token, i):
return "\n".join(lines) return "\n".join(lines)
Token.set_extension("get_conllu_lines", method=get_token_conllu, force=True)
Token.set_extension("begins_fused", default=False, force=True)
Token.set_extension("inside_fused", default=False, force=True)
################## ##################
# Initialization # # Initialization #
@ -378,7 +377,7 @@ def _load_pretrained_tok2vec(nlp, loc):
"""Load pretrained weights for the 'token-to-vector' part of the component """Load pretrained weights for the 'token-to-vector' part of the component
models, which is typically a CNN. See 'spacy pretrain'. Experimental. models, which is typically a CNN. See 'spacy pretrain'. Experimental.
""" """
with Path(loc).open("rb") as file_: with Path(loc).open("rb", encoding="utf8") as file_:
weights_data = file_.read() weights_data = file_.read()
loaded = [] loaded = []
for name, component in nlp.pipeline: for name, component in nlp.pipeline:
@ -459,13 +458,13 @@ class TreebankPaths(object):
@plac.annotations( @plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path), ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
parses_dir=("Directory to write the development parses", "positional", None, Path),
corpus=( corpus=(
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc", "UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
"positional", "positional",
None, None,
str, str,
), ),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "option", "C", Path), config=("Path to json formatted config file", "option", "C", Path),
limit=("Size limit", "option", "n", int), limit=("Size limit", "option", "n", int),
gpu_device=("Use GPU", "option", "g", int), gpu_device=("Use GPU", "option", "g", int),
@ -490,6 +489,10 @@ def main(
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200 # temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
import tqdm import tqdm
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
spacy.util.fix_random_seed() spacy.util.fix_random_seed()
lang.zh.Chinese.Defaults.use_jieba = False lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False lang.ja.Japanese.Defaults.use_janome = False
@ -506,8 +509,8 @@ def main(
docs, golds = read_data( docs, golds = read_data(
nlp, nlp,
paths.train.conllu.open(), paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(), paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length, max_doc_length=config.max_doc_length,
limit=limit, limit=limit,
) )
@ -519,8 +522,8 @@ def main(
for i in range(config.nr_epoch): for i in range(config.nr_epoch):
docs, golds = read_data( docs, golds = read_data(
nlp, nlp,
paths.train.conllu.open(), paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(), paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length, max_doc_length=config.max_doc_length,
limit=limit, limit=limit,
oracle_segments=use_oracle_segments, oracle_segments=use_oracle_segments,
@ -560,7 +563,7 @@ def main(
def _render_parses(i, to_render): def _render_parses(i, to_render):
to_render[0].user_data["title"] = "Batch %d" % i to_render[0].user_data["title"] = "Batch %d" % i
with Path("/tmp/parses.html").open("w") as file_: with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
html = displacy.render(to_render[:5], style="dep", page=True) html = displacy.render(to_render[:5], style="dep", page=True)
file_.write(html) file_.write(html)

View File

@ -0,0 +1,34 @@
## Entity Linking with Wikipedia and Wikidata
### Step 1: Create a Knowledge Base (KB) and training data
Run `wikipedia_pretrain_kb.py`
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
* You can set the filtering parameters for KB construction:
* `max_per_alias`: (max) number of candidate entities in the KB per alias/synonym
* `min_freq`: threshold of number of times an entity should occur in the corpus to be included in the KB
* `min_pair`: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
* Further parameters to set:
* `descriptions_from_wikipedia`: whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
* `entity_vector_length`: length of the pre-trained entity description vectors
* `lang`: language for which to fetch Wikidata information (as the dump contains all languages)
Quick testing and rerunning:
* When trying out the pipeline for a quick test, set `limit_prior`, `limit_train` and/or `limit_wd` to read only parts of the dumps instead of everything.
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
### Step 2: Train an Entity Linking model
Run `wikidata_train_entity_linker.py`
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
* You can set the learning parameters for the EL training:
* `epochs`: number of training iterations
* `dropout`: dropout rate
* `lr`: learning rate
* `l2`: L2 regularization
* Specify the number of training and dev testing entities with `train_inst` and `dev_inst` respectively
* Further parameters to set:
* `labels_discard`: NER label types to discard during training

View File

@ -6,6 +6,7 @@ OUTPUT_MODEL_DIR = "nlp"
PRIOR_PROB_PATH = "prior_prob.csv" PRIOR_PROB_PATH = "prior_prob.csv"
ENTITY_DEFS_PATH = "entity_defs.csv" ENTITY_DEFS_PATH = "entity_defs.csv"
ENTITY_FREQ_PATH = "entity_freq.csv" ENTITY_FREQ_PATH = "entity_freq.csv"
ENTITY_ALIAS_PATH = "entity_alias.csv"
ENTITY_DESCR_PATH = "entity_descriptions.csv" ENTITY_DESCR_PATH = "entity_descriptions.csv"
LOG_FORMAT = '%(asctime)s - %(levelname)s - %(name)s - %(message)s' LOG_FORMAT = '%(asctime)s - %(levelname)s - %(name)s - %(message)s'

View File

@ -15,10 +15,11 @@ class Metrics(object):
candidate_is_correct = true_entity == candidate candidate_is_correct = true_entity == candidate
# Assume that we have no labeled negatives in the data (i.e. cases where true_entity is "NIL") # Assume that we have no labeled negatives in the data (i.e. cases where true_entity is "NIL")
# Therefore, if candidate_is_correct then we have a true positive and never a true negative # Therefore, if candidate_is_correct then we have a true positive and never a true negative.
self.true_pos += candidate_is_correct self.true_pos += candidate_is_correct
self.false_neg += not candidate_is_correct self.false_neg += not candidate_is_correct
if candidate not in {"", "NIL"}: if candidate and candidate not in {"", "NIL"}:
# A wrong prediction (e.g. Q42 != Q3) counts both as a FP as well as a FN.
self.false_pos += not candidate_is_correct self.false_pos += not candidate_is_correct
def calculate_precision(self): def calculate_precision(self):
@ -33,6 +34,14 @@ class Metrics(object):
else: else:
return self.true_pos / (self.true_pos + self.false_neg) return self.true_pos / (self.true_pos + self.false_neg)
def calculate_fscore(self):
p = self.calculate_precision()
r = self.calculate_recall()
if p + r == 0:
return 0.0
else:
return 2 * p * r / (p + r)
class EvaluationResults(object): class EvaluationResults(object):
def __init__(self): def __init__(self):
@ -43,18 +52,20 @@ class EvaluationResults(object):
self.metrics.update_results(true_entity, candidate) self.metrics.update_results(true_entity, candidate)
self.metrics_by_label[ent_label].update_results(true_entity, candidate) self.metrics_by_label[ent_label].update_results(true_entity, candidate)
def increment_false_negatives(self):
self.metrics.false_neg += 1
def report_metrics(self, model_name): def report_metrics(self, model_name):
model_str = model_name.title() model_str = model_name.title()
recall = self.metrics.calculate_recall() recall = self.metrics.calculate_recall()
precision = self.metrics.calculate_precision() precision = self.metrics.calculate_precision()
return ("{}: ".format(model_str) + fscore = self.metrics.calculate_fscore()
"Recall = {} | ".format(round(recall, 3)) + return (
"Precision = {} | ".format(round(precision, 3)) + "{}: ".format(model_str)
"Precision by label = {}".format({k: v.calculate_precision() + "F-score = {} | ".format(round(fscore, 3))
for k, v in self.metrics_by_label.items()})) + "Recall = {} | ".format(round(recall, 3))
+ "Precision = {} | ".format(round(precision, 3))
+ "F-score by label = {}".format(
{k: v.calculate_fscore() for k, v in sorted(self.metrics_by_label.items())}
)
)
class BaselineResults(object): class BaselineResults(object):
@ -63,40 +74,51 @@ class BaselineResults(object):
self.prior = EvaluationResults() self.prior = EvaluationResults()
self.oracle = EvaluationResults() self.oracle = EvaluationResults()
def report_accuracy(self, model): def report_performance(self, model):
results = getattr(self, model) results = getattr(self, model)
return results.report_metrics(model) return results.report_metrics(model)
def update_baselines(self, true_entity, ent_label, random_candidate, prior_candidate, oracle_candidate): def update_baselines(
self,
true_entity,
ent_label,
random_candidate,
prior_candidate,
oracle_candidate,
):
self.oracle.update_metrics(ent_label, true_entity, oracle_candidate) self.oracle.update_metrics(ent_label, true_entity, oracle_candidate)
self.prior.update_metrics(ent_label, true_entity, prior_candidate) self.prior.update_metrics(ent_label, true_entity, prior_candidate)
self.random.update_metrics(ent_label, true_entity, random_candidate) self.random.update_metrics(ent_label, true_entity, random_candidate)
def measure_performance(dev_data, kb, el_pipe): def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True):
baseline_accuracies = measure_baselines( if baseline:
dev_data, kb baseline_accuracies, counts = measure_baselines(dev_data, kb)
) logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
logger.info(baseline_accuracies.report_performance("random"))
logger.info(baseline_accuracies.report_performance("prior"))
logger.info(baseline_accuracies.report_performance("oracle"))
logger.info(baseline_accuracies.report_accuracy("random")) if context:
logger.info(baseline_accuracies.report_accuracy("prior")) # using only context
logger.info(baseline_accuracies.report_accuracy("oracle")) el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = False
results = get_eval_results(dev_data, el_pipe)
logger.info(results.report_metrics("context only"))
# using only context # measuring combined accuracy (prior + context)
el_pipe.cfg["incl_context"] = True el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = False el_pipe.cfg["incl_prior"] = True
results = get_eval_results(dev_data, el_pipe) results = get_eval_results(dev_data, el_pipe)
logger.info(results.report_metrics("context only")) logger.info(results.report_metrics("context and prior"))
# measuring combined accuracy (prior + context)
el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = True
results = get_eval_results(dev_data, el_pipe)
logger.info(results.report_metrics("context and prior"))
def get_eval_results(data, el_pipe=None): def get_eval_results(data, el_pipe=None):
# If the docs in the data require further processing with an entity linker, set el_pipe """
Evaluate the ent.kb_id_ annotations against the gold standard.
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
If the docs in the data require further processing with an entity linker, set el_pipe.
"""
from tqdm import tqdm from tqdm import tqdm
docs = [] docs = []
@ -111,18 +133,15 @@ def get_eval_results(data, el_pipe=None):
results = EvaluationResults() results = EvaluationResults()
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
tagged_entries_per_article = {_offset(ent.start_char, ent.end_char): ent for ent in doc.ents}
try: try:
correct_entries_per_article = dict() correct_entries_per_article = dict()
for entity, kb_dict in gold.links.items(): for entity, kb_dict in gold.links.items():
start, end = entity start, end = entity
# only evaluating on positive examples
for gold_kb, value in kb_dict.items(): for gold_kb, value in kb_dict.items():
if value: if value:
# only evaluating on positive examples
offset = _offset(start, end) offset = _offset(start, end)
correct_entries_per_article[offset] = gold_kb correct_entries_per_article[offset] = gold_kb
if offset not in tagged_entries_per_article:
results.increment_false_negatives()
for ent in doc.ents: for ent in doc.ents:
ent_label = ent.label_ ent_label = ent.label_
@ -142,7 +161,11 @@ def get_eval_results(data, el_pipe=None):
def measure_baselines(data, kb): def measure_baselines(data, kb):
# Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound """
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
Also return a dictionary of counts by entity label.
"""
counts_d = dict() counts_d = dict()
baseline_results = BaselineResults() baseline_results = BaselineResults()
@ -152,7 +175,6 @@ def measure_baselines(data, kb):
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
correct_entries_per_article = dict() correct_entries_per_article = dict()
tagged_entries_per_article = {_offset(ent.start_char, ent.end_char): ent for ent in doc.ents}
for entity, kb_dict in gold.links.items(): for entity, kb_dict in gold.links.items():
start, end = entity start, end = entity
for gold_kb, value in kb_dict.items(): for gold_kb, value in kb_dict.items():
@ -160,10 +182,6 @@ def measure_baselines(data, kb):
if value: if value:
offset = _offset(start, end) offset = _offset(start, end)
correct_entries_per_article[offset] = gold_kb correct_entries_per_article[offset] = gold_kb
if offset not in tagged_entries_per_article:
baseline_results.random.increment_false_negatives()
baseline_results.oracle.increment_false_negatives()
baseline_results.prior.increment_false_negatives()
for ent in doc.ents: for ent in doc.ents:
ent_label = ent.label_ ent_label = ent.label_
@ -176,7 +194,7 @@ def measure_baselines(data, kb):
if gold_entity is not None: if gold_entity is not None:
candidates = kb.get_candidates(ent.text) candidates = kb.get_candidates(ent.text)
oracle_candidate = "" oracle_candidate = ""
best_candidate = "" prior_candidate = ""
random_candidate = "" random_candidate = ""
if candidates: if candidates:
scores = [] scores = []
@ -187,13 +205,21 @@ def measure_baselines(data, kb):
oracle_candidate = c.entity_ oracle_candidate = c.entity_
best_index = scores.index(max(scores)) best_index = scores.index(max(scores))
best_candidate = candidates[best_index].entity_ prior_candidate = candidates[best_index].entity_
random_candidate = random.choice(candidates).entity_ random_candidate = random.choice(candidates).entity_
baseline_results.update_baselines(gold_entity, ent_label, current_count = counts_d.get(ent_label, 0)
random_candidate, best_candidate, oracle_candidate) counts_d[ent_label] = current_count+1
return baseline_results baseline_results.update_baselines(
gold_entity,
ent_label,
random_candidate,
prior_candidate,
oracle_candidate,
)
return baseline_results, counts_d
def _offset(start, end): def _offset(start, end):

View File

@ -1,17 +1,12 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import csv
import logging import logging
import spacy
import sys
from spacy.kb import KnowledgeBase from spacy.kb import KnowledgeBase
from bin.wiki_entity_linking import wikipedia_processor as wp
from bin.wiki_entity_linking.train_descriptions import EntityEncoder from bin.wiki_entity_linking.train_descriptions import EntityEncoder
from bin.wiki_entity_linking import wiki_io as io
csv.field_size_limit(sys.maxsize)
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -22,18 +17,24 @@ def create_kb(
max_entities_per_alias, max_entities_per_alias,
min_entity_freq, min_entity_freq,
min_occ, min_occ,
entity_def_input, entity_def_path,
entity_descr_path, entity_descr_path,
count_input, entity_alias_path,
prior_prob_input, entity_freq_path,
prior_prob_path,
entity_vector_length, entity_vector_length,
): ):
# Create the knowledge base from Wikidata entries # Create the knowledge base from Wikidata entries
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=entity_vector_length) kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=entity_vector_length)
entity_list, filtered_title_to_id = _define_entities(nlp, kb, entity_def_path, entity_descr_path, min_entity_freq, entity_freq_path, entity_vector_length)
_define_aliases(kb, entity_alias_path, entity_list, filtered_title_to_id, max_entities_per_alias, min_occ, prior_prob_path)
return kb
def _define_entities(nlp, kb, entity_def_path, entity_descr_path, min_entity_freq, entity_freq_path, entity_vector_length):
# read the mappings from file # read the mappings from file
title_to_id = get_entity_to_id(entity_def_input) title_to_id = io.read_title_to_id(entity_def_path)
id_to_descr = get_id_to_description(entity_descr_path) id_to_descr = io.read_id_to_descr(entity_descr_path)
# check the length of the nlp vectors # check the length of the nlp vectors
if "vectors" in nlp.meta and nlp.vocab.vectors.size: if "vectors" in nlp.meta and nlp.vocab.vectors.size:
@ -45,10 +46,8 @@ def create_kb(
" cf. https://spacy.io/usage/models#languages." " cf. https://spacy.io/usage/models#languages."
) )
logger.info("Get entity frequencies")
entity_frequencies = wp.get_all_frequencies(count_input=count_input)
logger.info("Filtering entities with fewer than {} mentions".format(min_entity_freq)) logger.info("Filtering entities with fewer than {} mentions".format(min_entity_freq))
entity_frequencies = io.read_entity_to_count(entity_freq_path)
# filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise # filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise
filtered_title_to_id, entity_list, description_list, frequency_list = get_filtered_entities( filtered_title_to_id, entity_list, description_list, frequency_list = get_filtered_entities(
title_to_id, title_to_id,
@ -56,36 +55,33 @@ def create_kb(
entity_frequencies, entity_frequencies,
min_entity_freq min_entity_freq
) )
logger.info("Left with {} entities".format(len(description_list))) logger.info("Kept {} entities from the set of {}".format(len(description_list), len(title_to_id.keys())))
logger.info("Train entity encoder") logger.info("Training entity encoder")
encoder = EntityEncoder(nlp, input_dim, entity_vector_length) encoder = EntityEncoder(nlp, input_dim, entity_vector_length)
encoder.train(description_list=description_list, to_print=True) encoder.train(description_list=description_list, to_print=True)
logger.info("Get entity embeddings:") logger.info("Getting entity embeddings")
embeddings = encoder.apply_encoder(description_list) embeddings = encoder.apply_encoder(description_list)
logger.info("Adding {} entities".format(len(entity_list))) logger.info("Adding {} entities".format(len(entity_list)))
kb.set_entities( kb.set_entities(
entity_list=entity_list, freq_list=frequency_list, vector_list=embeddings entity_list=entity_list, freq_list=frequency_list, vector_list=embeddings
) )
return entity_list, filtered_title_to_id
logger.info("Adding aliases")
def _define_aliases(kb, entity_alias_path, entity_list, filtered_title_to_id, max_entities_per_alias, min_occ, prior_prob_path):
logger.info("Adding aliases from Wikipedia and Wikidata")
_add_aliases( _add_aliases(
kb, kb,
entity_list=entity_list,
title_to_id=filtered_title_to_id, title_to_id=filtered_title_to_id,
max_entities_per_alias=max_entities_per_alias, max_entities_per_alias=max_entities_per_alias,
min_occ=min_occ, min_occ=min_occ,
prior_prob_input=prior_prob_input, prior_prob_path=prior_prob_path,
) )
logger.info("KB size: {} entities, {} aliases".format(
kb.get_size_entities(),
kb.get_size_aliases()))
logger.info("Done with kb")
return kb
def get_filtered_entities(title_to_id, id_to_descr, entity_frequencies, def get_filtered_entities(title_to_id, id_to_descr, entity_frequencies,
min_entity_freq: int = 10): min_entity_freq: int = 10):
@ -104,34 +100,13 @@ def get_filtered_entities(title_to_id, id_to_descr, entity_frequencies,
return filtered_title_to_id, entity_list, description_list, frequency_list return filtered_title_to_id, entity_list, description_list, frequency_list
def get_entity_to_id(entity_def_output): def _add_aliases(kb, entity_list, title_to_id, max_entities_per_alias, min_occ, prior_prob_path):
entity_to_id = dict()
with entity_def_output.open("r", encoding="utf8") as csvfile:
csvreader = csv.reader(csvfile, delimiter="|")
# skip header
next(csvreader)
for row in csvreader:
entity_to_id[row[0]] = row[1]
return entity_to_id
def get_id_to_description(entity_descr_path):
id_to_desc = dict()
with entity_descr_path.open("r", encoding="utf8") as csvfile:
csvreader = csv.reader(csvfile, delimiter="|")
# skip header
next(csvreader)
for row in csvreader:
id_to_desc[row[0]] = row[1]
return id_to_desc
def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_input):
wp_titles = title_to_id.keys() wp_titles = title_to_id.keys()
# adding aliases with prior probabilities # adding aliases with prior probabilities
# we can read this file sequentially, it's sorted by alias, and then by count # we can read this file sequentially, it's sorted by alias, and then by count
with prior_prob_input.open("r", encoding="utf8") as prior_file: logger.info("Adding WP aliases")
with prior_prob_path.open("r", encoding="utf8") as prior_file:
# skip header # skip header
prior_file.readline() prior_file.readline()
line = prior_file.readline() line = prior_file.readline()
@ -180,10 +155,7 @@ def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_in
line = prior_file.readline() line = prior_file.readline()
def read_nlp_kb(model_dir, kb_file): def read_kb(nlp, kb_file):
nlp = spacy.load(model_dir)
kb = KnowledgeBase(vocab=nlp.vocab) kb = KnowledgeBase(vocab=nlp.vocab)
kb.load_bulk(kb_file) kb.load_bulk(kb_file)
logger.info("kb entities: {}".format(kb.get_size_entities())) return kb
logger.info("kb aliases: {}".format(kb.get_size_aliases()))
return nlp, kb

View File

@ -53,7 +53,7 @@ class EntityEncoder:
start = start + batch_size start = start + batch_size
stop = min(stop + batch_size, len(description_list)) stop = min(stop + batch_size, len(description_list))
logger.info("encoded: {} entities".format(stop)) logger.info("Encoded: {} entities".format(stop))
return encodings return encodings
@ -62,7 +62,7 @@ class EntityEncoder:
if to_print: if to_print:
logger.info( logger.info(
"Trained entity descriptions on {} ".format(processed) + "Trained entity descriptions on {} ".format(processed) +
"(non-unique) entities across {} ".format(self.epochs) + "(non-unique) descriptions across {} ".format(self.epochs) +
"epochs" "epochs"
) )
logger.info("Final loss: {}".format(loss)) logger.info("Final loss: {}".format(loss))

View File

@ -1,395 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import logging
import random
import re
import bz2
import json
from functools import partial
from spacy.gold import GoldParse
from bin.wiki_entity_linking import kb_creator
"""
Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
Gold-standard entities are stored in one file in standoff format (by character offset).
"""
ENTITY_FILE = "gold_entities.csv"
logger = logging.getLogger(__name__)
def create_training_examples_and_descriptions(wikipedia_input,
entity_def_input,
description_output,
training_output,
parse_descriptions,
limit=None):
wp_to_id = kb_creator.get_entity_to_id(entity_def_input)
_process_wikipedia_texts(wikipedia_input,
wp_to_id,
description_output,
training_output,
parse_descriptions,
limit)
def _process_wikipedia_texts(wikipedia_input,
wp_to_id,
output,
training_output,
parse_descriptions,
limit=None):
"""
Read the XML wikipedia data to parse out training data:
raw text data + positive instances
"""
title_regex = re.compile(r"(?<=<title>).*(?=</title>)")
id_regex = re.compile(r"(?<=<id>)\d*(?=</id>)")
read_ids = set()
with output.open("a", encoding="utf8") as descr_file, training_output.open("w", encoding="utf8") as entity_file:
if parse_descriptions:
_write_training_description(descr_file, "WD_id", "description")
with bz2.open(wikipedia_input, mode="rb") as file:
article_count = 0
article_text = ""
article_title = None
article_id = None
reading_text = False
reading_revision = False
logger.info("Processed {} articles".format(article_count))
for line in file:
clean_line = line.strip().decode("utf-8")
if clean_line == "<revision>":
reading_revision = True
elif clean_line == "</revision>":
reading_revision = False
# Start reading new page
if clean_line == "<page>":
article_text = ""
article_title = None
article_id = None
# finished reading this page
elif clean_line == "</page>":
if article_id:
clean_text, entities = _process_wp_text(
article_title,
article_text,
wp_to_id
)
if clean_text is not None and entities is not None:
_write_training_entities(entity_file,
article_id,
clean_text,
entities)
if article_title in wp_to_id and parse_descriptions:
description = " ".join(clean_text[:1000].split(" ")[:-1])
_write_training_description(
descr_file,
wp_to_id[article_title],
description
)
article_count += 1
if article_count % 10000 == 0:
logger.info("Processed {} articles".format(article_count))
if limit and article_count >= limit:
break
article_text = ""
article_title = None
article_id = None
reading_text = False
reading_revision = False
# start reading text within a page
if "<text" in clean_line:
reading_text = True
if reading_text:
article_text += " " + clean_line
# stop reading text within a page (we assume a new page doesn't start on the same line)
if "</text" in clean_line:
reading_text = False
# read the ID of this article (outside the revision portion of the document)
if not reading_revision:
ids = id_regex.search(clean_line)
if ids:
article_id = ids[0]
if article_id in read_ids:
logger.info(
"Found duplicate article ID", article_id, clean_line
) # This should never happen ...
read_ids.add(article_id)
# read the title of this article (outside the revision portion of the document)
if not reading_revision:
titles = title_regex.search(clean_line)
if titles:
article_title = titles[0].strip()
logger.info("Finished. Processed {} articles".format(article_count))
text_regex = re.compile(r"(?<=<text xml:space=\"preserve\">).*(?=</text)")
info_regex = re.compile(r"{[^{]*?}")
htlm_regex = re.compile(r"&lt;!--[^-]*--&gt;")
category_regex = re.compile(r"\[\[Category:[^\[]*]]")
file_regex = re.compile(r"\[\[File:[^[\]]+]]")
ref_regex = re.compile(r"&lt;ref.*?&gt;") # non-greedy
ref_2_regex = re.compile(r"&lt;/ref.*?&gt;") # non-greedy
def _process_wp_text(article_title, article_text, wp_to_id):
# ignore meta Wikipedia pages
if (
article_title.startswith("Wikipedia:") or
article_title.startswith("Kategori:")
):
return None, None
# remove the text tags
text_search = text_regex.search(article_text)
if text_search is None:
return None, None
text = text_search.group(0)
# stop processing if this is a redirect page
if text.startswith("#REDIRECT"):
return None, None
# get the raw text without markup etc, keeping only interwiki links
clean_text, entities = _remove_links(_get_clean_wp_text(text), wp_to_id)
return clean_text, entities
def _get_clean_wp_text(article_text):
clean_text = article_text.strip()
# remove bolding & italic markup
clean_text = clean_text.replace("'''", "")
clean_text = clean_text.replace("''", "")
# remove nested {{info}} statements by removing the inner/smallest ones first and iterating
try_again = True
previous_length = len(clean_text)
while try_again:
clean_text = info_regex.sub(
"", clean_text
) # non-greedy match excluding a nested {
if len(clean_text) < previous_length:
try_again = True
else:
try_again = False
previous_length = len(clean_text)
# remove HTML comments
clean_text = htlm_regex.sub("", clean_text)
# remove Category and File statements
clean_text = category_regex.sub("", clean_text)
clean_text = file_regex.sub("", clean_text)
# remove multiple =
while "==" in clean_text:
clean_text = clean_text.replace("==", "=")
clean_text = clean_text.replace(". =", ".")
clean_text = clean_text.replace(" = ", ". ")
clean_text = clean_text.replace("= ", ".")
clean_text = clean_text.replace(" =", "")
# remove refs (non-greedy match)
clean_text = ref_regex.sub("", clean_text)
clean_text = ref_2_regex.sub("", clean_text)
# remove additional wikiformatting
clean_text = re.sub(r"&lt;blockquote&gt;", "", clean_text)
clean_text = re.sub(r"&lt;/blockquote&gt;", "", clean_text)
# change special characters back to normal ones
clean_text = clean_text.replace(r"&lt;", "<")
clean_text = clean_text.replace(r"&gt;", ">")
clean_text = clean_text.replace(r"&quot;", '"')
clean_text = clean_text.replace(r"&amp;nbsp;", " ")
clean_text = clean_text.replace(r"&amp;", "&")
# remove multiple spaces
while " " in clean_text:
clean_text = clean_text.replace(" ", " ")
return clean_text.strip()
def _remove_links(clean_text, wp_to_id):
# read the text char by char to get the right offsets for the interwiki links
entities = []
final_text = ""
open_read = 0
reading_text = True
reading_entity = False
reading_mention = False
reading_special_case = False
entity_buffer = ""
mention_buffer = ""
for index, letter in enumerate(clean_text):
if letter == "[":
open_read += 1
elif letter == "]":
open_read -= 1
elif letter == "|":
if reading_text:
final_text += letter
# switch from reading entity to mention in the [[entity|mention]] pattern
elif reading_entity:
reading_text = False
reading_entity = False
reading_mention = True
else:
reading_special_case = True
else:
if reading_entity:
entity_buffer += letter
elif reading_mention:
mention_buffer += letter
elif reading_text:
final_text += letter
else:
raise ValueError("Not sure at point", clean_text[index - 2: index + 2])
if open_read > 2:
reading_special_case = True
if open_read == 2 and reading_text:
reading_text = False
reading_entity = True
reading_mention = False
# we just finished reading an entity
if open_read == 0 and not reading_text:
if "#" in entity_buffer or entity_buffer.startswith(":"):
reading_special_case = True
# Ignore cases with nested structures like File: handles etc
if not reading_special_case:
if not mention_buffer:
mention_buffer = entity_buffer
start = len(final_text)
end = start + len(mention_buffer)
qid = wp_to_id.get(entity_buffer, None)
if qid:
entities.append((mention_buffer, qid, start, end))
final_text += mention_buffer
entity_buffer = ""
mention_buffer = ""
reading_text = True
reading_entity = False
reading_mention = False
reading_special_case = False
return final_text, entities
def _write_training_description(outputfile, qid, description):
if description is not None:
line = str(qid) + "|" + description + "\n"
outputfile.write(line)
def _write_training_entities(outputfile, article_id, clean_text, entities):
entities_data = [{"alias": ent[0], "entity": ent[1], "start": ent[2], "end": ent[3]} for ent in entities]
line = json.dumps(
{
"article_id": article_id,
"clean_text": clean_text,
"entities": entities_data
},
ensure_ascii=False) + "\n"
outputfile.write(line)
def read_training(nlp, entity_file_path, dev, limit, kb):
""" This method provides training examples that correspond to the entity annotations found by the nlp object.
For training,, it will include negative training examples by using the candidate generator,
and it will only keep positive training examples that can be found by using the candidate generator.
For testing, it will include all positive examples only."""
from tqdm import tqdm
data = []
num_entities = 0
get_gold_parse = partial(_get_gold_parse, dev=dev, kb=kb)
logger.info("Reading {} data with limit {}".format('dev' if dev else 'train', limit))
with entity_file_path.open("r", encoding="utf8") as file:
with tqdm(total=limit, leave=False) as pbar:
for i, line in enumerate(file):
example = json.loads(line)
article_id = example["article_id"]
clean_text = example["clean_text"]
entities = example["entities"]
if dev != is_dev(article_id) or len(clean_text) >= 30000:
continue
doc = nlp(clean_text)
gold = get_gold_parse(doc, entities)
if gold and len(gold.links) > 0:
data.append((doc, gold))
num_entities += len(gold.links)
pbar.update(len(gold.links))
if limit and num_entities >= limit:
break
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
return data
def _get_gold_parse(doc, entities, dev, kb):
gold_entities = {}
tagged_ent_positions = set(
[(ent.start_char, ent.end_char) for ent in doc.ents]
)
for entity in entities:
entity_id = entity["entity"]
alias = entity["alias"]
start = entity["start"]
end = entity["end"]
candidates = kb.get_candidates(alias)
candidate_ids = [
c.entity_ for c in candidates
]
should_add_ent = (
dev or
(
(start, end) in tagged_ent_positions and
entity_id in candidate_ids and
len(candidates) > 1
)
)
if should_add_ent:
value_by_id = {entity_id: 1.0}
if not dev:
random.shuffle(candidate_ids)
value_by_id.update({
kb_id: 0.0
for kb_id in candidate_ids
if kb_id != entity_id
})
gold_entities[(start, end)] = value_by_id
return GoldParse(doc, links=gold_entities)
def is_dev(article_id):
return article_id.endswith("3")

View File

@ -0,0 +1,127 @@
# coding: utf-8
from __future__ import unicode_literals
import sys
import csv
# min() needed to prevent error on windows, cf https://stackoverflow.com/questions/52404416/
csv.field_size_limit(min(sys.maxsize, 2147483646))
""" This class provides reading/writing methods for temp files """
# Entity definition: WP title -> WD ID #
def write_title_to_id(entity_def_output, title_to_id):
with entity_def_output.open("w", encoding="utf8") as id_file:
id_file.write("WP_title" + "|" + "WD_id" + "\n")
for title, qid in title_to_id.items():
id_file.write(title + "|" + str(qid) + "\n")
def read_title_to_id(entity_def_output):
title_to_id = dict()
with entity_def_output.open("r", encoding="utf8") as id_file:
csvreader = csv.reader(id_file, delimiter="|")
# skip header
next(csvreader)
for row in csvreader:
title_to_id[row[0]] = row[1]
return title_to_id
# Entity aliases from WD: WD ID -> WD alias #
def write_id_to_alias(entity_alias_path, id_to_alias):
with entity_alias_path.open("w", encoding="utf8") as alias_file:
alias_file.write("WD_id" + "|" + "alias" + "\n")
for qid, alias_list in id_to_alias.items():
for alias in alias_list:
alias_file.write(str(qid) + "|" + alias + "\n")
def read_id_to_alias(entity_alias_path):
id_to_alias = dict()
with entity_alias_path.open("r", encoding="utf8") as alias_file:
csvreader = csv.reader(alias_file, delimiter="|")
# skip header
next(csvreader)
for row in csvreader:
qid = row[0]
alias = row[1]
alias_list = id_to_alias.get(qid, [])
alias_list.append(alias)
id_to_alias[qid] = alias_list
return id_to_alias
def read_alias_to_id_generator(entity_alias_path):
""" Read (aliases, qid) tuples """
with entity_alias_path.open("r", encoding="utf8") as alias_file:
csvreader = csv.reader(alias_file, delimiter="|")
# skip header
next(csvreader)
for row in csvreader:
qid = row[0]
alias = row[1]
yield alias, qid
# Entity descriptions from WD: WD ID -> WD alias #
def write_id_to_descr(entity_descr_output, id_to_descr):
with entity_descr_output.open("w", encoding="utf8") as descr_file:
descr_file.write("WD_id" + "|" + "description" + "\n")
for qid, descr in id_to_descr.items():
descr_file.write(str(qid) + "|" + descr + "\n")
def read_id_to_descr(entity_desc_path):
id_to_desc = dict()
with entity_desc_path.open("r", encoding="utf8") as descr_file:
csvreader = csv.reader(descr_file, delimiter="|")
# skip header
next(csvreader)
for row in csvreader:
id_to_desc[row[0]] = row[1]
return id_to_desc
# Entity counts from WP: WP title -> count #
def write_entity_to_count(prior_prob_input, count_output):
# Write entity counts for quick access later
entity_to_count = dict()
total_count = 0
with prior_prob_input.open("r", encoding="utf8") as prior_file:
# skip header
prior_file.readline()
line = prior_file.readline()
while line:
splits = line.replace("\n", "").split(sep="|")
# alias = splits[0]
count = int(splits[1])
entity = splits[2]
current_count = entity_to_count.get(entity, 0)
entity_to_count[entity] = current_count + count
total_count += count
line = prior_file.readline()
with count_output.open("w", encoding="utf8") as entity_file:
entity_file.write("entity" + "|" + "count" + "\n")
for entity, count in entity_to_count.items():
entity_file.write(entity + "|" + str(count) + "\n")
def read_entity_to_count(count_input):
entity_to_count = dict()
with count_input.open("r", encoding="utf8") as csvfile:
csvreader = csv.reader(csvfile, delimiter="|")
# skip header
next(csvreader)
for row in csvreader:
entity_to_count[row[0]] = int(row[1])
return entity_to_count

View File

@ -0,0 +1,128 @@
# coding: utf8
from __future__ import unicode_literals
# List of meta pages in Wikidata, should be kept out of the Knowledge base
WD_META_ITEMS = [
"Q163875",
"Q191780",
"Q224414",
"Q4167836",
"Q4167410",
"Q4663903",
"Q11266439",
"Q13406463",
"Q15407973",
"Q18616576",
"Q19887878",
"Q22808320",
"Q23894233",
"Q33120876",
"Q42104522",
"Q47460393",
"Q64875536",
"Q66480449",
]
# TODO: add more cases from non-English WP's
# List of prefixes that refer to Wikipedia "file" pages
WP_FILE_NAMESPACE = ["Bestand", "File"]
# List of prefixes that refer to Wikipedia "category" pages
WP_CATEGORY_NAMESPACE = ["Kategori", "Category", "Categorie"]
# List of prefixes that refer to Wikipedia "meta" pages
# these will/should be matched ignoring case
WP_META_NAMESPACE = (
WP_FILE_NAMESPACE
+ WP_CATEGORY_NAMESPACE
+ [
"b",
"betawikiversity",
"Book",
"c",
"Commons",
"d",
"dbdump",
"download",
"Draft",
"Education",
"Foundation",
"Gadget",
"Gadget definition",
"Gebruiker",
"gerrit",
"Help",
"Image",
"Incubator",
"m",
"mail",
"mailarchive",
"media",
"MediaWiki",
"MediaWiki talk",
"Mediawikiwiki",
"MediaZilla",
"Meta",
"Metawikipedia",
"Module",
"mw",
"n",
"nost",
"oldwikisource",
"otrs",
"OTRSwiki",
"Overleg gebruiker",
"outreach",
"outreachwiki",
"Portal",
"phab",
"Phabricator",
"Project",
"q",
"quality",
"rev",
"s",
"spcom",
"Special",
"species",
"Strategy",
"sulutil",
"svn",
"Talk",
"Template",
"Template talk",
"Testwiki",
"ticket",
"TimedText",
"Toollabs",
"tools",
"tswiki",
"User",
"User talk",
"v",
"voy",
"w",
"Wikibooks",
"Wikidata",
"wikiHow",
"Wikinvest",
"wikilivres",
"Wikimedia",
"Wikinews",
"Wikipedia",
"Wikipedia talk",
"Wikiquote",
"Wikisource",
"Wikispecies",
"Wikitech",
"Wikiversity",
"Wikivoyage",
"wikt",
"wiktionary",
"wmf",
"wmania",
"WP",
]
)

View File

@ -18,11 +18,12 @@ from pathlib import Path
import plac import plac
from bin.wiki_entity_linking import wikipedia_processor as wp, wikidata_processor as wd from bin.wiki_entity_linking import wikipedia_processor as wp, wikidata_processor as wd
from bin.wiki_entity_linking import wiki_io as io
from bin.wiki_entity_linking import kb_creator from bin.wiki_entity_linking import kb_creator
from bin.wiki_entity_linking import training_set_creator
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_FILE, ENTITY_DESCR_PATH, KB_MODEL_DIR, LOG_FORMAT from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_FILE, ENTITY_DESCR_PATH, KB_MODEL_DIR, LOG_FORMAT
from bin.wiki_entity_linking import ENTITY_FREQ_PATH, PRIOR_PROB_PATH, ENTITY_DEFS_PATH from bin.wiki_entity_linking import ENTITY_FREQ_PATH, PRIOR_PROB_PATH, ENTITY_DEFS_PATH, ENTITY_ALIAS_PATH
import spacy import spacy
from bin.wiki_entity_linking.kb_creator import read_kb
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -39,9 +40,11 @@ logger = logging.getLogger(__name__)
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path), loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path), loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path), loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
descriptions_from_wikipedia=("Flag for using wp descriptions not wd", "flag", "wp"), descr_from_wp=("Flag for using wp descriptions not wd", "flag", "wp"),
limit=("Optional threshold to limit lines read from dumps", "option", "l", int), limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
lang=("Optional language for which to get wikidata titles. Defaults to 'en'", "option", "la", str), limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),
lang=("Optional language for which to get Wikidata titles. Defaults to 'en'", "option", "la", str),
) )
def main( def main(
wd_json, wd_json,
@ -54,13 +57,16 @@ def main(
entity_vector_length=64, entity_vector_length=64,
loc_prior_prob=None, loc_prior_prob=None,
loc_entity_defs=None, loc_entity_defs=None,
loc_entity_alias=None,
loc_entity_desc=None, loc_entity_desc=None,
descriptions_from_wikipedia=False, descr_from_wp=False,
limit=None, limit_prior=None,
limit_train=None,
limit_wd=None,
lang="en", lang="en",
): ):
entity_defs_path = loc_entity_defs if loc_entity_defs else output_dir / ENTITY_DEFS_PATH entity_defs_path = loc_entity_defs if loc_entity_defs else output_dir / ENTITY_DEFS_PATH
entity_alias_path = loc_entity_alias if loc_entity_alias else output_dir / ENTITY_ALIAS_PATH
entity_descr_path = loc_entity_desc if loc_entity_desc else output_dir / ENTITY_DESCR_PATH entity_descr_path = loc_entity_desc if loc_entity_desc else output_dir / ENTITY_DESCR_PATH
entity_freq_path = output_dir / ENTITY_FREQ_PATH entity_freq_path = output_dir / ENTITY_FREQ_PATH
prior_prob_path = loc_prior_prob if loc_prior_prob else output_dir / PRIOR_PROB_PATH prior_prob_path = loc_prior_prob if loc_prior_prob else output_dir / PRIOR_PROB_PATH
@ -69,15 +75,12 @@ def main(
logger.info("Creating KB with Wikipedia and WikiData") logger.info("Creating KB with Wikipedia and WikiData")
if limit is not None:
logger.warning("Warning: reading only {} lines of Wikipedia/Wikidata dumps.".format(limit))
# STEP 0: set up IO # STEP 0: set up IO
if not output_dir.exists(): if not output_dir.exists():
output_dir.mkdir(parents=True) output_dir.mkdir(parents=True)
# STEP 1: create the NLP object # STEP 1: Load the NLP object
logger.info("STEP 1: Loading model {}".format(model)) logger.info("STEP 1: Loading NLP model {}".format(model))
nlp = spacy.load(model) nlp = spacy.load(model)
# check the length of the nlp vectors # check the length of the nlp vectors
@ -90,62 +93,83 @@ def main(
# STEP 2: create prior probabilities from WP # STEP 2: create prior probabilities from WP
if not prior_prob_path.exists(): if not prior_prob_path.exists():
# It takes about 2h to process 1000M lines of Wikipedia XML dump # It takes about 2h to process 1000M lines of Wikipedia XML dump
logger.info("STEP 2: writing prior probabilities to {}".format(prior_prob_path)) logger.info("STEP 2: Writing prior probabilities to {}".format(prior_prob_path))
wp.read_prior_probs(wp_xml, prior_prob_path, limit=limit) if limit_prior is not None:
logger.info("STEP 2: reading prior probabilities from {}".format(prior_prob_path)) logger.warning("Warning: reading only {} lines of Wikipedia dump".format(limit_prior))
wp.read_prior_probs(wp_xml, prior_prob_path, limit=limit_prior)
else:
logger.info("STEP 2: Reading prior probabilities from {}".format(prior_prob_path))
# STEP 3: deduce entity frequencies from WP (takes only a few minutes) # STEP 3: calculate entity frequencies
logger.info("STEP 3: calculating entity frequencies") if not entity_freq_path.exists():
wp.write_entity_counts(prior_prob_path, entity_freq_path, to_print=False) logger.info("STEP 3: Calculating and writing entity frequencies to {}".format(entity_freq_path))
io.write_entity_to_count(prior_prob_path, entity_freq_path)
else:
logger.info("STEP 3: Reading entity frequencies from {}".format(entity_freq_path))
# STEP 4: reading definitions and (possibly) descriptions from WikiData or from file # STEP 4: reading definitions and (possibly) descriptions from WikiData or from file
message = " and descriptions" if not descriptions_from_wikipedia else "" if (not entity_defs_path.exists()) or (not descr_from_wp and not entity_descr_path.exists()):
if (not entity_defs_path.exists()) or (not descriptions_from_wikipedia and not entity_descr_path.exists()):
# It takes about 10h to process 55M lines of Wikidata JSON dump # It takes about 10h to process 55M lines of Wikidata JSON dump
logger.info("STEP 4: parsing wikidata for entity definitions" + message) logger.info("STEP 4: Parsing and writing Wikidata entity definitions to {}".format(entity_defs_path))
title_to_id, id_to_descr = wd.read_wikidata_entities_json( if limit_wd is not None:
logger.warning("Warning: reading only {} lines of Wikidata dump".format(limit_wd))
title_to_id, id_to_descr, id_to_alias = wd.read_wikidata_entities_json(
wd_json, wd_json,
limit, limit_wd,
to_print=False, to_print=False,
lang=lang, lang=lang,
parse_descriptions=(not descriptions_from_wikipedia), parse_descr=(not descr_from_wp),
) )
wd.write_entity_files(entity_defs_path, title_to_id) io.write_title_to_id(entity_defs_path, title_to_id)
if not descriptions_from_wikipedia:
wd.write_entity_description_files(entity_descr_path, id_to_descr)
logger.info("STEP 4: read entity definitions" + message)
# STEP 5: Getting gold entities from wikipedia logger.info("STEP 4b: Writing Wikidata entity aliases to {}".format(entity_alias_path))
message = " and descriptions" if descriptions_from_wikipedia else "" io.write_id_to_alias(entity_alias_path, id_to_alias)
if (not training_entities_path.exists()) or (descriptions_from_wikipedia and not entity_descr_path.exists()):
logger.info("STEP 5: parsing wikipedia for gold entities" + message) if not descr_from_wp:
training_set_creator.create_training_examples_and_descriptions( logger.info("STEP 4c: Writing Wikidata entity descriptions to {}".format(entity_descr_path))
wp_xml, io.write_id_to_descr(entity_descr_path, id_to_descr)
entity_defs_path, else:
entity_descr_path, logger.info("STEP 4: Reading entity definitions from {}".format(entity_defs_path))
training_entities_path, logger.info("STEP 4b: Reading entity aliases from {}".format(entity_alias_path))
parse_descriptions=descriptions_from_wikipedia, if not descr_from_wp:
limit=limit, logger.info("STEP 4c: Reading entity descriptions from {}".format(entity_descr_path))
)
logger.info("STEP 5: read gold entities" + message) # STEP 5: Getting gold entities from Wikipedia
if (not training_entities_path.exists()) or (descr_from_wp and not entity_descr_path.exists()):
logger.info("STEP 5: Parsing and writing Wikipedia gold entities to {}".format(training_entities_path))
if limit_train is not None:
logger.warning("Warning: reading only {} lines of Wikipedia dump".format(limit_train))
wp.create_training_and_desc(wp_xml, entity_defs_path, entity_descr_path,
training_entities_path, descr_from_wp, limit_train)
if descr_from_wp:
logger.info("STEP 5b: Parsing and writing Wikipedia descriptions to {}".format(entity_descr_path))
else:
logger.info("STEP 5: Reading gold entities from {}".format(training_entities_path))
if descr_from_wp:
logger.info("STEP 5b: Reading entity descriptions from {}".format(entity_descr_path))
# STEP 6: creating the actual KB # STEP 6: creating the actual KB
# It takes ca. 30 minutes to pretrain the entity embeddings # It takes ca. 30 minutes to pretrain the entity embeddings
logger.info("STEP 6: creating the KB at {}".format(kb_path)) if not kb_path.exists():
kb = kb_creator.create_kb( logger.info("STEP 6: Creating the KB at {}".format(kb_path))
nlp=nlp, kb = kb_creator.create_kb(
max_entities_per_alias=max_per_alias, nlp=nlp,
min_entity_freq=min_freq, max_entities_per_alias=max_per_alias,
min_occ=min_pair, min_entity_freq=min_freq,
entity_def_input=entity_defs_path, min_occ=min_pair,
entity_descr_path=entity_descr_path, entity_def_path=entity_defs_path,
count_input=entity_freq_path, entity_descr_path=entity_descr_path,
prior_prob_input=prior_prob_path, entity_alias_path=entity_alias_path,
entity_vector_length=entity_vector_length, entity_freq_path=entity_freq_path,
) prior_prob_path=prior_prob_path,
entity_vector_length=entity_vector_length,
kb.dump(kb_path) )
nlp.to_disk(output_dir / KB_MODEL_DIR) kb.dump(kb_path)
logger.info("kb entities: {}".format(kb.get_size_entities()))
logger.info("kb aliases: {}".format(kb.get_size_aliases()))
nlp.to_disk(output_dir / KB_MODEL_DIR)
else:
logger.info("STEP 6: KB already exists at {}".format(kb_path))
logger.info("Done!") logger.info("Done!")

View File

@ -1,40 +1,52 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import gzip import bz2
import json import json
import logging import logging
import datetime
from bin.wiki_entity_linking.wiki_namespaces import WD_META_ITEMS
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang="en", parse_descriptions=True): def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang="en", parse_descr=True):
# Read the JSON wiki data and parse out the entities. Takes about 7u30 to parse 55M lines. # Read the JSON wiki data and parse out the entities. Takes about 7-10h to parse 55M lines.
# get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/ # get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
site_filter = '{}wiki'.format(lang) site_filter = '{}wiki'.format(lang)
# properties filter (currently disabled to get ALL data) # filter: currently defined as OR: one hit suffices to be removed from further processing
prop_filter = dict() exclude_list = WD_META_ITEMS
# prop_filter = {'P31': {'Q5', 'Q15632617'}} # currently defined as OR: one property suffices to be selected
# punctuation
exclude_list.extend(["Q1383557", "Q10617810"])
# letters etc
exclude_list.extend(["Q188725", "Q19776628", "Q3841820", "Q17907810", "Q9788", "Q9398093"])
neg_prop_filter = {
'P31': exclude_list, # instance of
'P279': exclude_list # subclass
}
title_to_id = dict() title_to_id = dict()
id_to_descr = dict() id_to_descr = dict()
id_to_alias = dict()
# parse appropriate fields - depending on what we need in the KB # parse appropriate fields - depending on what we need in the KB
parse_properties = False parse_properties = False
parse_sitelinks = True parse_sitelinks = True
parse_labels = False parse_labels = False
parse_aliases = False parse_aliases = True
parse_claims = False parse_claims = True
with gzip.open(wikidata_file, mode='rb') as file: with bz2.open(wikidata_file, mode='rb') as file:
for cnt, line in enumerate(file): for cnt, line in enumerate(file):
if limit and cnt >= limit: if limit and cnt >= limit:
break break
if cnt % 500000 == 0: if cnt % 500000 == 0 and cnt > 0:
logger.info("processed {} lines of WikiData dump".format(cnt)) logger.info("processed {} lines of WikiData JSON dump".format(cnt))
clean_line = line.strip() clean_line = line.strip()
if clean_line.endswith(b","): if clean_line.endswith(b","):
clean_line = clean_line[:-1] clean_line = clean_line[:-1]
@ -43,13 +55,11 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
entry_type = obj["type"] entry_type = obj["type"]
if entry_type == "item": if entry_type == "item":
# filtering records on their properties (currently disabled to get ALL data)
# keep = False
keep = True keep = True
claims = obj["claims"] claims = obj["claims"]
if parse_claims: if parse_claims:
for prop, value_set in prop_filter.items(): for prop, value_set in neg_prop_filter.items():
claim_property = claims.get(prop, None) claim_property = claims.get(prop, None)
if claim_property: if claim_property:
for cp in claim_property: for cp in claim_property:
@ -61,7 +71,7 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
) )
cp_rank = cp["rank"] cp_rank = cp["rank"]
if cp_rank != "deprecated" and cp_id in value_set: if cp_rank != "deprecated" and cp_id in value_set:
keep = True keep = False
if keep: if keep:
unique_id = obj["id"] unique_id = obj["id"]
@ -108,7 +118,7 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
"label (" + lang + "):", lang_label["value"] "label (" + lang + "):", lang_label["value"]
) )
if found_link and parse_descriptions: if found_link and parse_descr:
descriptions = obj["descriptions"] descriptions = obj["descriptions"]
if descriptions: if descriptions:
lang_descr = descriptions.get(lang, None) lang_descr = descriptions.get(lang, None)
@ -130,22 +140,15 @@ def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang=
print( print(
"alias (" + lang + "):", item["value"] "alias (" + lang + "):", item["value"]
) )
alias_list = id_to_alias.get(unique_id, [])
alias_list.append(item["value"])
id_to_alias[unique_id] = alias_list
if to_print: if to_print:
print() print()
return title_to_id, id_to_descr # log final number of lines processed
logger.info("Finished. Processed {} lines of WikiData JSON dump".format(cnt))
return title_to_id, id_to_descr, id_to_alias
def write_entity_files(entity_def_output, title_to_id):
with entity_def_output.open("w", encoding="utf8") as id_file:
id_file.write("WP_title" + "|" + "WD_id" + "\n")
for title, qid in title_to_id.items():
id_file.write(title + "|" + str(qid) + "\n")
def write_entity_description_files(entity_descr_output, id_to_descr):
with entity_descr_output.open("w", encoding="utf8") as descr_file:
descr_file.write("WD_id" + "|" + "description" + "\n")
for qid, descr in id_to_descr.items():
descr_file.write(str(qid) + "|" + descr + "\n")

View File

@ -6,19 +6,19 @@ as created by the script `wikidata_create_kb`.
For the Wikipedia dump: get enwiki-latest-pages-articles-multistream.xml.bz2 For the Wikipedia dump: get enwiki-latest-pages-articles-multistream.xml.bz2
from https://dumps.wikimedia.org/enwiki/latest/ from https://dumps.wikimedia.org/enwiki/latest/
""" """
from __future__ import unicode_literals from __future__ import unicode_literals
import random import random
import logging import logging
import spacy
from pathlib import Path from pathlib import Path
import plac import plac
from bin.wiki_entity_linking import training_set_creator from bin.wiki_entity_linking import wikipedia_processor
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance, measure_baselines from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
from bin.wiki_entity_linking.kb_creator import read_nlp_kb from bin.wiki_entity_linking.kb_creator import read_kb
from spacy.util import minibatch, compounding from spacy.util import minibatch, compounding
@ -35,6 +35,7 @@ logger = logging.getLogger(__name__)
l2=("L2 regularization", "option", "r", float), l2=("L2 regularization", "option", "r", float),
train_inst=("# training instances (default 90% of all)", "option", "t", int), train_inst=("# training instances (default 90% of all)", "option", "t", int),
dev_inst=("# test instances (default 10% of all)", "option", "d", int), dev_inst=("# test instances (default 10% of all)", "option", "d", int),
labels_discard=("NER labels to discard (default None)", "option", "l", str),
) )
def main( def main(
dir_kb, dir_kb,
@ -46,13 +47,14 @@ def main(
l2=1e-6, l2=1e-6,
train_inst=None, train_inst=None,
dev_inst=None, dev_inst=None,
labels_discard=None
): ):
logger.info("Creating Entity Linker with Wikipedia and WikiData") logger.info("Creating Entity Linker with Wikipedia and WikiData")
output_dir = Path(output_dir) if output_dir else dir_kb output_dir = Path(output_dir) if output_dir else dir_kb
training_path = loc_training if loc_training else output_dir / TRAINING_DATA_FILE training_path = loc_training if loc_training else dir_kb / TRAINING_DATA_FILE
nlp_dir = dir_kb / KB_MODEL_DIR nlp_dir = dir_kb / KB_MODEL_DIR
kb_path = output_dir / KB_FILE kb_path = dir_kb / KB_FILE
nlp_output_dir = output_dir / OUTPUT_MODEL_DIR nlp_output_dir = output_dir / OUTPUT_MODEL_DIR
# STEP 0: set up IO # STEP 0: set up IO
@ -60,38 +62,49 @@ def main(
output_dir.mkdir() output_dir.mkdir()
# STEP 1 : load the NLP object # STEP 1 : load the NLP object
logger.info("STEP 1: loading model from {}".format(nlp_dir)) logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
nlp, kb = read_nlp_kb(nlp_dir, kb_path) nlp = spacy.load(nlp_dir)
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
kb = read_kb(nlp, kb_path)
# check that there is a NER component in the pipeline # check that there is a NER component in the pipeline
if "ner" not in nlp.pipe_names: if "ner" not in nlp.pipe_names:
raise ValueError("The `nlp` object should have a pretrained `ner` component.") raise ValueError("The `nlp` object should have a pretrained `ner` component.")
# STEP 2: create a training dataset from WP # STEP 2: read the training dataset previously created from WP
logger.info("STEP 2: reading training dataset from {}".format(training_path)) logger.info("STEP 2: Reading training dataset from {}".format(training_path))
train_data = training_set_creator.read_training( if labels_discard:
labels_discard = [x.strip() for x in labels_discard.split(",")]
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
else:
labels_discard = []
train_data = wikipedia_processor.read_training(
nlp=nlp, nlp=nlp,
entity_file_path=training_path, entity_file_path=training_path,
dev=False, dev=False,
limit=train_inst, limit=train_inst,
kb=kb, kb=kb,
labels_discard=labels_discard
) )
# for testing, get all pos instances, whether or not they are in the kb # for testing, get all pos instances (independently of KB)
dev_data = training_set_creator.read_training( dev_data = wikipedia_processor.read_training(
nlp=nlp, nlp=nlp,
entity_file_path=training_path, entity_file_path=training_path,
dev=True, dev=True,
limit=dev_inst, limit=dev_inst,
kb=kb, kb=None,
labels_discard=labels_discard
) )
# STEP 3: create and train the entity linking pipe # STEP 3: create and train an entity linking pipe
logger.info("STEP 3: training Entity Linking pipe") logger.info("STEP 3: Creating and training an Entity Linking pipe")
el_pipe = nlp.create_pipe( el_pipe = nlp.create_pipe(
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors.name} name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors.name,
"labels_discard": labels_discard}
) )
el_pipe.set_kb(kb) el_pipe.set_kb(kb)
nlp.add_pipe(el_pipe, last=True) nlp.add_pipe(el_pipe, last=True)
@ -105,14 +118,9 @@ def main(
logger.info("Training on {} articles".format(len(train_data))) logger.info("Training on {} articles".format(len(train_data)))
logger.info("Dev testing on {} articles".format(len(dev_data))) logger.info("Dev testing on {} articles".format(len(dev_data)))
dev_baseline_accuracies = measure_baselines( # baseline performance on dev data
dev_data, kb
)
logger.info("Dev Baseline Accuracies:") logger.info("Dev Baseline Accuracies:")
logger.info(dev_baseline_accuracies.report_accuracy("random")) measure_performance(dev_data, kb, el_pipe, baseline=True, context=False)
logger.info(dev_baseline_accuracies.report_accuracy("prior"))
logger.info(dev_baseline_accuracies.report_accuracy("oracle"))
for itn in range(epochs): for itn in range(epochs):
random.shuffle(train_data) random.shuffle(train_data)
@ -136,18 +144,18 @@ def main(
logger.error("Error updating batch:" + str(e)) logger.error("Error updating batch:" + str(e))
if batchnr > 0: if batchnr > 0:
logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2))) logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2)))
measure_performance(dev_data, kb, el_pipe) measure_performance(dev_data, kb, el_pipe, baseline=False, context=True)
# STEP 4: measure the performance of our trained pipe on an independent dev set # STEP 4: measure the performance of our trained pipe on an independent dev set
logger.info("STEP 4: performance measurement of Entity Linking pipe") logger.info("STEP 4: Final performance measurement of Entity Linking pipe")
measure_performance(dev_data, kb, el_pipe) measure_performance(dev_data, kb, el_pipe)
# STEP 5: apply the EL pipe on a toy example # STEP 5: apply the EL pipe on a toy example
logger.info("STEP 5: applying Entity Linking to toy example") logger.info("STEP 5: Applying Entity Linking to toy example")
run_el_toy_example(nlp=nlp) run_el_toy_example(nlp=nlp)
if output_dir: if output_dir:
# STEP 6: write the NLP pipeline (including entity linker) to file # STEP 6: write the NLP pipeline (now including an EL model) to file
logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir)) logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir))
nlp.to_disk(nlp_output_dir) nlp.to_disk(nlp_output_dir)

View File

@ -3,147 +3,104 @@ from __future__ import unicode_literals
import re import re
import bz2 import bz2
import csv
import datetime
import logging import logging
import random
import json
from bin.wiki_entity_linking import LOG_FORMAT from functools import partial
from spacy.gold import GoldParse
from bin.wiki_entity_linking import wiki_io as io
from bin.wiki_entity_linking.wiki_namespaces import (
WP_META_NAMESPACE,
WP_FILE_NAMESPACE,
WP_CATEGORY_NAMESPACE,
)
""" """
Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions. Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions.
Write these results to file for downstream KB and training data generation. Write these results to file for downstream KB and training data generation.
Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
""" """
ENTITY_FILE = "gold_entities.csv"
map_alias_to_link = dict() map_alias_to_link = dict()
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
title_regex = re.compile(r"(?<=<title>).*(?=</title>)")
# these will/should be matched ignoring case id_regex = re.compile(r"(?<=<id>)\d*(?=</id>)")
wiki_namespaces = [ text_regex = re.compile(r"(?<=<text xml:space=\"preserve\">).*(?=</text)")
"b", info_regex = re.compile(r"{[^{]*?}")
"betawikiversity", html_regex = re.compile(r"&lt;!--[^-]*--&gt;")
"Book", ref_regex = re.compile(r"&lt;ref.*?&gt;") # non-greedy
"c", ref_2_regex = re.compile(r"&lt;/ref.*?&gt;") # non-greedy
"Category",
"Commons",
"d",
"dbdump",
"download",
"Draft",
"Education",
"Foundation",
"Gadget",
"Gadget definition",
"gerrit",
"File",
"Help",
"Image",
"Incubator",
"m",
"mail",
"mailarchive",
"media",
"MediaWiki",
"MediaWiki talk",
"Mediawikiwiki",
"MediaZilla",
"Meta",
"Metawikipedia",
"Module",
"mw",
"n",
"nost",
"oldwikisource",
"outreach",
"outreachwiki",
"otrs",
"OTRSwiki",
"Portal",
"phab",
"Phabricator",
"Project",
"q",
"quality",
"rev",
"s",
"spcom",
"Special",
"species",
"Strategy",
"sulutil",
"svn",
"Talk",
"Template",
"Template talk",
"Testwiki",
"ticket",
"TimedText",
"Toollabs",
"tools",
"tswiki",
"User",
"User talk",
"v",
"voy",
"w",
"Wikibooks",
"Wikidata",
"wikiHow",
"Wikinvest",
"wikilivres",
"Wikimedia",
"Wikinews",
"Wikipedia",
"Wikipedia talk",
"Wikiquote",
"Wikisource",
"Wikispecies",
"Wikitech",
"Wikiversity",
"Wikivoyage",
"wikt",
"wiktionary",
"wmf",
"wmania",
"WP",
]
# find the links # find the links
link_regex = re.compile(r"\[\[[^\[\]]*\]\]") link_regex = re.compile(r"\[\[[^\[\]]*\]\]")
# match on interwiki links, e.g. `en:` or `:fr:` # match on interwiki links, e.g. `en:` or `:fr:`
ns_regex = r":?" + "[a-z][a-z]" + ":" ns_regex = r":?" + "[a-z][a-z]" + ":"
# match on Namespace: optionally preceded by a : # match on Namespace: optionally preceded by a :
for ns in wiki_namespaces: for ns in WP_META_NAMESPACE:
ns_regex += "|" + ":?" + ns + ":" ns_regex += "|" + ":?" + ns + ":"
ns_regex = re.compile(ns_regex, re.IGNORECASE) ns_regex = re.compile(ns_regex, re.IGNORECASE)
files = r""
for f in WP_FILE_NAMESPACE:
files += "\[\[" + f + ":[^[\]]+]]" + "|"
files = files[0 : len(files) - 1]
file_regex = re.compile(files)
cats = r""
for c in WP_CATEGORY_NAMESPACE:
cats += "\[\[" + c + ":[^\[]*]]" + "|"
cats = cats[0 : len(cats) - 1]
category_regex = re.compile(cats)
def read_prior_probs(wikipedia_input, prior_prob_output, limit=None): def read_prior_probs(wikipedia_input, prior_prob_output, limit=None):
""" """
Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities. Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities.
The full file takes about 2h to parse 1100M lines. The full file takes about 2-3h to parse 1100M lines.
It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from. It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from,
though dev test articles are excluded in order not to get an artificially strong baseline.
""" """
cnt = 0
read_id = False
current_article_id = None
with bz2.open(wikipedia_input, mode="rb") as file: with bz2.open(wikipedia_input, mode="rb") as file:
line = file.readline() line = file.readline()
cnt = 0
while line and (not limit or cnt < limit): while line and (not limit or cnt < limit):
if cnt % 25000000 == 0: if cnt % 25000000 == 0 and cnt > 0:
logger.info("processed {} lines of Wikipedia XML dump".format(cnt)) logger.info("processed {} lines of Wikipedia XML dump".format(cnt))
clean_line = line.strip().decode("utf-8") clean_line = line.strip().decode("utf-8")
aliases, entities, normalizations = get_wp_links(clean_line) # we attempt at reading the article's ID (but not the revision or contributor ID)
for alias, entity, norm in zip(aliases, entities, normalizations): if "<revision>" in clean_line or "<contributor>" in clean_line:
_store_alias(alias, entity, normalize_alias=norm, normalize_entity=True) read_id = False
_store_alias(alias, entity, normalize_alias=norm, normalize_entity=True) if "<page>" in clean_line:
read_id = True
if read_id:
ids = id_regex.search(clean_line)
if ids:
current_article_id = ids[0]
# only processing prior probabilities from true training (non-dev) articles
if not is_dev(current_article_id):
aliases, entities, normalizations = get_wp_links(clean_line)
for alias, entity, norm in zip(aliases, entities, normalizations):
_store_alias(
alias, entity, normalize_alias=norm, normalize_entity=True
)
line = file.readline() line = file.readline()
cnt += 1 cnt += 1
logger.info("processed {} lines of Wikipedia XML dump".format(cnt)) logger.info("processed {} lines of Wikipedia XML dump".format(cnt))
logger.info("Finished. processed {} lines of Wikipedia XML dump".format(cnt))
# write all aliases and their entities and count occurrences to file # write all aliases and their entities and count occurrences to file
with prior_prob_output.open("w", encoding="utf8") as outputfile: with prior_prob_output.open("w", encoding="utf8") as outputfile:
@ -182,7 +139,7 @@ def get_wp_links(text):
match = match[2:][:-2].replace("_", " ").strip() match = match[2:][:-2].replace("_", " ").strip()
if ns_regex.match(match): if ns_regex.match(match):
pass # ignore namespaces at the beginning of the string pass # ignore the entity if it points to a "meta" page
# this is a simple [[link]], with the alias the same as the mention # this is a simple [[link]], with the alias the same as the mention
elif "|" not in match: elif "|" not in match:
@ -218,47 +175,382 @@ def _capitalize_first(text):
return result return result
def write_entity_counts(prior_prob_input, count_output, to_print=False): def create_training_and_desc(
# Write entity counts for quick access later wp_input, def_input, desc_output, training_output, parse_desc, limit=None
entity_to_count = dict() ):
total_count = 0 wp_to_id = io.read_title_to_id(def_input)
_process_wikipedia_texts(
with prior_prob_input.open("r", encoding="utf8") as prior_file: wp_input, wp_to_id, desc_output, training_output, parse_desc, limit
# skip header )
prior_file.readline()
line = prior_file.readline()
while line:
splits = line.replace("\n", "").split(sep="|")
# alias = splits[0]
count = int(splits[1])
entity = splits[2]
current_count = entity_to_count.get(entity, 0)
entity_to_count[entity] = current_count + count
total_count += count
line = prior_file.readline()
with count_output.open("w", encoding="utf8") as entity_file:
entity_file.write("entity" + "|" + "count" + "\n")
for entity, count in entity_to_count.items():
entity_file.write(entity + "|" + str(count) + "\n")
if to_print:
for entity, count in entity_to_count.items():
print("Entity count:", entity, count)
print("Total count:", total_count)
def get_all_frequencies(count_input): def _process_wikipedia_texts(
entity_to_count = dict() wikipedia_input, wp_to_id, output, training_output, parse_descriptions, limit=None
with count_input.open("r", encoding="utf8") as csvfile: ):
csvreader = csv.reader(csvfile, delimiter="|") """
# skip header Read the XML wikipedia data to parse out training data:
next(csvreader) raw text data + positive instances
for row in csvreader: """
entity_to_count[row[0]] = int(row[1])
return entity_to_count read_ids = set()
with output.open("a", encoding="utf8") as descr_file, training_output.open(
"w", encoding="utf8"
) as entity_file:
if parse_descriptions:
_write_training_description(descr_file, "WD_id", "description")
with bz2.open(wikipedia_input, mode="rb") as file:
article_count = 0
article_text = ""
article_title = None
article_id = None
reading_text = False
reading_revision = False
for line in file:
clean_line = line.strip().decode("utf-8")
if clean_line == "<revision>":
reading_revision = True
elif clean_line == "</revision>":
reading_revision = False
# Start reading new page
if clean_line == "<page>":
article_text = ""
article_title = None
article_id = None
# finished reading this page
elif clean_line == "</page>":
if article_id:
clean_text, entities = _process_wp_text(
article_title, article_text, wp_to_id
)
if clean_text is not None and entities is not None:
_write_training_entities(
entity_file, article_id, clean_text, entities
)
if article_title in wp_to_id and parse_descriptions:
description = " ".join(
clean_text[:1000].split(" ")[:-1]
)
_write_training_description(
descr_file, wp_to_id[article_title], description
)
article_count += 1
if article_count % 10000 == 0 and article_count > 0:
logger.info(
"Processed {} articles".format(article_count)
)
if limit and article_count >= limit:
break
article_text = ""
article_title = None
article_id = None
reading_text = False
reading_revision = False
# start reading text within a page
if "<text" in clean_line:
reading_text = True
if reading_text:
article_text += " " + clean_line
# stop reading text within a page (we assume a new page doesn't start on the same line)
if "</text" in clean_line:
reading_text = False
# read the ID of this article (outside the revision portion of the document)
if not reading_revision:
ids = id_regex.search(clean_line)
if ids:
article_id = ids[0]
if article_id in read_ids:
logger.info(
"Found duplicate article ID", article_id, clean_line
) # This should never happen ...
read_ids.add(article_id)
# read the title of this article (outside the revision portion of the document)
if not reading_revision:
titles = title_regex.search(clean_line)
if titles:
article_title = titles[0].strip()
logger.info("Finished. Processed {} articles".format(article_count))
def _process_wp_text(article_title, article_text, wp_to_id):
# ignore meta Wikipedia pages
if ns_regex.match(article_title):
return None, None
# remove the text tags
text_search = text_regex.search(article_text)
if text_search is None:
return None, None
text = text_search.group(0)
# stop processing if this is a redirect page
if text.startswith("#REDIRECT"):
return None, None
# get the raw text without markup etc, keeping only interwiki links
clean_text, entities = _remove_links(_get_clean_wp_text(text), wp_to_id)
return clean_text, entities
def _get_clean_wp_text(article_text):
clean_text = article_text.strip()
# remove bolding & italic markup
clean_text = clean_text.replace("'''", "")
clean_text = clean_text.replace("''", "")
# remove nested {{info}} statements by removing the inner/smallest ones first and iterating
try_again = True
previous_length = len(clean_text)
while try_again:
clean_text = info_regex.sub(
"", clean_text
) # non-greedy match excluding a nested {
if len(clean_text) < previous_length:
try_again = True
else:
try_again = False
previous_length = len(clean_text)
# remove HTML comments
clean_text = html_regex.sub("", clean_text)
# remove Category and File statements
clean_text = category_regex.sub("", clean_text)
clean_text = file_regex.sub("", clean_text)
# remove multiple =
while "==" in clean_text:
clean_text = clean_text.replace("==", "=")
clean_text = clean_text.replace(". =", ".")
clean_text = clean_text.replace(" = ", ". ")
clean_text = clean_text.replace("= ", ".")
clean_text = clean_text.replace(" =", "")
# remove refs (non-greedy match)
clean_text = ref_regex.sub("", clean_text)
clean_text = ref_2_regex.sub("", clean_text)
# remove additional wikiformatting
clean_text = re.sub(r"&lt;blockquote&gt;", "", clean_text)
clean_text = re.sub(r"&lt;/blockquote&gt;", "", clean_text)
# change special characters back to normal ones
clean_text = clean_text.replace(r"&lt;", "<")
clean_text = clean_text.replace(r"&gt;", ">")
clean_text = clean_text.replace(r"&quot;", '"')
clean_text = clean_text.replace(r"&amp;nbsp;", " ")
clean_text = clean_text.replace(r"&amp;", "&")
# remove multiple spaces
while " " in clean_text:
clean_text = clean_text.replace(" ", " ")
return clean_text.strip()
def _remove_links(clean_text, wp_to_id):
# read the text char by char to get the right offsets for the interwiki links
entities = []
final_text = ""
open_read = 0
reading_text = True
reading_entity = False
reading_mention = False
reading_special_case = False
entity_buffer = ""
mention_buffer = ""
for index, letter in enumerate(clean_text):
if letter == "[":
open_read += 1
elif letter == "]":
open_read -= 1
elif letter == "|":
if reading_text:
final_text += letter
# switch from reading entity to mention in the [[entity|mention]] pattern
elif reading_entity:
reading_text = False
reading_entity = False
reading_mention = True
else:
reading_special_case = True
else:
if reading_entity:
entity_buffer += letter
elif reading_mention:
mention_buffer += letter
elif reading_text:
final_text += letter
else:
raise ValueError("Not sure at point", clean_text[index - 2 : index + 2])
if open_read > 2:
reading_special_case = True
if open_read == 2 and reading_text:
reading_text = False
reading_entity = True
reading_mention = False
# we just finished reading an entity
if open_read == 0 and not reading_text:
if "#" in entity_buffer or entity_buffer.startswith(":"):
reading_special_case = True
# Ignore cases with nested structures like File: handles etc
if not reading_special_case:
if not mention_buffer:
mention_buffer = entity_buffer
start = len(final_text)
end = start + len(mention_buffer)
qid = wp_to_id.get(entity_buffer, None)
if qid:
entities.append((mention_buffer, qid, start, end))
final_text += mention_buffer
entity_buffer = ""
mention_buffer = ""
reading_text = True
reading_entity = False
reading_mention = False
reading_special_case = False
return final_text, entities
def _write_training_description(outputfile, qid, description):
if description is not None:
line = str(qid) + "|" + description + "\n"
outputfile.write(line)
def _write_training_entities(outputfile, article_id, clean_text, entities):
entities_data = [
{"alias": ent[0], "entity": ent[1], "start": ent[2], "end": ent[3]}
for ent in entities
]
line = (
json.dumps(
{
"article_id": article_id,
"clean_text": clean_text,
"entities": entities_data,
},
ensure_ascii=False,
)
+ "\n"
)
outputfile.write(line)
def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
""" This method provides training examples that correspond to the entity annotations found by the nlp object.
For training, it will include both positive and negative examples by using the candidate generator from the kb.
For testing (kb=None), it will include all positive examples only."""
from tqdm import tqdm
if not labels_discard:
labels_discard = []
data = []
num_entities = 0
get_gold_parse = partial(
_get_gold_parse, dev=dev, kb=kb, labels_discard=labels_discard
)
logger.info(
"Reading {} data with limit {}".format("dev" if dev else "train", limit)
)
with entity_file_path.open("r", encoding="utf8") as file:
with tqdm(total=limit, leave=False) as pbar:
for i, line in enumerate(file):
example = json.loads(line)
article_id = example["article_id"]
clean_text = example["clean_text"]
entities = example["entities"]
if dev != is_dev(article_id) or not is_valid_article(clean_text):
continue
doc = nlp(clean_text)
gold = get_gold_parse(doc, entities)
if gold and len(gold.links) > 0:
data.append((doc, gold))
num_entities += len(gold.links)
pbar.update(len(gold.links))
if limit and num_entities >= limit:
break
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
return data
def _get_gold_parse(doc, entities, dev, kb, labels_discard):
gold_entities = {}
tagged_ent_positions = {
(ent.start_char, ent.end_char): ent
for ent in doc.ents
if ent.label_ not in labels_discard
}
for entity in entities:
entity_id = entity["entity"]
alias = entity["alias"]
start = entity["start"]
end = entity["end"]
candidate_ids = []
if kb and not dev:
candidates = kb.get_candidates(alias)
candidate_ids = [cand.entity_ for cand in candidates]
tagged_ent = tagged_ent_positions.get((start, end), None)
if tagged_ent:
# TODO: check that alias == doc.text[start:end]
should_add_ent = (dev or entity_id in candidate_ids) and is_valid_sentence(
tagged_ent.sent.text
)
if should_add_ent:
value_by_id = {entity_id: 1.0}
if not dev:
random.shuffle(candidate_ids)
value_by_id.update(
{kb_id: 0.0 for kb_id in candidate_ids if kb_id != entity_id}
)
gold_entities[(start, end)] = value_by_id
return GoldParse(doc, links=gold_entities)
def is_dev(article_id):
if not article_id:
return False
return article_id.endswith("3")
def is_valid_article(doc_text):
# custom length cut-off
return 10 < len(doc_text) < 30000
def is_valid_sentence(sent_text):
if not 10 < len(sent_text) < 3000:
# custom length cut-off
return False
if sent_text.strip().startswith("*") or sent_text.strip().startswith("#"):
# remove 'enumeration' sentences (occurs often on Wikipedia)
return False
return True

View File

@ -7,7 +7,7 @@ dependency tree to find the noun phrase they are referring to for example:
$9.4 million --> Net income. $9.4 million --> Net income.
Compatible with: spaCy v2.0.0+ Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0 Last tested with: v2.2.1
""" """
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function
@ -38,14 +38,17 @@ def main(model="en_core_web_sm"):
def filter_spans(spans): def filter_spans(spans):
# Filter a sequence of spans so they don't contain overlaps # Filter a sequence of spans so they don't contain overlaps
get_sort_key = lambda span: (span.end - span.start, span.start) # For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
get_sort_key = lambda span: (span.end - span.start, -span.start)
sorted_spans = sorted(spans, key=get_sort_key, reverse=True) sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
result = [] result = []
seen_tokens = set() seen_tokens = set()
for span in sorted_spans: for span in sorted_spans:
# Check for end - 1 here because boundaries are inclusive
if span.start not in seen_tokens and span.end - 1 not in seen_tokens: if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
result.append(span) result.append(span)
seen_tokens.update(range(span.start, span.end)) seen_tokens.update(range(span.start, span.end))
result = sorted(result, key=lambda span: span.start)
return result return result

View File

@ -91,8 +91,8 @@ def demo(shape):
nlp = spacy.load("en_vectors_web_lg") nlp = spacy.load("en_vectors_web_lg")
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0])) nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
doc1 = nlp(u"The king of France is bald.") doc1 = nlp("The king of France is bald.")
doc2 = nlp(u"France has no king.") doc2 = nlp("France has no king.")
print("Sentence 1:", doc1) print("Sentence 1:", doc1)
print("Sentence 2:", doc2) print("Sentence 2:", doc2)

View File

@ -0,0 +1,45 @@
# coding: utf-8
"""
Example of loading previously parsed text using spaCy's DocBin class. The example
performs an entity count to show that the annotations are available.
For more details, see https://spacy.io/usage/saving-loading#docs
Installation:
python -m spacy download en_core_web_lg
Usage:
python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy
"""
from __future__ import unicode_literals
import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer
from collections import Counter
EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy"
def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH):
nlp = spacy.load(model)
print("Reading data from {}".format(docbin_path))
with open(docbin_path, "rb") as file_:
bytes_data = file_.read()
nr_word = 0
start_time = timer()
entities = Counter()
docbin = DocBin().from_bytes(bytes_data)
for doc in docbin.get_docs(nlp.vocab):
nr_word += len(doc)
entities.update((e.label_, e.text) for e in doc.ents)
end_time = timer()
msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)"
wps = nr_word / (end_time - start_time)
print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps))
print("Most common entities:")
for (label, entity), freq in entities.most_common(30):
print(freq, entity, label)
if __name__ == "__main__":
import plac
plac.call(main)

View File

@ -0,0 +1 @@
{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0}

View File

@ -13,8 +13,7 @@ import spacy.util
from spacy.tokens import Token, Doc from spacy.tokens import Token, Doc
from spacy.gold import GoldParse from spacy.gold import GoldParse
from spacy.syntax.nonproj import projectivize from spacy.syntax.nonproj import projectivize
from collections import defaultdict, Counter from collections import defaultdict
from timeit import default_timer as timer
from spacy.matcher import Matcher from spacy.matcher import Matcher
import itertools import itertools
@ -290,11 +289,6 @@ def get_token_conllu(token, i):
return "\n".join(lines) return "\n".join(lines)
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
################## ##################
# Initialization # # Initialization #
################## ##################
@ -381,20 +375,24 @@ class TreebankPaths(object):
@plac.annotations( @plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path), ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional", None, Config.load),
corpus=( corpus=(
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc", "UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
"positional", "positional",
None, None,
str, str,
), ),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional", None, Config.load),
limit=("Size limit", "option", "n", int), limit=("Size limit", "option", "n", int),
) )
def main(ud_dir, parses_dir, config, corpus, limit=0): def main(ud_dir, parses_dir, config, corpus, limit=0):
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200 # temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
import tqdm import tqdm
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
paths = TreebankPaths(ud_dir, corpus) paths = TreebankPaths(ud_dir, corpus)
if not (parses_dir / corpus).exists(): if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir() (parses_dir / corpus).mkdir()
@ -403,8 +401,8 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
docs, golds = read_data( docs, golds = read_data(
nlp, nlp,
paths.train.conllu.open(), paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(), paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length, max_doc_length=config.max_doc_length,
limit=limit, limit=limit,
) )

View File

@ -18,19 +18,21 @@ during training. We discard the auxiliary model before run-time.
The specific example here is not necessarily a good idea --- but it shows The specific example here is not necessarily a good idea --- but it shows
how an arbitrary objective function for some word can be used. how an arbitrary objective function for some word can be used.
Developed and tested for spaCy 2.0.6 Developed and tested for spaCy 2.0.6. Updated for v2.2.2
""" """
import random import random
import plac import plac
import spacy import spacy
import os.path import os.path
from spacy.tokens import Doc
from spacy.gold import read_json_file, GoldParse from spacy.gold import read_json_file, GoldParse
random.seed(0) random.seed(0)
PWD = os.path.dirname(__file__) PWD = os.path.dirname(__file__)
TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json"))) TRAIN_DATA = list(read_json_file(
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
def get_position_label(i, words, tags, heads, labels, ents): def get_position_label(i, words, tags, heads, labels, ents):
@ -55,6 +57,7 @@ def main(n_iter=10):
ner = nlp.create_pipe("ner") ner = nlp.create_pipe("ner")
ner.add_multitask_objective(get_position_label) ner.add_multitask_objective(get_position_label)
nlp.add_pipe(ner) nlp.add_pipe(ner)
print(nlp.pipeline)
print("Create data", len(TRAIN_DATA)) print("Create data", len(TRAIN_DATA))
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA) optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
@ -62,23 +65,24 @@ def main(n_iter=10):
random.shuffle(TRAIN_DATA) random.shuffle(TRAIN_DATA)
losses = {} losses = {}
for text, annot_brackets in TRAIN_DATA: for text, annot_brackets in TRAIN_DATA:
annotations, _ = annot_brackets for annotations, _ in annot_brackets:
doc = nlp.make_doc(text) doc = Doc(nlp.vocab, words=annotations[1])
gold = GoldParse.from_annot_tuples(doc, annotations[0]) gold = GoldParse.from_annot_tuples(doc, annotations)
nlp.update( nlp.update(
[doc], # batch of texts [doc], # batch of texts
[gold], # batch of annotations [gold], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights sgd=optimizer, # callable to update weights
losses=losses, losses=losses,
) )
print(losses.get("nn_labeller", 0.0), losses["ner"]) print(losses.get("nn_labeller", 0.0), losses["ner"])
# test the trained model # test the trained model
for text, _ in TRAIN_DATA: for text, _ in TRAIN_DATA:
doc = nlp(text) if text is not None:
print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) doc = nlp(text)
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -8,7 +8,7 @@
{ {
"tokens": [ "tokens": [
{ {
"head": 4, "head": 44,
"dep": "prep", "dep": "prep",
"tag": "IN", "tag": "IN",
"orth": "In", "orth": "In",

View File

@ -48,4 +48,6 @@ redirects = [
{from = "/api/sentencesegmenter", to="/api/sentencizer"}, {from = "/api/sentencesegmenter", to="/api/sentencizer"},
{from = "/universe", to = "/universe/project/:id", query = {id = ":id"}, force = true}, {from = "/universe", to = "/universe/project/:id", query = {id = ":id"}, force = true},
{from = "/universe", to = "/universe/category/:category", query = {category = ":category"}, force = true}, {from = "/universe", to = "/universe/category/:category", query = {category = ":category"}, force = true},
# Renamed universe projects
{from = "/universe/project/spacy-pytorch-transformers", to = "/universe/project/spacy-transformers", force = true}
] ]

View File

@ -1,15 +1,16 @@
# Our libraries # Our libraries
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=7.1.1,<7.2.0 thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0 blis>=0.4.0,<0.5.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.2.0,<1.1.0 wasabi>=0.4.0,<1.1.0
srsly>=0.1.0,<1.1.0 srsly>=0.1.0,<1.1.0
catalogue>=0.0.7,<1.1.0
# Third party dependencies # Third party dependencies
numpy>=1.15.0 numpy>=1.15.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
plac<1.0.0,>=0.9.6 plac>=0.9.6,<1.2.0
pathlib==1.0.1; python_version < "3.4" pathlib==1.0.1; python_version < "3.4"
# Optional dependencies # Optional dependencies
jsonschema>=2.6.0,<3.1.0 jsonschema>=2.6.0,<3.1.0

View File

@ -22,6 +22,7 @@ classifiers =
Programming Language :: Python :: 3.5 Programming Language :: Python :: 3.5
Programming Language :: Python :: 3.6 Programming Language :: Python :: 3.6
Programming Language :: Python :: 3.7 Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Topic :: Scientific/Engineering Topic :: Scientific/Engineering
[options] [options]
@ -37,40 +38,38 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=7.1.1,<7.2.0 thinc>=7.3.0,<7.4.0
install_requires = install_requires =
numpy>=1.15.0 # Our libraries
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=7.1.1,<7.2.0 thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0 blis>=0.4.0,<0.5.0
plac<1.0.0,>=0.9.6 wasabi>=0.4.0,<1.1.0
requests>=2.13.0,<3.0.0
wasabi>=0.2.0,<1.1.0
srsly>=0.1.0,<1.1.0 srsly>=0.1.0,<1.1.0
catalogue>=0.0.7,<1.1.0
# Third-party dependencies
setuptools
numpy>=1.15.0
plac>=0.9.6,<1.2.0
requests>=2.13.0,<3.0.0
pathlib==1.0.1; python_version < "3.4" pathlib==1.0.1; python_version < "3.4"
[options.extras_require] [options.extras_require]
lookups = lookups =
spacy_lookups_data>=0.0.5<0.2.0 spacy_lookups_data>=0.0.5<0.2.0
cuda = cuda =
thinc_gpu_ops>=0.0.1,<0.1.0
cupy>=5.0.0b4 cupy>=5.0.0b4
cuda80 = cuda80 =
thinc_gpu_ops>=0.0.1,<0.1.0
cupy-cuda80>=5.0.0b4 cupy-cuda80>=5.0.0b4
cuda90 = cuda90 =
thinc_gpu_ops>=0.0.1,<0.1.0
cupy-cuda90>=5.0.0b4 cupy-cuda90>=5.0.0b4
cuda91 = cuda91 =
thinc_gpu_ops>=0.0.1,<0.1.0
cupy-cuda91>=5.0.0b4 cupy-cuda91>=5.0.0b4
cuda92 = cuda92 =
thinc_gpu_ops>=0.0.1,<0.1.0
cupy-cuda92>=5.0.0b4 cupy-cuda92>=5.0.0b4
cuda100 = cuda100 =
thinc_gpu_ops>=0.0.1,<0.1.0
cupy-cuda100>=5.0.0b4 cupy-cuda100>=5.0.0b4
# Language tokenizers with external dependencies # Language tokenizers with external dependencies
ja = ja =

View File

@ -9,11 +9,14 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
# These are imported as part of the API # These are imported as part of the API
from thinc.neural.util import prefer_gpu, require_gpu from thinc.neural.util import prefer_gpu, require_gpu
from . import pipeline
from .cli.info import info as cli_info from .cli.info import info as cli_info
from .glossary import explain from .glossary import explain
from .about import __version__ from .about import __version__
from .errors import Errors, Warnings, deprecation_warning from .errors import Errors, Warnings, deprecation_warning
from . import util from . import util
from .util import registry
from .language import component
if sys.maxunicode == 65535: if sys.maxunicode == 65535:

View File

@ -7,12 +7,10 @@ from __future__ import print_function
if __name__ == "__main__": if __name__ == "__main__":
import plac import plac
import sys import sys
from wasabi import Printer from wasabi import msg
from spacy.cli import download, link, info, package, train, pretrain, convert from spacy.cli import download, link, info, package, train, pretrain, convert
from spacy.cli import init_model, profile, evaluate, validate, debug_data from spacy.cli import init_model, profile, evaluate, validate, debug_data
msg = Printer()
commands = { commands = {
"download": download, "download": download,
"link": link, "link": link,

View File

@ -7,12 +7,15 @@ from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow, ParametricAttention from thinc.t2t import ExtractWindow, ParametricAttention
from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
from thinc.misc import Residual from thinc.misc import Residual
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
from thinc.t2t import ExtractWindow, ParametricAttention
from thinc.t2v import Pooling, sum_pool, mean_pool
from thinc.i2v import HashEmbed
from thinc.misc import Residual, FeatureExtracter
from thinc.misc import LayerNorm as LN from thinc.misc import LayerNorm as LN
from thinc.misc import FeatureExtracter
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
from thinc.api import with_getitem, flatten_add_lengths from thinc.api import with_getitem, flatten_add_lengths
from thinc.api import uniqued, wrap, noop from thinc.api import uniqued, wrap, noop
from thinc.api import with_square_sequences
from thinc.linear.linear import LinearModel from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array, to_categorical from thinc.neural.util import get_array_module, copy_array, to_categorical
@ -29,14 +32,13 @@ from .strings import get_string_id
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
from .errors import Errors, user_warning, Warnings from .errors import Errors, user_warning, Warnings
from . import util from . import util
from . import ml as new_ml
from .ml import _legacy_tok2vec
try:
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN
except ImportError:
torch = None
VECTORS_KEY = "spacy_pretrained_vectors" VECTORS_KEY = "spacy_pretrained_vectors"
# Backwards compatibility with <2.2.2
USE_MODEL_REGISTRY_TOK2VEC = False
def cosine(vec1, vec2): def cosine(vec1, vec2):
@ -314,6 +316,10 @@ def link_vectors_to_models(vocab):
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
import torch.nn
from thinc.api import with_square_sequences
from thinc.extra.wrappers import PyTorchWrapperRNN
if depth == 0: if depth == 0:
return layerize(noop()) return layerize(noop())
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout) model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
@ -336,161 +342,91 @@ def Tok2Vec_chars_cnn(width, embed_size, **kwargs):
tok2vec.embed = embed tok2vec.embed = embed
return tok2vec return tok2vec
def Tok2Vec_chars_selfattention(width, embed_size, **kwargs):
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
sa_depth = kwargs.get("self_attn_depth", 4)
with Model.define_operators(
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
):
embed = (
CharacterEmbed(nM=64, nC=8)
>> with_flatten(LN(Maxout(width, 64*8, pieces=cnn_maxout_pieces))))
tok2vec = (
embed
>> PositionEncode(10000, width)
>> SelfAttention(width, 1, 4) ** sa_depth
)
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
tok2vec.nO = width
tok2vec.embed = embed
return tok2vec
def Tok2Vec_chars_bilstm(width, embed_size, **kwargs):
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
depth = kwargs.get("bilstm_depth", 2)
with Model.define_operators(
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
):
embed = (
CharacterEmbed(nM=64, nC=8)
>> with_flatten(LN(Maxout(width, 64*8, pieces=cnn_maxout_pieces))))
tok2vec = (
embed
>> Residual(PyTorchBiLSTM(width, width, depth))
>> with_flatten(LN(nO=width))
)
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
tok2vec.nO = width
tok2vec.embed = embed
return tok2vec
def CNN(width, depth, pieces, nW=1):
if pieces == 1:
layer = chain(
ExtractWindow(nW=nW),
Mish(width, width*(nW*2+1)),
LN(nO=width)
)
return clone(Residual(layer), depth)
else:
layer = chain(
ExtractWindow(nW=nW),
LN(Maxout(width, width * (nW*2+1), pieces=pieces)))
return clone(Residual(layer), depth)
def SelfAttention(width, depth, pieces):
layer = chain(
prepare_self_attention(Affine(width * 3, width), nM=width, nH=pieces, window=None),
MultiHeadedAttention(),
with_flatten(Maxout(width, width)))
return clone(Residual(chain(layer, with_flatten(LN(nO=width)))), depth)
def PositionEncode(L, D):
positions = NumpyOps().position_encode(L, D)
positions = Model.ops.asarray(positions)
def position_encode_forward(Xs, drop=0.):
output = []
for x in Xs:
output.append(x + positions[:x.shape[0]])
def position_encode_backward(dYs, sgd=None):
return dYs
return output, position_encode_backward
return layerize(position_encode_forward)
def Tok2Vec(width, embed_size, **kwargs): def Tok2Vec(width, embed_size, **kwargs):
pretrained_vectors = kwargs.setdefault("pretrained_vectors", None) if not USE_MODEL_REGISTRY_TOK2VEC:
cnn_maxout_pieces = kwargs.setdefault("cnn_maxout_pieces", 3) # Preserve prior tok2vec for backwards compat, in v2.2.2
subword_features = kwargs.setdefault("subword_features", True) return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs)
char_embed = kwargs.setdefault("char_embed", False) pretrained_vectors = kwargs.get("pretrained_vectors", None)
conv_depth = kwargs.setdefault("conv_depth", 4) cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
bilstm_depth = kwargs.setdefault("bilstm_depth", 0) subword_features = kwargs.get("subword_features", True)
self_attn_depth = kwargs.setdefault("self_attn_depth", 0) char_embed = kwargs.get("char_embed", False)
conv_window = kwargs.setdefault("conv_window", 1) conv_depth = kwargs.get("conv_depth", 4)
if char_embed and self_attn_depth: bilstm_depth = kwargs.get("bilstm_depth", 0)
return Tok2Vec_chars_selfattention(width, embed_size, **kwargs) conv_window = kwargs.get("conv_window", 1)
elif char_embed and bilstm_depth:
return Tok2Vec_chars_bilstm(width, embed_size, **kwargs)
elif char_embed and conv_depth:
return Tok2Vec_chars_cnn(width, embed_size, **kwargs)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators(
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
if subword_features:
prefix = HashEmbed(
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
)
suffix = HashEmbed(
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
)
shape = HashEmbed(
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
)
else:
prefix, suffix, shape = (None, None, None)
if pretrained_vectors is not None:
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
if subword_features: cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
column=cols.index(ORTH),
)
elif subword_features:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 4, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = norm
tok2vec = ( doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}}
FeatureExtracter(cols) if char_embed:
>> with_bos_eos( embed_cfg = {
with_flatten( "arch": "spacy.CharacterEmbed.v1",
embed "config": {
>> CNN(width, conv_depth, cnn_maxout_pieces, nW=conv_window), "width": 64,
pad=conv_depth * conv_window) "chars": 6,
) "@mix": {
) "arch": "spacy.LayerNormalizedMaxout.v1",
"config": {"width": width, "pieces": 3},
if bilstm_depth >= 1: },
tok2vec = ( "@embed_features": None,
tok2vec },
>> Residual( }
PyTorchBiLSTM(width, width, 1) else:
>> LN(nO=width) embed_cfg = {
) ** bilstm_depth "arch": "spacy.MultiHashEmbed.v1",
) "config": {
# Work around thinc API limitations :(. TODO: Revise in Thinc 7 "width": width,
tok2vec.nO = width "rows": embed_size,
tok2vec.embed = embed "columns": cols,
return tok2vec "use_subwords": subword_features,
"@pretrained_vectors": None,
"@mix": {
"arch": "spacy.LayerNormalizedMaxout.v1",
"config": {"width": width, "pieces": 3},
},
},
}
if pretrained_vectors:
embed_cfg["config"]["@pretrained_vectors"] = {
"arch": "spacy.PretrainedVectors.v1",
"config": {
"vectors_name": pretrained_vectors,
"width": width,
"column": cols.index("ID"),
},
}
if cnn_maxout_pieces >= 2:
cnn_cfg = {
"arch": "spacy.MaxoutWindowEncoder.v1",
"config": {
"width": width,
"window_size": conv_window,
"pieces": cnn_maxout_pieces,
"depth": conv_depth,
},
}
else:
cnn_cfg = {
"arch": "spacy.MishWindowEncoder.v1",
"config": {"width": width, "window_size": conv_window, "depth": conv_depth},
}
bilstm_cfg = {
"arch": "spacy.TorchBiLSTMEncoder.v1",
"config": {"width": width, "depth": bilstm_depth},
}
if conv_depth == 0 and bilstm_depth == 0:
encode_cfg = {}
elif conv_depth >= 1 and bilstm_depth >= 1:
encode_cfg = {
"arch": "thinc.FeedForward.v1",
"config": {"children": [cnn_cfg, bilstm_cfg]},
}
elif conv_depth >= 1:
encode_cfg = cnn_cfg
else:
encode_cfg = bilstm_cfg
config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg}
return new_ml.Tok2Vec(config)
def with_bos_eos(layer): def with_bos_eos(layer):
@ -1101,6 +1037,7 @@ class CharacterEmbed(Model):
return output, backprop_character_embed return output, backprop_character_embed
<<<<<<< HEAD
def get_characters_loss(ops, docs, prediction, nr_char=10): def get_characters_loss(ops, docs, prediction, nr_char=10):
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs]) target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
target_ids = target_ids.reshape((-1,)) target_ids = target_ids.reshape((-1,))
@ -1112,6 +1049,8 @@ def get_characters_loss(ops, docs, prediction, nr_char=10):
return loss, d_target return loss, d_target
=======
>>>>>>> master
def get_cossim_loss(yh, y, ignore_zeros=False): def get_cossim_loss(yh, y, ignore_zeros=False):
xp = get_array_module(yh) xp = get_array_module(yh)
# Find the zero vectors # Find the zero vectors

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "2.2.1" __version__ = "2.2.2"
__release__ = True __release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

179
spacy/analysis.py Normal file
View File

@ -0,0 +1,179 @@
# coding: utf8
from __future__ import unicode_literals
from collections import OrderedDict
from wasabi import Printer
from .tokens import Doc, Token, Span
from .errors import Errors, Warnings, user_warning
def analyze_pipes(pipeline, name, pipe, index, warn=True):
"""Analyze a pipeline component with respect to its position in the current
pipeline and the other components. Will check whether requirements are
fulfilled (e.g. if previous components assign the attributes).
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
name (unicode): The name of the pipeline component to analyze.
pipe (callable): The pipeline component function to analyze.
index (int): The index of the component in the pipeline.
warn (bool): Show user warning if problem is found.
RETURNS (list): The problems found for the given pipeline component.
"""
assert pipeline[index][0] == name
prev_pipes = pipeline[:index]
pipe_requires = getattr(pipe, "requires", [])
requires = OrderedDict([(annot, False) for annot in pipe_requires])
if requires:
for prev_name, prev_pipe in prev_pipes:
prev_assigns = getattr(prev_pipe, "assigns", [])
for annot in prev_assigns:
requires[annot] = True
problems = []
for annot, fulfilled in requires.items():
if not fulfilled:
problems.append(annot)
if warn:
user_warning(Warnings.W025.format(name=name, attr=annot))
return problems
def analyze_all_pipes(pipeline, warn=True):
"""Analyze all pipes in the pipeline in order.
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
warn (bool): Show user warning if problem is found.
RETURNS (dict): The problems found, keyed by component name.
"""
problems = {}
for i, (name, pipe) in enumerate(pipeline):
problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn)
return problems
def dot_to_dict(values):
"""Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"]
become {"token": {"pos": True, "_": {"xyz": True }}}.
values (iterable): The values to convert.
RETURNS (dict): The converted values.
"""
result = {}
for value in values:
path = result
parts = value.lower().split(".")
for i, item in enumerate(parts):
is_last = i == len(parts) - 1
path = path.setdefault(item, True if is_last else {})
return result
def validate_attrs(values):
"""Validate component attributes provided to "assigns", "requires" etc.
Raises error for invalid attributes and formatting. Doesn't check if
custom extension attributes are registered, since this is something the
user might want to do themselves later in the component.
values (iterable): The string attributes to check, e.g. `["token.pos"]`.
RETURNS (iterable): The checked attributes.
"""
data = dot_to_dict(values)
objs = {"doc": Doc, "token": Token, "span": Span}
for obj_key, attrs in data.items():
if obj_key == "span":
# Support Span only for custom extension attributes
span_attrs = [attr for attr in values if attr.startswith("span.")]
span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")]
if span_attrs:
raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs)))
if obj_key not in objs: # first element is not doc/token/span
invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key))
raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs))
if not isinstance(attrs, dict): # attr is something like "doc"
raise ValueError(Errors.E182.format(attr=obj_key))
for attr, value in attrs.items():
if attr == "_":
if value is True: # attr is something like "doc._"
raise ValueError(Errors.E182.format(attr="{}._".format(obj_key)))
for ext_attr, ext_value in value.items():
# We don't check whether the attribute actually exists
if ext_value is not True: # attr is something like doc._.x.y
good = "{}._.{}".format(obj_key, ext_attr)
bad = "{}.{}".format(good, ".".join(ext_value))
raise ValueError(Errors.E183.format(attr=bad, solution=good))
continue # we can't validate those further
if attr.endswith("_"): # attr is something like "token.pos_"
raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1]))
if value is not True: # attr is something like doc.x.y
good = "{}.{}".format(obj_key, attr)
bad = "{}.{}".format(good, ".".join(value))
raise ValueError(Errors.E183.format(attr=bad, solution=good))
obj = objs[obj_key]
if not hasattr(obj, attr):
raise ValueError(Errors.E185.format(obj=obj_key, attr=attr))
return values
def _get_feature_for_attr(pipeline, attr, feature):
assert feature in ["assigns", "requires"]
result = []
for pipe_name, pipe in pipeline:
pipe_assigns = getattr(pipe, feature, [])
if attr in pipe_assigns:
result.append((pipe_name, pipe))
return result
def get_assigns_for_attr(pipeline, attr):
"""Get all pipeline components that assign an attr, e.g. "doc.tensor".
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
attr (unicode): The attribute to check.
RETURNS (list): (name, pipeline) tuples of components that assign the attr.
"""
return _get_feature_for_attr(pipeline, attr, "assigns")
def get_requires_for_attr(pipeline, attr):
"""Get all pipeline components that require an attr, e.g. "doc.tensor".
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
attr (unicode): The attribute to check.
RETURNS (list): (name, pipeline) tuples of components that require the attr.
"""
return _get_feature_for_attr(pipeline, attr, "requires")
def print_summary(nlp, pretty=True, no_print=False):
"""Print a formatted summary for the current nlp object's pipeline. Shows
a table with the pipeline components and why they assign and require, as
well as any problems if available.
nlp (Language): The nlp object.
pretty (bool): Pretty-print the results (color etc).
no_print (bool): Don't print anything, just return the data.
RETURNS (dict): A dict with "overview" and "problems".
"""
msg = Printer(pretty=pretty, no_print=no_print)
overview = []
problems = {}
for i, (name, pipe) in enumerate(nlp.pipeline):
requires = getattr(pipe, "requires", [])
assigns = getattr(pipe, "assigns", [])
retok = getattr(pipe, "retokenizes", False)
overview.append((i, name, requires, assigns, retok))
problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False)
msg.divider("Pipeline Overview")
header = ("#", "Component", "Requires", "Assigns", "Retokenizes")
msg.table(overview, header=header, divider=True, multiline=True)
n_problems = sum(len(p) for p in problems.values())
if any(p for p in problems.values()):
msg.divider("Problems ({})".format(n_problems))
for name, problem in problems.items():
if problem:
problem = ", ".join(problem)
msg.warn("'{}' requirements not met: {}".format(name, problem))
else:
msg.good("No problems found.")
if no_print:
return {"overview": overview, "problems": problems}

View File

@ -57,7 +57,8 @@ def convert(
is written to stdout, so you can pipe them forward to a JSON file: is written to stdout, so you can pipe them forward to a JSON file:
$ spacy convert some_file.conllu > some_file.json $ spacy convert some_file.conllu > some_file.json
""" """
msg = Printer() no_print = output_dir == "-"
msg = Printer(no_print=no_print)
input_path = Path(input_file) input_path = Path(input_file)
if file_type not in FILE_TYPES: if file_type not in FILE_TYPES:
msg.fail( msg.fail(
@ -102,6 +103,7 @@ def convert(
use_morphology=morphology, use_morphology=morphology,
lang=lang, lang=lang,
model=model, model=model,
no_print=no_print,
) )
if output_dir != "-": if output_dir != "-":
# Export data to a file # Export data to a file

View File

@ -9,7 +9,9 @@ from ...tokens.doc import Doc
from ...util import load_model from ...util import load_model
def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, **kwargs): def conll_ner2json(
input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
):
""" """
Convert files in the CoNLL-2003 NER format and similar Convert files in the CoNLL-2003 NER format and similar
whitespace-separated columns into JSON format for use with train cli. whitespace-separated columns into JSON format for use with train cli.
@ -34,7 +36,7 @@ def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, **kwargs
. O . O
""" """
msg = Printer() msg = Printer(no_print=no_print)
doc_delimiter = "-DOCSTART- -X- O O" doc_delimiter = "-DOCSTART- -X- O O"
# check for existing delimiters, which should be preserved # check for existing delimiters, which should be preserved
if "\n\n" in input_data and seg_sents: if "\n\n" in input_data and seg_sents:

View File

@ -34,6 +34,9 @@ def conllu2json(input_data, n_sents=10, use_morphology=False, lang=None, **_):
doc = create_doc(sentences, i) doc = create_doc(sentences, i)
docs.append(doc) docs.append(doc)
sentences = [] sentences = []
if sentences:
doc = create_doc(sentences, i)
docs.append(doc)
return docs return docs

View File

@ -8,7 +8,7 @@ from ...util import minibatch
from .conll_ner2json import n_sents_info from .conll_ner2json import n_sents_info
def iob2json(input_data, n_sents=10, *args, **kwargs): def iob2json(input_data, n_sents=10, no_print=False, *args, **kwargs):
""" """
Convert IOB files with one sentence per line and tags separated with '|' Convert IOB files with one sentence per line and tags separated with '|'
into JSON format for use with train cli. IOB and IOB2 are accepted. into JSON format for use with train cli. IOB and IOB2 are accepted.
@ -20,7 +20,7 @@ def iob2json(input_data, n_sents=10, *args, **kwargs):
I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
""" """
msg = Printer() msg = Printer(no_print=no_print)
docs = read_iob(input_data.split("\n")) docs = read_iob(input_data.split("\n"))
if n_sents > 0: if n_sents > 0:
n_sents_info(msg, n_sents) n_sents_info(msg, n_sents)

View File

@ -7,7 +7,7 @@ from ...gold import docs_to_json
from ...util import get_lang_class, minibatch from ...util import get_lang_class, minibatch
def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False): def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False, **_):
if lang is None: if lang is None:
raise ValueError("No --lang specified, but tokenization required") raise ValueError("No --lang specified, but tokenization required")
json_docs = [] json_docs = []

View File

@ -121,6 +121,8 @@ def debug_data(
msg.text("{} training docs".format(len(train_docs))) msg.text("{} training docs".format(len(train_docs)))
msg.text("{} evaluation docs".format(len(dev_docs))) msg.text("{} evaluation docs".format(len(dev_docs)))
if not len(dev_docs):
msg.fail("No evaluation docs")
overlap = len(train_texts.intersection(dev_texts)) overlap = len(train_texts.intersection(dev_texts))
if overlap: if overlap:
msg.warn("{} training examples also in evaluation data".format(overlap)) msg.warn("{} training examples also in evaluation data".format(overlap))
@ -206,6 +208,9 @@ def debug_data(
missing_values, "value" if missing_values == 1 else "values" missing_values, "value" if missing_values == 1 else "values"
) )
) )
for label in new_labels:
if len(label) == 0:
msg.fail("Empty label found in new labels")
if new_labels: if new_labels:
labels_with_counts = [ labels_with_counts = [
(label, count) (label, count)
@ -360,6 +365,16 @@ def debug_data(
) )
) )
# check for documents with multiple sentences
sents_per_doc = gold_train_data["n_sents"] / len(gold_train_data["texts"])
if sents_per_doc < 1.1:
msg.warn(
"The training data contains {:.2f} sentences per "
"document. When there are very few documents containing more "
"than one sentence, the parser will not learn how to segment "
"longer texts into sentences.".format(sents_per_doc)
)
# profile labels # profile labels
labels_train = [label for label in gold_train_data["deps"]] labels_train = [label for label in gold_train_data["deps"]]
labels_train_unpreprocessed = [ labels_train_unpreprocessed = [

View File

@ -6,17 +6,13 @@ import requests
import os import os
import subprocess import subprocess
import sys import sys
import pkg_resources from wasabi import msg
from wasabi import Printer
from .link import link from .link import link
from ..util import get_package_path from ..util import get_package_path
from .. import about from .. import about
msg = Printer()
@plac.annotations( @plac.annotations(
model=("Model to download (shortcut or name)", "positional", None, str), model=("Model to download (shortcut or name)", "positional", None, str),
direct=("Force direct download of name + version", "flag", "d", bool), direct=("Force direct download of name + version", "flag", "d", bool),
@ -87,6 +83,8 @@ def download(model, direct=False, *pip_args):
def require_package(name): def require_package(name):
try: try:
import pkg_resources
pkg_resources.working_set.require(name) pkg_resources.working_set.require(name)
return True return True
except: # noqa: E722 except: # noqa: E722

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals, division, print_function
import plac import plac
from timeit import default_timer as timer from timeit import default_timer as timer
from wasabi import Printer from wasabi import msg
from ..gold import GoldCorpus from ..gold import GoldCorpus
from .. import util from .. import util
@ -32,7 +32,6 @@ def evaluate(
Evaluate a model. To render a sample of parses in a HTML file, set an Evaluate a model. To render a sample of parses in a HTML file, set an
output directory as the displacy_path argument. output directory as the displacy_path argument.
""" """
msg = Printer()
util.fix_random_seed() util.fix_random_seed()
if gpu_id >= 0: if gpu_id >= 0:
util.use_gpu(gpu_id) util.use_gpu(gpu_id)

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
import plac import plac
import platform import platform
from pathlib import Path from pathlib import Path
from wasabi import Printer from wasabi import msg
import srsly import srsly
from ..compat import path2str, basestring_, unicode_ from ..compat import path2str, basestring_, unicode_
@ -23,7 +23,6 @@ def info(model=None, markdown=False, silent=False):
speficied as an argument, print model information. Flag --markdown speficied as an argument, print model information. Flag --markdown
prints details in Markdown for easy copy-pasting to GitHub issues. prints details in Markdown for easy copy-pasting to GitHub issues.
""" """
msg = Printer()
if model: if model:
if util.is_package(model): if util.is_package(model):
model_path = util.get_package_path(model) model_path = util.get_package_path(model)

View File

@ -11,7 +11,7 @@ import tarfile
import gzip import gzip
import zipfile import zipfile
import srsly import srsly
from wasabi import Printer from wasabi import msg
from ..vectors import Vectors from ..vectors import Vectors
from ..errors import Errors, Warnings, user_warning from ..errors import Errors, Warnings, user_warning
@ -24,7 +24,6 @@ except ImportError:
DEFAULT_OOV_PROB = -20 DEFAULT_OOV_PROB = -20
msg = Printer()
@plac.annotations( @plac.annotations(

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals
import plac import plac
from pathlib import Path from pathlib import Path
from wasabi import Printer from wasabi import msg
from ..compat import symlink_to, path2str from ..compat import symlink_to, path2str
from .. import util from .. import util
@ -20,7 +20,6 @@ def link(origin, link_name, force=False, model_path=None):
either the name of a pip package, or the local path to the model data either the name of a pip package, or the local path to the model data
directory. Linking models allows loading them via spacy.load(link_name). directory. Linking models allows loading them via spacy.load(link_name).
""" """
msg = Printer()
if util.is_package(origin): if util.is_package(origin):
model_path = util.get_package_path(origin) model_path = util.get_package_path(origin)
else: else:

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
import plac import plac
import shutil import shutil
from pathlib import Path from pathlib import Path
from wasabi import Printer, get_raw_input from wasabi import msg, get_raw_input
import srsly import srsly
from ..compat import path2str from ..compat import path2str
@ -27,7 +27,6 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals
set and a meta.json already exists in the output directory, the existing set and a meta.json already exists in the output directory, the existing
values will be used as the defaults in the command-line prompt. values will be used as the defaults in the command-line prompt.
""" """
msg = Printer()
input_path = util.ensure_path(input_dir) input_path = util.ensure_path(input_dir)
output_path = util.ensure_path(output_dir) output_path = util.ensure_path(output_dir)
meta_path = util.ensure_path(meta_path) meta_path = util.ensure_path(meta_path)

View File

@ -133,7 +133,6 @@ def pretrain(
for key in config: for key in config:
if isinstance(config[key], Path): if isinstance(config[key], Path):
config[key] = str(config[key]) config[key] = str(config[key])
msg = Printer()
util.fix_random_seed(seed) util.fix_random_seed(seed)
if gpu_id != -1: if gpu_id != -1:
has_gpu = require_gpu(gpu_id=gpu_id) has_gpu = require_gpu(gpu_id=gpu_id)
@ -272,7 +271,7 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
"""Perform an update over a single batch of documents. """Perform an update over a single batch of documents.
docs (iterable): A batch of `Doc` objects. docs (iterable): A batch of `Doc` objects.
drop (float): The droput rate. drop (float): The dropout rate.
optimizer (callable): An optimizer. optimizer (callable): An optimizer.
RETURNS loss: A float for the loss. RETURNS loss: A float for the loss.
""" """

View File

@ -9,7 +9,7 @@ import pstats
import sys import sys
import itertools import itertools
import thinc.extra.datasets import thinc.extra.datasets
from wasabi import Printer from wasabi import msg
from ..util import load_model from ..util import load_model
@ -26,7 +26,6 @@ def profile(model, inputs=None, n_texts=10000):
It can either be provided as a JSONL file, or be read from sys.sytdin. It can either be provided as a JSONL file, or be read from sys.sytdin.
If no input file is specified, the IMDB dataset is loaded via Thinc. If no input file is specified, the IMDB dataset is loaded via Thinc.
""" """
msg = Printer()
if inputs is not None: if inputs is not None:
inputs = _read_inputs(inputs, msg) inputs = _read_inputs(inputs, msg)
if inputs is None: if inputs is None:

View File

@ -8,7 +8,7 @@ from thinc.neural._classes.model import Model
from timeit import default_timer as timer from timeit import default_timer as timer
import shutil import shutil
import srsly import srsly
from wasabi import Printer from wasabi import msg
import contextlib import contextlib
import random import random
from thinc.neural.util import require_gpu from thinc.neural.util import require_gpu
@ -92,7 +92,6 @@ def train(
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200 # temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
import tqdm import tqdm
msg = Printer()
util.fix_random_seed() util.fix_random_seed()
util.set_env_log(verbose) util.set_env_log(verbose)
@ -159,8 +158,7 @@ def train(
"`lang` argument ('{}') ".format(nlp.lang, lang), "`lang` argument ('{}') ".format(nlp.lang, lang),
exits=1, exits=1,
) )
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline] nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
nlp.disable_pipes(*other_pipes)
for pipe in pipeline: for pipe in pipeline:
if pipe not in nlp.pipe_names: if pipe not in nlp.pipe_names:
if pipe == "parser": if pipe == "parser":
@ -266,7 +264,11 @@ def train(
exits=1, exits=1,
) )
train_docs = corpus.train_docs( train_docs = corpus.train_docs(
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0 nlp,
noise_level=noise_level,
gold_preproc=gold_preproc,
max_length=0,
ignore_misaligned=True,
) )
train_labels = set() train_labels = set()
if textcat_multilabel: if textcat_multilabel:
@ -347,6 +349,7 @@ def train(
orth_variant_level=orth_variant_level, orth_variant_level=orth_variant_level,
gold_preproc=gold_preproc, gold_preproc=gold_preproc,
max_length=0, max_length=0,
ignore_misaligned=True,
) )
if raw_text: if raw_text:
random.shuffle(raw_text) random.shuffle(raw_text)
@ -385,7 +388,11 @@ def train(
if hasattr(component, "cfg"): if hasattr(component, "cfg"):
component.cfg["beam_width"] = beam_width component.cfg["beam_width"] = beam_width
dev_docs = list( dev_docs = list(
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) corpus.dev_docs(
nlp_loaded,
gold_preproc=gold_preproc,
ignore_misaligned=True,
)
) )
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
start_time = timer() start_time = timer()
@ -402,7 +409,11 @@ def train(
if hasattr(component, "cfg"): if hasattr(component, "cfg"):
component.cfg["beam_width"] = beam_width component.cfg["beam_width"] = beam_width
dev_docs = list( dev_docs = list(
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) corpus.dev_docs(
nlp_loaded,
gold_preproc=gold_preproc,
ignore_misaligned=True,
)
) )
start_time = timer() start_time = timer()
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)

View File

@ -1,12 +1,11 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function
import pkg_resources
from pathlib import Path from pathlib import Path
import sys import sys
import requests import requests
import srsly import srsly
from wasabi import Printer from wasabi import msg
from ..compat import path2str from ..compat import path2str
from ..util import get_data_path from ..util import get_data_path
@ -18,7 +17,6 @@ def validate():
Validate that the currently installed version of spaCy is compatible Validate that the currently installed version of spaCy is compatible
with the installed models. Should be run after `pip install -U spacy`. with the installed models. Should be run after `pip install -U spacy`.
""" """
msg = Printer()
with msg.loading("Loading compatibility table..."): with msg.loading("Loading compatibility table..."):
r = requests.get(about.__compatibility__) r = requests.get(about.__compatibility__)
if r.status_code != 200: if r.status_code != 200:
@ -109,6 +107,8 @@ def get_model_links(compat):
def get_model_pkgs(compat, all_models): def get_model_pkgs(compat, all_models):
import pkg_resources
pkgs = {} pkgs = {}
for pkg_name, pkg_data in pkg_resources.working_set.by_key.items(): for pkg_name, pkg_data in pkg_resources.working_set.by_key.items():
package = pkg_name.replace("-", "_") package = pkg_name.replace("-", "_")

View File

@ -12,6 +12,7 @@ import os
import sys import sys
import itertools import itertools
import ast import ast
import types
from thinc.neural.util import copy_array from thinc.neural.util import copy_array
@ -62,6 +63,7 @@ if is_python2:
basestring_ = basestring # noqa: F821 basestring_ = basestring # noqa: F821
input_ = raw_input # noqa: F821 input_ = raw_input # noqa: F821
path2str = lambda path: str(path).decode("utf8") path2str = lambda path: str(path).decode("utf8")
class_types = (type, types.ClassType)
elif is_python3: elif is_python3:
bytes_ = bytes bytes_ = bytes
@ -69,6 +71,7 @@ elif is_python3:
basestring_ = str basestring_ = str
input_ = input input_ = input
path2str = lambda path: str(path) path2str = lambda path: str(path)
class_types = (type, types.ClassType) if is_python_pre_3_5 else type
def b_to_str(b_str): def b_to_str(b_str):

View File

@ -5,7 +5,7 @@ import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS from ..util import minify_html, escape_html, registry
from ..errors import Errors from ..errors import Errors
@ -242,7 +242,7 @@ class EntityRenderer(object):
"CARDINAL": "#e4e7d2", "CARDINAL": "#e4e7d2",
"PERCENT": "#e4e7d2", "PERCENT": "#e4e7d2",
} }
user_colors = get_entry_points(ENTRY_POINTS.displacy_colors) user_colors = registry.displacy_colors.get_all()
for user_color in user_colors.values(): for user_color in user_colors.values():
colors.update(user_color) colors.update(user_color)
colors.update(options.get("colors", {})) colors.update(options.get("colors", {}))

View File

@ -44,14 +44,14 @@ TPL_ENTS = """
TPL_ENT = """ TPL_ENT = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
</mark> </mark>
""" """
TPL_ENT_RTL = """ TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark> </mark>

View File

@ -80,8 +80,8 @@ class Warnings(object):
"the v2.x models cannot release the global interpreter lock. " "the v2.x models cannot release the global interpreter lock. "
"Future versions may introduce a `n_process` argument for " "Future versions may introduce a `n_process` argument for "
"parallel inference via multiprocessing.") "parallel inference via multiprocessing.")
W017 = ("Alias '{alias}' already exists in the Knowledge base.") W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
W018 = ("Entity '{entity}' already exists in the Knowledge base.") W018 = ("Entity '{entity}' already exists in the Knowledge Base.")
W019 = ("Changing vectors name from {old} to {new}, to avoid clash with " W019 = ("Changing vectors name from {old} to {new}, to avoid clash with "
"previously loaded vectors. See Issue #3853.") "previously loaded vectors. See Issue #3853.")
W020 = ("Unnamed vectors. This won't allow multiple vectors models to be " W020 = ("Unnamed vectors. This won't allow multiple vectors models to be "
@ -95,6 +95,12 @@ class Warnings(object):
"you can ignore this warning by setting SPACY_WARNING_IGNORE=W022. " "you can ignore this warning by setting SPACY_WARNING_IGNORE=W022. "
"If this is surprising, make sure you have the spacy-lookups-data " "If this is surprising, make sure you have the spacy-lookups-data "
"package installed.") "package installed.")
W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. "
"'n_process' will be set to 1.")
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
"the Knowledge Base.")
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
"previous components in the pipeline declare that they assign it.")
@add_codes @add_codes
@ -407,7 +413,7 @@ class Errors(object):
"{probabilities_length} respectively.") "{probabilities_length} respectively.")
E133 = ("The sum of prior probabilities for alias '{alias}' should not " E133 = ("The sum of prior probabilities for alias '{alias}' should not "
"exceed 1, but found {sum}.") "exceed 1, but found {sum}.")
E134 = ("Alias '{alias}' defined for unknown entity '{entity}'.") E134 = ("Entity '{entity}' is not defined in the Knowledge Base.")
E135 = ("If you meant to replace a built-in component, use `create_pipe`: " E135 = ("If you meant to replace a built-in component, use `create_pipe`: "
"`nlp.replace_pipe('{name}', nlp.create_pipe('{name}'))`") "`nlp.replace_pipe('{name}', nlp.create_pipe('{name}'))`")
E136 = ("This additional feature requires the jsonschema library to be " E136 = ("This additional feature requires the jsonschema library to be "
@ -419,7 +425,7 @@ class Errors(object):
E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input " E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input "
"includes either the `text` or `tokens` key. For more info, see " "includes either the `text` or `tokens` key. For more info, see "
"the docs:\nhttps://spacy.io/api/cli#pretrain-jsonl") "the docs:\nhttps://spacy.io/api/cli#pretrain-jsonl")
E139 = ("Knowledge base for component '{name}' not initialized. Did you " E139 = ("Knowledge Base for component '{name}' not initialized. Did you "
"forget to call set_kb()?") "forget to call set_kb()?")
E140 = ("The list of entities, prior probabilities and entity vectors " E140 = ("The list of entities, prior probabilities and entity vectors "
"should be of equal length.") "should be of equal length.")
@ -495,6 +501,34 @@ class Errors(object):
E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of " E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
"Lookups containing the lemmatization tables. See the docs for " "Lookups containing the lemmatization tables. See the docs for "
"details: https://spacy.io/api/lemmatizer#init") "details: https://spacy.io/api/lemmatizer#init")
E174 = ("Architecture '{name}' not found in registry. Available "
"names: {names}")
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
E177 = ("Ill-formed IOB input detected: {tag}")
E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you "
"accidentally passed a single pattern to Matcher.add instead of a "
"list of patterns? If you only want to add one pattern, make sure "
"to wrap it in a list. For example: matcher.add('{key}', [pattern])")
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
"Doc. If you only want to add one pattern, make sure to wrap it "
"in a list. For example: matcher.add('{key}', [doc])")
E180 = ("Span attributes can't be declared as required or assigned by "
"components, since spans are only views of the Doc. Use Doc and "
"Token attributes (or custom extension attributes) only and remove "
"the following: {attrs}")
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
"Only Doc and Token attributes are supported.")
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
"to define the attribute? For example: {attr}.???")
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
"attributes are supported, for example: {solution}")
E184 = ("Only attributes without underscores are supported in component "
"attribute declarations (because underscore and non-underscore "
"attributes are connected anyways): {attr} -> {solution}")
E185 = ("Received invalid attribute in component attribute declaration: "
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
@add_codes @add_codes
@ -527,6 +561,10 @@ class MatchPatternError(ValueError):
ValueError.__init__(self, msg) ValueError.__init__(self, msg)
class AlignmentError(ValueError):
pass
class ModelsWarning(UserWarning): class ModelsWarning(UserWarning):
pass pass

View File

@ -80,7 +80,7 @@ GLOSSARY = {
"RBR": "adverb, comparative", "RBR": "adverb, comparative",
"RBS": "adverb, superlative", "RBS": "adverb, superlative",
"RP": "adverb, particle", "RP": "adverb, particle",
"TO": "infinitival to", "TO": 'infinitival "to"',
"UH": "interjection", "UH": "interjection",
"VB": "verb, base form", "VB": "verb, base form",
"VBD": "verb, past tense", "VBD": "verb, past tense",
@ -279,6 +279,12 @@ GLOSSARY = {
"re": "repeated element", "re": "repeated element",
"rs": "reported speech", "rs": "reported speech",
"sb": "subject", "sb": "subject",
"sb": "subject",
"sbp": "passivized subject (PP)",
"sp": "subject or predicate",
"svp": "separable verb prefix",
"uc": "unit component",
"vo": "vocative",
# Named Entity Recognition # Named Entity Recognition
# OntoNotes 5 # OntoNotes 5
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

View File

@ -11,10 +11,9 @@ import itertools
from pathlib import Path from pathlib import Path
import srsly import srsly
from . import _align
from .syntax import nonproj from .syntax import nonproj
from .tokens import Doc, Span from .tokens import Doc, Span
from .errors import Errors from .errors import Errors, AlignmentError
from .compat import path2str from .compat import path2str
from . import util from . import util
from .util import minibatch, itershuffle from .util import minibatch, itershuffle
@ -22,6 +21,7 @@ from .util import minibatch, itershuffle
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
USE_NEW_ALIGN = False
punct_re = re.compile(r"\W") punct_re = re.compile(r"\W")
@ -73,7 +73,21 @@ def merge_sents(sents):
return [(m_deps, (m_cats, m_brackets))] return [(m_deps, (m_cats, m_brackets))]
def align(tokens_a, tokens_b): _ALIGNMENT_NORM_MAP = [("``", "'"), ("''", "'"), ('"', "'"), ("`", "'")]
def _normalize_for_alignment(tokens):
tokens = [w.replace(" ", "").lower() for w in tokens]
output = []
for token in tokens:
token = token.replace(" ", "").lower()
for before, after in _ALIGNMENT_NORM_MAP:
token = token.replace(before, after)
output.append(token)
return output
def _align_before_v2_2_2(tokens_a, tokens_b):
"""Calculate alignment tables between two tokenizations, using the Levenshtein """Calculate alignment tables between two tokenizations, using the Levenshtein
algorithm. The alignment is case-insensitive. algorithm. The alignment is case-insensitive.
@ -92,6 +106,7 @@ def align(tokens_a, tokens_b):
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
direction. direction.
""" """
from . import _align
if tokens_a == tokens_b: if tokens_a == tokens_b:
alignment = numpy.arange(len(tokens_a)) alignment = numpy.arange(len(tokens_a))
return 0, alignment, alignment, {}, {} return 0, alignment, alignment, {}, {}
@ -111,6 +126,82 @@ def align(tokens_a, tokens_b):
return cost, i2j, j2i, i2j_multi, j2i_multi return cost, i2j, j2i, i2j_multi, j2i_multi
def align(tokens_a, tokens_b):
"""Calculate alignment tables between two tokenizations.
tokens_a (List[str]): The candidate tokenization.
tokens_b (List[str]): The reference tokenization.
RETURNS: (tuple): A 5-tuple consisting of the following information:
* cost (int): The number of misaligned tokens.
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
it has the value -1.
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
the same token of `tokens_b`.
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
direction.
"""
if not USE_NEW_ALIGN:
return _align_before_v2_2_2(tokens_a, tokens_b)
tokens_a = _normalize_for_alignment(tokens_a)
tokens_b = _normalize_for_alignment(tokens_b)
cost = 0
a2b = numpy.empty(len(tokens_a), dtype="i")
b2a = numpy.empty(len(tokens_b), dtype="i")
a2b_multi = {}
b2a_multi = {}
i = 0
j = 0
offset_a = 0
offset_b = 0
while i < len(tokens_a) and j < len(tokens_b):
a = tokens_a[i][offset_a:]
b = tokens_b[j][offset_b:]
a2b[i] = b2a[j] = -1
if a == b:
if offset_a == offset_b == 0:
a2b[i] = j
b2a[j] = i
elif offset_a == 0:
cost += 2
a2b_multi[i] = j
elif offset_b == 0:
cost += 2
b2a_multi[j] = i
offset_a = offset_b = 0
i += 1
j += 1
elif a == "":
assert offset_a == 0
cost += 1
i += 1
elif b == "":
assert offset_b == 0
cost += 1
j += 1
elif b.startswith(a):
cost += 1
if offset_a == 0:
a2b_multi[i] = j
i += 1
offset_a = 0
offset_b += len(a)
elif a.startswith(b):
cost += 1
if offset_b == 0:
b2a_multi[j] = i
j += 1
offset_b = 0
offset_a += len(b)
else:
assert "".join(tokens_a) != "".join(tokens_b)
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
return cost, a2b, b2a, a2b_multi, b2a_multi
class GoldCorpus(object): class GoldCorpus(object):
"""An annotated corpus, using the JSON file format. Manages """An annotated corpus, using the JSON file format. Manages
annotations for tagging, dependency parsing and NER. annotations for tagging, dependency parsing and NER.
@ -176,6 +267,11 @@ class GoldCorpus(object):
gold_tuples = read_json_file(loc) gold_tuples = read_json_file(loc)
elif loc.parts[-1].endswith("jsonl"): elif loc.parts[-1].endswith("jsonl"):
gold_tuples = srsly.read_jsonl(loc) gold_tuples = srsly.read_jsonl(loc)
first_gold_tuple = next(gold_tuples)
gold_tuples = itertools.chain([first_gold_tuple], gold_tuples)
# TODO: proper format checks with schemas
if isinstance(first_gold_tuple, dict):
gold_tuples = read_json_object(gold_tuples)
elif loc.parts[-1].endswith("msg"): elif loc.parts[-1].endswith("msg"):
gold_tuples = srsly.read_msgpack(loc) gold_tuples = srsly.read_msgpack(loc)
else: else:
@ -209,7 +305,8 @@ class GoldCorpus(object):
return n return n
def train_docs(self, nlp, gold_preproc=False, max_length=None, def train_docs(self, nlp, gold_preproc=False, max_length=None,
noise_level=0.0, orth_variant_level=0.0): noise_level=0.0, orth_variant_level=0.0,
ignore_misaligned=False):
locs = list((self.tmp_dir / 'train').iterdir()) locs = list((self.tmp_dir / 'train').iterdir())
random.shuffle(locs) random.shuffle(locs)
train_tuples = self.read_tuples(locs, limit=self.limit) train_tuples = self.read_tuples(locs, limit=self.limit)
@ -217,20 +314,23 @@ class GoldCorpus(object):
max_length=max_length, max_length=max_length,
noise_level=noise_level, noise_level=noise_level,
orth_variant_level=orth_variant_level, orth_variant_level=orth_variant_level,
make_projective=True) make_projective=True,
ignore_misaligned=ignore_misaligned)
yield from gold_docs yield from gold_docs
def train_docs_without_preprocessing(self, nlp, gold_preproc=False): def train_docs_without_preprocessing(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc) gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc)
yield from gold_docs yield from gold_docs
def dev_docs(self, nlp, gold_preproc=False): def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc) gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc,
ignore_misaligned=ignore_misaligned)
yield from gold_docs yield from gold_docs
@classmethod @classmethod
def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None,
noise_level=0.0, orth_variant_level=0.0, make_projective=False): noise_level=0.0, orth_variant_level=0.0, make_projective=False,
ignore_misaligned=False):
for raw_text, paragraph_tuples in tuples: for raw_text, paragraph_tuples in tuples:
if gold_preproc: if gold_preproc:
raw_text = None raw_text = None
@ -239,10 +339,12 @@ class GoldCorpus(object):
docs, paragraph_tuples = cls._make_docs(nlp, raw_text, docs, paragraph_tuples = cls._make_docs(nlp, raw_text,
paragraph_tuples, gold_preproc, noise_level=noise_level, paragraph_tuples, gold_preproc, noise_level=noise_level,
orth_variant_level=orth_variant_level) orth_variant_level=orth_variant_level)
golds = cls._make_golds(docs, paragraph_tuples, make_projective) golds = cls._make_golds(docs, paragraph_tuples, make_projective,
ignore_misaligned=ignore_misaligned)
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
if (not max_length) or len(doc) < max_length: if gold is not None:
yield doc, gold if (not max_length) or len(doc) < max_length:
yield doc, gold
@classmethod @classmethod
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0):
@ -257,14 +359,22 @@ class GoldCorpus(object):
@classmethod @classmethod
def _make_golds(cls, docs, paragraph_tuples, make_projective): def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False):
if len(docs) != len(paragraph_tuples): if len(docs) != len(paragraph_tuples):
n_annots = len(paragraph_tuples) n_annots = len(paragraph_tuples)
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
return [GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats, golds = []
make_projective=make_projective) for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples):
for doc, (sent_tuples, (cats, brackets)) try:
in zip(docs, paragraph_tuples)] gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats,
make_projective=make_projective)
except AlignmentError:
if ignore_misaligned:
gold = None
else:
raise
golds.append(gold)
return golds
def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
@ -494,7 +604,6 @@ def _json_iterate(loc):
def iob_to_biluo(tags): def iob_to_biluo(tags):
out = [] out = []
curr_label = None
tags = list(tags) tags = list(tags)
while tags: while tags:
out.extend(_consume_os(tags)) out.extend(_consume_os(tags))
@ -519,6 +628,8 @@ def _consume_ent(tags):
tags.pop(0) tags.pop(0)
label = tag[2:] label = tag[2:]
if length == 1: if length == 1:
if len(label) == 0:
raise ValueError(Errors.E177.format(tag=tag))
return ["U-" + label] return ["U-" + label]
else: else:
start = "B-" + label start = "B-" + label
@ -542,7 +653,7 @@ cdef class GoldParse:
def __init__(self, doc, annot_tuples=None, words=None, tags=None, morphology=None, def __init__(self, doc, annot_tuples=None, words=None, tags=None, morphology=None,
heads=None, deps=None, entities=None, make_projective=False, heads=None, deps=None, entities=None, make_projective=False,
cats=None, links=None, **_): cats=None, links=None, **_):
"""Create a GoldParse. """Create a GoldParse. The fields will not be initialized if len(doc) is zero.
doc (Doc): The document the annotations refer to. doc (Doc): The document the annotations refer to.
words (iterable): A sequence of unicode word strings. words (iterable): A sequence of unicode word strings.
@ -571,138 +682,144 @@ cdef class GoldParse:
negative examples respectively. negative examples respectively.
RETURNS (GoldParse): The newly constructed object. RETURNS (GoldParse): The newly constructed object.
""" """
if words is None:
words = [token.text for token in doc]
if tags is None:
tags = [None for _ in words]
if heads is None:
heads = [None for _ in words]
if deps is None:
deps = [None for _ in words]
if morphology is None:
morphology = [None for _ in words]
if entities is None:
entities = ["-" for _ in doc]
elif len(entities) == 0:
entities = ["O" for _ in doc]
else:
# Translate the None values to '-', to make processing easier.
# See Issue #2603
entities = [(ent if ent is not None else "-") for ent in entities]
if not isinstance(entities[0], basestring):
# Assume we have entities specified by character offset.
entities = biluo_tags_from_offsets(doc, entities)
self.mem = Pool() self.mem = Pool()
self.loss = 0 self.loss = 0
self.length = len(doc) self.length = len(doc)
# These are filled by the tagger/parser/entity recogniser
self.c.tags = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.heads = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.labels = <attr_t*>self.mem.alloc(len(doc), sizeof(attr_t))
self.c.has_dep = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.sent_start = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
self.cats = {} if cats is None else dict(cats) self.cats = {} if cats is None else dict(cats)
self.links = links self.links = links
self.words = [None] * len(doc)
self.tags = [None] * len(doc)
self.heads = [None] * len(doc)
self.labels = [None] * len(doc)
self.ner = [None] * len(doc)
self.morphology = [None] * len(doc)
# This needs to be done before we align the words # avoid allocating memory if the doc does not contain any tokens
if make_projective and heads is not None and deps is not None: if self.length > 0:
heads, deps = nonproj.projectivize(heads, deps) if words is None:
words = [token.text for token in doc]
# Do many-to-one alignment for misaligned tokens. if tags is None:
# If we over-segment, we'll have one gold word that covers a sequence tags = [None for _ in words]
# of predicted words if heads is None:
# If we under-segment, we'll have one predicted word that covers a heads = [None for _ in words]
# sequence of gold words. if deps is None:
# If we "mis-segment", we'll have a sequence of predicted words covering deps = [None for _ in words]
# a sequence of gold words. That's many-to-many -- we don't do that. if morphology is None:
cost, i2j, j2i, i2j_multi, j2i_multi = align([t.orth_ for t in doc], words) morphology = [None for _ in words]
if entities is None:
self.cand_to_gold = [(j if j >= 0 else None) for j in i2j] entities = ["-" for _ in words]
self.gold_to_cand = [(i if i >= 0 else None) for i in j2i] elif len(entities) == 0:
entities = ["O" for _ in words]
annot_tuples = (range(len(words)), words, tags, heads, deps, entities)
self.orig_annot = list(zip(*annot_tuples))
for i, gold_i in enumerate(self.cand_to_gold):
if doc[i].text.isspace():
self.words[i] = doc[i].text
self.tags[i] = "_SP"
self.heads[i] = None
self.labels[i] = None
self.ner[i] = None
self.morphology[i] = set()
if gold_i is None:
if i in i2j_multi:
self.words[i] = words[i2j_multi[i]]
self.tags[i] = tags[i2j_multi[i]]
self.morphology[i] = morphology[i2j_multi[i]]
is_last = i2j_multi[i] != i2j_multi.get(i+1)
is_first = i2j_multi[i] != i2j_multi.get(i-1)
# Set next word in multi-token span as head, until last
if not is_last:
self.heads[i] = i+1
self.labels[i] = "subtok"
else:
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]]
self.labels[i] = deps[i2j_multi[i]]
# Now set NER...This is annoying because if we've split
# got an entity word split into two, we need to adjust the
# BILUO tags. We can't have BB or LL etc.
# Case 1: O -- easy.
ner_tag = entities[i2j_multi[i]]
if ner_tag == "O":
self.ner[i] = "O"
# Case 2: U. This has to become a B I* L sequence.
elif ner_tag.startswith("U-"):
if is_first:
self.ner[i] = ner_tag.replace("U-", "B-", 1)
elif is_last:
self.ner[i] = ner_tag.replace("U-", "L-", 1)
else:
self.ner[i] = ner_tag.replace("U-", "I-", 1)
# Case 3: L. If not last, change to I.
elif ner_tag.startswith("L-"):
if is_last:
self.ner[i] = ner_tag
else:
self.ner[i] = ner_tag.replace("L-", "I-", 1)
# Case 4: I. Stays correct
elif ner_tag.startswith("I-"):
self.ner[i] = ner_tag
else: else:
self.words[i] = words[gold_i] # Translate the None values to '-', to make processing easier.
self.tags[i] = tags[gold_i] # See Issue #2603
self.morphology[i] = morphology[gold_i] entities = [(ent if ent is not None else "-") for ent in entities]
if heads[gold_i] is None: if not isinstance(entities[0], basestring):
# Assume we have entities specified by character offset.
entities = biluo_tags_from_offsets(doc, entities)
# These are filled by the tagger/parser/entity recogniser
self.c.tags = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.heads = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.labels = <attr_t*>self.mem.alloc(len(doc), sizeof(attr_t))
self.c.has_dep = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.sent_start = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
self.words = [None] * len(doc)
self.tags = [None] * len(doc)
self.heads = [None] * len(doc)
self.labels = [None] * len(doc)
self.ner = [None] * len(doc)
self.morphology = [None] * len(doc)
# This needs to be done before we align the words
if make_projective and heads is not None and deps is not None:
heads, deps = nonproj.projectivize(heads, deps)
# Do many-to-one alignment for misaligned tokens.
# If we over-segment, we'll have one gold word that covers a sequence
# of predicted words
# If we under-segment, we'll have one predicted word that covers a
# sequence of gold words.
# If we "mis-segment", we'll have a sequence of predicted words covering
# a sequence of gold words. That's many-to-many -- we don't do that.
cost, i2j, j2i, i2j_multi, j2i_multi = align([t.orth_ for t in doc], words)
self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
annot_tuples = (range(len(words)), words, tags, heads, deps, entities)
self.orig_annot = list(zip(*annot_tuples))
for i, gold_i in enumerate(self.cand_to_gold):
if doc[i].text.isspace():
self.words[i] = doc[i].text
self.tags[i] = "_SP"
self.heads[i] = None self.heads[i] = None
self.labels[i] = None
self.ner[i] = None
self.morphology[i] = set()
if gold_i is None:
if i in i2j_multi:
self.words[i] = words[i2j_multi[i]]
self.tags[i] = tags[i2j_multi[i]]
self.morphology[i] = morphology[i2j_multi[i]]
is_last = i2j_multi[i] != i2j_multi.get(i+1)
is_first = i2j_multi[i] != i2j_multi.get(i-1)
# Set next word in multi-token span as head, until last
if not is_last:
self.heads[i] = i+1
self.labels[i] = "subtok"
else:
head_i = heads[i2j_multi[i]]
if head_i:
self.heads[i] = self.gold_to_cand[head_i]
self.labels[i] = deps[i2j_multi[i]]
# Now set NER...This is annoying because if we've split
# got an entity word split into two, we need to adjust the
# BILUO tags. We can't have BB or LL etc.
# Case 1: O -- easy.
ner_tag = entities[i2j_multi[i]]
if ner_tag == "O":
self.ner[i] = "O"
# Case 2: U. This has to become a B I* L sequence.
elif ner_tag.startswith("U-"):
if is_first:
self.ner[i] = ner_tag.replace("U-", "B-", 1)
elif is_last:
self.ner[i] = ner_tag.replace("U-", "L-", 1)
else:
self.ner[i] = ner_tag.replace("U-", "I-", 1)
# Case 3: L. If not last, change to I.
elif ner_tag.startswith("L-"):
if is_last:
self.ner[i] = ner_tag
else:
self.ner[i] = ner_tag.replace("L-", "I-", 1)
# Case 4: I. Stays correct
elif ner_tag.startswith("I-"):
self.ner[i] = ner_tag
else: else:
self.heads[i] = self.gold_to_cand[heads[gold_i]] self.words[i] = words[gold_i]
self.labels[i] = deps[gold_i] self.tags[i] = tags[gold_i]
self.ner[i] = entities[gold_i] self.morphology[i] = morphology[gold_i]
if heads[gold_i] is None:
self.heads[i] = None
else:
self.heads[i] = self.gold_to_cand[heads[gold_i]]
self.labels[i] = deps[gold_i]
self.ner[i] = entities[gold_i]
# Prevent whitespace that isn't within entities from being tagged as # Prevent whitespace that isn't within entities from being tagged as
# an entity. # an entity.
for i in range(len(self.ner)): for i in range(len(self.ner)):
if self.tags[i] == "_SP": if self.tags[i] == "_SP":
prev_ner = self.ner[i-1] if i >= 1 else None prev_ner = self.ner[i-1] if i >= 1 else None
next_ner = self.ner[i+1] if (i+1) < len(self.ner) else None next_ner = self.ner[i+1] if (i+1) < len(self.ner) else None
if prev_ner == "O" or next_ner == "O": if prev_ner == "O" or next_ner == "O":
self.ner[i] = "O" self.ner[i] = "O"
cycle = nonproj.contains_cycle(self.heads) cycle = nonproj.contains_cycle(self.heads)
if cycle is not None: if cycle is not None:
raise ValueError(Errors.E069.format(cycle=cycle, raise ValueError(Errors.E069.format(cycle=cycle,
cycle_tokens=" ".join(["'{}'".format(self.words[tok_id]) for tok_id in cycle]), cycle_tokens=" ".join(["'{}'".format(self.words[tok_id]) for tok_id in cycle]),
doc_tokens=" ".join(words[:50]))) doc_tokens=" ".join(words[:50])))
def __len__(self): def __len__(self):
"""Get the number of gold-standard tokens. """Get the number of gold-standard tokens.
@ -740,7 +857,8 @@ def docs_to_json(docs, id=0):
docs (iterable / Doc): The Doc object(s) to convert. docs (iterable / Doc): The Doc object(s) to convert.
id (int): Id for the JSON. id (int): Id for the JSON.
RETURNS (list): The data in spaCy's JSON format. RETURNS (dict): The data in spaCy's JSON format
- each input doc will be treated as a paragraph in the output doc
""" """
if isinstance(docs, Doc): if isinstance(docs, Doc):
docs = [docs] docs = [docs]
@ -795,7 +913,7 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
""" """
# Ensure no overlapping entity labels exist # Ensure no overlapping entity labels exist
tokens_in_ents = {} tokens_in_ents = {}
starts = {token.idx: token.i for token in doc} starts = {token.idx: token.i for token in doc}
ends = {token.idx + len(token): token.i for token in doc} ends = {token.idx + len(token): token.i for token in doc}
biluo = ["-" for _ in doc] biluo = ["-" for _ in doc]

View File

@ -142,6 +142,7 @@ cdef class KnowledgeBase:
i = 0 i = 0
cdef KBEntryC entry cdef KBEntryC entry
cdef hash_t entity_hash
while i < nr_entities: while i < nr_entities:
entity_vector = vector_list[i] entity_vector = vector_list[i]
if len(entity_vector) != self.entity_vector_length: if len(entity_vector) != self.entity_vector_length:
@ -161,6 +162,14 @@ cdef class KnowledgeBase:
i += 1 i += 1
def contains_entity(self, unicode entity):
cdef hash_t entity_hash = self.vocab.strings.add(entity)
return entity_hash in self._entry_index
def contains_alias(self, unicode alias):
cdef hash_t alias_hash = self.vocab.strings.add(alias)
return alias_hash in self._alias_index
def add_alias(self, unicode alias, entities, probabilities): def add_alias(self, unicode alias, entities, probabilities):
""" """
For a given alias, add its potential entities and prior probabilies to the KB. For a given alias, add its potential entities and prior probabilies to the KB.
@ -190,7 +199,7 @@ cdef class KnowledgeBase:
for entity, prob in zip(entities, probabilities): for entity, prob in zip(entities, probabilities):
entity_hash = self.vocab.strings[entity] entity_hash = self.vocab.strings[entity]
if not entity_hash in self._entry_index: if not entity_hash in self._entry_index:
raise ValueError(Errors.E134.format(alias=alias, entity=entity)) raise ValueError(Errors.E134.format(entity=entity))
entry_index = <int64_t>self._entry_index.get(entity_hash) entry_index = <int64_t>self._entry_index.get(entity_hash)
entry_indices.push_back(int(entry_index)) entry_indices.push_back(int(entry_index))
@ -201,8 +210,63 @@ cdef class KnowledgeBase:
return alias_hash return alias_hash
def get_candidates(self, unicode alias): def append_alias(self, unicode alias, unicode entity, float prior_prob, ignore_warnings=False):
"""
For an alias already existing in the KB, extend its potential entities with one more.
Throw a warning if either the alias or the entity is unknown,
or when the combination is already previously recorded.
Throw an error if this entity+prior prob would exceed the sum of 1.
For efficiency, it's best to use the method `add_alias` as much as possible instead of this one.
"""
# Check if the alias exists in the KB
cdef hash_t alias_hash = self.vocab.strings[alias] cdef hash_t alias_hash = self.vocab.strings[alias]
if not alias_hash in self._alias_index:
raise ValueError(Errors.E176.format(alias=alias))
# Check if the entity exists in the KB
cdef hash_t entity_hash = self.vocab.strings[entity]
if not entity_hash in self._entry_index:
raise ValueError(Errors.E134.format(entity=entity))
entry_index = <int64_t>self._entry_index.get(entity_hash)
# Throw an error if the prior probabilities (including the new one) sum up to more than 1
alias_index = <int64_t>self._alias_index.get(alias_hash)
alias_entry = self._aliases_table[alias_index]
current_sum = sum([p for p in alias_entry.probs])
new_sum = current_sum + prior_prob
if new_sum > 1.00001:
raise ValueError(Errors.E133.format(alias=alias, sum=new_sum))
entry_indices = alias_entry.entry_indices
is_present = False
for i in range(entry_indices.size()):
if entry_indices[i] == int(entry_index):
is_present = True
if is_present:
if not ignore_warnings:
user_warning(Warnings.W024.format(entity=entity, alias=alias))
else:
entry_indices.push_back(int(entry_index))
alias_entry.entry_indices = entry_indices
probs = alias_entry.probs
probs.push_back(float(prior_prob))
alias_entry.probs = probs
self._aliases_table[alias_index] = alias_entry
def get_candidates(self, unicode alias):
"""
Return candidate entities for an alias. Each candidate defines the entity, the original alias,
and the prior probability of that alias resolving to that entity.
If the alias is not known in the KB, and empty list is returned.
"""
cdef hash_t alias_hash = self.vocab.strings[alias]
if not alias_hash in self._alias_index:
return []
alias_index = <int64_t>self._alias_index.get(alias_hash) alias_index = <int64_t>self._alias_index.get(alias_hash)
alias_entry = self._aliases_table[alias_index] alias_entry = self._aliases_table[alias_index]
@ -341,7 +405,6 @@ cdef class KnowledgeBase:
assert nr_entities == self.get_size_entities() assert nr_entities == self.get_size_entities()
# STEP 3: load aliases # STEP 3: load aliases
cdef int64_t nr_aliases cdef int64_t nr_aliases
reader.read_alias_length(&nr_aliases) reader.read_alias_length(&nr_aliases)
self._alias_index = PreshMap(nr_aliases+1) self._alias_index = PreshMap(nr_aliases+1)

View File

@ -184,7 +184,7 @@ _russian_lower = r"ёа-я"
_russian_upper = r"ЁА-Я" _russian_upper = r"ЁА-Я"
_russian = r"ёа-яЁА-Я" _russian = r"ёа-яЁА-Я"
_sinhala = r"\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6" _sinhala = r"\u0D80-\u0DFF"
_tatar_lower = r"әөүҗңһ" _tatar_lower = r"әөүҗңһ"
_tatar_upper = r"ӘӨҮҖҢҺ" _tatar_upper = r"ӘӨҮҖҢҺ"

View File

@ -1,8 +1,8 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB
TAG_MAP = { TAG_MAP = {
@ -20,8 +20,8 @@ TAG_MAP = {
"CARD": {POS: NUM, "NumType": "card"}, "CARD": {POS: NUM, "NumType": "card"},
"FM": {POS: X, "Foreign": "yes"}, "FM": {POS: X, "Foreign": "yes"},
"ITJ": {POS: INTJ}, "ITJ": {POS: INTJ},
"KOKOM": {POS: CONJ, "ConjType": "comp"}, "KOKOM": {POS: CCONJ, "ConjType": "comp"},
"KON": {POS: CONJ}, "KON": {POS: CCONJ},
"KOUI": {POS: SCONJ}, "KOUI": {POS: SCONJ},
"KOUS": {POS: SCONJ}, "KOUS": {POS: SCONJ},
"NE": {POS: PROPN}, "NE": {POS: PROPN},
@ -43,7 +43,7 @@ TAG_MAP = {
"PTKA": {POS: PART}, "PTKA": {POS: PART},
"PTKANT": {POS: PART, "PartType": "res"}, "PTKANT": {POS: PART, "PartType": "res"},
"PTKNEG": {POS: PART, "Polarity": "neg"}, "PTKNEG": {POS: PART, "Polarity": "neg"},
"PTKVZ": {POS: PART, "PartType": "vbp"}, "PTKVZ": {POS: ADP, "PartType": "vbp"},
"PTKZU": {POS: PART, "PartType": "inf"}, "PTKZU": {POS: PART, "PartType": "inf"},
"PWAT": {POS: DET, "PronType": "int"}, "PWAT": {POS: DET, "PronType": "int"},
"PWAV": {POS: ADV, "PronType": "int"}, "PWAV": {POS: ADV, "PronType": "int"},

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
TAG_MAP = { TAG_MAP = {
@ -28,8 +28,8 @@ TAG_MAP = {
"JJR": {POS: ADJ, "Degree": "comp"}, "JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"}, "JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: X, "NumType": "ord"}, "LS": {POS: X, "NumType": "ord"},
"MD": {POS: AUX, "VerbType": "mod"}, "MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""}, "NIL": {POS: X},
"NN": {POS: NOUN, "Number": "sing"}, "NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
@ -37,7 +37,7 @@ TAG_MAP = {
"PDT": {POS: DET}, "PDT": {POS: DET},
"POS": {POS: PART, "Poss": "yes"}, "POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"}, "PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"}, "PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"}, "RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"}, "RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"}, "RBS": {POS: ADV, "Degree": "sup"},
@ -58,9 +58,9 @@ TAG_MAP = {
"Number": "sing", "Number": "sing",
"Person": "three", "Person": "three",
}, },
"WDT": {POS: PRON}, "WDT": {POS: DET},
"WP": {POS: PRON}, "WP": {POS: PRON},
"WP$": {POS: PRON, "Poss": "yes"}, "WP$": {POS: DET, "Poss": "yes"},
"WRB": {POS: ADV}, "WRB": {POS: ADV},
"ADD": {POS: X}, "ADD": {POS: X},
"NFP": {POS: PUNCT}, "NFP": {POS: PUNCT},

View File

@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.
sentences = [ sentences = [
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares", "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes", "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery", "San Francisco analiza prohibir los robots delivery.",
"Londres es una gran ciudad del Reino Unido", "Londres es una gran ciudad del Reino Unido.",
"El gato come pescado", "El gato come pescado.",
"Veo al hombre con el telescopio", "Veo al hombre con el telescopio.",
"La araña come moscas", "La araña come moscas.",
"El pingüino incuba en su nido", "El pingüino incuba en su nido.",
] ]

View File

@ -5,7 +5,7 @@ from ..punctuation import TOKENIZER_INFIXES
from ..char_classes import ALPHA from ..char_classes import ALPHA
ELISION = " ' ".strip().replace(" ", "").replace("\n", "") ELISION = " ' ".strip().replace(" ", "")
_infixes = TOKENIZER_INFIXES + [ _infixes = TOKENIZER_INFIXES + [

36
spacy/lang/lb/__init__.py Normal file
View File

@ -0,0 +1,36 @@
# coding: utf8
from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES
from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
class LuxembourgishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "lb"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
tag_map = TAG_MAP
infixes = TOKENIZER_INFIXES
class Luxembourgish(Language):
lang = "lb"
Defaults = LuxembourgishDefaults
__all__ = ["Luxembourgish"]

18
spacy/lang/lb/examples.py Normal file
View File

@ -0,0 +1,18 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.lb.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"An der Zäit hunn sech den Nordwand an dSonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum.",
"Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
"Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet.",
"Um Enn huet den Nordwand säi Kampf opginn.",
"Dunn huet dSonn dLoft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.",
"Do huet den Nordwand missen zouginn, dass dSonn vun hinnen zwee de Stäerkste wier.",
]

View File

@ -0,0 +1,44 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = set(
"""
null eent zwee dräi véier fënnef sechs ziwen aacht néng zéng eelef zwielef dräizéng
véierzéng foffzéng siechzéng siwwenzéng uechtzeng uechzeng nonnzéng nongzéng zwanzeg drësseg véierzeg foffzeg sechzeg siechzeg siwenzeg achtzeg achzeg uechtzeg uechzeg nonnzeg
honnert dausend millioun milliard billioun billiard trillioun triliard
""".split()
)
_ordinal_words = set(
"""
éischten zweeten drëtten véierten fënneften sechsten siwenten aachten néngten zéngten eeleften
zwieleften dräizéngten véierzéngten foffzéngten siechzéngten uechtzéngen uechzéngten nonnzéngten nongzéngten zwanzegsten
drëssegsten véierzegsten foffzegsten siechzegsten siwenzegsten uechzegsten nonnzegsten
honnertsten dausendsten milliounsten
milliardsten billiounsten billiardsten trilliounsten trilliardsten
""".split()
)
def like_num(text):
"""
check if text resembles a number
"""
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,16 @@
# coding: utf8
from __future__ import unicode_literals
# TODO
# norm execptions: find a possibility to deal with the zillions of spelling
# variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
# here one could include the most common spelling mistakes
_exc = {"datt": "dass", "wgl.": "weg.", "vläicht": "viläicht"}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -0,0 +1,16 @@
# coding: utf8
from __future__ import unicode_literals
from ..punctuation import TOKENIZER_INFIXES
from ..char_classes import ALPHA
ELISION = " ' ".strip().replace(" ", "")
HYPHENS = r"- ".strip().replace(" ", "")
_infixes = TOKENIZER_INFIXES + [
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
]
TOKENIZER_INFIXES = _infixes

214
spacy/lang/lb/stop_words.py Normal file
View File

@ -0,0 +1,214 @@
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set(
"""
a
à
äis
är
ärt
äert
ären
all
allem
alles
alleguer
als
also
am
an
anerefalls
ass
aus
awer
bei
beim
bis
bis
d'
dach
datt
däin
där
dat
de
dee
den
deel
deem
deen
deene
déi
den
deng
denger
dem
der
dësem
di
dir
do
da
dann
domat
dozou
drop
du
duerch
duerno
e
ee
em
een
eent
ë
en
ënner
ëm
ech
eis
eise
eisen
eiser
eises
eisereen
esou
een
eng
enger
engem
entweder
et
eréischt
falls
fir
géint
géif
gëtt
gët
geet
gi
ginn
gouf
gouff
goung
hat
haten
hatt
hätt
hei
hu
huet
hun
hunn
hiren
hien
hin
hier
hir
jidderen
jiddereen
jiddwereen
jiddereng
jiddwerengen
jo
ins
iech
iwwer
kann
kee
keen
kënne
kënnt
kéng
kéngen
kéngem
koum
kuckt
mam
mat
ma
mech
méi
mécht
meng
menger
mer
mir
muss
nach
nämmlech
nämmelech
näischt
nawell
nëmme
nëmmen
net
nees
nee
no
nu
nom
och
oder
ons
onsen
onser
onsereen
onst
om
op
ouni
säi
säin
schonn
schonns
si
sid
sie
se
sech
seng
senge
sengem
senger
selwecht
selwer
sinn
sollten
souguer
sou
soss
sot
't
tëscht
u
un
um
virdrun
vu
vum
vun
wann
war
waren
was
wat
wëllt
weider
wéi
wéini
wéinst
wi
wollt
wou
wouhin
zanter
ze
zu
zum
zwar
""".split()
)

28
spacy/lang/lb/tag_map.py Normal file
View File

@ -0,0 +1,28 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, ADJ, CONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PART, SPACE, AUX
# TODO: tag map is still using POS tags from an internal training set.
# These POS tags have to be modified to match those from Universal Dependencies
TAG_MAP = {
"$": {POS: PUNCT},
"ADJ": {POS: ADJ},
"AV": {POS: ADV},
"APPR": {POS: ADP, "AdpType": "prep"},
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
"D": {POS: DET, "PronType": "art"},
"KO": {POS: CONJ},
"N": {POS: NOUN},
"P": {POS: ADV},
"TRUNC": {POS: X, "Hyph": "yes"},
"AUX": {POS: AUX},
"V": {POS: VERB},
"MV": {POS: VERB, "VerbType": "mod"},
"PTK": {POS: PART},
"INTER": {POS: PART},
"NUM": {POS: NUM},
"_SP": {POS: SPACE},
}

View File

@ -0,0 +1,51 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, NORM
# TODO
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
_exc = {
}
# translate / delete what is not necessary
for exc_data in [
{ORTH: "wgl.", LEMMA: "wann ech gelift", NORM: "wann ech gelieft"},
{ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
{ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},
{ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},
{ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
{ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
{ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
{ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
{ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"},
]:
_exc[exc_data[ORTH]] = [exc_data]
# to be extended
for orth in [
"z.B.",
"Dipl.",
"Dr.",
"etc.",
"i.e.",
"o.k.",
"O.K.",
"p.a.",
"p.s.",
"P.S.",
"phil.",
"q.e.d.",
"R.I.P.",
"rer.",
"sen.",
"ë.a.",
"U.S.",
"U.S.A.",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.
sentences = [ sentences = [
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar", "Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
"Selvkjørende biler flytter forsikringsansvaret over på produsentene ", "Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
"San Francisco vurderer å forby robotbud på fortauene", "San Francisco vurderer å forby robotbud på fortauene.",
"London er en stor by i Storbritannia.", "London er en stor by i Storbritannia.",
] ]

View File

@ -14,6 +14,7 @@ from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups from ...util import update_exc, add_lookups
from .syntax_iterators import SYNTAX_ITERATORS
class SwedishDefaults(Language.Defaults): class SwedishDefaults(Language.Defaults):
@ -29,6 +30,7 @@ class SwedishDefaults(Language.Defaults):
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
stop_words = STOP_WORDS stop_words = STOP_WORDS
morph_rules = MORPH_RULES morph_rules = MORPH_RULES
syntax_iterators = SYNTAX_ITERATORS
class Swedish(Language): class Swedish(Language):

View File

@ -0,0 +1,50 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON
def noun_chunks(obj):
"""
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
labels = [
"nsubj",
"nsubj:pass",
"dobj",
"obj",
"iobj",
"ROOT",
"appos",
"nmod",
"nmod:poss",
]
doc = obj.doc # Ensure works on both Doc and Span.
np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP")
seen = set()
for i, word in enumerate(obj):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.i in seen:
continue
if word.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label
elif word.dep == conj:
head = word.head
while head.dep == conj and head.head.i < head.i:
head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

99
spacy/lang/xx/examples.py Normal file
View File

@ -0,0 +1,99 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.de.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
# combined examples from de/en/es/fr/it/nl/pl/pt/ru
sentences = [
"Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
"Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
"Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
"Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
"San Francisco erwägt Verbot von Lieferrobotern",
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
"Wo bist du?",
"Was ist die Hauptstadt von Deutschland?",
"Apple is looking at buying U.K. startup for $1 billion",
"Autonomous cars shift insurance liability toward manufacturers",
"San Francisco considers banning sidewalk delivery robots",
"London is a big city in the United Kingdom.",
"Where are you?",
"Who is the president of France?",
"What is the capital of the United States?",
"When was Barack Obama born?",
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery.",
"Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.",
"Veo al hombre con el telescopio.",
"La araña come moscas.",
"El pingüino incuba en su nido.",
"Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
"Londres est une grande ville du Royaume-Uni",
"LItalie choisit ArcelorMittal pour reprendre la plus grande aciérie dEurope",
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
"La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
"Nouvelles attaques de Trump contre le maire de Londres",
"Où es-tu ?",
"Qui est le président de la France ?",
"Où est la capitale des États-Unis ?",
"Quand est né Barack Obama ?",
"Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
"Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
"San Francisco prevede di bandire i robot di consegna porta a porta",
"Londra è una grande città del Regno Unito.",
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
"San Francisco overweegt robots op voetpaden te verbieden",
"Londen is een grote stad in het Verenigd Koninkrijk",
"Poczuł przyjemną woń mocnej kawy.",
"Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
"Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
"Nowy abonament pod lupą Komisji Europejskiej",
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
"Londres é a maior cidade do Reino Unido.",
# Translations from English:
"Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
"Беспилотные автомобили перекладывают страховую ответственность на производителя",
"В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
"Лондон — это большой город в Соединённом Королевстве",
# Native Russian sentences:
# Colloquial:
"Да, нет, наверное!", # Typical polite refusal
"Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech
# Examples of Bookish Russian:
# Quote from "The Golden Calf"
"Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
# Quotes from "Ivan Vasilievich changes his occupation"
"Ты пошто боярыню обидел, смерд?!!",
"Оставь меня, старушка, я в печали!",
# Quotes from Dostoevsky:
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
# Quotes from Chekhov:
"Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
# Quotes from Turgenev:
"Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
# Quotes from newspapers:
# Komsomolskaya Pravda:
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
# Argumenty i Facty:
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
]

View File

@ -4,19 +4,92 @@ from __future__ import unicode_literals
from ...attrs import LANG from ...attrs import LANG
from ...language import Language from ...language import Language
from ...tokens import Doc from ...tokens import Doc
from ...util import DummyTokenizer
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
def try_jieba_import(use_jieba):
try:
import jieba
return jieba
except ImportError:
if use_jieba:
msg = (
"Jieba not installed. Either set Chinese.use_jieba = False, "
"or install it https://github.com/fxsjy/jieba"
)
raise ImportError(msg)
class ChineseTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
self.use_jieba = cls.use_jieba
self.jieba_seg = try_jieba_import(self.use_jieba)
self.tokenizer = Language.Defaults().create_tokenizer(nlp)
def __call__(self, text):
# use jieba
if self.use_jieba:
jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
words = [jieba_words[0]]
spaces = [False]
for i in range(1, len(jieba_words)):
word = jieba_words[i]
if word.isspace():
# second token in adjacent whitespace following a
# non-space token
if spaces[-1]:
words.append(word)
spaces.append(False)
# first space token following non-space token
elif word == " " and not words[-1].isspace():
spaces[-1] = True
# token is non-space whitespace or any whitespace following
# a whitespace token
else:
# extend previous whitespace token with more whitespace
if words[-1].isspace():
words[-1] += word
# otherwise it's a new whitespace token
else:
words.append(word)
spaces.append(False)
else:
words.append(word)
spaces.append(False)
return Doc(self.vocab, words=words, spaces=spaces)
# split into individual characters
words = []
spaces = []
for token in self.tokenizer(text):
if token.text.isspace():
words.append(token.text)
spaces.append(False)
else:
words.extend(list(token.text))
spaces.extend([False] * len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
class ChineseDefaults(Language.Defaults): class ChineseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "zh" lex_attr_getters[LANG] = lambda text: "zh"
use_jieba = True
tokenizer_exceptions = BASE_EXCEPTIONS tokenizer_exceptions = BASE_EXCEPTIONS
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
use_jieba = True
@classmethod
def create_tokenizer(cls, nlp=None):
return ChineseTokenizer(cls, nlp)
class Chinese(Language): class Chinese(Language):
@ -24,26 +97,7 @@ class Chinese(Language):
Defaults = ChineseDefaults # override defaults Defaults = ChineseDefaults # override defaults
def make_doc(self, text): def make_doc(self, text):
if self.Defaults.use_jieba: return self.tokenizer(text)
try:
import jieba
except ImportError:
msg = (
"Jieba not installed. Either set Chinese.use_jieba = False, "
"or install it https://github.com/fxsjy/jieba"
)
raise ImportError(msg)
words = list(jieba.cut(text, cut_all=False))
words = [x for x in words if x]
return Doc(self.vocab, words=words, spaces=[False] * len(words))
else:
words = []
spaces = []
for token in self.tokenizer(text):
words.extend(list(token.text))
spaces.extend([False] * len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
__all__ = ["Chinese"] __all__ = ["Chinese"]

View File

@ -1,11 +1,12 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X
from ...symbols import NOUN, PART, INTJ, PRON from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. # The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn
# We also map the tags to the simpler Google Universal POS tag set. # Treebank tag set. We also map the tags to the simpler Universal Dependencies
# v2 tag set.
TAG_MAP = { TAG_MAP = {
"AS": {POS: PART}, "AS": {POS: PART},
@ -38,10 +39,11 @@ TAG_MAP = {
"OD": {POS: NUM}, "OD": {POS: NUM},
"DT": {POS: DET}, "DT": {POS: DET},
"CC": {POS: CCONJ}, "CC": {POS: CCONJ},
"CS": {POS: CONJ}, "CS": {POS: SCONJ},
"AD": {POS: ADV}, "AD": {POS: ADV},
"JJ": {POS: ADJ}, "JJ": {POS: ADJ},
"P": {POS: ADP}, "P": {POS: ADP},
"PN": {POS: PRON}, "PN": {POS: PRON},
"PU": {POS: PUNCT}, "PU": {POS: PUNCT},
"_SP": {POS: SPACE},
} }

View File

@ -3,6 +3,7 @@ from __future__ import absolute_import, unicode_literals
import random import random
import itertools import itertools
from spacy.util import minibatch
import weakref import weakref
import functools import functools
from collections import OrderedDict from collections import OrderedDict
@ -10,18 +11,15 @@ from contextlib import contextmanager
from copy import copy, deepcopy from copy import copy, deepcopy
from thinc.neural import Model from thinc.neural import Model
import srsly import srsly
import multiprocessing as mp
from itertools import chain, cycle
from .tokenizer import Tokenizer from .tokenizer import Tokenizer
from .vocab import Vocab from .vocab import Vocab
from .lemmatizer import Lemmatizer from .lemmatizer import Lemmatizer
from .lookups import Lookups from .lookups import Lookups
from .pipeline import DependencyParser, Tagger from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs
from .pipeline import Tensorizer, EntityRecognizer, EntityLinker from .compat import izip, basestring_, is_python2, class_types
from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
from .pipeline import EntityRuler
from .pipeline import Morphologizer
from .compat import izip, basestring_
from .gold import GoldParse from .gold import GoldParse
from .scorer import Scorer from .scorer import Scorer
from ._ml import link_vectors_to_models, create_default_optimizer from ._ml import link_vectors_to_models, create_default_optimizer
@ -30,12 +28,16 @@ from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .lang.punctuation import TOKENIZER_INFIXES from .lang.punctuation import TOKENIZER_INFIXES
from .lang.tokenizer_exceptions import TOKEN_MATCH from .lang.tokenizer_exceptions import TOKEN_MATCH
from .lang.tag_map import TAG_MAP from .lang.tag_map import TAG_MAP
from .tokens import Doc
from .lang.lex_attrs import LEX_ATTRS, is_stop from .lang.lex_attrs import LEX_ATTRS, is_stop
from .errors import Errors, Warnings, deprecation_warning from .errors import Errors, Warnings, deprecation_warning, user_warning
from . import util from . import util
from . import about from . import about
ENABLE_PIPELINE_ANALYSIS = False
class BaseDefaults(object): class BaseDefaults(object):
@classmethod @classmethod
def create_lemmatizer(cls, nlp=None, lookups=None): def create_lemmatizer(cls, nlp=None, lookups=None):
@ -49,8 +51,8 @@ class BaseDefaults(object):
filenames = {name: root / filename for name, filename in cls.resources} filenames = {name: root / filename for name, filename in cls.resources}
if LANG in cls.lex_attr_getters: if LANG in cls.lex_attr_getters:
lang = cls.lex_attr_getters[LANG](None) lang = cls.lex_attr_getters[LANG](None)
user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {}) if lang in util.registry.lookups:
filenames.update(user_lookups) filenames.update(util.registry.lookups.get(lang))
lookups = Lookups() lookups = Lookups()
for name, filename in filenames.items(): for name, filename in filenames.items():
data = util.load_language_data(filename) data = util.load_language_data(filename)
@ -106,10 +108,6 @@ class BaseDefaults(object):
tag_map = dict(TAG_MAP) tag_map = dict(TAG_MAP)
tokenizer_exceptions = {} tokenizer_exceptions = {}
stop_words = set() stop_words = set()
lemma_rules = {}
lemma_exc = {}
lemma_index = {}
lemma_lookup = {}
morph_rules = {} morph_rules = {}
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
syntax_iterators = {} syntax_iterators = {}
@ -133,22 +131,7 @@ class Language(object):
Defaults = BaseDefaults Defaults = BaseDefaults
lang = None lang = None
factories = { factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)}
"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp),
"tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
"tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
"morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg),
"parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
"ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
"entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg),
"similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
"textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
"sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
"merge_noun_chunks": lambda nlp, **cfg: merge_noun_chunks,
"merge_entities": lambda nlp, **cfg: merge_entities,
"merge_subtokens": lambda nlp, **cfg: merge_subtokens,
"entity_ruler": lambda nlp, **cfg: EntityRuler(nlp, **cfg),
}
def __init__( def __init__(
self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
@ -172,7 +155,7 @@ class Language(object):
100,000 characters in one text. 100,000 characters in one text.
RETURNS (Language): The newly constructed object. RETURNS (Language): The newly constructed object.
""" """
user_factories = util.get_entry_points(util.ENTRY_POINTS.factories) user_factories = util.registry.factories.get_all()
self.factories.update(user_factories) self.factories.update(user_factories)
self._meta = dict(meta) self._meta = dict(meta)
self._path = None self._path = None
@ -218,6 +201,7 @@ class Language(object):
"name": self.vocab.vectors.name, "name": self.vocab.vectors.name,
} }
self._meta["pipeline"] = self.pipe_names self._meta["pipeline"] = self.pipe_names
self._meta["factories"] = self.pipe_factories
self._meta["labels"] = self.pipe_labels self._meta["labels"] = self.pipe_labels
return self._meta return self._meta
@ -259,6 +243,17 @@ class Language(object):
""" """
return [pipe_name for pipe_name, _ in self.pipeline] return [pipe_name for pipe_name, _ in self.pipeline]
@property
def pipe_factories(self):
"""Get the component factories for the available pipeline components.
RETURNS (dict): Factory names, keyed by component names.
"""
factories = {}
for pipe_name, pipe in self.pipeline:
factories[pipe_name] = getattr(pipe, "factory", pipe_name)
return factories
@property @property
def pipe_labels(self): def pipe_labels(self):
"""Get the labels set by the pipeline components, if available (if """Get the labels set by the pipeline components, if available (if
@ -327,33 +322,30 @@ class Language(object):
msg += Errors.E004.format(component=component) msg += Errors.E004.format(component=component)
raise ValueError(msg) raise ValueError(msg)
if name is None: if name is None:
if hasattr(component, "name"): name = util.get_component_name(component)
name = component.name
elif hasattr(component, "__name__"):
name = component.__name__
elif hasattr(component, "__class__") and hasattr(
component.__class__, "__name__"
):
name = component.__class__.__name__
else:
name = repr(component)
if name in self.pipe_names: if name in self.pipe_names:
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names)) raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2: if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
raise ValueError(Errors.E006) raise ValueError(Errors.E006)
pipe_index = 0
pipe = (name, component) pipe = (name, component)
if last or not any([first, before, after]): if last or not any([first, before, after]):
pipe_index = len(self.pipeline)
self.pipeline.append(pipe) self.pipeline.append(pipe)
elif first: elif first:
self.pipeline.insert(0, pipe) self.pipeline.insert(0, pipe)
elif before and before in self.pipe_names: elif before and before in self.pipe_names:
pipe_index = self.pipe_names.index(before)
self.pipeline.insert(self.pipe_names.index(before), pipe) self.pipeline.insert(self.pipe_names.index(before), pipe)
elif after and after in self.pipe_names: elif after and after in self.pipe_names:
pipe_index = self.pipe_names.index(after) + 1
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe) self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
else: else:
raise ValueError( raise ValueError(
Errors.E001.format(name=before or after, opts=self.pipe_names) Errors.E001.format(name=before or after, opts=self.pipe_names)
) )
if ENABLE_PIPELINE_ANALYSIS:
analyze_pipes(self.pipeline, name, component, pipe_index)
def has_pipe(self, name): def has_pipe(self, name):
"""Check if a component name is present in the pipeline. Equivalent to """Check if a component name is present in the pipeline. Equivalent to
@ -382,6 +374,8 @@ class Language(object):
msg += Errors.E135.format(name=name) msg += Errors.E135.format(name=name)
raise ValueError(msg) raise ValueError(msg)
self.pipeline[self.pipe_names.index(name)] = (name, component) self.pipeline[self.pipe_names.index(name)] = (name, component)
if ENABLE_PIPELINE_ANALYSIS:
analyze_all_pipes(self.pipeline)
def rename_pipe(self, old_name, new_name): def rename_pipe(self, old_name, new_name):
"""Rename a pipeline component. """Rename a pipeline component.
@ -408,7 +402,10 @@ class Language(object):
""" """
if name not in self.pipe_names: if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
return self.pipeline.pop(self.pipe_names.index(name)) removed = self.pipeline.pop(self.pipe_names.index(name))
if ENABLE_PIPELINE_ANALYSIS:
analyze_all_pipes(self.pipeline)
return removed
def __call__(self, text, disable=[], component_cfg=None): def __call__(self, text, disable=[], component_cfg=None):
"""Apply the pipeline to some text. The text can span multiple sentences, """Apply the pipeline to some text. The text can span multiple sentences,
@ -448,6 +445,8 @@ class Language(object):
DOCS: https://spacy.io/api/language#disable_pipes DOCS: https://spacy.io/api/language#disable_pipes
""" """
if len(names) == 1 and isinstance(names[0], (list, tuple)):
names = names[0] # support list of names instead of spread
return DisabledPipes(self, *names) return DisabledPipes(self, *names)
def make_doc(self, text): def make_doc(self, text):
@ -477,7 +476,7 @@ class Language(object):
docs (iterable): A batch of `Doc` objects. docs (iterable): A batch of `Doc` objects.
golds (iterable): A batch of `GoldParse` objects. golds (iterable): A batch of `GoldParse` objects.
drop (float): The droput rate. drop (float): The dropout rate.
sgd (callable): An optimizer. sgd (callable): An optimizer.
losses (dict): Dictionary to update with the loss, keyed by component. losses (dict): Dictionary to update with the loss, keyed by component.
component_cfg (dict): Config parameters for specific pipeline component_cfg (dict): Config parameters for specific pipeline
@ -525,7 +524,7 @@ class Language(object):
even if you're updating it with a smaller set of examples. even if you're updating it with a smaller set of examples.
docs (iterable): A batch of `Doc` objects. docs (iterable): A batch of `Doc` objects.
drop (float): The droput rate. drop (float): The dropout rate.
sgd (callable): An optimizer. sgd (callable): An optimizer.
RETURNS (dict): Results from the update. RETURNS (dict): Results from the update.
@ -679,7 +678,7 @@ class Language(object):
kwargs = component_cfg.get(name, {}) kwargs = component_cfg.get(name, {})
kwargs.setdefault("batch_size", batch_size) kwargs.setdefault("batch_size", batch_size)
if not hasattr(pipe, "pipe"): if not hasattr(pipe, "pipe"):
docs = _pipe(pipe, docs, kwargs) docs = _pipe(docs, pipe, kwargs)
else: else:
docs = pipe.pipe(docs, **kwargs) docs = pipe.pipe(docs, **kwargs)
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
@ -733,6 +732,7 @@ class Language(object):
disable=[], disable=[],
cleanup=False, cleanup=False,
component_cfg=None, component_cfg=None,
n_process=1,
): ):
"""Process texts as a stream, and yield `Doc` objects in order. """Process texts as a stream, and yield `Doc` objects in order.
@ -746,12 +746,21 @@ class Language(object):
use. Experimental. use. Experimental.
component_cfg (dict): An optional dictionary with extra keyword component_cfg (dict): An optional dictionary with extra keyword
arguments for specific components. arguments for specific components.
n_process (int): Number of processors to process texts, only supported
in Python3. If -1, set `multiprocessing.cpu_count()`.
YIELDS (Doc): Documents in the order of the original text. YIELDS (Doc): Documents in the order of the original text.
DOCS: https://spacy.io/api/language#pipe DOCS: https://spacy.io/api/language#pipe
""" """
# raw_texts will be used later to stop iterator.
texts, raw_texts = itertools.tee(texts)
if is_python2 and n_process != 1:
user_warning(Warnings.W023)
n_process = 1
if n_threads != -1: if n_threads != -1:
deprecation_warning(Warnings.W016) deprecation_warning(Warnings.W016)
if n_process == -1:
n_process = mp.cpu_count()
if as_tuples: if as_tuples:
text_context1, text_context2 = itertools.tee(texts) text_context1, text_context2 = itertools.tee(texts)
texts = (tc[0] for tc in text_context1) texts = (tc[0] for tc in text_context1)
@ -760,14 +769,18 @@ class Language(object):
texts, texts,
batch_size=batch_size, batch_size=batch_size,
disable=disable, disable=disable,
n_process=n_process,
component_cfg=component_cfg, component_cfg=component_cfg,
) )
for doc, context in izip(docs, contexts): for doc, context in izip(docs, contexts):
yield (doc, context) yield (doc, context)
return return
docs = (self.make_doc(text) for text in texts)
if component_cfg is None: if component_cfg is None:
component_cfg = {} component_cfg = {}
pipes = (
[]
) # contains functools.partial objects so that easily create multiprocess worker.
for name, proc in self.pipeline: for name, proc in self.pipeline:
if name in disable: if name in disable:
continue continue
@ -775,10 +788,20 @@ class Language(object):
# Allow component_cfg to overwrite the top-level kwargs. # Allow component_cfg to overwrite the top-level kwargs.
kwargs.setdefault("batch_size", batch_size) kwargs.setdefault("batch_size", batch_size)
if hasattr(proc, "pipe"): if hasattr(proc, "pipe"):
docs = proc.pipe(docs, **kwargs) f = functools.partial(proc.pipe, **kwargs)
else: else:
# Apply the function, but yield the doc # Apply the function, but yield the doc
docs = _pipe(proc, docs, kwargs) f = functools.partial(_pipe, proc=proc, kwargs=kwargs)
pipes.append(f)
if n_process != 1:
docs = self._multiprocessing_pipe(texts, pipes, n_process, batch_size)
else:
# if n_process == 1, no processes are forked.
docs = (self.make_doc(text) for text in texts)
for pipe in pipes:
docs = pipe(docs)
# Track weakrefs of "recent" documents, so that we can see when they # Track weakrefs of "recent" documents, so that we can see when they
# expire from memory. When they do, we know we don't need old strings. # expire from memory. When they do, we know we don't need old strings.
# This way, we avoid maintaining an unbounded growth in string entries # This way, we avoid maintaining an unbounded growth in string entries
@ -809,6 +832,46 @@ class Language(object):
self.tokenizer._reset_cache(keys) self.tokenizer._reset_cache(keys)
nr_seen = 0 nr_seen = 0
def _multiprocessing_pipe(self, texts, pipes, n_process, batch_size):
# raw_texts is used later to stop iteration.
texts, raw_texts = itertools.tee(texts)
# for sending texts to worker
texts_q = [mp.Queue() for _ in range(n_process)]
# for receiving byte encoded docs from worker
bytedocs_recv_ch, bytedocs_send_ch = zip(
*[mp.Pipe(False) for _ in range(n_process)]
)
batch_texts = minibatch(texts, batch_size)
# Sender sends texts to the workers.
# This is necessary to properly handle infinite length of texts.
# (In this case, all data cannot be sent to the workers at once)
sender = _Sender(batch_texts, texts_q, chunk_size=n_process)
# send twice so that make process busy
sender.send()
sender.send()
procs = [
mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
for rch, sch in zip(texts_q, bytedocs_send_ch)
]
for proc in procs:
proc.start()
# Cycle channels not to break the order of docs.
# The received object is batch of byte encoded docs, so flatten them with chain.from_iterable.
byte_docs = chain.from_iterable(recv.recv() for recv in cycle(bytedocs_recv_ch))
docs = (Doc(self.vocab).from_bytes(byte_doc) for byte_doc in byte_docs)
try:
for i, (_, doc) in enumerate(zip(raw_texts, docs), 1):
yield doc
if i % batch_size == 0:
# tell `sender` that one batch was consumed.
sender.step()
finally:
for proc in procs:
proc.terminate()
def to_disk(self, path, exclude=tuple(), disable=None): def to_disk(self, path, exclude=tuple(), disable=None):
"""Save the current state to a directory. If a model is loaded, this """Save the current state to a directory. If a model is loaded, this
will include the model. will include the model.
@ -936,6 +999,52 @@ class Language(object):
return self return self
class component(object):
"""Decorator for pipeline components. Can decorate both function components
and class components and will automatically register components in the
Language.factories. If the component is a class and needs access to the
nlp object or config parameters, it can expose a from_nlp classmethod
that takes the nlp object and **cfg arguments and returns the initialized
component.
"""
# NB: This decorator needs to live here, because it needs to write to
# Language.factories. All other solutions would cause circular import.
def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False):
"""Decorate a pipeline component.
name (unicode): Default component and factory name.
assigns (list): Attributes assigned by component, e.g. `["token.pos"]`.
requires (list): Attributes required by component, e.g. `["token.dep"]`.
retokenizes (bool): Whether the component changes the tokenization.
"""
self.name = name
self.assigns = validate_attrs(assigns)
self.requires = validate_attrs(requires)
self.retokenizes = retokenizes
def __call__(self, *args, **kwargs):
obj = args[0]
args = args[1:]
factory_name = self.name or util.get_component_name(obj)
obj.name = factory_name
obj.factory = factory_name
obj.assigns = self.assigns
obj.requires = self.requires
obj.retokenizes = self.retokenizes
def factory(nlp, **cfg):
if hasattr(obj, "from_nlp"):
return obj.from_nlp(nlp, **cfg)
elif isinstance(obj, class_types):
return obj()
return obj
Language.factories[obj.factory] = factory
return obj
def _fix_pretrained_vectors_name(nlp): def _fix_pretrained_vectors_name(nlp):
# TODO: Replace this once we handle vectors consistently as static # TODO: Replace this once we handle vectors consistently as static
# data # data
@ -987,12 +1096,56 @@ class DisabledPipes(list):
self[:] = [] self[:] = []
def _pipe(func, docs, kwargs): def _pipe(docs, proc, kwargs):
# We added some args for pipe that __call__ doesn't expect. # We added some args for pipe that __call__ doesn't expect.
kwargs = dict(kwargs) kwargs = dict(kwargs)
for arg in ["n_threads", "batch_size"]: for arg in ["n_threads", "batch_size"]:
if arg in kwargs: if arg in kwargs:
kwargs.pop(arg) kwargs.pop(arg)
for doc in docs: for doc in docs:
doc = func(doc, **kwargs) doc = proc(doc, **kwargs)
yield doc yield doc
def _apply_pipes(make_doc, pipes, reciever, sender):
"""Worker for Language.pipe
receiver (multiprocessing.Connection): Pipe to receive text. Usually
created by `multiprocessing.Pipe()`
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
`multiprocessing.Pipe()`
"""
while True:
texts = reciever.get()
docs = (make_doc(text) for text in texts)
for pipe in pipes:
docs = pipe(docs)
# Connection does not accept unpickable objects, so send list.
sender.send([doc.to_bytes() for doc in docs])
class _Sender:
"""Util for sending data to multiprocessing workers in Language.pipe"""
def __init__(self, data, queues, chunk_size):
self.data = iter(data)
self.queues = iter(cycle(queues))
self.chunk_size = chunk_size
self.count = 0
def send(self):
"""Send chunk_size items from self.data to channels."""
for item, q in itertools.islice(
zip(self.data, cycle(self.queues)), self.chunk_size
):
# cycle channels so that distribute the texts evenly
q.put(item)
def step(self):
"""Tell sender that comsumed one item.
Data is sent to the workers after every chunk_size calls."""
self.count += 1
if self.count >= self.chunk_size:
self.count = 0
self.send()

View File

@ -375,7 +375,7 @@ cdef class Lexeme:
Lexeme.c_set_flag(self.c, IS_STOP, x) Lexeme.c_set_flag(self.c, IS_STOP, x)
property is_alpha: property is_alpha:
"""RETURNS (bool): Whether the lexeme consists of alphanumeric """RETURNS (bool): Whether the lexeme consists of alphabetic
characters. Equivalent to `lexeme.text.isalpha()`. characters. Equivalent to `lexeme.text.isalpha()`.
""" """
def __get__(self): def __get__(self):

View File

@ -111,7 +111,7 @@ TOKEN_PATTERN_SCHEMA = {
"$ref": "#/definitions/integer_value", "$ref": "#/definitions/integer_value",
}, },
"IS_ALPHA": { "IS_ALPHA": {
"title": "Token consists of alphanumeric characters", "title": "Token consists of alphabetic characters",
"$ref": "#/definitions/boolean_value", "$ref": "#/definitions/boolean_value",
}, },
"IS_ASCII": { "IS_ASCII": {

View File

@ -102,7 +102,10 @@ cdef class DependencyMatcher:
visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True
idx = idx + 1 idx = idx + 1
def add(self, key, on_match, *patterns): def add(self, key, patterns, *_patterns, on_match=None):
if patterns is None or hasattr(patterns, "__call__"): # old API
on_match = patterns
patterns = _patterns
for pattern in patterns: for pattern in patterns:
if len(pattern) == 0: if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key)) raise ValueError(Errors.E012.format(key=key))
@ -237,7 +240,7 @@ cdef class DependencyMatcher:
for i, (ent_id, nodes) in enumerate(matched_key_trees): for i, (ent_id, nodes) in enumerate(matched_key_trees):
on_match = self._callbacks.get(ent_id) on_match = self._callbacks.get(ent_id)
if on_match is not None: if on_match is not None:
on_match(self, doc, i, matches) on_match(self, doc, i, matched_key_trees)
return matched_key_trees return matched_key_trees
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees): def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):

View File

@ -74,7 +74,7 @@ cdef class Matcher:
""" """
return self._normalize_key(key) in self._patterns return self._normalize_key(key) in self._patterns
def add(self, key, on_match, *patterns): def add(self, key, patterns, *_patterns, on_match=None):
"""Add a match-rule to the matcher. A match-rule consists of: an ID """Add a match-rule to the matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns. key, an on_match callback, and one or more patterns.
@ -98,16 +98,29 @@ cdef class Matcher:
operator will behave non-greedily. This quirk in the semantics makes operator will behave non-greedily. This quirk in the semantics makes
the matcher more efficient, by avoiding the need for back-tracking. the matcher more efficient, by avoiding the need for back-tracking.
As of spaCy v2.2.2, Matcher.add supports the future API, which makes
the patterns the second argument and a list (instead of a variable
number of arguments). The on_match callback becomes an optional keyword
argument.
key (unicode): The match ID. key (unicode): The match ID.
on_match (callable): Callback executed on match. patterns (list): The patterns to add for the given key.
*patterns (list): List of token descriptions. on_match (callable): Optional callback executed on match.
*_patterns (list): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
""" """
errors = {} errors = {}
if on_match is not None and not hasattr(on_match, "__call__"): if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match))) raise ValueError(Errors.E171.format(arg_type=type(on_match)))
if patterns is None or hasattr(patterns, "__call__"): # old API
on_match = patterns
patterns = _patterns
for i, pattern in enumerate(patterns): for i, pattern in enumerate(patterns):
if len(pattern) == 0: if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key)) raise ValueError(Errors.E012.format(key=key))
if not isinstance(pattern, list):
raise ValueError(Errors.E178.format(pat=pattern, key=key))
if self.validator: if self.validator:
errors[i] = validate_json(pattern, self.validator) errors[i] = validate_json(pattern, self.validator)
if any(err for err in errors.values()): if any(err for err in errors.values()):
@ -133,13 +146,15 @@ cdef class Matcher:
key (unicode): The ID of the match rule. key (unicode): The ID of the match rule.
""" """
key = self._normalize_key(key) norm_key = self._normalize_key(key)
self._patterns.pop(key) if not norm_key in self._patterns:
self._callbacks.pop(key) raise ValueError(Errors.E175.format(key=key))
self._patterns.pop(norm_key)
self._callbacks.pop(norm_key)
cdef int i = 0 cdef int i = 0
while i < self.patterns.size(): while i < self.patterns.size():
pattern_key = get_pattern_key(self.patterns.at(i)) pattern_key = get_ent_id(self.patterns.at(i))
if pattern_key == key: if pattern_key == norm_key:
self.patterns.erase(self.patterns.begin()+i) self.patterns.erase(self.patterns.begin()+i)
else: else:
i += 1 i += 1
@ -252,7 +267,12 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
cdef PatternStateC state cdef PatternStateC state
cdef int i, j, nr_extra_attr cdef int i, j, nr_extra_attr
cdef Pool mem = Pool() cdef Pool mem = Pool()
predicate_cache = <char*>mem.alloc(doc.length * len(predicates), sizeof(char)) output = []
if doc.length == 0:
# avoid any processing or mem alloc if the document is empty
return output
if len(predicates) > 0:
predicate_cache = <char*>mem.alloc(doc.length * len(predicates), sizeof(char))
if extensions is not None and len(extensions) >= 1: if extensions is not None and len(extensions) >= 1:
nr_extra_attr = max(extensions.values()) + 1 nr_extra_attr = max(extensions.values()) + 1
extra_attr_values = <attr_t*>mem.alloc(doc.length * nr_extra_attr, sizeof(attr_t)) extra_attr_values = <attr_t*>mem.alloc(doc.length * nr_extra_attr, sizeof(attr_t))
@ -276,7 +296,6 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
predicate_cache += len(predicates) predicate_cache += len(predicates)
# Handle matches that end in 0-width patterns # Handle matches that end in 0-width patterns
finish_states(matches, states) finish_states(matches, states)
output = []
seen = set() seen = set()
for i in range(matches.size()): for i in range(matches.size()):
match = ( match = (
@ -293,18 +312,6 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
return output return output
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
# There have been a few bugs here.
# The code was originally designed to always have pattern[1].attrs.value
# be the ent_id when we get to the end of a pattern. However, Issue #2671
# showed this wasn't the case when we had a reject-and-continue before a
# match.
# The patch to #2671 was wrong though, which came up in #3839.
while pattern.attrs.attr != ID:
pattern += 1
return pattern.attrs.value
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches, cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
char* cached_py_predicates, char* cached_py_predicates,
Token token, const attr_t* extra_attrs, py_predicates) except *: Token token, const attr_t* extra_attrs, py_predicates) except *:
@ -533,9 +540,10 @@ cdef char get_is_match(PatternStateC state,
if predicate_matches[state.pattern.py_predicates[i]] == -1: if predicate_matches[state.pattern.py_predicates[i]] == -1:
return 0 return 0
spec = state.pattern spec = state.pattern
for attr in spec.attrs[:spec.nr_attr]: if spec.nr_attr > 0:
if get_token_attr(token, attr.attr) != attr.value: for attr in spec.attrs[:spec.nr_attr]:
return 0 if get_token_attr(token, attr.attr) != attr.value:
return 0
for i in range(spec.nr_extra_attr): for i in range(spec.nr_extra_attr):
if spec.extra_attrs[i].value != extra_attrs[spec.extra_attrs[i].index]: if spec.extra_attrs[i].value != extra_attrs[spec.extra_attrs[i].index]:
return 0 return 0
@ -543,7 +551,11 @@ cdef char get_is_match(PatternStateC state,
cdef char get_is_final(PatternStateC state) nogil: cdef char get_is_final(PatternStateC state) nogil:
if state.pattern[1].attrs[0].attr == ID and state.pattern[1].nr_attr == 0: if state.pattern[1].nr_attr == 0 and state.pattern[1].attrs != NULL:
id_attr = state.pattern[1].attrs[0]
if id_attr.attr != ID:
with gil:
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
return 1 return 1
else: else:
return 0 return 0
@ -558,22 +570,27 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
cdef int i, index cdef int i, index
for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs): for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs):
pattern[i].quantifier = quantifier pattern[i].quantifier = quantifier
pattern[i].attrs = <AttrValueC*>mem.alloc(len(spec), sizeof(AttrValueC)) # Ensure attrs refers to a null pointer if nr_attr == 0
if len(spec) > 0:
pattern[i].attrs = <AttrValueC*>mem.alloc(len(spec), sizeof(AttrValueC))
pattern[i].nr_attr = len(spec) pattern[i].nr_attr = len(spec)
for j, (attr, value) in enumerate(spec): for j, (attr, value) in enumerate(spec):
pattern[i].attrs[j].attr = attr pattern[i].attrs[j].attr = attr
pattern[i].attrs[j].value = value pattern[i].attrs[j].value = value
pattern[i].extra_attrs = <IndexValueC*>mem.alloc(len(extensions), sizeof(IndexValueC)) if len(extensions) > 0:
pattern[i].extra_attrs = <IndexValueC*>mem.alloc(len(extensions), sizeof(IndexValueC))
for j, (index, value) in enumerate(extensions): for j, (index, value) in enumerate(extensions):
pattern[i].extra_attrs[j].index = index pattern[i].extra_attrs[j].index = index
pattern[i].extra_attrs[j].value = value pattern[i].extra_attrs[j].value = value
pattern[i].nr_extra_attr = len(extensions) pattern[i].nr_extra_attr = len(extensions)
pattern[i].py_predicates = <int32_t*>mem.alloc(len(predicates), sizeof(int32_t)) if len(predicates) > 0:
pattern[i].py_predicates = <int32_t*>mem.alloc(len(predicates), sizeof(int32_t))
for j, index in enumerate(predicates): for j, index in enumerate(predicates):
pattern[i].py_predicates[j] = index pattern[i].py_predicates[j] = index
pattern[i].nr_py = len(predicates) pattern[i].nr_py = len(predicates)
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0) pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
i = len(token_specs) i = len(token_specs)
# Even though here, nr_attr == 0, we're storing the ID value in attrs[0] (bug-prone, thread carefully!)
pattern[i].attrs = <AttrValueC*>mem.alloc(2, sizeof(AttrValueC)) pattern[i].attrs = <AttrValueC*>mem.alloc(2, sizeof(AttrValueC))
pattern[i].attrs[0].attr = ID pattern[i].attrs[0].attr = ID
pattern[i].attrs[0].value = entity_id pattern[i].attrs[0].value = entity_id
@ -583,8 +600,26 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
return pattern return pattern
cdef attr_t get_pattern_key(const TokenPatternC* pattern) nogil: cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
while pattern.nr_attr != 0 or pattern.nr_extra_attr != 0 or pattern.nr_py != 0: # There have been a few bugs here. We used to have two functions,
# get_ent_id and get_pattern_key that tried to do the same thing. These
# are now unified to try to solve the "ghost match" problem.
# Below is the previous implementation of get_ent_id and the comment on it,
# preserved for reference while we figure out whether the heisenbug in the
# matcher is resolved.
#
#
# cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
# # The code was originally designed to always have pattern[1].attrs.value
# # be the ent_id when we get to the end of a pattern. However, Issue #2671
# # showed this wasn't the case when we had a reject-and-continue before a
# # match.
# # The patch to #2671 was wrong though, which came up in #3839.
# while pattern.attrs.attr != ID:
# pattern += 1
# return pattern.attrs.value
while pattern.nr_attr != 0 or pattern.nr_extra_attr != 0 or pattern.nr_py != 0 \
or pattern.quantifier != ZERO:
pattern += 1 pattern += 1
id_attr = pattern[0].attrs[0] id_attr = pattern[0].attrs[0]
if id_attr.attr != ID: if id_attr.attr != ID:
@ -642,7 +677,7 @@ def _get_attr_values(spec, string_store):
value = string_store.add(value) value = string_store.add(value)
elif isinstance(value, bool): elif isinstance(value, bool):
value = int(value) value = int(value)
elif isinstance(value, dict): elif isinstance(value, (dict, int)):
continue continue
else: else:
raise ValueError(Errors.E153.format(vtype=type(value).__name__)) raise ValueError(Errors.E153.format(vtype=type(value).__name__))

View File

@ -4,6 +4,7 @@ from cymem.cymem cimport Pool
from preshed.maps cimport key_t, MapStruct from preshed.maps cimport key_t, MapStruct
from ..attrs cimport attr_id_t from ..attrs cimport attr_id_t
from ..structs cimport SpanC
from ..tokens.doc cimport Doc from ..tokens.doc cimport Doc
from ..vocab cimport Vocab from ..vocab cimport Vocab
@ -18,10 +19,4 @@ cdef class PhraseMatcher:
cdef Pool mem cdef Pool mem
cdef key_t _terminal_hash cdef key_t _terminal_hash
cdef void find_matches(self, Doc doc, vector[MatchStruct] *matches) nogil cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
cdef struct MatchStruct:
key_t match_id
int start
int end

View File

@ -9,6 +9,7 @@ from preshed.maps cimport map_init, map_set, map_get, map_clear, map_iter
from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA
from ..structs cimport TokenC from ..structs cimport TokenC
from ..tokens.token cimport Token from ..tokens.token cimport Token
from ..typedefs cimport attr_t
from ._schemas import TOKEN_PATTERN_SCHEMA from ._schemas import TOKEN_PATTERN_SCHEMA
from ..errors import Errors, Warnings, deprecation_warning, user_warning from ..errors import Errors, Warnings, deprecation_warning, user_warning
@ -102,8 +103,10 @@ cdef class PhraseMatcher:
cdef vector[MapStruct*] path_nodes cdef vector[MapStruct*] path_nodes
cdef vector[key_t] path_keys cdef vector[key_t] path_keys
cdef key_t key_to_remove cdef key_t key_to_remove
for keyword in self._docs[key]: for keyword in sorted(self._docs[key], key=lambda x: len(x), reverse=True):
current_node = self.c_map current_node = self.c_map
path_nodes.clear()
path_keys.clear()
for token in keyword: for token in keyword:
result = map_get(current_node, token) result = map_get(current_node, token)
if result: if result:
@ -149,16 +152,27 @@ cdef class PhraseMatcher:
del self._callbacks[key] del self._callbacks[key]
del self._docs[key] del self._docs[key]
def add(self, key, on_match, *docs): def add(self, key, docs, *_docs, on_match=None):
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID """Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns. key, an on_match callback, and one or more patterns.
As of spaCy v2.2.2, PhraseMatcher.add supports the future API, which
makes the patterns the second argument and a list (instead of a variable
number of arguments). The on_match callback becomes an optional keyword
argument.
key (unicode): The match ID. key (unicode): The match ID.
docs (list): List of `Doc` objects representing match patterns.
on_match (callable): Callback executed on match. on_match (callable): Callback executed on match.
*docs (Doc): `Doc` objects representing match patterns. *_docs (Doc): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
DOCS: https://spacy.io/api/phrasematcher#add DOCS: https://spacy.io/api/phrasematcher#add
""" """
if docs is None or hasattr(docs, "__call__"): # old API
on_match = docs
docs = _docs
_ = self.vocab[key] _ = self.vocab[key]
self._callbacks[key] = on_match self._callbacks[key] = on_match
@ -168,6 +182,8 @@ cdef class PhraseMatcher:
cdef MapStruct* internal_node cdef MapStruct* internal_node
cdef void* result cdef void* result
if isinstance(docs, Doc):
raise ValueError(Errors.E179.format(key=key))
for doc in docs: for doc in docs:
if len(doc) == 0: if len(doc) == 0:
continue continue
@ -220,17 +236,17 @@ cdef class PhraseMatcher:
# if doc is empty or None just return empty list # if doc is empty or None just return empty list
return matches return matches
cdef vector[MatchStruct] c_matches cdef vector[SpanC] c_matches
self.find_matches(doc, &c_matches) self.find_matches(doc, &c_matches)
for i in range(c_matches.size()): for i in range(c_matches.size()):
matches.append((c_matches[i].match_id, c_matches[i].start, c_matches[i].end)) matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
for i, (ent_id, start, end) in enumerate(matches): for i, (ent_id, start, end) in enumerate(matches):
on_match = self._callbacks.get(ent_id) on_match = self._callbacks.get(self.vocab.strings[ent_id])
if on_match is not None: if on_match is not None:
on_match(self, doc, i, matches) on_match(self, doc, i, matches)
return matches return matches
cdef void find_matches(self, Doc doc, vector[MatchStruct] *matches) nogil: cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
cdef MapStruct* current_node = self.c_map cdef MapStruct* current_node = self.c_map
cdef int start = 0 cdef int start = 0
cdef int idx = 0 cdef int idx = 0
@ -238,7 +254,7 @@ cdef class PhraseMatcher:
cdef key_t key cdef key_t key
cdef void* value cdef void* value
cdef int i = 0 cdef int i = 0
cdef MatchStruct ms cdef SpanC ms
cdef void* result cdef void* result
while idx < doc.length: while idx < doc.length:
start = idx start = idx
@ -253,7 +269,7 @@ cdef class PhraseMatcher:
if result: if result:
i = 0 i = 0
while map_iter(<MapStruct*>result, &i, &key, &value): while map_iter(<MapStruct*>result, &i, &key, &value):
ms = make_matchstruct(key, start, idy) ms = make_spanstruct(key, start, idy)
matches.push_back(ms) matches.push_back(ms)
inner_token = Token.get_struct_attr(&doc.c[idy], self.attr) inner_token = Token.get_struct_attr(&doc.c[idy], self.attr)
result = map_get(current_node, inner_token) result = map_get(current_node, inner_token)
@ -268,7 +284,7 @@ cdef class PhraseMatcher:
if result: if result:
i = 0 i = 0
while map_iter(<MapStruct*>result, &i, &key, &value): while map_iter(<MapStruct*>result, &i, &key, &value):
ms = make_matchstruct(key, start, idy) ms = make_spanstruct(key, start, idy)
matches.push_back(ms) matches.push_back(ms)
current_node = self.c_map current_node = self.c_map
idx += 1 idx += 1
@ -318,9 +334,9 @@ def unpickle_matcher(vocab, docs, callbacks, attr):
return matcher return matcher
cdef MatchStruct make_matchstruct(key_t match_id, int start, int end) nogil: cdef SpanC make_spanstruct(attr_t label, int start, int end) nogil:
cdef MatchStruct ms cdef SpanC spanc
ms.match_id = match_id spanc.label = label
ms.start = start spanc.start = start
ms.end = end spanc.end = end
return ms return spanc

5
spacy/ml/__init__.py Normal file
View File

@ -0,0 +1,5 @@
# coding: utf8
from __future__ import unicode_literals
from .tok2vec import Tok2Vec # noqa: F401
from .common import FeedForward, LayerNormalizedMaxout # noqa: F401

131
spacy/ml/_legacy_tok2vec.py Normal file
View File

@ -0,0 +1,131 @@
# coding: utf8
from __future__ import unicode_literals
from thinc.v2v import Model, Maxout
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual
from thinc.misc import LayerNorm as LN
from thinc.misc import FeatureExtracter
from thinc.api import layerize, chain, clone, concatenate, with_flatten
from thinc.api import uniqued, wrap, noop
from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE
def Tok2Vec(width, embed_size, **kwargs):
# Circular imports :(
from .._ml import CharacterEmbed
from .._ml import PyTorchBiLSTM
pretrained_vectors = kwargs.get("pretrained_vectors", None)
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
subword_features = kwargs.get("subword_features", True)
char_embed = kwargs.get("char_embed", False)
if char_embed:
subword_features = False
conv_depth = kwargs.get("conv_depth", 4)
bilstm_depth = kwargs.get("bilstm_depth", 0)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
if subword_features:
prefix = HashEmbed(
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
)
suffix = HashEmbed(
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
)
shape = HashEmbed(
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
)
else:
prefix, suffix, shape = (None, None, None)
if pretrained_vectors is not None:
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
if subword_features:
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
column=cols.index(ORTH),
)
elif subword_features:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 4, pieces=3)),
column=cols.index(ORTH),
)
elif char_embed:
embed = concatenate_lists(
CharacterEmbed(nM=64, nC=8),
FeatureExtracter(cols) >> with_flatten(norm),
)
reduce_dimensions = LN(
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
)
else:
embed = norm
convolution = Residual(
ExtractWindow(nW=1)
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
)
if char_embed:
tok2vec = embed >> with_flatten(
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth
)
else:
tok2vec = FeatureExtracter(cols) >> with_flatten(
embed >> convolution ** conv_depth, pad=conv_depth
)
if bilstm_depth >= 1:
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
tok2vec.nO = width
tok2vec.embed = embed
return tok2vec
@layerize
def flatten(seqs, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
X = ops.flatten(seqs, pad=0)
return X, finish_update
def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""Compose two or more models `f`, `g`, etc, such that their outputs are
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
"""
if not layers:
return noop()
drop_factor = kwargs.get("drop_factor", 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.0):
if drop is not None:
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)
def concatenate_lists_bwd(d_ys, sgd=None):
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
return ys, concatenate_lists_bwd
model = wrap(concatenate_lists_fwd, concat)
return model

42
spacy/ml/_wire.py Normal file
View File

@ -0,0 +1,42 @@
from __future__ import unicode_literals
from thinc.api import layerize, wrap, noop, chain, concatenate
from thinc.v2v import Model
def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""Compose two or more models `f`, `g`, etc, such that their outputs are
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
"""
if not layers:
return layerize(noop())
drop_factor = kwargs.get("drop_factor", 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.0):
if drop is not None:
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)
def concatenate_lists_bwd(d_ys, sgd=None):
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
return ys, concatenate_lists_bwd
model = wrap(concatenate_lists_fwd, concat)
return model
@layerize
def flatten(seqs, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
X = ops.flatten(seqs, pad=0)
return X, finish_update

23
spacy/ml/common.py Normal file
View File

@ -0,0 +1,23 @@
from __future__ import unicode_literals
from thinc.api import chain
from thinc.v2v import Maxout
from thinc.misc import LayerNorm
from ..util import registry, make_layer
@registry.architectures.register("thinc.FeedForward.v1")
def FeedForward(config):
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
model = chain(*layers)
model.cfg = config
return model
@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
def LayerNormalizedMaxout(config):
width = config["width"]
pieces = config["pieces"]
layer = LayerNorm(Maxout(width, pieces=pieces))
layer.nO = width
return layer

176
spacy/ml/tok2vec.py Normal file
View File

@ -0,0 +1,176 @@
from __future__ import unicode_literals
from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued
from thinc.api import noop, with_square_sequences
from thinc.v2v import Maxout, Model
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual, LayerNorm, FeatureExtracter
from ..util import make_layer, registry
from ._wire import concatenate_lists
@registry.architectures.register("spacy.Tok2Vec.v1")
def Tok2Vec(config):
doc2feats = make_layer(config["@doc2feats"])
embed = make_layer(config["@embed"])
encode = make_layer(config["@encode"])
field_size = getattr(encode, "receptive_field", 0)
tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size))
tok2vec.cfg = config
tok2vec.nO = encode.nO
tok2vec.embed = embed
tok2vec.encode = encode
return tok2vec
@registry.architectures.register("spacy.Doc2Feats.v1")
def Doc2Feats(config):
columns = config["columns"]
return FeatureExtracter(columns)
@registry.architectures.register("spacy.MultiHashEmbed.v1")
def MultiHashEmbed(config):
# For backwards compatibility with models before the architecture registry,
# we have to be careful to get exactly the same model structure. One subtle
# trick is that when we define concatenation with the operator, the operator
# is actually binary associative. So when we write (a | b | c), we're actually
# getting concatenate(concatenate(a, b), c). That's why the implementation
# is a bit ugly here.
cols = config["columns"]
width = config["width"]
rows = config["rows"]
norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm")
if config["use_subwords"]:
prefix = HashEmbed(
width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix"
)
suffix = HashEmbed(
width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix"
)
shape = HashEmbed(
width, rows // 2, column=cols.index("SHAPE"), name="embed_shape"
)
if config.get("@pretrained_vectors"):
glove = make_layer(config["@pretrained_vectors"])
mix = make_layer(config["@mix"])
with Model.define_operators({">>": chain, "|": concatenate}):
if config["use_subwords"] and config["@pretrained_vectors"]:
mix._layers[0].nI = width * 5
layer = uniqued(
(glove | norm | prefix | suffix | shape) >> mix,
column=cols.index("ORTH"),
)
elif config["use_subwords"]:
mix._layers[0].nI = width * 4
layer = uniqued(
(norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH")
)
elif config["@pretrained_vectors"]:
mix._layers[0].nI = width * 2
layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),)
else:
layer = norm
layer.cfg = config
return layer
@registry.architectures.register("spacy.CharacterEmbed.v1")
def CharacterEmbed(config):
from .. import _ml
width = config["width"]
chars = config["chars"]
chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars)
other_tables = make_layer(config["@embed_features"])
mix = make_layer(config["@mix"])
model = chain(concatenate_lists(chr_embed, other_tables), mix)
model.cfg = config
return model
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
def MaxoutWindowEncoder(config):
nO = config["width"]
nW = config["window_size"]
nP = config["pieces"]
depth = config["depth"]
cnn = chain(
ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP))
)
model = clone(Residual(cnn), depth)
model.nO = nO
model.receptive_field = nW * depth
return model
@registry.architectures.register("spacy.MishWindowEncoder.v1")
def MishWindowEncoder(config):
from thinc.v2v import Mish
nO = config["width"]
nW = config["window_size"]
depth = config["depth"]
cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1))))
model = clone(Residual(cnn), depth)
model.nO = nO
return model
@registry.architectures.register("spacy.PretrainedVectors.v1")
def PretrainedVectors(config):
return StaticVectors(config["vectors_name"], config["width"], config["column"])
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
def TorchBiLSTMEncoder(config):
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN
width = config["width"]
depth = config["depth"]
if depth == 0:
return layerize(noop())
return with_square_sequences(
PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True))
)
_EXAMPLE_CONFIG = {
"@doc2feats": {
"arch": "Doc2Feats",
"config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]},
},
"@embed": {
"arch": "spacy.MultiHashEmbed.v1",
"config": {
"width": 96,
"rows": 2000,
"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"],
"use_subwords": True,
"@pretrained_vectors": {
"arch": "TransformedStaticVectors",
"config": {
"vectors_name": "en_vectors_web_lg.vectors",
"width": 96,
"column": 0,
},
},
"@mix": {
"arch": "LayerNormalizedMaxout",
"config": {"width": 96, "pieces": 3},
},
},
},
"@encode": {
"arch": "MaxoutWindowEncode",
"config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3},
},
}

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
from collections import defaultdict, OrderedDict from collections import defaultdict, OrderedDict
import srsly import srsly
from ..language import component
from ..errors import Errors from ..errors import Errors
from ..compat import basestring_ from ..compat import basestring_
from ..util import ensure_path, to_disk, from_disk from ..util import ensure_path, to_disk, from_disk
@ -13,6 +14,7 @@ from ..matcher import Matcher, PhraseMatcher
DEFAULT_ENT_ID_SEP = "||" DEFAULT_ENT_ID_SEP = "||"
@component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"])
class EntityRuler(object): class EntityRuler(object):
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based """The EntityRuler lets you add spans to the `Doc.ents` using token-based
rules or exact phrase matches. It can be combined with the statistical rules or exact phrase matches. It can be combined with the statistical
@ -24,8 +26,6 @@ class EntityRuler(object):
USAGE: https://spacy.io/usage/rule-based-matching#entityruler USAGE: https://spacy.io/usage/rule-based-matching#entityruler
""" """
name = "entity_ruler"
def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg): def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg):
"""Initialize the entitiy ruler. If patterns are supplied here, they """Initialize the entitiy ruler. If patterns are supplied here, they
need to be a list of dictionaries with a `"label"` and `"pattern"` need to be a list of dictionaries with a `"label"` and `"pattern"`
@ -64,10 +64,15 @@ class EntityRuler(object):
self.phrase_matcher_attr = None self.phrase_matcher_attr = None
self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate) self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate)
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
self._ent_ids = defaultdict(dict)
patterns = cfg.get("patterns") patterns = cfg.get("patterns")
if patterns is not None: if patterns is not None:
self.add_patterns(patterns) self.add_patterns(patterns)
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp, **cfg)
def __len__(self): def __len__(self):
"""The number of all patterns added to the entity ruler.""" """The number of all patterns added to the entity ruler."""
n_token_patterns = sum(len(p) for p in self.token_patterns.values()) n_token_patterns = sum(len(p) for p in self.token_patterns.values())
@ -100,10 +105,9 @@ class EntityRuler(object):
continue continue
# check for end - 1 here because boundaries are inclusive # check for end - 1 here because boundaries are inclusive
if start not in seen_tokens and end - 1 not in seen_tokens: if start not in seen_tokens and end - 1 not in seen_tokens:
if self.ent_ids: if match_id in self._ent_ids:
label_ = self.nlp.vocab.strings[match_id] label, ent_id = self._ent_ids[match_id]
ent_label, ent_id = self._split_label(label_) span = Span(doc, start, end, label=label)
span = Span(doc, start, end, label=ent_label)
if ent_id: if ent_id:
for token in span: for token in span:
token.ent_id_ = ent_id token.ent_id_ = ent_id
@ -131,11 +135,11 @@ class EntityRuler(object):
@property @property
def ent_ids(self): def ent_ids(self):
"""All entity ids present in the match patterns meta dicts. """All entity ids present in the match patterns `id` properties.
RETURNS (set): The string entity ids. RETURNS (set): The string entity ids.
DOCS: https://spacy.io/api/entityruler#labels DOCS: https://spacy.io/api/entityruler#ent_ids
""" """
all_ent_ids = set() all_ent_ids = set()
for l in self.labels: for l in self.labels:
@ -147,7 +151,6 @@ class EntityRuler(object):
@property @property
def patterns(self): def patterns(self):
"""Get all patterns that were added to the entity ruler. """Get all patterns that were added to the entity ruler.
RETURNS (list): The original patterns, one dictionary per pattern. RETURNS (list): The original patterns, one dictionary per pattern.
DOCS: https://spacy.io/api/entityruler#patterns DOCS: https://spacy.io/api/entityruler#patterns
@ -183,14 +186,20 @@ class EntityRuler(object):
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet # disable the nlp components after this one in case they hadn't been initialized / deserialised yet
try: try:
current_index = self.nlp.pipe_names.index(self.name) current_index = self.nlp.pipe_names.index(self.name)
subsequent_pipes = [pipe for pipe in self.nlp.pipe_names[current_index + 1:]] subsequent_pipes = [
pipe for pipe in self.nlp.pipe_names[current_index + 1 :]
]
except ValueError: except ValueError:
subsequent_pipes = [] subsequent_pipes = []
with self.nlp.disable_pipes(*subsequent_pipes): with self.nlp.disable_pipes(subsequent_pipes):
for entry in patterns: for entry in patterns:
label = entry["label"] label = entry["label"]
if "id" in entry: if "id" in entry:
ent_label = label
label = self._create_label(label, entry["id"]) label = self._create_label(label, entry["id"])
key = self.matcher._normalize_key(label)
self._ent_ids[key] = (ent_label, entry["id"])
pattern = entry["pattern"] pattern = entry["pattern"]
if isinstance(pattern, basestring_): if isinstance(pattern, basestring_):
self.phrase_patterns[label].append(self.nlp(pattern)) self.phrase_patterns[label].append(self.nlp(pattern))
@ -199,9 +208,9 @@ class EntityRuler(object):
else: else:
raise ValueError(Errors.E097.format(pattern=pattern)) raise ValueError(Errors.E097.format(pattern=pattern))
for label, patterns in self.token_patterns.items(): for label, patterns in self.token_patterns.items():
self.matcher.add(label, None, *patterns) self.matcher.add(label, patterns)
for label, patterns in self.phrase_patterns.items(): for label, patterns in self.phrase_patterns.items():
self.phrase_matcher.add(label, None, *patterns) self.phrase_matcher.add(label, patterns)
def _split_label(self, label): def _split_label(self, label):
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep

View File

@ -1,9 +1,16 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..language import component
from ..matcher import Matcher from ..matcher import Matcher
from ..util import filter_spans
@component(
"merge_noun_chunks",
requires=["token.dep", "token.tag", "token.pos"],
retokenizes=True,
)
def merge_noun_chunks(doc): def merge_noun_chunks(doc):
"""Merge noun chunks into a single token. """Merge noun chunks into a single token.
@ -21,6 +28,11 @@ def merge_noun_chunks(doc):
return doc return doc
@component(
"merge_entities",
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
retokenizes=True,
)
def merge_entities(doc): def merge_entities(doc):
"""Merge entities into a single token. """Merge entities into a single token.
@ -36,6 +48,7 @@ def merge_entities(doc):
return doc return doc
@component("merge_subtokens", requires=["token.dep"], retokenizes=True)
def merge_subtokens(doc, label="subtok"): def merge_subtokens(doc, label="subtok"):
"""Merge subtokens into a single token. """Merge subtokens into a single token.
@ -48,7 +61,7 @@ def merge_subtokens(doc, label="subtok"):
merger = Matcher(doc.vocab) merger = Matcher(doc.vocab)
merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}]) merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
matches = merger(doc) matches = merger(doc)
spans = [doc[start : end + 1] for _, start, end in matches] spans = filter_spans([doc[start : end + 1] for _, start, end in matches])
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
for span in spans: for span in spans:
retokenizer.merge(span) retokenizer.merge(span)

View File

@ -5,9 +5,11 @@ from thinc.t2v import Pooling, max_pool, mean_pool
from thinc.neural._classes.difference import Siamese, CauchySimilarity from thinc.neural._classes.difference import Siamese, CauchySimilarity
from .pipes import Pipe from .pipes import Pipe
from ..language import component
from .._ml import link_vectors_to_models from .._ml import link_vectors_to_models
@component("sentencizer_hook", assigns=["doc.user_hooks"])
class SentenceSegmenter(object): class SentenceSegmenter(object):
"""A simple spaCy hook, to allow custom sentence boundary detection logic """A simple spaCy hook, to allow custom sentence boundary detection logic
(that doesn't require the dependency parse). To change the sentence (that doesn't require the dependency parse). To change the sentence
@ -17,8 +19,6 @@ class SentenceSegmenter(object):
and yield `Span` objects for each sentence. and yield `Span` objects for each sentence.
""" """
name = "sentencizer"
def __init__(self, vocab, strategy=None): def __init__(self, vocab, strategy=None):
self.vocab = vocab self.vocab = vocab
if strategy is None or strategy == "on_punct": if strategy is None or strategy == "on_punct":
@ -44,6 +44,7 @@ class SentenceSegmenter(object):
yield doc[start : len(doc)] yield doc[start : len(doc)]
@component("similarity", assigns=["doc.user_hooks"])
class SimilarityHook(Pipe): class SimilarityHook(Pipe):
""" """
Experimental: A pipeline component to install a hook for supervised Experimental: A pipeline component to install a hook for supervised
@ -58,8 +59,6 @@ class SimilarityHook(Pipe):
Where W is a vector of dimension weights, initialized to 1. Where W is a vector of dimension weights, initialized to 1.
""" """
name = "similarity"
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model

View File

@ -8,6 +8,7 @@ from thinc.api import chain
from thinc.neural.util import to_categorical, copy_array, get_array_module from thinc.neural.util import to_categorical, copy_array, get_array_module
from .. import util from .. import util
from .pipes import Pipe from .pipes import Pipe
from ..language import component
from .._ml import Tok2Vec, build_morphologizer_model from .._ml import Tok2Vec, build_morphologizer_model
from .._ml import link_vectors_to_models, zero_init, flatten from .._ml import link_vectors_to_models, zero_init, flatten
from .._ml import create_default_optimizer from .._ml import create_default_optimizer
@ -18,9 +19,9 @@ from ..vocab cimport Vocab
from ..morphology cimport Morphology from ..morphology cimport Morphology
@component("morphologizer", assigns=["token.morph", "token.pos"])
class Morphologizer(Pipe): class Morphologizer(Pipe):
name = 'morphologizer'
@classmethod @classmethod
def Model(cls, **cfg): def Model(cls, **cfg):
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'): if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):

View File

@ -13,7 +13,6 @@ from thinc.misc import LayerNorm
from thinc.neural.util import to_categorical from thinc.neural.util import to_categorical
from thinc.neural.util import get_array_module from thinc.neural.util import get_array_module
from .functions import merge_subtokens
from ..tokens.doc cimport Doc from ..tokens.doc cimport Doc
from ..syntax.nn_parser cimport Parser from ..syntax.nn_parser cimport Parser
from ..syntax.ner cimport BiluoPushDown from ..syntax.ner cimport BiluoPushDown
@ -21,6 +20,8 @@ from ..syntax.arc_eager cimport ArcEager
from ..morphology cimport Morphology from ..morphology cimport Morphology
from ..vocab cimport Vocab from ..vocab cimport Vocab
from .functions import merge_subtokens
from ..language import Language, component
from ..syntax import nonproj from ..syntax import nonproj
from ..attrs import POS, ID from ..attrs import POS, ID
from ..parts_of_speech import X from ..parts_of_speech import X
@ -55,6 +56,10 @@ class Pipe(object):
"""Initialize a model for the pipe.""" """Initialize a model for the pipe."""
raise NotImplementedError raise NotImplementedError
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp.vocab, **cfg)
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
"""Create a new pipe instance.""" """Create a new pipe instance."""
raise NotImplementedError raise NotImplementedError
@ -224,11 +229,10 @@ class Pipe(object):
return self return self
@component("tensorizer", assigns=["doc.tensor"])
class Tensorizer(Pipe): class Tensorizer(Pipe):
"""Pre-train position-sensitive vectors for tokens.""" """Pre-train position-sensitive vectors for tokens."""
name = "tensorizer"
@classmethod @classmethod
def Model(cls, output_size=300, **cfg): def Model(cls, output_size=300, **cfg):
"""Create a new statistical model for the class. """Create a new statistical model for the class.
@ -363,14 +367,13 @@ class Tensorizer(Pipe):
return sgd return sgd
@component("tagger", assigns=["token.tag", "token.pos"])
class Tagger(Pipe): class Tagger(Pipe):
"""Pipeline component for part-of-speech tagging. """Pipeline component for part-of-speech tagging.
DOCS: https://spacy.io/api/tagger DOCS: https://spacy.io/api/tagger
""" """
name = "tagger"
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
@ -515,7 +518,6 @@ class Tagger(Pipe):
orig_tag_map = dict(self.vocab.morphology.tag_map) orig_tag_map = dict(self.vocab.morphology.tag_map)
new_tag_map = OrderedDict() new_tag_map = OrderedDict()
for raw_text, annots_brackets in get_gold_tuples(): for raw_text, annots_brackets in get_gold_tuples():
_ = annots_brackets.pop()
for annots, brackets in annots_brackets: for annots, brackets in annots_brackets:
ids, words, tags, heads, deps, ents = annots ids, words, tags, heads, deps, ents = annots
for tag in tags: for tag in tags:
@ -658,13 +660,12 @@ class Tagger(Pipe):
return self return self
@component("nn_labeller")
class MultitaskObjective(Tagger): class MultitaskObjective(Tagger):
"""Experimental: Assist training of a parser or tagger, by training a """Experimental: Assist training of a parser or tagger, by training a
side-objective. side-objective.
""" """
name = "nn_labeller"
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
@ -923,12 +924,12 @@ class ClozeMultitask(Pipe):
return words return words
@component("textcat", assigns=["doc.cats"])
class TextCategorizer(Pipe): class TextCategorizer(Pipe):
"""Pipeline component for text classification. """Pipeline component for text classification.
DOCS: https://spacy.io/api/textcategorizer DOCS: https://spacy.io/api/textcategorizer
""" """
name = 'textcat'
@classmethod @classmethod
def Model(cls, nr_class=1, **cfg): def Model(cls, nr_class=1, **cfg):
@ -1057,9 +1058,10 @@ class TextCategorizer(Pipe):
return 1 return 1
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
for raw_text, (_, (cats, _2)) in get_gold_tuples(): for raw_text, annot_brackets in get_gold_tuples():
for cat in cats: for _, (cats, _2) in annot_brackets:
self.add_label(cat) for cat in cats:
self.add_label(cat)
if self.model is True: if self.model is True:
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
self.require_labels() self.require_labels()
@ -1075,8 +1077,11 @@ cdef class DependencyParser(Parser):
DOCS: https://spacy.io/api/dependencyparser DOCS: https://spacy.io/api/dependencyparser
""" """
# cdef classes can't have decorators, so we're defining this here
name = "parser" name = "parser"
factory = "parser"
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
requires = []
TransitionSystem = ArcEager TransitionSystem = ArcEager
nr_feature = 8 nr_feature = 8
@ -1122,8 +1127,10 @@ cdef class EntityRecognizer(Parser):
DOCS: https://spacy.io/api/entityrecognizer DOCS: https://spacy.io/api/entityrecognizer
""" """
name = "ner" name = "ner"
factory = "ner"
assigns = ["doc.ents", "token.ent_iob", "token.ent_type"]
requires = []
TransitionSystem = BiluoPushDown TransitionSystem = BiluoPushDown
nr_feature = 3 nr_feature = 3
@ -1154,12 +1161,16 @@ cdef class EntityRecognizer(Parser):
return tuple(sorted(labels)) return tuple(sorted(labels))
@component(
"entity_linker",
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
assigns=["token.ent_kb_id"]
)
class EntityLinker(Pipe): class EntityLinker(Pipe):
"""Pipeline component for named entity linking. """Pipeline component for named entity linking.
DOCS: https://spacy.io/api/entitylinker DOCS: https://spacy.io/api/entitylinker
""" """
name = 'entity_linker'
NIL = "NIL" # string used to refer to a non-existing link NIL = "NIL" # string used to refer to a non-existing link
@classmethod @classmethod
@ -1220,23 +1231,26 @@ class EntityLinker(Pipe):
docs = [docs] docs = [docs]
golds = [golds] golds = [golds]
context_docs = [] sentence_docs = []
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
ents_by_offset = dict() ents_by_offset = dict()
for ent in doc.ents: for ent in doc.ents:
ents_by_offset["{}_{}".format(ent.start_char, ent.end_char)] = ent ents_by_offset[(ent.start_char, ent.end_char)] = ent
for entity, kb_dict in gold.links.items(): for entity, kb_dict in gold.links.items():
start, end = entity start, end = entity
mention = doc.text[start:end] mention = doc.text[start:end]
# the gold annotations should link to proper entities - if this fails, the dataset is likely corrupt
ent = ents_by_offset[(start, end)]
for kb_id, value in kb_dict.items(): for kb_id, value in kb_dict.items():
# Currently only training on the positive instances # Currently only training on the positive instances
if value: if value:
context_docs.append(doc) sentence_docs.append(ent.sent.as_doc())
context_encodings, bp_context = self.model.begin_update(context_docs, drop=drop) sentence_encodings, bp_context = self.model.begin_update(sentence_docs, drop=drop)
loss, d_scores = self.get_similarity_loss(scores=context_encodings, golds=golds, docs=None) loss, d_scores = self.get_similarity_loss(scores=sentence_encodings, golds=golds, docs=None)
bp_context(d_scores, sgd=sgd) bp_context(d_scores, sgd=sgd)
if losses is not None: if losses is not None:
@ -1305,50 +1319,69 @@ class EntityLinker(Pipe):
if isinstance(docs, Doc): if isinstance(docs, Doc):
docs = [docs] docs = [docs]
context_encodings = self.model(docs)
xp = get_array_module(context_encodings)
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
if len(doc) > 0: if len(doc) > 0:
# currently, the context is the same for each entity in a sentence (should be refined) # Looping through each sentence and each entity
context_encoding = context_encodings[i] # This may go wrong if there are entities across sentences - because they might not get a KB ID
context_enc_t = context_encoding.T for sent in doc.ents:
norm_1 = xp.linalg.norm(context_enc_t) sent_doc = sent.as_doc()
for ent in doc.ents: # currently, the context is the same for each entity in a sentence (should be refined)
entity_count += 1 sentence_encoding = self.model([sent_doc])[0]
xp = get_array_module(sentence_encoding)
sentence_encoding_t = sentence_encoding.T
sentence_norm = xp.linalg.norm(sentence_encoding_t)
candidates = self.kb.get_candidates(ent.text) for ent in sent_doc.ents:
if not candidates: entity_count += 1
final_kb_ids.append(self.NIL) # no prediction possible for this entity
final_tensors.append(context_encoding)
else:
random.shuffle(candidates)
# this will set all prior probabilities to 0 if they should be excluded from the model to_discard = self.cfg.get("labels_discard", [])
prior_probs = xp.asarray([c.prior_prob for c in candidates]) if to_discard and ent.label_ in to_discard:
if not self.cfg.get("incl_prior", True): # ignoring this entity - setting to NIL
prior_probs = xp.asarray([0.0 for c in candidates]) final_kb_ids.append(self.NIL)
scores = prior_probs final_tensors.append(sentence_encoding)
# add in similarity from the context else:
if self.cfg.get("incl_context", True): candidates = self.kb.get_candidates(ent.text)
entity_encodings = xp.asarray([c.entity_vector for c in candidates]) if not candidates:
norm_2 = xp.linalg.norm(entity_encodings, axis=1) # no prediction possible for this entity - setting to NIL
final_kb_ids.append(self.NIL)
final_tensors.append(sentence_encoding)
if len(entity_encodings) != len(prior_probs): elif len(candidates) == 1:
raise RuntimeError(Errors.E147.format(method="predict", msg="vectors not of equal length")) # shortcut for efficiency reasons: take the 1 candidate
# cosine similarity # TODO: thresholding
sims = xp.dot(entity_encodings, context_enc_t) / (norm_1 * norm_2) final_kb_ids.append(candidates[0].entity_)
if sims.shape != prior_probs.shape: final_tensors.append(sentence_encoding)
raise ValueError(Errors.E161)
scores = prior_probs + sims - (prior_probs*sims)
# TODO: thresholding else:
best_index = scores.argmax() random.shuffle(candidates)
best_candidate = candidates[best_index]
final_kb_ids.append(best_candidate.entity_) # this will set all prior probabilities to 0 if they should be excluded from the model
final_tensors.append(context_encoding) prior_probs = xp.asarray([c.prior_prob for c in candidates])
if not self.cfg.get("incl_prior", True):
prior_probs = xp.asarray([0.0 for c in candidates])
scores = prior_probs
# add in similarity from the context
if self.cfg.get("incl_context", True):
entity_encodings = xp.asarray([c.entity_vector for c in candidates])
entity_norm = xp.linalg.norm(entity_encodings, axis=1)
if len(entity_encodings) != len(prior_probs):
raise RuntimeError(Errors.E147.format(method="predict", msg="vectors not of equal length"))
# cosine similarity
sims = xp.dot(entity_encodings, sentence_encoding_t) / (sentence_norm * entity_norm)
if sims.shape != prior_probs.shape:
raise ValueError(Errors.E161)
scores = prior_probs + sims - (prior_probs*sims)
# TODO: thresholding
best_index = scores.argmax()
best_candidate = candidates[best_index]
final_kb_ids.append(best_candidate.entity_)
final_tensors.append(sentence_encoding)
if not (len(final_tensors) == len(final_kb_ids) == entity_count): if not (len(final_tensors) == len(final_kb_ids) == entity_count):
raise RuntimeError(Errors.E147.format(method="predict", msg="result variables not of equal length")) raise RuntimeError(Errors.E147.format(method="predict", msg="result variables not of equal length"))
@ -1408,13 +1441,13 @@ class EntityLinker(Pipe):
raise NotImplementedError raise NotImplementedError
@component("sentencizer", assigns=["token.is_sent_start", "doc.sents"])
class Sentencizer(object): class Sentencizer(object):
"""Segment the Doc into sentences using a rule-based strategy. """Segment the Doc into sentences using a rule-based strategy.
DOCS: https://spacy.io/api/sentencizer DOCS: https://spacy.io/api/sentencizer
""" """
name = "sentencizer"
default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '᱿', '', '', '', '', '', '', '', '', '', '', '', '', '᱿',
@ -1440,6 +1473,10 @@ class Sentencizer(object):
else: else:
self.punct_chars = set(self.default_punct_chars) self.punct_chars = set(self.default_punct_chars)
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(**cfg)
def __call__(self, doc): def __call__(self, doc):
"""Apply the sentencizer to a Doc and set Token.is_sent_start. """Apply the sentencizer to a Doc and set Token.is_sent_start.
@ -1506,4 +1543,9 @@ class Sentencizer(object):
return self return self
# Cython classes can't be decorated, so we need to add the factories here
Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg)
Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg)
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] __all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]

View File

@ -82,6 +82,7 @@ class Scorer(object):
self.sbd = PRFScore() self.sbd = PRFScore()
self.unlabelled = PRFScore() self.unlabelled = PRFScore()
self.labelled = PRFScore() self.labelled = PRFScore()
self.labelled_per_dep = dict()
self.tags = PRFScore() self.tags = PRFScore()
self.ner = PRFScore() self.ner = PRFScore()
self.ner_per_ents = dict() self.ner_per_ents = dict()
@ -124,9 +125,18 @@ class Scorer(object):
@property @property
def las(self): def las(self):
"""RETURNS (float): Labelled depdendency score.""" """RETURNS (float): Labelled dependency score."""
return self.labelled.fscore * 100 return self.labelled.fscore * 100
@property
def las_per_type(self):
"""RETURNS (dict): Scores per dependency label.
"""
return {
k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
for k, v in self.labelled_per_dep.items()
}
@property @property
def ents_p(self): def ents_p(self):
"""RETURNS (float): Named entity accuracy (precision).""" """RETURNS (float): Named entity accuracy (precision)."""
@ -196,6 +206,7 @@ class Scorer(object):
return { return {
"uas": self.uas, "uas": self.uas,
"las": self.las, "las": self.las,
"las_per_type": self.las_per_type,
"ents_p": self.ents_p, "ents_p": self.ents_p,
"ents_r": self.ents_r, "ents_r": self.ents_r,
"ents_f": self.ents_f, "ents_f": self.ents_f,
@ -219,15 +230,24 @@ class Scorer(object):
DOCS: https://spacy.io/api/scorer#score DOCS: https://spacy.io/api/scorer#score
""" """
if len(doc) != len(gold): if len(doc) != len(gold):
gold = GoldParse.from_annot_tuples(doc, zip(*gold.orig_annot)) gold = GoldParse.from_annot_tuples(
doc, tuple(zip(*gold.orig_annot)) + (gold.cats,)
)
gold_deps = set() gold_deps = set()
gold_deps_per_dep = {}
gold_tags = set() gold_tags = set()
gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot])) gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot]))
for id_, word, tag, head, dep, ner in gold.orig_annot: for id_, word, tag, head, dep, ner in gold.orig_annot:
gold_tags.add((id_, tag)) gold_tags.add((id_, tag))
if dep not in (None, "") and dep.lower() not in punct_labels: if dep not in (None, "") and dep.lower() not in punct_labels:
gold_deps.add((id_, head, dep.lower())) gold_deps.add((id_, head, dep.lower()))
if dep.lower() not in self.labelled_per_dep:
self.labelled_per_dep[dep.lower()] = PRFScore()
if dep.lower() not in gold_deps_per_dep:
gold_deps_per_dep[dep.lower()] = set()
gold_deps_per_dep[dep.lower()].add((id_, head, dep.lower()))
cand_deps = set() cand_deps = set()
cand_deps_per_dep = {}
cand_tags = set() cand_tags = set()
for token in doc: for token in doc:
if token.orth_.isspace(): if token.orth_.isspace():
@ -247,6 +267,11 @@ class Scorer(object):
self.labelled.fp += 1 self.labelled.fp += 1
else: else:
cand_deps.add((gold_i, gold_head, token.dep_.lower())) cand_deps.add((gold_i, gold_head, token.dep_.lower()))
if token.dep_.lower() not in self.labelled_per_dep:
self.labelled_per_dep[token.dep_.lower()] = PRFScore()
if token.dep_.lower() not in cand_deps_per_dep:
cand_deps_per_dep[token.dep_.lower()] = set()
cand_deps_per_dep[token.dep_.lower()].add((gold_i, gold_head, token.dep_.lower()))
if "-" not in [token[-1] for token in gold.orig_annot]: if "-" not in [token[-1] for token in gold.orig_annot]:
# Find all NER labels in gold and doc # Find all NER labels in gold and doc
ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents]) ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
@ -278,6 +303,8 @@ class Scorer(object):
self.ner.score_set(cand_ents, gold_ents) self.ner.score_set(cand_ents, gold_ents)
self.tags.score_set(cand_tags, gold_tags) self.tags.score_set(cand_tags, gold_tags)
self.labelled.score_set(cand_deps, gold_deps) self.labelled.score_set(cand_deps, gold_deps)
for dep in self.labelled_per_dep:
self.labelled_per_dep[dep].score_set(cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()))
self.unlabelled.score_set( self.unlabelled.score_set(
set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps) set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
) )

Some files were not shown because too many files have changed in this diff Show More