Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-07-12 14:30:49 +02:00
commit 69dbd59a13
109 changed files with 249446 additions and 1296 deletions

106
.github/contributors/ameyuuno.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alexey Kim |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-07-09 |
| GitHub username | ameyuuno |
| Website (optional) | https://ameyuuno.io |

106
.github/contributors/askhogan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Patrick Hogan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 7/7/2019 |
| GitHub username | askhogan@gmail.com |
| Website (optional) | |

106
.github/contributors/cedar101.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Kim, Baeg-il |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-07-03 |
| GitHub username | cedar101 |
| Website (optional) | |

106
.github/contributors/khellan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Knut O. Hellan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 02.07.2019 |
| GitHub username | khellan |
| Website (optional) | knuthellan.com |

106
.github/contributors/kognate.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Joshua B. Smith |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | July 7, 2019 |
| GitHub username | kognate |
| Website (optional) | |

106
.github/contributors/rokasramas.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ----------------------- |
| Name | Rokas Ramanauskas |
| Company name (if applicable) | TokenMill |
| Title or role (if applicable) | Software Engineer |
| Date | 2019-07-02 |
| GitHub username | rokasramas |
| Website (optional) | http://www.tokenmill.lt |

106
.github/contributors/yashpatadia.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Yash Patadia |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 11/07/2019 |
| GitHub username | yash1994 |
| Website (optional) | |

2
.gitignore vendored
View File

@ -56,6 +56,8 @@ parts/
sdist/
var/
*.egg-info/
pip-wheel-metadata/
Pipfile.lock
.installed.cfg
*.egg
.eggs

0
bin/__init__.py Normal file
View File

View File

@ -5,7 +5,6 @@ import logging
from pathlib import Path
from collections import defaultdict
from gensim.models import Word2Vec
from preshed.counter import PreshCounter
import plac
import spacy

View File

@ -292,8 +292,8 @@ def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True):
def spans_score(gold_spans, system_spans):
correct, gi, si = 0, 0, 0
undersegmented = list()
oversegmented = list()
undersegmented = []
oversegmented = []
combo = 0
previous_end_si_earlier = False
previous_end_gi_earlier = False

View File

View File

@ -0,0 +1,171 @@
# coding: utf-8
from __future__ import unicode_literals
from .train_descriptions import EntityEncoder
from . import wikidata_processor as wd, wikipedia_processor as wp
from spacy.kb import KnowledgeBase
import csv
import datetime
INPUT_DIM = 300 # dimension of pre-trained input vectors
DESC_WIDTH = 64 # dimension of output entity vectors
def create_kb(nlp, max_entities_per_alias, min_entity_freq, min_occ,
entity_def_output, entity_descr_output,
count_input, prior_prob_input, wikidata_input):
# Create the knowledge base from Wikidata entries
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=DESC_WIDTH)
# disable this part of the pipeline when rerunning the KB generation from preprocessed files
read_raw_data = True
if read_raw_data:
print()
print(" * _read_wikidata_entities", datetime.datetime.now())
title_to_id, id_to_descr = wd.read_wikidata_entities_json(wikidata_input)
# write the title-ID and ID-description mappings to file
_write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr)
else:
# read the mappings from file
title_to_id = get_entity_to_id(entity_def_output)
id_to_descr = get_id_to_description(entity_descr_output)
print()
print(" * _get_entity_frequencies", datetime.datetime.now())
print()
entity_frequencies = wp.get_all_frequencies(count_input=count_input)
# filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise
filtered_title_to_id = dict()
entity_list = []
description_list = []
frequency_list = []
for title, entity in title_to_id.items():
freq = entity_frequencies.get(title, 0)
desc = id_to_descr.get(entity, None)
if desc and freq > min_entity_freq:
entity_list.append(entity)
description_list.append(desc)
frequency_list.append(freq)
filtered_title_to_id[title] = entity
print("Kept", len(filtered_title_to_id.keys()), "out of", len(title_to_id.keys()),
"titles with filter frequency", min_entity_freq)
print()
print(" * train entity encoder", datetime.datetime.now())
print()
encoder = EntityEncoder(nlp, INPUT_DIM, DESC_WIDTH)
encoder.train(description_list=description_list, to_print=True)
print()
print(" * get entity embeddings", datetime.datetime.now())
print()
embeddings = encoder.apply_encoder(description_list)
print()
print(" * adding", len(entity_list), "entities", datetime.datetime.now())
kb.set_entities(entity_list=entity_list, prob_list=frequency_list, vector_list=embeddings)
print()
print(" * adding aliases", datetime.datetime.now())
print()
_add_aliases(kb, title_to_id=filtered_title_to_id,
max_entities_per_alias=max_entities_per_alias, min_occ=min_occ,
prior_prob_input=prior_prob_input)
print()
print("kb size:", len(kb), kb.get_size_entities(), kb.get_size_aliases())
print("done with kb", datetime.datetime.now())
return kb
def _write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr):
with open(entity_def_output, mode='w', encoding='utf8') as id_file:
id_file.write("WP_title" + "|" + "WD_id" + "\n")
for title, qid in title_to_id.items():
id_file.write(title + "|" + str(qid) + "\n")
with open(entity_descr_output, mode='w', encoding='utf8') as descr_file:
descr_file.write("WD_id" + "|" + "description" + "\n")
for qid, descr in id_to_descr.items():
descr_file.write(str(qid) + "|" + descr + "\n")
def get_entity_to_id(entity_def_output):
entity_to_id = dict()
with open(entity_def_output, 'r', encoding='utf8') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
# skip header
next(csvreader)
for row in csvreader:
entity_to_id[row[0]] = row[1]
return entity_to_id
def get_id_to_description(entity_descr_output):
id_to_desc = dict()
with open(entity_descr_output, 'r', encoding='utf8') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
# skip header
next(csvreader)
for row in csvreader:
id_to_desc[row[0]] = row[1]
return id_to_desc
def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_input):
wp_titles = title_to_id.keys()
# adding aliases with prior probabilities
# we can read this file sequentially, it's sorted by alias, and then by count
with open(prior_prob_input, mode='r', encoding='utf8') as prior_file:
# skip header
prior_file.readline()
line = prior_file.readline()
previous_alias = None
total_count = 0
counts = []
entities = []
while line:
splits = line.replace('\n', "").split(sep='|')
new_alias = splits[0]
count = int(splits[1])
entity = splits[2]
if new_alias != previous_alias and previous_alias:
# done reading the previous alias --> output
if len(entities) > 0:
selected_entities = []
prior_probs = []
for ent_count, ent_string in zip(counts, entities):
if ent_string in wp_titles:
wd_id = title_to_id[ent_string]
p_entity_givenalias = ent_count / total_count
selected_entities.append(wd_id)
prior_probs.append(p_entity_givenalias)
if selected_entities:
try:
kb.add_alias(alias=previous_alias, entities=selected_entities, probabilities=prior_probs)
except ValueError as e:
print(e)
total_count = 0
counts = []
entities = []
total_count += count
if len(entities) < max_entities_per_alias and count >= min_occ:
counts.append(count)
entities.append(entity)
previous_alias = new_alias
line = prior_file.readline()

View File

@ -0,0 +1,121 @@
# coding: utf-8
from random import shuffle
import numpy as np
from spacy._ml import zero_init, create_default_optimizer
from spacy.cli.pretrain import get_cossim_loss
from thinc.v2v import Model
from thinc.api import chain
from thinc.neural._classes.affine import Affine
class EntityEncoder:
"""
Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).
This entity vector will be stored in the KB, for further downstream use in the entity model.
"""
DROP = 0
EPOCHS = 5
STOP_THRESHOLD = 0.04
BATCH_SIZE = 1000
def __init__(self, nlp, input_dim, desc_width):
self.nlp = nlp
self.input_dim = input_dim
self.desc_width = desc_width
def apply_encoder(self, description_list):
if self.encoder is None:
raise ValueError("Can not apply encoder before training it")
batch_size = 100000
start = 0
stop = min(batch_size, len(description_list))
encodings = []
while start < len(description_list):
docs = list(self.nlp.pipe(description_list[start:stop]))
doc_embeddings = [self._get_doc_embedding(doc) for doc in docs]
enc = self.encoder(np.asarray(doc_embeddings))
encodings.extend(enc.tolist())
start = start + batch_size
stop = min(stop + batch_size, len(description_list))
return encodings
def train(self, description_list, to_print=False):
processed, loss = self._train_model(description_list)
if to_print:
print("Trained on", processed, "entities across", self.EPOCHS, "epochs")
print("Final loss:", loss)
def _train_model(self, description_list):
# TODO: when loss gets too low, a 'mean of empty slice' warning is thrown by numpy
self._build_network(self.input_dim, self.desc_width)
processed = 0
loss = 1
descriptions = description_list.copy() # copy this list so that shuffling does not affect other functions
for i in range(self.EPOCHS):
shuffle(descriptions)
batch_nr = 0
start = 0
stop = min(self.BATCH_SIZE, len(descriptions))
while loss > self.STOP_THRESHOLD and start < len(descriptions):
batch = []
for descr in descriptions[start:stop]:
doc = self.nlp(descr)
doc_vector = self._get_doc_embedding(doc)
batch.append(doc_vector)
loss = self._update(batch)
print(i, batch_nr, loss)
processed += len(batch)
batch_nr += 1
start = start + self.BATCH_SIZE
stop = min(stop + self.BATCH_SIZE, len(descriptions))
return processed, loss
@staticmethod
def _get_doc_embedding(doc):
indices = np.zeros((len(doc),), dtype="i")
for i, word in enumerate(doc):
if word.orth in doc.vocab.vectors.key2row:
indices[i] = doc.vocab.vectors.key2row[word.orth]
else:
indices[i] = 0
word_vectors = doc.vocab.vectors.data[indices]
doc_vector = np.mean(word_vectors, axis=0)
return doc_vector
def _build_network(self, orig_width, hidden_with):
with Model.define_operators({">>": chain}):
# very simple encoder-decoder model
self.encoder = (
Affine(hidden_with, orig_width)
)
self.model = self.encoder >> zero_init(Affine(orig_width, hidden_with, drop_factor=0.0))
self.sgd = create_default_optimizer(self.model.ops)
def _update(self, vectors):
predictions, bp_model = self.model.begin_update(np.asarray(vectors), drop=self.DROP)
loss, d_scores = self._get_loss(scores=predictions, golds=np.asarray(vectors))
bp_model(d_scores, sgd=self.sgd)
return loss / len(vectors)
@staticmethod
def _get_loss(golds, scores):
loss, gradients = get_cossim_loss(scores, golds)
return loss, gradients

View File

@ -0,0 +1,353 @@
# coding: utf-8
from __future__ import unicode_literals
import os
import re
import bz2
import datetime
from spacy.gold import GoldParse
from bin.wiki_entity_linking import kb_creator
"""
Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
Gold-standard entities are stored in one file in standoff format (by character offset).
"""
ENTITY_FILE = "gold_entities.csv"
def create_training(wikipedia_input, entity_def_input, training_output):
wp_to_id = kb_creator.get_entity_to_id(entity_def_input)
_process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None)
def _process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None):
"""
Read the XML wikipedia data to parse out training data:
raw text data + positive instances
"""
title_regex = re.compile(r'(?<=<title>).*(?=</title>)')
id_regex = re.compile(r'(?<=<id>)\d*(?=</id>)')
read_ids = set()
entityfile_loc = training_output / ENTITY_FILE
with open(entityfile_loc, mode="w", encoding='utf8') as entityfile:
# write entity training header file
_write_training_entity(outputfile=entityfile,
article_id="article_id",
alias="alias",
entity="WD_id",
start="start",
end="end")
with bz2.open(wikipedia_input, mode='rb') as file:
line = file.readline()
cnt = 0
article_text = ""
article_title = None
article_id = None
reading_text = False
reading_revision = False
while line and (not limit or cnt < limit):
if cnt % 1000000 == 0:
print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump")
clean_line = line.strip().decode("utf-8")
if clean_line == "<revision>":
reading_revision = True
elif clean_line == "</revision>":
reading_revision = False
# Start reading new page
if clean_line == "<page>":
article_text = ""
article_title = None
article_id = None
# finished reading this page
elif clean_line == "</page>":
if article_id:
try:
_process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text.strip(),
training_output)
except Exception as e:
print("Error processing article", article_id, article_title, e)
else:
print("Done processing a page, but couldn't find an article_id ?", article_title)
article_text = ""
article_title = None
article_id = None
reading_text = False
reading_revision = False
# start reading text within a page
if "<text" in clean_line:
reading_text = True
if reading_text:
article_text += " " + clean_line
# stop reading text within a page (we assume a new page doesn't start on the same line)
if "</text" in clean_line:
reading_text = False
# read the ID of this article (outside the revision portion of the document)
if not reading_revision:
ids = id_regex.search(clean_line)
if ids:
article_id = ids[0]
if article_id in read_ids:
print("Found duplicate article ID", article_id, clean_line) # This should never happen ...
read_ids.add(article_id)
# read the title of this article (outside the revision portion of the document)
if not reading_revision:
titles = title_regex.search(clean_line)
if titles:
article_title = titles[0].strip()
line = file.readline()
cnt += 1
text_regex = re.compile(r'(?<=<text xml:space=\"preserve\">).*(?=</text)')
def _process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text, training_output):
found_entities = False
# ignore meta Wikipedia pages
if article_title.startswith("Wikipedia:"):
return
# remove the text tags
text = text_regex.search(article_text).group(0)
# stop processing if this is a redirect page
if text.startswith("#REDIRECT"):
return
# get the raw text without markup etc, keeping only interwiki links
clean_text = _get_clean_wp_text(text)
# read the text char by char to get the right offsets for the interwiki links
final_text = ""
open_read = 0
reading_text = True
reading_entity = False
reading_mention = False
reading_special_case = False
entity_buffer = ""
mention_buffer = ""
for index, letter in enumerate(clean_text):
if letter == '[':
open_read += 1
elif letter == ']':
open_read -= 1
elif letter == '|':
if reading_text:
final_text += letter
# switch from reading entity to mention in the [[entity|mention]] pattern
elif reading_entity:
reading_text = False
reading_entity = False
reading_mention = True
else:
reading_special_case = True
else:
if reading_entity:
entity_buffer += letter
elif reading_mention:
mention_buffer += letter
elif reading_text:
final_text += letter
else:
raise ValueError("Not sure at point", clean_text[index-2:index+2])
if open_read > 2:
reading_special_case = True
if open_read == 2 and reading_text:
reading_text = False
reading_entity = True
reading_mention = False
# we just finished reading an entity
if open_read == 0 and not reading_text:
if '#' in entity_buffer or entity_buffer.startswith(':'):
reading_special_case = True
# Ignore cases with nested structures like File: handles etc
if not reading_special_case:
if not mention_buffer:
mention_buffer = entity_buffer
start = len(final_text)
end = start + len(mention_buffer)
qid = wp_to_id.get(entity_buffer, None)
if qid:
_write_training_entity(outputfile=entityfile,
article_id=article_id,
alias=mention_buffer,
entity=qid,
start=start,
end=end)
found_entities = True
final_text += mention_buffer
entity_buffer = ""
mention_buffer = ""
reading_text = True
reading_entity = False
reading_mention = False
reading_special_case = False
if found_entities:
_write_training_article(article_id=article_id, clean_text=final_text, training_output=training_output)
info_regex = re.compile(r'{[^{]*?}')
htlm_regex = re.compile(r'&lt;!--[^-]*--&gt;')
category_regex = re.compile(r'\[\[Category:[^\[]*]]')
file_regex = re.compile(r'\[\[File:[^[\]]+]]')
ref_regex = re.compile(r'&lt;ref.*?&gt;') # non-greedy
ref_2_regex = re.compile(r'&lt;/ref.*?&gt;') # non-greedy
def _get_clean_wp_text(article_text):
clean_text = article_text.strip()
# remove bolding & italic markup
clean_text = clean_text.replace('\'\'\'', '')
clean_text = clean_text.replace('\'\'', '')
# remove nested {{info}} statements by removing the inner/smallest ones first and iterating
try_again = True
previous_length = len(clean_text)
while try_again:
clean_text = info_regex.sub('', clean_text) # non-greedy match excluding a nested {
if len(clean_text) < previous_length:
try_again = True
else:
try_again = False
previous_length = len(clean_text)
# remove HTML comments
clean_text = htlm_regex.sub('', clean_text)
# remove Category and File statements
clean_text = category_regex.sub('', clean_text)
clean_text = file_regex.sub('', clean_text)
# remove multiple =
while '==' in clean_text:
clean_text = clean_text.replace("==", "=")
clean_text = clean_text.replace(". =", ".")
clean_text = clean_text.replace(" = ", ". ")
clean_text = clean_text.replace("= ", ".")
clean_text = clean_text.replace(" =", "")
# remove refs (non-greedy match)
clean_text = ref_regex.sub('', clean_text)
clean_text = ref_2_regex.sub('', clean_text)
# remove additional wikiformatting
clean_text = re.sub(r'&lt;blockquote&gt;', '', clean_text)
clean_text = re.sub(r'&lt;/blockquote&gt;', '', clean_text)
# change special characters back to normal ones
clean_text = clean_text.replace(r'&lt;', '<')
clean_text = clean_text.replace(r'&gt;', '>')
clean_text = clean_text.replace(r'&quot;', '"')
clean_text = clean_text.replace(r'&amp;nbsp;', ' ')
clean_text = clean_text.replace(r'&amp;', '&')
# remove multiple spaces
while ' ' in clean_text:
clean_text = clean_text.replace(' ', ' ')
return clean_text.strip()
def _write_training_article(article_id, clean_text, training_output):
file_loc = training_output / str(article_id) + ".txt"
with open(file_loc, mode='w', encoding='utf8') as outputfile:
outputfile.write(clean_text)
def _write_training_entity(outputfile, article_id, alias, entity, start, end):
outputfile.write(article_id + "|" + alias + "|" + entity + "|" + str(start) + "|" + str(end) + "\n")
def is_dev(article_id):
return article_id.endswith("3")
def read_training(nlp, training_dir, dev, limit):
# This method provides training examples that correspond to the entity annotations found by the nlp object
entityfile_loc = training_dir / ENTITY_FILE
data = []
# assume the data is written sequentially, so we can reuse the article docs
current_article_id = None
current_doc = None
ents_by_offset = dict()
skip_articles = set()
total_entities = 0
with open(entityfile_loc, mode='r', encoding='utf8') as file:
for line in file:
if not limit or len(data) < limit:
fields = line.replace('\n', "").split(sep='|')
article_id = fields[0]
alias = fields[1]
wp_title = fields[2]
start = fields[3]
end = fields[4]
if dev == is_dev(article_id) and article_id != "article_id" and article_id not in skip_articles:
if not current_doc or (current_article_id != article_id):
# parse the new article text
file_name = article_id + ".txt"
try:
with open(os.path.join(training_dir, file_name), mode="r", encoding='utf8') as f:
text = f.read()
if len(text) < 30000: # threshold for convenience / speed of processing
current_doc = nlp(text)
current_article_id = article_id
ents_by_offset = dict()
for ent in current_doc.ents:
sent_length = len(ent.sent)
# custom filtering to avoid too long or too short sentences
if 5 < sent_length < 100:
ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent
else:
skip_articles.add(article_id)
current_doc = None
except Exception as e:
print("Problem parsing article", article_id, e)
skip_articles.add(article_id)
raise e
# repeat checking this condition in case an exception was thrown
if current_doc and (current_article_id == article_id):
found_ent = ents_by_offset.get(start + "_" + end, None)
if found_ent:
if found_ent.text != alias:
skip_articles.add(article_id)
current_doc = None
else:
sent = found_ent.sent.as_doc()
# currently feeding the gold data one entity per sentence at a time
gold_start = int(start) - found_ent.sent.start_char
gold_end = int(end) - found_ent.sent.start_char
gold_entities = [(gold_start, gold_end, wp_title)]
gold = GoldParse(doc=sent, links=gold_entities)
data.append((sent, gold))
total_entities += 1
if len(data) % 2500 == 0:
print(" -read", total_entities, "entities")
print(" -read", total_entities, "entities")
return data

View File

@ -0,0 +1,119 @@
# coding: utf-8
from __future__ import unicode_literals
import bz2
import json
import datetime
def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False):
# Read the JSON wiki data and parse out the entities. Takes about 7u30 to parse 55M lines.
# get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
lang = 'en'
site_filter = 'enwiki'
# properties filter (currently disabled to get ALL data)
prop_filter = dict()
# prop_filter = {'P31': {'Q5', 'Q15632617'}} # currently defined as OR: one property suffices to be selected
title_to_id = dict()
id_to_descr = dict()
# parse appropriate fields - depending on what we need in the KB
parse_properties = False
parse_sitelinks = True
parse_labels = False
parse_descriptions = True
parse_aliases = False
parse_claims = False
with bz2.open(wikidata_file, mode='rb') as file:
line = file.readline()
cnt = 0
while line and (not limit or cnt < limit):
if cnt % 500000 == 0:
print(datetime.datetime.now(), "processed", cnt, "lines of WikiData dump")
clean_line = line.strip()
if clean_line.endswith(b","):
clean_line = clean_line[:-1]
if len(clean_line) > 1:
obj = json.loads(clean_line)
entry_type = obj["type"]
if entry_type == "item":
# filtering records on their properties (currently disabled to get ALL data)
# keep = False
keep = True
claims = obj["claims"]
if parse_claims:
for prop, value_set in prop_filter.items():
claim_property = claims.get(prop, None)
if claim_property:
for cp in claim_property:
cp_id = cp['mainsnak'].get('datavalue', {}).get('value', {}).get('id')
cp_rank = cp['rank']
if cp_rank != "deprecated" and cp_id in value_set:
keep = True
if keep:
unique_id = obj["id"]
if to_print:
print("ID:", unique_id)
print("type:", entry_type)
# parsing all properties that refer to other entities
if parse_properties:
for prop, claim_property in claims.items():
cp_dicts = [cp['mainsnak']['datavalue'].get('value') for cp in claim_property
if cp['mainsnak'].get('datavalue')]
cp_values = [cp_dict.get('id') for cp_dict in cp_dicts if isinstance(cp_dict, dict)
if cp_dict.get('id') is not None]
if cp_values:
if to_print:
print("prop:", prop, cp_values)
found_link = False
if parse_sitelinks:
site_value = obj["sitelinks"].get(site_filter, None)
if site_value:
site = site_value['title']
if to_print:
print(site_filter, ":", site)
title_to_id[site] = unique_id
found_link = True
if parse_labels:
labels = obj["labels"]
if labels:
lang_label = labels.get(lang, None)
if lang_label:
if to_print:
print("label (" + lang + "):", lang_label["value"])
if found_link and parse_descriptions:
descriptions = obj["descriptions"]
if descriptions:
lang_descr = descriptions.get(lang, None)
if lang_descr:
if to_print:
print("description (" + lang + "):", lang_descr["value"])
id_to_descr[unique_id] = lang_descr["value"]
if parse_aliases:
aliases = obj["aliases"]
if aliases:
lang_aliases = aliases.get(lang, None)
if lang_aliases:
for item in lang_aliases:
if to_print:
print("alias (" + lang + "):", item["value"])
if to_print:
print()
line = file.readline()
cnt += 1
return title_to_id, id_to_descr

View File

@ -0,0 +1,182 @@
# coding: utf-8
from __future__ import unicode_literals
import re
import bz2
import csv
import datetime
"""
Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions.
Write these results to file for downstream KB and training data generation.
"""
map_alias_to_link = dict()
# these will/should be matched ignoring case
wiki_namespaces = ["b", "betawikiversity", "Book", "c", "Category", "Commons",
"d", "dbdump", "download", "Draft", "Education", "Foundation",
"Gadget", "Gadget definition", "gerrit", "File", "Help", "Image", "Incubator",
"m", "mail", "mailarchive", "media", "MediaWiki", "MediaWiki talk", "Mediawikiwiki",
"MediaZilla", "Meta", "Metawikipedia", "Module",
"mw", "n", "nost", "oldwikisource", "outreach", "outreachwiki", "otrs", "OTRSwiki",
"Portal", "phab", "Phabricator", "Project", "q", "quality", "rev",
"s", "spcom", "Special", "species", "Strategy", "sulutil", "svn",
"Talk", "Template", "Template talk", "Testwiki", "ticket", "TimedText", "Toollabs", "tools",
"tswiki", "User", "User talk", "v", "voy",
"w", "Wikibooks", "Wikidata", "wikiHow", "Wikinvest", "wikilivres", "Wikimedia", "Wikinews",
"Wikipedia", "Wikipedia talk", "Wikiquote", "Wikisource", "Wikispecies", "Wikitech",
"Wikiversity", "Wikivoyage", "wikt", "wiktionary", "wmf", "wmania", "WP"]
# find the links
link_regex = re.compile(r'\[\[[^\[\]]*\]\]')
# match on interwiki links, e.g. `en:` or `:fr:`
ns_regex = r":?" + "[a-z][a-z]" + ":"
# match on Namespace: optionally preceded by a :
for ns in wiki_namespaces:
ns_regex += "|" + ":?" + ns + ":"
ns_regex = re.compile(ns_regex, re.IGNORECASE)
def read_wikipedia_prior_probs(wikipedia_input, prior_prob_output):
"""
Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities.
The full file takes about 2h to parse 1100M lines.
It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from.
"""
with bz2.open(wikipedia_input, mode='rb') as file:
line = file.readline()
cnt = 0
while line:
if cnt % 5000000 == 0:
print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump")
clean_line = line.strip().decode("utf-8")
aliases, entities, normalizations = get_wp_links(clean_line)
for alias, entity, norm in zip(aliases, entities, normalizations):
_store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
_store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
line = file.readline()
cnt += 1
# write all aliases and their entities and count occurrences to file
with open(prior_prob_output, mode='w', encoding='utf8') as outputfile:
outputfile.write("alias" + "|" + "count" + "|" + "entity" + "\n")
for alias, alias_dict in sorted(map_alias_to_link.items(), key=lambda x: x[0]):
for entity, count in sorted(alias_dict.items(), key=lambda x: x[1], reverse=True):
outputfile.write(alias + "|" + str(count) + "|" + entity + "\n")
def _store_alias(alias, entity, normalize_alias=False, normalize_entity=True):
alias = alias.strip()
entity = entity.strip()
# remove everything after # as this is not part of the title but refers to a specific paragraph
if normalize_entity:
# wikipedia titles are always capitalized
entity = _capitalize_first(entity.split("#")[0])
if normalize_alias:
alias = alias.split("#")[0]
if alias and entity:
alias_dict = map_alias_to_link.get(alias, dict())
entity_count = alias_dict.get(entity, 0)
alias_dict[entity] = entity_count + 1
map_alias_to_link[alias] = alias_dict
def get_wp_links(text):
aliases = []
entities = []
normalizations = []
matches = link_regex.findall(text)
for match in matches:
match = match[2:][:-2].replace("_", " ").strip()
if ns_regex.match(match):
pass # ignore namespaces at the beginning of the string
# this is a simple [[link]], with the alias the same as the mention
elif "|" not in match:
aliases.append(match)
entities.append(match)
normalizations.append(True)
# in wiki format, the link is written as [[entity|alias]]
else:
splits = match.split("|")
entity = splits[0].strip()
alias = splits[1].strip()
# specific wiki format [[alias (specification)|]]
if len(alias) == 0 and "(" in entity:
alias = entity.split("(")[0]
aliases.append(alias)
entities.append(entity)
normalizations.append(False)
else:
aliases.append(alias)
entities.append(entity)
normalizations.append(False)
return aliases, entities, normalizations
def _capitalize_first(text):
if not text:
return None
result = text[0].capitalize()
if len(result) > 0:
result += text[1:]
return result
def write_entity_counts(prior_prob_input, count_output, to_print=False):
# Write entity counts for quick access later
entity_to_count = dict()
total_count = 0
with open(prior_prob_input, mode='r', encoding='utf8') as prior_file:
# skip header
prior_file.readline()
line = prior_file.readline()
while line:
splits = line.replace('\n', "").split(sep='|')
# alias = splits[0]
count = int(splits[1])
entity = splits[2]
current_count = entity_to_count.get(entity, 0)
entity_to_count[entity] = current_count + count
total_count += count
line = prior_file.readline()
with open(count_output, mode='w', encoding='utf8') as entity_file:
entity_file.write("entity" + "|" + "count" + "\n")
for entity, count in entity_to_count.items():
entity_file.write(entity + "|" + str(count) + "\n")
if to_print:
for entity, count in entity_to_count.items():
print("Entity count:", entity, count)
print("Total count:", total_count)
def get_all_frequencies(count_input):
entity_to_count = dict()
with open(count_input, 'r', encoding='utf8') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
# skip header
next(csvreader)
for row in csvreader:
entity_to_count[row[0]] = int(row[1])
return entity_to_count

View File

@ -51,7 +51,6 @@ def filter_spans(spans):
def extract_currency_relations(doc):
# Merge entities and noun chunks into one token
seen_tokens = set()
spans = list(doc.ents) + list(doc.noun_chunks)
spans = filter_spans(spans)
with doc.retokenize() as retokenizer:

View File

@ -9,26 +9,26 @@ from spacy.kb import KnowledgeBase
def create_kb(vocab):
kb = KnowledgeBase(vocab=vocab)
kb = KnowledgeBase(vocab=vocab, entity_vector_length=1)
# adding entities
entity_0 = "Q1004791_Douglas"
print("adding entity", entity_0)
kb.add_entity(entity=entity_0, prob=0.5)
kb.add_entity(entity=entity_0, prob=0.5, entity_vector=[0])
entity_1 = "Q42_Douglas_Adams"
print("adding entity", entity_1)
kb.add_entity(entity=entity_1, prob=0.5)
kb.add_entity(entity=entity_1, prob=0.5, entity_vector=[1])
entity_2 = "Q5301561_Douglas_Haig"
print("adding entity", entity_2)
kb.add_entity(entity=entity_2, prob=0.5)
kb.add_entity(entity=entity_2, prob=0.5, entity_vector=[2])
# adding aliases
print()
alias_0 = "Douglas"
print("adding alias", alias_0)
kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.1, 0.6, 0.2])
kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.6, 0.1, 0.2])
alias_1 = "Douglas Adams"
print("adding alias", alias_1)
@ -41,8 +41,12 @@ def create_kb(vocab):
def add_el(kb, nlp):
el_pipe = nlp.create_pipe(name='entity_linker', config={"kb": kb})
el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64})
el_pipe.set_kb(kb)
nlp.add_pipe(el_pipe, last=True)
nlp.begin_training()
el_pipe.context_weight = 0
el_pipe.prior_weight = 1
for alias in ["Douglas Adams", "Douglas"]:
candidates = nlp.linker.kb.get_candidates(alias)
@ -66,6 +70,6 @@ def add_el(kb, nlp):
if __name__ == "__main__":
nlp = spacy.load('en_core_web_sm')
my_kb = create_kb(nlp.vocab)
add_el(my_kb, nlp)
my_nlp = spacy.load('en_core_web_sm')
my_kb = create_kb(my_nlp.vocab)
add_el(my_kb, my_nlp)

View File

@ -0,0 +1,442 @@
# coding: utf-8
from __future__ import unicode_literals
import random
import datetime
from pathlib import Path
from bin.wiki_entity_linking import training_set_creator, kb_creator, wikipedia_processor as wp
from bin.wiki_entity_linking.kb_creator import DESC_WIDTH
import spacy
from spacy.kb import KnowledgeBase
from spacy.util import minibatch, compounding
"""
Demonstrate how to build a knowledge base from WikiData and run an Entity Linking algorithm.
"""
ROOT_DIR = Path("C:/Users/Sofie/Documents/data/")
OUTPUT_DIR = ROOT_DIR / 'wikipedia'
TRAINING_DIR = OUTPUT_DIR / 'training_data_nel'
PRIOR_PROB = OUTPUT_DIR / 'prior_prob.csv'
ENTITY_COUNTS = OUTPUT_DIR / 'entity_freq.csv'
ENTITY_DEFS = OUTPUT_DIR / 'entity_defs.csv'
ENTITY_DESCR = OUTPUT_DIR / 'entity_descriptions.csv'
KB_FILE = OUTPUT_DIR / 'kb_1' / 'kb'
NLP_1_DIR = OUTPUT_DIR / 'nlp_1'
NLP_2_DIR = OUTPUT_DIR / 'nlp_2'
# get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
WIKIDATA_JSON = ROOT_DIR / 'wikidata' / 'wikidata-20190304-all.json.bz2'
# get enwiki-latest-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/enwiki/latest/
ENWIKI_DUMP = ROOT_DIR / 'wikipedia' / 'enwiki-20190320-pages-articles-multistream.xml.bz2'
# KB construction parameters
MAX_CANDIDATES = 10
MIN_ENTITY_FREQ = 20
MIN_PAIR_OCC = 5
# model training parameters
EPOCHS = 10
DROPOUT = 0.5
LEARN_RATE = 0.005
L2 = 1e-6
CONTEXT_WIDTH = 128
def run_pipeline():
# set the appropriate booleans to define which parts of the pipeline should be re(run)
print("START", datetime.datetime.now())
print()
nlp_1 = spacy.load('en_core_web_lg')
nlp_2 = None
kb_2 = None
# one-time methods to create KB and write to file
to_create_prior_probs = False
to_create_entity_counts = False
to_create_kb = False
# read KB back in from file
to_read_kb = True
to_test_kb = False
# create training dataset
create_wp_training = False
# train the EL pipe
train_pipe = True
measure_performance = True
# test the EL pipe on a simple example
to_test_pipeline = True
# write the NLP object, read back in and test again
to_write_nlp = True
to_read_nlp = True
test_from_file = False
# STEP 1 : create prior probabilities from WP (run only once)
if to_create_prior_probs:
print("STEP 1: to_create_prior_probs", datetime.datetime.now())
wp.read_wikipedia_prior_probs(wikipedia_input=ENWIKI_DUMP, prior_prob_output=PRIOR_PROB)
print()
# STEP 2 : deduce entity frequencies from WP (run only once)
if to_create_entity_counts:
print("STEP 2: to_create_entity_counts", datetime.datetime.now())
wp.write_entity_counts(prior_prob_input=PRIOR_PROB, count_output=ENTITY_COUNTS, to_print=False)
print()
# STEP 3 : create KB and write to file (run only once)
if to_create_kb:
print("STEP 3a: to_create_kb", datetime.datetime.now())
kb_1 = kb_creator.create_kb(nlp_1,
max_entities_per_alias=MAX_CANDIDATES,
min_entity_freq=MIN_ENTITY_FREQ,
min_occ=MIN_PAIR_OCC,
entity_def_output=ENTITY_DEFS,
entity_descr_output=ENTITY_DESCR,
count_input=ENTITY_COUNTS,
prior_prob_input=PRIOR_PROB,
wikidata_input=WIKIDATA_JSON)
print("kb entities:", kb_1.get_size_entities())
print("kb aliases:", kb_1.get_size_aliases())
print()
print("STEP 3b: write KB and NLP", datetime.datetime.now())
kb_1.dump(KB_FILE)
nlp_1.to_disk(NLP_1_DIR)
print()
# STEP 4 : read KB back in from file
if to_read_kb:
print("STEP 4: to_read_kb", datetime.datetime.now())
nlp_2 = spacy.load(NLP_1_DIR)
kb_2 = KnowledgeBase(vocab=nlp_2.vocab, entity_vector_length=DESC_WIDTH)
kb_2.load_bulk(KB_FILE)
print("kb entities:", kb_2.get_size_entities())
print("kb aliases:", kb_2.get_size_aliases())
print()
# test KB
if to_test_kb:
check_kb(kb_2)
print()
# STEP 5: create a training dataset from WP
if create_wp_training:
print("STEP 5: create training dataset", datetime.datetime.now())
training_set_creator.create_training(wikipedia_input=ENWIKI_DUMP,
entity_def_input=ENTITY_DEFS,
training_output=TRAINING_DIR)
# STEP 6: create and train the entity linking pipe
if train_pipe:
print("STEP 6: training Entity Linking pipe", datetime.datetime.now())
type_to_int = {label: i for i, label in enumerate(nlp_2.entity.labels)}
print(" -analysing", len(type_to_int), "different entity types")
el_pipe = nlp_2.create_pipe(name='entity_linker',
config={"context_width": CONTEXT_WIDTH,
"pretrained_vectors": nlp_2.vocab.vectors.name,
"type_to_int": type_to_int})
el_pipe.set_kb(kb_2)
nlp_2.add_pipe(el_pipe, last=True)
other_pipes = [pipe for pipe in nlp_2.pipe_names if pipe != "entity_linker"]
with nlp_2.disable_pipes(*other_pipes): # only train Entity Linking
optimizer = nlp_2.begin_training()
optimizer.learn_rate = LEARN_RATE
optimizer.L2 = L2
# define the size (nr of entities) of training and dev set
train_limit = 5000
dev_limit = 5000
train_data = training_set_creator.read_training(nlp=nlp_2,
training_dir=TRAINING_DIR,
dev=False,
limit=train_limit)
print("Training on", len(train_data), "articles")
print()
dev_data = training_set_creator.read_training(nlp=nlp_2,
training_dir=TRAINING_DIR,
dev=True,
limit=dev_limit)
print("Dev testing on", len(dev_data), "articles")
print()
if not train_data:
print("Did not find any training data")
else:
for itn in range(EPOCHS):
random.shuffle(train_data)
losses = {}
batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001))
batchnr = 0
with nlp_2.disable_pipes(*other_pipes):
for batch in batches:
try:
docs, golds = zip(*batch)
nlp_2.update(
docs,
golds,
sgd=optimizer,
drop=DROPOUT,
losses=losses,
)
batchnr += 1
except Exception as e:
print("Error updating batch:", e)
if batchnr > 0:
el_pipe.cfg["context_weight"] = 1
el_pipe.cfg["prior_weight"] = 1
dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe)
losses['entity_linker'] = losses['entity_linker'] / batchnr
print("Epoch, train loss", itn, round(losses['entity_linker'], 2),
" / dev acc avg", round(dev_acc_context, 3))
# STEP 7: measure the performance of our trained pipe on an independent dev set
if len(dev_data) and measure_performance:
print()
print("STEP 7: performance measurement of Entity Linking pipe", datetime.datetime.now())
print()
counts, acc_r, acc_r_label, acc_p, acc_p_label, acc_o, acc_o_label = _measure_baselines(dev_data, kb_2)
print("dev counts:", sorted(counts.items(), key=lambda x: x[0]))
print("dev acc oracle:", round(acc_o, 3), [(x, round(y, 3)) for x, y in acc_o_label.items()])
print("dev acc random:", round(acc_r, 3), [(x, round(y, 3)) for x, y in acc_r_label.items()])
print("dev acc prior:", round(acc_p, 3), [(x, round(y, 3)) for x, y in acc_p_label.items()])
# using only context
el_pipe.cfg["context_weight"] = 1
el_pipe.cfg["prior_weight"] = 0
dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe)
print("dev acc context avg:", round(dev_acc_context, 3),
[(x, round(y, 3)) for x, y in dev_acc_context_dict.items()])
# measuring combined accuracy (prior + context)
el_pipe.cfg["context_weight"] = 1
el_pipe.cfg["prior_weight"] = 1
dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe, error_analysis=False)
print("dev acc combo avg:", round(dev_acc_combo, 3),
[(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()])
# STEP 8: apply the EL pipe on a toy example
if to_test_pipeline:
print()
print("STEP 8: applying Entity Linking to toy example", datetime.datetime.now())
print()
run_el_toy_example(nlp=nlp_2)
# STEP 9: write the NLP pipeline (including entity linker) to file
if to_write_nlp:
print()
print("STEP 9: testing NLP IO", datetime.datetime.now())
print()
print("writing to", NLP_2_DIR)
nlp_2.to_disk(NLP_2_DIR)
print()
# verify that the IO has gone correctly
if to_read_nlp:
print("reading from", NLP_2_DIR)
nlp_3 = spacy.load(NLP_2_DIR)
print("running toy example with NLP 3")
run_el_toy_example(nlp=nlp_3)
# testing performance with an NLP model from file
if test_from_file:
nlp_2 = spacy.load(NLP_1_DIR)
nlp_3 = spacy.load(NLP_2_DIR)
el_pipe = nlp_3.get_pipe("entity_linker")
dev_limit = 5000
dev_data = training_set_creator.read_training(nlp=nlp_2,
training_dir=TRAINING_DIR,
dev=True,
limit=dev_limit)
print("Dev testing from file on", len(dev_data), "articles")
print()
dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe=el_pipe, error_analysis=False)
print("dev acc combo avg:", round(dev_acc_combo, 3),
[(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()])
print()
print("STOP", datetime.datetime.now())
def _measure_accuracy(data, el_pipe=None, error_analysis=False):
# If the docs in the data require further processing with an entity linker, set el_pipe
correct_by_label = dict()
incorrect_by_label = dict()
docs = [d for d, g in data if len(d) > 0]
if el_pipe is not None:
docs = list(el_pipe.pipe(docs))
golds = [g for d, g in data if len(d) > 0]
for doc, gold in zip(docs, golds):
try:
correct_entries_per_article = dict()
for entity in gold.links:
start, end, gold_kb = entity
correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb
for ent in doc.ents:
ent_label = ent.label_
pred_entity = ent.kb_id_
start = ent.start_char
end = ent.end_char
gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None)
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
if gold_entity is not None:
if gold_entity == pred_entity:
correct = correct_by_label.get(ent_label, 0)
correct_by_label[ent_label] = correct + 1
else:
incorrect = incorrect_by_label.get(ent_label, 0)
incorrect_by_label[ent_label] = incorrect + 1
if error_analysis:
print(ent.text, "in", doc)
print("Predicted", pred_entity, "should have been", gold_entity)
print()
except Exception as e:
print("Error assessing accuracy", e)
acc, acc_by_label = calculate_acc(correct_by_label, incorrect_by_label)
return acc, acc_by_label
def _measure_baselines(data, kb):
# Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound
counts_by_label = dict()
random_correct_by_label = dict()
random_incorrect_by_label = dict()
oracle_correct_by_label = dict()
oracle_incorrect_by_label = dict()
prior_correct_by_label = dict()
prior_incorrect_by_label = dict()
docs = [d for d, g in data if len(d) > 0]
golds = [g for d, g in data if len(d) > 0]
for doc, gold in zip(docs, golds):
try:
correct_entries_per_article = dict()
for entity in gold.links:
start, end, gold_kb = entity
correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb
for ent in doc.ents:
ent_label = ent.label_
start = ent.start_char
end = ent.end_char
gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None)
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
if gold_entity is not None:
counts_by_label[ent_label] = counts_by_label.get(ent_label, 0) + 1
candidates = kb.get_candidates(ent.text)
oracle_candidate = ""
best_candidate = ""
random_candidate = ""
if candidates:
scores = []
for c in candidates:
scores.append(c.prior_prob)
if c.entity_ == gold_entity:
oracle_candidate = c.entity_
best_index = scores.index(max(scores))
best_candidate = candidates[best_index].entity_
random_candidate = random.choice(candidates).entity_
if gold_entity == best_candidate:
prior_correct_by_label[ent_label] = prior_correct_by_label.get(ent_label, 0) + 1
else:
prior_incorrect_by_label[ent_label] = prior_incorrect_by_label.get(ent_label, 0) + 1
if gold_entity == random_candidate:
random_correct_by_label[ent_label] = random_correct_by_label.get(ent_label, 0) + 1
else:
random_incorrect_by_label[ent_label] = random_incorrect_by_label.get(ent_label, 0) + 1
if gold_entity == oracle_candidate:
oracle_correct_by_label[ent_label] = oracle_correct_by_label.get(ent_label, 0) + 1
else:
oracle_incorrect_by_label[ent_label] = oracle_incorrect_by_label.get(ent_label, 0) + 1
except Exception as e:
print("Error assessing accuracy", e)
acc_prior, acc_prior_by_label = calculate_acc(prior_correct_by_label, prior_incorrect_by_label)
acc_rand, acc_rand_by_label = calculate_acc(random_correct_by_label, random_incorrect_by_label)
acc_oracle, acc_oracle_by_label = calculate_acc(oracle_correct_by_label, oracle_incorrect_by_label)
return counts_by_label, acc_rand, acc_rand_by_label, acc_prior, acc_prior_by_label, acc_oracle, acc_oracle_by_label
def calculate_acc(correct_by_label, incorrect_by_label):
acc_by_label = dict()
total_correct = 0
total_incorrect = 0
all_keys = set()
all_keys.update(correct_by_label.keys())
all_keys.update(incorrect_by_label.keys())
for label in sorted(all_keys):
correct = correct_by_label.get(label, 0)
incorrect = incorrect_by_label.get(label, 0)
total_correct += correct
total_incorrect += incorrect
if correct == incorrect == 0:
acc_by_label[label] = 0
else:
acc_by_label[label] = correct / (correct + incorrect)
acc = 0
if not (total_correct == total_incorrect == 0):
acc = total_correct / (total_correct + total_incorrect)
return acc, acc_by_label
def check_kb(kb):
for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
candidates = kb.get_candidates(mention)
print("generating candidates for " + mention + " :")
for c in candidates:
print(" ", c.prior_prob, c.alias_, "-->", c.entity_ + " (freq=" + str(c.entity_freq) + ")")
print()
def run_el_toy_example(nlp):
text = "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, " \
"Douglas reminds us to always bring our towel, even in China or Brazil. " \
"The main character in Doug's novel is the man Arthur Dent, " \
"but Douglas doesn't write about George Washington or Homer Simpson."
doc = nlp(text)
print(text)
for ent in doc.ents:
print(" ent", ent.text, ent.label_, ent.kb_id_)
print()
if __name__ == "__main__":
run_pipeline()

View File

@ -5,6 +5,6 @@ requires = ["setuptools",
"cymem>=2.0.2,<2.1.0",
"preshed>=2.0.1,<2.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc==7.0.0.dev6",
"thinc>=7.0.8,<7.1.0",
]
build-backend = "setuptools.build_meta"

View File

@ -1,7 +1,7 @@
# Our libraries
cymem>=2.0.2,<2.1.0
preshed>=2.0.1,<2.1.0
thinc>=7.0.2,<7.1.0
thinc>=7.0.8,<7.1.0
blis>=0.2.2,<0.3.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.2.0,<1.1.0

View File

@ -228,7 +228,7 @@ def setup_package():
"murmurhash>=0.28.0,<1.1.0",
"cymem>=2.0.2,<2.1.0",
"preshed>=2.0.1,<2.1.0",
"thinc>=7.0.2,<7.1.0",
"thinc>=7.0.8,<7.1.0",
"blis>=0.2.2,<0.3.0",
"plac<1.0.0,>=0.9.6",
"requests>=2.13.0,<3.0.0",
@ -246,6 +246,7 @@ def setup_package():
"cuda100": ["thinc_gpu_ops>=0.0.1,<0.1.0", "cupy-cuda100>=5.0.0b4"],
# Language tokenizers with external dependencies
"ja": ["mecab-python3==0.7"],
"ko": ["natto-py==0.9.0"],
},
python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*",
classifiers=[

View File

@ -24,7 +24,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
import thinc.extra.load_nlp
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
from .errors import Errors
from .errors import Errors, user_warning, Warnings
from . import util
try:
@ -299,7 +299,17 @@ def link_vectors_to_models(vocab):
data = ops.asarray(vectors.data)
# Set an entry here, so that vectors are accessed by StaticVectors
# (unideal, I know)
thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
key = (ops.device, vectors.name)
if key in thinc.extra.load_nlp.VECTORS:
if thinc.extra.load_nlp.VECTORS[key].shape != data.shape:
# This is a hack to avoid the problem in #3853. Maybe we should
# print a warning as well?
old_name = vectors.name
new_name = vectors.name + "_%d" % data.shape[0]
user_warning(Warnings.W019.format(old=old_name, new=new_name))
vectors.name = new_name
key = (ops.device, vectors.name)
thinc.extra.load_nlp.VECTORS[key] = data
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
@ -652,6 +662,51 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False,
return model
def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg):
# TODO proper error
if "entity_width" not in cfg:
raise ValueError("entity_width not found")
if "context_width" not in cfg:
raise ValueError("context_width not found")
conv_depth = cfg.get("conv_depth", 2)
cnn_maxout_pieces = cfg.get("cnn_maxout_pieces", 3)
pretrained_vectors = cfg.get("pretrained_vectors") # self.nlp.vocab.vectors.name
context_width = cfg.get("context_width")
entity_width = cfg.get("entity_width")
with Model.define_operators({">>": chain, "**": clone}):
model = (
Affine(entity_width, entity_width + context_width + 1 + ner_types)
>> Affine(1, entity_width, drop_factor=0.0)
>> logistic
)
# context encoder
tok2vec = (
Tok2Vec(
width=hidden_width,
embed_size=embed_width,
pretrained_vectors=pretrained_vectors,
cnn_maxout_pieces=cnn_maxout_pieces,
subword_features=True,
conv_depth=conv_depth,
bilstm_depth=0,
)
>> flatten_add_lengths
>> Pooling(mean_pool)
>> Residual(zero_init(Maxout(hidden_width, hidden_width)))
>> zero_init(Affine(context_width, hidden_width))
)
model.tok2vec = tok2vec
model.tok2vec = tok2vec
model.tok2vec.nO = context_width
model.nO = 1
return model
@layerize
def flatten(seqs, drop=0.0):
ops = Model.ops

View File

@ -4,13 +4,13 @@
# fmt: off
__title__ = "spacy"
__version__ = "2.1.4"
__version__ = "2.1.5"
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
__uri__ = "https://spacy.io"
__author__ = "Explosion AI"
__email__ = "contact@explosion.ai"
__license__ = "MIT"
__release__ = False
__release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

View File

@ -82,6 +82,7 @@ cdef enum attr_id_t:
DEP
ENT_IOB
ENT_TYPE
ENT_KB_ID
HEAD
SENT_START
SPACY

View File

@ -84,6 +84,7 @@ IDS = {
"DEP": DEP,
"ENT_IOB": ENT_IOB,
"ENT_TYPE": ENT_TYPE,
"ENT_KB_ID": ENT_KB_ID,
"HEAD": HEAD,
"SENT_START": SENT_START,
"SPACY": SPACY,

View File

@ -5,6 +5,7 @@ import plac
import random
import numpy
import time
import re
from collections import Counter
from pathlib import Path
from thinc.v2v import Affine, Maxout
@ -65,6 +66,13 @@ from .train import _load_pretrained_tok2vec
"t2v",
Path,
),
epoch_start=(
"The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been "
"renamed. Prevents unintended overwriting of existing weight files.",
"option",
"es",
int
),
)
def pretrain(
texts_loc,
@ -83,6 +91,7 @@ def pretrain(
seed=0,
n_save_every=None,
init_tok2vec=None,
epoch_start=None,
):
"""
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -151,9 +160,29 @@ def pretrain(
if init_tok2vec is not None:
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
msg.text("Loaded pretrained tok2vec for: {}".format(components))
# Parse the epoch number from the given weight file
model_name = re.search(r"model\d+\.bin", str(init_tok2vec))
if model_name:
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
epoch_start = int(model_name.group(0)[5:][:-4]) + 1
else:
if not epoch_start:
msg.fail(
"You have to use the '--epoch-start' argument when using a renamed weight file for "
"'--init-tok2vec'", exits=True
)
elif epoch_start < 0:
msg.fail(
"The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" % epoch_start,
exits=True
)
else:
# Without '--init-tok2vec' the '--epoch-start' argument is ignored
epoch_start = 0
optimizer = create_default_optimizer(model.ops)
tracker = ProgressTracker(frequency=10000)
msg.divider("Pre-training tok2vec layer")
msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start)
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
@ -174,7 +203,7 @@ def pretrain(
file_.write(srsly.json_dumps(log) + "\n")
skip_counter = 0
for epoch in range(n_iter):
for epoch in range(epoch_start, n_iter + epoch_start):
for batch_id, batch in enumerate(
util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
):
@ -272,7 +301,7 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
elif objective == "cosine":
loss, d_target = get_cossim_loss(prediction, target)
else:
raise ValueError(Errors.E139.format(loss_func=objective))
raise ValueError(Errors.E142.format(loss_func=objective))
return loss, d_target

View File

@ -82,6 +82,8 @@ class Warnings(object):
"parallel inference via multiprocessing.")
W017 = ("Alias '{alias}' already exists in the Knowledge base.")
W018 = ("Entity '{entity}' already exists in the Knowledge base.")
W019 = ("Changing vectors name from {old} to {new}, to avoid clash with "
"previously loaded vectors. See Issue #3853.")
@add_codes
@ -399,7 +401,11 @@ class Errors(object):
E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input includes either the "
"`text` or `tokens` key. For more info, see the docs:\n"
"https://spacy.io/api/cli#pretrain-jsonl")
E139 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'")
E139 = ("Knowledge base for component '{name}' not initialized. Did you forget to call set_kb()?")
E140 = ("The list of entities, prior probabilities and entity vectors should be of equal length.")
E141 = ("Entity vectors should be of length {required} instead of the provided {found}.")
E142 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'")
E143 = ("Labels for component '{name}' not initialized. Did you forget to call add_label()?")
@add_codes

View File

@ -31,6 +31,7 @@ cdef class GoldParse:
cdef public list ents
cdef public dict brackets
cdef public object cats
cdef public list links
cdef readonly list cand_to_gold
cdef readonly list gold_to_cand

View File

@ -427,7 +427,7 @@ cdef class GoldParse:
def __init__(self, doc, annot_tuples=None, words=None, tags=None,
heads=None, deps=None, entities=None, make_projective=False,
cats=None, **_):
cats=None, links=None, **_):
"""Create a GoldParse.
doc (Doc): The document the annotations refer to.
@ -450,6 +450,8 @@ cdef class GoldParse:
examples of a label to have the value 0.0. Labels not in the
dictionary are treated as missing - the gradient for those labels
will be zero.
links (iterable): A sequence of `(start_char, end_char, kb_id)` tuples,
representing the external ID of an entity in a knowledge base.
RETURNS (GoldParse): The newly constructed object.
"""
if words is None:
@ -485,6 +487,7 @@ cdef class GoldParse:
self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
self.cats = {} if cats is None else dict(cats)
self.links = links
self.words = [None] * len(doc)
self.tags = [None] * len(doc)
self.heads = [None] * len(doc)

View File

@ -1,53 +1,27 @@
"""Knowledge-base for entity or concept linking."""
from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap
from libcpp.vector cimport vector
from libc.stdint cimport int32_t, int64_t
from libc.stdio cimport FILE
from spacy.vocab cimport Vocab
from .typedefs cimport hash_t
# Internal struct, for storage and disambiguation. This isn't what we return
# to the user as the answer to "here's your entity". It's the minimum number
# of bits we need to keep track of the answers.
cdef struct _EntryC:
# The hash of this entry's unique ID and name in the kB
hash_t entity_hash
# Allows retrieval of one or more vectors.
# Each element of vector_rows should be an index into a vectors table.
# Every entry should have the same number of vectors, so we can avoid storing
# the number of vectors in each knowledge-base struct
int32_t* vector_rows
# Allows retrieval of a struct of non-vector features. We could make this a
# pointer, but we have 32 bits left over in the struct after prob, so we'd
# like this to only be 32 bits. We can also set this to -1, for the common
# case where there are no features.
int32_t feats_row
# log probability of entity, based on corpus frequency
float prob
# Each alias struct stores a list of Entry pointers with their prior probabilities
# for this specific mention/alias.
cdef struct _AliasC:
# All entry candidates for this alias
vector[int64_t] entry_indices
# Prior probability P(entity|alias) - should sum up to (at most) 1.
vector[float] probs
from .structs cimport KBEntryC, AliasC
ctypedef vector[KBEntryC] entry_vec
ctypedef vector[AliasC] alias_vec
ctypedef vector[float] float_vec
ctypedef vector[float_vec] float_matrix
# Object used by the Entity Linker that summarizes one entity-alias candidate combination.
cdef class Candidate:
cdef readonly KnowledgeBase kb
cdef hash_t entity_hash
cdef float entity_freq
cdef vector[float] entity_vector
cdef hash_t alias_hash
cdef float prior_prob
@ -55,9 +29,10 @@ cdef class Candidate:
cdef class KnowledgeBase:
cdef Pool mem
cpdef readonly Vocab vocab
cdef int64_t entity_vector_length
# This maps 64bit keys (hash of unique entity string)
# to 64bit values (position of the _EntryC struct in the _entries vector).
# to 64bit values (position of the _KBEntryC struct in the _entries vector).
# The PreshMap is pretty space efficient, as it uses open addressing. So
# the only overhead is the vacancy rate, which is approximately 30%.
cdef PreshMap _entry_index
@ -66,7 +41,7 @@ cdef class KnowledgeBase:
# over allocation.
# In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries.
# Storing 1m entries would take 41.6mb under this scheme.
cdef vector[_EntryC] _entries
cdef entry_vec _entries
# This maps 64bit keys (hash of unique alias string)
# to 64bit values (position of the _AliasC struct in the _aliases_table vector).
@ -76,7 +51,7 @@ cdef class KnowledgeBase:
# should be P(entity | mention), which is pretty important to know.
# We can pack both pieces of information into a 64-bit value, to keep things
# efficient.
cdef vector[_AliasC] _aliases_table
cdef alias_vec _aliases_table
# This is the part which might take more space: storing various
# categorical features for the entries, and storing vectors for disambiguation
@ -87,7 +62,7 @@ cdef class KnowledgeBase:
# model, that embeds different features of the entities into vectors. We'll
# still want some per-entity features, like the Wikipedia text or entity
# co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions.
cdef object _vectors_table
cdef float_matrix _vectors_table
# It's very useful to track categorical features, at least for output, even
# if they're not useful in the model itself. For instance, we should be
@ -96,53 +71,102 @@ cdef class KnowledgeBase:
# optional data, we can let users configure a DB as the backend for this.
cdef object _features_table
cdef inline int64_t c_add_vector(self, vector[float] entity_vector) nogil:
"""Add an entity vector to the vectors table."""
cdef int64_t new_index = self._vectors_table.size()
self._vectors_table.push_back(entity_vector)
return new_index
cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob,
int32_t* vector_rows, int feats_row):
"""Add an entry to the knowledge base."""
# This is what we'll map the hash key to. It's where the entry will sit
int32_t vector_index, int feats_row) nogil:
"""Add an entry to the vector of entries.
After calling this method, make sure to update also the _entry_index using the return value"""
# This is what we'll map the entity hash key to. It's where the entry will sit
# in the vector of entries, so we can get it later.
cdef int64_t new_index = self._entries.size()
self._entries.push_back(
_EntryC(
entity_hash=entity_hash,
vector_rows=vector_rows,
feats_row=feats_row,
prob=prob
))
self._entry_index[entity_hash] = new_index
# Avoid struct initializer to enable nogil, cf https://github.com/cython/cython/issues/1642
cdef KBEntryC entry
entry.entity_hash = entity_hash
entry.vector_index = vector_index
entry.feats_row = feats_row
entry.prob = prob
self._entries.push_back(entry)
return new_index
cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs):
"""Connect a mention to a list of potential entities with their prior probabilities ."""
cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs) nogil:
"""Connect a mention to a list of potential entities with their prior probabilities .
After calling this method, make sure to update also the _alias_index using the return value"""
# This is what we'll map the alias hash key to. It's where the alias will be defined
# in the vector of aliases.
cdef int64_t new_index = self._aliases_table.size()
self._aliases_table.push_back(
_AliasC(
entry_indices=entry_indices,
probs=probs
))
self._alias_index[alias_hash] = new_index
# Avoid struct initializer to enable nogil
cdef AliasC alias
alias.entry_indices = entry_indices
alias.probs = probs
self._aliases_table.push_back(alias)
return new_index
cdef inline _create_empty_vectors(self):
cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil:
"""
Making sure the first element of each vector is a dummy,
Initializing the vectors and making sure the first element of each vector is a dummy,
because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
cf. https://github.com/explosion/preshed/issues/17
"""
cdef int32_t dummy_value = 0
self.vocab.strings.add("")
self._entries.push_back(
_EntryC(
entity_hash=self.vocab.strings[""],
vector_rows=&dummy_value,
feats_row=dummy_value,
prob=dummy_value
))
self._aliases_table.push_back(
_AliasC(
entry_indices=[dummy_value],
probs=[dummy_value]
))
# Avoid struct initializer to enable nogil
cdef KBEntryC entry
entry.entity_hash = dummy_hash
entry.vector_index = dummy_value
entry.feats_row = dummy_value
entry.prob = dummy_value
# Avoid struct initializer to enable nogil
cdef vector[int64_t] dummy_entry_indices
dummy_entry_indices.push_back(0)
cdef vector[float] dummy_probs
dummy_probs.push_back(0)
cdef AliasC alias
alias.entry_indices = dummy_entry_indices
alias.probs = dummy_probs
self._entries.push_back(entry)
self._aliases_table.push_back(alias)
cpdef load_bulk(self, loc)
cpdef set_entities(self, entity_list, prob_list, vector_list)
cdef class Writer:
cdef FILE* _fp
cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1
cdef int write_vector_element(self, float element) except -1
cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1
cdef int write_alias_length(self, int64_t alias_length) except -1
cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1
cdef int write_alias(self, int64_t entry_index, float prob) except -1
cdef int _write(self, void* value, size_t size) except -1
cdef class Reader:
cdef FILE* _fp
cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1
cdef int read_vector_element(self, float* element) except -1
cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1
cdef int read_alias_length(self, int64_t* alias_length) except -1
cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
cdef int _read(self, void* value, size_t size) except -1

View File

@ -1,13 +1,30 @@
# cython: infer_types=True
# cython: profile=True
# coding: utf8
from spacy.errors import Errors, Warnings, user_warning
from pathlib import Path
from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap
from cpython.exc cimport PyErr_SetFromErrno
from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
from libc.stdint cimport int32_t, int64_t
from .typedefs cimport hash_t
from os import path
from libcpp.vector cimport vector
cdef class Candidate:
def __init__(self, KnowledgeBase kb, entity_hash, alias_hash, prior_prob):
def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob):
self.kb = kb
self.entity_hash = entity_hash
self.entity_freq = entity_freq
self.entity_vector = entity_vector
self.alias_hash = alias_hash
self.prior_prob = prior_prob
@ -19,7 +36,7 @@ cdef class Candidate:
@property
def entity_(self):
"""RETURNS (unicode): ID/name of this entity in the KB"""
return self.kb.vocab.strings[self.entity]
return self.kb.vocab.strings[self.entity_hash]
@property
def alias(self):
@ -29,7 +46,15 @@ cdef class Candidate:
@property
def alias_(self):
"""RETURNS (unicode): ID of the original alias"""
return self.kb.vocab.strings[self.alias]
return self.kb.vocab.strings[self.alias_hash]
@property
def entity_freq(self):
return self.entity_freq
@property
def entity_vector(self):
return self.entity_vector
@property
def prior_prob(self):
@ -38,26 +63,41 @@ cdef class Candidate:
cdef class KnowledgeBase:
def __init__(self, Vocab vocab):
def __init__(self, Vocab vocab, entity_vector_length):
self.vocab = vocab
self.mem = Pool()
self.entity_vector_length = entity_vector_length
self._entry_index = PreshMap()
self._alias_index = PreshMap()
self.mem = Pool()
self._create_empty_vectors()
self.vocab.strings.add("")
self._create_empty_vectors(dummy_hash=self.vocab.strings[""])
@property
def entity_vector_length(self):
"""RETURNS (uint64): length of the entity vectors"""
return self.entity_vector_length
def __len__(self):
return self.get_size_entities()
def get_size_entities(self):
return self._entries.size() - 1 # not counting dummy element on index 0
return len(self._entry_index)
def get_entity_strings(self):
return [self.vocab.strings[x] for x in self._entry_index]
def get_size_aliases(self):
return self._aliases_table.size() - 1 # not counting dummy element on index 0
return len(self._alias_index)
def add_entity(self, unicode entity, float prob=0.5, vectors=None, features=None):
def get_alias_strings(self):
return [self.vocab.strings[x] for x in self._alias_index]
def add_entity(self, unicode entity, float prob, vector[float] entity_vector):
"""
Add an entity to the KB.
Return the hash of the entity ID at the end
Add an entity to the KB, optionally specifying its log probability based on corpus frequency
Return the hash of the entity ID/name at the end.
"""
cdef hash_t entity_hash = self.vocab.strings.add(entity)
@ -66,40 +106,72 @@ cdef class KnowledgeBase:
user_warning(Warnings.W018.format(entity=entity))
return
cdef int32_t dummy_value = 342
self.c_add_entity(entity_hash=entity_hash, prob=prob,
vector_rows=&dummy_value, feats_row=dummy_value)
# TODO self._vectors_table.get_pointer(vectors),
# self._features_table.get(features))
# Raise an error if the provided entity vector is not of the correct length
if len(entity_vector) != self.entity_vector_length:
raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length))
vector_index = self.c_add_vector(entity_vector=entity_vector)
new_index = self.c_add_entity(entity_hash=entity_hash,
prob=prob,
vector_index=vector_index,
feats_row=-1) # Features table currently not implemented
self._entry_index[entity_hash] = new_index
return entity_hash
cpdef set_entities(self, entity_list, prob_list, vector_list):
if len(entity_list) != len(prob_list) or len(entity_list) != len(vector_list):
raise ValueError(Errors.E140)
nr_entities = len(entity_list)
self._entry_index = PreshMap(nr_entities+1)
self._entries = entry_vec(nr_entities+1)
i = 0
cdef KBEntryC entry
while i < nr_entities:
entity_vector = vector_list[i]
if len(entity_vector) != self.entity_vector_length:
raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length))
entity_hash = self.vocab.strings.add(entity_list[i])
entry.entity_hash = entity_hash
entry.prob = prob_list[i]
vector_index = self.c_add_vector(entity_vector=vector_list[i])
entry.vector_index = vector_index
entry.feats_row = -1 # Features table currently not implemented
self._entries[i+1] = entry
self._entry_index[entity_hash] = i+1
i += 1
def add_alias(self, unicode alias, entities, probabilities):
"""
For a given alias, add its potential entities and prior probabilies to the KB.
Return the alias_hash at the end
"""
# Throw an error if the length of entities and probabilities are not the same
if not len(entities) == len(probabilities):
raise ValueError(Errors.E132.format(alias=alias,
entities_length=len(entities),
probabilities_length=len(probabilities)))
# Throw an error if the probabilities sum up to more than 1
# Throw an error if the probabilities sum up to more than 1 (allow for some rounding errors)
prob_sum = sum(probabilities)
if prob_sum > 1:
if prob_sum > 1.00001:
raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum))
cdef hash_t alias_hash = self.vocab.strings.add(alias)
# Return if this alias was added before
# Check whether this alias was added before
if alias_hash in self._alias_index:
user_warning(Warnings.W017.format(alias=alias))
return
cdef hash_t entity_hash
cdef vector[int64_t] entry_indices
cdef vector[float] probs
@ -112,20 +184,295 @@ cdef class KnowledgeBase:
entry_indices.push_back(int(entry_index))
probs.push_back(float(prob))
self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
new_index = self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
self._alias_index[alias_hash] = new_index
return alias_hash
def get_candidates(self, unicode alias):
""" TODO: where to put this functionality ?"""
cdef hash_t alias_hash = self.vocab.strings[alias]
alias_index = <int64_t>self._alias_index.get(alias_hash)
alias_entry = self._aliases_table[alias_index]
return [Candidate(kb=self,
entity_hash=self._entries[entry_index].entity_hash,
entity_freq=self._entries[entry_index].prob,
entity_vector=self._vectors_table[self._entries[entry_index].vector_index],
alias_hash=alias_hash,
prior_prob=prob)
for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs)
if entry_index != 0]
def dump(self, loc):
cdef Writer writer = Writer(loc)
writer.write_header(self.get_size_entities(), self.entity_vector_length)
# dumping the entity vectors in their original order
i = 0
for entity_vector in self._vectors_table:
for element in entity_vector:
writer.write_vector_element(element)
i = i+1
# dumping the entry records in the order in which they are in the _entries vector.
# index 0 is a dummy object not stored in the _entry_index and can be ignored.
i = 1
for entry_hash, entry_index in sorted(self._entry_index.items(), key=lambda x: x[1]):
entry = self._entries[entry_index]
assert entry.entity_hash == entry_hash
assert entry_index == i
writer.write_entry(entry.entity_hash, entry.prob, entry.vector_index)
i = i+1
writer.write_alias_length(self.get_size_aliases())
# dumping the aliases in the order in which they are in the _alias_index vector.
# index 0 is a dummy object not stored in the _aliases_table and can be ignored.
i = 1
for alias_hash, alias_index in sorted(self._alias_index.items(), key=lambda x: x[1]):
alias = self._aliases_table[alias_index]
assert alias_index == i
candidate_length = len(alias.entry_indices)
writer.write_alias_header(alias_hash, candidate_length)
for j in range(0, candidate_length):
writer.write_alias(alias.entry_indices[j], alias.probs[j])
i = i+1
writer.close()
cpdef load_bulk(self, loc):
cdef hash_t entity_hash
cdef hash_t alias_hash
cdef int64_t entry_index
cdef float prob
cdef int32_t vector_index
cdef KBEntryC entry
cdef AliasC alias
cdef float vector_element
cdef Reader reader = Reader(loc)
# STEP 0: load header and initialize KB
cdef int64_t nr_entities
cdef int64_t entity_vector_length
reader.read_header(&nr_entities, &entity_vector_length)
self.entity_vector_length = entity_vector_length
self._entry_index = PreshMap(nr_entities+1)
self._entries = entry_vec(nr_entities+1)
self._vectors_table = float_matrix(nr_entities+1)
# STEP 1: load entity vectors
cdef int i = 0
cdef int j = 0
while i < nr_entities:
entity_vector = float_vec(entity_vector_length)
j = 0
while j < entity_vector_length:
reader.read_vector_element(&vector_element)
entity_vector[j] = vector_element
j = j+1
self._vectors_table[i] = entity_vector
i = i+1
# STEP 2: load entities
# we assume that the entity data was written in sequence
# index 0 is a dummy object not stored in the _entry_index and can be ignored.
i = 1
while i <= nr_entities:
reader.read_entry(&entity_hash, &prob, &vector_index)
entry.entity_hash = entity_hash
entry.prob = prob
entry.vector_index = vector_index
entry.feats_row = -1 # Features table currently not implemented
self._entries[i] = entry
self._entry_index[entity_hash] = i
i += 1
# check that all entities were read in properly
assert nr_entities == self.get_size_entities()
# STEP 3: load aliases
cdef int64_t nr_aliases
reader.read_alias_length(&nr_aliases)
self._alias_index = PreshMap(nr_aliases+1)
self._aliases_table = alias_vec(nr_aliases+1)
cdef int64_t nr_candidates
cdef vector[int64_t] entry_indices
cdef vector[float] probs
i = 1
# we assume the alias data was written in sequence
# index 0 is a dummy object not stored in the _entry_index and can be ignored.
while i <= nr_aliases:
reader.read_alias_header(&alias_hash, &nr_candidates)
entry_indices = vector[int64_t](nr_candidates)
probs = vector[float](nr_candidates)
for j in range(0, nr_candidates):
reader.read_alias(&entry_index, &prob)
entry_indices[j] = entry_index
probs[j] = prob
alias.entry_indices = entry_indices
alias.probs = probs
self._aliases_table[i] = alias
self._alias_index[alias_hash] = i
i += 1
# check that all aliases were read in properly
assert nr_aliases == self.get_size_aliases()
cdef class Writer:
def __init__(self, object loc):
if path.exists(loc):
assert not path.isdir(loc), "%s is directory." % loc
if isinstance(loc, Path):
loc = bytes(loc)
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self._fp = fopen(<char*>bytes_loc, 'wb')
assert self._fp != NULL
fseek(self._fp, 0, 0)
def close(self):
cdef size_t status = fclose(self._fp)
assert status == 0
cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1:
self._write(&nr_entries, sizeof(nr_entries))
self._write(&entity_vector_length, sizeof(entity_vector_length))
cdef int write_vector_element(self, float element) except -1:
self._write(&element, sizeof(element))
cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1:
self._write(&entry_hash, sizeof(entry_hash))
self._write(&entry_prob, sizeof(entry_prob))
self._write(&vector_index, sizeof(vector_index))
# Features table currently not implemented and not written to file
cdef int write_alias_length(self, int64_t alias_length) except -1:
self._write(&alias_length, sizeof(alias_length))
cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1:
self._write(&alias_hash, sizeof(alias_hash))
self._write(&candidate_length, sizeof(candidate_length))
cdef int write_alias(self, int64_t entry_index, float prob) except -1:
self._write(&entry_index, sizeof(entry_index))
self._write(&prob, sizeof(prob))
cdef int _write(self, void* value, size_t size) except -1:
status = fwrite(value, size, 1, self._fp)
assert status == 1, status
cdef class Reader:
def __init__(self, object loc):
assert path.exists(loc)
assert not path.isdir(loc)
if isinstance(loc, Path):
loc = bytes(loc)
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self._fp = fopen(<char*>bytes_loc, 'rb')
if not self._fp:
PyErr_SetFromErrno(IOError)
status = fseek(self._fp, 0, 0) # this can be 0 if there is no header
def __dealloc__(self):
fclose(self._fp)
cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1:
status = self._read(nr_entries, sizeof(int64_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading header from input file")
status = self._read(entity_vector_length, sizeof(int64_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading header from input file")
cdef int read_vector_element(self, float* element) except -1:
status = self._read(element, sizeof(float))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading entity vector from input file")
cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1:
status = self._read(entity_hash, sizeof(hash_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading entity hash from input file")
status = self._read(prob, sizeof(float))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading entity prob from input file")
status = self._read(vector_index, sizeof(int32_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading entity vector from input file")
if feof(self._fp):
return 0
else:
return 1
cdef int read_alias_length(self, int64_t* alias_length) except -1:
status = self._read(alias_length, sizeof(int64_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading alias length from input file")
cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1:
status = self._read(alias_hash, sizeof(hash_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading alias hash from input file")
status = self._read(candidate_length, sizeof(int64_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading candidate length from input file")
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1:
status = self._read(entry_index, sizeof(int64_t))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading entry index for alias from input file")
status = self._read(prob, sizeof(float))
if status < 1:
if feof(self._fp):
return 0 # end of file
raise IOError("error reading prob for entity/alias from input file")
cdef int _read(self, void* value, size_t size) except -1:
status = fread(value, size, 1, self._fp)
return status

View File

@ -9,6 +9,8 @@ _bengali = r"\u0980-\u09FF"
_hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F"
_hindi = r"\u0900-\u097F"
# Latin standard
_latin_u_standard = r"A-Z"
_latin_l_standard = r"a-z"
@ -193,7 +195,7 @@ _ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ"
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
_uncased = _bengali + _hebrew + _persian + _sinhala
_uncased = _bengali + _hebrew + _persian + _sinhala + _hindi
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
ALPHA_LOWER = group_chars(_lower + _uncased)

View File

@ -5,7 +5,7 @@ from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.en.examples import sentences
>>> from spacy.lang.id.examples import sentences
>>> docs = nlp.pipe(sentences)
"""

120
spacy/lang/ko/__init__.py Normal file
View File

@ -0,0 +1,120 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
import re
import sys
from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from ...attrs import LANG
from ...language import Language
from ...tokens import Doc
from ...compat import copy_reg
from ...util import DummyTokenizer
from ...compat import is_python3, is_python_pre_3_5
is_python_post_3_7 = is_python3 and sys.version_info[1] >= 7
# fmt: off
if is_python_pre_3_5:
from collections import namedtuple
Morpheme = namedtuple("Morpheme", "surface lemma tag")
elif is_python_post_3_7:
from dataclasses import dataclass
@dataclass(frozen=True)
class Morpheme:
surface: str
lemma: str
tag: str
else:
from typing import NamedTuple
class Morpheme(NamedTuple):
surface: str
lemma: str
tag: str
def try_mecab_import():
try:
from natto import MeCab
return MeCab
except ImportError:
raise ImportError(
"Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
"and [natto-py](https://github.com/buruzaemon/natto-py)"
)
# fmt: on
def check_spaces(text, tokens):
token_pattern = re.compile(r"\s?".join(f"({t})" for t in tokens))
m = token_pattern.match(text)
if m is not None:
for i in range(1, m.lastindex):
yield m.end(i) < m.start(i + 1)
yield False
class KoreanTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
self.Tokenizer = try_mecab_import()
def __call__(self, text):
dtokens = list(self.detailed_tokens(text))
surfaces = [dt.surface for dt in dtokens]
doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces)))
for token, dtoken in zip(doc, dtokens):
first_tag, sep, eomi_tags = dtoken.tag.partition("+")
token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미)
token.lemma_ = dtoken.lemma
doc.user_data["full_tags"] = [dt.tag for dt in dtokens]
return doc
def detailed_tokens(self, text):
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
for node in tokenizer.parse(text, as_nodes=True):
if node.is_eos():
break
surface = node.surface
feature = node.feature
tag, _, expr = feature.partition(",")
lemma, _, remainder = expr.partition("/")
if lemma == "*":
lemma = surface
yield Morpheme(surface, lemma, tag)
class KoreanDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda _text: "ko"
stop_words = STOP_WORDS
tag_map = TAG_MAP
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
@classmethod
def create_tokenizer(cls, nlp=None):
return KoreanTokenizer(cls, nlp)
class Korean(Language):
lang = "ko"
Defaults = KoreanDefaults
def make_doc(self, text):
return self.tokenizer(text)
def pickle_korean(instance):
return Korean, tuple()
copy_reg.pickle(Korean, pickle_korean)
__all__ = ["Korean"]

15
spacy/lang/ko/examples.py Normal file
View File

@ -0,0 +1,15 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ko.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.",
"자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.",
"자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.",
"런던은 영국의 수도이자 가장 큰 도시입니다."
]

View File

@ -0,0 +1,68 @@
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
아니
그렇
위하
때문
그것
말하
그러나
못하
그런
그리고
시키
그러
하나
어떤
다른
어떻
이렇
""".split())

59
spacy/lang/ko/tag_map.py Normal file
View File

@ -0,0 +1,59 @@
# encoding: utf8
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, INTJ, X, SYM, ADJ, AUX, ADP, CONJ, NOUN, PRON
from ...symbols import VERB, ADV, PROPN, NUM, DET
# 은전한닢(mecab-ko-dic)의 품사 태그를 universal pos tag로 대응시킴
# https://docs.google.com/spreadsheets/d/1-9blXKjtjeKZqsf4NzHeYJCrr49-nXeRF6D80udfcwY/edit#gid=589544265
# https://universaldependencies.org/u/pos/
TAG_MAP = {
# J.{1,2} 조사
"JKS": {POS: ADP},
"JKC": {POS: ADP},
"JKG": {POS: ADP},
"JKO": {POS: ADP},
"JKB": {POS: ADP},
"JKV": {POS: ADP},
"JKQ": {POS: ADP},
"JX": {POS: ADP}, # 보조사
"JC": {POS: CONJ}, # 접속 조사
"MAJ": {POS: CONJ}, # 접속 부사
"MAG": {POS: ADV}, # 일반 부사
"MM": {POS: DET}, # 관형사
"XPN": {POS: X}, # 접두사
# XS. 접미사
"XSN": {POS: X},
"XSV": {POS: X},
"XSA": {POS: X},
"XR": {POS: X}, # 어근
# E.{1,2} 어미
"EP": {POS: X},
"EF": {POS: X},
"EC": {POS: X},
"ETN": {POS: X},
"ETM": {POS: X},
"IC": {POS: INTJ}, # 감탄사
"VV": {POS: VERB}, # 동사
"VA": {POS: ADJ}, # 형용사
"VX": {POS: AUX}, # 보조 용언
"VCP": {POS: ADP}, # 긍정 지정사(이다)
"VCN": {POS: ADJ}, # 부정 지정사(아니다)
"NNG": {POS: NOUN}, # 일반 명사(general noun)
"NNB": {POS: NOUN}, # 의존 명사
"NNBC": {POS: NOUN}, # 의존 명사(단위: unit)
"NNP": {POS: PROPN}, # 고유 명사(proper noun)
"NP": {POS: PRON}, # 대명사
"NR": {POS: NUM}, # 수사(numerals)
"SN": {POS: NUM}, # 숫자
# S.{1,2} 부호
# 문장 부호
"SF": {POS: PUNCT}, # period or other EOS marker
"SE": {POS: PUNCT},
"SC": {POS: PUNCT}, # comma, etc.
"SSO": {POS: PUNCT}, # open bracket
"SSC": {POS: PUNCT}, # close bracket
"SY": {POS: SYM}, # 기타 기호
"SL": {POS: X}, # 외국어
"SH": {POS: X}, # 한자
}

View File

@ -1,15 +1,37 @@
# coding: utf8
from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP
from .lemmatizer import LOOKUP
from .morph_rules import MORPH_RULES
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
def _return_lt(_):
return "lt"
class LithuanianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "lt"
lex_attr_getters[LANG] = _return_lt
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
lex_attr_getters.update(LEX_ATTRS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
tag_map = TAG_MAP
morph_rules = MORPH_RULES
lemma_lookup = LOOKUP
class Lithuanian(Language):

22
spacy/lang/lt/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.lt.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Jaunikis pirmąją vestuvinę naktį iškeitė į areštinės gultą",
"Bepiločiai automobiliai išnaikins vairavimo mokyklas, autoservisus ir eismo nelaimes",
"Vilniuje galvojama uždrausti naudoti skėčius",
"Londonas yra didelis miestas Jungtinėje Karalystėje",
"Kur tu?",
"Kas yra Prancūzijos prezidentas?",
"Kokia yra Jungtinių Amerikos Valstijų sostinė?",
"Kada gimė Dalia Grybauskaitė?",
]

234227
spacy/lang/lt/lemmatizer.py Normal file

File diff suppressed because it is too large Load Diff

1153
spacy/lang/lt/lex_attrs.py Normal file

File diff suppressed because it is too large Load Diff

3075
spacy/lang/lt/morph_rules.py Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

4798
spacy/lang/lt/tag_map.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,268 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH
_exc = {}
for orth in [
"G.",
"J. E.",
"J. Em.",
"J.E.",
"J.Em.",
"K.",
"N.",
"V.",
"Vt.",
"a.",
"a.k.",
"a.s.",
"adv.",
"akad.",
"aklg.",
"akt.",
"al.",
"ang.",
"angl.",
"aps.",
"apskr.",
"apyg.",
"arbat.",
"asist.",
"asm.",
"asm.k.",
"asmv.",
"atk.",
"atsak.",
"atsisk.",
"atsisk.sąsk.",
"atv.",
"aut.",
"avd.",
"b.k.",
"baud.",
"biol.",
"bkl.",
"bot.",
"bt.",
"buv.",
"ch.",
"chem.",
"corp.",
"d.",
"dab.",
"dail.",
"dek.",
"deš.",
"dir.",
"dirig.",
"doc.",
"dol.",
"dr.",
"drp.",
"dvit.",
"dėst.",
"dš.",
"dž.",
"e.b.",
"e.bankas",
"e.p.",
"e.parašas",
"e.paštas",
"e.v.",
"e.valdžia",
"egz.",
"eil.",
"ekon.",
"el.",
"el.bankas",
"el.p.",
"el.parašas",
"el.paštas",
"el.valdžia",
"etc.",
"ež.",
"fak.",
"faks.",
"feat.",
"filol.",
"filos.",
"g.",
"gen.",
"geol.",
"gerb.",
"gim.",
"gr.",
"gv.",
"gyd.",
"gyv.",
"habil.",
"inc.",
"insp.",
"inž.",
"ir pan.",
"ir t. t.",
"isp.",
"istor.",
"it.",
"just.",
"k.",
"k. a.",
"k.a.",
"kab.",
"kand.",
"kart.",
"kat.",
"ketv.",
"kh.",
"kl.",
"kln.",
"km.",
"kn.",
"koresp.",
"kpt.",
"kr.",
"kt.",
"kub.",
"kun.",
"kv.",
"kyš.",
"l. e. p.",
"l.e.p.",
"lenk.",
"liet.",
"lot.",
"lt.",
"ltd.",
"ltn.",
"m.",
"m.e..",
"m.m.",
"mat.",
"med.",
"mgnt.",
"mgr.",
"min.",
"mjr.",
"ml.",
"mln.",
"mlrd.",
"mob.",
"mok.",
"moksl.",
"mokyt.",
"mot.",
"mr.",
"mst.",
"mstl.",
"mėn.",
"nkt.",
"no.",
"nr.",
"ntk.",
"nuotr.",
"op.",
"org.",
"orig.",
"p.",
"p.d.",
"p.m.e.",
"p.s.",
"pab.",
"pan.",
"past.",
"pav.",
"pavad.",
"per.",
"perd.",
"pirm.",
"pl.",
"plg.",
"plk.",
"pr.",
"pr.Kr.",
"pranc.",
"proc.",
"prof.",
"prom.",
"prot.",
"psl.",
"pss.",
"pvz.",
"pšt.",
"r.",
"raj.",
"red.",
"rez.",
"rež.",
"rus.",
"rš.",
"s.",
"sav.",
"saviv.",
"sek.",
"sekr.",
"sen.",
"sh.",
"sk.",
"skg.",
"skv.",
"skyr.",
"sp.",
"spec.",
"sr.",
"st.",
"str.",
"stud.",
"sąs.",
"t.",
"t. p.",
"t. y.",
"t.p.",
"t.t.",
"t.y.",
"techn.",
"tel.",
"teol.",
"th.",
"tir.",
"trit.",
"trln.",
"tšk.",
"tūks.",
"tūkst.",
"up.",
"upl.",
"v.s.",
"vad.",
"val.",
"valg.",
"ved.",
"vert.",
"vet.",
"vid.",
"virš.",
"vlsč.",
"vnt.",
"vok.",
"vs.",
"vtv.",
"vv.",
"vyr.",
"vyresn.",
"zool.",
"Įn",
"įl.",
"š.m.",
"šnek.",
"šv.",
"švč.",
"ž.ū.",
"žin.",
"žml.",
"žr.",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -22,6 +22,7 @@ NOUN_RULES = [
VERB_RULES = [
["er", "e"], # vasker -> vaske
["et", "e"], # vasket -> vaske
["a", "e"], # vaska -> vaske
["es", "e"], # vaskes -> vaske
["te", "e"], # stekte -> steke
["år", "å"], # får -> få

View File

@ -10,7 +10,15 @@ _exc = {}
for exc_data in [
{ORTH: "jan.", LEMMA: "januar"},
{ORTH: "feb.", LEMMA: "februar"},
{ORTH: "mar.", LEMMA: "mars"},
{ORTH: "apr.", LEMMA: "april"},
{ORTH: "jun.", LEMMA: "juni"},
{ORTH: "jul.", LEMMA: "juli"},
{ORTH: "aug.", LEMMA: "august"},
{ORTH: "sep.", LEMMA: "september"},
{ORTH: "okt.", LEMMA: "oktober"},
{ORTH: "nov.", LEMMA: "november"},
{ORTH: "des.", LEMMA: "desember"},
]:
_exc[exc_data[ORTH]] = [exc_data]
@ -18,11 +26,13 @@ for exc_data in [
for orth in [
"adm.dir.",
"a.m.",
"andelsnr",
"Aq.",
"b.c.",
"bl.a.",
"bla.",
"bm.",
"bnr.",
"bto.",
"ca.",
"cand.mag.",
@ -41,6 +51,7 @@ for orth in [
"el.",
"e.l.",
"et.",
"etc.",
"etg.",
"ev.",
"evt.",
@ -76,6 +87,7 @@ for orth in [
"kgl.res.",
"kl.",
"komm.",
"kr.",
"kst.",
"lø.",
"ma.",
@ -106,6 +118,7 @@ for orth in [
"o.l.",
"on.",
"op.",
"org."
"osv.",
"ovf.",
"p.",
@ -130,6 +143,7 @@ for orth in [
"sep.",
"siviling.",
"sms.",
"snr.",
"spm.",
"sr.",
"sst.",

18
spacy/lang/sq/examples.py Normal file
View File

@ -0,0 +1,18 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.sq.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple po shqyrton blerjen e nje shoqërie të U.K. për 1 miliard dollarë",
"Makinat autonome ndryshojnë përgjegjësinë e sigurimit ndaj prodhuesve",
"San Francisko konsideron ndalimin e robotëve të shpërndarjes",
"Londra është një qytet i madh në Mbretërinë e Bashkuar.",
]

View File

@ -262,13 +262,13 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
# There have been a few bugs here.
# The code was originally designed to always have pattern[1].attrs.value
# be the ent_id when we get to the end of a pattern. However, Issue #2671
# showed this wasn't the case when we had a reject-and-continue before a
# match. I still don't really understand what's going on here, but this
# workaround does resolve the issue.
while pattern.attrs.attr != ID and \
(pattern.nr_attr > 0 or pattern.nr_extra_attr > 0 or pattern.nr_py > 0):
# match.
# The patch to #2671 was wrong though, which came up in #3839.
while pattern.attrs.attr != ID:
pattern += 1
return pattern.attrs.value

View File

@ -1,15 +1,17 @@
# coding: utf8
from __future__ import unicode_literals
from collections import defaultdict
from collections import defaultdict, OrderedDict
import srsly
from ..errors import Errors
from ..compat import basestring_
from ..util import ensure_path
from ..util import ensure_path, to_disk, from_disk
from ..tokens import Span
from ..matcher import Matcher, PhraseMatcher
DEFAULT_ENT_ID_SEP = "||"
class EntityRuler(object):
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
@ -24,7 +26,7 @@ class EntityRuler(object):
name = "entity_ruler"
def __init__(self, nlp, **cfg):
def __init__(self, nlp, phrase_matcher_attr=None, **cfg):
"""Initialize the entitiy ruler. If patterns are supplied here, they
need to be a list of dictionaries with a `"label"` and `"pattern"`
key. A pattern can either be a token pattern (list) or a phrase pattern
@ -32,6 +34,8 @@ class EntityRuler(object):
nlp (Language): The shared nlp object to pass the vocab to the matchers
and process phrase patterns.
phrase_matcher_attr (int / unicode): Token attribute to match on, passed
to the internal PhraseMatcher as `attr`
patterns (iterable): Optional patterns to load in.
overwrite_ents (bool): If existing entities are present, e.g. entities
added by the model, overwrite them by matches if necessary.
@ -47,8 +51,15 @@ class EntityRuler(object):
self.token_patterns = defaultdict(list)
self.phrase_patterns = defaultdict(list)
self.matcher = Matcher(nlp.vocab)
self.phrase_matcher = PhraseMatcher(nlp.vocab)
self.ent_id_sep = cfg.get("ent_id_sep", "||")
if phrase_matcher_attr is not None:
self.phrase_matcher_attr = phrase_matcher_attr
self.phrase_matcher = PhraseMatcher(
nlp.vocab, attr=self.phrase_matcher_attr
)
else:
self.phrase_matcher_attr = None
self.phrase_matcher = PhraseMatcher(nlp.vocab)
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
patterns = cfg.get("patterns")
if patterns is not None:
self.add_patterns(patterns)
@ -212,8 +223,18 @@ class EntityRuler(object):
DOCS: https://spacy.io/api/entityruler#from_bytes
"""
patterns = srsly.msgpack_loads(patterns_bytes)
self.add_patterns(patterns)
cfg = srsly.msgpack_loads(patterns_bytes)
if isinstance(cfg, dict):
self.add_patterns(cfg.get("patterns", cfg))
self.overwrite = cfg.get("overwrite", False)
self.phrase_matcher_attr = cfg.get("phrase_matcher_attr", None)
if self.phrase_matcher_attr is not None:
self.phrase_matcher = PhraseMatcher(
self.nlp.vocab, attr=self.phrase_matcher_attr
)
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
else:
self.add_patterns(cfg)
return self
def to_bytes(self, **kwargs):
@ -223,7 +244,16 @@ class EntityRuler(object):
DOCS: https://spacy.io/api/entityruler#to_bytes
"""
return srsly.msgpack_dumps(self.patterns)
serial = OrderedDict(
(
("overwrite", self.overwrite),
("ent_id_sep", self.ent_id_sep),
("phrase_matcher_attr", self.phrase_matcher_attr),
("patterns", self.patterns),
)
)
return srsly.msgpack_dumps(serial)
def from_disk(self, path, **kwargs):
"""Load the entity ruler from a file. Expects a file containing
@ -236,21 +266,52 @@ class EntityRuler(object):
DOCS: https://spacy.io/api/entityruler#from_disk
"""
path = ensure_path(path)
path = path.with_suffix(".jsonl")
patterns = srsly.read_jsonl(path)
self.add_patterns(patterns)
depr_patterns_path = path.with_suffix(".jsonl")
if depr_patterns_path.is_file():
patterns = srsly.read_jsonl(depr_patterns_path)
self.add_patterns(patterns)
else:
cfg = {}
deserializers = {
"patterns": lambda p: self.add_patterns(
srsly.read_jsonl(p.with_suffix(".jsonl"))
),
"cfg": lambda p: cfg.update(srsly.read_json(p)),
}
from_disk(path, deserializers, {})
self.overwrite = cfg.get("overwrite", False)
self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
if self.phrase_matcher_attr is not None:
self.phrase_matcher = PhraseMatcher(
self.nlp.vocab, attr=self.phrase_matcher_attr
)
return self
def to_disk(self, path, **kwargs):
"""Save the entity ruler patterns to a directory. The patterns will be
saved as newline-delimited JSON (JSONL).
path (unicode / Path): The JSONL file to load.
path (unicode / Path): The JSONL file to save.
**kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#to_disk
"""
path = ensure_path(path)
path = path.with_suffix(".jsonl")
srsly.write_jsonl(path, self.patterns)
cfg = {
"overwrite": self.overwrite,
"phrase_matcher_attr": self.phrase_matcher_attr,
"ent_id_sep": self.ent_id_sep,
}
serializers = {
"patterns": lambda p: srsly.write_jsonl(
p.with_suffix(".jsonl"), self.patterns
),
"cfg": lambda p: srsly.write_json(p, cfg),
}
if path.suffix == ".jsonl": # user wants to save only JSONL
srsly.write_jsonl(path, self.patterns)
else:
to_disk(path, serializers, {})

View File

@ -3,16 +3,18 @@
# coding: utf8
from __future__ import unicode_literals
cimport numpy as np
import numpy
import srsly
import random
from collections import OrderedDict
from thinc.api import chain
from thinc.v2v import Affine, Maxout, Softmax
from thinc.misc import LayerNorm
from thinc.neural.util import to_categorical, copy_array
from thinc.neural.util import to_categorical
from thinc.neural.util import get_array_module
from spacy.kb import KnowledgeBase
from ..cli.pretrain import get_cossim_loss
from .functions import merge_subtokens
from ..tokens.doc cimport Doc
from ..syntax.nn_parser cimport Parser
@ -24,9 +26,9 @@ from ..vocab cimport Vocab
from ..syntax import nonproj
from ..attrs import POS, ID
from ..parts_of_speech import X
from .._ml import Tok2Vec, build_tagger_model
from .._ml import Tok2Vec, build_tagger_model, cosine
from .._ml import build_text_classifier, build_simple_cnn_text_classifier
from .._ml import build_bow_text_classifier
from .._ml import build_bow_text_classifier, build_nel_encoder
from .._ml import link_vectors_to_models, zero_init, flatten
from .._ml import masked_language_model, create_default_optimizer
from ..errors import Errors, TempErrors
@ -229,7 +231,7 @@ class Tensorizer(Pipe):
vocab (Vocab): A `Vocab` instance. The model must share the same
`Vocab` instance with the `Doc` objects it will process.
model (Model): A `Model` instance or `True` allocate one later.
model (Model): A `Model` instance or `True` to allocate one later.
**cfg: Config parameters.
EXAMPLE:
@ -294,7 +296,7 @@ class Tensorizer(Pipe):
docs (iterable): A batch of `Doc` objects.
golds (iterable): A batch of `GoldParse` objects.
drop (float): The droput rate.
drop (float): The dropout rate.
sgd (callable): An optimizer.
RETURNS (dict): Results from the update.
"""
@ -386,7 +388,7 @@ class Tagger(Pipe):
def predict(self, docs):
self.require_model()
if not any(len(doc) for doc in docs):
# Handle case where there are no tokens in any docs.
# Handle cases where there are no tokens in any docs.
n_labels = len(self.labels)
guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs]
tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO))
@ -900,6 +902,11 @@ class TextCategorizer(Pipe):
def labels(self):
return tuple(self.cfg.setdefault("labels", []))
def require_labels(self):
"""Raise an error if the component's model has no labels defined."""
if not self.labels:
raise ValueError(Errors.E143.format(name=self.name))
@labels.setter
def labels(self, value):
self.cfg["labels"] = tuple(value)
@ -929,6 +936,7 @@ class TextCategorizer(Pipe):
doc.cats[label] = float(scores[i, j])
def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
self.require_model()
scores, bp_scores = self.model.begin_update(docs, drop=drop)
loss, d_scores = self.get_loss(docs, golds, scores)
bp_scores(d_scores, sgd=sgd)
@ -983,6 +991,7 @@ class TextCategorizer(Pipe):
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
if self.model is True:
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
self.require_labels()
self.model = self.Model(len(self.labels), **self.cfg)
link_vectors_to_models(self.vocab)
if sgd is None:
@ -1001,7 +1010,7 @@ cdef class DependencyParser(Parser):
@property
def postprocesses(self):
return [nonproj.deprojectivize, merge_subtokens]
return [nonproj.deprojectivize]
def add_multitask_objective(self, target):
if target == "cloze":
@ -1063,52 +1072,252 @@ cdef class EntityRecognizer(Parser):
class EntityLinker(Pipe):
"""Pipeline component for named entity linking.
DOCS: TODO
"""
name = 'entity_linker'
@classmethod
def Model(cls, nr_class=1, **cfg):
# TODO: non-dummy EL implementation
return None
def Model(cls, **cfg):
embed_width = cfg.get("embed_width", 300)
hidden_width = cfg.get("hidden_width", 128)
type_to_int = cfg.get("type_to_int", dict())
def __init__(self, model=True, **cfg):
self.model = False
model = build_nel_encoder(embed_width=embed_width, hidden_width=hidden_width, ner_types=len(type_to_int), **cfg)
return model
def __init__(self, vocab, **cfg):
self.vocab = vocab
self.model = True
self.kb = None
self.cfg = dict(cfg)
self.kb = self.cfg["kb"]
self.sgd_context = None
def set_kb(self, kb):
self.kb = kb
def require_model(self):
# Raise an error if the component's model is not initialized.
if getattr(self, "model", None) in (None, True, False):
raise ValueError(Errors.E109.format(name=self.name))
def require_kb(self):
# Raise an error if the knowledge base is not initialized.
if getattr(self, "kb", None) in (None, True, False):
raise ValueError(Errors.E139.format(name=self.name))
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
self.require_kb()
self.cfg["entity_width"] = self.kb.entity_vector_length
if self.model is True:
self.model = self.Model(**self.cfg)
self.sgd_context = self.create_optimizer()
if sgd is None:
sgd = self.create_optimizer()
return sgd
def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None):
self.require_model()
self.require_kb()
if losses is not None:
losses.setdefault(self.name, 0.0)
if not docs or not golds:
return 0
if len(docs) != len(golds):
raise ValueError(Errors.E077.format(value="EL training", n_docs=len(docs),
n_golds=len(golds)))
if isinstance(docs, Doc):
docs = [docs]
golds = [golds]
context_docs = []
entity_encodings = []
cats = []
priors = []
type_vectors = []
type_to_int = self.cfg.get("type_to_int", dict())
for doc, gold in zip(docs, golds):
ents_by_offset = dict()
for ent in doc.ents:
ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent
for entity in gold.links:
start, end, gold_kb = entity
mention = doc.text[start:end]
gold_ent = ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)]
assert gold_ent is not None
type_vector = [0 for i in range(len(type_to_int))]
if len(type_to_int) > 0:
type_vector[type_to_int[gold_ent.label_]] = 1
candidates = self.kb.get_candidates(mention)
random.shuffle(candidates)
nr_neg = 0
for c in candidates:
kb_id = c.entity_
entity_encoding = c.entity_vector
entity_encodings.append(entity_encoding)
context_docs.append(doc)
type_vectors.append(type_vector)
if self.cfg.get("prior_weight", 1) > 0:
priors.append([c.prior_prob])
else:
priors.append([0])
if kb_id == gold_kb:
cats.append([1])
else:
nr_neg += 1
cats.append([0])
if len(entity_encodings) > 0:
assert len(priors) == len(entity_encodings) == len(context_docs) == len(cats) == len(type_vectors)
context_encodings, bp_context = self.model.tok2vec.begin_update(context_docs, drop=drop)
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
mention_encodings = [list(context_encodings[i]) + list(entity_encodings[i]) + priors[i] + type_vectors[i]
for i in range(len(entity_encodings))]
pred, bp_mention = self.model.begin_update(self.model.ops.asarray(mention_encodings, dtype="float32"), drop=drop)
cats = self.model.ops.asarray(cats, dtype="float32")
loss, d_scores = self.get_loss(prediction=pred, golds=cats, docs=None)
mention_gradient = bp_mention(d_scores, sgd=sgd)
context_gradients = [list(x[0:self.cfg.get("context_width")]) for x in mention_gradient]
bp_context(self.model.ops.asarray(context_gradients, dtype="float32"), sgd=self.sgd_context)
if losses is not None:
losses[self.name] += loss
return loss
return 0
def get_loss(self, docs, golds, prediction):
d_scores = (prediction - golds)
loss = (d_scores ** 2).sum()
loss = loss / len(golds)
return loss, d_scores
def get_loss_old(self, docs, golds, scores):
# this loss function assumes we're only using positive examples
loss, gradients = get_cossim_loss(yh=scores, y=golds)
loss = loss / len(golds)
return loss, gradients
def __call__(self, doc):
self.set_annotations([doc], scores=None, tensors=None)
entities, kb_ids = self.predict([doc])
self.set_annotations([doc], entities, kb_ids)
return doc
def pipe(self, stream, batch_size=128, n_threads=-1):
"""Apply the pipe to a stream of documents.
Both __call__ and pipe should delegate to the `predict()`
and `set_annotations()` methods.
"""
for docs in util.minibatch(stream, size=batch_size):
docs = list(docs)
self.set_annotations(docs, scores=None, tensors=None)
entities, kb_ids = self.predict(docs)
self.set_annotations(docs, entities, kb_ids)
yield from docs
def set_annotations(self, docs, scores, tensors=None):
"""
Currently implemented as taking the KB entry with highest prior probability for each named entity
TODO: actually use context etc
"""
for i, doc in enumerate(docs):
for ent in doc.ents:
candidates = self.kb.get_candidates(ent.text)
if candidates:
best_candidate = max(candidates, key=lambda c: c.prior_prob)
for token in ent:
token.ent_kb_id_ = best_candidate.entity_
def predict(self, docs):
self.require_model()
self.require_kb()
def get_loss(self, docs, golds, scores):
# TODO
pass
final_entities = []
final_kb_ids = []
if not docs:
return final_entities, final_kb_ids
if isinstance(docs, Doc):
docs = [docs]
context_encodings = self.model.tok2vec(docs)
xp = get_array_module(context_encodings)
type_to_int = self.cfg.get("type_to_int", dict())
for i, doc in enumerate(docs):
if len(doc) > 0:
context_encoding = context_encodings[i]
for ent in doc.ents:
type_vector = [0 for i in range(len(type_to_int))]
if len(type_to_int) > 0:
type_vector[type_to_int[ent.label_]] = 1
candidates = self.kb.get_candidates(ent.text)
if candidates:
random.shuffle(candidates)
# this will set the prior probabilities to 0 (just like in training) if their weight is 0
prior_probs = xp.asarray([[c.prior_prob] for c in candidates])
prior_probs *= self.cfg.get("prior_weight", 1)
scores = prior_probs
if self.cfg.get("context_weight", 1) > 0:
entity_encodings = xp.asarray([c.entity_vector for c in candidates])
assert len(entity_encodings) == len(prior_probs)
mention_encodings = [list(context_encoding) + list(entity_encodings[i])
+ list(prior_probs[i]) + type_vector
for i in range(len(entity_encodings))]
scores = self.model(self.model.ops.asarray(mention_encodings, dtype="float32"))
# TODO: thresholding
best_index = scores.argmax()
best_candidate = candidates[best_index]
final_entities.append(ent)
final_kb_ids.append(best_candidate.entity_)
return final_entities, final_kb_ids
def set_annotations(self, docs, entities, kb_ids=None):
for entity, kb_id in zip(entities, kb_ids):
for token in entity:
token.ent_kb_id_ = kb_id
def to_disk(self, path, exclude=tuple(), **kwargs):
serialize = OrderedDict()
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
serialize["kb"] = lambda p: self.kb.dump(p)
if self.model not in (None, True, False):
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
util.to_disk(path, serialize, exclude)
def from_disk(self, path, exclude=tuple(), **kwargs):
def load_model(p):
if self.model is True:
self.model = self.Model(**self.cfg)
self.model.from_bytes(p.open("rb").read())
def load_kb(p):
kb = KnowledgeBase(vocab=self.vocab, entity_vector_length=self.cfg["entity_width"])
kb.load_bulk(p)
self.set_kb(kb)
deserialize = OrderedDict()
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
deserialize["kb"] = load_kb
deserialize["model"] = load_model
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_disk(path, deserialize, exclude)
return self
def rehearse(self, docs, sgd=None, losses=None, **config):
raise NotImplementedError
def add_label(self, label):
# TODO
pass
raise NotImplementedError
class Sentencizer(object):

View File

@ -52,6 +52,7 @@ class Scorer(object):
self.labelled = PRFScore()
self.tags = PRFScore()
self.ner = PRFScore()
self.ner_per_ents = dict()
self.eval_punct = eval_punct
@property
@ -91,6 +92,15 @@ class Scorer(object):
"""RETURNS (float): Named entity accuracy (F-score)."""
return self.ner.fscore * 100
@property
def ents_per_type(self):
"""RETURNS (dict): Scores per entity label.
"""
return {
k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
for k, v in self.ner_per_ents.items()
}
@property
def scores(self):
"""RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`,
@ -102,6 +112,7 @@ class Scorer(object):
"ents_p": self.ents_p,
"ents_r": self.ents_r,
"ents_f": self.ents_f,
"ents_per_type": self.ents_per_type,
"tags_acc": self.tags_acc,
"token_acc": self.token_acc,
}
@ -149,13 +160,31 @@ class Scorer(object):
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
if "-" not in [token[-1] for token in gold.orig_annot]:
cand_ents = set()
current_ent = {k.label_: set() for k in doc.ents}
current_gold = {k.label_: set() for k in doc.ents}
for ent in doc.ents:
if ent.label_ not in self.ner_per_ents:
self.ner_per_ents[ent.label_] = PRFScore()
first = gold.cand_to_gold[ent.start]
last = gold.cand_to_gold[ent.end - 1]
if first is None or last is None:
self.ner.fp += 1
self.ner_per_ents[ent.label_].fp += 1
else:
cand_ents.add((ent.label_, first, last))
current_ent[ent.label_].add(
tuple(x for x in cand_ents if x[0] == ent.label_)
)
current_gold[ent.label_].add(
tuple(x for x in gold_ents if x[0] == ent.label_)
)
# Scores per ent
[
v.score_set(current_ent[k], current_gold[k])
for k, v in self.ner_per_ents.items()
if k in current_ent
]
# Score for all ents
self.ner.score_set(cand_ents, gold_ents)
self.tags.score_set(cand_tags, gold_tags)
self.labelled.score_set(cand_deps, gold_deps)

View File

@ -3,6 +3,10 @@ from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t
from .typedefs cimport flags_t, attr_t, hash_t
from .parts_of_speech cimport univ_pos_t
from libcpp.vector cimport vector
from libc.stdint cimport int32_t, int64_t
cdef struct LexemeC:
flags_t flags
@ -72,3 +76,32 @@ cdef struct TokenC:
attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
attr_t ent_kb_id
hash_t ent_id
# Internal struct, for storage and disambiguation of entities.
cdef struct KBEntryC:
# The hash of this entry's unique ID/name in the kB
hash_t entity_hash
# Allows retrieval of the entity vector, as an index into a vectors table of the KB.
# Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).
int32_t vector_index
# Allows retrieval of a struct of non-vector features.
# This is currently not implemented and set to -1 for the common case where there are no features.
int32_t feats_row
# log probability of entity, based on corpus frequency
float prob
# Each alias struct stores a list of Entry pointers with their prior probabilities
# for this specific mention/alias.
cdef struct AliasC:
# All entry candidates for this alias
vector[int64_t] entry_indices
# Prior probability P(entity|alias) - should sum up to (at most) 1.
vector[float] probs

View File

@ -81,6 +81,7 @@ cdef enum symbol_t:
DEP
ENT_IOB
ENT_TYPE
ENT_KB_ID
HEAD
SENT_START
SPACY

View File

@ -86,6 +86,7 @@ IDS = {
"DEP": DEP,
"ENT_IOB": ENT_IOB,
"ENT_TYPE": ENT_TYPE,
"ENT_KB_ID": ENT_KB_ID,
"HEAD": HEAD,
"SENT_START": SENT_START,
"SPACY": SPACY,

View File

@ -124,6 +124,22 @@ def ja_tokenizer():
return get_lang_class("ja").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def ko_tokenizer():
pytest.importorskip("natto")
return get_lang_class("ko").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def lt_tokenizer():
return get_lang_class("lt").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def lt_lemmatizer():
return get_lang_class("lt").Defaults.create_lemmatizer()
@pytest.fixture(scope="session")
def nb_tokenizer():
return get_lang_class("nb").Defaults.create_tokenizer()

View File

View File

@ -0,0 +1,12 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize(
"word,lemma", [("새로운", "새롭"), ("빨간", "빨갛"), ("클수록", ""), ("뭡니까", ""), ("됐다", "")]
)
def test_ko_lemmatizer_assigns(ko_tokenizer, word, lemma):
test_lemma = ko_tokenizer(word)[0].lemma_
assert test_lemma == lemma

View File

@ -0,0 +1,46 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
# fmt: off
TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."),
("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 .")]
TAG_TESTS = [("서울 타워 근처에 살고 있습니다.",
"NNP NNG NNG JKB VV EC VX EF SF"),
("영등포구에 있는 맛집 좀 알려주세요.",
"NNP JKB VV ETM NNG MAG VV VX EP SF")]
FULL_TAG_TESTS = [("영등포구에 있는 맛집 좀 알려주세요.",
"NNP JKB VV ETM NNG MAG VV+EC VX EP+EF SF")]
POS_TESTS = [("서울 타워 근처에 살고 있습니다.",
"PROPN NOUN NOUN ADP VERB X AUX X PUNCT"),
("영등포구에 있는 맛집 좀 알려주세요.",
"PROPN ADP VERB X NOUN ADV VERB AUX X PUNCT")]
# fmt: on
@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
def test_ko_tokenizer(ko_tokenizer, text, expected_tokens):
tokens = [token.text for token in ko_tokenizer(text)]
assert tokens == expected_tokens.split()
@pytest.mark.parametrize("text,expected_tags", TAG_TESTS)
def test_ko_tokenizer_tags(ko_tokenizer, text, expected_tags):
tags = [token.tag_ for token in ko_tokenizer(text)]
assert tags == expected_tags.split()
@pytest.mark.parametrize("text,expected_tags", FULL_TAG_TESTS)
def test_ko_tokenizer_full_tags(ko_tokenizer, text, expected_tags):
tags = ko_tokenizer(text).user_data["full_tags"]
assert tags == expected_tags.split()
@pytest.mark.parametrize("text,expected_pos", POS_TESTS)
def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
pos = [token.pos_ for token in ko_tokenizer(text)]
assert pos == expected_pos.split()

View File

View File

@ -0,0 +1,15 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize("tokens,lemmas", [
(["Galime", "vadinti", "gerovės", "valstybe", ",", "turime", "išvystytą", "socialinę", "apsaugą", ",",
"sveikatos", "apsaugą", "ir", "prieinamą", "švietimą", "."],
["galėti", "vadintas", "gerovė", "valstybė", ",", "turėti", "išvystytas", "socialinis",
"apsauga", ",", "sveikata", "apsauga", "ir", "prieinamas", "švietimas", "."]),
(["taip", ",", "uoliai", "tyrinėjau", "ir", "pasirinkau", "geriausią", "variantą", "."],
["taip", ",", "uolus", "tyrinėti", "ir", "pasirinkti", "geras", "variantas", "."])])
def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas):
assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens]

View File

@ -0,0 +1,56 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_lt_tokenizer_handles_long_text(lt_tokenizer):
text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui."""
tokens = lt_tokenizer(text)
assert len(tokens) == 42
@pytest.mark.parametrize(
"text,length",
[
(
"177R Parodų rūmaiOzo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
15,
),
(
"ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
16,
),
],
)
def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
tokens = lt_tokenizer(text)
assert len(tokens) == length
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
tokens = lt_tokenizer(text)
assert len(tokens) == 1
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("10,000", True),
("10,00", True),
("999.0", True),
("vienas", True),
("du", True),
("milijardas", True),
("šuo", False),
(",", False),
("1/2", True),
],
)
def test_lt_lex_attrs_like_number(lt_tokenizer, text, match):
tokens = lt_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -5,7 +5,6 @@ import pytest
import re
from spacy.matcher import Matcher, DependencyMatcher
from spacy.tokens import Doc, Token
from ..util import get_doc
@pytest.fixture
@ -288,24 +287,43 @@ def deps():
def dependency_matcher(en_vocab):
def is_brown_yellow(text):
return bool(re.compile(r"brown|yellow|over").match(text))
IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow)
pattern1 = [
{"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}},
{"SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},"PATTERN": {"ORTH": "quick", "DEP": "amod"}},
{"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, "PATTERN": {IS_BROWN_YELLOW: True}},
{
"SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
"PATTERN": {"ORTH": "quick", "DEP": "amod"},
},
{
"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
"PATTERN": {IS_BROWN_YELLOW: True},
},
]
pattern2 = [
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}
{
"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
"PATTERN": {"ORTH": "fox"},
},
{
"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
"PATTERN": {"ORTH": "fox"},
},
]
pattern3 = [
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
{"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"}, "PATTERN": {"ORTH": "brown"}}
{
"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
"PATTERN": {"ORTH": "fox"},
},
{
"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"},
"PATTERN": {"ORTH": "brown"},
},
]
matcher = DependencyMatcher(en_vocab)
@ -320,9 +338,9 @@ def test_dependency_matcher_compile(dependency_matcher):
assert len(dependency_matcher) == 3
def test_dependency_matcher(dependency_matcher, text, heads, deps):
doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps)
matches = dependency_matcher(doc)
# assert matches[0][1] == [[3, 1, 2]]
# assert matches[1][1] == [[4, 3, 3]]
# assert matches[2][1] == [[4, 3, 2]]
# def test_dependency_matcher(dependency_matcher, text, heads, deps):
# doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps)
# matches = dependency_matcher(doc)
# assert matches[0][1] == [[3, 1, 2]]
# assert matches[1][1] == [[4, 3, 3]]
# assert matches[2][1] == [[4, 3, 2]]

View File

@ -1,91 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.kb import KnowledgeBase
from spacy.lang.en import English
@pytest.fixture
def nlp():
return English()
def test_kb_valid_entities(nlp):
"""Test the valid construction of a KB with 3 entities and two aliases"""
mykb = KnowledgeBase(nlp.vocab)
# adding entities
mykb.add_entity(entity=u'Q1', prob=0.9)
mykb.add_entity(entity=u'Q2')
mykb.add_entity(entity=u'Q3', prob=0.5)
# adding aliases
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
# test the size of the corresponding KB
assert(mykb.get_size_entities() == 3)
assert(mykb.get_size_aliases() == 2)
def test_kb_invalid_entities(nlp):
"""Test the invalid construction of a KB with an alias linked to a non-existing entity"""
mykb = KnowledgeBase(nlp.vocab)
# adding entities
mykb.add_entity(entity=u'Q1', prob=0.9)
mykb.add_entity(entity=u'Q2', prob=0.2)
mykb.add_entity(entity=u'Q3', prob=0.5)
# adding aliases - should fail because one of the given IDs is not valid
with pytest.raises(ValueError):
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q342'], probabilities=[0.8, 0.2])
def test_kb_invalid_probabilities(nlp):
"""Test the invalid construction of a KB with wrong prior probabilities"""
mykb = KnowledgeBase(nlp.vocab)
# adding entities
mykb.add_entity(entity=u'Q1', prob=0.9)
mykb.add_entity(entity=u'Q2', prob=0.2)
mykb.add_entity(entity=u'Q3', prob=0.5)
# adding aliases - should fail because the sum of the probabilities exceeds 1
with pytest.raises(ValueError):
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.4])
def test_kb_invalid_combination(nlp):
"""Test the invalid construction of a KB with non-matching entity and probability lists"""
mykb = KnowledgeBase(nlp.vocab)
# adding entities
mykb.add_entity(entity=u'Q1', prob=0.9)
mykb.add_entity(entity=u'Q2', prob=0.2)
mykb.add_entity(entity=u'Q3', prob=0.5)
# adding aliases - should fail because the entities and probabilities vectors are not of equal length
with pytest.raises(ValueError):
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.3, 0.4, 0.1])
def test_candidate_generation(nlp):
"""Test correct candidate generation"""
mykb = KnowledgeBase(nlp.vocab)
# adding entities
mykb.add_entity(entity=u'Q1', prob=0.9)
mykb.add_entity(entity=u'Q2', prob=0.2)
mykb.add_entity(entity=u'Q3', prob=0.5)
# adding aliases
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
# test the size of the relevant candidates
assert(len(mykb.get_candidates(u'douglas')) == 2)
assert(len(mykb.get_candidates(u'adam')) == 1)
assert(len(mykb.get_candidates(u'shrubbery')) == 0)

View File

@ -0,0 +1,145 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.kb import KnowledgeBase
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
@pytest.fixture
def nlp():
return English()
def test_kb_valid_entities(nlp):
"""Test the valid construction of a KB with 3 entities and two aliases"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
# adding entities
mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
mykb.add_entity(entity='Q2', prob=0.5, entity_vector=[2])
mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
# adding aliases
mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2])
mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9])
# test the size of the corresponding KB
assert(mykb.get_size_entities() == 3)
assert(mykb.get_size_aliases() == 2)
def test_kb_invalid_entities(nlp):
"""Test the invalid construction of a KB with an alias linked to a non-existing entity"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
# adding entities
mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
# adding aliases - should fail because one of the given IDs is not valid
with pytest.raises(ValueError):
mykb.add_alias(alias='douglas', entities=['Q2', 'Q342'], probabilities=[0.8, 0.2])
def test_kb_invalid_probabilities(nlp):
"""Test the invalid construction of a KB with wrong prior probabilities"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
# adding entities
mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
# adding aliases - should fail because the sum of the probabilities exceeds 1
with pytest.raises(ValueError):
mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.4])
def test_kb_invalid_combination(nlp):
"""Test the invalid construction of a KB with non-matching entity and probability lists"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
# adding entities
mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
# adding aliases - should fail because the entities and probabilities vectors are not of equal length
with pytest.raises(ValueError):
mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.3, 0.4, 0.1])
def test_kb_invalid_entity_vector(nlp):
"""Test the invalid construction of a KB with non-matching entity vector lengths"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=3)
# adding entities
mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1, 2, 3])
# this should fail because the kb's expected entity vector length is 3
with pytest.raises(ValueError):
mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
def test_candidate_generation(nlp):
"""Test correct candidate generation"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
# adding entities
mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
# adding aliases
mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2])
mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9])
# test the size of the relevant candidates
assert(len(mykb.get_candidates('douglas')) == 2)
assert(len(mykb.get_candidates('adam')) == 1)
assert(len(mykb.get_candidates('shrubbery')) == 0)
def test_preserving_links_asdoc(nlp):
"""Test that Span.as_doc preserves the existing entity links"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
# adding entities
mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
mykb.add_entity(entity='Q2', prob=0.8, entity_vector=[1])
# adding aliases
mykb.add_alias(alias='Boston', entities=['Q1'], probabilities=[0.7])
mykb.add_alias(alias='Denver', entities=['Q2'], probabilities=[0.6])
# set up pipeline with NER (Entity Ruler) and NEL (prior probability only, model not trained)
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
ruler = EntityRuler(nlp)
patterns = [{"label": "GPE", "pattern": "Boston"},
{"label": "GPE", "pattern": "Denver"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64})
el_pipe.set_kb(mykb)
el_pipe.begin_training()
el_pipe.context_weight = 0
el_pipe.prior_weight = 1
nlp.add_pipe(el_pipe, last=True)
# test whether the entity links are preserved by the `as_doc()` function
text = "She lives in Boston. He lives in Denver."
doc = nlp(text)
for ent in doc.ents:
orig_text = ent.text
orig_kb_id = ent.kb_id_
sent_doc = ent.sent.as_doc()
for s_ent in sent_doc.ents:
if s_ent.text == orig_text:
assert s_ent.kb_id_ == orig_kb_id

View File

@ -106,5 +106,24 @@ def test_entity_ruler_serialize_bytes(nlp, patterns):
assert len(new_ruler) == 0
assert len(new_ruler.labels) == 0
new_ruler = new_ruler.from_bytes(ruler_bytes)
assert len(new_ruler) == len(patterns)
assert len(new_ruler.labels) == 4
assert len(new_ruler.patterns) == len(ruler.patterns)
for pattern in ruler.patterns:
assert pattern in new_ruler.patterns
assert sorted(new_ruler.labels) == sorted(ruler.labels)
def test_entity_ruler_serialize_phrase_matcher_attr_bytes(nlp, patterns):
ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER", patterns=patterns)
assert len(ruler) == len(patterns)
assert len(ruler.labels) == 4
ruler_bytes = ruler.to_bytes()
new_ruler = EntityRuler(nlp)
assert len(new_ruler) == 0
assert len(new_ruler.labels) == 0
assert new_ruler.phrase_matcher_attr is None
new_ruler = new_ruler.from_bytes(ruler_bytes)
assert len(new_ruler) == len(patterns)
assert len(new_ruler.labels) == 4
assert new_ruler.phrase_matcher_attr == "LOWER"

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
import pytest
import numpy
from spacy.tokens import Doc
from spacy.matcher import Matcher
from spacy.displacy import render
from spacy.gold import iob_to_biluo
from spacy.lang.it import Italian
@ -123,6 +124,15 @@ def test_issue2396(en_vocab):
assert (span.get_lca_matrix() == matrix).all()
def test_issue2464(en_vocab):
"""Test problem with successive ?. This is the same bug, so putting it here."""
matcher = Matcher(en_vocab)
doc = Doc(en_vocab, words=["a", "b"])
matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
matches = matcher(doc)
assert len(matches) == 3
def test_issue2482():
"""Test we can serialize and deserialize a blank NER or parser model."""
nlp = Italian()

View File

@ -0,0 +1,334 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.lang.en import English
from spacy.lang.de import German
from spacy.pipeline import EntityRuler, EntityRecognizer
from spacy.matcher import Matcher, PhraseMatcher
from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy.attrs import ENT_IOB, ENT_TYPE
from spacy.compat import pickle, is_python2, unescape_unicode
from spacy import displacy
from spacy.util import decaying
import numpy
import re
from ..util import get_doc
def test_issue3002():
"""Test that the tokenizer doesn't hang on a long list of dots"""
nlp = German()
doc = nlp(
"880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl"
)
assert len(doc) == 5
def test_issue3009(en_vocab):
"""Test problem with matcher quantifiers"""
patterns = [
[{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}],
[
{"LEMMA": "have"},
{"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"},
{"LOWER": "to"},
{"LOWER": "do"},
{"POS": "ADP"},
],
[
{"LEMMA": "have"},
{"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"},
{"LOWER": "to"},
{"LOWER": "do"},
{"POS": "ADP"},
],
]
words = ["also", "has", "to", "do", "with"]
tags = ["RB", "VBZ", "TO", "VB", "IN"]
doc = get_doc(en_vocab, words=words, tags=tags)
matcher = Matcher(en_vocab)
for i, pattern in enumerate(patterns):
matcher.add(str(i), None, pattern)
matches = matcher(doc)
assert matches
def test_issue3012(en_vocab):
"""Test that the is_tagged attribute doesn't get overwritten when we from_array
without tag information."""
words = ["This", "is", "10", "%", "."]
tags = ["DT", "VBZ", "CD", "NN", "."]
pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"]
ents = [(2, 4, "PERCENT")]
doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents)
assert doc.is_tagged
expected = ("10", "NUM", "CD", "PERCENT")
assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
header = [ENT_IOB, ENT_TYPE]
ent_array = doc.to_array(header)
doc.from_array(header, ent_array)
assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
# Serializing then deserializing
doc_bytes = doc.to_bytes()
doc2 = Doc(en_vocab).from_bytes(doc_bytes)
assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected
def test_issue3199():
"""Test that Span.noun_chunks works correctly if no noun chunks iterator
is available. To make this test future-proof, we're constructing a Doc
with a new Vocab here and setting is_parsed to make sure the noun chunks run.
"""
doc = Doc(Vocab(), words=["This", "is", "a", "sentence"])
doc.is_parsed = True
assert list(doc[0:3].noun_chunks) == []
def test_issue3209():
"""Test issue that occurred in spaCy nightly where NER labels were being
mapped to classes incorrectly after loading the model, when the labels
were added using ner.add_label().
"""
nlp = English()
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("ANIMAL")
nlp.begin_training()
move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"]
assert ner.move_names == move_names
nlp2 = English()
nlp2.add_pipe(nlp2.create_pipe("ner"))
nlp2.from_bytes(nlp.to_bytes())
assert nlp2.get_pipe("ner").move_names == move_names
def test_issue3248_1():
"""Test that the PhraseMatcher correctly reports its number of rules, not
total number of patterns."""
nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
matcher.add("TEST2", None, nlp("d"))
assert len(matcher) == 2
def test_issue3248_2():
"""Test that the PhraseMatcher can be pickled correctly."""
nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
matcher.add("TEST2", None, nlp("d"))
data = pickle.dumps(matcher)
new_matcher = pickle.loads(data)
assert len(new_matcher) == len(matcher)
def test_issue3277(es_tokenizer):
"""Test that hyphens are split correctly as prefixes."""
doc = es_tokenizer("—Yo me llamo... murmuró el niño Emilio Sánchez Pérez.")
assert len(doc) == 14
assert doc[0].text == "\u2014"
assert doc[5].text == "\u2013"
assert doc[9].text == "\u2013"
def test_issue3288(en_vocab):
"""Test that retokenization works correctly via displaCy when punctuation
is merged onto the preceeding token and tensor is resized."""
words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
heads = [1, 0, -1, 1, 0, 1, -2, -3]
deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
doc.tensor = numpy.zeros((len(words), 96), dtype="float32")
displacy.render(doc)
def test_issue3289():
"""Test that Language.to_bytes handles serializing a pipeline component
with an uninitialized model."""
nlp = English()
nlp.add_pipe(nlp.create_pipe("textcat"))
bytes_data = nlp.to_bytes()
new_nlp = English()
new_nlp.add_pipe(nlp.create_pipe("textcat"))
new_nlp.from_bytes(bytes_data)
def test_issue3328(en_vocab):
doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"])
matcher = Matcher(en_vocab)
patterns = [
[{"LOWER": {"IN": ["hello", "how"]}}],
[{"LOWER": {"IN": ["you", "doing"]}}],
]
matcher.add("TEST", None, *patterns)
matches = matcher(doc)
assert len(matches) == 4
matched_texts = [doc[start:end].text for _, start, end in matches]
assert matched_texts == ["Hello", "how", "you", "doing"]
@pytest.mark.xfail
def test_issue3331(en_vocab):
"""Test that duplicate patterns for different rules result in multiple
matches, one per rule.
"""
matcher = PhraseMatcher(en_vocab)
matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
matches = matcher(doc)
assert len(matches) == 2
match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]]
assert sorted(match_ids) == ["A", "B"]
def test_issue3345():
"""Test case where preset entity crosses sentence boundary."""
nlp = English()
doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
doc[4].is_sent_start = True
ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
ner = EntityRecognizer(doc.vocab)
# Add the OUT action. I wouldn't have thought this would be necessary...
ner.moves.add_action(5, "")
ner.add_label("GPE")
doc = ruler(doc)
# Get into the state just before "New"
state = ner.moves.init_batch([doc])[0]
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
# Check that B-GPE is valid.
assert ner.moves.is_valid(state, "B-GPE")
if is_python2:
# If we have this test in Python 3, pytest chokes, as it can't print the
# string above in the xpass message.
prefix_search = (
b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?"
b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}"
b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|"
b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|"
b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|"
b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|"
b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|"
b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|"
b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|"
b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|"
b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|"
b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|"
b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|"
b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F"
b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8"
b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17"
b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC"
b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940"
b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103"
b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125"
b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F"
b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4"
b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5"
b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B"
b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440"
b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2"
b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800"
b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76"
b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80"
b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004"
b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191"
b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250"
b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0"
b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77"
b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137"
b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E"
b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877"
b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45"
b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129"
b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C"
b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245"
b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A"
b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86"
b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0"
b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1"
b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6"
b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250"
b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400"
b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700"
b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810"
b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890"
b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940"
b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2"
b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF"
b"\\U0001FA60-\\U0001FA6D]"
)
def test_issue3356():
pattern = re.compile(unescape_unicode(prefix_search.decode("utf8")))
assert not pattern.search("hello")
def test_issue3410():
texts = ["Hello world", "This is a test"]
nlp = English()
matcher = Matcher(nlp.vocab)
phrasematcher = PhraseMatcher(nlp.vocab)
with pytest.deprecated_call():
docs = list(nlp.pipe(texts, n_threads=4))
with pytest.deprecated_call():
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
with pytest.deprecated_call():
list(matcher.pipe(docs, n_threads=4))
with pytest.deprecated_call():
list(phrasematcher.pipe(docs, n_threads=4))
def test_issue3447():
sizes = decaying(10.0, 1.0, 0.5)
size = next(sizes)
assert size == 10.0
size = next(sizes)
assert size == 10.0 - 0.5
size = next(sizes)
assert size == 10.0 - 0.5 - 0.5
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
def test_issue3449():
nlp = English()
nlp.add_pipe(nlp.create_pipe("sentencizer"))
text1 = "He gave the ball to I. Do you want to go to the movies with I?"
text2 = "He gave the ball to I. Do you want to go to the movies with I?"
text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
t1 = nlp(text1)
t2 = nlp(text2)
t3 = nlp(text3)
assert t1[5].text == "I"
assert t2[5].text == "I"
assert t3[5].text == "I"
def test_issue3468():
"""Test that sentence boundaries are set correctly so Doc.is_sentenced can
be restored after serialization."""
nlp = English()
nlp.add_pipe(nlp.create_pipe("sentencizer"))
doc = nlp("Hello world")
assert doc[0].is_sent_start
assert doc.is_sentenced
assert len(list(doc.sents)) == 1
doc_bytes = doc.to_bytes()
new_doc = Doc(nlp.vocab).from_bytes(doc_bytes)
assert new_doc[0].is_sent_start
assert new_doc.is_sentenced
assert len(list(new_doc.sents)) == 1

View File

@ -1,11 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.de import German
def test_issue3002():
"""Test that the tokenizer doesn't hang on a long list of dots"""
nlp = German()
doc = nlp('880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl')
assert len(doc) == 5

View File

@ -1,67 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.matcher import Matcher
from spacy.tokens import Doc
PATTERNS = [
("1", [[{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}]]),
(
"2",
[
[
{"LEMMA": "have"},
{"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"},
{"LOWER": "to"},
{"LOWER": "do"},
{"POS": "ADP"},
]
],
),
(
"3",
[
[
{"LEMMA": "have"},
{"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"},
{"LOWER": "to"},
{"LOWER": "do"},
{"POS": "ADP"},
]
],
),
]
@pytest.fixture
def doc(en_tokenizer):
doc = en_tokenizer("also has to do with")
doc[0].tag_ = "RB"
doc[1].tag_ = "VBZ"
doc[2].tag_ = "TO"
doc[3].tag_ = "VB"
doc[4].tag_ = "IN"
return doc
@pytest.fixture
def matcher(en_tokenizer):
return Matcher(en_tokenizer.vocab)
@pytest.mark.parametrize("pattern", PATTERNS)
def test_issue3009(doc, matcher, pattern):
"""Test problem with matcher quantifiers"""
matcher.add(pattern[0], None, *pattern[1])
matches = matcher(doc)
assert matches
def test_issue2464(matcher):
"""Test problem with successive ?. This is the same bug, so putting it here."""
doc = Doc(matcher.vocab, words=["a", "b"])
matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
matches = matcher(doc)
assert len(matches) == 3

View File

@ -1,31 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import ENT_IOB, ENT_TYPE
from ...tokens import Doc
from ..util import get_doc
def test_issue3012(en_vocab):
"""Test that the is_tagged attribute doesn't get overwritten when we from_array
without tag information."""
words = ["This", "is", "10", "%", "."]
tags = ["DT", "VBZ", "CD", "NN", "."]
pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"]
ents = [(2, 4, "PERCENT")]
doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents)
assert doc.is_tagged
expected = ("10", "NUM", "CD", "PERCENT")
assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
header = [ENT_IOB, ENT_TYPE]
ent_array = doc.to_array(header)
doc.from_array(header, ent_array)
assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
# serializing then deserializing
doc_bytes = doc.to_bytes()
doc2 = Doc(en_vocab).from_bytes(doc_bytes)
assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected

View File

@ -1,15 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.tokens import Doc
from spacy.vocab import Vocab
def test_issue3199():
"""Test that Span.noun_chunks works correctly if no noun chunks iterator
is available. To make this test future-proof, we're constructing a Doc
with a new Vocab here and setting is_parsed to make sure the noun chunks run.
"""
doc = Doc(Vocab(), words=["This", "is", "a", "sentence"])
doc.is_parsed = True
assert list(doc[0:3].noun_chunks) == []

View File

@ -1,23 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
def test_issue3209():
"""Test issue that occurred in spaCy nightly where NER labels were being
mapped to classes incorrectly after loading the model, when the labels
were added using ner.add_label().
"""
nlp = English()
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("ANIMAL")
nlp.begin_training()
move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"]
assert ner.move_names == move_names
nlp2 = English()
nlp2.add_pipe(nlp2.create_pipe("ner"))
nlp2.from_bytes(nlp.to_bytes())
assert nlp2.get_pipe("ner").move_names == move_names

View File

@ -1,27 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
from spacy.matcher import PhraseMatcher
from spacy.lang.en import English
from spacy.compat import pickle
def test_issue3248_1():
"""Test that the PhraseMatcher correctly reports its number of rules, not
total number of patterns."""
nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
matcher.add("TEST2", None, nlp("d"))
assert len(matcher) == 2
def test_issue3248_2():
"""Test that the PhraseMatcher can be pickled correctly."""
nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
matcher.add("TEST2", None, nlp("d"))
data = pickle.dumps(matcher)
new_matcher = pickle.loads(data)
assert len(new_matcher) == len(matcher)

View File

@ -1,11 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
def test_issue3277(es_tokenizer):
"""Test that hyphens are split correctly as prefixes."""
doc = es_tokenizer("—Yo me llamo... murmuró el niño Emilio Sánchez Pérez.")
assert len(doc) == 14
assert doc[0].text == "\u2014"
assert doc[5].text == "\u2013"
assert doc[9].text == "\u2013"

View File

@ -1,18 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import numpy
from spacy import displacy
from ..util import get_doc
def test_issue3288(en_vocab):
"""Test that retokenization works correctly via displaCy when punctuation
is merged onto the preceeding token and tensor is resized."""
words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
heads = [1, 0, -1, 1, 0, 1, -2, -3]
deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
doc.tensor = numpy.zeros((len(words), 96), dtype="float32")
displacy.render(doc)

View File

@ -1,15 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
from spacy.lang.en import English
def test_issue3289():
"""Test that Language.to_bytes handles serializing a pipeline component
with an uninitialized model."""
nlp = English()
nlp.add_pipe(nlp.create_pipe("textcat"))
bytes_data = nlp.to_bytes()
new_nlp = English()
new_nlp.add_pipe(nlp.create_pipe("textcat"))
new_nlp.from_bytes(bytes_data)

View File

@ -1,19 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
from spacy.matcher import Matcher
from spacy.tokens import Doc
def test_issue3328(en_vocab):
doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"])
matcher = Matcher(en_vocab)
patterns = [
[{"LOWER": {"IN": ["hello", "how"]}}],
[{"LOWER": {"IN": ["you", "doing"]}}],
]
matcher.add("TEST", None, *patterns)
matches = matcher(doc)
assert len(matches) == 4
matched_texts = [doc[start:end].text for _, start, end in matches]
assert matched_texts == ["Hello", "how", "you", "doing"]

View File

@ -1,21 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc
@pytest.mark.xfail
def test_issue3331(en_vocab):
"""Test that duplicate patterns for different rules result in multiple
matches, one per rule.
"""
matcher = PhraseMatcher(en_vocab)
matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
matches = matcher(doc)
assert len(matches) == 2
match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]]
assert sorted(match_ids) == ["A", "B"]

View File

@ -1,26 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
from spacy.tokens import Doc
from spacy.pipeline import EntityRuler, EntityRecognizer
def test_issue3345():
"""Test case where preset entity crosses sentence boundary."""
nlp = English()
doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
doc[4].is_sent_start = True
ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
ner = EntityRecognizer(doc.vocab)
# Add the OUT action. I wouldn't have thought this would be necessary...
ner.moves.add_action(5, "")
ner.add_label("GPE")
doc = ruler(doc)
# Get into the state just before "New"
state = ner.moves.init_batch([doc])[0]
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
# Check that B-GPE is valid.
assert ner.moves.is_valid(state, "B-GPE")

View File

@ -1,72 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
import re
from spacy import compat
prefix_search = (
b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?"
b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}"
b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|"
b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|"
b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|"
b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|"
b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|"
b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|"
b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|"
b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|"
b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|"
b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|"
b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|"
b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F"
b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8"
b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17"
b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC"
b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940"
b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103"
b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125"
b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F"
b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4"
b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5"
b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B"
b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440"
b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2"
b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800"
b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76"
b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80"
b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004"
b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191"
b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250"
b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0"
b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77"
b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137"
b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E"
b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877"
b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45"
b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129"
b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C"
b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245"
b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A"
b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86"
b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0"
b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1"
b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6"
b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250"
b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400"
b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700"
b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810"
b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890"
b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940"
b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2"
b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF"
b"\\U0001FA60-\\U0001FA6D]"
)
if compat.is_python2:
# If we have this test in Python 3, pytest chokes, as it can't print the
# string above in the xpass message.
def test_issue3356():
pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
assert not pattern.search("hello")

View File

@ -1,21 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.lang.en import English
from spacy.matcher import Matcher, PhraseMatcher
def test_issue3410():
texts = ["Hello world", "This is a test"]
nlp = English()
matcher = Matcher(nlp.vocab)
phrasematcher = PhraseMatcher(nlp.vocab)
with pytest.deprecated_call():
docs = list(nlp.pipe(texts, n_threads=4))
with pytest.deprecated_call():
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
with pytest.deprecated_call():
list(matcher.pipe(docs, n_threads=4))
with pytest.deprecated_call():
list(phrasematcher.pipe(docs, n_threads=4))

View File

@ -1,14 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.util import decaying
def test_issue3447():
sizes = decaying(10.0, 1.0, 0.5)
size = next(sizes)
assert size == 10.0
size = next(sizes)
assert size == 10.0 - 0.5
size = next(sizes)
assert size == 10.0 - 0.5 - 0.5

View File

@ -1,21 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.lang.en import English
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
def test_issue3449():
nlp = English()
nlp.add_pipe(nlp.create_pipe("sentencizer"))
text1 = "He gave the ball to I. Do you want to go to the movies with I?"
text2 = "He gave the ball to I. Do you want to go to the movies with I?"
text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
t1 = nlp(text1)
t2 = nlp(text2)
t3 = nlp(text3)
assert t1[5].text == "I"
assert t2[5].text == "I"
assert t3[5].text == "I"

View File

@ -1,21 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
from spacy.tokens import Doc
def test_issue3468():
"""Test that sentence boundaries are set correctly so Doc.is_sentenced can
be restored after serialization."""
nlp = English()
nlp.add_pipe(nlp.create_pipe("sentencizer"))
doc = nlp("Hello world")
assert doc[0].is_sent_start
assert doc.is_sentenced
assert len(list(doc.sents)) == 1
doc_bytes = doc.to_bytes()
new_doc = Doc(nlp.vocab).from_bytes(doc_bytes)
assert new_doc[0].is_sent_start
assert new_doc.is_sentenced
assert len(list(new_doc.sents)) == 1

View File

@ -0,0 +1,88 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.tokens import Span
from spacy.language import Language
from spacy.pipeline import EntityRuler
from spacy import load
import srsly
from ..util import make_tempdir
@pytest.fixture
def patterns():
return [
{"label": "HELLO", "pattern": "hello world"},
{"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]},
{"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]},
{"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]},
{"label": "TECH_ORG", "pattern": "Apple", "id": "a1"},
]
@pytest.fixture
def add_ent():
def add_ent_component(doc):
doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])]
return doc
return add_ent_component
def test_entity_ruler_existing_overwrite_serialize_bytes(patterns, en_vocab):
nlp = Language(vocab=en_vocab)
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
ruler_bytes = ruler.to_bytes()
assert len(ruler) == len(patterns)
assert len(ruler.labels) == 4
assert ruler.overwrite
new_ruler = EntityRuler(nlp)
new_ruler = new_ruler.from_bytes(ruler_bytes)
assert len(new_ruler) == len(ruler)
assert len(new_ruler.labels) == 4
assert new_ruler.overwrite == ruler.overwrite
assert new_ruler.ent_id_sep == ruler.ent_id_sep
def test_entity_ruler_existing_bytes_old_format_safe(patterns, en_vocab):
nlp = Language(vocab=en_vocab)
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
bytes_old_style = srsly.msgpack_dumps(ruler.patterns)
new_ruler = EntityRuler(nlp)
new_ruler = new_ruler.from_bytes(bytes_old_style)
assert len(new_ruler) == len(ruler)
for pattern in ruler.patterns:
assert pattern in new_ruler.patterns
assert new_ruler.overwrite is not ruler.overwrite
def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab):
nlp = Language(vocab=en_vocab)
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
with make_tempdir() as tmpdir:
out_file = tmpdir / "entity_ruler"
srsly.write_jsonl(out_file.with_suffix(".jsonl"), ruler.patterns)
new_ruler = EntityRuler(nlp).from_disk(out_file)
for pattern in ruler.patterns:
assert pattern in new_ruler.patterns
assert len(new_ruler) == len(ruler)
assert new_ruler.overwrite is not ruler.overwrite
def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab):
nlp = Language(vocab=en_vocab)
ruler = EntityRuler(nlp, overwrite_ents=True)
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
nlp.add_pipe(ruler)
with make_tempdir() as tmpdir:
nlp.to_disk(tmpdir)
ruler = nlp.get_pipe("entity_ruler")
assert ruler.patterns == [{"label": "ORG", "pattern": "Apple"}]
assert ruler.overwrite is True
nlp2 = load(tmpdir)
new_ruler = nlp2.get_pipe("entity_ruler")
assert new_ruler.patterns == [{"label": "ORG", "pattern": "Apple"}]
assert new_ruler.overwrite is True

View File

@ -0,0 +1,51 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
import spacy
from spacy.util import minibatch, compounding
def test_issue3611():
""" Test whether adding n-grams in the textcat works even when n > token length of some docs """
unique_classes = ["offensive", "inoffensive"]
x_train = ["This is an offensive text",
"This is the second offensive text",
"inoff"]
y_train = ["offensive", "offensive", "inoffensive"]
# preparing the data
pos_cats = list()
for train_instance in y_train:
pos_cats.append({label: label == train_instance for label in unique_classes})
train_data = list(zip(x_train, [{'cats': cats} for cats in pos_cats]))
# set up the spacy model with a text categorizer component
nlp = spacy.blank('en')
textcat = nlp.create_pipe(
"textcat",
config={
"exclusive_classes": True,
"architecture": "bow",
"ngram_size": 2
}
)
for label in unique_classes:
textcat.add_label(label)
nlp.add_pipe(textcat, last=True)
# training the network
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(3):
losses = {}
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(docs=texts, golds=annotations, sgd=optimizer, drop=0.1, losses=losses)

View File

@ -0,0 +1,10 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.hi import Hindi
def test_issue3625():
"""Test that default punctuation rules applies to hindi unicode characters"""
nlp = Hindi()
doc = nlp(u"hi. how हुए. होटल, होटल")
assert [token.text for token in doc] == ['hi', '.', 'how', 'हुए', '.', 'होटल', ',', 'होटल']

View File

@ -6,7 +6,6 @@ from spacy.matcher import Matcher
from spacy.tokens import Doc
@pytest.mark.xfail
def test_issue3839(en_vocab):
"""Test that match IDs returned by the matcher are correct, are in the string """
doc = Doc(en_vocab, words=["terrific", "group", "of", "people"])

View File

@ -0,0 +1,31 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.attrs import IS_ALPHA
from spacy.lang.en import English
@pytest.mark.parametrize(
"sentence",
[
'The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.',
'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s #1.',
'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s number one',
'Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.',
"It was a missed assignment, but it shouldn't have resulted in a turnover ..."
],
)
def test_issue3869(sentence):
"""Test that the Doc's count_by function works consistently"""
nlp = English()
doc = nlp(sentence)
count = 0
for token in doc:
count += token.is_alpha
assert count == doc.count_by(IS_ALPHA).get(1, 0)

View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
def test_issue3880():
"""Test that `nlp.pipe()` works when an empty string ends the batch.
Fixed in v7.0.5 of Thinc.
"""
texts = ["hello", "world", "", ""]
nlp = English()
nlp.add_pipe(nlp.create_pipe("parser"))
nlp.add_pipe(nlp.create_pipe("ner"))
nlp.add_pipe(nlp.create_pipe("tagger"))
nlp.get_pipe("parser").add_label("dep")
nlp.get_pipe("ner").add_label("PERSON")
nlp.get_pipe("tagger").add_label("NN")
nlp.begin_training()
for doc in nlp.pipe(texts):
pass

View File

@ -0,0 +1,74 @@
# coding: utf-8
from __future__ import unicode_literals
from ..util import make_tempdir
from ...util import ensure_path
from spacy.kb import KnowledgeBase
def test_serialize_kb_disk(en_vocab):
# baseline assertions
kb1 = _get_dummy_kb(en_vocab)
_check_kb(kb1)
# dumping to file & loading back in
with make_tempdir() as d:
dir_path = ensure_path(d)
if not dir_path.exists():
dir_path.mkdir()
file_path = dir_path / "kb"
kb1.dump(str(file_path))
kb2 = KnowledgeBase(vocab=en_vocab, entity_vector_length=3)
kb2.load_bulk(str(file_path))
# final assertions
_check_kb(kb2)
def _get_dummy_kb(vocab):
kb = KnowledgeBase(vocab=vocab, entity_vector_length=3)
kb.add_entity(entity='Q53', prob=0.33, entity_vector=[0, 5, 3])
kb.add_entity(entity='Q17', prob=0.2, entity_vector=[7, 1, 0])
kb.add_entity(entity='Q007', prob=0.7, entity_vector=[0, 0, 7])
kb.add_entity(entity='Q44', prob=0.4, entity_vector=[4, 4, 4])
kb.add_alias(alias='double07', entities=['Q17', 'Q007'], probabilities=[0.1, 0.9])
kb.add_alias(alias='guy', entities=['Q53', 'Q007', 'Q17', 'Q44'], probabilities=[0.3, 0.3, 0.2, 0.1])
kb.add_alias(alias='random', entities=['Q007'], probabilities=[1.0])
return kb
def _check_kb(kb):
# check entities
assert kb.get_size_entities() == 4
for entity_string in ['Q53', 'Q17', 'Q007', 'Q44']:
assert entity_string in kb.get_entity_strings()
for entity_string in ['', 'Q0']:
assert entity_string not in kb.get_entity_strings()
# check aliases
assert kb.get_size_aliases() == 3
for alias_string in ['double07', 'guy', 'random']:
assert alias_string in kb.get_alias_strings()
for alias_string in ['nothingness', '', 'randomnoise']:
assert alias_string not in kb.get_alias_strings()
# check candidates & probabilities
candidates = sorted(kb.get_candidates('double07'), key=lambda x: x.entity_)
assert len(candidates) == 2
assert candidates[0].entity_ == 'Q007'
assert 0.6999 < candidates[0].entity_freq < 0.701
assert candidates[0].entity_vector == [0, 0, 7]
assert candidates[0].alias_ == 'double07'
assert 0.899 < candidates[0].prior_prob < 0.901
assert candidates[1].entity_ == 'Q17'
assert 0.199 < candidates[1].entity_freq < 0.201
assert candidates[1].entity_vector == [7, 1, 0]
assert candidates[1].alias_ == 'double07'
assert 0.099 < candidates[1].prior_prob < 0.101

View File

@ -11,29 +11,27 @@ from ..tokens import Doc
from ..attrs import SPACY, ORTH
class Binder(object):
class DocBox(object):
"""Serialize analyses from a collection of doc objects."""
def __init__(self, attrs=None):
"""Create a Binder object, to hold serialized annotations.
def __init__(self, attrs=None, store_user_data=False):
"""Create a DocBox object, to hold serialized annotations.
attrs (list): List of attributes to serialize. 'orth' and 'spacy' are
always serialized, so they're not required. Defaults to None.
"""
attrs = attrs or []
self.attrs = list(attrs)
# Ensure ORTH is always attrs[0]
if ORTH in self.attrs:
self.attrs.pop(ORTH)
if SPACY in self.attrs:
self.attrs.pop(SPACY)
self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY]
self.attrs.insert(0, ORTH)
self.tokens = []
self.spaces = []
self.user_data = []
self.strings = set()
self.store_user_data = store_user_data
def add(self, doc):
"""Add a doc's annotations to the binder for serialization."""
"""Add a doc's annotations to the DocBox for serialization."""
array = doc.to_array(self.attrs)
if len(array.shape) == 1:
array = array.reshape((array.shape[0], 1))
@ -43,27 +41,35 @@ class Binder(object):
spaces = spaces.reshape((spaces.shape[0], 1))
self.spaces.append(numpy.asarray(spaces, dtype=bool))
self.strings.update(w.text for w in doc)
if self.store_user_data:
self.user_data.append(srsly.msgpack_dumps(doc.user_data))
def get_docs(self, vocab):
"""Recover Doc objects from the annotations, using the given vocab."""
for string in self.strings:
vocab[string]
orth_col = self.attrs.index(ORTH)
for tokens, spaces in zip(self.tokens, self.spaces):
for i in range(len(self.tokens)):
tokens = self.tokens[i]
spaces = self.spaces[i]
words = [vocab.strings[orth] for orth in tokens[:, orth_col]]
doc = Doc(vocab, words=words, spaces=spaces)
doc = doc.from_array(self.attrs, tokens)
if self.store_user_data:
doc.user_data.update(srsly.msgpack_loads(self.user_data[i]))
yield doc
def merge(self, other):
"""Extend the annotations of this binder with the annotations from another."""
"""Extend the annotations of this DocBox with the annotations from another."""
assert self.attrs == other.attrs
self.tokens.extend(other.tokens)
self.spaces.extend(other.spaces)
self.strings.update(other.strings)
if self.store_user_data:
self.user_data.extend(other.user_data)
def to_bytes(self):
"""Serialize the binder's annotations into a byte string."""
"""Serialize the DocBox's annotations into a byte string."""
for tokens in self.tokens:
assert len(tokens.shape) == 2, tokens.shape
lengths = [len(tokens) for tokens in self.tokens]
@ -74,10 +80,12 @@ class Binder(object):
"lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"),
"strings": list(self.strings),
}
if self.store_user_data:
msg["user_data"] = self.user_data
return gzip.compress(srsly.msgpack_dumps(msg))
def from_bytes(self, string):
"""Deserialize the binder's annotations from a byte string."""
"""Deserialize the DocBox's annotations from a byte string."""
msg = srsly.msgpack_loads(gzip.decompress(string))
self.attrs = msg["attrs"]
self.strings = set(msg["strings"])
@ -89,29 +97,38 @@ class Binder(object):
flat_spaces = flat_spaces.reshape((flat_spaces.size, 1))
self.tokens = NumpyOps().unflatten(flat_tokens, lengths)
self.spaces = NumpyOps().unflatten(flat_spaces, lengths)
if self.store_user_data and "user_data" in msg:
self.user_data = list(msg["user_data"])
for tokens in self.tokens:
assert len(tokens.shape) == 2, tokens.shape
return self
def merge_bytes(binder_strings):
"""Concatenate multiple serialized binders into one byte string."""
output = None
for byte_string in binder_strings:
binder = Binder().from_bytes(byte_string)
if output is None:
output = binder
else:
output.merge(binder)
return output.to_bytes()
def merge_boxes(boxes):
merged = None
for byte_string in boxes:
if byte_string is not None:
box = DocBox(store_user_data=True).from_bytes(byte_string)
if merged is None:
merged = box
else:
merged.merge(box)
if merged is not None:
return merged.to_bytes()
else:
return b""
def pickle_binder(binder):
return (unpickle_binder, (binder.to_bytes(),))
def pickle_box(box):
return (unpickle_box, (box.to_bytes(),))
def unpickle_binder(byte_string):
return Binder().from_bytes(byte_string)
def unpickle_box(byte_string):
return DocBox().from_bytes(byte_string)
copy_reg.pickle(Binder, pickle_binder, unpickle_binder)
copy_reg.pickle(DocBox, pickle_box, unpickle_box)
# Compatibility, as we had named it this previously.
Binder = DocBox
__all__ = ["DocBox"]

View File

@ -1,6 +1,5 @@
from cymem.cymem cimport Pool
cimport numpy as np
from preshed.counter cimport PreshCounter
from ..vocab cimport Vocab
from ..structs cimport TokenC, LexemeC

View File

@ -9,6 +9,7 @@ cimport cython
cimport numpy as np
from libc.string cimport memcpy, memset
from libc.math cimport sqrt
from collections import Counter
import numpy
import numpy.linalg
@ -22,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..typedefs cimport attr_t, flags_t
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
from ..attrs cimport ENT_TYPE, SENT_START, attr_id_t
from ..attrs cimport ENT_TYPE, ENT_KB_ID, SENT_START, attr_id_t
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..attrs import intify_attrs, IDS
@ -64,6 +65,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
return token.ent_iob
elif feat_name == ENT_TYPE:
return token.ent_type
elif feat_name == ENT_KB_ID:
return token.ent_kb_id
else:
return Lexeme.get_struct_attr(token.lex, feat_name)
@ -85,13 +88,14 @@ cdef class Doc:
Python-level `Token` and `Span` objects are views of this array, i.e.
they don't own the data themselves.
EXAMPLE: Construction 1
EXAMPLE:
Construction 1
>>> doc = nlp(u'Some text')
Construction 2
>>> from spacy.tokens import Doc
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
spaces=[True, False, False])
>>> spaces=[True, False, False])
DOCS: https://spacy.io/api/doc
"""
@ -237,6 +241,8 @@ cdef class Doc:
return True
if self.is_parsed:
return True
if len(self) < 2:
return True
for i in range(1, self.length):
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
return True
@ -248,6 +254,8 @@ cdef class Doc:
*any* of the tokens has a named entity tag set (even if the others are
uknown values).
"""
if len(self) == 0:
return True
for i in range(self.length):
if self.c[i].ent_iob != 0:
return True
@ -690,7 +698,7 @@ cdef class Doc:
# Handle 1d case
return output if len(attr_ids) >= 2 else output.reshape((self.length,))
def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None):
def count_by(self, attr_id_t attr_id, exclude=None, object counts=None):
"""Count the frequencies of a given attribute. Produces a dict of
`{attribute (int): count (ints)}` frequencies, keyed by the values of
the given attribute ID.
@ -705,19 +713,18 @@ cdef class Doc:
cdef size_t count
if counts is None:
counts = PreshCounter()
counts = Counter()
output_dict = True
else:
output_dict = False
# Take this check out of the loop, for a bit of extra speed
if exclude is None:
for i in range(self.length):
counts.inc(get_token_attr(&self.c[i], attr_id), 1)
counts[get_token_attr(&self.c[i], attr_id)] += 1
else:
for i in range(self.length):
if not exclude(self[i]):
attr = get_token_attr(&self.c[i], attr_id)
counts.inc(attr, 1)
counts[get_token_attr(&self.c[i], attr_id)] += 1
if output_dict:
return dict(counts)
@ -850,7 +857,7 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#to_bytes
"""
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE] # TODO: ENT_KB_ID ?
if self.is_tagged:
array_head.append(TAG)
# If doc parsed add head and dep attribute
@ -1004,6 +1011,7 @@ cdef class Doc:
"""
cdef unicode tag, lemma, ent_type
deprecation_warning(Warnings.W013.format(obj="Doc"))
# TODO: ENT_KB_ID ?
if len(args) == 3:
deprecation_warning(Warnings.W003)
tag, lemma, ent_type = args

View File

@ -210,7 +210,7 @@ cdef class Span:
words = [t.text for t in self]
spaces = [bool(t.whitespace_) for t in self]
cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_KB_ID]
if self.doc.is_tagged:
array_head.append(TAG)
# If doc parsed add head and dep attribute

View File

@ -53,6 +53,8 @@ cdef class Token:
return token.ent_iob
elif feat_name == ENT_TYPE:
return token.ent_type
elif feat_name == ENT_KB_ID:
return token.ent_kb_id
elif feat_name == SENT_START:
return token.sent_start
else:
@ -79,5 +81,7 @@ cdef class Token:
token.ent_iob = value
elif feat_name == ENT_TYPE:
token.ent_type = value
elif feat_name == ENT_KB_ID:
token.ent_kb_id = value
elif feat_name == SENT_START:
token.sent_start = value

View File

@ -520,7 +520,9 @@ spaCy takes training data in JSON format. The built-in
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
spaCy's training format.
spaCy's training format. To convert one or more existing `Doc` objects to
spaCy's JSON format, you can use the
[`gold.docs_to_json`](/api/goldparse#docs_to_json) helper.
> #### Annotating entities
>

Some files were not shown because too many files have changed in this diff Show More