Merge branch 'master' into develop

This commit is contained in:
Ines Montani 2020-02-18 14:47:23 +01:00
commit de11ea753a
79 changed files with 3539 additions and 498 deletions

106
.github/contributors/AlJohri.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Al Johri |
| Company name (if applicable) | N/A |
| Title or role (if applicable) | N/A |
| Date | December 27th, 2019 |
| GitHub username | AlJohri |
| Website (optional) | http://aljohri.com/ |

106
.github/contributors/Jan-711.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jan Jessewitsch |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 16.02.2020 |
| GitHub username | Jan-711 |
| Website (optional) | |

106
.github/contributors/ceteri.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ---------------------- |
| Name | Paco Nathan |
| Company name (if applicable) | Derwen, Inc. |
| Title or role (if applicable) | Managing Partner |
| Date | 2020-01-25 |
| GitHub username | ceteri |
| Website (optional) | https://derwen.ai/paco |

106
.github/contributors/drndos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Filip Bednárik |
| Company name (if applicable) | Ardevop, s. r. o. |
| Title or role (if applicable) | IT Consultant |
| Date | 2020-01-26 |
| GitHub username | drndos |
| Website (optional) | https://ardevop.sk |

106
.github/contributors/iechevarria.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | --------------------- |
| Name | Ivan Echevarria |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-12-24 |
| GitHub username | iechevarria |
| Website (optional) | https://echevarria.io |

106
.github/contributors/iurshina.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Anastasiia Iurshina |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 28.12.2019 |
| GitHub username | iurshina |
| Website (optional) | |

106
.github/contributors/onlyanegg.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
- Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
- to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
- each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
| ----------------------------- | ---------------- |
| Name | Tyler Couto |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | January 29, 2020 |
| GitHub username | onlyanegg |
| Website (optional) | |

View File

@ -1,5 +1,5 @@
recursive-include include *.h
recursive-include spacy *.pyx *.pxd *.txt
recursive-include spacy *.txt *.pyx *.pxd
include LICENSE
include README.md
include bin/spacy

View File

@ -1 +1,2 @@
#! /bin/sh
python -m spacy "$@"

View File

@ -7,16 +7,17 @@ Run `wikipedia_pretrain_kb.py`
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
* You can set the filtering parameters for KB construction:
* `max_per_alias`: (max) number of candidate entities in the KB per alias/synonym
* `min_freq`: threshold of number of times an entity should occur in the corpus to be included in the KB
* `min_pair`: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
* `max_per_alias` (`-a`): (max) number of candidate entities in the KB per alias/synonym
* `min_freq` (`-f`): threshold of number of times an entity should occur in the corpus to be included in the KB
* `min_pair` (`-c`): threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
* Further parameters to set:
* `descriptions_from_wikipedia`: whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
* `entity_vector_length`: length of the pre-trained entity description vectors
* `lang`: language for which to fetch Wikidata information (as the dump contains all languages)
* `descriptions_from_wikipedia` (`-wp`): whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
* `entity_vector_length` (`-v`): length of the pre-trained entity description vectors
* `lang` (`-la`): language for which to fetch Wikidata information (as the dump contains all languages)
Quick testing and rerunning:
* When trying out the pipeline for a quick test, set `limit_prior`, `limit_train` and/or `limit_wd` to read only parts of the dumps instead of everything.
* When trying out the pipeline for a quick test, set `limit_prior` (`-lp`), `limit_train` (`-lt`) and/or `limit_wd` (`-lw`) to read only parts of the dumps instead of everything.
* e.g. set `-lt 20000 -lp 2000 -lw 3000 -f 1`
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
@ -24,11 +25,13 @@ Quick testing and rerunning:
Run `wikidata_train_entity_linker.py`
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
* Specify the output directory (`-o`) in which the final, trained model will be saved
* You can set the learning parameters for the EL training:
* `epochs`: number of training iterations
* `dropout`: dropout rate
* `lr`: learning rate
* `l2`: L2 regularization
* Specify the number of training and dev testing entities with `train_inst` and `dev_inst` respectively
* `epochs` (`-e`): number of training iterations
* `dropout` (`-p`): dropout rate
* `lr` (`-n`): learning rate
* `l2` (`-r`): L2 regularization
* Specify the number of training and dev testing articles with `train_articles` (`-t`) and `dev_articles` (`-d`) respectively
* If not specified, the full dataset will be processed - this may take a LONG time !
* Further parameters to set:
* `labels_discard`: NER label types to discard during training
* `labels_discard` (`-l`): NER label types to discard during training

View File

@ -1,6 +1,8 @@
# coding: utf-8
from __future__ import unicode_literals
import logging
import random
from tqdm import tqdm
from collections import defaultdict
@ -92,133 +94,110 @@ class BaselineResults(object):
self.random.update_metrics(ent_label, true_entity, random_candidate)
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True):
if baseline:
baseline_accuracies, counts = measure_baselines(dev_data, kb)
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
logger.info(baseline_accuracies.report_performance("random"))
logger.info(baseline_accuracies.report_performance("prior"))
logger.info(baseline_accuracies.report_performance("oracle"))
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True, dev_limit=None):
counts = dict()
baseline_results = BaselineResults()
context_results = EvaluationResults()
combo_results = EvaluationResults()
if context:
# using only context
el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = False
results = get_eval_results(dev_data, el_pipe)
logger.info(results.report_metrics("context only"))
# measuring combined accuracy (prior + context)
el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = True
results = get_eval_results(dev_data, el_pipe)
logger.info(results.report_metrics("context and prior"))
def get_eval_results(data, el_pipe=None):
"""
Evaluate the ent.kb_id_ annotations against the gold standard.
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
If the docs in the data require further processing with an entity linker, set el_pipe.
"""
docs = []
golds = []
for d, g in tqdm(data, leave=False):
if len(d) > 0:
golds.append(g)
if el_pipe is not None:
docs.append(el_pipe(d))
else:
docs.append(d)
results = EvaluationResults()
for doc, gold in zip(docs, golds):
try:
correct_entries_per_article = dict()
for doc, gold in tqdm(dev_data, total=dev_limit, leave=False, desc='Processing dev data'):
if len(doc) > 0:
correct_ents = dict()
for entity, kb_dict in gold.links.items():
start, end = entity
for gold_kb, value in kb_dict.items():
if value:
# only evaluating on positive examples
offset = _offset(start, end)
correct_entries_per_article[offset] = gold_kb
correct_ents[offset] = gold_kb
for ent in doc.ents:
ent_label = ent.label_
pred_entity = ent.kb_id_
start = ent.start_char
end = ent.end_char
offset = _offset(start, end)
gold_entity = correct_entries_per_article.get(offset, None)
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
if gold_entity is not None:
results.update_metrics(ent_label, gold_entity, pred_entity)
if baseline:
_add_baseline(baseline_results, counts, doc, correct_ents, kb)
except Exception as e:
logging.error("Error assessing accuracy " + str(e))
if context:
# using only context
el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = False
_add_eval_result(context_results, doc, correct_ents, el_pipe)
return results
# measuring combined accuracy (prior + context)
el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = True
_add_eval_result(combo_results, doc, correct_ents, el_pipe)
if baseline:
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
logger.info(baseline_results.report_performance("random"))
logger.info(baseline_results.report_performance("prior"))
logger.info(baseline_results.report_performance("oracle"))
if context:
logger.info(context_results.report_metrics("context only"))
logger.info(combo_results.report_metrics("context and prior"))
def measure_baselines(data, kb):
def _add_eval_result(results, doc, correct_ents, el_pipe):
"""
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
Evaluate the ent.kb_id_ annotations against the gold standard.
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
Also return a dictionary of counts by entity label.
"""
counts_d = dict()
baseline_results = BaselineResults()
docs = [d for d, g in data if len(d) > 0]
golds = [g for d, g in data if len(d) > 0]
for doc, gold in zip(docs, golds):
correct_entries_per_article = dict()
for entity, kb_dict in gold.links.items():
start, end = entity
for gold_kb, value in kb_dict.items():
# only evaluating on positive examples
if value:
offset = _offset(start, end)
correct_entries_per_article[offset] = gold_kb
try:
doc = el_pipe(doc)
for ent in doc.ents:
ent_label = ent.label_
start = ent.start_char
end = ent.end_char
offset = _offset(start, end)
gold_entity = correct_entries_per_article.get(offset, None)
gold_entity = correct_ents.get(offset, None)
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
if gold_entity is not None:
candidates = kb.get_candidates(ent.text)
oracle_candidate = ""
prior_candidate = ""
random_candidate = ""
if candidates:
scores = []
pred_entity = ent.kb_id_
results.update_metrics(ent_label, gold_entity, pred_entity)
for c in candidates:
scores.append(c.prior_prob)
if c.entity_ == gold_entity:
oracle_candidate = c.entity_
except Exception as e:
logging.error("Error assessing accuracy " + str(e))
best_index = scores.index(max(scores))
prior_candidate = candidates[best_index].entity_
random_candidate = random.choice(candidates).entity_
current_count = counts_d.get(ent_label, 0)
counts_d[ent_label] = current_count+1
def _add_baseline(baseline_results, counts, doc, correct_ents, kb):
"""
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
"""
for ent in doc.ents:
ent_label = ent.label_
start = ent.start_char
end = ent.end_char
offset = _offset(start, end)
gold_entity = correct_ents.get(offset, None)
baseline_results.update_baselines(
gold_entity,
ent_label,
random_candidate,
prior_candidate,
oracle_candidate,
)
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
if gold_entity is not None:
candidates = kb.get_candidates(ent.text)
oracle_candidate = ""
prior_candidate = ""
random_candidate = ""
if candidates:
scores = []
return baseline_results, counts_d
for c in candidates:
scores.append(c.prior_prob)
if c.entity_ == gold_entity:
oracle_candidate = c.entity_
best_index = scores.index(max(scores))
prior_candidate = candidates[best_index].entity_
random_candidate = random.choice(candidates).entity_
current_count = counts.get(ent_label, 0)
counts[ent_label] = current_count+1
baseline_results.update_baselines(
gold_entity,
ent_label,
random_candidate,
prior_candidate,
oracle_candidate,
)
def _offset(start, end):

View File

@ -40,7 +40,7 @@ logger = logging.getLogger(__name__)
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
descr_from_wp=("Flag for using wp descriptions not wd", "flag", "wp"),
descr_from_wp=("Flag for using descriptions from WP instead of WD (default False)", "flag", "wp"),
limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),

View File

@ -1,5 +1,5 @@
# coding: utf-8
"""Script to take a previously created Knowledge Base and train an entity linking
"""Script that takes a previously created Knowledge Base and trains an entity linking
pipeline. The provided KB directory should hold the kb, the original nlp object and
its vocab used to create the KB, and a few auxiliary files such as the entity definitions,
as created by the script `wikidata_create_kb`.
@ -14,9 +14,16 @@ import logging
import spacy
from pathlib import Path
import plac
from tqdm import tqdm
from bin.wiki_entity_linking import wikipedia_processor
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR
from bin.wiki_entity_linking import (
TRAINING_DATA_FILE,
KB_MODEL_DIR,
KB_FILE,
LOG_FORMAT,
OUTPUT_MODEL_DIR,
)
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
from bin.wiki_entity_linking.kb_creator import read_kb
@ -33,8 +40,8 @@ logger = logging.getLogger(__name__)
dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float),
lr=("Learning rate (default 0.005)", "option", "n", float),
l2=("L2 regularization", "option", "r", float),
train_inst=("# training instances (default 90% of all)", "option", "t", int),
dev_inst=("# test instances (default 10% of all)", "option", "d", int),
train_articles=("# training articles (default 90% of all)", "option", "t", int),
dev_articles=("# dev test articles (default 10% of all)", "option", "d", int),
labels_discard=("NER labels to discard (default None)", "option", "l", str),
)
def main(
@ -45,10 +52,15 @@ def main(
dropout=0.5,
lr=0.005,
l2=1e-6,
train_inst=None,
dev_inst=None,
labels_discard=None
train_articles=None,
dev_articles=None,
labels_discard=None,
):
if not output_dir:
logger.warning(
"No output dir specified so no results will be written, are you sure about this ?"
)
logger.info("Creating Entity Linker with Wikipedia and WikiData")
output_dir = Path(output_dir) if output_dir else dir_kb
@ -64,47 +76,57 @@ def main(
# STEP 1 : load the NLP object
logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
nlp = spacy.load(nlp_dir)
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
kb = read_kb(nlp, kb_path)
logger.info(
"Original NLP pipeline has following pipeline components: {}".format(
nlp.pipe_names
)
)
# check that there is a NER component in the pipeline
if "ner" not in nlp.pipe_names:
raise ValueError("The `nlp` object should have a pretrained `ner` component.")
# STEP 2: read the training dataset previously created from WP
logger.info("STEP 2: Reading training dataset from {}".format(training_path))
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
kb = read_kb(nlp, kb_path)
# STEP 2: read the training dataset previously created from WP
logger.info("STEP 2: Reading training & dev dataset from {}".format(training_path))
train_indices, dev_indices = wikipedia_processor.read_training_indices(
training_path
)
logger.info(
"Training set has {} articles, limit set to roughly {} articles per epoch".format(
len(train_indices), train_articles if train_articles else "all"
)
)
logger.info(
"Dev set has {} articles, limit set to rougly {} articles for evaluation".format(
len(dev_indices), dev_articles if dev_articles else "all"
)
)
if dev_articles:
dev_indices = dev_indices[0:dev_articles]
# STEP 3: create and train an entity linking pipe
logger.info(
"STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(
epochs
)
)
if labels_discard:
labels_discard = [x.strip() for x in labels_discard.split(",")]
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
logger.info(
"Discarding {} NER types: {}".format(len(labels_discard), labels_discard)
)
else:
labels_discard = []
train_data = wikipedia_processor.read_training(
nlp=nlp,
entity_file_path=training_path,
dev=False,
limit=train_inst,
kb=kb,
labels_discard=labels_discard
)
# for testing, get all pos instances (independently of KB)
dev_data = wikipedia_processor.read_training(
nlp=nlp,
entity_file_path=training_path,
dev=True,
limit=dev_inst,
kb=None,
labels_discard=labels_discard
)
# STEP 3: create and train an entity linking pipe
logger.info("STEP 3: Creating and training an Entity Linking pipe")
el_pipe = nlp.create_pipe(
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors,
"labels_discard": labels_discard}
name="entity_linker",
config={
"pretrained_vectors": nlp.vocab.vectors,
"labels_discard": labels_discard,
},
)
el_pipe.set_kb(kb)
nlp.add_pipe(el_pipe, last=True)
@ -115,78 +137,96 @@ def main(
optimizer.learn_rate = lr
optimizer.L2 = l2
logger.info("Training on {} articles".format(len(train_data)))
logger.info("Dev testing on {} articles".format(len(dev_data)))
# baseline performance on dev data
logger.info("Dev Baseline Accuracies:")
measure_performance(dev_data, kb, el_pipe, baseline=True, context=False)
dev_data = wikipedia_processor.read_el_docs_golds(
nlp=nlp,
entity_file_path=training_path,
dev=True,
line_ids=dev_indices,
kb=kb,
labels_discard=labels_discard,
)
measure_performance(
dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices)
)
for itn in range(epochs):
random.shuffle(train_data)
random.shuffle(train_indices)
losses = {}
batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001))
batches = minibatch(train_indices, size=compounding(8.0, 128.0, 1.001))
batchnr = 0
articles_processed = 0
with nlp.disable_pipes(*other_pipes):
# we either process the whole training file, or just a part each epoch
bar_total = len(train_indices)
if train_articles:
bar_total = train_articles
with tqdm(total=bar_total, leave=False, desc=f"Epoch {itn}") as pbar:
for batch in batches:
try:
nlp.update(
examples=batch,
sgd=optimizer,
drop=dropout,
losses=losses,
)
batchnr += 1
except Exception as e:
logger.error("Error updating batch:" + str(e))
if not train_articles or articles_processed < train_articles:
with nlp.disable_pipes("entity_linker"):
train_batch = wikipedia_processor.read_el_docs_golds(
nlp=nlp,
entity_file_path=training_path,
dev=False,
line_ids=batch,
kb=kb,
labels_discard=labels_discard,
)
docs, golds = zip(*train_batch)
try:
with nlp.disable_pipes(*other_pipes):
nlp.update(
docs=docs,
golds=golds,
sgd=optimizer,
drop=dropout,
losses=losses,
)
batchnr += 1
articles_processed += len(docs)
pbar.update(len(docs))
except Exception as e:
logger.error("Error updating batch:" + str(e))
if batchnr > 0:
logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2)))
measure_performance(dev_data, kb, el_pipe, baseline=False, context=True)
# STEP 4: measure the performance of our trained pipe on an independent dev set
logger.info("STEP 4: Final performance measurement of Entity Linking pipe")
measure_performance(dev_data, kb, el_pipe)
# STEP 5: apply the EL pipe on a toy example
logger.info("STEP 5: Applying Entity Linking to toy example")
run_el_toy_example(nlp=nlp)
logging.info(
"Epoch {} trained on {} articles, train loss {}".format(
itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)
)
)
# re-read the dev_data (data is returned as a generator)
dev_data = wikipedia_processor.read_el_docs_golds(
nlp=nlp,
entity_file_path=training_path,
dev=True,
line_ids=dev_indices,
kb=kb,
labels_discard=labels_discard,
)
measure_performance(
dev_data,
kb,
el_pipe,
baseline=False,
context=True,
dev_limit=len(dev_indices),
)
if output_dir:
# STEP 6: write the NLP pipeline (now including an EL model) to file
logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir))
# STEP 4: write the NLP pipeline (now including an EL model) to file
logger.info(
"Final NLP pipeline has following pipeline components: {}".format(
nlp.pipe_names
)
)
logger.info("STEP 4: Writing trained NLP to {}".format(nlp_output_dir))
nlp.to_disk(nlp_output_dir)
logger.info("Done!")
def check_kb(kb):
for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
candidates = kb.get_candidates(mention)
logger.info("generating candidates for " + mention + " :")
for c in candidates:
logger.info(" ".join[
str(c.prior_prob),
c.alias_,
"-->",
c.entity_ + " (freq=" + str(c.entity_freq) + ")"
])
def run_el_toy_example(nlp):
text = (
"In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, "
"Douglas reminds us to always bring our towel, even in China or Brazil. "
"The main character in Doug's novel is the man Arthur Dent, "
"but Dougledydoug doesn't write about George Washington or Homer Simpson."
)
doc = nlp(text)
logger.info(text)
for ent in doc.ents:
logger.info(" ".join(["ent", ent.text, ent.label_, ent.kb_id_]))
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
plac.call(main)

View File

@ -6,9 +6,6 @@ import bz2
import logging
import random
import json
from tqdm import tqdm
from functools import partial
from spacy.gold import GoldParse
from bin.wiki_entity_linking import wiki_io as io
@ -454,25 +451,40 @@ def _write_training_entities(outputfile, article_id, clean_text, entities):
outputfile.write(line)
def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
""" This method provides training examples that correspond to the entity annotations found by the nlp object.
def read_training_indices(entity_file_path):
""" This method creates two lists of indices into the training file: one with indices for the
training examples, and one for the dev examples."""
train_indices = []
dev_indices = []
with entity_file_path.open("r", encoding="utf8") as file:
for i, line in enumerate(file):
example = json.loads(line)
article_id = example["article_id"]
clean_text = example["clean_text"]
if is_valid_article(clean_text):
if is_dev(article_id):
dev_indices.append(i)
else:
train_indices.append(i)
return train_indices, dev_indices
def read_el_docs_golds(nlp, entity_file_path, dev, line_ids, kb, labels_discard=None):
""" This method provides training/dev examples that correspond to the entity annotations found by the nlp object.
For training, it will include both positive and negative examples by using the candidate generator from the kb.
For testing (kb=None), it will include all positive examples only."""
if not labels_discard:
labels_discard = []
data = []
num_entities = 0
get_gold_parse = partial(
_get_gold_parse, dev=dev, kb=kb, labels_discard=labels_discard
)
texts = []
entities_list = []
logger.info(
"Reading {} data with limit {}".format("dev" if dev else "train", limit)
)
with entity_file_path.open("r", encoding="utf8") as file:
with tqdm(total=limit, leave=False) as pbar:
for i, line in enumerate(file):
for i, line in enumerate(file):
if i in line_ids:
example = json.loads(line)
article_id = example["article_id"]
clean_text = example["clean_text"]
@ -481,16 +493,15 @@ def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
if dev != is_dev(article_id) or not is_valid_article(clean_text):
continue
doc = nlp(clean_text)
gold = get_gold_parse(doc, entities)
if gold and len(gold.links) > 0:
data.append((doc, gold))
num_entities += len(gold.links)
pbar.update(len(gold.links))
if limit and num_entities >= limit:
break
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
return data
texts.append(clean_text)
entities_list.append(entities)
docs = nlp.pipe(texts, batch_size=50)
for doc, entities in zip(docs, entities_list):
gold = _get_gold_parse(doc, entities, dev=dev, kb=kb, labels_discard=labels_discard)
if gold and len(gold.links) > 0:
yield doc, gold
def _get_gold_parse(doc, entities, dev, kb, labels_discard):

View File

@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
@st.cache(ignore_hash=True)
@st.cache(allow_output_mutation=True)
def load_model(name):
return spacy.load(name)
@st.cache(ignore_hash=True)
@st.cache(allow_output_mutation=True)
def process_text(model_name, text):
nlp = load_model(model_name)
return nlp(text)
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
st.header("Named Entities")
st.sidebar.header("Named Entities")
label_set = nlp.get_pipe("ner").labels
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
labels = st.sidebar.multiselect(
"Entity labels", options=label_set, default=list(label_set)
)
html = displacy.render(doc, style="ent", options={"ents": labels})
# Newlines seem to mess with the rendering
html = html.replace("\n", " ")

View File

@ -32,27 +32,24 @@ DESC_WIDTH = 64 # dimension of output entity vectors
@plac.annotations(
vocab_path=("Path to the vocab for the kb", "option", "v", Path),
model=("Model name, should have pretrained word embeddings", "option", "m", str),
model=("Model name, should have pretrained word embeddings", "positional", None, str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
def main(model=None, output_dir=None, n_iter=50):
"""Load the model, create the KB and pretrain the entity encodings.
Either an nlp model or a vocab is needed to provide access to pretrained word embeddings.
If an output_dir is provided, the KB will be stored there in a file 'kb'.
When providing an nlp model, the updated vocab will also be written to a directory in the output_dir."""
if model is None and vocab_path is None:
raise ValueError("Either the `nlp` model or the `vocab` should be specified.")
The updated vocab will also be written to a directory in the output_dir."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
vocab = Vocab().from_disk(vocab_path)
# create blank Language class with specified vocab
nlp = spacy.blank("en", vocab=vocab)
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
# check the length of the nlp vectors
if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
raise ValueError(
"The `nlp` object should have access to pretrained word vectors, "
" cf. https://spacy.io/usage/models#languages."
)
kb = KnowledgeBase(vocab=nlp.vocab)
@ -103,11 +100,9 @@ def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
print()
print("Saved KB to", kb_path)
# only storing the vocab if we weren't already reading it from file
if not vocab_path:
vocab_path = output_dir / "vocab"
kb.vocab.to_disk(vocab_path)
print("Saved vocab to", vocab_path)
vocab_path = output_dir / "vocab"
kb.vocab.to_disk(vocab_path)
print("Saved vocab to", vocab_path)
print()

View File

@ -131,7 +131,8 @@ def train_textcat(nlp, n_texts, n_iter=10):
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training()
textcat.model.tok2vec.from_bytes(tok2vec_weights)

View File

@ -63,7 +63,8 @@ def main(model_name, unlabelled_loc):
optimizer.b2 = 0.0
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
sizes = compounding(1.0, 4.0, 1.001)
with nlp.disable_pipes(*other_pipes):
for itn in range(n_iter):

View File

@ -113,7 +113,8 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
TRAIN_DOCS.append((doc, annotation_clean))
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train entity linker
# reset and initialize the weights randomly
optimizer = nlp.begin_training()

View File

@ -124,7 +124,8 @@ def main(model=None, output_dir=None, n_iter=15):
for dep in annotations.get("deps", []):
parser.add_label(dep)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):

View File

@ -55,7 +55,8 @@ def main(model=None, output_dir=None, n_iter=100):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly but only if we're
# training a new model

View File

@ -95,7 +95,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
optimizer = nlp.resume_training()
move_names = list(ner.move_names)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train NER
sizes = compounding(1.0, 4.0, 1.001)
# batch up the examples using spaCy's minibatch

View File

@ -65,7 +65,8 @@ def main(model=None, output_dir=None, n_iter=15):
parser.add_label(dep)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):

View File

@ -68,7 +68,8 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training()
if init_tok2vec is not None:

View File

@ -49,6 +49,7 @@ install_requires =
catalogue>=0.0.7,<1.1.0
ml_datasets
# Third-party dependencies
tqdm>=4.38.0,<5.0.0
setuptools
numpy>=1.15.0
plac>=0.9.6,<1.2.0

View File

@ -92,3 +92,4 @@ cdef enum attr_id_t:
LANG
ENT_KB_ID = symbols.ENT_KB_ID
MORPH
ENT_ID = symbols.ENT_ID

View File

@ -81,6 +81,7 @@ IDS = {
"DEP": DEP,
"ENT_IOB": ENT_IOB,
"ENT_TYPE": ENT_TYPE,
"ENT_ID": ENT_ID,
"ENT_KB_ID": ENT_KB_ID,
"HEAD": HEAD,
"SENT_START": SENT_START,

View File

@ -9,8 +9,14 @@ from wasabi import Printer
def conllu2json(
input_data, n_sents=10, append_morphology=False, lang=None, ner_map=None,
merge_subtokens=False, no_print=False, **_
input_data,
n_sents=10,
append_morphology=False,
lang=None,
ner_map=None,
merge_subtokens=False,
no_print=False,
**_
):
"""
Convert conllu files into JSON format for use with train cli.
@ -26,9 +32,13 @@ def conllu2json(
docs = []
raw = ""
sentences = []
conll_data = read_conllx(input_data, append_morphology=append_morphology,
ner_tag_pattern=MISC_NER_PATTERN, ner_map=ner_map,
merge_subtokens=merge_subtokens)
conll_data = read_conllx(
input_data,
append_morphology=append_morphology,
ner_tag_pattern=MISC_NER_PATTERN,
ner_map=ner_map,
merge_subtokens=merge_subtokens,
)
has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN)
for i, example in enumerate(conll_data):
raw += example.text
@ -72,20 +82,28 @@ def has_ner(input_data, ner_tag_pattern):
return False
def read_conllx(input_data, append_morphology=False, merge_subtokens=False,
ner_tag_pattern="", ner_map=None):
def read_conllx(
input_data,
append_morphology=False,
merge_subtokens=False,
ner_tag_pattern="",
ner_map=None,
):
""" Yield examples, one for each sentence """
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
i = 0
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
for sent in input_data.strip().split("\n\n"):
lines = sent.strip().split("\n")
if lines:
while lines[0].startswith("#"):
lines.pop(0)
example = example_from_conllu_sentence(vocab, lines,
ner_tag_pattern, merge_subtokens=merge_subtokens,
append_morphology=append_morphology,
ner_map=ner_map)
example = example_from_conllu_sentence(
vocab,
lines,
ner_tag_pattern,
merge_subtokens=merge_subtokens,
append_morphology=append_morphology,
ner_map=ner_map,
)
yield example
@ -157,8 +175,14 @@ def create_json_doc(raw, sentences, id_):
return doc
def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
merge_subtokens=False, append_morphology=False, ner_map=None):
def example_from_conllu_sentence(
vocab,
lines,
ner_tag_pattern,
merge_subtokens=False,
append_morphology=False,
ner_map=None,
):
"""Create an Example from the lines for one CoNLL-U sentence, merging
subtokens and appending morphology to tags if required.
@ -182,7 +206,6 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
in_subtok = False
for i in range(len(lines)):
line = lines[i]
subtok_lines = []
parts = line.split("\t")
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
if "." in id_:
@ -266,9 +289,17 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
if space:
raw += " "
example = Example(doc=raw)
example.set_token_annotation(ids=ids, words=words, tags=tags, pos=pos,
morphs=morphs, lemmas=lemmas, heads=heads,
deps=deps, entities=ents)
example.set_token_annotation(
ids=ids,
words=words,
tags=tags,
pos=pos,
morphs=morphs,
lemmas=lemmas,
heads=heads,
deps=deps,
entities=ents,
)
return example
@ -280,7 +311,7 @@ def merge_conllu_subtokens(lines, doc):
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
if "-" in id_:
subtok_start, subtok_end = id_.split("-")
subtok_span = doc[int(subtok_start) - 1:int(subtok_end)]
subtok_span = doc[int(subtok_start) - 1 : int(subtok_end)]
subtok_spans.append(subtok_span)
# create merged tag, morph, and lemma values
tags = []
@ -292,7 +323,7 @@ def merge_conllu_subtokens(lines, doc):
if token._.merged_morph:
for feature in token._.merged_morph.split("|"):
field, values = feature.split("=", 1)
if not field in morphs:
if field not in morphs:
morphs[field] = set()
for value in values.split(","):
morphs[field].add(value)
@ -306,7 +337,9 @@ def merge_conllu_subtokens(lines, doc):
token._.merged_lemma = " ".join(lemmas)
token.tag_ = "_".join(tags)
token._.merged_morph = "|".join(sorted(morphs.values()))
token._.merged_spaceafter = True if subtok_span[-1].whitespace_ else False
token._.merged_spaceafter = (
True if subtok_span[-1].whitespace_ else False
)
with doc.retokenize() as retokenizer:
for span in subtok_spans:

View File

@ -166,6 +166,7 @@ def debug_data(
has_low_data_warning = False
has_no_neg_warning = False
has_ws_ents_error = False
has_punct_ents_warning = False
msg.divider("Named Entity Recognition")
msg.info(
@ -190,6 +191,14 @@ def debug_data(
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
has_ws_ents_error = True
if gold_train_data["punct_ents"]:
msg.warn(
"{} entity span(s) with punctuation".format(
gold_train_data["punct_ents"]
)
)
has_punct_ents_warning = True
for label in new_labels:
if label_counts[label] <= NEW_LABEL_THRESHOLD:
msg.warn(
@ -209,6 +218,8 @@ def debug_data(
msg.good("Examples without occurrences available for all labels")
if not has_ws_ents_error:
msg.good("No entities consisting of or starting/ending with whitespace")
if not has_punct_ents_warning:
msg.good("No entities consisting of or starting/ending with punctuation")
if has_low_data_warning:
msg.text(
@ -229,6 +240,12 @@ def debug_data(
"with whitespace characters are considered invalid."
)
if has_punct_ents_warning:
msg.text(
"Entity spans consisting of or starting/ending "
"with punctuation can not be trained with a noise level > 0."
)
if "textcat" in pipeline:
msg.divider("Text Classification")
labels = [label for label in gold_train_data["cats"]]
@ -446,6 +463,7 @@ def _compile_gold(examples, pipeline):
"words": Counter(),
"roots": Counter(),
"ws_ents": 0,
"punct_ents": 0,
"n_words": 0,
"n_misaligned_words": 0,
"n_sents": 0,
@ -469,6 +487,16 @@ def _compile_gold(examples, pipeline):
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
# "Illegal" whitespace entity
data["ws_ents"] += 1
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
".",
"'",
"!",
"?",
",",
]:
# punctuation entity: could be replaced by whitespace when training with noise,
# so add a warning to alert the user to this unexpected side effect.
data["punct_ents"] += 1
if label.startswith(("B-", "U-")):
combined_label = label.split("-")[1]
data["ner"][combined_label] += 1

View File

@ -28,7 +28,7 @@ def pretrain(
vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str),
output_dir: ("Directory to write models to on each epoch", "positional", None, str),
width: ("Width of CNN layers", "option", "cw", int) = 96,
depth: ("Depth of CNN layers", "option", "cd", int) = 4,
conv_depth: ("Depth of CNN layers", "option", "cd", int) = 4,
bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0,
cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3,
sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0,
@ -77,9 +77,15 @@ def pretrain(
msg.info("Using GPU" if has_gpu else "Not using GPU")
output_dir = Path(output_dir)
if output_dir.exists() and [p for p in output_dir.iterdir()]:
msg.warn(
"Output directory is not empty",
"It is better to use an empty directory or refer to a new output path, "
"then the new directory will be created for you.",
)
if not output_dir.exists():
output_dir.mkdir()
msg.good("Created output directory")
msg.good("Created output directory: {}".format(output_dir))
srsly.write_json(output_dir / "config.json", config)
msg.good("Saved settings to config.json")
@ -107,7 +113,7 @@ def pretrain(
Tok2Vec(
width,
embed_rows,
conv_depth=depth,
conv_depth=conv_depth,
pretrained_vectors=pretrained_vectors,
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
subword_features=not use_chars, # Set to False for Chinese etc

View File

@ -10,6 +10,7 @@ import contextlib
import random
from ..util import create_default_optimizer
from ..util import use_gpu as set_gpu
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus
from .. import util
@ -26,6 +27,14 @@ def train(
base_model: ("Name of model to update (optional)", "option", "b", str) = None,
pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner",
vectors: ("Model to load vectors from", "option", "v", str) = None,
replace_components: ("Replace components from base model", "flag", "R", bool) = False,
width: ("Width of CNN layers of Tok2Vec component", "option", "cw", int) = 96,
conv_depth: ("Depth of CNN layers of Tok2Vec component", "option", "cd", int) = 4,
cnn_window: ("Window size for CNN layers of Tok2Vec component", "option", "cW", int) = 1,
cnn_pieces: ("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int) = 3,
use_chars: ("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool) = False,
bilstm_depth: ("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int) = 0,
embed_rows: ("Number of embedding rows of Tok2Vec component", "option", "er", int) = 2000,
n_iter: ("Number of iterations", "option", "n", int) = 30,
n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None,
n_examples: ("Number of examples", "option", "ns", int) = 0,
@ -80,6 +89,7 @@ def train(
)
if not output_path.exists():
output_path.mkdir()
msg.good("Created output directory: {}".format(output_path))
tag_map = {}
if tag_map_path is not None:
@ -113,6 +123,21 @@ def train(
# training starts from a blank model, intitalize the language class.
pipeline = [p.strip() for p in pipeline.split(",")]
msg.text(f"Training pipeline: {pipeline}")
disabled_pipes = None
pipes_added = False
msg.text("Training pipeline: {}".format(pipeline))
if use_gpu >= 0:
activated_gpu = None
try:
activated_gpu = set_gpu(use_gpu)
except Exception as e:
msg.warn("Exception: {}".format(e))
if activated_gpu is not None:
msg.text("Using GPU: {}".format(use_gpu))
else:
msg.warn("Unable to activate GPU: {}".format(use_gpu))
msg.text("Using CPU only")
use_gpu = -1
if base_model:
msg.text(f"Starting with base model '{base_model}'")
nlp = util.load_model(base_model)
@ -122,20 +147,24 @@ def train(
f"specified as `lang` argument ('{lang}') ",
exits=1,
)
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
for pipe in pipeline:
pipe_cfg = {}
if pipe == "parser":
pipe_cfg = {"learn_tokens": learn_tokens}
elif pipe == "textcat":
pipe_cfg = {
"exclusive_classes": not textcat_multilabel,
"architecture": textcat_arch,
"positive_label": textcat_positive_label,
}
if pipe not in nlp.pipe_names:
if pipe == "parser":
pipe_cfg = {"learn_tokens": learn_tokens}
elif pipe == "textcat":
pipe_cfg = {
"exclusive_classes": not textcat_multilabel,
"architecture": textcat_arch,
"positive_label": textcat_positive_label,
}
else:
pipe_cfg = {}
msg.text("Adding component to base model '{}'".format(pipe))
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
pipes_added = True
elif replace_components:
msg.text("Replacing component from base model '{}'".format(pipe))
nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg))
pipes_added = True
else:
if pipe == "textcat":
textcat_cfg = nlp.get_pipe("textcat").cfg
@ -144,11 +173,6 @@ def train(
"architecture": textcat_cfg["architecture"],
"positive_label": textcat_cfg["positive_label"],
}
pipe_cfg = {
"exclusive_classes": not textcat_multilabel,
"architecture": textcat_arch,
"positive_label": textcat_positive_label,
}
if base_cfg != pipe_cfg:
msg.fail(
f"The base textcat model configuration does"
@ -156,6 +180,10 @@ def train(
f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}",
exits=1,
)
msg.text("Extending component from base model '{}'".format(pipe))
disabled_pipes = nlp.disable_pipes(
[p for p in nlp.pipe_names if p not in pipeline]
)
else:
msg.text(f"Starting with blank model '{lang}'")
lang_cls = util.get_lang_class(lang)
@ -198,13 +226,20 @@ def train(
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
n_train_words = corpus.count_train()
if base_model:
if base_model and not pipes_added:
# Start with an existing model, use default optimizer
optimizer = create_default_optimizer()
else:
# Start with a blank model, call begin_training
optimizer = nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
cfg = {"device": use_gpu}
cfg["conv_depth"] = conv_depth
cfg["token_vector_width"] = width
cfg["bilstm_depth"] = bilstm_depth
cfg["cnn_maxout_pieces"] = cnn_pieces
cfg["embed_size"] = embed_rows
cfg["conv_window"] = cnn_window
cfg["subword_features"] = not use_chars
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
nlp._optimizer = None
# Load in pretrained weights
@ -214,7 +249,7 @@ def train(
# Verify textcat config
if "textcat" in pipeline:
textcat_labels = nlp.get_pipe("textcat").cfg["labels"]
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
if textcat_positive_label and textcat_positive_label not in textcat_labels:
msg.fail(
f"The textcat_positive_label (tpl) '{textcat_positive_label}' "
@ -327,12 +362,22 @@ def train(
for batch in util.minibatch_by_words(train_data, size=batch_sizes):
if not batch:
continue
nlp.update(
batch,
sgd=optimizer,
drop=next(dropout_rates),
losses=losses,
)
docs, golds = zip(*batch)
try:
nlp.update(
docs,
golds,
sgd=optimizer,
drop=next(dropout_rates),
losses=losses,
)
except ValueError as e:
msg.warn("Error during training")
if init_tok2vec:
msg.warn(
"Did you provide the same parameters during 'train' as during 'pretrain'?"
)
msg.fail("Original error message: {}".format(e), exits=1)
if raw_text:
# If raw text is available, perform 'rehearsal' updates,
# which use unlabelled data to reduce overfitting.
@ -396,11 +441,16 @@ def train(
"cpu": cpu_wps,
"gpu": gpu_wps,
}
meta["accuracy"] = scorer.scores
meta.setdefault("accuracy", {})
for component in nlp.pipe_names:
for metric in _get_metrics(component):
meta["accuracy"][metric] = scorer.scores[metric]
else:
meta.setdefault("beam_accuracy", {})
meta.setdefault("beam_speed", {})
meta["beam_accuracy"][beam_width] = scorer.scores
for component in nlp.pipe_names:
for metric in _get_metrics(component):
meta["beam_accuracy"][metric] = scorer.scores[metric]
meta["beam_speed"][beam_width] = {
"nwords": nwords,
"cpu": cpu_wps,
@ -453,13 +503,23 @@ def train(
f"Best score = {best_score}; Final iteration score = {current_score}"
)
break
except Exception as e:
msg.warn(
"Aborting and saving the final best model. Encountered exception: {}".format(
e
)
)
finally:
best_pipes = nlp.pipe_names
if disabled_pipes:
disabled_pipes.restore()
with nlp.use_params(optimizer.averages):
final_model_path = output_path / "model-final"
nlp.to_disk(final_model_path)
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
msg.good("Saved model to output directory", final_model_path)
with msg.loading("Creating best model..."):
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
msg.good("Created best model", best_model_path)
@ -519,15 +579,14 @@ def _load_pretrained_tok2vec(nlp, loc):
def _collate_best_model(meta, output_path, components):
bests = {}
meta.setdefault("accuracy", {})
for component in components:
bests[component] = _find_best(output_path, component)
best_dest = output_path / "model-best"
shutil.copytree(str(output_path / "model-final"), str(best_dest))
for component, best_component_src in bests.items():
shutil.rmtree(str(best_dest / component))
shutil.copytree(
str(best_component_src / component), str(best_dest / component)
)
shutil.copytree(str(best_component_src / component), str(best_dest / component))
accs = srsly.read_json(best_component_src / "accuracy.json")
for metric in _get_metrics(component):
meta["accuracy"][metric] = accs[metric]
@ -550,13 +609,15 @@ def _find_best(experiment_dir, component):
def _get_metrics(component):
if component == "parser":
return ("las", "uas", "token_acc", "sent_f")
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
elif component == "tagger":
return ("tags_acc",)
elif component == "ner":
return ("ents_f", "ents_p", "ents_r")
return ("ents_f", "ents_p", "ents_r", "enty_per_type")
elif component == "sentrec":
return ("sent_f", "sent_p", "sent_r")
elif component == "textcat":
return ("textcat_score",)
return ("token_acc",)
@ -568,8 +629,12 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths):
row_head.extend(["Tag Loss ", " Tag % "])
output_stats.extend(["tag_loss", "tags_acc"])
elif pipe == "parser":
row_head.extend(["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"])
output_stats.extend(["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"])
row_head.extend(
["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"]
)
output_stats.extend(
["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"]
)
elif pipe == "ner":
row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])

View File

@ -51,9 +51,10 @@ def render(
html = RENDER_WRAPPER(html)
if jupyter or (jupyter is None and is_in_jupyter()):
# return HTML rendered by IPython display()
# See #4840 for details on span wrapper to disable mathjax
from IPython.core.display import display, HTML
return display(HTML(html))
return display(HTML('<span class="tex2jax_ignore">{}</span>'.format(html)))
return html

View File

@ -75,10 +75,9 @@ class Warnings(object):
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
"being serialized or deserialized is deprecated. Please use the "
"`exclude` argument instead. For example: exclude=['{arg}'].")
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
"the v2.x models cannot release the global interpreter lock. "
"Future versions may introduce a `n_process` argument for "
"parallel inference via multiprocessing.")
W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, "
"the argument `n_process` controls parallel inference via "
"multiprocessing.")
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
"ignoring the duplicate entry.")
@ -170,7 +169,8 @@ class Errors(object):
"and satisfies the correct annotations specified in the GoldParse. "
"For example, are all labels added to the model? If you're "
"training a named entity recognizer, also make sure that none of "
"your annotated entity spans have leading or trailing whitespace. "
"your annotated entity spans have leading or trailing whitespace "
"or punctuation. "
"You can also use the experimental `debug-data` command to "
"validate your JSON-formatted training data. For details, run:\n"
"python -m spacy debug-data --help")

View File

@ -991,6 +991,11 @@ cdef class GoldParse:
self.cats = {} if cats is None else dict(cats)
self.links = {} if links is None else dict(links)
# orig_annot is used as an iterator in `nlp.evalate` even if self.length == 0,
# so set a empty list to avoid error.
# if self.lenght > 0, this is modified latter.
self.orig_annot = []
# avoid allocating memory if the doc does not contain any tokens
if self.length > 0:
if not words:

View File

@ -14,6 +14,17 @@ _tamil = r"\u0B80-\u0BFF"
_telugu = r"\u0C00-\u0C7F"
# from the final table in: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
_cjk = (
r"\u4E00-\u62FF\u6300-\u77FF\u7800-\u8CFF\u8D00-\u9FFF\u3400-\u4DBF"
r"\U00020000-\U000215FF\U00021600-\U000230FF\U00023100-\U000245FF"
r"\U00024600-\U000260FF\U00026100-\U000275FF\U00027600-\U000290FF"
r"\U00029100-\U0002A6DF\U0002A700-\U0002B73F\U0002B740-\U0002B81F"
r"\U0002B820-\U0002CEAF\U0002CEB0-\U0002EBEF\u2E80-\u2EFF\u2F00-\u2FDF"
r"\u2FF0-\u2FFF\u3000-\u303F\u31C0-\u31EF\u3200-\u32FF\u3300-\u33FF"
r"\uF900-\uFAFF\uFE30-\uFE4F\U0001F200-\U0001F2FF\U0002F800-\U0002FA1F"
)
# Latin standard
_latin_u_standard = r"A-Z"
_latin_l_standard = r"a-z"
@ -212,6 +223,7 @@ _uncased = (
+ _tamil
+ _telugu
+ _hangul
+ _cjk
)
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)

View File

@ -19,14 +19,14 @@ dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
durfte durften
eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
einem einen einer eines einige einigen einiger einiges einmal einmaleins elf en
ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch
früher fünf fünfte fünften fünfter fünftes für
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
großen grosser großer grosses großes gut gute guter gutes
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
@ -44,9 +44,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
lang lange leicht leider lieber los
machen macht machte mag magst man manche manchem manchen mancher manches mehr
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
musste mussten
mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur

View File

@ -1,5 +1,5 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map_general import TAG_MAP
from ..tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .lemmatizer import GreekLemmatizer

View File

@ -1,24 +0,0 @@
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
TAG_MAP = {
"ADJ": {POS: ADJ},
"ADV": {POS: ADV},
"INTJ": {POS: INTJ},
"NOUN": {POS: NOUN},
"PROPN": {POS: PROPN},
"VERB": {POS: VERB},
"ADP": {POS: ADP},
"CCONJ": {POS: CCONJ},
"SCONJ": {POS: SCONJ},
"PART": {POS: PART},
"PUNCT": {POS: PUNCT},
"SYM": {POS: SYM},
"NUM": {POS: NUM},
"PRON": {POS: PRON},
"AUX": {POS: AUX},
"SPACE": {POS: SPACE},
"DET": {POS: DET},
"X": {POS: X},
}

View File

@ -1,9 +1,10 @@
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_HYPHENS
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_SUFFIXES
_quotes = CONCAT_QUOTES.replace("'", "")
DASHES = "|".join(x for x in LIST_HYPHENS if x != "-")
_infixes = (
LIST_ELLIPSES
@ -11,11 +12,9 @@ _infixes = (
+ [
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{d})(?=[{a}])".format(a=ALPHA, d=DASHES),
r"(?<=[{a}0-9])[<>=/](?=[{a}])".format(a=ALPHA),
]
)

View File

@ -28,6 +28,9 @@ for exc_data in [
{ORTH: "myöh.", LEMMA: "myöhempi"},
{ORTH: "n.", LEMMA: "noin"},
{ORTH: "nimim.", LEMMA: "nimimerkki"},
{ORTH: "n:o", LEMMA: "numero"},
{ORTH: "N:o", LEMMA: "numero"},
{ORTH: "nro", LEMMA: "numero"},
{ORTH: "ns.", LEMMA: "niin sanottu"},
{ORTH: "nyk.", LEMMA: "nykyinen"},
{ORTH: "oik.", LEMMA: "oikealla"},

View File

@ -1,11 +1,16 @@
from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from .lex_attrs import LEX_ATTRS
from ...language import Language
from ...attrs import LANG
class SlovakDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "sk"
tag_map = TAG_MAP
stop_words = STOP_WORDS

27
spacy/lang/sk/examples.py Normal file
View File

@ -0,0 +1,27 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.sk.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Ardevop, s.r.o. je malá startup firma na území SR.",
"Samojazdiace autá presúvajú poistnú zodpovednosť na výrobcov automobilov.",
"Košice sú na východe.",
"Bratislava je hlavné mesto Slovenskej republiky.",
"Kde si?",
"Kto je prezidentom Francúzska?",
"Aké je hlavné mesto Slovenska?",
"Kedy sa narodil Andrej Kiska?",
"Včera som dostal 100€ na ruku.",
"Dnes je nedeľa 26.1.2020.",
"Narodil sa 15.4.1998 v Ružomberku.",
"Niekto mi povedal, že 500 eur je veľa peňazí.",
"Podaj mi ruku!",
]

View File

@ -0,0 +1,62 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = [
"nula",
"jeden",
"dva",
"tri",
"štyri",
"päť",
"šesť",
"sedem",
"osem",
"deväť",
"desať",
"jedenásť",
"dvanásť",
"trinásť",
"štrnásť",
"pätnásť",
"šestnásť",
"sedemnásť",
"osemnásť",
"devätnásť",
"dvadsať",
"tridsať",
"štyridsať",
"päťdesiat",
"šesťdesiat",
"sedemdesiat",
"osemdesiat",
"deväťdesiat",
"sto",
"tisíc",
"milión",
"miliarda",
"bilión",
"biliarda",
"trilión",
"triliarda",
"kvadrilión",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -1,5 +1,5 @@
# Source: https://github.com/stopwords-iso/stopwords-sk
# Source: https://github.com/Ardevop-sk/stopwords-sk
STOP_WORDS = set(
"""
@ -7,17 +7,41 @@ a
aby
aj
ak
akej
akejže
ako
akom
akomže
akou
akouže
akože
aká
akáže
aké
akého
akéhože
akému
akémuže
akéže
akú
akúže
aký
akých
akýchže
akým
akými
akýmiže
akýmže
akýže
ale
alebo
and
ani
asi
avšak
ba
bez
bezo
bol
bola
boli
@ -28,23 +52,32 @@ budeme
budete
budeš
budú
buï
buď
by
byť
cez
cezo
dnes
do
ešte
for
ho
hoci
i
iba
ich
im
inej
inom
iná
iné
iného
inému
iní
inú
iný
iných
iným
inými
ja
je
jeho
@ -53,80 +86,185 @@ jemu
ju
k
kam
kamže
každou
každá
každé
každého
každému
každí
každú
každý
každých
každým
každými
kde
kedže
keï
kej
kejže
keď
keďže
kie
kieho
kiehože
kiemu
kiemuže
kieže
koho
kom
komu
kou
kouže
kto
ktorej
ktorou
ktorá
ktoré
ktorí
ktorú
ktorý
ktorých
ktorým
ktorými
ku
káže
kéže
kúže
kýho
kýhože
kým
kýmu
kýmuže
kýže
lebo
leda
ledaže
len
ma
majú
mal
mala
mali
mať
medzi
menej
mi
mna
mne
mnou
moja
moje
mojej
mojich
mojim
mojimi
mojou
moju
možno
mu
musia
musieť
musí
musím
musíme
musíte
musíš
my
mám
máme
máte
mòa
máš
môcť
môj
môjho
môže
môžem
môžeme
môžete
môžeš
môžu
mňa
na
nad
nado
najmä
nami
naša
naše
našej
naši
našich
našim
našimi
našou
ne
nech
neho
nej
nejakej
nejakom
nejakou
nejaká
nejaké
nejakého
nejakému
nejakú
nejaký
nejakých
nejakým
nejakými
nemu
než
nich
nie
niektorej
niektorom
niektorou
niektorá
niektoré
niektorého
niektorému
niektorú
niektorý
niektorých
niektorým
niektorými
nielen
niečo
nim
nimi
nič
ničoho
ničom
ničomu
ničím
no
nová
nové
noví
nový
nám
nás
náš
nášho
ním
o
od
odo
of
on
ona
oni
ono
ony
oňho
po
pod
podo
podľa
pokiaľ
popod
popri
potom
poza
pre
pred
predo
@ -134,42 +272,56 @@ preto
pretože
prečo
pri
prvá
prvé
prví
prvý
práve
pýta
s
sa
seba
sebe
sebou
sem
si
sme
so
som
späť
ste
svoj
svoja
svoje
svojho
svojich
svojim
svojimi
svojou
svoju
svojím
svojími
ta
tak
takej
takejto
taká
takáto
také
takého
takéhoto
takému
takémuto
takéto
takí
takú
takúto
taký
takýto
takže
tam
te
teba
tebe
tebou
teda
tej
tejto
ten
tento
the
ti
tie
tieto
@ -177,52 +329,97 @@ tiež
to
toho
tohoto
tohto
tom
tomto
tomu
tomuto
toto
tou
touto
tu
tvoj
tvojími
tvoja
tvoje
tvojej
tvojho
tvoji
tvojich
tvojim
tvojimi
tvojím
ty
táto
títo
túto
tých
tým
tými
týmto
u
v
vami
vaša
vaše
veï
vašej
vaši
vašich
vašim
vaším
veď
viac
vo
vy
vám
vás
váš
vášho
však
všetci
všetka
všetko
všetky
všetok
z
za
začo
začože
zo
a
áno
èi
èo
èí
òom
òou
òu
čej
či
čia
čie
čieho
čiemu
čiu
čo
čoho
čom
čomu
čou
čože
čí
čím
čími
ďalšia
ďalšie
ďalšieho
ďalšiemu
ďalšiu
ďalšom
ďalšou
ďalší
ďalších
ďalším
ďalšími
ňom
ňou
ňu
že
""".split()
)

1467
spacy/lang/sk/tag_map.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,13 +1,17 @@
import re
from .char_classes import ALPHA_LOWER
from ..symbols import ORTH, POS, TAG, LEMMA, SPACE
# URL validation regex courtesy of: https://mathiasbynens.be/demo/url-regex
# A few minor mods to this regex to account for use cases represented in test_urls
# and https://gist.github.com/dperini/729294 (Diego Perini, MIT License)
# A few mods to this regex to account for use cases represented in test_urls
URL_PATTERN = (
# fmt: off
r"^"
# protocol identifier (see: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml)
# protocol identifier (mods: make optional and expand schemes)
# (see: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml)
r"(?:(?:[\w\+\-\.]{2,})://)?"
# mailto:user or user:pass authentication
r"(?:\S+(?::\S*)?@)?"
@ -28,18 +32,27 @@ URL_PATTERN = (
r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
r"|"
# host name
r"(?:(?:[a-z0-9\-]*)?[a-z0-9]+)"
# domain name
r"(?:\.(?:[a-z0-9])(?:[a-z0-9\-])*[a-z0-9])?"
# host & domain names
# mods: match is case-sensitive, so include [A-Z]
"(?:"
"(?:"
"[A-Za-z0-9\u00a1-\uffff]"
"[A-Za-z0-9\u00a1-\uffff_-]{0,62}"
")?"
"[A-Za-z0-9\u00a1-\uffff]\."
")+"
# TLD identifier
r"(?:\.(?:[a-z]{2,}))"
# mods: use ALPHA_LOWER instead of a wider range so that this doesn't match
# strings like "lower.Upper", which can be split on "." by infixes in some
# languages
r"(?:[" + ALPHA_LOWER + "]{2,63})"
r")"
# port number
r"(?::\d{2,5})?"
# resource path
r"(?:[/?#]\S*)?"
r"$"
# fmt: on
).strip()
TOKEN_MATCH = re.compile(URL_PATTERN, re.UNICODE).match

View File

@ -4,7 +4,6 @@ import weakref
import functools
from contextlib import contextmanager
from copy import copy, deepcopy
from thinc.model import Model
from thinc.backends import get_current_ops
import srsly
import multiprocessing as mp
@ -481,7 +480,6 @@ class Language(object):
component_cfg.setdefault(name, {})
component_cfg[name].setdefault("drop", drop)
component_cfg[name].setdefault("set_annotations", False)
grads = {}
for name, proc in self.pipeline:
if not hasattr(proc, "update"):
continue
@ -581,7 +579,8 @@ class Language(object):
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
link_vectors_to_models(self.vocab)
if self.vocab.vectors.data.shape[1]:
cfg["pretrained_vectors"] = self.vocab.vectors
cfg["pretrained_vectors"] = self.vocab.vectors.name
cfg["pretrained_dims"] = self.vocab.vectors.data.shape[1]
if sgd is None:
sgd = create_default_optimizer()
self._optimizer = sgd
@ -746,7 +745,7 @@ class Language(object):
pipes = (
[]
) # contains functools.partial objects so that easily create multiprocess worker.
) # contains functools.partial objects to easily create multiprocess worker.
for name, proc in self.pipeline:
if name in disable:
continue
@ -803,7 +802,7 @@ class Language(object):
texts, raw_texts = itertools.tee(texts)
# for sending texts to worker
texts_q = [mp.Queue() for _ in range(n_process)]
# for receiving byte encoded docs from worker
# for receiving byte-encoded docs from worker
bytedocs_recv_ch, bytedocs_send_ch = zip(
*[mp.Pipe(False) for _ in range(n_process)]
)
@ -813,7 +812,7 @@ class Language(object):
# This is necessary to properly handle infinite length of texts.
# (In this case, all data cannot be sent to the workers at once)
sender = _Sender(batch_texts, texts_q, chunk_size=n_process)
# send twice so that make process busy
# send twice to make process busy
sender.send()
sender.send()
@ -825,7 +824,7 @@ class Language(object):
proc.start()
# Cycle channels not to break the order of docs.
# The received object is batch of byte encoded docs, so flatten them with chain.from_iterable.
# The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable.
byte_docs = chain.from_iterable(recv.recv() for recv in cycle(bytedocs_recv_ch))
docs = (Doc(self.vocab).from_bytes(byte_doc) for byte_doc in byte_docs)
try:

View File

@ -4,7 +4,7 @@ import srsly
from ..language import component
from ..errors import Errors
from ..util import ensure_path, to_disk, from_disk
from ..tokens import Span
from ..tokens import Doc, Span
from ..matcher import Matcher, PhraseMatcher
DEFAULT_ENT_ID_SEP = "||"
@ -125,20 +125,31 @@ class EntityRuler(object):
DOCS: https://spacy.io/api/entityruler#labels
"""
all_labels = set(self.token_patterns.keys())
all_labels.update(self.phrase_patterns.keys())
keys = set(self.token_patterns.keys())
keys.update(self.phrase_patterns.keys())
all_labels = set()
for l in keys:
if self.ent_id_sep in l:
label, _ = self._split_label(l)
all_labels.add(label)
else:
all_labels.add(l)
return tuple(all_labels)
@property
def ent_ids(self):
"""All entity ids present in the match patterns `id` properties.
"""All entity ids present in the match patterns `id` properties
RETURNS (set): The string entity ids.
DOCS: https://spacy.io/api/entityruler#ent_ids
"""
keys = set(self.token_patterns.keys())
keys.update(self.phrase_patterns.keys())
all_ent_ids = set()
for l in self.labels:
for l in keys:
if self.ent_id_sep in l:
_, ent_id = self._split_label(l)
all_ent_ids.add(ent_id)
@ -147,6 +158,7 @@ class EntityRuler(object):
@property
def patterns(self):
"""Get all patterns that were added to the entity ruler.
RETURNS (list): The original patterns, one dictionary per pattern.
DOCS: https://spacy.io/api/entityruler#patterns
@ -179,6 +191,7 @@ class EntityRuler(object):
DOCS: https://spacy.io/api/entityruler#add_patterns
"""
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
try:
current_index = self.nlp.pipe_names.index(self.name)
@ -188,7 +201,31 @@ class EntityRuler(object):
except ValueError:
subsequent_pipes = []
with self.nlp.disable_pipes(subsequent_pipes):
token_patterns = []
phrase_pattern_labels = []
phrase_pattern_texts = []
phrase_pattern_ids = []
for entry in patterns:
if isinstance(entry["pattern"], str):
phrase_pattern_labels.append(entry["label"])
phrase_pattern_texts.append(entry["pattern"])
phrase_pattern_ids.append(entry.get("id"))
elif isinstance(entry["pattern"], list):
token_patterns.append(entry)
phrase_patterns = []
for label, pattern, ent_id in zip(
phrase_pattern_labels,
self.nlp.pipe(phrase_pattern_texts),
phrase_pattern_ids,
):
phrase_pattern = {"label": label, "pattern": pattern, "id": ent_id}
if ent_id:
phrase_pattern["id"] = ent_id
phrase_patterns.append(phrase_pattern)
for entry in token_patterns + phrase_patterns:
label = entry["label"]
if "id" in entry:
ent_label = label
@ -197,8 +234,8 @@ class EntityRuler(object):
self._ent_ids[key] = (ent_label, entry["id"])
pattern = entry["pattern"]
if isinstance(pattern, str):
self.phrase_patterns[label].append(self.nlp(pattern))
if isinstance(pattern, Doc):
self.phrase_patterns[label].append(pattern)
elif isinstance(pattern, list):
self.token_patterns[label].append(pattern)
else:
@ -211,6 +248,8 @@ class EntityRuler(object):
def _split_label(self, label):
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
label (str): The value of label in a pattern entry
RETURNS (tuple): ent_label, ent_id
"""
if self.ent_id_sep in label:
@ -224,6 +263,9 @@ class EntityRuler(object):
def _create_label(self, label, ent_id):
"""Join Entity label with ent_id if the pattern has an `id` attribute
label (str): The label to set for ent.label_
ent_id (str): The label
RETURNS (str): The ent_label joined with configured `ent_id_sep`
"""
if isinstance(ent_id, str):
@ -235,6 +277,7 @@ class EntityRuler(object):
patterns_bytes (bytes): The bytestring to load.
**kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_bytes
@ -274,6 +317,7 @@ class EntityRuler(object):
path (unicode / Path): The JSONL file to load.
**kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_disk

View File

@ -1632,7 +1632,7 @@ class EntityLinker(Pipe):
for i, doc in enumerate(docs):
if len(doc) > 0:
# Looping through each sentence and each entity
# This may go wrong if there are entities across sentences - because they might not get a KB ID
# This may go wrong if there are entities across sentences - which shouldn't happen normally.
for sent in doc.sents:
sent_doc = sent.as_doc()
# currently, the context is the same for each entity in a sentence (should be refined)
@ -1829,7 +1829,7 @@ class Sentencizer(Pipe):
yield ex
else:
yield from docs
def predict(self, docs):
"""Apply the pipeline's model to a batch of docs, without
modifying them.
@ -1840,20 +1840,21 @@ class Sentencizer(Pipe):
return guesses
guesses = []
for doc in docs:
start = 0
seen_period = False
doc_guesses = [False] * len(doc)
doc_guesses[0] = True
for i, token in enumerate(doc):
is_in_punct_chars = token.text in self.punct_chars
if seen_period and not token.is_punct and not is_in_punct_chars:
if len(doc) > 0:
start = 0
seen_period = False
doc_guesses[0] = True
for i, token in enumerate(doc):
is_in_punct_chars = token.text in self.punct_chars
if seen_period and not token.is_punct and not is_in_punct_chars:
doc_guesses[start] = True
start = token.i
seen_period = False
elif is_in_punct_chars:
seen_period = True
if start < len(doc):
doc_guesses[start] = True
start = token.i
seen_period = False
elif is_in_punct_chars:
seen_period = True
if start < len(doc):
doc_guesses[start] = True
guesses.append(doc_guesses)
return guesses

View File

@ -463,3 +463,4 @@ cdef enum symbol_t:
ENT_KB_ID
MORPH
ENT_ID

View File

@ -82,6 +82,7 @@ IDS = {
"DEP": DEP,
"ENT_IOB": ENT_IOB,
"ENT_TYPE": ENT_TYPE,
"ENT_ID": ENT_ID,
"ENT_KB_ID": ENT_KB_ID,
"HEAD": HEAD,
"SENT_START": SENT_START,

View File

@ -57,7 +57,7 @@ cdef class Parser:
subword_features = util.env_opt('subword_features',
cfg.get('subword_features', True))
conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4))
window_size = util.env_opt('window_size', cfg.get('window_size', 1))
conv_window = util.env_opt('conv_window', cfg.get('conv_window', 1))
t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))

View File

@ -4,7 +4,7 @@ import numpy
from spacy.tokens import Doc, Span
from spacy.vocab import Vocab
from spacy.errors import ModelsWarning
from spacy.attrs import ENT_TYPE, ENT_IOB
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP
from ..util import get_doc
@ -271,6 +271,39 @@ def test_doc_is_nered(en_vocab):
assert new_doc.is_nered
def test_doc_from_array_sent_starts(en_vocab):
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
doc = Doc(en_vocab, words=words)
for i, (dep, head) in enumerate(zip(deps, heads)):
doc[i].dep_ = dep
doc[i].head = doc[head]
if head == i:
doc[i].is_sent_start = True
doc.is_parsed
attrs = [SENT_START, HEAD]
arr = doc.to_array(attrs)
new_doc = Doc(en_vocab, words=words)
with pytest.raises(ValueError):
new_doc.from_array(attrs, arr)
attrs = [SENT_START, DEP]
arr = doc.to_array(attrs)
new_doc = Doc(en_vocab, words=words)
new_doc.from_array(attrs, arr)
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
assert not new_doc.is_parsed
attrs = [HEAD, DEP]
arr = doc.to_array(attrs)
new_doc = Doc(en_vocab, words=words)
new_doc.from_array(attrs, arr)
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
assert new_doc.is_parsed
def test_doc_lang(en_vocab):
doc = Doc(en_vocab, words=["Hello", "world"])
assert doc.lang_ == "en"

View File

@ -276,3 +276,12 @@ def test_filter_spans(doc):
assert len(filtered[1]) == 5
assert filtered[0].start == 1 and filtered[0].end == 4
assert filtered[1].start == 5 and filtered[1].end == 10
def test_span_eq_hash(doc, doc_not_parsed):
assert doc[0:2] == doc[0:2]
assert doc[0:2] != doc[1:3]
assert doc[0:2] != doc_not_parsed[0:2]
assert hash(doc[0:2]) == hash(doc[0:2])
assert hash(doc[0:2]) != hash(doc[1:3])
assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])

View File

@ -16,6 +16,21 @@ HYPHENATED_TESTS = [
)
]
ABBREVIATION_INFLECTION_TESTS = [
(
"VTT:ssa ennen v:ta 2010 suoritetut mittaukset",
["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"]
),
(
"ALV:n osuus on 24 %.",
["ALV:n", "osuus", "on", "24", "%", "."]
),
(
"Hiihtäjä oli kilpailun 14:s.",
["Hiihtäjä", "oli", "kilpailun", "14:s", "."]
)
]
@pytest.mark.parametrize("text,expected_tokens", ABBREVIATION_TESTS)
def test_fi_tokenizer_abbreviations(fi_tokenizer, text, expected_tokens):
@ -29,3 +44,10 @@ def test_fi_tokenizer_hyphenated_words(fi_tokenizer, text, expected_tokens):
tokens = fi_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list
@pytest.mark.parametrize("text,expected_tokens", ABBREVIATION_INFLECTION_TESTS)
def test_fi_tokenizer_abbreviation_inflections(fi_tokenizer, text, expected_tokens):
tokens = fi_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list

View File

@ -293,9 +293,8 @@ WIKI_TESTS = [
("cérium(IV)-oxid", ["cérium", "(", "IV", ")", "-oxid"]),
]
TESTCASES = (
DEFAULT_TESTS
+ DOT_TESTS
EXTRA_TESTS = (
DOT_TESTS
+ QUOTE_TESTS
+ NUMBER_TESTS
+ HYPHEN_TESTS
@ -303,8 +302,16 @@ TESTCASES = (
+ TYPO_TESTS
)
# normal: default tests + 10% of extra tests
TESTS = DEFAULT_TESTS
TESTS.extend([x for i, x in enumerate(EXTRA_TESTS) if i % 10 == 0])
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
# slow: remaining 90% of extra tests
SLOW_TESTS = [x for i, x in enumerate(EXTRA_TESTS) if i % 10 != 0]
TESTS.extend([pytest.param(x[0], x[1], marks=pytest.mark.slow()) if not isinstance(x[0], tuple) else x for x in SLOW_TESTS])
@pytest.mark.parametrize("text,expected_tokens", TESTS)
def test_hu_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
tokens = hu_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]

View File

@ -41,15 +41,15 @@ TYPOS_IN_PUNC_TESTS = [
LONG_TEXTS_TESTS = [
(
"Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
"якларда яшәгәннәр, шуңа күрә аларга кием кирәк булмаган.Йөз"
"меңнәрчә еллар үткән, борынгы кешеләр акрынлап Европа һәм Азиянең"
"салкын илләрендә дә яши башлаганнар. Алар кырыс һәм салкын"
"Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы "
"якларда яшәгәннәр, шуңа күрә аларга кием кирәк булмаган.Йөз "
"меңнәрчә еллар үткән, борынгы кешеләр акрынлап Европа һәм Азиянең "
"салкын илләрендә дә яши башлаганнар. Алар кырыс һәм салкын "
"кышлардан саклану өчен кием-салым уйлап тапканнар - итәк.",
"Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
"якларда яшәгәннәр , шуңа күрә аларга кием кирәк булмаган . Йөз"
"меңнәрчә еллар үткән , борынгы кешеләр акрынлап Европа һәм Азиянең"
"салкын илләрендә дә яши башлаганнар . Алар кырыс һәм салкын"
"Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы "
"якларда яшәгәннәр , шуңа күрә аларга кием кирәк булмаган . Йөз "
"меңнәрчә еллар үткән , борынгы кешеләр акрынлап Европа һәм Азиянең "
"салкын илләрендә дә яши башлаганнар . Алар кырыс һәм салкын "
"кышлардан саклану өчен кием-салым уйлап тапканнар - итәк .".split(),
)
]

View File

@ -18,6 +18,7 @@ def patterns():
{"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]},
{"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]},
{"label": "TECH_ORG", "pattern": "Apple", "id": "a1"},
{"label": "TECH_ORG", "pattern": "Microsoft", "id": "a2"},
]
@ -144,3 +145,14 @@ def test_entity_ruler_validate(nlp):
# invalid pattern raises error with validate
with pytest.raises(MatchPatternError):
validated_ruler.add_patterns([invalid_pattern])
def test_entity_ruler_properties(nlp, patterns):
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
assert sorted(ruler.labels) == sorted([
"HELLO",
"BYE",
"COMPLEX",
"TECH_ORG"
])
assert sorted(ruler.ent_ids) == ["a1", "a2"]

View File

@ -32,6 +32,22 @@ def test_sentencizer_pipe():
assert len(list(doc.sents)) == 2
def test_sentencizer_empty_docs():
one_empty_text = [""]
many_empty_texts = ["", "", ""]
some_empty_texts = ["hi", "", "This is a test. Here are two sentences.", ""]
nlp = English()
nlp.add_pipe(nlp.create_pipe("sentencizer"))
for texts in [one_empty_text, many_empty_texts, some_empty_texts]:
for doc in nlp.pipe(texts):
assert doc.is_sentenced
sent_starts = [t.is_sent_start for t in doc]
if len(doc) == 0:
assert sent_starts == []
else:
assert len(sent_starts) > 0
@pytest.mark.parametrize(
"words,sent_starts,n_sents",
[

View File

@ -0,0 +1,31 @@
from spacy.cli.converters.conllu2json import conllu2json
input_data = """
1 [ _ PUNCT -LRB- _ _ punct _ _
2 This _ DET DT _ _ det _ _
3 killing _ NOUN NN _ _ nsubj _ _
4 of _ ADP IN _ _ case _ _
5 a _ DET DT _ _ det _ _
6 respected _ ADJ JJ _ _ amod _ _
7 cleric _ NOUN NN _ _ nmod _ _
8 will _ AUX MD _ _ aux _ _
9 be _ AUX VB _ _ aux _ _
10 causing _ VERB VBG _ _ root _ _
11 us _ PRON PRP _ _ iobj _ _
12 trouble _ NOUN NN _ _ dobj _ _
13 for _ ADP IN _ _ case _ _
14 years _ NOUN NNS _ _ nmod _ _
15 to _ PART TO _ _ mark _ _
16 come _ VERB VB _ _ acl _ _
17 . _ PUNCT . _ _ punct _ _
18 ] _ PUNCT -RRB- _ _ punct _ _
"""
def test_issue4665():
"""
conllu2json should not raise an exception if the HEAD column contains an
underscore
"""
conllu2json(input_data)

View File

@ -0,0 +1,36 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
def test_issue4849():
nlp = English()
ruler = EntityRuler(
nlp, patterns=[
{"label": "PERSON", "pattern": 'joe biden', "id": 'joe-biden'},
{"label": "PERSON", "pattern": 'bernie sanders', "id": 'bernie-sanders'},
],
phrase_matcher_attr="LOWER"
)
nlp.add_pipe(ruler)
text = """
The left is starting to take aim at Democratic front-runner Joe Biden.
Sen. Bernie Sanders joined in her criticism: "There is no 'middle ground' when it comes to climate policy."
"""
# USING 1 PROCESS
count_ents = 0
for doc in nlp.pipe([text], n_process=1):
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
assert(count_ents == 2)
# USING 2 PROCESSES
count_ents = 0
for doc in nlp.pipe([text], n_process=2):
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
assert (count_ents == 2)

View File

@ -0,0 +1,16 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
import spacy
@pytest.fixture
def nlp():
return spacy.blank("en")
def test_evaluate(nlp):
docs_golds = [("", {})]
nlp.evaluate(docs_golds)

View File

@ -1,5 +1,5 @@
import pytest
from spacy.tokens import Doc
from spacy.tokens import Doc, Token
from spacy.vocab import Vocab
@ -10,6 +10,10 @@ def doc_w_attrs(en_tokenizer):
Doc.set_extension("_test_method", method=lambda doc, arg: f"{len(doc.text)}{arg}")
doc = en_tokenizer("This is a test.")
doc._._test_attr = "test"
Token.set_extension("_test_token", default="t0")
doc[1]._._test_token = "t1"
return doc
@ -20,3 +24,6 @@ def test_serialize_ext_attrs_from_bytes(doc_w_attrs):
assert doc._._test_attr == "test"
assert doc._._test_prop == len(doc.text)
assert doc._._test_method("test") == f"{len(doc.text)}test"
assert doc[0]._._test_token == "t0"
assert doc[1]._._test_token == "t1"
assert doc[2]._._test_token == "t0"

View File

@ -19,6 +19,7 @@ URLS_FULL = URLS_BASIC + [
# URL SHOULD_MATCH and SHOULD_NOT_MATCH patterns courtesy of https://mathiasbynens.be/demo/url-regex
URLS_SHOULD_MATCH = [
"http://foo.com/blah_blah",
"http://BlahBlah.com/Blah_Blah",
"http://foo.com/blah_blah/",
"http://www.example.com/wpstyle/?p=364",
"https://www.example.com/foo/?bar=baz&inga=42&quux",
@ -56,14 +57,17 @@ URLS_SHOULD_MATCH = [
),
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/blah_blah_(wikipedia)_(again)",
pytest.param("http://⌘.ws", marks=pytest.mark.xfail()),
pytest.param("http://⌘.ws/", marks=pytest.mark.xfail()),
pytest.param("http://☺.damowmow.com/", marks=pytest.mark.xfail()),
pytest.param("http://✪df.ws/123", marks=pytest.mark.xfail()),
pytest.param("http://➡.ws/䨹", marks=pytest.mark.xfail()),
pytest.param("http://مثال.إختبار", marks=pytest.mark.xfail()),
pytest.param("http://例子.测试", marks=pytest.mark.xfail()),
pytest.param("http://उदाहरण.परीक्षा", marks=pytest.mark.xfail()),
"http://www.foo.co.uk",
"http://www.foo.co.uk/",
"http://www.foo.co.uk/blah/blah",
"http://⌘.ws",
"http://⌘.ws/",
"http://☺.damowmow.com/",
"http://✪df.ws/123",
"http://➡.ws/䨹",
"http://مثال.إختبار",
"http://例子.测试",
"http://उदाहरण.परीक्षा",
]
URLS_SHOULD_NOT_MATCH = [

View File

@ -91,7 +91,11 @@ def assert_docs_equal(doc1, doc2):
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
assert [ent for ent in doc1.ents] == [ent for ent in doc2.ents]
for ent1, ent2 in zip(doc1.ents, doc2.ents):
assert ent1.start == ent2.start
assert ent1.end == ent2.end
assert ent1.label == ent2.label
assert ent1.kb_id == ent2.kb_id
def assert_packed_msg_equal(b1, b2):

View File

@ -19,7 +19,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..typedefs cimport attr_t, flags_t
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
from ..attrs cimport ENT_TYPE, ENT_KB_ID, SENT_START, attr_id_t
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..attrs import intify_attrs, IDS
@ -65,6 +65,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
return token.ent_iob
elif feat_name == ENT_TYPE:
return token.ent_type
elif feat_name == ENT_ID:
return token.ent_id
elif feat_name == ENT_KB_ID:
return token.ent_kb_id
else:
@ -807,7 +809,7 @@ cdef class Doc:
if attr_ids[j] != TAG:
Token.set_struct_attr(token, attr_ids[j], array[i, j])
# Set flags
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
self.is_parsed = bool(self.is_parsed or HEAD in attrs)
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
# If document is parsed, set children
if self.is_parsed:
@ -864,7 +866,7 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#to_bytes
"""
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE] # TODO: ENT_KB_ID ?
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID] # TODO: ENT_KB_ID ?
if self.is_tagged:
array_head.extend([TAG, POS])
# If doc parsed add head and dep attribute
@ -990,9 +992,9 @@ cdef class Doc:
order, and no span intersection is allowed.
spans (Span[]): Spans to merge, in document order, with all span
intersections empty. Cannot be emty.
intersections empty. Cannot be empty.
attributes (Dictionary[]): Attributes to assign to the merged tokens. By default,
must be the same lenghth as spans, emty dictionaries are allowed.
must be the same length as spans, empty dictionaries are allowed.
attributes are inherited from the syntactic root of the span.
RETURNS (Token): The first newly merged token.
"""

View File

@ -124,22 +124,27 @@ cdef class Span:
return False
else:
return True
# Eq
# <
if op == 0:
return self.start_char < other.start_char
# <=
elif op == 1:
return self.start_char <= other.start_char
# ==
elif op == 2:
return self.start_char == other.start_char and self.end_char == other.end_char
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) == (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
# !=
elif op == 3:
return self.start_char != other.start_char or self.end_char != other.end_char
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) != (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
# >
elif op == 4:
return self.start_char > other.start_char
# >=
elif op == 5:
return self.start_char >= other.start_char
def __hash__(self):
return hash((self.doc, self.label, self.start_char, self.end_char))
return hash((self.doc, self.start_char, self.end_char, self.label, self.kb_id))
def __len__(self):
"""Get the number of tokens in the span.
@ -207,7 +212,7 @@ cdef class Span:
words = [t.text for t in self]
spaces = [bool(t.whitespace_) for t in self]
cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_KB_ID]
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, ENT_KB_ID]
if self.doc.is_tagged:
array_head.append(TAG)
# If doc parsed add head and dep attribute

View File

@ -55,6 +55,8 @@ cdef class Token:
return token.ent_iob
elif feat_name == ENT_TYPE:
return token.ent_type
elif feat_name == ENT_ID:
return token.ent_id
elif feat_name == ENT_KB_ID:
return token.ent_kb_id
elif feat_name == SENT_START:
@ -85,6 +87,8 @@ cdef class Token:
token.ent_iob = value
elif feat_name == ENT_TYPE:
token.ent_type = value
elif feat_name == ENT_ID:
token.ent_id = value
elif feat_name == ENT_KB_ID:
token.ent_kb_id = value
elif feat_name == SENT_START:

View File

@ -278,7 +278,11 @@ cdef class Vectors:
DOCS: https://spacy.io/api/vectors#add
"""
key = get_string_id(key)
# use int for all keys and rows in key2row for more efficient access
# and serialization
key = int(get_string_id(key))
if row is not None:
row = int(row)
if row is None and key in self.key2row:
row = self.key2row[key]
elif row is None:

View File

@ -372,7 +372,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
| `--n-examples`, `-ns` | option | Number of examples to use (defaults to `0` for all examples). |
| `--use-gpu`, `-g` | option | Whether to use GPU. Can be either `0`, `1` or `-1`. |
| `--use-gpu`, `-g` | option | GPU ID or `-1` for CPU only (default: `-1`). |
| `--version`, `-V` | option | Model version. Will be written out to the model's `meta.json` after training. |
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Optional path to model [`meta.json`](/usage/training#models-generating). All relevant properties like `lang`, `pipeline` and `spacy_version` will be overwritten. |
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |

View File

@ -77,9 +77,9 @@ more efficient than processing texts one-by-one.
Early versions of spaCy used simple statistical models that could be efficiently
multi-threaded, as we were able to entirely release Python's global interpreter
lock. The multi-threading was controlled using the `n_threads` keyword argument
to the `.pipe` method. This keyword argument is now deprecated as of v2.1.0.
Future versions may introduce a `n_process` argument for parallel inference via
multiprocessing.
to the `.pipe` method. This keyword argument is now deprecated as of v2.1.0. A
new keyword argument, `n_process`, was introduced to control parallel inference
via multiprocessing in v2.2.2.
</Infobox>
@ -98,6 +98,7 @@ multiprocessing.
| `batch_size` | int | The number of texts to buffer. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| `n_process` <Tag variant="new">2.2.2</Tag> | int | Number of processors to use, only supported in Python 3. Defaults to `1`. |
| **YIELDS** | `Doc` | Documents in the order of the original text. |
## Language.update {#update tag="method"}

View File

@ -38,7 +38,7 @@ be shown.
| Name | Type | Description |
| --------------------------------------- | --------------- | ------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. |
| `max_length` | int | Deprecated argument - the `PhraseMatcher` does not have a phrase length limit anymore. |
| `max_length` | int | Deprecated argument - the `PhraseMatcher` does not have a phrase length limit anymore. |
| `attr` <Tag variant="new">2.1</Tag> | int / unicode | The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text. |
| `validate` <Tag variant="new">2.1</Tag> | bool | Validate patterns added to the matcher. |
| **RETURNS** | `PhraseMatcher` | The newly constructed object. |
@ -70,6 +70,18 @@ Find all token sequences matching the supplied patterns on the `Doc`.
| `doc` | `Doc` | The document to match over. |
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. |
<Infobox title="Note on retrieving the string representation of the match_id" variant="warning">
Because spaCy stores all strings as integers, the `match_id` you get back will
be an integer, too but you can always get the string representation by looking
it up in the vocabulary's `StringStore`, i.e. `nlp.vocab.strings`:
```python
match_id_string = nlp.vocab.strings[match_id]
```
</Infobox>
## PhraseMatcher.pipe {#pipe tag="method"}
Match a stream of documents, yielding them in turn.

View File

@ -5,8 +5,6 @@ next: /usage/spacy-101
menu:
- ['Feature Comparison', 'comparison']
- ['Benchmarks', 'benchmarks']
- ['Powered by spaCy', 'powered-by']
- ['Other Libraries', 'other-libraries']
---
## Feature comparison {#comparison}

View File

@ -135,9 +135,8 @@ interface for GPU arrays.
spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`,
`spacy[cuda91]`, `spacy[cuda92]` or `spacy[cuda100]`. If you know your cuda
version, using the more explicit specifier allows cupy to be installed via
wheel, saving some compilation time. The specifiers should install two
libraries: [`cupy`](https://cupy.chainer.org) and
[`thinc_gpu_ops`](https://github.com/explosion/thinc_gpu_ops).
wheel, saving some compilation time. The specifiers should install
[`cupy`](https://cupy.chainer.org).
```bash
$ pip install -U spacy[cuda92]

View File

@ -327,7 +327,7 @@ displaCy in our [online demo](https://explosion.ai/demos/displacy)..
### Disabling the parser {#disabling}
In the [default models](/models), the parser is loaded and enabled as part of
the [standard processing pipeline](/usage/processing-pipelin). If you don't need
the [standard processing pipeline](/usage/processing-pipelines). If you don't need
any of the syntactic information, you should disable the parser. Disabling the
parser will make spaCy load and run much faster. If you want to load the parser,
but need to disable it for specific documents, you can also control its use on

View File

@ -9,7 +9,7 @@ menu:
---
Compared to using regular expressions on raw text, spaCy's rule-based matcher
engines and components not only let you find you the words and phrases you're
engines and components not only let you find the words and phrases you're
looking for they also give you access to the tokens within the document and
their relationships. This means you can easily access and analyze the
surrounding tokens, merge spans into single tokens or add entries to the named
@ -1096,6 +1096,33 @@ with the patterns. When you load the model back in, all pipeline components will
be restored and deserialized including the entity ruler. This lets you ship
powerful model packages with binary weights _and_ rules included!
### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to
extract matches based on the pattern's POS signature.
In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively.
Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
An easy workaround to make this function run faster is disabling the other language pipes
while adding the phrase patterns.
```python
entityruler = EntityRuler(nlp)
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
with nlp.disable_pipes(*disable_pipes):
entityruler.add_patterns(patterns)
```
## Combining models and rules {#models-rules}
You can combine statistical and rule-based components in a variety of ways.

View File

@ -229,10 +229,10 @@ For more details on **adding hooks** and **overwriting** the built-in `Doc`,
If you're using a GPU, it's much more efficient to keep the word vectors on the
device. You can do that by setting the [`Vectors.data`](/api/vectors#attributes)
attribute to a `cupy.ndarray` object if you're using spaCy or
[Chainer]("https://chainer.org"), or a `torch.Tensor` object if you're using
[PyTorch]("http://pytorch.org"). The `data` object just needs to support
[Chainer](https://chainer.org), or a `torch.Tensor` object if you're using
[PyTorch](http://pytorch.org). The `data` object just needs to support
`__iter__` and `__getitem__`, so if you're using another library such as
[TensorFlow]("https://www.tensorflow.org"), you could also create a wrapper for
[TensorFlow](https://www.tensorflow.org), you could also create a wrapper for
your vectors data.
```python

View File

@ -999,6 +999,17 @@
"author": "Graphbrain",
"category": ["standalone"]
},
{
"type": "education",
"id": "nostarch-nlp-python",
"title": "Natural Language Processing Using Python",
"slogan": "No Starch Press, 2020",
"description": "Natural Language Processing Using Python is an introduction to natural language processing (NLP), the task of converting human language into data that a computer can process. The book uses spaCy, a leading Python library for NLP, to guide readers through common NLP tasks related to generating and understanding human language with code. It addresses problems like understanding a user's intent, continuing a conversation with a human, and maintaining the state of a conversation.",
"cover": "https://nostarch.com/sites/default/files/styles/uc_product_full/public/NaturalLanguageProcessing_final_v01.jpg",
"url": "https://nostarch.com/NLPPython",
"author": "Yuli Vasiliev",
"category": ["books"]
},
{
"type": "education",
"id": "oreilly-python-ds",
@ -1509,28 +1520,30 @@
{
"id": "spacy-conll",
"title": "spacy_conll",
"slogan": "Parse text with spaCy and print the output in CoNLL-U format",
"description": "This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts.",
"slogan": "Parse text with spaCy and gets its output in CoNLL-U format",
"description": "This module allows you to parse a text to CoNLL-U format. It contains a pipeline component for spaCy that adds CoNLL-U properties to a Doc and its sentences. It can also be used as a command-line tool.",
"code_example": [
"from spacy_conll import Spacy2ConllParser",
"spacyconll = Spacy2ConllParser()",
"import spacy",
"from spacy_conll import ConllFormatter",
"",
"# `parse` returns a generator of the parsed sentences",
"for parsed_sent in spacyconll.parse(input_str='I like cookies.\nWhat about you?\nI don't like 'em!'):",
" do_something_(parsed_sent)",
"",
"# `parseprint` prints output to stdout (default) or a file (use `output_file` parameter)",
"# This method is called when using the command line",
"spacyconll.parseprint(input_str='I like cookies.')"
"nlp = spacy.load('en')",
"conllformatter = ConllFormatter(nlp)",
"nlp.add_pipe(conllformatter, after='parser')",
"doc = nlp('I like cookies. Do you?')",
"conll = doc._.conll",
"print(doc._.conll_str_headers)",
"print(doc._.conll_str)"
],
"code_language": "python",
"author": "Bram Vanroy",
"author_links": {
"github": "BramVanroy",
"github": "BramVanroy",
"twitter": "BramVanroy",
"website": "https://bramvanroy.be"
},
"github": "BramVanroy/spacy_conll",
"category": ["standalone"]
"category": ["standalone", "pipeline"],
"tags": ["linguistics", "computational linguistics", "conll"]
},
{
"id": "spacy-langdetect",
@ -1837,6 +1850,20 @@
"github": "microsoft"
}
},
{
"id": "presidio-research",
"title": "Presidio Research",
"slogan": "Toolbox for developing and evaluating PII detectors, NER models for PII and generating fake PII data",
"description": "This package features data-science related tasks for developing new recognizers for Microsoft Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models. Anyone interested in evaluating an existing Microsoft Presidio instance, a specific PII recognizer or to develop new models or logic for detecting PII could leverage the preexisting work in this package. Additionally, anyone interested in generating new data based on previous datasets (e.g. to increase the coverage of entity values) for Named Entity Recognition models could leverage the data generator contained in this package.",
"url": "https://aka.ms/presidio-research",
"github": "microsoft/presidio-research",
"category": ["standalone"],
"thumb": "https://avatars0.githubusercontent.com/u/6154722",
"author": "Microsoft",
"author_links": {
"github": "microsoft"
}
},
{
"id": "python-sentence-boundary-disambiguation",
"title": "pySBD - python Sentence Boundary Disambiguation",
@ -1901,6 +1928,43 @@
"twitter": "PatadiaYash",
"github": "yash1994"
}
},
{
"id": "spacy-pytextrank",
"title": "PyTextRank",
"slogan": "Py impl of TextRank for lightweight phrase extraction",
"description": "An implementation of TextRank in Python for use in spaCy pipelines which provides fast, effective phrase extraction from texts, along with extractive summarization. The graph algorithm works independent of a specific natural language and does not require domain knowledge. See (Mihalcea 2004) https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf",
"github": "DerwenAI/pytextrank",
"pip": "pytextrank",
"code_example": [
"import spacy",
"import pytextrank",
"",
"nlp = spacy.load('en_core_web_sm')",
"",
"tr = pytextrank.TextRank()",
"nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)",
"",
"text = 'Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.'",
"doc = nlp(text)",
"",
"# examine the top-ranked phrases in the document",
"for p in doc._.phrases:",
" print('{:.4f} {:5d} {}'.format(p.rank, p.count, p.text))",
" print(p.chunks)"
],
"code_language": "python",
"url": "https://github.com/DerwenAI/pytextrank/wiki",
"thumb": "https://memegenerator.net/img/instances/66942896.jpg",
"image": "https://memegenerator.net/img/instances/66942896.jpg",
"author": "Paco Nathan",
"author_links": {
"twitter": "pacoid",
"github": "ceteri",
"website": "https://derwen.ai/paco"
},
"category": ["pipeline"],
"tags": ["phrase extraction", "ner", "summarization", "graph algorithms", "textrank"]
}
],