Merge branch 'develop' into refactor/remove-symlinks

This commit is contained in:
Ines Montani 2020-02-18 17:22:20 +01:00
commit a3335d36b8
189 changed files with 3690 additions and 736 deletions

106
.github/contributors/AlJohri.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Al Johri |
| Company name (if applicable) | N/A |
| Title or role (if applicable) | N/A |
| Date | December 27th, 2019 |
| GitHub username | AlJohri |
| Website (optional) | http://aljohri.com/ |

106
.github/contributors/Jan-711.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jan Jessewitsch |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 16.02.2020 |
| GitHub username | Jan-711 |
| Website (optional) | |

106
.github/contributors/ceteri.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ---------------------- |
| Name | Paco Nathan |
| Company name (if applicable) | Derwen, Inc. |
| Title or role (if applicable) | Managing Partner |
| Date | 2020-01-25 |
| GitHub username | ceteri |
| Website (optional) | https://derwen.ai/paco |

106
.github/contributors/drndos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Filip Bednárik |
| Company name (if applicable) | Ardevop, s. r. o. |
| Title or role (if applicable) | IT Consultant |
| Date | 2020-01-26 |
| GitHub username | drndos |
| Website (optional) | https://ardevop.sk |

106
.github/contributors/iechevarria.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | --------------------- |
| Name | Ivan Echevarria |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-12-24 |
| GitHub username | iechevarria |
| Website (optional) | https://echevarria.io |

106
.github/contributors/iurshina.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Anastasiia Iurshina |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 28.12.2019 |
| GitHub username | iurshina |
| Website (optional) | |

106
.github/contributors/onlyanegg.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
- Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
- to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
- each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
| ----------------------------- | ---------------- |
| Name | Tyler Couto |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | January 29, 2020 |
| GitHub username | onlyanegg |
| Website (optional) | |

View File

@ -1,5 +1,5 @@
recursive-include include *.h recursive-include include *.h
recursive-include spacy *.pyx *.pxd *.txt recursive-include spacy *.txt *.pyx *.pxd
include LICENSE include LICENSE
include README.md include README.md
include bin/spacy include bin/spacy

View File

@ -1 +1,2 @@
#! /bin/sh
python -m spacy "$@" python -m spacy "$@"

View File

@ -7,16 +7,17 @@ Run `wikipedia_pretrain_kb.py`
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/ * WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language) * Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
* You can set the filtering parameters for KB construction: * You can set the filtering parameters for KB construction:
* `max_per_alias`: (max) number of candidate entities in the KB per alias/synonym * `max_per_alias` (`-a`): (max) number of candidate entities in the KB per alias/synonym
* `min_freq`: threshold of number of times an entity should occur in the corpus to be included in the KB * `min_freq` (`-f`): threshold of number of times an entity should occur in the corpus to be included in the KB
* `min_pair`: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB * `min_pair` (`-c`): threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
* Further parameters to set: * Further parameters to set:
* `descriptions_from_wikipedia`: whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`) * `descriptions_from_wikipedia` (`-wp`): whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
* `entity_vector_length`: length of the pre-trained entity description vectors * `entity_vector_length` (`-v`): length of the pre-trained entity description vectors
* `lang`: language for which to fetch Wikidata information (as the dump contains all languages) * `lang` (`-la`): language for which to fetch Wikidata information (as the dump contains all languages)
Quick testing and rerunning: Quick testing and rerunning:
* When trying out the pipeline for a quick test, set `limit_prior`, `limit_train` and/or `limit_wd` to read only parts of the dumps instead of everything. * When trying out the pipeline for a quick test, set `limit_prior` (`-lp`), `limit_train` (`-lt`) and/or `limit_wd` (`-lw`) to read only parts of the dumps instead of everything.
* e.g. set `-lt 20000 -lp 2000 -lw 3000 -f 1`
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed. * If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
@ -24,11 +25,13 @@ Quick testing and rerunning:
Run `wikidata_train_entity_linker.py` Run `wikidata_train_entity_linker.py`
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model** * This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
* Specify the output directory (`-o`) in which the final, trained model will be saved
* You can set the learning parameters for the EL training: * You can set the learning parameters for the EL training:
* `epochs`: number of training iterations * `epochs` (`-e`): number of training iterations
* `dropout`: dropout rate * `dropout` (`-p`): dropout rate
* `lr`: learning rate * `lr` (`-n`): learning rate
* `l2`: L2 regularization * `l2` (`-r`): L2 regularization
* Specify the number of training and dev testing entities with `train_inst` and `dev_inst` respectively * Specify the number of training and dev testing articles with `train_articles` (`-t`) and `dev_articles` (`-d`) respectively
* If not specified, the full dataset will be processed - this may take a LONG time !
* Further parameters to set: * Further parameters to set:
* `labels_discard`: NER label types to discard during training * `labels_discard` (`-l`): NER label types to discard during training

View File

@ -1,6 +1,8 @@
# coding: utf-8
from __future__ import unicode_literals
import logging import logging
import random import random
from tqdm import tqdm from tqdm import tqdm
from collections import defaultdict from collections import defaultdict
@ -92,102 +94,81 @@ class BaselineResults(object):
self.random.update_metrics(ent_label, true_entity, random_candidate) self.random.update_metrics(ent_label, true_entity, random_candidate)
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True): def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True, dev_limit=None):
counts = dict()
baseline_results = BaselineResults()
context_results = EvaluationResults()
combo_results = EvaluationResults()
for doc, gold in tqdm(dev_data, total=dev_limit, leave=False, desc='Processing dev data'):
if len(doc) > 0:
correct_ents = dict()
for entity, kb_dict in gold.links.items():
start, end = entity
for gold_kb, value in kb_dict.items():
if value:
# only evaluating on positive examples
offset = _offset(start, end)
correct_ents[offset] = gold_kb
if baseline: if baseline:
baseline_accuracies, counts = measure_baselines(dev_data, kb) _add_baseline(baseline_results, counts, doc, correct_ents, kb)
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
logger.info(baseline_accuracies.report_performance("random"))
logger.info(baseline_accuracies.report_performance("prior"))
logger.info(baseline_accuracies.report_performance("oracle"))
if context: if context:
# using only context # using only context
el_pipe.cfg["incl_context"] = True el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = False el_pipe.cfg["incl_prior"] = False
results = get_eval_results(dev_data, el_pipe) _add_eval_result(context_results, doc, correct_ents, el_pipe)
logger.info(results.report_metrics("context only"))
# measuring combined accuracy (prior + context) # measuring combined accuracy (prior + context)
el_pipe.cfg["incl_context"] = True el_pipe.cfg["incl_context"] = True
el_pipe.cfg["incl_prior"] = True el_pipe.cfg["incl_prior"] = True
results = get_eval_results(dev_data, el_pipe) _add_eval_result(combo_results, doc, correct_ents, el_pipe)
logger.info(results.report_metrics("context and prior"))
if baseline:
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
logger.info(baseline_results.report_performance("random"))
logger.info(baseline_results.report_performance("prior"))
logger.info(baseline_results.report_performance("oracle"))
if context:
logger.info(context_results.report_metrics("context only"))
logger.info(combo_results.report_metrics("context and prior"))
def get_eval_results(data, el_pipe=None): def _add_eval_result(results, doc, correct_ents, el_pipe):
""" """
Evaluate the ent.kb_id_ annotations against the gold standard. Evaluate the ent.kb_id_ annotations against the gold standard.
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL. Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
If the docs in the data require further processing with an entity linker, set el_pipe.
""" """
docs = []
golds = []
for d, g in tqdm(data, leave=False):
if len(d) > 0:
golds.append(g)
if el_pipe is not None:
docs.append(el_pipe(d))
else:
docs.append(d)
results = EvaluationResults()
for doc, gold in zip(docs, golds):
try: try:
correct_entries_per_article = dict() doc = el_pipe(doc)
for entity, kb_dict in gold.links.items():
start, end = entity
for gold_kb, value in kb_dict.items():
if value:
# only evaluating on positive examples
offset = _offset(start, end)
correct_entries_per_article[offset] = gold_kb
for ent in doc.ents: for ent in doc.ents:
ent_label = ent.label_ ent_label = ent.label_
pred_entity = ent.kb_id_
start = ent.start_char start = ent.start_char
end = ent.end_char end = ent.end_char
offset = _offset(start, end) offset = _offset(start, end)
gold_entity = correct_entries_per_article.get(offset, None) gold_entity = correct_ents.get(offset, None)
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong' # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
if gold_entity is not None: if gold_entity is not None:
pred_entity = ent.kb_id_
results.update_metrics(ent_label, gold_entity, pred_entity) results.update_metrics(ent_label, gold_entity, pred_entity)
except Exception as e: except Exception as e:
logging.error("Error assessing accuracy " + str(e)) logging.error("Error assessing accuracy " + str(e))
return results
def _add_baseline(baseline_results, counts, doc, correct_ents, kb):
def measure_baselines(data, kb):
""" """
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound. Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL. Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
Also return a dictionary of counts by entity label.
""" """
counts_d = dict()
baseline_results = BaselineResults()
docs = [d for d, g in data if len(d) > 0]
golds = [g for d, g in data if len(d) > 0]
for doc, gold in zip(docs, golds):
correct_entries_per_article = dict()
for entity, kb_dict in gold.links.items():
start, end = entity
for gold_kb, value in kb_dict.items():
# only evaluating on positive examples
if value:
offset = _offset(start, end)
correct_entries_per_article[offset] = gold_kb
for ent in doc.ents: for ent in doc.ents:
ent_label = ent.label_ ent_label = ent.label_
start = ent.start_char start = ent.start_char
end = ent.end_char end = ent.end_char
offset = _offset(start, end) offset = _offset(start, end)
gold_entity = correct_entries_per_article.get(offset, None) gold_entity = correct_ents.get(offset, None)
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong' # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
if gold_entity is not None: if gold_entity is not None:
@ -207,8 +188,8 @@ def measure_baselines(data, kb):
prior_candidate = candidates[best_index].entity_ prior_candidate = candidates[best_index].entity_
random_candidate = random.choice(candidates).entity_ random_candidate = random.choice(candidates).entity_
current_count = counts_d.get(ent_label, 0) current_count = counts.get(ent_label, 0)
counts_d[ent_label] = current_count+1 counts[ent_label] = current_count+1
baseline_results.update_baselines( baseline_results.update_baselines(
gold_entity, gold_entity,
@ -218,8 +199,6 @@ def measure_baselines(data, kb):
oracle_candidate, oracle_candidate,
) )
return baseline_results, counts_d
def _offset(start, end): def _offset(start, end):
return "{}_{}".format(start, end) return "{}_{}".format(start, end)

View File

@ -40,7 +40,7 @@ logger = logging.getLogger(__name__)
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path), loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path), loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path), loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
descr_from_wp=("Flag for using wp descriptions not wd", "flag", "wp"), descr_from_wp=("Flag for using descriptions from WP instead of WD (default False)", "flag", "wp"),
limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int), limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int), limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int), limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),

View File

@ -1,5 +1,5 @@
# coding: utf-8 # coding: utf-8
"""Script to take a previously created Knowledge Base and train an entity linking """Script that takes a previously created Knowledge Base and trains an entity linking
pipeline. The provided KB directory should hold the kb, the original nlp object and pipeline. The provided KB directory should hold the kb, the original nlp object and
its vocab used to create the KB, and a few auxiliary files such as the entity definitions, its vocab used to create the KB, and a few auxiliary files such as the entity definitions,
as created by the script `wikidata_create_kb`. as created by the script `wikidata_create_kb`.
@ -14,9 +14,16 @@ import logging
import spacy import spacy
from pathlib import Path from pathlib import Path
import plac import plac
from tqdm import tqdm
from bin.wiki_entity_linking import wikipedia_processor from bin.wiki_entity_linking import wikipedia_processor
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR from bin.wiki_entity_linking import (
TRAINING_DATA_FILE,
KB_MODEL_DIR,
KB_FILE,
LOG_FORMAT,
OUTPUT_MODEL_DIR,
)
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
from bin.wiki_entity_linking.kb_creator import read_kb from bin.wiki_entity_linking.kb_creator import read_kb
@ -33,8 +40,8 @@ logger = logging.getLogger(__name__)
dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float), dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float),
lr=("Learning rate (default 0.005)", "option", "n", float), lr=("Learning rate (default 0.005)", "option", "n", float),
l2=("L2 regularization", "option", "r", float), l2=("L2 regularization", "option", "r", float),
train_inst=("# training instances (default 90% of all)", "option", "t", int), train_articles=("# training articles (default 90% of all)", "option", "t", int),
dev_inst=("# test instances (default 10% of all)", "option", "d", int), dev_articles=("# dev test articles (default 10% of all)", "option", "d", int),
labels_discard=("NER labels to discard (default None)", "option", "l", str), labels_discard=("NER labels to discard (default None)", "option", "l", str),
) )
def main( def main(
@ -45,10 +52,15 @@ def main(
dropout=0.5, dropout=0.5,
lr=0.005, lr=0.005,
l2=1e-6, l2=1e-6,
train_inst=None, train_articles=None,
dev_inst=None, dev_articles=None,
labels_discard=None labels_discard=None,
): ):
if not output_dir:
logger.warning(
"No output dir specified so no results will be written, are you sure about this ?"
)
logger.info("Creating Entity Linker with Wikipedia and WikiData") logger.info("Creating Entity Linker with Wikipedia and WikiData")
output_dir = Path(output_dir) if output_dir else dir_kb output_dir = Path(output_dir) if output_dir else dir_kb
@ -64,47 +76,57 @@ def main(
# STEP 1 : load the NLP object # STEP 1 : load the NLP object
logger.info("STEP 1a: Loading model from {}".format(nlp_dir)) logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
nlp = spacy.load(nlp_dir) nlp = spacy.load(nlp_dir)
logger.info("STEP 1b: Loading KB from {}".format(kb_path)) logger.info(
kb = read_kb(nlp, kb_path) "Original NLP pipeline has following pipeline components: {}".format(
nlp.pipe_names
)
)
# check that there is a NER component in the pipeline # check that there is a NER component in the pipeline
if "ner" not in nlp.pipe_names: if "ner" not in nlp.pipe_names:
raise ValueError("The `nlp` object should have a pretrained `ner` component.") raise ValueError("The `nlp` object should have a pretrained `ner` component.")
# STEP 2: read the training dataset previously created from WP logger.info("STEP 1b: Loading KB from {}".format(kb_path))
logger.info("STEP 2: Reading training dataset from {}".format(training_path)) kb = read_kb(nlp, kb_path)
# STEP 2: read the training dataset previously created from WP
logger.info("STEP 2: Reading training & dev dataset from {}".format(training_path))
train_indices, dev_indices = wikipedia_processor.read_training_indices(
training_path
)
logger.info(
"Training set has {} articles, limit set to roughly {} articles per epoch".format(
len(train_indices), train_articles if train_articles else "all"
)
)
logger.info(
"Dev set has {} articles, limit set to rougly {} articles for evaluation".format(
len(dev_indices), dev_articles if dev_articles else "all"
)
)
if dev_articles:
dev_indices = dev_indices[0:dev_articles]
# STEP 3: create and train an entity linking pipe
logger.info(
"STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(
epochs
)
)
if labels_discard: if labels_discard:
labels_discard = [x.strip() for x in labels_discard.split(",")] labels_discard = [x.strip() for x in labels_discard.split(",")]
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard)) logger.info(
"Discarding {} NER types: {}".format(len(labels_discard), labels_discard)
)
else: else:
labels_discard = [] labels_discard = []
train_data = wikipedia_processor.read_training(
nlp=nlp,
entity_file_path=training_path,
dev=False,
limit=train_inst,
kb=kb,
labels_discard=labels_discard
)
# for testing, get all pos instances (independently of KB)
dev_data = wikipedia_processor.read_training(
nlp=nlp,
entity_file_path=training_path,
dev=True,
limit=dev_inst,
kb=None,
labels_discard=labels_discard
)
# STEP 3: create and train an entity linking pipe
logger.info("STEP 3: Creating and training an Entity Linking pipe")
el_pipe = nlp.create_pipe( el_pipe = nlp.create_pipe(
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors, name="entity_linker",
"labels_discard": labels_discard} config={
"pretrained_vectors": nlp.vocab.vectors,
"labels_discard": labels_discard,
},
) )
el_pipe.set_kb(kb) el_pipe.set_kb(kb)
nlp.add_pipe(el_pipe, last=True) nlp.add_pipe(el_pipe, last=True)
@ -115,78 +137,96 @@ def main(
optimizer.learn_rate = lr optimizer.learn_rate = lr
optimizer.L2 = l2 optimizer.L2 = l2
logger.info("Training on {} articles".format(len(train_data)))
logger.info("Dev testing on {} articles".format(len(dev_data)))
# baseline performance on dev data
logger.info("Dev Baseline Accuracies:") logger.info("Dev Baseline Accuracies:")
measure_performance(dev_data, kb, el_pipe, baseline=True, context=False) dev_data = wikipedia_processor.read_el_docs_golds(
nlp=nlp,
entity_file_path=training_path,
dev=True,
line_ids=dev_indices,
kb=kb,
labels_discard=labels_discard,
)
measure_performance(
dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices)
)
for itn in range(epochs): for itn in range(epochs):
random.shuffle(train_data) random.shuffle(train_indices)
losses = {} losses = {}
batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001)) batches = minibatch(train_indices, size=compounding(8.0, 128.0, 1.001))
batchnr = 0 batchnr = 0
articles_processed = 0
with nlp.disable_pipes(*other_pipes): # we either process the whole training file, or just a part each epoch
bar_total = len(train_indices)
if train_articles:
bar_total = train_articles
with tqdm(total=bar_total, leave=False, desc=f"Epoch {itn}") as pbar:
for batch in batches: for batch in batches:
if not train_articles or articles_processed < train_articles:
with nlp.disable_pipes("entity_linker"):
train_batch = wikipedia_processor.read_el_docs_golds(
nlp=nlp,
entity_file_path=training_path,
dev=False,
line_ids=batch,
kb=kb,
labels_discard=labels_discard,
)
docs, golds = zip(*train_batch)
try: try:
with nlp.disable_pipes(*other_pipes):
nlp.update( nlp.update(
examples=batch, docs=docs,
golds=golds,
sgd=optimizer, sgd=optimizer,
drop=dropout, drop=dropout,
losses=losses, losses=losses,
) )
batchnr += 1 batchnr += 1
articles_processed += len(docs)
pbar.update(len(docs))
except Exception as e: except Exception as e:
logger.error("Error updating batch:" + str(e)) logger.error("Error updating batch:" + str(e))
if batchnr > 0: if batchnr > 0:
logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2))) logging.info(
measure_performance(dev_data, kb, el_pipe, baseline=False, context=True) "Epoch {} trained on {} articles, train loss {}".format(
itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)
# STEP 4: measure the performance of our trained pipe on an independent dev set )
logger.info("STEP 4: Final performance measurement of Entity Linking pipe") )
measure_performance(dev_data, kb, el_pipe) # re-read the dev_data (data is returned as a generator)
dev_data = wikipedia_processor.read_el_docs_golds(
# STEP 5: apply the EL pipe on a toy example nlp=nlp,
logger.info("STEP 5: Applying Entity Linking to toy example") entity_file_path=training_path,
run_el_toy_example(nlp=nlp) dev=True,
line_ids=dev_indices,
kb=kb,
labels_discard=labels_discard,
)
measure_performance(
dev_data,
kb,
el_pipe,
baseline=False,
context=True,
dev_limit=len(dev_indices),
)
if output_dir: if output_dir:
# STEP 6: write the NLP pipeline (now including an EL model) to file # STEP 4: write the NLP pipeline (now including an EL model) to file
logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir)) logger.info(
"Final NLP pipeline has following pipeline components: {}".format(
nlp.pipe_names
)
)
logger.info("STEP 4: Writing trained NLP to {}".format(nlp_output_dir))
nlp.to_disk(nlp_output_dir) nlp.to_disk(nlp_output_dir)
logger.info("Done!") logger.info("Done!")
def check_kb(kb):
for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
candidates = kb.get_candidates(mention)
logger.info("generating candidates for " + mention + " :")
for c in candidates:
logger.info(" ".join[
str(c.prior_prob),
c.alias_,
"-->",
c.entity_ + " (freq=" + str(c.entity_freq) + ")"
])
def run_el_toy_example(nlp):
text = (
"In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, "
"Douglas reminds us to always bring our towel, even in China or Brazil. "
"The main character in Doug's novel is the man Arthur Dent, "
"but Dougledydoug doesn't write about George Washington or Homer Simpson."
)
doc = nlp(text)
logger.info(text)
for ent in doc.ents:
logger.info(" ".join(["ent", ent.text, ent.label_, ent.kb_id_]))
if __name__ == "__main__": if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT) logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
plac.call(main) plac.call(main)

View File

@ -6,9 +6,6 @@ import bz2
import logging import logging
import random import random
import json import json
from tqdm import tqdm
from functools import partial
from spacy.gold import GoldParse from spacy.gold import GoldParse
from bin.wiki_entity_linking import wiki_io as io from bin.wiki_entity_linking import wiki_io as io
@ -454,25 +451,40 @@ def _write_training_entities(outputfile, article_id, clean_text, entities):
outputfile.write(line) outputfile.write(line)
def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None): def read_training_indices(entity_file_path):
""" This method provides training examples that correspond to the entity annotations found by the nlp object. """ This method creates two lists of indices into the training file: one with indices for the
training examples, and one for the dev examples."""
train_indices = []
dev_indices = []
with entity_file_path.open("r", encoding="utf8") as file:
for i, line in enumerate(file):
example = json.loads(line)
article_id = example["article_id"]
clean_text = example["clean_text"]
if is_valid_article(clean_text):
if is_dev(article_id):
dev_indices.append(i)
else:
train_indices.append(i)
return train_indices, dev_indices
def read_el_docs_golds(nlp, entity_file_path, dev, line_ids, kb, labels_discard=None):
""" This method provides training/dev examples that correspond to the entity annotations found by the nlp object.
For training, it will include both positive and negative examples by using the candidate generator from the kb. For training, it will include both positive and negative examples by using the candidate generator from the kb.
For testing (kb=None), it will include all positive examples only.""" For testing (kb=None), it will include all positive examples only."""
if not labels_discard: if not labels_discard:
labels_discard = [] labels_discard = []
data = [] texts = []
num_entities = 0 entities_list = []
get_gold_parse = partial(
_get_gold_parse, dev=dev, kb=kb, labels_discard=labels_discard
)
logger.info(
"Reading {} data with limit {}".format("dev" if dev else "train", limit)
)
with entity_file_path.open("r", encoding="utf8") as file: with entity_file_path.open("r", encoding="utf8") as file:
with tqdm(total=limit, leave=False) as pbar:
for i, line in enumerate(file): for i, line in enumerate(file):
if i in line_ids:
example = json.loads(line) example = json.loads(line)
article_id = example["article_id"] article_id = example["article_id"]
clean_text = example["clean_text"] clean_text = example["clean_text"]
@ -481,16 +493,15 @@ def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
if dev != is_dev(article_id) or not is_valid_article(clean_text): if dev != is_dev(article_id) or not is_valid_article(clean_text):
continue continue
doc = nlp(clean_text) texts.append(clean_text)
gold = get_gold_parse(doc, entities) entities_list.append(entities)
docs = nlp.pipe(texts, batch_size=50)
for doc, entities in zip(docs, entities_list):
gold = _get_gold_parse(doc, entities, dev=dev, kb=kb, labels_discard=labels_discard)
if gold and len(gold.links) > 0: if gold and len(gold.links) > 0:
data.append((doc, gold)) yield doc, gold
num_entities += len(gold.links)
pbar.update(len(gold.links))
if limit and num_entities >= limit:
break
logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
return data
def _get_gold_parse(doc, entities, dev, kb, labels_discard): def _get_gold_parse(doc, entities, dev, kb, labels_discard):

View File

@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>""" HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
@st.cache(ignore_hash=True) @st.cache(allow_output_mutation=True)
def load_model(name): def load_model(name):
return spacy.load(name) return spacy.load(name)
@st.cache(ignore_hash=True) @st.cache(allow_output_mutation=True)
def process_text(model_name, text): def process_text(model_name, text):
nlp = load_model(model_name) nlp = load_model(model_name)
return nlp(text) return nlp(text)
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
st.header("Named Entities") st.header("Named Entities")
st.sidebar.header("Named Entities") st.sidebar.header("Named Entities")
label_set = nlp.get_pipe("ner").labels label_set = nlp.get_pipe("ner").labels
labels = st.sidebar.multiselect("Entity labels", label_set, label_set) labels = st.sidebar.multiselect(
"Entity labels", options=label_set, default=list(label_set)
)
html = displacy.render(doc, style="ent", options={"ents": labels}) html = displacy.render(doc, style="ent", options={"ents": labels})
# Newlines seem to mess with the rendering # Newlines seem to mess with the rendering
html = html.replace("\n", " ") html = html.replace("\n", " ")

View File

@ -32,27 +32,24 @@ DESC_WIDTH = 64 # dimension of output entity vectors
@plac.annotations( @plac.annotations(
vocab_path=("Path to the vocab for the kb", "option", "v", Path), model=("Model name, should have pretrained word embeddings", "positional", None, str),
model=("Model name, should have pretrained word embeddings", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path), output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int), n_iter=("Number of training iterations", "option", "n", int),
) )
def main(vocab_path=None, model=None, output_dir=None, n_iter=50): def main(model=None, output_dir=None, n_iter=50):
"""Load the model, create the KB and pretrain the entity encodings. """Load the model, create the KB and pretrain the entity encodings.
Either an nlp model or a vocab is needed to provide access to pretrained word embeddings.
If an output_dir is provided, the KB will be stored there in a file 'kb'. If an output_dir is provided, the KB will be stored there in a file 'kb'.
When providing an nlp model, the updated vocab will also be written to a directory in the output_dir.""" The updated vocab will also be written to a directory in the output_dir."""
if model is None and vocab_path is None:
raise ValueError("Either the `nlp` model or the `vocab` should be specified.")
if model is not None:
nlp = spacy.load(model) # load existing spaCy model nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model) print("Loaded model '%s'" % model)
else:
vocab = Vocab().from_disk(vocab_path) # check the length of the nlp vectors
# create blank Language class with specified vocab if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
nlp = spacy.blank("en", vocab=vocab) raise ValueError(
print("Created blank 'en' model with vocab from '%s'" % vocab_path) "The `nlp` object should have access to pretrained word vectors, "
" cf. https://spacy.io/usage/models#languages."
)
kb = KnowledgeBase(vocab=nlp.vocab) kb = KnowledgeBase(vocab=nlp.vocab)
@ -103,8 +100,6 @@ def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
print() print()
print("Saved KB to", kb_path) print("Saved KB to", kb_path)
# only storing the vocab if we weren't already reading it from file
if not vocab_path:
vocab_path = output_dir / "vocab" vocab_path = output_dir / "vocab"
kb.vocab.to_disk(vocab_path) kb.vocab.to_disk(vocab_path)
print("Saved vocab to", vocab_path) print("Saved vocab to", vocab_path)

View File

@ -131,7 +131,8 @@ def train_textcat(nlp, n_texts, n_iter=10):
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats])) train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train textcat with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training() optimizer = nlp.begin_training()
textcat.model.tok2vec.from_bytes(tok2vec_weights) textcat.model.tok2vec.from_bytes(tok2vec_weights)

View File

@ -63,7 +63,8 @@ def main(model_name, unlabelled_loc):
optimizer.b2 = 0.0 optimizer.b2 = 0.0
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
sizes = compounding(1.0, 4.0, 1.001) sizes = compounding(1.0, 4.0, 1.001)
with nlp.disable_pipes(*other_pipes): with nlp.disable_pipes(*other_pipes):
for itn in range(n_iter): for itn in range(n_iter):

View File

@ -113,7 +113,8 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
TRAIN_DOCS.append((doc, annotation_clean)) TRAIN_DOCS.append((doc, annotation_clean))
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"] pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train entity linker with nlp.disable_pipes(*other_pipes): # only train entity linker
# reset and initialize the weights randomly # reset and initialize the weights randomly
optimizer = nlp.begin_training() optimizer = nlp.begin_training()

View File

@ -124,7 +124,8 @@ def main(model=None, output_dir=None, n_iter=15):
for dep in annotations.get("deps", []): for dep in annotations.get("deps", []):
parser.add_label(dep) parser.add_label(dep)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"] pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training() optimizer = nlp.begin_training()
for itn in range(n_iter): for itn in range(n_iter):

View File

@ -55,7 +55,8 @@ def main(model=None, output_dir=None, n_iter=100):
ner.add_label(ent[2]) ner.add_label(ent[2])
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train NER with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly but only if we're # reset and initialize the weights randomly but only if we're
# training a new model # training a new model

View File

@ -95,7 +95,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
optimizer = nlp.resume_training() optimizer = nlp.resume_training()
move_names = list(ner.move_names) move_names = list(ner.move_names)
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train NER with nlp.disable_pipes(*other_pipes): # only train NER
sizes = compounding(1.0, 4.0, 1.001) sizes = compounding(1.0, 4.0, 1.001)
# batch up the examples using spaCy's minibatch # batch up the examples using spaCy's minibatch

View File

@ -65,7 +65,8 @@ def main(model=None, output_dir=None, n_iter=15):
parser.add_label(dep) parser.add_label(dep)
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"] pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training() optimizer = nlp.begin_training()
for itn in range(n_iter): for itn in range(n_iter):

View File

@ -68,7 +68,8 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats])) train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train textcat with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training() optimizer = nlp.begin_training()
if init_tok2vec is not None: if init_tok2vec is not None:

View File

@ -49,6 +49,7 @@ install_requires =
catalogue>=0.0.7,<1.1.0 catalogue>=0.0.7,<1.1.0
ml_datasets ml_datasets
# Third-party dependencies # Third-party dependencies
tqdm>=4.38.0,<5.0.0
setuptools setuptools
numpy>=1.15.0 numpy>=1.15.0
plac>=0.9.6,<1.2.0 plac>=0.9.6,<1.2.0

View File

@ -5,7 +5,7 @@ warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed") warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
# These are imported as part of the API # These are imported as part of the API
from thinc.util import prefer_gpu, require_gpu from thinc.api import prefer_gpu, require_gpu
from . import pipeline from . import pipeline
from .cli.info import info as cli_info from .cli.info import info as cli_info

View File

@ -92,3 +92,4 @@ cdef enum attr_id_t:
LANG LANG
ENT_KB_ID = symbols.ENT_KB_ID ENT_KB_ID = symbols.ENT_KB_ID
MORPH MORPH
ENT_ID = symbols.ENT_ID

View File

@ -81,6 +81,7 @@ IDS = {
"DEP": DEP, "DEP": DEP,
"ENT_IOB": ENT_IOB, "ENT_IOB": ENT_IOB,
"ENT_TYPE": ENT_TYPE, "ENT_TYPE": ENT_TYPE,
"ENT_ID": ENT_ID,
"ENT_KB_ID": ENT_KB_ID, "ENT_KB_ID": ENT_KB_ID,
"HEAD": HEAD, "HEAD": HEAD,
"SENT_START": SENT_START, "SENT_START": SENT_START,

View File

@ -9,8 +9,14 @@ from wasabi import Printer
def conllu2json( def conllu2json(
input_data, n_sents=10, append_morphology=False, lang=None, ner_map=None, input_data,
merge_subtokens=False, no_print=False, **_ n_sents=10,
append_morphology=False,
lang=None,
ner_map=None,
merge_subtokens=False,
no_print=False,
**_
): ):
""" """
Convert conllu files into JSON format for use with train cli. Convert conllu files into JSON format for use with train cli.
@ -26,9 +32,13 @@ def conllu2json(
docs = [] docs = []
raw = "" raw = ""
sentences = [] sentences = []
conll_data = read_conllx(input_data, append_morphology=append_morphology, conll_data = read_conllx(
ner_tag_pattern=MISC_NER_PATTERN, ner_map=ner_map, input_data,
merge_subtokens=merge_subtokens) append_morphology=append_morphology,
ner_tag_pattern=MISC_NER_PATTERN,
ner_map=ner_map,
merge_subtokens=merge_subtokens,
)
has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN) has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN)
for i, example in enumerate(conll_data): for i, example in enumerate(conll_data):
raw += example.text raw += example.text
@ -72,20 +82,28 @@ def has_ner(input_data, ner_tag_pattern):
return False return False
def read_conllx(input_data, append_morphology=False, merge_subtokens=False, def read_conllx(
ner_tag_pattern="", ner_map=None): input_data,
append_morphology=False,
merge_subtokens=False,
ner_tag_pattern="",
ner_map=None,
):
""" Yield examples, one for each sentence """ """ Yield examples, one for each sentence """
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
i = 0
for sent in input_data.strip().split("\n\n"): for sent in input_data.strip().split("\n\n"):
lines = sent.strip().split("\n") lines = sent.strip().split("\n")
if lines: if lines:
while lines[0].startswith("#"): while lines[0].startswith("#"):
lines.pop(0) lines.pop(0)
example = example_from_conllu_sentence(vocab, lines, example = example_from_conllu_sentence(
ner_tag_pattern, merge_subtokens=merge_subtokens, vocab,
lines,
ner_tag_pattern,
merge_subtokens=merge_subtokens,
append_morphology=append_morphology, append_morphology=append_morphology,
ner_map=ner_map) ner_map=ner_map,
)
yield example yield example
@ -157,8 +175,14 @@ def create_json_doc(raw, sentences, id_):
return doc return doc
def example_from_conllu_sentence(vocab, lines, ner_tag_pattern, def example_from_conllu_sentence(
merge_subtokens=False, append_morphology=False, ner_map=None): vocab,
lines,
ner_tag_pattern,
merge_subtokens=False,
append_morphology=False,
ner_map=None,
):
"""Create an Example from the lines for one CoNLL-U sentence, merging """Create an Example from the lines for one CoNLL-U sentence, merging
subtokens and appending morphology to tags if required. subtokens and appending morphology to tags if required.
@ -182,7 +206,6 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
in_subtok = False in_subtok = False
for i in range(len(lines)): for i in range(len(lines)):
line = lines[i] line = lines[i]
subtok_lines = []
parts = line.split("\t") parts = line.split("\t")
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
if "." in id_: if "." in id_:
@ -212,7 +235,7 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
subtok_word = "" subtok_word = ""
in_subtok = False in_subtok = False
id_ = int(id_) - 1 id_ = int(id_) - 1
head = (int(head) - 1) if head != "0" else id_ head = (int(head) - 1) if head not in ("0", "_") else id_
tag = pos if tag == "_" else tag tag = pos if tag == "_" else tag
morph = morph if morph != "_" else "" morph = morph if morph != "_" else ""
dep = "ROOT" if dep == "root" else dep dep = "ROOT" if dep == "root" else dep
@ -266,9 +289,17 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
if space: if space:
raw += " " raw += " "
example = Example(doc=raw) example = Example(doc=raw)
example.set_token_annotation(ids=ids, words=words, tags=tags, pos=pos, example.set_token_annotation(
morphs=morphs, lemmas=lemmas, heads=heads, ids=ids,
deps=deps, entities=ents) words=words,
tags=tags,
pos=pos,
morphs=morphs,
lemmas=lemmas,
heads=heads,
deps=deps,
entities=ents,
)
return example return example
@ -292,7 +323,7 @@ def merge_conllu_subtokens(lines, doc):
if token._.merged_morph: if token._.merged_morph:
for feature in token._.merged_morph.split("|"): for feature in token._.merged_morph.split("|"):
field, values = feature.split("=", 1) field, values = feature.split("=", 1)
if not field in morphs: if field not in morphs:
morphs[field] = set() morphs[field] = set()
for value in values.split(","): for value in values.split(","):
morphs[field].add(value) morphs[field].add(value)
@ -306,7 +337,9 @@ def merge_conllu_subtokens(lines, doc):
token._.merged_lemma = " ".join(lemmas) token._.merged_lemma = " ".join(lemmas)
token.tag_ = "_".join(tags) token.tag_ = "_".join(tags)
token._.merged_morph = "|".join(sorted(morphs.values())) token._.merged_morph = "|".join(sorted(morphs.values()))
token._.merged_spaceafter = True if subtok_span[-1].whitespace_ else False token._.merged_spaceafter = (
True if subtok_span[-1].whitespace_ else False
)
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
for span in subtok_spans: for span in subtok_spans:

View File

@ -166,6 +166,7 @@ def debug_data(
has_low_data_warning = False has_low_data_warning = False
has_no_neg_warning = False has_no_neg_warning = False
has_ws_ents_error = False has_ws_ents_error = False
has_punct_ents_warning = False
msg.divider("Named Entity Recognition") msg.divider("Named Entity Recognition")
msg.info( msg.info(
@ -190,6 +191,10 @@ def debug_data(
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans") msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
has_ws_ents_error = True has_ws_ents_error = True
if gold_train_data["punct_ents"]:
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
has_punct_ents_warning = True
for label in new_labels: for label in new_labels:
if label_counts[label] <= NEW_LABEL_THRESHOLD: if label_counts[label] <= NEW_LABEL_THRESHOLD:
msg.warn( msg.warn(
@ -209,6 +214,8 @@ def debug_data(
msg.good("Examples without occurrences available for all labels") msg.good("Examples without occurrences available for all labels")
if not has_ws_ents_error: if not has_ws_ents_error:
msg.good("No entities consisting of or starting/ending with whitespace") msg.good("No entities consisting of or starting/ending with whitespace")
if not has_punct_ents_warning:
msg.good("No entities consisting of or starting/ending with punctuation")
if has_low_data_warning: if has_low_data_warning:
msg.text( msg.text(
@ -229,6 +236,12 @@ def debug_data(
"with whitespace characters are considered invalid." "with whitespace characters are considered invalid."
) )
if has_punct_ents_warning:
msg.text(
"Entity spans consisting of or starting/ending "
"with punctuation can not be trained with a noise level > 0."
)
if "textcat" in pipeline: if "textcat" in pipeline:
msg.divider("Text Classification") msg.divider("Text Classification")
labels = [label for label in gold_train_data["cats"]] labels = [label for label in gold_train_data["cats"]]
@ -446,6 +459,7 @@ def _compile_gold(examples, pipeline):
"words": Counter(), "words": Counter(),
"roots": Counter(), "roots": Counter(),
"ws_ents": 0, "ws_ents": 0,
"punct_ents": 0,
"n_words": 0, "n_words": 0,
"n_misaligned_words": 0, "n_misaligned_words": 0,
"n_sents": 0, "n_sents": 0,
@ -469,6 +483,16 @@ def _compile_gold(examples, pipeline):
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space: if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
# "Illegal" whitespace entity # "Illegal" whitespace entity
data["ws_ents"] += 1 data["ws_ents"] += 1
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
".",
"'",
"!",
"?",
",",
]:
# punctuation entity: could be replaced by whitespace when training with noise,
# so add a warning to alert the user to this unexpected side effect.
data["punct_ents"] += 1
if label.startswith(("B-", "U-")): if label.startswith(("B-", "U-")):
combined_label = label.split("-")[1] combined_label = label.split("-")[1]
data["ner"][combined_label] += 1 data["ner"][combined_label] += 1

View File

@ -4,14 +4,12 @@ import time
import re import re
from collections import Counter from collections import Counter
from pathlib import Path from pathlib import Path
from thinc.layers import Linear, Maxout from thinc.api import Linear, Maxout, chain, list2array, prefer_gpu
from thinc.util import prefer_gpu from thinc.api import CosineDistance, L2Distance
from wasabi import msg from wasabi import msg
import srsly import srsly
from thinc.layers import chain, list2array
from thinc.loss import CosineDistance, L2Distance
from spacy.gold import Example from ..gold import Example
from ..errors import Errors from ..errors import Errors
from ..tokens import Doc from ..tokens import Doc
from ..attrs import ID, HEAD from ..attrs import ID, HEAD
@ -28,7 +26,7 @@ def pretrain(
vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str), vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str),
output_dir: ("Directory to write models to on each epoch", "positional", None, str), output_dir: ("Directory to write models to on each epoch", "positional", None, str),
width: ("Width of CNN layers", "option", "cw", int) = 96, width: ("Width of CNN layers", "option", "cw", int) = 96,
depth: ("Depth of CNN layers", "option", "cd", int) = 4, conv_depth: ("Depth of CNN layers", "option", "cd", int) = 4,
bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0, bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0,
cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3, cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3,
sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0, sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0,
@ -77,9 +75,15 @@ def pretrain(
msg.info("Using GPU" if has_gpu else "Not using GPU") msg.info("Using GPU" if has_gpu else "Not using GPU")
output_dir = Path(output_dir) output_dir = Path(output_dir)
if output_dir.exists() and [p for p in output_dir.iterdir()]:
msg.warn(
"Output directory is not empty",
"It is better to use an empty directory or refer to a new output path, "
"then the new directory will be created for you.",
)
if not output_dir.exists(): if not output_dir.exists():
output_dir.mkdir() output_dir.mkdir()
msg.good("Created output directory") msg.good(f"Created output directory: {output_dir}")
srsly.write_json(output_dir / "config.json", config) srsly.write_json(output_dir / "config.json", config)
msg.good("Saved settings to config.json") msg.good("Saved settings to config.json")
@ -107,7 +111,7 @@ def pretrain(
Tok2Vec( Tok2Vec(
width, width,
embed_rows, embed_rows,
conv_depth=depth, conv_depth=conv_depth,
pretrained_vectors=pretrained_vectors, pretrained_vectors=pretrained_vectors,
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental. bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
subword_features=not use_chars, # Set to False for Chinese etc subword_features=not use_chars, # Set to False for Chinese etc

View File

@ -1,7 +1,7 @@
import os import os
import tqdm import tqdm
from pathlib import Path from pathlib import Path
from thinc.backends import use_ops from thinc.api import use_ops
from timeit import default_timer as timer from timeit import default_timer as timer
import shutil import shutil
import srsly import srsly
@ -10,6 +10,7 @@ import contextlib
import random import random
from ..util import create_default_optimizer from ..util import create_default_optimizer
from ..util import use_gpu as set_gpu
from ..attrs import PROB, IS_OOV, CLUSTER, LANG from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus from ..gold import GoldCorpus
from .. import util from .. import util
@ -26,6 +27,14 @@ def train(
base_model: ("Name of model to update (optional)", "option", "b", str) = None, base_model: ("Name of model to update (optional)", "option", "b", str) = None,
pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner", pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner",
vectors: ("Model to load vectors from", "option", "v", str) = None, vectors: ("Model to load vectors from", "option", "v", str) = None,
replace_components: ("Replace components from base model", "flag", "R", bool) = False,
width: ("Width of CNN layers of Tok2Vec component", "option", "cw", int) = 96,
conv_depth: ("Depth of CNN layers of Tok2Vec component", "option", "cd", int) = 4,
cnn_window: ("Window size for CNN layers of Tok2Vec component", "option", "cW", int) = 1,
cnn_pieces: ("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int) = 3,
use_chars: ("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool) = False,
bilstm_depth: ("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int) = 0,
embed_rows: ("Number of embedding rows of Tok2Vec component", "option", "er", int) = 2000,
n_iter: ("Number of iterations", "option", "n", int) = 30, n_iter: ("Number of iterations", "option", "n", int) = 30,
n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None, n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None,
n_examples: ("Number of examples", "option", "ns", int) = 0, n_examples: ("Number of examples", "option", "ns", int) = 0,
@ -80,6 +89,7 @@ def train(
) )
if not output_path.exists(): if not output_path.exists():
output_path.mkdir() output_path.mkdir()
msg.good(f"Created output directory: {output_path}")
tag_map = {} tag_map = {}
if tag_map_path is not None: if tag_map_path is not None:
@ -113,6 +123,21 @@ def train(
# training starts from a blank model, intitalize the language class. # training starts from a blank model, intitalize the language class.
pipeline = [p.strip() for p in pipeline.split(",")] pipeline = [p.strip() for p in pipeline.split(",")]
msg.text(f"Training pipeline: {pipeline}") msg.text(f"Training pipeline: {pipeline}")
disabled_pipes = None
pipes_added = False
msg.text(f"Training pipeline: {pipeline}")
if use_gpu >= 0:
activated_gpu = None
try:
activated_gpu = set_gpu(use_gpu)
except Exception as e:
msg.warn(f"Exception: {e}")
if activated_gpu is not None:
msg.text(f"Using GPU: {use_gpu}")
else:
msg.warn(f"Unable to activate GPU: {use_gpu}")
msg.text("Using CPU only")
use_gpu = -1
if base_model: if base_model:
msg.text(f"Starting with base model '{base_model}'") msg.text(f"Starting with base model '{base_model}'")
nlp = util.load_model(base_model) nlp = util.load_model(base_model)
@ -122,9 +147,8 @@ def train(
f"specified as `lang` argument ('{lang}') ", f"specified as `lang` argument ('{lang}') ",
exits=1, exits=1,
) )
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
for pipe in pipeline: for pipe in pipeline:
if pipe not in nlp.pipe_names: pipe_cfg = {}
if pipe == "parser": if pipe == "parser":
pipe_cfg = {"learn_tokens": learn_tokens} pipe_cfg = {"learn_tokens": learn_tokens}
elif pipe == "textcat": elif pipe == "textcat":
@ -133,9 +157,14 @@ def train(
"architecture": textcat_arch, "architecture": textcat_arch,
"positive_label": textcat_positive_label, "positive_label": textcat_positive_label,
} }
else: if pipe not in nlp.pipe_names:
pipe_cfg = {} msg.text(f"Adding component to base model '{pipe}'")
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg)) nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
pipes_added = True
elif replace_components:
msg.text(f"Replacing component from base model '{pipe}'")
nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg))
pipes_added = True
else: else:
if pipe == "textcat": if pipe == "textcat":
textcat_cfg = nlp.get_pipe("textcat").cfg textcat_cfg = nlp.get_pipe("textcat").cfg
@ -144,11 +173,6 @@ def train(
"architecture": textcat_cfg["architecture"], "architecture": textcat_cfg["architecture"],
"positive_label": textcat_cfg["positive_label"], "positive_label": textcat_cfg["positive_label"],
} }
pipe_cfg = {
"exclusive_classes": not textcat_multilabel,
"architecture": textcat_arch,
"positive_label": textcat_positive_label,
}
if base_cfg != pipe_cfg: if base_cfg != pipe_cfg:
msg.fail( msg.fail(
f"The base textcat model configuration does" f"The base textcat model configuration does"
@ -156,6 +180,10 @@ def train(
f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}", f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}",
exits=1, exits=1,
) )
msg.text(f"Extending component from base model '{pipe}'")
disabled_pipes = nlp.disable_pipes(
[p for p in nlp.pipe_names if p not in pipeline]
)
else: else:
msg.text(f"Starting with blank model '{lang}'") msg.text(f"Starting with blank model '{lang}'")
lang_cls = util.get_lang_class(lang) lang_cls = util.get_lang_class(lang)
@ -198,13 +226,20 @@ def train(
corpus = GoldCorpus(train_path, dev_path, limit=n_examples) corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
n_train_words = corpus.count_train() n_train_words = corpus.count_train()
if base_model: if base_model and not pipes_added:
# Start with an existing model, use default optimizer # Start with an existing model, use default optimizer
optimizer = create_default_optimizer() optimizer = create_default_optimizer()
else: else:
# Start with a blank model, call begin_training # Start with a blank model, call begin_training
optimizer = nlp.begin_training(lambda: corpus.train_examples, device=use_gpu) cfg = {"device": use_gpu}
cfg["conv_depth"] = conv_depth
cfg["token_vector_width"] = width
cfg["bilstm_depth"] = bilstm_depth
cfg["cnn_maxout_pieces"] = cnn_pieces
cfg["embed_size"] = embed_rows
cfg["conv_window"] = cnn_window
cfg["subword_features"] = not use_chars
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
nlp._optimizer = None nlp._optimizer = None
# Load in pretrained weights # Load in pretrained weights
@ -214,7 +249,7 @@ def train(
# Verify textcat config # Verify textcat config
if "textcat" in pipeline: if "textcat" in pipeline:
textcat_labels = nlp.get_pipe("textcat").cfg["labels"] textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
if textcat_positive_label and textcat_positive_label not in textcat_labels: if textcat_positive_label and textcat_positive_label not in textcat_labels:
msg.fail( msg.fail(
f"The textcat_positive_label (tpl) '{textcat_positive_label}' " f"The textcat_positive_label (tpl) '{textcat_positive_label}' "
@ -327,12 +362,22 @@ def train(
for batch in util.minibatch_by_words(train_data, size=batch_sizes): for batch in util.minibatch_by_words(train_data, size=batch_sizes):
if not batch: if not batch:
continue continue
docs, golds = zip(*batch)
try:
nlp.update( nlp.update(
batch, docs,
golds,
sgd=optimizer, sgd=optimizer,
drop=next(dropout_rates), drop=next(dropout_rates),
losses=losses, losses=losses,
) )
except ValueError as e:
msg.warn("Error during training")
if init_tok2vec:
msg.warn(
"Did you provide the same parameters during 'train' as during 'pretrain'?"
)
msg.fail(f"Original error message: {e}", exits=1)
if raw_text: if raw_text:
# If raw text is available, perform 'rehearsal' updates, # If raw text is available, perform 'rehearsal' updates,
# which use unlabelled data to reduce overfitting. # which use unlabelled data to reduce overfitting.
@ -396,11 +441,16 @@ def train(
"cpu": cpu_wps, "cpu": cpu_wps,
"gpu": gpu_wps, "gpu": gpu_wps,
} }
meta["accuracy"] = scorer.scores meta.setdefault("accuracy", {})
for component in nlp.pipe_names:
for metric in _get_metrics(component):
meta["accuracy"][metric] = scorer.scores[metric]
else: else:
meta.setdefault("beam_accuracy", {}) meta.setdefault("beam_accuracy", {})
meta.setdefault("beam_speed", {}) meta.setdefault("beam_speed", {})
meta["beam_accuracy"][beam_width] = scorer.scores for component in nlp.pipe_names:
for metric in _get_metrics(component):
meta["beam_accuracy"][metric] = scorer.scores[metric]
meta["beam_speed"][beam_width] = { meta["beam_speed"][beam_width] = {
"nwords": nwords, "nwords": nwords,
"cpu": cpu_wps, "cpu": cpu_wps,
@ -453,13 +503,19 @@ def train(
f"Best score = {best_score}; Final iteration score = {current_score}" f"Best score = {best_score}; Final iteration score = {current_score}"
) )
break break
except Exception as e:
msg.warn(f"Aborting and saving final best model. Encountered exception: {e}")
finally: finally:
best_pipes = nlp.pipe_names
if disabled_pipes:
disabled_pipes.restore()
with nlp.use_params(optimizer.averages): with nlp.use_params(optimizer.averages):
final_model_path = output_path / "model-final" final_model_path = output_path / "model-final"
nlp.to_disk(final_model_path) nlp.to_disk(final_model_path)
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
msg.good("Saved model to output directory", final_model_path) msg.good("Saved model to output directory", final_model_path)
with msg.loading("Creating best model..."): with msg.loading("Creating best model..."):
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names) best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
msg.good("Created best model", best_model_path) msg.good("Created best model", best_model_path)
@ -519,15 +575,14 @@ def _load_pretrained_tok2vec(nlp, loc):
def _collate_best_model(meta, output_path, components): def _collate_best_model(meta, output_path, components):
bests = {} bests = {}
meta.setdefault("accuracy", {})
for component in components: for component in components:
bests[component] = _find_best(output_path, component) bests[component] = _find_best(output_path, component)
best_dest = output_path / "model-best" best_dest = output_path / "model-best"
shutil.copytree(str(output_path / "model-final"), str(best_dest)) shutil.copytree(str(output_path / "model-final"), str(best_dest))
for component, best_component_src in bests.items(): for component, best_component_src in bests.items():
shutil.rmtree(str(best_dest / component)) shutil.rmtree(str(best_dest / component))
shutil.copytree( shutil.copytree(str(best_component_src / component), str(best_dest / component))
str(best_component_src / component), str(best_dest / component)
)
accs = srsly.read_json(best_component_src / "accuracy.json") accs = srsly.read_json(best_component_src / "accuracy.json")
for metric in _get_metrics(component): for metric in _get_metrics(component):
meta["accuracy"][metric] = accs[metric] meta["accuracy"][metric] = accs[metric]
@ -550,13 +605,15 @@ def _find_best(experiment_dir, component):
def _get_metrics(component): def _get_metrics(component):
if component == "parser": if component == "parser":
return ("las", "uas", "token_acc", "sent_f") return ("las", "uas", "las_per_type", "token_acc", "sent_f")
elif component == "tagger": elif component == "tagger":
return ("tags_acc",) return ("tags_acc",)
elif component == "ner": elif component == "ner":
return ("ents_f", "ents_p", "ents_r") return ("ents_f", "ents_p", "ents_r", "enty_per_type")
elif component == "sentrec": elif component == "sentrec":
return ("sent_f", "sent_p", "sent_r") return ("sent_f", "sent_p", "sent_r")
elif component == "textcat":
return ("textcat_score",)
return ("token_acc",) return ("token_acc",)
@ -568,8 +625,12 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths):
row_head.extend(["Tag Loss ", " Tag % "]) row_head.extend(["Tag Loss ", " Tag % "])
output_stats.extend(["tag_loss", "tags_acc"]) output_stats.extend(["tag_loss", "tags_acc"])
elif pipe == "parser": elif pipe == "parser":
row_head.extend(["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"]) row_head.extend(
output_stats.extend(["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"]) ["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"]
)
output_stats.extend(
["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"]
)
elif pipe == "ner": elif pipe == "ner":
row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "]) row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"]) output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])

View File

@ -1,19 +1,20 @@
from typing import Optional, Dict, List, Union, Sequence
import plac import plac
from thinc.util import require_gpu
from wasabi import msg from wasabi import msg
from pathlib import Path from pathlib import Path
import thinc import thinc
import thinc.schedules import thinc.schedules
from thinc.model import Model from thinc.api import Model
from spacy.gold import GoldCorpus
import spacy
from spacy.pipeline.tok2vec import Tok2VecListener
from typing import Optional, Dict, List, Union, Sequence
from pydantic import BaseModel, FilePath, StrictInt from pydantic import BaseModel, FilePath, StrictInt
import tqdm import tqdm
from ..ml import component_models # TODO: relative imports?
from .. import util import spacy
from spacy.gold import GoldCorpus
from spacy.pipeline.tok2vec import Tok2VecListener
from spacy.ml import component_models
from spacy import util
registry = util.registry registry = util.registry
@ -153,10 +154,9 @@ def create_tb_parser_model(
hidden_width: StrictInt = 64, hidden_width: StrictInt = 64,
maxout_pieces: StrictInt = 3, maxout_pieces: StrictInt = 3,
): ):
from thinc.layers import Linear, chain, list2array from thinc.api import Linear, chain, list2array, use_ops, zero_init
from spacy.ml._layers import PrecomputableAffine from spacy.ml._layers import PrecomputableAffine
from spacy.syntax._parser_model import ParserModel from spacy.syntax._parser_model import ParserModel
from thinc.api import use_ops, zero_init
token_vector_width = tok2vec.get_dim("nO") token_vector_width = tok2vec.get_dim("nO")
tok2vec = chain(tok2vec, list2array()) tok2vec = chain(tok2vec, list2array())
@ -221,13 +221,9 @@ def train_from_config_cli(
def train_from_config( def train_from_config(
config_path, config_path, data_paths, raw_text=None, meta_path=None, output_path=None,
data_paths,
raw_text=None,
meta_path=None,
output_path=None,
): ):
msg.info("Loading config from: {}".format(config_path)) msg.info(f"Loading config from: {config_path}")
config = util.load_from_config(config_path, create_objects=True) config = util.load_from_config(config_path, create_objects=True)
use_gpu = config["training"]["use_gpu"] use_gpu = config["training"]["use_gpu"]
if use_gpu >= 0: if use_gpu >= 0:
@ -241,9 +237,7 @@ def train_from_config(
msg.info("Loading training corpus") msg.info("Loading training corpus")
corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit) corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
msg.info("Initializing the nlp pipeline") msg.info("Initializing the nlp pipeline")
nlp.begin_training( nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
lambda: corpus.train_examples, device=use_gpu
)
train_batches = create_train_batches(nlp, corpus, config["training"]) train_batches = create_train_batches(nlp, corpus, config["training"])
evaluate = create_evaluation_callback(nlp, optimizer, corpus, config["training"]) evaluate = create_evaluation_callback(nlp, optimizer, corpus, config["training"])
@ -260,7 +254,7 @@ def train_from_config(
config["training"]["eval_frequency"], config["training"]["eval_frequency"],
) )
msg.info("Training. Initial learn rate: {}".format(optimizer.learn_rate)) msg.info(f"Training. Initial learn rate: {optimizer.learn_rate}")
print_row = setup_printer(config) print_row = setup_printer(config)
try: try:
@ -414,7 +408,7 @@ def subdivide_batch(batch):
def setup_printer(config): def setup_printer(config):
score_cols = config["training"]["scores"] score_cols = config["training"]["scores"]
score_widths = [max(len(col), 6) for col in score_cols] score_widths = [max(len(col), 6) for col in score_cols]
loss_cols = ["Loss {}".format(pipe) for pipe in config["nlp"]["pipeline"]] loss_cols = [f"Loss {pipe}" for pipe in config["nlp"]["pipeline"]]
loss_widths = [max(len(col), 8) for col in loss_cols] loss_widths = [max(len(col), 8) for col in loss_cols]
table_header = ["#"] + loss_cols + score_cols + ["Score"] table_header = ["#"] + loss_cols + score_cols + ["Score"]
table_header = [col.upper() for col in table_header] table_header = [col.upper() for col in table_header]

View File

@ -29,7 +29,7 @@ try:
except ImportError: except ImportError:
cupy = None cupy = None
from thinc.optimizers import Optimizer # noqa: F401 from thinc.api import Optimizer # noqa: F401
pickle = pickle pickle = pickle
copy_reg = copy_reg copy_reg = copy_reg

View File

@ -51,9 +51,10 @@ def render(
html = RENDER_WRAPPER(html) html = RENDER_WRAPPER(html)
if jupyter or (jupyter is None and is_in_jupyter()): if jupyter or (jupyter is None and is_in_jupyter()):
# return HTML rendered by IPython display() # return HTML rendered by IPython display()
# See #4840 for details on span wrapper to disable mathjax
from IPython.core.display import display, HTML from IPython.core.display import display, HTML
return display(HTML(html)) return display(HTML('<span class="tex2jax_ignore">{}</span>'.format(html)))
return html return html

View File

@ -1,4 +1,3 @@
# Setting explicit height and max-width: none on the SVG is required for # Setting explicit height and max-width: none on the SVG is required for
# Jupyter to render it properly in a cell # Jupyter to render it properly in a cell

View File

@ -75,10 +75,9 @@ class Warnings(object):
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from " W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
"being serialized or deserialized is deprecated. Please use the " "being serialized or deserialized is deprecated. Please use the "
"`exclude` argument instead. For example: exclude=['{arg}'].") "`exclude` argument instead. For example: exclude=['{arg}'].")
W016 = ("The keyword argument `n_threads` on the is now deprecated, as " W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, "
"the v2.x models cannot release the global interpreter lock. " "the argument `n_process` controls parallel inference via "
"Future versions may introduce a `n_process` argument for " "multiprocessing.")
"parallel inference via multiprocessing.")
W017 = ("Alias '{alias}' already exists in the Knowledge Base.") W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
W018 = ("Entity '{entity}' already exists in the Knowledge Base - " W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
"ignoring the duplicate entry.") "ignoring the duplicate entry.")
@ -170,7 +169,8 @@ class Errors(object):
"and satisfies the correct annotations specified in the GoldParse. " "and satisfies the correct annotations specified in the GoldParse. "
"For example, are all labels added to the model? If you're " "For example, are all labels added to the model? If you're "
"training a named entity recognizer, also make sure that none of " "training a named entity recognizer, also make sure that none of "
"your annotated entity spans have leading or trailing whitespace. " "your annotated entity spans have leading or trailing whitespace "
"or punctuation. "
"You can also use the experimental `debug-data` command to " "You can also use the experimental `debug-data` command to "
"validate your JSON-formatted training data. For details, run:\n" "validate your JSON-formatted training data. For details, run:\n"
"python -m spacy debug-data --help") "python -m spacy debug-data --help")
@ -536,8 +536,8 @@ class Errors(object):
E997 = ("Tokenizer special cases are not allowed to modify the text. " E997 = ("Tokenizer special cases are not allowed to modify the text. "
"This would map '{chunk}' to '{orth}' given token attributes " "This would map '{chunk}' to '{orth}' given token attributes "
"'{token_attrs}'.") "'{token_attrs}'.")
E998 = ("Can only create GoldParse's from Example's without a Doc, " E998 = ("Can only create GoldParse objects from Example objects without a "
"if get_gold_parses() is called with a Vocab object.") "Doc if get_gold_parses() is called with a Vocab object.")
E999 = ("Encountered an unexpected format for the dictionary holding " E999 = ("Encountered an unexpected format for the dictionary holding "
"gold annotations: {gold_dict}") "gold annotations: {gold_dict}")

View File

@ -1,4 +1,3 @@
def explain(term): def explain(term):
"""Get a description for a given POS tag, dependency label or entity type. """Get a description for a given POS tag, dependency label or entity type.

View File

@ -1,6 +1,6 @@
from cymem.cymem cimport Pool from cymem.cymem cimport Pool
from spacy.tokens import Doc from .tokens import Doc
from .typedefs cimport attr_t from .typedefs cimport attr_t
from .syntax.transition_system cimport Transition from .syntax.transition_system cimport Transition
@ -65,5 +65,3 @@ cdef class Example:
cdef public TokenAnnotation token_annotation cdef public TokenAnnotation token_annotation
cdef public DocAnnotation doc_annotation cdef public DocAnnotation doc_annotation
cdef public object goldparse cdef public object goldparse

View File

@ -6,7 +6,7 @@ from libcpp.vector cimport vector
from libc.stdint cimport int32_t, int64_t from libc.stdint cimport int32_t, int64_t
from libc.stdio cimport FILE from libc.stdio cimport FILE
from spacy.vocab cimport Vocab from .vocab cimport Vocab
from .typedefs cimport hash_t from .typedefs cimport hash_t
from .structs cimport KBEntryC, AliasC from .structs cimport KBEntryC, AliasC
@ -169,4 +169,3 @@ cdef class Reader:
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1 cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
cdef int _read(self, void* value, size_t size) except -1 cdef int _read(self, void* value, size_t size) except -1

View File

@ -1,4 +1,3 @@
# Source: https://github.com/stopwords-iso/stopwords-af # Source: https://github.com/stopwords-iso/stopwords-af
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,4 +1,3 @@
# Source: https://github.com/Alir3z4/stop-words # Source: https://github.com/Alir3z4/stop-words
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
অতএব অথচ অথব অন অন অন অন অনতত অবধি অবশ অর অন অন অরধভ অতএব অথচ অথব অন অন অন অন অনতত অবধি অবশ অর অন অন অরধভ

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -14,6 +14,17 @@ _tamil = r"\u0B80-\u0BFF"
_telugu = r"\u0C00-\u0C7F" _telugu = r"\u0C00-\u0C7F"
# from the final table in: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
_cjk = (
r"\u4E00-\u62FF\u6300-\u77FF\u7800-\u8CFF\u8D00-\u9FFF\u3400-\u4DBF"
r"\U00020000-\U000215FF\U00021600-\U000230FF\U00023100-\U000245FF"
r"\U00024600-\U000260FF\U00026100-\U000275FF\U00027600-\U000290FF"
r"\U00029100-\U0002A6DF\U0002A700-\U0002B73F\U0002B740-\U0002B81F"
r"\U0002B820-\U0002CEAF\U0002CEB0-\U0002EBEF\u2E80-\u2EFF\u2F00-\u2FDF"
r"\u2FF0-\u2FFF\u3000-\u303F\u31C0-\u31EF\u3200-\u32FF\u3300-\u33FF"
r"\uF900-\uFAFF\uFE30-\uFE4F\U0001F200-\U0001F2FF\U0002F800-\U0002FA1F"
)
# Latin standard # Latin standard
_latin_u_standard = r"A-Z" _latin_u_standard = r"A-Z"
_latin_l_standard = r"a-z" _latin_l_standard = r"a-z"
@ -212,6 +223,7 @@ _uncased = (
+ _tamil + _tamil
+ _telugu + _telugu
+ _hangul + _hangul
+ _cjk
) )
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased) ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)

View File

@ -1,4 +1,3 @@
# Source: https://github.com/Alir3z4/stop-words # Source: https://github.com/Alir3z4/stop-words
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
@ -26,7 +25,7 @@ früher fünf fünfte fünften fünfter fünftes für
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
großen grosser großer grosses großes gut gute guter gutes großen grosser großer grosses großes gut gute guter gutes
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
@ -44,9 +43,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
lang lange leicht leider lieber los lang lange leicht leider lieber los
machen macht machte mag magst man manche manchem manchen mancher manches mehr machen macht machte mag magst man manche manchem manchen mancher manches mehr
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
musste mussten
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur neuntes nicht nichts nie niemand niemandem niemanden noch nun nur

View File

@ -1,5 +1,5 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map_general import TAG_MAP from ..tag_map import TAG_MAP
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .lemmatizer import GreekLemmatizer from .lemmatizer import GreekLemmatizer

View File

@ -1,4 +1,3 @@
def get_pos_from_wiktionary(): def get_pos_from_wiktionary():
import re import re
from gensim.corpora.wikicorpus import extract_pages from gensim.corpora.wikicorpus import extract_pages

View File

@ -1,4 +1,3 @@
# These exceptions are used to add NORM values based on a token's ORTH value. # These exceptions are used to add NORM values based on a token's ORTH value.
# Norms are only set if no alternative is provided in the tokenizer exceptions. # Norms are only set if no alternative is provided in the tokenizer exceptions.

View File

@ -1,4 +1,3 @@
# Stop words # Stop words
# Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0 # Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,24 +0,0 @@
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
TAG_MAP = {
"ADJ": {POS: ADJ},
"ADV": {POS: ADV},
"INTJ": {POS: INTJ},
"NOUN": {POS: NOUN},
"PROPN": {POS: PROPN},
"VERB": {POS: VERB},
"ADP": {POS: ADP},
"CCONJ": {POS: CCONJ},
"SCONJ": {POS: SCONJ},
"PART": {POS: PART},
"PUNCT": {POS: PUNCT},
"SYM": {POS: SYM},
"NUM": {POS: NUM},
"PRON": {POS: PRON},
"AUX": {POS: AUX},
"SPACE": {POS: SPACE},
"DET": {POS: DET},
"X": {POS: X},
}

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
_exc = { _exc = {
# Slang and abbreviations # Slang and abbreviations
"cos": "because", "cos": "because",

View File

@ -1,4 +1,3 @@
# Stop words # Stop words
STOP_WORDS = set( STOP_WORDS = set(
""" """

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí

View File

@ -1,4 +1,3 @@
# Source: https://github.com/stopwords-iso/stopwords-et # Source: https://github.com/stopwords-iso/stopwords-et
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
verb_roots = """ verb_roots = """
#هست #هست
آخت#آهنج آخت#آهنج

View File

@ -1,4 +1,3 @@
# Stop words from HAZM package # Stop words from HAZM package
STOP_WORDS = set( STOP_WORDS = set(
""" """

View File

@ -1,9 +1,10 @@
from ..char_classes import LIST_ELLIPSES, LIST_ICONS from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_HYPHENS
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_SUFFIXES from ..punctuation import TOKENIZER_SUFFIXES
_quotes = CONCAT_QUOTES.replace("'", "") _quotes = CONCAT_QUOTES.replace("'", "")
DASHES = "|".join(x for x in LIST_HYPHENS if x != "-")
_infixes = ( _infixes = (
LIST_ELLIPSES LIST_ELLIPSES
@ -11,11 +12,9 @@ _infixes = (
+ [ + [
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER), r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA), r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes), r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA), r"(?<=[{a}])(?:{d})(?=[{a}])".format(a=ALPHA, d=DASHES),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), r"(?<=[{a}0-9])[<>=/](?=[{a}])".format(a=ALPHA),
] ]
) )

View File

@ -1,4 +1,3 @@
# Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt # Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt
# Reformatted with some minor corrections # Reformatted with some minor corrections
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -28,6 +28,9 @@ for exc_data in [
{ORTH: "myöh.", LEMMA: "myöhempi"}, {ORTH: "myöh.", LEMMA: "myöhempi"},
{ORTH: "n.", LEMMA: "noin"}, {ORTH: "n.", LEMMA: "noin"},
{ORTH: "nimim.", LEMMA: "nimimerkki"}, {ORTH: "nimim.", LEMMA: "nimimerkki"},
{ORTH: "n:o", LEMMA: "numero"},
{ORTH: "N:o", LEMMA: "numero"},
{ORTH: "nro", LEMMA: "numero"},
{ORTH: "ns.", LEMMA: "niin sanottu"}, {ORTH: "ns.", LEMMA: "niin sanottu"},
{ORTH: "nyk.", LEMMA: "nykyinen"}, {ORTH: "nyk.", LEMMA: "nykyinen"},
{ORTH: "oik.", LEMMA: "oikealla"}, {ORTH: "oik.", LEMMA: "oikealla"},

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons

View File

@ -1,4 +1,3 @@
# fmt: off # fmt: off
consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"] consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"]
broad_vowels = ["a", "á", "o", "ó", "u", "ú"] broad_vowels = ["a", "á", "o", "ó", "u", "ú"]

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6 # Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
# Source: https://github.com/Xangis/extra-stopwords # Source: https://github.com/Xangis/extra-stopwords
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
ಹಲವ ಹಲವ

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
# Source: https://github.com/stopwords-iso/stopwords-lv # Source: https://github.com/stopwords-iso/stopwords-lv
STOP_WORDS = set( STOP_WORDS = set(

View File

@ -1,4 +1,3 @@
# Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json # Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json
STOP_WORDS = set( STOP_WORDS = set(
""" """

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
# These exceptions are used to add NORM values based on a token's ORTH value. # These exceptions are used to add NORM values based on a token's ORTH value.
# Individual languages can also add their own exceptions and overwrite them - # Individual languages can also add their own exceptions and overwrite them -
# for example, British vs. American spelling in English. # for example, British vs. American spelling in English.

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
_exc = { _exc = {
# Slang # Slang
"прив": "привет", "прив": "привет",

View File

@ -1,4 +1,3 @@
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,4 +1,3 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
අතර අතර

View File

@ -1,11 +1,16 @@
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from .lex_attrs import LEX_ATTRS
from ...language import Language from ...language import Language
from ...attrs import LANG from ...attrs import LANG
class SlovakDefaults(Language.Defaults): class SlovakDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "sk" lex_attr_getters[LANG] = lambda text: "sk"
tag_map = TAG_MAP
stop_words = STOP_WORDS stop_words = STOP_WORDS

23
spacy/lang/sk/examples.py Normal file
View File

@ -0,0 +1,23 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.sk.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Ardevop, s.r.o. je malá startup firma na území SR.",
"Samojazdiace autá presúvajú poistnú zodpovednosť na výrobcov automobilov.",
"Košice sú na východe.",
"Bratislava je hlavné mesto Slovenskej republiky.",
"Kde si?",
"Kto je prezidentom Francúzska?",
"Aké je hlavné mesto Slovenska?",
"Kedy sa narodil Andrej Kiska?",
"Včera som dostal 100€ na ruku.",
"Dnes je nedeľa 26.1.2020.",
"Narodil sa 15.4.1998 v Ružomberku.",
"Niekto mi povedal, že 500 eur je veľa peňazí.",
"Podaj mi ruku!",
]

View File

@ -0,0 +1,59 @@
from ...attrs import LIKE_NUM
_num_words = [
"nula",
"jeden",
"dva",
"tri",
"štyri",
"päť",
"šesť",
"sedem",
"osem",
"deväť",
"desať",
"jedenásť",
"dvanásť",
"trinásť",
"štrnásť",
"pätnásť",
"šestnásť",
"sedemnásť",
"osemnásť",
"devätnásť",
"dvadsať",
"tridsať",
"štyridsať",
"päťdesiat",
"šesťdesiat",
"sedemdesiat",
"osemdesiat",
"deväťdesiat",
"sto",
"tisíc",
"milión",
"miliarda",
"bilión",
"biliarda",
"trilión",
"triliarda",
"kvadrilión",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -1,5 +1,4 @@
# Source: https://github.com/Ardevop-sk/stopwords-sk
# Source: https://github.com/stopwords-iso/stopwords-sk
STOP_WORDS = set( STOP_WORDS = set(
""" """
@ -7,17 +6,41 @@ a
aby aby
aj aj
ak ak
akej
akejže
ako ako
akom
akomže
akou
akouže
akože
aká
akáže
aké
akého
akéhože
akému
akémuže
akéže
akú
akúže
aký aký
akých
akýchže
akým
akými
akýmiže
akýmže
akýže
ale ale
alebo alebo
and
ani ani
asi asi
avšak avšak
ba ba
bez bez
bezo
bol bol
bola bola
boli boli
@ -28,23 +51,32 @@ budeme
budete budete
budeš budeš
budú budú
buï
buď buď
by by
byť byť
cez cez
cezo
dnes dnes
do do
ešte ešte
for
ho ho
hoci hoci
i i
iba iba
ich ich
im im
inej
inom
iná
iné iné
iného
inému
iní
inú
iný iný
iných
iným
inými
ja ja
je je
jeho jeho
@ -53,80 +85,185 @@ jemu
ju ju
k k
kam kam
kamže
každou
každá každá
každé každé
každého
každému
každí každí
každú
každý každý
každých
každým
každými
kde kde
kedže kej
keï kejže
keď keď
keďže
kie
kieho
kiehože
kiemu
kiemuže
kieže
koho
kom
komu
kou
kouže
kto kto
ktorej
ktorou ktorou
ktorá ktorá
ktoré ktoré
ktorí ktorí
ktorú
ktorý ktorý
ktorých
ktorým
ktorými
ku ku
káže
kéže
kúže
kýho
kýhože
kým
kýmu
kýmuže
kýže
lebo lebo
leda
ledaže
len len
ma ma
majú
mal
mala
mali
mať mať
medzi medzi
menej
mi mi
mna
mne mne
mnou mnou
moja moja
moje moje
mojej
mojich
mojim
mojimi
mojou
moju
možno
mu mu
musia
musieť musieť
musí
musím
musíme
musíte
musíš
my my
mám
máme
máte máte
mòa máš
môcť môcť
môj môj
môjho
môže môže
môžem
môžeme
môžete
môžeš
môžu
mňa
na na
nad nad
nado
najmä
nami nami
naša
naše
našej
naši naši
našich
našim
našimi
našou
ne
nech nech
neho neho
nej nej
nejakej
nejakom
nejakou
nejaká
nejaké
nejakého
nejakému
nejakú
nejaký
nejakých
nejakým
nejakými
nemu nemu
než než
nich nich
nie nie
niektorej
niektorom
niektorou
niektorá
niektoré
niektorého
niektorému
niektorú
niektorý niektorý
niektorých
niektorým
niektorými
nielen nielen
niečo
nim nim
nimi
nič nič
ničoho
ničom
ničomu
ničím
no no
nová
nové
noví
nový
nám nám
nás nás
náš náš
nášho
ním ním
o o
od od
odo odo
of
on on
ona ona
oni oni
ono ono
ony ony
oňho
po po
pod pod
podo
podľa podľa
pokiaľ pokiaľ
popod
popri
potom potom
poza
pre pre
pred pred
predo predo
@ -134,42 +271,56 @@ preto
pretože pretože
prečo prečo
pri pri
prvá
prvé
prví
prvý
práve práve
pýta
s s
sa sa
seba seba
sebe
sebou
sem sem
si si
sme sme
so so
som som
späť
ste ste
svoj svoj
svoja
svoje svoje
svojho
svojich svojich
svojim
svojimi
svojou
svoju
svojím svojím
svojími
ta ta
tak tak
takej
takejto
taká
takáto
také
takého
takéhoto
takému
takémuto
takéto
takí
takú
takúto
taký taký
takýto
takže takže
tam tam
te
teba teba
tebe tebe
tebou tebou
teda teda
tej tej
tejto
ten ten
tento tento
the
ti ti
tie tie
tieto tieto
@ -177,52 +328,97 @@ tiež
to to
toho toho
tohoto tohoto
tohto
tom tom
tomto tomto
tomu tomu
tomuto tomuto
toto toto
tou tou
touto
tu tu
tvoj tvoj
tvojími tvoja
tvoje
tvojej
tvojho
tvoji
tvojich
tvojim
tvojimi
tvojím
ty ty
táto táto
títo
túto túto
tých
tým tým
tými
týmto týmto
u
v v
vami vami
vaša
vaše vaše
veï vašej
vaši
vašich
vašim
vaším
veď
viac viac
vo vo
vy vy
vám vám
vás vás
váš váš
vášho
však však
všetci
všetka
všetko
všetky
všetok všetok
z z
za za
začo
začože
zo zo
a
áno áno
èi čej
èo
èí
òom
òou
òu
či či
čia
čie
čieho
čiemu
čiu
čo čo
čoho
čom
čomu
čou
čože
čí
čím
čími
ďalšia ďalšia
ďalšie ďalšie
ďalšieho
ďalšiemu
ďalšiu
ďalšom
ďalšou
ďalší ďalší
ďalších
ďalším
ďalšími
ňom
ňou
ňu
že že
""".split() """.split()
) )

Some files were not shown because too many files have changed in this diff Show More