mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Merge branch 'master' into feature/nel-wiki
This commit is contained in:
commit
f2ea3e3ea2
106
.github/contributors/ameyuuno.md
vendored
Normal file
106
.github/contributors/ameyuuno.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Alexey Kim |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2019-07-09 |
|
||||||
|
| GitHub username | ameyuuno |
|
||||||
|
| Website (optional) | https://ameyuuno.io |
|
106
.github/contributors/askhogan.md
vendored
Normal file
106
.github/contributors/askhogan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Patrick Hogan |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 7/7/2019 |
|
||||||
|
| GitHub username | askhogan@gmail.com |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/khellan.md
vendored
Normal file
106
.github/contributors/khellan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Knut O. Hellan |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 02.07.2019 |
|
||||||
|
| GitHub username | khellan |
|
||||||
|
| Website (optional) | knuthellan.com |
|
106
.github/contributors/kognate.md
vendored
Normal file
106
.github/contributors/kognate.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Joshua B. Smith |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | July 7, 2019 |
|
||||||
|
| GitHub username | kognate |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/rokasramas.md
vendored
Normal file
106
.github/contributors/rokasramas.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ----------------------- |
|
||||||
|
| Name | Rokas Ramanauskas |
|
||||||
|
| Company name (if applicable) | TokenMill |
|
||||||
|
| Title or role (if applicable) | Software Engineer |
|
||||||
|
| Date | 2019-07-02 |
|
||||||
|
| GitHub username | rokasramas |
|
||||||
|
| Website (optional) | http://www.tokenmill.lt |
|
8
CITATION
8
CITATION
|
@ -1,6 +1,6 @@
|
||||||
@ARTICLE{spacy2,
|
@unpublished{spacy2,
|
||||||
AUTHOR = {Honnibal, Matthew AND Montani, Ines},
|
AUTHOR = {Honnibal, Matthew and Montani, Ines},
|
||||||
TITLE = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
|
TITLE = {{spaCy 2}: Natural language understanding with {B}loom embeddings, convolutional neural networks and incremental parsing},
|
||||||
YEAR = {2017},
|
YEAR = {2017},
|
||||||
JOURNAL = {To appear}
|
Note = {To appear}
|
||||||
}
|
}
|
||||||
|
|
|
@ -51,7 +51,6 @@ def filter_spans(spans):
|
||||||
|
|
||||||
def extract_currency_relations(doc):
|
def extract_currency_relations(doc):
|
||||||
# Merge entities and noun chunks into one token
|
# Merge entities and noun chunks into one token
|
||||||
seen_tokens = set()
|
|
||||||
spans = list(doc.ents) + list(doc.noun_chunks)
|
spans = list(doc.ents) + list(doc.noun_chunks)
|
||||||
spans = filter_spans(spans)
|
spans = filter_spans(spans)
|
||||||
with doc.retokenize() as retokenizer:
|
with doc.retokenize() as retokenizer:
|
||||||
|
|
|
@ -5,6 +5,7 @@ import plac
|
||||||
import random
|
import random
|
||||||
import numpy
|
import numpy
|
||||||
import time
|
import time
|
||||||
|
import re
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from thinc.v2v import Affine, Maxout
|
from thinc.v2v import Affine, Maxout
|
||||||
|
@ -23,19 +24,39 @@ from .train import _load_pretrained_tok2vec
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
texts_loc=("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the "
|
texts_loc=(
|
||||||
"key 'tokens'", "positional", None, str),
|
"Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the "
|
||||||
|
"key 'tokens'",
|
||||||
|
"positional",
|
||||||
|
None,
|
||||||
|
str,
|
||||||
|
),
|
||||||
vectors_model=("Name or path to spaCy model with vectors to learn from"),
|
vectors_model=("Name or path to spaCy model with vectors to learn from"),
|
||||||
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
||||||
width=("Width of CNN layers", "option", "cw", int),
|
width=("Width of CNN layers", "option", "cw", int),
|
||||||
depth=("Depth of CNN layers", "option", "cd", int),
|
depth=("Depth of CNN layers", "option", "cd", int),
|
||||||
embed_rows=("Number of embedding rows", "option", "er", int),
|
embed_rows=("Number of embedding rows", "option", "er", int),
|
||||||
loss_func=("Loss function to use for the objective. Either 'L2' or 'cosine'", "option", "L", str),
|
loss_func=(
|
||||||
|
"Loss function to use for the objective. Either 'L2' or 'cosine'",
|
||||||
|
"option",
|
||||||
|
"L",
|
||||||
|
str,
|
||||||
|
),
|
||||||
use_vectors=("Whether to use the static vectors as input features", "flag", "uv"),
|
use_vectors=("Whether to use the static vectors as input features", "flag", "uv"),
|
||||||
dropout=("Dropout rate", "option", "d", float),
|
dropout=("Dropout rate", "option", "d", float),
|
||||||
batch_size=("Number of words per training batch", "option", "bs", int),
|
batch_size=("Number of words per training batch", "option", "bs", int),
|
||||||
max_length=("Max words per example. Longer examples are discarded", "option", "xw", int),
|
max_length=(
|
||||||
min_length=("Min words per example. Shorter examples are discarded", "option", "nw", int),
|
"Max words per example. Longer examples are discarded",
|
||||||
|
"option",
|
||||||
|
"xw",
|
||||||
|
int,
|
||||||
|
),
|
||||||
|
min_length=(
|
||||||
|
"Min words per example. Shorter examples are discarded",
|
||||||
|
"option",
|
||||||
|
"nw",
|
||||||
|
int,
|
||||||
|
),
|
||||||
seed=("Seed for random number generators", "option", "s", int),
|
seed=("Seed for random number generators", "option", "s", int),
|
||||||
n_iter=("Number of iterations to pretrain", "option", "i", int),
|
n_iter=("Number of iterations to pretrain", "option", "i", int),
|
||||||
n_save_every=("Save model every X batches.", "option", "se", int),
|
n_save_every=("Save model every X batches.", "option", "se", int),
|
||||||
|
@ -45,6 +66,13 @@ from .train import _load_pretrained_tok2vec
|
||||||
"t2v",
|
"t2v",
|
||||||
Path,
|
Path,
|
||||||
),
|
),
|
||||||
|
epoch_start=(
|
||||||
|
"The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been "
|
||||||
|
"renamed. Prevents unintended overwriting of existing weight files.",
|
||||||
|
"option",
|
||||||
|
"es",
|
||||||
|
int
|
||||||
|
),
|
||||||
)
|
)
|
||||||
def pretrain(
|
def pretrain(
|
||||||
texts_loc,
|
texts_loc,
|
||||||
|
@ -63,6 +91,7 @@ def pretrain(
|
||||||
seed=0,
|
seed=0,
|
||||||
n_save_every=None,
|
n_save_every=None,
|
||||||
init_tok2vec=None,
|
init_tok2vec=None,
|
||||||
|
epoch_start=None,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
||||||
|
@ -131,9 +160,29 @@ def pretrain(
|
||||||
if init_tok2vec is not None:
|
if init_tok2vec is not None:
|
||||||
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
|
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
|
||||||
msg.text("Loaded pretrained tok2vec for: {}".format(components))
|
msg.text("Loaded pretrained tok2vec for: {}".format(components))
|
||||||
|
# Parse the epoch number from the given weight file
|
||||||
|
model_name = re.search(r"model\d+\.bin", str(init_tok2vec))
|
||||||
|
if model_name:
|
||||||
|
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
||||||
|
epoch_start = int(model_name.group(0)[5:][:-4]) + 1
|
||||||
|
else:
|
||||||
|
if not epoch_start:
|
||||||
|
msg.fail(
|
||||||
|
"You have to use the '--epoch-start' argument when using a renamed weight file for "
|
||||||
|
"'--init-tok2vec'", exits=True
|
||||||
|
)
|
||||||
|
elif epoch_start < 0:
|
||||||
|
msg.fail(
|
||||||
|
"The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" % epoch_start,
|
||||||
|
exits=True
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# Without '--init-tok2vec' the '--epoch-start' argument is ignored
|
||||||
|
epoch_start = 0
|
||||||
|
|
||||||
optimizer = create_default_optimizer(model.ops)
|
optimizer = create_default_optimizer(model.ops)
|
||||||
tracker = ProgressTracker(frequency=10000)
|
tracker = ProgressTracker(frequency=10000)
|
||||||
msg.divider("Pre-training tok2vec layer")
|
msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start)
|
||||||
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
|
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
|
||||||
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
|
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
|
||||||
|
|
||||||
|
@ -154,7 +203,7 @@ def pretrain(
|
||||||
file_.write(srsly.json_dumps(log) + "\n")
|
file_.write(srsly.json_dumps(log) + "\n")
|
||||||
|
|
||||||
skip_counter = 0
|
skip_counter = 0
|
||||||
for epoch in range(n_iter):
|
for epoch in range(epoch_start, n_iter + epoch_start):
|
||||||
for batch_id, batch in enumerate(
|
for batch_id, batch in enumerate(
|
||||||
util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
|
util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
|
||||||
):
|
):
|
||||||
|
|
|
@ -116,7 +116,7 @@ def parse_deps(orig_doc, options={}):
|
||||||
doc (Doc): Document do parse.
|
doc (Doc): Document do parse.
|
||||||
RETURNS (dict): Generated dependency parse keyed by words and arcs.
|
RETURNS (dict): Generated dependency parse keyed by words and arcs.
|
||||||
"""
|
"""
|
||||||
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data"]))
|
||||||
if not doc.is_parsed:
|
if not doc.is_parsed:
|
||||||
user_warning(Warnings.W005)
|
user_warning(Warnings.W005)
|
||||||
if options.get("collapse_phrases", False):
|
if options.get("collapse_phrases", False):
|
||||||
|
|
|
@ -537,6 +537,7 @@ for orth in [
|
||||||
"Sen.",
|
"Sen.",
|
||||||
"St.",
|
"St.",
|
||||||
"vs.",
|
"vs.",
|
||||||
|
"v.s."
|
||||||
]:
|
]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
|
@ -5,7 +5,7 @@ from __future__ import unicode_literals
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
>>> from spacy.lang.en.examples import sentences
|
>>> from spacy.lang.id.examples import sentences
|
||||||
>>> docs = nlp.pipe(sentences)
|
>>> docs = nlp.pipe(sentences)
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
|
@ -1,15 +1,37 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .tag_map import TAG_MAP
|
||||||
|
from .lemmatizer import LOOKUP
|
||||||
|
from .morph_rules import MORPH_RULES
|
||||||
|
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG, NORM
|
||||||
|
from ...util import update_exc, add_lookups
|
||||||
|
|
||||||
|
|
||||||
|
def _return_lt(_):
|
||||||
|
return "lt"
|
||||||
|
|
||||||
|
|
||||||
class LithuanianDefaults(Language.Defaults):
|
class LithuanianDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: "lt"
|
lex_attr_getters[LANG] = _return_lt
|
||||||
|
lex_attr_getters[NORM] = add_lookups(
|
||||||
|
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||||
|
)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
tag_map = TAG_MAP
|
||||||
|
morph_rules = MORPH_RULES
|
||||||
|
lemma_lookup = LOOKUP
|
||||||
|
|
||||||
|
|
||||||
class Lithuanian(Language):
|
class Lithuanian(Language):
|
||||||
|
|
22
spacy/lang/lt/examples.py
Normal file
22
spacy/lang/lt/examples.py
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.lt.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Jaunikis pirmąją vestuvinę naktį iškeitė į areštinės gultą",
|
||||||
|
"Bepiločiai automobiliai išnaikins vairavimo mokyklas, autoservisus ir eismo nelaimes",
|
||||||
|
"Vilniuje galvojama uždrausti naudoti skėčius",
|
||||||
|
"Londonas yra didelis miestas Jungtinėje Karalystėje",
|
||||||
|
"Kur tu?",
|
||||||
|
"Kas yra Prancūzijos prezidentas?",
|
||||||
|
"Kokia yra Jungtinių Amerikos Valstijų sostinė?",
|
||||||
|
"Kada gimė Dalia Grybauskaitė?",
|
||||||
|
]
|
234227
spacy/lang/lt/lemmatizer.py
Normal file
234227
spacy/lang/lt/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
1153
spacy/lang/lt/lex_attrs.py
Normal file
1153
spacy/lang/lt/lex_attrs.py
Normal file
File diff suppressed because it is too large
Load Diff
3075
spacy/lang/lt/morph_rules.py
Normal file
3075
spacy/lang/lt/morph_rules.py
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
4798
spacy/lang/lt/tag_map.py
Normal file
4798
spacy/lang/lt/tag_map.py
Normal file
File diff suppressed because it is too large
Load Diff
268
spacy/lang/lt/tokenizer_exceptions.py
Normal file
268
spacy/lang/lt/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,268 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import ORTH
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
for orth in [
|
||||||
|
"G.",
|
||||||
|
"J. E.",
|
||||||
|
"J. Em.",
|
||||||
|
"J.E.",
|
||||||
|
"J.Em.",
|
||||||
|
"K.",
|
||||||
|
"N.",
|
||||||
|
"V.",
|
||||||
|
"Vt.",
|
||||||
|
"a.",
|
||||||
|
"a.k.",
|
||||||
|
"a.s.",
|
||||||
|
"adv.",
|
||||||
|
"akad.",
|
||||||
|
"aklg.",
|
||||||
|
"akt.",
|
||||||
|
"al.",
|
||||||
|
"ang.",
|
||||||
|
"angl.",
|
||||||
|
"aps.",
|
||||||
|
"apskr.",
|
||||||
|
"apyg.",
|
||||||
|
"arbat.",
|
||||||
|
"asist.",
|
||||||
|
"asm.",
|
||||||
|
"asm.k.",
|
||||||
|
"asmv.",
|
||||||
|
"atk.",
|
||||||
|
"atsak.",
|
||||||
|
"atsisk.",
|
||||||
|
"atsisk.sąsk.",
|
||||||
|
"atv.",
|
||||||
|
"aut.",
|
||||||
|
"avd.",
|
||||||
|
"b.k.",
|
||||||
|
"baud.",
|
||||||
|
"biol.",
|
||||||
|
"bkl.",
|
||||||
|
"bot.",
|
||||||
|
"bt.",
|
||||||
|
"buv.",
|
||||||
|
"ch.",
|
||||||
|
"chem.",
|
||||||
|
"corp.",
|
||||||
|
"d.",
|
||||||
|
"dab.",
|
||||||
|
"dail.",
|
||||||
|
"dek.",
|
||||||
|
"deš.",
|
||||||
|
"dir.",
|
||||||
|
"dirig.",
|
||||||
|
"doc.",
|
||||||
|
"dol.",
|
||||||
|
"dr.",
|
||||||
|
"drp.",
|
||||||
|
"dvit.",
|
||||||
|
"dėst.",
|
||||||
|
"dš.",
|
||||||
|
"dž.",
|
||||||
|
"e.b.",
|
||||||
|
"e.bankas",
|
||||||
|
"e.p.",
|
||||||
|
"e.parašas",
|
||||||
|
"e.paštas",
|
||||||
|
"e.v.",
|
||||||
|
"e.valdžia",
|
||||||
|
"egz.",
|
||||||
|
"eil.",
|
||||||
|
"ekon.",
|
||||||
|
"el.",
|
||||||
|
"el.bankas",
|
||||||
|
"el.p.",
|
||||||
|
"el.parašas",
|
||||||
|
"el.paštas",
|
||||||
|
"el.valdžia",
|
||||||
|
"etc.",
|
||||||
|
"ež.",
|
||||||
|
"fak.",
|
||||||
|
"faks.",
|
||||||
|
"feat.",
|
||||||
|
"filol.",
|
||||||
|
"filos.",
|
||||||
|
"g.",
|
||||||
|
"gen.",
|
||||||
|
"geol.",
|
||||||
|
"gerb.",
|
||||||
|
"gim.",
|
||||||
|
"gr.",
|
||||||
|
"gv.",
|
||||||
|
"gyd.",
|
||||||
|
"gyv.",
|
||||||
|
"habil.",
|
||||||
|
"inc.",
|
||||||
|
"insp.",
|
||||||
|
"inž.",
|
||||||
|
"ir pan.",
|
||||||
|
"ir t. t.",
|
||||||
|
"isp.",
|
||||||
|
"istor.",
|
||||||
|
"it.",
|
||||||
|
"just.",
|
||||||
|
"k.",
|
||||||
|
"k. a.",
|
||||||
|
"k.a.",
|
||||||
|
"kab.",
|
||||||
|
"kand.",
|
||||||
|
"kart.",
|
||||||
|
"kat.",
|
||||||
|
"ketv.",
|
||||||
|
"kh.",
|
||||||
|
"kl.",
|
||||||
|
"kln.",
|
||||||
|
"km.",
|
||||||
|
"kn.",
|
||||||
|
"koresp.",
|
||||||
|
"kpt.",
|
||||||
|
"kr.",
|
||||||
|
"kt.",
|
||||||
|
"kub.",
|
||||||
|
"kun.",
|
||||||
|
"kv.",
|
||||||
|
"kyš.",
|
||||||
|
"l. e. p.",
|
||||||
|
"l.e.p.",
|
||||||
|
"lenk.",
|
||||||
|
"liet.",
|
||||||
|
"lot.",
|
||||||
|
"lt.",
|
||||||
|
"ltd.",
|
||||||
|
"ltn.",
|
||||||
|
"m.",
|
||||||
|
"m.e..",
|
||||||
|
"m.m.",
|
||||||
|
"mat.",
|
||||||
|
"med.",
|
||||||
|
"mgnt.",
|
||||||
|
"mgr.",
|
||||||
|
"min.",
|
||||||
|
"mjr.",
|
||||||
|
"ml.",
|
||||||
|
"mln.",
|
||||||
|
"mlrd.",
|
||||||
|
"mob.",
|
||||||
|
"mok.",
|
||||||
|
"moksl.",
|
||||||
|
"mokyt.",
|
||||||
|
"mot.",
|
||||||
|
"mr.",
|
||||||
|
"mst.",
|
||||||
|
"mstl.",
|
||||||
|
"mėn.",
|
||||||
|
"nkt.",
|
||||||
|
"no.",
|
||||||
|
"nr.",
|
||||||
|
"ntk.",
|
||||||
|
"nuotr.",
|
||||||
|
"op.",
|
||||||
|
"org.",
|
||||||
|
"orig.",
|
||||||
|
"p.",
|
||||||
|
"p.d.",
|
||||||
|
"p.m.e.",
|
||||||
|
"p.s.",
|
||||||
|
"pab.",
|
||||||
|
"pan.",
|
||||||
|
"past.",
|
||||||
|
"pav.",
|
||||||
|
"pavad.",
|
||||||
|
"per.",
|
||||||
|
"perd.",
|
||||||
|
"pirm.",
|
||||||
|
"pl.",
|
||||||
|
"plg.",
|
||||||
|
"plk.",
|
||||||
|
"pr.",
|
||||||
|
"pr.Kr.",
|
||||||
|
"pranc.",
|
||||||
|
"proc.",
|
||||||
|
"prof.",
|
||||||
|
"prom.",
|
||||||
|
"prot.",
|
||||||
|
"psl.",
|
||||||
|
"pss.",
|
||||||
|
"pvz.",
|
||||||
|
"pšt.",
|
||||||
|
"r.",
|
||||||
|
"raj.",
|
||||||
|
"red.",
|
||||||
|
"rez.",
|
||||||
|
"rež.",
|
||||||
|
"rus.",
|
||||||
|
"rš.",
|
||||||
|
"s.",
|
||||||
|
"sav.",
|
||||||
|
"saviv.",
|
||||||
|
"sek.",
|
||||||
|
"sekr.",
|
||||||
|
"sen.",
|
||||||
|
"sh.",
|
||||||
|
"sk.",
|
||||||
|
"skg.",
|
||||||
|
"skv.",
|
||||||
|
"skyr.",
|
||||||
|
"sp.",
|
||||||
|
"spec.",
|
||||||
|
"sr.",
|
||||||
|
"st.",
|
||||||
|
"str.",
|
||||||
|
"stud.",
|
||||||
|
"sąs.",
|
||||||
|
"t.",
|
||||||
|
"t. p.",
|
||||||
|
"t. y.",
|
||||||
|
"t.p.",
|
||||||
|
"t.t.",
|
||||||
|
"t.y.",
|
||||||
|
"techn.",
|
||||||
|
"tel.",
|
||||||
|
"teol.",
|
||||||
|
"th.",
|
||||||
|
"tir.",
|
||||||
|
"trit.",
|
||||||
|
"trln.",
|
||||||
|
"tšk.",
|
||||||
|
"tūks.",
|
||||||
|
"tūkst.",
|
||||||
|
"up.",
|
||||||
|
"upl.",
|
||||||
|
"v.s.",
|
||||||
|
"vad.",
|
||||||
|
"val.",
|
||||||
|
"valg.",
|
||||||
|
"ved.",
|
||||||
|
"vert.",
|
||||||
|
"vet.",
|
||||||
|
"vid.",
|
||||||
|
"virš.",
|
||||||
|
"vlsč.",
|
||||||
|
"vnt.",
|
||||||
|
"vok.",
|
||||||
|
"vs.",
|
||||||
|
"vtv.",
|
||||||
|
"vv.",
|
||||||
|
"vyr.",
|
||||||
|
"vyresn.",
|
||||||
|
"zool.",
|
||||||
|
"Įn",
|
||||||
|
"įl.",
|
||||||
|
"š.m.",
|
||||||
|
"šnek.",
|
||||||
|
"šv.",
|
||||||
|
"švč.",
|
||||||
|
"ž.ū.",
|
||||||
|
"žin.",
|
||||||
|
"žml.",
|
||||||
|
"žr.",
|
||||||
|
]:
|
||||||
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -22,6 +22,7 @@ NOUN_RULES = [
|
||||||
VERB_RULES = [
|
VERB_RULES = [
|
||||||
["er", "e"], # vasker -> vaske
|
["er", "e"], # vasker -> vaske
|
||||||
["et", "e"], # vasket -> vaske
|
["et", "e"], # vasket -> vaske
|
||||||
|
["a", "e"], # vaska -> vaske
|
||||||
["es", "e"], # vaskes -> vaske
|
["es", "e"], # vaskes -> vaske
|
||||||
["te", "e"], # stekte -> steke
|
["te", "e"], # stekte -> steke
|
||||||
["år", "å"], # får -> få
|
["år", "å"], # får -> få
|
||||||
|
|
|
@ -10,7 +10,15 @@ _exc = {}
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
{ORTH: "jan.", LEMMA: "januar"},
|
{ORTH: "jan.", LEMMA: "januar"},
|
||||||
{ORTH: "feb.", LEMMA: "februar"},
|
{ORTH: "feb.", LEMMA: "februar"},
|
||||||
|
{ORTH: "mar.", LEMMA: "mars"},
|
||||||
|
{ORTH: "apr.", LEMMA: "april"},
|
||||||
|
{ORTH: "jun.", LEMMA: "juni"},
|
||||||
{ORTH: "jul.", LEMMA: "juli"},
|
{ORTH: "jul.", LEMMA: "juli"},
|
||||||
|
{ORTH: "aug.", LEMMA: "august"},
|
||||||
|
{ORTH: "sep.", LEMMA: "september"},
|
||||||
|
{ORTH: "okt.", LEMMA: "oktober"},
|
||||||
|
{ORTH: "nov.", LEMMA: "november"},
|
||||||
|
{ORTH: "des.", LEMMA: "desember"},
|
||||||
]:
|
]:
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
@ -18,11 +26,13 @@ for exc_data in [
|
||||||
for orth in [
|
for orth in [
|
||||||
"adm.dir.",
|
"adm.dir.",
|
||||||
"a.m.",
|
"a.m.",
|
||||||
|
"andelsnr",
|
||||||
"Aq.",
|
"Aq.",
|
||||||
"b.c.",
|
"b.c.",
|
||||||
"bl.a.",
|
"bl.a.",
|
||||||
"bla.",
|
"bla.",
|
||||||
"bm.",
|
"bm.",
|
||||||
|
"bnr.",
|
||||||
"bto.",
|
"bto.",
|
||||||
"ca.",
|
"ca.",
|
||||||
"cand.mag.",
|
"cand.mag.",
|
||||||
|
@ -41,6 +51,7 @@ for orth in [
|
||||||
"el.",
|
"el.",
|
||||||
"e.l.",
|
"e.l.",
|
||||||
"et.",
|
"et.",
|
||||||
|
"etc.",
|
||||||
"etg.",
|
"etg.",
|
||||||
"ev.",
|
"ev.",
|
||||||
"evt.",
|
"evt.",
|
||||||
|
@ -76,6 +87,7 @@ for orth in [
|
||||||
"kgl.res.",
|
"kgl.res.",
|
||||||
"kl.",
|
"kl.",
|
||||||
"komm.",
|
"komm.",
|
||||||
|
"kr.",
|
||||||
"kst.",
|
"kst.",
|
||||||
"lø.",
|
"lø.",
|
||||||
"ma.",
|
"ma.",
|
||||||
|
@ -106,6 +118,7 @@ for orth in [
|
||||||
"o.l.",
|
"o.l.",
|
||||||
"on.",
|
"on.",
|
||||||
"op.",
|
"op.",
|
||||||
|
"org."
|
||||||
"osv.",
|
"osv.",
|
||||||
"ovf.",
|
"ovf.",
|
||||||
"p.",
|
"p.",
|
||||||
|
@ -130,6 +143,7 @@ for orth in [
|
||||||
"sep.",
|
"sep.",
|
||||||
"siviling.",
|
"siviling.",
|
||||||
"sms.",
|
"sms.",
|
||||||
|
"snr.",
|
||||||
"spm.",
|
"spm.",
|
||||||
"sr.",
|
"sr.",
|
||||||
"sst.",
|
"sst.",
|
||||||
|
|
18
spacy/lang/sq/examples.py
Normal file
18
spacy/lang/sq/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.sq.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Apple po shqyrton blerjen e nje shoqërie të U.K. për 1 miliard dollarë",
|
||||||
|
"Makinat autonome ndryshojnë përgjegjësinë e sigurimit ndaj prodhuesve",
|
||||||
|
"San Francisko konsideron ndalimin e robotëve të shpërndarjes",
|
||||||
|
"Londra është një qytet i madh në Mbretërinë e Bashkuar.",
|
||||||
|
]
|
|
@ -1,15 +1,17 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from collections import defaultdict
|
from collections import defaultdict, OrderedDict
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..compat import basestring_
|
from ..compat import basestring_
|
||||||
from ..util import ensure_path
|
from ..util import ensure_path, to_disk, from_disk
|
||||||
from ..tokens import Span
|
from ..tokens import Span
|
||||||
from ..matcher import Matcher, PhraseMatcher
|
from ..matcher import Matcher, PhraseMatcher
|
||||||
|
|
||||||
|
DEFAULT_ENT_ID_SEP = '||'
|
||||||
|
|
||||||
|
|
||||||
class EntityRuler(object):
|
class EntityRuler(object):
|
||||||
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
||||||
|
@ -24,7 +26,7 @@ class EntityRuler(object):
|
||||||
|
|
||||||
name = "entity_ruler"
|
name = "entity_ruler"
|
||||||
|
|
||||||
def __init__(self, nlp, **cfg):
|
def __init__(self, nlp, phrase_matcher_attr=None, **cfg):
|
||||||
"""Initialize the entitiy ruler. If patterns are supplied here, they
|
"""Initialize the entitiy ruler. If patterns are supplied here, they
|
||||||
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
||||||
key. A pattern can either be a token pattern (list) or a phrase pattern
|
key. A pattern can either be a token pattern (list) or a phrase pattern
|
||||||
|
@ -32,6 +34,8 @@ class EntityRuler(object):
|
||||||
|
|
||||||
nlp (Language): The shared nlp object to pass the vocab to the matchers
|
nlp (Language): The shared nlp object to pass the vocab to the matchers
|
||||||
and process phrase patterns.
|
and process phrase patterns.
|
||||||
|
phrase_matcher_attr (int / unicode): Token attribute to match on, passed
|
||||||
|
to the internal PhraseMatcher as `attr`
|
||||||
patterns (iterable): Optional patterns to load in.
|
patterns (iterable): Optional patterns to load in.
|
||||||
overwrite_ents (bool): If existing entities are present, e.g. entities
|
overwrite_ents (bool): If existing entities are present, e.g. entities
|
||||||
added by the model, overwrite them by matches if necessary.
|
added by the model, overwrite them by matches if necessary.
|
||||||
|
@ -47,8 +51,13 @@ class EntityRuler(object):
|
||||||
self.token_patterns = defaultdict(list)
|
self.token_patterns = defaultdict(list)
|
||||||
self.phrase_patterns = defaultdict(list)
|
self.phrase_patterns = defaultdict(list)
|
||||||
self.matcher = Matcher(nlp.vocab)
|
self.matcher = Matcher(nlp.vocab)
|
||||||
|
if phrase_matcher_attr is not None:
|
||||||
|
self.phrase_matcher_attr = phrase_matcher_attr
|
||||||
|
self.phrase_matcher = PhraseMatcher(nlp.vocab, attr=self.phrase_matcher_attr)
|
||||||
|
else:
|
||||||
|
self.phrase_matcher_attr = None
|
||||||
self.phrase_matcher = PhraseMatcher(nlp.vocab)
|
self.phrase_matcher = PhraseMatcher(nlp.vocab)
|
||||||
self.ent_id_sep = cfg.get("ent_id_sep", "||")
|
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
|
||||||
patterns = cfg.get("patterns")
|
patterns = cfg.get("patterns")
|
||||||
if patterns is not None:
|
if patterns is not None:
|
||||||
self.add_patterns(patterns)
|
self.add_patterns(patterns)
|
||||||
|
@ -212,8 +221,17 @@ class EntityRuler(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#from_bytes
|
DOCS: https://spacy.io/api/entityruler#from_bytes
|
||||||
"""
|
"""
|
||||||
patterns = srsly.msgpack_loads(patterns_bytes)
|
cfg = srsly.msgpack_loads(patterns_bytes)
|
||||||
self.add_patterns(patterns)
|
if isinstance(cfg, dict):
|
||||||
|
self.add_patterns(cfg.get('patterns', cfg))
|
||||||
|
self.overwrite = cfg.get('overwrite', False)
|
||||||
|
self.phrase_matcher_attr = cfg.get('phrase_matcher_attr', None)
|
||||||
|
if self.phrase_matcher_attr is not None:
|
||||||
|
self.phrase_matcher = PhraseMatcher(self.nlp.vocab,
|
||||||
|
attr=self.phrase_matcher_attr)
|
||||||
|
self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP)
|
||||||
|
else:
|
||||||
|
self.add_patterns(cfg)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **kwargs):
|
def to_bytes(self, **kwargs):
|
||||||
|
@ -223,7 +241,13 @@ class EntityRuler(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#to_bytes
|
DOCS: https://spacy.io/api/entityruler#to_bytes
|
||||||
"""
|
"""
|
||||||
return srsly.msgpack_dumps(self.patterns)
|
|
||||||
|
serial = OrderedDict((
|
||||||
|
('overwrite', self.overwrite),
|
||||||
|
('ent_id_sep', self.ent_id_sep),
|
||||||
|
('phrase_matcher_attr', self.phrase_matcher_attr),
|
||||||
|
('patterns', self.patterns)))
|
||||||
|
return srsly.msgpack_dumps(serial)
|
||||||
|
|
||||||
def from_disk(self, path, **kwargs):
|
def from_disk(self, path, **kwargs):
|
||||||
"""Load the entity ruler from a file. Expects a file containing
|
"""Load the entity ruler from a file. Expects a file containing
|
||||||
|
@ -236,9 +260,23 @@ class EntityRuler(object):
|
||||||
DOCS: https://spacy.io/api/entityruler#from_disk
|
DOCS: https://spacy.io/api/entityruler#from_disk
|
||||||
"""
|
"""
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
path = path.with_suffix(".jsonl")
|
if path.is_file():
|
||||||
patterns = srsly.read_jsonl(path)
|
patterns = srsly.read_jsonl(path)
|
||||||
self.add_patterns(patterns)
|
self.add_patterns(patterns)
|
||||||
|
else:
|
||||||
|
cfg = {}
|
||||||
|
deserializers = {
|
||||||
|
'patterns': lambda p: self.add_patterns(srsly.read_jsonl(p.with_suffix('.jsonl'))),
|
||||||
|
'cfg': lambda p: cfg.update(srsly.read_json(p))
|
||||||
|
}
|
||||||
|
from_disk(path, deserializers, {})
|
||||||
|
self.overwrite = cfg.get('overwrite', False)
|
||||||
|
self.phrase_matcher_attr = cfg.get('phrase_matcher_attr')
|
||||||
|
self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP)
|
||||||
|
|
||||||
|
if self.phrase_matcher_attr is not None:
|
||||||
|
self.phrase_matcher = PhraseMatcher(self.nlp.vocab,
|
||||||
|
attr=self.phrase_matcher_attr)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, **kwargs):
|
def to_disk(self, path, **kwargs):
|
||||||
|
@ -251,6 +289,13 @@ class EntityRuler(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#to_disk
|
DOCS: https://spacy.io/api/entityruler#to_disk
|
||||||
"""
|
"""
|
||||||
|
cfg = {'overwrite': self.overwrite,
|
||||||
|
'phrase_matcher_attr': self.phrase_matcher_attr,
|
||||||
|
'ent_id_sep': self.ent_id_sep}
|
||||||
|
serializers = {
|
||||||
|
'patterns': lambda p: srsly.write_jsonl(p.with_suffix('.jsonl'),
|
||||||
|
self.patterns),
|
||||||
|
'cfg': lambda p: srsly.write_json(p, cfg)
|
||||||
|
}
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
path = path.with_suffix(".jsonl")
|
to_disk(path, serializers, {})
|
||||||
srsly.write_jsonl(path, self.patterns)
|
|
||||||
|
|
|
@ -1003,7 +1003,7 @@ cdef class DependencyParser(Parser):
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def postprocesses(self):
|
def postprocesses(self):
|
||||||
return [nonproj.deprojectivize] # , merge_subtokens]
|
return [nonproj.deprojectivize]
|
||||||
|
|
||||||
def add_multitask_objective(self, target):
|
def add_multitask_objective(self, target):
|
||||||
if target == "cloze":
|
if target == "cloze":
|
||||||
|
|
|
@ -52,6 +52,7 @@ class Scorer(object):
|
||||||
self.labelled = PRFScore()
|
self.labelled = PRFScore()
|
||||||
self.tags = PRFScore()
|
self.tags = PRFScore()
|
||||||
self.ner = PRFScore()
|
self.ner = PRFScore()
|
||||||
|
self.ner_per_ents = dict()
|
||||||
self.eval_punct = eval_punct
|
self.eval_punct = eval_punct
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
@ -104,6 +105,15 @@ class Scorer(object):
|
||||||
"ents_f": self.ents_f,
|
"ents_f": self.ents_f,
|
||||||
"tags_acc": self.tags_acc,
|
"tags_acc": self.tags_acc,
|
||||||
"token_acc": self.token_acc,
|
"token_acc": self.token_acc,
|
||||||
|
"ents_per_type": self.__scores_per_ents(),
|
||||||
|
}
|
||||||
|
|
||||||
|
def __scores_per_ents(self):
|
||||||
|
"""RETURNS (dict): Scores per NER entity
|
||||||
|
"""
|
||||||
|
return {
|
||||||
|
k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
|
||||||
|
for k, v in self.ner_per_ents.items()
|
||||||
}
|
}
|
||||||
|
|
||||||
def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")):
|
def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")):
|
||||||
|
@ -149,13 +159,31 @@ class Scorer(object):
|
||||||
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
|
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
|
||||||
if "-" not in [token[-1] for token in gold.orig_annot]:
|
if "-" not in [token[-1] for token in gold.orig_annot]:
|
||||||
cand_ents = set()
|
cand_ents = set()
|
||||||
|
current_ent = {k.label_: set() for k in doc.ents}
|
||||||
|
current_gold = {k.label_: set() for k in doc.ents}
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
|
if ent.label_ not in self.ner_per_ents:
|
||||||
|
self.ner_per_ents[ent.label_] = PRFScore()
|
||||||
first = gold.cand_to_gold[ent.start]
|
first = gold.cand_to_gold[ent.start]
|
||||||
last = gold.cand_to_gold[ent.end - 1]
|
last = gold.cand_to_gold[ent.end - 1]
|
||||||
if first is None or last is None:
|
if first is None or last is None:
|
||||||
self.ner.fp += 1
|
self.ner.fp += 1
|
||||||
|
self.ner_per_ents[ent.label_].fp += 1
|
||||||
else:
|
else:
|
||||||
cand_ents.add((ent.label_, first, last))
|
cand_ents.add((ent.label_, first, last))
|
||||||
|
current_ent[ent.label_].add(
|
||||||
|
tuple(x for x in cand_ents if x[0] == ent.label_)
|
||||||
|
)
|
||||||
|
current_gold[ent.label_].add(
|
||||||
|
tuple(x for x in gold_ents if x[0] == ent.label_)
|
||||||
|
)
|
||||||
|
# Scores per ent
|
||||||
|
[
|
||||||
|
v.score_set(current_ent[k], current_gold[k])
|
||||||
|
for k, v in self.ner_per_ents.items()
|
||||||
|
if k in current_ent
|
||||||
|
]
|
||||||
|
# Score for all ents
|
||||||
self.ner.score_set(cand_ents, gold_ents)
|
self.ner.score_set(cand_ents, gold_ents)
|
||||||
self.tags.score_set(cand_tags, gold_tags)
|
self.tags.score_set(cand_tags, gold_tags)
|
||||||
self.labelled.score_set(cand_deps, gold_deps)
|
self.labelled.score_set(cand_deps, gold_deps)
|
||||||
|
|
|
@ -124,6 +124,16 @@ def ja_tokenizer():
|
||||||
return get_lang_class("ja").Defaults.create_tokenizer()
|
return get_lang_class("ja").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def lt_tokenizer():
|
||||||
|
return get_lang_class("lt").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def lt_lemmatizer():
|
||||||
|
return get_lang_class("lt").Defaults.create_lemmatizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def nb_tokenizer():
|
def nb_tokenizer():
|
||||||
return get_lang_class("nb").Defaults.create_tokenizer()
|
return get_lang_class("nb").Defaults.create_tokenizer()
|
||||||
|
|
0
spacy/tests/lang/lt/__init__.py
Normal file
0
spacy/tests/lang/lt/__init__.py
Normal file
15
spacy/tests/lang/lt/test_lemmatizer.py
Normal file
15
spacy/tests/lang/lt/test_lemmatizer.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("tokens,lemmas", [
|
||||||
|
(["Galime", "vadinti", "gerovės", "valstybe", ",", "turime", "išvystytą", "socialinę", "apsaugą", ",",
|
||||||
|
"sveikatos", "apsaugą", "ir", "prieinamą", "švietimą", "."],
|
||||||
|
["galėti", "vadintas", "gerovė", "valstybė", ",", "turėti", "išvystytas", "socialinis",
|
||||||
|
"apsauga", ",", "sveikata", "apsauga", "ir", "prieinamas", "švietimas", "."]),
|
||||||
|
(["taip", ",", "uoliai", "tyrinėjau", "ir", "pasirinkau", "geriausią", "variantą", "."],
|
||||||
|
["taip", ",", "uolus", "tyrinėti", "ir", "pasirinkti", "geras", "variantas", "."])])
|
||||||
|
def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas):
|
||||||
|
assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens]
|
44
spacy/tests/lang/lt/test_text.py
Normal file
44
spacy/tests/lang/lt/test_text.py
Normal file
|
@ -0,0 +1,44 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_lt_tokenizer_handles_long_text(lt_tokenizer):
|
||||||
|
text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią
|
||||||
|
vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis
|
||||||
|
yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui."""
|
||||||
|
tokens = lt_tokenizer(text.replace("\n", ""))
|
||||||
|
assert len(tokens) == 42
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,length', [
|
||||||
|
("177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.", 15),
|
||||||
|
("ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.", 16)])
|
||||||
|
def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
|
||||||
|
tokens = lt_tokenizer(text)
|
||||||
|
assert len(tokens) == length
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
|
||||||
|
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
|
||||||
|
tokens = lt_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,match", [
|
||||||
|
("10", True),
|
||||||
|
("1", True),
|
||||||
|
("10,000", True),
|
||||||
|
("10,00", True),
|
||||||
|
("999.0", True),
|
||||||
|
("vienas", True),
|
||||||
|
("du", True),
|
||||||
|
("milijardas", True),
|
||||||
|
("šuo", False),
|
||||||
|
(",", False),
|
||||||
|
("1/2", True)])
|
||||||
|
def test_lt_lex_attrs_like_number(lt_tokenizer, text, match):
|
||||||
|
tokens = lt_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
assert tokens[0].like_num == match
|
|
@ -106,5 +106,24 @@ def test_entity_ruler_serialize_bytes(nlp, patterns):
|
||||||
assert len(new_ruler) == 0
|
assert len(new_ruler) == 0
|
||||||
assert len(new_ruler.labels) == 0
|
assert len(new_ruler.labels) == 0
|
||||||
new_ruler = new_ruler.from_bytes(ruler_bytes)
|
new_ruler = new_ruler.from_bytes(ruler_bytes)
|
||||||
|
assert len(new_ruler) == len(patterns)
|
||||||
|
assert len(new_ruler.labels) == 4
|
||||||
|
assert len(new_ruler.patterns) == len(ruler.patterns)
|
||||||
|
for pattern in ruler.patterns:
|
||||||
|
assert pattern in new_ruler.patterns
|
||||||
|
assert new_ruler.labels == ruler.labels
|
||||||
|
|
||||||
|
|
||||||
|
def test_entity_ruler_serialize_phrase_matcher_attr_bytes(nlp, patterns):
|
||||||
|
ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER", patterns=patterns)
|
||||||
assert len(ruler) == len(patterns)
|
assert len(ruler) == len(patterns)
|
||||||
assert len(ruler.labels) == 4
|
assert len(ruler.labels) == 4
|
||||||
|
ruler_bytes = ruler.to_bytes()
|
||||||
|
new_ruler = EntityRuler(nlp)
|
||||||
|
assert len(new_ruler) == 0
|
||||||
|
assert len(new_ruler.labels) == 0
|
||||||
|
assert new_ruler.phrase_matcher_attr is None
|
||||||
|
new_ruler = new_ruler.from_bytes(ruler_bytes)
|
||||||
|
assert len(new_ruler) == len(patterns)
|
||||||
|
assert len(new_ruler.labels) == 4
|
||||||
|
assert new_ruler.phrase_matcher_attr == "LOWER"
|
||||||
|
|
86
spacy/tests/regression/test_issue3526.py
Normal file
86
spacy/tests/regression/test_issue3526.py
Normal file
|
@ -0,0 +1,86 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.tokens import Span
|
||||||
|
from spacy.language import Language
|
||||||
|
from spacy.pipeline import EntityRuler
|
||||||
|
from spacy import load
|
||||||
|
import srsly
|
||||||
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def patterns():
|
||||||
|
return [
|
||||||
|
{"label": "HELLO", "pattern": "hello world"},
|
||||||
|
{"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]},
|
||||||
|
{"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]},
|
||||||
|
{"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]},
|
||||||
|
{"label": "TECH_ORG", "pattern": "Apple", "id": "a1"},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def add_ent():
|
||||||
|
def add_ent_component(doc):
|
||||||
|
doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])]
|
||||||
|
return doc
|
||||||
|
|
||||||
|
return add_ent_component
|
||||||
|
|
||||||
|
|
||||||
|
def test_entity_ruler_existing_overwrite_serialize_bytes(patterns, en_vocab):
|
||||||
|
nlp = Language(vocab=en_vocab)
|
||||||
|
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
|
||||||
|
ruler_bytes = ruler.to_bytes()
|
||||||
|
assert len(ruler) == len(patterns)
|
||||||
|
assert len(ruler.labels) == 4
|
||||||
|
assert ruler.overwrite
|
||||||
|
new_ruler = EntityRuler(nlp)
|
||||||
|
new_ruler = new_ruler.from_bytes(ruler_bytes)
|
||||||
|
assert len(new_ruler) == len(ruler)
|
||||||
|
assert len(new_ruler.labels) == 4
|
||||||
|
assert new_ruler.overwrite == ruler.overwrite
|
||||||
|
assert new_ruler.ent_id_sep == ruler.ent_id_sep
|
||||||
|
|
||||||
|
|
||||||
|
def test_entity_ruler_existing_bytes_old_format_safe(patterns, en_vocab):
|
||||||
|
nlp = Language(vocab=en_vocab)
|
||||||
|
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
|
||||||
|
bytes_old_style = srsly.msgpack_dumps(ruler.patterns)
|
||||||
|
new_ruler = EntityRuler(nlp)
|
||||||
|
new_ruler = new_ruler.from_bytes(bytes_old_style)
|
||||||
|
assert len(new_ruler) == len(ruler)
|
||||||
|
for pattern in ruler.patterns:
|
||||||
|
assert pattern in new_ruler.patterns
|
||||||
|
assert new_ruler.overwrite is not ruler.overwrite
|
||||||
|
|
||||||
|
|
||||||
|
def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab):
|
||||||
|
nlp = Language(vocab=en_vocab)
|
||||||
|
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
|
||||||
|
with make_tempdir() as tmpdir:
|
||||||
|
out_file = tmpdir / "entity_ruler.jsonl"
|
||||||
|
srsly.write_jsonl(out_file, ruler.patterns)
|
||||||
|
new_ruler = EntityRuler(nlp)
|
||||||
|
new_ruler = new_ruler.from_disk(out_file)
|
||||||
|
for pattern in ruler.patterns:
|
||||||
|
assert pattern in new_ruler.patterns
|
||||||
|
assert len(new_ruler) == len(ruler)
|
||||||
|
assert new_ruler.overwrite is not ruler.overwrite
|
||||||
|
|
||||||
|
|
||||||
|
def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab):
|
||||||
|
nlp = Language(vocab=en_vocab)
|
||||||
|
ruler = EntityRuler(nlp, overwrite_ents=True)
|
||||||
|
|
||||||
|
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
|
||||||
|
nlp.add_pipe(ruler)
|
||||||
|
with make_tempdir() as tmpdir:
|
||||||
|
nlp.to_disk(tmpdir)
|
||||||
|
assert nlp.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}]
|
||||||
|
assert nlp.pipeline[-1][-1].overwrite is True
|
||||||
|
nlp2 = load(tmpdir)
|
||||||
|
assert nlp2.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}]
|
||||||
|
assert nlp2.pipeline[-1][-1].overwrite is True
|
15
spacy/tests/regression/test_issue3882.py
Normal file
15
spacy/tests/regression/test_issue3882.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from spacy.displacy import parse_deps
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue3882(en_vocab):
|
||||||
|
"""Test that displaCy doesn't serialize the doc.user_data when making a
|
||||||
|
copy of the Doc.
|
||||||
|
"""
|
||||||
|
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||||
|
doc.is_parsed = True
|
||||||
|
doc.user_data["test"] = set()
|
||||||
|
parse_deps(doc)
|
|
@ -284,9 +284,9 @@ same between pretraining and training. The API and errors around this need some
|
||||||
improvement.
|
improvement.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
|
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
||||||
[--depth] [--embed-rows] [--loss_func] [--dropout] [--seed] [--n-iter] [--use-vectors]
|
[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length]
|
||||||
[--n-save_every]
|
[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
|
@ -307,6 +307,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
|
||||||
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
|
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
|
||||||
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
||||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
|
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
|
||||||
|
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.|
|
||||||
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
|
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
|
||||||
|
|
||||||
### JSONL format for raw text {#pretrain-jsonl}
|
### JSONL format for raw text {#pretrain-jsonl}
|
||||||
|
|
|
@ -34,6 +34,7 @@ be a token pattern (list) or a phrase pattern (string). For example:
|
||||||
| ---------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
||||||
| `patterns` | iterable | Optional patterns to load in. |
|
| `patterns` | iterable | Optional patterns to load in. |
|
||||||
|
| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phtasematcher). defaults to `None`
|
||||||
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
||||||
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
|
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
|
||||||
| **RETURNS** | `EntityRuler` | The newly constructed object. |
|
| **RETURNS** | `EntityRuler` | The newly constructed object. |
|
||||||
|
|
|
@ -305,11 +305,11 @@ match on the uppercase versions, in case someone has written it as "Google i/o".
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
import spacy
|
from spacy.lang.en import English
|
||||||
from spacy.matcher import Matcher
|
from spacy.matcher import Matcher
|
||||||
from spacy.tokens import Span
|
from spacy.tokens import Span
|
||||||
|
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = English()
|
||||||
matcher = Matcher(nlp.vocab)
|
matcher = Matcher(nlp.vocab)
|
||||||
|
|
||||||
def add_event_ent(matcher, doc, i, matches):
|
def add_event_ent(matcher, doc, i, matches):
|
||||||
|
@ -322,7 +322,7 @@ def add_event_ent(matcher, doc, i, matches):
|
||||||
|
|
||||||
pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
|
pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
|
||||||
matcher.add("GoogleIO", add_event_ent, pattern)
|
matcher.add("GoogleIO", add_event_ent, pattern)
|
||||||
doc = nlp(u"This is a text about Google I/O.")
|
doc = nlp(u"This is a text about Google I/O")
|
||||||
matches = matcher(doc)
|
matches = matcher(doc)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
@ -106,7 +106,12 @@
|
||||||
{ "code": "hi", "name": "Hindi", "example": "यह एक वाक्य है।", "has_examples": true },
|
{ "code": "hi", "name": "Hindi", "example": "यह एक वाक्य है।", "has_examples": true },
|
||||||
{ "code": "kn", "name": "Kannada" },
|
{ "code": "kn", "name": "Kannada" },
|
||||||
{ "code": "ta", "name": "Tamil", "has_examples": true },
|
{ "code": "ta", "name": "Tamil", "has_examples": true },
|
||||||
{ "code": "id", "name": "Indonesian", "has_examples": true },
|
{
|
||||||
|
"code": "id",
|
||||||
|
"name": "Indonesian",
|
||||||
|
"example": "Ini adalah sebuah kalimat.",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
{ "code": "tl", "name": "Tagalog" },
|
{ "code": "tl", "name": "Tagalog" },
|
||||||
{ "code": "af", "name": "Afrikaans" },
|
{ "code": "af", "name": "Afrikaans" },
|
||||||
{ "code": "bg", "name": "Bulgarian" },
|
{ "code": "bg", "name": "Bulgarian" },
|
||||||
|
@ -116,7 +121,12 @@
|
||||||
{ "code": "lv", "name": "Latvian" },
|
{ "code": "lv", "name": "Latvian" },
|
||||||
{ "code": "sk", "name": "Slovak" },
|
{ "code": "sk", "name": "Slovak" },
|
||||||
{ "code": "sl", "name": "Slovenian" },
|
{ "code": "sl", "name": "Slovenian" },
|
||||||
{ "code": "sq", "name": "Albanian" },
|
{
|
||||||
|
"code": "sq",
|
||||||
|
"name": "Albanian",
|
||||||
|
"example": "Kjo është një fjali.",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
{ "code": "et", "name": "Estonian" },
|
{ "code": "et", "name": "Estonian" },
|
||||||
{
|
{
|
||||||
"code": "th",
|
"code": "th",
|
||||||
|
|
Loading…
Reference in New Issue
Block a user