mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-24 08:14:15 +03:00
Merge pull request #6444 from adrianeboyd/chore/update-develop-from-master
This commit is contained in:
commit
f2571b5ec4
108
.github/contributors/KKsharma99.md
vendored
Normal file
108
.github/contributors/KKsharma99.md
vendored
Normal file
|
@ -0,0 +1,108 @@
|
||||||
|
<!-- This agreement was mistakenly submitted as an update to the CONTRIBUTOR_AGREEMENT.md template. Commit: 8a2d22222dec5cf910df5a378cbcd9ea2ab53ec4. It was therefore moved over manually. -->
|
||||||
|
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Kunal Sharma |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 10/19/2020 |
|
||||||
|
| GitHub username | KKsharma99 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/borijang.md
vendored
Normal file
106
.github/contributors/borijang.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Borijan Georgievski |
|
||||||
|
| Company name (if applicable) | Netcetera |
|
||||||
|
| Title or role (if applicable) | Deta Scientist |
|
||||||
|
| Date | 2020.10.09 |
|
||||||
|
| GitHub username | borijang |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/danielvasic.md
vendored
Normal file
106
.github/contributors/danielvasic.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Daniel Vasić |
|
||||||
|
| Company name (if applicable) | University of Mostar |
|
||||||
|
| Title or role (if applicable) | Teaching asistant |
|
||||||
|
| Date | 13/10/2020 |
|
||||||
|
| GitHub username | danielvasic |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/forest1988.md
vendored
Normal file
106
.github/contributors/forest1988.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Yusuke Mori |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | Ph.D. student |
|
||||||
|
| Date | 2020/11/22 |
|
||||||
|
| GitHub username | forest1988 |
|
||||||
|
| Website (optional) | https://forest1988.github.io |
|
106
.github/contributors/jabortell.md
vendored
Normal file
106
.github/contributors/jabortell.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jacob Bortell |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-11-20 |
|
||||||
|
| GitHub username | jabortell |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/revuel.md
vendored
Normal file
106
.github/contributors/revuel.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Miguel Revuelta |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-11-17 |
|
||||||
|
| GitHub username | revuel |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/robertsipek.md
vendored
Normal file
106
.github/contributors/robertsipek.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Robert Šípek |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 22.10.2020 |
|
||||||
|
| GitHub username | @robertsipek |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/vha14.md
vendored
Normal file
106
.github/contributors/vha14.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Vu Ha |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 10-23-2020 |
|
||||||
|
| GitHub username | vha14 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/walterhenry.md
vendored
Normal file
106
.github/contributors/walterhenry.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Walter Henry |
|
||||||
|
| Company name (if applicable) | ExplosionAI GmbH |
|
||||||
|
| Title or role (if applicable) | Executive Assistant |
|
||||||
|
| Date | September 14, 2020 |
|
||||||
|
| GitHub username | walterhenry |
|
||||||
|
| Website (optional) | |
|
|
@ -112,6 +112,14 @@ pip install -U pip setuptools wheel
|
||||||
pip install spacy
|
pip install spacy
|
||||||
```
|
```
|
||||||
|
|
||||||
|
For installation on python 3.5 where binary wheels are not provided for the most
|
||||||
|
recent versions of the dependencies, you can prefer older binary wheels over
|
||||||
|
newer source packages with `--prefer-binary`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install spacy --prefer-binary
|
||||||
|
```
|
||||||
|
|
||||||
To install additional data tables for lemmatization and normalization in
|
To install additional data tables for lemmatization and normalization in
|
||||||
**spaCy v2.2+** you can run `pip install spacy[lookups]` or install
|
**spaCy v2.2+** you can run `pip install spacy[lookups]` or install
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||||
|
|
|
@ -2,96 +2,113 @@ trigger:
|
||||||
batch: true
|
batch: true
|
||||||
branches:
|
branches:
|
||||||
include:
|
include:
|
||||||
- '*'
|
- "*"
|
||||||
exclude:
|
exclude:
|
||||||
- 'spacy.io'
|
- "spacy.io"
|
||||||
paths:
|
paths:
|
||||||
exclude:
|
exclude:
|
||||||
- 'website/*'
|
- "website/*"
|
||||||
- '*.md'
|
- "*.md"
|
||||||
pr:
|
pr:
|
||||||
paths:
|
paths:
|
||||||
exclude:
|
exclude:
|
||||||
- 'website/*'
|
- "website/*"
|
||||||
- '*.md'
|
- "*.md"
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
|
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
||||||
|
# defined in .flake8 and overwrites the selected codes.
|
||||||
|
- job: "Validate"
|
||||||
|
pool:
|
||||||
|
vmImage: "ubuntu-16.04"
|
||||||
|
steps:
|
||||||
|
- task: UsePythonVersion@0
|
||||||
|
inputs:
|
||||||
|
versionSpec: "3.7"
|
||||||
|
- script: |
|
||||||
|
pip install flake8==3.5.0
|
||||||
|
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
|
||||||
|
displayName: "flake8"
|
||||||
|
|
||||||
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
- job: "Test"
|
||||||
# defined in .flake8 and overwrites the selected codes.
|
dependsOn: "Validate"
|
||||||
- job: 'Validate'
|
strategy:
|
||||||
pool:
|
matrix:
|
||||||
vmImage: 'ubuntu-16.04'
|
Python36Linux:
|
||||||
steps:
|
imageName: "ubuntu-16.04"
|
||||||
- task: UsePythonVersion@0
|
python.version: "3.6"
|
||||||
inputs:
|
Python36Windows:
|
||||||
versionSpec: '3.7'
|
imageName: "vs2017-win2016"
|
||||||
- script: |
|
python.version: "3.6"
|
||||||
pip install flake8==3.5.0
|
Python36Mac:
|
||||||
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
|
imageName: "macos-10.14"
|
||||||
displayName: 'flake8'
|
python.version: "3.6"
|
||||||
|
# Don't test on 3.7 for now to speed up builds
|
||||||
|
# Python37Linux:
|
||||||
|
# imageName: 'ubuntu-16.04'
|
||||||
|
# python.version: '3.7'
|
||||||
|
# Python37Windows:
|
||||||
|
# imageName: 'vs2017-win2016'
|
||||||
|
# python.version: '3.7'
|
||||||
|
# Python37Mac:
|
||||||
|
# imageName: 'macos-10.14'
|
||||||
|
# python.version: '3.7'
|
||||||
|
Python38Linux:
|
||||||
|
imageName: "ubuntu-16.04"
|
||||||
|
python.version: "3.8"
|
||||||
|
Python38Windows:
|
||||||
|
imageName: "vs2017-win2016"
|
||||||
|
python.version: "3.8"
|
||||||
|
Python38Mac:
|
||||||
|
imageName: "macos-10.14"
|
||||||
|
python.version: "3.8"
|
||||||
|
# Python39Linux:
|
||||||
|
# imageName: "ubuntu-16.04"
|
||||||
|
# python.version: "3.9"
|
||||||
|
# Python39Windows:
|
||||||
|
# imageName: "vs2017-win2016"
|
||||||
|
# python.version: "3.9"
|
||||||
|
# Python39Mac:
|
||||||
|
# imageName: "macos-10.14"
|
||||||
|
# python.version: "3.9"
|
||||||
|
maxParallel: 4
|
||||||
|
pool:
|
||||||
|
vmImage: $(imageName)
|
||||||
|
|
||||||
- job: 'Test'
|
steps:
|
||||||
dependsOn: 'Validate'
|
- task: UsePythonVersion@0
|
||||||
strategy:
|
inputs:
|
||||||
matrix:
|
versionSpec: "$(python.version)"
|
||||||
Python36Linux:
|
architecture: "x64"
|
||||||
imageName: 'ubuntu-16.04'
|
|
||||||
python.version: '3.6'
|
|
||||||
Python36Windows:
|
|
||||||
imageName: 'vs2017-win2016'
|
|
||||||
python.version: '3.6'
|
|
||||||
Python36Mac:
|
|
||||||
imageName: 'macos-10.14'
|
|
||||||
python.version: '3.6'
|
|
||||||
# Don't test on 3.7 for now to speed up builds
|
|
||||||
# Python37Linux:
|
|
||||||
# imageName: 'ubuntu-16.04'
|
|
||||||
# python.version: '3.7'
|
|
||||||
# Python37Windows:
|
|
||||||
# imageName: 'vs2017-win2016'
|
|
||||||
# python.version: '3.7'
|
|
||||||
# Python37Mac:
|
|
||||||
# imageName: 'macos-10.14'
|
|
||||||
# python.version: '3.7'
|
|
||||||
Python38Linux:
|
|
||||||
imageName: 'ubuntu-16.04'
|
|
||||||
python.version: '3.8'
|
|
||||||
Python38Windows:
|
|
||||||
imageName: 'vs2017-win2016'
|
|
||||||
python.version: '3.8'
|
|
||||||
Python38Mac:
|
|
||||||
imageName: 'macos-10.14'
|
|
||||||
python.version: '3.8'
|
|
||||||
maxParallel: 4
|
|
||||||
pool:
|
|
||||||
vmImage: $(imageName)
|
|
||||||
|
|
||||||
steps:
|
- script: |
|
||||||
- task: UsePythonVersion@0
|
python -m pip install -U pip setuptools
|
||||||
inputs:
|
pip install -r requirements.txt
|
||||||
versionSpec: '$(python.version)'
|
displayName: "Install dependencies"
|
||||||
architecture: 'x64'
|
condition: not(eq(variables['python.version'], '3.5'))
|
||||||
|
|
||||||
- script: |
|
- script: |
|
||||||
python -m pip install -U setuptools
|
python setup.py build_ext --inplace -j 2
|
||||||
pip install -r requirements.txt
|
python setup.py sdist --formats=gztar
|
||||||
displayName: 'Install dependencies'
|
displayName: "Compile and build sdist"
|
||||||
|
|
||||||
- script: |
|
- task: DeleteFiles@1
|
||||||
python setup.py build_ext --inplace
|
inputs:
|
||||||
python setup.py sdist --formats=gztar
|
contents: "spacy"
|
||||||
displayName: 'Compile and build sdist'
|
displayName: "Delete source directory"
|
||||||
|
|
||||||
- task: DeleteFiles@1
|
- script: |
|
||||||
inputs:
|
pip freeze > installed.txt
|
||||||
contents: 'spacy'
|
pip uninstall -y -r installed.txt
|
||||||
displayName: 'Delete source directory'
|
displayName: "Uninstall all packages"
|
||||||
|
|
||||||
- bash: |
|
- bash: |
|
||||||
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
||||||
pip install dist/$SDIST
|
pip install dist/$SDIST
|
||||||
displayName: 'Install from sdist'
|
displayName: "Install from sdist"
|
||||||
|
condition: not(eq(variables['python.version'], '3.5'))
|
||||||
|
|
||||||
- script: python -m pytest --pyargs spacy
|
- script: |
|
||||||
displayName: 'Run tests'
|
pip install -r requirements.txt
|
||||||
|
python -m pytest --pyargs spacy
|
||||||
|
displayName: "Run tests"
|
||||||
|
|
5
build-constraints.txt
Normal file
5
build-constraints.txt
Normal file
|
@ -0,0 +1,5 @@
|
||||||
|
# build version constraints for use with wheelwright + multibuild
|
||||||
|
numpy==1.15.0; python_version<='3.7'
|
||||||
|
numpy==1.17.3; python_version=='3.8'
|
||||||
|
numpy==1.19.3; python_version=='3.9'
|
||||||
|
numpy; python_version>='3.10'
|
|
@ -3,6 +3,8 @@ redirects = [
|
||||||
{from = "https://spacy.netlify.com/*", to="https://spacy.io/:splat", force = true },
|
{from = "https://spacy.netlify.com/*", to="https://spacy.io/:splat", force = true },
|
||||||
# Subdomain for branches
|
# Subdomain for branches
|
||||||
{from = "https://nightly.spacy.io/*", to="https://nightly-spacy-io.spacy.io/:splat", force = true, status = 200},
|
{from = "https://nightly.spacy.io/*", to="https://nightly-spacy-io.spacy.io/:splat", force = true, status = 200},
|
||||||
|
# TODO: update this with the v2 branch build once v3 is live (status = 200)
|
||||||
|
{from = "https://v2.spacy.io/*", to="https://spacy.io/:splat", force = true},
|
||||||
# Old subdomains
|
# Old subdomains
|
||||||
{from = "https://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
{from = "https://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||||
{from = "http://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
{from = "http://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||||
|
|
|
@ -1,13 +1,16 @@
|
||||||
[build-system]
|
[build-system]
|
||||||
requires = [
|
requires = [
|
||||||
"setuptools",
|
"setuptools",
|
||||||
"wheel",
|
|
||||||
"cython>=0.25",
|
"cython>=0.25",
|
||||||
"cymem>=2.0.2,<2.1.0",
|
"cymem>=2.0.2,<2.1.0",
|
||||||
"preshed>=3.0.2,<3.1.0",
|
"preshed>=3.0.2,<3.1.0",
|
||||||
"murmurhash>=0.28.0,<1.1.0",
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
"thinc>=8.0.0rc2,<8.1.0",
|
"thinc>=8.0.0rc2,<8.1.0",
|
||||||
"blis>=0.4.0,<0.8.0",
|
"blis>=0.4.0,<0.8.0",
|
||||||
"pathy"
|
"pathy",
|
||||||
|
"numpy==1.15.0; python_version<='3.7'",
|
||||||
|
"numpy==1.17.3; python_version=='3.8'",
|
||||||
|
"numpy==1.19.3; python_version=='3.9'",
|
||||||
|
"numpy; python_version>='3.10'",
|
||||||
]
|
]
|
||||||
build-backend = "setuptools.build_meta"
|
build-backend = "setuptools.build_meta"
|
||||||
|
|
|
@ -20,6 +20,7 @@ classifiers =
|
||||||
Programming Language :: Python :: 3.6
|
Programming Language :: Python :: 3.6
|
||||||
Programming Language :: Python :: 3.7
|
Programming Language :: Python :: 3.7
|
||||||
Programming Language :: Python :: 3.8
|
Programming Language :: Python :: 3.8
|
||||||
|
Programming Language :: Python :: 3.9
|
||||||
Topic :: Scientific/Engineering
|
Topic :: Scientific/Engineering
|
||||||
|
|
||||||
[options]
|
[options]
|
||||||
|
@ -27,7 +28,6 @@ zip_safe = false
|
||||||
include_package_data = true
|
include_package_data = true
|
||||||
python_requires = >=3.6
|
python_requires = >=3.6
|
||||||
setup_requires =
|
setup_requires =
|
||||||
wheel
|
|
||||||
cython>=0.25
|
cython>=0.25
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
# We also need our Cython packages here to compile against
|
# We also need our Cython packages here to compile against
|
||||||
|
|
6
setup.py
6
setup.py
|
@ -2,9 +2,9 @@
|
||||||
from setuptools import Extension, setup, find_packages
|
from setuptools import Extension, setup, find_packages
|
||||||
import sys
|
import sys
|
||||||
import platform
|
import platform
|
||||||
|
import numpy
|
||||||
from distutils.command.build_ext import build_ext
|
from distutils.command.build_ext import build_ext
|
||||||
from distutils.sysconfig import get_python_inc
|
from distutils.sysconfig import get_python_inc
|
||||||
import numpy
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import shutil
|
import shutil
|
||||||
from Cython.Build import cythonize
|
from Cython.Build import cythonize
|
||||||
|
@ -194,8 +194,8 @@ def setup_package():
|
||||||
print(f"Copied {copy_file} -> {target_dir}")
|
print(f"Copied {copy_file} -> {target_dir}")
|
||||||
|
|
||||||
include_dirs = [
|
include_dirs = [
|
||||||
get_python_inc(plat_specific=True),
|
|
||||||
numpy.get_include(),
|
numpy.get_include(),
|
||||||
|
get_python_inc(plat_specific=True),
|
||||||
]
|
]
|
||||||
ext_modules = []
|
ext_modules = []
|
||||||
for name in MOD_NAMES:
|
for name in MOD_NAMES:
|
||||||
|
@ -212,7 +212,7 @@ def setup_package():
|
||||||
ext_modules=ext_modules,
|
ext_modules=ext_modules,
|
||||||
cmdclass={"build_ext": build_ext_subclass},
|
cmdclass={"build_ext": build_ext_subclass},
|
||||||
include_dirs=include_dirs,
|
include_dirs=include_dirs,
|
||||||
package_data={"": ["*.pyx", "*.pxd", "*.pxi", "*.cpp"]},
|
package_data={"": ["*.pyx", "*.pxd", "*.pxi"]},
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -210,8 +210,12 @@ _ukrainian_lower = r"а-щюяіїєґ"
|
||||||
_ukrainian_upper = r"А-ЩЮЯІЇЄҐ"
|
_ukrainian_upper = r"А-ЩЮЯІЇЄҐ"
|
||||||
_ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ"
|
_ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ"
|
||||||
|
|
||||||
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper
|
_macedonian_lower = r"ѓѕјљњќѐѝ"
|
||||||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
|
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
|
||||||
|
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
|
||||||
|
|
||||||
|
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper + _macedonian_upper
|
||||||
|
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower
|
||||||
|
|
||||||
_uncased = (
|
_uncased = (
|
||||||
_bengali
|
_bengali
|
||||||
|
@ -226,7 +230,7 @@ _uncased = (
|
||||||
+ _cjk
|
+ _cjk
|
||||||
)
|
)
|
||||||
|
|
||||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased)
|
||||||
ALPHA_LOWER = group_chars(_lower + _uncased)
|
ALPHA_LOWER = group_chars(_lower + _uncased)
|
||||||
ALPHA_UPPER = group_chars(_upper + _uncased)
|
ALPHA_UPPER = group_chars(_upper + _uncased)
|
||||||
|
|
||||||
|
|
|
@ -1,9 +1,16 @@
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tag_map import TAG_MAP
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
|
|
||||||
|
|
||||||
class CzechDefaults(Language.Defaults):
|
class CzechDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "cs"
|
||||||
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
lex_attr_getters = LEX_ATTRS
|
lex_attr_getters = LEX_ATTRS
|
||||||
|
|
||||||
|
|
4312
spacy/lang/cs/tag_map.py
Normal file
4312
spacy/lang/cs/tag_map.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -6,10 +6,21 @@ from ...tokens import Doc, Span
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
|
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
|
||||||
"""Detect base noun phrases from a dependency parse. Works on Doc and Span."""
|
"""
|
||||||
# fmt: off
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
labels = ["nsubj", "dobj", "nsubjpass", "pcomp", "pobj", "dative", "appos", "attr", "ROOT"]
|
"""
|
||||||
# fmt: on
|
labels = [
|
||||||
|
"oprd",
|
||||||
|
"nsubj",
|
||||||
|
"dobj",
|
||||||
|
"nsubjpass",
|
||||||
|
"pcomp",
|
||||||
|
"pobj",
|
||||||
|
"dative",
|
||||||
|
"appos",
|
||||||
|
"attr",
|
||||||
|
"ROOT",
|
||||||
|
]
|
||||||
doc = doclike.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
if not doc.has_annotation("DEP"):
|
if not doc.has_annotation("DEP"):
|
||||||
raise ValueError(Errors.E029)
|
raise ValueError(Errors.E029)
|
||||||
|
|
48
spacy/lang/mk/__init__.py
Normal file
48
spacy/lang/mk/__init__.py
Normal file
|
@ -0,0 +1,48 @@
|
||||||
|
from typing import Optional
|
||||||
|
from thinc.api import Model
|
||||||
|
from .lemmatizer import MacedonianLemmatizer
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG
|
||||||
|
from ...util import update_exc
|
||||||
|
from ...lookups import Lookups
|
||||||
|
|
||||||
|
|
||||||
|
class MacedonianDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "mk"
|
||||||
|
|
||||||
|
# Optional: replace flags with custom functions, e.g. like_num()
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
|
||||||
|
# Merge base exceptions and custom tokenizer exceptions
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def create_lemmatizer(cls, nlp=None, lookups=None):
|
||||||
|
if lookups is None:
|
||||||
|
lookups = Lookups()
|
||||||
|
return MacedonianLemmatizer(lookups)
|
||||||
|
|
||||||
|
|
||||||
|
class Macedonian(Language):
|
||||||
|
lang = "mk"
|
||||||
|
Defaults = MacedonianDefaults
|
||||||
|
|
||||||
|
|
||||||
|
@Macedonian.factory(
|
||||||
|
"lemmatizer",
|
||||||
|
assigns=["token.lemma"],
|
||||||
|
default_config={"model": None, "mode": "rule"},
|
||||||
|
default_score_weights={"lemma_acc": 1.0},
|
||||||
|
)
|
||||||
|
def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str):
|
||||||
|
return MacedonianLemmatizer(nlp.vocab, model, name, mode=mode)
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Macedonian"]
|
55
spacy/lang/mk/lemmatizer.py
Normal file
55
spacy/lang/mk/lemmatizer.py
Normal file
|
@ -0,0 +1,55 @@
|
||||||
|
from typing import List
|
||||||
|
from collections import OrderedDict
|
||||||
|
|
||||||
|
from ...pipeline import Lemmatizer
|
||||||
|
from ...tokens import Token
|
||||||
|
|
||||||
|
|
||||||
|
class MacedonianLemmatizer(Lemmatizer):
|
||||||
|
def rule_lemmatize(self, token: Token) -> List[str]:
|
||||||
|
string = token.text
|
||||||
|
univ_pos = token.pos_.lower()
|
||||||
|
morphology = token.morph.to_dict()
|
||||||
|
|
||||||
|
if univ_pos in ("", "eol", "space"):
|
||||||
|
return [string.lower()]
|
||||||
|
|
||||||
|
if string[-3:] == 'јќи':
|
||||||
|
string = string[:-3]
|
||||||
|
univ_pos = "verb"
|
||||||
|
|
||||||
|
if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
|
||||||
|
return [string.lower()]
|
||||||
|
index_table = self.lookups.get_table("lemma_index", {})
|
||||||
|
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||||
|
rules_table = self.lookups.get_table("lemma_rules", {})
|
||||||
|
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
|
||||||
|
if univ_pos == "propn":
|
||||||
|
return [string]
|
||||||
|
else:
|
||||||
|
return [string.lower()]
|
||||||
|
|
||||||
|
index = index_table.get(univ_pos, {})
|
||||||
|
exceptions = exc_table.get(univ_pos, {})
|
||||||
|
rules = rules_table.get(univ_pos, [])
|
||||||
|
|
||||||
|
orig = string
|
||||||
|
string = string.lower()
|
||||||
|
forms = []
|
||||||
|
|
||||||
|
for old, new in rules:
|
||||||
|
if string.endswith(old):
|
||||||
|
form = string[: len(string) - len(old)] + new
|
||||||
|
if not form:
|
||||||
|
continue
|
||||||
|
if form in index or not form.isalpha():
|
||||||
|
forms.append(form)
|
||||||
|
|
||||||
|
forms = list(OrderedDict.fromkeys(forms))
|
||||||
|
for form in exceptions.get(string, []):
|
||||||
|
if form not in forms:
|
||||||
|
forms.insert(0, form)
|
||||||
|
if not forms:
|
||||||
|
forms.append(orig)
|
||||||
|
|
||||||
|
return forms
|
55
spacy/lang/mk/lex_attrs.py
Normal file
55
spacy/lang/mk/lex_attrs.py
Normal file
|
@ -0,0 +1,55 @@
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
_num_words = [
|
||||||
|
"нула", "еден", "една", "едно", "два", "две", "три", "четири", "пет", "шест", "седум", "осум", "девет", "десет",
|
||||||
|
"единаесет", "дванаесет", "тринаесет", "четиринаесет", "петнаесет", "шеснаесет", "седумнаесет", "осумнаесет",
|
||||||
|
"деветнаесет", "дваесет", "триесет", "четириесет", "педесет", "шеесет", "седумдесет", "осумдесет", "деведесет",
|
||||||
|
"сто", "двесте", "триста", "четиристотини", "петстотини", "шестотини", "седумстотини", "осумстотини",
|
||||||
|
"деветстотини", "илјада", "илјади", 'милион', 'милиони', 'милијарда', 'милијарди', 'билион', 'билиони',
|
||||||
|
|
||||||
|
"двајца", "тројца", "четворица", "петмина", "шестмина", "седуммина", "осуммина", "деветмина", "обата", "обајцата",
|
||||||
|
|
||||||
|
"прв", "втор", "трет", "четврт", "седм", "осм", "двестоти",
|
||||||
|
|
||||||
|
"два-три", "два-триесет", "два-триесетмина", "два-тринаесет", "два-тројца", "две-три", "две-тристотини",
|
||||||
|
"пет-шеесет", "пет-шеесетмина", "пет-шеснаесетмина", "пет-шест", "пет-шестмина", "пет-шестотини", "петина",
|
||||||
|
"осмина", "седум-осум", "седум-осумдесет", "седум-осуммина", "седум-осумнаесет", "седум-осумнаесетмина",
|
||||||
|
"три-четириесет", "три-четиринаесет", "шеесет", "шеесетина", "шеесетмина", "шеснаесет", "шеснаесетмина",
|
||||||
|
"шест-седум", "шест-седумдесет", "шест-седумнаесет", "шест-седумстотини", "шестоти", "шестотини"
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(",", "").replace(".", "")
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count("/") == 1:
|
||||||
|
num, denom = text.split("/")
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
|
||||||
|
text_lower = text.lower()
|
||||||
|
if text_lower in _num_words:
|
||||||
|
return True
|
||||||
|
|
||||||
|
if text_lower.endswith(("а", "о", "и")):
|
||||||
|
if text_lower[:-1] in _num_words:
|
||||||
|
return True
|
||||||
|
|
||||||
|
if text_lower.endswith(("ти", "та", "то", "на")):
|
||||||
|
if text_lower[:-2] in _num_words:
|
||||||
|
return True
|
||||||
|
|
||||||
|
if text_lower.endswith(("ата", "иот", "ите", "ина", "чки")):
|
||||||
|
if text_lower[:-3] in _num_words:
|
||||||
|
return True
|
||||||
|
|
||||||
|
if text_lower.endswith(("мина", "тина")):
|
||||||
|
if text_lower[:-4] in _num_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {LIKE_NUM: like_num}
|
815
spacy/lang/mk/stop_words.py
Normal file
815
spacy/lang/mk/stop_words.py
Normal file
|
@ -0,0 +1,815 @@
|
||||||
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
|
а
|
||||||
|
абре
|
||||||
|
aв
|
||||||
|
аи
|
||||||
|
ако
|
||||||
|
алало
|
||||||
|
ам
|
||||||
|
ама
|
||||||
|
аман
|
||||||
|
ами
|
||||||
|
амин
|
||||||
|
априли-ли-ли
|
||||||
|
ау
|
||||||
|
аух
|
||||||
|
ауч
|
||||||
|
ах
|
||||||
|
аха
|
||||||
|
аха-ха
|
||||||
|
аш
|
||||||
|
ашколсум
|
||||||
|
ашколсун
|
||||||
|
ај
|
||||||
|
ајде
|
||||||
|
ајс
|
||||||
|
аџаба
|
||||||
|
бавно
|
||||||
|
бам
|
||||||
|
бам-бум
|
||||||
|
бап
|
||||||
|
бар
|
||||||
|
баре
|
||||||
|
барем
|
||||||
|
бау
|
||||||
|
бау-бау
|
||||||
|
баш
|
||||||
|
бај
|
||||||
|
бе
|
||||||
|
беа
|
||||||
|
бев
|
||||||
|
бевме
|
||||||
|
бевте
|
||||||
|
без
|
||||||
|
безбели
|
||||||
|
бездруго
|
||||||
|
белки
|
||||||
|
беше
|
||||||
|
би
|
||||||
|
бидејќи
|
||||||
|
бим
|
||||||
|
бис
|
||||||
|
бла
|
||||||
|
блазе
|
||||||
|
богами
|
||||||
|
божем
|
||||||
|
боц
|
||||||
|
браво
|
||||||
|
бравос
|
||||||
|
бре
|
||||||
|
бреј
|
||||||
|
брзо
|
||||||
|
бришка
|
||||||
|
бррр
|
||||||
|
бу
|
||||||
|
бум
|
||||||
|
буф
|
||||||
|
буц
|
||||||
|
бујрум
|
||||||
|
ваа
|
||||||
|
вам
|
||||||
|
варај
|
||||||
|
варда
|
||||||
|
вас
|
||||||
|
вај
|
||||||
|
ве
|
||||||
|
велат
|
||||||
|
вели
|
||||||
|
версус
|
||||||
|
веќе
|
||||||
|
ви
|
||||||
|
виа
|
||||||
|
види
|
||||||
|
вие
|
||||||
|
вистина
|
||||||
|
витос
|
||||||
|
внатре
|
||||||
|
во
|
||||||
|
воз
|
||||||
|
вон
|
||||||
|
впрочем
|
||||||
|
врв
|
||||||
|
вред
|
||||||
|
време
|
||||||
|
врз
|
||||||
|
всушност
|
||||||
|
втор
|
||||||
|
галиба
|
||||||
|
ги
|
||||||
|
гитла
|
||||||
|
го
|
||||||
|
годе
|
||||||
|
годишник
|
||||||
|
горе
|
||||||
|
гра
|
||||||
|
гуц
|
||||||
|
гљу
|
||||||
|
да
|
||||||
|
даан
|
||||||
|
дава
|
||||||
|
дал
|
||||||
|
дали
|
||||||
|
дан
|
||||||
|
два
|
||||||
|
дваесет
|
||||||
|
дванаесет
|
||||||
|
двајца
|
||||||
|
две
|
||||||
|
двесте
|
||||||
|
движам
|
||||||
|
движат
|
||||||
|
движи
|
||||||
|
движиме
|
||||||
|
движите
|
||||||
|
движиш
|
||||||
|
де
|
||||||
|
деведесет
|
||||||
|
девет
|
||||||
|
деветнаесет
|
||||||
|
деветстотини
|
||||||
|
деветти
|
||||||
|
дека
|
||||||
|
дел
|
||||||
|
делми
|
||||||
|
демек
|
||||||
|
десет
|
||||||
|
десетина
|
||||||
|
десетти
|
||||||
|
деситици
|
||||||
|
дејгиди
|
||||||
|
дејди
|
||||||
|
ди
|
||||||
|
дилми
|
||||||
|
дин
|
||||||
|
дип
|
||||||
|
дно
|
||||||
|
до
|
||||||
|
доволно
|
||||||
|
додека
|
||||||
|
додуша
|
||||||
|
докај
|
||||||
|
доколку
|
||||||
|
доправено
|
||||||
|
доправи
|
||||||
|
досамоти
|
||||||
|
доста
|
||||||
|
држи
|
||||||
|
дрн
|
||||||
|
друг
|
||||||
|
друга
|
||||||
|
другата
|
||||||
|
други
|
||||||
|
другиот
|
||||||
|
другите
|
||||||
|
друго
|
||||||
|
другото
|
||||||
|
дум
|
||||||
|
дур
|
||||||
|
дури
|
||||||
|
е
|
||||||
|
евала
|
||||||
|
еве
|
||||||
|
евет
|
||||||
|
ега
|
||||||
|
егиди
|
||||||
|
еден
|
||||||
|
едикојси
|
||||||
|
единаесет
|
||||||
|
единствено
|
||||||
|
еднаш
|
||||||
|
едно
|
||||||
|
ексик
|
||||||
|
ела
|
||||||
|
елбете
|
||||||
|
елем
|
||||||
|
ели
|
||||||
|
ем
|
||||||
|
еми
|
||||||
|
ене
|
||||||
|
ете
|
||||||
|
еурека
|
||||||
|
ех
|
||||||
|
еј
|
||||||
|
жими
|
||||||
|
жити
|
||||||
|
за
|
||||||
|
завал
|
||||||
|
заврши
|
||||||
|
зад
|
||||||
|
задека
|
||||||
|
задоволна
|
||||||
|
задржи
|
||||||
|
заедно
|
||||||
|
зар
|
||||||
|
зарад
|
||||||
|
заради
|
||||||
|
заре
|
||||||
|
зарем
|
||||||
|
затоа
|
||||||
|
зашто
|
||||||
|
згора
|
||||||
|
зема
|
||||||
|
земе
|
||||||
|
земува
|
||||||
|
зер
|
||||||
|
значи
|
||||||
|
зошто
|
||||||
|
зуј
|
||||||
|
и
|
||||||
|
иако
|
||||||
|
из
|
||||||
|
извезен
|
||||||
|
изгледа
|
||||||
|
измеѓу
|
||||||
|
износ
|
||||||
|
или
|
||||||
|
или-или
|
||||||
|
илјада
|
||||||
|
илјади
|
||||||
|
им
|
||||||
|
има
|
||||||
|
имаа
|
||||||
|
имаат
|
||||||
|
имавме
|
||||||
|
имавте
|
||||||
|
имам
|
||||||
|
имаме
|
||||||
|
имате
|
||||||
|
имаш
|
||||||
|
имаше
|
||||||
|
име
|
||||||
|
имено
|
||||||
|
именува
|
||||||
|
имплицира
|
||||||
|
имплицираат
|
||||||
|
имплицирам
|
||||||
|
имплицираме
|
||||||
|
имплицирате
|
||||||
|
имплицираш
|
||||||
|
инаку
|
||||||
|
индицира
|
||||||
|
исечок
|
||||||
|
исклучен
|
||||||
|
исклучена
|
||||||
|
исклучени
|
||||||
|
исклучено
|
||||||
|
искористен
|
||||||
|
искористена
|
||||||
|
искористени
|
||||||
|
искористено
|
||||||
|
искористи
|
||||||
|
искрај
|
||||||
|
исти
|
||||||
|
исто
|
||||||
|
итака
|
||||||
|
итн
|
||||||
|
их
|
||||||
|
иха
|
||||||
|
ихуу
|
||||||
|
иш
|
||||||
|
ишала
|
||||||
|
иј
|
||||||
|
ка
|
||||||
|
каде
|
||||||
|
кажува
|
||||||
|
како
|
||||||
|
каков
|
||||||
|
камоли
|
||||||
|
кај
|
||||||
|
ква
|
||||||
|
ки
|
||||||
|
кит
|
||||||
|
кло
|
||||||
|
клум
|
||||||
|
кога
|
||||||
|
кого
|
||||||
|
кого-годе
|
||||||
|
кое
|
||||||
|
кои
|
||||||
|
количество
|
||||||
|
количина
|
||||||
|
колку
|
||||||
|
кому
|
||||||
|
кон
|
||||||
|
користена
|
||||||
|
користени
|
||||||
|
користено
|
||||||
|
користи
|
||||||
|
кот
|
||||||
|
котрр
|
||||||
|
кош-кош
|
||||||
|
кој
|
||||||
|
која
|
||||||
|
којзнае
|
||||||
|
којшто
|
||||||
|
кр-кр-кр
|
||||||
|
крај
|
||||||
|
крек
|
||||||
|
крз
|
||||||
|
крк
|
||||||
|
крц
|
||||||
|
куку
|
||||||
|
кукуригу
|
||||||
|
куш
|
||||||
|
ле
|
||||||
|
лебами
|
||||||
|
леле
|
||||||
|
лели
|
||||||
|
ли
|
||||||
|
лиду
|
||||||
|
луп
|
||||||
|
ма
|
||||||
|
макар
|
||||||
|
малку
|
||||||
|
марш
|
||||||
|
мат
|
||||||
|
мац
|
||||||
|
машала
|
||||||
|
ме
|
||||||
|
мене
|
||||||
|
место
|
||||||
|
меѓу
|
||||||
|
меѓувреме
|
||||||
|
меѓутоа
|
||||||
|
ми
|
||||||
|
мое
|
||||||
|
може
|
||||||
|
можеби
|
||||||
|
молам
|
||||||
|
моли
|
||||||
|
мор
|
||||||
|
мора
|
||||||
|
море
|
||||||
|
мори
|
||||||
|
мразец
|
||||||
|
му
|
||||||
|
муклец
|
||||||
|
мутлак
|
||||||
|
муц
|
||||||
|
мјау
|
||||||
|
на
|
||||||
|
навидум
|
||||||
|
навистина
|
||||||
|
над
|
||||||
|
надвор
|
||||||
|
назад
|
||||||
|
накај
|
||||||
|
накрај
|
||||||
|
нали
|
||||||
|
нам
|
||||||
|
наместо
|
||||||
|
наоколу
|
||||||
|
направено
|
||||||
|
направи
|
||||||
|
напред
|
||||||
|
нас
|
||||||
|
наспоред
|
||||||
|
наспрема
|
||||||
|
наспроти
|
||||||
|
насред
|
||||||
|
натаму
|
||||||
|
натема
|
||||||
|
начин
|
||||||
|
наш
|
||||||
|
наша
|
||||||
|
наше
|
||||||
|
наши
|
||||||
|
нај
|
||||||
|
најдоцна
|
||||||
|
најмалку
|
||||||
|
најмногу
|
||||||
|
не
|
||||||
|
неа
|
||||||
|
него
|
||||||
|
негов
|
||||||
|
негова
|
||||||
|
негови
|
||||||
|
негово
|
||||||
|
незе
|
||||||
|
нека
|
||||||
|
некаде
|
||||||
|
некако
|
||||||
|
некаков
|
||||||
|
некого
|
||||||
|
некое
|
||||||
|
некои
|
||||||
|
неколку
|
||||||
|
некому
|
||||||
|
некој
|
||||||
|
некојси
|
||||||
|
нели
|
||||||
|
немој
|
||||||
|
нему
|
||||||
|
неоти
|
||||||
|
нечиј
|
||||||
|
нешто
|
||||||
|
нејзе
|
||||||
|
нејзин
|
||||||
|
нејзини
|
||||||
|
нејзино
|
||||||
|
нејсе
|
||||||
|
ни
|
||||||
|
нив
|
||||||
|
нивен
|
||||||
|
нивна
|
||||||
|
нивни
|
||||||
|
нивно
|
||||||
|
ние
|
||||||
|
низ
|
||||||
|
никаде
|
||||||
|
никако
|
||||||
|
никогаш
|
||||||
|
никого
|
||||||
|
никому
|
||||||
|
никој
|
||||||
|
ним
|
||||||
|
нити
|
||||||
|
нито
|
||||||
|
ниту
|
||||||
|
ничиј
|
||||||
|
ништо
|
||||||
|
но
|
||||||
|
нѐ
|
||||||
|
о
|
||||||
|
обр
|
||||||
|
ова
|
||||||
|
ова-она
|
||||||
|
оваа
|
||||||
|
овај
|
||||||
|
овде
|
||||||
|
овега
|
||||||
|
овие
|
||||||
|
овој
|
||||||
|
од
|
||||||
|
одавде
|
||||||
|
оди
|
||||||
|
однесува
|
||||||
|
односно
|
||||||
|
одошто
|
||||||
|
околу
|
||||||
|
олеле
|
||||||
|
олкацок
|
||||||
|
он
|
||||||
|
она
|
||||||
|
онаа
|
||||||
|
онака
|
||||||
|
онаков
|
||||||
|
онде
|
||||||
|
они
|
||||||
|
оние
|
||||||
|
оно
|
||||||
|
оној
|
||||||
|
оп
|
||||||
|
освем
|
||||||
|
освен
|
||||||
|
осем
|
||||||
|
осми
|
||||||
|
осум
|
||||||
|
осумдесет
|
||||||
|
осумнаесет
|
||||||
|
осумстотитни
|
||||||
|
отаде
|
||||||
|
оти
|
||||||
|
откако
|
||||||
|
откај
|
||||||
|
откога
|
||||||
|
отколку
|
||||||
|
оттаму
|
||||||
|
оттука
|
||||||
|
оф
|
||||||
|
ох
|
||||||
|
ој
|
||||||
|
па
|
||||||
|
пак
|
||||||
|
папа
|
||||||
|
пардон
|
||||||
|
пате-ќуте
|
||||||
|
пати
|
||||||
|
пау
|
||||||
|
паче
|
||||||
|
пеесет
|
||||||
|
пеки
|
||||||
|
пет
|
||||||
|
петнаесет
|
||||||
|
петстотини
|
||||||
|
петти
|
||||||
|
пи
|
||||||
|
пи-пи
|
||||||
|
пис
|
||||||
|
плас
|
||||||
|
плус
|
||||||
|
по
|
||||||
|
побавно
|
||||||
|
поблиску
|
||||||
|
побрзо
|
||||||
|
побуни
|
||||||
|
повеќе
|
||||||
|
повторно
|
||||||
|
под
|
||||||
|
подалеку
|
||||||
|
подолу
|
||||||
|
подоцна
|
||||||
|
подруго
|
||||||
|
позади
|
||||||
|
поинаква
|
||||||
|
поинакви
|
||||||
|
поинакво
|
||||||
|
поинаков
|
||||||
|
поинаку
|
||||||
|
покаже
|
||||||
|
покажува
|
||||||
|
покрај
|
||||||
|
полно
|
||||||
|
помалку
|
||||||
|
помеѓу
|
||||||
|
понатаму
|
||||||
|
понекогаш
|
||||||
|
понекој
|
||||||
|
поради
|
||||||
|
поразличен
|
||||||
|
поразлична
|
||||||
|
поразлични
|
||||||
|
поразлично
|
||||||
|
поседува
|
||||||
|
после
|
||||||
|
последен
|
||||||
|
последна
|
||||||
|
последни
|
||||||
|
последно
|
||||||
|
поспоро
|
||||||
|
потег
|
||||||
|
потоа
|
||||||
|
пошироко
|
||||||
|
прави
|
||||||
|
празно
|
||||||
|
прв
|
||||||
|
пред
|
||||||
|
през
|
||||||
|
преку
|
||||||
|
претежно
|
||||||
|
претходен
|
||||||
|
претходна
|
||||||
|
претходни
|
||||||
|
претходник
|
||||||
|
претходно
|
||||||
|
при
|
||||||
|
присвои
|
||||||
|
притоа
|
||||||
|
причинува
|
||||||
|
пријатно
|
||||||
|
просто
|
||||||
|
против
|
||||||
|
прр
|
||||||
|
пст
|
||||||
|
пук
|
||||||
|
пусто
|
||||||
|
пуф
|
||||||
|
пуј
|
||||||
|
пфуј
|
||||||
|
пшт
|
||||||
|
ради
|
||||||
|
различен
|
||||||
|
различна
|
||||||
|
различни
|
||||||
|
различно
|
||||||
|
разни
|
||||||
|
разоружен
|
||||||
|
разредлив
|
||||||
|
рамките
|
||||||
|
рамнообразно
|
||||||
|
растревожено
|
||||||
|
растреперено
|
||||||
|
расчувствувано
|
||||||
|
ратоборно
|
||||||
|
рече
|
||||||
|
роден
|
||||||
|
с
|
||||||
|
сакан
|
||||||
|
сам
|
||||||
|
сама
|
||||||
|
сами
|
||||||
|
самите
|
||||||
|
само
|
||||||
|
самоти
|
||||||
|
свое
|
||||||
|
свои
|
||||||
|
свој
|
||||||
|
своја
|
||||||
|
се
|
||||||
|
себе
|
||||||
|
себеси
|
||||||
|
сега
|
||||||
|
седми
|
||||||
|
седум
|
||||||
|
седумдесет
|
||||||
|
седумнаесет
|
||||||
|
седумстотини
|
||||||
|
секаде
|
||||||
|
секаков
|
||||||
|
секи
|
||||||
|
секогаш
|
||||||
|
секого
|
||||||
|
секому
|
||||||
|
секој
|
||||||
|
секојдневно
|
||||||
|
сем
|
||||||
|
сенешто
|
||||||
|
сепак
|
||||||
|
сериозен
|
||||||
|
сериозна
|
||||||
|
сериозни
|
||||||
|
сериозно
|
||||||
|
сет
|
||||||
|
сечиј
|
||||||
|
сешто
|
||||||
|
си
|
||||||
|
сиктер
|
||||||
|
сиот
|
||||||
|
сип
|
||||||
|
сиреч
|
||||||
|
сите
|
||||||
|
сичко
|
||||||
|
скок
|
||||||
|
скоро
|
||||||
|
скрц
|
||||||
|
следбеник
|
||||||
|
следбеничка
|
||||||
|
следен
|
||||||
|
следователно
|
||||||
|
следствено
|
||||||
|
сме
|
||||||
|
со
|
||||||
|
соне
|
||||||
|
сопствен
|
||||||
|
сопствена
|
||||||
|
сопствени
|
||||||
|
сопствено
|
||||||
|
сосе
|
||||||
|
сосем
|
||||||
|
сполај
|
||||||
|
според
|
||||||
|
споро
|
||||||
|
спрема
|
||||||
|
спроти
|
||||||
|
спротив
|
||||||
|
сред
|
||||||
|
среде
|
||||||
|
среќно
|
||||||
|
срочен
|
||||||
|
сст
|
||||||
|
става
|
||||||
|
ставаат
|
||||||
|
ставам
|
||||||
|
ставаме
|
||||||
|
ставате
|
||||||
|
ставаш
|
||||||
|
стави
|
||||||
|
сте
|
||||||
|
сто
|
||||||
|
стоп
|
||||||
|
страна
|
||||||
|
сум
|
||||||
|
сума
|
||||||
|
супер
|
||||||
|
сус
|
||||||
|
сѐ
|
||||||
|
та
|
||||||
|
таа
|
||||||
|
така
|
||||||
|
таква
|
||||||
|
такви
|
||||||
|
таков
|
||||||
|
тамам
|
||||||
|
таму
|
||||||
|
тангар-мангар
|
||||||
|
тандар-мандар
|
||||||
|
тап
|
||||||
|
твое
|
||||||
|
те
|
||||||
|
тебе
|
||||||
|
тебека
|
||||||
|
тек
|
||||||
|
текот
|
||||||
|
ти
|
||||||
|
тие
|
||||||
|
тизе
|
||||||
|
тик-так
|
||||||
|
тики
|
||||||
|
тоа
|
||||||
|
тогаш
|
||||||
|
тој
|
||||||
|
трак
|
||||||
|
трака-трука
|
||||||
|
трас
|
||||||
|
треба
|
||||||
|
трет
|
||||||
|
три
|
||||||
|
триесет
|
||||||
|
тринаест
|
||||||
|
триста
|
||||||
|
труп
|
||||||
|
трупа
|
||||||
|
трус
|
||||||
|
ту
|
||||||
|
тука
|
||||||
|
туку
|
||||||
|
тукушто
|
||||||
|
туф
|
||||||
|
у
|
||||||
|
уа
|
||||||
|
убаво
|
||||||
|
уви
|
||||||
|
ужасно
|
||||||
|
уз
|
||||||
|
ура
|
||||||
|
уу
|
||||||
|
уф
|
||||||
|
уха
|
||||||
|
уш
|
||||||
|
уште
|
||||||
|
фазен
|
||||||
|
фала
|
||||||
|
фил
|
||||||
|
филан
|
||||||
|
фис
|
||||||
|
фиу
|
||||||
|
фиљан
|
||||||
|
фоб
|
||||||
|
фон
|
||||||
|
ха
|
||||||
|
ха-ха
|
||||||
|
хе
|
||||||
|
хеј
|
||||||
|
хеј
|
||||||
|
хи
|
||||||
|
хм
|
||||||
|
хо
|
||||||
|
цак
|
||||||
|
цап
|
||||||
|
целина
|
||||||
|
цело
|
||||||
|
цигу-лигу
|
||||||
|
циц
|
||||||
|
чекај
|
||||||
|
често
|
||||||
|
четврт
|
||||||
|
четири
|
||||||
|
четириесет
|
||||||
|
четиринаесет
|
||||||
|
четирстотини
|
||||||
|
чие
|
||||||
|
чии
|
||||||
|
чик
|
||||||
|
чик-чирик
|
||||||
|
чини
|
||||||
|
чиш
|
||||||
|
чиј
|
||||||
|
чија
|
||||||
|
чијшто
|
||||||
|
чкрап
|
||||||
|
чому
|
||||||
|
чук
|
||||||
|
чукш
|
||||||
|
чуму
|
||||||
|
чунки
|
||||||
|
шеесет
|
||||||
|
шеснаесет
|
||||||
|
шест
|
||||||
|
шести
|
||||||
|
шестотини
|
||||||
|
ширум
|
||||||
|
шлак
|
||||||
|
шлап
|
||||||
|
шлапа-шлупа
|
||||||
|
шлуп
|
||||||
|
шмрк
|
||||||
|
што
|
||||||
|
штогоде
|
||||||
|
штом
|
||||||
|
штотуку
|
||||||
|
штрак
|
||||||
|
штрап
|
||||||
|
штрап-штруп
|
||||||
|
шуќур
|
||||||
|
ѓиди
|
||||||
|
ѓоа
|
||||||
|
ѓоамити
|
||||||
|
ѕан
|
||||||
|
ѕе
|
||||||
|
ѕин
|
||||||
|
ја
|
||||||
|
јадец
|
||||||
|
јазе
|
||||||
|
јали
|
||||||
|
јас
|
||||||
|
јаска
|
||||||
|
јок
|
||||||
|
ќе
|
||||||
|
ќешки
|
||||||
|
ѝ
|
||||||
|
џагара-магара
|
||||||
|
џанам
|
||||||
|
џив-џив
|
||||||
|
""".split()
|
||||||
|
)
|
100
spacy/lang/mk/tokenizer_exceptions.py
Normal file
100
spacy/lang/mk/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,100 @@
|
||||||
|
from ...symbols import ORTH, NORM
|
||||||
|
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
|
||||||
|
_abbr_exc = [
|
||||||
|
{ORTH: "м", NORM: "метар"},
|
||||||
|
{ORTH: "мм", NORM: "милиметар"},
|
||||||
|
{ORTH: "цм", NORM: "центиметар"},
|
||||||
|
{ORTH: "см", NORM: "сантиметар"},
|
||||||
|
{ORTH: "дм", NORM: "дециметар"},
|
||||||
|
{ORTH: "км", NORM: "километар"},
|
||||||
|
{ORTH: "кг", NORM: "килограм"},
|
||||||
|
{ORTH: "дкг", NORM: "декаграм"},
|
||||||
|
{ORTH: "дг", NORM: "дециграм"},
|
||||||
|
{ORTH: "мг", NORM: "милиграм"},
|
||||||
|
{ORTH: "г", NORM: "грам"},
|
||||||
|
{ORTH: "т", NORM: "тон"},
|
||||||
|
{ORTH: "кл", NORM: "килолитар"},
|
||||||
|
{ORTH: "хл", NORM: "хектолитар"},
|
||||||
|
{ORTH: "дкл", NORM: "декалитар"},
|
||||||
|
{ORTH: "л", NORM: "литар"},
|
||||||
|
{ORTH: "дл", NORM: "децилитар"}
|
||||||
|
|
||||||
|
]
|
||||||
|
for abbr in _abbr_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
_abbr_line_exc = [
|
||||||
|
{ORTH: "д-р", NORM: "доктор"},
|
||||||
|
{ORTH: "м-р", NORM: "магистер"},
|
||||||
|
{ORTH: "г-ѓа", NORM: "госпоѓа"},
|
||||||
|
{ORTH: "г-ца", NORM: "госпоѓица"},
|
||||||
|
{ORTH: "г-дин", NORM: "господин"},
|
||||||
|
|
||||||
|
]
|
||||||
|
|
||||||
|
for abbr in _abbr_line_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
_abbr_dot_exc = [
|
||||||
|
{ORTH: "в.", NORM: "век"},
|
||||||
|
{ORTH: "в.д.", NORM: "вршител на должност"},
|
||||||
|
{ORTH: "г.", NORM: "година"},
|
||||||
|
{ORTH: "г.г.", NORM: "господин господин"},
|
||||||
|
{ORTH: "м.р.", NORM: "машки род"},
|
||||||
|
{ORTH: "год.", NORM: "женски род"},
|
||||||
|
{ORTH: "с.р.", NORM: "среден род"},
|
||||||
|
{ORTH: "н.е.", NORM: "наша ера"},
|
||||||
|
{ORTH: "о.г.", NORM: "оваа година"},
|
||||||
|
{ORTH: "о.м.", NORM: "овој месец"},
|
||||||
|
{ORTH: "с.", NORM: "село"},
|
||||||
|
{ORTH: "т.", NORM: "точка"},
|
||||||
|
{ORTH: "т.е.", NORM: "то ест"},
|
||||||
|
{ORTH: "т.н.", NORM: "таканаречен"},
|
||||||
|
|
||||||
|
{ORTH: "бр.", NORM: "број"},
|
||||||
|
{ORTH: "гр.", NORM: "град"},
|
||||||
|
{ORTH: "др.", NORM: "другар"},
|
||||||
|
{ORTH: "и др.", NORM: "и друго"},
|
||||||
|
{ORTH: "и сл.", NORM: "и слично"},
|
||||||
|
{ORTH: "кн.", NORM: "книга"},
|
||||||
|
{ORTH: "мн.", NORM: "множина"},
|
||||||
|
{ORTH: "на пр.", NORM: "на пример"},
|
||||||
|
{ORTH: "св.", NORM: "свети"},
|
||||||
|
{ORTH: "сп.", NORM: "списание"},
|
||||||
|
{ORTH: "с.", NORM: "страница"},
|
||||||
|
{ORTH: "стр.", NORM: "страница"},
|
||||||
|
{ORTH: "чл.", NORM: "член"},
|
||||||
|
|
||||||
|
{ORTH: "арх.", NORM: "архитект"},
|
||||||
|
{ORTH: "бел.", NORM: "белешка"},
|
||||||
|
{ORTH: "гимн.", NORM: "гимназија"},
|
||||||
|
{ORTH: "ден.", NORM: "денар"},
|
||||||
|
{ORTH: "ул.", NORM: "улица"},
|
||||||
|
{ORTH: "инж.", NORM: "инженер"},
|
||||||
|
{ORTH: "проф.", NORM: "професор"},
|
||||||
|
{ORTH: "студ.", NORM: "студент"},
|
||||||
|
{ORTH: "бот.", NORM: "ботаника"},
|
||||||
|
{ORTH: "мат.", NORM: "математика"},
|
||||||
|
{ORTH: "мед.", NORM: "медицина"},
|
||||||
|
{ORTH: "прил.", NORM: "прилог"},
|
||||||
|
{ORTH: "прид.", NORM: "придавка"},
|
||||||
|
{ORTH: "сврз.", NORM: "сврзник"},
|
||||||
|
{ORTH: "физ.", NORM: "физика"},
|
||||||
|
{ORTH: "хем.", NORM: "хемија"},
|
||||||
|
{ORTH: "пр. н.", NORM: "природни науки"},
|
||||||
|
{ORTH: "истор.", NORM: "историја"},
|
||||||
|
{ORTH: "геогр.", NORM: "географија"},
|
||||||
|
{ORTH: "литер.", NORM: "литература"},
|
||||||
|
|
||||||
|
|
||||||
|
]
|
||||||
|
|
||||||
|
for abbr in _abbr_dot_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -1,4 +1,4 @@
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
@ -9,6 +9,7 @@ class TurkishDefaults(Language.Defaults):
|
||||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
lex_attr_getters = LEX_ATTRS
|
lex_attr_getters = LEX_ATTRS
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
token_match = TOKEN_MATCH
|
||||||
syntax_iterators = SYNTAX_ITERATORS
|
syntax_iterators = SYNTAX_ITERATORS
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,119 +1,181 @@
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
import re
|
||||||
|
|
||||||
|
from ..punctuation import ALPHA_LOWER, ALPHA
|
||||||
from ...symbols import ORTH, NORM
|
from ...symbols import ORTH, NORM
|
||||||
from ...util import update_exc
|
|
||||||
|
|
||||||
|
|
||||||
_exc = {"sağol": [{ORTH: "sağ"}, {ORTH: "ol", NORM: "olun"}]}
|
_exc = {}
|
||||||
|
|
||||||
|
|
||||||
for exc_data in [
|
_abbr_period_exc = [
|
||||||
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
|
{ORTH: "A.B.D.", NORM: "Amerika"},
|
||||||
{ORTH: "Alb.", NORM: "Albay"},
|
{ORTH: "Alb.", NORM: "albay"},
|
||||||
{ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
|
{ORTH: "Ank.", NORM: "Ankara"},
|
||||||
{ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
|
{ORTH: "Ar.Gör."},
|
||||||
{ORTH: "Asb.", NORM: "Astsubay"},
|
{ORTH: "Arş.Gör."},
|
||||||
{ORTH: "Astsb.", NORM: "Astsubay"},
|
{ORTH: "Asb.", NORM: "astsubay"},
|
||||||
{ORTH: "As.İz.", NORM: "Askeri İnzibat"},
|
{ORTH: "Astsb.", NORM: "astsubay"},
|
||||||
{ORTH: "Atğm", NORM: "Asteğmen"},
|
{ORTH: "As.İz."},
|
||||||
{ORTH: "Av.", NORM: "Avukat"},
|
{ORTH: "as.iz."},
|
||||||
{ORTH: "Apt.", NORM: "Apartmanı"},
|
{ORTH: "Atğm", NORM: "asteğmen"},
|
||||||
{ORTH: "Bçvş.", NORM: "Başçavuş"},
|
{ORTH: "Av.", NORM: "avukat"},
|
||||||
|
{ORTH: "Apt.", NORM: "apartmanı"},
|
||||||
|
{ORTH: "apt.", NORM: "apartmanı"},
|
||||||
|
{ORTH: "Bçvş.", NORM: "başçavuş"},
|
||||||
|
{ORTH: "bçvş.", NORM: "başçavuş"},
|
||||||
{ORTH: "bk.", NORM: "bakınız"},
|
{ORTH: "bk.", NORM: "bakınız"},
|
||||||
{ORTH: "bknz.", NORM: "bakınız"},
|
{ORTH: "bknz.", NORM: "bakınız"},
|
||||||
{ORTH: "Bnb.", NORM: "Binbaşı"},
|
{ORTH: "Bnb.", NORM: "binbaşı"},
|
||||||
{ORTH: "bnb.", NORM: "binbaşı"},
|
{ORTH: "bnb.", NORM: "binbaşı"},
|
||||||
{ORTH: "Böl.", NORM: "Bölümü"},
|
{ORTH: "Böl.", NORM: "bölümü"},
|
||||||
{ORTH: "Bşk.", NORM: "Başkanlığı"},
|
{ORTH: "böl.", NORM: "bölümü"},
|
||||||
{ORTH: "Bştbp.", NORM: "Baştabip"},
|
{ORTH: "Bşk.", NORM: "başkanlığı"},
|
||||||
{ORTH: "Bul.", NORM: "Bulvarı"},
|
{ORTH: "bşk.", NORM: "başkanlığı"},
|
||||||
{ORTH: "Cad.", NORM: "Caddesi"},
|
{ORTH: "Bştbp.", NORM: "baştabip"},
|
||||||
|
{ORTH: "bştbp.", NORM: "baştabip"},
|
||||||
|
{ORTH: "Bul.", NORM: "bulvarı"},
|
||||||
|
{ORTH: "bul.", NORM: "bulvarı"},
|
||||||
|
{ORTH: "Cad.", NORM: "caddesi"},
|
||||||
|
{ORTH: "cad.", NORM: "caddesi"},
|
||||||
{ORTH: "çev.", NORM: "çeviren"},
|
{ORTH: "çev.", NORM: "çeviren"},
|
||||||
{ORTH: "Çvş.", NORM: "Çavuş"},
|
{ORTH: "Çvş.", NORM: "çavuş"},
|
||||||
|
{ORTH: "çvş.", NORM: "çavuş"},
|
||||||
{ORTH: "dak.", NORM: "dakika"},
|
{ORTH: "dak.", NORM: "dakika"},
|
||||||
{ORTH: "dk.", NORM: "dakika"},
|
{ORTH: "dk.", NORM: "dakika"},
|
||||||
{ORTH: "Doç.", NORM: "Doçent"},
|
{ORTH: "Doç.", NORM: "doçent"},
|
||||||
{ORTH: "doğ.", NORM: "doğum tarihi"},
|
{ORTH: "doğ."},
|
||||||
|
{ORTH: "Dr.", NORM: "doktor"},
|
||||||
|
{ORTH: "dr.", NORM:"doktor"},
|
||||||
{ORTH: "drl.", NORM: "derleyen"},
|
{ORTH: "drl.", NORM: "derleyen"},
|
||||||
{ORTH: "Dz.", NORM: "Deniz"},
|
{ORTH: "Dz.", NORM: "deniz"},
|
||||||
{ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
|
{ORTH: "Dz.K.K.lığı"},
|
||||||
{ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
|
{ORTH: "Dz.Kuv."},
|
||||||
{ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
|
{ORTH: "Dz.Kuv.K."},
|
||||||
{ORTH: "dzl.", NORM: "düzenleyen"},
|
{ORTH: "dzl.", NORM: "düzenleyen"},
|
||||||
{ORTH: "Ecz.", NORM: "Eczanesi"},
|
{ORTH: "Ecz.", NORM: "eczanesi"},
|
||||||
|
{ORTH: "ecz.", NORM: "eczanesi"},
|
||||||
{ORTH: "ekon.", NORM: "ekonomi"},
|
{ORTH: "ekon.", NORM: "ekonomi"},
|
||||||
{ORTH: "Fak.", NORM: "Fakültesi"},
|
{ORTH: "Fak.", NORM: "fakültesi"},
|
||||||
{ORTH: "Gn.", NORM: "Genel"},
|
{ORTH: "Gn.", NORM: "genel"},
|
||||||
{ORTH: "Gnkur.", NORM: "Genelkurmay"},
|
{ORTH: "Gnkur.", NORM: "Genelkurmay"},
|
||||||
{ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
|
{ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
|
||||||
{ORTH: "gr.", NORM: "gram"},
|
{ORTH: "gr.", NORM: "gram"},
|
||||||
{ORTH: "Hst.", NORM: "Hastanesi"},
|
{ORTH: "Hst.", NORM: "hastanesi"},
|
||||||
{ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
|
{ORTH: "hst.", NORM: "hastanesi"},
|
||||||
|
{ORTH: "Hs.Uzm."},
|
||||||
{ORTH: "huk.", NORM: "hukuk"},
|
{ORTH: "huk.", NORM: "hukuk"},
|
||||||
{ORTH: "Hv.", NORM: "Hava"},
|
{ORTH: "Hv.", NORM: "hava"},
|
||||||
{ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
|
{ORTH: "Hv.K.K.lığı"},
|
||||||
{ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
|
{ORTH: "Hv.Kuv."},
|
||||||
{ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
|
{ORTH: "Hv.Kuv.K."},
|
||||||
{ORTH: "Hz.", NORM: "Hazreti"},
|
{ORTH: "Hz.", NORM: "hazreti"},
|
||||||
{ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
|
{ORTH: "Hz.Öz."},
|
||||||
{ORTH: "İng.", NORM: "İngilizce"},
|
{ORTH: "İng.", NORM: "ingilizce"},
|
||||||
{ORTH: "Jeol.", NORM: "Jeoloji"},
|
{ORTH: "İst.", NORM: "İstanbul"},
|
||||||
|
{ORTH: "Jeol.", NORM: "jeoloji"},
|
||||||
{ORTH: "jeol.", NORM: "jeoloji"},
|
{ORTH: "jeol.", NORM: "jeoloji"},
|
||||||
{ORTH: "Korg.", NORM: "Korgeneral"},
|
{ORTH: "Korg.", NORM: "korgeneral"},
|
||||||
{ORTH: "Kur.", NORM: "Kurmay"},
|
{ORTH: "Kur.", NORM: "kurmay"},
|
||||||
{ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
|
{ORTH: "Kur.Bşk."},
|
||||||
{ORTH: "Kuv.", NORM: "Kuvvetleri"},
|
{ORTH: "Kuv.", NORM: "kuvvetleri"},
|
||||||
{ORTH: "Ltd.", NORM: "Limited"},
|
{ORTH: "Ltd.", NORM: "limited"},
|
||||||
{ORTH: "Mah.", NORM: "Mahallesi"},
|
{ORTH: "ltd.", NORM: "limited"},
|
||||||
|
{ORTH: "Mah.", NORM: "mahallesi"},
|
||||||
{ORTH: "mah.", NORM: "mahallesi"},
|
{ORTH: "mah.", NORM: "mahallesi"},
|
||||||
{ORTH: "max.", NORM: "maksimum"},
|
{ORTH: "max.", NORM: "maksimum"},
|
||||||
{ORTH: "min.", NORM: "minimum"},
|
{ORTH: "min.", NORM: "minimum"},
|
||||||
{ORTH: "Müh.", NORM: "Mühendisliği"},
|
{ORTH: "Müh.", NORM: "mühendisliği"},
|
||||||
{ORTH: "müh.", NORM: "mühendisliği"},
|
{ORTH: "müh.", NORM: "mühendisliği"},
|
||||||
{ORTH: "MÖ.", NORM: "Milattan Önce"},
|
{ORTH: "M.Ö."},
|
||||||
{ORTH: "Onb.", NORM: "Onbaşı"},
|
{ORTH: "M.S."},
|
||||||
{ORTH: "Ord.", NORM: "Ordinaryüs"},
|
{ORTH: "Onb.", NORM: "onbaşı"},
|
||||||
{ORTH: "Org.", NORM: "Orgeneral"},
|
{ORTH: "Ord.", NORM: "ordinaryüs"},
|
||||||
{ORTH: "Ped.", NORM: "Pedagoji"},
|
{ORTH: "Org.", NORM: "orgeneral"},
|
||||||
{ORTH: "Prof.", NORM: "Profesör"},
|
{ORTH: "Ped.", NORM: "pedagoji"},
|
||||||
{ORTH: "Sb.", NORM: "Subay"},
|
{ORTH: "Prof.", NORM: "profesör"},
|
||||||
{ORTH: "Sn.", NORM: "Sayın"},
|
{ORTH: "prof.", NORM: "profesör"},
|
||||||
|
{ORTH: "Sb.", NORM: "subay"},
|
||||||
|
{ORTH: "Sn.", NORM: "sayın"},
|
||||||
{ORTH: "sn.", NORM: "saniye"},
|
{ORTH: "sn.", NORM: "saniye"},
|
||||||
{ORTH: "Sok.", NORM: "Sokak"},
|
{ORTH: "Sok.", NORM: "sokak"},
|
||||||
{ORTH: "Şb.", NORM: "Şube"},
|
{ORTH: "sok.", NORM: "sokak"},
|
||||||
{ORTH: "Şti.", NORM: "Şirketi"},
|
{ORTH: "Şb.", NORM: "şube"},
|
||||||
{ORTH: "Tbp.", NORM: "Tabip"},
|
{ORTH: "şb.", NORM: "şube"},
|
||||||
{ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
|
{ORTH: "Şti.", NORM: "şirketi"},
|
||||||
{ORTH: "Tel.", NORM: "Telefon"},
|
{ORTH: "şti.", NORM: "şirketi"},
|
||||||
|
{ORTH: "Tbp.", NORM: "tabip"},
|
||||||
|
{ORTH: "tbp.", NORM: "tabip"},
|
||||||
|
{ORTH: "T.C."},
|
||||||
|
{ORTH: "Tel.", NORM: "telefon"},
|
||||||
{ORTH: "tel.", NORM: "telefon"},
|
{ORTH: "tel.", NORM: "telefon"},
|
||||||
{ORTH: "telg.", NORM: "telgraf"},
|
{ORTH: "telg.", NORM: "telgraf"},
|
||||||
{ORTH: "Tğm.", NORM: "Teğmen"},
|
{ORTH: "Tğm.", NORM: "teğmen"},
|
||||||
{ORTH: "tğm.", NORM: "teğmen"},
|
{ORTH: "tğm.", NORM: "teğmen"},
|
||||||
{ORTH: "tic.", NORM: "ticaret"},
|
{ORTH: "tic.", NORM: "ticaret"},
|
||||||
{ORTH: "Tug.", NORM: "Tugay"},
|
{ORTH: "Tug.", NORM: "tugay"},
|
||||||
{ORTH: "Tuğg.", NORM: "Tuğgeneral"},
|
{ORTH: "Tuğg.", NORM: "tuğgeneral"},
|
||||||
{ORTH: "Tümg.", NORM: "Tümgeneral"},
|
{ORTH: "Tümg.", NORM: "tümgeneral"},
|
||||||
{ORTH: "Uzm.", NORM: "Uzman"},
|
{ORTH: "Uzm.", NORM: "uzman"},
|
||||||
{ORTH: "Üçvş.", NORM: "Üstçavuş"},
|
{ORTH: "Üçvş.", NORM: "üstçavuş"},
|
||||||
{ORTH: "Üni.", NORM: "Üniversitesi"},
|
{ORTH: "Üni.", NORM: "üniversitesi"},
|
||||||
{ORTH: "Ütğm.", NORM: "Üsteğmen"},
|
{ORTH: "Ütğm.", NORM: "üsteğmen"},
|
||||||
{ORTH: "vb.", NORM: "ve benzeri"},
|
{ORTH: "vb."},
|
||||||
{ORTH: "vs.", NORM: "vesaire"},
|
{ORTH: "vs.", NORM: "vesaire"},
|
||||||
{ORTH: "Yard.", NORM: "Yardımcı"},
|
{ORTH: "Yard.", NORM: "yardımcı"},
|
||||||
{ORTH: "Yar.", NORM: "Yardımcı"},
|
{ORTH: "Yar.", NORM: "yardımcı"},
|
||||||
{ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
|
{ORTH: "Yd.Sb."},
|
||||||
{ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
|
{ORTH: "Yard.Doç."},
|
||||||
{ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
|
{ORTH: "Yar.Doç."},
|
||||||
{ORTH: "Yb.", NORM: "Yarbay"},
|
{ORTH: "Yb.", NORM: "yarbay"},
|
||||||
{ORTH: "Yrd.", NORM: "Yardımcı"},
|
{ORTH: "Yrd.", NORM: "yardımcı"},
|
||||||
{ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
|
{ORTH: "Yrd.Doç."},
|
||||||
{ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
|
{ORTH: "Y.Müh."},
|
||||||
{ORTH: "Y.Mim.", NORM: "Yüksek mimar"},
|
{ORTH: "Y.Mim."},
|
||||||
]:
|
{ORTH: "yy.", NORM: "yüzyıl"},
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
]
|
||||||
|
|
||||||
|
for abbr in _abbr_period_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
_abbr_exc = [
|
||||||
|
{ORTH: "AB", NORM: "Avrupa Birliği"},
|
||||||
|
{ORTH: "ABD", NORM: "Amerika"},
|
||||||
|
{ORTH: "ABS", NORM: "fren"},
|
||||||
|
{ORTH: "AOÇ"},
|
||||||
|
{ORTH: "ASKİ"},
|
||||||
|
{ORTH: "Bağ-kur", NORM: "Bağkur"},
|
||||||
|
{ORTH: "BDDK"},
|
||||||
|
{ORTH: "BJK", NORM: "Beşiktaş"},
|
||||||
|
{ORTH: "ESA", NORM: "Avrupa uzay ajansı"},
|
||||||
|
{ORTH: "FB", NORM: "Fenerbahçe"},
|
||||||
|
{ORTH: "GATA"},
|
||||||
|
{ORTH: "GS", NORM: "Galatasaray"},
|
||||||
|
{ORTH: "İSKİ"},
|
||||||
|
{ORTH: "KBB"},
|
||||||
|
{ORTH: "RTÜK", NORM: "radyo ve televizyon üst kurulu"},
|
||||||
|
{ORTH: "TBMM"},
|
||||||
|
{ORTH: "TC"},
|
||||||
|
{ORTH: "TÜİK", NORM: "Türkiye istatistik kurumu"},
|
||||||
|
{ORTH: "YÖK"},
|
||||||
|
]
|
||||||
|
|
||||||
|
for abbr in _abbr_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
|
||||||
for orth in ["Dr.", "yy."]:
|
|
||||||
_exc[orth] = [{ORTH: orth}]
|
|
||||||
|
|
||||||
|
_num = r"[+-]?\d+([,.]\d+)*"
|
||||||
|
_ord_num = r"(\d+\.)"
|
||||||
|
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
|
||||||
|
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
|
||||||
|
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
|
||||||
|
_roman_ord = r"({rn})\.".format(rn=_roman_num)
|
||||||
|
_time_exp = r"\d+(:\d+)*"
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
|
||||||
|
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
|
||||||
|
|
||||||
|
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(d=_date, dn=_dash_num, te=_time_exp, on=_ord_num, n=_num, ro=_roman_ord, rn=_roman_num, inf=_inflections)
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
TOKEN_MATCH = re.compile(r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)).match
|
||||||
|
|
|
@ -968,10 +968,6 @@ class Language:
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/language#call
|
DOCS: https://nightly.spacy.io/api/language#call
|
||||||
"""
|
"""
|
||||||
if len(text) > self.max_length:
|
|
||||||
raise ValueError(
|
|
||||||
Errors.E088.format(length=len(text), max_length=self.max_length)
|
|
||||||
)
|
|
||||||
doc = self.make_doc(text)
|
doc = self.make_doc(text)
|
||||||
if component_cfg is None:
|
if component_cfg is None:
|
||||||
component_cfg = {}
|
component_cfg = {}
|
||||||
|
@ -1045,6 +1041,11 @@ class Language:
|
||||||
text (str): The text to process.
|
text (str): The text to process.
|
||||||
RETURNS (Doc): The processed doc.
|
RETURNS (Doc): The processed doc.
|
||||||
"""
|
"""
|
||||||
|
if len(text) > self.max_length:
|
||||||
|
raise ValueError(
|
||||||
|
Errors.E088.format(length=len(text), max_length=self.max_length)
|
||||||
|
)
|
||||||
|
return self.tokenizer(text)
|
||||||
return self.tokenizer(text)
|
return self.tokenizer(text)
|
||||||
|
|
||||||
def update(
|
def update(
|
||||||
|
|
|
@ -26,6 +26,7 @@ cdef enum quantifier_t:
|
||||||
ZERO_PLUS
|
ZERO_PLUS
|
||||||
ONE
|
ONE
|
||||||
ONE_PLUS
|
ONE_PLUS
|
||||||
|
FINAL_ID
|
||||||
|
|
||||||
|
|
||||||
cdef struct AttrValueC:
|
cdef struct AttrValueC:
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
from typing import List
|
from typing import List
|
||||||
|
|
||||||
from libcpp.vector cimport vector
|
from libcpp.vector cimport vector
|
||||||
from libc.stdint cimport int32_t
|
from libc.stdint cimport int32_t, int8_t
|
||||||
from libc.string cimport memset, memcmp
|
from libc.string cimport memset, memcmp
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from murmurhash.mrmr cimport hash64
|
from murmurhash.mrmr cimport hash64
|
||||||
|
@ -308,7 +308,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
||||||
# avoid any processing or mem alloc if the document is empty
|
# avoid any processing or mem alloc if the document is empty
|
||||||
return output
|
return output
|
||||||
if len(predicates) > 0:
|
if len(predicates) > 0:
|
||||||
predicate_cache = <char*>mem.alloc(length * len(predicates), sizeof(char))
|
predicate_cache = <int8_t*>mem.alloc(length * len(predicates), sizeof(int8_t))
|
||||||
if extensions is not None and len(extensions) >= 1:
|
if extensions is not None and len(extensions) >= 1:
|
||||||
nr_extra_attr = max(extensions.values()) + 1
|
nr_extra_attr = max(extensions.values()) + 1
|
||||||
extra_attr_values = <attr_t*>mem.alloc(length * nr_extra_attr, sizeof(attr_t))
|
extra_attr_values = <attr_t*>mem.alloc(length * nr_extra_attr, sizeof(attr_t))
|
||||||
|
@ -349,7 +349,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
||||||
|
|
||||||
|
|
||||||
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
||||||
char* cached_py_predicates,
|
int8_t* cached_py_predicates,
|
||||||
Token token, const attr_t* extra_attrs, py_predicates) except *:
|
Token token, const attr_t* extra_attrs, py_predicates) except *:
|
||||||
cdef int q = 0
|
cdef int q = 0
|
||||||
cdef vector[PatternStateC] new_states
|
cdef vector[PatternStateC] new_states
|
||||||
|
@ -421,7 +421,7 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
||||||
states.push_back(new_states[i])
|
states.push_back(new_states[i])
|
||||||
|
|
||||||
|
|
||||||
cdef int update_predicate_cache(char* cache,
|
cdef int update_predicate_cache(int8_t* cache,
|
||||||
const TokenPatternC* pattern, Token token, predicates) except -1:
|
const TokenPatternC* pattern, Token token, predicates) except -1:
|
||||||
# If the state references any extra predicates, check whether they match.
|
# If the state references any extra predicates, check whether they match.
|
||||||
# These are cached, so that we don't call these potentially expensive
|
# These are cached, so that we don't call these potentially expensive
|
||||||
|
@ -459,7 +459,7 @@ cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states)
|
||||||
|
|
||||||
cdef action_t get_action(PatternStateC state,
|
cdef action_t get_action(PatternStateC state,
|
||||||
const TokenC* token, const attr_t* extra_attrs,
|
const TokenC* token, const attr_t* extra_attrs,
|
||||||
const char* predicate_matches) nogil:
|
const int8_t* predicate_matches) nogil:
|
||||||
"""We need to consider:
|
"""We need to consider:
|
||||||
a) Does the token match the specification? [Yes, No]
|
a) Does the token match the specification? [Yes, No]
|
||||||
b) What's the quantifier? [1, 0+, ?]
|
b) What's the quantifier? [1, 0+, ?]
|
||||||
|
@ -517,7 +517,7 @@ cdef action_t get_action(PatternStateC state,
|
||||||
|
|
||||||
Problem: If a quantifier is matching, we're adding a lot of open partials
|
Problem: If a quantifier is matching, we're adding a lot of open partials
|
||||||
"""
|
"""
|
||||||
cdef char is_match
|
cdef int8_t is_match
|
||||||
is_match = get_is_match(state, token, extra_attrs, predicate_matches)
|
is_match = get_is_match(state, token, extra_attrs, predicate_matches)
|
||||||
quantifier = get_quantifier(state)
|
quantifier = get_quantifier(state)
|
||||||
is_final = get_is_final(state)
|
is_final = get_is_final(state)
|
||||||
|
@ -569,9 +569,9 @@ cdef action_t get_action(PatternStateC state,
|
||||||
return RETRY
|
return RETRY
|
||||||
|
|
||||||
|
|
||||||
cdef char get_is_match(PatternStateC state,
|
cdef int8_t get_is_match(PatternStateC state,
|
||||||
const TokenC* token, const attr_t* extra_attrs,
|
const TokenC* token, const attr_t* extra_attrs,
|
||||||
const char* predicate_matches) nogil:
|
const int8_t* predicate_matches) nogil:
|
||||||
for i in range(state.pattern.nr_py):
|
for i in range(state.pattern.nr_py):
|
||||||
if predicate_matches[state.pattern.py_predicates[i]] == -1:
|
if predicate_matches[state.pattern.py_predicates[i]] == -1:
|
||||||
return 0
|
return 0
|
||||||
|
@ -586,8 +586,8 @@ cdef char get_is_match(PatternStateC state,
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
||||||
cdef char get_is_final(PatternStateC state) nogil:
|
cdef int8_t get_is_final(PatternStateC state) nogil:
|
||||||
if state.pattern[1].nr_attr == 0 and state.pattern[1].attrs != NULL:
|
if state.pattern[1].quantifier == FINAL_ID:
|
||||||
id_attr = state.pattern[1].attrs[0]
|
id_attr = state.pattern[1].attrs[0]
|
||||||
if id_attr.attr != ID:
|
if id_attr.attr != ID:
|
||||||
with gil:
|
with gil:
|
||||||
|
@ -597,7 +597,7 @@ cdef char get_is_final(PatternStateC state) nogil:
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
cdef char get_quantifier(PatternStateC state) nogil:
|
cdef int8_t get_quantifier(PatternStateC state) nogil:
|
||||||
return state.pattern.quantifier
|
return state.pattern.quantifier
|
||||||
|
|
||||||
|
|
||||||
|
@ -626,36 +626,20 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
||||||
pattern[i].nr_py = len(predicates)
|
pattern[i].nr_py = len(predicates)
|
||||||
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
||||||
i = len(token_specs)
|
i = len(token_specs)
|
||||||
# Even though here, nr_attr == 0, we're storing the ID value in attrs[0] (bug-prone, thread carefully!)
|
# Use quantifier to identify final ID pattern node (rather than previous
|
||||||
pattern[i].attrs = <AttrValueC*>mem.alloc(2, sizeof(AttrValueC))
|
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
|
||||||
|
pattern[i].quantifier = FINAL_ID
|
||||||
|
pattern[i].attrs = <AttrValueC*>mem.alloc(1, sizeof(AttrValueC))
|
||||||
pattern[i].attrs[0].attr = ID
|
pattern[i].attrs[0].attr = ID
|
||||||
pattern[i].attrs[0].value = entity_id
|
pattern[i].attrs[0].value = entity_id
|
||||||
pattern[i].nr_attr = 0
|
pattern[i].nr_attr = 1
|
||||||
pattern[i].nr_extra_attr = 0
|
pattern[i].nr_extra_attr = 0
|
||||||
pattern[i].nr_py = 0
|
pattern[i].nr_py = 0
|
||||||
return pattern
|
return pattern
|
||||||
|
|
||||||
|
|
||||||
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
||||||
# There have been a few bugs here. We used to have two functions,
|
while pattern.quantifier != FINAL_ID:
|
||||||
# get_ent_id and get_pattern_key that tried to do the same thing. These
|
|
||||||
# are now unified to try to solve the "ghost match" problem.
|
|
||||||
# Below is the previous implementation of get_ent_id and the comment on it,
|
|
||||||
# preserved for reference while we figure out whether the heisenbug in the
|
|
||||||
# matcher is resolved.
|
|
||||||
#
|
|
||||||
#
|
|
||||||
# cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
|
||||||
# # The code was originally designed to always have pattern[1].attrs.value
|
|
||||||
# # be the ent_id when we get to the end of a pattern. However, Issue #2671
|
|
||||||
# # showed this wasn't the case when we had a reject-and-continue before a
|
|
||||||
# # match.
|
|
||||||
# # The patch to #2671 was wrong though, which came up in #3839.
|
|
||||||
# while pattern.attrs.attr != ID:
|
|
||||||
# pattern += 1
|
|
||||||
# return pattern.attrs.value
|
|
||||||
while pattern.nr_attr != 0 or pattern.nr_extra_attr != 0 or pattern.nr_py != 0 \
|
|
||||||
or pattern.quantifier != ZERO:
|
|
||||||
pattern += 1
|
pattern += 1
|
||||||
id_attr = pattern[0].attrs[0]
|
id_attr = pattern[0].attrs[0]
|
||||||
if id_attr.attr != ID:
|
if id_attr.attr != ID:
|
||||||
|
|
|
@ -261,7 +261,11 @@ class EntityRuler(Pipe):
|
||||||
|
|
||||||
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
|
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
|
||||||
try:
|
try:
|
||||||
current_index = self.nlp.pipe_names.index(self.name)
|
current_index = -1
|
||||||
|
for i, (name, pipe) in enumerate(self.nlp.pipeline):
|
||||||
|
if self == pipe:
|
||||||
|
current_index = i
|
||||||
|
break
|
||||||
subsequent_pipes = [
|
subsequent_pipes = [
|
||||||
pipe for pipe in self.nlp.pipe_names[current_index + 1 :]
|
pipe for pipe in self.nlp.pipe_names[current_index + 1 :]
|
||||||
]
|
]
|
||||||
|
|
|
@ -172,6 +172,11 @@ def lt_tokenizer():
|
||||||
return get_lang_class("lt")().tokenizer
|
return get_lang_class("lt")().tokenizer
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def mk_tokenizer():
|
||||||
|
return get_lang_class("mk")().tokenizer
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def ml_tokenizer():
|
def ml_tokenizer():
|
||||||
return get_lang_class("ml")().tokenizer
|
return get_lang_class("ml")().tokenizer
|
||||||
|
|
|
@ -123,6 +123,7 @@ def test_doc_api_serialize(en_tokenizer, text):
|
||||||
tokens[0].norm_ = "norm"
|
tokens[0].norm_ = "norm"
|
||||||
tokens.ents = [(tokens.vocab.strings["PRODUCT"], 0, 1)]
|
tokens.ents = [(tokens.vocab.strings["PRODUCT"], 0, 1)]
|
||||||
tokens[0].ent_kb_id_ = "ent_kb_id"
|
tokens[0].ent_kb_id_ = "ent_kb_id"
|
||||||
|
tokens[0].ent_id_ = "ent_id"
|
||||||
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
||||||
assert tokens.text == new_tokens.text
|
assert tokens.text == new_tokens.text
|
||||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||||
|
@ -130,6 +131,7 @@ def test_doc_api_serialize(en_tokenizer, text):
|
||||||
assert new_tokens[0].lemma_ == "lemma"
|
assert new_tokens[0].lemma_ == "lemma"
|
||||||
assert new_tokens[0].norm_ == "norm"
|
assert new_tokens[0].norm_ == "norm"
|
||||||
assert new_tokens[0].ent_kb_id_ == "ent_kb_id"
|
assert new_tokens[0].ent_kb_id_ == "ent_kb_id"
|
||||||
|
assert new_tokens[0].ent_id_ == "ent_id"
|
||||||
|
|
||||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||||
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
||||||
|
|
|
@ -416,6 +416,13 @@ def test_doc_retokenizer_merge_lex_attrs(en_vocab):
|
||||||
assert doc[1].is_stop
|
assert doc[1].is_stop
|
||||||
assert not doc[0].is_stop
|
assert not doc[0].is_stop
|
||||||
assert not doc[1].like_num
|
assert not doc[1].like_num
|
||||||
|
# Test that norm is only set on tokens
|
||||||
|
doc = Doc(en_vocab, words=["eins", "zwei", "!", "!"])
|
||||||
|
assert doc[0].norm_ == "eins"
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[0:1], attrs={"norm": "1"})
|
||||||
|
assert doc[0].norm_ == "1"
|
||||||
|
assert en_vocab["eins"].norm_ == "eins"
|
||||||
|
|
||||||
|
|
||||||
def test_retokenize_skip_duplicates(en_vocab):
|
def test_retokenize_skip_duplicates(en_vocab):
|
||||||
|
|
0
spacy/tests/lang/mk/__init__.py
Normal file
0
spacy/tests/lang/mk/__init__.py
Normal file
84
spacy/tests/lang/mk/test_text.py
Normal file
84
spacy/tests/lang/mk/test_text.py
Normal file
|
@ -0,0 +1,84 @@
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.mk.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_handles_long_text(mk_tokenizer):
|
||||||
|
text = """
|
||||||
|
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
|
||||||
|
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
|
||||||
|
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
|
||||||
|
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
|
||||||
|
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
|
||||||
|
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
|
||||||
|
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
|
||||||
|
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
|
||||||
|
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
|
||||||
|
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
|
||||||
|
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
|
||||||
|
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
|
||||||
|
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
|
||||||
|
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
|
||||||
|
ги разбере овие идеи...
|
||||||
|
"""
|
||||||
|
tokens = mk_tokenizer(text)
|
||||||
|
assert len(tokens) == 297
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word,match",
|
||||||
|
[
|
||||||
|
("10", True),
|
||||||
|
("1", True),
|
||||||
|
("10.000", True),
|
||||||
|
("1000", True),
|
||||||
|
("бројка", False),
|
||||||
|
("999,0", True),
|
||||||
|
("еден", True),
|
||||||
|
("два", True),
|
||||||
|
("цифра", False),
|
||||||
|
("десет", True),
|
||||||
|
("сто", True),
|
||||||
|
("број", False),
|
||||||
|
("илјада", True),
|
||||||
|
("илјади", True),
|
||||||
|
("милион", True),
|
||||||
|
(",", False),
|
||||||
|
("милијарда", True),
|
||||||
|
("билион", True),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
|
||||||
|
tokens = mk_tokenizer(word)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
assert tokens[0].like_num == match
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word",
|
||||||
|
[
|
||||||
|
"двесте",
|
||||||
|
"два-три",
|
||||||
|
"пет-шест"
|
||||||
|
]
|
||||||
|
)
|
||||||
|
def test_mk_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word",
|
||||||
|
[
|
||||||
|
"првиот",
|
||||||
|
"втора",
|
||||||
|
"четврт",
|
||||||
|
"четвртата",
|
||||||
|
"петти",
|
||||||
|
"петто",
|
||||||
|
"стоти",
|
||||||
|
"шеесетите",
|
||||||
|
"седумдесетите"
|
||||||
|
]
|
||||||
|
)
|
||||||
|
def test_mk_lex_attrs_like_number_for_ordinal(word):
|
||||||
|
assert like_num(word)
|
|
@ -2,6 +2,27 @@ import pytest
|
||||||
from spacy.lang.tr.lex_attrs import like_num
|
from spacy.lang.tr.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
|
def test_tr_tokenizer_handles_long_text(tr_tokenizer):
|
||||||
|
text = """Pamuk nasıl ipliğe dönüştürülür?
|
||||||
|
|
||||||
|
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
|
||||||
|
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
|
||||||
|
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
|
||||||
|
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
|
||||||
|
|
||||||
|
Daha sonra pamuk yığınları, liflerin açılıp temizlenmesi için tek bir birim halinde
|
||||||
|
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
|
||||||
|
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
|
||||||
|
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
|
||||||
|
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
|
||||||
|
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
|
||||||
|
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
|
||||||
|
tokens = tr_tokenizer(text)
|
||||||
|
assert len(tokens) == 146
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"word",
|
"word",
|
||||||
[
|
[
|
||||||
|
|
152
spacy/tests/lang/tr/test_tokenizer.py
Normal file
152
spacy/tests/lang/tr/test_tokenizer.py
Normal file
|
@ -0,0 +1,152 @@
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
ABBREV_TESTS = [
|
||||||
|
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
|
||||||
|
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
|
||||||
|
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
|
||||||
|
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
|
||||||
|
("Hem İst. hem Ank. bu konuda gayet iyi durumda.", ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."]),
|
||||||
|
("Hem İst. hem Ank.'da yağış var.", ["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."]),
|
||||||
|
("Dr.", ["Dr."]),
|
||||||
|
("Yrd.Doç.", ["Yrd.Doç."]),
|
||||||
|
("Prof.'un", ["Prof.'un"]),
|
||||||
|
("Böl.'nde", ["Böl.'nde"]),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
URL_TESTS = [
|
||||||
|
("Bizler de www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
||||||
|
("Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "https://www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
||||||
|
("Bizler de www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
||||||
|
("Bizler de https://www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
NUMBER_TESTS = [
|
||||||
|
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
|
||||||
|
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
|
||||||
|
("Hava sıcaklığı -4ten +6ya yükseldi.", ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."]),
|
||||||
|
("Hava sıcaklığı -4'ten +6'ya yükseldi.", ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."]),
|
||||||
|
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
|
||||||
|
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
|
||||||
|
("Kitap IV. Murat hakkında.",["Kitap", "IV.", "Murat", "hakkında", "."]),
|
||||||
|
#("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
|
||||||
|
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
|
||||||
|
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
|
||||||
|
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
|
||||||
|
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
|
||||||
|
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
|
||||||
|
("5'te", ["5'te"]),
|
||||||
|
("6'da", ["6'da"]),
|
||||||
|
("9dan", ["9dan"]),
|
||||||
|
("19'da", ["19'da"]),
|
||||||
|
("VI'da", ["VI'da"]),
|
||||||
|
("5.", ["5."]),
|
||||||
|
("72.", ["72."]),
|
||||||
|
("VI.", ["VI."]),
|
||||||
|
("6.'dan", ["6.'dan"]),
|
||||||
|
("19.'dan", ["19.'dan"]),
|
||||||
|
("6.dan", ["6.dan"]),
|
||||||
|
("16.dan", ["16.dan"]),
|
||||||
|
("VI.'dan", ["VI.'dan"]),
|
||||||
|
("VI.dan", ["VI.dan"]),
|
||||||
|
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
|
||||||
|
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
|
||||||
|
("2/3 tarihli faturayı bulamadım.", ["2/3", "tarihli", "faturayı", "bulamadım", "."]),
|
||||||
|
("2.3 tarihli faturayı bulamadım.", ["2.3", "tarihli", "faturayı", "bulamadım", "."]),
|
||||||
|
("2.3. tarihli faturayı bulamadım.", ["2.3.", "tarihli", "faturayı", "bulamadım", "."]),
|
||||||
|
("2/3/2020 tarihli faturayı bulamadm.", ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."]),
|
||||||
|
("2/3/1987 tarihinden beri burda yaşıyorum.", ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."]),
|
||||||
|
("2-3-1987 tarihinden beri burdayım.", ["2-3-1987", "tarihinden", "beri", "burdayım", "."]),
|
||||||
|
("2.3.1987 tarihinden beri burdayım.", ["2.3.1987", "tarihinden", "beri", "burdayım", "."]),
|
||||||
|
("Bu olay 2005-2006 tarihleri arasında oldu.", ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."]),
|
||||||
|
("Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.", ["Bu", "olay", "4/12/2005", "-", "21/3/2006", "tarihleri", "arasında", "oldu", ".",]),
|
||||||
|
("Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.", ["Ek", "fıkra", ":", "5/11/2003", "-", "4999/3", "maddesine", "göre", "uygundur", "."]),
|
||||||
|
("2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre", ["2/A", "alanları", ":", "6831", "sayılı", "Kanunun", "2nci", "maddesinin", "birinci", "fıkrasının", "(", "A", ")", "bendine", "göre"]),
|
||||||
|
("ŞEHİTTEĞMENKALMAZ Cad. No: 2/311", ["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"]),
|
||||||
|
("2-3-2025", ["2-3-2025",]),
|
||||||
|
("2/3/2025", ["2/3/2025"]),
|
||||||
|
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "uç", "kullanıyorum", "."]),
|
||||||
|
("Kan değerlerim 0.5-0.7 arasıydı.", ["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."]),
|
||||||
|
("0.5", ["0.5"]),
|
||||||
|
("1/2", ["1/2"]),
|
||||||
|
("%1", ["%", "1"]),
|
||||||
|
("%1lik", ["%", "1lik"]),
|
||||||
|
("%1'lik", ["%", "1'lik"]),
|
||||||
|
("%1lik dilim", ["%", "1lik", "dilim"]),
|
||||||
|
("%1'lik dilim", ["%", "1'lik", "dilim"]),
|
||||||
|
("%1.5", ["%", "1.5"]),
|
||||||
|
#("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
|
||||||
|
("%1-2 arası büyüme bekliyoruz.", ["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."]),
|
||||||
|
("%11-12 arası büyüme bekliyoruz.", ["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."]),
|
||||||
|
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
|
||||||
|
("Saat 1-2 arası gelin lütfen.", ["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."]),
|
||||||
|
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
|
||||||
|
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
|
||||||
|
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
|
||||||
|
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
|
||||||
|
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
|
||||||
|
("9’daki otobüse binsek mi?", ["9’daki", "otobüse", "binsek", "mi", "?"]),
|
||||||
|
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
|
||||||
|
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
|
||||||
|
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
|
||||||
|
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
|
||||||
|
("Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.", ["Antonio", "Gaudí", "20.", "yüzyılda", ",", "1904", "-", "1914", "yılları", "arasında", "on", "yıl", "süren", "bir", "reform", "süreci", "getirmiştir", "."]),
|
||||||
|
("Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.", ["Dizel", "yakıtın", "avro", "bölgesi", "ortalaması", "olan", "1,165", "avroya", "kıyasla", "litre", "başına", "1,335", "avroya", "mal", "olduğunu", "gösteriyor", "."]),
|
||||||
|
("Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.", ["Marcus", "Antonius", "M.Ö.", "1", "Ocak", "49'da", ",", "Sezar'dan", "Vali'nin", "kendisini", "barış", "dostu", "ilan", "ettiği", "bir", "bildiri", "yayınlamıştır", "."])
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
PUNCT_TESTS = [
|
||||||
|
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
|
||||||
|
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
|
||||||
|
("Gitsek mi?", ["Gitsek", "mi", "?"]),
|
||||||
|
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
|
||||||
|
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
|
||||||
|
("Ankara - Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
||||||
|
("Ankara-Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
||||||
|
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
|
||||||
|
("Senden, benden, bizden şarkısını biliyor musun?", ["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"]),
|
||||||
|
("Akif'le geldik, sonra da o ayrıldı.", ["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."]),
|
||||||
|
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
|
||||||
|
("Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...", ["Yok", "hasta", "olmuş", ",", "yok", "annesi", "hastaymış", ",", "bahaneler", "işte", "..."]),
|
||||||
|
("Ankara'dan İstanbul'a ... bir aşk hikayesi.", ["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."]),
|
||||||
|
("Ahmet'te", ["Ahmet'te"]),
|
||||||
|
("İstanbul'da", ["İstanbul'da"]),
|
||||||
|
]
|
||||||
|
|
||||||
|
GENERAL_TESTS = [
|
||||||
|
("1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.", ["1914'teki", "Endurance", "seferinde", ",", "Sir", "Ernest", "Shackleton'ın", "kaptanlığını", "yaptığı", "İngiliz", "Endurance", "gemisi", "yirmi", "sekiz", "kişi", "ile", "Antarktika'yı", "geçmek", "üzere", "yelken", "açtı", "."]),
|
||||||
|
("Danışılan \"%100 Cospedal\" olduğunu belirtti.", ["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."]),
|
||||||
|
("1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.", ["1976'da", "parkur", "artık", "kullanılmıyordu", ";", "1990'da", "ise", "bir", "yangın", ",", "daha", "sonraları", "ahırlarla", "birlikte", "yıkılacak", "olan", "tahta", "tribünlerden", "geri", "kalanları", "da", "yok", "etmişti", "."]),
|
||||||
|
("Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.", ["Dahiyane", "bir", "ameliyat", "ve", "zorlu", "bir", "rehabilitasyon", "sürecinden", "sonra", ",", "tamamen", "iyileştim", "."]),
|
||||||
|
("Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.", ["Yaklaşık", "iki", "hafta", "süren", "bireysel", "erken", "oy", "kullanma", "döneminin", "ardından", "5,7", "milyondan", "fazla", "Floridalı", "sandık", "başına", "gitti", "."]),
|
||||||
|
("Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.", ["Ancak", ",", "bu", "ABD", "Çevre", "Koruma", "Ajansı'nın", "dünyayı", "bu", "konularda", "uyarmasının", "ardından", "ortaya", "çıktı", "."]),
|
||||||
|
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
||||||
|
("Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar." , ["Granit", "adaları", ";", "Seyşeller", "ve", "Tioman", "ile", "Saint", "Helena", "gibi", "volkanik", "adaları", "kapsar", "."]),
|
||||||
|
("Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.", ["Barış", "antlaşmasıyla", "İspanya", ",", "Amerika'ya", "Porto", "Riko", ",", "Guam", "ve", "Filipinler", "kolonilerini", "devretti", "."]),
|
||||||
|
("Makedonya\'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya\'ya doğru yürüdü.", ["Makedonya\'nın", "sınır", "bölgelerini", "güvence", "altına", "alan", "Philip", ",", "büyük", "bir", "Makedon", "ordusu", "kurdu", "ve", "uzun", "bir", "fetih", "seferi", "için", "Trakya\'ya", "doğru", "yürüdü", "."]),
|
||||||
|
("Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.", ["Fransız", "gazetesi", "Le", "Figaro'ya", "göre", "bu", "hükumet", "planı", "sayesinde", "42", "milyon", "Euro", "kazanç", "sağlanabilir", "ve", "elde", "edilen", "paranın", "15.5", "milyonu", "ulusal", "güvenlik", "için", "kullanılabilir", "."]),
|
||||||
|
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
||||||
|
("3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.", ["3", "Kasım", "Salı", "günü", ",", "Ankara", "Belediye", "Başkanı", "2014'te", "hükümetle", "birlikte", "oluşturulan", "kentsel", "gelişim", "anlaşmasını", "askıya", "alma", "kararı", "verdi", "."]),
|
||||||
|
("Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.", ["Stalin", ",", "Abakumov'u", "Beria'nın", "enerji", "bakanlıkları", "üzerindeki", "baskınlığına", "karşı", "MGB", "içinde", "kendi", "ağını", "kurmaya", "teşvik", "etmeye", "başlamıştı", "."]),
|
||||||
|
("Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar", ["Güney", "Avrupa'daki", "kazı", "alanlarının", "çoğunluğu", "gibi", ",", "bu", "bulgu", "M.Ö.", "5.", "yüzyılın", "başlar"]),
|
||||||
|
("Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.", ["Sağlığın", "bozulması", "Hitchcock", "hayatının", "son", "yirmi", "yılında", "üretimini", "azalttı", "."]),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
TESTS = (ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,expected_tokens", TESTS)
|
||||||
|
def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
|
||||||
|
tokens = tr_tokenizer(text)
|
||||||
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
|
print(token_list)
|
||||||
|
assert expected_tokens == token_list
|
||||||
|
|
|
@ -457,6 +457,7 @@ def test_attr_pipeline_checks(en_vocab):
|
||||||
([{"IS_LEFT_PUNCT": True}], "``"),
|
([{"IS_LEFT_PUNCT": True}], "``"),
|
||||||
([{"IS_RIGHT_PUNCT": True}], "''"),
|
([{"IS_RIGHT_PUNCT": True}], "''"),
|
||||||
([{"IS_STOP": True}], "the"),
|
([{"IS_STOP": True}], "the"),
|
||||||
|
([{"SPACY": True}], "the"),
|
||||||
([{"LIKE_NUM": True}], "1"),
|
([{"LIKE_NUM": True}], "1"),
|
||||||
([{"LIKE_URL": True}], "http://example.com"),
|
([{"LIKE_URL": True}], "http://example.com"),
|
||||||
([{"LIKE_EMAIL": True}], "mail@example.com"),
|
([{"LIKE_EMAIL": True}], "mail@example.com"),
|
||||||
|
|
|
@ -4,7 +4,9 @@ from pathlib import Path
|
||||||
|
|
||||||
def test_build_dependencies():
|
def test_build_dependencies():
|
||||||
# Check that library requirements are pinned exactly the same across different setup files.
|
# Check that library requirements are pinned exactly the same across different setup files.
|
||||||
|
# TODO: correct checks for numpy rather than ignoring
|
||||||
libs_ignore_requirements = [
|
libs_ignore_requirements = [
|
||||||
|
"numpy",
|
||||||
"pytest",
|
"pytest",
|
||||||
"pytest-timeout",
|
"pytest-timeout",
|
||||||
"mock",
|
"mock",
|
||||||
|
@ -12,6 +14,7 @@ def test_build_dependencies():
|
||||||
]
|
]
|
||||||
# ignore language-specific packages that shouldn't be installed by all
|
# ignore language-specific packages that shouldn't be installed by all
|
||||||
libs_ignore_setup = [
|
libs_ignore_setup = [
|
||||||
|
"numpy",
|
||||||
"fugashi",
|
"fugashi",
|
||||||
"natto-py",
|
"natto-py",
|
||||||
"pythainlp",
|
"pythainlp",
|
||||||
|
@ -67,7 +70,7 @@ def test_build_dependencies():
|
||||||
line = line.strip().strip(",").strip('"')
|
line = line.strip().strip(",").strip('"')
|
||||||
if not line.startswith("#"):
|
if not line.startswith("#"):
|
||||||
lib, v = _parse_req(line)
|
lib, v = _parse_req(line)
|
||||||
if lib:
|
if lib and lib not in libs_ignore_requirements:
|
||||||
req_v = req_dict.get(lib, None)
|
req_v = req_dict.get(lib, None)
|
||||||
assert (lib + v) == (lib + req_v), (
|
assert (lib + v) == (lib + req_v), (
|
||||||
"{} has different version in pyproject.toml and in requirements.txt: "
|
"{} has different version in pyproject.toml and in requirements.txt: "
|
||||||
|
|
|
@ -197,3 +197,21 @@ def test_entity_ruler_overlapping_spans(nlp):
|
||||||
doc = ruler(nlp.make_doc("foo bar baz"))
|
doc = ruler(nlp.make_doc("foo bar baz"))
|
||||||
assert len(doc.ents) == 1
|
assert len(doc.ents) == 1
|
||||||
assert doc.ents[0].label_ == "FOOBAR"
|
assert doc.ents[0].label_ == "FOOBAR"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
|
def test_entity_ruler_multiprocessing(nlp, n_process):
|
||||||
|
texts = [
|
||||||
|
"I enjoy eating Pizza Hut pizza."
|
||||||
|
]
|
||||||
|
|
||||||
|
patterns = [
|
||||||
|
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
|
||||||
|
]
|
||||||
|
|
||||||
|
ruler = nlp.add_pipe("entity_ruler")
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
|
||||||
|
for doc in nlp.pipe(texts, n_process=2):
|
||||||
|
for ent in doc.ents:
|
||||||
|
assert ent.ent_id_ == "1234"
|
||||||
|
|
|
@ -404,9 +404,7 @@ cdef class Tokenizer:
|
||||||
cdef unicode minus_suf
|
cdef unicode minus_suf
|
||||||
cdef size_t last_size = 0
|
cdef size_t last_size = 0
|
||||||
while string and len(string) != last_size:
|
while string and len(string) != last_size:
|
||||||
if self.token_match and self.token_match(string) \
|
if self.token_match and self.token_match(string):
|
||||||
and not self.find_prefix(string) \
|
|
||||||
and not self.find_suffix(string):
|
|
||||||
break
|
break
|
||||||
if with_special_cases and self._specials.get(hash_string(string)) != NULL:
|
if with_special_cases and self._specials.get(hash_string(string)) != NULL:
|
||||||
break
|
break
|
||||||
|
@ -679,6 +677,8 @@ cdef class Tokenizer:
|
||||||
break
|
break
|
||||||
suffixes.append(("SUFFIX", substring[split:]))
|
suffixes.append(("SUFFIX", substring[split:]))
|
||||||
substring = substring[:split]
|
substring = substring[:split]
|
||||||
|
if len(substring) == 0:
|
||||||
|
continue
|
||||||
if token_match(substring):
|
if token_match(substring):
|
||||||
tokens.append(("TOKEN_MATCH", substring))
|
tokens.append(("TOKEN_MATCH", substring))
|
||||||
substring = ''
|
substring = ''
|
||||||
|
|
|
@ -11,7 +11,7 @@ from .span cimport Span
|
||||||
from .token cimport Token
|
from .token cimport Token
|
||||||
from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
||||||
from ..structs cimport LexemeC, TokenC
|
from ..structs cimport LexemeC, TokenC
|
||||||
from ..attrs cimport MORPH
|
from ..attrs cimport MORPH, NORM
|
||||||
from ..vocab cimport Vocab
|
from ..vocab cimport Vocab
|
||||||
|
|
||||||
from .underscore import is_writable_attr
|
from .underscore import is_writable_attr
|
||||||
|
@ -372,9 +372,10 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
# Set attributes on both token and lexeme to take care of token
|
# Set attributes on both token and lexeme to take care of token
|
||||||
# attribute vs. lexical attribute without having to enumerate
|
# attribute vs. lexical attribute without having to enumerate
|
||||||
# them. If an attribute name is not valid, set_struct_attr will
|
# them. If an attribute name is not valid, set_struct_attr will
|
||||||
# ignore it.
|
# ignore it. Exception: set NORM only on tokens.
|
||||||
Token.set_struct_attr(token, attr_name, get_string_id(attr_value))
|
Token.set_struct_attr(token, attr_name, get_string_id(attr_value))
|
||||||
Lexeme.set_struct_attr(<LexemeC*>token.lex, attr_name, get_string_id(attr_value))
|
if attr_name != NORM:
|
||||||
|
Lexeme.set_struct_attr(<LexemeC*>token.lex, attr_name, get_string_id(attr_value))
|
||||||
# Assign correct dependencies to the inner token
|
# Assign correct dependencies to the inner token
|
||||||
for i, head in enumerate(heads):
|
for i, head in enumerate(heads):
|
||||||
doc.c[token_index + i].head = head
|
doc.c[token_index + i].head = head
|
||||||
|
@ -435,6 +436,7 @@ def set_token_attrs(Token py_token, attrs):
|
||||||
# Set attributes on both token and lexeme to take care of token
|
# Set attributes on both token and lexeme to take care of token
|
||||||
# attribute vs. lexical attribute without having to enumerate
|
# attribute vs. lexical attribute without having to enumerate
|
||||||
# them. If an attribute name is not valid, set_struct_attr will
|
# them. If an attribute name is not valid, set_struct_attr will
|
||||||
# ignore it.
|
# ignore it. Exception: set NORM only on tokens.
|
||||||
Token.set_struct_attr(token, attr_name, attr_value)
|
Token.set_struct_attr(token, attr_name, attr_value)
|
||||||
Lexeme.set_struct_attr(<LexemeC*>lex, attr_name, attr_value)
|
if attr_name != NORM:
|
||||||
|
Lexeme.set_struct_attr(<LexemeC*>lex, attr_name, attr_value)
|
||||||
|
|
|
@ -5,7 +5,6 @@ from libc.stdint cimport uint8_t
|
||||||
ctypedef float weight_t
|
ctypedef float weight_t
|
||||||
ctypedef uint64_t hash_t
|
ctypedef uint64_t hash_t
|
||||||
ctypedef uint64_t class_t
|
ctypedef uint64_t class_t
|
||||||
ctypedef char* utf8_t
|
|
||||||
ctypedef uint64_t attr_t
|
ctypedef uint64_t attr_t
|
||||||
ctypedef uint64_t flags_t
|
ctypedef uint64_t flags_t
|
||||||
ctypedef uint16_t len_t
|
ctypedef uint16_t len_t
|
||||||
|
|
|
@ -1295,6 +1295,13 @@ def combine_score_weights(
|
||||||
|
|
||||||
|
|
||||||
class DummyTokenizer:
|
class DummyTokenizer:
|
||||||
|
def __call__(self, text):
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def pipe(self, texts, **kwargs):
|
||||||
|
for text in texts:
|
||||||
|
yield self(text)
|
||||||
|
|
||||||
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
||||||
# allow serialization (see #1557)
|
# allow serialization (see #1557)
|
||||||
def to_bytes(self, **kwargs):
|
def to_bytes(self, **kwargs):
|
||||||
|
|
|
@ -4,7 +4,7 @@ from cymem.cymem cimport Pool
|
||||||
from murmurhash.mrmr cimport hash64
|
from murmurhash.mrmr cimport hash64
|
||||||
|
|
||||||
from .structs cimport LexemeC, TokenC
|
from .structs cimport LexemeC, TokenC
|
||||||
from .typedefs cimport utf8_t, attr_t, hash_t
|
from .typedefs cimport attr_t, hash_t
|
||||||
from .strings cimport StringStore
|
from .strings cimport StringStore
|
||||||
from .morphology cimport Morphology
|
from .morphology cimport Morphology
|
||||||
|
|
||||||
|
|
|
@ -305,6 +305,9 @@ cdef class Vocab:
|
||||||
DOCS: https://nightly.spacy.io/api/vocab#prune_vectors
|
DOCS: https://nightly.spacy.io/api/vocab#prune_vectors
|
||||||
"""
|
"""
|
||||||
xp = get_array_module(self.vectors.data)
|
xp = get_array_module(self.vectors.data)
|
||||||
|
# Make sure all vectors are in the vocab
|
||||||
|
for orth in self.vectors:
|
||||||
|
self[orth]
|
||||||
# Make prob negative so it sorts by rank ascending
|
# Make prob negative so it sorts by rank ascending
|
||||||
# (key2row contains the rank)
|
# (key2row contains the rank)
|
||||||
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
|
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
|
||||||
|
|
|
@ -39,7 +39,9 @@ rule-based matching are:
|
||||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||||
|
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||||
|
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
|
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
|
||||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||||
|
@ -61,7 +63,7 @@ matched:
|
||||||
| `!` | Negate the pattern, by requiring it to match exactly 0 times. |
|
| `!` | Negate the pattern, by requiring it to match exactly 0 times. |
|
||||||
| `?` | Make the pattern optional, by allowing it to match 0 or 1 times. |
|
| `?` | Make the pattern optional, by allowing it to match 0 or 1 times. |
|
||||||
| `+` | Require the pattern to match 1 or more times. |
|
| `+` | Require the pattern to match 1 or more times. |
|
||||||
| `*` | Allow the pattern to match 0 or more times. |
|
| `*` | Allow the pattern to match 0 or more times. |
|
||||||
|
|
||||||
Token patterns can also map to a **dictionary of properties** instead of a
|
Token patterns can also map to a **dictionary of properties** instead of a
|
||||||
single value to indicate whether the expected value is a member of a list or how
|
single value to indicate whether the expected value is a member of a list or how
|
||||||
|
|
|
@ -158,21 +158,22 @@ The available token pattern keys correspond to a number of
|
||||||
[`Token` attributes](/api/token#attributes). The supported attributes for
|
[`Token` attributes](/api/token#attributes). The supported attributes for
|
||||||
rule-based matching are:
|
rule-based matching are:
|
||||||
|
|
||||||
| Attribute | Description |
|
| Attribute | Description |
|
||||||
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||||
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||||
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||||
| `LENGTH` | The length of the token text. ~~int~~ |
|
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
|
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ |
|
||||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||||
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |
|
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||||
|
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |
|
||||||
|
|
||||||
<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
|
<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
|
||||||
|
|
||||||
|
|
|
@ -199,6 +199,36 @@
|
||||||
"name": "Vietnamese",
|
"name": "Vietnamese",
|
||||||
"dependencies": [{ "name": "Pyvi", "url": "https://github.com/trungtv/pyvi" }]
|
"dependencies": [{ "name": "Pyvi", "url": "https://github.com/trungtv/pyvi" }]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"code": "lij",
|
||||||
|
"name": "Ligurian",
|
||||||
|
"example": "Sta chì a l'é unna fraxe.",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"code": "hy",
|
||||||
|
"name": "Armenian",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"code": "gu",
|
||||||
|
"name": "Gujarati",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"code": "ml",
|
||||||
|
"name": "Malayalam",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"code": "ne",
|
||||||
|
"name": "Nepali",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"code": "mk",
|
||||||
|
"name": "Macedonian"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"code": "xx",
|
"code": "xx",
|
||||||
"name": "Multi-language",
|
"name": "Multi-language",
|
||||||
|
|
|
@ -1,5 +1,36 @@
|
||||||
{
|
{
|
||||||
"resources": [
|
"resources": [
|
||||||
|
{
|
||||||
|
"id": "spacy-textblob",
|
||||||
|
"title": "spaCyTextBlob",
|
||||||
|
"slogan": "Easy sentiment analysis for spaCy using TextBlob",
|
||||||
|
"description": "spaCyTextBlob is a pipeline component that enables sentiment analysis using the [TextBlob](https://github.com/sloria/TextBlob) library. It will add the additional extenstion `._.sentiment` to `Doc`, `Span`, and `Token` objects.",
|
||||||
|
"github": "SamEdwardes/spaCyTextBlob",
|
||||||
|
"pip": "spacytextblob",
|
||||||
|
"code_example": [
|
||||||
|
"import spacy",
|
||||||
|
"from spacytextblob.spacytextblob import SpacyTextBlob",
|
||||||
|
"",
|
||||||
|
"nlp = spacy.load('en_core_web_sm')",
|
||||||
|
"spacy_text_blob = SpacyTextBlob()",
|
||||||
|
"nlp.add_pipe(spacy_text_blob)",
|
||||||
|
"text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'",
|
||||||
|
"doc = nlp(text)",
|
||||||
|
"doc._.sentiment.polarity # Polarity: -0.125",
|
||||||
|
"doc._.sentiment.subjectivity # Sujectivity: 0.9",
|
||||||
|
"doc._.sentiment.assessments # Assessments: [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]"
|
||||||
|
],
|
||||||
|
"code_language": "python",
|
||||||
|
"url": "https://spacytextblob.netlify.app/",
|
||||||
|
"author": "Sam Edwardes",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "TheReaLSamlam",
|
||||||
|
"github": "SamEdwardes",
|
||||||
|
"website": "https://samedwardes.com"
|
||||||
|
},
|
||||||
|
"category": ["pipeline"],
|
||||||
|
"tags": ["sentiment", "textblob"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "spacy-ray",
|
"id": "spacy-ray",
|
||||||
"title": "spacy-ray",
|
"title": "spacy-ray",
|
||||||
|
@ -788,6 +819,22 @@
|
||||||
"category": ["conversational"],
|
"category": ["conversational"],
|
||||||
"tags": ["chatbots"]
|
"tags": ["chatbots"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"id": "mindmeld",
|
||||||
|
"title": "MindMeld - Conversational AI platform",
|
||||||
|
"slogan": "Conversational AI platform for deep-domain voice interfaces and chatbots",
|
||||||
|
"description": "The MindMeld Conversational AI platform is among the most advanced AI platforms for building production-quality conversational applications. It is a Python-based machine learning framework which encompasses all of the algorithms and utilities required for this purpose. (https://github.com/cisco/mindmeld)",
|
||||||
|
"github": "cisco/mindmeld",
|
||||||
|
"pip": "mindmeld",
|
||||||
|
"thumb": "https://www.mindmeld.com/img/mindmeld-logo.png",
|
||||||
|
"category": ["conversational", "ner"],
|
||||||
|
"tags": ["chatbots"],
|
||||||
|
"author": "Cisco",
|
||||||
|
"author_links": {
|
||||||
|
"github": "cisco/mindmeld",
|
||||||
|
"website": "https://www.mindmeld.com/"
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "torchtext",
|
"id": "torchtext",
|
||||||
"title": "torchtext",
|
"title": "torchtext",
|
||||||
|
@ -1648,7 +1695,7 @@
|
||||||
"",
|
"",
|
||||||
"nlp = spacy.load('en')",
|
"nlp = spacy.load('en')",
|
||||||
"nlp.add_pipe(BeneparComponent('benepar_en'))",
|
"nlp.add_pipe(BeneparComponent('benepar_en'))",
|
||||||
"doc = nlp('The time for action is now. It's never too late to do something.')",
|
"doc = nlp('The time for action is now. It is never too late to do something.')",
|
||||||
"sent = list(doc.sents)[0]",
|
"sent = list(doc.sents)[0]",
|
||||||
"print(sent._.parse_string)",
|
"print(sent._.parse_string)",
|
||||||
"# (S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))",
|
"# (S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))",
|
||||||
|
@ -2527,14 +2574,14 @@
|
||||||
"description": "A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
|
"description": "A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
|
||||||
"pip": "cov-bsv",
|
"pip": "cov-bsv",
|
||||||
"code_example": [
|
"code_example": [
|
||||||
"import cov_bsv",
|
"import cov_bsv",
|
||||||
"",
|
"",
|
||||||
"nlp = cov_bsv.load()",
|
"nlp = cov_bsv.load()",
|
||||||
"text = 'Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected'",
|
"doc = nlp('Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected')",
|
||||||
"",
|
"",
|
||||||
"print(doc.ents)",
|
"print(doc.ents)",
|
||||||
"print(doc._.cov_classification)",
|
"print(doc._.cov_classification)",
|
||||||
"cov_bsv.visualize_doc(doc)"
|
"cov_bsv.visualize_doc(doc)"
|
||||||
],
|
],
|
||||||
"category": ["pipeline", "standalone", "biomedical", "scientific"],
|
"category": ["pipeline", "standalone", "biomedical", "scientific"],
|
||||||
"tags": ["clinical", "epidemiology", "covid-19", "surveillance"],
|
"tags": ["clinical", "epidemiology", "covid-19", "surveillance"],
|
||||||
|
@ -2542,6 +2589,35 @@
|
||||||
"author_links": {
|
"author_links": {
|
||||||
"github": "abchapman93"
|
"github": "abchapman93"
|
||||||
}
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "medspacy",
|
||||||
|
"title": "medspaCy",
|
||||||
|
"thumb": "https://raw.githubusercontent.com/medspacy/medspacy/master/images/medspacy_logo.png",
|
||||||
|
"slogan": "A toolkit for clinical NLP with spaCy.",
|
||||||
|
"github": "medspacy/medspacy",
|
||||||
|
"description": "A toolkit for clinical NLP with spaCy. Features include sentence splitting, section detection, and asserting negation, family history, and uncertainty.",
|
||||||
|
"pip": "medspacy",
|
||||||
|
"code_example": [
|
||||||
|
"import medspacy",
|
||||||
|
"from medspacy.ner import TargetRule",
|
||||||
|
"",
|
||||||
|
"nlp = medspacy.load()",
|
||||||
|
"print(nlp.pipe_names)",
|
||||||
|
"",
|
||||||
|
"nlp.get_pipe('target_matcher').add([TargetRule('stroke', 'CONDITION'), TargetRule('diabetes', 'CONDITION'), TargetRule('pna', 'CONDITION')])",
|
||||||
|
"doc = nlp('Patient has hx of stroke. Mother diagnosed with diabetes. No evidence of pna.')",
|
||||||
|
"",
|
||||||
|
"for ent in doc.ents:",
|
||||||
|
" print(ent, ent._.is_negated, ent._.is_family, ent._.is_historical)",
|
||||||
|
"medspacy.visualization.visualize_ent(doc)"
|
||||||
|
],
|
||||||
|
"category": ["biomedical", "scientific", "research"],
|
||||||
|
"tags": ["clinical"],
|
||||||
|
"author": "medspacy",
|
||||||
|
"author_links": {
|
||||||
|
"github": "medspacy"
|
||||||
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"id": "rita-dsl",
|
"id": "rita-dsl",
|
||||||
|
@ -2578,6 +2654,32 @@
|
||||||
"author_links": {
|
"author_links": {
|
||||||
"github": "zaibacu"
|
"github": "zaibacu"
|
||||||
}
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "PatternOmatic",
|
||||||
|
"title": "PatternOmatic",
|
||||||
|
"slogan": "Finds linguistic patterns effortlessly",
|
||||||
|
"description": "Discover spaCy's linguistic patterns matching a given set of String samples to be used by the spaCy's Rule Based Matcher",
|
||||||
|
"github": "revuel/PatternOmatic",
|
||||||
|
"pip": "PatternOmatic",
|
||||||
|
"code_example": [
|
||||||
|
"from PatternOmatic.api import find_patterns",
|
||||||
|
"",
|
||||||
|
"samples = ['I am a cat!', 'You are a dog!', 'She is an owl!']",
|
||||||
|
"",
|
||||||
|
"patterns_found, _ = find_patterns(samples)",
|
||||||
|
"",
|
||||||
|
"print(f'Patterns found: {patterns_found}')"
|
||||||
|
],
|
||||||
|
"code_language": "python",
|
||||||
|
"thumb": "https://svgshare.com/i/R3P.svg",
|
||||||
|
"image": "https://svgshare.com/i/R3P.svg",
|
||||||
|
"author": "Miguel Revuelta Espinosa",
|
||||||
|
"author_links": {
|
||||||
|
"github": "revuel"
|
||||||
|
},
|
||||||
|
"category": ["scientific", "research", "standalone"],
|
||||||
|
"tags": ["Evolutionary Computation", "Grammatical Evolution"]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
|
||||||
|
|
|
@ -207,42 +207,49 @@ const Landing = ({ data }) => {
|
||||||
|
|
||||||
<LandingBannerGrid>
|
<LandingBannerGrid>
|
||||||
<LandingBanner
|
<LandingBanner
|
||||||
to="https://course.spacy.io"
|
title="spaCy v3.0 nightly: Transformer-based pipelines, new training system, project templates & more"
|
||||||
button="Start the course"
|
label="Try the pre-release"
|
||||||
background="#f6f6f6"
|
to="https://nightly.spacy.io"
|
||||||
color="#252a33"
|
button="See what's new"
|
||||||
|
background="#8758fe"
|
||||||
|
color="#ffffff"
|
||||||
small
|
small
|
||||||
>
|
>
|
||||||
<Link to="https://course.spacy.io" hidden>
|
spaCy v3.0 features all new <strong>transformer-based pipelines</strong> that
|
||||||
|
bring spaCy's accuracy right up to the current <strong>state-of-the-art</strong>
|
||||||
|
. You can use any pretrained transformer to train your own pipelines, and even
|
||||||
|
share one transformer between multiple components with{' '}
|
||||||
|
<strong>multi-task learning</strong>. Training is now fully configurable and
|
||||||
|
extensible, and you can define your own custom models using{' '}
|
||||||
|
<strong>PyTorch</strong>, <strong>TensorFlow</strong> and other frameworks. The
|
||||||
|
new spaCy projects system lets you describe whole{' '}
|
||||||
|
<strong>end-to-end workflows</strong> in a single file, giving you an easy path
|
||||||
|
from prototype to production, and making it easy to clone and adapt
|
||||||
|
best-practice projects for your own use cases.
|
||||||
|
</LandingBanner>
|
||||||
|
|
||||||
|
<LandingBanner
|
||||||
|
title="Prodigy: Radically efficient machine teaching"
|
||||||
|
label="From the makers of spaCy"
|
||||||
|
to="https://prodi.gy"
|
||||||
|
button="Try it out"
|
||||||
|
background="#f6f6f6"
|
||||||
|
color="#000"
|
||||||
|
small
|
||||||
|
>
|
||||||
|
<Link to="https://prodi.gy" hidden>
|
||||||
<img
|
<img
|
||||||
src={courseImage}
|
src={prodigyImage}
|
||||||
alt="Advanced NLP with spaCy: A free online course"
|
alt="Prodigy: Radically efficient machine teaching"
|
||||||
/>
|
/>
|
||||||
</Link>
|
</Link>
|
||||||
<br />
|
<br />
|
||||||
<br />
|
<br />
|
||||||
In this <strong>free and interactive online course</strong> you’ll learn how to
|
Prodigy is an <strong>annotation tool</strong> so efficient that data scientists
|
||||||
use spaCy to build advanced natural language understanding systems, using both
|
can do the annotation themselves, enabling a new level of rapid iteration.
|
||||||
rule-based and machine learning approaches. It includes{' '}
|
Whether you're working on entity recognition, intent detection or image
|
||||||
<strong>55 exercises</strong> featuring videos, slide decks, multiple-choice
|
classification, Prodigy can help you <strong>train and evaluate</strong> your
|
||||||
questions and interactive coding practice in the browser.
|
models faster.
|
||||||
</LandingBanner>
|
|
||||||
<LandingBanner
|
|
||||||
title="spaCy IRL: Two days of NLP"
|
|
||||||
label="Watch the videos"
|
|
||||||
to="https://www.youtube.com/playlist?list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc"
|
|
||||||
button="Watch the videos"
|
|
||||||
background="#ffc194"
|
|
||||||
backgroundImage={irlBackground}
|
|
||||||
color="#1a1e23"
|
|
||||||
small
|
|
||||||
>
|
|
||||||
We were pleased to invite the spaCy community and other folks working on NLP to
|
|
||||||
Berlin for a small and intimate event. We booked a beautiful venue, hand-picked
|
|
||||||
an awesome lineup of speakers and scheduled plenty of social time to get to know
|
|
||||||
each other. The YouTube playlist includes 12 talks about NLP research,
|
|
||||||
development and applications, with keynotes by Sebastian Ruder (DeepMind) and
|
|
||||||
Yoav Goldberg (Allen AI).
|
|
||||||
</LandingBanner>
|
</LandingBanner>
|
||||||
</LandingBannerGrid>
|
</LandingBannerGrid>
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user