mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Merge branch 'develop' into feature/init-config-cpu-gpu
This commit is contained in:
commit
9d32e839d3
108
.github/contributors/KKsharma99.md
vendored
Normal file
108
.github/contributors/KKsharma99.md
vendored
Normal file
|
@ -0,0 +1,108 @@
|
|||
<!-- This agreement was mistakenly submitted as an update to the CONTRIBUTOR_AGREEMENT.md template. Commit: 8a2d22222dec5cf910df5a378cbcd9ea2ab53ec4. It was therefore moved over manually. -->
|
||||
|
||||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Kunal Sharma |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 10/19/2020 |
|
||||
| GitHub username | KKsharma99 |
|
||||
| Website (optional) | |
|
106
.github/contributors/borijang.md
vendored
Normal file
106
.github/contributors/borijang.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Borijan Georgievski |
|
||||
| Company name (if applicable) | Netcetera |
|
||||
| Title or role (if applicable) | Deta Scientist |
|
||||
| Date | 2020.10.09 |
|
||||
| GitHub username | borijang |
|
||||
| Website (optional) | |
|
106
.github/contributors/danielvasic.md
vendored
Normal file
106
.github/contributors/danielvasic.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Daniel Vasić |
|
||||
| Company name (if applicable) | University of Mostar |
|
||||
| Title or role (if applicable) | Teaching asistant |
|
||||
| Date | 13/10/2020 |
|
||||
| GitHub username | danielvasic |
|
||||
| Website (optional) | |
|
106
.github/contributors/forest1988.md
vendored
Normal file
106
.github/contributors/forest1988.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Yusuke Mori |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | Ph.D. student |
|
||||
| Date | 2020/11/22 |
|
||||
| GitHub username | forest1988 |
|
||||
| Website (optional) | https://forest1988.github.io |
|
106
.github/contributors/jabortell.md
vendored
Normal file
106
.github/contributors/jabortell.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jacob Bortell |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-11-20 |
|
||||
| GitHub username | jabortell |
|
||||
| Website (optional) | |
|
106
.github/contributors/revuel.md
vendored
Normal file
106
.github/contributors/revuel.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Miguel Revuelta |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-11-17 |
|
||||
| GitHub username | revuel |
|
||||
| Website (optional) | |
|
106
.github/contributors/robertsipek.md
vendored
Normal file
106
.github/contributors/robertsipek.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Robert Šípek |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 22.10.2020 |
|
||||
| GitHub username | @robertsipek |
|
||||
| Website (optional) | |
|
106
.github/contributors/vha14.md
vendored
Normal file
106
.github/contributors/vha14.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Vu Ha |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 10-23-2020 |
|
||||
| GitHub username | vha14 |
|
||||
| Website (optional) | |
|
106
.github/contributors/walterhenry.md
vendored
Normal file
106
.github/contributors/walterhenry.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Walter Henry |
|
||||
| Company name (if applicable) | ExplosionAI GmbH |
|
||||
| Title or role (if applicable) | Executive Assistant |
|
||||
| Date | September 14, 2020 |
|
||||
| GitHub username | walterhenry |
|
||||
| Website (optional) | |
|
|
@ -2,96 +2,113 @@ trigger:
|
|||
batch: true
|
||||
branches:
|
||||
include:
|
||||
- '*'
|
||||
- "*"
|
||||
exclude:
|
||||
- 'spacy.io'
|
||||
- "spacy.io"
|
||||
paths:
|
||||
exclude:
|
||||
- 'website/*'
|
||||
- '*.md'
|
||||
- "website/*"
|
||||
- "*.md"
|
||||
pr:
|
||||
paths:
|
||||
exclude:
|
||||
- 'website/*'
|
||||
- '*.md'
|
||||
- "website/*"
|
||||
- "*.md"
|
||||
|
||||
jobs:
|
||||
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
||||
# defined in .flake8 and overwrites the selected codes.
|
||||
- job: "Validate"
|
||||
pool:
|
||||
vmImage: "ubuntu-16.04"
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: "3.7"
|
||||
- script: |
|
||||
pip install flake8==3.5.0
|
||||
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
|
||||
displayName: "flake8"
|
||||
|
||||
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
||||
# defined in .flake8 and overwrites the selected codes.
|
||||
- job: 'Validate'
|
||||
pool:
|
||||
vmImage: 'ubuntu-16.04'
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: '3.7'
|
||||
- script: |
|
||||
pip install flake8==3.5.0
|
||||
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
|
||||
displayName: 'flake8'
|
||||
- job: "Test"
|
||||
dependsOn: "Validate"
|
||||
strategy:
|
||||
matrix:
|
||||
Python36Linux:
|
||||
imageName: "ubuntu-16.04"
|
||||
python.version: "3.6"
|
||||
Python36Windows:
|
||||
imageName: "vs2017-win2016"
|
||||
python.version: "3.6"
|
||||
Python36Mac:
|
||||
imageName: "macos-10.14"
|
||||
python.version: "3.6"
|
||||
# Don't test on 3.7 for now to speed up builds
|
||||
# Python37Linux:
|
||||
# imageName: 'ubuntu-16.04'
|
||||
# python.version: '3.7'
|
||||
# Python37Windows:
|
||||
# imageName: 'vs2017-win2016'
|
||||
# python.version: '3.7'
|
||||
# Python37Mac:
|
||||
# imageName: 'macos-10.14'
|
||||
# python.version: '3.7'
|
||||
Python38Linux:
|
||||
imageName: "ubuntu-16.04"
|
||||
python.version: "3.8"
|
||||
Python38Windows:
|
||||
imageName: "vs2017-win2016"
|
||||
python.version: "3.8"
|
||||
Python38Mac:
|
||||
imageName: "macos-10.14"
|
||||
python.version: "3.8"
|
||||
# Python39Linux:
|
||||
# imageName: "ubuntu-16.04"
|
||||
# python.version: "3.9"
|
||||
# Python39Windows:
|
||||
# imageName: "vs2017-win2016"
|
||||
# python.version: "3.9"
|
||||
# Python39Mac:
|
||||
# imageName: "macos-10.14"
|
||||
# python.version: "3.9"
|
||||
maxParallel: 4
|
||||
pool:
|
||||
vmImage: $(imageName)
|
||||
|
||||
- job: 'Test'
|
||||
dependsOn: 'Validate'
|
||||
strategy:
|
||||
matrix:
|
||||
Python36Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.6'
|
||||
Python36Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.6'
|
||||
Python36Mac:
|
||||
imageName: 'macos-10.14'
|
||||
python.version: '3.6'
|
||||
# Don't test on 3.7 for now to speed up builds
|
||||
# Python37Linux:
|
||||
# imageName: 'ubuntu-16.04'
|
||||
# python.version: '3.7'
|
||||
# Python37Windows:
|
||||
# imageName: 'vs2017-win2016'
|
||||
# python.version: '3.7'
|
||||
# Python37Mac:
|
||||
# imageName: 'macos-10.14'
|
||||
# python.version: '3.7'
|
||||
Python38Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.8'
|
||||
Python38Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.8'
|
||||
Python38Mac:
|
||||
imageName: 'macos-10.14'
|
||||
python.version: '3.8'
|
||||
maxParallel: 4
|
||||
pool:
|
||||
vmImage: $(imageName)
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: "$(python.version)"
|
||||
architecture: "x64"
|
||||
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: '$(python.version)'
|
||||
architecture: 'x64'
|
||||
- script: |
|
||||
python -m pip install -U pip setuptools
|
||||
pip install -r requirements.txt
|
||||
displayName: "Install dependencies"
|
||||
condition: not(eq(variables['python.version'], '3.5'))
|
||||
|
||||
- script: |
|
||||
python -m pip install -U setuptools
|
||||
pip install -r requirements.txt
|
||||
displayName: 'Install dependencies'
|
||||
- script: |
|
||||
python setup.py build_ext --inplace -j 2
|
||||
python setup.py sdist --formats=gztar
|
||||
displayName: "Compile and build sdist"
|
||||
|
||||
- script: |
|
||||
python setup.py build_ext --inplace
|
||||
python setup.py sdist --formats=gztar
|
||||
displayName: 'Compile and build sdist'
|
||||
- task: DeleteFiles@1
|
||||
inputs:
|
||||
contents: "spacy"
|
||||
displayName: "Delete source directory"
|
||||
|
||||
- task: DeleteFiles@1
|
||||
inputs:
|
||||
contents: 'spacy'
|
||||
displayName: 'Delete source directory'
|
||||
- script: |
|
||||
pip freeze > installed.txt
|
||||
pip uninstall -y -r installed.txt
|
||||
displayName: "Uninstall all packages"
|
||||
|
||||
- bash: |
|
||||
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
||||
pip install dist/$SDIST
|
||||
displayName: 'Install from sdist'
|
||||
- bash: |
|
||||
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
||||
pip install dist/$SDIST
|
||||
displayName: "Install from sdist"
|
||||
condition: not(eq(variables['python.version'], '3.5'))
|
||||
|
||||
- script: python -m pytest --pyargs spacy
|
||||
displayName: 'Run tests'
|
||||
- script: |
|
||||
pip install -r requirements.txt
|
||||
python -m pytest --pyargs spacy
|
||||
displayName: "Run tests"
|
||||
|
|
5
build-constraints.txt
Normal file
5
build-constraints.txt
Normal file
|
@ -0,0 +1,5 @@
|
|||
# build version constraints for use with wheelwright + multibuild
|
||||
numpy==1.15.0; python_version<='3.7'
|
||||
numpy==1.17.3; python_version=='3.8'
|
||||
numpy==1.19.3; python_version=='3.9'
|
||||
numpy; python_version>='3.10'
|
|
@ -3,6 +3,8 @@ redirects = [
|
|||
{from = "https://spacy.netlify.com/*", to="https://spacy.io/:splat", force = true },
|
||||
# Subdomain for branches
|
||||
{from = "https://nightly.spacy.io/*", to="https://nightly-spacy-io.spacy.io/:splat", force = true, status = 200},
|
||||
# TODO: update this with the v2 branch build once v3 is live (status = 200)
|
||||
{from = "https://v2.spacy.io/*", to="https://spacy.io/:splat", force = true},
|
||||
# Old subdomains
|
||||
{from = "https://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||
{from = "http://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||
|
|
|
@ -1,13 +1,16 @@
|
|||
[build-system]
|
||||
requires = [
|
||||
"setuptools",
|
||||
"wheel",
|
||||
"cython>=0.25",
|
||||
"cymem>=2.0.2,<2.1.0",
|
||||
"preshed>=3.0.2,<3.1.0",
|
||||
"murmurhash>=0.28.0,<1.1.0",
|
||||
"thinc>=8.0.0rc2,<8.1.0",
|
||||
"blis>=0.4.0,<0.8.0",
|
||||
"pathy"
|
||||
"pathy",
|
||||
"numpy==1.15.0; python_version<='3.7'",
|
||||
"numpy==1.17.3; python_version=='3.8'",
|
||||
"numpy==1.19.3; python_version=='3.9'",
|
||||
"numpy; python_version>='3.10'",
|
||||
]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
|
|
@ -20,6 +20,7 @@ classifiers =
|
|||
Programming Language :: Python :: 3.6
|
||||
Programming Language :: Python :: 3.7
|
||||
Programming Language :: Python :: 3.8
|
||||
Programming Language :: Python :: 3.9
|
||||
Topic :: Scientific/Engineering
|
||||
|
||||
[options]
|
||||
|
@ -27,7 +28,6 @@ zip_safe = false
|
|||
include_package_data = true
|
||||
python_requires = >=3.6
|
||||
setup_requires =
|
||||
wheel
|
||||
cython>=0.25
|
||||
numpy>=1.15.0
|
||||
# We also need our Cython packages here to compile against
|
||||
|
|
6
setup.py
6
setup.py
|
@ -2,9 +2,9 @@
|
|||
from setuptools import Extension, setup, find_packages
|
||||
import sys
|
||||
import platform
|
||||
import numpy
|
||||
from distutils.command.build_ext import build_ext
|
||||
from distutils.sysconfig import get_python_inc
|
||||
import numpy
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
from Cython.Build import cythonize
|
||||
|
@ -194,8 +194,8 @@ def setup_package():
|
|||
print(f"Copied {copy_file} -> {target_dir}")
|
||||
|
||||
include_dirs = [
|
||||
get_python_inc(plat_specific=True),
|
||||
numpy.get_include(),
|
||||
get_python_inc(plat_specific=True),
|
||||
]
|
||||
ext_modules = []
|
||||
for name in MOD_NAMES:
|
||||
|
@ -212,7 +212,7 @@ def setup_package():
|
|||
ext_modules=ext_modules,
|
||||
cmdclass={"build_ext": build_ext_subclass},
|
||||
include_dirs=include_dirs,
|
||||
package_data={"": ["*.pyx", "*.pxd", "*.pxi", "*.cpp"]},
|
||||
package_data={"": ["*.pyx", "*.pxd", "*.pxi"]},
|
||||
)
|
||||
|
||||
|
||||
|
|
|
@ -45,14 +45,16 @@ def init_config_cli(
|
|||
if isinstance(optimize, Optimizations): # instance of enum from the CLI
|
||||
optimize = optimize.value
|
||||
pipeline = string_to_list(pipeline)
|
||||
init_config(
|
||||
output_file,
|
||||
is_stdout = str(output_file) == "-"
|
||||
config = init_config(
|
||||
lang=lang,
|
||||
pipeline=pipeline,
|
||||
optimize=optimize,
|
||||
gpu=gpu,
|
||||
pretraining=pretraining,
|
||||
silent=is_stdout,
|
||||
)
|
||||
save_config(config, output_file, is_stdout=is_stdout)
|
||||
|
||||
|
||||
@init_cli.command("fill-config")
|
||||
|
@ -118,16 +120,15 @@ def fill_config(
|
|||
|
||||
|
||||
def init_config(
|
||||
output_file: Path,
|
||||
*,
|
||||
lang: str,
|
||||
pipeline: List[str],
|
||||
optimize: str,
|
||||
gpu: bool,
|
||||
pretraining: bool = False,
|
||||
) -> None:
|
||||
is_stdout = str(output_file) == "-"
|
||||
msg = Printer(no_print=is_stdout)
|
||||
silent: bool = True,
|
||||
) -> Config:
|
||||
msg = Printer(no_print=silent)
|
||||
with TEMPLATE_PATH.open("r") as f:
|
||||
template = Template(f.read())
|
||||
# Filter out duplicates since tok2vec and transformer are added by template
|
||||
|
@ -173,7 +174,7 @@ def init_config(
|
|||
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
|
||||
config = pretrain_config.merge(config)
|
||||
msg.good("Auto-filled config with all values")
|
||||
save_config(config, output_file, is_stdout=is_stdout)
|
||||
return config
|
||||
|
||||
|
||||
def save_config(
|
||||
|
|
|
@ -119,6 +119,10 @@ class Warnings:
|
|||
"call the {matcher} on each Doc object.")
|
||||
W107 = ("The property `Doc.{prop}` is deprecated. Use "
|
||||
"`Doc.has_annotation(\"{attr}\")` instead.")
|
||||
W108 = ("The rule-based lemmatizer did not find POS annotation for the "
|
||||
"token '{text}'. Check that your pipeline includes components that "
|
||||
"assign token.pos, typically 'tagger'+'attribute_ruler' or "
|
||||
"'morphologizer'.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
|
|
@ -210,8 +210,12 @@ _ukrainian_lower = r"а-щюяіїєґ"
|
|||
_ukrainian_upper = r"А-ЩЮЯІЇЄҐ"
|
||||
_ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ"
|
||||
|
||||
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper
|
||||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
|
||||
_macedonian_lower = r"ѓѕјљњќѐѝ"
|
||||
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
|
||||
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
|
||||
|
||||
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper + _macedonian_upper
|
||||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower
|
||||
|
||||
_uncased = (
|
||||
_bengali
|
||||
|
@ -226,7 +230,7 @@ _uncased = (
|
|||
+ _cjk
|
||||
)
|
||||
|
||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased)
|
||||
ALPHA_LOWER = group_chars(_lower + _uncased)
|
||||
ALPHA_UPPER = group_chars(_upper + _uncased)
|
||||
|
||||
|
|
|
@ -1,9 +1,16 @@
|
|||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from ...language import Language
|
||||
|
||||
|
||||
class CzechDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "cs"
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
|
||||
|
|
4312
spacy/lang/cs/tag_map.py
Normal file
4312
spacy/lang/cs/tag_map.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -6,10 +6,21 @@ from ...tokens import Doc, Span
|
|||
|
||||
|
||||
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
|
||||
"""Detect base noun phrases from a dependency parse. Works on Doc and Span."""
|
||||
# fmt: off
|
||||
labels = ["nsubj", "dobj", "nsubjpass", "pcomp", "pobj", "dative", "appos", "attr", "ROOT"]
|
||||
# fmt: on
|
||||
"""
|
||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||
"""
|
||||
labels = [
|
||||
"oprd",
|
||||
"nsubj",
|
||||
"dobj",
|
||||
"nsubjpass",
|
||||
"pcomp",
|
||||
"pobj",
|
||||
"dative",
|
||||
"appos",
|
||||
"attr",
|
||||
"ROOT",
|
||||
]
|
||||
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||
if not doc.has_annotation("DEP"):
|
||||
raise ValueError(Errors.E029)
|
||||
|
|
48
spacy/lang/mk/__init__.py
Normal file
48
spacy/lang/mk/__init__.py
Normal file
|
@ -0,0 +1,48 @@
|
|||
from typing import Optional
|
||||
from thinc.api import Model
|
||||
from .lemmatizer import MacedonianLemmatizer
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
from ...util import update_exc
|
||||
from ...lookups import Lookups
|
||||
|
||||
|
||||
class MacedonianDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: "mk"
|
||||
|
||||
# Optional: replace flags with custom functions, e.g. like_num()
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
|
||||
# Merge base exceptions and custom tokenizer exceptions
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
@classmethod
|
||||
def create_lemmatizer(cls, nlp=None, lookups=None):
|
||||
if lookups is None:
|
||||
lookups = Lookups()
|
||||
return MacedonianLemmatizer(lookups)
|
||||
|
||||
|
||||
class Macedonian(Language):
|
||||
lang = "mk"
|
||||
Defaults = MacedonianDefaults
|
||||
|
||||
|
||||
@Macedonian.factory(
|
||||
"lemmatizer",
|
||||
assigns=["token.lemma"],
|
||||
default_config={"model": None, "mode": "rule"},
|
||||
default_score_weights={"lemma_acc": 1.0},
|
||||
)
|
||||
def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str):
|
||||
return MacedonianLemmatizer(nlp.vocab, model, name, mode=mode)
|
||||
|
||||
|
||||
__all__ = ["Macedonian"]
|
55
spacy/lang/mk/lemmatizer.py
Normal file
55
spacy/lang/mk/lemmatizer.py
Normal file
|
@ -0,0 +1,55 @@
|
|||
from typing import List
|
||||
from collections import OrderedDict
|
||||
|
||||
from ...pipeline import Lemmatizer
|
||||
from ...tokens import Token
|
||||
|
||||
|
||||
class MacedonianLemmatizer(Lemmatizer):
|
||||
def rule_lemmatize(self, token: Token) -> List[str]:
|
||||
string = token.text
|
||||
univ_pos = token.pos_.lower()
|
||||
morphology = token.morph.to_dict()
|
||||
|
||||
if univ_pos in ("", "eol", "space"):
|
||||
return [string.lower()]
|
||||
|
||||
if string[-3:] == 'јќи':
|
||||
string = string[:-3]
|
||||
univ_pos = "verb"
|
||||
|
||||
if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
|
||||
return [string.lower()]
|
||||
index_table = self.lookups.get_table("lemma_index", {})
|
||||
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||
rules_table = self.lookups.get_table("lemma_rules", {})
|
||||
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
|
||||
if univ_pos == "propn":
|
||||
return [string]
|
||||
else:
|
||||
return [string.lower()]
|
||||
|
||||
index = index_table.get(univ_pos, {})
|
||||
exceptions = exc_table.get(univ_pos, {})
|
||||
rules = rules_table.get(univ_pos, [])
|
||||
|
||||
orig = string
|
||||
string = string.lower()
|
||||
forms = []
|
||||
|
||||
for old, new in rules:
|
||||
if string.endswith(old):
|
||||
form = string[: len(string) - len(old)] + new
|
||||
if not form:
|
||||
continue
|
||||
if form in index or not form.isalpha():
|
||||
forms.append(form)
|
||||
|
||||
forms = list(OrderedDict.fromkeys(forms))
|
||||
for form in exceptions.get(string, []):
|
||||
if form not in forms:
|
||||
forms.insert(0, form)
|
||||
if not forms:
|
||||
forms.append(orig)
|
||||
|
||||
return forms
|
55
spacy/lang/mk/lex_attrs.py
Normal file
55
spacy/lang/mk/lex_attrs.py
Normal file
|
@ -0,0 +1,55 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = [
|
||||
"нула", "еден", "една", "едно", "два", "две", "три", "четири", "пет", "шест", "седум", "осум", "девет", "десет",
|
||||
"единаесет", "дванаесет", "тринаесет", "четиринаесет", "петнаесет", "шеснаесет", "седумнаесет", "осумнаесет",
|
||||
"деветнаесет", "дваесет", "триесет", "четириесет", "педесет", "шеесет", "седумдесет", "осумдесет", "деведесет",
|
||||
"сто", "двесте", "триста", "четиристотини", "петстотини", "шестотини", "седумстотини", "осумстотини",
|
||||
"деветстотини", "илјада", "илјади", 'милион', 'милиони', 'милијарда', 'милијарди', 'билион', 'билиони',
|
||||
|
||||
"двајца", "тројца", "четворица", "петмина", "шестмина", "седуммина", "осуммина", "деветмина", "обата", "обајцата",
|
||||
|
||||
"прв", "втор", "трет", "четврт", "седм", "осм", "двестоти",
|
||||
|
||||
"два-три", "два-триесет", "два-триесетмина", "два-тринаесет", "два-тројца", "две-три", "две-тристотини",
|
||||
"пет-шеесет", "пет-шеесетмина", "пет-шеснаесетмина", "пет-шест", "пет-шестмина", "пет-шестотини", "петина",
|
||||
"осмина", "седум-осум", "седум-осумдесет", "седум-осуммина", "седум-осумнаесет", "седум-осумнаесетмина",
|
||||
"три-четириесет", "три-четиринаесет", "шеесет", "шеесетина", "шеесетмина", "шеснаесет", "шеснаесетмина",
|
||||
"шест-седум", "шест-седумдесет", "шест-седумнаесет", "шест-седумстотини", "шестоти", "шестотини"
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
|
||||
text_lower = text.lower()
|
||||
if text_lower in _num_words:
|
||||
return True
|
||||
|
||||
if text_lower.endswith(("а", "о", "и")):
|
||||
if text_lower[:-1] in _num_words:
|
||||
return True
|
||||
|
||||
if text_lower.endswith(("ти", "та", "то", "на")):
|
||||
if text_lower[:-2] in _num_words:
|
||||
return True
|
||||
|
||||
if text_lower.endswith(("ата", "иот", "ите", "ина", "чки")):
|
||||
if text_lower[:-3] in _num_words:
|
||||
return True
|
||||
|
||||
if text_lower.endswith(("мина", "тина")):
|
||||
if text_lower[:-4] in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
815
spacy/lang/mk/stop_words.py
Normal file
815
spacy/lang/mk/stop_words.py
Normal file
|
@ -0,0 +1,815 @@
|
|||
STOP_WORDS = set(
|
||||
"""
|
||||
а
|
||||
абре
|
||||
aв
|
||||
аи
|
||||
ако
|
||||
алало
|
||||
ам
|
||||
ама
|
||||
аман
|
||||
ами
|
||||
амин
|
||||
априли-ли-ли
|
||||
ау
|
||||
аух
|
||||
ауч
|
||||
ах
|
||||
аха
|
||||
аха-ха
|
||||
аш
|
||||
ашколсум
|
||||
ашколсун
|
||||
ај
|
||||
ајде
|
||||
ајс
|
||||
аџаба
|
||||
бавно
|
||||
бам
|
||||
бам-бум
|
||||
бап
|
||||
бар
|
||||
баре
|
||||
барем
|
||||
бау
|
||||
бау-бау
|
||||
баш
|
||||
бај
|
||||
бе
|
||||
беа
|
||||
бев
|
||||
бевме
|
||||
бевте
|
||||
без
|
||||
безбели
|
||||
бездруго
|
||||
белки
|
||||
беше
|
||||
би
|
||||
бидејќи
|
||||
бим
|
||||
бис
|
||||
бла
|
||||
блазе
|
||||
богами
|
||||
божем
|
||||
боц
|
||||
браво
|
||||
бравос
|
||||
бре
|
||||
бреј
|
||||
брзо
|
||||
бришка
|
||||
бррр
|
||||
бу
|
||||
бум
|
||||
буф
|
||||
буц
|
||||
бујрум
|
||||
ваа
|
||||
вам
|
||||
варај
|
||||
варда
|
||||
вас
|
||||
вај
|
||||
ве
|
||||
велат
|
||||
вели
|
||||
версус
|
||||
веќе
|
||||
ви
|
||||
виа
|
||||
види
|
||||
вие
|
||||
вистина
|
||||
витос
|
||||
внатре
|
||||
во
|
||||
воз
|
||||
вон
|
||||
впрочем
|
||||
врв
|
||||
вред
|
||||
време
|
||||
врз
|
||||
всушност
|
||||
втор
|
||||
галиба
|
||||
ги
|
||||
гитла
|
||||
го
|
||||
годе
|
||||
годишник
|
||||
горе
|
||||
гра
|
||||
гуц
|
||||
гљу
|
||||
да
|
||||
даан
|
||||
дава
|
||||
дал
|
||||
дали
|
||||
дан
|
||||
два
|
||||
дваесет
|
||||
дванаесет
|
||||
двајца
|
||||
две
|
||||
двесте
|
||||
движам
|
||||
движат
|
||||
движи
|
||||
движиме
|
||||
движите
|
||||
движиш
|
||||
де
|
||||
деведесет
|
||||
девет
|
||||
деветнаесет
|
||||
деветстотини
|
||||
деветти
|
||||
дека
|
||||
дел
|
||||
делми
|
||||
демек
|
||||
десет
|
||||
десетина
|
||||
десетти
|
||||
деситици
|
||||
дејгиди
|
||||
дејди
|
||||
ди
|
||||
дилми
|
||||
дин
|
||||
дип
|
||||
дно
|
||||
до
|
||||
доволно
|
||||
додека
|
||||
додуша
|
||||
докај
|
||||
доколку
|
||||
доправено
|
||||
доправи
|
||||
досамоти
|
||||
доста
|
||||
држи
|
||||
дрн
|
||||
друг
|
||||
друга
|
||||
другата
|
||||
други
|
||||
другиот
|
||||
другите
|
||||
друго
|
||||
другото
|
||||
дум
|
||||
дур
|
||||
дури
|
||||
е
|
||||
евала
|
||||
еве
|
||||
евет
|
||||
ега
|
||||
егиди
|
||||
еден
|
||||
едикојси
|
||||
единаесет
|
||||
единствено
|
||||
еднаш
|
||||
едно
|
||||
ексик
|
||||
ела
|
||||
елбете
|
||||
елем
|
||||
ели
|
||||
ем
|
||||
еми
|
||||
ене
|
||||
ете
|
||||
еурека
|
||||
ех
|
||||
еј
|
||||
жими
|
||||
жити
|
||||
за
|
||||
завал
|
||||
заврши
|
||||
зад
|
||||
задека
|
||||
задоволна
|
||||
задржи
|
||||
заедно
|
||||
зар
|
||||
зарад
|
||||
заради
|
||||
заре
|
||||
зарем
|
||||
затоа
|
||||
зашто
|
||||
згора
|
||||
зема
|
||||
земе
|
||||
земува
|
||||
зер
|
||||
значи
|
||||
зошто
|
||||
зуј
|
||||
и
|
||||
иако
|
||||
из
|
||||
извезен
|
||||
изгледа
|
||||
измеѓу
|
||||
износ
|
||||
или
|
||||
или-или
|
||||
илјада
|
||||
илјади
|
||||
им
|
||||
има
|
||||
имаа
|
||||
имаат
|
||||
имавме
|
||||
имавте
|
||||
имам
|
||||
имаме
|
||||
имате
|
||||
имаш
|
||||
имаше
|
||||
име
|
||||
имено
|
||||
именува
|
||||
имплицира
|
||||
имплицираат
|
||||
имплицирам
|
||||
имплицираме
|
||||
имплицирате
|
||||
имплицираш
|
||||
инаку
|
||||
индицира
|
||||
исечок
|
||||
исклучен
|
||||
исклучена
|
||||
исклучени
|
||||
исклучено
|
||||
искористен
|
||||
искористена
|
||||
искористени
|
||||
искористено
|
||||
искористи
|
||||
искрај
|
||||
исти
|
||||
исто
|
||||
итака
|
||||
итн
|
||||
их
|
||||
иха
|
||||
ихуу
|
||||
иш
|
||||
ишала
|
||||
иј
|
||||
ка
|
||||
каде
|
||||
кажува
|
||||
како
|
||||
каков
|
||||
камоли
|
||||
кај
|
||||
ква
|
||||
ки
|
||||
кит
|
||||
кло
|
||||
клум
|
||||
кога
|
||||
кого
|
||||
кого-годе
|
||||
кое
|
||||
кои
|
||||
количество
|
||||
количина
|
||||
колку
|
||||
кому
|
||||
кон
|
||||
користена
|
||||
користени
|
||||
користено
|
||||
користи
|
||||
кот
|
||||
котрр
|
||||
кош-кош
|
||||
кој
|
||||
која
|
||||
којзнае
|
||||
којшто
|
||||
кр-кр-кр
|
||||
крај
|
||||
крек
|
||||
крз
|
||||
крк
|
||||
крц
|
||||
куку
|
||||
кукуригу
|
||||
куш
|
||||
ле
|
||||
лебами
|
||||
леле
|
||||
лели
|
||||
ли
|
||||
лиду
|
||||
луп
|
||||
ма
|
||||
макар
|
||||
малку
|
||||
марш
|
||||
мат
|
||||
мац
|
||||
машала
|
||||
ме
|
||||
мене
|
||||
место
|
||||
меѓу
|
||||
меѓувреме
|
||||
меѓутоа
|
||||
ми
|
||||
мое
|
||||
може
|
||||
можеби
|
||||
молам
|
||||
моли
|
||||
мор
|
||||
мора
|
||||
море
|
||||
мори
|
||||
мразец
|
||||
му
|
||||
муклец
|
||||
мутлак
|
||||
муц
|
||||
мјау
|
||||
на
|
||||
навидум
|
||||
навистина
|
||||
над
|
||||
надвор
|
||||
назад
|
||||
накај
|
||||
накрај
|
||||
нали
|
||||
нам
|
||||
наместо
|
||||
наоколу
|
||||
направено
|
||||
направи
|
||||
напред
|
||||
нас
|
||||
наспоред
|
||||
наспрема
|
||||
наспроти
|
||||
насред
|
||||
натаму
|
||||
натема
|
||||
начин
|
||||
наш
|
||||
наша
|
||||
наше
|
||||
наши
|
||||
нај
|
||||
најдоцна
|
||||
најмалку
|
||||
најмногу
|
||||
не
|
||||
неа
|
||||
него
|
||||
негов
|
||||
негова
|
||||
негови
|
||||
негово
|
||||
незе
|
||||
нека
|
||||
некаде
|
||||
некако
|
||||
некаков
|
||||
некого
|
||||
некое
|
||||
некои
|
||||
неколку
|
||||
некому
|
||||
некој
|
||||
некојси
|
||||
нели
|
||||
немој
|
||||
нему
|
||||
неоти
|
||||
нечиј
|
||||
нешто
|
||||
нејзе
|
||||
нејзин
|
||||
нејзини
|
||||
нејзино
|
||||
нејсе
|
||||
ни
|
||||
нив
|
||||
нивен
|
||||
нивна
|
||||
нивни
|
||||
нивно
|
||||
ние
|
||||
низ
|
||||
никаде
|
||||
никако
|
||||
никогаш
|
||||
никого
|
||||
никому
|
||||
никој
|
||||
ним
|
||||
нити
|
||||
нито
|
||||
ниту
|
||||
ничиј
|
||||
ништо
|
||||
но
|
||||
нѐ
|
||||
о
|
||||
обр
|
||||
ова
|
||||
ова-она
|
||||
оваа
|
||||
овај
|
||||
овде
|
||||
овега
|
||||
овие
|
||||
овој
|
||||
од
|
||||
одавде
|
||||
оди
|
||||
однесува
|
||||
односно
|
||||
одошто
|
||||
околу
|
||||
олеле
|
||||
олкацок
|
||||
он
|
||||
она
|
||||
онаа
|
||||
онака
|
||||
онаков
|
||||
онде
|
||||
они
|
||||
оние
|
||||
оно
|
||||
оној
|
||||
оп
|
||||
освем
|
||||
освен
|
||||
осем
|
||||
осми
|
||||
осум
|
||||
осумдесет
|
||||
осумнаесет
|
||||
осумстотитни
|
||||
отаде
|
||||
оти
|
||||
откако
|
||||
откај
|
||||
откога
|
||||
отколку
|
||||
оттаму
|
||||
оттука
|
||||
оф
|
||||
ох
|
||||
ој
|
||||
па
|
||||
пак
|
||||
папа
|
||||
пардон
|
||||
пате-ќуте
|
||||
пати
|
||||
пау
|
||||
паче
|
||||
пеесет
|
||||
пеки
|
||||
пет
|
||||
петнаесет
|
||||
петстотини
|
||||
петти
|
||||
пи
|
||||
пи-пи
|
||||
пис
|
||||
плас
|
||||
плус
|
||||
по
|
||||
побавно
|
||||
поблиску
|
||||
побрзо
|
||||
побуни
|
||||
повеќе
|
||||
повторно
|
||||
под
|
||||
подалеку
|
||||
подолу
|
||||
подоцна
|
||||
подруго
|
||||
позади
|
||||
поинаква
|
||||
поинакви
|
||||
поинакво
|
||||
поинаков
|
||||
поинаку
|
||||
покаже
|
||||
покажува
|
||||
покрај
|
||||
полно
|
||||
помалку
|
||||
помеѓу
|
||||
понатаму
|
||||
понекогаш
|
||||
понекој
|
||||
поради
|
||||
поразличен
|
||||
поразлична
|
||||
поразлични
|
||||
поразлично
|
||||
поседува
|
||||
после
|
||||
последен
|
||||
последна
|
||||
последни
|
||||
последно
|
||||
поспоро
|
||||
потег
|
||||
потоа
|
||||
пошироко
|
||||
прави
|
||||
празно
|
||||
прв
|
||||
пред
|
||||
през
|
||||
преку
|
||||
претежно
|
||||
претходен
|
||||
претходна
|
||||
претходни
|
||||
претходник
|
||||
претходно
|
||||
при
|
||||
присвои
|
||||
притоа
|
||||
причинува
|
||||
пријатно
|
||||
просто
|
||||
против
|
||||
прр
|
||||
пст
|
||||
пук
|
||||
пусто
|
||||
пуф
|
||||
пуј
|
||||
пфуј
|
||||
пшт
|
||||
ради
|
||||
различен
|
||||
различна
|
||||
различни
|
||||
различно
|
||||
разни
|
||||
разоружен
|
||||
разредлив
|
||||
рамките
|
||||
рамнообразно
|
||||
растревожено
|
||||
растреперено
|
||||
расчувствувано
|
||||
ратоборно
|
||||
рече
|
||||
роден
|
||||
с
|
||||
сакан
|
||||
сам
|
||||
сама
|
||||
сами
|
||||
самите
|
||||
само
|
||||
самоти
|
||||
свое
|
||||
свои
|
||||
свој
|
||||
своја
|
||||
се
|
||||
себе
|
||||
себеси
|
||||
сега
|
||||
седми
|
||||
седум
|
||||
седумдесет
|
||||
седумнаесет
|
||||
седумстотини
|
||||
секаде
|
||||
секаков
|
||||
секи
|
||||
секогаш
|
||||
секого
|
||||
секому
|
||||
секој
|
||||
секојдневно
|
||||
сем
|
||||
сенешто
|
||||
сепак
|
||||
сериозен
|
||||
сериозна
|
||||
сериозни
|
||||
сериозно
|
||||
сет
|
||||
сечиј
|
||||
сешто
|
||||
си
|
||||
сиктер
|
||||
сиот
|
||||
сип
|
||||
сиреч
|
||||
сите
|
||||
сичко
|
||||
скок
|
||||
скоро
|
||||
скрц
|
||||
следбеник
|
||||
следбеничка
|
||||
следен
|
||||
следователно
|
||||
следствено
|
||||
сме
|
||||
со
|
||||
соне
|
||||
сопствен
|
||||
сопствена
|
||||
сопствени
|
||||
сопствено
|
||||
сосе
|
||||
сосем
|
||||
сполај
|
||||
според
|
||||
споро
|
||||
спрема
|
||||
спроти
|
||||
спротив
|
||||
сред
|
||||
среде
|
||||
среќно
|
||||
срочен
|
||||
сст
|
||||
става
|
||||
ставаат
|
||||
ставам
|
||||
ставаме
|
||||
ставате
|
||||
ставаш
|
||||
стави
|
||||
сте
|
||||
сто
|
||||
стоп
|
||||
страна
|
||||
сум
|
||||
сума
|
||||
супер
|
||||
сус
|
||||
сѐ
|
||||
та
|
||||
таа
|
||||
така
|
||||
таква
|
||||
такви
|
||||
таков
|
||||
тамам
|
||||
таму
|
||||
тангар-мангар
|
||||
тандар-мандар
|
||||
тап
|
||||
твое
|
||||
те
|
||||
тебе
|
||||
тебека
|
||||
тек
|
||||
текот
|
||||
ти
|
||||
тие
|
||||
тизе
|
||||
тик-так
|
||||
тики
|
||||
тоа
|
||||
тогаш
|
||||
тој
|
||||
трак
|
||||
трака-трука
|
||||
трас
|
||||
треба
|
||||
трет
|
||||
три
|
||||
триесет
|
||||
тринаест
|
||||
триста
|
||||
труп
|
||||
трупа
|
||||
трус
|
||||
ту
|
||||
тука
|
||||
туку
|
||||
тукушто
|
||||
туф
|
||||
у
|
||||
уа
|
||||
убаво
|
||||
уви
|
||||
ужасно
|
||||
уз
|
||||
ура
|
||||
уу
|
||||
уф
|
||||
уха
|
||||
уш
|
||||
уште
|
||||
фазен
|
||||
фала
|
||||
фил
|
||||
филан
|
||||
фис
|
||||
фиу
|
||||
фиљан
|
||||
фоб
|
||||
фон
|
||||
ха
|
||||
ха-ха
|
||||
хе
|
||||
хеј
|
||||
хеј
|
||||
хи
|
||||
хм
|
||||
хо
|
||||
цак
|
||||
цап
|
||||
целина
|
||||
цело
|
||||
цигу-лигу
|
||||
циц
|
||||
чекај
|
||||
често
|
||||
четврт
|
||||
четири
|
||||
четириесет
|
||||
четиринаесет
|
||||
четирстотини
|
||||
чие
|
||||
чии
|
||||
чик
|
||||
чик-чирик
|
||||
чини
|
||||
чиш
|
||||
чиј
|
||||
чија
|
||||
чијшто
|
||||
чкрап
|
||||
чому
|
||||
чук
|
||||
чукш
|
||||
чуму
|
||||
чунки
|
||||
шеесет
|
||||
шеснаесет
|
||||
шест
|
||||
шести
|
||||
шестотини
|
||||
ширум
|
||||
шлак
|
||||
шлап
|
||||
шлапа-шлупа
|
||||
шлуп
|
||||
шмрк
|
||||
што
|
||||
штогоде
|
||||
штом
|
||||
штотуку
|
||||
штрак
|
||||
штрап
|
||||
штрап-штруп
|
||||
шуќур
|
||||
ѓиди
|
||||
ѓоа
|
||||
ѓоамити
|
||||
ѕан
|
||||
ѕе
|
||||
ѕин
|
||||
ја
|
||||
јадец
|
||||
јазе
|
||||
јали
|
||||
јас
|
||||
јаска
|
||||
јок
|
||||
ќе
|
||||
ќешки
|
||||
ѝ
|
||||
џагара-магара
|
||||
џанам
|
||||
џив-џив
|
||||
""".split()
|
||||
)
|
100
spacy/lang/mk/tokenizer_exceptions.py
Normal file
100
spacy/lang/mk/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,100 @@
|
|||
from ...symbols import ORTH, NORM
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
||||
|
||||
_abbr_exc = [
|
||||
{ORTH: "м", NORM: "метар"},
|
||||
{ORTH: "мм", NORM: "милиметар"},
|
||||
{ORTH: "цм", NORM: "центиметар"},
|
||||
{ORTH: "см", NORM: "сантиметар"},
|
||||
{ORTH: "дм", NORM: "дециметар"},
|
||||
{ORTH: "км", NORM: "километар"},
|
||||
{ORTH: "кг", NORM: "килограм"},
|
||||
{ORTH: "дкг", NORM: "декаграм"},
|
||||
{ORTH: "дг", NORM: "дециграм"},
|
||||
{ORTH: "мг", NORM: "милиграм"},
|
||||
{ORTH: "г", NORM: "грам"},
|
||||
{ORTH: "т", NORM: "тон"},
|
||||
{ORTH: "кл", NORM: "килолитар"},
|
||||
{ORTH: "хл", NORM: "хектолитар"},
|
||||
{ORTH: "дкл", NORM: "декалитар"},
|
||||
{ORTH: "л", NORM: "литар"},
|
||||
{ORTH: "дл", NORM: "децилитар"}
|
||||
|
||||
]
|
||||
for abbr in _abbr_exc:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
_abbr_line_exc = [
|
||||
{ORTH: "д-р", NORM: "доктор"},
|
||||
{ORTH: "м-р", NORM: "магистер"},
|
||||
{ORTH: "г-ѓа", NORM: "госпоѓа"},
|
||||
{ORTH: "г-ца", NORM: "госпоѓица"},
|
||||
{ORTH: "г-дин", NORM: "господин"},
|
||||
|
||||
]
|
||||
|
||||
for abbr in _abbr_line_exc:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
_abbr_dot_exc = [
|
||||
{ORTH: "в.", NORM: "век"},
|
||||
{ORTH: "в.д.", NORM: "вршител на должност"},
|
||||
{ORTH: "г.", NORM: "година"},
|
||||
{ORTH: "г.г.", NORM: "господин господин"},
|
||||
{ORTH: "м.р.", NORM: "машки род"},
|
||||
{ORTH: "год.", NORM: "женски род"},
|
||||
{ORTH: "с.р.", NORM: "среден род"},
|
||||
{ORTH: "н.е.", NORM: "наша ера"},
|
||||
{ORTH: "о.г.", NORM: "оваа година"},
|
||||
{ORTH: "о.м.", NORM: "овој месец"},
|
||||
{ORTH: "с.", NORM: "село"},
|
||||
{ORTH: "т.", NORM: "точка"},
|
||||
{ORTH: "т.е.", NORM: "то ест"},
|
||||
{ORTH: "т.н.", NORM: "таканаречен"},
|
||||
|
||||
{ORTH: "бр.", NORM: "број"},
|
||||
{ORTH: "гр.", NORM: "град"},
|
||||
{ORTH: "др.", NORM: "другар"},
|
||||
{ORTH: "и др.", NORM: "и друго"},
|
||||
{ORTH: "и сл.", NORM: "и слично"},
|
||||
{ORTH: "кн.", NORM: "книга"},
|
||||
{ORTH: "мн.", NORM: "множина"},
|
||||
{ORTH: "на пр.", NORM: "на пример"},
|
||||
{ORTH: "св.", NORM: "свети"},
|
||||
{ORTH: "сп.", NORM: "списание"},
|
||||
{ORTH: "с.", NORM: "страница"},
|
||||
{ORTH: "стр.", NORM: "страница"},
|
||||
{ORTH: "чл.", NORM: "член"},
|
||||
|
||||
{ORTH: "арх.", NORM: "архитект"},
|
||||
{ORTH: "бел.", NORM: "белешка"},
|
||||
{ORTH: "гимн.", NORM: "гимназија"},
|
||||
{ORTH: "ден.", NORM: "денар"},
|
||||
{ORTH: "ул.", NORM: "улица"},
|
||||
{ORTH: "инж.", NORM: "инженер"},
|
||||
{ORTH: "проф.", NORM: "професор"},
|
||||
{ORTH: "студ.", NORM: "студент"},
|
||||
{ORTH: "бот.", NORM: "ботаника"},
|
||||
{ORTH: "мат.", NORM: "математика"},
|
||||
{ORTH: "мед.", NORM: "медицина"},
|
||||
{ORTH: "прил.", NORM: "прилог"},
|
||||
{ORTH: "прид.", NORM: "придавка"},
|
||||
{ORTH: "сврз.", NORM: "сврзник"},
|
||||
{ORTH: "физ.", NORM: "физика"},
|
||||
{ORTH: "хем.", NORM: "хемија"},
|
||||
{ORTH: "пр. н.", NORM: "природни науки"},
|
||||
{ORTH: "истор.", NORM: "историја"},
|
||||
{ORTH: "геогр.", NORM: "географија"},
|
||||
{ORTH: "литер.", NORM: "литература"},
|
||||
|
||||
|
||||
]
|
||||
|
||||
for abbr in _abbr_dot_exc:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -1,4 +1,4 @@
|
|||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
|
||||
from .stop_words import STOP_WORDS
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
@ -9,6 +9,7 @@ class TurkishDefaults(Language.Defaults):
|
|||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
stop_words = STOP_WORDS
|
||||
token_match = TOKEN_MATCH
|
||||
syntax_iterators = SYNTAX_ITERATORS
|
||||
|
||||
|
||||
|
|
|
@ -1,119 +1,181 @@
|
|||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
import re
|
||||
|
||||
from ..punctuation import ALPHA_LOWER, ALPHA
|
||||
from ...symbols import ORTH, NORM
|
||||
from ...util import update_exc
|
||||
|
||||
|
||||
_exc = {"sağol": [{ORTH: "sağ"}, {ORTH: "ol", NORM: "olun"}]}
|
||||
_exc = {}
|
||||
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
|
||||
{ORTH: "Alb.", NORM: "Albay"},
|
||||
{ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
|
||||
{ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
|
||||
{ORTH: "Asb.", NORM: "Astsubay"},
|
||||
{ORTH: "Astsb.", NORM: "Astsubay"},
|
||||
{ORTH: "As.İz.", NORM: "Askeri İnzibat"},
|
||||
{ORTH: "Atğm", NORM: "Asteğmen"},
|
||||
{ORTH: "Av.", NORM: "Avukat"},
|
||||
{ORTH: "Apt.", NORM: "Apartmanı"},
|
||||
{ORTH: "Bçvş.", NORM: "Başçavuş"},
|
||||
_abbr_period_exc = [
|
||||
{ORTH: "A.B.D.", NORM: "Amerika"},
|
||||
{ORTH: "Alb.", NORM: "albay"},
|
||||
{ORTH: "Ank.", NORM: "Ankara"},
|
||||
{ORTH: "Ar.Gör."},
|
||||
{ORTH: "Arş.Gör."},
|
||||
{ORTH: "Asb.", NORM: "astsubay"},
|
||||
{ORTH: "Astsb.", NORM: "astsubay"},
|
||||
{ORTH: "As.İz."},
|
||||
{ORTH: "as.iz."},
|
||||
{ORTH: "Atğm", NORM: "asteğmen"},
|
||||
{ORTH: "Av.", NORM: "avukat"},
|
||||
{ORTH: "Apt.", NORM: "apartmanı"},
|
||||
{ORTH: "apt.", NORM: "apartmanı"},
|
||||
{ORTH: "Bçvş.", NORM: "başçavuş"},
|
||||
{ORTH: "bçvş.", NORM: "başçavuş"},
|
||||
{ORTH: "bk.", NORM: "bakınız"},
|
||||
{ORTH: "bknz.", NORM: "bakınız"},
|
||||
{ORTH: "Bnb.", NORM: "Binbaşı"},
|
||||
{ORTH: "Bnb.", NORM: "binbaşı"},
|
||||
{ORTH: "bnb.", NORM: "binbaşı"},
|
||||
{ORTH: "Böl.", NORM: "Bölümü"},
|
||||
{ORTH: "Bşk.", NORM: "Başkanlığı"},
|
||||
{ORTH: "Bştbp.", NORM: "Baştabip"},
|
||||
{ORTH: "Bul.", NORM: "Bulvarı"},
|
||||
{ORTH: "Cad.", NORM: "Caddesi"},
|
||||
{ORTH: "Böl.", NORM: "bölümü"},
|
||||
{ORTH: "böl.", NORM: "bölümü"},
|
||||
{ORTH: "Bşk.", NORM: "başkanlığı"},
|
||||
{ORTH: "bşk.", NORM: "başkanlığı"},
|
||||
{ORTH: "Bştbp.", NORM: "baştabip"},
|
||||
{ORTH: "bştbp.", NORM: "baştabip"},
|
||||
{ORTH: "Bul.", NORM: "bulvarı"},
|
||||
{ORTH: "bul.", NORM: "bulvarı"},
|
||||
{ORTH: "Cad.", NORM: "caddesi"},
|
||||
{ORTH: "cad.", NORM: "caddesi"},
|
||||
{ORTH: "çev.", NORM: "çeviren"},
|
||||
{ORTH: "Çvş.", NORM: "Çavuş"},
|
||||
{ORTH: "Çvş.", NORM: "çavuş"},
|
||||
{ORTH: "çvş.", NORM: "çavuş"},
|
||||
{ORTH: "dak.", NORM: "dakika"},
|
||||
{ORTH: "dk.", NORM: "dakika"},
|
||||
{ORTH: "Doç.", NORM: "Doçent"},
|
||||
{ORTH: "doğ.", NORM: "doğum tarihi"},
|
||||
{ORTH: "Doç.", NORM: "doçent"},
|
||||
{ORTH: "doğ."},
|
||||
{ORTH: "Dr.", NORM: "doktor"},
|
||||
{ORTH: "dr.", NORM:"doktor"},
|
||||
{ORTH: "drl.", NORM: "derleyen"},
|
||||
{ORTH: "Dz.", NORM: "Deniz"},
|
||||
{ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
|
||||
{ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
|
||||
{ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
|
||||
{ORTH: "Dz.", NORM: "deniz"},
|
||||
{ORTH: "Dz.K.K.lığı"},
|
||||
{ORTH: "Dz.Kuv."},
|
||||
{ORTH: "Dz.Kuv.K."},
|
||||
{ORTH: "dzl.", NORM: "düzenleyen"},
|
||||
{ORTH: "Ecz.", NORM: "Eczanesi"},
|
||||
{ORTH: "Ecz.", NORM: "eczanesi"},
|
||||
{ORTH: "ecz.", NORM: "eczanesi"},
|
||||
{ORTH: "ekon.", NORM: "ekonomi"},
|
||||
{ORTH: "Fak.", NORM: "Fakültesi"},
|
||||
{ORTH: "Gn.", NORM: "Genel"},
|
||||
{ORTH: "Fak.", NORM: "fakültesi"},
|
||||
{ORTH: "Gn.", NORM: "genel"},
|
||||
{ORTH: "Gnkur.", NORM: "Genelkurmay"},
|
||||
{ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
|
||||
{ORTH: "gr.", NORM: "gram"},
|
||||
{ORTH: "Hst.", NORM: "Hastanesi"},
|
||||
{ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
|
||||
{ORTH: "Hst.", NORM: "hastanesi"},
|
||||
{ORTH: "hst.", NORM: "hastanesi"},
|
||||
{ORTH: "Hs.Uzm."},
|
||||
{ORTH: "huk.", NORM: "hukuk"},
|
||||
{ORTH: "Hv.", NORM: "Hava"},
|
||||
{ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
|
||||
{ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
|
||||
{ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
|
||||
{ORTH: "Hz.", NORM: "Hazreti"},
|
||||
{ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
|
||||
{ORTH: "İng.", NORM: "İngilizce"},
|
||||
{ORTH: "Jeol.", NORM: "Jeoloji"},
|
||||
{ORTH: "Hv.", NORM: "hava"},
|
||||
{ORTH: "Hv.K.K.lığı"},
|
||||
{ORTH: "Hv.Kuv."},
|
||||
{ORTH: "Hv.Kuv.K."},
|
||||
{ORTH: "Hz.", NORM: "hazreti"},
|
||||
{ORTH: "Hz.Öz."},
|
||||
{ORTH: "İng.", NORM: "ingilizce"},
|
||||
{ORTH: "İst.", NORM: "İstanbul"},
|
||||
{ORTH: "Jeol.", NORM: "jeoloji"},
|
||||
{ORTH: "jeol.", NORM: "jeoloji"},
|
||||
{ORTH: "Korg.", NORM: "Korgeneral"},
|
||||
{ORTH: "Kur.", NORM: "Kurmay"},
|
||||
{ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
|
||||
{ORTH: "Kuv.", NORM: "Kuvvetleri"},
|
||||
{ORTH: "Ltd.", NORM: "Limited"},
|
||||
{ORTH: "Mah.", NORM: "Mahallesi"},
|
||||
{ORTH: "Korg.", NORM: "korgeneral"},
|
||||
{ORTH: "Kur.", NORM: "kurmay"},
|
||||
{ORTH: "Kur.Bşk."},
|
||||
{ORTH: "Kuv.", NORM: "kuvvetleri"},
|
||||
{ORTH: "Ltd.", NORM: "limited"},
|
||||
{ORTH: "ltd.", NORM: "limited"},
|
||||
{ORTH: "Mah.", NORM: "mahallesi"},
|
||||
{ORTH: "mah.", NORM: "mahallesi"},
|
||||
{ORTH: "max.", NORM: "maksimum"},
|
||||
{ORTH: "min.", NORM: "minimum"},
|
||||
{ORTH: "Müh.", NORM: "Mühendisliği"},
|
||||
{ORTH: "Müh.", NORM: "mühendisliği"},
|
||||
{ORTH: "müh.", NORM: "mühendisliği"},
|
||||
{ORTH: "MÖ.", NORM: "Milattan Önce"},
|
||||
{ORTH: "Onb.", NORM: "Onbaşı"},
|
||||
{ORTH: "Ord.", NORM: "Ordinaryüs"},
|
||||
{ORTH: "Org.", NORM: "Orgeneral"},
|
||||
{ORTH: "Ped.", NORM: "Pedagoji"},
|
||||
{ORTH: "Prof.", NORM: "Profesör"},
|
||||
{ORTH: "Sb.", NORM: "Subay"},
|
||||
{ORTH: "Sn.", NORM: "Sayın"},
|
||||
{ORTH: "M.Ö."},
|
||||
{ORTH: "M.S."},
|
||||
{ORTH: "Onb.", NORM: "onbaşı"},
|
||||
{ORTH: "Ord.", NORM: "ordinaryüs"},
|
||||
{ORTH: "Org.", NORM: "orgeneral"},
|
||||
{ORTH: "Ped.", NORM: "pedagoji"},
|
||||
{ORTH: "Prof.", NORM: "profesör"},
|
||||
{ORTH: "prof.", NORM: "profesör"},
|
||||
{ORTH: "Sb.", NORM: "subay"},
|
||||
{ORTH: "Sn.", NORM: "sayın"},
|
||||
{ORTH: "sn.", NORM: "saniye"},
|
||||
{ORTH: "Sok.", NORM: "Sokak"},
|
||||
{ORTH: "Şb.", NORM: "Şube"},
|
||||
{ORTH: "Şti.", NORM: "Şirketi"},
|
||||
{ORTH: "Tbp.", NORM: "Tabip"},
|
||||
{ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
|
||||
{ORTH: "Tel.", NORM: "Telefon"},
|
||||
{ORTH: "Sok.", NORM: "sokak"},
|
||||
{ORTH: "sok.", NORM: "sokak"},
|
||||
{ORTH: "Şb.", NORM: "şube"},
|
||||
{ORTH: "şb.", NORM: "şube"},
|
||||
{ORTH: "Şti.", NORM: "şirketi"},
|
||||
{ORTH: "şti.", NORM: "şirketi"},
|
||||
{ORTH: "Tbp.", NORM: "tabip"},
|
||||
{ORTH: "tbp.", NORM: "tabip"},
|
||||
{ORTH: "T.C."},
|
||||
{ORTH: "Tel.", NORM: "telefon"},
|
||||
{ORTH: "tel.", NORM: "telefon"},
|
||||
{ORTH: "telg.", NORM: "telgraf"},
|
||||
{ORTH: "Tğm.", NORM: "Teğmen"},
|
||||
{ORTH: "Tğm.", NORM: "teğmen"},
|
||||
{ORTH: "tğm.", NORM: "teğmen"},
|
||||
{ORTH: "tic.", NORM: "ticaret"},
|
||||
{ORTH: "Tug.", NORM: "Tugay"},
|
||||
{ORTH: "Tuğg.", NORM: "Tuğgeneral"},
|
||||
{ORTH: "Tümg.", NORM: "Tümgeneral"},
|
||||
{ORTH: "Uzm.", NORM: "Uzman"},
|
||||
{ORTH: "Üçvş.", NORM: "Üstçavuş"},
|
||||
{ORTH: "Üni.", NORM: "Üniversitesi"},
|
||||
{ORTH: "Ütğm.", NORM: "Üsteğmen"},
|
||||
{ORTH: "vb.", NORM: "ve benzeri"},
|
||||
{ORTH: "Tug.", NORM: "tugay"},
|
||||
{ORTH: "Tuğg.", NORM: "tuğgeneral"},
|
||||
{ORTH: "Tümg.", NORM: "tümgeneral"},
|
||||
{ORTH: "Uzm.", NORM: "uzman"},
|
||||
{ORTH: "Üçvş.", NORM: "üstçavuş"},
|
||||
{ORTH: "Üni.", NORM: "üniversitesi"},
|
||||
{ORTH: "Ütğm.", NORM: "üsteğmen"},
|
||||
{ORTH: "vb."},
|
||||
{ORTH: "vs.", NORM: "vesaire"},
|
||||
{ORTH: "Yard.", NORM: "Yardımcı"},
|
||||
{ORTH: "Yar.", NORM: "Yardımcı"},
|
||||
{ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
|
||||
{ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
|
||||
{ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
|
||||
{ORTH: "Yb.", NORM: "Yarbay"},
|
||||
{ORTH: "Yrd.", NORM: "Yardımcı"},
|
||||
{ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
|
||||
{ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
|
||||
{ORTH: "Y.Mim.", NORM: "Yüksek mimar"},
|
||||
]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
{ORTH: "Yard.", NORM: "yardımcı"},
|
||||
{ORTH: "Yar.", NORM: "yardımcı"},
|
||||
{ORTH: "Yd.Sb."},
|
||||
{ORTH: "Yard.Doç."},
|
||||
{ORTH: "Yar.Doç."},
|
||||
{ORTH: "Yb.", NORM: "yarbay"},
|
||||
{ORTH: "Yrd.", NORM: "yardımcı"},
|
||||
{ORTH: "Yrd.Doç."},
|
||||
{ORTH: "Y.Müh."},
|
||||
{ORTH: "Y.Mim."},
|
||||
{ORTH: "yy.", NORM: "yüzyıl"},
|
||||
]
|
||||
|
||||
for abbr in _abbr_period_exc:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
_abbr_exc = [
|
||||
{ORTH: "AB", NORM: "Avrupa Birliği"},
|
||||
{ORTH: "ABD", NORM: "Amerika"},
|
||||
{ORTH: "ABS", NORM: "fren"},
|
||||
{ORTH: "AOÇ"},
|
||||
{ORTH: "ASKİ"},
|
||||
{ORTH: "Bağ-kur", NORM: "Bağkur"},
|
||||
{ORTH: "BDDK"},
|
||||
{ORTH: "BJK", NORM: "Beşiktaş"},
|
||||
{ORTH: "ESA", NORM: "Avrupa uzay ajansı"},
|
||||
{ORTH: "FB", NORM: "Fenerbahçe"},
|
||||
{ORTH: "GATA"},
|
||||
{ORTH: "GS", NORM: "Galatasaray"},
|
||||
{ORTH: "İSKİ"},
|
||||
{ORTH: "KBB"},
|
||||
{ORTH: "RTÜK", NORM: "radyo ve televizyon üst kurulu"},
|
||||
{ORTH: "TBMM"},
|
||||
{ORTH: "TC"},
|
||||
{ORTH: "TÜİK", NORM: "Türkiye istatistik kurumu"},
|
||||
{ORTH: "YÖK"},
|
||||
]
|
||||
|
||||
for abbr in _abbr_exc:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for orth in ["Dr.", "yy."]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
_num = r"[+-]?\d+([,.]\d+)*"
|
||||
_ord_num = r"(\d+\.)"
|
||||
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
|
||||
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
|
||||
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
|
||||
_roman_ord = r"({rn})\.".format(rn=_roman_num)
|
||||
_time_exp = r"\d+(:\d+)*"
|
||||
|
||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
||||
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
|
||||
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
|
||||
|
||||
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(d=_date, dn=_dash_num, te=_time_exp, on=_ord_num, n=_num, ro=_roman_ord, rn=_roman_num, inf=_inflections)
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
TOKEN_MATCH = re.compile(r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)).match
|
||||
|
|
|
@ -968,10 +968,6 @@ class Language:
|
|||
|
||||
DOCS: https://nightly.spacy.io/api/language#call
|
||||
"""
|
||||
if len(text) > self.max_length:
|
||||
raise ValueError(
|
||||
Errors.E088.format(length=len(text), max_length=self.max_length)
|
||||
)
|
||||
doc = self.make_doc(text)
|
||||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
|
@ -1045,6 +1041,11 @@ class Language:
|
|||
text (str): The text to process.
|
||||
RETURNS (Doc): The processed doc.
|
||||
"""
|
||||
if len(text) > self.max_length:
|
||||
raise ValueError(
|
||||
Errors.E088.format(length=len(text), max_length=self.max_length)
|
||||
)
|
||||
return self.tokenizer(text)
|
||||
return self.tokenizer(text)
|
||||
|
||||
def update(
|
||||
|
|
|
@ -26,6 +26,7 @@ cdef enum quantifier_t:
|
|||
ZERO_PLUS
|
||||
ONE
|
||||
ONE_PLUS
|
||||
FINAL_ID
|
||||
|
||||
|
||||
cdef struct AttrValueC:
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from typing import List
|
||||
|
||||
from libcpp.vector cimport vector
|
||||
from libc.stdint cimport int32_t
|
||||
from libc.stdint cimport int32_t, int8_t
|
||||
from libc.string cimport memset, memcmp
|
||||
from cymem.cymem cimport Pool
|
||||
from murmurhash.mrmr cimport hash64
|
||||
|
@ -308,7 +308,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
# avoid any processing or mem alloc if the document is empty
|
||||
return output
|
||||
if len(predicates) > 0:
|
||||
predicate_cache = <char*>mem.alloc(length * len(predicates), sizeof(char))
|
||||
predicate_cache = <int8_t*>mem.alloc(length * len(predicates), sizeof(int8_t))
|
||||
if extensions is not None and len(extensions) >= 1:
|
||||
nr_extra_attr = max(extensions.values()) + 1
|
||||
extra_attr_values = <attr_t*>mem.alloc(length * nr_extra_attr, sizeof(attr_t))
|
||||
|
@ -349,7 +349,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
|
||||
|
||||
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
||||
char* cached_py_predicates,
|
||||
int8_t* cached_py_predicates,
|
||||
Token token, const attr_t* extra_attrs, py_predicates) except *:
|
||||
cdef int q = 0
|
||||
cdef vector[PatternStateC] new_states
|
||||
|
@ -421,7 +421,7 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
|||
states.push_back(new_states[i])
|
||||
|
||||
|
||||
cdef int update_predicate_cache(char* cache,
|
||||
cdef int update_predicate_cache(int8_t* cache,
|
||||
const TokenPatternC* pattern, Token token, predicates) except -1:
|
||||
# If the state references any extra predicates, check whether they match.
|
||||
# These are cached, so that we don't call these potentially expensive
|
||||
|
@ -459,7 +459,7 @@ cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states)
|
|||
|
||||
cdef action_t get_action(PatternStateC state,
|
||||
const TokenC* token, const attr_t* extra_attrs,
|
||||
const char* predicate_matches) nogil:
|
||||
const int8_t* predicate_matches) nogil:
|
||||
"""We need to consider:
|
||||
a) Does the token match the specification? [Yes, No]
|
||||
b) What's the quantifier? [1, 0+, ?]
|
||||
|
@ -517,7 +517,7 @@ cdef action_t get_action(PatternStateC state,
|
|||
|
||||
Problem: If a quantifier is matching, we're adding a lot of open partials
|
||||
"""
|
||||
cdef char is_match
|
||||
cdef int8_t is_match
|
||||
is_match = get_is_match(state, token, extra_attrs, predicate_matches)
|
||||
quantifier = get_quantifier(state)
|
||||
is_final = get_is_final(state)
|
||||
|
@ -569,9 +569,9 @@ cdef action_t get_action(PatternStateC state,
|
|||
return RETRY
|
||||
|
||||
|
||||
cdef char get_is_match(PatternStateC state,
|
||||
cdef int8_t get_is_match(PatternStateC state,
|
||||
const TokenC* token, const attr_t* extra_attrs,
|
||||
const char* predicate_matches) nogil:
|
||||
const int8_t* predicate_matches) nogil:
|
||||
for i in range(state.pattern.nr_py):
|
||||
if predicate_matches[state.pattern.py_predicates[i]] == -1:
|
||||
return 0
|
||||
|
@ -586,8 +586,8 @@ cdef char get_is_match(PatternStateC state,
|
|||
return True
|
||||
|
||||
|
||||
cdef char get_is_final(PatternStateC state) nogil:
|
||||
if state.pattern[1].nr_attr == 0 and state.pattern[1].attrs != NULL:
|
||||
cdef int8_t get_is_final(PatternStateC state) nogil:
|
||||
if state.pattern[1].quantifier == FINAL_ID:
|
||||
id_attr = state.pattern[1].attrs[0]
|
||||
if id_attr.attr != ID:
|
||||
with gil:
|
||||
|
@ -597,7 +597,7 @@ cdef char get_is_final(PatternStateC state) nogil:
|
|||
return 0
|
||||
|
||||
|
||||
cdef char get_quantifier(PatternStateC state) nogil:
|
||||
cdef int8_t get_quantifier(PatternStateC state) nogil:
|
||||
return state.pattern.quantifier
|
||||
|
||||
|
||||
|
@ -626,36 +626,20 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
|||
pattern[i].nr_py = len(predicates)
|
||||
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
||||
i = len(token_specs)
|
||||
# Even though here, nr_attr == 0, we're storing the ID value in attrs[0] (bug-prone, thread carefully!)
|
||||
pattern[i].attrs = <AttrValueC*>mem.alloc(2, sizeof(AttrValueC))
|
||||
# Use quantifier to identify final ID pattern node (rather than previous
|
||||
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
|
||||
pattern[i].quantifier = FINAL_ID
|
||||
pattern[i].attrs = <AttrValueC*>mem.alloc(1, sizeof(AttrValueC))
|
||||
pattern[i].attrs[0].attr = ID
|
||||
pattern[i].attrs[0].value = entity_id
|
||||
pattern[i].nr_attr = 0
|
||||
pattern[i].nr_attr = 1
|
||||
pattern[i].nr_extra_attr = 0
|
||||
pattern[i].nr_py = 0
|
||||
return pattern
|
||||
|
||||
|
||||
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
||||
# There have been a few bugs here. We used to have two functions,
|
||||
# get_ent_id and get_pattern_key that tried to do the same thing. These
|
||||
# are now unified to try to solve the "ghost match" problem.
|
||||
# Below is the previous implementation of get_ent_id and the comment on it,
|
||||
# preserved for reference while we figure out whether the heisenbug in the
|
||||
# matcher is resolved.
|
||||
#
|
||||
#
|
||||
# cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
|
||||
# # The code was originally designed to always have pattern[1].attrs.value
|
||||
# # be the ent_id when we get to the end of a pattern. However, Issue #2671
|
||||
# # showed this wasn't the case when we had a reject-and-continue before a
|
||||
# # match.
|
||||
# # The patch to #2671 was wrong though, which came up in #3839.
|
||||
# while pattern.attrs.attr != ID:
|
||||
# pattern += 1
|
||||
# return pattern.attrs.value
|
||||
while pattern.nr_attr != 0 or pattern.nr_extra_attr != 0 or pattern.nr_py != 0 \
|
||||
or pattern.quantifier != ZERO:
|
||||
while pattern.quantifier != FINAL_ID:
|
||||
pattern += 1
|
||||
id_attr = pattern[0].attrs[0]
|
||||
if id_attr.attr != ID:
|
||||
|
|
|
@ -261,7 +261,11 @@ class EntityRuler(Pipe):
|
|||
|
||||
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
|
||||
try:
|
||||
current_index = self.nlp.pipe_names.index(self.name)
|
||||
current_index = -1
|
||||
for i, (name, pipe) in enumerate(self.nlp.pipeline):
|
||||
if self == pipe:
|
||||
current_index = i
|
||||
break
|
||||
subsequent_pipes = [
|
||||
pipe for pipe in self.nlp.pipe_names[current_index + 1 :]
|
||||
]
|
||||
|
|
|
@ -4,7 +4,7 @@ from thinc.api import Model
|
|||
from pathlib import Path
|
||||
|
||||
from .pipe import Pipe
|
||||
from ..errors import Errors
|
||||
from ..errors import Errors, Warnings
|
||||
from ..language import Language
|
||||
from ..training import Example
|
||||
from ..lookups import Lookups, load_lookups
|
||||
|
@ -197,6 +197,8 @@ class Lemmatizer(Pipe):
|
|||
string = token.text
|
||||
univ_pos = token.pos_.lower()
|
||||
if univ_pos in ("", "eol", "space"):
|
||||
if univ_pos == "":
|
||||
logger.warn(Warnings.W108.format(text=string))
|
||||
return [string.lower()]
|
||||
# See Issue #435 for example of where this logic is requied.
|
||||
if self.is_base_form(token):
|
||||
|
|
|
@ -172,6 +172,11 @@ def lt_tokenizer():
|
|||
return get_lang_class("lt")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def mk_tokenizer():
|
||||
return get_lang_class("mk")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ml_tokenizer():
|
||||
return get_lang_class("ml")().tokenizer
|
||||
|
|
|
@ -123,6 +123,7 @@ def test_doc_api_serialize(en_tokenizer, text):
|
|||
tokens[0].norm_ = "norm"
|
||||
tokens.ents = [(tokens.vocab.strings["PRODUCT"], 0, 1)]
|
||||
tokens[0].ent_kb_id_ = "ent_kb_id"
|
||||
tokens[0].ent_id_ = "ent_id"
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
|
@ -130,6 +131,7 @@ def test_doc_api_serialize(en_tokenizer, text):
|
|||
assert new_tokens[0].lemma_ == "lemma"
|
||||
assert new_tokens[0].norm_ == "norm"
|
||||
assert new_tokens[0].ent_kb_id_ == "ent_kb_id"
|
||||
assert new_tokens[0].ent_id_ == "ent_id"
|
||||
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
||||
|
|
|
@ -416,6 +416,13 @@ def test_doc_retokenizer_merge_lex_attrs(en_vocab):
|
|||
assert doc[1].is_stop
|
||||
assert not doc[0].is_stop
|
||||
assert not doc[1].like_num
|
||||
# Test that norm is only set on tokens
|
||||
doc = Doc(en_vocab, words=["eins", "zwei", "!", "!"])
|
||||
assert doc[0].norm_ == "eins"
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[0:1], attrs={"norm": "1"})
|
||||
assert doc[0].norm_ == "1"
|
||||
assert en_vocab["eins"].norm_ == "eins"
|
||||
|
||||
|
||||
def test_retokenize_skip_duplicates(en_vocab):
|
||||
|
|
0
spacy/tests/lang/mk/__init__.py
Normal file
0
spacy/tests/lang/mk/__init__.py
Normal file
84
spacy/tests/lang/mk/test_text.py
Normal file
84
spacy/tests/lang/mk/test_text.py
Normal file
|
@ -0,0 +1,84 @@
|
|||
import pytest
|
||||
from spacy.lang.mk.lex_attrs import like_num
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(mk_tokenizer):
|
||||
text = """
|
||||
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
|
||||
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
|
||||
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
|
||||
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
|
||||
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
|
||||
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
|
||||
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
|
||||
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
|
||||
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
|
||||
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
|
||||
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
|
||||
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
|
||||
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
|
||||
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
|
||||
ги разбере овие идеи...
|
||||
"""
|
||||
tokens = mk_tokenizer(text)
|
||||
assert len(tokens) == 297
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word,match",
|
||||
[
|
||||
("10", True),
|
||||
("1", True),
|
||||
("10.000", True),
|
||||
("1000", True),
|
||||
("бројка", False),
|
||||
("999,0", True),
|
||||
("еден", True),
|
||||
("два", True),
|
||||
("цифра", False),
|
||||
("десет", True),
|
||||
("сто", True),
|
||||
("број", False),
|
||||
("илјада", True),
|
||||
("илјади", True),
|
||||
("милион", True),
|
||||
(",", False),
|
||||
("милијарда", True),
|
||||
("билион", True),
|
||||
]
|
||||
)
|
||||
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
|
||||
tokens = mk_tokenizer(word)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
[
|
||||
"двесте",
|
||||
"два-три",
|
||||
"пет-шест"
|
||||
]
|
||||
)
|
||||
def test_mk_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
[
|
||||
"првиот",
|
||||
"втора",
|
||||
"четврт",
|
||||
"четвртата",
|
||||
"петти",
|
||||
"петто",
|
||||
"стоти",
|
||||
"шеесетите",
|
||||
"седумдесетите"
|
||||
]
|
||||
)
|
||||
def test_mk_lex_attrs_like_number_for_ordinal(word):
|
||||
assert like_num(word)
|
|
@ -2,6 +2,27 @@ import pytest
|
|||
from spacy.lang.tr.lex_attrs import like_num
|
||||
|
||||
|
||||
def test_tr_tokenizer_handles_long_text(tr_tokenizer):
|
||||
text = """Pamuk nasıl ipliğe dönüştürülür?
|
||||
|
||||
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
|
||||
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
|
||||
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
|
||||
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
|
||||
|
||||
Daha sonra pamuk yığınları, liflerin açılıp temizlenmesi için tek bir birim halinde
|
||||
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
|
||||
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
|
||||
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
|
||||
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
|
||||
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
|
||||
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
|
||||
tokens = tr_tokenizer(text)
|
||||
assert len(tokens) == 146
|
||||
|
||||
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
[
|
||||
|
|
152
spacy/tests/lang/tr/test_tokenizer.py
Normal file
152
spacy/tests/lang/tr/test_tokenizer.py
Normal file
|
@ -0,0 +1,152 @@
|
|||
import pytest
|
||||
|
||||
|
||||
ABBREV_TESTS = [
|
||||
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
|
||||
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
|
||||
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
|
||||
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
|
||||
("Hem İst. hem Ank. bu konuda gayet iyi durumda.", ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."]),
|
||||
("Hem İst. hem Ank.'da yağış var.", ["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."]),
|
||||
("Dr.", ["Dr."]),
|
||||
("Yrd.Doç.", ["Yrd.Doç."]),
|
||||
("Prof.'un", ["Prof.'un"]),
|
||||
("Böl.'nde", ["Böl.'nde"]),
|
||||
]
|
||||
|
||||
|
||||
|
||||
URL_TESTS = [
|
||||
("Bizler de www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
||||
("Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "https://www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
||||
("Bizler de www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
||||
("Bizler de https://www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
||||
]
|
||||
|
||||
|
||||
|
||||
NUMBER_TESTS = [
|
||||
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
|
||||
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
|
||||
("Hava sıcaklığı -4ten +6ya yükseldi.", ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."]),
|
||||
("Hava sıcaklığı -4'ten +6'ya yükseldi.", ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."]),
|
||||
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
|
||||
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
|
||||
("Kitap IV. Murat hakkında.",["Kitap", "IV.", "Murat", "hakkında", "."]),
|
||||
#("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
|
||||
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
|
||||
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
|
||||
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
|
||||
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
|
||||
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
|
||||
("5'te", ["5'te"]),
|
||||
("6'da", ["6'da"]),
|
||||
("9dan", ["9dan"]),
|
||||
("19'da", ["19'da"]),
|
||||
("VI'da", ["VI'da"]),
|
||||
("5.", ["5."]),
|
||||
("72.", ["72."]),
|
||||
("VI.", ["VI."]),
|
||||
("6.'dan", ["6.'dan"]),
|
||||
("19.'dan", ["19.'dan"]),
|
||||
("6.dan", ["6.dan"]),
|
||||
("16.dan", ["16.dan"]),
|
||||
("VI.'dan", ["VI.'dan"]),
|
||||
("VI.dan", ["VI.dan"]),
|
||||
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
|
||||
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
|
||||
("2/3 tarihli faturayı bulamadım.", ["2/3", "tarihli", "faturayı", "bulamadım", "."]),
|
||||
("2.3 tarihli faturayı bulamadım.", ["2.3", "tarihli", "faturayı", "bulamadım", "."]),
|
||||
("2.3. tarihli faturayı bulamadım.", ["2.3.", "tarihli", "faturayı", "bulamadım", "."]),
|
||||
("2/3/2020 tarihli faturayı bulamadm.", ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."]),
|
||||
("2/3/1987 tarihinden beri burda yaşıyorum.", ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."]),
|
||||
("2-3-1987 tarihinden beri burdayım.", ["2-3-1987", "tarihinden", "beri", "burdayım", "."]),
|
||||
("2.3.1987 tarihinden beri burdayım.", ["2.3.1987", "tarihinden", "beri", "burdayım", "."]),
|
||||
("Bu olay 2005-2006 tarihleri arasında oldu.", ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."]),
|
||||
("Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.", ["Bu", "olay", "4/12/2005", "-", "21/3/2006", "tarihleri", "arasında", "oldu", ".",]),
|
||||
("Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.", ["Ek", "fıkra", ":", "5/11/2003", "-", "4999/3", "maddesine", "göre", "uygundur", "."]),
|
||||
("2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre", ["2/A", "alanları", ":", "6831", "sayılı", "Kanunun", "2nci", "maddesinin", "birinci", "fıkrasının", "(", "A", ")", "bendine", "göre"]),
|
||||
("ŞEHİTTEĞMENKALMAZ Cad. No: 2/311", ["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"]),
|
||||
("2-3-2025", ["2-3-2025",]),
|
||||
("2/3/2025", ["2/3/2025"]),
|
||||
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "uç", "kullanıyorum", "."]),
|
||||
("Kan değerlerim 0.5-0.7 arasıydı.", ["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."]),
|
||||
("0.5", ["0.5"]),
|
||||
("1/2", ["1/2"]),
|
||||
("%1", ["%", "1"]),
|
||||
("%1lik", ["%", "1lik"]),
|
||||
("%1'lik", ["%", "1'lik"]),
|
||||
("%1lik dilim", ["%", "1lik", "dilim"]),
|
||||
("%1'lik dilim", ["%", "1'lik", "dilim"]),
|
||||
("%1.5", ["%", "1.5"]),
|
||||
#("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
|
||||
("%1-2 arası büyüme bekliyoruz.", ["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."]),
|
||||
("%11-12 arası büyüme bekliyoruz.", ["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."]),
|
||||
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
|
||||
("Saat 1-2 arası gelin lütfen.", ["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."]),
|
||||
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
|
||||
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
|
||||
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
|
||||
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
|
||||
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
|
||||
("9’daki otobüse binsek mi?", ["9’daki", "otobüse", "binsek", "mi", "?"]),
|
||||
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
|
||||
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
|
||||
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
|
||||
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
|
||||
("Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.", ["Antonio", "Gaudí", "20.", "yüzyılda", ",", "1904", "-", "1914", "yılları", "arasında", "on", "yıl", "süren", "bir", "reform", "süreci", "getirmiştir", "."]),
|
||||
("Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.", ["Dizel", "yakıtın", "avro", "bölgesi", "ortalaması", "olan", "1,165", "avroya", "kıyasla", "litre", "başına", "1,335", "avroya", "mal", "olduğunu", "gösteriyor", "."]),
|
||||
("Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.", ["Marcus", "Antonius", "M.Ö.", "1", "Ocak", "49'da", ",", "Sezar'dan", "Vali'nin", "kendisini", "barış", "dostu", "ilan", "ettiği", "bir", "bildiri", "yayınlamıştır", "."])
|
||||
]
|
||||
|
||||
|
||||
PUNCT_TESTS = [
|
||||
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
|
||||
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
|
||||
("Gitsek mi?", ["Gitsek", "mi", "?"]),
|
||||
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
|
||||
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
|
||||
("Ankara - Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
||||
("Ankara-Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
||||
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
|
||||
("Senden, benden, bizden şarkısını biliyor musun?", ["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"]),
|
||||
("Akif'le geldik, sonra da o ayrıldı.", ["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."]),
|
||||
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
|
||||
("Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...", ["Yok", "hasta", "olmuş", ",", "yok", "annesi", "hastaymış", ",", "bahaneler", "işte", "..."]),
|
||||
("Ankara'dan İstanbul'a ... bir aşk hikayesi.", ["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."]),
|
||||
("Ahmet'te", ["Ahmet'te"]),
|
||||
("İstanbul'da", ["İstanbul'da"]),
|
||||
]
|
||||
|
||||
GENERAL_TESTS = [
|
||||
("1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.", ["1914'teki", "Endurance", "seferinde", ",", "Sir", "Ernest", "Shackleton'ın", "kaptanlığını", "yaptığı", "İngiliz", "Endurance", "gemisi", "yirmi", "sekiz", "kişi", "ile", "Antarktika'yı", "geçmek", "üzere", "yelken", "açtı", "."]),
|
||||
("Danışılan \"%100 Cospedal\" olduğunu belirtti.", ["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."]),
|
||||
("1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.", ["1976'da", "parkur", "artık", "kullanılmıyordu", ";", "1990'da", "ise", "bir", "yangın", ",", "daha", "sonraları", "ahırlarla", "birlikte", "yıkılacak", "olan", "tahta", "tribünlerden", "geri", "kalanları", "da", "yok", "etmişti", "."]),
|
||||
("Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.", ["Dahiyane", "bir", "ameliyat", "ve", "zorlu", "bir", "rehabilitasyon", "sürecinden", "sonra", ",", "tamamen", "iyileştim", "."]),
|
||||
("Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.", ["Yaklaşık", "iki", "hafta", "süren", "bireysel", "erken", "oy", "kullanma", "döneminin", "ardından", "5,7", "milyondan", "fazla", "Floridalı", "sandık", "başına", "gitti", "."]),
|
||||
("Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.", ["Ancak", ",", "bu", "ABD", "Çevre", "Koruma", "Ajansı'nın", "dünyayı", "bu", "konularda", "uyarmasının", "ardından", "ortaya", "çıktı", "."]),
|
||||
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
||||
("Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar." , ["Granit", "adaları", ";", "Seyşeller", "ve", "Tioman", "ile", "Saint", "Helena", "gibi", "volkanik", "adaları", "kapsar", "."]),
|
||||
("Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.", ["Barış", "antlaşmasıyla", "İspanya", ",", "Amerika'ya", "Porto", "Riko", ",", "Guam", "ve", "Filipinler", "kolonilerini", "devretti", "."]),
|
||||
("Makedonya\'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya\'ya doğru yürüdü.", ["Makedonya\'nın", "sınır", "bölgelerini", "güvence", "altına", "alan", "Philip", ",", "büyük", "bir", "Makedon", "ordusu", "kurdu", "ve", "uzun", "bir", "fetih", "seferi", "için", "Trakya\'ya", "doğru", "yürüdü", "."]),
|
||||
("Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.", ["Fransız", "gazetesi", "Le", "Figaro'ya", "göre", "bu", "hükumet", "planı", "sayesinde", "42", "milyon", "Euro", "kazanç", "sağlanabilir", "ve", "elde", "edilen", "paranın", "15.5", "milyonu", "ulusal", "güvenlik", "için", "kullanılabilir", "."]),
|
||||
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
||||
("3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.", ["3", "Kasım", "Salı", "günü", ",", "Ankara", "Belediye", "Başkanı", "2014'te", "hükümetle", "birlikte", "oluşturulan", "kentsel", "gelişim", "anlaşmasını", "askıya", "alma", "kararı", "verdi", "."]),
|
||||
("Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.", ["Stalin", ",", "Abakumov'u", "Beria'nın", "enerji", "bakanlıkları", "üzerindeki", "baskınlığına", "karşı", "MGB", "içinde", "kendi", "ağını", "kurmaya", "teşvik", "etmeye", "başlamıştı", "."]),
|
||||
("Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar", ["Güney", "Avrupa'daki", "kazı", "alanlarının", "çoğunluğu", "gibi", ",", "bu", "bulgu", "M.Ö.", "5.", "yüzyılın", "başlar"]),
|
||||
("Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.", ["Sağlığın", "bozulması", "Hitchcock", "hayatının", "son", "yirmi", "yılında", "üretimini", "azalttı", "."]),
|
||||
]
|
||||
|
||||
|
||||
|
||||
TESTS = (ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS)
|
||||
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", TESTS)
|
||||
def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
|
||||
tokens = tr_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
print(token_list)
|
||||
assert expected_tokens == token_list
|
||||
|
|
@ -457,6 +457,7 @@ def test_attr_pipeline_checks(en_vocab):
|
|||
([{"IS_LEFT_PUNCT": True}], "``"),
|
||||
([{"IS_RIGHT_PUNCT": True}], "''"),
|
||||
([{"IS_STOP": True}], "the"),
|
||||
([{"SPACY": True}], "the"),
|
||||
([{"LIKE_NUM": True}], "1"),
|
||||
([{"LIKE_URL": True}], "http://example.com"),
|
||||
([{"LIKE_EMAIL": True}], "mail@example.com"),
|
||||
|
|
|
@ -4,7 +4,9 @@ from pathlib import Path
|
|||
|
||||
def test_build_dependencies():
|
||||
# Check that library requirements are pinned exactly the same across different setup files.
|
||||
# TODO: correct checks for numpy rather than ignoring
|
||||
libs_ignore_requirements = [
|
||||
"numpy",
|
||||
"pytest",
|
||||
"pytest-timeout",
|
||||
"mock",
|
||||
|
@ -12,6 +14,7 @@ def test_build_dependencies():
|
|||
]
|
||||
# ignore language-specific packages that shouldn't be installed by all
|
||||
libs_ignore_setup = [
|
||||
"numpy",
|
||||
"fugashi",
|
||||
"natto-py",
|
||||
"pythainlp",
|
||||
|
@ -67,7 +70,7 @@ def test_build_dependencies():
|
|||
line = line.strip().strip(",").strip('"')
|
||||
if not line.startswith("#"):
|
||||
lib, v = _parse_req(line)
|
||||
if lib:
|
||||
if lib and lib not in libs_ignore_requirements:
|
||||
req_v = req_dict.get(lib, None)
|
||||
assert (lib + v) == (lib + req_v), (
|
||||
"{} has different version in pyproject.toml and in requirements.txt: "
|
||||
|
|
|
@ -197,3 +197,21 @@ def test_entity_ruler_overlapping_spans(nlp):
|
|||
doc = ruler(nlp.make_doc("foo bar baz"))
|
||||
assert len(doc.ents) == 1
|
||||
assert doc.ents[0].label_ == "FOOBAR"
|
||||
|
||||
|
||||
@pytest.mark.parametrize("n_process", [1, 2])
|
||||
def test_entity_ruler_multiprocessing(nlp, n_process):
|
||||
texts = [
|
||||
"I enjoy eating Pizza Hut pizza."
|
||||
]
|
||||
|
||||
patterns = [
|
||||
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
|
||||
]
|
||||
|
||||
ruler = nlp.add_pipe("entity_ruler")
|
||||
ruler.add_patterns(patterns)
|
||||
|
||||
for doc in nlp.pipe(texts, n_process=2):
|
||||
for ent in doc.ents:
|
||||
assert ent.ent_id_ == "1234"
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
import pytest
|
||||
import logging
|
||||
import mock
|
||||
from spacy import util, registry
|
||||
from spacy.lang.en import English
|
||||
from spacy.lookups import Lookups
|
||||
|
@ -54,9 +56,18 @@ def test_lemmatizer_config(nlp):
|
|||
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "rule"})
|
||||
nlp.initialize()
|
||||
|
||||
# warning if no POS assigned
|
||||
doc = nlp.make_doc("coping")
|
||||
logger = logging.getLogger("spacy")
|
||||
with mock.patch.object(logger, "warn") as mock_warn:
|
||||
doc = lemmatizer(doc)
|
||||
mock_warn.assert_called_once()
|
||||
|
||||
# works with POS
|
||||
doc = nlp.make_doc("coping")
|
||||
doc[0].pos_ = "VERB"
|
||||
assert doc[0].lemma_ == ""
|
||||
doc[0].pos_ = "VERB"
|
||||
doc = lemmatizer(doc)
|
||||
doc = lemmatizer(doc)
|
||||
assert doc[0].text == "coping"
|
||||
assert doc[0].lemma_ == "cope"
|
||||
|
|
|
@ -8,7 +8,7 @@ from spacy.cli.init_config import init_config, RECOMMENDATIONS
|
|||
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
||||
from spacy.cli._util import load_project_config, substitute_project_variables
|
||||
from spacy.cli._util import string_to_list
|
||||
from thinc.api import ConfigValidationError
|
||||
from thinc.api import ConfigValidationError, Config
|
||||
import srsly
|
||||
import os
|
||||
|
||||
|
@ -368,7 +368,8 @@ def test_parse_cli_overrides():
|
|||
@pytest.mark.parametrize("optimize", ["efficiency", "accuracy"])
|
||||
def test_init_config(lang, pipeline, optimize):
|
||||
# TODO: add more tests and also check for GPU with transformers
|
||||
init_config("-", lang=lang, pipeline=pipeline, optimize=optimize, gpu=False)
|
||||
config = init_config(lang=lang, pipeline=pipeline, optimize=optimize, gpu=False)
|
||||
assert isinstance(config, Config)
|
||||
|
||||
|
||||
def test_model_recommendations():
|
||||
|
|
|
@ -404,9 +404,7 @@ cdef class Tokenizer:
|
|||
cdef unicode minus_suf
|
||||
cdef size_t last_size = 0
|
||||
while string and len(string) != last_size:
|
||||
if self.token_match and self.token_match(string) \
|
||||
and not self.find_prefix(string) \
|
||||
and not self.find_suffix(string):
|
||||
if self.token_match and self.token_match(string):
|
||||
break
|
||||
if with_special_cases and self._specials.get(hash_string(string)) != NULL:
|
||||
break
|
||||
|
@ -679,6 +677,8 @@ cdef class Tokenizer:
|
|||
break
|
||||
suffixes.append(("SUFFIX", substring[split:]))
|
||||
substring = substring[:split]
|
||||
if len(substring) == 0:
|
||||
continue
|
||||
if token_match(substring):
|
||||
tokens.append(("TOKEN_MATCH", substring))
|
||||
substring = ''
|
||||
|
|
|
@ -11,7 +11,7 @@ from .span cimport Span
|
|||
from .token cimport Token
|
||||
from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
||||
from ..structs cimport LexemeC, TokenC
|
||||
from ..attrs cimport MORPH
|
||||
from ..attrs cimport MORPH, NORM
|
||||
from ..vocab cimport Vocab
|
||||
|
||||
from .underscore import is_writable_attr
|
||||
|
@ -372,9 +372,10 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
|||
# Set attributes on both token and lexeme to take care of token
|
||||
# attribute vs. lexical attribute without having to enumerate
|
||||
# them. If an attribute name is not valid, set_struct_attr will
|
||||
# ignore it.
|
||||
# ignore it. Exception: set NORM only on tokens.
|
||||
Token.set_struct_attr(token, attr_name, get_string_id(attr_value))
|
||||
Lexeme.set_struct_attr(<LexemeC*>token.lex, attr_name, get_string_id(attr_value))
|
||||
if attr_name != NORM:
|
||||
Lexeme.set_struct_attr(<LexemeC*>token.lex, attr_name, get_string_id(attr_value))
|
||||
# Assign correct dependencies to the inner token
|
||||
for i, head in enumerate(heads):
|
||||
doc.c[token_index + i].head = head
|
||||
|
@ -435,6 +436,7 @@ def set_token_attrs(Token py_token, attrs):
|
|||
# Set attributes on both token and lexeme to take care of token
|
||||
# attribute vs. lexical attribute without having to enumerate
|
||||
# them. If an attribute name is not valid, set_struct_attr will
|
||||
# ignore it.
|
||||
# ignore it. Exception: set NORM only on tokens.
|
||||
Token.set_struct_attr(token, attr_name, attr_value)
|
||||
Lexeme.set_struct_attr(<LexemeC*>lex, attr_name, attr_value)
|
||||
if attr_name != NORM:
|
||||
Lexeme.set_struct_attr(<LexemeC*>lex, attr_name, attr_value)
|
||||
|
|
|
@ -5,7 +5,6 @@ from libc.stdint cimport uint8_t
|
|||
ctypedef float weight_t
|
||||
ctypedef uint64_t hash_t
|
||||
ctypedef uint64_t class_t
|
||||
ctypedef char* utf8_t
|
||||
ctypedef uint64_t attr_t
|
||||
ctypedef uint64_t flags_t
|
||||
ctypedef uint16_t len_t
|
||||
|
|
|
@ -1295,6 +1295,13 @@ def combine_score_weights(
|
|||
|
||||
|
||||
class DummyTokenizer:
|
||||
def __call__(self, text):
|
||||
raise NotImplementedError
|
||||
|
||||
def pipe(self, texts, **kwargs):
|
||||
for text in texts:
|
||||
yield self(text)
|
||||
|
||||
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
||||
# allow serialization (see #1557)
|
||||
def to_bytes(self, **kwargs):
|
||||
|
|
|
@ -4,7 +4,7 @@ from cymem.cymem cimport Pool
|
|||
from murmurhash.mrmr cimport hash64
|
||||
|
||||
from .structs cimport LexemeC, TokenC
|
||||
from .typedefs cimport utf8_t, attr_t, hash_t
|
||||
from .typedefs cimport attr_t, hash_t
|
||||
from .strings cimport StringStore
|
||||
from .morphology cimport Morphology
|
||||
|
||||
|
|
|
@ -305,6 +305,9 @@ cdef class Vocab:
|
|||
DOCS: https://nightly.spacy.io/api/vocab#prune_vectors
|
||||
"""
|
||||
xp = get_array_module(self.vectors.data)
|
||||
# Make sure all vectors are in the vocab
|
||||
for orth in self.vectors:
|
||||
self[orth]
|
||||
# Make prob negative so it sorts by rank ascending
|
||||
# (key2row contains the rank)
|
||||
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
|
||||
|
|
|
@ -39,7 +39,9 @@ rule-based matching are:
|
|||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
|
||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||
|
@ -61,7 +63,7 @@ matched:
|
|||
| `!` | Negate the pattern, by requiring it to match exactly 0 times. |
|
||||
| `?` | Make the pattern optional, by allowing it to match 0 or 1 times. |
|
||||
| `+` | Require the pattern to match 1 or more times. |
|
||||
| `*` | Allow the pattern to match 0 or more times. |
|
||||
| `*` | Allow the pattern to match 0 or more times. |
|
||||
|
||||
Token patterns can also map to a **dictionary of properties** instead of a
|
||||
single value to indicate whether the expected value is a member of a list or how
|
||||
|
|
|
@ -158,21 +158,22 @@ The available token pattern keys correspond to a number of
|
|||
[`Token` attributes](/api/token#attributes). The supported attributes for
|
||||
rule-based matching are:
|
||||
|
||||
| Attribute | Description |
|
||||
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
|
||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |
|
||||
| Attribute | Description |
|
||||
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ |
|
||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |
|
||||
|
||||
<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
|
||||
|
||||
|
|
|
@ -199,6 +199,36 @@
|
|||
"name": "Vietnamese",
|
||||
"dependencies": [{ "name": "Pyvi", "url": "https://github.com/trungtv/pyvi" }]
|
||||
},
|
||||
{
|
||||
"code": "lij",
|
||||
"name": "Ligurian",
|
||||
"example": "Sta chì a l'é unna fraxe.",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "hy",
|
||||
"name": "Armenian",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "gu",
|
||||
"name": "Gujarati",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "ml",
|
||||
"name": "Malayalam",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "ne",
|
||||
"name": "Nepali",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "mk",
|
||||
"name": "Macedonian"
|
||||
},
|
||||
{
|
||||
"code": "xx",
|
||||
"name": "Multi-language",
|
||||
|
|
|
@ -1,5 +1,36 @@
|
|||
{
|
||||
"resources": [
|
||||
{
|
||||
"id": "spacy-textblob",
|
||||
"title": "spaCyTextBlob",
|
||||
"slogan": "Easy sentiment analysis for spaCy using TextBlob",
|
||||
"description": "spaCyTextBlob is a pipeline component that enables sentiment analysis using the [TextBlob](https://github.com/sloria/TextBlob) library. It will add the additional extenstion `._.sentiment` to `Doc`, `Span`, and `Token` objects.",
|
||||
"github": "SamEdwardes/spaCyTextBlob",
|
||||
"pip": "spacytextblob",
|
||||
"code_example": [
|
||||
"import spacy",
|
||||
"from spacytextblob.spacytextblob import SpacyTextBlob",
|
||||
"",
|
||||
"nlp = spacy.load('en_core_web_sm')",
|
||||
"spacy_text_blob = SpacyTextBlob()",
|
||||
"nlp.add_pipe(spacy_text_blob)",
|
||||
"text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'",
|
||||
"doc = nlp(text)",
|
||||
"doc._.sentiment.polarity # Polarity: -0.125",
|
||||
"doc._.sentiment.subjectivity # Sujectivity: 0.9",
|
||||
"doc._.sentiment.assessments # Assessments: [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]"
|
||||
],
|
||||
"code_language": "python",
|
||||
"url": "https://spacytextblob.netlify.app/",
|
||||
"author": "Sam Edwardes",
|
||||
"author_links": {
|
||||
"twitter": "TheReaLSamlam",
|
||||
"github": "SamEdwardes",
|
||||
"website": "https://samedwardes.com"
|
||||
},
|
||||
"category": ["pipeline"],
|
||||
"tags": ["sentiment", "textblob"]
|
||||
},
|
||||
{
|
||||
"id": "spacy-ray",
|
||||
"title": "spacy-ray",
|
||||
|
@ -788,6 +819,22 @@
|
|||
"category": ["conversational"],
|
||||
"tags": ["chatbots"]
|
||||
},
|
||||
{
|
||||
"id": "mindmeld",
|
||||
"title": "MindMeld - Conversational AI platform",
|
||||
"slogan": "Conversational AI platform for deep-domain voice interfaces and chatbots",
|
||||
"description": "The MindMeld Conversational AI platform is among the most advanced AI platforms for building production-quality conversational applications. It is a Python-based machine learning framework which encompasses all of the algorithms and utilities required for this purpose. (https://github.com/cisco/mindmeld)",
|
||||
"github": "cisco/mindmeld",
|
||||
"pip": "mindmeld",
|
||||
"thumb": "https://www.mindmeld.com/img/mindmeld-logo.png",
|
||||
"category": ["conversational", "ner"],
|
||||
"tags": ["chatbots"],
|
||||
"author": "Cisco",
|
||||
"author_links": {
|
||||
"github": "cisco/mindmeld",
|
||||
"website": "https://www.mindmeld.com/"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "torchtext",
|
||||
"title": "torchtext",
|
||||
|
@ -1648,7 +1695,7 @@
|
|||
"",
|
||||
"nlp = spacy.load('en')",
|
||||
"nlp.add_pipe(BeneparComponent('benepar_en'))",
|
||||
"doc = nlp('The time for action is now. It's never too late to do something.')",
|
||||
"doc = nlp('The time for action is now. It is never too late to do something.')",
|
||||
"sent = list(doc.sents)[0]",
|
||||
"print(sent._.parse_string)",
|
||||
"# (S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))",
|
||||
|
@ -2527,14 +2574,14 @@
|
|||
"description": "A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
|
||||
"pip": "cov-bsv",
|
||||
"code_example": [
|
||||
"import cov_bsv",
|
||||
"",
|
||||
"nlp = cov_bsv.load()",
|
||||
"text = 'Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected'",
|
||||
"",
|
||||
"print(doc.ents)",
|
||||
"print(doc._.cov_classification)",
|
||||
"cov_bsv.visualize_doc(doc)"
|
||||
"import cov_bsv",
|
||||
"",
|
||||
"nlp = cov_bsv.load()",
|
||||
"doc = nlp('Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected')",
|
||||
"",
|
||||
"print(doc.ents)",
|
||||
"print(doc._.cov_classification)",
|
||||
"cov_bsv.visualize_doc(doc)"
|
||||
],
|
||||
"category": ["pipeline", "standalone", "biomedical", "scientific"],
|
||||
"tags": ["clinical", "epidemiology", "covid-19", "surveillance"],
|
||||
|
@ -2542,6 +2589,35 @@
|
|||
"author_links": {
|
||||
"github": "abchapman93"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "medspacy",
|
||||
"title": "medspaCy",
|
||||
"thumb": "https://raw.githubusercontent.com/medspacy/medspacy/master/images/medspacy_logo.png",
|
||||
"slogan": "A toolkit for clinical NLP with spaCy.",
|
||||
"github": "medspacy/medspacy",
|
||||
"description": "A toolkit for clinical NLP with spaCy. Features include sentence splitting, section detection, and asserting negation, family history, and uncertainty.",
|
||||
"pip": "medspacy",
|
||||
"code_example": [
|
||||
"import medspacy",
|
||||
"from medspacy.ner import TargetRule",
|
||||
"",
|
||||
"nlp = medspacy.load()",
|
||||
"print(nlp.pipe_names)",
|
||||
"",
|
||||
"nlp.get_pipe('target_matcher').add([TargetRule('stroke', 'CONDITION'), TargetRule('diabetes', 'CONDITION'), TargetRule('pna', 'CONDITION')])",
|
||||
"doc = nlp('Patient has hx of stroke. Mother diagnosed with diabetes. No evidence of pna.')",
|
||||
"",
|
||||
"for ent in doc.ents:",
|
||||
" print(ent, ent._.is_negated, ent._.is_family, ent._.is_historical)",
|
||||
"medspacy.visualization.visualize_ent(doc)"
|
||||
],
|
||||
"category": ["biomedical", "scientific", "research"],
|
||||
"tags": ["clinical"],
|
||||
"author": "medspacy",
|
||||
"author_links": {
|
||||
"github": "medspacy"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "rita-dsl",
|
||||
|
@ -2578,6 +2654,32 @@
|
|||
"author_links": {
|
||||
"github": "zaibacu"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "PatternOmatic",
|
||||
"title": "PatternOmatic",
|
||||
"slogan": "Finds linguistic patterns effortlessly",
|
||||
"description": "Discover spaCy's linguistic patterns matching a given set of String samples to be used by the spaCy's Rule Based Matcher",
|
||||
"github": "revuel/PatternOmatic",
|
||||
"pip": "PatternOmatic",
|
||||
"code_example": [
|
||||
"from PatternOmatic.api import find_patterns",
|
||||
"",
|
||||
"samples = ['I am a cat!', 'You are a dog!', 'She is an owl!']",
|
||||
"",
|
||||
"patterns_found, _ = find_patterns(samples)",
|
||||
"",
|
||||
"print(f'Patterns found: {patterns_found}')"
|
||||
],
|
||||
"code_language": "python",
|
||||
"thumb": "https://svgshare.com/i/R3P.svg",
|
||||
"image": "https://svgshare.com/i/R3P.svg",
|
||||
"author": "Miguel Revuelta Espinosa",
|
||||
"author_links": {
|
||||
"github": "revuel"
|
||||
},
|
||||
"category": ["scientific", "research", "standalone"],
|
||||
"tags": ["Evolutionary Computation", "Grammatical Evolution"]
|
||||
}
|
||||
],
|
||||
|
||||
|
|
|
@ -207,42 +207,49 @@ const Landing = ({ data }) => {
|
|||
|
||||
<LandingBannerGrid>
|
||||
<LandingBanner
|
||||
to="https://course.spacy.io"
|
||||
button="Start the course"
|
||||
background="#f6f6f6"
|
||||
color="#252a33"
|
||||
title="spaCy v3.0 nightly: Transformer-based pipelines, new training system, project templates & more"
|
||||
label="Try the pre-release"
|
||||
to="https://nightly.spacy.io"
|
||||
button="See what's new"
|
||||
background="#8758fe"
|
||||
color="#ffffff"
|
||||
small
|
||||
>
|
||||
<Link to="https://course.spacy.io" hidden>
|
||||
spaCy v3.0 features all new <strong>transformer-based pipelines</strong> that
|
||||
bring spaCy's accuracy right up to the current <strong>state-of-the-art</strong>
|
||||
. You can use any pretrained transformer to train your own pipelines, and even
|
||||
share one transformer between multiple components with{' '}
|
||||
<strong>multi-task learning</strong>. Training is now fully configurable and
|
||||
extensible, and you can define your own custom models using{' '}
|
||||
<strong>PyTorch</strong>, <strong>TensorFlow</strong> and other frameworks. The
|
||||
new spaCy projects system lets you describe whole{' '}
|
||||
<strong>end-to-end workflows</strong> in a single file, giving you an easy path
|
||||
from prototype to production, and making it easy to clone and adapt
|
||||
best-practice projects for your own use cases.
|
||||
</LandingBanner>
|
||||
|
||||
<LandingBanner
|
||||
title="Prodigy: Radically efficient machine teaching"
|
||||
label="From the makers of spaCy"
|
||||
to="https://prodi.gy"
|
||||
button="Try it out"
|
||||
background="#f6f6f6"
|
||||
color="#000"
|
||||
small
|
||||
>
|
||||
<Link to="https://prodi.gy" hidden>
|
||||
<img
|
||||
src={courseImage}
|
||||
alt="Advanced NLP with spaCy: A free online course"
|
||||
src={prodigyImage}
|
||||
alt="Prodigy: Radically efficient machine teaching"
|
||||
/>
|
||||
</Link>
|
||||
<br />
|
||||
<br />
|
||||
In this <strong>free and interactive online course</strong> you’ll learn how to
|
||||
use spaCy to build advanced natural language understanding systems, using both
|
||||
rule-based and machine learning approaches. It includes{' '}
|
||||
<strong>55 exercises</strong> featuring videos, slide decks, multiple-choice
|
||||
questions and interactive coding practice in the browser.
|
||||
</LandingBanner>
|
||||
<LandingBanner
|
||||
title="spaCy IRL: Two days of NLP"
|
||||
label="Watch the videos"
|
||||
to="https://www.youtube.com/playlist?list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc"
|
||||
button="Watch the videos"
|
||||
background="#ffc194"
|
||||
backgroundImage={irlBackground}
|
||||
color="#1a1e23"
|
||||
small
|
||||
>
|
||||
We were pleased to invite the spaCy community and other folks working on NLP to
|
||||
Berlin for a small and intimate event. We booked a beautiful venue, hand-picked
|
||||
an awesome lineup of speakers and scheduled plenty of social time to get to know
|
||||
each other. The YouTube playlist includes 12 talks about NLP research,
|
||||
development and applications, with keynotes by Sebastian Ruder (DeepMind) and
|
||||
Yoav Goldberg (Allen AI).
|
||||
Prodigy is an <strong>annotation tool</strong> so efficient that data scientists
|
||||
can do the annotation themselves, enabling a new level of rapid iteration.
|
||||
Whether you're working on entity recognition, intent detection or image
|
||||
classification, Prodigy can help you <strong>train and evaluate</strong> your
|
||||
models faster.
|
||||
</LandingBanner>
|
||||
</LandingBannerGrid>
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user