mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge pull request #5788 from explosion/master-tmp
This commit is contained in:
commit
311d0bde29
106
.github/contributors/PluieElectrique.md
vendored
Normal file
106
.github/contributors/PluieElectrique.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Pluie |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-06-18 |
|
||||||
|
| GitHub username | PluieElectrique |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/abchapman93.md
vendored
Normal file
106
.github/contributors/abchapman93.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Alec Chapman |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 7/17/2020 |
|
||||||
|
| GitHub username | abchapman93 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/gandersen101.md
vendored
Normal file
106
.github/contributors/gandersen101.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Grant Andersen |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 07.06.2020 |
|
||||||
|
| GitHub username | gandersen101 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/jbesomi.md
vendored
Normal file
106
.github/contributors/jbesomi.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jonathan B. |
|
||||||
|
| Company name (if applicable) | besomi.ai |
|
||||||
|
| Title or role (if applicable) | - |
|
||||||
|
| Date | 07.07.2020 |
|
||||||
|
| GitHub username | jbesomi |
|
||||||
|
| Website (optional) | besomi.ai |
|
106
.github/contributors/mikeizbicki.md
vendored
Normal file
106
.github/contributors/mikeizbicki.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Mike Izbicki |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 02 Jun 2020 |
|
||||||
|
| GitHub username | mikeizbicki |
|
||||||
|
| Website (optional) | https://izbicki.me |
|
106
.github/contributors/rameshhpathak.md
vendored
Normal file
106
.github/contributors/rameshhpathak.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Ramesh Pathak |
|
||||||
|
| Company name (if applicable) | Diyo AI |
|
||||||
|
| Title or role (if applicable) | AI Engineer |
|
||||||
|
| Date | June 21, 2020 |
|
||||||
|
| GitHub username | rameshhpathak |
|
||||||
|
| Website (optional) |rameshhpathak.github.io| |
|
106
.github/contributors/richardliaw.md
vendored
Normal file
106
.github/contributors/richardliaw.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Richard Liaw |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 06/22/2020 |
|
||||||
|
| GitHub username | richardliaw |
|
||||||
|
| Website (optional) | |
|
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -71,6 +71,7 @@ Pipfile.lock
|
||||||
*.egg
|
*.egg
|
||||||
.eggs
|
.eggs
|
||||||
MANIFEST
|
MANIFEST
|
||||||
|
spacy/git_info.py
|
||||||
|
|
||||||
# Temporary files
|
# Temporary files
|
||||||
*.~*
|
*.~*
|
||||||
|
|
|
@ -5,3 +5,4 @@ include README.md
|
||||||
include pyproject.toml
|
include pyproject.toml
|
||||||
recursive-exclude spacy/lang *.json
|
recursive-exclude spacy/lang *.json
|
||||||
recursive-include spacy/lang *.json.gz
|
recursive-include spacy/lang *.json.gz
|
||||||
|
recursive-include licenses *
|
||||||
|
|
4
Makefile
4
Makefile
|
@ -5,7 +5,7 @@ VENV := ./env$(PYVER)
|
||||||
version := $(shell "bin/get-version.sh")
|
version := $(shell "bin/get-version.sh")
|
||||||
|
|
||||||
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
||||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core
|
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
|
||||||
chmod a+rx $@
|
chmod a+rx $@
|
||||||
cp $@ dist/spacy.pex
|
cp $@ dist/spacy.pex
|
||||||
|
|
||||||
|
@ -15,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl
|
||||||
|
|
||||||
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||||
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
||||||
$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse
|
$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse
|
||||||
touch $@
|
touch $@
|
||||||
|
|
||||||
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
||||||
|
|
|
@ -16,8 +16,6 @@ from __future__ import unicode_literals, print_function
|
||||||
import plac
|
import plac
|
||||||
import random
|
import random
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from spacy.vocab import Vocab
|
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.kb import KnowledgeBase
|
from spacy.kb import KnowledgeBase
|
||||||
|
|
||||||
|
@ -61,13 +59,13 @@ TRAIN_DATA = sample_train_data()
|
||||||
output_dir=("Optional output directory", "option", "o", Path),
|
output_dir=("Optional output directory", "option", "o", Path),
|
||||||
n_iter=("Number of training iterations", "option", "n", int),
|
n_iter=("Number of training iterations", "option", "n", int),
|
||||||
)
|
)
|
||||||
def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
def main(kb_path, vocab_path, output_dir=None, n_iter=50):
|
||||||
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
|
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
|
||||||
The `vocab` should be the one used during creation of the KB."""
|
The `vocab` should be the one used during creation of the KB."""
|
||||||
vocab = Vocab().from_disk(vocab_path)
|
|
||||||
# create blank English model with correct vocab
|
# create blank English model with correct vocab
|
||||||
nlp = spacy.blank("en", vocab=vocab)
|
nlp = spacy.blank("en")
|
||||||
nlp.vocab.vectors.name = "nel_vectors"
|
nlp.vocab.from_disk(vocab_path)
|
||||||
|
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
|
||||||
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
||||||
|
|
||||||
# Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
|
# Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
|
||||||
|
@ -96,7 +94,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
|
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
|
||||||
# Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
|
# Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
|
||||||
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
||||||
train_examples = []
|
train_examples = []
|
||||||
for text, annotation in TRAIN_DATA:
|
for text, annotation in TRAIN_DATA:
|
||||||
with nlp.select_pipes(disable="entity_linker"):
|
with nlp.select_pipes(disable="entity_linker"):
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
|
@ -111,7 +109,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
"Removed", kb_id, "from training because it is not in the KB."
|
"Removed", kb_id, "from training because it is not in the KB."
|
||||||
)
|
)
|
||||||
annotation_clean["links"][offset] = new_dict
|
annotation_clean["links"][offset] = new_dict
|
||||||
train_examples .append(Example.from_dict(doc, annotation_clean))
|
train_examples.append(Example.from_dict(doc, annotation_clean))
|
||||||
|
|
||||||
with nlp.select_pipes(enable="entity_linker"): # only train entity linker
|
with nlp.select_pipes(enable="entity_linker"): # only train entity linker
|
||||||
# reset and initialize the weights randomly
|
# reset and initialize the weights randomly
|
||||||
|
|
52
setup.py
52
setup.py
|
@ -4,13 +4,14 @@ import sys
|
||||||
import platform
|
import platform
|
||||||
from distutils.command.build_ext import build_ext
|
from distutils.command.build_ext import build_ext
|
||||||
from distutils.sysconfig import get_python_inc
|
from distutils.sysconfig import get_python_inc
|
||||||
import distutils.util
|
|
||||||
from distutils import ccompiler, msvccompiler
|
from distutils import ccompiler, msvccompiler
|
||||||
import numpy
|
import numpy
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import shutil
|
import shutil
|
||||||
from Cython.Build import cythonize
|
from Cython.Build import cythonize
|
||||||
from Cython.Compiler import Options
|
from Cython.Compiler import Options
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
|
||||||
|
|
||||||
ROOT = Path(__file__).parent
|
ROOT = Path(__file__).parent
|
||||||
|
@ -75,7 +76,6 @@ COPY_FILES = {
|
||||||
|
|
||||||
def is_new_osx():
|
def is_new_osx():
|
||||||
"""Check whether we're on OSX >= 10.7"""
|
"""Check whether we're on OSX >= 10.7"""
|
||||||
name = distutils.util.get_platform()
|
|
||||||
if sys.platform != "darwin":
|
if sys.platform != "darwin":
|
||||||
return False
|
return False
|
||||||
mac_ver = platform.mac_ver()[0]
|
mac_ver = platform.mac_ver()[0]
|
||||||
|
@ -118,6 +118,53 @@ class build_ext_subclass(build_ext, build_ext_options):
|
||||||
build_ext.build_extensions(self)
|
build_ext.build_extensions(self)
|
||||||
|
|
||||||
|
|
||||||
|
# Include the git version in the build (adapted from NumPy)
|
||||||
|
# Copyright (c) 2005-2020, NumPy Developers.
|
||||||
|
# BSD 3-Clause license, see licenses/3rd_party_licenses.txt
|
||||||
|
def write_git_info_py(filename="spacy/git_info.py"):
|
||||||
|
def _minimal_ext_cmd(cmd):
|
||||||
|
# construct minimal environment
|
||||||
|
env = {}
|
||||||
|
for k in ["SYSTEMROOT", "PATH", "HOME"]:
|
||||||
|
v = os.environ.get(k)
|
||||||
|
if v is not None:
|
||||||
|
env[k] = v
|
||||||
|
# LANGUAGE is used on win32
|
||||||
|
env["LANGUAGE"] = "C"
|
||||||
|
env["LANG"] = "C"
|
||||||
|
env["LC_ALL"] = "C"
|
||||||
|
out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, env=env)
|
||||||
|
return out
|
||||||
|
|
||||||
|
git_version = "Unknown"
|
||||||
|
if Path(".git").exists():
|
||||||
|
try:
|
||||||
|
out = _minimal_ext_cmd(["git", "rev-parse", "--short", "HEAD"])
|
||||||
|
git_version = out.strip().decode("ascii")
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
elif Path(filename).exists():
|
||||||
|
# must be a source distribution, use existing version file
|
||||||
|
try:
|
||||||
|
a = open(filename, "r")
|
||||||
|
lines = a.readlines()
|
||||||
|
git_version = lines[-1].split('"')[1]
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
finally:
|
||||||
|
a.close()
|
||||||
|
|
||||||
|
text = """# THIS FILE IS GENERATED FROM SPACY SETUP.PY
|
||||||
|
#
|
||||||
|
GIT_VERSION = "%(git_version)s"
|
||||||
|
"""
|
||||||
|
a = open(filename, "w")
|
||||||
|
try:
|
||||||
|
a.write(text % {"git_version": git_version})
|
||||||
|
finally:
|
||||||
|
a.close()
|
||||||
|
|
||||||
|
|
||||||
def clean(path):
|
def clean(path):
|
||||||
for path in path.glob("**/*"):
|
for path in path.glob("**/*"):
|
||||||
if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
|
if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
|
||||||
|
@ -126,6 +173,7 @@ def clean(path):
|
||||||
|
|
||||||
|
|
||||||
def setup_package():
|
def setup_package():
|
||||||
|
write_git_info_py()
|
||||||
if len(sys.argv) > 1 and sys.argv[1] == "clean":
|
if len(sys.argv) > 1 and sys.argv[1] == "clean":
|
||||||
return clean(PACKAGE_ROOT)
|
return clean(PACKAGE_ROOT)
|
||||||
|
|
||||||
|
|
|
@ -31,6 +31,41 @@ class EnglishDefaults(Language.Defaults):
|
||||||
{"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]},
|
{"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]},
|
||||||
]
|
]
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def is_base_form(cls, univ_pos, morphology=None):
|
||||||
|
"""
|
||||||
|
Check whether we're dealing with an uninflected paradigm, so we can
|
||||||
|
avoid lemmatization entirely.
|
||||||
|
|
||||||
|
univ_pos (unicode / int): The token's universal part-of-speech tag.
|
||||||
|
morphology (dict): The token's morphological features following the
|
||||||
|
Universal Dependencies scheme.
|
||||||
|
"""
|
||||||
|
if morphology is None:
|
||||||
|
morphology = {}
|
||||||
|
if univ_pos == "noun" and morphology.get("Number") == "sing":
|
||||||
|
return True
|
||||||
|
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
|
||||||
|
return True
|
||||||
|
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
|
||||||
|
# morphology
|
||||||
|
elif univ_pos == "verb" and (
|
||||||
|
morphology.get("VerbForm") == "fin"
|
||||||
|
and morphology.get("Tense") == "pres"
|
||||||
|
and morphology.get("Number") is None
|
||||||
|
):
|
||||||
|
return True
|
||||||
|
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
|
||||||
|
return True
|
||||||
|
elif morphology.get("VerbForm") == "inf":
|
||||||
|
return True
|
||||||
|
elif morphology.get("VerbForm") == "none":
|
||||||
|
return True
|
||||||
|
elif morphology.get("Degree") == "pos":
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
class English(Language):
|
class English(Language):
|
||||||
lang = "en"
|
lang = "en"
|
||||||
|
|
|
@ -41,9 +41,6 @@ class FrenchLemmatizer(Lemmatizer):
|
||||||
univ_pos = "sconj"
|
univ_pos = "sconj"
|
||||||
else:
|
else:
|
||||||
return [self.lookup(string)]
|
return [self.lookup(string)]
|
||||||
# See Issue #435 for example of where this logic is requied.
|
|
||||||
if self.is_base_form(univ_pos, morphology):
|
|
||||||
return list(set([string.lower()]))
|
|
||||||
index_table = self.lookups.get_table("lemma_index", {})
|
index_table = self.lookups.get_table("lemma_index", {})
|
||||||
exc_table = self.lookups.get_table("lemma_exc", {})
|
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||||
rules_table = self.lookups.get_table("lemma_rules", {})
|
rules_table = self.lookups.get_table("lemma_rules", {})
|
||||||
|
|
|
@ -8,6 +8,6 @@ Example sentences to test spaCy and its language models.
|
||||||
sentences = [
|
sentences = [
|
||||||
"Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
|
"Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
|
||||||
"Ո՞վ է Ֆրանսիայի նախագահը։",
|
"Ո՞վ է Ֆրանսիայի նախագահը։",
|
||||||
"Որն է Միացյալ Նահանգների մայրաքաղաքը։",
|
"Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
|
||||||
"Ե՞րբ է ծնվել Բարաք Օբաման։",
|
"Ե՞րբ է ծնվել Բարաք Օբաման։",
|
||||||
]
|
]
|
||||||
|
|
|
@ -15,14 +15,15 @@ _num_words = [
|
||||||
"տասը",
|
"տասը",
|
||||||
"տասնմեկ",
|
"տասնմեկ",
|
||||||
"տասներկու",
|
"տասներկու",
|
||||||
"տասներեք",
|
"տասներեք",
|
||||||
"տասնչորս",
|
"տասնչորս",
|
||||||
"տասնհինգ",
|
"տասնհինգ",
|
||||||
"տասնվեց",
|
"տասնվեց",
|
||||||
"տասնյոթ",
|
"տասնյոթ",
|
||||||
"տասնութ",
|
"տասնութ",
|
||||||
"տասնինը",
|
"տասնինը",
|
||||||
"քսան" "երեսուն",
|
"քսան",
|
||||||
|
"երեսուն",
|
||||||
"քառասուն",
|
"քառասուն",
|
||||||
"հիսուն",
|
"հիսուն",
|
||||||
"վաթսուն",
|
"վաթսուն",
|
||||||
|
|
|
@ -17,12 +17,9 @@ from ... import util
|
||||||
|
|
||||||
|
|
||||||
# Hold the attributes we need with convenient names
|
# Hold the attributes we need with convenient names
|
||||||
DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
|
DetailedToken = namedtuple(
|
||||||
|
"DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"]
|
||||||
# Handling for multiple spaces in a row is somewhat awkward, this simplifies
|
)
|
||||||
# the flow by creating a dummy with the same interface.
|
|
||||||
DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
|
|
||||||
DummySpace = DummyNode(" ", " ", " ")
|
|
||||||
|
|
||||||
|
|
||||||
def try_sudachi_import(split_mode="A"):
|
def try_sudachi_import(split_mode="A"):
|
||||||
|
@ -49,7 +46,7 @@ def try_sudachi_import(split_mode="A"):
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def resolve_pos(orth, pos, next_pos):
|
def resolve_pos(orth, tag, next_tag):
|
||||||
"""If necessary, add a field to the POS tag for UD mapping.
|
"""If necessary, add a field to the POS tag for UD mapping.
|
||||||
Under Universal Dependencies, sometimes the same Unidic POS tag can
|
Under Universal Dependencies, sometimes the same Unidic POS tag can
|
||||||
be mapped differently depending on the literal token or its context
|
be mapped differently depending on the literal token or its context
|
||||||
|
@ -60,127 +57,80 @@ def resolve_pos(orth, pos, next_pos):
|
||||||
# Some tokens have their UD tag decided based on the POS of the following
|
# Some tokens have their UD tag decided based on the POS of the following
|
||||||
# token.
|
# token.
|
||||||
|
|
||||||
# orth based rules
|
# apply orth based mapping
|
||||||
if pos[0] in TAG_ORTH_MAP:
|
if tag in TAG_ORTH_MAP:
|
||||||
orth_map = TAG_ORTH_MAP[pos[0]]
|
orth_map = TAG_ORTH_MAP[tag]
|
||||||
if orth in orth_map:
|
if orth in orth_map:
|
||||||
return orth_map[orth], None
|
return orth_map[orth], None # current_pos, next_pos
|
||||||
|
|
||||||
# tag bi-gram mapping
|
# apply tag bi-gram mapping
|
||||||
if next_pos:
|
if next_tag:
|
||||||
tag_bigram = pos[0], next_pos[0]
|
tag_bigram = tag, next_tag
|
||||||
if tag_bigram in TAG_BIGRAM_MAP:
|
if tag_bigram in TAG_BIGRAM_MAP:
|
||||||
bipos = TAG_BIGRAM_MAP[tag_bigram]
|
current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
|
||||||
if bipos[0] is None:
|
if current_pos is None: # apply tag uni-gram mapping for current_pos
|
||||||
return TAG_MAP[pos[0]][POS], bipos[1]
|
return (
|
||||||
|
TAG_MAP[tag][POS],
|
||||||
|
next_pos,
|
||||||
|
) # only next_pos is identified by tag bi-gram mapping
|
||||||
else:
|
else:
|
||||||
return bipos
|
return current_pos, next_pos
|
||||||
|
|
||||||
return TAG_MAP[pos[0]][POS], None
|
# apply tag uni-gram mapping
|
||||||
|
return TAG_MAP[tag][POS], None
|
||||||
|
|
||||||
|
|
||||||
# Use a mapping of paired punctuation to avoid splitting quoted sentences.
|
def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
|
||||||
pairpunct = {"「": "」", "『": "』", "【": "】"}
|
# Compare the content of tokens and text, first
|
||||||
|
|
||||||
|
|
||||||
def separate_sentences(doc):
|
|
||||||
"""Given a doc, mark tokens that start sentences based on Unidic tags.
|
|
||||||
"""
|
|
||||||
|
|
||||||
stack = [] # save paired punctuation
|
|
||||||
|
|
||||||
for i, token in enumerate(doc[:-2]):
|
|
||||||
# Set all tokens after the first to false by default. This is necessary
|
|
||||||
# for the doc code to be aware we've done sentencization, see
|
|
||||||
# `is_sentenced`.
|
|
||||||
token.sent_start = i == 0
|
|
||||||
if token.tag_:
|
|
||||||
if token.tag_ == "補助記号-括弧開":
|
|
||||||
ts = str(token)
|
|
||||||
if ts in pairpunct:
|
|
||||||
stack.append(pairpunct[ts])
|
|
||||||
elif stack and ts == stack[-1]:
|
|
||||||
stack.pop()
|
|
||||||
|
|
||||||
if token.tag_ == "補助記号-句点":
|
|
||||||
next_token = doc[i + 1]
|
|
||||||
if next_token.tag_ != token.tag_ and not stack:
|
|
||||||
next_token.sent_start = True
|
|
||||||
|
|
||||||
|
|
||||||
def get_dtokens(tokenizer, text):
|
|
||||||
tokens = tokenizer.tokenize(text)
|
|
||||||
words = []
|
|
||||||
for ti, token in enumerate(tokens):
|
|
||||||
tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
|
|
||||||
inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
|
|
||||||
dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
|
|
||||||
if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
|
|
||||||
# don't add multiple space tokens in a row
|
|
||||||
continue
|
|
||||||
words.append(dtoken)
|
|
||||||
|
|
||||||
# remove empty tokens. These can be produced with characters like … that
|
|
||||||
# Sudachi normalizes internally.
|
|
||||||
words = [ww for ww in words if len(ww.surface) > 0]
|
|
||||||
return words
|
|
||||||
|
|
||||||
|
|
||||||
def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
|
|
||||||
words = [x.surface for x in dtokens]
|
words = [x.surface for x in dtokens]
|
||||||
if "".join("".join(words).split()) != "".join(text.split()):
|
if "".join("".join(words).split()) != "".join(text.split()):
|
||||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||||
text_words = []
|
|
||||||
text_lemmas = []
|
text_dtokens = []
|
||||||
text_tags = []
|
|
||||||
text_spaces = []
|
text_spaces = []
|
||||||
text_pos = 0
|
text_pos = 0
|
||||||
# handle empty and whitespace-only texts
|
# handle empty and whitespace-only texts
|
||||||
if len(words) == 0:
|
if len(words) == 0:
|
||||||
return text_words, text_lemmas, text_tags, text_spaces
|
return text_dtokens, text_spaces
|
||||||
elif len([word for word in words if not word.isspace()]) == 0:
|
elif len([word for word in words if not word.isspace()]) == 0:
|
||||||
assert text.isspace()
|
assert text.isspace()
|
||||||
text_words = [text]
|
text_dtokens = [DetailedToken(text, gap_tag, "", text, None, None)]
|
||||||
text_lemmas = [text]
|
|
||||||
text_tags = [gap_tag]
|
|
||||||
text_spaces = [False]
|
text_spaces = [False]
|
||||||
return text_words, text_lemmas, text_tags, text_spaces
|
return text_dtokens, text_spaces
|
||||||
# normalize words to remove all whitespace tokens
|
|
||||||
norm_words, norm_dtokens = zip(
|
# align words and dtokens by referring text, and insert gap tokens for the space char spans
|
||||||
*[
|
for word, dtoken in zip(words, dtokens):
|
||||||
(word, dtokens)
|
# skip all space tokens
|
||||||
for word, dtokens in zip(words, dtokens)
|
if word.isspace():
|
||||||
if not word.isspace()
|
continue
|
||||||
]
|
|
||||||
)
|
|
||||||
# align words with text
|
|
||||||
for word, dtoken in zip(norm_words, norm_dtokens):
|
|
||||||
try:
|
try:
|
||||||
word_start = text[text_pos:].index(word)
|
word_start = text[text_pos:].index(word)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||||
|
|
||||||
|
# space token
|
||||||
if word_start > 0:
|
if word_start > 0:
|
||||||
w = text[text_pos : text_pos + word_start]
|
w = text[text_pos : text_pos + word_start]
|
||||||
text_words.append(w)
|
text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
|
||||||
text_lemmas.append(w)
|
|
||||||
text_tags.append(gap_tag)
|
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
text_pos += word_start
|
text_pos += word_start
|
||||||
text_words.append(word)
|
|
||||||
text_lemmas.append(dtoken.lemma)
|
# content word
|
||||||
text_tags.append(dtoken.pos)
|
text_dtokens.append(dtoken)
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
text_pos += len(word)
|
text_pos += len(word)
|
||||||
|
# poll a space char after the word
|
||||||
if text_pos < len(text) and text[text_pos] == " ":
|
if text_pos < len(text) and text[text_pos] == " ":
|
||||||
text_spaces[-1] = True
|
text_spaces[-1] = True
|
||||||
text_pos += 1
|
text_pos += 1
|
||||||
|
|
||||||
|
# trailing space token
|
||||||
if text_pos < len(text):
|
if text_pos < len(text):
|
||||||
w = text[text_pos:]
|
w = text[text_pos:]
|
||||||
text_words.append(w)
|
text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
|
||||||
text_lemmas.append(w)
|
|
||||||
text_tags.append(gap_tag)
|
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
return text_words, text_lemmas, text_tags, text_spaces
|
|
||||||
|
return text_dtokens, text_spaces
|
||||||
|
|
||||||
|
|
||||||
class JapaneseTokenizer(DummyTokenizer):
|
class JapaneseTokenizer(DummyTokenizer):
|
||||||
|
@ -190,29 +140,96 @@ class JapaneseTokenizer(DummyTokenizer):
|
||||||
self.tokenizer = try_sudachi_import(self.split_mode)
|
self.tokenizer = try_sudachi_import(self.split_mode)
|
||||||
|
|
||||||
def __call__(self, text):
|
def __call__(self, text):
|
||||||
dtokens = get_dtokens(self.tokenizer, text)
|
# convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
|
||||||
|
sudachipy_tokens = self.tokenizer.tokenize(text)
|
||||||
|
dtokens = self._get_dtokens(sudachipy_tokens)
|
||||||
|
dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
|
||||||
|
|
||||||
words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
|
# create Doc with tag bi-gram based part-of-speech identification rules
|
||||||
|
words, tags, inflections, lemmas, readings, sub_tokens_list = (
|
||||||
|
zip(*dtokens) if dtokens else [[]] * 6
|
||||||
|
)
|
||||||
|
sub_tokens_list = list(sub_tokens_list)
|
||||||
doc = Doc(self.vocab, words=words, spaces=spaces)
|
doc = Doc(self.vocab, words=words, spaces=spaces)
|
||||||
next_pos = None
|
next_pos = None # for bi-gram rules
|
||||||
for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
|
for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
|
||||||
token.tag_ = unidic_tag[0]
|
token.tag_ = dtoken.tag
|
||||||
if next_pos:
|
if next_pos: # already identified in previous iteration
|
||||||
token.pos = next_pos
|
token.pos = next_pos
|
||||||
next_pos = None
|
next_pos = None
|
||||||
else:
|
else:
|
||||||
token.pos, next_pos = resolve_pos(
|
token.pos, next_pos = resolve_pos(
|
||||||
token.orth_,
|
token.orth_,
|
||||||
unidic_tag,
|
dtoken.tag,
|
||||||
unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None,
|
tags[idx + 1] if idx + 1 < len(tags) else None,
|
||||||
)
|
)
|
||||||
|
|
||||||
# if there's no lemma info (it's an unk) just use the surface
|
# if there's no lemma info (it's an unk) just use the surface
|
||||||
token.lemma_ = lemma
|
token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
|
||||||
doc.user_data["unidic_tags"] = unidic_tags
|
|
||||||
|
doc.user_data["inflections"] = inflections
|
||||||
|
doc.user_data["reading_forms"] = readings
|
||||||
|
doc.user_data["sub_tokens"] = sub_tokens_list
|
||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
|
||||||
|
sub_tokens_list = (
|
||||||
|
self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
|
||||||
|
)
|
||||||
|
dtokens = [
|
||||||
|
DetailedToken(
|
||||||
|
token.surface(), # orth
|
||||||
|
"-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag
|
||||||
|
",".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf
|
||||||
|
token.dictionary_form(), # lemma
|
||||||
|
token.reading_form(), # user_data['reading_forms']
|
||||||
|
sub_tokens_list[idx]
|
||||||
|
if sub_tokens_list
|
||||||
|
else None, # user_data['sub_tokens']
|
||||||
|
)
|
||||||
|
for idx, token in enumerate(sudachipy_tokens)
|
||||||
|
if len(token.surface()) > 0
|
||||||
|
# remove empty tokens which can be produced with characters like … that
|
||||||
|
]
|
||||||
|
# Sudachi normalizes internally and outputs each space char as a token.
|
||||||
|
# This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
|
||||||
|
return [
|
||||||
|
t
|
||||||
|
for idx, t in enumerate(dtokens)
|
||||||
|
if idx == 0
|
||||||
|
or not t.surface.isspace()
|
||||||
|
or t.tag != "空白"
|
||||||
|
or not dtokens[idx - 1].surface.isspace()
|
||||||
|
or dtokens[idx - 1].tag != "空白"
|
||||||
|
]
|
||||||
|
|
||||||
|
def _get_sub_tokens(self, sudachipy_tokens):
|
||||||
|
if (
|
||||||
|
self.split_mode is None or self.split_mode == "A"
|
||||||
|
): # do nothing for default split mode
|
||||||
|
return None
|
||||||
|
|
||||||
|
sub_tokens_list = [] # list of (list of list of DetailedToken | None)
|
||||||
|
for token in sudachipy_tokens:
|
||||||
|
sub_a = token.split(self.tokenizer.SplitMode.A)
|
||||||
|
if len(sub_a) == 1: # no sub tokens
|
||||||
|
sub_tokens_list.append(None)
|
||||||
|
elif self.split_mode == "B":
|
||||||
|
sub_tokens_list.append([self._get_dtokens(sub_a, False)])
|
||||||
|
else: # "C"
|
||||||
|
sub_b = token.split(self.tokenizer.SplitMode.B)
|
||||||
|
if len(sub_a) == len(sub_b):
|
||||||
|
dtokens = self._get_dtokens(sub_a, False)
|
||||||
|
sub_tokens_list.append([dtokens, dtokens])
|
||||||
|
else:
|
||||||
|
sub_tokens_list.append(
|
||||||
|
[
|
||||||
|
self._get_dtokens(sub_a, False),
|
||||||
|
self._get_dtokens(sub_b, False),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
return sub_tokens_list
|
||||||
|
|
||||||
def _get_config(self):
|
def _get_config(self):
|
||||||
config = OrderedDict((("split_mode", self.split_mode),))
|
config = OrderedDict((("split_mode", self.split_mode),))
|
||||||
return config
|
return config
|
||||||
|
|
|
@ -1,176 +0,0 @@
|
||||||
POS_PHRASE_MAP = {
|
|
||||||
"NOUN": "NP",
|
|
||||||
"NUM": "NP",
|
|
||||||
"PRON": "NP",
|
|
||||||
"PROPN": "NP",
|
|
||||||
"VERB": "VP",
|
|
||||||
"ADJ": "ADJP",
|
|
||||||
"ADV": "ADVP",
|
|
||||||
"CCONJ": "CCONJP",
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
|
|
||||||
def yield_bunsetu(doc, debug=False):
|
|
||||||
bunsetu = []
|
|
||||||
bunsetu_may_end = False
|
|
||||||
phrase_type = None
|
|
||||||
phrase = None
|
|
||||||
prev = None
|
|
||||||
prev_tag = None
|
|
||||||
prev_dep = None
|
|
||||||
prev_head = None
|
|
||||||
for t in doc:
|
|
||||||
pos = t.pos_
|
|
||||||
pos_type = POS_PHRASE_MAP.get(pos, None)
|
|
||||||
tag = t.tag_
|
|
||||||
dep = t.dep_
|
|
||||||
head = t.head.i
|
|
||||||
if debug:
|
|
||||||
print(
|
|
||||||
t.i,
|
|
||||||
t.orth_,
|
|
||||||
pos,
|
|
||||||
pos_type,
|
|
||||||
dep,
|
|
||||||
head,
|
|
||||||
bunsetu_may_end,
|
|
||||||
phrase_type,
|
|
||||||
phrase,
|
|
||||||
bunsetu,
|
|
||||||
)
|
|
||||||
|
|
||||||
# DET is always an individual bunsetu
|
|
||||||
if pos == "DET":
|
|
||||||
if bunsetu:
|
|
||||||
yield bunsetu, phrase_type, phrase
|
|
||||||
yield [t], None, None
|
|
||||||
bunsetu = []
|
|
||||||
bunsetu_may_end = False
|
|
||||||
phrase_type = None
|
|
||||||
phrase = None
|
|
||||||
|
|
||||||
# PRON or Open PUNCT always splits bunsetu
|
|
||||||
elif tag == "補助記号-括弧開":
|
|
||||||
if bunsetu:
|
|
||||||
yield bunsetu, phrase_type, phrase
|
|
||||||
bunsetu = [t]
|
|
||||||
bunsetu_may_end = True
|
|
||||||
phrase_type = None
|
|
||||||
phrase = None
|
|
||||||
|
|
||||||
# bunsetu head not appeared
|
|
||||||
elif phrase_type is None:
|
|
||||||
if bunsetu and prev_tag == "補助記号-読点":
|
|
||||||
yield bunsetu, phrase_type, phrase
|
|
||||||
bunsetu = []
|
|
||||||
bunsetu_may_end = False
|
|
||||||
phrase_type = None
|
|
||||||
phrase = None
|
|
||||||
bunsetu.append(t)
|
|
||||||
if pos_type: # begin phrase
|
|
||||||
phrase = [t]
|
|
||||||
phrase_type = pos_type
|
|
||||||
if pos_type in {"ADVP", "CCONJP"}:
|
|
||||||
bunsetu_may_end = True
|
|
||||||
|
|
||||||
# entering new bunsetu
|
|
||||||
elif pos_type and (
|
|
||||||
pos_type != phrase_type
|
|
||||||
or bunsetu_may_end # different phrase type arises # same phrase type but bunsetu already ended
|
|
||||||
):
|
|
||||||
# exceptional case: NOUN to VERB
|
|
||||||
if (
|
|
||||||
phrase_type == "NP"
|
|
||||||
and pos_type == "VP"
|
|
||||||
and prev_dep == "compound"
|
|
||||||
and prev_head == t.i
|
|
||||||
):
|
|
||||||
bunsetu.append(t)
|
|
||||||
phrase_type = "VP"
|
|
||||||
phrase.append(t)
|
|
||||||
# exceptional case: VERB to NOUN
|
|
||||||
elif (
|
|
||||||
phrase_type == "VP"
|
|
||||||
and pos_type == "NP"
|
|
||||||
and (
|
|
||||||
prev_dep == "compound"
|
|
||||||
and prev_head == t.i
|
|
||||||
or dep == "compound"
|
|
||||||
and prev == head
|
|
||||||
or prev_dep == "nmod"
|
|
||||||
and prev_head == t.i
|
|
||||||
)
|
|
||||||
):
|
|
||||||
bunsetu.append(t)
|
|
||||||
phrase_type = "NP"
|
|
||||||
phrase.append(t)
|
|
||||||
else:
|
|
||||||
yield bunsetu, phrase_type, phrase
|
|
||||||
bunsetu = [t]
|
|
||||||
bunsetu_may_end = False
|
|
||||||
phrase_type = pos_type
|
|
||||||
phrase = [t]
|
|
||||||
|
|
||||||
# NOUN bunsetu
|
|
||||||
elif phrase_type == "NP":
|
|
||||||
bunsetu.append(t)
|
|
||||||
if not bunsetu_may_end and (
|
|
||||||
(
|
|
||||||
(pos_type == "NP" or pos == "SYM")
|
|
||||||
and (prev_head == t.i or prev_head == head)
|
|
||||||
and prev_dep in {"compound", "nummod"}
|
|
||||||
)
|
|
||||||
or (
|
|
||||||
pos == "PART"
|
|
||||||
and (prev == head or prev_head == head)
|
|
||||||
and dep == "mark"
|
|
||||||
)
|
|
||||||
):
|
|
||||||
phrase.append(t)
|
|
||||||
else:
|
|
||||||
bunsetu_may_end = True
|
|
||||||
|
|
||||||
# VERB bunsetu
|
|
||||||
elif phrase_type == "VP":
|
|
||||||
bunsetu.append(t)
|
|
||||||
if (
|
|
||||||
not bunsetu_may_end
|
|
||||||
and pos == "VERB"
|
|
||||||
and prev_head == t.i
|
|
||||||
and prev_dep == "compound"
|
|
||||||
):
|
|
||||||
phrase.append(t)
|
|
||||||
else:
|
|
||||||
bunsetu_may_end = True
|
|
||||||
|
|
||||||
# ADJ bunsetu
|
|
||||||
elif phrase_type == "ADJP" and tag != "連体詞":
|
|
||||||
bunsetu.append(t)
|
|
||||||
if not bunsetu_may_end and (
|
|
||||||
(
|
|
||||||
pos == "NOUN"
|
|
||||||
and (prev_head == t.i or prev_head == head)
|
|
||||||
and prev_dep in {"amod", "compound"}
|
|
||||||
)
|
|
||||||
or (
|
|
||||||
pos == "PART"
|
|
||||||
and (prev == head or prev_head == head)
|
|
||||||
and dep == "mark"
|
|
||||||
)
|
|
||||||
):
|
|
||||||
phrase.append(t)
|
|
||||||
else:
|
|
||||||
bunsetu_may_end = True
|
|
||||||
|
|
||||||
# other bunsetu
|
|
||||||
else:
|
|
||||||
bunsetu.append(t)
|
|
||||||
|
|
||||||
prev = t.i
|
|
||||||
prev_tag = t.tag_
|
|
||||||
prev_dep = t.dep_
|
|
||||||
prev_head = head
|
|
||||||
|
|
||||||
if bunsetu:
|
|
||||||
yield bunsetu, phrase_type, phrase
|
|
|
@ -39,7 +39,11 @@ def check_spaces(text, tokens):
|
||||||
class KoreanTokenizer(DummyTokenizer):
|
class KoreanTokenizer(DummyTokenizer):
|
||||||
def __init__(self, cls, nlp=None):
|
def __init__(self, cls, nlp=None):
|
||||||
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||||
self.Tokenizer = try_mecab_import()
|
MeCab = try_mecab_import()
|
||||||
|
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
|
||||||
|
|
||||||
|
def __del__(self):
|
||||||
|
self.mecab_tokenizer.__del__()
|
||||||
|
|
||||||
def __call__(self, text):
|
def __call__(self, text):
|
||||||
dtokens = list(self.detailed_tokens(text))
|
dtokens = list(self.detailed_tokens(text))
|
||||||
|
@ -55,17 +59,16 @@ class KoreanTokenizer(DummyTokenizer):
|
||||||
def detailed_tokens(self, text):
|
def detailed_tokens(self, text):
|
||||||
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
|
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
|
||||||
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
|
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
|
||||||
with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
|
for node in self.mecab_tokenizer.parse(text, as_nodes=True):
|
||||||
for node in tokenizer.parse(text, as_nodes=True):
|
if node.is_eos():
|
||||||
if node.is_eos():
|
break
|
||||||
break
|
surface = node.surface
|
||||||
surface = node.surface
|
feature = node.feature
|
||||||
feature = node.feature
|
tag, _, expr = feature.partition(",")
|
||||||
tag, _, expr = feature.partition(",")
|
lemma, _, remainder = expr.partition("/")
|
||||||
lemma, _, remainder = expr.partition("/")
|
if lemma == "*":
|
||||||
if lemma == "*":
|
lemma = surface
|
||||||
lemma = surface
|
yield {"surface": surface, "lemma": lemma, "tag": tag}
|
||||||
yield {"surface": surface, "lemma": lemma, "tag": tag}
|
|
||||||
|
|
||||||
|
|
||||||
class KoreanDefaults(Language.Defaults):
|
class KoreanDefaults(Language.Defaults):
|
||||||
|
|
23
spacy/lang/ne/__init__.py
Normal file
23
spacy/lang/ne/__init__.py
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG
|
||||||
|
|
||||||
|
|
||||||
|
class NepaliDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
class Nepali(Language):
|
||||||
|
lang = "ne"
|
||||||
|
Defaults = NepaliDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Nepali"]
|
22
spacy/lang/ne/examples.py
Normal file
22
spacy/lang/ne/examples.py
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.ne.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
|
||||||
|
"स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
|
||||||
|
"स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
|
||||||
|
"लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
|
||||||
|
"तिमी कहाँ छौ?",
|
||||||
|
"फ्रान्स को राष्ट्रपति को हो?",
|
||||||
|
"संयुक्त राज्यको राजधानी के हो?",
|
||||||
|
"बराक ओबामा कहिले कहिले जन्मेका हुन्?",
|
||||||
|
]
|
98
spacy/lang/ne/lex_attrs.py
Normal file
98
spacy/lang/ne/lex_attrs.py
Normal file
|
@ -0,0 +1,98 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
from ...attrs import NORM, LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
_stem_suffixes = [
|
||||||
|
["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"],
|
||||||
|
["ँ", "ं", "्", "ः"],
|
||||||
|
["लाई", "ले", "बाट", "को", "मा", "हरू"],
|
||||||
|
["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
|
||||||
|
["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"],
|
||||||
|
["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"],
|
||||||
|
["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
|
||||||
|
["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
|
||||||
|
["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
|
||||||
|
["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
|
||||||
|
["याइ", "ाइ", "बार", "वार", "चाँहि"],
|
||||||
|
["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"]
|
||||||
|
]
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
|
||||||
|
# reference 2: https://www.imnepal.com/nepali-numbers/
|
||||||
|
_num_words = [
|
||||||
|
"शुन्य",
|
||||||
|
"एक",
|
||||||
|
"दुई",
|
||||||
|
"तीन",
|
||||||
|
"चार",
|
||||||
|
"पाँच",
|
||||||
|
"छ",
|
||||||
|
"सात",
|
||||||
|
"आठ",
|
||||||
|
"नौ",
|
||||||
|
"दश",
|
||||||
|
"एघार",
|
||||||
|
"बाह्र",
|
||||||
|
"तेह्र",
|
||||||
|
"चौध",
|
||||||
|
"पन्ध्र",
|
||||||
|
"सोह्र",
|
||||||
|
"सोह्र",
|
||||||
|
"सत्र",
|
||||||
|
"अठार",
|
||||||
|
"उन्नाइस",
|
||||||
|
"बीस",
|
||||||
|
"तीस",
|
||||||
|
"चालीस",
|
||||||
|
"पचास",
|
||||||
|
"साठी",
|
||||||
|
"सत्तरी",
|
||||||
|
"असी",
|
||||||
|
"नब्बे",
|
||||||
|
"सय",
|
||||||
|
"हजार",
|
||||||
|
"लाख",
|
||||||
|
"करोड",
|
||||||
|
"अर्ब",
|
||||||
|
"खर्ब",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def norm(string):
|
||||||
|
# normalise base exceptions, e.g. punctuation or currency symbols
|
||||||
|
if string in BASE_NORMS:
|
||||||
|
return BASE_NORMS[string]
|
||||||
|
# set stem word as norm, if available, adapted from:
|
||||||
|
# https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
|
||||||
|
# https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
|
||||||
|
for suffix_group in reversed(_stem_suffixes):
|
||||||
|
length = len(suffix_group[0])
|
||||||
|
if len(string) <= length:
|
||||||
|
break
|
||||||
|
for suffix in suffix_group:
|
||||||
|
if string.endswith(suffix):
|
||||||
|
return string[:-length]
|
||||||
|
return string
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(", ", "").replace(".", "")
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count("/") == 1:
|
||||||
|
num, denom = text.split("/")
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text.lower() in _num_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}
|
498
spacy/lang/ne/stop_words.py
Normal file
498
spacy/lang/ne/stop_words.py
Normal file
|
@ -0,0 +1,498 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
|
||||||
|
|
||||||
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
|
अक्सर
|
||||||
|
अगाडि
|
||||||
|
अगाडी
|
||||||
|
अघि
|
||||||
|
अझै
|
||||||
|
अठार
|
||||||
|
अथवा
|
||||||
|
अनि
|
||||||
|
अनुसार
|
||||||
|
अन्तर्गत
|
||||||
|
अन्य
|
||||||
|
अन्यत्र
|
||||||
|
अन्यथा
|
||||||
|
अब
|
||||||
|
अरु
|
||||||
|
अरुलाई
|
||||||
|
अरू
|
||||||
|
अर्को
|
||||||
|
अर्थात
|
||||||
|
अर्थात्
|
||||||
|
अलग
|
||||||
|
अलि
|
||||||
|
अवस्था
|
||||||
|
अहिले
|
||||||
|
आए
|
||||||
|
आएका
|
||||||
|
आएको
|
||||||
|
आज
|
||||||
|
आजको
|
||||||
|
आठ
|
||||||
|
आत्म
|
||||||
|
आदि
|
||||||
|
आदिलाई
|
||||||
|
आफनो
|
||||||
|
आफू
|
||||||
|
आफूलाई
|
||||||
|
आफै
|
||||||
|
आफैँ
|
||||||
|
आफ्नै
|
||||||
|
आफ्नो
|
||||||
|
आयो
|
||||||
|
उ
|
||||||
|
उक्त
|
||||||
|
उदाहरण
|
||||||
|
उनको
|
||||||
|
उनलाई
|
||||||
|
उनले
|
||||||
|
उनि
|
||||||
|
उनी
|
||||||
|
उनीहरुको
|
||||||
|
उन्नाइस
|
||||||
|
उप
|
||||||
|
उसको
|
||||||
|
उसलाई
|
||||||
|
उसले
|
||||||
|
उहालाई
|
||||||
|
ऊ
|
||||||
|
एउटा
|
||||||
|
एउटै
|
||||||
|
एक
|
||||||
|
एकदम
|
||||||
|
एघार
|
||||||
|
ओठ
|
||||||
|
औ
|
||||||
|
औं
|
||||||
|
कता
|
||||||
|
कति
|
||||||
|
कतै
|
||||||
|
कम
|
||||||
|
कमसेकम
|
||||||
|
कसरि
|
||||||
|
कसरी
|
||||||
|
कसै
|
||||||
|
कसैको
|
||||||
|
कसैलाई
|
||||||
|
कसैले
|
||||||
|
कसैसँग
|
||||||
|
कस्तो
|
||||||
|
कहाँबाट
|
||||||
|
कहिलेकाहीं
|
||||||
|
का
|
||||||
|
काम
|
||||||
|
कारण
|
||||||
|
कि
|
||||||
|
किन
|
||||||
|
किनभने
|
||||||
|
कुन
|
||||||
|
कुनै
|
||||||
|
कुन्नी
|
||||||
|
कुरा
|
||||||
|
कृपया
|
||||||
|
के
|
||||||
|
केहि
|
||||||
|
केही
|
||||||
|
को
|
||||||
|
कोहि
|
||||||
|
कोहिपनि
|
||||||
|
कोही
|
||||||
|
कोहीपनि
|
||||||
|
क्रमशः
|
||||||
|
गए
|
||||||
|
गएको
|
||||||
|
गएर
|
||||||
|
गयौ
|
||||||
|
गरि
|
||||||
|
गरी
|
||||||
|
गरे
|
||||||
|
गरेका
|
||||||
|
गरेको
|
||||||
|
गरेर
|
||||||
|
गरौं
|
||||||
|
गर्छ
|
||||||
|
गर्छन्
|
||||||
|
गर्छु
|
||||||
|
गर्दा
|
||||||
|
गर्दै
|
||||||
|
गर्न
|
||||||
|
गर्नु
|
||||||
|
गर्नुपर्छ
|
||||||
|
गर्ने
|
||||||
|
गैर
|
||||||
|
घर
|
||||||
|
चार
|
||||||
|
चाले
|
||||||
|
चाहनुहुन्छ
|
||||||
|
चाहन्छु
|
||||||
|
चाहिं
|
||||||
|
चाहिए
|
||||||
|
चाहिंले
|
||||||
|
चाहीं
|
||||||
|
चाहेको
|
||||||
|
चाहेर
|
||||||
|
चोटी
|
||||||
|
चौथो
|
||||||
|
चौध
|
||||||
|
छ
|
||||||
|
छन
|
||||||
|
छन्
|
||||||
|
छु
|
||||||
|
छू
|
||||||
|
छैन
|
||||||
|
छैनन्
|
||||||
|
छौ
|
||||||
|
छौं
|
||||||
|
जता
|
||||||
|
जताततै
|
||||||
|
जना
|
||||||
|
जनाको
|
||||||
|
जनालाई
|
||||||
|
जनाले
|
||||||
|
जब
|
||||||
|
जबकि
|
||||||
|
जबकी
|
||||||
|
जसको
|
||||||
|
जसबाट
|
||||||
|
जसमा
|
||||||
|
जसरी
|
||||||
|
जसलाई
|
||||||
|
जसले
|
||||||
|
जस्ता
|
||||||
|
जस्तै
|
||||||
|
जस्तो
|
||||||
|
जस्तोसुकै
|
||||||
|
जहाँ
|
||||||
|
जान
|
||||||
|
जाने
|
||||||
|
जाहिर
|
||||||
|
जुन
|
||||||
|
जुनै
|
||||||
|
जे
|
||||||
|
जो
|
||||||
|
जोपनि
|
||||||
|
जोपनी
|
||||||
|
झैं
|
||||||
|
ठाउँमा
|
||||||
|
ठीक
|
||||||
|
ठूलो
|
||||||
|
त
|
||||||
|
तता
|
||||||
|
तत्काल
|
||||||
|
तथा
|
||||||
|
तथापि
|
||||||
|
तथापी
|
||||||
|
तदनुसार
|
||||||
|
तपाइ
|
||||||
|
तपाई
|
||||||
|
तपाईको
|
||||||
|
तब
|
||||||
|
तर
|
||||||
|
तर्फ
|
||||||
|
तल
|
||||||
|
तसरी
|
||||||
|
तापनि
|
||||||
|
तापनी
|
||||||
|
तिन
|
||||||
|
तिनि
|
||||||
|
तिनिहरुलाई
|
||||||
|
तिनी
|
||||||
|
तिनीहरु
|
||||||
|
तिनीहरुको
|
||||||
|
तिनीहरू
|
||||||
|
तिनीहरूको
|
||||||
|
तिनै
|
||||||
|
तिमी
|
||||||
|
तिर
|
||||||
|
तिरको
|
||||||
|
ती
|
||||||
|
तीन
|
||||||
|
तुरन्त
|
||||||
|
तुरुन्त
|
||||||
|
तुरुन्तै
|
||||||
|
तेश्रो
|
||||||
|
तेस्कारण
|
||||||
|
तेस्रो
|
||||||
|
तेह्र
|
||||||
|
तैपनि
|
||||||
|
तैपनी
|
||||||
|
त्यत्तिकै
|
||||||
|
त्यत्तिकैमा
|
||||||
|
त्यस
|
||||||
|
त्यसकारण
|
||||||
|
त्यसको
|
||||||
|
त्यसले
|
||||||
|
त्यसैले
|
||||||
|
त्यसो
|
||||||
|
त्यस्तै
|
||||||
|
त्यस्तो
|
||||||
|
त्यहाँ
|
||||||
|
त्यहिँ
|
||||||
|
त्यही
|
||||||
|
त्यहीँ
|
||||||
|
त्यहीं
|
||||||
|
त्यो
|
||||||
|
त्सपछि
|
||||||
|
त्सैले
|
||||||
|
थप
|
||||||
|
थरि
|
||||||
|
थरी
|
||||||
|
थाहा
|
||||||
|
थिए
|
||||||
|
थिएँ
|
||||||
|
थिएन
|
||||||
|
थियो
|
||||||
|
दर्ता
|
||||||
|
दश
|
||||||
|
दिए
|
||||||
|
दिएको
|
||||||
|
दिन
|
||||||
|
दिनुभएको
|
||||||
|
दिनुहुन्छ
|
||||||
|
दुइ
|
||||||
|
दुइवटा
|
||||||
|
दुई
|
||||||
|
देखि
|
||||||
|
देखिन्छ
|
||||||
|
देखियो
|
||||||
|
देखे
|
||||||
|
देखेको
|
||||||
|
देखेर
|
||||||
|
दोश्री
|
||||||
|
दोश्रो
|
||||||
|
दोस्रो
|
||||||
|
द्वारा
|
||||||
|
धन्न
|
||||||
|
धेरै
|
||||||
|
धौ
|
||||||
|
न
|
||||||
|
नगर्नु
|
||||||
|
नगर्नू
|
||||||
|
नजिकै
|
||||||
|
नत्र
|
||||||
|
नत्रभने
|
||||||
|
नभई
|
||||||
|
नभएको
|
||||||
|
नभनेर
|
||||||
|
नयाँ
|
||||||
|
नि
|
||||||
|
निकै
|
||||||
|
निम्ति
|
||||||
|
निम्न
|
||||||
|
निम्नानुसार
|
||||||
|
निर्दिष्ट
|
||||||
|
नै
|
||||||
|
नौ
|
||||||
|
पक्का
|
||||||
|
पक्कै
|
||||||
|
पछाडि
|
||||||
|
पछाडी
|
||||||
|
पछि
|
||||||
|
पछिल्लो
|
||||||
|
पछी
|
||||||
|
पटक
|
||||||
|
पनि
|
||||||
|
पन्ध्र
|
||||||
|
पर्छ
|
||||||
|
पर्थ्यो
|
||||||
|
पर्दैन
|
||||||
|
पर्ने
|
||||||
|
पर्नेमा
|
||||||
|
पर्याप्त
|
||||||
|
पहिले
|
||||||
|
पहिलो
|
||||||
|
पहिल्यै
|
||||||
|
पाँच
|
||||||
|
पांच
|
||||||
|
पाचौँ
|
||||||
|
पाँचौं
|
||||||
|
पिच्छे
|
||||||
|
पूर्व
|
||||||
|
पो
|
||||||
|
प्रति
|
||||||
|
प्रतेक
|
||||||
|
प्रत्यक
|
||||||
|
प्राय
|
||||||
|
प्लस
|
||||||
|
फरक
|
||||||
|
फेरि
|
||||||
|
फेरी
|
||||||
|
बढी
|
||||||
|
बताए
|
||||||
|
बने
|
||||||
|
बरु
|
||||||
|
बाट
|
||||||
|
बारे
|
||||||
|
बाहिर
|
||||||
|
बाहेक
|
||||||
|
बाह्र
|
||||||
|
बिच
|
||||||
|
बिचमा
|
||||||
|
बिरुद्ध
|
||||||
|
बिशेष
|
||||||
|
बिस
|
||||||
|
बीच
|
||||||
|
बीचमा
|
||||||
|
बीस
|
||||||
|
भए
|
||||||
|
भएँ
|
||||||
|
भएका
|
||||||
|
भएकालाई
|
||||||
|
भएको
|
||||||
|
भएन
|
||||||
|
भएर
|
||||||
|
भन
|
||||||
|
भने
|
||||||
|
भनेको
|
||||||
|
भनेर
|
||||||
|
भन्
|
||||||
|
भन्छन्
|
||||||
|
भन्छु
|
||||||
|
भन्दा
|
||||||
|
भन्दै
|
||||||
|
भन्नुभयो
|
||||||
|
भन्ने
|
||||||
|
भन्या
|
||||||
|
भयेन
|
||||||
|
भयो
|
||||||
|
भर
|
||||||
|
भरि
|
||||||
|
भरी
|
||||||
|
भा
|
||||||
|
भित्र
|
||||||
|
भित्री
|
||||||
|
भीत्र
|
||||||
|
म
|
||||||
|
मध्य
|
||||||
|
मध्ये
|
||||||
|
मलाई
|
||||||
|
मा
|
||||||
|
मात्र
|
||||||
|
मात्रै
|
||||||
|
माथि
|
||||||
|
माथी
|
||||||
|
मुख्य
|
||||||
|
मुनि
|
||||||
|
मुन्तिर
|
||||||
|
मेरो
|
||||||
|
मैले
|
||||||
|
यति
|
||||||
|
यथोचित
|
||||||
|
यदि
|
||||||
|
यद्ध्यपि
|
||||||
|
यद्यपि
|
||||||
|
यस
|
||||||
|
यसका
|
||||||
|
यसको
|
||||||
|
यसपछि
|
||||||
|
यसबाहेक
|
||||||
|
यसमा
|
||||||
|
यसरी
|
||||||
|
यसले
|
||||||
|
यसो
|
||||||
|
यस्तै
|
||||||
|
यस्तो
|
||||||
|
यहाँ
|
||||||
|
यहाँसम्म
|
||||||
|
यही
|
||||||
|
या
|
||||||
|
यी
|
||||||
|
यो
|
||||||
|
र
|
||||||
|
रही
|
||||||
|
रहेका
|
||||||
|
रहेको
|
||||||
|
रहेछ
|
||||||
|
राखे
|
||||||
|
राख्छ
|
||||||
|
राम्रो
|
||||||
|
रुपमा
|
||||||
|
रूप
|
||||||
|
रे
|
||||||
|
लगभग
|
||||||
|
लगायत
|
||||||
|
लाई
|
||||||
|
लाख
|
||||||
|
लागि
|
||||||
|
लागेको
|
||||||
|
ले
|
||||||
|
वटा
|
||||||
|
वरीपरी
|
||||||
|
वा
|
||||||
|
वाट
|
||||||
|
वापत
|
||||||
|
वास्तवमा
|
||||||
|
शायद
|
||||||
|
सक्छ
|
||||||
|
सक्ने
|
||||||
|
सँग
|
||||||
|
संग
|
||||||
|
सँगको
|
||||||
|
सँगसँगै
|
||||||
|
सँगै
|
||||||
|
संगै
|
||||||
|
सङ्ग
|
||||||
|
सङ्गको
|
||||||
|
सट्टा
|
||||||
|
सत्र
|
||||||
|
सधै
|
||||||
|
सबै
|
||||||
|
सबैको
|
||||||
|
सबैलाई
|
||||||
|
समय
|
||||||
|
समेत
|
||||||
|
सम्भव
|
||||||
|
सम्म
|
||||||
|
सय
|
||||||
|
सरह
|
||||||
|
सहित
|
||||||
|
सहितै
|
||||||
|
सही
|
||||||
|
साँच्चै
|
||||||
|
सात
|
||||||
|
साथ
|
||||||
|
साथै
|
||||||
|
सायद
|
||||||
|
सारा
|
||||||
|
सुनेको
|
||||||
|
सुनेर
|
||||||
|
सुरु
|
||||||
|
सुरुको
|
||||||
|
सुरुमै
|
||||||
|
सो
|
||||||
|
सोचेको
|
||||||
|
सोचेर
|
||||||
|
सोही
|
||||||
|
सोह्र
|
||||||
|
स्थित
|
||||||
|
स्पष्ट
|
||||||
|
हजार
|
||||||
|
हरे
|
||||||
|
हरेक
|
||||||
|
हामी
|
||||||
|
हामीले
|
||||||
|
हाम्रा
|
||||||
|
हाम्रो
|
||||||
|
हुँदैन
|
||||||
|
हुन
|
||||||
|
हुनत
|
||||||
|
हुनु
|
||||||
|
हुने
|
||||||
|
हुनेछ
|
||||||
|
हुन्
|
||||||
|
हुन्छ
|
||||||
|
हुन्थ्यो
|
||||||
|
हैन
|
||||||
|
हो
|
||||||
|
होइन
|
||||||
|
होकि
|
||||||
|
होला
|
||||||
|
""".split()
|
||||||
|
)
|
|
@ -14,7 +14,7 @@ from .stop_words import STOP_WORDS
|
||||||
from ... import util
|
from ... import util
|
||||||
|
|
||||||
|
|
||||||
_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.22` or from https://github.com/lancopku/pkuseg-python"
|
_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.25` or from https://github.com/lancopku/pkuseg-python"
|
||||||
|
|
||||||
|
|
||||||
def try_jieba_import(segmenter):
|
def try_jieba_import(segmenter):
|
||||||
|
|
|
@ -32,6 +32,7 @@ from .lang.tag_map import TAG_MAP
|
||||||
from .tokens import Doc
|
from .tokens import Doc
|
||||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||||
from .errors import Errors, Warnings
|
from .errors import Errors, Warnings
|
||||||
|
from .git_info import GIT_VERSION
|
||||||
from . import util
|
from . import util
|
||||||
from . import about
|
from . import about
|
||||||
|
|
||||||
|
@ -44,7 +45,7 @@ class BaseDefaults:
|
||||||
def create_lemmatizer(cls, nlp=None, lookups=None):
|
def create_lemmatizer(cls, nlp=None, lookups=None):
|
||||||
if lookups is None:
|
if lookups is None:
|
||||||
lookups = cls.create_lookups(nlp=nlp)
|
lookups = cls.create_lookups(nlp=nlp)
|
||||||
return Lemmatizer(lookups=lookups)
|
return Lemmatizer(lookups=lookups, is_base_form=cls.is_base_form)
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def create_lookups(cls, nlp=None):
|
def create_lookups(cls, nlp=None):
|
||||||
|
@ -116,6 +117,7 @@ class BaseDefaults:
|
||||||
tokenizer_exceptions = {}
|
tokenizer_exceptions = {}
|
||||||
stop_words = set()
|
stop_words = set()
|
||||||
morph_rules = {}
|
morph_rules = {}
|
||||||
|
is_base_form = None
|
||||||
lex_attr_getters = LEX_ATTRS
|
lex_attr_getters = LEX_ATTRS
|
||||||
syntax_iterators = {}
|
syntax_iterators = {}
|
||||||
resources = {}
|
resources = {}
|
||||||
|
@ -212,6 +214,7 @@ class Language:
|
||||||
self._meta.setdefault("email", "")
|
self._meta.setdefault("email", "")
|
||||||
self._meta.setdefault("url", "")
|
self._meta.setdefault("url", "")
|
||||||
self._meta.setdefault("license", "")
|
self._meta.setdefault("license", "")
|
||||||
|
self._meta.setdefault("spacy_git_version", GIT_VERSION)
|
||||||
self._meta["vectors"] = {
|
self._meta["vectors"] = {
|
||||||
"width": self.vocab.vectors_length,
|
"width": self.vocab.vectors_length,
|
||||||
"vectors": len(self.vocab.vectors),
|
"vectors": len(self.vocab.vectors),
|
||||||
|
|
|
@ -14,7 +14,7 @@ class Lemmatizer:
|
||||||
def load(cls, *args, **kwargs):
|
def load(cls, *args, **kwargs):
|
||||||
raise NotImplementedError(Errors.E172)
|
raise NotImplementedError(Errors.E172)
|
||||||
|
|
||||||
def __init__(self, lookups):
|
def __init__(self, lookups, is_base_form=None):
|
||||||
"""Initialize a Lemmatizer.
|
"""Initialize a Lemmatizer.
|
||||||
|
|
||||||
lookups (Lookups): The lookups object containing the (optional) tables
|
lookups (Lookups): The lookups object containing the (optional) tables
|
||||||
|
@ -22,6 +22,7 @@ class Lemmatizer:
|
||||||
RETURNS (Lemmatizer): The newly constructed object.
|
RETURNS (Lemmatizer): The newly constructed object.
|
||||||
"""
|
"""
|
||||||
self.lookups = lookups
|
self.lookups = lookups
|
||||||
|
self.is_base_form = is_base_form
|
||||||
|
|
||||||
def __call__(self, string, univ_pos, morphology=None):
|
def __call__(self, string, univ_pos, morphology=None):
|
||||||
"""Lemmatize a string.
|
"""Lemmatize a string.
|
||||||
|
@ -42,7 +43,7 @@ class Lemmatizer:
|
||||||
if univ_pos in ("", "eol", "space"):
|
if univ_pos in ("", "eol", "space"):
|
||||||
return [string.lower()]
|
return [string.lower()]
|
||||||
# See Issue #435 for example of where this logic is requied.
|
# See Issue #435 for example of where this logic is requied.
|
||||||
if self.is_base_form(univ_pos, morphology):
|
if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
|
||||||
return [string.lower()]
|
return [string.lower()]
|
||||||
index_table = self.lookups.get_table("lemma_index", {})
|
index_table = self.lookups.get_table("lemma_index", {})
|
||||||
exc_table = self.lookups.get_table("lemma_exc", {})
|
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||||
|
|
|
@ -346,7 +346,7 @@ cdef class Lexeme:
|
||||||
@property
|
@property
|
||||||
def is_oov(self):
|
def is_oov(self):
|
||||||
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
|
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
|
||||||
return self.orth in self.vocab.vectors
|
return self.orth not in self.vocab.vectors
|
||||||
|
|
||||||
property is_stop:
|
property is_stop:
|
||||||
"""RETURNS (bool): Whether the lexeme is a stop word."""
|
"""RETURNS (bool): Whether the lexeme is a stop word."""
|
||||||
|
|
|
@ -117,8 +117,7 @@ class Lookups:
|
||||||
"""
|
"""
|
||||||
self._tables = {}
|
self._tables = {}
|
||||||
for key, value in srsly.msgpack_loads(bytes_data).items():
|
for key, value in srsly.msgpack_loads(bytes_data).items():
|
||||||
self._tables[key] = Table(key)
|
self._tables[key] = Table(key, value)
|
||||||
self._tables[key].update(value)
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, filename="lookups.bin", **kwargs):
|
def to_disk(self, path, filename="lookups.bin", **kwargs):
|
||||||
|
@ -189,7 +188,7 @@ class Table(OrderedDict):
|
||||||
self.name = name
|
self.name = name
|
||||||
# Assume a default size of 1M items
|
# Assume a default size of 1M items
|
||||||
self.default_size = 1e6
|
self.default_size = 1e6
|
||||||
size = len(data) if data and len(data) > 0 else self.default_size
|
size = max(len(data), 1) if data is not None else self.default_size
|
||||||
self.bloom = BloomFilter.from_error_rate(size)
|
self.bloom = BloomFilter.from_error_rate(size)
|
||||||
if data:
|
if data:
|
||||||
self.update(data)
|
self.update(data)
|
||||||
|
|
|
@ -781,6 +781,20 @@ class ClozeMultitask(Pipe):
|
||||||
if losses is not None:
|
if losses is not None:
|
||||||
losses[self.name] += loss
|
losses[self.name] += loss
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def decode_utf8_predictions(char_array):
|
||||||
|
# The format alternates filling from start and end, and 255 is missing
|
||||||
|
words = []
|
||||||
|
char_array = char_array.reshape((char_array.shape[0], -1, 256))
|
||||||
|
nr_char = char_array.shape[1]
|
||||||
|
char_array = char_array.argmax(axis=-1)
|
||||||
|
for row in char_array:
|
||||||
|
starts = [chr(c) for c in row[::2] if c != 255]
|
||||||
|
ends = [chr(c) for c in row[1::2] if c != 255]
|
||||||
|
word = "".join(starts + list(reversed(ends)))
|
||||||
|
words.append(word)
|
||||||
|
return words
|
||||||
|
|
||||||
|
|
||||||
@component("textcat", assigns=["doc.cats"], default_model=default_textcat)
|
@component("textcat", assigns=["doc.cats"], default_model=default_textcat)
|
||||||
class TextCategorizer(Pipe):
|
class TextCategorizer(Pipe):
|
||||||
|
@ -949,6 +963,7 @@ cdef class DependencyParser(Parser):
|
||||||
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
|
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
|
||||||
requires = []
|
requires = []
|
||||||
TransitionSystem = ArcEager
|
TransitionSystem = ArcEager
|
||||||
|
nr_feature = 8
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def postprocesses(self):
|
def postprocesses(self):
|
||||||
|
|
|
@ -167,6 +167,11 @@ def nb_tokenizer():
|
||||||
return get_lang_class("nb").Defaults.create_tokenizer()
|
return get_lang_class("nb").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def ne_tokenizer():
|
||||||
|
return get_lang_class("ne").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def nl_tokenizer():
|
def nl_tokenizer():
|
||||||
return get_lang_class("nl").Defaults.create_tokenizer()
|
return get_lang_class("nl").Defaults.create_tokenizer()
|
||||||
|
|
|
@ -102,10 +102,16 @@ def test_doc_api_getitem(en_tokenizer):
|
||||||
)
|
)
|
||||||
def test_doc_api_serialize(en_tokenizer, text):
|
def test_doc_api_serialize(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
|
tokens[0].lemma_ = "lemma"
|
||||||
|
tokens[0].norm_ = "norm"
|
||||||
|
tokens[0].ent_kb_id_ = "ent_kb_id"
|
||||||
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
||||||
assert tokens.text == new_tokens.text
|
assert tokens.text == new_tokens.text
|
||||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||||
|
assert new_tokens[0].lemma_ == "lemma"
|
||||||
|
assert new_tokens[0].norm_ == "norm"
|
||||||
|
assert new_tokens[0].ent_kb_id_ == "ent_kb_id"
|
||||||
|
|
||||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||||
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
|
from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
|
||||||
from spacy.lang.ja import Japanese
|
from spacy.lang.ja import Japanese, DetailedToken
|
||||||
|
|
||||||
# fmt: off
|
# fmt: off
|
||||||
TOKENIZER_TESTS = [
|
TOKENIZER_TESTS = [
|
||||||
|
@ -93,6 +93,57 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
|
||||||
assert len(nlp_c(text)) == len_c
|
assert len(nlp_c(text)) == len_c
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
"選挙管理委員会",
|
||||||
|
[None, None, None, None],
|
||||||
|
[None, None, [
|
||||||
|
[
|
||||||
|
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
|
||||||
|
DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
|
||||||
|
]
|
||||||
|
]],
|
||||||
|
[[
|
||||||
|
[
|
||||||
|
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
|
||||||
|
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
|
||||||
|
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
|
||||||
|
DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
|
||||||
|
], [
|
||||||
|
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
|
||||||
|
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
|
||||||
|
DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
|
||||||
|
]
|
||||||
|
]]
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
|
||||||
|
nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
|
||||||
|
nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
|
||||||
|
nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
|
||||||
|
|
||||||
|
assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
|
||||||
|
assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
|
||||||
|
assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
|
||||||
|
assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,inflections,reading_forms",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
"取ってつけた",
|
||||||
|
("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
|
||||||
|
("トッ", "テ", "ツケ", "タ"),
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
|
||||||
|
assert ja_tokenizer(text).user_data["inflections"] == inflections
|
||||||
|
assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
|
||||||
|
|
||||||
|
|
||||||
def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
|
def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
|
||||||
doc = ja_tokenizer("")
|
doc = ja_tokenizer("")
|
||||||
assert len(doc) == 0
|
assert len(doc) == 0
|
||||||
|
|
0
spacy/tests/lang/ne/__init__.py
Normal file
0
spacy/tests/lang/ne/__init__.py
Normal file
19
spacy/tests/lang/ne/test_text.py
Normal file
19
spacy/tests/lang/ne/test_text.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
|
||||||
|
text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
|
||||||
|
tokens = ne_tokenizer(text)
|
||||||
|
assert len(tokens) == 24
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"text,length",
|
||||||
|
[("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
|
||||||
|
)
|
||||||
|
def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
|
||||||
|
tokens = ne_tokenizer(text)
|
||||||
|
assert len(tokens) == length
|
|
@ -4,7 +4,9 @@ from spacy import util
|
||||||
from spacy.gold import Example
|
from spacy.gold import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.tests.util import make_tempdir
|
from spacy.symbols import POS, NOUN
|
||||||
|
|
||||||
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
def test_label_types():
|
def test_label_types():
|
||||||
|
@ -15,6 +17,19 @@ def test_label_types():
|
||||||
nlp.get_pipe("tagger").add_label(9)
|
nlp.get_pipe("tagger").add_label(9)
|
||||||
|
|
||||||
|
|
||||||
|
def test_tagger_begin_training_tag_map():
|
||||||
|
"""Test that Tagger.begin_training() without gold tuples does not clobber
|
||||||
|
the tag map."""
|
||||||
|
nlp = Language()
|
||||||
|
tagger = nlp.create_pipe("tagger")
|
||||||
|
orig_tag_count = len(tagger.labels)
|
||||||
|
tagger.add_label("A", {"POS": "NOUN"})
|
||||||
|
nlp.add_pipe(tagger)
|
||||||
|
nlp.begin_training()
|
||||||
|
assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
|
||||||
|
assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
|
||||||
|
|
||||||
|
|
||||||
TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
|
TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
|
||||||
|
|
||||||
MORPH_RULES = {"V": {"like": {"lemma": "luck"}}}
|
MORPH_RULES = {"V": {"like": {"lemma": "luck"}}}
|
||||||
|
|
|
@ -11,6 +11,7 @@ from spacy.lang.en import English
|
||||||
from spacy.lemmatizer import Lemmatizer
|
from spacy.lemmatizer import Lemmatizer
|
||||||
from spacy.lookups import Lookups
|
from spacy.lookups import Lookups
|
||||||
from spacy.tokens import Doc, Span
|
from spacy.tokens import Doc, Span
|
||||||
|
from spacy.lang.en import EnglishDefaults
|
||||||
|
|
||||||
from ..util import get_doc, make_tempdir
|
from ..util import get_doc, make_tempdir
|
||||||
|
|
||||||
|
@ -164,7 +165,7 @@ def test_issue595():
|
||||||
lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]})
|
lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]})
|
||||||
lookups.add_table("lemma_index", {"verb": {}})
|
lookups.add_table("lemma_index", {"verb": {}})
|
||||||
lookups.add_table("lemma_exc", {"verb": {}})
|
lookups.add_table("lemma_exc", {"verb": {}})
|
||||||
lemmatizer = Lemmatizer(lookups)
|
lemmatizer = Lemmatizer(lookups, is_base_form=EnglishDefaults.is_base_form)
|
||||||
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
|
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
|
||||||
doc = Doc(vocab, words=words)
|
doc = Doc(vocab, words=words)
|
||||||
doc[2].tag_ = "VB"
|
doc[2].tag_ = "VB"
|
||||||
|
|
|
@ -57,7 +57,7 @@ def test_issue2626_2835(en_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
def test_issue2656(en_tokenizer):
|
def test_issue2656(en_tokenizer):
|
||||||
"""Test that tokenizer correctly splits of punctuation after numbers with
|
"""Test that tokenizer correctly splits off punctuation after numbers with
|
||||||
decimal points.
|
decimal points.
|
||||||
"""
|
"""
|
||||||
doc = en_tokenizer("I went for 40.3, and got home by 10.0.")
|
doc = en_tokenizer("I went for 40.3, and got home by 10.0.")
|
||||||
|
|
|
@ -2,6 +2,7 @@ import pytest
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.lookups import Lookups
|
from spacy.lookups import Lookups
|
||||||
|
from spacy.lemmatizer import Lemmatizer
|
||||||
|
|
||||||
|
|
||||||
def test_lemmatizer_reflects_lookups_changes():
|
def test_lemmatizer_reflects_lookups_changes():
|
||||||
|
@ -46,3 +47,14 @@ def test_tagger_warns_no_lookups():
|
||||||
with pytest.warns(None) as record:
|
with pytest.warns(None) as record:
|
||||||
nlp.begin_training()
|
nlp.begin_training()
|
||||||
assert not record.list
|
assert not record.list
|
||||||
|
|
||||||
|
|
||||||
|
def test_lemmatizer_without_is_base_form_implementation():
|
||||||
|
# Norwegian example from #5658
|
||||||
|
lookups = Lookups()
|
||||||
|
lookups.add_table("lemma_rules", {"noun": []})
|
||||||
|
lookups.add_table("lemma_index", {"noun": {}})
|
||||||
|
lookups.add_table("lemma_exc", {"noun": {"formuesskatten": ["formuesskatt"]}})
|
||||||
|
|
||||||
|
lemmatizer = Lemmatizer(lookups, is_base_form=None)
|
||||||
|
assert lemmatizer("Formuesskatten", "noun", {'Definite': 'def', 'Gender': 'masc', 'Number': 'sing'}) == ["formuesskatt"]
|
||||||
|
|
|
@ -370,6 +370,6 @@ def test_vector_is_oov():
|
||||||
data[1] = 2.0
|
data[1] = 2.0
|
||||||
vocab.set_vector("cat", data[0])
|
vocab.set_vector("cat", data[0])
|
||||||
vocab.set_vector("dog", data[1])
|
vocab.set_vector("dog", data[1])
|
||||||
assert vocab["cat"].is_oov is True
|
assert vocab["cat"].is_oov is False
|
||||||
assert vocab["dog"].is_oov is True
|
assert vocab["dog"].is_oov is False
|
||||||
assert vocab["hamster"].is_oov is False
|
assert vocab["hamster"].is_oov is True
|
||||||
|
|
|
@ -1062,7 +1062,7 @@ cdef class Doc:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#to_bytes
|
DOCS: https://spacy.io/api/doc#to_bytes
|
||||||
"""
|
"""
|
||||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM] # TODO: ENT_KB_ID ?
|
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM, ENT_KB_ID]
|
||||||
if self.is_tagged:
|
if self.is_tagged:
|
||||||
array_head.extend([TAG, POS])
|
array_head.extend([TAG, POS])
|
||||||
# If doc parsed add head and dep attribute
|
# If doc parsed add head and dep attribute
|
||||||
|
@ -1071,6 +1071,14 @@ cdef class Doc:
|
||||||
# Otherwise add sent_start
|
# Otherwise add sent_start
|
||||||
else:
|
else:
|
||||||
array_head.append(SENT_START)
|
array_head.append(SENT_START)
|
||||||
|
strings = set()
|
||||||
|
for token in self:
|
||||||
|
strings.add(token.tag_)
|
||||||
|
strings.add(token.lemma_)
|
||||||
|
strings.add(token.dep_)
|
||||||
|
strings.add(token.ent_type_)
|
||||||
|
strings.add(token.ent_kb_id_)
|
||||||
|
strings.add(token.norm_)
|
||||||
# Msgpack doesn't distinguish between lists and tuples, which is
|
# Msgpack doesn't distinguish between lists and tuples, which is
|
||||||
# vexing for user data. As a best guess, we *know* that within
|
# vexing for user data. As a best guess, we *know* that within
|
||||||
# keys, we must have tuples. In values we just have to hope
|
# keys, we must have tuples. In values we just have to hope
|
||||||
|
@ -1082,6 +1090,7 @@ cdef class Doc:
|
||||||
"sentiment": lambda: self.sentiment,
|
"sentiment": lambda: self.sentiment,
|
||||||
"tensor": lambda: self.tensor,
|
"tensor": lambda: self.tensor,
|
||||||
"cats": lambda: self.cats,
|
"cats": lambda: self.cats,
|
||||||
|
"strings": lambda: list(strings),
|
||||||
"has_unknown_spaces": lambda: self.has_unknown_spaces
|
"has_unknown_spaces": lambda: self.has_unknown_spaces
|
||||||
}
|
}
|
||||||
if "user_data" not in exclude and self.user_data:
|
if "user_data" not in exclude and self.user_data:
|
||||||
|
@ -1110,6 +1119,7 @@ cdef class Doc:
|
||||||
"sentiment": lambda b: None,
|
"sentiment": lambda b: None,
|
||||||
"tensor": lambda b: None,
|
"tensor": lambda b: None,
|
||||||
"cats": lambda b: None,
|
"cats": lambda b: None,
|
||||||
|
"strings": lambda b: None,
|
||||||
"user_data_keys": lambda b: None,
|
"user_data_keys": lambda b: None,
|
||||||
"user_data_values": lambda b: None,
|
"user_data_values": lambda b: None,
|
||||||
"has_unknown_spaces": lambda b: None
|
"has_unknown_spaces": lambda b: None
|
||||||
|
@ -1130,6 +1140,9 @@ cdef class Doc:
|
||||||
self.tensor = msg["tensor"]
|
self.tensor = msg["tensor"]
|
||||||
if "cats" not in exclude and "cats" in msg:
|
if "cats" not in exclude and "cats" in msg:
|
||||||
self.cats = msg["cats"]
|
self.cats = msg["cats"]
|
||||||
|
if "strings" not in exclude and "strings" in msg:
|
||||||
|
for s in msg["strings"]:
|
||||||
|
self.vocab.strings.add(s)
|
||||||
if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg:
|
if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg:
|
||||||
self.has_unknown_spaces = msg["has_unknown_spaces"]
|
self.has_unknown_spaces = msg["has_unknown_spaces"]
|
||||||
start = 0
|
start = 0
|
||||||
|
|
|
@ -923,7 +923,7 @@ cdef class Token:
|
||||||
@property
|
@property
|
||||||
def is_oov(self):
|
def is_oov(self):
|
||||||
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
|
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
|
||||||
return self.c.lex.orth in self.vocab.vectors
|
return self.c.lex.orth not in self.vocab.vectors
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def is_stop(self):
|
def is_stop(self):
|
||||||
|
|
|
@ -187,6 +187,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
|
||||||
pipeline = nlp.Defaults.pipe_names
|
pipeline = nlp.Defaults.pipe_names
|
||||||
elif pipeline in (False, None):
|
elif pipeline in (False, None):
|
||||||
pipeline = []
|
pipeline = []
|
||||||
|
# skip "vocab" from overrides in component initialization since vocab is
|
||||||
|
# already configured from overrides when nlp is initialized above
|
||||||
|
if "vocab" in overrides:
|
||||||
|
del overrides["vocab"]
|
||||||
for name in pipeline:
|
for name in pipeline:
|
||||||
if name not in disable:
|
if name not in disable:
|
||||||
config = meta.get("pipeline_args", {}).get(name, {})
|
config = meta.get("pipeline_args", {}).get(name, {})
|
||||||
|
|
|
@ -105,8 +105,8 @@ The Chinese language class supports three word segmentation options:
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
1. **Character segmentation:** Character segmentation is the default
|
1. **Character segmentation:** Character segmentation is the default
|
||||||
segmentation option. It's enabled when you create a new `Chinese`
|
segmentation option. It's enabled when you create a new `Chinese` language
|
||||||
language class or call `spacy.blank("zh")`.
|
class or call `spacy.blank("zh")`.
|
||||||
2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
|
2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
|
||||||
segmentation with the tokenizer option `{"segmenter": "jieba"}`.
|
segmentation with the tokenizer option `{"segmenter": "jieba"}`.
|
||||||
3. **PKUSeg**: As of spaCy v2.3.0, support for
|
3. **PKUSeg**: As of spaCy v2.3.0, support for
|
||||||
|
|
|
@ -1,5 +1,58 @@
|
||||||
{
|
{
|
||||||
"resources": [
|
"resources": [
|
||||||
|
{
|
||||||
|
"id": "spacy-streamlit",
|
||||||
|
"title": "spacy-streamlit",
|
||||||
|
"slogan": "spaCy building blocks for Streamlit apps",
|
||||||
|
"github": "explosion/spacy-streamlit",
|
||||||
|
"description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
|
||||||
|
"pip": "spacy-streamlit",
|
||||||
|
"category": ["visualizers"],
|
||||||
|
"thumb": "https://i.imgur.com/mhEjluE.jpg",
|
||||||
|
"image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
|
||||||
|
"code_example": [
|
||||||
|
"import spacy_streamlit",
|
||||||
|
"",
|
||||||
|
"models = [\"en_core_web_sm\", \"en_core_web_md\"]",
|
||||||
|
"default_text = \"Sundar Pichai is the CEO of Google.\"",
|
||||||
|
"spacy_streamlit.visualize(models, default_text))"
|
||||||
|
],
|
||||||
|
"author": "Ines Montani",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "_inesmontani",
|
||||||
|
"github": "ines",
|
||||||
|
"website": "https://ines.io"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "spaczz",
|
||||||
|
"title": "spaczz",
|
||||||
|
"slogan": "Fuzzy matching and more for spaCy.",
|
||||||
|
"description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
|
||||||
|
"github": "gandersen101/spaczz",
|
||||||
|
"pip": "spaczz",
|
||||||
|
"code_example": [
|
||||||
|
"import spacy",
|
||||||
|
"from spaczz.pipeline import SpaczzRuler",
|
||||||
|
"",
|
||||||
|
"nlp = spacy.blank('en')",
|
||||||
|
"ruler = SpaczzRuler(nlp)",
|
||||||
|
"ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
|
||||||
|
"nlp.add_pipe(ruler)",
|
||||||
|
"",
|
||||||
|
"doc = nlp('Oops, I spelled Bill Gatez wrong.')",
|
||||||
|
"print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
|
||||||
|
],
|
||||||
|
"code_language": "python",
|
||||||
|
"url": "https://spaczz.readthedocs.io/en/latest/",
|
||||||
|
"author": "Grant Andersen",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "gandersen101",
|
||||||
|
"github": "gandersen101"
|
||||||
|
},
|
||||||
|
"category": ["pipeline"],
|
||||||
|
"tags": ["fuzzy-matching", "regex"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "spacy-universal-sentence-encoder",
|
"id": "spacy-universal-sentence-encoder",
|
||||||
"title": "SpaCy - Universal Sentence Encoder",
|
"title": "SpaCy - Universal Sentence Encoder",
|
||||||
|
@ -1238,6 +1291,19 @@
|
||||||
"youtube": "K1elwpgDdls",
|
"youtube": "K1elwpgDdls",
|
||||||
"category": ["videos"]
|
"category": ["videos"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"type": "education",
|
||||||
|
"id": "video-spacy-course-es",
|
||||||
|
"title": "NLP avanzado con spaCy · Un curso en línea gratis",
|
||||||
|
"description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
|
||||||
|
"url": "https://course.spacy.io/es",
|
||||||
|
"author": "Camila Gutiérrez",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "Mariacamilagl30"
|
||||||
|
},
|
||||||
|
"youtube": "RNiLVCE5d4k",
|
||||||
|
"category": ["videos"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"type": "education",
|
"type": "education",
|
||||||
"id": "video-intro-to-nlp-episode-1",
|
"id": "video-intro-to-nlp-episode-1",
|
||||||
|
@ -1294,6 +1360,20 @@
|
||||||
"youtube": "IqOJU1-_Fi0",
|
"youtube": "IqOJU1-_Fi0",
|
||||||
"category": ["videos"]
|
"category": ["videos"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"type": "education",
|
||||||
|
"id": "video-intro-to-nlp-episode-5",
|
||||||
|
"title": "Intro to NLP with spaCy (5)",
|
||||||
|
"slogan": "Episode 5: Rules vs. Machine Learning",
|
||||||
|
"description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
|
||||||
|
"author": "Vincent Warmerdam",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "fishnets88",
|
||||||
|
"github": "koaning"
|
||||||
|
},
|
||||||
|
"youtube": "f4sqeLRzkPg",
|
||||||
|
"category": ["videos"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"type": "education",
|
"type": "education",
|
||||||
"id": "video-spacy-irl-entity-linking",
|
"id": "video-spacy-irl-entity-linking",
|
||||||
|
@ -2348,6 +2428,56 @@
|
||||||
},
|
},
|
||||||
"category": ["pipeline", "conversational", "research"],
|
"category": ["pipeline", "conversational", "research"],
|
||||||
"tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
|
"tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "texthero",
|
||||||
|
"title": "Texthero",
|
||||||
|
"slogan": "Text preprocessing, representation and visualization from zero to hero.",
|
||||||
|
"description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
|
||||||
|
"github": "jbesomi/texthero",
|
||||||
|
"pip": "texthero",
|
||||||
|
"code_example": [
|
||||||
|
"import texthero as hero",
|
||||||
|
"import pandas as pd",
|
||||||
|
"",
|
||||||
|
"df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
|
||||||
|
"df['named_entities'] = hero.named_entities(df['text'])",
|
||||||
|
"df.head()"
|
||||||
|
],
|
||||||
|
"code_language": "python",
|
||||||
|
"url": "https://texthero.org",
|
||||||
|
"thumb": "https://texthero.org/img/T.png",
|
||||||
|
"image": "https://texthero.org/docs/assets/texthero.png",
|
||||||
|
"author": "Jonathan Besomi",
|
||||||
|
"author_links": {
|
||||||
|
"github": "jbesomi",
|
||||||
|
"website": "https://besomi.ai"
|
||||||
|
},
|
||||||
|
"category": ["standalone"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "cov-bsv",
|
||||||
|
"title": "VA COVID-19 NLP BSV",
|
||||||
|
"slogan": "spaCy pipeline for COVID-19 surveillance.",
|
||||||
|
"github": "abchapman93/VA_COVID-19_NLP_BSV",
|
||||||
|
"description": "A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
|
||||||
|
"pip": "cov-bsv",
|
||||||
|
"code_example": [
|
||||||
|
"import cov_bsv",
|
||||||
|
"",
|
||||||
|
"nlp = cov_bsv.load()",
|
||||||
|
"text = 'Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected'",
|
||||||
|
"",
|
||||||
|
"print(doc.ents)",
|
||||||
|
"print(doc._.cov_classification)",
|
||||||
|
"cov_bsv.visualize_doc(doc)"
|
||||||
|
],
|
||||||
|
"category": ["pipeline", "standalone", "biomedical", "scientific"],
|
||||||
|
"tags": ["clinical", "epidemiology", "covid-19", "surveillance"],
|
||||||
|
"author": "Alec Chapman",
|
||||||
|
"author_links": {
|
||||||
|
"github": "abchapman93"
|
||||||
|
}
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
|
||||||
|
|
56916
website/package-lock.json
generated
56916
website/package-lock.json
generated
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user