mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-12 01:02:23 +03:00
This reverts commit f42c9026f5
.
This commit is contained in:
parent
f42c9026f5
commit
0df3aaa795
106
.github/contributors/mahnerak.md
vendored
106
.github/contributors/mahnerak.md
vendored
|
@ -1,106 +0,0 @@
|
||||||
# spaCy contributor agreement
|
|
||||||
|
|
||||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
|
||||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
|
||||||
The SCA applies to any contribution that you make to any product or project
|
|
||||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
|
||||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
|
||||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
|
||||||
**"you"** shall mean the person or entity identified below.
|
|
||||||
|
|
||||||
If you agree to be bound by these terms, fill in the information requested
|
|
||||||
below and include the filled-in version with your first pull request, under the
|
|
||||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
|
||||||
should be your GitHub username, with the extension `.md`. For example, the user
|
|
||||||
example_user would create the file `.github/contributors/example_user.md`.
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
## Contributor Agreement
|
|
||||||
|
|
||||||
1. The term "contribution" or "contributed materials" means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual,
|
|
||||||
documentation, or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and
|
|
||||||
registrations, in your contribution:
|
|
||||||
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such
|
|
||||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
|
||||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
|
||||||
royalty-free, unrestricted license to exercise all rights under those
|
|
||||||
copyrights. This includes, at our option, the right to sublicense these same
|
|
||||||
rights to third parties through multiple levels of sublicensees or other
|
|
||||||
licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your
|
|
||||||
contribution as if each of us were the sole owners, and if one of us makes
|
|
||||||
a derivative work of your contribution, the one who makes the derivative
|
|
||||||
work (or has it made will be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution
|
|
||||||
against us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and
|
|
||||||
exercise all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the
|
|
||||||
consent of, pay or render an accounting to the other for any use or
|
|
||||||
distribution of your contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
|
||||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
|
||||||
your contribution in whole or in part, alone or in combination with or
|
|
||||||
included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through
|
|
||||||
multiple levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective
|
|
||||||
on the date you first submitted a contribution to us, even if your submission
|
|
||||||
took place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of
|
|
||||||
authorship and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any
|
|
||||||
third party's copyrights, trademarks, patents, or other intellectual
|
|
||||||
property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and
|
|
||||||
other applicable export and import laws. You agree to notify us if you
|
|
||||||
become aware of any circumstance which would make any of the foregoing
|
|
||||||
representations inaccurate in any respect. We may publicly disclose your
|
|
||||||
participation in the project, including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable
|
|
||||||
U.S. Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
* [x] I am signing on behalf of myself as an individual and no other person
|
|
||||||
or entity, including my employer, has or will have rights with respect to my
|
|
||||||
contributions.
|
|
||||||
|
|
||||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
|
||||||
actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
## Contributor Details
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | Karen Hambardzumyan |
|
|
||||||
| Company name (if applicable) | YerevaNN |
|
|
||||||
| Title or role (if applicable) | Researcher |
|
|
||||||
| Date | 2020-06-19 |
|
|
||||||
| GitHub username | mahnerak |
|
|
||||||
| Website (optional) | https://mahnerak.com/|
|
|
106
.github/contributors/myavrum.md
vendored
106
.github/contributors/myavrum.md
vendored
|
@ -1,106 +0,0 @@
|
||||||
# spaCy contributor agreement
|
|
||||||
|
|
||||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
|
||||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
|
||||||
The SCA applies to any contribution that you make to any product or project
|
|
||||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
|
||||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
|
||||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
|
||||||
**"you"** shall mean the person or entity identified below.
|
|
||||||
|
|
||||||
If you agree to be bound by these terms, fill in the information requested
|
|
||||||
below and include the filled-in version with your first pull request, under the
|
|
||||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
|
||||||
should be your GitHub username, with the extension `.md`. For example, the user
|
|
||||||
example_user would create the file `.github/contributors/example_user.md`.
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
## Contributor Agreement
|
|
||||||
|
|
||||||
1. The term "contribution" or "contributed materials" means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual,
|
|
||||||
documentation, or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and
|
|
||||||
registrations, in your contribution:
|
|
||||||
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such
|
|
||||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
|
||||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
|
||||||
royalty-free, unrestricted license to exercise all rights under those
|
|
||||||
copyrights. This includes, at our option, the right to sublicense these same
|
|
||||||
rights to third parties through multiple levels of sublicensees or other
|
|
||||||
licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your
|
|
||||||
contribution as if each of us were the sole owners, and if one of us makes
|
|
||||||
a derivative work of your contribution, the one who makes the derivative
|
|
||||||
work (or has it made will be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution
|
|
||||||
against us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and
|
|
||||||
exercise all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the
|
|
||||||
consent of, pay or render an accounting to the other for any use or
|
|
||||||
distribution of your contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
|
||||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
|
||||||
your contribution in whole or in part, alone or in combination with or
|
|
||||||
included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through
|
|
||||||
multiple levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective
|
|
||||||
on the date you first submitted a contribution to us, even if your submission
|
|
||||||
took place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of
|
|
||||||
authorship and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any
|
|
||||||
third party's copyrights, trademarks, patents, or other intellectual
|
|
||||||
property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and
|
|
||||||
other applicable export and import laws. You agree to notify us if you
|
|
||||||
become aware of any circumstance which would make any of the foregoing
|
|
||||||
representations inaccurate in any respect. We may publicly disclose your
|
|
||||||
participation in the project, including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable
|
|
||||||
U.S. Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
* [x] I am signing on behalf of myself as an individual and no other person
|
|
||||||
or entity, including my employer, has or will have rights with respect to my
|
|
||||||
contributions.
|
|
||||||
|
|
||||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
|
||||||
actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
## Contributor Details
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | Marat M. Yavrumyan |
|
|
||||||
| Company name (if applicable) | YSU, UD_Armenian Project |
|
|
||||||
| Title or role (if applicable) | Dr., Principal Investigator |
|
|
||||||
| Date | 2020-06-19 |
|
|
||||||
| GitHub username | myavrum |
|
|
||||||
| Website (optional) | http://armtreebank.yerevann.com/ |
|
|
106
.github/contributors/rameshhpathak.md
vendored
106
.github/contributors/rameshhpathak.md
vendored
|
@ -1,106 +0,0 @@
|
||||||
# spaCy contributor agreement
|
|
||||||
|
|
||||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
|
||||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
|
||||||
The SCA applies to any contribution that you make to any product or project
|
|
||||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
|
||||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
|
||||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
|
||||||
**"you"** shall mean the person or entity identified below.
|
|
||||||
|
|
||||||
If you agree to be bound by these terms, fill in the information requested
|
|
||||||
below and include the filled-in version with your first pull request, under the
|
|
||||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
|
||||||
should be your GitHub username, with the extension `.md`. For example, the user
|
|
||||||
example_user would create the file `.github/contributors/example_user.md`.
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
## Contributor Agreement
|
|
||||||
|
|
||||||
1. The term "contribution" or "contributed materials" means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual,
|
|
||||||
documentation, or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and
|
|
||||||
registrations, in your contribution:
|
|
||||||
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such
|
|
||||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
|
||||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
|
||||||
royalty-free, unrestricted license to exercise all rights under those
|
|
||||||
copyrights. This includes, at our option, the right to sublicense these same
|
|
||||||
rights to third parties through multiple levels of sublicensees or other
|
|
||||||
licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your
|
|
||||||
contribution as if each of us were the sole owners, and if one of us makes
|
|
||||||
a derivative work of your contribution, the one who makes the derivative
|
|
||||||
work (or has it made will be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution
|
|
||||||
against us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and
|
|
||||||
exercise all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the
|
|
||||||
consent of, pay or render an accounting to the other for any use or
|
|
||||||
distribution of your contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
|
||||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
|
||||||
your contribution in whole or in part, alone or in combination with or
|
|
||||||
included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through
|
|
||||||
multiple levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective
|
|
||||||
on the date you first submitted a contribution to us, even if your submission
|
|
||||||
took place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of
|
|
||||||
authorship and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any
|
|
||||||
third party's copyrights, trademarks, patents, or other intellectual
|
|
||||||
property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and
|
|
||||||
other applicable export and import laws. You agree to notify us if you
|
|
||||||
become aware of any circumstance which would make any of the foregoing
|
|
||||||
representations inaccurate in any respect. We may publicly disclose your
|
|
||||||
participation in the project, including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable
|
|
||||||
U.S. Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
* [x] I am signing on behalf of myself as an individual and no other person
|
|
||||||
or entity, including my employer, has or will have rights with respect to my
|
|
||||||
contributions.
|
|
||||||
|
|
||||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
|
||||||
actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
## Contributor Details
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | Ramesh Pathak |
|
|
||||||
| Company name (if applicable) | Diyo AI |
|
|
||||||
| Title or role (if applicable) | AI Engineer |
|
|
||||||
| Date | June 21, 2020 |
|
|
||||||
| GitHub username | rameshhpathak |
|
|
||||||
| Website (optional) |rameshhpathak.github.io| |
|
|
106
.github/contributors/richardliaw.md
vendored
106
.github/contributors/richardliaw.md
vendored
|
@ -1,106 +0,0 @@
|
||||||
# spaCy contributor agreement
|
|
||||||
|
|
||||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
|
||||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
|
||||||
The SCA applies to any contribution that you make to any product or project
|
|
||||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
|
||||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
|
||||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
|
||||||
**"you"** shall mean the person or entity identified below.
|
|
||||||
|
|
||||||
If you agree to be bound by these terms, fill in the information requested
|
|
||||||
below and include the filled-in version with your first pull request, under the
|
|
||||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
|
||||||
should be your GitHub username, with the extension `.md`. For example, the user
|
|
||||||
example_user would create the file `.github/contributors/example_user.md`.
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
## Contributor Agreement
|
|
||||||
|
|
||||||
1. The term "contribution" or "contributed materials" means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual,
|
|
||||||
documentation, or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and
|
|
||||||
registrations, in your contribution:
|
|
||||||
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such
|
|
||||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
|
||||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
|
||||||
royalty-free, unrestricted license to exercise all rights under those
|
|
||||||
copyrights. This includes, at our option, the right to sublicense these same
|
|
||||||
rights to third parties through multiple levels of sublicensees or other
|
|
||||||
licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your
|
|
||||||
contribution as if each of us were the sole owners, and if one of us makes
|
|
||||||
a derivative work of your contribution, the one who makes the derivative
|
|
||||||
work (or has it made will be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution
|
|
||||||
against us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and
|
|
||||||
exercise all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the
|
|
||||||
consent of, pay or render an accounting to the other for any use or
|
|
||||||
distribution of your contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
|
||||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
|
||||||
your contribution in whole or in part, alone or in combination with or
|
|
||||||
included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through
|
|
||||||
multiple levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective
|
|
||||||
on the date you first submitted a contribution to us, even if your submission
|
|
||||||
took place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of
|
|
||||||
authorship and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any
|
|
||||||
third party's copyrights, trademarks, patents, or other intellectual
|
|
||||||
property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and
|
|
||||||
other applicable export and import laws. You agree to notify us if you
|
|
||||||
become aware of any circumstance which would make any of the foregoing
|
|
||||||
representations inaccurate in any respect. We may publicly disclose your
|
|
||||||
participation in the project, including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable
|
|
||||||
U.S. Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
* [x] I am signing on behalf of myself as an individual and no other person
|
|
||||||
or entity, including my employer, has or will have rights with respect to my
|
|
||||||
contributions.
|
|
||||||
|
|
||||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
|
||||||
actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
## Contributor Details
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | Richard Liaw |
|
|
||||||
| Company name (if applicable) | |
|
|
||||||
| Title or role (if applicable) | |
|
|
||||||
| Date | 06/22/2020 |
|
|
||||||
| GitHub username | richardliaw |
|
|
||||||
| Website (optional) | |
|
|
|
@ -11,6 +11,6 @@ Example sentences to test spaCy and its language models.
|
||||||
sentences = [
|
sentences = [
|
||||||
"Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
|
"Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
|
||||||
"Ո՞վ է Ֆրանսիայի նախագահը։",
|
"Ո՞վ է Ֆրանսիայի նախագահը։",
|
||||||
"Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
|
"Որն է Միացյալ Նահանգների մայրաքաղաքը։",
|
||||||
"Ե՞րբ է ծնվել Բարաք Օբաման։",
|
"Ե՞րբ է ծնվել Բարաք Օբաման։",
|
||||||
]
|
]
|
||||||
|
|
|
@ -5,8 +5,8 @@ from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
_num_words = [
|
_num_words = [
|
||||||
"զրո",
|
"զրօ",
|
||||||
"մեկ",
|
"մէկ",
|
||||||
"երկու",
|
"երկու",
|
||||||
"երեք",
|
"երեք",
|
||||||
"չորս",
|
"չորս",
|
||||||
|
@ -18,21 +18,20 @@ _num_words = [
|
||||||
"տասը",
|
"տասը",
|
||||||
"տասնմեկ",
|
"տասնմեկ",
|
||||||
"տասներկու",
|
"տասներկու",
|
||||||
"տասներեք",
|
"տասներեք",
|
||||||
"տասնչորս",
|
"տասնչորս",
|
||||||
"տասնհինգ",
|
"տասնհինգ",
|
||||||
"տասնվեց",
|
"տասնվեց",
|
||||||
"տասնյոթ",
|
"տասնյոթ",
|
||||||
"տասնութ",
|
"տասնութ",
|
||||||
"տասնինը",
|
"տասնինը",
|
||||||
"քսան",
|
"քսան" "երեսուն",
|
||||||
"երեսուն",
|
|
||||||
"քառասուն",
|
"քառասուն",
|
||||||
"հիսուն",
|
"հիսուն",
|
||||||
"վաթսուն",
|
"վաթցսուն",
|
||||||
"յոթանասուն",
|
"յոթանասուն",
|
||||||
"ութսուն",
|
"ութսուն",
|
||||||
"իննսուն",
|
"ինիսուն",
|
||||||
"հարյուր",
|
"հարյուր",
|
||||||
"հազար",
|
"հազար",
|
||||||
"միլիոն",
|
"միլիոն",
|
||||||
|
|
|
@ -20,7 +20,12 @@ from ... import util
|
||||||
|
|
||||||
|
|
||||||
# Hold the attributes we need with convenient names
|
# Hold the attributes we need with convenient names
|
||||||
DetailedToken = namedtuple("DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"])
|
DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
|
||||||
|
|
||||||
|
# Handling for multiple spaces in a row is somewhat awkward, this simplifies
|
||||||
|
# the flow by creating a dummy with the same interface.
|
||||||
|
DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
|
||||||
|
DummySpace = DummyNode(" ", " ", " ")
|
||||||
|
|
||||||
|
|
||||||
def try_sudachi_import(split_mode="A"):
|
def try_sudachi_import(split_mode="A"):
|
||||||
|
@ -48,7 +53,7 @@ def try_sudachi_import(split_mode="A"):
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def resolve_pos(orth, tag, next_tag):
|
def resolve_pos(orth, pos, next_pos):
|
||||||
"""If necessary, add a field to the POS tag for UD mapping.
|
"""If necessary, add a field to the POS tag for UD mapping.
|
||||||
Under Universal Dependencies, sometimes the same Unidic POS tag can
|
Under Universal Dependencies, sometimes the same Unidic POS tag can
|
||||||
be mapped differently depending on the literal token or its context
|
be mapped differently depending on the literal token or its context
|
||||||
|
@ -59,77 +64,124 @@ def resolve_pos(orth, tag, next_tag):
|
||||||
# Some tokens have their UD tag decided based on the POS of the following
|
# Some tokens have their UD tag decided based on the POS of the following
|
||||||
# token.
|
# token.
|
||||||
|
|
||||||
# apply orth based mapping
|
# orth based rules
|
||||||
if tag in TAG_ORTH_MAP:
|
if pos[0] in TAG_ORTH_MAP:
|
||||||
orth_map = TAG_ORTH_MAP[tag]
|
orth_map = TAG_ORTH_MAP[pos[0]]
|
||||||
if orth in orth_map:
|
if orth in orth_map:
|
||||||
return orth_map[orth], None # current_pos, next_pos
|
return orth_map[orth], None
|
||||||
|
|
||||||
# apply tag bi-gram mapping
|
# tag bi-gram mapping
|
||||||
if next_tag:
|
if next_pos:
|
||||||
tag_bigram = tag, next_tag
|
tag_bigram = pos[0], next_pos[0]
|
||||||
if tag_bigram in TAG_BIGRAM_MAP:
|
if tag_bigram in TAG_BIGRAM_MAP:
|
||||||
current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
|
bipos = TAG_BIGRAM_MAP[tag_bigram]
|
||||||
if current_pos is None: # apply tag uni-gram mapping for current_pos
|
if bipos[0] is None:
|
||||||
return TAG_MAP[tag][POS], next_pos # only next_pos is identified by tag bi-gram mapping
|
return TAG_MAP[pos[0]][POS], bipos[1]
|
||||||
else:
|
else:
|
||||||
return current_pos, next_pos
|
return bipos
|
||||||
|
|
||||||
# apply tag uni-gram mapping
|
return TAG_MAP[pos[0]][POS], None
|
||||||
return TAG_MAP[tag][POS], None
|
|
||||||
|
|
||||||
|
|
||||||
def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
|
# Use a mapping of paired punctuation to avoid splitting quoted sentences.
|
||||||
# Compare the content of tokens and text, first
|
pairpunct = {'「':'」', '『': '』', '【': '】'}
|
||||||
|
|
||||||
|
|
||||||
|
def separate_sentences(doc):
|
||||||
|
"""Given a doc, mark tokens that start sentences based on Unidic tags.
|
||||||
|
"""
|
||||||
|
|
||||||
|
stack = [] # save paired punctuation
|
||||||
|
|
||||||
|
for i, token in enumerate(doc[:-2]):
|
||||||
|
# Set all tokens after the first to false by default. This is necessary
|
||||||
|
# for the doc code to be aware we've done sentencization, see
|
||||||
|
# `is_sentenced`.
|
||||||
|
token.sent_start = (i == 0)
|
||||||
|
if token.tag_:
|
||||||
|
if token.tag_ == "補助記号-括弧開":
|
||||||
|
ts = str(token)
|
||||||
|
if ts in pairpunct:
|
||||||
|
stack.append(pairpunct[ts])
|
||||||
|
elif stack and ts == stack[-1]:
|
||||||
|
stack.pop()
|
||||||
|
|
||||||
|
if token.tag_ == "補助記号-句点":
|
||||||
|
next_token = doc[i+1]
|
||||||
|
if next_token.tag_ != token.tag_ and not stack:
|
||||||
|
next_token.sent_start = True
|
||||||
|
|
||||||
|
|
||||||
|
def get_dtokens(tokenizer, text):
|
||||||
|
tokens = tokenizer.tokenize(text)
|
||||||
|
words = []
|
||||||
|
for ti, token in enumerate(tokens):
|
||||||
|
tag = '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*'])
|
||||||
|
inf = '-'.join([xx for xx in token.part_of_speech()[4:] if xx != '*'])
|
||||||
|
dtoken = DetailedToken(
|
||||||
|
token.surface(),
|
||||||
|
(tag, inf),
|
||||||
|
token.dictionary_form())
|
||||||
|
if ti > 0 and words[-1].pos[0] == '空白' and tag == '空白':
|
||||||
|
# don't add multiple space tokens in a row
|
||||||
|
continue
|
||||||
|
words.append(dtoken)
|
||||||
|
|
||||||
|
# remove empty tokens. These can be produced with characters like … that
|
||||||
|
# Sudachi normalizes internally.
|
||||||
|
words = [ww for ww in words if len(ww.surface) > 0]
|
||||||
|
return words
|
||||||
|
|
||||||
|
|
||||||
|
def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
|
||||||
words = [x.surface for x in dtokens]
|
words = [x.surface for x in dtokens]
|
||||||
if "".join("".join(words).split()) != "".join(text.split()):
|
if "".join("".join(words).split()) != "".join(text.split()):
|
||||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||||
|
text_words = []
|
||||||
text_dtokens = []
|
text_lemmas = []
|
||||||
|
text_tags = []
|
||||||
text_spaces = []
|
text_spaces = []
|
||||||
text_pos = 0
|
text_pos = 0
|
||||||
# handle empty and whitespace-only texts
|
# handle empty and whitespace-only texts
|
||||||
if len(words) == 0:
|
if len(words) == 0:
|
||||||
return text_dtokens, text_spaces
|
return text_words, text_lemmas, text_tags, text_spaces
|
||||||
elif len([word for word in words if not word.isspace()]) == 0:
|
elif len([word for word in words if not word.isspace()]) == 0:
|
||||||
assert text.isspace()
|
assert text.isspace()
|
||||||
text_dtokens = [DetailedToken(text, gap_tag, '', text, None, None)]
|
text_words = [text]
|
||||||
|
text_lemmas = [text]
|
||||||
|
text_tags = [gap_tag]
|
||||||
text_spaces = [False]
|
text_spaces = [False]
|
||||||
return text_dtokens, text_spaces
|
return text_words, text_lemmas, text_tags, text_spaces
|
||||||
|
# normalize words to remove all whitespace tokens
|
||||||
# align words and dtokens by referring text, and insert gap tokens for the space char spans
|
norm_words, norm_dtokens = zip(*[(word, dtokens) for word, dtokens in zip(words, dtokens) if not word.isspace()])
|
||||||
for word, dtoken in zip(words, dtokens):
|
# align words with text
|
||||||
# skip all space tokens
|
for word, dtoken in zip(norm_words, norm_dtokens):
|
||||||
if word.isspace():
|
|
||||||
continue
|
|
||||||
try:
|
try:
|
||||||
word_start = text[text_pos:].index(word)
|
word_start = text[text_pos:].index(word)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||||
|
|
||||||
# space token
|
|
||||||
if word_start > 0:
|
if word_start > 0:
|
||||||
w = text[text_pos:text_pos + word_start]
|
w = text[text_pos:text_pos + word_start]
|
||||||
text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
|
text_words.append(w)
|
||||||
|
text_lemmas.append(w)
|
||||||
|
text_tags.append(gap_tag)
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
text_pos += word_start
|
text_pos += word_start
|
||||||
|
text_words.append(word)
|
||||||
# content word
|
text_lemmas.append(dtoken.lemma)
|
||||||
text_dtokens.append(dtoken)
|
text_tags.append(dtoken.pos)
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
text_pos += len(word)
|
text_pos += len(word)
|
||||||
# poll a space char after the word
|
|
||||||
if text_pos < len(text) and text[text_pos] == " ":
|
if text_pos < len(text) and text[text_pos] == " ":
|
||||||
text_spaces[-1] = True
|
text_spaces[-1] = True
|
||||||
text_pos += 1
|
text_pos += 1
|
||||||
|
|
||||||
# trailing space token
|
|
||||||
if text_pos < len(text):
|
if text_pos < len(text):
|
||||||
w = text[text_pos:]
|
w = text[text_pos:]
|
||||||
text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
|
text_words.append(w)
|
||||||
|
text_lemmas.append(w)
|
||||||
|
text_tags.append(gap_tag)
|
||||||
text_spaces.append(False)
|
text_spaces.append(False)
|
||||||
|
return text_words, text_lemmas, text_tags, text_spaces
|
||||||
return text_dtokens, text_spaces
|
|
||||||
|
|
||||||
|
|
||||||
class JapaneseTokenizer(DummyTokenizer):
|
class JapaneseTokenizer(DummyTokenizer):
|
||||||
|
@ -139,78 +191,29 @@ class JapaneseTokenizer(DummyTokenizer):
|
||||||
self.tokenizer = try_sudachi_import(self.split_mode)
|
self.tokenizer = try_sudachi_import(self.split_mode)
|
||||||
|
|
||||||
def __call__(self, text):
|
def __call__(self, text):
|
||||||
# convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
|
dtokens = get_dtokens(self.tokenizer, text)
|
||||||
sudachipy_tokens = self.tokenizer.tokenize(text)
|
|
||||||
dtokens = self._get_dtokens(sudachipy_tokens)
|
|
||||||
dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
|
|
||||||
|
|
||||||
# create Doc with tag bi-gram based part-of-speech identification rules
|
words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
|
||||||
words, tags, inflections, lemmas, readings, sub_tokens_list = zip(*dtokens) if dtokens else [[]] * 6
|
|
||||||
sub_tokens_list = list(sub_tokens_list)
|
|
||||||
doc = Doc(self.vocab, words=words, spaces=spaces)
|
doc = Doc(self.vocab, words=words, spaces=spaces)
|
||||||
next_pos = None # for bi-gram rules
|
next_pos = None
|
||||||
for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
|
for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
|
||||||
token.tag_ = dtoken.tag
|
token.tag_ = unidic_tag[0]
|
||||||
if next_pos: # already identified in previous iteration
|
if next_pos:
|
||||||
token.pos = next_pos
|
token.pos = next_pos
|
||||||
next_pos = None
|
next_pos = None
|
||||||
else:
|
else:
|
||||||
token.pos, next_pos = resolve_pos(
|
token.pos, next_pos = resolve_pos(
|
||||||
token.orth_,
|
token.orth_,
|
||||||
dtoken.tag,
|
unidic_tag,
|
||||||
tags[idx + 1] if idx + 1 < len(tags) else None
|
unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None
|
||||||
)
|
)
|
||||||
# if there's no lemma info (it's an unk) just use the surface
|
|
||||||
token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
|
|
||||||
|
|
||||||
doc.user_data["inflections"] = inflections
|
# if there's no lemma info (it's an unk) just use the surface
|
||||||
doc.user_data["reading_forms"] = readings
|
token.lemma_ = lemma
|
||||||
doc.user_data["sub_tokens"] = sub_tokens_list
|
doc.user_data["unidic_tags"] = unidic_tags
|
||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
|
|
||||||
sub_tokens_list = self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
|
|
||||||
dtokens = [
|
|
||||||
DetailedToken(
|
|
||||||
token.surface(), # orth
|
|
||||||
'-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']), # tag
|
|
||||||
','.join([xx for xx in token.part_of_speech()[4:] if xx != '*']), # inf
|
|
||||||
token.dictionary_form(), # lemma
|
|
||||||
token.reading_form(), # user_data['reading_forms']
|
|
||||||
sub_tokens_list[idx] if sub_tokens_list else None, # user_data['sub_tokens']
|
|
||||||
) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
|
|
||||||
# remove empty tokens which can be produced with characters like … that
|
|
||||||
]
|
|
||||||
# Sudachi normalizes internally and outputs each space char as a token.
|
|
||||||
# This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
|
|
||||||
return [
|
|
||||||
t for idx, t in enumerate(dtokens) if
|
|
||||||
idx == 0 or
|
|
||||||
not t.surface.isspace() or t.tag != '空白' or
|
|
||||||
not dtokens[idx - 1].surface.isspace() or dtokens[idx - 1].tag != '空白'
|
|
||||||
]
|
|
||||||
|
|
||||||
def _get_sub_tokens(self, sudachipy_tokens):
|
|
||||||
if self.split_mode is None or self.split_mode == "A": # do nothing for default split mode
|
|
||||||
return None
|
|
||||||
|
|
||||||
sub_tokens_list = [] # list of (list of list of DetailedToken | None)
|
|
||||||
for token in sudachipy_tokens:
|
|
||||||
sub_a = token.split(self.tokenizer.SplitMode.A)
|
|
||||||
if len(sub_a) == 1: # no sub tokens
|
|
||||||
sub_tokens_list.append(None)
|
|
||||||
elif self.split_mode == "B":
|
|
||||||
sub_tokens_list.append([self._get_dtokens(sub_a, False)])
|
|
||||||
else: # "C"
|
|
||||||
sub_b = token.split(self.tokenizer.SplitMode.B)
|
|
||||||
if len(sub_a) == len(sub_b):
|
|
||||||
dtokens = self._get_dtokens(sub_a, False)
|
|
||||||
sub_tokens_list.append([dtokens, dtokens])
|
|
||||||
else:
|
|
||||||
sub_tokens_list.append([self._get_dtokens(sub_a, False), self._get_dtokens(sub_b, False)])
|
|
||||||
return sub_tokens_list
|
|
||||||
|
|
||||||
def _get_config(self):
|
def _get_config(self):
|
||||||
config = OrderedDict(
|
config = OrderedDict(
|
||||||
(
|
(
|
||||||
|
|
144
spacy/lang/ja/bunsetu.py
Normal file
144
spacy/lang/ja/bunsetu.py
Normal file
|
@ -0,0 +1,144 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
POS_PHRASE_MAP = {
|
||||||
|
"NOUN": "NP",
|
||||||
|
"NUM": "NP",
|
||||||
|
"PRON": "NP",
|
||||||
|
"PROPN": "NP",
|
||||||
|
|
||||||
|
"VERB": "VP",
|
||||||
|
|
||||||
|
"ADJ": "ADJP",
|
||||||
|
|
||||||
|
"ADV": "ADVP",
|
||||||
|
|
||||||
|
"CCONJ": "CCONJP",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
|
||||||
|
def yield_bunsetu(doc, debug=False):
|
||||||
|
bunsetu = []
|
||||||
|
bunsetu_may_end = False
|
||||||
|
phrase_type = None
|
||||||
|
phrase = None
|
||||||
|
prev = None
|
||||||
|
prev_tag = None
|
||||||
|
prev_dep = None
|
||||||
|
prev_head = None
|
||||||
|
for t in doc:
|
||||||
|
pos = t.pos_
|
||||||
|
pos_type = POS_PHRASE_MAP.get(pos, None)
|
||||||
|
tag = t.tag_
|
||||||
|
dep = t.dep_
|
||||||
|
head = t.head.i
|
||||||
|
if debug:
|
||||||
|
print(t.i, t.orth_, pos, pos_type, dep, head, bunsetu_may_end, phrase_type, phrase, bunsetu)
|
||||||
|
|
||||||
|
# DET is always an individual bunsetu
|
||||||
|
if pos == "DET":
|
||||||
|
if bunsetu:
|
||||||
|
yield bunsetu, phrase_type, phrase
|
||||||
|
yield [t], None, None
|
||||||
|
bunsetu = []
|
||||||
|
bunsetu_may_end = False
|
||||||
|
phrase_type = None
|
||||||
|
phrase = None
|
||||||
|
|
||||||
|
# PRON or Open PUNCT always splits bunsetu
|
||||||
|
elif tag == "補助記号-括弧開":
|
||||||
|
if bunsetu:
|
||||||
|
yield bunsetu, phrase_type, phrase
|
||||||
|
bunsetu = [t]
|
||||||
|
bunsetu_may_end = True
|
||||||
|
phrase_type = None
|
||||||
|
phrase = None
|
||||||
|
|
||||||
|
# bunsetu head not appeared
|
||||||
|
elif phrase_type is None:
|
||||||
|
if bunsetu and prev_tag == "補助記号-読点":
|
||||||
|
yield bunsetu, phrase_type, phrase
|
||||||
|
bunsetu = []
|
||||||
|
bunsetu_may_end = False
|
||||||
|
phrase_type = None
|
||||||
|
phrase = None
|
||||||
|
bunsetu.append(t)
|
||||||
|
if pos_type: # begin phrase
|
||||||
|
phrase = [t]
|
||||||
|
phrase_type = pos_type
|
||||||
|
if pos_type in {"ADVP", "CCONJP"}:
|
||||||
|
bunsetu_may_end = True
|
||||||
|
|
||||||
|
# entering new bunsetu
|
||||||
|
elif pos_type and (
|
||||||
|
pos_type != phrase_type or # different phrase type arises
|
||||||
|
bunsetu_may_end # same phrase type but bunsetu already ended
|
||||||
|
):
|
||||||
|
# exceptional case: NOUN to VERB
|
||||||
|
if phrase_type == "NP" and pos_type == "VP" and prev_dep == 'compound' and prev_head == t.i:
|
||||||
|
bunsetu.append(t)
|
||||||
|
phrase_type = "VP"
|
||||||
|
phrase.append(t)
|
||||||
|
# exceptional case: VERB to NOUN
|
||||||
|
elif phrase_type == "VP" and pos_type == "NP" and (
|
||||||
|
prev_dep == 'compound' and prev_head == t.i or
|
||||||
|
dep == 'compound' and prev == head or
|
||||||
|
prev_dep == 'nmod' and prev_head == t.i
|
||||||
|
):
|
||||||
|
bunsetu.append(t)
|
||||||
|
phrase_type = "NP"
|
||||||
|
phrase.append(t)
|
||||||
|
else:
|
||||||
|
yield bunsetu, phrase_type, phrase
|
||||||
|
bunsetu = [t]
|
||||||
|
bunsetu_may_end = False
|
||||||
|
phrase_type = pos_type
|
||||||
|
phrase = [t]
|
||||||
|
|
||||||
|
# NOUN bunsetu
|
||||||
|
elif phrase_type == "NP":
|
||||||
|
bunsetu.append(t)
|
||||||
|
if not bunsetu_may_end and ((
|
||||||
|
(pos_type == "NP" or pos == "SYM") and (prev_head == t.i or prev_head == head) and prev_dep in {'compound', 'nummod'}
|
||||||
|
) or (
|
||||||
|
pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
|
||||||
|
)):
|
||||||
|
phrase.append(t)
|
||||||
|
else:
|
||||||
|
bunsetu_may_end = True
|
||||||
|
|
||||||
|
# VERB bunsetu
|
||||||
|
elif phrase_type == "VP":
|
||||||
|
bunsetu.append(t)
|
||||||
|
if not bunsetu_may_end and pos == "VERB" and prev_head == t.i and prev_dep == 'compound':
|
||||||
|
phrase.append(t)
|
||||||
|
else:
|
||||||
|
bunsetu_may_end = True
|
||||||
|
|
||||||
|
# ADJ bunsetu
|
||||||
|
elif phrase_type == "ADJP" and tag != '連体詞':
|
||||||
|
bunsetu.append(t)
|
||||||
|
if not bunsetu_may_end and ((
|
||||||
|
pos == "NOUN" and (prev_head == t.i or prev_head == head) and prev_dep in {'amod', 'compound'}
|
||||||
|
) or (
|
||||||
|
pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
|
||||||
|
)):
|
||||||
|
phrase.append(t)
|
||||||
|
else:
|
||||||
|
bunsetu_may_end = True
|
||||||
|
|
||||||
|
# other bunsetu
|
||||||
|
else:
|
||||||
|
bunsetu.append(t)
|
||||||
|
|
||||||
|
prev = t.i
|
||||||
|
prev_tag = t.tag_
|
||||||
|
prev_dep = t.dep_
|
||||||
|
prev_head = head
|
||||||
|
|
||||||
|
if bunsetu:
|
||||||
|
yield bunsetu, phrase_type, phrase
|
|
@ -1,23 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
|
||||||
from .lex_attrs import LEX_ATTRS
|
|
||||||
|
|
||||||
from ...language import Language
|
|
||||||
from ...attrs import LANG
|
|
||||||
|
|
||||||
|
|
||||||
class NepaliDefaults(Language.Defaults):
|
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
|
||||||
lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
|
|
||||||
stop_words = STOP_WORDS
|
|
||||||
|
|
||||||
|
|
||||||
class Nepali(Language):
|
|
||||||
lang = "ne"
|
|
||||||
Defaults = NepaliDefaults
|
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Nepali"]
|
|
|
@ -1,22 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
|
|
||||||
"""
|
|
||||||
Example sentences to test spaCy and its language models.
|
|
||||||
|
|
||||||
>>> from spacy.lang.ne.examples import sentences
|
|
||||||
>>> docs = nlp.pipe(sentences)
|
|
||||||
"""
|
|
||||||
|
|
||||||
|
|
||||||
sentences = [
|
|
||||||
"एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
|
|
||||||
"स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
|
|
||||||
"स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
|
|
||||||
"लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
|
|
||||||
"तिमी कहाँ छौ?",
|
|
||||||
"फ्रान्स को राष्ट्रपति को हो?",
|
|
||||||
"संयुक्त राज्यको राजधानी के हो?",
|
|
||||||
"बराक ओबामा कहिले कहिले जन्मेका हुन्?",
|
|
||||||
]
|
|
|
@ -1,98 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...attrs import NORM, LIKE_NUM
|
|
||||||
|
|
||||||
|
|
||||||
# fmt: off
|
|
||||||
_stem_suffixes = [
|
|
||||||
["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"],
|
|
||||||
["ँ", "ं", "्", "ः"],
|
|
||||||
["लाई", "ले", "बाट", "को", "मा", "हरू"],
|
|
||||||
["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
|
|
||||||
["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"],
|
|
||||||
["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"],
|
|
||||||
["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
|
|
||||||
["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
|
|
||||||
["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
|
|
||||||
["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
|
|
||||||
["याइ", "ाइ", "बार", "वार", "चाँहि"],
|
|
||||||
["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"]
|
|
||||||
]
|
|
||||||
# fmt: on
|
|
||||||
|
|
||||||
# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
|
|
||||||
# reference 2: https://www.imnepal.com/nepali-numbers/
|
|
||||||
_num_words = [
|
|
||||||
"शुन्य",
|
|
||||||
"एक",
|
|
||||||
"दुई",
|
|
||||||
"तीन",
|
|
||||||
"चार",
|
|
||||||
"पाँच",
|
|
||||||
"छ",
|
|
||||||
"सात",
|
|
||||||
"आठ",
|
|
||||||
"नौ",
|
|
||||||
"दश",
|
|
||||||
"एघार",
|
|
||||||
"बाह्र",
|
|
||||||
"तेह्र",
|
|
||||||
"चौध",
|
|
||||||
"पन्ध्र",
|
|
||||||
"सोह्र",
|
|
||||||
"सोह्र",
|
|
||||||
"सत्र",
|
|
||||||
"अठार",
|
|
||||||
"उन्नाइस",
|
|
||||||
"बीस",
|
|
||||||
"तीस",
|
|
||||||
"चालीस",
|
|
||||||
"पचास",
|
|
||||||
"साठी",
|
|
||||||
"सत्तरी",
|
|
||||||
"असी",
|
|
||||||
"नब्बे",
|
|
||||||
"सय",
|
|
||||||
"हजार",
|
|
||||||
"लाख",
|
|
||||||
"करोड",
|
|
||||||
"अर्ब",
|
|
||||||
"खर्ब",
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
def norm(string):
|
|
||||||
# normalise base exceptions, e.g. punctuation or currency symbols
|
|
||||||
if string in BASE_NORMS:
|
|
||||||
return BASE_NORMS[string]
|
|
||||||
# set stem word as norm, if available, adapted from:
|
|
||||||
# https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
|
|
||||||
# https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
|
|
||||||
for suffix_group in reversed(_stem_suffixes):
|
|
||||||
length = len(suffix_group[0])
|
|
||||||
if len(string) <= length:
|
|
||||||
break
|
|
||||||
for suffix in suffix_group:
|
|
||||||
if string.endswith(suffix):
|
|
||||||
return string[:-length]
|
|
||||||
return string
|
|
||||||
|
|
||||||
|
|
||||||
def like_num(text):
|
|
||||||
if text.startswith(("+", "-", "±", "~")):
|
|
||||||
text = text[1:]
|
|
||||||
text = text.replace(", ", "").replace(".", "")
|
|
||||||
if text.isdigit():
|
|
||||||
return True
|
|
||||||
if text.count("/") == 1:
|
|
||||||
num, denom = text.split("/")
|
|
||||||
if num.isdigit() and denom.isdigit():
|
|
||||||
return True
|
|
||||||
if text.lower() in _num_words:
|
|
||||||
return True
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}
|
|
|
@ -1,498 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
|
|
||||||
# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
|
|
||||||
|
|
||||||
STOP_WORDS = set(
|
|
||||||
"""
|
|
||||||
अक्सर
|
|
||||||
अगाडि
|
|
||||||
अगाडी
|
|
||||||
अघि
|
|
||||||
अझै
|
|
||||||
अठार
|
|
||||||
अथवा
|
|
||||||
अनि
|
|
||||||
अनुसार
|
|
||||||
अन्तर्गत
|
|
||||||
अन्य
|
|
||||||
अन्यत्र
|
|
||||||
अन्यथा
|
|
||||||
अब
|
|
||||||
अरु
|
|
||||||
अरुलाई
|
|
||||||
अरू
|
|
||||||
अर्को
|
|
||||||
अर्थात
|
|
||||||
अर्थात्
|
|
||||||
अलग
|
|
||||||
अलि
|
|
||||||
अवस्था
|
|
||||||
अहिले
|
|
||||||
आए
|
|
||||||
आएका
|
|
||||||
आएको
|
|
||||||
आज
|
|
||||||
आजको
|
|
||||||
आठ
|
|
||||||
आत्म
|
|
||||||
आदि
|
|
||||||
आदिलाई
|
|
||||||
आफनो
|
|
||||||
आफू
|
|
||||||
आफूलाई
|
|
||||||
आफै
|
|
||||||
आफैँ
|
|
||||||
आफ्नै
|
|
||||||
आफ्नो
|
|
||||||
आयो
|
|
||||||
उ
|
|
||||||
उक्त
|
|
||||||
उदाहरण
|
|
||||||
उनको
|
|
||||||
उनलाई
|
|
||||||
उनले
|
|
||||||
उनि
|
|
||||||
उनी
|
|
||||||
उनीहरुको
|
|
||||||
उन्नाइस
|
|
||||||
उप
|
|
||||||
उसको
|
|
||||||
उसलाई
|
|
||||||
उसले
|
|
||||||
उहालाई
|
|
||||||
ऊ
|
|
||||||
एउटा
|
|
||||||
एउटै
|
|
||||||
एक
|
|
||||||
एकदम
|
|
||||||
एघार
|
|
||||||
ओठ
|
|
||||||
औ
|
|
||||||
औं
|
|
||||||
कता
|
|
||||||
कति
|
|
||||||
कतै
|
|
||||||
कम
|
|
||||||
कमसेकम
|
|
||||||
कसरि
|
|
||||||
कसरी
|
|
||||||
कसै
|
|
||||||
कसैको
|
|
||||||
कसैलाई
|
|
||||||
कसैले
|
|
||||||
कसैसँग
|
|
||||||
कस्तो
|
|
||||||
कहाँबाट
|
|
||||||
कहिलेकाहीं
|
|
||||||
का
|
|
||||||
काम
|
|
||||||
कारण
|
|
||||||
कि
|
|
||||||
किन
|
|
||||||
किनभने
|
|
||||||
कुन
|
|
||||||
कुनै
|
|
||||||
कुन्नी
|
|
||||||
कुरा
|
|
||||||
कृपया
|
|
||||||
के
|
|
||||||
केहि
|
|
||||||
केही
|
|
||||||
को
|
|
||||||
कोहि
|
|
||||||
कोहिपनि
|
|
||||||
कोही
|
|
||||||
कोहीपनि
|
|
||||||
क्रमशः
|
|
||||||
गए
|
|
||||||
गएको
|
|
||||||
गएर
|
|
||||||
गयौ
|
|
||||||
गरि
|
|
||||||
गरी
|
|
||||||
गरे
|
|
||||||
गरेका
|
|
||||||
गरेको
|
|
||||||
गरेर
|
|
||||||
गरौं
|
|
||||||
गर्छ
|
|
||||||
गर्छन्
|
|
||||||
गर्छु
|
|
||||||
गर्दा
|
|
||||||
गर्दै
|
|
||||||
गर्न
|
|
||||||
गर्नु
|
|
||||||
गर्नुपर्छ
|
|
||||||
गर्ने
|
|
||||||
गैर
|
|
||||||
घर
|
|
||||||
चार
|
|
||||||
चाले
|
|
||||||
चाहनुहुन्छ
|
|
||||||
चाहन्छु
|
|
||||||
चाहिं
|
|
||||||
चाहिए
|
|
||||||
चाहिंले
|
|
||||||
चाहीं
|
|
||||||
चाहेको
|
|
||||||
चाहेर
|
|
||||||
चोटी
|
|
||||||
चौथो
|
|
||||||
चौध
|
|
||||||
छ
|
|
||||||
छन
|
|
||||||
छन्
|
|
||||||
छु
|
|
||||||
छू
|
|
||||||
छैन
|
|
||||||
छैनन्
|
|
||||||
छौ
|
|
||||||
छौं
|
|
||||||
जता
|
|
||||||
जताततै
|
|
||||||
जना
|
|
||||||
जनाको
|
|
||||||
जनालाई
|
|
||||||
जनाले
|
|
||||||
जब
|
|
||||||
जबकि
|
|
||||||
जबकी
|
|
||||||
जसको
|
|
||||||
जसबाट
|
|
||||||
जसमा
|
|
||||||
जसरी
|
|
||||||
जसलाई
|
|
||||||
जसले
|
|
||||||
जस्ता
|
|
||||||
जस्तै
|
|
||||||
जस्तो
|
|
||||||
जस्तोसुकै
|
|
||||||
जहाँ
|
|
||||||
जान
|
|
||||||
जाने
|
|
||||||
जाहिर
|
|
||||||
जुन
|
|
||||||
जुनै
|
|
||||||
जे
|
|
||||||
जो
|
|
||||||
जोपनि
|
|
||||||
जोपनी
|
|
||||||
झैं
|
|
||||||
ठाउँमा
|
|
||||||
ठीक
|
|
||||||
ठूलो
|
|
||||||
त
|
|
||||||
तता
|
|
||||||
तत्काल
|
|
||||||
तथा
|
|
||||||
तथापि
|
|
||||||
तथापी
|
|
||||||
तदनुसार
|
|
||||||
तपाइ
|
|
||||||
तपाई
|
|
||||||
तपाईको
|
|
||||||
तब
|
|
||||||
तर
|
|
||||||
तर्फ
|
|
||||||
तल
|
|
||||||
तसरी
|
|
||||||
तापनि
|
|
||||||
तापनी
|
|
||||||
तिन
|
|
||||||
तिनि
|
|
||||||
तिनिहरुलाई
|
|
||||||
तिनी
|
|
||||||
तिनीहरु
|
|
||||||
तिनीहरुको
|
|
||||||
तिनीहरू
|
|
||||||
तिनीहरूको
|
|
||||||
तिनै
|
|
||||||
तिमी
|
|
||||||
तिर
|
|
||||||
तिरको
|
|
||||||
ती
|
|
||||||
तीन
|
|
||||||
तुरन्त
|
|
||||||
तुरुन्त
|
|
||||||
तुरुन्तै
|
|
||||||
तेश्रो
|
|
||||||
तेस्कारण
|
|
||||||
तेस्रो
|
|
||||||
तेह्र
|
|
||||||
तैपनि
|
|
||||||
तैपनी
|
|
||||||
त्यत्तिकै
|
|
||||||
त्यत्तिकैमा
|
|
||||||
त्यस
|
|
||||||
त्यसकारण
|
|
||||||
त्यसको
|
|
||||||
त्यसले
|
|
||||||
त्यसैले
|
|
||||||
त्यसो
|
|
||||||
त्यस्तै
|
|
||||||
त्यस्तो
|
|
||||||
त्यहाँ
|
|
||||||
त्यहिँ
|
|
||||||
त्यही
|
|
||||||
त्यहीँ
|
|
||||||
त्यहीं
|
|
||||||
त्यो
|
|
||||||
त्सपछि
|
|
||||||
त्सैले
|
|
||||||
थप
|
|
||||||
थरि
|
|
||||||
थरी
|
|
||||||
थाहा
|
|
||||||
थिए
|
|
||||||
थिएँ
|
|
||||||
थिएन
|
|
||||||
थियो
|
|
||||||
दर्ता
|
|
||||||
दश
|
|
||||||
दिए
|
|
||||||
दिएको
|
|
||||||
दिन
|
|
||||||
दिनुभएको
|
|
||||||
दिनुहुन्छ
|
|
||||||
दुइ
|
|
||||||
दुइवटा
|
|
||||||
दुई
|
|
||||||
देखि
|
|
||||||
देखिन्छ
|
|
||||||
देखियो
|
|
||||||
देखे
|
|
||||||
देखेको
|
|
||||||
देखेर
|
|
||||||
दोश्री
|
|
||||||
दोश्रो
|
|
||||||
दोस्रो
|
|
||||||
द्वारा
|
|
||||||
धन्न
|
|
||||||
धेरै
|
|
||||||
धौ
|
|
||||||
न
|
|
||||||
नगर्नु
|
|
||||||
नगर्नू
|
|
||||||
नजिकै
|
|
||||||
नत्र
|
|
||||||
नत्रभने
|
|
||||||
नभई
|
|
||||||
नभएको
|
|
||||||
नभनेर
|
|
||||||
नयाँ
|
|
||||||
नि
|
|
||||||
निकै
|
|
||||||
निम्ति
|
|
||||||
निम्न
|
|
||||||
निम्नानुसार
|
|
||||||
निर्दिष्ट
|
|
||||||
नै
|
|
||||||
नौ
|
|
||||||
पक्का
|
|
||||||
पक्कै
|
|
||||||
पछाडि
|
|
||||||
पछाडी
|
|
||||||
पछि
|
|
||||||
पछिल्लो
|
|
||||||
पछी
|
|
||||||
पटक
|
|
||||||
पनि
|
|
||||||
पन्ध्र
|
|
||||||
पर्छ
|
|
||||||
पर्थ्यो
|
|
||||||
पर्दैन
|
|
||||||
पर्ने
|
|
||||||
पर्नेमा
|
|
||||||
पर्याप्त
|
|
||||||
पहिले
|
|
||||||
पहिलो
|
|
||||||
पहिल्यै
|
|
||||||
पाँच
|
|
||||||
पांच
|
|
||||||
पाचौँ
|
|
||||||
पाँचौं
|
|
||||||
पिच्छे
|
|
||||||
पूर्व
|
|
||||||
पो
|
|
||||||
प्रति
|
|
||||||
प्रतेक
|
|
||||||
प्रत्यक
|
|
||||||
प्राय
|
|
||||||
प्लस
|
|
||||||
फरक
|
|
||||||
फेरि
|
|
||||||
फेरी
|
|
||||||
बढी
|
|
||||||
बताए
|
|
||||||
बने
|
|
||||||
बरु
|
|
||||||
बाट
|
|
||||||
बारे
|
|
||||||
बाहिर
|
|
||||||
बाहेक
|
|
||||||
बाह्र
|
|
||||||
बिच
|
|
||||||
बिचमा
|
|
||||||
बिरुद्ध
|
|
||||||
बिशेष
|
|
||||||
बिस
|
|
||||||
बीच
|
|
||||||
बीचमा
|
|
||||||
बीस
|
|
||||||
भए
|
|
||||||
भएँ
|
|
||||||
भएका
|
|
||||||
भएकालाई
|
|
||||||
भएको
|
|
||||||
भएन
|
|
||||||
भएर
|
|
||||||
भन
|
|
||||||
भने
|
|
||||||
भनेको
|
|
||||||
भनेर
|
|
||||||
भन्
|
|
||||||
भन्छन्
|
|
||||||
भन्छु
|
|
||||||
भन्दा
|
|
||||||
भन्दै
|
|
||||||
भन्नुभयो
|
|
||||||
भन्ने
|
|
||||||
भन्या
|
|
||||||
भयेन
|
|
||||||
भयो
|
|
||||||
भर
|
|
||||||
भरि
|
|
||||||
भरी
|
|
||||||
भा
|
|
||||||
भित्र
|
|
||||||
भित्री
|
|
||||||
भीत्र
|
|
||||||
म
|
|
||||||
मध्य
|
|
||||||
मध्ये
|
|
||||||
मलाई
|
|
||||||
मा
|
|
||||||
मात्र
|
|
||||||
मात्रै
|
|
||||||
माथि
|
|
||||||
माथी
|
|
||||||
मुख्य
|
|
||||||
मुनि
|
|
||||||
मुन्तिर
|
|
||||||
मेरो
|
|
||||||
मैले
|
|
||||||
यति
|
|
||||||
यथोचित
|
|
||||||
यदि
|
|
||||||
यद्ध्यपि
|
|
||||||
यद्यपि
|
|
||||||
यस
|
|
||||||
यसका
|
|
||||||
यसको
|
|
||||||
यसपछि
|
|
||||||
यसबाहेक
|
|
||||||
यसमा
|
|
||||||
यसरी
|
|
||||||
यसले
|
|
||||||
यसो
|
|
||||||
यस्तै
|
|
||||||
यस्तो
|
|
||||||
यहाँ
|
|
||||||
यहाँसम्म
|
|
||||||
यही
|
|
||||||
या
|
|
||||||
यी
|
|
||||||
यो
|
|
||||||
र
|
|
||||||
रही
|
|
||||||
रहेका
|
|
||||||
रहेको
|
|
||||||
रहेछ
|
|
||||||
राखे
|
|
||||||
राख्छ
|
|
||||||
राम्रो
|
|
||||||
रुपमा
|
|
||||||
रूप
|
|
||||||
रे
|
|
||||||
लगभग
|
|
||||||
लगायत
|
|
||||||
लाई
|
|
||||||
लाख
|
|
||||||
लागि
|
|
||||||
लागेको
|
|
||||||
ले
|
|
||||||
वटा
|
|
||||||
वरीपरी
|
|
||||||
वा
|
|
||||||
वाट
|
|
||||||
वापत
|
|
||||||
वास्तवमा
|
|
||||||
शायद
|
|
||||||
सक्छ
|
|
||||||
सक्ने
|
|
||||||
सँग
|
|
||||||
संग
|
|
||||||
सँगको
|
|
||||||
सँगसँगै
|
|
||||||
सँगै
|
|
||||||
संगै
|
|
||||||
सङ्ग
|
|
||||||
सङ्गको
|
|
||||||
सट्टा
|
|
||||||
सत्र
|
|
||||||
सधै
|
|
||||||
सबै
|
|
||||||
सबैको
|
|
||||||
सबैलाई
|
|
||||||
समय
|
|
||||||
समेत
|
|
||||||
सम्भव
|
|
||||||
सम्म
|
|
||||||
सय
|
|
||||||
सरह
|
|
||||||
सहित
|
|
||||||
सहितै
|
|
||||||
सही
|
|
||||||
साँच्चै
|
|
||||||
सात
|
|
||||||
साथ
|
|
||||||
साथै
|
|
||||||
सायद
|
|
||||||
सारा
|
|
||||||
सुनेको
|
|
||||||
सुनेर
|
|
||||||
सुरु
|
|
||||||
सुरुको
|
|
||||||
सुरुमै
|
|
||||||
सो
|
|
||||||
सोचेको
|
|
||||||
सोचेर
|
|
||||||
सोही
|
|
||||||
सोह्र
|
|
||||||
स्थित
|
|
||||||
स्पष्ट
|
|
||||||
हजार
|
|
||||||
हरे
|
|
||||||
हरेक
|
|
||||||
हामी
|
|
||||||
हामीले
|
|
||||||
हाम्रा
|
|
||||||
हाम्रो
|
|
||||||
हुँदैन
|
|
||||||
हुन
|
|
||||||
हुनत
|
|
||||||
हुनु
|
|
||||||
हुने
|
|
||||||
हुनेछ
|
|
||||||
हुन्
|
|
||||||
हुन्छ
|
|
||||||
हुन्थ्यो
|
|
||||||
हैन
|
|
||||||
हो
|
|
||||||
होइन
|
|
||||||
होकि
|
|
||||||
होला
|
|
||||||
""".split()
|
|
||||||
)
|
|
|
@ -349,7 +349,7 @@ cdef class Lexeme:
|
||||||
@property
|
@property
|
||||||
def is_oov(self):
|
def is_oov(self):
|
||||||
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
|
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
|
||||||
return self.orth not in self.vocab.vectors
|
return self.orth in self.vocab.vectors
|
||||||
|
|
||||||
property is_stop:
|
property is_stop:
|
||||||
"""RETURNS (bool): Whether the lexeme is a stop word."""
|
"""RETURNS (bool): Whether the lexeme is a stop word."""
|
||||||
|
|
|
@ -528,10 +528,10 @@ class Tagger(Pipe):
|
||||||
new_tag_map[tag] = orig_tag_map[tag]
|
new_tag_map[tag] = orig_tag_map[tag]
|
||||||
else:
|
else:
|
||||||
new_tag_map[tag] = {POS: X}
|
new_tag_map[tag] = {POS: X}
|
||||||
|
if "_SP" in orig_tag_map:
|
||||||
|
new_tag_map["_SP"] = orig_tag_map["_SP"]
|
||||||
cdef Vocab vocab = self.vocab
|
cdef Vocab vocab = self.vocab
|
||||||
if new_tag_map:
|
if new_tag_map:
|
||||||
if "_SP" in orig_tag_map:
|
|
||||||
new_tag_map["_SP"] = orig_tag_map["_SP"]
|
|
||||||
vocab.morphology = Morphology(vocab.strings, new_tag_map,
|
vocab.morphology = Morphology(vocab.strings, new_tag_map,
|
||||||
vocab.morphology.lemmatizer,
|
vocab.morphology.lemmatizer,
|
||||||
exc=vocab.morphology.exc)
|
exc=vocab.morphology.exc)
|
||||||
|
|
|
@ -170,11 +170,6 @@ def nb_tokenizer():
|
||||||
return get_lang_class("nb").Defaults.create_tokenizer()
|
return get_lang_class("nb").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
|
||||||
def ne_tokenizer():
|
|
||||||
return get_lang_class("ne").Defaults.create_tokenizer()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def nl_tokenizer():
|
def nl_tokenizer():
|
||||||
return get_lang_class("nl").Defaults.create_tokenizer()
|
return get_lang_class("nl").Defaults.create_tokenizer()
|
||||||
|
|
|
@ -4,7 +4,7 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
|
from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
|
||||||
from spacy.lang.ja import Japanese, DetailedToken
|
from spacy.lang.ja import Japanese
|
||||||
|
|
||||||
# fmt: off
|
# fmt: off
|
||||||
TOKENIZER_TESTS = [
|
TOKENIZER_TESTS = [
|
||||||
|
@ -96,57 +96,6 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
|
||||||
assert len(nlp_c(text)) == len_c
|
assert len(nlp_c(text)) == len_c
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
|
|
||||||
[
|
|
||||||
(
|
|
||||||
"選挙管理委員会",
|
|
||||||
[None, None, None, None],
|
|
||||||
[None, None, [
|
|
||||||
[
|
|
||||||
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
|
|
||||||
DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
|
|
||||||
]
|
|
||||||
]],
|
|
||||||
[[
|
|
||||||
[
|
|
||||||
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
|
|
||||||
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
|
|
||||||
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
|
|
||||||
DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
|
|
||||||
], [
|
|
||||||
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
|
|
||||||
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
|
|
||||||
DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
|
|
||||||
]
|
|
||||||
]]
|
|
||||||
),
|
|
||||||
]
|
|
||||||
)
|
|
||||||
def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
|
|
||||||
nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
|
|
||||||
nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
|
|
||||||
nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
|
|
||||||
|
|
||||||
assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
|
|
||||||
assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
|
|
||||||
assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
|
|
||||||
assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text,inflections,reading_forms",
|
|
||||||
[
|
|
||||||
(
|
|
||||||
"取ってつけた",
|
|
||||||
("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
|
|
||||||
("トッ", "テ", "ツケ", "タ"),
|
|
||||||
),
|
|
||||||
]
|
|
||||||
)
|
|
||||||
def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
|
|
||||||
assert ja_tokenizer(text).user_data["inflections"] == inflections
|
|
||||||
assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
|
|
||||||
|
|
||||||
|
|
||||||
def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
|
def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
|
||||||
doc = ja_tokenizer("")
|
doc = ja_tokenizer("")
|
||||||
assert len(doc) == 0
|
assert len(doc) == 0
|
||||||
|
|
|
@ -1,19 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
|
|
||||||
text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
|
|
||||||
tokens = ne_tokenizer(text)
|
|
||||||
assert len(tokens) == 24
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
|
||||||
"text,length",
|
|
||||||
[("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
|
|
||||||
)
|
|
||||||
def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
|
|
||||||
tokens = ne_tokenizer(text)
|
|
||||||
assert len(tokens) == length
|
|
|
@ -3,7 +3,6 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.symbols import POS, NOUN
|
|
||||||
|
|
||||||
|
|
||||||
def test_label_types():
|
def test_label_types():
|
||||||
|
@ -12,16 +11,3 @@ def test_label_types():
|
||||||
nlp.get_pipe("tagger").add_label("A")
|
nlp.get_pipe("tagger").add_label("A")
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
nlp.get_pipe("tagger").add_label(9)
|
nlp.get_pipe("tagger").add_label(9)
|
||||||
|
|
||||||
|
|
||||||
def test_tagger_begin_training_tag_map():
|
|
||||||
"""Test that Tagger.begin_training() without gold tuples does not clobber
|
|
||||||
the tag map."""
|
|
||||||
nlp = Language()
|
|
||||||
tagger = nlp.create_pipe("tagger")
|
|
||||||
orig_tag_count = len(tagger.labels)
|
|
||||||
tagger.add_label("A", {"POS": "NOUN"})
|
|
||||||
nlp.add_pipe(tagger)
|
|
||||||
nlp.begin_training()
|
|
||||||
assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
|
|
||||||
assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
|
|
||||||
|
|
|
@ -376,6 +376,6 @@ def test_vector_is_oov():
|
||||||
data[1] = 2.0
|
data[1] = 2.0
|
||||||
vocab.set_vector("cat", data[0])
|
vocab.set_vector("cat", data[0])
|
||||||
vocab.set_vector("dog", data[1])
|
vocab.set_vector("dog", data[1])
|
||||||
assert vocab["cat"].is_oov is False
|
assert vocab["cat"].is_oov is True
|
||||||
assert vocab["dog"].is_oov is False
|
assert vocab["dog"].is_oov is True
|
||||||
assert vocab["hamster"].is_oov is True
|
assert vocab["hamster"].is_oov is False
|
||||||
|
|
|
@ -923,7 +923,7 @@ cdef class Token:
|
||||||
@property
|
@property
|
||||||
def is_oov(self):
|
def is_oov(self):
|
||||||
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
|
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
|
||||||
return self.c.lex.orth not in self.vocab.vectors
|
return self.c.lex.orth in self.vocab.vectors
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def is_stop(self):
|
def is_stop(self):
|
||||||
|
|
|
@ -208,10 +208,6 @@ def load_model_from_path(model_path, meta=False, **overrides):
|
||||||
pipeline = nlp.Defaults.pipe_names
|
pipeline = nlp.Defaults.pipe_names
|
||||||
elif pipeline in (False, None):
|
elif pipeline in (False, None):
|
||||||
pipeline = []
|
pipeline = []
|
||||||
# skip "vocab" from overrides in component initialization since vocab is
|
|
||||||
# already configured from overrides when nlp is initialized above
|
|
||||||
if "vocab" in overrides:
|
|
||||||
del overrides["vocab"]
|
|
||||||
for name in pipeline:
|
for name in pipeline:
|
||||||
if name not in disable:
|
if name not in disable:
|
||||||
config = meta.get("pipeline_args", {}).get(name, {})
|
config = meta.get("pipeline_args", {}).get(name, {})
|
||||||
|
|
|
@ -12,18 +12,18 @@ expects true examples of a label to have the value `1.0`, and negative examples
|
||||||
of a label to have the value `0.0`. Labels not in the dictionary are treated as
|
of a label to have the value `0.0`. Labels not in the dictionary are treated as
|
||||||
missing – the gradient for those labels will be zero.
|
missing – the gradient for those labels will be zero.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The document the annotations refer to. |
|
| `doc` | `Doc` | The document the annotations refer to. |
|
||||||
| `words` | iterable | A sequence of unicode word strings. |
|
| `words` | iterable | A sequence of unicode word strings. |
|
||||||
| `tags` | iterable | A sequence of strings, representing tag annotations. |
|
| `tags` | iterable | A sequence of strings, representing tag annotations. |
|
||||||
| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
|
| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
|
||||||
| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
|
| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
|
||||||
| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
|
| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
|
||||||
| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
|
| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
|
||||||
| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
|
| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
|
||||||
| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False`. |
|
| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False.`. |
|
||||||
| **RETURNS** | `GoldParse` | The newly constructed object. |
|
| **RETURNS** | `GoldParse` | The newly constructed object. |
|
||||||
|
|
||||||
## GoldParse.\_\_len\_\_ {#len tag="method"}
|
## GoldParse.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -43,17 +43,17 @@ Whether the provided syntactic annotations form a projective dependency tree.
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `words` | list | The words. |
|
| `words` | list | The words. |
|
||||||
| `tags` | list | The part-of-speech tag annotations. |
|
| `tags` | list | The part-of-speech tag annotations. |
|
||||||
| `heads` | list | The syntactic head annotations. |
|
| `heads` | list | The syntactic head annotations. |
|
||||||
| `labels` | list | The syntactic relation-type annotations. |
|
| `labels` | list | The syntactic relation-type annotations. |
|
||||||
| `ner` | list | The named entity annotations as BILUO tags. |
|
| `ner` | list | The named entity annotations as BILUO tags. |
|
||||||
| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
|
| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
|
||||||
| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
|
| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
|
||||||
| `cats` <Tag variant="new">2</Tag> | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
|
| `cats` <Tag variant="new">2</Tag> | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
|
||||||
| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
|
| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
|
||||||
|
|
||||||
## Utilities {#util}
|
## Utilities {#util}
|
||||||
|
|
||||||
|
@ -61,8 +61,7 @@ Whether the provided syntactic annotations form a projective dependency tree.
|
||||||
|
|
||||||
Convert a list of Doc objects into the
|
Convert a list of Doc objects into the
|
||||||
[JSON-serializable format](/api/annotation#json-input) used by the
|
[JSON-serializable format](/api/annotation#json-input) used by the
|
||||||
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
|
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.
|
||||||
'paragraph' in the output doc.
|
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -57,7 +57,7 @@ spaCy v2.3, the `Matcher` can also be called on `Span` objects.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3). |
|
| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3).. |
|
||||||
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
|
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
|
@ -36,7 +36,7 @@ for token in doc:
|
||||||
| Text | Lemma | POS | Tag | Dep | Shape | alpha | stop |
|
| Text | Lemma | POS | Tag | Dep | Shape | alpha | stop |
|
||||||
| ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- |
|
| ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- |
|
||||||
| Apple | apple | `PROPN` | `NNP` | `nsubj` | `Xxxxx` | `True` | `False` |
|
| Apple | apple | `PROPN` | `NNP` | `nsubj` | `Xxxxx` | `True` | `False` |
|
||||||
| is | be | `AUX` | `VBZ` | `aux` | `xx` | `True` | `True` |
|
| is | be | `VERB` | `VBZ` | `aux` | `xx` | `True` | `True` |
|
||||||
| looking | look | `VERB` | `VBG` | `ROOT` | `xxxx` | `True` | `False` |
|
| looking | look | `VERB` | `VBG` | `ROOT` | `xxxx` | `True` | `False` |
|
||||||
| at | at | `ADP` | `IN` | `prep` | `xx` | `True` | `True` |
|
| at | at | `ADP` | `IN` | `prep` | `xx` | `True` | `True` |
|
||||||
| buying | buy | `VERB` | `VBG` | `pcomp` | `xxxx` | `True` | `False` |
|
| buying | buy | `VERB` | `VBG` | `pcomp` | `xxxx` | `True` | `False` |
|
||||||
|
|
|
@ -662,7 +662,7 @@ One thing to keep in mind is that spaCy expects to train its models from **whole
|
||||||
documents**, not just single sentences. If your corpus only contains single
|
documents**, not just single sentences. If your corpus only contains single
|
||||||
sentences, spaCy's models will never learn to expect multi-sentence documents,
|
sentences, spaCy's models will never learn to expect multi-sentence documents,
|
||||||
leading to low performance on real text. To mitigate this problem, you can use
|
leading to low performance on real text. To mitigate this problem, you can use
|
||||||
the `-n` argument to the `spacy convert` command, to merge some of the sentences
|
the `-N` argument to the `spacy convert` command, to merge some of the sentences
|
||||||
into longer pseudo-documents.
|
into longer pseudo-documents.
|
||||||
|
|
||||||
### Training the tagger and parser {#train-tagger-parser}
|
### Training the tagger and parser {#train-tagger-parser}
|
||||||
|
|
|
@ -471,7 +471,7 @@ doc = nlp.make_doc("London is a big city in the United Kingdom.")
|
||||||
print("Before", doc.ents) # []
|
print("Before", doc.ents) # []
|
||||||
|
|
||||||
header = [ENT_IOB, ENT_TYPE]
|
header = [ENT_IOB, ENT_TYPE]
|
||||||
attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
|
attr_array = numpy.zeros((len(doc), len(header)))
|
||||||
attr_array[0, 0] = 3 # B
|
attr_array[0, 0] = 3 # B
|
||||||
attr_array[0, 1] = doc.vocab.strings["GPE"]
|
attr_array[0, 1] = doc.vocab.strings["GPE"]
|
||||||
doc.from_array(header, attr_array)
|
doc.from_array(header, attr_array)
|
||||||
|
@ -1143,9 +1143,9 @@ from spacy.gold import align
|
||||||
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
|
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
|
||||||
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
|
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
|
||||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
|
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
|
||||||
print("Edit distance:", cost) # 3
|
print("Misaligned tokens:", cost) # 2
|
||||||
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6])
|
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6])
|
||||||
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7])
|
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, 5, 6, 7])
|
||||||
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4}
|
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4}
|
||||||
print("Many-to-one mappings b-> a", b2a_multi) # {}
|
print("Many-to-one mappings b-> a", b2a_multi) # {}
|
||||||
```
|
```
|
||||||
|
@ -1153,7 +1153,7 @@ print("Many-to-one mappings b-> a", b2a_multi) # {}
|
||||||
Here are some insights from the alignment information generated in the example
|
Here are some insights from the alignment information generated in the example
|
||||||
above:
|
above:
|
||||||
|
|
||||||
- The edit distance (cost) is `3`: two deletions and one insertion.
|
- Two tokens are misaligned.
|
||||||
- The one-to-one mappings for the first four tokens are identical, which means
|
- The one-to-one mappings for the first four tokens are identical, which means
|
||||||
they map to each other. This makes sense because they're also identical in the
|
they map to each other. This makes sense because they're also identical in the
|
||||||
input: `"i"`, `"listened"`, `"to"` and `"obama"`.
|
input: `"i"`, `"listened"`, `"to"` and `"obama"`.
|
||||||
|
|
|
@ -117,18 +117,6 @@ The Chinese language class supports three word segmentation options:
|
||||||
better segmentation for Chinese OntoNotes and the new
|
better segmentation for Chinese OntoNotes and the new
|
||||||
[Chinese models](/models/zh).
|
[Chinese models](/models/zh).
|
||||||
|
|
||||||
<Infobox variant="warning">
|
|
||||||
|
|
||||||
Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
|
|
||||||
with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
|
|
||||||
install it from our fork and compile it locally:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
|
|
||||||
```
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
<Accordion title="Details on spaCy's PKUSeg API">
|
<Accordion title="Details on spaCy's PKUSeg API">
|
||||||
|
|
||||||
The `meta` argument of the `Chinese` language class supports the following
|
The `meta` argument of the `Chinese` language class supports the following
|
||||||
|
@ -208,20 +196,12 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
|
||||||
|
|
||||||
The Japanese language class uses
|
The Japanese language class uses
|
||||||
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
||||||
segmentation and part-of-speech tagging. The default Japanese language class and
|
segmentation and part-of-speech tagging. The default Japanese language class
|
||||||
the provided Japanese models use SudachiPy split mode `A`.
|
and the provided Japanese models use SudachiPy split mode `A`.
|
||||||
|
|
||||||
The `meta` argument of the `Japanese` language class can be used to configure
|
The `meta` argument of the `Japanese` language class can be used to configure
|
||||||
the split mode to `A`, `B` or `C`.
|
the split mode to `A`, `B` or `C`.
|
||||||
|
|
||||||
<Infobox variant="warning">
|
|
||||||
|
|
||||||
If you run into errors related to `sudachipy`, which is currently under active
|
|
||||||
development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
|
|
||||||
used for training the current [Japanese models](/models/ja).
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
## Installing and using models {#download}
|
## Installing and using models {#download}
|
||||||
|
|
||||||
> #### Downloading models in spaCy < v1.7
|
> #### Downloading models in spaCy < v1.7
|
||||||
|
|
|
@ -1158,17 +1158,17 @@ what you need for your application.
|
||||||
> available corpus.
|
> available corpus.
|
||||||
|
|
||||||
For example, the corpus spaCy's [English models](/models/en) were trained on
|
For example, the corpus spaCy's [English models](/models/en) were trained on
|
||||||
defines a `PERSON` entity as just the **person name**, without titles like "Mr."
|
defines a `PERSON` entity as just the **person name**, without titles like "Mr"
|
||||||
or "Dr.". This makes sense, because it makes it easier to resolve the entity
|
or "Dr". This makes sense, because it makes it easier to resolve the entity type
|
||||||
type back to a knowledge base. But what if your application needs the full
|
back to a knowledge base. But what if your application needs the full names,
|
||||||
names, _including_ the titles?
|
_including_ the titles?
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
|
doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
|
||||||
print([(ent.text, ent.label_) for ent in doc.ents])
|
print([(ent.text, ent.label_) for ent in doc.ents])
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -1233,7 +1233,7 @@ def expand_person_entities(doc):
|
||||||
# Add the component after the named entity recognizer
|
# Add the component after the named entity recognizer
|
||||||
nlp.add_pipe(expand_person_entities, after='ner')
|
nlp.add_pipe(expand_person_entities, after='ner')
|
||||||
|
|
||||||
doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
|
doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
|
||||||
print([(ent.text, ent.label_) for ent in doc.ents])
|
print([(ent.text, ent.label_) for ent in doc.ents])
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
|
||||||
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
|
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
|
||||||
and Romanian** and updated the training data and vectors for most languages.
|
and Romanian** and updated the training data and vectors for most languages.
|
||||||
Model packages with vectors are about **2×** smaller on disk and load
|
Model packages with vectors are about **2×** smaller on disk and load
|
||||||
**2-4×** faster. For the full changelog, see the
|
**2-4×** faster. For the full changelog, see the [release notes on
|
||||||
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
|
GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
|
||||||
For more details and a behind-the-scenes look at the new release,
|
details and a behind-the-scenes look at the new release, [see our blog
|
||||||
[see our blog post](https://explosion.ai/blog/spacy-v2-3).
|
post](https://explosion.ai/blog/spacy-v2-3).
|
||||||
|
|
||||||
### Expanded model families with vectors {#models}
|
### Expanded model families with vectors {#models}
|
||||||
|
|
||||||
|
@ -33,10 +33,10 @@ For more details and a behind-the-scenes look at the new release,
|
||||||
|
|
||||||
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
|
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
|
||||||
`md` and `lg` models with word vectors for all languages, this release provides
|
`md` and `lg` models with word vectors for all languages, this release provides
|
||||||
a total of 46 model packages. For models trained using
|
a total of 46 model packages. For models trained using [Universal
|
||||||
[Universal Dependencies](https://universaldependencies.org) corpora, the
|
Dependencies](https://universaldependencies.org) corpora, the training data has
|
||||||
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
|
been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
|
||||||
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
|
extended to include both UD Dutch Alpino and LassySmall.
|
||||||
|
|
||||||
<Infobox>
|
<Infobox>
|
||||||
|
|
||||||
|
@ -48,7 +48,6 @@ and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
|
||||||
### Chinese {#chinese}
|
### Chinese {#chinese}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.lang.zh import Chinese
|
> from spacy.lang.zh import Chinese
|
||||||
>
|
>
|
||||||
|
@ -58,49 +57,41 @@ and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
|
||||||
>
|
>
|
||||||
> # Append words to user dict
|
> # Append words to user dict
|
||||||
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
|
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
|
||||||
> ```
|
|
||||||
|
|
||||||
This release adds support for
|
This release adds support for
|
||||||
[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
|
[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
|
||||||
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
|
the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
|
||||||
Chinese tokenizer can be initialized with both `pkuseg` and custom models and
|
The Chinese tokenizer can be initialized with both `pkuseg` and custom models
|
||||||
the `pkuseg` user dictionary is easy to customize. Note that
|
and the `pkuseg` user dictionary is easy to customize.
|
||||||
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
|
|
||||||
pre-compiled wheels for Python 3.8. See the
|
|
||||||
[usage documentation](/usage/models#chinese) for details on how to install it on
|
|
||||||
Python 3.8.
|
|
||||||
|
|
||||||
<Infobox>
|
<Infobox>
|
||||||
|
|
||||||
**Models:** [Chinese models](/models/zh) **Usage: **
|
**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
|
||||||
[Chinese tokenizer usage](/usage/models#chinese)
|
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Japanese {#japanese}
|
### Japanese {#japanese}
|
||||||
|
|
||||||
The updated Japanese language class switches to
|
The updated Japanese language class switches to
|
||||||
[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
|
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
||||||
segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
|
segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
|
||||||
installing spaCy for Japanese, which is now possible with a single command:
|
installing spaCy for Japanese, which is now possible with a single command:
|
||||||
`pip install spacy[ja]`.
|
`pip install spacy[ja]`.
|
||||||
|
|
||||||
<Infobox>
|
<Infobox>
|
||||||
|
|
||||||
**Models:** [Japanese models](/models/ja) **Usage:**
|
**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
|
||||||
[Japanese tokenizer usage](/usage/models#japanese)
|
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Small CLI updates
|
### Small CLI updates
|
||||||
|
|
||||||
- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
|
- `spacy debug-data` provides the coverage of the vectors in a base model with
|
||||||
in a base model with `spacy debug-data lang train dev -b base_model`
|
`spacy debug-data lang train dev -b base_model`
|
||||||
- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
|
- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
|
||||||
`spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
|
dev.json`) to evaluate the tokenization accuracy without loading a model
|
||||||
without loading a model
|
- `spacy train` on GPU restricts the CPU timing evaluation to the first
|
||||||
- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
|
iteration
|
||||||
the first iteration
|
|
||||||
|
|
||||||
## Backwards incompatibilities {#incompat}
|
## Backwards incompatibilities {#incompat}
|
||||||
|
|
||||||
|
@ -109,8 +100,8 @@ installing spaCy for Japanese, which is now possible with a single command:
|
||||||
If you've been training **your own models**, you'll need to **retrain** them
|
If you've been training **your own models**, you'll need to **retrain** them
|
||||||
with the new version. Also don't forget to upgrade all models to the latest
|
with the new version. Also don't forget to upgrade all models to the latest
|
||||||
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
|
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
|
||||||
with models for v2.3. To check if all of your models are up to date, you can run
|
with models for v2.3. To check if all of your models are up to date, you can
|
||||||
the [`spacy validate`](/api/cli#validate) command.
|
run the [`spacy validate`](/api/cli#validate) command.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
@ -125,20 +116,21 @@ the [`spacy validate`](/api/cli#validate) command.
|
||||||
> directly.
|
> directly.
|
||||||
|
|
||||||
- If you're training new models, you'll want to install the package
|
- If you're training new models, you'll want to install the package
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
|
||||||
now includes both the lemmatization tables (as in v2.2) and the normalization
|
which now includes both the lemmatization tables (as in v2.2) and the
|
||||||
tables (new in v2.3). If you're using pretrained models, **nothing changes**,
|
normalization tables (new in v2.3). If you're using pretrained models,
|
||||||
because the relevant tables are included in the model packages.
|
**nothing changes**, because the relevant tables are included in the model
|
||||||
|
packages.
|
||||||
- Due to the updated Universal Dependencies training data, the fine-grained
|
- Due to the updated Universal Dependencies training data, the fine-grained
|
||||||
part-of-speech tags will change for many provided language models. The
|
part-of-speech tags will change for many provided language models. The
|
||||||
coarse-grained part-of-speech tagset remains the same, but the mapping from
|
coarse-grained part-of-speech tagset remains the same, but the mapping from
|
||||||
particular fine-grained to coarse-grained tags may show minor differences.
|
particular fine-grained to coarse-grained tags may show minor differences.
|
||||||
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
|
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
|
||||||
tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
|
tagsets contain new merged tags related to contracted forms, such as
|
||||||
for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
|
`ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
|
||||||
increases the accuracy of the models by improving the alignment between
|
`"à"`. This increases the accuracy of the models by improving the alignment
|
||||||
spaCy's tokenization and Universal Dependencies multi-word tokens used for
|
between spaCy's tokenization and Universal Dependencies multi-word tokens
|
||||||
contractions.
|
used for contractions.
|
||||||
|
|
||||||
### Migrating from spaCy 2.2 {#migrating}
|
### Migrating from spaCy 2.2 {#migrating}
|
||||||
|
|
||||||
|
@ -151,81 +143,29 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
|
||||||
and earlier versions.
|
and earlier versions.
|
||||||
|
|
||||||
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
|
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
|
||||||
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
|
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
|
||||||
comma at the end of a URL) before applying the match. See the full
|
a comma at the end of a URL) before applying the match. See the full [tokenizer
|
||||||
[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
|
documentation](/usage/linguistic-features#tokenization) and try out
|
||||||
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
|
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
|
||||||
debugging your tokenizer configuration.
|
debugging your tokenizer configuration.
|
||||||
|
|
||||||
#### Warnings configuration
|
#### Warnings configuration
|
||||||
|
|
||||||
spaCy's custom warnings have been replaced with native Python
|
spaCy's custom warnings have been replaced with native python
|
||||||
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
||||||
setting `SPACY_WARNING_IGNORE`, use the [`warnings`
|
setting `SPACY_WARNING_IGNORE`, use the [warnings
|
||||||
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
||||||
to manage warnings.
|
to manage warnings.
|
||||||
|
|
||||||
```diff
|
|
||||||
import spacy
|
|
||||||
+ import warnings
|
|
||||||
|
|
||||||
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
|
|
||||||
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Normalization tables
|
#### Normalization tables
|
||||||
|
|
||||||
The normalization tables have moved from the language data in
|
The normalization tables have moved from the language data in
|
||||||
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
|
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
|
||||||
package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
|
the package
|
||||||
If you're adding data for a new language, the normalization table should be
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
|
||||||
added to `spacy-lookups-data`. See
|
you're adding data for a new language, the normalization table should be added
|
||||||
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
to `spacy-lookups-data`. See [adding norm
|
||||||
|
exceptions](/usage/adding-languages#norm-exceptions).
|
||||||
#### No preloaded lexemes/vocab for models with vectors
|
|
||||||
|
|
||||||
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
|
||||||
loaded on initialization for models with vectors. As you process texts, the
|
|
||||||
lexemes will be added to the vocab automatically, just as in models without
|
|
||||||
vectors.
|
|
||||||
|
|
||||||
To see the number of unique vectors and number of words with vectors, see
|
|
||||||
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
|
||||||
unique vectors and `684830` words with vectors:
|
|
||||||
|
|
||||||
```python
|
|
||||||
{
|
|
||||||
'width': 300,
|
|
||||||
'vectors': 20000,
|
|
||||||
'keys': 684830,
|
|
||||||
'name': 'en_core_web_md.vectors'
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
If required, for instance if you are working directly with word vectors rather
|
|
||||||
than processing texts, you can load all lexemes for words with vectors at once:
|
|
||||||
|
|
||||||
```python
|
|
||||||
for orth in nlp.vocab.vectors:
|
|
||||||
_ = nlp.vocab[orth]
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Lexeme.is_oov and Token.is_oov
|
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
|
||||||
|
|
||||||
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
|
|
||||||
fixed in the next patch release v2.3.1.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
|
|
||||||
have a word vector. This is equivalent to `token.orth not in
|
|
||||||
nlp.vocab.vectors`.
|
|
||||||
|
|
||||||
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
|
|
||||||
probability and cluster features. The probability and cluster features are no
|
|
||||||
longer included in the provided medium and large models (see the next section).
|
|
||||||
|
|
||||||
#### Probability and cluster features
|
#### Probability and cluster features
|
||||||
|
|
||||||
|
@ -241,28 +181,28 @@ longer included in the provided medium and large models (see the next section).
|
||||||
|
|
||||||
The `Token.prob` and `Token.cluster` features, which are no longer used by the
|
The `Token.prob` and `Token.cluster` features, which are no longer used by the
|
||||||
core pipeline components as of spaCy v2, are no longer provided in the
|
core pipeline components as of spaCy v2, are no longer provided in the
|
||||||
pretrained models to reduce the model size. To keep these features available for
|
pretrained models to reduce the model size. To keep these features available
|
||||||
users relying on them, the `prob` and `cluster` features for the most frequent
|
for users relying on them, the `prob` and `cluster` features for the most
|
||||||
1M tokens have been moved to
|
frequent 1M tokens have been moved to
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
|
||||||
`extra` features for the relevant languages (English, German, Greek and
|
`extra` features for the relevant languages (English, German, Greek and
|
||||||
Spanish).
|
Spanish).
|
||||||
|
|
||||||
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
|
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
|
||||||
installed and your code accesses `Token.prob`, the full table is loaded into the
|
installed and your code accesses `Token.prob`, the full table is loaded into
|
||||||
model vocab, which will take a few seconds on initial loading. When you save
|
the model vocab, which will take a few seconds on initial loading. When you
|
||||||
this model after loading the `prob` table, the full `prob` table will be saved
|
save this model after loading the `prob` table, the full `prob` table will be
|
||||||
as part of the model vocab.
|
saved as part of the model vocab.
|
||||||
|
|
||||||
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
|
||||||
of a new model, add the data to
|
part of a new model, add the data to
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
||||||
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
|
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
|
||||||
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
|
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
|
||||||
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
|
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
|
||||||
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
|
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
|
||||||
currently only used to provide a custom `oov_prob`. See examples in the
|
currently only used to provide a custom `oov_prob`. See examples in the [`data`
|
||||||
[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
|
directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
|
||||||
in `spacy-lookups-data`.
|
in `spacy-lookups-data`.
|
||||||
|
|
||||||
#### Initializing new models without extra lookups tables
|
#### Initializing new models without extra lookups tables
|
||||||
|
|
|
@ -23,9 +23,9 @@
|
||||||
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
|
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
|
||||||
"indexName": "spacy"
|
"indexName": "spacy"
|
||||||
},
|
},
|
||||||
"binderUrl": "explosion/spacy-io-binder",
|
"binderUrl": "ines/spacy-io-binder",
|
||||||
"binderBranch": "live",
|
"binderBranch": "live",
|
||||||
"binderVersion": "2.3.0",
|
"binderVersion": "2.2.0",
|
||||||
"sections": [
|
"sections": [
|
||||||
{ "id": "usage", "title": "Usage Documentation", "theme": "blue" },
|
{ "id": "usage", "title": "Usage Documentation", "theme": "blue" },
|
||||||
{ "id": "models", "title": "Models Documentation", "theme": "blue" },
|
{ "id": "models", "title": "Models Documentation", "theme": "blue" },
|
||||||
|
|
Loading…
Reference in New Issue
Block a user