mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 01:04:34 +03:00
Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match
This commit is contained in:
commit
730fa493a4
106
.github/contributors/ilivans.md
vendored
Normal file
106
.github/contributors/ilivans.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ------------------------ |
|
||||||
|
| Name | Ilia Ivanov |
|
||||||
|
| Company name (if applicable) | Chattermill |
|
||||||
|
| Title or role (if applicable) | DL Engineer |
|
||||||
|
| Date | 2020-05-14 |
|
||||||
|
| GitHub username | ilivans |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/kevinlu1248.md
vendored
Normal file
106
.github/contributors/kevinlu1248.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Kevin Lu|
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | Student|
|
||||||
|
| Date | |
|
||||||
|
| GitHub username | kevinlu1248|
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/lfiedler.md
vendored
Normal file
106
.github/contributors/lfiedler.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Leander Fiedler |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 06 April 2020 |
|
||||||
|
| GitHub username | lfiedler |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/osori.md
vendored
Normal file
106
.github/contributors/osori.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Ilkyu Ju |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-05-17 |
|
||||||
|
| GitHub username | osori |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/thoppe.md
vendored
Normal file
106
.github/contributors/thoppe.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Travis Hoppe |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | Data Scientist |
|
||||||
|
| Date | 07 May 2020 |
|
||||||
|
| GitHub username | thoppe |
|
||||||
|
| Website (optional) | http://thoppe.github.io/ |
|
106
.github/contributors/vishnupriyavr.md
vendored
Normal file
106
.github/contributors/vishnupriyavr.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ------------------------ |
|
||||||
|
| Name | Vishnu Priya VR |
|
||||||
|
| Company name (if applicable) | Uniphore |
|
||||||
|
| Title or role (if applicable) | NLP/AI Engineer |
|
||||||
|
| Date | 2020-05-03 |
|
||||||
|
| GitHub username | vishnupriyavr |
|
||||||
|
| Website (optional) | |
|
|
@ -1,6 +1,7 @@
|
||||||
"""Prevent catastrophic forgetting with rehearsal updates."""
|
"""Prevent catastrophic forgetting with rehearsal updates."""
|
||||||
import plac
|
import plac
|
||||||
import random
|
import random
|
||||||
|
import warnings
|
||||||
import srsly
|
import srsly
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.gold import GoldParse
|
from spacy.gold import GoldParse
|
||||||
|
@ -66,7 +67,10 @@ def main(model_name, unlabelled_loc):
|
||||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
with nlp.disable_pipes(*other_pipes):
|
with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
|
||||||
|
# show warnings for misaligned entity spans once
|
||||||
|
warnings.filterwarnings("once", category=UserWarning, module='spacy')
|
||||||
|
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
random.shuffle(raw_docs)
|
random.shuffle(raw_docs)
|
||||||
|
|
|
@ -64,7 +64,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
|
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
|
||||||
The `vocab` should be the one used during creation of the KB."""
|
The `vocab` should be the one used during creation of the KB."""
|
||||||
vocab = Vocab().from_disk(vocab_path)
|
vocab = Vocab().from_disk(vocab_path)
|
||||||
# create blank Language class with correct vocab
|
# create blank English model with correct vocab
|
||||||
nlp = spacy.blank("en", vocab=vocab)
|
nlp = spacy.blank("en", vocab=vocab)
|
||||||
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
|
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
|
||||||
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
||||||
|
|
|
@ -8,12 +8,13 @@ For more details, see the documentation:
|
||||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||||
|
|
||||||
Compatible with: spaCy v2.0.0+
|
Compatible with: spaCy v2.0.0+
|
||||||
Last tested with: v2.1.0
|
Last tested with: v2.2.4
|
||||||
"""
|
"""
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
import random
|
import random
|
||||||
|
import warnings
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
|
@ -57,7 +58,11 @@ def main(model=None, output_dir=None, n_iter=100):
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
# only train NER
|
||||||
|
with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
|
||||||
|
# show warnings for misaligned entity spans once
|
||||||
|
warnings.filterwarnings("once", category=UserWarning, module='spacy')
|
||||||
|
|
||||||
# reset and initialize the weights randomly – but only if we're
|
# reset and initialize the weights randomly – but only if we're
|
||||||
# training a new model
|
# training a new model
|
||||||
if model is None:
|
if model is None:
|
||||||
|
|
|
@ -24,12 +24,13 @@ For more details, see the documentation:
|
||||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||||
|
|
||||||
Compatible with: spaCy v2.1.0+
|
Compatible with: spaCy v2.1.0+
|
||||||
Last tested with: v2.1.0
|
Last tested with: v2.2.4
|
||||||
"""
|
"""
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
import random
|
import random
|
||||||
|
import warnings
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
|
@ -97,7 +98,11 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
# only train NER
|
||||||
|
with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
|
||||||
|
# show warnings for misaligned entity spans once
|
||||||
|
warnings.filterwarnings("once", category=UserWarning, module='spacy')
|
||||||
|
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
|
|
|
@ -59,7 +59,7 @@ install_requires =
|
||||||
|
|
||||||
[options.extras_require]
|
[options.extras_require]
|
||||||
lookups =
|
lookups =
|
||||||
spacy_lookups_data>=0.0.5,<0.2.0
|
spacy_lookups_data>=0.3.1,<0.4.0
|
||||||
cuda =
|
cuda =
|
||||||
cupy>=5.0.0b4,<9.0.0
|
cupy>=5.0.0b4,<9.0.0
|
||||||
cuda80 =
|
cuda80 =
|
||||||
|
|
|
@ -279,13 +279,14 @@ class PrecomputableAffine(Model):
|
||||||
break
|
break
|
||||||
|
|
||||||
|
|
||||||
def link_vectors_to_models(vocab):
|
def link_vectors_to_models(vocab, skip_rank=False):
|
||||||
vectors = vocab.vectors
|
vectors = vocab.vectors
|
||||||
if vectors.name is None:
|
if vectors.name is None:
|
||||||
vectors.name = VECTORS_KEY
|
vectors.name = VECTORS_KEY
|
||||||
if vectors.data.size != 0:
|
if vectors.data.size != 0:
|
||||||
warnings.warn(Warnings.W020.format(shape=vectors.data.shape))
|
warnings.warn(Warnings.W020.format(shape=vectors.data.shape))
|
||||||
ops = Model.ops
|
ops = Model.ops
|
||||||
|
if not skip_rank:
|
||||||
for word in vocab:
|
for word in vocab:
|
||||||
if word.orth in vectors.key2row:
|
if word.orth in vectors.key2row:
|
||||||
word.rank = vectors.key2row[word.orth]
|
word.rank = vectors.key2row[word.orth]
|
||||||
|
|
|
@ -15,7 +15,7 @@ cdef enum attr_id_t:
|
||||||
LIKE_NUM
|
LIKE_NUM
|
||||||
LIKE_EMAIL
|
LIKE_EMAIL
|
||||||
IS_STOP
|
IS_STOP
|
||||||
IS_OOV
|
IS_OOV_DEPRECATED
|
||||||
IS_BRACKET
|
IS_BRACKET
|
||||||
IS_QUOTE
|
IS_QUOTE
|
||||||
IS_LEFT_PUNCT
|
IS_LEFT_PUNCT
|
||||||
|
|
|
@ -16,7 +16,7 @@ IDS = {
|
||||||
"LIKE_NUM": LIKE_NUM,
|
"LIKE_NUM": LIKE_NUM,
|
||||||
"LIKE_EMAIL": LIKE_EMAIL,
|
"LIKE_EMAIL": LIKE_EMAIL,
|
||||||
"IS_STOP": IS_STOP,
|
"IS_STOP": IS_STOP,
|
||||||
"IS_OOV": IS_OOV,
|
"IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
|
||||||
"IS_BRACKET": IS_BRACKET,
|
"IS_BRACKET": IS_BRACKET,
|
||||||
"IS_QUOTE": IS_QUOTE,
|
"IS_QUOTE": IS_QUOTE,
|
||||||
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
|
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
|
||||||
|
|
|
@ -187,12 +187,17 @@ def debug_data(
|
||||||
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
|
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"{} words in training data without vectors ({:0.2f}%)".format(
|
"{} words in training data without vectors ({:0.2f}%)".format(
|
||||||
n_missing_vectors,
|
n_missing_vectors, n_missing_vectors / gold_train_data["n_words"],
|
||||||
n_missing_vectors / gold_train_data["n_words"],
|
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
msg.text(
|
msg.text(
|
||||||
"10 most common words without vectors: {}".format(_format_labels(gold_train_data["words_missing_vectors"].most_common(10), counts=True)), show=verbose,
|
"10 most common words without vectors: {}".format(
|
||||||
|
_format_labels(
|
||||||
|
gold_train_data["words_missing_vectors"].most_common(10),
|
||||||
|
counts=True,
|
||||||
|
)
|
||||||
|
),
|
||||||
|
show=verbose,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
msg.info("No word vectors present in the model")
|
msg.info("No word vectors present in the model")
|
||||||
|
|
|
@ -2,7 +2,6 @@
|
||||||
from __future__ import unicode_literals, division, print_function
|
from __future__ import unicode_literals, division, print_function
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
import spacy
|
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
|
|
||||||
|
@ -45,7 +44,7 @@ def evaluate(
|
||||||
msg.fail("Visualization output directory not found", displacy_path, exits=1)
|
msg.fail("Visualization output directory not found", displacy_path, exits=1)
|
||||||
corpus = GoldCorpus(data_path, data_path)
|
corpus = GoldCorpus(data_path, data_path)
|
||||||
if model.startswith("blank:"):
|
if model.startswith("blank:"):
|
||||||
nlp = spacy.blank(model.replace("blank:", ""))
|
nlp = util.get_lang_class(model.replace("blank:", ""))()
|
||||||
else:
|
else:
|
||||||
nlp = util.load_model(model)
|
nlp = util.load_model(model)
|
||||||
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
||||||
|
|
|
@ -17,7 +17,9 @@ from wasabi import msg
|
||||||
|
|
||||||
from ..vectors import Vectors
|
from ..vectors import Vectors
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..util import ensure_path, get_lang_class, OOV_RANK
|
from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
|
||||||
|
from ..lookups import Lookups
|
||||||
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
import ftfy
|
import ftfy
|
||||||
|
@ -49,6 +51,8 @@ DEFAULT_OOV_PROB = -20
|
||||||
str,
|
str,
|
||||||
),
|
),
|
||||||
model_name=("Optional name for the model meta", "option", "mn", str),
|
model_name=("Optional name for the model meta", "option", "mn", str),
|
||||||
|
omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
|
||||||
|
base_model=("Base model (for languages with custom tokenizers)", "option", "b", str),
|
||||||
)
|
)
|
||||||
def init_model(
|
def init_model(
|
||||||
lang,
|
lang,
|
||||||
|
@ -61,6 +65,8 @@ def init_model(
|
||||||
prune_vectors=-1,
|
prune_vectors=-1,
|
||||||
vectors_name=None,
|
vectors_name=None,
|
||||||
model_name=None,
|
model_name=None,
|
||||||
|
omit_extra_lookups=False,
|
||||||
|
base_model=None,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Create a new model from raw data, like word frequencies, Brown clusters
|
Create a new model from raw data, like word frequencies, Brown clusters
|
||||||
|
@ -92,7 +98,16 @@ def init_model(
|
||||||
lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
|
lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
|
||||||
|
|
||||||
with msg.loading("Creating model..."):
|
with msg.loading("Creating model..."):
|
||||||
nlp = create_model(lang, lex_attrs, name=model_name)
|
nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
|
||||||
|
|
||||||
|
# Create empty extra lexeme tables so the data from spacy-lookups-data
|
||||||
|
# isn't loaded if these features are accessed
|
||||||
|
if omit_extra_lookups:
|
||||||
|
nlp.vocab.lookups_extra = Lookups()
|
||||||
|
nlp.vocab.lookups_extra.add_table("lexeme_cluster")
|
||||||
|
nlp.vocab.lookups_extra.add_table("lexeme_prob")
|
||||||
|
nlp.vocab.lookups_extra.add_table("lexeme_settings")
|
||||||
|
|
||||||
msg.good("Successfully created model")
|
msg.good("Successfully created model")
|
||||||
if vectors_loc is not None:
|
if vectors_loc is not None:
|
||||||
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
|
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
|
||||||
|
@ -152,20 +167,23 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
|
||||||
return lex_attrs
|
return lex_attrs
|
||||||
|
|
||||||
|
|
||||||
def create_model(lang, lex_attrs, name=None):
|
def create_model(lang, lex_attrs, name=None, base_model=None):
|
||||||
|
if base_model:
|
||||||
|
nlp = load_model(base_model)
|
||||||
|
# keep the tokenizer but remove any existing pipeline components due to
|
||||||
|
# potentially conflicting vectors
|
||||||
|
for pipe in nlp.pipe_names:
|
||||||
|
nlp.remove_pipe(pipe)
|
||||||
|
else:
|
||||||
lang_class = get_lang_class(lang)
|
lang_class = get_lang_class(lang)
|
||||||
nlp = lang_class()
|
nlp = lang_class()
|
||||||
for lexeme in nlp.vocab:
|
for lexeme in nlp.vocab:
|
||||||
lexeme.rank = OOV_RANK
|
lexeme.rank = OOV_RANK
|
||||||
lex_added = 0
|
|
||||||
for attrs in lex_attrs:
|
for attrs in lex_attrs:
|
||||||
if "settings" in attrs:
|
if "settings" in attrs:
|
||||||
continue
|
continue
|
||||||
lexeme = nlp.vocab[attrs["orth"]]
|
lexeme = nlp.vocab[attrs["orth"]]
|
||||||
lexeme.set_attrs(**attrs)
|
lexeme.set_attrs(**attrs)
|
||||||
lexeme.is_oov = False
|
|
||||||
lex_added += 1
|
|
||||||
lex_added += 1
|
|
||||||
if len(nlp.vocab):
|
if len(nlp.vocab):
|
||||||
oov_prob = min(lex.prob for lex in nlp.vocab) - 1
|
oov_prob = min(lex.prob for lex in nlp.vocab) - 1
|
||||||
else:
|
else:
|
||||||
|
@ -181,7 +199,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
|
||||||
if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
|
if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
|
||||||
nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
|
nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
|
||||||
for lex in nlp.vocab:
|
for lex in nlp.vocab:
|
||||||
if lex.rank:
|
if lex.rank and lex.rank != OOV_RANK:
|
||||||
nlp.vocab.vectors.add(lex.orth, row=lex.rank)
|
nlp.vocab.vectors.add(lex.orth, row=lex.rank)
|
||||||
else:
|
else:
|
||||||
if vectors_loc:
|
if vectors_loc:
|
||||||
|
@ -193,8 +211,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
|
||||||
if vector_keys is not None:
|
if vector_keys is not None:
|
||||||
for word in vector_keys:
|
for word in vector_keys:
|
||||||
if word not in nlp.vocab:
|
if word not in nlp.vocab:
|
||||||
lexeme = nlp.vocab[word]
|
nlp.vocab[word]
|
||||||
lexeme.is_oov = False
|
|
||||||
if vectors_data is not None:
|
if vectors_data is not None:
|
||||||
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
||||||
if name is None:
|
if name is None:
|
||||||
|
|
|
@ -15,9 +15,9 @@ import random
|
||||||
|
|
||||||
from .._ml import create_default_optimizer
|
from .._ml import create_default_optimizer
|
||||||
from ..util import use_gpu as set_gpu
|
from ..util import use_gpu as set_gpu
|
||||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
|
||||||
from ..gold import GoldCorpus
|
from ..gold import GoldCorpus
|
||||||
from ..compat import path2str
|
from ..compat import path2str
|
||||||
|
from ..lookups import Lookups
|
||||||
from .. import util
|
from .. import util
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
@ -58,6 +58,7 @@ from .. import about
|
||||||
textcat_arch=("Textcat model architecture", "option", "ta", str),
|
textcat_arch=("Textcat model architecture", "option", "ta", str),
|
||||||
textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
|
textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
|
||||||
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
|
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
|
||||||
|
omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
|
||||||
verbose=("Display more information for debug", "flag", "VV", bool),
|
verbose=("Display more information for debug", "flag", "VV", bool),
|
||||||
debug=("Run data diagnostics before training", "flag", "D", bool),
|
debug=("Run data diagnostics before training", "flag", "D", bool),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
@ -97,6 +98,7 @@ def train(
|
||||||
textcat_arch="bow",
|
textcat_arch="bow",
|
||||||
textcat_positive_label=None,
|
textcat_positive_label=None,
|
||||||
tag_map_path=None,
|
tag_map_path=None,
|
||||||
|
omit_extra_lookups=False,
|
||||||
verbose=False,
|
verbose=False,
|
||||||
debug=False,
|
debug=False,
|
||||||
):
|
):
|
||||||
|
@ -248,6 +250,14 @@ def train(
|
||||||
# Update tag map with provided mapping
|
# Update tag map with provided mapping
|
||||||
nlp.vocab.morphology.tag_map.update(tag_map)
|
nlp.vocab.morphology.tag_map.update(tag_map)
|
||||||
|
|
||||||
|
# Create empty extra lexeme tables so the data from spacy-lookups-data
|
||||||
|
# isn't loaded if these features are accessed
|
||||||
|
if omit_extra_lookups:
|
||||||
|
nlp.vocab.lookups_extra = Lookups()
|
||||||
|
nlp.vocab.lookups_extra.add_table("lexeme_cluster")
|
||||||
|
nlp.vocab.lookups_extra.add_table("lexeme_prob")
|
||||||
|
nlp.vocab.lookups_extra.add_table("lexeme_settings")
|
||||||
|
|
||||||
if vectors:
|
if vectors:
|
||||||
msg.text("Loading vector from model '{}'".format(vectors))
|
msg.text("Loading vector from model '{}'".format(vectors))
|
||||||
_load_vectors(nlp, vectors)
|
_load_vectors(nlp, vectors)
|
||||||
|
@ -630,15 +640,6 @@ def _create_progress_bar(total):
|
||||||
|
|
||||||
def _load_vectors(nlp, vectors):
|
def _load_vectors(nlp, vectors):
|
||||||
util.load_model(vectors, vocab=nlp.vocab)
|
util.load_model(vectors, vocab=nlp.vocab)
|
||||||
for lex in nlp.vocab:
|
|
||||||
values = {}
|
|
||||||
for attr, func in nlp.vocab.lex_attr_getters.items():
|
|
||||||
# These attrs are expected to be set by data. Others should
|
|
||||||
# be set by calling the language functions.
|
|
||||||
if attr not in (CLUSTER, PROB, IS_OOV, LANG):
|
|
||||||
values[lex.vocab.strings[attr]] = func(lex.orth_)
|
|
||||||
lex.set_attrs(**values)
|
|
||||||
lex.is_oov = False
|
|
||||||
|
|
||||||
|
|
||||||
def _load_pretrained_tok2vec(nlp, loc):
|
def _load_pretrained_tok2vec(nlp, loc):
|
||||||
|
|
|
@ -1,12 +1,16 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
def add_codes(err_cls):
|
def add_codes(err_cls):
|
||||||
"""Add error codes to string messages via class attribute names."""
|
"""Add error codes to string messages via class attribute names."""
|
||||||
|
|
||||||
class ErrorsWithCodes(object):
|
class ErrorsWithCodes(err_cls):
|
||||||
def __getattribute__(self, code):
|
def __getattribute__(self, code):
|
||||||
msg = getattr(err_cls, code)
|
msg = super(ErrorsWithCodes, self).__getattribute__(code)
|
||||||
|
if code.startswith("__"): # python system attributes like __class__
|
||||||
|
return msg
|
||||||
|
else:
|
||||||
return "[{code}] {msg}".format(code=code, msg=msg)
|
return "[{code}] {msg}".format(code=code, msg=msg)
|
||||||
|
|
||||||
return ErrorsWithCodes()
|
return ErrorsWithCodes()
|
||||||
|
@ -106,6 +110,11 @@ class Warnings(object):
|
||||||
"in problems with the vocab further on in the pipeline.")
|
"in problems with the vocab further on in the pipeline.")
|
||||||
W029 = ("Unable to align tokens with entities from character offsets. "
|
W029 = ("Unable to align tokens with entities from character offsets. "
|
||||||
"Discarding entity annotation for the text: {text}.")
|
"Discarding entity annotation for the text: {text}.")
|
||||||
|
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
|
||||||
|
"entities \"{entities}\". Use "
|
||||||
|
"`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
|
||||||
|
" to check the alignment. Misaligned entities ('-') will be "
|
||||||
|
"ignored during training.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
@ -555,6 +564,9 @@ class Errors(object):
|
||||||
E195 = ("Matcher can be called on {good} only, got {got}.")
|
E195 = ("Matcher can be called on {good} only, got {got}.")
|
||||||
E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can "
|
E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can "
|
||||||
"only be fixed with token.is_sent_start.")
|
"only be fixed with token.is_sent_start.")
|
||||||
|
E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
|
||||||
|
E198 = ("Unable to return {n} most similar vectors for the current vectors "
|
||||||
|
"table, which contains {n_rows} vectors.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
|
|
@ -658,7 +658,15 @@ cdef class GoldParse:
|
||||||
entdoc = None
|
entdoc = None
|
||||||
|
|
||||||
# avoid allocating memory if the doc does not contain any tokens
|
# avoid allocating memory if the doc does not contain any tokens
|
||||||
if self.length > 0:
|
if self.length == 0:
|
||||||
|
self.words = []
|
||||||
|
self.tags = []
|
||||||
|
self.heads = []
|
||||||
|
self.labels = []
|
||||||
|
self.ner = []
|
||||||
|
self.morphology = []
|
||||||
|
|
||||||
|
else:
|
||||||
if words is None:
|
if words is None:
|
||||||
words = [token.text for token in doc]
|
words = [token.text for token in doc]
|
||||||
if tags is None:
|
if tags is None:
|
||||||
|
@ -949,6 +957,12 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
biluo[token.i] = missing
|
biluo[token.i] = missing
|
||||||
|
if "-" in biluo:
|
||||||
|
ent_str = str(entities)
|
||||||
|
warnings.warn(Warnings.W030.format(
|
||||||
|
text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
|
||||||
|
entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str
|
||||||
|
))
|
||||||
return biluo
|
return biluo
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -6,7 +6,7 @@ from libcpp.vector cimport vector
|
||||||
from libc.stdint cimport int32_t, int64_t
|
from libc.stdint cimport int32_t, int64_t
|
||||||
from libc.stdio cimport FILE
|
from libc.stdio cimport FILE
|
||||||
|
|
||||||
from spacy.vocab cimport Vocab
|
from .vocab cimport Vocab
|
||||||
from .typedefs cimport hash_t
|
from .typedefs cimport hash_t
|
||||||
|
|
||||||
from .structs cimport KBEntryC, AliasC
|
from .structs cimport KBEntryC, AliasC
|
||||||
|
@ -169,4 +169,3 @@ cdef class Reader:
|
||||||
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
|
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
|
||||||
|
|
||||||
cdef int _read(self, void* value, size_t size) except -1
|
cdef int _read(self, void* value, size_t size) except -1
|
||||||
|
|
||||||
|
|
25
spacy/kb.pyx
25
spacy/kb.pyx
|
@ -1,23 +1,20 @@
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
# cython: profile=True
|
# cython: profile=True
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
import warnings
|
|
||||||
|
|
||||||
from spacy.errors import Errors, Warnings
|
|
||||||
|
|
||||||
from pathlib import Path
|
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from preshed.maps cimport PreshMap
|
from preshed.maps cimport PreshMap
|
||||||
|
|
||||||
from cpython.exc cimport PyErr_SetFromErrno
|
from cpython.exc cimport PyErr_SetFromErrno
|
||||||
|
|
||||||
from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
|
from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
|
||||||
from libc.stdint cimport int32_t, int64_t
|
from libc.stdint cimport int32_t, int64_t
|
||||||
|
from libcpp.vector cimport vector
|
||||||
|
|
||||||
|
import warnings
|
||||||
|
from os import path
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
from .typedefs cimport hash_t
|
from .typedefs cimport hash_t
|
||||||
|
|
||||||
from os import path
|
from .errors import Errors, Warnings
|
||||||
from libcpp.vector cimport vector
|
|
||||||
|
|
||||||
|
|
||||||
cdef class Candidate:
|
cdef class Candidate:
|
||||||
|
@ -448,10 +445,10 @@ cdef class KnowledgeBase:
|
||||||
|
|
||||||
cdef class Writer:
|
cdef class Writer:
|
||||||
def __init__(self, object loc):
|
def __init__(self, object loc):
|
||||||
if path.exists(loc):
|
|
||||||
assert not path.isdir(loc), "%s is directory." % loc
|
|
||||||
if isinstance(loc, Path):
|
if isinstance(loc, Path):
|
||||||
loc = bytes(loc)
|
loc = bytes(loc)
|
||||||
|
if path.exists(loc):
|
||||||
|
assert not path.isdir(loc), "%s is directory." % loc
|
||||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||||
self._fp = fopen(<char*>bytes_loc, 'wb')
|
self._fp = fopen(<char*>bytes_loc, 'wb')
|
||||||
if not self._fp:
|
if not self._fp:
|
||||||
|
@ -493,10 +490,10 @@ cdef class Writer:
|
||||||
|
|
||||||
cdef class Reader:
|
cdef class Reader:
|
||||||
def __init__(self, object loc):
|
def __init__(self, object loc):
|
||||||
assert path.exists(loc)
|
|
||||||
assert not path.isdir(loc)
|
|
||||||
if isinstance(loc, Path):
|
if isinstance(loc, Path):
|
||||||
loc = bytes(loc)
|
loc = bytes(loc)
|
||||||
|
assert path.exists(loc)
|
||||||
|
assert not path.isdir(loc)
|
||||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||||
self._fp = fopen(<char*>bytes_loc, 'rb')
|
self._fp = fopen(<char*>bytes_loc, 'rb')
|
||||||
if not self._fp:
|
if not self._fp:
|
||||||
|
@ -586,5 +583,3 @@ cdef class Reader:
|
||||||
cdef int _read(self, void* value, size_t size) except -1:
|
cdef int _read(self, void* value, size_t size) except -1:
|
||||||
status = fread(value, size, 1, self._fp)
|
status = fread(value, size, 1, self._fp)
|
||||||
return status
|
return status
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -2,7 +2,6 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
@ -10,19 +9,15 @@ from .morph_rules import MORPH_RULES
|
||||||
from ..tag_map import TAG_MAP
|
from ..tag_map import TAG_MAP
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class DanishDefaults(Language.Defaults):
|
class DanishDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "da"
|
lex_attr_getters[LANG] = lambda text: "da"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
morph_rules = MORPH_RULES
|
morph_rules = MORPH_RULES
|
||||||
infixes = TOKENIZER_INFIXES
|
infixes = TOKENIZER_INFIXES
|
||||||
|
|
|
@ -1,527 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
"""
|
|
||||||
Special-case rules for normalizing tokens to improve the model's predictions.
|
|
||||||
For example 'mysterium' vs 'mysterie' and similar.
|
|
||||||
"""
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
|
|
||||||
# Sources:
|
|
||||||
# 1: https://dsn.dk/retskrivning/om-retskrivningsordbogen/mere-om-retskrivningsordbogen-2012/endrede-stave-og-ordformer/
|
|
||||||
# 2: http://www.tjerry-korrektur.dk/ord-med-flere-stavemaader/
|
|
||||||
|
|
||||||
_exc = {
|
|
||||||
# Alternative spelling
|
|
||||||
"a-kraft-værk": "a-kraftværk", # 1
|
|
||||||
"ålborg": "aalborg", # 2
|
|
||||||
"århus": "aarhus",
|
|
||||||
"accessoirer": "accessoires", # 1
|
|
||||||
"affektert": "affekteret", # 1
|
|
||||||
"afrikander": "afrikaaner", # 1
|
|
||||||
"aftabuere": "aftabuisere", # 1
|
|
||||||
"aftabuering": "aftabuisering", # 1
|
|
||||||
"akvarium": "akvarie", # 1
|
|
||||||
"alenefader": "alenefar", # 1
|
|
||||||
"alenemoder": "alenemor", # 1
|
|
||||||
"alkoholambulatorium": "alkoholambulatorie", # 1
|
|
||||||
"ambulatorium": "ambulatorie", # 1
|
|
||||||
"ananassene": "ananasserne", # 2
|
|
||||||
"anførelsestegn": "anførselstegn", # 1
|
|
||||||
"anseelig": "anselig", # 2
|
|
||||||
"antioxydant": "antioxidant", # 1
|
|
||||||
"artrig": "artsrig", # 1
|
|
||||||
"auditorium": "auditorie", # 1
|
|
||||||
"avocado": "avokado", # 2
|
|
||||||
"bagerst": "bagest", # 2
|
|
||||||
"bagstræv": "bagstræb", # 1
|
|
||||||
"bagstræver": "bagstræber", # 1
|
|
||||||
"bagstræverisk": "bagstræberisk", # 1
|
|
||||||
"balde": "balle", # 2
|
|
||||||
"barselorlov": "barselsorlov", # 1
|
|
||||||
"barselvikar": "barselsvikar", # 1
|
|
||||||
"baskien": "baskerlandet", # 1
|
|
||||||
"bayrisk": "bayersk", # 1
|
|
||||||
"bedstefader": "bedstefar", # 1
|
|
||||||
"bedstemoder": "bedstemor", # 1
|
|
||||||
"behefte": "behæfte", # 1
|
|
||||||
"beheftelse": "behæftelse", # 1
|
|
||||||
"bidragydende": "bidragsydende", # 1
|
|
||||||
"bidragyder": "bidragsyder", # 1
|
|
||||||
"billiondel": "billiontedel", # 1
|
|
||||||
"blaseret": "blasert", # 1
|
|
||||||
"bleskifte": "bleskift", # 1
|
|
||||||
"blodbroder": "blodsbroder", # 2
|
|
||||||
"blyantspidser": "blyantsspidser", # 2
|
|
||||||
"boligministerium": "boligministerie", # 1
|
|
||||||
"borhul": "borehul", # 1
|
|
||||||
"broder": "bror", # 2
|
|
||||||
"buldog": "bulldog", # 2
|
|
||||||
"bådhus": "bådehus", # 1
|
|
||||||
"børnepleje": "barnepleje", # 1
|
|
||||||
"børneseng": "barneseng", # 1
|
|
||||||
"børnestol": "barnestol", # 1
|
|
||||||
"cairo": "kairo", # 1
|
|
||||||
"cambodia": "cambodja", # 1
|
|
||||||
"cambodianer": "cambodjaner", # 1
|
|
||||||
"cambodiansk": "cambodjansk", # 1
|
|
||||||
"camouflage": "kamuflage", # 2
|
|
||||||
"campylobacter": "kampylobakter", # 1
|
|
||||||
"centeret": "centret", # 2
|
|
||||||
"chefskahyt": "chefkahyt", # 1
|
|
||||||
"chefspost": "chefpost", # 1
|
|
||||||
"chefssekretær": "chefsekretær", # 1
|
|
||||||
"chefsstol": "chefstol", # 1
|
|
||||||
"cirkulærskrivelse": "cirkulæreskrivelse", # 1
|
|
||||||
"cognacsglas": "cognacglas", # 1
|
|
||||||
"columnist": "kolumnist", # 1
|
|
||||||
"cricket": "kricket", # 2
|
|
||||||
"dagplejemoder": "dagplejemor", # 1
|
|
||||||
"damaskesdug": "damaskdug", # 1
|
|
||||||
"damp-barn": "dampbarn", # 1
|
|
||||||
"delfinarium": "delfinarie", # 1
|
|
||||||
"dentallaboratorium": "dentallaboratorie", # 1
|
|
||||||
"diaramme": "diasramme", # 1
|
|
||||||
"diaré": "diarré", # 1
|
|
||||||
"dioxyd": "dioxid", # 1
|
|
||||||
"dommedagsprædiken": "dommedagspræken", # 1
|
|
||||||
"donut": "doughnut", # 2
|
|
||||||
"driftmæssig": "driftsmæssig", # 1
|
|
||||||
"driftsikker": "driftssikker", # 1
|
|
||||||
"driftsikring": "driftssikring", # 1
|
|
||||||
"drikkejogurt": "drikkeyoghurt", # 1
|
|
||||||
"drivein": "drive-in", # 1
|
|
||||||
"driveinbiograf": "drive-in-biograf", # 1
|
|
||||||
"drøvel": "drøbel", # 1
|
|
||||||
"dødskriterium": "dødskriterie", # 1
|
|
||||||
"e-mail-adresse": "e-mailadresse", # 1
|
|
||||||
"e-post-adresse": "e-postadresse", # 1
|
|
||||||
"egypten": "ægypten", # 2
|
|
||||||
"ekskommunicere": "ekskommunikere", # 1
|
|
||||||
"eksperimentarium": "eksperimentarie", # 1
|
|
||||||
"elsass": "Alsace", # 1
|
|
||||||
"elsasser": "alsacer", # 1
|
|
||||||
"elsassisk": "alsacisk", # 1
|
|
||||||
"elvetal": "ellevetal", # 1
|
|
||||||
"elvetiden": "ellevetiden", # 1
|
|
||||||
"elveårig": "elleveårig", # 1
|
|
||||||
"elveårs": "elleveårs", # 1
|
|
||||||
"elveårsbarn": "elleveårsbarn", # 1
|
|
||||||
"elvte": "ellevte", # 1
|
|
||||||
"elvtedel": "ellevtedel", # 1
|
|
||||||
"energiministerium": "energiministerie", # 1
|
|
||||||
"erhvervsministerium": "erhvervsministerie", # 1
|
|
||||||
"espaliere": "spaliere", # 2
|
|
||||||
"evangelium": "evangelie", # 1
|
|
||||||
"fagministerium": "fagministerie", # 1
|
|
||||||
"fakse": "faxe", # 1
|
|
||||||
"fangstkvota": "fangstkvote", # 1
|
|
||||||
"fader": "far", # 2
|
|
||||||
"farbroder": "farbror", # 1
|
|
||||||
"farfader": "farfar", # 1
|
|
||||||
"farmoder": "farmor", # 1
|
|
||||||
"federal": "føderal", # 1
|
|
||||||
"federalisering": "føderalisering", # 1
|
|
||||||
"federalisme": "føderalisme", # 1
|
|
||||||
"federalist": "føderalist", # 1
|
|
||||||
"federalistisk": "føderalistisk", # 1
|
|
||||||
"federation": "føderation", # 1
|
|
||||||
"federativ": "føderativ", # 1
|
|
||||||
"fejlbeheftet": "fejlbehæftet", # 1
|
|
||||||
"femetagers": "femetages", # 2
|
|
||||||
"femhundredekroneseddel": "femhundredkroneseddel", # 2
|
|
||||||
"filmpremiere": "filmpræmiere", # 2
|
|
||||||
"finansimperium": "finansimperie", # 1
|
|
||||||
"finansministerium": "finansministerie", # 1
|
|
||||||
"firehjulstræk": "firhjulstræk", # 2
|
|
||||||
"fjernstudium": "fjernstudie", # 1
|
|
||||||
"formalier": "formalia", # 1
|
|
||||||
"formandsskift": "formandsskifte", # 1
|
|
||||||
"fornemst": "fornemmest", # 2
|
|
||||||
"fornuftparti": "fornuftsparti", # 1
|
|
||||||
"fornuftstridig": "fornuftsstridig", # 1
|
|
||||||
"fornuftvæsen": "fornuftsvæsen", # 1
|
|
||||||
"fornuftægteskab": "fornuftsægteskab", # 1
|
|
||||||
"forretningsministerium": "forretningsministerie", # 1
|
|
||||||
"forskningsministerium": "forskningsministerie", # 1
|
|
||||||
"forstudium": "forstudie", # 1
|
|
||||||
"forsvarsministerium": "forsvarsministerie", # 1
|
|
||||||
"frilægge": "fritlægge", # 1
|
|
||||||
"frilæggelse": "fritlæggelse", # 1
|
|
||||||
"frilægning": "fritlægning", # 1
|
|
||||||
"fristille": "fritstille", # 1
|
|
||||||
"fristilling": "fritstilling", # 1
|
|
||||||
"fuldttegnet": "fuldtegnet", # 1
|
|
||||||
"fødestedskriterium": "fødestedskriterie", # 1
|
|
||||||
"fødevareministerium": "fødevareministerie", # 1
|
|
||||||
"følesløs": "følelsesløs", # 1
|
|
||||||
"følgeligt": "følgelig", # 1
|
|
||||||
"førne": "førn", # 1
|
|
||||||
"gearskift": "gearskifte", # 2
|
|
||||||
"gladeligt": "gladelig", # 1
|
|
||||||
"glosehefte": "glosehæfte", # 1
|
|
||||||
"glædeløs": "glædesløs", # 1
|
|
||||||
"gonoré": "gonorré", # 1
|
|
||||||
"grangiveligt": "grangivelig", # 1
|
|
||||||
"grundliggende": "grundlæggende", # 2
|
|
||||||
"grønsag": "grøntsag", # 2
|
|
||||||
"gudbenådet": "gudsbenådet", # 1
|
|
||||||
"gudfader": "gudfar", # 1
|
|
||||||
"gudmoder": "gudmor", # 1
|
|
||||||
"gulvmop": "gulvmoppe", # 1
|
|
||||||
"gymnasium": "gymnasie", # 1
|
|
||||||
"hackning": "hacking", # 1
|
|
||||||
"halvbroder": "halvbror", # 1
|
|
||||||
"halvelvetiden": "halvellevetiden", # 1
|
|
||||||
"handelsgymnasium": "handelsgymnasie", # 1
|
|
||||||
"hefte": "hæfte", # 1
|
|
||||||
"hefteklamme": "hæfteklamme", # 1
|
|
||||||
"heftelse": "hæftelse", # 1
|
|
||||||
"heftemaskine": "hæftemaskine", # 1
|
|
||||||
"heftepistol": "hæftepistol", # 1
|
|
||||||
"hefteplaster": "hæfteplaster", # 1
|
|
||||||
"heftestraf": "hæftestraf", # 1
|
|
||||||
"heftning": "hæftning", # 1
|
|
||||||
"helbroder": "helbror", # 1
|
|
||||||
"hjemmeklasse": "hjemklasse", # 1
|
|
||||||
"hjulspin": "hjulspind", # 1
|
|
||||||
"huggevåben": "hugvåben", # 1
|
|
||||||
"hulmurisolering": "hulmursisolering", # 1
|
|
||||||
"hurtiggående": "hurtigtgående", # 2
|
|
||||||
"hurtigttørrende": "hurtigtørrende", # 2
|
|
||||||
"husmoder": "husmor", # 1
|
|
||||||
"hydroxyd": "hydroxid", # 1
|
|
||||||
"håndmikser": "håndmixer", # 1
|
|
||||||
"højtaler": "højttaler", # 2
|
|
||||||
"hønemoder": "hønemor", # 1
|
|
||||||
"ide": "idé", # 2
|
|
||||||
"imperium": "imperie", # 1
|
|
||||||
"imponerthed": "imponerethed", # 1
|
|
||||||
"inbox": "indboks", # 2
|
|
||||||
"indenrigsministerium": "indenrigsministerie", # 1
|
|
||||||
"indhefte": "indhæfte", # 1
|
|
||||||
"indheftning": "indhæftning", # 1
|
|
||||||
"indicium": "indicie", # 1
|
|
||||||
"indkassere": "inkassere", # 2
|
|
||||||
"iota": "jota", # 1
|
|
||||||
"jobskift": "jobskifte", # 1
|
|
||||||
"jogurt": "yoghurt", # 1
|
|
||||||
"jukeboks": "jukebox", # 1
|
|
||||||
"justitsministerium": "justitsministerie", # 1
|
|
||||||
"kalorifere": "kalorifer", # 1
|
|
||||||
"kandidatstipendium": "kandidatstipendie", # 1
|
|
||||||
"kannevas": "kanvas", # 1
|
|
||||||
"kaperssauce": "kaperssovs", # 1
|
|
||||||
"kigge": "kikke", # 2
|
|
||||||
"kirkeministerium": "kirkeministerie", # 1
|
|
||||||
"klapmydse": "klapmyds", # 1
|
|
||||||
"klimakterium": "klimakterie", # 1
|
|
||||||
"klogeligt": "klogelig", # 1
|
|
||||||
"knivblad": "knivsblad", # 1
|
|
||||||
"kollegaer": "kolleger", # 2
|
|
||||||
"kollegium": "kollegie", # 1
|
|
||||||
"kollegiehefte": "kollegiehæfte", # 1
|
|
||||||
"kollokviumx": "kollokvium", # 1
|
|
||||||
"kommissorium": "kommissorie", # 1
|
|
||||||
"kompendium": "kompendie", # 1
|
|
||||||
"komplicerthed": "komplicerethed", # 1
|
|
||||||
"konfederation": "konføderation", # 1
|
|
||||||
"konfedereret": "konfødereret", # 1
|
|
||||||
"konferensstudium": "konferensstudie", # 1
|
|
||||||
"konservatorium": "konservatorie", # 1
|
|
||||||
"konsulere": "konsultere", # 1
|
|
||||||
"kradsbørstig": "krasbørstig", # 2
|
|
||||||
"kravsspecifikation": "kravspecifikation", # 1
|
|
||||||
"krematorium": "krematorie", # 1
|
|
||||||
"krep": "crepe", # 1
|
|
||||||
"krepnylon": "crepenylon", # 1
|
|
||||||
"kreppapir": "crepepapir", # 1
|
|
||||||
"kricket": "cricket", # 2
|
|
||||||
"kriterium": "kriterie", # 1
|
|
||||||
"kroat": "kroater", # 2
|
|
||||||
"kroki": "croquis", # 1
|
|
||||||
"kronprinsepar": "kronprinspar", # 2
|
|
||||||
"kropdoven": "kropsdoven", # 1
|
|
||||||
"kroplus": "kropslus", # 1
|
|
||||||
"krøllefedt": "krølfedt", # 1
|
|
||||||
"kulturministerium": "kulturministerie", # 1
|
|
||||||
"kuponhefte": "kuponhæfte", # 1
|
|
||||||
"kvota": "kvote", # 1
|
|
||||||
"kvotaordning": "kvoteordning", # 1
|
|
||||||
"laboratorium": "laboratorie", # 1
|
|
||||||
"laksfarve": "laksefarve", # 1
|
|
||||||
"laksfarvet": "laksefarvet", # 1
|
|
||||||
"laksrød": "lakserød", # 1
|
|
||||||
"laksyngel": "lakseyngel", # 1
|
|
||||||
"laksørred": "lakseørred", # 1
|
|
||||||
"landbrugsministerium": "landbrugsministerie", # 1
|
|
||||||
"landskampstemning": "landskampsstemning", # 1
|
|
||||||
"langust": "languster", # 1
|
|
||||||
"lappegrejer": "lappegrej", # 1
|
|
||||||
"lavløn": "lavtløn", # 1
|
|
||||||
"lillebroder": "lillebror", # 1
|
|
||||||
"linear": "lineær", # 1
|
|
||||||
"loftlampe": "loftslampe", # 2
|
|
||||||
"log-in": "login", # 1
|
|
||||||
"login": "log-in", # 2
|
|
||||||
"lovmedholdig": "lovmedholdelig", # 1
|
|
||||||
"ludder": "luder", # 2
|
|
||||||
"lysholder": "lyseholder", # 1
|
|
||||||
"lægeskifte": "lægeskift", # 1
|
|
||||||
"lærvillig": "lærevillig", # 1
|
|
||||||
"løgsauce": "løgsovs", # 1
|
|
||||||
"madmoder": "madmor", # 1
|
|
||||||
"majonæse": "mayonnaise", # 1
|
|
||||||
"mareridtagtig": "mareridtsagtig", # 1
|
|
||||||
"margen": "margin", # 2
|
|
||||||
"martyrium": "martyrie", # 1
|
|
||||||
"mellemstatlig": "mellemstatslig", # 1
|
|
||||||
"menneskene": "menneskerne", # 2
|
|
||||||
"metropolis": "metropol", # 1
|
|
||||||
"miks": "mix", # 1
|
|
||||||
"mikse": "mixe", # 1
|
|
||||||
"miksepult": "mixerpult", # 1
|
|
||||||
"mikser": "mixer", # 1
|
|
||||||
"mikserpult": "mixerpult", # 1
|
|
||||||
"mikslån": "mixlån", # 1
|
|
||||||
"miksning": "mixning", # 1
|
|
||||||
"miljøministerium": "miljøministerie", # 1
|
|
||||||
"milliarddel": "milliardtedel", # 1
|
|
||||||
"milliondel": "milliontedel", # 1
|
|
||||||
"ministerium": "ministerie", # 1
|
|
||||||
"mop": "moppe", # 1
|
|
||||||
"moder": "mor", # 2
|
|
||||||
"moratorium": "moratorie", # 1
|
|
||||||
"morbroder": "morbror", # 1
|
|
||||||
"morfader": "morfar", # 1
|
|
||||||
"mormoder": "mormor", # 1
|
|
||||||
"musikkonservatorium": "musikkonservatorie", # 1
|
|
||||||
"muslingskal": "muslingeskal", # 1
|
|
||||||
"mysterium": "mysterie", # 1
|
|
||||||
"naturalieydelse": "naturalydelse", # 1
|
|
||||||
"naturalieøkonomi": "naturaløkonomi", # 1
|
|
||||||
"navnebroder": "navnebror", # 1
|
|
||||||
"nerium": "nerie", # 1
|
|
||||||
"nådeløs": "nådesløs", # 1
|
|
||||||
"nærforestående": "nærtforestående", # 1
|
|
||||||
"nærstående": "nærtstående", # 1
|
|
||||||
"observatorium": "observatorie", # 1
|
|
||||||
"oldefader": "oldefar", # 1
|
|
||||||
"oldemoder": "oldemor", # 1
|
|
||||||
"opgraduere": "opgradere", # 1
|
|
||||||
"opgraduering": "opgradering", # 1
|
|
||||||
"oratorium": "oratorie", # 1
|
|
||||||
"overbookning": "overbooking", # 1
|
|
||||||
"overpræsidium": "overpræsidie", # 1
|
|
||||||
"overstatlig": "overstatslig", # 1
|
|
||||||
"oxyd": "oxid", # 1
|
|
||||||
"oxydere": "oxidere", # 1
|
|
||||||
"oxydering": "oxidering", # 1
|
|
||||||
"pakkenellike": "pakkenelliker", # 1
|
|
||||||
"papirtynd": "papirstynd", # 1
|
|
||||||
"pastoralseminarium": "pastoralseminarie", # 1
|
|
||||||
"peanutsene": "peanuttene", # 2
|
|
||||||
"penalhus": "pennalhus", # 2
|
|
||||||
"pensakrav": "pensumkrav", # 1
|
|
||||||
"pepperoni": "peperoni", # 1
|
|
||||||
"peruaner": "peruvianer", # 1
|
|
||||||
"petrole": "petrol", # 1
|
|
||||||
"piltast": "piletast", # 1
|
|
||||||
"piltaste": "piletast", # 1
|
|
||||||
"planetarium": "planetarie", # 1
|
|
||||||
"plasteret": "plastret", # 2
|
|
||||||
"plastic": "plastik", # 2
|
|
||||||
"play-off-kamp": "playoffkamp", # 1
|
|
||||||
"plejefader": "plejefar", # 1
|
|
||||||
"plejemoder": "plejemor", # 1
|
|
||||||
"podium": "podie", # 2
|
|
||||||
"praha": "prag", # 2
|
|
||||||
"preciøs": "pretiøs", # 2
|
|
||||||
"privilegium": "privilegie", # 1
|
|
||||||
"progredere": "progrediere", # 1
|
|
||||||
"præsidium": "præsidie", # 1
|
|
||||||
"psykodelisk": "psykedelisk", # 1
|
|
||||||
"pudsegrejer": "pudsegrej", # 1
|
|
||||||
"referensgruppe": "referencegruppe", # 1
|
|
||||||
"referensramme": "referenceramme", # 1
|
|
||||||
"refugium": "refugie", # 1
|
|
||||||
"registeret": "registret", # 2
|
|
||||||
"remedium": "remedie", # 1
|
|
||||||
"remiks": "remix", # 1
|
|
||||||
"reservert": "reserveret", # 1
|
|
||||||
"ressortministerium": "ressortministerie", # 1
|
|
||||||
"ressource": "resurse", # 2
|
|
||||||
"resætte": "resette", # 1
|
|
||||||
"rettelig": "retteligt", # 1
|
|
||||||
"rettetaste": "rettetast", # 1
|
|
||||||
"returtaste": "returtast", # 1
|
|
||||||
"risici": "risikoer", # 2
|
|
||||||
"roll-on": "rollon", # 1
|
|
||||||
"rollehefte": "rollehæfte", # 1
|
|
||||||
"rostbøf": "roastbeef", # 1
|
|
||||||
"rygsæksturist": "rygsækturist", # 1
|
|
||||||
"rødstjært": "rødstjert", # 1
|
|
||||||
"saddel": "sadel", # 2
|
|
||||||
"samaritan": "samaritaner", # 2
|
|
||||||
"sanatorium": "sanatorie", # 1
|
|
||||||
"sauce": "sovs", # 1
|
|
||||||
"scanning": "skanning", # 2
|
|
||||||
"sceneskifte": "sceneskift", # 1
|
|
||||||
"scilla": "skilla", # 1
|
|
||||||
"sejflydende": "sejtflydende", # 1
|
|
||||||
"selvstudium": "selvstudie", # 1
|
|
||||||
"seminarium": "seminarie", # 1
|
|
||||||
"sennepssauce": "sennepssovs ", # 1
|
|
||||||
"servitutbeheftet": "servitutbehæftet", # 1
|
|
||||||
"sit-in": "sitin", # 1
|
|
||||||
"skatteministerium": "skatteministerie", # 1
|
|
||||||
"skifer": "skiffer", # 2
|
|
||||||
"skyldsfølelse": "skyldfølelse", # 1
|
|
||||||
"skysauce": "skysovs", # 1
|
|
||||||
"sladdertaske": "sladretaske", # 2
|
|
||||||
"sladdervorn": "sladrevorn", # 2
|
|
||||||
"slagsbroder": "slagsbror", # 1
|
|
||||||
"slettetaste": "slettetast", # 1
|
|
||||||
"smørsauce": "smørsovs", # 1
|
|
||||||
"snitsel": "schnitzel", # 1
|
|
||||||
"snobbeeffekt": "snobeffekt", # 2
|
|
||||||
"socialministerium": "socialministerie", # 1
|
|
||||||
"solarium": "solarie", # 1
|
|
||||||
"soldebroder": "soldebror", # 1
|
|
||||||
"spagetti": "spaghetti", # 1
|
|
||||||
"spagettistrop": "spaghettistrop", # 1
|
|
||||||
"spagettiwestern": "spaghettiwestern", # 1
|
|
||||||
"spin-off": "spinoff", # 1
|
|
||||||
"spinnefiskeri": "spindefiskeri", # 1
|
|
||||||
"spolorm": "spoleorm", # 1
|
|
||||||
"sproglaboratorium": "sproglaboratorie", # 1
|
|
||||||
"spækbræt": "spækkebræt", # 2
|
|
||||||
"stand-in": "standin", # 1
|
|
||||||
"stand-up-comedy": "standupcomedy", # 1
|
|
||||||
"stand-up-komiker": "standupkomiker", # 1
|
|
||||||
"statsministerium": "statsministerie", # 1
|
|
||||||
"stedbroder": "stedbror", # 1
|
|
||||||
"stedfader": "stedfar", # 1
|
|
||||||
"stedmoder": "stedmor", # 1
|
|
||||||
"stilehefte": "stilehæfte", # 1
|
|
||||||
"stipendium": "stipendie", # 1
|
|
||||||
"stjært": "stjert", # 1
|
|
||||||
"stjærthage": "stjerthage", # 1
|
|
||||||
"storebroder": "storebror", # 1
|
|
||||||
"stortå": "storetå", # 1
|
|
||||||
"strabads": "strabadser", # 1
|
|
||||||
"strømlinjet": "strømlinet", # 1
|
|
||||||
"studium": "studie", # 1
|
|
||||||
"stænkelap": "stænklap", # 1
|
|
||||||
"sundhedsministerium": "sundhedsministerie", # 1
|
|
||||||
"suppositorium": "suppositorie", # 1
|
|
||||||
"svejts": "schweiz", # 1
|
|
||||||
"svejtser": "schweizer", # 1
|
|
||||||
"svejtserfranc": "schweizerfranc", # 1
|
|
||||||
"svejtserost": "schweizerost", # 1
|
|
||||||
"svejtsisk": "schweizisk", # 1
|
|
||||||
"svigerfader": "svigerfar", # 1
|
|
||||||
"svigermoder": "svigermor", # 1
|
|
||||||
"svirebroder": "svirebror", # 1
|
|
||||||
"symposium": "symposie", # 1
|
|
||||||
"sælarium": "sælarie", # 1
|
|
||||||
"søreme": "sørme", # 2
|
|
||||||
"søterritorium": "søterritorie", # 1
|
|
||||||
"t-bone-steak": "t-bonesteak", # 1
|
|
||||||
"tabgivende": "tabsgivende", # 1
|
|
||||||
"tabuere": "tabuisere", # 1
|
|
||||||
"tabuering": "tabuisering", # 1
|
|
||||||
"tackle": "takle", # 2
|
|
||||||
"tackling": "takling", # 2
|
|
||||||
"taifun": "tyfon", # 1
|
|
||||||
"take-off": "takeoff", # 1
|
|
||||||
"taknemlig": "taknemmelig", # 2
|
|
||||||
"talehørelærer": "tale-høre-lærer", # 1
|
|
||||||
"talehøreundervisning": "tale-høre-undervisning", # 1
|
|
||||||
"tandstik": "tandstikker", # 1
|
|
||||||
"tao": "dao", # 1
|
|
||||||
"taoisme": "daoisme", # 1
|
|
||||||
"taoist": "daoist", # 1
|
|
||||||
"taoistisk": "daoistisk", # 1
|
|
||||||
"taverne": "taverna", # 1
|
|
||||||
"teateret": "teatret", # 2
|
|
||||||
"tekno": "techno", # 1
|
|
||||||
"temposkifte": "temposkift", # 1
|
|
||||||
"terrarium": "terrarie", # 1
|
|
||||||
"territorium": "territorie", # 1
|
|
||||||
"tesis": "tese", # 1
|
|
||||||
"tidsstudium": "tidsstudie", # 1
|
|
||||||
"tipoldefader": "tipoldefar", # 1
|
|
||||||
"tipoldemoder": "tipoldemor", # 1
|
|
||||||
"tomatsauce": "tomatsovs", # 1
|
|
||||||
"tonart": "toneart", # 1
|
|
||||||
"trafikministerium": "trafikministerie", # 1
|
|
||||||
"tredve": "tredive", # 1
|
|
||||||
"tredver": "trediver", # 1
|
|
||||||
"tredveårig": "trediveårig", # 1
|
|
||||||
"tredveårs": "trediveårs", # 1
|
|
||||||
"tredveårsfødselsdag": "trediveårsfødselsdag", # 1
|
|
||||||
"tredvte": "tredivte", # 1
|
|
||||||
"tredvtedel": "tredivtedel", # 1
|
|
||||||
"troldunge": "troldeunge", # 1
|
|
||||||
"trommestikke": "trommestik", # 1
|
|
||||||
"trubadur": "troubadour", # 2
|
|
||||||
"trøstepræmie": "trøstpræmie", # 2
|
|
||||||
"tummerum": "trummerum", # 1
|
|
||||||
"tumultuarisk": "tumultarisk", # 1
|
|
||||||
"tunghørighed": "tunghørhed", # 1
|
|
||||||
"tus": "tusch", # 2
|
|
||||||
"tusind": "tusinde", # 2
|
|
||||||
"tvillingbroder": "tvillingebror", # 1
|
|
||||||
"tvillingbror": "tvillingebror", # 1
|
|
||||||
"tvillingebroder": "tvillingebror", # 1
|
|
||||||
"ubeheftet": "ubehæftet", # 1
|
|
||||||
"udenrigsministerium": "udenrigsministerie", # 1
|
|
||||||
"udhulning": "udhuling", # 1
|
|
||||||
"udslaggivende": "udslagsgivende", # 1
|
|
||||||
"udspekulert": "udspekuleret", # 1
|
|
||||||
"udviklingsministerium": "udviklingsministerie", # 1
|
|
||||||
"uforpligtigende": "uforpligtende", # 1
|
|
||||||
"uheldvarslende": "uheldsvarslende", # 1
|
|
||||||
"uimponerthed": "uimponerethed", # 1
|
|
||||||
"undervisningsministerium": "undervisningsministerie", # 1
|
|
||||||
"unægtelig": "unægteligt", # 1
|
|
||||||
"urinale": "urinal", # 1
|
|
||||||
"uvederheftig": "uvederhæftig", # 1
|
|
||||||
"vabel": "vable", # 2
|
|
||||||
"vadi": "wadi", # 1
|
|
||||||
"vaklevorn": "vakkelvorn", # 1
|
|
||||||
"vanadin": "vanadium", # 1
|
|
||||||
"vaselin": "vaseline", # 1
|
|
||||||
"vederheftig": "vederhæftig", # 1
|
|
||||||
"vedhefte": "vedhæfte", # 1
|
|
||||||
"velar": "velær", # 1
|
|
||||||
"videndeling": "vidensdeling", # 2
|
|
||||||
"vinkelanførelsestegn": "vinkelanførselstegn", # 1
|
|
||||||
"vipstjært": "vipstjert", # 1
|
|
||||||
"vismut": "bismut", # 1
|
|
||||||
"visvas": "vissevasse", # 1
|
|
||||||
"voksværk": "vokseværk", # 1
|
|
||||||
"værtdyr": "værtsdyr", # 1
|
|
||||||
"værtplante": "værtsplante", # 1
|
|
||||||
"wienersnitsel": "wienerschnitzel", # 1
|
|
||||||
"yderliggående": "yderligtgående", # 2
|
|
||||||
"zombi": "zombie", # 1
|
|
||||||
"ægbakke": "æggebakke", # 1
|
|
||||||
"ægformet": "æggeformet", # 1
|
|
||||||
"ægleder": "æggeleder", # 1
|
|
||||||
"ækvilibrist": "ekvilibrist", # 2
|
|
||||||
"æselsøre": "æseløre", # 1
|
|
||||||
"øjehule": "øjenhule", # 1
|
|
||||||
"øjelåg": "øjenlåg", # 1
|
|
||||||
"øjeåbner": "øjenåbner", # 1
|
|
||||||
"økonomiministerium": "økonomiministerie", # 1
|
|
||||||
"ørenring": "ørering", # 2
|
|
||||||
"øvehefte": "øvehæfte", # 1
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
||||||
NORM_EXCEPTIONS[string.title()] = norm
|
|
|
@ -6,7 +6,7 @@ Source: https://forkortelse.dk/ and various others.
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import ORTH, LEMMA, NORM, TAG, PUNCT
|
from ...symbols import ORTH, LEMMA, NORM
|
||||||
|
|
||||||
|
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
@ -52,7 +52,7 @@ for exc_data in [
|
||||||
{ORTH: "Ons.", LEMMA: "onsdag"},
|
{ORTH: "Ons.", LEMMA: "onsdag"},
|
||||||
{ORTH: "Fre.", LEMMA: "fredag"},
|
{ORTH: "Fre.", LEMMA: "fredag"},
|
||||||
{ORTH: "Lør.", LEMMA: "lørdag"},
|
{ORTH: "Lør.", LEMMA: "lørdag"},
|
||||||
{ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller", TAG: "CC"},
|
{ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller"},
|
||||||
]:
|
]:
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
@ -577,7 +577,7 @@ for h in range(1, 31 + 1):
|
||||||
for period in ["."]:
|
for period in ["."]:
|
||||||
_exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}]
|
_exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}]
|
||||||
|
|
||||||
_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: ".", TAG: PUNCT}]}
|
_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: "."}]}
|
||||||
_exc.update(_custom_base_exc)
|
_exc.update(_custom_base_exc)
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
|
|
@ -2,7 +2,6 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||||
from .punctuation import TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_INFIXES
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
|
@ -10,18 +9,14 @@ from .stop_words import STOP_WORDS
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class GermanDefaults(Language.Defaults):
|
class GermanDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: "de"
|
lex_attr_getters[LANG] = lambda text: "de"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
prefixes = TOKENIZER_PREFIXES
|
prefixes = TOKENIZER_PREFIXES
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
|
|
@ -1,16 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
# Here we only want to include the absolute most common words. Otherwise,
|
|
||||||
# this list would get impossibly long for German – especially considering the
|
|
||||||
# old vs. new spelling rules, and all possible cases.
|
|
||||||
|
|
||||||
|
|
||||||
_exc = {"daß": "dass"}
|
|
||||||
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
||||||
NORM_EXCEPTIONS[string.title()] = norm
|
|
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -27,13 +28,17 @@ def noun_chunks(obj):
|
||||||
"og",
|
"og",
|
||||||
"app",
|
"app",
|
||||||
]
|
]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
np_deps = set(doc.vocab.strings.add(label) for label in labels)
|
np_deps = set(doc.vocab.strings.add(label) for label in labels)
|
||||||
close_app = doc.vocab.strings.add("nk")
|
close_app = doc.vocab.strings.add("nk")
|
||||||
|
|
||||||
rbracket = 0
|
rbracket = 0
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if i < rbracket:
|
if i < rbracket:
|
||||||
continue
|
continue
|
||||||
if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
|
if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
|
||||||
|
|
|
@ -10,21 +10,16 @@ from .lemmatizer import GreekLemmatizer
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...lookups import Lookups
|
from ...lookups import Lookups
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class GreekDefaults(Language.Defaults):
|
class GreekDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "el"
|
lex_attr_getters[LANG] = lambda text: "el"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases. Works on both Doc and Span.
|
Detect base noun phrases. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -13,34 +14,34 @@ def noun_chunks(obj):
|
||||||
# obj tag corrects some DEP tagger mistakes.
|
# obj tag corrects some DEP tagger mistakes.
|
||||||
# Further improvement of the models will eliminate the need for this tag.
|
# Further improvement of the models will eliminate the need for this tag.
|
||||||
labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
|
labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
||||||
conj = doc.vocab.strings.add("conj")
|
conj = doc.vocab.strings.add("conj")
|
||||||
nmod = doc.vocab.strings.add("nmod")
|
nmod = doc.vocab.strings.add("nmod")
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
seen = set()
|
prev_end = -1
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if word.pos not in (NOUN, PROPN, PRON):
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
continue
|
continue
|
||||||
# Prevent nested chunks from being produced
|
# Prevent nested chunks from being produced
|
||||||
if word.i in seen:
|
if word.left_edge.i <= prev_end:
|
||||||
continue
|
continue
|
||||||
if word.dep in np_deps:
|
if word.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
|
||||||
continue
|
|
||||||
flag = False
|
flag = False
|
||||||
if word.pos == NOUN:
|
if word.pos == NOUN:
|
||||||
# check for patterns such as γραμμή παραγωγής
|
# check for patterns such as γραμμή παραγωγής
|
||||||
for potential_nmod in word.rights:
|
for potential_nmod in word.rights:
|
||||||
if potential_nmod.dep == nmod:
|
if potential_nmod.dep == nmod:
|
||||||
seen.update(
|
prev_end = potential_nmod.i
|
||||||
j for j in range(word.left_edge.i, potential_nmod.i + 1)
|
|
||||||
)
|
|
||||||
yield word.left_edge.i, potential_nmod.i + 1, np_label
|
yield word.left_edge.i, potential_nmod.i + 1, np_label
|
||||||
flag = True
|
flag = True
|
||||||
break
|
break
|
||||||
if flag is False:
|
if flag is False:
|
||||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
prev_end = word.i
|
||||||
yield word.left_edge.i, word.i + 1, np_label
|
yield word.left_edge.i, word.i + 1, np_label
|
||||||
elif word.dep == conj:
|
elif word.dep == conj:
|
||||||
# covers the case: έχει όμορφα και έξυπνα παιδιά
|
# covers the case: έχει όμορφα και έξυπνα παιδιά
|
||||||
|
@ -49,9 +50,7 @@ def noun_chunks(obj):
|
||||||
head = head.head
|
head = head.head
|
||||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
if head.dep in np_deps:
|
if head.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
|
||||||
yield word.left_edge.i, word.i + 1, np_label
|
yield word.left_edge.i, word.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -2,7 +2,6 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
@ -10,10 +9,9 @@ from .morph_rules import MORPH_RULES
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
def _return_en(_):
|
def _return_en(_):
|
||||||
|
@ -24,9 +22,6 @@ class EnglishDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = _return_en
|
lex_attr_getters[LANG] = _return_en
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -19,21 +20,23 @@ def noun_chunks(obj):
|
||||||
"attr",
|
"attr",
|
||||||
"ROOT",
|
"ROOT",
|
||||||
]
|
]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
||||||
conj = doc.vocab.strings.add("conj")
|
conj = doc.vocab.strings.add("conj")
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
seen = set()
|
prev_end = -1
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if word.pos not in (NOUN, PROPN, PRON):
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
continue
|
continue
|
||||||
# Prevent nested chunks from being produced
|
# Prevent nested chunks from being produced
|
||||||
if word.i in seen:
|
if word.left_edge.i <= prev_end:
|
||||||
continue
|
continue
|
||||||
if word.dep in np_deps:
|
if word.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
|
||||||
yield word.left_edge.i, word.i + 1, np_label
|
yield word.left_edge.i, word.i + 1, np_label
|
||||||
elif word.dep == conj:
|
elif word.dep == conj:
|
||||||
head = word.head
|
head = word.head
|
||||||
|
@ -41,9 +44,7 @@ def noun_chunks(obj):
|
||||||
head = head.head
|
head = head.head
|
||||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
if head.dep in np_deps:
|
if head.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
|
||||||
yield word.left_edge.i, word.i + 1, np_label
|
yield word.left_edge.i, word.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -77,12 +77,12 @@ for pron in ["i", "you", "he", "she", "it", "we", "they"]:
|
||||||
|
|
||||||
_exc[orth + "'d"] = [
|
_exc[orth + "'d"] = [
|
||||||
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
|
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
|
||||||
{ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"},
|
{ORTH: "'d", NORM: "'d"},
|
||||||
]
|
]
|
||||||
|
|
||||||
_exc[orth + "d"] = [
|
_exc[orth + "d"] = [
|
||||||
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
|
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
|
||||||
{ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"},
|
{ORTH: "d", NORM: "'d"},
|
||||||
]
|
]
|
||||||
|
|
||||||
_exc[orth + "'d've"] = [
|
_exc[orth + "'d've"] = [
|
||||||
|
@ -195,7 +195,10 @@ for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
|
||||||
{ORTH: "'d", NORM: "'d"},
|
{ORTH: "'d", NORM: "'d"},
|
||||||
]
|
]
|
||||||
|
|
||||||
_exc[orth + "d"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "d"}]
|
_exc[orth + "d"] = [
|
||||||
|
{ORTH: orth, LEMMA: word, NORM: word},
|
||||||
|
{ORTH: "d", NORM: "'d"},
|
||||||
|
]
|
||||||
|
|
||||||
_exc[orth + "'d've"] = [
|
_exc[orth + "'d've"] = [
|
||||||
{ORTH: orth, LEMMA: word, NORM: word},
|
{ORTH: orth, LEMMA: word, NORM: word},
|
||||||
|
|
|
@ -5,7 +5,6 @@ from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
|
||||||
from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
|
from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
|
||||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
|
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
|
||||||
from ..char_classes import merge_chars
|
from ..char_classes import merge_chars
|
||||||
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
|
|
||||||
|
|
||||||
|
|
||||||
_list_units = [u for u in LIST_UNITS if u != "%"]
|
_list_units = [u for u in LIST_UNITS if u != "%"]
|
||||||
|
|
|
@ -2,10 +2,15 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON, VERB, AUX
|
from ...symbols import NOUN, PROPN, PRON, VERB, AUX
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
doc = obj.doc
|
doc = doclike.doc
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
if not len(doc):
|
if not len(doc):
|
||||||
return
|
return
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
|
@ -16,7 +21,7 @@ def noun_chunks(obj):
|
||||||
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
|
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
|
||||||
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
|
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
|
||||||
token = doc[0]
|
token = doc[0]
|
||||||
while token and token.i < len(doc):
|
while token and token.i < len(doclike):
|
||||||
if token.pos in [PROPN, NOUN, PRON]:
|
if token.pos in [PROPN, NOUN, PRON]:
|
||||||
left, right = noun_bounds(
|
left, right = noun_bounds(
|
||||||
doc, token, np_left_deps, np_right_deps, stop_deps
|
doc, token, np_left_deps, np_right_deps, stop_deps
|
||||||
|
|
|
@ -10,6 +10,7 @@ from .lex_attrs import LEX_ATTRS
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .punctuation import TOKENIZER_SUFFIXES
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
|
||||||
|
|
||||||
class PersianDefaults(Language.Defaults):
|
class PersianDefaults(Language.Defaults):
|
||||||
|
@ -24,6 +25,7 @@ class PersianDefaults(Language.Defaults):
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||||
|
syntax_iterators = SYNTAX_ITERATORS
|
||||||
|
|
||||||
|
|
||||||
class Persian(Language):
|
class Persian(Language):
|
||||||
|
|
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -19,21 +20,23 @@ def noun_chunks(obj):
|
||||||
"attr",
|
"attr",
|
||||||
"ROOT",
|
"ROOT",
|
||||||
]
|
]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
||||||
conj = doc.vocab.strings.add("conj")
|
conj = doc.vocab.strings.add("conj")
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
seen = set()
|
prev_end = -1
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if word.pos not in (NOUN, PROPN, PRON):
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
continue
|
continue
|
||||||
# Prevent nested chunks from being produced
|
# Prevent nested chunks from being produced
|
||||||
if word.i in seen:
|
if word.left_edge.i <= prev_end:
|
||||||
continue
|
continue
|
||||||
if word.dep in np_deps:
|
if word.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
|
||||||
yield word.left_edge.i, word.i + 1, np_label
|
yield word.left_edge.i, word.i + 1, np_label
|
||||||
elif word.dep == conj:
|
elif word.dep == conj:
|
||||||
head = word.head
|
head = word.head
|
||||||
|
@ -41,9 +44,7 @@ def noun_chunks(obj):
|
||||||
head = head.head
|
head = head.head
|
||||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
if head.dep in np_deps:
|
if head.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
|
||||||
yield word.left_edge.i, word.i + 1, np_label
|
yield word.left_edge.i, word.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -18,21 +19,23 @@ def noun_chunks(obj):
|
||||||
"nmod",
|
"nmod",
|
||||||
"nmod:poss",
|
"nmod:poss",
|
||||||
]
|
]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_deps = [doc.vocab.strings[label] for label in labels]
|
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||||
conj = doc.vocab.strings.add("conj")
|
conj = doc.vocab.strings.add("conj")
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
seen = set()
|
prev_end = -1
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if word.pos not in (NOUN, PROPN, PRON):
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
continue
|
continue
|
||||||
# Prevent nested chunks from being produced
|
# Prevent nested chunks from being produced
|
||||||
if word.i in seen:
|
if word.left_edge.i <= prev_end:
|
||||||
continue
|
continue
|
||||||
if word.dep in np_deps:
|
if word.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
elif word.dep == conj:
|
elif word.dep == conj:
|
||||||
head = word.head
|
head = word.head
|
||||||
|
@ -40,9 +43,7 @@ def noun_chunks(obj):
|
||||||
head = head.head
|
head = head.head
|
||||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
if head.dep in np_deps:
|
if head.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,11 +1,12 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
|
|
||||||
|
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...tokens import Doc
|
|
||||||
|
|
||||||
|
|
||||||
class ArmenianDefaults(Language.Defaults):
|
class ArmenianDefaults(Language.Defaults):
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Example sentences to test spaCy and its language models.
|
Example sentences to test spaCy and its language models.
|
||||||
>>> from spacy.lang.hy.examples import sentences
|
>>> from spacy.lang.hy.examples import sentences
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...attrs import LIKE_NUM
|
from ...attrs import LIKE_NUM
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
"""
|
"""
|
||||||
նա
|
նա
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import POS, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
|
from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
|
||||||
from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ
|
from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ
|
||||||
|
|
||||||
TAG_MAP = {
|
TAG_MAP = {
|
||||||
|
@ -716,7 +716,7 @@ TAG_MAP = {
|
||||||
POS: NOUN,
|
POS: NOUN,
|
||||||
"Animacy": "Nhum",
|
"Animacy": "Nhum",
|
||||||
"Case": "Dat",
|
"Case": "Dat",
|
||||||
"Number": "Coll",
|
# "Number": "Coll",
|
||||||
"Number": "Sing",
|
"Number": "Sing",
|
||||||
"Person": "1",
|
"Person": "1",
|
||||||
},
|
},
|
||||||
|
@ -815,7 +815,7 @@ TAG_MAP = {
|
||||||
"Animacy": "Nhum",
|
"Animacy": "Nhum",
|
||||||
"Case": "Nom",
|
"Case": "Nom",
|
||||||
"Definite": "Def",
|
"Definite": "Def",
|
||||||
"Number": "Plur",
|
# "Number": "Plur",
|
||||||
"Number": "Sing",
|
"Number": "Sing",
|
||||||
"Poss": "Yes",
|
"Poss": "Yes",
|
||||||
},
|
},
|
||||||
|
@ -880,7 +880,7 @@ TAG_MAP = {
|
||||||
POS: NOUN,
|
POS: NOUN,
|
||||||
"Animacy": "Nhum",
|
"Animacy": "Nhum",
|
||||||
"Case": "Nom",
|
"Case": "Nom",
|
||||||
"Number": "Plur",
|
# "Number": "Plur",
|
||||||
"Number": "Sing",
|
"Number": "Sing",
|
||||||
"Person": "2",
|
"Person": "2",
|
||||||
},
|
},
|
||||||
|
@ -1223,9 +1223,9 @@ TAG_MAP = {
|
||||||
"PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": {
|
"PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": {
|
||||||
POS: PRON,
|
POS: PRON,
|
||||||
"Case": "Nom",
|
"Case": "Nom",
|
||||||
"Number": "Sing",
|
# "Number": "Sing",
|
||||||
"Number": "Plur",
|
"Number": "Plur",
|
||||||
"Person": "3",
|
# "Person": "3",
|
||||||
"Person": "1",
|
"Person": "1",
|
||||||
"PronType": "Emp",
|
"PronType": "Emp",
|
||||||
},
|
},
|
||||||
|
|
|
@ -4,25 +4,20 @@ from __future__ import unicode_literals
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class IndonesianDefaults(Language.Defaults):
|
class IndonesianDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: "id"
|
lex_attr_getters[LANG] = lambda text: "id"
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
prefixes = TOKENIZER_PREFIXES
|
prefixes = TOKENIZER_PREFIXES
|
||||||
|
|
|
@ -1,532 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
# Daftar kosakata yang sering salah dieja
|
|
||||||
# https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
|
|
||||||
_exc = {
|
|
||||||
# Slang and abbreviations
|
|
||||||
"silahkan": "silakan",
|
|
||||||
"yg": "yang",
|
|
||||||
"kalo": "kalau",
|
|
||||||
"cawu": "caturwulan",
|
|
||||||
"ok": "oke",
|
|
||||||
"gak": "tidak",
|
|
||||||
"enggak": "tidak",
|
|
||||||
"nggak": "tidak",
|
|
||||||
"ndak": "tidak",
|
|
||||||
"ngga": "tidak",
|
|
||||||
"dgn": "dengan",
|
|
||||||
"tdk": "tidak",
|
|
||||||
"jg": "juga",
|
|
||||||
"klo": "kalau",
|
|
||||||
"denger": "dengar",
|
|
||||||
"pinter": "pintar",
|
|
||||||
"krn": "karena",
|
|
||||||
"nemuin": "menemukan",
|
|
||||||
"jgn": "jangan",
|
|
||||||
"udah": "sudah",
|
|
||||||
"sy": "saya",
|
|
||||||
"udh": "sudah",
|
|
||||||
"dapetin": "mendapatkan",
|
|
||||||
"ngelakuin": "melakukan",
|
|
||||||
"ngebuat": "membuat",
|
|
||||||
"membikin": "membuat",
|
|
||||||
"bikin": "buat",
|
|
||||||
# Daftar kosakata yang sering salah dieja
|
|
||||||
"malpraktik": "malapraktik",
|
|
||||||
"malfungsi": "malafungsi",
|
|
||||||
"malserap": "malaserap",
|
|
||||||
"maladaptasi": "malaadaptasi",
|
|
||||||
"malsuai": "malasuai",
|
|
||||||
"maldistribusi": "maladistribusi",
|
|
||||||
"malgizi": "malagizi",
|
|
||||||
"malsikap": "malasikap",
|
|
||||||
"memperhatikan": "memerhatikan",
|
|
||||||
"akte": "akta",
|
|
||||||
"cemilan": "camilan",
|
|
||||||
"esei": "esai",
|
|
||||||
"frase": "frasa",
|
|
||||||
"kafeteria": "kafetaria",
|
|
||||||
"ketapel": "katapel",
|
|
||||||
"kenderaan": "kendaraan",
|
|
||||||
"menejemen": "manajemen",
|
|
||||||
"menejer": "manajer",
|
|
||||||
"mesjid": "masjid",
|
|
||||||
"rebo": "rabu",
|
|
||||||
"seksama": "saksama",
|
|
||||||
"senggama": "sanggama",
|
|
||||||
"sekedar": "sekadar",
|
|
||||||
"seprei": "seprai",
|
|
||||||
"semedi": "semadi",
|
|
||||||
"samadi": "semadi",
|
|
||||||
"amandemen": "amendemen",
|
|
||||||
"algoritma": "algoritme",
|
|
||||||
"aritmatika": "aritmetika",
|
|
||||||
"metoda": "metode",
|
|
||||||
"materai": "meterai",
|
|
||||||
"meterei": "meterai",
|
|
||||||
"kalendar": "kalender",
|
|
||||||
"kadaluwarsa": "kedaluwarsa",
|
|
||||||
"katagori": "kategori",
|
|
||||||
"parlamen": "parlemen",
|
|
||||||
"sekular": "sekuler",
|
|
||||||
"selular": "seluler",
|
|
||||||
"sirkular": "sirkuler",
|
|
||||||
"survai": "survei",
|
|
||||||
"survey": "survei",
|
|
||||||
"aktuil": "aktual",
|
|
||||||
"formil": "formal",
|
|
||||||
"trotoir": "trotoar",
|
|
||||||
"komersiil": "komersial",
|
|
||||||
"komersil": "komersial",
|
|
||||||
"tradisionil": "tradisionial",
|
|
||||||
"orisinil": "orisinal",
|
|
||||||
"orijinil": "orisinal",
|
|
||||||
"afdol": "afdal",
|
|
||||||
"antri": "antre",
|
|
||||||
"apotik": "apotek",
|
|
||||||
"atlit": "atlet",
|
|
||||||
"atmosfir": "atmosfer",
|
|
||||||
"cidera": "cedera",
|
|
||||||
"cendikiawan": "cendekiawan",
|
|
||||||
"cepet": "cepat",
|
|
||||||
"cinderamata": "cenderamata",
|
|
||||||
"debet": "debit",
|
|
||||||
"difinisi": "definisi",
|
|
||||||
"dekrit": "dekret",
|
|
||||||
"disain": "desain",
|
|
||||||
"diskripsi": "deskripsi",
|
|
||||||
"diskotik": "diskotek",
|
|
||||||
"eksim": "eksem",
|
|
||||||
"exim": "eksem",
|
|
||||||
"faidah": "faedah",
|
|
||||||
"ekstrim": "ekstrem",
|
|
||||||
"ekstrimis": "ekstremis",
|
|
||||||
"komplit": "komplet",
|
|
||||||
"konkrit": "konkret",
|
|
||||||
"kongkrit": "konkret",
|
|
||||||
"kongkret": "konkret",
|
|
||||||
"kridit": "kredit",
|
|
||||||
"musium": "museum",
|
|
||||||
"pinalti": "penalti",
|
|
||||||
"piranti": "peranti",
|
|
||||||
"pinsil": "pensil",
|
|
||||||
"personil": "personel",
|
|
||||||
"sistim": "sistem",
|
|
||||||
"teoritis": "teoretis",
|
|
||||||
"vidio": "video",
|
|
||||||
"cengkeh": "cengkih",
|
|
||||||
"desertasi": "disertasi",
|
|
||||||
"hakekat": "hakikat",
|
|
||||||
"intelejen": "intelijen",
|
|
||||||
"kaedah": "kaidah",
|
|
||||||
"kempes": "kempis",
|
|
||||||
"kementrian": "kementerian",
|
|
||||||
"ledeng": "leding",
|
|
||||||
"nasehat": "nasihat",
|
|
||||||
"penasehat": "penasihat",
|
|
||||||
"praktek": "praktik",
|
|
||||||
"praktekum": "praktikum",
|
|
||||||
"resiko": "risiko",
|
|
||||||
"retsleting": "ritsleting",
|
|
||||||
"senen": "senin",
|
|
||||||
"amuba": "ameba",
|
|
||||||
"punggawa": "penggawa",
|
|
||||||
"surban": "serban",
|
|
||||||
"nomer": "nomor",
|
|
||||||
"sorban": "serban",
|
|
||||||
"bis": "bus",
|
|
||||||
"agribisnis": "agrobisnis",
|
|
||||||
"kantung": "kantong",
|
|
||||||
"khutbah": "khotbah",
|
|
||||||
"mandur": "mandor",
|
|
||||||
"rubuh": "roboh",
|
|
||||||
"pastur": "pastor",
|
|
||||||
"supir": "sopir",
|
|
||||||
"goncang": "guncang",
|
|
||||||
"goa": "gua",
|
|
||||||
"kaos": "kaus",
|
|
||||||
"kokoh": "kukuh",
|
|
||||||
"komulatif": "kumulatif",
|
|
||||||
"kolomnis": "kolumnis",
|
|
||||||
"korma": "kurma",
|
|
||||||
"lobang": "lubang",
|
|
||||||
"limo": "limusin",
|
|
||||||
"limosin": "limusin",
|
|
||||||
"mangkok": "mangkuk",
|
|
||||||
"saos": "saus",
|
|
||||||
"sop": "sup",
|
|
||||||
"sorga": "surga",
|
|
||||||
"tegor": "tegur",
|
|
||||||
"telor": "telur",
|
|
||||||
"obrak-abrik": "ubrak-abrik",
|
|
||||||
"ekwivalen": "ekuivalen",
|
|
||||||
"frekwensi": "frekuensi",
|
|
||||||
"konsekwensi": "konsekuensi",
|
|
||||||
"kwadran": "kuadran",
|
|
||||||
"kwadrat": "kuadrat",
|
|
||||||
"kwalifikasi": "kualifikasi",
|
|
||||||
"kwalitas": "kualitas",
|
|
||||||
"kwalitet": "kualitas",
|
|
||||||
"kwalitatif": "kualitatif",
|
|
||||||
"kwantitas": "kuantitas",
|
|
||||||
"kwantitatif": "kuantitatif",
|
|
||||||
"kwantum": "kuantum",
|
|
||||||
"kwartal": "kuartal",
|
|
||||||
"kwintal": "kuintal",
|
|
||||||
"kwitansi": "kuitansi",
|
|
||||||
"kwatir": "khawatir",
|
|
||||||
"kuatir": "khawatir",
|
|
||||||
"jadual": "jadwal",
|
|
||||||
"hirarki": "hierarki",
|
|
||||||
"karir": "karier",
|
|
||||||
"aktip": "aktif",
|
|
||||||
"daptar": "daftar",
|
|
||||||
"efektip": "efektif",
|
|
||||||
"epektif": "efektif",
|
|
||||||
"epektip": "efektif",
|
|
||||||
"Pebruari": "Februari",
|
|
||||||
"pisik": "fisik",
|
|
||||||
"pondasi": "fondasi",
|
|
||||||
"photo": "foto",
|
|
||||||
"photokopi": "fotokopi",
|
|
||||||
"hapal": "hafal",
|
|
||||||
"insap": "insaf",
|
|
||||||
"insyaf": "insaf",
|
|
||||||
"konperensi": "konferensi",
|
|
||||||
"kreatip": "kreatif",
|
|
||||||
"kreativ": "kreatif",
|
|
||||||
"maap": "maaf",
|
|
||||||
"napsu": "nafsu",
|
|
||||||
"negatip": "negatif",
|
|
||||||
"negativ": "negatif",
|
|
||||||
"objektip": "objektif",
|
|
||||||
"obyektip": "objektif",
|
|
||||||
"obyektif": "objektif",
|
|
||||||
"pasip": "pasif",
|
|
||||||
"pasiv": "pasif",
|
|
||||||
"positip": "positif",
|
|
||||||
"positiv": "positif",
|
|
||||||
"produktip": "produktif",
|
|
||||||
"produktiv": "produktif",
|
|
||||||
"sarap": "saraf",
|
|
||||||
"sertipikat": "sertifikat",
|
|
||||||
"subjektip": "subjektif",
|
|
||||||
"subyektip": "subjektif",
|
|
||||||
"subyektif": "subjektif",
|
|
||||||
"tarip": "tarif",
|
|
||||||
"transitip": "transitif",
|
|
||||||
"transitiv": "transitif",
|
|
||||||
"faham": "paham",
|
|
||||||
"fikir": "pikir",
|
|
||||||
"berfikir": "berpikir",
|
|
||||||
"telefon": "telepon",
|
|
||||||
"telfon": "telepon",
|
|
||||||
"telpon": "telepon",
|
|
||||||
"tilpon": "telepon",
|
|
||||||
"nafas": "napas",
|
|
||||||
"bernafas": "bernapas",
|
|
||||||
"pernafasan": "pernapasan",
|
|
||||||
"vermak": "permak",
|
|
||||||
"vulpen": "pulpen",
|
|
||||||
"aktifis": "aktivis",
|
|
||||||
"konfeksi": "konveksi",
|
|
||||||
"motifasi": "motivasi",
|
|
||||||
"Nopember": "November",
|
|
||||||
"propinsi": "provinsi",
|
|
||||||
"babtis": "baptis",
|
|
||||||
"jerembab": "jerembap",
|
|
||||||
"lembab": "lembap",
|
|
||||||
"sembab": "sembap",
|
|
||||||
"saptu": "sabtu",
|
|
||||||
"tekat": "tekad",
|
|
||||||
"bejad": "bejat",
|
|
||||||
"nekad": "nekat",
|
|
||||||
"otoped": "otopet",
|
|
||||||
"skuad": "skuat",
|
|
||||||
"jenius": "genius",
|
|
||||||
"marjin": "margin",
|
|
||||||
"marjinal": "marginal",
|
|
||||||
"obyek": "objek",
|
|
||||||
"subyek": "subjek",
|
|
||||||
"projek": "proyek",
|
|
||||||
"azas": "asas",
|
|
||||||
"ijasah": "ijazah",
|
|
||||||
"jenasah": "jenazah",
|
|
||||||
"plasa": "plaza",
|
|
||||||
"bathin": "batin",
|
|
||||||
"Katholik": "Katolik",
|
|
||||||
"orthografi": "ortografi",
|
|
||||||
"pathogen": "patogen",
|
|
||||||
"theologi": "teologi",
|
|
||||||
"ijin": "izin",
|
|
||||||
"rejeki": "rezeki",
|
|
||||||
"rejim": "rezim",
|
|
||||||
"jaman": "zaman",
|
|
||||||
"jamrud": "zamrud",
|
|
||||||
"jinah": "zina",
|
|
||||||
"perjinahan": "perzinaan",
|
|
||||||
"anugrah": "anugerah",
|
|
||||||
"cendrawasih": "cenderawasih",
|
|
||||||
"jendral": "jenderal",
|
|
||||||
"kripik": "keripik",
|
|
||||||
"krupuk": "kerupuk",
|
|
||||||
"ksatria": "kesatria",
|
|
||||||
"mentri": "menteri",
|
|
||||||
"negri": "negeri",
|
|
||||||
"Prancis": "Perancis",
|
|
||||||
"sebrang": "seberang",
|
|
||||||
"menyebrang": "menyeberang",
|
|
||||||
"Sumatra": "Sumatera",
|
|
||||||
"trampil": "terampil",
|
|
||||||
"isteri": "istri",
|
|
||||||
"justeru": "justru",
|
|
||||||
"perajurit": "prajurit",
|
|
||||||
"putera": "putra",
|
|
||||||
"puteri": "putri",
|
|
||||||
"samudera": "samudra",
|
|
||||||
"sastera": "sastra",
|
|
||||||
"sutera": "sutra",
|
|
||||||
"terompet": "trompet",
|
|
||||||
"iklas": "ikhlas",
|
|
||||||
"iktisar": "ikhtisar",
|
|
||||||
"kafilah": "khafilah",
|
|
||||||
"kawatir": "khawatir",
|
|
||||||
"kotbah": "khotbah",
|
|
||||||
"kusyuk": "khusyuk",
|
|
||||||
"makluk": "makhluk",
|
|
||||||
"mahluk": "makhluk",
|
|
||||||
"mahkluk": "makhluk",
|
|
||||||
"nahkoda": "nakhoda",
|
|
||||||
"nakoda": "nakhoda",
|
|
||||||
"tahta": "takhta",
|
|
||||||
"takhyul": "takhayul",
|
|
||||||
"tahyul": "takhayul",
|
|
||||||
"tahayul": "takhayul",
|
|
||||||
"akhli": "ahli",
|
|
||||||
"anarkhi": "anarki",
|
|
||||||
"kharisma": "karisma",
|
|
||||||
"kharismatik": "karismatik",
|
|
||||||
"mahsud": "maksud",
|
|
||||||
"makhsud": "maksud",
|
|
||||||
"rakhmat": "rahmat",
|
|
||||||
"tekhnik": "teknik",
|
|
||||||
"tehnik": "teknik",
|
|
||||||
"tehnologi": "teknologi",
|
|
||||||
"ikhwal": "ihwal",
|
|
||||||
"expor": "ekspor",
|
|
||||||
"extra": "ekstra",
|
|
||||||
"komplex": "komplek",
|
|
||||||
"sex": "seks",
|
|
||||||
"taxi": "taksi",
|
|
||||||
"extasi": "ekstasi",
|
|
||||||
"syaraf": "saraf",
|
|
||||||
"syurga": "surga",
|
|
||||||
"mashur": "masyhur",
|
|
||||||
"masyur": "masyhur",
|
|
||||||
"mahsyur": "masyhur",
|
|
||||||
"mashyur": "masyhur",
|
|
||||||
"muadzin": "muazin",
|
|
||||||
"adzan": "azan",
|
|
||||||
"ustadz": "ustaz",
|
|
||||||
"ustad": "ustaz",
|
|
||||||
"ustadzah": "ustaz",
|
|
||||||
"dzikir": "zikir",
|
|
||||||
"dzuhur": "zuhur",
|
|
||||||
"dhuhur": "zuhur",
|
|
||||||
"zhuhur": "zuhur",
|
|
||||||
"analisa": "analisis",
|
|
||||||
"diagnosa": "diagnosis",
|
|
||||||
"hipotesa": "hipotesis",
|
|
||||||
"sintesa": "sintesis",
|
|
||||||
"aktiviti": "aktivitas",
|
|
||||||
"aktifitas": "aktivitas",
|
|
||||||
"efektifitas": "efektivitas",
|
|
||||||
"komuniti": "komunitas",
|
|
||||||
"kreatifitas": "kreativitas",
|
|
||||||
"produktifitas": "produktivitas",
|
|
||||||
"realiti": "realitas",
|
|
||||||
"realita": "realitas",
|
|
||||||
"selebriti": "selebritas",
|
|
||||||
"spotifitas": "sportivitas",
|
|
||||||
"universiti": "universitas",
|
|
||||||
"utiliti": "utilitas",
|
|
||||||
"validiti": "validitas",
|
|
||||||
"dilokalisir": "dilokalisasi",
|
|
||||||
"didramatisir": "didramatisasi",
|
|
||||||
"dipolitisir": "dipolitisasi",
|
|
||||||
"dinetralisir": "dinetralisasi",
|
|
||||||
"dikonfrontir": "dikonfrontasi",
|
|
||||||
"mendominir": "mendominasi",
|
|
||||||
"koordinir": "koordinasi",
|
|
||||||
"proklamir": "proklamasi",
|
|
||||||
"terorganisir": "terorganisasi",
|
|
||||||
"terealisir": "terealisasi",
|
|
||||||
"robah": "ubah",
|
|
||||||
"dirubah": "diubah",
|
|
||||||
"merubah": "mengubah",
|
|
||||||
"terlanjur": "telanjur",
|
|
||||||
"terlantar": "telantar",
|
|
||||||
"penglepasan": "pelepasan",
|
|
||||||
"pelihatan": "penglihatan",
|
|
||||||
"pemukiman": "permukiman",
|
|
||||||
"pengrumahan": "perumahan",
|
|
||||||
"penyewaan": "persewaan",
|
|
||||||
"menyintai": "mencintai",
|
|
||||||
"menyolok": "mencolok",
|
|
||||||
"contek": "sontek",
|
|
||||||
"mencontek": "menyontek",
|
|
||||||
"pungkir": "mungkir",
|
|
||||||
"dipungkiri": "dimungkiri",
|
|
||||||
"kupungkiri": "kumungkiri",
|
|
||||||
"kaupungkiri": "kaumungkiri",
|
|
||||||
"nampak": "tampak",
|
|
||||||
"nampaknya": "tampaknya",
|
|
||||||
"nongkrong": "tongkrong",
|
|
||||||
"berternak": "beternak",
|
|
||||||
"berterbangan": "beterbangan",
|
|
||||||
"berserta": "beserta",
|
|
||||||
"berperkara": "beperkara",
|
|
||||||
"berpergian": "bepergian",
|
|
||||||
"berkerja": "bekerja",
|
|
||||||
"berberapa": "beberapa",
|
|
||||||
"terbersit": "tebersit",
|
|
||||||
"terpercaya": "tepercaya",
|
|
||||||
"terperdaya": "teperdaya",
|
|
||||||
"terpercik": "tepercik",
|
|
||||||
"terpergok": "tepergok",
|
|
||||||
"aksesoris": "aksesori",
|
|
||||||
"handal": "andal",
|
|
||||||
"hantar": "antar",
|
|
||||||
"panutan": "anutan",
|
|
||||||
"atsiri": "asiri",
|
|
||||||
"bhakti": "bakti",
|
|
||||||
"china": "cina",
|
|
||||||
"dharma": "darma",
|
|
||||||
"diktaktor": "diktator",
|
|
||||||
"eksport": "ekspor",
|
|
||||||
"hembus": "embus",
|
|
||||||
"hadits": "hadis",
|
|
||||||
"hadist": "hadits",
|
|
||||||
"harafiah": "harfiah",
|
|
||||||
"himbau": "imbau",
|
|
||||||
"import": "impor",
|
|
||||||
"inget": "ingat",
|
|
||||||
"hisap": "isap",
|
|
||||||
"interprestasi": "interpretasi",
|
|
||||||
"kangker": "kanker",
|
|
||||||
"konggres": "kongres",
|
|
||||||
"lansekap": "lanskap",
|
|
||||||
"maghrib": "magrib",
|
|
||||||
"emak": "mak",
|
|
||||||
"moderen": "modern",
|
|
||||||
"pasport": "paspor",
|
|
||||||
"perduli": "peduli",
|
|
||||||
"ramadhan": "ramadan",
|
|
||||||
"rapih": "rapi",
|
|
||||||
"Sansekerta": "Sanskerta",
|
|
||||||
"shalat": "salat",
|
|
||||||
"sholat": "salat",
|
|
||||||
"silahkan": "silakan",
|
|
||||||
"standard": "standar",
|
|
||||||
"hutang": "utang",
|
|
||||||
"zinah": "zina",
|
|
||||||
"ambulan": "ambulans",
|
|
||||||
"antartika": "sntarktika",
|
|
||||||
"arteri": "arteria",
|
|
||||||
"asik": "asyik",
|
|
||||||
"australi": "australia",
|
|
||||||
"denga": "dengan",
|
|
||||||
"depo": "depot",
|
|
||||||
"detil": "detail",
|
|
||||||
"ensiklopedi": "ensiklopedia",
|
|
||||||
"elit": "elite",
|
|
||||||
"frustasi": "frustrasi",
|
|
||||||
"gladi": "geladi",
|
|
||||||
"greget": "gereget",
|
|
||||||
"itali": "italia",
|
|
||||||
"karna": "karena",
|
|
||||||
"klenteng": "kelenteng",
|
|
||||||
"erling": "kerling",
|
|
||||||
"kontruksi": "konstruksi",
|
|
||||||
"masal": "massal",
|
|
||||||
"merk": "merek",
|
|
||||||
"respon": "respons",
|
|
||||||
"diresponi": "direspons",
|
|
||||||
"skak": "sekak",
|
|
||||||
"stir": "setir",
|
|
||||||
"singapur": "singapura",
|
|
||||||
"standarisasi": "standardisasi",
|
|
||||||
"varitas": "varietas",
|
|
||||||
"amphibi": "amfibi",
|
|
||||||
"anjlog": "anjlok",
|
|
||||||
"alpukat": "avokad",
|
|
||||||
"alpokat": "avokad",
|
|
||||||
"bolpen": "pulpen",
|
|
||||||
"cabe": "cabai",
|
|
||||||
"cabay": "cabai",
|
|
||||||
"ceret": "cerek",
|
|
||||||
"differensial": "diferensial",
|
|
||||||
"duren": "durian",
|
|
||||||
"faksimili": "faksimile",
|
|
||||||
"faksimil": "faksimile",
|
|
||||||
"graha": "gerha",
|
|
||||||
"goblog": "goblok",
|
|
||||||
"gombrong": "gombroh",
|
|
||||||
"horden": "gorden",
|
|
||||||
"korden": "gorden",
|
|
||||||
"gubug": "gubuk",
|
|
||||||
"imaginasi": "imajinasi",
|
|
||||||
"jerigen": "jeriken",
|
|
||||||
"jirigen": "jeriken",
|
|
||||||
"carut-marut": "karut-marut",
|
|
||||||
"kwota": "kuota",
|
|
||||||
"mahzab": "mazhab",
|
|
||||||
"mempesona": "memesona",
|
|
||||||
"milyar": "miliar",
|
|
||||||
"missi": "misi",
|
|
||||||
"nenas": "nanas",
|
|
||||||
"negoisasi": "negosiasi",
|
|
||||||
"automotif": "otomotif",
|
|
||||||
"pararel": "paralel",
|
|
||||||
"paska": "pasca",
|
|
||||||
"prosen": "persen",
|
|
||||||
"pete": "petai",
|
|
||||||
"petay": "petai",
|
|
||||||
"proffesor": "profesor",
|
|
||||||
"rame": "ramai",
|
|
||||||
"rapot": "rapor",
|
|
||||||
"rileks": "relaks",
|
|
||||||
"rileksasi": "relaksasi",
|
|
||||||
"renumerasi": "remunerasi",
|
|
||||||
"seketaris": "sekretaris",
|
|
||||||
"sekertaris": "sekretaris",
|
|
||||||
"sensorik": "sensoris",
|
|
||||||
"sentausa": "sentosa",
|
|
||||||
"strawberi": "stroberi",
|
|
||||||
"strawbery": "stroberi",
|
|
||||||
"taqwa": "takwa",
|
|
||||||
"tauco": "taoco",
|
|
||||||
"tauge": "taoge",
|
|
||||||
"toge": "taoge",
|
|
||||||
"tauladan": "teladan",
|
|
||||||
"taubat": "tobat",
|
|
||||||
"trilyun": "triliun",
|
|
||||||
"vissi": "visi",
|
|
||||||
"coklat": "cokelat",
|
|
||||||
"narkotika": "narkotik",
|
|
||||||
"oase": "oasis",
|
|
||||||
"politisi": "politikus",
|
|
||||||
"terong": "terung",
|
|
||||||
"wool": "wol",
|
|
||||||
"himpit": "impit",
|
|
||||||
"mujizat": "mukjizat",
|
|
||||||
"mujijat": "mukjizat",
|
|
||||||
"yag": "yang",
|
|
||||||
}
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
||||||
NORM_EXCEPTIONS[string.title()] = norm
|
|
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -18,21 +19,23 @@ def noun_chunks(obj):
|
||||||
"nmod",
|
"nmod",
|
||||||
"nmod:poss",
|
"nmod:poss",
|
||||||
]
|
]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_deps = [doc.vocab.strings[label] for label in labels]
|
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||||
conj = doc.vocab.strings.add("conj")
|
conj = doc.vocab.strings.add("conj")
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
seen = set()
|
prev_end = -1
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if word.pos not in (NOUN, PROPN, PRON):
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
continue
|
continue
|
||||||
# Prevent nested chunks from being produced
|
# Prevent nested chunks from being produced
|
||||||
if word.i in seen:
|
if word.left_edge.i <= prev_end:
|
||||||
continue
|
continue
|
||||||
if word.dep in np_deps:
|
if word.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
elif word.dep == conj:
|
elif word.dep == conj:
|
||||||
head = word.head
|
head = word.head
|
||||||
|
@ -40,9 +43,7 @@ def noun_chunks(obj):
|
||||||
head = head.head
|
head = head.head
|
||||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
if head.dep in np_deps:
|
if head.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -9,8 +9,8 @@ Example sentences to test spaCy and its language models.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
sentences = [
|
sentences = [
|
||||||
"애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.",
|
"애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다.",
|
||||||
"자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.",
|
"자율주행 자동차의 손해 배상 책임이 제조 업체로 옮겨 가다",
|
||||||
"자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.",
|
"샌프란시스코 시가 자동 배달 로봇의 보도 주행 금지를 검토 중이라고 합니다.",
|
||||||
"런던은 영국의 수도이자 가장 큰 도시입니다.",
|
"런던은 영국의 수도이자 가장 큰 도시입니다.",
|
||||||
]
|
]
|
||||||
|
|
|
@ -2,26 +2,21 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .punctuation import TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_INFIXES
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class LuxembourgishDefaults(Language.Defaults):
|
class LuxembourgishDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "lb"
|
lex_attr_getters[LANG] = lambda text: "lb"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
|
|
@ -1,16 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
# TODO
|
|
||||||
# norm execptions: find a possibility to deal with the zillions of spelling
|
|
||||||
# variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
|
|
||||||
# here one could include the most common spelling mistakes
|
|
||||||
|
|
||||||
_exc = {"dass": "datt", "viläicht": "vläicht"}
|
|
||||||
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
||||||
NORM_EXCEPTIONS[string.title()] = norm
|
|
|
@ -186,10 +186,6 @@ def suffix(string):
|
||||||
return string[-3:]
|
return string[-3:]
|
||||||
|
|
||||||
|
|
||||||
def cluster(string):
|
|
||||||
return 0
|
|
||||||
|
|
||||||
|
|
||||||
def is_alpha(string):
|
def is_alpha(string):
|
||||||
return string.isalpha()
|
return string.isalpha()
|
||||||
|
|
||||||
|
@ -218,20 +214,11 @@ def is_stop(string, stops=set()):
|
||||||
return string.lower() in stops
|
return string.lower() in stops
|
||||||
|
|
||||||
|
|
||||||
def is_oov(string):
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
def get_prob(string):
|
|
||||||
return -20.0
|
|
||||||
|
|
||||||
|
|
||||||
LEX_ATTRS = {
|
LEX_ATTRS = {
|
||||||
attrs.LOWER: lower,
|
attrs.LOWER: lower,
|
||||||
attrs.NORM: lower,
|
attrs.NORM: lower,
|
||||||
attrs.PREFIX: prefix,
|
attrs.PREFIX: prefix,
|
||||||
attrs.SUFFIX: suffix,
|
attrs.SUFFIX: suffix,
|
||||||
attrs.CLUSTER: cluster,
|
|
||||||
attrs.IS_ALPHA: is_alpha,
|
attrs.IS_ALPHA: is_alpha,
|
||||||
attrs.IS_DIGIT: is_digit,
|
attrs.IS_DIGIT: is_digit,
|
||||||
attrs.IS_LOWER: is_lower,
|
attrs.IS_LOWER: is_lower,
|
||||||
|
@ -239,8 +226,6 @@ LEX_ATTRS = {
|
||||||
attrs.IS_TITLE: is_title,
|
attrs.IS_TITLE: is_title,
|
||||||
attrs.IS_UPPER: is_upper,
|
attrs.IS_UPPER: is_upper,
|
||||||
attrs.IS_STOP: is_stop,
|
attrs.IS_STOP: is_stop,
|
||||||
attrs.IS_OOV: is_oov,
|
|
||||||
attrs.PROB: get_prob,
|
|
||||||
attrs.LIKE_EMAIL: like_email,
|
attrs.LIKE_EMAIL: like_email,
|
||||||
attrs.LIKE_NUM: like_num,
|
attrs.LIKE_NUM: like_num,
|
||||||
attrs.IS_PUNCT: is_punct,
|
attrs.IS_PUNCT: is_punct,
|
||||||
|
|
|
@ -55,7 +55,7 @@ _num_words = [
|
||||||
"തൊണ്ണൂറ് ",
|
"തൊണ്ണൂറ് ",
|
||||||
"നുറ് ",
|
"നുറ് ",
|
||||||
"ആയിരം ",
|
"ആയിരം ",
|
||||||
"പത്തുലക്ഷം"
|
"പത്തുലക്ഷം",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -3,7 +3,6 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set(
|
||||||
|
|
||||||
"""
|
"""
|
||||||
അത്
|
അത്
|
||||||
ഇത്
|
ഇത്
|
||||||
|
|
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -18,21 +19,23 @@ def noun_chunks(obj):
|
||||||
"nmod",
|
"nmod",
|
||||||
"nmod:poss",
|
"nmod:poss",
|
||||||
]
|
]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_deps = [doc.vocab.strings[label] for label in labels]
|
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||||
conj = doc.vocab.strings.add("conj")
|
conj = doc.vocab.strings.add("conj")
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
seen = set()
|
prev_end = -1
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if word.pos not in (NOUN, PROPN, PRON):
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
continue
|
continue
|
||||||
# Prevent nested chunks from being produced
|
# Prevent nested chunks from being produced
|
||||||
if word.i in seen:
|
if word.left_edge.i <= prev_end:
|
||||||
continue
|
continue
|
||||||
if word.dep in np_deps:
|
if word.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
elif word.dep == conj:
|
elif word.dep == conj:
|
||||||
head = word.head
|
head = word.head
|
||||||
|
@ -40,9 +43,7 @@ def noun_chunks(obj):
|
||||||
head = head.head
|
head = head.head
|
||||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
if head.dep in np_deps:
|
if head.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,17 +1,19 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
from .punctuation import TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .lemmatizer import PolishLemmatizer
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ..norm_exceptions import BASE_NORMS
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG, NORM
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import add_lookups
|
||||||
|
from ...lookups import Lookups
|
||||||
|
|
||||||
|
|
||||||
class PolishDefaults(Language.Defaults):
|
class PolishDefaults(Language.Defaults):
|
||||||
|
@ -21,10 +23,21 @@ class PolishDefaults(Language.Defaults):
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
lex_attr_getters[NORM] = add_lookups(
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||||
)
|
)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
mod_base_exceptions = {
|
||||||
|
exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
|
||||||
|
}
|
||||||
|
tokenizer_exceptions = mod_base_exceptions
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
infixes = TOKENIZER_INFIXES
|
infixes = TOKENIZER_INFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def create_lemmatizer(cls, nlp=None, lookups=None):
|
||||||
|
if lookups is None:
|
||||||
|
lookups = Lookups()
|
||||||
|
return PolishLemmatizer(lookups)
|
||||||
|
|
||||||
|
|
||||||
class Polish(Language):
|
class Polish(Language):
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
106
spacy/lang/pl/lemmatizer.py
Normal file
106
spacy/lang/pl/lemmatizer.py
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...lemmatizer import Lemmatizer
|
||||||
|
from ...parts_of_speech import NAMES
|
||||||
|
|
||||||
|
|
||||||
|
class PolishLemmatizer(Lemmatizer):
|
||||||
|
# This lemmatizer implements lookup lemmatization based on
|
||||||
|
# the Morfeusz dictionary (morfeusz.sgjp.pl/en) by Institute of Computer Science PAS
|
||||||
|
# It utilizes some prefix based improvements for
|
||||||
|
# verb and adjectives lemmatization, as well as case-sensitive
|
||||||
|
# lemmatization for nouns
|
||||||
|
def __init__(self, lookups, *args, **kwargs):
|
||||||
|
# this lemmatizer is lookup based, so it does not require an index, exceptionlist, or rules
|
||||||
|
super(PolishLemmatizer, self).__init__(lookups)
|
||||||
|
self.lemma_lookups = {}
|
||||||
|
for tag in [
|
||||||
|
"ADJ",
|
||||||
|
"ADP",
|
||||||
|
"ADV",
|
||||||
|
"AUX",
|
||||||
|
"NOUN",
|
||||||
|
"NUM",
|
||||||
|
"PART",
|
||||||
|
"PRON",
|
||||||
|
"VERB",
|
||||||
|
"X",
|
||||||
|
]:
|
||||||
|
self.lemma_lookups[tag] = self.lookups.get_table(
|
||||||
|
"lemma_lookup_" + tag.lower(), {}
|
||||||
|
)
|
||||||
|
self.lemma_lookups["DET"] = self.lemma_lookups["X"]
|
||||||
|
self.lemma_lookups["PROPN"] = self.lemma_lookups["NOUN"]
|
||||||
|
|
||||||
|
def __call__(self, string, univ_pos, morphology=None):
|
||||||
|
if isinstance(univ_pos, int):
|
||||||
|
univ_pos = NAMES.get(univ_pos, "X")
|
||||||
|
univ_pos = univ_pos.upper()
|
||||||
|
|
||||||
|
if univ_pos == "NOUN":
|
||||||
|
return self.lemmatize_noun(string, morphology)
|
||||||
|
|
||||||
|
if univ_pos != "PROPN":
|
||||||
|
string = string.lower()
|
||||||
|
|
||||||
|
if univ_pos == "ADJ":
|
||||||
|
return self.lemmatize_adj(string, morphology)
|
||||||
|
elif univ_pos == "VERB":
|
||||||
|
return self.lemmatize_verb(string, morphology)
|
||||||
|
|
||||||
|
lemma_dict = self.lemma_lookups.get(univ_pos, {})
|
||||||
|
return [lemma_dict.get(string, string.lower())]
|
||||||
|
|
||||||
|
def lemmatize_adj(self, string, morphology):
|
||||||
|
# this method utilizes different procedures for adjectives
|
||||||
|
# with 'nie' and 'naj' prefixes
|
||||||
|
lemma_dict = self.lemma_lookups["ADJ"]
|
||||||
|
|
||||||
|
if string[:3] == "nie":
|
||||||
|
search_string = string[3:]
|
||||||
|
if search_string[:3] == "naj":
|
||||||
|
naj_search_string = search_string[3:]
|
||||||
|
if naj_search_string in lemma_dict:
|
||||||
|
return [lemma_dict[naj_search_string]]
|
||||||
|
if search_string in lemma_dict:
|
||||||
|
return [lemma_dict[search_string]]
|
||||||
|
|
||||||
|
if string[:3] == "naj":
|
||||||
|
naj_search_string = string[3:]
|
||||||
|
if naj_search_string in lemma_dict:
|
||||||
|
return [lemma_dict[naj_search_string]]
|
||||||
|
|
||||||
|
return [lemma_dict.get(string, string)]
|
||||||
|
|
||||||
|
def lemmatize_verb(self, string, morphology):
|
||||||
|
# this method utilizes a different procedure for verbs
|
||||||
|
# with 'nie' prefix
|
||||||
|
lemma_dict = self.lemma_lookups["VERB"]
|
||||||
|
|
||||||
|
if string[:3] == "nie":
|
||||||
|
search_string = string[3:]
|
||||||
|
if search_string in lemma_dict:
|
||||||
|
return [lemma_dict[search_string]]
|
||||||
|
|
||||||
|
return [lemma_dict.get(string, string)]
|
||||||
|
|
||||||
|
def lemmatize_noun(self, string, morphology):
|
||||||
|
# this method is case-sensitive, in order to work
|
||||||
|
# for incorrectly tagged proper names
|
||||||
|
lemma_dict = self.lemma_lookups["NOUN"]
|
||||||
|
|
||||||
|
if string != string.lower():
|
||||||
|
if string.lower() in lemma_dict:
|
||||||
|
return [lemma_dict[string.lower()]]
|
||||||
|
elif string in lemma_dict:
|
||||||
|
return [lemma_dict[string]]
|
||||||
|
return [string.lower()]
|
||||||
|
|
||||||
|
return [lemma_dict.get(string, string)]
|
||||||
|
|
||||||
|
def lookup(self, string, orth=None):
|
||||||
|
return string.lower()
|
||||||
|
|
||||||
|
def lemmatize(self, string, index, exceptions, rules):
|
||||||
|
raise NotImplementedError
|
|
@ -1,23 +0,0 @@
|
||||||
|
|
||||||
Copyright (c) 2019, Marcin Miłkowski
|
|
||||||
All rights reserved.
|
|
||||||
|
|
||||||
Redistribution and use in source and binary forms, with or without
|
|
||||||
modification, are permitted provided that the following conditions are met:
|
|
||||||
|
|
||||||
1. Redistributions of source code must retain the above copyright notice, this
|
|
||||||
list of conditions and the following disclaimer.
|
|
||||||
2. Redistributions in binary form must reproduce the above copyright notice,
|
|
||||||
this list of conditions and the following disclaimer in the documentation
|
|
||||||
and/or other materials provided with the distribution.
|
|
||||||
|
|
||||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
|
||||||
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
|
||||||
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
||||||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
|
||||||
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
|
||||||
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
|
||||||
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
|
||||||
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
||||||
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
|
||||||
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
@ -1,22 +1,48 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..char_classes import LIST_ELLIPSES, CONCAT_ICONS
|
from ..char_classes import LIST_ELLIPSES, LIST_PUNCT, LIST_HYPHENS
|
||||||
|
from ..char_classes import LIST_ICONS, LIST_QUOTES, CURRENCY, UNITS, PUNCT
|
||||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
|
||||||
|
|
||||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||||
|
|
||||||
|
_prefixes = _prefixes = [
|
||||||
|
r"(długo|krótko|jedno|dwu|trzy|cztero)-"
|
||||||
|
] + BASE_TOKENIZER_PREFIXES
|
||||||
|
|
||||||
_infixes = (
|
_infixes = (
|
||||||
LIST_ELLIPSES
|
LIST_ELLIPSES
|
||||||
+ [CONCAT_ICONS]
|
+ LIST_ICONS
|
||||||
|
+ LIST_HYPHENS
|
||||||
+ [
|
+ [
|
||||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
r"(?<=[0-9{al}])\.(?=[0-9{au}])".format(al=ALPHA, au=ALPHA_UPPER),
|
||||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[:<>=\/](?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
|
||||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=CONCAT_QUOTES),
|
r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=_quotes),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
_suffixes = (
|
||||||
|
["''", "’’", r"\.", "…"]
|
||||||
|
+ LIST_PUNCT
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ LIST_ICONS
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])\+",
|
||||||
|
r"(?<=°[FfCcKk])\.",
|
||||||
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
|
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||||
|
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
|
||||||
|
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||||
|
),
|
||||||
|
r"(?<=[{au}])\.".format(au=ALPHA_UPPER),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
TOKENIZER_INFIXES = _infixes
|
TOKENIZER_INFIXES = _infixes
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
|
|
|
@ -1,26 +0,0 @@
|
||||||
# encoding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ._tokenizer_exceptions_list import PL_BASE_EXCEPTIONS
|
|
||||||
from ...symbols import POS, ADV, NOUN, ORTH, LEMMA, ADJ
|
|
||||||
|
|
||||||
|
|
||||||
_exc = {}
|
|
||||||
|
|
||||||
for exc_data in [
|
|
||||||
{ORTH: "m.in.", LEMMA: "między innymi", POS: ADV},
|
|
||||||
{ORTH: "inż.", LEMMA: "inżynier", POS: NOUN},
|
|
||||||
{ORTH: "mgr.", LEMMA: "magister", POS: NOUN},
|
|
||||||
{ORTH: "tzn.", LEMMA: "to znaczy", POS: ADV},
|
|
||||||
{ORTH: "tj.", LEMMA: "to jest", POS: ADV},
|
|
||||||
{ORTH: "tzw.", LEMMA: "tak zwany", POS: ADJ},
|
|
||||||
]:
|
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
|
||||||
|
|
||||||
for orth in ["w.", "r."]:
|
|
||||||
_exc[orth] = [{ORTH: orth}]
|
|
||||||
|
|
||||||
for orth in PL_BASE_EXCEPTIONS:
|
|
||||||
_exc[orth] = [{ORTH: orth}]
|
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
|
|
@ -5,22 +5,17 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
|
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class PortugueseDefaults(Language.Defaults):
|
class PortugueseDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: "pt"
|
lex_attr_getters[LANG] = lambda text: "pt"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
|
@ -1,23 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
# These exceptions are used to add NORM values based on a token's ORTH value.
|
|
||||||
# Individual languages can also add their own exceptions and overwrite them -
|
|
||||||
# for example, British vs. American spelling in English.
|
|
||||||
|
|
||||||
# Norms are only set if no alternative is provided in the tokenizer exceptions.
|
|
||||||
# Note that this does not change any other token attributes. Its main purpose
|
|
||||||
# is to normalise the word representations so that equivalent tokens receive
|
|
||||||
# similar representations. For example: $ and € are very different, but they're
|
|
||||||
# both currency symbols. By normalising currency symbols to $, all symbols are
|
|
||||||
# seen as similar, no matter how common they are in the training data.
|
|
||||||
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {
|
|
||||||
"R$": "$", # Real
|
|
||||||
"r$": "$", # Real
|
|
||||||
"Cz$": "$", # Cruzado
|
|
||||||
"cz$": "$", # Cruzado
|
|
||||||
"NCz$": "$", # Cruzado Novo
|
|
||||||
"ncz$": "$", # Cruzado Novo
|
|
||||||
}
|
|
|
@ -3,26 +3,21 @@ from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .lemmatizer import RussianLemmatizer
|
from .lemmatizer import RussianLemmatizer
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ...util import update_exc
|
||||||
from ...util import update_exc, add_lookups
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...lookups import Lookups
|
from ...lookups import Lookups
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
|
|
||||||
|
|
||||||
class RussianDefaults(Language.Defaults):
|
class RussianDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "ru"
|
lex_attr_getters[LANG] = lambda text: "ru"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
|
|
@ -1,36 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
|
|
||||||
_exc = {
|
|
||||||
# Slang
|
|
||||||
"прив": "привет",
|
|
||||||
"дарова": "привет",
|
|
||||||
"дак": "так",
|
|
||||||
"дык": "так",
|
|
||||||
"здарова": "привет",
|
|
||||||
"пакедава": "пока",
|
|
||||||
"пакедаво": "пока",
|
|
||||||
"ща": "сейчас",
|
|
||||||
"спс": "спасибо",
|
|
||||||
"пжлст": "пожалуйста",
|
|
||||||
"плиз": "пожалуйста",
|
|
||||||
"ладненько": "ладно",
|
|
||||||
"лады": "ладно",
|
|
||||||
"лан": "ладно",
|
|
||||||
"ясн": "ясно",
|
|
||||||
"всм": "всмысле",
|
|
||||||
"хош": "хочешь",
|
|
||||||
"хаюшки": "привет",
|
|
||||||
"оч": "очень",
|
|
||||||
"че": "что",
|
|
||||||
"чо": "что",
|
|
||||||
"шо": "что",
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
||||||
NORM_EXCEPTIONS[string.title()] = norm
|
|
|
@ -3,22 +3,17 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG, NORM
|
from ...attrs import LANG
|
||||||
from ...util import update_exc, add_lookups
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class SerbianDefaults(Language.Defaults):
|
class SerbianDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "sr"
|
lex_attr_getters[LANG] = lambda text: "sr"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
|
@ -1,26 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
|
|
||||||
_exc = {
|
|
||||||
# Slang
|
|
||||||
"ћале": "отац",
|
|
||||||
"кева": "мајка",
|
|
||||||
"смор": "досада",
|
|
||||||
"кец": "јединица",
|
|
||||||
"тебра": "брат",
|
|
||||||
"штребер": "ученик",
|
|
||||||
"факс": "факултет",
|
|
||||||
"профа": "професор",
|
|
||||||
"бус": "аутобус",
|
|
||||||
"пискарало": "службеник",
|
|
||||||
"бакутанер": "бака",
|
|
||||||
"џибер": "простак",
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
||||||
NORM_EXCEPTIONS[string.title()] = norm
|
|
|
@ -40,7 +40,7 @@ _num_words = [
|
||||||
"miljard",
|
"miljard",
|
||||||
"biljon",
|
"biljon",
|
||||||
"biljard",
|
"biljard",
|
||||||
"kvadriljon"
|
"kvadriljon",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -2,9 +2,10 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(doclike):
|
||||||
"""
|
"""
|
||||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||||
"""
|
"""
|
||||||
|
@ -19,21 +20,23 @@ def noun_chunks(obj):
|
||||||
"nmod",
|
"nmod",
|
||||||
"nmod:poss",
|
"nmod:poss",
|
||||||
]
|
]
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||||
|
|
||||||
|
if not doc.is_parsed:
|
||||||
|
raise ValueError(Errors.E029)
|
||||||
|
|
||||||
np_deps = [doc.vocab.strings[label] for label in labels]
|
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||||
conj = doc.vocab.strings.add("conj")
|
conj = doc.vocab.strings.add("conj")
|
||||||
np_label = doc.vocab.strings.add("NP")
|
np_label = doc.vocab.strings.add("NP")
|
||||||
seen = set()
|
prev_end = -1
|
||||||
for i, word in enumerate(obj):
|
for i, word in enumerate(doclike):
|
||||||
if word.pos not in (NOUN, PROPN, PRON):
|
if word.pos not in (NOUN, PROPN, PRON):
|
||||||
continue
|
continue
|
||||||
# Prevent nested chunks from being produced
|
# Prevent nested chunks from being produced
|
||||||
if word.i in seen:
|
if word.left_edge.i <= prev_end:
|
||||||
continue
|
continue
|
||||||
if word.dep in np_deps:
|
if word.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
elif word.dep == conj:
|
elif word.dep == conj:
|
||||||
head = word.head
|
head = word.head
|
||||||
|
@ -41,9 +44,7 @@ def noun_chunks(obj):
|
||||||
head = head.head
|
head = head.head
|
||||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||||
if head.dep in np_deps:
|
if head.dep in np_deps:
|
||||||
if any(w.i in seen for w in word.subtree):
|
prev_end = word.right_edge.i
|
||||||
continue
|
|
||||||
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
|
|
||||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA, PUNCT, TAG
|
from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA
|
||||||
|
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
|
||||||
|
@ -155,6 +155,6 @@ for orth in ABBREVIATIONS:
|
||||||
# Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
|
# Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
|
||||||
# should be tokenized as two separate tokens.
|
# should be tokenized as two separate tokens.
|
||||||
for orth in ["i", "m"]:
|
for orth in ["i", "m"]:
|
||||||
_exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: ".", TAG: PUNCT}]
|
_exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: "."}]
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
|
|
@ -1,139 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
_exc = {
|
|
||||||
# Regional words normal
|
|
||||||
# Sri Lanka - wikipeadia
|
|
||||||
"இங்க": "இங்கே",
|
|
||||||
"வாங்க": "வாருங்கள்",
|
|
||||||
"ஒண்டு": "ஒன்று",
|
|
||||||
"கண்டு": "கன்று",
|
|
||||||
"கொண்டு": "கொன்று",
|
|
||||||
"பண்டி": "பன்றி",
|
|
||||||
"பச்ச": "பச்சை",
|
|
||||||
"அம்பது": "ஐம்பது",
|
|
||||||
"வெச்ச": "வைத்து",
|
|
||||||
"வச்ச": "வைத்து",
|
|
||||||
"வச்சி": "வைத்து",
|
|
||||||
"வாளைப்பழம்": "வாழைப்பழம்",
|
|
||||||
"மண்ணு": "மண்",
|
|
||||||
"பொன்னு": "பொன்",
|
|
||||||
"சாவல்": "சேவல்",
|
|
||||||
"அங்கால": "அங்கு ",
|
|
||||||
"அசுப்பு": "நடமாட்டம்",
|
|
||||||
"எழுவான் கரை": "எழுவான்கரை",
|
|
||||||
"ஓய்யாரம்": "எழில் ",
|
|
||||||
"ஒளும்பு": "எழும்பு",
|
|
||||||
"ஓர்மை": "துணிவு",
|
|
||||||
"கச்சை": "கோவணம்",
|
|
||||||
"கடப்பு": "தெருவாசல்",
|
|
||||||
"சுள்ளி": "காய்ந்த குச்சி",
|
|
||||||
"திறாவுதல்": "தடவுதல்",
|
|
||||||
"நாசமறுப்பு": "தொல்லை",
|
|
||||||
"பரிசாரி": "வைத்தியன்",
|
|
||||||
"பறவாதி": "பேராசைக்காரன்",
|
|
||||||
"பிசினி": "உலோபி ",
|
|
||||||
"விசர்": "பைத்தியம்",
|
|
||||||
"ஏனம்": "பாத்திரம்",
|
|
||||||
"ஏலா": "இயலாது",
|
|
||||||
"ஒசில்": "அழகு",
|
|
||||||
"ஒள்ளுப்பம்": "கொஞ்சம்",
|
|
||||||
# Srilankan and indian
|
|
||||||
"குத்துமதிப்பு": "",
|
|
||||||
"நூனாயம்": "நூல்நயம்",
|
|
||||||
"பைய": "மெதுவாக",
|
|
||||||
"மண்டை": "தலை",
|
|
||||||
"வெள்ளனே": "சீக்கிரம்",
|
|
||||||
"உசுப்பு": "எழுப்பு",
|
|
||||||
"ஆணம்": "குழம்பு",
|
|
||||||
"உறக்கம்": "தூக்கம்",
|
|
||||||
"பஸ்": "பேருந்து",
|
|
||||||
"களவு": "திருட்டு ",
|
|
||||||
# relationship
|
|
||||||
"புருசன்": "கணவன்",
|
|
||||||
"பொஞ்சாதி": "மனைவி",
|
|
||||||
"புள்ள": "பிள்ளை",
|
|
||||||
"பிள்ள": "பிள்ளை",
|
|
||||||
"ஆம்பிளப்புள்ள": "ஆண் பிள்ளை",
|
|
||||||
"பொம்பிளப்புள்ள": "பெண் பிள்ளை",
|
|
||||||
"அண்ணாச்சி": "அண்ணா",
|
|
||||||
"அக்காச்சி": "அக்கா",
|
|
||||||
"தங்கச்சி": "தங்கை",
|
|
||||||
# difference words
|
|
||||||
"பொடியன்": "சிறுவன்",
|
|
||||||
"பொட்டை": "சிறுமி",
|
|
||||||
"பிறகு": "பின்பு",
|
|
||||||
"டக்கென்டு": "விரைவாக",
|
|
||||||
"கெதியா": "விரைவாக",
|
|
||||||
"கிறுகி": "திரும்பி",
|
|
||||||
"போயித்து வாறன்": "போய் வருகிறேன்",
|
|
||||||
"வருவாங்களா": "வருவார்களா",
|
|
||||||
# regular spokens
|
|
||||||
"சொல்லு": "சொல்",
|
|
||||||
"கேளு": "கேள்",
|
|
||||||
"சொல்லுங்க": "சொல்லுங்கள்",
|
|
||||||
"கேளுங்க": "கேளுங்கள்",
|
|
||||||
"நீங்கள்": "நீ",
|
|
||||||
"உன்": "உன்னுடைய",
|
|
||||||
# Portugeese formal words
|
|
||||||
"அலவாங்கு": "கடப்பாரை",
|
|
||||||
"ஆசுப்பத்திரி": "மருத்துவமனை",
|
|
||||||
"உரோதை": "சில்லு",
|
|
||||||
"கடுதாசி": "கடிதம்",
|
|
||||||
"கதிரை": "நாற்காலி",
|
|
||||||
"குசினி": "அடுக்களை",
|
|
||||||
"கோப்பை": "கிண்ணம்",
|
|
||||||
"சப்பாத்து": "காலணி",
|
|
||||||
"தாச்சி": "இரும்புச் சட்டி",
|
|
||||||
"துவாய்": "துவாலை",
|
|
||||||
"தவறணை": "மதுக்கடை",
|
|
||||||
"பீப்பா": "மரத்தாழி",
|
|
||||||
"யன்னல்": "சாளரம்",
|
|
||||||
"வாங்கு": "மரஇருக்கை",
|
|
||||||
# Dutch formal words
|
|
||||||
"இறாக்கை": "பற்சட்டம்",
|
|
||||||
"இலாட்சி": "இழுப்பறை",
|
|
||||||
"கந்தோர்": "பணிமனை",
|
|
||||||
"நொத்தாரிசு": "ஆவண எழுத்துபதிவாளர்",
|
|
||||||
# English formal words
|
|
||||||
"இஞ்சினியர்": "பொறியியலாளர்",
|
|
||||||
"சூப்பு": "ரசம்",
|
|
||||||
"செக்": "காசோலை",
|
|
||||||
"சேட்டு": "மேற்ச்சட்டை",
|
|
||||||
"மார்க்கட்டு": "சந்தை",
|
|
||||||
"விண்ணன்": "கெட்டிக்காரன்",
|
|
||||||
# Arabic formal words
|
|
||||||
"ஈமான்": "நம்பிக்கை",
|
|
||||||
"சுன்னத்து": "விருத்தசேதனம்",
|
|
||||||
"செய்த்தான்": "பிசாசு",
|
|
||||||
"மவுத்து": "இறப்பு",
|
|
||||||
"ஹலால்": "அங்கீகரிக்கப்பட்டது",
|
|
||||||
"கறாம்": "நிராகரிக்கப்பட்டது",
|
|
||||||
# Persian, Hindustanian and hindi formal words
|
|
||||||
"சுமார்": "கிட்டத்தட்ட",
|
|
||||||
"சிப்பாய்": "போர்வீரன்",
|
|
||||||
"சிபார்சு": "சிபாரிசு",
|
|
||||||
"ஜமீன்": "பணக்காரா்",
|
|
||||||
"அசல்": "மெய்யான",
|
|
||||||
"அந்தஸ்து": "கௌரவம்",
|
|
||||||
"ஆஜர்": "சமா்ப்பித்தல்",
|
|
||||||
"உசார்": "எச்சரிக்கை",
|
|
||||||
"அச்சா": "நல்ல",
|
|
||||||
# English words used in text conversations
|
|
||||||
"bcoz": "ஏனெனில்",
|
|
||||||
"bcuz": "ஏனெனில்",
|
|
||||||
"fav": "விருப்பமான",
|
|
||||||
"morning": "காலை வணக்கம்",
|
|
||||||
"gdeveng": "மாலை வணக்கம்",
|
|
||||||
"gdnyt": "இரவு வணக்கம்",
|
|
||||||
"gdnit": "இரவு வணக்கம்",
|
|
||||||
"plz": "தயவு செய்து",
|
|
||||||
"pls": "தயவு செய்து",
|
|
||||||
"thx": "நன்றி",
|
|
||||||
"thanx": "நன்றி",
|
|
||||||
}
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
|
@ -4,14 +4,12 @@ from __future__ import unicode_literals
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ...attrs import LANG
|
||||||
from ...attrs import LANG, NORM
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...util import DummyTokenizer, add_lookups
|
from ...util import DummyTokenizer
|
||||||
|
|
||||||
|
|
||||||
class ThaiTokenizer(DummyTokenizer):
|
class ThaiTokenizer(DummyTokenizer):
|
||||||
|
@ -37,9 +35,6 @@ class ThaiDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda _text: "th"
|
lex_attr_getters[LANG] = lambda _text: "th"
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
|
|
||||||
)
|
|
||||||
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
|
@ -1,113 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
|
|
||||||
_exc = {
|
|
||||||
# Conjugation and Diversion invalid to Tonal form (ผันอักษรและเสียงไม่ตรงกับรูปวรรณยุกต์)
|
|
||||||
"สนุ๊กเกอร์": "สนุกเกอร์",
|
|
||||||
"โน้ต": "โน้ต",
|
|
||||||
# Misspelled because of being lazy or hustle (สะกดผิดเพราะขี้เกียจพิมพ์ หรือเร่งรีบ)
|
|
||||||
"โทสับ": "โทรศัพท์",
|
|
||||||
"พุ่งนี้": "พรุ่งนี้",
|
|
||||||
# Strange (ให้ดูแปลกตา)
|
|
||||||
"ชะมะ": "ใช่ไหม",
|
|
||||||
"ชิมิ": "ใช่ไหม",
|
|
||||||
"ชะ": "ใช่ไหม",
|
|
||||||
"ช่ายมะ": "ใช่ไหม",
|
|
||||||
"ป่าว": "เปล่า",
|
|
||||||
"ป่ะ": "เปล่า",
|
|
||||||
"ปล่าว": "เปล่า",
|
|
||||||
"คัย": "ใคร",
|
|
||||||
"ไค": "ใคร",
|
|
||||||
"คราย": "ใคร",
|
|
||||||
"เตง": "ตัวเอง",
|
|
||||||
"ตะเอง": "ตัวเอง",
|
|
||||||
"รึ": "หรือ",
|
|
||||||
"เหรอ": "หรือ",
|
|
||||||
"หรา": "หรือ",
|
|
||||||
"หรอ": "หรือ",
|
|
||||||
"ชั้น": "ฉัน",
|
|
||||||
"ชั้ล": "ฉัน",
|
|
||||||
"ช้าน": "ฉัน",
|
|
||||||
"เทอ": "เธอ",
|
|
||||||
"เทอร์": "เธอ",
|
|
||||||
"เทอว์": "เธอ",
|
|
||||||
"แกร": "แก",
|
|
||||||
"ป๋ม": "ผม",
|
|
||||||
"บ่องตง": "บอกตรงๆ",
|
|
||||||
"ถ่ามตง": "ถามตรงๆ",
|
|
||||||
"ต่อมตง": "ตอบตรงๆ",
|
|
||||||
"เพิ่ล": "เพื่อน",
|
|
||||||
"จอบอ": "จอบอ",
|
|
||||||
"ดั้ย": "ได้",
|
|
||||||
"ขอบคุง": "ขอบคุณ",
|
|
||||||
"ยังงัย": "ยังไง",
|
|
||||||
"Inw": "เทพ",
|
|
||||||
"uou": "นอน",
|
|
||||||
"Lกรีeu": "เกรียน",
|
|
||||||
# Misspelled to express emotions (คำที่สะกดผิดเพื่อแสดงอารมณ์)
|
|
||||||
"เปงราย": "เป็นอะไร",
|
|
||||||
"เปนรัย": "เป็นอะไร",
|
|
||||||
"เปงรัย": "เป็นอะไร",
|
|
||||||
"เป็นอัลไล": "เป็นอะไร",
|
|
||||||
"ทามมาย": "ทำไม",
|
|
||||||
"ทามมัย": "ทำไม",
|
|
||||||
"จังรุย": "จังเลย",
|
|
||||||
"จังเยย": "จังเลย",
|
|
||||||
"จุงเบย": "จังเลย",
|
|
||||||
"ไม่รู้": "มะรุ",
|
|
||||||
"เฮ่ย": "เฮ้ย",
|
|
||||||
"เห้ย": "เฮ้ย",
|
|
||||||
"น่าร็อค": "น่ารัก",
|
|
||||||
"น่าร๊าก": "น่ารัก",
|
|
||||||
"ตั้ลล๊าก": "น่ารัก",
|
|
||||||
"คือร๊ะ": "คืออะไร",
|
|
||||||
"โอป่ะ": "โอเคหรือเปล่า",
|
|
||||||
"น่ามคาน": "น่ารำคาญ",
|
|
||||||
"น่ามสาร": "น่าสงสาร",
|
|
||||||
"วงวาร": "สงสาร",
|
|
||||||
"บับว่า": "แบบว่า",
|
|
||||||
"อัลไล": "อะไร",
|
|
||||||
"อิจ": "อิจฉา",
|
|
||||||
# Reduce rough words or Avoid to software filter (คำที่สะกดผิดเพื่อลดความหยาบของคำ หรืออาจใช้หลีกเลี่ยงการกรองคำหยาบของซอฟต์แวร์)
|
|
||||||
"กรู": "กู",
|
|
||||||
"กุ": "กู",
|
|
||||||
"กรุ": "กู",
|
|
||||||
"ตู": "กู",
|
|
||||||
"ตรู": "กู",
|
|
||||||
"มรึง": "มึง",
|
|
||||||
"เมิง": "มึง",
|
|
||||||
"มืง": "มึง",
|
|
||||||
"มุง": "มึง",
|
|
||||||
"สาด": "สัตว์",
|
|
||||||
"สัส": "สัตว์",
|
|
||||||
"สัก": "สัตว์",
|
|
||||||
"แสรด": "สัตว์",
|
|
||||||
"โคโตะ": "โคตร",
|
|
||||||
"โคด": "โคตร",
|
|
||||||
"โครต": "โคตร",
|
|
||||||
"โคตะระ": "โคตร",
|
|
||||||
"พ่อง": "พ่อมึง",
|
|
||||||
"แม่เมิง": "แม่มึง",
|
|
||||||
"เชี่ย": "เหี้ย",
|
|
||||||
# Imitate words (คำเลียนเสียง โดยส่วนใหญ่จะเพิ่มทัณฑฆาต หรือซ้ำตัวอักษร)
|
|
||||||
"แอร๊ยย": "อ๊าย",
|
|
||||||
"อร๊ายยย": "อ๊าย",
|
|
||||||
"มันส์": "มัน",
|
|
||||||
"วู๊วววววววว์": "วู้",
|
|
||||||
# Acronym (แบบคำย่อ)
|
|
||||||
"หมาลัย": "มหาวิทยาลัย",
|
|
||||||
"วิดวะ": "วิศวะ",
|
|
||||||
"สินสาด ": "ศิลปศาสตร์",
|
|
||||||
"สินกำ ": "ศิลปกรรมศาสตร์",
|
|
||||||
"เสารีย์ ": "อนุเสาวรีย์ชัยสมรภูมิ",
|
|
||||||
"เมกา ": "อเมริกา",
|
|
||||||
"มอไซค์ ": "มอเตอร์ไซค์",
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
|
||||||
|
|
||||||
for string, norm in _exc.items():
|
|
||||||
NORM_EXCEPTIONS[string] = norm
|
|
||||||
NORM_EXCEPTIONS[string.title()] = norm
|
|
|
@ -38,7 +38,6 @@ TAG_MAP = {
|
||||||
"NNPC": {POS: PROPN},
|
"NNPC": {POS: PROPN},
|
||||||
"NNC": {POS: NOUN},
|
"NNC": {POS: NOUN},
|
||||||
"PSP": {POS: ADP},
|
"PSP": {POS: ADP},
|
||||||
|
|
||||||
".": {POS: PUNCT},
|
".": {POS: PUNCT},
|
||||||
",": {POS: PUNCT},
|
",": {POS: PUNCT},
|
||||||
"-LRB-": {POS: PUNCT},
|
"-LRB-": {POS: PUNCT},
|
||||||
|
|
|
@ -104,6 +104,23 @@ class ChineseTokenizer(DummyTokenizer):
|
||||||
(words, spaces) = util.get_words_and_spaces(words, text)
|
(words, spaces) = util.get_words_and_spaces(words, text)
|
||||||
return Doc(self.vocab, words=words, spaces=spaces)
|
return Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
|
||||||
|
def pkuseg_update_user_dict(self, words, reset=False):
|
||||||
|
if self.pkuseg_seg:
|
||||||
|
if reset:
|
||||||
|
try:
|
||||||
|
import pkuseg
|
||||||
|
|
||||||
|
self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(None)
|
||||||
|
except ImportError:
|
||||||
|
if self.use_pkuseg:
|
||||||
|
msg = (
|
||||||
|
"pkuseg not installed: unable to reset pkuseg "
|
||||||
|
"user dict. Please " + _PKUSEG_INSTALL_MSG
|
||||||
|
)
|
||||||
|
raise ImportError(msg)
|
||||||
|
for word in words:
|
||||||
|
self.pkuseg_seg.preprocesser.insert(word.strip(), "")
|
||||||
|
|
||||||
def _get_config(self):
|
def _get_config(self):
|
||||||
config = OrderedDict(
|
config = OrderedDict(
|
||||||
(
|
(
|
||||||
|
@ -152,21 +169,16 @@ class ChineseTokenizer(DummyTokenizer):
|
||||||
return util.to_bytes(serializers, [])
|
return util.to_bytes(serializers, [])
|
||||||
|
|
||||||
def from_bytes(self, data, **kwargs):
|
def from_bytes(self, data, **kwargs):
|
||||||
pkuseg_features_b = b""
|
pkuseg_data = {"features_b": b"", "weights_b": b"", "processors_data": None}
|
||||||
pkuseg_weights_b = b""
|
|
||||||
pkuseg_processors_data = None
|
|
||||||
|
|
||||||
def deserialize_pkuseg_features(b):
|
def deserialize_pkuseg_features(b):
|
||||||
nonlocal pkuseg_features_b
|
pkuseg_data["features_b"] = b
|
||||||
pkuseg_features_b = b
|
|
||||||
|
|
||||||
def deserialize_pkuseg_weights(b):
|
def deserialize_pkuseg_weights(b):
|
||||||
nonlocal pkuseg_weights_b
|
pkuseg_data["weights_b"] = b
|
||||||
pkuseg_weights_b = b
|
|
||||||
|
|
||||||
def deserialize_pkuseg_processors(b):
|
def deserialize_pkuseg_processors(b):
|
||||||
nonlocal pkuseg_processors_data
|
pkuseg_data["processors_data"] = srsly.msgpack_loads(b)
|
||||||
pkuseg_processors_data = srsly.msgpack_loads(b)
|
|
||||||
|
|
||||||
deserializers = OrderedDict(
|
deserializers = OrderedDict(
|
||||||
(
|
(
|
||||||
|
@ -178,13 +190,13 @@ class ChineseTokenizer(DummyTokenizer):
|
||||||
)
|
)
|
||||||
util.from_bytes(data, deserializers, [])
|
util.from_bytes(data, deserializers, [])
|
||||||
|
|
||||||
if pkuseg_features_b and pkuseg_weights_b:
|
if pkuseg_data["features_b"] and pkuseg_data["weights_b"]:
|
||||||
with tempfile.TemporaryDirectory() as tempdir:
|
with tempfile.TemporaryDirectory() as tempdir:
|
||||||
tempdir = Path(tempdir)
|
tempdir = Path(tempdir)
|
||||||
with open(tempdir / "features.pkl", "wb") as fileh:
|
with open(tempdir / "features.pkl", "wb") as fileh:
|
||||||
fileh.write(pkuseg_features_b)
|
fileh.write(pkuseg_data["features_b"])
|
||||||
with open(tempdir / "weights.npz", "wb") as fileh:
|
with open(tempdir / "weights.npz", "wb") as fileh:
|
||||||
fileh.write(pkuseg_weights_b)
|
fileh.write(pkuseg_data["weights_b"])
|
||||||
try:
|
try:
|
||||||
import pkuseg
|
import pkuseg
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
@ -193,13 +205,9 @@ class ChineseTokenizer(DummyTokenizer):
|
||||||
+ _PKUSEG_INSTALL_MSG
|
+ _PKUSEG_INSTALL_MSG
|
||||||
)
|
)
|
||||||
self.pkuseg_seg = pkuseg.pkuseg(str(tempdir))
|
self.pkuseg_seg = pkuseg.pkuseg(str(tempdir))
|
||||||
if pkuseg_processors_data:
|
if pkuseg_data["processors_data"]:
|
||||||
(
|
processors_data = pkuseg_data["processors_data"]
|
||||||
user_dict,
|
(user_dict, do_process, common_words, other_words) = processors_data
|
||||||
do_process,
|
|
||||||
common_words,
|
|
||||||
other_words,
|
|
||||||
) = pkuseg_processors_data
|
|
||||||
self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict)
|
self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict)
|
||||||
self.pkuseg_seg.postprocesser.do_process = do_process
|
self.pkuseg_seg.postprocesser.do_process = do_process
|
||||||
self.pkuseg_seg.postprocesser.common_words = set(common_words)
|
self.pkuseg_seg.postprocesser.common_words = set(common_words)
|
||||||
|
|
|
@ -4,10 +4,7 @@ from __future__ import absolute_import, unicode_literals
|
||||||
import random
|
import random
|
||||||
import itertools
|
import itertools
|
||||||
import warnings
|
import warnings
|
||||||
|
|
||||||
from thinc.extra import load_nlp
|
from thinc.extra import load_nlp
|
||||||
|
|
||||||
from spacy.util import minibatch
|
|
||||||
import weakref
|
import weakref
|
||||||
import functools
|
import functools
|
||||||
from collections import OrderedDict
|
from collections import OrderedDict
|
||||||
|
@ -28,10 +25,11 @@ from .compat import izip, basestring_, is_python2, class_types
|
||||||
from .gold import GoldParse
|
from .gold import GoldParse
|
||||||
from .scorer import Scorer
|
from .scorer import Scorer
|
||||||
from ._ml import link_vectors_to_models, create_default_optimizer
|
from ._ml import link_vectors_to_models, create_default_optimizer
|
||||||
from .attrs import IS_STOP, LANG
|
from .attrs import IS_STOP, LANG, NORM
|
||||||
from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||||
from .lang.punctuation import TOKENIZER_INFIXES
|
from .lang.punctuation import TOKENIZER_INFIXES
|
||||||
from .lang.tokenizer_exceptions import TOKEN_MATCH, TOKEN_MATCH_WITH_AFFIXES
|
from .lang.tokenizer_exceptions import TOKEN_MATCH, TOKEN_MATCH_WITH_AFFIXES
|
||||||
|
from .lang.norm_exceptions import BASE_NORMS
|
||||||
from .lang.tag_map import TAG_MAP
|
from .lang.tag_map import TAG_MAP
|
||||||
from .tokens import Doc
|
from .tokens import Doc
|
||||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||||
|
@ -77,6 +75,11 @@ class BaseDefaults(object):
|
||||||
lemmatizer=lemmatizer,
|
lemmatizer=lemmatizer,
|
||||||
lookups=lookups,
|
lookups=lookups,
|
||||||
)
|
)
|
||||||
|
vocab.lex_attr_getters[NORM] = util.add_lookups(
|
||||||
|
vocab.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]),
|
||||||
|
BASE_NORMS,
|
||||||
|
vocab.lookups.get_table("lexeme_norm"),
|
||||||
|
)
|
||||||
for tag_str, exc in cls.morph_rules.items():
|
for tag_str, exc in cls.morph_rules.items():
|
||||||
for orth_str, attrs in exc.items():
|
for orth_str, attrs in exc.items():
|
||||||
vocab.morphology.add_special_case(tag_str, orth_str, attrs)
|
vocab.morphology.add_special_case(tag_str, orth_str, attrs)
|
||||||
|
@ -417,7 +420,7 @@ class Language(object):
|
||||||
|
|
||||||
def __call__(self, text, disable=[], component_cfg=None):
|
def __call__(self, text, disable=[], component_cfg=None):
|
||||||
"""Apply the pipeline to some text. The text can span multiple sentences,
|
"""Apply the pipeline to some text. The text can span multiple sentences,
|
||||||
and can contain arbtrary whitespace. Alignment into the original string
|
and can contain arbitrary whitespace. Alignment into the original string
|
||||||
is preserved.
|
is preserved.
|
||||||
|
|
||||||
text (unicode): The text to be processed.
|
text (unicode): The text to be processed.
|
||||||
|
@ -849,7 +852,7 @@ class Language(object):
|
||||||
*[mp.Pipe(False) for _ in range(n_process)]
|
*[mp.Pipe(False) for _ in range(n_process)]
|
||||||
)
|
)
|
||||||
|
|
||||||
batch_texts = minibatch(texts, batch_size)
|
batch_texts = util.minibatch(texts, batch_size)
|
||||||
# Sender sends texts to the workers.
|
# Sender sends texts to the workers.
|
||||||
# This is necessary to properly handle infinite length of texts.
|
# This is necessary to properly handle infinite length of texts.
|
||||||
# (In this case, all data cannot be sent to the workers at once)
|
# (In this case, all data cannot be sent to the workers at once)
|
||||||
|
@ -907,9 +910,8 @@ class Language(object):
|
||||||
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
|
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
|
||||||
p, exclude=["vocab"]
|
p, exclude=["vocab"]
|
||||||
)
|
)
|
||||||
serializers["meta.json"] = lambda p: p.open("w").write(
|
serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
|
||||||
srsly.json_dumps(self.meta)
|
|
||||||
)
|
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if not hasattr(proc, "name"):
|
if not hasattr(proc, "name"):
|
||||||
continue
|
continue
|
||||||
|
@ -973,7 +975,9 @@ class Language(object):
|
||||||
serializers = OrderedDict()
|
serializers = OrderedDict()
|
||||||
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
||||||
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
||||||
serializers["meta.json"] = lambda: srsly.json_dumps(OrderedDict(sorted(self.meta.items())))
|
serializers["meta.json"] = lambda: srsly.json_dumps(
|
||||||
|
OrderedDict(sorted(self.meta.items()))
|
||||||
|
)
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if name in exclude:
|
if name in exclude:
|
||||||
continue
|
continue
|
||||||
|
@ -1075,7 +1079,7 @@ def _fix_pretrained_vectors_name(nlp):
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E092)
|
raise ValueError(Errors.E092)
|
||||||
if nlp.vocab.vectors.size != 0:
|
if nlp.vocab.vectors.size != 0:
|
||||||
link_vectors_to_models(nlp.vocab)
|
link_vectors_to_models(nlp.vocab, skip_rank=True)
|
||||||
for name, proc in nlp.pipeline:
|
for name, proc in nlp.pipeline:
|
||||||
if not hasattr(proc, "cfg"):
|
if not hasattr(proc, "cfg"):
|
||||||
continue
|
continue
|
||||||
|
|
|
@ -6,6 +6,7 @@ from collections import OrderedDict
|
||||||
from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN
|
from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
from .lookups import Lookups
|
from .lookups import Lookups
|
||||||
|
from .parts_of_speech import NAMES as UPOS_NAMES
|
||||||
|
|
||||||
|
|
||||||
class Lemmatizer(object):
|
class Lemmatizer(object):
|
||||||
|
@ -43,17 +44,11 @@ class Lemmatizer(object):
|
||||||
lookup_table = self.lookups.get_table("lemma_lookup", {})
|
lookup_table = self.lookups.get_table("lemma_lookup", {})
|
||||||
if "lemma_rules" not in self.lookups:
|
if "lemma_rules" not in self.lookups:
|
||||||
return [lookup_table.get(string, string)]
|
return [lookup_table.get(string, string)]
|
||||||
if univ_pos in (NOUN, "NOUN", "noun"):
|
if isinstance(univ_pos, int):
|
||||||
univ_pos = "noun"
|
univ_pos = UPOS_NAMES.get(univ_pos, "X")
|
||||||
elif univ_pos in (VERB, "VERB", "verb"):
|
univ_pos = univ_pos.lower()
|
||||||
univ_pos = "verb"
|
|
||||||
elif univ_pos in (ADJ, "ADJ", "adj"):
|
if univ_pos in ("", "eol", "space"):
|
||||||
univ_pos = "adj"
|
|
||||||
elif univ_pos in (PUNCT, "PUNCT", "punct"):
|
|
||||||
univ_pos = "punct"
|
|
||||||
elif univ_pos in (PROPN, "PROPN"):
|
|
||||||
return [string]
|
|
||||||
else:
|
|
||||||
return [string.lower()]
|
return [string.lower()]
|
||||||
# See Issue #435 for example of where this logic is requied.
|
# See Issue #435 for example of where this logic is requied.
|
||||||
if self.is_base_form(univ_pos, morphology):
|
if self.is_base_form(univ_pos, morphology):
|
||||||
|
@ -61,6 +56,11 @@ class Lemmatizer(object):
|
||||||
index_table = self.lookups.get_table("lemma_index", {})
|
index_table = self.lookups.get_table("lemma_index", {})
|
||||||
exc_table = self.lookups.get_table("lemma_exc", {})
|
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||||
rules_table = self.lookups.get_table("lemma_rules", {})
|
rules_table = self.lookups.get_table("lemma_rules", {})
|
||||||
|
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
|
||||||
|
if univ_pos == "propn":
|
||||||
|
return [string]
|
||||||
|
else:
|
||||||
|
return [string.lower()]
|
||||||
lemmas = self.lemmatize(
|
lemmas = self.lemmatize(
|
||||||
string,
|
string,
|
||||||
index_table.get(univ_pos, {}),
|
index_table.get(univ_pos, {}),
|
||||||
|
|
|
@ -1,8 +1,8 @@
|
||||||
from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t
|
from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t
|
||||||
from .attrs cimport attr_id_t
|
from .attrs cimport attr_id_t
|
||||||
from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, CLUSTER, LANG
|
from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG
|
||||||
|
|
||||||
from .structs cimport LexemeC, SerializedLexemeC
|
from .structs cimport LexemeC
|
||||||
from .strings cimport StringStore
|
from .strings cimport StringStore
|
||||||
from .vocab cimport Vocab
|
from .vocab cimport Vocab
|
||||||
|
|
||||||
|
@ -24,22 +24,6 @@ cdef class Lexeme:
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self.orth = lex.orth
|
self.orth = lex.orth
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
cdef inline SerializedLexemeC c_to_bytes(const LexemeC* lex) nogil:
|
|
||||||
cdef SerializedLexemeC lex_data
|
|
||||||
buff = <const unsigned char*>&lex.flags
|
|
||||||
end = <const unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
|
|
||||||
for i in range(sizeof(lex_data.data)):
|
|
||||||
lex_data.data[i] = buff[i]
|
|
||||||
return lex_data
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
cdef inline void c_from_bytes(LexemeC* lex, SerializedLexemeC lex_data) nogil:
|
|
||||||
buff = <unsigned char*>&lex.flags
|
|
||||||
end = <unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
|
|
||||||
for i in range(sizeof(lex_data.data)):
|
|
||||||
buff[i] = lex_data.data[i]
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
|
cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
|
||||||
if name < (sizeof(flags_t) * 8):
|
if name < (sizeof(flags_t) * 8):
|
||||||
|
@ -56,8 +40,6 @@ cdef class Lexeme:
|
||||||
lex.prefix = value
|
lex.prefix = value
|
||||||
elif name == SUFFIX:
|
elif name == SUFFIX:
|
||||||
lex.suffix = value
|
lex.suffix = value
|
||||||
elif name == CLUSTER:
|
|
||||||
lex.cluster = value
|
|
||||||
elif name == LANG:
|
elif name == LANG:
|
||||||
lex.lang = value
|
lex.lang = value
|
||||||
|
|
||||||
|
@ -84,8 +66,6 @@ cdef class Lexeme:
|
||||||
return lex.suffix
|
return lex.suffix
|
||||||
elif feat_name == LENGTH:
|
elif feat_name == LENGTH:
|
||||||
return lex.length
|
return lex.length
|
||||||
elif feat_name == CLUSTER:
|
|
||||||
return lex.cluster
|
|
||||||
elif feat_name == LANG:
|
elif feat_name == LANG:
|
||||||
return lex.lang
|
return lex.lang
|
||||||
else:
|
else:
|
||||||
|
|
|
@ -17,7 +17,7 @@ from .typedefs cimport attr_t, flags_t
|
||||||
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
||||||
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
||||||
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
|
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
|
||||||
from .attrs cimport IS_CURRENCY, IS_OOV, PROB
|
from .attrs cimport IS_CURRENCY
|
||||||
|
|
||||||
from .attrs import intify_attrs
|
from .attrs import intify_attrs
|
||||||
from .errors import Errors, Warnings
|
from .errors import Errors, Warnings
|
||||||
|
@ -89,11 +89,10 @@ cdef class Lexeme:
|
||||||
cdef attr_id_t attr
|
cdef attr_id_t attr
|
||||||
attrs = intify_attrs(attrs)
|
attrs = intify_attrs(attrs)
|
||||||
for attr, value in attrs.items():
|
for attr, value in attrs.items():
|
||||||
if attr == PROB:
|
# skip PROB, e.g. from lexemes.jsonl
|
||||||
self.c.prob = value
|
if isinstance(value, float):
|
||||||
elif attr == CLUSTER:
|
continue
|
||||||
self.c.cluster = int(value)
|
elif isinstance(value, (int, long)):
|
||||||
elif isinstance(value, int) or isinstance(value, long):
|
|
||||||
Lexeme.set_struct_attr(self.c, attr, value)
|
Lexeme.set_struct_attr(self.c, attr, value)
|
||||||
else:
|
else:
|
||||||
Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))
|
Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))
|
||||||
|
@ -137,34 +136,6 @@ cdef class Lexeme:
|
||||||
xp = get_array_module(vector)
|
xp = get_array_module(vector)
|
||||||
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
|
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
|
||||||
|
|
||||||
def to_bytes(self):
|
|
||||||
lex_data = Lexeme.c_to_bytes(self.c)
|
|
||||||
start = <const char*>&self.c.flags
|
|
||||||
end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
|
|
||||||
if (end-start) != sizeof(lex_data.data):
|
|
||||||
raise ValueError(Errors.E072.format(length=end-start,
|
|
||||||
bad_length=sizeof(lex_data.data)))
|
|
||||||
byte_string = b"\0" * sizeof(lex_data.data)
|
|
||||||
byte_chars = <char*>byte_string
|
|
||||||
for i in range(sizeof(lex_data.data)):
|
|
||||||
byte_chars[i] = lex_data.data[i]
|
|
||||||
if len(byte_string) != sizeof(lex_data.data):
|
|
||||||
raise ValueError(Errors.E072.format(length=len(byte_string),
|
|
||||||
bad_length=sizeof(lex_data.data)))
|
|
||||||
return byte_string
|
|
||||||
|
|
||||||
def from_bytes(self, bytes byte_string):
|
|
||||||
# This method doesn't really have a use-case --- wrote it for testing.
|
|
||||||
# Possibly delete? It puts the Lexeme out of synch with the vocab.
|
|
||||||
cdef SerializedLexemeC lex_data
|
|
||||||
if len(byte_string) != sizeof(lex_data.data):
|
|
||||||
raise ValueError(Errors.E072.format(length=len(byte_string),
|
|
||||||
bad_length=sizeof(lex_data.data)))
|
|
||||||
for i in range(len(byte_string)):
|
|
||||||
lex_data.data[i] = byte_string[i]
|
|
||||||
Lexeme.c_from_bytes(self.c, lex_data)
|
|
||||||
self.orth = self.c.orth
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def has_vector(self):
|
def has_vector(self):
|
||||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||||
|
@ -208,10 +179,14 @@ cdef class Lexeme:
|
||||||
"""RETURNS (float): A scalar value indicating the positivity or
|
"""RETURNS (float): A scalar value indicating the positivity or
|
||||||
negativity of the lexeme."""
|
negativity of the lexeme."""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.c.sentiment
|
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {})
|
||||||
|
return sentiment_table.get(self.c.orth, 0.0)
|
||||||
|
|
||||||
def __set__(self, float sentiment):
|
def __set__(self, float x):
|
||||||
self.c.sentiment = sentiment
|
if "lexeme_sentiment" not in self.vocab.lookups:
|
||||||
|
self.vocab.lookups.add_table("lexeme_sentiment")
|
||||||
|
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment")
|
||||||
|
sentiment_table[self.c.orth] = x
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def orth_(self):
|
def orth_(self):
|
||||||
|
@ -241,6 +216,10 @@ cdef class Lexeme:
|
||||||
return self.c.norm
|
return self.c.norm
|
||||||
|
|
||||||
def __set__(self, attr_t x):
|
def __set__(self, attr_t x):
|
||||||
|
if "lexeme_norm" not in self.vocab.lookups:
|
||||||
|
self.vocab.lookups.add_table("lexeme_norm")
|
||||||
|
norm_table = self.vocab.lookups.get_table("lexeme_norm")
|
||||||
|
norm_table[self.c.orth] = self.vocab.strings[x]
|
||||||
self.c.norm = x
|
self.c.norm = x
|
||||||
|
|
||||||
property shape:
|
property shape:
|
||||||
|
@ -276,10 +255,12 @@ cdef class Lexeme:
|
||||||
property cluster:
|
property cluster:
|
||||||
"""RETURNS (int): Brown cluster ID."""
|
"""RETURNS (int): Brown cluster ID."""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.c.cluster
|
cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
|
||||||
|
return cluster_table.get(self.c.orth, 0)
|
||||||
|
|
||||||
def __set__(self, attr_t x):
|
def __set__(self, int x):
|
||||||
self.c.cluster = x
|
cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
|
||||||
|
cluster_table[self.c.orth] = x
|
||||||
|
|
||||||
property lang:
|
property lang:
|
||||||
"""RETURNS (uint64): Language of the parent vocabulary."""
|
"""RETURNS (uint64): Language of the parent vocabulary."""
|
||||||
|
@ -293,10 +274,14 @@ cdef class Lexeme:
|
||||||
"""RETURNS (float): Smoothed log probability estimate of the lexeme's
|
"""RETURNS (float): Smoothed log probability estimate of the lexeme's
|
||||||
type."""
|
type."""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.c.prob
|
prob_table = self.vocab.load_extra_lookups("lexeme_prob")
|
||||||
|
settings_table = self.vocab.load_extra_lookups("lexeme_settings")
|
||||||
|
default_oov_prob = settings_table.get("oov_prob", -20.0)
|
||||||
|
return prob_table.get(self.c.orth, default_oov_prob)
|
||||||
|
|
||||||
def __set__(self, float x):
|
def __set__(self, float x):
|
||||||
self.c.prob = x
|
prob_table = self.vocab.load_extra_lookups("lexeme_prob")
|
||||||
|
prob_table[self.c.orth] = x
|
||||||
|
|
||||||
property lower_:
|
property lower_:
|
||||||
"""RETURNS (unicode): Lowercase form of the word."""
|
"""RETURNS (unicode): Lowercase form of the word."""
|
||||||
|
@ -314,7 +299,7 @@ cdef class Lexeme:
|
||||||
return self.vocab.strings[self.c.norm]
|
return self.vocab.strings[self.c.norm]
|
||||||
|
|
||||||
def __set__(self, unicode x):
|
def __set__(self, unicode x):
|
||||||
self.c.norm = self.vocab.strings.add(x)
|
self.norm = self.vocab.strings.add(x)
|
||||||
|
|
||||||
property shape_:
|
property shape_:
|
||||||
"""RETURNS (unicode): Transform of the word's string, to show
|
"""RETURNS (unicode): Transform of the word's string, to show
|
||||||
|
@ -362,13 +347,10 @@ cdef class Lexeme:
|
||||||
def __set__(self, flags_t x):
|
def __set__(self, flags_t x):
|
||||||
self.c.flags = x
|
self.c.flags = x
|
||||||
|
|
||||||
property is_oov:
|
@property
|
||||||
|
def is_oov(self):
|
||||||
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
|
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
|
||||||
def __get__(self):
|
return self.orth in self.vocab.vectors
|
||||||
return Lexeme.c_check_flag(self.c, IS_OOV)
|
|
||||||
|
|
||||||
def __set__(self, attr_t x):
|
|
||||||
Lexeme.c_set_flag(self.c, IS_OOV, x)
|
|
||||||
|
|
||||||
property is_stop:
|
property is_stop:
|
||||||
"""RETURNS (bool): Whether the lexeme is a stop word."""
|
"""RETURNS (bool): Whether the lexeme is a stop word."""
|
||||||
|
|
|
@ -124,7 +124,7 @@ class Lookups(object):
|
||||||
self._tables[key].update(value)
|
self._tables[key].update(value)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, **kwargs):
|
def to_disk(self, path, filename="lookups.bin", **kwargs):
|
||||||
"""Save the lookups to a directory as lookups.bin. Expects a path to a
|
"""Save the lookups to a directory as lookups.bin. Expects a path to a
|
||||||
directory, which will be created if it doesn't exist.
|
directory, which will be created if it doesn't exist.
|
||||||
|
|
||||||
|
@ -136,11 +136,11 @@ class Lookups(object):
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
if not path.exists():
|
if not path.exists():
|
||||||
path.mkdir()
|
path.mkdir()
|
||||||
filepath = path / "lookups.bin"
|
filepath = path / filename
|
||||||
with filepath.open("wb") as file_:
|
with filepath.open("wb") as file_:
|
||||||
file_.write(self.to_bytes())
|
file_.write(self.to_bytes())
|
||||||
|
|
||||||
def from_disk(self, path, **kwargs):
|
def from_disk(self, path, filename="lookups.bin", **kwargs):
|
||||||
"""Load lookups from a directory containing a lookups.bin. Will skip
|
"""Load lookups from a directory containing a lookups.bin. Will skip
|
||||||
loading if the file doesn't exist.
|
loading if the file doesn't exist.
|
||||||
|
|
||||||
|
@ -150,7 +150,7 @@ class Lookups(object):
|
||||||
DOCS: https://spacy.io/api/lookups#from_disk
|
DOCS: https://spacy.io/api/lookups#from_disk
|
||||||
"""
|
"""
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
filepath = path / "lookups.bin"
|
filepath = path / filename
|
||||||
if filepath.exists():
|
if filepath.exists():
|
||||||
with filepath.open("rb") as file_:
|
with filepath.open("rb") as file_:
|
||||||
data = file_.read()
|
data = file_.read()
|
||||||
|
|
|
@ -213,28 +213,28 @@ cdef class Matcher:
|
||||||
else:
|
else:
|
||||||
yield doc
|
yield doc
|
||||||
|
|
||||||
def __call__(self, object doc_or_span):
|
def __call__(self, object doclike):
|
||||||
"""Find all token sequences matching the supplied pattern.
|
"""Find all token sequences matching the supplied pattern.
|
||||||
|
|
||||||
doc_or_span (Doc or Span): The document to match over.
|
doclike (Doc or Span): The document to match over.
|
||||||
RETURNS (list): A list of `(key, start, end)` tuples,
|
RETURNS (list): A list of `(key, start, end)` tuples,
|
||||||
describing the matches. A match tuple describes a span
|
describing the matches. A match tuple describes a span
|
||||||
`doc[start:end]`. The `label_id` and `key` are both integers.
|
`doc[start:end]`. The `label_id` and `key` are both integers.
|
||||||
"""
|
"""
|
||||||
if isinstance(doc_or_span, Doc):
|
if isinstance(doclike, Doc):
|
||||||
doc = doc_or_span
|
doc = doclike
|
||||||
length = len(doc)
|
length = len(doc)
|
||||||
elif isinstance(doc_or_span, Span):
|
elif isinstance(doclike, Span):
|
||||||
doc = doc_or_span.doc
|
doc = doclike.doc
|
||||||
length = doc_or_span.end - doc_or_span.start
|
length = doclike.end - doclike.start
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doc_or_span).__name__))
|
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||||
if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \
|
if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \
|
||||||
and not doc.is_tagged:
|
and not doc.is_tagged:
|
||||||
raise ValueError(Errors.E155.format())
|
raise ValueError(Errors.E155.format())
|
||||||
if DEP in self._seen_attrs and not doc.is_parsed:
|
if DEP in self._seen_attrs and not doc.is_parsed:
|
||||||
raise ValueError(Errors.E156.format())
|
raise ValueError(Errors.E156.format())
|
||||||
matches = find_matches(&self.patterns[0], self.patterns.size(), doc_or_span, length,
|
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
|
||||||
extensions=self._extensions, predicates=self._extra_predicates)
|
extensions=self._extensions, predicates=self._extra_predicates)
|
||||||
for i, (key, start, end) in enumerate(matches):
|
for i, (key, start, end) in enumerate(matches):
|
||||||
on_match = self._callbacks.get(key, None)
|
on_match = self._callbacks.get(key, None)
|
||||||
|
@ -257,7 +257,7 @@ def unpickle_matcher(vocab, patterns, callbacks):
|
||||||
return matcher
|
return matcher
|
||||||
|
|
||||||
|
|
||||||
cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int length, extensions=None, predicates=tuple()):
|
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
|
||||||
"""Find matches in a doc, with a compiled array of patterns. Matches are
|
"""Find matches in a doc, with a compiled array of patterns. Matches are
|
||||||
returned as a list of (id, start, end) tuples.
|
returned as a list of (id, start, end) tuples.
|
||||||
|
|
||||||
|
@ -286,7 +286,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
|
||||||
else:
|
else:
|
||||||
nr_extra_attr = 0
|
nr_extra_attr = 0
|
||||||
extra_attr_values = <attr_t*>mem.alloc(length, sizeof(attr_t))
|
extra_attr_values = <attr_t*>mem.alloc(length, sizeof(attr_t))
|
||||||
for i, token in enumerate(doc_or_span):
|
for i, token in enumerate(doclike):
|
||||||
for name, index in extensions.items():
|
for name, index in extensions.items():
|
||||||
value = token._.get(name)
|
value = token._.get(name)
|
||||||
if isinstance(value, basestring):
|
if isinstance(value, basestring):
|
||||||
|
@ -298,7 +298,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
|
||||||
for j in range(n):
|
for j in range(n):
|
||||||
states.push_back(PatternStateC(patterns[j], i, 0))
|
states.push_back(PatternStateC(patterns[j], i, 0))
|
||||||
transition_states(states, matches, predicate_cache,
|
transition_states(states, matches, predicate_cache,
|
||||||
doc_or_span[i], extra_attr_values, predicates)
|
doclike[i], extra_attr_values, predicates)
|
||||||
extra_attr_values += nr_extra_attr
|
extra_attr_values += nr_extra_attr
|
||||||
predicate_cache += len(predicates)
|
predicate_cache += len(predicates)
|
||||||
# Handle matches that end in 0-width patterns
|
# Handle matches that end in 0-width patterns
|
||||||
|
|
|
@ -203,7 +203,7 @@ class Pipe(object):
|
||||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
if self.model not in (None, True, False):
|
if self.model not in (None, True, False):
|
||||||
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
|
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
|
@ -626,7 +626,7 @@ class Tagger(Pipe):
|
||||||
serialize = OrderedDict((
|
serialize = OrderedDict((
|
||||||
("vocab", lambda p: self.vocab.to_disk(p)),
|
("vocab", lambda p: self.vocab.to_disk(p)),
|
||||||
("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
|
("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
|
||||||
("model", lambda p: p.open("wb").write(self.model.to_bytes())),
|
("model", lambda p: self.model.to_disk(p)),
|
||||||
("cfg", lambda p: srsly.write_json(p, self.cfg))
|
("cfg", lambda p: srsly.write_json(p, self.cfg))
|
||||||
))
|
))
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||||
|
@ -1395,7 +1395,7 @@ class EntityLinker(Pipe):
|
||||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
serialize["kb"] = lambda p: self.kb.dump(p)
|
serialize["kb"] = lambda p: self.kb.dump(p)
|
||||||
if self.model not in (None, True, False):
|
if self.model not in (None, True, False):
|
||||||
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
|
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
|
|
|
@ -23,29 +23,6 @@ cdef struct LexemeC:
|
||||||
attr_t prefix
|
attr_t prefix
|
||||||
attr_t suffix
|
attr_t suffix
|
||||||
|
|
||||||
attr_t cluster
|
|
||||||
|
|
||||||
float prob
|
|
||||||
float sentiment
|
|
||||||
|
|
||||||
|
|
||||||
cdef struct SerializedLexemeC:
|
|
||||||
unsigned char[8 + 8*10 + 4 + 4] data
|
|
||||||
# sizeof(flags_t) # flags
|
|
||||||
# + sizeof(attr_t) # lang
|
|
||||||
# + sizeof(attr_t) # id
|
|
||||||
# + sizeof(attr_t) # length
|
|
||||||
# + sizeof(attr_t) # orth
|
|
||||||
# + sizeof(attr_t) # lower
|
|
||||||
# + sizeof(attr_t) # norm
|
|
||||||
# + sizeof(attr_t) # shape
|
|
||||||
# + sizeof(attr_t) # prefix
|
|
||||||
# + sizeof(attr_t) # suffix
|
|
||||||
# + sizeof(attr_t) # cluster
|
|
||||||
# + sizeof(float) # prob
|
|
||||||
# + sizeof(float) # cluster
|
|
||||||
# + sizeof(float) # l2_norm
|
|
||||||
|
|
||||||
|
|
||||||
cdef struct SpanC:
|
cdef struct SpanC:
|
||||||
hash_t id
|
hash_t id
|
||||||
|
|
|
@ -12,7 +12,7 @@ cdef enum symbol_t:
|
||||||
LIKE_NUM
|
LIKE_NUM
|
||||||
LIKE_EMAIL
|
LIKE_EMAIL
|
||||||
IS_STOP
|
IS_STOP
|
||||||
IS_OOV
|
IS_OOV_DEPRECATED
|
||||||
IS_BRACKET
|
IS_BRACKET
|
||||||
IS_QUOTE
|
IS_QUOTE
|
||||||
IS_LEFT_PUNCT
|
IS_LEFT_PUNCT
|
||||||
|
|
|
@ -17,7 +17,7 @@ IDS = {
|
||||||
"LIKE_NUM": LIKE_NUM,
|
"LIKE_NUM": LIKE_NUM,
|
||||||
"LIKE_EMAIL": LIKE_EMAIL,
|
"LIKE_EMAIL": LIKE_EMAIL,
|
||||||
"IS_STOP": IS_STOP,
|
"IS_STOP": IS_STOP,
|
||||||
"IS_OOV": IS_OOV,
|
"IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
|
||||||
"IS_BRACKET": IS_BRACKET,
|
"IS_BRACKET": IS_BRACKET,
|
||||||
"IS_QUOTE": IS_QUOTE,
|
"IS_QUOTE": IS_QUOTE,
|
||||||
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
|
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
|
||||||
|
|
|
@ -9,7 +9,6 @@ import numpy
|
||||||
cimport cython.parallel
|
cimport cython.parallel
|
||||||
import numpy.random
|
import numpy.random
|
||||||
cimport numpy as np
|
cimport numpy as np
|
||||||
from itertools import islice
|
|
||||||
from cpython.ref cimport PyObject, Py_XDECREF
|
from cpython.ref cimport PyObject, Py_XDECREF
|
||||||
from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
|
from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
|
||||||
from libc.math cimport exp
|
from libc.math cimport exp
|
||||||
|
@ -621,15 +620,15 @@ cdef class Parser:
|
||||||
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
|
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
sgd = self.create_optimizer()
|
sgd = self.create_optimizer()
|
||||||
doc_sample = []
|
docs = []
|
||||||
gold_sample = []
|
golds = []
|
||||||
for raw_text, annots_brackets in islice(get_gold_tuples(), 1000):
|
for raw_text, annots_brackets in get_gold_tuples():
|
||||||
for annots, brackets in annots_brackets:
|
for annots, brackets in annots_brackets:
|
||||||
ids, words, tags, heads, deps, ents = annots
|
ids, words, tags, heads, deps, ents = annots
|
||||||
doc_sample.append(Doc(self.vocab, words=words))
|
docs.append(Doc(self.vocab, words=words))
|
||||||
gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags,
|
golds.append(GoldParse(docs[-1], words=words, tags=tags,
|
||||||
heads=heads, deps=deps, entities=ents))
|
heads=heads, deps=deps, entities=ents))
|
||||||
self.model.begin_training(doc_sample, gold_sample)
|
self.model.begin_training(docs, golds)
|
||||||
if pipeline is not None:
|
if pipeline is not None:
|
||||||
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
|
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
|
|
|
@ -88,6 +88,11 @@ def eu_tokenizer():
|
||||||
return get_lang_class("eu").Defaults.create_tokenizer()
|
return get_lang_class("eu").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def fa_tokenizer():
|
||||||
|
return get_lang_class("fa").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def fi_tokenizer():
|
def fi_tokenizer():
|
||||||
return get_lang_class("fi").Defaults.create_tokenizer()
|
return get_lang_class("fi").Defaults.create_tokenizer()
|
||||||
|
@ -107,6 +112,7 @@ def ga_tokenizer():
|
||||||
def gu_tokenizer():
|
def gu_tokenizer():
|
||||||
return get_lang_class("gu").Defaults.create_tokenizer()
|
return get_lang_class("gu").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def he_tokenizer():
|
def he_tokenizer():
|
||||||
return get_lang_class("he").Defaults.create_tokenizer()
|
return get_lang_class("he").Defaults.create_tokenizer()
|
||||||
|
@ -241,7 +247,9 @@ def yo_tokenizer():
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def zh_tokenizer_char():
|
def zh_tokenizer_char():
|
||||||
return get_lang_class("zh").Defaults.create_tokenizer(config={"use_jieba": False, "use_pkuseg": False})
|
return get_lang_class("zh").Defaults.create_tokenizer(
|
||||||
|
config={"use_jieba": False, "use_pkuseg": False}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
|
@ -253,7 +261,9 @@ def zh_tokenizer_jieba():
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def zh_tokenizer_pkuseg():
|
def zh_tokenizer_pkuseg():
|
||||||
pytest.importorskip("pkuseg")
|
pytest.importorskip("pkuseg")
|
||||||
return get_lang_class("zh").Defaults.create_tokenizer(config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True})
|
return get_lang_class("zh").Defaults.create_tokenizer(
|
||||||
|
config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
|
|
|
@ -50,7 +50,9 @@ def test_create_from_words_and_text(vocab):
|
||||||
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
||||||
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
|
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
|
||||||
assert doc.text == text
|
assert doc.text == text
|
||||||
assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
|
assert [t.text for t in doc if not t.text.isspace()] == [
|
||||||
|
word for word in words if not word.isspace()
|
||||||
|
]
|
||||||
|
|
||||||
# partial whitespace in words
|
# partial whitespace in words
|
||||||
words = [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
words = [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
||||||
|
@ -60,7 +62,9 @@ def test_create_from_words_and_text(vocab):
|
||||||
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
||||||
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
|
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
|
||||||
assert doc.text == text
|
assert doc.text == text
|
||||||
assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
|
assert [t.text for t in doc if not t.text.isspace()] == [
|
||||||
|
word for word in words if not word.isspace()
|
||||||
|
]
|
||||||
|
|
||||||
# non-standard whitespace tokens
|
# non-standard whitespace tokens
|
||||||
words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]
|
words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]
|
||||||
|
@ -70,7 +74,9 @@ def test_create_from_words_and_text(vocab):
|
||||||
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
|
||||||
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
|
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
|
||||||
assert doc.text == text
|
assert doc.text == text
|
||||||
assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
|
assert [t.text for t in doc if not t.text.isspace()] == [
|
||||||
|
word for word in words if not word.isspace()
|
||||||
|
]
|
||||||
|
|
||||||
# mismatch between words and text
|
# mismatch between words and text
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
|
|
|
@ -181,6 +181,7 @@ def test_is_sent_start(en_tokenizer):
|
||||||
doc.is_parsed = True
|
doc.is_parsed = True
|
||||||
assert len(list(doc.sents)) == 2
|
assert len(list(doc.sents)) == 2
|
||||||
|
|
||||||
|
|
||||||
def test_is_sent_end(en_tokenizer):
|
def test_is_sent_end(en_tokenizer):
|
||||||
doc = en_tokenizer("This is a sentence. This is another.")
|
doc = en_tokenizer("This is a sentence. This is another.")
|
||||||
assert doc[4].is_sent_end is None
|
assert doc[4].is_sent_end is None
|
||||||
|
@ -213,6 +214,7 @@ def test_token0_has_sent_start_true():
|
||||||
assert doc[1].is_sent_start is None
|
assert doc[1].is_sent_start is None
|
||||||
assert not doc.is_sentenced
|
assert not doc.is_sentenced
|
||||||
|
|
||||||
|
|
||||||
def test_tokenlast_has_sent_end_true():
|
def test_tokenlast_has_sent_end_true():
|
||||||
doc = Doc(Vocab(), words=["hello", "world"])
|
doc = Doc(Vocab(), words=["hello", "world"])
|
||||||
assert doc[0].is_sent_end is None
|
assert doc[0].is_sent_end is None
|
||||||
|
|
|
@ -37,14 +37,6 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
|
||||||
assert tokens[7].text == "."
|
assert tokens[7].text == "."
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
|
||||||
"text,norm", [("akvarium", "akvarie"), ("bedstemoder", "bedstemor")]
|
|
||||||
)
|
|
||||||
def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
|
|
||||||
tokens = da_tokenizer(text)
|
|
||||||
assert tokens[0].norm_ == norm
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"text,n_tokens",
|
"text,n_tokens",
|
||||||
[
|
[
|
||||||
|
|
|
@ -22,17 +22,3 @@ def test_de_tokenizer_handles_exc_in_text(de_tokenizer):
|
||||||
assert len(tokens) == 6
|
assert len(tokens) == 6
|
||||||
assert tokens[2].text == "z.Zt."
|
assert tokens[2].text == "z.Zt."
|
||||||
assert tokens[2].lemma_ == "zur Zeit"
|
assert tokens[2].lemma_ == "zur Zeit"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
|
||||||
"text,norms", [("vor'm", ["vor", "dem"]), ("du's", ["du", "es"])]
|
|
||||||
)
|
|
||||||
def test_de_tokenizer_norm_exceptions(de_tokenizer, text, norms):
|
|
||||||
tokens = de_tokenizer(text)
|
|
||||||
assert [token.norm_ for token in tokens] == norms
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text,norm", [("daß", "dass")])
|
|
||||||
def test_de_lex_attrs_norm_exceptions(de_tokenizer, text, norm):
|
|
||||||
tokens = de_tokenizer(text)
|
|
||||||
assert tokens[0].norm_ == norm
|
|
||||||
|
|
16
spacy/tests/lang/de/test_noun_chunks.py
Normal file
16
spacy/tests/lang/de/test_noun_chunks.py
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_noun_chunks_is_parsed_de(de_tokenizer):
|
||||||
|
"""Test that noun_chunks raises Value Error for 'de' language if Doc is not parsed.
|
||||||
|
To check this test, we're constructing a Doc
|
||||||
|
with a new Vocab here and forcing is_parsed to 'False'
|
||||||
|
to make sure the noun chunks don't run.
|
||||||
|
"""
|
||||||
|
doc = de_tokenizer("Er lag auf seinem")
|
||||||
|
doc.is_parsed = False
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(doc.noun_chunks)
|
16
spacy/tests/lang/el/test_noun_chunks.py
Normal file
16
spacy/tests/lang/el/test_noun_chunks.py
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_noun_chunks_is_parsed_el(el_tokenizer):
|
||||||
|
"""Test that noun_chunks raises Value Error for 'el' language if Doc is not parsed.
|
||||||
|
To check this test, we're constructing a Doc
|
||||||
|
with a new Vocab here and forcing is_parsed to 'False'
|
||||||
|
to make sure the noun chunks don't run.
|
||||||
|
"""
|
||||||
|
doc = el_tokenizer("είναι χώρα της νοτιοανατολικής")
|
||||||
|
doc.is_parsed = False
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(doc.noun_chunks)
|
|
@ -118,6 +118,7 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
|
||||||
assert [token.norm_ for token in tokens] == norms
|
assert [token.norm_ for token in tokens] == norms
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
|
"text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
|
||||||
)
|
)
|
||||||
|
|
|
@ -6,9 +6,24 @@ from spacy.attrs import HEAD, DEP
|
||||||
from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
|
from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
|
||||||
from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
|
from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
from ...util import get_doc
|
from ...util import get_doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_noun_chunks_is_parsed(en_tokenizer):
|
||||||
|
"""Test that noun_chunks raises Value Error for 'en' language if Doc is not parsed.
|
||||||
|
To check this test, we're constructing a Doc
|
||||||
|
with a new Vocab here and forcing is_parsed to 'False'
|
||||||
|
to make sure the noun chunks don't run.
|
||||||
|
"""
|
||||||
|
doc = en_tokenizer("This is a sentence")
|
||||||
|
doc.is_parsed = False
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(doc.noun_chunks)
|
||||||
|
|
||||||
|
|
||||||
def test_en_noun_chunks_not_nested(en_vocab):
|
def test_en_noun_chunks_not_nested(en_vocab):
|
||||||
words = ["Peter", "has", "chronic", "command", "and", "control", "issues"]
|
words = ["Peter", "has", "chronic", "command", "and", "control", "issues"]
|
||||||
heads = [1, 0, 4, 3, -1, -2, -5]
|
heads = [1, 0, 4, 3, -1, -2, -5]
|
||||||
|
|
16
spacy/tests/lang/es/test_noun_chunks.py
Normal file
16
spacy/tests/lang/es/test_noun_chunks.py
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_noun_chunks_is_parsed_es(es_tokenizer):
|
||||||
|
"""Test that noun_chunks raises Value Error for 'es' language if Doc is not parsed.
|
||||||
|
To check this test, we're constructing a Doc
|
||||||
|
with a new Vocab here and forcing is_parsed to 'False'
|
||||||
|
to make sure the noun chunks don't run.
|
||||||
|
"""
|
||||||
|
doc = es_tokenizer("en Oxford este verano")
|
||||||
|
doc.is_parsed = False
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(doc.noun_chunks)
|
0
spacy/tests/lang/fa/__init__.py
Normal file
0
spacy/tests/lang/fa/__init__.py
Normal file
17
spacy/tests/lang/fa/test_noun_chunks.py
Normal file
17
spacy/tests/lang/fa/test_noun_chunks.py
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_noun_chunks_is_parsed_fa(fa_tokenizer):
|
||||||
|
"""Test that noun_chunks raises Value Error for 'fa' language if Doc is not parsed.
|
||||||
|
To check this test, we're constructing a Doc
|
||||||
|
with a new Vocab here and forcing is_parsed to 'False'
|
||||||
|
to make sure the noun chunks don't run.
|
||||||
|
"""
|
||||||
|
|
||||||
|
doc = fa_tokenizer("این یک جمله نمونه می باشد.")
|
||||||
|
doc.is_parsed = False
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(doc.noun_chunks)
|
16
spacy/tests/lang/fr/test_noun_chunks.py
Normal file
16
spacy/tests/lang/fr/test_noun_chunks.py
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_noun_chunks_is_parsed_fr(fr_tokenizer):
|
||||||
|
"""Test that noun_chunks raises Value Error for 'fr' language if Doc is not parsed.
|
||||||
|
To check this test, we're constructing a Doc
|
||||||
|
with a new Vocab here and forcing is_parsed to 'False'
|
||||||
|
to make sure the noun chunks don't run.
|
||||||
|
"""
|
||||||
|
doc = fr_tokenizer("trouver des travaux antérieurs")
|
||||||
|
doc.is_parsed = False
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(doc.noun_chunks)
|
|
@ -3,17 +3,16 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_gu_tokenizer_handlers_long_text(gu_tokenizer):
|
def test_gu_tokenizer_handlers_long_text(gu_tokenizer):
|
||||||
text = """પશ્ચિમ ભારતમાં આવેલું ગુજરાત રાજ્ય જે વ્યક્તિઓની માતૃભૂમિ છે"""
|
text = """પશ્ચિમ ભારતમાં આવેલું ગુજરાત રાજ્ય જે વ્યક્તિઓની માતૃભૂમિ છે"""
|
||||||
tokens = gu_tokenizer(text)
|
tokens = gu_tokenizer(text)
|
||||||
assert len(tokens) == 9
|
assert len(tokens) == 9
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"text,length",
|
"text,length",
|
||||||
[
|
[("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6), ("ખેતરની ખેડ કરવામાં આવે છે.", 5)],
|
||||||
("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6),
|
|
||||||
("ખેતરની ખેડ કરવામાં આવે છે.", 5),
|
|
||||||
],
|
|
||||||
)
|
)
|
||||||
def test_gu_tokenizer_handles_cnts(gu_tokenizer, text, length):
|
def test_gu_tokenizer_handles_cnts(gu_tokenizer, text, length):
|
||||||
tokens = gu_tokenizer(text)
|
tokens = gu_tokenizer(text)
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user