mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Merge branch 'master' into develop
This commit is contained in:
commit
61d09c481b
106
.github/contributors/Brixjohn.md
vendored
Normal file
106
.github/contributors/Brixjohn.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [X] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Brixter John Lumabi |
|
||||
| Company name (if applicable) | Stratpoint |
|
||||
| Title or role (if applicable) | Software Developer |
|
||||
| Date | 18 December 2018 |
|
||||
| GitHub username | Brixjohn |
|
||||
| Website (optional) | |
|
106
.github/contributors/amperinet.md
vendored
Normal file
106
.github/contributors/amperinet.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ----------------------- |
|
||||
| Name | Amandine Périnet |
|
||||
| Company name (if applicable) | 365Talents |
|
||||
| Title or role (if applicable) | Data Science Researcher |
|
||||
| Date | 12/12/2018 |
|
||||
| GitHub username | amperinet |
|
||||
| Website (optional) | |
|
106
.github/contributors/beatesi.md
vendored
Normal file
106
.github/contributors/beatesi.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Beate Sildnes |
|
||||
| Company name (if applicable) | NAV |
|
||||
| Title or role (if applicable) | Data Scientist |
|
||||
| Date | 04.12.2018 |
|
||||
| GitHub username | beatesi |
|
||||
| Website (optional) | |
|
106
.github/contributors/chezou.md
vendored
Normal file
106
.github/contributors/chezou.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Aki Ariga |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 07/12/2018 |
|
||||
| GitHub username | chezou |
|
||||
| Website (optional) | chezo.uno |
|
106
.github/contributors/svlandeg.md
vendored
Normal file
106
.github/contributors/svlandeg.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Sofie Van Landeghem |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 29 Nov 2018 |
|
||||
| GitHub username | svlandeg |
|
||||
| Website (optional) | |
|
106
.github/contributors/wxv.md
vendored
Normal file
106
.github/contributors/wxv.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jason Xu |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-11-29 |
|
||||
| GitHub username | wxv |
|
||||
| Website (optional) | |
|
|
@ -20,9 +20,10 @@ import os
|
|||
import importlib
|
||||
from keras import backend as K
|
||||
|
||||
|
||||
def set_keras_backend(backend):
|
||||
if K.backend() != backend:
|
||||
os.environ['KERAS_BACKEND'] = backend
|
||||
os.environ["KERAS_BACKEND"] = backend
|
||||
importlib.reload(K)
|
||||
assert K.backend() == backend
|
||||
if backend == "tensorflow":
|
||||
|
@ -32,6 +33,7 @@ def set_keras_backend(backend):
|
|||
K.set_session(K.tf.Session(config=cfg))
|
||||
K.clear_session()
|
||||
|
||||
|
||||
set_keras_backend("tensorflow")
|
||||
|
||||
|
||||
|
@ -40,9 +42,8 @@ def train(train_loc, dev_loc, shape, settings):
|
|||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||
|
||||
print("Loading spaCy")
|
||||
nlp = spacy.load('en_vectors_web_lg')
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
assert nlp.path is not None
|
||||
|
||||
print("Processing texts...")
|
||||
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
|
||||
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
|
||||
|
@ -54,29 +55,28 @@ def train(train_loc, dev_loc, shape, settings):
|
|||
model.fit(
|
||||
train_X,
|
||||
train_labels,
|
||||
validation_data = (dev_X, dev_labels),
|
||||
epochs = settings['nr_epoch'],
|
||||
batch_size = settings['batch_size'])
|
||||
|
||||
if not (nlp.path / 'similarity').exists():
|
||||
(nlp.path / 'similarity').mkdir()
|
||||
print("Saving to", nlp.path / 'similarity')
|
||||
validation_data=(dev_X, dev_labels),
|
||||
epochs=settings["nr_epoch"],
|
||||
batch_size=settings["batch_size"],
|
||||
)
|
||||
if not (nlp.path / "similarity").exists():
|
||||
(nlp.path / "similarity").mkdir()
|
||||
print("Saving to", nlp.path / "similarity")
|
||||
weights = model.get_weights()
|
||||
# remove the embedding matrix. We can reconstruct it.
|
||||
del weights[1]
|
||||
with (nlp.path / 'similarity' / 'model').open('wb') as file_:
|
||||
with (nlp.path / "similarity" / "model").open("wb") as file_:
|
||||
pickle.dump(weights, file_)
|
||||
with (nlp.path / 'similarity' / 'config.json').open('w') as file_:
|
||||
with (nlp.path / "similarity" / "config.json").open("w") as file_:
|
||||
file_.write(model.to_json())
|
||||
|
||||
|
||||
def evaluate(dev_loc, shape):
|
||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||
nlp = spacy.load('en_vectors_web_lg')
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
|
||||
|
||||
total = 0.
|
||||
correct = 0.
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
|
||||
total = 0.0
|
||||
correct = 0.0
|
||||
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
|
||||
doc1 = nlp(text1)
|
||||
doc2 = nlp(text2)
|
||||
|
@ -88,11 +88,11 @@ def evaluate(dev_loc, shape):
|
|||
|
||||
|
||||
def demo(shape):
|
||||
nlp = spacy.load('en_vectors_web_lg')
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
|
||||
|
||||
doc1 = nlp(u'The king of France is bald.')
|
||||
doc2 = nlp(u'France has no king.')
|
||||
doc1 = nlp(u"The king of France is bald.")
|
||||
doc2 = nlp(u"France has no king.")
|
||||
|
||||
print("Sentence 1:", doc1)
|
||||
print("Sentence 2:", doc2)
|
||||
|
@ -101,30 +101,31 @@ def demo(shape):
|
|||
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
|
||||
|
||||
|
||||
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
|
||||
LABELS = {"entailment": 0, "contradiction": 1, "neutral": 2}
|
||||
|
||||
|
||||
def read_snli(path):
|
||||
texts1 = []
|
||||
texts2 = []
|
||||
labels = []
|
||||
with open(path, 'r') as file_:
|
||||
with open(path, "r") as file_:
|
||||
for line in file_:
|
||||
eg = json.loads(line)
|
||||
label = eg['gold_label']
|
||||
if label == '-': # per Parikh, ignore - SNLI entries
|
||||
label = eg["gold_label"]
|
||||
if label == "-": # per Parikh, ignore - SNLI entries
|
||||
continue
|
||||
texts1.append(eg['sentence1'])
|
||||
texts2.append(eg['sentence2'])
|
||||
texts1.append(eg["sentence1"])
|
||||
texts2.append(eg["sentence2"])
|
||||
labels.append(LABELS[label])
|
||||
return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))
|
||||
return texts1, texts2, to_categorical(np.asarray(labels, dtype="int32"))
|
||||
|
||||
|
||||
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
||||
sents = texts + hypotheses
|
||||
|
||||
sents_as_ids = []
|
||||
for sent in sents:
|
||||
doc = nlp(sent)
|
||||
word_ids = []
|
||||
|
||||
for i, token in enumerate(doc):
|
||||
# skip odd spaces from tokenizer
|
||||
if token.has_vector and token.vector_norm == 0:
|
||||
|
@ -140,13 +141,12 @@ def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
|||
word_ids.append(token.rank % num_unk + 1)
|
||||
|
||||
# there must be a simpler way of generating padded arrays from lists...
|
||||
word_id_vec = np.zeros((max_length), dtype='int')
|
||||
word_id_vec = np.zeros((max_length), dtype="int")
|
||||
clipped_len = min(max_length, len(word_ids))
|
||||
word_id_vec[:clipped_len] = word_ids[:clipped_len]
|
||||
sents_as_ids.append(word_id_vec)
|
||||
|
||||
|
||||
return [np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])]
|
||||
return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
|
@ -159,39 +159,49 @@ def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
|||
learn_rate=("Learning rate", "option", "r", float),
|
||||
batch_size=("Batch size for neural network training", "option", "b", int),
|
||||
nr_epoch=("Number of training epochs", "option", "e", int),
|
||||
entail_dir=("Direction of entailment", "option", "D", str, ["both", "left", "right"])
|
||||
entail_dir=(
|
||||
"Direction of entailment",
|
||||
"option",
|
||||
"D",
|
||||
str,
|
||||
["both", "left", "right"],
|
||||
),
|
||||
)
|
||||
def main(mode, train_loc, dev_loc,
|
||||
max_length = 50,
|
||||
nr_hidden = 200,
|
||||
dropout = 0.2,
|
||||
learn_rate = 0.001,
|
||||
batch_size = 1024,
|
||||
nr_epoch = 10,
|
||||
entail_dir="both"):
|
||||
|
||||
def main(
|
||||
mode,
|
||||
train_loc,
|
||||
dev_loc,
|
||||
max_length=50,
|
||||
nr_hidden=200,
|
||||
dropout=0.2,
|
||||
learn_rate=0.001,
|
||||
batch_size=1024,
|
||||
nr_epoch=10,
|
||||
entail_dir="both",
|
||||
):
|
||||
shape = (max_length, nr_hidden, 3)
|
||||
settings = {
|
||||
'lr': learn_rate,
|
||||
'dropout': dropout,
|
||||
'batch_size': batch_size,
|
||||
'nr_epoch': nr_epoch,
|
||||
'entail_dir': entail_dir
|
||||
"lr": learn_rate,
|
||||
"dropout": dropout,
|
||||
"batch_size": batch_size,
|
||||
"nr_epoch": nr_epoch,
|
||||
"entail_dir": entail_dir,
|
||||
}
|
||||
|
||||
if mode == 'train':
|
||||
if mode == "train":
|
||||
if train_loc == None or dev_loc == None:
|
||||
print("Train mode requires paths to training and development data sets.")
|
||||
sys.exit(1)
|
||||
train(train_loc, dev_loc, shape, settings)
|
||||
elif mode == 'evaluate':
|
||||
if dev_loc == None:
|
||||
elif mode == "evaluate":
|
||||
if dev_loc == None:
|
||||
print("Evaluate mode requires paths to test data set.")
|
||||
sys.exit(1)
|
||||
correct, total = evaluate(dev_loc, shape)
|
||||
print(correct, '/', total, correct / total)
|
||||
print(correct, "/", total, correct / total)
|
||||
else:
|
||||
demo(shape)
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
|
|
@ -5,11 +5,12 @@ import numpy as np
|
|||
from keras import layers, Model, models, optimizers
|
||||
from keras import backend as K
|
||||
|
||||
|
||||
def build_model(vectors, shape, settings):
|
||||
max_length, nr_hidden, nr_class = shape
|
||||
|
||||
input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
|
||||
input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
|
||||
input1 = layers.Input(shape=(max_length,), dtype="int32", name="words1")
|
||||
input2 = layers.Input(shape=(max_length,), dtype="int32", name="words2")
|
||||
|
||||
# embeddings (projected)
|
||||
embed = create_embedding(vectors, max_length, nr_hidden)
|
||||
|
@ -23,11 +24,11 @@ def build_model(vectors, shape, settings):
|
|||
|
||||
G = create_feedforward(nr_hidden)
|
||||
|
||||
if settings['entail_dir'] == 'both':
|
||||
if settings["entail_dir"] == "both":
|
||||
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
||||
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
|
||||
alpha = layers.dot([norm_weights_a, a], axes=1)
|
||||
beta = layers.dot([norm_weights_b, b], axes=1)
|
||||
beta = layers.dot([norm_weights_b, b], axes=1)
|
||||
|
||||
# step 2: compare
|
||||
comp1 = layers.concatenate([a, beta])
|
||||
|
@ -40,7 +41,7 @@ def build_model(vectors, shape, settings):
|
|||
v2_sum = layers.Lambda(sum_word)(v2)
|
||||
concat = layers.concatenate([v1_sum, v2_sum])
|
||||
|
||||
elif settings['entail_dir'] == 'left':
|
||||
elif settings["entail_dir"] == "left":
|
||||
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
||||
alpha = layers.dot([norm_weights_a, a], axes=1)
|
||||
comp2 = layers.concatenate([b, alpha])
|
||||
|
@ -50,7 +51,7 @@ def build_model(vectors, shape, settings):
|
|||
|
||||
else:
|
||||
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
|
||||
beta = layers.dot([norm_weights_b, b], axes=1)
|
||||
beta = layers.dot([norm_weights_b, b], axes=1)
|
||||
comp1 = layers.concatenate([a, beta])
|
||||
v1 = layers.TimeDistributed(G)(comp1)
|
||||
v1_sum = layers.Lambda(sum_word)(v1)
|
||||
|
@ -58,80 +59,86 @@ def build_model(vectors, shape, settings):
|
|||
|
||||
H = create_feedforward(nr_hidden)
|
||||
out = H(concat)
|
||||
out = layers.Dense(nr_class, activation='softmax')(out)
|
||||
out = layers.Dense(nr_class, activation="softmax")(out)
|
||||
|
||||
model = Model([input1, input2], out)
|
||||
|
||||
model.compile(
|
||||
optimizer=optimizers.Adam(lr=settings['lr']),
|
||||
loss='categorical_crossentropy',
|
||||
metrics=['accuracy'])
|
||||
optimizer=optimizers.Adam(lr=settings["lr"]),
|
||||
loss="categorical_crossentropy",
|
||||
metrics=["accuracy"],
|
||||
)
|
||||
|
||||
return model
|
||||
|
||||
|
||||
def create_embedding(vectors, max_length, projected_dim):
|
||||
return models.Sequential([
|
||||
layers.Embedding(
|
||||
vectors.shape[0],
|
||||
vectors.shape[1],
|
||||
input_length=max_length,
|
||||
weights=[vectors],
|
||||
trainable=False),
|
||||
return models.Sequential(
|
||||
[
|
||||
layers.Embedding(
|
||||
vectors.shape[0],
|
||||
vectors.shape[1],
|
||||
input_length=max_length,
|
||||
weights=[vectors],
|
||||
trainable=False,
|
||||
),
|
||||
layers.TimeDistributed(
|
||||
layers.Dense(projected_dim, activation=None, use_bias=False)
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
layers.TimeDistributed(
|
||||
layers.Dense(projected_dim,
|
||||
activation=None,
|
||||
use_bias=False))
|
||||
])
|
||||
|
||||
def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
|
||||
return models.Sequential([
|
||||
layers.Dense(num_units, activation=activation),
|
||||
layers.Dropout(dropout_rate),
|
||||
layers.Dense(num_units, activation=activation),
|
||||
layers.Dropout(dropout_rate)
|
||||
])
|
||||
def create_feedforward(num_units=200, activation="relu", dropout_rate=0.2):
|
||||
return models.Sequential(
|
||||
[
|
||||
layers.Dense(num_units, activation=activation),
|
||||
layers.Dropout(dropout_rate),
|
||||
layers.Dense(num_units, activation=activation),
|
||||
layers.Dropout(dropout_rate),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def normalizer(axis):
|
||||
def _normalize(att_weights):
|
||||
exp_weights = K.exp(att_weights)
|
||||
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
|
||||
return exp_weights/sum_weights
|
||||
return exp_weights / sum_weights
|
||||
|
||||
return _normalize
|
||||
|
||||
|
||||
def sum_word(x):
|
||||
return K.sum(x, axis=1)
|
||||
|
||||
|
||||
def test_build_model():
|
||||
vectors = np.ndarray((100, 8), dtype='float32')
|
||||
vectors = np.ndarray((100, 8), dtype="float32")
|
||||
shape = (10, 16, 3)
|
||||
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
|
||||
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
|
||||
model = build_model(vectors, shape, settings)
|
||||
|
||||
|
||||
def test_fit_model():
|
||||
|
||||
def _generate_X(nr_example, length, nr_vector):
|
||||
X1 = np.ndarray((nr_example, length), dtype='int32')
|
||||
X1 = np.ndarray((nr_example, length), dtype="int32")
|
||||
X1 *= X1 < nr_vector
|
||||
X1 *= 0 <= X1
|
||||
X2 = np.ndarray((nr_example, length), dtype='int32')
|
||||
X2 = np.ndarray((nr_example, length), dtype="int32")
|
||||
X2 *= X2 < nr_vector
|
||||
X2 *= 0 <= X2
|
||||
return [X1, X2]
|
||||
|
||||
def _generate_Y(nr_example, nr_class):
|
||||
ys = np.zeros((nr_example, nr_class), dtype='int32')
|
||||
ys = np.zeros((nr_example, nr_class), dtype="int32")
|
||||
for i in range(nr_example):
|
||||
ys[i, i % nr_class] = 1
|
||||
return ys
|
||||
|
||||
vectors = np.ndarray((100, 8), dtype='float32')
|
||||
vectors = np.ndarray((100, 8), dtype="float32")
|
||||
shape = (10, 16, 3)
|
||||
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
|
||||
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
|
||||
model = build_model(vectors, shape, settings)
|
||||
|
||||
train_X = _generate_X(20, shape[0], vectors.shape[0])
|
||||
|
|
|
@ -59,7 +59,7 @@ def main(model=None, output_dir=None, n_iter=100):
|
|||
# reset and initialize the weights randomly – but only if we're
|
||||
# training a new model
|
||||
if model is None:
|
||||
optimizer = nlp.begin_training()
|
||||
nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
|
|
|
@ -90,7 +90,8 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
|
|||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
with nlp.use_params(optimizer.averages):
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
|
|
9
setup.py
9
setup.py
|
@ -98,6 +98,14 @@ if os.environ.get("USE_OPENMP", USE_OPENMP_DEFAULT) == "1":
|
|||
COMPILE_OPTIONS["other"].append("-fopenmp")
|
||||
LINK_OPTIONS["other"].append("-fopenmp")
|
||||
|
||||
if sys.platform == "darwin":
|
||||
# On Mac, use libc++ because Apple deprecated use of
|
||||
# libstdc
|
||||
COMPILE_OPTIONS["other"].append("-stdlib=libc++")
|
||||
LINK_OPTIONS["other"].append("-lc++")
|
||||
# g++ (used by unix compiler on mac) links to libstdc++ as a default lib.
|
||||
# See: https://stackoverflow.com/questions/1653047/avoid-linking-to-libstdc
|
||||
LINK_OPTIONS["other"].append("-nodefaultlibs")
|
||||
|
||||
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
|
||||
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
|
||||
|
@ -183,6 +191,7 @@ def setup_package():
|
|||
for mod_name in MOD_NAMES:
|
||||
mod_path = mod_name.replace(".", "/") + ".cpp"
|
||||
extra_link_args = []
|
||||
extra_compile_args = []
|
||||
# ???
|
||||
# Imported from patch from @mikepb
|
||||
# See Issue #267. Running blind here...
|
||||
|
|
|
@ -4,6 +4,8 @@ from __future__ import unicode_literals
|
|||
from ...gold import iob_to_biluo
|
||||
from ...util import minibatch
|
||||
|
||||
import re
|
||||
|
||||
|
||||
def iob2json(input_data, n_sents=10, *args, **kwargs):
|
||||
"""
|
||||
|
@ -25,7 +27,8 @@ def read_iob(raw_sents):
|
|||
for line in raw_sents:
|
||||
if not line.strip():
|
||||
continue
|
||||
tokens = [t.split("|") for t in line.split()]
|
||||
# tokens = [t.split("|") for t in line.split()]
|
||||
tokens = [re.split("[^\w\-]", line.strip())]
|
||||
if len(tokens[0]) == 3:
|
||||
words, pos, iob = zip(*tokens)
|
||||
else:
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# stop words from HAZM package
|
||||
|
||||
# Stop words from HAZM package
|
||||
STOP_WORDS = set(
|
||||
|
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
|
||||
AUXILIARY_VERBS_IRREG = {
|
||||
"été": ("être",),
|
||||
"suis": ("être",),
|
||||
"es": ("être",),
|
||||
"est": ("être",),
|
||||
|
@ -83,4 +84,286 @@ AUXILIARY_VERBS_IRREG = {
|
|||
"eussiez": ("avoir",),
|
||||
"eussent": ("avoir",),
|
||||
"ayant": ("avoir",),
|
||||
"eu": ("avoir",),
|
||||
"eue": ("avoir",),
|
||||
"eues": ("avoir",),
|
||||
"devaient": ("devoir",),
|
||||
"devais": ("devoir",),
|
||||
"devait": ("devoir",),
|
||||
"devant": ("devoir",),
|
||||
"devez": ("devoir",),
|
||||
"deviez": ("devoir",),
|
||||
"devions": ("devoir",),
|
||||
"devons": ("devoir",),
|
||||
"devra": ("devoir",),
|
||||
"devrai": ("devoir",),
|
||||
"devraient": ("devoir",),
|
||||
"devrais": ("devoir",),
|
||||
"devrait": ("devoir",),
|
||||
"devras": ("devoir",),
|
||||
"devrez": ("devoir",),
|
||||
"devriez": ("devoir",),
|
||||
"devrions": ("devoir",),
|
||||
"devrons": ("devoir",),
|
||||
"devront": ("devoir",),
|
||||
"dois": ("devoir",),
|
||||
"doit": ("devoir",),
|
||||
"doive": ("devoir",),
|
||||
"doivent": ("devoir",),
|
||||
"doives": ("devoir",),
|
||||
"dû": ("devoir",),
|
||||
"due": ("devoir",),
|
||||
"dues": ("devoir",),
|
||||
"dûmes": ("devoir",),
|
||||
"durent": ("devoir",),
|
||||
"dus": ("devoir",),
|
||||
"dûs": ("devoir",),
|
||||
"dusse": ("devoir",),
|
||||
"dussent": ("devoir",),
|
||||
"dusses": ("devoir",),
|
||||
"dussiez": ("devoir",),
|
||||
"dussions": ("devoir",),
|
||||
"dut": ("devoir",),
|
||||
"dût": ("devoir",),
|
||||
"dûtes": ("devoir",),
|
||||
"peut": ("pouvoir",),
|
||||
"peuvent": ("pouvoir",),
|
||||
"peux": ("pouvoir",),
|
||||
"pourraient": ("pouvoir",),
|
||||
"pourrai": ("pouvoir",),
|
||||
"pourrais": ("pouvoir",),
|
||||
"pourrait": ("pouvoir",),
|
||||
"pourra": ("pouvoir",),
|
||||
"pourras": ("pouvoir",),
|
||||
"pourrez": ("pouvoir",),
|
||||
"pourriez": ("pouvoir",),
|
||||
"pourrions": ("pouvoir",),
|
||||
"pourrons": ("pouvoir",),
|
||||
"pourront": ("pouvoir",),
|
||||
"pouvaient": ("pouvoir",),
|
||||
"pouvais": ("pouvoir",),
|
||||
"pouvait": ("pouvoir",),
|
||||
"pouvez": ("pouvoir",),
|
||||
"pouviez": ("pouvoir",),
|
||||
"pouvions": ("pouvoir",),
|
||||
"pouvons": ("pouvoir",),
|
||||
"pûmes": ("pouvoir",),
|
||||
"pu": ("pouvoir",),
|
||||
"purent": ("pouvoir",),
|
||||
"pus": ("pouvoir",),
|
||||
"pûtes": ("pouvoir",),
|
||||
"put": ("pouvoir",),
|
||||
"pouvant": ("pouvoir",),
|
||||
"puisse": ("pouvoir",),
|
||||
"puissions": ("pouvoir",),
|
||||
"puissiez": ("pouvoir",),
|
||||
"puissent": ("pouvoir",),
|
||||
"pusse": ("pouvoir",),
|
||||
"pusses": ("pouvoir",),
|
||||
"pussions": ("pouvoir",),
|
||||
"pussiez": ("pouvoir",),
|
||||
"pussent": ("pouvoir",),
|
||||
"faisaient": ("faire",),
|
||||
"faisais": ("faire",),
|
||||
"faisait": ("faire",),
|
||||
"faisant": ("faire",),
|
||||
"fais": ("faire",),
|
||||
"faisiez": ("faire",),
|
||||
"faisions": ("faire",),
|
||||
"faisons": ("faire",),
|
||||
"faite": ("faire",),
|
||||
"faites": ("faire",),
|
||||
"fait": ("faire",),
|
||||
"faits": ("faire",),
|
||||
"fasse": ("faire",),
|
||||
"fassent": ("faire",),
|
||||
"fasses": ("faire",),
|
||||
"fassiez": ("faire",),
|
||||
"fassions": ("faire",),
|
||||
"fera": ("faire",),
|
||||
"feraient": ("faire",),
|
||||
"ferai": ("faire",),
|
||||
"ferais": ("faire",),
|
||||
"ferait": ("faire",),
|
||||
"feras": ("faire",),
|
||||
"ferez": ("faire",),
|
||||
"feriez": ("faire",),
|
||||
"ferions": ("faire",),
|
||||
"ferons": ("faire",),
|
||||
"feront": ("faire",),
|
||||
"fîmes": ("faire",),
|
||||
"firent": ("faire",),
|
||||
"fis": ("faire",),
|
||||
"fisse": ("faire",),
|
||||
"fissent": ("faire",),
|
||||
"fisses": ("faire",),
|
||||
"fissiez": ("faire",),
|
||||
"fissions": ("faire",),
|
||||
"fîtes": ("faire",),
|
||||
"fit": ("faire",),
|
||||
"fît": ("faire",),
|
||||
"font": ("faire",),
|
||||
"veuillent": ("vouloir",),
|
||||
"veuilles": ("vouloir",),
|
||||
"veuille": ("vouloir",),
|
||||
"veuillez": ("vouloir",),
|
||||
"veuillons": ("vouloir",),
|
||||
"veulent": ("vouloir",),
|
||||
"veut": ("vouloir",),
|
||||
"veux": ("vouloir",),
|
||||
"voudraient": ("vouloir",),
|
||||
"voudrais": ("vouloir",),
|
||||
"voudrait": ("vouloir",),
|
||||
"voudrai": ("vouloir",),
|
||||
"voudras": ("vouloir",),
|
||||
"voudra": ("vouloir",),
|
||||
"voudrez": ("vouloir",),
|
||||
"voudriez": ("vouloir",),
|
||||
"voudrions": ("vouloir",),
|
||||
"voudrons": ("vouloir",),
|
||||
"voudront": ("vouloir",),
|
||||
"voulaient": ("vouloir",),
|
||||
"voulais": ("vouloir",),
|
||||
"voulait": ("vouloir",),
|
||||
"voulant": ("vouloir",),
|
||||
"voulez": ("vouloir",),
|
||||
"vouliez": ("vouloir",),
|
||||
"voulions": ("vouloir",),
|
||||
"voulons": ("vouloir",),
|
||||
"voulues": ("vouloir",),
|
||||
"voulue": ("vouloir",),
|
||||
"voulûmes": ("vouloir",),
|
||||
"voulurent": ("vouloir",),
|
||||
"voulussent": ("vouloir",),
|
||||
"voulusses": ("vouloir",),
|
||||
"voulusse": ("vouloir",),
|
||||
"voulussiez": ("vouloir",),
|
||||
"voulussions": ("vouloir",),
|
||||
"voulus": ("vouloir",),
|
||||
"voulûtes": ("vouloir",),
|
||||
"voulut": ("vouloir",),
|
||||
"voulût": ("vouloir",),
|
||||
"voulu": ("vouloir",),
|
||||
"sachant": ("savoir",),
|
||||
"sachent": ("savoir",),
|
||||
"sache": ("savoir",),
|
||||
"saches": ("savoir",),
|
||||
"sachez": ("savoir",),
|
||||
"sachiez": ("savoir",),
|
||||
"sachions": ("savoir",),
|
||||
"sachons": ("savoir",),
|
||||
"sais": ("savoir",),
|
||||
"sait": ("savoir",),
|
||||
"sauraient": ("savoir",),
|
||||
"saurai": ("savoir",),
|
||||
"saurais": ("savoir",),
|
||||
"saurait": ("savoir",),
|
||||
"saura": ("savoir",),
|
||||
"sauras": ("savoir",),
|
||||
"saurez": ("savoir",),
|
||||
"sauriez": ("savoir",),
|
||||
"saurions": ("savoir",),
|
||||
"saurons": ("savoir",),
|
||||
"sauront": ("savoir",),
|
||||
"savaient": ("savoir",),
|
||||
"savais": ("savoir",),
|
||||
"savait": ("savoir",),
|
||||
"savent": ("savoir",),
|
||||
"savez": ("savoir",),
|
||||
"saviez": ("savoir",),
|
||||
"savions": ("savoir",),
|
||||
"savons": ("savoir",),
|
||||
"sue": ("savoir",),
|
||||
"sues": ("savoir",),
|
||||
"sûmes": ("savoir",),
|
||||
"surent": ("savoir",),
|
||||
"su": ("savoir",),
|
||||
"sus": ("savoir",),
|
||||
"sussent": ("savoir",),
|
||||
"susse": ("savoir",),
|
||||
"susses": ("savoir",),
|
||||
"sussiez": ("savoir",),
|
||||
"sussions": ("savoir",),
|
||||
"sûtes": ("savoir",),
|
||||
"sut": ("savoir",),
|
||||
"sût": ("savoir",),
|
||||
"venaient": ("venir",),
|
||||
"venais": ("venir",),
|
||||
"venait": ("venir",),
|
||||
"venant": ("venir",),
|
||||
"venez": ("venir",),
|
||||
"veniez": ("venir",),
|
||||
"venions": ("venir",),
|
||||
"venons": ("venir",),
|
||||
"venues": ("venir",),
|
||||
"venue": ("venir",),
|
||||
"venus": ("venir",),
|
||||
"venu": ("venir",),
|
||||
"viendraient": ("venir",),
|
||||
"viendrais": ("venir",),
|
||||
"viendrait": ("venir",),
|
||||
"viendrai": ("venir",),
|
||||
"viendras": ("venir",),
|
||||
"viendra": ("venir",),
|
||||
"viendrez": ("venir",),
|
||||
"viendriez": ("venir",),
|
||||
"viendrions": ("venir",),
|
||||
"viendrons": ("venir",),
|
||||
"viendront": ("venir",),
|
||||
"viennent": ("venir",),
|
||||
"viennes": ("venir",),
|
||||
"vienne": ("venir",),
|
||||
"viens": ("venir",),
|
||||
"vient": ("venir",),
|
||||
"vînmes": ("venir",),
|
||||
"vinrent": ("venir",),
|
||||
"vinssent": ("venir",),
|
||||
"vinsses": ("venir",),
|
||||
"vinsse": ("venir",),
|
||||
"vinssiez": ("venir",),
|
||||
"vinssions": ("venir",),
|
||||
"vins": ("venir",),
|
||||
"vîntes": ("venir",),
|
||||
"vint": ("venir",),
|
||||
"vînt": ("venir",),
|
||||
"aille": ("aller",),
|
||||
"aillent": ("aller",),
|
||||
"ailles": ("aller",),
|
||||
"alla": ("aller",),
|
||||
"allai": ("aller",),
|
||||
"allaient": ("aller",),
|
||||
"allais": ("aller",),
|
||||
"allait": ("aller",),
|
||||
"allâmes": ("aller",),
|
||||
"allant": ("aller",),
|
||||
"allas": ("aller",),
|
||||
"allasse": ("aller",),
|
||||
"allassent": ("aller",),
|
||||
"allasses": ("aller",),
|
||||
"allassiez": ("aller",),
|
||||
"allassions": ("aller",),
|
||||
"allât": ("aller",),
|
||||
"allâtes": ("aller",),
|
||||
"allé": ("aller",),
|
||||
"allée": ("aller",),
|
||||
"allées": ("aller",),
|
||||
"allèrent": ("aller",),
|
||||
"allés": ("aller",),
|
||||
"allez": ("aller",),
|
||||
"allons": ("aller",),
|
||||
"ira": ("aller",),
|
||||
"irai": ("aller",),
|
||||
"iraient": ("aller",),
|
||||
"irais": ("aller",),
|
||||
"irait": ("aller",),
|
||||
"iras": ("aller",),
|
||||
"irez": ("aller",),
|
||||
"iriez": ("aller",),
|
||||
"irions": ("aller",),
|
||||
"irons": ("aller",),
|
||||
"iront": ("aller",),
|
||||
"va": ("aller",),
|
||||
"vais": ("aller",),
|
||||
"vas": ("aller",),
|
||||
"vont": ("aller",)
|
||||
}
|
||||
|
|
|
@ -2,10 +2,113 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
ADJECTIVE_RULES = [["s", ""], ["e", ""], ["es", ""]]
|
||||
ADJECTIVE_RULES = [
|
||||
["a", "a"],
|
||||
["aux", "al"],
|
||||
["c", "c"],
|
||||
["d", "d"],
|
||||
["e", ""],
|
||||
["é", "é"],
|
||||
["eux", "eux"],
|
||||
["f", "f"],
|
||||
["i", "i"],
|
||||
["ï", "ï"],
|
||||
["l", "l"],
|
||||
["m", "m"],
|
||||
["n", "n"],
|
||||
["o", "o"],
|
||||
["p", "p"],
|
||||
["r", "r"],
|
||||
["s", ""],
|
||||
["t", "t"],
|
||||
["u", "u"],
|
||||
["y", "y"],
|
||||
]
|
||||
|
||||
|
||||
NOUN_RULES = [["s", ""]]
|
||||
NOUN_RULES = [
|
||||
["a", "a"],
|
||||
["à", "à"],
|
||||
["â", "â"],
|
||||
["b", "b"],
|
||||
["c", "c"],
|
||||
["ç", "ç"],
|
||||
["d", "d"],
|
||||
["e", "e"],
|
||||
["é", "é"],
|
||||
["è", "è"],
|
||||
["ê", "ê"],
|
||||
["ë", "ë"],
|
||||
["f", "f"],
|
||||
["g", "g"],
|
||||
["h", "h"],
|
||||
["i", "i"],
|
||||
["î", "î"],
|
||||
["ï", "ï"],
|
||||
["j", "j"],
|
||||
["k", "k"],
|
||||
["l", "l"],
|
||||
["m", "m"],
|
||||
["n", "n"],
|
||||
["o", "o"],
|
||||
["ô", "ö"],
|
||||
["ö", "ö"],
|
||||
["p", "p"],
|
||||
["q", "q"],
|
||||
["r", "r"],
|
||||
["t", "t"],
|
||||
["u", "u"],
|
||||
["û", "û"],
|
||||
["v", "v"],
|
||||
["w", "w"],
|
||||
["y", "y"],
|
||||
["z", "z"],
|
||||
["as", "a"],
|
||||
["aux", "au"],
|
||||
["cs", "c"],
|
||||
["chs", "ch"],
|
||||
["ds", "d"],
|
||||
["és", "é"],
|
||||
["es", "e"],
|
||||
["eux", "eu"],
|
||||
["fs", "f"],
|
||||
["gs", "g"],
|
||||
["hs", "h"],
|
||||
["is", "i"],
|
||||
["ïs", "ï"],
|
||||
["js", "j"],
|
||||
["ks", "k"],
|
||||
["ls", "l"],
|
||||
["ms", "m"],
|
||||
["ns", "n"],
|
||||
["oux", "ou"],
|
||||
["os", "o"],
|
||||
["ps", "p"],
|
||||
["qs", "q"],
|
||||
["rs", "r"],
|
||||
["ses", "se"],
|
||||
["se", "se"],
|
||||
["ts", "t"],
|
||||
["us", "u"],
|
||||
["vs", "v"],
|
||||
["ws", "w"],
|
||||
["ys", "y"],
|
||||
["nt(e", "nt"],
|
||||
["nt(e)", "nt"],
|
||||
["al(e", "ale"],
|
||||
["é(", "é"],
|
||||
["é(e", "é"],
|
||||
["é.e", "é"],
|
||||
["el(le", "el"],
|
||||
["eurs(rices", "eur"],
|
||||
["eur(rice", "eur"],
|
||||
["eux(se", "eux"],
|
||||
["ial(e", "ial"],
|
||||
["er(ère", "er"],
|
||||
["eur(se", "eur"],
|
||||
["teur(trice", "teur"],
|
||||
["teurs(trices", "teur"],
|
||||
]
|
||||
|
||||
|
||||
VERB_RULES = [
|
||||
|
@ -47,4 +150,11 @@ VERB_RULES = [
|
|||
["assiez", "er"],
|
||||
["assent", "er"],
|
||||
["ant", "er"],
|
||||
["ante", "er"],
|
||||
["ants", "er"],
|
||||
["antes", "er"],
|
||||
["u(er", "u"],
|
||||
["és(ées", "er"],
|
||||
["é()e", "er"],
|
||||
["é()", "er"],
|
||||
]
|
||||
|
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
@ -94,15 +94,19 @@ for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
|
|||
|
||||
_infixes_exc = []
|
||||
orig_elision = "'"
|
||||
orig_hyphen = '-'
|
||||
orig_hyphen = "-"
|
||||
|
||||
# loop through the elison and hyphen characters, and try to substitute the ones that weren't used in the original list
|
||||
for infix in FR_BASE_EXCEPTIONS:
|
||||
variants_infix = {infix}
|
||||
for elision_char in [x for x in ELISION if x != orig_elision]:
|
||||
variants_infix.update([word.replace(orig_elision, elision_char) for word in variants_infix])
|
||||
for hyphen_char in [x for x in ['-', '‐'] if x != orig_hyphen]:
|
||||
variants_infix.update([word.replace(orig_hyphen, hyphen_char) for word in variants_infix])
|
||||
variants_infix.update(
|
||||
[word.replace(orig_elision, elision_char) for word in variants_infix]
|
||||
)
|
||||
for hyphen_char in [x for x in ["-", "‐"] if x != orig_hyphen]:
|
||||
variants_infix.update(
|
||||
[word.replace(orig_hyphen, hyphen_char) for word in variants_infix]
|
||||
)
|
||||
variants_infix.update([upper_first_letter(word) for word in variants_infix])
|
||||
_infixes_exc.extend(variants_infix)
|
||||
|
||||
|
@ -327,7 +331,9 @@ _regular_exp = [
|
|||
"^chape[{hyphen}]chut[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
|
||||
"^down[{hyphen}]load[{alpha}]*$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
|
||||
"^[ée]tats[{hyphen}]uni[{alpha}]*$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
|
||||
"^droits?[{hyphen}]de[{hyphen}]l'homm[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
|
||||
"^droits?[{hyphen}]de[{hyphen}]l'homm[{alpha}]+$".format(
|
||||
hyphen=HYPHENS, alpha=ALPHA_LOWER
|
||||
),
|
||||
"^fac[{hyphen}]simil[{alpha}]*$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
|
||||
"^fleur[{hyphen}]bleuis[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
|
||||
"^flic[{hyphen}]flaqu[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
|
||||
|
@ -380,25 +386,32 @@ _regular_exp += [
|
|||
]
|
||||
|
||||
# catching cases like entr'abat
|
||||
_elision_prefix = ['r?é?entr', 'grande?s?', 'r']
|
||||
_elision_prefix = ["r?é?entr", "grande?s?", "r"]
|
||||
_regular_exp += [
|
||||
"^{prefix}[{elision}][{alpha}][{alpha}{elision}{hyphen}\-]*$".format(
|
||||
prefix=p,
|
||||
elision=ELISION,
|
||||
hyphen=_other_hyphens,
|
||||
alpha=ALPHA_LOWER,
|
||||
prefix=p, elision=ELISION, hyphen=_other_hyphens, alpha=ALPHA_LOWER
|
||||
)
|
||||
for p in _elision_prefix
|
||||
]
|
||||
|
||||
# catching cases like saut-de-ski, pet-en-l'air
|
||||
_hyphen_combination = ['l[èe]s?', 'la', 'en', 'des?', 'd[eu]', 'sur', 'sous', 'aux?', 'à', 'et', "près", "saint"]
|
||||
_hyphen_combination = [
|
||||
"l[èe]s?",
|
||||
"la",
|
||||
"en",
|
||||
"des?",
|
||||
"d[eu]",
|
||||
"sur",
|
||||
"sous",
|
||||
"aux?",
|
||||
"à",
|
||||
"et",
|
||||
"près",
|
||||
"saint",
|
||||
]
|
||||
_regular_exp += [
|
||||
"^[{alpha}]+[{hyphen}]{hyphen_combo}[{hyphen}](?:l[{elision}])?[{alpha}]+$".format(
|
||||
hyphen_combo=hc,
|
||||
elision=ELISION,
|
||||
hyphen=HYPHENS,
|
||||
alpha=ALPHA_LOWER,
|
||||
hyphen_combo=hc, elision=ELISION, hyphen=HYPHENS, alpha=ALPHA_LOWER
|
||||
)
|
||||
for hc in _hyphen_combination
|
||||
]
|
||||
|
|
|
@ -1,3 +1,10 @@
|
|||
"""
|
||||
Slang and abbreviations
|
||||
|
||||
Daftar kosakata yang sering salah dieja
|
||||
https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
|
||||
|
||||
"""
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
"""
|
||||
List of stop words in Bahasa Indonesia.
|
||||
"""
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
|
|
@ -1,3 +1,7 @@
|
|||
"""
|
||||
Daftar singkatan dan Akronim dari:
|
||||
https://id.wiktionary.org/wiki/Wiktionary:Daftar_singkatan_dan_akronim_bahasa_Indonesia#A
|
||||
"""
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -1,6 +1,6 @@
|
|||
# coding: utf8
|
||||
"""
|
||||
All wordforms are extracted from Norsk Ordbank in Norwegian Bokmål 2005
|
||||
All wordforms are extracted from Norsk Ordbank in Norwegian Bokmål 2005, updated 20180627
|
||||
(CLARINO NB - Språkbanken), Nasjonalbiblioteket, Norway:
|
||||
https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-5&lang=en
|
||||
License:
|
||||
|
@ -15,9 +15,7 @@ ADVERBS_WORDFORMS = {
|
|||
'à la grecque': ('à la grecque',),
|
||||
'à la mode': ('à la mode',),
|
||||
'òg': ('òg',),
|
||||
'a': ('a',),
|
||||
'a cappella': ('a cappella',),
|
||||
'a conto': ('a conto',),
|
||||
'a konto': ('a konto',),
|
||||
'a posteriori': ('a posteriori',),
|
||||
'a prima vista': ('a prima vista',),
|
||||
|
@ -34,6 +32,12 @@ ADVERBS_WORDFORMS = {
|
|||
'ad undas': ('ad undas',),
|
||||
'adagio': ('adagio',),
|
||||
'akkurat': ('akkurat',),
|
||||
'aktenfor': ('aktenfor',),
|
||||
'aktenfra': ('aktenfra',),
|
||||
'akter': ('akter',),
|
||||
'akterinn': ('akterinn',),
|
||||
'akterover': ('akterover',),
|
||||
'akterut': ('akterut',),
|
||||
'al fresco': ('al fresco',),
|
||||
'al secco': ('al secco',),
|
||||
'aldeles': ('aldeles',),
|
||||
|
@ -46,6 +50,9 @@ ADVERBS_WORDFORMS = {
|
|||
'allegro': ('allegro',),
|
||||
'aller': ('aller',),
|
||||
'allerede': ('allerede',),
|
||||
'allesteds': ('allesteds',),
|
||||
'allestedsfra': ('allestedsfra',),
|
||||
'allestedshen': ('allestedshen',),
|
||||
'allikevel': ('allikevel',),
|
||||
'alltid': ('alltid',),
|
||||
'alltids': ('alltids',),
|
||||
|
@ -60,8 +67,12 @@ ADVERBS_WORDFORMS = {
|
|||
'andelsvis': ('andelsvis',),
|
||||
'andfares': ('andfares',),
|
||||
'andføttes': ('andføttes',),
|
||||
'annensteds': ('annensteds',),
|
||||
'annenstedsfra': ('annenstedsfra',),
|
||||
'annenstedshen': ('annenstedshen',),
|
||||
'annetsteds': ('annetsteds',),
|
||||
'annetstedsfra': ('annetstedsfra',),
|
||||
'annetstedsfra': ('annetstedsfra',),
|
||||
'annetstedshen': ('annetstedshen',),
|
||||
'anno': ('anno',),
|
||||
'anslagsvis': ('anslagsvis',),
|
||||
|
@ -72,21 +83,35 @@ ADVERBS_WORDFORMS = {
|
|||
'apropos': ('apropos',),
|
||||
'argende': ('argende',),
|
||||
'at': ('at',),
|
||||
'att': ('att',),
|
||||
'attende': ('attende',),
|
||||
'atter': ('atter',),
|
||||
'attpåtil': ('attpåtil',),
|
||||
'attåt': ('attåt',),
|
||||
'au': ('au',),
|
||||
'aust': ('aust',),
|
||||
'austa': ('austa',),
|
||||
'austafjells': ('austafjells',),
|
||||
'av gårde': ('av gårde',),
|
||||
'av sted': ('av sted',),
|
||||
'avdelingsvis': ('avdelingsvis',),
|
||||
'avdragsvis': ('avdragsvis',),
|
||||
'avhendes': ('avhendes',),
|
||||
'avhends': ('avhends',),
|
||||
'avsatsvis': ('avsatsvis',),
|
||||
'babord': ('babord',),
|
||||
'bakfra': ('bakfra',),
|
||||
'bakk': ('bakk',),
|
||||
'baklengs': ('baklengs',),
|
||||
'bakover': ('bakover',),
|
||||
'bakut': ('bakut',),
|
||||
'bare': ('bare',),
|
||||
'bataljonsvis': ('bataljonsvis',),
|
||||
'beint fram': ('beint fram',),
|
||||
'bekende': ('bekende',),
|
||||
'belgende': ('belgende',),
|
||||
'bent fram': ('bent fram',),
|
||||
'bent frem': ('bent frem',),
|
||||
'betids': ('betids',),
|
||||
'bi': ('bi',),
|
||||
'bidevind': ('bidevind',),
|
||||
|
@ -102,17 +127,21 @@ ADVERBS_WORDFORMS = {
|
|||
'bom': ('bom',),
|
||||
'bommende': ('bommende',),
|
||||
'bona fide': ('bona fide',),
|
||||
'bort': ('bort',),
|
||||
'borte': ('borte',),
|
||||
'bortimot': ('bortimot',),
|
||||
'brennfort': ('brennfort',),
|
||||
'brutto': ('brutto',),
|
||||
'bråtevis': ('bråtevis',),
|
||||
'bums': ('bums',),
|
||||
'buntevis': ('buntevis',),
|
||||
'buntvis': ('buntvis',),
|
||||
'bus': ('bus',),
|
||||
'bygdimellom': ('bygdimellom',),
|
||||
'cantabile': ('cantabile',),
|
||||
'cf': ('cf',),
|
||||
'cif': ('cif',),
|
||||
'cirka': ('cirka',),
|
||||
'comme il faut': ('comme il faut',),
|
||||
'crescendo': ('crescendo',),
|
||||
'da': ('da',),
|
||||
'dagevis': ('dagevis',),
|
||||
|
@ -127,18 +156,38 @@ ADVERBS_WORDFORMS = {
|
|||
'delkredere': ('delkredere',),
|
||||
'dels': ('dels',),
|
||||
'delvis': ('delvis',),
|
||||
'den gang': ('den gang',),
|
||||
'der': ('der',),
|
||||
'der borte': ('der borte',),
|
||||
'der hen': ('der hen',),
|
||||
'der inne': ('der inne',),
|
||||
'der nede': ('der nede',),
|
||||
'der oppe': ('der oppe',),
|
||||
'der ute': ('der ute',),
|
||||
'derav': ('derav',),
|
||||
'deretter': ('deretter',),
|
||||
'derfor': ('derfor',),
|
||||
'derfra': ('derfra',),
|
||||
'deri': ('deri',),
|
||||
'deriblant': ('deriblant',),
|
||||
'derifra': ('derifra',),
|
||||
'derimot': ('derimot',),
|
||||
'dermed': ('dermed',),
|
||||
'dernest': ('dernest',),
|
||||
'derom': ('derom',),
|
||||
'derpå': ('derpå',),
|
||||
'dertil': ('dertil',),
|
||||
'derved': ('derved',),
|
||||
'dess': ('dess',),
|
||||
'dessuten': ('dessuten',),
|
||||
'dessverre': ('dessverre',),
|
||||
'desto': ('desto',),
|
||||
'diminuendo': ('diminuendo',),
|
||||
'dis': ('dis',),
|
||||
'dit': ('dit',),
|
||||
'dit hen': ('dit hen',),
|
||||
'ditover': ('ditover',),
|
||||
'ditto': ('ditto',),
|
||||
'dog': ('dog',),
|
||||
'dolce': ('dolce',),
|
||||
'dorgende': ('dorgende',),
|
||||
|
@ -158,10 +207,10 @@ ADVERBS_WORDFORMS = {
|
|||
'eitrende': ('eitrende',),
|
||||
'eks': ('eks',),
|
||||
'eksempelvis': ('eksempelvis',),
|
||||
'eksklusiv': ('eksklusiv',),
|
||||
'eksklusive': ('eksklusive',),
|
||||
'ekspress': ('ekspress',),
|
||||
'ekstempore': ('ekstempore',),
|
||||
'eldende': ('eldende',),
|
||||
'eldende': ('eldende',),
|
||||
'ellers': ('ellers',),
|
||||
'en': ('en',),
|
||||
'en bloc': ('en bloc',),
|
||||
|
@ -175,6 +224,8 @@ ADVERBS_WORDFORMS = {
|
|||
'enda': ('enda',),
|
||||
'endatil': ('endatil',),
|
||||
'ende': ('ende',),
|
||||
'ende fram': ('ende fram',),
|
||||
'ende frem': ('ende frem',),
|
||||
'ender': ('ender',),
|
||||
'endog': ('endog',),
|
||||
'ene': ('ene',),
|
||||
|
@ -183,10 +234,12 @@ ADVERBS_WORDFORMS = {
|
|||
'enkom': ('enkom',),
|
||||
'enn': ('enn',),
|
||||
'ennå': ('ennå',),
|
||||
'ensteds': ('ensteds',),
|
||||
'eo ipso': ('eo ipso',),
|
||||
'ergo': ('ergo',),
|
||||
'et cetera': ('et cetera',),
|
||||
'etappevis': ('etappevis',),
|
||||
'etsteds': ('etsteds',),
|
||||
'etterhånden': ('etterhånden',),
|
||||
'etterpå': ('etterpå',),
|
||||
'etterskottsvis': ('etterskottsvis',),
|
||||
|
@ -195,9 +248,10 @@ ADVERBS_WORDFORMS = {
|
|||
'ex auditorio': ('ex auditorio',),
|
||||
'ex cathedra': ('ex cathedra',),
|
||||
'ex officio': ('ex officio',),
|
||||
'exit': ('exit',),
|
||||
'f.o.r.': ('f.o.r.',),
|
||||
'fas': ('fas',),
|
||||
'fatt': ('fatt',),
|
||||
'fatt': ('fatt',),
|
||||
'feil': ('feil',),
|
||||
'femti-femti': ('femti-femti',),
|
||||
'fifty-fifty': ('fifty-fifty',),
|
||||
|
@ -208,44 +262,64 @@ ADVERBS_WORDFORMS = {
|
|||
'flunkende': ('flunkende',),
|
||||
'flust': ('flust',),
|
||||
'fly': ('fly',),
|
||||
'fløyten': ('fløyten',),
|
||||
'fob': ('fob',),
|
||||
'for': ('for',),
|
||||
'for hånden': ('for hånden',),
|
||||
'for lengst': ('for lengst',),
|
||||
'for resten': ('for resten',),
|
||||
'for så vidt': ('for så vidt',),
|
||||
'for tida': ('for tida',),
|
||||
'for tiden': ('for tiden',),
|
||||
'for visst': ('for visst',),
|
||||
'for øvrig': ('for øvrig',),
|
||||
'fordevind': ('fordevind',),
|
||||
'fordum': ('fordum',),
|
||||
'fore': ('fore',),
|
||||
'forfra': ('forfra',),
|
||||
'forhakkende': ('forhakkende',),
|
||||
'forholdsvis': ('forholdsvis',),
|
||||
'forhåpentlig': ('forhåpentlig',),
|
||||
'forhåpentligvis': ('forhåpentligvis',),
|
||||
'forlengs': ('forlengs',),
|
||||
'formelig': ('formelig',),
|
||||
'forover': ('forover',),
|
||||
'forresten': ('forresten',),
|
||||
'forsøksvis': ('forsøksvis',),
|
||||
'fort': ('fort',),
|
||||
'fortere': ('fort',),
|
||||
'fortest': ('fort',),
|
||||
'forte': ('forte',),
|
||||
'fortfarende': ('fortfarende',),
|
||||
'fortissimo': ('fortissimo',),
|
||||
'fortrinnsvis': ('fortrinnsvis',),
|
||||
'forut': ('forut',),
|
||||
'fra borde': ('fra borde',),
|
||||
'fram': ('fram',),
|
||||
'framføre': ('framføre',),
|
||||
'framleis': ('framleis',),
|
||||
'framlengs': ('framlengs',),
|
||||
'framme': ('framme',),
|
||||
'framstupes': ('framstupes',),
|
||||
'framstups': ('framstups',),
|
||||
'franko': ('franko',),
|
||||
'free on board': ('free on board',),
|
||||
'free on rail': ('free on rail',),
|
||||
'frem': ('frem',),
|
||||
'fremad': ('fremad',),
|
||||
'fremdeles': ('fremdeles',),
|
||||
'fremlengs': ('fremlengs',),
|
||||
'fremme': ('fremme',),
|
||||
'fremstupes': ('fremstupes',),
|
||||
'fremstups': ('fremstups',),
|
||||
'furioso': ('furioso',),
|
||||
'fylkesvis': ('fylkesvis',),
|
||||
'følgelig': ('følgelig',),
|
||||
'føre': ('føre',),
|
||||
'først': ('først',),
|
||||
'ganske': ('ganske',),
|
||||
'gardimellom': ('gardimellom',),
|
||||
'gatelangs': ('gatelangs',),
|
||||
'gid': ('gid',),
|
||||
'givetvis': ('givetvis',),
|
||||
'gjerne': ('gjerne',),
|
||||
|
@ -267,17 +341,56 @@ ADVERBS_WORDFORMS = {
|
|||
'gørrende': ('gørrende',),
|
||||
'hakk': ('hakk',),
|
||||
'hakkende': ('hakkende',),
|
||||
'halvveges': ('halvveges',),
|
||||
'halvvegs': ('halvvegs',),
|
||||
'halvveis': ('halvveis',),
|
||||
'haugevis': ('haugevis',),
|
||||
'heden': ('heden',),
|
||||
'heim': ('heim',),
|
||||
'heim att': ('heim att',),
|
||||
'heiman': ('heiman',),
|
||||
'heime': ('heime',),
|
||||
'heimefra': ('heimefra',),
|
||||
'heimetter': ('heimetter',),
|
||||
'heimom': ('heimom',),
|
||||
'heimover': ('heimover',),
|
||||
'heldigvis': ('heldigvis',),
|
||||
'heller': ('heller',),
|
||||
'helst': ('helst',),
|
||||
'hen': ('hen',),
|
||||
'henholdsvis': ('henholdsvis',),
|
||||
'henne': ('henne',),
|
||||
'her': ('her',),
|
||||
'herav': ('herav',),
|
||||
'heretter': ('heretter',),
|
||||
'herfra': ('herfra',),
|
||||
'heri': ('heri',),
|
||||
'heriblant': ('heriblant',),
|
||||
'herifra': ('herifra',),
|
||||
'herigjennom': ('herigjennom',),
|
||||
'herimot': ('herimot',),
|
||||
'hermed': ('hermed',),
|
||||
'herom': ('herom',),
|
||||
'herover': ('herover',),
|
||||
'herpå': ('herpå',),
|
||||
'herre': ('herre',),
|
||||
'hersens': ('hersens',),
|
||||
'hertil': ('hertil',),
|
||||
'herunder': ('herunder',),
|
||||
'herved': ('herved',),
|
||||
'himlende': ('himlende',),
|
||||
'hisset': ('hisset',),
|
||||
'hist': ('hist',),
|
||||
'hit': ('hit',),
|
||||
'hitover': ('hitover',),
|
||||
'hittil': ('hittil',),
|
||||
'hjem': ('hjem',),
|
||||
'hjemad': ('hjemad',),
|
||||
'hjemetter': ('hjemetter',),
|
||||
'hjemme': ('hjemme',),
|
||||
'hjemmefra': ('hjemmefra',),
|
||||
'hjemom': ('hjemom',),
|
||||
'hjemover': ('hjemover',),
|
||||
'hodekulls': ('hodekulls',),
|
||||
'hodestupes': ('hodestupes',),
|
||||
'hodestups': ('hodestups',),
|
||||
|
@ -288,15 +401,41 @@ ADVERBS_WORDFORMS = {
|
|||
'hundretusenvis': ('hundretusenvis',),
|
||||
'hundrevis': ('hundrevis',),
|
||||
'hurra-meg-rundt': ('hurra-meg-rundt',),
|
||||
'husimellom': ('husimellom',),
|
||||
'hvi': ('hvi',),
|
||||
'hvor': ('hvor',),
|
||||
'hvor hen': ('hvor hen',),
|
||||
'hvorav': ('hvorav',),
|
||||
'hvordan': ('hvordan',),
|
||||
'hvoretter': ('hvoretter',),
|
||||
'hvorfor': ('hvorfor',),
|
||||
'hvorfra': ('hvorfra',),
|
||||
'hvori': ('hvori',),
|
||||
'hvoriblant': ('hvoriblant',),
|
||||
'hvorimot': ('hvorimot',),
|
||||
'hvorledes': ('hvorledes',),
|
||||
'hvormed': ('hvormed',),
|
||||
'hvorom': ('hvorom',),
|
||||
'hvorpå': ('hvorpå',),
|
||||
'hånt': ('hånt',),
|
||||
'høylig': ('høylig',),
|
||||
'høyst': ('høyst',),
|
||||
'i aften': ('i aften',),
|
||||
'i aftes': ('i aftes',),
|
||||
'i alle fall': ('i alle fall',),
|
||||
'i dag': ('i dag',),
|
||||
'i fjor': ('i fjor',),
|
||||
'i fleng': ('i fleng',),
|
||||
'i forfjor': ('i forfjor',),
|
||||
'i forgårs': ('i forgårs',),
|
||||
'i gjerde': ('i gjerde',),
|
||||
'i gjære': ('i gjære',),
|
||||
'i grunnen': ('i grunnen',),
|
||||
'i går': ('i går',),
|
||||
'i hende': ('i hende',),
|
||||
'i hjel': ('i hjel',),
|
||||
'i hug': ('i hug',),
|
||||
'i huleste': ('i huleste',),
|
||||
'i stedet': ('i stedet',),
|
||||
'iallfall': ('iallfall',),
|
||||
'ibidem': ('ibidem',),
|
||||
|
@ -304,7 +443,7 @@ ADVERBS_WORDFORMS = {
|
|||
'igjen': ('igjen',),
|
||||
'ikke': ('ikke',),
|
||||
'ildende': ('ildende',),
|
||||
'ildende': ('ildende',),
|
||||
'ille': ('ille',),
|
||||
'imens': ('imens',),
|
||||
'imidlertid': ('imidlertid',),
|
||||
'in absentia': ('in absentia',),
|
||||
|
@ -334,10 +473,22 @@ ADVERBS_WORDFORMS = {
|
|||
'in vivo': ('in vivo',),
|
||||
'ingenlunde': ('ingenlunde',),
|
||||
'ingensteds': ('ingensteds',),
|
||||
'inklusiv': ('inklusiv',),
|
||||
'inklusive': ('inklusive',),
|
||||
'inkognito': ('inkognito',),
|
||||
'inn': ('inn',),
|
||||
'innad': ('innad',),
|
||||
'innafra': ('innafra',),
|
||||
'innalands': ('innalands',),
|
||||
'innaskjærs': ('innaskjærs',),
|
||||
'inne': ('inne',),
|
||||
'innenat': ('innenat',),
|
||||
'innenfra': ('innenfra',),
|
||||
'innenlands': ('innenlands',),
|
||||
'innenskjærs': ('innenskjærs',),
|
||||
'innledningsvis': ('innledningsvis',),
|
||||
'innleiingsvis': ('innleiingsvis',),
|
||||
'innomhus': ('innomhus',),
|
||||
'isteden': ('isteden',),
|
||||
'især': ('især',),
|
||||
'item': ('item',),
|
||||
|
@ -380,12 +531,26 @@ ADVERBS_WORDFORMS = {
|
|||
'lagerfritt': ('lagerfritt',),
|
||||
'lagom': ('lagom',),
|
||||
'lagvis': ('lagvis',),
|
||||
'landimellom': ('landimellom',),
|
||||
'landverts': ('landverts',),
|
||||
'langt': ('langt',),
|
||||
'lenger': ('langt',),
|
||||
'lengst': ('langt',),
|
||||
'langveges': ('langveges',),
|
||||
'langvegesfra': ('langvegesfra',),
|
||||
'langvegs': ('langvegs',),
|
||||
'langvegsfra': ('langvegsfra',),
|
||||
'langveis': ('langveis',),
|
||||
'langveisfra': ('langveisfra',),
|
||||
'larghetto': ('larghetto',),
|
||||
'largo': ('largo',),
|
||||
'lassevis': ('lassevis',),
|
||||
'legato': ('legato',),
|
||||
'leilighetsvis': ('leilighetsvis',),
|
||||
'lell': ('lell',),
|
||||
'lenge': ('lenge',),
|
||||
'lenger': ('lenge',),
|
||||
'lengst': ('lenge',),
|
||||
'lenger': ('lenger',),
|
||||
'liddelig': ('liddelig',),
|
||||
'like': ('like',),
|
||||
|
@ -408,19 +573,25 @@ ADVERBS_WORDFORMS = {
|
|||
'maestoso': ('maestoso',),
|
||||
'mala fide': ('mala fide',),
|
||||
'malapropos': ('malapropos',),
|
||||
'mannemellom': ('mannemellom',),
|
||||
'massevis': ('massevis',),
|
||||
'med rette': ('med rette',),
|
||||
'medio': ('medio',),
|
||||
'medium': ('medium',),
|
||||
'medsols': ('medsols',),
|
||||
'medstrøms': ('medstrøms',),
|
||||
'meget': ('meget',),
|
||||
'mengdevis': ('mengdevis',),
|
||||
'metervis': ('metervis',),
|
||||
'mezzoforte': ('mezzoforte',),
|
||||
'midsommers': ('midsommers',),
|
||||
'midsommers': ('midsommers',),
|
||||
'midt': ('midt',),
|
||||
'midtfjords': ('midtfjords',),
|
||||
'midtskips': ('midtskips',),
|
||||
'midtsommers': ('midtsommers',),
|
||||
'midtsommers': ('midtsommers',),
|
||||
'midtveges': ('midtveges',),
|
||||
'midtvegs': ('midtvegs',),
|
||||
'midtveis': ('midtveis',),
|
||||
'midtvinters': ('midtvinters',),
|
||||
'midvinters': ('midvinters',),
|
||||
'milevis': ('milevis',),
|
||||
|
@ -445,6 +616,13 @@ ADVERBS_WORDFORMS = {
|
|||
'naturligvis': ('naturligvis',),
|
||||
'nauende': ('nauende',),
|
||||
'navnlig': ('navnlig',),
|
||||
'ned': ('ned',),
|
||||
'nedad': ('nedad',),
|
||||
'nedatil': ('nedatil',),
|
||||
'nede': ('nede',),
|
||||
'nedentil': ('nedentil',),
|
||||
'nedenunder': ('nedenunder',),
|
||||
'nedstrøms': ('nedstrøms',),
|
||||
'neigu': ('neigu',),
|
||||
'neimen': ('neimen',),
|
||||
'nemlig': ('nemlig',),
|
||||
|
@ -452,31 +630,46 @@ ADVERBS_WORDFORMS = {
|
|||
'nesegrus': ('nesegrus',),
|
||||
'nest': ('nest',),
|
||||
'nesten': ('nesten',),
|
||||
'netto': ('netto',),
|
||||
'nettopp': ('nettopp',),
|
||||
'noenlunde': ('noenlunde',),
|
||||
'noensinne': ('noensinne',),
|
||||
'noensteds': ('noensteds',),
|
||||
'nok': ('nok',),
|
||||
'nok': ('nok',),
|
||||
'noksom': ('noksom',),
|
||||
'nokså': ('nokså',),
|
||||
'non stop': ('non stop',),
|
||||
'nonstop': ('nonstop',),
|
||||
'nord': ('nord',),
|
||||
'nordafjells': ('nordafjells',),
|
||||
'nordaust': ('nordaust',),
|
||||
'nordenfjells': ('nordenfjells',),
|
||||
'nordost': ('nordost',),
|
||||
'nordvest': ('nordvest',),
|
||||
'nordøst': ('nordøst',),
|
||||
'notabene': ('notabene',),
|
||||
'nu': ('nu',),
|
||||
'nylig': ('nylig',),
|
||||
'nyss': ('nyss',),
|
||||
'nå': ('nå',),
|
||||
'når': ('når',),
|
||||
'nåvel': ('nåvel',),
|
||||
'nær': ('nær',),
|
||||
'nærere': ('nær',),
|
||||
'nærmere': ('nær',),
|
||||
'nærest': ('nær',),
|
||||
'nærmest': ('nær',),
|
||||
'nære': ('nære',),
|
||||
'nærere': ('nærere',),
|
||||
'nærest': ('nærest',),
|
||||
'nærme': ('nærme',),
|
||||
'nærmere': ('nærmere',),
|
||||
'nærmest': ('nærmest',),
|
||||
'nødig': ('nødig',),
|
||||
'nødigere': ('nødig',),
|
||||
'nødigst': ('nødig',),
|
||||
'nødvendigvis': ('nødvendigvis',),
|
||||
'offside': ('offside',),
|
||||
'ofte': ('ofte',),
|
||||
'oftere': ('ofte',),
|
||||
'oftest': ('ofte',),
|
||||
'også': ('også',),
|
||||
'om att': ('om att',),
|
||||
'om igjen': ('om igjen',),
|
||||
|
@ -485,11 +678,18 @@ ADVERBS_WORDFORMS = {
|
|||
'omsonst': ('omsonst',),
|
||||
'omtrent': ('omtrent',),
|
||||
'onnimellom': ('onnimellom',),
|
||||
'opp': ('opp',),
|
||||
'opp att': ('opp att',),
|
||||
'opp ned': ('opp ned',),
|
||||
'oppad': ('oppad',),
|
||||
'oppe': ('oppe',),
|
||||
'oppstrøms': ('oppstrøms',),
|
||||
'ost': ('ost',),
|
||||
'ovabords': ('ovabords',),
|
||||
'ovatil': ('ovatil',),
|
||||
'oven': ('oven',),
|
||||
'ovenbords': ('ovenbords',),
|
||||
'oventil': ('oventil',),
|
||||
'overalt': ('overalt',),
|
||||
'overens': ('overens',),
|
||||
'overhodet': ('overhodet',),
|
||||
|
@ -506,8 +706,6 @@ ADVERBS_WORDFORMS = {
|
|||
'partout': ('partout',),
|
||||
'parvis': ('parvis',),
|
||||
'per capita': ('per capita',),
|
||||
'peu à peu': ('peu à peu',),
|
||||
'peu om peu': ('peu om peu',),
|
||||
'pianissimo': ('pianissimo',),
|
||||
'piano': ('piano',),
|
||||
'pinende': ('pinende',),
|
||||
|
@ -554,7 +752,6 @@ ADVERBS_WORDFORMS = {
|
|||
'respektive': ('respektive',),
|
||||
'rettsøles': ('rettsøles',),
|
||||
'reverenter': ('reverenter',),
|
||||
'riktig nok': ('riktig nok',),
|
||||
'riktignok': ('riktignok',),
|
||||
'rimeligvis': ('rimeligvis',),
|
||||
'ringside': ('ringside',),
|
||||
|
@ -567,6 +764,8 @@ ADVERBS_WORDFORMS = {
|
|||
'saktelig': ('saktelig',),
|
||||
'saktens': ('saktens',),
|
||||
'sammen': ('sammen',),
|
||||
'sammesteds': ('sammesteds',),
|
||||
'sammestedsfra': ('sammestedsfra',),
|
||||
'samstundes': ('samstundes',),
|
||||
'samt': ('samt',),
|
||||
'sann': ('sann',),
|
||||
|
@ -578,6 +777,7 @@ ADVERBS_WORDFORMS = {
|
|||
'senhøstes': ('senhøstes',),
|
||||
'sia': ('sia',),
|
||||
'sic': ('sic',),
|
||||
'sidelangs': ('sidelangs',),
|
||||
'sidelengs': ('sidelengs',),
|
||||
'siden': ('siden',),
|
||||
'sideveges': ('sideveges',),
|
||||
|
@ -587,9 +787,9 @@ ADVERBS_WORDFORMS = {
|
|||
'silde': ('silde',),
|
||||
'simpelthen': ('simpelthen',),
|
||||
'sine anno': ('sine anno',),
|
||||
'sistpå': ('sistpå',),
|
||||
'sjelden': ('sjelden',),
|
||||
'sjøleies': ('sjøleies',),
|
||||
'sjøleis': ('sjøleis',),
|
||||
'sjøverts': ('sjøverts',),
|
||||
'skeis': ('skeis',),
|
||||
'skiftevis': ('skiftevis',),
|
||||
|
@ -607,6 +807,9 @@ ADVERBS_WORDFORMS = {
|
|||
'smekk': ('smekk',),
|
||||
'smellende': ('smellende',),
|
||||
'småningom': ('småningom',),
|
||||
'snart': ('snart',),
|
||||
'snarere': ('snart',),
|
||||
'snarest': ('snart',),
|
||||
'sneisevis': ('sneisevis',),
|
||||
'snesevis': ('snesevis',),
|
||||
'snuft': ('snuft',),
|
||||
|
@ -616,6 +819,7 @@ ADVERBS_WORDFORMS = {
|
|||
'snyte': ('snyte',),
|
||||
'solo': ('solo',),
|
||||
'sommerstid': ('sommerstid',),
|
||||
'sommesteds': ('sommesteds',),
|
||||
'spenna': ('spenna',),
|
||||
'spent': ('spent',),
|
||||
'spika': ('spika',),
|
||||
|
@ -651,6 +855,7 @@ ADVERBS_WORDFORMS = {
|
|||
'styggelig': ('styggelig',),
|
||||
'styggende': ('styggende',),
|
||||
'stykkevis': ('stykkevis',),
|
||||
'styrbord': ('styrbord',),
|
||||
'støtt': ('støtt',),
|
||||
'støtvis': ('støtvis',),
|
||||
'støytvis': ('støytvis',),
|
||||
|
@ -658,6 +863,12 @@ ADVERBS_WORDFORMS = {
|
|||
'summa summarum': ('summa summarum',),
|
||||
'surr': ('surr',),
|
||||
'svinaktig': ('svinaktig',),
|
||||
'svint': ('svint',),
|
||||
'svintere': ('svint',),
|
||||
'svintest': ('svint',),
|
||||
'syd': ('syd',),
|
||||
'sydost': ('sydost',),
|
||||
'sydvest': ('sydvest',),
|
||||
'sydøst': ('sydøst',),
|
||||
'synderlig': ('synderlig',),
|
||||
'så': ('så',),
|
||||
|
@ -672,6 +883,13 @@ ADVERBS_WORDFORMS = {
|
|||
'søkk': ('søkk',),
|
||||
'søkkende': ('søkkende',),
|
||||
'sønder': ('sønder',),
|
||||
'sønna': ('sønna',),
|
||||
'sønnafjells': ('sønnafjells',),
|
||||
'sønnenfjells': ('sønnenfjells',),
|
||||
'sør': ('sør',),
|
||||
'søraust': ('søraust',),
|
||||
'sørvest': ('sørvest',),
|
||||
'sørøst': ('sørøst',),
|
||||
'takimellom': ('takimellom',),
|
||||
'takomtil': ('takomtil',),
|
||||
'temmelig': ('temmelig',),
|
||||
|
@ -679,10 +897,15 @@ ADVERBS_WORDFORMS = {
|
|||
'tidligdags': ('tidligdags',),
|
||||
'tidsnok': ('tidsnok',),
|
||||
'tidvis': ('tidvis',),
|
||||
'til like': ('til like',),
|
||||
'tilbake': ('tilbake',),
|
||||
'tilfeldigvis': ('tilfeldigvis',),
|
||||
'tilmed': ('tilmed',),
|
||||
'tilnærmelsesvis': ('tilnærmelsesvis',),
|
||||
'timevis': ('timevis',),
|
||||
'titt': ('titt',),
|
||||
'tiere': ('titt',),
|
||||
'tiest': ('titt',),
|
||||
'tjokkende': ('tjokkende',),
|
||||
'tomreipes': ('tomreipes',),
|
||||
'tott': ('tott',),
|
||||
|
@ -695,44 +918,55 @@ ADVERBS_WORDFORMS = {
|
|||
'trutt': ('trutt',),
|
||||
'turevis': ('turevis',),
|
||||
'turvis': ('turvis',),
|
||||
'tusenfold': ('tusenfold',),
|
||||
'tusenvis': ('tusenvis',),
|
||||
'tvers': ('tvers',),
|
||||
'tvert': ('tvert',),
|
||||
'tydeligvis': ('tydeligvis',),
|
||||
'tynnevis': ('tynnevis',),
|
||||
'tynnevis': ('tynnevis',),
|
||||
'tålig': ('tålig',),
|
||||
'tønnevis': ('tønnevis',),
|
||||
'tønnevis': ('tønnevis',),
|
||||
'ufravendt': ('ufravendt',),
|
||||
'ugjerne': ('ugjerne',),
|
||||
'uheldigvis': ('uheldigvis',),
|
||||
'ukevis': ('ukevis',),
|
||||
'ukevis': ('ukevis',),
|
||||
'ultimo': ('ultimo',),
|
||||
'ulykkeligvis': ('ulykkeligvis',),
|
||||
'uløyves': ('uløyves',),
|
||||
'undas': ('undas',),
|
||||
'underhånden': ('underhånden',),
|
||||
'undertiden': ('undertiden',),
|
||||
'undervegs': ('undervegs',),
|
||||
'underveis': ('underveis',),
|
||||
'unntakelsesvis': ('unntakelsesvis',),
|
||||
'unntaksvis': ('unntaksvis',),
|
||||
'ustyggelig': ('ustyggelig',),
|
||||
'ut': ('ut',),
|
||||
'utaboks': ('utaboks',),
|
||||
'utad': ('utad',),
|
||||
'utalands': ('utalands',),
|
||||
'utbygdes': ('utbygdes',),
|
||||
'utdragsvis': ('utdragsvis',),
|
||||
'ute': ('ute',),
|
||||
'utelukkende': ('utelukkende',),
|
||||
'utenat': ('utenat',),
|
||||
'utenboks': ('utenboks',),
|
||||
'utenlands': ('utenlands',),
|
||||
'utomhus': ('utomhus',),
|
||||
'uvegerlig': ('uvegerlig',),
|
||||
'uviselig': ('uviselig',),
|
||||
'uvislig': ('uvislig',),
|
||||
'va banque': ('va banque',),
|
||||
'vanligvis': ('vanligvis',),
|
||||
'vann': ('vann',),
|
||||
'vekevis': ('vekevis',),
|
||||
'vekevis': ('vekevis',),
|
||||
'ved like': ('ved like',),
|
||||
'veggimellom': ('veggimellom',),
|
||||
'vekk': ('vekk',),
|
||||
'vekke': ('vekke',),
|
||||
'vekselvis': ('vekselvis',),
|
||||
'vel': ('vel',),
|
||||
'vest': ('vest',),
|
||||
'vesta': ('vesta',),
|
||||
'vestafjells': ('vestafjells',),
|
||||
'vestenfjells': ('vestenfjells',),
|
||||
'vibrato': ('vibrato',),
|
||||
'vice versa': ('vice versa',),
|
||||
'vide': ('vide',),
|
||||
|
@ -741,7 +975,6 @@ ADVERBS_WORDFORMS = {
|
|||
'viselig': ('viselig',),
|
||||
'visselig': ('visselig',),
|
||||
'visst': ('visst',),
|
||||
'visst nok': ('visst nok',),
|
||||
'visstnok': ('visstnok',),
|
||||
'vivace': ('vivace',),
|
||||
'vonlig': ('vonlig',),
|
||||
|
@ -754,40 +987,183 @@ ADVERBS_WORDFORMS = {
|
|||
'årlig års': ('årlig års',),
|
||||
'åssen': ('åssen',),
|
||||
'ørende': ('ørende',),
|
||||
'øst': ('øst',),
|
||||
'østa': ('østa',),
|
||||
'østafjells': ('østafjells',),
|
||||
'østenfjells': ('østenfjells',),
|
||||
'øyensynlig': ('øyensynlig',),
|
||||
'antageligvis': ('antageligvis',),
|
||||
'coolly': ('coolly',),
|
||||
'kor': ('kor',),
|
||||
'korfor': ('korfor',),
|
||||
'kor': ('kor',),
|
||||
'korfor': ('korfor',),
|
||||
'medels': ('medels',),
|
||||
'nasegrus': ('nasegrus',),
|
||||
'overimorgen': ('overimorgen',),
|
||||
'unntagelsesvis': ('unntagelsesvis',),
|
||||
'åffer': ('åffer',),
|
||||
'åffer': ('åffer',),
|
||||
'sist': ('sist',),
|
||||
'seinhaustes': ('seinhaustes',),
|
||||
'stetse': ('stetse',),
|
||||
'stikk': ('stikk',),
|
||||
'storlig': ('storlig',),
|
||||
'A': ('A',),
|
||||
'for': ('for',),
|
||||
'still going strong': ('still going strong',),
|
||||
'til og med': ('til og med',),
|
||||
'i hu': ('i hu',),
|
||||
'dengang': ('dengang',),
|
||||
'derborte': ('derborte',),
|
||||
'derefter': ('derefter',),
|
||||
'derinne': ('derinne',),
|
||||
'dernede': ('dernede',),
|
||||
'deromkring': ('deromkring',),
|
||||
'etterhvert': ('etterhvert',),
|
||||
'fordømrade': ('fordømrade',),
|
||||
'foreksempel': ('foreksempel',),
|
||||
'forsåvidt': ('forsåvidt',),
|
||||
'forøvrig': ('forøvrig',),
|
||||
'herefter': ('herefter',),
|
||||
'hvertfall': ('hvertfall',),
|
||||
'idag': ('idag',),
|
||||
'ifjor': ('ifjor',),
|
||||
'i gang': ('i gang',),
|
||||
'igår': ('igår',),
|
||||
'ihvertfall': ('ihvertfall',),
|
||||
'ikveld': ('ikveld',),
|
||||
'iland': ('iland',),
|
||||
'imorgen': ('imorgen',),
|
||||
'imøte': ('imøte',),
|
||||
'inatt': ('inatt',),
|
||||
'iorden': ('iorden',),
|
||||
'istand': ('istand',),
|
||||
'istedet': ('istedet',),
|
||||
'javisst': ('javisst',),
|
||||
'neivisst': ('neivisst',),
|
||||
'fortsatt': ('fortsatt',),
|
||||
'slik': ('slik',),
|
||||
'sådan': ('sådan',),
|
||||
'sånn': ('sånn',),
|
||||
'for eksempel': ('for eksempel',),
|
||||
'fra barnsbein av': ('fra barnsbein av',),
|
||||
'fra barnsben av': ('fra barnsben av',),
|
||||
'fra oven': ('fra oven',),
|
||||
'på vidvanke': ('på vidvanke',),
|
||||
'rubb og stubb': ('rubb og stubb',),
|
||||
'akterifra': ('akterifra',),
|
||||
'andsynes': ('andsynes',),
|
||||
'austenom': ('austenom',),
|
||||
'avslutningsvis': ('avslutningsvis',),
|
||||
'bøttevis': ('bøttevis',),
|
||||
'bakenfra': ('bakenfra',),
|
||||
'bakenom': ('bakenom',),
|
||||
'baki': ('baki',),
|
||||
'bedriftsvis': ('bedriftsvis',),
|
||||
'beklageligvis': ('beklageligvis',),
|
||||
'benveges': ('benveges',),
|
||||
'benveies': ('benveies',),
|
||||
'bistrende': ('bistrende',),
|
||||
'bitvis': ('bitvis',),
|
||||
'bortenom': ('bortenom',),
|
||||
'bortmed': ('bortmed',),
|
||||
'bråfort': ('bråfort',),
|
||||
'bunkevis': ('bunkevis',),
|
||||
'ca': ('ca',),
|
||||
'derigjennom': ('derigjennom',),
|
||||
'derover': ('derover',),
|
||||
'dessuaktet': ('dessuaktet',),
|
||||
'distriktsvis': ('distriktsvis',),
|
||||
'doloroso': ('doloroso',),
|
||||
'erfaringsvis': ('erfaringsvis',),
|
||||
'falskelig': ('falskelig',),
|
||||
'fjellstøtt': ('fjellstøtt',),
|
||||
'flekkvis': ('flekkvis',),
|
||||
'flerveis': ('flerveis',),
|
||||
'forholdvis': ('forholdvis',),
|
||||
'fornemmelig': ('fornemmelig',),
|
||||
'fornuftigvis': ('fornuftigvis',),
|
||||
'forsiktigvis': ('forsiktigvis',),
|
||||
'forskottsvis': ('forskottsvis',),
|
||||
'forskuddsvis': ('forskuddsvis',),
|
||||
'forutsetningsvis': ('forutsetningsvis',),
|
||||
'framt': ('framt',),
|
||||
'fremt': ('fremt',),
|
||||
'godhetsfullt': ('godhetsfullt',),
|
||||
'hvortil': ('hvortil',),
|
||||
'hvorunder': ('hvorunder',),
|
||||
'hvorved': ('hvorved',),
|
||||
'iltrende': ('iltrende',),
|
||||
'innatil': ('innatil',),
|
||||
'innentil': ('innentil',),
|
||||
'innigjennom': ('innigjennom',),
|
||||
'kilometervis': ('kilometervis',),
|
||||
'klattvis': ('klattvis',),
|
||||
'kolonnevis': ('kolonnevis',),
|
||||
'kommunevis': ('kommunevis',),
|
||||
'listelig': ('listelig',),
|
||||
'lusende': ('lusende',),
|
||||
'mildelig': ('mildelig',),
|
||||
'milevidt': ('milevidt',),
|
||||
'nordøstover': ('nordøstover',),
|
||||
'ovenover': ('ovenover',),
|
||||
'periodevis': ('periodevis',),
|
||||
'pirende': ('pirende',),
|
||||
'priori': ('priori',),
|
||||
'rettnok': ('rettnok',),
|
||||
'rykkvis': ('rykkvis',),
|
||||
'sørøstover': ('sørøstover',),
|
||||
'sørvestover': ('sørvestover',),
|
||||
'sedvanligvis': ('sedvanligvis',),
|
||||
'seksjonsvis': ('seksjonsvis',),
|
||||
'styggfort': ('styggfort',),
|
||||
'stykkomtil': ('stykkomtil',),
|
||||
'sydvestover': ('sydvestover',),
|
||||
'terminvis': ('terminvis',),
|
||||
'tertialvis': ('tertialvis',),
|
||||
'utdannelsesmessig': ('utdannelsesmessig',),
|
||||
'vis-à-vis': ('vis-à-vis',),
|
||||
'før': ('før',),
|
||||
'jo': ('jo',),
|
||||
'såvel': ('såvel',),
|
||||
'efterhvert': ('efterhvert',),
|
||||
'liksom': ('liksom',),
|
||||
'dann og vann': ('dann og vann',),
|
||||
'jaggu': ('jaggu',),
|
||||
'joggu': ('joggu',),
|
||||
'knekk': ('knekk',),
|
||||
'live': ('live',),
|
||||
'og': ('og',),
|
||||
'sabla': ('sabla',),
|
||||
'sikksakk': ('sikksakk',),
|
||||
'stadig': ('stadig',),
|
||||
'rett og slett': ('rett og slett',),
|
||||
'såvidt': ('såvidt',),
|
||||
'for moro skyld': ('for moro skyld',),
|
||||
'omlag': ('omlag',),
|
||||
'nattestid': ('nattestid',),
|
||||
'sørpe': ('sørpe',),
|
||||
'A.': ('A.',),
|
||||
'selv': ('selv',),
|
||||
'forlengst': ('forlengst',),
|
||||
'sjøl': ('sjøl',),
|
||||
'drita': ('drita',),
|
||||
'ennu': ('ennu',),
|
||||
'skauleies': ('skauleies',),
|
||||
'da capo': ('da capo',),
|
||||
'iallefall': ('iallefall',),
|
||||
'til alters': ('til alters',),
|
||||
'pokka': ('pokka',),
|
||||
'tilslutt': ('tilslutt',),
|
||||
'i steden': ('i steden',),
|
||||
'm.a.': ('m.a.',),
|
||||
'til syvende og sist': ('til syvende og sist',),
|
||||
'i en fei': ('i en fei',),
|
||||
'ender og da': ('ender og da',),
|
||||
'ender og gang': ('ender og gang',),
|
||||
'fra arilds tid': ('fra arilds tid',),
|
||||
'i hør og heim': ('i hør og heim',),
|
||||
'for fote': ('for fote',),
|
||||
'natterstid': ('natterstid',),
|
||||
'natterstider': ('natterstider',),
|
||||
'høgstdags': ('høgstdags',),
|
||||
'høgstnattes': ('høgstnattes',),
|
||||
'beint frem': ('beint frem',),
|
||||
'beintfrem': ('beintfrem',),
|
||||
'beinveges': ('beinveges',),
|
||||
'beinvegs': ('beinvegs',),
|
||||
'beinveis': ('beinveis',),
|
||||
'benvegs': ('benvegs',),
|
||||
'benveis': ('benveis',),
|
||||
'en garde': ('en garde',),
|
||||
'etter hvert': ('etter hvert',),
|
||||
'framåt': ('framåt',),
|
||||
'krittende': ('krittende',),
|
||||
'kvivitt': ('kvivitt',),
|
||||
|
@ -801,5 +1177,14 @@ ADVERBS_WORDFORMS = {
|
|||
'til sammen': ('til sammen',),
|
||||
'tomrepes': ('tomrepes',),
|
||||
'medurs': ('medurs',),
|
||||
'moturs': ('moturs',)
|
||||
'moturs': ('moturs',),
|
||||
'til ansvar': ('til ansvar',),
|
||||
'til ansvars': ('til ansvars',),
|
||||
'til fullnads': ('til fullnads',),
|
||||
'concertando': ('concertando',),
|
||||
'lesto': ('lesto',),
|
||||
'tardando': ('tardando',),
|
||||
'natters tid': ('natters tid',),
|
||||
'natters tider': ('natters tider',),
|
||||
'snydens': ('snydens',)
|
||||
}
|
||||
|
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
73
spacy/lang/tl/__init__.py
Normal file
73
spacy/lang/tl/__init__.py
Normal file
|
@ -0,0 +1,73 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
||||
# uncomment if files are available
|
||||
# from .norm_exceptions import NORM_EXCEPTIONS
|
||||
from .tag_map import TAG_MAP
|
||||
# from .morph_rules import MORPH_RULES
|
||||
|
||||
# uncomment if lookup-based lemmatizer is available
|
||||
from .lemmatizer import LOOKUP
|
||||
# from ...lemmatizerlookup import Lemmatizer
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
|
||||
def _return_tl(_):
|
||||
return 'tl'
|
||||
|
||||
|
||||
# Create a Language subclass
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages
|
||||
|
||||
# This file should be placed in spacy/lang/xx (ISO code of language).
|
||||
# Before submitting a pull request, make sure the remove all comments from the
|
||||
# language data files, and run at least the basic tokenizer tests. Simply add the
|
||||
# language ID to the list of languages in spacy/tests/conftest.py to include it
|
||||
# in the basic tokenizer sanity tests. You can optionally add a fixture for the
|
||||
# language's tokenizer and add more specific tests. For more info, see the
|
||||
# tests documentation: https://github.com/explosion/spaCy/tree/master/spacy/tests
|
||||
|
||||
|
||||
class TagalogDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = _return_tl # ISO code
|
||||
# add more norm exception dictionaries here
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||
|
||||
# overwrite functions for lexical attributes
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
|
||||
# add custom tokenizer exceptions to base exceptions
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
|
||||
# add stop words
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
# if available: add tag map
|
||||
# tag_map = dict(TAG_MAP)
|
||||
|
||||
# if available: add morph rules
|
||||
# morph_rules = dict(MORPH_RULES)
|
||||
|
||||
# if available: add lookup lemmatizer
|
||||
# @classmethod
|
||||
# def create_lemmatizer(cls, nlp=None):
|
||||
# return Lemmatizer(LOOKUP)
|
||||
|
||||
|
||||
class Tagalog(Language):
|
||||
lang = 'tl' # ISO code
|
||||
Defaults = TagalogDefaults # set Defaults to custom language defaults
|
||||
|
||||
|
||||
# set default export – this allows the language class to be lazy-loaded
|
||||
__all__ = ['Tagalog']
|
18
spacy/lang/tl/lemmatizer.py
Normal file
18
spacy/lang/tl/lemmatizer.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Adding a lemmatizer lookup table
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages#lemmatizer
|
||||
# Entries should be added in the following format:
|
||||
|
||||
|
||||
LOOKUP = {
|
||||
"kaugnayan": "ugnay",
|
||||
"sangkatauhan": "tao",
|
||||
"kanayunan": "nayon",
|
||||
"pandaigdigan": "daigdig",
|
||||
"kasaysayan": "saysay",
|
||||
"kabayanihan": "bayani",
|
||||
"karuwagan": "duwag"
|
||||
}
|
43
spacy/lang/tl/lex_attrs.py
Normal file
43
spacy/lang/tl/lex_attrs.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# import the symbols for the attrs you want to overwrite
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
# Overwriting functions for lexical attributes
|
||||
# Documentation: https://localhost:1234/docs/usage/adding-languages#lex-attrs
|
||||
# Most of these functions, like is_lower or like_url should be language-
|
||||
# independent. Others, like like_num (which includes both digits and number
|
||||
# words), requires customisation.
|
||||
|
||||
|
||||
# Example: check if token resembles a number
|
||||
|
||||
_num_words = ['sero', 'isa', 'dalawa', 'tatlo', 'apat', 'lima', 'anim', 'pito',
|
||||
'walo', 'siyam', 'sampu', 'labing-isa', 'labindalawa', 'labintatlo', 'labing-apat',
|
||||
'labinlima', 'labing-anim', 'labimpito', 'labing-walo', 'labinsiyam', 'dalawampu',
|
||||
'tatlumpu', 'apatnapu', 'limampu', 'animnapu', 'pitumpu', 'walumpu', 'siyamnapu',
|
||||
'daan', 'libo', 'milyon', 'bilyon', 'trilyon', 'quadrilyon',
|
||||
'gajilyon', 'bazilyon']
|
||||
|
||||
|
||||
def like_num(text):
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
# Create dictionary of functions to overwrite. The default lex_attr_getters are
|
||||
# updated with this one, so only the functions defined here are overwritten.
|
||||
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num
|
||||
}
|
162
spacy/lang/tl/stop_words.py
Normal file
162
spacy/lang/tl/stop_words.py
Normal file
|
@ -0,0 +1,162 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Add stop words
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages#stop-words
|
||||
# To improve readability, words should be ordered alphabetically and separated
|
||||
# by spaces and newlines. When adding stop words from an online source, always
|
||||
# include the link in a comment. Make sure to proofread and double-check the
|
||||
# words – lists available online are often known to contain mistakes.
|
||||
|
||||
# data from https://github.com/stopwords-iso/stopwords-tl/blob/master/stopwords-tl.txt
|
||||
|
||||
STOP_WORDS = set("""
|
||||
akin
|
||||
aking
|
||||
ako
|
||||
alin
|
||||
am
|
||||
amin
|
||||
aming
|
||||
ang
|
||||
ano
|
||||
anumang
|
||||
apat
|
||||
at
|
||||
atin
|
||||
ating
|
||||
ay
|
||||
bababa
|
||||
bago
|
||||
bakit
|
||||
bawat
|
||||
bilang
|
||||
dahil
|
||||
dalawa
|
||||
dapat
|
||||
din
|
||||
dito
|
||||
doon
|
||||
gagawin
|
||||
gayunman
|
||||
ginagawa
|
||||
ginawa
|
||||
ginawang
|
||||
gumawa
|
||||
gusto
|
||||
habang
|
||||
hanggang
|
||||
hindi
|
||||
huwag
|
||||
iba
|
||||
ibaba
|
||||
ibabaw
|
||||
ibig
|
||||
ikaw
|
||||
ilagay
|
||||
ilalim
|
||||
ilan
|
||||
inyong
|
||||
isa
|
||||
isang
|
||||
itaas
|
||||
ito
|
||||
iyo
|
||||
iyon
|
||||
iyong
|
||||
ka
|
||||
kahit
|
||||
kailangan
|
||||
kailanman
|
||||
kami
|
||||
kanila
|
||||
kanilang
|
||||
kanino
|
||||
kanya
|
||||
kanyang
|
||||
kapag
|
||||
kapwa
|
||||
karamihan
|
||||
katiyakan
|
||||
katulad
|
||||
kaya
|
||||
kaysa
|
||||
ko
|
||||
kong
|
||||
kulang
|
||||
kumuha
|
||||
kung
|
||||
laban
|
||||
lahat
|
||||
lamang
|
||||
likod
|
||||
lima
|
||||
maaari
|
||||
maaaring
|
||||
maging
|
||||
mahusay
|
||||
makita
|
||||
marami
|
||||
marapat
|
||||
masyado
|
||||
may
|
||||
mayroon
|
||||
mga
|
||||
minsan
|
||||
mismo
|
||||
mula
|
||||
muli
|
||||
na
|
||||
nabanggit
|
||||
naging
|
||||
nagkaroon
|
||||
nais
|
||||
nakita
|
||||
namin
|
||||
napaka
|
||||
narito
|
||||
nasaan
|
||||
ng
|
||||
ngayon
|
||||
ni
|
||||
nila
|
||||
nilang
|
||||
nito
|
||||
niya
|
||||
niyang
|
||||
noon
|
||||
o
|
||||
pa
|
||||
paano
|
||||
pababa
|
||||
paggawa
|
||||
pagitan
|
||||
pagkakaroon
|
||||
pagkatapos
|
||||
palabas
|
||||
pamamagitan
|
||||
panahon
|
||||
pangalawa
|
||||
para
|
||||
paraan
|
||||
pareho
|
||||
pataas
|
||||
pero
|
||||
pumunta
|
||||
pumupunta
|
||||
sa
|
||||
saan
|
||||
sabi
|
||||
sabihin
|
||||
sarili
|
||||
sila
|
||||
sino
|
||||
siya
|
||||
tatlo
|
||||
tayo
|
||||
tulad
|
||||
tungkol
|
||||
una
|
||||
walang
|
||||
""".split())
|
36
spacy/lang/tl/tag_map.py
Normal file
36
spacy/lang/tl/tag_map.py
Normal file
|
@ -0,0 +1,36 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||
from ...symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
|
||||
# Add a tag map
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
|
||||
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
|
||||
# The keys of the tag map should be strings in your tag set. The dictionary must
|
||||
# have an entry POS whose value is one of the Universal Dependencies tags.
|
||||
# Optionally, you can also include morphological features or other attributes.
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADV": {POS: ADV},
|
||||
"NOUN": {POS: NOUN},
|
||||
"ADP": {POS: ADP},
|
||||
"PRON": {POS: PRON},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PROPN": {POS: PROPN},
|
||||
"DET": {POS: DET},
|
||||
"SYM": {POS: SYM},
|
||||
"INTJ": {POS: INTJ},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"NUM": {POS: NUM},
|
||||
"AUX": {POS: AUX},
|
||||
"X": {POS: X},
|
||||
"CONJ": {POS: CONJ},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"ADJ": {POS: ADJ},
|
||||
"VERB": {POS: VERB},
|
||||
"PART": {POS: PART},
|
||||
"SP": {POS: SPACE}
|
||||
}
|
48
spacy/lang/tl/tokenizer_exceptions.py
Normal file
48
spacy/lang/tl/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,48 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# import symbols – if you need to use more, add them here
|
||||
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
|
||||
|
||||
|
||||
# Add tokenizer exceptions
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
|
||||
# Feel free to use custom logic to generate repetitive exceptions more efficiently.
|
||||
# If an exception is split into more than one token, the ORTH values combined always
|
||||
# need to match the original string.
|
||||
|
||||
# Exceptions should be added in the following format:
|
||||
|
||||
_exc = {
|
||||
"tayo'y": [
|
||||
{ORTH: "tayo", LEMMA: "tayo"},
|
||||
{ORTH: "'y", LEMMA: "ay"}],
|
||||
"isa'y": [
|
||||
{ORTH: "isa", LEMMA: "isa"},
|
||||
{ORTH: "'y", LEMMA: "ay"}],
|
||||
"baya'y": [
|
||||
{ORTH: "baya", LEMMA: "bayan"},
|
||||
{ORTH: "'y", LEMMA: "ay"}],
|
||||
"sa'yo": [
|
||||
{ORTH: "sa", LEMMA: "sa"},
|
||||
{ORTH: "'yo", LEMMA: "iyo"}],
|
||||
"ano'ng": [
|
||||
{ORTH: "ano", LEMMA: "ano"},
|
||||
{ORTH: "'ng", LEMMA: "ang"}],
|
||||
"siya'y": [
|
||||
{ORTH: "siya", LEMMA: "siya"},
|
||||
{ORTH: "'y", LEMMA: "ay"}],
|
||||
"nawa'y": [
|
||||
{ORTH: "nawa", LEMMA: "nawa"},
|
||||
{ORTH: "'y", LEMMA: "ay"}],
|
||||
"papa'no": [
|
||||
{ORTH: "papa'no", LEMMA: "papaano"}],
|
||||
"'di": [
|
||||
{ORTH: "'di", LEMMA: "hindi"}]
|
||||
}
|
||||
|
||||
|
||||
# To keep things clean and readable, it's recommended to only declare the
|
||||
# TOKENIZER_EXCEPTIONS at the bottom:
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -291,6 +291,8 @@ cdef char get_quantifier(PatternStateC state) nogil:
|
|||
|
||||
DEF PADDING = 5
|
||||
|
||||
DEF PADDING = 5
|
||||
|
||||
|
||||
cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id,
|
||||
object token_specs) except NULL:
|
||||
|
|
|
@ -189,6 +189,25 @@ def test_doc_api_merge(en_tokenizer):
|
|||
assert doc[5].text_with_ws == "all night"
|
||||
assert doc[5].tag_ == "NAMED"
|
||||
|
||||
# merge both with bulk merge
|
||||
doc = en_tokenizer(text)
|
||||
assert len(doc) == 9
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(
|
||||
doc[4:7], attrs={"tag": "NAMED", "lemma": "LEMMA", "ent_type": "TYPE"}
|
||||
)
|
||||
retokenizer.merge(
|
||||
doc[7:9], attrs={"tag": "NAMED", "lemma": "LEMMA", "ent_type": "TYPE"}
|
||||
)
|
||||
|
||||
assert len(doc) == 6
|
||||
assert doc[4].text == "the beach boys"
|
||||
assert doc[4].text_with_ws == "the beach boys "
|
||||
assert doc[4].tag_ == "NAMED"
|
||||
assert doc[5].text == "all night"
|
||||
assert doc[5].text_with_ws == "all night"
|
||||
assert doc[5].tag_ == "NAMED"
|
||||
|
||||
|
||||
def test_doc_api_merge_children(en_tokenizer):
|
||||
"""Test that attachments work correctly after merging."""
|
||||
|
|
|
@ -67,6 +67,22 @@ def test_spans_merge_non_disjoint(en_tokenizer):
|
|||
)
|
||||
|
||||
|
||||
def test_spans_merge_non_disjoint(en_tokenizer):
|
||||
text = "Los Angeles start."
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens])
|
||||
with pytest.raises(ValueError):
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(
|
||||
doc[0:2],
|
||||
attrs={"tag": "NNP", "lemma": "Los Angeles", "ent_type": "GPE"},
|
||||
)
|
||||
retokenizer.merge(
|
||||
doc[0:1],
|
||||
attrs={"tag": "NNP", "lemma": "Los Angeles", "ent_type": "GPE"},
|
||||
)
|
||||
|
||||
|
||||
def test_span_np_merges(en_tokenizer):
|
||||
text = "displaCy is a parse tool built with Javascript"
|
||||
heads = [1, 0, 2, 1, -3, -1, -1, -1]
|
||||
|
|
|
@ -5,15 +5,36 @@ import pytest
|
|||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text", ["aujourd'hui", "Aujourd'hui", "prud'hommes", "prud’hommal",
|
||||
"audio-numérique", "Audio-numérique",
|
||||
"entr'amis", "entr'abat", "rentr'ouvertes", "grand'hamien",
|
||||
"Châteauneuf-la-Forêt", "Château-Guibert",
|
||||
"11-septembre", "11-Septembre", "refox-trottâmes",
|
||||
"K-POP", "K-Pop", "K-pop", "z'yeutes",
|
||||
"black-outeront", "états-unienne",
|
||||
"courtes-pattes", "court-pattes",
|
||||
"saut-de-ski", "Écourt-Saint-Quentin", "Bout-de-l'Îlien", "pet-en-l'air"]
|
||||
"text",
|
||||
[
|
||||
"aujourd'hui",
|
||||
"Aujourd'hui",
|
||||
"prud'hommes",
|
||||
"prud’hommal",
|
||||
"audio-numérique",
|
||||
"Audio-numérique",
|
||||
"entr'amis",
|
||||
"entr'abat",
|
||||
"rentr'ouvertes",
|
||||
"grand'hamien",
|
||||
"Châteauneuf-la-Forêt",
|
||||
"Château-Guibert",
|
||||
"11-septembre",
|
||||
"11-Septembre",
|
||||
"refox-trottâmes",
|
||||
"K-POP",
|
||||
"K-Pop",
|
||||
"K-pop",
|
||||
"z'yeutes",
|
||||
"black-outeront",
|
||||
"états-unienne",
|
||||
"courtes-pattes",
|
||||
"court-pattes",
|
||||
"saut-de-ski",
|
||||
"Écourt-Saint-Quentin",
|
||||
"Bout-de-l'Îlien",
|
||||
"pet-en-l'air",
|
||||
],
|
||||
)
|
||||
def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text):
|
||||
tokens = fr_tokenizer(text)
|
||||
|
|
89
spacy/tests/regression/_test_issue1622.py
Normal file
89
spacy/tests/regression/_test_issue1622.py
Normal file
|
@ -0,0 +1,89 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
import json
|
||||
from tempfile import NamedTemporaryFile
|
||||
import pytest
|
||||
|
||||
from ...cli.train import train
|
||||
|
||||
|
||||
def test_cli_trained_model_can_be_saved(tmpdir):
|
||||
lang = 'nl'
|
||||
output_dir = str(tmpdir)
|
||||
train_file = NamedTemporaryFile('wb', dir=output_dir, delete=False)
|
||||
train_corpus = [
|
||||
{
|
||||
"id": "identifier_0",
|
||||
"paragraphs": [
|
||||
{
|
||||
"raw": "Jan houdt van Marie.\n",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"id": 0,
|
||||
"dep": "nsubj",
|
||||
"head": 1,
|
||||
"tag": "NOUN",
|
||||
"orth": "Jan",
|
||||
"ner": "B-PER"
|
||||
},
|
||||
{
|
||||
"id": 1,
|
||||
"dep": "ROOT",
|
||||
"head": 0,
|
||||
"tag": "VERB",
|
||||
"orth": "houdt",
|
||||
"ner": "O"
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"dep": "case",
|
||||
"head": 1,
|
||||
"tag": "ADP",
|
||||
"orth": "van",
|
||||
"ner": "O"
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"dep": "obj",
|
||||
"head": -2,
|
||||
"tag": "NOUN",
|
||||
"orth": "Marie",
|
||||
"ner": "B-PER"
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"dep": "punct",
|
||||
"head": -3,
|
||||
"tag": "PUNCT",
|
||||
"orth": ".",
|
||||
"ner": "O"
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"dep": "",
|
||||
"head": -1,
|
||||
"tag": "SPACE",
|
||||
"orth": "\n",
|
||||
"ner": "O"
|
||||
}
|
||||
],
|
||||
"brackets": []
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
train_file.write(json.dumps(train_corpus).encode('utf-8'))
|
||||
train_file.close()
|
||||
train_data = train_file.name
|
||||
dev_data = train_data
|
||||
|
||||
# spacy train -n 1 -g -1 nl output_nl training_corpus.json training \
|
||||
# corpus.json
|
||||
train(lang, output_dir, train_data, dev_data, n_iter=1)
|
||||
|
||||
assert True
|
36
spacy/tests/regression/test_issue2800.py
Normal file
36
spacy/tests/regression/test_issue2800.py
Normal file
|
@ -0,0 +1,36 @@
|
|||
'''Test issue that arises when too many labels are added to NER model.'''
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import random
|
||||
from ...lang.en import English
|
||||
|
||||
def train_model(train_data, entity_types):
|
||||
nlp = English(pipeline=[])
|
||||
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
|
||||
for entity_type in list(entity_types):
|
||||
ner.add_label(entity_type)
|
||||
|
||||
optimizer = nlp.begin_training()
|
||||
|
||||
# Start training
|
||||
for i in range(20):
|
||||
losses = {}
|
||||
index = 0
|
||||
random.shuffle(train_data)
|
||||
|
||||
for statement, entities in train_data:
|
||||
nlp.update([statement], [entities], sgd=optimizer, losses=losses, drop=0.5)
|
||||
return nlp
|
||||
|
||||
|
||||
def test_train_with_many_entity_types():
|
||||
train_data = []
|
||||
train_data.extend([("One sentence", {"entities": []})])
|
||||
entity_types = [str(i) for i in range(1000)]
|
||||
|
||||
model = train_model(train_data, entity_types)
|
||||
|
||||
|
40
spacy/tests/test_symlink_windows.py
Normal file
40
spacy/tests/test_symlink_windows.py
Normal file
|
@ -0,0 +1,40 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from ..compat import symlink_to, symlink_remove, path2str
|
||||
|
||||
|
||||
def target_local_path():
|
||||
return "./foo-target"
|
||||
|
||||
|
||||
def link_local_path():
|
||||
return "./foo-symlink"
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def setup_target(request):
|
||||
target = Path(target_local_path())
|
||||
if not target.exists():
|
||||
os.mkdir(path2str(target))
|
||||
|
||||
# yield -- need to cleanup even if assertion fails
|
||||
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
||||
def cleanup():
|
||||
symlink_remove(Path(link_local_path()))
|
||||
os.rmdir(target_local_path())
|
||||
|
||||
request.addfinalizer(cleanup)
|
||||
|
||||
|
||||
def test_create_symlink_windows(setup_target):
|
||||
target = Path(target_local_path())
|
||||
link = Path(link_local_path())
|
||||
assert target.exists()
|
||||
|
||||
symlink_to(link, target)
|
||||
assert link.exists()
|
|
@ -865,7 +865,7 @@ cdef class Token:
|
|||
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
|
||||
|
||||
property is_right_punct:
|
||||
"""RETURNS (bool): Whether the token is a left punctuation mark."""
|
||||
"""RETURNS (bool): Whether the token is a right punctuation mark."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
|
||||
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
p
|
||||
| Models trained on the
|
||||
| #[+a("https://catalog.ldc.upenn.edu/ldc2013t19") OntoNotes 5] corpus
|
||||
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus
|
||||
| support the following entity types:
|
||||
|
||||
+table(["Type", "Description"])
|
||||
|
|
|
@ -245,7 +245,7 @@ p The following file format converters are available:
|
|||
|
||||
+row
|
||||
+cell #[code iob]
|
||||
+cell IOB named entity recognition format.
|
||||
+cell IOB or IOB2 named entity recognition format.
|
||||
|
||||
+h(3, "train") Train
|
||||
|
||||
|
|
|
@ -352,6 +352,7 @@ p Retokenize the document, such that the span is merged into a single token.
|
|||
+h(2, "ents") Span.ents
|
||||
+tag property
|
||||
+tag-model("NER")
|
||||
+tag-new("2.0.12")
|
||||
|
||||
p
|
||||
| Iterate over the entities in the span. Yields named-entity
|
||||
|
|
|
@ -714,7 +714,7 @@ p The L2 norm of the token's vector representation.
|
|||
+cell bool
|
||||
+cell
|
||||
| Does the token consist of ASCII characters? Equivalent to
|
||||
| #[code [any(ord(c) >= 128 for c in token.text)]].
|
||||
| #[code all(ord(c) < 128 for c in token.text)].
|
||||
|
||||
+row
|
||||
+cell #[code is_digit]
|
||||
|
|
|
@ -91,8 +91,8 @@ p
|
|||
|
||||
p
|
||||
| spaCy can be installed on GPU by specifying #[code spacy[cuda]],
|
||||
| #[code spacy[cuda90]], #[code spacy[cuda91]], #[code spacy[cuda92]] or
|
||||
| #[code spacy[cuda10]]. If you know your cuda version, using the more
|
||||
| #[code spacy[cuda90]], #[code spacy[cuda91]] or #[code spacy[cuda92]].
|
||||
| If you know your cuda version, using the more
|
||||
| explicit specifier allows cupy to be installed via wheel, saving some
|
||||
| compilation time. The specifiers should install two libraries:
|
||||
| #[+a("https://cupy.chainer.org") #[code cupy]] and
|
||||
|
|
|
@ -206,7 +206,8 @@ p
|
|||
nlp = spacy.load('en_core_web_sm')
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
|
||||
patterns = [nlp(text) for text in terminology_list]
|
||||
# Only run nlp.make_doc to speed things up
|
||||
patterns = [nlp.make_doc(text) for text in terminology_list]
|
||||
matcher.add('TerminologyList', None, *patterns)
|
||||
|
||||
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
|
||||
|
|
|
@ -44,7 +44,7 @@ p
|
|||
|
||||
+list.o-no-block
|
||||
+item #[strong Chinese]: #[+a("https://github.com/fxsjy/jieba") Jieba]
|
||||
+item #[strong Japanese]: #[+a("https://github.com/taku910/mecab") MeCab]
|
||||
+item #[strong Japanese]: #[+a("https://github.com/taku910/mecab") MeCab] with #[+a("http://unidic.ninjal.ac.jp/back_number#unidic_cwj") Unidic]
|
||||
+item #[strong Thai]: #[+a("https://github.com/wannaphongcom/pythainlp") pythainlp]
|
||||
+item #[strong Vietnamese]: #[+a("https://github.com/trungtv/pyvi") Pyvi]
|
||||
+item #[strong Russian]: #[+a("https://github.com/kmike/pymorphy2") pymorphy2]
|
||||
|
|
|
@ -72,7 +72,7 @@ p
|
|||
name = 'entity_matcher'
|
||||
|
||||
def __init__(self, nlp, terms, label):
|
||||
patterns = [nlp(text) for text in terms]
|
||||
patterns = [nlp.make_doc(text) for text in terms]
|
||||
self.matcher = PhraseMatcher(nlp.vocab)
|
||||
self.matcher.add(label, None, *patterns)
|
||||
|
||||
|
|
|
@ -240,7 +240,7 @@ p
|
|||
+code-new.
|
||||
from spacy.matcher import PhraseMatcher
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
patterns = [nlp(text) for text in large_terminology_list]
|
||||
patterns = [nlp.make_doc(text) for text in large_terminology_list]
|
||||
matcher.add('PRODUCT', None, *patterns)
|
||||
|
||||
+code-old.
|
||||
|
|
Loading…
Reference in New Issue
Block a user