mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-11 17:56:30 +03:00
Update develop from master
This commit is contained in:
commit
4336397ecb
|
@ -8,7 +8,6 @@ environment:
|
|||
- PYTHON: "C:\\Python27-x64"
|
||||
#- PYTHON: "C:\\Python34"
|
||||
#- PYTHON: "C:\\Python35"
|
||||
#- PYTHON: "C:\\Python27-x64"
|
||||
#- DISTUTILS_USE_SDK: "1"
|
||||
#- PYTHON: "C:\\Python34-x64"
|
||||
#- DISTUTILS_USE_SDK: "1"
|
||||
|
|
106
.github/contributors/DimaBryuhanov.md
vendored
Normal file
106
.github/contributors/DimaBryuhanov.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Dmitry Briukhanov |
|
||||
| Company name (if applicable) | - |
|
||||
| Title or role (if applicable) | - |
|
||||
| Date | 7/24/2018 |
|
||||
| GitHub username | DimaBryuhanov |
|
||||
| Website (optional) | |
|
106
.github/contributors/EmilStenstrom.md
vendored
Normal file
106
.github/contributors/EmilStenstrom.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ----------------------- |
|
||||
| Name | Emil Stenström |
|
||||
| Company name (if applicable) | - |
|
||||
| Title or role (if applicable) | - |
|
||||
| Date | 2018-07-28 |
|
||||
| GitHub username | EmilStenstrom |
|
||||
| Website (optional) | https://friendlybit.com |
|
106
.github/contributors/aashishg.md
vendored
Normal file
106
.github/contributors/aashishg.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Aashish Gangwani |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 7/08/2018 |
|
||||
| GitHub username | aashishg |
|
||||
| Website (optional) | |
|
106
.github/contributors/sammous.md
vendored
Normal file
106
.github/contributors/sammous.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Sami Moustachir |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | Data Scientist |
|
||||
| Date | 2018-08-02 |
|
||||
| GitHub username | sammous |
|
||||
| Website (optional) | https://samimoustachir.com |
|
106
.github/contributors/vikaskyadav.md
vendored
Normal file
106
.github/contributors/vikaskyadav.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | vikas yadav |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | Data Scientist |
|
||||
| Date | 1 August 2018 |
|
||||
| GitHub username | vikaskyadav |
|
||||
| Website (optional) | www.vikaskyadav.tk |
|
106
.github/contributors/wojtuch.md
vendored
Normal file
106
.github/contributors/wojtuch.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Wojciech Lukasiewicz |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 13.08.2018 |
|
||||
| GitHub username | wojtuch |
|
||||
| Website (optional) | |
|
15
bin/push-tag.sh
Executable file
15
bin/push-tag.sh
Executable file
|
@ -0,0 +1,15 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
set -e
|
||||
|
||||
# Insist repository is clean
|
||||
git diff-index --quiet HEAD
|
||||
|
||||
git checkout master
|
||||
git pull origin master
|
||||
version=$(grep "__version__ = " spacy/about.py)
|
||||
version=${version/__version__ = }
|
||||
version=${version/\'/}
|
||||
version=${version/\'/}
|
||||
git tag "v$version"
|
||||
git push origin --tags
|
48
examples/pipeline/custom_sentence_segmentation.py
Normal file
48
examples/pipeline/custom_sentence_segmentation.py
Normal file
|
@ -0,0 +1,48 @@
|
|||
'''Example of adding a pipeline component to prohibit sentence boundaries
|
||||
before certain tokens.
|
||||
|
||||
What we do is write to the token.is_sent_start attribute, which
|
||||
takes values in {True, False, None}. The default value None allows the parser
|
||||
to predict sentence segments. The value False prohibits the parser from inserting
|
||||
a sentence boundary before that token. Note that fixing the sentence segmentation
|
||||
should also improve the parse quality.
|
||||
|
||||
The specific example here is drawn from https://github.com/explosion/spaCy/issues/2627
|
||||
Other versions of the model may not make the original mistake, so the specific
|
||||
example might not be apt for future versions.
|
||||
'''
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
def prevent_sentence_boundaries(doc):
|
||||
for token in doc:
|
||||
if not can_be_sentence_start(token):
|
||||
token.is_sent_start = False
|
||||
return doc
|
||||
|
||||
def can_be_sentence_start(token):
|
||||
if token.i == 0:
|
||||
return True
|
||||
elif token.is_title:
|
||||
return True
|
||||
elif token.nbor(-1).is_punct:
|
||||
return True
|
||||
elif token.nbor(-1).is_space:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
def main():
|
||||
nlp = spacy.load('en_core_web_lg')
|
||||
raw_text = "Been here and I'm loving it."
|
||||
doc = nlp(raw_text)
|
||||
sentences = [sent.string.strip() for sent in doc.sents]
|
||||
print(sentences)
|
||||
nlp.add_pipe(prevent_sentence_boundaries, before='parser')
|
||||
doc = nlp(raw_text)
|
||||
sentences = [sent.string.strip() for sent in doc.sents]
|
||||
print(sentences)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,5 +1,5 @@
|
|||
cython>=0.24,<0.28.0
|
||||
numpy>=1.7
|
||||
numpy>=1.15.0
|
||||
cymem>=1.30,<1.32
|
||||
preshed>=1.0.0,<2.0.0
|
||||
thinc>=6.11.2,<6.12.0
|
||||
|
|
2
setup.py
2
setup.py
|
@ -188,7 +188,7 @@ def setup_package():
|
|||
ext_modules=ext_modules,
|
||||
scripts=['bin/spacy'],
|
||||
install_requires=[
|
||||
'numpy>=1.7',
|
||||
'numpy>=1.15.0',
|
||||
'murmurhash>=0.28,<0.29',
|
||||
'cymem>=1.30,<1.32',
|
||||
'preshed>=1.0.0,<2.0.0',
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
import warnings
|
||||
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
|
||||
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
||||
|
||||
from .cli.info import info as cli_info
|
||||
from .glossary import explain
|
||||
|
|
|
@ -3,13 +3,13 @@
|
|||
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
||||
|
||||
__title__ = 'spacy-nightly'
|
||||
__version__ = '2.1.0a0'
|
||||
__version__ = '2.1.0a2'
|
||||
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
|
||||
__uri__ = 'https://spacy.io'
|
||||
__author__ = 'Explosion AI'
|
||||
__email__ = 'contact@explosion.ai'
|
||||
__license__ = 'MIT'
|
||||
__release__ = True
|
||||
__release__ = False
|
||||
|
||||
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
|
||||
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
|
||||
|
|
|
@ -4,9 +4,12 @@ from __future__ import unicode_literals
|
|||
from .._messages import Messages
|
||||
from ...compat import json_dumps, path2str
|
||||
from ...util import prints
|
||||
from ...gold import iob_to_biluo
|
||||
import re
|
||||
|
||||
|
||||
def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||
|
||||
"""
|
||||
Convert conllu files into JSON format for use with train cli.
|
||||
use_morphology parameter enables appending morphology to tags, which is
|
||||
|
@ -14,15 +17,27 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
|||
"""
|
||||
# by @dvsrepo, via #11 explosion/spacy-dev-resources
|
||||
|
||||
"""
|
||||
Extract NER tags if available and convert them so that they follow
|
||||
BILUO and the Wikipedia scheme
|
||||
"""
|
||||
# by @katarkor
|
||||
|
||||
docs = []
|
||||
sentences = []
|
||||
conll_tuples = read_conllx(input_path, use_morphology=use_morphology)
|
||||
checked_for_ner = False
|
||||
has_ner_tags = False
|
||||
|
||||
for i, (raw_text, tokens) in enumerate(conll_tuples):
|
||||
sentence, brackets = tokens[0]
|
||||
sentences.append(generate_sentence(sentence))
|
||||
if not checked_for_ner:
|
||||
has_ner_tags = is_ner(sentence[5][0])
|
||||
checked_for_ner = True
|
||||
sentences.append(generate_sentence(sentence, has_ner_tags))
|
||||
# Real-sized documents could be extracted using the comments on the
|
||||
# conluu document
|
||||
|
||||
if(len(sentences) % n_sents == 0):
|
||||
doc = create_doc(sentences, i)
|
||||
docs.append(doc)
|
||||
|
@ -37,6 +52,21 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
|||
title=Messages.M032.format(name=path2str(output_file)))
|
||||
|
||||
|
||||
def is_ner(tag):
|
||||
|
||||
"""
|
||||
Check the 10th column of the first token to determine if the file contains
|
||||
NER tags
|
||||
"""
|
||||
|
||||
tag_match = re.match('([A-Z_]+)-([A-Z_]+)', tag)
|
||||
if tag_match:
|
||||
return True
|
||||
elif tag == "O":
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
def read_conllx(input_path, use_morphology=False, n=0):
|
||||
text = input_path.open('r', encoding='utf-8').read()
|
||||
i = 0
|
||||
|
@ -49,7 +79,7 @@ def read_conllx(input_path, use_morphology=False, n=0):
|
|||
for line in lines:
|
||||
|
||||
parts = line.split('\t')
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, _2 = parts
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
|
||||
if '-' in id_ or '.' in id_:
|
||||
continue
|
||||
try:
|
||||
|
@ -58,7 +88,7 @@ def read_conllx(input_path, use_morphology=False, n=0):
|
|||
dep = 'ROOT' if dep == 'root' else dep
|
||||
tag = pos if tag == '_' else tag
|
||||
tag = tag+'__'+morph if use_morphology else tag
|
||||
tokens.append((id_, word, tag, head, dep, 'O'))
|
||||
tokens.append((id_, word, tag, head, dep, iob))
|
||||
except:
|
||||
print(line)
|
||||
raise
|
||||
|
@ -68,17 +98,47 @@ def read_conllx(input_path, use_morphology=False, n=0):
|
|||
if n >= 1 and i >= n:
|
||||
break
|
||||
|
||||
def simplify_tags(iob):
|
||||
|
||||
"""
|
||||
Simplify tags obtained from the dataset in order to follow Wikipedia
|
||||
scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while
|
||||
'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to
|
||||
'MISC'.
|
||||
"""
|
||||
|
||||
def generate_sentence(sent):
|
||||
(id_, word, tag, head, dep, _) = sent
|
||||
new_iob = []
|
||||
for tag in iob:
|
||||
tag_match = re.match('([A-Z_]+)-([A-Z_]+)', tag)
|
||||
if tag_match:
|
||||
prefix = tag_match.group(1)
|
||||
suffix = tag_match.group(2)
|
||||
if suffix == 'GPE_LOC':
|
||||
suffix = 'LOC'
|
||||
elif suffix == 'GPE_ORG':
|
||||
suffix = 'ORG'
|
||||
elif suffix != 'PER' and suffix != 'LOC' and suffix != 'ORG':
|
||||
suffix = 'MISC'
|
||||
tag = prefix + '-' + suffix
|
||||
new_iob.append(tag)
|
||||
return new_iob
|
||||
|
||||
def generate_sentence(sent, has_ner_tags):
|
||||
(id_, word, tag, head, dep, iob) = sent
|
||||
sentence = {}
|
||||
tokens = []
|
||||
if has_ner_tags:
|
||||
iob = simplify_tags(iob)
|
||||
biluo = iob_to_biluo(iob)
|
||||
for i, id in enumerate(id_):
|
||||
token = {}
|
||||
token["id"] = id
|
||||
token["orth"] = word[i]
|
||||
token["tag"] = tag[i]
|
||||
token["head"] = head[i] - id
|
||||
token["dep"] = dep[i]
|
||||
if has_ner_tags:
|
||||
token["ner"] = biluo[i]
|
||||
tokens.append(token)
|
||||
sentence["tokens"] = tokens
|
||||
return sentence
|
||||
|
|
|
@ -38,12 +38,13 @@ from ..compat import json_dumps
|
|||
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||
version=("Model version", "option", "V", str),
|
||||
meta_path=("Optional path to meta.json. All relevant properties will be "
|
||||
"overwritten.", "option", "m", Path))
|
||||
"overwritten.", "option", "m", Path),
|
||||
verbose=("Display more information for debug", "option", None, bool))
|
||||
def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||
parser_multitasks='', entity_multitasks='',
|
||||
use_gpu=-1, vectors=None, no_tagger=False,
|
||||
no_parser=False, no_entities=False, gold_preproc=False,
|
||||
version="0.0.0", meta_path=None):
|
||||
version="0.0.0", meta_path=None, verbose=False):
|
||||
"""
|
||||
Train a model. Expects data in spaCy's JSON format.
|
||||
"""
|
||||
|
@ -146,7 +147,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
|||
gold_preproc=gold_preproc))
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs)
|
||||
scorer = nlp_loaded.evaluate(dev_docs, verbose)
|
||||
end_time = timer()
|
||||
if use_gpu < 0:
|
||||
gpu_wps = None
|
||||
|
|
|
@ -39,7 +39,7 @@ def tags_to_entities(tags):
|
|||
continue
|
||||
elif tag.startswith('I'):
|
||||
if start is None:
|
||||
raise ValueError(Errors.E067.format(tags=tags[:i]))
|
||||
raise ValueError(Errors.E067.format(tags=tags[:i+1]))
|
||||
continue
|
||||
if tag.startswith('U'):
|
||||
entities.append((tag[2:], i, i))
|
||||
|
|
|
@ -7,6 +7,8 @@ from .tag_map_general import TAG_MAP
|
|||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .lemmatizer import LEMMA_RULES, LEMMA_INDEX, LEMMA_EXC
|
||||
from .lemmatizer.lemmatizer import GreekLemmatizer
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from .norm_exceptions import NORM_EXCEPTIONS
|
||||
|
@ -20,15 +22,23 @@ class GreekDefaults(Language.Defaults):
|
|||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: 'el' # ISO code
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS)
|
||||
lex_attr_getters[NORM] = add_lookups(
|
||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
lemma_rules = LEMMA_RULES
|
||||
lemma_index = LEMMA_INDEX
|
||||
tag_map = TAG_MAP
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
syntax_iterators = SYNTAX_ITERATORS
|
||||
|
||||
@classmethod
|
||||
def create_lemmatizer(cls, nlp=None):
|
||||
lemma_rules = LEMMA_RULES
|
||||
lemma_index = LEMMA_INDEX
|
||||
lemma_exc = LEMMA_EXC
|
||||
return GreekLemmatizer(index=lemma_index, exceptions=lemma_exc,
|
||||
rules=lemma_rules)
|
||||
|
||||
|
||||
class Greek(Language):
|
||||
|
@ -39,4 +49,3 @@ class Greek(Language):
|
|||
|
||||
# set default export – this allows the language class to be lazy-loaded
|
||||
__all__ = ['Greek']
|
||||
|
||||
|
|
|
@ -9,11 +9,20 @@ Example sentences to test spaCy and its language models.
|
|||
"""
|
||||
|
||||
sentences = [
|
||||
"Η άνιση κατανομή του πλούτου και του εισοδήματος, η οποία έχει λάβει τρομερές διαστάσεις, δεν δείχνει τάσεις βελτίωσης.",
|
||||
"Ο στόχος της σύντομης αυτής έκθεσης είναι να συνοψίσει τα κυριότερα συμπεράσματα των επισκοπήσεων κάθε μιας χώρας.",
|
||||
"Μέχρι αργά χθες το βράδυ ο πλοιοκτήτης παρέμενε έξω από το γραφείο του γενικού γραμματέα του υπουργείου, ενώ είχε μόνον τηλεφωνική επικοινωνία με τον υπουργό.",
|
||||
"Σύμφωνα με καλά ενημερωμένη πηγή, από την επεξεργασία του προέκυψε ότι οι δράστες της επίθεσης ήταν δύο, καθώς και ότι προσέγγισαν και αποχώρησαν από το σημείο με μοτοσικλέτα.",
|
||||
'''Η άνιση κατανομή του πλούτου και του εισοδήματος, η οποία έχει λάβει
|
||||
τρομερές διαστάσεις, δεν δείχνει τάσεις βελτίωσης.''',
|
||||
'''Ο στόχος της σύντομης αυτής έκθεσης είναι να συνοψίσει τα κυριότερα
|
||||
συμπεράσματα των επισκοπήσεων κάθε μιας χώρας.''',
|
||||
'''Μέχρι αργά χθες το βράδυ ο πλοιοκτήτης παρέμενε έξω από το γραφείο του
|
||||
γενικού γραμματέα του υπουργείου, ενώ είχε μόνον τηλεφωνική επικοινωνία με
|
||||
τον υπουργό.''',
|
||||
'''Σύμφωνα με καλά ενημερωμένη πηγή, από την επεξεργασία του προέκυψε ότι
|
||||
οι δράστες της επίθεσης ήταν δύο, καθώς και ότι προσέγγισαν και αποχώρησαν
|
||||
από το σημείο με μοτοσικλέτα.''',
|
||||
"Η υποδομή καταλυμάτων στην Ελλάδα είναι πλήρης και ανανεώνεται συνεχώς.",
|
||||
"Το επείγον ταχυδρομείο (ήτοι το παραδοτέο εντός 48 ωρών το πολύ) μπορεί να μεταφέρεται αεροπορικώς μόνον εφόσον εφαρμόζονται οι κανόνες ασφαλείας.",
|
||||
"Στις ορεινές περιοχές του νησιού οι χιονοπτώσεις και οι παγετοί είναι περιορισμένοι ενώ στις παραθαλάσσιες περιοχές σημειώνονται σπανίως."
|
||||
'''Το επείγον ταχυδρομείο (ήτοι το παραδοτέο εντός 48 ωρών το πολύ) μπορεί
|
||||
να μεταφέρεται αεροπορικώς μόνον εφόσον εφαρμόζονται οι κανόνες
|
||||
ασφαλείας''',
|
||||
''''Στις ορεινές περιοχές του νησιού οι χιονοπτώσεις και οι παγετοί είναι
|
||||
περιορισμένοι ενώ στις παραθαλάσσιες περιοχές σημειώνονται σπανίως.'''
|
||||
]
|
||||
|
|
|
@ -5,19 +5,29 @@ from __future__ import unicode_literals
|
|||
ADJECTIVES_IRREG = {
|
||||
"χειρότερος": ("κακός",),
|
||||
"χειρότερη": ("κακός",),
|
||||
"χειρότερης": ("κακός",),
|
||||
"χειρότερο": ("κακός",),
|
||||
"χειρότεροι": ("κακός",),
|
||||
"χειρότερων": ("κακός",),
|
||||
"χειρότερου": ("κακός",),
|
||||
"βέλτιστος": ("καλός",),
|
||||
"βέλτιστη": ("καλός",),
|
||||
"βέλτιστης": ("καλός",),
|
||||
"βέλτιστο": ("καλός",),
|
||||
"βέλτιστοι": ("καλός",),
|
||||
"βέλτιστων": ("καλός",),
|
||||
"βέλτιστου": ("καλός",),
|
||||
"ελάχιστος": ("λίγος",),
|
||||
"ελάχιστα": ("λίγος",),
|
||||
"ελάχιστοι": ("λίγος",),
|
||||
"ελάχιστων": ("λίγος",),
|
||||
"ελάχιστη": ("λίγος",),
|
||||
"ελάχιστης": ("λίγος",),
|
||||
"ελάχιστο": ("λίγος",),
|
||||
"ελάχιστου": ("λίγος",),
|
||||
"πλείστος": ("πολύς",),
|
||||
"πλείστου": ("πολύς",),
|
||||
"πλείστων": ("πολύς",),
|
||||
"πολλή": ("πολύ",),
|
||||
"πολύς": ("πολύ",),
|
||||
"πολλύ": ("πολύ",),
|
||||
|
|
|
@ -3,94 +3,148 @@ from __future__ import unicode_literals
|
|||
|
||||
|
||||
ADJECTIVE_RULES = [
|
||||
["οί","ός"], # καρδιακοί
|
||||
["ές","ός"], # επιφανειακές
|
||||
["ές","ος"], # καρδιακές
|
||||
["ές","ύς"], # πολλές
|
||||
["οι","ος"],
|
||||
["αία","ος"], # ωραία
|
||||
["ωδη","ες"], # δασώδη
|
||||
["ώδη","ες"],
|
||||
["ότερη","ός"],
|
||||
["ότερος","ός"],
|
||||
["ότεροι", "ός"],
|
||||
["ότερων","ός"],
|
||||
["ότερες", "ός"],
|
||||
["οί", "ός"], # καρδιακοί -> καρδιακός. Ονομαστική πλ. σε -ός. (m)
|
||||
["ών", "ός"], # καρδιακών -> καρδιακός. Γενική πλ. σε -ός. (m)
|
||||
["ού", "ός"], # καρδιακού -> καρδιακός. Γενική εν. σε -ός. (m)
|
||||
["ή", "ός"], # καρδιακή -> καρδιακός. Ονομαστική εν. σε -ή. (f)
|
||||
["ής", "ός"], # καρδιακής -> καρδιακός. Γενική εν. σε -ή. (f)
|
||||
["ές", "ός"], # καρδιακές -> καρδιακός. Ονομαστική πλ. σε -ή. (f)
|
||||
["οι", "ος"], # ωραίοι -> ωραίος. Ονομαστική πλ. σε -ος. (m)
|
||||
["ων", "ος"], # ωραίων -> ωραίος. Γενική πλ. σε -ος. (m)
|
||||
["ου", "ος"], # ωραίου -> ωραίος. Γενική εν. σε -ος. (m)
|
||||
["ο", "ος"], # ωραίο -> ωραίος. Ονομαστική εν. σε -ο. (n)
|
||||
["α", "ος"], # χυδαία -> χυδαίος. Ονομαστική πλ. σε -ο. (n)
|
||||
["ώδη", "ώδες"], # δασώδη -> δασώδες. Ονομαστική πλ. σε -ώδες. (n)
|
||||
["ύτερη", "ός"], # καλύτερη -> καλός. Συγκριτικός βαθμός σε -ή. (f)
|
||||
["ύτερης", "ός"], # καλύτερης -> καλός. (f)
|
||||
["ύτερων", "ός"], # καλύτερων -> καλός. (f)
|
||||
["ύτερος", "ός"], # καλύτερος -> καλός. Συγκριτικός βαθμός σε -ός. (m)
|
||||
["ύτερου", "ός"], # καλύτερου -> καλός. (m)
|
||||
]
|
||||
|
||||
|
||||
# masculine -> m, feminine -> f, neuter -> n.
|
||||
NOUN_RULES = [
|
||||
["ιά","ί"], # παιδιά
|
||||
["ια","ι"], # ποτήρια
|
||||
["ες","α"], # κεραμίδες
|
||||
["ές","ά"],
|
||||
["ές","ά"],
|
||||
["ες","α"], # εσπερινές
|
||||
["ες","η"], # ζάχαρη
|
||||
["ές","ή"], # φυλακές
|
||||
["ές","ής"], # καθηγητής
|
||||
["α","ο"], # πρόβατα
|
||||
["α","α"], # ζήτημα
|
||||
["ατα","α"], # στόματα
|
||||
["άτα","άτα"], # ντομάτα
|
||||
["άτες","άτα"], # πατάτες
|
||||
["ία","ία"],
|
||||
["ιά","ιά"],
|
||||
["οί","ός"], # υπουργοί
|
||||
["ίας","ία"], # δικτατορίας, δυσωδείας, τρομοκρατίας
|
||||
["άτων","ατα"], # δικαιωμάτων
|
||||
["ώπων","ωπος"], # ανθρώπων
|
||||
["ιού", "ί"], # παιδιού -> παιδί. Γενική ενικού σε -ί. (n)
|
||||
["ιά", "ί"], # παιδιά -> παιδί. Ονομαστική πληθυντικού σε -ί. (n)
|
||||
["ιών", "ί"], # παιδιών -> παιδί. Γενική πληθυντικού σε -ί. (n)
|
||||
["ηριού", "ήρι"], # ποτηριού -> ποτήρι. Γενική ενικού σε -ι. (n)
|
||||
["ια", "ι"], # ποτήρια -> ποτήρι. Ονομαστική πληθυντικού σε -ι. (n)
|
||||
["ηριών", "ήρι"], # ποτηριών -> ποτήρι. Γενική πληθυντικού σε -ι. (n)
|
||||
["ας", "α"], # κεραμίδας -> κεραμίδα. Γενική ενικού σε -α. (f)
|
||||
["ες", "α"], # κεραμίδες -> κεραμίδα. Ονομαστική πληθυντικού σε -α. (f)
|
||||
["ων", "α"], # κεραμίδων -> κεραμίδα. Γενική πληθυντικού σε -α. (f)
|
||||
["άς", "ά"], # βελανιδιάς -> βελανιδιά. Γενική ενικού σε -ά. (f)
|
||||
["ές", "ά"], # βελανιδιές -> βελανιδιά. Ονομαστική πληθυντικού σε -ά. (f)
|
||||
["ών", "ά"], # βελανιδιών -> βελανιδιά. Γενική πληθυντικού σε -ά. (f)
|
||||
["ής", "ή"], # φυλακής -> φυλακή. Γενική ενικού σε -ή. (f)
|
||||
["ές", "ή"], # φυλακές -> φυλακή. Ονομαστική πληθυντικού σε -ή. (f)
|
||||
["ών", "ή"], # φυλακών -> φυλακή. Γενική πληθυντικού σε -ή. (f)
|
||||
["ές", "ής"], # καθηγητές -> καθηγητής. Ονομαστική πληθυντικού σε -ής. (m)
|
||||
["ών", "ής"], # καθηγητών -> καθηγητής. Γενική πληθυντικού σε -ής. (m)
|
||||
["ου", "ο"], # προβάτου -> πρόβατο. Γενική ενικού σε -ο. (n)
|
||||
["α", "ο"], # πρόβατα -> πρόβατο. Ονομαστική πληθυντικού σε -o. (n)
|
||||
["ων", "ο"], # προβάτων -> πρόβατο. Γενική πληθυντικού σε -ο. (n)
|
||||
["ητήματος", "ήτημα"], # ζητήματος -> ζήτημα. Γενική ενικού σε -α (n)
|
||||
# ζητήματα -> ζήτημα. Ονομαστική πληθυντικού σε -α. (n)
|
||||
["ητήματα", "ήτημα"],
|
||||
# ζητημάτων -> ζήτημα. Γενική πληθυντικού σε -α. (n)
|
||||
["ητημάτων", "ήτημα"],
|
||||
["τος", ""], # στόματος -> στόμα. Γενική ενικού σε -α. (n)
|
||||
["τα", "α"], # στόματα -> στόμα. Ονομαστική πληθυντικού σε -α. (n)
|
||||
["ομάτων", "όμα"], # στομάτων -> στόμα. Γενική πληθυντικού σε -α. (n)
|
||||
["ού", "ός"], # υπουργού -> υπουργός. Γενική ενικού σε -ος. (m)
|
||||
["οί", "ός"], # υπουργοί -> υπουργούς. Ονομαστική πληυθυντικού σε -ος. (m)
|
||||
["ών", "ός"], # υπουργών -> υπουργός. Γενική πληθυντικού σε -ος. (m)
|
||||
["ς", ""], # δικτατορίας -> δικτατορία. Γενική ενικού σε -ας. (f)
|
||||
# δικτατορίες -> δικτατορία. Ονομαστική πληθυντικού σε -ας. (f)
|
||||
["ες", "α"],
|
||||
["ιών", "ία"], # δικτατοριών -> δικτατορία. Γενική πληθυντικού σε -ας. (f)
|
||||
["α", "ας"], # βασιλιά -> βασιλιάς. Γενική ενικού σε -άς. (m)
|
||||
["δων", ""], # βασιλιάδων -> βασιλιά. Γενική πληθυντικού σε -άς. (m)
|
||||
]
|
||||
|
||||
|
||||
VERB_RULES = [
|
||||
["εις", "ω"],
|
||||
["εις","ώ"],
|
||||
["ει","ω"],
|
||||
["ει","ώ"],
|
||||
["ουμε","ω"],
|
||||
["ουμε","ώ"],
|
||||
["ούμε","ώ"], # θεώρησα
|
||||
["ούνε","ώ"], #
|
||||
["ετε","ω"],
|
||||
["ετε","ώ"],
|
||||
["ουν","ω"],
|
||||
["ουν","ώ"],
|
||||
["είς","ώ"],
|
||||
["εί","ώ"],
|
||||
["ούν","ώ"],
|
||||
["εσαι","ομαι"], #αισθάνεσαι
|
||||
["εσαι","όμαι"],
|
||||
["έσαι","ομαι"],
|
||||
["έσαι","όμαι"],
|
||||
["εται","ομαι"],
|
||||
["εται","όμαι"],
|
||||
["έται","ομαι"],
|
||||
["έται","όμαι"],
|
||||
["όμαστε","όμαι"],
|
||||
["όμαστε","ομαι"],
|
||||
["έσθε","όμαι"],
|
||||
["εσθε","όμαι"],
|
||||
["άς","ώ"], # αγαπάς
|
||||
["άει","ώ"],
|
||||
["άμε","ώ"],
|
||||
["άτε","ώ"],
|
||||
["άνε","ώ"],
|
||||
["άν","ώ"],
|
||||
["άμε","ώ"],
|
||||
["άω","ώ"], # _verbs.py could contain any of the two
|
||||
["ώ","άω"],
|
||||
["όμουν", "ομαι"], # ζαλιζόμουν
|
||||
["όμουν", "όμαι"],
|
||||
["όμουν", "αμαι"], # κοιμόμουν
|
||||
["όμουν", "αμαι"],
|
||||
["ούσα", "ώ"], # ζητούσα -> ζητώ
|
||||
["ούσες", "ώ"],
|
||||
["ούσε", "ώ"],
|
||||
["ούσαμε", "ώ"],
|
||||
["ούσατε", "ώ"],
|
||||
["ούσαν", "ώ"],
|
||||
["ούσανε", "ώ"],
|
||||
["εις", "ω"], # πάρεις -> πάρω. Ενεστώτας ρήματος σε -ω.
|
||||
["ει", "ω"],
|
||||
["ουμε", "ω"],
|
||||
["ετε", "ω"],
|
||||
["ουνε", "ω"],
|
||||
["ουν", "ω"],
|
||||
["είς", "ώ"], # πονείς -> πονώ. Ενεστώτας ρήματος σε -ώ vol1.
|
||||
["εί", "ώ"], # οι κανόνες που λείπουν καλύπτονται από το αγαπώ.
|
||||
["ούν", "ώ"],
|
||||
["εσαι", "ομαι"], # αισθάνεσαι -> αισθάνομαι. Ενεστώτας ρήματος σε -ομαι.
|
||||
["εται", "ομαι"],
|
||||
["ανόμαστε", "άνομαι"],
|
||||
["εστε", "ομαι"],
|
||||
["ονται", "ομαι"],
|
||||
["άς", "ώ"], # αγαπάς -> αγαπάω (ή αγαπώ). Ενεστώτας ρήματος σε -ώ vol2.
|
||||
["άει", "ώ"],
|
||||
["άμε", "ώ"],
|
||||
["άτε", "ώ"],
|
||||
["άνε", "ώ"],
|
||||
["άν", "ώ"],
|
||||
["άω", "ώ"],
|
||||
["ώ", "άω"],
|
||||
# ζαλιζόμουν -> ζαλίζομαι. Παρατατικός ρήματος -ίζομαι.
|
||||
["ιζόμουν", "ίζομαι"],
|
||||
["ιζόσουν", "ίζομαι"],
|
||||
["ιζόταν", "ίζομαι"],
|
||||
["ιζόμασταν", "ίζομαι"],
|
||||
["ιζόσασταν", "ίζομαι"],
|
||||
["ονταν", "ομαι"],
|
||||
["όμουν", "άμαι"], # κοιμόμουν -> κοιμάμαι. Παρατατικός ρήματος σε -άμαι.
|
||||
["όσουν", "άμαι"],
|
||||
["όταν", "άμαι"],
|
||||
["όμασταν", "άμαι"],
|
||||
["όσασταν", "άμαι"],
|
||||
["όντουσταν", "άμαι"],
|
||||
["ούσα", "ώ"], # ζητούσα -> ζητώ. # Παρατατικός ρήματος σε -ώ.
|
||||
["ούσες", "ώ"],
|
||||
["ούσε", "ώ"],
|
||||
["ούσαμε", "ώ"],
|
||||
["ούσατε", "ώ"],
|
||||
["ούσαν", "ώ"],
|
||||
["ούσανε", "ώ"],
|
||||
["λαμε", "ζω"], # βγάλαμε -> βγάζω. Αόριστος ρήματος σε -ω vol1.
|
||||
["λατε", "ζω"],
|
||||
["ήρα", "άρω"], # πήρα -> πάρω. Αόριστος ρήματος σε -ω vol2.
|
||||
["ήρες", "άρω"],
|
||||
["ήρε", "άρω"],
|
||||
["ήραμε", "άρω"],
|
||||
["ήρατε", "άρω"],
|
||||
["ήρα", "άρω"],
|
||||
["ένησα", "ενώ"], # φιλοξένησα -> φιλοξενώ. Αόριστος ρήματος σε -ώ vol1.
|
||||
["ένησες", "ενώ"],
|
||||
["ένησε", "ενώ"],
|
||||
["ενήσαμε", "ενώ"],
|
||||
["ένησατε", "ενώ"],
|
||||
["ένησαν", "ενώ"],
|
||||
["όνεσα", "ονώ"], # πόνεσα -> πονώ. Αόριστος ρήματος σε -ώ vol2.
|
||||
["όνεσες", "ονώ"],
|
||||
["όνεσε", "ονώ"],
|
||||
["έσαμε", "ώ"],
|
||||
["έσατε", "ώ"],
|
||||
["ισα", "ομαι"], # κάθισα -> κάθομαι. Αόριστος ρήματος σε -ομαι.
|
||||
["ισες", "ομαι"],
|
||||
["ισε", "ομαι"],
|
||||
["αθίσαμε", "άθομαι"],
|
||||
["αθίσατε", "άθομαι"],
|
||||
["ισαν", "ομαι"],
|
||||
["άπα", "απώ"], # αγάπα -> αγαπώ. Προστακτική ρήματος σε -άω/ώ vol1.
|
||||
["ά", "ώ"], # τιμά -> τιμώ. Προστακτική ρήματος σε άω/ώ vol2.
|
||||
["οντας", "ω"], # βλέποντας -> βλέπω. Μετοχή.
|
||||
["ξω", "ζω"], # παίξω -> παίζω. Μέλλοντας σε -ω.
|
||||
["ξεις", "ζω"],
|
||||
["ξουμε", "ζω"],
|
||||
["ξετε", "ζω"],
|
||||
["ξουν", "ζω"],
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
]
|
||||
|
||||
|
||||
|
|
|
@ -21,6 +21,8 @@ VERBS_IRREG = {
|
|||
"είπατε": ("λέω",),
|
||||
"είπαν": ("λέω",),
|
||||
"είπανε": ("λέω",),
|
||||
"πει": ("λέω"),
|
||||
"πω": ("λέω"),
|
||||
"πάω": ("πηγαίνω",),
|
||||
"πάς": ("πηγαίνω",),
|
||||
"πας": ("πηγαίνω",),
|
||||
|
@ -38,7 +40,7 @@ VERBS_IRREG = {
|
|||
"έπαιζα": ("παίζω",),
|
||||
"έπαιζες": ("παίζω",),
|
||||
"έπαιζε": ("παίζω",),
|
||||
"έπαιζαν":("παίζω,",),
|
||||
"έπαιζαν": ("παίζω,",),
|
||||
"έπαιξα": ("παίζω",),
|
||||
"έπαιξες": ("παίζω",),
|
||||
"έπαιξε": ("παίζω",),
|
||||
|
@ -52,6 +54,7 @@ VERBS_IRREG = {
|
|||
"είχαμε": ("έχω",),
|
||||
"είχατε": ("έχω",),
|
||||
"είχαν": ("έχω",),
|
||||
"είχανε": ("έχω",),
|
||||
"έπαιρνα": ("παίρνω",),
|
||||
"έπαιρνες": ("παίρνω",),
|
||||
"έπαιρνε": ("παίρνω",),
|
||||
|
@ -72,6 +75,12 @@ VERBS_IRREG = {
|
|||
"έβλεπες": ("βλέπω",),
|
||||
"έβλεπε": ("βλέπω",),
|
||||
"έβλεπαν": ("βλέπω",),
|
||||
"είδα": ("βλέπω",),
|
||||
"είδες": ("βλέπω",),
|
||||
"είδε": ("βλέπω",),
|
||||
"είδαμε": ("βλέπω",),
|
||||
"είδατε": ("βλέπω",),
|
||||
"είδαν": ("βλέπω",),
|
||||
"έφερνα": ("φέρνω",),
|
||||
"έφερνες": ("φέρνω",),
|
||||
"έφερνε": ("φέρνω",),
|
||||
|
@ -122,6 +131,10 @@ VERBS_IRREG = {
|
|||
"έπεφτες": ("πέφτω",),
|
||||
"έπεφτε": ("πέφτω",),
|
||||
"έπεφταν": ("πέφτω",),
|
||||
"έπεσα": ("πέφτω",),
|
||||
"έπεσες": ("πέφτω",),
|
||||
"έπεσε": ("πέφτω",),
|
||||
"έπεσαν": ("πέφτω",),
|
||||
"έστειλα": ("στέλνω",),
|
||||
"έστειλες": ("στέλνω",),
|
||||
"έστειλε": ("στέλνω",),
|
||||
|
@ -142,6 +155,12 @@ VERBS_IRREG = {
|
|||
"έπινες": ("πίνω",),
|
||||
"έπινε": ("πίνω",),
|
||||
"έπιναν": ("πίνω",),
|
||||
"ήπια": ("πίνω",),
|
||||
"ήπιες": ("πίνω",),
|
||||
"ήπιε": ("πίνω",),
|
||||
"ήπιαμε": ("πίνω",),
|
||||
"ήπιατε": ("πίνω",),
|
||||
"ήπιαν": ("πίνω",),
|
||||
"ετύχα": ("τυχαίνω",),
|
||||
"ετύχες": ("τυχαίνω",),
|
||||
"ετύχε": ("τυχαίνω",),
|
||||
|
@ -159,4 +178,23 @@ VERBS_IRREG = {
|
|||
"τρώγατε": ("τρώω",),
|
||||
"τρώγανε": ("τρώω",),
|
||||
"τρώγαν": ("τρώω",),
|
||||
"πέρασα": ("περνώ",),
|
||||
"πέρασες": ("περνώ",),
|
||||
"πέρασε": ("περνώ",),
|
||||
"πέρασαμε": ("περνώ",),
|
||||
"πέρασατε": ("περνώ",),
|
||||
"πέρασαν": ("περνώ",),
|
||||
"έγδαρα": ("γδάρω",),
|
||||
"έγδαρες": ("γδάρω",),
|
||||
"έγδαρε": ("γδάρω",),
|
||||
"έγδαραν": ("γδάρω",),
|
||||
"έβγαλα": ("βγάλω",),
|
||||
"έβγαλες": ("βγάλω",),
|
||||
"έβγαλε": ("βγάλω",),
|
||||
"έβγαλαν": ("βγάλω",),
|
||||
"έφθασα": ("φτάνω",),
|
||||
"έφθασες": ("φτάνω",),
|
||||
"έφθασε": ("φτάνω",),
|
||||
"έφθασαν": ("φτάνω",),
|
||||
|
||||
}
|
||||
|
|
69
spacy/lang/el/lemmatizer/lemmatizer.py
Normal file
69
spacy/lang/el/lemmatizer/lemmatizer.py
Normal file
|
@ -0,0 +1,69 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ....symbols import NOUN, VERB, ADJ, PUNCT
|
||||
|
||||
'''
|
||||
Greek language lemmatizer applies the default rule based lemmatization
|
||||
procedure with some modifications for better Greek language support.
|
||||
|
||||
The first modification is that it checks if the word for lemmatization is
|
||||
already a lemma and if yes, it just returns it.
|
||||
The second modification is about removing the base forms function which is
|
||||
not applicable for Greek language.
|
||||
'''
|
||||
|
||||
|
||||
class GreekLemmatizer(object):
|
||||
@classmethod
|
||||
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
|
||||
return cls(index, exc, rules, lookup)
|
||||
|
||||
def __init__(self, index=None, exceptions=None, rules=None, lookup=None):
|
||||
self.index = index
|
||||
self.exc = exceptions
|
||||
self.rules = rules
|
||||
self.lookup_table = lookup if lookup is not None else {}
|
||||
|
||||
def __call__(self, string, univ_pos, morphology=None):
|
||||
if not self.rules:
|
||||
return [self.lookup_table.get(string, string)]
|
||||
if univ_pos in (NOUN, 'NOUN', 'noun'):
|
||||
univ_pos = 'noun'
|
||||
elif univ_pos in (VERB, 'VERB', 'verb'):
|
||||
univ_pos = 'verb'
|
||||
elif univ_pos in (ADJ, 'ADJ', 'adj'):
|
||||
univ_pos = 'adj'
|
||||
elif univ_pos in (PUNCT, 'PUNCT', 'punct'):
|
||||
univ_pos = 'punct'
|
||||
else:
|
||||
return list(set([string.lower()]))
|
||||
lemmas = lemmatize(string, self.index.get(univ_pos, {}),
|
||||
self.exc.get(univ_pos, {}),
|
||||
self.rules.get(univ_pos, []))
|
||||
return lemmas
|
||||
|
||||
|
||||
def lemmatize(string, index, exceptions, rules):
|
||||
string = string.lower()
|
||||
forms = []
|
||||
if (string in index):
|
||||
forms.append(string)
|
||||
return forms
|
||||
forms.extend(exceptions.get(string, []))
|
||||
oov_forms = []
|
||||
if not forms:
|
||||
for old, new in rules:
|
||||
if string.endswith(old):
|
||||
form = string[:len(string) - len(old)] + new
|
||||
if not form:
|
||||
pass
|
||||
elif form in index or not form.isalpha():
|
||||
forms.append(form)
|
||||
else:
|
||||
oov_forms.append(form)
|
||||
if not forms:
|
||||
forms.extend(oov_forms)
|
||||
if not forms:
|
||||
forms.append(string)
|
||||
return list(set(forms))
|
|
@ -4,14 +4,20 @@ from __future__ import unicode_literals
|
|||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = ['μηδέν', 'ένας', 'δυο', 'δυό', 'τρεις', 'τέσσερις', 'πέντε', 'έξι', 'εφτά', 'επτά', 'οκτώ', 'οχτώ',
|
||||
'εννιά', 'εννέα', 'δέκα', 'έντεκα', 'ένδεκα', 'δώδεκα', 'δεκατρείς', 'δεκατέσσερις', 'δεκαπέντε',
|
||||
'δεκαέξι', 'δεκαεπτά', 'δεκαοχτώ', 'δεκαεννέα', 'δεκαεννεα', 'είκοσι', 'τριάντα', 'σαράντα', 'πενήντα',
|
||||
'εξήντα', 'εβδομήντα', 'ογδόντα', 'ενενήντα', 'εκατό', 'διακόσιοι', 'διακόσοι', 'τριακόσιοι', 'τριακόσοι',
|
||||
'τετρακόσιοι', 'τετρακόσοι', 'πεντακόσιοι', 'πεντακόσοι', 'εξακόσιοι', 'εξακόσοι', 'εφτακόσιοι',
|
||||
'εφτακόσοι', 'επτακόσιοι', 'επτακόσοι', 'οχτακόσιοι', 'οχτακόσοι', 'οκτακόσιοι', 'οκτακόσοι',
|
||||
'εννιακόσιοι', 'χίλιοι', 'χιλιάδα', 'εκατομμύριο', 'δισεκατομμύριο', 'τρισεκατομμύριο', 'τετράκις',
|
||||
'πεντάκις', 'εξάκις', 'επτάκις', 'οκτάκις', 'εννεάκις', 'ένα', 'δύο', 'τρία', 'τέσσερα', 'δις', 'χιλιάδες']
|
||||
_num_words = ['μηδέν', 'ένας', 'δυο', 'δυό', 'τρεις', 'τέσσερις', 'πέντε',
|
||||
'έξι', 'εφτά', 'επτά', 'οκτώ', 'οχτώ',
|
||||
'εννιά', 'εννέα', 'δέκα', 'έντεκα', 'ένδεκα', 'δώδεκα',
|
||||
'δεκατρείς', 'δεκατέσσερις', 'δεκαπέντε', 'δεκαέξι', 'δεκαεπτά',
|
||||
'δεκαοχτώ', 'δεκαεννέα', 'δεκαεννεα', 'είκοσι', 'τριάντα',
|
||||
'σαράντα', 'πενήντα', 'εξήντα', 'εβδομήντα', 'ογδόντα',
|
||||
'ενενήντα', 'εκατό', 'διακόσιοι', 'διακόσοι', 'τριακόσιοι',
|
||||
'τριακόσοι', 'τετρακόσιοι', 'τετρακόσοι', 'πεντακόσιοι',
|
||||
'πεντακόσοι', 'εξακόσιοι', 'εξακόσοι', 'εφτακόσιοι', 'εφτακόσοι',
|
||||
'επτακόσιοι', 'επτακόσοι', 'οχτακόσιοι', 'οχτακόσοι',
|
||||
'οκτακόσιοι', 'οκτακόσοι', 'εννιακόσιοι', 'χίλιοι', 'χιλιάδα',
|
||||
'εκατομμύριο', 'δισεκατομμύριο', 'τρισεκατομμύριο', 'τετράκις',
|
||||
'πεντάκις', 'εξάκις', 'επτάκις', 'οκτάκις', 'εννεάκις', 'ένα',
|
||||
'δύο', 'τρία', 'τέσσερα', 'δις', 'χιλιάδες']
|
||||
|
||||
|
||||
def like_num(text):
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -10,7 +10,11 @@ _units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm
|
|||
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
|
||||
'TB T G M K км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
|
||||
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб')
|
||||
merge_chars = lambda char: char.strip().replace(' ', '|')
|
||||
|
||||
|
||||
def merge_chars(char): return char.strip().replace(' ', '|')
|
||||
|
||||
|
||||
UNITS = merge_chars(_units)
|
||||
|
||||
_prefixes = (['\'\'', '§', '%', '=', r'\+[0-9]+%', # 90%
|
||||
|
@ -42,7 +46,8 @@ _suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS +
|
|||
r'(?<=[Α-Ωα-ωίϊΐόάέύϋΰήώ])\.',
|
||||
r'^[Α-Ω]{1}\.',
|
||||
r'\ [Α-Ω]{1}\.',
|
||||
r'[ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+([\-]([ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+))+', # πρώτος-δεύτερος , πρώτος-δεύτερος-τρίτος
|
||||
# πρώτος-δεύτερος , πρώτος-δεύτερος-τρίτος
|
||||
r'[ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+([\-]([ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+))+',
|
||||
r'([0-9]+)mg', # 13mg
|
||||
r'([0-9]+)\.([0-9]+)m' # 1.2m
|
||||
])
|
||||
|
@ -53,7 +58,8 @@ _infixes = (LIST_ELLIPSES + LIST_ICONS +
|
|||
r'([0-9])+(\.([0-9]+))*([\-]([0-9])+)+', # 10.9 , 10.9.9 , 10.9-6
|
||||
r'([0-9])+[,]([0-9])+[\-]([0-9])+[,]([0-9])+', # 10,11,12
|
||||
r'([0-9])+[ης]+([\-]([0-9])+)+', # 1ης-2
|
||||
r'([0-9]){1,4}[\/]([0-9]){1,2}([\/]([0-9]){0,4}){0,1}', # 15/2 , 15/2/17 , 2017/2/15
|
||||
# 15/2 , 15/2/17 , 2017/2/15
|
||||
r'([0-9]){1,4}[\/]([0-9]){1,2}([\/]([0-9]){0,4}){0,1}',
|
||||
r'[A-Za-z]+\@[A-Za-z]+(\-[A-Za-z]+)*\.[A-Za-z]+', # abc@cde-fgh.a
|
||||
r'([a-zA-Z]+)(\-([a-zA-Z]+))+', # abc-abc
|
||||
r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
|
||||
|
|
61
spacy/lang/el/syntax_iterators.py
Normal file
61
spacy/lang/el/syntax_iterators.py
Normal file
|
@ -0,0 +1,61 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import NOUN, PROPN, PRON
|
||||
|
||||
|
||||
def noun_chunks(obj):
|
||||
"""
|
||||
Detect base noun phrases. Works on both Doc and Span.
|
||||
"""
|
||||
|
||||
# it follows the logic of the noun chunks finder of English language,
|
||||
# adjusted to some Greek language special characteristics.
|
||||
|
||||
# obj tag corrects some DEP tagger mistakes.
|
||||
# Further improvement of the models will eliminate the need for this tag.
|
||||
labels = ['nsubj', 'obj', 'iobj', 'appos', 'ROOT', 'obl']
|
||||
doc = obj.doc # Ensure works on both Doc and Span.
|
||||
np_deps = [doc.vocab.strings.add(label) for label in labels]
|
||||
conj = doc.vocab.strings.add('conj')
|
||||
nmod = doc.vocab.strings.add('nmod')
|
||||
np_label = doc.vocab.strings.add('NP')
|
||||
seen = set()
|
||||
for i, word in enumerate(obj):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.i in seen:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
flag = False
|
||||
if (word.pos == NOUN):
|
||||
# check for patterns such as γραμμή παραγωγής
|
||||
for potential_nmod in word.rights:
|
||||
if (potential_nmod.dep == nmod):
|
||||
seen.update(j for j in range(
|
||||
word.left_edge.i, potential_nmod.i + 1))
|
||||
yield word.left_edge.i, potential_nmod.i + 1, np_label
|
||||
flag = True
|
||||
break
|
||||
if (flag is False):
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
# covers the case: έχει όμορφα και έξυπνα παιδιά
|
||||
head = word.head
|
||||
while head.dep == conj and head.head.i < head.i:
|
||||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
if any(w.i in seen for w in word.subtree):
|
||||
continue
|
||||
seen.update(j for j in range(word.left_edge.i, word.i + 1))
|
||||
yield word.left_edge.i, word.i + 1, np_label
|
||||
|
||||
|
||||
SYNTAX_ITERATORS = {
|
||||
'noun_chunks': noun_chunks
|
||||
}
|
|
@ -2,10 +2,10 @@
|
|||
|
||||
from __future__ import unicode_literals
|
||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ,SPACE,PRON
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, PRON
|
||||
|
||||
TAG_MAP = {
|
||||
"ABBR": {POS: NOUN, "Abbr":"Yes"},
|
||||
"ABBR": {POS: NOUN, "Abbr": "Yes"},
|
||||
"AdXxBa": {POS: ADV, "Degree": ""},
|
||||
"AdXxCp": {POS: ADV, "Degree": "Cmp"},
|
||||
"AdXxSu": {POS: ADV, "Degree": "Sup"},
|
||||
|
@ -112,38 +112,38 @@ TAG_MAP = {
|
|||
"AsPpPaNeSgAc": {POS: ADP, "Gender": "Neut", "Number": "Sing", "Case": "Acc"},
|
||||
"AsPpPaNeSgGe": {POS: ADP, "Gender": "Neut", "Number": "Sing", "Case": "Gen"},
|
||||
"AsPpSp": {POS: ADP},
|
||||
"AtDfFePlAc": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Plur", "Case": "Acc", "Other":{"Definite": "Def"}},
|
||||
"AtDfFePlGe": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Plur", "Case": "Gen", "Other":{"Definite": "Def"}},
|
||||
"AtDfFePlNm": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Plur", "Case": "Nom", "Other":{"Definite": "Def"}},
|
||||
"AtDfFeSgAc": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Acc", "Other":{"Definite": "Def"}},
|
||||
"AtDfFeSgDa": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Dat", "Other":{"Definite": "Def"}},
|
||||
"AtDfFeSgGe": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Gen", "Other":{"Definite": "Def"}},
|
||||
"AtDfFeSgNm": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Nom", "Other":{"Definite": "Def"}},
|
||||
"AtDfMaPlAc": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Plur", "Case": "Acc", "Other":{"Definite": "Def"}},
|
||||
"AtDfMaPlGe": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Plur", "Case": "Gen", "Other":{"Definite": "Def"}},
|
||||
"AtDfMaPlNm": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Plur", "Case": "Nom", "Other":{"Definite": "Def"}},
|
||||
"AtDfMaSgAc": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Acc", "Other":{"Definite": "Def"}},
|
||||
"AtDfMaSgDa": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Dat", "Other":{"Definite": "Def"}},
|
||||
"AtDfMaSgGe": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Gen", "Other":{"Definite": "Def"}},
|
||||
"AtDfMaSgNm": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Nom", "Other":{"Definite": "Def"}},
|
||||
"AtDfNePlAc": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Acc", "Other":{"Definite": "Def"}},
|
||||
"AtDfNePlDa": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Dat", "Other":{"Definite": "Def"}},
|
||||
"AtDfNePlGe": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Gen", "Other":{"Definite": "Def"}},
|
||||
"AtDfNePlNm": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Nom", "Other":{"Definite": "Def"}},
|
||||
"AtDfNeSgAc": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Acc", "Other":{"Definite": "Def"}},
|
||||
"AtDfNeSgDa": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Dat", "Other":{"Definite": "Def"}},
|
||||
"AtDfNeSgGe": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Gen", "Other":{"Definite": "Def"}},
|
||||
"AtDfNeSgNm": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Nom", "Other":{"Definite": "Def"}},
|
||||
"AtIdFeSgAc": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Acc", "Other":{"Definite": "Ind"}},
|
||||
"AtIdFeSgDa": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Dat", "Other":{"Definite": "Ind"}},
|
||||
"AtIdFeSgGe": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Gen", "Other":{"Definite": "Ind"}},
|
||||
"AtIdFeSgNm": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Nom", "Other":{"Definite": "Ind"}},
|
||||
"AtIdMaSgAc": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Acc", "Other":{"Definite": "Ind"}},
|
||||
"AtIdMaSgGe": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Gen", "Other":{"Definite": "Ind"}},
|
||||
"AtIdMaSgNm": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Nom", "Other":{"Definite": "Ind"}},
|
||||
"AtIdNeSgAc": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Acc", "Other":{"Definite": "Ind"}},
|
||||
"AtIdNeSgGe": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Gen", "Other":{"Definite": "Ind"}},
|
||||
"AtIdNeSgNm": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Nom", "Other":{"Definite": "Ind"}},
|
||||
"AtDfFePlAc": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Plur", "Case": "Acc", "Other": {"Definite": "Def"}},
|
||||
"AtDfFePlGe": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Plur", "Case": "Gen", "Other": {"Definite": "Def"}},
|
||||
"AtDfFePlNm": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Plur", "Case": "Nom", "Other": {"Definite": "Def"}},
|
||||
"AtDfFeSgAc": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Acc", "Other": {"Definite": "Def"}},
|
||||
"AtDfFeSgDa": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Dat", "Other": {"Definite": "Def"}},
|
||||
"AtDfFeSgGe": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Gen", "Other": {"Definite": "Def"}},
|
||||
"AtDfFeSgNm": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Nom", "Other": {"Definite": "Def"}},
|
||||
"AtDfMaPlAc": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Plur", "Case": "Acc", "Other": {"Definite": "Def"}},
|
||||
"AtDfMaPlGe": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Plur", "Case": "Gen", "Other": {"Definite": "Def"}},
|
||||
"AtDfMaPlNm": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Plur", "Case": "Nom", "Other": {"Definite": "Def"}},
|
||||
"AtDfMaSgAc": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Acc", "Other": {"Definite": "Def"}},
|
||||
"AtDfMaSgDa": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Dat", "Other": {"Definite": "Def"}},
|
||||
"AtDfMaSgGe": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Gen", "Other": {"Definite": "Def"}},
|
||||
"AtDfMaSgNm": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Nom", "Other": {"Definite": "Def"}},
|
||||
"AtDfNePlAc": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Acc", "Other": {"Definite": "Def"}},
|
||||
"AtDfNePlDa": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Dat", "Other": {"Definite": "Def"}},
|
||||
"AtDfNePlGe": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Gen", "Other": {"Definite": "Def"}},
|
||||
"AtDfNePlNm": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Plur", "Case": "Nom", "Other": {"Definite": "Def"}},
|
||||
"AtDfNeSgAc": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Acc", "Other": {"Definite": "Def"}},
|
||||
"AtDfNeSgDa": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Dat", "Other": {"Definite": "Def"}},
|
||||
"AtDfNeSgGe": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Gen", "Other": {"Definite": "Def"}},
|
||||
"AtDfNeSgNm": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Nom", "Other": {"Definite": "Def"}},
|
||||
"AtIdFeSgAc": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Acc", "Other": {"Definite": "Ind"}},
|
||||
"AtIdFeSgDa": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Dat", "Other": {"Definite": "Ind"}},
|
||||
"AtIdFeSgGe": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Gen", "Other": {"Definite": "Ind"}},
|
||||
"AtIdFeSgNm": {POS: DET, "PronType": "Art", "Gender": "Fem", "Number": "Sing", "Case": "Nom", "Other": {"Definite": "Ind"}},
|
||||
"AtIdMaSgAc": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Acc", "Other": {"Definite": "Ind"}},
|
||||
"AtIdMaSgGe": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Gen", "Other": {"Definite": "Ind"}},
|
||||
"AtIdMaSgNm": {POS: DET, "PronType": "Art", "Gender": "Masc", "Number": "Sing", "Case": "Nom", "Other": {"Definite": "Ind"}},
|
||||
"AtIdNeSgAc": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Acc", "Other": {"Definite": "Ind"}},
|
||||
"AtIdNeSgGe": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Gen", "Other": {"Definite": "Ind"}},
|
||||
"AtIdNeSgNm": {POS: DET, "PronType": "Art", "Gender": "Neut", "Number": "Sing", "Case": "Nom", "Other": {"Definite": "Ind"}},
|
||||
"CjCo": {POS: CCONJ},
|
||||
"CjSb": {POS: SCONJ},
|
||||
"CPUNCT": {POS: PUNCT},
|
||||
|
@ -152,7 +152,7 @@ TAG_MAP = {
|
|||
"ENUM": {POS: NUM},
|
||||
"Ij": {POS: INTJ},
|
||||
"INIT": {POS: SYM},
|
||||
"NBABBR": {POS: NOUN, "Abbr":"Yes"},
|
||||
"NBABBR": {POS: NOUN, "Abbr": "Yes"},
|
||||
"NmAnFePlAcAj": {POS: NUM, "NumType": "Mult", "Gender": "Fem", "Number": "Plur", "Case": "Acc"},
|
||||
"NmAnFePlGeAj": {POS: NUM, "NumType": "Mult", "Gender": "Fem", "Number": "Plur", "Case": "Gen"},
|
||||
"NmAnFePlNmAj": {POS: NUM, "NumType": "Mult", "Gender": "Fem", "Number": "Plur", "Case": "Nom"},
|
||||
|
@ -529,71 +529,70 @@ TAG_MAP = {
|
|||
"VbMnIdPa03PlXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03PlXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03PlXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01PlXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01PlXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02PlXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02PlXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03PlXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03PlXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01PlXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02PlXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03PlXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxIpAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxIpPvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxPeAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxPePvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxIpAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxIpPvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxPeAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxPePvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx03SgXxIpPvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnNfXxXxXxXxPeAvXx": {POS: VERB, "VerbForm": "Inf", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnNfXxXxXxXxPePvXx": {POS: VERB, "VerbForm": "Inf", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnPpPrXxXxXxIpAvXx": {POS: VERB, "VerbForm": "Conv", "Mood": "", "Tense": "Pres", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnPpXxXxPlFePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxPlFePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxPlFePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxPlFePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxPlMaPePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxPlMaPePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxPlMaPePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxPlMaPePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxPlNePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxPlNePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxPlNePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxPlNePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxSgFePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxSgFePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxSgFePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxSgFePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf" , "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxSgMaPePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxSgMaPePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxSgMaPePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxSgMaPePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf" , "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxSgNePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxSgNePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxSgNePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxSgNePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf" , "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxXxXxIpAvXx": {POS: VERB, "VerbForm": "Conv", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp" , "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"}
|
||||
"VbMnIdPa03PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPa03SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01PlXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01PlXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr01SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02PlXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02PlXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr02SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03PlXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03PlXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03SgXxIpAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdPr03SgXxIpPvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01PlXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx01SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "1", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02PlXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx02SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03PlXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03PlXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03SgXxPeAvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnIdXx03SgXxPePvXx": {POS: VERB, "VerbForm": "Fin", "Mood": "Ind", "Tense": "Pres|Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxIpAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxIpPvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxPeAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02PlXxPePvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxIpAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxIpPvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxPeAvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx02SgXxPePvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "2", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnMpXx03SgXxIpPvXx": {POS: VERB, "VerbForm": "", "Mood": "Imp", "Tense": "Pres|Past", "Person": "3", "Number": "Sing", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnNfXxXxXxXxPeAvXx": {POS: VERB, "VerbForm": "Inf", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnNfXxXxXxXxPePvXx": {POS: VERB, "VerbForm": "Inf", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnPpPrXxXxXxIpAvXx": {POS: VERB, "VerbForm": "Conv", "Mood": "", "Tense": "Pres", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"},
|
||||
"VbMnPpXxXxPlFePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxPlFePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxPlFePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxPlFePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxPlMaPePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxPlMaPePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxPlMaPePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxPlMaPePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxPlNePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxPlNePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxPlNePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxPlNePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Plur", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxSgFePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxSgFePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxSgFePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxSgFePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Fem", "Aspect": "Perf", "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxSgMaPePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxSgMaPePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxSgMaPePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxSgMaPePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Masc", "Aspect": "Perf", "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxSgNePePvAc": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Acc"},
|
||||
"VbMnPpXxXxSgNePePvGe": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Gen"},
|
||||
"VbMnPpXxXxSgNePePvNm": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Nom"},
|
||||
"VbMnPpXxXxSgNePePvVo": {POS: VERB, "VerbForm": "Part", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing", "Gender": "Neut", "Aspect": "Perf", "Voice": "Pass", "Case": "Voc"},
|
||||
"VbMnPpXxXxXxXxIpAvXx": {POS: VERB, "VerbForm": "Conv", "Mood": "", "Tense": "Pres|Past", "Person": "1|2|3", "Number": "Sing|Plur", "Gender": "Masc|Fem|Neut", "Aspect": "Imp", "Voice": "Act", "Case": "Nom|Gen|Dat|Acc|Voc"}
|
||||
}
|
||||
|
||||
|
|
|
@ -1,27 +1,26 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||
from ...symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADJ": {POS: ADJ},
|
||||
"ADV": {POS: ADV},
|
||||
"INTJ": {POS: INTJ},
|
||||
"NOUN": {POS: NOUN},
|
||||
"PROPN": {POS: PROPN},
|
||||
"VERB": {POS: VERB},
|
||||
"ADP": {POS: ADP},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PART": {POS: PART},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"SYM": {POS: SYM},
|
||||
"NUM": {POS: NUM},
|
||||
"PRON": {POS: PRON},
|
||||
"AUX": {POS: AUX},
|
||||
"SPACE": {POS: SPACE},
|
||||
"DET": {POS: DET},
|
||||
"X" : {POS: X}
|
||||
"ADJ": {POS: ADJ},
|
||||
"ADV": {POS: ADV},
|
||||
"INTJ": {POS: INTJ},
|
||||
"NOUN": {POS: NOUN},
|
||||
"PROPN": {POS: PROPN},
|
||||
"VERB": {POS: VERB},
|
||||
"ADP": {POS: ADP},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PART": {POS: PART},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"SYM": {POS: SYM},
|
||||
"NUM": {POS: NUM},
|
||||
"PRON": {POS: PRON},
|
||||
"AUX": {POS: AUX},
|
||||
"SPACE": {POS: SPACE},
|
||||
"DET": {POS: DET},
|
||||
"X": {POS: X}
|
||||
}
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
|
||||
from ...symbols import ORTH, LEMMA, NORM
|
||||
|
||||
_exc = {}
|
||||
|
||||
|
|
|
@ -44,7 +44,7 @@ lors lorsque lui lui-meme lui-même là lès
|
|||
|
||||
m' m’ ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien
|
||||
mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins
|
||||
mon moyennant multiple multiples même mêmes
|
||||
mon moyennant même mêmes
|
||||
|
||||
n' n’ na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf
|
||||
neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau
|
||||
|
|
|
@ -3,9 +3,9 @@ from __future__ import unicode_literals
|
|||
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...attrs import NORM
|
||||
from ...attrs import LIKE_NUM
|
||||
from ...util import add_lookups
|
||||
|
||||
|
||||
_stem_suffixes = [
|
||||
["ो","े","ू","ु","ी","ि","ा"],
|
||||
["कर","ाओ","िए","ाई","ाए","ने","नी","ना","ते","ीं","ती","ता","ाँ","ां","ों","ें"],
|
||||
|
@ -14,6 +14,13 @@ _stem_suffixes = [
|
|||
["ाएंगी","ाएंगे","ाऊंगी","ाऊंगा","ाइयाँ","ाइयों","ाइयां"]
|
||||
]
|
||||
|
||||
#reference 1:https://en.wikipedia.org/wiki/Indian_numbering_system
|
||||
#reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
|
||||
|
||||
_num_words = ['शून्य', 'एक', 'दो', 'तीन', 'चार', 'पांच', 'छह', 'सात', 'आठ', 'नौ', 'दस',
|
||||
'ग्यारह', 'बारह', 'तेरह', 'चौदह', 'पंद्रह', 'सोलह', 'सत्रह', 'अठारह', 'उन्नीस',
|
||||
'बीस', 'तीस', 'चालीस', 'पचास', 'साठ', 'सत्तर', 'अस्सी', 'नब्बे', 'सौ', 'हज़ार',
|
||||
'लाख', 'करोड़', 'अरब', 'खरब']
|
||||
|
||||
def norm(string):
|
||||
# normalise base exceptions, e.g. punctuation or currency symbols
|
||||
|
@ -32,7 +39,20 @@ def norm(string):
|
|||
return string[:-length]
|
||||
return string
|
||||
|
||||
def like_num(text):
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {
|
||||
NORM: norm
|
||||
LIKE_NUM: like_num
|
||||
}
|
||||
|
|
|
@ -10,7 +10,7 @@ Example sentences to test spaCy and its language models.
|
|||
"""
|
||||
|
||||
|
||||
examples = [
|
||||
sentences = [
|
||||
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
||||
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
||||
"San Francisco overweegt robots op voetpaden te verbieden",
|
||||
|
|
|
@ -11,8 +11,41 @@ Example sentences to test spaCy and its language models.
|
|||
|
||||
|
||||
sentences = [
|
||||
#Translations from English:
|
||||
"Apple рассматривает возможность покупки стартапа из Соединенного Королевства за $1 млрд",
|
||||
"Автономные автомобили переносят страховую ответственность на производителя",
|
||||
"В Сан Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
|
||||
"Лондон - большой город Соединенного Королевства"
|
||||
"Лондон - большой город Соединенного Королевства",
|
||||
|
||||
#Native Russian sentences:
|
||||
|
||||
#Colloquial:
|
||||
"Да, нет, наверное!",#Typical polite refusal
|
||||
"Обратите внимание на необыкновенную крастоту этого города-героя Москвы, столицы нашей Родины!",#From a tour guide speech
|
||||
|
||||
#Examples of Bookish Russian:
|
||||
"Рио-де-Жанейро — эта моя мечта и не смейте касаться ее своими грязными лапами!",#Quote from "The Golden Calf"
|
||||
|
||||
#Quotes from "Ivan Vasilievish changes his occupation" - a famous Russian comedy known by all Russians
|
||||
"Ты пошто боярыню обидел, смерд?!!",
|
||||
"Оставь меня, старушка, я в печали!",
|
||||
|
||||
#Quotes from Dostoevsky:
|
||||
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог.",
|
||||
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта.",
|
||||
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще.",
|
||||
|
||||
#Quotes from Chechov:
|
||||
"Ненужные дела и разговоры все об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
|
||||
|
||||
#Quotes from Turgenev:
|
||||
"Нравится тебе женщина, старайся добиться толку; а нельзя - ну, не надо, отвернись - земля не клином сошлась.",
|
||||
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
|
||||
|
||||
#Quotes from newspapers:
|
||||
#Komsomolskaya Pravda:
|
||||
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
|
||||
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск.",
|
||||
#Argumeni i Facti:
|
||||
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
|
||||
]
|
||||
|
|
|
@ -5,15 +5,27 @@ from __future__ import unicode_literals
|
|||
_exc = {
|
||||
# Slang
|
||||
'прив': 'привет',
|
||||
'дарова': 'привет',
|
||||
'дак': 'так',
|
||||
'дык': 'так',
|
||||
'здарова': 'привет',
|
||||
'пакедава': 'пока',
|
||||
'пакедаво': 'пока',
|
||||
'ща': 'сейчас',
|
||||
'спс': 'спасибо',
|
||||
'пжлст': 'пожалуйста',
|
||||
'плиз': 'пожалуйста',
|
||||
'ладненько': 'ладно',
|
||||
'лады': 'ладно',
|
||||
'лан': 'ладно',
|
||||
'ясн': 'ясно',
|
||||
'всм': 'всмысле',
|
||||
'хош': 'хочешь',
|
||||
'оч': 'очень'
|
||||
'хаюшки': 'привет',
|
||||
'оч': 'очень',
|
||||
'че': 'что',
|
||||
'чо': 'что',
|
||||
'шо': 'что'
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -1,8 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
||||
|
||||
from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA, PUNCT, TAG
|
||||
|
||||
_exc = {}
|
||||
|
||||
|
@ -70,13 +69,25 @@ for exc_data in [
|
|||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
||||
for orth in [
|
||||
"ang.", "anm.", "bil.", "bl.a.", "dvs.", "e.Kr.", "el.", "e.d.", "eng.",
|
||||
"etc.", "exkl.", "f.d.", "fid.", "f.Kr.", "forts.", "fr.o.m.", "f.ö.",
|
||||
"förf.", "inkl.", "jur.", "kl.", "kr.", "lat.", "m.a.o.", "max.", "m.fl.",
|
||||
"min.", "m.m.", "obs.", "o.d.", "osv.", "p.g.a.", "ref.", "resp.", "s.a.s.",
|
||||
"s.k.", "st.", "s:t", "t.ex.", "t.o.m.", "ung.", "äv.", "övers."]:
|
||||
ABBREVIATIONS = [
|
||||
"ang", "anm", "bil", "bl.a", "d.v.s", "doc", "dvs", "e.d", "e.kr", "el",
|
||||
"eng", "etc", "exkl", "f", "f.d", "f.kr", "f.n", "f.ö", "fid", "fig",
|
||||
"forts", "fr.o.m", "förf", "inkl", "jur", "kap", "kl", "kor", "kr",
|
||||
"kungl", "lat", "m.a.o", "m.fl", "m.m", "max", "milj", "min", "mos",
|
||||
"mt", "o.d", "o.s.v", "obs", "osv", "p.g.a", "proc", "prof", "ref",
|
||||
"resp", "s.a.s", "s.k", "s.t", "sid", "s:t", "t.ex", "t.h", "t.o.m", "t.v",
|
||||
"tel", "ung", "vol", "äv", "övers"
|
||||
]
|
||||
ABBREVIATIONS = [abbr + "." for abbr in ABBREVIATIONS] + ABBREVIATIONS
|
||||
|
||||
for orth in ABBREVIATIONS:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
# Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
|
||||
# should be tokenized as two separate tokens.
|
||||
for orth in ["i", "m"]:
|
||||
_exc[orth + "."] = [
|
||||
{ORTH: orth, LEMMA: orth, NORM: orth},
|
||||
{ORTH: ".", TAG: PUNCT}]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -4,12 +4,20 @@ from __future__ import unicode_literals
|
|||
from ...attrs import LANG
|
||||
from ...language import Language
|
||||
from ...tokens import Doc
|
||||
from .tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
from ...util import update_exc
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
|
||||
|
||||
class ChineseDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'zh' # for pickling
|
||||
use_jieba = True
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
class Chinese(Language):
|
||||
|
|
1901
spacy/lang/zh/stop_words.py
Normal file
1901
spacy/lang/zh/stop_words.py
Normal file
File diff suppressed because it is too large
Load Diff
24
spacy/lang/zh/tag_map.py
Normal file
24
spacy/lang/zh/tag_map.py
Normal file
|
@ -0,0 +1,24 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import *
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADV": {POS: ADV},
|
||||
"NOUN": {POS: NOUN},
|
||||
"ADP": {POS: ADP},
|
||||
"PRON": {POS: PRON},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PROPN": {POS: PROPN},
|
||||
"DET": {POS: DET},
|
||||
"SYM": {POS: SYM},
|
||||
"INTJ": {POS: INTJ},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"NUM": {POS: NUM},
|
||||
"AUX": {POS: AUX},
|
||||
"X": {POS: X},
|
||||
"CONJ": {POS: CONJ},
|
||||
"ADJ": {POS: ADJ},
|
||||
"VERB": {POS: VERB}
|
||||
}
|
45
spacy/lang/zh/tokenizer_exceptions.py
Normal file
45
spacy/lang/zh/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,45 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import *
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
"Jan.": [
|
||||
{ORTH: "Jan.", LEMMA: "January"}
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
# exceptions mapped to a single token containing only ORTH property
|
||||
# example: {"string": [{ORTH: "string"}]}
|
||||
# converted using strings_to_exc() util
|
||||
|
||||
ORTH_ONLY = [
|
||||
"a.",
|
||||
"b.",
|
||||
"c.",
|
||||
"d.",
|
||||
"e.",
|
||||
"f.",
|
||||
"g.",
|
||||
"h.",
|
||||
"i.",
|
||||
"j.",
|
||||
"k.",
|
||||
"l.",
|
||||
"m.",
|
||||
"n.",
|
||||
"o.",
|
||||
"p.",
|
||||
"q.",
|
||||
"r.",
|
||||
"s.",
|
||||
"t.",
|
||||
"u.",
|
||||
"v.",
|
||||
"w.",
|
||||
"x.",
|
||||
"y.",
|
||||
"z."
|
||||
]
|
|
@ -96,49 +96,40 @@ def he_tokenizer():
|
|||
def nb_tokenizer():
|
||||
return get_lang_class('nb').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def da_tokenizer():
|
||||
return get_lang_class('da').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ja_tokenizer():
|
||||
mecab = pytest.importorskip("MeCab")
|
||||
return get_lang_class('ja').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def th_tokenizer():
|
||||
pythainlp = pytest.importorskip("pythainlp")
|
||||
return get_lang_class('th').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def tr_tokenizer():
|
||||
return get_lang_class('tr').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def tt_tokenizer():
|
||||
return get_lang_class('tt').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def el_tokenizer():
|
||||
return get_lang_class('el').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ar_tokenizer():
|
||||
return get_lang_class('ar').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ur_tokenizer():
|
||||
return get_lang_class('ur').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ru_tokenizer():
|
||||
pymorphy = pytest.importorskip('pymorphy2')
|
||||
|
|
|
@ -150,3 +150,31 @@ def test_span_as_doc(doc):
|
|||
span = doc[4:10]
|
||||
span_doc = span.as_doc()
|
||||
assert span.text == span_doc.text.strip()
|
||||
|
||||
def test_span_ents_property(doc):
|
||||
"""Test span.ents for the """
|
||||
doc.ents = [
|
||||
(doc.vocab.strings['PRODUCT'], 0, 1),
|
||||
(doc.vocab.strings['PRODUCT'], 7, 8),
|
||||
(doc.vocab.strings['PRODUCT'], 11, 14)
|
||||
]
|
||||
assert len(list(doc.ents)) == 3
|
||||
sentences = list(doc.sents)
|
||||
assert len(sentences) == 3
|
||||
assert len(sentences[0].ents) == 1
|
||||
# First sentence, also tests start of sentence
|
||||
assert sentences[0].ents[0].text == "This"
|
||||
assert sentences[0].ents[0].label_ == "PRODUCT"
|
||||
assert sentences[0].ents[0].start == 0
|
||||
assert sentences[0].ents[0].end == 1
|
||||
# Second sentence
|
||||
assert len(sentences[1].ents) == 1
|
||||
assert sentences[1].ents[0].text == "another"
|
||||
assert sentences[1].ents[0].label_ == "PRODUCT"
|
||||
assert sentences[1].ents[0].start == 7
|
||||
assert sentences[1].ents[0].end == 8
|
||||
# Third sentence ents, Also tests end of sentence
|
||||
assert sentences[2].ents[0].text == "a third ."
|
||||
assert sentences[2].ents[0].label_ == "PRODUCT"
|
||||
assert sentences[2].ents[0].start == 11
|
||||
assert sentences[2].ents[0].end == 14
|
||||
|
|
|
@ -2,6 +2,11 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from .... import util
|
||||
|
||||
@pytest.fixture(scope='module')
|
||||
def fr_tokenizer():
|
||||
return util.get_lang_class('fr').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', [
|
||||
|
|
|
@ -1,5 +1,13 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
import pytest
|
||||
|
||||
from .... import util
|
||||
|
||||
@pytest.fixture(scope='module')
|
||||
def fr_tokenizer():
|
||||
return util.get_lang_class('fr').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
import pytest
|
||||
from spacy.lang.fr.lex_attrs import like_num
|
||||
|
|
|
@ -6,7 +6,8 @@ import pytest
|
|||
|
||||
SV_TOKEN_EXCEPTION_TESTS = [
|
||||
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
|
||||
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar']),
|
||||
('Anders I. tycker om ord med i i.', ["Anders", "I.", "tycker", "om", "ord", "med", "i", "i", "."])
|
||||
]
|
||||
|
||||
|
||||
|
|
11
spacy/tests/regression/test_issue2626.py
Normal file
11
spacy/tests/regression/test_issue2626.py
Normal file
|
@ -0,0 +1,11 @@
|
|||
from __future__ import unicode_literals
|
||||
import spacy
|
||||
|
||||
def test_issue2626():
|
||||
'''Check that this sentence doesn't cause an infinite loop in the tokenizer.'''
|
||||
nlp = spacy.blank('en')
|
||||
text = """
|
||||
ABLEItemColumn IAcceptance Limits of ErrorIn-Service Limits of ErrorColumn IIColumn IIIColumn IVColumn VComputed VolumeUnder Registration of\xa0VolumeOver Registration of\xa0VolumeUnder Registration of\xa0VolumeOver Registration of\xa0VolumeCubic FeetCubic FeetCubic FeetCubic FeetCubic Feet1Up to 10.0100.0050.0100.005220.0200.0100.0200.010350.0360.0180.0360.0184100.0500.0250.0500.0255Over 100.5% of computed volume0.25% of computed volume0.5% of computed volume0.25% of computed volume TABLE ItemColumn IAcceptance Limits of ErrorIn-Service Limits of ErrorColumn IIColumn IIIColumn IVColumn VComputed VolumeUnder Registration of\xa0VolumeOver Registration of\xa0VolumeUnder Registration of\xa0VolumeOver Registration of\xa0VolumeCubic FeetCubic FeetCubic FeetCubic FeetCubic Feet1Up to 10.0100.0050.0100.005220.0200.0100.0200.010350.0360.0180.0360.0184100.0500.0250.0500.0255Over 100.5% of computed volume0.25% of computed volume0.5% of computed volume0.25% of computed volume ItemColumn IAcceptance Limits of ErrorIn-Service Limits of ErrorColumn IIColumn IIIColumn IVColumn VComputed VolumeUnder Registration of\xa0VolumeOver Registration of\xa0VolumeUnder Registration of\xa0VolumeOver Registration of\xa0VolumeCubic FeetCubic FeetCubic FeetCubic FeetCubic Feet1Up to 10.0100.0050.0100.005220.0200.0100.0200.010350.0360.0180.0360.0184100.0500.0250.0500.0255Over 100.5% of computed volume0.25% of computed volume0.5% of computed volume0.25% of computed volume
|
||||
"""
|
||||
doc = nlp.make_doc(text)
|
||||
|
0
spacy/tests/zh/__init__.py
Normal file
0
spacy/tests/zh/__init__.py
Normal file
|
@ -324,6 +324,15 @@ cdef class Span:
|
|||
break
|
||||
return self.doc[start:end]
|
||||
|
||||
property ents:
|
||||
"""RETURNS (list): A list of tokens that belong to the current span."""
|
||||
def __get__(self):
|
||||
ents = []
|
||||
for ent in self.doc.ents:
|
||||
if ent.start >= self.start and ent.end <= self.end:
|
||||
ents.append(ent)
|
||||
return ents
|
||||
|
||||
property has_vector:
|
||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||
"""
|
||||
|
|
|
@ -10,7 +10,7 @@
|
|||
|
||||
- MODEL_COUNT = Object.keys(MODELS).map(m => Object.keys(MODELS[m]).length).reduce((a, b) => a + b)
|
||||
- MODEL_LANG_COUNT = Object.keys(MODELS).length
|
||||
- LANG_COUNT = Object.keys(LANGUAGES).length
|
||||
- LANG_COUNT = Object.keys(LANGUAGES).length - 1
|
||||
|
||||
- MODEL_META = public.models._data.MODEL_META
|
||||
- MODEL_LICENSES = public.models._data.MODEL_LICENSES
|
||||
|
|
|
@ -107,4 +107,3 @@ for id in CURRENT_MODELS
|
|||
print(doc.text)
|
||||
for token in doc:
|
||||
print(token.text, token.pos_, token.dep_)
|
||||
|
||||
|
|
|
@ -25,7 +25,7 @@ p
|
|||
+table(["Name", "Type", "Description", "Default"])
|
||||
+row
|
||||
+cell #[code docs]
|
||||
+cell list or #[code Doc]
|
||||
+cell list, #[code Doc], #[code Span]
|
||||
+cell Document(s) to visualize.
|
||||
+cell
|
||||
|
||||
|
@ -84,7 +84,7 @@ p Render a dependency parse tree or named entity visualization.
|
|||
+table(["Name", "Type", "Description", "Default"])
|
||||
+row
|
||||
+cell #[code docs]
|
||||
+cell list or #[code Doc]
|
||||
+cell list, #[code Doc], #[code Span]
|
||||
+cell Document(s) to visualize.
|
||||
+cell
|
||||
|
||||
|
@ -157,6 +157,12 @@ p
|
|||
| as it prevents long arcs to attach punctuation.
|
||||
+cell #[code True]
|
||||
|
||||
+row
|
||||
+cell #[code collapse_phrases]
|
||||
+cell bool
|
||||
+cell Merge noun phrases into one token.
|
||||
+cell #[code False]
|
||||
|
||||
+row
|
||||
+cell #[code compact]
|
||||
+cell bool
|
||||
|
|
|
@ -136,6 +136,12 @@ p
|
|||
+cell flag
|
||||
+cell Print information as Markdown.
|
||||
|
||||
+row
|
||||
+cell #[code --silent], #[code -s]
|
||||
+tag-new("2.0.12")
|
||||
+cell flag
|
||||
+cell Don't print anything, just return the values.
|
||||
|
||||
+row
|
||||
+cell #[code --help], #[code -h]
|
||||
+cell flag
|
||||
|
@ -254,7 +260,7 @@ p
|
|||
+code(false, "bash", "$", false, false, true).
|
||||
python -m spacy train [lang] [output_dir] [train_data] [dev_data] [--n-iter]
|
||||
[--n-sents] [--use-gpu] [--meta-path] [--vectors] [--no-tagger] [--no-parser]
|
||||
[--no-entities] [--gold-preproc]
|
||||
[--no-entities] [--gold-preproc] [--verbose]
|
||||
|
||||
+table(["Argument", "Type", "Description"])
|
||||
+row
|
||||
|
@ -338,6 +344,11 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+row
|
||||
+cell #[code --verbose]
|
||||
+cell flag
|
||||
+cell Show more detail message during training.
|
||||
|
||||
+row("foot")
|
||||
+cell creates
|
||||
+cell model, pickle
|
||||
|
|
|
@ -202,8 +202,8 @@ p
|
|||
|
||||
+aside-code("Example").
|
||||
from spacy.tokens import Doc
|
||||
Doc.set_extension('is_city', default=False)
|
||||
extension = Doc.get_extension('is_city')
|
||||
Doc.set_extension('has_city', default=False)
|
||||
extension = Doc.get_extension('has_city')
|
||||
assert extension == (False, None, None, None)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
|
@ -227,8 +227,8 @@ p Check whether an extension has been registered on the #[code Doc] class.
|
|||
|
||||
+aside-code("Example").
|
||||
from spacy.tokens import Doc
|
||||
Doc.set_extension('is_city', default=False)
|
||||
assert Doc.has_extension('is_city')
|
||||
Doc.set_extension('has_city', default=False)
|
||||
assert Doc.has_extension('has_city')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
|
@ -241,6 +241,31 @@ p Check whether an extension has been registered on the #[code Doc] class.
|
|||
+cell bool
|
||||
+cell Whether the extension has been registered.
|
||||
|
||||
+h(2, "remove_extension") Doc.remove_extension
|
||||
+tag classmethod
|
||||
+tag-new("2.0.12")
|
||||
|
||||
p Remove a previously registered extension.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.tokens import Doc
|
||||
Doc.set_extension('has_city', default=False)
|
||||
removed = Doc.remove_extension('has_city')
|
||||
assert not Doc.has_extension('has_city')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code name]
|
||||
+cell unicode
|
||||
+cell Name of the extension.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell tuple
|
||||
+cell
|
||||
| A #[code.u-break (default, method, getter, setter)] tuple of the
|
||||
| removed extension.
|
||||
|
||||
+h(2, "char_span") Doc.char_span
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
@ -263,7 +288,7 @@ p
|
|||
+row
|
||||
+cell #[code end]
|
||||
+cell int
|
||||
+cell The index of the first character after the span.
|
||||
+cell The index of the last character after the span.
|
||||
|
||||
+row
|
||||
+cell #[code label]
|
||||
|
@ -761,6 +786,13 @@ p
|
|||
+cell bool
|
||||
+cell A flag indicating that the document has been syntactically parsed.
|
||||
|
||||
+row
|
||||
+cell #[code is_sentenced]
|
||||
+cell bool
|
||||
+cell
|
||||
| A flag indicating that sentence boundaries have been applied to
|
||||
| the document.
|
||||
|
||||
+row
|
||||
+cell #[code sentiment]
|
||||
+cell float
|
||||
|
|
|
@ -513,11 +513,19 @@ p
|
|||
p
|
||||
| Loads state from a directory. Modifies the object in place and returns
|
||||
| it. If the saved #[code Language] object contains a model, the
|
||||
| #[strong model will be loaded].
|
||||
| model will be loaded. Note that this method is commonly used via the
|
||||
| subclasses like #[code English] or #[code German] to make
|
||||
| language-specific functionality like the
|
||||
| #[+a("/usage/adding-languages#lex-attrs") lexical attribute getters]
|
||||
| available to the loaded object.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.language import Language
|
||||
nlp = Language().from_disk('/path/to/models')
|
||||
nlp = Language().from_disk('/path/to/model')
|
||||
|
||||
# using language-specific subclass
|
||||
from spacy.lang.en import English
|
||||
nlp = English().from_disk('/path/to/en_model')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
|
@ -575,10 +583,15 @@ p Serialize the current state to a binary string.
|
|||
+h(2, "from_bytes") Language.from_bytes
|
||||
+tag method
|
||||
|
||||
p Load state from a binary string.
|
||||
p
|
||||
| Load state from a binary string. Note that this method is commonly used
|
||||
| via the subclasses like #[code English] or #[code German] to make
|
||||
| language-specific functionality like the
|
||||
| #[+a("/usage/adding-languages#lex-attrs") lexical attribute getters]
|
||||
| available to the loaded object.
|
||||
|
||||
+aside-code("Example").
|
||||
fron spacy.lang.en import English
|
||||
from spacy.lang.en import English
|
||||
nlp_bytes = nlp.to_bytes()
|
||||
nlp2 = English()
|
||||
nlp2.from_bytes(nlp_bytes)
|
||||
|
|
|
@ -219,6 +219,31 @@ p Check whether an extension has been registered on the #[code Span] class.
|
|||
+cell bool
|
||||
+cell Whether the extension has been registered.
|
||||
|
||||
+h(2, "remove_extension") Span.remove_extension
|
||||
+tag classmethod
|
||||
+tag-new("2.0.12")
|
||||
|
||||
p Remove a previously registered extension.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.tokens import Span
|
||||
Span.set_extension('is_city', default=False)
|
||||
removed = Span.remove_extension('is_city')
|
||||
assert not Span.has_extension('is_city')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code name]
|
||||
+cell unicode
|
||||
+cell Name of the extension.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell tuple
|
||||
+cell
|
||||
| A #[code.u-break (default, method, getter, setter)] tuple of the
|
||||
| removed extension.
|
||||
|
||||
+h(2, "similarity") Span.similarity
|
||||
+tag method
|
||||
+tag-model("vectors")
|
||||
|
|
|
@ -154,6 +154,31 @@ p Check whether an extension has been registered on the #[code Token] class.
|
|||
+cell bool
|
||||
+cell Whether the extension has been registered.
|
||||
|
||||
+h(2, "remove_extension") Token.remove_extension
|
||||
+tag classmethod
|
||||
+tag-new("2.0.11")
|
||||
|
||||
p Remove a previously registered extension.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.tokens import Token
|
||||
Token.set_extension('is_fruit', default=False)
|
||||
removed = Token.remove_extension('is_fruit')
|
||||
assert not Token.has_extension('is_fruit')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code name]
|
||||
+cell unicode
|
||||
+cell Name of the extension.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell tuple
|
||||
+cell
|
||||
| A #[code.u-break (default, method, getter, setter)] tuple of the
|
||||
| removed extension.
|
||||
|
||||
+h(2, "check_flag") Token.check_flag
|
||||
+tag method
|
||||
|
||||
|
@ -380,7 +405,7 @@ p
|
|||
+tag property
|
||||
+tag-model("parse")
|
||||
|
||||
p A sequence of all the token's syntactic descendents.
|
||||
p A sequence of all the token's syntactic descendants.
|
||||
|
||||
+aside-code("Example").
|
||||
doc = nlp(u'Give it back! He pleaded.')
|
||||
|
@ -484,6 +509,17 @@ p The L2 norm of the token's vector representation.
|
|||
+h(2, "attributes") Attributes
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The parent document.
|
||||
|
||||
+row
|
||||
+cell #[code sent]
|
||||
+tag-new("2.0.12")
|
||||
+cell #[code Span]
|
||||
+cell The sentence span that this token is a part of.
|
||||
|
||||
+row
|
||||
+cell #[code text]
|
||||
+cell unicode
|
||||
|
@ -534,7 +570,7 @@ p The L2 norm of the token's vector representation.
|
|||
+row
|
||||
+cell #[code right_edge]
|
||||
+cell #[code Token]
|
||||
+cell The rightmost token of this token's syntactic descendents.
|
||||
+cell The rightmost token of this token's syntactic descendants.
|
||||
|
||||
+row
|
||||
+cell #[code i]
|
||||
|
|
|
@ -95,7 +95,7 @@
|
|||
|
||||
"EXAMPLE_SENT_LANGS": [
|
||||
"da", "de", "en", "es", "fa", "fr", "he", "hi", "hu", "id", "it", "ja",
|
||||
"nb", "nl", "pl", "pt", "ru", "sv", "tr", "zh"
|
||||
"nb", "pl", "pt", "ru", "sv", "tr", "zh"
|
||||
],
|
||||
|
||||
"LANGUAGES": {
|
||||
|
|
|
@ -251,27 +251,29 @@
|
|||
},
|
||||
{
|
||||
"id": "spacy-lefff",
|
||||
"slogan": "French lemmatization with Lefff",
|
||||
"description": "spacy v2.0 extension and pipeline component for adding a French lemmatizer based on [Lefff](https://hal.inria.fr/inria-00521242/).",
|
||||
"slogan": "POS and French lemmatization with Lefff",
|
||||
"description": "spacy v2.0 extension and pipeline component for adding a French POS and lemmatizer based on [Lefff](https://hal.inria.fr/inria-00521242/).",
|
||||
"github": "sammous/spacy-lefff",
|
||||
"pip": "spacy-lefff",
|
||||
"code_example": [
|
||||
"import spacy",
|
||||
"from spacy_lefff import LefffLemmatizer",
|
||||
"from spacy_lefff import LefffLemmatizer, POSTagger",
|
||||
"",
|
||||
"nlp = spacy.load('fr')",
|
||||
"french_lemmatizer = LefffLemmatizer()",
|
||||
"nlp.add_pipe(french_lemmatizer, name='lefff', after='parser')",
|
||||
"pos = POSTagger()",
|
||||
"french_lemmatizer = LefffLemmatizer(after_melt=True)",
|
||||
"nlp.add_pipe(pos, name='pos', after='parser')",
|
||||
"nlp.add_pipe(french_lemmatizer, name='lefff', after='pos')",
|
||||
"doc = nlp(u\"Paris est une ville très chère.\")",
|
||||
"for d in doc:",
|
||||
" print(d.text, d.pos_, d._.lefff_lemma, d.tag_)"
|
||||
" print(d.text, d.pos_, d._.melt_tagger, d._.lefff_lemma, d.tag_, d.lemma_)"
|
||||
],
|
||||
"author": "Sami Moustachir",
|
||||
"author_links": {
|
||||
"github": "sammous"
|
||||
},
|
||||
"category": ["pipeline"],
|
||||
"tags": ["lemmatizer", "french"]
|
||||
"tags": ["pos", "lemmatizer", "french"]
|
||||
},
|
||||
{
|
||||
"id": "lemmy",
|
||||
|
@ -943,17 +945,19 @@
|
|||
{
|
||||
"id": "excelcy",
|
||||
"title": "ExcelCy",
|
||||
"slogan": "Excel Integration with SpaCy. Includes, Entity training, Entity matcher pipe.",
|
||||
"description": "ExcelCy is a SpaCy toolkit to help improve the data training experiences. It provides easy annotation using Excel file format. It has helper to pre-train entity annotation with phrase and regex matcher pipe.",
|
||||
"slogan": "Excel Integration with spaCy. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG.",
|
||||
"description": "ExcelCy is a toolkit to integrate Excel to spaCy NLP training experiences. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG. ExcelCy has pipeline to match Entity with PhraseMatcher or Matcher in regular expression.",
|
||||
"url": "https://github.com/kororo/excelcy",
|
||||
"github": "kororo/excelcy",
|
||||
"pip": "excelcy",
|
||||
"code_example": [
|
||||
"from excelcy import ExcelCy",
|
||||
"",
|
||||
"excelcy = ExcelCy()",
|
||||
"# download data from here, https://github.com/kororo/excelcy/tree/master/excelcy/tests/data/test_data_28.xlsx",
|
||||
"excelcy.train(data_path='test_data_28.xlsx')"
|
||||
"# collect sentences, annotate Entities and train NER using spaCy",
|
||||
"excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')",
|
||||
"# use the nlp object as per spaCy API",
|
||||
"doc = excelcy.nlp('Google rebrands its business apps')",
|
||||
"# or save it for faster bootstrap for application",
|
||||
"excelcy.nlp.to_disk('/model')"
|
||||
],
|
||||
"author": "Robertus Johansyah",
|
||||
"author_links": {
|
||||
|
@ -961,6 +965,45 @@
|
|||
},
|
||||
"category": ["training"],
|
||||
"tags": ["excel"]
|
||||
},
|
||||
{
|
||||
"id": "spacy-graphql",
|
||||
"title": "spacy-graphql",
|
||||
"slogan": "Query spaCy's linguistic annotations using GraphQL",
|
||||
"github": "ines/spacy-graphql",
|
||||
"description": "A very simple and experimental app that lets you query spaCy's linguistic annotations using [GraphQL](https://graphql.org/). The API currently supports most token attributes, named entities, sentences and text categories (if available as `doc.cats`, i.e. if you added a text classifier to a model). The `meta` field will return the model meta data. Models are only loaded once and kept in memory.",
|
||||
"url": "https://explosion.ai/demos/spacy-graphql",
|
||||
"category": ["apis"],
|
||||
"tags": ["graphql"],
|
||||
"thumb": "https://i.imgur.com/xC7zpTO.png",
|
||||
"code_example": [
|
||||
"{",
|
||||
" nlp(text: \"Zuckerberg is the CEO of Facebook.\", model: \"en_core_web_sm\") {",
|
||||
" meta {",
|
||||
" lang",
|
||||
" description",
|
||||
" }",
|
||||
" doc {",
|
||||
" text",
|
||||
" tokens {",
|
||||
" text",
|
||||
" pos_",
|
||||
" }",
|
||||
" ents {",
|
||||
" text",
|
||||
" label_",
|
||||
" }",
|
||||
" }",
|
||||
" }",
|
||||
"}"
|
||||
],
|
||||
"code_language": "json",
|
||||
"author": "Ines Montani",
|
||||
"author_links": {
|
||||
"twitter": "_inesmontani",
|
||||
"github": "ines",
|
||||
"website": "https://ines.io"
|
||||
}
|
||||
}
|
||||
],
|
||||
"projectCats": {
|
||||
|
@ -970,7 +1013,7 @@
|
|||
},
|
||||
"training": {
|
||||
"title": "Training",
|
||||
"description": "Helpers and toolkits for trainig spaCy models"
|
||||
"description": "Helpers and toolkits for training spaCy models"
|
||||
},
|
||||
"conversational": {
|
||||
"title": "Conversational",
|
||||
|
|
|
@ -103,7 +103,7 @@
|
|||
"menu": {
|
||||
"How Pipelines Work": "pipelines",
|
||||
"Custom Components": "custom-components",
|
||||
"Extension Attributes": "custom-components-extensions",
|
||||
"Extension Attributes": "custom-components-attributes",
|
||||
"Multi-Threading": "multithreading",
|
||||
"Serialization": "serialization"
|
||||
}
|
||||
|
|
|
@ -103,8 +103,8 @@ p
|
|||
+h(4, "ner-accuracy-ontonotes5") NER accuracy (OntoNotes 5, no pre-process)
|
||||
|
||||
p
|
||||
| This is the evaluation we use to tune spaCy's parameters are decide which
|
||||
| algorithms are better than others. It's reasonably close to actual usage,
|
||||
| This is the evaluation we use to tune spaCy's parameters to decide which
|
||||
| algorithms are better than the others. It's reasonably close to actual usage,
|
||||
| because it requires the parses to be produced from raw text, without any
|
||||
| pre-processing.
|
||||
|
||||
|
|
|
@ -129,8 +129,8 @@ p
|
|||
substring = substring[split:]
|
||||
elif find_suffix(substring) is not None:
|
||||
split = find_suffix(substring)
|
||||
suffixes.append(substring[split:])
|
||||
substring = substring[:split]
|
||||
suffixes.append(substring[-split:])
|
||||
substring = substring[:-split]
|
||||
elif find_infixes(substring):
|
||||
infixes = find_infixes(substring)
|
||||
offset = 0
|
||||
|
|
|
@ -62,8 +62,8 @@ p
|
|||
|
||||
+code.
|
||||
nlp_latin = spacy.load('/tmp/la_vectors_wiki_lg')
|
||||
doc1 = nlp(u"Caecilius est in horto")
|
||||
doc2 = nlp(u"servus est in atrio")
|
||||
doc1 = nlp_latin(u"Caecilius est in horto")
|
||||
doc2 = nlp_latin(u"servus est in atrio")
|
||||
doc1.similarity(doc2)
|
||||
|
||||
p
|
||||
|
|
|
@ -60,3 +60,26 @@ p
|
|||
displacy.serve(doc, style='dep', options=options)
|
||||
|
||||
+codepen("39c02c893a84794353de77a605d817fd", 360)
|
||||
|
||||
+h(3, "dep-long-text") Visualizing long texts
|
||||
+tag-new("2.0.12")
|
||||
|
||||
p
|
||||
| Long texts can become difficult to read when displayed in one row, so
|
||||
| it's often better to visualize them sentence-by-sentence instead. As of
|
||||
| v2.0.12, #[code displacy] supports rendering both
|
||||
| #[+api("doc") #[code Doc]] and #[+api("span") #[code Span]] objects, as
|
||||
| well as lists of #[code Doc]s or #[code Span]s. Instead of passing the
|
||||
| full #[code Doc] to #[code displacy.serve], you can also pass in a list
|
||||
| of the #[code doc.sents]. This will create one visualization for each
|
||||
| sentence.
|
||||
|
||||
+code.
|
||||
import spacy
|
||||
from spacy import displacy
|
||||
|
||||
nlp = spacy.load('en')
|
||||
text = u"""In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus. A slave belonging to Hero, Pseudolus wishes to buy, win, or steal his freedom. One of the neighboring houses is owned by Marcus Lycus, who is a buyer and seller of beautiful women; the other belongs to the ancient Erronius, who is abroad searching for his long-lost children (stolen in infancy by pirates). One day, Senex and Domina go on a trip and leave Pseudolus in charge of Hero. Hero confides in Pseudolus that he is in love with the lovely Philia, one of the courtesans in the House of Lycus (albeit still a virgin)."""
|
||||
doc = nlp(text)
|
||||
sentence_spans = list(doc.sents)
|
||||
displacy.serve(sentence_spans, style='dep')
|
||||
|
|
|
@ -12,8 +12,8 @@ include _spacy-101/_pipelines
|
|||
+h(2, "custom-components") Creating custom pipeline components
|
||||
include _processing-pipelines/_custom-components
|
||||
|
||||
+section("custom-components-extensions")
|
||||
+h(2, "custom-components-extensions") Extension attributes
|
||||
+section("custom-components-attributes")
|
||||
+h(2, "custom-components-attributes") Extension attributes
|
||||
+tag-new(2)
|
||||
include _processing-pipelines/_extensions
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user