Merge branch 'master' into develop

This commit is contained in:
Ines Montani 2019-02-07 20:54:07 +01:00
commit 5d0b60999d
77 changed files with 293374 additions and 292084 deletions

106
.github/contributors/DeNeutoy.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |Mark Neumann |
| Company name (if applicable) |Allen Institute for AI |
| Title or role (if applicable) |Research Engineer |
| Date | 13/01/2019 |
| GitHub username |@Deneutoy |
| Website (optional) |markneumann.xyz |

106
.github/contributors/Loghijiaha.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Loghi Perinpanayagam |
| Company name (if applicable) | |
| Title or role (if applicable) | Student |
| Date | 13 Jan, 2019 |
| GitHub username | loghijiaha |
| Website (optional) | |

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jo |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-01-26 |
| GitHub username | PolyglotOpenstreetmap|
| Website (optional) | |

106
.github/contributors/adrianeboyd.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Adriane Boyd |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 28 January 2019 |
| GitHub username | adrianeboyd |
| Website (optional) | |

106
.github/contributors/alvations.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Liling |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 04 Jan 2019 |
| GitHub username | alvations |
| Website (optional) | |

View File

@ -101,6 +101,6 @@ mark both statements:
| Name | Amandine Périnet |
| Company name (if applicable) | 365Talents |
| Title or role (if applicable) | Data Science Researcher |
| Date | 12/12/2018 |
| Date | 28/01/2019 |
| GitHub username | amperinet |
| Website (optional) | |

106
.github/contributors/boena.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Björn Lennartsson |
| Company name (if applicable) | Uptrail AB |
| Title or role (if applicable) | CTO |
| Date | 2019-01-15 |
| GitHub username | boena |
| Website (optional) | www.uptrail.com |

106
.github/contributors/foufaster.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |Anès Foufa |
| Company name (if applicable) | |
| Title or role (if applicable) |NLP developer |
| Date |21/01/2019 |
| GitHub username |foufaster |
| Website (optional) | |

106
.github/contributors/ozcankasal.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ozcan Kasal |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | December 21, 2018 |
| GitHub username | ozcankasal |
| Website (optional) | |

106
.github/contributors/retnuh.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
- Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
- to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
- each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
| ----------------------------- | ------------ |
| Name | Hunter Kelly |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-01-10 |
| GitHub username | retnuh |
| Website (optional) | |

106
.github/contributors/willprice.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | --------------------- |
| Name | Will Price |
| Company name (if applicable) | N/A |
| Title or role (if applicable) | N/A |
| Date | 26/12/2018 |
| GitHub username | willprice |
| Website (optional) | https://willprice.org |

View File

@ -1,4 +1,5 @@
recursive-include include *.h
include LICENSE
include README.md
include pyproject.toml
include bin/spacy

106
contributer_agreement.md Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Laura Baakman |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | February 7, 2019 |
| GitHub username | lauraBaakman |
| Website (optional) | |

View File

@ -58,7 +58,7 @@ import spacy
lang=("Language class to initialise", "option", "l", str),
)
def main(patterns_loc, text_loc, n=10000, lang="en"):
nlp = spacy.blank("en")
nlp = spacy.blank(lang)
nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0

View File

@ -26,6 +26,11 @@ from spacy.util import minibatch, compounding
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
@ -87,9 +92,6 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
print(test_text, doc.cats)
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
with nlp.use_params(optimizer.averages):
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

View File

@ -1,6 +1,6 @@
[
{
"id": "wsj_0200",
"id": 42,
"paragraphs": [
{
"raw": "In an Oct. 19 review of \"The Misanthrope\" at Chicago's Goodman Theatre (\"Revitalized Classics Take the Stage in Windy City,\" Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Haag. Ms. Haag plays Elianti.",

10
pyproject.toml Normal file
View File

@ -0,0 +1,10 @@
[build-system]
requires = ["setuptools",
"wheel>0.32.0.<0.33.0",
"Cython",
"cymem>=2.0.2,<2.1.0",
"preshed>=2.0.1,<2.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc>=6.12.1,<6.13.0",
]
build-backend = "setuptools.build_meta"

View File

@ -14,7 +14,7 @@ plac<1.0.0,>=0.9.6
pathlib==1.0.1; python_version < "3.4"
# Development dependencies
cython>=0.25
pytest>=4.0.0,<5.0.0
pytest>=4.0.0,<4.1.0
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
flake8>=3.5.0,<3.6.0

View File

@ -246,6 +246,7 @@ def setup_package():
"cuda92": ["cupy-cuda92>=4.0"],
"cuda100": ["cupy-cuda100>=4.0"],
},
python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Environment :: Console",

View File

@ -31,9 +31,13 @@ def read_iob(raw_sents):
tokens = [re.split("[^\w\-]", line.strip())]
if len(tokens[0]) == 3:
words, pos, iob = zip(*tokens)
else:
elif len(tokens[0]) == 2:
words, iob = zip(*tokens)
pos = ["-"] * len(words)
else:
raise ValueError(
"The iob/iob2 file is not formatted correctly. Try checking whitespace and delimiters."
)
biluo = iob_to_biluo(iob)
sentences.append(
[

View File

@ -208,7 +208,11 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
doc_freq = int(doc_freq)
freq = int(freq)
if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length:
word = literal_eval(key)
try:
word = literal_eval(key)
except SyntaxError:
# Take odd strings literally.
word = literal_eval("'%s'" % key)
smooth_count = counts.smoother(int(freq))
probs[word] = math.log(smooth_count) - log_total
oov_prob = math.log(counts.smoother(0)) - log_total

View File

@ -9,7 +9,6 @@ from ..util import is_in_jupyter
_html = {}
IS_JUPYTER = is_in_jupyter()
RENDER_WRAPPER = None
@ -18,7 +17,7 @@ def render(
style="dep",
page=False,
minify=False,
jupyter=IS_JUPYTER,
jupyter=False,
options={},
manual=False,
):
@ -51,7 +50,7 @@ def render(
html = _html["parsed"]
if RENDER_WRAPPER is not None:
html = RENDER_WRAPPER(html)
if jupyter: # return HTML rendered by IPython display()
if jupyter or is_in_jupyter(): # return HTML rendered by IPython display()
from IPython.core.display import display, HTML
return display(HTML(html))

View File

@ -1,7 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
import random
import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
@ -41,7 +41,7 @@ class DependencyRenderer(object):
"""
# Create a random ID prefix to make sure parses don't receive the
# same ID, even if they're identical
id_prefix = random.randint(0, 999)
id_prefix = uuid.uuid4().hex
rendered = [
self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"])
for i, p in enumerate(parsed)

View File

@ -4,20 +4,24 @@ from __future__ import unicode_literals
from .lookup import LOOKUP
from ._adjectives import ADJECTIVES
from ._adjectives_irreg import ADJECTIVES_IRREG
from ._adp_irreg import ADP_IRREG
from ._adverbs import ADVERBS
from ._auxiliary_verbs_irreg import AUXILIARY_VERBS_IRREG
from ._cconj_irreg import CCONJ_IRREG
from ._dets_irreg import DETS_IRREG
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES
from ._nouns import NOUNS
from ._nouns_irreg import NOUNS_IRREG
from ._pronouns_irreg import PRONOUNS_IRREG
from ._sconj_irreg import SCONJ_IRREG
from ._verbs import VERBS
from ._verbs_irreg import VERBS_IRREG
from ._dets_irreg import DETS_IRREG
from ._pronouns_irreg import PRONOUNS_IRREG
from ._auxiliary_verbs_irreg import AUXILIARY_VERBS_IRREG
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES
LEMMA_INDEX = {'adj': ADJECTIVES, 'adv': ADVERBS, 'noun': NOUNS, 'verb': VERBS}
LEMMA_EXC = {'adj': ADJECTIVES_IRREG, 'noun': NOUNS_IRREG, 'verb': VERBS_IRREG,
'det': DETS_IRREG, 'pron': PRONOUNS_IRREG, 'aux': AUXILIARY_VERBS_IRREG}
LEMMA_EXC = {'adj': ADJECTIVES_IRREG, 'adp': ADP_IRREG, 'aux': AUXILIARY_VERBS_IRREG,
'cconj': CCONJ_IRREG, 'det': DETS_IRREG, 'noun': NOUNS_IRREG, 'verb': VERBS_IRREG,
'pron': PRONOUNS_IRREG, 'sconj': SCONJ_IRREG}
LEMMA_RULES = {'adj': ADJECTIVE_RULES, 'noun': NOUN_RULES, 'verb': VERB_RULES}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,24 @@
# coding: utf8
from __future__ import unicode_literals
ADP_IRREG = {
"a": ("à",),
"apr.": ("après",),
"aux": ("à",),
"av.": ("avant",),
"avt": ("avant",),
"cf.": ("cf",),
"conf.": ("cf",),
"confer": ("cf",),
"d'": ("de",),
"des": ("de",),
"du": ("de",),
"jusqu'": ("jusque",),
"pdt": ("pendant",),
"+": ("plus",),
"pr": ("pour",),
"/": ("sur",),
"versus": ("vs",),
"vs.": ("vs",)
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,17 @@
# coding: utf8
from __future__ import unicode_literals
CCONJ_IRREG = {
"&amp;": ("et",),
"c-à-d": ("c'est-à-dire",),
"c.-à.-d.": ("c'est-à-dire",),
"càd": ("c'est-à-dire",),
"&": ("et",),
"et|ou": ("et-ou",),
"et/ou": ("et-ou",),
"i.e.": ("c'est-à-dire",),
"ie": ("c'est-à-dire",),
"ou/et": ("et-ou",),
"+": ("plus",)
}

View File

@ -4,20 +4,27 @@ from __future__ import unicode_literals
DETS_IRREG = {
"aucune": ("aucun",),
"cents": ("cent",),
"certaine": ("certain",),
"certaines": ("certain",),
"certains": ("certain",),
"ces": ("ce",),
"cet": ("ce",),
"cette": ("ce",),
"cents": ("cent",),
"certaines": ("certains",),
"des": ("un",),
"différentes": ("différents",),
"diverse": ("divers",),
"diverses": ("divers",),
"du": ("de",),
"la": ("le",),
"les": ("le",),
"l'": ("le",),
"laquelle": ("lequel",),
"les": ("le",),
"lesdites": ("ledit",),
"lesdits": ("ledit",),
"leurs": ("leur",),
"lesquelles": ("lequel",),
"lesquels": ("lequel",),
"leurs": ("leur",),
"l'": ("le",),
"mainte": ("maint",),
"maintes": ("maint",),
"maints": ("maint",),
@ -27,23 +34,29 @@ DETS_IRREG = {
"nulle": ("nul",),
"nulles": ("nul",),
"nuls": ("nul",),
"pareille": ("pareil",),
"pareilles": ("pareil",),
"pareils": ("pareil",),
"quelle": ("quel",),
"quelles": ("quel",),
"quels": ("quel",),
"quelqu'": ("quelque",),
"qq": ("quelque",),
"qqes": ("quelque",),
"qqs": ("quelque",),
"quelques": ("quelque",),
"quelqu'": ("quelque",),
"quels": ("quel",),
"sa": ("son",),
"ses": ("son",),
"telle": ("tel",),
"telles": ("tel",),
"tels": ("tel",),
"ta": ("ton",),
"telles": ("tel",),
"telle": ("tel",),
"tels": ("tel",),
"tes": ("ton",),
"tous": ("tout",),
"toute": ("tout",),
"toutes": ("tout",),
"des": ("un",),
"toute": ("tout",),
"une": ("un",),
"vingts": ("vingt",),
"vot'": ("votre",),
"vos": ("votre",),
}

View File

@ -63,36 +63,8 @@ NOUN_RULES = [
["w", "w"],
["y", "y"],
["z", "z"],
["as", "a"],
["aux", "au"],
["cs", "c"],
["chs", "ch"],
["ds", "d"],
["és", "é"],
["es", "e"],
["eux", "eu"],
["fs", "f"],
["gs", "g"],
["hs", "h"],
["is", "i"],
["ïs", "ï"],
["js", "j"],
["ks", "k"],
["ls", "l"],
["ms", "m"],
["ns", "n"],
["oux", "ou"],
["os", "o"],
["ps", "p"],
["qs", "q"],
["rs", "r"],
["ses", "se"],
["se", "se"],
["ts", "t"],
["us", "u"],
["vs", "v"],
["ws", "w"],
["ys", "y"],
["s", ""],
["x", ""],
["nt(e", "nt"],
["nt(e)", "nt"],
["al(e", "ale"],

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -4,37 +4,89 @@ from __future__ import unicode_literals
PRONOUNS_IRREG = {
"aucune": ("aucun",),
"celle-ci": ("celui-ci",),
"celles-ci": ("celui-ci",),
"ceux-ci": ("celui-ci",),
"celle-là": ("celui-là",),
"celles-là": ("celui-là",),
"ceux-là": ("celui-là",),
"autres": ("autre",),
"ça": ("cela",),
"c'": ("ce",),
"celle": ("celui",),
"celle-ci": ("celui-ci",),
"celle-là": ("celui-là",),
"celles": ("celui",),
"ceux": ("celui",),
"celles-ci": ("celui-ci",),
"celles-là": ("celui-là",),
"certaines": ("certains",),
"ceux": ("celui",),
"ceux-ci": ("celui-ci",),
"ceux-là": ("celui-là",),
"chacune": ("chacun",),
"-elle": ("lui",),
"elle": ("lui",),
"elle-même": ("lui-même",),
"-elles": ("lui",),
"elles": ("lui",),
"elles-mêmes": ("lui-même",),
"eux": ("lui",),
"eux-mêmes": ("lui-même",),
"icelle": ("icelui",),
"icelles": ("icelui",),
"iceux": ("icelui",),
"-il": ("il",),
"-ils": ("il",),
"ils": ("il",),
"-je": ("je",),
"j'": ("je",),
"la": ("le",),
"les": ("le",),
"laquelle": ("lequel",),
"l'autre": ("l'autre",),
"les": ("le",),
"lesquelles": ("lequel",),
"lesquels": ("lequel",),
"elle-même": ("lui-même",),
"elles-mêmes": ("lui-même",),
"eux-mêmes": ("lui-même",),
"-leur": ("leur",),
"l'on": ("on",),
"-lui": ("lui",),
"l'une": ("l'un",),
"mêmes": ("même",),
"-m'": ("me",),
"m'": ("me",),
"-moi": ("moi",),
"nous-mêmes": ("nous-même",),
"-nous": ("nous",),
"-on": ("on",),
"qqchose": ("quelque chose",),
"qqch": ("quelque chose",),
"qqc": ("quelque chose",),
"qqn": ("quelqu'un",),
"quelle": ("quel",),
"quelles": ("quel",),
"quels": ("quel",),
"quelques-unes": ("quelqu'un",),
"quelques-uns": ("quelqu'un",),
"quelques-unes": ("quelques-uns",),
"quelque-une": ("quelqu'un",),
"quelqu'une": ("quelqu'un",),
"quels": ("quel",),
"qu": ("que",),
"telle": ("tel",),
"s'": ("se",),
"-t-elle": ("elle",),
"-t-elles": ("elle",),
"telles": ("tel",),
"telle": ("tel",),
"tels": ("tel",),
"toutes": ("tous",),
"-t-en": ("en",),
"-t-il": ("il",),
"-t-ils": ("il",),
"-toi": ("toi",),
"-t-on": ("on",),
"tous": ("tout",),
"toutes": ("tout",),
"toute": ("tout",),
"-t'": ("te",),
"t'": ("te",),
"-tu": ("tu",),
"-t-y": ("y",),
"unes": ("un",),
"une": ("un",),
"uns": ("un",),
"vous-mêmes": ("vous-même",),
"vous-même": ("vous-même",),
"-vous": ("vous",),
"-vs": ("vous",),
"vs": ("vous",),
"-y": ("y",),
}

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
SCONJ_IRREG = {
"lorsqu'": ("lorsque",),
"pac'que": ("parce que",),
"pac'qu'": ("parce que",),
"parc'que": ("parce que",),
"parc'qu'": ("parce que",),
"paske": ("parce que",),
"pask'": ("parce que",),
"pcq": ("parce que",),
"+": ("plus",),
"puisqu'": ("puisque",),
"qd": ("quand",),
"quoiqu'": ("quoique",),
"qu'": ("que",)
}

View File

@ -6,63 +6,64 @@ VERBS = set(
"""
abaisser abandonner abdiquer abecquer abéliser aberrer abhorrer abîmer abjurer
ablater abluer ablutionner abominer abonder abonner aborder aborner aboucher
abouler abouter abraquer abraser abreuver abricoter abriter absenter absinther
absolutiser absorber abuser académifier académiser acagnarder accabler
accagner accaparer accastiller accentuer accepter accessoiriser accidenter
acclamer acclimater accointer accolader accoler accommoder accompagner
accorder accorer accoster accoter accoucher accouder accouer accoupler
accoutrer accoutumer accouver accrassiner accréditer accrocher acculer
acculturer accumuler accuser acenser acétaliser acétyler achalander acharner
acheminer achopper achromatiser aciduler aciériser acliquer acoquiner acquêter
acquitter acter actiniser actionner activer actoriser actualiser acupuncturer
acyler adapter additionner adenter adieuser adirer adjectiver adjectiviser
adjurer adjuver administrer admirer admonester adoniser adonner adopter adorer
adorner adosser adouber adresser adsorber aduler adverbialiser aéroporter
aérosoliser aérosonder aérotransporter affabuler affacturer affairer affaisser
affaiter affaler affamer affecter affectionner affermer afficher affider
affiler affiner affirmer affistoler affixer affleurer afflouer affluer affoler
afforester affouiller affourcher affriander affricher affrioler affriquer
affriter affronter affruiter affubler affurer affûter afghaniser afistoler
africaniser agatiser agenouiller agglutiner aggraver agioter agiter agoniser
agourmander agrafer agrainer agrémenter agresser agriffer agripper
agroalimentariser agrouper aguetter aguicher ahaner aheurter aicher aider
aigretter aiguer aiguiller aiguillonner aiguiser ailer ailler ailloliser
aimanter aimer airer ajointer ajourer ajourner ajouter ajuster ajuter
alambiquer alarmer albaniser albitiser alcaliniser alcaliser alcooliser
alcoolyser alcoyler aldoliser alerter aleviner algébriser algérianiser
algorithmiser aligner alimenter alinéater alinéatiser aliter alkyler allaiter
allectomiser allégoriser allitiser allivrer allocutionner alloter allouer
alluder allumer allusionner alluvionner allyler aloter alpaguer alphabétiser
alterner aluminer aluminiser aluner alvéoler alvéoliser amabiliser amadouer
amalgamer amariner amarrer amateloter ambitionner ambler ambrer ambuler
améliorer amender amenuiser américaniser ameulonner ameuter amhariser amiauler
amicoter amidonner amignarder amignoter amignotter aminer ammoniaquer
ammoniser ammoxyder amocher amouiller amouracher amourer amphotériser ampouler
amputer amunitionner amurer amuser anagrammatiser anagrammer analyser
anamorphoser anaphylactiser anarchiser anastomoser anathématiser anatomiser
ancher anchoiter ancrer anecdoter anecdotiser angéliser anglaiser angler
angliciser angoisser anguler animaliser animer aniser ankyloser annexer
annihiler annoter annualiser annuler anodiser ânonner anser antagoniser
antéposer antérioriser anthropomorphiser anticiper anticoaguler antidater
antiparasiter antiquer antiseptiser anuiter aoûter apaiser apériter apetisser
apeurer apicaliser apiquer aplaner apologiser aponévrotomiser aponter aposter
apostiller apostoliser apostropher apostumer apothéoser appareiller apparenter
appeauter appertiser appliquer appointer appoltronner apponter apporter
apposer appréhender apprêter apprivoiser approcher approuver approvisionner
approximer apurer aquareller arabiser araméiser aramer araser arbitrer arborer
arboriser arcbouter arc-bouter archaïser architecturer archiver arçonner
ardoiser aréniser arer argenter argentiniser argoter argotiser argumenter
arianiser arimer ariser aristocratiser aristotéliser arithmétiser armaturer
armer arnaquer aromatiser arpenter arquebuser arquer arracher arraisonner
arrenter arrêter arrher arrimer arriser arriver arroser arsouiller
artérialiser articler articuler artificialiser artistiquer aryaniser aryler
ascensionner ascétiser aseptiser asexuer asianiser asiatiser aspecter
asphalter aspirer assabler assaisonner assassiner assembler assener asséner
assermenter asserter assibiler assigner assimiler assister assoiffer assoler
assommer assoner assoter assumer assurer asticoter astiquer athéiser
atlantiser atomiser atourner atropiniser attabler attacher attaquer attarder
attenter attentionner atténuer atterrer attester attifer attirer attiser
attitrer attraper attremper attribuer attrister attrouper aubiner
abouler abouter aboutonner abracadabrer abraquer abraser abreuver abricoter
abriter absenter absinther absolutiser absorber abuser académifier académiser
acagnarder accabler accagner accaparer accastiller accentuer accepter
accessoiriser accidenter acclamer acclimater accointer accolader accoler
accommoder accompagner accorder accorer accoster accoter accoucher accouder
accouer accoupler accoutrer accoutumer accouver accrassiner accréditer
accrocher acculer acculturer accumuler accuser acenser acétaliser acétyler
achalander acharner acheminer achopper achromatiser aciduler aciériser
acliquer acoquiner acquêter acquitter acter actiniser actionner activer
actoriser actualiser acupuncturer acyler adapter additionner adenter adieuser
adirer adjectiver adjectiviser adjurer adjuver administrer admirer admonester
adoniser adonner adopter adorer adorner adosser adouber adresser adsorber
aduler adverbialiser aéroporter aérosoliser aérosonder aérotransporter
affabuler affacturer affairer affaisser affaiter affaler affamer affecter
affectionner affermer afficher affider affiler affiner affirmer affistoler
affixer affleurer afflouer affluer affoler afforester affouiller affourcher
affriander affricher affrioler affriquer affriter affronter affruiter affubler
affurer affûter afghaniser afistoler africaniser agatiser agenouiller
agglutiner aggraver agioter agiter agoniser agourmander agrafer agrainer
agrémenter agresser agricher agriffer agripper agroalimentariser agrouper
aguetter aguicher aguiller ahaner aheurter aicher aider aigretter aiguer
aiguiller aiguillonner aiguiser ailer ailler ailloliser aimanter aimer airer
ajointer ajourer ajourner ajouter ajuster ajuter alambiquer alarmer albaniser
albitiser alcaliniser alcaliser alcooliser alcoolyser alcoyler aldoliser
alerter aleviner algébriser algérianiser algorithmiser aligner alimenter
alinéater alinéatiser aliter alkyler allaiter allectomiser allégoriser
allitiser allivrer allocutionner alloter allouer alluder allumer allusionner
alluvionner allyler aloter alpaguer alphabétiser alterner aluminer aluminiser
aluner alvéoler alvéoliser amabiliser amadouer amalgamer amariner amarrer
amateloter ambitionner ambler ambrer ambuler améliorer amender amenuiser
américaniser ameulonner ameuter amhariser amiauler amicoter amidonner
amignarder amignoter amignotter aminer ammoniaquer ammoniser ammoxyder amocher
amouiller amouracher amourer amphotériser ampouler amputer amunitionner amurer
amuser anagrammatiser anagrammer analyser anamorphoser anaphylactiser
anarchiser anastomoser anathématiser anatomiser ancher anchoiter ancrer
anecdoter anecdotiser angéliser anglaiser angler angliciser angoisser anguler
animaliser animer aniser ankyloser annexer annihiler annoter annualiser
annuler anodiser ânonner anser antagoniser antéposer antérioriser
anthropomorphiser anticiper anticoaguler antidater antiparasiter antiquer
antiseptiser anuiter aoûter apaiser apériter apetisser apeurer apicaliser
apiquer aplaner apologiser aponévrotomiser aponter aposter apostiller
apostoliser apostropher apostumer apothéoser appareiller apparenter appeauter
appertiser appliquer appointer appoltronner apponter apporter apposer
appréhender apprêter apprivoiser approcher approuver approvisionner approximer
apurer aquareller arabiser araméiser aramer araser arbitrer arborer arboriser
arcbouter arc-bouter archaïser architecturer archiver arçonner ardoiser
aréniser arer argenter argentiniser argoter argotiser argumenter arianiser
arimer ariser aristocratiser aristotéliser arithmétiser armaturer armer
arnaquer aromatiser arpenter arquebuser arquer arracher arraisonner arrenter
arrêter arrher arrimer arriser arriver arroser arsouiller artérialiser
articler articuler artificialiser artistiquer aryaniser aryler ascensionner
ascétiser aseptiser asexuer asianiser asiatiser aspecter asphalter aspirer
assabler assaisonner assassiner assembler assener asséner assermenter asserter
assibiler assigner assimiler assister assoiffer assoler assommer assoner
assoter assumer assurer asticoter astiquer athéiser atlantiser atomiser
atourner atropiniser attabler attacher attaquer attarder attenter attentionner
atténuer atterrer attester attifer attirer attiser attitrer attoucher attraper
attremper attribuer attriquer attrister attrouper aubader aubiner
audiovisualiser auditer auditionner augmenter augurer aulofer auloffer aumôner
auner auréoler ausculter authentiquer autoaccuser autoadapter autoadministrer
autoagglutiner autoalimenter autoallumer autoamputer autoanalyser autoancrer
@ -73,10 +74,10 @@ VERBS = set(
autodéterminer autodévelopper autodévorer autodicter autodiscipliner
autodupliquer autoéduquer autoenchâsser autoenseigner autoépurer autoéquiper
autoévaporiser autoévoluer autoféconder autofertiliser autoflageller
autofonder autoformer autofretter autogouverner autogreffer autoguider auto-
immuniser auto-ioniser autolégitimer autolimiter autoliquider autolyser
automatiser automédiquer automitrailler automutiler autonomiser auto-
optimaliser auto-optimiser autoorganiser autoperpétuer autopersuader
autofonder autoformer autofretter autogouverner autogreffer autoguider
auto-immuniser auto-ioniser autolégitimer autolimiter autoliquider autolyser
automatiser automédiquer automitrailler automutiler autonomiser
auto-optimaliser auto-optimiser autoorganiser autoperpétuer autopersuader
autopiloter autopolliniser autoporter autopositionner autoproclamer
autopropulser autoréaliser autorecruter autoréglementer autoréguler
autorelaxer autoréparer autoriser autosélectionner autosevrer autostabiliser
@ -84,7 +85,7 @@ VERBS = set(
autotracter autotransformer autovacciner autoventiler avaler avaliser
aventurer aveugler avillonner aviner avironner aviser avitailler aviver
avoiner avoisiner avorter avouer axéniser axer axiomatiser azimuter azoter
azurer babiller babouiner bâcher bachonner bachoter bâcler badauder
azurer babiller babouiner bâcher bachonner bachoter bâcler badauder bader
badigeonner badiner baffer bafouer bafouiller bâfrer bagarrer bagoter bagouler
baguenauder baguer baguetter bahuter baigner bailler bâiller baîller
bâillonner baîllonner baiser baisoter baisouiller baisser bakéliser balader
@ -135,9 +136,9 @@ VERBS = set(
brouillonner broussailler brousser brouter bruiner bruisser bruiter brûler
brumer brumiser bruncher brusquer brutaliser bruter bûcher bucoliser
budgétiser buer buffériser buffler bugler bugner buiser buissonner bulgariser
buquer bureaucratiser buriner buser busquer buter butiner butonner butter
buvoter byzantiner byzantiniser cabaler cabaliser cabaner câbler cabosser
caboter cabotiner cabrer cabrioler cacaber cacaoter cacarder cacher
buller buquer bureaucratiser buriner buser busquer buter butiner butonner
butter buvoter byzantiner byzantiniser cabaler cabaliser cabaner câbler
cabosser caboter cabotiner cabrer cabrioler cacaber cacaoter cacarder cacher
cachetonner cachotter cadastrer cadavériser cadeauter cadetter cadoter cadrer
cafarder cafeter cafouiller cafter cageoler cagnarder cagner caguer cahoter
caillebotter cailler caillouter cajoler calaminer calamistrer calamiter
@ -185,65 +186,66 @@ VERBS = set(
claveliser claver clavetter clayonner cléricaliser clicher cligner clignoter
climatiser clinquanter clinquer cliper cliquer clisser cliver clochardiser
clocher clocter cloisonner cloîtrer cloner cloper clopiner cloquer clôturer
clouer clouter coaccuser coacerver coacher coadapter coagglutiner coaguler
coaliser coaltarer coaltariser coanimer coarticuler cobelligérer cocaïniser
cocarder cocheniller cocher côcher cochonner coconiser coconner cocooner
cocoter coder codéterminer codiller coéditer coéduquer coexister coexploiter
coexprimer coffiner coffrer cofonder cogiter cogner cogouverner cohabiter
cohériter cohober coiffer coincher coincider coïncider coïter colchiciner
collaber collaborer collationner collecter collectionner collectiviser coller
collisionner colloquer colluvionner colmater colombianiser colombiner
coloniser colorer coloriser colostomiser colporter colpotomiser coltiner
columniser combiner combler commander commanditer commémorer commenter
commercialiser comminer commissionner commotionner commuer communaliser
communautariser communiquer communiser commuter compacifier compacter comparer
compartimenter compenser compiler compisser complanter complémenter
complétiviser complexer complimenter compliquer comploter comporter composer
composter compoter compounder compresser comprimer comptabiliser compter
compulser computer computériser concentrer conceptualiser concerner concerter
concher conciliabuler concocter concomiter concorder concrétionner concrétiser
concubiner condamner condenser condimenter conditionner confabuler
confectionner confédéraliser confesser confessionnaliser configurer confiner
confirmer confisquer confiter confluer conformer conforter confronter
confusionner congestionner conglober conglutiner congoliser congratuler
coniser conjecturer conjointer conjuger conjuguer conjurer connecter conniver
connoter conquêter consacrer conscientiser conseiller conserver consigner
consister consoler consolider consommariser consommer consonantiser consoner
conspirer conspuer constater consteller conster consterner constiper
constituer constitutionnaliser consulter consumer contacter contagionner
containeriser containériser contaminer contemner contempler conteneuriser
contenter conter contester contextualiser continentaliser contingenter
continuer contorsionner contourner contracter contractualiser contracturer
contraposer contraster contre-attaquer contrebouter contrebuter contrecalquer
contrecarrer contre-expertiser contreficher contrefraser contre-indiquer
contremander contremanifester contremarcher contremarquer contreminer
contremurer contrenquêter contreplaquer contrepointer contrer contresigner
contrespionner contretyper contreventer contribuer contrister contrôler
controuver controverser contusionner conventionnaliser conventionner
conventualiser converser convoiter convoler convoquer convulser convulsionner
cooccuper coopératiser coopter coordonner coorganiser coparrainer coparticiper
copermuter copiner copolycondenser copolymériser coprésenter coprésider copser
copter copuler copyrighter coqueliner coquer coqueriquer coquiller corailler
corder cordonner coréaliser coréaniser coréguler coresponsabiliser cornaquer
cornemuser corner coroniser corporiser correctionaliser correctionnaliser
correler corréler corroborer corroder corser corticaliser cosigner cosmétiquer
cosser costumer coter cotillonner cotiser cotonner cotransfecter couaquer
couarder couchailler coucher couchoter couchotter coucouer coucouler couder
coudrer couillonner couiner couler coulisser coupailler coupeller couper
couperoser coupler couponner courailler courbaturer courber courbetter
courcailler couronner courrieler courser courtauder court-circuiter courtiser
cousiner coussiner coûter couturer couver cracher crachiner crachoter
crachouiller crailler cramer craminer cramper cramponner crampser cramser
craner crâner crânoter cranter crapahuter crapaüter crapser crapuler craquer
crasher cratériser craticuler cratoniser cravacher cravater crawler crayonner
crédibiliser créditer crématiser créoliser créosoter crêper crépiner crépiter
crésyler crêter crétiniser creuser criailler cribler criminaliser criquer
crisper crisser cristalliser criticailler critiquer crocher croiser crôler
croquer croskiller crosser crotoniser crotter crouler croupionner crouponner
clotûrer clouer clouter coaccuser coacerver coacher coadapter coagglutiner
coaguler coaliser coaltarer coaltariser coanimer coarticuler cobelligérer
cocaïniser cocarder cocheniller cocher côcher cochonner coconiser coconner
cocooner cocoter coder codéterminer codiller coéditer coéduquer coexister
coexploiter coexprimer coffiner coffrer cofonder cogiter cogner cogouverner
cohabiter cohériter cohober coiffer coincher coincider coïncider coïter
colchiciner collaber collaborer collationner collecter collectionner
collectiviser coller collisionner colloquer colluvionner colmater
colombianiser colombiner coloniser colorer coloriser colostomiser colporter
colpotomiser coltiner columniser combiner combler commander commanditer
commémorer commenter commercialiser comminer commissionner commotionner
commuer communaliser communautariser communiquer communiser commuter
compacifier compacter comparer compartimenter compenser compiler compisser
complanter complémenter complétiviser complexer complimenter compliquer
comploter comporter composer composter compoter compounder compresser
comprimer comptabiliser compter compulser computer computériser concentrer
conceptualiser concerner concerter concher conciliabuler concocter concomiter
concorder concrétionner concrétiser concubiner condamner condenser condimenter
conditionner confabuler confectionner confédéraliser confesser
confessionnaliser configurer confiner confirmer confisquer confiter confluer
conformer conforter confronter confusionner congestionner conglober
conglutiner congoliser congratuler coniser conjecturer conjointer conjuger
conjuguer conjurer connecter conniver connoter conquêter consacrer
conscientiser conseiller conserver consigner consister consoler consolider
consommariser consommer consonantiser consoner conspirer conspuer constater
consteller conster consterner constiper constituer constitutionnaliser
consulter consumer contacter contagionner containeriser containériser
contaminer contemner contempler conteneuriser contenter conter contester
contextualiser continentaliser contingenter continuer contorsionner contourner
contracter contractualiser contracturer contraposer contraster contre-attaquer
contrebouter contrebuter contrecalquer contrecarrer contre-expertiser
contreficher contrefraser contre-indiquer contremander contremanifester
contremarcher contremarquer contreminer contremurer contrenquêter
contreplaquer contrepointer contrer contresigner contrespionner contretyper
contreventer contribuer contrister contrôler controuver controverser
contusionner conventionnaliser conventionner conventualiser converser
convoiter convoler convoquer convulser convulsionner cooccuper coopératiser
coopter coordonner coorganiser coparrainer coparticiper copermuter copiner
copolycondenser copolymériser coprésenter coprésider copser copter copuler
copyrighter coqueliner coquer coqueriquer coquiller corailler corder cordonner
coréaliser coréaniser coréguler coresponsabiliser cornaquer cornemuser corner
coroniser corporiser correctionaliser correctionnaliser correler corréler
corroborer corroder corser corticaliser cosigner cosmétiquer cosser costumer
coter cotillonner cotiser cotonner cotransfecter couaquer couarder couchailler
coucher couchoter couchotter coucouer coucouler couder coudrer couillonner
couiner couler coulisser coupailler coupeller couper couperoser coupler
couponner courailler courbaturer courber courbetter courcailler couronner
courrieler courser courtauder court-circuiter courtiser cousiner coussiner
coûter couturer couver cracher crachiner crachoter crachouiller crailler
cramer craminer cramper cramponner crampser cramser craner crâner crânoter
cranter crapahuter crapaüter crapser crapuler craquer crasher cratériser
craticuler cratoniser cravacher cravater crawler crayonner crédibiliser
créditer crématiser créoliser créosoter crêper crépiner crépiter crésyler
crêter crétiniser creuser criailler cribler criminaliser criquer crisper
crisser cristalliser criticailler critiquer crocher croiser crôler croquer
croskiller crosser crotoniser crotter crouler croupionner crouponner
croustiller croûter croûtonner cryoappliquer cryocautériser cryocoaguler
cryoconcentrer cryodécaper cryoébarber cryofixer cryogéniser cryomarquer
cryosorber crypter cuber cueiller cuider cuisiner cuiter cuivrer culbuter
culer culminer culotter culpabiliser cultiver culturaliser cumuler curariser
cryosorber crypter cuber cueiller cuider cuisiner cuivrer culbuter culer
culminer culotter culpabiliser cultiver culturaliser cumuler curariser
curedenter curer curetter customiser cuter cutiniser cuver cyaniser cyanoser
cyanurer cybernétiser cycler cycliser cycloner cylindrer dactylocoder daguer
daguerréotyper daïer daigner dailler daller damasquiner damer damner
@ -748,8 +750,8 @@ VERBS = set(
mithridatiser mitonner mitrailler mixer mixter mixtionner mobiliser modaliser
modéliser modérantiser moderniser moduler moellonner mofler moirer moiser
moissonner molarder molariser moléculariser molester moletter mollarder
molletter monarchiser mondaniser monder mondialiser monétariser monétiser
moniliser monologuer monomériser monophtonguer monopoler monopoliser
molletonner molletter monarchiser mondaniser monder mondialiser monétariser
monétiser moniliser monologuer monomériser monophtonguer monopoler monopoliser
monoprogrammer monosiallitiser monotoniser monseigneuriser monter montrer
monumentaliser moquer moquetter morailler moraliser mordailler mordiller
mordillonner mordorer mordoriser morfailler morfaler morfiler morfler morganer
@ -792,63 +794,64 @@ VERBS = set(
palpiter palucher panacher panader pancarter paner paniquer panneauter panner
pannetonner panoramiquer panser pantiner pantomimer pantoufler paoner paonner
papelarder papillonner papilloter papoter papouiller paquer paraboliser
parachuter parader parafer paraffiner paralléliser paralyser paramétriser
parangonner parapher paraphraser parasiter parcellariser parceller parcelliser
parcheminer parcoriser pardonner parementer parenthétiser parer paresser
parfiler parfumer parisianiser parjurer parkériser parlementer parler parloter
parlotter parquer parrainer participer particulariser partitionner partouzer
pasquiner pasquiniser passefiler passementer passepoiler passeriller
passionnaliser passionner pasteller pasteuriser pasticher pastiller pastoriser
patafioler pateliner patenter paternaliser paterner pathétiser patienter
patiner pâtisser patoiser pâtonner patouiller patrimonialiser patrociner
patronner patrouiller patter pâturer paumer paupériser pauser pavaner paver
pavoiser peaufiner pébriner pécher pêcher pécloter pectiser pédaler pédanter
pédantiser pédiculiser pédicurer pédimenter peigner peiner peinturer
peinturlurer péjorer pelaner pelauder péleriner pèleriner pelletiser
pelleverser pelliculer peloter pelotonner pelucher pelurer pénaliser pencher
pendeloquer pendiller pendouiller penduler pénéplaner penser pensionner
peptiser peptoniser percaliner percher percoler percuter perdurer pérégriner
pérenniser perfectionner perforer performer perfuser péricliter périmer
périodiser périphériser périphraser péritoniser perler permanenter permaner
perméabiliser permuter pérorer pérouaniser peroxyder perpétuer perquisitionner
perreyer perruquer persécuter persifler persiller persister personnaliser
persuader perturber pervibrer pester pétarader pétarder pétiller pétitionner
pétocher pétouiller pétrarquiser pétroliser pétuner peupler pexer
phacoémulsifier phagocyter phalangiser pharyngaliser phéniquer phénoler
phényler philosophailler philosopher phlébotomiser phlegmatiser phlogistiquer
phonétiser phonologiser phosphater phosphorer phosphoriser phosphoryler
photoactiver photocomposer photograver photo-ioniser photoïoniser photomonter
photophosphoryler photopolymériser photosensibiliser phraser piaffer piailler
pianomiser pianoter piauler pickler picocher picoler picorer picoter picouser
picouzer picrater pictonner picturaliser pidginiser piédestaliser pierrer
piétiner piétonnifier piétonniser pieuter pifer piffer piffrer pigeonner
pigmenter pigner pignocher pignoler piler piller pilloter pilonner piloter
pimenter pinailler pinceauter pinçoter pindariser pinter piocher pionner
piotter piper piqueniquer pique-niquer piquer piquetonner piquouser piquouzer
pirater pirouetter piser pisser pissoter pissouiller pistacher pister pistoler
pistonner pitancher pitcher piter pitonner pituiter pivoter placarder
placardiser plafonner plaider plainer plaisanter plamer plancher planer
planétariser planétiser planquer planter plaquer plasmolyser plastiquer
plastronner platiner platiniser platoniser plâtrer plébisciter pleurailler
pleuraliser pleurer pleurnicher pleuroter pleuviner pleuvioter pleuvoter
plisser plissoter plomber ploquer plotiniser plouter ploutrer plucher
plumarder plumer pluraliser plussoyer pluviner pluvioter pocharder pocher
pochetronner pochtronner poculer podzoliser poêler poétiser poignarder poigner
poiler poinçonner pointer pointiller poireauter poirer poiroter poisser
poitriner poivrer poivroter polariser poldériser polémiquer polissonner
politicailler politiquer politiser polker polliciser polliniser polluer
poloniser polychromer polycontaminer polygoner polygoniser polymériser
polyploïdiser polytransfuser polyviser pommader pommer pomper pomponner
ponctionner ponctuer ponter pontiller populariser poquer porer porphyriser
porter porteuser portionner portoricaniser portraicturer portraiturer poser
positionner positiver possibiliser postdater poster postérioriser posticher
postillonner postposer postsonoriser postsynchroniser postuler potabiliser
potentialiser poter poteyer potiner poudrer pouffer pouiller pouliner pouloper
poulotter pouponner pourpenser pourprer poussailler pousser poutser praliner
pratiquer préaccentuer préadapter préallouer préassembler préassimiler
préaviser précariser précautionner prêchailler préchauffer préchauler prêcher
précipiter préciser préciter précompter préconditionner préconfigurer
préconiser préconstituer précoter prédater prédécouper prédésigner prédestiner
parachuter parader parafer paraffiner paraisonner paralléliser paralyser
paramétriser parangonner parapher paraphraser parasiter parcellariser
parceller parcelliser parcheminer parcoriser pardonner parementer
parenthétiser parer paresser parfiler parfumer parisianiser parjurer
parkériser parlementer parler parloter parlotter parquer parrainer participer
particulariser partitionner partouzer pasquiner pasquiniser passefiler
passementer passepoiler passeriller passionnaliser passionner pasteller
pasteuriser pasticher pastiller pastoriser patafioler pateliner patenter
paternaliser paterner pathétiser patienter patiner pâtisser patoiser pâtonner
patouiller patrimonialiser patrociner patronner patrouiller patter pâturer
paumer paupériser pauser pavaner paver pavoiser peaufiner pébriner pécher
pêcher pécloter pectiser pédaler pédanter pédantiser pédiculiser pédicurer
pédimenter peigner peiner peinturer peinturlurer péjorer pelaner pelauder
péleriner pèleriner pelletiser pelleverser pelliculer peloter pelotonner
pelucher pelurer pénaliser pencher pendeloquer pendiller pendouiller penduler
pénéplaner penser pensionner peptiser peptoniser percaliner percher percoler
percuter perdurer pérégriner pérenniser perfectionner perforer performer
perfuser péricliter périmer périodiser périphériser périphraser péritoniser
perler permanenter permaner perméabiliser permuter pérorer pérouaniser
peroxyder perpétuer perquisitionner perreyer perruquer persécuter persifler
persiller persister personnaliser persuader perturber pervibrer pester
pétarader pétarder pétiller pétitionner pétocher pétouiller pétrarquiser
pétroliser pétuner peupler pexer phacoémulsifier phagocyter phalangiser
pharyngaliser phéniquer phénoler phényler philosophailler philosopher
phlébotomiser phlegmatiser phlogistiquer phonétiser phonologiser phosphater
phosphorer phosphoriser phosphoryler photoactiver photocomposer photograver
photo-ioniser photoïoniser photomonter photophosphoryler photopolymériser
photosensibiliser phraser piaffer piailler pianomiser pianoter piauler pickler
picocher picoler picorer picoter picouser picouzer picrater pictonner
picturaliser pidginiser piédestaliser pierrer piétiner piétonnifier
piétonniser pieuter pifer piffer piffrer pigeonner pigmenter pigner pignocher
pignoler piler piller pilloter pilonner piloter pimenter pinailler pinceauter
pinçoter pindariser pinter piocher pionner piotter piper piqueniquer
pique-niquer piquer piquetonner piquouser piquouzer pirater pirouetter piser
pisser pissoter pissouiller pistacher pister pistoler pistonner pitancher
pitcher piter pitonner pituiter pivoter placarder placardiser plafonner
plaider plainer plaisanter plamer plancher planer planétariser planétiser
planquer planter plaquer plasmolyser plastiquer plastronner platiner
platiniser platoniser plâtrer plébisciter pleurailler pleuraliser pleurer
pleurnicher pleuroter pleuviner pleuvioter pleuvoter plisser plissoter plomber
ploquer plotiniser plouter ploutrer plucher plumarder plumer pluraliser
plussoyer pluviner pluvioter pocharder pocher pochetronner pochtronner poculer
podzoliser poêler poétiser poignarder poigner poiler poinçonner pointer
pointiller poireauter poirer poiroter poisser poitriner poivrer poivroter
polariser poldériser polémiquer polissonner politicailler politiquer politiser
polker polliciser polliniser polluer poloniser polychromer polycontaminer
polygoner polygoniser polymériser polyploïdiser polytransfuser polyviser
pommader pommer pomper pomponner ponctionner ponctuer ponter pontiller
populariser poquer porer porphyriser porter porteuser portionner
portoricaniser portraicturer portraiturer poser positionner positiver
possibiliser postdater poster postérioriser posticher postillonner postposer
postsonoriser postsynchroniser postuler potabiliser potentialiser poter
poteyer potiner poudrer pouffer pouiller pouliner pouloper poulotter pouponner
pourpenser pourprer poussailler pousser poutser praliner pratiquer
préaccentuer préadapter préallouer préassembler préassimiler préaviser
précariser précautionner prêchailler préchauffer préchauler prêcher précipiter
préciser préciter précompter préconditionner préconfigurer préconiser
préconstituer précoter prédater prédécouper prédésigner prédestiner
prédéterminer prédiffuser prédilectionner prédiquer prédisposer prédominer
préemballer préempter préencoller préenregistrer préenrober préexaminer
préexister préfabriquer préfaner préfigurer préfixer préformater préformer
@ -879,8 +882,8 @@ VERBS = set(
raccommoder raccompagner raccorder raccoutrer raccoutumer raccrocher racémiser
rachalander racher raciner racketter racler râcler racoler raconter racoquiner
radariser rader radicaliser radiner radioactiver radiobaliser radiocommander
radioconserver radiodétecter radiodiffuser radioexposer radioguider radio-
immuniser radiolocaliser radiopasteuriser radiosonder radiostériliser
radioconserver radiodétecter radiodiffuser radioexposer radioguider
radio-immuniser radiolocaliser radiopasteuriser radiosonder radiostériliser
radiotéléphoner radiotéléviser radoter radouber rafaler raffermer raffiler
raffiner raffluer raffoler raffûter rafistoler rafler ragoter ragoûter
ragrafer raguer raguser raiguiser railler rainer rainurer raisonner rajouter
@ -1123,19 +1126,21 @@ VERBS = set(
sommer somnambuler somniloquer somnoler sonder sonnailler sonner sonoriser
sophistiquer sorguer soubresauter souder souffler souffroter soufrer souhaiter
souiller souillonner soûler souligner soûlotter soumissionner soupailler
soupçonner souper soupirer souquer sourciller sourdiner sous-capitaliser sous-
catégoriser sousestimer sous-estimer sous-industrialiser sous-médicaliser
sousperformer sous-qualifier soussigner sous-titrer sous-utiliser soutacher
souter soutirer soviétiser spammer spasmer spatialiser spatuler spécialiser
spéculer sphéroïdiser spilitiser spiraler spiraliser spirantiser spiritualiser
spitter splénectomiser spléniser sponsoriser sporter sporuler sprinter
squatériser squatter squatteriser squattériser squeezer stabiliser stabuler
staffer stagner staliniser standardiser standoliser stanioler stariser
stationner statistiquer statuer stelliter stenciler stendhaliser sténoser
sténotyper stepper stéréotyper stériliser stigmatiser stimuler stipuler
stocker stoloniser stopper stranguler stratégiser stresser strider striduler
striper stripper striquer stronker strouiller structurer strychniser stuquer
styler styliser subalterniser subdiviser subdivisionner subériser subjectiver
soupçonner souper soupirer souquer sourciller sourdiner sous-alimenter
sous-capitaliser sous-catégoriser sous-équiper sousestimer sous-estimer
sous-évaluer sous-exploiter sous-exposer sous-industrialiser sous-louer
sous-médicaliser sousperformer sous-qualifier soussigner sous-titrer
sous-traiter sous-utiliser sous-virer soutacher souter soutirer soviétiser
spammer spasmer spatialiser spatuler spécialiser spéculer sphéroïdiser
spilitiser spiraler spiraliser spirantiser spiritualiser spitter
splénectomiser spléniser sponsoriser sporter sporuler sprinter squatériser
squatter squatteriser squattériser squeezer stabiliser stabuler staffer
stagner staliniser standardiser standoliser stanioler stariser stationner
statistiquer statuer stelliter stenciler stendhaliser sténoser sténotyper
stepper stéréotyper stériliser stigmatiser stimuler stipuler stocker
stoloniser stopper stranguler stratégiser stresser strider striduler striper
stripper striquer stronker strouiller structurer strychniser stuquer styler
styliser subalterniser subdiviser subdivisionner subériser subjectiver
subjectiviser subjuguer sublimer sublimiser subluxer subminiaturiser subodorer
subordonner suborner subsister substanter substantialiser substantiver
substituer subsumer subtiliser suburbaniser subventionner succomber suçoter

File diff suppressed because it is too large Load Diff

View File

@ -1,7 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
from ....symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT
from ....symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT, ADP, SCONJ, CCONJ
from ....symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
from .lookup import LOOKUP
@ -9,7 +9,7 @@ from .lookup import LOOKUP
French language lemmatizer applies the default rule based lemmatization
procedure with some modifications for better French language support.
The parts of speech 'ADV', 'PRON', 'DET' and 'AUX' are added to use the
The parts of speech 'ADV', 'PRON', 'DET', 'ADP' and 'AUX' are added to use the
rule-based lemmatization. As a last resort, the lemmatizer checks in
the lookup table.
'''
@ -34,16 +34,22 @@ class FrenchLemmatizer(object):
univ_pos = 'verb'
elif univ_pos in (ADJ, 'ADJ', 'adj'):
univ_pos = 'adj'
elif univ_pos in (ADP, 'ADP', 'adp'):
univ_pos = 'adp'
elif univ_pos in (ADV, 'ADV', 'adv'):
univ_pos = 'adv'
elif univ_pos in (PRON, 'PRON', 'pron'):
univ_pos = 'pron'
elif univ_pos in (DET, 'DET', 'det'):
univ_pos = 'det'
elif univ_pos in (AUX, 'AUX', 'aux'):
univ_pos = 'aux'
elif univ_pos in (CCONJ, 'CCONJ', 'cconj'):
univ_pos = 'cconj'
elif univ_pos in (DET, 'DET', 'det'):
univ_pos = 'det'
elif univ_pos in (PRON, 'PRON', 'pron'):
univ_pos = 'pron'
elif univ_pos in (PUNCT, 'PUNCT', 'punct'):
univ_pos = 'punct'
elif univ_pos in (SCONJ, 'SCONJ', 'sconj'):
univ_pos = 'sconj'
else:
return [self.lookup(string)]
# See Issue #435 for example of where this logic is requied.
@ -100,7 +106,7 @@ class FrenchLemmatizer(object):
def lookup(self, string):
if string in self.lookup_table:
return self.lookup_table[string]
return self.lookup_table[string][0]
return string
@ -125,7 +131,7 @@ def lemmatize(string, index, exceptions, rules):
if not forms:
forms.extend(oov_forms)
if not forms and string in LOOKUP.keys():
forms.append(LOOKUP[string])
forms.append(LOOKUP[string][0])
if not forms:
forms.append(string)
return list(set(forms))

File diff suppressed because it is too large Load Diff

View File

@ -1,16 +1,15 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from ...language import Language
from ...attrs import LANG
from ...tokens import Doc, Token
from ...tokenizer import Tokenizer
from ... import util
from .tag_map import TAG_MAP
import re
from collections import namedtuple
from .tag_map import TAG_MAP
from ...attrs import LANG
from ...language import Language
from ...tokens import Doc, Token
from ...util import DummyTokenizer
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
@ -46,12 +45,12 @@ def resolve_pos(token):
# PoS mappings.
if token.pos == "連体詞,*,*,*":
if re.match("^[こそあど此其彼]の", token.surface):
if re.match(r"[こそあど此其彼]の", token.surface):
return token.pos + ",DET"
if re.match("^[こそあど此其彼]", token.surface):
if re.match(r"[こそあど此其彼]", token.surface):
return token.pos + ",PRON"
else:
return token.pos + ",ADJ"
return token.pos + ",ADJ"
return token.pos
@ -68,7 +67,8 @@ def detailed_tokens(tokenizer, text):
pos = ",".join(parts[0:4])
if len(parts) > 7:
# this information is only available for words in the tokenizer dictionary
# this information is only available for words in the tokenizer
# dictionary
base = parts[7]
words.append(ShortUnitWord(surface, base, pos))
@ -76,38 +76,27 @@ def detailed_tokens(tokenizer, text):
return words
class JapaneseTokenizer(object):
class JapaneseTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
MeCab = try_mecab_import()
self.tokenizer = MeCab.Tagger()
self.tokenizer = try_mecab_import().Tagger()
self.tokenizer.parseToNode("") # see #2901
def __call__(self, text):
dtokens = detailed_tokens(self.tokenizer, text)
words = [x.surface for x in dtokens]
doc = Doc(self.vocab, words=words, spaces=[False] * len(words))
spaces = [False] * len(words)
doc = Doc(self.vocab, words=words, spaces=spaces)
for token, dtoken in zip(doc, dtokens):
token._.mecab_tag = dtoken.pos
token.tag_ = resolve_pos(dtoken)
token.lemma_ = dtoken.lemma
return doc
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
# allow serialization (see #1557)
def to_bytes(self, **exclude):
return b""
def from_bytes(self, bytes_data, **exclude):
return self
def to_disk(self, path, **exclude):
return None
def from_disk(self, path, **exclude):
return self
class JapaneseCharacterSegmenter(object):
def __init__(self, vocab):
@ -154,7 +143,8 @@ class JapaneseCharacterSegmenter(object):
class JapaneseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "ja"
lex_attr_getters[LANG] = lambda _text: "ja"
tag_map = TAG_MAP
use_janome = True
@ -169,7 +159,6 @@ class JapaneseDefaults(Language.Defaults):
class Japanese(Language):
lang = "ja"
Defaults = JapaneseDefaults
Tokenizer = JapaneseTokenizer
def make_doc(self, text):
return self.tokenizer(text)

View File

@ -5,6 +5,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .morph_rules import MORPH_RULES
from .lemmatizer import LEMMA_RULES, LOOKUP
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
@ -20,12 +21,14 @@ class SwedishDefaults(Language.Defaults):
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
morph_rules = MORPH_RULES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
stop_words = STOP_WORDS
lemma_rules = LEMMA_RULES
lemma_lookup = LOOKUP
morph_rules = MORPH_RULES
class Swedish(Language):
lang = "sv"
Defaults = SwedishDefaults

View File

@ -233167,7 +233167,6 @@ LOOKUP = {
"jades": "jade",
"jaet": "ja",
"jaets": "ja",
"jag": "jaga",
"jagad": "jaga",
"jagade": "jaga",
"jagades": "jaga",

View File

@ -0,0 +1,25 @@
# coding: utf8
"""Punctuation stolen from Danish"""
from __future__ import unicode_literals
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_SUFFIXES
_quotes = QUOTES.replace("'", '')
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])'.format(a=ALPHA, q=_quotes),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA)])
_suffixes = [suffix for suffix in TOKENIZER_SUFFIXES if suffix not in ["'s", "'S", "s", "S", r"\'"]]
_suffixes += [r"(?<=[^sSxXzZ])\'"]
TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -26,14 +26,15 @@ for verb_data in [
{ORTH: "u", LEMMA: PRON_LEMMA, NORM: "du"},
]
# Abbreviations for weekdays "sön." (for "söndag" / "söner")
# are left out because they are ambiguous. The same is the case
# for abbreviations "jul." and "Jul." ("juli" / "jul").
for exc_data in [
{ORTH: "jan.", LEMMA: "januari"},
{ORTH: "febr.", LEMMA: "februari"},
{ORTH: "feb.", LEMMA: "februari"},
{ORTH: "apr.", LEMMA: "april"},
{ORTH: "jun.", LEMMA: "juni"},
{ORTH: "jul.", LEMMA: "juli"},
{ORTH: "aug.", LEMMA: "augusti"},
{ORTH: "sept.", LEMMA: "september"},
{ORTH: "sep.", LEMMA: "september"},
@ -46,13 +47,11 @@ for exc_data in [
{ORTH: "tors.", LEMMA: "torsdag"},
{ORTH: "fre.", LEMMA: "fredag"},
{ORTH: "lör.", LEMMA: "lördag"},
{ORTH: "sön.", LEMMA: "söndag"},
{ORTH: "Jan.", LEMMA: "Januari"},
{ORTH: "Febr.", LEMMA: "Februari"},
{ORTH: "Feb.", LEMMA: "Februari"},
{ORTH: "Apr.", LEMMA: "April"},
{ORTH: "Jun.", LEMMA: "Juni"},
{ORTH: "Jul.", LEMMA: "Juli"},
{ORTH: "Aug.", LEMMA: "Augusti"},
{ORTH: "Sept.", LEMMA: "September"},
{ORTH: "Sep.", LEMMA: "September"},
@ -65,28 +64,32 @@ for exc_data in [
{ORTH: "Tors.", LEMMA: "Torsdag"},
{ORTH: "Fre.", LEMMA: "Fredag"},
{ORTH: "Lör.", LEMMA: "Lördag"},
{ORTH: "Sön.", LEMMA: "Söndag"},
{ORTH: "sthlm", LEMMA: "Stockholm"},
{ORTH: "gbg", LEMMA: "Göteborg"},
]:
_exc[exc_data[ORTH]] = [exc_data]
# Specific case abbreviations only
for orth in ["AB", "Dr.", "H.M.", "H.K.H.", "m/s", "M/S", "Ph.d.", "S:t", "s:t"]:
_exc[orth] = [{ORTH: orth}]
ABBREVIATIONS = [
"ang",
"anm",
"bil",
"bl.a",
"d.v.s",
"doc",
"dvs",
"e.d",
"e.kr",
"el",
"el.",
"eng",
"etc",
"exkl",
"f",
"ev",
"f.",
"f.d",
"f.kr",
"f.n",
@ -97,10 +100,11 @@ ABBREVIATIONS = [
"fr.o.m",
"förf",
"inkl",
"jur",
"iofs",
"jur.",
"kap",
"kl",
"kor",
"kor.",
"kr",
"kungl",
"lat",
@ -109,9 +113,10 @@ ABBREVIATIONS = [
"m.m",
"max",
"milj",
"min",
"min.",
"mos",
"mt",
"mvh",
"o.d",
"o.s.v",
"obs",
@ -125,21 +130,27 @@ ABBREVIATIONS = [
"s.k",
"s.t",
"sid",
"s:t",
"t.ex",
"t.h",
"t.o.m",
"t.v",
"tel",
"ung",
"ung.",
"vol",
"v.",
"äv",
"övers",
]
ABBREVIATIONS = [abbr + "." for abbr in ABBREVIATIONS] + ABBREVIATIONS
# Add abbreviation for trailing punctuation too. If the abbreviation already has a trailing punctuation - skip it.
for abbr in ABBREVIATIONS:
if abbr.endswith(".") == False:
ABBREVIATIONS.append(abbr + ".")
for orth in ABBREVIATIONS:
_exc[orth] = [{ORTH: orth}]
capitalized = orth.capitalize()
_exc[capitalized] = [{ORTH: capitalized}]
# Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
# should be tokenized as two separate tokens.

24
spacy/lang/ta/__init__.py Normal file
View File

@ -0,0 +1,24 @@
# import language-specific data
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
# create Defaults class in the module scope (necessary for pickling!)
class TamilDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ta' # language ISO code
# optional: replace flags with custom functions, e.g. like_num()
lex_attr_getters.update(LEX_ATTRS)
# create actual Language class
class Tamil(Language):
lang = 'ta' # language ISO code
Defaults = TamilDefaults # override defaults
# set default export this allows the language class to be lazy-loaded
__all__ = ['Tamil']

21
spacy/lang/ta/examples.py Normal file
View File

@ -0,0 +1,21 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ta.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"கிறிஸ்துமஸ் மற்றும் இனிய புத்தாண்டு வாழ்த்துக்கள்",
"எனக்கு என் குழந்தைப் பருவம் நினைவிருக்கிறது",
"உங்கள் பெயர் என்ன?",
"ஏறத்தாழ இலங்கைத் தமிழரில் மூன்றிலொரு பங்கினர் இலங்கையை விட்டு வெளியேறிப் பிற நாடுகளில் வாழ்கின்றனர்",
"இந்த ஃபோனுடன் சுமார் ரூ.2,990 மதிப்புள்ள போட் ராக்கர்ஸ் நிறுவனத்தின் ஸ்போர்ட் புளூடூத் ஹெட்போன்ஸ் இலவசமாக வழங்கப்படவுள்ளது.",
"மட்டக்களப்பில் பல இடங்களில் வீட்டுத் திட்டங்களுக்கு இன்று அடிக்கல் நாட்டல்",
"ஐ போன்க்கு முகத்தை வைத்து அன்லாக் செய்யும் முறை மற்றும் விரலால் தொட்டு அன்லாக் செய்யும் முறையை வாட்ஸ் ஆப் நிறுவனம் இதற்கு முன் கண்டுபிடித்தது"
]

View File

@ -0,0 +1,44 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_numeral_suffixes = {'பத்து': 'பது', 'ற்று': 'று', 'ரத்து':'ரம்' , 'சத்து': 'சம்'}
_num_words = ['பூச்சியம்', 'ஒரு', 'ஒன்று', 'இரண்டு', 'மூன்று', 'நான்கு', 'ஐந்து', 'ஆறு', 'ஏழு',
'எட்டு', 'ஒன்பது', 'பத்து', 'பதினொன்று', 'பன்னிரண்டு', 'பதின்மூன்று', 'பதினான்கு',
'பதினைந்து', 'பதினாறு', 'பதினேழு', 'பதினெட்டு', 'பத்தொன்பது', 'இருபது',
'முப்பது', 'நாற்பது', 'ஐம்பது', 'அறுபது', 'எழுபது', 'எண்பது', 'தொண்ணூறு',
'நூறு', 'இருநூறு', 'முன்னூறு', 'நாநூறு', 'ஐநூறு', 'அறுநூறு', 'எழுநூறு', 'எண்ணூறு', 'தொள்ளாயிரம்',
'ஆயிரம்', 'ஒராயிரம்', 'லட்சம்', 'மில்லியன்', 'கோடி', 'பில்லியன்', 'டிரில்லியன்']
# 20-89 ,90-899,900-99999 and above have different suffixes
def suffix_filter(text):
# text without numeral suffixes
for num_suffix in _numeral_suffixes.keys():
length = len(num_suffix)
if (len(text) < length):
break
elif text.endswith(num_suffix):
return text[:-length] + _numeral_suffixes[num_suffix]
return text
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
print(suffix_filter(text))
if text.lower() in _num_words:
return True
elif suffix_filter(text) in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,148 @@
# coding: utf8
from __future__ import unicode_literals
_exc = {
# Regional words normal
# Sri Lanka - wikipeadia
"இங்க": "இங்கே",
"வாங்க": "வாருங்கள்",
'ஒண்டு':'ஒன்று',
'கண்டு': 'கன்று',
'கொண்டு': 'கொன்று',
'பண்டி': 'பன்றி',
'பச்ச': 'பச்சை',
'அம்பது': 'ஐம்பது',
'வெச்ச': 'வைத்து',
'வச்ச': 'வைத்து',
'வச்சி': 'வைத்து',
'வாளைப்பழம்':'வாழைப்பழம்',
'மண்ணு': 'மண்',
'பொன்னு': 'பொன்',
'சாவல்': 'சேவல்',
'அங்கால': 'அங்கு ',
'அசுப்பு': 'நடமாட்டம்',
'எழுவான் கரை': 'எழுவான்கரை',
'ஓய்யாரம்': 'எழில் ',
'ஒளும்பு': 'எழும்பு',
'ஓர்மை': 'துணிவு',
'கச்சை': 'கோவணம்',
'கடப்பு': 'தெருவாசல்',
'சுள்ளி': 'காய்ந்த குச்சி',
'திறாவுதல்': 'தடவுதல்',
'நாசமறுப்பு': 'தொல்லை',
'பரிசாரி': 'வைத்தியன்',
'பறவாதி': 'பேராசைக்காரன்',
'பிசினி': 'உலோபி ',
'விசர்': 'பைத்தியம்',
'ஏனம்': 'பாத்திரம்',
'ஏலா': 'இயலாது',
'ஒசில்': 'அழகு',
'ஒள்ளுப்பம்': 'கொஞ்சம்',
# Srilankan and indian
'குத்துமதிப்பு': '',
'நூனாயம்': 'நூல்நயம்',
'பைய': 'மெதுவாக',
'மண்டை': 'தலை',
'வெள்ளனே': 'சீக்கிரம்',
'உசுப்பு': 'எழுப்பு',
'ஆணம்': 'குழம்பு',
'உறக்கம்': 'தூக்கம்',
'பஸ்': 'பேருந்து',
'களவு': 'திருட்டு ',
#relationship
'புருசன்': 'கணவன்',
'பொஞ்சாதி': 'மனைவி',
'புள்ள': 'பிள்ளை',
'பிள்ள': 'பிள்ளை',
'ஆம்பிளப்புள்ள': 'ஆண் பிள்ளை',
'பொம்பிளப்புள்ள': 'பெண் பிள்ளை',
'அண்ணாச்சி': 'அண்ணா',
'அக்காச்சி': 'அக்கா',
'தங்கச்சி': 'தங்கை',
#difference words
'பொடியன்': 'சிறுவன்',
'பொட்டை': 'சிறுமி',
'பிறகு': 'பின்பு',
'டக்கென்டு': 'விரைவாக',
'கெதியா': 'விரைவாக',
'கிறுகி': 'திரும்பி',
'போயித்து வாறன்': 'போய் வருகிறேன்',
'வருவாங்களா': 'வருவார்களா',
# regular spokens
'சொல்லு': 'சொல்',
'கேளு': 'கேள்',
'சொல்லுங்க': 'சொல்லுங்கள்',
'கேளுங்க': 'கேளுங்கள்',
'நீங்கள்': 'நீ',
'உன்': 'உன்னுடைய',
# Portugeese formal words
'அலவாங்கு': 'கடப்பாரை',
'ஆசுப்பத்திரி': 'மருத்துவமனை',
'உரோதை': 'சில்லு',
'கடுதாசி': 'கடிதம்',
'கதிரை': 'நாற்காலி',
'குசினி': 'அடுக்களை',
'கோப்பை': 'கிண்ணம்',
'சப்பாத்து': 'காலணி',
'தாச்சி': 'இரும்புச் சட்டி',
'துவாய்': 'துவாலை',
'தவறணை': 'மதுக்கடை',
'பீப்பா': 'மரத்தாழி',
'யன்னல்': 'சாளரம்',
'வாங்கு': 'மரஇருக்கை',
# Dutch formal words
'இறாக்கை': 'பற்சட்டம்',
'இலாட்சி': 'இழுப்பறை',
'கந்தோர்': 'பணிமனை',
'நொத்தாரிசு': 'ஆவண எழுத்துபதிவாளர்',
# English formal words
'இஞ்சினியர்': 'பொறியியலாளர்',
'சூப்பு': 'ரசம்',
'செக்': 'காசோலை',
'சேட்டு': 'மேற்ச்சட்டை',
'மார்க்கட்டு': 'சந்தை',
'விண்ணன்': 'கெட்டிக்காரன்',
# Arabic formal words
'ஈமான்': 'நம்பிக்கை',
'சுன்னத்து': 'விருத்தசேதனம்',
'செய்த்தான்': 'பிசாசு',
'மவுத்து': 'இறப்பு',
'ஹலால்': 'அங்கீகரிக்கப்பட்டது',
'கறாம்': 'நிராகரிக்கப்பட்டது',
# Persian, Hindustanian and hindi formal words
'சுமார்': 'கிட்டத்தட்ட',
'சிப்பாய்': 'போர்வீரன்',
'சிபார்சு': 'சிபாரிசு',
'ஜமீன்': 'பணக்காரா்',
'அசல்': 'மெய்யான',
'அந்தஸ்து': 'கௌரவம்',
'ஆஜர்': 'சமா்ப்பித்தல்',
'உசார்': 'எச்சரிக்கை',
'அச்சா':'நல்ல',
# English words used in text conversations
"bcoz": "ஏனெனில்",
"bcuz": "ஏனெனில்",
"fav": "விருப்பமான",
"morning": "காலை வணக்கம்",
"gdeveng": "மாலை வணக்கம்",
"gdnyt": "இரவு வணக்கம்",
"gdnit": "இரவு வணக்கம்",
"plz": "தயவு செய்து",
"pls": "தயவு செய்து",
"thx": "நன்றி",
"thanx": "நன்றி",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm

133
spacy/lang/ta/stop_words.py Normal file
View File

@ -0,0 +1,133 @@
# coding: utf8
from __future__ import unicode_literals
# Stop words
STOP_WORDS = set("""
ஒர
என
மற
இந
இத
என
எனபத
பல
ஆக
அலலத
அவர
உள
அந
இவர
என
தல
என
இர
ி
என
வந
இதன
அத
அவன
பலர
என
ினர
இர
தனத
உளளத
என
அதன
தன
ிறக
அவரகள
வர
அவள
ஆகி
இரதத
உளளன
வந
இர
ிகவ
இங
ஓர
இவ
இநதக
பறி
வர
இர
இதி
இப
அவரத
மட
இநதப
என
ி
ஆகி
எனக
இன
அநதப
அன
ஒர
ி
அங
பல
ி
அத
பறி
உன
அதி
அநதக
இதன
அவ
அத
ஏன
எனபத
எல
மட
இங
அங
இடம
இடதி
அதி
அதற
எனவ
ி
ி
மற
ி
எந
எனவ
எனபபட
எனி
அட
இதன
இத
இநதத
இதற
அதன
தவி
வரி
சற
எனக
""".split())

View File

@ -5,24 +5,14 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from ...tokens import Doc
from ...language import Language
from ...attrs import LANG
from ...language import Language
from ...tokens import Doc
from ...util import DummyTokenizer
class ThaiDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "th"
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
class Thai(Language):
lang = "th"
Defaults = ThaiDefaults
def make_doc(self, text):
class ThaiTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None):
try:
from pythainlp.tokenize import word_tokenize
except ImportError:
@ -30,8 +20,35 @@ class Thai(Language):
"The Thai tokenizer requires the PyThaiNLP library: "
"https://github.com/PyThaiNLP/pythainlp"
)
words = [x for x in list(word_tokenize(text, "newmm"))]
return Doc(self.vocab, words=words, spaces=[False] * len(words))
self.word_tokenize = word_tokenize
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
def __call__(self, text):
words = list(self.word_tokenize(text, "newmm"))
spaces = [False] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
class ThaiDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda _text: "th"
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
@classmethod
def create_tokenizer(cls, nlp=None):
return ThaiTokenizer(cls, nlp)
class Thai(Language):
lang = "th"
Defaults = ThaiDefaults
def make_doc(self, text):
return self.tokenizer(text)
__all__ = ["Thai"]

View File

@ -5,6 +5,7 @@ from ...attrs import LIKE_NUM
# Thirteen, fifteen etc. are written separate: on üç
_num_words = [
"bir",
"iki",
@ -28,6 +29,7 @@ _num_words = [
"bin",
"milyon",
"milyar",
"trilyon",
"katrilyon",
"kentilyon",
]

View File

@ -353,10 +353,38 @@ def test_doc_api_similarity_match():
assert doc.similarity(doc2) == 0.0
def test_lowest_common_ancestor(en_tokenizer):
tokens = en_tokenizer("the lazy dog slept")
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
@pytest.mark.parametrize(
"sentence,heads,lca_matrix",
[
(
"the lazy dog slept",
[2, 1, 1, 0],
numpy.array([[0, 2, 2, 3], [2, 1, 2, 3], [2, 2, 2, 3], [3, 3, 3, 3]]),
),
(
"The lazy dog slept. The quick fox jumped",
[2, 1, 1, 0, -1, 2, 1, 1, 0],
numpy.array(
[
[0, 2, 2, 3, 3, -1, -1, -1, -1],
[2, 1, 2, 3, 3, -1, -1, -1, -1],
[2, 2, 2, 3, 3, -1, -1, -1, -1],
[3, 3, 3, 3, 3, -1, -1, -1, -1],
[3, 3, 3, 3, 4, -1, -1, -1, -1],
[-1, -1, -1, -1, -1, 5, 7, 7, 8],
[-1, -1, -1, -1, -1, 7, 6, 7, 8],
[-1, -1, -1, -1, -1, 7, 7, 7, 8],
[-1, -1, -1, -1, -1, 8, 8, 8, 8],
]
),
),
],
)
def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix):
tokens = en_tokenizer(sentence)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
lca = doc.get_lca_matrix()
assert (lca == lca_matrix).all()
assert lca[1, 1] == 1
assert lca[0, 1] == 2
assert lca[1, 2] == 2

View File

@ -80,10 +80,24 @@ def test_spans_lca_matrix(en_tokenizer):
tokens = en_tokenizer("the lazy dog slept")
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
lca = doc[:2].get_lca_matrix()
assert lca[0, 0] == 0
assert lca[0, 1] == -1
assert lca[1, 0] == -1
assert lca[1, 1] == 1
assert lca.shape == (2, 2)
assert lca[0, 0] == 0 # the & the -> the
assert lca[0, 1] == -1 # the & lazy -> dog (out of span)
assert lca[1, 0] == -1 # lazy & the -> dog (out of span)
assert lca[1, 1] == 1 # lazy & lazy -> lazy
lca = doc[1:].get_lca_matrix()
assert lca.shape == (3, 3)
assert lca[0, 0] == 0 # lazy & lazy -> lazy
assert lca[0, 1] == 1 # lazy & dog -> dog
assert lca[0, 2] == 2 # lazy & slept -> slept
lca = doc[2:].get_lca_matrix()
assert lca.shape == (2, 2)
assert lca[0, 0] == 0 # dog & dog -> dog
assert lca[0, 1] == 1 # dog & slept -> slept
assert lca[1, 0] == 1 # slept & dog -> slept
assert lca[1, 1] == 1 # slept & slept -> slept
def test_span_similarity_match():
@ -158,15 +172,17 @@ def test_span_as_doc(doc):
def test_span_string_label(doc):
span = Span(doc, 0, 1, label='hello')
assert span.label_ == 'hello'
assert span.label == doc.vocab.strings['hello']
span = Span(doc, 0, 1, label="hello")
assert span.label_ == "hello"
assert span.label == doc.vocab.strings["hello"]
def test_span_string_set_label(doc):
span = Span(doc, 0, 1)
span.label_ = 'hello'
assert span.label_ == 'hello'
assert span.label == doc.vocab.strings['hello']
span.label_ = "hello"
assert span.label_ == "hello"
assert span.label == doc.vocab.strings["hello"]
def test_span_ents_property(doc):
"""Test span.ents for the """

View File

@ -0,0 +1,53 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
SV_TOKEN_EXCEPTION_TESTS = [
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar']),
('Anders I. tycker om ord med i i.', ["Anders", "I.", "tycker", "om", "ord", "med", "i", "i", "."])
]
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
tokens = sv_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 2
assert tokens[1].text == "u"
@pytest.mark.parametrize('text',
["bl.a", "m.a.o.", "Jan.", "Dec.", "kr.", "osv."])
def test_sv_tokenizer_handles_abbr(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 1
@pytest.mark.parametrize('text', ["Jul.", "jul.", "sön.", "Sön."])
def test_sv_tokenizer_handles_ambiguous_abbr(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 2
def test_sv_tokenizer_handles_exc_in_text(sv_tokenizer):
text = "Det er bl.a. ikke meningen"
tokens = sv_tokenizer(text)
assert len(tokens) == 5
assert tokens[2].text == "bl.a."
def test_sv_tokenizer_handles_custom_base_exc(sv_tokenizer):
text = "Her er noget du kan kigge i."
tokens = sv_tokenizer(text)
assert len(tokens) == 8
assert tokens[6].text == "i"
assert tokens[7].text == "."

View File

@ -0,0 +1,15 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('string,lemma', [('DNA-profilernas', 'DNA-profil'),
('Elfenbenskustens', 'Elfenbenskusten'),
('abortmotståndarens', 'abortmotståndare'),
('kolesterols', 'kolesterol'),
('portionssnusernas', 'portionssnus'),
('åsyns', 'åsyn')])
def test_lemmatizer_lookup_assigns(sv_tokenizer, string, lemma):
tokens = sv_tokenizer(string)
assert tokens[0].lemma_ == lemma

View File

@ -0,0 +1,37 @@
# coding: utf-8
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text', ["(under)"])
def test_tokenizer_splits_no_special(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3
@pytest.mark.parametrize('text', ["gitta'r", "Björn's", "Lars'"])
def test_tokenizer_handles_no_punct(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 1
@pytest.mark.parametrize('text', ["svart.Gul", "Hej.Världen"])
def test_tokenizer_splits_period_infix(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hej,Världen", "en,två"])
def test_tokenizer_splits_comma_infix(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3
assert tokens[0].text == text.split(",")[0]
assert tokens[1].text == ","
assert tokens[2].text == text.split(",")[1]
@pytest.mark.parametrize('text', ["svart...Gul", "svart...gul"])
def test_tokenizer_splits_ellipsis_infix(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3

View File

@ -0,0 +1,21 @@
# coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals
import pytest
def test_sv_tokenizer_handles_long_text(sv_tokenizer):
text = """Det var så härligt ute på landet. Det var sommar, majsen var gul, havren grön,
höet var uppställt i stackar nere vid den gröna ängen, och där gick storken sina långa,
röda ben och snackade engelska, för det språket hade han lärt sig av sin mor.
Runt om åkrar och äng låg den stora skogen, och mitt i skogen fanns djupa sjöar; jo, det var verkligen trevligt ute landet!"""
tokens = sv_tokenizer(text)
assert len(tokens) == 86
def test_sv_tokenizer_handles_trailing_dot_for_i_in_sentence(sv_tokenizer):
text = "Provar att tokenisera en mening med ord i."
tokens = sv_tokenizer(text)
assert len(tokens) == 9

View File

@ -5,27 +5,31 @@ from ..util import get_doc
import pytest
import numpy
from numpy.testing import assert_array_equal
@pytest.mark.parametrize('words,heads,matrix', [
(
'She created a test for spacy'.split(),
[1, 0, 1, -2, -1, -1],
numpy.array([
[0, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 2, 3, 3, 3],
[1, 1, 3, 3, 3, 3],
[1, 1, 3, 3, 4, 4],
[1, 1, 3, 3, 4, 5]], dtype=numpy.int32)
)
])
def test_issue2396(en_vocab, words, heads, matrix):
doc = get_doc(en_vocab, words=words, heads=heads)
@pytest.mark.parametrize(
"sentence,heads,matrix",
[
(
"She created a test for spacy",
[1, 0, 1, -2, -1, -1],
numpy.array(
[
[0, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 2, 3, 3, 3],
[1, 1, 3, 3, 3, 3],
[1, 1, 3, 3, 4, 4],
[1, 1, 3, 3, 4, 5],
],
dtype=numpy.int32,
),
)
],
)
def test_issue2396(en_tokenizer, sentence, heads, matrix):
tokens = en_tokenizer(sentence)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
span = doc[:]
assert_array_equal(doc.get_lca_matrix(), matrix)
assert_array_equal(span.get_lca_matrix(), matrix)
assert (doc.get_lca_matrix() == matrix).all()
assert (span.get_lca_matrix() == matrix).all()

View File

@ -10,7 +10,7 @@ def test_issue2901():
"""Test that `nlp` doesn't fail."""
try:
nlp = Japanese()
except:
except ImportError:
pytest.skip()
doc = nlp("pythonが大好きです")

View File

@ -0,0 +1,10 @@
from __future__ import unicode_literals
import pytest
import spacy
@pytest.mark.models('fr')
def test_issue1959(FR):
texts = ['Je suis la mauvaise herbe', "Me, myself and moi"]
for text in texts:
FR(text)

View File

@ -1075,21 +1075,30 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
cdef int [:,:] lca_matrix
n_tokens= end - start
lca_matrix = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32)
lca_mat = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32)
lca_mat.fill(-1)
lca_matrix = lca_mat
for j in range(start, end):
token_j = doc[j]
for j in range(n_tokens):
token_j = doc[start + j]
# the common ancestor of token and itself is itself:
lca_matrix[j, j] = j
for k in range(j + 1, end):
lca = _get_tokens_lca(token_j, doc[k])
# we will only iterate through tokens in the same sentence
sent = token_j.sent
sent_start = sent.start
j_idx_in_sent = start + j - sent_start
n_missing_tokens_in_sent = len(sent) - j_idx_in_sent
# make sure we do not go past `end`, in cases where `end` < sent.end
max_range = min(j + n_missing_tokens_in_sent, end)
for k in range(j + 1, max_range):
lca = _get_tokens_lca(token_j, doc[start + k])
# if lca is outside of span, we set it to -1
if not start <= lca < end:
lca_matrix[j, k] = -1
lca_matrix[k, j] = -1
else:
lca_matrix[j, k] = lca
lca_matrix[k, j] = lca
lca_matrix[j, k] = lca - start
lca_matrix[k, j] = lca - start
return lca_matrix

View File

@ -524,9 +524,9 @@ cdef class Span:
return len(list(self.rights))
property subtree:
"""Tokens that descend from tokens in the span, but fall outside it.
"""Tokens within the span and tokens which descend from them.
YIELDS (Token): A descendant of a token within the span.
YIELDS (Token): A token within the span, or a descendant from it.
"""
def __get__(self):
for word in self.lefts:

View File

@ -457,10 +457,11 @@ cdef class Token:
yield from self.rights
property subtree:
"""A sequence of all the token's syntactic descendents.
"""A sequence containing the token and all the token's syntactic
descendants.
YIELDS (Token): A descendent token such that
`self.is_ancestor(descendent)`.
`self.is_ancestor(descendent) or token == self`.
"""
def __get__(self):
for word in self.lefts:

View File

@ -253,7 +253,6 @@ def get_entry_point(key, value):
def is_in_jupyter():
"""Check if user is running spaCy from a Jupyter notebook by detecting the
IPython kernel. Mainly used for the displaCy visualizer.
RETURNS (bool): True if in Jupyter, False if not.
"""
# https://stackoverflow.com/a/39662359/6400719
@ -667,3 +666,19 @@ class SimpleFrozenDict(dict):
def update(self, other):
raise NotImplementedError(Errors.E095)
class DummyTokenizer(object):
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
# allow serialization (see #1557)
def to_bytes(self, **exclude):
return b''
def from_bytes(self, _bytes_data, **exclude):
return self
def to_disk(self, _path, **exclude):
return None
def from_disk(self, _path, **exclude):
return self

View File

@ -150,3 +150,9 @@ p
+dep-row("re", "repeated element")
+dep-row("rs", "reported speech")
+dep-row("sb", "subject")
+dep-row("sbp", "passivised subject")
+dep-row("sp", "subject or predicate")
+dep-row("svp", "separable verb prefix")
+dep-row("uc", "unit component")
+dep-row("vo", "vocative")
+dep-row("ROOT", "root")

View File

@ -5,7 +5,7 @@ include ../_includes/_mixins
p
| The #[code PhraseMatcher] lets you efficiently match large terminology
| lists. While the #[+api("matcher") #[code Matcher]] lets you match
| squences based on lists of token descriptions, the #[code PhraseMatcher]
| sequences based on lists of token descriptions, the #[code PhraseMatcher]
| accepts match patterns in the form of #[code Doc] objects.
+h(2, "init") PhraseMatcher.__init__

View File

@ -489,7 +489,7 @@ p
+tag property
+tag-model("parse")
p Tokens that descend from tokens in the span, but fall outside it.
p Tokens within the span and tokens which descend from them.
+aside-code("Example").
doc = nlp(u'Give it back! He pleaded.')
@ -500,7 +500,7 @@ p Tokens that descend from tokens in the span, but fall outside it.
+row("foot")
+cell yields
+cell #[code Token]
+cell A descendant of a token within the span.
+cell A token within the span, or a descendant from it.
+h(2, "has_vector") Span.has_vector
+tag property

View File

@ -1,3 +1,4 @@
//- 💫 DOCS > API > TOKEN
include ../_includes/_mixins
@ -405,7 +406,7 @@ p
+tag property
+tag-model("parse")
p A sequence of all the token's syntactic descendants.
p A sequence containing the token and all the token's syntactic descendants.
+aside-code("Example").
doc = nlp(u'Give it back! He pleaded.')
@ -416,7 +417,7 @@ p A sequence of all the token's syntactic descendants.
+row("foot")
+cell yields
+cell #[code Token]
+cell A descendant token such that #[code self.is_ancestor(descendant)].
+cell A descendant token such that #[code self.is_ancestor(token) or token == self].
+h(2, "is_sent_start") Token.is_sent_start
+tag property

View File

@ -1083,20 +1083,31 @@
"category": ["pipeline"]
},
{
"id": "spacy2conllu",
"title": "spaCy2CoNLLU",
"id": "spacy-conll",
"title": "spacy_conll",
"slogan": "Parse text with spaCy and print the output in CoNLL-U format",
"description": "Simple script to parse text with spaCy and print the output in CoNLL-U format",
"description": "This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts.",
"code_example": [
"python parse_as_conllu.py [-h] --input_file INPUT_FILE [--output_file OUTPUT_FILE] --model MODEL"
"from spacy_conll import Spacy2ConllParser",
"spacyconll = Spacy2ConllParser()",
"",
"# `parse` returns a generator of the parsed sentences",
"for parsed_sent in spacyconll.parse(input_str='I like cookies.\nWhat about you?\nI don't like 'em!'):",
" do_something_(parsed_sent)",
"",
"# `parseprint` prints output to stdout (default) or a file (use `output_file` parameter)",
"# This method is called when using the command line",
"spacyconll.parseprint(input_str='I like cookies.')"
],
"code_language": "bash",
"author": "Raquel G. Alhama",
"code_language": "python",
"author": "Bram Vanroy",
"author_links": {
"github": "rgalhama"
"github": "BramVanroy",
"website": "https://bramvanroy.be"
},
"github": "rgalhama/spaCy2CoNLLU",
"category": ["training"]
"github": "BramVanroy/spacy_conll",
"category": ["standalone"]
}
],
"projectCats": {

View File

@ -159,7 +159,7 @@ p
| To provide training examples to the entity recogniser, you'll first need
| to create an instance of the #[+api("goldparse") #[code GoldParse]] class.
| You can specify your annotations in a stand-off format or as token tags.
| If a character offset in your entity annotations don't fall on a token
| If a character offset in your entity annotations doesn't fall on a token
| boundary, the #[code GoldParse] class will treat that annotation as a
| missing value. This allows for more realistic training, because the
| entity recogniser is allowed to learn from examples that may feature

View File

@ -444,7 +444,7 @@ p
| Let's say you're analysing user comments and you want to find out what
| people are saying about Facebook. You want to start off by finding
| adjectives following "Facebook is" or "Facebook was". This is obviously
| a very rudimentary solution, but it'll be fast, and a great way get an
| a very rudimentary solution, but it'll be fast, and a great way to get an
| idea for what's in your data. Your pattern could look like this:
+code.

View File

@ -40,7 +40,7 @@ p
| constrained to predict parses consistent with the sentence boundaries.
+infobox("Important note", "⚠️")
| To prevent inconsitent state, you can only set boundaries #[em before] a
| To prevent inconsistent state, you can only set boundaries #[em before] a
| document is parsed (and #[code Doc.is_parsed] is #[code False]). To
| ensure that your component is added in the right place, you can set
| #[code before='parser'] or #[code first=True] when adding it to the

View File

@ -21,7 +21,7 @@ p
| which needs to be split into two tokens: #[code {ORTH: "do"}] and
| #[code {ORTH: "n't", LEMMA: "not"}]. The prefixes, suffixes and infixes
| mosty define punctuation rules for example, when to split off periods
| (at the end of a sentence), and when to leave token containing periods
| (at the end of a sentence), and when to leave tokens containing periods
| intact (abbreviations like "U.S.").
+graphic("/assets/img/language_data.svg")

View File

@ -43,7 +43,7 @@ p
p
| This example shows how to use multiple cores to process text using
| spaCy and #[+a("https://pythonhosted.org/joblib/") Joblib]. We're
| spaCy and #[+a("https://joblib.readthedocs.io/en/latest/parallel.html") Joblib]. We're
| exporting part-of-speech-tagged, true-cased, (very roughly)
| sentence-separated text, with each "sentence" on a newline, and
| spaces between tokens. Data is loaded from the IMDB movie reviews

View File

@ -74,7 +74,7 @@ p
displacy.serve(doc, style='ent')
p
| This feature is espeically handy if you're using displaCy to compare
| This feature is especially handy if you're using displaCy to compare
| performance at different stages of a process, e.g. during training. Here
| you could use the title for a brief description of the text example and
| the number of iterations.

View File

@ -61,7 +61,7 @@ p
output_path.open('w', encoding='utf-8').write(svg)
p
| The above code will generate the dependency visualizations as to
| The above code will generate the dependency visualizations as
| two files, #[code This-is-an-example.svg] and #[code This-is-another-one.svg].

View File

@ -24,7 +24,7 @@ include ../_includes/_mixins
| standards.
p
| The quickest way visualize #[code Doc] is to use
| The quickest way to visualize #[code Doc] is to use
| #[+api("displacy#serve") #[code displacy.serve]]. This will spin up a
| simple web server and let you view the result straight from your browser.
| displaCy can either take a single #[code Doc] or a list of #[code Doc]