Merge branch 'master' into develop

This commit is contained in:
Ines Montani 2019-02-07 20:54:07 +01:00
commit 5d0b60999d
77 changed files with 293374 additions and 292084 deletions

106
.github/contributors/DeNeutoy.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |Mark Neumann |
| Company name (if applicable) |Allen Institute for AI |
| Title or role (if applicable) |Research Engineer |
| Date | 13/01/2019 |
| GitHub username |@Deneutoy |
| Website (optional) |markneumann.xyz |

106
.github/contributors/Loghijiaha.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Loghi Perinpanayagam |
| Company name (if applicable) | |
| Title or role (if applicable) | Student |
| Date | 13 Jan, 2019 |
| GitHub username | loghijiaha |
| Website (optional) | |

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jo |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-01-26 |
| GitHub username | PolyglotOpenstreetmap|
| Website (optional) | |

106
.github/contributors/adrianeboyd.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Adriane Boyd |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 28 January 2019 |
| GitHub username | adrianeboyd |
| Website (optional) | |

106
.github/contributors/alvations.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Liling |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 04 Jan 2019 |
| GitHub username | alvations |
| Website (optional) | |

View File

@ -101,6 +101,6 @@ mark both statements:
| Name | Amandine Périnet | | Name | Amandine Périnet |
| Company name (if applicable) | 365Talents | | Company name (if applicable) | 365Talents |
| Title or role (if applicable) | Data Science Researcher | | Title or role (if applicable) | Data Science Researcher |
| Date | 12/12/2018 | | Date | 28/01/2019 |
| GitHub username | amperinet | | GitHub username | amperinet |
| Website (optional) | | | Website (optional) | |

106
.github/contributors/boena.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Björn Lennartsson |
| Company name (if applicable) | Uptrail AB |
| Title or role (if applicable) | CTO |
| Date | 2019-01-15 |
| GitHub username | boena |
| Website (optional) | www.uptrail.com |

106
.github/contributors/foufaster.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |Anès Foufa |
| Company name (if applicable) | |
| Title or role (if applicable) |NLP developer |
| Date |21/01/2019 |
| GitHub username |foufaster |
| Website (optional) | |

106
.github/contributors/ozcankasal.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ozcan Kasal |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | December 21, 2018 |
| GitHub username | ozcankasal |
| Website (optional) | |

106
.github/contributors/retnuh.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
- Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
- to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
- each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
| ----------------------------- | ------------ |
| Name | Hunter Kelly |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-01-10 |
| GitHub username | retnuh |
| Website (optional) | |

106
.github/contributors/willprice.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | --------------------- |
| Name | Will Price |
| Company name (if applicable) | N/A |
| Title or role (if applicable) | N/A |
| Date | 26/12/2018 |
| GitHub username | willprice |
| Website (optional) | https://willprice.org |

View File

@ -1,4 +1,5 @@
recursive-include include *.h recursive-include include *.h
include LICENSE include LICENSE
include README.md include README.md
include pyproject.toml
include bin/spacy include bin/spacy

106
contributer_agreement.md Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Laura Baakman |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | February 7, 2019 |
| GitHub username | lauraBaakman |
| Website (optional) | |

View File

@ -58,7 +58,7 @@ import spacy
lang=("Language class to initialise", "option", "l", str), lang=("Language class to initialise", "option", "l", str),
) )
def main(patterns_loc, text_loc, n=10000, lang="en"): def main(patterns_loc, text_loc, n=10000, lang="en"):
nlp = spacy.blank("en") nlp = spacy.blank(lang)
nlp.vocab.lex_attr_getters = {} nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc) phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0 count = 0

View File

@ -26,6 +26,11 @@ from spacy.util import minibatch, compounding
n_iter=("Number of training iterations", "option", "n", int), n_iter=("Number of training iterations", "option", "n", int),
) )
def main(model=None, output_dir=None, n_iter=20, n_texts=2000): def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
if model is not None: if model is not None:
nlp = spacy.load(model) # load existing spaCy model nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model) print("Loaded model '%s'" % model)
@ -87,9 +92,6 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
print(test_text, doc.cats) print(test_text, doc.cats)
if output_dir is not None: if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
with nlp.use_params(optimizer.averages): with nlp.use_params(optimizer.averages):
nlp.to_disk(output_dir) nlp.to_disk(output_dir)
print("Saved model to", output_dir) print("Saved model to", output_dir)

View File

@ -1,6 +1,6 @@
[ [
{ {
"id": "wsj_0200", "id": 42,
"paragraphs": [ "paragraphs": [
{ {
"raw": "In an Oct. 19 review of \"The Misanthrope\" at Chicago's Goodman Theatre (\"Revitalized Classics Take the Stage in Windy City,\" Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Haag. Ms. Haag plays Elianti.", "raw": "In an Oct. 19 review of \"The Misanthrope\" at Chicago's Goodman Theatre (\"Revitalized Classics Take the Stage in Windy City,\" Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Haag. Ms. Haag plays Elianti.",

10
pyproject.toml Normal file
View File

@ -0,0 +1,10 @@
[build-system]
requires = ["setuptools",
"wheel>0.32.0.<0.33.0",
"Cython",
"cymem>=2.0.2,<2.1.0",
"preshed>=2.0.1,<2.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc>=6.12.1,<6.13.0",
]
build-backend = "setuptools.build_meta"

View File

@ -14,7 +14,7 @@ plac<1.0.0,>=0.9.6
pathlib==1.0.1; python_version < "3.4" pathlib==1.0.1; python_version < "3.4"
# Development dependencies # Development dependencies
cython>=0.25 cython>=0.25
pytest>=4.0.0,<5.0.0 pytest>=4.0.0,<4.1.0
pytest-timeout>=1.3.0,<2.0.0 pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0
flake8>=3.5.0,<3.6.0 flake8>=3.5.0,<3.6.0

View File

@ -246,6 +246,7 @@ def setup_package():
"cuda92": ["cupy-cuda92>=4.0"], "cuda92": ["cupy-cuda92>=4.0"],
"cuda100": ["cupy-cuda100>=4.0"], "cuda100": ["cupy-cuda100>=4.0"],
}, },
python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*",
classifiers=[ classifiers=[
"Development Status :: 5 - Production/Stable", "Development Status :: 5 - Production/Stable",
"Environment :: Console", "Environment :: Console",

View File

@ -31,9 +31,13 @@ def read_iob(raw_sents):
tokens = [re.split("[^\w\-]", line.strip())] tokens = [re.split("[^\w\-]", line.strip())]
if len(tokens[0]) == 3: if len(tokens[0]) == 3:
words, pos, iob = zip(*tokens) words, pos, iob = zip(*tokens)
else: elif len(tokens[0]) == 2:
words, iob = zip(*tokens) words, iob = zip(*tokens)
pos = ["-"] * len(words) pos = ["-"] * len(words)
else:
raise ValueError(
"The iob/iob2 file is not formatted correctly. Try checking whitespace and delimiters."
)
biluo = iob_to_biluo(iob) biluo = iob_to_biluo(iob)
sentences.append( sentences.append(
[ [

View File

@ -208,7 +208,11 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
doc_freq = int(doc_freq) doc_freq = int(doc_freq)
freq = int(freq) freq = int(freq)
if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length: if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length:
word = literal_eval(key) try:
word = literal_eval(key)
except SyntaxError:
# Take odd strings literally.
word = literal_eval("'%s'" % key)
smooth_count = counts.smoother(int(freq)) smooth_count = counts.smoother(int(freq))
probs[word] = math.log(smooth_count) - log_total probs[word] = math.log(smooth_count) - log_total
oov_prob = math.log(counts.smoother(0)) - log_total oov_prob = math.log(counts.smoother(0)) - log_total

View File

@ -9,7 +9,6 @@ from ..util import is_in_jupyter
_html = {} _html = {}
IS_JUPYTER = is_in_jupyter()
RENDER_WRAPPER = None RENDER_WRAPPER = None
@ -18,7 +17,7 @@ def render(
style="dep", style="dep",
page=False, page=False,
minify=False, minify=False,
jupyter=IS_JUPYTER, jupyter=False,
options={}, options={},
manual=False, manual=False,
): ):
@ -51,7 +50,7 @@ def render(
html = _html["parsed"] html = _html["parsed"]
if RENDER_WRAPPER is not None: if RENDER_WRAPPER is not None:
html = RENDER_WRAPPER(html) html = RENDER_WRAPPER(html)
if jupyter: # return HTML rendered by IPython display() if jupyter or is_in_jupyter(): # return HTML rendered by IPython display()
from IPython.core.display import display, HTML from IPython.core.display import display, HTML
return display(HTML(html)) return display(HTML(html))

View File

@ -1,7 +1,7 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import random import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
@ -41,7 +41,7 @@ class DependencyRenderer(object):
""" """
# Create a random ID prefix to make sure parses don't receive the # Create a random ID prefix to make sure parses don't receive the
# same ID, even if they're identical # same ID, even if they're identical
id_prefix = random.randint(0, 999) id_prefix = uuid.uuid4().hex
rendered = [ rendered = [
self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"]) self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"])
for i, p in enumerate(parsed) for i, p in enumerate(parsed)

View File

@ -4,20 +4,24 @@ from __future__ import unicode_literals
from .lookup import LOOKUP from .lookup import LOOKUP
from ._adjectives import ADJECTIVES from ._adjectives import ADJECTIVES
from ._adjectives_irreg import ADJECTIVES_IRREG from ._adjectives_irreg import ADJECTIVES_IRREG
from ._adp_irreg import ADP_IRREG
from ._adverbs import ADVERBS from ._adverbs import ADVERBS
from ._auxiliary_verbs_irreg import AUXILIARY_VERBS_IRREG
from ._cconj_irreg import CCONJ_IRREG
from ._dets_irreg import DETS_IRREG
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES
from ._nouns import NOUNS from ._nouns import NOUNS
from ._nouns_irreg import NOUNS_IRREG from ._nouns_irreg import NOUNS_IRREG
from ._pronouns_irreg import PRONOUNS_IRREG
from ._sconj_irreg import SCONJ_IRREG
from ._verbs import VERBS from ._verbs import VERBS
from ._verbs_irreg import VERBS_IRREG from ._verbs_irreg import VERBS_IRREG
from ._dets_irreg import DETS_IRREG
from ._pronouns_irreg import PRONOUNS_IRREG
from ._auxiliary_verbs_irreg import AUXILIARY_VERBS_IRREG
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES
LEMMA_INDEX = {'adj': ADJECTIVES, 'adv': ADVERBS, 'noun': NOUNS, 'verb': VERBS} LEMMA_INDEX = {'adj': ADJECTIVES, 'adv': ADVERBS, 'noun': NOUNS, 'verb': VERBS}
LEMMA_EXC = {'adj': ADJECTIVES_IRREG, 'noun': NOUNS_IRREG, 'verb': VERBS_IRREG, LEMMA_EXC = {'adj': ADJECTIVES_IRREG, 'adp': ADP_IRREG, 'aux': AUXILIARY_VERBS_IRREG,
'det': DETS_IRREG, 'pron': PRONOUNS_IRREG, 'aux': AUXILIARY_VERBS_IRREG} 'cconj': CCONJ_IRREG, 'det': DETS_IRREG, 'noun': NOUNS_IRREG, 'verb': VERBS_IRREG,
'pron': PRONOUNS_IRREG, 'sconj': SCONJ_IRREG}
LEMMA_RULES = {'adj': ADJECTIVE_RULES, 'noun': NOUN_RULES, 'verb': VERB_RULES} LEMMA_RULES = {'adj': ADJECTIVE_RULES, 'noun': NOUN_RULES, 'verb': VERB_RULES}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,24 @@
# coding: utf8
from __future__ import unicode_literals
ADP_IRREG = {
"a": ("à",),
"apr.": ("après",),
"aux": ("à",),
"av.": ("avant",),
"avt": ("avant",),
"cf.": ("cf",),
"conf.": ("cf",),
"confer": ("cf",),
"d'": ("de",),
"des": ("de",),
"du": ("de",),
"jusqu'": ("jusque",),
"pdt": ("pendant",),
"+": ("plus",),
"pr": ("pour",),
"/": ("sur",),
"versus": ("vs",),
"vs.": ("vs",)
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,17 @@
# coding: utf8
from __future__ import unicode_literals
CCONJ_IRREG = {
"&amp;": ("et",),
"c-à-d": ("c'est-à-dire",),
"c.-à.-d.": ("c'est-à-dire",),
"càd": ("c'est-à-dire",),
"&": ("et",),
"et|ou": ("et-ou",),
"et/ou": ("et-ou",),
"i.e.": ("c'est-à-dire",),
"ie": ("c'est-à-dire",),
"ou/et": ("et-ou",),
"+": ("plus",)
}

View File

@ -4,20 +4,27 @@ from __future__ import unicode_literals
DETS_IRREG = { DETS_IRREG = {
"aucune": ("aucun",), "aucune": ("aucun",),
"cents": ("cent",),
"certaine": ("certain",),
"certaines": ("certain",),
"certains": ("certain",),
"ces": ("ce",), "ces": ("ce",),
"cet": ("ce",), "cet": ("ce",),
"cette": ("ce",), "cette": ("ce",),
"cents": ("cent",), "des": ("un",),
"certaines": ("certains",),
"différentes": ("différents",), "différentes": ("différents",),
"diverse": ("divers",),
"diverses": ("divers",), "diverses": ("divers",),
"du": ("de",),
"la": ("le",), "la": ("le",),
"les": ("le",),
"l'": ("le",),
"laquelle": ("lequel",), "laquelle": ("lequel",),
"les": ("le",),
"lesdites": ("ledit",),
"lesdits": ("ledit",),
"leurs": ("leur",),
"lesquelles": ("lequel",), "lesquelles": ("lequel",),
"lesquels": ("lequel",), "lesquels": ("lequel",),
"leurs": ("leur",), "l'": ("le",),
"mainte": ("maint",), "mainte": ("maint",),
"maintes": ("maint",), "maintes": ("maint",),
"maints": ("maint",), "maints": ("maint",),
@ -27,23 +34,29 @@ DETS_IRREG = {
"nulle": ("nul",), "nulle": ("nul",),
"nulles": ("nul",), "nulles": ("nul",),
"nuls": ("nul",), "nuls": ("nul",),
"pareille": ("pareil",),
"pareilles": ("pareil",),
"pareils": ("pareil",),
"quelle": ("quel",), "quelle": ("quel",),
"quelles": ("quel",), "quelles": ("quel",),
"quels": ("quel",), "qq": ("quelque",),
"quelqu'": ("quelque",), "qqes": ("quelque",),
"qqs": ("quelque",),
"quelques": ("quelque",), "quelques": ("quelque",),
"quelqu'": ("quelque",),
"quels": ("quel",),
"sa": ("son",), "sa": ("son",),
"ses": ("son",), "ses": ("son",),
"telle": ("tel",),
"telles": ("tel",),
"tels": ("tel",),
"ta": ("ton",), "ta": ("ton",),
"telles": ("tel",),
"telle": ("tel",),
"tels": ("tel",),
"tes": ("ton",), "tes": ("ton",),
"tous": ("tout",), "tous": ("tout",),
"toute": ("tout",),
"toutes": ("tout",), "toutes": ("tout",),
"des": ("un",), "toute": ("tout",),
"une": ("un",), "une": ("un",),
"vingts": ("vingt",), "vingts": ("vingt",),
"vot'": ("votre",),
"vos": ("votre",), "vos": ("votre",),
} }

View File

@ -63,36 +63,8 @@ NOUN_RULES = [
["w", "w"], ["w", "w"],
["y", "y"], ["y", "y"],
["z", "z"], ["z", "z"],
["as", "a"], ["s", ""],
["aux", "au"], ["x", ""],
["cs", "c"],
["chs", "ch"],
["ds", "d"],
["és", "é"],
["es", "e"],
["eux", "eu"],
["fs", "f"],
["gs", "g"],
["hs", "h"],
["is", "i"],
["ïs", "ï"],
["js", "j"],
["ks", "k"],
["ls", "l"],
["ms", "m"],
["ns", "n"],
["oux", "ou"],
["os", "o"],
["ps", "p"],
["qs", "q"],
["rs", "r"],
["ses", "se"],
["se", "se"],
["ts", "t"],
["us", "u"],
["vs", "v"],
["ws", "w"],
["ys", "y"],
["nt(e", "nt"], ["nt(e", "nt"],
["nt(e)", "nt"], ["nt(e)", "nt"],
["al(e", "ale"], ["al(e", "ale"],

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -4,37 +4,89 @@ from __future__ import unicode_literals
PRONOUNS_IRREG = { PRONOUNS_IRREG = {
"aucune": ("aucun",), "aucune": ("aucun",),
"celle-ci": ("celui-ci",), "autres": ("autre",),
"celles-ci": ("celui-ci",), "ça": ("cela",),
"ceux-ci": ("celui-ci",), "c'": ("ce",),
"celle-là": ("celui-là",),
"celles-là": ("celui-là",),
"ceux-là": ("celui-là",),
"celle": ("celui",), "celle": ("celui",),
"celle-ci": ("celui-ci",),
"celle-là": ("celui-là",),
"celles": ("celui",), "celles": ("celui",),
"ceux": ("celui",), "celles-ci": ("celui-ci",),
"celles-là": ("celui-là",),
"certaines": ("certains",), "certaines": ("certains",),
"ceux": ("celui",),
"ceux-ci": ("celui-ci",),
"ceux-là": ("celui-là",),
"chacune": ("chacun",), "chacune": ("chacun",),
"-elle": ("lui",),
"elle": ("lui",),
"elle-même": ("lui-même",),
"-elles": ("lui",),
"elles": ("lui",),
"elles-mêmes": ("lui-même",),
"eux": ("lui",),
"eux-mêmes": ("lui-même",),
"icelle": ("icelui",), "icelle": ("icelui",),
"icelles": ("icelui",), "icelles": ("icelui",),
"iceux": ("icelui",), "iceux": ("icelui",),
"-il": ("il",),
"-ils": ("il",),
"ils": ("il",),
"-je": ("je",),
"j'": ("je",),
"la": ("le",), "la": ("le",),
"les": ("le",),
"laquelle": ("lequel",), "laquelle": ("lequel",),
"l'autre": ("l'autre",),
"les": ("le",),
"lesquelles": ("lequel",), "lesquelles": ("lequel",),
"lesquels": ("lequel",), "lesquels": ("lequel",),
"elle-même": ("lui-même",), "-leur": ("leur",),
"elles-mêmes": ("lui-même",), "l'on": ("on",),
"eux-mêmes": ("lui-même",), "-lui": ("lui",),
"l'une": ("l'un",),
"mêmes": ("même",),
"-m'": ("me",),
"m'": ("me",),
"-moi": ("moi",),
"nous-mêmes": ("nous-même",),
"-nous": ("nous",),
"-on": ("on",),
"qqchose": ("quelque chose",),
"qqch": ("quelque chose",),
"qqc": ("quelque chose",),
"qqn": ("quelqu'un",),
"quelle": ("quel",), "quelle": ("quel",),
"quelles": ("quel",), "quelles": ("quel",),
"quels": ("quel",), "quelques-unes": ("quelques-uns",),
"quelques-unes": ("quelqu'un",),
"quelques-uns": ("quelqu'un",),
"quelque-une": ("quelqu'un",), "quelque-une": ("quelqu'un",),
"quelqu'une": ("quelqu'un",),
"quels": ("quel",),
"qu": ("que",), "qu": ("que",),
"telle": ("tel",), "s'": ("se",),
"-t-elle": ("elle",),
"-t-elles": ("elle",),
"telles": ("tel",), "telles": ("tel",),
"telle": ("tel",),
"tels": ("tel",), "tels": ("tel",),
"toutes": ("tous",), "-t-en": ("en",),
"-t-il": ("il",),
"-t-ils": ("il",),
"-toi": ("toi",),
"-t-on": ("on",),
"tous": ("tout",),
"toutes": ("tout",),
"toute": ("tout",),
"-t'": ("te",),
"t'": ("te",),
"-tu": ("tu",),
"-t-y": ("y",),
"unes": ("un",),
"une": ("un",),
"uns": ("un",),
"vous-mêmes": ("vous-même",),
"vous-même": ("vous-même",),
"-vous": ("vous",),
"-vs": ("vous",),
"vs": ("vous",),
"-y": ("y",),
} }

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
SCONJ_IRREG = {
"lorsqu'": ("lorsque",),
"pac'que": ("parce que",),
"pac'qu'": ("parce que",),
"parc'que": ("parce que",),
"parc'qu'": ("parce que",),
"paske": ("parce que",),
"pask'": ("parce que",),
"pcq": ("parce que",),
"+": ("plus",),
"puisqu'": ("puisque",),
"qd": ("quand",),
"quoiqu'": ("quoique",),
"qu'": ("que",)
}

View File

@ -6,63 +6,64 @@ VERBS = set(
""" """
abaisser abandonner abdiquer abecquer abéliser aberrer abhorrer abîmer abjurer abaisser abandonner abdiquer abecquer abéliser aberrer abhorrer abîmer abjurer
ablater abluer ablutionner abominer abonder abonner aborder aborner aboucher ablater abluer ablutionner abominer abonder abonner aborder aborner aboucher
abouler abouter abraquer abraser abreuver abricoter abriter absenter absinther abouler abouter aboutonner abracadabrer abraquer abraser abreuver abricoter
absolutiser absorber abuser académifier académiser acagnarder accabler abriter absenter absinther absolutiser absorber abuser académifier académiser
accagner accaparer accastiller accentuer accepter accessoiriser accidenter acagnarder accabler accagner accaparer accastiller accentuer accepter
acclamer acclimater accointer accolader accoler accommoder accompagner accessoiriser accidenter acclamer acclimater accointer accolader accoler
accorder accorer accoster accoter accoucher accouder accouer accoupler accommoder accompagner accorder accorer accoster accoter accoucher accouder
accoutrer accoutumer accouver accrassiner accréditer accrocher acculer accouer accoupler accoutrer accoutumer accouver accrassiner accréditer
acculturer accumuler accuser acenser acétaliser acétyler achalander acharner accrocher acculer acculturer accumuler accuser acenser acétaliser acétyler
acheminer achopper achromatiser aciduler aciériser acliquer acoquiner acquêter achalander acharner acheminer achopper achromatiser aciduler aciériser
acquitter acter actiniser actionner activer actoriser actualiser acupuncturer acliquer acoquiner acquêter acquitter acter actiniser actionner activer
acyler adapter additionner adenter adieuser adirer adjectiver adjectiviser actoriser actualiser acupuncturer acyler adapter additionner adenter adieuser
adjurer adjuver administrer admirer admonester adoniser adonner adopter adorer adirer adjectiver adjectiviser adjurer adjuver administrer admirer admonester
adorner adosser adouber adresser adsorber aduler adverbialiser aéroporter adoniser adonner adopter adorer adorner adosser adouber adresser adsorber
aérosoliser aérosonder aérotransporter affabuler affacturer affairer affaisser aduler adverbialiser aéroporter aérosoliser aérosonder aérotransporter
affaiter affaler affamer affecter affectionner affermer afficher affider affabuler affacturer affairer affaisser affaiter affaler affamer affecter
affiler affiner affirmer affistoler affixer affleurer afflouer affluer affoler affectionner affermer afficher affider affiler affiner affirmer affistoler
afforester affouiller affourcher affriander affricher affrioler affriquer affixer affleurer afflouer affluer affoler afforester affouiller affourcher
affriter affronter affruiter affubler affurer affûter afghaniser afistoler affriander affricher affrioler affriquer affriter affronter affruiter affubler
africaniser agatiser agenouiller agglutiner aggraver agioter agiter agoniser affurer affûter afghaniser afistoler africaniser agatiser agenouiller
agourmander agrafer agrainer agrémenter agresser agriffer agripper agglutiner aggraver agioter agiter agoniser agourmander agrafer agrainer
agroalimentariser agrouper aguetter aguicher ahaner aheurter aicher aider agrémenter agresser agricher agriffer agripper agroalimentariser agrouper
aigretter aiguer aiguiller aiguillonner aiguiser ailer ailler ailloliser aguetter aguicher aguiller ahaner aheurter aicher aider aigretter aiguer
aimanter aimer airer ajointer ajourer ajourner ajouter ajuster ajuter aiguiller aiguillonner aiguiser ailer ailler ailloliser aimanter aimer airer
alambiquer alarmer albaniser albitiser alcaliniser alcaliser alcooliser ajointer ajourer ajourner ajouter ajuster ajuter alambiquer alarmer albaniser
alcoolyser alcoyler aldoliser alerter aleviner algébriser algérianiser albitiser alcaliniser alcaliser alcooliser alcoolyser alcoyler aldoliser
algorithmiser aligner alimenter alinéater alinéatiser aliter alkyler allaiter alerter aleviner algébriser algérianiser algorithmiser aligner alimenter
allectomiser allégoriser allitiser allivrer allocutionner alloter allouer alinéater alinéatiser aliter alkyler allaiter allectomiser allégoriser
alluder allumer allusionner alluvionner allyler aloter alpaguer alphabétiser allitiser allivrer allocutionner alloter allouer alluder allumer allusionner
alterner aluminer aluminiser aluner alvéoler alvéoliser amabiliser amadouer alluvionner allyler aloter alpaguer alphabétiser alterner aluminer aluminiser
amalgamer amariner amarrer amateloter ambitionner ambler ambrer ambuler aluner alvéoler alvéoliser amabiliser amadouer amalgamer amariner amarrer
améliorer amender amenuiser américaniser ameulonner ameuter amhariser amiauler amateloter ambitionner ambler ambrer ambuler améliorer amender amenuiser
amicoter amidonner amignarder amignoter amignotter aminer ammoniaquer américaniser ameulonner ameuter amhariser amiauler amicoter amidonner
ammoniser ammoxyder amocher amouiller amouracher amourer amphotériser ampouler amignarder amignoter amignotter aminer ammoniaquer ammoniser ammoxyder amocher
amputer amunitionner amurer amuser anagrammatiser anagrammer analyser amouiller amouracher amourer amphotériser ampouler amputer amunitionner amurer
anamorphoser anaphylactiser anarchiser anastomoser anathématiser anatomiser amuser anagrammatiser anagrammer analyser anamorphoser anaphylactiser
ancher anchoiter ancrer anecdoter anecdotiser angéliser anglaiser angler anarchiser anastomoser anathématiser anatomiser ancher anchoiter ancrer
angliciser angoisser anguler animaliser animer aniser ankyloser annexer anecdoter anecdotiser angéliser anglaiser angler angliciser angoisser anguler
annihiler annoter annualiser annuler anodiser ânonner anser antagoniser animaliser animer aniser ankyloser annexer annihiler annoter annualiser
antéposer antérioriser anthropomorphiser anticiper anticoaguler antidater annuler anodiser ânonner anser antagoniser antéposer antérioriser
antiparasiter antiquer antiseptiser anuiter aoûter apaiser apériter apetisser anthropomorphiser anticiper anticoaguler antidater antiparasiter antiquer
apeurer apicaliser apiquer aplaner apologiser aponévrotomiser aponter aposter antiseptiser anuiter aoûter apaiser apériter apetisser apeurer apicaliser
apostiller apostoliser apostropher apostumer apothéoser appareiller apparenter apiquer aplaner apologiser aponévrotomiser aponter aposter apostiller
appeauter appertiser appliquer appointer appoltronner apponter apporter apostoliser apostropher apostumer apothéoser appareiller apparenter appeauter
apposer appréhender apprêter apprivoiser approcher approuver approvisionner appertiser appliquer appointer appoltronner apponter apporter apposer
approximer apurer aquareller arabiser araméiser aramer araser arbitrer arborer appréhender apprêter apprivoiser approcher approuver approvisionner approximer
arboriser arcbouter arc-bouter archaïser architecturer archiver arçonner apurer aquareller arabiser araméiser aramer araser arbitrer arborer arboriser
ardoiser aréniser arer argenter argentiniser argoter argotiser argumenter arcbouter arc-bouter archaïser architecturer archiver arçonner ardoiser
arianiser arimer ariser aristocratiser aristotéliser arithmétiser armaturer aréniser arer argenter argentiniser argoter argotiser argumenter arianiser
armer arnaquer aromatiser arpenter arquebuser arquer arracher arraisonner arimer ariser aristocratiser aristotéliser arithmétiser armaturer armer
arrenter arrêter arrher arrimer arriser arriver arroser arsouiller arnaquer aromatiser arpenter arquebuser arquer arracher arraisonner arrenter
artérialiser articler articuler artificialiser artistiquer aryaniser aryler arrêter arrher arrimer arriser arriver arroser arsouiller artérialiser
ascensionner ascétiser aseptiser asexuer asianiser asiatiser aspecter articler articuler artificialiser artistiquer aryaniser aryler ascensionner
asphalter aspirer assabler assaisonner assassiner assembler assener asséner ascétiser aseptiser asexuer asianiser asiatiser aspecter asphalter aspirer
assermenter asserter assibiler assigner assimiler assister assoiffer assoler assabler assaisonner assassiner assembler assener asséner assermenter asserter
assommer assoner assoter assumer assurer asticoter astiquer athéiser assibiler assigner assimiler assister assoiffer assoler assommer assoner
atlantiser atomiser atourner atropiniser attabler attacher attaquer attarder assoter assumer assurer asticoter astiquer athéiser atlantiser atomiser
attenter attentionner atténuer atterrer attester attifer attirer attiser atourner atropiniser attabler attacher attaquer attarder attenter attentionner
attitrer attraper attremper attribuer attrister attrouper aubiner atténuer atterrer attester attifer attirer attiser attitrer attoucher attraper
attremper attribuer attriquer attrister attrouper aubader aubiner
audiovisualiser auditer auditionner augmenter augurer aulofer auloffer aumôner audiovisualiser auditer auditionner augmenter augurer aulofer auloffer aumôner
auner auréoler ausculter authentiquer autoaccuser autoadapter autoadministrer auner auréoler ausculter authentiquer autoaccuser autoadapter autoadministrer
autoagglutiner autoalimenter autoallumer autoamputer autoanalyser autoancrer autoagglutiner autoalimenter autoallumer autoamputer autoanalyser autoancrer
@ -73,10 +74,10 @@ VERBS = set(
autodéterminer autodévelopper autodévorer autodicter autodiscipliner autodéterminer autodévelopper autodévorer autodicter autodiscipliner
autodupliquer autoéduquer autoenchâsser autoenseigner autoépurer autoéquiper autodupliquer autoéduquer autoenchâsser autoenseigner autoépurer autoéquiper
autoévaporiser autoévoluer autoféconder autofertiliser autoflageller autoévaporiser autoévoluer autoféconder autofertiliser autoflageller
autofonder autoformer autofretter autogouverner autogreffer autoguider auto- autofonder autoformer autofretter autogouverner autogreffer autoguider
immuniser auto-ioniser autolégitimer autolimiter autoliquider autolyser auto-immuniser auto-ioniser autolégitimer autolimiter autoliquider autolyser
automatiser automédiquer automitrailler automutiler autonomiser auto- automatiser automédiquer automitrailler automutiler autonomiser
optimaliser auto-optimiser autoorganiser autoperpétuer autopersuader auto-optimaliser auto-optimiser autoorganiser autoperpétuer autopersuader
autopiloter autopolliniser autoporter autopositionner autoproclamer autopiloter autopolliniser autoporter autopositionner autoproclamer
autopropulser autoréaliser autorecruter autoréglementer autoréguler autopropulser autoréaliser autorecruter autoréglementer autoréguler
autorelaxer autoréparer autoriser autosélectionner autosevrer autostabiliser autorelaxer autoréparer autoriser autosélectionner autosevrer autostabiliser
@ -84,7 +85,7 @@ VERBS = set(
autotracter autotransformer autovacciner autoventiler avaler avaliser autotracter autotransformer autovacciner autoventiler avaler avaliser
aventurer aveugler avillonner aviner avironner aviser avitailler aviver aventurer aveugler avillonner aviner avironner aviser avitailler aviver
avoiner avoisiner avorter avouer axéniser axer axiomatiser azimuter azoter avoiner avoisiner avorter avouer axéniser axer axiomatiser azimuter azoter
azurer babiller babouiner bâcher bachonner bachoter bâcler badauder azurer babiller babouiner bâcher bachonner bachoter bâcler badauder bader
badigeonner badiner baffer bafouer bafouiller bâfrer bagarrer bagoter bagouler badigeonner badiner baffer bafouer bafouiller bâfrer bagarrer bagoter bagouler
baguenauder baguer baguetter bahuter baigner bailler bâiller baîller baguenauder baguer baguetter bahuter baigner bailler bâiller baîller
bâillonner baîllonner baiser baisoter baisouiller baisser bakéliser balader bâillonner baîllonner baiser baisoter baisouiller baisser bakéliser balader
@ -135,9 +136,9 @@ VERBS = set(
brouillonner broussailler brousser brouter bruiner bruisser bruiter brûler brouillonner broussailler brousser brouter bruiner bruisser bruiter brûler
brumer brumiser bruncher brusquer brutaliser bruter bûcher bucoliser brumer brumiser bruncher brusquer brutaliser bruter bûcher bucoliser
budgétiser buer buffériser buffler bugler bugner buiser buissonner bulgariser budgétiser buer buffériser buffler bugler bugner buiser buissonner bulgariser
buquer bureaucratiser buriner buser busquer buter butiner butonner butter buller buquer bureaucratiser buriner buser busquer buter butiner butonner
buvoter byzantiner byzantiniser cabaler cabaliser cabaner câbler cabosser butter buvoter byzantiner byzantiniser cabaler cabaliser cabaner câbler
caboter cabotiner cabrer cabrioler cacaber cacaoter cacarder cacher cabosser caboter cabotiner cabrer cabrioler cacaber cacaoter cacarder cacher
cachetonner cachotter cadastrer cadavériser cadeauter cadetter cadoter cadrer cachetonner cachotter cadastrer cadavériser cadeauter cadetter cadoter cadrer
cafarder cafeter cafouiller cafter cageoler cagnarder cagner caguer cahoter cafarder cafeter cafouiller cafter cageoler cagnarder cagner caguer cahoter
caillebotter cailler caillouter cajoler calaminer calamistrer calamiter caillebotter cailler caillouter cajoler calaminer calamistrer calamiter
@ -185,65 +186,66 @@ VERBS = set(
claveliser claver clavetter clayonner cléricaliser clicher cligner clignoter claveliser claver clavetter clayonner cléricaliser clicher cligner clignoter
climatiser clinquanter clinquer cliper cliquer clisser cliver clochardiser climatiser clinquanter clinquer cliper cliquer clisser cliver clochardiser
clocher clocter cloisonner cloîtrer cloner cloper clopiner cloquer clôturer clocher clocter cloisonner cloîtrer cloner cloper clopiner cloquer clôturer
clouer clouter coaccuser coacerver coacher coadapter coagglutiner coaguler clotûrer clouer clouter coaccuser coacerver coacher coadapter coagglutiner
coaliser coaltarer coaltariser coanimer coarticuler cobelligérer cocaïniser coaguler coaliser coaltarer coaltariser coanimer coarticuler cobelligérer
cocarder cocheniller cocher côcher cochonner coconiser coconner cocooner cocaïniser cocarder cocheniller cocher côcher cochonner coconiser coconner
cocoter coder codéterminer codiller coéditer coéduquer coexister coexploiter cocooner cocoter coder codéterminer codiller coéditer coéduquer coexister
coexprimer coffiner coffrer cofonder cogiter cogner cogouverner cohabiter coexploiter coexprimer coffiner coffrer cofonder cogiter cogner cogouverner
cohériter cohober coiffer coincher coincider coïncider coïter colchiciner cohabiter cohériter cohober coiffer coincher coincider coïncider coïter
collaber collaborer collationner collecter collectionner collectiviser coller colchiciner collaber collaborer collationner collecter collectionner
collisionner colloquer colluvionner colmater colombianiser colombiner collectiviser coller collisionner colloquer colluvionner colmater
coloniser colorer coloriser colostomiser colporter colpotomiser coltiner colombianiser colombiner coloniser colorer coloriser colostomiser colporter
columniser combiner combler commander commanditer commémorer commenter colpotomiser coltiner columniser combiner combler commander commanditer
commercialiser comminer commissionner commotionner commuer communaliser commémorer commenter commercialiser comminer commissionner commotionner
communautariser communiquer communiser commuter compacifier compacter comparer commuer communaliser communautariser communiquer communiser commuter
compartimenter compenser compiler compisser complanter complémenter compacifier compacter comparer compartimenter compenser compiler compisser
complétiviser complexer complimenter compliquer comploter comporter composer complanter complémenter complétiviser complexer complimenter compliquer
composter compoter compounder compresser comprimer comptabiliser compter comploter comporter composer composter compoter compounder compresser
compulser computer computériser concentrer conceptualiser concerner concerter comprimer comptabiliser compter compulser computer computériser concentrer
concher conciliabuler concocter concomiter concorder concrétionner concrétiser conceptualiser concerner concerter concher conciliabuler concocter concomiter
concubiner condamner condenser condimenter conditionner confabuler concorder concrétionner concrétiser concubiner condamner condenser condimenter
confectionner confédéraliser confesser confessionnaliser configurer confiner conditionner confabuler confectionner confédéraliser confesser
confirmer confisquer confiter confluer conformer conforter confronter confessionnaliser configurer confiner confirmer confisquer confiter confluer
confusionner congestionner conglober conglutiner congoliser congratuler conformer conforter confronter confusionner congestionner conglober
coniser conjecturer conjointer conjuger conjuguer conjurer connecter conniver conglutiner congoliser congratuler coniser conjecturer conjointer conjuger
connoter conquêter consacrer conscientiser conseiller conserver consigner conjuguer conjurer connecter conniver connoter conquêter consacrer
consister consoler consolider consommariser consommer consonantiser consoner conscientiser conseiller conserver consigner consister consoler consolider
conspirer conspuer constater consteller conster consterner constiper consommariser consommer consonantiser consoner conspirer conspuer constater
constituer constitutionnaliser consulter consumer contacter contagionner consteller conster consterner constiper constituer constitutionnaliser
containeriser containériser contaminer contemner contempler conteneuriser consulter consumer contacter contagionner containeriser containériser
contenter conter contester contextualiser continentaliser contingenter contaminer contemner contempler conteneuriser contenter conter contester
continuer contorsionner contourner contracter contractualiser contracturer contextualiser continentaliser contingenter continuer contorsionner contourner
contraposer contraster contre-attaquer contrebouter contrebuter contrecalquer contracter contractualiser contracturer contraposer contraster contre-attaquer
contrecarrer contre-expertiser contreficher contrefraser contre-indiquer contrebouter contrebuter contrecalquer contrecarrer contre-expertiser
contremander contremanifester contremarcher contremarquer contreminer contreficher contrefraser contre-indiquer contremander contremanifester
contremurer contrenquêter contreplaquer contrepointer contrer contresigner contremarcher contremarquer contreminer contremurer contrenquêter
contrespionner contretyper contreventer contribuer contrister contrôler contreplaquer contrepointer contrer contresigner contrespionner contretyper
controuver controverser contusionner conventionnaliser conventionner contreventer contribuer contrister contrôler controuver controverser
conventualiser converser convoiter convoler convoquer convulser convulsionner contusionner conventionnaliser conventionner conventualiser converser
cooccuper coopératiser coopter coordonner coorganiser coparrainer coparticiper convoiter convoler convoquer convulser convulsionner cooccuper coopératiser
copermuter copiner copolycondenser copolymériser coprésenter coprésider copser coopter coordonner coorganiser coparrainer coparticiper copermuter copiner
copter copuler copyrighter coqueliner coquer coqueriquer coquiller corailler copolycondenser copolymériser coprésenter coprésider copser copter copuler
corder cordonner coréaliser coréaniser coréguler coresponsabiliser cornaquer copyrighter coqueliner coquer coqueriquer coquiller corailler corder cordonner
cornemuser corner coroniser corporiser correctionaliser correctionnaliser coréaliser coréaniser coréguler coresponsabiliser cornaquer cornemuser corner
correler corréler corroborer corroder corser corticaliser cosigner cosmétiquer coroniser corporiser correctionaliser correctionnaliser correler corréler
cosser costumer coter cotillonner cotiser cotonner cotransfecter couaquer corroborer corroder corser corticaliser cosigner cosmétiquer cosser costumer
couarder couchailler coucher couchoter couchotter coucouer coucouler couder coter cotillonner cotiser cotonner cotransfecter couaquer couarder couchailler
coudrer couillonner couiner couler coulisser coupailler coupeller couper coucher couchoter couchotter coucouer coucouler couder coudrer couillonner
couperoser coupler couponner courailler courbaturer courber courbetter couiner couler coulisser coupailler coupeller couper couperoser coupler
courcailler couronner courrieler courser courtauder court-circuiter courtiser couponner courailler courbaturer courber courbetter courcailler couronner
cousiner coussiner coûter couturer couver cracher crachiner crachoter courrieler courser courtauder court-circuiter courtiser cousiner coussiner
crachouiller crailler cramer craminer cramper cramponner crampser cramser coûter couturer couver cracher crachiner crachoter crachouiller crailler
craner crâner crânoter cranter crapahuter crapaüter crapser crapuler craquer cramer craminer cramper cramponner crampser cramser craner crâner crânoter
crasher cratériser craticuler cratoniser cravacher cravater crawler crayonner cranter crapahuter crapaüter crapser crapuler craquer crasher cratériser
crédibiliser créditer crématiser créoliser créosoter crêper crépiner crépiter craticuler cratoniser cravacher cravater crawler crayonner crédibiliser
crésyler crêter crétiniser creuser criailler cribler criminaliser criquer créditer crématiser créoliser créosoter crêper crépiner crépiter crésyler
crisper crisser cristalliser criticailler critiquer crocher croiser crôler crêter crétiniser creuser criailler cribler criminaliser criquer crisper
croquer croskiller crosser crotoniser crotter crouler croupionner crouponner crisser cristalliser criticailler critiquer crocher croiser crôler croquer
croskiller crosser crotoniser crotter crouler croupionner crouponner
croustiller croûter croûtonner cryoappliquer cryocautériser cryocoaguler croustiller croûter croûtonner cryoappliquer cryocautériser cryocoaguler
cryoconcentrer cryodécaper cryoébarber cryofixer cryogéniser cryomarquer cryoconcentrer cryodécaper cryoébarber cryofixer cryogéniser cryomarquer
cryosorber crypter cuber cueiller cuider cuisiner cuiter cuivrer culbuter cryosorber crypter cuber cueiller cuider cuisiner cuivrer culbuter culer
culer culminer culotter culpabiliser cultiver culturaliser cumuler curariser culminer culotter culpabiliser cultiver culturaliser cumuler curariser
curedenter curer curetter customiser cuter cutiniser cuver cyaniser cyanoser curedenter curer curetter customiser cuter cutiniser cuver cyaniser cyanoser
cyanurer cybernétiser cycler cycliser cycloner cylindrer dactylocoder daguer cyanurer cybernétiser cycler cycliser cycloner cylindrer dactylocoder daguer
daguerréotyper daïer daigner dailler daller damasquiner damer damner daguerréotyper daïer daigner dailler daller damasquiner damer damner
@ -748,8 +750,8 @@ VERBS = set(
mithridatiser mitonner mitrailler mixer mixter mixtionner mobiliser modaliser mithridatiser mitonner mitrailler mixer mixter mixtionner mobiliser modaliser
modéliser modérantiser moderniser moduler moellonner mofler moirer moiser modéliser modérantiser moderniser moduler moellonner mofler moirer moiser
moissonner molarder molariser moléculariser molester moletter mollarder moissonner molarder molariser moléculariser molester moletter mollarder
molletter monarchiser mondaniser monder mondialiser monétariser monétiser molletonner molletter monarchiser mondaniser monder mondialiser monétariser
moniliser monologuer monomériser monophtonguer monopoler monopoliser monétiser moniliser monologuer monomériser monophtonguer monopoler monopoliser
monoprogrammer monosiallitiser monotoniser monseigneuriser monter montrer monoprogrammer monosiallitiser monotoniser monseigneuriser monter montrer
monumentaliser moquer moquetter morailler moraliser mordailler mordiller monumentaliser moquer moquetter morailler moraliser mordailler mordiller
mordillonner mordorer mordoriser morfailler morfaler morfiler morfler morganer mordillonner mordorer mordoriser morfailler morfaler morfiler morfler morganer
@ -792,63 +794,64 @@ VERBS = set(
palpiter palucher panacher panader pancarter paner paniquer panneauter panner palpiter palucher panacher panader pancarter paner paniquer panneauter panner
pannetonner panoramiquer panser pantiner pantomimer pantoufler paoner paonner pannetonner panoramiquer panser pantiner pantomimer pantoufler paoner paonner
papelarder papillonner papilloter papoter papouiller paquer paraboliser papelarder papillonner papilloter papoter papouiller paquer paraboliser
parachuter parader parafer paraffiner paralléliser paralyser paramétriser parachuter parader parafer paraffiner paraisonner paralléliser paralyser
parangonner parapher paraphraser parasiter parcellariser parceller parcelliser paramétriser parangonner parapher paraphraser parasiter parcellariser
parcheminer parcoriser pardonner parementer parenthétiser parer paresser parceller parcelliser parcheminer parcoriser pardonner parementer
parfiler parfumer parisianiser parjurer parkériser parlementer parler parloter parenthétiser parer paresser parfiler parfumer parisianiser parjurer
parlotter parquer parrainer participer particulariser partitionner partouzer parkériser parlementer parler parloter parlotter parquer parrainer participer
pasquiner pasquiniser passefiler passementer passepoiler passeriller particulariser partitionner partouzer pasquiner pasquiniser passefiler
passionnaliser passionner pasteller pasteuriser pasticher pastiller pastoriser passementer passepoiler passeriller passionnaliser passionner pasteller
patafioler pateliner patenter paternaliser paterner pathétiser patienter pasteuriser pasticher pastiller pastoriser patafioler pateliner patenter
patiner pâtisser patoiser pâtonner patouiller patrimonialiser patrociner paternaliser paterner pathétiser patienter patiner pâtisser patoiser pâtonner
patronner patrouiller patter pâturer paumer paupériser pauser pavaner paver patouiller patrimonialiser patrociner patronner patrouiller patter pâturer
pavoiser peaufiner pébriner pécher pêcher pécloter pectiser pédaler pédanter paumer paupériser pauser pavaner paver pavoiser peaufiner pébriner pécher
pédantiser pédiculiser pédicurer pédimenter peigner peiner peinturer pêcher pécloter pectiser pédaler pédanter pédantiser pédiculiser pédicurer
peinturlurer péjorer pelaner pelauder péleriner pèleriner pelletiser pédimenter peigner peiner peinturer peinturlurer péjorer pelaner pelauder
pelleverser pelliculer peloter pelotonner pelucher pelurer pénaliser pencher péleriner pèleriner pelletiser pelleverser pelliculer peloter pelotonner
pendeloquer pendiller pendouiller penduler pénéplaner penser pensionner pelucher pelurer pénaliser pencher pendeloquer pendiller pendouiller penduler
peptiser peptoniser percaliner percher percoler percuter perdurer pérégriner pénéplaner penser pensionner peptiser peptoniser percaliner percher percoler
pérenniser perfectionner perforer performer perfuser péricliter périmer percuter perdurer pérégriner pérenniser perfectionner perforer performer
périodiser périphériser périphraser péritoniser perler permanenter permaner perfuser péricliter périmer périodiser périphériser périphraser péritoniser
perméabiliser permuter pérorer pérouaniser peroxyder perpétuer perquisitionner perler permanenter permaner perméabiliser permuter pérorer pérouaniser
perreyer perruquer persécuter persifler persiller persister personnaliser peroxyder perpétuer perquisitionner perreyer perruquer persécuter persifler
persuader perturber pervibrer pester pétarader pétarder pétiller pétitionner persiller persister personnaliser persuader perturber pervibrer pester
pétocher pétouiller pétrarquiser pétroliser pétuner peupler pexer pétarader pétarder pétiller pétitionner pétocher pétouiller pétrarquiser
phacoémulsifier phagocyter phalangiser pharyngaliser phéniquer phénoler pétroliser pétuner peupler pexer phacoémulsifier phagocyter phalangiser
phényler philosophailler philosopher phlébotomiser phlegmatiser phlogistiquer pharyngaliser phéniquer phénoler phényler philosophailler philosopher
phonétiser phonologiser phosphater phosphorer phosphoriser phosphoryler phlébotomiser phlegmatiser phlogistiquer phonétiser phonologiser phosphater
photoactiver photocomposer photograver photo-ioniser photoïoniser photomonter phosphorer phosphoriser phosphoryler photoactiver photocomposer photograver
photophosphoryler photopolymériser photosensibiliser phraser piaffer piailler photo-ioniser photoïoniser photomonter photophosphoryler photopolymériser
pianomiser pianoter piauler pickler picocher picoler picorer picoter picouser photosensibiliser phraser piaffer piailler pianomiser pianoter piauler pickler
picouzer picrater pictonner picturaliser pidginiser piédestaliser pierrer picocher picoler picorer picoter picouser picouzer picrater pictonner
piétiner piétonnifier piétonniser pieuter pifer piffer piffrer pigeonner picturaliser pidginiser piédestaliser pierrer piétiner piétonnifier
pigmenter pigner pignocher pignoler piler piller pilloter pilonner piloter piétonniser pieuter pifer piffer piffrer pigeonner pigmenter pigner pignocher
pimenter pinailler pinceauter pinçoter pindariser pinter piocher pionner pignoler piler piller pilloter pilonner piloter pimenter pinailler pinceauter
piotter piper piqueniquer pique-niquer piquer piquetonner piquouser piquouzer pinçoter pindariser pinter piocher pionner piotter piper piqueniquer
pirater pirouetter piser pisser pissoter pissouiller pistacher pister pistoler pique-niquer piquer piquetonner piquouser piquouzer pirater pirouetter piser
pistonner pitancher pitcher piter pitonner pituiter pivoter placarder pisser pissoter pissouiller pistacher pister pistoler pistonner pitancher
placardiser plafonner plaider plainer plaisanter plamer plancher planer pitcher piter pitonner pituiter pivoter placarder placardiser plafonner
planétariser planétiser planquer planter plaquer plasmolyser plastiquer plaider plainer plaisanter plamer plancher planer planétariser planétiser
plastronner platiner platiniser platoniser plâtrer plébisciter pleurailler planquer planter plaquer plasmolyser plastiquer plastronner platiner
pleuraliser pleurer pleurnicher pleuroter pleuviner pleuvioter pleuvoter platiniser platoniser plâtrer plébisciter pleurailler pleuraliser pleurer
plisser plissoter plomber ploquer plotiniser plouter ploutrer plucher pleurnicher pleuroter pleuviner pleuvioter pleuvoter plisser plissoter plomber
plumarder plumer pluraliser plussoyer pluviner pluvioter pocharder pocher ploquer plotiniser plouter ploutrer plucher plumarder plumer pluraliser
pochetronner pochtronner poculer podzoliser poêler poétiser poignarder poigner plussoyer pluviner pluvioter pocharder pocher pochetronner pochtronner poculer
poiler poinçonner pointer pointiller poireauter poirer poiroter poisser podzoliser poêler poétiser poignarder poigner poiler poinçonner pointer
poitriner poivrer poivroter polariser poldériser polémiquer polissonner pointiller poireauter poirer poiroter poisser poitriner poivrer poivroter
politicailler politiquer politiser polker polliciser polliniser polluer polariser poldériser polémiquer polissonner politicailler politiquer politiser
poloniser polychromer polycontaminer polygoner polygoniser polymériser polker polliciser polliniser polluer poloniser polychromer polycontaminer
polyploïdiser polytransfuser polyviser pommader pommer pomper pomponner polygoner polygoniser polymériser polyploïdiser polytransfuser polyviser
ponctionner ponctuer ponter pontiller populariser poquer porer porphyriser pommader pommer pomper pomponner ponctionner ponctuer ponter pontiller
porter porteuser portionner portoricaniser portraicturer portraiturer poser populariser poquer porer porphyriser porter porteuser portionner
positionner positiver possibiliser postdater poster postérioriser posticher portoricaniser portraicturer portraiturer poser positionner positiver
postillonner postposer postsonoriser postsynchroniser postuler potabiliser possibiliser postdater poster postérioriser posticher postillonner postposer
potentialiser poter poteyer potiner poudrer pouffer pouiller pouliner pouloper postsonoriser postsynchroniser postuler potabiliser potentialiser poter
poulotter pouponner pourpenser pourprer poussailler pousser poutser praliner poteyer potiner poudrer pouffer pouiller pouliner pouloper poulotter pouponner
pratiquer préaccentuer préadapter préallouer préassembler préassimiler pourpenser pourprer poussailler pousser poutser praliner pratiquer
préaviser précariser précautionner prêchailler préchauffer préchauler prêcher préaccentuer préadapter préallouer préassembler préassimiler préaviser
précipiter préciser préciter précompter préconditionner préconfigurer précariser précautionner prêchailler préchauffer préchauler prêcher précipiter
préconiser préconstituer précoter prédater prédécouper prédésigner prédestiner préciser préciter précompter préconditionner préconfigurer préconiser
préconstituer précoter prédater prédécouper prédésigner prédestiner
prédéterminer prédiffuser prédilectionner prédiquer prédisposer prédominer prédéterminer prédiffuser prédilectionner prédiquer prédisposer prédominer
préemballer préempter préencoller préenregistrer préenrober préexaminer préemballer préempter préencoller préenregistrer préenrober préexaminer
préexister préfabriquer préfaner préfigurer préfixer préformater préformer préexister préfabriquer préfaner préfigurer préfixer préformater préformer
@ -879,8 +882,8 @@ VERBS = set(
raccommoder raccompagner raccorder raccoutrer raccoutumer raccrocher racémiser raccommoder raccompagner raccorder raccoutrer raccoutumer raccrocher racémiser
rachalander racher raciner racketter racler râcler racoler raconter racoquiner rachalander racher raciner racketter racler râcler racoler raconter racoquiner
radariser rader radicaliser radiner radioactiver radiobaliser radiocommander radariser rader radicaliser radiner radioactiver radiobaliser radiocommander
radioconserver radiodétecter radiodiffuser radioexposer radioguider radio- radioconserver radiodétecter radiodiffuser radioexposer radioguider
immuniser radiolocaliser radiopasteuriser radiosonder radiostériliser radio-immuniser radiolocaliser radiopasteuriser radiosonder radiostériliser
radiotéléphoner radiotéléviser radoter radouber rafaler raffermer raffiler radiotéléphoner radiotéléviser radoter radouber rafaler raffermer raffiler
raffiner raffluer raffoler raffûter rafistoler rafler ragoter ragoûter raffiner raffluer raffoler raffûter rafistoler rafler ragoter ragoûter
ragrafer raguer raguser raiguiser railler rainer rainurer raisonner rajouter ragrafer raguer raguser raiguiser railler rainer rainurer raisonner rajouter
@ -1123,19 +1126,21 @@ VERBS = set(
sommer somnambuler somniloquer somnoler sonder sonnailler sonner sonoriser sommer somnambuler somniloquer somnoler sonder sonnailler sonner sonoriser
sophistiquer sorguer soubresauter souder souffler souffroter soufrer souhaiter sophistiquer sorguer soubresauter souder souffler souffroter soufrer souhaiter
souiller souillonner soûler souligner soûlotter soumissionner soupailler souiller souillonner soûler souligner soûlotter soumissionner soupailler
soupçonner souper soupirer souquer sourciller sourdiner sous-capitaliser sous- soupçonner souper soupirer souquer sourciller sourdiner sous-alimenter
catégoriser sousestimer sous-estimer sous-industrialiser sous-médicaliser sous-capitaliser sous-catégoriser sous-équiper sousestimer sous-estimer
sousperformer sous-qualifier soussigner sous-titrer sous-utiliser soutacher sous-évaluer sous-exploiter sous-exposer sous-industrialiser sous-louer
souter soutirer soviétiser spammer spasmer spatialiser spatuler spécialiser sous-médicaliser sousperformer sous-qualifier soussigner sous-titrer
spéculer sphéroïdiser spilitiser spiraler spiraliser spirantiser spiritualiser sous-traiter sous-utiliser sous-virer soutacher souter soutirer soviétiser
spitter splénectomiser spléniser sponsoriser sporter sporuler sprinter spammer spasmer spatialiser spatuler spécialiser spéculer sphéroïdiser
squatériser squatter squatteriser squattériser squeezer stabiliser stabuler spilitiser spiraler spiraliser spirantiser spiritualiser spitter
staffer stagner staliniser standardiser standoliser stanioler stariser splénectomiser spléniser sponsoriser sporter sporuler sprinter squatériser
stationner statistiquer statuer stelliter stenciler stendhaliser sténoser squatter squatteriser squattériser squeezer stabiliser stabuler staffer
sténotyper stepper stéréotyper stériliser stigmatiser stimuler stipuler stagner staliniser standardiser standoliser stanioler stariser stationner
stocker stoloniser stopper stranguler stratégiser stresser strider striduler statistiquer statuer stelliter stenciler stendhaliser sténoser sténotyper
striper stripper striquer stronker strouiller structurer strychniser stuquer stepper stéréotyper stériliser stigmatiser stimuler stipuler stocker
styler styliser subalterniser subdiviser subdivisionner subériser subjectiver stoloniser stopper stranguler stratégiser stresser strider striduler striper
stripper striquer stronker strouiller structurer strychniser stuquer styler
styliser subalterniser subdiviser subdivisionner subériser subjectiver
subjectiviser subjuguer sublimer sublimiser subluxer subminiaturiser subodorer subjectiviser subjuguer sublimer sublimiser subluxer subminiaturiser subodorer
subordonner suborner subsister substanter substantialiser substantiver subordonner suborner subsister substanter substantialiser substantiver
substituer subsumer subtiliser suburbaniser subventionner succomber suçoter substituer subsumer subtiliser suburbaniser subventionner succomber suçoter

File diff suppressed because it is too large Load Diff

View File

@ -1,7 +1,7 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ....symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT from ....symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT, ADP, SCONJ, CCONJ
from ....symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos from ....symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
from .lookup import LOOKUP from .lookup import LOOKUP
@ -9,7 +9,7 @@ from .lookup import LOOKUP
French language lemmatizer applies the default rule based lemmatization French language lemmatizer applies the default rule based lemmatization
procedure with some modifications for better French language support. procedure with some modifications for better French language support.
The parts of speech 'ADV', 'PRON', 'DET' and 'AUX' are added to use the The parts of speech 'ADV', 'PRON', 'DET', 'ADP' and 'AUX' are added to use the
rule-based lemmatization. As a last resort, the lemmatizer checks in rule-based lemmatization. As a last resort, the lemmatizer checks in
the lookup table. the lookup table.
''' '''
@ -34,16 +34,22 @@ class FrenchLemmatizer(object):
univ_pos = 'verb' univ_pos = 'verb'
elif univ_pos in (ADJ, 'ADJ', 'adj'): elif univ_pos in (ADJ, 'ADJ', 'adj'):
univ_pos = 'adj' univ_pos = 'adj'
elif univ_pos in (ADP, 'ADP', 'adp'):
univ_pos = 'adp'
elif univ_pos in (ADV, 'ADV', 'adv'): elif univ_pos in (ADV, 'ADV', 'adv'):
univ_pos = 'adv' univ_pos = 'adv'
elif univ_pos in (PRON, 'PRON', 'pron'):
univ_pos = 'pron'
elif univ_pos in (DET, 'DET', 'det'):
univ_pos = 'det'
elif univ_pos in (AUX, 'AUX', 'aux'): elif univ_pos in (AUX, 'AUX', 'aux'):
univ_pos = 'aux' univ_pos = 'aux'
elif univ_pos in (CCONJ, 'CCONJ', 'cconj'):
univ_pos = 'cconj'
elif univ_pos in (DET, 'DET', 'det'):
univ_pos = 'det'
elif univ_pos in (PRON, 'PRON', 'pron'):
univ_pos = 'pron'
elif univ_pos in (PUNCT, 'PUNCT', 'punct'): elif univ_pos in (PUNCT, 'PUNCT', 'punct'):
univ_pos = 'punct' univ_pos = 'punct'
elif univ_pos in (SCONJ, 'SCONJ', 'sconj'):
univ_pos = 'sconj'
else: else:
return [self.lookup(string)] return [self.lookup(string)]
# See Issue #435 for example of where this logic is requied. # See Issue #435 for example of where this logic is requied.
@ -100,7 +106,7 @@ class FrenchLemmatizer(object):
def lookup(self, string): def lookup(self, string):
if string in self.lookup_table: if string in self.lookup_table:
return self.lookup_table[string] return self.lookup_table[string][0]
return string return string
@ -125,7 +131,7 @@ def lemmatize(string, index, exceptions, rules):
if not forms: if not forms:
forms.extend(oov_forms) forms.extend(oov_forms)
if not forms and string in LOOKUP.keys(): if not forms and string in LOOKUP.keys():
forms.append(LOOKUP[string]) forms.append(LOOKUP[string][0])
if not forms: if not forms:
forms.append(string) forms.append(string)
return list(set(forms)) return list(set(forms))

File diff suppressed because it is too large Load Diff

View File

@ -1,16 +1,15 @@
# encoding: utf8 # encoding: utf8
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function
from ...language import Language
from ...attrs import LANG
from ...tokens import Doc, Token
from ...tokenizer import Tokenizer
from ... import util
from .tag_map import TAG_MAP
import re import re
from collections import namedtuple from collections import namedtuple
from .tag_map import TAG_MAP
from ...attrs import LANG
from ...language import Language
from ...tokens import Doc, Token
from ...util import DummyTokenizer
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"]) ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
@ -46,12 +45,12 @@ def resolve_pos(token):
# PoS mappings. # PoS mappings.
if token.pos == "連体詞,*,*,*": if token.pos == "連体詞,*,*,*":
if re.match("^[こそあど此其彼]の", token.surface): if re.match(r"[こそあど此其彼]の", token.surface):
return token.pos + ",DET" return token.pos + ",DET"
if re.match("^[こそあど此其彼]", token.surface): if re.match(r"[こそあど此其彼]", token.surface):
return token.pos + ",PRON" return token.pos + ",PRON"
else: return token.pos + ",ADJ"
return token.pos + ",ADJ"
return token.pos return token.pos
@ -68,7 +67,8 @@ def detailed_tokens(tokenizer, text):
pos = ",".join(parts[0:4]) pos = ",".join(parts[0:4])
if len(parts) > 7: if len(parts) > 7:
# this information is only available for words in the tokenizer dictionary # this information is only available for words in the tokenizer
# dictionary
base = parts[7] base = parts[7]
words.append(ShortUnitWord(surface, base, pos)) words.append(ShortUnitWord(surface, base, pos))
@ -76,38 +76,27 @@ def detailed_tokens(tokenizer, text):
return words return words
class JapaneseTokenizer(object): class JapaneseTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None): def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
MeCab = try_mecab_import() self.tokenizer = try_mecab_import().Tagger()
self.tokenizer = MeCab.Tagger()
self.tokenizer.parseToNode("") # see #2901 self.tokenizer.parseToNode("") # see #2901
def __call__(self, text): def __call__(self, text):
dtokens = detailed_tokens(self.tokenizer, text) dtokens = detailed_tokens(self.tokenizer, text)
words = [x.surface for x in dtokens] words = [x.surface for x in dtokens]
doc = Doc(self.vocab, words=words, spaces=[False] * len(words)) spaces = [False] * len(words)
doc = Doc(self.vocab, words=words, spaces=spaces)
for token, dtoken in zip(doc, dtokens): for token, dtoken in zip(doc, dtokens):
token._.mecab_tag = dtoken.pos token._.mecab_tag = dtoken.pos
token.tag_ = resolve_pos(dtoken) token.tag_ = resolve_pos(dtoken)
token.lemma_ = dtoken.lemma token.lemma_ = dtoken.lemma
return doc return doc
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
# allow serialization (see #1557)
def to_bytes(self, **exclude):
return b""
def from_bytes(self, bytes_data, **exclude):
return self
def to_disk(self, path, **exclude):
return None
def from_disk(self, path, **exclude):
return self
class JapaneseCharacterSegmenter(object): class JapaneseCharacterSegmenter(object):
def __init__(self, vocab): def __init__(self, vocab):
@ -154,7 +143,8 @@ class JapaneseCharacterSegmenter(object):
class JapaneseDefaults(Language.Defaults): class JapaneseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "ja" lex_attr_getters[LANG] = lambda _text: "ja"
tag_map = TAG_MAP tag_map = TAG_MAP
use_janome = True use_janome = True
@ -169,7 +159,6 @@ class JapaneseDefaults(Language.Defaults):
class Japanese(Language): class Japanese(Language):
lang = "ja" lang = "ja"
Defaults = JapaneseDefaults Defaults = JapaneseDefaults
Tokenizer = JapaneseTokenizer
def make_doc(self, text): def make_doc(self, text):
return self.tokenizer(text) return self.tokenizer(text)

View File

@ -5,6 +5,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .morph_rules import MORPH_RULES from .morph_rules import MORPH_RULES
from .lemmatizer import LEMMA_RULES, LOOKUP from .lemmatizer import LEMMA_RULES, LOOKUP
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS from ..norm_exceptions import BASE_NORMS
@ -20,12 +21,14 @@ class SwedishDefaults(Language.Defaults):
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
) )
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
morph_rules = MORPH_RULES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
stop_words = STOP_WORDS stop_words = STOP_WORDS
lemma_rules = LEMMA_RULES lemma_rules = LEMMA_RULES
lemma_lookup = LOOKUP lemma_lookup = LOOKUP
morph_rules = MORPH_RULES morph_rules = MORPH_RULES
class Swedish(Language): class Swedish(Language):
lang = "sv" lang = "sv"
Defaults = SwedishDefaults Defaults = SwedishDefaults

View File

@ -233167,7 +233167,6 @@ LOOKUP = {
"jades": "jade", "jades": "jade",
"jaet": "ja", "jaet": "ja",
"jaets": "ja", "jaets": "ja",
"jag": "jaga",
"jagad": "jaga", "jagad": "jaga",
"jagade": "jaga", "jagade": "jaga",
"jagades": "jaga", "jagades": "jaga",

View File

@ -0,0 +1,25 @@
# coding: utf8
"""Punctuation stolen from Danish"""
from __future__ import unicode_literals
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_SUFFIXES
_quotes = QUOTES.replace("'", '')
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])'.format(a=ALPHA, q=_quotes),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA)])
_suffixes = [suffix for suffix in TOKENIZER_SUFFIXES if suffix not in ["'s", "'S", "s", "S", r"\'"]]
_suffixes += [r"(?<=[^sSxXzZ])\'"]
TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -26,14 +26,15 @@ for verb_data in [
{ORTH: "u", LEMMA: PRON_LEMMA, NORM: "du"}, {ORTH: "u", LEMMA: PRON_LEMMA, NORM: "du"},
] ]
# Abbreviations for weekdays "sön." (for "söndag" / "söner")
# are left out because they are ambiguous. The same is the case
# for abbreviations "jul." and "Jul." ("juli" / "jul").
for exc_data in [ for exc_data in [
{ORTH: "jan.", LEMMA: "januari"}, {ORTH: "jan.", LEMMA: "januari"},
{ORTH: "febr.", LEMMA: "februari"}, {ORTH: "febr.", LEMMA: "februari"},
{ORTH: "feb.", LEMMA: "februari"}, {ORTH: "feb.", LEMMA: "februari"},
{ORTH: "apr.", LEMMA: "april"}, {ORTH: "apr.", LEMMA: "april"},
{ORTH: "jun.", LEMMA: "juni"}, {ORTH: "jun.", LEMMA: "juni"},
{ORTH: "jul.", LEMMA: "juli"},
{ORTH: "aug.", LEMMA: "augusti"}, {ORTH: "aug.", LEMMA: "augusti"},
{ORTH: "sept.", LEMMA: "september"}, {ORTH: "sept.", LEMMA: "september"},
{ORTH: "sep.", LEMMA: "september"}, {ORTH: "sep.", LEMMA: "september"},
@ -46,13 +47,11 @@ for exc_data in [
{ORTH: "tors.", LEMMA: "torsdag"}, {ORTH: "tors.", LEMMA: "torsdag"},
{ORTH: "fre.", LEMMA: "fredag"}, {ORTH: "fre.", LEMMA: "fredag"},
{ORTH: "lör.", LEMMA: "lördag"}, {ORTH: "lör.", LEMMA: "lördag"},
{ORTH: "sön.", LEMMA: "söndag"},
{ORTH: "Jan.", LEMMA: "Januari"}, {ORTH: "Jan.", LEMMA: "Januari"},
{ORTH: "Febr.", LEMMA: "Februari"}, {ORTH: "Febr.", LEMMA: "Februari"},
{ORTH: "Feb.", LEMMA: "Februari"}, {ORTH: "Feb.", LEMMA: "Februari"},
{ORTH: "Apr.", LEMMA: "April"}, {ORTH: "Apr.", LEMMA: "April"},
{ORTH: "Jun.", LEMMA: "Juni"}, {ORTH: "Jun.", LEMMA: "Juni"},
{ORTH: "Jul.", LEMMA: "Juli"},
{ORTH: "Aug.", LEMMA: "Augusti"}, {ORTH: "Aug.", LEMMA: "Augusti"},
{ORTH: "Sept.", LEMMA: "September"}, {ORTH: "Sept.", LEMMA: "September"},
{ORTH: "Sep.", LEMMA: "September"}, {ORTH: "Sep.", LEMMA: "September"},
@ -65,28 +64,32 @@ for exc_data in [
{ORTH: "Tors.", LEMMA: "Torsdag"}, {ORTH: "Tors.", LEMMA: "Torsdag"},
{ORTH: "Fre.", LEMMA: "Fredag"}, {ORTH: "Fre.", LEMMA: "Fredag"},
{ORTH: "Lör.", LEMMA: "Lördag"}, {ORTH: "Lör.", LEMMA: "Lördag"},
{ORTH: "Sön.", LEMMA: "Söndag"},
{ORTH: "sthlm", LEMMA: "Stockholm"}, {ORTH: "sthlm", LEMMA: "Stockholm"},
{ORTH: "gbg", LEMMA: "Göteborg"}, {ORTH: "gbg", LEMMA: "Göteborg"},
]: ]:
_exc[exc_data[ORTH]] = [exc_data] _exc[exc_data[ORTH]] = [exc_data]
# Specific case abbreviations only
for orth in ["AB", "Dr.", "H.M.", "H.K.H.", "m/s", "M/S", "Ph.d.", "S:t", "s:t"]:
_exc[orth] = [{ORTH: orth}]
ABBREVIATIONS = [ ABBREVIATIONS = [
"ang", "ang",
"anm", "anm",
"bil",
"bl.a", "bl.a",
"d.v.s", "d.v.s",
"doc", "doc",
"dvs", "dvs",
"e.d", "e.d",
"e.kr", "e.kr",
"el", "el.",
"eng", "eng",
"etc", "etc",
"exkl", "exkl",
"f", "ev",
"f.",
"f.d", "f.d",
"f.kr", "f.kr",
"f.n", "f.n",
@ -97,10 +100,11 @@ ABBREVIATIONS = [
"fr.o.m", "fr.o.m",
"förf", "förf",
"inkl", "inkl",
"jur", "iofs",
"jur.",
"kap", "kap",
"kl", "kl",
"kor", "kor.",
"kr", "kr",
"kungl", "kungl",
"lat", "lat",
@ -109,9 +113,10 @@ ABBREVIATIONS = [
"m.m", "m.m",
"max", "max",
"milj", "milj",
"min", "min.",
"mos", "mos",
"mt", "mt",
"mvh",
"o.d", "o.d",
"o.s.v", "o.s.v",
"obs", "obs",
@ -125,21 +130,27 @@ ABBREVIATIONS = [
"s.k", "s.k",
"s.t", "s.t",
"sid", "sid",
"s:t",
"t.ex", "t.ex",
"t.h", "t.h",
"t.o.m", "t.o.m",
"t.v", "t.v",
"tel", "tel",
"ung", "ung.",
"vol", "vol",
"v.",
"äv", "äv",
"övers", "övers",
] ]
ABBREVIATIONS = [abbr + "." for abbr in ABBREVIATIONS] + ABBREVIATIONS
# Add abbreviation for trailing punctuation too. If the abbreviation already has a trailing punctuation - skip it.
for abbr in ABBREVIATIONS:
if abbr.endswith(".") == False:
ABBREVIATIONS.append(abbr + ".")
for orth in ABBREVIATIONS: for orth in ABBREVIATIONS:
_exc[orth] = [{ORTH: orth}] _exc[orth] = [{ORTH: orth}]
capitalized = orth.capitalize()
_exc[capitalized] = [{ORTH: capitalized}]
# Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), # Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
# should be tokenized as two separate tokens. # should be tokenized as two separate tokens.

24
spacy/lang/ta/__init__.py Normal file
View File

@ -0,0 +1,24 @@
# import language-specific data
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
# create Defaults class in the module scope (necessary for pickling!)
class TamilDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ta' # language ISO code
# optional: replace flags with custom functions, e.g. like_num()
lex_attr_getters.update(LEX_ATTRS)
# create actual Language class
class Tamil(Language):
lang = 'ta' # language ISO code
Defaults = TamilDefaults # override defaults
# set default export this allows the language class to be lazy-loaded
__all__ = ['Tamil']

21
spacy/lang/ta/examples.py Normal file
View File

@ -0,0 +1,21 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ta.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"கிறிஸ்துமஸ் மற்றும் இனிய புத்தாண்டு வாழ்த்துக்கள்",
"எனக்கு என் குழந்தைப் பருவம் நினைவிருக்கிறது",
"உங்கள் பெயர் என்ன?",
"ஏறத்தாழ இலங்கைத் தமிழரில் மூன்றிலொரு பங்கினர் இலங்கையை விட்டு வெளியேறிப் பிற நாடுகளில் வாழ்கின்றனர்",
"இந்த ஃபோனுடன் சுமார் ரூ.2,990 மதிப்புள்ள போட் ராக்கர்ஸ் நிறுவனத்தின் ஸ்போர்ட் புளூடூத் ஹெட்போன்ஸ் இலவசமாக வழங்கப்படவுள்ளது.",
"மட்டக்களப்பில் பல இடங்களில் வீட்டுத் திட்டங்களுக்கு இன்று அடிக்கல் நாட்டல்",
"ஐ போன்க்கு முகத்தை வைத்து அன்லாக் செய்யும் முறை மற்றும் விரலால் தொட்டு அன்லாக் செய்யும் முறையை வாட்ஸ் ஆப் நிறுவனம் இதற்கு முன் கண்டுபிடித்தது"
]

View File

@ -0,0 +1,44 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_numeral_suffixes = {'பத்து': 'பது', 'ற்று': 'று', 'ரத்து':'ரம்' , 'சத்து': 'சம்'}
_num_words = ['பூச்சியம்', 'ஒரு', 'ஒன்று', 'இரண்டு', 'மூன்று', 'நான்கு', 'ஐந்து', 'ஆறு', 'ஏழு',
'எட்டு', 'ஒன்பது', 'பத்து', 'பதினொன்று', 'பன்னிரண்டு', 'பதின்மூன்று', 'பதினான்கு',
'பதினைந்து', 'பதினாறு', 'பதினேழு', 'பதினெட்டு', 'பத்தொன்பது', 'இருபது',
'முப்பது', 'நாற்பது', 'ஐம்பது', 'அறுபது', 'எழுபது', 'எண்பது', 'தொண்ணூறு',
'நூறு', 'இருநூறு', 'முன்னூறு', 'நாநூறு', 'ஐநூறு', 'அறுநூறு', 'எழுநூறு', 'எண்ணூறு', 'தொள்ளாயிரம்',
'ஆயிரம்', 'ஒராயிரம்', 'லட்சம்', 'மில்லியன்', 'கோடி', 'பில்லியன்', 'டிரில்லியன்']
# 20-89 ,90-899,900-99999 and above have different suffixes
def suffix_filter(text):
# text without numeral suffixes
for num_suffix in _numeral_suffixes.keys():
length = len(num_suffix)
if (len(text) < length):
break
elif text.endswith(num_suffix):
return text[:-length] + _numeral_suffixes[num_suffix]
return text
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
print(suffix_filter(text))
if text.lower() in _num_words:
return True
elif suffix_filter(text) in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,148 @@
# coding: utf8
from __future__ import unicode_literals
_exc = {
# Regional words normal
# Sri Lanka - wikipeadia
"இங்க": "இங்கே",
"வாங்க": "வாருங்கள்",
'ஒண்டு':'ஒன்று',
'கண்டு': 'கன்று',
'கொண்டு': 'கொன்று',
'பண்டி': 'பன்றி',
'பச்ச': 'பச்சை',
'அம்பது': 'ஐம்பது',
'வெச்ச': 'வைத்து',
'வச்ச': 'வைத்து',
'வச்சி': 'வைத்து',
'வாளைப்பழம்':'வாழைப்பழம்',
'மண்ணு': 'மண்',
'பொன்னு': 'பொன்',
'சாவல்': 'சேவல்',
'அங்கால': 'அங்கு ',
'அசுப்பு': 'நடமாட்டம்',
'எழுவான் கரை': 'எழுவான்கரை',
'ஓய்யாரம்': 'எழில் ',
'ஒளும்பு': 'எழும்பு',
'ஓர்மை': 'துணிவு',
'கச்சை': 'கோவணம்',
'கடப்பு': 'தெருவாசல்',
'சுள்ளி': 'காய்ந்த குச்சி',
'திறாவுதல்': 'தடவுதல்',
'நாசமறுப்பு': 'தொல்லை',
'பரிசாரி': 'வைத்தியன்',
'பறவாதி': 'பேராசைக்காரன்',
'பிசினி': 'உலோபி ',
'விசர்': 'பைத்தியம்',
'ஏனம்': 'பாத்திரம்',
'ஏலா': 'இயலாது',
'ஒசில்': 'அழகு',
'ஒள்ளுப்பம்': 'கொஞ்சம்',
# Srilankan and indian
'குத்துமதிப்பு': '',
'நூனாயம்': 'நூல்நயம்',
'பைய': 'மெதுவாக',
'மண்டை': 'தலை',
'வெள்ளனே': 'சீக்கிரம்',
'உசுப்பு': 'எழுப்பு',
'ஆணம்': 'குழம்பு',
'உறக்கம்': 'தூக்கம்',
'பஸ்': 'பேருந்து',
'களவு': 'திருட்டு ',
#relationship
'புருசன்': 'கணவன்',
'பொஞ்சாதி': 'மனைவி',
'புள்ள': 'பிள்ளை',
'பிள்ள': 'பிள்ளை',
'ஆம்பிளப்புள்ள': 'ஆண் பிள்ளை',
'பொம்பிளப்புள்ள': 'பெண் பிள்ளை',
'அண்ணாச்சி': 'அண்ணா',
'அக்காச்சி': 'அக்கா',
'தங்கச்சி': 'தங்கை',
#difference words
'பொடியன்': 'சிறுவன்',
'பொட்டை': 'சிறுமி',
'பிறகு': 'பின்பு',
'டக்கென்டு': 'விரைவாக',
'கெதியா': 'விரைவாக',
'கிறுகி': 'திரும்பி',
'போயித்து வாறன்': 'போய் வருகிறேன்',
'வருவாங்களா': 'வருவார்களா',
# regular spokens
'சொல்லு': 'சொல்',
'கேளு': 'கேள்',
'சொல்லுங்க': 'சொல்லுங்கள்',
'கேளுங்க': 'கேளுங்கள்',
'நீங்கள்': 'நீ',
'உன்': 'உன்னுடைய',
# Portugeese formal words
'அலவாங்கு': 'கடப்பாரை',
'ஆசுப்பத்திரி': 'மருத்துவமனை',
'உரோதை': 'சில்லு',
'கடுதாசி': 'கடிதம்',
'கதிரை': 'நாற்காலி',
'குசினி': 'அடுக்களை',
'கோப்பை': 'கிண்ணம்',
'சப்பாத்து': 'காலணி',
'தாச்சி': 'இரும்புச் சட்டி',
'துவாய்': 'துவாலை',
'தவறணை': 'மதுக்கடை',
'பீப்பா': 'மரத்தாழி',
'யன்னல்': 'சாளரம்',
'வாங்கு': 'மரஇருக்கை',
# Dutch formal words
'இறாக்கை': 'பற்சட்டம்',
'இலாட்சி': 'இழுப்பறை',
'கந்தோர்': 'பணிமனை',
'நொத்தாரிசு': 'ஆவண எழுத்துபதிவாளர்',
# English formal words
'இஞ்சினியர்': 'பொறியியலாளர்',
'சூப்பு': 'ரசம்',
'செக்': 'காசோலை',
'சேட்டு': 'மேற்ச்சட்டை',
'மார்க்கட்டு': 'சந்தை',
'விண்ணன்': 'கெட்டிக்காரன்',
# Arabic formal words
'ஈமான்': 'நம்பிக்கை',
'சுன்னத்து': 'விருத்தசேதனம்',
'செய்த்தான்': 'பிசாசு',
'மவுத்து': 'இறப்பு',
'ஹலால்': 'அங்கீகரிக்கப்பட்டது',
'கறாம்': 'நிராகரிக்கப்பட்டது',
# Persian, Hindustanian and hindi formal words
'சுமார்': 'கிட்டத்தட்ட',
'சிப்பாய்': 'போர்வீரன்',
'சிபார்சு': 'சிபாரிசு',
'ஜமீன்': 'பணக்காரா்',
'அசல்': 'மெய்யான',
'அந்தஸ்து': 'கௌரவம்',
'ஆஜர்': 'சமா்ப்பித்தல்',
'உசார்': 'எச்சரிக்கை',
'அச்சா':'நல்ல',
# English words used in text conversations
"bcoz": "ஏனெனில்",
"bcuz": "ஏனெனில்",
"fav": "விருப்பமான",
"morning": "காலை வணக்கம்",
"gdeveng": "மாலை வணக்கம்",
"gdnyt": "இரவு வணக்கம்",
"gdnit": "இரவு வணக்கம்",
"plz": "தயவு செய்து",
"pls": "தயவு செய்து",
"thx": "நன்றி",
"thanx": "நன்றி",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm

133
spacy/lang/ta/stop_words.py Normal file
View File

@ -0,0 +1,133 @@
# coding: utf8
from __future__ import unicode_literals
# Stop words
STOP_WORDS = set("""
ஒர
என
மற
இந
இத
என
எனபத
பல
ஆக
அலலத
அவர
உள
அந
இவர
என
தல
என
இர
ி
என
வந
இதன
அத
அவன
பலர
என
ினர
இர
தனத
உளளத
என
அதன
தன
ிறக
அவரகள
வர
அவள
ஆகி
இரதத
உளளன
வந
இர
ிகவ
இங
ஓர
இவ
இநதக
பறி
வர
இர
இதி
இப
அவரத
மட
இநதப
என
ி
ஆகி
எனக
இன
அநதப
அன
ஒர
ி
அங
பல
ி
அத
பறி
உன
அதி
அநதக
இதன
அவ
அத
ஏன
எனபத
எல
மட
இங
அங
இடம
இடதி
அதி
அதற
எனவ
ி
ி
மற
ி
எந
எனவ
எனபபட
எனி
அட
இதன
இத
இநதத
இதற
அதன
தவி
வரி
சற
எனக
""".split())

View File

@ -5,24 +5,14 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from ...tokens import Doc
from ...language import Language
from ...attrs import LANG from ...attrs import LANG
from ...language import Language
from ...tokens import Doc
from ...util import DummyTokenizer
class ThaiDefaults(Language.Defaults): class ThaiTokenizer(DummyTokenizer):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) def __init__(self, cls, nlp=None):
lex_attr_getters[LANG] = lambda text: "th"
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
class Thai(Language):
lang = "th"
Defaults = ThaiDefaults
def make_doc(self, text):
try: try:
from pythainlp.tokenize import word_tokenize from pythainlp.tokenize import word_tokenize
except ImportError: except ImportError:
@ -30,8 +20,35 @@ class Thai(Language):
"The Thai tokenizer requires the PyThaiNLP library: " "The Thai tokenizer requires the PyThaiNLP library: "
"https://github.com/PyThaiNLP/pythainlp" "https://github.com/PyThaiNLP/pythainlp"
) )
words = [x for x in list(word_tokenize(text, "newmm"))]
return Doc(self.vocab, words=words, spaces=[False] * len(words)) self.word_tokenize = word_tokenize
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
def __call__(self, text):
words = list(self.word_tokenize(text, "newmm"))
spaces = [False] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
class ThaiDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda _text: "th"
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
@classmethod
def create_tokenizer(cls, nlp=None):
return ThaiTokenizer(cls, nlp)
class Thai(Language):
lang = "th"
Defaults = ThaiDefaults
def make_doc(self, text):
return self.tokenizer(text)
__all__ = ["Thai"] __all__ = ["Thai"]

View File

@ -5,6 +5,7 @@ from ...attrs import LIKE_NUM
# Thirteen, fifteen etc. are written separate: on üç # Thirteen, fifteen etc. are written separate: on üç
_num_words = [ _num_words = [
"bir", "bir",
"iki", "iki",
@ -28,6 +29,7 @@ _num_words = [
"bin", "bin",
"milyon", "milyon",
"milyar", "milyar",
"trilyon",
"katrilyon", "katrilyon",
"kentilyon", "kentilyon",
] ]

View File

@ -353,10 +353,38 @@ def test_doc_api_similarity_match():
assert doc.similarity(doc2) == 0.0 assert doc.similarity(doc2) == 0.0
def test_lowest_common_ancestor(en_tokenizer): @pytest.mark.parametrize(
tokens = en_tokenizer("the lazy dog slept") "sentence,heads,lca_matrix",
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0]) [
(
"the lazy dog slept",
[2, 1, 1, 0],
numpy.array([[0, 2, 2, 3], [2, 1, 2, 3], [2, 2, 2, 3], [3, 3, 3, 3]]),
),
(
"The lazy dog slept. The quick fox jumped",
[2, 1, 1, 0, -1, 2, 1, 1, 0],
numpy.array(
[
[0, 2, 2, 3, 3, -1, -1, -1, -1],
[2, 1, 2, 3, 3, -1, -1, -1, -1],
[2, 2, 2, 3, 3, -1, -1, -1, -1],
[3, 3, 3, 3, 3, -1, -1, -1, -1],
[3, 3, 3, 3, 4, -1, -1, -1, -1],
[-1, -1, -1, -1, -1, 5, 7, 7, 8],
[-1, -1, -1, -1, -1, 7, 6, 7, 8],
[-1, -1, -1, -1, -1, 7, 7, 7, 8],
[-1, -1, -1, -1, -1, 8, 8, 8, 8],
]
),
),
],
)
def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix):
tokens = en_tokenizer(sentence)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
lca = doc.get_lca_matrix() lca = doc.get_lca_matrix()
assert (lca == lca_matrix).all()
assert lca[1, 1] == 1 assert lca[1, 1] == 1
assert lca[0, 1] == 2 assert lca[0, 1] == 2
assert lca[1, 2] == 2 assert lca[1, 2] == 2

View File

@ -80,10 +80,24 @@ def test_spans_lca_matrix(en_tokenizer):
tokens = en_tokenizer("the lazy dog slept") tokens = en_tokenizer("the lazy dog slept")
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0]) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
lca = doc[:2].get_lca_matrix() lca = doc[:2].get_lca_matrix()
assert lca[0, 0] == 0 assert lca.shape == (2, 2)
assert lca[0, 1] == -1 assert lca[0, 0] == 0 # the & the -> the
assert lca[1, 0] == -1 assert lca[0, 1] == -1 # the & lazy -> dog (out of span)
assert lca[1, 1] == 1 assert lca[1, 0] == -1 # lazy & the -> dog (out of span)
assert lca[1, 1] == 1 # lazy & lazy -> lazy
lca = doc[1:].get_lca_matrix()
assert lca.shape == (3, 3)
assert lca[0, 0] == 0 # lazy & lazy -> lazy
assert lca[0, 1] == 1 # lazy & dog -> dog
assert lca[0, 2] == 2 # lazy & slept -> slept
lca = doc[2:].get_lca_matrix()
assert lca.shape == (2, 2)
assert lca[0, 0] == 0 # dog & dog -> dog
assert lca[0, 1] == 1 # dog & slept -> slept
assert lca[1, 0] == 1 # slept & dog -> slept
assert lca[1, 1] == 1 # slept & slept -> slept
def test_span_similarity_match(): def test_span_similarity_match():
@ -158,15 +172,17 @@ def test_span_as_doc(doc):
def test_span_string_label(doc): def test_span_string_label(doc):
span = Span(doc, 0, 1, label='hello') span = Span(doc, 0, 1, label="hello")
assert span.label_ == 'hello' assert span.label_ == "hello"
assert span.label == doc.vocab.strings['hello'] assert span.label == doc.vocab.strings["hello"]
def test_span_string_set_label(doc): def test_span_string_set_label(doc):
span = Span(doc, 0, 1) span = Span(doc, 0, 1)
span.label_ = 'hello' span.label_ = "hello"
assert span.label_ == 'hello' assert span.label_ == "hello"
assert span.label == doc.vocab.strings['hello'] assert span.label == doc.vocab.strings["hello"]
def test_span_ents_property(doc): def test_span_ents_property(doc):
"""Test span.ents for the """ """Test span.ents for the """

View File

@ -0,0 +1,53 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
SV_TOKEN_EXCEPTION_TESTS = [
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar']),
('Anders I. tycker om ord med i i.', ["Anders", "I.", "tycker", "om", "ord", "med", "i", "i", "."])
]
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
tokens = sv_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 2
assert tokens[1].text == "u"
@pytest.mark.parametrize('text',
["bl.a", "m.a.o.", "Jan.", "Dec.", "kr.", "osv."])
def test_sv_tokenizer_handles_abbr(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 1
@pytest.mark.parametrize('text', ["Jul.", "jul.", "sön.", "Sön."])
def test_sv_tokenizer_handles_ambiguous_abbr(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 2
def test_sv_tokenizer_handles_exc_in_text(sv_tokenizer):
text = "Det er bl.a. ikke meningen"
tokens = sv_tokenizer(text)
assert len(tokens) == 5
assert tokens[2].text == "bl.a."
def test_sv_tokenizer_handles_custom_base_exc(sv_tokenizer):
text = "Her er noget du kan kigge i."
tokens = sv_tokenizer(text)
assert len(tokens) == 8
assert tokens[6].text == "i"
assert tokens[7].text == "."

View File

@ -0,0 +1,15 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('string,lemma', [('DNA-profilernas', 'DNA-profil'),
('Elfenbenskustens', 'Elfenbenskusten'),
('abortmotståndarens', 'abortmotståndare'),
('kolesterols', 'kolesterol'),
('portionssnusernas', 'portionssnus'),
('åsyns', 'åsyn')])
def test_lemmatizer_lookup_assigns(sv_tokenizer, string, lemma):
tokens = sv_tokenizer(string)
assert tokens[0].lemma_ == lemma

View File

@ -0,0 +1,37 @@
# coding: utf-8
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text', ["(under)"])
def test_tokenizer_splits_no_special(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3
@pytest.mark.parametrize('text', ["gitta'r", "Björn's", "Lars'"])
def test_tokenizer_handles_no_punct(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 1
@pytest.mark.parametrize('text', ["svart.Gul", "Hej.Världen"])
def test_tokenizer_splits_period_infix(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hej,Världen", "en,två"])
def test_tokenizer_splits_comma_infix(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3
assert tokens[0].text == text.split(",")[0]
assert tokens[1].text == ","
assert tokens[2].text == text.split(",")[1]
@pytest.mark.parametrize('text', ["svart...Gul", "svart...gul"])
def test_tokenizer_splits_ellipsis_infix(sv_tokenizer, text):
tokens = sv_tokenizer(text)
assert len(tokens) == 3

View File

@ -0,0 +1,21 @@
# coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals
import pytest
def test_sv_tokenizer_handles_long_text(sv_tokenizer):
text = """Det var så härligt ute på landet. Det var sommar, majsen var gul, havren grön,
höet var uppställt i stackar nere vid den gröna ängen, och där gick storken sina långa,
röda ben och snackade engelska, för det språket hade han lärt sig av sin mor.
Runt om åkrar och äng låg den stora skogen, och mitt i skogen fanns djupa sjöar; jo, det var verkligen trevligt ute landet!"""
tokens = sv_tokenizer(text)
assert len(tokens) == 86
def test_sv_tokenizer_handles_trailing_dot_for_i_in_sentence(sv_tokenizer):
text = "Provar att tokenisera en mening med ord i."
tokens = sv_tokenizer(text)
assert len(tokens) == 9

View File

@ -5,27 +5,31 @@ from ..util import get_doc
import pytest import pytest
import numpy import numpy
from numpy.testing import assert_array_equal
@pytest.mark.parametrize('words,heads,matrix', [ @pytest.mark.parametrize(
( "sentence,heads,matrix",
'She created a test for spacy'.split(), [
[1, 0, 1, -2, -1, -1], (
numpy.array([ "She created a test for spacy",
[0, 1, 1, 1, 1, 1], [1, 0, 1, -2, -1, -1],
[1, 1, 1, 1, 1, 1], numpy.array(
[1, 1, 2, 3, 3, 3], [
[1, 1, 3, 3, 3, 3], [0, 1, 1, 1, 1, 1],
[1, 1, 3, 3, 4, 4], [1, 1, 1, 1, 1, 1],
[1, 1, 3, 3, 4, 5]], dtype=numpy.int32) [1, 1, 2, 3, 3, 3],
) [1, 1, 3, 3, 3, 3],
]) [1, 1, 3, 3, 4, 4],
def test_issue2396(en_vocab, words, heads, matrix): [1, 1, 3, 3, 4, 5],
doc = get_doc(en_vocab, words=words, heads=heads) ],
dtype=numpy.int32,
),
)
],
)
def test_issue2396(en_tokenizer, sentence, heads, matrix):
tokens = en_tokenizer(sentence)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
span = doc[:] span = doc[:]
assert_array_equal(doc.get_lca_matrix(), matrix) assert (doc.get_lca_matrix() == matrix).all()
assert_array_equal(span.get_lca_matrix(), matrix) assert (span.get_lca_matrix() == matrix).all()

View File

@ -10,7 +10,7 @@ def test_issue2901():
"""Test that `nlp` doesn't fail.""" """Test that `nlp` doesn't fail."""
try: try:
nlp = Japanese() nlp = Japanese()
except: except ImportError:
pytest.skip() pytest.skip()
doc = nlp("pythonが大好きです") doc = nlp("pythonが大好きです")

View File

@ -0,0 +1,10 @@
from __future__ import unicode_literals
import pytest
import spacy
@pytest.mark.models('fr')
def test_issue1959(FR):
texts = ['Je suis la mauvaise herbe', "Me, myself and moi"]
for text in texts:
FR(text)

View File

@ -1075,21 +1075,30 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
cdef int [:,:] lca_matrix cdef int [:,:] lca_matrix
n_tokens= end - start n_tokens= end - start
lca_matrix = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32) lca_mat = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32)
lca_mat.fill(-1)
lca_matrix = lca_mat
for j in range(start, end): for j in range(n_tokens):
token_j = doc[j] token_j = doc[start + j]
# the common ancestor of token and itself is itself: # the common ancestor of token and itself is itself:
lca_matrix[j, j] = j lca_matrix[j, j] = j
for k in range(j + 1, end): # we will only iterate through tokens in the same sentence
lca = _get_tokens_lca(token_j, doc[k]) sent = token_j.sent
sent_start = sent.start
j_idx_in_sent = start + j - sent_start
n_missing_tokens_in_sent = len(sent) - j_idx_in_sent
# make sure we do not go past `end`, in cases where `end` < sent.end
max_range = min(j + n_missing_tokens_in_sent, end)
for k in range(j + 1, max_range):
lca = _get_tokens_lca(token_j, doc[start + k])
# if lca is outside of span, we set it to -1 # if lca is outside of span, we set it to -1
if not start <= lca < end: if not start <= lca < end:
lca_matrix[j, k] = -1 lca_matrix[j, k] = -1
lca_matrix[k, j] = -1 lca_matrix[k, j] = -1
else: else:
lca_matrix[j, k] = lca lca_matrix[j, k] = lca - start
lca_matrix[k, j] = lca lca_matrix[k, j] = lca - start
return lca_matrix return lca_matrix

View File

@ -524,9 +524,9 @@ cdef class Span:
return len(list(self.rights)) return len(list(self.rights))
property subtree: property subtree:
"""Tokens that descend from tokens in the span, but fall outside it. """Tokens within the span and tokens which descend from them.
YIELDS (Token): A descendant of a token within the span. YIELDS (Token): A token within the span, or a descendant from it.
""" """
def __get__(self): def __get__(self):
for word in self.lefts: for word in self.lefts:

View File

@ -457,10 +457,11 @@ cdef class Token:
yield from self.rights yield from self.rights
property subtree: property subtree:
"""A sequence of all the token's syntactic descendents. """A sequence containing the token and all the token's syntactic
descendants.
YIELDS (Token): A descendent token such that YIELDS (Token): A descendent token such that
`self.is_ancestor(descendent)`. `self.is_ancestor(descendent) or token == self`.
""" """
def __get__(self): def __get__(self):
for word in self.lefts: for word in self.lefts:

View File

@ -253,7 +253,6 @@ def get_entry_point(key, value):
def is_in_jupyter(): def is_in_jupyter():
"""Check if user is running spaCy from a Jupyter notebook by detecting the """Check if user is running spaCy from a Jupyter notebook by detecting the
IPython kernel. Mainly used for the displaCy visualizer. IPython kernel. Mainly used for the displaCy visualizer.
RETURNS (bool): True if in Jupyter, False if not. RETURNS (bool): True if in Jupyter, False if not.
""" """
# https://stackoverflow.com/a/39662359/6400719 # https://stackoverflow.com/a/39662359/6400719
@ -667,3 +666,19 @@ class SimpleFrozenDict(dict):
def update(self, other): def update(self, other):
raise NotImplementedError(Errors.E095) raise NotImplementedError(Errors.E095)
class DummyTokenizer(object):
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
# allow serialization (see #1557)
def to_bytes(self, **exclude):
return b''
def from_bytes(self, _bytes_data, **exclude):
return self
def to_disk(self, _path, **exclude):
return None
def from_disk(self, _path, **exclude):
return self

View File

@ -150,3 +150,9 @@ p
+dep-row("re", "repeated element") +dep-row("re", "repeated element")
+dep-row("rs", "reported speech") +dep-row("rs", "reported speech")
+dep-row("sb", "subject") +dep-row("sb", "subject")
+dep-row("sbp", "passivised subject")
+dep-row("sp", "subject or predicate")
+dep-row("svp", "separable verb prefix")
+dep-row("uc", "unit component")
+dep-row("vo", "vocative")
+dep-row("ROOT", "root")

View File

@ -5,7 +5,7 @@ include ../_includes/_mixins
p p
| The #[code PhraseMatcher] lets you efficiently match large terminology | The #[code PhraseMatcher] lets you efficiently match large terminology
| lists. While the #[+api("matcher") #[code Matcher]] lets you match | lists. While the #[+api("matcher") #[code Matcher]] lets you match
| squences based on lists of token descriptions, the #[code PhraseMatcher] | sequences based on lists of token descriptions, the #[code PhraseMatcher]
| accepts match patterns in the form of #[code Doc] objects. | accepts match patterns in the form of #[code Doc] objects.
+h(2, "init") PhraseMatcher.__init__ +h(2, "init") PhraseMatcher.__init__

View File

@ -489,7 +489,7 @@ p
+tag property +tag property
+tag-model("parse") +tag-model("parse")
p Tokens that descend from tokens in the span, but fall outside it. p Tokens within the span and tokens which descend from them.
+aside-code("Example"). +aside-code("Example").
doc = nlp(u'Give it back! He pleaded.') doc = nlp(u'Give it back! He pleaded.')
@ -500,7 +500,7 @@ p Tokens that descend from tokens in the span, but fall outside it.
+row("foot") +row("foot")
+cell yields +cell yields
+cell #[code Token] +cell #[code Token]
+cell A descendant of a token within the span. +cell A token within the span, or a descendant from it.
+h(2, "has_vector") Span.has_vector +h(2, "has_vector") Span.has_vector
+tag property +tag property

View File

@ -1,3 +1,4 @@
//- 💫 DOCS > API > TOKEN //- 💫 DOCS > API > TOKEN
include ../_includes/_mixins include ../_includes/_mixins
@ -405,7 +406,7 @@ p
+tag property +tag property
+tag-model("parse") +tag-model("parse")
p A sequence of all the token's syntactic descendants. p A sequence containing the token and all the token's syntactic descendants.
+aside-code("Example"). +aside-code("Example").
doc = nlp(u'Give it back! He pleaded.') doc = nlp(u'Give it back! He pleaded.')
@ -416,7 +417,7 @@ p A sequence of all the token's syntactic descendants.
+row("foot") +row("foot")
+cell yields +cell yields
+cell #[code Token] +cell #[code Token]
+cell A descendant token such that #[code self.is_ancestor(descendant)]. +cell A descendant token such that #[code self.is_ancestor(token) or token == self].
+h(2, "is_sent_start") Token.is_sent_start +h(2, "is_sent_start") Token.is_sent_start
+tag property +tag property

View File

@ -1083,20 +1083,31 @@
"category": ["pipeline"] "category": ["pipeline"]
}, },
{ {
"id": "spacy2conllu", "id": "spacy-conll",
"title": "spaCy2CoNLLU", "title": "spacy_conll",
"slogan": "Parse text with spaCy and print the output in CoNLL-U format", "slogan": "Parse text with spaCy and print the output in CoNLL-U format",
"description": "Simple script to parse text with spaCy and print the output in CoNLL-U format", "description": "This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts.",
"code_example": [ "code_example": [
"python parse_as_conllu.py [-h] --input_file INPUT_FILE [--output_file OUTPUT_FILE] --model MODEL" "from spacy_conll import Spacy2ConllParser",
"spacyconll = Spacy2ConllParser()",
"",
"# `parse` returns a generator of the parsed sentences",
"for parsed_sent in spacyconll.parse(input_str='I like cookies.\nWhat about you?\nI don't like 'em!'):",
" do_something_(parsed_sent)",
"",
"# `parseprint` prints output to stdout (default) or a file (use `output_file` parameter)",
"# This method is called when using the command line",
"spacyconll.parseprint(input_str='I like cookies.')"
], ],
"code_language": "bash", "code_language": "python",
"author": "Raquel G. Alhama", "author": "Bram Vanroy",
"author_links": { "author_links": {
"github": "rgalhama" "github": "BramVanroy",
"website": "https://bramvanroy.be"
}, },
"github": "rgalhama/spaCy2CoNLLU", "github": "BramVanroy/spacy_conll",
"category": ["training"] "category": ["standalone"]
} }
], ],
"projectCats": { "projectCats": {

View File

@ -159,7 +159,7 @@ p
| To provide training examples to the entity recogniser, you'll first need | To provide training examples to the entity recogniser, you'll first need
| to create an instance of the #[+api("goldparse") #[code GoldParse]] class. | to create an instance of the #[+api("goldparse") #[code GoldParse]] class.
| You can specify your annotations in a stand-off format or as token tags. | You can specify your annotations in a stand-off format or as token tags.
| If a character offset in your entity annotations don't fall on a token | If a character offset in your entity annotations doesn't fall on a token
| boundary, the #[code GoldParse] class will treat that annotation as a | boundary, the #[code GoldParse] class will treat that annotation as a
| missing value. This allows for more realistic training, because the | missing value. This allows for more realistic training, because the
| entity recogniser is allowed to learn from examples that may feature | entity recogniser is allowed to learn from examples that may feature

View File

@ -444,7 +444,7 @@ p
| Let's say you're analysing user comments and you want to find out what | Let's say you're analysing user comments and you want to find out what
| people are saying about Facebook. You want to start off by finding | people are saying about Facebook. You want to start off by finding
| adjectives following "Facebook is" or "Facebook was". This is obviously | adjectives following "Facebook is" or "Facebook was". This is obviously
| a very rudimentary solution, but it'll be fast, and a great way get an | a very rudimentary solution, but it'll be fast, and a great way to get an
| idea for what's in your data. Your pattern could look like this: | idea for what's in your data. Your pattern could look like this:
+code. +code.

View File

@ -40,7 +40,7 @@ p
| constrained to predict parses consistent with the sentence boundaries. | constrained to predict parses consistent with the sentence boundaries.
+infobox("Important note", "⚠️") +infobox("Important note", "⚠️")
| To prevent inconsitent state, you can only set boundaries #[em before] a | To prevent inconsistent state, you can only set boundaries #[em before] a
| document is parsed (and #[code Doc.is_parsed] is #[code False]). To | document is parsed (and #[code Doc.is_parsed] is #[code False]). To
| ensure that your component is added in the right place, you can set | ensure that your component is added in the right place, you can set
| #[code before='parser'] or #[code first=True] when adding it to the | #[code before='parser'] or #[code first=True] when adding it to the

View File

@ -21,7 +21,7 @@ p
| which needs to be split into two tokens: #[code {ORTH: "do"}] and | which needs to be split into two tokens: #[code {ORTH: "do"}] and
| #[code {ORTH: "n't", LEMMA: "not"}]. The prefixes, suffixes and infixes | #[code {ORTH: "n't", LEMMA: "not"}]. The prefixes, suffixes and infixes
| mosty define punctuation rules for example, when to split off periods | mosty define punctuation rules for example, when to split off periods
| (at the end of a sentence), and when to leave token containing periods | (at the end of a sentence), and when to leave tokens containing periods
| intact (abbreviations like "U.S."). | intact (abbreviations like "U.S.").
+graphic("/assets/img/language_data.svg") +graphic("/assets/img/language_data.svg")

View File

@ -43,7 +43,7 @@ p
p p
| This example shows how to use multiple cores to process text using | This example shows how to use multiple cores to process text using
| spaCy and #[+a("https://pythonhosted.org/joblib/") Joblib]. We're | spaCy and #[+a("https://joblib.readthedocs.io/en/latest/parallel.html") Joblib]. We're
| exporting part-of-speech-tagged, true-cased, (very roughly) | exporting part-of-speech-tagged, true-cased, (very roughly)
| sentence-separated text, with each "sentence" on a newline, and | sentence-separated text, with each "sentence" on a newline, and
| spaces between tokens. Data is loaded from the IMDB movie reviews | spaces between tokens. Data is loaded from the IMDB movie reviews

View File

@ -74,7 +74,7 @@ p
displacy.serve(doc, style='ent') displacy.serve(doc, style='ent')
p p
| This feature is espeically handy if you're using displaCy to compare | This feature is especially handy if you're using displaCy to compare
| performance at different stages of a process, e.g. during training. Here | performance at different stages of a process, e.g. during training. Here
| you could use the title for a brief description of the text example and | you could use the title for a brief description of the text example and
| the number of iterations. | the number of iterations.

View File

@ -61,7 +61,7 @@ p
output_path.open('w', encoding='utf-8').write(svg) output_path.open('w', encoding='utf-8').write(svg)
p p
| The above code will generate the dependency visualizations as to | The above code will generate the dependency visualizations as
| two files, #[code This-is-an-example.svg] and #[code This-is-another-one.svg]. | two files, #[code This-is-an-example.svg] and #[code This-is-another-one.svg].

View File

@ -24,7 +24,7 @@ include ../_includes/_mixins
| standards. | standards.
p p
| The quickest way visualize #[code Doc] is to use | The quickest way to visualize #[code Doc] is to use
| #[+api("displacy#serve") #[code displacy.serve]]. This will spin up a | #[+api("displacy#serve") #[code displacy.serve]]. This will spin up a
| simple web server and let you view the result straight from your browser. | simple web server and let you view the result straight from your browser.
| displaCy can either take a single #[code Doc] or a list of #[code Doc] | displaCy can either take a single #[code Doc] or a list of #[code Doc]