This commit is contained in:
Matthew Honnibal 2018-09-27 12:50:31 +02:00
commit bae6b3e2b3
60 changed files with 3435 additions and 913 deletions

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Aniruddha Adhikary |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-09-05 |
| GitHub username | aniruddha-adhikary |
| Website (optional) | https://adhikary.net |

106
.github/contributors/aongko.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Andrew Ongko |
| Company name (if applicable) | Kurio |
| Title or role (if applicable) | Senior Data Science |
| Date | Sep 10, 2018 |
| GitHub username | aongko |
| Website (optional) | |

54
.github/contributors/aryaprabhudesai.md vendored Normal file
View File

@ -0,0 +1,54 @@
spaCy contributor agreement
This spaCy Contributor Agreement ("SCA") is based on the Oracle Contributor Agreement. The SCA applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean ExplosionAI UG (haftungsbeschränkt). The term "you" shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested below and include the filled-in version with your first pull request, under the folder .github/contributors/. The name of the file should be your GitHub username, with the extension .md. For example, the user example_user would create the file .github/contributors/example_user.md.
Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.
Contributor Agreement
The term "contribution" or "contributed materials" means any source code, object code, patch, tool, sample, graphic, specification, manual, documentation, or any other material posted or submitted by you to the project.
With respect to any worldwide copyrights, or copyright applications and registrations, in your contribution:
you hereby assign to us joint ownership, and to the extent that such assignment is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license to exercise all rights under those copyrights. This includes, at our option, the right to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements;
you agree that each of us can do all things in relation to your contribution as if each of us were the sole owners, and if one of us makes a derivative work of your contribution, the one who makes the derivative work (or has it made will be the sole owner of that derivative work;
you agree that you will not assert any moral rights in your contribution against us, our licensees or transferees;
you agree that we may register a copyright in your contribution and exercise all ownership rights associated with it; and
you agree that neither of us has any duty to consult with, obtain the consent of, pay or render an accounting to the other for any use or distribution of your contribution.
With respect to any patents you own, or that you can license without payment to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free license to:
make, have made, use, sell, offer to sell, import, and otherwise transfer your contribution in whole or in part, alone or in combination with or included in any product, work or materials arising out of the project to which your contribution was submitted, and
at our option, to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements.
Except as set out above, you keep all right, title, and interest in your contribution. The rights that you grant to us under these terms are effective on the date you first submitted a contribution to us, even if your submission took place before the date you sign these terms.
You covenant, represent, warrant and agree that:
Each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this SCA;
to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and
each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws. You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. We may publicly disclose your participation in the project, including the fact that you have signed the SCA.
This SCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
Please place an “x” on one of the applicable statement below. Please do NOT mark both statements:
[X] I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect to my contributions.
I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity.
Contributor Details
Field Entry
Name Arya Prabhudesai
Company name (if applicable) -
Title or role (if applicable) -
Date 2018-08-17
GitHub username aryaprabhudesai
Website (optional) -

106
.github/contributors/charlax.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Charles-Axel Dein |
| Company name (if applicable) | Skrib |
| Title or role (if applicable) | CEO |
| Date | 27/09/2018 |
| GitHub username | charlax |
| Website (optional) | www.dein.fr |

106
.github/contributors/darindf.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Darin DeForest |
| Company name (if applicable) | Ipro Tech |
| Title or role (if applicable) | Senior Software Engineer |
| Date | 2018-09-26 |
| GitHub username | darindf |
| Website (optional) | |

106
.github/contributors/filipecaixeta.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Filipe Caixeta |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 09.12.2018 |
| GitHub username | filipecaixeta |
| Website (optional) | filipecaixeta.com.br |

106
.github/contributors/free-variation.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | John Stewart |
| Company name (if applicable) | Amplify |
| Title or role (if applicable) | SVP Research |
| Date | 14/09/2018 |
| GitHub username | free-variation |
| Website (optional) | |

106
.github/contributors/grivaz.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |C. Grivaz |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date |08.22.2018 |
| GitHub username |grivaz |
| Website (optional) | |

106
.github/contributors/keshan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Keshan Sodimana |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Sep 21, 2018 |
| GitHub username | keshan |
| Website (optional) | |

106
.github/contributors/mbkupfer.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Maxim Kupfer |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Sep 6, 2018 |
| GitHub username | mbkupfer |
| Website (optional) | |

106
.github/contributors/njsmith.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Nathaniel J. Smith |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-08-26 |
| GitHub username | njsmith |
| Website (optional) | https://vorpus.org |

106
.github/contributors/phojnacki.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ X ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------------------- |
| Name | Przemysław Hojnacki |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 12/09/2018 |
| GitHub username | phojnacki |
| Website (optional) | https://about.me/przemyslaw.hojnacki |

106
.github/contributors/pzelasko.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Piotr Żelasko |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 04-09-2018 |
| GitHub username | pzelasko |
| Website (optional) | |

106
.github/contributors/sainathadapa.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Sainath Adapa |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-09-06 |
| GitHub username | sainathadapa |
| Website (optional) | |

106
.github/contributors/tyburam.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Mateusz Tybura |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 08.09.2018 |
| GitHub username | tyburam |
| Website (optional) | |

View File

@ -92,11 +92,13 @@ def get_features(docs, max_length):
def train(train_texts, train_labels, dev_texts, dev_labels, def train(train_texts, train_labels, dev_texts, dev_labels,
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100, lstm_shape, lstm_settings, lstm_optimizer, batch_size=100,
nb_epoch=5, by_sentence=True): nb_epoch=5, by_sentence=True):
print("Loading spaCy") print("Loading spaCy")
nlp = spacy.load('en_vectors_web_lg') nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(nlp.create_pipe('sentencizer')) nlp.add_pipe(nlp.create_pipe('sentencizer'))
embeddings = get_embeddings(nlp.vocab) embeddings = get_embeddings(nlp.vocab)
model = compile_lstm(embeddings, lstm_shape, lstm_settings) model = compile_lstm(embeddings, lstm_shape, lstm_settings)
print("Parsing texts...") print("Parsing texts...")
train_docs = list(nlp.pipe(train_texts)) train_docs = list(nlp.pipe(train_texts))
dev_docs = list(nlp.pipe(dev_texts)) dev_docs = list(nlp.pipe(dev_texts))
@ -107,7 +109,7 @@ def train(train_texts, train_labels, dev_texts, dev_labels,
train_X = get_features(train_docs, lstm_shape['max_length']) train_X = get_features(train_docs, lstm_shape['max_length'])
dev_X = get_features(dev_docs, lstm_shape['max_length']) dev_X = get_features(dev_docs, lstm_shape['max_length'])
model.fit(train_X, train_labels, validation_data=(dev_X, dev_labels), model.fit(train_X, train_labels, validation_data=(dev_X, dev_labels),
nb_epoch=nb_epoch, batch_size=batch_size) epochs=nb_epoch, batch_size=batch_size)
return model return model
@ -138,15 +140,9 @@ def get_embeddings(vocab):
def evaluate(model_dir, texts, labels, max_length=100): def evaluate(model_dir, texts, labels, max_length=100):
def create_pipeline(nlp): nlp = spacy.load('en_vectors_web_lg')
''' nlp.add_pipe(nlp.create_pipe('sentencizer'))
This could be a lambda, but named functions are easier to read in Python. nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length))
'''
return [nlp.tagger, nlp.parser, SentimentAnalyser.load(model_dir, nlp,
max_length=max_length)]
nlp = spacy.load('en')
nlp.pipeline = create_pipeline(nlp)
correct = 0 correct = 0
i = 0 i = 0
@ -186,7 +182,7 @@ def main(model_dir=None, train_dir=None, dev_dir=None,
is_runtime=False, is_runtime=False,
nr_hidden=64, max_length=100, # Shape nr_hidden=64, max_length=100, # Shape
dropout=0.5, learn_rate=0.001, # General NN config dropout=0.5, learn_rate=0.001, # General NN config
nb_epoch=5, batch_size=100, nr_examples=-1): # Training params nb_epoch=5, batch_size=256, nr_examples=-1): # Training params
if model_dir is not None: if model_dir is not None:
model_dir = pathlib.Path(model_dir) model_dir = pathlib.Path(model_dir)
if train_dir is None or dev_dir is None: if train_dir is None or dev_dir is None:
@ -219,7 +215,7 @@ def main(model_dir=None, train_dir=None, dev_dir=None,
if model_dir is not None: if model_dir is not None:
with (model_dir / 'model').open('wb') as file_: with (model_dir / 'model').open('wb') as file_:
pickle.dump(weights[1:], file_) pickle.dump(weights[1:], file_)
with (model_dir / 'config.json').open('wb') as file_: with (model_dir / 'config.json').open('w') as file_:
file_.write(lstm.to_json()) file_.write(lstm.to_json())

View File

@ -14,7 +14,7 @@ from .. import about
@plac.annotations( @plac.annotations(
model=("model to download, shortcut or name)", "positional", None, str), model=("model to download, shortcut or name", "positional", None, str),
direct=("force direct download. Needs model name with version and won't " direct=("force direct download. Needs model name with version and won't "
"perform compatibility check", "flag", "d", bool), "perform compatibility check", "flag", "d", bool),
pip_args=("additional arguments to be passed to `pip install` when " pip_args=("additional arguments to be passed to `pip install` when "

View File

@ -249,6 +249,7 @@ class Errors(object):
"error. Are you writing to a default function argument?") "error. Are you writing to a default function argument?")
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or " E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
"Span objects, or dicts if set to manual=True.") "Span objects, or dicts if set to manual=True.")
E097 = ("Can't merge non-disjoint spans. '{token}' is already part of tokens to merge")
@add_codes @add_codes

View File

@ -286,6 +286,7 @@ GLOSSARY = {
'PERSON': 'People, including fictional', 'PERSON': 'People, including fictional',
'NORP': 'Nationalities or religious or political groups', 'NORP': 'Nationalities or religious or political groups',
'FACILITY': 'Buildings, airports, highways, bridges, etc.', 'FACILITY': 'Buildings, airports, highways, bridges, etc.',
'FAC': 'Buildings, airports, highways, bridges, etc.',
'ORG': 'Companies, agencies, institutions, etc.', 'ORG': 'Companies, agencies, institutions, etc.',
'GPE': 'Countries, cities, states', 'GPE': 'Countries, cities, states',
'LOC': 'Non-GPE locations, mountain ranges, bodies of water', 'LOC': 'Non-GPE locations, mountain ranges, bodies of water',

View File

@ -20,12 +20,11 @@ _suffixes = (_list_punct + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS +
r'(?<=[{}(?:{})])\.'.format('|'.join([ALPHA_LOWER, r'%²\-\)\]\+', QUOTES]), _currency)]) r'(?<=[{}(?:{})])\.'.format('|'.join([ALPHA_LOWER, r'%²\-\)\]\+', QUOTES]), _currency)])
_infixes = (LIST_ELLIPSES + LIST_ICONS + _infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER), [r'(?<=[0-9{zero}-{nine}])[+\-\*^=](?=[0-9{zero}-{nine}-])'.format(zero=u'', nine=u''),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA), r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA), r'(?<=[{a}])[{h}](?={ae})'.format(a=ALPHA, h=HYPHENS, ae=u''),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA), r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA), r'(?<=[{a}"])[:<>=/](?=[{a}])'.format(a=ALPHA)])
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=_quotes)])
TOKENIZER_PREFIXES = _prefixes TOKENIZER_PREFIXES = _prefixes

View File

@ -16,6 +16,7 @@ _latin = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
_persian = r'[\p{L}&&\p{Arabic}]' _persian = r'[\p{L}&&\p{Arabic}]'
_russian_lower = r'[ёа-я]' _russian_lower = r'[ёа-я]'
_russian_upper = r'[ЁА-Я]' _russian_upper = r'[ЁА-Я]'
_sinhala = r'[\p{L}&&\p{Sinhala}]'
_tatar_lower = r'[әөүҗңһ]' _tatar_lower = r'[әөүҗңһ]'
_tatar_upper = r'[ӘӨҮҖҢҺ]' _tatar_upper = r'[ӘӨҮҖҢҺ]'
_greek_lower = r'[α-ωάέίόώήύ]' _greek_lower = r'[α-ωάέίόώήύ]'
@ -23,7 +24,7 @@ _greek_upper = r'[Α-ΩΆΈΊΌΏΉΎ]'
_upper = [_latin_upper, _russian_upper, _tatar_upper, _greek_upper] _upper = [_latin_upper, _russian_upper, _tatar_upper, _greek_upper]
_lower = [_latin_lower, _russian_lower, _tatar_lower, _greek_lower] _lower = [_latin_lower, _russian_lower, _tatar_lower, _greek_lower]
_uncased = [_bengali, _hebrew, _persian] _uncased = [_bengali, _hebrew, _persian, _sinhala]
ALPHA = merge_char_classes(_upper + _lower + _uncased) ALPHA = merge_char_classes(_upper + _lower + _uncased)
ALPHA_LOWER = merge_char_classes(_lower + _uncased) ALPHA_LOWER = merge_char_classes(_lower + _uncased)

View File

@ -31599,5 +31599,499 @@ FR_BASE_EXCEPTIONS = [
"σ-additivités", "σ-additivités",
"σ-compacité", "σ-compacité",
"σ-compact", "σ-compact",
"σ-compacts" "σ-compacts",
"Abbie-Gaëlle",
"Abby-Gail",
"Aëlle-Lys",
"Agathe-Rose",
"Aimée-Alexandrine",
"Amy-Lee",
"Andrée-Ange",
"Anna-Elisa",
"Anna-Maria",
"Anna-Sophia",
"Anne-Aëlle",
"Anne-Aymone",
"Anne-Catherine",
"Anne-Cécile",
"Anne-Christelle",
"Anne-Claire",
"Anne-Clarisse",
"Anne-Clémence",
"Anne-Clotilde",
"Anne-Colombe",
"Anne-Eléonore",
"Anne-Elisabeth",
"Anne-Flore",
"Anne-Gabrielle",
"Anne-Gaëlle",
"Anne-Garance",
"Anne-Hélène",
"Anne-Hortense",
"Anne-Isabelle",
"Anne-Juliette",
"Anne-Laurence",
"Anne-Lise",
"Anne-Louise",
"Anne-Lucie",
"Anne-Lyse",
"Anne-Marceline",
"Anne-Marguerite",
"Anne-Mathilde",
"Anne-Pascale",
"Anne-Pénélope",
"Anne-Raphaëlle",
"Anne-Sixtine",
"Anne-Victoire",
"Anne-Yvonne",
"Annie-Claire",
"Annie-Claude",
"Annie-Christine",
"Annie-Flore",
"Annie-France",
"Annie-Françoise",
"Annie-Laure",
"Annie-Paule",
"Annie-Pierre",
"Annie-Rose",
"Annie-Thérèse",
"Audrey-Anne",
"Aure-Anne",
"Bérénice-Marie",
"Blanche-Castille",
"Brune-Hilde",
"Carla-Luna",
"Carla-Marie",
"Carole-Anne",
"Catherine-Amélie",
"Catherine-Josée",
"Catherine-Luce",
"Cécile-Liv",
"Cécilie-Anne",
"Claire-Aude",
"Claire-Juliette",
"Claire-Marie",
"Edith-Marie",
"Eléa-Nore",
"Elée-Anne",
"Elisa-Maude",
"Elise-Marie",
"Elyse-Ea",
"Elysée-Anne",
"Emma-Line",
"Emma-Lou",
"Emma-Louise",
"Emma-Rose",
"Eugénie-Héloïse",
"Eva-Elle",
"Eva-Marie",
"Eve-Aëlle",
"Eve-Angéline",
"Eve-Rose",
"Eve-Charlotte",
"Eve-Marie",
"Gracie-Lou",
"Hélène-Sarah",
"Héloïse-Luce",
"Héloïse-Marie",
"Jeanne-Antide",
"Jeanne-Claire",
"Jeanne-Charlotte",
"Jeanne-Cécile",
"Jeanne-Colombe",
"Jeanne-Françoise",
"Jeanne-Hélène",
"Jeanne-Laure",
"Jeanne-Lise",
"Jeanne-Louise",
"Jeanne-Marie",
"Jeanne-Sixtine",
"Jeanne-Sophie",
"Julie-Anne",
"Julie-Maude",
"Julie-Michelle",
"Josée-Anne",
"Laure-Anne",
"Laure-Eléa",
"Laure-Élise",
"Laure-Lise",
"Laure-Lou",
"Laure-Marie",
"Laurie-Anne",
"Léa-Jade",
"Lily-Rose",
"Lisa-Marie",
"Lise-Anne",
"Liv-Helen",
"Lou-Anne",
"Lou-Eve",
"Louisa-May",
"Louise-Marie",
"Louise-Anne",
"Louise-Hélène",
"Maëlle-Anne",
"Maëlle-Lys",
"Maï-Lan",
"Marguerite-Marie",
"Marie-Adélaïde",
"Marie-Adeline",
"Marie-Agathe",
"Marie-Agnès",
"Marie-Aimée",
"Marie-Aldine",
"Marie-Alice",
"Marie-Aline",
"Marie-Alix",
"Marie-Amélie",
"Marie-Andrée",
"Marie-Anne",
"Marie-Astrid",
"Marie-Athéna",
"Marie-Aude",
"Marie-Bérénice",
"Marie-Carmèle",
"Marie-Caroline",
"Marie-Cécile",
"Marie-Charlotte",
"Marie-Clarence",
"Marie-Clémence",
"Marie-Clémentine",
"Marie-Clotilde",
"Marie-Colombe",
"Marie-Dominique",
"Marie-Edith",
"Marie-Elisabeth",
"Marie-Ella",
"Marie-Elle",
"Marie-Emeline",
"Marie-Eve",
"Marie-Flore",
"Marie-Gabrielle",
"Marie-Garance",
"Marie-Héloïse",
"Marie-Hortense",
"Marie-Isabelle",
"Marie-Jeanne",
"Marie-Josée",
"Marie-Josèphe",
"Marie-Juliette",
"Marie-Léonor",
"Marie-Liesse",
"Marie-Line",
"Marie-Lise",
"Marie-Lorraine",
"Marie-Lou",
"Marie-Lyne",
"Marie-Lyse",
"Marie-Naëlle",
"Marie-Odile",
"Marie-Pascale",
"Marie-Prudence",
"Marie-Prune",
"Marie-Rose",
"Marie-Sarah",
"Marie-Sophie",
"Marie-Valentine",
"Marie-Victoire",
"Marie-Yolande",
"Mary-Beth",
"Mary-Lynn",
"Mathilde-Marie",
"Meg-Anne",
"Nathalie-Anne",
"Paule-Émeline",
"Paule-Marie",
"Pauline-Hortense",
"Pauline-Marie",
"Pénélope-Fiona",
"Rose-Anne",
"Rose-Hélène",
"Sarah-Anne",
"Sarah-Eve",
"Sarah-Jane",
"Sarah-Laure",
"Sarah-Line",
"Sarah-Lise",
"Sarah-Lou",
"Sarah-Louise",
"Sarah-Marie",
"Sarah-Myriam",
"Sixtine-Jade",
"Sophie-Anne",
"Sophie-Caroline",
"Sylvie-Anne",
"Valérie-Anne",
"Alain-Maxime",
"Alain-Michel",
"Alain-Pierre",
"Ambroise-Polycarpe",
"André-Ferdinand",
"André-Jean",
"André-Marie",
"Ange-Marie",
"Antoine-Daniel",
"Antoine-Marie",
"Auguste-Eugène",
"Benoît-Joseph",
"Bernard-Marie",
"Camille-Raphaël",
"Carl-Éric",
"Carl-Edwin",
"Carl-Philippe",
"Charles-Aubin",
"Charles-Eric",
"Charles-Edouard",
"Charles-Étienne",
"Charles-Henri",
"Charles-Marie",
"Charles-Olivier",
"Charles-Orland",
"Christian-Jacques",
"Claude-Henri",
"Claude-Pierre",
"Côme-Edouard",
"David-Frédéric",
"David-Vincent",
"Don-Yves",
"Eric-Antoine",
"Etienne-Henri",
"Félix-Antoine",
"Félix-Marie",
"François-Alexandre",
"François-Charles",
"François-David",
"François-Éric",
"François-Ferdinand",
"François-Guillaume",
"François-Henri",
"François-Jérôme",
"François-Joseph",
"François-Julien",
"François-Louis",
"François-Marie",
"François-Nicolas",
"François-Pierre",
"François-Régis",
"François-René",
"François-Xavier",
"Frédéric-François",
"Gonzague-Edouard",
"Guy-Marie",
"Guy-Noël",
"Henri-Brice",
"Henri-Claude",
"Henri-Jean",
"Henri-Jules",
"Henri-Louis",
"Henri-Michel",
"Henri-Paul",
"Henri-Pierre",
"Henri-Xavier",
"Ian-Aël",
"Jacques-Alexandre",
"Jacques-Antoine",
"Jacques-Édouard",
"Jacques-Étienne",
"Jacques-Henri",
"Jacques-Olivier",
"Jacques-Pierre",
"Jacques-Yves",
"Jean-Adrien",
"Jean-Aimé",
"Jean-Alain",
"Jean-Albert",
"Jean-Alexandre",
"Jean-Alexis",
"Jean-Alfred",
"Jean-André",
"Jean-Antoine",
"Jean-Arnaud",
"Jean-Auguste",
"Jean-Baptiste",
"Jean-Barthélémy",
"Jean-Bastien",
"Jean-Benoît",
"Jean-Bernard",
"Jean-Bertrand",
"Jean-Blaise",
"Jean-Bosco",
"Jean-Briac",
"Jean-Brice",
"Jean-Bruno",
"Jean-Camille",
"Jean-Casimir",
"Jean-Cédric",
"Jean-Charles",
"Jean-Christian",
"Jean-Christophe",
"Jean-Clair",
"Jean-Claude",
"Jean-Clément",
"Jean-Côme",
"Jean-Cyril",
"Jean-Damien",
"Jean-Daniel",
"Jean-David",
"Jean-Denis",
"Jean-Didier",
"Jean-Dominique",
"Jean-Édouard",
"Jean-Élie",
"Jean-Éloi",
"Jean-Émile",
"Jean-Éric",
"Jean-Étienne",
"Jean-Eudes",
"Jean-Fabien",
"Jean-Francis",
"Jean-François",
"Jean-Gabriel",
"Jean-Gaël",
"Jean-Guy",
"Jean-Hugues",
"Jean-Jacques",
"Jean-Joseph",
"Jean-Julien",
"Jean-Kévin",
"Jean-Louis",
"Jean-Loup",
"Jean-Luc",
"Jean-Marc",
"Jean-Marcel",
"Jean-Marie",
"Jean-Matthieu",
"Jean-Maxime",
"Jean-Maurice",
"Jean-Michel",
"Jean-Nicolas",
"Jean-Noël",
"Jean-Pascal",
"Jean-Patrick",
"Jean-Paul",
"Jean-Philibert",
"Jean-Philippe",
"Jean-Pierre",
"Jean-René",
"Jean-Robert",
"Jean-Roch",
"Jean-Roland",
"Jean-Sébastien",
"Jean-Simon",
"Jean-Thomas",
"Jean-Vincent",
"Jean-Victor",
"Jean-Xavier",
"Jean-Yves",
"Jérôme-André",
"Joël-Francis",
"Joseph-Antoine",
"Joseph-Désiré",
"Joseph-Marie",
"Jules-Edouard",
"Jules-Henri",
"Julien-Pierre",
"Léo-Paul",
"Louis-André",
"Louis-Antoine",
"Louis-Arthur",
"Louis-Auguste",
"Louis-Benjamin",
"Louis-Benoît",
"Louis-Bernard",
"Louis-Casimir",
"Louis-Camille",
"Louis-Charles",
"Louis-Daniel",
"Louis-Ferdinand",
"Louis-Joseph",
"Louis-Marie",
"Louis-Mathieu",
"Louis-Philippe",
"Louis-Roger",
"Louis-Théophane",
"Louis-Valentin",
"Louis-Victor",
"Louis-Xavier",
"Loup-Léis",
"Luc-Henri",
"Marc-André",
"Marc-Antoine",
"Marc-Aurèle",
"Marc-Emmanuel",
"Marc-Olivier",
"Marie-Joseph",
"Michael-Corneille",
"Paul-Albert",
"Paul-Alexandre",
"Paul-André",
"Paul-Antoine",
"Paul-Armand",
"Paul-Arthur",
"Paul-Elie",
"Paul-Eloi",
"Paul-Emile",
"Paul-Étienne",
"Paul-Henri",
"Paul-Louis",
"Paul-Loup",
"Paul-Marie",
"Paul-Saintange",
"Paul-Vincent",
"Philippe-Alexandre",
"Philippe-Auguste",
"Pierre-Alexandre",
"Pierre-André",
"Pierre-Antoine",
"Pierre-Arnaud",
"Pierre-Baptiste",
"Pierre-Benoit",
"Pierre-Charles",
"Pierre-Côme",
"Pierre-Cyrille",
"Pierre-Edouard",
"Pierre-Elie",
"Pierre-Eloi",
"Pierre-Emmanuel",
"Pierre-Étienne",
"Pierre-Eugène",
"Pierre-François",
"Pierre-Henri",
"Pierre-Jean",
"Pierre-Jérôme",
"Pierre-Jules",
"Pierre-Julien",
"Pierre-Paul",
"Pierre-Quentin",
"Pierre-Valentin",
"Pierre-Vincent",
"Pierre-Xavier",
"Rémi-Etienne",
"René-Charles",
"René-Marc",
"René-Jean",
"René-Paul",
"René-Pierre",
"René-Yves",
"Thibaud-Marie",
"Vincent-Xavier",
"Xavier-Alexandre",
"Xavier-François",
"Xavier-Marie",
"Xavier-Pierre",
"Yann-Aël",
"Yann-Alrick",
"Yann-Éric",
"Yann-Gaël",
"Yann-Yves",
"Yann-Vari",
"Yves-Alain",
"Yves-Alexandre",
"Yves-André",
"Yves-Éric",
"Yves-Henri",
"Yves-Laurent",
"Yves-Marie",
"Yves-Michel",
"Yves-Olivier",
"Yves-Pierre"
] ]

View File

@ -10,15 +10,18 @@ from .lex_attrs import LEX_ATTRS
from .syntax_iterators import SYNTAX_ITERATORS from .syntax_iterators import SYNTAX_ITERATORS
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG from ...attrs import LANG, NORM
from ...util import update_exc from ...util import update_exc, add_lookups
class IndonesianDefaults(Language.Defaults): class IndonesianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'id' lex_attr_getters[LANG] = lambda text: 'id'
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
BASE_NORMS, NORM_EXCEPTIONS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
prefixes = TOKENIZER_PREFIXES prefixes = TOKENIZER_PREFIXES

View File

@ -24,7 +24,7 @@ aci-acinya
aco-acoan aco-acoan
ad-blocker ad-blocker
ad-interim ad-interim
ada-ada saja ada-ada
ada-adanya ada-adanya
ada-adanyakah ada-adanyakah
adang-adang adang-adang
@ -243,7 +243,6 @@ bari-bari
barik-barik barik-barik
baris-berbaris baris-berbaris
baru-baru baru-baru
baru-baru ini
baru-batu baru-batu
barung-barung barung-barung
basa-basi basa-basi
@ -1059,7 +1058,6 @@ box-to-box
boyo-boyo boyo-boyo
buah-buahan buah-buahan
buang-buang buang-buang
buang-buang air
buat-buatan buat-buatan
buaya-buaya buaya-buaya
bubun-bubun bubun-bubun
@ -1226,7 +1224,6 @@ deg-degan
degap-degap degap-degap
dekak-dekak dekak-dekak
dekat-dekat dekat-dekat
dengan -
dengar-dengaran dengar-dengaran
dengking-mendengking dengking-mendengking
departemen-departemen departemen-departemen
@ -1246,6 +1243,7 @@ dibayang-bayangi
dibuat-buat dibuat-buat
diiming-imingi diiming-imingi
dilebih-lebihkan dilebih-lebihkan
dimana-mana
dimata-matai dimata-matai
dinas-dinas dinas-dinas
dinul-Islam dinul-Islam
@ -1278,6 +1276,57 @@ dulang-dulang
duri-duri duri-duri
duta-duta duta-duta
dwi-kewarganegaraan dwi-kewarganegaraan
e-arena
e-billing
e-budgeting
e-cctv
e-class
e-commerce
e-counting
e-elektronik
e-entertainment
e-evolution
e-faktur
e-filing
e-fin
e-form
e-government
e-govt
e-hakcipta
e-id
e-info
e-katalog
e-ktp
e-leadership
e-lhkpn
e-library
e-loket
e-m1
e-money
e-news
e-nisn
e-npwp
e-paspor
e-paten
e-pay
e-perda
e-perizinan
e-planning
e-polisi
e-power
e-punten
e-retribusi
e-samsat
e-sport
e-store
e-tax
e-ticketing
e-tilang
e-toll
e-visa
e-voting
e-wallet
e-warong
ecek-ecek ecek-ecek
eco-friendly eco-friendly
eco-park eco-park
@ -1440,7 +1489,25 @@ ginang-ginang
girap-girap girap-girap
girik-girik girik-girik
giring-giring giring-giring
go-auto
go-bills
go-bluebird
go-box
go-car
go-clean
go-food
go-glam
go-jek
go-kart go-kart
go-mart
go-massage
go-med
go-points
go-pulsa
go-ride
go-send
go-shop
go-tix
go-to-market go-to-market
goak-goak goak-goak
goal-line goal-line
@ -1488,7 +1555,6 @@ hang-out
hantu-hantu hantu-hantu
happy-happy happy-happy
harap-harap harap-harap
harap-harap cemas
harap-harapan harap-harapan
hard-disk hard-disk
harga-harga harga-harga
@ -1633,7 +1699,7 @@ jor-joran
jotos-jotosan jotos-jotosan
juak-juak juak-juak
jual-beli jual-beli
juang-juang !!? lenjuang juang-juang
julo-julo julo-julo
julung-julung julung-julung
julur-julur julur-julur
@ -1787,6 +1853,7 @@ kemarah-marahan
kemasam-masaman kemasam-masaman
kemati-matian kemati-matian
kembang-kembang kembang-kembang
kemenpan-rb
kementerian-kementerian kementerian-kementerian
kemerah-merahan kemerah-merahan
kempang-kempis kempang-kempis
@ -1827,7 +1894,6 @@ keras-mengerasi
kercap-kercip kercap-kercip
kercap-kercup kercap-kercup
keriang-keriut keriang-keriut
kering-kering air
kerja-kerja kerja-kerja
kernyat-kernyut kernyat-kernyut
kerobak-kerabit kerobak-kerabit
@ -1952,7 +2018,7 @@ kuda-kudaan
kudap-kudap kudap-kudap
kue-kue kue-kue
kulah-kulah kulah-kulah
kulak-kulak tangan kulak-kulak
kulik-kulik kulik-kulik
kulum-kulum kulum-kulum
kumat-kamit kumat-kamit
@ -2086,7 +2152,6 @@ lumba-lumba
lumi-lumi lumi-lumi
luntang-lantung luntang-lantung
lupa-lupa lupa-lupa
lupa-lupa ingat
lupa-lupaan lupa-lupaan
lurah-camat lurah-camat
maaf-memaafkan maaf-memaafkan
@ -2097,6 +2162,7 @@ macan-macanan
machine-to-machine machine-to-machine
mafia-mafia mafia-mafia
mahasiswa-mahasiswi mahasiswa-mahasiswi
mahasiswa/i
mahi-mahi mahi-mahi
main-main main-main
main-mainan main-mainan
@ -2185,14 +2251,14 @@ memandai-mandai
memanggil-manggil memanggil-manggil
memanis-manis memanis-manis
memanjut-manjut memanjut-manjut
memantas-mantas diri memantas-mantas
memasak-masak memasak-masak
memata-matai memata-matai
mematah-matah mematah-matah
mematuk-matuk mematuk-matuk
mematut-matut mematut-matut
memau-mau memau-mau
memayah-mayahkan (diri) memayah-mayahkan
membaca-baca membaca-baca
membacah-bacah membacah-bacah
membagi-bagikan membagi-bagikan
@ -2576,6 +2642,7 @@ meraung-raungkan
merayau-rayau merayau-rayau
merayu-rayu merayu-rayu
mercak-mercik mercak-mercik
mercedes-benz
merek-merek merek-merek
mereka-mereka mereka-mereka
mereka-reka mereka-reka
@ -2627,9 +2694,9 @@ morat-marit
move-on move-on
muda-muda muda-muda
muda-mudi muda-mudi
muda/i
mudah-mudahan mudah-mudahan
muka-muka muka-muka
muka-muka (dengan -)
mula-mula mula-mula
multiple-output multiple-output
muluk-muluk muluk-muluk
@ -2791,6 +2858,7 @@ paus-paus
paut-memaut paut-memaut
pay-per-click pay-per-click
paya-paya paya-paya
pdi-p
pecah-pecah pecah-pecah
pecat-pecatan pecat-pecatan
peer-to-peer peer-to-peer
@ -2951,6 +3019,7 @@ putih-hitam
putih-putih putih-putih
putra-putra putra-putra
putra-putri putra-putri
putra/i
putri-putri putri-putri
putus-putus putus-putus
putusan-putusan putusan-putusan
@ -3069,6 +3138,7 @@ sambung-bersambung
sambung-menyambung sambung-menyambung
sambut-menyambut sambut-menyambut
samo-samo samo-samo
sampah-sampah
sampai-sampai sampai-sampai
samping-menyamping samping-menyamping
sana-sini sana-sini
@ -3204,7 +3274,7 @@ seolah-olah
sepala-pala sepala-pala
sepandai-pandai sepandai-pandai
sepetang-petangan sepetang-petangan
sepoi-sepoi (basa) sepoi-sepoi
sepraktis-praktisnya sepraktis-praktisnya
sepuas-puasnya sepuas-puasnya
serak-serak serak-serak
@ -3278,6 +3348,7 @@ sisa-sisa
sisi-sisi sisi-sisi
siswa-siswa siswa-siswa
siswa-siswi siswa-siswi
siswa/i
siswi-siswi siswi-siswi
situ-situ situ-situ
situs-situs situs-situs
@ -3380,6 +3451,7 @@ tanggul-tanggul
tanggung-menanggung tanggung-menanggung
tanggung-tanggung tanggung-tanggung
tank-tank tank-tank
tante-tante
tanya-jawab tanya-jawab
tapa-tapa tapa-tapa
tapak-tapak tapak-tapak
@ -3424,7 +3496,6 @@ teralang-alang
terambang-ambang terambang-ambang
terambung-ambung terambung-ambung
terang-terang terang-terang
terang-terang laras
terang-terangan terang-terangan
teranggar-anggar teranggar-anggar
terangguk-angguk terangguk-angguk
@ -3438,7 +3509,6 @@ terayap-rayap
terbada-bada terbada-bada
terbahak-bahak terbahak-bahak
terbang-terbang terbang-terbang
terbang-terbang hinggap
terbata-bata terbata-bata
terbatuk-batuk terbatuk-batuk
terbayang-bayang terbayang-bayang

View File

@ -18199,7 +18199,6 @@ LOOKUP = {
'sekelap': 'kelap', 'sekelap': 'kelap',
'kelap-kelip': 'terkelap', 'kelap-kelip': 'terkelap',
'mengelapkan': 'lap', 'mengelapkan': 'lap',
'sekelap': 'terkelap',
'berlapar': 'lapar', 'berlapar': 'lapar',
'kelaparan': 'lapar', 'kelaparan': 'lapar',
'kelaparannya': 'lapar', 'kelaparannya': 'lapar',
@ -30179,7 +30178,6 @@ LOOKUP = {
'terperonyok': 'peronyok', 'terperonyok': 'peronyok',
'terperosok': 'perosok', 'terperosok': 'perosok',
'terperosoknya': 'perosok', 'terperosoknya': 'perosok',
'merosot': 'perosot',
'memerosot': 'perosot', 'memerosot': 'perosot',
'memerosotkan': 'perosot', 'memerosotkan': 'perosot',
'kepustakaan': 'pustaka', 'kepustakaan': 'pustaka',

View File

@ -1,7 +1,10 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...attrs import LIKE_NUM import unicodedata
from .punctuation import LIST_CURRENCY
from ...attrs import IS_CURRENCY, LIKE_NUM
_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh', _num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
@ -27,6 +30,17 @@ def like_num(text):
return False return False
def is_currency(text):
if text in LIST_CURRENCY:
return True
for char in text:
if unicodedata.category(char) != 'Sc':
return False
return True
LEX_ATTRS = { LEX_ATTRS = {
IS_CURRENCY: is_currency,
LIKE_NUM: like_num LIKE_NUM: like_num
} }

View File

@ -1,7 +1,535 @@
"""
Slang and abbreviations
Daftar kosakata yang sering salah dieja
https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
"""
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
_exc = {} _exc = {
# Slang and abbreviations
"silahkan": "silakan",
"yg": "yang",
"kalo": "kalau",
"cawu": "caturwulan",
"ok": "oke",
"gak": "tidak",
"enggak": "tidak",
"nggak": "tidak",
"ndak": "tidak",
"ngga": "tidak",
"dgn": "dengan",
"tdk": "tidak",
"jg": "juga",
"klo": "kalau",
"denger": "dengar",
"pinter": "pintar",
"krn": "karena",
"nemuin": "menemukan",
"jgn": "jangan",
"udah": "sudah",
"sy": "saya",
"udh": "sudah",
"dapetin": "mendapatkan",
"ngelakuin": "melakukan",
"ngebuat": "membuat",
"membikin": "membuat",
"bikin": "buat",
# Daftar kosakata yang sering salah dieja
"malpraktik": "malapraktik",
"malfungsi": "malafungsi",
"malserap": "malaserap",
"maladaptasi": "malaadaptasi",
"malsuai": "malasuai",
"maldistribusi": "maladistribusi",
"malgizi": "malagizi",
"malsikap": "malasikap",
"memperhatikan": "memerhatikan",
"akte": "akta",
"cemilan": "camilan",
"esei": "esai",
"frase": "frasa",
"kafeteria": "kafetaria",
"ketapel": "katapel",
"kenderaan": "kendaraan",
"menejemen": "manajemen",
"menejer": "manajer",
"mesjid": "masjid",
"rebo": "rabu",
"seksama": "saksama",
"senggama": "sanggama",
"sekedar": "sekadar",
"seprei": "seprai",
"semedi": "semadi",
"samadi": "semadi",
"amandemen": "amendemen",
"algoritma": "algoritme",
"aritmatika": "aritmetika",
"metoda": "metode",
"materai": "meterai",
"meterei": "meterai",
"kalendar": "kalender",
"kadaluwarsa": "kedaluwarsa",
"katagori": "kategori",
"parlamen": "parlemen",
"sekular": "sekuler",
"selular": "seluler",
"sirkular": "sirkuler",
"survai": "survei",
"survey": "survei",
"aktuil": "aktual",
"formil": "formal",
"trotoir": "trotoar",
"komersiil": "komersial",
"komersil": "komersial",
"tradisionil": "tradisionial",
"orisinil": "orisinal",
"orijinil": "orisinal",
"afdol": "afdal",
"antri": "antre",
"apotik": "apotek",
"atlit": "atlet",
"atmosfir": "atmosfer",
"cidera": "cedera",
"cendikiawan": "cendekiawan",
"cepet": "cepat",
"cinderamata": "cenderamata",
"debet": "debit",
"difinisi": "definisi",
"dekrit": "dekret",
"disain": "desain",
"diskripsi": "deskripsi",
"diskotik": "diskotek",
"eksim": "eksem",
"exim": "eksem",
"faidah": "faedah",
"ekstrim": "ekstrem",
"ekstrimis": "ekstremis",
"komplit": "komplet",
"konkrit": "konkret",
"kongkrit": "konkret",
"kongkret": "konkret",
"kridit": "kredit",
"musium": "museum",
"pinalti": "penalti",
"piranti": "peranti",
"pinsil": "pensil",
"personil": "personel",
"sistim": "sistem",
"teoritis": "teoretis",
"vidio": "video",
"cengkeh": "cengkih",
"desertasi": "disertasi",
"hakekat": "hakikat",
"intelejen": "intelijen",
"kaedah": "kaidah",
"kempes": "kempis",
"kementrian": "kementerian",
"ledeng": "leding",
"nasehat": "nasihat",
"penasehat": "penasihat",
"praktek": "praktik",
"praktekum": "praktikum",
"resiko": "risiko",
"retsleting": "ritsleting",
"senen": "senin",
"amuba": "ameba",
"punggawa": "penggawa",
"surban": "serban",
"nomer": "nomor",
"sorban": "serban",
"bis": "bus",
"agribisnis": "agrobisnis",
"kantung": "kantong",
"khutbah": "khotbah",
"mandur": "mandor",
"rubuh": "roboh",
"pastur": "pastor",
"supir": "sopir",
"goncang": "guncang",
"goa": "gua",
"kaos": "kaus",
"kokoh": "kukuh",
"komulatif": "kumulatif",
"kolomnis": "kolumnis",
"korma": "kurma",
"lobang": "lubang",
"limo": "limusin",
"limosin": "limusin",
"mangkok": "mangkuk",
"saos": "saus",
"sop": "sup",
"sorga": "surga",
"tegor": "tegur",
"telor": "telur",
"obrak-abrik": "ubrak-abrik",
"ekwivalen": "ekuivalen",
"frekwensi": "frekuensi",
"konsekwensi": "konsekuensi",
"kwadran": "kuadran",
"kwadrat": "kuadrat",
"kwalifikasi": "kualifikasi",
"kwalitas": "kualitas",
"kwalitet": "kualitas",
"kwalitatif": "kualitatif",
"kwantitas": "kuantitas",
"kwantitatif": "kuantitatif",
"kwantum": "kuantum",
"kwartal": "kuartal",
"kwintal": "kuintal",
"kwitansi": "kuitansi",
"kwatir": "khawatir",
"kuatir": "khawatir",
"jadual": "jadwal",
"hirarki": "hierarki",
"karir": "karier",
"aktip": "aktif",
"daptar": "daftar",
"efektip": "efektif",
"epektif": "efektif",
"epektip": "efektif",
"Pebruari": "Februari",
"pisik": "fisik",
"pondasi": "fondasi",
"photo": "foto",
"photokopi": "fotokopi",
"hapal": "hafal",
"insap": "insaf",
"insyaf": "insaf",
"konperensi": "konferensi",
"kreatip": "kreatif",
"kreativ": "kreatif",
"maap": "maaf",
"napsu": "nafsu",
"negatip": "negatif",
"negativ": "negatif",
"objektip": "objektif",
"obyektip": "objektif",
"obyektif": "objektif",
"pasip": "pasif",
"pasiv": "pasif",
"positip": "positif",
"positiv": "positif",
"produktip": "produktif",
"produktiv": "produktif",
"sarap": "saraf",
"sertipikat": "sertifikat",
"subjektip": "subjektif",
"subyektip": "subjektif",
"subyektif": "subjektif",
"tarip": "tarif",
"transitip": "transitif",
"transitiv": "transitif",
"faham": "paham",
"fikir": "pikir",
"berfikir": "berpikir",
"telefon": "telepon",
"telfon": "telepon",
"telpon": "telepon",
"tilpon": "telepon",
"nafas": "napas",
"bernafas": "bernapas",
"pernafasan": "pernapasan",
"vermak": "permak",
"vulpen": "pulpen",
"aktifis": "aktivis",
"konfeksi": "konveksi",
"motifasi": "motivasi",
"Nopember": "November",
"propinsi": "provinsi",
"babtis": "baptis",
"jerembab": "jerembap",
"lembab": "lembap",
"sembab": "sembap",
"saptu": "sabtu",
"tekat": "tekad",
"bejad": "bejat",
"nekad": "nekat",
"otoped": "otopet",
"skuad": "skuat",
"jenius": "genius",
"marjin": "margin",
"marjinal": "marginal",
"obyek": "objek",
"subyek": "subjek",
"projek": "proyek",
"azas": "asas",
"ijasah": "ijazah",
"jenasah": "jenazah",
"plasa": "plaza",
"bathin": "batin",
"Katholik": "Katolik",
"orthografi": "ortografi",
"pathogen": "patogen",
"theologi": "teologi",
"ijin": "izin",
"rejeki": "rezeki",
"rejim": "rezim",
"jaman": "zaman",
"jamrud": "zamrud",
"jinah": "zina",
"perjinahan": "perzinaan",
"anugrah": "anugerah",
"cendrawasih": "cenderawasih",
"jendral": "jenderal",
"kripik": "keripik",
"krupuk": "kerupuk",
"ksatria": "kesatria",
"mentri": "menteri",
"negri": "negeri",
"Prancis": "Perancis",
"sebrang": "seberang",
"menyebrang": "menyeberang",
"Sumatra": "Sumatera",
"trampil": "terampil",
"isteri": "istri",
"justeru": "justru",
"perajurit": "prajurit",
"putera": "putra",
"puteri": "putri",
"samudera": "samudra",
"sastera": "sastra",
"sutera": "sutra",
"terompet": "trompet",
"iklas": "ikhlas",
"iktisar": "ikhtisar",
"kafilah": "khafilah",
"kawatir": "khawatir",
"kotbah": "khotbah",
"kusyuk": "khusyuk",
"makluk": "makhluk",
"mahluk": "makhluk",
"mahkluk": "makhluk",
"nahkoda": "nakhoda",
"nakoda": "nakhoda",
"tahta": "takhta",
"takhyul": "takhayul",
"tahyul": "takhayul",
"tahayul": "takhayul",
"akhli": "ahli",
"anarkhi": "anarki",
"kharisma": "karisma",
"kharismatik": "karismatik",
"mahsud": "maksud",
"makhsud": "maksud",
"rakhmat": "rahmat",
"tekhnik": "teknik",
"tehnik": "teknik",
"tehnologi": "teknologi",
"ikhwal": "ihwal",
"expor": "ekspor",
"extra": "ekstra",
"komplex": "komplek",
"sex": "seks",
"taxi": "taksi",
"extasi": "ekstasi",
"syaraf": "saraf",
"syurga": "surga",
"mashur": "masyhur",
"masyur": "masyhur",
"mahsyur": "masyhur",
"mashyur": "masyhur",
"muadzin": "muazin",
"adzan": "azan",
"ustadz": "ustaz",
"ustad": "ustaz",
"ustadzah": "ustaz",
"dzikir": "zikir",
"dzuhur": "zuhur",
"dhuhur": "zuhur",
"zhuhur": "zuhur",
"analisa": "analisis",
"diagnosa": "diagnosis",
"hipotesa": "hipotesis",
"sintesa": "sintesis",
"aktiviti": "aktivitas",
"aktifitas": "aktivitas",
"efektifitas": "efektivitas",
"komuniti": "komunitas",
"kreatifitas": "kreativitas",
"produktifitas": "produktivitas",
"realiti": "realitas",
"realita": "realitas",
"selebriti": "selebritas",
"spotifitas": "sportivitas",
"universiti": "universitas",
"utiliti": "utilitas",
"validiti": "validitas",
"dilokalisir": "dilokalisasi",
"didramatisir": "didramatisasi",
"dipolitisir": "dipolitisasi",
"dinetralisir": "dinetralisasi",
"dikonfrontir": "dikonfrontasi",
"mendominir": "mendominasi",
"koordinir": "koordinasi",
"proklamir": "proklamasi",
"terorganisir": "terorganisasi",
"terealisir": "terealisasi",
"robah": "ubah",
"dirubah": "diubah",
"merubah": "mengubah",
"terlanjur": "telanjur",
"terlantar": "telantar",
"penglepasan": "pelepasan",
"pelihatan": "penglihatan",
"pemukiman": "permukiman",
"pengrumahan": "perumahan",
"penyewaan": "persewaan",
"menyintai": "mencintai",
"menyolok": "mencolok",
"contek": "sontek",
"mencontek": "menyontek",
"pungkir": "mungkir",
"dipungkiri": "dimungkiri",
"kupungkiri": "kumungkiri",
"kaupungkiri": "kaumungkiri",
"nampak": "tampak",
"nampaknya": "tampaknya",
"nongkrong": "tongkrong",
"berternak": "beternak",
"berterbangan": "beterbangan",
"berserta": "beserta",
"berperkara": "beperkara",
"berpergian": "bepergian",
"berkerja": "bekerja",
"berberapa": "beberapa",
"terbersit": "tebersit",
"terpercaya": "tepercaya",
"terperdaya": "teperdaya",
"terpercik": "tepercik",
"terpergok": "tepergok",
"aksesoris": "aksesori",
"handal": "andal",
"hantar": "antar",
"panutan": "anutan",
"atsiri": "asiri",
"bhakti": "bakti",
"china": "cina",
"dharma": "darma",
"diktaktor": "diktator",
"eksport": "ekspor",
"hembus": "embus",
"hadits": "hadis",
"hadist": "hadits",
"harafiah": "harfiah",
"himbau": "imbau",
"import": "impor",
"inget": "ingat",
"hisap": "isap",
"interprestasi": "interpretasi",
"kangker": "kanker",
"konggres": "kongres",
"lansekap": "lanskap",
"maghrib": "magrib",
"emak": "mak",
"moderen": "modern",
"pasport": "paspor",
"perduli": "peduli",
"ramadhan": "ramadan",
"rapih": "rapi",
"Sansekerta": "Sanskerta",
"shalat": "salat",
"sholat": "salat",
"silahkan": "silakan",
"standard": "standar",
"hutang": "utang",
"zinah": "zina",
"ambulan": "ambulans",
"antartika": "sntarktika",
"arteri": "arteria",
"asik": "asyik",
"australi": "australia",
"denga": "dengan",
"depo": "depot",
"detil": "detail",
"ensiklopedi": "ensiklopedia",
"elit": "elite",
"frustasi": "frustrasi",
"gladi": "geladi",
"greget": "gereget",
"itali": "italia",
"karna": "karena",
"klenteng": "kelenteng",
"erling": "kerling",
"kontruksi": "konstruksi",
"masal": "massal",
"merk": "merek",
"respon": "respons",
"diresponi": "direspons",
"skak": "sekak",
"stir": "setir",
"singapur": "singapura",
"standarisasi": "standardisasi",
"varitas": "varietas",
"amphibi": "amfibi",
"anjlog": "anjlok",
"alpukat": "avokad",
"alpokat": "avokad",
"bolpen": "pulpen",
"cabe": "cabai",
"cabay": "cabai",
"ceret": "cerek",
"differensial": "diferensial",
"duren": "durian",
"faksimili": "faksimile",
"faksimil": "faksimile",
"graha": "gerha",
"goblog": "goblok",
"gombrong": "gombroh",
"horden": "gorden",
"korden": "gorden",
"gubug": "gubuk",
"imaginasi": "imajinasi",
"jerigen": "jeriken",
"jirigen": "jeriken",
"carut-marut": "karut-marut",
"kwota": "kuota",
"mahzab": "mazhab",
"mempesona": "memesona",
"milyar": "miliar",
"missi": "misi",
"nenas": "nanas",
"negoisasi": "negosiasi",
"automotif": "otomotif",
"pararel": "paralel",
"paska": "pasca",
"prosen": "persen",
"pete": "petai",
"petay": "petai",
"proffesor": "profesor",
"rame": "ramai",
"rapot": "rapor",
"rileks": "relaks",
"rileksasi": "relaksasi",
"renumerasi": "remunerasi",
"seketaris": "sekretaris",
"sekertaris": "sekretaris",
"sensorik": "sensoris",
"sentausa": "sentosa",
"strawberi": "stroberi",
"strawbery": "stroberi",
"taqwa": "takwa",
"tauco": "taoco",
"tauge": "taoge",
"toge": "taoge",
"tauladan": "teladan",
"taubat": "tobat",
"trilyun": "triliun",
"vissi": "visi",
"coklat": "cokelat",
"narkotika": "narkotik",
"oase": "oasis",
"politisi": "politikus",
"terong": "terung",
"wool": "wol",
"himpit": "impit",
"mujizat": "mukjizat",
"mujijat": "mukjizat",
"yag": "yang",
}
NORM_EXCEPTIONS = {} NORM_EXCEPTIONS = {}

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from ..char_classes import merge_chars, split_chars, _currency, _units from ..char_classes import merge_chars, split_chars, _currency, _units
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
_units = (_units + 's bit Gbps Mbps mbps Kbps kbps ƒ ppi px ' _units = (_units + 's bit Gbps Mbps mbps Kbps kbps ƒ ppi px '
'Hz kHz MHz GHz mAh ' 'Hz kHz MHz GHz mAh '
@ -25,7 +25,7 @@ HTML_SUFFIX = r'</(b|strong|i|em|p|span|div|a)>'
MONTHS = merge_chars(_months) MONTHS = merge_chars(_months)
LIST_CURRENCY = split_chars(_currency) LIST_CURRENCY = split_chars(_currency)
TOKENIZER_PREFIXES.remove('#') # hashtag TOKENIZER_PREFIXES.remove('#') # hashtag
_prefixes = TOKENIZER_PREFIXES + LIST_CURRENCY + [HTML_PREFIX] + ['/', ''] _prefixes = TOKENIZER_PREFIXES + LIST_CURRENCY + [HTML_PREFIX] + ['/', '']
_suffixes = TOKENIZER_SUFFIXES + [r'\-[Nn]ya', '-[KkMm]u', '[—-]'] + [ _suffixes = TOKENIZER_SUFFIXES + [r'\-[Nn]ya', '-[KkMm]u', '[—-]'] + [

View File

@ -1,763 +1,122 @@
"""
List of stop words in Bahasa Indonesia.
"""
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
STOP_WORDS = set(""" STOP_WORDS = set("""
ada ada adalah adanya adapun agak agaknya agar akan akankah akhir akhiri akhirnya
adalah aku akulah amat amatlah anda andalah antar antara antaranya apa apaan apabila
adanya apakah apalagi apatah artinya asal asalkan atas atau ataukah ataupun awal
adapun
agak
agaknya
agar
akan
akankah
akhir
akhiri
akhirnya
aku
akulah
amat
amatlah
anda
andalah
antar
antara
antaranya
apa
apaan
apabila
apakah
apalagi
apatah
artinya
asal
asalkan
atas
atau
ataukah
ataupun
awal
awalnya awalnya
bagai
bagaikan bagai bagaikan bagaimana bagaimanakah bagaimanapun bagi bagian bahkan bahwa
bagaimana bahwasanya baik bakal bakalan balik banyak bapak baru bawah beberapa begini
bagaimanakah beginian beginikah beginilah begitu begitukah begitulah begitupun bekerja
bagaimanapun belakang belakangan belum belumlah benar benarkah benarlah berada berakhir
bagi berakhirlah berakhirnya berapa berapakah berapalah berapapun berarti berawal
bagian berbagai berdatangan beri berikan berikut berikutnya berjumlah berkali-kali
bahkan berkata berkehendak berkeinginan berkenaan berlainan berlalu berlangsung
bahwa berlebihan bermacam bermacam-macam bermaksud bermula bersama bersama-sama
bahwasanya bersiap bersiap-siap bertanya bertanya-tanya berturut berturut-turut bertutur
baik berujar berupa besar betul betulkah biasa biasanya bila bilakah bisa bisakah
bakal boleh bolehkah bolehlah buat bukan bukankah bukanlah bukannya bulan bung
bakalan
balik cara caranya cukup cukupkah cukuplah cuma
banyak
bapak dahulu dalam dan dapat dari daripada datang dekat demi demikian demikianlah
baru dengan depan di dia diakhiri diakhirinya dialah diantara diantaranya diberi
bawah diberikan diberikannya dibuat dibuatnya didapat didatangkan digunakan
beberapa diibaratkan diibaratkannya diingat diingatkan diinginkan dijawab dijelaskan
begini dijelaskannya dikarenakan dikatakan dikatakannya dikerjakan diketahui
beginian diketahuinya dikira dilakukan dilalui dilihat dimaksud dimaksudkan
beginikah dimaksudkannya dimaksudnya diminta dimintai dimisalkan dimulai dimulailah
beginilah dimulainya dimungkinkan dini dipastikan diperbuat diperbuatnya dipergunakan
begitu diperkirakan diperlihatkan diperlukan diperlukannya dipersoalkan dipertanyakan
begitukah dipunyai diri dirinya disampaikan disebut disebutkan disebutkannya disini
begitulah disinilah ditambahkan ditandaskan ditanya ditanyai ditanyakan ditegaskan
begitupun ditujukan ditunjuk ditunjuki ditunjukkan ditunjukkannya ditunjuknya dituturkan
bekerja dituturkannya diucapkan diucapkannya diungkapkan dong dua dulu
belakang
belakangan empat enggak enggaknya entah entahlah
belum
belumlah guna gunakan
benar
benarkah hal hampir hanya hanyalah hari harus haruslah harusnya hendak hendaklah
benarlah hendaknya hingga
berada
berakhir ia ialah ibarat ibaratkan ibaratnya ibu ikut ingat ingat-ingat ingin inginkah
berakhirlah inginkan ini inikah inilah itu itukah itulah
berakhirnya
berapa jadi jadilah jadinya jangan jangankan janganlah jauh jawab jawaban jawabnya
berapakah jelas jelaskan jelaslah jelasnya jika jikalau juga jumlah jumlahnya justru
berapalah
berapapun kala kalau kalaulah kalaupun kalian kami kamilah kamu kamulah kan kapan
berarti kapankah kapanpun karena karenanya kasus kata katakan katakanlah katanya ke
berawal keadaan kebetulan kecil kedua keduanya keinginan kelamaan kelihatan
berbagai kelihatannya kelima keluar kembali kemudian kemungkinan kemungkinannya kenapa
berdatangan kepada kepadanya kesampaian keseluruhan keseluruhannya keterlaluan ketika
beri khususnya kini kinilah kira kira-kira kiranya kita kitalah kok kurang
berikan
berikut lagi lagian lah lain lainnya lalu lama lamanya lanjut lanjutnya lebih lewat
berikutnya lima luar
berjumlah
berkali-kali macam maka makanya makin malah malahan mampu mampukah mana manakala manalagi
berkata masa masalah masalahnya masih masihkah masing masing-masing mau maupun
berkehendak melainkan melakukan melalui melihat melihatnya memang memastikan memberi
berkeinginan memberikan membuat memerlukan memihak meminta memintakan memisalkan memperbuat
berkenaan mempergunakan memperkirakan memperlihatkan mempersiapkan mempersoalkan
berlainan mempertanyakan mempunyai memulai memungkinkan menaiki menambahkan menandaskan
berlalu menanti menanti-nanti menantikan menanya menanyai menanyakan mendapat
berlangsung mendapatkan mendatang mendatangi mendatangkan menegaskan mengakhiri mengapa
berlebihan mengatakan mengatakannya mengenai mengerjakan mengetahui menggunakan
bermacam menghendaki mengibaratkan mengibaratkannya mengingat mengingatkan menginginkan
bermacam-macam mengira mengucapkan mengucapkannya mengungkapkan menjadi menjawab menjelaskan
bermaksud menuju menunjuk menunjuki menunjukkan menunjuknya menurut menuturkan
bermula menyampaikan menyangkut menyatakan menyebutkan menyeluruh menyiapkan merasa
bersama mereka merekalah merupakan meski meskipun meyakini meyakinkan minta mirip
bersama-sama misal misalkan misalnya mula mulai mulailah mulanya mungkin mungkinkah
bersiap
bersiap-siap nah naik namun nanti nantinya nyaris nyatanya
bertanya
bertanya-tanya oleh olehnya
berturut
berturut-turut pada padahal padanya pak paling panjang pantas para pasti pastilah penting
bertutur pentingnya per percuma perlu perlukah perlunya pernah persoalan pertama
berujar pertama-tama pertanyaan pertanyakan pihak pihaknya pukul pula pun punya
berupa
besar rasa rasanya rata rupanya
betul
betulkah saat saatnya saja sajalah saling sama sama-sama sambil sampai sampai-sampai
biasa sampaikan sana sangat sangatlah satu saya sayalah se sebab sebabnya sebagai
biasanya sebagaimana sebagainya sebagian sebaik sebaik-baiknya sebaiknya sebaliknya
bila sebanyak sebegini sebegitu sebelum sebelumnya sebenarnya seberapa sebesar
bilakah sebetulnya sebisanya sebuah sebut sebutlah sebutnya secara secukupnya sedang
bisa sedangkan sedemikian sedikit sedikitnya seenaknya segala segalanya segera
bisakah seharusnya sehingga seingat sejak sejauh sejenak sejumlah sekadar sekadarnya
boleh sekali sekali-kali sekalian sekaligus sekalipun sekarang sekarang sekecil
bolehkah seketika sekiranya sekitar sekitarnya sekurang-kurangnya sekurangnya sela
bolehlah selain selaku selalu selama selama-lamanya selamanya selanjutnya seluruh
buat seluruhnya semacam semakin semampu semampunya semasa semasih semata semata-mata
bukan semaunya sementara semisal semisalnya sempat semua semuanya semula sendiri
bukankah sendirian sendirinya seolah seolah-olah seorang sepanjang sepantasnya
bukanlah sepantasnyalah seperlunya seperti sepertinya sepihak sering seringnya serta
bukannya serupa sesaat sesama sesampai sesegera sesekali seseorang sesuatu sesuatunya
bulan sesudah sesudahnya setelah setempat setengah seterusnya setiap setiba setibanya
bung setidak-tidaknya setidaknya setinggi seusai sewaktu siap siapa siapakah
cara siapapun sini sinilah soal soalnya suatu sudah sudahkah sudahlah supaya
caranya
cukup tadi tadinya tahu tahun tak tambah tambahnya tampak tampaknya tandas tandasnya
cukupkah tanpa tanya tanyakan tanyanya tapi tegas tegasnya telah tempat tengah tentang
cukuplah tentu tentulah tentunya tepat terakhir terasa terbanyak terdahulu terdapat
cuma terdiri terhadap terhadapnya teringat teringat-ingat terjadi terjadilah
dahulu terjadinya terkira terlalu terlebih terlihat termasuk ternyata tersampaikan
dalam tersebut tersebutlah tertentu tertuju terus terutama tetap tetapi tiap tiba
dan tiba-tiba tidak tidakkah tidaklah tiga tinggi toh tunjuk turut tutur tuturnya
dapat
dari ucap ucapnya ujar ujarnya umum umumnya ungkap ungkapnya untuk usah usai
daripada
datang waduh wah wahai waktu waktunya walau walaupun wong
dekat
demi yaitu yakin yakni yang
demikian """.split())
demikianlah
dengan
depan
di
dia
diakhiri
diakhirinya
dialah
diantara
diantaranya
diberi
diberikan
diberikannya
dibuat
dibuatnya
didapat
didatangkan
digunakan
diibaratkan
diibaratkannya
diingat
diingatkan
diinginkan
dijawab
dijelaskan
dijelaskannya
dikarenakan
dikatakan
dikatakannya
dikerjakan
diketahui
diketahuinya
dikira
dilakukan
dilalui
dilihat
dimaksud
dimaksudkan
dimaksudkannya
dimaksudnya
diminta
dimintai
dimisalkan
dimulai
dimulailah
dimulainya
dimungkinkan
dini
dipastikan
diperbuat
diperbuatnya
dipergunakan
diperkirakan
diperlihatkan
diperlukan
diperlukannya
dipersoalkan
dipertanyakan
dipunyai
diri
dirinya
disampaikan
disebut
disebutkan
disebutkannya
disini
disinilah
ditambahkan
ditandaskan
ditanya
ditanyai
ditanyakan
ditegaskan
ditujukan
ditunjuk
ditunjuki
ditunjukkan
ditunjukkannya
ditunjuknya
dituturkan
dituturkannya
diucapkan
diucapkannya
diungkapkan
dong
dua
dulu
empat
enggak
enggaknya
entah
entahlah
guna
gunakan
hal
hampir
hanya
hanyalah
hari
harus
haruslah
harusnya
hendak
hendaklah
hendaknya
hingga
ia
ialah
ibarat
ibaratkan
ibaratnya
ibu
ikut
ingat
ingat-ingat
ingin
inginkah
inginkan
ini
inikah
inilah
itu
itukah
itulah
jadi
jadilah
jadinya
jangan
jangankan
janganlah
jauh
jawab
jawaban
jawabnya
jelas
jelaskan
jelaslah
jelasnya
jika
jikalau
juga
jumlah
jumlahnya
justru
kala
kalau
kalaulah
kalaupun
kalian
kami
kamilah
kamu
kamulah
kan
kapan
kapankah
kapanpun
karena
karenanya
kasus
kata
katakan
katakanlah
katanya
ke
keadaan
kebetulan
kecil
kedua
keduanya
keinginan
kelamaan
kelihatan
kelihatannya
kelima
keluar
kembali
kemudian
kemungkinan
kemungkinannya
kenapa
kepada
kepadanya
kesampaian
keseluruhan
keseluruhannya
keterlaluan
ketika
khususnya
kini
kinilah
kira
kira-kira
kiranya
kita
kitalah
kok
kurang
lagi
lagian
lah
lain
lainnya
lalu
lama
lamanya
lanjut
lanjutnya
lebih
lewat
lima
luar
macam
maka
makanya
makin
malah
malahan
mampu
mampukah
mana
manakala
manalagi
masa
masalah
masalahnya
masih
masihkah
masing
masing-masing
mau
maupun
melainkan
melakukan
melalui
melihat
melihatnya
memang
memastikan
memberi
memberikan
membuat
memerlukan
memihak
meminta
memintakan
memisalkan
memperbuat
mempergunakan
memperkirakan
memperlihatkan
mempersiapkan
mempersoalkan
mempertanyakan
mempunyai
memulai
memungkinkan
menaiki
menambahkan
menandaskan
menanti
menanti-nanti
menantikan
menanya
menanyai
menanyakan
mendapat
mendapatkan
mendatang
mendatangi
mendatangkan
menegaskan
mengakhiri
mengapa
mengatakan
mengatakannya
mengenai
mengerjakan
mengetahui
menggunakan
menghendaki
mengibaratkan
mengibaratkannya
mengingat
mengingatkan
menginginkan
mengira
mengucapkan
mengucapkannya
mengungkapkan
menjadi
menjawab
menjelaskan
menuju
menunjuk
menunjuki
menunjukkan
menunjuknya
menurut
menuturkan
menyampaikan
menyangkut
menyatakan
menyebutkan
menyeluruh
menyiapkan
merasa
mereka
merekalah
merupakan
meski
meskipun
meyakini
meyakinkan
minta
mirip
misal
misalkan
misalnya
mula
mulai
mulailah
mulanya
mungkin
mungkinkah
nah
naik
namun
nanti
nantinya
nyaris
nyatanya
oleh
olehnya
pada
padahal
padanya
pak
paling
panjang
pantas
para
pasti
pastilah
penting
pentingnya
per
percuma
perlu
perlukah
perlunya
pernah
persoalan
pertama
pertama-tama
pertanyaan
pertanyakan
pihak
pihaknya
pukul
pula
pun
punya
rasa
rasanya
rata
rupanya
saat
saatnya
saja
sajalah
saling
sama
sama-sama
sambil
sampai
sampai-sampai
sampaikan
sana
sangat
sangatlah
satu
saya
sayalah
se
sebab
sebabnya
sebagai
sebagaimana
sebagainya
sebagian
sebaik
sebaik-baiknya
sebaiknya
sebaliknya
sebanyak
sebegini
sebegitu
sebelum
sebelumnya
sebenarnya
seberapa
sebesar
sebetulnya
sebisanya
sebuah
sebut
sebutlah
sebutnya
secara
secukupnya
sedang
sedangkan
sedemikian
sedikit
sedikitnya
seenaknya
segala
segalanya
segera
seharusnya
sehingga
seingat
sejak
sejauh
sejenak
sejumlah
sekadar
sekadarnya
sekali
sekali-kali
sekalian
sekaligus
sekalipun
sekarang
sekarang
sekecil
seketika
sekiranya
sekitar
sekitarnya
sekurang-kurangnya
sekurangnya
sela
selain
selaku
selalu
selama
selama-lamanya
selamanya
selanjutnya
seluruh
seluruhnya
semacam
semakin
semampu
semampunya
semasa
semasih
semata
semata-mata
semaunya
sementara
semisal
semisalnya
sempat
semua
semuanya
semula
sendiri
sendirian
sendirinya
seolah
seolah-olah
seorang
sepanjang
sepantasnya
sepantasnyalah
seperlunya
seperti
sepertinya
sepihak
sering
seringnya
serta
serupa
sesaat
sesama
sesampai
sesegera
sesekali
seseorang
sesuatu
sesuatunya
sesudah
sesudahnya
setelah
setempat
setengah
seterusnya
setiap
setiba
setibanya
setidak-tidaknya
setidaknya
setinggi
seusai
sewaktu
siap
siapa
siapakah
siapapun
sini
sinilah
soal
soalnya
suatu
sudah
sudahkah
sudahlah
supaya
tadi
tadinya
tahu
tahun
tak
tambah
tambahnya
tampak
tampaknya
tandas
tandasnya
tanpa
tanya
tanyakan
tanyanya
tapi
tegas
tegasnya
telah
tempat
tengah
tentang
tentu
tentulah
tentunya
tepat
terakhir
terasa
terbanyak
terdahulu
terdapat
terdiri
terhadap
terhadapnya
teringat
teringat-ingat
terjadi
terjadilah
terjadinya
terkira
terlalu
terlebih
terlihat
termasuk
ternyata
tersampaikan
tersebut
tersebutlah
tertentu
tertuju
terus
terutama
tetap
tetapi
tiap
tiba
tiba-tiba
tidak
tidakkah
tidaklah
tiga
tinggi
toh
tunjuk
turut
tutur
tuturnya
ucap
ucapnya
ujar
ujarnya
umum
umumnya
ungkap
ungkapnya
untuk
usah
usai
waduh
wah
wahai
waktu
waktunya
walau
walaupun
wong
yaitu
yakin
yakni
yang
""".split())

View File

@ -1,10 +1,11 @@
"""
Daftar singkatan dan Akronim dari:
https://id.wiktionary.org/wiki/Wiktionary:Daftar_singkatan_dan_akronim_bahasa_Indonesia#A
"""
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import regex as re
from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS
from ..tokenizer_exceptions import URL_PATTERN
from ...symbols import ORTH, LEMMA, NORM from ...symbols import ORTH, LEMMA, NORM
@ -22,6 +23,9 @@ for orth in ID_BASE_EXCEPTIONS:
orth_lower = orth.lower() orth_lower = orth.lower()
_exc[orth_lower] = [{ORTH: orth_lower}] _exc[orth_lower] = [{ORTH: orth_lower}]
orth_first_upper = orth[0].upper() + orth[1:]
_exc[orth_first_upper] = [{ORTH: orth_first_upper}]
if '-' in orth: if '-' in orth:
orth_title = '-'.join([part.title() for part in orth.split('-')]) orth_title = '-'.join([part.title() for part in orth.split('-')])
_exc[orth_title] = [{ORTH: orth_title}] _exc[orth_title] = [{ORTH: orth_title}]
@ -30,28 +34,6 @@ for orth in ID_BASE_EXCEPTIONS:
_exc[orth_caps] = [{ORTH: orth_caps}] _exc[orth_caps] = [{ORTH: orth_caps}]
for exc_data in [ for exc_data in [
{ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"},
{ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"},
{ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"},
{ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"},
{ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"},
{ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"},
{ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"},
{ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"},
{ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"},
{ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"},
{ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"},
{ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"},
{ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"},
{ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"},
{ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"},
{ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"},
{ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"},
{ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"},
{ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"},
{ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"},
{ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"}, {ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"},
{ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"}, {ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"},
{ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"}, {ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"},
@ -66,25 +48,43 @@ for exc_data in [
{ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]: {ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]:
_exc[exc_data[ORTH]] = [exc_data] _exc[exc_data[ORTH]] = [exc_data]
_other_exc = {
"do'a": [{ORTH: "do'a", LEMMA: "doa", NORM: "doa"}],
"jum'at": [{ORTH: "jum'at", LEMMA: "Jumat", NORM: "Jumat"}],
"Jum'at": [{ORTH: "Jum'at", LEMMA: "Jumat", NORM: "Jumat"}],
"la'nat": [{ORTH: "la'nat", LEMMA: "laknat", NORM: "laknat"}],
"ma'af": [{ORTH: "ma'af", LEMMA: "maaf", NORM: "maaf"}],
"mu'jizat": [{ORTH: "mu'jizat", LEMMA: "mukjizat", NORM: "mukjizat"}],
"Mu'jizat": [{ORTH: "Mu'jizat", LEMMA: "mukjizat", NORM: "mukjizat"}],
"ni'mat": [{ORTH: "ni'mat", LEMMA: "nikmat", NORM: "nikmat"}],
"raka'at": [{ORTH: "raka'at", LEMMA: "rakaat", NORM: "rakaat"}],
"ta'at": [{ORTH: "ta'at", LEMMA: "taat", NORM: "taat"}],
}
_exc.update(_other_exc)
for orth in [ for orth in [
"A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.", "A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.",
"B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.", "B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.",
"M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,", "M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.",
"M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.",
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "M.SArl", "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.",
"S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars", "S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars",
"S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han", "S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han",
"S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP", "S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P",
"S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH", "S.IP", "S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep",
"S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat", "S.KG", "S.KH", "S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM",
"S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.", "S.Mb", "S.Mat", "S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.",
"S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.", "S.Psi.", "S.S.", "S.SArl.", "S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.",
"S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.", "S.Sy.", "S.T.", "S.T.Han",
"S.Tekp.", "S.Th.", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK", "S.Tekp.", "S.Th.",
"a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.", "Prof.", "drg.", "KH.", "Ust.", "Lc", "Pdt.", "S.H.H.", "Rm.", "Ps.",
"dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o", "St.", "M.A.", "M.B.A", "M.Eng.", "M.Eng.Sc.", "M.Pharm.", "Dr. med",
"n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.", "Dr.-Ing", "Dr. rer. nat.", "Dr. phil.", "Dr. iur.", "Dr. rer. oec",
]: "Dr. rer. pol.", "R.Ng.", "R.", "R.M.", "R.B.", "R.P.", "R.Ay.", "Rr.",
"R.Ngt.", "a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.",
"dll.", "dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.",
"i/o", "n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p."]:
_exc[orth] = [{ORTH: orth}] _exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc TOKENIZER_EXCEPTIONS = _exc

View File

@ -0,0 +1,32 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['zero', 'jeden', 'dwa', 'trzy', 'cztery', 'pięć', 'sześć',
'siedem', 'osiem', 'dziewięć', 'dziesięć', 'jedenaście',
'dwanaście', 'trzynaście', 'czternaście',
'pietnaście', 'szesnaście', 'siedemnaście', 'osiemnaście',
'dziewiętnaście', 'dwadzieścia', 'trzydzieści', 'czterdzieści',
'pięćdziesiąt', 'szcześćdziesiąt', 'siedemdziesiąt',
'osiemdziesiąt', 'dziewięćdziesiąt', 'sto', 'tysiąc', 'milion',
'miliard', 'bilion', 'trylion']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -4,12 +4,14 @@ from __future__ import unicode_literals
from ...attrs import LIKE_NUM from ...attrs import LIKE_NUM
_num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete', _num_words = ['zero', 'um', 'dois', 'três', 'tres', 'quatro', 'cinco', 'seis', 'sete', 'oito', 'nove', 'dez',
'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze', 'onze', 'doze', 'dúzia', 'dúzias', 'duzia', 'duzias', 'treze', 'catorze', 'quinze', 'dezasseis',
'quinze', 'dezesseis', 'dezasseis', 'dezessete', 'dezassete', 'dezoito', 'dezenove', 'dezanove', 'vinte', 'dezassete', 'dezoito', 'dezanove', 'vinte', 'trinta', 'quarenta', 'cinquenta', 'sessenta',
'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta', 'setenta', 'oitenta', 'noventa', 'cem', 'cento', 'duzentos', 'trezentos', 'quatrocentos',
'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilhão', 'bilião', 'trilhão', 'trilião', 'quinhentos', 'seicentos', 'setecentos', 'oitocentos', 'novecentos', 'mil', 'milhão', 'milhao',
'quatrilhão'] 'milhões', 'milhoes', 'bilhão', 'bilhao', 'bilhões', 'bilhoes', 'trilhão', 'trilhao', 'trilhões',
'trilhoes', 'quadrilhão', 'quadrilhao', 'quadrilhões', 'quadrilhoes']
_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto', _ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo', 'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',

23
spacy/lang/si/__init__.py Normal file
View File

@ -0,0 +1,23 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from ...language import Language
from ...attrs import LANG
class SinhalaDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'si'
stop_words = STOP_WORDS
class Sinhala(Language):
lang = 'si'
Defaults = SinhalaDefaults
__all__ = ['Sinhala']

22
spacy/lang/si/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.si.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"ඔබ කවුද?",
"ගූගල් සමාගම ඩොලර් මිලියන 500 කට එම ආයතනය මිලදී ගන්නා ලදී.",
"කොළඹ ශ්‍රී ලංකාවේ ප්‍රධානතම නගරය යි.",
"ප්‍රංශයේ ජනාධිපති කවරෙක් ද?",
"මට බිස්කට් 1 ක් දෙන්න",
"ඔවුන් ලකුණු 59 කින් තරඟය පරාජයට පත් විය.",
"1 ත් 10 ත් අතර සංඛ්‍යාවක් කියන්න",
"ඔහු සහ ඇය නුවර හෝ කොළඹ පදිංචි කරුවන් නොවේ",
]

View File

@ -0,0 +1,28 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['බින්දුව', 'බිංදුව', 'එක', 'දෙක', 'තුන', 'හතර', 'පහ', 'හය', 'හත',
'අට', 'නවය', 'නමය', 'දහය', 'එකොළහ', 'දොළහ', 'දහතුන', 'දහහතර',
'දාහතර', 'පහළව', 'පහළොව', 'දහසය', 'දහහත', 'දාහත', 'දහඅට',
'දහනවය', 'විස්ස', 'තිහ', 'හතළිහ', 'පනහ', 'හැට', 'හැත්තෑව', 'අසූව',
'අනූව', 'සියය', 'දහස', 'දාහ', 'ලක්ෂය', 'මිලියනය', 'කෝටිය',
'බිලියනය', 'ට්‍රිලියනය']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,51 @@
# coding: utf8
from __future__ import unicode_literals
# Stop words
STOP_WORDS = set("""
අතර
එචචර
එපමණ
එල
එව
කට
කද
නම
පමණ
පමණ
චර
පමණ
ලද
වග
වන
තර
වත
වද
සමඟ
සහ
වත
""".split())

23
spacy/lang/te/__init__.py Normal file
View File

@ -0,0 +1,23 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from ...language import Language
from ...attrs import LANG
class TeluguDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'te'
stop_words = STOP_WORDS
class Telugu(Language):
lang = 'te'
Defaults = TeluguDefaults
__all__ = ['Telugu']

24
spacy/lang/te/examples.py Normal file
View File

@ -0,0 +1,24 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.te import Telugu
>>> nlp = Telugu()
>>> from spacy.lang.te.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"ఆపిల్ 1 బిలియన్ డాలర్స్ కి యూ.కె. స్టార్ట్అప్ ని కొనాలని అనుకుంటుంది.",
"ఆటోనోమోస్ కార్లు భీమా బాధ్యతను తయారీదారులపైకి మళ్లిస్తాయి.",
"సాన్ ఫ్రాన్సిస్కో కాలిబాట డెలివరీ రోబోట్లను నిషేధించడానికి ఆలోచిస్తుంది.",
"లండన్ యునైటెడ్ కింగ్డమ్ లో పెద్ద సిటీ.",
"నువ్వు ఎక్కడ ఉన్నావ్?",
"ఫ్రాన్స్ అధ్యక్షుడు ఎవరు?",
"యునైటెడ్ స్టేట్స్ యొక్క రాజధాని ఏంటి?",
"బరాక్ ఒబామా ఎప్పుడు జన్మించారు?"
]

View File

@ -0,0 +1,28 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['సున్నా', 'శూన్యం', 'ఒకటి', 'రెండు', 'మూడు', 'నాలుగు', 'ఐదు', 'ఆరు',
'ఏడు', 'ఎనిమిది', 'తొమ్మిది', 'పది', 'పదకొండు', 'పన్నెండు', 'పదమూడు',
'పద్నాలుగు', 'పదిహేను', 'పదహారు', 'పదిహేడు', 'పద్దెనిమిది', 'పందొమ్మిది', 'ఇరవై',
'ముప్పై', 'నలభై', 'యాభై', 'అరవై', 'డెబ్బై', 'ఎనభై', 'తొంబై', 'వంద', 'నూరు',
'వెయ్యి', 'లక్ష', 'కోటి']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,57 @@
# coding: utf8
from __future__ import unicode_literals
# Source: https://github.com/Xangis/extra-stopwords (MIT License)
STOP_WORDS = set("""
దర
అడగి
అడగడ
అడ
అన
అనమతి
అనమతిి
అయి
ఇపపటి
ఉన
ఎకకడ
ఎప
ఎవర
ఎవర ఒకర
ఏద
ఏమనపపటిి
ఏమనపపటిి
ఒక
ఒక రకకన
కనిిి
ిి
యగలిిి
తగి
తర
తర
ి
రక
మధ
మధ
మరి
మర
మళ
రమ
వద
వద
యతి
""".split())

View File

@ -119,12 +119,6 @@ class Language(object):
`Language.Defaults.create_vocab`. `Language.Defaults.create_vocab`.
make_doc (callable): A function that takes text and returns a `Doc` make_doc (callable): A function that takes text and returns a `Doc`
object. Usually a `Tokenizer`. object. Usually a `Tokenizer`.
pipeline (list): A list of annotation processes or IDs of annotation,
processes, e.g. a `Tagger` object, or `'tagger'`. IDs are looked
up in `Language.Defaults.factories`.
disable (list): A list of component names to exclude from the pipeline.
The disable list has priority over the pipeline list -- if the same
string occurs in both, the component is not loaded.
meta (dict): Custom meta data for the Language class. Is written to by meta (dict): Custom meta data for the Language class. Is written to by
models to add model meta data. models to add model meta data.
max_length (int) : max_length (int) :

View File

@ -212,13 +212,22 @@ def pytest_addoption(parser):
def pytest_runtest_setup(item): def pytest_runtest_setup(item):
def getopt(opt):
# When using 'pytest --pyargs spacy' to test an installed copy of
# spacy, pytest skips running our pytest_addoption() hook. Later, when
# we call getoption(), pytest raises an error, because it doesn't
# recognize the option we're asking about. To avoid this, we need to
# pass a default value. We default to False, i.e., we act like all the
# options weren't given.
return item.config.getoption("--%s" % opt, False)
for opt in ['models', 'vectors', 'slow']: for opt in ['models', 'vectors', 'slow']:
if opt in item.keywords and not item.config.getoption("--%s" % opt): if opt in item.keywords and not getopt(opt):
pytest.skip("need --%s option to run" % opt) pytest.skip("need --%s option to run" % opt)
# Check if test is marked with models and has arguments set, i.e. specific # Check if test is marked with models and has arguments set, i.e. specific
# language. If so, skip test if flag not set. # language. If so, skip test if flag not set.
if item.get_marker('models'): if item.get_marker('models'):
for arg in item.get_marker('models').args: for arg in item.get_marker('models').args:
if not item.config.getoption("--%s" % arg) and not item.config.getoption("--all"): if not getopt(arg) and not getopt("all"):
pytest.skip("need --%s or --all option to run" % arg) pytest.skip("need --%s or --all option to run" % arg)

View File

@ -5,6 +5,7 @@ from ..util import get_doc
from ...tokens import Doc from ...tokens import Doc
from ...vocab import Vocab from ...vocab import Vocab
from ...attrs import LEMMA from ...attrs import LEMMA
from ...tokens import Span
import pytest import pytest
import numpy import numpy
@ -156,6 +157,23 @@ def test_doc_api_merge(en_tokenizer):
assert doc[7].text == 'all night' assert doc[7].text == 'all night'
assert doc[7].text_with_ws == 'all night' assert doc[7].text_with_ws == 'all night'
# merge both with bulk merge
doc = en_tokenizer(text)
assert len(doc) == 9
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[4: 7], attrs={'tag':'NAMED', 'lemma':'LEMMA',
'ent_type':'TYPE'})
retokenizer.merge(doc[7: 9], attrs={'tag':'NAMED', 'lemma':'LEMMA',
'ent_type':'TYPE'})
assert len(doc) == 6
assert doc[4].text == 'the beach boys'
assert doc[4].text_with_ws == 'the beach boys '
assert doc[4].tag_ == 'NAMED'
assert doc[5].text == 'all night'
assert doc[5].text_with_ws == 'all night'
assert doc[5].tag_ == 'NAMED'
def test_doc_api_merge_children(en_tokenizer): def test_doc_api_merge_children(en_tokenizer):
"""Test that attachments work correctly after merging.""" """Test that attachments work correctly after merging."""

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
from ..util import get_doc from ..util import get_doc
from ...vocab import Vocab from ...vocab import Vocab
from ...tokens import Doc from ...tokens import Doc
from ...tokens import Span
import pytest import pytest
@ -16,16 +17,8 @@ def test_spans_merge_tokens(en_tokenizer):
assert len(doc) == 4 assert len(doc) == 4
assert doc[0].head.text == 'Angeles' assert doc[0].head.text == 'Angeles'
assert doc[1].head.text == 'start' assert doc[1].head.text == 'start'
doc.merge(0, len('Los Angeles'), tag='NNP', lemma='Los Angeles', ent_type='GPE') with doc.retokenize() as retokenizer:
assert len(doc) == 3 retokenizer.merge(doc[0 : 2], attrs={'tag':'NNP', 'lemma':'Los Angeles', 'ent_type':'GPE'})
assert doc[0].text == 'Los Angeles'
assert doc[0].head.text == 'start'
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
assert len(doc) == 4
assert doc[0].head.text == 'Angeles'
assert doc[1].head.text == 'start'
doc.merge(0, len('Los Angeles'), tag='NNP', lemma='Los Angeles', label='GPE')
assert len(doc) == 3 assert len(doc) == 3
assert doc[0].text == 'Los Angeles' assert doc[0].text == 'Los Angeles'
assert doc[0].head.text == 'start' assert doc[0].head.text == 'start'
@ -38,8 +31,8 @@ def test_spans_merge_heads(en_tokenizer):
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
assert len(doc) == 8 assert len(doc) == 8
doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_, with doc.retokenize() as retokenizer:
lemma='pilates class', ent_type='O') retokenizer.merge(doc[3 : 5], attrs={'tag':doc[4].tag_, 'lemma':'pilates class', 'ent_type':'O'})
assert len(doc) == 7 assert len(doc) == 7
assert doc[0].head.i == 1 assert doc[0].head.i == 1
assert doc[1].head.i == 1 assert doc[1].head.i == 1
@ -48,6 +41,14 @@ def test_spans_merge_heads(en_tokenizer):
assert doc[4].head.i in [1, 3] assert doc[4].head.i in [1, 3]
assert doc[5].head.i == 4 assert doc[5].head.i == 4
def test_spans_merge_non_disjoint(en_tokenizer):
text = "Los Angeles start."
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens])
with pytest.raises(ValueError):
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[0: 2], attrs={'tag': 'NNP', 'lemma': 'Los Angeles', 'ent_type': 'GPE'})
retokenizer.merge(doc[0: 1], attrs={'tag': 'NNP', 'lemma': 'Los Angeles', 'ent_type': 'GPE'})
def test_span_np_merges(en_tokenizer): def test_span_np_merges(en_tokenizer):
text = "displaCy is a parse tool built with Javascript" text = "displaCy is a parse tool built with Javascript"
@ -111,6 +112,25 @@ def test_spans_entity_merge_iob():
assert doc[0].ent_iob_ == "B" assert doc[0].ent_iob_ == "B"
assert doc[1].ent_iob_ == "I" assert doc[1].ent_iob_ == "I"
words = ["a", "b", "c", "d", "e", "f", "g", "h", "i"]
doc = Doc(Vocab(), words=words)
doc.ents = [(doc.vocab.strings.add('ent-de'), 3, 5),
(doc.vocab.strings.add('ent-fg'), 5, 7)]
assert doc[3].ent_iob_ == "B"
assert doc[4].ent_iob_ == "I"
assert doc[5].ent_iob_ == "B"
assert doc[6].ent_iob_ == "I"
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[2 : 4])
retokenizer.merge(doc[4 : 6])
retokenizer.merge(doc[7 : 9])
for token in doc:
print(token)
print(token.ent_iob)
assert len(doc) == 6
assert doc[3].ent_iob_ == "B"
assert doc[4].ent_iob_ == "I"
def test_spans_sentence_update_after_merge(en_tokenizer): def test_spans_sentence_update_after_merge(en_tokenizer):
text = "Stewart Lee is a stand up comedian. He lives in England and loves Joe Pasquale." text = "Stewart Lee is a stand up comedian. He lives in England and loves Joe Pasquale."

View File

@ -10,6 +10,11 @@ PUNCTUATION_TESTS = [
(u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'']), (u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'']),
(u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']), (u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']),
(u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']), (u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']),
(u'সরকারি বিশ্ববিদ্যালয়-এর ছাত্র নই বলেই কি এমন আচরণ?',
[u'সরকারি', u'বিশ্ববিদ্যালয়', u'-', u'এর', u'ছাত্র', u'নই', u'বলেই', u'কি', u'এমন', u'আচরণ', u'?']),
(u'তারা বলে, "ওরা খামারের মুরগি।"', [u'তারা', u'বলে', ',', '"', u'ওরা', u'খামারের', u'মুরগি', u'', '"']),
(u'৩*৩=৬?', [u'', u'*', u'', '=', u'', '?']),
(u'কাঁঠাল-এর গন্ধই অন্যরকম', [u'কাঁঠাল', '-', u'এর', u'গন্ধই', u'অন্যরকম']),
] ]
ABBREVIATIONS = [ ABBREVIATIONS = [

View File

@ -5,6 +5,9 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from libc.string cimport memcpy, memset from libc.string cimport memcpy, memset
from libc.stdlib cimport malloc, free
from cymem.cymem cimport Pool
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
from .span cimport Span from .span cimport Span
@ -14,24 +17,31 @@ from ..structs cimport LexemeC, TokenC
from ..attrs cimport TAG from ..attrs cimport TAG
from ..attrs import intify_attrs from ..attrs import intify_attrs
from ..util import SimpleFrozenDict from ..util import SimpleFrozenDict
from ..errors import Errors
cdef class Retokenizer: cdef class Retokenizer:
"""Helper class for doc.retokenize() context manager.""" """Helper class for doc.retokenize() context manager."""
cdef Doc doc cdef Doc doc
cdef list merges cdef list merges
cdef list splits cdef list splits
cdef set tokens_to_merge
def __init__(self, doc): def __init__(self, doc):
self.doc = doc self.doc = doc
self.merges = [] self.merges = []
self.splits = [] self.splits = []
self.tokens_to_merge = set()
def merge(self, Span span, attrs=SimpleFrozenDict()): def merge(self, Span span, attrs=SimpleFrozenDict()):
"""Mark a span for merging. The attrs will be applied to the resulting """Mark a span for merging. The attrs will be applied to the resulting
token. token.
""" """
for token in span:
if token.i in self.tokens_to_merge:
raise ValueError(Errors.E097.format(token=repr(token)))
self.tokens_to_merge.add(token.i)
attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings) attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings)
self.merges.append((span.start_char, span.end_char, attrs)) self.merges.append((span, attrs))
def split(self, Token token, orths, attrs=SimpleFrozenDict()): def split(self, Token token, orths, attrs=SimpleFrozenDict()):
"""Mark a Token for splitting, into the specified orths. The attrs """Mark a Token for splitting, into the specified orths. The attrs
@ -47,20 +57,22 @@ cdef class Retokenizer:
def __exit__(self, *args): def __exit__(self, *args):
# Do the actual merging here # Do the actual merging here
for start_char, end_char, attrs in self.merges: if len(self.merges) > 1:
start = token_by_start(self.doc.c, self.doc.length, start_char) _bulk_merge(self.doc, self.merges)
end = token_by_end(self.doc.c, self.doc.length, end_char) elif len(self.merges) == 1:
_merge(self.doc, start, end+1, attrs) (span, attrs) = self.merges[0]
start = span.start
end = span.end
_merge(self.doc, start, end, attrs)
for start_char, orths, attrs in self.splits: for start_char, orths, attrs in self.splits:
raise NotImplementedError raise NotImplementedError
def _merge(Doc doc, int start, int end, attributes): def _merge(Doc doc, int start, int end, attributes):
"""Retokenize the document, such that the span at """Retokenize the document, such that the span at
`doc.text[start_idx : end_idx]` is merged into a single token. If `doc.text[start_idx : end_idx]` is merged into a single token. If
`start_idx` and `end_idx `do not mark start and end token boundaries, `start_idx` and `end_idx `do not mark start and end token boundaries,
the document remains unchanged. the document remains unchanged.
start_idx (int): Character index of the start of the slice to merge. start_idx (int): Character index of the start of the slice to merge.
end_idx (int): Character index after the end of the slice to merge. end_idx (int): Character index after the end of the slice to merge.
**attributes: Attributes to assign to the merged token. By default, **attributes: Attributes to assign to the merged token. By default,
@ -131,3 +143,139 @@ def _merge(Doc doc, int start, int end, attributes):
# Clear the cached Python objects # Clear the cached Python objects
# Return the merged Python object # Return the merged Python object
return doc[start] return doc[start]
def _bulk_merge(Doc doc, merges):
"""Retokenize the document, such that the spans described in 'merges'
are merged into a single token. This method assumes that the merges
are in the same order at which they appear in the doc, and that merges
do not intersect each other in any way.
merges: Tokens to merge, and corresponding attributes to assign to the
merged token. By default, attributes are inherited from the
syntactic root of the span.
RETURNS (Token): The first newly merged token.
"""
cdef Span span
cdef const LexemeC* lex
cdef Pool mem = Pool()
tokens = <TokenC**>mem.alloc(len(merges), sizeof(TokenC))
spans = []
def _get_start(merge):
return merge[0].start
merges.sort(key=_get_start)
for merge_index, (span, attributes) in enumerate(merges):
start = span.start
end = span.end
spans.append(span)
# House the new merged token where it starts
token = &doc.c[start]
tokens[merge_index] = token
# Assign attributes
for attr_name, attr_value in attributes.items():
if attr_name == TAG:
doc.vocab.morphology.assign_tag(token, attr_value)
else:
Token.set_struct_attr(token, attr_name, attr_value)
# Memorize span roots and sets dependencies of the newly merged
# tokens to the dependencies of their roots.
span_roots = []
for i, span in enumerate(spans):
span_roots.append(span.root.i)
tokens[i].dep = span.root.dep
# We update token.lex after keeping span root and dep, since
# setting token.lex will change span.start and span.end properties
# as it modifies the character offsets in the doc
for token_index in range(len(merges)):
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
if spans[token_index][-1].whitespace_:
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
lex = doc.vocab.get(doc.mem, new_orth)
tokens[token_index].lex = lex
# We set trailing space here too
tokens[token_index].spacy = doc.c[spans[token_index].end-1].spacy
# Begin by setting all the head indices to absolute token positions
# This is easier to work with for now than the offsets
# Before thinking of something simpler, beware the case where a
# dependency bridges over the entity. Here the alignment of the
# tokens changes.
for i in range(doc.length):
doc.c[i].head += i
# Set the head of the merged token from the Span
for i in range(len(merges)):
tokens[i].head = doc.c[span_roots[i]].head
# Adjust deps before shrinking tokens
# Tokens which point into the merged token should now point to it
# Subtract the offset from all tokens which point to >= end
offsets = []
current_span_index = 0
current_offset = 0
for i in range(doc.length):
if current_span_index < len(spans) and i == spans[current_span_index].end:
#last token was the last of the span
current_offset += (spans[current_span_index].end - spans[current_span_index].start) -1
current_span_index += 1
if current_span_index < len(spans) and \
spans[current_span_index].start <= i < spans[current_span_index].end:
offsets.append(spans[current_span_index].start - current_offset)
else:
offsets.append(i - current_offset)
for i in range(doc.length):
doc.c[i].head = offsets[doc.c[i].head]
# Now compress the token array
offset = 0
in_span = False
span_index = 0
for i in range(doc.length):
if in_span and i == spans[span_index].end:
# First token after a span
in_span = False
span_index += 1
if span_index < len(spans) and i == spans[span_index].start:
# First token in a span
doc.c[i - offset] = doc.c[i] # move token to its place
offset += (spans[span_index].end - spans[span_index].start) - 1
in_span = True
if not in_span:
doc.c[i - offset] = doc.c[i] # move token to its place
for i in range(doc.length - offset, doc.length):
memset(&doc.c[i], 0, sizeof(TokenC))
doc.c[i].lex = &EMPTY_LEXEME
doc.length -= offset
# ...And, set heads back to a relative position
for i in range(doc.length):
doc.c[i].head -= i
# Set the left/right children, left/right edges
set_children_from_heads(doc.c, doc.length)
# Make sure ent_iob remains consistent
for (span, _) in merges:
if(span.end < len(offsets)):
#if it's not the last span
token_after_span_position = offsets[span.end]
if doc.c[token_after_span_position].ent_iob == 1\
and doc.c[token_after_span_position - 1].ent_iob in (0, 2):
if doc.c[token_after_span_position - 1].ent_type == doc.c[token_after_span_position].ent_type:
doc.c[token_after_span_position - 1].ent_iob = 3
else:
# If they're not the same entity type, let them be two entities
doc.c[token_after_span_position].ent_iob = 3
# Return the merged Python object
return doc[spans[0].start]

View File

@ -867,7 +867,7 @@ cdef class Doc:
''' '''
xp = get_array_module(self.tensor) xp = get_array_module(self.tensor)
if self.tensor.size == 0: if self.tensor.size == 0:
self.tensor.resize(tensor.shape) self.tensor.resize(tensor.shape, refcheck=False)
copy_array(self.tensor, tensor) copy_array(self.tensor, tensor)
else: else:
self.tensor = xp.hstack((self.tensor, tensor)) self.tensor = xp.hstack((self.tensor, tensor))
@ -884,6 +884,28 @@ cdef class Doc:
''' '''
return Retokenizer(self) return Retokenizer(self)
def _bulk_merge(self, spans, attributes):
"""Retokenize the document, such that the spans given as arguments
are merged into single tokens. The spans need to be in document
order, and no span intersection is allowed.
spans (Span[]): Spans to merge, in document order, with all span
intersections empty. Cannot be emty.
attributes (Dictionary[]): Attributes to assign to the merged tokens. By default,
must be the same lenghth as spans, emty dictionaries are allowed.
attributes are inherited from the syntactic root of the span.
RETURNS (Token): The first newly merged token.
"""
cdef unicode tag, lemma, ent_type
assert len(attributes) == len(spans), "attribute length should be equal to span length" + str(len(attributes)) +\
str(len(spans))
with self.retokenize() as retokenizer:
for i, span in enumerate(spans):
fix_attributes(self, attributes[i])
remove_label_if_necessary(attributes[i])
retokenizer.merge(span, attributes[i])
def merge(self, int start_idx, int end_idx, *args, **attributes): def merge(self, int start_idx, int end_idx, *args, **attributes):
"""Retokenize the document, such that the span at """Retokenize the document, such that the span at
`doc.text[start_idx : end_idx]` is merged into a single token. If `doc.text[start_idx : end_idx]` is merged into a single token. If
@ -905,20 +927,12 @@ cdef class Doc:
attributes[LEMMA] = lemma attributes[LEMMA] = lemma
attributes[ENT_TYPE] = ent_type attributes[ENT_TYPE] = ent_type
elif not args: elif not args:
if 'label' in attributes and 'ent_type' not in attributes: fix_attributes(self, attributes)
if isinstance(attributes['label'], int):
attributes[ENT_TYPE] = attributes['label']
else:
attributes[ENT_TYPE] = self.vocab.strings[attributes['label']]
if 'ent_type' in attributes:
attributes[ENT_TYPE] = attributes['ent_type']
elif args: elif args:
raise ValueError(Errors.E034.format(n_args=len(args), raise ValueError(Errors.E034.format(n_args=len(args),
args=repr(args), args=repr(args),
kwargs=repr(attributes))) kwargs=repr(attributes)))
# More deprecated attribute handling =/ remove_label_if_necessary(attributes)
if 'label' in attributes:
attributes['ent_type'] = attributes.pop('label')
attributes = intify_attrs(attributes, strings_map=self.vocab.strings) attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
@ -1034,3 +1048,17 @@ def unpickle_doc(vocab, hooks_and_data, bytes_data):
copy_reg.pickle(Doc, pickle_doc, unpickle_doc) copy_reg.pickle(Doc, pickle_doc, unpickle_doc)
def remove_label_if_necessary(attributes):
# More deprecated attribute handling =/
if 'label' in attributes:
attributes['ent_type'] = attributes.pop('label')
def fix_attributes(doc, attributes):
if 'label' in attributes and 'ent_type' not in attributes:
if isinstance(attributes['label'], int):
attributes[ENT_TYPE] = attributes['label']
else:
attributes[ENT_TYPE] = doc.vocab.strings[attributes['label']]
if 'ent_type' in attributes:
attributes[ENT_TYPE] = attributes['ent_type']

View File

@ -161,10 +161,12 @@ cdef class Token:
elif hasattr(other, 'orth'): elif hasattr(other, 'orth'):
if self.c.lex.orth == other.orth: if self.c.lex.orth == other.orth:
return 1.0 return 1.0
if self.vector_norm == 0 or other.vector_norm == 0: self_norm = self.vector_norm
other_norm = other.vector_norm
if self_norm == 0 or other_norm == 0:
return 0.0 return 0.0
return (numpy.dot(self.vector, other.vector) / return (numpy.dot(self.vector, other.vector) /
(self.vector_norm * other.vector_norm)) (self_norm * other_norm))
property lex_id: property lex_id:
"""RETURNS (int): Sequential ID of the token's lexical type.""" """RETURNS (int): Sequential ID of the token's lexical type."""
@ -687,7 +689,7 @@ cdef class Token:
property orth_: property orth_:
"""RETURNS (unicode): Verbatim text content (identical to """RETURNS (unicode): Verbatim text content (identical to
`Token.text`). Existst mostly for consistency with the other `Token.text`). Exists mostly for consistency with the other
attributes. attributes.
""" """
def __get__(self): def __get__(self):

View File

@ -85,7 +85,7 @@
], ],
"V_CSS": "2.2.1", "V_CSS": "2.2.1",
"V_JS": "2.2.2", "V_JS": "2.2.4",
"DEFAULT_SYNTAX": "python", "DEFAULT_SYNTAX": "python",
"ANALYTICS": "UA-58931649-1", "ANALYTICS": "UA-58931649-1",
"MAILCHIMP": { "MAILCHIMP": {

View File

@ -52,10 +52,11 @@ p
+accordion("English", "dependency-parsing-english") +accordion("English", "dependency-parsing-english")
p p
| The English dependency labels use the | The English dependency labels use the
| #[+a("http://www.mathcs.emory.edu/~choi/doc/cu-2012-choi.pdf") CLEAR Style] | #[+a("https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md") CLEAR Style]
| by #[+a("http://www.clearnlp.com") ClearNLP]. | by #[+a("http://www.clearnlp.com") ClearNLP].
+table(["Label", "Description"]) +table(["Label", "Description"])
+dep-row("acl", "clausal modifier of noun (adjectival clause)")
+dep-row("acomp", "adjectival complement") +dep-row("acomp", "adjectival complement")
+dep-row("advcl", "adverbial clause modifier") +dep-row("advcl", "adverbial clause modifier")
+dep-row("advmod", "adverbial modifier") +dep-row("advmod", "adverbial modifier")
@ -65,47 +66,42 @@ p
+dep-row("attr", "attribute") +dep-row("attr", "attribute")
+dep-row("aux", "auxiliary") +dep-row("aux", "auxiliary")
+dep-row("auxpass", "auxiliary (passive)") +dep-row("auxpass", "auxiliary (passive)")
+dep-row("case", "case marking")
+dep-row("cc", "coordinating conjunction") +dep-row("cc", "coordinating conjunction")
+dep-row("ccomp", "clausal complement") +dep-row("ccomp", "clausal complement")
+dep-row("complm", "complementizer") +dep-row("compound", "compound")
+dep-row("conj", "conjunct") +dep-row("conj", "conjunct")
+dep-row("cop", "copula") +dep-row("cop", "copula")
+dep-row("csubj", "clausal subject") +dep-row("csubj", "clausal subject")
+dep-row("csubjpass", "clausal subject (passive)") +dep-row("csubjpass", "clausal subject (passive)")
+dep-row("dative", "dative")
+dep-row("dep", "unclassified dependent") +dep-row("dep", "unclassified dependent")
+dep-row("det", "determiner") +dep-row("det", "determiner")
+dep-row("dobj", "direct object") +dep-row("dobj", "direct object")
+dep-row("expl", "expletive") +dep-row("expl", "expletive")
+dep-row("hmod", "modifier in hyphenation")
+dep-row("hyph", "hyphen")
+dep-row("infmod", "infinitival modifier")
+dep-row("intj", "interjection") +dep-row("intj", "interjection")
+dep-row("iobj", "indirect object")
+dep-row("mark", "marker") +dep-row("mark", "marker")
+dep-row("meta", "meta modifier") +dep-row("meta", "meta modifier")
+dep-row("neg", "negation modifier") +dep-row("neg", "negation modifier")
+dep-row("nmod", "modifier of nominal")
+dep-row("nn", "noun compound modifier") +dep-row("nn", "noun compound modifier")
+dep-row("npadvmod", "noun phrase as adverbial modifier") +dep-row("nounmod", "modifier of nominal")
+dep-row("npmod", "noun phrase as adverbial modifier")
+dep-row("nsubj", "nominal subject") +dep-row("nsubj", "nominal subject")
+dep-row("nsubjpass", "nominal subject (passive)") +dep-row("nsubjpass", "nominal subject (passive)")
+dep-row("num", "number modifier") +dep-row("nummod", "numeric modifier")
+dep-row("number", "number compound modifier")
+dep-row("oprd", "object predicate") +dep-row("oprd", "object predicate")
+dep-row("obj", "object") +dep-row("obj", "object")
+dep-row("obl", "oblique nominal") +dep-row("obl", "oblique nominal")
+dep-row("parataxis", "parataxis") +dep-row("parataxis", "parataxis")
+dep-row("partmod", "participal modifier")
+dep-row("pcomp", "complement of preposition") +dep-row("pcomp", "complement of preposition")
+dep-row("pobj", "object of preposition") +dep-row("pobj", "object of preposition")
+dep-row("poss", "possession modifier") +dep-row("poss", "possession modifier")
+dep-row("possessive", "possessive modifier")
+dep-row("preconj", "pre-correlative conjunction") +dep-row("preconj", "pre-correlative conjunction")
+dep-row("prep", "prepositional modifier") +dep-row("prep", "prepositional modifier")
+dep-row("prt", "particle") +dep-row("prt", "particle")
+dep-row("punct", "punctuation") +dep-row("punct", "punctuation")
+dep-row("quantmod", "modifier of quantifier") +dep-row("quantmod", "modifier of quantifier")
+dep-row("rcmod", "relative clause modifier") +dep-row("relcl", "relative clause modifier")
+dep-row("root", "root") +dep-row("root", "root")
+dep-row("xcomp", "open clausal complement") +dep-row("xcomp", "open clausal complement")

View File

@ -228,7 +228,7 @@ p
+cell JSON +cell JSON
+cell Data in spaCy's #[+a("/api/annotation#json-input") JSON format]. +cell Data in spaCy's #[+a("/api/annotation#json-input") JSON format].
p The following converters are available: p The following file format converters are available:
+table(["ID", "Description"]) +table(["ID", "Description"])
+row +row

View File

@ -99,6 +99,12 @@ p
| Process texts as a stream, and yield #[code Doc] objects in order. | Process texts as a stream, and yield #[code Doc] objects in order.
| Supports GIL-free multi-threading. | Supports GIL-free multi-threading.
+infobox("Important note for spaCy v2.0.x", "⚠️")
| By default, multiple threads will be launched for matrix multiplication,
| which may be inefficient on multi-core machines. Setting
| #[code OPENBLAS_NUM_THREADS=1] should fix this problem. spaCy v2.1.x
| will be switching to single-thread by default.
+aside-code("Example"). +aside-code("Example").
texts = [u'One document.', u'...', u'Lots of documents'] texts = [u'One document.', u'...', u'Lots of documents']
for doc in nlp.pipe(texts, batch_size=50, n_threads=4): for doc in nlp.pipe(texts, batch_size=50, n_threads=4):

View File

@ -173,7 +173,7 @@ p The L2 norm of the lexeme's vector representation.
+cell #[code orth_] +cell #[code orth_]
+cell unicode +cell unicode
+cell +cell
| Verbatim text content (identical to #[code Lexeme.text]). Existst | Verbatim text content (identical to #[code Lexeme.text]). Exists
| mostly for consistency with the other attributes. | mostly for consistency with the other attributes.
+row +row

View File

@ -544,7 +544,7 @@ p The L2 norm of the token's vector representation.
+cell #[code orth_] +cell #[code orth_]
+cell unicode +cell unicode
+cell +cell
| Verbatim text content (identical to #[code Token.text]). Existst | Verbatim text content (identical to #[code Token.text]). Exists
| mostly for consistency with the other attributes. | mostly for consistency with the other attributes.
+row +row

View File

@ -49,6 +49,7 @@ import initUniverse from './universe.vue.js';
if (window.Juniper) { if (window.Juniper) {
new Juniper({ new Juniper({
repo: 'ines/spacy-io-binder', repo: 'ines/spacy-io-binder',
branch: 'live',
storageExpire: 60 storageExpire: 60
}); });
} }

View File

@ -10,6 +10,7 @@ export default function(repo) {
'CC BY-SA 4.0': 'https://creativecommons.org/licenses/by-sa/4.0/', 'CC BY-SA 4.0': 'https://creativecommons.org/licenses/by-sa/4.0/',
'CC BY-NC': 'https://creativecommons.org/licenses/by-nc/3.0/', 'CC BY-NC': 'https://creativecommons.org/licenses/by-nc/3.0/',
'CC BY-NC 3.0': 'https://creativecommons.org/licenses/by-nc/3.0/', 'CC BY-NC 3.0': 'https://creativecommons.org/licenses/by-nc/3.0/',
'CC BY-NC 4.0': 'https://creativecommons.org/licenses/by-nc/4.0/',
'CC-BY-NC-SA 3.0': 'https://creativecommons.org/licenses/by-nc-sa/3.0/', 'CC-BY-NC-SA 3.0': 'https://creativecommons.org/licenses/by-nc-sa/3.0/',
'GPL': 'https://www.gnu.org/licenses/gpl.html', 'GPL': 'https://www.gnu.org/licenses/gpl.html',
'LGPL': 'https://www.gnu.org/licenses/lgpl.html', 'LGPL': 'https://www.gnu.org/licenses/lgpl.html',
@ -33,8 +34,8 @@ export default function(repo) {
version: 'n/a', version: 'n/a',
notes: null, notes: null,
sizeFull: null, sizeFull: null,
fullName: null,
pipeline: null, pipeline: null,
releaseUrl: null,
description: null, description: null,
license: null, license: null,
author: null, author: null,
@ -61,6 +62,10 @@ export default function(repo) {
}, },
hasAccuracy() { hasAccuracy() {
return this.uas || this.las || this.tags_acc || this.ents_f || this.ents_p || this.ents_r; return this.uas || this.las || this.tags_acc || this.ents_f || this.ents_p || this.ents_r;
},
releaseUrl() {
const baseUrl = `${this.repo}/releases`;
return `${baseUrl}/${this.fullName ? `/tag/${this.fullName}` : ''}`
} }
}, },
beforeMount() { beforeMount() {
@ -78,9 +83,8 @@ export default function(repo) {
}, },
methods: { methods: {
$_updateData(data) { $_updateData(data) {
const fullName = `${data.lang}_${data.name}-${data.version}`; this.fullName = `${data.lang}_${data.name}-${data.version}`;
this.version = data.version; this.version = data.version;
this.releaseUrl = `${this.repo}/releases/tag/${fullName}`;
this.sizeFull = data.size; this.sizeFull = data.size;
this.pipeline = data.pipeline; this.pipeline = data.pipeline;
this.notes = data.notes; this.notes = data.notes;

View File

@ -352,7 +352,6 @@ p
span.merge() # merge span.merge() # merge
for token in span: for token in span:
token._.bad_html = True # mark token as bad HTML token._.bad_html = True # mark token as bad HTML
doc.vocab[token.text].is_stop = True # mark lexeme as stop word
return doc return doc
nlp = spacy.load('en_core_web_sm') nlp = spacy.load('en_core_web_sm')

View File

@ -100,7 +100,7 @@ p
| containing the models for each installation, you can now choose | containing the models for each installation, you can now choose
| #[strong how and where you want to keep your data]. For example, you could | #[strong how and where you want to keep your data]. For example, you could
| download all models manually and put them into a local directory. | download all models manually and put them into a local directory.
| Whenever your spaCy projects need a models, you create a shortcut link to | Whenever your spaCy projects need a model, you create a shortcut link to
| tell spaCy to load it from there. This means you'll never end up with | tell spaCy to load it from there. This means you'll never end up with
| duplicate data. | duplicate data.

View File

@ -3,7 +3,7 @@
p p
| spaCy makes it very easy to create your own pipelines consisting of | spaCy makes it very easy to create your own pipelines consisting of
| reusable components this includes spaCy's default tagger, | reusable components this includes spaCy's default tagger,
| parser and entity regcognizer, but also your own custom processing | parser and entity recognizer, but also your own custom processing
| functions. A pipeline component can be added to an already existing | functions. A pipeline component can be added to an already existing
| #[code nlp] object, specified when initialising a #[code Language] class, | #[code nlp] object, specified when initialising a #[code Language] class,
| or defined within a | or defined within a

View File

@ -12,10 +12,10 @@ p
from spacy import displacy from spacy import displacy
doc = nlp(u'Rats are various medium-sized, long-tailed rodents.') doc = nlp(u'Rats are various medium-sized, long-tailed rodents.')
displacy.render(doc, style='dep') displacy.render(doc, style='dep', jupyter=True)
doc2 = nlp(LONG_NEWS_ARTICLE) doc2 = nlp(LONG_NEWS_ARTICLE)
displacy.render(doc2, style='ent') displacy.render(doc2, style='ent', jupyter=True)
+aside("Enabling or disabling Jupyter mode") +aside("Enabling or disabling Jupyter mode")
| To explicitly enable or disable "Jupyter mode", you can use the | To explicitly enable or disable "Jupyter mode", you can use the