Merge branch 'master' into feature/nel-wiki

This commit is contained in:
svlandeg 2019-06-03 09:35:10 +02:00
commit d83a1e3052
97 changed files with 3825 additions and 435 deletions

106
.github/contributors/BreakBB.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Björn Böing |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 15.04.2019 |
| GitHub username | BreakBB |
| Website (optional) | |

106
.github/contributors/Dobita21.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Nattapol |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 18.04.2019 |
| GitHub username | Dobita21 |
| Website (optional) | |

106
.github/contributors/F0rge1cE.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Icarus Xu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 05/06/2019 |
| GitHub username | F0rge1cE |
| Website (optional) | |

106
.github/contributors/NirantK.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Nirant Kasliwal |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | |
| GitHub username | NirantK |
| Website (optional) | https://nirantk.com |

106
.github/contributors/aaronkub.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Aaron Kub |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-05-09 |
| GitHub username | aaronkub |
| Website (optional) | |

106
.github/contributors/amitness.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Amit Chaudhary |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | April 29, 2019 |
| GitHub username | amitness |
| Website (optional) | https://amitness.com |

106
.github/contributors/bjascob.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Brad Jascob |
| Company name (if applicable) | n/a |
| Title or role (if applicable) | Software Engineer |
| Date | 04/25/2019 |
| GitHub username | bjascob |
| Website (optional) | n/a |

106
.github/contributors/bryant1410.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Santiago Castro |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-04-09 |
| GitHub username | bryant1410 |
| Website (optional) | |

106
.github/contributors/celikomer.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Omer Celik |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 04/11/2019 |
| GitHub username | celikomer |
| Website (optional) | www.ocelik.com |

106
.github/contributors/estr4ng7d.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Amey Baviskar |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 21-May-2019 |
| GitHub username | estr4ng7d |
| Website (optional) | |

106
.github/contributors/fizban99.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | A.I.M. |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 16.04.2019 |
| GitHub username | fizban99 |
| Website (optional) | |

106
.github/contributors/henry860916.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Henry Zhang |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-04-30 |
| GitHub username | henry860916 |
| Website (optional) | |

106
.github/contributors/ldorigo.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Luca Dorigo |
| Company name (if applicable) | / |
| Title or role (if applicable) | / |
| Date | 08.05.2019 |
| GitHub username | ldorigo |
| Website (optional) | / |

106
.github/contributors/munozbravo.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Germán Muñoz |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-06-01 |
| GitHub username | munozbravo |
| Website (optional) | |

106
.github/contributors/nipunsadvilkar.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Nipun Sadvilkar |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 31st May, 2019 |
| GitHub username | nipunsadvilkar|
| Website (optional) |https://nipunsadvilkar.github.io/|

106
.github/contributors/pickfire.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ivan Tham Jun Hoe |
| Company name (if applicable) | Semut |
| Title or role (if applicable) | Data Analyst |
| Date | Apr 11, 2019 |
| GitHub username | pickfire |
| Website (optional) | https://pickfire.tk |

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Richard Paul Hudson |
| Company name (if applicable) | msg systems ag |
| Title or role (if applicable) | Principal IT Consultant|
| Date | 06. May 2019 |
| GitHub username | richardpaulhudson |
| Website (optional) | |

106
.github/contributors/ujwal-narayan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ujwal Narayan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 17/05/2019 |
| GitHub username | ujwal-narayan |
| Website (optional) | |

106
.github/contributors/xssChauhan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Shikhar Chauhan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 12/11/2019 |
| GitHub username | xssChauhan |
| Website (optional) | |

106
.github/contributors/yaph.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ramiro Gómez |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-04-29 |
| GitHub username | yaph |
| Website (optional) | http://ramiro.org/ |

View File

@ -447,17 +447,7 @@ use the `get_doc()` utility function to construct it manually.
## Updating the website
Our [website and docs](https://spacy.io) are implemented in
[Jade/Pug](https://www.jade-lang.org), and built or served by
[Harp](https://harpjs.com). Jade/Pug is an extensible templating language with a
readable syntax, that compiles to HTML. Here's how to view the site locally:
```bash
sudo npm install --global harp
git clone https://github.com/explosion/spaCy
cd spaCy/website
harp server
```
For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the *website* directory's README.
The docs can always use another example or more detail, and they should always
be up to date and not misleading. To quickly find the correct file to edit,

View File

@ -6,11 +6,10 @@ spaCy is a library for advanced Natural Language Processing in Python and
Cython. It's built on the very latest research, and was designed from day one
to be used in real products. spaCy comes with
[pre-trained statistical models](https://spacy.io/models) and word vectors, and
currently supports tokenization for **45+ languages**. It features the
**fastest syntactic parser** in the world, convolutional
**neural network models** for tagging, parsing and **named entity recognition**
and easy **deep learning** integration. It's commercial open-source software,
released under the MIT license.
currently supports tokenization for **49+ languages**. It features
state-of-the-art speed, convolutional **neural network models** for tagging,
parsing and **named entity recognition** and easy **deep learning** integration.
It's commercial open-source software, released under the MIT license.
💫 **Version 2.1 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
@ -66,11 +65,11 @@ valuable if it's shared publicly, so that more people can benefit from it.
## Features
- **Fastest syntactic parser** in the world
- **Named entity** recognition
- Non-destructive **tokenization**
- Support for **45+ languages**
- **Named entity** recognition
- Support for **49+ languages**
- Pre-trained [statistical models](https://spacy.io/models) and word vectors
- State-of-the-art speed
- Easy **deep learning** integration
- Part-of-speech tagging
- Labelled dependency parsing
@ -80,7 +79,6 @@ valuable if it's shared publicly, so that more people can benefit from it.
- Export to numpy data arrays
- Efficient binary serialization
- Easy **model packaging** and deployment
- State-of-the-art speed
- Robust, rigorously evaluated accuracy
📖 **For more details, see the

View File

@ -16,4 +16,4 @@ version=${version/\'/}
version=${version/\"/}
version=${version/\"/}
git tag "v$version"
git push origin "v$version" --tags
git push origin "v$version"

View File

@ -36,11 +36,27 @@ def main(model="en_core_web_sm"):
print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text))
def filter_spans(spans):
# Filter a sequence of spans so they don't contain overlaps
get_sort_key = lambda span: (span.end - span.start, span.start)
sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
result = []
seen_tokens = set()
for span in sorted_spans:
if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
result.append(span)
seen_tokens.update(range(span.start, span.end))
return result
def extract_currency_relations(doc):
# merge entities and noun chunks into one token
# Merge entities and noun chunks into one token
seen_tokens = set()
spans = list(doc.ents) + list(doc.noun_chunks)
for span in spans:
span.merge()
spans = filter_spans(spans)
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
relations = []
for money in filter(lambda w: w.ent_type_ == "MONEY", doc):

View File

@ -9,9 +9,10 @@ srsly>=0.0.5,<1.1.0
# Third party dependencies
numpy>=1.15.0
requests>=2.13.0,<3.0.0
jsonschema>=2.6.0,<3.0.0
plac<1.0.0,>=0.9.6
pathlib==1.0.1; python_version < "3.4"
# Optional dependencies
jsonschema>=2.6.0,<3.1.0
# Development dependencies
cython>=0.25
pytest>=4.0.0,<4.1.0

View File

@ -209,7 +209,7 @@ def setup_package():
generate_cython(root, "spacy")
setup(
name=about["__title__"],
name="spacy",
zip_safe=False,
packages=PACKAGES,
package_data=PACKAGE_DATA,
@ -232,7 +232,6 @@ def setup_package():
"blis>=0.2.2,<0.3.0",
"plac<1.0.0,>=0.9.6",
"requests>=2.13.0,<3.0.0",
"jsonschema>=2.6.0,<3.0.0",
"wasabi>=0.2.0,<1.1.0",
"srsly>=0.0.5,<1.1.0",
'pathlib==1.0.1; python_version < "3.4"',

View File

@ -4,13 +4,13 @@
# fmt: off
__title__ = "spacy"
__version__ = "2.1.3"
__version__ = "2.1.4"
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
__uri__ = "https://spacy.io"
__author__ = "Explosion AI"
__email__ = "contact@explosion.ai"
__license__ = "MIT"
__release__ = True
__release__ = False
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

View File

@ -39,7 +39,7 @@ FILE_TYPES_STDOUT = ("json", "jsonl")
def convert(
input_file,
output_dir="-",
file_type="jsonl",
file_type="json",
n_sents=1,
morphology=False,
converter="auto",
@ -48,8 +48,8 @@ def convert(
"""
Convert files into JSON format for use with train command and other
experiment management functions. If no output_dir is specified, the data
is written to stdout, so you can pipe them forward to a JSONL file:
$ spacy convert some_file.conllu > some_file.jsonl
is written to stdout, so you can pipe them forward to a JSON file:
$ spacy convert some_file.conllu > some_file.json
"""
msg = Printer()
input_path = Path(input_file)

View File

@ -11,14 +11,8 @@ def iob2json(input_data, n_sents=10, *args, **kwargs):
"""
Convert IOB files into JSON format for use with train cli.
"""
docs = []
for group in minibatch(docs, n_sents):
group = list(group)
first = group.pop(0)
to_extend = first["paragraphs"][0]["sentences"]
for sent in group[1:]:
to_extend.extend(sent["paragraphs"][0]["sentences"])
docs.append(first)
sentences = read_iob(input_data.split("\n"))
docs = merge_sentences(sentences, n_sents)
return docs
@ -27,7 +21,6 @@ def read_iob(raw_sents):
for line in raw_sents:
if not line.strip():
continue
# tokens = [t.split("|") for t in line.split()]
tokens = [re.split("[^\w\-]", line.strip())]
if len(tokens[0]) == 3:
words, pos, iob = zip(*tokens)
@ -49,3 +42,15 @@ def read_iob(raw_sents):
paragraphs = [{"sentences": [sent]} for sent in sentences]
docs = [{"id": 0, "paragraphs": [para]} for para in paragraphs]
return docs
def merge_sentences(docs, n_sents):
merged = []
for group in minibatch(docs, size=n_sents):
group = list(group)
first = group.pop(0)
to_extend = first["paragraphs"][0]["sentences"]
for sent in group[1:]:
to_extend.extend(sent["paragraphs"][0]["sentences"])
merged.append(first)
return merged

View File

@ -17,6 +17,7 @@ from .. import displacy
gpu_id=("Use GPU", "option", "g", int),
displacy_path=("Directory to output rendered parses as HTML", "option", "dp", str),
displacy_limit=("Limit of parses to render as HTML", "option", "dl", int),
return_scores=("Return dict containing model scores", "flag", "R", bool),
)
def evaluate(
model,
@ -25,6 +26,7 @@ def evaluate(
gold_preproc=False,
displacy_path=None,
displacy_limit=25,
return_scores=False,
):
"""
Evaluate a model. To render a sample of parses in a HTML file, set an
@ -75,6 +77,8 @@ def evaluate(
ents=render_ents,
)
msg.good("Generated {} parses as HTML".format(displacy_limit), displacy_path)
if return_scores:
return scorer.scores
def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True):

View File

@ -181,7 +181,7 @@ def read_vectors(vectors_loc):
vectors_keys = []
for i, line in enumerate(tqdm(f)):
line = line.rstrip()
pieces = line.rsplit(" ", vectors_data.shape[1] + 1)
pieces = line.rsplit(" ", vectors_data.shape[1])
word = pieces.pop(0)
if len(pieces) != vectors_data.shape[1]:
msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1)

View File

@ -34,7 +34,8 @@ from .. import util
max_length=("Max words per example.", "option", "xw", int),
min_length=("Min words per example.", "option", "nw", int),
seed=("Seed for random number generators", "option", "s", float),
nr_iter=("Number of iterations to pretrain", "option", "i", int),
n_iter=("Number of iterations to pretrain", "option", "i", int),
n_save_every=("Save model every X batches.", "option", "se", int),
)
def pretrain(
texts_loc,
@ -46,11 +47,12 @@ def pretrain(
loss_func="cosine",
use_vectors=False,
dropout=0.2,
nr_iter=1000,
n_iter=1000,
batch_size=3000,
max_length=500,
min_length=5,
seed=0,
n_save_every=None,
):
"""
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -115,9 +117,26 @@ def pretrain(
msg.divider("Pre-training tok2vec layer")
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
for epoch in range(nr_iter):
for batch in util.minibatch_by_words(
((text, None) for text in texts), size=batch_size
def _save_model(epoch, is_temp=False):
is_temp_str = ".temp" if is_temp else ""
with model.use_params(optimizer.averages):
with (output_dir / ("model%d%s.bin" % (epoch, is_temp_str))).open(
"wb"
) as file_:
file_.write(model.tok2vec.to_bytes())
log = {
"nr_word": tracker.nr_word,
"loss": tracker.loss,
"epoch_loss": tracker.epoch_loss,
"epoch": epoch,
}
with (output_dir / "log.jsonl").open("a") as file_:
file_.write(srsly.json_dumps(log) + "\n")
for epoch in range(n_iter):
for batch_id, batch in enumerate(
util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
):
docs = make_docs(
nlp,
@ -133,17 +152,9 @@ def pretrain(
msg.row(progress, **row_settings)
if texts_loc == "-" and tracker.words_per_epoch[epoch] >= 10 ** 7:
break
with model.use_params(optimizer.averages):
with (output_dir / ("model%d.bin" % epoch)).open("wb") as file_:
file_.write(model.tok2vec.to_bytes())
log = {
"nr_word": tracker.nr_word,
"loss": tracker.loss,
"epoch_loss": tracker.epoch_loss,
"epoch": epoch,
}
with (output_dir / "log.jsonl").open("a") as file_:
file_.write(srsly.json_dumps(log) + "\n")
if n_save_every and (batch_id % n_save_every == 0):
_save_model(epoch, is_temp=True)
_save_model(epoch)
tracker.epoch_loss = 0.0
if texts_loc != "-":
# Reshuffle the texts if texts were loaded from a file
@ -170,10 +181,10 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
def make_docs(nlp, batch, min_length, max_length):
docs = []
for record in batch:
text = record["text"]
if "tokens" in record:
doc = Doc(nlp.vocab, words=record["tokens"])
else:
text = record["text"]
doc = nlp.make_doc(text)
if "heads" in record:
heads = record["heads"]

View File

@ -16,6 +16,7 @@ import random
from .._ml import create_default_optimizer
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus
from ..compat import path2str
from .. import util
from .. import about
@ -35,6 +36,12 @@ from .. import about
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
vectors=("Model to load vectors from", "option", "v", str),
n_iter=("Number of iterations", "option", "n", int),
n_early_stopping=(
"Maximum number of training epochs without dev accuracy improvement",
"option",
"ne",
int,
),
n_examples=("Number of examples", "option", "ns", int),
use_gpu=("Use GPU", "option", "g", int),
version=("Model version", "option", "V", str),
@ -74,6 +81,7 @@ def train(
pipeline="tagger,parser,ner",
vectors=None,
n_iter=30,
n_early_stopping=None,
n_examples=0,
use_gpu=-1,
version="0.0.0",
@ -101,6 +109,7 @@ def train(
train_path = util.ensure_path(train_path)
dev_path = util.ensure_path(dev_path)
meta_path = util.ensure_path(meta_path)
output_path = util.ensure_path(output_path)
if raw_text is not None:
raw_text = list(srsly.read_jsonl(raw_text))
if not train_path or not train_path.exists():
@ -222,6 +231,8 @@ def train(
msg.row(row_head, **row_settings)
msg.row(["-" * width for width in row_settings["widths"]], **row_settings)
try:
iter_since_best = 0
best_score = 0.0
for i in range(n_iter):
train_docs = corpus.train_docs(
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0
@ -276,7 +287,9 @@ def train(
gpu_wps = nwords / (end_time - start_time)
with Model.use_device("cpu"):
nlp_loaded = util.load_model_from_path(epoch_model_path)
nlp_loaded.parser.cfg["beam_width"]
for name, component in nlp_loaded.pipeline:
if hasattr(component, "cfg"):
component.cfg["beam_width"] = beam_width
dev_docs = list(
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
)
@ -328,6 +341,24 @@ def train(
gpu_wps=gpu_wps,
)
msg.row(progress, **row_settings)
# Early stopping
if n_early_stopping is not None:
current_score = _score_for_model(meta)
if current_score < best_score:
iter_since_best += 1
else:
iter_since_best = 0
best_score = current_score
if iter_since_best >= n_early_stopping:
msg.text(
"Early stopping, best iteration "
"is: {}".format(i - iter_since_best)
)
msg.text(
"Best score = {}; Final iteration "
"score = {}".format(best_score, current_score)
)
break
finally:
with nlp.use_params(optimizer.averages):
final_model_path = output_path / "model-final"
@ -338,6 +369,20 @@ def train(
msg.good("Created best model", best_model_path)
def _score_for_model(meta):
""" Returns mean score between tasks in pipeline that can be used for early stopping. """
mean_acc = list()
pipes = meta["pipeline"]
acc = meta["accuracy"]
if "tagger" in pipes:
mean_acc.append(acc["tags_acc"])
if "parser" in pipes:
mean_acc.append((acc["uas"] + acc["las"]) / 2)
if "ner" in pipes:
mean_acc.append((acc["ents_p"] + acc["ents_r"] + acc["ents_f"]) / 3)
return sum(mean_acc) / len(mean_acc)
@contextlib.contextmanager
def _create_progress_bar(total):
if int(os.environ.get("LOG_FRIENDLY", 0)):
@ -379,10 +424,12 @@ def _collate_best_model(meta, output_path, components):
for component in components:
bests[component] = _find_best(output_path, component)
best_dest = output_path / "model-best"
shutil.copytree(output_path / "model-final", best_dest)
shutil.copytree(path2str(output_path / "model-final"), path2str(best_dest))
for component, best_component_src in bests.items():
shutil.rmtree(best_dest / component)
shutil.copytree(best_component_src / component, best_dest / component)
shutil.rmtree(path2str(best_dest / component))
shutil.copytree(
path2str(best_component_src / component), path2str(best_dest / component)
)
accs = srsly.read_json(best_component_src / "accuracy.json")
for metric in _get_metrics(component):
meta["accuracy"][metric] = accs[metric]

View File

@ -92,7 +92,9 @@ def symlink_to(orig, dest):
if is_windows:
import subprocess
subprocess.call(["mklink", "/d", path2str(orig), path2str(dest)], shell=True)
subprocess.check_call(
["mklink", "/d", path2str(orig), path2str(dest)], shell=True
)
else:
orig.symlink_to(dest)

View File

@ -19,7 +19,7 @@ RENDER_WRAPPER = None
def render(
docs, style="dep", page=False, minify=False, jupyter=False, options={}, manual=False
docs, style="dep", page=False, minify=False, jupyter=None, options={}, manual=False
):
"""Render displaCy visualisation.
@ -27,7 +27,7 @@ def render(
style (unicode): Visualisation style, 'dep' or 'ent'.
page (bool): Render markup as full HTML page.
minify (bool): Minify HTML markup.
jupyter (bool): Experimental, use Jupyter's `display()` to output markup.
jupyter (bool): Override Jupyter auto-detection.
options (dict): Visualiser-specific options, e.g. colors.
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
RETURNS (unicode): Rendered HTML markup.
@ -53,7 +53,8 @@ def render(
html = _html["parsed"]
if RENDER_WRAPPER is not None:
html = RENDER_WRAPPER(html)
if jupyter or is_in_jupyter(): # return HTML rendered by IPython display()
if jupyter or (jupyter is None and is_in_jupyter()):
# return HTML rendered by IPython display()
from IPython.core.display import display, HTML
return display(HTML(html))

View File

@ -141,8 +141,14 @@ class Errors(object):
E023 = ("Error cleaning up beam: The same state occurred twice at "
"memory address {addr} and position {i}.")
E024 = ("Could not find an optimal move to supervise the parser. Usually, "
"this means the GoldParse was not correct. For example, are all "
"labels added to the model?")
"this means that the model can't be updated in a way that's valid "
"and satisfies the correct annotations specified in the GoldParse. "
"For example, are all labels added to the model? If you're "
"training a named entity recognizer, also make sure that none of "
"your annotated entity spans have leading or trailing whitespace. "
"You can also use the experimental `debug-data` command to "
"validate your JSON-formatted training data. For details, run:\n"
"python -m spacy debug-data --help")
E025 = ("String is too long: {length} characters. Max is 2**30.")
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
"length {length}.")
@ -383,6 +389,10 @@ class Errors(object):
E133 = ("The sum of prior probabilities for alias '{alias}' should not exceed 1, "
"but found {sum}.")
E134 = ("Alias '{alias}' defined for unknown entity '{entity}'.")
E135 = ("If you meant to replace a built-in component, use `create_pipe`: "
"`nlp.replace_pipe('{name}', nlp.create_pipe('{name}'))`")
E136 = ("This additional feature requires the jsonschema library to be "
"installed:\npip install jsonschema")
@add_codes

View File

@ -168,6 +168,7 @@ GLOSSARY = {
# Dependency Labels (English)
# ClearNLP / Universal Dependencies
# https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
"acl": "clausal modifier of noun (adjectival clause)",
"acomp": "adjectival complement",
"advcl": "adverbial clause modifier",
"advmod": "adverbial modifier",
@ -177,22 +178,32 @@ GLOSSARY = {
"attr": "attribute",
"aux": "auxiliary",
"auxpass": "auxiliary (passive)",
"case": "case marking",
"cc": "coordinating conjunction",
"ccomp": "clausal complement",
"clf": "classifier",
"complm": "complementizer",
"compound": "compound",
"conj": "conjunct",
"cop": "copula",
"csubj": "clausal subject",
"csubjpass": "clausal subject (passive)",
"dative": "dative",
"dep": "unclassified dependent",
"det": "determiner",
"discourse": "discourse element",
"dislocated": "dislocated elements",
"dobj": "direct object",
"expl": "expletive",
"fixed": "fixed multiword expression",
"flat": "flat multiword expression",
"goeswith": "goes with",
"hmod": "modifier in hyphenation",
"hyph": "hyphen",
"infmod": "infinitival modifier",
"intj": "interjection",
"iobj": "indirect object",
"list": "list",
"mark": "marker",
"meta": "meta modifier",
"neg": "negation modifier",
@ -201,11 +212,15 @@ GLOSSARY = {
"npadvmod": "noun phrase as adverbial modifier",
"nsubj": "nominal subject",
"nsubjpass": "nominal subject (passive)",
"nounmod": "modifier of nominal",
"npmod": "noun phrase as adverbial modifier",
"num": "number modifier",
"number": "number compound modifier",
"nummod": "numeric modifier",
"oprd": "object predicate",
"obj": "object",
"obl": "oblique nominal",
"orphan": "orphan",
"parataxis": "parataxis",
"partmod": "participal modifier",
"pcomp": "complement of preposition",
@ -218,7 +233,10 @@ GLOSSARY = {
"punct": "punctuation",
"quantmod": "modifier of quantifier",
"rcmod": "relative clause modifier",
"relcl": "relative clause modifier",
"reparandum": "overridden disfluency",
"root": "root",
"vocative": "vocative",
"xcomp": "open clausal complement",
# Dependency labels (German)
# TIGER Treebank

View File

@ -532,7 +532,7 @@ cdef class GoldParse:
self.labels[i] = deps[i2j_multi[i]]
# Now set NER...This is annoying because if we've split
# got an entity word split into two, we need to adjust the
# BILOU tags. We can't have BB or LL etc.
# BILUO tags. We can't have BB or LL etc.
# Case 1: O -- easy.
ner_tag = entities[i2j_multi[i]]
if ner_tag == "O":

View File

@ -5,8 +5,8 @@ from __future__ import unicode_literals
STOP_WORDS = set(
"""
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
aller allerdings alles allgemeinen als also am an andere anderen andern anders
auch auf aus ausser außer ausserdem außerdem
aller allerdings alles allgemeinen als also am an andere anderen anderem andern
anders auch auf aus ausser außer ausserdem außerdem
bald bei beide beiden beim beispiel bekannt bereits besonders besser besten bin
bis bisher bist
@ -35,8 +35,8 @@ großen grosser großer grosses großes gut gute guter gutes
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
hin hinter hoch
ich ihm ihn ihnen ihr ihre ihrem ihrer ihres im immer in indem infolgedessen
ins irgend ist
ich ihm ihn ihnen ihr ihre ihrem ihren ihrer ihres im immer in indem
infolgedessen ins irgend ist
ja jahr jahre jahren je jede jedem jeden jeder jedermann jedermanns jedoch
jemand jemandem jemanden jene jenem jenen jener jenes jetzt

View File

@ -39,7 +39,7 @@ made make many may me meanwhile might mine more moreover most mostly move much
must my myself
name namely neither never nevertheless next nine no nobody none noone nor not
nothing now nowhere
nothing now nowhere
of off often on once one only onto or other others otherwise our ours ourselves
out over own
@ -75,4 +75,3 @@ STOP_WORDS.update(contractions)
for apostrophe in ["", ""]:
for stopword in contractions:
STOP_WORDS.add(stopword.replace("'", apostrophe))

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .lemmatizer import LOOKUP
from .syntax_iterators import SYNTAX_ITERATORS
@ -16,6 +17,7 @@ from ...util import update_exc, add_lookups
class SpanishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "es"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS

View File

@ -0,0 +1,59 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = [
"cero",
"uno",
"dos",
"tres",
"cuatro",
"cinco",
"seis",
"siete",
"ocho",
"nueve",
"diez",
"once",
"doce",
"trece",
"catorce",
"quince",
"dieciséis",
"diecisiete",
"dieciocho",
"diecinueve",
"veinte",
"treinta",
"cuarenta",
"cincuenta",
"sesenta",
"setenta",
"ochenta",
"noventa",
"cien",
"mil",
"millón",
"billón",
"trillón",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -11,9 +11,9 @@ Example sentences to test spaCy and its language models.
sentences = [
"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard",
"Les voitures autonomes voient leur assurances décalées vers les constructeurs",
"San Francisco envisage d'interdire les robots coursiers",
"Apple cherche à acheter une startup anglaise pour 1 milliard de dollars",
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
"Londres est une grande ville du Royaume-Uni",
"LItalie choisit ArcelorMittal pour reprendre la plus grande aciérie dEurope",
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",

View File

@ -7,88 +7,89 @@ from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN
# POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014
TAG_MAP = {
"NSD": {POS: NOUN},
"Z--": {POS: PUNCT},
"VSA": {POS: VERB},
"CC-": {POS: NUM},
"R--": {POS: ADP},
"D--": {POS: ADV},
"ASP": {POS: ADJ},
"S--": {POS: SCONJ},
"VSP": {POS: VERB},
"H--": {POS: CCONJ},
"F--": {POS: X},
"B--": {POS: DET},
"CO-": {POS: NUM},
"G--": {POS: ADV},
"PS3": {POS: PRON},
"W--": {POS: ADV},
"O--": {POS: AUX},
"PP1": {POS: PRON},
"ASS": {POS: ADJ},
"PS1": {POS: PRON},
"APP": {POS: ADJ},
"CD-": {POS: NUM},
"VPA": {POS: VERB},
"VPP": {POS: VERB},
"X--": {POS: X},
"CO-+PS3": {POS: NUM},
"NSD+PS3": {POS: NOUN},
"ASP+PS3": {POS: ADJ},
"M--": {POS: AUX},
"VSA+PS3": {POS: VERB},
"R--+PS3": {POS: ADP},
"W--+T--": {POS: ADV},
"PS2": {POS:PRON},
"NSD+PS1": {POS:NOUN},
"PP3": {POS: PRON},
"VSA+T--": {POS: VERB},
"D--+T--": {POS: ADV},
"VSP+PS3": {POS: VERB},
"F--+PS3": {POS: X},
"M--+T--": {POS: AUX},
"F--+T--": {POS: X},
"PUNCT": {POS: PUNCT},
"PROPN": {POS: PROPN},
"I--": {POS: INTJ},
"S--+PS3": {POS: SCONJ},
"ASP+T--": {POS: ADJ},
"CC-+PS3": {POS: NUM},
"NSD+PS2": {POS: NOUN},
"B--+T--": {POS: DET},
"H--+T--": {POS: CCONJ},
"VSA+PS2": {POS: VERB},
"NSF": {POS: NOUN},
"PS1+VSA": {POS: PRON},
"NPD": {POS: NOUN},
"PP2": {POS:PRON},
"VSA+PS1": {POS: VERB},
"T--": {POS: PART},
"NSM": {POS: NOUN},
"NUM": {POS: NUM},
"ASP+PS2": {POS: ADJ},
"G--+T--": {POS: PART},
"D--+PS3": {POS: ADV},
"R--+PS2": {POS: ADP},
"NSM+PS3": {POS: NOUN},
"VSP+T--": {POS: VERB},
"M--+PS3": {POS: AUX},
"ASS+PS3": {POS: ADJ},
"G--+PS3": {POS: PART},
"F--+PS1": {POS: X},
"NSD+T--": {POS: NOUN},
"PP1+T--": {POS: PRON},
"B--+PS3": {POS: DET},
"NOUN": {POS: NOUN},
"NPD+PS3": {POS: NOUN},
"R--+PS1": {POS: ADP},
"F--+PS2": {POS: X},
"CD-+PS3": {POS: NUM},
"PS1+VSA+T--":{POS: VERB},
"PS2+VSA": {POS: VERB},
"VERB": {POS: VERB},
"CC-+T--": {POS: NUM},
"NPD+PS2":{POS: NOUN},
"D--+PS2":{POS: ADV},
"PP3+T--": {POS: PRON},
"X": {POS: X}}
"NSD": {POS: NOUN},
"Z--": {POS: PUNCT},
"VSA": {POS: VERB},
"CC-": {POS: NUM},
"R--": {POS: ADP},
"D--": {POS: ADV},
"ASP": {POS: ADJ},
"S--": {POS: SCONJ},
"VSP": {POS: VERB},
"H--": {POS: CCONJ},
"F--": {POS: X},
"B--": {POS: DET},
"CO-": {POS: NUM},
"G--": {POS: ADV},
"PS3": {POS: PRON},
"W--": {POS: ADV},
"O--": {POS: AUX},
"PP1": {POS: PRON},
"ASS": {POS: ADJ},
"PS1": {POS: PRON},
"APP": {POS: ADJ},
"CD-": {POS: NUM},
"VPA": {POS: VERB},
"VPP": {POS: VERB},
"X--": {POS: X},
"CO-+PS3": {POS: NUM},
"NSD+PS3": {POS: NOUN},
"ASP+PS3": {POS: ADJ},
"M--": {POS: AUX},
"VSA+PS3": {POS: VERB},
"R--+PS3": {POS: ADP},
"W--+T--": {POS: ADV},
"PS2": {POS: PRON},
"NSD+PS1": {POS: NOUN},
"PP3": {POS: PRON},
"VSA+T--": {POS: VERB},
"D--+T--": {POS: ADV},
"VSP+PS3": {POS: VERB},
"F--+PS3": {POS: X},
"M--+T--": {POS: AUX},
"F--+T--": {POS: X},
"PUNCT": {POS: PUNCT},
"PROPN": {POS: PROPN},
"I--": {POS: INTJ},
"S--+PS3": {POS: SCONJ},
"ASP+T--": {POS: ADJ},
"CC-+PS3": {POS: NUM},
"NSD+PS2": {POS: NOUN},
"B--+T--": {POS: DET},
"H--+T--": {POS: CCONJ},
"VSA+PS2": {POS: VERB},
"NSF": {POS: NOUN},
"PS1+VSA": {POS: PRON},
"NPD": {POS: NOUN},
"PP2": {POS: PRON},
"VSA+PS1": {POS: VERB},
"T--": {POS: PART},
"NSM": {POS: NOUN},
"NUM": {POS: NUM},
"ASP+PS2": {POS: ADJ},
"G--+T--": {POS: PART},
"D--+PS3": {POS: ADV},
"R--+PS2": {POS: ADP},
"NSM+PS3": {POS: NOUN},
"VSP+T--": {POS: VERB},
"M--+PS3": {POS: AUX},
"ASS+PS3": {POS: ADJ},
"G--+PS3": {POS: PART},
"F--+PS1": {POS: X},
"NSD+T--": {POS: NOUN},
"PP1+T--": {POS: PRON},
"B--+PS3": {POS: DET},
"NOUN": {POS: NOUN},
"NPD+PS3": {POS: NOUN},
"R--+PS1": {POS: ADP},
"F--+PS2": {POS: X},
"CD-+PS3": {POS: NUM},
"PS1+VSA+T--": {POS: VERB},
"PS2+VSA": {POS: VERB},
"VERB": {POS: VERB},
"CC-+T--": {POS: NUM},
"NPD+PS2": {POS: NOUN},
"D--+PS2": {POS: ADV},
"PP3+T--": {POS: PRON},
"X": {POS: X},
}

View File

@ -4,67 +4,87 @@ from __future__ import unicode_literals
STOP_WORDS = set(
"""
ಮತ
ಅವರ
ಅವರ
ಬಗ
ಆದರ
ಅವರನ
ಆದರ
ತಮ
ದರ
ಿದರ
ಿ
ಬಳಿ
ಅವರಿ
ನಡ
ಿ
ಇದ
ಅವರ
ಕಳ
ಇದ
ಿಿಿದರ
ಿ
ತನ
ಿಿಿ
ಿ
ಈಗ
ಎಲ
ನನ
ನಮ
ಈಗಗಲ
ಇದಕ
ಹಲವ
ಇದ
ಮತ
ಿದರ
ಿ
ಇದರಿ
ಲಕ
ಅದ
ಇದನ
ಿ
ದರ
ಅವರ
ಈಗ
ಿ
ಅಷ
ಇದ
ಿ
ತಮ
ನಮ
ಿದರ
ಮತ
ಇದ
ಇತ
ಎಲ
ನಡ
ಅದನ
ಇಲಿ
ಆಗ
ಿ.
ಅದ
ಇರ
ಅಲಲದ
ಲವ
ದರ
ಿ
ಿ
ಇದರಿ
ನನಗ
ಅಲಲದ
ಎಷ
ಇದರ
ಇಲ
ಕಳ
ಈಗಗಲ
ಿ
ಅದಕ
ಬಗ
ಅವರ
ಇದನ
ಇದ
ಇನ
ಎಲ
ಇರ
ಅವರಿ
ಿ
ಏನ
ಇಲಿ
ನನನನ
ಲವ
ಬಳಿ
ತನ
ಆಗ
ಅಥವ
ಅಲ
ವಲ
ಆದರ
ಮತ
ಇನ
ಅದ
ಆಗಿ
ಅವರನ
ಿ
ನಡಿ
ಇದಕ
ನನ
""".split()
)

20
spacy/lang/mr/__init__.py Normal file
View File

@ -0,0 +1,20 @@
#coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from ...language import Language
from ...attrs import LANG
class MarathiDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "mr"
stop_words = STOP_WORDS
class Marathi(Language):
lang = "mr"
Defaults = MarathiDefaults
__all__ = ["Marathi"]

196
spacy/lang/mr/stop_words.py Normal file
View File

@ -0,0 +1,196 @@
# coding: utf8
from __future__ import unicode_literals
# Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json
STOP_WORDS = set(
"""
अतर
आणि
मग
पर
ऐस
आत
तय
अस
हण
आह
जर
हणि
एक
ऐस
मज
एथ
जय
अस
कर
ऐस
हल
ि
आघव
ऊनि
एक
सकळ
एऱहव
ि
ि
ि
तरि
आपण
ि
कर
इय
पड
अधि
अन
अश
असलय
असल
अस
अस
अस
आज
आणि
आत
आपल
आल
आल
आल
आह
आह
एक
एक
कम
करणय
कर
ि
ऊन
तर
तर
तस
ि
पण
पम
परयतन
ि
हणज
हण
हण
यकत
सर
ि
हज
""".split()
)

View File

@ -6,10 +6,7 @@ from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .lemmatizer import LOOKUP, LEMMA_EXC, LEMMA_INDEX, RULES
from .lemmatizer.lemmatizer import DutchLemmatizer
from .lemmatizer import LOOKUP, LEMMA_EXC, LEMMA_INDEX, RULES, DutchLemmatizer
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
@ -21,9 +18,10 @@ class DutchDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'nl'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
BASE_NORMS)
lex_attr_getters[LANG] = lambda text: "nl"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
tag_map = TAG_MAP
@ -36,15 +34,14 @@ class DutchDefaults(Language.Defaults):
lemma_index = LEMMA_INDEX
lemma_exc = LEMMA_EXC
lemma_lookup = LOOKUP
return DutchLemmatizer(index=lemma_index,
exceptions=lemma_exc,
lookup=lemma_lookup,
rules=rules)
return DutchLemmatizer(
index=lemma_index, exceptions=lemma_exc, lookup=lemma_lookup, rules=rules
)
class Dutch(Language):
lang = 'nl'
lang = "nl"
Defaults = DutchDefaults
__all__ = ['Dutch']
__all__ = ["Dutch"]

View File

@ -18,23 +18,26 @@ from ._adpositions import ADPOSITIONS
from ._determiners import DETERMINERS
from .lookup import LOOKUP
from ._lemma_rules import RULES
from .lemmatizer import DutchLemmatizer
LEMMA_INDEX = {"adj": ADJECTIVES,
"noun": NOUNS,
"verb": VERBS,
"adp": ADPOSITIONS,
"det": DETERMINERS}
LEMMA_INDEX = {
"adj": ADJECTIVES,
"noun": NOUNS,
"verb": VERBS,
"adp": ADPOSITIONS,
"det": DETERMINERS,
}
LEMMA_EXC = {"adj": ADJECTIVES_IRREG,
"adv": ADVERBS_IRREG,
"adp": ADPOSITIONS_IRREG,
"noun": NOUNS_IRREG,
"verb": VERBS_IRREG,
"det": DETERMINERS_IRREG,
"pron": PRONOUNS_IRREG}
LEMMA_EXC = {
"adj": ADJECTIVES_IRREG,
"adv": ADVERBS_IRREG,
"adp": ADPOSITIONS_IRREG,
"noun": NOUNS_IRREG,
"verb": VERBS_IRREG,
"det": DETERMINERS_IRREG,
"pron": PRONOUNS_IRREG,
}
__all__ = ["LOOKUP", "LEMMA_EXC", "LEMMA_INDEX", "RULES", "DutchLemmatizer"]

View File

@ -1,7 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
from ...symbols import ORTH
# Extensive list of both common and uncommon dutch abbreviations copied from
# github.com/diasks2/pragmatic_segmenter, a Ruby library for rule-based
@ -16,7 +16,7 @@ from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
# are extremely domain-specific. Tokenizer performance may benefit from some
# slight pruning, although no performance regression has been observed so far.
# fmt: off
abbrevs = ['a.2d.', 'a.a.', 'a.a.j.b.', 'a.f.t.', 'a.g.j.b.',
'a.h.v.', 'a.h.w.', 'a.hosp.', 'a.i.', 'a.j.b.', 'a.j.t.',
'a.m.', 'a.m.r.', 'a.p.m.', 'a.p.r.', 'a.p.t.', 'a.s.',
@ -326,7 +326,7 @@ abbrevs = ['a.2d.', 'a.a.', 'a.a.j.b.', 'a.f.t.', 'a.g.j.b.',
'wtvb.', 'ww.', 'x.d.', 'z.a.', 'z.g.', 'z.i.', 'z.j.',
'z.o.z.', 'z.p.', 'z.s.m.', 'zg.', 'zgn.', 'zn.', 'znw.',
'zr.', 'zr.', 'ms.', 'zr.ms.']
# fmt: on
_exc = {}
for orth in abbrevs:

View File

@ -53,4 +53,11 @@ BASE_NORMS = {
"US$": "$",
"C$": "$",
"A$": "$",
"": "$",
"": "$",
"": "$",
"": "$",
"Mex$": "$",
"": "$",
"": "$",
}

View File

@ -4,11 +4,14 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from .norm_exceptions import NORM_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from ...attrs import LANG
from ..norm_exceptions import BASE_NORMS
from ...attrs import LANG, NORM
from ...language import Language
from ...tokens import Doc
from ...util import DummyTokenizer
from ...util import DummyTokenizer, add_lookups
class ThaiTokenizer(DummyTokenizer):
@ -25,15 +28,18 @@ class ThaiTokenizer(DummyTokenizer):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
def __call__(self, text):
words = list(self.word_tokenize(text, "newmm"))
words = list(self.word_tokenize(text))
spaces = [False] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
class ThaiDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda _text: "th"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS

View File

@ -0,0 +1,62 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = [
"ศูนย์",
"หนึ่ง",
"สอง",
"สาม",
"สี่",
"ห้า",
"หก",
"เจ็ด",
"แปด",
"เก้า",
"สิบ",
"สิบเอ็ด",
"ยี่สิบ",
"ยี่สิบเอ็ด",
"สามสิบ",
"สามสิบเอ็ด",
"สี่สิบ",
"สี่สิบเอ็ด",
"ห้าสิบ",
"ห้าสิบเอ็ด",
"หกสิบเอ็ด",
"เจ็ดสิบ",
"เจ็ดสิบเอ็ด",
"แปดสิบ",
"แปดสิบเอ็ด",
"เก้าสิบ",
"เก้าสิบเอ็ด",
"ร้อย",
"พัน",
"ล้าน",
"พันล้าน",
"หมื่นล้าน",
"แสนล้าน",
"ล้านล้าน",
"ล้านล้านล้าน",
"ล้านล้านล้านล้าน",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,113 @@
# coding: utf8
from __future__ import unicode_literals
_exc = {
# Conjugation and Diversion invalid to Tonal form (ผันอักษรและเสียงไม่ตรงกับรูปวรรณยุกต์)
"สนุ๊กเกอร์": "สนุกเกอร์",
"โน้ต": "โน้ต",
# Misspelled because of being lazy or hustle (สะกดผิดเพราะขี้เกียจพิมพ์ หรือเร่งรีบ)
"โทสับ": "โทรศัพท์",
"พุ่งนี้": "พรุ่งนี้",
# Strange (ให้ดูแปลกตา)
"ชะมะ": "ใช่ไหม",
"ชิมิ": "ใช่ไหม",
"ชะ": "ใช่ไหม",
"ช่ายมะ": "ใช่ไหม",
"ป่าว": "เปล่า",
"ป่ะ": "เปล่า",
"ปล่าว": "เปล่า",
"คัย": "ใคร",
"ไค": "ใคร",
"คราย": "ใคร",
"เตง": "ตัวเอง",
"ตะเอง": "ตัวเอง",
"รึ": "หรือ",
"เหรอ": "หรือ",
"หรา": "หรือ",
"หรอ": "หรือ",
"ชั้น": "ฉัน",
"ชั้ล": "ฉัน",
"ช้าน": "ฉัน",
"เทอ": "เธอ",
"เทอร์": "เธอ",
"เทอว์": "เธอ",
"แกร": "แก",
"ป๋ม": "ผม",
"บ่องตง": "บอกตรงๆ",
"ถ่ามตง": "ถามตรงๆ",
"ต่อมตง": "ตอบตรงๆ",
"เพิ่ล": "เพื่อน",
"จอบอ": "จอบอ",
"ดั้ย": "ได้",
"ขอบคุง": "ขอบคุณ",
"ยังงัย": "ยังไง",
"Inw": "เทพ",
"uou": "นอน",
"Lกรีeu": "เกรียน",
# Misspelled to express emotions (คำที่สะกดผิดเพื่อแสดงอารมณ์)
"เปงราย": "เป็นอะไร",
"เปนรัย": "เป็นอะไร",
"เปงรัย": "เป็นอะไร",
"เป็นอัลไล": "เป็นอะไร",
"ทามมาย": "ทำไม",
"ทามมัย": "ทำไม",
"จังรุย": "จังเลย",
"จังเยย": "จังเลย",
"จุงเบย": "จังเลย",
"ไม่รู้": "มะรุ",
"เฮ่ย": "เฮ้ย",
"เห้ย": "เฮ้ย",
"น่าร็อค": "น่ารัก",
"น่าร๊าก": "น่ารัก",
"ตั้ลล๊าก": "น่ารัก",
"คือร๊ะ": "คืออะไร",
"โอป่ะ": "โอเคหรือเปล่า",
"น่ามคาน": "น่ารำคาญ",
"น่ามสาร": "น่าสงสาร",
"วงวาร": "สงสาร",
"บับว่า": "แบบว่า",
"อัลไล": "อะไร",
"อิจ": "อิจฉา",
# Reduce rough words or Avoid to software filter (คำที่สะกดผิดเพื่อลดความหยาบของคำ หรืออาจใช้หลีกเลี่ยงการกรองคำหยาบของซอฟต์แวร์)
"กรู": "กู",
"กุ": "กู",
"กรุ": "กู",
"ตู": "กู",
"ตรู": "กู",
"มรึง": "มึง",
"เมิง": "มึง",
"มืง": "มึง",
"มุง": "มึง",
"สาด": "สัตว์",
"สัส": "สัตว์",
"สัก": "สัตว์",
"แสรด": "สัตว์",
"โคโตะ": "โคตร",
"โคด": "โคตร",
"โครต": "โคตร",
"โคตะระ": "โคตร",
"พ่อง": "พ่อมึง",
"แม่เมิง": "แม่มึง",
"เชี่ย": "เหี้ย",
# Imitate words (คำเลียนเสียง โดยส่วนใหญ่จะเพิ่มทัณฑฆาต หรือซ้ำตัวอักษร)
"แอร๊ยย": "อ๊าย",
"อร๊ายยย": "อ๊าย",
"มันส์": "มัน",
"วู๊วววววววว์": "วู้",
# Acronym (แบบคำย่อ)
"หมาลัย": "มหาวิทยาลัย",
"วิดวะ": "วิศวะ",
"สินสาด ": "ศิลปศาสตร์",
"สินกำ ": "ศิลปกรรมศาสตร์",
"เสารีย์ ": "อนุเสาวรีย์ชัยสมรภูมิ",
"เมกา ": "อเมริกา",
"มอไซค์ ": "มอเตอร์ไซค์",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -5,7 +5,7 @@ from ...symbols import ORTH, LEMMA
_exc = {
#หน่วยงานรัฐ / government agency
# หน่วยงานรัฐ / government agency
"กกต.": [{ORTH: "กกต.", LEMMA: "คณะกรรมการการเลือกตั้ง"}],
"กทท.": [{ORTH: "กทท.", LEMMA: "การท่าเรือแห่งประเทศไทย"}],
"กทพ.": [{ORTH: "กทพ.", LEMMA: "การทางพิเศษแห่งประเทศไทย"}],
@ -44,11 +44,21 @@ _exc = {
"ธอส.": [{ORTH: "ธอส.", LEMMA: "ธนาคารอาคารสงเคราะห์"}],
"นย.": [{ORTH: "นย.", LEMMA: "นาวิกโยธิน"}],
"ปตท.": [{ORTH: "ปตท.", LEMMA: "การปิโตรเลียมแห่งประเทศไทย"}],
"ป.ป.ช.": [{ORTH: "ป.ป.ช.", LEMMA: "คณะกรรมการป้องกันและปราบปรามการทุจริตและประพฤติมิชอบในวงราชการ"}],
"ป.ป.ช.": [
{
ORTH: "ป.ป.ช.",
LEMMA: "คณะกรรมการป้องกันและปราบปรามการทุจริตและประพฤติมิชอบในวงราชการ",
}
],
"ป.ป.ส.": [{ORTH: "ป.ป.ส.", LEMMA: "คณะกรรมการป้องกันและปราบปรามยาเสพติด"}],
"บพร.": [{ORTH: "บพร.", LEMMA: "กรมการบินพลเรือน"}],
"บย.": [{ORTH: "บย.", LEMMA: "กองบินยุทธการ"}],
"พสวท.": [{ORTH: "พสวท.", LEMMA: "โครงการพัฒนาและส่งเสริมผู้มีความรู้ความสามารถพิเศษทางวิทยาศาสตร์และเทคโนโลยี"}],
"พสวท.": [
{
ORTH: "พสวท.",
LEMMA: "โครงการพัฒนาและส่งเสริมผู้มีความรู้ความสามารถพิเศษทางวิทยาศาสตร์และเทคโนโลยี",
}
],
"มอก.": [{ORTH: "มอก.", LEMMA: "สำนักงานมาตรฐานผลิตภัณฑ์อุตสาหกรรม"}],
"ยธ.": [{ORTH: "ยธ.", LEMMA: "กรมโยธาธิการ"}],
"รพช.": [{ORTH: "รพช.", LEMMA: "สำนักงานเร่งรัดพัฒนาชนบท"}],
@ -71,11 +81,15 @@ _exc = {
"สปช.": [{ORTH: "สปช.", LEMMA: "สำนักงานคณะกรรมการการประถมศึกษาแห่งชาติ"}],
"สปอ.": [{ORTH: "สปอ.", LEMMA: "สำนักงานการประถมศึกษาอำเภอ"}],
"สพช.": [{ORTH: "สพช.", LEMMA: "สำนักงานคณะกรรมการนโยบายพลังงานแห่งชาติ"}],
"สยช.": [{ORTH: "สยช.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมและประสานงานเยาวชนแห่งชาติ"}],
"สยช.": [
{ORTH: "สยช.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมและประสานงานเยาวชนแห่งชาติ"}
],
"สวช.": [{ORTH: "สวช.", LEMMA: "สำนักงานคณะกรรมการวัฒนธรรมแห่งชาติ"}],
"สวท.": [{ORTH: "สวท.", LEMMA: "สถานีวิทยุกระจายเสียงแห่งประเทศไทย"}],
"สวทช.": [{ORTH: "สวทช.", LEMMA: "สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ"}],
"สคช.": [{ORTH: "สคช.", LEMMA: "สำนักงานคณะกรรมการพัฒนาการเศรษฐกิจและสังคมแห่งชาติ"}],
"สคช.": [
{ORTH: "สคช.", LEMMA: "สำนักงานคณะกรรมการพัฒนาการเศรษฐกิจและสังคมแห่งชาติ"}
],
"สสว.": [{ORTH: "สสว.", LEMMA: "สำนักงานส่งเสริมวิสาหกิจขนาดกลางและขนาดย่อม"}],
"สสส.": [{ORTH: "สสส.", LEMMA: "สำนักงานกองทุนสนับสนุนการสร้างเสริมสุขภาพ"}],
"สสวท.": [{ORTH: "สสวท.", LEMMA: "สถาบันส่งเสริมการสอนวิทยาศาสตร์และเทคโนโลยี"}],
@ -85,7 +99,7 @@ _exc = {
"อปพร.": [{ORTH: "อปพร.", LEMMA: "อาสาสมัครป้องกันภัยฝ่ายพลเรือน"}],
"อย.": [{ORTH: "อย.", LEMMA: "สำนักงานคณะกรรมการอาหารและยา"}],
"อ.ส.ม.ท.": [{ORTH: "อ.ส.ม.ท.", LEMMA: "องค์การสื่อสารมวลชนแห่งประเทศไทย"}],
#มหาวิทยาลัย / สถานศึกษา / university / college
# มหาวิทยาลัย / สถานศึกษา / university / college
"มทส.": [{ORTH: "มทส.", LEMMA: "มหาวิทยาลัยเทคโนโลยีสุรนารี"}],
"มธ.": [{ORTH: "มธ.", LEMMA: "มหาวิทยาลัยธรรมศาสตร์"}],
"ม.อ.": [{ORTH: "ม.อ.", LEMMA: "มหาวิทยาลัยสงขลานครินทร์"}],
@ -93,7 +107,7 @@ _exc = {
"มมส.": [{ORTH: "มมส.", LEMMA: "มหาวิทยาลัยมหาสารคาม"}],
"วท.": [{ORTH: "วท.", LEMMA: "วิทยาลัยเทคนิค"}],
"สตม.": [{ORTH: "สตม.", LEMMA: "สำนักงานตรวจคนเข้าเมือง (ตำรวจ)"}],
#ยศ / rank
# ยศ / rank
"ดร.": [{ORTH: "ดร.", LEMMA: "ดอกเตอร์"}],
"ด.ต.": [{ORTH: "ด.ต.", LEMMA: "ดาบตำรวจ"}],
"จ.ต.": [{ORTH: "จ.ต.", LEMMA: "จ่าตรี"}],
@ -133,10 +147,14 @@ _exc = {
"ผญบ.": [{ORTH: "ผญบ.", LEMMA: "ผู้ใหญ่บ้าน"}],
"ผบ.": [{ORTH: "ผบ.", LEMMA: "ผู้บังคับบัญชา"}],
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับบัญชาการ (ตำรวจ)"}],
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับการ (ตำรวจ)"}],
"ผบก.น.": [{ORTH: "ผบก.น.", LEMMA: "ผู้บังคับการตำรวจนครบาล"}],
"ผบก.ป.": [{ORTH: "ผบก.ป.", LEMMA: "ผู้บังคับการตำรวจกองปราบปราม"}],
"ผบก.ปค.": [{ORTH: "ผบก.ปค.", LEMMA: "ผู้บังคับการ กองบังคับการปกครอง (โรงเรียนนายร้อยตำรวจ)"}],
"ผบก.ปค.": [
{
ORTH: "ผบก.ปค.",
LEMMA: "ผู้บังคับการ กองบังคับการปกครอง (โรงเรียนนายร้อยตำรวจ)",
}
],
"ผบก.ปม.": [{ORTH: "ผบก.ปม.", LEMMA: "ผู้บังคับการตำรวจป่าไม้"}],
"ผบก.ภ.": [{ORTH: "ผบก.ภ.", LEMMA: "ผู้บังคับการตำรวจภูธร"}],
"ผบช.": [{ORTH: "ผบช.", LEMMA: "ผู้บัญชาการ (ตำรวจ)"}],
@ -177,7 +195,6 @@ _exc = {
"พล.อ.ต.": [{ORTH: "พล.อ.ต.", LEMMA: "พลอากาศตรี"}],
"พล.อ.ท.": [{ORTH: "พล.อ.ท.", LEMMA: "พลอากาศโท"}],
"พล.อ.อ.": [{ORTH: "พล.อ.อ.", LEMMA: "พลอากาศเอก"}],
"พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
"พ.อ.พิเศษ": [{ORTH: "พ.อ.พิเศษ", LEMMA: "พันเอกพิเศษ"}],
"พ.อ.ต.": [{ORTH: "พ.อ.ต.", LEMMA: "พันจ่าอากาศตรี"}],
"พ.อ.ท.": [{ORTH: "พ.อ.ท.", LEMMA: "พันจ่าอากาศโท"}],
@ -209,7 +226,7 @@ _exc = {
"ส.อ.": [{ORTH: "ส.อ.", LEMMA: "สิบเอก"}],
"อจ.": [{ORTH: "อจ.", LEMMA: "อาจารย์"}],
"อจญ.": [{ORTH: "อจญ.", LEMMA: "อาจารย์ใหญ่"}],
#วุฒิ / bachelor degree
# วุฒิ / bachelor degree
"ป.": [{ORTH: "ป.", LEMMA: "ประถมศึกษา"}],
"ป.กศ.": [{ORTH: "ป.กศ.", LEMMA: "ประกาศนียบัตรวิชาการศึกษา"}],
"ป.กศ.สูง": [{ORTH: "ป.กศ.สูง", LEMMA: "ประกาศนียบัตรวิชาการศึกษาชั้นสูง"}],
@ -283,20 +300,20 @@ _exc = {
"อ.บ.": [{ORTH: "อ.บ.", LEMMA: "อักษรศาสตรบัณฑิต"}],
"อ.ม.": [{ORTH: "อ.ม.", LEMMA: "อักษรศาสตรมหาบัณฑิต"}],
"อ.ด.": [{ORTH: "อ.ด.", LEMMA: "อักษรศาสตรดุษฎีบัณฑิต"}],
#ปี / เวลา / year / time
# ปี / เวลา / year / time
"ชม.": [{ORTH: "ชม.", LEMMA: "ชั่วโมง"}],
"จ.ศ.": [{ORTH: "จ.ศ.", LEMMA: "จุลศักราช"}],
"ค.ศ.": [{ORTH: "ค.ศ.", LEMMA: "คริสต์ศักราช"}],
"ฮ.ศ.": [{ORTH: "ฮ.ศ.", LEMMA: "ฮิจเราะห์ศักราช"}],
"ว.ด.ป.": [{ORTH: "ว.ด.ป.", LEMMA: "วัน เดือน ปี"}],
#ระยะทาง / distance
# ระยะทาง / distance
"ฮม.": [{ORTH: "ฮม.", LEMMA: "เฮกโตเมตร"}],
"ดคม.": [{ORTH: "ดคม.", LEMMA: "เดคาเมตร"}],
"ดม.": [{ORTH: "ดม.", LEMMA: "เดซิเมตร"}],
"มม.": [{ORTH: "มม.", LEMMA: "มิลลิเมตร"}],
"ซม.": [{ORTH: "ซม.", LEMMA: "เซนติเมตร"}],
"กม.": [{ORTH: "กม.", LEMMA: "กิโลเมตร"}],
#น้ำหนัก / weight
# น้ำหนัก / weight
"น.น.": [{ORTH: "น.น.", LEMMA: "น้ำหนัก"}],
"ฮก.": [{ORTH: "ฮก.", LEMMA: "เฮกโตกรัม"}],
"ดคก.": [{ORTH: "ดคก.", LEMMA: "เดคากรัม"}],
@ -305,7 +322,7 @@ _exc = {
"มก.": [{ORTH: "มก.", LEMMA: "มิลลิกรัม"}],
"ก.": [{ORTH: "ก.", LEMMA: "กรัม"}],
"กก.": [{ORTH: "กก.", LEMMA: "กิโลกรัม"}],
#ปริมาตร / volume
# ปริมาตร / volume
"ฮล.": [{ORTH: "ฮล.", LEMMA: "เฮกโตลิตร"}],
"ดคล.": [{ORTH: "ดคล.", LEMMA: "เดคาลิตร"}],
"ดล.": [{ORTH: "ดล.", LEMMA: "เดซิลิตร"}],
@ -313,12 +330,12 @@ _exc = {
"ล.": [{ORTH: "ล.", LEMMA: "ลิตร"}],
"กล.": [{ORTH: "กล.", LEMMA: "กิโลลิตร"}],
"ลบ.": [{ORTH: "ลบ.", LEMMA: "ลูกบาศก์"}],
#พื้นที่ / area
# พื้นที่ / area
"ตร.ซม.": [{ORTH: "ตร.ซม.", LEMMA: "ตารางเซนติเมตร"}],
"ตร.ม.": [{ORTH: "ตร.ม.", LEMMA: "ตารางเมตร"}],
"ตร.ว.": [{ORTH: "ตร.ว.", LEMMA: "ตารางวา"}],
"ตร.กม.": [{ORTH: "ตร.กม.", LEMMA: "ตารางกิโลเมตร"}],
#เดือน / month
# เดือน / month
"ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
"ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
"มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
@ -331,22 +348,22 @@ _exc = {
"ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
"พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
"ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}],
#เพศ / gender
# เพศ / gender
"ช.": [{ORTH: "ช.", LEMMA: "ชาย"}],
"ญ.": [{ORTH: "ญ.", LEMMA: "หญิง"}],
"ด.ช.": [{ORTH: "ด.ช.", LEMMA: "เด็กชาย"}],
"ด.ญ.": [{ORTH: "ด.ญ.", LEMMA: "เด็กหญิง"}],
#ที่อยู่ / address
# ที่อยู่ / address
"ถ.": [{ORTH: "ถ.", LEMMA: "ถนน"}],
"ต.": [{ORTH: "ต.", LEMMA: "ตำบล"}],
"อ.": [{ORTH: "อ.", LEMMA: "อำเภอ"}],
"จ.": [{ORTH: "จ.", LEMMA: "จังหวัด"}],
#สรรพนาม / pronoun
# สรรพนาม / pronoun
"ข้าฯ": [{ORTH: "ข้าฯ", LEMMA: "ข้าพระพุทธเจ้า"}],
"ทูลเกล้าฯ": [{ORTH: "ทูลเกล้าฯ", LEMMA: "ทูลเกล้าทูลกระหม่อม"}],
"น้อมเกล้าฯ": [{ORTH: "น้อมเกล้าฯ", LEMMA: "น้อมเกล้าน้อมกระหม่อม"}],
"โปรดเกล้าฯ": [{ORTH: "โปรดเกล้าฯ", LEMMA: "โปรดเกล้าโปรดกระหม่อม"}],
#การเมือง / politic
# การเมือง / politic
"ขจก.": [{ORTH: "ขจก.", LEMMA: "ขบวนการโจรก่อการร้าย"}],
"ขบด.": [{ORTH: "ขบด.", LEMMA: "ขบวนการแบ่งแยกดินแดน"}],
"นปช.": [{ORTH: "นปช.", LEMMA: "แนวร่วมประชาธิปไตยขับไล่เผด็จการ"}],
@ -363,7 +380,7 @@ _exc = {
"สจ.": [{ORTH: "สจ.", LEMMA: "สมาชิกสภาจังหวัด"}],
"สว.": [{ORTH: "สว.", LEMMA: "สมาชิกวุฒิสภา"}],
"ส.ส.": [{ORTH: "ส.ส.", LEMMA: "สมาชิกสภาผู้แทนราษฎร"}],
#ทั่วไป / general
# ทั่วไป / general
"ก.ข.ค.": [{ORTH: "ก.ข.ค.", LEMMA: "ก้างขวางคอ"}],
"กทม.": [{ORTH: "กทม.", LEMMA: "กรุงเทพมหานคร"}],
"กรุงเทพฯ": [{ORTH: "กรุงเทพฯ", LEMMA: "กรุงเทพมหานคร"}],
@ -376,7 +393,12 @@ _exc = {
"จก.": [{ORTH: "จก.", LEMMA: "จำกัด"}],
"จขกท.": [{ORTH: "จขกท.", LEMMA: "เจ้าของกระทู้"}],
"จนท.": [{ORTH: "จนท.", LEMMA: "เจ้าหน้าที่"}],
"จ.ป.ร.": [{ORTH: "จ.ป.ร.", LEMMA: "มหาจุฬาลงกรณ ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว)"}],
"จ.ป.ร.": [
{
ORTH: "จ.ป.ร.",
LEMMA: "มหาจุฬาลงกรณ ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว)",
}
],
"จ.ม.": [{ORTH: "จ.ม.", LEMMA: "จดหมาย"}],
"จย.": [{ORTH: "จย.", LEMMA: "จักรยาน"}],
"จยย.": [{ORTH: "จยย.", LEMMA: "จักรยานยนต์"}],
@ -387,7 +409,9 @@ _exc = {
"น.ศ.": [{ORTH: "น.ศ.", LEMMA: "นักศึกษา"}],
"น.ส.": [{ORTH: "น.ส.", LEMMA: "นางสาว"}],
"น.ส.๓": [{ORTH: "น.ส.๓", LEMMA: "หนังสือรับรองการทำประโยชน์ในที่ดิน"}],
"น.ส.๓ ก.": [{ORTH: "น.ส.๓ ก", LEMMA: "หนังสือแสดงกรรมสิทธิ์ในที่ดิน (มีระวางกำหนด)"}],
"น.ส.๓ ก.": [
{ORTH: "น.ส.๓ ก", LEMMA: "หนังสือแสดงกรรมสิทธิ์ในที่ดิน (มีระวางกำหนด)"}
],
"นสพ.": [{ORTH: "นสพ.", LEMMA: "หนังสือพิมพ์"}],
"บ.ก.": [{ORTH: "บ.ก.", LEMMA: "บรรณาธิการ"}],
"บจก.": [{ORTH: "บจก.", LEMMA: "บริษัทจำกัด"}],
@ -410,7 +434,12 @@ _exc = {
"พขร.": [{ORTH: "พขร.", LEMMA: "พนักงานขับรถ"}],
"ภ.ง.ด.": [{ORTH: "ภ.ง.ด.", LEMMA: "ภาษีเงินได้"}],
"ภ.ง.ด.๙": [{ORTH: "ภ.ง.ด.๙", LEMMA: "แบบแสดงรายการเสียภาษีเงินได้ของกรมสรรพากร"}],
"ภ.ป.ร.": [{ORTH: "ภ.ป.ร.", LEMMA: "ภูมิพลอดุยเดช ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดช)"}],
"ภ.ป.ร.": [
{
ORTH: "ภ.ป.ร.",
LEMMA: "ภูมิพลอดุยเดช ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดช)",
}
],
"ภ.พ.": [{ORTH: "ภ.พ.", LEMMA: "ภาษีมูลค่าเพิ่ม"}],
"ร.": [{ORTH: "ร.", LEMMA: "รัชกาล"}],
"ร.ง.": [{ORTH: "ร.ง.", LEMMA: "โรงงาน"}],
@ -438,7 +467,6 @@ _exc = {
"เสธ.": [{ORTH: "เสธ.", LEMMA: "เสนาธิการ"}],
"หจก.": [{ORTH: "หจก.", LEMMA: "ห้างหุ้นส่วนจำกัด"}],
"ห.ร.ม.": [{ORTH: "ห.ร.ม.", LEMMA: "ตัวหารร่วมมาก"}],
}

View File

@ -333,6 +333,11 @@ class Language(object):
"""
if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
if not hasattr(component, "__call__"):
msg = Errors.E003.format(component=repr(component), name=name)
if isinstance(component, basestring_) and component in self.factories:
msg += Errors.E135.format(name=name)
raise ValueError(msg)
self.pipeline[self.pipe_names.index(name)] = (name, component)
def rename_pipe(self, old_name, new_name):
@ -412,7 +417,9 @@ class Language(object):
golds (iterable): A batch of `GoldParse` objects.
drop (float): The droput rate.
sgd (callable): An optimizer.
RETURNS (dict): Results from the update.
losses (dict): Dictionary to update with the loss, keyed by component.
component_cfg (dict): Config parameters for specific pipeline
components, keyed by component name.
DOCS: https://spacy.io/api/language#update
"""
@ -593,6 +600,19 @@ class Language(object):
def evaluate(
self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None
):
"""Evaluate a model's pipeline components.
docs_golds (iterable): Tuples of `Doc` and `GoldParse` objects.
verbose (bool): Print debugging information.
batch_size (int): Batch size to use.
scorer (Scorer): Optional `Scorer` to use. If not passed in, a new one
will be created.
component_cfg (dict): An optional dictionary with extra keyword
arguments for specific components.
RETURNS (Scorer): The scorer containing the evaluation results.
DOCS: https://spacy.io/api/language#evaluate
"""
if scorer is None:
scorer = Scorer()
if component_cfg is None:

View File

@ -1,5 +1,6 @@
# coding: utf8
from __future__ import unicode_literals
from collections import OrderedDict
from .symbols import POS, NOUN, VERB, ADJ, PUNCT, PROPN
from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
@ -118,8 +119,8 @@ def lemmatize(string, index, exceptions, rules):
forms.append(form)
else:
oov_forms.append(form)
# Remove duplicates, and sort forms generated by rules alphabetically.
forms = list(set(forms))
# Remove duplicates but preserve the ordering of applied "rules"
forms = list(OrderedDict.fromkeys(forms))
# Put exceptions at the front of the list, so they get priority.
# This is a dodgy heuristic -- but it's the best we can do until we get
# frequencies on this. We can at least prune out problematic exceptions,

View File

@ -48,7 +48,10 @@ cdef class Matcher:
self._extra_predicates = []
self.vocab = vocab
self.mem = Pool()
self.validator = get_json_validator(TOKEN_PATTERN_SCHEMA) if validate else None
if validate:
self.validator = get_json_validator(TOKEN_PATTERN_SCHEMA)
else:
self.validator = None
def __reduce__(self):
data = (self.vocab, self._patterns, self._callbacks)
@ -105,7 +108,7 @@ cdef class Matcher:
raise ValueError(Errors.E012.format(key=key))
if self.validator:
errors[i] = validate_json(pattern, self.validator)
if errors:
if any(err for err in errors.values()):
raise MatchPatternError(key, errors)
key = self._normalize_key(key)
for pattern in patterns:

View File

@ -127,7 +127,7 @@ cdef class PhraseMatcher:
and self.attr not in (DEP, POS, TAG, LEMMA):
string_attr = self.vocab.strings[self.attr]
user_warning(Warnings.W012.format(key=key, attr=string_attr))
tags = get_bilou(length)
tags = get_biluo(length)
phrase_key = <attr_t*>mem.alloc(length, sizeof(attr_t))
for i, tag in enumerate(tags):
attr_value = self.get_lex_value(doc, i)
@ -230,7 +230,7 @@ cdef class PhraseMatcher:
return "matcher:{}-{}".format(string_attr_name, string_attr_value)
def get_bilou(length):
def get_biluo(length):
if length == 0:
raise ValueError(Errors.E127)
elif length == 1:

View File

@ -109,6 +109,7 @@ cdef class Morphology:
analysis.tag = rich_tag
analysis.lemma = self.lemmatize(analysis.tag.pos, token.lex.orth,
self.tag_map.get(tag_str, {}))
self._cache.set(tag_id, token.lex.orth, analysis)
if token.lemma == 0:
token.lemma = analysis.lemma
@ -140,7 +141,7 @@ cdef class Morphology:
if tag not in self.reverse_index:
return
tag_id = self.reverse_index[tag]
orth = self.strings[orth_str]
orth = self.strings.add(orth_str)
cdef RichTagC rich_tag = self.rich_tags[tag_id]
attrs = intify_attrs(attrs, self.strings, _do_deprecated=True)
cached = <MorphAnalysisC*>self._cache.get(tag_id, orth)

View File

@ -35,7 +35,17 @@ class PRFScore(object):
class Scorer(object):
"""Compute evaluation scores."""
def __init__(self, eval_punct=False):
"""Initialize the Scorer.
eval_punct (bool): Evaluate the dependency attachments to and from
punctuation.
RETURNS (Scorer): The newly created object.
DOCS: https://spacy.io/api/scorer#init
"""
self.tokens = PRFScore()
self.sbd = PRFScore()
self.unlabelled = PRFScore()
@ -46,34 +56,46 @@ class Scorer(object):
@property
def tags_acc(self):
"""RETURNS (float): Part-of-speech tag accuracy (fine grained tags,
i.e. `Token.tag`).
"""
return self.tags.fscore * 100
@property
def token_acc(self):
"""RETURNS (float): Tokenization accuracy."""
return self.tokens.precision * 100
@property
def uas(self):
"""RETURNS (float): Unlabelled dependency score."""
return self.unlabelled.fscore * 100
@property
def las(self):
"""RETURNS (float): Labelled depdendency score."""
return self.labelled.fscore * 100
@property
def ents_p(self):
"""RETURNS (float): Named entity accuracy (precision)."""
return self.ner.precision * 100
@property
def ents_r(self):
"""RETURNS (float): Named entity accuracy (recall)."""
return self.ner.recall * 100
@property
def ents_f(self):
"""RETURNS (float): Named entity accuracy (F-score)."""
return self.ner.fscore * 100
@property
def scores(self):
"""RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`,
`ents_r`, `ents_f`, `tags_acc` and `token_acc`.
"""
return {
"uas": self.uas,
"las": self.las,
@ -84,9 +106,20 @@ class Scorer(object):
"token_acc": self.token_acc,
}
def score(self, tokens, gold, verbose=False, punct_labels=("p", "punct")):
if len(tokens) != len(gold):
gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot))
def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")):
"""Update the evaluation scores from a single Doc / GoldParse pair.
doc (Doc): The predicted annotations.
gold (GoldParse): The correct annotations.
verbose (bool): Print debugging information.
punct_labels (tuple): Dependency labels for punctuation. Used to
evaluate dependency attachments to punctuation if `eval_punct` is
`True`.
DOCS: https://spacy.io/api/scorer#score
"""
if len(doc) != len(gold):
gold = GoldParse.from_annot_tuples(doc, zip(*gold.orig_annot))
gold_deps = set()
gold_tags = set()
gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot]))
@ -96,7 +129,7 @@ class Scorer(object):
gold_deps.add((id_, head, dep.lower()))
cand_deps = set()
cand_tags = set()
for token in tokens:
for token in doc:
if token.orth_.isspace():
continue
gold_i = gold.cand_to_gold[token.i]
@ -116,7 +149,7 @@ class Scorer(object):
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
if "-" not in [token[-1] for token in gold.orig_annot]:
cand_ents = set()
for ent in tokens.ents:
for ent in doc.ents:
first = gold.cand_to_gold[ent.start]
last = gold.cand_to_gold[ent.end - 1]
if first is None or last is None:

View File

@ -6,6 +6,7 @@ from spacy.attrs import ORTH, LENGTH
from spacy.tokens import Doc, Span
from spacy.vocab import Vocab
from spacy.errors import ModelsWarning
from spacy.util import filter_spans
from ..util import get_doc
@ -219,3 +220,21 @@ def test_span_ents_property(doc):
assert sentences[2].ents[0].label_ == "PRODUCT"
assert sentences[2].ents[0].start == 11
assert sentences[2].ents[0].end == 14
def test_filter_spans(doc):
# Test filtering duplicates
spans = [doc[1:4], doc[6:8], doc[1:4], doc[10:14]]
filtered = filter_spans(spans)
assert len(filtered) == 3
assert filtered[0].start == 1 and filtered[0].end == 4
assert filtered[1].start == 6 and filtered[1].end == 8
assert filtered[2].start == 10 and filtered[2].end == 14
# Test filtering overlaps with longest preference
spans = [doc[1:4], doc[1:3], doc[5:10], doc[7:9], doc[1:4]]
filtered = filter_spans(spans)
assert len(filtered) == 2
assert len(filtered[0]) == 3
assert len(filtered[1]) == 5
assert filtered[0].start == 1 and filtered[0].end == 4
assert filtered[1].start == 5 and filtered[1].end == 10

View File

@ -140,3 +140,28 @@ def test_underscore_mutable_defaults_dict(en_vocab):
assert len(token1._.mutable) == 2
assert token1._.mutable["x"] == ["y"]
assert len(token2._.mutable) == 0
def test_underscore_dir(en_vocab):
"""Test that dir() correctly returns extension attributes. This enables
things like tab-completion for the attributes in doc._."""
Doc.set_extension("test_dir", default=None)
doc = Doc(en_vocab, words=["hello", "world"])
assert "_" in dir(doc)
assert "test_dir" in dir(doc._)
assert "test_dir" not in dir(doc[0]._)
assert "test_dir" not in dir(doc[0:2]._)
def test_underscore_docstring(en_vocab):
"""Test that docstrings are available for extension methods, even though
they're partials."""
def test_method(doc, arg1=1, arg2=2):
"""I am a docstring"""
return (arg1, arg2)
Doc.set_extension("test_docstrings", method=test_method)
doc = Doc(en_vocab, words=["hello", "world"])
assert test_method.__doc__ == "I am a docstring"
assert doc._.test_docstrings.__doc__.rsplit(". ")[-1] == "I am a docstring"

View File

@ -52,11 +52,13 @@ def test_get_pipe(nlp, name):
assert nlp.get_pipe(name) == new_pipe
@pytest.mark.parametrize("name,replacement", [("my_component", lambda doc: doc)])
def test_replace_pipe(nlp, name, replacement):
@pytest.mark.parametrize("name,replacement,not_callable", [("my_component", lambda doc: doc, {})])
def test_replace_pipe(nlp, name, replacement, not_callable):
with pytest.raises(ValueError):
nlp.replace_pipe(name, new_pipe)
nlp.add_pipe(new_pipe, name=name)
with pytest.raises(ValueError):
nlp.replace_pipe(name, not_callable)
nlp.replace_pipe(name, replacement)
assert nlp.get_pipe(name) != new_pipe
assert nlp.get_pipe(name) == replacement

View File

@ -6,20 +6,16 @@ import pytest
from spacy.lang.en import English
@pytest.mark.xfail(reason="Current default suffix rules avoid one upper-case letter before a dot.")
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
def test_issue3449():
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.add_pipe(nlp.create_pipe("sentencizer"))
text1 = "He gave the ball to I. Do you want to go to the movies with I?"
text2 = "He gave the ball to I. Do you want to go to the movies with I?"
text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
t1 = nlp(text1)
t2 = nlp(text2)
t3 = nlp(text3)
assert t1[5].text == 'I'
assert t2[5].text == 'I'
assert t3[5].text == 'I'
assert t1[5].text == "I"
assert t2[5].text == "I"
assert t3[5].text == "I"

View File

@ -0,0 +1,15 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.matcher import Matcher
from spacy.errors import MatchPatternError
def test_issue3549(en_vocab):
"""Test that match pattern validation doesn't raise on empty errors."""
matcher = Matcher(en_vocab, validate=True)
pattern = [{"LOWER": "hello"}, {"LOWER": "world"}]
matcher.add("GOOD", None, pattern)
with pytest.raises(MatchPatternError):
matcher.add("BAD", None, [{"X": "Y"}])

View File

@ -0,0 +1,17 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.tokens import Doc, Token
from spacy.matcher import Matcher
@pytest.mark.xfail
def test_issue3555(en_vocab):
"""Test that custom extensions with default None don't break matcher."""
Token.set_extension("issue3555", default=None)
matcher = Matcher(en_vocab)
pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}]
matcher.add("TEST", None, pattern)
doc = Doc(en_vocab, words=["have", "apple"])
matcher(doc)

View File

@ -0,0 +1,15 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.lang.es import Spanish
def test_issue3803():
"""Test that spanish num-like tokens have True for like_num attribute."""
nlp = Spanish()
text = "2 dos 1000 mil 12 doce"
doc = nlp(text)
assert [t.like_num for t in doc] == [True, True, True, True, True, True]

View File

@ -3,11 +3,13 @@ from __future__ import unicode_literals
import pytest
import os
import ctypes
from pathlib import Path
from spacy import util
from spacy import prefer_gpu, require_gpu
from spacy.compat import symlink_to, symlink_remove, path2str
from spacy.compat import symlink_to, symlink_remove, path2str, is_windows
from spacy._ml import PrecomputableAffine
from subprocess import CalledProcessError
@pytest.fixture
@ -28,12 +30,25 @@ def symlink_setup_target(request, symlink_target, symlink):
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
def cleanup():
symlink_remove(symlink)
# Remove symlink only if it was created
if symlink.exists():
symlink_remove(symlink)
os.rmdir(path2str(symlink_target))
request.addfinalizer(cleanup)
@pytest.fixture
def is_admin():
"""Determine if the tests are run as admin or not."""
try:
admin = os.getuid() == 0
except AttributeError:
admin = ctypes.windll.shell32.IsUserAnAdmin() != 0
return admin
@pytest.mark.parametrize("text", ["hello/world", "hello world"])
def test_util_ensure_path_succeeds(text):
path = util.ensure_path(text)
@ -88,7 +103,20 @@ def test_require_gpu():
require_gpu()
def test_create_symlink_windows(symlink_setup_target, symlink_target, symlink):
def test_create_symlink_windows(
symlink_setup_target, symlink_target, symlink, is_admin
):
"""Test the creation of symlinks on windows. If run as admin or not on windows it should succeed, otherwise a CalledProcessError should be raised."""
assert symlink_target.exists()
symlink_to(symlink, symlink_target)
assert symlink.exists()
if is_admin or not is_windows:
try:
symlink_to(symlink, symlink_target)
assert symlink.exists()
except CalledProcessError as e:
pytest.fail(e)
else:
with pytest.raises(CalledProcessError):
symlink_to(symlink, symlink_target)
assert not symlink.exists()

View File

@ -25,6 +25,11 @@ class Underscore(object):
object.__setattr__(self, "_start", start)
object.__setattr__(self, "_end", end)
def __dir__(self):
# Hack to enable autocomplete on custom extensions
extensions = list(self._extensions.keys())
return ["set", "get", "has"] + extensions
def __getattr__(self, name):
if name not in self._extensions:
raise AttributeError(Errors.E046.format(name=name))
@ -32,7 +37,16 @@ class Underscore(object):
if getter is not None:
return getter(self._obj)
elif method is not None:
return functools.partial(method, self._obj)
method_partial = functools.partial(method, self._obj)
# Hack to port over docstrings of the original function
# See https://stackoverflow.com/q/27362727/6400719
method_docstring = method.__doc__ or ""
method_docstring_prefix = (
"This method is a partial function and its first argument "
"(the object it's called on) will be filled automatically. "
)
method_partial.__doc__ = method_docstring_prefix + method_docstring
return method_partial
else:
key = self._get_key(name)
if key in self._doc.user_data:

View File

@ -14,8 +14,11 @@ import functools
import itertools
import numpy.random
import srsly
from jsonschema import Draft4Validator
try:
import jsonschema
except ImportError:
jsonschema = None
try:
import cupy.random
@ -510,7 +513,7 @@ def decaying(start, stop, decay):
curr = float(start)
while True:
yield max(curr, stop)
curr -= (decay)
curr -= decay
def minibatch_by_words(items, size, tuples=True, count_words=len):
@ -571,6 +574,28 @@ def itershuffle(iterable, bufsize=1000):
raise StopIteration
def filter_spans(spans):
"""Filter a sequence of spans and remove duplicates or overlaps. Useful for
creating named entities (where one token can only be part of one entity) or
when merging spans with `Retokenizer.merge`. When spans overlap, the (first)
longest span is preferred over shorter spans.
spans (iterable): The spans to filter.
RETURNS (list): The filtered spans.
"""
get_sort_key = lambda span: (span.end - span.start, span.start)
sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
result = []
seen_tokens = set()
for span in sorted_spans:
# Check for end - 1 here because boundaries are inclusive
if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
result.append(span)
seen_tokens.update(range(span.start, span.end))
result = sorted(result, key=lambda span: span.start)
return result
def to_bytes(getters, exclude):
serialized = OrderedDict()
for key, getter in getters.items():
@ -660,7 +685,9 @@ def get_json_validator(schema):
# validator that's used (e.g. different draft implementation), without
# having to change it all across the codebase.
# TODO: replace with (stable) Draft6Validator, if available
return Draft4Validator(schema)
if jsonschema is None:
raise ValueError(Errors.E136)
return jsonschema.Draft4Validator(schema)
def validate_schema(schema):

View File

@ -457,7 +457,7 @@ sit amet dignissim justo congue.
## Setup and installation {#setup}
Before running the setup, make sure your versions of
[Node](https://nodejs.org/en/) and [npm](https://www.npmjs.com/) are up to date.
[Node](https://nodejs.org/en/) and [npm](https://www.npmjs.com/) are up to date. Node v10.15 or later is required.
```bash
# Clone the repository

94
website/UNIVERSE.md Normal file
View File

@ -0,0 +1,94 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy Universe
The [spaCy Universe](https://spacy.io/universe) collects the many great resources developed with or for spaCy. It
includes standalone packages, plugins, extensions, educational materials,
operational utilities and bindings for other languages.
If you have a project that you want the spaCy community to make use of, you can
suggest it by submitting a pull request to this repository. The Universe
database is open-source and collected in a simple JSON file.
Looking for inspiration for your own spaCy plugin or extension? Check out the
[`project idea`](https://github.com/explosion/spaCy/labels/project%20idea) label
on the issue tracker.
## Checklist
### Projects
✅ Libraries and packages should be **open-source** (with a user-friendly license) and at least somewhat **documented** (e.g. a simple `README` with usage instructions).
✅ We're happy to include work in progress and prereleases, but we'd like to keep the emphasis on projects that should be useful to the community **right away**.
✅ Demos and visualizers should be available via a **public URL**.
### Educational Materials
✅ Books should be **available for purchase or download** (not just pre-order). Ebooks and self-published books are fine, too, if they include enough substantial content.
✅ The `"url"` of book entries should either point to the publisher's website or a reseller of your choice (ideally one that ships worldwide or as close as possible).
✅ If an online course is only available behind a paywall, it should at least have a **free excerpt** or chapter available, so users know what to expect.
## JSON format
To add a project, fork this repository, edit the [`universe.json`](meta/universe.json)
and add an object of the following format to the list of `"resources"`. Before
you submit your pull request, make sure to use a linter to verify that your
markup is correct.
```json
{
"id": "unique-project-id",
"title": "Project title",
"slogan": "A short summary",
"description": "A longer description *Mardown allowed!*",
"github": "user/repo",
"pip": "package-name",
"code_example": [
"import spacy",
"import package_name",
"",
"nlp = spacy.load('en')",
"nlp.add_pipe(package_name)"
],
"code_language": "python",
"url": "https://example.com",
"thumb": "https://example.com/thumb.jpg",
"image": "https://example.com/image.jpg",
"author": "Your Name",
"author_links": {
"twitter": "username",
"github": "username",
"website": "https://example.com"
},
"category": ["pipeline", "standalone"],
"tags": ["some-tag", "etc"]
}
```
| Field | Type | Description |
| --- | --- | --- |
| `id` | string | Unique ID of the project. |
| `title` | string | Project title. If not set, the `id` will be used as the display title. |
| `slogan` | string | A short description of the project. Displayed in the overview and under the title. |
| `description` | string | A longer description of the project. Markdown is allowed, but should be limited to basic formatting like bold, italics, code or links. |
| `github` | string | Associated GitHub repo in the format `user/repo`. Will be displayed as a link and used for release, license and star badges. |
| `pip` | string | Package name on pip. If available, the installation command will be displayed. |
| `cran` | string | For R packages: package name on CRAN. If available, the installation command will be displayed. |
| `code_example` | array | Short example that shows how to use the project. Formatted as an array with one string per line. |
| `code_language` | string | Defaults to `'python'`. Optional code language used for syntax highlighting with [Prism](http://prismjs.com/). |
| `url` | string | Optional project link to display as button. |
| `thumb` | string | Optional URL to project thumbnail to display in overview and project header. Recommended size is 100x100px. |
| `image` | string | Optional URL to project image to display with description. |
| `author` | string | Name(s) of project author(s). |
| `author_links` | object | Usernames and links to display as icons to author info. Currently supports `twitter` and `github` usernames, as well as `website` link. |
| `category` | list | One or more categories to assign to project. Must be one of the available options. |
| `tags` | list | Still experimental and not used for filtering: one or more tags to assign to project. |
To separate them from the projects, educational materials also specify
`"type": "education`. Books can also set a `"cover"` field containing a URL
to a cover image. If available, it's used in the overview and displayed on
the individual book page.

View File

@ -510,7 +510,7 @@ described in any single publication. The model is a greedy transition-based
parser guided by a linear model whose weights are learned using the averaged
perceptron loss, via the
[dynamic oracle](http://www.aclweb.org/anthology/C12-1059) imitation learning
strategy. The transition system is equivalent to the BILOU tagging scheme.
strategy. The transition system is equivalent to the BILUO tagging scheme.
## Models and training data {#training}

View File

@ -189,7 +189,7 @@ using the [`package`](/api/cli#package) command.
<Infobox title="Changed in v2.1" variant="warning">
As of spaCy 2.1, the `--no-tagger`, `--no-parser` and `--no-parser` flags have
As of spaCy 2.1, the `--no-tagger`, `--no-parser` and `--no-entities` flags have
been replaced by a `--pipeline` option, which lets you define comma-separated
names of pipeline components to train. For example, `--pipeline tagger,parser`
will only train the tagger and parser.
@ -198,7 +198,7 @@ will only train the tagger and parser.
```bash
$ python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-examples] [--use-gpu]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping] [--n-examples] [--use-gpu]
[--version] [--meta-path] [--init-tok2vec] [--parser-multitasks]
[--entity-multitasks] [--gold-preproc] [--noise-level] [--learn-tokens]
[--verbose]
@ -210,10 +210,11 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| `output_path` | positional | Directory to store model in. Will be created if it doesn't exist. |
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
| `--base-model`, `-b` <Tag variant="new">2.1</Tag> | option | Optional name of base model to update. Can be any loadable spaCy model. |
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
| `--vectors`, `-v` | option | Model to load vectors from. |
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
| `--n-examples`, `-ns` | option | Number of examples to use (defaults to `0` for all examples). |
| `--use-gpu`, `-g` | option | Whether to use GPU. Can be either `0`, `1` or `-1`. |
| `--version`, `-V` | option | Model version. Will be written out to the model's `meta.json` after training. |
@ -274,7 +275,7 @@ an approximate language-modeling objective. Specifically, we load pre-trained
vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which
match the pre-trained ones. The weights are saved to a directory after each
epoch. You can then pass a path to one of these pre-trained weights files to the
'spacy train' command.
`spacy train` command.
This technique may be especially helpful if you have little labelled data.
However, it's still quite experimental, so your mileage may vary. To load the
@ -285,24 +286,26 @@ improvement.
```bash
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
[--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors]
[--n-save_every]
```
| Argument | Type | Description |
| ---------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"`. [See here](#pretrain-jsonl) for details. |
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
| `output_dir` | positional | Directory to write models to on each epoch. |
| `--width`, `-cw` | option | Width of CNN layers. |
| `--depth`, `-cd` | option | Depth of CNN layers. |
| `--embed-rows`, `-er` | option | Number of embedding rows. |
| `--dropout`, `-d` | option | Dropout rate. |
| `--batch-size`, `-bs` | option | Number of words per training batch. |
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |
| `--min-length`, `-nw` | option | Minimum words per example. Shorter examples are discarded. |
| `--seed`, `-s` | option | Seed for random number generators. |
| `--n-iter`, `-i` | option | Number of iterations to pretrain. |
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
| Argument | Type | Description |
| ----------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"`. [See here](#pretrain-jsonl) for details. |
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
| `output_dir` | positional | Directory to write models to on each epoch. |
| `--width`, `-cw` | option | Width of CNN layers. |
| `--depth`, `-cd` | option | Depth of CNN layers. |
| `--embed-rows`, `-er` | option | Number of embedding rows. |
| `--dropout`, `-d` | option | Dropout rate. |
| `--batch-size`, `-bs` | option | Number of words per training batch. |
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |
| `--min-length`, `-nw` | option | Minimum words per example. Shorter examples are discarded. |
| `--seed`, `-s` | option | Seed for random number generators. |
| `--n-iter`, `-i` | option | Number of iterations to pretrain. |
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
| `--n-save_every`, `-se` | option | Save model every X batches. |
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
### JSONL format for raw text {#pretrain-jsonl}
@ -324,7 +327,7 @@ tokenization can be provided.
| Key | Type | Description |
| -------- | ------- | -------------------------------------------- |
| `text` | unicode | The raw input text. |
| `text` | unicode | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. |
```json
@ -332,6 +335,7 @@ tokenization can be provided.
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
```
## Init Model {#init-model new="2"}
@ -375,7 +379,7 @@ pipeline.
```bash
$ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-limit]
[--gpu-id] [--gold-preproc]
[--gpu-id] [--gold-preproc] [--return-scores]
```
| Argument | Type | Description |
@ -386,6 +390,7 @@ $ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-lim
| `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. |
| `--gpu-id`, `-g` | option | GPU to use, if any. Defaults to `-1` for CPU. |
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
| `--return-scores`, `-R` | flag | Return dict containing model scores. |
| **CREATES** | `stdout`, HTML | Training results and optional displaCy visualizations. |
## Package {#package}

View File

@ -172,7 +172,7 @@ struct.
| `prefix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the start of the lexeme. Defaults to `N=1`. |
| `suffix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the end of the lexeme. Defaults to `N=3`. |
| `cluster` | <Abbr title="uint64_t">`attr_t`</Abbr> | Brown cluster ID. |
| `prob` | `float` | Smoothed log probability estimate of the lexeme's type. |
| `prob` | `float` | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). |
| `sentiment` | `float` | A scalar value indicating positivity or negativity. |
### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}

View File

@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
> scores = parser.predict([doc1, doc2])
> ```
| Name | Type | Description |
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | The documents to predict. |
| **RETURNS** | tuple | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. |
| Name | Type | Description |
| ----------- | ------------------- | ---------------------------------------------- |
| `docs` | iterable | The documents to predict. |
| **RETURNS** | `syntax.StateClass` | A helper class for the parse state (internal). |
## DependencyParser.set_annotations {#set_annotations tag="method"}

View File

@ -119,8 +119,27 @@ Update the models in the pipeline.
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
| `drop` | float | The dropout rate. |
| `sgd` | callable | An optimizer. |
| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | dict | Results from the update. |
## Language.evaluate {#evaluate tag="method"}
Evaluate a model's pipeline components.
> #### Example
>
> ```python
> scorer = nlp.evaluate(docs_golds, verbose=True)
> print(scorer.scores)
> ```
| Name | Type | Description |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------- |
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
| `verbose` | bool | Print debugging information. |
| `batch_size` | int | The batch size to use. |
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
## Language.begin_training {#begin_training tag="method"}

View File

@ -128,7 +128,6 @@ The L2 norm of the lexeme's vector representation.
| `text` | unicode | Verbatim text content. |
| `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes. |
| `lex_id` | int | ID of the lexeme's lexical type. |
| `rank` | int | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. |
| `flags` | int | Container of the lexeme's binary flags. |
| `norm` | int | The lexemes's norm, i.e. a normalized form of the lexeme text. |
@ -161,6 +160,6 @@ The L2 norm of the lexeme's vector representation.
| `is_stop` | bool | Is the lexeme part of a "stop list"? |
| `lang` | int | Language of the parent vocabulary. |
| `lang_` | unicode | Language of the parent vocabulary. |
| `prob` | float | Smoothed log probability estimate of the lexeme's type. |
| `prob` | float | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). |
| `cluster` | int | Brown cluster ID. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the lexeme. |

View File

@ -0,0 +1,58 @@
---
title: Scorer
teaser: Compute evaluation scores
tag: class
source: spacy/scorer.py
---
The `Scorer` computes and stores evaluation scores. It's typically created by
[`Language.evaluate`](/api/language#evaluate).
## Scorer.\_\_init\_\_ {#init tag="method"}
Create a new `Scorer`.
> #### Example
>
> ```python
> from spacy.scorer import Scorer
>
> scorer = Scorer()
> ```
| Name | Type | Description |
| ------------ | -------- | ------------------------------------------------------------ |
| `eval_punct` | bool | Evaluate the dependency attachments to and from punctuation. |
| **RETURNS** | `Scorer` | The newly created object. |
## Scorer.score {#score tag="method"}
Update the evaluation scores from a single [`Doc`](/api/doc) /
[`GoldParse`](/api/goldparse) pair.
> #### Example
>
> ```python
> scorer = Scorer()
> scorer.score(doc, gold)
> ```
| Name | Type | Description |
| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The predicted annotations. |
| `gold` | `GoldParse` | The correct annotations. |
| `verbose` | bool | Print debugging information. |
| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
## Properties
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
| `token_acc` | float | Tokenization accuracy. |
| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). |
| `uas` | float | Unlabelled dependency score. |
| `las` | float | Labelled dependency score. |
| `ents_p` | float | Named entity accuracy (precision). |
| `ents_r` | float | Named entity accuracy (recall). |
| `ents_f` | float | Named entity accuracy (F-score). |
| `scores` | dict | All scores with keys `uas`, `las`, `ents_p`, `ents_r`, `ents_f`, `tags_acc` and `token_acc`. |

View File

@ -424,7 +424,7 @@ The L2 norm of the token's vector representation.
| `ent_type` | int | Named entity type. |
| `ent_type_` | unicode | Named entity type. |
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | |
| `ent_iob_` | unicode | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
| `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. |
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `lemma` | int | Base form of the token, with no inflectional suffixes. |
@ -465,10 +465,10 @@ The L2 norm of the token's vector representation.
| `dep_` | unicode | Syntactic dependency relation. |
| `lang` | int | Language of the parent document's vocabulary. |
| `lang_` | unicode | Language of the parent document's vocabulary. |
| `prob` | float | Smoothed log probability estimate of token's type. |
| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). |
| `idx` | int | The character offset of the token within the parent document. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
| `lex_id` | int | Sequential ID of the token's lexical type. |
| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `cluster` | int | Brown cluster ID. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |

View File

@ -211,16 +211,16 @@ Render a dependency parse tree or named entity visualization.
> html = displacy.render(doc, style="dep")
> ```
| Name | Type | Description | Default |
| ----------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ---------------------- |
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
| `page` | bool | Render markup as full HTML page. | `False` |
| `minify` | bool | Minify HTML markup. | `False` |
| `jupyter` | bool | Explicitly enable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. | detected automatically |
| `options` | dict | [Visualizer-specific options](#options), e.g. colors. | `{}` |
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
| **RETURNS** | unicode | Rendered HTML markup. |
| Name | Type | Description | Default |
| ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
| `page` | bool | Render markup as full HTML page. | `False` |
| `minify` | bool | Minify HTML markup. | `False` |
| `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` |
| `options` | dict | [Visualizer-specific options](#options), e.g. colors. | `{}` |
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
| **RETURNS** | unicode | Rendered HTML markup. |
### Visualizer options {#displacy_options}
@ -351,7 +351,7 @@ the two-letter language code.
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
| `cls` | `Language` | The language class, e.g. `English`. |
### util.lang_class_is_loaded (#util.lang_class_is_loaded tag="function" new="2.1")
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
Check whether a `Language` class is already loaded. `Language` classes are
loaded lazily, to avoid expensive setup code associated with the language data.
@ -654,6 +654,27 @@ for batching. Larger `buffsize` means less bias.
| `buffsize` | int | Items to hold back. |
| **YIELDS** | iterable | The shuffled iterator. |
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
Filter a sequence of [`Span`](/api/span) objects and remove duplicates or
overlaps. Useful for creating named entities (where one token can only be part
of one entity) or when merging spans with
[`Retokenizer.merge`](/api/doc#retokenizer.merge). When spans overlap, the
(first) longest span is preferred over shorter spans.
> #### Example
>
> ```python
> doc = nlp("This is a sentence.")
> spans = [doc[0:2], doc[0:2], doc[0:4]]
> filtered = filter_spans(spans)
> ```
| Name | Type | Description |
| ----------- | -------- | -------------------- |
| `spans` | iterable | The spans to filter. |
| **RETURNS** | list | The filtered spans. |
## Compatibility functions {#compat source="spacy/compaty.py"}
All Python code is written in an **intersection of Python 2 and Python 3**. This

View File

@ -306,7 +306,7 @@ vectors, they will be counted individually.
Load [GloVe](https://nlp.stanford.edu/projects/glove/) vectors from a directory.
Assumes binary format, that the vocab is in a `vocab.txt`, and that vectors are
named `vectors.{size}.[fd`.bin], e.g. `vectors.128.f.bin` for 128d float32
named `vectors.{size}.[fd.bin]`, e.g. `vectors.128.f.bin` for 128d float32
vectors, `vectors.300.d.bin` for 300d float64 (double) vectors, etc. By default
GloVe outputs 64-bit vectors.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.6 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 270 KiB

View File

@ -4,7 +4,7 @@ example, everything that's in your `nlp` object. This means you'll have to
translate its contents and structure into a format that can be saved, like a
file or a byte string. This process is called serialization. spaCy comes with
**built-in serialization methods** and supports the
[Pickle protocol](http://www.diveintopython3.net/serializing.html#dump).
[Pickle protocol](https://www.diveinto.org/python3/serializing.html#dump).
> #### What's pickle?
>

View File

@ -50,7 +50,7 @@ together.
## Benchmarks {#benchmarks}
Two peer-reviewed papers in 2015 confirm that spaCy offers the **fastest
Two peer-reviewed papers in 2015 confirmed that spaCy offers the **fastest
syntactic parser in the world** and that **its accuracy is within 1% of the
best** available. The few systems that are more accurate are 20× slower or more.

View File

@ -326,7 +326,7 @@ URLs.
```text
### requirements.txt
spacy>=2.0.0,<3.0.0
https://github.com/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm
```
Specifying `#egg=` with the package name tells pip which package to expect from

View File

@ -260,7 +260,7 @@ def my_component(doc):
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(my_component, name="print_info", last=True)
print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner']
print(nlp.pipe_names) # ['tagger', 'parser', 'ner', 'print_info']
doc = nlp(u"This is a sentence.")
```

View File

@ -214,7 +214,8 @@ example, you might want to match different spellings of a word, without having
to add a new pattern for each spelling.
```python
pattern = [{"TEXT": {"REGEX": "^([Uu](\\.?|nited) ?[Ss](\\.?|tates)"}},
pattern = [{"TEXT": {"REGEX": "^[Uu](\\.?|nited)$"}},
{"TEXT": {"REGEX": "^[Ss](\\.?|tates)$"}},
{"LOWER": "president"}]
```
@ -227,7 +228,7 @@ attributes:
pattern = [{"TAG": {"REGEX": "^V"}}]
# Match custom attribute values with regular expressions
pattern = [{"_": {"country": {"REGEX": "^([Uu](\\.?|nited) ?[Ss](\\.?|tates)"}}}]
pattern = [{"_": {"country": {"REGEX": "^[Uu](\\.?|nited) ?[Ss](\\.?|tates)$"}}}]
```
<Infobox title="Regular expressions in older versions" variant="warning">
@ -404,7 +405,7 @@ class BadHTMLMerger(object):
for match_id, start, end in matches:
spans.append(doc[start:end])
with doc.retokenize() as retokenizer:
for span in hashtags:
for span in spans:
retokenizer.merge(span)
for token in span:
token._.bad_html = True # Mark token as bad HTML
@ -678,7 +679,7 @@ for match_id, start, end in matches:
if doc.vocab.strings[match_id] == "HASHTAG":
hashtags.append(doc[start:end])
with doc.retokenize() as retokenizer:
for span in spans:
for span in hashtags:
retokenizer.merge(span)
for token in span:
token._.is_hashtag = True
@ -712,9 +713,9 @@ from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = [u"Barack Obama", u"Angela Merkel", u"Washington, D.C."]
terms = [u"Barack Obama", u"Angela Merkel", u"Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terminology_list]
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "

View File

@ -29,6 +29,19 @@ quick introduction.
> [pull requests](https://github.com/explosion/spaCy/pulls). You can find a
> "Suggest edits" link at the bottom of each page that points you to the source.
<Infobox title="Take the free interactive course">
[![Advanced NLP with spaCy](../images/course.jpg)](https://course.spacy.io)
In this course you'll learn how to use spaCy to build advanced natural language
understanding systems, using both rule-based and machine learning approaches. It
includes 55 exercises featuring interactive coding practice, multiple-choice
questions and slide decks.
<p><Button to="https://course.spacy.io" variant="primary">Start the course</Button></p>
</Infobox>
## What's spaCy? {#whats-spacy}
<Grid cols={2}>
@ -89,27 +102,12 @@ systems, or to pre-process text for **deep learning**.
integrated and opinionated. spaCy tries to avoid asking the user to choose
between multiple algorithms that deliver equivalent functionality. Keeping the
menu small lets spaCy deliver generally better performance and developer
experience.M
experience.
- **spaCy is not a company**. It's an open-source library. Our company
publishing spaCy and other software is called
[Explosion AI](https://explosion.ai).
<Infobox title="Download the spaCy Cheat Sheet!">
[![spaCy Cheatsheet](../images/cheatsheet.jpg)](http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06)
For the launch of our
["Advanced NLP with spaCy"](https://www.datacamp.com/courses/advanced-nlp-with-spacy)
course on DataCamp we created the first official spaCy cheat sheet! A handy
two-page reference to the most important concepts and features, from loading
models and accessing linguistic annotations, to custom pipeline components and
rule-based matching.
<p><Button to="http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06" variant="primary">Download</Button></p>
</Infobox>
## Features {#features}
In the documentation, you'll come across mentions of spaCy's features and

View File

@ -136,7 +136,7 @@ The entity visualizer lets you customize the following `options`:
| Argument | Type | Description | Default |
| -------- | ---- | ------------------------------------------------------------------------------------- | ------- |
| `ents` | list |  Entity types to highlight (`None` for all types). | `None` |
| `colors` | dict | Color overrides. Entity types in lowercase should be mapped to color names or values. | `{}` |
| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` |
If you specify a list of `ents`, only those entity types will be rendered for
example, you can choose to display `PERSON` entities. Internally, the visualizer

View File

@ -90,7 +90,8 @@
{ "text": "StringStore", "url": "/api/stringstore" },
{ "text": "Vectors", "url": "/api/vectors" },
{ "text": "GoldParse", "url": "/api/goldparse" },
{ "text": "GoldCorpus", "url": "/api/goldcorpus" }
{ "text": "GoldCorpus", "url": "/api/goldcorpus" },
{ "text": "Scorer", "url": "/api/scorer" }
]
},
{

View File

@ -1,5 +1,107 @@
{
"resources": [
{
"id": "nlp-architect",
"title": "NLP Architect",
"slogan": "Python lib for exploring Deep NLP & NLU by Intel AI",
"github": "NervanaSystems/nlp-architect",
"pip": "nlp-architect",
"thumb": "https://i.imgur.com/vMideRx.png",
"category": ["standalone", "research"],
"tags": ["pytorch"]
},
{
"id": "NeuroNER",
"title": "NeuroNER",
"slogan": "Named-entity recognition using neural networks",
"github": "Franck-Dernoncourt/NeuroNER",
"pip": "pyneuroner[cpu]",
"code_example": [
"from neuroner import neuromodel",
"nn = neuromodel.NeuroNER(train_model=False, use_pretrained_model=True)"
],
"category": ["ner"],
"tags": ["standalone"]
},
{
"id": "NLPre",
"title": "NLPre",
"slogan": "Natural Language Preprocessing Library for health data and more",
"github": "NIHOPA/NLPre",
"pip": "nlpre",
"code_example": [
"from nlpre import titlecaps, dedash, identify_parenthetical_phrases",
"from nlpre import replace_acronyms, replace_from_dictionary",
"ABBR = identify_parenthetical_phrases()(text)",
"parsers = [dedash(), titlecaps(), replace_acronyms(ABBR),",
" replace_from_dictionary(prefix='MeSH_')]",
"for f in parsers:",
" text = f(text)",
"print(text)"
],
"category": ["scientific"]
},
{
"id": "Chatterbot",
"title": "Chatterbot",
"slogan": "A machine-learning based conversational dialog engine for creating chat bots",
"github": "gunthercox/ChatterBot",
"pip": "chatterbot",
"thumb": "https://i.imgur.com/eyAhwXk.jpg",
"code_example": [
"from chatterbot import ChatBot",
"from chatterbot.trainers import ListTrainer",
"# Create a new chat bot named Charlie",
"chatbot = ChatBot('Charlie')",
"trainer = ListTrainer(chatbot)",
"trainer.train([",
"'Hi, can I help you?',",
"'Sure, I would like to book a flight to Iceland.",
"'Your flight has been booked.'",
"])",
"",
"response = chatbot.get_response('I would like to book a flight.')"
],
"author": "Gunther Cox",
"author_links": {
"github": "gunthercox"
},
"category": ["conversational", "standalone"],
"tags": ["chatbots"]
},
{
"id": "saber",
"title": "saber",
"slogan": "Deep-learning based tool for information extraction in the biomedical domain",
"github": "BaderLab/saber",
"pip": "saber",
"thumb": "https://raw.githubusercontent.com/BaderLab/saber/master/docs/img/saber_logo.png",
"code_example": [
"from saber.saber import Saber",
"saber = Saber()",
"saber.load('PRGE')",
"saber.annotate('The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.')"
],
"author": "Bader Lab, University of Toronto",
"category": ["scientific"],
"tags": ["keras", "biomedical"]
},
{
"id": "alibi",
"title": "alibi",
"slogan": "Algorithms for monitoring and explaining machine learning models ",
"github": "SeldonIO/alibi",
"pip": "alibi",
"thumb": "https://i.imgur.com/YkzQHRp.png",
"code_example": [
"from alibi.explainers import AnchorTabular",
"explainer = AnchorTabular(predict_fn, feature_names)",
"explainer.fit(X_train)",
"explainer.explain(x)"
],
"author": "Seldon",
"category": ["standalone", "research"]
},
{
"id": "spacymoji",
"slogan": "Emoji handling and meta data as a spaCy pipeline component",
@ -143,7 +245,7 @@
"doc = nlp(my_doc_text)"
],
"author": "tc64",
"author_link": {
"author_links": {
"github": "tc64"
},
"category": ["pipeline"]
@ -346,7 +448,7 @@
"author_links": {
"github": "huggingface"
},
"category": ["standalone", "conversational"],
"category": ["standalone", "conversational", "models"],
"tags": ["coref"]
},
{
@ -538,7 +640,7 @@
"twitter": "allenai_org",
"website": "http://allenai.org"
},
"category": ["models", "research"]
"category": ["scientific", "models", "research"]
},
{
"id": "textacy",
@ -601,7 +703,7 @@
"github": "ahalterman",
"twitter": "ahalterman"
},
"category": ["standalone"]
"category": ["standalone", "scientific"]
},
{
"id": "kindred",
@ -626,7 +728,7 @@
"author_links": {
"github": "jakelever"
},
"category": ["standalone"]
"category": ["standalone", "scientific"]
},
{
"id": "sense2vec",
@ -837,6 +939,42 @@
},
"category": ["standalone"]
},
{
"id": "prefect",
"title": "Prefect",
"slogan": "Workflow management system designed for modern infrastructure",
"github": "PrefectHQ/prefect",
"pip": "prefect",
"thumb": "https://i.imgur.com/oLTwr0e.png",
"code_example": [
"from prefect import Flow",
"from prefect.tasks.spacy.spacy_tasks import SpacyNLP",
"import spacy",
"",
"nlp = spacy.load(\"en_core_web_sm\")",
"",
"with Flow(\"Natural Language Processing\") as flow:",
" doc = SpacyNLP(text=\"This is some text\", nlp=nlp)",
"",
"flow.run()"
],
"author": "Prefect",
"author_links": {
"website": "https://prefect.io"
},
"category": ["standalone"]
},
{
"id": "graphbrain",
"title": "Graphbrain",
"slogan": "Automated meaning extraction and text understanding",
"description": "Graphbrain is an Artificial Intelligence open-source software library and scientific research tool. Its aim is to facilitate automated meaning extraction and text understanding, as well as the exploration and inference of knowledge.",
"github": "graphbrain/graphbrain",
"pip": "graphbrain",
"thumb": "https://i.imgur.com/cct9W1E.png",
"author": "Graphbrain",
"category": ["standalone"]
},
{
"type": "education",
"id": "oreilly-python-ds",
@ -883,36 +1021,6 @@
"author": "Bhargav Srinivasa-Desikan",
"category": ["books"]
},
{
"type": "education",
"id": "datacamp-nlp-fundamentals",
"title": "Natural Language Processing Fundamentals in Python",
"slogan": "Datacamp, 2017",
"description": "In this course, you'll learn Natural Language Processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.",
"url": "https://www.datacamp.com/courses/natural-language-processing-fundamentals-in-python",
"thumb": "https://i.imgur.com/0Zks7c0.jpg",
"author": "Katharine Jarmul",
"author_links": {
"twitter": "kjam"
},
"category": ["courses"]
},
{
"type": "education",
"id": "datacamp-advanced-nlp",
"title": "Advanced Natural Language Processing with spaCy",
"slogan": "Datacamp, 2019",
"description": "If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other? In this course, you'll learn how to use spaCy, a fast-growing industry standard library for NLP in Python, to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
"url": "https://www.datacamp.com/courses/advanced-nlp-with-spacy",
"thumb": "https://i.imgur.com/0Zks7c0.jpg",
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
},
"category": ["courses"]
},
{
"type": "education",
"id": "learning-path-spacy",
@ -924,6 +1032,23 @@
"author": "Aaron Kramer",
"category": ["courses"]
},
{
"type": "education",
"id": "spacy-course",
"title": "Advanced NLP with spaCy",
"slogan": "spaCy, 2019",
"description": "In this free interactive course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
"url": "https://course.spacy.io",
"image": "https://i.imgur.com/JC00pHW.jpg",
"thumb": "https://i.imgur.com/5RXLtrr.jpg",
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
},
"category": ["courses"]
},
{
"type": "education",
"id": "video-spacys-ner-model",
@ -1010,6 +1135,22 @@
},
"category": ["podcasts"]
},
{
"type": "education",
"id": "twimlai-podcast",
"title": "TWiML & AI: Practical NLP with spaCy and Prodigy",
"slogan": "May 2019",
"description": "\"Ines and I caught up to discuss her various projects, including the aforementioned SpaCy, an open-source NLP library built with a focus on industry and production use cases. In our conversation, Ines gives us an overview of the SpaCy Library, a look at some of the use cases that excite her, and the Spacy community and contributors. We also discuss her work with Prodigy, an annotation service tool that uses continuous active learning to train models, and finally, what other exciting projects she is working on.\"",
"thumb": "https://i.imgur.com/ng2F5gK.png",
"url": "https://twimlai.com/twiml-talk-262-practical-natural-language-processing-with-spacy-and-prodigy-w-ines-montani",
"iframe": "https://html5-player.libsyn.com/embed/episode/id/9691514/height/90/theme/custom/thumbnail/no/preload/no/direction/backward/render-playlist/no/custom-color/3e85b1/",
"iframe_height": 90,
"author": "Sam Charrington",
"author_links": {
"website": "https://twimlai.com"
},
"category": ["podcasts"]
},
{
"id": "adam_qas",
"title": "ADAM: Question Answering System",
@ -1068,7 +1209,7 @@
"github": "ecohealthalliance",
"website": " https://ecohealthalliance.org/"
},
"category": ["research", "standalone"]
"category": ["scientific", "standalone"]
},
{
"id": "self-attentive-parser",
@ -1311,8 +1452,100 @@
"website": "http://w4nderlu.st"
},
"category": ["standalone", "research"]
},
{
"id": "gracyql",
"title": "gracyql",
"slogan": "A thin GraphQL wrapper around spacy",
"github": "oterrier/gracyql",
"description": "An example of a basic [Starlette](https://github.com/encode/starlette) app using [Spacy](https://github.com/explosion/spaCy) and [Graphene](https://github.com/graphql-python/graphene). The main goal is to be able to use the amazing power of spaCy from other languages and retrieving only the information you need thanks to the GraphQL query definition. The GraphQL schema tries to mimic as much as possible the original Spacy API with classes Doc, Span and Token.",
"thumb": "https://i.imgur.com/xC7zpTO.png",
"category": ["apis"],
"tags": ["graphql"],
"code_example": [
"query ParserDisabledQuery {",
" nlp(model: \"en\", disable: [\"parser\", \"ner\"]) {",
" doc(text: \"I live in Grenoble, France\") {",
" text",
" tokens {",
" id",
" pos",
" lemma",
" dep",
" }",
" ents {",
" start",
" end",
" label",
" }",
" }",
" }",
"}"
],
"code_language": "json",
"author": "Olivier Terrier",
"author_links": {
"github": "oterrier"
}
},
{
"id": "pyInflect",
"slogan": "A python module for word inflections",
"description": "This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add word inflections to the system.",
"github": "bjascob/pyInflect",
"pip": "pyinflect",
"code_example": [
"import spacy",
"import pyinflect",
"",
"nlp = spacy.load('en_core_web_sm')",
"doc = nlp('This is an example.')",
"doc[3].tag_ # NN",
"doc[3]._.inflect('NNS') # examples"
],
"author": "Brad Jascob",
"author_links": {
"github": "bjascob"
},
"category": ["pipeline"],
"tags": ["inflection"]
},
{
"id": "NGym",
"title": "NeuralGym",
"slogan": "A little Windows GUI for training models with spaCy",
"description": "NeuralGym is a Python application for Windows with a graphical user interface to train models with spaCy. Run the application, select an output folder, a training data file in spaCy's data format, a spaCy model or blank model and press 'Start'.",
"github": "d5555/NeuralGym",
"url": "https://github.com/d5555/NeuralGym",
"image": "https://github.com/d5555/NeuralGym/raw/master/NGym.png",
"thumb": "https://github.com/d5555/NeuralGym/raw/master/NGym/web.png",
"author": "d5555",
"category": ["training"],
"tags": ["windows"]
},
{
"id": "holmes",
"title": "Holmes",
"slogan": "Information extraction from English and German texts based on predicate logic",
"github": "msg-systems/holmes-extractor",
"url": "https://github.com/msg-systems/holmes-extractor",
"description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural search, topic matching and supervised document classification.",
"pip": "holmes-extractor",
"category": ["conversational", "standalone"],
"tags": ["chatbots", "text-processing"],
"code_example": [
"import holmes_extractor as holmes",
"holmes_manager = holmes.Manager(model='en_coref_lg')",
"holmes_manager.register_search_phrase('A big dog chases a cat')",
"holmes_manager.start_chatbot_mode_console()"
],
"author": "Richard Paul Hudson",
"author_links": {
"github": "richardpaulhudson"
}
}
],
"categories": [
{
"label": "Projects",
@ -1337,6 +1570,11 @@
"title": "Research",
"description": "Frameworks and utilities for developing better NLP models, especially using neural networks"
},
{
"id": "scientific",
"title": "Scientific",
"description": "Frameworks and utilities for scientific text processing"
},
{
"id": "visualizers",
"title": "Visualizers",
@ -1356,6 +1594,11 @@
"id": "standalone",
"title": "Standalone",
"description": "Self-contained libraries or tools that use spaCy under the hood"
},
{
"id": "models",
"title": "Models",
"description": "Third-party pre-trained models for different languages and domains"
}
]
},

View File

@ -93,6 +93,7 @@ const SEO = ({ description, lang, title, section, sectionTitle, bodyClass }) =>
return (
<Helmet
defer={false}
htmlAttributes={{ lang }}
bodyAttributes={{ class: bodyClass }}
title={pageTitle}

View File

@ -125,7 +125,7 @@ const UniverseContent = ({ content = [], categories, pageContext, location, mdxC
</p>
<InlineList>
<Button variant="primary" to={github('website/universe/README.md')}>
<Button variant="primary" to={github('website/UNIVERSE.md')}>
Read the docs
</Button>
<Button icon="code" to={github('website/meta/universe.json')}>

View File

@ -75,16 +75,6 @@ const Landing = ({ data }) => {
<LandingSubtitle>in Python</LandingSubtitle>
</LandingHeader>
<LandingGrid blocks>
<LandingCard title="Fastest in the world">
<p>
spaCy excels at large-scale information extraction tasks. It's written from
the ground up in carefully memory-managed Cython. Independent research has
confirmed that spaCy is the fastest in the world. If your application needs
to process entire web dumps, spaCy is the library you want to be using.
</p>
<LandingButton to="/usage/facts-figures">Facts & Figures</LandingButton>
</LandingCard>
<LandingCard title="Get things done">
<p>
spaCy is designed to help you do real work to build real products, or
@ -92,7 +82,16 @@ const Landing = ({ data }) => {
wasting it. It's easy to install, and its API is simple and productive. We
like to think of spaCy as the Ruby on Rails of Natural Language Processing.
</p>
<LandingButton to="/usage">Get started</LandingButton>
<LandingButton to="/usage/spacy-101">Get started</LandingButton>
</LandingCard>
<LandingCard title="Blazing fast">
<p>
spaCy excels at large-scale information extraction tasks. It's written from
the ground up in carefully memory-managed Cython. Independent research in
2015 found spaCy to be the fastest in the world. If your application needs
to process entire web dumps, spaCy is the library you want to be using.
</p>
<LandingButton to="/usage/facts-figures">Facts & Figures</LandingButton>
</LandingCard>
<LandingCard title="Deep learning">
@ -129,6 +128,7 @@ const Landing = ({ data }) => {
<Li>
Pre-trained <strong>word vectors</strong>
</Li>
<Li>State-of-the-art speed</Li>
<Li>
Easy <strong>deep learning</strong> integration
</Li>
@ -144,7 +144,6 @@ const Landing = ({ data }) => {
<Li>
Easy <strong>model packaging</strong> and deployment
</Li>
<Li>State-of-the-art speed</Li>
<Li>Robust, rigorously evaluated accuracy</Li>
</Ul>
</LandingCol>