Merge pull request #5788 from explosion/master-tmp

This commit is contained in:
Ines Montani 2020-07-20 15:39:24 +02:00 committed by GitHub
commit 311d0bde29
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
45 changed files with 30377 additions and 28795 deletions

106
.github/contributors/PluieElectrique.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Pluie |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-06-18 |
| GitHub username | PluieElectrique |
| Website (optional) | |

106
.github/contributors/abchapman93.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alec Chapman |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 7/17/2020 |
| GitHub username | abchapman93 |
| Website (optional) | |

106
.github/contributors/gandersen101.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Grant Andersen |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 07.06.2020 |
| GitHub username | gandersen101 |
| Website (optional) | |

106
.github/contributors/jbesomi.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jonathan B. |
| Company name (if applicable) | besomi.ai |
| Title or role (if applicable) | - |
| Date | 07.07.2020 |
| GitHub username | jbesomi |
| Website (optional) | besomi.ai |

106
.github/contributors/mikeizbicki.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Mike Izbicki |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 02 Jun 2020 |
| GitHub username | mikeizbicki |
| Website (optional) | https://izbicki.me |

106
.github/contributors/rameshhpathak.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ramesh Pathak |
| Company name (if applicable) | Diyo AI |
| Title or role (if applicable) | AI Engineer |
| Date | June 21, 2020 |
| GitHub username | rameshhpathak |
| Website (optional) |rameshhpathak.github.io| |

106
.github/contributors/richardliaw.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Richard Liaw |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 06/22/2020 |
| GitHub username | richardliaw |
| Website (optional) | |

1
.gitignore vendored
View File

@ -71,6 +71,7 @@ Pipfile.lock
*.egg *.egg
.eggs .eggs
MANIFEST MANIFEST
spacy/git_info.py
# Temporary files # Temporary files
*.~* *.~*

View File

@ -5,3 +5,4 @@ include README.md
include pyproject.toml include pyproject.toml
recursive-exclude spacy/lang *.json recursive-exclude spacy/lang *.json
recursive-include spacy/lang *.json.gz recursive-include spacy/lang *.json.gz
recursive-include licenses *

View File

@ -5,7 +5,7 @@ VENV := ./env$(PYVER)
version := $(shell "bin/get-version.sh") version := $(shell "bin/get-version.sh")
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core $(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
chmod a+rx $@ chmod a+rx $@
cp $@ dist/spacy.pex cp $@ dist/spacy.pex
@ -15,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
$(VENV)/bin/pip wheel . -w ./wheelhouse $(VENV)/bin/pip wheel . -w ./wheelhouse
$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse $(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse
touch $@ touch $@
wheelhouse/pytest-%.whl : $(VENV)/bin/pex wheelhouse/pytest-%.whl : $(VENV)/bin/pex

View File

@ -16,8 +16,6 @@ from __future__ import unicode_literals, print_function
import plac import plac
import random import random
from pathlib import Path from pathlib import Path
from spacy.vocab import Vocab
import spacy import spacy
from spacy.kb import KnowledgeBase from spacy.kb import KnowledgeBase
@ -61,13 +59,13 @@ TRAIN_DATA = sample_train_data()
output_dir=("Optional output directory", "option", "o", Path), output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int), n_iter=("Number of training iterations", "option", "n", int),
) )
def main(kb_path, vocab_path=None, output_dir=None, n_iter=50): def main(kb_path, vocab_path, output_dir=None, n_iter=50):
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker. """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
The `vocab` should be the one used during creation of the KB.""" The `vocab` should be the one used during creation of the KB."""
vocab = Vocab().from_disk(vocab_path)
# create blank English model with correct vocab # create blank English model with correct vocab
nlp = spacy.blank("en", vocab=vocab) nlp = spacy.blank("en")
nlp.vocab.vectors.name = "nel_vectors" nlp.vocab.from_disk(vocab_path)
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
print("Created blank 'en' model with vocab from '%s'" % vocab_path) print("Created blank 'en' model with vocab from '%s'" % vocab_path)
# Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy. # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
@ -111,7 +109,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
"Removed", kb_id, "from training because it is not in the KB." "Removed", kb_id, "from training because it is not in the KB."
) )
annotation_clean["links"][offset] = new_dict annotation_clean["links"][offset] = new_dict
train_examples .append(Example.from_dict(doc, annotation_clean)) train_examples.append(Example.from_dict(doc, annotation_clean))
with nlp.select_pipes(enable="entity_linker"): # only train entity linker with nlp.select_pipes(enable="entity_linker"): # only train entity linker
# reset and initialize the weights randomly # reset and initialize the weights randomly

View File

@ -4,13 +4,14 @@ import sys
import platform import platform
from distutils.command.build_ext import build_ext from distutils.command.build_ext import build_ext
from distutils.sysconfig import get_python_inc from distutils.sysconfig import get_python_inc
import distutils.util
from distutils import ccompiler, msvccompiler from distutils import ccompiler, msvccompiler
import numpy import numpy
from pathlib import Path from pathlib import Path
import shutil import shutil
from Cython.Build import cythonize from Cython.Build import cythonize
from Cython.Compiler import Options from Cython.Compiler import Options
import os
import subprocess
ROOT = Path(__file__).parent ROOT = Path(__file__).parent
@ -75,7 +76,6 @@ COPY_FILES = {
def is_new_osx(): def is_new_osx():
"""Check whether we're on OSX >= 10.7""" """Check whether we're on OSX >= 10.7"""
name = distutils.util.get_platform()
if sys.platform != "darwin": if sys.platform != "darwin":
return False return False
mac_ver = platform.mac_ver()[0] mac_ver = platform.mac_ver()[0]
@ -118,6 +118,53 @@ class build_ext_subclass(build_ext, build_ext_options):
build_ext.build_extensions(self) build_ext.build_extensions(self)
# Include the git version in the build (adapted from NumPy)
# Copyright (c) 2005-2020, NumPy Developers.
# BSD 3-Clause license, see licenses/3rd_party_licenses.txt
def write_git_info_py(filename="spacy/git_info.py"):
def _minimal_ext_cmd(cmd):
# construct minimal environment
env = {}
for k in ["SYSTEMROOT", "PATH", "HOME"]:
v = os.environ.get(k)
if v is not None:
env[k] = v
# LANGUAGE is used on win32
env["LANGUAGE"] = "C"
env["LANG"] = "C"
env["LC_ALL"] = "C"
out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, env=env)
return out
git_version = "Unknown"
if Path(".git").exists():
try:
out = _minimal_ext_cmd(["git", "rev-parse", "--short", "HEAD"])
git_version = out.strip().decode("ascii")
except Exception:
pass
elif Path(filename).exists():
# must be a source distribution, use existing version file
try:
a = open(filename, "r")
lines = a.readlines()
git_version = lines[-1].split('"')[1]
except Exception:
pass
finally:
a.close()
text = """# THIS FILE IS GENERATED FROM SPACY SETUP.PY
#
GIT_VERSION = "%(git_version)s"
"""
a = open(filename, "w")
try:
a.write(text % {"git_version": git_version})
finally:
a.close()
def clean(path): def clean(path):
for path in path.glob("**/*"): for path in path.glob("**/*"):
if path.is_file() and path.suffix in (".so", ".cpp", ".html"): if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
@ -126,6 +173,7 @@ def clean(path):
def setup_package(): def setup_package():
write_git_info_py()
if len(sys.argv) > 1 and sys.argv[1] == "clean": if len(sys.argv) > 1 and sys.argv[1] == "clean":
return clean(PACKAGE_ROOT) return clean(PACKAGE_ROOT)

View File

@ -31,6 +31,41 @@ class EnglishDefaults(Language.Defaults):
{"tags": ["``", "''"], "variants": [('"', '"'), ("", "")]}, {"tags": ["``", "''"], "variants": [('"', '"'), ("", "")]},
] ]
@classmethod
def is_base_form(cls, univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
univ_pos (unicode / int): The token's universal part-of-speech tag.
morphology (dict): The token's morphological features following the
Universal Dependencies scheme.
"""
if morphology is None:
morphology = {}
if univ_pos == "noun" and morphology.get("Number") == "sing":
return True
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == "verb" and (
morphology.get("VerbForm") == "fin"
and morphology.get("Tense") == "pres"
and morphology.get("Number") is None
):
return True
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
return True
elif morphology.get("VerbForm") == "inf":
return True
elif morphology.get("VerbForm") == "none":
return True
elif morphology.get("Degree") == "pos":
return True
else:
return False
class English(Language): class English(Language):
lang = "en" lang = "en"

View File

@ -41,9 +41,6 @@ class FrenchLemmatizer(Lemmatizer):
univ_pos = "sconj" univ_pos = "sconj"
else: else:
return [self.lookup(string)] return [self.lookup(string)]
# See Issue #435 for example of where this logic is requied.
if self.is_base_form(univ_pos, morphology):
return list(set([string.lower()]))
index_table = self.lookups.get_table("lemma_index", {}) index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {}) exc_table = self.lookups.get_table("lemma_exc", {})
rules_table = self.lookups.get_table("lemma_rules", {}) rules_table = self.lookups.get_table("lemma_rules", {})

View File

@ -8,6 +8,6 @@ Example sentences to test spaCy and its language models.
sentences = [ sentences = [
"Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։", "Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
"Ո՞վ է Ֆրանսիայի նախագահը։", "Ո՞վ է Ֆրանսիայի նախագահը։",
"Որն է Միացյալ Նահանգների մայրաքաղաքը։", "Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
"Ե՞րբ է ծնվել Բարաք Օբաման։", "Ե՞րբ է ծնվել Բարաք Օբաման։",
] ]

View File

@ -15,14 +15,15 @@ _num_words = [
"տասը", "տասը",
"տասնմեկ", "տասնմեկ",
"տասներկու", "տասներկու",
"տասն­երեք", "տասներեք",
"տասն­չորս", "տասնչորս",
"տասն­հինգ", "տասնհինգ",
"տասն­վեց", "տասնվեց",
"տասն­յոթ", "տասնյոթ",
"տասն­ութ", "տասնութ",
"տասն­ինը", "տասնինը",
"քսան" "երեսուն", "քսան",
"երեսուն",
"քառասուն", "քառասուն",
"հիսուն", "հիսուն",
"վաթսուն", "վաթսուն",

View File

@ -17,12 +17,9 @@ from ... import util
# Hold the attributes we need with convenient names # Hold the attributes we need with convenient names
DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"]) DetailedToken = namedtuple(
"DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"]
# Handling for multiple spaces in a row is somewhat awkward, this simplifies )
# the flow by creating a dummy with the same interface.
DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
DummySpace = DummyNode(" ", " ", " ")
def try_sudachi_import(split_mode="A"): def try_sudachi_import(split_mode="A"):
@ -49,7 +46,7 @@ def try_sudachi_import(split_mode="A"):
) )
def resolve_pos(orth, pos, next_pos): def resolve_pos(orth, tag, next_tag):
"""If necessary, add a field to the POS tag for UD mapping. """If necessary, add a field to the POS tag for UD mapping.
Under Universal Dependencies, sometimes the same Unidic POS tag can Under Universal Dependencies, sometimes the same Unidic POS tag can
be mapped differently depending on the literal token or its context be mapped differently depending on the literal token or its context
@ -60,127 +57,80 @@ def resolve_pos(orth, pos, next_pos):
# Some tokens have their UD tag decided based on the POS of the following # Some tokens have their UD tag decided based on the POS of the following
# token. # token.
# orth based rules # apply orth based mapping
if pos[0] in TAG_ORTH_MAP: if tag in TAG_ORTH_MAP:
orth_map = TAG_ORTH_MAP[pos[0]] orth_map = TAG_ORTH_MAP[tag]
if orth in orth_map: if orth in orth_map:
return orth_map[orth], None return orth_map[orth], None # current_pos, next_pos
# tag bi-gram mapping # apply tag bi-gram mapping
if next_pos: if next_tag:
tag_bigram = pos[0], next_pos[0] tag_bigram = tag, next_tag
if tag_bigram in TAG_BIGRAM_MAP: if tag_bigram in TAG_BIGRAM_MAP:
bipos = TAG_BIGRAM_MAP[tag_bigram] current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
if bipos[0] is None: if current_pos is None: # apply tag uni-gram mapping for current_pos
return TAG_MAP[pos[0]][POS], bipos[1] return (
TAG_MAP[tag][POS],
next_pos,
) # only next_pos is identified by tag bi-gram mapping
else: else:
return bipos return current_pos, next_pos
return TAG_MAP[pos[0]][POS], None # apply tag uni-gram mapping
return TAG_MAP[tag][POS], None
# Use a mapping of paired punctuation to avoid splitting quoted sentences. def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
pairpunct = {"": "", "": "", "": ""} # Compare the content of tokens and text, first
def separate_sentences(doc):
"""Given a doc, mark tokens that start sentences based on Unidic tags.
"""
stack = [] # save paired punctuation
for i, token in enumerate(doc[:-2]):
# Set all tokens after the first to false by default. This is necessary
# for the doc code to be aware we've done sentencization, see
# `is_sentenced`.
token.sent_start = i == 0
if token.tag_:
if token.tag_ == "補助記号-括弧開":
ts = str(token)
if ts in pairpunct:
stack.append(pairpunct[ts])
elif stack and ts == stack[-1]:
stack.pop()
if token.tag_ == "補助記号-句点":
next_token = doc[i + 1]
if next_token.tag_ != token.tag_ and not stack:
next_token.sent_start = True
def get_dtokens(tokenizer, text):
tokens = tokenizer.tokenize(text)
words = []
for ti, token in enumerate(tokens):
tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
# don't add multiple space tokens in a row
continue
words.append(dtoken)
# remove empty tokens. These can be produced with characters like … that
# Sudachi normalizes internally.
words = [ww for ww in words if len(ww.surface) > 0]
return words
def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
words = [x.surface for x in dtokens] words = [x.surface for x in dtokens]
if "".join("".join(words).split()) != "".join(text.split()): if "".join("".join(words).split()) != "".join(text.split()):
raise ValueError(Errors.E194.format(text=text, words=words)) raise ValueError(Errors.E194.format(text=text, words=words))
text_words = []
text_lemmas = [] text_dtokens = []
text_tags = []
text_spaces = [] text_spaces = []
text_pos = 0 text_pos = 0
# handle empty and whitespace-only texts # handle empty and whitespace-only texts
if len(words) == 0: if len(words) == 0:
return text_words, text_lemmas, text_tags, text_spaces return text_dtokens, text_spaces
elif len([word for word in words if not word.isspace()]) == 0: elif len([word for word in words if not word.isspace()]) == 0:
assert text.isspace() assert text.isspace()
text_words = [text] text_dtokens = [DetailedToken(text, gap_tag, "", text, None, None)]
text_lemmas = [text]
text_tags = [gap_tag]
text_spaces = [False] text_spaces = [False]
return text_words, text_lemmas, text_tags, text_spaces return text_dtokens, text_spaces
# normalize words to remove all whitespace tokens
norm_words, norm_dtokens = zip( # align words and dtokens by referring text, and insert gap tokens for the space char spans
*[ for word, dtoken in zip(words, dtokens):
(word, dtokens) # skip all space tokens
for word, dtokens in zip(words, dtokens) if word.isspace():
if not word.isspace() continue
]
)
# align words with text
for word, dtoken in zip(norm_words, norm_dtokens):
try: try:
word_start = text[text_pos:].index(word) word_start = text[text_pos:].index(word)
except ValueError: except ValueError:
raise ValueError(Errors.E194.format(text=text, words=words)) raise ValueError(Errors.E194.format(text=text, words=words))
# space token
if word_start > 0: if word_start > 0:
w = text[text_pos : text_pos + word_start] w = text[text_pos : text_pos + word_start]
text_words.append(w) text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
text_lemmas.append(w)
text_tags.append(gap_tag)
text_spaces.append(False) text_spaces.append(False)
text_pos += word_start text_pos += word_start
text_words.append(word)
text_lemmas.append(dtoken.lemma) # content word
text_tags.append(dtoken.pos) text_dtokens.append(dtoken)
text_spaces.append(False) text_spaces.append(False)
text_pos += len(word) text_pos += len(word)
# poll a space char after the word
if text_pos < len(text) and text[text_pos] == " ": if text_pos < len(text) and text[text_pos] == " ":
text_spaces[-1] = True text_spaces[-1] = True
text_pos += 1 text_pos += 1
# trailing space token
if text_pos < len(text): if text_pos < len(text):
w = text[text_pos:] w = text[text_pos:]
text_words.append(w) text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
text_lemmas.append(w)
text_tags.append(gap_tag)
text_spaces.append(False) text_spaces.append(False)
return text_words, text_lemmas, text_tags, text_spaces
return text_dtokens, text_spaces
class JapaneseTokenizer(DummyTokenizer): class JapaneseTokenizer(DummyTokenizer):
@ -190,29 +140,96 @@ class JapaneseTokenizer(DummyTokenizer):
self.tokenizer = try_sudachi_import(self.split_mode) self.tokenizer = try_sudachi_import(self.split_mode)
def __call__(self, text): def __call__(self, text):
dtokens = get_dtokens(self.tokenizer, text) # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
sudachipy_tokens = self.tokenizer.tokenize(text)
dtokens = self._get_dtokens(sudachipy_tokens)
dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text) # create Doc with tag bi-gram based part-of-speech identification rules
words, tags, inflections, lemmas, readings, sub_tokens_list = (
zip(*dtokens) if dtokens else [[]] * 6
)
sub_tokens_list = list(sub_tokens_list)
doc = Doc(self.vocab, words=words, spaces=spaces) doc = Doc(self.vocab, words=words, spaces=spaces)
next_pos = None next_pos = None # for bi-gram rules
for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)): for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
token.tag_ = unidic_tag[0] token.tag_ = dtoken.tag
if next_pos: if next_pos: # already identified in previous iteration
token.pos = next_pos token.pos = next_pos
next_pos = None next_pos = None
else: else:
token.pos, next_pos = resolve_pos( token.pos, next_pos = resolve_pos(
token.orth_, token.orth_,
unidic_tag, dtoken.tag,
unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None, tags[idx + 1] if idx + 1 < len(tags) else None,
) )
# if there's no lemma info (it's an unk) just use the surface # if there's no lemma info (it's an unk) just use the surface
token.lemma_ = lemma token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
doc.user_data["unidic_tags"] = unidic_tags
doc.user_data["inflections"] = inflections
doc.user_data["reading_forms"] = readings
doc.user_data["sub_tokens"] = sub_tokens_list
return doc return doc
def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
sub_tokens_list = (
self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
)
dtokens = [
DetailedToken(
token.surface(), # orth
"-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag
",".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf
token.dictionary_form(), # lemma
token.reading_form(), # user_data['reading_forms']
sub_tokens_list[idx]
if sub_tokens_list
else None, # user_data['sub_tokens']
)
for idx, token in enumerate(sudachipy_tokens)
if len(token.surface()) > 0
# remove empty tokens which can be produced with characters like … that
]
# Sudachi normalizes internally and outputs each space char as a token.
# This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
return [
t
for idx, t in enumerate(dtokens)
if idx == 0
or not t.surface.isspace()
or t.tag != "空白"
or not dtokens[idx - 1].surface.isspace()
or dtokens[idx - 1].tag != "空白"
]
def _get_sub_tokens(self, sudachipy_tokens):
if (
self.split_mode is None or self.split_mode == "A"
): # do nothing for default split mode
return None
sub_tokens_list = [] # list of (list of list of DetailedToken | None)
for token in sudachipy_tokens:
sub_a = token.split(self.tokenizer.SplitMode.A)
if len(sub_a) == 1: # no sub tokens
sub_tokens_list.append(None)
elif self.split_mode == "B":
sub_tokens_list.append([self._get_dtokens(sub_a, False)])
else: # "C"
sub_b = token.split(self.tokenizer.SplitMode.B)
if len(sub_a) == len(sub_b):
dtokens = self._get_dtokens(sub_a, False)
sub_tokens_list.append([dtokens, dtokens])
else:
sub_tokens_list.append(
[
self._get_dtokens(sub_a, False),
self._get_dtokens(sub_b, False),
]
)
return sub_tokens_list
def _get_config(self): def _get_config(self):
config = OrderedDict((("split_mode", self.split_mode),)) config = OrderedDict((("split_mode", self.split_mode),))
return config return config

View File

@ -1,176 +0,0 @@
POS_PHRASE_MAP = {
"NOUN": "NP",
"NUM": "NP",
"PRON": "NP",
"PROPN": "NP",
"VERB": "VP",
"ADJ": "ADJP",
"ADV": "ADVP",
"CCONJ": "CCONJP",
}
# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
def yield_bunsetu(doc, debug=False):
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
prev = None
prev_tag = None
prev_dep = None
prev_head = None
for t in doc:
pos = t.pos_
pos_type = POS_PHRASE_MAP.get(pos, None)
tag = t.tag_
dep = t.dep_
head = t.head.i
if debug:
print(
t.i,
t.orth_,
pos,
pos_type,
dep,
head,
bunsetu_may_end,
phrase_type,
phrase,
bunsetu,
)
# DET is always an individual bunsetu
if pos == "DET":
if bunsetu:
yield bunsetu, phrase_type, phrase
yield [t], None, None
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
# PRON or Open PUNCT always splits bunsetu
elif tag == "補助記号-括弧開":
if bunsetu:
yield bunsetu, phrase_type, phrase
bunsetu = [t]
bunsetu_may_end = True
phrase_type = None
phrase = None
# bunsetu head not appeared
elif phrase_type is None:
if bunsetu and prev_tag == "補助記号-読点":
yield bunsetu, phrase_type, phrase
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
bunsetu.append(t)
if pos_type: # begin phrase
phrase = [t]
phrase_type = pos_type
if pos_type in {"ADVP", "CCONJP"}:
bunsetu_may_end = True
# entering new bunsetu
elif pos_type and (
pos_type != phrase_type
or bunsetu_may_end # different phrase type arises # same phrase type but bunsetu already ended
):
# exceptional case: NOUN to VERB
if (
phrase_type == "NP"
and pos_type == "VP"
and prev_dep == "compound"
and prev_head == t.i
):
bunsetu.append(t)
phrase_type = "VP"
phrase.append(t)
# exceptional case: VERB to NOUN
elif (
phrase_type == "VP"
and pos_type == "NP"
and (
prev_dep == "compound"
and prev_head == t.i
or dep == "compound"
and prev == head
or prev_dep == "nmod"
and prev_head == t.i
)
):
bunsetu.append(t)
phrase_type = "NP"
phrase.append(t)
else:
yield bunsetu, phrase_type, phrase
bunsetu = [t]
bunsetu_may_end = False
phrase_type = pos_type
phrase = [t]
# NOUN bunsetu
elif phrase_type == "NP":
bunsetu.append(t)
if not bunsetu_may_end and (
(
(pos_type == "NP" or pos == "SYM")
and (prev_head == t.i or prev_head == head)
and prev_dep in {"compound", "nummod"}
)
or (
pos == "PART"
and (prev == head or prev_head == head)
and dep == "mark"
)
):
phrase.append(t)
else:
bunsetu_may_end = True
# VERB bunsetu
elif phrase_type == "VP":
bunsetu.append(t)
if (
not bunsetu_may_end
and pos == "VERB"
and prev_head == t.i
and prev_dep == "compound"
):
phrase.append(t)
else:
bunsetu_may_end = True
# ADJ bunsetu
elif phrase_type == "ADJP" and tag != "連体詞":
bunsetu.append(t)
if not bunsetu_may_end and (
(
pos == "NOUN"
and (prev_head == t.i or prev_head == head)
and prev_dep in {"amod", "compound"}
)
or (
pos == "PART"
and (prev == head or prev_head == head)
and dep == "mark"
)
):
phrase.append(t)
else:
bunsetu_may_end = True
# other bunsetu
else:
bunsetu.append(t)
prev = t.i
prev_tag = t.tag_
prev_dep = t.dep_
prev_head = head
if bunsetu:
yield bunsetu, phrase_type, phrase

View File

@ -39,7 +39,11 @@ def check_spaces(text, tokens):
class KoreanTokenizer(DummyTokenizer): class KoreanTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None): def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
self.Tokenizer = try_mecab_import() MeCab = try_mecab_import()
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
def __del__(self):
self.mecab_tokenizer.__del__()
def __call__(self, text): def __call__(self, text):
dtokens = list(self.detailed_tokens(text)) dtokens = list(self.detailed_tokens(text))
@ -55,8 +59,7 @@ class KoreanTokenizer(DummyTokenizer):
def detailed_tokens(self, text): def detailed_tokens(self, text):
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3], # 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], * # 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
with self.Tokenizer("-F%f[0],%f[7]") as tokenizer: for node in self.mecab_tokenizer.parse(text, as_nodes=True):
for node in tokenizer.parse(text, as_nodes=True):
if node.is_eos(): if node.is_eos():
break break
surface = node.surface surface = node.surface

23
spacy/lang/ne/__init__.py Normal file
View File

@ -0,0 +1,23 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from ...language import Language
from ...attrs import LANG
class NepaliDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
stop_words = STOP_WORDS
class Nepali(Language):
lang = "ne"
Defaults = NepaliDefaults
__all__ = ["Nepali"]

22
spacy/lang/ne/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ne.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
"स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
"स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
"लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
"तिमी कहाँ छौ?",
"फ्रान्स को राष्ट्रपति को हो?",
"संयुक्त राज्यको राजधानी के हो?",
"बराक ओबामा कहिले कहिले जन्मेका हुन्?",
]

View File

@ -0,0 +1,98 @@
# coding: utf8
from __future__ import unicode_literals
from ..norm_exceptions import BASE_NORMS
from ...attrs import NORM, LIKE_NUM
# fmt: off
_stem_suffixes = [
["", "ि", "", "", "", "", "", "", "", ""],
["", "", "", ""],
["लाई", "ले", "बाट", "को", "मा", "हरू"],
["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
["इलो", "िलो", "नु", "ाउनु", "", "इन", "इन्", "इनन्"],
["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "", "एनन्"],
["छु", "छौँ", "छस्", "छौ", "", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
["", "", "", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
["याइ", "ाइ", "बार", "वार", "चाँहि"],
["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "", "", "नन्"]
]
# fmt: on
# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
# reference 2: https://www.imnepal.com/nepali-numbers/
_num_words = [
"शुन्य",
"एक",
"दुई",
"तीन",
"चार",
"पाँच",
"",
"सात",
"आठ",
"नौ",
"दश",
"एघार",
"बाह्र",
"तेह्र",
"चौध",
"पन्ध्र",
"सोह्र",
"सोह्र",
"सत्र",
"अठार",
"उन्नाइस",
"बीस",
"तीस",
"चालीस",
"पचास",
"साठी",
"सत्तरी",
"असी",
"नब्बे",
"सय",
"हजार",
"लाख",
"करोड",
"अर्ब",
"खर्ब",
]
def norm(string):
# normalise base exceptions, e.g. punctuation or currency symbols
if string in BASE_NORMS:
return BASE_NORMS[string]
# set stem word as norm, if available, adapted from:
# https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
# https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
for suffix_group in reversed(_stem_suffixes):
length = len(suffix_group[0])
if len(string) <= length:
break
for suffix in suffix_group:
if string.endswith(suffix):
return string[:-length]
return string
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(", ", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}

498
spacy/lang/ne/stop_words.py Normal file
View File

@ -0,0 +1,498 @@
# coding: utf8
from __future__ import unicode_literals
# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
STOP_WORDS = set(
"""
अकसर
अगि
अग
अघि
अझ
अठ
अथव
अनि
अन
अनतरगत
अन
अनयत
अनयथ
अब
अर
अर
अर
अर
अर
अर
अलग
अलि
अवस
अहि
आए
आएक
आएक
आज
आजक
आठ
आत
आदि
आदि
आफन
आफ
आफ
आफ
आफ
आफ
आफ
आय
उक
उदहरण
उनक
उनल
उनल
उनि
उन
उनहर
उनइस
उप
उसक
उसल
उसल
उह
एउट
एउट
एक
एकदम
एघ
ओठ
कत
कति
कत
कम
कमसकम
कसरि
कसर
कस
कस
कस
कस
कस
कस
कह
कहि
रण
ि
ि
िनभन
पय
ि
ि
िपनि
पनि
रमश
गए
गएक
गएर
गय
गरि
गर
गर
गर
गर
गर
गर
गर
गरछन
गर
गर
गर
गर
गर
गरपर
गर
घर
हन
हन
ि
ि
ि
छन
छन
नन
जत
जततत
जन
जन
जन
जन
जब
जबकि
जबक
जसक
जसब
जसम
जसर
जसल
जसल
जस
जस
जस
जस
जह
ि
पनि
पन
तत
तत
तथ
तथि
तथ
तदन
तप
तप
तपईक
तब
तर
तर
तल
तसर
पनि
पन
ि
िि
ििहर
ि
िहर
िहर
िहर
िहर
ि
ि
ि
िरक
रन
रण
पनि
पन
यति
यति
यस
यसकरण
यसक
यसल
यस
यस
यस
यस
यह
यहि
यह
यह
यह
सपछि
थप
थरि
थर
ि
ि
िएन
ि
दर
दश
ि
िएक
ि
िभएक
ि
इवट
ि
ि
ि
धन
नगर
नगर
नजि
नत
नतरभन
नभई
नभएक
नभन
नय
ि
ि
िि
ि
ि
िि
पक
पक
पछि
पछ
पछि
पछि
पछ
पटक
पनि
पन
पर
पर
पर
पर
पर
पर
पहि
पहि
पहि
ि
रति
रत
रतयक
लस
फरक
ि
बढ
बत
बन
बर
ि
ि
िचम
ि
ि
ि
चम
भए
भए
भएक
भएक
भएक
भएन
भएर
भन
भन
भन
भन
भन
भनछन
भन
भन
भन
भनभय
भन
भन
भय
भय
भर
भरि
भर
ि
ि
मध
मध
मल
ि
ि
ि
यति
यथि
यदि
यदयपि
यदयपि
यस
यसक
यसक
यसपछि
यसब
यसम
यसर
यसल
यस
यस
यस
यह
यहसम
यह
रह
रह
रह
रह
पम
लगभग
लगयत
ि
वट
वरपर
पत
तवम
यद
सक
सक
गक
गस
सङ
सङगक
सट
सत
सध
सब
सब
सब
समय
सम
समभव
सम
सय
सरह
सहि
सहि
सह
यद
ि
पष
हज
हर
हर
नत
इन
ि
""".split()
)

View File

@ -14,7 +14,7 @@ from .stop_words import STOP_WORDS
from ... import util from ... import util
_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.22` or from https://github.com/lancopku/pkuseg-python" _PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.25` or from https://github.com/lancopku/pkuseg-python"
def try_jieba_import(segmenter): def try_jieba_import(segmenter):

View File

@ -32,6 +32,7 @@ from .lang.tag_map import TAG_MAP
from .tokens import Doc from .tokens import Doc
from .lang.lex_attrs import LEX_ATTRS, is_stop from .lang.lex_attrs import LEX_ATTRS, is_stop
from .errors import Errors, Warnings from .errors import Errors, Warnings
from .git_info import GIT_VERSION
from . import util from . import util
from . import about from . import about
@ -44,7 +45,7 @@ class BaseDefaults:
def create_lemmatizer(cls, nlp=None, lookups=None): def create_lemmatizer(cls, nlp=None, lookups=None):
if lookups is None: if lookups is None:
lookups = cls.create_lookups(nlp=nlp) lookups = cls.create_lookups(nlp=nlp)
return Lemmatizer(lookups=lookups) return Lemmatizer(lookups=lookups, is_base_form=cls.is_base_form)
@classmethod @classmethod
def create_lookups(cls, nlp=None): def create_lookups(cls, nlp=None):
@ -116,6 +117,7 @@ class BaseDefaults:
tokenizer_exceptions = {} tokenizer_exceptions = {}
stop_words = set() stop_words = set()
morph_rules = {} morph_rules = {}
is_base_form = None
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
syntax_iterators = {} syntax_iterators = {}
resources = {} resources = {}
@ -212,6 +214,7 @@ class Language:
self._meta.setdefault("email", "") self._meta.setdefault("email", "")
self._meta.setdefault("url", "") self._meta.setdefault("url", "")
self._meta.setdefault("license", "") self._meta.setdefault("license", "")
self._meta.setdefault("spacy_git_version", GIT_VERSION)
self._meta["vectors"] = { self._meta["vectors"] = {
"width": self.vocab.vectors_length, "width": self.vocab.vectors_length,
"vectors": len(self.vocab.vectors), "vectors": len(self.vocab.vectors),

View File

@ -14,7 +14,7 @@ class Lemmatizer:
def load(cls, *args, **kwargs): def load(cls, *args, **kwargs):
raise NotImplementedError(Errors.E172) raise NotImplementedError(Errors.E172)
def __init__(self, lookups): def __init__(self, lookups, is_base_form=None):
"""Initialize a Lemmatizer. """Initialize a Lemmatizer.
lookups (Lookups): The lookups object containing the (optional) tables lookups (Lookups): The lookups object containing the (optional) tables
@ -22,6 +22,7 @@ class Lemmatizer:
RETURNS (Lemmatizer): The newly constructed object. RETURNS (Lemmatizer): The newly constructed object.
""" """
self.lookups = lookups self.lookups = lookups
self.is_base_form = is_base_form
def __call__(self, string, univ_pos, morphology=None): def __call__(self, string, univ_pos, morphology=None):
"""Lemmatize a string. """Lemmatize a string.
@ -42,7 +43,7 @@ class Lemmatizer:
if univ_pos in ("", "eol", "space"): if univ_pos in ("", "eol", "space"):
return [string.lower()] return [string.lower()]
# See Issue #435 for example of where this logic is requied. # See Issue #435 for example of where this logic is requied.
if self.is_base_form(univ_pos, morphology): if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
return [string.lower()] return [string.lower()]
index_table = self.lookups.get_table("lemma_index", {}) index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {}) exc_table = self.lookups.get_table("lemma_exc", {})

View File

@ -346,7 +346,7 @@ cdef class Lexeme:
@property @property
def is_oov(self): def is_oov(self):
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary.""" """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
return self.orth in self.vocab.vectors return self.orth not in self.vocab.vectors
property is_stop: property is_stop:
"""RETURNS (bool): Whether the lexeme is a stop word.""" """RETURNS (bool): Whether the lexeme is a stop word."""

View File

@ -117,8 +117,7 @@ class Lookups:
""" """
self._tables = {} self._tables = {}
for key, value in srsly.msgpack_loads(bytes_data).items(): for key, value in srsly.msgpack_loads(bytes_data).items():
self._tables[key] = Table(key) self._tables[key] = Table(key, value)
self._tables[key].update(value)
return self return self
def to_disk(self, path, filename="lookups.bin", **kwargs): def to_disk(self, path, filename="lookups.bin", **kwargs):
@ -189,7 +188,7 @@ class Table(OrderedDict):
self.name = name self.name = name
# Assume a default size of 1M items # Assume a default size of 1M items
self.default_size = 1e6 self.default_size = 1e6
size = len(data) if data and len(data) > 0 else self.default_size size = max(len(data), 1) if data is not None else self.default_size
self.bloom = BloomFilter.from_error_rate(size) self.bloom = BloomFilter.from_error_rate(size)
if data: if data:
self.update(data) self.update(data)

View File

@ -781,6 +781,20 @@ class ClozeMultitask(Pipe):
if losses is not None: if losses is not None:
losses[self.name] += loss losses[self.name] += loss
@staticmethod
def decode_utf8_predictions(char_array):
# The format alternates filling from start and end, and 255 is missing
words = []
char_array = char_array.reshape((char_array.shape[0], -1, 256))
nr_char = char_array.shape[1]
char_array = char_array.argmax(axis=-1)
for row in char_array:
starts = [chr(c) for c in row[::2] if c != 255]
ends = [chr(c) for c in row[1::2] if c != 255]
word = "".join(starts + list(reversed(ends)))
words.append(word)
return words
@component("textcat", assigns=["doc.cats"], default_model=default_textcat) @component("textcat", assigns=["doc.cats"], default_model=default_textcat)
class TextCategorizer(Pipe): class TextCategorizer(Pipe):
@ -949,6 +963,7 @@ cdef class DependencyParser(Parser):
assigns = ["token.dep", "token.is_sent_start", "doc.sents"] assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
requires = [] requires = []
TransitionSystem = ArcEager TransitionSystem = ArcEager
nr_feature = 8
@property @property
def postprocesses(self): def postprocesses(self):

View File

@ -167,6 +167,11 @@ def nb_tokenizer():
return get_lang_class("nb").Defaults.create_tokenizer() return get_lang_class("nb").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def ne_tokenizer():
return get_lang_class("ne").Defaults.create_tokenizer()
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def nl_tokenizer(): def nl_tokenizer():
return get_lang_class("nl").Defaults.create_tokenizer() return get_lang_class("nl").Defaults.create_tokenizer()

View File

@ -102,10 +102,16 @@ def test_doc_api_getitem(en_tokenizer):
) )
def test_doc_api_serialize(en_tokenizer, text): def test_doc_api_serialize(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
tokens[0].lemma_ = "lemma"
tokens[0].norm_ = "norm"
tokens[0].ent_kb_id_ = "ent_kb_id"
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes()) new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
assert tokens.text == new_tokens.text assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.text for t in tokens] == [t.text for t in new_tokens]
assert [t.orth for t in tokens] == [t.orth for t in new_tokens] assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
assert new_tokens[0].lemma_ == "lemma"
assert new_tokens[0].norm_ == "norm"
assert new_tokens[0].ent_kb_id_ == "ent_kb_id"
new_tokens = Doc(tokens.vocab).from_bytes( new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"] tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]

View File

@ -1,7 +1,7 @@
import pytest import pytest
from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
from spacy.lang.ja import Japanese from spacy.lang.ja import Japanese, DetailedToken
# fmt: off # fmt: off
TOKENIZER_TESTS = [ TOKENIZER_TESTS = [
@ -93,6 +93,57 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
assert len(nlp_c(text)) == len_c assert len(nlp_c(text)) == len_c
@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
[
(
"選挙管理委員会",
[None, None, None, None],
[None, None, [
[
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
DetailedToken(surface='', tag='名詞-普通名詞-一般', inf='', lemma='', reading='カイ', sub_tokens=None),
]
]],
[[
[
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
DetailedToken(surface='', tag='名詞-普通名詞-一般', inf='', lemma='', reading='カイ', sub_tokens=None),
], [
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
]
]]
),
]
)
def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
@pytest.mark.parametrize("text,inflections,reading_forms",
[
(
"取ってつけた",
("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
("トッ", "", "ツケ", ""),
),
]
)
def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
assert ja_tokenizer(text).user_data["inflections"] == inflections
assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
def test_ja_tokenizer_emptyish_texts(ja_tokenizer): def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
doc = ja_tokenizer("") doc = ja_tokenizer("")
assert len(doc) == 0 assert len(doc) == 0

View File

View File

@ -0,0 +1,19 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
tokens = ne_tokenizer(text)
assert len(tokens) == 24
@pytest.mark.parametrize(
"text,length",
[("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
)
def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
tokens = ne_tokenizer(text)
assert len(tokens) == length

View File

@ -4,7 +4,9 @@ from spacy import util
from spacy.gold import Example from spacy.gold import Example
from spacy.lang.en import English from spacy.lang.en import English
from spacy.language import Language from spacy.language import Language
from spacy.tests.util import make_tempdir from spacy.symbols import POS, NOUN
from ..util import make_tempdir
def test_label_types(): def test_label_types():
@ -15,6 +17,19 @@ def test_label_types():
nlp.get_pipe("tagger").add_label(9) nlp.get_pipe("tagger").add_label(9)
def test_tagger_begin_training_tag_map():
"""Test that Tagger.begin_training() without gold tuples does not clobber
the tag map."""
nlp = Language()
tagger = nlp.create_pipe("tagger")
orig_tag_count = len(tagger.labels)
tagger.add_label("A", {"POS": "NOUN"})
nlp.add_pipe(tagger)
nlp.begin_training()
assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}} TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
MORPH_RULES = {"V": {"like": {"lemma": "luck"}}} MORPH_RULES = {"V": {"like": {"lemma": "luck"}}}

View File

@ -11,6 +11,7 @@ from spacy.lang.en import English
from spacy.lemmatizer import Lemmatizer from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups from spacy.lookups import Lookups
from spacy.tokens import Doc, Span from spacy.tokens import Doc, Span
from spacy.lang.en import EnglishDefaults
from ..util import get_doc, make_tempdir from ..util import get_doc, make_tempdir
@ -164,7 +165,7 @@ def test_issue595():
lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]}) lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]})
lookups.add_table("lemma_index", {"verb": {}}) lookups.add_table("lemma_index", {"verb": {}})
lookups.add_table("lemma_exc", {"verb": {}}) lookups.add_table("lemma_exc", {"verb": {}})
lemmatizer = Lemmatizer(lookups) lemmatizer = Lemmatizer(lookups, is_base_form=EnglishDefaults.is_base_form)
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map) vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
doc = Doc(vocab, words=words) doc = Doc(vocab, words=words)
doc[2].tag_ = "VB" doc[2].tag_ = "VB"

View File

@ -57,7 +57,7 @@ def test_issue2626_2835(en_tokenizer, text):
def test_issue2656(en_tokenizer): def test_issue2656(en_tokenizer):
"""Test that tokenizer correctly splits of punctuation after numbers with """Test that tokenizer correctly splits off punctuation after numbers with
decimal points. decimal points.
""" """
doc = en_tokenizer("I went for 40.3, and got home by 10.0.") doc = en_tokenizer("I went for 40.3, and got home by 10.0.")

View File

@ -2,6 +2,7 @@ import pytest
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.language import Language from spacy.language import Language
from spacy.lookups import Lookups from spacy.lookups import Lookups
from spacy.lemmatizer import Lemmatizer
def test_lemmatizer_reflects_lookups_changes(): def test_lemmatizer_reflects_lookups_changes():
@ -46,3 +47,14 @@ def test_tagger_warns_no_lookups():
with pytest.warns(None) as record: with pytest.warns(None) as record:
nlp.begin_training() nlp.begin_training()
assert not record.list assert not record.list
def test_lemmatizer_without_is_base_form_implementation():
# Norwegian example from #5658
lookups = Lookups()
lookups.add_table("lemma_rules", {"noun": []})
lookups.add_table("lemma_index", {"noun": {}})
lookups.add_table("lemma_exc", {"noun": {"formuesskatten": ["formuesskatt"]}})
lemmatizer = Lemmatizer(lookups, is_base_form=None)
assert lemmatizer("Formuesskatten", "noun", {'Definite': 'def', 'Gender': 'masc', 'Number': 'sing'}) == ["formuesskatt"]

View File

@ -370,6 +370,6 @@ def test_vector_is_oov():
data[1] = 2.0 data[1] = 2.0
vocab.set_vector("cat", data[0]) vocab.set_vector("cat", data[0])
vocab.set_vector("dog", data[1]) vocab.set_vector("dog", data[1])
assert vocab["cat"].is_oov is True assert vocab["cat"].is_oov is False
assert vocab["dog"].is_oov is True assert vocab["dog"].is_oov is False
assert vocab["hamster"].is_oov is False assert vocab["hamster"].is_oov is True

View File

@ -1062,7 +1062,7 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#to_bytes DOCS: https://spacy.io/api/doc#to_bytes
""" """
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM] # TODO: ENT_KB_ID ? array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM, ENT_KB_ID]
if self.is_tagged: if self.is_tagged:
array_head.extend([TAG, POS]) array_head.extend([TAG, POS])
# If doc parsed add head and dep attribute # If doc parsed add head and dep attribute
@ -1071,6 +1071,14 @@ cdef class Doc:
# Otherwise add sent_start # Otherwise add sent_start
else: else:
array_head.append(SENT_START) array_head.append(SENT_START)
strings = set()
for token in self:
strings.add(token.tag_)
strings.add(token.lemma_)
strings.add(token.dep_)
strings.add(token.ent_type_)
strings.add(token.ent_kb_id_)
strings.add(token.norm_)
# Msgpack doesn't distinguish between lists and tuples, which is # Msgpack doesn't distinguish between lists and tuples, which is
# vexing for user data. As a best guess, we *know* that within # vexing for user data. As a best guess, we *know* that within
# keys, we must have tuples. In values we just have to hope # keys, we must have tuples. In values we just have to hope
@ -1082,6 +1090,7 @@ cdef class Doc:
"sentiment": lambda: self.sentiment, "sentiment": lambda: self.sentiment,
"tensor": lambda: self.tensor, "tensor": lambda: self.tensor,
"cats": lambda: self.cats, "cats": lambda: self.cats,
"strings": lambda: list(strings),
"has_unknown_spaces": lambda: self.has_unknown_spaces "has_unknown_spaces": lambda: self.has_unknown_spaces
} }
if "user_data" not in exclude and self.user_data: if "user_data" not in exclude and self.user_data:
@ -1110,6 +1119,7 @@ cdef class Doc:
"sentiment": lambda b: None, "sentiment": lambda b: None,
"tensor": lambda b: None, "tensor": lambda b: None,
"cats": lambda b: None, "cats": lambda b: None,
"strings": lambda b: None,
"user_data_keys": lambda b: None, "user_data_keys": lambda b: None,
"user_data_values": lambda b: None, "user_data_values": lambda b: None,
"has_unknown_spaces": lambda b: None "has_unknown_spaces": lambda b: None
@ -1130,6 +1140,9 @@ cdef class Doc:
self.tensor = msg["tensor"] self.tensor = msg["tensor"]
if "cats" not in exclude and "cats" in msg: if "cats" not in exclude and "cats" in msg:
self.cats = msg["cats"] self.cats = msg["cats"]
if "strings" not in exclude and "strings" in msg:
for s in msg["strings"]:
self.vocab.strings.add(s)
if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg: if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg:
self.has_unknown_spaces = msg["has_unknown_spaces"] self.has_unknown_spaces = msg["has_unknown_spaces"]
start = 0 start = 0

View File

@ -923,7 +923,7 @@ cdef class Token:
@property @property
def is_oov(self): def is_oov(self):
"""RETURNS (bool): Whether the token is out-of-vocabulary.""" """RETURNS (bool): Whether the token is out-of-vocabulary."""
return self.c.lex.orth in self.vocab.vectors return self.c.lex.orth not in self.vocab.vectors
@property @property
def is_stop(self): def is_stop(self):

View File

@ -187,6 +187,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
pipeline = nlp.Defaults.pipe_names pipeline = nlp.Defaults.pipe_names
elif pipeline in (False, None): elif pipeline in (False, None):
pipeline = [] pipeline = []
# skip "vocab" from overrides in component initialization since vocab is
# already configured from overrides when nlp is initialized above
if "vocab" in overrides:
del overrides["vocab"]
for name in pipeline: for name in pipeline:
if name not in disable: if name not in disable:
config = meta.get("pipeline_args", {}).get(name, {}) config = meta.get("pipeline_args", {}).get(name, {})

View File

@ -105,8 +105,8 @@ The Chinese language class supports three word segmentation options:
> ``` > ```
1. **Character segmentation:** Character segmentation is the default 1. **Character segmentation:** Character segmentation is the default
segmentation option. It's enabled when you create a new `Chinese` segmentation option. It's enabled when you create a new `Chinese` language
language class or call `spacy.blank("zh")`. class or call `spacy.blank("zh")`.
2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word 2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
segmentation with the tokenizer option `{"segmenter": "jieba"}`. segmentation with the tokenizer option `{"segmenter": "jieba"}`.
3. **PKUSeg**: As of spaCy v2.3.0, support for 3. **PKUSeg**: As of spaCy v2.3.0, support for

View File

@ -1,5 +1,58 @@
{ {
"resources": [ "resources": [
{
"id": "spacy-streamlit",
"title": "spacy-streamlit",
"slogan": "spaCy building blocks for Streamlit apps",
"github": "explosion/spacy-streamlit",
"description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
"pip": "spacy-streamlit",
"category": ["visualizers"],
"thumb": "https://i.imgur.com/mhEjluE.jpg",
"image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
"code_example": [
"import spacy_streamlit",
"",
"models = [\"en_core_web_sm\", \"en_core_web_md\"]",
"default_text = \"Sundar Pichai is the CEO of Google.\"",
"spacy_streamlit.visualize(models, default_text))"
],
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
}
},
{
"id": "spaczz",
"title": "spaczz",
"slogan": "Fuzzy matching and more for spaCy.",
"description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
"github": "gandersen101/spaczz",
"pip": "spaczz",
"code_example": [
"import spacy",
"from spaczz.pipeline import SpaczzRuler",
"",
"nlp = spacy.blank('en')",
"ruler = SpaczzRuler(nlp)",
"ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
"nlp.add_pipe(ruler)",
"",
"doc = nlp('Oops, I spelled Bill Gatez wrong.')",
"print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
],
"code_language": "python",
"url": "https://spaczz.readthedocs.io/en/latest/",
"author": "Grant Andersen",
"author_links": {
"twitter": "gandersen101",
"github": "gandersen101"
},
"category": ["pipeline"],
"tags": ["fuzzy-matching", "regex"]
},
{ {
"id": "spacy-universal-sentence-encoder", "id": "spacy-universal-sentence-encoder",
"title": "SpaCy - Universal Sentence Encoder", "title": "SpaCy - Universal Sentence Encoder",
@ -1238,6 +1291,19 @@
"youtube": "K1elwpgDdls", "youtube": "K1elwpgDdls",
"category": ["videos"] "category": ["videos"]
}, },
{
"type": "education",
"id": "video-spacy-course-es",
"title": "NLP avanzado con spaCy · Un curso en línea gratis",
"description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
"url": "https://course.spacy.io/es",
"author": "Camila Gutiérrez",
"author_links": {
"twitter": "Mariacamilagl30"
},
"youtube": "RNiLVCE5d4k",
"category": ["videos"]
},
{ {
"type": "education", "type": "education",
"id": "video-intro-to-nlp-episode-1", "id": "video-intro-to-nlp-episode-1",
@ -1294,6 +1360,20 @@
"youtube": "IqOJU1-_Fi0", "youtube": "IqOJU1-_Fi0",
"category": ["videos"] "category": ["videos"]
}, },
{
"type": "education",
"id": "video-intro-to-nlp-episode-5",
"title": "Intro to NLP with spaCy (5)",
"slogan": "Episode 5: Rules vs. Machine Learning",
"description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"author": "Vincent Warmerdam",
"author_links": {
"twitter": "fishnets88",
"github": "koaning"
},
"youtube": "f4sqeLRzkPg",
"category": ["videos"]
},
{ {
"type": "education", "type": "education",
"id": "video-spacy-irl-entity-linking", "id": "video-spacy-irl-entity-linking",
@ -2348,6 +2428,56 @@
}, },
"category": ["pipeline", "conversational", "research"], "category": ["pipeline", "conversational", "research"],
"tags": ["spell check", "correction", "preprocessing", "translation", "correction"] "tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
},
{
"id": "texthero",
"title": "Texthero",
"slogan": "Text preprocessing, representation and visualization from zero to hero.",
"description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
"github": "jbesomi/texthero",
"pip": "texthero",
"code_example": [
"import texthero as hero",
"import pandas as pd",
"",
"df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
"df['named_entities'] = hero.named_entities(df['text'])",
"df.head()"
],
"code_language": "python",
"url": "https://texthero.org",
"thumb": "https://texthero.org/img/T.png",
"image": "https://texthero.org/docs/assets/texthero.png",
"author": "Jonathan Besomi",
"author_links": {
"github": "jbesomi",
"website": "https://besomi.ai"
},
"category": ["standalone"]
},
{
"id": "cov-bsv",
"title": "VA COVID-19 NLP BSV",
"slogan": "spaCy pipeline for COVID-19 surveillance.",
"github": "abchapman93/VA_COVID-19_NLP_BSV",
"description": "A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
"pip": "cov-bsv",
"code_example": [
"import cov_bsv",
"",
"nlp = cov_bsv.load()",
"text = 'Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected'",
"",
"print(doc.ents)",
"print(doc._.cov_classification)",
"cov_bsv.visualize_doc(doc)"
],
"category": ["pipeline", "standalone", "biomedical", "scientific"],
"tags": ["clinical", "epidemiology", "covid-19", "surveillance"],
"author": "Alec Chapman",
"author_links": {
"github": "abchapman93"
}
} }
], ],