Merge pull request #6727 from adrianeboyd/chore/update-develop-from-master-rc3

This commit is contained in:
Ines Montani 2021-01-15 11:44:28 +11:00 committed by GitHub
commit 57369909c0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
50 changed files with 30401 additions and 28847 deletions

106
.github/contributors/AMArostegui.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Antonio Miras |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 11/01/2020 |
| GitHub username | AMArostegui |
| Website (optional) | |

106
.github/contributors/alexcombessie.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alex COMBESSIE |
| Company name (if applicable) | Dataiku |
| Title or role (if applicable) | R&D Engineer |
| Date | 2020-10-27 |
| GitHub username | alexcombessie |
| Website (optional) | |

107
.github/contributors/cristianasp.md vendored Normal file
View File

@ -0,0 +1,107 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Cristiana S Parada |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-11-04 |
| GitHub username | cristianasp |
| Website (optional) | |

106
.github/contributors/lorenanda.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Lorena Ciutacu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-12-23 |
| GitHub username | lorenanda |
| Website (optional) | lorenaciutacu.com/ |

106
.github/contributors/ophelielacroix.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|-------------------------------|-----------------|
| Name | Ophélie Lacroix |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | |
| GitHub username | ophelielacroix |
| Website (optional) | |

106
.github/contributors/rafguns.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Raf Guns |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-12-09 |
| GitHub username | rafguns |
| Website (optional) | |

106
.github/contributors/thomasbird.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ---------------------- |
| Name | Thomas Bird |
| Company name (if applicable) | Leap Beyond Group |
| Title or role (if applicable) | Data Scientist |
| Date | 15/12/2020 |
| GitHub username | thomasbird |
| Website (optional) | https://leapbeyond.ai |

106
.github/contributors/yosiasz.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Josiah Solomon |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-12-15 |
| GitHub username | yosiasz |
| Website (optional) | |

View File

@ -1,6 +1,8 @@
@unpublished{spacy2, @software{spacy,
AUTHOR = {Honnibal, Matthew and Montani, Ines}, author = {Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane},
TITLE = {{spaCy 2}: Natural language understanding with {B}loom embeddings, convolutional neural networks and incremental parsing}, title = {{spaCy: Industrial-strength Natural Language Processing in Python}},
YEAR = {2017}, year = 2020,
Note = {To appear} publisher = {Zenodo},
doi = {10.5281/zenodo.1212303},
url = {https://doi.org/10.5281/zenodo.1212303}
} }

27
spacy/lang/am/__init__.py Normal file
View File

@ -0,0 +1,27 @@
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
class AmharicDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "am"
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "ltr", "has_case": False, "has_letters": True}
class Amharic(Language):
lang = "am"
Defaults = AmharicDefaults
__all__ = ["Amharic"]

18
spacy/lang/am/examples.py Normal file
View File

@ -0,0 +1,18 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.am.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"አፕል የዩኬን ጅምር ድርጅት በ 1 ቢሊዮን ዶላር ለመግዛት አስቧል።",
"የራስ ገዝ መኪኖች የኢንሹራንስ ኃላፊነትን ወደ አምራቾች ያዛውራሉ",
"ሳን ፍራንሲስኮ የእግረኛ መንገድ አቅርቦት ሮቦቶችን ማገድን ይመለከታል",
"ለንደን በእንግሊዝ የምትገኝ ትልቅ ከተማ ናት።",
"የት ነህ?",
"የፈረንሳይ ፕሬዝዳንት ማናቸው?",
"የአሜሪካ ዋና ከተማ ምንድነው?",
"ባራክ ኦባማ መቼ ተወለደ?",
]

101
spacy/lang/am/lex_attrs.py Normal file
View File

@ -0,0 +1,101 @@
from ...attrs import LIKE_NUM
_num_words = [
"ዜሮ",
"አንድ",
"ሁለት",
"ሶስት",
"አራት",
"አምስት",
"ስድስት",
"ሰባት",
"ስምት",
"ዘጠኝ",
"አስር",
"አስራ አንድ",
"አስራ ሁለት",
"አስራ ሶስት",
"አስራ አራት",
"አስራ አምስት",
"አስራ ስድስት",
"አስራ ሰባት",
"አስራ ስምንት",
"አስራ ዘጠኝ",
"ሃያ",
"ሰላሳ",
"አርባ",
"ሃምሳ",
"ስልሳ",
"ሰባ",
"ሰማንያ",
"ዘጠና",
"መቶ",
"ሺህ",
"ሚሊዮን",
"ቢሊዮን",
"ትሪሊዮን",
"ኳድሪሊዮን",
"ገጅሊዮን",
"ባዝሊዮን"
]
_ordinal_words = [
"አንደኛ",
"ሁለተኛ",
"ሶስተኛ",
"አራተኛ",
"አምስተኛ",
"ስድስተኛ",
"ሰባተኛ",
"ስምንተኛ",
"ዘጠነኛ",
"አስረኛ",
"አስራ አንደኛ",
"አስራ ሁለተኛ",
"አስራ ሶስተኛ",
"አስራ አራተኛ",
"አስራ አምስተኛ",
"አስራ ስድስተኛ",
"አስራ ሰባተኛ",
"አስራ ስምንተኛ",
"አስራ ዘጠነኛ",
"ሃያኛ",
"ሰላሳኛ"
"አርባኛ",
"አምሳኛ",
"ስድሳኛ",
"ሰባኛ",
"ሰማንያኛ",
"ዘጠናኛ",
"መቶኛ",
"ሺኛ",
"ሚሊዮንኛ",
"ቢሊዮንኛ",
"ትሪሊዮንኛ"
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
if text_lower.endswith(""):
if text_lower[:-2].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,19 @@
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import UNITS, ALPHA_UPPER
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
_suffixes = (
_list_punct
+ LIST_ELLIPSES
+ LIST_QUOTES
+ [
r"(?<=[0-9])\+",
# Amharic is written from Left-To-Right
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
TOKENIZER_SUFFIXES = _suffixes

View File

@ -0,0 +1,6 @@
# Stop words
STOP_WORDS = set(
"""
ግን አንቺ አንተ እናንተ ያንተ ያንቺ የናንተ ራስህን ራስሽን ራሳችሁን
""".split()
)

View File

@ -0,0 +1,22 @@
from ...symbols import ORTH, NORM
_exc = {}
for exc_data in [
{ORTH: "ት/ቤት"},
{ORTH: "ወ/ሮ", NORM: "ወይዘሮ"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"ዓ.ም.",
"ኪ.ሜ.",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -2,6 +2,8 @@ split_chars = lambda char: list(char.strip().split(" "))
merge_chars = lambda char: char.strip().replace(" ", "|") merge_chars = lambda char: char.strip().replace(" ", "|")
group_chars = lambda char: char.strip().replace(" ", "") group_chars = lambda char: char.strip().replace(" ", "")
_ethiopic = r"\u1200-\u137F"
_bengali = r"\u0980-\u09FF" _bengali = r"\u0980-\u09FF"
_hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F" _hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F"
@ -232,7 +234,8 @@ _lower = (
) )
_uncased = ( _uncased = (
_bengali _ethiopic
+ _bengali
+ _hebrew + _hebrew
+ _persian + _persian
+ _sinhala + _sinhala

View File

@ -2,6 +2,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .syntax_iterators import SYNTAX_ITERATORS
from ...language import Language from ...language import Language
@ -11,6 +12,7 @@ class DanishDefaults(Language.Defaults):
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS stop_words = STOP_WORDS
syntax_iterators = SYNTAX_ITERATORS
class Danish(Language): class Danish(Language):

View File

@ -0,0 +1,71 @@
from ...symbols import NOUN, PROPN, PRON, VERB, AUX
from ...errors import Errors
def noun_chunks(doclike):
def is_verb_token(tok):
return tok.pos in [VERB, AUX]
def get_left_bound(doc, root):
left_bound = root
for tok in reversed(list(root.lefts)):
if tok.dep in np_left_deps:
left_bound = tok
return left_bound
def get_right_bound(doc, root):
right_bound = root
for tok in root.rights:
if tok.dep in np_right_deps:
right = get_right_bound(doc, tok)
if list(
filter(
lambda t: is_verb_token(t) or t.dep in stop_deps,
doc[root.i : right.i],
)
):
break
else:
right_bound = right
return right_bound
def get_bounds(doc, root):
return get_left_bound(doc, root), get_right_bound(doc, root)
doc = doclike.doc
if not doc.has_annotation("DEP"):
raise ValueError(Errors.E029)
if not len(doc):
return
left_labels = [
"det",
"fixed",
"nmod:poss",
"amod",
"flat",
"goeswith",
"nummod",
"appos",
]
right_labels = ["fixed", "nmod:poss", "amod", "flat", "goeswith", "nummod", "appos"]
stop_labels = ["punct"]
np_label = doc.vocab.strings.add("NP")
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
prev_right = -1
for token in doclike:
if token.pos in [PROPN, NOUN, PRON]:
left, right = get_bounds(doc, token)
if left.i <= prev_right:
continue
yield left.i, right.i + 1, np_label
prev_right = right.i
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -330,7 +330,6 @@ for exc_data in [
# Other contractions with leading apostrophe # Other contractions with leading apostrophe
for exc_data in [ for exc_data in [
{ORTH: "cause", NORM: "because"},
{ORTH: "em", NORM: "them"}, {ORTH: "em", NORM: "them"},
{ORTH: "ll", NORM: "will"}, {ORTH: "ll", NORM: "will"},
{ORTH: "nuff", NORM: "enough"}, {ORTH: "nuff", NORM: "enough"},

View File

@ -19,34 +19,24 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels] np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels] np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels] stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
prev_right = -1
for token in doclike: for token in doclike:
if token.pos in [PROPN, NOUN, PRON]: if token.pos in [PROPN, NOUN, PRON]:
left, right = noun_bounds( left, right = noun_bounds(
doc, token, np_left_deps, np_right_deps, stop_deps doc, token, np_left_deps, np_right_deps, stop_deps
) )
if left.i <= prev_right:
continue
yield left.i, right.i + 1, np_label yield left.i, right.i + 1, np_label
token = right prev_right = right.i
token = next_token(token)
def is_verb_token(token: Token) -> bool: def is_verb_token(token: Token) -> bool:
return token.pos in [VERB, AUX] return token.pos in [VERB, AUX]
def next_token(token: Token) -> Optional[Token]: def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
try:
return token.nbor()
except IndexError:
return None
def noun_bounds(
doc: Doc,
root: Token,
np_left_deps: List[str],
np_right_deps: List[str],
stop_deps: List[str],
) -> Tuple[Token, Token]:
left_bound = root left_bound = root
for token in reversed(list(root.lefts)): for token in reversed(list(root.lefts)):
if token.dep in np_left_deps: if token.dep in np_left_deps:

View File

@ -1,86 +1,80 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons a à â abord afin ah ai aie ainsi ait allaient allons
allô alors anterieur anterieure anterieures apres après as assez attendu au alors anterieur anterieure anterieures apres après as assez attendu au
aucun aucune aujourd aujourd'hui aupres auquel aura auraient aurait auront aucun aucune aujourd aujourd'hui aupres auquel aura auraient aurait auront
aussi autre autrefois autrement autres autrui aux auxquelles auxquels avaient aussi autre autrement autres autrui aux auxquelles auxquels avaient
avais avait avant avec avoir avons ayant avais avait avant avec avoir avons ayant
bah bas basee bat beau beaucoup bien bigre boum bravo brrr bas basee bat
c' c ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui c' c ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui
celui-ci celui- cent cependant certain certaine certaines certains certes ces celui-ci celui- cent cependant certain certaine certaines certains certes ces
cet cette ceux ceux-ci ceux- chacun chacune chaque cher chers chez chiche cet cette ceux ceux-ci ceux- chacun chacune chaque chez ci cinq cinquantaine cinquante
chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac cinquantième cinquième combien comme comment compris concernant
clic combien comme comment comparable comparables compris concernant contre
couic crac
d' d da dans de debout dedans dehors deja delà depuis dernier derniere derriere d' d da dans de debout dedans dehors deja delà depuis derriere
derrière des desormais desquelles desquels dessous dessus deux deuxième derrière des desormais desquelles desquels dessous dessus deux deuxième
deuxièmement devant devers devra different differentes differents différent deuxièmement devant devers devra different differentes differents différent
différente différentes différents dire directe directement dit dite dits divers différente différentes différents dire directe directement dit dite dits divers
diverse diverses dix dix-huit dix-neuf dix-sept dixième doit doivent donc dont diverse diverses dix dix-huit dix-neuf dix-sept dixième doit doivent donc dont
douze douzième dring du duquel durant dès désormais douze douzième du duquel durant dès désormais
effet egale egalement egales eh elle elle-même elles elles-mêmes en encore effet egale egalement egales eh elle elle-même elles elles-mêmes en encore
enfin entre envers environ es ès est et etaient étaient etais étais etait était enfin entre envers environ es ès est et etaient étaient etais étais etait était
etant étant etc été etre être eu euh eux eux-mêmes exactement excepté extenso etant étant etc été etre être eu eux eux-mêmes exactement excepté
exterieur
fais faisaient faisant fait façon feront fi flac floc font fais faisaient faisant fait façon feront font
gens gens
ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum ha hem hep hi ho hormis hors hou houp hue hui huit huitième
hurrah hélas i il ils importe i il ils importe
j' j je jusqu jusque juste j' j je jusqu jusque juste
l' l la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps l' l la laisser laquelle le lequel les lesquelles lesquels leur leurs longtemps
lors lorsque lui lui-meme lui-même lès lors lorsque lui lui-meme lui-même lès
m' m ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien m' m ma maint maintenant mais malgre me meme memes merci mes mien
mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins mienne miennes miens mille moi moi-meme moi-même moindres moins
mon moyennant même mêmes mon même mêmes
n' n na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf n' n na ne neanmoins neuvième ni nombreuses nombreux nos notamment
neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau notre nous nous-mêmes nouvea nul néanmoins nôtre nôtres
nul néanmoins nôtre nôtres
o ô oh ohé ollé olé on ont onze onzième ore ou ouf ouias oust ouste outre o ô on ont onze onzième ore ou ouias oust outre
ouvert ouverte ouverts ouvert ouverte ouverts
paf pan par parce parfois parle parlent parler parmi parseme partant par parce parfois parle parlent parler parmi parseme partant
particulier particulière particulièrement pas passé pendant pense permet pas pendant pense permet personne peu peut peuvent peux plus
personne peu peut peuvent peux pff pfft pfut pif pire plein plouf plus plusieurs plutôt possible possibles pour pourquoi
plusieurs plutôt possessif possessifs possible possibles pouah pour pourquoi
pourrais pourrait pouvait prealable precisement premier première premièrement pourrais pourrait pouvait prealable precisement premier première premièrement
pres probable probante procedant proche près psitt pu puis puisque pur pure pres procedant proche près pu puis puisque
qu' qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt qu' qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque
quelques quels qui quiconque quinze quoi quoique quelques quels qui quiconque quinze quoi quoique
rare rarement rares relative relativement remarquable rend rendre restant reste relative relativement rend rendre restant reste
restent restrictif retour revoici revoilà rien restent retour revoici revoilà
s' s sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient s' s sa sait sans sauf se seize selon semblable semblaient
semble semblent sent sept septième sera seraient serait seront ses seul seule semble semblent sent sept septième sera seraient serait seront ses seul seule
seulement si sien sienne siennes siens sinon six sixième soi soi-même soit seulement si sien sienne siennes siens sinon six sixième soi soi-même soit
soixante son sont sous souvent specifique specifiques speculatif stop soixante son sont sous souvent specifique specifiques stop
strictement subtiles suffisant suffisante suffit suis suit suivant suivante suffisant suffisante suffit suis suit suivant suivante
suivantes suivants suivre superpose sur surtout suivantes suivants suivre sur surtout
t' t ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente t' t ta tant te tel telle tellement telles tels tenant tend tenir tente
tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous tes tien tienne tiennes tiens toi toi-même ton touchant toujours tous
tout toute toutefois toutes treize trente tres trois troisième troisièmement tout toute toutes treize trente tres trois troisième troisièmement
trop très tsoin tsouin tu tu
un une unes uniformement unique uniques uns un une unes uns
va vais vas vers via vif vifs vingt vivat vive vives vlan voici voilà vont vos va vais vas vers via vingt voici voilà vont vos
votre vous vous-mêmes vu vôtre vôtres votre vous vous-mêmes vu vôtre vôtres
zut
""".split() """.split()
) )

View File

@ -1,6 +1,6 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes a à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes
ao aos apenas apoia apoio apontar após aquela aquelas aquele aqueles aqui aquilo ao aos apenas apoia apoio apontar após aquela aquelas aquele aqueles aqui aquilo
as assim através atrás até as assim através atrás até
@ -14,7 +14,7 @@ da daquela daquele dar das de debaixo demais dentro depois des desde dessa desse
desta deste deve devem deverá dez dezanove dezasseis dezassete dezoito diante desta deste deve devem deverá dez dezanove dezasseis dezassete dezoito diante
direita disso diz dizem dizer do dois dos doze duas dão direita disso diz dizem dizer do dois dos doze duas dão
é és ela elas ele eles em embora enquanto entre então era essa essas esse esses esta e é és ela elas ele eles em embora enquanto entre então era essa essas esse esses esta
estado estar estará estas estava este estes esteve estive estivemos estiveram estado estar estará estas estava este estes esteve estive estivemos estiveram
estiveste estivestes estou está estás estão eu eventual exemplo estiveste estivestes estou está estás estão eu eventual exemplo
@ -36,7 +36,7 @@ na nada naquela naquele nas nem nenhuma nessa nesse nesta neste no nos nossa
nossas nosso nossos nova novas nove novo novos num numa nunca nuns não nível nós nossas nosso nossos nova novas nove novo novos num numa nunca nuns não nível nós
número números número números
obrigada obrigado oitava oitavo oito onde ontem onze ora os ou outra outras outros o obrigada obrigado oitava oitavo oito onde ontem onze ora os ou outra outras outros
para parece parte partir pegar pela pelas pelo pelos perto pode podem poder poderá para parece parte partir pegar pela pelas pelo pelos perto pode podem poder poderá
podia pois ponto pontos por porquanto porque porquê portanto porém posição podia pois ponto pontos por porquanto porque porquê portanto porém posição
@ -59,8 +59,8 @@ tudo tão têm
um uma umas uns usa usar último um uma umas uns usa usar último
vai vais valor veja vem vens ver vez vezes vinda vindo vinte você vocês vos vossa vai vais valor veja vem vens ver vez vezes vinda vindo vinte você vocês vos vossa
vossas vosso vossos vários vão vêm vós vossas vosso vossos vários vão vêm vós
zero zero
""".split() """.split()
) )

View File

@ -8,11 +8,13 @@ aceasta
această această
aceea aceea
aceeasi aceeasi
aceeași
acei acei
aceia aceia
acel acel
acela acela
acelasi acelasi
același
acele acele
acelea acelea
acest acest
@ -24,12 +26,11 @@ acestia
acestui acestui
aceşti aceşti
aceştia aceştia
acești
aceștia
acolo acolo
acord acord
acum acum
adica adica
adică
ai ai
aia aia
aibă aibă
@ -53,6 +54,8 @@ alături
am am
anume anume
apoi apoi
apai
apăi
ar ar
are are
as as
@ -150,7 +153,9 @@ că
căci căci
cărei cărei
căror căror
cărora
cărui cărui
căruia
către către
d d
da da
@ -175,6 +180,8 @@ deşi
deși deși
din din
dinaintea dinaintea
dincolo
dincoace
dintr dintr
dintr- dintr-
dintre dintre
@ -186,6 +193,10 @@ drept
dupa dupa
după după
deunaseara
deunăseară
deunazi
deunăzi
e e
ea ea
ei ei
@ -220,7 +231,6 @@ geaba
graţie graţie
grație grație
h h
halbă
i i
ia ia
iar iar
@ -232,6 +242,7 @@ in
inainte inainte
inapoi inapoi
inca inca
incotro
incit incit
insa insa
intr intr
@ -252,6 +263,10 @@ m
ma ma
mai mai
mare mare
macar
măcar
mata
matale
mea mea
mei mei
mele mele
@ -274,11 +289,18 @@ mâine
mîine mîine
n n
na
ne ne
neincetat
neîncetat
nevoie nevoie
ni ni
nici nici
nicidecum
nicidecat
nicidecât
niciodata niciodata
niciodată
nicăieri nicăieri
nimeni nimeni
nimeri nimeri
@ -300,6 +322,10 @@ noștri
nu nu
numai numai
o o
odata
odată
odinioara
odinioară
opt opt
or or
ori ori
@ -314,7 +340,9 @@ oricît
oriunde oriunde
p p
pai pai
păi
parca parca
parcă
patra patra
patru patru
patrulea patrulea
@ -331,13 +359,11 @@ prima
primul primul
prin prin
printr- printr-
printre
putini putini
puţin puţin
puţina puţina
puţină puţină
puțin
puțina
puțină
până până
pînă pînă
r r
@ -415,6 +441,7 @@ unuia
unul unul
v v
va va
vai
vi vi
voastre voastre
voastră voastră

27
spacy/lang/ti/__init__.py Normal file
View File

@ -0,0 +1,27 @@
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
class TigrinyaDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "ti"
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "ltr", "has_case": False, "has_letters": True}
class Tigrinya(Language):
lang = "ti"
Defaults = TigrinyaDefaults
__all__ = ["Tigrinya"]

18
spacy/lang/ti/examples.py Normal file
View File

@ -0,0 +1,18 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ti.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"አፕል ብዩኬ ትርከብ ንግድ ብ1 ቢሊዮን ዶላር ንምግዛዕ ሐሲባ።",
"ፈላማይ ክታበት ኮቪድ 19 ተጀሚሩ፤ሓዱሽ ተስፋ ሂቡ ኣሎ",
"ቻንስለር ጀርመን ኣንገላ መርከል ዝርግሓ ቫይረስ ኮሮና ንምክልካል ጽኑዕ እገዳ ክግበር ጸዊዓ",
"ለንደን ብዓዲ እንግሊዝ ትርከብ ዓባይ ከተማ እያ።",
"ናበይ አለኻ፧",
"ናይ ፈረንሳይ ፕሬዝዳንት መን እዩ፧",
"ናይ አሜሪካ ዋና ከተማ እንታይ እያ፧",
"ኦባማ መዓስ ተወሊዱ፧",
]

101
spacy/lang/ti/lex_attrs.py Normal file
View File

@ -0,0 +1,101 @@
from ...attrs import LIKE_NUM
_num_words = [
"ዜሮ",
"ሐደ",
"ክልተ",
"ሰለስተ",
"ኣርባዕተ",
"ሓሙሽተ",
"ሽድሽተ",
"ሸውዓተ",
"ሽሞንተ",
"ትሽዓተ",
"ኣሰርተ",
"ኣሰርተ ሐደ",
"ኣሰርተ ክልተ",
"ኣሰርተ ሰለስተ",
"ኣሰርተ ኣርባዕተ",
"ኣሰርተ ሓሙሽተ",
"ኣሰርተ ሽድሽተ",
"ኣሰርተ ሸውዓተ",
"ኣሰርተ ሽሞንተ",
"ኣሰርተ ትሽዓተ",
"ዕስራ",
"ሰላሳ",
"ኣርብዓ",
"ሃምሳ",
"ስልሳ",
"ሰብዓ",
"ሰማንያ",
"ተስዓ",
"ሚእቲ",
"ሺሕ",
"ሚልዮን",
"ቢልዮን",
"ትሪልዮን",
"ኳድሪልዮን",
"ገጅልዮን",
"ባዝልዮን"
]
_ordinal_words = [
"ቀዳማይ",
"ካልኣይ",
"ሳልሳይ",
"ራብኣይ",
"ሓምሻይ",
"ሻድሻይ",
"ሻውዓይ",
"ሻምናይ",
"ዘጠነኛ",
"አስረኛ",
"ኣሰርተ አንደኛ",
"ኣሰርተ ሁለተኛ",
"ኣሰርተ ሶስተኛ",
"ኣሰርተ አራተኛ",
"ኣሰርተ አምስተኛ",
"ኣሰርተ ስድስተኛ",
"ኣሰርተ ሰባተኛ",
"ኣሰርተ ስምንተኛ",
"ኣሰርተ ዘጠነኛ",
"ሃያኛ",
"ሰላሳኛ"
"አርባኛ",
"አምሳኛ",
"ስድሳኛ",
"ሰባኛ",
"ሰማንያኛ",
"ዘጠናኛ",
"መቶኛ",
"ሺኛ",
"ሚሊዮንኛ",
"ቢሊዮንኛ",
"ትሪሊዮንኛ"
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
if text_lower.endswith(""):
if text_lower[:-2].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,19 @@
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import UNITS, ALPHA_UPPER
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
_suffixes = (
_list_punct
+ LIST_ELLIPSES
+ LIST_QUOTES
+ [
r"(?<=[0-9])\+",
# Tigrinya is written from Left-To-Right
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
TOKENIZER_SUFFIXES = _suffixes

View File

@ -0,0 +1,6 @@
# Stop words
STOP_WORDS = set(
"""
ግን ግና ንስኻ ንስኺ ንስኻትክን ንስኻትኩም ናትካ ናትኪ ናትክን ናትኩም
""".split()
)

View File

@ -0,0 +1,23 @@
from ...symbols import ORTH, NORM
_exc = {}
for exc_data in [
{ORTH: "ት/ቤት"},
{ORTH: "ወ/ሮ", NORM: "ወይዘሮ"},
{ORTH: "ወ/ሪ", NORM: "ወይዘሪት"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"ዓ.ም.",
"ኪ.ሜ.",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -28,6 +28,9 @@ def pytest_runtest_setup(item):
def tokenizer(): def tokenizer():
return get_lang_class("xx")().tokenizer return get_lang_class("xx")().tokenizer
@pytest.fixture(scope="session")
def am_tokenizer():
return get_lang_class("am")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def ar_tokenizer(): def ar_tokenizer():
@ -244,6 +247,9 @@ def th_tokenizer():
pytest.importorskip("pythainlp") pytest.importorskip("pythainlp")
return get_lang_class("th")().tokenizer return get_lang_class("th")().tokenizer
@pytest.fixture(scope="session")
def ti_tokenizer():
return get_lang_class("ti")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def tr_tokenizer(): def tr_tokenizer():

View File

View File

View File

@ -0,0 +1,52 @@
import pytest
from spacy.lang.am.lex_attrs import like_num
def test_am_tokenizer_handles_long_text(am_tokenizer):
text = """ሆሴ ሙጂካ በበጋ ወቅት በኦክስፎርድ ንግግር አንድያቀርቡ ሲጋበዙ ጭንቅላታቸው "ፈነዳ"
እጅግ ጥንታዊ የእንግሊዝኛ ተናጋሪ ዩኒቨርስቲ በአስር ሺዎች የሚቆጠሩ ዩሮዎችን ለተማሪዎች በማስተማር የሚያስከፍለው
እና ከማርጋሬት ታቸር እስከ ስቲቨን ሆኪንግ በአዳራሾቻቸው ውስጥ ንግግር ያደረጉበት የትምህርት ማዕከል በሞንቴቪዴኦ
በሚገኘው የመንግስት ትምህርት ቤት የሰለጠኑትን የ81 ዓመቱ አዛውንት አገልግሎት ጠየቁ"""
tokens = am_tokenizer(text)
assert len(tokens) == 56
@pytest.mark.parametrize(
"text,length",
[
("ሆሴ ሙጂካ ለምን ተመረጠ?", 5),
("“በፍፁም?”", 4),
("""አዎ! ሆዜ አርካዲዮ ቡንዲያ “እንሂድ” ሲል መለሰ።""", 11),
("እነሱ በግምት 10ኪ.ሜ. ሮጡ።", 7),
("እና ከዚያ ለምን...", 4),
],
)
def test_am_tokenizer_handles_cnts(am_tokenizer, text, length):
tokens = am_tokenizer(text)
assert len(tokens) == length
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("10.000", True),
("1000", True),
("999,0", True),
("አንድ", True),
("ሁለት", True),
("ትሪሊዮን", True),
("ውሻ", False),
(",", False),
("1/2", True),
],
)
def test_lex_attrs_like_number(am_tokenizer, text, match):
tokens = am_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -0,0 +1,70 @@
import pytest
from spacy.tokens import Doc
def test_noun_chunks_is_parsed(da_tokenizer):
"""Test that noun_chunks raises Value Error for 'da' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = da_tokenizer("Det er en sætning")
with pytest.raises(ValueError):
list(doc.noun_chunks)
DA_NP_TEST_EXAMPLES = [
(
"Hun elsker at plukker frugt.",
["PRON", "VERB", "PART", "VERB", "NOUN", "PUNCT"],
["nsubj", "ROOT", "mark", "obj", "obj", "punct"],
[1, 0, 1, -2, -1, -4],
["Hun", "frugt"],
),
(
"Påfugle er de smukkeste fugle.",
["NOUN", "AUX", "DET", "ADJ", "NOUN", "PUNCT"],
["nsubj", "cop", "det", "amod", "ROOT", "punct"],
[4, 3, 2, 1, 0, -1],
["Påfugle", "de smukkeste fugle"],
),
(
"Rikke og Jacob Jensen glæder sig til en hyggelig skovtur",
[
"PROPN",
"CCONJ",
"PROPN",
"PROPN",
"VERB",
"PRON",
"ADP",
"DET",
"ADJ",
"NOUN",
],
["nsubj", "cc", "conj", "flat", "ROOT", "obj", "case", "det", "amod", "obl"],
[4, 1, -2, -1, 0, -1, 3, 2, 1, -5],
["Rikke", "Jacob Jensen", "sig", "en hyggelig skovtur"],
),
]
@pytest.mark.parametrize(
"text,pos,deps,heads,expected_noun_chunks", DA_NP_TEST_EXAMPLES
)
def test_da_noun_chunks(da_tokenizer, text, pos, deps, heads, expected_noun_chunks):
tokens = da_tokenizer(text)
assert len(heads) == len(pos)
doc = Doc(
tokens.vocab,
words=[t.text for t in tokens],
heads=[head + i for i, head in enumerate(heads)],
deps=deps,
pos=pos,
)
noun_chunks = list(doc.noun_chunks)
assert len(noun_chunks) == len(expected_noun_chunks)
for i, np in enumerate(noun_chunks):
assert np.text == expected_noun_chunks[i]

View File

@ -106,7 +106,15 @@ def test_en_tokenizer_handles_times(en_tokenizer, text):
@pytest.mark.parametrize( @pytest.mark.parametrize(
"text,norms", [("I'm", ["i", "am"]), ("shan't", ["shall", "not"])] "text,norms",
[
("I'm", ["i", "am"]),
("shan't", ["shall", "not"]),
(
"Many factors cause cancer 'cause it is complex",
["many", "factors", "cause", "cancer", "because", "it", "is", "complex"],
),
],
) )
def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms): def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)

View File

View File

View File

@ -0,0 +1,52 @@
import pytest
from spacy.lang.ti.lex_attrs import like_num
def test_ti_tokenizer_handles_long_text(ti_tokenizer):
text = """ቻንስለር ጀርመን ኣንገላ መርከል ኣብታ ሃገር ቁጽሪ መትሓዝቲ ኮቪድ መዓልታዊ ክብረ መዝገብ ድሕሪ ምህራሙ- ጽኑዕ እገዳ ክግበር ጸዊዓ።
መርከል ሎሚ ንታሕታዋይ ባይቶ ሃገራ ክትገልጽ ከላ ኣብ ወሳኒ ምዕራፍ ቃልሲ ኢና ዘለና-ዳሕራዋይ ማዕበል ካብቲ ቀዳማይ ክገድድ ይኽእል` ኢላ
ትካል ምክልኻል ተላገብቲ ሕማማት ጀርመን ኣብ ዝሓለፈ 24 ሰዓታት ኣብ ምልእቲ ጀርመር 590 ሰባት ብኮቪድ19 ምሟቶም ኣፍሊጡ`
ቻንስለር ኣንጀላ መርከል ኣብ እዋን በዓላት ልደት ስድራቤታት ክተኣኻኸባ ዝፍቀደለን` እንተኾነ ድሕሪኡ ኣብ ዘሎ ግዜ ግን እቲ እገዳታት ክትግበር ትደሊ"""
tokens = ti_tokenizer(text)
assert len(tokens) == 85
@pytest.mark.parametrize(
"text,length",
[
("ቻንስለር ጀርመን ኣንገላ መርከል፧", 5),
("“ስድራቤታት፧”", 4),
("""ኣብ እዋን በዓላት ልደት ስድራቤታት ክተኣኻኸባ ዝፍቀደለን`ኳ እንተኾነ።""", 9),
("ብግምት 10ኪ.ሜ. ጎይዩ።", 6),
("ኣብ ዝሓለፈ 24 ሰዓታት...", 5),
],
)
def test_ti_tokenizer_handles_cnts(ti_tokenizer, text, length):
tokens = ti_tokenizer(text)
assert len(tokens) == length
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("10.000", True),
("1000", True),
("999,0", True),
("ሐደ", True),
("ክልተ", True),
("ትሪልዮን", True),
("ከልቢ", False),
(",", False),
("1/2", True),
],
)
def test_lex_attrs_like_number(ti_tokenizer, text, match):
tokens = ti_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -1,4 +1,5 @@
import pytest import pytest
import re
from spacy.util import get_lang_class from spacy.util import get_lang_class
from spacy.tokenizer import Tokenizer from spacy.tokenizer import Tokenizer
@ -19,6 +20,17 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
tokenizer_bytes = tokenizer.to_bytes() tokenizer_bytes = tokenizer.to_bytes()
Tokenizer(en_vocab).from_bytes(tokenizer_bytes) Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
# test that empty/unset values are set correctly on deserialization
tokenizer = get_lang_class("en")().tokenizer
tokenizer.token_match = re.compile("test").match
assert tokenizer.rules != {}
assert tokenizer.token_match is not None
assert tokenizer.url_match is not None
tokenizer.from_bytes(tokenizer_bytes)
assert tokenizer.rules == {}
assert tokenizer.token_match is None
assert tokenizer.url_match is None
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]}) tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
tokenizer.rules = {} tokenizer.rules = {}
tokenizer_bytes = tokenizer.to_bytes() tokenizer_bytes = tokenizer.to_bytes()

View File

@ -785,10 +785,16 @@ cdef class Tokenizer:
self.suffix_search = re.compile(data["suffix_search"]).search self.suffix_search = re.compile(data["suffix_search"]).search
if "infix_finditer" in data and isinstance(data["infix_finditer"], str): if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
self.infix_finditer = re.compile(data["infix_finditer"]).finditer self.infix_finditer = re.compile(data["infix_finditer"]).finditer
# for token_match and url_match, set to None to override the language
# defaults if no regex is provided
if "token_match" in data and isinstance(data["token_match"], str): if "token_match" in data and isinstance(data["token_match"], str):
self.token_match = re.compile(data["token_match"]).match self.token_match = re.compile(data["token_match"]).match
else:
self.token_match = None
if "url_match" in data and isinstance(data["url_match"], str): if "url_match" in data and isinstance(data["url_match"], str):
self.url_match = re.compile(data["url_match"]).match self.url_match = re.compile(data["url_match"]).match
else:
self.url_match = None
if "rules" in data and isinstance(data["rules"], dict): if "rules" in data and isinstance(data["rules"], dict):
# make sure to hard reset the cache to remove data from the default exceptions # make sure to hard reset the cache to remove data from the default exceptions
self._rules = {} self._rules = {}

View File

@ -239,7 +239,6 @@ it.
| `infix_finditer` | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. ~~Optional[Callable[[str], Iterator[Match]]]~~ | | `infix_finditer` | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. ~~Optional[Callable[[str], Iterator[Match]]]~~ |
| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ | | `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ |
| `rules` | A dictionary of tokenizer exceptions and special cases. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ | | `rules` | A dictionary of tokenizer exceptions and special cases. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ |
## Serialization fields {#serialization-fields} ## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore During serialization, spaCy will export several data fields used to restore

View File

@ -58,15 +58,6 @@ module.exports = {
}, },
plugins: [ plugins: [
{
resolve: `gatsby-plugin-svgr`,
options: {
svgo: false,
svgoConfig: {
removeViewBox: false,
},
},
},
{ {
resolve: `gatsby-plugin-sass`, resolve: `gatsby-plugin-sass`,
options: { options: {
@ -109,6 +100,14 @@ module.exports = {
path: `${__dirname}/docs/images`, path: `${__dirname}/docs/images`,
}, },
}, },
{
resolve: 'gatsby-plugin-react-svg',
options: {
rule: {
include: /src\/images\/(.*)\.svg/,
},
},
},
{ {
resolve: `gatsby-mdx`, resolve: `gatsby-mdx`,
options: { options: {

View File

@ -1,6 +1,6 @@
{ {
"resources": [ "resources": [
{ {
"id": "spacy-textblob", "id": "spacy-textblob",
"title": "spaCyTextBlob", "title": "spaCyTextBlob",
"slogan": "Easy sentiment analysis for spaCy using TextBlob", "slogan": "Easy sentiment analysis for spaCy using TextBlob",
@ -30,7 +30,7 @@
}, },
"category": ["pipeline"], "category": ["pipeline"],
"tags": ["sentiment", "textblob"] "tags": ["sentiment", "textblob"]
}, },
{ {
"id": "spacy-ray", "id": "spacy-ray",
"title": "spacy-ray", "title": "spacy-ray",
@ -2139,7 +2139,7 @@
"from negspacy.negation import Negex", "from negspacy.negation import Negex",
"", "",
"nlp = spacy.load(\"en_core_web_sm\")", "nlp = spacy.load(\"en_core_web_sm\")",
"negex = Negex(nlp, ent_types=[\"PERSON','ORG\"])", "negex = Negex(nlp, ent_types=[\"PERSON\",\"ORG\"])",
"nlp.add_pipe(negex, last=True)", "nlp.add_pipe(negex, last=True)",
"", "",
"doc = nlp(\"She does not like Steve Jobs but likes Apple products.\")", "doc = nlp(\"She does not like Steve Jobs but likes Apple products.\")",
@ -2619,14 +2619,14 @@
"github": "medspacy" "github": "medspacy"
} }
}, },
{ {
"id": "rita-dsl", "id": "rita-dsl",
"title": "RITA DSL", "title": "RITA DSL",
"slogan": "Domain Specific Language for creating language rules", "slogan": "Domain Specific Language for creating language rules",
"github": "zaibacu/rita-dsl", "github": "zaibacu/rita-dsl",
"description": "A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format", "description": "A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format",
"pip": "rita-dsl", "pip": "rita-dsl",
"thumb": "https://raw.githubusercontent.com/zaibacu/rita-dsl/master/docs/assets/logo-100px.png", "thumb": "https://raw.githubusercontent.com/zaibacu/rita-dsl/master/docs/assets/logo-100px.png",
"code_language": "python", "code_language": "python",
"code_example": [ "code_example": [
"import spacy", "import spacy",
@ -2680,6 +2680,60 @@
}, },
"category": ["scientific", "research", "standalone"], "category": ["scientific", "research", "standalone"],
"tags": ["Evolutionary Computation", "Grammatical Evolution"] "tags": ["Evolutionary Computation", "Grammatical Evolution"]
},
{
"id": "SpacyDotNet",
"title": "spaCy .NET Wrapper",
"slogan": "SpacyDotNet is a .NET Core compatible wrapper for spaCy, based on Python.NET",
"description": "This projects relies on [Python.NET](http://pythonnet.github.io/) to interop with spaCy. It's not meant to be a complete and exhaustive implementation of all spaCy features and [APIs](https://spacy.io/api). Although it should be enough for basic tasks, it's considered as a starting point if you need to build a complex project using spaCy in .NET Most of the basic features in _Spacy101_ are available. All `Container` classes are present (`Doc`, `Token`, `Span` and `Lexeme`) with their basic properties/methods running and also `Vocab` and `StringStore` in a limited form. Anyway, any developer should be ready to add the missing properties or classes in a very straightforward manner.",
"github": "AMArostegui/SpacyDotNet",
"thumb": "https://raw.githubusercontent.com/AMArostegui/SpacyDotNet/master/cslogo.png",
"code_example": [
"var spacy = new Spacy();",
"",
"var nlp = spacy.Load(\"en_core_web_sm\");",
"var doc = nlp.GetDocument(\"Apple is looking at buying U.K. startup for $1 billion\");",
"",
"foreach (Token token in doc.Tokens)",
" Console.WriteLine($\"{token.Text} {token.Lemma} {token.PoS} {token.Tag} {token.Dep} {token.Shape} {token.IsAlpha} {token.IsStop}\");",
"",
"Console.WriteLine(\"\");",
"foreach (Span ent in doc.Ents)",
" Console.WriteLine($\"{ent.Text} {ent.StartChar} {ent.EndChar} {ent.Label}\");",
"",
"nlp = spacy.Load(\"en_core_web_md\");",
"var tokens = nlp.GetDocument(\"dog cat banana afskfsd\");",
"",
"Console.WriteLine(\"\");",
"foreach (Token token in tokens.Tokens)",
" Console.WriteLine($\"{token.Text} {token.HasVector} {token.VectorNorm}, {token.IsOov}\");",
"",
"tokens = nlp.GetDocument(\"dog cat banana\");",
"Console.WriteLine(\"\");",
"foreach (Token token1 in tokens.Tokens)",
"{",
" foreach (Token token2 in tokens.Tokens)",
" Console.WriteLine($\"{token1.Text} {token2.Text} {token1.Similarity(token2) }\");",
"}",
"",
"doc = nlp.GetDocument(\"I love coffee\");",
"Console.WriteLine(\"\");",
"Console.WriteLine(doc.Vocab.Strings[\"coffee\"]);",
"Console.WriteLine(doc.Vocab.Strings[3197928453018144401]);",
"",
"Console.WriteLine(\"\");",
"foreach (Token word in doc.Tokens)",
"{",
" var lexeme = doc.Vocab[word.Text];",
" Console.WriteLine($@\"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} {lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}\");",
"}"
],
"code_language": "csharp",
"author": "Antonio Miras",
"author_links": {
"github": "AMArostegui"
},
"category": ["nonpython"]
} }
], ],

57320
website/package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@ -3,7 +3,7 @@
"private": true, "private": true,
"description": "spaCy website", "description": "spaCy website",
"version": "3.0.0", "version": "3.0.0",
"author": "Explosion <contact@explosion.ai>", "author": "Explosion AI <contact@explosion.ai>",
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
"@jupyterlab/outputarea": "^0.19.1", "@jupyterlab/outputarea": "^0.19.1",
@ -16,7 +16,7 @@
"autoprefixer": "^9.4.7", "autoprefixer": "^9.4.7",
"classnames": "^2.2.6", "classnames": "^2.2.6",
"codemirror": "^5.43.0", "codemirror": "^5.43.0",
"gatsby": "^2.11.1", "gatsby": "^2.1.18",
"gatsby-image": "^2.0.29", "gatsby-image": "^2.0.29",
"gatsby-mdx": "^0.3.6", "gatsby-mdx": "^0.3.6",
"gatsby-plugin-catch-links": "^2.0.11", "gatsby-plugin-catch-links": "^2.0.11",
@ -24,14 +24,12 @@
"gatsby-plugin-offline": "^2.0.24", "gatsby-plugin-offline": "^2.0.24",
"gatsby-plugin-plausible": "0.0.6", "gatsby-plugin-plausible": "0.0.6",
"gatsby-plugin-react-helmet": "^3.0.6", "gatsby-plugin-react-helmet": "^3.0.6",
"gatsby-plugin-react-svg": "^2.0.0", "gatsby-plugin-react-svg": "^2.1.2",
"gatsby-plugin-robots-txt": "^1.5.1",
"gatsby-plugin-sass": "^2.0.10", "gatsby-plugin-sass": "^2.0.10",
"gatsby-plugin-sharp": "^2.0.20", "gatsby-plugin-sharp": "^2.0.20",
"gatsby-plugin-sitemap": "^2.0.5", "gatsby-plugin-sitemap": "^2.0.5",
"gatsby-plugin-svgr": "^2.0.1", "gatsby-plugin-svgr": "^2.0.1",
"gatsby-remark-copy-linked-files": "^2.0.9", "gatsby-remark-copy-linked-files": "^2.0.9",
"gatsby-remark-find-replace": "^0.3.0",
"gatsby-remark-images": "^3.0.4", "gatsby-remark-images": "^3.0.4",
"gatsby-remark-prismjs": "^3.2.4", "gatsby-remark-prismjs": "^3.2.4",
"gatsby-remark-smartypants": "^2.0.8", "gatsby-remark-smartypants": "^2.0.8",
@ -41,11 +39,9 @@
"gatsby-transformer-sharp": "^2.1.13", "gatsby-transformer-sharp": "^2.1.13",
"html-to-react": "^1.3.4", "html-to-react": "^1.3.4",
"intersection-observer": "^0.5.1", "intersection-observer": "^0.5.1",
"jinja-to-js": "^3.2.3",
"node-sass": "^4.11.0", "node-sass": "^4.11.0",
"parse-numeric-range": "0.0.2", "parse-numeric-range": "0.0.2",
"prismjs": "^1.15.0", "prismjs": "^1.15.0",
"prismjs-bibtex": "^1.1.0",
"prop-types": "^15.7.2", "prop-types": "^15.7.2",
"react": "^16.8.2", "react": "^16.8.2",
"react-dom": "^16.8.2", "react-dom": "^16.8.2",
@ -54,22 +50,19 @@
"remark-react": "^5.0.1" "remark-react": "^5.0.1"
}, },
"scripts": { "scripts": {
"build": "npm run python:install && npm run python:setup && gatsby build", "build": "gatsby build",
"dev": "npm run python:setup && gatsby develop", "dev": "gatsby develop",
"dev:nightly": "BRANCH=nightly.spacy.io npm run dev",
"lint": "eslint **", "lint": "eslint **",
"clear": "rm -rf .cache", "clear": "rm -rf .cache",
"test": "echo \"Write tests! -> https://gatsby.app/unit-testing\"", "test": "echo \"Write tests! -> https://gatsby.app/unit-testing\""
"python:install": "pip install -r setup/requirements.txt",
"python:setup": "cd setup && sh setup.sh"
}, },
"devDependencies": { "devDependencies": {
"@sindresorhus/slugify": "^0.8.0",
"browser-monads": "^1.0.0", "browser-monads": "^1.0.0",
"md-attr-parser": "^1.2.1", "md-attr-parser": "^1.2.1",
"prettier": "^1.16.4", "prettier": "^1.16.4",
"raw-loader": "^1.0.0", "raw-loader": "^1.0.0",
"unist-util-visit": "^1.4.0" "unist-util-visit": "^1.4.0",
"@sindresorhus/slugify": "^0.8.0"
}, },
"repository": { "repository": {
"type": "git", "type": "git",

View File

@ -6,7 +6,7 @@ import classNames from 'classnames'
import Link from './link' import Link from './link'
import Grid from './grid' import Grid from './grid'
import Newsletter from './newsletter' import Newsletter from './newsletter'
import { ReactComponent as ExplosionLogo } from '../images/explosion.svg' import ExplosionLogo from '-!svg-react-loader!../images/explosion.svg'
import classes from '../styles/footer.module.sass' import classes from '../styles/footer.module.sass'
export default function Footer({ wide = false }) { export default function Footer({ wide = false }) {

View File

@ -1,31 +1,25 @@
import React, { Fragment } from 'react' import React from 'react'
import PropTypes from 'prop-types' import PropTypes from 'prop-types'
import classNames from 'classnames' import classNames from 'classnames'
import { ReactComponent as GitHubIcon } from '../images/icons/github.svg' import GitHubIcon from '-!svg-react-loader!../images/icons/github.svg'
import { ReactComponent as TwitterIcon } from '../images/icons/twitter.svg' import TwitterIcon from '-!svg-react-loader!../images/icons/twitter.svg'
import { ReactComponent as WebsiteIcon } from '../images/icons/website.svg' import WebsiteIcon from '-!svg-react-loader!../images/icons/website.svg'
import { ReactComponent as WarningIcon } from '../images/icons/warning.svg' import WarningIcon from '-!svg-react-loader!../images/icons/warning.svg'
import { ReactComponent as InfoIcon } from '../images/icons/info.svg' import InfoIcon from '-!svg-react-loader!../images/icons/info.svg'
import { ReactComponent as AcceptIcon } from '../images/icons/accept.svg' import AcceptIcon from '-!svg-react-loader!../images/icons/accept.svg'
import { ReactComponent as RejectIcon } from '../images/icons/reject.svg' import RejectIcon from '-!svg-react-loader!../images/icons/reject.svg'
import { ReactComponent as DocsIcon } from '../images/icons/docs.svg' import DocsIcon from '-!svg-react-loader!../images/icons/docs.svg'
import { ReactComponent as CodeIcon } from '../images/icons/code.svg' import CodeIcon from '-!svg-react-loader!../images/icons/code.svg'
import { ReactComponent as HelpIcon } from '../images/icons/help.svg' import HelpIcon from '-!svg-react-loader!../images/icons/help.svg'
import { ReactComponent as HelpOutlineIcon } from '../images/icons/help-outline.svg' import HelpOutlineIcon from '-!svg-react-loader!../images/icons/help-outline.svg'
import { ReactComponent as ArrowRightIcon } from '../images/icons/arrow-right.svg' import ArrowRightIcon from '-!svg-react-loader!../images/icons/arrow-right.svg'
import { ReactComponent as YesIcon } from '../images/icons/yes.svg' import YesIcon from '-!svg-react-loader!../images/icons/yes.svg'
import { ReactComponent as NoIcon } from '../images/icons/no.svg' import NoIcon from '-!svg-react-loader!../images/icons/no.svg'
import { ReactComponent as NeutralIcon } from '../images/icons/neutral.svg' import NeutralIcon from '-!svg-react-loader!../images/icons/neutral.svg'
import { ReactComponent as OfflineIcon } from '../images/icons/offline.svg' import OfflineIcon from '-!svg-react-loader!../images/icons/offline.svg'
import { ReactComponent as SearchIcon } from '../images/icons/search.svg' import SearchIcon from '-!svg-react-loader!../images/icons/search.svg'
import { ReactComponent as MoonIcon } from '../images/icons/moon.svg'
import { ReactComponent as ClipboardIcon } from '../images/icons/clipboard.svg'
import { ReactComponent as NetworkIcon } from '../images/icons/network.svg'
import { ReactComponent as DownloadIcon } from '../images/icons/download.svg'
import { ReactComponent as PackageIcon } from '../images/icons/package.svg'
import { isString } from './util'
import classes from '../styles/icon.module.sass' import classes from '../styles/icon.module.sass'
const icons = { const icons = {
@ -47,22 +41,9 @@ const icons = {
neutral: NeutralIcon, neutral: NeutralIcon,
offline: OfflineIcon, offline: OfflineIcon,
search: SearchIcon, search: SearchIcon,
moon: MoonIcon,
clipboard: ClipboardIcon,
network: NetworkIcon,
download: DownloadIcon,
package: PackageIcon,
} }
export default function Icon({ const Icon = ({ name, width, height, inline, variant, className }) => {
name,
width = 20,
height,
inline = false,
variant,
className,
...props
}) {
const IconComponent = icons[name] const IconComponent = icons[name]
const iconClassNames = classNames(classes.root, className, { const iconClassNames = classNames(classes.root, className, {
[classes.inline]: inline, [classes.inline]: inline,
@ -76,11 +57,15 @@ export default function Icon({
aria-hidden="true" aria-hidden="true"
width={width} width={width}
height={height || width} height={height || width}
{...props}
/> />
) )
} }
Icon.defaultProps = {
width: 20,
inline: false,
}
Icon.propTypes = { Icon.propTypes = {
name: PropTypes.oneOf(Object.keys(icons)), name: PropTypes.oneOf(Object.keys(icons)),
width: PropTypes.number, width: PropTypes.number,
@ -90,43 +75,4 @@ Icon.propTypes = {
className: PropTypes.string, className: PropTypes.string,
} }
export function replaceEmoji(cellChildren) { export default Icon
const icons = {
'✅': { name: 'yes', variant: 'success', 'aria-label': 'positive' },
'❌': { name: 'no', variant: 'error', 'aria-label': 'negative' },
}
const iconRe = new RegExp(`^(${Object.keys(icons).join('|')})`, 'g')
let children = isString(cellChildren) ? [cellChildren] : cellChildren
let hasIcon = false
if (Array.isArray(children)) {
children = children.map((child, i) => {
if (isString(child)) {
const icon = icons[child.trim()]
if (icon) {
hasIcon = true
return (
<Icon
{...icon}
inline={i < children.length}
aria-hidden={undefined}
key={i}
/>
)
} else if (iconRe.test(child)) {
hasIcon = true
const [, iconName, text] = child.split(iconRe)
return (
<Fragment key={i}>
<Icon {...icons[iconName]} aria-hidden={undefined} inline={true} />
{text.replace(/^\s+/g, '')}
</Fragment>
)
}
// Work around prettier auto-escape
if (child.startsWith('\\')) return child.slice(1)
}
return child
})
}
return { content: children, hasIcon }
}

View File

@ -6,7 +6,7 @@ import Link from './link'
import Icon from './icon' import Icon from './icon'
import Dropdown from './dropdown' import Dropdown from './dropdown'
import { github } from './util' import { github } from './util'
import { ReactComponent as Logo } from '../images/logo.svg' import Logo from '-!svg-react-loader!../images/logo.svg'
import classes from '../styles/navigation.module.sass' import classes from '../styles/navigation.module.sass'
const NavigationDropdown = ({ items = [], section }) => { const NavigationDropdown = ({ items = [], section }) => {

View File

@ -0,0 +1,31 @@
import AirbnbLogo from '-!svg-react-loader!./airbnb.svg'
import UberLogo from '-!svg-react-loader!./uber.svg'
import QuoraLogo from '-!svg-react-loader!./quora.svg'
import RetrieverLogo from '-!svg-react-loader!./retriever.svg'
import StitchfixLogo from '-!svg-react-loader!./stitchfix.svg'
import ChartbeatLogo from '-!svg-react-loader!./chartbeat.svg'
import AllenAILogo from '-!svg-react-loader!./allenai.svg'
import RecodeLogo from '-!svg-react-loader!./recode.svg'
import WapoLogo from '-!svg-react-loader!./wapo.svg'
import BBCLogo from '-!svg-react-loader!./bbc.svg'
import MicrosoftLogo from '-!svg-react-loader!./microsoft.svg'
import VenturebeatLogo from '-!svg-react-loader!./venturebeat.svg'
import ThoughtworksLogo from '-!svg-react-loader!./thoughtworks.svg'
export default {
airbnb: AirbnbLogo,
uber: UberLogo,
quora: QuoraLogo,
retriever: RetrieverLogo,
stitchfix: StitchfixLogo,
chartbeat: ChartbeatLogo,
allenai: AllenAILogo,
recode: RecodeLogo,
wapo: WapoLogo,
bbc: BBCLogo,
microsoft: MicrosoftLogo,
venturebeat: VenturebeatLogo,
thoughtworks: ThoughtworksLogo,
}

View File

@ -4,7 +4,7 @@ import Grid from '../components/grid'
import { Label } from '../components/typography' import { Label } from '../components/typography'
import Link from '../components/link' import Link from '../components/link'
import { ReactComponent as Logo } from '../images/logo.svg' import Logo from '-!svg-react-loader!../images/logo.svg'
import patternBlue from '../images/pattern_blue.jpg' import patternBlue from '../images/pattern_blue.jpg'
import patternGreen from '../images/pattern_green.jpg' import patternGreen from '../images/pattern_green.jpg'
import patternPurple from '../images/pattern_purple.jpg' import patternPurple from '../images/pattern_purple.jpg'