Merge branch 'master' into tmp/sync

This commit is contained in:
Ines Montani 2020-03-26 13:38:14 +01:00
commit 46568f40a7
104 changed files with 2987 additions and 638 deletions

106
.github/contributors/Baciccin.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Giovanni Battista Parodi |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-03-19 |
| GitHub username | Baciccin |
| Website (optional) | |

106
.github/contributors/MisterKeefe.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Tom Keefe |
| Company name (if applicable) | / |
| Title or role (if applicable) | / |
| Date | 18 February 2020 |
| GitHub username | MisterKeefe |
| Website (optional) | / |

106
.github/contributors/Tiljander.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Henrik Tiljander |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 24/3/2020 |
| GitHub username | Tiljander |
| Website (optional) | |

106
.github/contributors/dhpollack.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | David Pollack |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Mar 5. 2020 |
| GitHub username | dhpollack |
| Website (optional) | |

106
.github/contributors/guerda.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Philip Gillißen |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-03-24 |
| GitHub username | guerda |
| Website (optional) | |

89
.github/contributors/mabraham.md vendored Normal file
View File

@ -0,0 +1,89 @@
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | |
| GitHub username | |
| Website (optional) | |

106
.github/contributors/merrcury.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Himanshu Garg |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-03-10 |
| GitHub username | merrcury |
| Website (optional) | |

106
.github/contributors/pinealan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alan Chan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-03-15 |
| GitHub username | pinealan |
| Website (optional) | http://pinealan.xyz |

106
.github/contributors/sloev.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Johannes Valbjørn |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-03-13 |
| GitHub username | sloev |
| Website (optional) | https://sloev.github.io |

2
.gitignore vendored
View File

@ -46,6 +46,7 @@ __pycache__/
.venv
env3.6/
venv/
env3.*/
.dev
.denv
.pypyenv
@ -62,6 +63,7 @@ lib64/
parts/
sdist/
var/
wheelhouse/
*.egg-info/
pip-wheel-metadata/
Pipfile.lock

View File

@ -1,6 +1,6 @@
The MIT License (MIT)
Copyright (C) 2016-2019 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
Copyright (C) 2016-2020 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View File

@ -1,28 +1,37 @@
SHELL := /bin/bash
sha = $(shell "git" "rev-parse" "--short" "HEAD")
version = $(shell "bin/get-version.sh")
wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
PYVER := 3.6
VENV := ./env$(PYVER)
dist/spacy.pex : dist/spacy-$(sha).pex
cp dist/spacy-$(sha).pex dist/spacy.pex
chmod a+rx dist/spacy.pex
version := $(shell "bin/get-version.sh")
dist/spacy-$(sha).pex : dist/$(wheel)
env3.6/bin/python -m pip install pex==1.5.3
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
chmod a+rx $@
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
python3.6 -m venv env3.6
source env3.6/bin/activate
env3.6/bin/pip install wheel
env3.6/bin/pip install -r requirements.txt --no-cache-dir
env3.6/bin/python setup.py build_ext --inplace
env3.6/bin/python setup.py sdist
env3.6/bin/python setup.py bdist_wheel
dist/pytest.pex : wheelhouse/pytest-*.whl
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
chmod a+rx $@
.PHONY : clean
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
$(VENV)/bin/pip wheel . -w ./wheelhouse
$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
touch $@
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
$(VENV)/bin/pex :
python$(PYVER) -m venv $(VENV)
$(VENV)/bin/pip install -U pip setuptools pex wheel
.PHONY : clean test
test : dist/spacy-$(version).pex dist/pytest.pex
( . $(VENV)/bin/activate ; \
PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )
clean : setup.py
source env3.6/bin/activate
rm -rf dist/*
rm -rf ./wheelhouse
rm -rf $(VENV)
python setup.py clean --all

View File

@ -2,7 +2,7 @@
### Step 1: Create a Knowledge Base (KB) and training data
Run `wikipedia_pretrain_kb.py`
Run `wikidata_pretrain_kb.py`
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)

View File

@ -88,8 +88,8 @@ def read_text(bz2_loc, n=10000):
break
def get_matches(tokenizer, phrases, texts, max_length=6):
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
def get_matches(tokenizer, phrases, texts):
matcher = PhraseMatcher(tokenizer.vocab)
matcher.add("Phrase", None, *phrases)
for text in texts:
doc = tokenizer(text)

View File

@ -59,7 +59,7 @@ install_requires =
[options.extras_require]
lookups =
spacy_lookups_data>=0.0.5<0.2.0
spacy_lookups_data>=0.0.5,<0.2.0
cuda =
cupy>=5.0.0b4
cuda80 =

View File

@ -93,3 +93,5 @@ cdef enum attr_id_t:
ENT_KB_ID = symbols.ENT_KB_ID
MORPH
ENT_ID = symbols.ENT_ID
IDX

View File

@ -89,6 +89,7 @@ IDS = {
"PROB": PROB,
"LANG": LANG,
"MORPH": MORPH,
"IDX": IDX
}

View File

@ -405,12 +405,10 @@ def train(
losses=losses,
)
except ValueError as e:
msg.warn("Error during training")
err = "Error during training"
if init_tok2vec:
msg.warn(
"Did you provide the same parameters during 'train' as during 'pretrain'?"
)
msg.fail(f"Original error message: {e}", exits=1)
err += " Did you provide the same parameters during 'train' as during 'pretrain'?"
msg.fail(err, f"Original error message: {e}", exits=1)
if raw_text:
# If raw text is available, perform 'rehearsal' updates,
# which use unlabelled data to reduce overfitting.
@ -545,7 +543,40 @@ def train(
with nlp.use_params(optimizer.averages):
final_model_path = output_path / "model-final"
nlp.to_disk(final_model_path)
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
meta_loc = output_path / "model-final" / "meta.json"
final_meta = srsly.read_json(meta_loc)
final_meta.setdefault("accuracy", {})
final_meta["accuracy"].update(meta.get("accuracy", {}))
final_meta.setdefault("speed", {})
final_meta["speed"].setdefault("cpu", None)
final_meta["speed"].setdefault("gpu", None)
meta.setdefault("speed", {})
meta["speed"].setdefault("cpu", None)
meta["speed"].setdefault("gpu", None)
# combine cpu and gpu speeds with the base model speeds
if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
speed = _get_total_speed(
[final_meta["speed"]["cpu"], meta["speed"]["cpu"]]
)
final_meta["speed"]["cpu"] = speed
if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
speed = _get_total_speed(
[final_meta["speed"]["gpu"], meta["speed"]["gpu"]]
)
final_meta["speed"]["gpu"] = speed
# if there were no speeds to update, overwrite with meta
if (
final_meta["speed"]["cpu"] is None
and final_meta["speed"]["gpu"] is None
):
final_meta["speed"].update(meta["speed"])
# note: beam speeds are not combined with the base model
if has_beam_widths:
final_meta.setdefault("beam_accuracy", {})
final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
final_meta.setdefault("beam_speed", {})
final_meta["beam_speed"].update(meta.get("beam_speed", {}))
srsly.write_json(meta_loc, final_meta)
msg.good("Saved model to output directory", final_model_path)
with msg.loading("Creating best model..."):
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
@ -630,6 +661,8 @@ def _find_best(experiment_dir, component):
if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final":
accs = srsly.read_json(epoch_model / "accuracy.json")
scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)]
# remove per_type dicts from score list for max() comparison
scores = [score for score in scores if isinstance(score, float)]
accuracies.append((scores, epoch_model))
if accuracies:
return max(accuracies)[1]
@ -641,13 +674,13 @@ def _get_metrics(component):
if component == "parser":
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
elif component == "tagger":
return ("tags_acc",)
return ("tags_acc", "token_acc")
elif component == "ner":
return ("ents_f", "ents_p", "ents_r", "ents_per_type")
return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
elif component == "senter":
return ("sent_f", "sent_p", "sent_r")
elif component == "textcat":
return ("textcat_score",)
return ("textcat_score", "token_acc")
return ("token_acc",)
@ -714,3 +747,12 @@ def _get_progress(
if beam_width is not None:
result.insert(1, beam_width)
return result
def _get_total_speed(speeds):
seconds_per_word = 0.0
for words_per_second in speeds:
if words_per_second is None:
return None
seconds_per_word += 1.0 / words_per_second
return 1.0 / seconds_per_word

View File

@ -142,10 +142,17 @@ def parse_deps(orig_doc, options={}):
for span, tag, lemma, ent_type in spans:
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
retokenizer.merge(span, attrs=attrs)
if options.get("fine_grained"):
words = [{"text": w.text, "tag": w.tag_} for w in doc]
else:
words = [{"text": w.text, "tag": w.pos_} for w in doc]
fine_grained = options.get("fine_grained")
add_lemma = options.get("add_lemma")
words = [
{
"text": w.text,
"tag": w.tag_ if fine_grained else w.pos_,
"lemma": w.lemma_ if add_lemma else None,
}
for w in doc
]
arcs = []
for word in doc:
if word.i < word.head.i:

View File

@ -1,6 +1,12 @@
import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
from .templates import (
TPL_DEP_SVG,
TPL_DEP_WORDS,
TPL_DEP_WORDS_LEMMA,
TPL_DEP_ARCS,
TPL_ENTS,
)
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html, registry
from ..errors import Errors
@ -80,7 +86,10 @@ class DependencyRenderer(object):
self.width = self.offset_x + len(words) * self.distance
self.height = self.offset_y + 3 * self.word_spacing
self.id = render_id
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
words = [
self.render_word(w["text"], w["tag"], w.get("lemma", None), i)
for i, w in enumerate(words)
]
arcs = [
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
for i, a in enumerate(arcs)
@ -98,7 +107,9 @@ class DependencyRenderer(object):
lang=self.lang,
)
def render_word(self, text, tag, i):
def render_word(
self, text, tag, lemma, i,
):
"""Render individual word.
text (unicode): Word text.
@ -111,6 +122,10 @@ class DependencyRenderer(object):
if self.direction == "rtl":
x = self.width - x
html_text = escape_html(text)
if lemma is not None:
return TPL_DEP_WORDS_LEMMA.format(
text=html_text, tag=tag, lemma=lemma, x=x, y=y
)
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
def render_arrow(self, label, start, end, direction, i):

View File

@ -14,6 +14,15 @@ TPL_DEP_WORDS = """
"""
TPL_DEP_WORDS_LEMMA = """
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
<tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
<tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
</text>
"""
TPL_DEP_ARCS = """
<g class="displacy-arrow">
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>

View File

@ -96,7 +96,10 @@ class Warnings(object):
W027 = ("Found a large training file of {size} bytes. Note that it may "
"be more efficient to split your training data into multiple "
"smaller JSON files instead.")
W028 = ("Skipping unsupported morphological feature(s): {feature}. "
W028 = ("Doc.from_array was called with a vector of type '{type}', "
"but is expecting one of type 'uint64' instead. This may result "
"in problems with the vocab further on in the pipeline.")
W029 = ("Skipping unsupported morphological feature(s): {feature}. "
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
"string \"Field1=Value1,Value2|Field2=Value3\".")
@ -531,6 +534,15 @@ class Errors(object):
E188 = ("Could not match the gold entity links to entities in the doc - "
"make sure the gold EL data refers to valid results of the "
"named entity recognizer in the `nlp` pipeline.")
E189 = ("Each argument to `get_doc` should be of equal length.")
E190 = ("Token head out of range in `Doc.from_array()` for token index "
"'{index}' with value '{value}' (equivalent to relative head "
"index: '{rel_head_index}'). The head indices should be relative "
"to the current token index rather than absolute indices in the "
"array.")
E191 = ("Invalid head: the head token must be from the same doc as the "
"token itself.")
# TODO: fix numbering after merging develop into master
E993 = ("The config for 'nlp' should include either a key 'name' to "
"refer to an existing model by name or path, or a key 'lang' "

View File

@ -66,6 +66,7 @@ for orth in [
"A/S",
"B.C.",
"BK.",
"B.T.",
"Dr.",
"Boul.",
"Chr.",
@ -75,6 +76,7 @@ for orth in [
"Hf.",
"i/s",
"I/S",
"Inc.",
"Kprs.",
"L.A.",
"Ll.",
@ -145,6 +147,7 @@ for orth in [
"bygn.",
"c/o",
"ca.",
"cm.",
"cand.",
"d.d.",
"d.m.",
@ -168,10 +171,12 @@ for orth in [
"dl.",
"do.",
"dobb.",
"dr.",
"dr.h.c",
"dr.phil.",
"ds.",
"dvs.",
"d.v.s.",
"e.b.",
"e.l.",
"e.o.",
@ -293,10 +298,14 @@ for orth in [
"kap.",
"kbh.",
"kem.",
"kg.",
"kgs.",
"kgl.",
"kl.",
"kld.",
"km.",
"km/t",
"km/t.",
"knsp.",
"komm.",
"kons.",
@ -307,6 +316,7 @@ for orth in [
"kt.",
"ktr.",
"kv.",
"kvm.",
"kvt.",
"l.c.",
"lab.",
@ -353,6 +363,7 @@ for orth in [
"nto.",
"nuv.",
"o/m",
"o/m.",
"o.a.",
"o.fl.",
"o.h.",
@ -522,6 +533,7 @@ for orth in [
"vejl.",
"vh.",
"vha.",
"vind.",
"vs.",
"vsa.",
"vær.",

View File

@ -1,5 +1,6 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .punctuation import TOKENIZER_INFIXES
from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
@ -19,6 +20,8 @@ class GermanDefaults(Language.Defaults):
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
prefixes = TOKENIZER_PREFIXES
suffixes = TOKENIZER_SUFFIXES
infixes = TOKENIZER_INFIXES
tag_map = TAG_MAP
stop_words = STOP_WORDS

View File

@ -1,7 +1,29 @@
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
from ..char_classes import CURRENCY, UNITS, PUNCT
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
_prefixes = ["``"] + BASE_TOKENIZER_PREFIXES
_suffixes = (
["''", "/"]
+ LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_ICONS
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
_quotes = CONCAT_QUOTES.replace("'", "")
_infixes = (
@ -12,6 +34,7 @@ _infixes = (
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
r"(?<=[0-9])-(?=[0-9])",
@ -19,4 +42,6 @@ _infixes = (
)
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_SUFFIXES = _suffixes
TOKENIZER_INFIXES = _infixes

View File

@ -157,6 +157,8 @@ for exc_data in [
for orth in [
"``",
"''",
"A.C.",
"a.D.",
"A.D.",
@ -172,10 +174,13 @@ for orth in [
"biol.",
"Biol.",
"ca.",
"CDU/CSU",
"Chr.",
"Cie.",
"c/o",
"co.",
"Co.",
"d'",
"D.C.",
"Dipl.-Ing.",
"Dipl.",
@ -200,12 +205,18 @@ for orth in [
"i.G.",
"i.Tr.",
"i.V.",
"I.",
"II.",
"III.",
"IV.",
"Inc.",
"Ing.",
"jr.",
"Jr.",
"jun.",
"jur.",
"K.O.",
"L'",
"L.A.",
"lat.",
"M.A.",

30
spacy/lang/eu/__init__.py Normal file
View File

@ -0,0 +1,30 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_SUFFIXES
from .tag_map import TAG_MAP
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
class BasqueDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "eu"
tokenizer_exceptions = BASE_EXCEPTIONS
tag_map = TAG_MAP
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
class Basque(Language):
lang = "eu"
Defaults = BasqueDefaults
__all__ = ["Basque"]

14
spacy/lang/eu/examples.py Normal file
View File

@ -0,0 +1,14 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.eu.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
"gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira",
]

View File

@ -0,0 +1,79 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
# Source http://mylanguages.org/basque_numbers.php
_num_words = """
bat
bi
hiru
lau
bost
sei
zazpi
zortzi
bederatzi
hamar
hamaika
hamabi
hamahiru
hamalau
hamabost
hamasei
hamazazpi
Hemezortzi
hemeretzi
hogei
ehun
mila
milioi
""".split()
# source https://www.google.com/intl/ur/inputtools/try/
_ordinal_words = """
lehen
bigarren
hirugarren
laugarren
bosgarren
seigarren
zazpigarren
zortzigarren
bederatzigarren
hamargarren
hamaikagarren
hamabigarren
hamahirugarren
hamalaugarren
hamabosgarren
hamaseigarren
hamazazpigarren
hamazortzigarren
hemeretzigarren
hogeigarren
behin
""".split()
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
from ..punctuation import TOKENIZER_SUFFIXES
_suffixes = TOKENIZER_SUFFIXES

108
spacy/lang/eu/stop_words.py Normal file
View File

@ -0,0 +1,108 @@
# encoding: utf8
from __future__ import unicode_literals
# Source: https://github.com/stopwords-iso/stopwords-eu
# https://www.ranks.nl/stopwords/basque
# https://www.mustgo.com/worldlanguages/basque/
STOP_WORDS = set(
"""
al
anitz
arabera
asko
baina
bat
batean
batek
bati
batzuei
batzuek
batzuetan
batzuk
bera
beraiek
berau
berauek
bere
berori
beroriek
beste
bezala
da
dago
dira
ditu
du
dute
edo
egin
ere
eta
eurak
ez
gainera
gu
gutxi
guzti
haiei
haiek
haietan
hainbeste
hala
han
handik
hango
hara
hari
hark
hartan
hau
hauei
hauek
hauetan
hemen
hemendik
hemengo
hi
hona
honek
honela
honetan
honi
hor
hori
horiei
horiek
horietan
horko
horra
horrek
horrela
horretan
horri
hortik
hura
izan
ni
noiz
nola
non
nondik
nongo
nor
nora
ze
zein
zen
zenbait
zenbat
zer
zergatik
ziren
zituen
zu
zuek
zuen
zuten
""".split()
)

71
spacy/lang/eu/tag_map.py Normal file
View File

@ -0,0 +1,71 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
TAG_MAP = {
".": {POS: PUNCT, "PunctType": "peri"},
",": {POS: PUNCT, "PunctType": "comm"},
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
":": {POS: PUNCT},
"$": {POS: SYM, "Other": {"SymType": "currency"}},
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
"AFX": {POS: ADJ, "Hyph": "yes"},
"CC": {POS: CCONJ, "ConjType": "coor"},
"CD": {POS: NUM, "NumType": "card"},
"DT": {POS: DET},
"EX": {POS: ADV, "AdvType": "ex"},
"FW": {POS: X, "Foreign": "yes"},
"HYPH": {POS: PUNCT, "PunctType": "dash"},
"IN": {POS: ADP},
"JJ": {POS: ADJ, "Degree": "pos"},
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: PUNCT, "NumType": "ord"},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
"RP": {POS: PART},
"SP": {POS: SPACE},
"SYM": {POS: SYM},
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
"UH": {POS: INTJ},
"VB": {POS: VERB, "VerbForm": "inf"},
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
"VBZ": {
POS: VERB,
"VerbForm": "fin",
"Tense": "pres",
"Number": "sing",
"Person": 3,
},
"WDT": {POS: ADJ, "PronType": "int|rel"},
"WP": {POS: NOUN, "PronType": "int|rel"},
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
"WRB": {POS: ADV, "PronType": "int|rel"},
"ADD": {POS: X},
"NFP": {POS: PUNCT},
"GW": {POS: X},
"XX": {POS: X},
"BES": {POS: VERB},
"HVS": {POS: VERB},
"_SP": {POS: SPACE},
}

View File

@ -11,6 +11,7 @@ for exc_data in [
{ORTH: "alv.", LEMMA: "arvonlisävero"},
{ORTH: "ark.", LEMMA: "arkisin"},
{ORTH: "as.", LEMMA: "asunto"},
{ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
{ORTH: "ed.", LEMMA: "edellinen"},
{ORTH: "esim.", LEMMA: "esimerkki"},
{ORTH: "huom.", LEMMA: "huomautus"},
@ -24,6 +25,7 @@ for exc_data in [
{ORTH: "läh.", LEMMA: "lähettäjä"},
{ORTH: "miel.", LEMMA: "mieluummin"},
{ORTH: "milj.", LEMMA: "miljoona"},
{ORTH: "Mm.", LEMMA: "muun muassa"},
{ORTH: "mm.", LEMMA: "muun muassa"},
{ORTH: "myöh.", LEMMA: "myöhempi"},
{ORTH: "n.", LEMMA: "noin"},

View File

@ -1,5 +1,6 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_SUFFIXES
from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
@ -24,6 +25,7 @@ class FrenchDefaults(Language.Defaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
token_match = TOKEN_MATCH

View File

@ -1,12 +1,23 @@
from ..punctuation import TOKENIZER_INFIXES
from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..char_classes import merge_chars
ELISION = " ' ".strip().replace(" ", "").replace("\n", "")
HYPHENS = r"- ".strip().replace(" ", "").replace("\n", "")
ELISION = "' ".replace(" ", "")
HYPHENS = r"- ".replace(" ", "")
_prefixes_elision = "d l n"
_prefixes_elision += " " + _prefixes_elision.upper()
_hyphen_suffixes = "ce clés elle en il ils je là moi nous on t vous"
_hyphen_suffixes += " " + _hyphen_suffixes.upper()
_prefixes = TOKENIZER_PREFIXES + [
r"(?:({pe})[{el}])(?=[{a}])".format(
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
)
]
_suffixes = (
LIST_PUNCT
+ LIST_ELLIPSES
@ -14,7 +25,6 @@ _suffixes = (
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.", # °C. -> ["°C", "."]
r"(?<=[0-9])°[FfCcKk]", # 4°C -> ["4", "°C"]
r"(?<=[0-9])%", # 4% -> ["4", "%"]
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
@ -22,14 +32,17 @@ _suffixes = (
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES
),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
r"(?<=[{a}])[{h}]({hs})".format(
a=ALPHA, h=HYPHENS, hs=merge_chars(_hyphen_suffixes)
),
]
)
_infixes = TOKENIZER_INFIXES + [
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
]
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_SUFFIXES = _suffixes
TOKENIZER_INFIXES = _infixes

View File

@ -3,7 +3,7 @@ import re
from .punctuation import ELISION, HYPHENS
from ..tokenizer_exceptions import URL_PATTERN
from ..char_classes import ALPHA_LOWER, ALPHA
from ...symbols import ORTH, LEMMA, TAG
from ...symbols import ORTH, LEMMA
# not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer
# from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS
@ -53,7 +53,28 @@ for exc_data in [
_exc[exc_data[ORTH]] = [exc_data]
for orth in ["etc."]:
for orth in [
"après-midi",
"au-delà",
"au-dessus",
"celle-ci",
"celles-ci",
"celui-ci",
"cf.",
"ci-dessous",
"elle-même",
"en-dessous",
"etc.",
"jusque-là",
"lui-même",
"MM.",
"No.",
"peut-être",
"pp.",
"quelques-uns",
"rendez-vous",
"Vol.",
]:
_exc[orth] = [{ORTH: orth}]
@ -69,7 +90,7 @@ for verb, verb_lemma in [
for pronoun in ["elle", "il", "on"]:
token = f"{orth}-t-{pronoun}"
_exc[token] = [
{LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
{LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"},
{LEMMA: "t", ORTH: "-t"},
{LEMMA: pronoun, ORTH: "-" + pronoun},
]
@ -78,7 +99,7 @@ for verb, verb_lemma in [("est", "être")]:
for orth in [verb, verb.title()]:
token = f"{orth}-ce"
_exc[token] = [
{LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
{LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"},
{LEMMA: "ce", ORTH: "-ce"},
]
@ -86,12 +107,29 @@ for verb, verb_lemma in [("est", "être")]:
for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
for orth in [pre, pre.title()]:
_exc[f"{orth}est-ce"] = [
{LEMMA: pre_lemma, ORTH: orth, TAG: "ADV"},
{LEMMA: "être", ORTH: "est", TAG: "VERB"},
{LEMMA: pre_lemma, ORTH: orth},
{LEMMA: "être", ORTH: "est"},
{LEMMA: "ce", ORTH: "-ce"},
]
for verb, pronoun in [("est", "il"), ("EST", "IL")]:
token = "{}-{}".format(verb, pronoun)
_exc[token] = [
{LEMMA: "être", ORTH: verb},
{LEMMA: pronoun, ORTH: "-" + pronoun},
]
for s, verb, pronoun in [("s", "est", "il"), ("S", "EST", "IL")]:
token = "{}'{}-{}".format(s, verb, pronoun)
_exc[token] = [
{LEMMA: "se", ORTH: s + "'"},
{LEMMA: "être", ORTH: verb},
{LEMMA: pronoun, ORTH: "-" + pronoun},
]
_infixes_exc = []
orig_elision = "'"
orig_hyphen = "-"

View File

@ -1,7 +1,7 @@
from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
@ -19,6 +19,7 @@ class ItalianDefaults(Language.Defaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
tag_map = TAG_MAP
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES

View File

@ -1,12 +1,29 @@
from ..punctuation import TOKENIZER_INFIXES
from ..char_classes import ALPHA
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES
from ..char_classes import ALPHA_LOWER, ALPHA_UPPER
ELISION = " ' ".strip().replace(" ", "")
ELISION = "'"
_infixes = TOKENIZER_INFIXES + [
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
]
_prefixes = [r"'[0-9][0-9]", r"[0-9]+°"] + BASE_TOKENIZER_PREFIXES
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{al}])".format(a=ALPHA, h=HYPHENS, al=ALPHA_LOWER),
r"(?<=[{a}0-9])[:<>=\/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}][{el}])(?=[{a}0-9\"])".format(a=ALPHA, el=ELISION),
]
)
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_INFIXES = _infixes

View File

@ -1,5 +1,55 @@
from ...symbols import ORTH, LEMMA
_exc = {"po'": [{ORTH: "po'", LEMMA: "poco"}]}
_exc = {
"all'art.": [{ORTH: "all'"}, {ORTH: "art."}],
"dall'art.": [{ORTH: "dall'"}, {ORTH: "art."}],
"dell'art.": [{ORTH: "dell'"}, {ORTH: "art."}],
"L'art.": [{ORTH: "L'"}, {ORTH: "art."}],
"l'art.": [{ORTH: "l'"}, {ORTH: "art."}],
"nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}],
"po'": [{ORTH: "po'", LEMMA: "poco"}],
"sett..": [{ORTH: "sett."}, {ORTH: "."}],
}
for orth in [
"..",
"....",
"al.",
"all-path",
"art.",
"Art.",
"artt.",
"att.",
"by-pass",
"c.d.",
"centro-sinistra",
"check-up",
"Civ.",
"cm.",
"Cod.",
"col.",
"Cost.",
"d.C.",
'de"',
"distr.",
"E'",
"ecc.",
"e-mail",
"e/o",
"etc.",
"Jr.",
"",
"nord-est",
"pag.",
"Proc.",
"prof.",
"sett.",
"s.p.a.",
"ss.",
"St.",
"tel.",
"week-end",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -2,11 +2,13 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_
ELISION = " ' ".strip().replace(" ", "")
abbrev = ("d", "D")
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
r"(?<=^[{ab}][{el}])(?=[{a}])".format(ab=abbrev, a=ALPHA, el=ELISION),
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),

View File

@ -7,6 +7,8 @@ _exc = {}
# translate / delete what is not necessary
for exc_data in [
{ORTH: "t", LEMMA: "et", NORM: "et"},
{ORTH: "T", LEMMA: "et", NORM: "et"},
{ORTH: "'t", LEMMA: "et", NORM: "et"},
{ORTH: "'T", LEMMA: "et", NORM: "et"},
{ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},

View File

@ -0,0 +1,31 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
class LigurianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "lij"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
infixes = TOKENIZER_INFIXES
class Ligurian(Language):
lang = "lij"
Defaults = LigurianDefaults
__all__ = ["Ligurian"]

View File

@ -0,0 +1,18 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.lij.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Sciusciâ e sciorbî no se peu.",
"Graçie di çetroin, che me son arrivæ.",
"Vegnime apreuvo, che ve fasso pescâ di òmmi.",
"Bella pe sempre l'ægua inta conchetta quande unn'agoggia d'ægua a se â trapaña.",
]

View File

@ -0,0 +1,15 @@
# coding: utf8
from __future__ import unicode_literals
from ..punctuation import TOKENIZER_INFIXES
from ..char_classes import ALPHA
ELISION = " ' ".strip().replace(" ", "").replace("\n", "")
_infixes = TOKENIZER_INFIXES + [
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
]
TOKENIZER_INFIXES = _infixes

View File

@ -0,0 +1,43 @@
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set(
"""
a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei
bella belle belli bello ben
ch' che chì chi ciù co-a co-e co-i co-o comm' comme con cösa coscì cöse
d' da da-a da-e da-i da-o dapeu de delongo derê di do doe doî donde dòppo
é e ê ea ean emmo en ëse
fin fiña
gh' ghe guæei
i î in insemme int' inta inte inti into
l' lê lì lô
m' ma manco me megio meno mezo mi
na n' ne ni ninte nisciun nisciuña no
o ò ô oua
parte pe pe-a pe-i pe-e pe-o perché pittin primma pròpio
quæ quand' quande quarche quella quelle quelli quello
s' sce scê sci sciâ sciô sciù se segge seu sò solo son sott' sta stæta stæte stæti stæto ste sti sto
tanta tante tanti tanto te ti torna tra tròppo tutta tutte tutti tutto
un uña unn' unna
za zu
""".split()
)

View File

@ -0,0 +1,52 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA
_exc = {}
for raw, lemma in [
("a-a", "a-o"),
("a-e", "a-o"),
("a-o", "a-o"),
("a-i", "a-o"),
("co-a", "co-o"),
("co-e", "co-o"),
("co-i", "co-o"),
("co-o", "co-o"),
("da-a", "da-o"),
("da-e", "da-o"),
("da-i", "da-o"),
("da-o", "da-o"),
("pe-a", "pe-o"),
("pe-e", "pe-o"),
("pe-i", "pe-o"),
("pe-o", "pe-o"),
]:
for orth in [raw, raw.capitalize()]:
_exc[orth] = [{ORTH: orth, LEMMA: lemma}]
# Prefix + prepositions with à (e.g. "sott'a-o")
for prep, prep_lemma in [
("a-a", "a-o"),
("a-e", "a-o"),
("a-o", "a-o"),
("a-i", "a-o"),
]:
for prefix, prefix_lemma in [
("sott'", "sotta"),
("sott", "sotta"),
("contr'", "contra"),
("contr", "contra"),
("ch'", "che"),
("ch", "che"),
("s'", "se"),
("s", "se"),
]:
for prefix_orth in [prefix, prefix.capitalize()]:
_exc[prefix_orth + prep] = [
{ORTH: prefix_orth, LEMMA: prefix_lemma},
{ORTH: prep, LEMMA: prep_lemma},
]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -1,3 +1,4 @@
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
@ -23,7 +24,13 @@ class LithuanianDefaults(Language.Defaults):
)
lex_attr_getters.update(LEX_ATTRS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
mod_base_exceptions = {
exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
}
del mod_base_exceptions["8)"]
tokenizer_exceptions = update_exc(mod_base_exceptions, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
tag_map = TAG_MAP
morph_rules = MORPH_RULES

View File

@ -0,0 +1,29 @@
# coding: utf8
from __future__ import unicode_literals
from ..char_classes import LIST_ICONS, LIST_ELLIPSES
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
from ..char_classes import HYPHENS
from ..punctuation import TOKENIZER_SUFFIXES
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
_suffixes = ["\."] + list(TOKENIZER_SUFFIXES)
TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -3,262 +3,264 @@ from ...symbols import ORTH
_exc = {}
for orth in [
"G.",
"J. E.",
"J. Em.",
"J.E.",
"J.Em.",
"K.",
"N.",
"V.",
"Vt.",
"a.",
"a.k.",
"a.s.",
"adv.",
"akad.",
"aklg.",
"akt.",
"al.",
"ang.",
"angl.",
"aps.",
"apskr.",
"apyg.",
"arbat.",
"asist.",
"asm.",
"asm.k.",
"asmv.",
"atk.",
"atsak.",
"atsisk.",
"atsisk.sąsk.",
"atv.",
"aut.",
"avd.",
"b.k.",
"baud.",
"biol.",
"bkl.",
"bot.",
"bt.",
"buv.",
"ch.",
"chem.",
"corp.",
"d.",
"dab.",
"dail.",
"dek.",
"deš.",
"dir.",
"dirig.",
"doc.",
"dol.",
"dr.",
"drp.",
"dvit.",
"dėst.",
"dš.",
"dž.",
"e.b.",
"e.bankas",
"e.p.",
"e.parašas",
"e.paštas",
"e.v.",
"e.valdžia",
"egz.",
"eil.",
"ekon.",
"el.",
"el.bankas",
"el.p.",
"el.parašas",
"el.paštas",
"el.valdžia",
"etc.",
"ež.",
"fak.",
"faks.",
"feat.",
"filol.",
"filos.",
"g.",
"gen.",
"geol.",
"gerb.",
"gim.",
"gr.",
"gv.",
"gyd.",
"gyv.",
"habil.",
"inc.",
"insp.",
"inž.",
"ir pan.",
"ir t. t.",
"isp.",
"istor.",
"it.",
"just.",
"k.",
"k. a.",
"k.a.",
"kab.",
"kand.",
"kart.",
"kat.",
"ketv.",
"kh.",
"kl.",
"kln.",
"km.",
"kn.",
"koresp.",
"kpt.",
"kr.",
"kt.",
"kub.",
"kun.",
"kv.",
"kyš.",
"l. e. p.",
"l.e.p.",
"lenk.",
"liet.",
"lot.",
"lt.",
"ltd.",
"ltn.",
"m.",
"m.e..",
"m.m.",
"mat.",
"med.",
"mgnt.",
"mgr.",
"min.",
"mjr.",
"ml.",
"mln.",
"mlrd.",
"mob.",
"mok.",
"moksl.",
"mokyt.",
"mot.",
"mr.",
"mst.",
"mstl.",
"mėn.",
"nkt.",
"no.",
"nr.",
"ntk.",
"nuotr.",
"op.",
"org.",
"orig.",
"p.",
"p.d.",
"p.m.e.",
"p.s.",
"pab.",
"pan.",
"past.",
"pav.",
"pavad.",
"per.",
"perd.",
"pirm.",
"pl.",
"plg.",
"plk.",
"pr.",
"pr.Kr.",
"pranc.",
"proc.",
"prof.",
"prom.",
"prot.",
"psl.",
"pss.",
"pvz.",
"pšt.",
"r.",
"raj.",
"red.",
"rez.",
"rež.",
"rus.",
"rš.",
"s.",
"sav.",
"saviv.",
"sek.",
"sekr.",
"sen.",
"sh.",
"sk.",
"skg.",
"skv.",
"skyr.",
"sp.",
"spec.",
"sr.",
"st.",
"str.",
"stud.",
"sąs.",
"t.",
"t. p.",
"t. y.",
"t.p.",
"t.t.",
"t.y.",
"techn.",
"tel.",
"teol.",
"th.",
"tir.",
"trit.",
"trln.",
"tšk.",
"tūks.",
"tūkst.",
"up.",
"upl.",
"v.s.",
"vad.",
"val.",
"valg.",
"ved.",
"vert.",
"vet.",
"vid.",
"virš.",
"vlsč.",
"vnt.",
"vok.",
"vs.",
"vtv.",
"vv.",
"vyr.",
"vyresn.",
"zool.",
"Įn",
"įl.",
"š.m.",
"šnek.",
"šv.",
"švč.",
"ž.ū.",
"žin.",
"žml.",
"žr.",
"n-tosios",
"?!",
# "G.",
# "J. E.",
# "J. Em.",
# "J.E.",
# "J.Em.",
# "K.",
# "N.",
# "V.",
# "Vt.",
# "a.",
# "a.k.",
# "a.s.",
# "adv.",
# "akad.",
# "aklg.",
# "akt.",
# "al.",
# "ang.",
# "angl.",
# "aps.",
# "apskr.",
# "apyg.",
# "arbat.",
# "asist.",
# "asm.",
# "asm.k.",
# "asmv.",
# "atk.",
# "atsak.",
# "atsisk.",
# "atsisk.sąsk.",
# "atv.",
# "aut.",
# "avd.",
# "b.k.",
# "baud.",
# "biol.",
# "bkl.",
# "bot.",
# "bt.",
# "buv.",
# "ch.",
# "chem.",
# "corp.",
# "d.",
# "dab.",
# "dail.",
# "dek.",
# "deš.",
# "dir.",
# "dirig.",
# "doc.",
# "dol.",
# "dr.",
# "drp.",
# "dvit.",
# "dėst.",
# "dš.",
# "dž.",
# "e.b.",
# "e.bankas",
# "e.p.",
# "e.parašas",
# "e.paštas",
# "e.v.",
# "e.valdžia",
# "egz.",
# "eil.",
# "ekon.",
# "el.",
# "el.bankas",
# "el.p.",
# "el.parašas",
# "el.paštas",
# "el.valdžia",
# "etc.",
# "ež.",
# "fak.",
# "faks.",
# "feat.",
# "filol.",
# "filos.",
# "g.",
# "gen.",
# "geol.",
# "gerb.",
# "gim.",
# "gr.",
# "gv.",
# "gyd.",
# "gyv.",
# "habil.",
# "inc.",
# "insp.",
# "inž.",
# "ir pan.",
# "ir t. t.",
# "isp.",
# "istor.",
# "it.",
# "just.",
# "k.",
# "k. a.",
# "k.a.",
# "kab.",
# "kand.",
# "kart.",
# "kat.",
# "ketv.",
# "kh.",
# "kl.",
# "kln.",
# "km.",
# "kn.",
# "koresp.",
# "kpt.",
# "kr.",
# "kt.",
# "kub.",
# "kun.",
# "kv.",
# "kyš.",
# "l. e. p.",
# "l.e.p.",
# "lenk.",
# "liet.",
# "lot.",
# "lt.",
# "ltd.",
# "ltn.",
# "m.",
# "m.e..",
# "m.m.",
# "mat.",
# "med.",
# "mgnt.",
# "mgr.",
# "min.",
# "mjr.",
# "ml.",
# "mln.",
# "mlrd.",
# "mob.",
# "mok.",
# "moksl.",
# "mokyt.",
# "mot.",
# "mr.",
# "mst.",
# "mstl.",
# "mėn.",
# "nkt.",
# "no.",
# "nr.",
# "ntk.",
# "nuotr.",
# "op.",
# "org.",
# "orig.",
# "p.",
# "p.d.",
# "p.m.e.",
# "p.s.",
# "pab.",
# "pan.",
# "past.",
# "pav.",
# "pavad.",
# "per.",
# "perd.",
# "pirm.",
# "pl.",
# "plg.",
# "plk.",
# "pr.",
# "pr.Kr.",
# "pranc.",
# "proc.",
# "prof.",
# "prom.",
# "prot.",
# "psl.",
# "pss.",
# "pvz.",
# "pšt.",
# "r.",
# "raj.",
# "red.",
# "rez.",
# "rež.",
# "rus.",
# "rš.",
# "s.",
# "sav.",
# "saviv.",
# "sek.",
# "sekr.",
# "sen.",
# "sh.",
# "sk.",
# "skg.",
# "skv.",
# "skyr.",
# "sp.",
# "spec.",
# "sr.",
# "st.",
# "str.",
# "stud.",
# "sąs.",
# "t.",
# "t. p.",
# "t. y.",
# "t.p.",
# "t.t.",
# "t.y.",
# "techn.",
# "tel.",
# "teol.",
# "th.",
# "tir.",
# "trit.",
# "trln.",
# "tšk.",
# "tūks.",
# "tūkst.",
# "up.",
# "upl.",
# "v.s.",
# "vad.",
# "val.",
# "valg.",
# "ved.",
# "vert.",
# "vet.",
# "vid.",
# "virš.",
# "vlsč.",
# "vnt.",
# "vok.",
# "vs.",
# "vtv.",
# "vv.",
# "vyr.",
# "vyresn.",
# "zool.",
# "Įn",
# "įl.",
# "š.m.",
# "šnek.",
# "šv.",
# "švč.",
# "ž.ū.",
# "žin.",
# "žml.",
# "žr.",
]:
_exc[orth] = [{ORTH: orth}]

View File

@ -1,4 +1,6 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS
from .morph_rules import MORPH_RULES
from .syntax_iterators import SYNTAX_ITERATORS
@ -18,6 +20,9 @@ class NorwegianDefaults(Language.Defaults):
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
stop_words = STOP_WORDS
morph_rules = MORPH_RULES
tag_map = TAG_MAP

View File

@ -1,13 +1,29 @@
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_SUFFIXES
from ..char_classes import CURRENCY, PUNCT, UNITS, LIST_CURRENCY
# Punctuation stolen from Danish
# Punctuation adapted from Danish
_quotes = CONCAT_QUOTES.replace("'", "")
_list_punct = [x for x in LIST_PUNCT if x != "#"]
_list_icons = [x for x in LIST_ICONS if x != "°"]
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
_list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
_prefixes = (
["§", "%", "=", "", "", r"\+(?![0-9])"]
+ _list_punct
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_CURRENCY
+ LIST_ICONS
)
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ _list_icons
+ [
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
@ -18,13 +34,26 @@ _infixes = (
]
)
_suffixes = [
suffix
for suffix in TOKENIZER_SUFFIXES
if suffix not in ["'s", "'S", "s", "S", r"\'"]
]
_suffixes += [r"(?<=[^sSxXzZ])\'"]
_suffixes = (
LIST_PUNCT
+ LIST_ELLIPSES
+ _list_quotes
+ _list_icons
+ ["", ""]
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
+ [r"(?<=[^sSxXzZ])'"]
)
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -21,57 +21,80 @@ for exc_data in [
for orth in [
"adm.dir.",
"a.m.",
"andelsnr",
"Ap.",
"Aq.",
"Ca.",
"Chr.",
"Co.",
"Co.",
"Dr.",
"F.eks.",
"Fr.p.",
"Frp.",
"Grl.",
"Kr.",
"Kr.F.",
"Kr.F.s",
"Mr.",
"Mrs.",
"Pb.",
"Pr.",
"Sp.",
"Sp.",
"St.",
"a.m.",
"ad.",
"adm.dir.",
"andelsnr",
"b.c.",
"bl.a.",
"bla.",
"bm.",
"bnr.",
"bto.",
"c.c.",
"ca.",
"cand.mag.",
"c.c.",
"co.",
"d.d.",
"dept.",
"d.m.",
"dr.philos.",
"dvs.",
"d.y.",
"E. coli",
"dept.",
"dr.",
"dr.med.",
"dr.philos.",
"dr.psychol.",
"dvs.",
"e.Kr.",
"e.l.",
"eg.",
"ekskl.",
"e.Kr.",
"el.",
"e.l.",
"et.",
"etc.",
"etg.",
"ev.",
"evt.",
"f.",
"f.Kr.",
"f.eks.",
"f.o.m.",
"fhv.",
"fk.",
"f.Kr.",
"f.o.m.",
"foreg.",
"fork.",
"fv.",
"fvt.",
"g.",
"gt.",
"gl.",
"gno.",
"gnr.",
"grl.",
"gt.",
"h.r.adv.",
"hhv.",
"hoh.",
"hr.",
"h.r.adv.",
"ifb.",
"ifm.",
"iht.",
@ -80,39 +103,45 @@ for orth in [
"jf.",
"jr.",
"jun.",
"juris.",
"kfr.",
"kgl.",
"kgl.res.",
"kl.",
"komm.",
"kr.",
"kst.",
"lat.",
"lø.",
"m.a.o.",
"m.fl.",
"m.m.",
"m.v.",
"ma.",
"mag.art.",
"m.a.o.",
"md.",
"mfl.",
"mht.",
"mill.",
"min.",
"m.m.",
"mnd.",
"moh.",
"Mr.",
"mrd.",
"muh.",
"mv.",
"mva.",
"n.å.",
"ndf.",
"no.",
"nov.",
"nr.",
"nto.",
"nyno.",
"n.å.",
"o.a.",
"o.l.",
"off.",
"ofl.",
"okt.",
"o.l.",
"on.",
"op.",
"org.",
@ -120,14 +149,15 @@ for orth in [
"ovf.",
"p.",
"p.a.",
"Pb.",
"p.g.a.",
"p.m.",
"p.t.",
"pga.",
"ph.d.",
"pkt.",
"p.m.",
"pr.",
"pst.",
"p.t.",
"pt.",
"red.anm.",
"ref.",
"res.",
@ -136,6 +166,10 @@ for orth in [
"rv.",
"s.",
"s.d.",
"s.k.",
"s.k.",
"s.u.",
"s.å.",
"sen.",
"sep.",
"siviling.",
@ -145,16 +179,17 @@ for orth in [
"sr.",
"sst.",
"st.",
"stip.",
"stk.",
"st.meld.",
"st.prp.",
"stip.",
"stk.",
"stud.",
"s.u.",
"sv.",
"sø.",
"s.å.",
"såk.",
"sø.",
"t.h.",
"t.o.m.",
"t.v.",
"temp.",
"ti.",
"tils.",
@ -162,7 +197,6 @@ for orth in [
"tl;dr",
"tlf.",
"to.",
"t.o.m.",
"ult.",
"utg.",
"v.",
@ -176,8 +210,10 @@ for orth in [
"vol.",
"vs.",
"vsa.",
"©NTB",
"årg.",
"årh.",
"§§",
]:
_exc[orth] = [{ORTH: orth}]

View File

@ -1,69 +1,47 @@
from ...symbols import ORTH, NORM
from ...symbols import ORTH
_exc = {
"às": [{ORTH: "à", NORM: "a"}, {ORTH: "s", NORM: "as"}],
"ao": [{ORTH: "a"}, {ORTH: "o"}],
"aos": [{ORTH: "a"}, {ORTH: "os"}],
"àquele": [{ORTH: "à", NORM: "a"}, {ORTH: "quele", NORM: "aquele"}],
"àquela": [{ORTH: "à", NORM: "a"}, {ORTH: "quela", NORM: "aquela"}],
"àqueles": [{ORTH: "à", NORM: "a"}, {ORTH: "queles", NORM: "aqueles"}],
"àquelas": [{ORTH: "à", NORM: "a"}, {ORTH: "quelas", NORM: "aquelas"}],
"àquilo": [{ORTH: "à", NORM: "a"}, {ORTH: "quilo", NORM: "aquilo"}],
"aonde": [{ORTH: "a"}, {ORTH: "onde"}],
}
# Contractions
_per_pron = ["ele", "ela", "eles", "elas"]
_dem_pron = [
"este",
"esta",
"estes",
"estas",
"isto",
"esse",
"essa",
"esses",
"essas",
"isso",
"aquele",
"aquela",
"aqueles",
"aquelas",
"aquilo",
]
_und_pron = ["outro", "outra", "outros", "outras"]
_adv = ["aqui", "", "ali", "além"]
for orth in _per_pron + _dem_pron + _und_pron + _adv:
_exc["d" + orth] = [{ORTH: "d", NORM: "de"}, {ORTH: orth}]
for orth in _per_pron + _dem_pron + _und_pron:
_exc["n" + orth] = [{ORTH: "n", NORM: "em"}, {ORTH: orth}]
_exc = {}
for orth in [
"Adm.",
"Art.",
"art.",
"Av.",
"av.",
"Cia.",
"dom.",
"Dr.",
"dr.",
"e.g.",
"E.g.",
"E.G.",
"e/ou",
"ed.",
"eng.",
"etc.",
"Fund.",
"Gen.",
"Gov.",
"i.e.",
"I.e.",
"I.E.",
"Inc.",
"Jr.",
"km/h",
"Ltd.",
"Mr.",
"p.m.",
"Ph.D.",
"Rep.",
"Rev.",
"S/A",
"Sen.",
"Sr.",
"sr.",
"Sra.",
"sra.",
"vs.",
"tel.",
"pág.",

View File

@ -1,5 +1,7 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_SUFFIXES
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
@ -21,6 +23,9 @@ class RomanianDefaults(Language.Defaults):
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
prefixes = TOKENIZER_PREFIXES
suffixes = TOKENIZER_SUFFIXES
infixes = TOKENIZER_INFIXES
tag_map = TAG_MAP

View File

@ -0,0 +1,164 @@
# coding: utf8
from __future__ import unicode_literals
import itertools
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
from ..char_classes import LIST_ICONS, CURRENCY
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
_list_icons = [x for x in LIST_ICONS if x != "°"]
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
_ro_variants = {
"Ă": ["Ă", "A"],
"Â": ["Â", "A"],
"Î": ["Î", "I"],
"Ș": ["Ș", "Ş", "S"],
"Ț": ["Ț", "Ţ", "T"],
}
def _make_ro_variants(tokens):
variants = []
for token in tokens:
upper_token = token.upper()
upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
for variant in upper_variants:
variants.extend([variant, variant.lower(), variant.title()])
return sorted(list(set(variants)))
# UD_Romanian-RRT closed class prefixes
# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
_ud_rrt_prefixes = [
"a-",
"c-",
"ce-",
"cu-",
"d-",
"de-",
"dintr-",
"e-",
"făr-",
"i-",
"l-",
"le-",
"m-",
"mi-",
"n-",
"ne-",
"p-",
"pe-",
"prim-",
"printr-",
"s-",
"se-",
"te-",
"v-",
"într-",
"ș-",
"și-",
"ți-",
]
_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
# UD_Romanian-RRT closed class suffixes without NUM
# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
_ud_rrt_suffixes = [
"-a",
"-aceasta",
"-ai",
"-al",
"-ale",
"-alta",
"-am",
"-ar",
"-astea",
"-atâta",
"-au",
"-aș",
"-ați",
"-i",
"-ilor",
"-l",
"-le",
"-lea",
"-mea",
"-meu",
"-mi",
"-mă",
"-n",
"-ndărătul",
"-ne",
"-o",
"-oi",
"-or",
"-s",
"-se",
"-si",
"-te",
"-ul",
"-ului",
"-un",
"-uri",
"-urile",
"-urilor",
"-veți",
"-vă",
"-ăștia",
"-și",
"-ți",
]
_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
_prefixes = (
["§", "%", "=", "", "", r"\+(?![0-9])"]
+ _ud_rrt_prefix_variants
+ LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_CURRENCY
+ LIST_ICONS
)
_suffixes = (
_ud_rrt_suffix_variants
+ LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ _list_icons
+ ["", ""]
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
_infixes = (
LIST_ELLIPSES
+ _list_icons
+ [
r"(?<=[0-9])[+\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
]
)
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_SUFFIXES = _suffixes
TOKENIZER_INFIXES = _infixes

View File

@ -1,4 +1,5 @@
from ...symbols import ORTH
from .punctuation import _make_ro_variants
_exc = {}
@ -42,8 +43,52 @@ for orth in [
"dpdv",
"șamd.",
"ș.a.m.d.",
# below: from UD_Romanian-RRT:
"A.c.",
"A.f.",
"A.r.",
"Al.",
"Art.",
"Aug.",
"Bd.",
"Dem.",
"Dr.",
"Fig.",
"Fr.",
"Gh.",
"Gr.",
"Lt.",
"Nr.",
"Obs.",
"Prof.",
"Sf.",
"a.m.",
"a.r.",
"alin.",
"art.",
"d-l",
"d-lui",
"d-nei",
"ex.",
"fig.",
"ian.",
"lit.",
"lt.",
"p.a.",
"p.m.",
"pct.",
"prep.",
"sf.",
"tel.",
"univ.",
"îngr.",
"într-adevăr",
"Șt.",
"ș.a.",
]:
_exc[orth] = [{ORTH: orth}]
# note: does not distinguish capitalized-only exceptions from others
for variant in _make_ro_variants([orth]):
_exc[variant] = [{ORTH: variant}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -13,6 +13,7 @@ import multiprocessing as mp
from itertools import chain, cycle
from .tokenizer import Tokenizer
from .tokens.underscore import Underscore
from .vocab import Vocab
from .lemmatizer import Lemmatizer
from .lookups import Lookups
@ -874,7 +875,10 @@ class Language(object):
sender.send()
procs = [
mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
mp.Process(
target=_apply_pipes,
args=(self.make_doc, pipes, rch, sch, Underscore.get_state()),
)
for rch, sch in zip(texts_q, bytedocs_send_ch)
]
for proc in procs:
@ -1146,16 +1150,19 @@ def _pipe(examples, proc, kwargs):
yield ex
def _apply_pipes(make_doc, pipes, reciever, sender):
def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
"""Worker for Language.pipe
receiver (multiprocessing.Connection): Pipe to receive text. Usually
created by `multiprocessing.Pipe()`
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
`multiprocessing.Pipe()`
underscore_state (tuple): The data in the Underscore class of the parent
vectors (dict): The global vectors data, copied from the parent
"""
Underscore.load_state(underscore_state)
while True:
texts = reciever.get()
texts = receiver.get()
docs = (make_doc(text) for text in texts)
for pipe in pipes:
docs = pipe(docs)

View File

@ -664,6 +664,8 @@ def _get_attr_values(spec, string_store):
continue
if attr == "TEXT":
attr = "ORTH"
if attr == "IS_SENT_START":
attr = "SENT_START"
attr = IDS.get(attr)
if isinstance(value, basestring):
value = string_store.add(value)

View File

@ -365,7 +365,7 @@ class Tensorizer(Pipe):
return sgd
@component("tagger", assigns=["token.tag", "token.pos"])
@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
class Tagger(Pipe):
"""Pipeline component for part-of-speech tagging.

View File

@ -464,3 +464,5 @@ cdef enum symbol_t:
ENT_KB_ID
MORPH
ENT_ID
IDX

View File

@ -89,6 +89,7 @@ IDS = {
"SPACY": SPACY,
"PROB": PROB,
"LANG": LANG,
"IDX": IDX,
"ADJ": ADJ,
"ADP": ADP,

View File

@ -80,6 +80,11 @@ def es_tokenizer():
return get_lang_class("es").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def eu_tokenizer():
return get_lang_class("eu").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def fi_tokenizer():
return get_lang_class("fi").Defaults.create_tokenizer()

View File

@ -63,3 +63,39 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
words = ["An", "example", "sentence"]
doc = Doc(en_vocab, words=words)
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
def test_doc_array_idx(en_vocab):
"""Test that Doc.to_array can retrieve token start indices"""
words = ["An", "example", "sentence"]
offsets = Doc(en_vocab, words=words).to_array("IDX")
assert offsets[0] == 0
assert offsets[1] == 3
assert offsets[2] == 11
def test_doc_from_array_heads_in_bounds(en_vocab):
"""Test that Doc.from_array doesn't set heads that are out of bounds."""
words = ["This", "is", "a", "sentence", "."]
doc = Doc(en_vocab, words=words)
for token in doc:
token.head = doc[0]
# correct
arr = doc.to_array(["HEAD"])
doc_from_array = Doc(en_vocab, words=words)
doc_from_array.from_array(["HEAD"], arr)
# head before start
arr = doc.to_array(["HEAD"])
arr[0] = -1
doc_from_array = Doc(en_vocab, words=words)
with pytest.raises(ValueError):
doc_from_array.from_array(["HEAD"], arr)
# head after end
arr = doc.to_array(["HEAD"])
arr[0] = 5
doc_from_array = Doc(en_vocab, words=words)
with pytest.raises(ValueError):
doc_from_array.from_array(["HEAD"], arr)

View File

@ -145,10 +145,9 @@ def test_doc_api_runtime_error(en_tokenizer):
# Example that caused run-time error while parsing Reddit
# fmt: off
text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
"nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
"pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
"ROOT", "amod", "dobj"]
deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
"amod", "pobj", "acl", "prep", "prep", "pobj",
"", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
# fmt: on
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
@ -272,19 +271,9 @@ def test_doc_is_nered(en_vocab):
def test_doc_from_array_sent_starts(en_vocab):
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
deps = [
"ROOT",
"dep",
"dep",
"dep",
"dep",
"dep",
"ROOT",
"dep",
"dep",
"dep",
"dep",
]
# fmt: off
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
# fmt: on
doc = Doc(en_vocab, words=words)
for i, (dep, head) in enumerate(zip(deps, heads)):
doc[i].dep_ = dep

View File

@ -164,6 +164,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
assert doc[4].left_edge.i == 0
assert doc[2].left_edge.i == 0
# head token must be from the same document
doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
with pytest.raises(ValueError):
doc[0].head = doc2[0]
def test_is_sent_start(en_tokenizer):
doc = en_tokenizer("This is a sentence. This is another.")
@ -211,7 +216,7 @@ def test_token_api_conjuncts_chain(en_vocab):
def test_token_api_conjuncts_simple(en_vocab):
words = "They came and went .".split()
heads = [1, 0, -1, -2, -1]
deps = ["nsubj", "ROOT", "cc", "conj"]
deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert [w.text for w in doc[1].conjuncts] == ["went"]
assert [w.text for w in doc[3].conjuncts] == ["came"]

View File

@ -4,6 +4,15 @@ from spacy.tokens import Doc, Span, Token
from spacy.tokens.underscore import Underscore
@pytest.fixture(scope="function", autouse=True)
def clean_underscore():
# reset the Underscore object after the test, to avoid having state copied across tests
yield
Underscore.doc_extensions = {}
Underscore.span_extensions = {}
Underscore.token_extensions = {}
def test_create_doc_underscore():
doc = Mock()
doc.doc = doc

View File

@ -55,7 +55,8 @@ def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
("Kristiansen c/o Madsen", 3),
("Sprogteknologi a/s", 2),
("De boede i A/B Bellevue", 5),
("Rotorhastigheden er 3400 o/m.", 5),
# note: skipping due to weirdness in UD_Danish-DDT
# ("Rotorhastigheden er 3400 o/m.", 5),
("Jeg købte billet t/r.", 5),
("Murerarbejdsmand m/k søges", 3),
("Netværket kører over TCP/IP", 4),

View File

@ -0,0 +1,22 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_eu_tokenizer_handles_long_text(eu_tokenizer):
text = """ta nere guitarra estrenatu ondoren"""
tokens = eu_tokenizer(text)
assert len(tokens) == 5
@pytest.mark.parametrize(
"text,length",
[
("milesker ederra joan zen hitzaldia plazer hutsa", 7),
("astelehen guztia sofan pasau biot", 5),
],
)
def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
tokens = eu_tokenizer(text)
assert len(tokens) == length

View File

@ -7,12 +7,22 @@ ABBREVIATION_TESTS = [
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
),
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
(
"Vuonna 1 eaa. tapahtui kauheita.",
["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
),
]
HYPHENATED_TESTS = [
(
"1700-luvulle sijoittuva taide-elokuva",
["1700-luvulle", "sijoittuva", "taide-elokuva"],
"1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
[
"1700-luvulle",
"sijoittuva",
"taide-elokuva",
"Wikimedia-säätiön",
"Varsinais-Suomen",
],
)
]
@ -23,6 +33,7 @@ ABBREVIATION_INFLECTION_TESTS = [
),
("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
]

View File

@ -12,11 +12,11 @@ def test_lt_tokenizer_handles_long_text(lt_tokenizer):
[
(
"177R Parodų rūmaiOzo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
15,
17,
),
(
"ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
16,
18,
),
],
)
@ -28,7 +28,7 @@ def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
tokens = lt_tokenizer(text)
assert len(tokens) == 1
assert len(tokens) == 2
@pytest.mark.parametrize(

View File

@ -4,6 +4,8 @@ from mock import Mock
from spacy.matcher import Matcher, DependencyMatcher
from spacy.tokens import Doc, Token
from ..doc.test_underscore import clean_underscore # noqa: F401
@pytest.fixture
def matcher(en_vocab):
@ -197,6 +199,7 @@ def test_matcher_any_token_operator(en_vocab):
assert matches[2] == "test hello world"
@pytest.mark.usefixtures("clean_underscore")
def test_matcher_extension_attribute(en_vocab):
matcher = Matcher(en_vocab)
get_is_fruit = lambda token: token.text in ("apple", "banana")

View File

@ -31,6 +31,8 @@ TEST_PATTERNS = [
([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
([{"orth": "foo"}], 0, 0), # prev: xfail
([{"IS_SENT_START": True}], 0, 0),
([{"SENT_START": True}], 0, 0),
]

View File

@ -31,23 +31,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
@pytest.fixture
def heads():
# fmt: off
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
-1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
-1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
-2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
-19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
-1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
-1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
-2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
-10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
-1, -8, -9, -1]
1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
-1, 0, -1, -1]
# fmt: on

View File

@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
doc = Doc(en_vocab, words=words)
# Work around lemma corrpution problem and set lemmas after tags
# Work around lemma corruption problem and set lemmas after tags
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
assert [t.tag_ for t in doc] == tags

View File

@ -121,7 +121,7 @@ def test_issue2772(en_vocab):
words = "When we write or communicate virtually , we can hide our true feelings .".split()
# A tree with a non-projective (i.e. crossing) arc
# The arcs (0, 4) and (2, 9) cross.
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
deps = ["dep"] * len(heads)
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert doc[1].is_sent_start is None

View File

@ -24,7 +24,7 @@ def test_issue4590(en_vocab):
text = "The quick brown fox jumped over the lazy fox"
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)

View File

@ -0,0 +1,25 @@
# coding: utf8
from __future__ import unicode_literals
import numpy
from spacy.lang.en import English
from spacy.vocab import Vocab
def test_issue4725():
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
vocab = Vocab(vectors_name="test_vocab_add_vector")
data = numpy.ndarray((5, 3), dtype="f")
data[0] = 1.0
data[1] = 2.0
vocab.set_vector("cat", data[0])
vocab.set_vector("dog", data[1])
nlp = English(vocab=vocab)
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
nlp.begin_training()
docs = ["Kurt is in London."] * 10
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
pass

View File

@ -0,0 +1,43 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
from spacy.tokens import Span, Doc
class CustomPipe:
name = "my_pipe"
def __init__(self):
Span.set_extension("my_ext", getter=self._get_my_ext)
Doc.set_extension("my_ext", default=None)
def __call__(self, doc):
gathered_ext = []
for sent in doc.sents:
sent_ext = self._get_my_ext(sent)
sent._.set("my_ext", sent_ext)
gathered_ext.append(sent_ext)
doc._.set("my_ext", "\n".join(gathered_ext))
return doc
@staticmethod
def _get_my_ext(span):
return str(span.end)
def test_issue4903():
# ensures that this runs correctly and doesn't hang or crash on Windows / macOS
nlp = English()
custom_component = CustomPipe()
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.add_pipe(custom_component, after="sentencizer")
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
docs = list(nlp.pipe(text, n_process=2))
assert docs[0].text == "I like bananas."
assert docs[1].text == "Do you like them?"
assert docs[2].text == "No, I prefer wasabi."

View File

@ -2,7 +2,7 @@ import pytest
from spacy.language import Language
def test_evaluate():
def test_issue4924():
nlp = Language()
docs_golds = [("", {})]
with pytest.raises(ValueError):

View File

@ -0,0 +1,35 @@
# coding: utf8
from __future__ import unicode_literals
import numpy
from spacy.tokens import Doc
from spacy.attrs import DEP, POS, TAG
from ..util import get_doc
def test_issue5048(en_vocab):
words = ["This", "is", "a", "sentence"]
pos_s = ["DET", "VERB", "DET", "NOUN"]
spaces = [" ", " ", " ", ""]
deps_s = ["dep", "adj", "nn", "atm"]
tags_s = ["DT", "VBZ", "DT", "NN"]
strings = en_vocab.strings
for w in words:
strings.add(w)
deps = [strings.add(d) for d in deps_s]
pos = [strings.add(p) for p in pos_s]
tags = [strings.add(t) for t in tags_s]
attrs = [POS, DEP, TAG]
array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
doc = Doc(en_vocab, words=words, spaces=spaces)
doc.from_array(attrs, array)
v1 = [(token.text, token.pos_, token.tag_) for token in doc]
doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
assert v1 == v2

View File

@ -0,0 +1,46 @@
# coding: utf8
from __future__ import unicode_literals
import numpy as np
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
def test_issue5082():
# Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
nlp = English()
vocab = nlp.vocab
array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
vocab.set_vector("I", array1)
vocab.set_vector("like", array2)
vocab.set_vector("David", array3)
vocab.set_vector("Bowie", array4)
text = "I like David Bowie"
ruler = EntityRuler(nlp)
patterns = [
{"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
parsed_vectors_1 = [t.vector for t in nlp(text)]
assert len(parsed_vectors_1) == 4
np.testing.assert_array_equal(parsed_vectors_1[0], array1)
np.testing.assert_array_equal(parsed_vectors_1[1], array2)
np.testing.assert_array_equal(parsed_vectors_1[2], array3)
np.testing.assert_array_equal(parsed_vectors_1[3], array4)
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)
parsed_vectors_2 = [t.vector for t in nlp(text)]
assert len(parsed_vectors_2) == 3
np.testing.assert_array_equal(parsed_vectors_2[0], array1)
np.testing.assert_array_equal(parsed_vectors_2[1], array2)
np.testing.assert_array_equal(parsed_vectors_2[2], array34)

View File

@ -12,12 +12,19 @@ def load_tokenizer(b):
def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
"""Test that custom tokenizer with not all functions defined can be
serialized and deserialized correctly (see #2494)."""
"""Test that custom tokenizer with not all functions defined or empty
properties can be serialized and deserialized correctly (see #2494,
#4991)."""
tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
tokenizer_bytes = tokenizer.to_bytes()
Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
tokenizer.rules = {}
tokenizer_bytes = tokenizer.to_bytes()
tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
assert tokenizer_reloaded.rules == {}
@pytest.mark.skip(reason="Currently unreliable across platforms")
@pytest.mark.parametrize("text", ["I💜you", "theyre", "“hello”"])

View File

@ -28,10 +28,10 @@ def test_displacy_parse_deps(en_vocab):
deps = displacy.parse_deps(doc)
assert isinstance(deps, dict)
assert deps["words"] == [
{"text": "This", "tag": "DET"},
{"text": "is", "tag": "AUX"},
{"text": "a", "tag": "DET"},
{"text": "sentence", "tag": "NOUN"},
{"lemma": None, "text": words[0], "tag": pos[0]},
{"lemma": None, "text": words[1], "tag": pos[1]},
{"lemma": None, "text": words[2], "tag": pos[2]},
{"lemma": None, "text": words[3], "tag": pos[3]},
]
assert deps["arcs"] == [
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
@ -72,7 +72,7 @@ def test_displacy_rtl():
deps = ["foo", "bar", "foo", "baz"]
heads = [1, 0, 1, -2]
nlp = Persian()
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
doc.ents = [Span(doc, 1, 3, label="TEST")]
html = displacy.render(doc, page=True, style="dep")
assert "direction: rtl" in html

View File

@ -4,8 +4,10 @@ import shutil
import contextlib
import srsly
from pathlib import Path
from spacy import Errors
from spacy.tokens import Doc, Span
from spacy.attrs import POS, HEAD, DEP
from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA
@contextlib.contextmanager
@ -22,30 +24,56 @@ def make_tempdir():
shutil.rmtree(str(d))
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
def get_doc(
vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None
):
"""Create Doc object from given vocab, words and annotations."""
pos = pos or [""] * len(words)
tags = tags or [""] * len(words)
heads = heads or [0] * len(words)
deps = deps or [""] * len(words)
for value in deps + tags + pos:
if deps and not heads:
heads = [0] * len(deps)
headings = []
values = []
annotations = [pos, heads, deps, lemmas, tags]
possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
for a, annot in enumerate(annotations):
if annot is not None:
if len(annot) != len(words):
raise ValueError(Errors.E189)
headings.append(possible_headings[a])
if annot is not heads:
values.extend(annot)
for value in values:
vocab.strings.add(value)
doc = Doc(vocab, words=words)
attrs = doc.to_array([POS, HEAD, DEP])
for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
attrs[i, 0] = doc.vocab.strings[p]
attrs[i, 1] = head
attrs[i, 2] = doc.vocab.strings[dep]
doc.from_array([POS, HEAD, DEP], attrs)
# if there are any other annotations, set them
if headings:
attrs = doc.to_array(headings)
j = 0
for annot in annotations:
if annot:
if annot is heads:
for i in range(len(words)):
if attrs.ndim == 1:
attrs[i] = heads[i]
else:
attrs[i, j] = heads[i]
else:
for i in range(len(words)):
if attrs.ndim == 1:
attrs[i] = doc.vocab.strings[annot[i]]
else:
attrs[i, j] = doc.vocab.strings[annot[i]]
j += 1
doc.from_array(headings, attrs)
# finally, set the entities
if ents:
doc.ents = [
Span(doc, start, end, label=doc.vocab.strings[label])
for start, end, label in ents
]
if tags:
for token in doc:
token.tag_ = tags[token.i]
return doc
@ -86,8 +114,7 @@ def assert_docs_equal(doc1, doc2):
assert [t.head.i for t in doc1] == [t.head.i for t in doc2]
assert [t.dep for t in doc1] == [t.dep for t in doc2]
if doc1.is_parsed and doc2.is_parsed:
assert [s for s in doc1.sents] == [s for s in doc2.sents]
assert [t.is_sent_start for t in doc1] == [t.is_sent_start for t in doc2]
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]

View File

@ -699,6 +699,7 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#to_disk
"""
path = util.ensure_path(path)
with path.open("wb") as file_:
file_.write(self.to_bytes(**kwargs))
@ -712,6 +713,7 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#from_disk
"""
path = util.ensure_path(path)
with path.open("rb") as file_:
bytes_data = file_.read()
self.from_bytes(bytes_data, **kwargs)
@ -756,21 +758,20 @@ cdef class Tokenizer:
}
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
msg = util.from_bytes(bytes_data, deserializers, exclude)
if data.get("prefix_search"):
if "prefix_search" in data and isinstance(data["prefix_search"], str):
self.prefix_search = re.compile(data["prefix_search"]).search
if data.get("suffix_search"):
if "suffix_search" in data and isinstance(data["suffix_search"], str):
self.suffix_search = re.compile(data["suffix_search"]).search
if data.get("infix_finditer"):
if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
if data.get("token_match"):
if "token_match" in data and isinstance(data["token_match"], str):
self.token_match = re.compile(data["token_match"]).match
if data.get("rules"):
if "rules" in data and isinstance(data["rules"], dict):
# make sure to hard reset the cache to remove data from the default exceptions
self._rules = {}
self._flush_cache()
self._flush_specials()
self._load_special_cases(data.get("rules", {}))
self._load_special_cases(data["rules"])
return self

View File

@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
if spans[token_index][-1].whitespace_:
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
# add the vector of the (merged) entity to the vocab
if not doc.vocab.get_vector(new_orth).any():
if doc.vocab.vectors_length > 0:
doc.vocab.set_vector(new_orth, span.vector)
token = tokens[token_index]
lex = doc.vocab.get(doc.mem, new_orth)
token.lex = lex

View File

@ -19,7 +19,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..typedefs cimport attr_t, flags_t
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..attrs import intify_attrs, IDS
@ -68,6 +68,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
return token.ent_id
elif feat_name == ENT_KB_ID:
return token.ent_kb_id
elif feat_name == IDX:
return token.idx
else:
return Lexeme.get_struct_attr(token.lex, feat_name)
@ -253,7 +255,7 @@ cdef class Doc:
def is_nered(self):
"""Check if the document has named entities set. Will return True if
*any* of the tokens has a named entity tag set (even if the others are
unknown values).
unknown values), or if the document is empty.
"""
if len(self) == 0:
return True
@ -778,10 +780,12 @@ cdef class Doc:
# Allow strings, e.g. 'lemma' or 'LEMMA'
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
for id_ in attrs]
if array.dtype != numpy.uint64:
user_warning(Warnings.W028.format(type=array.dtype))
if SENT_START in attrs and HEAD in attrs:
raise ValueError(Errors.E032)
cdef int i, col
cdef int i, col, abs_head_index
cdef attr_id_t attr_id
cdef TokenC* tokens = self.c
cdef int length = len(array)
@ -795,6 +799,14 @@ cdef class Doc:
attr_ids[i] = attr_id
if len(array.shape) == 1:
array = array.reshape((array.size, 1))
# Check that all heads are within the document bounds
if HEAD in attrs:
col = attrs.index(HEAD)
for i in range(length):
# cast index to signed int
abs_head_index = numpy.int32(array[i, col]) + i
if abs_head_index < 0 or abs_head_index >= length:
raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
if TAG in attrs:
col = attrs.index(TAG)
@ -865,7 +877,7 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#to_bytes
"""
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID] # TODO: ENT_KB_ID ?
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM] # TODO: ENT_KB_ID ?
if self.is_tagged:
array_head.extend([TAG, POS])
# If doc parsed add head and dep attribute
@ -1166,6 +1178,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
if loop_count > 10:
warnings.warn(Warnings.W026)
break
loop_count += 1
# Set sentence starts
for i in range(length):

View File

@ -626,6 +626,9 @@ cdef class Token:
# This function sets the head of self to new_head and updates the
# counters for left/right dependents and left/right corner for the
# new and the old head
# Check that token is from the same document
if self.doc != new_head.doc:
raise ValueError(Errors.E191)
# Do nothing if old head is new head
if self.i + self.c.head == new_head.i:
return

View File

@ -76,6 +76,14 @@ class Underscore(object):
def _get_key(self, name):
return ("._.", name, self._start, self._end)
@classmethod
def get_state(cls):
return cls.token_extensions, cls.span_extensions, cls.doc_extensions
@classmethod
def load_state(cls, state):
cls.token_extensions, cls.span_extensions, cls.doc_extensions = state
def get_ext_args(**kwargs):
"""Validate and convert arguments. Reused in Doc, Token and Span."""

View File

@ -349,44 +349,6 @@ cdef class Vectors:
for i in range(len(queries)) ], dtype="uint64")
return (keys, best_rows, scores)
def from_glove(self, path):
"""Load GloVe vectors from a directory. Assumes binary format,
that the vocab is in a vocab.txt, and that vectors are named
vectors.{size}.[fd].bin, e.g. vectors.128.f.bin for 128d float32
vectors, vectors.300.d.bin for 300d float64 (double) vectors, etc.
By default GloVe outputs 64-bit vectors.
path (unicode / Path): The path to load the GloVe vectors from.
RETURNS: A `StringStore` object, holding the key-to-string mapping.
DOCS: https://spacy.io/api/vectors#from_glove
"""
path = util.ensure_path(path)
width = None
for name in path.iterdir():
if name.parts[-1].startswith("vectors"):
_, dims, dtype, _2 = name.parts[-1].split('.')
width = int(dims)
break
else:
raise IOError(Errors.E061.format(filename=path))
bin_loc = path / f"vectors.{dims}.{dtype}.bin"
xp = get_array_module(self.data)
self.data = None
with bin_loc.open("rb") as file_:
self.data = xp.fromfile(file_, dtype=dtype)
if dtype != "float32":
self.data = xp.ascontiguousarray(self.data, dtype="float32")
if self.data.ndim == 1:
self.data = self.data.reshape((self.data.size//width, width))
n = 0
strings = StringStore()
with (path / "vocab.txt").open("r") as file_:
for i, line in enumerate(file_):
key = strings.add(line.strip())
self.add(key, row=i)
return strings
def to_disk(self, path, **kwargs):
"""Save the current state to a directory.

View File

@ -109,9 +109,9 @@ links) and check whether they are compatible with the currently installed
version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
to ensure that all installed models are can be used with the new version. The
command is also useful to detect out-of-sync model links resulting from links
created in different virtual environments. It will a list of models, the
installed versions, the latest compatible version (if out of date) and the
commands for updating.
created in different virtual environments. It will show a list of models and
their installed versions. If any model is out of date, the latest compatible
versions and command for updating are shown.
> #### Automated validation
>
@ -176,7 +176,7 @@ All output files generated by this command are compatible with
## Debug data {#debug-data new="2.2"}
Analyze, debug and validate your training and development data, get useful
Analyze, debug, and validate your training and development data. Get useful
stats, and find problems like invalid entity annotations, cyclic dependencies,
low data labels and more.
@ -185,10 +185,11 @@ $ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pi
```
| Argument | Type | Description |
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
| `lang` | positional | Model language. |
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option | Location of JSON-formatted tag map. |
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
| `--base-model`, `-b` <Tag variant="new">2.1</Tag> | option | Optional name of base model to update. Can be any loadable spaCy model. |
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
| `--replace-components`, `-R` | flag | Replace components from the base model. |
| `--vectors`, `-v` | option | Model to load vectors from. |
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` |
| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` |
| `--width`, `-cw` <Tag variant="new">2.2.4</Tag> | option | Width of CNN layers of `Tok2Vec` component. |
| `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag> | option | Depth of CNN layers of `Tok2Vec` component. |
| `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag> | option | Window size for CNN layers of `Tok2Vec` component. |
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag> | option | Maxout size for CNN layers of `Tok2Vec` component. |
| `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag> | flag | Whether to use character-based embedding of `Tok2Vec` component. |
| `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag> | option | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch). |
| `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag> | option | Number of embedding rows of `Tok2Vec` component. |
| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. |
| `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag> | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). |
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag> | flag | Text classification classes aren't mutually exclusive (multilabel). |
| `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag> | option | Text classification model architecture. Defaults to `"bow"`. |
| `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option | Text classification positive label for binary classes with two labels. |
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
| `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag> | flag | Show more detailed messages during training. |
| `--help`, `-h` | flag | Show help message and available arguments. |
| **CREATES** | model, pickle | A spaCy model on each epoch. |

View File

@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx
A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
named entities, export annotations to numpy arrays, losslessly serialize to
compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
The Python-level `Token` and [`Span`](/api/span) objects are views of this
array, i.e. they don't own the data themselves.
compressed binary strings. The `Doc` object holds an array of
[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
[`Span`](/api/span) objects are views of this array, i.e. they don't own the
data themselves.
## Doc.\_\_init\_\_ {#init tag="method"}
@ -198,10 +199,11 @@ the character indices don't map to a valid span.
> ```
| Name | Type | Description |
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
| `start` | int | The index of the first character of the span. |
| `end` | int | The index of the last character after the span. |
| `label` | uint64 / unicode | A label to attach to the Span, e.g. for named entities. |
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
| **RETURNS** | `Span` | The newly constructed object or `None`. |
@ -655,10 +657,10 @@ The L2 norm of the document's vector representation.
| `user_data` | - | A generic storage area, for user custom data. |
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
| `sentiment` | float | The document's positivity/negativity score, if available. |
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |

View File

@ -83,7 +83,8 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
happens automatically after the component has been added to the pipeline using
[`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
with `overwrite_ents=True`, existing entities will be replaced if they overlap
with the matches.
with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
patterns over shorter, and if equal the match occuring first in the Doc is chosen.
> #### Example
>

View File

@ -172,6 +172,28 @@ Remove a previously registered extension.
| `name` | unicode | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
## Span.char_span {#char_span tag="method" new="2.2.4"}
Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
the character indices don't map to a valid span.
> #### Example
>
> ```python
> doc = nlp("I like New York")
> span = doc[1:4].char_span(5, 13, label="GPE")
> assert span.text == "New York"
> ```
| Name | Type | Description |
| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
| `start` | int | The index of the first character of the span. |
| `end` | int | The index of the last character after the span. |
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
| `kb_id` | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
| **RETURNS** | `Span` | The newly constructed object or `None`. |
## Span.similarity {#similarity tag="method" model="vectors"}
Make a semantic similarity estimate. The default estimate is cosine similarity
@ -294,7 +316,7 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
> ```
| Name | Type | Description |
| ----------------- | ----- | ---------------------------------------------------- |
| ---------------- | ----- | ---------------------------------------------------- |
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |

View File

@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `lower` | int | Lowercase form of the token. |
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |

View File

@ -237,8 +237,9 @@ If a setting is not present in the options, the default value will be used.
> ```
| Name | Type | Description | Default |
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |

View File

@ -326,25 +326,6 @@ performed in chunks, to avoid consuming too much memory. You can set the
| `sort` | bool | Whether to sort the entries returned by score. Defaults to `True`. |
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. |
## Vectors.from_glove {#from_glove tag="method"}
Load [GloVe](https://nlp.stanford.edu/projects/glove/) vectors from a directory.
Assumes binary format, that the vocab is in a `vocab.txt`, and that vectors are
named `vectors.{size}.[fd.bin]`, e.g. `vectors.128.f.bin` for 128d float32
vectors, `vectors.300.d.bin` for 300d float64 (double) vectors, etc. By default
GloVe outputs 64-bit vectors.
> #### Example
>
> ```python
> vectors = Vectors()
> vectors.from_glove("/path/to/glove_vectors")
> ```
| Name | Type | Description |
| ------ | ---------------- | ---------------------------------------- |
| `path` | unicode / `Path` | The path to load the GloVe vectors from. |
## Vectors.to_disk {#to_disk tag="method"}
Save the current state to a directory.

View File

@ -622,13 +622,13 @@ categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
In order to use this, you'll need training and evaluation data in the
[JSON format](/api/annotation#json-input) spaCy expects for training.
You can now train the model using a corpus for your language annotated with If
your data is in one of the supported formats, the easiest solution might be to
use the [`spacy convert`](/api/cli#convert) command-line utility. This supports
several popular formats, including the IOB format for named entity recognition,
the JSONL format produced by our annotation tool [Prodigy](https://prodi.gy),
and the [CoNLL-U](http://universaldependencies.org/docs/format.html) format used
by the [Universal Dependencies](http://universaldependencies.org/) corpus.
If your data is in one of the supported formats, the easiest solution might be
to use the [`spacy convert`](/api/cli#convert) command-line utility. This
supports several popular formats, including the IOB format for named entity
recognition, the JSONL format produced by our annotation tool
[Prodigy](https://prodi.gy), and the
[CoNLL-U](http://universaldependencies.org/docs/format.html) format used by the
[Universal Dependencies](http://universaldependencies.org/) corpus.
One thing to keep in mind is that spaCy expects to train its models from **whole
documents**, not just single sentences. If your corpus only contains single

View File

@ -968,7 +968,10 @@ pattern. The entity ruler accepts two types of patterns:
The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
called on a text, it will find matches in the `doc` and add them as entities to
the `doc.ents`, using the specified pattern label as the entity label.
the `doc.ents`, using the specified pattern label as the entity label. If any
matches were to overlap, the pattern matching most tokens takes priority. If
they also happen to be equally long, then the match occuring first in the Doc is
chosen.
```python
### {executable="true"}
@ -1119,7 +1122,7 @@ entityruler = EntityRuler(nlp)
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
with nlp.disable_pipes(*disable_pipes):
with nlp.disable_pipes(*other_pipes):
entityruler.add_patterns(patterns)
```

View File

@ -94,7 +94,7 @@ docs = list(doc_bin.get_docs(nlp.vocab))
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
well, which includes the values of
[extension attributes](/processing-pipelines#custom-components-attributes) (if
[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if
they're serializable with msgpack).
<Infobox title="Important note on serializing extension attributes" variant="warning">

Some files were not shown because too many files have changed in this diff Show More