mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Merge branch 'master' into tmp/sync
This commit is contained in:
commit
46568f40a7
106
.github/contributors/Baciccin.md
vendored
Normal file
106
.github/contributors/Baciccin.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ------------------------ |
|
||||||
|
| Name | Giovanni Battista Parodi |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-03-19 |
|
||||||
|
| GitHub username | Baciccin |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/MisterKeefe.md
vendored
Normal file
106
.github/contributors/MisterKeefe.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Tom Keefe |
|
||||||
|
| Company name (if applicable) | / |
|
||||||
|
| Title or role (if applicable) | / |
|
||||||
|
| Date | 18 February 2020 |
|
||||||
|
| GitHub username | MisterKeefe |
|
||||||
|
| Website (optional) | / |
|
106
.github/contributors/Tiljander.md
vendored
Normal file
106
.github/contributors/Tiljander.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Henrik Tiljander |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 24/3/2020 |
|
||||||
|
| GitHub username | Tiljander |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/dhpollack.md
vendored
Normal file
106
.github/contributors/dhpollack.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | David Pollack |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | Mar 5. 2020 |
|
||||||
|
| GitHub username | dhpollack |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/guerda.md
vendored
Normal file
106
.github/contributors/guerda.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Philip Gillißen |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-03-24 |
|
||||||
|
| GitHub username | guerda |
|
||||||
|
| Website (optional) | |
|
89
.github/contributors/mabraham.md
vendored
Normal file
89
.github/contributors/mabraham.md
vendored
Normal file
|
@ -0,0 +1,89 @@
|
||||||
|
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | |
|
||||||
|
| GitHub username | |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/merrcury.md
vendored
Normal file
106
.github/contributors/merrcury.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Himanshu Garg |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-03-10 |
|
||||||
|
| GitHub username | merrcury |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/pinealan.md
vendored
Normal file
106
.github/contributors/pinealan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Alan Chan |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-03-15 |
|
||||||
|
| GitHub username | pinealan |
|
||||||
|
| Website (optional) | http://pinealan.xyz |
|
106
.github/contributors/sloev.md
vendored
Normal file
106
.github/contributors/sloev.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ------------------------ |
|
||||||
|
| Name | Johannes Valbjørn |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-03-13 |
|
||||||
|
| GitHub username | sloev |
|
||||||
|
| Website (optional) | https://sloev.github.io |
|
2
.gitignore
vendored
2
.gitignore
vendored
|
@ -46,6 +46,7 @@ __pycache__/
|
||||||
.venv
|
.venv
|
||||||
env3.6/
|
env3.6/
|
||||||
venv/
|
venv/
|
||||||
|
env3.*/
|
||||||
.dev
|
.dev
|
||||||
.denv
|
.denv
|
||||||
.pypyenv
|
.pypyenv
|
||||||
|
@ -62,6 +63,7 @@ lib64/
|
||||||
parts/
|
parts/
|
||||||
sdist/
|
sdist/
|
||||||
var/
|
var/
|
||||||
|
wheelhouse/
|
||||||
*.egg-info/
|
*.egg-info/
|
||||||
pip-wheel-metadata/
|
pip-wheel-metadata/
|
||||||
Pipfile.lock
|
Pipfile.lock
|
||||||
|
|
2
LICENSE
2
LICENSE
|
@ -1,6 +1,6 @@
|
||||||
The MIT License (MIT)
|
The MIT License (MIT)
|
||||||
|
|
||||||
Copyright (C) 2016-2019 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
Copyright (C) 2016-2020 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
|
47
Makefile
47
Makefile
|
@ -1,28 +1,37 @@
|
||||||
SHELL := /bin/bash
|
SHELL := /bin/bash
|
||||||
sha = $(shell "git" "rev-parse" "--short" "HEAD")
|
PYVER := 3.6
|
||||||
version = $(shell "bin/get-version.sh")
|
VENV := ./env$(PYVER)
|
||||||
wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
|
|
||||||
|
|
||||||
dist/spacy.pex : dist/spacy-$(sha).pex
|
version := $(shell "bin/get-version.sh")
|
||||||
cp dist/spacy-$(sha).pex dist/spacy.pex
|
|
||||||
chmod a+rx dist/spacy.pex
|
|
||||||
|
|
||||||
dist/spacy-$(sha).pex : dist/$(wheel)
|
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
||||||
env3.6/bin/python -m pip install pex==1.5.3
|
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
|
||||||
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
|
chmod a+rx $@
|
||||||
|
|
||||||
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
|
dist/pytest.pex : wheelhouse/pytest-*.whl
|
||||||
python3.6 -m venv env3.6
|
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
|
||||||
source env3.6/bin/activate
|
chmod a+rx $@
|
||||||
env3.6/bin/pip install wheel
|
|
||||||
env3.6/bin/pip install -r requirements.txt --no-cache-dir
|
|
||||||
env3.6/bin/python setup.py build_ext --inplace
|
|
||||||
env3.6/bin/python setup.py sdist
|
|
||||||
env3.6/bin/python setup.py bdist_wheel
|
|
||||||
|
|
||||||
.PHONY : clean
|
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||||
|
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
||||||
|
$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
|
||||||
|
touch $@
|
||||||
|
|
||||||
|
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
||||||
|
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
|
||||||
|
|
||||||
|
$(VENV)/bin/pex :
|
||||||
|
python$(PYVER) -m venv $(VENV)
|
||||||
|
$(VENV)/bin/pip install -U pip setuptools pex wheel
|
||||||
|
|
||||||
|
.PHONY : clean test
|
||||||
|
|
||||||
|
test : dist/spacy-$(version).pex dist/pytest.pex
|
||||||
|
( . $(VENV)/bin/activate ; \
|
||||||
|
PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )
|
||||||
|
|
||||||
clean : setup.py
|
clean : setup.py
|
||||||
source env3.6/bin/activate
|
|
||||||
rm -rf dist/*
|
rm -rf dist/*
|
||||||
|
rm -rf ./wheelhouse
|
||||||
|
rm -rf $(VENV)
|
||||||
python setup.py clean --all
|
python setup.py clean --all
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
### Step 1: Create a Knowledge Base (KB) and training data
|
### Step 1: Create a Knowledge Base (KB) and training data
|
||||||
|
|
||||||
Run `wikipedia_pretrain_kb.py`
|
Run `wikidata_pretrain_kb.py`
|
||||||
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
|
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
|
||||||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
||||||
|
|
|
@ -88,8 +88,8 @@ def read_text(bz2_loc, n=10000):
|
||||||
break
|
break
|
||||||
|
|
||||||
|
|
||||||
def get_matches(tokenizer, phrases, texts, max_length=6):
|
def get_matches(tokenizer, phrases, texts):
|
||||||
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
|
matcher = PhraseMatcher(tokenizer.vocab)
|
||||||
matcher.add("Phrase", None, *phrases)
|
matcher.add("Phrase", None, *phrases)
|
||||||
for text in texts:
|
for text in texts:
|
||||||
doc = tokenizer(text)
|
doc = tokenizer(text)
|
||||||
|
|
|
@ -59,7 +59,7 @@ install_requires =
|
||||||
|
|
||||||
[options.extras_require]
|
[options.extras_require]
|
||||||
lookups =
|
lookups =
|
||||||
spacy_lookups_data>=0.0.5<0.2.0
|
spacy_lookups_data>=0.0.5,<0.2.0
|
||||||
cuda =
|
cuda =
|
||||||
cupy>=5.0.0b4
|
cupy>=5.0.0b4
|
||||||
cuda80 =
|
cuda80 =
|
||||||
|
|
|
@ -93,3 +93,5 @@ cdef enum attr_id_t:
|
||||||
ENT_KB_ID = symbols.ENT_KB_ID
|
ENT_KB_ID = symbols.ENT_KB_ID
|
||||||
MORPH
|
MORPH
|
||||||
ENT_ID = symbols.ENT_ID
|
ENT_ID = symbols.ENT_ID
|
||||||
|
|
||||||
|
IDX
|
||||||
|
|
|
@ -89,6 +89,7 @@ IDS = {
|
||||||
"PROB": PROB,
|
"PROB": PROB,
|
||||||
"LANG": LANG,
|
"LANG": LANG,
|
||||||
"MORPH": MORPH,
|
"MORPH": MORPH,
|
||||||
|
"IDX": IDX
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -405,12 +405,10 @@ def train(
|
||||||
losses=losses,
|
losses=losses,
|
||||||
)
|
)
|
||||||
except ValueError as e:
|
except ValueError as e:
|
||||||
msg.warn("Error during training")
|
err = "Error during training"
|
||||||
if init_tok2vec:
|
if init_tok2vec:
|
||||||
msg.warn(
|
err += " Did you provide the same parameters during 'train' as during 'pretrain'?"
|
||||||
"Did you provide the same parameters during 'train' as during 'pretrain'?"
|
msg.fail(err, f"Original error message: {e}", exits=1)
|
||||||
)
|
|
||||||
msg.fail(f"Original error message: {e}", exits=1)
|
|
||||||
if raw_text:
|
if raw_text:
|
||||||
# If raw text is available, perform 'rehearsal' updates,
|
# If raw text is available, perform 'rehearsal' updates,
|
||||||
# which use unlabelled data to reduce overfitting.
|
# which use unlabelled data to reduce overfitting.
|
||||||
|
@ -545,7 +543,40 @@ def train(
|
||||||
with nlp.use_params(optimizer.averages):
|
with nlp.use_params(optimizer.averages):
|
||||||
final_model_path = output_path / "model-final"
|
final_model_path = output_path / "model-final"
|
||||||
nlp.to_disk(final_model_path)
|
nlp.to_disk(final_model_path)
|
||||||
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
|
meta_loc = output_path / "model-final" / "meta.json"
|
||||||
|
final_meta = srsly.read_json(meta_loc)
|
||||||
|
final_meta.setdefault("accuracy", {})
|
||||||
|
final_meta["accuracy"].update(meta.get("accuracy", {}))
|
||||||
|
final_meta.setdefault("speed", {})
|
||||||
|
final_meta["speed"].setdefault("cpu", None)
|
||||||
|
final_meta["speed"].setdefault("gpu", None)
|
||||||
|
meta.setdefault("speed", {})
|
||||||
|
meta["speed"].setdefault("cpu", None)
|
||||||
|
meta["speed"].setdefault("gpu", None)
|
||||||
|
# combine cpu and gpu speeds with the base model speeds
|
||||||
|
if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
|
||||||
|
speed = _get_total_speed(
|
||||||
|
[final_meta["speed"]["cpu"], meta["speed"]["cpu"]]
|
||||||
|
)
|
||||||
|
final_meta["speed"]["cpu"] = speed
|
||||||
|
if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
|
||||||
|
speed = _get_total_speed(
|
||||||
|
[final_meta["speed"]["gpu"], meta["speed"]["gpu"]]
|
||||||
|
)
|
||||||
|
final_meta["speed"]["gpu"] = speed
|
||||||
|
# if there were no speeds to update, overwrite with meta
|
||||||
|
if (
|
||||||
|
final_meta["speed"]["cpu"] is None
|
||||||
|
and final_meta["speed"]["gpu"] is None
|
||||||
|
):
|
||||||
|
final_meta["speed"].update(meta["speed"])
|
||||||
|
# note: beam speeds are not combined with the base model
|
||||||
|
if has_beam_widths:
|
||||||
|
final_meta.setdefault("beam_accuracy", {})
|
||||||
|
final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
|
||||||
|
final_meta.setdefault("beam_speed", {})
|
||||||
|
final_meta["beam_speed"].update(meta.get("beam_speed", {}))
|
||||||
|
srsly.write_json(meta_loc, final_meta)
|
||||||
msg.good("Saved model to output directory", final_model_path)
|
msg.good("Saved model to output directory", final_model_path)
|
||||||
with msg.loading("Creating best model..."):
|
with msg.loading("Creating best model..."):
|
||||||
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
|
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
|
||||||
|
@ -630,6 +661,8 @@ def _find_best(experiment_dir, component):
|
||||||
if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final":
|
if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final":
|
||||||
accs = srsly.read_json(epoch_model / "accuracy.json")
|
accs = srsly.read_json(epoch_model / "accuracy.json")
|
||||||
scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)]
|
scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)]
|
||||||
|
# remove per_type dicts from score list for max() comparison
|
||||||
|
scores = [score for score in scores if isinstance(score, float)]
|
||||||
accuracies.append((scores, epoch_model))
|
accuracies.append((scores, epoch_model))
|
||||||
if accuracies:
|
if accuracies:
|
||||||
return max(accuracies)[1]
|
return max(accuracies)[1]
|
||||||
|
@ -641,13 +674,13 @@ def _get_metrics(component):
|
||||||
if component == "parser":
|
if component == "parser":
|
||||||
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
|
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
|
||||||
elif component == "tagger":
|
elif component == "tagger":
|
||||||
return ("tags_acc",)
|
return ("tags_acc", "token_acc")
|
||||||
elif component == "ner":
|
elif component == "ner":
|
||||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type")
|
return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
|
||||||
elif component == "senter":
|
elif component == "senter":
|
||||||
return ("sent_f", "sent_p", "sent_r")
|
return ("sent_f", "sent_p", "sent_r")
|
||||||
elif component == "textcat":
|
elif component == "textcat":
|
||||||
return ("textcat_score",)
|
return ("textcat_score", "token_acc")
|
||||||
return ("token_acc",)
|
return ("token_acc",)
|
||||||
|
|
||||||
|
|
||||||
|
@ -714,3 +747,12 @@ def _get_progress(
|
||||||
if beam_width is not None:
|
if beam_width is not None:
|
||||||
result.insert(1, beam_width)
|
result.insert(1, beam_width)
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def _get_total_speed(speeds):
|
||||||
|
seconds_per_word = 0.0
|
||||||
|
for words_per_second in speeds:
|
||||||
|
if words_per_second is None:
|
||||||
|
return None
|
||||||
|
seconds_per_word += 1.0 / words_per_second
|
||||||
|
return 1.0 / seconds_per_word
|
||||||
|
|
|
@ -142,10 +142,17 @@ def parse_deps(orig_doc, options={}):
|
||||||
for span, tag, lemma, ent_type in spans:
|
for span, tag, lemma, ent_type in spans:
|
||||||
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
|
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
|
||||||
retokenizer.merge(span, attrs=attrs)
|
retokenizer.merge(span, attrs=attrs)
|
||||||
if options.get("fine_grained"):
|
fine_grained = options.get("fine_grained")
|
||||||
words = [{"text": w.text, "tag": w.tag_} for w in doc]
|
add_lemma = options.get("add_lemma")
|
||||||
else:
|
words = [
|
||||||
words = [{"text": w.text, "tag": w.pos_} for w in doc]
|
{
|
||||||
|
"text": w.text,
|
||||||
|
"tag": w.tag_ if fine_grained else w.pos_,
|
||||||
|
"lemma": w.lemma_ if add_lemma else None,
|
||||||
|
}
|
||||||
|
for w in doc
|
||||||
|
]
|
||||||
|
|
||||||
arcs = []
|
arcs = []
|
||||||
for word in doc:
|
for word in doc:
|
||||||
if word.i < word.head.i:
|
if word.i < word.head.i:
|
||||||
|
|
|
@ -1,6 +1,12 @@
|
||||||
import uuid
|
import uuid
|
||||||
|
|
||||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
from .templates import (
|
||||||
|
TPL_DEP_SVG,
|
||||||
|
TPL_DEP_WORDS,
|
||||||
|
TPL_DEP_WORDS_LEMMA,
|
||||||
|
TPL_DEP_ARCS,
|
||||||
|
TPL_ENTS,
|
||||||
|
)
|
||||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||||
from ..util import minify_html, escape_html, registry
|
from ..util import minify_html, escape_html, registry
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
@ -80,7 +86,10 @@ class DependencyRenderer(object):
|
||||||
self.width = self.offset_x + len(words) * self.distance
|
self.width = self.offset_x + len(words) * self.distance
|
||||||
self.height = self.offset_y + 3 * self.word_spacing
|
self.height = self.offset_y + 3 * self.word_spacing
|
||||||
self.id = render_id
|
self.id = render_id
|
||||||
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
|
words = [
|
||||||
|
self.render_word(w["text"], w["tag"], w.get("lemma", None), i)
|
||||||
|
for i, w in enumerate(words)
|
||||||
|
]
|
||||||
arcs = [
|
arcs = [
|
||||||
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
|
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
|
||||||
for i, a in enumerate(arcs)
|
for i, a in enumerate(arcs)
|
||||||
|
@ -98,7 +107,9 @@ class DependencyRenderer(object):
|
||||||
lang=self.lang,
|
lang=self.lang,
|
||||||
)
|
)
|
||||||
|
|
||||||
def render_word(self, text, tag, i):
|
def render_word(
|
||||||
|
self, text, tag, lemma, i,
|
||||||
|
):
|
||||||
"""Render individual word.
|
"""Render individual word.
|
||||||
|
|
||||||
text (unicode): Word text.
|
text (unicode): Word text.
|
||||||
|
@ -111,6 +122,10 @@ class DependencyRenderer(object):
|
||||||
if self.direction == "rtl":
|
if self.direction == "rtl":
|
||||||
x = self.width - x
|
x = self.width - x
|
||||||
html_text = escape_html(text)
|
html_text = escape_html(text)
|
||||||
|
if lemma is not None:
|
||||||
|
return TPL_DEP_WORDS_LEMMA.format(
|
||||||
|
text=html_text, tag=tag, lemma=lemma, x=x, y=y
|
||||||
|
)
|
||||||
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
||||||
|
|
||||||
def render_arrow(self, label, start, end, direction, i):
|
def render_arrow(self, label, start, end, direction, i):
|
||||||
|
|
|
@ -14,6 +14,15 @@ TPL_DEP_WORDS = """
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
TPL_DEP_WORDS_LEMMA = """
|
||||||
|
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
|
||||||
|
<tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
|
||||||
|
<tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
|
||||||
|
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
|
||||||
|
</text>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
TPL_DEP_ARCS = """
|
TPL_DEP_ARCS = """
|
||||||
<g class="displacy-arrow">
|
<g class="displacy-arrow">
|
||||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||||
|
|
|
@ -96,7 +96,10 @@ class Warnings(object):
|
||||||
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
||||||
"be more efficient to split your training data into multiple "
|
"be more efficient to split your training data into multiple "
|
||||||
"smaller JSON files instead.")
|
"smaller JSON files instead.")
|
||||||
W028 = ("Skipping unsupported morphological feature(s): {feature}. "
|
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
||||||
|
"but is expecting one of type 'uint64' instead. This may result "
|
||||||
|
"in problems with the vocab further on in the pipeline.")
|
||||||
|
W029 = ("Skipping unsupported morphological feature(s): {feature}. "
|
||||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||||
|
|
||||||
|
@ -531,6 +534,15 @@ class Errors(object):
|
||||||
E188 = ("Could not match the gold entity links to entities in the doc - "
|
E188 = ("Could not match the gold entity links to entities in the doc - "
|
||||||
"make sure the gold EL data refers to valid results of the "
|
"make sure the gold EL data refers to valid results of the "
|
||||||
"named entity recognizer in the `nlp` pipeline.")
|
"named entity recognizer in the `nlp` pipeline.")
|
||||||
|
E189 = ("Each argument to `get_doc` should be of equal length.")
|
||||||
|
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
||||||
|
"'{index}' with value '{value}' (equivalent to relative head "
|
||||||
|
"index: '{rel_head_index}'). The head indices should be relative "
|
||||||
|
"to the current token index rather than absolute indices in the "
|
||||||
|
"array.")
|
||||||
|
E191 = ("Invalid head: the head token must be from the same doc as the "
|
||||||
|
"token itself.")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
E993 = ("The config for 'nlp' should include either a key 'name' to "
|
E993 = ("The config for 'nlp' should include either a key 'name' to "
|
||||||
"refer to an existing model by name or path, or a key 'lang' "
|
"refer to an existing model by name or path, or a key 'lang' "
|
||||||
|
|
|
@ -66,6 +66,7 @@ for orth in [
|
||||||
"A/S",
|
"A/S",
|
||||||
"B.C.",
|
"B.C.",
|
||||||
"BK.",
|
"BK.",
|
||||||
|
"B.T.",
|
||||||
"Dr.",
|
"Dr.",
|
||||||
"Boul.",
|
"Boul.",
|
||||||
"Chr.",
|
"Chr.",
|
||||||
|
@ -75,6 +76,7 @@ for orth in [
|
||||||
"Hf.",
|
"Hf.",
|
||||||
"i/s",
|
"i/s",
|
||||||
"I/S",
|
"I/S",
|
||||||
|
"Inc.",
|
||||||
"Kprs.",
|
"Kprs.",
|
||||||
"L.A.",
|
"L.A.",
|
||||||
"Ll.",
|
"Ll.",
|
||||||
|
@ -145,6 +147,7 @@ for orth in [
|
||||||
"bygn.",
|
"bygn.",
|
||||||
"c/o",
|
"c/o",
|
||||||
"ca.",
|
"ca.",
|
||||||
|
"cm.",
|
||||||
"cand.",
|
"cand.",
|
||||||
"d.d.",
|
"d.d.",
|
||||||
"d.m.",
|
"d.m.",
|
||||||
|
@ -168,10 +171,12 @@ for orth in [
|
||||||
"dl.",
|
"dl.",
|
||||||
"do.",
|
"do.",
|
||||||
"dobb.",
|
"dobb.",
|
||||||
|
"dr.",
|
||||||
"dr.h.c",
|
"dr.h.c",
|
||||||
"dr.phil.",
|
"dr.phil.",
|
||||||
"ds.",
|
"ds.",
|
||||||
"dvs.",
|
"dvs.",
|
||||||
|
"d.v.s.",
|
||||||
"e.b.",
|
"e.b.",
|
||||||
"e.l.",
|
"e.l.",
|
||||||
"e.o.",
|
"e.o.",
|
||||||
|
@ -293,10 +298,14 @@ for orth in [
|
||||||
"kap.",
|
"kap.",
|
||||||
"kbh.",
|
"kbh.",
|
||||||
"kem.",
|
"kem.",
|
||||||
|
"kg.",
|
||||||
|
"kgs.",
|
||||||
"kgl.",
|
"kgl.",
|
||||||
"kl.",
|
"kl.",
|
||||||
"kld.",
|
"kld.",
|
||||||
|
"km.",
|
||||||
"km/t",
|
"km/t",
|
||||||
|
"km/t.",
|
||||||
"knsp.",
|
"knsp.",
|
||||||
"komm.",
|
"komm.",
|
||||||
"kons.",
|
"kons.",
|
||||||
|
@ -307,6 +316,7 @@ for orth in [
|
||||||
"kt.",
|
"kt.",
|
||||||
"ktr.",
|
"ktr.",
|
||||||
"kv.",
|
"kv.",
|
||||||
|
"kvm.",
|
||||||
"kvt.",
|
"kvt.",
|
||||||
"l.c.",
|
"l.c.",
|
||||||
"lab.",
|
"lab.",
|
||||||
|
@ -353,6 +363,7 @@ for orth in [
|
||||||
"nto.",
|
"nto.",
|
||||||
"nuv.",
|
"nuv.",
|
||||||
"o/m",
|
"o/m",
|
||||||
|
"o/m.",
|
||||||
"o.a.",
|
"o.a.",
|
||||||
"o.fl.",
|
"o.fl.",
|
||||||
"o.h.",
|
"o.h.",
|
||||||
|
@ -522,6 +533,7 @@ for orth in [
|
||||||
"vejl.",
|
"vejl.",
|
||||||
"vh.",
|
"vh.",
|
||||||
"vha.",
|
"vha.",
|
||||||
|
"vind.",
|
||||||
"vs.",
|
"vs.",
|
||||||
"vsa.",
|
"vsa.",
|
||||||
"vær.",
|
"vær.",
|
||||||
|
|
|
@ -1,5 +1,6 @@
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .norm_exceptions import NORM_EXCEPTIONS
|
from .norm_exceptions import NORM_EXCEPTIONS
|
||||||
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||||
from .punctuation import TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_INFIXES
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
@ -19,6 +20,8 @@ class GermanDefaults(Language.Defaults):
|
||||||
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
|
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
|
||||||
)
|
)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
infixes = TOKENIZER_INFIXES
|
infixes = TOKENIZER_INFIXES
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
|
@ -1,7 +1,29 @@
|
||||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
|
||||||
|
from ..char_classes import CURRENCY, UNITS, PUNCT
|
||||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
|
||||||
|
|
||||||
|
|
||||||
|
_prefixes = ["``"] + BASE_TOKENIZER_PREFIXES
|
||||||
|
|
||||||
|
_suffixes = (
|
||||||
|
["''", "/"]
|
||||||
|
+ LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ LIST_ICONS
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])\+",
|
||||||
|
r"(?<=°[FfCcKk])\.",
|
||||||
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
|
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||||
|
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
|
||||||
|
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||||
|
),
|
||||||
|
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||||
|
|
||||||
_infixes = (
|
_infixes = (
|
||||||
|
@ -12,6 +34,7 @@ _infixes = (
|
||||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[0-9])-(?=[0-9])",
|
r"(?<=[0-9])-(?=[0-9])",
|
||||||
|
@ -19,4 +42,6 @@ _infixes = (
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
TOKENIZER_INFIXES = _infixes
|
TOKENIZER_INFIXES = _infixes
|
||||||
|
|
|
@ -157,6 +157,8 @@ for exc_data in [
|
||||||
|
|
||||||
|
|
||||||
for orth in [
|
for orth in [
|
||||||
|
"``",
|
||||||
|
"''",
|
||||||
"A.C.",
|
"A.C.",
|
||||||
"a.D.",
|
"a.D.",
|
||||||
"A.D.",
|
"A.D.",
|
||||||
|
@ -172,10 +174,13 @@ for orth in [
|
||||||
"biol.",
|
"biol.",
|
||||||
"Biol.",
|
"Biol.",
|
||||||
"ca.",
|
"ca.",
|
||||||
|
"CDU/CSU",
|
||||||
"Chr.",
|
"Chr.",
|
||||||
"Cie.",
|
"Cie.",
|
||||||
|
"c/o",
|
||||||
"co.",
|
"co.",
|
||||||
"Co.",
|
"Co.",
|
||||||
|
"d'",
|
||||||
"D.C.",
|
"D.C.",
|
||||||
"Dipl.-Ing.",
|
"Dipl.-Ing.",
|
||||||
"Dipl.",
|
"Dipl.",
|
||||||
|
@ -200,12 +205,18 @@ for orth in [
|
||||||
"i.G.",
|
"i.G.",
|
||||||
"i.Tr.",
|
"i.Tr.",
|
||||||
"i.V.",
|
"i.V.",
|
||||||
|
"I.",
|
||||||
|
"II.",
|
||||||
|
"III.",
|
||||||
|
"IV.",
|
||||||
|
"Inc.",
|
||||||
"Ing.",
|
"Ing.",
|
||||||
"jr.",
|
"jr.",
|
||||||
"Jr.",
|
"Jr.",
|
||||||
"jun.",
|
"jun.",
|
||||||
"jur.",
|
"jur.",
|
||||||
"K.O.",
|
"K.O.",
|
||||||
|
"L'",
|
||||||
"L.A.",
|
"L.A.",
|
||||||
"lat.",
|
"lat.",
|
||||||
"M.A.",
|
"M.A.",
|
||||||
|
|
30
spacy/lang/eu/__init__.py
Normal file
30
spacy/lang/eu/__init__.py
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
|
from .tag_map import TAG_MAP
|
||||||
|
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG
|
||||||
|
|
||||||
|
|
||||||
|
class BasqueDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "eu"
|
||||||
|
|
||||||
|
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||||
|
tag_map = TAG_MAP
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
|
||||||
|
class Basque(Language):
|
||||||
|
lang = "eu"
|
||||||
|
Defaults = BasqueDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Basque"]
|
14
spacy/lang/eu/examples.py
Normal file
14
spacy/lang/eu/examples.py
Normal file
|
@ -0,0 +1,14 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.eu.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
|
||||||
|
"gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira",
|
||||||
|
]
|
79
spacy/lang/eu/lex_attrs.py
Normal file
79
spacy/lang/eu/lex_attrs.py
Normal file
|
@ -0,0 +1,79 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
# Source http://mylanguages.org/basque_numbers.php
|
||||||
|
|
||||||
|
|
||||||
|
_num_words = """
|
||||||
|
bat
|
||||||
|
bi
|
||||||
|
hiru
|
||||||
|
lau
|
||||||
|
bost
|
||||||
|
sei
|
||||||
|
zazpi
|
||||||
|
zortzi
|
||||||
|
bederatzi
|
||||||
|
hamar
|
||||||
|
hamaika
|
||||||
|
hamabi
|
||||||
|
hamahiru
|
||||||
|
hamalau
|
||||||
|
hamabost
|
||||||
|
hamasei
|
||||||
|
hamazazpi
|
||||||
|
Hemezortzi
|
||||||
|
hemeretzi
|
||||||
|
hogei
|
||||||
|
ehun
|
||||||
|
mila
|
||||||
|
milioi
|
||||||
|
""".split()
|
||||||
|
|
||||||
|
# source https://www.google.com/intl/ur/inputtools/try/
|
||||||
|
|
||||||
|
_ordinal_words = """
|
||||||
|
lehen
|
||||||
|
bigarren
|
||||||
|
hirugarren
|
||||||
|
laugarren
|
||||||
|
bosgarren
|
||||||
|
seigarren
|
||||||
|
zazpigarren
|
||||||
|
zortzigarren
|
||||||
|
bederatzigarren
|
||||||
|
hamargarren
|
||||||
|
hamaikagarren
|
||||||
|
hamabigarren
|
||||||
|
hamahirugarren
|
||||||
|
hamalaugarren
|
||||||
|
hamabosgarren
|
||||||
|
hamaseigarren
|
||||||
|
hamazazpigarren
|
||||||
|
hamazortzigarren
|
||||||
|
hemeretzigarren
|
||||||
|
hogeigarren
|
||||||
|
behin
|
||||||
|
""".split()
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(",", "").replace(".", "")
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count("/") == 1:
|
||||||
|
num, denom = text.split("/")
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text in _num_words:
|
||||||
|
return True
|
||||||
|
if text in _ordinal_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {LIKE_NUM: like_num}
|
7
spacy/lang/eu/punctuation.py
Normal file
7
spacy/lang/eu/punctuation.py
Normal file
|
@ -0,0 +1,7 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
|
||||||
|
_suffixes = TOKENIZER_SUFFIXES
|
108
spacy/lang/eu/stop_words.py
Normal file
108
spacy/lang/eu/stop_words.py
Normal file
|
@ -0,0 +1,108 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
# Source: https://github.com/stopwords-iso/stopwords-eu
|
||||||
|
# https://www.ranks.nl/stopwords/basque
|
||||||
|
# https://www.mustgo.com/worldlanguages/basque/
|
||||||
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
|
al
|
||||||
|
anitz
|
||||||
|
arabera
|
||||||
|
asko
|
||||||
|
baina
|
||||||
|
bat
|
||||||
|
batean
|
||||||
|
batek
|
||||||
|
bati
|
||||||
|
batzuei
|
||||||
|
batzuek
|
||||||
|
batzuetan
|
||||||
|
batzuk
|
||||||
|
bera
|
||||||
|
beraiek
|
||||||
|
berau
|
||||||
|
berauek
|
||||||
|
bere
|
||||||
|
berori
|
||||||
|
beroriek
|
||||||
|
beste
|
||||||
|
bezala
|
||||||
|
da
|
||||||
|
dago
|
||||||
|
dira
|
||||||
|
ditu
|
||||||
|
du
|
||||||
|
dute
|
||||||
|
edo
|
||||||
|
egin
|
||||||
|
ere
|
||||||
|
eta
|
||||||
|
eurak
|
||||||
|
ez
|
||||||
|
gainera
|
||||||
|
gu
|
||||||
|
gutxi
|
||||||
|
guzti
|
||||||
|
haiei
|
||||||
|
haiek
|
||||||
|
haietan
|
||||||
|
hainbeste
|
||||||
|
hala
|
||||||
|
han
|
||||||
|
handik
|
||||||
|
hango
|
||||||
|
hara
|
||||||
|
hari
|
||||||
|
hark
|
||||||
|
hartan
|
||||||
|
hau
|
||||||
|
hauei
|
||||||
|
hauek
|
||||||
|
hauetan
|
||||||
|
hemen
|
||||||
|
hemendik
|
||||||
|
hemengo
|
||||||
|
hi
|
||||||
|
hona
|
||||||
|
honek
|
||||||
|
honela
|
||||||
|
honetan
|
||||||
|
honi
|
||||||
|
hor
|
||||||
|
hori
|
||||||
|
horiei
|
||||||
|
horiek
|
||||||
|
horietan
|
||||||
|
horko
|
||||||
|
horra
|
||||||
|
horrek
|
||||||
|
horrela
|
||||||
|
horretan
|
||||||
|
horri
|
||||||
|
hortik
|
||||||
|
hura
|
||||||
|
izan
|
||||||
|
ni
|
||||||
|
noiz
|
||||||
|
nola
|
||||||
|
non
|
||||||
|
nondik
|
||||||
|
nongo
|
||||||
|
nor
|
||||||
|
nora
|
||||||
|
ze
|
||||||
|
zein
|
||||||
|
zen
|
||||||
|
zenbait
|
||||||
|
zenbat
|
||||||
|
zer
|
||||||
|
zergatik
|
||||||
|
ziren
|
||||||
|
zituen
|
||||||
|
zu
|
||||||
|
zuek
|
||||||
|
zuen
|
||||||
|
zuten
|
||||||
|
""".split()
|
||||||
|
)
|
71
spacy/lang/eu/tag_map.py
Normal file
71
spacy/lang/eu/tag_map.py
Normal file
|
@ -0,0 +1,71 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||||
|
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||||
|
|
||||||
|
TAG_MAP = {
|
||||||
|
".": {POS: PUNCT, "PunctType": "peri"},
|
||||||
|
",": {POS: PUNCT, "PunctType": "comm"},
|
||||||
|
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
|
||||||
|
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
|
||||||
|
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
|
||||||
|
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||||
|
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||||
|
":": {POS: PUNCT},
|
||||||
|
"$": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||||
|
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||||
|
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||||
|
"CC": {POS: CCONJ, "ConjType": "coor"},
|
||||||
|
"CD": {POS: NUM, "NumType": "card"},
|
||||||
|
"DT": {POS: DET},
|
||||||
|
"EX": {POS: ADV, "AdvType": "ex"},
|
||||||
|
"FW": {POS: X, "Foreign": "yes"},
|
||||||
|
"HYPH": {POS: PUNCT, "PunctType": "dash"},
|
||||||
|
"IN": {POS: ADP},
|
||||||
|
"JJ": {POS: ADJ, "Degree": "pos"},
|
||||||
|
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||||
|
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||||
|
"LS": {POS: PUNCT, "NumType": "ord"},
|
||||||
|
"MD": {POS: VERB, "VerbType": "mod"},
|
||||||
|
"NIL": {POS: ""},
|
||||||
|
"NN": {POS: NOUN, "Number": "sing"},
|
||||||
|
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||||
|
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||||
|
"NNS": {POS: NOUN, "Number": "plur"},
|
||||||
|
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||||
|
"POS": {POS: PART, "Poss": "yes"},
|
||||||
|
"PRP": {POS: PRON, "PronType": "prs"},
|
||||||
|
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||||
|
"RB": {POS: ADV, "Degree": "pos"},
|
||||||
|
"RBR": {POS: ADV, "Degree": "comp"},
|
||||||
|
"RBS": {POS: ADV, "Degree": "sup"},
|
||||||
|
"RP": {POS: PART},
|
||||||
|
"SP": {POS: SPACE},
|
||||||
|
"SYM": {POS: SYM},
|
||||||
|
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
|
||||||
|
"UH": {POS: INTJ},
|
||||||
|
"VB": {POS: VERB, "VerbForm": "inf"},
|
||||||
|
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
|
||||||
|
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||||||
|
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
|
||||||
|
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
|
||||||
|
"VBZ": {
|
||||||
|
POS: VERB,
|
||||||
|
"VerbForm": "fin",
|
||||||
|
"Tense": "pres",
|
||||||
|
"Number": "sing",
|
||||||
|
"Person": 3,
|
||||||
|
},
|
||||||
|
"WDT": {POS: ADJ, "PronType": "int|rel"},
|
||||||
|
"WP": {POS: NOUN, "PronType": "int|rel"},
|
||||||
|
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
|
||||||
|
"WRB": {POS: ADV, "PronType": "int|rel"},
|
||||||
|
"ADD": {POS: X},
|
||||||
|
"NFP": {POS: PUNCT},
|
||||||
|
"GW": {POS: X},
|
||||||
|
"XX": {POS: X},
|
||||||
|
"BES": {POS: VERB},
|
||||||
|
"HVS": {POS: VERB},
|
||||||
|
"_SP": {POS: SPACE},
|
||||||
|
}
|
|
@ -11,6 +11,7 @@ for exc_data in [
|
||||||
{ORTH: "alv.", LEMMA: "arvonlisävero"},
|
{ORTH: "alv.", LEMMA: "arvonlisävero"},
|
||||||
{ORTH: "ark.", LEMMA: "arkisin"},
|
{ORTH: "ark.", LEMMA: "arkisin"},
|
||||||
{ORTH: "as.", LEMMA: "asunto"},
|
{ORTH: "as.", LEMMA: "asunto"},
|
||||||
|
{ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
|
||||||
{ORTH: "ed.", LEMMA: "edellinen"},
|
{ORTH: "ed.", LEMMA: "edellinen"},
|
||||||
{ORTH: "esim.", LEMMA: "esimerkki"},
|
{ORTH: "esim.", LEMMA: "esimerkki"},
|
||||||
{ORTH: "huom.", LEMMA: "huomautus"},
|
{ORTH: "huom.", LEMMA: "huomautus"},
|
||||||
|
@ -24,6 +25,7 @@ for exc_data in [
|
||||||
{ORTH: "läh.", LEMMA: "lähettäjä"},
|
{ORTH: "läh.", LEMMA: "lähettäjä"},
|
||||||
{ORTH: "miel.", LEMMA: "mieluummin"},
|
{ORTH: "miel.", LEMMA: "mieluummin"},
|
||||||
{ORTH: "milj.", LEMMA: "miljoona"},
|
{ORTH: "milj.", LEMMA: "miljoona"},
|
||||||
|
{ORTH: "Mm.", LEMMA: "muun muassa"},
|
||||||
{ORTH: "mm.", LEMMA: "muun muassa"},
|
{ORTH: "mm.", LEMMA: "muun muassa"},
|
||||||
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
||||||
{ORTH: "n.", LEMMA: "noin"},
|
{ORTH: "n.", LEMMA: "noin"},
|
||||||
|
|
|
@ -1,5 +1,6 @@
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
|
||||||
from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
@ -24,6 +25,7 @@ class FrenchDefaults(Language.Defaults):
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
infixes = TOKENIZER_INFIXES
|
infixes = TOKENIZER_INFIXES
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
token_match = TOKEN_MATCH
|
token_match = TOKEN_MATCH
|
||||||
|
|
|
@ -1,12 +1,23 @@
|
||||||
from ..punctuation import TOKENIZER_INFIXES
|
from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||||
from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
from ..char_classes import merge_chars
|
||||||
|
|
||||||
|
|
||||||
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
ELISION = "' ’".replace(" ", "")
|
||||||
HYPHENS = r"- – — ‐ ‑".strip().replace(" ", "").replace("\n", "")
|
HYPHENS = r"- – — ‐ ‑".replace(" ", "")
|
||||||
|
_prefixes_elision = "d l n"
|
||||||
|
_prefixes_elision += " " + _prefixes_elision.upper()
|
||||||
|
_hyphen_suffixes = "ce clés elle en il ils je là moi nous on t vous"
|
||||||
|
_hyphen_suffixes += " " + _hyphen_suffixes.upper()
|
||||||
|
|
||||||
|
|
||||||
|
_prefixes = TOKENIZER_PREFIXES + [
|
||||||
|
r"(?:({pe})[{el}])(?=[{a}])".format(
|
||||||
|
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
_suffixes = (
|
_suffixes = (
|
||||||
LIST_PUNCT
|
LIST_PUNCT
|
||||||
+ LIST_ELLIPSES
|
+ LIST_ELLIPSES
|
||||||
|
@ -14,7 +25,6 @@ _suffixes = (
|
||||||
+ [
|
+ [
|
||||||
r"(?<=[0-9])\+",
|
r"(?<=[0-9])\+",
|
||||||
r"(?<=°[FfCcKk])\.", # °C. -> ["°C", "."]
|
r"(?<=°[FfCcKk])\.", # °C. -> ["°C", "."]
|
||||||
r"(?<=[0-9])°[FfCcKk]", # 4°C -> ["4", "°C"]
|
|
||||||
r"(?<=[0-9])%", # 4% -> ["4", "%"]
|
r"(?<=[0-9])%", # 4% -> ["4", "%"]
|
||||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||||
|
@ -22,14 +32,17 @@ _suffixes = (
|
||||||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES
|
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES
|
||||||
),
|
),
|
||||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||||
|
r"(?<=[{a}])[{h}]({hs})".format(
|
||||||
|
a=ALPHA, h=HYPHENS, hs=merge_chars(_hyphen_suffixes)
|
||||||
|
),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
_infixes = TOKENIZER_INFIXES + [
|
_infixes = TOKENIZER_INFIXES + [
|
||||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
TOKENIZER_SUFFIXES = _suffixes
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
TOKENIZER_INFIXES = _infixes
|
TOKENIZER_INFIXES = _infixes
|
||||||
|
|
|
@ -3,7 +3,7 @@ import re
|
||||||
from .punctuation import ELISION, HYPHENS
|
from .punctuation import ELISION, HYPHENS
|
||||||
from ..tokenizer_exceptions import URL_PATTERN
|
from ..tokenizer_exceptions import URL_PATTERN
|
||||||
from ..char_classes import ALPHA_LOWER, ALPHA
|
from ..char_classes import ALPHA_LOWER, ALPHA
|
||||||
from ...symbols import ORTH, LEMMA, TAG
|
from ...symbols import ORTH, LEMMA
|
||||||
|
|
||||||
# not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer
|
# not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer
|
||||||
# from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS
|
# from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS
|
||||||
|
@ -53,7 +53,28 @@ for exc_data in [
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
|
||||||
for orth in ["etc."]:
|
for orth in [
|
||||||
|
"après-midi",
|
||||||
|
"au-delà",
|
||||||
|
"au-dessus",
|
||||||
|
"celle-ci",
|
||||||
|
"celles-ci",
|
||||||
|
"celui-ci",
|
||||||
|
"cf.",
|
||||||
|
"ci-dessous",
|
||||||
|
"elle-même",
|
||||||
|
"en-dessous",
|
||||||
|
"etc.",
|
||||||
|
"jusque-là",
|
||||||
|
"lui-même",
|
||||||
|
"MM.",
|
||||||
|
"No.",
|
||||||
|
"peut-être",
|
||||||
|
"pp.",
|
||||||
|
"quelques-uns",
|
||||||
|
"rendez-vous",
|
||||||
|
"Vol.",
|
||||||
|
]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
||||||
|
@ -69,7 +90,7 @@ for verb, verb_lemma in [
|
||||||
for pronoun in ["elle", "il", "on"]:
|
for pronoun in ["elle", "il", "on"]:
|
||||||
token = f"{orth}-t-{pronoun}"
|
token = f"{orth}-t-{pronoun}"
|
||||||
_exc[token] = [
|
_exc[token] = [
|
||||||
{LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
|
{LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"},
|
||||||
{LEMMA: "t", ORTH: "-t"},
|
{LEMMA: "t", ORTH: "-t"},
|
||||||
{LEMMA: pronoun, ORTH: "-" + pronoun},
|
{LEMMA: pronoun, ORTH: "-" + pronoun},
|
||||||
]
|
]
|
||||||
|
@ -78,7 +99,7 @@ for verb, verb_lemma in [("est", "être")]:
|
||||||
for orth in [verb, verb.title()]:
|
for orth in [verb, verb.title()]:
|
||||||
token = f"{orth}-ce"
|
token = f"{orth}-ce"
|
||||||
_exc[token] = [
|
_exc[token] = [
|
||||||
{LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
|
{LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"},
|
||||||
{LEMMA: "ce", ORTH: "-ce"},
|
{LEMMA: "ce", ORTH: "-ce"},
|
||||||
]
|
]
|
||||||
|
|
||||||
|
@ -86,12 +107,29 @@ for verb, verb_lemma in [("est", "être")]:
|
||||||
for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
|
for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
|
||||||
for orth in [pre, pre.title()]:
|
for orth in [pre, pre.title()]:
|
||||||
_exc[f"{orth}est-ce"] = [
|
_exc[f"{orth}est-ce"] = [
|
||||||
{LEMMA: pre_lemma, ORTH: orth, TAG: "ADV"},
|
{LEMMA: pre_lemma, ORTH: orth},
|
||||||
{LEMMA: "être", ORTH: "est", TAG: "VERB"},
|
{LEMMA: "être", ORTH: "est"},
|
||||||
{LEMMA: "ce", ORTH: "-ce"},
|
{LEMMA: "ce", ORTH: "-ce"},
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
for verb, pronoun in [("est", "il"), ("EST", "IL")]:
|
||||||
|
token = "{}-{}".format(verb, pronoun)
|
||||||
|
_exc[token] = [
|
||||||
|
{LEMMA: "être", ORTH: verb},
|
||||||
|
{LEMMA: pronoun, ORTH: "-" + pronoun},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
for s, verb, pronoun in [("s", "est", "il"), ("S", "EST", "IL")]:
|
||||||
|
token = "{}'{}-{}".format(s, verb, pronoun)
|
||||||
|
_exc[token] = [
|
||||||
|
{LEMMA: "se", ORTH: s + "'"},
|
||||||
|
{LEMMA: "être", ORTH: verb},
|
||||||
|
{LEMMA: pronoun, ORTH: "-" + pronoun},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
_infixes_exc = []
|
_infixes_exc = []
|
||||||
orig_elision = "'"
|
orig_elision = "'"
|
||||||
orig_hyphen = "-"
|
orig_hyphen = "-"
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .punctuation import TOKENIZER_INFIXES
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
@ -19,6 +19,7 @@ class ItalianDefaults(Language.Defaults):
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
infixes = TOKENIZER_INFIXES
|
infixes = TOKENIZER_INFIXES
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,12 +1,29 @@
|
||||||
from ..punctuation import TOKENIZER_INFIXES
|
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
|
||||||
from ..char_classes import ALPHA
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||||
|
from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES
|
||||||
|
from ..char_classes import ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
|
||||||
|
|
||||||
ELISION = " ' ’ ".strip().replace(" ", "")
|
ELISION = "'’"
|
||||||
|
|
||||||
|
|
||||||
_infixes = TOKENIZER_INFIXES + [
|
_prefixes = [r"'[0-9][0-9]", r"[0-9]+°"] + BASE_TOKENIZER_PREFIXES
|
||||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
|
||||||
|
|
||||||
|
_infixes = (
|
||||||
|
LIST_ELLIPSES
|
||||||
|
+ LIST_ICONS
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
|
||||||
|
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||||
|
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||||
|
),
|
||||||
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}])(?:{h})(?=[{al}])".format(a=ALPHA, h=HYPHENS, al=ALPHA_LOWER),
|
||||||
|
r"(?<=[{a}0-9])[:<>=\/](?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}][{el}])(?=[{a}0-9\"])".format(a=ALPHA, el=ELISION),
|
||||||
]
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
TOKENIZER_INFIXES = _infixes
|
TOKENIZER_INFIXES = _infixes
|
||||||
|
|
|
@ -1,5 +1,55 @@
|
||||||
from ...symbols import ORTH, LEMMA
|
from ...symbols import ORTH, LEMMA
|
||||||
|
|
||||||
_exc = {"po'": [{ORTH: "po'", LEMMA: "poco"}]}
|
_exc = {
|
||||||
|
"all'art.": [{ORTH: "all'"}, {ORTH: "art."}],
|
||||||
|
"dall'art.": [{ORTH: "dall'"}, {ORTH: "art."}],
|
||||||
|
"dell'art.": [{ORTH: "dell'"}, {ORTH: "art."}],
|
||||||
|
"L'art.": [{ORTH: "L'"}, {ORTH: "art."}],
|
||||||
|
"l'art.": [{ORTH: "l'"}, {ORTH: "art."}],
|
||||||
|
"nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}],
|
||||||
|
"po'": [{ORTH: "po'", LEMMA: "poco"}],
|
||||||
|
"sett..": [{ORTH: "sett."}, {ORTH: "."}],
|
||||||
|
}
|
||||||
|
|
||||||
|
for orth in [
|
||||||
|
"..",
|
||||||
|
"....",
|
||||||
|
"al.",
|
||||||
|
"all-path",
|
||||||
|
"art.",
|
||||||
|
"Art.",
|
||||||
|
"artt.",
|
||||||
|
"att.",
|
||||||
|
"by-pass",
|
||||||
|
"c.d.",
|
||||||
|
"centro-sinistra",
|
||||||
|
"check-up",
|
||||||
|
"Civ.",
|
||||||
|
"cm.",
|
||||||
|
"Cod.",
|
||||||
|
"col.",
|
||||||
|
"Cost.",
|
||||||
|
"d.C.",
|
||||||
|
'de"',
|
||||||
|
"distr.",
|
||||||
|
"E'",
|
||||||
|
"ecc.",
|
||||||
|
"e-mail",
|
||||||
|
"e/o",
|
||||||
|
"etc.",
|
||||||
|
"Jr.",
|
||||||
|
"n°",
|
||||||
|
"nord-est",
|
||||||
|
"pag.",
|
||||||
|
"Proc.",
|
||||||
|
"prof.",
|
||||||
|
"sett.",
|
||||||
|
"s.p.a.",
|
||||||
|
"ss.",
|
||||||
|
"St.",
|
||||||
|
"tel.",
|
||||||
|
"week-end",
|
||||||
|
]:
|
||||||
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
|
|
@ -2,11 +2,13 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_
|
||||||
|
|
||||||
ELISION = " ' ’ ".strip().replace(" ", "")
|
ELISION = " ' ’ ".strip().replace(" ", "")
|
||||||
|
|
||||||
|
abbrev = ("d", "D")
|
||||||
|
|
||||||
_infixes = (
|
_infixes = (
|
||||||
LIST_ELLIPSES
|
LIST_ELLIPSES
|
||||||
+ LIST_ICONS
|
+ LIST_ICONS
|
||||||
+ [
|
+ [
|
||||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
|
r"(?<=^[{ab}][{el}])(?=[{a}])".format(ab=abbrev, a=ALPHA, el=ELISION),
|
||||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||||
|
|
|
@ -7,6 +7,8 @@ _exc = {}
|
||||||
|
|
||||||
# translate / delete what is not necessary
|
# translate / delete what is not necessary
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
|
{ORTH: "’t", LEMMA: "et", NORM: "et"},
|
||||||
|
{ORTH: "’T", LEMMA: "et", NORM: "et"},
|
||||||
{ORTH: "'t", LEMMA: "et", NORM: "et"},
|
{ORTH: "'t", LEMMA: "et", NORM: "et"},
|
||||||
{ORTH: "'T", LEMMA: "et", NORM: "et"},
|
{ORTH: "'T", LEMMA: "et", NORM: "et"},
|
||||||
{ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
|
{ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
|
||||||
|
|
31
spacy/lang/lij/__init__.py
Normal file
31
spacy/lang/lij/__init__.py
Normal file
|
@ -0,0 +1,31 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from .punctuation import TOKENIZER_INFIXES
|
||||||
|
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG, NORM
|
||||||
|
from ...util import update_exc, add_lookups
|
||||||
|
|
||||||
|
|
||||||
|
class LigurianDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "lij"
|
||||||
|
lex_attr_getters[NORM] = add_lookups(
|
||||||
|
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||||
|
)
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
|
|
||||||
|
|
||||||
|
class Ligurian(Language):
|
||||||
|
lang = "lij"
|
||||||
|
Defaults = LigurianDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Ligurian"]
|
18
spacy/lang/lij/examples.py
Normal file
18
spacy/lang/lij/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.lij.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Sciusciâ e sciorbî no se peu.",
|
||||||
|
"Graçie di çetroin, che me son arrivæ.",
|
||||||
|
"Vegnime apreuvo, che ve fasso pescâ di òmmi.",
|
||||||
|
"Bella pe sempre l'ægua inta conchetta quande unn'agoggia d'ægua a se â trapaña.",
|
||||||
|
]
|
15
spacy/lang/lij/punctuation.py
Normal file
15
spacy/lang/lij/punctuation.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..punctuation import TOKENIZER_INFIXES
|
||||||
|
from ..char_classes import ALPHA
|
||||||
|
|
||||||
|
|
||||||
|
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
||||||
|
|
||||||
|
|
||||||
|
_infixes = TOKENIZER_INFIXES + [
|
||||||
|
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
||||||
|
]
|
||||||
|
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
43
spacy/lang/lij/stop_words.py
Normal file
43
spacy/lang/lij/stop_words.py
Normal file
|
@ -0,0 +1,43 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
|
a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei
|
||||||
|
|
||||||
|
bella belle belli bello ben
|
||||||
|
|
||||||
|
ch' che chì chi ciù co-a co-e co-i co-o comm' comme con cösa coscì cöse
|
||||||
|
|
||||||
|
d' da da-a da-e da-i da-o dapeu de delongo derê di do doe doî donde dòppo
|
||||||
|
|
||||||
|
é e ê ea ean emmo en ëse
|
||||||
|
|
||||||
|
fin fiña
|
||||||
|
|
||||||
|
gh' ghe guæei
|
||||||
|
|
||||||
|
i î in insemme int' inta inte inti into
|
||||||
|
|
||||||
|
l' lê lì lô
|
||||||
|
|
||||||
|
m' ma manco me megio meno mezo mi
|
||||||
|
|
||||||
|
na n' ne ni ninte nisciun nisciuña no
|
||||||
|
|
||||||
|
o ò ô oua
|
||||||
|
|
||||||
|
parte pe pe-a pe-i pe-e pe-o perché pittin pö primma pròpio
|
||||||
|
|
||||||
|
quæ quand' quande quarche quella quelle quelli quello
|
||||||
|
|
||||||
|
s' sce scê sci sciâ sciô sciù se segge seu sò solo son sott' sta stæta stæte stæti stæto ste sti sto
|
||||||
|
|
||||||
|
tanta tante tanti tanto te ti torna tra tròppo tutta tutte tutti tutto
|
||||||
|
|
||||||
|
un uña unn' unna
|
||||||
|
|
||||||
|
za zu
|
||||||
|
""".split()
|
||||||
|
)
|
52
spacy/lang/lij/tokenizer_exceptions.py
Normal file
52
spacy/lang/lij/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,52 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
from ...symbols import ORTH, LEMMA
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
for raw, lemma in [
|
||||||
|
("a-a", "a-o"),
|
||||||
|
("a-e", "a-o"),
|
||||||
|
("a-o", "a-o"),
|
||||||
|
("a-i", "a-o"),
|
||||||
|
("co-a", "co-o"),
|
||||||
|
("co-e", "co-o"),
|
||||||
|
("co-i", "co-o"),
|
||||||
|
("co-o", "co-o"),
|
||||||
|
("da-a", "da-o"),
|
||||||
|
("da-e", "da-o"),
|
||||||
|
("da-i", "da-o"),
|
||||||
|
("da-o", "da-o"),
|
||||||
|
("pe-a", "pe-o"),
|
||||||
|
("pe-e", "pe-o"),
|
||||||
|
("pe-i", "pe-o"),
|
||||||
|
("pe-o", "pe-o"),
|
||||||
|
]:
|
||||||
|
for orth in [raw, raw.capitalize()]:
|
||||||
|
_exc[orth] = [{ORTH: orth, LEMMA: lemma}]
|
||||||
|
|
||||||
|
# Prefix + prepositions with à (e.g. "sott'a-o")
|
||||||
|
|
||||||
|
for prep, prep_lemma in [
|
||||||
|
("a-a", "a-o"),
|
||||||
|
("a-e", "a-o"),
|
||||||
|
("a-o", "a-o"),
|
||||||
|
("a-i", "a-o"),
|
||||||
|
]:
|
||||||
|
for prefix, prefix_lemma in [
|
||||||
|
("sott'", "sotta"),
|
||||||
|
("sott’", "sotta"),
|
||||||
|
("contr'", "contra"),
|
||||||
|
("contr’", "contra"),
|
||||||
|
("ch'", "che"),
|
||||||
|
("ch’", "che"),
|
||||||
|
("s'", "se"),
|
||||||
|
("s’", "se"),
|
||||||
|
]:
|
||||||
|
for prefix_orth in [prefix, prefix.capitalize()]:
|
||||||
|
_exc[prefix_orth + prep] = [
|
||||||
|
{ORTH: prefix_orth, LEMMA: prefix_lemma},
|
||||||
|
{ORTH: prep, LEMMA: prep_lemma},
|
||||||
|
]
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -1,3 +1,4 @@
|
||||||
|
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
@ -23,7 +24,13 @@ class LithuanianDefaults(Language.Defaults):
|
||||||
)
|
)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
infixes = TOKENIZER_INFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
mod_base_exceptions = {
|
||||||
|
exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
|
||||||
|
}
|
||||||
|
del mod_base_exceptions["8)"]
|
||||||
|
tokenizer_exceptions = update_exc(mod_base_exceptions, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
morph_rules = MORPH_RULES
|
morph_rules = MORPH_RULES
|
||||||
|
|
29
spacy/lang/lt/punctuation.py
Normal file
29
spacy/lang/lt/punctuation.py
Normal file
|
@ -0,0 +1,29 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..char_classes import LIST_ICONS, LIST_ELLIPSES
|
||||||
|
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
|
||||||
|
from ..char_classes import HYPHENS
|
||||||
|
from ..punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
|
||||||
|
_infixes = (
|
||||||
|
LIST_ELLIPSES
|
||||||
|
+ LIST_ICONS
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])[+\*^](?=[0-9-])",
|
||||||
|
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||||
|
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||||
|
),
|
||||||
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||||
|
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_suffixes = ["\."] + list(TOKENIZER_SUFFIXES)
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
|
@ -3,262 +3,264 @@ from ...symbols import ORTH
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
|
||||||
for orth in [
|
for orth in [
|
||||||
"G.",
|
"n-tosios",
|
||||||
"J. E.",
|
"?!",
|
||||||
"J. Em.",
|
# "G.",
|
||||||
"J.E.",
|
# "J. E.",
|
||||||
"J.Em.",
|
# "J. Em.",
|
||||||
"K.",
|
# "J.E.",
|
||||||
"N.",
|
# "J.Em.",
|
||||||
"V.",
|
# "K.",
|
||||||
"Vt.",
|
# "N.",
|
||||||
"a.",
|
# "V.",
|
||||||
"a.k.",
|
# "Vt.",
|
||||||
"a.s.",
|
# "a.",
|
||||||
"adv.",
|
# "a.k.",
|
||||||
"akad.",
|
# "a.s.",
|
||||||
"aklg.",
|
# "adv.",
|
||||||
"akt.",
|
# "akad.",
|
||||||
"al.",
|
# "aklg.",
|
||||||
"ang.",
|
# "akt.",
|
||||||
"angl.",
|
# "al.",
|
||||||
"aps.",
|
# "ang.",
|
||||||
"apskr.",
|
# "angl.",
|
||||||
"apyg.",
|
# "aps.",
|
||||||
"arbat.",
|
# "apskr.",
|
||||||
"asist.",
|
# "apyg.",
|
||||||
"asm.",
|
# "arbat.",
|
||||||
"asm.k.",
|
# "asist.",
|
||||||
"asmv.",
|
# "asm.",
|
||||||
"atk.",
|
# "asm.k.",
|
||||||
"atsak.",
|
# "asmv.",
|
||||||
"atsisk.",
|
# "atk.",
|
||||||
"atsisk.sąsk.",
|
# "atsak.",
|
||||||
"atv.",
|
# "atsisk.",
|
||||||
"aut.",
|
# "atsisk.sąsk.",
|
||||||
"avd.",
|
# "atv.",
|
||||||
"b.k.",
|
# "aut.",
|
||||||
"baud.",
|
# "avd.",
|
||||||
"biol.",
|
# "b.k.",
|
||||||
"bkl.",
|
# "baud.",
|
||||||
"bot.",
|
# "biol.",
|
||||||
"bt.",
|
# "bkl.",
|
||||||
"buv.",
|
# "bot.",
|
||||||
"ch.",
|
# "bt.",
|
||||||
"chem.",
|
# "buv.",
|
||||||
"corp.",
|
# "ch.",
|
||||||
"d.",
|
# "chem.",
|
||||||
"dab.",
|
# "corp.",
|
||||||
"dail.",
|
# "d.",
|
||||||
"dek.",
|
# "dab.",
|
||||||
"deš.",
|
# "dail.",
|
||||||
"dir.",
|
# "dek.",
|
||||||
"dirig.",
|
# "deš.",
|
||||||
"doc.",
|
# "dir.",
|
||||||
"dol.",
|
# "dirig.",
|
||||||
"dr.",
|
# "doc.",
|
||||||
"drp.",
|
# "dol.",
|
||||||
"dvit.",
|
# "dr.",
|
||||||
"dėst.",
|
# "drp.",
|
||||||
"dš.",
|
# "dvit.",
|
||||||
"dž.",
|
# "dėst.",
|
||||||
"e.b.",
|
# "dš.",
|
||||||
"e.bankas",
|
# "dž.",
|
||||||
"e.p.",
|
# "e.b.",
|
||||||
"e.parašas",
|
# "e.bankas",
|
||||||
"e.paštas",
|
# "e.p.",
|
||||||
"e.v.",
|
# "e.parašas",
|
||||||
"e.valdžia",
|
# "e.paštas",
|
||||||
"egz.",
|
# "e.v.",
|
||||||
"eil.",
|
# "e.valdžia",
|
||||||
"ekon.",
|
# "egz.",
|
||||||
"el.",
|
# "eil.",
|
||||||
"el.bankas",
|
# "ekon.",
|
||||||
"el.p.",
|
# "el.",
|
||||||
"el.parašas",
|
# "el.bankas",
|
||||||
"el.paštas",
|
# "el.p.",
|
||||||
"el.valdžia",
|
# "el.parašas",
|
||||||
"etc.",
|
# "el.paštas",
|
||||||
"ež.",
|
# "el.valdžia",
|
||||||
"fak.",
|
# "etc.",
|
||||||
"faks.",
|
# "ež.",
|
||||||
"feat.",
|
# "fak.",
|
||||||
"filol.",
|
# "faks.",
|
||||||
"filos.",
|
# "feat.",
|
||||||
"g.",
|
# "filol.",
|
||||||
"gen.",
|
# "filos.",
|
||||||
"geol.",
|
# "g.",
|
||||||
"gerb.",
|
# "gen.",
|
||||||
"gim.",
|
# "geol.",
|
||||||
"gr.",
|
# "gerb.",
|
||||||
"gv.",
|
# "gim.",
|
||||||
"gyd.",
|
# "gr.",
|
||||||
"gyv.",
|
# "gv.",
|
||||||
"habil.",
|
# "gyd.",
|
||||||
"inc.",
|
# "gyv.",
|
||||||
"insp.",
|
# "habil.",
|
||||||
"inž.",
|
# "inc.",
|
||||||
"ir pan.",
|
# "insp.",
|
||||||
"ir t. t.",
|
# "inž.",
|
||||||
"isp.",
|
# "ir pan.",
|
||||||
"istor.",
|
# "ir t. t.",
|
||||||
"it.",
|
# "isp.",
|
||||||
"just.",
|
# "istor.",
|
||||||
"k.",
|
# "it.",
|
||||||
"k. a.",
|
# "just.",
|
||||||
"k.a.",
|
# "k.",
|
||||||
"kab.",
|
# "k. a.",
|
||||||
"kand.",
|
# "k.a.",
|
||||||
"kart.",
|
# "kab.",
|
||||||
"kat.",
|
# "kand.",
|
||||||
"ketv.",
|
# "kart.",
|
||||||
"kh.",
|
# "kat.",
|
||||||
"kl.",
|
# "ketv.",
|
||||||
"kln.",
|
# "kh.",
|
||||||
"km.",
|
# "kl.",
|
||||||
"kn.",
|
# "kln.",
|
||||||
"koresp.",
|
# "km.",
|
||||||
"kpt.",
|
# "kn.",
|
||||||
"kr.",
|
# "koresp.",
|
||||||
"kt.",
|
# "kpt.",
|
||||||
"kub.",
|
# "kr.",
|
||||||
"kun.",
|
# "kt.",
|
||||||
"kv.",
|
# "kub.",
|
||||||
"kyš.",
|
# "kun.",
|
||||||
"l. e. p.",
|
# "kv.",
|
||||||
"l.e.p.",
|
# "kyš.",
|
||||||
"lenk.",
|
# "l. e. p.",
|
||||||
"liet.",
|
# "l.e.p.",
|
||||||
"lot.",
|
# "lenk.",
|
||||||
"lt.",
|
# "liet.",
|
||||||
"ltd.",
|
# "lot.",
|
||||||
"ltn.",
|
# "lt.",
|
||||||
"m.",
|
# "ltd.",
|
||||||
"m.e..",
|
# "ltn.",
|
||||||
"m.m.",
|
# "m.",
|
||||||
"mat.",
|
# "m.e..",
|
||||||
"med.",
|
# "m.m.",
|
||||||
"mgnt.",
|
# "mat.",
|
||||||
"mgr.",
|
# "med.",
|
||||||
"min.",
|
# "mgnt.",
|
||||||
"mjr.",
|
# "mgr.",
|
||||||
"ml.",
|
# "min.",
|
||||||
"mln.",
|
# "mjr.",
|
||||||
"mlrd.",
|
# "ml.",
|
||||||
"mob.",
|
# "mln.",
|
||||||
"mok.",
|
# "mlrd.",
|
||||||
"moksl.",
|
# "mob.",
|
||||||
"mokyt.",
|
# "mok.",
|
||||||
"mot.",
|
# "moksl.",
|
||||||
"mr.",
|
# "mokyt.",
|
||||||
"mst.",
|
# "mot.",
|
||||||
"mstl.",
|
# "mr.",
|
||||||
"mėn.",
|
# "mst.",
|
||||||
"nkt.",
|
# "mstl.",
|
||||||
"no.",
|
# "mėn.",
|
||||||
"nr.",
|
# "nkt.",
|
||||||
"ntk.",
|
# "no.",
|
||||||
"nuotr.",
|
# "nr.",
|
||||||
"op.",
|
# "ntk.",
|
||||||
"org.",
|
# "nuotr.",
|
||||||
"orig.",
|
# "op.",
|
||||||
"p.",
|
# "org.",
|
||||||
"p.d.",
|
# "orig.",
|
||||||
"p.m.e.",
|
# "p.",
|
||||||
"p.s.",
|
# "p.d.",
|
||||||
"pab.",
|
# "p.m.e.",
|
||||||
"pan.",
|
# "p.s.",
|
||||||
"past.",
|
# "pab.",
|
||||||
"pav.",
|
# "pan.",
|
||||||
"pavad.",
|
# "past.",
|
||||||
"per.",
|
# "pav.",
|
||||||
"perd.",
|
# "pavad.",
|
||||||
"pirm.",
|
# "per.",
|
||||||
"pl.",
|
# "perd.",
|
||||||
"plg.",
|
# "pirm.",
|
||||||
"plk.",
|
# "pl.",
|
||||||
"pr.",
|
# "plg.",
|
||||||
"pr.Kr.",
|
# "plk.",
|
||||||
"pranc.",
|
# "pr.",
|
||||||
"proc.",
|
# "pr.Kr.",
|
||||||
"prof.",
|
# "pranc.",
|
||||||
"prom.",
|
# "proc.",
|
||||||
"prot.",
|
# "prof.",
|
||||||
"psl.",
|
# "prom.",
|
||||||
"pss.",
|
# "prot.",
|
||||||
"pvz.",
|
# "psl.",
|
||||||
"pšt.",
|
# "pss.",
|
||||||
"r.",
|
# "pvz.",
|
||||||
"raj.",
|
# "pšt.",
|
||||||
"red.",
|
# "r.",
|
||||||
"rez.",
|
# "raj.",
|
||||||
"rež.",
|
# "red.",
|
||||||
"rus.",
|
# "rez.",
|
||||||
"rš.",
|
# "rež.",
|
||||||
"s.",
|
# "rus.",
|
||||||
"sav.",
|
# "rš.",
|
||||||
"saviv.",
|
# "s.",
|
||||||
"sek.",
|
# "sav.",
|
||||||
"sekr.",
|
# "saviv.",
|
||||||
"sen.",
|
# "sek.",
|
||||||
"sh.",
|
# "sekr.",
|
||||||
"sk.",
|
# "sen.",
|
||||||
"skg.",
|
# "sh.",
|
||||||
"skv.",
|
# "sk.",
|
||||||
"skyr.",
|
# "skg.",
|
||||||
"sp.",
|
# "skv.",
|
||||||
"spec.",
|
# "skyr.",
|
||||||
"sr.",
|
# "sp.",
|
||||||
"st.",
|
# "spec.",
|
||||||
"str.",
|
# "sr.",
|
||||||
"stud.",
|
# "st.",
|
||||||
"sąs.",
|
# "str.",
|
||||||
"t.",
|
# "stud.",
|
||||||
"t. p.",
|
# "sąs.",
|
||||||
"t. y.",
|
# "t.",
|
||||||
"t.p.",
|
# "t. p.",
|
||||||
"t.t.",
|
# "t. y.",
|
||||||
"t.y.",
|
# "t.p.",
|
||||||
"techn.",
|
# "t.t.",
|
||||||
"tel.",
|
# "t.y.",
|
||||||
"teol.",
|
# "techn.",
|
||||||
"th.",
|
# "tel.",
|
||||||
"tir.",
|
# "teol.",
|
||||||
"trit.",
|
# "th.",
|
||||||
"trln.",
|
# "tir.",
|
||||||
"tšk.",
|
# "trit.",
|
||||||
"tūks.",
|
# "trln.",
|
||||||
"tūkst.",
|
# "tšk.",
|
||||||
"up.",
|
# "tūks.",
|
||||||
"upl.",
|
# "tūkst.",
|
||||||
"v.s.",
|
# "up.",
|
||||||
"vad.",
|
# "upl.",
|
||||||
"val.",
|
# "v.s.",
|
||||||
"valg.",
|
# "vad.",
|
||||||
"ved.",
|
# "val.",
|
||||||
"vert.",
|
# "valg.",
|
||||||
"vet.",
|
# "ved.",
|
||||||
"vid.",
|
# "vert.",
|
||||||
"virš.",
|
# "vet.",
|
||||||
"vlsč.",
|
# "vid.",
|
||||||
"vnt.",
|
# "virš.",
|
||||||
"vok.",
|
# "vlsč.",
|
||||||
"vs.",
|
# "vnt.",
|
||||||
"vtv.",
|
# "vok.",
|
||||||
"vv.",
|
# "vs.",
|
||||||
"vyr.",
|
# "vtv.",
|
||||||
"vyresn.",
|
# "vv.",
|
||||||
"zool.",
|
# "vyr.",
|
||||||
"Įn",
|
# "vyresn.",
|
||||||
"įl.",
|
# "zool.",
|
||||||
"š.m.",
|
# "Įn",
|
||||||
"šnek.",
|
# "įl.",
|
||||||
"šv.",
|
# "š.m.",
|
||||||
"švč.",
|
# "šnek.",
|
||||||
"ž.ū.",
|
# "šv.",
|
||||||
"žin.",
|
# "švč.",
|
||||||
"žml.",
|
# "ž.ū.",
|
||||||
"žr.",
|
# "žin.",
|
||||||
|
# "žml.",
|
||||||
|
# "žr.",
|
||||||
]:
|
]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,6 @@
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .morph_rules import MORPH_RULES
|
from .morph_rules import MORPH_RULES
|
||||||
from .syntax_iterators import SYNTAX_ITERATORS
|
from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
@ -18,6 +20,9 @@ class NorwegianDefaults(Language.Defaults):
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||||
)
|
)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
morph_rules = MORPH_RULES
|
morph_rules = MORPH_RULES
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
|
|
@ -1,13 +1,29 @@
|
||||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
|
||||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
from ..punctuation import TOKENIZER_SUFFIXES
|
from ..char_classes import CURRENCY, PUNCT, UNITS, LIST_CURRENCY
|
||||||
|
|
||||||
# Punctuation stolen from Danish
|
|
||||||
|
# Punctuation adapted from Danish
|
||||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||||
|
_list_punct = [x for x in LIST_PUNCT if x != "#"]
|
||||||
|
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||||
|
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||||
|
_list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
|
||||||
|
|
||||||
|
|
||||||
|
_prefixes = (
|
||||||
|
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||||
|
+ _list_punct
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ LIST_CURRENCY
|
||||||
|
+ LIST_ICONS
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
_infixes = (
|
_infixes = (
|
||||||
LIST_ELLIPSES
|
LIST_ELLIPSES
|
||||||
+ LIST_ICONS
|
+ _list_icons
|
||||||
+ [
|
+ [
|
||||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||||
|
@ -18,13 +34,26 @@ _infixes = (
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
_suffixes = [
|
_suffixes = (
|
||||||
suffix
|
LIST_PUNCT
|
||||||
for suffix in TOKENIZER_SUFFIXES
|
+ LIST_ELLIPSES
|
||||||
if suffix not in ["'s", "'S", "’s", "’S", r"\'"]
|
+ _list_quotes
|
||||||
|
+ _list_icons
|
||||||
|
+ ["—", "–"]
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])\+",
|
||||||
|
r"(?<=°[FfCcKk])\.",
|
||||||
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
|
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||||
|
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
|
||||||
|
al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
|
||||||
|
),
|
||||||
|
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||||
]
|
]
|
||||||
_suffixes += [r"(?<=[^sSxXzZ])\'"]
|
+ [r"(?<=[^sSxXzZ])'"]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
TOKENIZER_INFIXES = _infixes
|
TOKENIZER_INFIXES = _infixes
|
||||||
TOKENIZER_SUFFIXES = _suffixes
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
|
|
|
@ -21,57 +21,80 @@ for exc_data in [
|
||||||
|
|
||||||
|
|
||||||
for orth in [
|
for orth in [
|
||||||
"adm.dir.",
|
"Ap.",
|
||||||
"a.m.",
|
|
||||||
"andelsnr",
|
|
||||||
"Aq.",
|
"Aq.",
|
||||||
|
"Ca.",
|
||||||
|
"Chr.",
|
||||||
|
"Co.",
|
||||||
|
"Co.",
|
||||||
|
"Dr.",
|
||||||
|
"F.eks.",
|
||||||
|
"Fr.p.",
|
||||||
|
"Frp.",
|
||||||
|
"Grl.",
|
||||||
|
"Kr.",
|
||||||
|
"Kr.F.",
|
||||||
|
"Kr.F.s",
|
||||||
|
"Mr.",
|
||||||
|
"Mrs.",
|
||||||
|
"Pb.",
|
||||||
|
"Pr.",
|
||||||
|
"Sp.",
|
||||||
|
"Sp.",
|
||||||
|
"St.",
|
||||||
|
"a.m.",
|
||||||
|
"ad.",
|
||||||
|
"adm.dir.",
|
||||||
|
"andelsnr",
|
||||||
"b.c.",
|
"b.c.",
|
||||||
"bl.a.",
|
"bl.a.",
|
||||||
"bla.",
|
"bla.",
|
||||||
"bm.",
|
"bm.",
|
||||||
"bnr.",
|
"bnr.",
|
||||||
"bto.",
|
"bto.",
|
||||||
|
"c.c.",
|
||||||
"ca.",
|
"ca.",
|
||||||
"cand.mag.",
|
"cand.mag.",
|
||||||
"c.c.",
|
|
||||||
"co.",
|
"co.",
|
||||||
"d.d.",
|
"d.d.",
|
||||||
"dept.",
|
|
||||||
"d.m.",
|
"d.m.",
|
||||||
"dr.philos.",
|
|
||||||
"dvs.",
|
|
||||||
"d.y.",
|
"d.y.",
|
||||||
"E. coli",
|
"dept.",
|
||||||
|
"dr.",
|
||||||
|
"dr.med.",
|
||||||
|
"dr.philos.",
|
||||||
|
"dr.psychol.",
|
||||||
|
"dvs.",
|
||||||
|
"e.Kr.",
|
||||||
|
"e.l.",
|
||||||
"eg.",
|
"eg.",
|
||||||
"ekskl.",
|
"ekskl.",
|
||||||
"e.Kr.",
|
|
||||||
"el.",
|
"el.",
|
||||||
"e.l.",
|
|
||||||
"et.",
|
"et.",
|
||||||
"etc.",
|
"etc.",
|
||||||
"etg.",
|
"etg.",
|
||||||
"ev.",
|
"ev.",
|
||||||
"evt.",
|
"evt.",
|
||||||
"f.",
|
"f.",
|
||||||
|
"f.Kr.",
|
||||||
"f.eks.",
|
"f.eks.",
|
||||||
|
"f.o.m.",
|
||||||
"fhv.",
|
"fhv.",
|
||||||
"fk.",
|
"fk.",
|
||||||
"f.Kr.",
|
|
||||||
"f.o.m.",
|
|
||||||
"foreg.",
|
"foreg.",
|
||||||
"fork.",
|
"fork.",
|
||||||
"fv.",
|
"fv.",
|
||||||
"fvt.",
|
"fvt.",
|
||||||
"g.",
|
"g.",
|
||||||
"gt.",
|
|
||||||
"gl.",
|
"gl.",
|
||||||
"gno.",
|
"gno.",
|
||||||
"gnr.",
|
"gnr.",
|
||||||
"grl.",
|
"grl.",
|
||||||
|
"gt.",
|
||||||
|
"h.r.adv.",
|
||||||
"hhv.",
|
"hhv.",
|
||||||
"hoh.",
|
"hoh.",
|
||||||
"hr.",
|
"hr.",
|
||||||
"h.r.adv.",
|
|
||||||
"ifb.",
|
"ifb.",
|
||||||
"ifm.",
|
"ifm.",
|
||||||
"iht.",
|
"iht.",
|
||||||
|
@ -80,39 +103,45 @@ for orth in [
|
||||||
"jf.",
|
"jf.",
|
||||||
"jr.",
|
"jr.",
|
||||||
"jun.",
|
"jun.",
|
||||||
|
"juris.",
|
||||||
"kfr.",
|
"kfr.",
|
||||||
|
"kgl.",
|
||||||
"kgl.res.",
|
"kgl.res.",
|
||||||
"kl.",
|
"kl.",
|
||||||
"komm.",
|
"komm.",
|
||||||
"kr.",
|
"kr.",
|
||||||
"kst.",
|
"kst.",
|
||||||
|
"lat.",
|
||||||
"lø.",
|
"lø.",
|
||||||
|
"m.a.o.",
|
||||||
|
"m.fl.",
|
||||||
|
"m.m.",
|
||||||
|
"m.v.",
|
||||||
"ma.",
|
"ma.",
|
||||||
"mag.art.",
|
"mag.art.",
|
||||||
"m.a.o.",
|
|
||||||
"md.",
|
"md.",
|
||||||
"mfl.",
|
"mfl.",
|
||||||
|
"mht.",
|
||||||
"mill.",
|
"mill.",
|
||||||
"min.",
|
"min.",
|
||||||
"m.m.",
|
|
||||||
"mnd.",
|
"mnd.",
|
||||||
"moh.",
|
"moh.",
|
||||||
"Mr.",
|
"mrd.",
|
||||||
"muh.",
|
"muh.",
|
||||||
"mv.",
|
"mv.",
|
||||||
"mva.",
|
"mva.",
|
||||||
|
"n.å.",
|
||||||
"ndf.",
|
"ndf.",
|
||||||
"no.",
|
"no.",
|
||||||
"nov.",
|
"nov.",
|
||||||
"nr.",
|
"nr.",
|
||||||
"nto.",
|
"nto.",
|
||||||
"nyno.",
|
"nyno.",
|
||||||
"n.å.",
|
|
||||||
"o.a.",
|
"o.a.",
|
||||||
|
"o.l.",
|
||||||
"off.",
|
"off.",
|
||||||
"ofl.",
|
"ofl.",
|
||||||
"okt.",
|
"okt.",
|
||||||
"o.l.",
|
|
||||||
"on.",
|
"on.",
|
||||||
"op.",
|
"op.",
|
||||||
"org.",
|
"org.",
|
||||||
|
@ -120,14 +149,15 @@ for orth in [
|
||||||
"ovf.",
|
"ovf.",
|
||||||
"p.",
|
"p.",
|
||||||
"p.a.",
|
"p.a.",
|
||||||
"Pb.",
|
"p.g.a.",
|
||||||
|
"p.m.",
|
||||||
|
"p.t.",
|
||||||
"pga.",
|
"pga.",
|
||||||
"ph.d.",
|
"ph.d.",
|
||||||
"pkt.",
|
"pkt.",
|
||||||
"p.m.",
|
|
||||||
"pr.",
|
"pr.",
|
||||||
"pst.",
|
"pst.",
|
||||||
"p.t.",
|
"pt.",
|
||||||
"red.anm.",
|
"red.anm.",
|
||||||
"ref.",
|
"ref.",
|
||||||
"res.",
|
"res.",
|
||||||
|
@ -136,6 +166,10 @@ for orth in [
|
||||||
"rv.",
|
"rv.",
|
||||||
"s.",
|
"s.",
|
||||||
"s.d.",
|
"s.d.",
|
||||||
|
"s.k.",
|
||||||
|
"s.k.",
|
||||||
|
"s.u.",
|
||||||
|
"s.å.",
|
||||||
"sen.",
|
"sen.",
|
||||||
"sep.",
|
"sep.",
|
||||||
"siviling.",
|
"siviling.",
|
||||||
|
@ -145,16 +179,17 @@ for orth in [
|
||||||
"sr.",
|
"sr.",
|
||||||
"sst.",
|
"sst.",
|
||||||
"st.",
|
"st.",
|
||||||
"stip.",
|
|
||||||
"stk.",
|
|
||||||
"st.meld.",
|
"st.meld.",
|
||||||
"st.prp.",
|
"st.prp.",
|
||||||
|
"stip.",
|
||||||
|
"stk.",
|
||||||
"stud.",
|
"stud.",
|
||||||
"s.u.",
|
|
||||||
"sv.",
|
"sv.",
|
||||||
"sø.",
|
|
||||||
"s.å.",
|
|
||||||
"såk.",
|
"såk.",
|
||||||
|
"sø.",
|
||||||
|
"t.h.",
|
||||||
|
"t.o.m.",
|
||||||
|
"t.v.",
|
||||||
"temp.",
|
"temp.",
|
||||||
"ti.",
|
"ti.",
|
||||||
"tils.",
|
"tils.",
|
||||||
|
@ -162,7 +197,6 @@ for orth in [
|
||||||
"tl;dr",
|
"tl;dr",
|
||||||
"tlf.",
|
"tlf.",
|
||||||
"to.",
|
"to.",
|
||||||
"t.o.m.",
|
|
||||||
"ult.",
|
"ult.",
|
||||||
"utg.",
|
"utg.",
|
||||||
"v.",
|
"v.",
|
||||||
|
@ -176,8 +210,10 @@ for orth in [
|
||||||
"vol.",
|
"vol.",
|
||||||
"vs.",
|
"vs.",
|
||||||
"vsa.",
|
"vsa.",
|
||||||
|
"©NTB",
|
||||||
"årg.",
|
"årg.",
|
||||||
"årh.",
|
"årh.",
|
||||||
|
"§§",
|
||||||
]:
|
]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
|
@ -1,69 +1,47 @@
|
||||||
from ...symbols import ORTH, NORM
|
from ...symbols import ORTH
|
||||||
|
|
||||||
|
|
||||||
_exc = {
|
_exc = {}
|
||||||
"às": [{ORTH: "à", NORM: "a"}, {ORTH: "s", NORM: "as"}],
|
|
||||||
"ao": [{ORTH: "a"}, {ORTH: "o"}],
|
|
||||||
"aos": [{ORTH: "a"}, {ORTH: "os"}],
|
|
||||||
"àquele": [{ORTH: "à", NORM: "a"}, {ORTH: "quele", NORM: "aquele"}],
|
|
||||||
"àquela": [{ORTH: "à", NORM: "a"}, {ORTH: "quela", NORM: "aquela"}],
|
|
||||||
"àqueles": [{ORTH: "à", NORM: "a"}, {ORTH: "queles", NORM: "aqueles"}],
|
|
||||||
"àquelas": [{ORTH: "à", NORM: "a"}, {ORTH: "quelas", NORM: "aquelas"}],
|
|
||||||
"àquilo": [{ORTH: "à", NORM: "a"}, {ORTH: "quilo", NORM: "aquilo"}],
|
|
||||||
"aonde": [{ORTH: "a"}, {ORTH: "onde"}],
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
# Contractions
|
|
||||||
_per_pron = ["ele", "ela", "eles", "elas"]
|
|
||||||
_dem_pron = [
|
|
||||||
"este",
|
|
||||||
"esta",
|
|
||||||
"estes",
|
|
||||||
"estas",
|
|
||||||
"isto",
|
|
||||||
"esse",
|
|
||||||
"essa",
|
|
||||||
"esses",
|
|
||||||
"essas",
|
|
||||||
"isso",
|
|
||||||
"aquele",
|
|
||||||
"aquela",
|
|
||||||
"aqueles",
|
|
||||||
"aquelas",
|
|
||||||
"aquilo",
|
|
||||||
]
|
|
||||||
_und_pron = ["outro", "outra", "outros", "outras"]
|
|
||||||
_adv = ["aqui", "aí", "ali", "além"]
|
|
||||||
|
|
||||||
|
|
||||||
for orth in _per_pron + _dem_pron + _und_pron + _adv:
|
|
||||||
_exc["d" + orth] = [{ORTH: "d", NORM: "de"}, {ORTH: orth}]
|
|
||||||
|
|
||||||
for orth in _per_pron + _dem_pron + _und_pron:
|
|
||||||
_exc["n" + orth] = [{ORTH: "n", NORM: "em"}, {ORTH: orth}]
|
|
||||||
|
|
||||||
|
|
||||||
for orth in [
|
for orth in [
|
||||||
"Adm.",
|
"Adm.",
|
||||||
|
"Art.",
|
||||||
|
"art.",
|
||||||
|
"Av.",
|
||||||
|
"av.",
|
||||||
|
"Cia.",
|
||||||
|
"dom.",
|
||||||
"Dr.",
|
"Dr.",
|
||||||
|
"dr.",
|
||||||
"e.g.",
|
"e.g.",
|
||||||
"E.g.",
|
"E.g.",
|
||||||
"E.G.",
|
"E.G.",
|
||||||
|
"e/ou",
|
||||||
|
"ed.",
|
||||||
|
"eng.",
|
||||||
|
"etc.",
|
||||||
|
"Fund.",
|
||||||
"Gen.",
|
"Gen.",
|
||||||
"Gov.",
|
"Gov.",
|
||||||
"i.e.",
|
"i.e.",
|
||||||
"I.e.",
|
"I.e.",
|
||||||
"I.E.",
|
"I.E.",
|
||||||
|
"Inc.",
|
||||||
"Jr.",
|
"Jr.",
|
||||||
|
"km/h",
|
||||||
"Ltd.",
|
"Ltd.",
|
||||||
|
"Mr.",
|
||||||
"p.m.",
|
"p.m.",
|
||||||
"Ph.D.",
|
"Ph.D.",
|
||||||
"Rep.",
|
"Rep.",
|
||||||
"Rev.",
|
"Rev.",
|
||||||
|
"S/A",
|
||||||
"Sen.",
|
"Sen.",
|
||||||
"Sr.",
|
"Sr.",
|
||||||
|
"sr.",
|
||||||
"Sra.",
|
"Sra.",
|
||||||
|
"sra.",
|
||||||
"vs.",
|
"vs.",
|
||||||
"tel.",
|
"tel.",
|
||||||
"pág.",
|
"pág.",
|
||||||
|
|
|
@ -1,5 +1,7 @@
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
@ -21,6 +23,9 @@ class RomanianDefaults(Language.Defaults):
|
||||||
)
|
)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
|
||||||
|
|
||||||
|
|
164
spacy/lang/ro/punctuation.py
Normal file
164
spacy/lang/ro/punctuation.py
Normal file
|
@ -0,0 +1,164 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
|
||||||
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
|
||||||
|
from ..char_classes import LIST_ICONS, CURRENCY
|
||||||
|
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
|
||||||
|
|
||||||
|
|
||||||
|
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||||
|
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||||
|
|
||||||
|
|
||||||
|
_ro_variants = {
|
||||||
|
"Ă": ["Ă", "A"],
|
||||||
|
"Â": ["Â", "A"],
|
||||||
|
"Î": ["Î", "I"],
|
||||||
|
"Ș": ["Ș", "Ş", "S"],
|
||||||
|
"Ț": ["Ț", "Ţ", "T"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _make_ro_variants(tokens):
|
||||||
|
variants = []
|
||||||
|
for token in tokens:
|
||||||
|
upper_token = token.upper()
|
||||||
|
upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
|
||||||
|
upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
|
||||||
|
for variant in upper_variants:
|
||||||
|
variants.extend([variant, variant.lower(), variant.title()])
|
||||||
|
return sorted(list(set(variants)))
|
||||||
|
|
||||||
|
|
||||||
|
# UD_Romanian-RRT closed class prefixes
|
||||||
|
# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
|
||||||
|
_ud_rrt_prefixes = [
|
||||||
|
"a-",
|
||||||
|
"c-",
|
||||||
|
"ce-",
|
||||||
|
"cu-",
|
||||||
|
"d-",
|
||||||
|
"de-",
|
||||||
|
"dintr-",
|
||||||
|
"e-",
|
||||||
|
"făr-",
|
||||||
|
"i-",
|
||||||
|
"l-",
|
||||||
|
"le-",
|
||||||
|
"m-",
|
||||||
|
"mi-",
|
||||||
|
"n-",
|
||||||
|
"ne-",
|
||||||
|
"p-",
|
||||||
|
"pe-",
|
||||||
|
"prim-",
|
||||||
|
"printr-",
|
||||||
|
"s-",
|
||||||
|
"se-",
|
||||||
|
"te-",
|
||||||
|
"v-",
|
||||||
|
"într-",
|
||||||
|
"ș-",
|
||||||
|
"și-",
|
||||||
|
"ți-",
|
||||||
|
]
|
||||||
|
_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
|
||||||
|
|
||||||
|
|
||||||
|
# UD_Romanian-RRT closed class suffixes without NUM
|
||||||
|
# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
|
||||||
|
_ud_rrt_suffixes = [
|
||||||
|
"-a",
|
||||||
|
"-aceasta",
|
||||||
|
"-ai",
|
||||||
|
"-al",
|
||||||
|
"-ale",
|
||||||
|
"-alta",
|
||||||
|
"-am",
|
||||||
|
"-ar",
|
||||||
|
"-astea",
|
||||||
|
"-atâta",
|
||||||
|
"-au",
|
||||||
|
"-aș",
|
||||||
|
"-ați",
|
||||||
|
"-i",
|
||||||
|
"-ilor",
|
||||||
|
"-l",
|
||||||
|
"-le",
|
||||||
|
"-lea",
|
||||||
|
"-mea",
|
||||||
|
"-meu",
|
||||||
|
"-mi",
|
||||||
|
"-mă",
|
||||||
|
"-n",
|
||||||
|
"-ndărătul",
|
||||||
|
"-ne",
|
||||||
|
"-o",
|
||||||
|
"-oi",
|
||||||
|
"-or",
|
||||||
|
"-s",
|
||||||
|
"-se",
|
||||||
|
"-si",
|
||||||
|
"-te",
|
||||||
|
"-ul",
|
||||||
|
"-ului",
|
||||||
|
"-un",
|
||||||
|
"-uri",
|
||||||
|
"-urile",
|
||||||
|
"-urilor",
|
||||||
|
"-veți",
|
||||||
|
"-vă",
|
||||||
|
"-ăștia",
|
||||||
|
"-și",
|
||||||
|
"-ți",
|
||||||
|
]
|
||||||
|
_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
|
||||||
|
|
||||||
|
|
||||||
|
_prefixes = (
|
||||||
|
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||||
|
+ _ud_rrt_prefix_variants
|
||||||
|
+ LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ LIST_CURRENCY
|
||||||
|
+ LIST_ICONS
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_suffixes = (
|
||||||
|
_ud_rrt_suffix_variants
|
||||||
|
+ LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ _list_icons
|
||||||
|
+ ["—", "–"]
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])\+",
|
||||||
|
r"(?<=°[FfCcKk])\.",
|
||||||
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
|
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
|
||||||
|
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||||
|
),
|
||||||
|
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
_infixes = (
|
||||||
|
LIST_ELLIPSES
|
||||||
|
+ _list_icons
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])[+\*^](?=[0-9-])",
|
||||||
|
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||||
|
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||||
|
),
|
||||||
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
|
@ -1,4 +1,5 @@
|
||||||
from ...symbols import ORTH
|
from ...symbols import ORTH
|
||||||
|
from .punctuation import _make_ro_variants
|
||||||
|
|
||||||
|
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
@ -42,8 +43,52 @@ for orth in [
|
||||||
"dpdv",
|
"dpdv",
|
||||||
"șamd.",
|
"șamd.",
|
||||||
"ș.a.m.d.",
|
"ș.a.m.d.",
|
||||||
|
# below: from UD_Romanian-RRT:
|
||||||
|
"A.c.",
|
||||||
|
"A.f.",
|
||||||
|
"A.r.",
|
||||||
|
"Al.",
|
||||||
|
"Art.",
|
||||||
|
"Aug.",
|
||||||
|
"Bd.",
|
||||||
|
"Dem.",
|
||||||
|
"Dr.",
|
||||||
|
"Fig.",
|
||||||
|
"Fr.",
|
||||||
|
"Gh.",
|
||||||
|
"Gr.",
|
||||||
|
"Lt.",
|
||||||
|
"Nr.",
|
||||||
|
"Obs.",
|
||||||
|
"Prof.",
|
||||||
|
"Sf.",
|
||||||
|
"a.m.",
|
||||||
|
"a.r.",
|
||||||
|
"alin.",
|
||||||
|
"art.",
|
||||||
|
"d-l",
|
||||||
|
"d-lui",
|
||||||
|
"d-nei",
|
||||||
|
"ex.",
|
||||||
|
"fig.",
|
||||||
|
"ian.",
|
||||||
|
"lit.",
|
||||||
|
"lt.",
|
||||||
|
"p.a.",
|
||||||
|
"p.m.",
|
||||||
|
"pct.",
|
||||||
|
"prep.",
|
||||||
|
"sf.",
|
||||||
|
"tel.",
|
||||||
|
"univ.",
|
||||||
|
"îngr.",
|
||||||
|
"într-adevăr",
|
||||||
|
"Șt.",
|
||||||
|
"ș.a.",
|
||||||
]:
|
]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
# note: does not distinguish capitalized-only exceptions from others
|
||||||
|
for variant in _make_ro_variants([orth]):
|
||||||
|
_exc[variant] = [{ORTH: variant}]
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
|
|
@ -13,6 +13,7 @@ import multiprocessing as mp
|
||||||
from itertools import chain, cycle
|
from itertools import chain, cycle
|
||||||
|
|
||||||
from .tokenizer import Tokenizer
|
from .tokenizer import Tokenizer
|
||||||
|
from .tokens.underscore import Underscore
|
||||||
from .vocab import Vocab
|
from .vocab import Vocab
|
||||||
from .lemmatizer import Lemmatizer
|
from .lemmatizer import Lemmatizer
|
||||||
from .lookups import Lookups
|
from .lookups import Lookups
|
||||||
|
@ -874,7 +875,10 @@ class Language(object):
|
||||||
sender.send()
|
sender.send()
|
||||||
|
|
||||||
procs = [
|
procs = [
|
||||||
mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
|
mp.Process(
|
||||||
|
target=_apply_pipes,
|
||||||
|
args=(self.make_doc, pipes, rch, sch, Underscore.get_state()),
|
||||||
|
)
|
||||||
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
||||||
]
|
]
|
||||||
for proc in procs:
|
for proc in procs:
|
||||||
|
@ -1146,16 +1150,19 @@ def _pipe(examples, proc, kwargs):
|
||||||
yield ex
|
yield ex
|
||||||
|
|
||||||
|
|
||||||
def _apply_pipes(make_doc, pipes, reciever, sender):
|
def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
|
||||||
"""Worker for Language.pipe
|
"""Worker for Language.pipe
|
||||||
|
|
||||||
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
||||||
created by `multiprocessing.Pipe()`
|
created by `multiprocessing.Pipe()`
|
||||||
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
|
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
|
||||||
`multiprocessing.Pipe()`
|
`multiprocessing.Pipe()`
|
||||||
|
underscore_state (tuple): The data in the Underscore class of the parent
|
||||||
|
vectors (dict): The global vectors data, copied from the parent
|
||||||
"""
|
"""
|
||||||
|
Underscore.load_state(underscore_state)
|
||||||
while True:
|
while True:
|
||||||
texts = reciever.get()
|
texts = receiver.get()
|
||||||
docs = (make_doc(text) for text in texts)
|
docs = (make_doc(text) for text in texts)
|
||||||
for pipe in pipes:
|
for pipe in pipes:
|
||||||
docs = pipe(docs)
|
docs = pipe(docs)
|
||||||
|
|
|
@ -664,6 +664,8 @@ def _get_attr_values(spec, string_store):
|
||||||
continue
|
continue
|
||||||
if attr == "TEXT":
|
if attr == "TEXT":
|
||||||
attr = "ORTH"
|
attr = "ORTH"
|
||||||
|
if attr == "IS_SENT_START":
|
||||||
|
attr = "SENT_START"
|
||||||
attr = IDS.get(attr)
|
attr = IDS.get(attr)
|
||||||
if isinstance(value, basestring):
|
if isinstance(value, basestring):
|
||||||
value = string_store.add(value)
|
value = string_store.add(value)
|
||||||
|
|
|
@ -365,7 +365,7 @@ class Tensorizer(Pipe):
|
||||||
return sgd
|
return sgd
|
||||||
|
|
||||||
|
|
||||||
@component("tagger", assigns=["token.tag", "token.pos"])
|
@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
|
||||||
class Tagger(Pipe):
|
class Tagger(Pipe):
|
||||||
"""Pipeline component for part-of-speech tagging.
|
"""Pipeline component for part-of-speech tagging.
|
||||||
|
|
||||||
|
|
|
@ -464,3 +464,5 @@ cdef enum symbol_t:
|
||||||
ENT_KB_ID
|
ENT_KB_ID
|
||||||
MORPH
|
MORPH
|
||||||
ENT_ID
|
ENT_ID
|
||||||
|
|
||||||
|
IDX
|
|
@ -89,6 +89,7 @@ IDS = {
|
||||||
"SPACY": SPACY,
|
"SPACY": SPACY,
|
||||||
"PROB": PROB,
|
"PROB": PROB,
|
||||||
"LANG": LANG,
|
"LANG": LANG,
|
||||||
|
"IDX": IDX,
|
||||||
|
|
||||||
"ADJ": ADJ,
|
"ADJ": ADJ,
|
||||||
"ADP": ADP,
|
"ADP": ADP,
|
||||||
|
|
|
@ -80,6 +80,11 @@ def es_tokenizer():
|
||||||
return get_lang_class("es").Defaults.create_tokenizer()
|
return get_lang_class("es").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def eu_tokenizer():
|
||||||
|
return get_lang_class("eu").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def fi_tokenizer():
|
def fi_tokenizer():
|
||||||
return get_lang_class("fi").Defaults.create_tokenizer()
|
return get_lang_class("fi").Defaults.create_tokenizer()
|
||||||
|
|
|
@ -63,3 +63,39 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
|
||||||
words = ["An", "example", "sentence"]
|
words = ["An", "example", "sentence"]
|
||||||
doc = Doc(en_vocab, words=words)
|
doc = Doc(en_vocab, words=words)
|
||||||
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_array_idx(en_vocab):
|
||||||
|
"""Test that Doc.to_array can retrieve token start indices"""
|
||||||
|
words = ["An", "example", "sentence"]
|
||||||
|
offsets = Doc(en_vocab, words=words).to_array("IDX")
|
||||||
|
assert offsets[0] == 0
|
||||||
|
assert offsets[1] == 3
|
||||||
|
assert offsets[2] == 11
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_from_array_heads_in_bounds(en_vocab):
|
||||||
|
"""Test that Doc.from_array doesn't set heads that are out of bounds."""
|
||||||
|
words = ["This", "is", "a", "sentence", "."]
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
for token in doc:
|
||||||
|
token.head = doc[0]
|
||||||
|
|
||||||
|
# correct
|
||||||
|
arr = doc.to_array(["HEAD"])
|
||||||
|
doc_from_array = Doc(en_vocab, words=words)
|
||||||
|
doc_from_array.from_array(["HEAD"], arr)
|
||||||
|
|
||||||
|
# head before start
|
||||||
|
arr = doc.to_array(["HEAD"])
|
||||||
|
arr[0] = -1
|
||||||
|
doc_from_array = Doc(en_vocab, words=words)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
doc_from_array.from_array(["HEAD"], arr)
|
||||||
|
|
||||||
|
# head after end
|
||||||
|
arr = doc.to_array(["HEAD"])
|
||||||
|
arr[0] = 5
|
||||||
|
doc_from_array = Doc(en_vocab, words=words)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
doc_from_array.from_array(["HEAD"], arr)
|
||||||
|
|
|
@ -145,10 +145,9 @@ def test_doc_api_runtime_error(en_tokenizer):
|
||||||
# Example that caused run-time error while parsing Reddit
|
# Example that caused run-time error while parsing Reddit
|
||||||
# fmt: off
|
# fmt: off
|
||||||
text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
|
text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
|
||||||
deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
|
deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
|
||||||
"nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
|
"amod", "pobj", "acl", "prep", "prep", "pobj",
|
||||||
"pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
|
"", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
|
||||||
"ROOT", "amod", "dobj"]
|
|
||||||
# fmt: on
|
# fmt: on
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||||
|
@ -272,19 +271,9 @@ def test_doc_is_nered(en_vocab):
|
||||||
def test_doc_from_array_sent_starts(en_vocab):
|
def test_doc_from_array_sent_starts(en_vocab):
|
||||||
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
|
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
|
||||||
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
|
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
|
||||||
deps = [
|
# fmt: off
|
||||||
"ROOT",
|
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
|
||||||
"dep",
|
# fmt: on
|
||||||
"dep",
|
|
||||||
"dep",
|
|
||||||
"dep",
|
|
||||||
"dep",
|
|
||||||
"ROOT",
|
|
||||||
"dep",
|
|
||||||
"dep",
|
|
||||||
"dep",
|
|
||||||
"dep",
|
|
||||||
]
|
|
||||||
doc = Doc(en_vocab, words=words)
|
doc = Doc(en_vocab, words=words)
|
||||||
for i, (dep, head) in enumerate(zip(deps, heads)):
|
for i, (dep, head) in enumerate(zip(deps, heads)):
|
||||||
doc[i].dep_ = dep
|
doc[i].dep_ = dep
|
||||||
|
|
|
@ -164,6 +164,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
||||||
assert doc[4].left_edge.i == 0
|
assert doc[4].left_edge.i == 0
|
||||||
assert doc[2].left_edge.i == 0
|
assert doc[2].left_edge.i == 0
|
||||||
|
|
||||||
|
# head token must be from the same document
|
||||||
|
doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
doc[0].head = doc2[0]
|
||||||
|
|
||||||
|
|
||||||
def test_is_sent_start(en_tokenizer):
|
def test_is_sent_start(en_tokenizer):
|
||||||
doc = en_tokenizer("This is a sentence. This is another.")
|
doc = en_tokenizer("This is a sentence. This is another.")
|
||||||
|
@ -211,7 +216,7 @@ def test_token_api_conjuncts_chain(en_vocab):
|
||||||
def test_token_api_conjuncts_simple(en_vocab):
|
def test_token_api_conjuncts_simple(en_vocab):
|
||||||
words = "They came and went .".split()
|
words = "They came and went .".split()
|
||||||
heads = [1, 0, -1, -2, -1]
|
heads = [1, 0, -1, -2, -1]
|
||||||
deps = ["nsubj", "ROOT", "cc", "conj"]
|
deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
|
||||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||||
assert [w.text for w in doc[1].conjuncts] == ["went"]
|
assert [w.text for w in doc[1].conjuncts] == ["went"]
|
||||||
assert [w.text for w in doc[3].conjuncts] == ["came"]
|
assert [w.text for w in doc[3].conjuncts] == ["came"]
|
||||||
|
|
|
@ -4,6 +4,15 @@ from spacy.tokens import Doc, Span, Token
|
||||||
from spacy.tokens.underscore import Underscore
|
from spacy.tokens.underscore import Underscore
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="function", autouse=True)
|
||||||
|
def clean_underscore():
|
||||||
|
# reset the Underscore object after the test, to avoid having state copied across tests
|
||||||
|
yield
|
||||||
|
Underscore.doc_extensions = {}
|
||||||
|
Underscore.span_extensions = {}
|
||||||
|
Underscore.token_extensions = {}
|
||||||
|
|
||||||
|
|
||||||
def test_create_doc_underscore():
|
def test_create_doc_underscore():
|
||||||
doc = Mock()
|
doc = Mock()
|
||||||
doc.doc = doc
|
doc.doc = doc
|
||||||
|
|
|
@ -55,7 +55,8 @@ def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
|
||||||
("Kristiansen c/o Madsen", 3),
|
("Kristiansen c/o Madsen", 3),
|
||||||
("Sprogteknologi a/s", 2),
|
("Sprogteknologi a/s", 2),
|
||||||
("De boede i A/B Bellevue", 5),
|
("De boede i A/B Bellevue", 5),
|
||||||
("Rotorhastigheden er 3400 o/m.", 5),
|
# note: skipping due to weirdness in UD_Danish-DDT
|
||||||
|
# ("Rotorhastigheden er 3400 o/m.", 5),
|
||||||
("Jeg købte billet t/r.", 5),
|
("Jeg købte billet t/r.", 5),
|
||||||
("Murerarbejdsmand m/k søges", 3),
|
("Murerarbejdsmand m/k søges", 3),
|
||||||
("Netværket kører over TCP/IP", 4),
|
("Netværket kører over TCP/IP", 4),
|
||||||
|
|
22
spacy/tests/lang/eu/test_text.py
Normal file
22
spacy/tests/lang/eu/test_text.py
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_eu_tokenizer_handles_long_text(eu_tokenizer):
|
||||||
|
text = """ta nere guitarra estrenatu ondoren"""
|
||||||
|
tokens = eu_tokenizer(text)
|
||||||
|
assert len(tokens) == 5
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"text,length",
|
||||||
|
[
|
||||||
|
("milesker ederra joan zen hitzaldia plazer hutsa", 7),
|
||||||
|
("astelehen guztia sofan pasau biot", 5),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
|
||||||
|
tokens = eu_tokenizer(text)
|
||||||
|
assert len(tokens) == length
|
|
@ -7,12 +7,22 @@ ABBREVIATION_TESTS = [
|
||||||
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
|
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
|
||||||
),
|
),
|
||||||
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
|
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
|
||||||
|
(
|
||||||
|
"Vuonna 1 eaa. tapahtui kauheita.",
|
||||||
|
["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
HYPHENATED_TESTS = [
|
HYPHENATED_TESTS = [
|
||||||
(
|
(
|
||||||
"1700-luvulle sijoittuva taide-elokuva",
|
"1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
|
||||||
["1700-luvulle", "sijoittuva", "taide-elokuva"],
|
[
|
||||||
|
"1700-luvulle",
|
||||||
|
"sijoittuva",
|
||||||
|
"taide-elokuva",
|
||||||
|
"Wikimedia-säätiön",
|
||||||
|
"Varsinais-Suomen",
|
||||||
|
],
|
||||||
)
|
)
|
||||||
]
|
]
|
||||||
|
|
||||||
|
@ -23,6 +33,7 @@ ABBREVIATION_INFLECTION_TESTS = [
|
||||||
),
|
),
|
||||||
("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
|
("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
|
||||||
("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
|
("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
|
||||||
|
("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -12,11 +12,11 @@ def test_lt_tokenizer_handles_long_text(lt_tokenizer):
|
||||||
[
|
[
|
||||||
(
|
(
|
||||||
"177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
|
"177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
|
||||||
15,
|
17,
|
||||||
),
|
),
|
||||||
(
|
(
|
||||||
"ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
|
"ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
|
||||||
16,
|
18,
|
||||||
),
|
),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
|
@ -28,7 +28,7 @@ def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
|
||||||
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
|
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
|
||||||
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
|
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
|
||||||
tokens = lt_tokenizer(text)
|
tokens = lt_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
|
|
|
@ -4,6 +4,8 @@ from mock import Mock
|
||||||
from spacy.matcher import Matcher, DependencyMatcher
|
from spacy.matcher import Matcher, DependencyMatcher
|
||||||
from spacy.tokens import Doc, Token
|
from spacy.tokens import Doc, Token
|
||||||
|
|
||||||
|
from ..doc.test_underscore import clean_underscore # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def matcher(en_vocab):
|
def matcher(en_vocab):
|
||||||
|
@ -197,6 +199,7 @@ def test_matcher_any_token_operator(en_vocab):
|
||||||
assert matches[2] == "test hello world"
|
assert matches[2] == "test hello world"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.usefixtures("clean_underscore")
|
||||||
def test_matcher_extension_attribute(en_vocab):
|
def test_matcher_extension_attribute(en_vocab):
|
||||||
matcher = Matcher(en_vocab)
|
matcher = Matcher(en_vocab)
|
||||||
get_is_fruit = lambda token: token.text in ("apple", "banana")
|
get_is_fruit = lambda token: token.text in ("apple", "banana")
|
||||||
|
|
|
@ -31,6 +31,8 @@ TEST_PATTERNS = [
|
||||||
([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
|
([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
|
||||||
([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
|
([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
|
||||||
([{"orth": "foo"}], 0, 0), # prev: xfail
|
([{"orth": "foo"}], 0, 0), # prev: xfail
|
||||||
|
([{"IS_SENT_START": True}], 0, 0),
|
||||||
|
([{"SENT_START": True}], 0, 0),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -31,23 +31,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def heads():
|
def heads():
|
||||||
# fmt: off
|
# fmt: off
|
||||||
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
|
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
|
||||||
-1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
-1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
||||||
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
|
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
|
||||||
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
|
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
|
||||||
0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
|
0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
|
||||||
9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
|
9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
|
||||||
2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
|
2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
|
||||||
3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
|
3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
|
||||||
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
|
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
|
||||||
-1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
|
-1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
|
||||||
-2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
|
-2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
|
||||||
1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
|
1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
|
||||||
1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
||||||
-19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
-10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
||||||
0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
|
0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
|
||||||
1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
|
1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
|
||||||
-1, -8, -9, -1]
|
-1, 0, -1, -1]
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
|
||||||
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
||||||
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
||||||
doc = Doc(en_vocab, words=words)
|
doc = Doc(en_vocab, words=words)
|
||||||
# Work around lemma corrpution problem and set lemmas after tags
|
# Work around lemma corruption problem and set lemmas after tags
|
||||||
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||||
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
||||||
assert [t.tag_ for t in doc] == tags
|
assert [t.tag_ for t in doc] == tags
|
||||||
|
|
|
@ -121,7 +121,7 @@ def test_issue2772(en_vocab):
|
||||||
words = "When we write or communicate virtually , we can hide our true feelings .".split()
|
words = "When we write or communicate virtually , we can hide our true feelings .".split()
|
||||||
# A tree with a non-projective (i.e. crossing) arc
|
# A tree with a non-projective (i.e. crossing) arc
|
||||||
# The arcs (0, 4) and (2, 9) cross.
|
# The arcs (0, 4) and (2, 9) cross.
|
||||||
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
|
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
|
||||||
deps = ["dep"] * len(heads)
|
deps = ["dep"] * len(heads)
|
||||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||||
assert doc[1].is_sent_start is None
|
assert doc[1].is_sent_start is None
|
||||||
|
|
|
@ -24,7 +24,7 @@ def test_issue4590(en_vocab):
|
||||||
|
|
||||||
text = "The quick brown fox jumped over the lazy fox"
|
text = "The quick brown fox jumped over the lazy fox"
|
||||||
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
||||||
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
|
deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]
|
||||||
|
|
||||||
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
||||||
|
|
||||||
|
|
25
spacy/tests/regression/test_issue4725.py
Normal file
25
spacy/tests/regression/test_issue4725.py
Normal file
|
@ -0,0 +1,25 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import numpy
|
||||||
|
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue4725():
|
||||||
|
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
||||||
|
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||||
|
data = numpy.ndarray((5, 3), dtype="f")
|
||||||
|
data[0] = 1.0
|
||||||
|
data[1] = 2.0
|
||||||
|
vocab.set_vector("cat", data[0])
|
||||||
|
vocab.set_vector("dog", data[1])
|
||||||
|
|
||||||
|
nlp = English(vocab=vocab)
|
||||||
|
ner = nlp.create_pipe("ner")
|
||||||
|
nlp.add_pipe(ner)
|
||||||
|
nlp.begin_training()
|
||||||
|
docs = ["Kurt is in London."] * 10
|
||||||
|
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
||||||
|
pass
|
43
spacy/tests/regression/test_issue4903.py
Normal file
43
spacy/tests/regression/test_issue4903.py
Normal file
|
@ -0,0 +1,43 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.tokens import Span, Doc
|
||||||
|
|
||||||
|
|
||||||
|
class CustomPipe:
|
||||||
|
name = "my_pipe"
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
Span.set_extension("my_ext", getter=self._get_my_ext)
|
||||||
|
Doc.set_extension("my_ext", default=None)
|
||||||
|
|
||||||
|
def __call__(self, doc):
|
||||||
|
gathered_ext = []
|
||||||
|
for sent in doc.sents:
|
||||||
|
sent_ext = self._get_my_ext(sent)
|
||||||
|
sent._.set("my_ext", sent_ext)
|
||||||
|
gathered_ext.append(sent_ext)
|
||||||
|
|
||||||
|
doc._.set("my_ext", "\n".join(gathered_ext))
|
||||||
|
|
||||||
|
return doc
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _get_my_ext(span):
|
||||||
|
return str(span.end)
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue4903():
|
||||||
|
# ensures that this runs correctly and doesn't hang or crash on Windows / macOS
|
||||||
|
|
||||||
|
nlp = English()
|
||||||
|
custom_component = CustomPipe()
|
||||||
|
nlp.add_pipe(nlp.create_pipe("sentencizer"))
|
||||||
|
nlp.add_pipe(custom_component, after="sentencizer")
|
||||||
|
|
||||||
|
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
|
||||||
|
docs = list(nlp.pipe(text, n_process=2))
|
||||||
|
assert docs[0].text == "I like bananas."
|
||||||
|
assert docs[1].text == "Do you like them?"
|
||||||
|
assert docs[2].text == "No, I prefer wasabi."
|
|
@ -2,7 +2,7 @@ import pytest
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
|
|
||||||
|
|
||||||
def test_evaluate():
|
def test_issue4924():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
docs_golds = [("", {})]
|
docs_golds = [("", {})]
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
|
|
35
spacy/tests/regression/test_issue5048.py
Normal file
35
spacy/tests/regression/test_issue5048.py
Normal file
|
@ -0,0 +1,35 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import numpy
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.attrs import DEP, POS, TAG
|
||||||
|
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue5048(en_vocab):
|
||||||
|
words = ["This", "is", "a", "sentence"]
|
||||||
|
pos_s = ["DET", "VERB", "DET", "NOUN"]
|
||||||
|
spaces = [" ", " ", " ", ""]
|
||||||
|
deps_s = ["dep", "adj", "nn", "atm"]
|
||||||
|
tags_s = ["DT", "VBZ", "DT", "NN"]
|
||||||
|
|
||||||
|
strings = en_vocab.strings
|
||||||
|
|
||||||
|
for w in words:
|
||||||
|
strings.add(w)
|
||||||
|
deps = [strings.add(d) for d in deps_s]
|
||||||
|
pos = [strings.add(p) for p in pos_s]
|
||||||
|
tags = [strings.add(t) for t in tags_s]
|
||||||
|
|
||||||
|
attrs = [POS, DEP, TAG]
|
||||||
|
array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
|
||||||
|
|
||||||
|
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||||
|
doc.from_array(attrs, array)
|
||||||
|
v1 = [(token.text, token.pos_, token.tag_) for token in doc]
|
||||||
|
|
||||||
|
doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
|
||||||
|
v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
|
||||||
|
assert v1 == v2
|
46
spacy/tests/regression/test_issue5082.py
Normal file
46
spacy/tests/regression/test_issue5082.py
Normal file
|
@ -0,0 +1,46 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.pipeline import EntityRuler
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue5082():
|
||||||
|
# Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
|
||||||
|
nlp = English()
|
||||||
|
vocab = nlp.vocab
|
||||||
|
array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
|
||||||
|
array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
|
||||||
|
array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
|
||||||
|
array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
|
||||||
|
array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
|
||||||
|
|
||||||
|
vocab.set_vector("I", array1)
|
||||||
|
vocab.set_vector("like", array2)
|
||||||
|
vocab.set_vector("David", array3)
|
||||||
|
vocab.set_vector("Bowie", array4)
|
||||||
|
|
||||||
|
text = "I like David Bowie"
|
||||||
|
ruler = EntityRuler(nlp)
|
||||||
|
patterns = [
|
||||||
|
{"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
|
||||||
|
]
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
nlp.add_pipe(ruler)
|
||||||
|
|
||||||
|
parsed_vectors_1 = [t.vector for t in nlp(text)]
|
||||||
|
assert len(parsed_vectors_1) == 4
|
||||||
|
np.testing.assert_array_equal(parsed_vectors_1[0], array1)
|
||||||
|
np.testing.assert_array_equal(parsed_vectors_1[1], array2)
|
||||||
|
np.testing.assert_array_equal(parsed_vectors_1[2], array3)
|
||||||
|
np.testing.assert_array_equal(parsed_vectors_1[3], array4)
|
||||||
|
|
||||||
|
merge_ents = nlp.create_pipe("merge_entities")
|
||||||
|
nlp.add_pipe(merge_ents)
|
||||||
|
|
||||||
|
parsed_vectors_2 = [t.vector for t in nlp(text)]
|
||||||
|
assert len(parsed_vectors_2) == 3
|
||||||
|
np.testing.assert_array_equal(parsed_vectors_2[0], array1)
|
||||||
|
np.testing.assert_array_equal(parsed_vectors_2[1], array2)
|
||||||
|
np.testing.assert_array_equal(parsed_vectors_2[2], array34)
|
|
@ -12,12 +12,19 @@ def load_tokenizer(b):
|
||||||
|
|
||||||
|
|
||||||
def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
||||||
"""Test that custom tokenizer with not all functions defined can be
|
"""Test that custom tokenizer with not all functions defined or empty
|
||||||
serialized and deserialized correctly (see #2494)."""
|
properties can be serialized and deserialized correctly (see #2494,
|
||||||
|
#4991)."""
|
||||||
tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
|
tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
|
||||||
tokenizer_bytes = tokenizer.to_bytes()
|
tokenizer_bytes = tokenizer.to_bytes()
|
||||||
Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||||
|
|
||||||
|
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
|
||||||
|
tokenizer.rules = {}
|
||||||
|
tokenizer_bytes = tokenizer.to_bytes()
|
||||||
|
tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||||
|
assert tokenizer_reloaded.rules == {}
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
||||||
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
||||||
|
|
|
@ -28,10 +28,10 @@ def test_displacy_parse_deps(en_vocab):
|
||||||
deps = displacy.parse_deps(doc)
|
deps = displacy.parse_deps(doc)
|
||||||
assert isinstance(deps, dict)
|
assert isinstance(deps, dict)
|
||||||
assert deps["words"] == [
|
assert deps["words"] == [
|
||||||
{"text": "This", "tag": "DET"},
|
{"lemma": None, "text": words[0], "tag": pos[0]},
|
||||||
{"text": "is", "tag": "AUX"},
|
{"lemma": None, "text": words[1], "tag": pos[1]},
|
||||||
{"text": "a", "tag": "DET"},
|
{"lemma": None, "text": words[2], "tag": pos[2]},
|
||||||
{"text": "sentence", "tag": "NOUN"},
|
{"lemma": None, "text": words[3], "tag": pos[3]},
|
||||||
]
|
]
|
||||||
assert deps["arcs"] == [
|
assert deps["arcs"] == [
|
||||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||||
|
@ -72,7 +72,7 @@ def test_displacy_rtl():
|
||||||
deps = ["foo", "bar", "foo", "baz"]
|
deps = ["foo", "bar", "foo", "baz"]
|
||||||
heads = [1, 0, 1, -2]
|
heads = [1, 0, 1, -2]
|
||||||
nlp = Persian()
|
nlp = Persian()
|
||||||
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
|
doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
|
||||||
doc.ents = [Span(doc, 1, 3, label="TEST")]
|
doc.ents = [Span(doc, 1, 3, label="TEST")]
|
||||||
html = displacy.render(doc, page=True, style="dep")
|
html = displacy.render(doc, page=True, style="dep")
|
||||||
assert "direction: rtl" in html
|
assert "direction: rtl" in html
|
||||||
|
|
|
@ -4,8 +4,10 @@ import shutil
|
||||||
import contextlib
|
import contextlib
|
||||||
import srsly
|
import srsly
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
from spacy import Errors
|
||||||
from spacy.tokens import Doc, Span
|
from spacy.tokens import Doc, Span
|
||||||
from spacy.attrs import POS, HEAD, DEP
|
from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA
|
||||||
|
|
||||||
|
|
||||||
@contextlib.contextmanager
|
@contextlib.contextmanager
|
||||||
|
@ -22,30 +24,56 @@ def make_tempdir():
|
||||||
shutil.rmtree(str(d))
|
shutil.rmtree(str(d))
|
||||||
|
|
||||||
|
|
||||||
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
|
def get_doc(
|
||||||
|
vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None
|
||||||
|
):
|
||||||
"""Create Doc object from given vocab, words and annotations."""
|
"""Create Doc object from given vocab, words and annotations."""
|
||||||
pos = pos or [""] * len(words)
|
if deps and not heads:
|
||||||
tags = tags or [""] * len(words)
|
heads = [0] * len(deps)
|
||||||
heads = heads or [0] * len(words)
|
headings = []
|
||||||
deps = deps or [""] * len(words)
|
values = []
|
||||||
for value in deps + tags + pos:
|
annotations = [pos, heads, deps, lemmas, tags]
|
||||||
|
possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
|
||||||
|
for a, annot in enumerate(annotations):
|
||||||
|
if annot is not None:
|
||||||
|
if len(annot) != len(words):
|
||||||
|
raise ValueError(Errors.E189)
|
||||||
|
headings.append(possible_headings[a])
|
||||||
|
if annot is not heads:
|
||||||
|
values.extend(annot)
|
||||||
|
for value in values:
|
||||||
vocab.strings.add(value)
|
vocab.strings.add(value)
|
||||||
|
|
||||||
doc = Doc(vocab, words=words)
|
doc = Doc(vocab, words=words)
|
||||||
attrs = doc.to_array([POS, HEAD, DEP])
|
|
||||||
for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
|
# if there are any other annotations, set them
|
||||||
attrs[i, 0] = doc.vocab.strings[p]
|
if headings:
|
||||||
attrs[i, 1] = head
|
attrs = doc.to_array(headings)
|
||||||
attrs[i, 2] = doc.vocab.strings[dep]
|
|
||||||
doc.from_array([POS, HEAD, DEP], attrs)
|
j = 0
|
||||||
|
for annot in annotations:
|
||||||
|
if annot:
|
||||||
|
if annot is heads:
|
||||||
|
for i in range(len(words)):
|
||||||
|
if attrs.ndim == 1:
|
||||||
|
attrs[i] = heads[i]
|
||||||
|
else:
|
||||||
|
attrs[i, j] = heads[i]
|
||||||
|
else:
|
||||||
|
for i in range(len(words)):
|
||||||
|
if attrs.ndim == 1:
|
||||||
|
attrs[i] = doc.vocab.strings[annot[i]]
|
||||||
|
else:
|
||||||
|
attrs[i, j] = doc.vocab.strings[annot[i]]
|
||||||
|
j += 1
|
||||||
|
doc.from_array(headings, attrs)
|
||||||
|
|
||||||
|
# finally, set the entities
|
||||||
if ents:
|
if ents:
|
||||||
doc.ents = [
|
doc.ents = [
|
||||||
Span(doc, start, end, label=doc.vocab.strings[label])
|
Span(doc, start, end, label=doc.vocab.strings[label])
|
||||||
for start, end, label in ents
|
for start, end, label in ents
|
||||||
]
|
]
|
||||||
if tags:
|
|
||||||
for token in doc:
|
|
||||||
token.tag_ = tags[token.i]
|
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
@ -86,8 +114,7 @@ def assert_docs_equal(doc1, doc2):
|
||||||
|
|
||||||
assert [t.head.i for t in doc1] == [t.head.i for t in doc2]
|
assert [t.head.i for t in doc1] == [t.head.i for t in doc2]
|
||||||
assert [t.dep for t in doc1] == [t.dep for t in doc2]
|
assert [t.dep for t in doc1] == [t.dep for t in doc2]
|
||||||
if doc1.is_parsed and doc2.is_parsed:
|
assert [t.is_sent_start for t in doc1] == [t.is_sent_start for t in doc2]
|
||||||
assert [s for s in doc1.sents] == [s for s in doc2.sents]
|
|
||||||
|
|
||||||
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
|
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
|
||||||
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
|
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
|
||||||
|
|
|
@ -699,6 +699,7 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#to_disk
|
DOCS: https://spacy.io/api/tokenizer#to_disk
|
||||||
"""
|
"""
|
||||||
|
path = util.ensure_path(path)
|
||||||
with path.open("wb") as file_:
|
with path.open("wb") as file_:
|
||||||
file_.write(self.to_bytes(**kwargs))
|
file_.write(self.to_bytes(**kwargs))
|
||||||
|
|
||||||
|
@ -712,6 +713,7 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#from_disk
|
DOCS: https://spacy.io/api/tokenizer#from_disk
|
||||||
"""
|
"""
|
||||||
|
path = util.ensure_path(path)
|
||||||
with path.open("rb") as file_:
|
with path.open("rb") as file_:
|
||||||
bytes_data = file_.read()
|
bytes_data = file_.read()
|
||||||
self.from_bytes(bytes_data, **kwargs)
|
self.from_bytes(bytes_data, **kwargs)
|
||||||
|
@ -756,21 +758,20 @@ cdef class Tokenizer:
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
if data.get("prefix_search"):
|
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
||||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||||
if data.get("suffix_search"):
|
if "suffix_search" in data and isinstance(data["suffix_search"], str):
|
||||||
self.suffix_search = re.compile(data["suffix_search"]).search
|
self.suffix_search = re.compile(data["suffix_search"]).search
|
||||||
if data.get("infix_finditer"):
|
if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
|
||||||
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
||||||
if data.get("token_match"):
|
if "token_match" in data and isinstance(data["token_match"], str):
|
||||||
self.token_match = re.compile(data["token_match"]).match
|
self.token_match = re.compile(data["token_match"]).match
|
||||||
if data.get("rules"):
|
if "rules" in data and isinstance(data["rules"], dict):
|
||||||
# make sure to hard reset the cache to remove data from the default exceptions
|
# make sure to hard reset the cache to remove data from the default exceptions
|
||||||
self._rules = {}
|
self._rules = {}
|
||||||
self._flush_cache()
|
self._flush_cache()
|
||||||
self._flush_specials()
|
self._flush_specials()
|
||||||
self._load_special_cases(data.get("rules", {}))
|
self._load_special_cases(data["rules"])
|
||||||
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
|
||||||
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
|
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
|
||||||
if spans[token_index][-1].whitespace_:
|
if spans[token_index][-1].whitespace_:
|
||||||
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
|
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
|
||||||
|
# add the vector of the (merged) entity to the vocab
|
||||||
|
if not doc.vocab.get_vector(new_orth).any():
|
||||||
|
if doc.vocab.vectors_length > 0:
|
||||||
|
doc.vocab.set_vector(new_orth, span.vector)
|
||||||
token = tokens[token_index]
|
token = tokens[token_index]
|
||||||
lex = doc.vocab.get(doc.mem, new_orth)
|
lex = doc.vocab.get(doc.mem, new_orth)
|
||||||
token.lex = lex
|
token.lex = lex
|
||||||
|
|
|
@ -19,7 +19,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
||||||
from ..typedefs cimport attr_t, flags_t
|
from ..typedefs cimport attr_t, flags_t
|
||||||
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
|
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
|
||||||
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
||||||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
|
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
|
||||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||||
|
|
||||||
from ..attrs import intify_attrs, IDS
|
from ..attrs import intify_attrs, IDS
|
||||||
|
@ -68,6 +68,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
|
||||||
return token.ent_id
|
return token.ent_id
|
||||||
elif feat_name == ENT_KB_ID:
|
elif feat_name == ENT_KB_ID:
|
||||||
return token.ent_kb_id
|
return token.ent_kb_id
|
||||||
|
elif feat_name == IDX:
|
||||||
|
return token.idx
|
||||||
else:
|
else:
|
||||||
return Lexeme.get_struct_attr(token.lex, feat_name)
|
return Lexeme.get_struct_attr(token.lex, feat_name)
|
||||||
|
|
||||||
|
@ -253,7 +255,7 @@ cdef class Doc:
|
||||||
def is_nered(self):
|
def is_nered(self):
|
||||||
"""Check if the document has named entities set. Will return True if
|
"""Check if the document has named entities set. Will return True if
|
||||||
*any* of the tokens has a named entity tag set (even if the others are
|
*any* of the tokens has a named entity tag set (even if the others are
|
||||||
unknown values).
|
unknown values), or if the document is empty.
|
||||||
"""
|
"""
|
||||||
if len(self) == 0:
|
if len(self) == 0:
|
||||||
return True
|
return True
|
||||||
|
@ -778,10 +780,12 @@ cdef class Doc:
|
||||||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||||
for id_ in attrs]
|
for id_ in attrs]
|
||||||
|
if array.dtype != numpy.uint64:
|
||||||
|
user_warning(Warnings.W028.format(type=array.dtype))
|
||||||
|
|
||||||
if SENT_START in attrs and HEAD in attrs:
|
if SENT_START in attrs and HEAD in attrs:
|
||||||
raise ValueError(Errors.E032)
|
raise ValueError(Errors.E032)
|
||||||
cdef int i, col
|
cdef int i, col, abs_head_index
|
||||||
cdef attr_id_t attr_id
|
cdef attr_id_t attr_id
|
||||||
cdef TokenC* tokens = self.c
|
cdef TokenC* tokens = self.c
|
||||||
cdef int length = len(array)
|
cdef int length = len(array)
|
||||||
|
@ -795,6 +799,14 @@ cdef class Doc:
|
||||||
attr_ids[i] = attr_id
|
attr_ids[i] = attr_id
|
||||||
if len(array.shape) == 1:
|
if len(array.shape) == 1:
|
||||||
array = array.reshape((array.size, 1))
|
array = array.reshape((array.size, 1))
|
||||||
|
# Check that all heads are within the document bounds
|
||||||
|
if HEAD in attrs:
|
||||||
|
col = attrs.index(HEAD)
|
||||||
|
for i in range(length):
|
||||||
|
# cast index to signed int
|
||||||
|
abs_head_index = numpy.int32(array[i, col]) + i
|
||||||
|
if abs_head_index < 0 or abs_head_index >= length:
|
||||||
|
raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
|
||||||
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
|
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
|
||||||
if TAG in attrs:
|
if TAG in attrs:
|
||||||
col = attrs.index(TAG)
|
col = attrs.index(TAG)
|
||||||
|
@ -865,7 +877,7 @@ cdef class Doc:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#to_bytes
|
DOCS: https://spacy.io/api/doc#to_bytes
|
||||||
"""
|
"""
|
||||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID] # TODO: ENT_KB_ID ?
|
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM] # TODO: ENT_KB_ID ?
|
||||||
if self.is_tagged:
|
if self.is_tagged:
|
||||||
array_head.extend([TAG, POS])
|
array_head.extend([TAG, POS])
|
||||||
# If doc parsed add head and dep attribute
|
# If doc parsed add head and dep attribute
|
||||||
|
@ -1166,6 +1178,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
|
||||||
heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
|
heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
|
||||||
if loop_count > 10:
|
if loop_count > 10:
|
||||||
warnings.warn(Warnings.W026)
|
warnings.warn(Warnings.W026)
|
||||||
|
break
|
||||||
loop_count += 1
|
loop_count += 1
|
||||||
# Set sentence starts
|
# Set sentence starts
|
||||||
for i in range(length):
|
for i in range(length):
|
||||||
|
|
|
@ -626,6 +626,9 @@ cdef class Token:
|
||||||
# This function sets the head of self to new_head and updates the
|
# This function sets the head of self to new_head and updates the
|
||||||
# counters for left/right dependents and left/right corner for the
|
# counters for left/right dependents and left/right corner for the
|
||||||
# new and the old head
|
# new and the old head
|
||||||
|
# Check that token is from the same document
|
||||||
|
if self.doc != new_head.doc:
|
||||||
|
raise ValueError(Errors.E191)
|
||||||
# Do nothing if old head is new head
|
# Do nothing if old head is new head
|
||||||
if self.i + self.c.head == new_head.i:
|
if self.i + self.c.head == new_head.i:
|
||||||
return
|
return
|
||||||
|
|
|
@ -76,6 +76,14 @@ class Underscore(object):
|
||||||
def _get_key(self, name):
|
def _get_key(self, name):
|
||||||
return ("._.", name, self._start, self._end)
|
return ("._.", name, self._start, self._end)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_state(cls):
|
||||||
|
return cls.token_extensions, cls.span_extensions, cls.doc_extensions
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def load_state(cls, state):
|
||||||
|
cls.token_extensions, cls.span_extensions, cls.doc_extensions = state
|
||||||
|
|
||||||
|
|
||||||
def get_ext_args(**kwargs):
|
def get_ext_args(**kwargs):
|
||||||
"""Validate and convert arguments. Reused in Doc, Token and Span."""
|
"""Validate and convert arguments. Reused in Doc, Token and Span."""
|
||||||
|
|
|
@ -349,44 +349,6 @@ cdef class Vectors:
|
||||||
for i in range(len(queries)) ], dtype="uint64")
|
for i in range(len(queries)) ], dtype="uint64")
|
||||||
return (keys, best_rows, scores)
|
return (keys, best_rows, scores)
|
||||||
|
|
||||||
def from_glove(self, path):
|
|
||||||
"""Load GloVe vectors from a directory. Assumes binary format,
|
|
||||||
that the vocab is in a vocab.txt, and that vectors are named
|
|
||||||
vectors.{size}.[fd].bin, e.g. vectors.128.f.bin for 128d float32
|
|
||||||
vectors, vectors.300.d.bin for 300d float64 (double) vectors, etc.
|
|
||||||
By default GloVe outputs 64-bit vectors.
|
|
||||||
|
|
||||||
path (unicode / Path): The path to load the GloVe vectors from.
|
|
||||||
RETURNS: A `StringStore` object, holding the key-to-string mapping.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vectors#from_glove
|
|
||||||
"""
|
|
||||||
path = util.ensure_path(path)
|
|
||||||
width = None
|
|
||||||
for name in path.iterdir():
|
|
||||||
if name.parts[-1].startswith("vectors"):
|
|
||||||
_, dims, dtype, _2 = name.parts[-1].split('.')
|
|
||||||
width = int(dims)
|
|
||||||
break
|
|
||||||
else:
|
|
||||||
raise IOError(Errors.E061.format(filename=path))
|
|
||||||
bin_loc = path / f"vectors.{dims}.{dtype}.bin"
|
|
||||||
xp = get_array_module(self.data)
|
|
||||||
self.data = None
|
|
||||||
with bin_loc.open("rb") as file_:
|
|
||||||
self.data = xp.fromfile(file_, dtype=dtype)
|
|
||||||
if dtype != "float32":
|
|
||||||
self.data = xp.ascontiguousarray(self.data, dtype="float32")
|
|
||||||
if self.data.ndim == 1:
|
|
||||||
self.data = self.data.reshape((self.data.size//width, width))
|
|
||||||
n = 0
|
|
||||||
strings = StringStore()
|
|
||||||
with (path / "vocab.txt").open("r") as file_:
|
|
||||||
for i, line in enumerate(file_):
|
|
||||||
key = strings.add(line.strip())
|
|
||||||
self.add(key, row=i)
|
|
||||||
return strings
|
|
||||||
|
|
||||||
def to_disk(self, path, **kwargs):
|
def to_disk(self, path, **kwargs):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
|
||||||
|
|
|
@ -109,9 +109,9 @@ links) and check whether they are compatible with the currently installed
|
||||||
version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
|
version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
|
||||||
to ensure that all installed models are can be used with the new version. The
|
to ensure that all installed models are can be used with the new version. The
|
||||||
command is also useful to detect out-of-sync model links resulting from links
|
command is also useful to detect out-of-sync model links resulting from links
|
||||||
created in different virtual environments. It will a list of models, the
|
created in different virtual environments. It will show a list of models and
|
||||||
installed versions, the latest compatible version (if out of date) and the
|
their installed versions. If any model is out of date, the latest compatible
|
||||||
commands for updating.
|
versions and command for updating are shown.
|
||||||
|
|
||||||
> #### Automated validation
|
> #### Automated validation
|
||||||
>
|
>
|
||||||
|
@ -176,7 +176,7 @@ All output files generated by this command are compatible with
|
||||||
|
|
||||||
## Debug data {#debug-data new="2.2"}
|
## Debug data {#debug-data new="2.2"}
|
||||||
|
|
||||||
Analyze, debug and validate your training and development data, get useful
|
Analyze, debug, and validate your training and development data. Get useful
|
||||||
stats, and find problems like invalid entity annotations, cyclic dependencies,
|
stats, and find problems like invalid entity annotations, cyclic dependencies,
|
||||||
low data labels and more.
|
low data labels and more.
|
||||||
|
|
||||||
|
@ -185,10 +185,11 @@ $ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pi
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||||
| `lang` | positional | Model language. |
|
| `lang` | positional | Model language. |
|
||||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||||
|
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option | Location of JSON-formatted tag map. |
|
||||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||||
|
@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
||||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||||
| `--base-model`, `-b` <Tag variant="new">2.1</Tag> | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
| `--base-model`, `-b` <Tag variant="new">2.1</Tag> | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||||
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||||
|
| `--replace-components`, `-R` | flag | Replace components from the base model. |
|
||||||
| `--vectors`, `-v` | option | Model to load vectors from. |
|
| `--vectors`, `-v` | option | Model to load vectors from. |
|
||||||
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
|
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
|
||||||
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
|
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
|
||||||
|
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
||||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
||||||
| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` |
|
| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||||
| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` |
|
| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||||
|
| `--width`, `-cw` <Tag variant="new">2.2.4</Tag> | option | Width of CNN layers of `Tok2Vec` component. |
|
||||||
|
| `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag> | option | Depth of CNN layers of `Tok2Vec` component. |
|
||||||
|
| `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag> | option | Window size for CNN layers of `Tok2Vec` component. |
|
||||||
|
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag> | option | Maxout size for CNN layers of `Tok2Vec` component. |
|
||||||
|
| `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag> | flag | Whether to use character-based embedding of `Tok2Vec` component. |
|
||||||
|
| `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag> | option | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch). |
|
||||||
|
| `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag> | option | Number of embedding rows of `Tok2Vec` component. |
|
||||||
| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. |
|
| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. |
|
||||||
| `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag> | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). |
|
| `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag> | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). |
|
||||||
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
|
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
|
||||||
|
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
||||||
| `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag> | flag | Text classification classes aren't mutually exclusive (multilabel). |
|
| `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag> | flag | Text classification classes aren't mutually exclusive (multilabel). |
|
||||||
| `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag> | option | Text classification model architecture. Defaults to `"bow"`. |
|
| `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag> | option | Text classification model architecture. Defaults to `"bow"`. |
|
||||||
| `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option | Text classification positive label for binary classes with two labels. |
|
| `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option | Text classification positive label for binary classes with two labels. |
|
||||||
|
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
||||||
| `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag> | flag | Show more detailed messages during training. |
|
| `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag> | flag | Show more detailed messages during training. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
||||||
|
|
|
@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx
|
||||||
|
|
||||||
A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
|
A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
|
||||||
named entities, export annotations to numpy arrays, losslessly serialize to
|
named entities, export annotations to numpy arrays, losslessly serialize to
|
||||||
compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
|
compressed binary strings. The `Doc` object holds an array of
|
||||||
The Python-level `Token` and [`Span`](/api/span) objects are views of this
|
[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
|
||||||
array, i.e. they don't own the data themselves.
|
[`Span`](/api/span) objects are views of this array, i.e. they don't own the
|
||||||
|
data themselves.
|
||||||
|
|
||||||
## Doc.\_\_init\_\_ {#init tag="method"}
|
## Doc.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
@ -198,10 +199,11 @@ the character indices don't map to a valid span.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||||
| `start` | int | The index of the first character of the span. |
|
| `start` | int | The index of the first character of the span. |
|
||||||
| `end` | int | The index of the last character after the span. |
|
| `end` | int | The index of the last character after the span. |
|
||||||
| `label` | uint64 / unicode | A label to attach to the Span, e.g. for named entities. |
|
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||||
|
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||||
|
|
||||||
|
@ -655,10 +657,10 @@ The L2 norm of the document's vector representation.
|
||||||
| `user_data` | - | A generic storage area, for user custom data. |
|
| `user_data` | - | A generic storage area, for user custom data. |
|
||||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||||
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
|
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
|
||||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
||||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
||||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
||||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||||
|
|
|
@ -83,7 +83,8 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
|
||||||
happens automatically after the component has been added to the pipeline using
|
happens automatically after the component has been added to the pipeline using
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
|
[`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
|
||||||
with `overwrite_ents=True`, existing entities will be replaced if they overlap
|
with `overwrite_ents=True`, existing entities will be replaced if they overlap
|
||||||
with the matches.
|
with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
|
||||||
|
patterns over shorter, and if equal the match occuring first in the Doc is chosen.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -172,6 +172,28 @@ Remove a previously registered extension.
|
||||||
| `name` | unicode | Name of the extension. |
|
| `name` | unicode | Name of the extension. |
|
||||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||||
|
|
||||||
|
## Span.char_span {#char_span tag="method" new="2.2.4"}
|
||||||
|
|
||||||
|
Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
|
||||||
|
the character indices don't map to a valid span.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> doc = nlp("I like New York")
|
||||||
|
> span = doc[1:4].char_span(5, 13, label="GPE")
|
||||||
|
> assert span.text == "New York"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||||
|
| `start` | int | The index of the first character of the span. |
|
||||||
|
| `end` | int | The index of the last character after the span. |
|
||||||
|
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||||
|
| `kb_id` | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||||
|
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||||
|
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||||
|
|
||||||
## Span.similarity {#similarity tag="method" model="vectors"}
|
## Span.similarity {#similarity tag="method" model="vectors"}
|
||||||
|
|
||||||
Make a semantic similarity estimate. The default estimate is cosine similarity
|
Make a semantic similarity estimate. The default estimate is cosine similarity
|
||||||
|
@ -294,7 +316,7 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------------- | ----- | ---------------------------------------------------- |
|
| ---------------- | ----- | ---------------------------------------------------- |
|
||||||
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
||||||
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
||||||
|
|
||||||
|
|
|
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
|
||||||
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||||
| `lower` | int | Lowercase form of the token. |
|
| `lower` | int | Lowercase form of the token. |
|
||||||
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
||||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
||||||
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
||||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
||||||
|
|
|
@ -237,8 +237,9 @@ If a setting is not present in the options, the default value will be used.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description | Default |
|
| Name | Type | Description | Default |
|
||||||
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||||
|
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
||||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||||
|
|
|
@ -326,25 +326,6 @@ performed in chunks, to avoid consuming too much memory. You can set the
|
||||||
| `sort` | bool | Whether to sort the entries returned by score. Defaults to `True`. |
|
| `sort` | bool | Whether to sort the entries returned by score. Defaults to `True`. |
|
||||||
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. |
|
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. |
|
||||||
|
|
||||||
## Vectors.from_glove {#from_glove tag="method"}
|
|
||||||
|
|
||||||
Load [GloVe](https://nlp.stanford.edu/projects/glove/) vectors from a directory.
|
|
||||||
Assumes binary format, that the vocab is in a `vocab.txt`, and that vectors are
|
|
||||||
named `vectors.{size}.[fd.bin]`, e.g. `vectors.128.f.bin` for 128d float32
|
|
||||||
vectors, `vectors.300.d.bin` for 300d float64 (double) vectors, etc. By default
|
|
||||||
GloVe outputs 64-bit vectors.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> vectors = Vectors()
|
|
||||||
> vectors.from_glove("/path/to/glove_vectors")
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ------ | ---------------- | ---------------------------------------- |
|
|
||||||
| `path` | unicode / `Path` | The path to load the GloVe vectors from. |
|
|
||||||
|
|
||||||
## Vectors.to_disk {#to_disk tag="method"}
|
## Vectors.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
Save the current state to a directory.
|
Save the current state to a directory.
|
||||||
|
|
|
@ -622,13 +622,13 @@ categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
|
||||||
In order to use this, you'll need training and evaluation data in the
|
In order to use this, you'll need training and evaluation data in the
|
||||||
[JSON format](/api/annotation#json-input) spaCy expects for training.
|
[JSON format](/api/annotation#json-input) spaCy expects for training.
|
||||||
|
|
||||||
You can now train the model using a corpus for your language annotated with If
|
If your data is in one of the supported formats, the easiest solution might be
|
||||||
your data is in one of the supported formats, the easiest solution might be to
|
to use the [`spacy convert`](/api/cli#convert) command-line utility. This
|
||||||
use the [`spacy convert`](/api/cli#convert) command-line utility. This supports
|
supports several popular formats, including the IOB format for named entity
|
||||||
several popular formats, including the IOB format for named entity recognition,
|
recognition, the JSONL format produced by our annotation tool
|
||||||
the JSONL format produced by our annotation tool [Prodigy](https://prodi.gy),
|
[Prodigy](https://prodi.gy), and the
|
||||||
and the [CoNLL-U](http://universaldependencies.org/docs/format.html) format used
|
[CoNLL-U](http://universaldependencies.org/docs/format.html) format used by the
|
||||||
by the [Universal Dependencies](http://universaldependencies.org/) corpus.
|
[Universal Dependencies](http://universaldependencies.org/) corpus.
|
||||||
|
|
||||||
One thing to keep in mind is that spaCy expects to train its models from **whole
|
One thing to keep in mind is that spaCy expects to train its models from **whole
|
||||||
documents**, not just single sentences. If your corpus only contains single
|
documents**, not just single sentences. If your corpus only contains single
|
||||||
|
|
|
@ -968,7 +968,10 @@ pattern. The entity ruler accepts two types of patterns:
|
||||||
The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
|
The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
|
||||||
added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
|
added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
|
||||||
called on a text, it will find matches in the `doc` and add them as entities to
|
called on a text, it will find matches in the `doc` and add them as entities to
|
||||||
the `doc.ents`, using the specified pattern label as the entity label.
|
the `doc.ents`, using the specified pattern label as the entity label. If any
|
||||||
|
matches were to overlap, the pattern matching most tokens takes priority. If
|
||||||
|
they also happen to be equally long, then the match occuring first in the Doc is
|
||||||
|
chosen.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
|
@ -1119,7 +1122,7 @@ entityruler = EntityRuler(nlp)
|
||||||
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
||||||
|
|
||||||
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
|
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
|
||||||
with nlp.disable_pipes(*disable_pipes):
|
with nlp.disable_pipes(*other_pipes):
|
||||||
entityruler.add_patterns(patterns)
|
entityruler.add_patterns(patterns)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
@ -94,7 +94,7 @@ docs = list(doc_bin.get_docs(nlp.vocab))
|
||||||
|
|
||||||
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
|
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
|
||||||
well, which includes the values of
|
well, which includes the values of
|
||||||
[extension attributes](/processing-pipelines#custom-components-attributes) (if
|
[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if
|
||||||
they're serializable with msgpack).
|
they're serializable with msgpack).
|
||||||
|
|
||||||
<Infobox title="Important note on serializing extension attributes" variant="warning">
|
<Infobox title="Important note on serializing extension attributes" variant="warning">
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user