mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge branch 'master' into tmp/sync
This commit is contained in:
commit
46568f40a7
106
.github/contributors/Baciccin.md
vendored
Normal file
106
.github/contributors/Baciccin.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Giovanni Battista Parodi |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-19 |
|
||||
| GitHub username | Baciccin |
|
||||
| Website (optional) | |
|
106
.github/contributors/MisterKeefe.md
vendored
Normal file
106
.github/contributors/MisterKeefe.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Tom Keefe |
|
||||
| Company name (if applicable) | / |
|
||||
| Title or role (if applicable) | / |
|
||||
| Date | 18 February 2020 |
|
||||
| GitHub username | MisterKeefe |
|
||||
| Website (optional) | / |
|
106
.github/contributors/Tiljander.md
vendored
Normal file
106
.github/contributors/Tiljander.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Henrik Tiljander |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 24/3/2020 |
|
||||
| GitHub username | Tiljander |
|
||||
| Website (optional) | |
|
106
.github/contributors/dhpollack.md
vendored
Normal file
106
.github/contributors/dhpollack.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | David Pollack |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | Mar 5. 2020 |
|
||||
| GitHub username | dhpollack |
|
||||
| Website (optional) | |
|
106
.github/contributors/guerda.md
vendored
Normal file
106
.github/contributors/guerda.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Philip Gillißen |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-24 |
|
||||
| GitHub username | guerda |
|
||||
| Website (optional) | |
|
89
.github/contributors/mabraham.md
vendored
Normal file
89
.github/contributors/mabraham.md
vendored
Normal file
|
@ -0,0 +1,89 @@
|
|||
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | |
|
||||
| GitHub username | |
|
||||
| Website (optional) | |
|
106
.github/contributors/merrcury.md
vendored
Normal file
106
.github/contributors/merrcury.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Himanshu Garg |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-10 |
|
||||
| GitHub username | merrcury |
|
||||
| Website (optional) | |
|
106
.github/contributors/pinealan.md
vendored
Normal file
106
.github/contributors/pinealan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Alan Chan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-15 |
|
||||
| GitHub username | pinealan |
|
||||
| Website (optional) | http://pinealan.xyz |
|
106
.github/contributors/sloev.md
vendored
Normal file
106
.github/contributors/sloev.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Johannes Valbjørn |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-13 |
|
||||
| GitHub username | sloev |
|
||||
| Website (optional) | https://sloev.github.io |
|
2
.gitignore
vendored
2
.gitignore
vendored
|
@ -46,6 +46,7 @@ __pycache__/
|
|||
.venv
|
||||
env3.6/
|
||||
venv/
|
||||
env3.*/
|
||||
.dev
|
||||
.denv
|
||||
.pypyenv
|
||||
|
@ -62,6 +63,7 @@ lib64/
|
|||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheelhouse/
|
||||
*.egg-info/
|
||||
pip-wheel-metadata/
|
||||
Pipfile.lock
|
||||
|
|
2
LICENSE
2
LICENSE
|
@ -1,6 +1,6 @@
|
|||
The MIT License (MIT)
|
||||
|
||||
Copyright (C) 2016-2019 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
Copyright (C) 2016-2020 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
|
|
47
Makefile
47
Makefile
|
@ -1,28 +1,37 @@
|
|||
SHELL := /bin/bash
|
||||
sha = $(shell "git" "rev-parse" "--short" "HEAD")
|
||||
version = $(shell "bin/get-version.sh")
|
||||
wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
|
||||
PYVER := 3.6
|
||||
VENV := ./env$(PYVER)
|
||||
|
||||
dist/spacy.pex : dist/spacy-$(sha).pex
|
||||
cp dist/spacy-$(sha).pex dist/spacy.pex
|
||||
chmod a+rx dist/spacy.pex
|
||||
version := $(shell "bin/get-version.sh")
|
||||
|
||||
dist/spacy-$(sha).pex : dist/$(wheel)
|
||||
env3.6/bin/python -m pip install pex==1.5.3
|
||||
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
|
||||
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
|
||||
chmod a+rx $@
|
||||
|
||||
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
|
||||
python3.6 -m venv env3.6
|
||||
source env3.6/bin/activate
|
||||
env3.6/bin/pip install wheel
|
||||
env3.6/bin/pip install -r requirements.txt --no-cache-dir
|
||||
env3.6/bin/python setup.py build_ext --inplace
|
||||
env3.6/bin/python setup.py sdist
|
||||
env3.6/bin/python setup.py bdist_wheel
|
||||
dist/pytest.pex : wheelhouse/pytest-*.whl
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
|
||||
chmod a+rx $@
|
||||
|
||||
.PHONY : clean
|
||||
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
||||
$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
|
||||
touch $@
|
||||
|
||||
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
||||
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
|
||||
|
||||
$(VENV)/bin/pex :
|
||||
python$(PYVER) -m venv $(VENV)
|
||||
$(VENV)/bin/pip install -U pip setuptools pex wheel
|
||||
|
||||
.PHONY : clean test
|
||||
|
||||
test : dist/spacy-$(version).pex dist/pytest.pex
|
||||
( . $(VENV)/bin/activate ; \
|
||||
PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )
|
||||
|
||||
clean : setup.py
|
||||
source env3.6/bin/activate
|
||||
rm -rf dist/*
|
||||
rm -rf ./wheelhouse
|
||||
rm -rf $(VENV)
|
||||
python setup.py clean --all
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
### Step 1: Create a Knowledge Base (KB) and training data
|
||||
|
||||
Run `wikipedia_pretrain_kb.py`
|
||||
Run `wikidata_pretrain_kb.py`
|
||||
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
|
||||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
||||
|
|
|
@ -88,8 +88,8 @@ def read_text(bz2_loc, n=10000):
|
|||
break
|
||||
|
||||
|
||||
def get_matches(tokenizer, phrases, texts, max_length=6):
|
||||
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
|
||||
def get_matches(tokenizer, phrases, texts):
|
||||
matcher = PhraseMatcher(tokenizer.vocab)
|
||||
matcher.add("Phrase", None, *phrases)
|
||||
for text in texts:
|
||||
doc = tokenizer(text)
|
||||
|
|
|
@ -59,7 +59,7 @@ install_requires =
|
|||
|
||||
[options.extras_require]
|
||||
lookups =
|
||||
spacy_lookups_data>=0.0.5<0.2.0
|
||||
spacy_lookups_data>=0.0.5,<0.2.0
|
||||
cuda =
|
||||
cupy>=5.0.0b4
|
||||
cuda80 =
|
||||
|
|
|
@ -93,3 +93,5 @@ cdef enum attr_id_t:
|
|||
ENT_KB_ID = symbols.ENT_KB_ID
|
||||
MORPH
|
||||
ENT_ID = symbols.ENT_ID
|
||||
|
||||
IDX
|
||||
|
|
|
@ -89,6 +89,7 @@ IDS = {
|
|||
"PROB": PROB,
|
||||
"LANG": LANG,
|
||||
"MORPH": MORPH,
|
||||
"IDX": IDX
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -405,12 +405,10 @@ def train(
|
|||
losses=losses,
|
||||
)
|
||||
except ValueError as e:
|
||||
msg.warn("Error during training")
|
||||
err = "Error during training"
|
||||
if init_tok2vec:
|
||||
msg.warn(
|
||||
"Did you provide the same parameters during 'train' as during 'pretrain'?"
|
||||
)
|
||||
msg.fail(f"Original error message: {e}", exits=1)
|
||||
err += " Did you provide the same parameters during 'train' as during 'pretrain'?"
|
||||
msg.fail(err, f"Original error message: {e}", exits=1)
|
||||
if raw_text:
|
||||
# If raw text is available, perform 'rehearsal' updates,
|
||||
# which use unlabelled data to reduce overfitting.
|
||||
|
@ -545,7 +543,40 @@ def train(
|
|||
with nlp.use_params(optimizer.averages):
|
||||
final_model_path = output_path / "model-final"
|
||||
nlp.to_disk(final_model_path)
|
||||
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
|
||||
meta_loc = output_path / "model-final" / "meta.json"
|
||||
final_meta = srsly.read_json(meta_loc)
|
||||
final_meta.setdefault("accuracy", {})
|
||||
final_meta["accuracy"].update(meta.get("accuracy", {}))
|
||||
final_meta.setdefault("speed", {})
|
||||
final_meta["speed"].setdefault("cpu", None)
|
||||
final_meta["speed"].setdefault("gpu", None)
|
||||
meta.setdefault("speed", {})
|
||||
meta["speed"].setdefault("cpu", None)
|
||||
meta["speed"].setdefault("gpu", None)
|
||||
# combine cpu and gpu speeds with the base model speeds
|
||||
if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
|
||||
speed = _get_total_speed(
|
||||
[final_meta["speed"]["cpu"], meta["speed"]["cpu"]]
|
||||
)
|
||||
final_meta["speed"]["cpu"] = speed
|
||||
if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
|
||||
speed = _get_total_speed(
|
||||
[final_meta["speed"]["gpu"], meta["speed"]["gpu"]]
|
||||
)
|
||||
final_meta["speed"]["gpu"] = speed
|
||||
# if there were no speeds to update, overwrite with meta
|
||||
if (
|
||||
final_meta["speed"]["cpu"] is None
|
||||
and final_meta["speed"]["gpu"] is None
|
||||
):
|
||||
final_meta["speed"].update(meta["speed"])
|
||||
# note: beam speeds are not combined with the base model
|
||||
if has_beam_widths:
|
||||
final_meta.setdefault("beam_accuracy", {})
|
||||
final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
|
||||
final_meta.setdefault("beam_speed", {})
|
||||
final_meta["beam_speed"].update(meta.get("beam_speed", {}))
|
||||
srsly.write_json(meta_loc, final_meta)
|
||||
msg.good("Saved model to output directory", final_model_path)
|
||||
with msg.loading("Creating best model..."):
|
||||
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
|
||||
|
@ -630,6 +661,8 @@ def _find_best(experiment_dir, component):
|
|||
if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final":
|
||||
accs = srsly.read_json(epoch_model / "accuracy.json")
|
||||
scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)]
|
||||
# remove per_type dicts from score list for max() comparison
|
||||
scores = [score for score in scores if isinstance(score, float)]
|
||||
accuracies.append((scores, epoch_model))
|
||||
if accuracies:
|
||||
return max(accuracies)[1]
|
||||
|
@ -641,13 +674,13 @@ def _get_metrics(component):
|
|||
if component == "parser":
|
||||
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
|
||||
elif component == "tagger":
|
||||
return ("tags_acc",)
|
||||
return ("tags_acc", "token_acc")
|
||||
elif component == "ner":
|
||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type")
|
||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
|
||||
elif component == "senter":
|
||||
return ("sent_f", "sent_p", "sent_r")
|
||||
elif component == "textcat":
|
||||
return ("textcat_score",)
|
||||
return ("textcat_score", "token_acc")
|
||||
return ("token_acc",)
|
||||
|
||||
|
||||
|
@ -714,3 +747,12 @@ def _get_progress(
|
|||
if beam_width is not None:
|
||||
result.insert(1, beam_width)
|
||||
return result
|
||||
|
||||
|
||||
def _get_total_speed(speeds):
|
||||
seconds_per_word = 0.0
|
||||
for words_per_second in speeds:
|
||||
if words_per_second is None:
|
||||
return None
|
||||
seconds_per_word += 1.0 / words_per_second
|
||||
return 1.0 / seconds_per_word
|
||||
|
|
|
@ -142,10 +142,17 @@ def parse_deps(orig_doc, options={}):
|
|||
for span, tag, lemma, ent_type in spans:
|
||||
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
|
||||
retokenizer.merge(span, attrs=attrs)
|
||||
if options.get("fine_grained"):
|
||||
words = [{"text": w.text, "tag": w.tag_} for w in doc]
|
||||
else:
|
||||
words = [{"text": w.text, "tag": w.pos_} for w in doc]
|
||||
fine_grained = options.get("fine_grained")
|
||||
add_lemma = options.get("add_lemma")
|
||||
words = [
|
||||
{
|
||||
"text": w.text,
|
||||
"tag": w.tag_ if fine_grained else w.pos_,
|
||||
"lemma": w.lemma_ if add_lemma else None,
|
||||
}
|
||||
for w in doc
|
||||
]
|
||||
|
||||
arcs = []
|
||||
for word in doc:
|
||||
if word.i < word.head.i:
|
||||
|
|
|
@ -1,6 +1,12 @@
|
|||
import uuid
|
||||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||
from .templates import (
|
||||
TPL_DEP_SVG,
|
||||
TPL_DEP_WORDS,
|
||||
TPL_DEP_WORDS_LEMMA,
|
||||
TPL_DEP_ARCS,
|
||||
TPL_ENTS,
|
||||
)
|
||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from ..util import minify_html, escape_html, registry
|
||||
from ..errors import Errors
|
||||
|
@ -80,7 +86,10 @@ class DependencyRenderer(object):
|
|||
self.width = self.offset_x + len(words) * self.distance
|
||||
self.height = self.offset_y + 3 * self.word_spacing
|
||||
self.id = render_id
|
||||
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
|
||||
words = [
|
||||
self.render_word(w["text"], w["tag"], w.get("lemma", None), i)
|
||||
for i, w in enumerate(words)
|
||||
]
|
||||
arcs = [
|
||||
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
|
||||
for i, a in enumerate(arcs)
|
||||
|
@ -98,7 +107,9 @@ class DependencyRenderer(object):
|
|||
lang=self.lang,
|
||||
)
|
||||
|
||||
def render_word(self, text, tag, i):
|
||||
def render_word(
|
||||
self, text, tag, lemma, i,
|
||||
):
|
||||
"""Render individual word.
|
||||
|
||||
text (unicode): Word text.
|
||||
|
@ -111,6 +122,10 @@ class DependencyRenderer(object):
|
|||
if self.direction == "rtl":
|
||||
x = self.width - x
|
||||
html_text = escape_html(text)
|
||||
if lemma is not None:
|
||||
return TPL_DEP_WORDS_LEMMA.format(
|
||||
text=html_text, tag=tag, lemma=lemma, x=x, y=y
|
||||
)
|
||||
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
||||
|
||||
def render_arrow(self, label, start, end, direction, i):
|
||||
|
|
|
@ -14,6 +14,15 @@ TPL_DEP_WORDS = """
|
|||
"""
|
||||
|
||||
|
||||
TPL_DEP_WORDS_LEMMA = """
|
||||
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
|
||||
<tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
|
||||
<tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
|
||||
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
|
||||
</text>
|
||||
"""
|
||||
|
||||
|
||||
TPL_DEP_ARCS = """
|
||||
<g class="displacy-arrow">
|
||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||
|
|
|
@ -96,7 +96,10 @@ class Warnings(object):
|
|||
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
||||
"be more efficient to split your training data into multiple "
|
||||
"smaller JSON files instead.")
|
||||
W028 = ("Skipping unsupported morphological feature(s): {feature}. "
|
||||
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
||||
"but is expecting one of type 'uint64' instead. This may result "
|
||||
"in problems with the vocab further on in the pipeline.")
|
||||
W029 = ("Skipping unsupported morphological feature(s): {feature}. "
|
||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||
|
||||
|
@ -531,6 +534,15 @@ class Errors(object):
|
|||
E188 = ("Could not match the gold entity links to entities in the doc - "
|
||||
"make sure the gold EL data refers to valid results of the "
|
||||
"named entity recognizer in the `nlp` pipeline.")
|
||||
E189 = ("Each argument to `get_doc` should be of equal length.")
|
||||
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
||||
"'{index}' with value '{value}' (equivalent to relative head "
|
||||
"index: '{rel_head_index}'). The head indices should be relative "
|
||||
"to the current token index rather than absolute indices in the "
|
||||
"array.")
|
||||
E191 = ("Invalid head: the head token must be from the same doc as the "
|
||||
"token itself.")
|
||||
|
||||
# TODO: fix numbering after merging develop into master
|
||||
E993 = ("The config for 'nlp' should include either a key 'name' to "
|
||||
"refer to an existing model by name or path, or a key 'lang' "
|
||||
|
|
|
@ -66,6 +66,7 @@ for orth in [
|
|||
"A/S",
|
||||
"B.C.",
|
||||
"BK.",
|
||||
"B.T.",
|
||||
"Dr.",
|
||||
"Boul.",
|
||||
"Chr.",
|
||||
|
@ -75,6 +76,7 @@ for orth in [
|
|||
"Hf.",
|
||||
"i/s",
|
||||
"I/S",
|
||||
"Inc.",
|
||||
"Kprs.",
|
||||
"L.A.",
|
||||
"Ll.",
|
||||
|
@ -145,6 +147,7 @@ for orth in [
|
|||
"bygn.",
|
||||
"c/o",
|
||||
"ca.",
|
||||
"cm.",
|
||||
"cand.",
|
||||
"d.d.",
|
||||
"d.m.",
|
||||
|
@ -168,10 +171,12 @@ for orth in [
|
|||
"dl.",
|
||||
"do.",
|
||||
"dobb.",
|
||||
"dr.",
|
||||
"dr.h.c",
|
||||
"dr.phil.",
|
||||
"ds.",
|
||||
"dvs.",
|
||||
"d.v.s.",
|
||||
"e.b.",
|
||||
"e.l.",
|
||||
"e.o.",
|
||||
|
@ -293,10 +298,14 @@ for orth in [
|
|||
"kap.",
|
||||
"kbh.",
|
||||
"kem.",
|
||||
"kg.",
|
||||
"kgs.",
|
||||
"kgl.",
|
||||
"kl.",
|
||||
"kld.",
|
||||
"km.",
|
||||
"km/t",
|
||||
"km/t.",
|
||||
"knsp.",
|
||||
"komm.",
|
||||
"kons.",
|
||||
|
@ -307,6 +316,7 @@ for orth in [
|
|||
"kt.",
|
||||
"ktr.",
|
||||
"kv.",
|
||||
"kvm.",
|
||||
"kvt.",
|
||||
"l.c.",
|
||||
"lab.",
|
||||
|
@ -353,6 +363,7 @@ for orth in [
|
|||
"nto.",
|
||||
"nuv.",
|
||||
"o/m",
|
||||
"o/m.",
|
||||
"o.a.",
|
||||
"o.fl.",
|
||||
"o.h.",
|
||||
|
@ -522,6 +533,7 @@ for orth in [
|
|||
"vejl.",
|
||||
"vh.",
|
||||
"vha.",
|
||||
"vind.",
|
||||
"vs.",
|
||||
"vsa.",
|
||||
"vær.",
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .norm_exceptions import NORM_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from .tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
|
@ -19,6 +20,8 @@ class GermanDefaults(Language.Defaults):
|
|||
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
|
||||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
|
|
@ -1,7 +1,29 @@
|
|||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
|
||||
from ..char_classes import CURRENCY, UNITS, PUNCT
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
|
||||
|
||||
|
||||
_prefixes = ["``"] + BASE_TOKENIZER_PREFIXES
|
||||
|
||||
_suffixes = (
|
||||
["''", "/"]
|
||||
+ LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.",
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
]
|
||||
)
|
||||
|
||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||
|
||||
_infixes = (
|
||||
|
@ -12,6 +34,7 @@ _infixes = (
|
|||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[0-9])-(?=[0-9])",
|
||||
|
@ -19,4 +42,6 @@ _infixes = (
|
|||
)
|
||||
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
|
|
|
@ -157,6 +157,8 @@ for exc_data in [
|
|||
|
||||
|
||||
for orth in [
|
||||
"``",
|
||||
"''",
|
||||
"A.C.",
|
||||
"a.D.",
|
||||
"A.D.",
|
||||
|
@ -172,10 +174,13 @@ for orth in [
|
|||
"biol.",
|
||||
"Biol.",
|
||||
"ca.",
|
||||
"CDU/CSU",
|
||||
"Chr.",
|
||||
"Cie.",
|
||||
"c/o",
|
||||
"co.",
|
||||
"Co.",
|
||||
"d'",
|
||||
"D.C.",
|
||||
"Dipl.-Ing.",
|
||||
"Dipl.",
|
||||
|
@ -200,12 +205,18 @@ for orth in [
|
|||
"i.G.",
|
||||
"i.Tr.",
|
||||
"i.V.",
|
||||
"I.",
|
||||
"II.",
|
||||
"III.",
|
||||
"IV.",
|
||||
"Inc.",
|
||||
"Ing.",
|
||||
"jr.",
|
||||
"Jr.",
|
||||
"jun.",
|
||||
"jur.",
|
||||
"K.O.",
|
||||
"L'",
|
||||
"L.A.",
|
||||
"lat.",
|
||||
"M.A.",
|
||||
|
|
30
spacy/lang/eu/__init__.py
Normal file
30
spacy/lang/eu/__init__.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
from .tag_map import TAG_MAP
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
||||
|
||||
class BasqueDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "eu"
|
||||
|
||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
class Basque(Language):
|
||||
lang = "eu"
|
||||
Defaults = BasqueDefaults
|
||||
|
||||
|
||||
__all__ = ["Basque"]
|
14
spacy/lang/eu/examples.py
Normal file
14
spacy/lang/eu/examples.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.eu.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
sentences = [
|
||||
"bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
|
||||
"gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira",
|
||||
]
|
79
spacy/lang/eu/lex_attrs.py
Normal file
79
spacy/lang/eu/lex_attrs.py
Normal file
|
@ -0,0 +1,79 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
# Source http://mylanguages.org/basque_numbers.php
|
||||
|
||||
|
||||
_num_words = """
|
||||
bat
|
||||
bi
|
||||
hiru
|
||||
lau
|
||||
bost
|
||||
sei
|
||||
zazpi
|
||||
zortzi
|
||||
bederatzi
|
||||
hamar
|
||||
hamaika
|
||||
hamabi
|
||||
hamahiru
|
||||
hamalau
|
||||
hamabost
|
||||
hamasei
|
||||
hamazazpi
|
||||
Hemezortzi
|
||||
hemeretzi
|
||||
hogei
|
||||
ehun
|
||||
mila
|
||||
milioi
|
||||
""".split()
|
||||
|
||||
# source https://www.google.com/intl/ur/inputtools/try/
|
||||
|
||||
_ordinal_words = """
|
||||
lehen
|
||||
bigarren
|
||||
hirugarren
|
||||
laugarren
|
||||
bosgarren
|
||||
seigarren
|
||||
zazpigarren
|
||||
zortzigarren
|
||||
bederatzigarren
|
||||
hamargarren
|
||||
hamaikagarren
|
||||
hamabigarren
|
||||
hamahirugarren
|
||||
hamalaugarren
|
||||
hamabosgarren
|
||||
hamaseigarren
|
||||
hamazazpigarren
|
||||
hamazortzigarren
|
||||
hemeretzigarren
|
||||
hogeigarren
|
||||
behin
|
||||
""".split()
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
if text in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
7
spacy/lang/eu/punctuation.py
Normal file
7
spacy/lang/eu/punctuation.py
Normal file
|
@ -0,0 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
_suffixes = TOKENIZER_SUFFIXES
|
108
spacy/lang/eu/stop_words.py
Normal file
108
spacy/lang/eu/stop_words.py
Normal file
|
@ -0,0 +1,108 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-eu
|
||||
# https://www.ranks.nl/stopwords/basque
|
||||
# https://www.mustgo.com/worldlanguages/basque/
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
al
|
||||
anitz
|
||||
arabera
|
||||
asko
|
||||
baina
|
||||
bat
|
||||
batean
|
||||
batek
|
||||
bati
|
||||
batzuei
|
||||
batzuek
|
||||
batzuetan
|
||||
batzuk
|
||||
bera
|
||||
beraiek
|
||||
berau
|
||||
berauek
|
||||
bere
|
||||
berori
|
||||
beroriek
|
||||
beste
|
||||
bezala
|
||||
da
|
||||
dago
|
||||
dira
|
||||
ditu
|
||||
du
|
||||
dute
|
||||
edo
|
||||
egin
|
||||
ere
|
||||
eta
|
||||
eurak
|
||||
ez
|
||||
gainera
|
||||
gu
|
||||
gutxi
|
||||
guzti
|
||||
haiei
|
||||
haiek
|
||||
haietan
|
||||
hainbeste
|
||||
hala
|
||||
han
|
||||
handik
|
||||
hango
|
||||
hara
|
||||
hari
|
||||
hark
|
||||
hartan
|
||||
hau
|
||||
hauei
|
||||
hauek
|
||||
hauetan
|
||||
hemen
|
||||
hemendik
|
||||
hemengo
|
||||
hi
|
||||
hona
|
||||
honek
|
||||
honela
|
||||
honetan
|
||||
honi
|
||||
hor
|
||||
hori
|
||||
horiei
|
||||
horiek
|
||||
horietan
|
||||
horko
|
||||
horra
|
||||
horrek
|
||||
horrela
|
||||
horretan
|
||||
horri
|
||||
hortik
|
||||
hura
|
||||
izan
|
||||
ni
|
||||
noiz
|
||||
nola
|
||||
non
|
||||
nondik
|
||||
nongo
|
||||
nor
|
||||
nora
|
||||
ze
|
||||
zein
|
||||
zen
|
||||
zenbait
|
||||
zenbat
|
||||
zer
|
||||
zergatik
|
||||
ziren
|
||||
zituen
|
||||
zu
|
||||
zuek
|
||||
zuen
|
||||
zuten
|
||||
""".split()
|
||||
)
|
71
spacy/lang/eu/tag_map.py
Normal file
71
spacy/lang/eu/tag_map.py
Normal file
|
@ -0,0 +1,71 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||
|
||||
TAG_MAP = {
|
||||
".": {POS: PUNCT, "PunctType": "peri"},
|
||||
",": {POS: PUNCT, "PunctType": "comm"},
|
||||
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
|
||||
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
|
||||
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
|
||||
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
":": {POS: PUNCT},
|
||||
"$": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||
"CC": {POS: CCONJ, "ConjType": "coor"},
|
||||
"CD": {POS: NUM, "NumType": "card"},
|
||||
"DT": {POS: DET},
|
||||
"EX": {POS: ADV, "AdvType": "ex"},
|
||||
"FW": {POS: X, "Foreign": "yes"},
|
||||
"HYPH": {POS: PUNCT, "PunctType": "dash"},
|
||||
"IN": {POS: ADP},
|
||||
"JJ": {POS: ADJ, "Degree": "pos"},
|
||||
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||
"LS": {POS: PUNCT, "NumType": "ord"},
|
||||
"MD": {POS: VERB, "VerbType": "mod"},
|
||||
"NIL": {POS: ""},
|
||||
"NN": {POS: NOUN, "Number": "sing"},
|
||||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||
"NNS": {POS: NOUN, "Number": "plur"},
|
||||
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||
"POS": {POS: PART, "Poss": "yes"},
|
||||
"PRP": {POS: PRON, "PronType": "prs"},
|
||||
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||
"RB": {POS: ADV, "Degree": "pos"},
|
||||
"RBR": {POS: ADV, "Degree": "comp"},
|
||||
"RBS": {POS: ADV, "Degree": "sup"},
|
||||
"RP": {POS: PART},
|
||||
"SP": {POS: SPACE},
|
||||
"SYM": {POS: SYM},
|
||||
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
|
||||
"UH": {POS: INTJ},
|
||||
"VB": {POS: VERB, "VerbForm": "inf"},
|
||||
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
|
||||
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||||
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
|
||||
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
|
||||
"VBZ": {
|
||||
POS: VERB,
|
||||
"VerbForm": "fin",
|
||||
"Tense": "pres",
|
||||
"Number": "sing",
|
||||
"Person": 3,
|
||||
},
|
||||
"WDT": {POS: ADJ, "PronType": "int|rel"},
|
||||
"WP": {POS: NOUN, "PronType": "int|rel"},
|
||||
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
|
||||
"WRB": {POS: ADV, "PronType": "int|rel"},
|
||||
"ADD": {POS: X},
|
||||
"NFP": {POS: PUNCT},
|
||||
"GW": {POS: X},
|
||||
"XX": {POS: X},
|
||||
"BES": {POS: VERB},
|
||||
"HVS": {POS: VERB},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
|
@ -11,6 +11,7 @@ for exc_data in [
|
|||
{ORTH: "alv.", LEMMA: "arvonlisävero"},
|
||||
{ORTH: "ark.", LEMMA: "arkisin"},
|
||||
{ORTH: "as.", LEMMA: "asunto"},
|
||||
{ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
|
||||
{ORTH: "ed.", LEMMA: "edellinen"},
|
||||
{ORTH: "esim.", LEMMA: "esimerkki"},
|
||||
{ORTH: "huom.", LEMMA: "huomautus"},
|
||||
|
@ -24,6 +25,7 @@ for exc_data in [
|
|||
{ORTH: "läh.", LEMMA: "lähettäjä"},
|
||||
{ORTH: "miel.", LEMMA: "mieluummin"},
|
||||
{ORTH: "milj.", LEMMA: "miljoona"},
|
||||
{ORTH: "Mm.", LEMMA: "muun muassa"},
|
||||
{ORTH: "mm.", LEMMA: "muun muassa"},
|
||||
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
||||
{ORTH: "n.", LEMMA: "noin"},
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
|
||||
from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
from .tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
@ -24,6 +25,7 @@ class FrenchDefaults(Language.Defaults):
|
|||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
token_match = TOKEN_MATCH
|
||||
|
|
|
@ -1,12 +1,23 @@
|
|||
from ..punctuation import TOKENIZER_INFIXES
|
||||
from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||
from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
from ..char_classes import merge_chars
|
||||
|
||||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
||||
HYPHENS = r"- – — ‐ ‑".strip().replace(" ", "").replace("\n", "")
|
||||
ELISION = "' ’".replace(" ", "")
|
||||
HYPHENS = r"- – — ‐ ‑".replace(" ", "")
|
||||
_prefixes_elision = "d l n"
|
||||
_prefixes_elision += " " + _prefixes_elision.upper()
|
||||
_hyphen_suffixes = "ce clés elle en il ils je là moi nous on t vous"
|
||||
_hyphen_suffixes += " " + _hyphen_suffixes.upper()
|
||||
|
||||
|
||||
_prefixes = TOKENIZER_PREFIXES + [
|
||||
r"(?:({pe})[{el}])(?=[{a}])".format(
|
||||
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
|
||||
)
|
||||
]
|
||||
|
||||
_suffixes = (
|
||||
LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
|
@ -14,7 +25,6 @@ _suffixes = (
|
|||
+ [
|
||||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.", # °C. -> ["°C", "."]
|
||||
r"(?<=[0-9])°[FfCcKk]", # 4°C -> ["4", "°C"]
|
||||
r"(?<=[0-9])%", # 4% -> ["4", "%"]
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||
|
@ -22,14 +32,17 @@ _suffixes = (
|
|||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
r"(?<=[{a}])[{h}]({hs})".format(
|
||||
a=ALPHA, h=HYPHENS, hs=merge_chars(_hyphen_suffixes)
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
_infixes = TOKENIZER_INFIXES + [
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
||||
]
|
||||
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
|
|
|
@ -3,7 +3,7 @@ import re
|
|||
from .punctuation import ELISION, HYPHENS
|
||||
from ..tokenizer_exceptions import URL_PATTERN
|
||||
from ..char_classes import ALPHA_LOWER, ALPHA
|
||||
from ...symbols import ORTH, LEMMA, TAG
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
# not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer
|
||||
# from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS
|
||||
|
@ -53,7 +53,28 @@ for exc_data in [
|
|||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
||||
for orth in ["etc."]:
|
||||
for orth in [
|
||||
"après-midi",
|
||||
"au-delà",
|
||||
"au-dessus",
|
||||
"celle-ci",
|
||||
"celles-ci",
|
||||
"celui-ci",
|
||||
"cf.",
|
||||
"ci-dessous",
|
||||
"elle-même",
|
||||
"en-dessous",
|
||||
"etc.",
|
||||
"jusque-là",
|
||||
"lui-même",
|
||||
"MM.",
|
||||
"No.",
|
||||
"peut-être",
|
||||
"pp.",
|
||||
"quelques-uns",
|
||||
"rendez-vous",
|
||||
"Vol.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
|
@ -69,7 +90,7 @@ for verb, verb_lemma in [
|
|||
for pronoun in ["elle", "il", "on"]:
|
||||
token = f"{orth}-t-{pronoun}"
|
||||
_exc[token] = [
|
||||
{LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
|
||||
{LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"},
|
||||
{LEMMA: "t", ORTH: "-t"},
|
||||
{LEMMA: pronoun, ORTH: "-" + pronoun},
|
||||
]
|
||||
|
@ -78,7 +99,7 @@ for verb, verb_lemma in [("est", "être")]:
|
|||
for orth in [verb, verb.title()]:
|
||||
token = f"{orth}-ce"
|
||||
_exc[token] = [
|
||||
{LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
|
||||
{LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"},
|
||||
{LEMMA: "ce", ORTH: "-ce"},
|
||||
]
|
||||
|
||||
|
@ -86,12 +107,29 @@ for verb, verb_lemma in [("est", "être")]:
|
|||
for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
|
||||
for orth in [pre, pre.title()]:
|
||||
_exc[f"{orth}est-ce"] = [
|
||||
{LEMMA: pre_lemma, ORTH: orth, TAG: "ADV"},
|
||||
{LEMMA: "être", ORTH: "est", TAG: "VERB"},
|
||||
{LEMMA: pre_lemma, ORTH: orth},
|
||||
{LEMMA: "être", ORTH: "est"},
|
||||
{LEMMA: "ce", ORTH: "-ce"},
|
||||
]
|
||||
|
||||
|
||||
for verb, pronoun in [("est", "il"), ("EST", "IL")]:
|
||||
token = "{}-{}".format(verb, pronoun)
|
||||
_exc[token] = [
|
||||
{LEMMA: "être", ORTH: verb},
|
||||
{LEMMA: pronoun, ORTH: "-" + pronoun},
|
||||
]
|
||||
|
||||
|
||||
for s, verb, pronoun in [("s", "est", "il"), ("S", "EST", "IL")]:
|
||||
token = "{}'{}-{}".format(s, verb, pronoun)
|
||||
_exc[token] = [
|
||||
{LEMMA: "se", ORTH: s + "'"},
|
||||
{LEMMA: "être", ORTH: verb},
|
||||
{LEMMA: pronoun, ORTH: "-" + pronoun},
|
||||
]
|
||||
|
||||
|
||||
_infixes_exc = []
|
||||
orig_elision = "'"
|
||||
orig_hyphen = "-"
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -19,6 +19,7 @@ class ItalianDefaults(Language.Defaults):
|
|||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
|
||||
|
||||
|
|
|
@ -1,12 +1,29 @@
|
|||
from ..punctuation import TOKENIZER_INFIXES
|
||||
from ..char_classes import ALPHA
|
||||
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES
|
||||
from ..char_classes import ALPHA_LOWER, ALPHA_UPPER
|
||||
|
||||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "")
|
||||
ELISION = "'’"
|
||||
|
||||
|
||||
_infixes = TOKENIZER_INFIXES + [
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
||||
]
|
||||
_prefixes = [r"'[0-9][0-9]", r"[0-9]+°"] + BASE_TOKENIZER_PREFIXES
|
||||
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
|
||||
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])(?:{h})(?=[{al}])".format(a=ALPHA, h=HYPHENS, al=ALPHA_LOWER),
|
||||
r"(?<=[{a}0-9])[:<>=\/](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}][{el}])(?=[{a}0-9\"])".format(a=ALPHA, el=ELISION),
|
||||
]
|
||||
)
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
|
|
|
@ -1,5 +1,55 @@
|
|||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
_exc = {"po'": [{ORTH: "po'", LEMMA: "poco"}]}
|
||||
_exc = {
|
||||
"all'art.": [{ORTH: "all'"}, {ORTH: "art."}],
|
||||
"dall'art.": [{ORTH: "dall'"}, {ORTH: "art."}],
|
||||
"dell'art.": [{ORTH: "dell'"}, {ORTH: "art."}],
|
||||
"L'art.": [{ORTH: "L'"}, {ORTH: "art."}],
|
||||
"l'art.": [{ORTH: "l'"}, {ORTH: "art."}],
|
||||
"nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}],
|
||||
"po'": [{ORTH: "po'", LEMMA: "poco"}],
|
||||
"sett..": [{ORTH: "sett."}, {ORTH: "."}],
|
||||
}
|
||||
|
||||
for orth in [
|
||||
"..",
|
||||
"....",
|
||||
"al.",
|
||||
"all-path",
|
||||
"art.",
|
||||
"Art.",
|
||||
"artt.",
|
||||
"att.",
|
||||
"by-pass",
|
||||
"c.d.",
|
||||
"centro-sinistra",
|
||||
"check-up",
|
||||
"Civ.",
|
||||
"cm.",
|
||||
"Cod.",
|
||||
"col.",
|
||||
"Cost.",
|
||||
"d.C.",
|
||||
'de"',
|
||||
"distr.",
|
||||
"E'",
|
||||
"ecc.",
|
||||
"e-mail",
|
||||
"e/o",
|
||||
"etc.",
|
||||
"Jr.",
|
||||
"n°",
|
||||
"nord-est",
|
||||
"pag.",
|
||||
"Proc.",
|
||||
"prof.",
|
||||
"sett.",
|
||||
"s.p.a.",
|
||||
"ss.",
|
||||
"St.",
|
||||
"tel.",
|
||||
"week-end",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -2,11 +2,13 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_
|
|||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "")
|
||||
|
||||
abbrev = ("d", "D")
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
|
||||
r"(?<=^[{ab}][{el}])(?=[{a}])".format(ab=abbrev, a=ALPHA, el=ELISION),
|
||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||
|
|
|
@ -7,6 +7,8 @@ _exc = {}
|
|||
|
||||
# translate / delete what is not necessary
|
||||
for exc_data in [
|
||||
{ORTH: "’t", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "’T", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "'t", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "'T", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
|
||||
|
|
31
spacy/lang/lij/__init__.py
Normal file
31
spacy/lang/lij/__init__.py
Normal file
|
@ -0,0 +1,31 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
|
||||
|
||||
class LigurianDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: "lij"
|
||||
lex_attr_getters[NORM] = add_lookups(
|
||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
infixes = TOKENIZER_INFIXES
|
||||
|
||||
|
||||
class Ligurian(Language):
|
||||
lang = "lij"
|
||||
Defaults = LigurianDefaults
|
||||
|
||||
|
||||
__all__ = ["Ligurian"]
|
18
spacy/lang/lij/examples.py
Normal file
18
spacy/lang/lij/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.lij.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Sciusciâ e sciorbî no se peu.",
|
||||
"Graçie di çetroin, che me son arrivæ.",
|
||||
"Vegnime apreuvo, che ve fasso pescâ di òmmi.",
|
||||
"Bella pe sempre l'ægua inta conchetta quande unn'agoggia d'ægua a se â trapaña.",
|
||||
]
|
15
spacy/lang/lij/punctuation.py
Normal file
15
spacy/lang/lij/punctuation.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..punctuation import TOKENIZER_INFIXES
|
||||
from ..char_classes import ALPHA
|
||||
|
||||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
||||
|
||||
|
||||
_infixes = TOKENIZER_INFIXES + [
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
||||
]
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
43
spacy/lang/lij/stop_words.py
Normal file
43
spacy/lang/lij/stop_words.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei
|
||||
|
||||
bella belle belli bello ben
|
||||
|
||||
ch' che chì chi ciù co-a co-e co-i co-o comm' comme con cösa coscì cöse
|
||||
|
||||
d' da da-a da-e da-i da-o dapeu de delongo derê di do doe doî donde dòppo
|
||||
|
||||
é e ê ea ean emmo en ëse
|
||||
|
||||
fin fiña
|
||||
|
||||
gh' ghe guæei
|
||||
|
||||
i î in insemme int' inta inte inti into
|
||||
|
||||
l' lê lì lô
|
||||
|
||||
m' ma manco me megio meno mezo mi
|
||||
|
||||
na n' ne ni ninte nisciun nisciuña no
|
||||
|
||||
o ò ô oua
|
||||
|
||||
parte pe pe-a pe-i pe-e pe-o perché pittin pö primma pròpio
|
||||
|
||||
quæ quand' quande quarche quella quelle quelli quello
|
||||
|
||||
s' sce scê sci sciâ sciô sciù se segge seu sò solo son sott' sta stæta stæte stæti stæto ste sti sto
|
||||
|
||||
tanta tante tanti tanto te ti torna tra tròppo tutta tutte tutti tutto
|
||||
|
||||
un uña unn' unna
|
||||
|
||||
za zu
|
||||
""".split()
|
||||
)
|
52
spacy/lang/lij/tokenizer_exceptions.py
Normal file
52
spacy/lang/lij/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,52 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
_exc = {}
|
||||
|
||||
for raw, lemma in [
|
||||
("a-a", "a-o"),
|
||||
("a-e", "a-o"),
|
||||
("a-o", "a-o"),
|
||||
("a-i", "a-o"),
|
||||
("co-a", "co-o"),
|
||||
("co-e", "co-o"),
|
||||
("co-i", "co-o"),
|
||||
("co-o", "co-o"),
|
||||
("da-a", "da-o"),
|
||||
("da-e", "da-o"),
|
||||
("da-i", "da-o"),
|
||||
("da-o", "da-o"),
|
||||
("pe-a", "pe-o"),
|
||||
("pe-e", "pe-o"),
|
||||
("pe-i", "pe-o"),
|
||||
("pe-o", "pe-o"),
|
||||
]:
|
||||
for orth in [raw, raw.capitalize()]:
|
||||
_exc[orth] = [{ORTH: orth, LEMMA: lemma}]
|
||||
|
||||
# Prefix + prepositions with à (e.g. "sott'a-o")
|
||||
|
||||
for prep, prep_lemma in [
|
||||
("a-a", "a-o"),
|
||||
("a-e", "a-o"),
|
||||
("a-o", "a-o"),
|
||||
("a-i", "a-o"),
|
||||
]:
|
||||
for prefix, prefix_lemma in [
|
||||
("sott'", "sotta"),
|
||||
("sott’", "sotta"),
|
||||
("contr'", "contra"),
|
||||
("contr’", "contra"),
|
||||
("ch'", "che"),
|
||||
("ch’", "che"),
|
||||
("s'", "se"),
|
||||
("s’", "se"),
|
||||
]:
|
||||
for prefix_orth in [prefix, prefix.capitalize()]:
|
||||
_exc[prefix_orth + prep] = [
|
||||
{ORTH: prefix_orth, LEMMA: prefix_lemma},
|
||||
{ORTH: prep, LEMMA: prep_lemma},
|
||||
]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -1,3 +1,4 @@
|
|||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
@ -23,7 +24,13 @@ class LithuanianDefaults(Language.Defaults):
|
|||
)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
mod_base_exceptions = {
|
||||
exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
|
||||
}
|
||||
del mod_base_exceptions["8)"]
|
||||
tokenizer_exceptions = update_exc(mod_base_exceptions, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
morph_rules = MORPH_RULES
|
||||
|
|
29
spacy/lang/lt/punctuation.py
Normal file
29
spacy/lang/lt/punctuation.py
Normal file
|
@ -0,0 +1,29 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_ICONS, LIST_ELLIPSES
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
|
||||
from ..char_classes import HYPHENS
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[0-9])[+\*^](?=[0-9-])",
|
||||
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
_suffixes = ["\."] + list(TOKENIZER_SUFFIXES)
|
||||
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
|
@ -3,262 +3,264 @@ from ...symbols import ORTH
|
|||
_exc = {}
|
||||
|
||||
for orth in [
|
||||
"G.",
|
||||
"J. E.",
|
||||
"J. Em.",
|
||||
"J.E.",
|
||||
"J.Em.",
|
||||
"K.",
|
||||
"N.",
|
||||
"V.",
|
||||
"Vt.",
|
||||
"a.",
|
||||
"a.k.",
|
||||
"a.s.",
|
||||
"adv.",
|
||||
"akad.",
|
||||
"aklg.",
|
||||
"akt.",
|
||||
"al.",
|
||||
"ang.",
|
||||
"angl.",
|
||||
"aps.",
|
||||
"apskr.",
|
||||
"apyg.",
|
||||
"arbat.",
|
||||
"asist.",
|
||||
"asm.",
|
||||
"asm.k.",
|
||||
"asmv.",
|
||||
"atk.",
|
||||
"atsak.",
|
||||
"atsisk.",
|
||||
"atsisk.sąsk.",
|
||||
"atv.",
|
||||
"aut.",
|
||||
"avd.",
|
||||
"b.k.",
|
||||
"baud.",
|
||||
"biol.",
|
||||
"bkl.",
|
||||
"bot.",
|
||||
"bt.",
|
||||
"buv.",
|
||||
"ch.",
|
||||
"chem.",
|
||||
"corp.",
|
||||
"d.",
|
||||
"dab.",
|
||||
"dail.",
|
||||
"dek.",
|
||||
"deš.",
|
||||
"dir.",
|
||||
"dirig.",
|
||||
"doc.",
|
||||
"dol.",
|
||||
"dr.",
|
||||
"drp.",
|
||||
"dvit.",
|
||||
"dėst.",
|
||||
"dš.",
|
||||
"dž.",
|
||||
"e.b.",
|
||||
"e.bankas",
|
||||
"e.p.",
|
||||
"e.parašas",
|
||||
"e.paštas",
|
||||
"e.v.",
|
||||
"e.valdžia",
|
||||
"egz.",
|
||||
"eil.",
|
||||
"ekon.",
|
||||
"el.",
|
||||
"el.bankas",
|
||||
"el.p.",
|
||||
"el.parašas",
|
||||
"el.paštas",
|
||||
"el.valdžia",
|
||||
"etc.",
|
||||
"ež.",
|
||||
"fak.",
|
||||
"faks.",
|
||||
"feat.",
|
||||
"filol.",
|
||||
"filos.",
|
||||
"g.",
|
||||
"gen.",
|
||||
"geol.",
|
||||
"gerb.",
|
||||
"gim.",
|
||||
"gr.",
|
||||
"gv.",
|
||||
"gyd.",
|
||||
"gyv.",
|
||||
"habil.",
|
||||
"inc.",
|
||||
"insp.",
|
||||
"inž.",
|
||||
"ir pan.",
|
||||
"ir t. t.",
|
||||
"isp.",
|
||||
"istor.",
|
||||
"it.",
|
||||
"just.",
|
||||
"k.",
|
||||
"k. a.",
|
||||
"k.a.",
|
||||
"kab.",
|
||||
"kand.",
|
||||
"kart.",
|
||||
"kat.",
|
||||
"ketv.",
|
||||
"kh.",
|
||||
"kl.",
|
||||
"kln.",
|
||||
"km.",
|
||||
"kn.",
|
||||
"koresp.",
|
||||
"kpt.",
|
||||
"kr.",
|
||||
"kt.",
|
||||
"kub.",
|
||||
"kun.",
|
||||
"kv.",
|
||||
"kyš.",
|
||||
"l. e. p.",
|
||||
"l.e.p.",
|
||||
"lenk.",
|
||||
"liet.",
|
||||
"lot.",
|
||||
"lt.",
|
||||
"ltd.",
|
||||
"ltn.",
|
||||
"m.",
|
||||
"m.e..",
|
||||
"m.m.",
|
||||
"mat.",
|
||||
"med.",
|
||||
"mgnt.",
|
||||
"mgr.",
|
||||
"min.",
|
||||
"mjr.",
|
||||
"ml.",
|
||||
"mln.",
|
||||
"mlrd.",
|
||||
"mob.",
|
||||
"mok.",
|
||||
"moksl.",
|
||||
"mokyt.",
|
||||
"mot.",
|
||||
"mr.",
|
||||
"mst.",
|
||||
"mstl.",
|
||||
"mėn.",
|
||||
"nkt.",
|
||||
"no.",
|
||||
"nr.",
|
||||
"ntk.",
|
||||
"nuotr.",
|
||||
"op.",
|
||||
"org.",
|
||||
"orig.",
|
||||
"p.",
|
||||
"p.d.",
|
||||
"p.m.e.",
|
||||
"p.s.",
|
||||
"pab.",
|
||||
"pan.",
|
||||
"past.",
|
||||
"pav.",
|
||||
"pavad.",
|
||||
"per.",
|
||||
"perd.",
|
||||
"pirm.",
|
||||
"pl.",
|
||||
"plg.",
|
||||
"plk.",
|
||||
"pr.",
|
||||
"pr.Kr.",
|
||||
"pranc.",
|
||||
"proc.",
|
||||
"prof.",
|
||||
"prom.",
|
||||
"prot.",
|
||||
"psl.",
|
||||
"pss.",
|
||||
"pvz.",
|
||||
"pšt.",
|
||||
"r.",
|
||||
"raj.",
|
||||
"red.",
|
||||
"rez.",
|
||||
"rež.",
|
||||
"rus.",
|
||||
"rš.",
|
||||
"s.",
|
||||
"sav.",
|
||||
"saviv.",
|
||||
"sek.",
|
||||
"sekr.",
|
||||
"sen.",
|
||||
"sh.",
|
||||
"sk.",
|
||||
"skg.",
|
||||
"skv.",
|
||||
"skyr.",
|
||||
"sp.",
|
||||
"spec.",
|
||||
"sr.",
|
||||
"st.",
|
||||
"str.",
|
||||
"stud.",
|
||||
"sąs.",
|
||||
"t.",
|
||||
"t. p.",
|
||||
"t. y.",
|
||||
"t.p.",
|
||||
"t.t.",
|
||||
"t.y.",
|
||||
"techn.",
|
||||
"tel.",
|
||||
"teol.",
|
||||
"th.",
|
||||
"tir.",
|
||||
"trit.",
|
||||
"trln.",
|
||||
"tšk.",
|
||||
"tūks.",
|
||||
"tūkst.",
|
||||
"up.",
|
||||
"upl.",
|
||||
"v.s.",
|
||||
"vad.",
|
||||
"val.",
|
||||
"valg.",
|
||||
"ved.",
|
||||
"vert.",
|
||||
"vet.",
|
||||
"vid.",
|
||||
"virš.",
|
||||
"vlsč.",
|
||||
"vnt.",
|
||||
"vok.",
|
||||
"vs.",
|
||||
"vtv.",
|
||||
"vv.",
|
||||
"vyr.",
|
||||
"vyresn.",
|
||||
"zool.",
|
||||
"Įn",
|
||||
"įl.",
|
||||
"š.m.",
|
||||
"šnek.",
|
||||
"šv.",
|
||||
"švč.",
|
||||
"ž.ū.",
|
||||
"žin.",
|
||||
"žml.",
|
||||
"žr.",
|
||||
"n-tosios",
|
||||
"?!",
|
||||
# "G.",
|
||||
# "J. E.",
|
||||
# "J. Em.",
|
||||
# "J.E.",
|
||||
# "J.Em.",
|
||||
# "K.",
|
||||
# "N.",
|
||||
# "V.",
|
||||
# "Vt.",
|
||||
# "a.",
|
||||
# "a.k.",
|
||||
# "a.s.",
|
||||
# "adv.",
|
||||
# "akad.",
|
||||
# "aklg.",
|
||||
# "akt.",
|
||||
# "al.",
|
||||
# "ang.",
|
||||
# "angl.",
|
||||
# "aps.",
|
||||
# "apskr.",
|
||||
# "apyg.",
|
||||
# "arbat.",
|
||||
# "asist.",
|
||||
# "asm.",
|
||||
# "asm.k.",
|
||||
# "asmv.",
|
||||
# "atk.",
|
||||
# "atsak.",
|
||||
# "atsisk.",
|
||||
# "atsisk.sąsk.",
|
||||
# "atv.",
|
||||
# "aut.",
|
||||
# "avd.",
|
||||
# "b.k.",
|
||||
# "baud.",
|
||||
# "biol.",
|
||||
# "bkl.",
|
||||
# "bot.",
|
||||
# "bt.",
|
||||
# "buv.",
|
||||
# "ch.",
|
||||
# "chem.",
|
||||
# "corp.",
|
||||
# "d.",
|
||||
# "dab.",
|
||||
# "dail.",
|
||||
# "dek.",
|
||||
# "deš.",
|
||||
# "dir.",
|
||||
# "dirig.",
|
||||
# "doc.",
|
||||
# "dol.",
|
||||
# "dr.",
|
||||
# "drp.",
|
||||
# "dvit.",
|
||||
# "dėst.",
|
||||
# "dš.",
|
||||
# "dž.",
|
||||
# "e.b.",
|
||||
# "e.bankas",
|
||||
# "e.p.",
|
||||
# "e.parašas",
|
||||
# "e.paštas",
|
||||
# "e.v.",
|
||||
# "e.valdžia",
|
||||
# "egz.",
|
||||
# "eil.",
|
||||
# "ekon.",
|
||||
# "el.",
|
||||
# "el.bankas",
|
||||
# "el.p.",
|
||||
# "el.parašas",
|
||||
# "el.paštas",
|
||||
# "el.valdžia",
|
||||
# "etc.",
|
||||
# "ež.",
|
||||
# "fak.",
|
||||
# "faks.",
|
||||
# "feat.",
|
||||
# "filol.",
|
||||
# "filos.",
|
||||
# "g.",
|
||||
# "gen.",
|
||||
# "geol.",
|
||||
# "gerb.",
|
||||
# "gim.",
|
||||
# "gr.",
|
||||
# "gv.",
|
||||
# "gyd.",
|
||||
# "gyv.",
|
||||
# "habil.",
|
||||
# "inc.",
|
||||
# "insp.",
|
||||
# "inž.",
|
||||
# "ir pan.",
|
||||
# "ir t. t.",
|
||||
# "isp.",
|
||||
# "istor.",
|
||||
# "it.",
|
||||
# "just.",
|
||||
# "k.",
|
||||
# "k. a.",
|
||||
# "k.a.",
|
||||
# "kab.",
|
||||
# "kand.",
|
||||
# "kart.",
|
||||
# "kat.",
|
||||
# "ketv.",
|
||||
# "kh.",
|
||||
# "kl.",
|
||||
# "kln.",
|
||||
# "km.",
|
||||
# "kn.",
|
||||
# "koresp.",
|
||||
# "kpt.",
|
||||
# "kr.",
|
||||
# "kt.",
|
||||
# "kub.",
|
||||
# "kun.",
|
||||
# "kv.",
|
||||
# "kyš.",
|
||||
# "l. e. p.",
|
||||
# "l.e.p.",
|
||||
# "lenk.",
|
||||
# "liet.",
|
||||
# "lot.",
|
||||
# "lt.",
|
||||
# "ltd.",
|
||||
# "ltn.",
|
||||
# "m.",
|
||||
# "m.e..",
|
||||
# "m.m.",
|
||||
# "mat.",
|
||||
# "med.",
|
||||
# "mgnt.",
|
||||
# "mgr.",
|
||||
# "min.",
|
||||
# "mjr.",
|
||||
# "ml.",
|
||||
# "mln.",
|
||||
# "mlrd.",
|
||||
# "mob.",
|
||||
# "mok.",
|
||||
# "moksl.",
|
||||
# "mokyt.",
|
||||
# "mot.",
|
||||
# "mr.",
|
||||
# "mst.",
|
||||
# "mstl.",
|
||||
# "mėn.",
|
||||
# "nkt.",
|
||||
# "no.",
|
||||
# "nr.",
|
||||
# "ntk.",
|
||||
# "nuotr.",
|
||||
# "op.",
|
||||
# "org.",
|
||||
# "orig.",
|
||||
# "p.",
|
||||
# "p.d.",
|
||||
# "p.m.e.",
|
||||
# "p.s.",
|
||||
# "pab.",
|
||||
# "pan.",
|
||||
# "past.",
|
||||
# "pav.",
|
||||
# "pavad.",
|
||||
# "per.",
|
||||
# "perd.",
|
||||
# "pirm.",
|
||||
# "pl.",
|
||||
# "plg.",
|
||||
# "plk.",
|
||||
# "pr.",
|
||||
# "pr.Kr.",
|
||||
# "pranc.",
|
||||
# "proc.",
|
||||
# "prof.",
|
||||
# "prom.",
|
||||
# "prot.",
|
||||
# "psl.",
|
||||
# "pss.",
|
||||
# "pvz.",
|
||||
# "pšt.",
|
||||
# "r.",
|
||||
# "raj.",
|
||||
# "red.",
|
||||
# "rez.",
|
||||
# "rež.",
|
||||
# "rus.",
|
||||
# "rš.",
|
||||
# "s.",
|
||||
# "sav.",
|
||||
# "saviv.",
|
||||
# "sek.",
|
||||
# "sekr.",
|
||||
# "sen.",
|
||||
# "sh.",
|
||||
# "sk.",
|
||||
# "skg.",
|
||||
# "skv.",
|
||||
# "skyr.",
|
||||
# "sp.",
|
||||
# "spec.",
|
||||
# "sr.",
|
||||
# "st.",
|
||||
# "str.",
|
||||
# "stud.",
|
||||
# "sąs.",
|
||||
# "t.",
|
||||
# "t. p.",
|
||||
# "t. y.",
|
||||
# "t.p.",
|
||||
# "t.t.",
|
||||
# "t.y.",
|
||||
# "techn.",
|
||||
# "tel.",
|
||||
# "teol.",
|
||||
# "th.",
|
||||
# "tir.",
|
||||
# "trit.",
|
||||
# "trln.",
|
||||
# "tšk.",
|
||||
# "tūks.",
|
||||
# "tūkst.",
|
||||
# "up.",
|
||||
# "upl.",
|
||||
# "v.s.",
|
||||
# "vad.",
|
||||
# "val.",
|
||||
# "valg.",
|
||||
# "ved.",
|
||||
# "vert.",
|
||||
# "vet.",
|
||||
# "vid.",
|
||||
# "virš.",
|
||||
# "vlsč.",
|
||||
# "vnt.",
|
||||
# "vok.",
|
||||
# "vs.",
|
||||
# "vtv.",
|
||||
# "vv.",
|
||||
# "vyr.",
|
||||
# "vyresn.",
|
||||
# "zool.",
|
||||
# "Įn",
|
||||
# "įl.",
|
||||
# "š.m.",
|
||||
# "šnek.",
|
||||
# "šv.",
|
||||
# "švč.",
|
||||
# "ž.ū.",
|
||||
# "žin.",
|
||||
# "žml.",
|
||||
# "žr.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
from .stop_words import STOP_WORDS
|
||||
from .morph_rules import MORPH_RULES
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
|
@ -18,6 +20,9 @@ class NorwegianDefaults(Language.Defaults):
|
|||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
stop_words = STOP_WORDS
|
||||
morph_rules = MORPH_RULES
|
||||
tag_map = TAG_MAP
|
||||
|
|
|
@ -1,13 +1,29 @@
|
|||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
from ..char_classes import CURRENCY, PUNCT, UNITS, LIST_CURRENCY
|
||||
|
||||
# Punctuation stolen from Danish
|
||||
|
||||
# Punctuation adapted from Danish
|
||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||
_list_punct = [x for x in LIST_PUNCT if x != "#"]
|
||||
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||
_list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
|
||||
|
||||
|
||||
_prefixes = (
|
||||
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||
+ _list_punct
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ LIST_CURRENCY
|
||||
+ LIST_ICONS
|
||||
)
|
||||
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ _list_icons
|
||||
+ [
|
||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
|
@ -18,13 +34,26 @@ _infixes = (
|
|||
]
|
||||
)
|
||||
|
||||
_suffixes = [
|
||||
suffix
|
||||
for suffix in TOKENIZER_SUFFIXES
|
||||
if suffix not in ["'s", "'S", "’s", "’S", r"\'"]
|
||||
]
|
||||
_suffixes += [r"(?<=[^sSxXzZ])\'"]
|
||||
_suffixes = (
|
||||
LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ _list_quotes
|
||||
+ _list_icons
|
||||
+ ["—", "–"]
|
||||
+ [
|
||||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.",
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
]
|
||||
+ [r"(?<=[^sSxXzZ])'"]
|
||||
)
|
||||
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
||||
|
|
|
@ -21,57 +21,80 @@ for exc_data in [
|
|||
|
||||
|
||||
for orth in [
|
||||
"adm.dir.",
|
||||
"a.m.",
|
||||
"andelsnr",
|
||||
"Ap.",
|
||||
"Aq.",
|
||||
"Ca.",
|
||||
"Chr.",
|
||||
"Co.",
|
||||
"Co.",
|
||||
"Dr.",
|
||||
"F.eks.",
|
||||
"Fr.p.",
|
||||
"Frp.",
|
||||
"Grl.",
|
||||
"Kr.",
|
||||
"Kr.F.",
|
||||
"Kr.F.s",
|
||||
"Mr.",
|
||||
"Mrs.",
|
||||
"Pb.",
|
||||
"Pr.",
|
||||
"Sp.",
|
||||
"Sp.",
|
||||
"St.",
|
||||
"a.m.",
|
||||
"ad.",
|
||||
"adm.dir.",
|
||||
"andelsnr",
|
||||
"b.c.",
|
||||
"bl.a.",
|
||||
"bla.",
|
||||
"bm.",
|
||||
"bnr.",
|
||||
"bto.",
|
||||
"c.c.",
|
||||
"ca.",
|
||||
"cand.mag.",
|
||||
"c.c.",
|
||||
"co.",
|
||||
"d.d.",
|
||||
"dept.",
|
||||
"d.m.",
|
||||
"dr.philos.",
|
||||
"dvs.",
|
||||
"d.y.",
|
||||
"E. coli",
|
||||
"dept.",
|
||||
"dr.",
|
||||
"dr.med.",
|
||||
"dr.philos.",
|
||||
"dr.psychol.",
|
||||
"dvs.",
|
||||
"e.Kr.",
|
||||
"e.l.",
|
||||
"eg.",
|
||||
"ekskl.",
|
||||
"e.Kr.",
|
||||
"el.",
|
||||
"e.l.",
|
||||
"et.",
|
||||
"etc.",
|
||||
"etg.",
|
||||
"ev.",
|
||||
"evt.",
|
||||
"f.",
|
||||
"f.Kr.",
|
||||
"f.eks.",
|
||||
"f.o.m.",
|
||||
"fhv.",
|
||||
"fk.",
|
||||
"f.Kr.",
|
||||
"f.o.m.",
|
||||
"foreg.",
|
||||
"fork.",
|
||||
"fv.",
|
||||
"fvt.",
|
||||
"g.",
|
||||
"gt.",
|
||||
"gl.",
|
||||
"gno.",
|
||||
"gnr.",
|
||||
"grl.",
|
||||
"gt.",
|
||||
"h.r.adv.",
|
||||
"hhv.",
|
||||
"hoh.",
|
||||
"hr.",
|
||||
"h.r.adv.",
|
||||
"ifb.",
|
||||
"ifm.",
|
||||
"iht.",
|
||||
|
@ -80,39 +103,45 @@ for orth in [
|
|||
"jf.",
|
||||
"jr.",
|
||||
"jun.",
|
||||
"juris.",
|
||||
"kfr.",
|
||||
"kgl.",
|
||||
"kgl.res.",
|
||||
"kl.",
|
||||
"komm.",
|
||||
"kr.",
|
||||
"kst.",
|
||||
"lat.",
|
||||
"lø.",
|
||||
"m.a.o.",
|
||||
"m.fl.",
|
||||
"m.m.",
|
||||
"m.v.",
|
||||
"ma.",
|
||||
"mag.art.",
|
||||
"m.a.o.",
|
||||
"md.",
|
||||
"mfl.",
|
||||
"mht.",
|
||||
"mill.",
|
||||
"min.",
|
||||
"m.m.",
|
||||
"mnd.",
|
||||
"moh.",
|
||||
"Mr.",
|
||||
"mrd.",
|
||||
"muh.",
|
||||
"mv.",
|
||||
"mva.",
|
||||
"n.å.",
|
||||
"ndf.",
|
||||
"no.",
|
||||
"nov.",
|
||||
"nr.",
|
||||
"nto.",
|
||||
"nyno.",
|
||||
"n.å.",
|
||||
"o.a.",
|
||||
"o.l.",
|
||||
"off.",
|
||||
"ofl.",
|
||||
"okt.",
|
||||
"o.l.",
|
||||
"on.",
|
||||
"op.",
|
||||
"org.",
|
||||
|
@ -120,14 +149,15 @@ for orth in [
|
|||
"ovf.",
|
||||
"p.",
|
||||
"p.a.",
|
||||
"Pb.",
|
||||
"p.g.a.",
|
||||
"p.m.",
|
||||
"p.t.",
|
||||
"pga.",
|
||||
"ph.d.",
|
||||
"pkt.",
|
||||
"p.m.",
|
||||
"pr.",
|
||||
"pst.",
|
||||
"p.t.",
|
||||
"pt.",
|
||||
"red.anm.",
|
||||
"ref.",
|
||||
"res.",
|
||||
|
@ -136,6 +166,10 @@ for orth in [
|
|||
"rv.",
|
||||
"s.",
|
||||
"s.d.",
|
||||
"s.k.",
|
||||
"s.k.",
|
||||
"s.u.",
|
||||
"s.å.",
|
||||
"sen.",
|
||||
"sep.",
|
||||
"siviling.",
|
||||
|
@ -145,16 +179,17 @@ for orth in [
|
|||
"sr.",
|
||||
"sst.",
|
||||
"st.",
|
||||
"stip.",
|
||||
"stk.",
|
||||
"st.meld.",
|
||||
"st.prp.",
|
||||
"stip.",
|
||||
"stk.",
|
||||
"stud.",
|
||||
"s.u.",
|
||||
"sv.",
|
||||
"sø.",
|
||||
"s.å.",
|
||||
"såk.",
|
||||
"sø.",
|
||||
"t.h.",
|
||||
"t.o.m.",
|
||||
"t.v.",
|
||||
"temp.",
|
||||
"ti.",
|
||||
"tils.",
|
||||
|
@ -162,7 +197,6 @@ for orth in [
|
|||
"tl;dr",
|
||||
"tlf.",
|
||||
"to.",
|
||||
"t.o.m.",
|
||||
"ult.",
|
||||
"utg.",
|
||||
"v.",
|
||||
|
@ -176,8 +210,10 @@ for orth in [
|
|||
"vol.",
|
||||
"vs.",
|
||||
"vsa.",
|
||||
"©NTB",
|
||||
"årg.",
|
||||
"årh.",
|
||||
"§§",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
|
|
@ -1,69 +1,47 @@
|
|||
from ...symbols import ORTH, NORM
|
||||
from ...symbols import ORTH
|
||||
|
||||
|
||||
_exc = {
|
||||
"às": [{ORTH: "à", NORM: "a"}, {ORTH: "s", NORM: "as"}],
|
||||
"ao": [{ORTH: "a"}, {ORTH: "o"}],
|
||||
"aos": [{ORTH: "a"}, {ORTH: "os"}],
|
||||
"àquele": [{ORTH: "à", NORM: "a"}, {ORTH: "quele", NORM: "aquele"}],
|
||||
"àquela": [{ORTH: "à", NORM: "a"}, {ORTH: "quela", NORM: "aquela"}],
|
||||
"àqueles": [{ORTH: "à", NORM: "a"}, {ORTH: "queles", NORM: "aqueles"}],
|
||||
"àquelas": [{ORTH: "à", NORM: "a"}, {ORTH: "quelas", NORM: "aquelas"}],
|
||||
"àquilo": [{ORTH: "à", NORM: "a"}, {ORTH: "quilo", NORM: "aquilo"}],
|
||||
"aonde": [{ORTH: "a"}, {ORTH: "onde"}],
|
||||
}
|
||||
|
||||
|
||||
# Contractions
|
||||
_per_pron = ["ele", "ela", "eles", "elas"]
|
||||
_dem_pron = [
|
||||
"este",
|
||||
"esta",
|
||||
"estes",
|
||||
"estas",
|
||||
"isto",
|
||||
"esse",
|
||||
"essa",
|
||||
"esses",
|
||||
"essas",
|
||||
"isso",
|
||||
"aquele",
|
||||
"aquela",
|
||||
"aqueles",
|
||||
"aquelas",
|
||||
"aquilo",
|
||||
]
|
||||
_und_pron = ["outro", "outra", "outros", "outras"]
|
||||
_adv = ["aqui", "aí", "ali", "além"]
|
||||
|
||||
|
||||
for orth in _per_pron + _dem_pron + _und_pron + _adv:
|
||||
_exc["d" + orth] = [{ORTH: "d", NORM: "de"}, {ORTH: orth}]
|
||||
|
||||
for orth in _per_pron + _dem_pron + _und_pron:
|
||||
_exc["n" + orth] = [{ORTH: "n", NORM: "em"}, {ORTH: orth}]
|
||||
_exc = {}
|
||||
|
||||
|
||||
for orth in [
|
||||
"Adm.",
|
||||
"Art.",
|
||||
"art.",
|
||||
"Av.",
|
||||
"av.",
|
||||
"Cia.",
|
||||
"dom.",
|
||||
"Dr.",
|
||||
"dr.",
|
||||
"e.g.",
|
||||
"E.g.",
|
||||
"E.G.",
|
||||
"e/ou",
|
||||
"ed.",
|
||||
"eng.",
|
||||
"etc.",
|
||||
"Fund.",
|
||||
"Gen.",
|
||||
"Gov.",
|
||||
"i.e.",
|
||||
"I.e.",
|
||||
"I.E.",
|
||||
"Inc.",
|
||||
"Jr.",
|
||||
"km/h",
|
||||
"Ltd.",
|
||||
"Mr.",
|
||||
"p.m.",
|
||||
"Ph.D.",
|
||||
"Rep.",
|
||||
"Rev.",
|
||||
"S/A",
|
||||
"Sen.",
|
||||
"Sr.",
|
||||
"sr.",
|
||||
"Sra.",
|
||||
"sra.",
|
||||
"vs.",
|
||||
"tel.",
|
||||
"pág.",
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -21,6 +23,9 @@ class RomanianDefaults(Language.Defaults):
|
|||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
tag_map = TAG_MAP
|
||||
|
||||
|
||||
|
|
164
spacy/lang/ro/punctuation.py
Normal file
164
spacy/lang/ro/punctuation.py
Normal file
|
@ -0,0 +1,164 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import itertools
|
||||
|
||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
|
||||
from ..char_classes import LIST_ICONS, CURRENCY
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
|
||||
|
||||
|
||||
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||
|
||||
|
||||
_ro_variants = {
|
||||
"Ă": ["Ă", "A"],
|
||||
"Â": ["Â", "A"],
|
||||
"Î": ["Î", "I"],
|
||||
"Ș": ["Ș", "Ş", "S"],
|
||||
"Ț": ["Ț", "Ţ", "T"],
|
||||
}
|
||||
|
||||
|
||||
def _make_ro_variants(tokens):
|
||||
variants = []
|
||||
for token in tokens:
|
||||
upper_token = token.upper()
|
||||
upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
|
||||
upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
|
||||
for variant in upper_variants:
|
||||
variants.extend([variant, variant.lower(), variant.title()])
|
||||
return sorted(list(set(variants)))
|
||||
|
||||
|
||||
# UD_Romanian-RRT closed class prefixes
|
||||
# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
|
||||
_ud_rrt_prefixes = [
|
||||
"a-",
|
||||
"c-",
|
||||
"ce-",
|
||||
"cu-",
|
||||
"d-",
|
||||
"de-",
|
||||
"dintr-",
|
||||
"e-",
|
||||
"făr-",
|
||||
"i-",
|
||||
"l-",
|
||||
"le-",
|
||||
"m-",
|
||||
"mi-",
|
||||
"n-",
|
||||
"ne-",
|
||||
"p-",
|
||||
"pe-",
|
||||
"prim-",
|
||||
"printr-",
|
||||
"s-",
|
||||
"se-",
|
||||
"te-",
|
||||
"v-",
|
||||
"într-",
|
||||
"ș-",
|
||||
"și-",
|
||||
"ți-",
|
||||
]
|
||||
_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
|
||||
|
||||
|
||||
# UD_Romanian-RRT closed class suffixes without NUM
|
||||
# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
|
||||
_ud_rrt_suffixes = [
|
||||
"-a",
|
||||
"-aceasta",
|
||||
"-ai",
|
||||
"-al",
|
||||
"-ale",
|
||||
"-alta",
|
||||
"-am",
|
||||
"-ar",
|
||||
"-astea",
|
||||
"-atâta",
|
||||
"-au",
|
||||
"-aș",
|
||||
"-ați",
|
||||
"-i",
|
||||
"-ilor",
|
||||
"-l",
|
||||
"-le",
|
||||
"-lea",
|
||||
"-mea",
|
||||
"-meu",
|
||||
"-mi",
|
||||
"-mă",
|
||||
"-n",
|
||||
"-ndărătul",
|
||||
"-ne",
|
||||
"-o",
|
||||
"-oi",
|
||||
"-or",
|
||||
"-s",
|
||||
"-se",
|
||||
"-si",
|
||||
"-te",
|
||||
"-ul",
|
||||
"-ului",
|
||||
"-un",
|
||||
"-uri",
|
||||
"-urile",
|
||||
"-urilor",
|
||||
"-veți",
|
||||
"-vă",
|
||||
"-ăștia",
|
||||
"-și",
|
||||
"-ți",
|
||||
]
|
||||
_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
|
||||
|
||||
|
||||
_prefixes = (
|
||||
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||
+ _ud_rrt_prefix_variants
|
||||
+ LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ LIST_CURRENCY
|
||||
+ LIST_ICONS
|
||||
)
|
||||
|
||||
|
||||
_suffixes = (
|
||||
_ud_rrt_suffix_variants
|
||||
+ LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ _list_icons
|
||||
+ ["—", "–"]
|
||||
+ [
|
||||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.",
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
]
|
||||
)
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ _list_icons
|
||||
+ [
|
||||
r"(?<=[0-9])[+\*^](?=[0-9-])",
|
||||
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||
]
|
||||
)
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
||||
TOKENIZER_INFIXES = _infixes
|
|
@ -1,4 +1,5 @@
|
|||
from ...symbols import ORTH
|
||||
from .punctuation import _make_ro_variants
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
@ -42,8 +43,52 @@ for orth in [
|
|||
"dpdv",
|
||||
"șamd.",
|
||||
"ș.a.m.d.",
|
||||
# below: from UD_Romanian-RRT:
|
||||
"A.c.",
|
||||
"A.f.",
|
||||
"A.r.",
|
||||
"Al.",
|
||||
"Art.",
|
||||
"Aug.",
|
||||
"Bd.",
|
||||
"Dem.",
|
||||
"Dr.",
|
||||
"Fig.",
|
||||
"Fr.",
|
||||
"Gh.",
|
||||
"Gr.",
|
||||
"Lt.",
|
||||
"Nr.",
|
||||
"Obs.",
|
||||
"Prof.",
|
||||
"Sf.",
|
||||
"a.m.",
|
||||
"a.r.",
|
||||
"alin.",
|
||||
"art.",
|
||||
"d-l",
|
||||
"d-lui",
|
||||
"d-nei",
|
||||
"ex.",
|
||||
"fig.",
|
||||
"ian.",
|
||||
"lit.",
|
||||
"lt.",
|
||||
"p.a.",
|
||||
"p.m.",
|
||||
"pct.",
|
||||
"prep.",
|
||||
"sf.",
|
||||
"tel.",
|
||||
"univ.",
|
||||
"îngr.",
|
||||
"într-adevăr",
|
||||
"Șt.",
|
||||
"ș.a.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
# note: does not distinguish capitalized-only exceptions from others
|
||||
for variant in _make_ro_variants([orth]):
|
||||
_exc[variant] = [{ORTH: variant}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -13,6 +13,7 @@ import multiprocessing as mp
|
|||
from itertools import chain, cycle
|
||||
|
||||
from .tokenizer import Tokenizer
|
||||
from .tokens.underscore import Underscore
|
||||
from .vocab import Vocab
|
||||
from .lemmatizer import Lemmatizer
|
||||
from .lookups import Lookups
|
||||
|
@ -874,7 +875,10 @@ class Language(object):
|
|||
sender.send()
|
||||
|
||||
procs = [
|
||||
mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
|
||||
mp.Process(
|
||||
target=_apply_pipes,
|
||||
args=(self.make_doc, pipes, rch, sch, Underscore.get_state()),
|
||||
)
|
||||
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
||||
]
|
||||
for proc in procs:
|
||||
|
@ -1146,16 +1150,19 @@ def _pipe(examples, proc, kwargs):
|
|||
yield ex
|
||||
|
||||
|
||||
def _apply_pipes(make_doc, pipes, reciever, sender):
|
||||
def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
|
||||
"""Worker for Language.pipe
|
||||
|
||||
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
||||
created by `multiprocessing.Pipe()`
|
||||
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
|
||||
`multiprocessing.Pipe()`
|
||||
underscore_state (tuple): The data in the Underscore class of the parent
|
||||
vectors (dict): The global vectors data, copied from the parent
|
||||
"""
|
||||
Underscore.load_state(underscore_state)
|
||||
while True:
|
||||
texts = reciever.get()
|
||||
texts = receiver.get()
|
||||
docs = (make_doc(text) for text in texts)
|
||||
for pipe in pipes:
|
||||
docs = pipe(docs)
|
||||
|
|
|
@ -664,6 +664,8 @@ def _get_attr_values(spec, string_store):
|
|||
continue
|
||||
if attr == "TEXT":
|
||||
attr = "ORTH"
|
||||
if attr == "IS_SENT_START":
|
||||
attr = "SENT_START"
|
||||
attr = IDS.get(attr)
|
||||
if isinstance(value, basestring):
|
||||
value = string_store.add(value)
|
||||
|
|
|
@ -365,7 +365,7 @@ class Tensorizer(Pipe):
|
|||
return sgd
|
||||
|
||||
|
||||
@component("tagger", assigns=["token.tag", "token.pos"])
|
||||
@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
|
||||
class Tagger(Pipe):
|
||||
"""Pipeline component for part-of-speech tagging.
|
||||
|
||||
|
|
|
@ -464,3 +464,5 @@ cdef enum symbol_t:
|
|||
ENT_KB_ID
|
||||
MORPH
|
||||
ENT_ID
|
||||
|
||||
IDX
|
|
@ -89,6 +89,7 @@ IDS = {
|
|||
"SPACY": SPACY,
|
||||
"PROB": PROB,
|
||||
"LANG": LANG,
|
||||
"IDX": IDX,
|
||||
|
||||
"ADJ": ADJ,
|
||||
"ADP": ADP,
|
||||
|
|
|
@ -80,6 +80,11 @@ def es_tokenizer():
|
|||
return get_lang_class("es").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def eu_tokenizer():
|
||||
return get_lang_class("eu").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def fi_tokenizer():
|
||||
return get_lang_class("fi").Defaults.create_tokenizer()
|
||||
|
|
|
@ -63,3 +63,39 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
|
|||
words = ["An", "example", "sentence"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
||||
|
||||
|
||||
def test_doc_array_idx(en_vocab):
|
||||
"""Test that Doc.to_array can retrieve token start indices"""
|
||||
words = ["An", "example", "sentence"]
|
||||
offsets = Doc(en_vocab, words=words).to_array("IDX")
|
||||
assert offsets[0] == 0
|
||||
assert offsets[1] == 3
|
||||
assert offsets[2] == 11
|
||||
|
||||
|
||||
def test_doc_from_array_heads_in_bounds(en_vocab):
|
||||
"""Test that Doc.from_array doesn't set heads that are out of bounds."""
|
||||
words = ["This", "is", "a", "sentence", "."]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
for token in doc:
|
||||
token.head = doc[0]
|
||||
|
||||
# correct
|
||||
arr = doc.to_array(["HEAD"])
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
||||
# head before start
|
||||
arr = doc.to_array(["HEAD"])
|
||||
arr[0] = -1
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
with pytest.raises(ValueError):
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
||||
# head after end
|
||||
arr = doc.to_array(["HEAD"])
|
||||
arr[0] = 5
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
with pytest.raises(ValueError):
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
|
|
@ -145,10 +145,9 @@ def test_doc_api_runtime_error(en_tokenizer):
|
|||
# Example that caused run-time error while parsing Reddit
|
||||
# fmt: off
|
||||
text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
|
||||
deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
|
||||
"nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
|
||||
"pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
|
||||
"ROOT", "amod", "dobj"]
|
||||
deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
|
||||
"amod", "pobj", "acl", "prep", "prep", "pobj",
|
||||
"", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
|
||||
# fmt: on
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||
|
@ -272,19 +271,9 @@ def test_doc_is_nered(en_vocab):
|
|||
def test_doc_from_array_sent_starts(en_vocab):
|
||||
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
|
||||
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
|
||||
deps = [
|
||||
"ROOT",
|
||||
"dep",
|
||||
"dep",
|
||||
"dep",
|
||||
"dep",
|
||||
"dep",
|
||||
"ROOT",
|
||||
"dep",
|
||||
"dep",
|
||||
"dep",
|
||||
"dep",
|
||||
]
|
||||
# fmt: off
|
||||
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
|
||||
# fmt: on
|
||||
doc = Doc(en_vocab, words=words)
|
||||
for i, (dep, head) in enumerate(zip(deps, heads)):
|
||||
doc[i].dep_ = dep
|
||||
|
|
|
@ -164,6 +164,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
|||
assert doc[4].left_edge.i == 0
|
||||
assert doc[2].left_edge.i == 0
|
||||
|
||||
# head token must be from the same document
|
||||
doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
with pytest.raises(ValueError):
|
||||
doc[0].head = doc2[0]
|
||||
|
||||
|
||||
def test_is_sent_start(en_tokenizer):
|
||||
doc = en_tokenizer("This is a sentence. This is another.")
|
||||
|
@ -211,7 +216,7 @@ def test_token_api_conjuncts_chain(en_vocab):
|
|||
def test_token_api_conjuncts_simple(en_vocab):
|
||||
words = "They came and went .".split()
|
||||
heads = [1, 0, -1, -2, -1]
|
||||
deps = ["nsubj", "ROOT", "cc", "conj"]
|
||||
deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert [w.text for w in doc[1].conjuncts] == ["went"]
|
||||
assert [w.text for w in doc[3].conjuncts] == ["came"]
|
||||
|
|
|
@ -4,6 +4,15 @@ from spacy.tokens import Doc, Span, Token
|
|||
from spacy.tokens.underscore import Underscore
|
||||
|
||||
|
||||
@pytest.fixture(scope="function", autouse=True)
|
||||
def clean_underscore():
|
||||
# reset the Underscore object after the test, to avoid having state copied across tests
|
||||
yield
|
||||
Underscore.doc_extensions = {}
|
||||
Underscore.span_extensions = {}
|
||||
Underscore.token_extensions = {}
|
||||
|
||||
|
||||
def test_create_doc_underscore():
|
||||
doc = Mock()
|
||||
doc.doc = doc
|
||||
|
|
|
@ -55,7 +55,8 @@ def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
|
|||
("Kristiansen c/o Madsen", 3),
|
||||
("Sprogteknologi a/s", 2),
|
||||
("De boede i A/B Bellevue", 5),
|
||||
("Rotorhastigheden er 3400 o/m.", 5),
|
||||
# note: skipping due to weirdness in UD_Danish-DDT
|
||||
# ("Rotorhastigheden er 3400 o/m.", 5),
|
||||
("Jeg købte billet t/r.", 5),
|
||||
("Murerarbejdsmand m/k søges", 3),
|
||||
("Netværket kører over TCP/IP", 4),
|
||||
|
|
22
spacy/tests/lang/eu/test_text.py
Normal file
22
spacy/tests/lang/eu/test_text.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_eu_tokenizer_handles_long_text(eu_tokenizer):
|
||||
text = """ta nere guitarra estrenatu ondoren"""
|
||||
tokens = eu_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,length",
|
||||
[
|
||||
("milesker ederra joan zen hitzaldia plazer hutsa", 7),
|
||||
("astelehen guztia sofan pasau biot", 5),
|
||||
],
|
||||
)
|
||||
def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
|
||||
tokens = eu_tokenizer(text)
|
||||
assert len(tokens) == length
|
|
@ -7,12 +7,22 @@ ABBREVIATION_TESTS = [
|
|||
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
|
||||
),
|
||||
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
|
||||
(
|
||||
"Vuonna 1 eaa. tapahtui kauheita.",
|
||||
["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
|
||||
),
|
||||
]
|
||||
|
||||
HYPHENATED_TESTS = [
|
||||
(
|
||||
"1700-luvulle sijoittuva taide-elokuva",
|
||||
["1700-luvulle", "sijoittuva", "taide-elokuva"],
|
||||
"1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
|
||||
[
|
||||
"1700-luvulle",
|
||||
"sijoittuva",
|
||||
"taide-elokuva",
|
||||
"Wikimedia-säätiön",
|
||||
"Varsinais-Suomen",
|
||||
],
|
||||
)
|
||||
]
|
||||
|
||||
|
@ -23,6 +33,7 @@ ABBREVIATION_INFLECTION_TESTS = [
|
|||
),
|
||||
("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
|
||||
("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
|
||||
("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
|
||||
]
|
||||
|
||||
|
||||
|
|
|
@ -12,11 +12,11 @@ def test_lt_tokenizer_handles_long_text(lt_tokenizer):
|
|||
[
|
||||
(
|
||||
"177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
|
||||
15,
|
||||
17,
|
||||
),
|
||||
(
|
||||
"ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
|
||||
16,
|
||||
18,
|
||||
),
|
||||
],
|
||||
)
|
||||
|
@ -28,7 +28,7 @@ def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
|
|||
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
|
||||
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
|
||||
tokens = lt_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
|
|
|
@ -4,6 +4,8 @@ from mock import Mock
|
|||
from spacy.matcher import Matcher, DependencyMatcher
|
||||
from spacy.tokens import Doc, Token
|
||||
|
||||
from ..doc.test_underscore import clean_underscore # noqa: F401
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def matcher(en_vocab):
|
||||
|
@ -197,6 +199,7 @@ def test_matcher_any_token_operator(en_vocab):
|
|||
assert matches[2] == "test hello world"
|
||||
|
||||
|
||||
@pytest.mark.usefixtures("clean_underscore")
|
||||
def test_matcher_extension_attribute(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
get_is_fruit = lambda token: token.text in ("apple", "banana")
|
||||
|
|
|
@ -31,6 +31,8 @@ TEST_PATTERNS = [
|
|||
([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
|
||||
([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
|
||||
([{"orth": "foo"}], 0, 0), # prev: xfail
|
||||
([{"IS_SENT_START": True}], 0, 0),
|
||||
([{"SENT_START": True}], 0, 0),
|
||||
]
|
||||
|
||||
|
||||
|
|
|
@ -31,23 +31,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
|
|||
@pytest.fixture
|
||||
def heads():
|
||||
# fmt: off
|
||||
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
|
||||
-1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
||||
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
|
||||
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
|
||||
0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
|
||||
9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
|
||||
2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
|
||||
3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
|
||||
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
|
||||
-1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
|
||||
-2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
|
||||
1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
|
||||
1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
||||
-19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
||||
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
|
||||
-1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
||||
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
|
||||
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
|
||||
0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
|
||||
9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
|
||||
2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
|
||||
3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
|
||||
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
|
||||
-1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
|
||||
-2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
|
||||
1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
|
||||
1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
||||
-10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
||||
0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
|
||||
1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
|
||||
-1, -8, -9, -1]
|
||||
1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
|
||||
-1, 0, -1, -1]
|
||||
# fmt: on
|
||||
|
||||
|
||||
|
|
|
@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
|
|||
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
||||
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
# Work around lemma corrpution problem and set lemmas after tags
|
||||
# Work around lemma corruption problem and set lemmas after tags
|
||||
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
||||
assert [t.tag_ for t in doc] == tags
|
||||
|
|
|
@ -121,7 +121,7 @@ def test_issue2772(en_vocab):
|
|||
words = "When we write or communicate virtually , we can hide our true feelings .".split()
|
||||
# A tree with a non-projective (i.e. crossing) arc
|
||||
# The arcs (0, 4) and (2, 9) cross.
|
||||
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
|
||||
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
|
||||
deps = ["dep"] * len(heads)
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert doc[1].is_sent_start is None
|
||||
|
|
|
@ -24,7 +24,7 @@ def test_issue4590(en_vocab):
|
|||
|
||||
text = "The quick brown fox jumped over the lazy fox"
|
||||
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
||||
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
|
||||
deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]
|
||||
|
||||
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
||||
|
||||
|
|
25
spacy/tests/regression/test_issue4725.py
Normal file
25
spacy/tests/regression/test_issue4725.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
|
||||
def test_issue4725():
|
||||
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data[0] = 1.0
|
||||
data[1] = 2.0
|
||||
vocab.set_vector("cat", data[0])
|
||||
vocab.set_vector("dog", data[1])
|
||||
|
||||
nlp = English(vocab=vocab)
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
nlp.begin_training()
|
||||
docs = ["Kurt is in London."] * 10
|
||||
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
||||
pass
|
43
spacy/tests/regression/test_issue4903.py
Normal file
43
spacy/tests/regression/test_issue4903.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Span, Doc
|
||||
|
||||
|
||||
class CustomPipe:
|
||||
name = "my_pipe"
|
||||
|
||||
def __init__(self):
|
||||
Span.set_extension("my_ext", getter=self._get_my_ext)
|
||||
Doc.set_extension("my_ext", default=None)
|
||||
|
||||
def __call__(self, doc):
|
||||
gathered_ext = []
|
||||
for sent in doc.sents:
|
||||
sent_ext = self._get_my_ext(sent)
|
||||
sent._.set("my_ext", sent_ext)
|
||||
gathered_ext.append(sent_ext)
|
||||
|
||||
doc._.set("my_ext", "\n".join(gathered_ext))
|
||||
|
||||
return doc
|
||||
|
||||
@staticmethod
|
||||
def _get_my_ext(span):
|
||||
return str(span.end)
|
||||
|
||||
|
||||
def test_issue4903():
|
||||
# ensures that this runs correctly and doesn't hang or crash on Windows / macOS
|
||||
|
||||
nlp = English()
|
||||
custom_component = CustomPipe()
|
||||
nlp.add_pipe(nlp.create_pipe("sentencizer"))
|
||||
nlp.add_pipe(custom_component, after="sentencizer")
|
||||
|
||||
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
|
||||
docs = list(nlp.pipe(text, n_process=2))
|
||||
assert docs[0].text == "I like bananas."
|
||||
assert docs[1].text == "Do you like them?"
|
||||
assert docs[2].text == "No, I prefer wasabi."
|
|
@ -2,7 +2,7 @@ import pytest
|
|||
from spacy.language import Language
|
||||
|
||||
|
||||
def test_evaluate():
|
||||
def test_issue4924():
|
||||
nlp = Language()
|
||||
docs_golds = [("", {})]
|
||||
with pytest.raises(ValueError):
|
||||
|
|
35
spacy/tests/regression/test_issue5048.py
Normal file
35
spacy/tests/regression/test_issue5048.py
Normal file
|
@ -0,0 +1,35 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.attrs import DEP, POS, TAG
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_issue5048(en_vocab):
|
||||
words = ["This", "is", "a", "sentence"]
|
||||
pos_s = ["DET", "VERB", "DET", "NOUN"]
|
||||
spaces = [" ", " ", " ", ""]
|
||||
deps_s = ["dep", "adj", "nn", "atm"]
|
||||
tags_s = ["DT", "VBZ", "DT", "NN"]
|
||||
|
||||
strings = en_vocab.strings
|
||||
|
||||
for w in words:
|
||||
strings.add(w)
|
||||
deps = [strings.add(d) for d in deps_s]
|
||||
pos = [strings.add(p) for p in pos_s]
|
||||
tags = [strings.add(t) for t in tags_s]
|
||||
|
||||
attrs = [POS, DEP, TAG]
|
||||
array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
|
||||
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
doc.from_array(attrs, array)
|
||||
v1 = [(token.text, token.pos_, token.tag_) for token in doc]
|
||||
|
||||
doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
|
||||
v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
|
||||
assert v1 == v2
|
46
spacy/tests/regression/test_issue5082.py
Normal file
46
spacy/tests/regression/test_issue5082.py
Normal file
|
@ -0,0 +1,46 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy as np
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
|
||||
|
||||
def test_issue5082():
|
||||
# Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
|
||||
nlp = English()
|
||||
vocab = nlp.vocab
|
||||
array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
|
||||
array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
|
||||
array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
|
||||
array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
|
||||
array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
|
||||
|
||||
vocab.set_vector("I", array1)
|
||||
vocab.set_vector("like", array2)
|
||||
vocab.set_vector("David", array3)
|
||||
vocab.set_vector("Bowie", array4)
|
||||
|
||||
text = "I like David Bowie"
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [
|
||||
{"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
|
||||
]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
parsed_vectors_1 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_1) == 4
|
||||
np.testing.assert_array_equal(parsed_vectors_1[0], array1)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[1], array2)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[2], array3)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[3], array4)
|
||||
|
||||
merge_ents = nlp.create_pipe("merge_entities")
|
||||
nlp.add_pipe(merge_ents)
|
||||
|
||||
parsed_vectors_2 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_2) == 3
|
||||
np.testing.assert_array_equal(parsed_vectors_2[0], array1)
|
||||
np.testing.assert_array_equal(parsed_vectors_2[1], array2)
|
||||
np.testing.assert_array_equal(parsed_vectors_2[2], array34)
|
|
@ -12,12 +12,19 @@ def load_tokenizer(b):
|
|||
|
||||
|
||||
def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
||||
"""Test that custom tokenizer with not all functions defined can be
|
||||
serialized and deserialized correctly (see #2494)."""
|
||||
"""Test that custom tokenizer with not all functions defined or empty
|
||||
properties can be serialized and deserialized correctly (see #2494,
|
||||
#4991)."""
|
||||
tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
|
||||
tokenizer_bytes = tokenizer.to_bytes()
|
||||
Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||
|
||||
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
|
||||
tokenizer.rules = {}
|
||||
tokenizer_bytes = tokenizer.to_bytes()
|
||||
tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||
assert tokenizer_reloaded.rules == {}
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
||||
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
||||
|
|
|
@ -28,10 +28,10 @@ def test_displacy_parse_deps(en_vocab):
|
|||
deps = displacy.parse_deps(doc)
|
||||
assert isinstance(deps, dict)
|
||||
assert deps["words"] == [
|
||||
{"text": "This", "tag": "DET"},
|
||||
{"text": "is", "tag": "AUX"},
|
||||
{"text": "a", "tag": "DET"},
|
||||
{"text": "sentence", "tag": "NOUN"},
|
||||
{"lemma": None, "text": words[0], "tag": pos[0]},
|
||||
{"lemma": None, "text": words[1], "tag": pos[1]},
|
||||
{"lemma": None, "text": words[2], "tag": pos[2]},
|
||||
{"lemma": None, "text": words[3], "tag": pos[3]},
|
||||
]
|
||||
assert deps["arcs"] == [
|
||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||
|
@ -72,7 +72,7 @@ def test_displacy_rtl():
|
|||
deps = ["foo", "bar", "foo", "baz"]
|
||||
heads = [1, 0, 1, -2]
|
||||
nlp = Persian()
|
||||
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
|
||||
doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
|
||||
doc.ents = [Span(doc, 1, 3, label="TEST")]
|
||||
html = displacy.render(doc, page=True, style="dep")
|
||||
assert "direction: rtl" in html
|
||||
|
|
|
@ -4,8 +4,10 @@ import shutil
|
|||
import contextlib
|
||||
import srsly
|
||||
from pathlib import Path
|
||||
|
||||
from spacy import Errors
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy.attrs import POS, HEAD, DEP
|
||||
from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
|
@ -22,30 +24,56 @@ def make_tempdir():
|
|||
shutil.rmtree(str(d))
|
||||
|
||||
|
||||
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
|
||||
def get_doc(
|
||||
vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None
|
||||
):
|
||||
"""Create Doc object from given vocab, words and annotations."""
|
||||
pos = pos or [""] * len(words)
|
||||
tags = tags or [""] * len(words)
|
||||
heads = heads or [0] * len(words)
|
||||
deps = deps or [""] * len(words)
|
||||
for value in deps + tags + pos:
|
||||
if deps and not heads:
|
||||
heads = [0] * len(deps)
|
||||
headings = []
|
||||
values = []
|
||||
annotations = [pos, heads, deps, lemmas, tags]
|
||||
possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
|
||||
for a, annot in enumerate(annotations):
|
||||
if annot is not None:
|
||||
if len(annot) != len(words):
|
||||
raise ValueError(Errors.E189)
|
||||
headings.append(possible_headings[a])
|
||||
if annot is not heads:
|
||||
values.extend(annot)
|
||||
for value in values:
|
||||
vocab.strings.add(value)
|
||||
|
||||
doc = Doc(vocab, words=words)
|
||||
attrs = doc.to_array([POS, HEAD, DEP])
|
||||
for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
|
||||
attrs[i, 0] = doc.vocab.strings[p]
|
||||
attrs[i, 1] = head
|
||||
attrs[i, 2] = doc.vocab.strings[dep]
|
||||
doc.from_array([POS, HEAD, DEP], attrs)
|
||||
|
||||
# if there are any other annotations, set them
|
||||
if headings:
|
||||
attrs = doc.to_array(headings)
|
||||
|
||||
j = 0
|
||||
for annot in annotations:
|
||||
if annot:
|
||||
if annot is heads:
|
||||
for i in range(len(words)):
|
||||
if attrs.ndim == 1:
|
||||
attrs[i] = heads[i]
|
||||
else:
|
||||
attrs[i, j] = heads[i]
|
||||
else:
|
||||
for i in range(len(words)):
|
||||
if attrs.ndim == 1:
|
||||
attrs[i] = doc.vocab.strings[annot[i]]
|
||||
else:
|
||||
attrs[i, j] = doc.vocab.strings[annot[i]]
|
||||
j += 1
|
||||
doc.from_array(headings, attrs)
|
||||
|
||||
# finally, set the entities
|
||||
if ents:
|
||||
doc.ents = [
|
||||
Span(doc, start, end, label=doc.vocab.strings[label])
|
||||
for start, end, label in ents
|
||||
]
|
||||
if tags:
|
||||
for token in doc:
|
||||
token.tag_ = tags[token.i]
|
||||
return doc
|
||||
|
||||
|
||||
|
@ -86,8 +114,7 @@ def assert_docs_equal(doc1, doc2):
|
|||
|
||||
assert [t.head.i for t in doc1] == [t.head.i for t in doc2]
|
||||
assert [t.dep for t in doc1] == [t.dep for t in doc2]
|
||||
if doc1.is_parsed and doc2.is_parsed:
|
||||
assert [s for s in doc1.sents] == [s for s in doc2.sents]
|
||||
assert [t.is_sent_start for t in doc1] == [t.is_sent_start for t in doc2]
|
||||
|
||||
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
|
||||
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
|
||||
|
|
|
@ -699,6 +699,7 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
|
@ -712,6 +713,7 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#from_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open("rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
self.from_bytes(bytes_data, **kwargs)
|
||||
|
@ -756,21 +758,20 @@ cdef class Tokenizer:
|
|||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if data.get("prefix_search"):
|
||||
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||
if data.get("suffix_search"):
|
||||
if "suffix_search" in data and isinstance(data["suffix_search"], str):
|
||||
self.suffix_search = re.compile(data["suffix_search"]).search
|
||||
if data.get("infix_finditer"):
|
||||
if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
|
||||
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
||||
if data.get("token_match"):
|
||||
if "token_match" in data and isinstance(data["token_match"], str):
|
||||
self.token_match = re.compile(data["token_match"]).match
|
||||
if data.get("rules"):
|
||||
if "rules" in data and isinstance(data["rules"], dict):
|
||||
# make sure to hard reset the cache to remove data from the default exceptions
|
||||
self._rules = {}
|
||||
self._flush_cache()
|
||||
self._flush_specials()
|
||||
self._load_special_cases(data.get("rules", {}))
|
||||
|
||||
self._load_special_cases(data["rules"])
|
||||
return self
|
||||
|
||||
|
||||
|
|
|
@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
|
|||
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
|
||||
if spans[token_index][-1].whitespace_:
|
||||
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
|
||||
# add the vector of the (merged) entity to the vocab
|
||||
if not doc.vocab.get_vector(new_orth).any():
|
||||
if doc.vocab.vectors_length > 0:
|
||||
doc.vocab.set_vector(new_orth, span.vector)
|
||||
token = tokens[token_index]
|
||||
lex = doc.vocab.get(doc.mem, new_orth)
|
||||
token.lex = lex
|
||||
|
|
|
@ -19,7 +19,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
|||
from ..typedefs cimport attr_t, flags_t
|
||||
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
|
||||
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
||||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
|
||||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
|
||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||
|
||||
from ..attrs import intify_attrs, IDS
|
||||
|
@ -68,6 +68,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
|
|||
return token.ent_id
|
||||
elif feat_name == ENT_KB_ID:
|
||||
return token.ent_kb_id
|
||||
elif feat_name == IDX:
|
||||
return token.idx
|
||||
else:
|
||||
return Lexeme.get_struct_attr(token.lex, feat_name)
|
||||
|
||||
|
@ -253,7 +255,7 @@ cdef class Doc:
|
|||
def is_nered(self):
|
||||
"""Check if the document has named entities set. Will return True if
|
||||
*any* of the tokens has a named entity tag set (even if the others are
|
||||
unknown values).
|
||||
unknown values), or if the document is empty.
|
||||
"""
|
||||
if len(self) == 0:
|
||||
return True
|
||||
|
@ -778,10 +780,12 @@ cdef class Doc:
|
|||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||
for id_ in attrs]
|
||||
if array.dtype != numpy.uint64:
|
||||
user_warning(Warnings.W028.format(type=array.dtype))
|
||||
|
||||
if SENT_START in attrs and HEAD in attrs:
|
||||
raise ValueError(Errors.E032)
|
||||
cdef int i, col
|
||||
cdef int i, col, abs_head_index
|
||||
cdef attr_id_t attr_id
|
||||
cdef TokenC* tokens = self.c
|
||||
cdef int length = len(array)
|
||||
|
@ -795,6 +799,14 @@ cdef class Doc:
|
|||
attr_ids[i] = attr_id
|
||||
if len(array.shape) == 1:
|
||||
array = array.reshape((array.size, 1))
|
||||
# Check that all heads are within the document bounds
|
||||
if HEAD in attrs:
|
||||
col = attrs.index(HEAD)
|
||||
for i in range(length):
|
||||
# cast index to signed int
|
||||
abs_head_index = numpy.int32(array[i, col]) + i
|
||||
if abs_head_index < 0 or abs_head_index >= length:
|
||||
raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
|
||||
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
|
||||
if TAG in attrs:
|
||||
col = attrs.index(TAG)
|
||||
|
@ -865,7 +877,7 @@ cdef class Doc:
|
|||
|
||||
DOCS: https://spacy.io/api/doc#to_bytes
|
||||
"""
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID] # TODO: ENT_KB_ID ?
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM] # TODO: ENT_KB_ID ?
|
||||
if self.is_tagged:
|
||||
array_head.extend([TAG, POS])
|
||||
# If doc parsed add head and dep attribute
|
||||
|
@ -1166,6 +1178,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
|
|||
heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
|
||||
if loop_count > 10:
|
||||
warnings.warn(Warnings.W026)
|
||||
break
|
||||
loop_count += 1
|
||||
# Set sentence starts
|
||||
for i in range(length):
|
||||
|
|
|
@ -626,6 +626,9 @@ cdef class Token:
|
|||
# This function sets the head of self to new_head and updates the
|
||||
# counters for left/right dependents and left/right corner for the
|
||||
# new and the old head
|
||||
# Check that token is from the same document
|
||||
if self.doc != new_head.doc:
|
||||
raise ValueError(Errors.E191)
|
||||
# Do nothing if old head is new head
|
||||
if self.i + self.c.head == new_head.i:
|
||||
return
|
||||
|
|
|
@ -76,6 +76,14 @@ class Underscore(object):
|
|||
def _get_key(self, name):
|
||||
return ("._.", name, self._start, self._end)
|
||||
|
||||
@classmethod
|
||||
def get_state(cls):
|
||||
return cls.token_extensions, cls.span_extensions, cls.doc_extensions
|
||||
|
||||
@classmethod
|
||||
def load_state(cls, state):
|
||||
cls.token_extensions, cls.span_extensions, cls.doc_extensions = state
|
||||
|
||||
|
||||
def get_ext_args(**kwargs):
|
||||
"""Validate and convert arguments. Reused in Doc, Token and Span."""
|
||||
|
|
|
@ -349,44 +349,6 @@ cdef class Vectors:
|
|||
for i in range(len(queries)) ], dtype="uint64")
|
||||
return (keys, best_rows, scores)
|
||||
|
||||
def from_glove(self, path):
|
||||
"""Load GloVe vectors from a directory. Assumes binary format,
|
||||
that the vocab is in a vocab.txt, and that vectors are named
|
||||
vectors.{size}.[fd].bin, e.g. vectors.128.f.bin for 128d float32
|
||||
vectors, vectors.300.d.bin for 300d float64 (double) vectors, etc.
|
||||
By default GloVe outputs 64-bit vectors.
|
||||
|
||||
path (unicode / Path): The path to load the GloVe vectors from.
|
||||
RETURNS: A `StringStore` object, holding the key-to-string mapping.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#from_glove
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
width = None
|
||||
for name in path.iterdir():
|
||||
if name.parts[-1].startswith("vectors"):
|
||||
_, dims, dtype, _2 = name.parts[-1].split('.')
|
||||
width = int(dims)
|
||||
break
|
||||
else:
|
||||
raise IOError(Errors.E061.format(filename=path))
|
||||
bin_loc = path / f"vectors.{dims}.{dtype}.bin"
|
||||
xp = get_array_module(self.data)
|
||||
self.data = None
|
||||
with bin_loc.open("rb") as file_:
|
||||
self.data = xp.fromfile(file_, dtype=dtype)
|
||||
if dtype != "float32":
|
||||
self.data = xp.ascontiguousarray(self.data, dtype="float32")
|
||||
if self.data.ndim == 1:
|
||||
self.data = self.data.reshape((self.data.size//width, width))
|
||||
n = 0
|
||||
strings = StringStore()
|
||||
with (path / "vocab.txt").open("r") as file_:
|
||||
for i, line in enumerate(file_):
|
||||
key = strings.add(line.strip())
|
||||
self.add(key, row=i)
|
||||
return strings
|
||||
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
|
|
|
@ -109,9 +109,9 @@ links) and check whether they are compatible with the currently installed
|
|||
version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
|
||||
to ensure that all installed models are can be used with the new version. The
|
||||
command is also useful to detect out-of-sync model links resulting from links
|
||||
created in different virtual environments. It will a list of models, the
|
||||
installed versions, the latest compatible version (if out of date) and the
|
||||
commands for updating.
|
||||
created in different virtual environments. It will show a list of models and
|
||||
their installed versions. If any model is out of date, the latest compatible
|
||||
versions and command for updating are shown.
|
||||
|
||||
> #### Automated validation
|
||||
>
|
||||
|
@ -176,7 +176,7 @@ All output files generated by this command are compatible with
|
|||
|
||||
## Debug data {#debug-data new="2.2"}
|
||||
|
||||
Analyze, debug and validate your training and development data, get useful
|
||||
Analyze, debug, and validate your training and development data. Get useful
|
||||
stats, and find problems like invalid entity annotations, cyclic dependencies,
|
||||
low data labels and more.
|
||||
|
||||
|
@ -185,10 +185,11 @@ $ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pi
|
|||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||
| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language. |
|
||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||
|
@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--base-model`, `-b` <Tag variant="new">2.1</Tag> | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--replace-components`, `-R` | flag | Replace components from the base model. |
|
||||
| `--vectors`, `-v` | option | Model to load vectors from. |
|
||||
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
|
||||
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
|
||||
|
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
||||
| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||
| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||
| `--width`, `-cw` <Tag variant="new">2.2.4</Tag> | option | Width of CNN layers of `Tok2Vec` component. |
|
||||
| `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag> | option | Depth of CNN layers of `Tok2Vec` component. |
|
||||
| `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag> | option | Window size for CNN layers of `Tok2Vec` component. |
|
||||
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag> | option | Maxout size for CNN layers of `Tok2Vec` component. |
|
||||
| `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag> | flag | Whether to use character-based embedding of `Tok2Vec` component. |
|
||||
| `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag> | option | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch). |
|
||||
| `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag> | option | Number of embedding rows of `Tok2Vec` component. |
|
||||
| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. |
|
||||
| `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag> | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). |
|
||||
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
|
||||
|
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag> | flag | Text classification classes aren't mutually exclusive (multilabel). |
|
||||
| `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag> | option | Text classification model architecture. Defaults to `"bow"`. |
|
||||
| `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option | Text classification positive label for binary classes with two labels. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag> | flag | Show more detailed messages during training. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
||||
|
|
|
@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx
|
|||
|
||||
A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
|
||||
named entities, export annotations to numpy arrays, losslessly serialize to
|
||||
compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
|
||||
The Python-level `Token` and [`Span`](/api/span) objects are views of this
|
||||
array, i.e. they don't own the data themselves.
|
||||
compressed binary strings. The `Doc` object holds an array of
|
||||
[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
|
||||
[`Span`](/api/span) objects are views of this array, i.e. they don't own the
|
||||
data themselves.
|
||||
|
||||
## Doc.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
|
@ -198,10 +199,11 @@ the character indices don't map to a valid span.
|
|||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
||||
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the Span, e.g. for named entities. |
|
||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
|
||||
|
@ -655,10 +657,10 @@ The L2 norm of the document's vector representation.
|
|||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
|
|
|
@ -83,7 +83,8 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
|
|||
happens automatically after the component has been added to the pipeline using
|
||||
[`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
|
||||
with `overwrite_ents=True`, existing entities will be replaced if they overlap
|
||||
with the matches.
|
||||
with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
|
||||
patterns over shorter, and if equal the match occuring first in the Doc is chosen.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -172,6 +172,28 @@ Remove a previously registered extension.
|
|||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
|
||||
## Span.char_span {#char_span tag="method" new="2.2.4"}
|
||||
|
||||
Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
|
||||
the character indices don't map to a valid span.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp("I like New York")
|
||||
> span = doc[1:4].char_span(5, 13, label="GPE")
|
||||
> assert span.text == "New York"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
|
||||
## Span.similarity {#similarity tag="method" model="vectors"}
|
||||
|
||||
Make a semantic similarity estimate. The default estimate is cosine similarity
|
||||
|
@ -294,7 +316,7 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
|
|||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ----- | ---------------------------------------------------- |
|
||||
| ---------------- | ----- | ---------------------------------------------------- |
|
||||
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
||||
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
||||
|
||||
|
|
|
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
|
|||
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||
| `lower` | int | Lowercase form of the token. |
|
||||
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
||||
|
|
|
@ -237,8 +237,9 @@ If a setting is not present in the options, the default value will be used.
|
|||
> ```
|
||||
|
||||
| Name | Type | Description | Default |
|
||||
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
|
|
|
@ -326,25 +326,6 @@ performed in chunks, to avoid consuming too much memory. You can set the
|
|||
| `sort` | bool | Whether to sort the entries returned by score. Defaults to `True`. |
|
||||
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. |
|
||||
|
||||
## Vectors.from_glove {#from_glove tag="method"}
|
||||
|
||||
Load [GloVe](https://nlp.stanford.edu/projects/glove/) vectors from a directory.
|
||||
Assumes binary format, that the vocab is in a `vocab.txt`, and that vectors are
|
||||
named `vectors.{size}.[fd.bin]`, e.g. `vectors.128.f.bin` for 128d float32
|
||||
vectors, `vectors.300.d.bin` for 300d float64 (double) vectors, etc. By default
|
||||
GloVe outputs 64-bit vectors.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> vectors = Vectors()
|
||||
> vectors.from_glove("/path/to/glove_vectors")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | ---------------------------------------- |
|
||||
| `path` | unicode / `Path` | The path to load the GloVe vectors from. |
|
||||
|
||||
## Vectors.to_disk {#to_disk tag="method"}
|
||||
|
||||
Save the current state to a directory.
|
||||
|
|
|
@ -622,13 +622,13 @@ categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
|
|||
In order to use this, you'll need training and evaluation data in the
|
||||
[JSON format](/api/annotation#json-input) spaCy expects for training.
|
||||
|
||||
You can now train the model using a corpus for your language annotated with If
|
||||
your data is in one of the supported formats, the easiest solution might be to
|
||||
use the [`spacy convert`](/api/cli#convert) command-line utility. This supports
|
||||
several popular formats, including the IOB format for named entity recognition,
|
||||
the JSONL format produced by our annotation tool [Prodigy](https://prodi.gy),
|
||||
and the [CoNLL-U](http://universaldependencies.org/docs/format.html) format used
|
||||
by the [Universal Dependencies](http://universaldependencies.org/) corpus.
|
||||
If your data is in one of the supported formats, the easiest solution might be
|
||||
to use the [`spacy convert`](/api/cli#convert) command-line utility. This
|
||||
supports several popular formats, including the IOB format for named entity
|
||||
recognition, the JSONL format produced by our annotation tool
|
||||
[Prodigy](https://prodi.gy), and the
|
||||
[CoNLL-U](http://universaldependencies.org/docs/format.html) format used by the
|
||||
[Universal Dependencies](http://universaldependencies.org/) corpus.
|
||||
|
||||
One thing to keep in mind is that spaCy expects to train its models from **whole
|
||||
documents**, not just single sentences. If your corpus only contains single
|
||||
|
|
|
@ -968,7 +968,10 @@ pattern. The entity ruler accepts two types of patterns:
|
|||
The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
|
||||
added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
|
||||
called on a text, it will find matches in the `doc` and add them as entities to
|
||||
the `doc.ents`, using the specified pattern label as the entity label.
|
||||
the `doc.ents`, using the specified pattern label as the entity label. If any
|
||||
matches were to overlap, the pattern matching most tokens takes priority. If
|
||||
they also happen to be equally long, then the match occuring first in the Doc is
|
||||
chosen.
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
|
@ -1119,7 +1122,7 @@ entityruler = EntityRuler(nlp)
|
|||
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
||||
|
||||
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
|
||||
with nlp.disable_pipes(*disable_pipes):
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
entityruler.add_patterns(patterns)
|
||||
```
|
||||
|
||||
|
|
|
@ -94,7 +94,7 @@ docs = list(doc_bin.get_docs(nlp.vocab))
|
|||
|
||||
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
|
||||
well, which includes the values of
|
||||
[extension attributes](/processing-pipelines#custom-components-attributes) (if
|
||||
[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if
|
||||
they're serializable with msgpack).
|
||||
|
||||
<Infobox title="Important note on serializing extension attributes" variant="warning">
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user