mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da
This commit is contained in:
commit
09d442f5ad
106
.github/contributors/Baciccin.md
vendored
Normal file
106
.github/contributors/Baciccin.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Giovanni Battista Parodi |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-19 |
|
||||
| GitHub username | Baciccin |
|
||||
| Website (optional) | |
|
106
.github/contributors/dhpollack.md
vendored
Normal file
106
.github/contributors/dhpollack.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | David Pollack |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | Mar 5. 2020 |
|
||||
| GitHub username | dhpollack |
|
||||
| Website (optional) | |
|
106
.github/contributors/guerda.md
vendored
Normal file
106
.github/contributors/guerda.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Philip Gillißen |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-24 |
|
||||
| GitHub username | guerda |
|
||||
| Website (optional) | |
|
89
.github/contributors/mabraham.md
vendored
Normal file
89
.github/contributors/mabraham.md
vendored
Normal file
|
@ -0,0 +1,89 @@
|
|||
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | |
|
||||
| GitHub username | |
|
||||
| Website (optional) | |
|
106
.github/contributors/merrcury.md
vendored
Normal file
106
.github/contributors/merrcury.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Himanshu Garg |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-10 |
|
||||
| GitHub username | merrcury |
|
||||
| Website (optional) | |
|
106
.github/contributors/pinealan.md
vendored
Normal file
106
.github/contributors/pinealan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Alan Chan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-15 |
|
||||
| GitHub username | pinealan |
|
||||
| Website (optional) | http://pinealan.xyz |
|
106
.github/contributors/sloev.md
vendored
Normal file
106
.github/contributors/sloev.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Johannes Valbjørn |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-03-13 |
|
||||
| GitHub username | sloev |
|
||||
| Website (optional) | https://sloev.github.io |
|
7
.gitignore
vendored
7
.gitignore
vendored
|
@ -5,6 +5,11 @@ corpora/
|
|||
keys/
|
||||
*.json.gz
|
||||
|
||||
# Tests
|
||||
spacy/tests/package/setup.cfg
|
||||
spacy/tests/package/pyproject.toml
|
||||
spacy/tests/package/requirements.txt
|
||||
|
||||
# Website
|
||||
website/.cache/
|
||||
website/public/
|
||||
|
@ -40,6 +45,7 @@ __pycache__/
|
|||
.~env/
|
||||
.venv
|
||||
venv/
|
||||
env3.*/
|
||||
.dev
|
||||
.denv
|
||||
.pypyenv
|
||||
|
@ -56,6 +62,7 @@ lib64/
|
|||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheelhouse/
|
||||
*.egg-info/
|
||||
pip-wheel-metadata/
|
||||
Pipfile.lock
|
||||
|
|
2
LICENSE
2
LICENSE
|
@ -1,6 +1,6 @@
|
|||
The MIT License (MIT)
|
||||
|
||||
Copyright (C) 2016-2019 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
Copyright (C) 2016-2020 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
|
|
47
Makefile
47
Makefile
|
@ -1,28 +1,37 @@
|
|||
SHELL := /bin/bash
|
||||
sha = $(shell "git" "rev-parse" "--short" "HEAD")
|
||||
version = $(shell "bin/get-version.sh")
|
||||
wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
|
||||
PYVER := 3.6
|
||||
VENV := ./env$(PYVER)
|
||||
|
||||
dist/spacy.pex : dist/spacy-$(sha).pex
|
||||
cp dist/spacy-$(sha).pex dist/spacy.pex
|
||||
chmod a+rx dist/spacy.pex
|
||||
version := $(shell "bin/get-version.sh")
|
||||
|
||||
dist/spacy-$(sha).pex : dist/$(wheel)
|
||||
env3.6/bin/python -m pip install pex==1.5.3
|
||||
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
|
||||
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
|
||||
chmod a+rx $@
|
||||
|
||||
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
|
||||
python3.6 -m venv env3.6
|
||||
source env3.6/bin/activate
|
||||
env3.6/bin/pip install wheel
|
||||
env3.6/bin/pip install -r requirements.txt --no-cache-dir
|
||||
env3.6/bin/python setup.py build_ext --inplace
|
||||
env3.6/bin/python setup.py sdist
|
||||
env3.6/bin/python setup.py bdist_wheel
|
||||
dist/pytest.pex : wheelhouse/pytest-*.whl
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
|
||||
chmod a+rx $@
|
||||
|
||||
.PHONY : clean
|
||||
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
||||
$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
|
||||
touch $@
|
||||
|
||||
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
||||
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
|
||||
|
||||
$(VENV)/bin/pex :
|
||||
python$(PYVER) -m venv $(VENV)
|
||||
$(VENV)/bin/pip install -U pip setuptools pex wheel
|
||||
|
||||
.PHONY : clean test
|
||||
|
||||
test : dist/spacy-$(version).pex dist/pytest.pex
|
||||
( . $(VENV)/bin/activate ; \
|
||||
PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )
|
||||
|
||||
clean : setup.py
|
||||
source env3.6/bin/activate
|
||||
rm -rf dist/*
|
||||
rm -rf ./wheelhouse
|
||||
rm -rf $(VENV)
|
||||
python setup.py clean --all
|
||||
|
|
|
@ -48,7 +48,7 @@ jobs:
|
|||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.6'
|
||||
Python36Mac:
|
||||
imageName: 'macos-10.13'
|
||||
imageName: 'macos-10.14'
|
||||
python.version: '3.6'
|
||||
# Don't test on 3.7 for now to speed up builds
|
||||
# Python37Linux:
|
||||
|
@ -67,7 +67,7 @@ jobs:
|
|||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.8'
|
||||
Python38Mac:
|
||||
imageName: 'macos-10.13'
|
||||
imageName: 'macos-10.14'
|
||||
python.version: '3.8'
|
||||
maxParallel: 4
|
||||
pool:
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
### Step 1: Create a Knowledge Base (KB) and training data
|
||||
|
||||
Run `wikipedia_pretrain_kb.py`
|
||||
Run `wikidata_pretrain_kb.py`
|
||||
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
|
||||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
||||
|
|
|
@ -88,8 +88,8 @@ def read_text(bz2_loc, n=10000):
|
|||
break
|
||||
|
||||
|
||||
def get_matches(tokenizer, phrases, texts, max_length=6):
|
||||
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
|
||||
def get_matches(tokenizer, phrases, texts):
|
||||
matcher = PhraseMatcher(tokenizer.vocab)
|
||||
matcher.add("Phrase", None, *phrases)
|
||||
for text in texts:
|
||||
doc = tokenizer(text)
|
||||
|
|
|
@ -5,7 +5,7 @@ thinc==7.4.0
|
|||
blis>=0.4.0,<0.5.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=1.0.1,<1.1.0
|
||||
srsly>=1.0.2,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
|
|
|
@ -47,7 +47,7 @@ install_requires =
|
|||
thinc==7.4.0
|
||||
blis>=0.4.0,<0.5.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=1.0.1,<1.1.0
|
||||
srsly>=1.0.2,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third-party dependencies
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
|
|
|
@ -296,8 +296,7 @@ def link_vectors_to_models(vocab):
|
|||
key = (ops.device, vectors.name)
|
||||
if key in thinc.extra.load_nlp.VECTORS:
|
||||
if thinc.extra.load_nlp.VECTORS[key].shape != data.shape:
|
||||
# This is a hack to avoid the problem in #3853. Maybe we should
|
||||
# print a warning as well?
|
||||
# This is a hack to avoid the problem in #3853.
|
||||
old_name = vectors.name
|
||||
new_name = vectors.name + "_%d" % data.shape[0]
|
||||
user_warning(Warnings.W019.format(old=old_name, new=new_name))
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "2.2.4.dev0"
|
||||
__version__ = "2.2.4"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -23,20 +23,17 @@ BLANK_MODEL_THRESHOLD = 2000
|
|||
|
||||
|
||||
@plac.annotations(
|
||||
# fmt: off
|
||||
lang=("model language", "positional", None, str),
|
||||
train_path=("location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path=("location of JSON-formatted development data", "positional", None, Path),
|
||||
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
|
||||
base_model=("name of model to update (optional)", "option", "b", str),
|
||||
pipeline=(
|
||||
"Comma-separated names of pipeline components to train",
|
||||
"option",
|
||||
"p",
|
||||
str,
|
||||
),
|
||||
pipeline=("Comma-separated names of pipeline components to train", "option", "p", str),
|
||||
ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool),
|
||||
verbose=("Print additional information and explanations", "flag", "V", bool),
|
||||
no_format=("Don't pretty-print the results", "flag", "NF", bool),
|
||||
# fmt: on
|
||||
)
|
||||
def debug_data(
|
||||
lang,
|
||||
|
@ -235,13 +232,17 @@ def debug_data(
|
|||
|
||||
if gold_train_data["ws_ents"]:
|
||||
msg.fail(
|
||||
"{} invalid whitespace entity span(s)".format(gold_train_data["ws_ents"])
|
||||
"{} invalid whitespace entity span(s)".format(
|
||||
gold_train_data["ws_ents"]
|
||||
)
|
||||
)
|
||||
has_ws_ents_error = True
|
||||
|
||||
if gold_train_data["punct_ents"]:
|
||||
msg.warn(
|
||||
"{} entity span(s) with punctuation".format(gold_train_data["punct_ents"])
|
||||
"{} entity span(s) with punctuation".format(
|
||||
gold_train_data["punct_ents"]
|
||||
)
|
||||
)
|
||||
has_punct_ents_warning = True
|
||||
|
||||
|
@ -592,7 +593,13 @@ def _compile_gold(train_docs, pipeline):
|
|||
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
||||
# "Illegal" whitespace entity
|
||||
data["ws_ents"] += 1
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [".", "'", "!", "?", ","]:
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
|
||||
".",
|
||||
"'",
|
||||
"!",
|
||||
"?",
|
||||
",",
|
||||
]:
|
||||
# punctuation entity: could be replaced by whitespace when training with noise,
|
||||
# so add a warning to alert the user to this unexpected side effect.
|
||||
data["punct_ents"] += 1
|
||||
|
|
|
@ -554,7 +554,30 @@ def train(
|
|||
with nlp.use_params(optimizer.averages):
|
||||
final_model_path = output_path / "model-final"
|
||||
nlp.to_disk(final_model_path)
|
||||
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
|
||||
meta_loc = output_path / "model-final" / "meta.json"
|
||||
final_meta = srsly.read_json(meta_loc)
|
||||
final_meta.setdefault("accuracy", {})
|
||||
final_meta["accuracy"].update(meta.get("accuracy", {}))
|
||||
final_meta.setdefault("speed", {})
|
||||
final_meta["speed"].setdefault("cpu", None)
|
||||
final_meta["speed"].setdefault("gpu", None)
|
||||
# combine cpu and gpu speeds with the base model speeds
|
||||
if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
|
||||
speed = _get_total_speed([final_meta["speed"]["cpu"], meta["speed"]["cpu"]])
|
||||
final_meta["speed"]["cpu"] = speed
|
||||
if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
|
||||
speed = _get_total_speed([final_meta["speed"]["gpu"], meta["speed"]["gpu"]])
|
||||
final_meta["speed"]["gpu"] = speed
|
||||
# if there were no speeds to update, overwrite with meta
|
||||
if final_meta["speed"]["cpu"] is None and final_meta["speed"]["gpu"] is None:
|
||||
final_meta["speed"].update(meta["speed"])
|
||||
# note: beam speeds are not combined with the base model
|
||||
if has_beam_widths:
|
||||
final_meta.setdefault("beam_accuracy", {})
|
||||
final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
|
||||
final_meta.setdefault("beam_speed", {})
|
||||
final_meta["beam_speed"].update(meta.get("beam_speed", {}))
|
||||
srsly.write_json(meta_loc, final_meta)
|
||||
msg.good("Saved model to output directory", final_model_path)
|
||||
with msg.loading("Creating best model..."):
|
||||
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
|
||||
|
@ -649,11 +672,11 @@ def _get_metrics(component):
|
|||
if component == "parser":
|
||||
return ("las", "uas", "las_per_type", "token_acc")
|
||||
elif component == "tagger":
|
||||
return ("tags_acc",)
|
||||
return ("tags_acc", "token_acc")
|
||||
elif component == "ner":
|
||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type")
|
||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
|
||||
elif component == "textcat":
|
||||
return ("textcat_score",)
|
||||
return ("textcat_score", "token_acc")
|
||||
return ("token_acc",)
|
||||
|
||||
|
||||
|
@ -709,3 +732,12 @@ def _get_progress(
|
|||
if beam_width is not None:
|
||||
result.insert(1, beam_width)
|
||||
return result
|
||||
|
||||
|
||||
def _get_total_speed(speeds):
|
||||
seconds_per_word = 0.0
|
||||
for words_per_second in speeds:
|
||||
if words_per_second is None:
|
||||
return None
|
||||
seconds_per_word += 1.0 / words_per_second
|
||||
return 1.0 / seconds_per_word
|
||||
|
|
|
@ -107,6 +107,9 @@ class Warnings(object):
|
|||
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
||||
"be more efficient to split your training data into multiple "
|
||||
"smaller JSON files instead.")
|
||||
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
||||
"but is expecting one of type 'uint64' instead. This may result "
|
||||
"in problems with the vocab further on in the pipeline.")
|
||||
|
||||
|
||||
|
||||
|
@ -541,6 +544,14 @@ class Errors(object):
|
|||
E188 = ("Could not match the gold entity links to entities in the doc - "
|
||||
"make sure the gold EL data refers to valid results of the "
|
||||
"named entity recognizer in the `nlp` pipeline.")
|
||||
E189 = ("Each argument to `get_doc` should be of equal length.")
|
||||
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
||||
"'{index}' with value '{value}' (equivalent to relative head "
|
||||
"index: '{rel_head_index}'). The head indices should be relative "
|
||||
"to the current token index rather than absolute indices in the "
|
||||
"array.")
|
||||
E191 = ("Invalid head: the head token must be from the same doc as the "
|
||||
"token itself.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
|
|
@ -151,6 +151,8 @@ def align(tokens_a, tokens_b):
|
|||
cost = 0
|
||||
a2b = numpy.empty(len(tokens_a), dtype="i")
|
||||
b2a = numpy.empty(len(tokens_b), dtype="i")
|
||||
a2b.fill(-1)
|
||||
b2a.fill(-1)
|
||||
a2b_multi = {}
|
||||
b2a_multi = {}
|
||||
i = 0
|
||||
|
@ -160,7 +162,6 @@ def align(tokens_a, tokens_b):
|
|||
while i < len(tokens_a) and j < len(tokens_b):
|
||||
a = tokens_a[i][offset_a:]
|
||||
b = tokens_b[j][offset_b:]
|
||||
a2b[i] = b2a[j] = -1
|
||||
if a == b:
|
||||
if offset_a == offset_b == 0:
|
||||
a2b[i] = j
|
||||
|
|
|
@ -4,10 +4,10 @@ from __future__ import unicode_literals
|
|||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
|
||||
from ..char_classes import LIST_CURRENCY, CURRENCY, UNITS, PUNCT
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
from ..punctuation import _prefixes, _suffixes
|
||||
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
|
||||
|
||||
|
||||
_prefixes = ["``",] + list(_prefixes)
|
||||
_prefixes = ["``"] + BASE_TOKENIZER_PREFIXES
|
||||
|
||||
_suffixes = (
|
||||
["''", "/"]
|
||||
|
|
30
spacy/lang/eu/__init__.py
Normal file
30
spacy/lang/eu/__init__.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
from .tag_map import TAG_MAP
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
||||
|
||||
class BasqueDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "eu"
|
||||
|
||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
class Basque(Language):
|
||||
lang = "eu"
|
||||
Defaults = BasqueDefaults
|
||||
|
||||
|
||||
__all__ = ["Basque"]
|
14
spacy/lang/eu/examples.py
Normal file
14
spacy/lang/eu/examples.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.eu.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
sentences = [
|
||||
"bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
|
||||
"gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira"
|
||||
]
|
80
spacy/lang/eu/lex_attrs.py
Normal file
80
spacy/lang/eu/lex_attrs.py
Normal file
|
@ -0,0 +1,80 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
# Source http://mylanguages.org/basque_numbers.php
|
||||
|
||||
|
||||
_num_words = """
|
||||
bat
|
||||
bi
|
||||
hiru
|
||||
lau
|
||||
bost
|
||||
sei
|
||||
zazpi
|
||||
zortzi
|
||||
bederatzi
|
||||
hamar
|
||||
hamaika
|
||||
hamabi
|
||||
hamahiru
|
||||
hamalau
|
||||
hamabost
|
||||
hamasei
|
||||
hamazazpi
|
||||
Hemezortzi
|
||||
hemeretzi
|
||||
hogei
|
||||
ehun
|
||||
mila
|
||||
milioi
|
||||
""".split()
|
||||
|
||||
# source https://www.google.com/intl/ur/inputtools/try/
|
||||
|
||||
_ordinal_words = """
|
||||
lehen
|
||||
bigarren
|
||||
hirugarren
|
||||
laugarren
|
||||
bosgarren
|
||||
seigarren
|
||||
zazpigarren
|
||||
zortzigarren
|
||||
bederatzigarren
|
||||
hamargarren
|
||||
hamaikagarren
|
||||
hamabigarren
|
||||
hamahirugarren
|
||||
hamalaugarren
|
||||
hamabosgarren
|
||||
hamaseigarren
|
||||
hamazazpigarren
|
||||
hamazortzigarren
|
||||
hemeretzigarren
|
||||
hogeigarren
|
||||
behin
|
||||
""".split()
|
||||
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
if text in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
7
spacy/lang/eu/punctuation.py
Normal file
7
spacy/lang/eu/punctuation.py
Normal file
|
@ -0,0 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
_suffixes = TOKENIZER_SUFFIXES
|
108
spacy/lang/eu/stop_words.py
Normal file
108
spacy/lang/eu/stop_words.py
Normal file
|
@ -0,0 +1,108 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-eu
|
||||
# https://www.ranks.nl/stopwords/basque
|
||||
# https://www.mustgo.com/worldlanguages/basque/
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
al
|
||||
anitz
|
||||
arabera
|
||||
asko
|
||||
baina
|
||||
bat
|
||||
batean
|
||||
batek
|
||||
bati
|
||||
batzuei
|
||||
batzuek
|
||||
batzuetan
|
||||
batzuk
|
||||
bera
|
||||
beraiek
|
||||
berau
|
||||
berauek
|
||||
bere
|
||||
berori
|
||||
beroriek
|
||||
beste
|
||||
bezala
|
||||
da
|
||||
dago
|
||||
dira
|
||||
ditu
|
||||
du
|
||||
dute
|
||||
edo
|
||||
egin
|
||||
ere
|
||||
eta
|
||||
eurak
|
||||
ez
|
||||
gainera
|
||||
gu
|
||||
gutxi
|
||||
guzti
|
||||
haiei
|
||||
haiek
|
||||
haietan
|
||||
hainbeste
|
||||
hala
|
||||
han
|
||||
handik
|
||||
hango
|
||||
hara
|
||||
hari
|
||||
hark
|
||||
hartan
|
||||
hau
|
||||
hauei
|
||||
hauek
|
||||
hauetan
|
||||
hemen
|
||||
hemendik
|
||||
hemengo
|
||||
hi
|
||||
hona
|
||||
honek
|
||||
honela
|
||||
honetan
|
||||
honi
|
||||
hor
|
||||
hori
|
||||
horiei
|
||||
horiek
|
||||
horietan
|
||||
horko
|
||||
horra
|
||||
horrek
|
||||
horrela
|
||||
horretan
|
||||
horri
|
||||
hortik
|
||||
hura
|
||||
izan
|
||||
ni
|
||||
noiz
|
||||
nola
|
||||
non
|
||||
nondik
|
||||
nongo
|
||||
nor
|
||||
nora
|
||||
ze
|
||||
zein
|
||||
zen
|
||||
zenbait
|
||||
zenbat
|
||||
zer
|
||||
zergatik
|
||||
ziren
|
||||
zituen
|
||||
zu
|
||||
zuek
|
||||
zuen
|
||||
zuten
|
||||
""".split()
|
||||
)
|
71
spacy/lang/eu/tag_map.py
Normal file
71
spacy/lang/eu/tag_map.py
Normal file
|
@ -0,0 +1,71 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||
|
||||
TAG_MAP = {
|
||||
".": {POS: PUNCT, "PunctType": "peri"},
|
||||
",": {POS: PUNCT, "PunctType": "comm"},
|
||||
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
|
||||
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
|
||||
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
|
||||
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
":": {POS: PUNCT},
|
||||
"$": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||
"CC": {POS: CCONJ, "ConjType": "coor"},
|
||||
"CD": {POS: NUM, "NumType": "card"},
|
||||
"DT": {POS: DET},
|
||||
"EX": {POS: ADV, "AdvType": "ex"},
|
||||
"FW": {POS: X, "Foreign": "yes"},
|
||||
"HYPH": {POS: PUNCT, "PunctType": "dash"},
|
||||
"IN": {POS: ADP},
|
||||
"JJ": {POS: ADJ, "Degree": "pos"},
|
||||
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||
"LS": {POS: PUNCT, "NumType": "ord"},
|
||||
"MD": {POS: VERB, "VerbType": "mod"},
|
||||
"NIL": {POS: ""},
|
||||
"NN": {POS: NOUN, "Number": "sing"},
|
||||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||
"NNS": {POS: NOUN, "Number": "plur"},
|
||||
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||
"POS": {POS: PART, "Poss": "yes"},
|
||||
"PRP": {POS: PRON, "PronType": "prs"},
|
||||
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||
"RB": {POS: ADV, "Degree": "pos"},
|
||||
"RBR": {POS: ADV, "Degree": "comp"},
|
||||
"RBS": {POS: ADV, "Degree": "sup"},
|
||||
"RP": {POS: PART},
|
||||
"SP": {POS: SPACE},
|
||||
"SYM": {POS: SYM},
|
||||
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
|
||||
"UH": {POS: INTJ},
|
||||
"VB": {POS: VERB, "VerbForm": "inf"},
|
||||
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
|
||||
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||||
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
|
||||
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
|
||||
"VBZ": {
|
||||
POS: VERB,
|
||||
"VerbForm": "fin",
|
||||
"Tense": "pres",
|
||||
"Number": "sing",
|
||||
"Person": 3,
|
||||
},
|
||||
"WDT": {POS: ADJ, "PronType": "int|rel"},
|
||||
"WP": {POS: NOUN, "PronType": "int|rel"},
|
||||
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
|
||||
"WRB": {POS: ADV, "PronType": "int|rel"},
|
||||
"ADD": {POS: X},
|
||||
"NFP": {POS: PUNCT},
|
||||
"GW": {POS: X},
|
||||
"XX": {POS: X},
|
||||
"BES": {POS: VERB},
|
||||
"HVS": {POS: VERB},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
31
spacy/lang/lij/__init__.py
Normal file
31
spacy/lang/lij/__init__.py
Normal file
|
@ -0,0 +1,31 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
|
||||
|
||||
class LigurianDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: "lij"
|
||||
lex_attr_getters[NORM] = add_lookups(
|
||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
infixes = TOKENIZER_INFIXES
|
||||
|
||||
|
||||
class Ligurian(Language):
|
||||
lang = "lij"
|
||||
Defaults = LigurianDefaults
|
||||
|
||||
|
||||
__all__ = ["Ligurian"]
|
18
spacy/lang/lij/examples.py
Normal file
18
spacy/lang/lij/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.lij.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Sciusciâ e sciorbî no se peu.",
|
||||
"Graçie di çetroin, che me son arrivæ.",
|
||||
"Vegnime apreuvo, che ve fasso pescâ di òmmi.",
|
||||
"Bella pe sempre l'ægua inta conchetta quande unn'agoggia d'ægua a se â trapaña.",
|
||||
]
|
15
spacy/lang/lij/punctuation.py
Normal file
15
spacy/lang/lij/punctuation.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..punctuation import TOKENIZER_INFIXES
|
||||
from ..char_classes import ALPHA
|
||||
|
||||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
|
||||
|
||||
|
||||
_infixes = TOKENIZER_INFIXES + [
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
|
||||
]
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
43
spacy/lang/lij/stop_words.py
Normal file
43
spacy/lang/lij/stop_words.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei
|
||||
|
||||
bella belle belli bello ben
|
||||
|
||||
ch' che chì chi ciù co-a co-e co-i co-o comm' comme con cösa coscì cöse
|
||||
|
||||
d' da da-a da-e da-i da-o dapeu de delongo derê di do doe doî donde dòppo
|
||||
|
||||
é e ê ea ean emmo en ëse
|
||||
|
||||
fin fiña
|
||||
|
||||
gh' ghe guæei
|
||||
|
||||
i î in insemme int' inta inte inti into
|
||||
|
||||
l' lê lì lô
|
||||
|
||||
m' ma manco me megio meno mezo mi
|
||||
|
||||
na n' ne ni ninte nisciun nisciuña no
|
||||
|
||||
o ò ô oua
|
||||
|
||||
parte pe pe-a pe-i pe-e pe-o perché pittin pö primma pròpio
|
||||
|
||||
quæ quand' quande quarche quella quelle quelli quello
|
||||
|
||||
s' sce scê sci sciâ sciô sciù se segge seu sò solo son sott' sta stæta stæte stæti stæto ste sti sto
|
||||
|
||||
tanta tante tanti tanto te ti torna tra tròppo tutta tutte tutti tutto
|
||||
|
||||
un uña unn' unna
|
||||
|
||||
za zu
|
||||
""".split()
|
||||
)
|
52
spacy/lang/lij/tokenizer_exceptions.py
Normal file
52
spacy/lang/lij/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,52 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
_exc = {}
|
||||
|
||||
for raw, lemma in [
|
||||
("a-a", "a-o"),
|
||||
("a-e", "a-o"),
|
||||
("a-o", "a-o"),
|
||||
("a-i", "a-o"),
|
||||
("co-a", "co-o"),
|
||||
("co-e", "co-o"),
|
||||
("co-i", "co-o"),
|
||||
("co-o", "co-o"),
|
||||
("da-a", "da-o"),
|
||||
("da-e", "da-o"),
|
||||
("da-i", "da-o"),
|
||||
("da-o", "da-o"),
|
||||
("pe-a", "pe-o"),
|
||||
("pe-e", "pe-o"),
|
||||
("pe-i", "pe-o"),
|
||||
("pe-o", "pe-o"),
|
||||
]:
|
||||
for orth in [raw, raw.capitalize()]:
|
||||
_exc[orth] = [{ORTH: orth, LEMMA: lemma}]
|
||||
|
||||
# Prefix + prepositions with à (e.g. "sott'a-o")
|
||||
|
||||
for prep, prep_lemma in [
|
||||
("a-a", "a-o"),
|
||||
("a-e", "a-o"),
|
||||
("a-o", "a-o"),
|
||||
("a-i", "a-o"),
|
||||
]:
|
||||
for prefix, prefix_lemma in [
|
||||
("sott'", "sotta"),
|
||||
("sott’", "sotta"),
|
||||
("contr'", "contra"),
|
||||
("contr’", "contra"),
|
||||
("ch'", "che"),
|
||||
("ch’", "che"),
|
||||
("s'", "se"),
|
||||
("s’", "se"),
|
||||
]:
|
||||
for prefix_orth in [prefix, prefix.capitalize()]:
|
||||
_exc[prefix_orth+prep] = [
|
||||
{ORTH: prefix_orth, LEMMA: prefix_lemma},
|
||||
{ORTH: prep, LEMMA: prep_lemma},
|
||||
]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -3,6 +3,9 @@ from __future__ import absolute_import, unicode_literals
|
|||
|
||||
import random
|
||||
import itertools
|
||||
|
||||
from thinc.extra import load_nlp
|
||||
|
||||
from spacy.util import minibatch
|
||||
import weakref
|
||||
import functools
|
||||
|
@ -754,8 +757,6 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#pipe
|
||||
"""
|
||||
# raw_texts will be used later to stop iterator.
|
||||
texts, raw_texts = itertools.tee(texts)
|
||||
if is_python2 and n_process != 1:
|
||||
user_warning(Warnings.W023)
|
||||
n_process = 1
|
||||
|
@ -856,7 +857,7 @@ class Language(object):
|
|||
procs = [
|
||||
mp.Process(
|
||||
target=_apply_pipes,
|
||||
args=(self.make_doc, pipes, rch, sch, Underscore.get_state()),
|
||||
args=(self.make_doc, pipes, rch, sch, Underscore.get_state(), load_nlp.VECTORS),
|
||||
)
|
||||
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
||||
]
|
||||
|
@ -1112,7 +1113,7 @@ def _pipe(docs, proc, kwargs):
|
|||
yield doc
|
||||
|
||||
|
||||
def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state):
|
||||
def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
|
||||
"""Worker for Language.pipe
|
||||
|
||||
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
||||
|
@ -1120,8 +1121,10 @@ def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state):
|
|||
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
|
||||
`multiprocessing.Pipe()`
|
||||
underscore_state (tuple): The data in the Underscore class of the parent
|
||||
vectors (dict): The global vectors data, copied from the parent
|
||||
"""
|
||||
Underscore.load_state(underscore_state)
|
||||
load_nlp.VECTORS = vectors
|
||||
while True:
|
||||
texts = receiver.get()
|
||||
docs = (make_doc(text) for text in texts)
|
||||
|
|
|
@ -170,6 +170,10 @@ TOKEN_PATTERN_SCHEMA = {
|
|||
"title": "Token is the first in a sentence",
|
||||
"$ref": "#/definitions/boolean_value",
|
||||
},
|
||||
"SENT_START": {
|
||||
"title": "Token is the first in a sentence",
|
||||
"$ref": "#/definitions/boolean_value",
|
||||
},
|
||||
"LIKE_NUM": {
|
||||
"title": "Token resembles a number",
|
||||
"$ref": "#/definitions/boolean_value",
|
||||
|
|
|
@ -670,6 +670,8 @@ def _get_attr_values(spec, string_store):
|
|||
continue
|
||||
if attr == "TEXT":
|
||||
attr = "ORTH"
|
||||
if attr == "IS_SENT_START":
|
||||
attr = "SENT_START"
|
||||
if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]:
|
||||
raise ValueError(Errors.E152.format(attr=attr))
|
||||
attr = IDS.get(attr)
|
||||
|
|
|
@ -367,7 +367,7 @@ class Tensorizer(Pipe):
|
|||
return sgd
|
||||
|
||||
|
||||
@component("tagger", assigns=["token.tag", "token.pos"])
|
||||
@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
|
||||
class Tagger(Pipe):
|
||||
"""Pipeline component for part-of-speech tagging.
|
||||
|
||||
|
|
|
@ -606,7 +606,6 @@ cdef class Parser:
|
|||
if not hasattr(get_gold_tuples, '__call__'):
|
||||
gold_tuples = get_gold_tuples
|
||||
get_gold_tuples = lambda: gold_tuples
|
||||
cfg.setdefault('min_action_freq', 30)
|
||||
actions = self.moves.get_actions(gold_parses=get_gold_tuples(),
|
||||
min_freq=cfg.get('min_action_freq', 30),
|
||||
learn_tokens=self.cfg.get("learn_tokens", False))
|
||||
|
@ -616,8 +615,9 @@ cdef class Parser:
|
|||
if label not in actions[action]:
|
||||
actions[action][label] = freq
|
||||
self.moves.initialize_actions(actions)
|
||||
cfg.setdefault('token_vector_width', 96)
|
||||
if self.model is True:
|
||||
cfg.setdefault('min_action_freq', 30)
|
||||
cfg.setdefault('token_vector_width', 96)
|
||||
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
|
||||
if sgd is None:
|
||||
sgd = self.create_optimizer()
|
||||
|
@ -633,11 +633,11 @@ cdef class Parser:
|
|||
if pipeline is not None:
|
||||
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
self.cfg.update(cfg)
|
||||
else:
|
||||
if sgd is None:
|
||||
sgd = self.create_optimizer()
|
||||
self.model.begin_training([])
|
||||
self.cfg.update(cfg)
|
||||
return sgd
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
|
|
|
@ -83,6 +83,11 @@ def es_tokenizer():
|
|||
return get_lang_class("es").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def eu_tokenizer():
|
||||
return get_lang_class("eu").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def fi_tokenizer():
|
||||
return get_lang_class("fi").Defaults.create_tokenizer()
|
||||
|
|
|
@ -77,3 +77,30 @@ def test_doc_array_idx(en_vocab):
|
|||
assert offsets[0] == 0
|
||||
assert offsets[1] == 3
|
||||
assert offsets[2] == 11
|
||||
|
||||
|
||||
def test_doc_from_array_heads_in_bounds(en_vocab):
|
||||
"""Test that Doc.from_array doesn't set heads that are out of bounds."""
|
||||
words = ["This", "is", "a", "sentence", "."]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
for token in doc:
|
||||
token.head = doc[0]
|
||||
|
||||
# correct
|
||||
arr = doc.to_array(["HEAD"])
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
||||
# head before start
|
||||
arr = doc.to_array(["HEAD"])
|
||||
arr[0] = -1
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
with pytest.raises(ValueError):
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
||||
# head after end
|
||||
arr = doc.to_array(["HEAD"])
|
||||
arr[0] = 5
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
with pytest.raises(ValueError):
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
|
|
@ -150,10 +150,9 @@ def test_doc_api_runtime_error(en_tokenizer):
|
|||
# Example that caused run-time error while parsing Reddit
|
||||
# fmt: off
|
||||
text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
|
||||
deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
|
||||
"nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
|
||||
"pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
|
||||
"ROOT", "amod", "dobj"]
|
||||
deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
|
||||
"amod", "pobj", "acl", "prep", "prep", "pobj",
|
||||
"", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
|
||||
# fmt: on
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||
|
@ -277,7 +276,9 @@ def test_doc_is_nered(en_vocab):
|
|||
def test_doc_from_array_sent_starts(en_vocab):
|
||||
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
|
||||
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
|
||||
# fmt: off
|
||||
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
|
||||
# fmt: on
|
||||
doc = Doc(en_vocab, words=words)
|
||||
for i, (dep, head) in enumerate(zip(deps, heads)):
|
||||
doc[i].dep_ = dep
|
||||
|
|
|
@ -167,6 +167,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
|||
assert doc[4].left_edge.i == 0
|
||||
assert doc[2].left_edge.i == 0
|
||||
|
||||
# head token must be from the same document
|
||||
doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
with pytest.raises(ValueError):
|
||||
doc[0].head = doc2[0]
|
||||
|
||||
|
||||
def test_is_sent_start(en_tokenizer):
|
||||
doc = en_tokenizer("This is a sentence. This is another.")
|
||||
|
@ -214,7 +219,7 @@ def test_token_api_conjuncts_chain(en_vocab):
|
|||
def test_token_api_conjuncts_simple(en_vocab):
|
||||
words = "They came and went .".split()
|
||||
heads = [1, 0, -1, -2, -1]
|
||||
deps = ["nsubj", "ROOT", "cc", "conj"]
|
||||
deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert [w.text for w in doc[1].conjuncts] == ["went"]
|
||||
assert [w.text for w in doc[3].conjuncts] == ["came"]
|
||||
|
|
16
spacy/tests/lang/eu/test_text.py
Normal file
16
spacy/tests/lang/eu/test_text.py
Normal file
|
@ -0,0 +1,16 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_eu_tokenizer_handles_long_text(eu_tokenizer):
|
||||
text = """ta nere guitarra estrenatu ondoren"""
|
||||
tokens = eu_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,length", [("milesker ederra joan zen hitzaldia plazer hutsa", 7), ("astelehen guztia sofan pasau biot", 5)])
|
||||
def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
|
||||
tokens = eu_tokenizer(text)
|
||||
assert len(tokens) == length
|
|
@ -34,6 +34,8 @@ TEST_PATTERNS = [
|
|||
([{"LOWER": {"REGEX": "^X", "NOT_IN": ["XXX", "XY"]}}], 0, 0),
|
||||
([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
|
||||
([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
|
||||
([{"IS_SENT_START": True}], 0, 0),
|
||||
([{"SENT_START": True}], 0, 0),
|
||||
]
|
||||
|
||||
XFAIL_TEST_PATTERNS = [([{"orth": "foo"}], 0, 0)]
|
||||
|
|
|
@ -34,23 +34,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
|
|||
@pytest.fixture
|
||||
def heads():
|
||||
# fmt: off
|
||||
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
|
||||
-1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
||||
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
|
||||
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
|
||||
0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
|
||||
9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
|
||||
2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
|
||||
3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
|
||||
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
|
||||
-1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
|
||||
-2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
|
||||
1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
|
||||
1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
||||
-19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
||||
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
|
||||
-1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
||||
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
|
||||
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
|
||||
0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
|
||||
9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
|
||||
2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
|
||||
3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
|
||||
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
|
||||
-1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
|
||||
-2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
|
||||
1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
|
||||
1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
||||
-10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
||||
0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
|
||||
1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
|
||||
-1, -8, -9, -1]
|
||||
1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
|
||||
-1, 0, -1, -1]
|
||||
# fmt: on
|
||||
|
||||
|
||||
|
|
|
@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
|
|||
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
||||
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
# Work around lemma corrpution problem and set lemmas after tags
|
||||
# Work around lemma corruption problem and set lemmas after tags
|
||||
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
||||
assert [t.tag_ for t in doc] == tags
|
||||
|
|
|
@ -124,7 +124,7 @@ def test_issue2772(en_vocab):
|
|||
words = "When we write or communicate virtually , we can hide our true feelings .".split()
|
||||
# A tree with a non-projective (i.e. crossing) arc
|
||||
# The arcs (0, 4) and (2, 9) cross.
|
||||
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
|
||||
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
|
||||
deps = ["dep"] * len(heads)
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert doc[1].is_sent_start is None
|
||||
|
|
|
@ -27,7 +27,7 @@ def test_issue4590(en_vocab):
|
|||
|
||||
text = "The quick brown fox jumped over the lazy fox"
|
||||
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
||||
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
|
||||
deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]
|
||||
|
||||
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
||||
|
||||
|
|
26
spacy/tests/regression/test_issue4725.py
Normal file
26
spacy/tests/regression/test_issue4725.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
|
||||
def test_issue4725():
|
||||
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data[0] = 1.0
|
||||
data[1] = 2.0
|
||||
vocab.set_vector("cat", data[0])
|
||||
vocab.set_vector("dog", data[1])
|
||||
|
||||
nlp = English(vocab=vocab)
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
nlp.begin_training()
|
||||
docs = ["Kurt is in London."] * 10
|
||||
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
||||
pass
|
||||
|
|
@ -3,7 +3,6 @@ from __future__ import unicode_literals
|
|||
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
from spacy.tokens.underscore import Underscore
|
||||
|
||||
|
||||
def test_issue4849():
|
||||
|
|
|
@ -1,10 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import spacy
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Span, Doc
|
||||
from spacy.tokens.underscore import Underscore
|
||||
|
||||
|
||||
class CustomPipe:
|
||||
|
|
35
spacy/tests/regression/test_issue5048.py
Normal file
35
spacy/tests/regression/test_issue5048.py
Normal file
|
@ -0,0 +1,35 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.attrs import DEP, POS, TAG
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_issue5048(en_vocab):
|
||||
words = ["This", "is", "a", "sentence"]
|
||||
pos_s = ["DET", "VERB", "DET", "NOUN"]
|
||||
spaces = [" ", " ", " ", ""]
|
||||
deps_s = ["dep", "adj", "nn", "atm"]
|
||||
tags_s = ["DT", "VBZ", "DT", "NN"]
|
||||
|
||||
strings = en_vocab.strings
|
||||
|
||||
for w in words:
|
||||
strings.add(w)
|
||||
deps = [strings.add(d) for d in deps_s]
|
||||
pos = [strings.add(p) for p in pos_s]
|
||||
tags = [strings.add(t) for t in tags_s]
|
||||
|
||||
attrs = [POS, DEP, TAG]
|
||||
array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
|
||||
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
doc.from_array(attrs, array)
|
||||
v1 = [(token.text, token.pos_, token.tag_) for token in doc]
|
||||
|
||||
doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
|
||||
v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
|
||||
assert v1 == v2
|
46
spacy/tests/regression/test_issue5082.py
Normal file
46
spacy/tests/regression/test_issue5082.py
Normal file
|
@ -0,0 +1,46 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy as np
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
|
||||
|
||||
def test_issue5082():
|
||||
# Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
|
||||
nlp = English()
|
||||
vocab = nlp.vocab
|
||||
array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
|
||||
array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
|
||||
array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
|
||||
array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
|
||||
array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
|
||||
|
||||
vocab.set_vector("I", array1)
|
||||
vocab.set_vector("like", array2)
|
||||
vocab.set_vector("David", array3)
|
||||
vocab.set_vector("Bowie", array4)
|
||||
|
||||
text = "I like David Bowie"
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [
|
||||
{"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
|
||||
]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
parsed_vectors_1 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_1) == 4
|
||||
np.testing.assert_array_equal(parsed_vectors_1[0], array1)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[1], array2)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[2], array3)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[3], array4)
|
||||
|
||||
merge_ents = nlp.create_pipe("merge_entities")
|
||||
nlp.add_pipe(merge_ents)
|
||||
|
||||
parsed_vectors_2 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_2) == 3
|
||||
np.testing.assert_array_equal(parsed_vectors_2[0], array1)
|
||||
np.testing.assert_array_equal(parsed_vectors_2[1], array2)
|
||||
np.testing.assert_array_equal(parsed_vectors_2[2], array34)
|
|
@ -15,12 +15,19 @@ def load_tokenizer(b):
|
|||
|
||||
|
||||
def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
||||
"""Test that custom tokenizer with not all functions defined can be
|
||||
serialized and deserialized correctly (see #2494)."""
|
||||
"""Test that custom tokenizer with not all functions defined or empty
|
||||
properties can be serialized and deserialized correctly (see #2494,
|
||||
#4991)."""
|
||||
tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
|
||||
tokenizer_bytes = tokenizer.to_bytes()
|
||||
Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||
|
||||
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC", "ORTH": "."}]})
|
||||
tokenizer.rules = {}
|
||||
tokenizer_bytes = tokenizer.to_bytes()
|
||||
tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||
assert tokenizer_reloaded.rules == {}
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
||||
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
||||
|
|
|
@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
|
|||
deps = displacy.parse_deps(doc)
|
||||
assert isinstance(deps, dict)
|
||||
assert deps["words"] == [
|
||||
{"lemma": None, "text": "This", "tag": "DET"},
|
||||
{"lemma": None, "text": "is", "tag": "AUX"},
|
||||
{"lemma": None, "text": "a", "tag": "DET"},
|
||||
{"lemma": None, "text": "sentence", "tag": "NOUN"},
|
||||
{"lemma": None, "text": words[0], "tag": pos[0]},
|
||||
{"lemma": None, "text": words[1], "tag": pos[1]},
|
||||
{"lemma": None, "text": words[2], "tag": pos[2]},
|
||||
{"lemma": None, "text": words[3], "tag": pos[3]},
|
||||
]
|
||||
assert deps["arcs"] == [
|
||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||
|
@ -75,7 +75,7 @@ def test_displacy_rtl():
|
|||
deps = ["foo", "bar", "foo", "baz"]
|
||||
heads = [1, 0, 1, -2]
|
||||
nlp = Persian()
|
||||
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
|
||||
doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
|
||||
doc.ents = [Span(doc, 1, 3, label="TEST")]
|
||||
html = displacy.render(doc, page=True, style="dep")
|
||||
assert "direction: rtl" in html
|
||||
|
|
|
@ -7,8 +7,10 @@ import shutil
|
|||
import contextlib
|
||||
import srsly
|
||||
from pathlib import Path
|
||||
|
||||
from spacy import Errors
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy.attrs import POS, HEAD, DEP
|
||||
from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA
|
||||
from spacy.compat import path2str
|
||||
|
||||
|
||||
|
@ -26,30 +28,54 @@ def make_tempdir():
|
|||
shutil.rmtree(path2str(d))
|
||||
|
||||
|
||||
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
|
||||
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None):
|
||||
"""Create Doc object from given vocab, words and annotations."""
|
||||
pos = pos or [""] * len(words)
|
||||
tags = tags or [""] * len(words)
|
||||
heads = heads or [0] * len(words)
|
||||
deps = deps or [""] * len(words)
|
||||
for value in deps + tags + pos:
|
||||
if deps and not heads:
|
||||
heads = [0] * len(deps)
|
||||
headings = []
|
||||
values = []
|
||||
annotations = [pos, heads, deps, lemmas, tags]
|
||||
possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
|
||||
for a, annot in enumerate(annotations):
|
||||
if annot is not None:
|
||||
if len(annot) != len(words):
|
||||
raise ValueError(Errors.E189)
|
||||
headings.append(possible_headings[a])
|
||||
if annot is not heads:
|
||||
values.extend(annot)
|
||||
for value in values:
|
||||
vocab.strings.add(value)
|
||||
|
||||
doc = Doc(vocab, words=words)
|
||||
attrs = doc.to_array([POS, HEAD, DEP])
|
||||
for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
|
||||
attrs[i, 0] = doc.vocab.strings[p]
|
||||
attrs[i, 1] = head
|
||||
attrs[i, 2] = doc.vocab.strings[dep]
|
||||
doc.from_array([POS, HEAD, DEP], attrs)
|
||||
|
||||
# if there are any other annotations, set them
|
||||
if headings:
|
||||
attrs = doc.to_array(headings)
|
||||
|
||||
j = 0
|
||||
for annot in annotations:
|
||||
if annot:
|
||||
if annot is heads:
|
||||
for i in range(len(words)):
|
||||
if attrs.ndim == 1:
|
||||
attrs[i] = heads[i]
|
||||
else:
|
||||
attrs[i,j] = heads[i]
|
||||
else:
|
||||
for i in range(len(words)):
|
||||
if attrs.ndim == 1:
|
||||
attrs[i] = doc.vocab.strings[annot[i]]
|
||||
else:
|
||||
attrs[i, j] = doc.vocab.strings[annot[i]]
|
||||
j += 1
|
||||
doc.from_array(headings, attrs)
|
||||
|
||||
# finally, set the entities
|
||||
if ents:
|
||||
doc.ents = [
|
||||
Span(doc, start, end, label=doc.vocab.strings[label])
|
||||
for start, end, label in ents
|
||||
]
|
||||
if tags:
|
||||
for token in doc:
|
||||
token.tag_ = tags[token.i]
|
||||
return doc
|
||||
|
||||
|
||||
|
@ -90,8 +116,7 @@ def assert_docs_equal(doc1, doc2):
|
|||
|
||||
assert [t.head.i for t in doc1] == [t.head.i for t in doc2]
|
||||
assert [t.dep for t in doc1] == [t.dep for t in doc2]
|
||||
if doc1.is_parsed and doc2.is_parsed:
|
||||
assert [s for s in doc1.sents] == [s for s in doc2.sents]
|
||||
assert [t.is_sent_start for t in doc1] == [t.is_sent_start for t in doc2]
|
||||
|
||||
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
|
||||
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
|
||||
|
|
|
@ -14,7 +14,7 @@ import re
|
|||
|
||||
from .tokens.doc cimport Doc
|
||||
from .strings cimport hash_string
|
||||
from .compat import unescape_unicode
|
||||
from .compat import unescape_unicode, basestring_
|
||||
from .attrs import intify_attrs
|
||||
from .symbols import ORTH
|
||||
|
||||
|
@ -508,6 +508,7 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
|
@ -521,6 +522,7 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#from_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open("rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
self.from_bytes(bytes_data, **kwargs)
|
||||
|
@ -568,22 +570,22 @@ cdef class Tokenizer:
|
|||
for key in ["prefix_search", "suffix_search", "infix_finditer"]:
|
||||
if key in data:
|
||||
data[key] = unescape_unicode(data[key])
|
||||
if data.get("prefix_search"):
|
||||
if "prefix_search" in data and isinstance(data["prefix_search"], basestring_):
|
||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||
if data.get("suffix_search"):
|
||||
if "suffix_search" in data and isinstance(data["suffix_search"], basestring_):
|
||||
self.suffix_search = re.compile(data["suffix_search"]).search
|
||||
if data.get("infix_finditer"):
|
||||
if "infix_finditer" in data and isinstance(data["infix_finditer"], basestring_):
|
||||
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
||||
if data.get("token_match"):
|
||||
if "token_match" in data and isinstance(data["token_match"], basestring_):
|
||||
self.token_match = re.compile(data["token_match"]).match
|
||||
if data.get("rules"):
|
||||
if "rules" in data and isinstance(data["rules"], dict):
|
||||
# make sure to hard reset the cache to remove data from the default exceptions
|
||||
self._rules = {}
|
||||
self._reset_cache([key for key in self._cache])
|
||||
self._reset_specials()
|
||||
self._cache = PreshMap()
|
||||
self._specials = PreshMap()
|
||||
self._load_special_tokenization(data.get("rules", {}))
|
||||
self._load_special_tokenization(data["rules"])
|
||||
|
||||
return self
|
||||
|
||||
|
|
|
@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
|
|||
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
|
||||
if spans[token_index][-1].whitespace_:
|
||||
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
|
||||
# add the vector of the (merged) entity to the vocab
|
||||
if not doc.vocab.get_vector(new_orth).any():
|
||||
if doc.vocab.vectors_length > 0:
|
||||
doc.vocab.set_vector(new_orth, span.vector)
|
||||
token = tokens[token_index]
|
||||
lex = doc.vocab.get(doc.mem, new_orth)
|
||||
token.lex = lex
|
||||
|
|
|
@ -260,7 +260,7 @@ cdef class Doc:
|
|||
def is_nered(self):
|
||||
"""Check if the document has named entities set. Will return True if
|
||||
*any* of the tokens has a named entity tag set (even if the others are
|
||||
unknown values).
|
||||
unknown values), or if the document is empty.
|
||||
"""
|
||||
if len(self) == 0:
|
||||
return True
|
||||
|
@ -785,10 +785,12 @@ cdef class Doc:
|
|||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||
for id_ in attrs]
|
||||
if array.dtype != numpy.uint64:
|
||||
user_warning(Warnings.W028.format(type=array.dtype))
|
||||
|
||||
if SENT_START in attrs and HEAD in attrs:
|
||||
raise ValueError(Errors.E032)
|
||||
cdef int i, col
|
||||
cdef int i, col, abs_head_index
|
||||
cdef attr_id_t attr_id
|
||||
cdef TokenC* tokens = self.c
|
||||
cdef int length = len(array)
|
||||
|
@ -802,6 +804,14 @@ cdef class Doc:
|
|||
attr_ids[i] = attr_id
|
||||
if len(array.shape) == 1:
|
||||
array = array.reshape((array.size, 1))
|
||||
# Check that all heads are within the document bounds
|
||||
if HEAD in attrs:
|
||||
col = attrs.index(HEAD)
|
||||
for i in range(length):
|
||||
# cast index to signed int
|
||||
abs_head_index = numpy.int32(array[i, col]) + i
|
||||
if abs_head_index < 0 or abs_head_index >= length:
|
||||
raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
|
||||
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
|
||||
if TAG in attrs:
|
||||
col = attrs.index(TAG)
|
||||
|
@ -872,7 +882,7 @@ cdef class Doc:
|
|||
|
||||
DOCS: https://spacy.io/api/doc#to_bytes
|
||||
"""
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID] # TODO: ENT_KB_ID ?
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM] # TODO: ENT_KB_ID ?
|
||||
if self.is_tagged:
|
||||
array_head.extend([TAG, POS])
|
||||
# If doc parsed add head and dep attribute
|
||||
|
@ -1173,6 +1183,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
|
|||
heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
|
||||
if loop_count > 10:
|
||||
user_warning(Warnings.W026)
|
||||
break
|
||||
loop_count += 1
|
||||
# Set sentence starts
|
||||
for i in range(length):
|
||||
|
|
|
@ -623,6 +623,9 @@ cdef class Token:
|
|||
# This function sets the head of self to new_head and updates the
|
||||
# counters for left/right dependents and left/right corner for the
|
||||
# new and the old head
|
||||
# Check that token is from the same document
|
||||
if self.doc != new_head.doc:
|
||||
raise ValueError(Errors.E191)
|
||||
# Do nothing if old head is new head
|
||||
if self.i + self.c.head == new_head.i:
|
||||
return
|
||||
|
|
|
@ -109,9 +109,9 @@ links) and check whether they are compatible with the currently installed
|
|||
version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
|
||||
to ensure that all installed models are can be used with the new version. The
|
||||
command is also useful to detect out-of-sync model links resulting from links
|
||||
created in different virtual environments. It will a list of models, the
|
||||
installed versions, the latest compatible version (if out of date) and the
|
||||
commands for updating.
|
||||
created in different virtual environments. It will show a list of models and
|
||||
their installed versions. If any model is out of date, the latest compatible
|
||||
versions and command for updating are shown.
|
||||
|
||||
> #### Automated validation
|
||||
>
|
||||
|
@ -176,7 +176,7 @@ All output files generated by this command are compatible with
|
|||
|
||||
## Debug data {#debug-data new="2.2"}
|
||||
|
||||
Analyze, debug and validate your training and development data, get useful
|
||||
Analyze, debug, and validate your training and development data. Get useful
|
||||
stats, and find problems like invalid entity annotations, cyclic dependencies,
|
||||
low data labels and more.
|
||||
|
||||
|
@ -184,16 +184,17 @@ low data labels and more.
|
|||
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language. |
|
||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language. |
|
||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||
|
||||
<Accordion title="Example output">
|
||||
|
||||
|
@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--base-model`, `-b` <Tag variant="new">2.1</Tag> | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--replace-components`, `-R` | flag | Replace components from the base model. |
|
||||
| `--vectors`, `-v` | option | Model to load vectors from. |
|
||||
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
|
||||
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
|
||||
|
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
||||
| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||
| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||
| `--width`, `-cw` <Tag variant="new">2.2.4</Tag> | option | Width of CNN layers of `Tok2Vec` component. |
|
||||
| `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag> | option | Depth of CNN layers of `Tok2Vec` component. |
|
||||
| `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag> | option | Window size for CNN layers of `Tok2Vec` component. |
|
||||
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag> | option | Maxout size for CNN layers of `Tok2Vec` component. |
|
||||
| `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag> | flag | Whether to use character-based embedding of `Tok2Vec` component. |
|
||||
| `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag> | option | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch). |
|
||||
| `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag> | option | Number of embedding rows of `Tok2Vec` component. |
|
||||
| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. |
|
||||
| `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag> | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). |
|
||||
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
|
||||
|
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag> | flag | Text classification classes aren't mutually exclusive (multilabel). |
|
||||
| `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag> | option | Text classification model architecture. Defaults to `"bow"`. |
|
||||
| `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option | Text classification positive label for binary classes with two labels. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag> | flag | Show more detailed messages during training. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
||||
|
|
|
@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx
|
|||
|
||||
A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
|
||||
named entities, export annotations to numpy arrays, losslessly serialize to
|
||||
compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
|
||||
The Python-level `Token` and [`Span`](/api/span) objects are views of this
|
||||
array, i.e. they don't own the data themselves.
|
||||
compressed binary strings. The `Doc` object holds an array of
|
||||
[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
|
||||
[`Span`](/api/span) objects are views of this array, i.e. they don't own the
|
||||
data themselves.
|
||||
|
||||
## Doc.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
|
@ -197,13 +198,14 @@ the character indices don't map to a valid span.
|
|||
> assert span.text == "New York"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the Span, e.g. for named entities. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
| Name | Type | Description |
|
||||
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
|
||||
## Doc.similarity {#similarity tag="method" model="vectors"}
|
||||
|
||||
|
@ -655,10 +657,10 @@ The L2 norm of the document's vector representation.
|
|||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
|
|
|
@ -172,6 +172,28 @@ Remove a previously registered extension.
|
|||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
|
||||
## Span.char_span {#char_span tag="method" new="2.2.4"}
|
||||
|
||||
Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
|
||||
the character indices don't map to a valid span.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp("I like New York")
|
||||
> span = doc[1:4].char_span(5, 13, label="GPE")
|
||||
> assert span.text == "New York"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
|
||||
## Span.similarity {#similarity tag="method" model="vectors"}
|
||||
|
||||
Make a semantic similarity estimate. The default estimate is cosine similarity
|
||||
|
@ -293,10 +315,10 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
|
|||
> assert doc2.text == "New York"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ----- | ---------------------------------------------------- |
|
||||
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
||||
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ----- | ---------------------------------------------------- |
|
||||
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
||||
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
||||
|
||||
## Span.root {#root tag="property" model="parser"}
|
||||
|
||||
|
|
|
@ -236,22 +236,22 @@ If a setting is not present in the options, the default value will be used.
|
|||
> displacy.serve(doc, style="dep", options=options)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description | Default |
|
||||
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` | bool | Print the lemma's in a separate row below the token texts in the `dep` visualisation. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` |
|
||||
| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
||||
| `font` | unicode | Font name or font family for all text. | `'Arial'` |
|
||||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
||||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
||||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
||||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
||||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
||||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
||||
| Name | Type | Description | Default |
|
||||
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` |
|
||||
| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
||||
| `font` | unicode | Font name or font family for all text. | `'Arial'` |
|
||||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
||||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
||||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
||||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
||||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
||||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
||||
|
||||
#### Named Entity Visualizer options {#displacy_options-ent}
|
||||
|
||||
|
|
|
@ -622,13 +622,13 @@ categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
|
|||
In order to use this, you'll need training and evaluation data in the
|
||||
[JSON format](/api/annotation#json-input) spaCy expects for training.
|
||||
|
||||
You can now train the model using a corpus for your language annotated with If
|
||||
your data is in one of the supported formats, the easiest solution might be to
|
||||
use the [`spacy convert`](/api/cli#convert) command-line utility. This supports
|
||||
several popular formats, including the IOB format for named entity recognition,
|
||||
the JSONL format produced by our annotation tool [Prodigy](https://prodi.gy),
|
||||
and the [CoNLL-U](http://universaldependencies.org/docs/format.html) format used
|
||||
by the [Universal Dependencies](http://universaldependencies.org/) corpus.
|
||||
If your data is in one of the supported formats, the easiest solution might be
|
||||
to use the [`spacy convert`](/api/cli#convert) command-line utility. This
|
||||
supports several popular formats, including the IOB format for named entity
|
||||
recognition, the JSONL format produced by our annotation tool
|
||||
[Prodigy](https://prodi.gy), and the
|
||||
[CoNLL-U](http://universaldependencies.org/docs/format.html) format used by the
|
||||
[Universal Dependencies](http://universaldependencies.org/) corpus.
|
||||
|
||||
One thing to keep in mind is that spaCy expects to train its models from **whole
|
||||
documents**, not just single sentences. If your corpus only contains single
|
||||
|
|
|
@ -1119,7 +1119,7 @@ entityruler = EntityRuler(nlp)
|
|||
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
||||
|
||||
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
|
||||
with nlp.disable_pipes(*disable_pipes):
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
entityruler.add_patterns(patterns)
|
||||
```
|
||||
|
||||
|
|
|
@ -94,7 +94,7 @@ docs = list(doc_bin.get_docs(nlp.vocab))
|
|||
|
||||
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
|
||||
well, which includes the values of
|
||||
[extension attributes](/processing-pipelines#custom-components-attributes) (if
|
||||
[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if
|
||||
they're serializable with msgpack).
|
||||
|
||||
<Infobox title="Important note on serializing extension attributes" variant="warning">
|
||||
|
|
|
@ -95,6 +95,8 @@
|
|||
"has_examples": true
|
||||
},
|
||||
{ "code": "hr", "name": "Croatian", "has_examples": true },
|
||||
{ "code": "eu", "name": "Basque", "has_examples": true },
|
||||
{ "code": "yo", "name": "Yoruba", "has_examples": true },
|
||||
{ "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
|
||||
{ "code": "ca", "name": "Catalan", "example": "Això és una frase.", "has_examples": true },
|
||||
{ "code": "he", "name": "Hebrew", "example": "זהו משפט.", "has_examples": true },
|
||||
|
@ -179,6 +181,12 @@
|
|||
"name": "Vietnamese",
|
||||
"dependencies": [{ "name": "Pyvi", "url": "https://github.com/trungtv/pyvi" }]
|
||||
},
|
||||
{
|
||||
"code": "lij",
|
||||
"name": "Ligurian",
|
||||
"example": "Sta chì a l'é unna fraxe.",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "xx",
|
||||
"name": "Multi-language",
|
||||
|
|
|
@ -1,5 +1,33 @@
|
|||
{
|
||||
"resources": [
|
||||
{
|
||||
"id": "spacy-stanza",
|
||||
"title": "spacy-stanza",
|
||||
"slogan": "Use the latest Stanza (StanfordNLP) research models directly in spaCy",
|
||||
"description": "This package wraps the Stanza (formerly StanfordNLP) library, so you can use Stanford's models as a spaCy pipeline. Using this wrapper, you'll be able to use the following annotations, computed by your pretrained `stanza` model:\n\n- Statistical tokenization (reflected in the `Doc` and its tokens)\n - Lemmatization (`token.lemma` and `token.lemma_`)\n - Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)\n - Dependency parsing (`token.dep`, `token.dep_`, `token.head`)\n - Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`, `token.ent_iob`, `token.ent_iob_`)\n - Sentence segmentation (`doc.sents`)",
|
||||
"github": "explosion/spacy-stanza",
|
||||
"pip": "spacy-stanza",
|
||||
"thumb": "https://i.imgur.com/myhLjMJ.png",
|
||||
"code_example": [
|
||||
"import stanza",
|
||||
"from spacy_stanza import StanzaLanguage",
|
||||
"",
|
||||
"snlp = stanza.Pipeline(lang=\"en\")",
|
||||
"nlp = StanzaLanguage(snlp)",
|
||||
"",
|
||||
"doc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")",
|
||||
"for token in doc:",
|
||||
" print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)",
|
||||
"print(doc.ents)"
|
||||
],
|
||||
"category": ["pipeline", "standalone", "models", "research"],
|
||||
"author": "Explosion",
|
||||
"author_links": {
|
||||
"twitter": "explosion_ai",
|
||||
"github": "explosion",
|
||||
"website": "https://explosion.ai"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "spacy-server",
|
||||
"title": "spaCy Server",
|
||||
|
@ -1965,6 +1993,79 @@
|
|||
},
|
||||
"category": ["pipeline"],
|
||||
"tags": ["phrase extraction", "ner", "summarization", "graph algorithms", "textrank"]
|
||||
},
|
||||
{
|
||||
"id": "spacy_syllables",
|
||||
"title": "Spacy Syllables",
|
||||
"slogan": "Multilingual syllable annotations",
|
||||
"description": "Spacy Syllables is a pipeline component that adds multilingual syllable annotations to Tokens. It uses Pyphen under the hood and has support for a long list of languages.",
|
||||
"github": "sloev/spacy-syllables",
|
||||
"pip": "spacy_syllables",
|
||||
"code_example": [
|
||||
"import spacy",
|
||||
"from spacy_syllables import SpacySyllables",
|
||||
"",
|
||||
"nlp = spacy.load('en_core_web_sm')",
|
||||
"syllables = SpacySyllables(nlp)",
|
||||
"nlp.add_pipe(syllables, after='tagger')",
|
||||
"",
|
||||
"doc = nlp('terribly long')",
|
||||
"",
|
||||
"data = [",
|
||||
" (token.text, token._.syllables, token._.syllables_count)",
|
||||
" for token in doc",
|
||||
"]",
|
||||
"",
|
||||
"assert data == [",
|
||||
" ('terribly', ['ter', 'ri', 'bly'], 3),",
|
||||
" ('long', ['long'], 1)",
|
||||
"]"
|
||||
],
|
||||
"thumb": "https://raw.githubusercontent.com/sloev/spacy-syllables/master/logo.png",
|
||||
"author": "Johannes Valbjørn",
|
||||
"author_links": {
|
||||
"github": "sloev"
|
||||
},
|
||||
"category": ["pipeline"],
|
||||
"tags": ["syllables", "multilingual"]
|
||||
},
|
||||
{
|
||||
"id": "gobbli",
|
||||
"title": "gobbli",
|
||||
"slogan": "Deep learning for text classification doesn't have to be scary",
|
||||
"description": "gobbli is a Python library which wraps several modern deep learning models in a uniform interface that makes it easy to evaluate feasibility and conduct analyses. It leverages the abstractive powers of Docker to hide nearly all dependency management and functional differences between models from the user. It also contains an interactive app for exploring text data and evaluating classification models. spaCy's base text classification models, as well as models integrated from `spacy-transformers`, are available in the collection of classification models. In addition, spaCy is used for data augmentation and document embeddings.",
|
||||
"url": "https://github.com/rtiinternational/gobbli",
|
||||
"github": "rtiinternational/gobbli",
|
||||
"pip": "gobbli",
|
||||
"thumb": "https://i.postimg.cc/NGpzhrdr/gobbli-lg.png",
|
||||
"code_example": [
|
||||
"from gobbli.io import PredictInput, TrainInput",
|
||||
"from gobbli.model.bert import BERT",
|
||||
"",
|
||||
"train_input = TrainInput(",
|
||||
" X_train=['This is a training document.', 'This is another training document.'],",
|
||||
" y_train=['0', '1'],",
|
||||
" X_valid=['This is a validation sentence.', 'This is another validation sentence.'],",
|
||||
" y_valid=['1', '0'],",
|
||||
")",
|
||||
"",
|
||||
"clf = BERT()",
|
||||
"",
|
||||
"# Set up classifier resources -- Docker image, etc.",
|
||||
"clf.build()",
|
||||
"",
|
||||
"# Train model",
|
||||
"train_output = clf.train(train_input)",
|
||||
"",
|
||||
"predict_input = PredictInput(",
|
||||
" X=['Which class is this document?'],",
|
||||
" labels=train_output.labels,",
|
||||
" checkpoint=train_output.checkpoint,",
|
||||
")",
|
||||
"",
|
||||
"predict_output = clf.predict(predict_input)"
|
||||
],
|
||||
"category": ["standalone"]
|
||||
}
|
||||
],
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
<svg xmlns="http://www.w3.org/2000/svg" width="220" height="37" viewBox="0 0 610 103">
|
||||
<defs>
|
||||
<radialGradient id="gradient_allenai1 "cx="75.721" cy="20.894" r="11.05" gradientUnits="userSpaceOnUse">
|
||||
<radialGradient id="gradient_allenai1" cx="75.721" cy="20.894" r="11.05" gradientUnits="userSpaceOnUse">
|
||||
<stop offset=".3" stop-color="#FDEA65" />
|
||||
<stop offset="1" stop-color="#FCB431" />
|
||||
</radialGradient>
|
||||
|
|
Before Width: | Height: | Size: 9.6 KiB After Width: | Height: | Size: 9.6 KiB |
Loading…
Reference in New Issue
Block a user