mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Merge master into develop. Big merge, many conflicts -- need to review
This commit is contained in:
commit
2c4a6d66fa
4
.github/CONTRIBUTOR_AGREEMENT.md
vendored
4
.github/CONTRIBUTOR_AGREEMENT.md
vendored
|
@ -87,11 +87,11 @@ U.S. Federal law. Any choice of law rules will not apply.
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
mark both statements:
|
mark both statements:
|
||||||
|
|
||||||
* [x] I am signing on behalf of myself as an individual and no other person
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
or entity, including my employer, has or will have rights with respect to my
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
contributions.
|
contributions.
|
||||||
|
|
||||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
actual authority to contractually bind that entity.
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
## Contributor Details
|
## Contributor Details
|
||||||
|
|
106
.github/contributors/ivyleavedtoadflax.md
vendored
Normal file
106
.github/contributors/ivyleavedtoadflax.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name |Matthew Upson |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date |2018-04-24 |
|
||||||
|
| GitHub username |ivyleavedtoadflax |
|
||||||
|
| Website (optional) |www.machinegurning.com|
|
106
.github/contributors/katrinleinweber.md
vendored
Normal file
106
.github/contributors/katrinleinweber.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Katrin Leinweber |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-03-30 |
|
||||||
|
| GitHub username | katrinleinweber |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/miroli.md
vendored
Normal file
106
.github/contributors/miroli.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Robin Linderborg |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-04-23 |
|
||||||
|
| GitHub username | miroli |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/mollerhoj.md
vendored
Normal file
106
.github/contributors/mollerhoj.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jens Dahl Mollerhoj |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 4/04/2018 |
|
||||||
|
| GitHub username | mollerhoj |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/skrcode.md
vendored
Normal file
106
.github/contributors/skrcode.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Suraj Rajan |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 31/Mar/2018 |
|
||||||
|
| GitHub username | skrcode |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/trungtv.md
vendored
Normal file
106
.github/contributors/trungtv.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Viet-Trung Tran |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-03-28 |
|
||||||
|
| GitHub username | trungtv |
|
||||||
|
| Website (optional) | https://datalab.vn |
|
6
CITATION
Normal file
6
CITATION
Normal file
|
@ -0,0 +1,6 @@
|
||||||
|
@ARTICLE{spacy2,
|
||||||
|
AUTHOR = {Honnibal, Matthew AND Montani, Ines},
|
||||||
|
TITLE = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
|
||||||
|
YEAR = {2017},
|
||||||
|
JOURNAL = {To appear}
|
||||||
|
}
|
|
@ -73,28 +73,8 @@ so it only becomes visible on click, making the issue easier to read and follow.
|
||||||
### Issue labels
|
### Issue labels
|
||||||
|
|
||||||
To distinguish issues that are opened by us, the maintainers, we usually add a
|
To distinguish issues that are opened by us, the maintainers, we usually add a
|
||||||
💫 to the title. We also use the following system to tag our issues and pull
|
💫 to the title. [See this page](https://github.com/explosion/spaCy/labels)
|
||||||
requests:
|
for an overview of the system we use to tag our issues and pull requests.
|
||||||
|
|
||||||
| Issue label | Description |
|
|
||||||
| --- | --- |
|
|
||||||
| [`bug`](https://github.com/explosion/spaCy/labels/bug) | Bugs and behaviour differing from documentation |
|
|
||||||
| [`enhancement`](https://github.com/explosion/spaCy/labels/enhancement) | Feature requests and improvements |
|
|
||||||
| [`install`](https://github.com/explosion/spaCy/labels/install) | Installation problems |
|
|
||||||
| [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
|
|
||||||
| [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
|
|
||||||
| [`docs`](https://github.com/explosion/spaCy/labels/docs), [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [documentation](https://spacy.io/docs) and [examples](spacy/examples) |
|
|
||||||
| [`training`](https://github.com/explosion/spaCy/labels/training) | Issues related to training and updating models |
|
|
||||||
| [`models`](https://github.com/explosion/spaCy/labels/models), `language / [name]` | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
|
|
||||||
| [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
|
|
||||||
| [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
|
|
||||||
| [`compat`](https://github.com/explosion/spaCy/labels/compat) | Cross-platform and cross-Python compatibility issues |
|
|
||||||
| [`wip`](https://github.com/explosion/spaCy/labels/wip) | Work in progress, mostly used for pull requests |
|
|
||||||
| [`v1`](https://github.com/explosion/spaCy/labels/v1) | Reports related to spaCy v1.x |
|
|
||||||
| [`duplicate`](https://github.com/explosion/spaCy/labels/duplicate) | Duplicates, i.e. issues that have been reported before |
|
|
||||||
| [`third-party`](https://github.com/explosion/spaCy/labels/third-party) | Issues related to third-party packages and services |
|
|
||||||
| [`meta`](https://github.com/explosion/spaCy/labels/meta) | Meta topics, e.g. repo organisation and issue management |
|
|
||||||
| [`help wanted`](https://github.com/explosion/spaCy/labels/help%20wanted), [`help wanted (easy)`](https://github.com/explosion/spaCy/labels/help%20wanted%20%28easy%29) | Requests for contributions |
|
|
||||||
|
|
||||||
## Contributing to the code base
|
## Contributing to the code base
|
||||||
|
|
||||||
|
@ -220,7 +200,7 @@ All Python code must be written in an **intersection of Python 2 and Python 3**.
|
||||||
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
|
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
|
||||||
Python or platform compatibility should only live in
|
Python or platform compatibility should only live in
|
||||||
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
|
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
|
||||||
functions, replacement functions are suffixed with an undersocre, for example
|
functions, replacement functions are suffixed with an underscore, for example
|
||||||
`unicode_`. If you need to access the user's version or platform information,
|
`unicode_`. If you need to access the user's version or platform information,
|
||||||
for example to show more specific error messages, you can use the `is_config()`
|
for example to show more specific error messages, you can use the `is_config()`
|
||||||
helper function.
|
helper function.
|
||||||
|
|
34
README.rst
34
README.rst
|
@ -12,11 +12,11 @@ integration. It's commercial open-source software, released under the MIT licens
|
||||||
|
|
||||||
💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_
|
💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_
|
||||||
|
|
||||||
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
|
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis
|
||||||
:target: https://travis-ci.org/explosion/spaCy
|
:target: https://travis-ci.org/explosion/spaCy
|
||||||
:alt: Build Status
|
:alt: Build Status
|
||||||
|
|
||||||
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square
|
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square&logo=appveyor
|
||||||
:target: https://ci.appveyor.com/project/explosion/spaCy
|
:target: https://ci.appveyor.com/project/explosion/spaCy
|
||||||
:alt: Appveyor Build Status
|
:alt: Appveyor Build Status
|
||||||
|
|
||||||
|
@ -28,11 +28,11 @@ integration. It's commercial open-source software, released under the MIT licens
|
||||||
:target: https://pypi.python.org/pypi/spacy
|
:target: https://pypi.python.org/pypi/spacy
|
||||||
:alt: pypi Version
|
:alt: pypi Version
|
||||||
|
|
||||||
.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg
|
.. image:: https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square
|
||||||
:target: https://anaconda.org/conda-forge/spacy
|
:target: https://anaconda.org/conda-forge/spacy
|
||||||
:alt: conda Version
|
:alt: conda Version
|
||||||
|
|
||||||
.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square
|
.. image:: https://img.shields.io/badge/chat-join%20%E2%86%92-09a3d5.svg?style=flat-square&logo=gitter-white
|
||||||
:target: https://gitter.im/explosion/spaCy
|
:target: https://gitter.im/explosion/spaCy
|
||||||
:alt: spaCy on Gitter
|
:alt: spaCy on Gitter
|
||||||
|
|
||||||
|
@ -49,7 +49,7 @@ integration. It's commercial open-source software, released under the MIT licens
|
||||||
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
|
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
|
||||||
`API Reference`_ The detailed reference for spaCy's API.
|
`API Reference`_ The detailed reference for spaCy's API.
|
||||||
`Models`_ Download statistical language models for spaCy.
|
`Models`_ Download statistical language models for spaCy.
|
||||||
`Resources`_ Libraries, extensions, demos, books and courses.
|
`Universe`_ Libraries, extensions, demos, books and courses.
|
||||||
`Changelog`_ Changes and version history.
|
`Changelog`_ Changes and version history.
|
||||||
`Contribute`_ How to contribute to the spaCy project and code base.
|
`Contribute`_ How to contribute to the spaCy project and code base.
|
||||||
=================== ===
|
=================== ===
|
||||||
|
@ -59,7 +59,7 @@ integration. It's commercial open-source software, released under the MIT licens
|
||||||
.. _Usage Guides: https://spacy.io/usage/
|
.. _Usage Guides: https://spacy.io/usage/
|
||||||
.. _API Reference: https://spacy.io/api/
|
.. _API Reference: https://spacy.io/api/
|
||||||
.. _Models: https://spacy.io/models
|
.. _Models: https://spacy.io/models
|
||||||
.. _Resources: https://spacy.io/usage/resources
|
.. _Universe: https://spacy.io/universe
|
||||||
.. _Changelog: https://spacy.io/usage/#changelog
|
.. _Changelog: https://spacy.io/usage/#changelog
|
||||||
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||||
|
|
||||||
|
@ -308,18 +308,20 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
||||||
Run tests
|
Run tests
|
||||||
=========
|
=========
|
||||||
|
|
||||||
spaCy comes with an `extensive test suite <spacy/tests>`_. First, find out where
|
spaCy comes with an `extensive test suite <spacy/tests>`_. In order to run the
|
||||||
spaCy is installed:
|
tests, you'll usually want to clone the repository and build spaCy from source.
|
||||||
|
This will also install the required development dependencies and test utilities
|
||||||
|
defined in the ``requirements.txt``.
|
||||||
|
|
||||||
|
Alternatively, you can find out where spaCy is installed and run ``pytest`` on
|
||||||
|
that directory. Don't forget to also install the test utilities via spaCy's
|
||||||
|
``requirements.txt``:
|
||||||
|
|
||||||
.. code:: bash
|
.. code:: bash
|
||||||
|
|
||||||
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
|
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
|
||||||
|
pip install -r path/to/requirements.txt
|
||||||
Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow``
|
|
||||||
and ``--model`` are optional and enable additional tests:
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
# make sure you are using recent pytest version
|
|
||||||
python -m pip install -U pytest
|
|
||||||
python -m pytest <spacy-directory>
|
python -m pytest <spacy-directory>
|
||||||
|
|
||||||
|
See `the documentation <https://spacy.io/usage/#tests>`_ for more details and
|
||||||
|
examples.
|
||||||
|
|
|
@ -9,6 +9,7 @@ coordinates. Can be extended with more details from the API.
|
||||||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||||
|
|
||||||
Compatible with: spaCy v2.0.0+
|
Compatible with: spaCy v2.0.0+
|
||||||
|
Prerequisites: pip install requests
|
||||||
"""
|
"""
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
|
|
|
@ -81,7 +81,6 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
|
||||||
else:
|
else:
|
||||||
nlp = spacy.blank('en') # create blank Language class
|
nlp = spacy.blank('en') # create blank Language class
|
||||||
print("Created blank 'en' model")
|
print("Created blank 'en' model")
|
||||||
|
|
||||||
# Add entity recognizer to model if it's not in the pipeline
|
# Add entity recognizer to model if it's not in the pipeline
|
||||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||||
if 'ner' not in nlp.pipe_names:
|
if 'ner' not in nlp.pipe_names:
|
||||||
|
@ -92,11 +91,18 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
|
||||||
ner = nlp.get_pipe('ner')
|
ner = nlp.get_pipe('ner')
|
||||||
|
|
||||||
ner.add_label(LABEL) # add new entity label to entity recognizer
|
ner.add_label(LABEL) # add new entity label to entity recognizer
|
||||||
|
if model is None:
|
||||||
|
optimizer = nlp.begin_training()
|
||||||
|
else:
|
||||||
|
# Note that 'begin_training' initializes the models, so it'll zero out
|
||||||
|
# existing entity types.
|
||||||
|
optimizer = nlp.entity.create_optimizer()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
# get names of other pipes to disable them during training
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
|
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||||
optimizer = nlp.begin_training()
|
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
losses = {}
|
losses = {}
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
"""Train a multi-label convolutional neural network text classifier on the
|
"""Train a convolutional neural network text classifier on the
|
||||||
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
|
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
|
||||||
automatically via Thinc's built-in dataset loader. The model is added to
|
automatically via Thinc's built-in dataset loader. The model is added to
|
||||||
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
|
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
|
||||||
|
|
|
@ -9,10 +9,6 @@ cytoolz>=0.9.0,<0.10.0
|
||||||
plac<1.0.0,>=0.9.6
|
plac<1.0.0,>=0.9.6
|
||||||
ujson>=1.35
|
ujson>=1.35
|
||||||
dill>=0.2,<0.3
|
dill>=0.2,<0.3
|
||||||
requests>=2.13.0,<3.0.0
|
|
||||||
regex==2017.4.5
|
regex==2017.4.5
|
||||||
ftfy>=4.4.2,<5.0.0
|
|
||||||
pytest>=3.0.6,<4.0.0
|
pytest>=3.0.6,<4.0.0
|
||||||
mock>=2.0.0,<3.0.0
|
mock>=2.0.0,<3.0.0
|
||||||
msgpack-python==0.5.4
|
|
||||||
msgpack-numpy==0.4.1
|
|
||||||
|
|
6
setup.py
6
setup.py
|
@ -38,6 +38,7 @@ MOD_NAMES = [
|
||||||
'spacy.tokens.doc',
|
'spacy.tokens.doc',
|
||||||
'spacy.tokens.span',
|
'spacy.tokens.span',
|
||||||
'spacy.tokens.token',
|
'spacy.tokens.token',
|
||||||
|
'spacy.tokens._retokenize',
|
||||||
'spacy.matcher',
|
'spacy.matcher',
|
||||||
'spacy.syntax.ner',
|
'spacy.syntax.ner',
|
||||||
'spacy.symbols',
|
'spacy.symbols',
|
||||||
|
@ -195,11 +196,6 @@ def setup_package():
|
||||||
'pathlib',
|
'pathlib',
|
||||||
'ujson>=1.35',
|
'ujson>=1.35',
|
||||||
'dill>=0.2,<0.3',
|
'dill>=0.2,<0.3',
|
||||||
'requests>=2.13.0,<3.0.0',
|
|
||||||
'regex==2017.4.5',
|
|
||||||
'ftfy>=4.4.2,<5.0.0',
|
|
||||||
'msgpack-python==0.5.4',
|
|
||||||
'msgpack-numpy==0.4.1'],
|
|
||||||
setup_requires=['wheel'],
|
setup_requires=['wheel'],
|
||||||
classifiers=[
|
classifiers=[
|
||||||
'Development Status :: 5 - Production/Stable',
|
'Development Status :: 5 - Production/Stable',
|
||||||
|
|
|
@ -4,18 +4,14 @@ from __future__ import unicode_literals
|
||||||
from .cli.info import info as cli_info
|
from .cli.info import info as cli_info
|
||||||
from .glossary import explain
|
from .glossary import explain
|
||||||
from .about import __version__
|
from .about import __version__
|
||||||
|
from .errors import Warnings, deprecation_warning
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
def load(name, **overrides):
|
def load(name, **overrides):
|
||||||
depr_path = overrides.get('path')
|
depr_path = overrides.get('path')
|
||||||
if depr_path not in (True, False, None):
|
if depr_path not in (True, False, None):
|
||||||
util.deprecated(
|
deprecation_warning(Warnings.W001.format(path=depr_path))
|
||||||
"As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
|
||||||
"You can now call spacy.load with the path as its first argument, "
|
|
||||||
"and the model's meta.json will be used to determine the language "
|
|
||||||
"to load. For example:\nnlp = spacy.load('{}')".format(depr_path),
|
|
||||||
'error')
|
|
||||||
return util.load_model(name, **overrides)
|
return util.load_model(name, **overrides)
|
||||||
|
|
||||||
|
|
||||||
|
|
27
spacy/_ml.py
27
spacy/_ml.py
|
@ -23,6 +23,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
|
||||||
import thinc.extra.load_nlp
|
import thinc.extra.load_nlp
|
||||||
|
|
||||||
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
||||||
|
from .errors import Errors
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -225,6 +226,11 @@ class PrecomputableAffine(Model):
|
||||||
|
|
||||||
def link_vectors_to_models(vocab):
|
def link_vectors_to_models(vocab):
|
||||||
vectors = vocab.vectors
|
vectors = vocab.vectors
|
||||||
|
if vectors.name is None:
|
||||||
|
vectors.name = VECTORS_KEY
|
||||||
|
print(
|
||||||
|
"Warning: Unnamed vectors -- this won't allow multiple vectors "
|
||||||
|
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape)
|
||||||
ops = Model.ops
|
ops = Model.ops
|
||||||
for word in vocab:
|
for word in vocab:
|
||||||
if word.orth in vectors.key2row:
|
if word.orth in vectors.key2row:
|
||||||
|
@ -234,11 +240,11 @@ def link_vectors_to_models(vocab):
|
||||||
data = ops.asarray(vectors.data)
|
data = ops.asarray(vectors.data)
|
||||||
# Set an entry here, so that vectors are accessed by StaticVectors
|
# Set an entry here, so that vectors are accessed by StaticVectors
|
||||||
# (unideal, I know)
|
# (unideal, I know)
|
||||||
thinc.extra.load_nlp.VECTORS[(ops.device, VECTORS_KEY)] = data
|
thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
|
||||||
|
|
||||||
|
|
||||||
def Tok2Vec(width, embed_size, **kwargs):
|
def Tok2Vec(width, embed_size, **kwargs):
|
||||||
pretrained_dims = kwargs.get('pretrained_dims', 0)
|
pretrained_vectors = kwargs.get('pretrained_vectors', None)
|
||||||
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
|
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
|
||||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||||
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
|
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
|
||||||
|
@ -251,16 +257,16 @@ def Tok2Vec(width, embed_size, **kwargs):
|
||||||
name='embed_suffix')
|
name='embed_suffix')
|
||||||
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
|
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
|
||||||
name='embed_shape')
|
name='embed_shape')
|
||||||
if pretrained_dims is not None and pretrained_dims >= 1:
|
if pretrained_vectors is not None:
|
||||||
glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
|
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
||||||
|
|
||||||
embed = uniqued(
|
embed = uniqued(
|
||||||
(glove | norm | prefix | suffix | shape)
|
(glove | norm | prefix | suffix | shape)
|
||||||
>> LN(Maxout(width, width*5, pieces=3)), column=5)
|
>> LN(Maxout(width, width*5, pieces=3)), column=cols.index(ORTH))
|
||||||
else:
|
else:
|
||||||
embed = uniqued(
|
embed = uniqued(
|
||||||
(norm | prefix | suffix | shape)
|
(norm | prefix | suffix | shape)
|
||||||
>> LN(Maxout(width, width*4, pieces=3)), column=5)
|
>> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))
|
||||||
|
|
||||||
convolution = Residual(
|
convolution = Residual(
|
||||||
ExtractWindow(nW=1)
|
ExtractWindow(nW=1)
|
||||||
|
@ -318,10 +324,10 @@ def _divide_array(X, size):
|
||||||
|
|
||||||
|
|
||||||
def get_col(idx):
|
def get_col(idx):
|
||||||
assert idx >= 0, idx
|
if idx < 0:
|
||||||
|
raise IndexError(Errors.E066.format(value=idx))
|
||||||
|
|
||||||
def forward(X, drop=0.):
|
def forward(X, drop=0.):
|
||||||
assert idx >= 0, idx
|
|
||||||
if isinstance(X, numpy.ndarray):
|
if isinstance(X, numpy.ndarray):
|
||||||
ops = NumpyOps()
|
ops = NumpyOps()
|
||||||
else:
|
else:
|
||||||
|
@ -329,7 +335,6 @@ def get_col(idx):
|
||||||
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
|
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
|
||||||
|
|
||||||
def backward(y, sgd=None):
|
def backward(y, sgd=None):
|
||||||
assert idx >= 0, idx
|
|
||||||
dX = ops.allocate(X.shape)
|
dX = ops.allocate(X.shape)
|
||||||
dX[:, idx] += y
|
dX[:, idx] += y
|
||||||
return dX
|
return dX
|
||||||
|
@ -416,13 +421,13 @@ def build_tagger_model(nr_class, **cfg):
|
||||||
token_vector_width = cfg['token_vector_width']
|
token_vector_width = cfg['token_vector_width']
|
||||||
else:
|
else:
|
||||||
token_vector_width = util.env_opt('token_vector_width', 128)
|
token_vector_width = util.env_opt('token_vector_width', 128)
|
||||||
pretrained_dims = cfg.get('pretrained_dims', 0)
|
pretrained_vectors = cfg.get('pretrained_vectors')
|
||||||
with Model.define_operators({'>>': chain, '+': add}):
|
with Model.define_operators({'>>': chain, '+': add}):
|
||||||
if 'tok2vec' in cfg:
|
if 'tok2vec' in cfg:
|
||||||
tok2vec = cfg['tok2vec']
|
tok2vec = cfg['tok2vec']
|
||||||
else:
|
else:
|
||||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||||
pretrained_dims=pretrained_dims)
|
pretrained_vectors=pretrained_vectors)
|
||||||
softmax = with_flatten(Softmax(nr_class, token_vector_width))
|
softmax = with_flatten(Softmax(nr_class, token_vector_width))
|
||||||
model = (
|
model = (
|
||||||
tok2vec
|
tok2vec
|
||||||
|
|
|
@ -11,7 +11,6 @@ __email__ = 'contact@explosion.ai'
|
||||||
__license__ = 'MIT'
|
__license__ = 'MIT'
|
||||||
__release__ = False
|
__release__ = False
|
||||||
|
|
||||||
__docs_models__ = 'https://spacy.io/usage/models'
|
|
||||||
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
|
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
|
||||||
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
|
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
|
||||||
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'
|
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'
|
||||||
|
|
74
spacy/cli/_messages.py
Normal file
74
spacy/cli/_messages.py
Normal file
|
@ -0,0 +1,74 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
class Messages(object):
|
||||||
|
M001 = ("Download successful but linking failed")
|
||||||
|
M002 = ("Creating a shortcut link for 'en' didn't work (maybe you "
|
||||||
|
"don't have admin permissions?), but you can still load the "
|
||||||
|
"model via its full package name: nlp = spacy.load('{name}')")
|
||||||
|
M003 = ("Server error ({code}: {desc})")
|
||||||
|
M004 = ("Couldn't fetch {desc}. Please find a model for your spaCy "
|
||||||
|
"installation (v{version}), and download it manually. For more "
|
||||||
|
"details, see the documentation: https://spacy.io/usage/models")
|
||||||
|
M005 = ("Compatibility error")
|
||||||
|
M006 = ("No compatible models found for v{version} of spaCy.")
|
||||||
|
M007 = ("No compatible model found for '{name}' (spaCy v{version}).")
|
||||||
|
M008 = ("Can't locate model data")
|
||||||
|
M009 = ("The data should be located in {path}")
|
||||||
|
M010 = ("Can't find the spaCy data path to create model symlink")
|
||||||
|
M011 = ("Make sure a directory `/data` exists within your spaCy "
|
||||||
|
"installation and try again. The data directory should be "
|
||||||
|
"located here:")
|
||||||
|
M012 = ("Link '{name}' already exists")
|
||||||
|
M013 = ("To overwrite an existing link, use the --force flag.")
|
||||||
|
M014 = ("Can't overwrite symlink '{name}'")
|
||||||
|
M015 = ("This can happen if your data directory contains a directory or "
|
||||||
|
"file of the same name.")
|
||||||
|
M016 = ("Error: Couldn't link model to '{name}'")
|
||||||
|
M017 = ("Creating a symlink in spacy/data failed. Make sure you have the "
|
||||||
|
"required permissions and try re-running the command as admin, or "
|
||||||
|
"use a virtualenv. You can still import the model as a module and "
|
||||||
|
"call its load() method, or create the symlink manually.")
|
||||||
|
M018 = ("Linking successful")
|
||||||
|
M019 = ("You can now load the model via spacy.load('{name}')")
|
||||||
|
M020 = ("Can't find model meta.json")
|
||||||
|
M021 = ("Couldn't fetch compatibility table.")
|
||||||
|
M022 = ("Can't find spaCy v{version} in compatibility table")
|
||||||
|
M023 = ("Installed models (spaCy v{version})")
|
||||||
|
M024 = ("No models found in your current environment.")
|
||||||
|
M025 = ("Use the following commands to update the model packages:")
|
||||||
|
M026 = ("The following models are not available for spaCy "
|
||||||
|
"v{version}: {models}")
|
||||||
|
M027 = ("You may also want to overwrite the incompatible links using the "
|
||||||
|
"`python -m spacy link` command with `--force`, or remove them "
|
||||||
|
"from the data directory. Data path: {path}")
|
||||||
|
M028 = ("Input file not found")
|
||||||
|
M029 = ("Output directory not found")
|
||||||
|
M030 = ("Unknown format")
|
||||||
|
M031 = ("Can't find converter for {converter}")
|
||||||
|
M032 = ("Generated output file {name}")
|
||||||
|
M033 = ("Created {n_docs} documents")
|
||||||
|
M034 = ("Evaluation data not found")
|
||||||
|
M035 = ("Visualization output directory not found")
|
||||||
|
M036 = ("Generated {n} parses as HTML")
|
||||||
|
M037 = ("Can't find words frequencies file")
|
||||||
|
M038 = ("Sucessfully compiled vocab")
|
||||||
|
M039 = ("{entries} entries, {vectors} vectors")
|
||||||
|
M040 = ("Output directory not found")
|
||||||
|
M041 = ("Loaded meta.json from file")
|
||||||
|
M042 = ("Successfully created package '{name}'")
|
||||||
|
M043 = ("To build the package, run `python setup.py sdist` in this "
|
||||||
|
"directory.")
|
||||||
|
M044 = ("Package directory already exists")
|
||||||
|
M045 = ("Please delete the directory and try again, or use the `--force` "
|
||||||
|
"flag to overwrite existing directories.")
|
||||||
|
M046 = ("Generating meta.json")
|
||||||
|
M047 = ("Enter the package settings for your model. The following "
|
||||||
|
"information will be read from your model data: pipeline, vectors.")
|
||||||
|
M048 = ("No '{key}' setting found in meta.json")
|
||||||
|
M049 = ("This setting is required to build your package.")
|
||||||
|
M050 = ("Training data not found")
|
||||||
|
M051 = ("Development data not found")
|
||||||
|
M052 = ("Not a valid meta.json format")
|
||||||
|
M053 = ("Expected dict but got: {meta_type}")
|
|
@ -5,6 +5,7 @@ import plac
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .converters import conllu2json, iob2json, conll_ner2json
|
from .converters import conllu2json, iob2json, conll_ner2json
|
||||||
|
from ._messages import Messages
|
||||||
from ..util import prints
|
from ..util import prints
|
||||||
|
|
||||||
# Converters are matched by file extension. To add a converter, add a new
|
# Converters are matched by file extension. To add a converter, add a new
|
||||||
|
@ -32,14 +33,14 @@ def convert(input_file, output_dir, n_sents=1, morphology=False, converter='auto
|
||||||
input_path = Path(input_file)
|
input_path = Path(input_file)
|
||||||
output_path = Path(output_dir)
|
output_path = Path(output_dir)
|
||||||
if not input_path.exists():
|
if not input_path.exists():
|
||||||
prints(input_path, title="Input file not found", exits=1)
|
prints(input_path, title=Messages.M028, exits=1)
|
||||||
if not output_path.exists():
|
if not output_path.exists():
|
||||||
prints(output_path, title="Output directory not found", exits=1)
|
prints(output_path, title=Messages.M029, exits=1)
|
||||||
if converter == 'auto':
|
if converter == 'auto':
|
||||||
converter = input_path.suffix[1:]
|
converter = input_path.suffix[1:]
|
||||||
if converter not in CONVERTERS:
|
if converter not in CONVERTERS:
|
||||||
prints("Can't find converter for %s" % converter,
|
prints(Messages.M031.format(converter=converter),
|
||||||
title="Unknown format", exits=1)
|
title=Messages.M030, exits=1)
|
||||||
func = CONVERTERS[converter]
|
func = CONVERTERS[converter]
|
||||||
func(input_path, output_path,
|
func(input_path, output_path,
|
||||||
n_sents=n_sents, use_morphology=morphology)
|
n_sents=n_sents, use_morphology=morphology)
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .._messages import Messages
|
||||||
from ...compat import json_dumps, path2str
|
from ...compat import json_dumps, path2str
|
||||||
from ...util import prints
|
from ...util import prints
|
||||||
from ...gold import iob_to_biluo
|
from ...gold import iob_to_biluo
|
||||||
|
@ -18,8 +19,8 @@ def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||||
output_file = output_path / output_filename
|
output_file = output_path / output_filename
|
||||||
with output_file.open('w', encoding='utf-8') as f:
|
with output_file.open('w', encoding='utf-8') as f:
|
||||||
f.write(json_dumps(docs))
|
f.write(json_dumps(docs))
|
||||||
prints("Created %d documents" % len(docs),
|
prints(Messages.M033.format(n_docs=len(docs)),
|
||||||
title="Generated output file %s" % path2str(output_file))
|
title=Messages.M032.format(name=path2str(output_file)))
|
||||||
|
|
||||||
|
|
||||||
def read_conll_ner(input_path):
|
def read_conll_ner(input_path):
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .._messages import Messages
|
||||||
from ...compat import json_dumps, path2str
|
from ...compat import json_dumps, path2str
|
||||||
from ...util import prints
|
from ...util import prints
|
||||||
|
|
||||||
|
@ -32,8 +33,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||||
output_file = output_path / output_filename
|
output_file = output_path / output_filename
|
||||||
with output_file.open('w', encoding='utf-8') as f:
|
with output_file.open('w', encoding='utf-8') as f:
|
||||||
f.write(json_dumps(docs))
|
f.write(json_dumps(docs))
|
||||||
prints("Created %d documents" % len(docs),
|
prints(Messages.M033.format(n_docs=len(docs)),
|
||||||
title="Generated output file %s" % path2str(output_file))
|
title=Messages.M032.format(name=path2str(output_file)))
|
||||||
|
|
||||||
|
|
||||||
def read_conllx(input_path, use_morphology=False, n=0):
|
def read_conllx(input_path, use_morphology=False, n=0):
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from cytoolz import partition_all, concat
|
from cytoolz import partition_all, concat
|
||||||
|
|
||||||
|
from .._messages import Messages
|
||||||
from ...compat import json_dumps, path2str
|
from ...compat import json_dumps, path2str
|
||||||
from ...util import prints
|
from ...util import prints
|
||||||
from ...gold import iob_to_biluo
|
from ...gold import iob_to_biluo
|
||||||
|
@ -18,8 +19,8 @@ def iob2json(input_path, output_path, n_sents=10, *a, **k):
|
||||||
output_file = output_path / output_filename
|
output_file = output_path / output_filename
|
||||||
with output_file.open('w', encoding='utf-8') as f:
|
with output_file.open('w', encoding='utf-8') as f:
|
||||||
f.write(json_dumps(docs))
|
f.write(json_dumps(docs))
|
||||||
prints("Created %d documents" % len(docs),
|
prints(Messages.M033.format(n_docs=len(docs)),
|
||||||
title="Generated output file %s" % path2str(output_file))
|
title=Messages.M032.format(name=path2str(output_file)))
|
||||||
|
|
||||||
|
|
||||||
def read_iob(raw_sents):
|
def read_iob(raw_sents):
|
||||||
|
|
|
@ -2,13 +2,15 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
import requests
|
|
||||||
import os
|
import os
|
||||||
import subprocess
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
|
import ujson
|
||||||
|
|
||||||
from .link import link
|
from .link import link
|
||||||
|
from ._messages import Messages
|
||||||
from ..util import prints, get_package_path
|
from ..util import prints, get_package_path
|
||||||
|
from ..compat import url_read, HTTPError
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
|
@ -31,9 +33,7 @@ def download(model, direct=False):
|
||||||
version = get_version(model_name, compatibility)
|
version = get_version(model_name, compatibility)
|
||||||
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
|
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
|
||||||
v=version))
|
v=version))
|
||||||
if dl != 0:
|
if dl != 0: # if download subprocess doesn't return 0, exit
|
||||||
# if download subprocess doesn't return 0, exit with the respective
|
|
||||||
# exit code before doing anything else
|
|
||||||
sys.exit(dl)
|
sys.exit(dl)
|
||||||
try:
|
try:
|
||||||
# Get package path here because link uses
|
# Get package path here because link uses
|
||||||
|
@ -47,22 +47,16 @@ def download(model, direct=False):
|
||||||
# Dirty, but since spacy.download and the auto-linking is
|
# Dirty, but since spacy.download and the auto-linking is
|
||||||
# mostly a convenience wrapper, it's best to show a success
|
# mostly a convenience wrapper, it's best to show a success
|
||||||
# message and loading instructions, even if linking fails.
|
# message and loading instructions, even if linking fails.
|
||||||
prints(
|
prints(Messages.M001.format(name=model_name), title=Messages.M002)
|
||||||
"Creating a shortcut link for 'en' didn't work (maybe "
|
|
||||||
"you don't have admin permissions?), but you can still "
|
|
||||||
"load the model via its full package name:",
|
|
||||||
"nlp = spacy.load('%s')" % model_name,
|
|
||||||
title="Download successful but linking failed")
|
|
||||||
|
|
||||||
|
|
||||||
def get_json(url, desc):
|
def get_json(url, desc):
|
||||||
r = requests.get(url)
|
try:
|
||||||
if r.status_code != 200:
|
data = url_read(url)
|
||||||
msg = ("Couldn't fetch %s. Please find a model for your spaCy "
|
except HTTPError as e:
|
||||||
"installation (v%s), and download it manually.")
|
prints(Messages.M004.format(desc, about.__version__),
|
||||||
prints(msg % (desc, about.__version__), about.__docs_models__,
|
title=Messages.M003.format(e.code, e.reason), exits=1)
|
||||||
title="Server error (%d)" % r.status_code, exits=1)
|
return ujson.loads(data)
|
||||||
return r.json()
|
|
||||||
|
|
||||||
|
|
||||||
def get_compatibility():
|
def get_compatibility():
|
||||||
|
@ -71,17 +65,16 @@ def get_compatibility():
|
||||||
comp_table = get_json(about.__compatibility__, "compatibility table")
|
comp_table = get_json(about.__compatibility__, "compatibility table")
|
||||||
comp = comp_table['spacy']
|
comp = comp_table['spacy']
|
||||||
if version not in comp:
|
if version not in comp:
|
||||||
prints("No compatible models found for v%s of spaCy." % version,
|
prints(Messages.M006.format(version=version), title=Messages.M005,
|
||||||
title="Compatibility error", exits=1)
|
exits=1)
|
||||||
return comp[version]
|
return comp[version]
|
||||||
|
|
||||||
|
|
||||||
def get_version(model, comp):
|
def get_version(model, comp):
|
||||||
model = model.rsplit('.dev', 1)[0]
|
model = model.rsplit('.dev', 1)[0]
|
||||||
if model not in comp:
|
if model not in comp:
|
||||||
version = about.__version__
|
prints(Messages.M007.format(name=model, version=about.__version__),
|
||||||
msg = "No compatible model found for '%s' (spaCy v%s)."
|
title=Messages.M005, exits=1)
|
||||||
prints(msg % (model, version), title="Compatibility error", exits=1)
|
|
||||||
return comp[model][0]
|
return comp[model][0]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals, division, print_function
|
||||||
import plac
|
import plac
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
|
|
||||||
|
from ._messages import Messages
|
||||||
from ..gold import GoldCorpus
|
from ..gold import GoldCorpus
|
||||||
from ..util import prints
|
from ..util import prints
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -33,10 +34,9 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
|
||||||
data_path = util.ensure_path(data_path)
|
data_path = util.ensure_path(data_path)
|
||||||
displacy_path = util.ensure_path(displacy_path)
|
displacy_path = util.ensure_path(displacy_path)
|
||||||
if not data_path.exists():
|
if not data_path.exists():
|
||||||
prints(data_path, title="Evaluation data not found", exits=1)
|
prints(data_path, title=Messages.M034, exits=1)
|
||||||
if displacy_path and not displacy_path.exists():
|
if displacy_path and not displacy_path.exists():
|
||||||
prints(displacy_path, title="Visualization output directory not found",
|
prints(displacy_path, title=Messages.M035, exits=1)
|
||||||
exits=1)
|
|
||||||
corpus = GoldCorpus(data_path, data_path)
|
corpus = GoldCorpus(data_path, data_path)
|
||||||
nlp = util.load_model(model)
|
nlp = util.load_model(model)
|
||||||
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
||||||
|
@ -52,8 +52,7 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
|
||||||
render_ents = 'ner' in nlp.meta.get('pipeline', [])
|
render_ents = 'ner' in nlp.meta.get('pipeline', [])
|
||||||
render_parses(docs, displacy_path, model_name=model,
|
render_parses(docs, displacy_path, model_name=model,
|
||||||
limit=displacy_limit, deps=render_deps, ents=render_ents)
|
limit=displacy_limit, deps=render_deps, ents=render_ents)
|
||||||
msg = "Generated %s parses as HTML" % displacy_limit
|
prints(displacy_path, title=Messages.M036.format(n=displacy_limit))
|
||||||
prints(displacy_path, title=msg)
|
|
||||||
|
|
||||||
|
|
||||||
def render_parses(docs, output_path, model_name='', limit=250, deps=True,
|
def render_parses(docs, output_path, model_name='', limit=250, deps=True,
|
||||||
|
|
|
@ -5,15 +5,17 @@ import plac
|
||||||
import platform
|
import platform
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
from ._messages import Messages
|
||||||
from ..compat import path2str
|
from ..compat import path2str
|
||||||
from .. import about
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("optional: shortcut link of model", "positional", None, str),
|
model=("optional: shortcut link of model", "positional", None, str),
|
||||||
markdown=("generate Markdown for GitHub issues", "flag", "md", str))
|
markdown=("generate Markdown for GitHub issues", "flag", "md", str),
|
||||||
def info(model=None, markdown=False):
|
silent=("don't print anything (just return)", "flag", "s"))
|
||||||
|
def info(model=None, markdown=False, silent=False):
|
||||||
"""Print info about spaCy installation. If a model shortcut link is
|
"""Print info about spaCy installation. If a model shortcut link is
|
||||||
speficied as an argument, print model information. Flag --markdown
|
speficied as an argument, print model information. Flag --markdown
|
||||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||||
|
@ -25,21 +27,24 @@ def info(model=None, markdown=False):
|
||||||
model_path = util.get_data_path() / model
|
model_path = util.get_data_path() / model
|
||||||
meta_path = model_path / 'meta.json'
|
meta_path = model_path / 'meta.json'
|
||||||
if not meta_path.is_file():
|
if not meta_path.is_file():
|
||||||
util.prints(meta_path, title="Can't find model meta.json", exits=1)
|
util.prints(meta_path, title=Messages.M020, exits=1)
|
||||||
meta = util.read_json(meta_path)
|
meta = util.read_json(meta_path)
|
||||||
if model_path.resolve() != model_path:
|
if model_path.resolve() != model_path:
|
||||||
meta['link'] = path2str(model_path)
|
meta['link'] = path2str(model_path)
|
||||||
meta['source'] = path2str(model_path.resolve())
|
meta['source'] = path2str(model_path.resolve())
|
||||||
else:
|
else:
|
||||||
meta['source'] = path2str(model_path)
|
meta['source'] = path2str(model_path)
|
||||||
print_info(meta, 'model %s' % model, markdown)
|
if not silent:
|
||||||
else:
|
print_info(meta, 'model %s' % model, markdown)
|
||||||
data = {'spaCy version': about.__version__,
|
return meta
|
||||||
'Location': path2str(Path(__file__).parent.parent),
|
data = {'spaCy version': about.__version__,
|
||||||
'Platform': platform.platform(),
|
'Location': path2str(Path(__file__).parent.parent),
|
||||||
'Python version': platform.python_version(),
|
'Platform': platform.platform(),
|
||||||
'Models': list_models()}
|
'Python version': platform.python_version(),
|
||||||
|
'Models': list_models()}
|
||||||
|
if not silent:
|
||||||
print_info(data, 'spaCy', markdown)
|
print_info(data, 'spaCy', markdown)
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
def print_info(data, title, markdown):
|
def print_info(data, title, markdown):
|
||||||
|
|
|
@ -12,10 +12,16 @@ import tarfile
|
||||||
import gzip
|
import gzip
|
||||||
import zipfile
|
import zipfile
|
||||||
|
|
||||||
from ..compat import fix_text
|
from ._messages import Messages
|
||||||
from ..vectors import Vectors
|
from ..vectors import Vectors
|
||||||
|
from ..errors import Errors, Warnings, user_warning
|
||||||
from ..util import prints, ensure_path, get_lang_class
|
from ..util import prints, ensure_path, get_lang_class
|
||||||
|
|
||||||
|
try:
|
||||||
|
import ftfy
|
||||||
|
except ImportError:
|
||||||
|
ftfy = None
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
lang=("model language", "positional", None, str),
|
lang=("model language", "positional", None, str),
|
||||||
|
@ -23,27 +29,26 @@ from ..util import prints, ensure_path, get_lang_class
|
||||||
freqs_loc=("location of words frequencies file", "positional", None, Path),
|
freqs_loc=("location of words frequencies file", "positional", None, Path),
|
||||||
clusters_loc=("optional: location of brown clusters data",
|
clusters_loc=("optional: location of brown clusters data",
|
||||||
"option", "c", str),
|
"option", "c", str),
|
||||||
vectors_loc=("optional: location of vectors file in GenSim text format",
|
vectors_loc=("optional: location of vectors file in Word2Vec format "
|
||||||
"option", "v", str),
|
"(either as .txt or zipped as .zip or .tar.gz)", "option",
|
||||||
|
"v", str),
|
||||||
prune_vectors=("optional: number of vectors to prune to",
|
prune_vectors=("optional: number of vectors to prune to",
|
||||||
"option", "V", int)
|
"option", "V", int)
|
||||||
)
|
)
|
||||||
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None, vectors_loc=None, prune_vectors=-1):
|
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None,
|
||||||
|
vectors_loc=None, prune_vectors=-1):
|
||||||
"""
|
"""
|
||||||
Create a new model from raw data, like word frequencies, Brown clusters
|
Create a new model from raw data, like word frequencies, Brown clusters
|
||||||
and word vectors.
|
and word vectors.
|
||||||
"""
|
"""
|
||||||
if freqs_loc is not None and not freqs_loc.exists():
|
if freqs_loc is not None and not freqs_loc.exists():
|
||||||
prints(freqs_loc, title="Can't find words frequencies file", exits=1)
|
prints(freqs_loc, title=Messages.M037, exits=1)
|
||||||
clusters_loc = ensure_path(clusters_loc)
|
clusters_loc = ensure_path(clusters_loc)
|
||||||
vectors_loc = ensure_path(vectors_loc)
|
vectors_loc = ensure_path(vectors_loc)
|
||||||
|
|
||||||
probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
|
probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
|
||||||
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
|
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
|
||||||
clusters = read_clusters(clusters_loc) if clusters_loc else {}
|
clusters = read_clusters(clusters_loc) if clusters_loc else {}
|
||||||
|
|
||||||
nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors)
|
nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors)
|
||||||
|
|
||||||
if not output_dir.exists():
|
if not output_dir.exists():
|
||||||
output_dir.mkdir()
|
output_dir.mkdir()
|
||||||
nlp.to_disk(output_dir)
|
nlp.to_disk(output_dir)
|
||||||
|
@ -71,7 +76,6 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
|
||||||
nlp = lang_class()
|
nlp = lang_class()
|
||||||
for lexeme in nlp.vocab:
|
for lexeme in nlp.vocab:
|
||||||
lexeme.rank = 0
|
lexeme.rank = 0
|
||||||
|
|
||||||
lex_added = 0
|
lex_added = 0
|
||||||
for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))):
|
for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))):
|
||||||
lexeme = nlp.vocab[word]
|
lexeme = nlp.vocab[word]
|
||||||
|
@ -91,15 +95,13 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
|
||||||
lexeme = nlp.vocab[word]
|
lexeme = nlp.vocab[word]
|
||||||
lexeme.is_oov = False
|
lexeme.is_oov = False
|
||||||
lex_added += 1
|
lex_added += 1
|
||||||
|
|
||||||
if len(vectors_data):
|
if len(vectors_data):
|
||||||
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
||||||
if prune_vectors >= 1:
|
if prune_vectors >= 1:
|
||||||
nlp.vocab.prune_vectors(prune_vectors)
|
nlp.vocab.prune_vectors(prune_vectors)
|
||||||
vec_added = len(nlp.vocab.vectors)
|
vec_added = len(nlp.vocab.vectors)
|
||||||
|
prints(Messages.M039.format(entries=lex_added, vectors=vec_added),
|
||||||
prints("{} entries, {} vectors".format(lex_added, vec_added),
|
title=Messages.M038)
|
||||||
title="Sucessfully compiled vocab")
|
|
||||||
return nlp
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
@ -114,8 +116,7 @@ def read_vectors(vectors_loc):
|
||||||
pieces = line.rsplit(' ', vectors_data.shape[1]+1)
|
pieces = line.rsplit(' ', vectors_data.shape[1]+1)
|
||||||
word = pieces.pop(0)
|
word = pieces.pop(0)
|
||||||
if len(pieces) != vectors_data.shape[1]:
|
if len(pieces) != vectors_data.shape[1]:
|
||||||
print(word, repr(line))
|
raise ValueError(Errors.E094.format(line_num=i, loc=vectors_loc))
|
||||||
raise ValueError("Bad line in file")
|
|
||||||
vectors_data[i] = numpy.asarray(pieces, dtype='f')
|
vectors_data[i] = numpy.asarray(pieces, dtype='f')
|
||||||
vectors_keys.append(word)
|
vectors_keys.append(word)
|
||||||
return vectors_data, vectors_keys
|
return vectors_data, vectors_keys
|
||||||
|
@ -150,11 +151,14 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
||||||
def read_clusters(clusters_loc):
|
def read_clusters(clusters_loc):
|
||||||
print("Reading clusters...")
|
print("Reading clusters...")
|
||||||
clusters = {}
|
clusters = {}
|
||||||
|
if ftfy is None:
|
||||||
|
user_warning(Warnings.W004)
|
||||||
with clusters_loc.open() as f:
|
with clusters_loc.open() as f:
|
||||||
for line in tqdm(f):
|
for line in tqdm(f):
|
||||||
try:
|
try:
|
||||||
cluster, word, freq = line.split()
|
cluster, word, freq = line.split()
|
||||||
word = fix_text(word)
|
if ftfy is not None:
|
||||||
|
word = ftfy.fix_text(word)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
continue
|
continue
|
||||||
# If the clusterer has only seen the word a few times, its
|
# If the clusterer has only seen the word a few times, its
|
||||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
||||||
import plac
|
import plac
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
from ._messages import Messages
|
||||||
from ..compat import symlink_to, path2str
|
from ..compat import symlink_to, path2str
|
||||||
from ..util import prints
|
from ..util import prints
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -24,40 +25,29 @@ def link(origin, link_name, force=False, model_path=None):
|
||||||
else:
|
else:
|
||||||
model_path = Path(origin) if model_path is None else Path(model_path)
|
model_path = Path(origin) if model_path is None else Path(model_path)
|
||||||
if not model_path.exists():
|
if not model_path.exists():
|
||||||
prints("The data should be located in %s" % path2str(model_path),
|
prints(Messages.M009.format(path=path2str(model_path)),
|
||||||
title="Can't locate model data", exits=1)
|
title=Messages.M008, exits=1)
|
||||||
data_path = util.get_data_path()
|
data_path = util.get_data_path()
|
||||||
if not data_path or not data_path.exists():
|
if not data_path or not data_path.exists():
|
||||||
spacy_loc = Path(__file__).parent.parent
|
spacy_loc = Path(__file__).parent.parent
|
||||||
prints("Make sure a directory `/data` exists within your spaCy "
|
prints(Messages.M011, spacy_loc, title=Messages.M010, exits=1)
|
||||||
"installation and try again. The data directory should be "
|
|
||||||
"located here:", path2str(spacy_loc), exits=1,
|
|
||||||
title="Can't find the spaCy data path to create model symlink")
|
|
||||||
link_path = util.get_data_path() / link_name
|
link_path = util.get_data_path() / link_name
|
||||||
if link_path.is_symlink() and not force:
|
if link_path.is_symlink() and not force:
|
||||||
prints("To overwrite an existing link, use the --force flag.",
|
prints(Messages.M013, title=Messages.M012.format(name=link_name),
|
||||||
title="Link %s already exists" % link_name, exits=1)
|
exits=1)
|
||||||
elif link_path.is_symlink(): # does a symlink exist?
|
elif link_path.is_symlink(): # does a symlink exist?
|
||||||
# NB: It's important to check for is_symlink here and not for exists,
|
# NB: It's important to check for is_symlink here and not for exists,
|
||||||
# because invalid/outdated symlinks would return False otherwise.
|
# because invalid/outdated symlinks would return False otherwise.
|
||||||
link_path.unlink()
|
link_path.unlink()
|
||||||
elif link_path.exists(): # does it exist otherwise?
|
elif link_path.exists(): # does it exist otherwise?
|
||||||
# NB: Check this last because valid symlinks also "exist".
|
# NB: Check this last because valid symlinks also "exist".
|
||||||
prints("This can happen if your data directory contains a directory "
|
prints(Messages.M015, link_path,
|
||||||
"or file of the same name.", link_path,
|
title=Messages.M014.format(name=link_name), exits=1)
|
||||||
title="Can't overwrite symlink %s" % link_name, exits=1)
|
msg = "%s --> %s" % (path2str(model_path), path2str(link_path))
|
||||||
try:
|
try:
|
||||||
symlink_to(link_path, model_path)
|
symlink_to(link_path, model_path)
|
||||||
except:
|
except:
|
||||||
# This is quite dirty, but just making sure other errors are caught.
|
# This is quite dirty, but just making sure other errors are caught.
|
||||||
prints("Creating a symlink in spacy/data failed. Make sure you have "
|
prints(Messages.M017, msg, title=Messages.M016.format(name=link_name))
|
||||||
"the required permissions and try re-running the command as "
|
|
||||||
"admin, or use a virtualenv. You can still import the model as "
|
|
||||||
"a module and call its load() method, or create the symlink "
|
|
||||||
"manually.",
|
|
||||||
"%s --> %s" % (path2str(model_path), path2str(link_path)),
|
|
||||||
title="Error: Couldn't link model to '%s'" % link_name)
|
|
||||||
raise
|
raise
|
||||||
prints("%s --> %s" % (path2str(model_path), path2str(link_path)),
|
prints(msg, Messages.M019.format(name=link_name), title=Messages.M018)
|
||||||
"You can now load the model via spacy.load('%s')" % link_name,
|
|
||||||
title="Linking successful")
|
|
||||||
|
|
|
@ -5,6 +5,7 @@ import plac
|
||||||
import shutil
|
import shutil
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
from ._messages import Messages
|
||||||
from ..compat import path2str, json_dumps
|
from ..compat import path2str, json_dumps
|
||||||
from ..util import prints
|
from ..util import prints
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -31,17 +32,17 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
|
||||||
output_path = util.ensure_path(output_dir)
|
output_path = util.ensure_path(output_dir)
|
||||||
meta_path = util.ensure_path(meta_path)
|
meta_path = util.ensure_path(meta_path)
|
||||||
if not input_path or not input_path.exists():
|
if not input_path or not input_path.exists():
|
||||||
prints(input_path, title="Model directory not found", exits=1)
|
prints(input_path, title=Messages.M008, exits=1)
|
||||||
if not output_path or not output_path.exists():
|
if not output_path or not output_path.exists():
|
||||||
prints(output_path, title="Output directory not found", exits=1)
|
prints(output_path, title=Messages.M040, exits=1)
|
||||||
if meta_path and not meta_path.exists():
|
if meta_path and not meta_path.exists():
|
||||||
prints(meta_path, title="meta.json not found", exits=1)
|
prints(meta_path, title=Messages.M020, exits=1)
|
||||||
|
|
||||||
meta_path = meta_path or input_path / 'meta.json'
|
meta_path = meta_path or input_path / 'meta.json'
|
||||||
if meta_path.is_file():
|
if meta_path.is_file():
|
||||||
meta = util.read_json(meta_path)
|
meta = util.read_json(meta_path)
|
||||||
if not create_meta: # only print this if user doesn't want to overwrite
|
if not create_meta: # only print this if user doesn't want to overwrite
|
||||||
prints(meta_path, title="Loaded meta.json from file")
|
prints(meta_path, title=Messages.M041)
|
||||||
else:
|
else:
|
||||||
meta = generate_meta(input_dir, meta)
|
meta = generate_meta(input_dir, meta)
|
||||||
meta = validate_meta(meta, ['lang', 'name', 'version'])
|
meta = validate_meta(meta, ['lang', 'name', 'version'])
|
||||||
|
@ -57,9 +58,8 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
|
||||||
create_file(main_path / 'setup.py', TEMPLATE_SETUP)
|
create_file(main_path / 'setup.py', TEMPLATE_SETUP)
|
||||||
create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
|
create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
|
||||||
create_file(package_path / '__init__.py', TEMPLATE_INIT)
|
create_file(package_path / '__init__.py', TEMPLATE_INIT)
|
||||||
prints(main_path, "To build the package, run `python setup.py sdist` in "
|
prints(main_path, Messages.M043,
|
||||||
"this directory.",
|
title=Messages.M042.format(name=model_name_v))
|
||||||
title="Successfully created package '%s'" % model_name_v)
|
|
||||||
|
|
||||||
|
|
||||||
def create_dirs(package_path, force):
|
def create_dirs(package_path, force):
|
||||||
|
@ -67,10 +67,7 @@ def create_dirs(package_path, force):
|
||||||
if force:
|
if force:
|
||||||
shutil.rmtree(path2str(package_path))
|
shutil.rmtree(path2str(package_path))
|
||||||
else:
|
else:
|
||||||
prints(package_path, "Please delete the directory and try again, "
|
prints(package_path, Messages.M045, title=Messages.M044, exits=1)
|
||||||
"or use the --force flag to overwrite existing "
|
|
||||||
"directories.", title="Package directory already exists",
|
|
||||||
exits=1)
|
|
||||||
Path.mkdir(package_path, parents=True)
|
Path.mkdir(package_path, parents=True)
|
||||||
|
|
||||||
|
|
||||||
|
@ -97,9 +94,7 @@ def generate_meta(model_path, existing_meta):
|
||||||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
||||||
'vectors': len(nlp.vocab.vectors),
|
'vectors': len(nlp.vocab.vectors),
|
||||||
'keys': nlp.vocab.vectors.n_keys}
|
'keys': nlp.vocab.vectors.n_keys}
|
||||||
prints("Enter the package settings for your model. The following "
|
prints(Messages.M047, title=Messages.M046)
|
||||||
"information will be read from your model data: pipeline, vectors.",
|
|
||||||
title="Generating meta.json")
|
|
||||||
for setting, desc, default in settings:
|
for setting, desc, default in settings:
|
||||||
response = util.get_raw_input(desc, default)
|
response = util.get_raw_input(desc, default)
|
||||||
meta[setting] = default if response == '' and default else response
|
meta[setting] = default if response == '' and default else response
|
||||||
|
@ -111,8 +106,7 @@ def generate_meta(model_path, existing_meta):
|
||||||
def validate_meta(meta, keys):
|
def validate_meta(meta, keys):
|
||||||
for key in keys:
|
for key in keys:
|
||||||
if key not in meta or meta[key] == '':
|
if key not in meta or meta[key] == '':
|
||||||
prints("This setting is required to build your package.",
|
prints(Messages.M049, title=Messages.M048.format(key=key), exits=1)
|
||||||
title='No "%s" setting found in meta.json' % key, exits=1)
|
|
||||||
return meta
|
return meta
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -7,6 +7,7 @@ import tqdm
|
||||||
from thinc.neural._classes.model import Model
|
from thinc.neural._classes.model import Model
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
|
|
||||||
|
from ._messages import Messages
|
||||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||||
from ..gold import GoldCorpus
|
from ..gold import GoldCorpus
|
||||||
from ..util import prints, minibatch, minibatch_by_words
|
from ..util import prints, minibatch, minibatch_by_words
|
||||||
|
@ -52,15 +53,15 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||||
dev_path = util.ensure_path(dev_data)
|
dev_path = util.ensure_path(dev_data)
|
||||||
meta_path = util.ensure_path(meta_path)
|
meta_path = util.ensure_path(meta_path)
|
||||||
if not train_path.exists():
|
if not train_path.exists():
|
||||||
prints(train_path, title="Training data not found", exits=1)
|
prints(train_path, title=Messages.M050, exits=1)
|
||||||
if dev_path and not dev_path.exists():
|
if dev_path and not dev_path.exists():
|
||||||
prints(dev_path, title="Development data not found", exits=1)
|
prints(dev_path, title=Messages.M051, exits=1)
|
||||||
if meta_path is not None and not meta_path.exists():
|
if meta_path is not None and not meta_path.exists():
|
||||||
prints(meta_path, title="meta.json not found", exits=1)
|
prints(meta_path, title=Messages.M020, exits=1)
|
||||||
meta = util.read_json(meta_path) if meta_path else {}
|
meta = util.read_json(meta_path) if meta_path else {}
|
||||||
if not isinstance(meta, dict):
|
if not isinstance(meta, dict):
|
||||||
prints("Expected dict but got: {}".format(type(meta)),
|
prints(Messages.M053.format(meta_type=type(meta)),
|
||||||
title="Not a valid meta.json format", exits=1)
|
title=Messages.M052, exits=1)
|
||||||
meta.setdefault('lang', lang)
|
meta.setdefault('lang', lang)
|
||||||
meta.setdefault('name', 'unnamed')
|
meta.setdefault('name', 'unnamed')
|
||||||
|
|
||||||
|
@ -94,6 +95,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||||
meta['pipeline'] = pipeline
|
meta['pipeline'] = pipeline
|
||||||
nlp.meta.update(meta)
|
nlp.meta.update(meta)
|
||||||
if vectors:
|
if vectors:
|
||||||
|
print("Load vectors model", vectors)
|
||||||
util.load_model(vectors, vocab=nlp.vocab)
|
util.load_model(vectors, vocab=nlp.vocab)
|
||||||
for lex in nlp.vocab:
|
for lex in nlp.vocab:
|
||||||
values = {}
|
values = {}
|
||||||
|
|
|
@ -1,12 +1,13 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
import requests
|
|
||||||
import pkg_resources
|
import pkg_resources
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import sys
|
import sys
|
||||||
|
import ujson
|
||||||
|
|
||||||
from ..compat import path2str, locale_escape
|
from ._messages import Messages
|
||||||
|
from ..compat import path2str, locale_escape, url_read, HTTPError
|
||||||
from ..util import prints, get_data_path, read_json
|
from ..util import prints, get_data_path, read_json
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
@ -15,16 +16,16 @@ def validate():
|
||||||
"""Validate that the currently installed version of spaCy is compatible
|
"""Validate that the currently installed version of spaCy is compatible
|
||||||
with the installed models. Should be run after `pip install -U spacy`.
|
with the installed models. Should be run after `pip install -U spacy`.
|
||||||
"""
|
"""
|
||||||
r = requests.get(about.__compatibility__)
|
try:
|
||||||
if r.status_code != 200:
|
data = url_read(about.__compatibility__)
|
||||||
prints("Couldn't fetch compatibility table.",
|
except HTTPError as e:
|
||||||
title="Server error (%d)" % r.status_code, exits=1)
|
title = Messages.M003.format(code=e.code, desc=e.reason)
|
||||||
compat = r.json()['spacy']
|
prints(Messages.M021, title=title, exits=1)
|
||||||
|
compat = ujson.loads(data)['spacy']
|
||||||
current_compat = compat.get(about.__version__)
|
current_compat = compat.get(about.__version__)
|
||||||
if not current_compat:
|
if not current_compat:
|
||||||
prints(about.__compatibility__, exits=1,
|
prints(about.__compatibility__, exits=1,
|
||||||
title="Can't find spaCy v{} in compatibility table"
|
title=Messages.M022.format(version=about.__version__))
|
||||||
.format(about.__version__))
|
|
||||||
all_models = set()
|
all_models = set()
|
||||||
for spacy_v, models in dict(compat).items():
|
for spacy_v, models in dict(compat).items():
|
||||||
all_models.update(models.keys())
|
all_models.update(models.keys())
|
||||||
|
@ -41,7 +42,7 @@ def validate():
|
||||||
update_models = [m for m in incompat_models if m in current_compat]
|
update_models = [m for m in incompat_models if m in current_compat]
|
||||||
|
|
||||||
prints(path2str(Path(__file__).parent.parent),
|
prints(path2str(Path(__file__).parent.parent),
|
||||||
title="Installed models (spaCy v{})".format(about.__version__))
|
title=Messages.M023.format(version=about.__version__))
|
||||||
if model_links or model_pkgs:
|
if model_links or model_pkgs:
|
||||||
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
|
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
|
||||||
for name, data in model_pkgs.items():
|
for name, data in model_pkgs.items():
|
||||||
|
@ -49,23 +50,16 @@ def validate():
|
||||||
for name, data in model_links.items():
|
for name, data in model_links.items():
|
||||||
print(get_model_row(current_compat, name, data, 'link'))
|
print(get_model_row(current_compat, name, data, 'link'))
|
||||||
else:
|
else:
|
||||||
prints("No models found in your current environment.", exits=0)
|
prints(Messages.M024, exits=0)
|
||||||
|
|
||||||
if update_models:
|
if update_models:
|
||||||
cmd = ' python -m spacy download {}'
|
cmd = ' python -m spacy download {}'
|
||||||
print("\n Use the following commands to update the model packages:")
|
print("\n " + Messages.M025)
|
||||||
print('\n'.join([cmd.format(pkg) for pkg in update_models]))
|
print('\n'.join([cmd.format(pkg) for pkg in update_models]))
|
||||||
|
|
||||||
if na_models:
|
if na_models:
|
||||||
prints("The following models are not available for spaCy v{}: {}"
|
prints(Messages.M025.format(version=about.__version__,
|
||||||
.format(about.__version__, ', '.join(na_models)))
|
models=', '.join(na_models)))
|
||||||
|
|
||||||
if incompat_links:
|
if incompat_links:
|
||||||
prints("You may also want to overwrite the incompatible links using "
|
prints(Messages.M027.format(path=path2str(get_data_path())))
|
||||||
"the `python -m spacy link` command with `--force`, or remove "
|
|
||||||
"them from the data directory. Data path: {}"
|
|
||||||
.format(path2str(get_data_path())))
|
|
||||||
|
|
||||||
if incompat_models or incompat_links:
|
if incompat_models or incompat_links:
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,6 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import ftfy
|
|
||||||
import sys
|
import sys
|
||||||
import ujson
|
import ujson
|
||||||
import itertools
|
import itertools
|
||||||
|
@ -34,11 +33,20 @@ try:
|
||||||
except ImportError:
|
except ImportError:
|
||||||
from thinc.neural.optimizers import Adam as Optimizer
|
from thinc.neural.optimizers import Adam as Optimizer
|
||||||
|
|
||||||
|
try:
|
||||||
|
import urllib.request
|
||||||
|
except ImportError:
|
||||||
|
import urllib2 as urllib
|
||||||
|
|
||||||
|
try:
|
||||||
|
from urllib.error import HTTPError
|
||||||
|
except ImportError:
|
||||||
|
from urllib2 import HTTPError
|
||||||
|
|
||||||
pickle = pickle
|
pickle = pickle
|
||||||
copy_reg = copy_reg
|
copy_reg = copy_reg
|
||||||
CudaStream = CudaStream
|
CudaStream = CudaStream
|
||||||
cupy = cupy
|
cupy = cupy
|
||||||
fix_text = ftfy.fix_text
|
|
||||||
copy_array = copy_array
|
copy_array = copy_array
|
||||||
izip = getattr(itertools, 'izip', zip)
|
izip = getattr(itertools, 'izip', zip)
|
||||||
|
|
||||||
|
@ -58,6 +66,7 @@ if is_python2:
|
||||||
input_ = raw_input # noqa: F821
|
input_ = raw_input # noqa: F821
|
||||||
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8')
|
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8')
|
||||||
path2str = lambda path: str(path).decode('utf8')
|
path2str = lambda path: str(path).decode('utf8')
|
||||||
|
url_open = urllib.urlopen
|
||||||
|
|
||||||
elif is_python3:
|
elif is_python3:
|
||||||
bytes_ = bytes
|
bytes_ = bytes
|
||||||
|
@ -66,6 +75,16 @@ elif is_python3:
|
||||||
input_ = input
|
input_ = input
|
||||||
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
|
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
|
||||||
path2str = lambda path: str(path)
|
path2str = lambda path: str(path)
|
||||||
|
url_open = urllib.request.urlopen
|
||||||
|
|
||||||
|
|
||||||
|
def url_read(url):
|
||||||
|
file_ = url_open(url)
|
||||||
|
code = file_.getcode()
|
||||||
|
if code != 200:
|
||||||
|
raise HTTPError(url, code, "Cannot GET url", [], file_)
|
||||||
|
data = file_.read()
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
def b_to_str(b_str):
|
def b_to_str(b_str):
|
||||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
||||||
from .render import DependencyRenderer, EntityRenderer
|
from .render import DependencyRenderer, EntityRenderer
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..compat import b_to_str
|
from ..compat import b_to_str
|
||||||
|
from ..errors import Errors, Warnings, user_warning
|
||||||
from ..util import prints, is_in_jupyter
|
from ..util import prints, is_in_jupyter
|
||||||
|
|
||||||
|
|
||||||
|
@ -27,7 +28,7 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
|
||||||
factories = {'dep': (DependencyRenderer, parse_deps),
|
factories = {'dep': (DependencyRenderer, parse_deps),
|
||||||
'ent': (EntityRenderer, parse_ents)}
|
'ent': (EntityRenderer, parse_ents)}
|
||||||
if style not in factories:
|
if style not in factories:
|
||||||
raise ValueError("Unknown style: %s" % style)
|
raise ValueError(Errors.E087.format(style=style))
|
||||||
if isinstance(docs, Doc) or isinstance(docs, dict):
|
if isinstance(docs, Doc) or isinstance(docs, dict):
|
||||||
docs = [docs]
|
docs = [docs]
|
||||||
renderer, converter = factories[style]
|
renderer, converter = factories[style]
|
||||||
|
@ -57,12 +58,12 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
|
||||||
render(docs, style=style, page=page, minify=minify, options=options,
|
render(docs, style=style, page=page, minify=minify, options=options,
|
||||||
manual=manual)
|
manual=manual)
|
||||||
httpd = simple_server.make_server('0.0.0.0', port, app)
|
httpd = simple_server.make_server('0.0.0.0', port, app)
|
||||||
prints("Using the '%s' visualizer" % style,
|
prints("Using the '{}' visualizer".format(style),
|
||||||
title="Serving on port %d..." % port)
|
title="Serving on port {}...".format(port))
|
||||||
try:
|
try:
|
||||||
httpd.serve_forever()
|
httpd.serve_forever()
|
||||||
except KeyboardInterrupt:
|
except KeyboardInterrupt:
|
||||||
prints("Shutting down server on port %d." % port)
|
prints("Shutting down server on port {}.".format(port))
|
||||||
finally:
|
finally:
|
||||||
httpd.server_close()
|
httpd.server_close()
|
||||||
|
|
||||||
|
@ -83,6 +84,12 @@ def parse_deps(orig_doc, options={}):
|
||||||
RETURNS (dict): Generated dependency parse keyed by words and arcs.
|
RETURNS (dict): Generated dependency parse keyed by words and arcs.
|
||||||
"""
|
"""
|
||||||
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
||||||
|
if not doc.is_parsed:
|
||||||
|
user_warning(Warnings.W005)
|
||||||
|
if options.get('collapse_phrases', False):
|
||||||
|
for np in list(doc.noun_chunks):
|
||||||
|
np.merge(tag=np.root.tag_, lemma=np.root.lemma_,
|
||||||
|
ent_type=np.root.ent_type_)
|
||||||
if options.get('collapse_punct', True):
|
if options.get('collapse_punct', True):
|
||||||
spans = []
|
spans = []
|
||||||
for word in doc[:-1]:
|
for word in doc[:-1]:
|
||||||
|
@ -120,6 +127,8 @@ def parse_ents(doc, options={}):
|
||||||
"""
|
"""
|
||||||
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
|
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
|
||||||
for ent in doc.ents]
|
for ent in doc.ents]
|
||||||
|
if not ents:
|
||||||
|
user_warning(Warnings.W006)
|
||||||
title = (doc.user_data.get('title', None)
|
title = (doc.user_data.get('title', None)
|
||||||
if hasattr(doc, 'user_data') else None)
|
if hasattr(doc, 'user_data') else None)
|
||||||
return {'text': doc.text, 'ents': ents, 'title': title}
|
return {'text': doc.text, 'ents': ents, 'title': title}
|
||||||
|
|
313
spacy/errors.py
Normal file
313
spacy/errors.py
Normal file
|
@ -0,0 +1,313 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import os
|
||||||
|
import warnings
|
||||||
|
import inspect
|
||||||
|
|
||||||
|
|
||||||
|
def add_codes(err_cls):
|
||||||
|
"""Add error codes to string messages via class attribute names."""
|
||||||
|
class ErrorsWithCodes(object):
|
||||||
|
def __getattribute__(self, code):
|
||||||
|
msg = getattr(err_cls, code)
|
||||||
|
return '[{code}] {msg}'.format(code=code, msg=msg)
|
||||||
|
return ErrorsWithCodes()
|
||||||
|
|
||||||
|
|
||||||
|
@add_codes
|
||||||
|
class Warnings(object):
|
||||||
|
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
||||||
|
"You can now call spacy.load with the path as its first argument, "
|
||||||
|
"and the model's meta.json will be used to determine the language "
|
||||||
|
"to load. For example:\nnlp = spacy.load('{path}')")
|
||||||
|
W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object "
|
||||||
|
"instead and pass in the strings as the `words` keyword argument, "
|
||||||
|
"for example:\nfrom spacy.tokens import Doc\n"
|
||||||
|
"doc = Doc(nlp.vocab, words=[...])")
|
||||||
|
W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use "
|
||||||
|
"the keyword arguments, for example tag=, lemma= or ent_type=.")
|
||||||
|
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
|
||||||
|
"using ftfy.fix_text if necessary.")
|
||||||
|
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
||||||
|
"generate a dependency visualization for it. Make sure the Doc "
|
||||||
|
"was processed with a model that supports dependency parsing, and "
|
||||||
|
"not just a language class like `English()`. For more info, see "
|
||||||
|
"the docs:\nhttps://spacy.io/usage/models")
|
||||||
|
W006 = ("No entities to visualize found in Doc object. If this is "
|
||||||
|
"surprising to you, make sure the Doc was processed using a model "
|
||||||
|
"that supports named entity recognition, and check the `doc.ents` "
|
||||||
|
"property manually if necessary.")
|
||||||
|
|
||||||
|
|
||||||
|
@add_codes
|
||||||
|
class Errors(object):
|
||||||
|
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
|
||||||
|
E002 = ("Can't find factory for '{name}'. This usually happens when spaCy "
|
||||||
|
"calls `nlp.create_pipe` with a component name that's not built "
|
||||||
|
"in - for example, when constructing the pipeline from a model's "
|
||||||
|
"meta.json. If you're using a custom component, you can write to "
|
||||||
|
"`Language.factories['{name}']` or remove it from the model meta "
|
||||||
|
"and add it via `nlp.add_pipe` instead.")
|
||||||
|
E003 = ("Not a valid pipeline component. Expected callable, but "
|
||||||
|
"got {component} (name: '{name}').")
|
||||||
|
E004 = ("If you meant to add a built-in component, use `create_pipe`: "
|
||||||
|
"`nlp.add_pipe(nlp.create_pipe('{component}'))`")
|
||||||
|
E005 = ("Pipeline component '{name}' returned None. If you're using a "
|
||||||
|
"custom component, maybe you forgot to return the processed Doc?")
|
||||||
|
E006 = ("Invalid constraints. You can only set one of the following: "
|
||||||
|
"before, after, first, last.")
|
||||||
|
E007 = ("'{name}' already exists in pipeline. Existing names: {opts}")
|
||||||
|
E008 = ("Some current components would be lost when restoring previous "
|
||||||
|
"pipeline state. If you added components after calling "
|
||||||
|
"`nlp.disable_pipes()`, you should remove them explicitly with "
|
||||||
|
"`nlp.remove_pipe()` before the pipeline is restored. Names of "
|
||||||
|
"the new components: {names}")
|
||||||
|
E009 = ("The `update` method expects same number of docs and golds, but "
|
||||||
|
"got: {n_docs} docs, {n_golds} golds.")
|
||||||
|
E010 = ("Word vectors set to length 0. This may be because you don't have "
|
||||||
|
"a model installed or loaded, or because your model doesn't "
|
||||||
|
"include word vectors. For more info, see the docs:\n"
|
||||||
|
"https://spacy.io/usage/models")
|
||||||
|
E011 = ("Unknown operator: '{op}'. Options: {opts}")
|
||||||
|
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
|
||||||
|
E013 = ("Error selecting action in matcher")
|
||||||
|
E014 = ("Uknown tag ID: {tag}")
|
||||||
|
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
|
||||||
|
"`force=True` to overwrite.")
|
||||||
|
E016 = ("MultitaskObjective target should be function or one of: dep, "
|
||||||
|
"tag, ent, dep_tag_offset, ent_tag.")
|
||||||
|
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
|
||||||
|
E018 = ("Can't retrieve string for hash '{hash_value}'.")
|
||||||
|
E019 = ("Can't create transition with unknown action ID: {action}. Action "
|
||||||
|
"IDs are enumerated in spacy/syntax/{src}.pyx.")
|
||||||
|
E020 = ("Could not find a gold-standard action to supervise the "
|
||||||
|
"dependency parser. The tree is non-projective (i.e. it has "
|
||||||
|
"crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
|
||||||
|
"The ArcEager transition system only supports projective trees. "
|
||||||
|
"To learn non-projective representations, transform the data "
|
||||||
|
"before training and after parsing. Either pass "
|
||||||
|
"`make_projective=True` to the GoldParse class, or use "
|
||||||
|
"spacy.syntax.nonproj.preprocess_training_data.")
|
||||||
|
E021 = ("Could not find a gold-standard action to supervise the "
|
||||||
|
"dependency parser. The GoldParse was projective. The transition "
|
||||||
|
"system has {n_actions} actions. State at failure: {state}")
|
||||||
|
E022 = ("Could not find a transition with the name '{name}' in the NER "
|
||||||
|
"model.")
|
||||||
|
E023 = ("Error cleaning up beam: The same state occurred twice at "
|
||||||
|
"memory address {addr} and position {i}.")
|
||||||
|
E024 = ("Could not find an optimal move to supervise the parser. Usually, "
|
||||||
|
"this means the GoldParse was not correct. For example, are all "
|
||||||
|
"labels added to the model?")
|
||||||
|
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
||||||
|
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
||||||
|
"length {length}.")
|
||||||
|
E027 = ("Arguments 'words' and 'spaces' should be sequences of the same "
|
||||||
|
"length, or 'spaces' should be left default at None. spaces "
|
||||||
|
"should be a sequence of booleans, with True meaning that the "
|
||||||
|
"word owns a ' ' character following it.")
|
||||||
|
E028 = ("orths_and_spaces expects either a list of unicode string or a "
|
||||||
|
"list of (unicode, bool) tuples. Got bytes instance: {value}")
|
||||||
|
E029 = ("noun_chunks requires the dependency parse, which requires a "
|
||||||
|
"statistical model to be installed and loaded. For more info, see "
|
||||||
|
"the documentation:\nhttps://spacy.io/usage/models")
|
||||||
|
E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
|
||||||
|
"component to the pipeline with: "
|
||||||
|
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
|
||||||
|
"Alternatively, add the dependency parser, or set sentence "
|
||||||
|
"boundaries by setting doc[i].is_sent_start.")
|
||||||
|
E031 = ("Invalid token: empty string ('') at position {i}.")
|
||||||
|
E032 = ("Conflicting attributes specified in doc.from_array(): "
|
||||||
|
"(HEAD, SENT_START). The HEAD attribute currently sets sentence "
|
||||||
|
"boundaries implicitly, based on the tree structure. This means "
|
||||||
|
"the HEAD attribute would potentially override the sentence "
|
||||||
|
"boundaries set by SENT_START.")
|
||||||
|
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
||||||
|
E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected "
|
||||||
|
"either 3 arguments (deprecated), or 0 (use keyword arguments).\n"
|
||||||
|
"Arguments supplied:\n{args}\nKeyword arguments:{kwargs}")
|
||||||
|
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
||||||
|
"length {length}.")
|
||||||
|
E036 = ("Error calculating span: Can't find a token starting at character "
|
||||||
|
"offset {start}.")
|
||||||
|
E037 = ("Error calculating span: Can't find a token ending at character "
|
||||||
|
"offset {end}.")
|
||||||
|
E038 = ("Error finding sentence for span. Infinite loop detected.")
|
||||||
|
E039 = ("Array bounds exceeded while searching for root word. This likely "
|
||||||
|
"means the parse tree is in an invalid state. Please report this "
|
||||||
|
"issue here: http://github.com/explosion/spaCy/issues")
|
||||||
|
E040 = ("Attempt to access token at {i}, max length {max_length}.")
|
||||||
|
E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
|
||||||
|
E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.")
|
||||||
|
E043 = ("Refusing to write to token.sent_start if its document is parsed, "
|
||||||
|
"because this may cause inconsistent state.")
|
||||||
|
E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
|
||||||
|
"None, True, False")
|
||||||
|
E045 = ("Possibly infinite loop encountered while looking for {attr}.")
|
||||||
|
E046 = ("Can't retrieve unregistered extension attribute '{name}'. Did "
|
||||||
|
"you forget to call the `set_extension` method?")
|
||||||
|
E047 = ("Can't assign a value to unregistered extension attribute "
|
||||||
|
"'{name}'. Did you forget to call the `set_extension` method?")
|
||||||
|
E048 = ("Can't import language {lang} from spacy.lang.")
|
||||||
|
E049 = ("Can't find spaCy data directory: '{path}'. Check your "
|
||||||
|
"installation and permissions, or use spacy.util.set_data_path "
|
||||||
|
"to customise the location if necessary.")
|
||||||
|
E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut "
|
||||||
|
"link, a Python package or a valid path to a data directory.")
|
||||||
|
E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure "
|
||||||
|
"it points to a valid package (not just a data directory).")
|
||||||
|
E052 = ("Can't find model directory: {path}")
|
||||||
|
E053 = ("Could not read meta.json from {path}")
|
||||||
|
E054 = ("No valid '{setting}' setting found in model meta.json.")
|
||||||
|
E055 = ("Invalid ORTH value in exception:\nKey: {key}\nOrths: {orths}")
|
||||||
|
E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
|
||||||
|
"original string.\nKey: {key}\nOrths: {orths}")
|
||||||
|
E057 = ("Stepped slices not supported in Span objects. Try: "
|
||||||
|
"list(tokens)[start:stop:step] instead.")
|
||||||
|
E058 = ("Could not retrieve vector for key {key}.")
|
||||||
|
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
|
||||||
|
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
|
||||||
|
"({rows}, {cols}).")
|
||||||
|
E061 = ("Bad file name: {filename}. Example of a valid file name: "
|
||||||
|
"'vectors.128.f.bin'")
|
||||||
|
E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
|
||||||
|
"and 63 are occupied. You can replace one by specifying the "
|
||||||
|
"`flag_id` explicitly, e.g. "
|
||||||
|
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
|
||||||
|
E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 "
|
||||||
|
"and 63 (inclusive).")
|
||||||
|
E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
|
||||||
|
"string, the lexeme returned had an orth ID that did not match "
|
||||||
|
"the query string. This means that the cached lexeme structs are "
|
||||||
|
"mismatched to the string encoding table. The mismatched:\n"
|
||||||
|
"Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
|
||||||
|
E065 = ("Only one of the vector table's width and shape can be specified. "
|
||||||
|
"Got width {width} and shape {shape}.")
|
||||||
|
E066 = ("Error creating model helper for extracting columns. Can only "
|
||||||
|
"extract columns by positive integer. Got: {value}.")
|
||||||
|
E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
|
||||||
|
"an entity) without a preceding 'B' (beginning of an entity). "
|
||||||
|
"Tag sequence:\n{tags}")
|
||||||
|
E068 = ("Invalid BILUO tag: '{tag}'.")
|
||||||
|
E069 = ("Invalid gold-standard parse tree. Found cycle between word "
|
||||||
|
"IDs: {cycle}")
|
||||||
|
E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
|
||||||
|
"does not align with number of annotations ({n_annots}).")
|
||||||
|
E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
|
||||||
|
"match the one in the vocab ({vocab_orth}).")
|
||||||
|
E072 = ("Error serializing lexeme: expected data length {length}, "
|
||||||
|
"got {bad_length}.")
|
||||||
|
E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
|
||||||
|
"are of length {length}. You can use `vocab.reset_vectors` to "
|
||||||
|
"clear the existing vectors and resize the table.")
|
||||||
|
E074 = ("Error interpreting compiled match pattern: patterns are expected "
|
||||||
|
"to end with the attribute {attr}. Got: {bad_attr}.")
|
||||||
|
E075 = ("Error accepting match: length ({length}) > maximum length "
|
||||||
|
"({max_len}).")
|
||||||
|
E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
|
||||||
|
"has {words} words.")
|
||||||
|
E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
|
||||||
|
"equal number of GoldParse objects ({n_golds}) in batch.")
|
||||||
|
E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
|
||||||
|
"not equal number of words in GoldParse ({words_gold}).")
|
||||||
|
E079 = ("Error computing states in beam: number of predicted beams "
|
||||||
|
"({pbeams}) does not equal number of gold beams ({gbeams}).")
|
||||||
|
E080 = ("Duplicate state found in beam: {key}.")
|
||||||
|
E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
|
||||||
|
"does not equal number of losses ({losses}).")
|
||||||
|
E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
|
||||||
|
"projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
|
||||||
|
"match.")
|
||||||
|
E083 = ("Error setting extension: only one of `default`, `method`, or "
|
||||||
|
"`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
|
||||||
|
E084 = ("Error assigning label ID {label} to span: not in StringStore.")
|
||||||
|
E085 = ("Can't create lexeme for string '{string}'.")
|
||||||
|
E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
|
||||||
|
"not match hash {hash_id} in StringStore.")
|
||||||
|
E087 = ("Unknown displaCy style: {style}.")
|
||||||
|
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
|
||||||
|
"v2.x parser and NER models require roughly 1GB of temporary "
|
||||||
|
"memory per 100,000 characters in the input. This means long "
|
||||||
|
"texts may cause memory allocation errors. If you're not using "
|
||||||
|
"the parser or NER, it's probably safe to increase the "
|
||||||
|
"`nlp.max_length` limit. The limit is in number of characters, so "
|
||||||
|
"you can check whether your inputs are too long by checking "
|
||||||
|
"`len(text)`.")
|
||||||
|
E089 = ("Extensions can't have a setter argument without a getter "
|
||||||
|
"argument. Check the keyword arguments on `set_extension`.")
|
||||||
|
E090 = ("Extension '{name}' already exists on {obj}. To overwrite the "
|
||||||
|
"existing extension, set `force=True` on `{obj}.set_extension`.")
|
||||||
|
E091 = ("Invalid extension attribute {name}: expected callable or None, "
|
||||||
|
"but got: {value}")
|
||||||
|
E092 = ("Could not find or assign name for word vectors. Ususally, the "
|
||||||
|
"name is read from the model's meta.json in vector.name. "
|
||||||
|
"Alternatively, it is built from the 'lang' and 'name' keys in "
|
||||||
|
"the meta.json. Vector names are required to avoid issue #1660.")
|
||||||
|
E093 = ("token.ent_iob values make invalid sequence: I without B\n{seq}")
|
||||||
|
E094 = ("Error reading line {line_num} in vectors file {loc}.")
|
||||||
|
|
||||||
|
|
||||||
|
@add_codes
|
||||||
|
class TempErrors(object):
|
||||||
|
T001 = ("Max length currently 10 for phrase matching")
|
||||||
|
T002 = ("Pattern length ({doc_len}) >= phrase_matcher.max_length "
|
||||||
|
"({max_len}). Length can be set on initialization, up to 10.")
|
||||||
|
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
|
||||||
|
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
|
||||||
|
T005 = ("Currently history size is hard-coded to 0. Received: {value}.")
|
||||||
|
T006 = ("Currently history width is hard-coded to 0. Received: {value}.")
|
||||||
|
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||||
|
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||||
|
T008 = ("Bad configuration of Tagger. This is probably a bug within "
|
||||||
|
"spaCy. We changed the name of an internal attribute for loading "
|
||||||
|
"pre-trained vectors, and the class has been passed the old name "
|
||||||
|
"(pretrained_dims) but not the new name (pretrained_vectors).")
|
||||||
|
|
||||||
|
|
||||||
|
class ModelsWarning(UserWarning):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
WARNINGS = {
|
||||||
|
'user': UserWarning,
|
||||||
|
'deprecation': DeprecationWarning,
|
||||||
|
'models': ModelsWarning,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _get_warn_types(arg):
|
||||||
|
if arg == '': # don't show any warnings
|
||||||
|
return []
|
||||||
|
if not arg or arg == 'all': # show all available warnings
|
||||||
|
return WARNINGS.keys()
|
||||||
|
return [w_type.strip() for w_type in arg.split(',')
|
||||||
|
if w_type.strip() in WARNINGS]
|
||||||
|
|
||||||
|
|
||||||
|
SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER', 'always')
|
||||||
|
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES'))
|
||||||
|
|
||||||
|
|
||||||
|
def user_warning(message):
|
||||||
|
_warn(message, 'user')
|
||||||
|
|
||||||
|
|
||||||
|
def deprecation_warning(message):
|
||||||
|
_warn(message, 'deprecation')
|
||||||
|
|
||||||
|
|
||||||
|
def models_warning(message):
|
||||||
|
_warn(message, 'models')
|
||||||
|
|
||||||
|
|
||||||
|
def _warn(message, warn_type='user'):
|
||||||
|
"""
|
||||||
|
message (unicode): The message to display.
|
||||||
|
category (Warning): The Warning to show.
|
||||||
|
"""
|
||||||
|
if warn_type in SPACY_WARNING_TYPES:
|
||||||
|
category = WARNINGS[warn_type]
|
||||||
|
stack = inspect.stack()[-1]
|
||||||
|
with warnings.catch_warnings():
|
||||||
|
warnings.simplefilter(SPACY_WARNING_FILTER, category)
|
||||||
|
warnings.warn_explicit(message, category, stack[1], stack[2])
|
|
@ -17,6 +17,7 @@ import ujson
|
||||||
from . import _align
|
from . import _align
|
||||||
from .syntax import nonproj
|
from .syntax import nonproj
|
||||||
from .tokens import Doc
|
from .tokens import Doc
|
||||||
|
from .errors import Errors
|
||||||
from . import util
|
from . import util
|
||||||
from .util import minibatch, itershuffle
|
from .util import minibatch, itershuffle
|
||||||
from .compat import json_dumps
|
from .compat import json_dumps
|
||||||
|
@ -37,7 +38,8 @@ def tags_to_entities(tags):
|
||||||
elif tag == '-':
|
elif tag == '-':
|
||||||
continue
|
continue
|
||||||
elif tag.startswith('I'):
|
elif tag.startswith('I'):
|
||||||
assert start is not None, tags[:i]
|
if start is None:
|
||||||
|
raise ValueError(Errors.E067.format(tags=tags[:i]))
|
||||||
continue
|
continue
|
||||||
if tag.startswith('U'):
|
if tag.startswith('U'):
|
||||||
entities.append((tag[2:], i, i))
|
entities.append((tag[2:], i, i))
|
||||||
|
@ -47,7 +49,7 @@ def tags_to_entities(tags):
|
||||||
entities.append((tag[2:], start, i))
|
entities.append((tag[2:], start, i))
|
||||||
start = None
|
start = None
|
||||||
else:
|
else:
|
||||||
raise Exception(tag)
|
raise ValueError(Errors.E068.format(tag=tag))
|
||||||
return entities
|
return entities
|
||||||
|
|
||||||
|
|
||||||
|
@ -225,7 +227,9 @@ class GoldCorpus(object):
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def _make_golds(cls, docs, paragraph_tuples, make_projective):
|
def _make_golds(cls, docs, paragraph_tuples, make_projective):
|
||||||
assert len(docs) == len(paragraph_tuples)
|
if len(docs) != len(paragraph_tuples):
|
||||||
|
raise ValueError(Errors.E070.format(n_docs=len(docs),
|
||||||
|
n_annots=len(paragraph_tuples)))
|
||||||
if len(docs) == 1:
|
if len(docs) == 1:
|
||||||
return [GoldParse.from_annot_tuples(docs[0],
|
return [GoldParse.from_annot_tuples(docs[0],
|
||||||
paragraph_tuples[0][0],
|
paragraph_tuples[0][0],
|
||||||
|
@ -525,7 +529,7 @@ cdef class GoldParse:
|
||||||
|
|
||||||
cycle = nonproj.contains_cycle(self.heads)
|
cycle = nonproj.contains_cycle(self.heads)
|
||||||
if cycle is not None:
|
if cycle is not None:
|
||||||
raise Exception("Cycle found: %s" % cycle)
|
raise ValueError(Errors.E069.format(cycle=cycle))
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
"""Get the number of gold-standard tokens.
|
"""Get the number of gold-standard tokens.
|
||||||
|
|
|
@ -8,6 +8,7 @@ from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .morph_rules import MORPH_RULES
|
from .morph_rules import MORPH_RULES
|
||||||
from ..tag_map import TAG_MAP
|
from ..tag_map import TAG_MAP
|
||||||
|
from .lemmatizer import LOOKUP
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
@ -28,6 +29,7 @@ class DanishDefaults(Language.Defaults):
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
lemma_lookup = LOOKUP
|
||||||
|
|
||||||
|
|
||||||
class Danish(Language):
|
class Danish(Language):
|
||||||
|
|
692415
spacy/lang/da/lemmatizer.py
Normal file
692415
spacy/lang/da/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -286069,7 +286069,6 @@ LOOKUP = {
|
||||||
"sonnolente": "sonnolento",
|
"sonnolente": "sonnolento",
|
||||||
"sonnolenti": "sonnolento",
|
"sonnolenti": "sonnolento",
|
||||||
"sonnolenze": "sonnolenza",
|
"sonnolenze": "sonnolenza",
|
||||||
"sono": "sonare",
|
|
||||||
"sonora": "sonoro",
|
"sonora": "sonoro",
|
||||||
"sonore": "sonoro",
|
"sonore": "sonoro",
|
||||||
"sonori": "sonoro",
|
"sonori": "sonoro",
|
||||||
|
@ -333681,6 +333680,7 @@ LOOKUP = {
|
||||||
"zurliniane": "zurliniano",
|
"zurliniane": "zurliniano",
|
||||||
"zurliniani": "zurliniano",
|
"zurliniani": "zurliniano",
|
||||||
"àncore": "àncora",
|
"àncore": "àncora",
|
||||||
|
"sono": "essere",
|
||||||
"è": "essere",
|
"è": "essere",
|
||||||
"èlites": "èlite",
|
"èlites": "èlite",
|
||||||
"ère": "èra",
|
"ère": "èra",
|
||||||
|
|
|
@ -190262,7 +190262,6 @@ LOOKUP = {
|
||||||
"gämserna": "gäms",
|
"gämserna": "gäms",
|
||||||
"gämsernas": "gäms",
|
"gämsernas": "gäms",
|
||||||
"gämsers": "gäms",
|
"gämsers": "gäms",
|
||||||
"gäng": "gänga",
|
|
||||||
"gängad": "gänga",
|
"gängad": "gänga",
|
||||||
"gängade": "gängad",
|
"gängade": "gängad",
|
||||||
"gängades": "gängad",
|
"gängades": "gängad",
|
||||||
|
@ -651423,7 +651422,6 @@ LOOKUP = {
|
||||||
"åpnasts": "åpen",
|
"åpnasts": "åpen",
|
||||||
"åpne": "åpen",
|
"åpne": "åpen",
|
||||||
"åpnes": "åpen",
|
"åpnes": "åpen",
|
||||||
"år": "åra",
|
|
||||||
"åran": "åra",
|
"åran": "åra",
|
||||||
"årans": "åra",
|
"årans": "åra",
|
||||||
"åras": "åra",
|
"åras": "åra",
|
||||||
|
|
|
@ -1,19 +1,53 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG, NORM
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from ...util import update_exc, add_lookups
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
#from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
#from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
|
||||||
|
|
||||||
class VietnameseDefaults(Language.Defaults):
|
class VietnameseDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: 'vi' # for pickling
|
lex_attr_getters[LANG] = lambda text: 'vi' # for pickling
|
||||||
|
# add more norm exception dictionaries here
|
||||||
|
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||||
|
|
||||||
|
# overwrite functions for lexical attributes
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
|
||||||
|
# merge base exceptions and custom tokenizer exceptions
|
||||||
|
#tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
use_pyvi = True
|
||||||
|
|
||||||
class Vietnamese(Language):
|
class Vietnamese(Language):
|
||||||
lang = 'vi'
|
lang = 'vi'
|
||||||
Defaults = VietnameseDefaults # override defaults
|
Defaults = VietnameseDefaults # override defaults
|
||||||
|
|
||||||
|
def make_doc(self, text):
|
||||||
|
if self.Defaults.use_pyvi:
|
||||||
|
try:
|
||||||
|
from pyvi import ViTokenizer
|
||||||
|
except ImportError:
|
||||||
|
msg = ("Pyvi not installed. Either set Vietnamese.use_pyvi = False, "
|
||||||
|
"or install it https://pypi.python.org/pypi/pyvi")
|
||||||
|
raise ImportError(msg)
|
||||||
|
words, spaces = ViTokenizer.spacy_tokenize(text)
|
||||||
|
return Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
else:
|
||||||
|
words = []
|
||||||
|
spaces = []
|
||||||
|
doc = self.tokenizer(text)
|
||||||
|
for token in self.tokenizer(text):
|
||||||
|
words.extend(list(token.text))
|
||||||
|
spaces.extend([False]*len(token.text))
|
||||||
|
spaces[-1] = bool(token.whitespace_)
|
||||||
|
return Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
|
||||||
__all__ = ['Vietnamese']
|
__all__ = ['Vietnamese']
|
||||||
|
|
26
spacy/lang/vi/lex_attrs.py
Normal file
26
spacy/lang/vi/lex_attrs.py
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
|
_num_words = ['không', 'một', 'hai', 'ba', 'bốn', 'năm', 'sáu', 'bẩy',
|
||||||
|
'tám', 'chín', 'mười', 'trăm', 'tỷ']
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text.lower() in _num_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
1951
spacy/lang/vi/stop_words.py
Normal file
1951
spacy/lang/vi/stop_words.py
Normal file
File diff suppressed because it is too large
Load Diff
36
spacy/lang/vi/tag_map.py
Normal file
36
spacy/lang/vi/tag_map.py
Normal file
|
@ -0,0 +1,36 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||||
|
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
|
||||||
|
|
||||||
|
|
||||||
|
# Add a tag map
|
||||||
|
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
|
||||||
|
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
|
||||||
|
# The keys of the tag map should be strings in your tag set. The dictionary must
|
||||||
|
# have an entry POS whose value is one of the Universal Dependencies tags.
|
||||||
|
# Optionally, you can also include morphological features or other attributes.
|
||||||
|
|
||||||
|
|
||||||
|
TAG_MAP = {
|
||||||
|
"ADV": {POS: ADV},
|
||||||
|
"NOUN": {POS: NOUN},
|
||||||
|
"ADP": {POS: ADP},
|
||||||
|
"PRON": {POS: PRON},
|
||||||
|
"SCONJ": {POS: SCONJ},
|
||||||
|
"PROPN": {POS: PROPN},
|
||||||
|
"DET": {POS: DET},
|
||||||
|
"SYM": {POS: SYM},
|
||||||
|
"INTJ": {POS: INTJ},
|
||||||
|
"PUNCT": {POS: PUNCT},
|
||||||
|
"NUM": {POS: NUM},
|
||||||
|
"AUX": {POS: AUX},
|
||||||
|
"X": {POS: X},
|
||||||
|
"CONJ": {POS: CONJ},
|
||||||
|
"CCONJ": {POS: CCONJ},
|
||||||
|
"ADJ": {POS: ADJ},
|
||||||
|
"VERB": {POS: VERB},
|
||||||
|
"PART": {POS: PART},
|
||||||
|
"SP": {POS: SPACE}
|
||||||
|
}
|
|
@ -28,6 +28,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
|
||||||
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
||||||
from .lang.tag_map import TAG_MAP
|
from .lang.tag_map import TAG_MAP
|
||||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||||
|
from .errors import Errors
|
||||||
from . import util
|
from . import util
|
||||||
from . import about
|
from . import about
|
||||||
|
|
||||||
|
@ -112,7 +113,7 @@ class Language(object):
|
||||||
'merge_subtokens': lambda nlp, **cfg: merge_subtokens,
|
'merge_subtokens': lambda nlp, **cfg: merge_subtokens,
|
||||||
}
|
}
|
||||||
|
|
||||||
def __init__(self, vocab=True, make_doc=True, meta={}, **kwargs):
|
def __init__(self, vocab=True, make_doc=True, max_length=10**6, meta={}, **kwargs):
|
||||||
"""Initialise a Language object.
|
"""Initialise a Language object.
|
||||||
|
|
||||||
vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
|
vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
|
||||||
|
@ -127,6 +128,15 @@ class Language(object):
|
||||||
string occurs in both, the component is not loaded.
|
string occurs in both, the component is not loaded.
|
||||||
meta (dict): Custom meta data for the Language class. Is written to by
|
meta (dict): Custom meta data for the Language class. Is written to by
|
||||||
models to add model meta data.
|
models to add model meta data.
|
||||||
|
max_length (int) :
|
||||||
|
Maximum number of characters in a single text. The current v2 models
|
||||||
|
may run out memory on extremely long texts, due to large internal
|
||||||
|
allocations. You should segment these texts into meaningful units,
|
||||||
|
e.g. paragraphs, subsections etc, before passing them to spaCy.
|
||||||
|
Default maximum length is 1,000,000 characters (1mb). As a rule of
|
||||||
|
thumb, if all pipeline components are enabled, spaCy's default
|
||||||
|
models currently requires roughly 1GB of temporary memory per
|
||||||
|
100,000 characters in one text.
|
||||||
RETURNS (Language): The newly constructed object.
|
RETURNS (Language): The newly constructed object.
|
||||||
"""
|
"""
|
||||||
self._meta = dict(meta)
|
self._meta = dict(meta)
|
||||||
|
@ -134,12 +144,15 @@ class Language(object):
|
||||||
if vocab is True:
|
if vocab is True:
|
||||||
factory = self.Defaults.create_vocab
|
factory = self.Defaults.create_vocab
|
||||||
vocab = factory(self, **meta.get('vocab', {}))
|
vocab = factory(self, **meta.get('vocab', {}))
|
||||||
|
if vocab.vectors.name is None:
|
||||||
|
vocab.vectors.name = meta.get('vectors', {}).get('name')
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
if make_doc is True:
|
if make_doc is True:
|
||||||
factory = self.Defaults.create_tokenizer
|
factory = self.Defaults.create_tokenizer
|
||||||
make_doc = factory(self, **meta.get('tokenizer', {}))
|
make_doc = factory(self, **meta.get('tokenizer', {}))
|
||||||
self.tokenizer = make_doc
|
self.tokenizer = make_doc
|
||||||
self.pipeline = []
|
self.pipeline = []
|
||||||
|
self.max_length = max_length
|
||||||
self._optimizer = None
|
self._optimizer = None
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
@ -159,7 +172,8 @@ class Language(object):
|
||||||
self._meta.setdefault('license', '')
|
self._meta.setdefault('license', '')
|
||||||
self._meta['vectors'] = {'width': self.vocab.vectors_length,
|
self._meta['vectors'] = {'width': self.vocab.vectors_length,
|
||||||
'vectors': len(self.vocab.vectors),
|
'vectors': len(self.vocab.vectors),
|
||||||
'keys': self.vocab.vectors.n_keys}
|
'keys': self.vocab.vectors.n_keys,
|
||||||
|
'name': self.vocab.vectors.name}
|
||||||
self._meta['pipeline'] = self.pipe_names
|
self._meta['pipeline'] = self.pipe_names
|
||||||
return self._meta
|
return self._meta
|
||||||
|
|
||||||
|
@ -205,8 +219,7 @@ class Language(object):
|
||||||
for pipe_name, component in self.pipeline:
|
for pipe_name, component in self.pipeline:
|
||||||
if pipe_name == name:
|
if pipe_name == name:
|
||||||
return component
|
return component
|
||||||
msg = "No component '{}' found in pipeline. Available names: {}"
|
raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||||
raise KeyError(msg.format(name, self.pipe_names))
|
|
||||||
|
|
||||||
def create_pipe(self, name, config=dict()):
|
def create_pipe(self, name, config=dict()):
|
||||||
"""Create a pipeline component from a factory.
|
"""Create a pipeline component from a factory.
|
||||||
|
@ -216,7 +229,7 @@ class Language(object):
|
||||||
RETURNS (callable): Pipeline component.
|
RETURNS (callable): Pipeline component.
|
||||||
"""
|
"""
|
||||||
if name not in self.factories:
|
if name not in self.factories:
|
||||||
raise KeyError("Can't find factory for '{}'.".format(name))
|
raise KeyError(Errors.E002.format(name=name))
|
||||||
factory = self.factories[name]
|
factory = self.factories[name]
|
||||||
return factory(self, **config)
|
return factory(self, **config)
|
||||||
|
|
||||||
|
@ -241,12 +254,9 @@ class Language(object):
|
||||||
>>> nlp.add_pipe(component, name='custom_name', last=True)
|
>>> nlp.add_pipe(component, name='custom_name', last=True)
|
||||||
"""
|
"""
|
||||||
if not hasattr(component, '__call__'):
|
if not hasattr(component, '__call__'):
|
||||||
msg = ("Not a valid pipeline component. Expected callable, but "
|
msg = Errors.E003.format(component=repr(component), name=name)
|
||||||
"got {}. ".format(repr(component)))
|
|
||||||
if isinstance(component, basestring_) and component in self.factories:
|
if isinstance(component, basestring_) and component in self.factories:
|
||||||
msg += ("If you meant to add a built-in component, use "
|
msg += Errors.E004.format(component=component)
|
||||||
"create_pipe: nlp.add_pipe(nlp.create_pipe('{}'))"
|
|
||||||
.format(component))
|
|
||||||
raise ValueError(msg)
|
raise ValueError(msg)
|
||||||
if name is None:
|
if name is None:
|
||||||
if hasattr(component, 'name'):
|
if hasattr(component, 'name'):
|
||||||
|
@ -259,11 +269,9 @@ class Language(object):
|
||||||
else:
|
else:
|
||||||
name = repr(component)
|
name = repr(component)
|
||||||
if name in self.pipe_names:
|
if name in self.pipe_names:
|
||||||
raise ValueError("'{}' already exists in pipeline.".format(name))
|
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
|
||||||
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
|
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
|
||||||
msg = ("Invalid constraints. You can only set one of the "
|
raise ValueError(Errors.E006)
|
||||||
"following: before, after, first, last.")
|
|
||||||
raise ValueError(msg)
|
|
||||||
pipe = (name, component)
|
pipe = (name, component)
|
||||||
if last or not any([first, before, after]):
|
if last or not any([first, before, after]):
|
||||||
self.pipeline.append(pipe)
|
self.pipeline.append(pipe)
|
||||||
|
@ -274,9 +282,8 @@ class Language(object):
|
||||||
elif after and after in self.pipe_names:
|
elif after and after in self.pipe_names:
|
||||||
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
|
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
|
||||||
else:
|
else:
|
||||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
raise ValueError(Errors.E001.format(name=before or after,
|
||||||
unfound = before or after
|
opts=self.pipe_names))
|
||||||
raise ValueError(msg.format(unfound, self.pipe_names))
|
|
||||||
|
|
||||||
def has_pipe(self, name):
|
def has_pipe(self, name):
|
||||||
"""Check if a component name is present in the pipeline. Equivalent to
|
"""Check if a component name is present in the pipeline. Equivalent to
|
||||||
|
@ -294,8 +301,7 @@ class Language(object):
|
||||||
component (callable): Pipeline component.
|
component (callable): Pipeline component.
|
||||||
"""
|
"""
|
||||||
if name not in self.pipe_names:
|
if name not in self.pipe_names:
|
||||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||||
raise ValueError(msg.format(name, self.pipe_names))
|
|
||||||
self.pipeline[self.pipe_names.index(name)] = (name, component)
|
self.pipeline[self.pipe_names.index(name)] = (name, component)
|
||||||
|
|
||||||
def rename_pipe(self, old_name, new_name):
|
def rename_pipe(self, old_name, new_name):
|
||||||
|
@ -305,11 +311,9 @@ class Language(object):
|
||||||
new_name (unicode): New name of the component.
|
new_name (unicode): New name of the component.
|
||||||
"""
|
"""
|
||||||
if old_name not in self.pipe_names:
|
if old_name not in self.pipe_names:
|
||||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
|
||||||
raise ValueError(msg.format(old_name, self.pipe_names))
|
|
||||||
if new_name in self.pipe_names:
|
if new_name in self.pipe_names:
|
||||||
msg = "'{}' already exists in pipeline. Existing names: {}"
|
raise ValueError(Errors.E007.format(name=new_name, opts=self.pipe_names))
|
||||||
raise ValueError(msg.format(new_name, self.pipe_names))
|
|
||||||
i = self.pipe_names.index(old_name)
|
i = self.pipe_names.index(old_name)
|
||||||
self.pipeline[i] = (new_name, self.pipeline[i][1])
|
self.pipeline[i] = (new_name, self.pipeline[i][1])
|
||||||
|
|
||||||
|
@ -320,8 +324,7 @@ class Language(object):
|
||||||
RETURNS (tuple): A `(name, component)` tuple of the removed component.
|
RETURNS (tuple): A `(name, component)` tuple of the removed component.
|
||||||
"""
|
"""
|
||||||
if name not in self.pipe_names:
|
if name not in self.pipe_names:
|
||||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||||
raise ValueError(msg.format(name, self.pipe_names))
|
|
||||||
return self.pipeline.pop(self.pipe_names.index(name))
|
return self.pipeline.pop(self.pipe_names.index(name))
|
||||||
|
|
||||||
def __call__(self, text, disable=[]):
|
def __call__(self, text, disable=[]):
|
||||||
|
@ -338,11 +341,18 @@ class Language(object):
|
||||||
>>> tokens[0].text, tokens[0].head.tag_
|
>>> tokens[0].text, tokens[0].head.tag_
|
||||||
('An', 'NN')
|
('An', 'NN')
|
||||||
"""
|
"""
|
||||||
|
if len(text) >= self.max_length:
|
||||||
|
raise ValueError(Errors.E088.format(length=len(text),
|
||||||
|
max_length=self.max_length))
|
||||||
doc = self.make_doc(text)
|
doc = self.make_doc(text)
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if name in disable:
|
if name in disable:
|
||||||
continue
|
continue
|
||||||
|
if not hasattr(proc, '__call__'):
|
||||||
|
raise ValueError(Errors.E003.format(component=type(proc), name=name))
|
||||||
doc = proc(doc)
|
doc = proc(doc)
|
||||||
|
if doc is None:
|
||||||
|
raise ValueError(Errors.E005.format(name=name))
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def disable_pipes(self, *names):
|
def disable_pipes(self, *names):
|
||||||
|
@ -384,8 +394,7 @@ class Language(object):
|
||||||
>>> state = nlp.update(docs, golds, sgd=optimizer)
|
>>> state = nlp.update(docs, golds, sgd=optimizer)
|
||||||
"""
|
"""
|
||||||
if len(docs) != len(golds):
|
if len(docs) != len(golds):
|
||||||
raise IndexError("Update expects same number of docs and golds "
|
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
|
||||||
"Got: %d, %d" % (len(docs), len(golds)))
|
|
||||||
if len(docs) == 0:
|
if len(docs) == 0:
|
||||||
return
|
return
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
|
@ -458,6 +467,8 @@ class Language(object):
|
||||||
else:
|
else:
|
||||||
device = None
|
device = None
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
|
if self.vocab.vectors.data.shape[1]:
|
||||||
|
cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
sgd = create_default_optimizer(Model.ops)
|
sgd = create_default_optimizer(Model.ops)
|
||||||
self._optimizer = sgd
|
self._optimizer = sgd
|
||||||
|
@ -626,9 +637,10 @@ class Language(object):
|
||||||
"""
|
"""
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
deserializers = OrderedDict((
|
deserializers = OrderedDict((
|
||||||
('vocab', lambda p: self.vocab.from_disk(p)),
|
('meta.json', lambda p: self.meta.update(util.read_json(p))),
|
||||||
|
('vocab', lambda p: (
|
||||||
|
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self))),
|
||||||
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
|
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
|
||||||
('meta.json', lambda p: self.meta.update(util.read_json(p)))
|
|
||||||
))
|
))
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if name in disable:
|
if name in disable:
|
||||||
|
@ -671,9 +683,10 @@ class Language(object):
|
||||||
RETURNS (Language): The `Language` object.
|
RETURNS (Language): The `Language` object.
|
||||||
"""
|
"""
|
||||||
deserializers = OrderedDict((
|
deserializers = OrderedDict((
|
||||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
('meta', lambda b: self.meta.update(ujson.loads(b))),
|
||||||
|
('vocab', lambda b: (
|
||||||
|
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self))),
|
||||||
('tokenizer', lambda b: self.tokenizer.from_bytes(b, vocab=False)),
|
('tokenizer', lambda b: self.tokenizer.from_bytes(b, vocab=False)),
|
||||||
('meta', lambda b: self.meta.update(ujson.loads(b)))
|
|
||||||
))
|
))
|
||||||
for i, (name, proc) in enumerate(self.pipeline):
|
for i, (name, proc) in enumerate(self.pipeline):
|
||||||
if name in disable:
|
if name in disable:
|
||||||
|
@ -685,6 +698,27 @@ class Language(object):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
def _fix_pretrained_vectors_name(nlp):
|
||||||
|
# TODO: Replace this once we handle vectors consistently as static
|
||||||
|
# data
|
||||||
|
if 'vectors' in nlp.meta and nlp.meta['vectors'].get('name'):
|
||||||
|
nlp.vocab.vectors.name = nlp.meta['vectors']['name']
|
||||||
|
elif not nlp.vocab.vectors.size:
|
||||||
|
nlp.vocab.vectors.name = None
|
||||||
|
elif 'name' in nlp.meta and 'lang' in nlp.meta:
|
||||||
|
vectors_name = '%s_%s.vectors' % (nlp.meta['lang'], nlp.meta['name'])
|
||||||
|
nlp.vocab.vectors.name = vectors_name
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E092)
|
||||||
|
if nlp.vocab.vectors.size != 0:
|
||||||
|
link_vectors_to_models(nlp.vocab)
|
||||||
|
for name, proc in nlp.pipeline:
|
||||||
|
if not hasattr(proc, 'cfg'):
|
||||||
|
continue
|
||||||
|
proc.cfg.setdefault('deprecation_fixes', {})
|
||||||
|
proc.cfg['deprecation_fixes']['vectors_name'] = nlp.vocab.vectors.name
|
||||||
|
|
||||||
|
|
||||||
class DisabledPipes(list):
|
class DisabledPipes(list):
|
||||||
"""Manager for temporary pipeline disabling."""
|
"""Manager for temporary pipeline disabling."""
|
||||||
def __init__(self, nlp, *names):
|
def __init__(self, nlp, *names):
|
||||||
|
@ -711,14 +745,7 @@ class DisabledPipes(list):
|
||||||
if unexpected:
|
if unexpected:
|
||||||
# Don't change the pipeline if we're raising an error.
|
# Don't change the pipeline if we're raising an error.
|
||||||
self.nlp.pipeline = current
|
self.nlp.pipeline = current
|
||||||
msg = (
|
raise ValueError(Errors.E008.format(names=unexpected))
|
||||||
"Some current components would be lost when restoring "
|
|
||||||
"previous pipeline state. If you added components after "
|
|
||||||
"calling nlp.disable_pipes(), you should remove them "
|
|
||||||
"explicitly with nlp.remove_pipe() before the pipeline is "
|
|
||||||
"restore. Names of the new components: %s"
|
|
||||||
)
|
|
||||||
raise ValueError(msg % unexpected)
|
|
||||||
self[:] = []
|
self[:] = []
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -15,7 +15,7 @@ from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
||||||
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
|
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
|
||||||
from .attrs cimport PROB
|
from .attrs cimport PROB
|
||||||
from .attrs import intify_attrs
|
from .attrs import intify_attrs
|
||||||
from . import about
|
from .errors import Errors
|
||||||
|
|
||||||
|
|
||||||
memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))
|
memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))
|
||||||
|
@ -37,7 +37,8 @@ cdef class Lexeme:
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self.orth = orth
|
self.orth = orth
|
||||||
self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
|
self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
|
||||||
assert self.c.orth == orth
|
if self.c.orth != orth:
|
||||||
|
raise ValueError(Errors.E071.format(orth=orth, vocab_orth=self.c.orth))
|
||||||
|
|
||||||
def __richcmp__(self, other, int op):
|
def __richcmp__(self, other, int op):
|
||||||
if other is None:
|
if other is None:
|
||||||
|
@ -129,20 +130,25 @@ cdef class Lexeme:
|
||||||
lex_data = Lexeme.c_to_bytes(self.c)
|
lex_data = Lexeme.c_to_bytes(self.c)
|
||||||
start = <const char*>&self.c.flags
|
start = <const char*>&self.c.flags
|
||||||
end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
|
end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
|
||||||
assert (end-start) == sizeof(lex_data.data), (end-start, sizeof(lex_data.data))
|
if (end-start) != sizeof(lex_data.data):
|
||||||
|
raise ValueError(Errors.E072.format(length=end-start,
|
||||||
|
bad_length=sizeof(lex_data.data)))
|
||||||
byte_string = b'\0' * sizeof(lex_data.data)
|
byte_string = b'\0' * sizeof(lex_data.data)
|
||||||
byte_chars = <char*>byte_string
|
byte_chars = <char*>byte_string
|
||||||
for i in range(sizeof(lex_data.data)):
|
for i in range(sizeof(lex_data.data)):
|
||||||
byte_chars[i] = lex_data.data[i]
|
byte_chars[i] = lex_data.data[i]
|
||||||
assert len(byte_string) == sizeof(lex_data.data), (len(byte_string),
|
if len(byte_string) != sizeof(lex_data.data):
|
||||||
sizeof(lex_data.data))
|
raise ValueError(Errors.E072.format(length=len(byte_string),
|
||||||
|
bad_length=sizeof(lex_data.data)))
|
||||||
return byte_string
|
return byte_string
|
||||||
|
|
||||||
def from_bytes(self, bytes byte_string):
|
def from_bytes(self, bytes byte_string):
|
||||||
# This method doesn't really have a use-case --- wrote it for testing.
|
# This method doesn't really have a use-case --- wrote it for testing.
|
||||||
# Possibly delete? It puts the Lexeme out of synch with the vocab.
|
# Possibly delete? It puts the Lexeme out of synch with the vocab.
|
||||||
cdef SerializedLexemeC lex_data
|
cdef SerializedLexemeC lex_data
|
||||||
assert len(byte_string) == sizeof(lex_data.data)
|
if len(byte_string) != sizeof(lex_data.data):
|
||||||
|
raise ValueError(Errors.E072.format(length=len(byte_string),
|
||||||
|
bad_length=sizeof(lex_data.data)))
|
||||||
for i in range(len(byte_string)):
|
for i in range(len(byte_string)):
|
||||||
lex_data.data[i] = byte_string[i]
|
lex_data.data[i] = byte_string[i]
|
||||||
Lexeme.c_from_bytes(self.c, lex_data)
|
Lexeme.c_from_bytes(self.c, lex_data)
|
||||||
|
@ -169,16 +175,13 @@ cdef class Lexeme:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
cdef int length = self.vocab.vectors_length
|
cdef int length = self.vocab.vectors_length
|
||||||
if length == 0:
|
if length == 0:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E010)
|
||||||
"Word vectors set to length 0. This may be because you "
|
|
||||||
"don't have a model installed or loaded, or because your "
|
|
||||||
"model doesn't include word vectors. For more info, see "
|
|
||||||
"the documentation: \n%s\n" % about.__docs_models__
|
|
||||||
)
|
|
||||||
return self.vocab.get_vector(self.c.orth)
|
return self.vocab.get_vector(self.c.orth)
|
||||||
|
|
||||||
def __set__(self, vector):
|
def __set__(self, vector):
|
||||||
assert len(vector) == self.vocab.vectors_length
|
if len(vector) != self.vocab.vectors_length:
|
||||||
|
raise ValueError(Errors.E073.format(new_length=len(vector),
|
||||||
|
length=self.vocab.vectors_length))
|
||||||
self.vocab.set_vector(self.c.orth, vector)
|
self.vocab.set_vector(self.c.orth, vector)
|
||||||
|
|
||||||
property rank:
|
property rank:
|
||||||
|
|
|
@ -13,6 +13,8 @@ from .vocab cimport Vocab
|
||||||
from .tokens.doc cimport Doc
|
from .tokens.doc cimport Doc
|
||||||
from .tokens.doc cimport get_token_attr
|
from .tokens.doc cimport get_token_attr
|
||||||
from .attrs cimport ID, attr_id_t, NULL_ATTR
|
from .attrs cimport ID, attr_id_t, NULL_ATTR
|
||||||
|
from .errors import Errors, TempErrors
|
||||||
|
|
||||||
from .attrs import IDS
|
from .attrs import IDS
|
||||||
from .attrs import FLAG61 as U_ENT
|
from .attrs import FLAG61 as U_ENT
|
||||||
from .attrs import FLAG60 as B2_ENT
|
from .attrs import FLAG60 as B2_ENT
|
||||||
|
@ -321,6 +323,8 @@ cdef attr_t get_pattern_key(const TokenPatternC* pattern) nogil:
|
||||||
while pattern.nr_attr != 0:
|
while pattern.nr_attr != 0:
|
||||||
pattern += 1
|
pattern += 1
|
||||||
id_attr = pattern[0].attrs[0]
|
id_attr = pattern[0].attrs[0]
|
||||||
|
if id_attr.attr != ID:
|
||||||
|
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
|
||||||
return id_attr.value
|
return id_attr.value
|
||||||
|
|
||||||
def _convert_strings(token_specs, string_store):
|
def _convert_strings(token_specs, string_store):
|
||||||
|
@ -341,8 +345,8 @@ def _convert_strings(token_specs, string_store):
|
||||||
if value in operators:
|
if value in operators:
|
||||||
ops = operators[value]
|
ops = operators[value]
|
||||||
else:
|
else:
|
||||||
msg = "Unknown operator '%s'. Options: %s"
|
keys = ', '.join(operators.keys())
|
||||||
raise KeyError(msg % (value, ', '.join(operators.keys())))
|
raise KeyError(Errors.E011.format(op=value, opts=keys))
|
||||||
if isinstance(attr, basestring):
|
if isinstance(attr, basestring):
|
||||||
attr = IDS.get(attr.upper())
|
attr = IDS.get(attr.upper())
|
||||||
if isinstance(value, basestring):
|
if isinstance(value, basestring):
|
||||||
|
@ -429,9 +433,7 @@ cdef class Matcher:
|
||||||
"""
|
"""
|
||||||
for pattern in patterns:
|
for pattern in patterns:
|
||||||
if len(pattern) == 0:
|
if len(pattern) == 0:
|
||||||
msg = ("Cannot add pattern for zero tokens to matcher.\n"
|
raise ValueError(Errors.E012.format(key=key))
|
||||||
"key: {key}\n")
|
|
||||||
raise ValueError(msg.format(key=key))
|
|
||||||
key = self._normalize_key(key)
|
key = self._normalize_key(key)
|
||||||
for pattern in patterns:
|
for pattern in patterns:
|
||||||
specs = _convert_strings(pattern, self.vocab.strings)
|
specs = _convert_strings(pattern, self.vocab.strings)
|
||||||
|
|
|
@ -9,6 +9,7 @@ from .attrs import LEMMA, intify_attrs
|
||||||
from .parts_of_speech cimport SPACE
|
from .parts_of_speech cimport SPACE
|
||||||
from .parts_of_speech import IDS as POS_IDS
|
from .parts_of_speech import IDS as POS_IDS
|
||||||
from .lexeme cimport Lexeme
|
from .lexeme cimport Lexeme
|
||||||
|
from .errors import Errors
|
||||||
|
|
||||||
|
|
||||||
def _normalize_props(props):
|
def _normalize_props(props):
|
||||||
|
@ -93,7 +94,7 @@ cdef class Morphology:
|
||||||
|
|
||||||
cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1:
|
cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1:
|
||||||
if tag_id > self.n_tags:
|
if tag_id > self.n_tags:
|
||||||
raise ValueError("Unknown tag ID: %s" % tag_id)
|
raise ValueError(Errors.E014.format(tag=tag_id))
|
||||||
# TODO: It's pretty arbitrary to put this logic here. I guess the
|
# TODO: It's pretty arbitrary to put this logic here. I guess the
|
||||||
# justification is that this is where the specific word and the tag
|
# justification is that this is where the specific word and the tag
|
||||||
# interact. Still, we should have a better way to enforce this rule, or
|
# interact. Still, we should have a better way to enforce this rule, or
|
||||||
|
@ -147,9 +148,7 @@ cdef class Morphology:
|
||||||
elif force:
|
elif force:
|
||||||
memset(cached, 0, sizeof(cached[0]))
|
memset(cached, 0, sizeof(cached[0]))
|
||||||
else:
|
else:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E015.format(tag=tag_str, orth=orth_str))
|
||||||
"Conflicting morphology exception for (%s, %s). Use "
|
|
||||||
"force=True to overwrite." % (tag_str, orth_str))
|
|
||||||
|
|
||||||
cached.tag = rich_tag
|
cached.tag = rich_tag
|
||||||
# TODO: Refactor this to take arbitrary attributes.
|
# TODO: Refactor this to take arbitrary attributes.
|
||||||
|
|
|
@ -8,7 +8,9 @@ cimport numpy as np
|
||||||
import cytoolz
|
import cytoolz
|
||||||
from collections import OrderedDict
|
from collections import OrderedDict
|
||||||
import ujson
|
import ujson
|
||||||
import msgpack
|
|
||||||
|
from .util import msgpack
|
||||||
|
from .util import msgpack_numpy
|
||||||
|
|
||||||
from thinc.api import chain
|
from thinc.api import chain
|
||||||
from thinc.v2v import Affine, SELU, Softmax
|
from thinc.v2v import Affine, SELU, Softmax
|
||||||
|
@ -32,6 +34,7 @@ from .parts_of_speech import X
|
||||||
from ._ml import Tok2Vec, build_text_classifier, build_tagger_model
|
from ._ml import Tok2Vec, build_text_classifier, build_tagger_model
|
||||||
from ._ml import link_vectors_to_models, zero_init, flatten
|
from ._ml import link_vectors_to_models, zero_init, flatten
|
||||||
from ._ml import create_default_optimizer
|
from ._ml import create_default_optimizer
|
||||||
|
from .errors import Errors, TempErrors
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -77,7 +80,7 @@ def merge_noun_chunks(doc):
|
||||||
RETURNS (Doc): The Doc object with merged noun chunks.
|
RETURNS (Doc): The Doc object with merged noun chunks.
|
||||||
"""
|
"""
|
||||||
if not doc.is_parsed:
|
if not doc.is_parsed:
|
||||||
return
|
return doc
|
||||||
spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep)
|
spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep)
|
||||||
for np in doc.noun_chunks]
|
for np in doc.noun_chunks]
|
||||||
for start, end, tag, dep in spans:
|
for start, end, tag, dep in spans:
|
||||||
|
@ -214,8 +217,10 @@ class Pipe(object):
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, **exclude):
|
||||||
"""Load the pipe from a bytestring."""
|
"""Load the pipe from a bytestring."""
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
|
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||||
|
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
|
|
||||||
self.model = self.Model(**self.cfg)
|
self.model = self.Model(**self.cfg)
|
||||||
self.model.from_bytes(b)
|
self.model.from_bytes(b)
|
||||||
|
|
||||||
|
@ -239,8 +244,10 @@ class Pipe(object):
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, **exclude):
|
||||||
"""Load the pipe from disk."""
|
"""Load the pipe from disk."""
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
|
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||||
|
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
|
|
||||||
self.model = self.Model(**self.cfg)
|
self.model = self.Model(**self.cfg)
|
||||||
self.model.from_bytes(p.open('rb').read())
|
self.model.from_bytes(p.open('rb').read())
|
||||||
|
|
||||||
|
@ -298,7 +305,6 @@ class Tensorizer(Pipe):
|
||||||
self.model = model
|
self.model = model
|
||||||
self.input_models = []
|
self.input_models = []
|
||||||
self.cfg = dict(cfg)
|
self.cfg = dict(cfg)
|
||||||
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
|
||||||
self.cfg.setdefault('cnn_maxout_pieces', 3)
|
self.cfg.setdefault('cnn_maxout_pieces', 3)
|
||||||
|
|
||||||
def __call__(self, doc):
|
def __call__(self, doc):
|
||||||
|
@ -343,7 +349,8 @@ class Tensorizer(Pipe):
|
||||||
tensors (object): Vector representation for each token in the docs.
|
tensors (object): Vector representation for each token in the docs.
|
||||||
"""
|
"""
|
||||||
for doc, tensor in zip(docs, tensors):
|
for doc, tensor in zip(docs, tensors):
|
||||||
assert tensor.shape[0] == len(doc)
|
if tensor.shape[0] != len(doc):
|
||||||
|
raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
|
||||||
doc.tensor = tensor
|
doc.tensor = tensor
|
||||||
|
|
||||||
def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
|
def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
|
||||||
|
@ -415,8 +422,6 @@ class Tagger(Pipe):
|
||||||
self.model = model
|
self.model = model
|
||||||
self.cfg = OrderedDict(sorted(cfg.items()))
|
self.cfg = OrderedDict(sorted(cfg.items()))
|
||||||
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
||||||
self.cfg.setdefault('pretrained_dims',
|
|
||||||
self.vocab.vectors.data.shape[1])
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def labels(self):
|
def labels(self):
|
||||||
|
@ -477,7 +482,7 @@ class Tagger(Pipe):
|
||||||
doc.extend_tensor(tensors[i].get())
|
doc.extend_tensor(tensors[i].get())
|
||||||
else:
|
else:
|
||||||
doc.extend_tensor(tensors[i])
|
doc.extend_tensor(tensors[i])
|
||||||
doc.is_tagged = True
|
doc.is_tagged = True
|
||||||
|
|
||||||
def update(self, docs, golds, drop=0., sgd=None, losses=None):
|
def update(self, docs, golds, drop=0., sgd=None, losses=None):
|
||||||
if losses is not None and self.name not in losses:
|
if losses is not None and self.name not in losses:
|
||||||
|
@ -527,8 +532,8 @@ class Tagger(Pipe):
|
||||||
vocab.morphology = Morphology(vocab.strings, new_tag_map,
|
vocab.morphology = Morphology(vocab.strings, new_tag_map,
|
||||||
vocab.morphology.lemmatizer,
|
vocab.morphology.lemmatizer,
|
||||||
exc=vocab.morphology.exc)
|
exc=vocab.morphology.exc)
|
||||||
|
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
|
||||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
|
@ -537,6 +542,8 @@ class Tagger(Pipe):
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def Model(cls, n_tags, **cfg):
|
def Model(cls, n_tags, **cfg):
|
||||||
|
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):
|
||||||
|
raise ValueError(TempErrors.T008)
|
||||||
return build_tagger_model(n_tags, **cfg)
|
return build_tagger_model(n_tags, **cfg)
|
||||||
|
|
||||||
def add_label(self, label, values=None):
|
def add_label(self, label, values=None):
|
||||||
|
@ -552,9 +559,7 @@ class Tagger(Pipe):
|
||||||
# copy_array(larger.W[:smaller.nO], smaller.W)
|
# copy_array(larger.W[:smaller.nO], smaller.W)
|
||||||
# copy_array(larger.b[:smaller.nO], smaller.b)
|
# copy_array(larger.b[:smaller.nO], smaller.b)
|
||||||
# self.model._layers[-1] = larger
|
# self.model._layers[-1] = larger
|
||||||
raise ValueError(
|
raise ValueError(TempErrors.T003)
|
||||||
"Resizing pre-trained Tagger models is not "
|
|
||||||
"currently supported.")
|
|
||||||
tag_map = dict(self.vocab.morphology.tag_map)
|
tag_map = dict(self.vocab.morphology.tag_map)
|
||||||
if values is None:
|
if values is None:
|
||||||
values = {POS: "X"}
|
values = {POS: "X"}
|
||||||
|
@ -584,6 +589,10 @@ class Tagger(Pipe):
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, **exclude):
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
|
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||||
|
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||||
|
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
token_vector_width = util.env_opt(
|
token_vector_width = util.env_opt(
|
||||||
'token_vector_width',
|
'token_vector_width',
|
||||||
|
@ -609,7 +618,6 @@ class Tagger(Pipe):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, **exclude):
|
||||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors.data.shape[1])
|
|
||||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||||
serialize = OrderedDict((
|
serialize = OrderedDict((
|
||||||
('vocab', lambda p: self.vocab.to_disk(p)),
|
('vocab', lambda p: self.vocab.to_disk(p)),
|
||||||
|
@ -622,6 +630,9 @@ class Tagger(Pipe):
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, **exclude):
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
|
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||||
|
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||||
with p.open('rb') as file_:
|
with p.open('rb') as file_:
|
||||||
|
@ -669,12 +680,9 @@ class MultitaskObjective(Tagger):
|
||||||
elif hasattr(target, '__call__'):
|
elif hasattr(target, '__call__'):
|
||||||
self.make_label = target
|
self.make_label = target
|
||||||
else:
|
else:
|
||||||
raise ValueError("MultitaskObjective target should be function or "
|
raise ValueError(Errors.E016)
|
||||||
"one of: dep, tag, ent, sent_start, dep_tag_offset, ent_tag.")
|
|
||||||
self.cfg = dict(cfg)
|
self.cfg = dict(cfg)
|
||||||
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
||||||
self.cfg.setdefault('pretrained_dims',
|
|
||||||
self.vocab.vectors.data.shape[1])
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def labels(self):
|
def labels(self):
|
||||||
|
@ -723,7 +731,9 @@ class MultitaskObjective(Tagger):
|
||||||
return tokvecs, scores
|
return tokvecs, scores
|
||||||
|
|
||||||
def get_loss(self, docs, golds, scores):
|
def get_loss(self, docs, golds, scores):
|
||||||
assert len(docs) == len(golds)
|
if len(docs) != len(golds):
|
||||||
|
raise ValueError(Errors.E077.format(value='loss', n_docs=len(docs),
|
||||||
|
n_golds=len(golds)))
|
||||||
cdef int idx = 0
|
cdef int idx = 0
|
||||||
correct = numpy.zeros((scores.shape[0],), dtype='i')
|
correct = numpy.zeros((scores.shape[0],), dtype='i')
|
||||||
guesses = scores.argmax(axis=1)
|
guesses = scores.argmax(axis=1)
|
||||||
|
@ -962,16 +972,17 @@ class TextCategorizer(Pipe):
|
||||||
self.labels.append(label)
|
self.labels.append(label)
|
||||||
return 1
|
return 1
|
||||||
|
|
||||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None):
|
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None,
|
||||||
|
**kwargs):
|
||||||
if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
|
if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
|
||||||
token_vector_width = pipeline[0].model.nO
|
token_vector_width = pipeline[0].model.nO
|
||||||
else:
|
else:
|
||||||
token_vector_width = 64
|
token_vector_width = 64
|
||||||
|
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.cfg['pretrained_dims'] = self.vocab.vectors_length
|
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
|
||||||
self.cfg['nr_class'] = len(self.labels)
|
self.model = self.Model(len(self.labels), token_vector_width,
|
||||||
self.cfg['width'] = token_vector_width
|
**self.cfg)
|
||||||
self.model = self.Model(**self.cfg)
|
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
sgd = self.create_optimizer()
|
sgd = self.create_optimizer()
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
from __future__ import division, print_function, unicode_literals
|
from __future__ import division, print_function, unicode_literals
|
||||||
|
|
||||||
from .gold import tags_to_entities, GoldParse
|
from .gold import tags_to_entities, GoldParse
|
||||||
|
from .errors import Errors
|
||||||
|
|
||||||
|
|
||||||
class PRFScore(object):
|
class PRFScore(object):
|
||||||
|
@ -85,8 +86,7 @@ class Scorer(object):
|
||||||
|
|
||||||
def score(self, tokens, gold, verbose=False, punct_labels=('p', 'punct')):
|
def score(self, tokens, gold, verbose=False, punct_labels=('p', 'punct')):
|
||||||
if len(tokens) != len(gold):
|
if len(tokens) != len(gold):
|
||||||
gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot))
|
raise ValueError(Errors.E078.format(words_doc=len(tokens), words_gold=len(gold)))
|
||||||
assert len(tokens) == len(gold)
|
|
||||||
gold_deps = set()
|
gold_deps = set()
|
||||||
gold_tags = set()
|
gold_tags = set()
|
||||||
gold_ents = set(tags_to_entities([annot[-1]
|
gold_ents = set(tags_to_entities([annot[-1]
|
||||||
|
|
|
@ -13,6 +13,7 @@ from .symbols import IDS as SYMBOLS_BY_STR
|
||||||
from .symbols import NAMES as SYMBOLS_BY_INT
|
from .symbols import NAMES as SYMBOLS_BY_INT
|
||||||
from .typedefs cimport hash_t
|
from .typedefs cimport hash_t
|
||||||
from .compat import json_dumps
|
from .compat import json_dumps
|
||||||
|
from .errors import Errors
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -59,7 +60,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
|
||||||
string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char))
|
string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char))
|
||||||
string.p[0] = length
|
string.p[0] = length
|
||||||
memcpy(&string.p[1], chars, length)
|
memcpy(&string.p[1], chars, length)
|
||||||
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
|
|
||||||
return string
|
return string
|
||||||
else:
|
else:
|
||||||
i = 0
|
i = 0
|
||||||
|
@ -69,7 +69,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
|
||||||
string.p[i] = 255
|
string.p[i] = 255
|
||||||
string.p[n_length_bytes-1] = length % 255
|
string.p[n_length_bytes-1] = length % 255
|
||||||
memcpy(&string.p[n_length_bytes], chars, length)
|
memcpy(&string.p[n_length_bytes], chars, length)
|
||||||
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
|
|
||||||
return string
|
return string
|
||||||
|
|
||||||
|
|
||||||
|
@ -115,7 +114,7 @@ cdef class StringStore:
|
||||||
self.hits.insert(key)
|
self.hits.insert(key)
|
||||||
utf8str = <Utf8Str*>self._map.get(key)
|
utf8str = <Utf8Str*>self._map.get(key)
|
||||||
if utf8str is NULL:
|
if utf8str is NULL:
|
||||||
raise KeyError(string_or_id)
|
raise KeyError(Errors.E018.format(hash_value=string_or_id))
|
||||||
else:
|
else:
|
||||||
return decode_Utf8Str(utf8str)
|
return decode_Utf8Str(utf8str)
|
||||||
|
|
||||||
|
@ -136,8 +135,7 @@ cdef class StringStore:
|
||||||
key = hash_utf8(string, len(string))
|
key = hash_utf8(string, len(string))
|
||||||
self._intern_utf8(string, len(string))
|
self._intern_utf8(string, len(string))
|
||||||
else:
|
else:
|
||||||
raise TypeError(
|
raise TypeError(Errors.E017.format(value_type=type(string)))
|
||||||
"Can only add unicode or bytes. Got type: %s" % type(string))
|
|
||||||
return key
|
return key
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
|
|
|
@ -10,6 +10,7 @@ from thinc.extra.search cimport MaxViolation
|
||||||
|
|
||||||
from .transition_system cimport TransitionSystem, Transition
|
from .transition_system cimport TransitionSystem, Transition
|
||||||
from ..gold cimport GoldParse
|
from ..gold cimport GoldParse
|
||||||
|
from ..errors import Errors
|
||||||
from .stateclass cimport StateC, StateClass
|
from .stateclass cimport StateC, StateClass
|
||||||
|
|
||||||
|
|
||||||
|
@ -220,7 +221,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
|
||||||
p_indices = []
|
p_indices = []
|
||||||
g_indices = []
|
g_indices = []
|
||||||
cdef Beam pbeam, gbeam
|
cdef Beam pbeam, gbeam
|
||||||
assert len(pbeams) == len(gbeams)
|
if len(pbeams) != len(gbeams):
|
||||||
|
raise ValueError(Errors.E079.format(pbeams=len(pbeams), gbeams=len(gbeams)))
|
||||||
for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)):
|
for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)):
|
||||||
p_indices.append([])
|
p_indices.append([])
|
||||||
g_indices.append([])
|
g_indices.append([])
|
||||||
|
@ -228,7 +230,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
|
||||||
state = StateClass.borrow(<StateC*>pbeam.at(i))
|
state = StateClass.borrow(<StateC*>pbeam.at(i))
|
||||||
if not state.is_final():
|
if not state.is_final():
|
||||||
key = tuple([eg_id] + pbeam.histories[i])
|
key = tuple([eg_id] + pbeam.histories[i])
|
||||||
assert key not in seen, (key, seen)
|
if key in seen:
|
||||||
|
raise ValueError(Errors.E080.format(key=key))
|
||||||
seen[key] = len(states)
|
seen[key] = len(states)
|
||||||
p_indices[-1].append(len(states))
|
p_indices[-1].append(len(states))
|
||||||
states.append(state)
|
states.append(state)
|
||||||
|
@ -271,7 +274,8 @@ def get_gradient(nr_class, beam_maps, histories, losses):
|
||||||
for i in range(nr_step):
|
for i in range(nr_step):
|
||||||
grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class),
|
grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class),
|
||||||
dtype='f'))
|
dtype='f'))
|
||||||
assert len(histories) == len(losses)
|
if len(histories) != len(losses):
|
||||||
|
raise ValueError(Errors.E081.format(n_hist=len(histories), losses=len(losses)))
|
||||||
for eg_id, hists in enumerate(histories):
|
for eg_id, hists in enumerate(histories):
|
||||||
for loss, hist in zip(losses[eg_id], hists):
|
for loss, hist in zip(losses[eg_id], hists):
|
||||||
if loss == 0.0 or numpy.isnan(loss):
|
if loss == 0.0 or numpy.isnan(loss):
|
||||||
|
|
|
@ -16,6 +16,7 @@ from . import nonproj
|
||||||
from .transition_system cimport move_cost_func_t, label_cost_func_t
|
from .transition_system cimport move_cost_func_t, label_cost_func_t
|
||||||
from ..gold cimport GoldParse, GoldParseC
|
from ..gold cimport GoldParse, GoldParseC
|
||||||
from ..structs cimport TokenC
|
from ..structs cimport TokenC
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
# Calculate cost as gold/not gold. We don't use scalar value anyway.
|
# Calculate cost as gold/not gold. We don't use scalar value anyway.
|
||||||
cdef int BINARY_COSTS = 1
|
cdef int BINARY_COSTS = 1
|
||||||
|
@ -484,7 +485,7 @@ cdef class ArcEager(TransitionSystem):
|
||||||
t.do = Break.transition
|
t.do = Break.transition
|
||||||
t.get_cost = Break.cost
|
t.get_cost = Break.cost
|
||||||
else:
|
else:
|
||||||
raise Exception(move)
|
raise ValueError(Errors.E019.format(action=move, src='arc_eager'))
|
||||||
return t
|
return t
|
||||||
|
|
||||||
cdef int initialize_state(self, StateC* st) nogil:
|
cdef int initialize_state(self, StateC* st) nogil:
|
||||||
|
@ -556,35 +557,13 @@ cdef class ArcEager(TransitionSystem):
|
||||||
is_valid[i] = False
|
is_valid[i] = False
|
||||||
costs[i] = 9000
|
costs[i] = 9000
|
||||||
if n_gold < 1:
|
if n_gold < 1:
|
||||||
# Check label set --- leading cause
|
# Check projectivity --- leading cause
|
||||||
label_set = set([self.strings[self.c[i].label] for i in range(self.n_moves)])
|
if is_nonproj_tree(gold.heads):
|
||||||
for label_str in gold.labels:
|
raise ValueError(Errors.E020)
|
||||||
if label_str is not None and label_str not in label_set:
|
|
||||||
raise ValueError("Cannot get gold parser action: unknown label: %s" % label_str)
|
|
||||||
# Check projectivity --- other leading cause
|
|
||||||
if nonproj.is_nonproj_tree(gold.heads):
|
|
||||||
raise ValueError(
|
|
||||||
"Could not find a gold-standard action to supervise the "
|
|
||||||
"dependency parser. Likely cause: the tree is "
|
|
||||||
"non-projective (i.e. it has crossing arcs -- see "
|
|
||||||
"spacy/syntax/nonproj.pyx for definitions). The ArcEager "
|
|
||||||
"transition system only supports projective trees. To "
|
|
||||||
"learn non-projective representations, transform the data "
|
|
||||||
"before training and after parsing. Either pass "
|
|
||||||
"make_projective=True to the GoldParse class, or use "
|
|
||||||
"spacy.syntax.nonproj.preprocess_training_data.")
|
|
||||||
else:
|
else:
|
||||||
print(gold.orig_annot)
|
failure_state = stcls.print_state(gold.words)
|
||||||
print(gold.words)
|
raise ValueError(Errors.E021.format(n_actions=self.n_moves,
|
||||||
print(gold.heads)
|
state=failure_state))
|
||||||
print(gold.labels)
|
|
||||||
print(gold.sent_starts)
|
|
||||||
raise ValueError(
|
|
||||||
"Could not find a gold-standard action to supervise the"
|
|
||||||
"dependency parser. The GoldParse was projective. The "
|
|
||||||
"transition system has %d actions. State at failure: %s"
|
|
||||||
% (self.n_moves, stcls.print_state(gold.words)))
|
|
||||||
assert n_gold >= 1
|
|
||||||
|
|
||||||
def get_beam_annot(self, Beam beam):
|
def get_beam_annot(self, Beam beam):
|
||||||
length = (<StateC*>beam.at(0)).length
|
length = (<StateC*>beam.at(0)).length
|
||||||
|
|
|
@ -10,6 +10,7 @@ from ._state cimport StateC
|
||||||
from .transition_system cimport Transition
|
from .transition_system cimport Transition
|
||||||
from .transition_system cimport do_func_t
|
from .transition_system cimport do_func_t
|
||||||
from ..gold cimport GoldParseC, GoldParse
|
from ..gold cimport GoldParseC, GoldParse
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
cdef enum:
|
cdef enum:
|
||||||
|
@ -81,9 +82,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
for (ids, words, tags, heads, labels, biluo), _ in sents:
|
for (ids, words, tags, heads, labels, biluo), _ in sents:
|
||||||
for i, ner_tag in enumerate(biluo):
|
for i, ner_tag in enumerate(biluo):
|
||||||
if ner_tag != 'O' and ner_tag != '-':
|
if ner_tag != 'O' and ner_tag != '-':
|
||||||
if ner_tag.count('-') != 1:
|
_, label = ner_tag.split('-', 1)
|
||||||
raise ValueError(ner_tag)
|
|
||||||
_, label = ner_tag.split('-')
|
|
||||||
for action in (BEGIN, IN, LAST, UNIT):
|
for action in (BEGIN, IN, LAST, UNIT):
|
||||||
actions[action][label] += 1
|
actions[action][label] += 1
|
||||||
return actions
|
return actions
|
||||||
|
@ -170,7 +169,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
if self.c[i].move == move and self.c[i].label == label:
|
if self.c[i].move == move and self.c[i].label == label:
|
||||||
return self.c[i]
|
return self.c[i]
|
||||||
else:
|
else:
|
||||||
raise KeyError(name)
|
raise KeyError(Errors.E022.format(name=name))
|
||||||
|
|
||||||
cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
|
cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
|
||||||
# TODO: Apparent Cython bug here when we try to use the Transition()
|
# TODO: Apparent Cython bug here when we try to use the Transition()
|
||||||
|
@ -205,7 +204,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
t.do = Out.transition
|
t.do = Out.transition
|
||||||
t.get_cost = Out.cost
|
t.get_cost = Out.cost
|
||||||
else:
|
else:
|
||||||
raise Exception(move)
|
raise ValueError(Errors.E019.format(action=move, src='ner'))
|
||||||
return t
|
return t
|
||||||
|
|
||||||
def add_action(self, int action, label_name, freq=None):
|
def add_action(self, int action, label_name, freq=None):
|
||||||
|
@ -227,7 +226,6 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
self._size *= 2
|
self._size *= 2
|
||||||
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
|
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
|
||||||
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
|
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
|
||||||
assert self.c[self.n_moves].label == label_id
|
|
||||||
self.n_moves += 1
|
self.n_moves += 1
|
||||||
if self.labels.get(action, []):
|
if self.labels.get(action, []):
|
||||||
freq = min(0, min(self.labels[action].values()))
|
freq = min(0, min(self.labels[action].values()))
|
||||||
|
|
|
@ -35,6 +35,7 @@ from .._ml import link_vectors_to_models, create_default_optimizer
|
||||||
from ..compat import json_dumps, copy_array
|
from ..compat import json_dumps, copy_array
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
from ..gold cimport GoldParse
|
from ..gold cimport GoldParse
|
||||||
|
from ..errors import Errors, TempErrors
|
||||||
from .. import util
|
from .. import util
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ._state cimport StateC
|
from ._state cimport StateC
|
||||||
|
@ -244,7 +245,7 @@ cdef class Parser:
|
||||||
def Model(cls, nr_class, **cfg):
|
def Model(cls, nr_class, **cfg):
|
||||||
depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1))
|
depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1))
|
||||||
if depth != 1:
|
if depth != 1:
|
||||||
raise ValueError("Currently parser depth is hard-coded to 1.")
|
raise ValueError(TempErrors.T004.format(value=depth))
|
||||||
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
|
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
|
||||||
cfg.get('maxout_pieces', 2))
|
cfg.get('maxout_pieces', 2))
|
||||||
token_vector_width = util.env_opt('token_vector_width',
|
token_vector_width = util.env_opt('token_vector_width',
|
||||||
|
@ -254,11 +255,12 @@ cdef class Parser:
|
||||||
hist_size = util.env_opt('history_feats', cfg.get('hist_size', 0))
|
hist_size = util.env_opt('history_feats', cfg.get('hist_size', 0))
|
||||||
hist_width = util.env_opt('history_width', cfg.get('hist_width', 0))
|
hist_width = util.env_opt('history_width', cfg.get('hist_width', 0))
|
||||||
if hist_size != 0:
|
if hist_size != 0:
|
||||||
raise ValueError("Currently history size is hard-coded to 0")
|
raise ValueError(TempErrors.T005.format(value=hist_size))
|
||||||
if hist_width != 0:
|
if hist_width != 0:
|
||||||
raise ValueError("Currently history width is hard-coded to 0")
|
raise ValueError(TempErrors.T006.format(value=hist_width))
|
||||||
|
pretrained_vectors = cfg.get('pretrained_vectors', None)
|
||||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||||
pretrained_dims=cfg.get('pretrained_dims', 0))
|
pretrained_vectors=pretrained_vectors)
|
||||||
tok2vec = chain(tok2vec, flatten)
|
tok2vec = chain(tok2vec, flatten)
|
||||||
lower = PrecomputableAffine(hidden_width,
|
lower = PrecomputableAffine(hidden_width,
|
||||||
nF=cls.nr_feature, nI=token_vector_width,
|
nF=cls.nr_feature, nI=token_vector_width,
|
||||||
|
@ -277,6 +279,7 @@ cdef class Parser:
|
||||||
'token_vector_width': token_vector_width,
|
'token_vector_width': token_vector_width,
|
||||||
'hidden_width': hidden_width,
|
'hidden_width': hidden_width,
|
||||||
'maxout_pieces': parser_maxout_pieces,
|
'maxout_pieces': parser_maxout_pieces,
|
||||||
|
'pretrained_vectors': pretrained_vectors,
|
||||||
'hist_size': hist_size,
|
'hist_size': hist_size,
|
||||||
'hist_width': hist_width
|
'hist_width': hist_width
|
||||||
}
|
}
|
||||||
|
@ -296,9 +299,9 @@ cdef class Parser:
|
||||||
unless True (default), in which case a new instance is created with
|
unless True (default), in which case a new instance is created with
|
||||||
`Parser.Moves()`.
|
`Parser.Moves()`.
|
||||||
model (object): Defines how the parse-state is created, updated and
|
model (object): Defines how the parse-state is created, updated and
|
||||||
evaluated. The value is set to the .model attribute unless True
|
evaluated. The value is set to the .model attribute. If set to True
|
||||||
(default), in which case a new instance is created with
|
(default), a new instance will be created with `Parser.Model()`
|
||||||
`Parser.Model()`.
|
in parser.begin_training(), parser.from_disk() or parser.from_bytes().
|
||||||
**cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute
|
**cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute
|
||||||
"""
|
"""
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
|
@ -310,8 +313,7 @@ cdef class Parser:
|
||||||
cfg['beam_width'] = util.env_opt('beam_width', 1)
|
cfg['beam_width'] = util.env_opt('beam_width', 1)
|
||||||
if 'beam_density' not in cfg:
|
if 'beam_density' not in cfg:
|
||||||
cfg['beam_density'] = util.env_opt('beam_density', 0.0)
|
cfg['beam_density'] = util.env_opt('beam_density', 0.0)
|
||||||
if 'pretrained_dims' not in cfg:
|
cfg.setdefault('cnn_maxout_pieces', 3)
|
||||||
cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
|
||||||
self.cfg = cfg
|
self.cfg = cfg
|
||||||
self.model = model
|
self.model = model
|
||||||
self._multitasks = []
|
self._multitasks = []
|
||||||
|
@ -551,8 +553,13 @@ cdef class Parser:
|
||||||
def update(self, docs, golds, drop=0., sgd=None, losses=None):
|
def update(self, docs, golds, drop=0., sgd=None, losses=None):
|
||||||
if not any(self.moves.has_gold(gold) for gold in golds):
|
if not any(self.moves.has_gold(gold) for gold in golds):
|
||||||
return None
|
return None
|
||||||
assert len(docs) == len(golds)
|
if len(docs) != len(golds):
|
||||||
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= 0.0:
|
raise ValueError(Errors.E077.format(value='update', n_docs=len(docs),
|
||||||
|
n_golds=len(golds)))
|
||||||
|
# The probability we use beam update, instead of falling back to
|
||||||
|
# a greedy update
|
||||||
|
beam_update_prob = 1-self.cfg.get('beam_update_prob', 0.5)
|
||||||
|
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= beam_update_prob:
|
||||||
return self.update_beam(docs, golds,
|
return self.update_beam(docs, golds,
|
||||||
self.cfg['beam_width'], self.cfg['beam_density'],
|
self.cfg['beam_width'], self.cfg['beam_density'],
|
||||||
drop=drop, sgd=sgd, losses=losses)
|
drop=drop, sgd=sgd, losses=losses)
|
||||||
|
@ -634,7 +641,6 @@ cdef class Parser:
|
||||||
if losses is not None and self.name not in losses:
|
if losses is not None and self.name not in losses:
|
||||||
losses[self.name] = 0.
|
losses[self.name] = 0.
|
||||||
lengths = [len(d) for d in docs]
|
lengths = [len(d) for d in docs]
|
||||||
assert min(lengths) >= 1
|
|
||||||
states = self.moves.init_batch(docs)
|
states = self.moves.init_batch(docs)
|
||||||
for gold in golds:
|
for gold in golds:
|
||||||
self.moves.preprocess_gold(gold)
|
self.moves.preprocess_gold(gold)
|
||||||
|
@ -846,7 +852,6 @@ cdef class Parser:
|
||||||
self.moves.initialize_actions(actions)
|
self.moves.initialize_actions(actions)
|
||||||
cfg.setdefault('token_vector_width', 128)
|
cfg.setdefault('token_vector_width', 128)
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
cfg['pretrained_dims'] = self.vocab.vectors_length
|
|
||||||
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
|
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
sgd = self.create_optimizer()
|
sgd = self.create_optimizer()
|
||||||
|
@ -910,9 +915,11 @@ cdef class Parser:
|
||||||
}
|
}
|
||||||
util.from_disk(path, deserializers, exclude)
|
util.from_disk(path, deserializers, exclude)
|
||||||
if 'model' not in exclude:
|
if 'model' not in exclude:
|
||||||
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
|
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||||
|
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
|
|
||||||
self.model, cfg = self.Model(**self.cfg)
|
self.model, cfg = self.Model(**self.cfg)
|
||||||
else:
|
else:
|
||||||
cfg = {}
|
cfg = {}
|
||||||
|
@ -955,12 +962,13 @@ cdef class Parser:
|
||||||
))
|
))
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
if 'model' not in exclude:
|
if 'model' not in exclude:
|
||||||
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
|
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||||
|
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.model, cfg = self.Model(**self.cfg)
|
self.model, cfg = self.Model(**self.cfg)
|
||||||
cfg['pretrained_dims'] = self.vocab.vectors_length
|
|
||||||
else:
|
else:
|
||||||
cfg = {}
|
cfg = {}
|
||||||
cfg['pretrained_dims'] = self.vocab.vectors_length
|
|
||||||
if 'tok2vec_model' in msg:
|
if 'tok2vec_model' in msg:
|
||||||
self.model[0].from_bytes(msg['tok2vec_model'])
|
self.model[0].from_bytes(msg['tok2vec_model'])
|
||||||
if 'lower_model' in msg:
|
if 'lower_model' in msg:
|
||||||
|
@ -1033,15 +1041,11 @@ def _cleanup(Beam beam):
|
||||||
del state
|
del state
|
||||||
seen.add(addr)
|
seen.add(addr)
|
||||||
else:
|
else:
|
||||||
print(i, addr)
|
raise ValueError(Errors.E023.format(addr=addr, i=i))
|
||||||
print(seen)
|
|
||||||
raise Exception
|
|
||||||
addr = <size_t>beam._states[i].content
|
addr = <size_t>beam._states[i].content
|
||||||
if addr not in seen:
|
if addr not in seen:
|
||||||
state = <StateC*>addr
|
state = <StateC*>addr
|
||||||
del state
|
del state
|
||||||
seen.add(addr)
|
seen.add(addr)
|
||||||
else:
|
else:
|
||||||
print(i, addr)
|
raise ValueError(Errors.E023.format(addr=addr, i=i))
|
||||||
print(seen)
|
|
||||||
raise Exception
|
|
||||||
|
|
|
@ -10,6 +10,7 @@ from __future__ import unicode_literals
|
||||||
from copy import copy
|
from copy import copy
|
||||||
|
|
||||||
from ..tokens.doc cimport Doc, set_children_from_heads
|
from ..tokens.doc cimport Doc, set_children_from_heads
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
DELIMITER = '||'
|
DELIMITER = '||'
|
||||||
|
@ -146,7 +147,10 @@ cpdef deprojectivize(Doc doc):
|
||||||
|
|
||||||
def _decorate(heads, proj_heads, labels):
|
def _decorate(heads, proj_heads, labels):
|
||||||
# uses decoration scheme HEAD from Nivre & Nilsson 2005
|
# uses decoration scheme HEAD from Nivre & Nilsson 2005
|
||||||
assert(len(heads) == len(proj_heads) == len(labels))
|
if (len(heads) != len(proj_heads)) or (len(proj_heads) != len(labels)):
|
||||||
|
raise ValueError(Errors.E082.format(n_heads=len(heads),
|
||||||
|
n_proj_heads=len(proj_heads),
|
||||||
|
n_labels=len(labels)))
|
||||||
deco_labels = []
|
deco_labels = []
|
||||||
for tokenid, head in enumerate(heads):
|
for tokenid, head in enumerate(heads):
|
||||||
if head != proj_heads[tokenid]:
|
if head != proj_heads[tokenid]:
|
||||||
|
|
|
@ -12,6 +12,7 @@ from ..structs cimport TokenC
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ..typedefs cimport attr_t
|
from ..typedefs cimport attr_t
|
||||||
from ..compat import json_dumps
|
from ..compat import json_dumps
|
||||||
|
from ..errors import Errors
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -73,10 +74,7 @@ cdef class TransitionSystem:
|
||||||
action.do(state.c, action.label)
|
action.do(state.c, action.label)
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
print(gold.words)
|
raise ValueError(Errors.E024)
|
||||||
print(gold.ner)
|
|
||||||
print(history)
|
|
||||||
raise ValueError("Could not find gold move")
|
|
||||||
return history
|
return history
|
||||||
|
|
||||||
cdef int initialize_state(self, StateC* state) nogil:
|
cdef int initialize_state(self, StateC* state) nogil:
|
||||||
|
@ -123,17 +121,7 @@ cdef class TransitionSystem:
|
||||||
else:
|
else:
|
||||||
costs[i] = 9000
|
costs[i] = 9000
|
||||||
if n_gold <= 0:
|
if n_gold <= 0:
|
||||||
print(gold.words)
|
raise ValueError(Errors.E024)
|
||||||
print(gold.ner)
|
|
||||||
print([gold.c.ner[i].clas for i in range(gold.length)])
|
|
||||||
print([gold.c.ner[i].move for i in range(gold.length)])
|
|
||||||
print([gold.c.ner[i].label for i in range(gold.length)])
|
|
||||||
print("Self labels",
|
|
||||||
[self.c[i].label for i in range(self.n_moves)])
|
|
||||||
raise ValueError(
|
|
||||||
"Could not find a gold-standard action to supervise "
|
|
||||||
"the entity recognizer. The transition system has "
|
|
||||||
"%d actions." % (self.n_moves))
|
|
||||||
|
|
||||||
def get_class_name(self, int clas):
|
def get_class_name(self, int clas):
|
||||||
act = self.c[clas]
|
act = self.c[clas]
|
||||||
|
@ -171,7 +159,6 @@ cdef class TransitionSystem:
|
||||||
self._size *= 2
|
self._size *= 2
|
||||||
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
|
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
|
||||||
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
|
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
|
||||||
assert self.c[self.n_moves].label == label_id
|
|
||||||
self.n_moves += 1
|
self.n_moves += 1
|
||||||
if self.labels.get(action, []):
|
if self.labels.get(action, []):
|
||||||
new_freq = min(self.labels[action].values())
|
new_freq = min(self.labels[action].values())
|
||||||
|
|
|
@ -19,7 +19,9 @@ _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||||
_models = {'en': ['en_core_web_sm'],
|
_models = {'en': ['en_core_web_sm'],
|
||||||
'de': ['de_core_news_md'],
|
'de': ['de_core_news_md'],
|
||||||
'fr': ['fr_core_news_sm'],
|
'fr': ['fr_core_news_sm'],
|
||||||
'xx': ['xx_ent_web_md']}
|
'xx': ['xx_ent_web_md'],
|
||||||
|
'en_core_web_md': ['en_core_web_md'],
|
||||||
|
'es_core_news_md': ['es_core_news_md']}
|
||||||
|
|
||||||
|
|
||||||
# only used for tests that require loading the models
|
# only used for tests that require loading the models
|
||||||
|
@ -183,6 +185,9 @@ def pytest_addoption(parser):
|
||||||
|
|
||||||
for lang in _languages + ['all']:
|
for lang in _languages + ['all']:
|
||||||
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
|
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
|
||||||
|
for model in _models:
|
||||||
|
if model not in _languages:
|
||||||
|
parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
|
||||||
|
|
||||||
|
|
||||||
def pytest_runtest_setup(item):
|
def pytest_runtest_setup(item):
|
||||||
|
|
13
spacy/tests/lang/da/test_lemma.py
Normal file
13
spacy/tests/lang/da/test_lemma.py
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
|
||||||
|
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
|
||||||
|
('kolesterols', 'kolesterol'),
|
||||||
|
('åsyns', 'åsyn')])
|
||||||
|
def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
|
||||||
|
tokens = da_tokenizer(string)
|
||||||
|
assert tokens[0].lemma_ == lemma
|
12
spacy/tests/regression/test_issue1660.py
Normal file
12
spacy/tests/regression/test_issue1660.py
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
import pytest
|
||||||
|
from ...util import load_model
|
||||||
|
|
||||||
|
@pytest.mark.models("en_core_web_md")
|
||||||
|
@pytest.mark.models("es_core_news_md")
|
||||||
|
def test_models_with_different_vectors():
|
||||||
|
nlp = load_model('en_core_web_md')
|
||||||
|
doc = nlp(u'hello world')
|
||||||
|
nlp2 = load_model('es_core_news_md')
|
||||||
|
doc2 = nlp2(u'hola')
|
||||||
|
doc = nlp(u'hello world')
|
15
spacy/tests/regression/test_issue1967.py
Normal file
15
spacy/tests/regression/test_issue1967.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from ...pipeline import EntityRecognizer
|
||||||
|
from ...vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('label', ['U-JOB-NAME'])
|
||||||
|
def test_issue1967(label):
|
||||||
|
ner = EntityRecognizer(Vocab())
|
||||||
|
entry = ([0], ['word'], ['tag'], [0], ['dep'], [label])
|
||||||
|
gold_parses = [(None, [(entry, None)])]
|
||||||
|
ner.moves.get_actions(gold_parses=gold_parses)
|
|
@ -17,6 +17,7 @@ def meta_data():
|
||||||
'email': 'email-in-fixture',
|
'email': 'email-in-fixture',
|
||||||
'url': 'url-in-fixture',
|
'url': 'url-in-fixture',
|
||||||
'license': 'license-in-fixture',
|
'license': 'license-in-fixture',
|
||||||
|
'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -10,8 +10,8 @@ from ..gold import GoldParse
|
||||||
|
|
||||||
|
|
||||||
def test_textcat_learns_multilabel():
|
def test_textcat_learns_multilabel():
|
||||||
random.seed(0)
|
random.seed(1)
|
||||||
numpy.random.seed(0)
|
numpy.random.seed(1)
|
||||||
docs = []
|
docs = []
|
||||||
nlp = English()
|
nlp = English()
|
||||||
vocab = nlp.vocab
|
vocab = nlp.vocab
|
||||||
|
|
|
@ -1,4 +1,11 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
from mock import Mock
|
from mock import Mock
|
||||||
|
|
||||||
|
from ..vocab import Vocab
|
||||||
|
from ..tokens import Doc, Span, Token
|
||||||
from ..tokens.underscore import Underscore
|
from ..tokens.underscore import Underscore
|
||||||
|
|
||||||
|
|
||||||
|
@ -51,3 +58,42 @@ def test_token_underscore_method():
|
||||||
None, None)
|
None, None)
|
||||||
token._ = Underscore(Underscore.token_extensions, token, start=token.idx)
|
token._ = Underscore(Underscore.token_extensions, token, start=token.idx)
|
||||||
assert token._.hello() == 'cheese'
|
assert token._.hello() == 'cheese'
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('obj', [Doc, Span, Token])
|
||||||
|
def test_doc_underscore_remove_extension(obj):
|
||||||
|
ext_name = 'to_be_removed'
|
||||||
|
obj.set_extension(ext_name, default=False)
|
||||||
|
assert obj.has_extension(ext_name)
|
||||||
|
obj.remove_extension(ext_name)
|
||||||
|
assert not obj.has_extension(ext_name)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('obj', [Doc, Span, Token])
|
||||||
|
def test_underscore_raises_for_dup(obj):
|
||||||
|
obj.set_extension('test', default=None)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
obj.set_extension('test', default=None)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('invalid_kwargs', [
|
||||||
|
{'getter': None, 'setter': lambda: None},
|
||||||
|
{'default': None, 'method': lambda: None, 'getter': lambda: None},
|
||||||
|
{'setter': lambda: None},
|
||||||
|
{'default': None, 'method': lambda: None},
|
||||||
|
{'getter': True}])
|
||||||
|
def test_underscore_raises_for_invalid(invalid_kwargs):
|
||||||
|
invalid_kwargs['force'] = True
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
Doc.set_extension('test', **invalid_kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('valid_kwargs', [
|
||||||
|
{'getter': lambda: None},
|
||||||
|
{'getter': lambda: None, 'setter': lambda: None},
|
||||||
|
{'default': 'hello'},
|
||||||
|
{'default': None},
|
||||||
|
{'method': lambda: None}])
|
||||||
|
def test_underscore_accepts_valid(valid_kwargs):
|
||||||
|
valid_kwargs['force'] = True
|
||||||
|
Doc.set_extension('test', **valid_kwargs)
|
||||||
|
|
|
@ -28,12 +28,38 @@ def vectors():
|
||||||
def data():
|
def data():
|
||||||
return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f')
|
return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f')
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def resize_data():
|
||||||
|
return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype='f')
|
||||||
|
|
||||||
@pytest.fixture()
|
@pytest.fixture()
|
||||||
def vocab(en_vocab, vectors):
|
def vocab(en_vocab, vectors):
|
||||||
add_vecs_to_vocab(en_vocab, vectors)
|
add_vecs_to_vocab(en_vocab, vectors)
|
||||||
return en_vocab
|
return en_vocab
|
||||||
|
|
||||||
|
def test_init_vectors_with_resize_shape(strings,resize_data):
|
||||||
|
v = Vectors(shape=(len(strings), 3))
|
||||||
|
v.resize(shape=resize_data.shape)
|
||||||
|
assert v.shape == resize_data.shape
|
||||||
|
assert v.shape != (len(strings), 3)
|
||||||
|
|
||||||
|
def test_init_vectors_with_resize_data(data,resize_data):
|
||||||
|
v = Vectors(data=data)
|
||||||
|
v.resize(shape=resize_data.shape)
|
||||||
|
assert v.shape == resize_data.shape
|
||||||
|
assert v.shape != data.shape
|
||||||
|
|
||||||
|
def test_get_vector_resize(strings, data,resize_data):
|
||||||
|
v = Vectors(data=data)
|
||||||
|
v.resize(shape=resize_data.shape)
|
||||||
|
strings = [hash_string(s) for s in strings]
|
||||||
|
for i, string in enumerate(strings):
|
||||||
|
v.add(string, row=i)
|
||||||
|
|
||||||
|
assert list(v[strings[0]]) == list(resize_data[0])
|
||||||
|
assert list(v[strings[0]]) != list(resize_data[1])
|
||||||
|
assert list(v[strings[1]]) != list(resize_data[0])
|
||||||
|
assert list(v[strings[1]]) == list(resize_data[1])
|
||||||
|
|
||||||
def test_init_vectors_with_data(strings, data):
|
def test_init_vectors_with_data(strings, data):
|
||||||
v = Vectors(data=data)
|
v = Vectors(data=data)
|
||||||
|
|
|
@ -13,6 +13,7 @@ cimport cython
|
||||||
|
|
||||||
from .tokens.doc cimport Doc
|
from .tokens.doc cimport Doc
|
||||||
from .strings cimport hash_string
|
from .strings cimport hash_string
|
||||||
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -63,11 +64,7 @@ cdef class Tokenizer:
|
||||||
return (self.__class__, args, None, None)
|
return (self.__class__, args, None, None)
|
||||||
|
|
||||||
cpdef Doc tokens_from_list(self, list strings):
|
cpdef Doc tokens_from_list(self, list strings):
|
||||||
util.deprecated(
|
deprecation_warning(Warnings.W002)
|
||||||
"Tokenizer.from_list is now deprecated. Create a new Doc "
|
|
||||||
"object instead and pass in the strings as the `words` keyword "
|
|
||||||
"argument, for example:\nfrom spacy.tokens import Doc\n"
|
|
||||||
"doc = Doc(nlp.vocab, words=[...])")
|
|
||||||
return Doc(self.vocab, words=strings)
|
return Doc(self.vocab, words=strings)
|
||||||
|
|
||||||
@cython.boundscheck(False)
|
@cython.boundscheck(False)
|
||||||
|
@ -78,8 +75,7 @@ cdef class Tokenizer:
|
||||||
RETURNS (Doc): A container for linguistic annotations.
|
RETURNS (Doc): A container for linguistic annotations.
|
||||||
"""
|
"""
|
||||||
if len(string) >= (2 ** 30):
|
if len(string) >= (2 ** 30):
|
||||||
msg = "String is too long: %d characters. Max is 2**30."
|
raise ValueError(Errors.E025.format(length=len(string)))
|
||||||
raise ValueError(msg % len(string))
|
|
||||||
cdef int length = len(string)
|
cdef int length = len(string)
|
||||||
cdef Doc doc = Doc(self.vocab)
|
cdef Doc doc = Doc(self.vocab)
|
||||||
if length == 0:
|
if length == 0:
|
||||||
|
|
129
spacy/tokens/_retokenize.pyx
Normal file
129
spacy/tokens/_retokenize.pyx
Normal file
|
@ -0,0 +1,129 @@
|
||||||
|
# coding: utf8
|
||||||
|
# cython: infer_types=True
|
||||||
|
# cython: bounds_check=False
|
||||||
|
# cython: profile=True
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from libc.string cimport memcpy, memset
|
||||||
|
|
||||||
|
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
|
||||||
|
from .span cimport Span
|
||||||
|
from .token cimport Token
|
||||||
|
from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
||||||
|
from ..structs cimport LexemeC, TokenC
|
||||||
|
from ..attrs cimport *
|
||||||
|
|
||||||
|
|
||||||
|
cdef class Retokenizer:
|
||||||
|
'''Helper class for doc.retokenize() context manager.'''
|
||||||
|
cdef Doc doc
|
||||||
|
cdef list merges
|
||||||
|
cdef list splits
|
||||||
|
def __init__(self, doc):
|
||||||
|
self.doc = doc
|
||||||
|
self.merges = []
|
||||||
|
self.splits = []
|
||||||
|
|
||||||
|
def merge(self, Span span, attrs=None):
|
||||||
|
'''Mark a span for merging. The attrs will be applied to the resulting
|
||||||
|
token.'''
|
||||||
|
self.merges.append((span.start_char, span.end_char, attrs))
|
||||||
|
|
||||||
|
def split(self, Token token, orths, attrs=None):
|
||||||
|
'''Mark a Token for splitting, into the specified orths. The attrs
|
||||||
|
will be applied to each subtoken.'''
|
||||||
|
self.splits.append((token.start_char, orths, attrs))
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
self.merges = []
|
||||||
|
self.splits = []
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, *args):
|
||||||
|
# Do the actual merging here
|
||||||
|
for start_char, end_char, attrs in self.merges:
|
||||||
|
start = token_by_start(self.doc.c, self.doc.length, start_char)
|
||||||
|
end = token_by_end(self.doc.c, self.doc.length, end_char)
|
||||||
|
_merge(self.doc, start, end+1, attrs)
|
||||||
|
for start_char, orths, attrs in self.splits:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
|
||||||
|
def _merge(Doc doc, int start, int end, attributes):
|
||||||
|
"""Retokenize the document, such that the span at
|
||||||
|
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
||||||
|
`start_idx` and `end_idx `do not mark start and end token boundaries,
|
||||||
|
the document remains unchanged.
|
||||||
|
|
||||||
|
start_idx (int): Character index of the start of the slice to merge.
|
||||||
|
end_idx (int): Character index after the end of the slice to merge.
|
||||||
|
**attributes: Attributes to assign to the merged token. By default,
|
||||||
|
attributes are inherited from the syntactic root of the span.
|
||||||
|
RETURNS (Token): The newly merged token, or `None` if the start and end
|
||||||
|
indices did not fall at token boundaries.
|
||||||
|
"""
|
||||||
|
cdef Span span = doc[start:end]
|
||||||
|
cdef int start_char = span.start_char
|
||||||
|
cdef int end_char = span.end_char
|
||||||
|
# Get LexemeC for newly merged token
|
||||||
|
new_orth = ''.join([t.text_with_ws for t in span])
|
||||||
|
if span[-1].whitespace_:
|
||||||
|
new_orth = new_orth[:-len(span[-1].whitespace_)]
|
||||||
|
cdef const LexemeC* lex = doc.vocab.get(doc.mem, new_orth)
|
||||||
|
# House the new merged token where it starts
|
||||||
|
cdef TokenC* token = &doc.c[start]
|
||||||
|
token.spacy = doc.c[end-1].spacy
|
||||||
|
for attr_name, attr_value in attributes.items():
|
||||||
|
if attr_name == TAG:
|
||||||
|
doc.vocab.morphology.assign_tag(token, attr_value)
|
||||||
|
else:
|
||||||
|
Token.set_struct_attr(token, attr_name, attr_value)
|
||||||
|
# Make sure ent_iob remains consistent
|
||||||
|
if doc.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
|
||||||
|
if token.ent_type == doc.c[end].ent_type:
|
||||||
|
token.ent_iob = 3
|
||||||
|
else:
|
||||||
|
# If they're not the same entity type, let them be two entities
|
||||||
|
doc.c[end].ent_iob = 3
|
||||||
|
# Begin by setting all the head indices to absolute token positions
|
||||||
|
# This is easier to work with for now than the offsets
|
||||||
|
# Before thinking of something simpler, beware the case where a
|
||||||
|
# dependency bridges over the entity. Here the alignment of the
|
||||||
|
# tokens changes.
|
||||||
|
span_root = span.root.i
|
||||||
|
token.dep = span.root.dep
|
||||||
|
# We update token.lex after keeping span root and dep, since
|
||||||
|
# setting token.lex will change span.start and span.end properties
|
||||||
|
# as it modifies the character offsets in the doc
|
||||||
|
token.lex = lex
|
||||||
|
for i in range(doc.length):
|
||||||
|
doc.c[i].head += i
|
||||||
|
# Set the head of the merged token, and its dep relation, from the Span
|
||||||
|
token.head = doc.c[span_root].head
|
||||||
|
# Adjust deps before shrinking tokens
|
||||||
|
# Tokens which point into the merged token should now point to it
|
||||||
|
# Subtract the offset from all tokens which point to >= end
|
||||||
|
offset = (end - start) - 1
|
||||||
|
for i in range(doc.length):
|
||||||
|
head_idx = doc.c[i].head
|
||||||
|
if start <= head_idx < end:
|
||||||
|
doc.c[i].head = start
|
||||||
|
elif head_idx >= end:
|
||||||
|
doc.c[i].head -= offset
|
||||||
|
# Now compress the token array
|
||||||
|
for i in range(end, doc.length):
|
||||||
|
doc.c[i - offset] = doc.c[i]
|
||||||
|
for i in range(doc.length - offset, doc.length):
|
||||||
|
memset(&doc.c[i], 0, sizeof(TokenC))
|
||||||
|
doc.c[i].lex = &EMPTY_LEXEME
|
||||||
|
doc.length -= offset
|
||||||
|
for i in range(doc.length):
|
||||||
|
# ...And, set heads back to a relative position
|
||||||
|
doc.c[i].head -= i
|
||||||
|
# Set the left/right children, left/right edges
|
||||||
|
set_children_from_heads(doc.c, doc.length)
|
||||||
|
# Clear the cached Python objects
|
||||||
|
# Return the merged Python object
|
||||||
|
return doc[start]
|
||||||
|
|
||||||
|
|
|
@ -28,6 +28,8 @@ cdef int token_by_start(const TokenC* tokens, int length, int start_char) except
|
||||||
cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2
|
cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2
|
||||||
|
|
||||||
|
|
||||||
|
cdef int set_children_from_heads(TokenC* tokens, int length) except -1
|
||||||
|
|
||||||
cdef class Doc:
|
cdef class Doc:
|
||||||
cdef readonly Pool mem
|
cdef readonly Pool mem
|
||||||
cdef readonly Vocab vocab
|
cdef readonly Vocab vocab
|
||||||
|
|
|
@ -31,18 +31,19 @@ from ..attrs cimport ENT_TYPE, SENT_START
|
||||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||||
from ..util import normalize_slice
|
from ..util import normalize_slice
|
||||||
from ..compat import is_config, copy_reg, pickle, basestring_
|
from ..compat import is_config, copy_reg, pickle, basestring_
|
||||||
from .. import about
|
from ..errors import Errors, Warnings, deprecation_warning
|
||||||
from .. import util
|
from .. import util
|
||||||
from .underscore import Underscore
|
from .underscore import Underscore, get_ext_args
|
||||||
|
from ._retokenize import Retokenizer
|
||||||
|
|
||||||
DEF PADDING = 5
|
DEF PADDING = 5
|
||||||
|
|
||||||
|
|
||||||
cdef int bounds_check(int i, int length, int padding) except -1:
|
cdef int bounds_check(int i, int length, int padding) except -1:
|
||||||
if (i + padding) < 0:
|
if (i + padding) < 0:
|
||||||
raise IndexError
|
raise IndexError(Errors.E026.format(i=i, length=length))
|
||||||
if (i - padding) >= length:
|
if (i - padding) >= length:
|
||||||
raise IndexError
|
raise IndexError(Errors.E026.format(i=i, length=length))
|
||||||
|
|
||||||
|
|
||||||
cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
|
cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
|
||||||
|
@ -94,11 +95,10 @@ cdef class Doc:
|
||||||
spaces=[True, False, False])
|
spaces=[True, False, False])
|
||||||
"""
|
"""
|
||||||
@classmethod
|
@classmethod
|
||||||
def set_extension(cls, name, default=None, method=None,
|
def set_extension(cls, name, **kwargs):
|
||||||
getter=None, setter=None):
|
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||||
nr_defined = sum(t is not None for t in (default, getter, setter, method))
|
raise ValueError(Errors.E090.format(name=name, obj='Doc'))
|
||||||
assert nr_defined == 1
|
Underscore.doc_extensions[name] = get_ext_args(**kwargs)
|
||||||
Underscore.doc_extensions[name] = (default, method, getter, setter)
|
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def get_extension(cls, name):
|
def get_extension(cls, name):
|
||||||
|
@ -108,6 +108,12 @@ cdef class Doc:
|
||||||
def has_extension(cls, name):
|
def has_extension(cls, name):
|
||||||
return name in Underscore.doc_extensions
|
return name in Underscore.doc_extensions
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def remove_extension(cls, name):
|
||||||
|
if not cls.has_extension(name):
|
||||||
|
raise ValueError(Errors.E046.format(name=name))
|
||||||
|
return Underscore.doc_extensions.pop(name)
|
||||||
|
|
||||||
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
|
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
|
||||||
orths_and_spaces=None):
|
orths_and_spaces=None):
|
||||||
"""Create a Doc object.
|
"""Create a Doc object.
|
||||||
|
@ -154,11 +160,7 @@ cdef class Doc:
|
||||||
if spaces is None:
|
if spaces is None:
|
||||||
spaces = [True] * len(words)
|
spaces = [True] * len(words)
|
||||||
elif len(spaces) != len(words):
|
elif len(spaces) != len(words):
|
||||||
raise ValueError(
|
raise ValueError(Errors.E027)
|
||||||
"Arguments 'words' and 'spaces' should be sequences of "
|
|
||||||
"the same length, or 'spaces' should be left default at "
|
|
||||||
"None. spaces should be a sequence of booleans, with True "
|
|
||||||
"meaning that the word owns a ' ' character following it.")
|
|
||||||
orths_and_spaces = zip(words, spaces)
|
orths_and_spaces = zip(words, spaces)
|
||||||
if orths_and_spaces is not None:
|
if orths_and_spaces is not None:
|
||||||
for orth_space in orths_and_spaces:
|
for orth_space in orths_and_spaces:
|
||||||
|
@ -166,10 +168,7 @@ cdef class Doc:
|
||||||
orth = orth_space
|
orth = orth_space
|
||||||
has_space = True
|
has_space = True
|
||||||
elif isinstance(orth_space, bytes):
|
elif isinstance(orth_space, bytes):
|
||||||
raise ValueError(
|
raise ValueError(Errors.E028.format(value=orth_space))
|
||||||
"orths_and_spaces expects either List(unicode) or "
|
|
||||||
"List((unicode, bool)). "
|
|
||||||
"Got bytes instance: %s" % (str(orth_space)))
|
|
||||||
else:
|
else:
|
||||||
orth, has_space = orth_space
|
orth, has_space = orth_space
|
||||||
# Note that we pass self.mem here --- we have ownership, if LexemeC
|
# Note that we pass self.mem here --- we have ownership, if LexemeC
|
||||||
|
@ -437,10 +436,7 @@ cdef class Doc:
|
||||||
if token.ent_iob == 1:
|
if token.ent_iob == 1:
|
||||||
if start == -1:
|
if start == -1:
|
||||||
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
|
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
|
||||||
raise ValueError(
|
raise ValueError(Errors.E093.format(seq=' '.join(seq)))
|
||||||
"token.ent_iob values make invalid sequence: "
|
|
||||||
"I without B\n"
|
|
||||||
"{seq}".format(seq=' '.join(seq)))
|
|
||||||
elif token.ent_iob == 2 or token.ent_iob == 0:
|
elif token.ent_iob == 2 or token.ent_iob == 0:
|
||||||
if start != -1:
|
if start != -1:
|
||||||
output.append(Span(self, start, i, label=label))
|
output.append(Span(self, start, i, label=label))
|
||||||
|
@ -503,19 +499,16 @@ cdef class Doc:
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if not self.is_parsed:
|
if not self.is_parsed:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E029)
|
||||||
"noun_chunks requires the dependency parse, which "
|
|
||||||
"requires a statistical model to be installed and loaded. "
|
|
||||||
"For more info, see the "
|
|
||||||
"documentation: \n%s\n" % about.__docs_models__)
|
|
||||||
# Accumulate the result before beginning to iterate over it. This
|
# Accumulate the result before beginning to iterate over it. This
|
||||||
# prevents the tokenisation from being changed out from under us
|
# prevents the tokenisation from being changed out from under us
|
||||||
# during the iteration. The tricky thing here is that Span accepts
|
# during the iteration. The tricky thing here is that Span accepts
|
||||||
# its tokenisation changing, so it's okay once we have the Span
|
# its tokenisation changing, so it's okay once we have the Span
|
||||||
# objects. See Issue #375.
|
# objects. See Issue #375.
|
||||||
spans = []
|
spans = []
|
||||||
for start, end, label in self.noun_chunks_iterator(self):
|
if self.noun_chunks_iterator is not None:
|
||||||
spans.append(Span(self, start, end, label=label))
|
for start, end, label in self.noun_chunks_iterator(self):
|
||||||
|
spans.append(Span(self, start, end, label=label))
|
||||||
for span in spans:
|
for span in spans:
|
||||||
yield span
|
yield span
|
||||||
|
|
||||||
|
@ -532,12 +525,7 @@ cdef class Doc:
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if not self.is_sentenced:
|
if not self.is_sentenced:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E030)
|
||||||
"Sentence boundaries unset. You can add the 'sentencizer' "
|
|
||||||
"component to the pipeline with: "
|
|
||||||
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
|
|
||||||
"Alternatively, add the dependency parser, or set "
|
|
||||||
"sentence boundaries by setting doc[i].sent_start")
|
|
||||||
if 'sents' in self.user_hooks:
|
if 'sents' in self.user_hooks:
|
||||||
yield from self.user_hooks['sents'](self)
|
yield from self.user_hooks['sents'](self)
|
||||||
else:
|
else:
|
||||||
|
@ -567,7 +555,8 @@ cdef class Doc:
|
||||||
t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy
|
t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy
|
||||||
t.l_edge = self.length
|
t.l_edge = self.length
|
||||||
t.r_edge = self.length
|
t.r_edge = self.length
|
||||||
assert t.lex.orth != 0
|
if t.lex.orth == 0:
|
||||||
|
raise ValueError(Errors.E031.format(i=self.length))
|
||||||
t.spacy = has_space
|
t.spacy = has_space
|
||||||
self.length += 1
|
self.length += 1
|
||||||
return t.idx + t.lex.length + t.spacy
|
return t.idx + t.lex.length + t.spacy
|
||||||
|
@ -683,13 +672,7 @@ cdef class Doc:
|
||||||
|
|
||||||
def from_array(self, attrs, array):
|
def from_array(self, attrs, array):
|
||||||
if SENT_START in attrs and HEAD in attrs:
|
if SENT_START in attrs and HEAD in attrs:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E032)
|
||||||
"Conflicting attributes specified in doc.from_array(): "
|
|
||||||
"(HEAD, SENT_START)\n"
|
|
||||||
"The HEAD attribute currently sets sentence boundaries "
|
|
||||||
"implicitly, based on the tree structure. This means the HEAD "
|
|
||||||
"attribute would potentially override the sentence boundaries "
|
|
||||||
"set by SENT_START.")
|
|
||||||
cdef int i, col
|
cdef int i, col
|
||||||
cdef attr_id_t attr_id
|
cdef attr_id_t attr_id
|
||||||
cdef TokenC* tokens = self.c
|
cdef TokenC* tokens = self.c
|
||||||
|
@ -827,7 +810,7 @@ cdef class Doc:
|
||||||
RETURNS (Doc): Itself.
|
RETURNS (Doc): Itself.
|
||||||
"""
|
"""
|
||||||
if self.length != 0:
|
if self.length != 0:
|
||||||
raise ValueError("Cannot load into non-empty Doc")
|
raise ValueError(Errors.E033.format(length=self.length))
|
||||||
deserializers = {
|
deserializers = {
|
||||||
'text': lambda b: None,
|
'text': lambda b: None,
|
||||||
'array_head': lambda b: None,
|
'array_head': lambda b: None,
|
||||||
|
@ -878,7 +861,7 @@ cdef class Doc:
|
||||||
computed by the models in the pipeline. Let's say a
|
computed by the models in the pipeline. Let's say a
|
||||||
document with 30 words has a tensor with 128 dimensions
|
document with 30 words has a tensor with 128 dimensions
|
||||||
per word. doc.tensor.shape will be (30, 128). After
|
per word. doc.tensor.shape will be (30, 128). After
|
||||||
calling doc.extend_tensor with an array of hape (30, 64),
|
calling doc.extend_tensor with an array of shape (30, 64),
|
||||||
doc.tensor == (30, 192).
|
doc.tensor == (30, 192).
|
||||||
'''
|
'''
|
||||||
xp = get_array_module(self.tensor)
|
xp = get_array_module(self.tensor)
|
||||||
|
@ -888,6 +871,18 @@ cdef class Doc:
|
||||||
else:
|
else:
|
||||||
self.tensor = xp.hstack((self.tensor, tensor))
|
self.tensor = xp.hstack((self.tensor, tensor))
|
||||||
|
|
||||||
|
def retokenize(self):
|
||||||
|
'''Context manager to handle retokenization of the Doc.
|
||||||
|
Modifications to the Doc's tokenization are stored, and then
|
||||||
|
made all at once when the context manager exits. This is
|
||||||
|
much more efficient, and less error-prone.
|
||||||
|
|
||||||
|
All views of the Doc (Span and Token) created before the
|
||||||
|
retokenization are invalidated, although they may accidentally
|
||||||
|
continue to work.
|
||||||
|
'''
|
||||||
|
return Retokenizer(self)
|
||||||
|
|
||||||
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
||||||
"""Retokenize the document, such that the span at
|
"""Retokenize the document, such that the span at
|
||||||
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
||||||
|
@ -903,10 +898,7 @@ cdef class Doc:
|
||||||
"""
|
"""
|
||||||
cdef unicode tag, lemma, ent_type
|
cdef unicode tag, lemma, ent_type
|
||||||
if len(args) == 3:
|
if len(args) == 3:
|
||||||
util.deprecated(
|
deprecation_warning(Warnings.W003)
|
||||||
"Positional arguments to Doc.merge are deprecated. Instead, "
|
|
||||||
"use the keyword arguments, for example tag=, lemma= or "
|
|
||||||
"ent_type=.")
|
|
||||||
tag, lemma, ent_type = args
|
tag, lemma, ent_type = args
|
||||||
attributes[TAG] = tag
|
attributes[TAG] = tag
|
||||||
attributes[LEMMA] = lemma
|
attributes[LEMMA] = lemma
|
||||||
|
@ -920,13 +912,9 @@ cdef class Doc:
|
||||||
if 'ent_type' in attributes:
|
if 'ent_type' in attributes:
|
||||||
attributes[ENT_TYPE] = attributes['ent_type']
|
attributes[ENT_TYPE] = attributes['ent_type']
|
||||||
elif args:
|
elif args:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E034.format(n_args=len(args),
|
||||||
"Doc.merge received %d non-keyword arguments. Expected either "
|
args=repr(args),
|
||||||
"3 arguments (deprecated), or 0 (use keyword arguments). "
|
kwargs=repr(attributes)))
|
||||||
"Arguments supplied:\n%s\n"
|
|
||||||
"Keyword arguments: %s\n" % (len(args), repr(args),
|
|
||||||
repr(attributes)))
|
|
||||||
|
|
||||||
# More deprecated attribute handling =/
|
# More deprecated attribute handling =/
|
||||||
if 'label' in attributes:
|
if 'label' in attributes:
|
||||||
attributes['ent_type'] = attributes.pop('label')
|
attributes['ent_type'] = attributes.pop('label')
|
||||||
|
@ -941,66 +929,8 @@ cdef class Doc:
|
||||||
return None
|
return None
|
||||||
# Currently we have the token index, we want the range-end index
|
# Currently we have the token index, we want the range-end index
|
||||||
end += 1
|
end += 1
|
||||||
cdef Span span = self[start:end]
|
with self.retokenize() as retokenizer:
|
||||||
# Get LexemeC for newly merged token
|
retokenizer.merge(self[start:end], attrs=attributes)
|
||||||
new_orth = ''.join([t.text_with_ws for t in span])
|
|
||||||
if span[-1].whitespace_:
|
|
||||||
new_orth = new_orth[:-len(span[-1].whitespace_)]
|
|
||||||
cdef const LexemeC* lex = self.vocab.get(self.mem, new_orth)
|
|
||||||
# House the new merged token where it starts
|
|
||||||
cdef TokenC* token = &self.c[start]
|
|
||||||
token.spacy = self.c[end-1].spacy
|
|
||||||
for attr_name, attr_value in attributes.items():
|
|
||||||
if attr_name == TAG:
|
|
||||||
self.vocab.morphology.assign_tag(token, attr_value)
|
|
||||||
else:
|
|
||||||
Token.set_struct_attr(token, attr_name, attr_value)
|
|
||||||
# Make sure ent_iob remains consistent
|
|
||||||
if self.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
|
|
||||||
if token.ent_type == self.c[end].ent_type:
|
|
||||||
token.ent_iob = 3
|
|
||||||
else:
|
|
||||||
# If they're not the same entity type, let them be two entities
|
|
||||||
self.c[end].ent_iob = 3
|
|
||||||
# Begin by setting all the head indices to absolute token positions
|
|
||||||
# This is easier to work with for now than the offsets
|
|
||||||
# Before thinking of something simpler, beware the case where a
|
|
||||||
# dependency bridges over the entity. Here the alignment of the
|
|
||||||
# tokens changes.
|
|
||||||
span_root = span.root.i
|
|
||||||
token.dep = span.root.dep
|
|
||||||
# We update token.lex after keeping span root and dep, since
|
|
||||||
# setting token.lex will change span.start and span.end properties
|
|
||||||
# as it modifies the character offsets in the doc
|
|
||||||
token.lex = lex
|
|
||||||
for i in range(self.length):
|
|
||||||
self.c[i].head += i
|
|
||||||
# Set the head of the merged token, and its dep relation, from the Span
|
|
||||||
token.head = self.c[span_root].head
|
|
||||||
# Adjust deps before shrinking tokens
|
|
||||||
# Tokens which point into the merged token should now point to it
|
|
||||||
# Subtract the offset from all tokens which point to >= end
|
|
||||||
offset = (end - start) - 1
|
|
||||||
for i in range(self.length):
|
|
||||||
head_idx = self.c[i].head
|
|
||||||
if start <= head_idx < end:
|
|
||||||
self.c[i].head = start
|
|
||||||
elif head_idx >= end:
|
|
||||||
self.c[i].head -= offset
|
|
||||||
# Now compress the token array
|
|
||||||
for i in range(end, self.length):
|
|
||||||
self.c[i - offset] = self.c[i]
|
|
||||||
for i in range(self.length - offset, self.length):
|
|
||||||
memset(&self.c[i], 0, sizeof(TokenC))
|
|
||||||
self.c[i].lex = &EMPTY_LEXEME
|
|
||||||
self.length -= offset
|
|
||||||
for i in range(self.length):
|
|
||||||
# ...And, set heads back to a relative position
|
|
||||||
self.c[i].head -= i
|
|
||||||
# Set the left/right children, left/right edges
|
|
||||||
set_children_from_heads(self.c, self.length)
|
|
||||||
# Clear the cached Python objects
|
|
||||||
# Return the merged Python object
|
|
||||||
return self[start]
|
return self[start]
|
||||||
|
|
||||||
def print_tree(self, light=False, flat=False):
|
def print_tree(self, light=False, flat=False):
|
||||||
|
|
|
@ -8,7 +8,7 @@ from ..symbols import HEAD, TAG, DEP, ENT_IOB, ENT_TYPE
|
||||||
def merge_ents(doc):
|
def merge_ents(doc):
|
||||||
"""Helper: merge adjacent entities into single tokens; modifies the doc."""
|
"""Helper: merge adjacent entities into single tokens; modifies the doc."""
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
ent.merge(ent.root.tag_, ent.text, ent.label_)
|
ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -16,16 +16,17 @@ from ..util import normalize_slice
|
||||||
from ..attrs cimport IS_PUNCT, IS_SPACE
|
from ..attrs cimport IS_PUNCT, IS_SPACE
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
from ..compat import is_config
|
from ..compat import is_config
|
||||||
from .. import about
|
from ..errors import Errors, TempErrors
|
||||||
from .underscore import Underscore
|
from .underscore import Underscore, get_ext_args
|
||||||
|
|
||||||
|
|
||||||
cdef class Span:
|
cdef class Span:
|
||||||
"""A slice from a Doc object."""
|
"""A slice from a Doc object."""
|
||||||
@classmethod
|
@classmethod
|
||||||
def set_extension(cls, name, default=None, method=None,
|
def set_extension(cls, name, **kwargs):
|
||||||
getter=None, setter=None):
|
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||||
Underscore.span_extensions[name] = (default, method, getter, setter)
|
raise ValueError(Errors.E090.format(name=name, obj='Span'))
|
||||||
|
Underscore.span_extensions[name] = get_ext_args(**kwargs)
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def get_extension(cls, name):
|
def get_extension(cls, name):
|
||||||
|
@ -35,6 +36,12 @@ cdef class Span:
|
||||||
def has_extension(cls, name):
|
def has_extension(cls, name):
|
||||||
return name in Underscore.span_extensions
|
return name in Underscore.span_extensions
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def remove_extension(cls, name):
|
||||||
|
if not cls.has_extension(name):
|
||||||
|
raise ValueError(Errors.E046.format(name=name))
|
||||||
|
return Underscore.span_extensions.pop(name)
|
||||||
|
|
||||||
def __cinit__(self, Doc doc, int start, int end, attr_t label=0,
|
def __cinit__(self, Doc doc, int start, int end, attr_t label=0,
|
||||||
vector=None, vector_norm=None):
|
vector=None, vector_norm=None):
|
||||||
"""Create a `Span` object from the slice `doc[start : end]`.
|
"""Create a `Span` object from the slice `doc[start : end]`.
|
||||||
|
@ -48,8 +55,7 @@ cdef class Span:
|
||||||
RETURNS (Span): The newly constructed object.
|
RETURNS (Span): The newly constructed object.
|
||||||
"""
|
"""
|
||||||
if not (0 <= start <= end <= len(doc)):
|
if not (0 <= start <= end <= len(doc)):
|
||||||
raise IndexError
|
raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc)))
|
||||||
|
|
||||||
self.doc = doc
|
self.doc = doc
|
||||||
self.start = start
|
self.start = start
|
||||||
self.start_char = self.doc[start].idx if start < self.doc.length else 0
|
self.start_char = self.doc[start].idx if start < self.doc.length else 0
|
||||||
|
@ -58,7 +64,8 @@ cdef class Span:
|
||||||
self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1])
|
self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1])
|
||||||
else:
|
else:
|
||||||
self.end_char = 0
|
self.end_char = 0
|
||||||
assert label in doc.vocab.strings, label
|
if label not in doc.vocab.strings:
|
||||||
|
raise ValueError(Errors.E084.format(label=label))
|
||||||
self.label = label
|
self.label = label
|
||||||
self._vector = vector
|
self._vector = vector
|
||||||
self._vector_norm = vector_norm
|
self._vector_norm = vector_norm
|
||||||
|
@ -267,11 +274,10 @@ cdef class Span:
|
||||||
or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char:
|
or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char:
|
||||||
start = token_by_start(self.doc.c, self.doc.length, self.start_char)
|
start = token_by_start(self.doc.c, self.doc.length, self.start_char)
|
||||||
if self.start == -1:
|
if self.start == -1:
|
||||||
raise IndexError("Error calculating span: Can't find start")
|
raise IndexError(Errors.E036.format(start=self.start_char))
|
||||||
end = token_by_end(self.doc.c, self.doc.length, self.end_char)
|
end = token_by_end(self.doc.c, self.doc.length, self.end_char)
|
||||||
if end == -1:
|
if end == -1:
|
||||||
raise IndexError("Error calculating span: Can't find end")
|
raise IndexError(Errors.E037.format(end=self.end_char))
|
||||||
|
|
||||||
self.start = start
|
self.start = start
|
||||||
self.end = end + 1
|
self.end = end + 1
|
||||||
|
|
||||||
|
@ -294,12 +300,11 @@ cdef class Span:
|
||||||
cdef int i
|
cdef int i
|
||||||
if self.doc.is_parsed:
|
if self.doc.is_parsed:
|
||||||
root = &self.doc.c[self.start]
|
root = &self.doc.c[self.start]
|
||||||
n = 0
|
|
||||||
while root.head != 0:
|
while root.head != 0:
|
||||||
root += root.head
|
root += root.head
|
||||||
n += 1
|
n += 1
|
||||||
if n >= self.doc.length:
|
if n >= self.doc.length:
|
||||||
raise RuntimeError
|
raise RuntimeError(Errors.E038)
|
||||||
return self.doc[root.l_edge:root.r_edge + 1]
|
return self.doc[root.l_edge:root.r_edge + 1]
|
||||||
elif self.doc.is_sentenced:
|
elif self.doc.is_sentenced:
|
||||||
# find start of the sentence
|
# find start of the sentence
|
||||||
|
@ -314,13 +319,7 @@ cdef class Span:
|
||||||
n += 1
|
n += 1
|
||||||
if n >= self.doc.length:
|
if n >= self.doc.length:
|
||||||
break
|
break
|
||||||
#
|
|
||||||
return self.doc[start:end]
|
return self.doc[start:end]
|
||||||
else:
|
|
||||||
raise ValueError(
|
|
||||||
"Access to sentence requires either the dependency parse "
|
|
||||||
"or sentence boundaries to be set by setting " +
|
|
||||||
"doc[i].is_sent_start = True")
|
|
||||||
|
|
||||||
property has_vector:
|
property has_vector:
|
||||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||||
|
@ -402,11 +401,7 @@ cdef class Span:
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if not self.doc.is_parsed:
|
if not self.doc.is_parsed:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E029)
|
||||||
"noun_chunks requires the dependency parse, which "
|
|
||||||
"requires a statistical model to be installed and loaded. "
|
|
||||||
"For more info, see the "
|
|
||||||
"documentation: \n%s\n" % about.__docs_models__)
|
|
||||||
# Accumulate the result before beginning to iterate over it. This
|
# Accumulate the result before beginning to iterate over it. This
|
||||||
# prevents the tokenisation from being changed out from under us
|
# prevents the tokenisation from being changed out from under us
|
||||||
# during the iteration. The tricky thing here is that Span accepts
|
# during the iteration. The tricky thing here is that Span accepts
|
||||||
|
@ -552,9 +547,7 @@ cdef class Span:
|
||||||
return self.root.ent_id
|
return self.root.ent_id
|
||||||
|
|
||||||
def __set__(self, hash_t key):
|
def __set__(self, hash_t key):
|
||||||
raise NotImplementedError(
|
raise NotImplementedError(TempErrors.T007.format(attr='ent_id'))
|
||||||
"Can't yet set ent_id from Span. Vote for this feature on "
|
|
||||||
"the issue tracker: http://github.com/explosion/spaCy/issues")
|
|
||||||
|
|
||||||
property ent_id_:
|
property ent_id_:
|
||||||
"""RETURNS (unicode): The (string) entity ID."""
|
"""RETURNS (unicode): The (string) entity ID."""
|
||||||
|
@ -562,9 +555,7 @@ cdef class Span:
|
||||||
return self.root.ent_id_
|
return self.root.ent_id_
|
||||||
|
|
||||||
def __set__(self, hash_t key):
|
def __set__(self, hash_t key):
|
||||||
raise NotImplementedError(
|
raise NotImplementedError(TempErrors.T007.format(attr='ent_id_'))
|
||||||
"Can't yet set ent_id_ from Span. Vote for this feature on the "
|
|
||||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
|
||||||
|
|
||||||
property orth_:
|
property orth_:
|
||||||
"""Verbatim text content (identical to Span.text). Exists mostly for
|
"""Verbatim text content (identical to Span.text). Exists mostly for
|
||||||
|
@ -612,9 +603,5 @@ cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
||||||
token += token.head
|
token += token.head
|
||||||
n += 1
|
n += 1
|
||||||
if n >= sent_length:
|
if n >= sent_length:
|
||||||
raise RuntimeError(
|
raise RuntimeError(Errors.E039)
|
||||||
"Array bounds exceeded while searching for root word. This "
|
|
||||||
"likely means the parse tree is in an invalid state. Please "
|
|
||||||
"report this issue here: "
|
|
||||||
"http://github.com/explosion/spaCy/issues")
|
|
||||||
return n
|
return n
|
||||||
|
|
|
@ -6,6 +6,7 @@ from ..typedefs cimport attr_t, flags_t
|
||||||
from ..parts_of_speech cimport univ_pos_t
|
from ..parts_of_speech cimport univ_pos_t
|
||||||
from .doc cimport Doc
|
from .doc cimport Doc
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
cdef class Token:
|
cdef class Token:
|
||||||
|
@ -17,8 +18,7 @@ cdef class Token:
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef inline Token cinit(Vocab vocab, const TokenC* token, int offset, Doc doc):
|
cdef inline Token cinit(Vocab vocab, const TokenC* token, int offset, Doc doc):
|
||||||
if offset < 0 or offset >= doc.length:
|
if offset < 0 or offset >= doc.length:
|
||||||
msg = "Attempt to access token at %d, max length %d"
|
raise IndexError(Errors.E040.format(i=offset, max_length=doc.length))
|
||||||
raise IndexError(msg % (offset, doc.length))
|
|
||||||
cdef Token self = Token.__new__(Token, vocab, doc, offset)
|
cdef Token self = Token.__new__(Token, vocab, doc, offset)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
|
@ -19,18 +19,19 @@ from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM
|
||||||
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
|
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
|
||||||
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
|
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
|
||||||
from ..compat import is_config
|
from ..compat import is_config
|
||||||
|
from ..errors import Errors
|
||||||
from .. import util
|
from .. import util
|
||||||
from .. import about
|
from .underscore import Underscore, get_ext_args
|
||||||
from .underscore import Underscore
|
|
||||||
|
|
||||||
|
|
||||||
cdef class Token:
|
cdef class Token:
|
||||||
"""An individual token – i.e. a word, punctuation symbol, whitespace,
|
"""An individual token – i.e. a word, punctuation symbol, whitespace,
|
||||||
etc."""
|
etc."""
|
||||||
@classmethod
|
@classmethod
|
||||||
def set_extension(cls, name, default=None, method=None,
|
def set_extension(cls, name, **kwargs):
|
||||||
getter=None, setter=None):
|
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||||
Underscore.token_extensions[name] = (default, method, getter, setter)
|
raise ValueError(Errors.E090.format(name=name, obj='Token'))
|
||||||
|
Underscore.token_extensions[name] = get_ext_args(**kwargs)
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def get_extension(cls, name):
|
def get_extension(cls, name):
|
||||||
|
@ -40,6 +41,12 @@ cdef class Token:
|
||||||
def has_extension(cls, name):
|
def has_extension(cls, name):
|
||||||
return name in Underscore.span_extensions
|
return name in Underscore.span_extensions
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def remove_extension(cls, name):
|
||||||
|
if not cls.has_extension(name):
|
||||||
|
raise ValueError(Errors.E046.format(name=name))
|
||||||
|
return Underscore.token_extensions.pop(name)
|
||||||
|
|
||||||
def __cinit__(self, Vocab vocab, Doc doc, int offset):
|
def __cinit__(self, Vocab vocab, Doc doc, int offset):
|
||||||
"""Construct a `Token` object.
|
"""Construct a `Token` object.
|
||||||
|
|
||||||
|
@ -106,7 +113,7 @@ cdef class Token:
|
||||||
elif op == 5:
|
elif op == 5:
|
||||||
return my >= their
|
return my >= their
|
||||||
else:
|
else:
|
||||||
raise ValueError(op)
|
raise ValueError(Errors.E041.format(op=op))
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def _(self):
|
def _(self):
|
||||||
|
@ -135,8 +142,7 @@ cdef class Token:
|
||||||
RETURNS (Token): The token at position `self.doc[self.i+i]`.
|
RETURNS (Token): The token at position `self.doc[self.i+i]`.
|
||||||
"""
|
"""
|
||||||
if self.i+i < 0 or (self.i+i >= len(self.doc)):
|
if self.i+i < 0 or (self.i+i >= len(self.doc)):
|
||||||
msg = "Error accessing doc[%d].nbor(%d), for doc of length %d"
|
raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc)))
|
||||||
raise IndexError(msg % (self.i, i, len(self.doc)))
|
|
||||||
return self.doc[self.i+i]
|
return self.doc[self.i+i]
|
||||||
|
|
||||||
def similarity(self, other):
|
def similarity(self, other):
|
||||||
|
@ -354,14 +360,7 @@ cdef class Token:
|
||||||
|
|
||||||
property sent_start:
|
property sent_start:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
# Raising a deprecation warning causes errors for autocomplete
|
# Raising a deprecation warning here causes errors for autocomplete
|
||||||
#util.deprecated(
|
|
||||||
# "Token.sent_start is now deprecated. Use Token.is_sent_start "
|
|
||||||
# "instead, which returns a boolean value or None if the answer "
|
|
||||||
# "is unknown – instead of a misleading 0 for False and 1 for "
|
|
||||||
# "True. It also fixes a quirk in the old logic that would "
|
|
||||||
# "always set the property to 0 for the first word of the "
|
|
||||||
# "document.")
|
|
||||||
# Handle broken backwards compatibility case: doc[0].sent_start
|
# Handle broken backwards compatibility case: doc[0].sent_start
|
||||||
# was False.
|
# was False.
|
||||||
if self.i == 0:
|
if self.i == 0:
|
||||||
|
@ -386,9 +385,7 @@ cdef class Token:
|
||||||
|
|
||||||
def __set__(self, value):
|
def __set__(self, value):
|
||||||
if self.doc.is_parsed:
|
if self.doc.is_parsed:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E043)
|
||||||
"Refusing to write to token.sent_start if its document "
|
|
||||||
"is parsed, because this may cause inconsistent state.")
|
|
||||||
if value is None:
|
if value is None:
|
||||||
self.c.sent_start = 0
|
self.c.sent_start = 0
|
||||||
elif value is True:
|
elif value is True:
|
||||||
|
@ -396,8 +393,7 @@ cdef class Token:
|
||||||
elif value is False:
|
elif value is False:
|
||||||
self.c.sent_start = -1
|
self.c.sent_start = -1
|
||||||
else:
|
else:
|
||||||
raise ValueError("Invalid value for token.sent_start. Must be "
|
raise ValueError(Errors.E044.format(value=value))
|
||||||
"one of: None, True, False")
|
|
||||||
|
|
||||||
property lefts:
|
property lefts:
|
||||||
"""The leftward immediate children of the word, in the syntactic
|
"""The leftward immediate children of the word, in the syntactic
|
||||||
|
@ -415,8 +411,7 @@ cdef class Token:
|
||||||
nr_iter += 1
|
nr_iter += 1
|
||||||
# This is ugly, but it's a way to guard out infinite loops
|
# This is ugly, but it's a way to guard out infinite loops
|
||||||
if nr_iter >= 10000000:
|
if nr_iter >= 10000000:
|
||||||
raise RuntimeError("Possibly infinite loop encountered "
|
raise RuntimeError(Errors.E045.format(attr='token.lefts'))
|
||||||
"while looking for token.lefts")
|
|
||||||
|
|
||||||
property rights:
|
property rights:
|
||||||
"""The rightward immediate children of the word, in the syntactic
|
"""The rightward immediate children of the word, in the syntactic
|
||||||
|
@ -434,8 +429,7 @@ cdef class Token:
|
||||||
ptr -= 1
|
ptr -= 1
|
||||||
nr_iter += 1
|
nr_iter += 1
|
||||||
if nr_iter >= 10000000:
|
if nr_iter >= 10000000:
|
||||||
raise RuntimeError("Possibly infinite loop encountered "
|
raise RuntimeError(Errors.E045.format(attr='token.rights'))
|
||||||
"while looking for token.rights")
|
|
||||||
tokens.reverse()
|
tokens.reverse()
|
||||||
for t in tokens:
|
for t in tokens:
|
||||||
yield t
|
yield t
|
||||||
|
|
|
@ -3,6 +3,8 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import functools
|
import functools
|
||||||
|
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
class Underscore(object):
|
class Underscore(object):
|
||||||
doc_extensions = {}
|
doc_extensions = {}
|
||||||
|
@ -23,7 +25,7 @@ class Underscore(object):
|
||||||
|
|
||||||
def __getattr__(self, name):
|
def __getattr__(self, name):
|
||||||
if name not in self._extensions:
|
if name not in self._extensions:
|
||||||
raise AttributeError(name)
|
raise AttributeError(Errors.E046.format(name=name))
|
||||||
default, method, getter, setter = self._extensions[name]
|
default, method, getter, setter = self._extensions[name]
|
||||||
if getter is not None:
|
if getter is not None:
|
||||||
return getter(self._obj)
|
return getter(self._obj)
|
||||||
|
@ -34,7 +36,7 @@ class Underscore(object):
|
||||||
|
|
||||||
def __setattr__(self, name, value):
|
def __setattr__(self, name, value):
|
||||||
if name not in self._extensions:
|
if name not in self._extensions:
|
||||||
raise AttributeError(name)
|
raise AttributeError(Errors.E047.format(name=name))
|
||||||
default, method, getter, setter = self._extensions[name]
|
default, method, getter, setter = self._extensions[name]
|
||||||
if setter is not None:
|
if setter is not None:
|
||||||
return setter(self._obj, value)
|
return setter(self._obj, value)
|
||||||
|
@ -52,3 +54,24 @@ class Underscore(object):
|
||||||
|
|
||||||
def _get_key(self, name):
|
def _get_key(self, name):
|
||||||
return ('._.', name, self._start, self._end)
|
return ('._.', name, self._start, self._end)
|
||||||
|
|
||||||
|
|
||||||
|
def get_ext_args(**kwargs):
|
||||||
|
"""Validate and convert arguments. Reused in Doc, Token and Span."""
|
||||||
|
default = kwargs.get('default')
|
||||||
|
getter = kwargs.get('getter')
|
||||||
|
setter = kwargs.get('setter')
|
||||||
|
method = kwargs.get('method')
|
||||||
|
if getter is None and setter is not None:
|
||||||
|
raise ValueError(Errors.E089)
|
||||||
|
valid_opts = ('default' in kwargs, method is not None, getter is not None)
|
||||||
|
nr_defined = sum(t is True for t in valid_opts)
|
||||||
|
if nr_defined != 1:
|
||||||
|
raise ValueError(Errors.E083.format(nr_defined=nr_defined))
|
||||||
|
if setter is not None and not hasattr(setter, '__call__'):
|
||||||
|
raise ValueError(Errors.E091.format(name='setter', value=repr(setter)))
|
||||||
|
if getter is not None and not hasattr(getter, '__call__'):
|
||||||
|
raise ValueError(Errors.E091.format(name='getter', value=repr(getter)))
|
||||||
|
if method is not None and not hasattr(method, '__call__'):
|
||||||
|
raise ValueError(Errors.E091.format(name='method', value=repr(method)))
|
||||||
|
return (default, method, getter, setter)
|
||||||
|
|
|
@ -11,8 +11,6 @@ import sys
|
||||||
import textwrap
|
import textwrap
|
||||||
import random
|
import random
|
||||||
from collections import OrderedDict
|
from collections import OrderedDict
|
||||||
import inspect
|
|
||||||
import warnings
|
|
||||||
from thinc.neural._classes.model import Model
|
from thinc.neural._classes.model import Model
|
||||||
from thinc.neural.ops import NumpyOps
|
from thinc.neural.ops import NumpyOps
|
||||||
import functools
|
import functools
|
||||||
|
@ -23,10 +21,12 @@ import numpy.random
|
||||||
from .symbols import ORTH
|
from .symbols import ORTH
|
||||||
from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
|
from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
|
||||||
from .compat import import_file
|
from .compat import import_file
|
||||||
|
from .errors import Errors
|
||||||
|
|
||||||
import msgpack
|
# Import these directly from Thinc, so that we're sure we always have the
|
||||||
import msgpack_numpy
|
# same version.
|
||||||
msgpack_numpy.patch()
|
from thinc.neural._classes.model import msgpack
|
||||||
|
from thinc.neural._classes.model import msgpack_numpy
|
||||||
|
|
||||||
|
|
||||||
LANGUAGES = {}
|
LANGUAGES = {}
|
||||||
|
@ -50,8 +50,7 @@ def get_lang_class(lang):
|
||||||
try:
|
try:
|
||||||
module = importlib.import_module('.lang.%s' % lang, 'spacy')
|
module = importlib.import_module('.lang.%s' % lang, 'spacy')
|
||||||
except ImportError:
|
except ImportError:
|
||||||
msg = "Can't import language %s from spacy.lang."
|
raise ImportError(Errors.E048.format(lang=lang))
|
||||||
raise ImportError(msg % lang)
|
|
||||||
LANGUAGES[lang] = getattr(module, module.__all__[0])
|
LANGUAGES[lang] = getattr(module, module.__all__[0])
|
||||||
return LANGUAGES[lang]
|
return LANGUAGES[lang]
|
||||||
|
|
||||||
|
@ -108,7 +107,7 @@ def load_model(name, **overrides):
|
||||||
"""
|
"""
|
||||||
data_path = get_data_path()
|
data_path = get_data_path()
|
||||||
if not data_path or not data_path.exists():
|
if not data_path or not data_path.exists():
|
||||||
raise IOError("Can't find spaCy data path: %s" % path2str(data_path))
|
raise IOError(Errors.E049.format(path=path2str(data_path)))
|
||||||
if isinstance(name, basestring_): # in data dir / shortcut
|
if isinstance(name, basestring_): # in data dir / shortcut
|
||||||
if name in set([d.name for d in data_path.iterdir()]):
|
if name in set([d.name for d in data_path.iterdir()]):
|
||||||
return load_model_from_link(name, **overrides)
|
return load_model_from_link(name, **overrides)
|
||||||
|
@ -118,7 +117,7 @@ def load_model(name, **overrides):
|
||||||
return load_model_from_path(Path(name), **overrides)
|
return load_model_from_path(Path(name), **overrides)
|
||||||
elif hasattr(name, 'exists'): # Path or Path-like to model data
|
elif hasattr(name, 'exists'): # Path or Path-like to model data
|
||||||
return load_model_from_path(name, **overrides)
|
return load_model_from_path(name, **overrides)
|
||||||
raise IOError("Can't find model '%s'" % name)
|
raise IOError(Errors.E050.format(name=name))
|
||||||
|
|
||||||
|
|
||||||
def load_model_from_link(name, **overrides):
|
def load_model_from_link(name, **overrides):
|
||||||
|
@ -127,9 +126,7 @@ def load_model_from_link(name, **overrides):
|
||||||
try:
|
try:
|
||||||
cls = import_file(name, path)
|
cls = import_file(name, path)
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
raise IOError(
|
raise IOError(Errors.E051.format(name=name))
|
||||||
"Cant' load '%s'. If you're using a shortcut link, make sure it "
|
|
||||||
"points to a valid package (not just a data directory)." % name)
|
|
||||||
return cls.load(**overrides)
|
return cls.load(**overrides)
|
||||||
|
|
||||||
|
|
||||||
|
@ -173,8 +170,7 @@ def load_model_from_init_py(init_file, **overrides):
|
||||||
data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
|
data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
|
||||||
data_path = model_path / data_dir
|
data_path = model_path / data_dir
|
||||||
if not model_path.exists():
|
if not model_path.exists():
|
||||||
msg = "Can't find model directory: %s"
|
raise IOError(Errors.E052.format(path=path2str(data_path)))
|
||||||
raise ValueError(msg % path2str(data_path))
|
|
||||||
return load_model_from_path(data_path, meta, **overrides)
|
return load_model_from_path(data_path, meta, **overrides)
|
||||||
|
|
||||||
|
|
||||||
|
@ -186,16 +182,14 @@ def get_model_meta(path):
|
||||||
"""
|
"""
|
||||||
model_path = ensure_path(path)
|
model_path = ensure_path(path)
|
||||||
if not model_path.exists():
|
if not model_path.exists():
|
||||||
msg = "Can't find model directory: %s"
|
raise IOError(Errors.E052.format(path=path2str(model_path)))
|
||||||
raise ValueError(msg % path2str(model_path))
|
|
||||||
meta_path = model_path / 'meta.json'
|
meta_path = model_path / 'meta.json'
|
||||||
if not meta_path.is_file():
|
if not meta_path.is_file():
|
||||||
raise IOError("Could not read meta.json from %s" % meta_path)
|
raise IOError(Errors.E053.format(path=meta_path))
|
||||||
meta = read_json(meta_path)
|
meta = read_json(meta_path)
|
||||||
for setting in ['lang', 'name', 'version']:
|
for setting in ['lang', 'name', 'version']:
|
||||||
if setting not in meta or not meta[setting]:
|
if setting not in meta or not meta[setting]:
|
||||||
msg = "No valid '%s' setting found in model meta.json"
|
raise ValueError(Errors.E054.format(setting=setting))
|
||||||
raise ValueError(msg % setting)
|
|
||||||
return meta
|
return meta
|
||||||
|
|
||||||
|
|
||||||
|
@ -344,13 +338,10 @@ def update_exc(base_exceptions, *addition_dicts):
|
||||||
for orth, token_attrs in additions.items():
|
for orth, token_attrs in additions.items():
|
||||||
if not all(isinstance(attr[ORTH], unicode_)
|
if not all(isinstance(attr[ORTH], unicode_)
|
||||||
for attr in token_attrs):
|
for attr in token_attrs):
|
||||||
msg = "Invalid ORTH value in exception: key='%s', orths='%s'"
|
raise ValueError(Errors.E055.format(key=orth, orths=token_attrs))
|
||||||
raise ValueError(msg % (orth, token_attrs))
|
|
||||||
described_orth = ''.join(attr[ORTH] for attr in token_attrs)
|
described_orth = ''.join(attr[ORTH] for attr in token_attrs)
|
||||||
if orth != described_orth:
|
if orth != described_orth:
|
||||||
msg = ("Invalid tokenizer exception: ORTH values combined "
|
raise ValueError(Errors.E056.format(key=orth, orths=described_orth))
|
||||||
"don't match original string. key='%s', orths='%s'")
|
|
||||||
raise ValueError(msg % (orth, described_orth))
|
|
||||||
exc.update(additions)
|
exc.update(additions)
|
||||||
exc = expand_exc(exc, "'", "’")
|
exc = expand_exc(exc, "'", "’")
|
||||||
return exc
|
return exc
|
||||||
|
@ -380,8 +371,7 @@ def expand_exc(excs, search, replace):
|
||||||
|
|
||||||
def normalize_slice(length, start, stop, step=None):
|
def normalize_slice(length, start, stop, step=None):
|
||||||
if not (step is None or step == 1):
|
if not (step is None or step == 1):
|
||||||
raise ValueError("Stepped slices not supported in Span objects."
|
raise ValueError(Errors.E057)
|
||||||
"Try: list(tokens)[start:stop:step] instead.")
|
|
||||||
if start is None:
|
if start is None:
|
||||||
start = 0
|
start = 0
|
||||||
elif start < 0:
|
elif start < 0:
|
||||||
|
@ -392,7 +382,6 @@ def normalize_slice(length, start, stop, step=None):
|
||||||
elif stop < 0:
|
elif stop < 0:
|
||||||
stop += length
|
stop += length
|
||||||
stop = min(length, max(start, stop))
|
stop = min(length, max(start, stop))
|
||||||
assert 0 <= start <= stop <= length
|
|
||||||
return start, stop
|
return start, stop
|
||||||
|
|
||||||
|
|
||||||
|
@ -552,18 +541,6 @@ def from_disk(path, readers, exclude):
|
||||||
return path
|
return path
|
||||||
|
|
||||||
|
|
||||||
def deprecated(message, filter='always'):
|
|
||||||
"""Show a deprecation warning.
|
|
||||||
|
|
||||||
message (unicode): The message to display.
|
|
||||||
filter (unicode): Filter value.
|
|
||||||
"""
|
|
||||||
stack = inspect.stack()[-1]
|
|
||||||
with warnings.catch_warnings():
|
|
||||||
warnings.simplefilter(filter, DeprecationWarning)
|
|
||||||
warnings.warn_explicit(message, DeprecationWarning, stack[1], stack[2])
|
|
||||||
|
|
||||||
|
|
||||||
def print_table(data, title=None):
|
def print_table(data, title=None):
|
||||||
"""Print data in table format.
|
"""Print data in table format.
|
||||||
|
|
||||||
|
|
|
@ -1,24 +1,43 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import functools
|
||||||
import numpy
|
import numpy
|
||||||
from collections import OrderedDict
|
from collections import OrderedDict
|
||||||
import msgpack
|
|
||||||
import msgpack_numpy
|
from .util import msgpack
|
||||||
msgpack_numpy.patch()
|
from .util import msgpack_numpy
|
||||||
|
|
||||||
cimport numpy as np
|
cimport numpy as np
|
||||||
from thinc.neural.util import get_array_module
|
from thinc.neural.util import get_array_module
|
||||||
from thinc.neural._classes.model import Model
|
from thinc.neural._classes.model import Model
|
||||||
|
|
||||||
from .strings cimport StringStore, hash_string
|
from .strings cimport StringStore, hash_string
|
||||||
from .compat import basestring_, path2str
|
from .compat import basestring_, path2str
|
||||||
|
from .errors import Errors
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
from cython.operator cimport dereference as deref
|
||||||
|
from libcpp.set cimport set as cppset
|
||||||
|
|
||||||
def unpickle_vectors(bytes_data):
|
def unpickle_vectors(bytes_data):
|
||||||
return Vectors().from_bytes(bytes_data)
|
return Vectors().from_bytes(bytes_data)
|
||||||
|
|
||||||
|
|
||||||
|
class GlobalRegistry(object):
|
||||||
|
'''Global store of vectors, to avoid repeatedly loading the data.'''
|
||||||
|
data = {}
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def register(cls, name, data):
|
||||||
|
cls.data[name] = data
|
||||||
|
return functools.partial(cls.get, name)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get(cls, name):
|
||||||
|
return cls.data[name]
|
||||||
|
|
||||||
|
|
||||||
cdef class Vectors:
|
cdef class Vectors:
|
||||||
"""Store, save and load word vectors.
|
"""Store, save and load word vectors.
|
||||||
|
|
||||||
|
@ -31,18 +50,21 @@ cdef class Vectors:
|
||||||
the table need to be assigned --- so len(list(vectors.keys())) may be
|
the table need to be assigned --- so len(list(vectors.keys())) may be
|
||||||
greater or smaller than vectors.shape[0].
|
greater or smaller than vectors.shape[0].
|
||||||
"""
|
"""
|
||||||
|
cdef public object name
|
||||||
cdef public object data
|
cdef public object data
|
||||||
cdef public object key2row
|
cdef public object key2row
|
||||||
cdef public object _unset
|
cdef cppset[int] _unset
|
||||||
|
|
||||||
def __init__(self, *, shape=None, data=None, keys=None):
|
def __init__(self, *, shape=None, data=None, keys=None, name=None):
|
||||||
"""Create a new vector store.
|
"""Create a new vector store.
|
||||||
|
|
||||||
shape (tuple): Size of the table, as (# entries, # columns)
|
shape (tuple): Size of the table, as (# entries, # columns)
|
||||||
data (numpy.ndarray): The vector data.
|
data (numpy.ndarray): The vector data.
|
||||||
keys (iterable): A sequence of keys, aligned with the data.
|
keys (iterable): A sequence of keys, aligned with the data.
|
||||||
|
name (string): A name to identify the vectors table.
|
||||||
RETURNS (Vectors): The newly created object.
|
RETURNS (Vectors): The newly created object.
|
||||||
"""
|
"""
|
||||||
|
self.name = name
|
||||||
if data is None:
|
if data is None:
|
||||||
if shape is None:
|
if shape is None:
|
||||||
shape = (0,0)
|
shape = (0,0)
|
||||||
|
@ -50,9 +72,9 @@ cdef class Vectors:
|
||||||
self.data = data
|
self.data = data
|
||||||
self.key2row = OrderedDict()
|
self.key2row = OrderedDict()
|
||||||
if self.data is not None:
|
if self.data is not None:
|
||||||
self._unset = set(range(self.data.shape[0]))
|
self._unset = cppset[int]({i for i in range(self.data.shape[0])})
|
||||||
else:
|
else:
|
||||||
self._unset = set()
|
self._unset = cppset[int]()
|
||||||
if keys is not None:
|
if keys is not None:
|
||||||
for i, key in enumerate(keys):
|
for i, key in enumerate(keys):
|
||||||
self.add(key, row=i)
|
self.add(key, row=i)
|
||||||
|
@ -74,7 +96,7 @@ cdef class Vectors:
|
||||||
@property
|
@property
|
||||||
def is_full(self):
|
def is_full(self):
|
||||||
"""RETURNS (bool): `True` if no slots are available for new keys."""
|
"""RETURNS (bool): `True` if no slots are available for new keys."""
|
||||||
return len(self._unset) == 0
|
return self._unset.size() == 0
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def n_keys(self):
|
def n_keys(self):
|
||||||
|
@ -93,7 +115,7 @@ cdef class Vectors:
|
||||||
"""
|
"""
|
||||||
i = self.key2row[key]
|
i = self.key2row[key]
|
||||||
if i is None:
|
if i is None:
|
||||||
raise KeyError(key)
|
raise KeyError(Errors.E058.format(key=key))
|
||||||
else:
|
else:
|
||||||
return self.data[i]
|
return self.data[i]
|
||||||
|
|
||||||
|
@ -105,8 +127,8 @@ cdef class Vectors:
|
||||||
"""
|
"""
|
||||||
i = self.key2row[key]
|
i = self.key2row[key]
|
||||||
self.data[i] = vector
|
self.data[i] = vector
|
||||||
if i in self._unset:
|
if self._unset.count(i):
|
||||||
self._unset.remove(i)
|
self._unset.erase(self._unset.find(i))
|
||||||
|
|
||||||
def __iter__(self):
|
def __iter__(self):
|
||||||
"""Iterate over the keys in the table.
|
"""Iterate over the keys in the table.
|
||||||
|
@ -145,7 +167,7 @@ cdef class Vectors:
|
||||||
xp = get_array_module(self.data)
|
xp = get_array_module(self.data)
|
||||||
self.data = xp.resize(self.data, shape)
|
self.data = xp.resize(self.data, shape)
|
||||||
filled = {row for row in self.key2row.values()}
|
filled = {row for row in self.key2row.values()}
|
||||||
self._unset = {row for row in range(shape[0]) if row not in filled}
|
self._unset = cppset[int]({row for row in range(shape[0]) if row not in filled})
|
||||||
removed_items = []
|
removed_items = []
|
||||||
for key, row in list(self.key2row.items()):
|
for key, row in list(self.key2row.items()):
|
||||||
if row >= shape[0]:
|
if row >= shape[0]:
|
||||||
|
@ -169,7 +191,7 @@ cdef class Vectors:
|
||||||
YIELDS (ndarray): A vector in the table.
|
YIELDS (ndarray): A vector in the table.
|
||||||
"""
|
"""
|
||||||
for row, vector in enumerate(range(self.data.shape[0])):
|
for row, vector in enumerate(range(self.data.shape[0])):
|
||||||
if row not in self._unset:
|
if not self._unset.count(row):
|
||||||
yield vector
|
yield vector
|
||||||
|
|
||||||
def items(self):
|
def items(self):
|
||||||
|
@ -194,7 +216,8 @@ cdef class Vectors:
|
||||||
RETURNS: The requested key, keys, row or rows.
|
RETURNS: The requested key, keys, row or rows.
|
||||||
"""
|
"""
|
||||||
if sum(arg is None for arg in (key, keys, row, rows)) != 3:
|
if sum(arg is None for arg in (key, keys, row, rows)) != 3:
|
||||||
raise ValueError("One (and only one) keyword arg must be set.")
|
bad_kwargs = {'key': key, 'keys': keys, 'row': row, 'rows': rows}
|
||||||
|
raise ValueError(Errors.E059.format(kwargs=bad_kwargs))
|
||||||
xp = get_array_module(self.data)
|
xp = get_array_module(self.data)
|
||||||
if key is not None:
|
if key is not None:
|
||||||
if isinstance(key, basestring_):
|
if isinstance(key, basestring_):
|
||||||
|
@ -233,14 +256,14 @@ cdef class Vectors:
|
||||||
row = self.key2row[key]
|
row = self.key2row[key]
|
||||||
elif row is None:
|
elif row is None:
|
||||||
if self.is_full:
|
if self.is_full:
|
||||||
raise ValueError("Cannot add new key to vectors -- full")
|
raise ValueError(Errors.E060.format(rows=self.data.shape[0],
|
||||||
row = min(self._unset)
|
cols=self.data.shape[1]))
|
||||||
|
row = deref(self._unset.begin())
|
||||||
self.key2row[key] = row
|
self.key2row[key] = row
|
||||||
if vector is not None:
|
if vector is not None:
|
||||||
self.data[row] = vector
|
self.data[row] = vector
|
||||||
if row in self._unset:
|
if self._unset.count(row):
|
||||||
self._unset.remove(row)
|
self._unset.erase(self._unset.find(row))
|
||||||
return row
|
return row
|
||||||
|
|
||||||
def most_similar(self, queries, *, batch_size=1024):
|
def most_similar(self, queries, *, batch_size=1024):
|
||||||
|
@ -297,7 +320,7 @@ cdef class Vectors:
|
||||||
width = int(dims)
|
width = int(dims)
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
raise IOError("Expected file named e.g. vectors.128.f.bin")
|
raise IOError(Errors.E061.format(filename=path))
|
||||||
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
|
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
|
||||||
dtype=dtype)
|
dtype=dtype)
|
||||||
xp = get_array_module(self.data)
|
xp = get_array_module(self.data)
|
||||||
|
@ -346,8 +369,8 @@ cdef class Vectors:
|
||||||
with path.open('rb') as file_:
|
with path.open('rb') as file_:
|
||||||
self.key2row = msgpack.load(file_)
|
self.key2row = msgpack.load(file_)
|
||||||
for key, row in self.key2row.items():
|
for key, row in self.key2row.items():
|
||||||
if row in self._unset:
|
if self._unset.count(row):
|
||||||
self._unset.remove(row)
|
self._unset.erase(self._unset.find(row))
|
||||||
|
|
||||||
def load_keys(path):
|
def load_keys(path):
|
||||||
if path.exists():
|
if path.exists():
|
||||||
|
|
|
@ -16,6 +16,7 @@ from .attrs cimport PROB, LANG, ORTH, TAG
|
||||||
from .structs cimport SerializedLexemeC
|
from .structs cimport SerializedLexemeC
|
||||||
|
|
||||||
from .compat import copy_reg, basestring_
|
from .compat import copy_reg, basestring_
|
||||||
|
from .errors import Errors
|
||||||
from .lemmatizer import Lemmatizer
|
from .lemmatizer import Lemmatizer
|
||||||
from .attrs import intify_attrs
|
from .attrs import intify_attrs
|
||||||
from .vectors import Vectors
|
from .vectors import Vectors
|
||||||
|
@ -100,15 +101,9 @@ cdef class Vocab:
|
||||||
flag_id = bit
|
flag_id = bit
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E062)
|
||||||
"Cannot find empty bit for new lexical flag. All bits "
|
|
||||||
"between 0 and 63 are occupied. You can replace one by "
|
|
||||||
"specifying the flag_id explicitly, e.g. "
|
|
||||||
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
|
|
||||||
elif flag_id >= 64 or flag_id < 1:
|
elif flag_id >= 64 or flag_id < 1:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E063.format(value=flag_id))
|
||||||
"Invalid value for flag_id: %d. Flag IDs must be between "
|
|
||||||
"1 and 63 (inclusive)" % flag_id)
|
|
||||||
for lex in self:
|
for lex in self:
|
||||||
lex.set_flag(flag_id, flag_getter(lex.orth_))
|
lex.set_flag(flag_id, flag_getter(lex.orth_))
|
||||||
self.lex_attr_getters[flag_id] = flag_getter
|
self.lex_attr_getters[flag_id] = flag_getter
|
||||||
|
@ -127,8 +122,9 @@ cdef class Vocab:
|
||||||
cdef size_t addr
|
cdef size_t addr
|
||||||
if lex != NULL:
|
if lex != NULL:
|
||||||
if lex.orth != self.strings[string]:
|
if lex.orth != self.strings[string]:
|
||||||
raise LookupError.mismatched_strings(
|
raise KeyError(Errors.E064.format(string=lex.orth,
|
||||||
lex.orth, self.strings[string], string)
|
orth=self.strings[string],
|
||||||
|
orth_id=string))
|
||||||
return lex
|
return lex
|
||||||
else:
|
else:
|
||||||
return self._new_lexeme(mem, string)
|
return self._new_lexeme(mem, string)
|
||||||
|
@ -171,7 +167,8 @@ cdef class Vocab:
|
||||||
if not is_oov:
|
if not is_oov:
|
||||||
key = hash_string(string)
|
key = hash_string(string)
|
||||||
self._add_lex_to_vocab(key, lex)
|
self._add_lex_to_vocab(key, lex)
|
||||||
assert lex != NULL, string
|
if lex == NULL:
|
||||||
|
raise ValueError(Errors.E085.format(string=string))
|
||||||
return lex
|
return lex
|
||||||
|
|
||||||
cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1:
|
cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1:
|
||||||
|
@ -254,7 +251,7 @@ cdef class Vocab:
|
||||||
width, you have to call this to change the size of the vectors.
|
width, you have to call this to change the size of the vectors.
|
||||||
"""
|
"""
|
||||||
if width is not None and shape is not None:
|
if width is not None and shape is not None:
|
||||||
raise ValueError("Only one of width and shape can be specified")
|
raise ValueError(Errors.E065.format(width=width, shape=shape))
|
||||||
elif shape is not None:
|
elif shape is not None:
|
||||||
self.vectors = Vectors(shape=shape)
|
self.vectors = Vectors(shape=shape)
|
||||||
else:
|
else:
|
||||||
|
@ -381,7 +378,8 @@ cdef class Vocab:
|
||||||
self.lexemes_from_bytes(file_.read())
|
self.lexemes_from_bytes(file_.read())
|
||||||
if self.vectors is not None:
|
if self.vectors is not None:
|
||||||
self.vectors.from_disk(path, exclude='strings.json')
|
self.vectors.from_disk(path, exclude='strings.json')
|
||||||
link_vectors_to_models(self)
|
if self.vectors.name is not None:
|
||||||
|
link_vectors_to_models(self)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, **exclude):
|
||||||
|
@ -421,6 +419,8 @@ cdef class Vocab:
|
||||||
('vectors', lambda b: serialize_vectors(b))
|
('vectors', lambda b: serialize_vectors(b))
|
||||||
))
|
))
|
||||||
util.from_bytes(bytes_data, setters, exclude)
|
util.from_bytes(bytes_data, setters, exclude)
|
||||||
|
if self.vectors.name is not None:
|
||||||
|
link_vectors_to_models(self)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def lexemes_to_bytes(self):
|
def lexemes_to_bytes(self):
|
||||||
|
@ -468,7 +468,10 @@ cdef class Vocab:
|
||||||
if ptr == NULL:
|
if ptr == NULL:
|
||||||
continue
|
continue
|
||||||
py_str = self.strings[lexeme.orth]
|
py_str = self.strings[lexeme.orth]
|
||||||
assert self.strings[py_str] == lexeme.orth, (py_str, lexeme.orth)
|
if self.strings[py_str] != lexeme.orth:
|
||||||
|
raise ValueError(Errors.E086.format(string=py_str,
|
||||||
|
orth_id=lexeme.orth,
|
||||||
|
hash_id=self.strings[py_str]))
|
||||||
key = hash_string(py_str)
|
key = hash_string(py_str)
|
||||||
self._by_hash.set(key, lexeme)
|
self._by_hash.set(key, lexeme)
|
||||||
self._by_orth.set(lexeme.orth, lexeme)
|
self._by_orth.set(lexeme.orth, lexeme)
|
||||||
|
@ -509,16 +512,3 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
|
||||||
|
|
||||||
|
|
||||||
copy_reg.pickle(Vocab, pickle_vocab, unpickle_vocab)
|
copy_reg.pickle(Vocab, pickle_vocab, unpickle_vocab)
|
||||||
|
|
||||||
|
|
||||||
class LookupError(Exception):
|
|
||||||
@classmethod
|
|
||||||
def mismatched_strings(cls, id_, id_string, original_string):
|
|
||||||
return cls(
|
|
||||||
"Error fetching a Lexeme from the Vocab. When looking up a "
|
|
||||||
"string, the lexeme returned had an orth ID that did not match "
|
|
||||||
"the query string. This means that the cached lexeme structs are "
|
|
||||||
"mismatched to the string encoding table. The mismatched:\n"
|
|
||||||
"Query string: {}\n"
|
|
||||||
"Orth cached: {}\n"
|
|
||||||
"Orth ID: {}".format(repr(original_string), repr(id_string), id_))
|
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
{
|
{
|
||||||
"globals": {
|
"globals": {
|
||||||
"title": "spaCy",
|
"title": "spaCy",
|
||||||
"description": "spaCy is a free open-source library featuring state-of-the-art speed and accuracy and a powerful Python API.",
|
"description": "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.",
|
||||||
|
|
||||||
"SITENAME": "spaCy",
|
"SITENAME": "spaCy",
|
||||||
"SLOGAN": "Industrial-strength Natural Language Processing in Python",
|
"SLOGAN": "Industrial-strength Natural Language Processing in Python",
|
||||||
|
@ -10,10 +10,13 @@
|
||||||
|
|
||||||
"COMPANY": "Explosion AI",
|
"COMPANY": "Explosion AI",
|
||||||
"COMPANY_URL": "https://explosion.ai",
|
"COMPANY_URL": "https://explosion.ai",
|
||||||
"DEMOS_URL": "https://demos.explosion.ai",
|
"DEMOS_URL": "https://explosion.ai/demos",
|
||||||
"MODELS_REPO": "explosion/spacy-models",
|
"MODELS_REPO": "explosion/spacy-models",
|
||||||
|
"KERNEL_BINDER": "ines/spacy-binder",
|
||||||
|
"KERNEL_PYTHON": "python3",
|
||||||
|
|
||||||
"SPACY_VERSION": "2.0",
|
"SPACY_VERSION": "2.0",
|
||||||
|
"BINDER_VERSION": "2.0.11",
|
||||||
|
|
||||||
"SOCIAL": {
|
"SOCIAL": {
|
||||||
"twitter": "spacy_io",
|
"twitter": "spacy_io",
|
||||||
|
@ -26,7 +29,8 @@
|
||||||
"NAVIGATION": {
|
"NAVIGATION": {
|
||||||
"Usage": "/usage",
|
"Usage": "/usage",
|
||||||
"Models": "/models",
|
"Models": "/models",
|
||||||
"API": "/api"
|
"API": "/api",
|
||||||
|
"Universe": "/universe"
|
||||||
},
|
},
|
||||||
|
|
||||||
"FOOTER": {
|
"FOOTER": {
|
||||||
|
@ -34,7 +38,7 @@
|
||||||
"Usage": "/usage",
|
"Usage": "/usage",
|
||||||
"Models": "/models",
|
"Models": "/models",
|
||||||
"API Reference": "/api",
|
"API Reference": "/api",
|
||||||
"Resources": "/usage/resources"
|
"Universe": "/universe"
|
||||||
},
|
},
|
||||||
"Support": {
|
"Support": {
|
||||||
"Issue Tracker": "https://github.com/explosion/spaCy/issues",
|
"Issue Tracker": "https://github.com/explosion/spaCy/issues",
|
||||||
|
@ -82,8 +86,8 @@
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
|
||||||
"V_CSS": "2.0.1",
|
"V_CSS": "2.1.2",
|
||||||
"V_JS": "2.0.1",
|
"V_JS": "2.1.0",
|
||||||
"DEFAULT_SYNTAX": "python",
|
"DEFAULT_SYNTAX": "python",
|
||||||
"ANALYTICS": "UA-58931649-1",
|
"ANALYTICS": "UA-58931649-1",
|
||||||
"MAILCHIMP": {
|
"MAILCHIMP": {
|
||||||
|
|
|
@ -15,12 +15,39 @@
|
||||||
- MODEL_META = public.models._data.MODEL_META
|
- MODEL_META = public.models._data.MODEL_META
|
||||||
- MODEL_LICENSES = public.models._data.MODEL_LICENSES
|
- MODEL_LICENSES = public.models._data.MODEL_LICENSES
|
||||||
- MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS
|
- MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS
|
||||||
|
- EXAMPLE_SENT_LANGS = public.models._data.EXAMPLE_SENT_LANGS
|
||||||
- EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES
|
- EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES
|
||||||
|
|
||||||
- IS_PAGE = (SECTION != "index") && !landing
|
- IS_PAGE = (SECTION != "index") && !landing
|
||||||
- IS_MODELS = (SECTION == "models" && LANGUAGES[current.source])
|
- IS_MODELS = (SECTION == "models" && LANGUAGES[current.source])
|
||||||
- HAS_MODELS = IS_MODELS && CURRENT_MODELS.length
|
- HAS_MODELS = IS_MODELS && CURRENT_MODELS.length
|
||||||
|
|
||||||
|
//- Get page URL
|
||||||
|
|
||||||
|
- function getPageUrl() {
|
||||||
|
- var path = current.path;
|
||||||
|
- if(path[path.length - 1] == 'index') path = path.slice(0, path.length - 1);
|
||||||
|
- return `${SITE_URL}/${path.join('/')}`;
|
||||||
|
- }
|
||||||
|
|
||||||
|
//- Get pretty page title depending on section
|
||||||
|
|
||||||
|
- function getPageTitle() {
|
||||||
|
- var sections = ['api', 'usage', 'models'];
|
||||||
|
- if (sections.includes(SECTION)) {
|
||||||
|
- var titleSection = (SECTION == "api") ? 'API' : SECTION.charAt(0).toUpperCase() + SECTION.slice(1);
|
||||||
|
- return `${title} · ${SITENAME} ${titleSection} Documentation`;
|
||||||
|
- }
|
||||||
|
- else if (SECTION != 'index') return `${title} · ${SITENAME}`;
|
||||||
|
- return `${SITENAME} · ${SLOGAN}`;
|
||||||
|
- }
|
||||||
|
|
||||||
|
//- Get social image based on section and settings
|
||||||
|
|
||||||
|
- function getPageImage() {
|
||||||
|
- var img = (SECTION == 'api') ? 'api' : 'default';
|
||||||
|
- return `${SITE_URL}/assets/img/social/preview_${preview || img}.jpg`;
|
||||||
|
- }
|
||||||
|
|
||||||
//- Add prefixes to items of an array (for modifier CSS classes)
|
//- Add prefixes to items of an array (for modifier CSS classes)
|
||||||
array - [array] list of class names or options, e.g. ["foot"]
|
array - [array] list of class names or options, e.g. ["foot"]
|
||||||
|
|
|
@ -7,7 +7,7 @@ include _functions
|
||||||
id - [string] anchor assigned to section (used for breadcrumb navigation)
|
id - [string] anchor assigned to section (used for breadcrumb navigation)
|
||||||
|
|
||||||
mixin section(id)
|
mixin section(id)
|
||||||
section.o-section(id="section-" + id data-section=id)
|
section.o-section(id=id ? "section-" + id : null data-section=id)&attributes(attributes)
|
||||||
block
|
block
|
||||||
|
|
||||||
|
|
||||||
|
@ -143,7 +143,7 @@ mixin aside-wrapper(label, emoji)
|
||||||
|
|
||||||
mixin aside(label, emoji)
|
mixin aside(label, emoji)
|
||||||
+aside-wrapper(label, emoji)
|
+aside-wrapper(label, emoji)
|
||||||
.c-aside__text.u-text-small
|
.c-aside__text.u-text-small&attributes(attributes)
|
||||||
block
|
block
|
||||||
|
|
||||||
|
|
||||||
|
@ -154,7 +154,7 @@ mixin aside(label, emoji)
|
||||||
prompt - [string] prompt displayed before first line, e.g. "$"
|
prompt - [string] prompt displayed before first line, e.g. "$"
|
||||||
|
|
||||||
mixin aside-code(label, language, prompt)
|
mixin aside-code(label, language, prompt)
|
||||||
+aside-wrapper(label)
|
+aside-wrapper(label)&attributes(attributes)
|
||||||
+code(false, language, prompt).o-no-block
|
+code(false, language, prompt).o-no-block
|
||||||
block
|
block
|
||||||
|
|
||||||
|
@ -165,7 +165,7 @@ mixin aside-code(label, language, prompt)
|
||||||
argument to be able to wrap it for spacing
|
argument to be able to wrap it for spacing
|
||||||
|
|
||||||
mixin infobox(label, emoji)
|
mixin infobox(label, emoji)
|
||||||
aside.o-box.o-block.u-text-small
|
aside.o-box.o-block.u-text-small&attributes(attributes)
|
||||||
if label
|
if label
|
||||||
h3.u-heading.u-text-label.u-color-theme
|
h3.u-heading.u-text-label.u-color-theme
|
||||||
if emoji
|
if emoji
|
||||||
|
@ -242,7 +242,9 @@ mixin button(url, trusted, ...style)
|
||||||
wrap - [boolean] wrap text and disable horizontal scrolling
|
wrap - [boolean] wrap text and disable horizontal scrolling
|
||||||
|
|
||||||
mixin code(label, language, prompt, height, icon, wrap)
|
mixin code(label, language, prompt, height, icon, wrap)
|
||||||
pre.c-code-block.o-block(class="lang-#{(language || DEFAULT_SYNTAX)}" class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
|
- var lang = (language != "none") ? (language || DEFAULT_SYNTAX) : null
|
||||||
|
- var lang_class = (language != "none") ? "lang-" + (language || DEFAULT_SYNTAX) : null
|
||||||
|
pre.c-code-block.o-block(data-language=lang class=lang_class class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
|
||||||
if label
|
if label
|
||||||
h4.u-text-label.u-text-label--dark=label
|
h4.u-text-label.u-text-label--dark=label
|
||||||
if icon
|
if icon
|
||||||
|
@ -253,6 +255,15 @@ mixin code(label, language, prompt, height, icon, wrap)
|
||||||
code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt)
|
code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt)
|
||||||
block
|
block
|
||||||
|
|
||||||
|
//- Executable code
|
||||||
|
|
||||||
|
mixin code-exec(label, large)
|
||||||
|
- label = (label || "Editable code example") + " (experimental)"
|
||||||
|
+terminal-wrapper(label, !large)
|
||||||
|
figure.thebelab-wrapper
|
||||||
|
span.thebelab-wrapper__text.u-text-tiny v#{BINDER_VERSION} · Python 3 · via #[+a("https://mybinder.org/").u-hide-link Binder]
|
||||||
|
+code(data-executable="true")&attributes(attributes)
|
||||||
|
block
|
||||||
|
|
||||||
//- Wrapper for code blocks to display old/new versions
|
//- Wrapper for code blocks to display old/new versions
|
||||||
|
|
||||||
|
@ -658,12 +669,16 @@ mixin qs(data, style)
|
||||||
//- Terminal-style code window
|
//- Terminal-style code window
|
||||||
label - [string] title displayed in top bar of terminal window
|
label - [string] title displayed in top bar of terminal window
|
||||||
|
|
||||||
mixin terminal(label, button_text, button_url)
|
mixin terminal-wrapper(label, small)
|
||||||
.x-terminal
|
.x-terminal(class=small ? "x-terminal--small" : null)
|
||||||
.x-terminal__icons: span
|
.x-terminal__icons(class=small ? "x-terminal__icons--small" : null): span
|
||||||
.u-padding-small.u-text-label.u-text-center=label
|
.u-padding-small.u-text-center(class=small ? "u-text-tiny" : "u-text")
|
||||||
|
strong=label
|
||||||
|
block
|
||||||
|
|
||||||
+code.x-terminal__code
|
mixin terminal(label, button_text, button_url, exec)
|
||||||
|
+terminal-wrapper(label)
|
||||||
|
+code.x-terminal__code(data-executable=exec ? "" : null)
|
||||||
block
|
block
|
||||||
|
|
||||||
if button_text && button_url
|
if button_text && button_url
|
||||||
|
|
|
@ -10,10 +10,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
|
||||||
li.c-nav__menu__item(class=is_active ? "is-active" : null)
|
li.c-nav__menu__item(class=is_active ? "is-active" : null)
|
||||||
+a(url)(tabindex=is_active ? "-1" : null)=item
|
+a(url)(tabindex=is_active ? "-1" : null)=item
|
||||||
|
|
||||||
li.c-nav__menu__item.u-hidden-xs
|
li.c-nav__menu__item
|
||||||
+a("https://survey.spacy.io", true) User Survey 2018
|
|
||||||
|
|
||||||
li.c-nav__menu__item.u-hidden-xs
|
|
||||||
+a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
|
+a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
|
||||||
|
|
||||||
progress.c-progress.js-progress(value="0" max="1")
|
progress.c-progress.js-progress(value="0" max="1")
|
||||||
|
|
|
@ -1,77 +1,110 @@
|
||||||
//- 💫 INCLUDES > MODELS PAGE TEMPLATE
|
//- 💫 INCLUDES > MODELS PAGE TEMPLATE
|
||||||
|
|
||||||
for id in CURRENT_MODELS
|
for id in CURRENT_MODELS
|
||||||
|
- var comps = getModelComponents(id)
|
||||||
+section(id)
|
+section(id)
|
||||||
+grid("vcenter").o-no-block(id=id)
|
section(data-vue=id data-model=id)
|
||||||
+grid-col("two-thirds")
|
+grid("vcenter").o-no-block(id=id)
|
||||||
+h(2)
|
+grid-col("two-thirds")
|
||||||
+a("#" + id).u-permalink=id
|
+h(2)
|
||||||
|
+a("#" + id).u-permalink=id
|
||||||
|
|
||||||
+grid-col("third").u-text-right
|
+grid-col("third").u-text-right
|
||||||
.u-color-subtle.u-text-tiny
|
.u-color-subtle.u-text-tiny
|
||||||
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(data-tpl=id data-tpl-key="download")
|
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(v-bind:href="releaseUrl")
|
||||||
| Release details
|
| Release details
|
||||||
.u-padding-small Latest: #[code(data-tpl=id data-tpl-key="version") n/a]
|
.u-padding-small Latest: #[code(v-text="version") n/a]
|
||||||
|
|
||||||
+aside-code("Installation", "bash", "$").
|
+aside-code("Installation", "bash", "$").
|
||||||
python -m spacy download #{id}
|
python -m spacy download #{id}
|
||||||
|
|
||||||
- var comps = getModelComponents(id)
|
p(v-if="description" v-text="description")
|
||||||
|
|
||||||
p(data-tpl=id data-tpl-key="description")
|
+infobox(v-if="error")
|
||||||
|
|
||||||
div(data-tpl=id data-tpl-key="error")
|
|
||||||
+infobox
|
|
||||||
| Unable to load model details from GitHub. To find out more
|
| Unable to load model details from GitHub. To find out more
|
||||||
| about this model, see the overview of the
|
| about this model, see the overview of the
|
||||||
| #[+a(gh("spacy-models") + "/releases") latest model releases].
|
| #[+a(gh("spacy-models") + "/releases") latest model releases].
|
||||||
|
|
||||||
+table.o-block-small(data-tpl=id data-tpl-key="table")
|
+table.o-block-small(v-bind:data-loading="loading")
|
||||||
+row
|
|
||||||
+cell #[+label Language]
|
|
||||||
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
|
|
||||||
for comp, label in {"Type": comps.type, "Genre": comps.genre}
|
|
||||||
+row
|
+row
|
||||||
+cell #[+label=label]
|
+cell #[+label Language]
|
||||||
+cell #[+tag=comp] #{MODEL_META[comp]}
|
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
|
||||||
+row
|
for comp, label in {"Type": comps.type, "Genre": comps.genre}
|
||||||
+cell #[+label Size]
|
+row
|
||||||
+cell #[+tag=comps.size] #[span(data-tpl=id data-tpl-key="size") #[em n/a]]
|
+cell #[+label=label]
|
||||||
|
+cell #[+tag=comp] #{MODEL_META[comp]}
|
||||||
|
+row
|
||||||
|
+cell #[+label Size]
|
||||||
|
+cell #[+tag=comps.size] #[span(v-text="sizeFull" v-if="sizeFull")] #[em(v-else="") n/a]
|
||||||
|
|
||||||
each label in ["Pipeline", "Vectors", "Sources", "Author", "License"]
|
+row(v-if="pipeline && pipeline.length" v-cloak="")
|
||||||
- var field = label.toLowerCase()
|
|
||||||
if field == "vectors"
|
|
||||||
- field = "vecs"
|
|
||||||
+row
|
|
||||||
+cell.u-nowrap
|
|
||||||
+label=label
|
|
||||||
if MODEL_META[field]
|
|
||||||
| #[+help(MODEL_META[field]).u-color-subtle]
|
|
||||||
+cell
|
+cell
|
||||||
span(data-tpl=id data-tpl-key=field) #[em n/a]
|
+label Pipeline #[+help(MODEL_META.pipeline).u-color-subtle]
|
||||||
|
+cell
|
||||||
|
span(v-for="(pipe, index) in pipeline" v-if="pipeline")
|
||||||
|
code(v-text="pipe")
|
||||||
|
span(v-if="index != pipeline.length - 1") ,
|
||||||
|
|
||||||
+row(data-tpl=id data-tpl-key="compat-wrapper" hidden="")
|
+row(v-if="vectors" v-cloak="")
|
||||||
+cell
|
+cell
|
||||||
+label Compat #[+help("Latest compatible model version for your spaCy installation").u-color-subtle]
|
+label Vectors #[+help(MODEL_META.vectors).u-color-subtle]
|
||||||
+cell
|
+cell(v-text="vectors")
|
||||||
.o-field.u-float-left
|
|
||||||
select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
|
|
||||||
div(data-tpl=id data-tpl-key="compat-versions")
|
|
||||||
|
|
||||||
section(data-tpl=id data-tpl-key="benchmarks" hidden="")
|
+row(v-if="sources && sources.length" v-cloak="")
|
||||||
+grid.o-block-small
|
+cell
|
||||||
|
+label Sources #[+help(MODEL_META.sources).u-color-subtle]
|
||||||
|
+cell
|
||||||
|
span(v-for="(source, index) in sources") {{ source }}
|
||||||
|
span(v-if="index != sources.length - 1") ,
|
||||||
|
|
||||||
|
+row(v-if="author" v-cloak="")
|
||||||
|
+cell #[+label Author]
|
||||||
|
+cell
|
||||||
|
+a("")(v-bind:href="url" v-if="url" v-text="author")
|
||||||
|
span(v-else="" v-text="author") {{ model.author }}
|
||||||
|
|
||||||
|
+row(v-if="license" v-cloak="")
|
||||||
|
+cell #[+label License]
|
||||||
|
+cell
|
||||||
|
+a("")(v-bind:href="modelLicenses[license]" v-if="modelLicenses[license]") {{ license }}
|
||||||
|
span(v-else="") {{ license }}
|
||||||
|
|
||||||
|
+row(v-cloak="")
|
||||||
|
+cell #[+label Compat #[+help(MODEL_META.compat).u-color-subtle]]
|
||||||
|
+cell
|
||||||
|
.o-field.u-float-left
|
||||||
|
select.o-field__select.u-text-small(v-model="spacyVersion")
|
||||||
|
option(v-for="version in orderedCompat" v-bind:value="version") spaCy v{{ version }}
|
||||||
|
code(v-if="compatVersion" v-text="compatVersion")
|
||||||
|
em(v-else="") not compatible
|
||||||
|
|
||||||
|
+grid.o-block-small(v-cloak="" v-if="hasAccuracy")
|
||||||
for keys, label in MODEL_BENCHMARKS
|
for keys, label in MODEL_BENCHMARKS
|
||||||
.u-flex-full.u-padding-small(data-tpl=id data-tpl-key=label.toLowerCase() hidden="")
|
.u-flex-full.u-padding-small
|
||||||
+table.o-block-small
|
+table.o-block-small
|
||||||
+row("head")
|
+row("head")
|
||||||
+head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
|
+head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
|
||||||
for label, field in keys
|
for label, field in keys
|
||||||
+row(hidden="")
|
+row
|
||||||
+cell.u-nowrap
|
+cell.u-nowrap
|
||||||
+label=label
|
+label=label
|
||||||
if MODEL_META[field]
|
if MODEL_META[field]
|
||||||
| #[+help(MODEL_META[field]).u-color-subtle]
|
| #[+help(MODEL_META[field]).u-color-subtle]
|
||||||
+cell("num")(data-tpl=id data-tpl-key=field)
|
+cell("num")
|
||||||
| n/a
|
span(v-if="#{field}" v-text="#{field}")
|
||||||
|
em(v-if="!#{field}") n/a
|
||||||
|
|
||||||
|
p.u-text-small.u-color-dark(v-if="notes" v-text="notes" v-cloak="")
|
||||||
|
|
||||||
|
if comps.size == "sm" && EXAMPLE_SENT_LANGS.includes(comps.lang)
|
||||||
|
section
|
||||||
|
+code-exec("Test the model live").
|
||||||
|
import spacy
|
||||||
|
from spacy.lang.#{comps.lang}.examples import sentences
|
||||||
|
|
||||||
|
nlp = spacy.load('#{id}')
|
||||||
|
doc = nlp(sentences[0])
|
||||||
|
print(doc.text)
|
||||||
|
for token in doc:
|
||||||
|
print(token.text, token.pos_, token.dep_)
|
||||||
|
|
||||||
p.u-text-small.u-color-dark(data-tpl=id data-tpl-key="notes")
|
|
||||||
|
|
|
@ -1,86 +1,33 @@
|
||||||
//- 💫 INCLUDES > SCRIPTS
|
//- 💫 INCLUDES > SCRIPTS
|
||||||
|
|
||||||
if quickstart
|
if IS_PAGE || SECTION == "index"
|
||||||
script(src="/assets/js/vendor/quickstart.min.js")
|
script(type="text/x-thebe-config")
|
||||||
|
| { bootstrap: true, binderOptions: { repo: "#{KERNEL_BINDER}"},
|
||||||
|
| kernelOptions: { name: "#{KERNEL_PYTHON}" }}
|
||||||
|
|
||||||
if IS_PAGE
|
- scripts = ["vendor/prism.min", "vendor/vue.min"]
|
||||||
script(src="/assets/js/vendor/in-view.min.js")
|
- if (SECTION == "universe") scripts.push("vendor/vue-markdown.min")
|
||||||
|
- if (quickstart) scripts.push("vendor/quickstart.min")
|
||||||
|
- if (IS_PAGE) scripts.push("vendor/in-view.min")
|
||||||
|
- if (IS_PAGE || SECTION == "index") scripts.push("vendor/thebelab.custom.min")
|
||||||
|
|
||||||
|
for script in scripts
|
||||||
|
script(src="/assets/js/" + script + ".js")
|
||||||
|
script(src="/assets/js/main.js?v#{V_JS}" type=(environment == "deploy") ? null : "module")
|
||||||
|
|
||||||
if environment == "deploy"
|
if environment == "deploy"
|
||||||
script(async src="https://www.google-analytics.com/analytics.js")
|
script(src="https://www.google-analytics.com/analytics.js", async)
|
||||||
|
script
|
||||||
script(src="/assets/js/vendor/prism.min.js")
|
|
||||||
|
|
||||||
if compare_models
|
|
||||||
script(src="/assets/js/vendor/chart.min.js")
|
|
||||||
|
|
||||||
script
|
|
||||||
if quickstart
|
|
||||||
| new Quickstart("#qs");
|
|
||||||
|
|
||||||
if environment == "deploy"
|
|
||||||
| window.ga=window.ga||function(){
|
| window.ga=window.ga||function(){
|
||||||
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
|
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
|
||||||
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
|
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
|
||||||
|
|
||||||
if IS_PAGE
|
if IS_PAGE
|
||||||
|
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
|
||||||
|
script
|
||||||
| ((window.gitter = {}).chat = {}).options = {
|
| ((window.gitter = {}).chat = {}).options = {
|
||||||
| useStyles: false,
|
| useStyles: false,
|
||||||
| activationElement: '.js-gitter-button',
|
| activationElement: '.js-gitter-button',
|
||||||
| targetElement: '.js-gitter',
|
| targetElement: '.js-gitter',
|
||||||
| room: '!{SOCIAL.gitter}'
|
| room: '!{SOCIAL.gitter}'
|
||||||
| };
|
| };
|
||||||
|
|
||||||
if IS_PAGE
|
|
||||||
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
|
|
||||||
|
|
||||||
|
|
||||||
//- JS modules – slightly hacky, but necessary to dynamically instantiate the
|
|
||||||
classes with data from the Harp JSON files, while still being able to
|
|
||||||
support older browsers that can't handle JS modules. More details:
|
|
||||||
https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
|
|
||||||
|
|
||||||
- ProgressBar = "new ProgressBar('.js-progress');"
|
|
||||||
- Accordion = "new Accordion('.js-accordion');"
|
|
||||||
- Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
|
|
||||||
- NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
|
|
||||||
- GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
|
|
||||||
- ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
|
|
||||||
- ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
|
|
||||||
|
|
||||||
if environment == "deploy"
|
|
||||||
//- DEPLOY: use compiled rollup.js and instantiate classes directly
|
|
||||||
script(src="/assets/js/rollup.js?v#{V_JS}")
|
|
||||||
script
|
|
||||||
!=ProgressBar
|
|
||||||
if changelog
|
|
||||||
!=Changelog
|
|
||||||
if IS_PAGE
|
|
||||||
!=NavHighlighter
|
|
||||||
!=GitHubEmbed
|
|
||||||
!=Accordion
|
|
||||||
if HAS_MODELS
|
|
||||||
!=ModelLoader
|
|
||||||
if compare_models
|
|
||||||
!=ModelComparer
|
|
||||||
else
|
|
||||||
//- DEVELOPMENT: Use ES6 modules
|
|
||||||
script(type="module")
|
|
||||||
| import ProgressBar from '/assets/js/progress.js';
|
|
||||||
!=ProgressBar
|
|
||||||
if changelog
|
|
||||||
| import Changelog from '/assets/js/changelog.js';
|
|
||||||
!=Changelog
|
|
||||||
if IS_PAGE
|
|
||||||
| import NavHighlighter from '/assets/js/nav-highlighter.js';
|
|
||||||
!=NavHighlighter
|
|
||||||
| import GitHubEmbed from '/assets/js/github-embed.js';
|
|
||||||
!=GitHubEmbed
|
|
||||||
| import Accordion from '/assets/js/accordion.js';
|
|
||||||
!=Accordion
|
|
||||||
if HAS_MODELS
|
|
||||||
| import { ModelLoader } from '/assets/js/models.js';
|
|
||||||
!=ModelLoader
|
|
||||||
if compare_models
|
|
||||||
| import { ModelComparer } from '/assets/js/models.js';
|
|
||||||
!=ModelComparer
|
|
||||||
|
|
|
@ -7,6 +7,12 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
|
||||||
symbol#svg_github(viewBox="0 0 27 32")
|
symbol#svg_github(viewBox="0 0 27 32")
|
||||||
path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z")
|
path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z")
|
||||||
|
|
||||||
|
symbol#svg_twitter(viewBox="0 0 30 32")
|
||||||
|
path(d="M28.929 7.286q-1.196 1.75-2.893 2.982 0.018 0.25 0.018 0.75 0 2.321-0.679 4.634t-2.063 4.437-3.295 3.759-4.607 2.607-5.768 0.973q-4.839 0-8.857-2.589 0.625 0.071 1.393 0.071 4.018 0 7.161-2.464-1.875-0.036-3.357-1.152t-2.036-2.848q0.589 0.089 1.089 0.089 0.768 0 1.518-0.196-2-0.411-3.313-1.991t-1.313-3.67v-0.071q1.214 0.679 2.607 0.732-1.179-0.786-1.875-2.054t-0.696-2.75q0-1.571 0.786-2.911 2.161 2.661 5.259 4.259t6.634 1.777q-0.143-0.679-0.143-1.321 0-2.393 1.688-4.080t4.080-1.688q2.5 0 4.214 1.821 1.946-0.375 3.661-1.393-0.661 2.054-2.536 3.179 1.661-0.179 3.321-0.893z")
|
||||||
|
|
||||||
|
symbol#svg_website(viewBox="0 0 32 32")
|
||||||
|
path(d="M22.658 10.988h5.172c0.693 1.541 1.107 3.229 1.178 5.012h-5.934c-0.025-1.884-0.181-3.544-0.416-5.012zM20.398 3.896c2.967 1.153 5.402 3.335 6.928 6.090h-4.836c-0.549-2.805-1.383-4.799-2.092-6.090zM16.068 9.986v-6.996c1.066 0.047 2.102 0.216 3.092 0.493 0.75 1.263 1.719 3.372 2.33 6.503h-5.422zM9.489 22.014c-0.234-1.469-0.396-3.119-0.421-5.012h5.998v5.012h-5.577zM9.479 10.988h5.587v5.012h-6.004c0.025-1.886 0.183-3.543 0.417-5.012zM11.988 3.461c0.987-0.266 2.015-0.435 3.078-0.469v6.994h-5.422c0.615-3.148 1.591-5.265 2.344-6.525zM3.661 9.986c1.551-2.8 4.062-4.993 7.096-6.131-0.715 1.29-1.559 3.295-2.114 6.131h-4.982zM8.060 16h-6.060c0.066-1.781 0.467-3.474 1.158-5.012h5.316c-0.233 1.469-0.39 3.128-0.414 5.012zM8.487 22.014h-5.29c-0.694-1.543-1.139-3.224-1.204-5.012h6.071c0.024 1.893 0.188 3.541 0.423 5.012zM8.651 23.016c0.559 2.864 1.416 4.867 2.134 6.142-3.045-1.133-5.557-3.335-7.11-6.142h4.976zM15.066 23.016v6.994c-1.052-0.033-2.067-0.199-3.045-0.46-0.755-1.236-1.736-3.363-2.356-6.534h5.401zM21.471 23.016c-0.617 3.152-1.592 5.271-2.344 6.512-0.979 0.271-2.006 0.418-3.059 0.465v-6.977h5.403zM16.068 17.002h5.998c-0.023 1.893-0.188 3.542-0.422 5.012h-5.576v-5.012zM22.072 16h-6.004v-5.012h5.586c0.235 1.469 0.393 3.126 0.418 5.012zM23.070 17.002h5.926c-0.066 1.787-0.506 3.468-1.197 5.012h-5.152c0.234-1.471 0.398-3.119 0.423-5.012zM27.318 23.016c-1.521 2.766-3.967 4.949-6.947 6.1 0.715-1.276 1.561-3.266 2.113-6.1h4.834z")
|
||||||
|
|
||||||
symbol#svg_code(viewBox="0 0 20 20")
|
symbol#svg_code(viewBox="0 0 20 20")
|
||||||
path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z")
|
path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z")
|
||||||
|
|
||||||
|
|
|
@ -3,23 +3,15 @@
|
||||||
include _includes/_mixins
|
include _includes/_mixins
|
||||||
|
|
||||||
- title = IS_MODELS ? LANGUAGES[current.source] || title : title
|
- title = IS_MODELS ? LANGUAGES[current.source] || title : title
|
||||||
- social_title = (SECTION == "index") ? SITENAME + " - " + SLOGAN : title + " - " + SITENAME
|
|
||||||
- social_img = SITE_URL + "/assets/img/social/preview_" + (preview || ALPHA ? "alpha" : "default") + ".jpg"
|
- PAGE_URL = getPageUrl()
|
||||||
|
- PAGE_TITLE = getPageTitle()
|
||||||
|
- PAGE_IMAGE = getPageImage()
|
||||||
|
|
||||||
doctype html
|
doctype html
|
||||||
html(lang="en")
|
html(lang="en")
|
||||||
head
|
head
|
||||||
title
|
title=PAGE_TITLE
|
||||||
if SECTION == "api" || SECTION == "usage" || SECTION == "models"
|
|
||||||
- var title_section = (SECTION == "api") ? "API" : SECTION.charAt(0).toUpperCase() + SECTION.slice(1)
|
|
||||||
| #{title} | #{SITENAME} #{title_section} Documentation
|
|
||||||
|
|
||||||
else if SECTION != "index"
|
|
||||||
| #{title} | #{SITENAME}
|
|
||||||
|
|
||||||
else
|
|
||||||
| #{SITENAME} - #{SLOGAN}
|
|
||||||
|
|
||||||
meta(charset="utf-8")
|
meta(charset="utf-8")
|
||||||
meta(name="viewport" content="width=device-width, initial-scale=1.0")
|
meta(name="viewport" content="width=device-width, initial-scale=1.0")
|
||||||
meta(name="referrer" content="always")
|
meta(name="referrer" content="always")
|
||||||
|
@ -27,23 +19,24 @@ html(lang="en")
|
||||||
|
|
||||||
meta(property="og:type" content="website")
|
meta(property="og:type" content="website")
|
||||||
meta(property="og:site_name" content=sitename)
|
meta(property="og:site_name" content=sitename)
|
||||||
meta(property="og:url" content="#{SITE_URL}/#{current.path.join('/')}")
|
meta(property="og:url" content=PAGE_URL)
|
||||||
meta(property="og:title" content=social_title)
|
meta(property="og:title" content=PAGE_TITLE)
|
||||||
meta(property="og:description" content=description)
|
meta(property="og:description" content=description)
|
||||||
meta(property="og:image" content=social_img)
|
meta(property="og:image" content=PAGE_IMAGE)
|
||||||
|
|
||||||
meta(name="twitter:card" content="summary_large_image")
|
meta(name="twitter:card" content="summary_large_image")
|
||||||
meta(name="twitter:site" content="@" + SOCIAL.twitter)
|
meta(name="twitter:site" content="@" + SOCIAL.twitter)
|
||||||
meta(name="twitter:title" content=social_title)
|
meta(name="twitter:title" content=PAGE_TITLE)
|
||||||
meta(name="twitter:description" content=description)
|
meta(name="twitter:description" content=description)
|
||||||
meta(name="twitter:image" content=social_img)
|
meta(name="twitter:image" content=PAGE_IMAGE)
|
||||||
|
|
||||||
link(rel="shortcut icon" href="/assets/img/favicon.ico")
|
link(rel="shortcut icon" href="/assets/img/favicon.ico")
|
||||||
link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico")
|
link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico")
|
||||||
|
|
||||||
if SECTION == "api"
|
if SECTION == "api"
|
||||||
link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet")
|
link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet")
|
||||||
|
else if SECTION == "universe"
|
||||||
|
link(href="/assets/css/style_purple.css?v#{V_CSS}" rel="stylesheet")
|
||||||
else
|
else
|
||||||
link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet")
|
link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet")
|
||||||
|
|
||||||
|
@ -54,6 +47,9 @@ html(lang="en")
|
||||||
if !landing
|
if !landing
|
||||||
include _includes/_page-docs
|
include _includes/_page-docs
|
||||||
|
|
||||||
|
else if SECTION == "universe"
|
||||||
|
!=yield
|
||||||
|
|
||||||
else
|
else
|
||||||
main!=yield
|
main!=yield
|
||||||
include _includes/_footer
|
include _includes/_footer
|
||||||
|
|
|
@ -29,7 +29,7 @@ p
|
||||||
+ud-row("NUM", "numeral", "1, 2017, one, seventy-seven, IV, MMXIV")
|
+ud-row("NUM", "numeral", "1, 2017, one, seventy-seven, IV, MMXIV")
|
||||||
+ud-row("PART", "particle", "'s, not, ")
|
+ud-row("PART", "particle", "'s, not, ")
|
||||||
+ud-row("PRON", "pronoun", "I, you, he, she, myself, themselves, somebody")
|
+ud-row("PRON", "pronoun", "I, you, he, she, myself, themselves, somebody")
|
||||||
+ud-row("PROPN", "proper noun", "Mary, John, Londin, NATO, HBO")
|
+ud-row("PROPN", "proper noun", "Mary, John, London, NATO, HBO")
|
||||||
+ud-row("PUNCT", "punctuation", "., (, ), ?")
|
+ud-row("PUNCT", "punctuation", "., (, ), ?")
|
||||||
+ud-row("SCONJ", "subordinating conjunction", "if, while, that")
|
+ud-row("SCONJ", "subordinating conjunction", "if, while, that")
|
||||||
+ud-row("SYM", "symbol", "$, %, §, ©, +, −, ×, ÷, =, :), 😝")
|
+ud-row("SYM", "symbol", "$, %, §, ©, +, −, ×, ÷, =, :), 😝")
|
||||||
|
|
|
@ -1,5 +1,13 @@
|
||||||
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
|
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
|
||||||
|
|
||||||
|
p
|
||||||
|
| spaCy's statistical models have been custom-designed to give a
|
||||||
|
| high-performance mix of speed and accuracy. The current architecture
|
||||||
|
| hasn't been published yet, but in the meantime we prepared a video that
|
||||||
|
| explains how the models work, with particular focus on NER.
|
||||||
|
|
||||||
|
+youtube("sqDHBH9IjRU")
|
||||||
|
|
||||||
p
|
p
|
||||||
| The parsing model is a blend of recent results. The two recent
|
| The parsing model is a blend of recent results. The two recent
|
||||||
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
|
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
|
||||||
|
@ -44,7 +52,7 @@ p
|
||||||
+cell First two words of the buffer.
|
+cell First two words of the buffer.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell.u-nowrap
|
+cell
|
||||||
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
|
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
|
||||||
| #[code B1L1]#[br]
|
| #[code B1L1]#[br]
|
||||||
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
|
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
|
||||||
|
@ -54,7 +62,7 @@ p
|
||||||
| #[code S2], #[code B0] and #[code B1].
|
| #[code S2], #[code B0] and #[code B1].
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell.u-nowrap
|
+cell
|
||||||
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
|
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
|
||||||
| #[code B1R1]#[br]
|
| #[code B1R1]#[br]
|
||||||
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
|
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
|
||||||
|
|
|
@ -6,8 +6,7 @@ p
|
||||||
| but somewhat ugly in Python. Logic that deals with Python or platform
|
| but somewhat ugly in Python. Logic that deals with Python or platform
|
||||||
| compatibility only lives in #[code spacy.compat]. To distinguish them from
|
| compatibility only lives in #[code spacy.compat]. To distinguish them from
|
||||||
| the builtin functions, replacement functions are suffixed with an
|
| the builtin functions, replacement functions are suffixed with an
|
||||||
| undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the
|
| undersocre, e.e #[code unicode_].
|
||||||
| #[code six] and #[code ftfy] packages.
|
|
||||||
|
|
||||||
+aside-code("Example").
|
+aside-code("Example").
|
||||||
from spacy.compat import unicode_, json_dumps
|
from spacy.compat import unicode_, json_dumps
|
||||||
|
|
|
@ -533,8 +533,10 @@ p
|
||||||
+cell option
|
+cell option
|
||||||
+cell
|
+cell
|
||||||
| Optional location of vectors file. Should be a tab-separated
|
| Optional location of vectors file. Should be a tab-separated
|
||||||
| file where the first column contains the word and the remaining
|
| file in Word2Vec format where the first column contains the word
|
||||||
| columns the values.
|
| and the remaining columns the values. File can be provided in
|
||||||
|
| #[code .txt] format or as a zipped text file in #[code .zip] or
|
||||||
|
| #[code .tar.gz] format.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code --prune-vectors], #[code -V]
|
+cell #[code --prune-vectors], #[code -V]
|
||||||
|
|
|
@ -31,6 +31,7 @@
|
||||||
$grid-gutter: 2rem
|
$grid-gutter: 2rem
|
||||||
|
|
||||||
margin-top: $grid-gutter
|
margin-top: $grid-gutter
|
||||||
|
min-width: 0 // hack to prevent overflow
|
||||||
|
|
||||||
@include breakpoint(min, lg)
|
@include breakpoint(min, lg)
|
||||||
display: flex
|
display: flex
|
||||||
|
|
|
@ -60,6 +60,13 @@
|
||||||
padding-bottom: 4rem
|
padding-bottom: 4rem
|
||||||
border-bottom: 1px dotted $color-subtle
|
border-bottom: 1px dotted $color-subtle
|
||||||
|
|
||||||
|
&.o-section--small
|
||||||
|
overflow: auto
|
||||||
|
|
||||||
|
&:not(:last-child)
|
||||||
|
margin-bottom: 3.5rem
|
||||||
|
padding-bottom: 2rem
|
||||||
|
|
||||||
.o-block
|
.o-block
|
||||||
margin-bottom: 4rem
|
margin-bottom: 4rem
|
||||||
|
|
||||||
|
@ -142,6 +149,14 @@
|
||||||
.o-badge
|
.o-badge
|
||||||
border-radius: 1em
|
border-radius: 1em
|
||||||
|
|
||||||
|
.o-thumb
|
||||||
|
@include size(100px)
|
||||||
|
overflow: hidden
|
||||||
|
border-radius: 50%
|
||||||
|
|
||||||
|
&.o-thumb--small
|
||||||
|
@include size(35px)
|
||||||
|
|
||||||
|
|
||||||
//- SVG
|
//- SVG
|
||||||
|
|
||||||
|
|
|
@ -103,6 +103,9 @@
|
||||||
&:hover
|
&:hover
|
||||||
color: $color-theme-dark
|
color: $color-theme-dark
|
||||||
|
|
||||||
|
.u-hand
|
||||||
|
cursor: pointer
|
||||||
|
|
||||||
.u-hide-link.u-hide-link
|
.u-hide-link.u-hide-link
|
||||||
border: none
|
border: none
|
||||||
color: inherit
|
color: inherit
|
||||||
|
@ -224,6 +227,7 @@
|
||||||
$spinner-size: 75px
|
$spinner-size: 75px
|
||||||
$spinner-bar: 8px
|
$spinner-bar: 8px
|
||||||
|
|
||||||
|
min-height: $spinner-size * 2
|
||||||
position: relative
|
position: relative
|
||||||
|
|
||||||
& > *
|
& > *
|
||||||
|
@ -245,10 +249,19 @@
|
||||||
|
|
||||||
//- Hidden elements
|
//- Hidden elements
|
||||||
|
|
||||||
.u-hidden
|
.u-hidden,
|
||||||
display: none
|
[v-cloak]
|
||||||
|
display: none !important
|
||||||
|
|
||||||
@each $breakpoint in (xs, sm, md)
|
@each $breakpoint in (xs, sm, md)
|
||||||
.u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint}
|
.u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint}
|
||||||
@include breakpoint(max, $breakpoint)
|
@include breakpoint(max, $breakpoint)
|
||||||
display: none
|
display: none
|
||||||
|
|
||||||
|
//- Transitions
|
||||||
|
|
||||||
|
.u-fade-enter-active
|
||||||
|
transition: opacity 0.5s
|
||||||
|
|
||||||
|
.u-fade-enter
|
||||||
|
opacity: 0
|
||||||
|
|
|
@ -2,7 +2,8 @@
|
||||||
|
|
||||||
//- Code block
|
//- Code block
|
||||||
|
|
||||||
.c-code-block
|
.c-code-block,
|
||||||
|
.thebelab-cell
|
||||||
background: $color-front
|
background: $color-front
|
||||||
color: darken($color-back, 20)
|
color: darken($color-back, 20)
|
||||||
padding: 0.75em 0
|
padding: 0.75em 0
|
||||||
|
@ -13,11 +14,11 @@
|
||||||
white-space: pre
|
white-space: pre
|
||||||
direction: ltr
|
direction: ltr
|
||||||
|
|
||||||
&.c-code-block--has-icon
|
.c-code-block--has-icon
|
||||||
padding: 0
|
padding: 0
|
||||||
display: flex
|
display: flex
|
||||||
border-top-left-radius: 0
|
border-top-left-radius: 0
|
||||||
border-bottom-left-radius: 0
|
border-bottom-left-radius: 0
|
||||||
|
|
||||||
.c-code-block__icon
|
.c-code-block__icon
|
||||||
padding: 0 0 0 1rem
|
padding: 0 0 0 1rem
|
||||||
|
@ -28,26 +29,66 @@
|
||||||
&.c-code-block__icon--border
|
&.c-code-block__icon--border
|
||||||
border-left: 6px solid
|
border-left: 6px solid
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
//- Code block content
|
//- Code block content
|
||||||
|
|
||||||
.c-code-block__content
|
.c-code-block__content,
|
||||||
|
.thebelab-input,
|
||||||
|
.jp-OutputArea
|
||||||
display: block
|
display: block
|
||||||
font: normal normal 1.1rem/#{1.9} $font-code
|
font: normal normal 1.1rem/#{1.9} $font-code
|
||||||
padding: 1em 2em
|
padding: 1em 2em
|
||||||
|
|
||||||
&[data-prompt]:before,
|
.c-code-block__content[data-prompt]:before,
|
||||||
content: attr(data-prompt)
|
content: attr(data-prompt)
|
||||||
margin-right: 0.65em
|
margin-right: 0.65em
|
||||||
display: inline-block
|
display: inline-block
|
||||||
vertical-align: middle
|
vertical-align: middle
|
||||||
opacity: 0.5
|
opacity: 0.5
|
||||||
|
|
||||||
|
//- Thebelab
|
||||||
|
|
||||||
|
[data-executable]
|
||||||
|
margin-bottom: 0
|
||||||
|
|
||||||
|
.thebelab-input.thebelab-input
|
||||||
|
padding: 3em 2em 1em
|
||||||
|
|
||||||
|
.jp-OutputArea
|
||||||
|
&:not(:empty)
|
||||||
|
padding: 2rem 2rem 1rem
|
||||||
|
border-top: 1px solid $color-dark
|
||||||
|
margin-top: 2rem
|
||||||
|
|
||||||
|
.entities, svg
|
||||||
|
white-space: initial
|
||||||
|
font-family: inherit
|
||||||
|
|
||||||
|
.entities
|
||||||
|
font-size: 1.35rem
|
||||||
|
|
||||||
|
.jp-OutputArea pre
|
||||||
|
font: inherit
|
||||||
|
|
||||||
|
.jp-OutputPrompt.jp-OutputArea-prompt
|
||||||
|
padding-top: 0.5em
|
||||||
|
margin-right: 1rem
|
||||||
|
font-family: inherit
|
||||||
|
font-weight: bold
|
||||||
|
|
||||||
|
.thebelab-run-button
|
||||||
|
@extend .u-text-label, .u-text-label--dark
|
||||||
|
|
||||||
|
.thebelab-wrapper
|
||||||
|
position: relative
|
||||||
|
|
||||||
|
.thebelab-wrapper__text
|
||||||
|
@include position(absolute, top, right, 1.25rem, 1.25rem)
|
||||||
|
color: $color-subtle-dark
|
||||||
|
z-index: 10
|
||||||
|
|
||||||
//- Code
|
//- Code
|
||||||
|
|
||||||
code
|
code, .CodeMirror, .jp-RenderedText, .jp-OutputArea
|
||||||
-webkit-font-smoothing: subpixel-antialiased
|
-webkit-font-smoothing: subpixel-antialiased
|
||||||
-moz-osx-font-smoothing: auto
|
-moz-osx-font-smoothing: auto
|
||||||
|
|
||||||
|
@ -73,7 +114,7 @@ code
|
||||||
text-shadow: none
|
text-shadow: none
|
||||||
|
|
||||||
|
|
||||||
//- Syntax Highlighting
|
//- Syntax Highlighting (Prism)
|
||||||
|
|
||||||
[class*="language-"] .token
|
[class*="language-"] .token
|
||||||
&.comment, &.prolog, &.doctype, &.cdata, &.punctuation
|
&.comment, &.prolog, &.doctype, &.cdata, &.punctuation
|
||||||
|
@ -103,3 +144,50 @@ code
|
||||||
|
|
||||||
&.italic
|
&.italic
|
||||||
font-style: italic
|
font-style: italic
|
||||||
|
|
||||||
|
//- Syntax Highlighting (CodeMirror)
|
||||||
|
|
||||||
|
.CodeMirror.cm-s-default
|
||||||
|
background: $color-front
|
||||||
|
color: darken($color-back, 20)
|
||||||
|
|
||||||
|
.CodeMirror-selected
|
||||||
|
background: $color-theme
|
||||||
|
color: $color-back
|
||||||
|
|
||||||
|
.CodeMirror-cursor
|
||||||
|
border-left-color: currentColor
|
||||||
|
|
||||||
|
.cm-variable-2
|
||||||
|
color: inherit
|
||||||
|
font-style: italic
|
||||||
|
|
||||||
|
.cm-comment
|
||||||
|
color: map-get($syntax-highlighting, comment)
|
||||||
|
|
||||||
|
.cm-keyword, .cm-builtin
|
||||||
|
color: map-get($syntax-highlighting, keyword)
|
||||||
|
|
||||||
|
.cm-operator
|
||||||
|
color: map-get($syntax-highlighting, operator)
|
||||||
|
|
||||||
|
.cm-string
|
||||||
|
color: map-get($syntax-highlighting, selector)
|
||||||
|
|
||||||
|
.cm-number
|
||||||
|
color: map-get($syntax-highlighting, number)
|
||||||
|
|
||||||
|
.cm-def
|
||||||
|
color: map-get($syntax-highlighting, function)
|
||||||
|
|
||||||
|
//- Syntax highlighting (Jupyter)
|
||||||
|
|
||||||
|
.jp-RenderedText pre
|
||||||
|
.ansi-cyan-fg
|
||||||
|
color: map-get($syntax-highlighting, function)
|
||||||
|
|
||||||
|
.ansi-green-fg
|
||||||
|
color: $color-green
|
||||||
|
|
||||||
|
.ansi-red-fg
|
||||||
|
color: map-get($syntax-highlighting, operator)
|
||||||
|
|
|
@ -8,10 +8,20 @@
|
||||||
width: 100%
|
width: 100%
|
||||||
position: relative
|
position: relative
|
||||||
|
|
||||||
|
&.x-terminal--small
|
||||||
|
background: $color-dark
|
||||||
|
color: $color-subtle
|
||||||
|
border-radius: 4px
|
||||||
|
margin-bottom: 4rem
|
||||||
|
|
||||||
.x-terminal__icons
|
.x-terminal__icons
|
||||||
|
display: none
|
||||||
position: absolute
|
position: absolute
|
||||||
padding: 10px
|
padding: 10px
|
||||||
|
|
||||||
|
@include breakpoint(min, sm)
|
||||||
|
display: block
|
||||||
|
|
||||||
&:before,
|
&:before,
|
||||||
&:after,
|
&:after,
|
||||||
span
|
span
|
||||||
|
@ -32,6 +42,12 @@
|
||||||
content: ""
|
content: ""
|
||||||
background: $color-yellow
|
background: $color-yellow
|
||||||
|
|
||||||
|
&.x-terminal__icons--small
|
||||||
|
&:before,
|
||||||
|
&:after,
|
||||||
|
span
|
||||||
|
@include size(10px)
|
||||||
|
|
||||||
.x-terminal__code
|
.x-terminal__code
|
||||||
margin: 0
|
margin: 0
|
||||||
border: none
|
border: none
|
||||||
|
|
|
@ -9,7 +9,7 @@
|
||||||
display: flex
|
display: flex
|
||||||
justify-content: space-between
|
justify-content: space-between
|
||||||
flex-flow: row nowrap
|
flex-flow: row nowrap
|
||||||
padding: 0 2rem 0 1rem
|
padding: 0 0 0 1rem
|
||||||
z-index: 30
|
z-index: 30
|
||||||
width: 100%
|
width: 100%
|
||||||
box-shadow: $box-shadow
|
box-shadow: $box-shadow
|
||||||
|
@ -21,11 +21,20 @@
|
||||||
.c-nav__menu
|
.c-nav__menu
|
||||||
@include size(100%)
|
@include size(100%)
|
||||||
display: flex
|
display: flex
|
||||||
justify-content: flex-end
|
|
||||||
flex-flow: row nowrap
|
flex-flow: row nowrap
|
||||||
border-color: inherit
|
border-color: inherit
|
||||||
flex: 1
|
flex: 1
|
||||||
|
|
||||||
|
|
||||||
|
@include breakpoint(max, sm)
|
||||||
|
@include scroll-shadow-base($color-front)
|
||||||
|
overflow-x: auto
|
||||||
|
overflow-y: hidden
|
||||||
|
-webkit-overflow-scrolling: touch
|
||||||
|
|
||||||
|
@include breakpoint(min, md)
|
||||||
|
justify-content: flex-end
|
||||||
|
|
||||||
.c-nav__menu__item
|
.c-nav__menu__item
|
||||||
display: flex
|
display: flex
|
||||||
align-items: center
|
align-items: center
|
||||||
|
@ -39,6 +48,14 @@
|
||||||
&:not(:first-child)
|
&:not(:first-child)
|
||||||
margin-left: 2em
|
margin-left: 2em
|
||||||
|
|
||||||
|
&:last-child
|
||||||
|
@include scroll-shadow-cover(right, $color-back)
|
||||||
|
padding-right: 2rem
|
||||||
|
|
||||||
|
&:first-child
|
||||||
|
@include scroll-shadow-cover(left, $color-back)
|
||||||
|
padding-left: 2rem
|
||||||
|
|
||||||
&.is-active
|
&.is-active
|
||||||
color: $color-dark
|
color: $color-dark
|
||||||
pointer-events: none
|
pointer-events: none
|
||||||
|
|
|
@ -26,7 +26,7 @@ $font-code: Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace
|
||||||
|
|
||||||
// Colors
|
// Colors
|
||||||
|
|
||||||
$colors: ( blue: #09a3d5, green: #05b083 )
|
$colors: ( blue: #09a3d5, green: #05b083, purple: #6542d1 )
|
||||||
|
|
||||||
$color-back: #fff !default
|
$color-back: #fff !default
|
||||||
$color-front: #1a1e23 !default
|
$color-front: #1a1e23 !default
|
||||||
|
|
4
website/assets/css/style_purple.sass
Normal file
4
website/assets/css/style_purple.sass
Normal file
|
@ -0,0 +1,4 @@
|
||||||
|
//- 💫 STYLESHEET (PURPLE)
|
||||||
|
|
||||||
|
$theme: purple
|
||||||
|
@import style
|
BIN
website/assets/img/pattern_purple.jpg
Normal file
BIN
website/assets/img/pattern_purple.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 204 KiB |
BIN
website/assets/img/resources/spacy-vis.jpg
Normal file
BIN
website/assets/img/resources/spacy-vis.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 8.5 KiB |
BIN
website/assets/img/social/preview_api.jpg
Normal file
BIN
website/assets/img/social/preview_api.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 144 KiB |
BIN
website/assets/img/social/preview_universe.jpg
Normal file
BIN
website/assets/img/social/preview_universe.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 118 KiB |
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user