mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
Update develop from master
This commit is contained in:
commit
2338e8c7fc
6
.github/CONTRIBUTOR_AGREEMENT.md
vendored
6
.github/CONTRIBUTOR_AGREEMENT.md
vendored
|
@ -78,7 +78,7 @@ took place before the date you sign these terms.
|
|||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
|
@ -87,11 +87,11 @@ U.S. Federal law. Any choice of law rules will not apply.
|
|||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
|
106
.github/contributors/ivyleavedtoadflax.md
vendored
Normal file
106
.github/contributors/ivyleavedtoadflax.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name |Matthew Upson |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date |2018-04-24 |
|
||||
| GitHub username |ivyleavedtoadflax |
|
||||
| Website (optional) |www.machinegurning.com|
|
106
.github/contributors/katrinleinweber.md
vendored
Normal file
106
.github/contributors/katrinleinweber.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Katrin Leinweber |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-03-30 |
|
||||
| GitHub username | katrinleinweber |
|
||||
| Website (optional) | |
|
106
.github/contributors/miroli.md
vendored
Normal file
106
.github/contributors/miroli.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Robin Linderborg |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-04-23 |
|
||||
| GitHub username | miroli |
|
||||
| Website (optional) | |
|
106
.github/contributors/mollerhoj.md
vendored
Normal file
106
.github/contributors/mollerhoj.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jens Dahl Mollerhoj |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 4/04/2018 |
|
||||
| GitHub username | mollerhoj |
|
||||
| Website (optional) | |
|
106
.github/contributors/skrcode.md
vendored
Normal file
106
.github/contributors/skrcode.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Suraj Rajan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 31/Mar/2018 |
|
||||
| GitHub username | skrcode |
|
||||
| Website (optional) | |
|
106
.github/contributors/trungtv.md
vendored
Normal file
106
.github/contributors/trungtv.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Viet-Trung Tran |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-03-28 |
|
||||
| GitHub username | trungtv |
|
||||
| Website (optional) | https://datalab.vn |
|
6
CITATION
Normal file
6
CITATION
Normal file
|
@ -0,0 +1,6 @@
|
|||
@ARTICLE{spacy2,
|
||||
AUTHOR = {Honnibal, Matthew AND Montani, Ines},
|
||||
TITLE = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
|
||||
YEAR = {2017},
|
||||
JOURNAL = {To appear}
|
||||
}
|
|
@ -73,28 +73,8 @@ so it only becomes visible on click, making the issue easier to read and follow.
|
|||
### Issue labels
|
||||
|
||||
To distinguish issues that are opened by us, the maintainers, we usually add a
|
||||
💫 to the title. We also use the following system to tag our issues and pull
|
||||
requests:
|
||||
|
||||
| Issue label | Description |
|
||||
| --- | --- |
|
||||
| [`bug`](https://github.com/explosion/spaCy/labels/bug) | Bugs and behaviour differing from documentation |
|
||||
| [`enhancement`](https://github.com/explosion/spaCy/labels/enhancement) | Feature requests and improvements |
|
||||
| [`install`](https://github.com/explosion/spaCy/labels/install) | Installation problems |
|
||||
| [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
|
||||
| [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
|
||||
| [`docs`](https://github.com/explosion/spaCy/labels/docs), [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [documentation](https://spacy.io/docs) and [examples](spacy/examples) |
|
||||
| [`training`](https://github.com/explosion/spaCy/labels/training) | Issues related to training and updating models |
|
||||
| [`models`](https://github.com/explosion/spaCy/labels/models), `language / [name]` | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
|
||||
| [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
|
||||
| [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
|
||||
| [`compat`](https://github.com/explosion/spaCy/labels/compat) | Cross-platform and cross-Python compatibility issues |
|
||||
| [`wip`](https://github.com/explosion/spaCy/labels/wip) | Work in progress, mostly used for pull requests |
|
||||
| [`v1`](https://github.com/explosion/spaCy/labels/v1) | Reports related to spaCy v1.x |
|
||||
| [`duplicate`](https://github.com/explosion/spaCy/labels/duplicate) | Duplicates, i.e. issues that have been reported before |
|
||||
| [`third-party`](https://github.com/explosion/spaCy/labels/third-party) | Issues related to third-party packages and services |
|
||||
| [`meta`](https://github.com/explosion/spaCy/labels/meta) | Meta topics, e.g. repo organisation and issue management |
|
||||
| [`help wanted`](https://github.com/explosion/spaCy/labels/help%20wanted), [`help wanted (easy)`](https://github.com/explosion/spaCy/labels/help%20wanted%20%28easy%29) | Requests for contributions |
|
||||
💫 to the title. [See this page](https://github.com/explosion/spaCy/labels)
|
||||
for an overview of the system we use to tag our issues and pull requests.
|
||||
|
||||
## Contributing to the code base
|
||||
|
||||
|
@ -220,7 +200,7 @@ All Python code must be written in an **intersection of Python 2 and Python 3**.
|
|||
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
|
||||
Python or platform compatibility should only live in
|
||||
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
|
||||
functions, replacement functions are suffixed with an undersocre, for example
|
||||
functions, replacement functions are suffixed with an underscore, for example
|
||||
`unicode_`. If you need to access the user's version or platform information,
|
||||
for example to show more specific error messages, you can use the `is_config()`
|
||||
helper function.
|
||||
|
|
34
README.rst
34
README.rst
|
@ -12,11 +12,11 @@ integration. It's commercial open-source software, released under the MIT licens
|
|||
|
||||
💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_
|
||||
|
||||
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
|
||||
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis
|
||||
:target: https://travis-ci.org/explosion/spaCy
|
||||
:alt: Build Status
|
||||
|
||||
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square
|
||||
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square&logo=appveyor
|
||||
:target: https://ci.appveyor.com/project/explosion/spaCy
|
||||
:alt: Appveyor Build Status
|
||||
|
||||
|
@ -28,11 +28,11 @@ integration. It's commercial open-source software, released under the MIT licens
|
|||
:target: https://pypi.python.org/pypi/spacy
|
||||
:alt: pypi Version
|
||||
|
||||
.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg
|
||||
.. image:: https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square
|
||||
:target: https://anaconda.org/conda-forge/spacy
|
||||
:alt: conda Version
|
||||
|
||||
.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square
|
||||
.. image:: https://img.shields.io/badge/chat-join%20%E2%86%92-09a3d5.svg?style=flat-square&logo=gitter-white
|
||||
:target: https://gitter.im/explosion/spaCy
|
||||
:alt: spaCy on Gitter
|
||||
|
||||
|
@ -49,7 +49,7 @@ integration. It's commercial open-source software, released under the MIT licens
|
|||
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
|
||||
`API Reference`_ The detailed reference for spaCy's API.
|
||||
`Models`_ Download statistical language models for spaCy.
|
||||
`Resources`_ Libraries, extensions, demos, books and courses.
|
||||
`Universe`_ Libraries, extensions, demos, books and courses.
|
||||
`Changelog`_ Changes and version history.
|
||||
`Contribute`_ How to contribute to the spaCy project and code base.
|
||||
=================== ===
|
||||
|
@ -59,7 +59,7 @@ integration. It's commercial open-source software, released under the MIT licens
|
|||
.. _Usage Guides: https://spacy.io/usage/
|
||||
.. _API Reference: https://spacy.io/api/
|
||||
.. _Models: https://spacy.io/models
|
||||
.. _Resources: https://spacy.io/usage/resources
|
||||
.. _Universe: https://spacy.io/universe
|
||||
.. _Changelog: https://spacy.io/usage/#changelog
|
||||
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||
|
||||
|
@ -308,18 +308,20 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
|||
Run tests
|
||||
=========
|
||||
|
||||
spaCy comes with an `extensive test suite <spacy/tests>`_. First, find out where
|
||||
spaCy is installed:
|
||||
spaCy comes with an `extensive test suite <spacy/tests>`_. In order to run the
|
||||
tests, you'll usually want to clone the repository and build spaCy from source.
|
||||
This will also install the required development dependencies and test utilities
|
||||
defined in the ``requirements.txt``.
|
||||
|
||||
Alternatively, you can find out where spaCy is installed and run ``pytest`` on
|
||||
that directory. Don't forget to also install the test utilities via spaCy's
|
||||
``requirements.txt``:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
|
||||
|
||||
Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow``
|
||||
and ``--model`` are optional and enable additional tests:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
# make sure you are using recent pytest version
|
||||
python -m pip install -U pytest
|
||||
pip install -r path/to/requirements.txt
|
||||
python -m pytest <spacy-directory>
|
||||
|
||||
See `the documentation <https://spacy.io/usage/#tests>`_ for more details and
|
||||
examples.
|
||||
|
|
|
@ -9,6 +9,7 @@ coordinates. Can be extended with more details from the API.
|
|||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Prerequisites: pip install requests
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
|
|
|
@ -81,7 +81,6 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
|
|||
else:
|
||||
nlp = spacy.blank('en') # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# Add entity recognizer to model if it's not in the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if 'ner' not in nlp.pipe_names:
|
||||
|
@ -92,11 +91,18 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
|
|||
ner = nlp.get_pipe('ner')
|
||||
|
||||
ner.add_label(LABEL) # add new entity label to entity recognizer
|
||||
if model is None:
|
||||
optimizer = nlp.begin_training()
|
||||
else:
|
||||
# Note that 'begin_training' initializes the models, so it'll zero out
|
||||
# existing entity types.
|
||||
optimizer = nlp.entity.create_optimizer()
|
||||
|
||||
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Train a multi-label convolutional neural network text classifier on the
|
||||
"""Train a convolutional neural network text classifier on the
|
||||
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
|
||||
automatically via Thinc's built-in dataset loader. The model is added to
|
||||
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
|
||||
|
|
29
fabfile.py
vendored
29
fabfile.py
vendored
|
@ -6,6 +6,7 @@ from pathlib import Path
|
|||
from fabric.api import local, lcd, env, settings, prefix
|
||||
from os import path, environ
|
||||
import shutil
|
||||
import sys
|
||||
|
||||
|
||||
PWD = path.dirname(__file__)
|
||||
|
@ -90,3 +91,31 @@ def train():
|
|||
args = environ.get('SPACY_TRAIN_ARGS', '')
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
venv_local('spacy train {args}'.format(args=args))
|
||||
|
||||
|
||||
def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=''):
|
||||
is_not_clean = local('git status --porcelain', capture=True)
|
||||
if is_not_clean:
|
||||
print("Repository is not clean")
|
||||
print(is_not_clean)
|
||||
sys.exit(1)
|
||||
git_sha = local('git rev-parse --short HEAD', capture=True)
|
||||
config_checksum = local('sha256sum {config}'.format(config=config), capture=True)
|
||||
experiment_dir = Path(experiment_dir) / '{}--{}'.format(config_checksum[:6], git_sha)
|
||||
if not experiment_dir.exists():
|
||||
experiment_dir.mkdir()
|
||||
test_data_dir = Path(treebank_dir) / 'ud-test-v2.0-conll2017'
|
||||
assert test_data_dir.exists()
|
||||
assert test_data_dir.is_dir()
|
||||
if corpus:
|
||||
corpora = [corpus]
|
||||
else:
|
||||
corpora = ['UD_English', 'UD_Chinese', 'UD_Japanese', 'UD_Vietnamese']
|
||||
|
||||
local('cp {config} {experiment_dir}/config.json'.format(config=config, experiment_dir=experiment_dir))
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
for corpus in corpora:
|
||||
venv_local('spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}'.format(
|
||||
treebank_dir=treebank_dir, experiment_dir=experiment_dir, config=config, corpus=corpus, vectors_dir=vectors_dir))
|
||||
venv_local('spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}'.format(
|
||||
test_data_dir=test_data_dir, experiment_dir=experiment_dir, config=config, corpus=corpus))
|
||||
|
|
|
@ -3,16 +3,12 @@ pathlib
|
|||
numpy>=1.7
|
||||
cymem>=1.30,<1.32
|
||||
preshed>=1.0.0,<2.0.0
|
||||
thinc>=6.11.1.dev11,<6.12.0
|
||||
thinc>=6.11.1.dev12,<6.12.0
|
||||
murmurhash>=0.28,<0.29
|
||||
cytoolz>=0.9.0,<0.10.0
|
||||
plac<1.0.0,>=0.9.6
|
||||
ujson>=1.35
|
||||
dill>=0.2,<0.3
|
||||
requests>=2.13.0,<3.0.0
|
||||
regex==2017.4.5
|
||||
ftfy>=4.4.2,<5.0.0
|
||||
pytest>=3.0.6,<4.0.0
|
||||
mock>=2.0.0,<3.0.0
|
||||
msgpack-python==0.5.4
|
||||
msgpack-numpy==0.4.1
|
||||
|
|
8
setup.py
8
setup.py
|
@ -38,6 +38,7 @@ MOD_NAMES = [
|
|||
'spacy.tokens.doc',
|
||||
'spacy.tokens.span',
|
||||
'spacy.tokens.token',
|
||||
'spacy.tokens._retokenize',
|
||||
'spacy.matcher',
|
||||
'spacy.syntax.ner',
|
||||
'spacy.symbols',
|
||||
|
@ -194,12 +195,7 @@ def setup_package():
|
|||
'plac<1.0.0,>=0.9.6',
|
||||
'pathlib',
|
||||
'ujson>=1.35',
|
||||
'dill>=0.2,<0.3',
|
||||
'requests>=2.13.0,<3.0.0',
|
||||
'regex==2017.4.5',
|
||||
'ftfy>=4.4.2,<5.0.0',
|
||||
'msgpack-python==0.5.4',
|
||||
'msgpack-numpy==0.4.1'],
|
||||
'dill>=0.2,<0.3'],
|
||||
setup_requires=['wheel'],
|
||||
classifiers=[
|
||||
'Development Status :: 5 - Production/Stable',
|
||||
|
|
|
@ -4,18 +4,14 @@ from __future__ import unicode_literals
|
|||
from .cli.info import info as cli_info
|
||||
from .glossary import explain
|
||||
from .about import __version__
|
||||
from .errors import Warnings, deprecation_warning
|
||||
from . import util
|
||||
|
||||
|
||||
def load(name, **overrides):
|
||||
depr_path = overrides.get('path')
|
||||
if depr_path not in (True, False, None):
|
||||
util.deprecated(
|
||||
"As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
||||
"You can now call spacy.load with the path as its first argument, "
|
||||
"and the model's meta.json will be used to determine the language "
|
||||
"to load. For example:\nnlp = spacy.load('{}')".format(depr_path),
|
||||
'error')
|
||||
deprecation_warning(Warnings.W001.format(path=depr_path))
|
||||
return util.load_model(name, **overrides)
|
||||
|
||||
|
||||
|
|
46
spacy/_ml.py
46
spacy/_ml.py
|
@ -23,6 +23,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
|
|||
import thinc.extra.load_nlp
|
||||
|
||||
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
||||
from .errors import Errors
|
||||
from . import util
|
||||
|
||||
|
||||
|
@ -42,8 +43,8 @@ def cosine(vec1, vec2):
|
|||
def create_default_optimizer(ops, **cfg):
|
||||
learn_rate = util.env_opt('learn_rate', 0.001)
|
||||
beta1 = util.env_opt('optimizer_B1', 0.9)
|
||||
beta2 = util.env_opt('optimizer_B2', 0.999)
|
||||
eps = util.env_opt('optimizer_eps', 1e-08)
|
||||
beta2 = util.env_opt('optimizer_B2', 0.9)
|
||||
eps = util.env_opt('optimizer_eps', 1e-12)
|
||||
L2 = util.env_opt('L2_penalty', 1e-6)
|
||||
max_grad_norm = util.env_opt('grad_norm_clip', 1.)
|
||||
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1,
|
||||
|
@ -157,7 +158,7 @@ class PrecomputableAffine(Model):
|
|||
sgd(self._mem.weights, self._mem.gradient, key=self.id)
|
||||
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
|
||||
return Yf, backward
|
||||
|
||||
|
||||
def _add_padding(self, Yf):
|
||||
Yf_padded = self.ops.xp.vstack((self.pad, Yf))
|
||||
return Yf_padded
|
||||
|
@ -194,14 +195,15 @@ class PrecomputableAffine(Model):
|
|||
size=tokvecs.size).reshape(tokvecs.shape)
|
||||
|
||||
def predict(ids, tokvecs):
|
||||
# nS ids. nW tokvecs
|
||||
hiddens = model(tokvecs) # (nW, f, o, p)
|
||||
# nS ids. nW tokvecs. Exclude the padding array.
|
||||
hiddens = model(tokvecs[:-1]) # (nW, f, o, p)
|
||||
vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype='f')
|
||||
# need nS vectors
|
||||
vectors = model.ops.allocate((ids.shape[0], model.nO, model.nP))
|
||||
for i, feats in enumerate(ids):
|
||||
for j, id_ in enumerate(feats):
|
||||
vectors[i] += hiddens[id_, j]
|
||||
hiddens = hiddens.reshape((hiddens.shape[0] * model.nF, model.nO * model.nP))
|
||||
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
|
||||
vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP))
|
||||
vectors += model.b
|
||||
vectors = model.ops.asarray(vectors)
|
||||
if model.nP >= 2:
|
||||
return model.ops.maxout(vectors)[0]
|
||||
else:
|
||||
|
@ -225,6 +227,11 @@ class PrecomputableAffine(Model):
|
|||
|
||||
def link_vectors_to_models(vocab):
|
||||
vectors = vocab.vectors
|
||||
if vectors.name is None:
|
||||
vectors.name = VECTORS_KEY
|
||||
print(
|
||||
"Warning: Unnamed vectors -- this won't allow multiple vectors "
|
||||
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape)
|
||||
ops = Model.ops
|
||||
for word in vocab:
|
||||
if word.orth in vectors.key2row:
|
||||
|
@ -234,11 +241,11 @@ def link_vectors_to_models(vocab):
|
|||
data = ops.asarray(vectors.data)
|
||||
# Set an entry here, so that vectors are accessed by StaticVectors
|
||||
# (unideal, I know)
|
||||
thinc.extra.load_nlp.VECTORS[(ops.device, VECTORS_KEY)] = data
|
||||
thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
|
||||
|
||||
|
||||
def Tok2Vec(width, embed_size, **kwargs):
|
||||
pretrained_dims = kwargs.get('pretrained_dims', 0)
|
||||
pretrained_vectors = kwargs.get('pretrained_vectors', None)
|
||||
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
|
||||
|
@ -251,16 +258,16 @@ def Tok2Vec(width, embed_size, **kwargs):
|
|||
name='embed_suffix')
|
||||
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
|
||||
name='embed_shape')
|
||||
if pretrained_dims is not None and pretrained_dims >= 1:
|
||||
glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
|
||||
if pretrained_vectors is not None:
|
||||
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
||||
|
||||
embed = uniqued(
|
||||
(glove | norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width*5, pieces=3)), column=5)
|
||||
>> LN(Maxout(width, width*5, pieces=3)), column=cols.index(ORTH))
|
||||
else:
|
||||
embed = uniqued(
|
||||
(norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width*4, pieces=3)), column=5)
|
||||
>> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))
|
||||
|
||||
convolution = Residual(
|
||||
ExtractWindow(nW=1)
|
||||
|
@ -318,10 +325,10 @@ def _divide_array(X, size):
|
|||
|
||||
|
||||
def get_col(idx):
|
||||
assert idx >= 0, idx
|
||||
if idx < 0:
|
||||
raise IndexError(Errors.E066.format(value=idx))
|
||||
|
||||
def forward(X, drop=0.):
|
||||
assert idx >= 0, idx
|
||||
if isinstance(X, numpy.ndarray):
|
||||
ops = NumpyOps()
|
||||
else:
|
||||
|
@ -329,7 +336,6 @@ def get_col(idx):
|
|||
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
|
||||
|
||||
def backward(y, sgd=None):
|
||||
assert idx >= 0, idx
|
||||
dX = ops.allocate(X.shape)
|
||||
dX[:, idx] += y
|
||||
return dX
|
||||
|
@ -416,13 +422,13 @@ def build_tagger_model(nr_class, **cfg):
|
|||
token_vector_width = cfg['token_vector_width']
|
||||
else:
|
||||
token_vector_width = util.env_opt('token_vector_width', 128)
|
||||
pretrained_dims = cfg.get('pretrained_dims', 0)
|
||||
pretrained_vectors = cfg.get('pretrained_vectors')
|
||||
with Model.define_operators({'>>': chain, '+': add}):
|
||||
if 'tok2vec' in cfg:
|
||||
tok2vec = cfg['tok2vec']
|
||||
else:
|
||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||
pretrained_dims=pretrained_dims)
|
||||
pretrained_vectors=pretrained_vectors)
|
||||
softmax = with_flatten(Softmax(nr_class, token_vector_width))
|
||||
model = (
|
||||
tok2vec
|
||||
|
|
|
@ -11,7 +11,6 @@ __email__ = 'contact@explosion.ai'
|
|||
__license__ = 'MIT'
|
||||
__release__ = False
|
||||
|
||||
__docs_models__ = 'https://spacy.io/usage/models'
|
||||
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
|
||||
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
|
||||
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'
|
||||
|
|
74
spacy/cli/_messages.py
Normal file
74
spacy/cli/_messages.py
Normal file
|
@ -0,0 +1,74 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
class Messages(object):
|
||||
M001 = ("Download successful but linking failed")
|
||||
M002 = ("Creating a shortcut link for 'en' didn't work (maybe you "
|
||||
"don't have admin permissions?), but you can still load the "
|
||||
"model via its full package name: nlp = spacy.load('{name}')")
|
||||
M003 = ("Server error ({code}: {desc})")
|
||||
M004 = ("Couldn't fetch {desc}. Please find a model for your spaCy "
|
||||
"installation (v{version}), and download it manually. For more "
|
||||
"details, see the documentation: https://spacy.io/usage/models")
|
||||
M005 = ("Compatibility error")
|
||||
M006 = ("No compatible models found for v{version} of spaCy.")
|
||||
M007 = ("No compatible model found for '{name}' (spaCy v{version}).")
|
||||
M008 = ("Can't locate model data")
|
||||
M009 = ("The data should be located in {path}")
|
||||
M010 = ("Can't find the spaCy data path to create model symlink")
|
||||
M011 = ("Make sure a directory `/data` exists within your spaCy "
|
||||
"installation and try again. The data directory should be "
|
||||
"located here:")
|
||||
M012 = ("Link '{name}' already exists")
|
||||
M013 = ("To overwrite an existing link, use the --force flag.")
|
||||
M014 = ("Can't overwrite symlink '{name}'")
|
||||
M015 = ("This can happen if your data directory contains a directory or "
|
||||
"file of the same name.")
|
||||
M016 = ("Error: Couldn't link model to '{name}'")
|
||||
M017 = ("Creating a symlink in spacy/data failed. Make sure you have the "
|
||||
"required permissions and try re-running the command as admin, or "
|
||||
"use a virtualenv. You can still import the model as a module and "
|
||||
"call its load() method, or create the symlink manually.")
|
||||
M018 = ("Linking successful")
|
||||
M019 = ("You can now load the model via spacy.load('{name}')")
|
||||
M020 = ("Can't find model meta.json")
|
||||
M021 = ("Couldn't fetch compatibility table.")
|
||||
M022 = ("Can't find spaCy v{version} in compatibility table")
|
||||
M023 = ("Installed models (spaCy v{version})")
|
||||
M024 = ("No models found in your current environment.")
|
||||
M025 = ("Use the following commands to update the model packages:")
|
||||
M026 = ("The following models are not available for spaCy "
|
||||
"v{version}: {models}")
|
||||
M027 = ("You may also want to overwrite the incompatible links using the "
|
||||
"`python -m spacy link` command with `--force`, or remove them "
|
||||
"from the data directory. Data path: {path}")
|
||||
M028 = ("Input file not found")
|
||||
M029 = ("Output directory not found")
|
||||
M030 = ("Unknown format")
|
||||
M031 = ("Can't find converter for {converter}")
|
||||
M032 = ("Generated output file {name}")
|
||||
M033 = ("Created {n_docs} documents")
|
||||
M034 = ("Evaluation data not found")
|
||||
M035 = ("Visualization output directory not found")
|
||||
M036 = ("Generated {n} parses as HTML")
|
||||
M037 = ("Can't find words frequencies file")
|
||||
M038 = ("Sucessfully compiled vocab")
|
||||
M039 = ("{entries} entries, {vectors} vectors")
|
||||
M040 = ("Output directory not found")
|
||||
M041 = ("Loaded meta.json from file")
|
||||
M042 = ("Successfully created package '{name}'")
|
||||
M043 = ("To build the package, run `python setup.py sdist` in this "
|
||||
"directory.")
|
||||
M044 = ("Package directory already exists")
|
||||
M045 = ("Please delete the directory and try again, or use the `--force` "
|
||||
"flag to overwrite existing directories.")
|
||||
M046 = ("Generating meta.json")
|
||||
M047 = ("Enter the package settings for your model. The following "
|
||||
"information will be read from your model data: pipeline, vectors.")
|
||||
M048 = ("No '{key}' setting found in meta.json")
|
||||
M049 = ("This setting is required to build your package.")
|
||||
M050 = ("Training data not found")
|
||||
M051 = ("Development data not found")
|
||||
M052 = ("Not a valid meta.json format")
|
||||
M053 = ("Expected dict but got: {meta_type}")
|
|
@ -5,6 +5,7 @@ import plac
|
|||
from pathlib import Path
|
||||
|
||||
from .converters import conllu2json, iob2json, conll_ner2json
|
||||
from ._messages import Messages
|
||||
from ..util import prints
|
||||
|
||||
# Converters are matched by file extension. To add a converter, add a new
|
||||
|
@ -32,14 +33,14 @@ def convert(input_file, output_dir, n_sents=1, morphology=False, converter='auto
|
|||
input_path = Path(input_file)
|
||||
output_path = Path(output_dir)
|
||||
if not input_path.exists():
|
||||
prints(input_path, title="Input file not found", exits=1)
|
||||
prints(input_path, title=Messages.M028, exits=1)
|
||||
if not output_path.exists():
|
||||
prints(output_path, title="Output directory not found", exits=1)
|
||||
prints(output_path, title=Messages.M029, exits=1)
|
||||
if converter == 'auto':
|
||||
converter = input_path.suffix[1:]
|
||||
if converter not in CONVERTERS:
|
||||
prints("Can't find converter for %s" % converter,
|
||||
title="Unknown format", exits=1)
|
||||
prints(Messages.M031.format(converter=converter),
|
||||
title=Messages.M030, exits=1)
|
||||
func = CONVERTERS[converter]
|
||||
func(input_path, output_path,
|
||||
n_sents=n_sents, use_morphology=morphology)
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .._messages import Messages
|
||||
from ...compat import json_dumps, path2str
|
||||
from ...util import prints
|
||||
from ...gold import iob_to_biluo
|
||||
|
@ -18,8 +19,8 @@ def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
|
|||
output_file = output_path / output_filename
|
||||
with output_file.open('w', encoding='utf-8') as f:
|
||||
f.write(json_dumps(docs))
|
||||
prints("Created %d documents" % len(docs),
|
||||
title="Generated output file %s" % path2str(output_file))
|
||||
prints(Messages.M033.format(n_docs=len(docs)),
|
||||
title=Messages.M032.format(name=path2str(output_file)))
|
||||
|
||||
|
||||
def read_conll_ner(input_path):
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .._messages import Messages
|
||||
from ...compat import json_dumps, path2str
|
||||
from ...util import prints
|
||||
|
||||
|
@ -32,8 +33,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
|||
output_file = output_path / output_filename
|
||||
with output_file.open('w', encoding='utf-8') as f:
|
||||
f.write(json_dumps(docs))
|
||||
prints("Created %d documents" % len(docs),
|
||||
title="Generated output file %s" % path2str(output_file))
|
||||
prints(Messages.M033.format(n_docs=len(docs)),
|
||||
title=Messages.M032.format(name=path2str(output_file)))
|
||||
|
||||
|
||||
def read_conllx(input_path, use_morphology=False, n=0):
|
||||
|
|
|
@ -2,6 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
from cytoolz import partition_all, concat
|
||||
|
||||
from .._messages import Messages
|
||||
from ...compat import json_dumps, path2str
|
||||
from ...util import prints
|
||||
from ...gold import iob_to_biluo
|
||||
|
@ -18,8 +19,8 @@ def iob2json(input_path, output_path, n_sents=10, *a, **k):
|
|||
output_file = output_path / output_filename
|
||||
with output_file.open('w', encoding='utf-8') as f:
|
||||
f.write(json_dumps(docs))
|
||||
prints("Created %d documents" % len(docs),
|
||||
title="Generated output file %s" % path2str(output_file))
|
||||
prints(Messages.M033.format(n_docs=len(docs)),
|
||||
title=Messages.M032.format(name=path2str(output_file)))
|
||||
|
||||
|
||||
def read_iob(raw_sents):
|
||||
|
|
|
@ -2,13 +2,15 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import requests
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import ujson
|
||||
|
||||
from .link import link
|
||||
from ._messages import Messages
|
||||
from ..util import prints, get_package_path
|
||||
from ..compat import url_read, HTTPError
|
||||
from .. import about
|
||||
|
||||
|
||||
|
@ -31,9 +33,7 @@ def download(model, direct=False):
|
|||
version = get_version(model_name, compatibility)
|
||||
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
|
||||
v=version))
|
||||
if dl != 0:
|
||||
# if download subprocess doesn't return 0, exit with the respective
|
||||
# exit code before doing anything else
|
||||
if dl != 0: # if download subprocess doesn't return 0, exit
|
||||
sys.exit(dl)
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
|
@ -47,22 +47,16 @@ def download(model, direct=False):
|
|||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
# message and loading instructions, even if linking fails.
|
||||
prints(
|
||||
"Creating a shortcut link for 'en' didn't work (maybe "
|
||||
"you don't have admin permissions?), but you can still "
|
||||
"load the model via its full package name:",
|
||||
"nlp = spacy.load('%s')" % model_name,
|
||||
title="Download successful but linking failed")
|
||||
prints(Messages.M001.format(name=model_name), title=Messages.M002)
|
||||
|
||||
|
||||
def get_json(url, desc):
|
||||
r = requests.get(url)
|
||||
if r.status_code != 200:
|
||||
msg = ("Couldn't fetch %s. Please find a model for your spaCy "
|
||||
"installation (v%s), and download it manually.")
|
||||
prints(msg % (desc, about.__version__), about.__docs_models__,
|
||||
title="Server error (%d)" % r.status_code, exits=1)
|
||||
return r.json()
|
||||
try:
|
||||
data = url_read(url)
|
||||
except HTTPError as e:
|
||||
prints(Messages.M004.format(desc, about.__version__),
|
||||
title=Messages.M003.format(e.code, e.reason), exits=1)
|
||||
return ujson.loads(data)
|
||||
|
||||
|
||||
def get_compatibility():
|
||||
|
@ -71,17 +65,16 @@ def get_compatibility():
|
|||
comp_table = get_json(about.__compatibility__, "compatibility table")
|
||||
comp = comp_table['spacy']
|
||||
if version not in comp:
|
||||
prints("No compatible models found for v%s of spaCy." % version,
|
||||
title="Compatibility error", exits=1)
|
||||
prints(Messages.M006.format(version=version), title=Messages.M005,
|
||||
exits=1)
|
||||
return comp[version]
|
||||
|
||||
|
||||
def get_version(model, comp):
|
||||
model = model.rsplit('.dev', 1)[0]
|
||||
if model not in comp:
|
||||
version = about.__version__
|
||||
msg = "No compatible model found for '%s' (spaCy v%s)."
|
||||
prints(msg % (model, version), title="Compatibility error", exits=1)
|
||||
prints(Messages.M007.format(name=model, version=about.__version__),
|
||||
title=Messages.M005, exits=1)
|
||||
return comp[model][0]
|
||||
|
||||
|
||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals, division, print_function
|
|||
import plac
|
||||
from timeit import default_timer as timer
|
||||
|
||||
from ._messages import Messages
|
||||
from ..gold import GoldCorpus
|
||||
from ..util import prints
|
||||
from .. import util
|
||||
|
@ -33,10 +34,9 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
|
|||
data_path = util.ensure_path(data_path)
|
||||
displacy_path = util.ensure_path(displacy_path)
|
||||
if not data_path.exists():
|
||||
prints(data_path, title="Evaluation data not found", exits=1)
|
||||
prints(data_path, title=Messages.M034, exits=1)
|
||||
if displacy_path and not displacy_path.exists():
|
||||
prints(displacy_path, title="Visualization output directory not found",
|
||||
exits=1)
|
||||
prints(displacy_path, title=Messages.M035, exits=1)
|
||||
corpus = GoldCorpus(data_path, data_path)
|
||||
nlp = util.load_model(model)
|
||||
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
||||
|
@ -52,8 +52,7 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
|
|||
render_ents = 'ner' in nlp.meta.get('pipeline', [])
|
||||
render_parses(docs, displacy_path, model_name=model,
|
||||
limit=displacy_limit, deps=render_deps, ents=render_ents)
|
||||
msg = "Generated %s parses as HTML" % displacy_limit
|
||||
prints(displacy_path, title=msg)
|
||||
prints(displacy_path, title=Messages.M036.format(n=displacy_limit))
|
||||
|
||||
|
||||
def render_parses(docs, output_path, model_name='', limit=250, deps=True,
|
||||
|
|
|
@ -5,15 +5,17 @@ import plac
|
|||
import platform
|
||||
from pathlib import Path
|
||||
|
||||
from ._messages import Messages
|
||||
from ..compat import path2str
|
||||
from .. import about
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("optional: shortcut link of model", "positional", None, str),
|
||||
markdown=("generate Markdown for GitHub issues", "flag", "md", str))
|
||||
def info(model=None, markdown=False):
|
||||
markdown=("generate Markdown for GitHub issues", "flag", "md", str),
|
||||
silent=("don't print anything (just return)", "flag", "s"))
|
||||
def info(model=None, markdown=False, silent=False):
|
||||
"""Print info about spaCy installation. If a model shortcut link is
|
||||
speficied as an argument, print model information. Flag --markdown
|
||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||
|
@ -25,21 +27,24 @@ def info(model=None, markdown=False):
|
|||
model_path = util.get_data_path() / model
|
||||
meta_path = model_path / 'meta.json'
|
||||
if not meta_path.is_file():
|
||||
util.prints(meta_path, title="Can't find model meta.json", exits=1)
|
||||
util.prints(meta_path, title=Messages.M020, exits=1)
|
||||
meta = util.read_json(meta_path)
|
||||
if model_path.resolve() != model_path:
|
||||
meta['link'] = path2str(model_path)
|
||||
meta['source'] = path2str(model_path.resolve())
|
||||
else:
|
||||
meta['source'] = path2str(model_path)
|
||||
print_info(meta, 'model %s' % model, markdown)
|
||||
else:
|
||||
data = {'spaCy version': about.__version__,
|
||||
'Location': path2str(Path(__file__).parent.parent),
|
||||
'Platform': platform.platform(),
|
||||
'Python version': platform.python_version(),
|
||||
'Models': list_models()}
|
||||
if not silent:
|
||||
print_info(meta, 'model %s' % model, markdown)
|
||||
return meta
|
||||
data = {'spaCy version': about.__version__,
|
||||
'Location': path2str(Path(__file__).parent.parent),
|
||||
'Platform': platform.platform(),
|
||||
'Python version': platform.python_version(),
|
||||
'Models': list_models()}
|
||||
if not silent:
|
||||
print_info(data, 'spaCy', markdown)
|
||||
return data
|
||||
|
||||
|
||||
def print_info(data, title, markdown):
|
||||
|
|
|
@ -12,10 +12,16 @@ import tarfile
|
|||
import gzip
|
||||
import zipfile
|
||||
|
||||
from ..compat import fix_text
|
||||
from ._messages import Messages
|
||||
from ..vectors import Vectors
|
||||
from ..errors import Errors, Warnings, user_warning
|
||||
from ..util import prints, ensure_path, get_lang_class
|
||||
|
||||
try:
|
||||
import ftfy
|
||||
except ImportError:
|
||||
ftfy = None
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
|
@ -23,27 +29,26 @@ from ..util import prints, ensure_path, get_lang_class
|
|||
freqs_loc=("location of words frequencies file", "positional", None, Path),
|
||||
clusters_loc=("optional: location of brown clusters data",
|
||||
"option", "c", str),
|
||||
vectors_loc=("optional: location of vectors file in GenSim text format",
|
||||
"option", "v", str),
|
||||
vectors_loc=("optional: location of vectors file in Word2Vec format "
|
||||
"(either as .txt or zipped as .zip or .tar.gz)", "option",
|
||||
"v", str),
|
||||
prune_vectors=("optional: number of vectors to prune to",
|
||||
"option", "V", int)
|
||||
)
|
||||
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None, vectors_loc=None, prune_vectors=-1):
|
||||
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None,
|
||||
vectors_loc=None, prune_vectors=-1):
|
||||
"""
|
||||
Create a new model from raw data, like word frequencies, Brown clusters
|
||||
and word vectors.
|
||||
"""
|
||||
if freqs_loc is not None and not freqs_loc.exists():
|
||||
prints(freqs_loc, title="Can't find words frequencies file", exits=1)
|
||||
prints(freqs_loc, title=Messages.M037, exits=1)
|
||||
clusters_loc = ensure_path(clusters_loc)
|
||||
vectors_loc = ensure_path(vectors_loc)
|
||||
|
||||
probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
|
||||
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
|
||||
clusters = read_clusters(clusters_loc) if clusters_loc else {}
|
||||
|
||||
nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors)
|
||||
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
|
@ -71,7 +76,6 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
|
|||
nlp = lang_class()
|
||||
for lexeme in nlp.vocab:
|
||||
lexeme.rank = 0
|
||||
|
||||
lex_added = 0
|
||||
for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))):
|
||||
lexeme = nlp.vocab[word]
|
||||
|
@ -91,15 +95,13 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
|
|||
lexeme = nlp.vocab[word]
|
||||
lexeme.is_oov = False
|
||||
lex_added += 1
|
||||
|
||||
if len(vectors_data):
|
||||
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
||||
if prune_vectors >= 1:
|
||||
nlp.vocab.prune_vectors(prune_vectors)
|
||||
vec_added = len(nlp.vocab.vectors)
|
||||
|
||||
prints("{} entries, {} vectors".format(lex_added, vec_added),
|
||||
title="Sucessfully compiled vocab")
|
||||
prints(Messages.M039.format(entries=lex_added, vectors=vec_added),
|
||||
title=Messages.M038)
|
||||
return nlp
|
||||
|
||||
|
||||
|
@ -114,8 +116,7 @@ def read_vectors(vectors_loc):
|
|||
pieces = line.rsplit(' ', vectors_data.shape[1]+1)
|
||||
word = pieces.pop(0)
|
||||
if len(pieces) != vectors_data.shape[1]:
|
||||
print(word, repr(line))
|
||||
raise ValueError("Bad line in file")
|
||||
raise ValueError(Errors.E094.format(line_num=i, loc=vectors_loc))
|
||||
vectors_data[i] = numpy.asarray(pieces, dtype='f')
|
||||
vectors_keys.append(word)
|
||||
return vectors_data, vectors_keys
|
||||
|
@ -150,11 +151,14 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
|||
def read_clusters(clusters_loc):
|
||||
print("Reading clusters...")
|
||||
clusters = {}
|
||||
if ftfy is None:
|
||||
user_warning(Warnings.W004)
|
||||
with clusters_loc.open() as f:
|
||||
for line in tqdm(f):
|
||||
try:
|
||||
cluster, word, freq = line.split()
|
||||
word = fix_text(word)
|
||||
if ftfy is not None:
|
||||
word = ftfy.fix_text(word)
|
||||
except ValueError:
|
||||
continue
|
||||
# If the clusterer has only seen the word a few times, its
|
||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
|||
import plac
|
||||
from pathlib import Path
|
||||
|
||||
from ._messages import Messages
|
||||
from ..compat import symlink_to, path2str
|
||||
from ..util import prints
|
||||
from .. import util
|
||||
|
@ -24,40 +25,29 @@ def link(origin, link_name, force=False, model_path=None):
|
|||
else:
|
||||
model_path = Path(origin) if model_path is None else Path(model_path)
|
||||
if not model_path.exists():
|
||||
prints("The data should be located in %s" % path2str(model_path),
|
||||
title="Can't locate model data", exits=1)
|
||||
prints(Messages.M009.format(path=path2str(model_path)),
|
||||
title=Messages.M008, exits=1)
|
||||
data_path = util.get_data_path()
|
||||
if not data_path or not data_path.exists():
|
||||
spacy_loc = Path(__file__).parent.parent
|
||||
prints("Make sure a directory `/data` exists within your spaCy "
|
||||
"installation and try again. The data directory should be "
|
||||
"located here:", path2str(spacy_loc), exits=1,
|
||||
title="Can't find the spaCy data path to create model symlink")
|
||||
prints(Messages.M011, spacy_loc, title=Messages.M010, exits=1)
|
||||
link_path = util.get_data_path() / link_name
|
||||
if link_path.is_symlink() and not force:
|
||||
prints("To overwrite an existing link, use the --force flag.",
|
||||
title="Link %s already exists" % link_name, exits=1)
|
||||
prints(Messages.M013, title=Messages.M012.format(name=link_name),
|
||||
exits=1)
|
||||
elif link_path.is_symlink(): # does a symlink exist?
|
||||
# NB: It's important to check for is_symlink here and not for exists,
|
||||
# because invalid/outdated symlinks would return False otherwise.
|
||||
link_path.unlink()
|
||||
elif link_path.exists(): # does it exist otherwise?
|
||||
# NB: Check this last because valid symlinks also "exist".
|
||||
prints("This can happen if your data directory contains a directory "
|
||||
"or file of the same name.", link_path,
|
||||
title="Can't overwrite symlink %s" % link_name, exits=1)
|
||||
prints(Messages.M015, link_path,
|
||||
title=Messages.M014.format(name=link_name), exits=1)
|
||||
msg = "%s --> %s" % (path2str(model_path), path2str(link_path))
|
||||
try:
|
||||
symlink_to(link_path, model_path)
|
||||
except:
|
||||
# This is quite dirty, but just making sure other errors are caught.
|
||||
prints("Creating a symlink in spacy/data failed. Make sure you have "
|
||||
"the required permissions and try re-running the command as "
|
||||
"admin, or use a virtualenv. You can still import the model as "
|
||||
"a module and call its load() method, or create the symlink "
|
||||
"manually.",
|
||||
"%s --> %s" % (path2str(model_path), path2str(link_path)),
|
||||
title="Error: Couldn't link model to '%s'" % link_name)
|
||||
prints(Messages.M017, msg, title=Messages.M016.format(name=link_name))
|
||||
raise
|
||||
prints("%s --> %s" % (path2str(model_path), path2str(link_path)),
|
||||
"You can now load the model via spacy.load('%s')" % link_name,
|
||||
title="Linking successful")
|
||||
prints(msg, Messages.M019.format(name=link_name), title=Messages.M018)
|
||||
|
|
|
@ -5,6 +5,7 @@ import plac
|
|||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
from ._messages import Messages
|
||||
from ..compat import path2str, json_dumps
|
||||
from ..util import prints
|
||||
from .. import util
|
||||
|
@ -31,17 +32,17 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
|
|||
output_path = util.ensure_path(output_dir)
|
||||
meta_path = util.ensure_path(meta_path)
|
||||
if not input_path or not input_path.exists():
|
||||
prints(input_path, title="Model directory not found", exits=1)
|
||||
prints(input_path, title=Messages.M008, exits=1)
|
||||
if not output_path or not output_path.exists():
|
||||
prints(output_path, title="Output directory not found", exits=1)
|
||||
prints(output_path, title=Messages.M040, exits=1)
|
||||
if meta_path and not meta_path.exists():
|
||||
prints(meta_path, title="meta.json not found", exits=1)
|
||||
prints(meta_path, title=Messages.M020, exits=1)
|
||||
|
||||
meta_path = meta_path or input_path / 'meta.json'
|
||||
if meta_path.is_file():
|
||||
meta = util.read_json(meta_path)
|
||||
if not create_meta: # only print this if user doesn't want to overwrite
|
||||
prints(meta_path, title="Loaded meta.json from file")
|
||||
prints(meta_path, title=Messages.M041)
|
||||
else:
|
||||
meta = generate_meta(input_dir, meta)
|
||||
meta = validate_meta(meta, ['lang', 'name', 'version'])
|
||||
|
@ -57,9 +58,8 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
|
|||
create_file(main_path / 'setup.py', TEMPLATE_SETUP)
|
||||
create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
|
||||
create_file(package_path / '__init__.py', TEMPLATE_INIT)
|
||||
prints(main_path, "To build the package, run `python setup.py sdist` in "
|
||||
"this directory.",
|
||||
title="Successfully created package '%s'" % model_name_v)
|
||||
prints(main_path, Messages.M043,
|
||||
title=Messages.M042.format(name=model_name_v))
|
||||
|
||||
|
||||
def create_dirs(package_path, force):
|
||||
|
@ -67,10 +67,7 @@ def create_dirs(package_path, force):
|
|||
if force:
|
||||
shutil.rmtree(path2str(package_path))
|
||||
else:
|
||||
prints(package_path, "Please delete the directory and try again, "
|
||||
"or use the --force flag to overwrite existing "
|
||||
"directories.", title="Package directory already exists",
|
||||
exits=1)
|
||||
prints(package_path, Messages.M045, title=Messages.M044, exits=1)
|
||||
Path.mkdir(package_path, parents=True)
|
||||
|
||||
|
||||
|
@ -97,9 +94,7 @@ def generate_meta(model_path, existing_meta):
|
|||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
||||
'vectors': len(nlp.vocab.vectors),
|
||||
'keys': nlp.vocab.vectors.n_keys}
|
||||
prints("Enter the package settings for your model. The following "
|
||||
"information will be read from your model data: pipeline, vectors.",
|
||||
title="Generating meta.json")
|
||||
prints(Messages.M047, title=Messages.M046)
|
||||
for setting, desc, default in settings:
|
||||
response = util.get_raw_input(desc, default)
|
||||
meta[setting] = default if response == '' and default else response
|
||||
|
@ -111,8 +106,7 @@ def generate_meta(model_path, existing_meta):
|
|||
def validate_meta(meta, keys):
|
||||
for key in keys:
|
||||
if key not in meta or meta[key] == '':
|
||||
prints("This setting is required to build your package.",
|
||||
title='No "%s" setting found in meta.json' % key, exits=1)
|
||||
prints(Messages.M049, title=Messages.M048.format(key=key), exits=1)
|
||||
return meta
|
||||
|
||||
|
||||
|
|
|
@ -7,6 +7,7 @@ import tqdm
|
|||
from thinc.neural._classes.model import Model
|
||||
from timeit import default_timer as timer
|
||||
|
||||
from ._messages import Messages
|
||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||
from ..gold import GoldCorpus
|
||||
from ..util import prints, minibatch, minibatch_by_words
|
||||
|
@ -52,15 +53,15 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
|||
dev_path = util.ensure_path(dev_data)
|
||||
meta_path = util.ensure_path(meta_path)
|
||||
if not train_path.exists():
|
||||
prints(train_path, title="Training data not found", exits=1)
|
||||
prints(train_path, title=Messages.M050, exits=1)
|
||||
if dev_path and not dev_path.exists():
|
||||
prints(dev_path, title="Development data not found", exits=1)
|
||||
prints(dev_path, title=Messages.M051, exits=1)
|
||||
if meta_path is not None and not meta_path.exists():
|
||||
prints(meta_path, title="meta.json not found", exits=1)
|
||||
prints(meta_path, title=Messages.M020, exits=1)
|
||||
meta = util.read_json(meta_path) if meta_path else {}
|
||||
if not isinstance(meta, dict):
|
||||
prints("Expected dict but got: {}".format(type(meta)),
|
||||
title="Not a valid meta.json format", exits=1)
|
||||
prints(Messages.M053.format(meta_type=type(meta)),
|
||||
title=Messages.M052, exits=1)
|
||||
meta.setdefault('lang', lang)
|
||||
meta.setdefault('name', 'unnamed')
|
||||
|
||||
|
@ -94,6 +95,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
|||
meta['pipeline'] = pipeline
|
||||
nlp.meta.update(meta)
|
||||
if vectors:
|
||||
print("Load vectors model", vectors)
|
||||
util.load_model(vectors, vocab=nlp.vocab)
|
||||
for lex in nlp.vocab:
|
||||
values = {}
|
||||
|
|
315
spacy/cli/ud_run_test.py
Normal file
315
spacy/cli/ud_run_test.py
Normal file
|
@ -0,0 +1,315 @@
|
|||
'''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||
.conllu format for development data, allowing the official scorer to be used.
|
||||
'''
|
||||
from __future__ import unicode_literals
|
||||
import plac
|
||||
import tqdm
|
||||
from pathlib import Path
|
||||
import re
|
||||
import sys
|
||||
import json
|
||||
|
||||
import spacy
|
||||
import spacy.util
|
||||
from ..tokens import Token, Doc
|
||||
from ..gold import GoldParse
|
||||
from ..util import compounding, minibatch_by_words
|
||||
from ..syntax.nonproj import projectivize
|
||||
from ..matcher import Matcher
|
||||
from ..morphology import Fused_begin, Fused_inside
|
||||
from .. import displacy
|
||||
from collections import defaultdict, Counter
|
||||
from timeit import default_timer as timer
|
||||
|
||||
import itertools
|
||||
import random
|
||||
import numpy.random
|
||||
import cytoolz
|
||||
|
||||
from . import conll17_ud_eval
|
||||
|
||||
from .. import lang
|
||||
from .. import lang
|
||||
from ..lang import zh
|
||||
from ..lang import ja
|
||||
from ..lang import ru
|
||||
|
||||
|
||||
################
|
||||
# Data reading #
|
||||
################
|
||||
|
||||
space_re = re.compile('\s+')
|
||||
def split_text(text):
|
||||
return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
|
||||
|
||||
|
||||
##############
|
||||
# Evaluation #
|
||||
##############
|
||||
|
||||
def read_conllu(file_):
|
||||
docs = []
|
||||
sent = []
|
||||
doc = []
|
||||
for line in file_:
|
||||
if line.startswith('# newdoc'):
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
doc = []
|
||||
elif line.startswith('#'):
|
||||
continue
|
||||
elif not line.strip():
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
sent = []
|
||||
else:
|
||||
sent.append(list(line.strip().split('\t')))
|
||||
if len(sent[-1]) != 10:
|
||||
print(repr(line))
|
||||
raise ValueError
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
||||
|
||||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
if text_loc.parts[-1].endswith('.conllu'):
|
||||
docs = []
|
||||
with text_loc.open() as file_:
|
||||
for conllu_doc in read_conllu(file_):
|
||||
for conllu_sent in conllu_doc:
|
||||
words = [line[1] for line in conllu_sent]
|
||||
docs.append(Doc(nlp.vocab, words=words))
|
||||
for name, component in nlp.pipeline:
|
||||
docs = list(component.pipe(docs))
|
||||
else:
|
||||
with text_loc.open('r', encoding='utf8') as text_file:
|
||||
texts = split_text(text_file.read())
|
||||
docs = list(nlp.pipe(texts))
|
||||
with sys_loc.open('w', encoding='utf8') as out_file:
|
||||
write_conllu(docs, out_file)
|
||||
with gold_loc.open('r', encoding='utf8') as gold_file:
|
||||
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||
with sys_loc.open('r', encoding='utf8') as sys_file:
|
||||
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||
return docs, scores
|
||||
|
||||
|
||||
def write_conllu(docs, file_):
|
||||
merger = Matcher(docs[0].vocab)
|
||||
merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
|
||||
for i, doc in enumerate(docs):
|
||||
matches = merger(doc)
|
||||
spans = [doc[start:end+1] for _, start, end in matches]
|
||||
offsets = [(span.start_char, span.end_char) for span in spans]
|
||||
for start_char, end_char in offsets:
|
||||
doc.merge(start_char, end_char)
|
||||
# TODO: This shuldn't be necessary? Should be handled in merge
|
||||
for word in doc:
|
||||
if word.i == word.head.i:
|
||||
word.dep_ = 'ROOT'
|
||||
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||
for j, sent in enumerate(doc.sents):
|
||||
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||
file_.write("# text = {text}\n".format(text=sent.text))
|
||||
for k, token in enumerate(sent):
|
||||
file_.write(_get_token_conllu(token, k, len(sent)) + '\n')
|
||||
file_.write('\n')
|
||||
for word in sent:
|
||||
if word.head.i == word.i and word.dep_ == 'ROOT':
|
||||
break
|
||||
else:
|
||||
print("Rootless sentence!")
|
||||
print(sent)
|
||||
print(i)
|
||||
for w in sent:
|
||||
print(w.i, w.text, w.head.text, w.head.i, w.dep_)
|
||||
raise ValueError
|
||||
|
||||
|
||||
def _get_token_conllu(token, k, sent_len):
|
||||
if token.check_morph(Fused_begin) and (k+1 < sent_len):
|
||||
n = 1
|
||||
text = [token.text]
|
||||
while token.nbor(n).check_morph(Fused_inside):
|
||||
text.append(token.nbor(n).text)
|
||||
n += 1
|
||||
id_ = '%d-%d' % (k+1, (k+n))
|
||||
fields = [id_, ''.join(text)] + ['_'] * 8
|
||||
lines = ['\t'.join(fields)]
|
||||
else:
|
||||
lines = []
|
||||
if token.head.i == token.i:
|
||||
head = 0
|
||||
else:
|
||||
head = k + (token.head.i - token.i) + 1
|
||||
fields = [str(k+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
|
||||
str(head), token.dep_.lower(), '_', '_']
|
||||
if token.check_morph(Fused_begin) and (k+1 < sent_len):
|
||||
if k == 0:
|
||||
fields[1] = token.norm_[0].upper() + token.norm_[1:]
|
||||
else:
|
||||
fields[1] = token.norm_
|
||||
elif token.check_morph(Fused_inside):
|
||||
fields[1] = token.norm_
|
||||
elif token._.split_start is not None:
|
||||
split_start = token._.split_start
|
||||
split_end = token._.split_end
|
||||
split_len = (split_end.i - split_start.i) + 1
|
||||
n_in_split = token.i - split_start.i
|
||||
subtokens = guess_fused_orths(split_start.text, [''] * split_len)
|
||||
fields[1] = subtokens[n_in_split]
|
||||
|
||||
lines.append('\t'.join(fields))
|
||||
return '\n'.join(lines)
|
||||
|
||||
|
||||
def guess_fused_orths(word, ud_forms):
|
||||
'''The UD data 'fused tokens' don't necessarily expand to keys that match
|
||||
the form. We need orths that exact match the string. Here we make a best
|
||||
effort to divide up the word.'''
|
||||
if word == ''.join(ud_forms):
|
||||
# Happy case: we get a perfect split, with each letter accounted for.
|
||||
return ud_forms
|
||||
elif len(word) == sum(len(subtoken) for subtoken in ud_forms):
|
||||
# Unideal, but at least lengths match.
|
||||
output = []
|
||||
remain = word
|
||||
for subtoken in ud_forms:
|
||||
assert len(subtoken) >= 1
|
||||
output.append(remain[:len(subtoken)])
|
||||
remain = remain[len(subtoken):]
|
||||
assert len(remain) == 0, (word, ud_forms, remain)
|
||||
return output
|
||||
else:
|
||||
# Let's say word is 6 long, and there are three subtokens. The orths
|
||||
# *must* equal the original string. Arbitrarily, split [4, 1, 1]
|
||||
first = word[:len(word)-(len(ud_forms)-1)]
|
||||
output = [first]
|
||||
remain = word[len(first):]
|
||||
for i in range(1, len(ud_forms)):
|
||||
assert remain
|
||||
output.append(remain[:1])
|
||||
remain = remain[1:]
|
||||
assert len(remain) == 0, (word, output, remain)
|
||||
return output
|
||||
|
||||
|
||||
|
||||
def print_results(name, ud_scores):
|
||||
fields = {}
|
||||
if ud_scores is not None:
|
||||
fields.update({
|
||||
'words': ud_scores['Words'].f1 * 100,
|
||||
'sents': ud_scores['Sentences'].f1 * 100,
|
||||
'tags': ud_scores['XPOS'].f1 * 100,
|
||||
'uas': ud_scores['UAS'].f1 * 100,
|
||||
'las': ud_scores['LAS'].f1 * 100,
|
||||
})
|
||||
else:
|
||||
fields.update({
|
||||
'words': 0.0,
|
||||
'sents': 0.0,
|
||||
'tags': 0.0,
|
||||
'uas': 0.0,
|
||||
'las': 0.0
|
||||
})
|
||||
tpl = '\t'.join((
|
||||
name,
|
||||
'{las:.1f}',
|
||||
'{uas:.1f}',
|
||||
'{tags:.1f}',
|
||||
'{sents:.1f}',
|
||||
'{words:.1f}',
|
||||
))
|
||||
print(tpl.format(**fields))
|
||||
return fields
|
||||
|
||||
|
||||
def get_token_split_start(token):
|
||||
if token.text == '':
|
||||
assert token.i != 0
|
||||
i = -1
|
||||
while token.nbor(i).text == '':
|
||||
i -= 1
|
||||
return token.nbor(i)
|
||||
elif (token.i+1) < len(token.doc) and token.nbor(1).text == '':
|
||||
return token
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
def get_token_split_end(token):
|
||||
if (token.i+1) == len(token.doc):
|
||||
return token if token.text == '' else None
|
||||
elif token.text != '' and token.nbor(1).text != '':
|
||||
return None
|
||||
i = 1
|
||||
while (token.i+i) < len(token.doc) and token.nbor(i).text == '':
|
||||
i += 1
|
||||
return token.nbor(i-1)
|
||||
|
||||
|
||||
Token.set_extension('split_start', getter=get_token_split_start)
|
||||
Token.set_extension('split_end', getter=get_token_split_end)
|
||||
Token.set_extension('begins_fused', default=False)
|
||||
Token.set_extension('inside_fused', default=False)
|
||||
|
||||
|
||||
##################
|
||||
# Initialization #
|
||||
##################
|
||||
|
||||
|
||||
def load_nlp(experiments_dir, corpus):
|
||||
nlp = spacy.load(experiments_dir / corpus / 'best-model')
|
||||
return nlp
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||
nlp.add_pipe(nlp.create_pipe('parser'))
|
||||
return nlp
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
test_data_dir=("Path to Universal Dependencies test data", "positional", None, Path),
|
||||
experiment_dir=("Parent directory with output model", "positional", None, Path),
|
||||
corpus=("UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc", "positional", None, str),
|
||||
)
|
||||
def main(test_data_dir, experiment_dir, corpus):
|
||||
lang.zh.Chinese.Defaults.use_jieba = False
|
||||
lang.ja.Japanese.Defaults.use_janome = False
|
||||
lang.ru.Russian.Defaults.use_pymorphy2 = False
|
||||
|
||||
nlp = load_nlp(experiment_dir, corpus)
|
||||
|
||||
treebank_code = nlp.meta['treebank']
|
||||
for section in ('test', 'dev'):
|
||||
if section == 'dev':
|
||||
section_dir = 'conll17-ud-development-2017-03-19'
|
||||
else:
|
||||
section_dir = 'conll17-ud-test-2017-05-09'
|
||||
text_path = test_data_dir / 'input' / section_dir / (treebank_code+'.txt')
|
||||
udpipe_path = test_data_dir / 'input' / section_dir / (treebank_code+'-udpipe.conllu')
|
||||
gold_path = test_data_dir / 'gold' / section_dir / (treebank_code+'.conllu')
|
||||
|
||||
header = [section, 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
|
||||
print('\t'.join(header))
|
||||
inputs = {'gold': gold_path, 'udp': udpipe_path, 'raw': text_path}
|
||||
for input_type in ('udp', 'raw'):
|
||||
input_path = inputs[input_type]
|
||||
output_path = experiment_dir / corpus / '{section}.conllu'.format(section=section)
|
||||
|
||||
parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path)
|
||||
|
||||
accuracy = print_results(input_type, test_scores)
|
||||
acc_path = experiment_dir / corpus / '{section}-accuracy.json'.format(section=section)
|
||||
with open(acc_path, 'w') as file_:
|
||||
file_.write(json.dumps(accuracy, indent=2))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -247,12 +247,18 @@ Token.set_extension('inside_fused', default=False)
|
|||
##################
|
||||
|
||||
|
||||
def load_nlp(corpus, config):
|
||||
def load_nlp(corpus, config, vectors=None):
|
||||
lang = corpus.split('_')[0]
|
||||
nlp = spacy.blank(lang)
|
||||
if config.vectors:
|
||||
nlp.vocab.from_disk(Path(config.vectors) / 'vocab')
|
||||
if not vectors:
|
||||
raise ValueError("config asks for vectors, but no vectors "
|
||||
"directory set on command line (use -v)")
|
||||
if (Path(vectors) / corpus).exists():
|
||||
nlp.vocab.from_disk(Path(vectors) / corpus / 'vocab')
|
||||
nlp.meta['treebank'] = corpus
|
||||
return nlp
|
||||
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||
nlp.add_pipe(nlp.create_pipe('parser'))
|
||||
|
@ -274,10 +280,12 @@ def initialize_pipeline(nlp, docs, golds, config, device):
|
|||
|
||||
class Config(object):
|
||||
def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
|
||||
multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2):
|
||||
multitask_sent=True, multitask_dep=True, multitask_vectors=False,
|
||||
nr_epoch=30, batch_size=1000, dropout=0.2):
|
||||
for key, value in locals().items():
|
||||
setattr(self, key, value)
|
||||
|
||||
|
||||
|
||||
@classmethod
|
||||
def load(cls, loc):
|
||||
with Path(loc).open('r', encoding='utf8') as file_:
|
||||
|
@ -319,9 +327,11 @@ class TreebankPaths(object):
|
|||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||
config=("Path to json formatted config file", "positional"),
|
||||
limit=("Size limit", "option", "n", int),
|
||||
use_gpu=("Use GPU", "option", "g", int)
|
||||
use_gpu=("Use GPU", "option", "g", int),
|
||||
vectors_dir=("Path to directory with pre-trained vectors, named e.g. en/",
|
||||
"option", "v", Path),
|
||||
)
|
||||
def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1):
|
||||
def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1, vectors_dir=None):
|
||||
spacy.util.fix_random_seed()
|
||||
lang.zh.Chinese.Defaults.use_jieba = False
|
||||
lang.ja.Japanese.Defaults.use_janome = False
|
||||
|
@ -331,7 +341,7 @@ def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1):
|
|||
if not (parses_dir / corpus).exists():
|
||||
(parses_dir / corpus).mkdir()
|
||||
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||
nlp = load_nlp(paths.lang, config)
|
||||
nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
|
||||
|
||||
docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
|
||||
max_doc_length=config.max_doc_length, limit=limit)
|
||||
|
|
|
@ -1,12 +1,13 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import requests
|
||||
import pkg_resources
|
||||
from pathlib import Path
|
||||
import sys
|
||||
import ujson
|
||||
|
||||
from ..compat import path2str, locale_escape
|
||||
from ._messages import Messages
|
||||
from ..compat import path2str, locale_escape, url_read, HTTPError
|
||||
from ..util import prints, get_data_path, read_json
|
||||
from .. import about
|
||||
|
||||
|
@ -15,16 +16,16 @@ def validate():
|
|||
"""Validate that the currently installed version of spaCy is compatible
|
||||
with the installed models. Should be run after `pip install -U spacy`.
|
||||
"""
|
||||
r = requests.get(about.__compatibility__)
|
||||
if r.status_code != 200:
|
||||
prints("Couldn't fetch compatibility table.",
|
||||
title="Server error (%d)" % r.status_code, exits=1)
|
||||
compat = r.json()['spacy']
|
||||
try:
|
||||
data = url_read(about.__compatibility__)
|
||||
except HTTPError as e:
|
||||
title = Messages.M003.format(code=e.code, desc=e.reason)
|
||||
prints(Messages.M021, title=title, exits=1)
|
||||
compat = ujson.loads(data)['spacy']
|
||||
current_compat = compat.get(about.__version__)
|
||||
if not current_compat:
|
||||
prints(about.__compatibility__, exits=1,
|
||||
title="Can't find spaCy v{} in compatibility table"
|
||||
.format(about.__version__))
|
||||
title=Messages.M022.format(version=about.__version__))
|
||||
all_models = set()
|
||||
for spacy_v, models in dict(compat).items():
|
||||
all_models.update(models.keys())
|
||||
|
@ -41,7 +42,7 @@ def validate():
|
|||
update_models = [m for m in incompat_models if m in current_compat]
|
||||
|
||||
prints(path2str(Path(__file__).parent.parent),
|
||||
title="Installed models (spaCy v{})".format(about.__version__))
|
||||
title=Messages.M023.format(version=about.__version__))
|
||||
if model_links or model_pkgs:
|
||||
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
|
||||
for name, data in model_pkgs.items():
|
||||
|
@ -49,23 +50,16 @@ def validate():
|
|||
for name, data in model_links.items():
|
||||
print(get_model_row(current_compat, name, data, 'link'))
|
||||
else:
|
||||
prints("No models found in your current environment.", exits=0)
|
||||
|
||||
prints(Messages.M024, exits=0)
|
||||
if update_models:
|
||||
cmd = ' python -m spacy download {}'
|
||||
print("\n Use the following commands to update the model packages:")
|
||||
print("\n " + Messages.M025)
|
||||
print('\n'.join([cmd.format(pkg) for pkg in update_models]))
|
||||
|
||||
if na_models:
|
||||
prints("The following models are not available for spaCy v{}: {}"
|
||||
.format(about.__version__, ', '.join(na_models)))
|
||||
|
||||
prints(Messages.M025.format(version=about.__version__,
|
||||
models=', '.join(na_models)))
|
||||
if incompat_links:
|
||||
prints("You may also want to overwrite the incompatible links using "
|
||||
"the `python -m spacy link` command with `--force`, or remove "
|
||||
"them from the data directory. Data path: {}"
|
||||
.format(path2str(get_data_path())))
|
||||
|
||||
prints(Messages.M027.format(path=path2str(get_data_path())))
|
||||
if incompat_models or incompat_links:
|
||||
sys.exit(1)
|
||||
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import ftfy
|
||||
import sys
|
||||
import ujson
|
||||
import itertools
|
||||
|
@ -34,11 +33,20 @@ try:
|
|||
except ImportError:
|
||||
from thinc.neural.optimizers import Adam as Optimizer
|
||||
|
||||
try:
|
||||
import urllib.request
|
||||
except ImportError:
|
||||
import urllib2 as urllib
|
||||
|
||||
try:
|
||||
from urllib.error import HTTPError
|
||||
except ImportError:
|
||||
from urllib2 import HTTPError
|
||||
|
||||
pickle = pickle
|
||||
copy_reg = copy_reg
|
||||
CudaStream = CudaStream
|
||||
cupy = cupy
|
||||
fix_text = ftfy.fix_text
|
||||
copy_array = copy_array
|
||||
izip = getattr(itertools, 'izip', zip)
|
||||
|
||||
|
@ -58,6 +66,7 @@ if is_python2:
|
|||
input_ = raw_input # noqa: F821
|
||||
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8')
|
||||
path2str = lambda path: str(path).decode('utf8')
|
||||
url_open = urllib.urlopen
|
||||
|
||||
elif is_python3:
|
||||
bytes_ = bytes
|
||||
|
@ -66,6 +75,16 @@ elif is_python3:
|
|||
input_ = input
|
||||
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
|
||||
path2str = lambda path: str(path)
|
||||
url_open = urllib.request.urlopen
|
||||
|
||||
|
||||
def url_read(url):
|
||||
file_ = url_open(url)
|
||||
code = file_.getcode()
|
||||
if code != 200:
|
||||
raise HTTPError(url, code, "Cannot GET url", [], file_)
|
||||
data = file_.read()
|
||||
return data
|
||||
|
||||
|
||||
def b_to_str(b_str):
|
||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
|||
from .render import DependencyRenderer, EntityRenderer
|
||||
from ..tokens import Doc
|
||||
from ..compat import b_to_str
|
||||
from ..errors import Errors, Warnings, user_warning
|
||||
from ..util import prints, is_in_jupyter
|
||||
|
||||
|
||||
|
@ -27,7 +28,7 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
|
|||
factories = {'dep': (DependencyRenderer, parse_deps),
|
||||
'ent': (EntityRenderer, parse_ents)}
|
||||
if style not in factories:
|
||||
raise ValueError("Unknown style: %s" % style)
|
||||
raise ValueError(Errors.E087.format(style=style))
|
||||
if isinstance(docs, Doc) or isinstance(docs, dict):
|
||||
docs = [docs]
|
||||
renderer, converter = factories[style]
|
||||
|
@ -57,12 +58,12 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
|
|||
render(docs, style=style, page=page, minify=minify, options=options,
|
||||
manual=manual)
|
||||
httpd = simple_server.make_server('0.0.0.0', port, app)
|
||||
prints("Using the '%s' visualizer" % style,
|
||||
title="Serving on port %d..." % port)
|
||||
prints("Using the '{}' visualizer".format(style),
|
||||
title="Serving on port {}...".format(port))
|
||||
try:
|
||||
httpd.serve_forever()
|
||||
except KeyboardInterrupt:
|
||||
prints("Shutting down server on port %d." % port)
|
||||
prints("Shutting down server on port {}.".format(port))
|
||||
finally:
|
||||
httpd.server_close()
|
||||
|
||||
|
@ -83,6 +84,12 @@ def parse_deps(orig_doc, options={}):
|
|||
RETURNS (dict): Generated dependency parse keyed by words and arcs.
|
||||
"""
|
||||
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
||||
if not doc.is_parsed:
|
||||
user_warning(Warnings.W005)
|
||||
if options.get('collapse_phrases', False):
|
||||
for np in list(doc.noun_chunks):
|
||||
np.merge(tag=np.root.tag_, lemma=np.root.lemma_,
|
||||
ent_type=np.root.ent_type_)
|
||||
if options.get('collapse_punct', True):
|
||||
spans = []
|
||||
for word in doc[:-1]:
|
||||
|
@ -120,6 +127,8 @@ def parse_ents(doc, options={}):
|
|||
"""
|
||||
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
|
||||
for ent in doc.ents]
|
||||
if not ents:
|
||||
user_warning(Warnings.W006)
|
||||
title = (doc.user_data.get('title', None)
|
||||
if hasattr(doc, 'user_data') else None)
|
||||
return {'text': doc.text, 'ents': ents, 'title': title}
|
||||
|
|
313
spacy/errors.py
Normal file
313
spacy/errors.py
Normal file
|
@ -0,0 +1,313 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import os
|
||||
import warnings
|
||||
import inspect
|
||||
|
||||
|
||||
def add_codes(err_cls):
|
||||
"""Add error codes to string messages via class attribute names."""
|
||||
class ErrorsWithCodes(object):
|
||||
def __getattribute__(self, code):
|
||||
msg = getattr(err_cls, code)
|
||||
return '[{code}] {msg}'.format(code=code, msg=msg)
|
||||
return ErrorsWithCodes()
|
||||
|
||||
|
||||
@add_codes
|
||||
class Warnings(object):
|
||||
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
||||
"You can now call spacy.load with the path as its first argument, "
|
||||
"and the model's meta.json will be used to determine the language "
|
||||
"to load. For example:\nnlp = spacy.load('{path}')")
|
||||
W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object "
|
||||
"instead and pass in the strings as the `words` keyword argument, "
|
||||
"for example:\nfrom spacy.tokens import Doc\n"
|
||||
"doc = Doc(nlp.vocab, words=[...])")
|
||||
W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use "
|
||||
"the keyword arguments, for example tag=, lemma= or ent_type=.")
|
||||
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
|
||||
"using ftfy.fix_text if necessary.")
|
||||
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
||||
"generate a dependency visualization for it. Make sure the Doc "
|
||||
"was processed with a model that supports dependency parsing, and "
|
||||
"not just a language class like `English()`. For more info, see "
|
||||
"the docs:\nhttps://spacy.io/usage/models")
|
||||
W006 = ("No entities to visualize found in Doc object. If this is "
|
||||
"surprising to you, make sure the Doc was processed using a model "
|
||||
"that supports named entity recognition, and check the `doc.ents` "
|
||||
"property manually if necessary.")
|
||||
|
||||
|
||||
@add_codes
|
||||
class Errors(object):
|
||||
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
|
||||
E002 = ("Can't find factory for '{name}'. This usually happens when spaCy "
|
||||
"calls `nlp.create_pipe` with a component name that's not built "
|
||||
"in - for example, when constructing the pipeline from a model's "
|
||||
"meta.json. If you're using a custom component, you can write to "
|
||||
"`Language.factories['{name}']` or remove it from the model meta "
|
||||
"and add it via `nlp.add_pipe` instead.")
|
||||
E003 = ("Not a valid pipeline component. Expected callable, but "
|
||||
"got {component} (name: '{name}').")
|
||||
E004 = ("If you meant to add a built-in component, use `create_pipe`: "
|
||||
"`nlp.add_pipe(nlp.create_pipe('{component}'))`")
|
||||
E005 = ("Pipeline component '{name}' returned None. If you're using a "
|
||||
"custom component, maybe you forgot to return the processed Doc?")
|
||||
E006 = ("Invalid constraints. You can only set one of the following: "
|
||||
"before, after, first, last.")
|
||||
E007 = ("'{name}' already exists in pipeline. Existing names: {opts}")
|
||||
E008 = ("Some current components would be lost when restoring previous "
|
||||
"pipeline state. If you added components after calling "
|
||||
"`nlp.disable_pipes()`, you should remove them explicitly with "
|
||||
"`nlp.remove_pipe()` before the pipeline is restored. Names of "
|
||||
"the new components: {names}")
|
||||
E009 = ("The `update` method expects same number of docs and golds, but "
|
||||
"got: {n_docs} docs, {n_golds} golds.")
|
||||
E010 = ("Word vectors set to length 0. This may be because you don't have "
|
||||
"a model installed or loaded, or because your model doesn't "
|
||||
"include word vectors. For more info, see the docs:\n"
|
||||
"https://spacy.io/usage/models")
|
||||
E011 = ("Unknown operator: '{op}'. Options: {opts}")
|
||||
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
|
||||
E013 = ("Error selecting action in matcher")
|
||||
E014 = ("Uknown tag ID: {tag}")
|
||||
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
|
||||
"`force=True` to overwrite.")
|
||||
E016 = ("MultitaskObjective target should be function or one of: dep, "
|
||||
"tag, ent, dep_tag_offset, ent_tag.")
|
||||
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
|
||||
E018 = ("Can't retrieve string for hash '{hash_value}'.")
|
||||
E019 = ("Can't create transition with unknown action ID: {action}. Action "
|
||||
"IDs are enumerated in spacy/syntax/{src}.pyx.")
|
||||
E020 = ("Could not find a gold-standard action to supervise the "
|
||||
"dependency parser. The tree is non-projective (i.e. it has "
|
||||
"crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
|
||||
"The ArcEager transition system only supports projective trees. "
|
||||
"To learn non-projective representations, transform the data "
|
||||
"before training and after parsing. Either pass "
|
||||
"`make_projective=True` to the GoldParse class, or use "
|
||||
"spacy.syntax.nonproj.preprocess_training_data.")
|
||||
E021 = ("Could not find a gold-standard action to supervise the "
|
||||
"dependency parser. The GoldParse was projective. The transition "
|
||||
"system has {n_actions} actions. State at failure: {state}")
|
||||
E022 = ("Could not find a transition with the name '{name}' in the NER "
|
||||
"model.")
|
||||
E023 = ("Error cleaning up beam: The same state occurred twice at "
|
||||
"memory address {addr} and position {i}.")
|
||||
E024 = ("Could not find an optimal move to supervise the parser. Usually, "
|
||||
"this means the GoldParse was not correct. For example, are all "
|
||||
"labels added to the model?")
|
||||
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
||||
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
||||
"length {length}.")
|
||||
E027 = ("Arguments 'words' and 'spaces' should be sequences of the same "
|
||||
"length, or 'spaces' should be left default at None. spaces "
|
||||
"should be a sequence of booleans, with True meaning that the "
|
||||
"word owns a ' ' character following it.")
|
||||
E028 = ("orths_and_spaces expects either a list of unicode string or a "
|
||||
"list of (unicode, bool) tuples. Got bytes instance: {value}")
|
||||
E029 = ("noun_chunks requires the dependency parse, which requires a "
|
||||
"statistical model to be installed and loaded. For more info, see "
|
||||
"the documentation:\nhttps://spacy.io/usage/models")
|
||||
E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
|
||||
"component to the pipeline with: "
|
||||
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
|
||||
"Alternatively, add the dependency parser, or set sentence "
|
||||
"boundaries by setting doc[i].is_sent_start.")
|
||||
E031 = ("Invalid token: empty string ('') at position {i}.")
|
||||
E032 = ("Conflicting attributes specified in doc.from_array(): "
|
||||
"(HEAD, SENT_START). The HEAD attribute currently sets sentence "
|
||||
"boundaries implicitly, based on the tree structure. This means "
|
||||
"the HEAD attribute would potentially override the sentence "
|
||||
"boundaries set by SENT_START.")
|
||||
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
||||
E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected "
|
||||
"either 3 arguments (deprecated), or 0 (use keyword arguments).\n"
|
||||
"Arguments supplied:\n{args}\nKeyword arguments:{kwargs}")
|
||||
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
||||
"length {length}.")
|
||||
E036 = ("Error calculating span: Can't find a token starting at character "
|
||||
"offset {start}.")
|
||||
E037 = ("Error calculating span: Can't find a token ending at character "
|
||||
"offset {end}.")
|
||||
E038 = ("Error finding sentence for span. Infinite loop detected.")
|
||||
E039 = ("Array bounds exceeded while searching for root word. This likely "
|
||||
"means the parse tree is in an invalid state. Please report this "
|
||||
"issue here: http://github.com/explosion/spaCy/issues")
|
||||
E040 = ("Attempt to access token at {i}, max length {max_length}.")
|
||||
E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
|
||||
E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.")
|
||||
E043 = ("Refusing to write to token.sent_start if its document is parsed, "
|
||||
"because this may cause inconsistent state.")
|
||||
E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
|
||||
"None, True, False")
|
||||
E045 = ("Possibly infinite loop encountered while looking for {attr}.")
|
||||
E046 = ("Can't retrieve unregistered extension attribute '{name}'. Did "
|
||||
"you forget to call the `set_extension` method?")
|
||||
E047 = ("Can't assign a value to unregistered extension attribute "
|
||||
"'{name}'. Did you forget to call the `set_extension` method?")
|
||||
E048 = ("Can't import language {lang} from spacy.lang.")
|
||||
E049 = ("Can't find spaCy data directory: '{path}'. Check your "
|
||||
"installation and permissions, or use spacy.util.set_data_path "
|
||||
"to customise the location if necessary.")
|
||||
E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut "
|
||||
"link, a Python package or a valid path to a data directory.")
|
||||
E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure "
|
||||
"it points to a valid package (not just a data directory).")
|
||||
E052 = ("Can't find model directory: {path}")
|
||||
E053 = ("Could not read meta.json from {path}")
|
||||
E054 = ("No valid '{setting}' setting found in model meta.json.")
|
||||
E055 = ("Invalid ORTH value in exception:\nKey: {key}\nOrths: {orths}")
|
||||
E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
|
||||
"original string.\nKey: {key}\nOrths: {orths}")
|
||||
E057 = ("Stepped slices not supported in Span objects. Try: "
|
||||
"list(tokens)[start:stop:step] instead.")
|
||||
E058 = ("Could not retrieve vector for key {key}.")
|
||||
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
|
||||
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
|
||||
"({rows}, {cols}).")
|
||||
E061 = ("Bad file name: {filename}. Example of a valid file name: "
|
||||
"'vectors.128.f.bin'")
|
||||
E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
|
||||
"and 63 are occupied. You can replace one by specifying the "
|
||||
"`flag_id` explicitly, e.g. "
|
||||
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
|
||||
E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 "
|
||||
"and 63 (inclusive).")
|
||||
E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
|
||||
"string, the lexeme returned had an orth ID that did not match "
|
||||
"the query string. This means that the cached lexeme structs are "
|
||||
"mismatched to the string encoding table. The mismatched:\n"
|
||||
"Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
|
||||
E065 = ("Only one of the vector table's width and shape can be specified. "
|
||||
"Got width {width} and shape {shape}.")
|
||||
E066 = ("Error creating model helper for extracting columns. Can only "
|
||||
"extract columns by positive integer. Got: {value}.")
|
||||
E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
|
||||
"an entity) without a preceding 'B' (beginning of an entity). "
|
||||
"Tag sequence:\n{tags}")
|
||||
E068 = ("Invalid BILUO tag: '{tag}'.")
|
||||
E069 = ("Invalid gold-standard parse tree. Found cycle between word "
|
||||
"IDs: {cycle}")
|
||||
E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
|
||||
"does not align with number of annotations ({n_annots}).")
|
||||
E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
|
||||
"match the one in the vocab ({vocab_orth}).")
|
||||
E072 = ("Error serializing lexeme: expected data length {length}, "
|
||||
"got {bad_length}.")
|
||||
E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
|
||||
"are of length {length}. You can use `vocab.reset_vectors` to "
|
||||
"clear the existing vectors and resize the table.")
|
||||
E074 = ("Error interpreting compiled match pattern: patterns are expected "
|
||||
"to end with the attribute {attr}. Got: {bad_attr}.")
|
||||
E075 = ("Error accepting match: length ({length}) > maximum length "
|
||||
"({max_len}).")
|
||||
E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
|
||||
"has {words} words.")
|
||||
E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
|
||||
"equal number of GoldParse objects ({n_golds}) in batch.")
|
||||
E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
|
||||
"not equal number of words in GoldParse ({words_gold}).")
|
||||
E079 = ("Error computing states in beam: number of predicted beams "
|
||||
"({pbeams}) does not equal number of gold beams ({gbeams}).")
|
||||
E080 = ("Duplicate state found in beam: {key}.")
|
||||
E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
|
||||
"does not equal number of losses ({losses}).")
|
||||
E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
|
||||
"projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
|
||||
"match.")
|
||||
E083 = ("Error setting extension: only one of `default`, `method`, or "
|
||||
"`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
|
||||
E084 = ("Error assigning label ID {label} to span: not in StringStore.")
|
||||
E085 = ("Can't create lexeme for string '{string}'.")
|
||||
E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
|
||||
"not match hash {hash_id} in StringStore.")
|
||||
E087 = ("Unknown displaCy style: {style}.")
|
||||
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
|
||||
"v2.x parser and NER models require roughly 1GB of temporary "
|
||||
"memory per 100,000 characters in the input. This means long "
|
||||
"texts may cause memory allocation errors. If you're not using "
|
||||
"the parser or NER, it's probably safe to increase the "
|
||||
"`nlp.max_length` limit. The limit is in number of characters, so "
|
||||
"you can check whether your inputs are too long by checking "
|
||||
"`len(text)`.")
|
||||
E089 = ("Extensions can't have a setter argument without a getter "
|
||||
"argument. Check the keyword arguments on `set_extension`.")
|
||||
E090 = ("Extension '{name}' already exists on {obj}. To overwrite the "
|
||||
"existing extension, set `force=True` on `{obj}.set_extension`.")
|
||||
E091 = ("Invalid extension attribute {name}: expected callable or None, "
|
||||
"but got: {value}")
|
||||
E092 = ("Could not find or assign name for word vectors. Ususally, the "
|
||||
"name is read from the model's meta.json in vector.name. "
|
||||
"Alternatively, it is built from the 'lang' and 'name' keys in "
|
||||
"the meta.json. Vector names are required to avoid issue #1660.")
|
||||
E093 = ("token.ent_iob values make invalid sequence: I without B\n{seq}")
|
||||
E094 = ("Error reading line {line_num} in vectors file {loc}.")
|
||||
|
||||
|
||||
@add_codes
|
||||
class TempErrors(object):
|
||||
T001 = ("Max length currently 10 for phrase matching")
|
||||
T002 = ("Pattern length ({doc_len}) >= phrase_matcher.max_length "
|
||||
"({max_len}). Length can be set on initialization, up to 10.")
|
||||
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
|
||||
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
|
||||
T005 = ("Currently history size is hard-coded to 0. Received: {value}.")
|
||||
T006 = ("Currently history width is hard-coded to 0. Received: {value}.")
|
||||
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||
T008 = ("Bad configuration of Tagger. This is probably a bug within "
|
||||
"spaCy. We changed the name of an internal attribute for loading "
|
||||
"pre-trained vectors, and the class has been passed the old name "
|
||||
"(pretrained_dims) but not the new name (pretrained_vectors).")
|
||||
|
||||
|
||||
class ModelsWarning(UserWarning):
|
||||
pass
|
||||
|
||||
|
||||
WARNINGS = {
|
||||
'user': UserWarning,
|
||||
'deprecation': DeprecationWarning,
|
||||
'models': ModelsWarning,
|
||||
}
|
||||
|
||||
|
||||
def _get_warn_types(arg):
|
||||
if arg == '': # don't show any warnings
|
||||
return []
|
||||
if not arg or arg == 'all': # show all available warnings
|
||||
return WARNINGS.keys()
|
||||
return [w_type.strip() for w_type in arg.split(',')
|
||||
if w_type.strip() in WARNINGS]
|
||||
|
||||
|
||||
SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER', 'always')
|
||||
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES'))
|
||||
|
||||
|
||||
def user_warning(message):
|
||||
_warn(message, 'user')
|
||||
|
||||
|
||||
def deprecation_warning(message):
|
||||
_warn(message, 'deprecation')
|
||||
|
||||
|
||||
def models_warning(message):
|
||||
_warn(message, 'models')
|
||||
|
||||
|
||||
def _warn(message, warn_type='user'):
|
||||
"""
|
||||
message (unicode): The message to display.
|
||||
category (Warning): The Warning to show.
|
||||
"""
|
||||
if warn_type in SPACY_WARNING_TYPES:
|
||||
category = WARNINGS[warn_type]
|
||||
stack = inspect.stack()[-1]
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter(SPACY_WARNING_FILTER, category)
|
||||
warnings.warn_explicit(message, category, stack[1], stack[2])
|
|
@ -17,6 +17,7 @@ import ujson
|
|||
from . import _align
|
||||
from .syntax import nonproj
|
||||
from .tokens import Doc
|
||||
from .errors import Errors
|
||||
from . import util
|
||||
from .util import minibatch, itershuffle
|
||||
from .compat import json_dumps
|
||||
|
@ -37,7 +38,8 @@ def tags_to_entities(tags):
|
|||
elif tag == '-':
|
||||
continue
|
||||
elif tag.startswith('I'):
|
||||
assert start is not None, tags[:i]
|
||||
if start is None:
|
||||
raise ValueError(Errors.E067.format(tags=tags[:i]))
|
||||
continue
|
||||
if tag.startswith('U'):
|
||||
entities.append((tag[2:], i, i))
|
||||
|
@ -47,7 +49,7 @@ def tags_to_entities(tags):
|
|||
entities.append((tag[2:], start, i))
|
||||
start = None
|
||||
else:
|
||||
raise Exception(tag)
|
||||
raise ValueError(Errors.E068.format(tag=tag))
|
||||
return entities
|
||||
|
||||
|
||||
|
@ -225,7 +227,9 @@ class GoldCorpus(object):
|
|||
|
||||
@classmethod
|
||||
def _make_golds(cls, docs, paragraph_tuples, make_projective):
|
||||
assert len(docs) == len(paragraph_tuples)
|
||||
if len(docs) != len(paragraph_tuples):
|
||||
raise ValueError(Errors.E070.format(n_docs=len(docs),
|
||||
n_annots=len(paragraph_tuples)))
|
||||
if len(docs) == 1:
|
||||
return [GoldParse.from_annot_tuples(docs[0],
|
||||
paragraph_tuples[0][0],
|
||||
|
@ -525,7 +529,7 @@ cdef class GoldParse:
|
|||
|
||||
cycle = nonproj.contains_cycle(self.heads)
|
||||
if cycle is not None:
|
||||
raise Exception("Cycle found: %s" % cycle)
|
||||
raise ValueError(Errors.E069.format(cycle=cycle))
|
||||
|
||||
def __len__(self):
|
||||
"""Get the number of gold-standard tokens.
|
||||
|
|
|
@ -8,6 +8,7 @@ from .stop_words import STOP_WORDS
|
|||
from .lex_attrs import LEX_ATTRS
|
||||
from .morph_rules import MORPH_RULES
|
||||
from ..tag_map import TAG_MAP
|
||||
from .lemmatizer import LOOKUP
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -28,6 +29,7 @@ class DanishDefaults(Language.Defaults):
|
|||
suffixes = TOKENIZER_SUFFIXES
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
lemma_lookup = LOOKUP
|
||||
|
||||
|
||||
class Danish(Language):
|
||||
|
|
692415
spacy/lang/da/lemmatizer.py
Normal file
692415
spacy/lang/da/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -286069,7 +286069,6 @@ LOOKUP = {
|
|||
"sonnolente": "sonnolento",
|
||||
"sonnolenti": "sonnolento",
|
||||
"sonnolenze": "sonnolenza",
|
||||
"sono": "sonare",
|
||||
"sonora": "sonoro",
|
||||
"sonore": "sonoro",
|
||||
"sonori": "sonoro",
|
||||
|
@ -333681,6 +333680,7 @@ LOOKUP = {
|
|||
"zurliniane": "zurliniano",
|
||||
"zurliniani": "zurliniano",
|
||||
"àncore": "àncora",
|
||||
"sono": "essere",
|
||||
"è": "essere",
|
||||
"èlites": "èlite",
|
||||
"ère": "èra",
|
||||
|
|
|
@ -190262,7 +190262,6 @@ LOOKUP = {
|
|||
"gämserna": "gäms",
|
||||
"gämsernas": "gäms",
|
||||
"gämsers": "gäms",
|
||||
"gäng": "gänga",
|
||||
"gängad": "gänga",
|
||||
"gängade": "gängad",
|
||||
"gängades": "gängad",
|
||||
|
@ -651423,7 +651422,6 @@ LOOKUP = {
|
|||
"åpnasts": "åpen",
|
||||
"åpne": "åpen",
|
||||
"åpnes": "åpen",
|
||||
"år": "åra",
|
||||
"åran": "åra",
|
||||
"årans": "åra",
|
||||
"åras": "åra",
|
||||
|
|
|
@ -1,19 +1,53 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LANG
|
||||
from ...attrs import LANG, NORM
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...tokens import Doc
|
||||
from .stop_words import STOP_WORDS
|
||||
from ...util import update_exc, add_lookups
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
#from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
#from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
|
||||
|
||||
class VietnameseDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'vi' # for pickling
|
||||
# add more norm exception dictionaries here
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||
|
||||
# overwrite functions for lexical attributes
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
|
||||
# merge base exceptions and custom tokenizer exceptions
|
||||
#tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
use_pyvi = True
|
||||
|
||||
class Vietnamese(Language):
|
||||
lang = 'vi'
|
||||
Defaults = VietnameseDefaults # override defaults
|
||||
|
||||
def make_doc(self, text):
|
||||
if self.Defaults.use_pyvi:
|
||||
try:
|
||||
from pyvi import ViTokenizer
|
||||
except ImportError:
|
||||
msg = ("Pyvi not installed. Either set Vietnamese.use_pyvi = False, "
|
||||
"or install it https://pypi.python.org/pypi/pyvi")
|
||||
raise ImportError(msg)
|
||||
words, spaces = ViTokenizer.spacy_tokenize(text)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
else:
|
||||
words = []
|
||||
spaces = []
|
||||
doc = self.tokenizer(text)
|
||||
for token in self.tokenizer(text):
|
||||
words.extend(list(token.text))
|
||||
spaces.extend([False]*len(token.text))
|
||||
spaces[-1] = bool(token.whitespace_)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
|
||||
__all__ = ['Vietnamese']
|
||||
|
|
26
spacy/lang/vi/lex_attrs.py
Normal file
26
spacy/lang/vi/lex_attrs.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = ['không', 'một', 'hai', 'ba', 'bốn', 'năm', 'sáu', 'bẩy',
|
||||
'tám', 'chín', 'mười', 'trăm', 'tỷ']
|
||||
|
||||
|
||||
def like_num(text):
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num
|
||||
}
|
1951
spacy/lang/vi/stop_words.py
Normal file
1951
spacy/lang/vi/stop_words.py
Normal file
File diff suppressed because it is too large
Load Diff
36
spacy/lang/vi/tag_map.py
Normal file
36
spacy/lang/vi/tag_map.py
Normal file
|
@ -0,0 +1,36 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
|
||||
# Add a tag map
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
|
||||
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
|
||||
# The keys of the tag map should be strings in your tag set. The dictionary must
|
||||
# have an entry POS whose value is one of the Universal Dependencies tags.
|
||||
# Optionally, you can also include morphological features or other attributes.
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADV": {POS: ADV},
|
||||
"NOUN": {POS: NOUN},
|
||||
"ADP": {POS: ADP},
|
||||
"PRON": {POS: PRON},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PROPN": {POS: PROPN},
|
||||
"DET": {POS: DET},
|
||||
"SYM": {POS: SYM},
|
||||
"INTJ": {POS: INTJ},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"NUM": {POS: NUM},
|
||||
"AUX": {POS: AUX},
|
||||
"X": {POS: X},
|
||||
"CONJ": {POS: CONJ},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"ADJ": {POS: ADJ},
|
||||
"VERB": {POS: VERB},
|
||||
"PART": {POS: PART},
|
||||
"SP": {POS: SPACE}
|
||||
}
|
|
@ -28,6 +28,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
|
|||
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
||||
from .lang.tag_map import TAG_MAP
|
||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||
from .errors import Errors
|
||||
from . import util
|
||||
from . import about
|
||||
|
||||
|
@ -112,7 +113,7 @@ class Language(object):
|
|||
'merge_subtokens': lambda nlp, **cfg: merge_subtokens,
|
||||
}
|
||||
|
||||
def __init__(self, vocab=True, make_doc=True, meta={}, **kwargs):
|
||||
def __init__(self, vocab=True, make_doc=True, max_length=10**6, meta={}, **kwargs):
|
||||
"""Initialise a Language object.
|
||||
|
||||
vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
|
||||
|
@ -127,6 +128,15 @@ class Language(object):
|
|||
string occurs in both, the component is not loaded.
|
||||
meta (dict): Custom meta data for the Language class. Is written to by
|
||||
models to add model meta data.
|
||||
max_length (int) :
|
||||
Maximum number of characters in a single text. The current v2 models
|
||||
may run out memory on extremely long texts, due to large internal
|
||||
allocations. You should segment these texts into meaningful units,
|
||||
e.g. paragraphs, subsections etc, before passing them to spaCy.
|
||||
Default maximum length is 1,000,000 characters (1mb). As a rule of
|
||||
thumb, if all pipeline components are enabled, spaCy's default
|
||||
models currently requires roughly 1GB of temporary memory per
|
||||
100,000 characters in one text.
|
||||
RETURNS (Language): The newly constructed object.
|
||||
"""
|
||||
self._meta = dict(meta)
|
||||
|
@ -134,12 +144,15 @@ class Language(object):
|
|||
if vocab is True:
|
||||
factory = self.Defaults.create_vocab
|
||||
vocab = factory(self, **meta.get('vocab', {}))
|
||||
if vocab.vectors.name is None:
|
||||
vocab.vectors.name = meta.get('vectors', {}).get('name')
|
||||
self.vocab = vocab
|
||||
if make_doc is True:
|
||||
factory = self.Defaults.create_tokenizer
|
||||
make_doc = factory(self, **meta.get('tokenizer', {}))
|
||||
self.tokenizer = make_doc
|
||||
self.pipeline = []
|
||||
self.max_length = max_length
|
||||
self._optimizer = None
|
||||
|
||||
@property
|
||||
|
@ -159,7 +172,8 @@ class Language(object):
|
|||
self._meta.setdefault('license', '')
|
||||
self._meta['vectors'] = {'width': self.vocab.vectors_length,
|
||||
'vectors': len(self.vocab.vectors),
|
||||
'keys': self.vocab.vectors.n_keys}
|
||||
'keys': self.vocab.vectors.n_keys,
|
||||
'name': self.vocab.vectors.name}
|
||||
self._meta['pipeline'] = self.pipe_names
|
||||
return self._meta
|
||||
|
||||
|
@ -205,8 +219,7 @@ class Language(object):
|
|||
for pipe_name, component in self.pipeline:
|
||||
if pipe_name == name:
|
||||
return component
|
||||
msg = "No component '{}' found in pipeline. Available names: {}"
|
||||
raise KeyError(msg.format(name, self.pipe_names))
|
||||
raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||
|
||||
def create_pipe(self, name, config=dict()):
|
||||
"""Create a pipeline component from a factory.
|
||||
|
@ -216,7 +229,7 @@ class Language(object):
|
|||
RETURNS (callable): Pipeline component.
|
||||
"""
|
||||
if name not in self.factories:
|
||||
raise KeyError("Can't find factory for '{}'.".format(name))
|
||||
raise KeyError(Errors.E002.format(name=name))
|
||||
factory = self.factories[name]
|
||||
return factory(self, **config)
|
||||
|
||||
|
@ -241,12 +254,9 @@ class Language(object):
|
|||
>>> nlp.add_pipe(component, name='custom_name', last=True)
|
||||
"""
|
||||
if not hasattr(component, '__call__'):
|
||||
msg = ("Not a valid pipeline component. Expected callable, but "
|
||||
"got {}. ".format(repr(component)))
|
||||
msg = Errors.E003.format(component=repr(component), name=name)
|
||||
if isinstance(component, basestring_) and component in self.factories:
|
||||
msg += ("If you meant to add a built-in component, use "
|
||||
"create_pipe: nlp.add_pipe(nlp.create_pipe('{}'))"
|
||||
.format(component))
|
||||
msg += Errors.E004.format(component=component)
|
||||
raise ValueError(msg)
|
||||
if name is None:
|
||||
if hasattr(component, 'name'):
|
||||
|
@ -259,11 +269,9 @@ class Language(object):
|
|||
else:
|
||||
name = repr(component)
|
||||
if name in self.pipe_names:
|
||||
raise ValueError("'{}' already exists in pipeline.".format(name))
|
||||
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
|
||||
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
|
||||
msg = ("Invalid constraints. You can only set one of the "
|
||||
"following: before, after, first, last.")
|
||||
raise ValueError(msg)
|
||||
raise ValueError(Errors.E006)
|
||||
pipe = (name, component)
|
||||
if last or not any([first, before, after]):
|
||||
self.pipeline.append(pipe)
|
||||
|
@ -274,9 +282,8 @@ class Language(object):
|
|||
elif after and after in self.pipe_names:
|
||||
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
|
||||
else:
|
||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
||||
unfound = before or after
|
||||
raise ValueError(msg.format(unfound, self.pipe_names))
|
||||
raise ValueError(Errors.E001.format(name=before or after,
|
||||
opts=self.pipe_names))
|
||||
|
||||
def has_pipe(self, name):
|
||||
"""Check if a component name is present in the pipeline. Equivalent to
|
||||
|
@ -294,8 +301,7 @@ class Language(object):
|
|||
component (callable): Pipeline component.
|
||||
"""
|
||||
if name not in self.pipe_names:
|
||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
||||
raise ValueError(msg.format(name, self.pipe_names))
|
||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||
self.pipeline[self.pipe_names.index(name)] = (name, component)
|
||||
|
||||
def rename_pipe(self, old_name, new_name):
|
||||
|
@ -305,11 +311,9 @@ class Language(object):
|
|||
new_name (unicode): New name of the component.
|
||||
"""
|
||||
if old_name not in self.pipe_names:
|
||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
||||
raise ValueError(msg.format(old_name, self.pipe_names))
|
||||
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
|
||||
if new_name in self.pipe_names:
|
||||
msg = "'{}' already exists in pipeline. Existing names: {}"
|
||||
raise ValueError(msg.format(new_name, self.pipe_names))
|
||||
raise ValueError(Errors.E007.format(name=new_name, opts=self.pipe_names))
|
||||
i = self.pipe_names.index(old_name)
|
||||
self.pipeline[i] = (new_name, self.pipeline[i][1])
|
||||
|
||||
|
@ -320,8 +324,7 @@ class Language(object):
|
|||
RETURNS (tuple): A `(name, component)` tuple of the removed component.
|
||||
"""
|
||||
if name not in self.pipe_names:
|
||||
msg = "Can't find '{}' in pipeline. Available names: {}"
|
||||
raise ValueError(msg.format(name, self.pipe_names))
|
||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||
return self.pipeline.pop(self.pipe_names.index(name))
|
||||
|
||||
def __call__(self, text, disable=[]):
|
||||
|
@ -338,11 +341,18 @@ class Language(object):
|
|||
>>> tokens[0].text, tokens[0].head.tag_
|
||||
('An', 'NN')
|
||||
"""
|
||||
if len(text) >= self.max_length:
|
||||
raise ValueError(Errors.E088.format(length=len(text),
|
||||
max_length=self.max_length))
|
||||
doc = self.make_doc(text)
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
continue
|
||||
if not hasattr(proc, '__call__'):
|
||||
raise ValueError(Errors.E003.format(component=type(proc), name=name))
|
||||
doc = proc(doc)
|
||||
if doc is None:
|
||||
raise ValueError(Errors.E005.format(name=name))
|
||||
return doc
|
||||
|
||||
def disable_pipes(self, *names):
|
||||
|
@ -384,8 +394,7 @@ class Language(object):
|
|||
>>> state = nlp.update(docs, golds, sgd=optimizer)
|
||||
"""
|
||||
if len(docs) != len(golds):
|
||||
raise IndexError("Update expects same number of docs and golds "
|
||||
"Got: %d, %d" % (len(docs), len(golds)))
|
||||
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
|
||||
if len(docs) == 0:
|
||||
return
|
||||
if sgd is None:
|
||||
|
@ -458,6 +467,8 @@ class Language(object):
|
|||
else:
|
||||
device = None
|
||||
link_vectors_to_models(self.vocab)
|
||||
if self.vocab.vectors.data.shape[1]:
|
||||
cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
if sgd is None:
|
||||
sgd = create_default_optimizer(Model.ops)
|
||||
self._optimizer = sgd
|
||||
|
@ -626,9 +637,10 @@ class Language(object):
|
|||
"""
|
||||
path = util.ensure_path(path)
|
||||
deserializers = OrderedDict((
|
||||
('vocab', lambda p: self.vocab.from_disk(p)),
|
||||
('meta.json', lambda p: self.meta.update(util.read_json(p))),
|
||||
('vocab', lambda p: (
|
||||
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self))),
|
||||
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
|
||||
('meta.json', lambda p: self.meta.update(util.read_json(p)))
|
||||
))
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
|
@ -671,9 +683,10 @@ class Language(object):
|
|||
RETURNS (Language): The `Language` object.
|
||||
"""
|
||||
deserializers = OrderedDict((
|
||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
||||
('meta', lambda b: self.meta.update(ujson.loads(b))),
|
||||
('vocab', lambda b: (
|
||||
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self))),
|
||||
('tokenizer', lambda b: self.tokenizer.from_bytes(b, vocab=False)),
|
||||
('meta', lambda b: self.meta.update(ujson.loads(b)))
|
||||
))
|
||||
for i, (name, proc) in enumerate(self.pipeline):
|
||||
if name in disable:
|
||||
|
@ -685,6 +698,27 @@ class Language(object):
|
|||
return self
|
||||
|
||||
|
||||
def _fix_pretrained_vectors_name(nlp):
|
||||
# TODO: Replace this once we handle vectors consistently as static
|
||||
# data
|
||||
if 'vectors' in nlp.meta and nlp.meta['vectors'].get('name'):
|
||||
nlp.vocab.vectors.name = nlp.meta['vectors']['name']
|
||||
elif not nlp.vocab.vectors.size:
|
||||
nlp.vocab.vectors.name = None
|
||||
elif 'name' in nlp.meta and 'lang' in nlp.meta:
|
||||
vectors_name = '%s_%s.vectors' % (nlp.meta['lang'], nlp.meta['name'])
|
||||
nlp.vocab.vectors.name = vectors_name
|
||||
else:
|
||||
raise ValueError(Errors.E092)
|
||||
if nlp.vocab.vectors.size != 0:
|
||||
link_vectors_to_models(nlp.vocab)
|
||||
for name, proc in nlp.pipeline:
|
||||
if not hasattr(proc, 'cfg'):
|
||||
continue
|
||||
proc.cfg.setdefault('deprecation_fixes', {})
|
||||
proc.cfg['deprecation_fixes']['vectors_name'] = nlp.vocab.vectors.name
|
||||
|
||||
|
||||
class DisabledPipes(list):
|
||||
"""Manager for temporary pipeline disabling."""
|
||||
def __init__(self, nlp, *names):
|
||||
|
@ -711,14 +745,7 @@ class DisabledPipes(list):
|
|||
if unexpected:
|
||||
# Don't change the pipeline if we're raising an error.
|
||||
self.nlp.pipeline = current
|
||||
msg = (
|
||||
"Some current components would be lost when restoring "
|
||||
"previous pipeline state. If you added components after "
|
||||
"calling nlp.disable_pipes(), you should remove them "
|
||||
"explicitly with nlp.remove_pipe() before the pipeline is "
|
||||
"restore. Names of the new components: %s"
|
||||
)
|
||||
raise ValueError(msg % unexpected)
|
||||
raise ValueError(Errors.E008.format(names=unexpected))
|
||||
self[:] = []
|
||||
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
|||
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
|
||||
from .attrs cimport PROB
|
||||
from .attrs import intify_attrs
|
||||
from . import about
|
||||
from .errors import Errors
|
||||
|
||||
|
||||
memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))
|
||||
|
@ -37,7 +37,8 @@ cdef class Lexeme:
|
|||
self.vocab = vocab
|
||||
self.orth = orth
|
||||
self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
|
||||
assert self.c.orth == orth
|
||||
if self.c.orth != orth:
|
||||
raise ValueError(Errors.E071.format(orth=orth, vocab_orth=self.c.orth))
|
||||
|
||||
def __richcmp__(self, other, int op):
|
||||
if other is None:
|
||||
|
@ -129,20 +130,25 @@ cdef class Lexeme:
|
|||
lex_data = Lexeme.c_to_bytes(self.c)
|
||||
start = <const char*>&self.c.flags
|
||||
end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
|
||||
assert (end-start) == sizeof(lex_data.data), (end-start, sizeof(lex_data.data))
|
||||
if (end-start) != sizeof(lex_data.data):
|
||||
raise ValueError(Errors.E072.format(length=end-start,
|
||||
bad_length=sizeof(lex_data.data)))
|
||||
byte_string = b'\0' * sizeof(lex_data.data)
|
||||
byte_chars = <char*>byte_string
|
||||
for i in range(sizeof(lex_data.data)):
|
||||
byte_chars[i] = lex_data.data[i]
|
||||
assert len(byte_string) == sizeof(lex_data.data), (len(byte_string),
|
||||
sizeof(lex_data.data))
|
||||
if len(byte_string) != sizeof(lex_data.data):
|
||||
raise ValueError(Errors.E072.format(length=len(byte_string),
|
||||
bad_length=sizeof(lex_data.data)))
|
||||
return byte_string
|
||||
|
||||
def from_bytes(self, bytes byte_string):
|
||||
# This method doesn't really have a use-case --- wrote it for testing.
|
||||
# Possibly delete? It puts the Lexeme out of synch with the vocab.
|
||||
cdef SerializedLexemeC lex_data
|
||||
assert len(byte_string) == sizeof(lex_data.data)
|
||||
if len(byte_string) != sizeof(lex_data.data):
|
||||
raise ValueError(Errors.E072.format(length=len(byte_string),
|
||||
bad_length=sizeof(lex_data.data)))
|
||||
for i in range(len(byte_string)):
|
||||
lex_data.data[i] = byte_string[i]
|
||||
Lexeme.c_from_bytes(self.c, lex_data)
|
||||
|
@ -169,16 +175,13 @@ cdef class Lexeme:
|
|||
def __get__(self):
|
||||
cdef int length = self.vocab.vectors_length
|
||||
if length == 0:
|
||||
raise ValueError(
|
||||
"Word vectors set to length 0. This may be because you "
|
||||
"don't have a model installed or loaded, or because your "
|
||||
"model doesn't include word vectors. For more info, see "
|
||||
"the documentation: \n%s\n" % about.__docs_models__
|
||||
)
|
||||
raise ValueError(Errors.E010)
|
||||
return self.vocab.get_vector(self.c.orth)
|
||||
|
||||
def __set__(self, vector):
|
||||
assert len(vector) == self.vocab.vectors_length
|
||||
if len(vector) != self.vocab.vectors_length:
|
||||
raise ValueError(Errors.E073.format(new_length=len(vector),
|
||||
length=self.vocab.vectors_length))
|
||||
self.vocab.set_vector(self.c.orth, vector)
|
||||
|
||||
property rank:
|
||||
|
|
|
@ -13,6 +13,8 @@ from .vocab cimport Vocab
|
|||
from .tokens.doc cimport Doc
|
||||
from .tokens.doc cimport get_token_attr
|
||||
from .attrs cimport ID, attr_id_t, NULL_ATTR
|
||||
from .errors import Errors, TempErrors
|
||||
|
||||
from .attrs import IDS
|
||||
from .attrs import FLAG61 as U_ENT
|
||||
from .attrs import FLAG60 as B2_ENT
|
||||
|
@ -321,6 +323,9 @@ cdef attr_t get_pattern_key(const TokenPatternC* pattern) nogil:
|
|||
while pattern.nr_attr != 0:
|
||||
pattern += 1
|
||||
id_attr = pattern[0].attrs[0]
|
||||
if id_attr.attr != ID:
|
||||
with gil:
|
||||
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
|
||||
return id_attr.value
|
||||
|
||||
def _convert_strings(token_specs, string_store):
|
||||
|
@ -341,8 +346,8 @@ def _convert_strings(token_specs, string_store):
|
|||
if value in operators:
|
||||
ops = operators[value]
|
||||
else:
|
||||
msg = "Unknown operator '%s'. Options: %s"
|
||||
raise KeyError(msg % (value, ', '.join(operators.keys())))
|
||||
keys = ', '.join(operators.keys())
|
||||
raise KeyError(Errors.E011.format(op=value, opts=keys))
|
||||
if isinstance(attr, basestring):
|
||||
attr = IDS.get(attr.upper())
|
||||
if isinstance(value, basestring):
|
||||
|
@ -429,9 +434,7 @@ cdef class Matcher:
|
|||
"""
|
||||
for pattern in patterns:
|
||||
if len(pattern) == 0:
|
||||
msg = ("Cannot add pattern for zero tokens to matcher.\n"
|
||||
"key: {key}\n")
|
||||
raise ValueError(msg.format(key=key))
|
||||
raise ValueError(Errors.E012.format(key=key))
|
||||
key = self._normalize_key(key)
|
||||
for pattern in patterns:
|
||||
specs = _convert_strings(pattern, self.vocab.strings)
|
||||
|
|
|
@ -9,6 +9,7 @@ from .attrs import LEMMA, intify_attrs
|
|||
from .parts_of_speech cimport SPACE
|
||||
from .parts_of_speech import IDS as POS_IDS
|
||||
from .lexeme cimport Lexeme
|
||||
from .errors import Errors
|
||||
|
||||
|
||||
def _normalize_props(props):
|
||||
|
@ -93,7 +94,7 @@ cdef class Morphology:
|
|||
|
||||
cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1:
|
||||
if tag_id > self.n_tags:
|
||||
raise ValueError("Unknown tag ID: %s" % tag_id)
|
||||
raise ValueError(Errors.E014.format(tag=tag_id))
|
||||
# TODO: It's pretty arbitrary to put this logic here. I guess the
|
||||
# justification is that this is where the specific word and the tag
|
||||
# interact. Still, we should have a better way to enforce this rule, or
|
||||
|
@ -129,7 +130,7 @@ cdef class Morphology:
|
|||
tag (unicode): The part-of-speech tag to key the exception.
|
||||
orth (unicode): The word-form to key the exception.
|
||||
"""
|
||||
# TODO: Currently we've assumed that we know the number of tags --
|
||||
# TODO: Currently we've assumed that we know the number of tags --
|
||||
# RichTagC is an array, and _cache is a PreshMapArray
|
||||
# This is really bad: it makes the morphology typed to the tagger
|
||||
# classes, which is all wrong.
|
||||
|
@ -147,9 +148,7 @@ cdef class Morphology:
|
|||
elif force:
|
||||
memset(cached, 0, sizeof(cached[0]))
|
||||
else:
|
||||
raise ValueError(
|
||||
"Conflicting morphology exception for (%s, %s). Use "
|
||||
"force=True to overwrite." % (tag_str, orth_str))
|
||||
raise ValueError(Errors.E015.format(tag=tag_str, orth=orth_str))
|
||||
|
||||
cached.tag = rich_tag
|
||||
# TODO: Refactor this to take arbitrary attributes.
|
||||
|
|
|
@ -8,7 +8,9 @@ cimport numpy as np
|
|||
import cytoolz
|
||||
from collections import OrderedDict
|
||||
import ujson
|
||||
import msgpack
|
||||
|
||||
from .util import msgpack
|
||||
from .util import msgpack_numpy
|
||||
|
||||
from thinc.api import chain
|
||||
from thinc.v2v import Affine, SELU, Softmax
|
||||
|
@ -32,6 +34,7 @@ from .parts_of_speech import X
|
|||
from ._ml import Tok2Vec, build_text_classifier, build_tagger_model
|
||||
from ._ml import link_vectors_to_models, zero_init, flatten
|
||||
from ._ml import create_default_optimizer
|
||||
from .errors import Errors, TempErrors
|
||||
from . import util
|
||||
|
||||
|
||||
|
@ -77,7 +80,7 @@ def merge_noun_chunks(doc):
|
|||
RETURNS (Doc): The Doc object with merged noun chunks.
|
||||
"""
|
||||
if not doc.is_parsed:
|
||||
return
|
||||
return doc
|
||||
spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep)
|
||||
for np in doc.noun_chunks]
|
||||
for start, end, tag, dep in spans:
|
||||
|
@ -214,8 +217,10 @@ class Pipe(object):
|
|||
def from_bytes(self, bytes_data, **exclude):
|
||||
"""Load the pipe from a bytestring."""
|
||||
def load_model(b):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
if self.model is True:
|
||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
|
||||
self.model = self.Model(**self.cfg)
|
||||
self.model.from_bytes(b)
|
||||
|
||||
|
@ -239,8 +244,10 @@ class Pipe(object):
|
|||
def from_disk(self, path, **exclude):
|
||||
"""Load the pipe from disk."""
|
||||
def load_model(p):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
if self.model is True:
|
||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
|
||||
self.model = self.Model(**self.cfg)
|
||||
self.model.from_bytes(p.open('rb').read())
|
||||
|
||||
|
@ -298,7 +305,6 @@ class Tensorizer(Pipe):
|
|||
self.model = model
|
||||
self.input_models = []
|
||||
self.cfg = dict(cfg)
|
||||
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
||||
self.cfg.setdefault('cnn_maxout_pieces', 3)
|
||||
|
||||
def __call__(self, doc):
|
||||
|
@ -343,7 +349,8 @@ class Tensorizer(Pipe):
|
|||
tensors (object): Vector representation for each token in the docs.
|
||||
"""
|
||||
for doc, tensor in zip(docs, tensors):
|
||||
assert tensor.shape[0] == len(doc)
|
||||
if tensor.shape[0] != len(doc):
|
||||
raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
|
||||
doc.tensor = tensor
|
||||
|
||||
def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
|
||||
|
@ -415,8 +422,6 @@ class Tagger(Pipe):
|
|||
self.model = model
|
||||
self.cfg = OrderedDict(sorted(cfg.items()))
|
||||
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
||||
self.cfg.setdefault('pretrained_dims',
|
||||
self.vocab.vectors.data.shape[1])
|
||||
|
||||
@property
|
||||
def labels(self):
|
||||
|
@ -477,7 +482,7 @@ class Tagger(Pipe):
|
|||
doc.extend_tensor(tensors[i].get())
|
||||
else:
|
||||
doc.extend_tensor(tensors[i])
|
||||
doc.is_tagged = True
|
||||
doc.is_tagged = True
|
||||
|
||||
def update(self, docs, golds, drop=0., sgd=None, losses=None):
|
||||
if losses is not None and self.name not in losses:
|
||||
|
@ -527,8 +532,8 @@ class Tagger(Pipe):
|
|||
vocab.morphology = Morphology(vocab.strings, new_tag_map,
|
||||
vocab.morphology.lemmatizer,
|
||||
exc=vocab.morphology.exc)
|
||||
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
|
||||
if self.model is True:
|
||||
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
if sgd is None:
|
||||
|
@ -537,6 +542,8 @@ class Tagger(Pipe):
|
|||
|
||||
@classmethod
|
||||
def Model(cls, n_tags, **cfg):
|
||||
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):
|
||||
raise ValueError(TempErrors.T008)
|
||||
return build_tagger_model(n_tags, **cfg)
|
||||
|
||||
def add_label(self, label, values=None):
|
||||
|
@ -552,9 +559,7 @@ class Tagger(Pipe):
|
|||
# copy_array(larger.W[:smaller.nO], smaller.W)
|
||||
# copy_array(larger.b[:smaller.nO], smaller.b)
|
||||
# self.model._layers[-1] = larger
|
||||
raise ValueError(
|
||||
"Resizing pre-trained Tagger models is not "
|
||||
"currently supported.")
|
||||
raise ValueError(TempErrors.T003)
|
||||
tag_map = dict(self.vocab.morphology.tag_map)
|
||||
if values is None:
|
||||
values = {POS: "X"}
|
||||
|
@ -584,6 +589,10 @@ class Tagger(Pipe):
|
|||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def load_model(b):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
|
||||
if self.model is True:
|
||||
token_vector_width = util.env_opt(
|
||||
'token_vector_width',
|
||||
|
@ -609,7 +618,6 @@ class Tagger(Pipe):
|
|||
return self
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors.data.shape[1])
|
||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||
serialize = OrderedDict((
|
||||
('vocab', lambda p: self.vocab.to_disk(p)),
|
||||
|
@ -622,6 +630,9 @@ class Tagger(Pipe):
|
|||
|
||||
def from_disk(self, path, **exclude):
|
||||
def load_model(p):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
if self.model is True:
|
||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||
with p.open('rb') as file_:
|
||||
|
@ -669,12 +680,9 @@ class MultitaskObjective(Tagger):
|
|||
elif hasattr(target, '__call__'):
|
||||
self.make_label = target
|
||||
else:
|
||||
raise ValueError("MultitaskObjective target should be function or "
|
||||
"one of: dep, tag, ent, sent_start, dep_tag_offset, ent_tag.")
|
||||
raise ValueError(Errors.E016)
|
||||
self.cfg = dict(cfg)
|
||||
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
||||
self.cfg.setdefault('pretrained_dims',
|
||||
self.vocab.vectors.data.shape[1])
|
||||
|
||||
@property
|
||||
def labels(self):
|
||||
|
@ -723,7 +731,9 @@ class MultitaskObjective(Tagger):
|
|||
return tokvecs, scores
|
||||
|
||||
def get_loss(self, docs, golds, scores):
|
||||
assert len(docs) == len(golds)
|
||||
if len(docs) != len(golds):
|
||||
raise ValueError(Errors.E077.format(value='loss', n_docs=len(docs),
|
||||
n_golds=len(golds)))
|
||||
cdef int idx = 0
|
||||
correct = numpy.zeros((scores.shape[0],), dtype='i')
|
||||
guesses = scores.argmax(axis=1)
|
||||
|
@ -878,8 +888,8 @@ class TextCategorizer(Pipe):
|
|||
name = 'textcat'
|
||||
|
||||
@classmethod
|
||||
def Model(cls, **cfg):
|
||||
return build_text_classifier(**cfg)
|
||||
def Model(cls, nr_class, **cfg):
|
||||
return build_text_classifier(nr_class, **cfg)
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
self.vocab = vocab
|
||||
|
@ -962,16 +972,16 @@ class TextCategorizer(Pipe):
|
|||
self.labels.append(label)
|
||||
return 1
|
||||
|
||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None):
|
||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None,
|
||||
**kwargs):
|
||||
if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
|
||||
token_vector_width = pipeline[0].model.nO
|
||||
else:
|
||||
token_vector_width = 64
|
||||
|
||||
if self.model is True:
|
||||
self.cfg['pretrained_dims'] = self.vocab.vectors_length
|
||||
self.cfg['nr_class'] = len(self.labels)
|
||||
self.cfg['width'] = token_vector_width
|
||||
self.model = self.Model(**self.cfg)
|
||||
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
|
||||
self.model = self.Model(len(self.labels), **self.cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
if sgd is None:
|
||||
sgd = self.create_optimizer()
|
||||
|
|
|
@ -2,6 +2,7 @@
|
|||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from .gold import tags_to_entities, GoldParse
|
||||
from .errors import Errors
|
||||
|
||||
|
||||
class PRFScore(object):
|
||||
|
@ -86,7 +87,6 @@ class Scorer(object):
|
|||
def score(self, tokens, gold, verbose=False, punct_labels=('p', 'punct')):
|
||||
if len(tokens) != len(gold):
|
||||
gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot))
|
||||
assert len(tokens) == len(gold)
|
||||
gold_deps = set()
|
||||
gold_tags = set()
|
||||
gold_ents = set(tags_to_entities([annot[-1]
|
||||
|
|
|
@ -13,6 +13,7 @@ from .symbols import IDS as SYMBOLS_BY_STR
|
|||
from .symbols import NAMES as SYMBOLS_BY_INT
|
||||
from .typedefs cimport hash_t
|
||||
from .compat import json_dumps
|
||||
from .errors import Errors
|
||||
from . import util
|
||||
|
||||
|
||||
|
@ -59,7 +60,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
|
|||
string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char))
|
||||
string.p[0] = length
|
||||
memcpy(&string.p[1], chars, length)
|
||||
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
|
||||
return string
|
||||
else:
|
||||
i = 0
|
||||
|
@ -69,7 +69,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
|
|||
string.p[i] = 255
|
||||
string.p[n_length_bytes-1] = length % 255
|
||||
memcpy(&string.p[n_length_bytes], chars, length)
|
||||
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
|
||||
return string
|
||||
|
||||
|
||||
|
@ -115,7 +114,7 @@ cdef class StringStore:
|
|||
self.hits.insert(key)
|
||||
utf8str = <Utf8Str*>self._map.get(key)
|
||||
if utf8str is NULL:
|
||||
raise KeyError(string_or_id)
|
||||
raise KeyError(Errors.E018.format(hash_value=string_or_id))
|
||||
else:
|
||||
return decode_Utf8Str(utf8str)
|
||||
|
||||
|
@ -136,8 +135,7 @@ cdef class StringStore:
|
|||
key = hash_utf8(string, len(string))
|
||||
self._intern_utf8(string, len(string))
|
||||
else:
|
||||
raise TypeError(
|
||||
"Can only add unicode or bytes. Got type: %s" % type(string))
|
||||
raise TypeError(Errors.E017.format(value_type=type(string)))
|
||||
return key
|
||||
|
||||
def __len__(self):
|
||||
|
|
|
@ -10,6 +10,7 @@ from thinc.extra.search cimport MaxViolation
|
|||
|
||||
from .transition_system cimport TransitionSystem, Transition
|
||||
from ..gold cimport GoldParse
|
||||
from ..errors import Errors
|
||||
from .stateclass cimport StateC, StateClass
|
||||
|
||||
|
||||
|
@ -220,7 +221,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
|
|||
p_indices = []
|
||||
g_indices = []
|
||||
cdef Beam pbeam, gbeam
|
||||
assert len(pbeams) == len(gbeams)
|
||||
if len(pbeams) != len(gbeams):
|
||||
raise ValueError(Errors.E079.format(pbeams=len(pbeams), gbeams=len(gbeams)))
|
||||
for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)):
|
||||
p_indices.append([])
|
||||
g_indices.append([])
|
||||
|
@ -228,7 +230,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
|
|||
state = StateClass.borrow(<StateC*>pbeam.at(i))
|
||||
if not state.is_final():
|
||||
key = tuple([eg_id] + pbeam.histories[i])
|
||||
assert key not in seen, (key, seen)
|
||||
if key in seen:
|
||||
raise ValueError(Errors.E080.format(key=key))
|
||||
seen[key] = len(states)
|
||||
p_indices[-1].append(len(states))
|
||||
states.append(state)
|
||||
|
@ -271,7 +274,8 @@ def get_gradient(nr_class, beam_maps, histories, losses):
|
|||
for i in range(nr_step):
|
||||
grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class),
|
||||
dtype='f'))
|
||||
assert len(histories) == len(losses)
|
||||
if len(histories) != len(losses):
|
||||
raise ValueError(Errors.E081.format(n_hist=len(histories), losses=len(losses)))
|
||||
for eg_id, hists in enumerate(histories):
|
||||
for loss, hist in zip(losses[eg_id], hists):
|
||||
if loss == 0.0 or numpy.isnan(loss):
|
||||
|
|
|
@ -10,21 +10,25 @@ from collections import OrderedDict, defaultdict, Counter
|
|||
from thinc.extra.search cimport Beam
|
||||
import json
|
||||
|
||||
from .nonproj import is_nonproj_tree
|
||||
from ..typedefs cimport hash_t, attr_t
|
||||
from ..strings cimport hash_string
|
||||
from .stateclass cimport StateClass
|
||||
from ._state cimport StateC
|
||||
from . import nonproj
|
||||
from .transition_system cimport move_cost_func_t, label_cost_func_t
|
||||
from ..gold cimport GoldParse, GoldParseC
|
||||
from ..structs cimport TokenC
|
||||
from ..errors import Errors
|
||||
|
||||
# Calculate cost as gold/not gold. We don't use scalar value anyway.
|
||||
cdef int BINARY_COSTS = 1
|
||||
cdef weight_t MIN_SCORE = -90000
|
||||
cdef attr_t SUBTOK_LABEL = hash_string('subtok')
|
||||
|
||||
DEF NON_MONOTONIC = True
|
||||
DEF USE_BREAK = True
|
||||
|
||||
cdef weight_t MIN_SCORE = -90000
|
||||
|
||||
# Break transition from here
|
||||
# http://www.aclweb.org/anthology/P13-1074
|
||||
cdef enum:
|
||||
|
@ -178,6 +182,8 @@ cdef class Reduce:
|
|||
cdef class LeftArc:
|
||||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
|
||||
return 0
|
||||
sent_start = st._sent[st.B_(0).l_edge].sent_start
|
||||
return sent_start != 1
|
||||
|
||||
|
@ -214,6 +220,8 @@ cdef class RightArc:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
# If there's (perhaps partial) parse pre-set, don't allow cycle.
|
||||
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
|
||||
return 0
|
||||
sent_start = st._sent[st.B_(0).l_edge].sent_start
|
||||
return sent_start != 1 and st.H(st.S(0)) != st.B(0)
|
||||
|
||||
|
@ -364,6 +372,18 @@ cdef class ArcEager(TransitionSystem):
|
|||
def __get__(self):
|
||||
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
|
||||
|
||||
def get_cost(self, StateClass state, GoldParse gold, action):
|
||||
cdef Transition t = self.lookup_transition(action)
|
||||
if not t.is_valid(state.c, t.label):
|
||||
return 9000
|
||||
else:
|
||||
return t.get_cost(state, &gold.c, t.label)
|
||||
|
||||
def transition(self, StateClass state, action):
|
||||
cdef Transition t = self.lookup_transition(action)
|
||||
t.do(state.c, t.label)
|
||||
return state
|
||||
|
||||
def is_gold_parse(self, StateClass state, GoldParse gold):
|
||||
predicted = set()
|
||||
truth = set()
|
||||
|
@ -435,7 +455,10 @@ cdef class ArcEager(TransitionSystem):
|
|||
parses.append((prob, parse))
|
||||
return parses
|
||||
|
||||
cdef Transition lookup_transition(self, object name) except *:
|
||||
cdef Transition lookup_transition(self, object name_or_id) except *:
|
||||
if isinstance(name_or_id, int):
|
||||
return self.c[name_or_id]
|
||||
name = name_or_id
|
||||
if '-' in name:
|
||||
move_str, label_str = name.split('-', 1)
|
||||
label = self.strings[label_str]
|
||||
|
@ -455,6 +478,9 @@ cdef class ArcEager(TransitionSystem):
|
|||
else:
|
||||
return MOVE_NAMES[move]
|
||||
|
||||
def class_name(self, int i):
|
||||
return self.move_name(self.c[i].move, self.c[i].label)
|
||||
|
||||
cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
|
||||
# TODO: Apparent Cython bug here when we try to use the Transition()
|
||||
# constructor with the function pointers
|
||||
|
@ -484,7 +510,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
t.do = Break.transition
|
||||
t.get_cost = Break.cost
|
||||
else:
|
||||
raise Exception(move)
|
||||
raise ValueError(Errors.E019.format(action=move, src='arc_eager'))
|
||||
return t
|
||||
|
||||
cdef int initialize_state(self, StateC* st) nogil:
|
||||
|
@ -516,7 +542,10 @@ cdef class ArcEager(TransitionSystem):
|
|||
is_valid[BREAK] = Break.is_valid(st, 0)
|
||||
cdef int i
|
||||
for i in range(self.n_moves):
|
||||
output[i] = is_valid[self.c[i].move]
|
||||
if self.c[i].label == SUBTOK_LABEL:
|
||||
output[i] = self.c[i].is_valid(st, self.c[i].label)
|
||||
else:
|
||||
output[i] = is_valid[self.c[i].move]
|
||||
|
||||
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
||||
StateClass stcls, GoldParse gold) except -1:
|
||||
|
@ -556,35 +585,13 @@ cdef class ArcEager(TransitionSystem):
|
|||
is_valid[i] = False
|
||||
costs[i] = 9000
|
||||
if n_gold < 1:
|
||||
# Check label set --- leading cause
|
||||
label_set = set([self.strings[self.c[i].label] for i in range(self.n_moves)])
|
||||
for label_str in gold.labels:
|
||||
if label_str is not None and label_str not in label_set:
|
||||
raise ValueError("Cannot get gold parser action: unknown label: %s" % label_str)
|
||||
# Check projectivity --- other leading cause
|
||||
if nonproj.is_nonproj_tree(gold.heads):
|
||||
raise ValueError(
|
||||
"Could not find a gold-standard action to supervise the "
|
||||
"dependency parser. Likely cause: the tree is "
|
||||
"non-projective (i.e. it has crossing arcs -- see "
|
||||
"spacy/syntax/nonproj.pyx for definitions). The ArcEager "
|
||||
"transition system only supports projective trees. To "
|
||||
"learn non-projective representations, transform the data "
|
||||
"before training and after parsing. Either pass "
|
||||
"make_projective=True to the GoldParse class, or use "
|
||||
"spacy.syntax.nonproj.preprocess_training_data.")
|
||||
# Check projectivity --- leading cause
|
||||
if is_nonproj_tree(gold.heads):
|
||||
raise ValueError(Errors.E020)
|
||||
else:
|
||||
print(gold.orig_annot)
|
||||
print(gold.words)
|
||||
print(gold.heads)
|
||||
print(gold.labels)
|
||||
print(gold.sent_starts)
|
||||
raise ValueError(
|
||||
"Could not find a gold-standard action to supervise the"
|
||||
"dependency parser. The GoldParse was projective. The "
|
||||
"transition system has %d actions. State at failure: %s"
|
||||
% (self.n_moves, stcls.print_state(gold.words)))
|
||||
assert n_gold >= 1
|
||||
failure_state = stcls.print_state(gold.words)
|
||||
raise ValueError(Errors.E021.format(n_actions=self.n_moves,
|
||||
state=failure_state))
|
||||
|
||||
def get_beam_annot(self, Beam beam):
|
||||
length = (<StateC*>beam.at(0)).length
|
||||
|
|
|
@ -10,6 +10,7 @@ from ._state cimport StateC
|
|||
from .transition_system cimport Transition
|
||||
from .transition_system cimport do_func_t
|
||||
from ..gold cimport GoldParseC, GoldParse
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
cdef enum:
|
||||
|
@ -81,9 +82,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
for (ids, words, tags, heads, labels, biluo), _ in sents:
|
||||
for i, ner_tag in enumerate(biluo):
|
||||
if ner_tag != 'O' and ner_tag != '-':
|
||||
if ner_tag.count('-') != 1:
|
||||
raise ValueError(ner_tag)
|
||||
_, label = ner_tag.split('-')
|
||||
_, label = ner_tag.split('-', 1)
|
||||
for action in (BEGIN, IN, LAST, UNIT):
|
||||
actions[action][label] += 1
|
||||
return actions
|
||||
|
@ -170,7 +169,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
if self.c[i].move == move and self.c[i].label == label:
|
||||
return self.c[i]
|
||||
else:
|
||||
raise KeyError(name)
|
||||
raise KeyError(Errors.E022.format(name=name))
|
||||
|
||||
cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
|
||||
# TODO: Apparent Cython bug here when we try to use the Transition()
|
||||
|
@ -205,7 +204,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
t.do = Out.transition
|
||||
t.get_cost = Out.cost
|
||||
else:
|
||||
raise Exception(move)
|
||||
raise ValueError(Errors.E019.format(action=move, src='ner'))
|
||||
return t
|
||||
|
||||
def add_action(self, int action, label_name, freq=None):
|
||||
|
@ -227,7 +226,6 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
self._size *= 2
|
||||
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
|
||||
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
|
||||
assert self.c[self.n_moves].label == label_id
|
||||
self.n_moves += 1
|
||||
if self.labels.get(action, []):
|
||||
freq = min(0, min(self.labels[action].values()))
|
||||
|
|
|
@ -35,6 +35,7 @@ from .._ml import link_vectors_to_models, create_default_optimizer
|
|||
from ..compat import json_dumps, copy_array
|
||||
from ..tokens.doc cimport Doc
|
||||
from ..gold cimport GoldParse
|
||||
from ..errors import Errors, TempErrors
|
||||
from .. import util
|
||||
from .stateclass cimport StateClass
|
||||
from ._state cimport StateC
|
||||
|
@ -244,7 +245,7 @@ cdef class Parser:
|
|||
def Model(cls, nr_class, **cfg):
|
||||
depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1))
|
||||
if depth != 1:
|
||||
raise ValueError("Currently parser depth is hard-coded to 1.")
|
||||
raise ValueError(TempErrors.T004.format(value=depth))
|
||||
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
|
||||
cfg.get('maxout_pieces', 2))
|
||||
token_vector_width = util.env_opt('token_vector_width',
|
||||
|
@ -254,11 +255,12 @@ cdef class Parser:
|
|||
hist_size = util.env_opt('history_feats', cfg.get('hist_size', 0))
|
||||
hist_width = util.env_opt('history_width', cfg.get('hist_width', 0))
|
||||
if hist_size != 0:
|
||||
raise ValueError("Currently history size is hard-coded to 0")
|
||||
raise ValueError(TempErrors.T005.format(value=hist_size))
|
||||
if hist_width != 0:
|
||||
raise ValueError("Currently history width is hard-coded to 0")
|
||||
raise ValueError(TempErrors.T006.format(value=hist_width))
|
||||
pretrained_vectors = cfg.get('pretrained_vectors', None)
|
||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||
pretrained_dims=cfg.get('pretrained_dims', 0))
|
||||
pretrained_vectors=pretrained_vectors)
|
||||
tok2vec = chain(tok2vec, flatten)
|
||||
lower = PrecomputableAffine(hidden_width,
|
||||
nF=cls.nr_feature, nI=token_vector_width,
|
||||
|
@ -277,6 +279,7 @@ cdef class Parser:
|
|||
'token_vector_width': token_vector_width,
|
||||
'hidden_width': hidden_width,
|
||||
'maxout_pieces': parser_maxout_pieces,
|
||||
'pretrained_vectors': pretrained_vectors,
|
||||
'hist_size': hist_size,
|
||||
'hist_width': hist_width
|
||||
}
|
||||
|
@ -296,9 +299,9 @@ cdef class Parser:
|
|||
unless True (default), in which case a new instance is created with
|
||||
`Parser.Moves()`.
|
||||
model (object): Defines how the parse-state is created, updated and
|
||||
evaluated. The value is set to the .model attribute unless True
|
||||
(default), in which case a new instance is created with
|
||||
`Parser.Model()`.
|
||||
evaluated. The value is set to the .model attribute. If set to True
|
||||
(default), a new instance will be created with `Parser.Model()`
|
||||
in parser.begin_training(), parser.from_disk() or parser.from_bytes().
|
||||
**cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute
|
||||
"""
|
||||
self.vocab = vocab
|
||||
|
@ -310,8 +313,7 @@ cdef class Parser:
|
|||
cfg['beam_width'] = util.env_opt('beam_width', 1)
|
||||
if 'beam_density' not in cfg:
|
||||
cfg['beam_density'] = util.env_opt('beam_density', 0.0)
|
||||
if 'pretrained_dims' not in cfg:
|
||||
cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
||||
cfg.setdefault('cnn_maxout_pieces', 3)
|
||||
self.cfg = cfg
|
||||
self.model = model
|
||||
self._multitasks = []
|
||||
|
@ -551,8 +553,13 @@ cdef class Parser:
|
|||
def update(self, docs, golds, drop=0., sgd=None, losses=None):
|
||||
if not any(self.moves.has_gold(gold) for gold in golds):
|
||||
return None
|
||||
assert len(docs) == len(golds)
|
||||
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= 0.5:
|
||||
if len(docs) != len(golds):
|
||||
raise ValueError(Errors.E077.format(value='update', n_docs=len(docs),
|
||||
n_golds=len(golds)))
|
||||
# The probability we use beam update, instead of falling back to
|
||||
# a greedy update
|
||||
beam_update_prob = 1-self.cfg.get('beam_update_prob', 0.5)
|
||||
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= beam_update_prob:
|
||||
return self.update_beam(docs, golds,
|
||||
self.cfg['beam_width'], self.cfg['beam_density'],
|
||||
drop=drop, sgd=sgd, losses=losses)
|
||||
|
@ -595,7 +602,6 @@ cdef class Parser:
|
|||
scores, bp_scores = vec2scores.begin_update(vector, drop=drop)
|
||||
|
||||
d_scores = self.get_batch_loss(states, golds, scores)
|
||||
d_scores /= len(docs)
|
||||
d_vector = bp_scores(d_scores, sgd=sgd)
|
||||
if drop != 0:
|
||||
d_vector *= mask
|
||||
|
@ -620,7 +626,7 @@ cdef class Parser:
|
|||
break
|
||||
self._make_updates(d_tokvecs,
|
||||
bp_tokvecs, backprops, sgd, cuda_stream)
|
||||
|
||||
|
||||
def update_beam(self, docs, golds, width=None, density=None,
|
||||
drop=0., sgd=None, losses=None):
|
||||
if not any(self.moves.has_gold(gold) for gold in golds):
|
||||
|
@ -634,7 +640,6 @@ cdef class Parser:
|
|||
if losses is not None and self.name not in losses:
|
||||
losses[self.name] = 0.
|
||||
lengths = [len(d) for d in docs]
|
||||
assert min(lengths) >= 1
|
||||
states = self.moves.init_batch(docs)
|
||||
for gold in golds:
|
||||
self.moves.preprocess_gold(gold)
|
||||
|
@ -648,7 +653,6 @@ cdef class Parser:
|
|||
backprop_lower = []
|
||||
cdef float batch_size = len(docs)
|
||||
for i, d_scores in enumerate(states_d_scores):
|
||||
d_scores /= batch_size
|
||||
if losses is not None:
|
||||
losses[self.name] += (d_scores**2).sum()
|
||||
ids, bp_vectors, bp_scores = backprops[i]
|
||||
|
@ -846,7 +850,6 @@ cdef class Parser:
|
|||
self.moves.initialize_actions(actions)
|
||||
cfg.setdefault('token_vector_width', 128)
|
||||
if self.model is True:
|
||||
cfg['pretrained_dims'] = self.vocab.vectors_length
|
||||
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
|
||||
if sgd is None:
|
||||
sgd = self.create_optimizer()
|
||||
|
@ -910,9 +913,11 @@ cdef class Parser:
|
|||
}
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
path = util.ensure_path(path)
|
||||
if self.model is True:
|
||||
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
|
||||
self.model, cfg = self.Model(**self.cfg)
|
||||
else:
|
||||
cfg = {}
|
||||
|
@ -955,12 +960,13 @@ cdef class Parser:
|
|||
))
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
if self.model is True:
|
||||
self.model, cfg = self.Model(**self.cfg)
|
||||
cfg['pretrained_dims'] = self.vocab.vectors_length
|
||||
else:
|
||||
cfg = {}
|
||||
cfg['pretrained_dims'] = self.vocab.vectors_length
|
||||
if 'tok2vec_model' in msg:
|
||||
self.model[0].from_bytes(msg['tok2vec_model'])
|
||||
if 'lower_model' in msg:
|
||||
|
@ -1033,15 +1039,11 @@ def _cleanup(Beam beam):
|
|||
del state
|
||||
seen.add(addr)
|
||||
else:
|
||||
print(i, addr)
|
||||
print(seen)
|
||||
raise Exception
|
||||
raise ValueError(Errors.E023.format(addr=addr, i=i))
|
||||
addr = <size_t>beam._states[i].content
|
||||
if addr not in seen:
|
||||
state = <StateC*>addr
|
||||
del state
|
||||
seen.add(addr)
|
||||
else:
|
||||
print(i, addr)
|
||||
print(seen)
|
||||
raise Exception
|
||||
raise ValueError(Errors.E023.format(addr=addr, i=i))
|
||||
|
|
|
@ -10,6 +10,7 @@ from __future__ import unicode_literals
|
|||
from copy import copy
|
||||
|
||||
from ..tokens.doc cimport Doc, set_children_from_heads
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
DELIMITER = '||'
|
||||
|
@ -146,7 +147,10 @@ cpdef deprojectivize(Doc doc):
|
|||
|
||||
def _decorate(heads, proj_heads, labels):
|
||||
# uses decoration scheme HEAD from Nivre & Nilsson 2005
|
||||
assert(len(heads) == len(proj_heads) == len(labels))
|
||||
if (len(heads) != len(proj_heads)) or (len(proj_heads) != len(labels)):
|
||||
raise ValueError(Errors.E082.format(n_heads=len(heads),
|
||||
n_proj_heads=len(proj_heads),
|
||||
n_labels=len(labels)))
|
||||
deco_labels = []
|
||||
for tokenid, head in enumerate(heads):
|
||||
if head != proj_heads[tokenid]:
|
||||
|
|
|
@ -12,6 +12,7 @@ from ..structs cimport TokenC
|
|||
from .stateclass cimport StateClass
|
||||
from ..typedefs cimport attr_t
|
||||
from ..compat import json_dumps
|
||||
from ..errors import Errors
|
||||
from .. import util
|
||||
|
||||
|
||||
|
@ -73,10 +74,7 @@ cdef class TransitionSystem:
|
|||
action.do(state.c, action.label)
|
||||
break
|
||||
else:
|
||||
print(gold.words)
|
||||
print(gold.ner)
|
||||
print(history)
|
||||
raise ValueError("Could not find gold move")
|
||||
raise ValueError(Errors.E024)
|
||||
return history
|
||||
|
||||
cdef int initialize_state(self, StateC* state) nogil:
|
||||
|
@ -123,17 +121,7 @@ cdef class TransitionSystem:
|
|||
else:
|
||||
costs[i] = 9000
|
||||
if n_gold <= 0:
|
||||
print(gold.words)
|
||||
print(gold.ner)
|
||||
print([gold.c.ner[i].clas for i in range(gold.length)])
|
||||
print([gold.c.ner[i].move for i in range(gold.length)])
|
||||
print([gold.c.ner[i].label for i in range(gold.length)])
|
||||
print("Self labels",
|
||||
[self.c[i].label for i in range(self.n_moves)])
|
||||
raise ValueError(
|
||||
"Could not find a gold-standard action to supervise "
|
||||
"the entity recognizer. The transition system has "
|
||||
"%d actions." % (self.n_moves))
|
||||
raise ValueError(Errors.E024)
|
||||
|
||||
def get_class_name(self, int clas):
|
||||
act = self.c[clas]
|
||||
|
@ -171,7 +159,6 @@ cdef class TransitionSystem:
|
|||
self._size *= 2
|
||||
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
|
||||
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
|
||||
assert self.c[self.n_moves].label == label_id
|
||||
self.n_moves += 1
|
||||
if self.labels.get(action, []):
|
||||
new_freq = min(self.labels[action].values())
|
||||
|
|
|
@ -19,7 +19,9 @@ _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
|||
_models = {'en': ['en_core_web_sm'],
|
||||
'de': ['de_core_news_md'],
|
||||
'fr': ['fr_core_news_sm'],
|
||||
'xx': ['xx_ent_web_md']}
|
||||
'xx': ['xx_ent_web_md'],
|
||||
'en_core_web_md': ['en_core_web_md'],
|
||||
'es_core_news_md': ['es_core_news_md']}
|
||||
|
||||
|
||||
# only used for tests that require loading the models
|
||||
|
@ -183,6 +185,9 @@ def pytest_addoption(parser):
|
|||
|
||||
for lang in _languages + ['all']:
|
||||
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
|
||||
for model in _models:
|
||||
if model not in _languages:
|
||||
parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
|
||||
|
||||
|
||||
def pytest_runtest_setup(item):
|
||||
|
|
13
spacy/tests/lang/da/test_lemma.py
Normal file
13
spacy/tests/lang/da/test_lemma.py
Normal file
|
@ -0,0 +1,13 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
|
||||
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
|
||||
('kolesterols', 'kolesterol'),
|
||||
('åsyns', 'åsyn')])
|
||||
def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
|
||||
tokens = da_tokenizer(string)
|
||||
assert tokens[0].lemma_ == lemma
|
|
@ -1,9 +1,74 @@
|
|||
from __future__ import unicode_literals
|
||||
import pytest
|
||||
|
||||
from ...vocab import Vocab
|
||||
from ...pipeline import DependencyParser
|
||||
from ...tokens import Doc
|
||||
from ...gold import GoldParse
|
||||
from ...syntax.nonproj import projectivize
|
||||
from ...syntax.stateclass import StateClass
|
||||
from ...syntax.arc_eager import ArcEager
|
||||
|
||||
|
||||
def get_sequence_costs(M, words, heads, deps, transitions):
|
||||
doc = Doc(Vocab(), words=words)
|
||||
gold = GoldParse(doc, heads=heads, deps=deps)
|
||||
state = StateClass(doc)
|
||||
M.preprocess_gold(gold)
|
||||
cost_history = []
|
||||
for gold_action in transitions:
|
||||
state_costs = {}
|
||||
for i in range(M.n_moves):
|
||||
name = M.class_name(i)
|
||||
state_costs[name] = M.get_cost(state, gold, i)
|
||||
M.transition(state, gold_action)
|
||||
cost_history.append(state_costs)
|
||||
return state, cost_history
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def vocab():
|
||||
return Vocab()
|
||||
|
||||
@pytest.fixture
|
||||
def arc_eager(vocab):
|
||||
moves = ArcEager(vocab.strings, ArcEager.get_actions())
|
||||
moves.add_action(2, 'left')
|
||||
moves.add_action(3, 'right')
|
||||
return moves
|
||||
|
||||
@pytest.fixture
|
||||
def words():
|
||||
return ['a', 'b']
|
||||
|
||||
@pytest.fixture
|
||||
def doc(words, vocab):
|
||||
if vocab is None:
|
||||
vocab = Vocab()
|
||||
return Doc(vocab, words=list(words))
|
||||
|
||||
@pytest.fixture
|
||||
def gold(doc, words):
|
||||
if len(words) == 2:
|
||||
return GoldParse(doc, words=['a', 'b'], heads=[0, 0], deps=['ROOT', 'right'])
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_oracle_four_words(arc_eager, vocab):
|
||||
words = ['a', 'b', 'c', 'd']
|
||||
heads = [1, 1, 3, 3]
|
||||
deps = ['left', 'ROOT', 'left', 'ROOT']
|
||||
actions = ['L-left', 'B-ROOT', 'L-left']
|
||||
state, cost_history = get_sequence_costs(arc_eager, words, heads, deps, actions)
|
||||
assert state.is_final()
|
||||
for i, state_costs in enumerate(cost_history):
|
||||
# Check gold moves is 0 cost
|
||||
assert state_costs[actions[i]] == 0.0, actions[i]
|
||||
for other_action, cost in state_costs.items():
|
||||
if other_action != actions[i]:
|
||||
assert cost >= 1
|
||||
|
||||
|
||||
annot_tuples = [
|
||||
(0, 'When', 'WRB', 11, 'advmod', 'O'),
|
||||
|
|
12
spacy/tests/regression/test_issue1660.py
Normal file
12
spacy/tests/regression/test_issue1660.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
from __future__ import unicode_literals
|
||||
import pytest
|
||||
from ...util import load_model
|
||||
|
||||
@pytest.mark.models("en_core_web_md")
|
||||
@pytest.mark.models("es_core_news_md")
|
||||
def test_models_with_different_vectors():
|
||||
nlp = load_model('en_core_web_md')
|
||||
doc = nlp(u'hello world')
|
||||
nlp2 = load_model('es_core_news_md')
|
||||
doc2 = nlp2(u'hola')
|
||||
doc = nlp(u'hello world')
|
15
spacy/tests/regression/test_issue1967.py
Normal file
15
spacy/tests/regression/test_issue1967.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from ...pipeline import EntityRecognizer
|
||||
from ...vocab import Vocab
|
||||
|
||||
|
||||
@pytest.mark.parametrize('label', ['U-JOB-NAME'])
|
||||
def test_issue1967(label):
|
||||
ner = EntityRecognizer(Vocab())
|
||||
entry = ([0], ['word'], ['tag'], [0], ['dep'], [label])
|
||||
gold_parses = [(None, [(entry, None)])]
|
||||
ner.moves.get_actions(gold_parses=gold_parses)
|
|
@ -17,6 +17,7 @@ def meta_data():
|
|||
'email': 'email-in-fixture',
|
||||
'url': 'url-in-fixture',
|
||||
'license': 'license-in-fixture',
|
||||
'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -10,8 +10,8 @@ from ..gold import GoldParse
|
|||
|
||||
|
||||
def test_textcat_learns_multilabel():
|
||||
random.seed(0)
|
||||
numpy.random.seed(0)
|
||||
random.seed(5)
|
||||
numpy.random.seed(5)
|
||||
docs = []
|
||||
nlp = English()
|
||||
vocab = nlp.vocab
|
||||
|
|
|
@ -1,4 +1,11 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from mock import Mock
|
||||
|
||||
from ..vocab import Vocab
|
||||
from ..tokens import Doc, Span, Token
|
||||
from ..tokens.underscore import Underscore
|
||||
|
||||
|
||||
|
@ -51,3 +58,42 @@ def test_token_underscore_method():
|
|||
None, None)
|
||||
token._ = Underscore(Underscore.token_extensions, token, start=token.idx)
|
||||
assert token._.hello() == 'cheese'
|
||||
|
||||
|
||||
@pytest.mark.parametrize('obj', [Doc, Span, Token])
|
||||
def test_doc_underscore_remove_extension(obj):
|
||||
ext_name = 'to_be_removed'
|
||||
obj.set_extension(ext_name, default=False)
|
||||
assert obj.has_extension(ext_name)
|
||||
obj.remove_extension(ext_name)
|
||||
assert not obj.has_extension(ext_name)
|
||||
|
||||
|
||||
@pytest.mark.parametrize('obj', [Doc, Span, Token])
|
||||
def test_underscore_raises_for_dup(obj):
|
||||
obj.set_extension('test', default=None)
|
||||
with pytest.raises(ValueError):
|
||||
obj.set_extension('test', default=None)
|
||||
|
||||
|
||||
@pytest.mark.parametrize('invalid_kwargs', [
|
||||
{'getter': None, 'setter': lambda: None},
|
||||
{'default': None, 'method': lambda: None, 'getter': lambda: None},
|
||||
{'setter': lambda: None},
|
||||
{'default': None, 'method': lambda: None},
|
||||
{'getter': True}])
|
||||
def test_underscore_raises_for_invalid(invalid_kwargs):
|
||||
invalid_kwargs['force'] = True
|
||||
with pytest.raises(ValueError):
|
||||
Doc.set_extension('test', **invalid_kwargs)
|
||||
|
||||
|
||||
@pytest.mark.parametrize('valid_kwargs', [
|
||||
{'getter': lambda: None},
|
||||
{'getter': lambda: None, 'setter': lambda: None},
|
||||
{'default': 'hello'},
|
||||
{'default': None},
|
||||
{'method': lambda: None}])
|
||||
def test_underscore_accepts_valid(valid_kwargs):
|
||||
valid_kwargs['force'] = True
|
||||
Doc.set_extension('test', **valid_kwargs)
|
||||
|
|
|
@ -28,12 +28,38 @@ def vectors():
|
|||
def data():
|
||||
return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f')
|
||||
|
||||
@pytest.fixture
|
||||
def resize_data():
|
||||
return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype='f')
|
||||
|
||||
@pytest.fixture()
|
||||
def vocab(en_vocab, vectors):
|
||||
add_vecs_to_vocab(en_vocab, vectors)
|
||||
return en_vocab
|
||||
|
||||
def test_init_vectors_with_resize_shape(strings,resize_data):
|
||||
v = Vectors(shape=(len(strings), 3))
|
||||
v.resize(shape=resize_data.shape)
|
||||
assert v.shape == resize_data.shape
|
||||
assert v.shape != (len(strings), 3)
|
||||
|
||||
def test_init_vectors_with_resize_data(data,resize_data):
|
||||
v = Vectors(data=data)
|
||||
v.resize(shape=resize_data.shape)
|
||||
assert v.shape == resize_data.shape
|
||||
assert v.shape != data.shape
|
||||
|
||||
def test_get_vector_resize(strings, data,resize_data):
|
||||
v = Vectors(data=data)
|
||||
v.resize(shape=resize_data.shape)
|
||||
strings = [hash_string(s) for s in strings]
|
||||
for i, string in enumerate(strings):
|
||||
v.add(string, row=i)
|
||||
|
||||
assert list(v[strings[0]]) == list(resize_data[0])
|
||||
assert list(v[strings[0]]) != list(resize_data[1])
|
||||
assert list(v[strings[1]]) != list(resize_data[0])
|
||||
assert list(v[strings[1]]) == list(resize_data[1])
|
||||
|
||||
def test_init_vectors_with_data(strings, data):
|
||||
v = Vectors(data=data)
|
||||
|
|
|
@ -13,6 +13,7 @@ cimport cython
|
|||
|
||||
from .tokens.doc cimport Doc
|
||||
from .strings cimport hash_string
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
|
||||
|
||||
|
@ -63,11 +64,7 @@ cdef class Tokenizer:
|
|||
return (self.__class__, args, None, None)
|
||||
|
||||
cpdef Doc tokens_from_list(self, list strings):
|
||||
util.deprecated(
|
||||
"Tokenizer.from_list is now deprecated. Create a new Doc "
|
||||
"object instead and pass in the strings as the `words` keyword "
|
||||
"argument, for example:\nfrom spacy.tokens import Doc\n"
|
||||
"doc = Doc(nlp.vocab, words=[...])")
|
||||
deprecation_warning(Warnings.W002)
|
||||
return Doc(self.vocab, words=strings)
|
||||
|
||||
@cython.boundscheck(False)
|
||||
|
@ -78,8 +75,7 @@ cdef class Tokenizer:
|
|||
RETURNS (Doc): A container for linguistic annotations.
|
||||
"""
|
||||
if len(string) >= (2 ** 30):
|
||||
msg = "String is too long: %d characters. Max is 2**30."
|
||||
raise ValueError(msg % len(string))
|
||||
raise ValueError(Errors.E025.format(length=len(string)))
|
||||
cdef int length = len(string)
|
||||
cdef Doc doc = Doc(self.vocab)
|
||||
if length == 0:
|
||||
|
|
129
spacy/tokens/_retokenize.pyx
Normal file
129
spacy/tokens/_retokenize.pyx
Normal file
|
@ -0,0 +1,129 @@
|
|||
# coding: utf8
|
||||
# cython: infer_types=True
|
||||
# cython: bounds_check=False
|
||||
# cython: profile=True
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from libc.string cimport memcpy, memset
|
||||
|
||||
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
|
||||
from .span cimport Span
|
||||
from .token cimport Token
|
||||
from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
||||
from ..structs cimport LexemeC, TokenC
|
||||
from ..attrs cimport *
|
||||
|
||||
|
||||
cdef class Retokenizer:
|
||||
'''Helper class for doc.retokenize() context manager.'''
|
||||
cdef Doc doc
|
||||
cdef list merges
|
||||
cdef list splits
|
||||
def __init__(self, doc):
|
||||
self.doc = doc
|
||||
self.merges = []
|
||||
self.splits = []
|
||||
|
||||
def merge(self, Span span, attrs=None):
|
||||
'''Mark a span for merging. The attrs will be applied to the resulting
|
||||
token.'''
|
||||
self.merges.append((span.start_char, span.end_char, attrs))
|
||||
|
||||
def split(self, Token token, orths, attrs=None):
|
||||
'''Mark a Token for splitting, into the specified orths. The attrs
|
||||
will be applied to each subtoken.'''
|
||||
self.splits.append((token.start_char, orths, attrs))
|
||||
|
||||
def __enter__(self):
|
||||
self.merges = []
|
||||
self.splits = []
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
# Do the actual merging here
|
||||
for start_char, end_char, attrs in self.merges:
|
||||
start = token_by_start(self.doc.c, self.doc.length, start_char)
|
||||
end = token_by_end(self.doc.c, self.doc.length, end_char)
|
||||
_merge(self.doc, start, end+1, attrs)
|
||||
for start_char, orths, attrs in self.splits:
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
def _merge(Doc doc, int start, int end, attributes):
|
||||
"""Retokenize the document, such that the span at
|
||||
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
||||
`start_idx` and `end_idx `do not mark start and end token boundaries,
|
||||
the document remains unchanged.
|
||||
|
||||
start_idx (int): Character index of the start of the slice to merge.
|
||||
end_idx (int): Character index after the end of the slice to merge.
|
||||
**attributes: Attributes to assign to the merged token. By default,
|
||||
attributes are inherited from the syntactic root of the span.
|
||||
RETURNS (Token): The newly merged token, or `None` if the start and end
|
||||
indices did not fall at token boundaries.
|
||||
"""
|
||||
cdef Span span = doc[start:end]
|
||||
cdef int start_char = span.start_char
|
||||
cdef int end_char = span.end_char
|
||||
# Get LexemeC for newly merged token
|
||||
new_orth = ''.join([t.text_with_ws for t in span])
|
||||
if span[-1].whitespace_:
|
||||
new_orth = new_orth[:-len(span[-1].whitespace_)]
|
||||
cdef const LexemeC* lex = doc.vocab.get(doc.mem, new_orth)
|
||||
# House the new merged token where it starts
|
||||
cdef TokenC* token = &doc.c[start]
|
||||
token.spacy = doc.c[end-1].spacy
|
||||
for attr_name, attr_value in attributes.items():
|
||||
if attr_name == TAG:
|
||||
doc.vocab.morphology.assign_tag(token, attr_value)
|
||||
else:
|
||||
Token.set_struct_attr(token, attr_name, attr_value)
|
||||
# Make sure ent_iob remains consistent
|
||||
if doc.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
|
||||
if token.ent_type == doc.c[end].ent_type:
|
||||
token.ent_iob = 3
|
||||
else:
|
||||
# If they're not the same entity type, let them be two entities
|
||||
doc.c[end].ent_iob = 3
|
||||
# Begin by setting all the head indices to absolute token positions
|
||||
# This is easier to work with for now than the offsets
|
||||
# Before thinking of something simpler, beware the case where a
|
||||
# dependency bridges over the entity. Here the alignment of the
|
||||
# tokens changes.
|
||||
span_root = span.root.i
|
||||
token.dep = span.root.dep
|
||||
# We update token.lex after keeping span root and dep, since
|
||||
# setting token.lex will change span.start and span.end properties
|
||||
# as it modifies the character offsets in the doc
|
||||
token.lex = lex
|
||||
for i in range(doc.length):
|
||||
doc.c[i].head += i
|
||||
# Set the head of the merged token, and its dep relation, from the Span
|
||||
token.head = doc.c[span_root].head
|
||||
# Adjust deps before shrinking tokens
|
||||
# Tokens which point into the merged token should now point to it
|
||||
# Subtract the offset from all tokens which point to >= end
|
||||
offset = (end - start) - 1
|
||||
for i in range(doc.length):
|
||||
head_idx = doc.c[i].head
|
||||
if start <= head_idx < end:
|
||||
doc.c[i].head = start
|
||||
elif head_idx >= end:
|
||||
doc.c[i].head -= offset
|
||||
# Now compress the token array
|
||||
for i in range(end, doc.length):
|
||||
doc.c[i - offset] = doc.c[i]
|
||||
for i in range(doc.length - offset, doc.length):
|
||||
memset(&doc.c[i], 0, sizeof(TokenC))
|
||||
doc.c[i].lex = &EMPTY_LEXEME
|
||||
doc.length -= offset
|
||||
for i in range(doc.length):
|
||||
# ...And, set heads back to a relative position
|
||||
doc.c[i].head -= i
|
||||
# Set the left/right children, left/right edges
|
||||
set_children_from_heads(doc.c, doc.length)
|
||||
# Clear the cached Python objects
|
||||
# Return the merged Python object
|
||||
return doc[start]
|
||||
|
||||
|
|
@ -28,6 +28,8 @@ cdef int token_by_start(const TokenC* tokens, int length, int start_char) except
|
|||
cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2
|
||||
|
||||
|
||||
cdef int set_children_from_heads(TokenC* tokens, int length) except -1
|
||||
|
||||
cdef class Doc:
|
||||
cdef readonly Pool mem
|
||||
cdef readonly Vocab vocab
|
||||
|
|
|
@ -31,18 +31,19 @@ from ..attrs cimport ENT_TYPE, SENT_START
|
|||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||
from ..util import normalize_slice
|
||||
from ..compat import is_config, copy_reg, pickle, basestring_
|
||||
from .. import about
|
||||
from ..errors import Errors, Warnings, deprecation_warning
|
||||
from .. import util
|
||||
from .underscore import Underscore
|
||||
from .underscore import Underscore, get_ext_args
|
||||
from ._retokenize import Retokenizer
|
||||
|
||||
DEF PADDING = 5
|
||||
|
||||
|
||||
cdef int bounds_check(int i, int length, int padding) except -1:
|
||||
if (i + padding) < 0:
|
||||
raise IndexError
|
||||
raise IndexError(Errors.E026.format(i=i, length=length))
|
||||
if (i - padding) >= length:
|
||||
raise IndexError
|
||||
raise IndexError(Errors.E026.format(i=i, length=length))
|
||||
|
||||
|
||||
cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
|
||||
|
@ -94,11 +95,10 @@ cdef class Doc:
|
|||
spaces=[True, False, False])
|
||||
"""
|
||||
@classmethod
|
||||
def set_extension(cls, name, default=None, method=None,
|
||||
getter=None, setter=None):
|
||||
nr_defined = sum(t is not None for t in (default, getter, setter, method))
|
||||
assert nr_defined == 1
|
||||
Underscore.doc_extensions[name] = (default, method, getter, setter)
|
||||
def set_extension(cls, name, **kwargs):
|
||||
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj='Doc'))
|
||||
Underscore.doc_extensions[name] = get_ext_args(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def get_extension(cls, name):
|
||||
|
@ -108,6 +108,12 @@ cdef class Doc:
|
|||
def has_extension(cls, name):
|
||||
return name in Underscore.doc_extensions
|
||||
|
||||
@classmethod
|
||||
def remove_extension(cls, name):
|
||||
if not cls.has_extension(name):
|
||||
raise ValueError(Errors.E046.format(name=name))
|
||||
return Underscore.doc_extensions.pop(name)
|
||||
|
||||
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
|
||||
orths_and_spaces=None):
|
||||
"""Create a Doc object.
|
||||
|
@ -154,11 +160,7 @@ cdef class Doc:
|
|||
if spaces is None:
|
||||
spaces = [True] * len(words)
|
||||
elif len(spaces) != len(words):
|
||||
raise ValueError(
|
||||
"Arguments 'words' and 'spaces' should be sequences of "
|
||||
"the same length, or 'spaces' should be left default at "
|
||||
"None. spaces should be a sequence of booleans, with True "
|
||||
"meaning that the word owns a ' ' character following it.")
|
||||
raise ValueError(Errors.E027)
|
||||
orths_and_spaces = zip(words, spaces)
|
||||
if orths_and_spaces is not None:
|
||||
for orth_space in orths_and_spaces:
|
||||
|
@ -166,10 +168,7 @@ cdef class Doc:
|
|||
orth = orth_space
|
||||
has_space = True
|
||||
elif isinstance(orth_space, bytes):
|
||||
raise ValueError(
|
||||
"orths_and_spaces expects either List(unicode) or "
|
||||
"List((unicode, bool)). "
|
||||
"Got bytes instance: %s" % (str(orth_space)))
|
||||
raise ValueError(Errors.E028.format(value=orth_space))
|
||||
else:
|
||||
orth, has_space = orth_space
|
||||
# Note that we pass self.mem here --- we have ownership, if LexemeC
|
||||
|
@ -319,7 +318,7 @@ cdef class Doc:
|
|||
break
|
||||
else:
|
||||
return 1.0
|
||||
|
||||
|
||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||
return 0.0
|
||||
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||
|
@ -437,10 +436,7 @@ cdef class Doc:
|
|||
if token.ent_iob == 1:
|
||||
if start == -1:
|
||||
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
|
||||
raise ValueError(
|
||||
"token.ent_iob values make invalid sequence: "
|
||||
"I without B\n"
|
||||
"{seq}".format(seq=' '.join(seq)))
|
||||
raise ValueError(Errors.E093.format(seq=' '.join(seq)))
|
||||
elif token.ent_iob == 2 or token.ent_iob == 0:
|
||||
if start != -1:
|
||||
output.append(Span(self, start, i, label=label))
|
||||
|
@ -503,19 +499,16 @@ cdef class Doc:
|
|||
"""
|
||||
def __get__(self):
|
||||
if not self.is_parsed:
|
||||
raise ValueError(
|
||||
"noun_chunks requires the dependency parse, which "
|
||||
"requires a statistical model to be installed and loaded. "
|
||||
"For more info, see the "
|
||||
"documentation: \n%s\n" % about.__docs_models__)
|
||||
raise ValueError(Errors.E029)
|
||||
# Accumulate the result before beginning to iterate over it. This
|
||||
# prevents the tokenisation from being changed out from under us
|
||||
# during the iteration. The tricky thing here is that Span accepts
|
||||
# its tokenisation changing, so it's okay once we have the Span
|
||||
# objects. See Issue #375.
|
||||
spans = []
|
||||
for start, end, label in self.noun_chunks_iterator(self):
|
||||
spans.append(Span(self, start, end, label=label))
|
||||
if self.noun_chunks_iterator is not None:
|
||||
for start, end, label in self.noun_chunks_iterator(self):
|
||||
spans.append(Span(self, start, end, label=label))
|
||||
for span in spans:
|
||||
yield span
|
||||
|
||||
|
@ -532,12 +525,7 @@ cdef class Doc:
|
|||
"""
|
||||
def __get__(self):
|
||||
if not self.is_sentenced:
|
||||
raise ValueError(
|
||||
"Sentence boundaries unset. You can add the 'sentencizer' "
|
||||
"component to the pipeline with: "
|
||||
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
|
||||
"Alternatively, add the dependency parser, or set "
|
||||
"sentence boundaries by setting doc[i].sent_start")
|
||||
raise ValueError(Errors.E030)
|
||||
if 'sents' in self.user_hooks:
|
||||
yield from self.user_hooks['sents'](self)
|
||||
else:
|
||||
|
@ -567,7 +555,8 @@ cdef class Doc:
|
|||
t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy
|
||||
t.l_edge = self.length
|
||||
t.r_edge = self.length
|
||||
assert t.lex.orth != 0
|
||||
if t.lex.orth == 0:
|
||||
raise ValueError(Errors.E031.format(i=self.length))
|
||||
t.spacy = has_space
|
||||
self.length += 1
|
||||
return t.idx + t.lex.length + t.spacy
|
||||
|
@ -683,13 +672,7 @@ cdef class Doc:
|
|||
|
||||
def from_array(self, attrs, array):
|
||||
if SENT_START in attrs and HEAD in attrs:
|
||||
raise ValueError(
|
||||
"Conflicting attributes specified in doc.from_array(): "
|
||||
"(HEAD, SENT_START)\n"
|
||||
"The HEAD attribute currently sets sentence boundaries "
|
||||
"implicitly, based on the tree structure. This means the HEAD "
|
||||
"attribute would potentially override the sentence boundaries "
|
||||
"set by SENT_START.")
|
||||
raise ValueError(Errors.E032)
|
||||
cdef int i, col
|
||||
cdef attr_id_t attr_id
|
||||
cdef TokenC* tokens = self.c
|
||||
|
@ -827,7 +810,7 @@ cdef class Doc:
|
|||
RETURNS (Doc): Itself.
|
||||
"""
|
||||
if self.length != 0:
|
||||
raise ValueError("Cannot load into non-empty Doc")
|
||||
raise ValueError(Errors.E033.format(length=self.length))
|
||||
deserializers = {
|
||||
'text': lambda b: None,
|
||||
'array_head': lambda b: None,
|
||||
|
@ -878,7 +861,7 @@ cdef class Doc:
|
|||
computed by the models in the pipeline. Let's say a
|
||||
document with 30 words has a tensor with 128 dimensions
|
||||
per word. doc.tensor.shape will be (30, 128). After
|
||||
calling doc.extend_tensor with an array of hape (30, 64),
|
||||
calling doc.extend_tensor with an array of shape (30, 64),
|
||||
doc.tensor == (30, 192).
|
||||
'''
|
||||
xp = get_array_module(self.tensor)
|
||||
|
@ -888,6 +871,18 @@ cdef class Doc:
|
|||
else:
|
||||
self.tensor = xp.hstack((self.tensor, tensor))
|
||||
|
||||
def retokenize(self):
|
||||
'''Context manager to handle retokenization of the Doc.
|
||||
Modifications to the Doc's tokenization are stored, and then
|
||||
made all at once when the context manager exits. This is
|
||||
much more efficient, and less error-prone.
|
||||
|
||||
All views of the Doc (Span and Token) created before the
|
||||
retokenization are invalidated, although they may accidentally
|
||||
continue to work.
|
||||
'''
|
||||
return Retokenizer(self)
|
||||
|
||||
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
||||
"""Retokenize the document, such that the span at
|
||||
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
||||
|
@ -903,10 +898,7 @@ cdef class Doc:
|
|||
"""
|
||||
cdef unicode tag, lemma, ent_type
|
||||
if len(args) == 3:
|
||||
util.deprecated(
|
||||
"Positional arguments to Doc.merge are deprecated. Instead, "
|
||||
"use the keyword arguments, for example tag=, lemma= or "
|
||||
"ent_type=.")
|
||||
deprecation_warning(Warnings.W003)
|
||||
tag, lemma, ent_type = args
|
||||
attributes[TAG] = tag
|
||||
attributes[LEMMA] = lemma
|
||||
|
@ -920,13 +912,9 @@ cdef class Doc:
|
|||
if 'ent_type' in attributes:
|
||||
attributes[ENT_TYPE] = attributes['ent_type']
|
||||
elif args:
|
||||
raise ValueError(
|
||||
"Doc.merge received %d non-keyword arguments. Expected either "
|
||||
"3 arguments (deprecated), or 0 (use keyword arguments). "
|
||||
"Arguments supplied:\n%s\n"
|
||||
"Keyword arguments: %s\n" % (len(args), repr(args),
|
||||
repr(attributes)))
|
||||
|
||||
raise ValueError(Errors.E034.format(n_args=len(args),
|
||||
args=repr(args),
|
||||
kwargs=repr(attributes)))
|
||||
# More deprecated attribute handling =/
|
||||
if 'label' in attributes:
|
||||
attributes['ent_type'] = attributes.pop('label')
|
||||
|
@ -941,66 +929,8 @@ cdef class Doc:
|
|||
return None
|
||||
# Currently we have the token index, we want the range-end index
|
||||
end += 1
|
||||
cdef Span span = self[start:end]
|
||||
# Get LexemeC for newly merged token
|
||||
new_orth = ''.join([t.text_with_ws for t in span])
|
||||
if span[-1].whitespace_:
|
||||
new_orth = new_orth[:-len(span[-1].whitespace_)]
|
||||
cdef const LexemeC* lex = self.vocab.get(self.mem, new_orth)
|
||||
# House the new merged token where it starts
|
||||
cdef TokenC* token = &self.c[start]
|
||||
token.spacy = self.c[end-1].spacy
|
||||
for attr_name, attr_value in attributes.items():
|
||||
if attr_name == TAG:
|
||||
self.vocab.morphology.assign_tag(token, attr_value)
|
||||
else:
|
||||
Token.set_struct_attr(token, attr_name, attr_value)
|
||||
# Make sure ent_iob remains consistent
|
||||
if self.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
|
||||
if token.ent_type == self.c[end].ent_type:
|
||||
token.ent_iob = 3
|
||||
else:
|
||||
# If they're not the same entity type, let them be two entities
|
||||
self.c[end].ent_iob = 3
|
||||
# Begin by setting all the head indices to absolute token positions
|
||||
# This is easier to work with for now than the offsets
|
||||
# Before thinking of something simpler, beware the case where a
|
||||
# dependency bridges over the entity. Here the alignment of the
|
||||
# tokens changes.
|
||||
span_root = span.root.i
|
||||
token.dep = span.root.dep
|
||||
# We update token.lex after keeping span root and dep, since
|
||||
# setting token.lex will change span.start and span.end properties
|
||||
# as it modifies the character offsets in the doc
|
||||
token.lex = lex
|
||||
for i in range(self.length):
|
||||
self.c[i].head += i
|
||||
# Set the head of the merged token, and its dep relation, from the Span
|
||||
token.head = self.c[span_root].head
|
||||
# Adjust deps before shrinking tokens
|
||||
# Tokens which point into the merged token should now point to it
|
||||
# Subtract the offset from all tokens which point to >= end
|
||||
offset = (end - start) - 1
|
||||
for i in range(self.length):
|
||||
head_idx = self.c[i].head
|
||||
if start <= head_idx < end:
|
||||
self.c[i].head = start
|
||||
elif head_idx >= end:
|
||||
self.c[i].head -= offset
|
||||
# Now compress the token array
|
||||
for i in range(end, self.length):
|
||||
self.c[i - offset] = self.c[i]
|
||||
for i in range(self.length - offset, self.length):
|
||||
memset(&self.c[i], 0, sizeof(TokenC))
|
||||
self.c[i].lex = &EMPTY_LEXEME
|
||||
self.length -= offset
|
||||
for i in range(self.length):
|
||||
# ...And, set heads back to a relative position
|
||||
self.c[i].head -= i
|
||||
# Set the left/right children, left/right edges
|
||||
set_children_from_heads(self.c, self.length)
|
||||
# Clear the cached Python objects
|
||||
# Return the merged Python object
|
||||
with self.retokenize() as retokenizer:
|
||||
retokenizer.merge(self[start:end], attrs=attributes)
|
||||
return self[start]
|
||||
|
||||
def print_tree(self, light=False, flat=False):
|
||||
|
|
|
@ -8,7 +8,7 @@ from ..symbols import HEAD, TAG, DEP, ENT_IOB, ENT_TYPE
|
|||
def merge_ents(doc):
|
||||
"""Helper: merge adjacent entities into single tokens; modifies the doc."""
|
||||
for ent in doc.ents:
|
||||
ent.merge(ent.root.tag_, ent.text, ent.label_)
|
||||
ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)
|
||||
return doc
|
||||
|
||||
|
||||
|
|
|
@ -16,16 +16,17 @@ from ..util import normalize_slice
|
|||
from ..attrs cimport IS_PUNCT, IS_SPACE
|
||||
from ..lexeme cimport Lexeme
|
||||
from ..compat import is_config
|
||||
from .. import about
|
||||
from .underscore import Underscore
|
||||
from ..errors import Errors, TempErrors
|
||||
from .underscore import Underscore, get_ext_args
|
||||
|
||||
|
||||
cdef class Span:
|
||||
"""A slice from a Doc object."""
|
||||
@classmethod
|
||||
def set_extension(cls, name, default=None, method=None,
|
||||
getter=None, setter=None):
|
||||
Underscore.span_extensions[name] = (default, method, getter, setter)
|
||||
def set_extension(cls, name, **kwargs):
|
||||
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj='Span'))
|
||||
Underscore.span_extensions[name] = get_ext_args(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def get_extension(cls, name):
|
||||
|
@ -35,6 +36,12 @@ cdef class Span:
|
|||
def has_extension(cls, name):
|
||||
return name in Underscore.span_extensions
|
||||
|
||||
@classmethod
|
||||
def remove_extension(cls, name):
|
||||
if not cls.has_extension(name):
|
||||
raise ValueError(Errors.E046.format(name=name))
|
||||
return Underscore.span_extensions.pop(name)
|
||||
|
||||
def __cinit__(self, Doc doc, int start, int end, attr_t label=0,
|
||||
vector=None, vector_norm=None):
|
||||
"""Create a `Span` object from the slice `doc[start : end]`.
|
||||
|
@ -48,8 +55,7 @@ cdef class Span:
|
|||
RETURNS (Span): The newly constructed object.
|
||||
"""
|
||||
if not (0 <= start <= end <= len(doc)):
|
||||
raise IndexError
|
||||
|
||||
raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc)))
|
||||
self.doc = doc
|
||||
self.start = start
|
||||
self.start_char = self.doc[start].idx if start < self.doc.length else 0
|
||||
|
@ -58,7 +64,8 @@ cdef class Span:
|
|||
self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1])
|
||||
else:
|
||||
self.end_char = 0
|
||||
assert label in doc.vocab.strings, label
|
||||
if label not in doc.vocab.strings:
|
||||
raise ValueError(Errors.E084.format(label=label))
|
||||
self.label = label
|
||||
self._vector = vector
|
||||
self._vector_norm = vector_norm
|
||||
|
@ -267,11 +274,10 @@ cdef class Span:
|
|||
or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char:
|
||||
start = token_by_start(self.doc.c, self.doc.length, self.start_char)
|
||||
if self.start == -1:
|
||||
raise IndexError("Error calculating span: Can't find start")
|
||||
raise IndexError(Errors.E036.format(start=self.start_char))
|
||||
end = token_by_end(self.doc.c, self.doc.length, self.end_char)
|
||||
if end == -1:
|
||||
raise IndexError("Error calculating span: Can't find end")
|
||||
|
||||
raise IndexError(Errors.E037.format(end=self.end_char))
|
||||
self.start = start
|
||||
self.end = end + 1
|
||||
|
||||
|
@ -294,12 +300,11 @@ cdef class Span:
|
|||
cdef int i
|
||||
if self.doc.is_parsed:
|
||||
root = &self.doc.c[self.start]
|
||||
n = 0
|
||||
while root.head != 0:
|
||||
root += root.head
|
||||
n += 1
|
||||
if n >= self.doc.length:
|
||||
raise RuntimeError
|
||||
raise RuntimeError(Errors.E038)
|
||||
return self.doc[root.l_edge:root.r_edge + 1]
|
||||
elif self.doc.is_sentenced:
|
||||
# find start of the sentence
|
||||
|
@ -314,13 +319,7 @@ cdef class Span:
|
|||
n += 1
|
||||
if n >= self.doc.length:
|
||||
break
|
||||
#
|
||||
return self.doc[start:end]
|
||||
else:
|
||||
raise ValueError(
|
||||
"Access to sentence requires either the dependency parse "
|
||||
"or sentence boundaries to be set by setting " +
|
||||
"doc[i].is_sent_start = True")
|
||||
|
||||
property has_vector:
|
||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||
|
@ -402,11 +401,7 @@ cdef class Span:
|
|||
"""
|
||||
def __get__(self):
|
||||
if not self.doc.is_parsed:
|
||||
raise ValueError(
|
||||
"noun_chunks requires the dependency parse, which "
|
||||
"requires a statistical model to be installed and loaded. "
|
||||
"For more info, see the "
|
||||
"documentation: \n%s\n" % about.__docs_models__)
|
||||
raise ValueError(Errors.E029)
|
||||
# Accumulate the result before beginning to iterate over it. This
|
||||
# prevents the tokenisation from being changed out from under us
|
||||
# during the iteration. The tricky thing here is that Span accepts
|
||||
|
@ -552,9 +547,7 @@ cdef class Span:
|
|||
return self.root.ent_id
|
||||
|
||||
def __set__(self, hash_t key):
|
||||
raise NotImplementedError(
|
||||
"Can't yet set ent_id from Span. Vote for this feature on "
|
||||
"the issue tracker: http://github.com/explosion/spaCy/issues")
|
||||
raise NotImplementedError(TempErrors.T007.format(attr='ent_id'))
|
||||
|
||||
property ent_id_:
|
||||
"""RETURNS (unicode): The (string) entity ID."""
|
||||
|
@ -562,9 +555,7 @@ cdef class Span:
|
|||
return self.root.ent_id_
|
||||
|
||||
def __set__(self, hash_t key):
|
||||
raise NotImplementedError(
|
||||
"Can't yet set ent_id_ from Span. Vote for this feature on the "
|
||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||
raise NotImplementedError(TempErrors.T007.format(attr='ent_id_'))
|
||||
|
||||
property orth_:
|
||||
"""Verbatim text content (identical to Span.text). Exists mostly for
|
||||
|
@ -612,9 +603,5 @@ cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
|||
token += token.head
|
||||
n += 1
|
||||
if n >= sent_length:
|
||||
raise RuntimeError(
|
||||
"Array bounds exceeded while searching for root word. This "
|
||||
"likely means the parse tree is in an invalid state. Please "
|
||||
"report this issue here: "
|
||||
"http://github.com/explosion/spaCy/issues")
|
||||
raise RuntimeError(Errors.E039)
|
||||
return n
|
||||
|
|
|
@ -6,6 +6,7 @@ from ..typedefs cimport attr_t, flags_t
|
|||
from ..parts_of_speech cimport univ_pos_t
|
||||
from .doc cimport Doc
|
||||
from ..lexeme cimport Lexeme
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
cdef class Token:
|
||||
|
@ -17,8 +18,7 @@ cdef class Token:
|
|||
@staticmethod
|
||||
cdef inline Token cinit(Vocab vocab, const TokenC* token, int offset, Doc doc):
|
||||
if offset < 0 or offset >= doc.length:
|
||||
msg = "Attempt to access token at %d, max length %d"
|
||||
raise IndexError(msg % (offset, doc.length))
|
||||
raise IndexError(Errors.E040.format(i=offset, max_length=doc.length))
|
||||
cdef Token self = Token.__new__(Token, vocab, doc, offset)
|
||||
return self
|
||||
|
||||
|
|
|
@ -19,26 +19,33 @@ from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM
|
|||
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
|
||||
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
|
||||
from ..compat import is_config
|
||||
from ..errors import Errors
|
||||
from .. import util
|
||||
from .. import about
|
||||
from .underscore import Underscore
|
||||
from .underscore import Underscore, get_ext_args
|
||||
|
||||
|
||||
cdef class Token:
|
||||
"""An individual token – i.e. a word, punctuation symbol, whitespace,
|
||||
etc."""
|
||||
@classmethod
|
||||
def set_extension(cls, name, default=None, method=None,
|
||||
getter=None, setter=None):
|
||||
Underscore.token_extensions[name] = (default, method, getter, setter)
|
||||
def set_extension(cls, name, **kwargs):
|
||||
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj='Token'))
|
||||
Underscore.token_extensions[name] = get_ext_args(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def get_extension(cls, name):
|
||||
return Underscore.span_extensions.get(name)
|
||||
return Underscore.token_extensions.get(name)
|
||||
|
||||
@classmethod
|
||||
def has_extension(cls, name):
|
||||
return name in Underscore.span_extensions
|
||||
return name in Underscore.token_extensions
|
||||
|
||||
@classmethod
|
||||
def remove_extension(cls, name):
|
||||
if not cls.has_extension(name):
|
||||
raise ValueError(Errors.E046.format(name=name))
|
||||
return Underscore.token_extensions.pop(name)
|
||||
|
||||
def __cinit__(self, Vocab vocab, Doc doc, int offset):
|
||||
"""Construct a `Token` object.
|
||||
|
@ -106,7 +113,7 @@ cdef class Token:
|
|||
elif op == 5:
|
||||
return my >= their
|
||||
else:
|
||||
raise ValueError(op)
|
||||
raise ValueError(Errors.E041.format(op=op))
|
||||
|
||||
@property
|
||||
def _(self):
|
||||
|
@ -135,8 +142,7 @@ cdef class Token:
|
|||
RETURNS (Token): The token at position `self.doc[self.i+i]`.
|
||||
"""
|
||||
if self.i+i < 0 or (self.i+i >= len(self.doc)):
|
||||
msg = "Error accessing doc[%d].nbor(%d), for doc of length %d"
|
||||
raise IndexError(msg % (self.i, i, len(self.doc)))
|
||||
raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc)))
|
||||
return self.doc[self.i+i]
|
||||
|
||||
def similarity(self, other):
|
||||
|
@ -354,14 +360,7 @@ cdef class Token:
|
|||
|
||||
property sent_start:
|
||||
def __get__(self):
|
||||
# Raising a deprecation warning causes errors for autocomplete
|
||||
#util.deprecated(
|
||||
# "Token.sent_start is now deprecated. Use Token.is_sent_start "
|
||||
# "instead, which returns a boolean value or None if the answer "
|
||||
# "is unknown – instead of a misleading 0 for False and 1 for "
|
||||
# "True. It also fixes a quirk in the old logic that would "
|
||||
# "always set the property to 0 for the first word of the "
|
||||
# "document.")
|
||||
# Raising a deprecation warning here causes errors for autocomplete
|
||||
# Handle broken backwards compatibility case: doc[0].sent_start
|
||||
# was False.
|
||||
if self.i == 0:
|
||||
|
@ -386,9 +385,7 @@ cdef class Token:
|
|||
|
||||
def __set__(self, value):
|
||||
if self.doc.is_parsed:
|
||||
raise ValueError(
|
||||
"Refusing to write to token.sent_start if its document "
|
||||
"is parsed, because this may cause inconsistent state.")
|
||||
raise ValueError(Errors.E043)
|
||||
if value is None:
|
||||
self.c.sent_start = 0
|
||||
elif value is True:
|
||||
|
@ -396,8 +393,7 @@ cdef class Token:
|
|||
elif value is False:
|
||||
self.c.sent_start = -1
|
||||
else:
|
||||
raise ValueError("Invalid value for token.sent_start. Must be "
|
||||
"one of: None, True, False")
|
||||
raise ValueError(Errors.E044.format(value=value))
|
||||
|
||||
property lefts:
|
||||
"""The leftward immediate children of the word, in the syntactic
|
||||
|
@ -415,8 +411,7 @@ cdef class Token:
|
|||
nr_iter += 1
|
||||
# This is ugly, but it's a way to guard out infinite loops
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError("Possibly infinite loop encountered "
|
||||
"while looking for token.lefts")
|
||||
raise RuntimeError(Errors.E045.format(attr='token.lefts'))
|
||||
|
||||
property rights:
|
||||
"""The rightward immediate children of the word, in the syntactic
|
||||
|
@ -434,8 +429,7 @@ cdef class Token:
|
|||
ptr -= 1
|
||||
nr_iter += 1
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError("Possibly infinite loop encountered "
|
||||
"while looking for token.rights")
|
||||
raise RuntimeError(Errors.E045.format(attr='token.rights'))
|
||||
tokens.reverse()
|
||||
for t in tokens:
|
||||
yield t
|
||||
|
|
|
@ -3,6 +3,8 @@ from __future__ import unicode_literals
|
|||
|
||||
import functools
|
||||
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
class Underscore(object):
|
||||
doc_extensions = {}
|
||||
|
@ -23,7 +25,7 @@ class Underscore(object):
|
|||
|
||||
def __getattr__(self, name):
|
||||
if name not in self._extensions:
|
||||
raise AttributeError(name)
|
||||
raise AttributeError(Errors.E046.format(name=name))
|
||||
default, method, getter, setter = self._extensions[name]
|
||||
if getter is not None:
|
||||
return getter(self._obj)
|
||||
|
@ -34,7 +36,7 @@ class Underscore(object):
|
|||
|
||||
def __setattr__(self, name, value):
|
||||
if name not in self._extensions:
|
||||
raise AttributeError(name)
|
||||
raise AttributeError(Errors.E047.format(name=name))
|
||||
default, method, getter, setter = self._extensions[name]
|
||||
if setter is not None:
|
||||
return setter(self._obj, value)
|
||||
|
@ -52,3 +54,24 @@ class Underscore(object):
|
|||
|
||||
def _get_key(self, name):
|
||||
return ('._.', name, self._start, self._end)
|
||||
|
||||
|
||||
def get_ext_args(**kwargs):
|
||||
"""Validate and convert arguments. Reused in Doc, Token and Span."""
|
||||
default = kwargs.get('default')
|
||||
getter = kwargs.get('getter')
|
||||
setter = kwargs.get('setter')
|
||||
method = kwargs.get('method')
|
||||
if getter is None and setter is not None:
|
||||
raise ValueError(Errors.E089)
|
||||
valid_opts = ('default' in kwargs, method is not None, getter is not None)
|
||||
nr_defined = sum(t is True for t in valid_opts)
|
||||
if nr_defined != 1:
|
||||
raise ValueError(Errors.E083.format(nr_defined=nr_defined))
|
||||
if setter is not None and not hasattr(setter, '__call__'):
|
||||
raise ValueError(Errors.E091.format(name='setter', value=repr(setter)))
|
||||
if getter is not None and not hasattr(getter, '__call__'):
|
||||
raise ValueError(Errors.E091.format(name='getter', value=repr(getter)))
|
||||
if method is not None and not hasattr(method, '__call__'):
|
||||
raise ValueError(Errors.E091.format(name='method', value=repr(method)))
|
||||
return (default, method, getter, setter)
|
||||
|
|
|
@ -11,8 +11,6 @@ import sys
|
|||
import textwrap
|
||||
import random
|
||||
from collections import OrderedDict
|
||||
import inspect
|
||||
import warnings
|
||||
from thinc.neural._classes.model import Model
|
||||
from thinc.neural.ops import NumpyOps
|
||||
import functools
|
||||
|
@ -23,10 +21,12 @@ import numpy.random
|
|||
from .symbols import ORTH
|
||||
from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
|
||||
from .compat import import_file
|
||||
from .errors import Errors
|
||||
|
||||
import msgpack
|
||||
import msgpack_numpy
|
||||
msgpack_numpy.patch()
|
||||
# Import these directly from Thinc, so that we're sure we always have the
|
||||
# same version.
|
||||
from thinc.neural._classes.model import msgpack
|
||||
from thinc.neural._classes.model import msgpack_numpy
|
||||
|
||||
|
||||
LANGUAGES = {}
|
||||
|
@ -50,8 +50,7 @@ def get_lang_class(lang):
|
|||
try:
|
||||
module = importlib.import_module('.lang.%s' % lang, 'spacy')
|
||||
except ImportError:
|
||||
msg = "Can't import language %s from spacy.lang."
|
||||
raise ImportError(msg % lang)
|
||||
raise ImportError(Errors.E048.format(lang=lang))
|
||||
LANGUAGES[lang] = getattr(module, module.__all__[0])
|
||||
return LANGUAGES[lang]
|
||||
|
||||
|
@ -108,7 +107,7 @@ def load_model(name, **overrides):
|
|||
"""
|
||||
data_path = get_data_path()
|
||||
if not data_path or not data_path.exists():
|
||||
raise IOError("Can't find spaCy data path: %s" % path2str(data_path))
|
||||
raise IOError(Errors.E049.format(path=path2str(data_path)))
|
||||
if isinstance(name, basestring_): # in data dir / shortcut
|
||||
if name in set([d.name for d in data_path.iterdir()]):
|
||||
return load_model_from_link(name, **overrides)
|
||||
|
@ -118,7 +117,7 @@ def load_model(name, **overrides):
|
|||
return load_model_from_path(Path(name), **overrides)
|
||||
elif hasattr(name, 'exists'): # Path or Path-like to model data
|
||||
return load_model_from_path(name, **overrides)
|
||||
raise IOError("Can't find model '%s'" % name)
|
||||
raise IOError(Errors.E050.format(name=name))
|
||||
|
||||
|
||||
def load_model_from_link(name, **overrides):
|
||||
|
@ -127,9 +126,7 @@ def load_model_from_link(name, **overrides):
|
|||
try:
|
||||
cls = import_file(name, path)
|
||||
except AttributeError:
|
||||
raise IOError(
|
||||
"Cant' load '%s'. If you're using a shortcut link, make sure it "
|
||||
"points to a valid package (not just a data directory)." % name)
|
||||
raise IOError(Errors.E051.format(name=name))
|
||||
return cls.load(**overrides)
|
||||
|
||||
|
||||
|
@ -173,8 +170,7 @@ def load_model_from_init_py(init_file, **overrides):
|
|||
data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
|
||||
data_path = model_path / data_dir
|
||||
if not model_path.exists():
|
||||
msg = "Can't find model directory: %s"
|
||||
raise ValueError(msg % path2str(data_path))
|
||||
raise IOError(Errors.E052.format(path=path2str(data_path)))
|
||||
return load_model_from_path(data_path, meta, **overrides)
|
||||
|
||||
|
||||
|
@ -186,16 +182,14 @@ def get_model_meta(path):
|
|||
"""
|
||||
model_path = ensure_path(path)
|
||||
if not model_path.exists():
|
||||
msg = "Can't find model directory: %s"
|
||||
raise ValueError(msg % path2str(model_path))
|
||||
raise IOError(Errors.E052.format(path=path2str(model_path)))
|
||||
meta_path = model_path / 'meta.json'
|
||||
if not meta_path.is_file():
|
||||
raise IOError("Could not read meta.json from %s" % meta_path)
|
||||
raise IOError(Errors.E053.format(path=meta_path))
|
||||
meta = read_json(meta_path)
|
||||
for setting in ['lang', 'name', 'version']:
|
||||
if setting not in meta or not meta[setting]:
|
||||
msg = "No valid '%s' setting found in model meta.json"
|
||||
raise ValueError(msg % setting)
|
||||
raise ValueError(Errors.E054.format(setting=setting))
|
||||
return meta
|
||||
|
||||
|
||||
|
@ -344,13 +338,10 @@ def update_exc(base_exceptions, *addition_dicts):
|
|||
for orth, token_attrs in additions.items():
|
||||
if not all(isinstance(attr[ORTH], unicode_)
|
||||
for attr in token_attrs):
|
||||
msg = "Invalid ORTH value in exception: key='%s', orths='%s'"
|
||||
raise ValueError(msg % (orth, token_attrs))
|
||||
raise ValueError(Errors.E055.format(key=orth, orths=token_attrs))
|
||||
described_orth = ''.join(attr[ORTH] for attr in token_attrs)
|
||||
if orth != described_orth:
|
||||
msg = ("Invalid tokenizer exception: ORTH values combined "
|
||||
"don't match original string. key='%s', orths='%s'")
|
||||
raise ValueError(msg % (orth, described_orth))
|
||||
raise ValueError(Errors.E056.format(key=orth, orths=described_orth))
|
||||
exc.update(additions)
|
||||
exc = expand_exc(exc, "'", "’")
|
||||
return exc
|
||||
|
@ -380,8 +371,7 @@ def expand_exc(excs, search, replace):
|
|||
|
||||
def normalize_slice(length, start, stop, step=None):
|
||||
if not (step is None or step == 1):
|
||||
raise ValueError("Stepped slices not supported in Span objects."
|
||||
"Try: list(tokens)[start:stop:step] instead.")
|
||||
raise ValueError(Errors.E057)
|
||||
if start is None:
|
||||
start = 0
|
||||
elif start < 0:
|
||||
|
@ -392,7 +382,6 @@ def normalize_slice(length, start, stop, step=None):
|
|||
elif stop < 0:
|
||||
stop += length
|
||||
stop = min(length, max(start, stop))
|
||||
assert 0 <= start <= stop <= length
|
||||
return start, stop
|
||||
|
||||
|
||||
|
@ -552,18 +541,6 @@ def from_disk(path, readers, exclude):
|
|||
return path
|
||||
|
||||
|
||||
def deprecated(message, filter='always'):
|
||||
"""Show a deprecation warning.
|
||||
|
||||
message (unicode): The message to display.
|
||||
filter (unicode): Filter value.
|
||||
"""
|
||||
stack = inspect.stack()[-1]
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter(filter, DeprecationWarning)
|
||||
warnings.warn_explicit(message, DeprecationWarning, stack[1], stack[2])
|
||||
|
||||
|
||||
def print_table(data, title=None):
|
||||
"""Print data in table format.
|
||||
|
||||
|
|
|
@ -1,24 +1,43 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import functools
|
||||
import numpy
|
||||
from collections import OrderedDict
|
||||
import msgpack
|
||||
import msgpack_numpy
|
||||
msgpack_numpy.patch()
|
||||
|
||||
from .util import msgpack
|
||||
from .util import msgpack_numpy
|
||||
|
||||
cimport numpy as np
|
||||
from thinc.neural.util import get_array_module
|
||||
from thinc.neural._classes.model import Model
|
||||
|
||||
from .strings cimport StringStore, hash_string
|
||||
from .compat import basestring_, path2str
|
||||
from .errors import Errors
|
||||
from . import util
|
||||
|
||||
from cython.operator cimport dereference as deref
|
||||
from libcpp.set cimport set as cppset
|
||||
|
||||
def unpickle_vectors(bytes_data):
|
||||
return Vectors().from_bytes(bytes_data)
|
||||
|
||||
|
||||
class GlobalRegistry(object):
|
||||
'''Global store of vectors, to avoid repeatedly loading the data.'''
|
||||
data = {}
|
||||
|
||||
@classmethod
|
||||
def register(cls, name, data):
|
||||
cls.data[name] = data
|
||||
return functools.partial(cls.get, name)
|
||||
|
||||
@classmethod
|
||||
def get(cls, name):
|
||||
return cls.data[name]
|
||||
|
||||
|
||||
cdef class Vectors:
|
||||
"""Store, save and load word vectors.
|
||||
|
||||
|
@ -31,18 +50,21 @@ cdef class Vectors:
|
|||
the table need to be assigned --- so len(list(vectors.keys())) may be
|
||||
greater or smaller than vectors.shape[0].
|
||||
"""
|
||||
cdef public object name
|
||||
cdef public object data
|
||||
cdef public object key2row
|
||||
cdef public object _unset
|
||||
cdef cppset[int] _unset
|
||||
|
||||
def __init__(self, *, shape=None, data=None, keys=None):
|
||||
def __init__(self, *, shape=None, data=None, keys=None, name=None):
|
||||
"""Create a new vector store.
|
||||
|
||||
shape (tuple): Size of the table, as (# entries, # columns)
|
||||
data (numpy.ndarray): The vector data.
|
||||
keys (iterable): A sequence of keys, aligned with the data.
|
||||
name (string): A name to identify the vectors table.
|
||||
RETURNS (Vectors): The newly created object.
|
||||
"""
|
||||
self.name = name
|
||||
if data is None:
|
||||
if shape is None:
|
||||
shape = (0,0)
|
||||
|
@ -50,9 +72,9 @@ cdef class Vectors:
|
|||
self.data = data
|
||||
self.key2row = OrderedDict()
|
||||
if self.data is not None:
|
||||
self._unset = set(range(self.data.shape[0]))
|
||||
self._unset = cppset[int]({i for i in range(self.data.shape[0])})
|
||||
else:
|
||||
self._unset = set()
|
||||
self._unset = cppset[int]()
|
||||
if keys is not None:
|
||||
for i, key in enumerate(keys):
|
||||
self.add(key, row=i)
|
||||
|
@ -74,7 +96,7 @@ cdef class Vectors:
|
|||
@property
|
||||
def is_full(self):
|
||||
"""RETURNS (bool): `True` if no slots are available for new keys."""
|
||||
return len(self._unset) == 0
|
||||
return self._unset.size() == 0
|
||||
|
||||
@property
|
||||
def n_keys(self):
|
||||
|
@ -93,7 +115,7 @@ cdef class Vectors:
|
|||
"""
|
||||
i = self.key2row[key]
|
||||
if i is None:
|
||||
raise KeyError(key)
|
||||
raise KeyError(Errors.E058.format(key=key))
|
||||
else:
|
||||
return self.data[i]
|
||||
|
||||
|
@ -105,8 +127,8 @@ cdef class Vectors:
|
|||
"""
|
||||
i = self.key2row[key]
|
||||
self.data[i] = vector
|
||||
if i in self._unset:
|
||||
self._unset.remove(i)
|
||||
if self._unset.count(i):
|
||||
self._unset.erase(self._unset.find(i))
|
||||
|
||||
def __iter__(self):
|
||||
"""Iterate over the keys in the table.
|
||||
|
@ -145,7 +167,7 @@ cdef class Vectors:
|
|||
xp = get_array_module(self.data)
|
||||
self.data = xp.resize(self.data, shape)
|
||||
filled = {row for row in self.key2row.values()}
|
||||
self._unset = {row for row in range(shape[0]) if row not in filled}
|
||||
self._unset = cppset[int]({row for row in range(shape[0]) if row not in filled})
|
||||
removed_items = []
|
||||
for key, row in list(self.key2row.items()):
|
||||
if row >= shape[0]:
|
||||
|
@ -169,7 +191,7 @@ cdef class Vectors:
|
|||
YIELDS (ndarray): A vector in the table.
|
||||
"""
|
||||
for row, vector in enumerate(range(self.data.shape[0])):
|
||||
if row not in self._unset:
|
||||
if not self._unset.count(row):
|
||||
yield vector
|
||||
|
||||
def items(self):
|
||||
|
@ -194,7 +216,8 @@ cdef class Vectors:
|
|||
RETURNS: The requested key, keys, row or rows.
|
||||
"""
|
||||
if sum(arg is None for arg in (key, keys, row, rows)) != 3:
|
||||
raise ValueError("One (and only one) keyword arg must be set.")
|
||||
bad_kwargs = {'key': key, 'keys': keys, 'row': row, 'rows': rows}
|
||||
raise ValueError(Errors.E059.format(kwargs=bad_kwargs))
|
||||
xp = get_array_module(self.data)
|
||||
if key is not None:
|
||||
if isinstance(key, basestring_):
|
||||
|
@ -233,14 +256,14 @@ cdef class Vectors:
|
|||
row = self.key2row[key]
|
||||
elif row is None:
|
||||
if self.is_full:
|
||||
raise ValueError("Cannot add new key to vectors -- full")
|
||||
row = min(self._unset)
|
||||
|
||||
raise ValueError(Errors.E060.format(rows=self.data.shape[0],
|
||||
cols=self.data.shape[1]))
|
||||
row = deref(self._unset.begin())
|
||||
self.key2row[key] = row
|
||||
if vector is not None:
|
||||
self.data[row] = vector
|
||||
if row in self._unset:
|
||||
self._unset.remove(row)
|
||||
if self._unset.count(row):
|
||||
self._unset.erase(self._unset.find(row))
|
||||
return row
|
||||
|
||||
def most_similar(self, queries, *, batch_size=1024):
|
||||
|
@ -297,7 +320,7 @@ cdef class Vectors:
|
|||
width = int(dims)
|
||||
break
|
||||
else:
|
||||
raise IOError("Expected file named e.g. vectors.128.f.bin")
|
||||
raise IOError(Errors.E061.format(filename=path))
|
||||
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
|
||||
dtype=dtype)
|
||||
xp = get_array_module(self.data)
|
||||
|
@ -346,8 +369,8 @@ cdef class Vectors:
|
|||
with path.open('rb') as file_:
|
||||
self.key2row = msgpack.load(file_)
|
||||
for key, row in self.key2row.items():
|
||||
if row in self._unset:
|
||||
self._unset.remove(row)
|
||||
if self._unset.count(row):
|
||||
self._unset.erase(self._unset.find(row))
|
||||
|
||||
def load_keys(path):
|
||||
if path.exists():
|
||||
|
|
|
@ -16,6 +16,7 @@ from .attrs cimport PROB, LANG, ORTH, TAG
|
|||
from .structs cimport SerializedLexemeC
|
||||
|
||||
from .compat import copy_reg, basestring_
|
||||
from .errors import Errors
|
||||
from .lemmatizer import Lemmatizer
|
||||
from .attrs import intify_attrs
|
||||
from .vectors import Vectors
|
||||
|
@ -100,15 +101,9 @@ cdef class Vocab:
|
|||
flag_id = bit
|
||||
break
|
||||
else:
|
||||
raise ValueError(
|
||||
"Cannot find empty bit for new lexical flag. All bits "
|
||||
"between 0 and 63 are occupied. You can replace one by "
|
||||
"specifying the flag_id explicitly, e.g. "
|
||||
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
|
||||
raise ValueError(Errors.E062)
|
||||
elif flag_id >= 64 or flag_id < 1:
|
||||
raise ValueError(
|
||||
"Invalid value for flag_id: %d. Flag IDs must be between "
|
||||
"1 and 63 (inclusive)" % flag_id)
|
||||
raise ValueError(Errors.E063.format(value=flag_id))
|
||||
for lex in self:
|
||||
lex.set_flag(flag_id, flag_getter(lex.orth_))
|
||||
self.lex_attr_getters[flag_id] = flag_getter
|
||||
|
@ -127,8 +122,9 @@ cdef class Vocab:
|
|||
cdef size_t addr
|
||||
if lex != NULL:
|
||||
if lex.orth != self.strings[string]:
|
||||
raise LookupError.mismatched_strings(
|
||||
lex.orth, self.strings[string], string)
|
||||
raise KeyError(Errors.E064.format(string=lex.orth,
|
||||
orth=self.strings[string],
|
||||
orth_id=string))
|
||||
return lex
|
||||
else:
|
||||
return self._new_lexeme(mem, string)
|
||||
|
@ -171,7 +167,8 @@ cdef class Vocab:
|
|||
if not is_oov:
|
||||
key = hash_string(string)
|
||||
self._add_lex_to_vocab(key, lex)
|
||||
assert lex != NULL, string
|
||||
if lex == NULL:
|
||||
raise ValueError(Errors.E085.format(string=string))
|
||||
return lex
|
||||
|
||||
cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1:
|
||||
|
@ -254,7 +251,7 @@ cdef class Vocab:
|
|||
width, you have to call this to change the size of the vectors.
|
||||
"""
|
||||
if width is not None and shape is not None:
|
||||
raise ValueError("Only one of width and shape can be specified")
|
||||
raise ValueError(Errors.E065.format(width=width, shape=shape))
|
||||
elif shape is not None:
|
||||
self.vectors = Vectors(shape=shape)
|
||||
else:
|
||||
|
@ -381,7 +378,8 @@ cdef class Vocab:
|
|||
self.lexemes_from_bytes(file_.read())
|
||||
if self.vectors is not None:
|
||||
self.vectors.from_disk(path, exclude='strings.json')
|
||||
link_vectors_to_models(self)
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
|
@ -421,6 +419,8 @@ cdef class Vocab:
|
|||
('vectors', lambda b: serialize_vectors(b))
|
||||
))
|
||||
util.from_bytes(bytes_data, setters, exclude)
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
return self
|
||||
|
||||
def lexemes_to_bytes(self):
|
||||
|
@ -468,7 +468,10 @@ cdef class Vocab:
|
|||
if ptr == NULL:
|
||||
continue
|
||||
py_str = self.strings[lexeme.orth]
|
||||
assert self.strings[py_str] == lexeme.orth, (py_str, lexeme.orth)
|
||||
if self.strings[py_str] != lexeme.orth:
|
||||
raise ValueError(Errors.E086.format(string=py_str,
|
||||
orth_id=lexeme.orth,
|
||||
hash_id=self.strings[py_str]))
|
||||
key = hash_string(py_str)
|
||||
self._by_hash.set(key, lexeme)
|
||||
self._by_orth.set(lexeme.orth, lexeme)
|
||||
|
@ -509,16 +512,3 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
|
|||
|
||||
|
||||
copy_reg.pickle(Vocab, pickle_vocab, unpickle_vocab)
|
||||
|
||||
|
||||
class LookupError(Exception):
|
||||
@classmethod
|
||||
def mismatched_strings(cls, id_, id_string, original_string):
|
||||
return cls(
|
||||
"Error fetching a Lexeme from the Vocab. When looking up a "
|
||||
"string, the lexeme returned had an orth ID that did not match "
|
||||
"the query string. This means that the cached lexeme structs are "
|
||||
"mismatched to the string encoding table. The mismatched:\n"
|
||||
"Query string: {}\n"
|
||||
"Orth cached: {}\n"
|
||||
"Orth ID: {}".format(repr(original_string), repr(id_string), id_))
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
{
|
||||
"globals": {
|
||||
"title": "spaCy",
|
||||
"description": "spaCy is a free open-source library featuring state-of-the-art speed and accuracy and a powerful Python API.",
|
||||
"description": "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.",
|
||||
|
||||
"SITENAME": "spaCy",
|
||||
"SLOGAN": "Industrial-strength Natural Language Processing in Python",
|
||||
|
@ -10,10 +10,13 @@
|
|||
|
||||
"COMPANY": "Explosion AI",
|
||||
"COMPANY_URL": "https://explosion.ai",
|
||||
"DEMOS_URL": "https://demos.explosion.ai",
|
||||
"DEMOS_URL": "https://explosion.ai/demos",
|
||||
"MODELS_REPO": "explosion/spacy-models",
|
||||
"KERNEL_BINDER": "ines/spacy-binder",
|
||||
"KERNEL_PYTHON": "python3",
|
||||
|
||||
"SPACY_VERSION": "2.0",
|
||||
"BINDER_VERSION": "2.0.11",
|
||||
|
||||
"SOCIAL": {
|
||||
"twitter": "spacy_io",
|
||||
|
@ -26,7 +29,8 @@
|
|||
"NAVIGATION": {
|
||||
"Usage": "/usage",
|
||||
"Models": "/models",
|
||||
"API": "/api"
|
||||
"API": "/api",
|
||||
"Universe": "/universe"
|
||||
},
|
||||
|
||||
"FOOTER": {
|
||||
|
@ -34,7 +38,7 @@
|
|||
"Usage": "/usage",
|
||||
"Models": "/models",
|
||||
"API Reference": "/api",
|
||||
"Resources": "/usage/resources"
|
||||
"Universe": "/universe"
|
||||
},
|
||||
"Support": {
|
||||
"Issue Tracker": "https://github.com/explosion/spaCy/issues",
|
||||
|
@ -82,8 +86,8 @@
|
|||
}
|
||||
],
|
||||
|
||||
"V_CSS": "2.0.1",
|
||||
"V_JS": "2.0.1",
|
||||
"V_CSS": "2.1.2",
|
||||
"V_JS": "2.1.0",
|
||||
"DEFAULT_SYNTAX": "python",
|
||||
"ANALYTICS": "UA-58931649-1",
|
||||
"MAILCHIMP": {
|
||||
|
|
|
@ -15,12 +15,39 @@
|
|||
- MODEL_META = public.models._data.MODEL_META
|
||||
- MODEL_LICENSES = public.models._data.MODEL_LICENSES
|
||||
- MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS
|
||||
- EXAMPLE_SENT_LANGS = public.models._data.EXAMPLE_SENT_LANGS
|
||||
- EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES
|
||||
|
||||
- IS_PAGE = (SECTION != "index") && !landing
|
||||
- IS_MODELS = (SECTION == "models" && LANGUAGES[current.source])
|
||||
- HAS_MODELS = IS_MODELS && CURRENT_MODELS.length
|
||||
|
||||
//- Get page URL
|
||||
|
||||
- function getPageUrl() {
|
||||
- var path = current.path;
|
||||
- if(path[path.length - 1] == 'index') path = path.slice(0, path.length - 1);
|
||||
- return `${SITE_URL}/${path.join('/')}`;
|
||||
- }
|
||||
|
||||
//- Get pretty page title depending on section
|
||||
|
||||
- function getPageTitle() {
|
||||
- var sections = ['api', 'usage', 'models'];
|
||||
- if (sections.includes(SECTION)) {
|
||||
- var titleSection = (SECTION == "api") ? 'API' : SECTION.charAt(0).toUpperCase() + SECTION.slice(1);
|
||||
- return `${title} · ${SITENAME} ${titleSection} Documentation`;
|
||||
- }
|
||||
- else if (SECTION != 'index') return `${title} · ${SITENAME}`;
|
||||
- return `${SITENAME} · ${SLOGAN}`;
|
||||
- }
|
||||
|
||||
//- Get social image based on section and settings
|
||||
|
||||
- function getPageImage() {
|
||||
- var img = (SECTION == 'api') ? 'api' : 'default';
|
||||
- return `${SITE_URL}/assets/img/social/preview_${preview || img}.jpg`;
|
||||
- }
|
||||
|
||||
//- Add prefixes to items of an array (for modifier CSS classes)
|
||||
array - [array] list of class names or options, e.g. ["foot"]
|
||||
|
|
|
@ -7,7 +7,7 @@ include _functions
|
|||
id - [string] anchor assigned to section (used for breadcrumb navigation)
|
||||
|
||||
mixin section(id)
|
||||
section.o-section(id="section-" + id data-section=id)
|
||||
section.o-section(id=id ? "section-" + id : null data-section=id)&attributes(attributes)
|
||||
block
|
||||
|
||||
|
||||
|
@ -143,7 +143,7 @@ mixin aside-wrapper(label, emoji)
|
|||
|
||||
mixin aside(label, emoji)
|
||||
+aside-wrapper(label, emoji)
|
||||
.c-aside__text.u-text-small
|
||||
.c-aside__text.u-text-small&attributes(attributes)
|
||||
block
|
||||
|
||||
|
||||
|
@ -154,7 +154,7 @@ mixin aside(label, emoji)
|
|||
prompt - [string] prompt displayed before first line, e.g. "$"
|
||||
|
||||
mixin aside-code(label, language, prompt)
|
||||
+aside-wrapper(label)
|
||||
+aside-wrapper(label)&attributes(attributes)
|
||||
+code(false, language, prompt).o-no-block
|
||||
block
|
||||
|
||||
|
@ -165,7 +165,7 @@ mixin aside-code(label, language, prompt)
|
|||
argument to be able to wrap it for spacing
|
||||
|
||||
mixin infobox(label, emoji)
|
||||
aside.o-box.o-block.u-text-small
|
||||
aside.o-box.o-block.u-text-small&attributes(attributes)
|
||||
if label
|
||||
h3.u-heading.u-text-label.u-color-theme
|
||||
if emoji
|
||||
|
@ -242,7 +242,9 @@ mixin button(url, trusted, ...style)
|
|||
wrap - [boolean] wrap text and disable horizontal scrolling
|
||||
|
||||
mixin code(label, language, prompt, height, icon, wrap)
|
||||
pre.c-code-block.o-block(class="lang-#{(language || DEFAULT_SYNTAX)}" class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
|
||||
- var lang = (language != "none") ? (language || DEFAULT_SYNTAX) : null
|
||||
- var lang_class = (language != "none") ? "lang-" + (language || DEFAULT_SYNTAX) : null
|
||||
pre.c-code-block.o-block(data-language=lang class=lang_class class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
|
||||
if label
|
||||
h4.u-text-label.u-text-label--dark=label
|
||||
if icon
|
||||
|
@ -253,6 +255,15 @@ mixin code(label, language, prompt, height, icon, wrap)
|
|||
code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt)
|
||||
block
|
||||
|
||||
//- Executable code
|
||||
|
||||
mixin code-exec(label, large)
|
||||
- label = (label || "Editable code example") + " (experimental)"
|
||||
+terminal-wrapper(label, !large)
|
||||
figure.thebelab-wrapper
|
||||
span.thebelab-wrapper__text.u-text-tiny v#{BINDER_VERSION} · Python 3 · via #[+a("https://mybinder.org/").u-hide-link Binder]
|
||||
+code(data-executable="true")&attributes(attributes)
|
||||
block
|
||||
|
||||
//- Wrapper for code blocks to display old/new versions
|
||||
|
||||
|
@ -658,12 +669,16 @@ mixin qs(data, style)
|
|||
//- Terminal-style code window
|
||||
label - [string] title displayed in top bar of terminal window
|
||||
|
||||
mixin terminal(label, button_text, button_url)
|
||||
.x-terminal
|
||||
.x-terminal__icons: span
|
||||
.u-padding-small.u-text-label.u-text-center=label
|
||||
mixin terminal-wrapper(label, small)
|
||||
.x-terminal(class=small ? "x-terminal--small" : null)
|
||||
.x-terminal__icons(class=small ? "x-terminal__icons--small" : null): span
|
||||
.u-padding-small.u-text-center(class=small ? "u-text-tiny" : "u-text")
|
||||
strong=label
|
||||
block
|
||||
|
||||
+code.x-terminal__code
|
||||
mixin terminal(label, button_text, button_url, exec)
|
||||
+terminal-wrapper(label)
|
||||
+code.x-terminal__code(data-executable=exec ? "" : null)
|
||||
block
|
||||
|
||||
if button_text && button_url
|
||||
|
|
|
@ -10,10 +10,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
|
|||
li.c-nav__menu__item(class=is_active ? "is-active" : null)
|
||||
+a(url)(tabindex=is_active ? "-1" : null)=item
|
||||
|
||||
li.c-nav__menu__item.u-hidden-xs
|
||||
+a("https://survey.spacy.io", true) User Survey 2018
|
||||
|
||||
li.c-nav__menu__item.u-hidden-xs
|
||||
li.c-nav__menu__item
|
||||
+a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
|
||||
|
||||
progress.c-progress.js-progress(value="0" max="1")
|
||||
|
|
|
@ -1,77 +1,110 @@
|
|||
//- 💫 INCLUDES > MODELS PAGE TEMPLATE
|
||||
|
||||
for id in CURRENT_MODELS
|
||||
- var comps = getModelComponents(id)
|
||||
+section(id)
|
||||
+grid("vcenter").o-no-block(id=id)
|
||||
+grid-col("two-thirds")
|
||||
+h(2)
|
||||
+a("#" + id).u-permalink=id
|
||||
section(data-vue=id data-model=id)
|
||||
+grid("vcenter").o-no-block(id=id)
|
||||
+grid-col("two-thirds")
|
||||
+h(2)
|
||||
+a("#" + id).u-permalink=id
|
||||
|
||||
+grid-col("third").u-text-right
|
||||
.u-color-subtle.u-text-tiny
|
||||
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(data-tpl=id data-tpl-key="download")
|
||||
| Release details
|
||||
.u-padding-small Latest: #[code(data-tpl=id data-tpl-key="version") n/a]
|
||||
+grid-col("third").u-text-right
|
||||
.u-color-subtle.u-text-tiny
|
||||
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(v-bind:href="releaseUrl")
|
||||
| Release details
|
||||
.u-padding-small Latest: #[code(v-text="version") n/a]
|
||||
|
||||
+aside-code("Installation", "bash", "$").
|
||||
python -m spacy download #{id}
|
||||
+aside-code("Installation", "bash", "$").
|
||||
python -m spacy download #{id}
|
||||
|
||||
- var comps = getModelComponents(id)
|
||||
p(v-if="description" v-text="description")
|
||||
|
||||
p(data-tpl=id data-tpl-key="description")
|
||||
|
||||
div(data-tpl=id data-tpl-key="error")
|
||||
+infobox
|
||||
+infobox(v-if="error")
|
||||
| Unable to load model details from GitHub. To find out more
|
||||
| about this model, see the overview of the
|
||||
| #[+a(gh("spacy-models") + "/releases") latest model releases].
|
||||
|
||||
+table.o-block-small(data-tpl=id data-tpl-key="table")
|
||||
+row
|
||||
+cell #[+label Language]
|
||||
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
|
||||
for comp, label in {"Type": comps.type, "Genre": comps.genre}
|
||||
+table.o-block-small(v-bind:data-loading="loading")
|
||||
+row
|
||||
+cell #[+label=label]
|
||||
+cell #[+tag=comp] #{MODEL_META[comp]}
|
||||
+row
|
||||
+cell #[+label Size]
|
||||
+cell #[+tag=comps.size] #[span(data-tpl=id data-tpl-key="size") #[em n/a]]
|
||||
+cell #[+label Language]
|
||||
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
|
||||
for comp, label in {"Type": comps.type, "Genre": comps.genre}
|
||||
+row
|
||||
+cell #[+label=label]
|
||||
+cell #[+tag=comp] #{MODEL_META[comp]}
|
||||
+row
|
||||
+cell #[+label Size]
|
||||
+cell #[+tag=comps.size] #[span(v-text="sizeFull" v-if="sizeFull")] #[em(v-else="") n/a]
|
||||
|
||||
each label in ["Pipeline", "Vectors", "Sources", "Author", "License"]
|
||||
- var field = label.toLowerCase()
|
||||
if field == "vectors"
|
||||
- field = "vecs"
|
||||
+row
|
||||
+cell.u-nowrap
|
||||
+label=label
|
||||
if MODEL_META[field]
|
||||
| #[+help(MODEL_META[field]).u-color-subtle]
|
||||
+row(v-if="pipeline && pipeline.length" v-cloak="")
|
||||
+cell
|
||||
span(data-tpl=id data-tpl-key=field) #[em n/a]
|
||||
+label Pipeline #[+help(MODEL_META.pipeline).u-color-subtle]
|
||||
+cell
|
||||
span(v-for="(pipe, index) in pipeline" v-if="pipeline")
|
||||
code(v-text="pipe")
|
||||
span(v-if="index != pipeline.length - 1") ,
|
||||
|
||||
+row(data-tpl=id data-tpl-key="compat-wrapper" hidden="")
|
||||
+cell
|
||||
+label Compat #[+help("Latest compatible model version for your spaCy installation").u-color-subtle]
|
||||
+cell
|
||||
.o-field.u-float-left
|
||||
select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
|
||||
div(data-tpl=id data-tpl-key="compat-versions")
|
||||
+row(v-if="vectors" v-cloak="")
|
||||
+cell
|
||||
+label Vectors #[+help(MODEL_META.vectors).u-color-subtle]
|
||||
+cell(v-text="vectors")
|
||||
|
||||
section(data-tpl=id data-tpl-key="benchmarks" hidden="")
|
||||
+grid.o-block-small
|
||||
+row(v-if="sources && sources.length" v-cloak="")
|
||||
+cell
|
||||
+label Sources #[+help(MODEL_META.sources).u-color-subtle]
|
||||
+cell
|
||||
span(v-for="(source, index) in sources") {{ source }}
|
||||
span(v-if="index != sources.length - 1") ,
|
||||
|
||||
+row(v-if="author" v-cloak="")
|
||||
+cell #[+label Author]
|
||||
+cell
|
||||
+a("")(v-bind:href="url" v-if="url" v-text="author")
|
||||
span(v-else="" v-text="author") {{ model.author }}
|
||||
|
||||
+row(v-if="license" v-cloak="")
|
||||
+cell #[+label License]
|
||||
+cell
|
||||
+a("")(v-bind:href="modelLicenses[license]" v-if="modelLicenses[license]") {{ license }}
|
||||
span(v-else="") {{ license }}
|
||||
|
||||
+row(v-cloak="")
|
||||
+cell #[+label Compat #[+help(MODEL_META.compat).u-color-subtle]]
|
||||
+cell
|
||||
.o-field.u-float-left
|
||||
select.o-field__select.u-text-small(v-model="spacyVersion")
|
||||
option(v-for="version in orderedCompat" v-bind:value="version") spaCy v{{ version }}
|
||||
code(v-if="compatVersion" v-text="compatVersion")
|
||||
em(v-else="") not compatible
|
||||
|
||||
+grid.o-block-small(v-cloak="" v-if="hasAccuracy")
|
||||
for keys, label in MODEL_BENCHMARKS
|
||||
.u-flex-full.u-padding-small(data-tpl=id data-tpl-key=label.toLowerCase() hidden="")
|
||||
.u-flex-full.u-padding-small
|
||||
+table.o-block-small
|
||||
+row("head")
|
||||
+head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
|
||||
for label, field in keys
|
||||
+row(hidden="")
|
||||
+row
|
||||
+cell.u-nowrap
|
||||
+label=label
|
||||
if MODEL_META[field]
|
||||
| #[+help(MODEL_META[field]).u-color-subtle]
|
||||
+cell("num")(data-tpl=id data-tpl-key=field)
|
||||
| n/a
|
||||
+cell("num")
|
||||
span(v-if="#{field}" v-text="#{field}")
|
||||
em(v-if="!#{field}") n/a
|
||||
|
||||
p.u-text-small.u-color-dark(v-if="notes" v-text="notes" v-cloak="")
|
||||
|
||||
if comps.size == "sm" && EXAMPLE_SENT_LANGS.includes(comps.lang)
|
||||
section
|
||||
+code-exec("Test the model live").
|
||||
import spacy
|
||||
from spacy.lang.#{comps.lang}.examples import sentences
|
||||
|
||||
nlp = spacy.load('#{id}')
|
||||
doc = nlp(sentences[0])
|
||||
print(doc.text)
|
||||
for token in doc:
|
||||
print(token.text, token.pos_, token.dep_)
|
||||
|
||||
p.u-text-small.u-color-dark(data-tpl=id data-tpl-key="notes")
|
||||
|
|
|
@ -1,86 +1,33 @@
|
|||
//- 💫 INCLUDES > SCRIPTS
|
||||
|
||||
if quickstart
|
||||
script(src="/assets/js/vendor/quickstart.min.js")
|
||||
if IS_PAGE || SECTION == "index"
|
||||
script(type="text/x-thebe-config")
|
||||
| { bootstrap: true, binderOptions: { repo: "#{KERNEL_BINDER}"},
|
||||
| kernelOptions: { name: "#{KERNEL_PYTHON}" }}
|
||||
|
||||
if IS_PAGE
|
||||
script(src="/assets/js/vendor/in-view.min.js")
|
||||
- scripts = ["vendor/prism.min", "vendor/vue.min"]
|
||||
- if (SECTION == "universe") scripts.push("vendor/vue-markdown.min")
|
||||
- if (quickstart) scripts.push("vendor/quickstart.min")
|
||||
- if (IS_PAGE) scripts.push("vendor/in-view.min")
|
||||
- if (IS_PAGE || SECTION == "index") scripts.push("vendor/thebelab.custom.min")
|
||||
|
||||
for script in scripts
|
||||
script(src="/assets/js/" + script + ".js")
|
||||
script(src="/assets/js/main.js?v#{V_JS}" type=(environment == "deploy") ? null : "module")
|
||||
|
||||
if environment == "deploy"
|
||||
script(async src="https://www.google-analytics.com/analytics.js")
|
||||
|
||||
script(src="/assets/js/vendor/prism.min.js")
|
||||
|
||||
if compare_models
|
||||
script(src="/assets/js/vendor/chart.min.js")
|
||||
|
||||
script
|
||||
if quickstart
|
||||
| new Quickstart("#qs");
|
||||
|
||||
if environment == "deploy"
|
||||
script(src="https://www.google-analytics.com/analytics.js", async)
|
||||
script
|
||||
| window.ga=window.ga||function(){
|
||||
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
|
||||
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
|
||||
|
||||
if IS_PAGE
|
||||
if IS_PAGE
|
||||
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
|
||||
script
|
||||
| ((window.gitter = {}).chat = {}).options = {
|
||||
| useStyles: false,
|
||||
| activationElement: '.js-gitter-button',
|
||||
| targetElement: '.js-gitter',
|
||||
| room: '!{SOCIAL.gitter}'
|
||||
| };
|
||||
|
||||
if IS_PAGE
|
||||
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
|
||||
|
||||
|
||||
//- JS modules – slightly hacky, but necessary to dynamically instantiate the
|
||||
classes with data from the Harp JSON files, while still being able to
|
||||
support older browsers that can't handle JS modules. More details:
|
||||
https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
|
||||
|
||||
- ProgressBar = "new ProgressBar('.js-progress');"
|
||||
- Accordion = "new Accordion('.js-accordion');"
|
||||
- Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
|
||||
- NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
|
||||
- GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
|
||||
- ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
|
||||
- ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
|
||||
|
||||
if environment == "deploy"
|
||||
//- DEPLOY: use compiled rollup.js and instantiate classes directly
|
||||
script(src="/assets/js/rollup.js?v#{V_JS}")
|
||||
script
|
||||
!=ProgressBar
|
||||
if changelog
|
||||
!=Changelog
|
||||
if IS_PAGE
|
||||
!=NavHighlighter
|
||||
!=GitHubEmbed
|
||||
!=Accordion
|
||||
if HAS_MODELS
|
||||
!=ModelLoader
|
||||
if compare_models
|
||||
!=ModelComparer
|
||||
else
|
||||
//- DEVELOPMENT: Use ES6 modules
|
||||
script(type="module")
|
||||
| import ProgressBar from '/assets/js/progress.js';
|
||||
!=ProgressBar
|
||||
if changelog
|
||||
| import Changelog from '/assets/js/changelog.js';
|
||||
!=Changelog
|
||||
if IS_PAGE
|
||||
| import NavHighlighter from '/assets/js/nav-highlighter.js';
|
||||
!=NavHighlighter
|
||||
| import GitHubEmbed from '/assets/js/github-embed.js';
|
||||
!=GitHubEmbed
|
||||
| import Accordion from '/assets/js/accordion.js';
|
||||
!=Accordion
|
||||
if HAS_MODELS
|
||||
| import { ModelLoader } from '/assets/js/models.js';
|
||||
!=ModelLoader
|
||||
if compare_models
|
||||
| import { ModelComparer } from '/assets/js/models.js';
|
||||
!=ModelComparer
|
||||
|
|
|
@ -7,6 +7,12 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
|
|||
symbol#svg_github(viewBox="0 0 27 32")
|
||||
path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z")
|
||||
|
||||
symbol#svg_twitter(viewBox="0 0 30 32")
|
||||
path(d="M28.929 7.286q-1.196 1.75-2.893 2.982 0.018 0.25 0.018 0.75 0 2.321-0.679 4.634t-2.063 4.437-3.295 3.759-4.607 2.607-5.768 0.973q-4.839 0-8.857-2.589 0.625 0.071 1.393 0.071 4.018 0 7.161-2.464-1.875-0.036-3.357-1.152t-2.036-2.848q0.589 0.089 1.089 0.089 0.768 0 1.518-0.196-2-0.411-3.313-1.991t-1.313-3.67v-0.071q1.214 0.679 2.607 0.732-1.179-0.786-1.875-2.054t-0.696-2.75q0-1.571 0.786-2.911 2.161 2.661 5.259 4.259t6.634 1.777q-0.143-0.679-0.143-1.321 0-2.393 1.688-4.080t4.080-1.688q2.5 0 4.214 1.821 1.946-0.375 3.661-1.393-0.661 2.054-2.536 3.179 1.661-0.179 3.321-0.893z")
|
||||
|
||||
symbol#svg_website(viewBox="0 0 32 32")
|
||||
path(d="M22.658 10.988h5.172c0.693 1.541 1.107 3.229 1.178 5.012h-5.934c-0.025-1.884-0.181-3.544-0.416-5.012zM20.398 3.896c2.967 1.153 5.402 3.335 6.928 6.090h-4.836c-0.549-2.805-1.383-4.799-2.092-6.090zM16.068 9.986v-6.996c1.066 0.047 2.102 0.216 3.092 0.493 0.75 1.263 1.719 3.372 2.33 6.503h-5.422zM9.489 22.014c-0.234-1.469-0.396-3.119-0.421-5.012h5.998v5.012h-5.577zM9.479 10.988h5.587v5.012h-6.004c0.025-1.886 0.183-3.543 0.417-5.012zM11.988 3.461c0.987-0.266 2.015-0.435 3.078-0.469v6.994h-5.422c0.615-3.148 1.591-5.265 2.344-6.525zM3.661 9.986c1.551-2.8 4.062-4.993 7.096-6.131-0.715 1.29-1.559 3.295-2.114 6.131h-4.982zM8.060 16h-6.060c0.066-1.781 0.467-3.474 1.158-5.012h5.316c-0.233 1.469-0.39 3.128-0.414 5.012zM8.487 22.014h-5.29c-0.694-1.543-1.139-3.224-1.204-5.012h6.071c0.024 1.893 0.188 3.541 0.423 5.012zM8.651 23.016c0.559 2.864 1.416 4.867 2.134 6.142-3.045-1.133-5.557-3.335-7.11-6.142h4.976zM15.066 23.016v6.994c-1.052-0.033-2.067-0.199-3.045-0.46-0.755-1.236-1.736-3.363-2.356-6.534h5.401zM21.471 23.016c-0.617 3.152-1.592 5.271-2.344 6.512-0.979 0.271-2.006 0.418-3.059 0.465v-6.977h5.403zM16.068 17.002h5.998c-0.023 1.893-0.188 3.542-0.422 5.012h-5.576v-5.012zM22.072 16h-6.004v-5.012h5.586c0.235 1.469 0.393 3.126 0.418 5.012zM23.070 17.002h5.926c-0.066 1.787-0.506 3.468-1.197 5.012h-5.152c0.234-1.471 0.398-3.119 0.423-5.012zM27.318 23.016c-1.521 2.766-3.967 4.949-6.947 6.1 0.715-1.276 1.561-3.266 2.113-6.1h4.834z")
|
||||
|
||||
symbol#svg_code(viewBox="0 0 20 20")
|
||||
path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z")
|
||||
|
||||
|
|
|
@ -3,23 +3,15 @@
|
|||
include _includes/_mixins
|
||||
|
||||
- title = IS_MODELS ? LANGUAGES[current.source] || title : title
|
||||
- social_title = (SECTION == "index") ? SITENAME + " - " + SLOGAN : title + " - " + SITENAME
|
||||
- social_img = SITE_URL + "/assets/img/social/preview_" + (preview || ALPHA ? "alpha" : "default") + ".jpg"
|
||||
|
||||
- PAGE_URL = getPageUrl()
|
||||
- PAGE_TITLE = getPageTitle()
|
||||
- PAGE_IMAGE = getPageImage()
|
||||
|
||||
doctype html
|
||||
html(lang="en")
|
||||
head
|
||||
title
|
||||
if SECTION == "api" || SECTION == "usage" || SECTION == "models"
|
||||
- var title_section = (SECTION == "api") ? "API" : SECTION.charAt(0).toUpperCase() + SECTION.slice(1)
|
||||
| #{title} | #{SITENAME} #{title_section} Documentation
|
||||
|
||||
else if SECTION != "index"
|
||||
| #{title} | #{SITENAME}
|
||||
|
||||
else
|
||||
| #{SITENAME} - #{SLOGAN}
|
||||
|
||||
title=PAGE_TITLE
|
||||
meta(charset="utf-8")
|
||||
meta(name="viewport" content="width=device-width, initial-scale=1.0")
|
||||
meta(name="referrer" content="always")
|
||||
|
@ -27,23 +19,24 @@ html(lang="en")
|
|||
|
||||
meta(property="og:type" content="website")
|
||||
meta(property="og:site_name" content=sitename)
|
||||
meta(property="og:url" content="#{SITE_URL}/#{current.path.join('/')}")
|
||||
meta(property="og:title" content=social_title)
|
||||
meta(property="og:url" content=PAGE_URL)
|
||||
meta(property="og:title" content=PAGE_TITLE)
|
||||
meta(property="og:description" content=description)
|
||||
meta(property="og:image" content=social_img)
|
||||
meta(property="og:image" content=PAGE_IMAGE)
|
||||
|
||||
meta(name="twitter:card" content="summary_large_image")
|
||||
meta(name="twitter:site" content="@" + SOCIAL.twitter)
|
||||
meta(name="twitter:title" content=social_title)
|
||||
meta(name="twitter:title" content=PAGE_TITLE)
|
||||
meta(name="twitter:description" content=description)
|
||||
meta(name="twitter:image" content=social_img)
|
||||
meta(name="twitter:image" content=PAGE_IMAGE)
|
||||
|
||||
link(rel="shortcut icon" href="/assets/img/favicon.ico")
|
||||
link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico")
|
||||
|
||||
if SECTION == "api"
|
||||
link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet")
|
||||
|
||||
else if SECTION == "universe"
|
||||
link(href="/assets/css/style_purple.css?v#{V_CSS}" rel="stylesheet")
|
||||
else
|
||||
link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet")
|
||||
|
||||
|
@ -54,6 +47,9 @@ html(lang="en")
|
|||
if !landing
|
||||
include _includes/_page-docs
|
||||
|
||||
else if SECTION == "universe"
|
||||
!=yield
|
||||
|
||||
else
|
||||
main!=yield
|
||||
include _includes/_footer
|
||||
|
|
|
@ -29,7 +29,7 @@ p
|
|||
+ud-row("NUM", "numeral", "1, 2017, one, seventy-seven, IV, MMXIV")
|
||||
+ud-row("PART", "particle", "'s, not, ")
|
||||
+ud-row("PRON", "pronoun", "I, you, he, she, myself, themselves, somebody")
|
||||
+ud-row("PROPN", "proper noun", "Mary, John, Londin, NATO, HBO")
|
||||
+ud-row("PROPN", "proper noun", "Mary, John, London, NATO, HBO")
|
||||
+ud-row("PUNCT", "punctuation", "., (, ), ?")
|
||||
+ud-row("SCONJ", "subordinating conjunction", "if, while, that")
|
||||
+ud-row("SYM", "symbol", "$, %, §, ©, +, −, ×, ÷, =, :), 😝")
|
||||
|
|
|
@ -1,5 +1,13 @@
|
|||
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
|
||||
|
||||
p
|
||||
| spaCy's statistical models have been custom-designed to give a
|
||||
| high-performance mix of speed and accuracy. The current architecture
|
||||
| hasn't been published yet, but in the meantime we prepared a video that
|
||||
| explains how the models work, with particular focus on NER.
|
||||
|
||||
+youtube("sqDHBH9IjRU")
|
||||
|
||||
p
|
||||
| The parsing model is a blend of recent results. The two recent
|
||||
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
|
||||
|
@ -44,7 +52,7 @@ p
|
|||
+cell First two words of the buffer.
|
||||
|
||||
+row
|
||||
+cell.u-nowrap
|
||||
+cell
|
||||
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
|
||||
| #[code B1L1]#[br]
|
||||
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
|
||||
|
@ -54,7 +62,7 @@ p
|
|||
| #[code S2], #[code B0] and #[code B1].
|
||||
|
||||
+row
|
||||
+cell.u-nowrap
|
||||
+cell
|
||||
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
|
||||
| #[code B1R1]#[br]
|
||||
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
|
||||
|
|
|
@ -6,8 +6,7 @@ p
|
|||
| but somewhat ugly in Python. Logic that deals with Python or platform
|
||||
| compatibility only lives in #[code spacy.compat]. To distinguish them from
|
||||
| the builtin functions, replacement functions are suffixed with an
|
||||
| undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the
|
||||
| #[code six] and #[code ftfy] packages.
|
||||
| undersocre, e.e #[code unicode_].
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.compat import unicode_, json_dumps
|
||||
|
|
|
@ -533,8 +533,10 @@ p
|
|||
+cell option
|
||||
+cell
|
||||
| Optional location of vectors file. Should be a tab-separated
|
||||
| file where the first column contains the word and the remaining
|
||||
| columns the values.
|
||||
| file in Word2Vec format where the first column contains the word
|
||||
| and the remaining columns the values. File can be provided in
|
||||
| #[code .txt] format or as a zipped text file in #[code .zip] or
|
||||
| #[code .tar.gz] format.
|
||||
|
||||
+row
|
||||
+cell #[code --prune-vectors], #[code -V]
|
||||
|
|
|
@ -31,6 +31,7 @@
|
|||
$grid-gutter: 2rem
|
||||
|
||||
margin-top: $grid-gutter
|
||||
min-width: 0 // hack to prevent overflow
|
||||
|
||||
@include breakpoint(min, lg)
|
||||
display: flex
|
||||
|
|
|
@ -60,6 +60,13 @@
|
|||
padding-bottom: 4rem
|
||||
border-bottom: 1px dotted $color-subtle
|
||||
|
||||
&.o-section--small
|
||||
overflow: auto
|
||||
|
||||
&:not(:last-child)
|
||||
margin-bottom: 3.5rem
|
||||
padding-bottom: 2rem
|
||||
|
||||
.o-block
|
||||
margin-bottom: 4rem
|
||||
|
||||
|
@ -142,6 +149,14 @@
|
|||
.o-badge
|
||||
border-radius: 1em
|
||||
|
||||
.o-thumb
|
||||
@include size(100px)
|
||||
overflow: hidden
|
||||
border-radius: 50%
|
||||
|
||||
&.o-thumb--small
|
||||
@include size(35px)
|
||||
|
||||
|
||||
//- SVG
|
||||
|
||||
|
|
|
@ -103,6 +103,9 @@
|
|||
&:hover
|
||||
color: $color-theme-dark
|
||||
|
||||
.u-hand
|
||||
cursor: pointer
|
||||
|
||||
.u-hide-link.u-hide-link
|
||||
border: none
|
||||
color: inherit
|
||||
|
@ -224,6 +227,7 @@
|
|||
$spinner-size: 75px
|
||||
$spinner-bar: 8px
|
||||
|
||||
min-height: $spinner-size * 2
|
||||
position: relative
|
||||
|
||||
& > *
|
||||
|
@ -245,10 +249,19 @@
|
|||
|
||||
//- Hidden elements
|
||||
|
||||
.u-hidden
|
||||
display: none
|
||||
.u-hidden,
|
||||
[v-cloak]
|
||||
display: none !important
|
||||
|
||||
@each $breakpoint in (xs, sm, md)
|
||||
.u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint}
|
||||
@include breakpoint(max, $breakpoint)
|
||||
display: none
|
||||
|
||||
//- Transitions
|
||||
|
||||
.u-fade-enter-active
|
||||
transition: opacity 0.5s
|
||||
|
||||
.u-fade-enter
|
||||
opacity: 0
|
||||
|
|
|
@ -2,7 +2,8 @@
|
|||
|
||||
//- Code block
|
||||
|
||||
.c-code-block
|
||||
.c-code-block,
|
||||
.thebelab-cell
|
||||
background: $color-front
|
||||
color: darken($color-back, 20)
|
||||
padding: 0.75em 0
|
||||
|
@ -13,11 +14,11 @@
|
|||
white-space: pre
|
||||
direction: ltr
|
||||
|
||||
&.c-code-block--has-icon
|
||||
padding: 0
|
||||
display: flex
|
||||
border-top-left-radius: 0
|
||||
border-bottom-left-radius: 0
|
||||
.c-code-block--has-icon
|
||||
padding: 0
|
||||
display: flex
|
||||
border-top-left-radius: 0
|
||||
border-bottom-left-radius: 0
|
||||
|
||||
.c-code-block__icon
|
||||
padding: 0 0 0 1rem
|
||||
|
@ -28,26 +29,66 @@
|
|||
&.c-code-block__icon--border
|
||||
border-left: 6px solid
|
||||
|
||||
|
||||
|
||||
//- Code block content
|
||||
|
||||
.c-code-block__content
|
||||
.c-code-block__content,
|
||||
.thebelab-input,
|
||||
.jp-OutputArea
|
||||
display: block
|
||||
font: normal normal 1.1rem/#{1.9} $font-code
|
||||
padding: 1em 2em
|
||||
|
||||
&[data-prompt]:before,
|
||||
content: attr(data-prompt)
|
||||
margin-right: 0.65em
|
||||
display: inline-block
|
||||
vertical-align: middle
|
||||
opacity: 0.5
|
||||
.c-code-block__content[data-prompt]:before,
|
||||
content: attr(data-prompt)
|
||||
margin-right: 0.65em
|
||||
display: inline-block
|
||||
vertical-align: middle
|
||||
opacity: 0.5
|
||||
|
||||
//- Thebelab
|
||||
|
||||
[data-executable]
|
||||
margin-bottom: 0
|
||||
|
||||
.thebelab-input.thebelab-input
|
||||
padding: 3em 2em 1em
|
||||
|
||||
.jp-OutputArea
|
||||
&:not(:empty)
|
||||
padding: 2rem 2rem 1rem
|
||||
border-top: 1px solid $color-dark
|
||||
margin-top: 2rem
|
||||
|
||||
.entities, svg
|
||||
white-space: initial
|
||||
font-family: inherit
|
||||
|
||||
.entities
|
||||
font-size: 1.35rem
|
||||
|
||||
.jp-OutputArea pre
|
||||
font: inherit
|
||||
|
||||
.jp-OutputPrompt.jp-OutputArea-prompt
|
||||
padding-top: 0.5em
|
||||
margin-right: 1rem
|
||||
font-family: inherit
|
||||
font-weight: bold
|
||||
|
||||
.thebelab-run-button
|
||||
@extend .u-text-label, .u-text-label--dark
|
||||
|
||||
.thebelab-wrapper
|
||||
position: relative
|
||||
|
||||
.thebelab-wrapper__text
|
||||
@include position(absolute, top, right, 1.25rem, 1.25rem)
|
||||
color: $color-subtle-dark
|
||||
z-index: 10
|
||||
|
||||
//- Code
|
||||
|
||||
code
|
||||
code, .CodeMirror, .jp-RenderedText, .jp-OutputArea
|
||||
-webkit-font-smoothing: subpixel-antialiased
|
||||
-moz-osx-font-smoothing: auto
|
||||
|
||||
|
@ -73,7 +114,7 @@ code
|
|||
text-shadow: none
|
||||
|
||||
|
||||
//- Syntax Highlighting
|
||||
//- Syntax Highlighting (Prism)
|
||||
|
||||
[class*="language-"] .token
|
||||
&.comment, &.prolog, &.doctype, &.cdata, &.punctuation
|
||||
|
@ -103,3 +144,50 @@ code
|
|||
|
||||
&.italic
|
||||
font-style: italic
|
||||
|
||||
//- Syntax Highlighting (CodeMirror)
|
||||
|
||||
.CodeMirror.cm-s-default
|
||||
background: $color-front
|
||||
color: darken($color-back, 20)
|
||||
|
||||
.CodeMirror-selected
|
||||
background: $color-theme
|
||||
color: $color-back
|
||||
|
||||
.CodeMirror-cursor
|
||||
border-left-color: currentColor
|
||||
|
||||
.cm-variable-2
|
||||
color: inherit
|
||||
font-style: italic
|
||||
|
||||
.cm-comment
|
||||
color: map-get($syntax-highlighting, comment)
|
||||
|
||||
.cm-keyword, .cm-builtin
|
||||
color: map-get($syntax-highlighting, keyword)
|
||||
|
||||
.cm-operator
|
||||
color: map-get($syntax-highlighting, operator)
|
||||
|
||||
.cm-string
|
||||
color: map-get($syntax-highlighting, selector)
|
||||
|
||||
.cm-number
|
||||
color: map-get($syntax-highlighting, number)
|
||||
|
||||
.cm-def
|
||||
color: map-get($syntax-highlighting, function)
|
||||
|
||||
//- Syntax highlighting (Jupyter)
|
||||
|
||||
.jp-RenderedText pre
|
||||
.ansi-cyan-fg
|
||||
color: map-get($syntax-highlighting, function)
|
||||
|
||||
.ansi-green-fg
|
||||
color: $color-green
|
||||
|
||||
.ansi-red-fg
|
||||
color: map-get($syntax-highlighting, operator)
|
||||
|
|
|
@ -8,10 +8,20 @@
|
|||
width: 100%
|
||||
position: relative
|
||||
|
||||
&.x-terminal--small
|
||||
background: $color-dark
|
||||
color: $color-subtle
|
||||
border-radius: 4px
|
||||
margin-bottom: 4rem
|
||||
|
||||
.x-terminal__icons
|
||||
display: none
|
||||
position: absolute
|
||||
padding: 10px
|
||||
|
||||
@include breakpoint(min, sm)
|
||||
display: block
|
||||
|
||||
&:before,
|
||||
&:after,
|
||||
span
|
||||
|
@ -32,6 +42,12 @@
|
|||
content: ""
|
||||
background: $color-yellow
|
||||
|
||||
&.x-terminal__icons--small
|
||||
&:before,
|
||||
&:after,
|
||||
span
|
||||
@include size(10px)
|
||||
|
||||
.x-terminal__code
|
||||
margin: 0
|
||||
border: none
|
||||
|
|
|
@ -9,7 +9,7 @@
|
|||
display: flex
|
||||
justify-content: space-between
|
||||
flex-flow: row nowrap
|
||||
padding: 0 2rem 0 1rem
|
||||
padding: 0 0 0 1rem
|
||||
z-index: 30
|
||||
width: 100%
|
||||
box-shadow: $box-shadow
|
||||
|
@ -21,11 +21,20 @@
|
|||
.c-nav__menu
|
||||
@include size(100%)
|
||||
display: flex
|
||||
justify-content: flex-end
|
||||
flex-flow: row nowrap
|
||||
border-color: inherit
|
||||
flex: 1
|
||||
|
||||
|
||||
@include breakpoint(max, sm)
|
||||
@include scroll-shadow-base($color-front)
|
||||
overflow-x: auto
|
||||
overflow-y: hidden
|
||||
-webkit-overflow-scrolling: touch
|
||||
|
||||
@include breakpoint(min, md)
|
||||
justify-content: flex-end
|
||||
|
||||
.c-nav__menu__item
|
||||
display: flex
|
||||
align-items: center
|
||||
|
@ -39,6 +48,14 @@
|
|||
&:not(:first-child)
|
||||
margin-left: 2em
|
||||
|
||||
&:last-child
|
||||
@include scroll-shadow-cover(right, $color-back)
|
||||
padding-right: 2rem
|
||||
|
||||
&:first-child
|
||||
@include scroll-shadow-cover(left, $color-back)
|
||||
padding-left: 2rem
|
||||
|
||||
&.is-active
|
||||
color: $color-dark
|
||||
pointer-events: none
|
||||
|
|
|
@ -26,7 +26,7 @@ $font-code: Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace
|
|||
|
||||
// Colors
|
||||
|
||||
$colors: ( blue: #09a3d5, green: #05b083 )
|
||||
$colors: ( blue: #09a3d5, green: #05b083, purple: #6542d1 )
|
||||
|
||||
$color-back: #fff !default
|
||||
$color-front: #1a1e23 !default
|
||||
|
|
4
website/assets/css/style_purple.sass
Normal file
4
website/assets/css/style_purple.sass
Normal file
|
@ -0,0 +1,4 @@
|
|||
//- 💫 STYLESHEET (PURPLE)
|
||||
|
||||
$theme: purple
|
||||
@import style
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user