Update develop from master

This commit is contained in:
Matthew Honnibal 2018-05-02 01:35:59 +00:00
commit 2338e8c7fc
159 changed files with 699781 additions and 2579 deletions

View File

@ -78,7 +78,7 @@ took place before the date you sign these terms.
* each contribution shall be in compliance with U.S. export control laws and * each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA. participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable 6. This SCA is governed by the laws of the State of California and applicable
@ -87,11 +87,11 @@ U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT 7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements: mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person * [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my or entity, including my employer, has or will have rights with respect to my
contributions. contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the * [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity. actual authority to contractually bind that entity.
## Contributor Details ## Contributor Details

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |Matthew Upson |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date |2018-04-24 |
| GitHub username |ivyleavedtoadflax |
| Website (optional) |www.machinegurning.com|

106
.github/contributors/katrinleinweber.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Katrin Leinweber |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-03-30 |
| GitHub username | katrinleinweber |
| Website (optional) | |

106
.github/contributors/miroli.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Robin Linderborg |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-04-23 |
| GitHub username | miroli |
| Website (optional) | |

106
.github/contributors/mollerhoj.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jens Dahl Mollerhoj |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 4/04/2018 |
| GitHub username | mollerhoj |
| Website (optional) | |

106
.github/contributors/skrcode.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Suraj Rajan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 31/Mar/2018 |
| GitHub username | skrcode |
| Website (optional) | |

106
.github/contributors/trungtv.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Viet-Trung Tran |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-03-28 |
| GitHub username | trungtv |
| Website (optional) | https://datalab.vn |

6
CITATION Normal file
View File

@ -0,0 +1,6 @@
@ARTICLE{spacy2,
AUTHOR = {Honnibal, Matthew AND Montani, Ines},
TITLE = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
YEAR = {2017},
JOURNAL = {To appear}
}

View File

@ -73,28 +73,8 @@ so it only becomes visible on click, making the issue easier to read and follow.
### Issue labels ### Issue labels
To distinguish issues that are opened by us, the maintainers, we usually add a To distinguish issues that are opened by us, the maintainers, we usually add a
💫 to the title. We also use the following system to tag our issues and pull 💫 to the title. [See this page](https://github.com/explosion/spaCy/labels)
requests: for an overview of the system we use to tag our issues and pull requests.
| Issue label | Description |
| --- | --- |
| [`bug`](https://github.com/explosion/spaCy/labels/bug) | Bugs and behaviour differing from documentation |
| [`enhancement`](https://github.com/explosion/spaCy/labels/enhancement) | Feature requests and improvements |
| [`install`](https://github.com/explosion/spaCy/labels/install) | Installation problems |
| [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
| [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
| [`docs`](https://github.com/explosion/spaCy/labels/docs), [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [documentation](https://spacy.io/docs) and [examples](spacy/examples) |
| [`training`](https://github.com/explosion/spaCy/labels/training) | Issues related to training and updating models |
| [`models`](https://github.com/explosion/spaCy/labels/models), `language / [name]` | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
| [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
| [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
| [`compat`](https://github.com/explosion/spaCy/labels/compat) | Cross-platform and cross-Python compatibility issues |
| [`wip`](https://github.com/explosion/spaCy/labels/wip) | Work in progress, mostly used for pull requests |
| [`v1`](https://github.com/explosion/spaCy/labels/v1) | Reports related to spaCy v1.x |
| [`duplicate`](https://github.com/explosion/spaCy/labels/duplicate) | Duplicates, i.e. issues that have been reported before |
| [`third-party`](https://github.com/explosion/spaCy/labels/third-party) | Issues related to third-party packages and services |
| [`meta`](https://github.com/explosion/spaCy/labels/meta) | Meta topics, e.g. repo organisation and issue management |
| [`help wanted`](https://github.com/explosion/spaCy/labels/help%20wanted), [`help wanted (easy)`](https://github.com/explosion/spaCy/labels/help%20wanted%20%28easy%29) | Requests for contributions |
## Contributing to the code base ## Contributing to the code base
@ -220,7 +200,7 @@ All Python code must be written in an **intersection of Python 2 and Python 3**.
This is easy in Cython, but somewhat ugly in Python. Logic that deals with This is easy in Cython, but somewhat ugly in Python. Logic that deals with
Python or platform compatibility should only live in Python or platform compatibility should only live in
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin [`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
functions, replacement functions are suffixed with an undersocre, for example functions, replacement functions are suffixed with an underscore, for example
`unicode_`. If you need to access the user's version or platform information, `unicode_`. If you need to access the user's version or platform information,
for example to show more specific error messages, you can use the `is_config()` for example to show more specific error messages, you can use the `is_config()`
helper function. helper function.

View File

@ -12,11 +12,11 @@ integration. It's commercial open-source software, released under the MIT licens
💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_ 💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square .. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis
:target: https://travis-ci.org/explosion/spaCy :target: https://travis-ci.org/explosion/spaCy
:alt: Build Status :alt: Build Status
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square .. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square&logo=appveyor
:target: https://ci.appveyor.com/project/explosion/spaCy :target: https://ci.appveyor.com/project/explosion/spaCy
:alt: Appveyor Build Status :alt: Appveyor Build Status
@ -28,11 +28,11 @@ integration. It's commercial open-source software, released under the MIT licens
:target: https://pypi.python.org/pypi/spacy :target: https://pypi.python.org/pypi/spacy
:alt: pypi Version :alt: pypi Version
.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg .. image:: https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square
:target: https://anaconda.org/conda-forge/spacy :target: https://anaconda.org/conda-forge/spacy
:alt: conda Version :alt: conda Version
.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square .. image:: https://img.shields.io/badge/chat-join%20%E2%86%92-09a3d5.svg?style=flat-square&logo=gitter-white
:target: https://gitter.im/explosion/spaCy :target: https://gitter.im/explosion/spaCy
:alt: spaCy on Gitter :alt: spaCy on Gitter
@ -49,7 +49,7 @@ integration. It's commercial open-source software, released under the MIT licens
`New in v2.0`_ New features, backwards incompatibilities and migration guide. `New in v2.0`_ New features, backwards incompatibilities and migration guide.
`API Reference`_ The detailed reference for spaCy's API. `API Reference`_ The detailed reference for spaCy's API.
`Models`_ Download statistical language models for spaCy. `Models`_ Download statistical language models for spaCy.
`Resources`_ Libraries, extensions, demos, books and courses. `Universe`_ Libraries, extensions, demos, books and courses.
`Changelog`_ Changes and version history. `Changelog`_ Changes and version history.
`Contribute`_ How to contribute to the spaCy project and code base. `Contribute`_ How to contribute to the spaCy project and code base.
=================== === =================== ===
@ -59,7 +59,7 @@ integration. It's commercial open-source software, released under the MIT licens
.. _Usage Guides: https://spacy.io/usage/ .. _Usage Guides: https://spacy.io/usage/
.. _API Reference: https://spacy.io/api/ .. _API Reference: https://spacy.io/api/
.. _Models: https://spacy.io/models .. _Models: https://spacy.io/models
.. _Resources: https://spacy.io/usage/resources .. _Universe: https://spacy.io/universe
.. _Changelog: https://spacy.io/usage/#changelog .. _Changelog: https://spacy.io/usage/#changelog
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md .. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
@ -308,18 +308,20 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
Run tests Run tests
========= =========
spaCy comes with an `extensive test suite <spacy/tests>`_. First, find out where spaCy comes with an `extensive test suite <spacy/tests>`_. In order to run the
spaCy is installed: tests, you'll usually want to clone the repository and build spaCy from source.
This will also install the required development dependencies and test utilities
defined in the ``requirements.txt``.
Alternatively, you can find out where spaCy is installed and run ``pytest`` on
that directory. Don't forget to also install the test utilities via spaCy's
``requirements.txt``:
.. code:: bash .. code:: bash
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))" python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
pip install -r path/to/requirements.txt
Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow``
and ``--model`` are optional and enable additional tests:
.. code:: bash
# make sure you are using recent pytest version
python -m pip install -U pytest
python -m pytest <spacy-directory> python -m pytest <spacy-directory>
See `the documentation <https://spacy.io/usage/#tests>`_ for more details and
examples.

View File

@ -9,6 +9,7 @@ coordinates. Can be extended with more details from the API.
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components * Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
Compatible with: spaCy v2.0.0+ Compatible with: spaCy v2.0.0+
Prerequisites: pip install requests
""" """
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function

View File

@ -81,7 +81,6 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
else: else:
nlp = spacy.blank('en') # create blank Language class nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model") print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline # Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy # nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names: if 'ner' not in nlp.pipe_names:
@ -92,11 +91,18 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
ner = nlp.get_pipe('ner') ner = nlp.get_pipe('ner')
ner.add_label(LABEL) # add new entity label to entity recognizer ner.add_label(LABEL) # add new entity label to entity recognizer
if model is None:
optimizer = nlp.begin_training()
else:
# Note that 'begin_training' initializes the models, so it'll zero out
# existing entity types.
optimizer = nlp.entity.create_optimizer()
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter): for itn in range(n_iter):
random.shuffle(TRAIN_DATA) random.shuffle(TRAIN_DATA)
losses = {} losses = {}

View File

@ -1,6 +1,6 @@
#!/usr/bin/env python #!/usr/bin/env python
# coding: utf8 # coding: utf8
"""Train a multi-label convolutional neural network text classifier on the """Train a convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`. For more details, spacy.pipeline, and predictions are available via `doc.cats`. For more details,

29
fabfile.py vendored
View File

@ -6,6 +6,7 @@ from pathlib import Path
from fabric.api import local, lcd, env, settings, prefix from fabric.api import local, lcd, env, settings, prefix
from os import path, environ from os import path, environ
import shutil import shutil
import sys
PWD = path.dirname(__file__) PWD = path.dirname(__file__)
@ -90,3 +91,31 @@ def train():
args = environ.get('SPACY_TRAIN_ARGS', '') args = environ.get('SPACY_TRAIN_ARGS', '')
with virtualenv(VENV_DIR) as venv_local: with virtualenv(VENV_DIR) as venv_local:
venv_local('spacy train {args}'.format(args=args)) venv_local('spacy train {args}'.format(args=args))
def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=''):
is_not_clean = local('git status --porcelain', capture=True)
if is_not_clean:
print("Repository is not clean")
print(is_not_clean)
sys.exit(1)
git_sha = local('git rev-parse --short HEAD', capture=True)
config_checksum = local('sha256sum {config}'.format(config=config), capture=True)
experiment_dir = Path(experiment_dir) / '{}--{}'.format(config_checksum[:6], git_sha)
if not experiment_dir.exists():
experiment_dir.mkdir()
test_data_dir = Path(treebank_dir) / 'ud-test-v2.0-conll2017'
assert test_data_dir.exists()
assert test_data_dir.is_dir()
if corpus:
corpora = [corpus]
else:
corpora = ['UD_English', 'UD_Chinese', 'UD_Japanese', 'UD_Vietnamese']
local('cp {config} {experiment_dir}/config.json'.format(config=config, experiment_dir=experiment_dir))
with virtualenv(VENV_DIR) as venv_local:
for corpus in corpora:
venv_local('spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}'.format(
treebank_dir=treebank_dir, experiment_dir=experiment_dir, config=config, corpus=corpus, vectors_dir=vectors_dir))
venv_local('spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}'.format(
test_data_dir=test_data_dir, experiment_dir=experiment_dir, config=config, corpus=corpus))

View File

@ -3,16 +3,12 @@ pathlib
numpy>=1.7 numpy>=1.7
cymem>=1.30,<1.32 cymem>=1.30,<1.32
preshed>=1.0.0,<2.0.0 preshed>=1.0.0,<2.0.0
thinc>=6.11.1.dev11,<6.12.0 thinc>=6.11.1.dev12,<6.12.0
murmurhash>=0.28,<0.29 murmurhash>=0.28,<0.29
cytoolz>=0.9.0,<0.10.0 cytoolz>=0.9.0,<0.10.0
plac<1.0.0,>=0.9.6 plac<1.0.0,>=0.9.6
ujson>=1.35 ujson>=1.35
dill>=0.2,<0.3 dill>=0.2,<0.3
requests>=2.13.0,<3.0.0
regex==2017.4.5 regex==2017.4.5
ftfy>=4.4.2,<5.0.0
pytest>=3.0.6,<4.0.0 pytest>=3.0.6,<4.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0
msgpack-python==0.5.4
msgpack-numpy==0.4.1

View File

@ -38,6 +38,7 @@ MOD_NAMES = [
'spacy.tokens.doc', 'spacy.tokens.doc',
'spacy.tokens.span', 'spacy.tokens.span',
'spacy.tokens.token', 'spacy.tokens.token',
'spacy.tokens._retokenize',
'spacy.matcher', 'spacy.matcher',
'spacy.syntax.ner', 'spacy.syntax.ner',
'spacy.symbols', 'spacy.symbols',
@ -194,12 +195,7 @@ def setup_package():
'plac<1.0.0,>=0.9.6', 'plac<1.0.0,>=0.9.6',
'pathlib', 'pathlib',
'ujson>=1.35', 'ujson>=1.35',
'dill>=0.2,<0.3', 'dill>=0.2,<0.3'],
'requests>=2.13.0,<3.0.0',
'regex==2017.4.5',
'ftfy>=4.4.2,<5.0.0',
'msgpack-python==0.5.4',
'msgpack-numpy==0.4.1'],
setup_requires=['wheel'], setup_requires=['wheel'],
classifiers=[ classifiers=[
'Development Status :: 5 - Production/Stable', 'Development Status :: 5 - Production/Stable',

View File

@ -4,18 +4,14 @@ from __future__ import unicode_literals
from .cli.info import info as cli_info from .cli.info import info as cli_info
from .glossary import explain from .glossary import explain
from .about import __version__ from .about import __version__
from .errors import Warnings, deprecation_warning
from . import util from . import util
def load(name, **overrides): def load(name, **overrides):
depr_path = overrides.get('path') depr_path = overrides.get('path')
if depr_path not in (True, False, None): if depr_path not in (True, False, None):
util.deprecated( deprecation_warning(Warnings.W001.format(path=depr_path))
"As of spaCy v2.0, the keyword argument `path=` is deprecated. "
"You can now call spacy.load with the path as its first argument, "
"and the model's meta.json will be used to determine the language "
"to load. For example:\nnlp = spacy.load('{}')".format(depr_path),
'error')
return util.load_model(name, **overrides) return util.load_model(name, **overrides)

View File

@ -23,6 +23,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
import thinc.extra.load_nlp import thinc.extra.load_nlp
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
from .errors import Errors
from . import util from . import util
@ -42,8 +43,8 @@ def cosine(vec1, vec2):
def create_default_optimizer(ops, **cfg): def create_default_optimizer(ops, **cfg):
learn_rate = util.env_opt('learn_rate', 0.001) learn_rate = util.env_opt('learn_rate', 0.001)
beta1 = util.env_opt('optimizer_B1', 0.9) beta1 = util.env_opt('optimizer_B1', 0.9)
beta2 = util.env_opt('optimizer_B2', 0.999) beta2 = util.env_opt('optimizer_B2', 0.9)
eps = util.env_opt('optimizer_eps', 1e-08) eps = util.env_opt('optimizer_eps', 1e-12)
L2 = util.env_opt('L2_penalty', 1e-6) L2 = util.env_opt('L2_penalty', 1e-6)
max_grad_norm = util.env_opt('grad_norm_clip', 1.) max_grad_norm = util.env_opt('grad_norm_clip', 1.)
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1, optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1,
@ -157,7 +158,7 @@ class PrecomputableAffine(Model):
sgd(self._mem.weights, self._mem.gradient, key=self.id) sgd(self._mem.weights, self._mem.gradient, key=self.id)
return dXf.reshape((dXf.shape[0], self.nF, self.nI)) return dXf.reshape((dXf.shape[0], self.nF, self.nI))
return Yf, backward return Yf, backward
def _add_padding(self, Yf): def _add_padding(self, Yf):
Yf_padded = self.ops.xp.vstack((self.pad, Yf)) Yf_padded = self.ops.xp.vstack((self.pad, Yf))
return Yf_padded return Yf_padded
@ -194,14 +195,15 @@ class PrecomputableAffine(Model):
size=tokvecs.size).reshape(tokvecs.shape) size=tokvecs.size).reshape(tokvecs.shape)
def predict(ids, tokvecs): def predict(ids, tokvecs):
# nS ids. nW tokvecs # nS ids. nW tokvecs. Exclude the padding array.
hiddens = model(tokvecs) # (nW, f, o, p) hiddens = model(tokvecs[:-1]) # (nW, f, o, p)
vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype='f')
# need nS vectors # need nS vectors
vectors = model.ops.allocate((ids.shape[0], model.nO, model.nP)) hiddens = hiddens.reshape((hiddens.shape[0] * model.nF, model.nO * model.nP))
for i, feats in enumerate(ids): model.ops.scatter_add(vectors, ids.flatten(), hiddens)
for j, id_ in enumerate(feats): vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP))
vectors[i] += hiddens[id_, j]
vectors += model.b vectors += model.b
vectors = model.ops.asarray(vectors)
if model.nP >= 2: if model.nP >= 2:
return model.ops.maxout(vectors)[0] return model.ops.maxout(vectors)[0]
else: else:
@ -225,6 +227,11 @@ class PrecomputableAffine(Model):
def link_vectors_to_models(vocab): def link_vectors_to_models(vocab):
vectors = vocab.vectors vectors = vocab.vectors
if vectors.name is None:
vectors.name = VECTORS_KEY
print(
"Warning: Unnamed vectors -- this won't allow multiple vectors "
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape)
ops = Model.ops ops = Model.ops
for word in vocab: for word in vocab:
if word.orth in vectors.key2row: if word.orth in vectors.key2row:
@ -234,11 +241,11 @@ def link_vectors_to_models(vocab):
data = ops.asarray(vectors.data) data = ops.asarray(vectors.data)
# Set an entry here, so that vectors are accessed by StaticVectors # Set an entry here, so that vectors are accessed by StaticVectors
# (unideal, I know) # (unideal, I know)
thinc.extra.load_nlp.VECTORS[(ops.device, VECTORS_KEY)] = data thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
def Tok2Vec(width, embed_size, **kwargs): def Tok2Vec(width, embed_size, **kwargs):
pretrained_dims = kwargs.get('pretrained_dims', 0) pretrained_vectors = kwargs.get('pretrained_vectors', None)
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2) cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone, with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
@ -251,16 +258,16 @@ def Tok2Vec(width, embed_size, **kwargs):
name='embed_suffix') name='embed_suffix')
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE), shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
name='embed_shape') name='embed_shape')
if pretrained_dims is not None and pretrained_dims >= 1: if pretrained_vectors is not None:
glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID)) glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
embed = uniqued( embed = uniqued(
(glove | norm | prefix | suffix | shape) (glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width*5, pieces=3)), column=5) >> LN(Maxout(width, width*5, pieces=3)), column=cols.index(ORTH))
else: else:
embed = uniqued( embed = uniqued(
(norm | prefix | suffix | shape) (norm | prefix | suffix | shape)
>> LN(Maxout(width, width*4, pieces=3)), column=5) >> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))
convolution = Residual( convolution = Residual(
ExtractWindow(nW=1) ExtractWindow(nW=1)
@ -318,10 +325,10 @@ def _divide_array(X, size):
def get_col(idx): def get_col(idx):
assert idx >= 0, idx if idx < 0:
raise IndexError(Errors.E066.format(value=idx))
def forward(X, drop=0.): def forward(X, drop=0.):
assert idx >= 0, idx
if isinstance(X, numpy.ndarray): if isinstance(X, numpy.ndarray):
ops = NumpyOps() ops = NumpyOps()
else: else:
@ -329,7 +336,6 @@ def get_col(idx):
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype) output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
def backward(y, sgd=None): def backward(y, sgd=None):
assert idx >= 0, idx
dX = ops.allocate(X.shape) dX = ops.allocate(X.shape)
dX[:, idx] += y dX[:, idx] += y
return dX return dX
@ -416,13 +422,13 @@ def build_tagger_model(nr_class, **cfg):
token_vector_width = cfg['token_vector_width'] token_vector_width = cfg['token_vector_width']
else: else:
token_vector_width = util.env_opt('token_vector_width', 128) token_vector_width = util.env_opt('token_vector_width', 128)
pretrained_dims = cfg.get('pretrained_dims', 0) pretrained_vectors = cfg.get('pretrained_vectors')
with Model.define_operators({'>>': chain, '+': add}): with Model.define_operators({'>>': chain, '+': add}):
if 'tok2vec' in cfg: if 'tok2vec' in cfg:
tok2vec = cfg['tok2vec'] tok2vec = cfg['tok2vec']
else: else:
tok2vec = Tok2Vec(token_vector_width, embed_size, tok2vec = Tok2Vec(token_vector_width, embed_size,
pretrained_dims=pretrained_dims) pretrained_vectors=pretrained_vectors)
softmax = with_flatten(Softmax(nr_class, token_vector_width)) softmax = with_flatten(Softmax(nr_class, token_vector_width))
model = ( model = (
tok2vec tok2vec

View File

@ -11,7 +11,6 @@ __email__ = 'contact@explosion.ai'
__license__ = 'MIT' __license__ = 'MIT'
__release__ = False __release__ = False
__docs_models__ = 'https://spacy.io/usage/models'
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download' __download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json' __compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json' __shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'

74
spacy/cli/_messages.py Normal file
View File

@ -0,0 +1,74 @@
# coding: utf8
from __future__ import unicode_literals
class Messages(object):
M001 = ("Download successful but linking failed")
M002 = ("Creating a shortcut link for 'en' didn't work (maybe you "
"don't have admin permissions?), but you can still load the "
"model via its full package name: nlp = spacy.load('{name}')")
M003 = ("Server error ({code}: {desc})")
M004 = ("Couldn't fetch {desc}. Please find a model for your spaCy "
"installation (v{version}), and download it manually. For more "
"details, see the documentation: https://spacy.io/usage/models")
M005 = ("Compatibility error")
M006 = ("No compatible models found for v{version} of spaCy.")
M007 = ("No compatible model found for '{name}' (spaCy v{version}).")
M008 = ("Can't locate model data")
M009 = ("The data should be located in {path}")
M010 = ("Can't find the spaCy data path to create model symlink")
M011 = ("Make sure a directory `/data` exists within your spaCy "
"installation and try again. The data directory should be "
"located here:")
M012 = ("Link '{name}' already exists")
M013 = ("To overwrite an existing link, use the --force flag.")
M014 = ("Can't overwrite symlink '{name}'")
M015 = ("This can happen if your data directory contains a directory or "
"file of the same name.")
M016 = ("Error: Couldn't link model to '{name}'")
M017 = ("Creating a symlink in spacy/data failed. Make sure you have the "
"required permissions and try re-running the command as admin, or "
"use a virtualenv. You can still import the model as a module and "
"call its load() method, or create the symlink manually.")
M018 = ("Linking successful")
M019 = ("You can now load the model via spacy.load('{name}')")
M020 = ("Can't find model meta.json")
M021 = ("Couldn't fetch compatibility table.")
M022 = ("Can't find spaCy v{version} in compatibility table")
M023 = ("Installed models (spaCy v{version})")
M024 = ("No models found in your current environment.")
M025 = ("Use the following commands to update the model packages:")
M026 = ("The following models are not available for spaCy "
"v{version}: {models}")
M027 = ("You may also want to overwrite the incompatible links using the "
"`python -m spacy link` command with `--force`, or remove them "
"from the data directory. Data path: {path}")
M028 = ("Input file not found")
M029 = ("Output directory not found")
M030 = ("Unknown format")
M031 = ("Can't find converter for {converter}")
M032 = ("Generated output file {name}")
M033 = ("Created {n_docs} documents")
M034 = ("Evaluation data not found")
M035 = ("Visualization output directory not found")
M036 = ("Generated {n} parses as HTML")
M037 = ("Can't find words frequencies file")
M038 = ("Sucessfully compiled vocab")
M039 = ("{entries} entries, {vectors} vectors")
M040 = ("Output directory not found")
M041 = ("Loaded meta.json from file")
M042 = ("Successfully created package '{name}'")
M043 = ("To build the package, run `python setup.py sdist` in this "
"directory.")
M044 = ("Package directory already exists")
M045 = ("Please delete the directory and try again, or use the `--force` "
"flag to overwrite existing directories.")
M046 = ("Generating meta.json")
M047 = ("Enter the package settings for your model. The following "
"information will be read from your model data: pipeline, vectors.")
M048 = ("No '{key}' setting found in meta.json")
M049 = ("This setting is required to build your package.")
M050 = ("Training data not found")
M051 = ("Development data not found")
M052 = ("Not a valid meta.json format")
M053 = ("Expected dict but got: {meta_type}")

View File

@ -5,6 +5,7 @@ import plac
from pathlib import Path from pathlib import Path
from .converters import conllu2json, iob2json, conll_ner2json from .converters import conllu2json, iob2json, conll_ner2json
from ._messages import Messages
from ..util import prints from ..util import prints
# Converters are matched by file extension. To add a converter, add a new # Converters are matched by file extension. To add a converter, add a new
@ -32,14 +33,14 @@ def convert(input_file, output_dir, n_sents=1, morphology=False, converter='auto
input_path = Path(input_file) input_path = Path(input_file)
output_path = Path(output_dir) output_path = Path(output_dir)
if not input_path.exists(): if not input_path.exists():
prints(input_path, title="Input file not found", exits=1) prints(input_path, title=Messages.M028, exits=1)
if not output_path.exists(): if not output_path.exists():
prints(output_path, title="Output directory not found", exits=1) prints(output_path, title=Messages.M029, exits=1)
if converter == 'auto': if converter == 'auto':
converter = input_path.suffix[1:] converter = input_path.suffix[1:]
if converter not in CONVERTERS: if converter not in CONVERTERS:
prints("Can't find converter for %s" % converter, prints(Messages.M031.format(converter=converter),
title="Unknown format", exits=1) title=Messages.M030, exits=1)
func = CONVERTERS[converter] func = CONVERTERS[converter]
func(input_path, output_path, func(input_path, output_path,
n_sents=n_sents, use_morphology=morphology) n_sents=n_sents, use_morphology=morphology)

View File

@ -1,6 +1,7 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from .._messages import Messages
from ...compat import json_dumps, path2str from ...compat import json_dumps, path2str
from ...util import prints from ...util import prints
from ...gold import iob_to_biluo from ...gold import iob_to_biluo
@ -18,8 +19,8 @@ def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
output_file = output_path / output_filename output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f: with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs)) f.write(json_dumps(docs))
prints("Created %d documents" % len(docs), prints(Messages.M033.format(n_docs=len(docs)),
title="Generated output file %s" % path2str(output_file)) title=Messages.M032.format(name=path2str(output_file)))
def read_conll_ner(input_path): def read_conll_ner(input_path):

View File

@ -1,6 +1,7 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from .._messages import Messages
from ...compat import json_dumps, path2str from ...compat import json_dumps, path2str
from ...util import prints from ...util import prints
@ -32,8 +33,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
output_file = output_path / output_filename output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f: with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs)) f.write(json_dumps(docs))
prints("Created %d documents" % len(docs), prints(Messages.M033.format(n_docs=len(docs)),
title="Generated output file %s" % path2str(output_file)) title=Messages.M032.format(name=path2str(output_file)))
def read_conllx(input_path, use_morphology=False, n=0): def read_conllx(input_path, use_morphology=False, n=0):

View File

@ -2,6 +2,7 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from cytoolz import partition_all, concat from cytoolz import partition_all, concat
from .._messages import Messages
from ...compat import json_dumps, path2str from ...compat import json_dumps, path2str
from ...util import prints from ...util import prints
from ...gold import iob_to_biluo from ...gold import iob_to_biluo
@ -18,8 +19,8 @@ def iob2json(input_path, output_path, n_sents=10, *a, **k):
output_file = output_path / output_filename output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f: with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs)) f.write(json_dumps(docs))
prints("Created %d documents" % len(docs), prints(Messages.M033.format(n_docs=len(docs)),
title="Generated output file %s" % path2str(output_file)) title=Messages.M032.format(name=path2str(output_file)))
def read_iob(raw_sents): def read_iob(raw_sents):

View File

@ -2,13 +2,15 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import plac import plac
import requests
import os import os
import subprocess import subprocess
import sys import sys
import ujson
from .link import link from .link import link
from ._messages import Messages
from ..util import prints, get_package_path from ..util import prints, get_package_path
from ..compat import url_read, HTTPError
from .. import about from .. import about
@ -31,9 +33,7 @@ def download(model, direct=False):
version = get_version(model_name, compatibility) version = get_version(model_name, compatibility)
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name, dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
v=version)) v=version))
if dl != 0: if dl != 0: # if download subprocess doesn't return 0, exit
# if download subprocess doesn't return 0, exit with the respective
# exit code before doing anything else
sys.exit(dl) sys.exit(dl)
try: try:
# Get package path here because link uses # Get package path here because link uses
@ -47,22 +47,16 @@ def download(model, direct=False):
# Dirty, but since spacy.download and the auto-linking is # Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success # mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails. # message and loading instructions, even if linking fails.
prints( prints(Messages.M001.format(name=model_name), title=Messages.M002)
"Creating a shortcut link for 'en' didn't work (maybe "
"you don't have admin permissions?), but you can still "
"load the model via its full package name:",
"nlp = spacy.load('%s')" % model_name,
title="Download successful but linking failed")
def get_json(url, desc): def get_json(url, desc):
r = requests.get(url) try:
if r.status_code != 200: data = url_read(url)
msg = ("Couldn't fetch %s. Please find a model for your spaCy " except HTTPError as e:
"installation (v%s), and download it manually.") prints(Messages.M004.format(desc, about.__version__),
prints(msg % (desc, about.__version__), about.__docs_models__, title=Messages.M003.format(e.code, e.reason), exits=1)
title="Server error (%d)" % r.status_code, exits=1) return ujson.loads(data)
return r.json()
def get_compatibility(): def get_compatibility():
@ -71,17 +65,16 @@ def get_compatibility():
comp_table = get_json(about.__compatibility__, "compatibility table") comp_table = get_json(about.__compatibility__, "compatibility table")
comp = comp_table['spacy'] comp = comp_table['spacy']
if version not in comp: if version not in comp:
prints("No compatible models found for v%s of spaCy." % version, prints(Messages.M006.format(version=version), title=Messages.M005,
title="Compatibility error", exits=1) exits=1)
return comp[version] return comp[version]
def get_version(model, comp): def get_version(model, comp):
model = model.rsplit('.dev', 1)[0] model = model.rsplit('.dev', 1)[0]
if model not in comp: if model not in comp:
version = about.__version__ prints(Messages.M007.format(name=model, version=about.__version__),
msg = "No compatible model found for '%s' (spaCy v%s)." title=Messages.M005, exits=1)
prints(msg % (model, version), title="Compatibility error", exits=1)
return comp[model][0] return comp[model][0]

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals, division, print_function
import plac import plac
from timeit import default_timer as timer from timeit import default_timer as timer
from ._messages import Messages
from ..gold import GoldCorpus from ..gold import GoldCorpus
from ..util import prints from ..util import prints
from .. import util from .. import util
@ -33,10 +34,9 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
data_path = util.ensure_path(data_path) data_path = util.ensure_path(data_path)
displacy_path = util.ensure_path(displacy_path) displacy_path = util.ensure_path(displacy_path)
if not data_path.exists(): if not data_path.exists():
prints(data_path, title="Evaluation data not found", exits=1) prints(data_path, title=Messages.M034, exits=1)
if displacy_path and not displacy_path.exists(): if displacy_path and not displacy_path.exists():
prints(displacy_path, title="Visualization output directory not found", prints(displacy_path, title=Messages.M035, exits=1)
exits=1)
corpus = GoldCorpus(data_path, data_path) corpus = GoldCorpus(data_path, data_path)
nlp = util.load_model(model) nlp = util.load_model(model)
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc)) dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
@ -52,8 +52,7 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
render_ents = 'ner' in nlp.meta.get('pipeline', []) render_ents = 'ner' in nlp.meta.get('pipeline', [])
render_parses(docs, displacy_path, model_name=model, render_parses(docs, displacy_path, model_name=model,
limit=displacy_limit, deps=render_deps, ents=render_ents) limit=displacy_limit, deps=render_deps, ents=render_ents)
msg = "Generated %s parses as HTML" % displacy_limit prints(displacy_path, title=Messages.M036.format(n=displacy_limit))
prints(displacy_path, title=msg)
def render_parses(docs, output_path, model_name='', limit=250, deps=True, def render_parses(docs, output_path, model_name='', limit=250, deps=True,

View File

@ -5,15 +5,17 @@ import plac
import platform import platform
from pathlib import Path from pathlib import Path
from ._messages import Messages
from ..compat import path2str from ..compat import path2str
from .. import about
from .. import util from .. import util
from .. import about
@plac.annotations( @plac.annotations(
model=("optional: shortcut link of model", "positional", None, str), model=("optional: shortcut link of model", "positional", None, str),
markdown=("generate Markdown for GitHub issues", "flag", "md", str)) markdown=("generate Markdown for GitHub issues", "flag", "md", str),
def info(model=None, markdown=False): silent=("don't print anything (just return)", "flag", "s"))
def info(model=None, markdown=False, silent=False):
"""Print info about spaCy installation. If a model shortcut link is """Print info about spaCy installation. If a model shortcut link is
speficied as an argument, print model information. Flag --markdown speficied as an argument, print model information. Flag --markdown
prints details in Markdown for easy copy-pasting to GitHub issues. prints details in Markdown for easy copy-pasting to GitHub issues.
@ -25,21 +27,24 @@ def info(model=None, markdown=False):
model_path = util.get_data_path() / model model_path = util.get_data_path() / model
meta_path = model_path / 'meta.json' meta_path = model_path / 'meta.json'
if not meta_path.is_file(): if not meta_path.is_file():
util.prints(meta_path, title="Can't find model meta.json", exits=1) util.prints(meta_path, title=Messages.M020, exits=1)
meta = util.read_json(meta_path) meta = util.read_json(meta_path)
if model_path.resolve() != model_path: if model_path.resolve() != model_path:
meta['link'] = path2str(model_path) meta['link'] = path2str(model_path)
meta['source'] = path2str(model_path.resolve()) meta['source'] = path2str(model_path.resolve())
else: else:
meta['source'] = path2str(model_path) meta['source'] = path2str(model_path)
print_info(meta, 'model %s' % model, markdown) if not silent:
else: print_info(meta, 'model %s' % model, markdown)
data = {'spaCy version': about.__version__, return meta
'Location': path2str(Path(__file__).parent.parent), data = {'spaCy version': about.__version__,
'Platform': platform.platform(), 'Location': path2str(Path(__file__).parent.parent),
'Python version': platform.python_version(), 'Platform': platform.platform(),
'Models': list_models()} 'Python version': platform.python_version(),
'Models': list_models()}
if not silent:
print_info(data, 'spaCy', markdown) print_info(data, 'spaCy', markdown)
return data
def print_info(data, title, markdown): def print_info(data, title, markdown):

View File

@ -12,10 +12,16 @@ import tarfile
import gzip import gzip
import zipfile import zipfile
from ..compat import fix_text from ._messages import Messages
from ..vectors import Vectors from ..vectors import Vectors
from ..errors import Errors, Warnings, user_warning
from ..util import prints, ensure_path, get_lang_class from ..util import prints, ensure_path, get_lang_class
try:
import ftfy
except ImportError:
ftfy = None
@plac.annotations( @plac.annotations(
lang=("model language", "positional", None, str), lang=("model language", "positional", None, str),
@ -23,27 +29,26 @@ from ..util import prints, ensure_path, get_lang_class
freqs_loc=("location of words frequencies file", "positional", None, Path), freqs_loc=("location of words frequencies file", "positional", None, Path),
clusters_loc=("optional: location of brown clusters data", clusters_loc=("optional: location of brown clusters data",
"option", "c", str), "option", "c", str),
vectors_loc=("optional: location of vectors file in GenSim text format", vectors_loc=("optional: location of vectors file in Word2Vec format "
"option", "v", str), "(either as .txt or zipped as .zip or .tar.gz)", "option",
"v", str),
prune_vectors=("optional: number of vectors to prune to", prune_vectors=("optional: number of vectors to prune to",
"option", "V", int) "option", "V", int)
) )
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None, vectors_loc=None, prune_vectors=-1): def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None,
vectors_loc=None, prune_vectors=-1):
""" """
Create a new model from raw data, like word frequencies, Brown clusters Create a new model from raw data, like word frequencies, Brown clusters
and word vectors. and word vectors.
""" """
if freqs_loc is not None and not freqs_loc.exists(): if freqs_loc is not None and not freqs_loc.exists():
prints(freqs_loc, title="Can't find words frequencies file", exits=1) prints(freqs_loc, title=Messages.M037, exits=1)
clusters_loc = ensure_path(clusters_loc) clusters_loc = ensure_path(clusters_loc)
vectors_loc = ensure_path(vectors_loc) vectors_loc = ensure_path(vectors_loc)
probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20) probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None) vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
clusters = read_clusters(clusters_loc) if clusters_loc else {} clusters = read_clusters(clusters_loc) if clusters_loc else {}
nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors) nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors)
if not output_dir.exists(): if not output_dir.exists():
output_dir.mkdir() output_dir.mkdir()
nlp.to_disk(output_dir) nlp.to_disk(output_dir)
@ -71,7 +76,6 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
nlp = lang_class() nlp = lang_class()
for lexeme in nlp.vocab: for lexeme in nlp.vocab:
lexeme.rank = 0 lexeme.rank = 0
lex_added = 0 lex_added = 0
for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))): for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))):
lexeme = nlp.vocab[word] lexeme = nlp.vocab[word]
@ -91,15 +95,13 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
lexeme = nlp.vocab[word] lexeme = nlp.vocab[word]
lexeme.is_oov = False lexeme.is_oov = False
lex_added += 1 lex_added += 1
if len(vectors_data): if len(vectors_data):
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
if prune_vectors >= 1: if prune_vectors >= 1:
nlp.vocab.prune_vectors(prune_vectors) nlp.vocab.prune_vectors(prune_vectors)
vec_added = len(nlp.vocab.vectors) vec_added = len(nlp.vocab.vectors)
prints(Messages.M039.format(entries=lex_added, vectors=vec_added),
prints("{} entries, {} vectors".format(lex_added, vec_added), title=Messages.M038)
title="Sucessfully compiled vocab")
return nlp return nlp
@ -114,8 +116,7 @@ def read_vectors(vectors_loc):
pieces = line.rsplit(' ', vectors_data.shape[1]+1) pieces = line.rsplit(' ', vectors_data.shape[1]+1)
word = pieces.pop(0) word = pieces.pop(0)
if len(pieces) != vectors_data.shape[1]: if len(pieces) != vectors_data.shape[1]:
print(word, repr(line)) raise ValueError(Errors.E094.format(line_num=i, loc=vectors_loc))
raise ValueError("Bad line in file")
vectors_data[i] = numpy.asarray(pieces, dtype='f') vectors_data[i] = numpy.asarray(pieces, dtype='f')
vectors_keys.append(word) vectors_keys.append(word)
return vectors_data, vectors_keys return vectors_data, vectors_keys
@ -150,11 +151,14 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
def read_clusters(clusters_loc): def read_clusters(clusters_loc):
print("Reading clusters...") print("Reading clusters...")
clusters = {} clusters = {}
if ftfy is None:
user_warning(Warnings.W004)
with clusters_loc.open() as f: with clusters_loc.open() as f:
for line in tqdm(f): for line in tqdm(f):
try: try:
cluster, word, freq = line.split() cluster, word, freq = line.split()
word = fix_text(word) if ftfy is not None:
word = ftfy.fix_text(word)
except ValueError: except ValueError:
continue continue
# If the clusterer has only seen the word a few times, its # If the clusterer has only seen the word a few times, its

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
import plac import plac
from pathlib import Path from pathlib import Path
from ._messages import Messages
from ..compat import symlink_to, path2str from ..compat import symlink_to, path2str
from ..util import prints from ..util import prints
from .. import util from .. import util
@ -24,40 +25,29 @@ def link(origin, link_name, force=False, model_path=None):
else: else:
model_path = Path(origin) if model_path is None else Path(model_path) model_path = Path(origin) if model_path is None else Path(model_path)
if not model_path.exists(): if not model_path.exists():
prints("The data should be located in %s" % path2str(model_path), prints(Messages.M009.format(path=path2str(model_path)),
title="Can't locate model data", exits=1) title=Messages.M008, exits=1)
data_path = util.get_data_path() data_path = util.get_data_path()
if not data_path or not data_path.exists(): if not data_path or not data_path.exists():
spacy_loc = Path(__file__).parent.parent spacy_loc = Path(__file__).parent.parent
prints("Make sure a directory `/data` exists within your spaCy " prints(Messages.M011, spacy_loc, title=Messages.M010, exits=1)
"installation and try again. The data directory should be "
"located here:", path2str(spacy_loc), exits=1,
title="Can't find the spaCy data path to create model symlink")
link_path = util.get_data_path() / link_name link_path = util.get_data_path() / link_name
if link_path.is_symlink() and not force: if link_path.is_symlink() and not force:
prints("To overwrite an existing link, use the --force flag.", prints(Messages.M013, title=Messages.M012.format(name=link_name),
title="Link %s already exists" % link_name, exits=1) exits=1)
elif link_path.is_symlink(): # does a symlink exist? elif link_path.is_symlink(): # does a symlink exist?
# NB: It's important to check for is_symlink here and not for exists, # NB: It's important to check for is_symlink here and not for exists,
# because invalid/outdated symlinks would return False otherwise. # because invalid/outdated symlinks would return False otherwise.
link_path.unlink() link_path.unlink()
elif link_path.exists(): # does it exist otherwise? elif link_path.exists(): # does it exist otherwise?
# NB: Check this last because valid symlinks also "exist". # NB: Check this last because valid symlinks also "exist".
prints("This can happen if your data directory contains a directory " prints(Messages.M015, link_path,
"or file of the same name.", link_path, title=Messages.M014.format(name=link_name), exits=1)
title="Can't overwrite symlink %s" % link_name, exits=1) msg = "%s --> %s" % (path2str(model_path), path2str(link_path))
try: try:
symlink_to(link_path, model_path) symlink_to(link_path, model_path)
except: except:
# This is quite dirty, but just making sure other errors are caught. # This is quite dirty, but just making sure other errors are caught.
prints("Creating a symlink in spacy/data failed. Make sure you have " prints(Messages.M017, msg, title=Messages.M016.format(name=link_name))
"the required permissions and try re-running the command as "
"admin, or use a virtualenv. You can still import the model as "
"a module and call its load() method, or create the symlink "
"manually.",
"%s --> %s" % (path2str(model_path), path2str(link_path)),
title="Error: Couldn't link model to '%s'" % link_name)
raise raise
prints("%s --> %s" % (path2str(model_path), path2str(link_path)), prints(msg, Messages.M019.format(name=link_name), title=Messages.M018)
"You can now load the model via spacy.load('%s')" % link_name,
title="Linking successful")

View File

@ -5,6 +5,7 @@ import plac
import shutil import shutil
from pathlib import Path from pathlib import Path
from ._messages import Messages
from ..compat import path2str, json_dumps from ..compat import path2str, json_dumps
from ..util import prints from ..util import prints
from .. import util from .. import util
@ -31,17 +32,17 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
output_path = util.ensure_path(output_dir) output_path = util.ensure_path(output_dir)
meta_path = util.ensure_path(meta_path) meta_path = util.ensure_path(meta_path)
if not input_path or not input_path.exists(): if not input_path or not input_path.exists():
prints(input_path, title="Model directory not found", exits=1) prints(input_path, title=Messages.M008, exits=1)
if not output_path or not output_path.exists(): if not output_path or not output_path.exists():
prints(output_path, title="Output directory not found", exits=1) prints(output_path, title=Messages.M040, exits=1)
if meta_path and not meta_path.exists(): if meta_path and not meta_path.exists():
prints(meta_path, title="meta.json not found", exits=1) prints(meta_path, title=Messages.M020, exits=1)
meta_path = meta_path or input_path / 'meta.json' meta_path = meta_path or input_path / 'meta.json'
if meta_path.is_file(): if meta_path.is_file():
meta = util.read_json(meta_path) meta = util.read_json(meta_path)
if not create_meta: # only print this if user doesn't want to overwrite if not create_meta: # only print this if user doesn't want to overwrite
prints(meta_path, title="Loaded meta.json from file") prints(meta_path, title=Messages.M041)
else: else:
meta = generate_meta(input_dir, meta) meta = generate_meta(input_dir, meta)
meta = validate_meta(meta, ['lang', 'name', 'version']) meta = validate_meta(meta, ['lang', 'name', 'version'])
@ -57,9 +58,8 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
create_file(main_path / 'setup.py', TEMPLATE_SETUP) create_file(main_path / 'setup.py', TEMPLATE_SETUP)
create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST) create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
create_file(package_path / '__init__.py', TEMPLATE_INIT) create_file(package_path / '__init__.py', TEMPLATE_INIT)
prints(main_path, "To build the package, run `python setup.py sdist` in " prints(main_path, Messages.M043,
"this directory.", title=Messages.M042.format(name=model_name_v))
title="Successfully created package '%s'" % model_name_v)
def create_dirs(package_path, force): def create_dirs(package_path, force):
@ -67,10 +67,7 @@ def create_dirs(package_path, force):
if force: if force:
shutil.rmtree(path2str(package_path)) shutil.rmtree(path2str(package_path))
else: else:
prints(package_path, "Please delete the directory and try again, " prints(package_path, Messages.M045, title=Messages.M044, exits=1)
"or use the --force flag to overwrite existing "
"directories.", title="Package directory already exists",
exits=1)
Path.mkdir(package_path, parents=True) Path.mkdir(package_path, parents=True)
@ -97,9 +94,7 @@ def generate_meta(model_path, existing_meta):
meta['vectors'] = {'width': nlp.vocab.vectors_length, meta['vectors'] = {'width': nlp.vocab.vectors_length,
'vectors': len(nlp.vocab.vectors), 'vectors': len(nlp.vocab.vectors),
'keys': nlp.vocab.vectors.n_keys} 'keys': nlp.vocab.vectors.n_keys}
prints("Enter the package settings for your model. The following " prints(Messages.M047, title=Messages.M046)
"information will be read from your model data: pipeline, vectors.",
title="Generating meta.json")
for setting, desc, default in settings: for setting, desc, default in settings:
response = util.get_raw_input(desc, default) response = util.get_raw_input(desc, default)
meta[setting] = default if response == '' and default else response meta[setting] = default if response == '' and default else response
@ -111,8 +106,7 @@ def generate_meta(model_path, existing_meta):
def validate_meta(meta, keys): def validate_meta(meta, keys):
for key in keys: for key in keys:
if key not in meta or meta[key] == '': if key not in meta or meta[key] == '':
prints("This setting is required to build your package.", prints(Messages.M049, title=Messages.M048.format(key=key), exits=1)
title='No "%s" setting found in meta.json' % key, exits=1)
return meta return meta

View File

@ -7,6 +7,7 @@ import tqdm
from thinc.neural._classes.model import Model from thinc.neural._classes.model import Model
from timeit import default_timer as timer from timeit import default_timer as timer
from ._messages import Messages
from ..attrs import PROB, IS_OOV, CLUSTER, LANG from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus from ..gold import GoldCorpus
from ..util import prints, minibatch, minibatch_by_words from ..util import prints, minibatch, minibatch_by_words
@ -52,15 +53,15 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
dev_path = util.ensure_path(dev_data) dev_path = util.ensure_path(dev_data)
meta_path = util.ensure_path(meta_path) meta_path = util.ensure_path(meta_path)
if not train_path.exists(): if not train_path.exists():
prints(train_path, title="Training data not found", exits=1) prints(train_path, title=Messages.M050, exits=1)
if dev_path and not dev_path.exists(): if dev_path and not dev_path.exists():
prints(dev_path, title="Development data not found", exits=1) prints(dev_path, title=Messages.M051, exits=1)
if meta_path is not None and not meta_path.exists(): if meta_path is not None and not meta_path.exists():
prints(meta_path, title="meta.json not found", exits=1) prints(meta_path, title=Messages.M020, exits=1)
meta = util.read_json(meta_path) if meta_path else {} meta = util.read_json(meta_path) if meta_path else {}
if not isinstance(meta, dict): if not isinstance(meta, dict):
prints("Expected dict but got: {}".format(type(meta)), prints(Messages.M053.format(meta_type=type(meta)),
title="Not a valid meta.json format", exits=1) title=Messages.M052, exits=1)
meta.setdefault('lang', lang) meta.setdefault('lang', lang)
meta.setdefault('name', 'unnamed') meta.setdefault('name', 'unnamed')
@ -94,6 +95,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
meta['pipeline'] = pipeline meta['pipeline'] = pipeline
nlp.meta.update(meta) nlp.meta.update(meta)
if vectors: if vectors:
print("Load vectors model", vectors)
util.load_model(vectors, vocab=nlp.vocab) util.load_model(vectors, vocab=nlp.vocab)
for lex in nlp.vocab: for lex in nlp.vocab:
values = {} values = {}

315
spacy/cli/ud_run_test.py Normal file
View File

@ -0,0 +1,315 @@
'''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
.conllu format for development data, allowing the official scorer to be used.
'''
from __future__ import unicode_literals
import plac
import tqdm
from pathlib import Path
import re
import sys
import json
import spacy
import spacy.util
from ..tokens import Token, Doc
from ..gold import GoldParse
from ..util import compounding, minibatch_by_words
from ..syntax.nonproj import projectivize
from ..matcher import Matcher
from ..morphology import Fused_begin, Fused_inside
from .. import displacy
from collections import defaultdict, Counter
from timeit import default_timer as timer
import itertools
import random
import numpy.random
import cytoolz
from . import conll17_ud_eval
from .. import lang
from .. import lang
from ..lang import zh
from ..lang import ja
from ..lang import ru
################
# Data reading #
################
space_re = re.compile('\s+')
def split_text(text):
return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
##############
# Evaluation #
##############
def read_conllu(file_):
docs = []
sent = []
doc = []
for line in file_:
if line.startswith('# newdoc'):
if doc:
docs.append(doc)
doc = []
elif line.startswith('#'):
continue
elif not line.strip():
if sent:
doc.append(sent)
sent = []
else:
sent.append(list(line.strip().split('\t')))
if len(sent[-1]) != 10:
print(repr(line))
raise ValueError
if sent:
doc.append(sent)
if doc:
docs.append(doc)
return docs
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith('.conllu'):
docs = []
with text_loc.open() as file_:
for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent]
docs.append(Doc(nlp.vocab, words=words))
for name, component in nlp.pipeline:
docs = list(component.pipe(docs))
else:
with text_loc.open('r', encoding='utf8') as text_file:
texts = split_text(text_file.read())
docs = list(nlp.pipe(texts))
with sys_loc.open('w', encoding='utf8') as out_file:
write_conllu(docs, out_file)
with gold_loc.open('r', encoding='utf8') as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
with sys_loc.open('r', encoding='utf8') as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file)
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
return docs, scores
def write_conllu(docs, file_):
merger = Matcher(docs[0].vocab)
merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
for i, doc in enumerate(docs):
matches = merger(doc)
spans = [doc[start:end+1] for _, start, end in matches]
offsets = [(span.start_char, span.end_char) for span in spans]
for start_char, end_char in offsets:
doc.merge(start_char, end_char)
# TODO: This shuldn't be necessary? Should be handled in merge
for word in doc:
if word.i == word.head.i:
word.dep_ = 'ROOT'
file_.write("# newdoc id = {i}\n".format(i=i))
for j, sent in enumerate(doc.sents):
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
file_.write("# text = {text}\n".format(text=sent.text))
for k, token in enumerate(sent):
file_.write(_get_token_conllu(token, k, len(sent)) + '\n')
file_.write('\n')
for word in sent:
if word.head.i == word.i and word.dep_ == 'ROOT':
break
else:
print("Rootless sentence!")
print(sent)
print(i)
for w in sent:
print(w.i, w.text, w.head.text, w.head.i, w.dep_)
raise ValueError
def _get_token_conllu(token, k, sent_len):
if token.check_morph(Fused_begin) and (k+1 < sent_len):
n = 1
text = [token.text]
while token.nbor(n).check_morph(Fused_inside):
text.append(token.nbor(n).text)
n += 1
id_ = '%d-%d' % (k+1, (k+n))
fields = [id_, ''.join(text)] + ['_'] * 8
lines = ['\t'.join(fields)]
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = k + (token.head.i - token.i) + 1
fields = [str(k+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
str(head), token.dep_.lower(), '_', '_']
if token.check_morph(Fused_begin) and (k+1 < sent_len):
if k == 0:
fields[1] = token.norm_[0].upper() + token.norm_[1:]
else:
fields[1] = token.norm_
elif token.check_morph(Fused_inside):
fields[1] = token.norm_
elif token._.split_start is not None:
split_start = token._.split_start
split_end = token._.split_end
split_len = (split_end.i - split_start.i) + 1
n_in_split = token.i - split_start.i
subtokens = guess_fused_orths(split_start.text, [''] * split_len)
fields[1] = subtokens[n_in_split]
lines.append('\t'.join(fields))
return '\n'.join(lines)
def guess_fused_orths(word, ud_forms):
'''The UD data 'fused tokens' don't necessarily expand to keys that match
the form. We need orths that exact match the string. Here we make a best
effort to divide up the word.'''
if word == ''.join(ud_forms):
# Happy case: we get a perfect split, with each letter accounted for.
return ud_forms
elif len(word) == sum(len(subtoken) for subtoken in ud_forms):
# Unideal, but at least lengths match.
output = []
remain = word
for subtoken in ud_forms:
assert len(subtoken) >= 1
output.append(remain[:len(subtoken)])
remain = remain[len(subtoken):]
assert len(remain) == 0, (word, ud_forms, remain)
return output
else:
# Let's say word is 6 long, and there are three subtokens. The orths
# *must* equal the original string. Arbitrarily, split [4, 1, 1]
first = word[:len(word)-(len(ud_forms)-1)]
output = [first]
remain = word[len(first):]
for i in range(1, len(ud_forms)):
assert remain
output.append(remain[:1])
remain = remain[1:]
assert len(remain) == 0, (word, output, remain)
return output
def print_results(name, ud_scores):
fields = {}
if ud_scores is not None:
fields.update({
'words': ud_scores['Words'].f1 * 100,
'sents': ud_scores['Sentences'].f1 * 100,
'tags': ud_scores['XPOS'].f1 * 100,
'uas': ud_scores['UAS'].f1 * 100,
'las': ud_scores['LAS'].f1 * 100,
})
else:
fields.update({
'words': 0.0,
'sents': 0.0,
'tags': 0.0,
'uas': 0.0,
'las': 0.0
})
tpl = '\t'.join((
name,
'{las:.1f}',
'{uas:.1f}',
'{tags:.1f}',
'{sents:.1f}',
'{words:.1f}',
))
print(tpl.format(**fields))
return fields
def get_token_split_start(token):
if token.text == '':
assert token.i != 0
i = -1
while token.nbor(i).text == '':
i -= 1
return token.nbor(i)
elif (token.i+1) < len(token.doc) and token.nbor(1).text == '':
return token
else:
return None
def get_token_split_end(token):
if (token.i+1) == len(token.doc):
return token if token.text == '' else None
elif token.text != '' and token.nbor(1).text != '':
return None
i = 1
while (token.i+i) < len(token.doc) and token.nbor(i).text == '':
i += 1
return token.nbor(i-1)
Token.set_extension('split_start', getter=get_token_split_start)
Token.set_extension('split_end', getter=get_token_split_end)
Token.set_extension('begins_fused', default=False)
Token.set_extension('inside_fused', default=False)
##################
# Initialization #
##################
def load_nlp(experiments_dir, corpus):
nlp = spacy.load(experiments_dir / corpus / 'best-model')
return nlp
def initialize_pipeline(nlp, docs, golds, config, device):
nlp.add_pipe(nlp.create_pipe('parser'))
return nlp
@plac.annotations(
test_data_dir=("Path to Universal Dependencies test data", "positional", None, Path),
experiment_dir=("Parent directory with output model", "positional", None, Path),
corpus=("UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc", "positional", None, str),
)
def main(test_data_dir, experiment_dir, corpus):
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False
lang.ru.Russian.Defaults.use_pymorphy2 = False
nlp = load_nlp(experiment_dir, corpus)
treebank_code = nlp.meta['treebank']
for section in ('test', 'dev'):
if section == 'dev':
section_dir = 'conll17-ud-development-2017-03-19'
else:
section_dir = 'conll17-ud-test-2017-05-09'
text_path = test_data_dir / 'input' / section_dir / (treebank_code+'.txt')
udpipe_path = test_data_dir / 'input' / section_dir / (treebank_code+'-udpipe.conllu')
gold_path = test_data_dir / 'gold' / section_dir / (treebank_code+'.conllu')
header = [section, 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
print('\t'.join(header))
inputs = {'gold': gold_path, 'udp': udpipe_path, 'raw': text_path}
for input_type in ('udp', 'raw'):
input_path = inputs[input_type]
output_path = experiment_dir / corpus / '{section}.conllu'.format(section=section)
parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path)
accuracy = print_results(input_type, test_scores)
acc_path = experiment_dir / corpus / '{section}-accuracy.json'.format(section=section)
with open(acc_path, 'w') as file_:
file_.write(json.dumps(accuracy, indent=2))
if __name__ == '__main__':
plac.call(main)

View File

@ -247,12 +247,18 @@ Token.set_extension('inside_fused', default=False)
################## ##################
def load_nlp(corpus, config): def load_nlp(corpus, config, vectors=None):
lang = corpus.split('_')[0] lang = corpus.split('_')[0]
nlp = spacy.blank(lang) nlp = spacy.blank(lang)
if config.vectors: if config.vectors:
nlp.vocab.from_disk(Path(config.vectors) / 'vocab') if not vectors:
raise ValueError("config asks for vectors, but no vectors "
"directory set on command line (use -v)")
if (Path(vectors) / corpus).exists():
nlp.vocab.from_disk(Path(vectors) / corpus / 'vocab')
nlp.meta['treebank'] = corpus
return nlp return nlp
def initialize_pipeline(nlp, docs, golds, config, device): def initialize_pipeline(nlp, docs, golds, config, device):
nlp.add_pipe(nlp.create_pipe('parser')) nlp.add_pipe(nlp.create_pipe('parser'))
@ -274,10 +280,12 @@ def initialize_pipeline(nlp, docs, golds, config, device):
class Config(object): class Config(object):
def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True, def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2): multitask_sent=True, multitask_dep=True, multitask_vectors=False,
nr_epoch=30, batch_size=1000, dropout=0.2):
for key, value in locals().items(): for key, value in locals().items():
setattr(self, key, value) setattr(self, key, value)
@classmethod @classmethod
def load(cls, loc): def load(cls, loc):
with Path(loc).open('r', encoding='utf8') as file_: with Path(loc).open('r', encoding='utf8') as file_:
@ -319,9 +327,11 @@ class TreebankPaths(object):
parses_dir=("Directory to write the development parses", "positional", None, Path), parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional"), config=("Path to json formatted config file", "positional"),
limit=("Size limit", "option", "n", int), limit=("Size limit", "option", "n", int),
use_gpu=("Use GPU", "option", "g", int) use_gpu=("Use GPU", "option", "g", int),
vectors_dir=("Path to directory with pre-trained vectors, named e.g. en/",
"option", "v", Path),
) )
def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1): def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1, vectors_dir=None):
spacy.util.fix_random_seed() spacy.util.fix_random_seed()
lang.zh.Chinese.Defaults.use_jieba = False lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False lang.ja.Japanese.Defaults.use_janome = False
@ -331,7 +341,7 @@ def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1):
if not (parses_dir / corpus).exists(): if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir() (parses_dir / corpus).mkdir()
print("Train and evaluate", corpus, "using lang", paths.lang) print("Train and evaluate", corpus, "using lang", paths.lang)
nlp = load_nlp(paths.lang, config) nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(), docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
max_doc_length=config.max_doc_length, limit=limit) max_doc_length=config.max_doc_length, limit=limit)

View File

@ -1,12 +1,13 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function
import requests
import pkg_resources import pkg_resources
from pathlib import Path from pathlib import Path
import sys import sys
import ujson
from ..compat import path2str, locale_escape from ._messages import Messages
from ..compat import path2str, locale_escape, url_read, HTTPError
from ..util import prints, get_data_path, read_json from ..util import prints, get_data_path, read_json
from .. import about from .. import about
@ -15,16 +16,16 @@ def validate():
"""Validate that the currently installed version of spaCy is compatible """Validate that the currently installed version of spaCy is compatible
with the installed models. Should be run after `pip install -U spacy`. with the installed models. Should be run after `pip install -U spacy`.
""" """
r = requests.get(about.__compatibility__) try:
if r.status_code != 200: data = url_read(about.__compatibility__)
prints("Couldn't fetch compatibility table.", except HTTPError as e:
title="Server error (%d)" % r.status_code, exits=1) title = Messages.M003.format(code=e.code, desc=e.reason)
compat = r.json()['spacy'] prints(Messages.M021, title=title, exits=1)
compat = ujson.loads(data)['spacy']
current_compat = compat.get(about.__version__) current_compat = compat.get(about.__version__)
if not current_compat: if not current_compat:
prints(about.__compatibility__, exits=1, prints(about.__compatibility__, exits=1,
title="Can't find spaCy v{} in compatibility table" title=Messages.M022.format(version=about.__version__))
.format(about.__version__))
all_models = set() all_models = set()
for spacy_v, models in dict(compat).items(): for spacy_v, models in dict(compat).items():
all_models.update(models.keys()) all_models.update(models.keys())
@ -41,7 +42,7 @@ def validate():
update_models = [m for m in incompat_models if m in current_compat] update_models = [m for m in incompat_models if m in current_compat]
prints(path2str(Path(__file__).parent.parent), prints(path2str(Path(__file__).parent.parent),
title="Installed models (spaCy v{})".format(about.__version__)) title=Messages.M023.format(version=about.__version__))
if model_links or model_pkgs: if model_links or model_pkgs:
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', '')) print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
for name, data in model_pkgs.items(): for name, data in model_pkgs.items():
@ -49,23 +50,16 @@ def validate():
for name, data in model_links.items(): for name, data in model_links.items():
print(get_model_row(current_compat, name, data, 'link')) print(get_model_row(current_compat, name, data, 'link'))
else: else:
prints("No models found in your current environment.", exits=0) prints(Messages.M024, exits=0)
if update_models: if update_models:
cmd = ' python -m spacy download {}' cmd = ' python -m spacy download {}'
print("\n Use the following commands to update the model packages:") print("\n " + Messages.M025)
print('\n'.join([cmd.format(pkg) for pkg in update_models])) print('\n'.join([cmd.format(pkg) for pkg in update_models]))
if na_models: if na_models:
prints("The following models are not available for spaCy v{}: {}" prints(Messages.M025.format(version=about.__version__,
.format(about.__version__, ', '.join(na_models))) models=', '.join(na_models)))
if incompat_links: if incompat_links:
prints("You may also want to overwrite the incompatible links using " prints(Messages.M027.format(path=path2str(get_data_path())))
"the `python -m spacy link` command with `--force`, or remove "
"them from the data directory. Data path: {}"
.format(path2str(get_data_path())))
if incompat_models or incompat_links: if incompat_models or incompat_links:
sys.exit(1) sys.exit(1)

View File

@ -1,7 +1,6 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import ftfy
import sys import sys
import ujson import ujson
import itertools import itertools
@ -34,11 +33,20 @@ try:
except ImportError: except ImportError:
from thinc.neural.optimizers import Adam as Optimizer from thinc.neural.optimizers import Adam as Optimizer
try:
import urllib.request
except ImportError:
import urllib2 as urllib
try:
from urllib.error import HTTPError
except ImportError:
from urllib2 import HTTPError
pickle = pickle pickle = pickle
copy_reg = copy_reg copy_reg = copy_reg
CudaStream = CudaStream CudaStream = CudaStream
cupy = cupy cupy = cupy
fix_text = ftfy.fix_text
copy_array = copy_array copy_array = copy_array
izip = getattr(itertools, 'izip', zip) izip = getattr(itertools, 'izip', zip)
@ -58,6 +66,7 @@ if is_python2:
input_ = raw_input # noqa: F821 input_ = raw_input # noqa: F821
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8') json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8')
path2str = lambda path: str(path).decode('utf8') path2str = lambda path: str(path).decode('utf8')
url_open = urllib.urlopen
elif is_python3: elif is_python3:
bytes_ = bytes bytes_ = bytes
@ -66,6 +75,16 @@ elif is_python3:
input_ = input input_ = input
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False) json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
path2str = lambda path: str(path) path2str = lambda path: str(path)
url_open = urllib.request.urlopen
def url_read(url):
file_ = url_open(url)
code = file_.getcode()
if code != 200:
raise HTTPError(url, code, "Cannot GET url", [], file_)
data = file_.read()
return data
def b_to_str(b_str): def b_to_str(b_str):

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
from .render import DependencyRenderer, EntityRenderer from .render import DependencyRenderer, EntityRenderer
from ..tokens import Doc from ..tokens import Doc
from ..compat import b_to_str from ..compat import b_to_str
from ..errors import Errors, Warnings, user_warning
from ..util import prints, is_in_jupyter from ..util import prints, is_in_jupyter
@ -27,7 +28,7 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
factories = {'dep': (DependencyRenderer, parse_deps), factories = {'dep': (DependencyRenderer, parse_deps),
'ent': (EntityRenderer, parse_ents)} 'ent': (EntityRenderer, parse_ents)}
if style not in factories: if style not in factories:
raise ValueError("Unknown style: %s" % style) raise ValueError(Errors.E087.format(style=style))
if isinstance(docs, Doc) or isinstance(docs, dict): if isinstance(docs, Doc) or isinstance(docs, dict):
docs = [docs] docs = [docs]
renderer, converter = factories[style] renderer, converter = factories[style]
@ -57,12 +58,12 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
render(docs, style=style, page=page, minify=minify, options=options, render(docs, style=style, page=page, minify=minify, options=options,
manual=manual) manual=manual)
httpd = simple_server.make_server('0.0.0.0', port, app) httpd = simple_server.make_server('0.0.0.0', port, app)
prints("Using the '%s' visualizer" % style, prints("Using the '{}' visualizer".format(style),
title="Serving on port %d..." % port) title="Serving on port {}...".format(port))
try: try:
httpd.serve_forever() httpd.serve_forever()
except KeyboardInterrupt: except KeyboardInterrupt:
prints("Shutting down server on port %d." % port) prints("Shutting down server on port {}.".format(port))
finally: finally:
httpd.server_close() httpd.server_close()
@ -83,6 +84,12 @@ def parse_deps(orig_doc, options={}):
RETURNS (dict): Generated dependency parse keyed by words and arcs. RETURNS (dict): Generated dependency parse keyed by words and arcs.
""" """
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes()) doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
if not doc.is_parsed:
user_warning(Warnings.W005)
if options.get('collapse_phrases', False):
for np in list(doc.noun_chunks):
np.merge(tag=np.root.tag_, lemma=np.root.lemma_,
ent_type=np.root.ent_type_)
if options.get('collapse_punct', True): if options.get('collapse_punct', True):
spans = [] spans = []
for word in doc[:-1]: for word in doc[:-1]:
@ -120,6 +127,8 @@ def parse_ents(doc, options={}):
""" """
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_} ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
for ent in doc.ents] for ent in doc.ents]
if not ents:
user_warning(Warnings.W006)
title = (doc.user_data.get('title', None) title = (doc.user_data.get('title', None)
if hasattr(doc, 'user_data') else None) if hasattr(doc, 'user_data') else None)
return {'text': doc.text, 'ents': ents, 'title': title} return {'text': doc.text, 'ents': ents, 'title': title}

313
spacy/errors.py Normal file
View File

@ -0,0 +1,313 @@
# coding: utf8
from __future__ import unicode_literals
import os
import warnings
import inspect
def add_codes(err_cls):
"""Add error codes to string messages via class attribute names."""
class ErrorsWithCodes(object):
def __getattribute__(self, code):
msg = getattr(err_cls, code)
return '[{code}] {msg}'.format(code=code, msg=msg)
return ErrorsWithCodes()
@add_codes
class Warnings(object):
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
"You can now call spacy.load with the path as its first argument, "
"and the model's meta.json will be used to determine the language "
"to load. For example:\nnlp = spacy.load('{path}')")
W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object "
"instead and pass in the strings as the `words` keyword argument, "
"for example:\nfrom spacy.tokens import Doc\n"
"doc = Doc(nlp.vocab, words=[...])")
W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use "
"the keyword arguments, for example tag=, lemma= or ent_type=.")
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
"using ftfy.fix_text if necessary.")
W005 = ("Doc object not parsed. This means displaCy won't be able to "
"generate a dependency visualization for it. Make sure the Doc "
"was processed with a model that supports dependency parsing, and "
"not just a language class like `English()`. For more info, see "
"the docs:\nhttps://spacy.io/usage/models")
W006 = ("No entities to visualize found in Doc object. If this is "
"surprising to you, make sure the Doc was processed using a model "
"that supports named entity recognition, and check the `doc.ents` "
"property manually if necessary.")
@add_codes
class Errors(object):
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
E002 = ("Can't find factory for '{name}'. This usually happens when spaCy "
"calls `nlp.create_pipe` with a component name that's not built "
"in - for example, when constructing the pipeline from a model's "
"meta.json. If you're using a custom component, you can write to "
"`Language.factories['{name}']` or remove it from the model meta "
"and add it via `nlp.add_pipe` instead.")
E003 = ("Not a valid pipeline component. Expected callable, but "
"got {component} (name: '{name}').")
E004 = ("If you meant to add a built-in component, use `create_pipe`: "
"`nlp.add_pipe(nlp.create_pipe('{component}'))`")
E005 = ("Pipeline component '{name}' returned None. If you're using a "
"custom component, maybe you forgot to return the processed Doc?")
E006 = ("Invalid constraints. You can only set one of the following: "
"before, after, first, last.")
E007 = ("'{name}' already exists in pipeline. Existing names: {opts}")
E008 = ("Some current components would be lost when restoring previous "
"pipeline state. If you added components after calling "
"`nlp.disable_pipes()`, you should remove them explicitly with "
"`nlp.remove_pipe()` before the pipeline is restored. Names of "
"the new components: {names}")
E009 = ("The `update` method expects same number of docs and golds, but "
"got: {n_docs} docs, {n_golds} golds.")
E010 = ("Word vectors set to length 0. This may be because you don't have "
"a model installed or loaded, or because your model doesn't "
"include word vectors. For more info, see the docs:\n"
"https://spacy.io/usage/models")
E011 = ("Unknown operator: '{op}'. Options: {opts}")
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
E013 = ("Error selecting action in matcher")
E014 = ("Uknown tag ID: {tag}")
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
"`force=True` to overwrite.")
E016 = ("MultitaskObjective target should be function or one of: dep, "
"tag, ent, dep_tag_offset, ent_tag.")
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
E018 = ("Can't retrieve string for hash '{hash_value}'.")
E019 = ("Can't create transition with unknown action ID: {action}. Action "
"IDs are enumerated in spacy/syntax/{src}.pyx.")
E020 = ("Could not find a gold-standard action to supervise the "
"dependency parser. The tree is non-projective (i.e. it has "
"crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
"The ArcEager transition system only supports projective trees. "
"To learn non-projective representations, transform the data "
"before training and after parsing. Either pass "
"`make_projective=True` to the GoldParse class, or use "
"spacy.syntax.nonproj.preprocess_training_data.")
E021 = ("Could not find a gold-standard action to supervise the "
"dependency parser. The GoldParse was projective. The transition "
"system has {n_actions} actions. State at failure: {state}")
E022 = ("Could not find a transition with the name '{name}' in the NER "
"model.")
E023 = ("Error cleaning up beam: The same state occurred twice at "
"memory address {addr} and position {i}.")
E024 = ("Could not find an optimal move to supervise the parser. Usually, "
"this means the GoldParse was not correct. For example, are all "
"labels added to the model?")
E025 = ("String is too long: {length} characters. Max is 2**30.")
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
"length {length}.")
E027 = ("Arguments 'words' and 'spaces' should be sequences of the same "
"length, or 'spaces' should be left default at None. spaces "
"should be a sequence of booleans, with True meaning that the "
"word owns a ' ' character following it.")
E028 = ("orths_and_spaces expects either a list of unicode string or a "
"list of (unicode, bool) tuples. Got bytes instance: {value}")
E029 = ("noun_chunks requires the dependency parse, which requires a "
"statistical model to be installed and loaded. For more info, see "
"the documentation:\nhttps://spacy.io/usage/models")
E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
"component to the pipeline with: "
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
"Alternatively, add the dependency parser, or set sentence "
"boundaries by setting doc[i].is_sent_start.")
E031 = ("Invalid token: empty string ('') at position {i}.")
E032 = ("Conflicting attributes specified in doc.from_array(): "
"(HEAD, SENT_START). The HEAD attribute currently sets sentence "
"boundaries implicitly, based on the tree structure. This means "
"the HEAD attribute would potentially override the sentence "
"boundaries set by SENT_START.")
E033 = ("Cannot load into non-empty Doc of length {length}.")
E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected "
"either 3 arguments (deprecated), or 0 (use keyword arguments).\n"
"Arguments supplied:\n{args}\nKeyword arguments:{kwargs}")
E035 = ("Error creating span with start {start} and end {end} for Doc of "
"length {length}.")
E036 = ("Error calculating span: Can't find a token starting at character "
"offset {start}.")
E037 = ("Error calculating span: Can't find a token ending at character "
"offset {end}.")
E038 = ("Error finding sentence for span. Infinite loop detected.")
E039 = ("Array bounds exceeded while searching for root word. This likely "
"means the parse tree is in an invalid state. Please report this "
"issue here: http://github.com/explosion/spaCy/issues")
E040 = ("Attempt to access token at {i}, max length {max_length}.")
E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.")
E043 = ("Refusing to write to token.sent_start if its document is parsed, "
"because this may cause inconsistent state.")
E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
"None, True, False")
E045 = ("Possibly infinite loop encountered while looking for {attr}.")
E046 = ("Can't retrieve unregistered extension attribute '{name}'. Did "
"you forget to call the `set_extension` method?")
E047 = ("Can't assign a value to unregistered extension attribute "
"'{name}'. Did you forget to call the `set_extension` method?")
E048 = ("Can't import language {lang} from spacy.lang.")
E049 = ("Can't find spaCy data directory: '{path}'. Check your "
"installation and permissions, or use spacy.util.set_data_path "
"to customise the location if necessary.")
E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut "
"link, a Python package or a valid path to a data directory.")
E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure "
"it points to a valid package (not just a data directory).")
E052 = ("Can't find model directory: {path}")
E053 = ("Could not read meta.json from {path}")
E054 = ("No valid '{setting}' setting found in model meta.json.")
E055 = ("Invalid ORTH value in exception:\nKey: {key}\nOrths: {orths}")
E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
"original string.\nKey: {key}\nOrths: {orths}")
E057 = ("Stepped slices not supported in Span objects. Try: "
"list(tokens)[start:stop:step] instead.")
E058 = ("Could not retrieve vector for key {key}.")
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
"({rows}, {cols}).")
E061 = ("Bad file name: {filename}. Example of a valid file name: "
"'vectors.128.f.bin'")
E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
"and 63 are occupied. You can replace one by specifying the "
"`flag_id` explicitly, e.g. "
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 "
"and 63 (inclusive).")
E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
"string, the lexeme returned had an orth ID that did not match "
"the query string. This means that the cached lexeme structs are "
"mismatched to the string encoding table. The mismatched:\n"
"Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
E065 = ("Only one of the vector table's width and shape can be specified. "
"Got width {width} and shape {shape}.")
E066 = ("Error creating model helper for extracting columns. Can only "
"extract columns by positive integer. Got: {value}.")
E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
"an entity) without a preceding 'B' (beginning of an entity). "
"Tag sequence:\n{tags}")
E068 = ("Invalid BILUO tag: '{tag}'.")
E069 = ("Invalid gold-standard parse tree. Found cycle between word "
"IDs: {cycle}")
E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
"does not align with number of annotations ({n_annots}).")
E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
"match the one in the vocab ({vocab_orth}).")
E072 = ("Error serializing lexeme: expected data length {length}, "
"got {bad_length}.")
E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
"are of length {length}. You can use `vocab.reset_vectors` to "
"clear the existing vectors and resize the table.")
E074 = ("Error interpreting compiled match pattern: patterns are expected "
"to end with the attribute {attr}. Got: {bad_attr}.")
E075 = ("Error accepting match: length ({length}) > maximum length "
"({max_len}).")
E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
"has {words} words.")
E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
"equal number of GoldParse objects ({n_golds}) in batch.")
E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
"not equal number of words in GoldParse ({words_gold}).")
E079 = ("Error computing states in beam: number of predicted beams "
"({pbeams}) does not equal number of gold beams ({gbeams}).")
E080 = ("Duplicate state found in beam: {key}.")
E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
"does not equal number of losses ({losses}).")
E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
"projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
"match.")
E083 = ("Error setting extension: only one of `default`, `method`, or "
"`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
E084 = ("Error assigning label ID {label} to span: not in StringStore.")
E085 = ("Can't create lexeme for string '{string}'.")
E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
"not match hash {hash_id} in StringStore.")
E087 = ("Unknown displaCy style: {style}.")
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
"v2.x parser and NER models require roughly 1GB of temporary "
"memory per 100,000 characters in the input. This means long "
"texts may cause memory allocation errors. If you're not using "
"the parser or NER, it's probably safe to increase the "
"`nlp.max_length` limit. The limit is in number of characters, so "
"you can check whether your inputs are too long by checking "
"`len(text)`.")
E089 = ("Extensions can't have a setter argument without a getter "
"argument. Check the keyword arguments on `set_extension`.")
E090 = ("Extension '{name}' already exists on {obj}. To overwrite the "
"existing extension, set `force=True` on `{obj}.set_extension`.")
E091 = ("Invalid extension attribute {name}: expected callable or None, "
"but got: {value}")
E092 = ("Could not find or assign name for word vectors. Ususally, the "
"name is read from the model's meta.json in vector.name. "
"Alternatively, it is built from the 'lang' and 'name' keys in "
"the meta.json. Vector names are required to avoid issue #1660.")
E093 = ("token.ent_iob values make invalid sequence: I without B\n{seq}")
E094 = ("Error reading line {line_num} in vectors file {loc}.")
@add_codes
class TempErrors(object):
T001 = ("Max length currently 10 for phrase matching")
T002 = ("Pattern length ({doc_len}) >= phrase_matcher.max_length "
"({max_len}). Length can be set on initialization, up to 10.")
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
T005 = ("Currently history size is hard-coded to 0. Received: {value}.")
T006 = ("Currently history width is hard-coded to 0. Received: {value}.")
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
"issue tracker: http://github.com/explosion/spaCy/issues")
T008 = ("Bad configuration of Tagger. This is probably a bug within "
"spaCy. We changed the name of an internal attribute for loading "
"pre-trained vectors, and the class has been passed the old name "
"(pretrained_dims) but not the new name (pretrained_vectors).")
class ModelsWarning(UserWarning):
pass
WARNINGS = {
'user': UserWarning,
'deprecation': DeprecationWarning,
'models': ModelsWarning,
}
def _get_warn_types(arg):
if arg == '': # don't show any warnings
return []
if not arg or arg == 'all': # show all available warnings
return WARNINGS.keys()
return [w_type.strip() for w_type in arg.split(',')
if w_type.strip() in WARNINGS]
SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER', 'always')
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES'))
def user_warning(message):
_warn(message, 'user')
def deprecation_warning(message):
_warn(message, 'deprecation')
def models_warning(message):
_warn(message, 'models')
def _warn(message, warn_type='user'):
"""
message (unicode): The message to display.
category (Warning): The Warning to show.
"""
if warn_type in SPACY_WARNING_TYPES:
category = WARNINGS[warn_type]
stack = inspect.stack()[-1]
with warnings.catch_warnings():
warnings.simplefilter(SPACY_WARNING_FILTER, category)
warnings.warn_explicit(message, category, stack[1], stack[2])

View File

@ -17,6 +17,7 @@ import ujson
from . import _align from . import _align
from .syntax import nonproj from .syntax import nonproj
from .tokens import Doc from .tokens import Doc
from .errors import Errors
from . import util from . import util
from .util import minibatch, itershuffle from .util import minibatch, itershuffle
from .compat import json_dumps from .compat import json_dumps
@ -37,7 +38,8 @@ def tags_to_entities(tags):
elif tag == '-': elif tag == '-':
continue continue
elif tag.startswith('I'): elif tag.startswith('I'):
assert start is not None, tags[:i] if start is None:
raise ValueError(Errors.E067.format(tags=tags[:i]))
continue continue
if tag.startswith('U'): if tag.startswith('U'):
entities.append((tag[2:], i, i)) entities.append((tag[2:], i, i))
@ -47,7 +49,7 @@ def tags_to_entities(tags):
entities.append((tag[2:], start, i)) entities.append((tag[2:], start, i))
start = None start = None
else: else:
raise Exception(tag) raise ValueError(Errors.E068.format(tag=tag))
return entities return entities
@ -225,7 +227,9 @@ class GoldCorpus(object):
@classmethod @classmethod
def _make_golds(cls, docs, paragraph_tuples, make_projective): def _make_golds(cls, docs, paragraph_tuples, make_projective):
assert len(docs) == len(paragraph_tuples) if len(docs) != len(paragraph_tuples):
raise ValueError(Errors.E070.format(n_docs=len(docs),
n_annots=len(paragraph_tuples)))
if len(docs) == 1: if len(docs) == 1:
return [GoldParse.from_annot_tuples(docs[0], return [GoldParse.from_annot_tuples(docs[0],
paragraph_tuples[0][0], paragraph_tuples[0][0],
@ -525,7 +529,7 @@ cdef class GoldParse:
cycle = nonproj.contains_cycle(self.heads) cycle = nonproj.contains_cycle(self.heads)
if cycle is not None: if cycle is not None:
raise Exception("Cycle found: %s" % cycle) raise ValueError(Errors.E069.format(cycle=cycle))
def __len__(self): def __len__(self):
"""Get the number of gold-standard tokens. """Get the number of gold-standard tokens.

View File

@ -8,6 +8,7 @@ from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .morph_rules import MORPH_RULES from .morph_rules import MORPH_RULES
from ..tag_map import TAG_MAP from ..tag_map import TAG_MAP
from .lemmatizer import LOOKUP
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS from ..norm_exceptions import BASE_NORMS
@ -28,6 +29,7 @@ class DanishDefaults(Language.Defaults):
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
tag_map = TAG_MAP tag_map = TAG_MAP
stop_words = STOP_WORDS stop_words = STOP_WORDS
lemma_lookup = LOOKUP
class Danish(Language): class Danish(Language):

692415
spacy/lang/da/lemmatizer.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -286069,7 +286069,6 @@ LOOKUP = {
"sonnolente": "sonnolento", "sonnolente": "sonnolento",
"sonnolenti": "sonnolento", "sonnolenti": "sonnolento",
"sonnolenze": "sonnolenza", "sonnolenze": "sonnolenza",
"sono": "sonare",
"sonora": "sonoro", "sonora": "sonoro",
"sonore": "sonoro", "sonore": "sonoro",
"sonori": "sonoro", "sonori": "sonoro",
@ -333681,6 +333680,7 @@ LOOKUP = {
"zurliniane": "zurliniano", "zurliniane": "zurliniano",
"zurliniani": "zurliniano", "zurliniani": "zurliniano",
"àncore": "àncora", "àncore": "àncora",
"sono": "essere",
"è": "essere", "è": "essere",
"èlites": "èlite", "èlites": "èlite",
"ère": "èra", "ère": "èra",

View File

@ -190262,7 +190262,6 @@ LOOKUP = {
"gämserna": "gäms", "gämserna": "gäms",
"gämsernas": "gäms", "gämsernas": "gäms",
"gämsers": "gäms", "gämsers": "gäms",
"gäng": "gänga",
"gängad": "gänga", "gängad": "gänga",
"gängade": "gängad", "gängade": "gängad",
"gängades": "gängad", "gängades": "gängad",
@ -651423,7 +651422,6 @@ LOOKUP = {
"åpnasts": "åpen", "åpnasts": "åpen",
"åpne": "åpen", "åpne": "åpen",
"åpnes": "åpen", "åpnes": "åpen",
"år": "åra",
"åran": "åra", "åran": "åra",
"årans": "åra", "årans": "åra",
"åras": "åra", "åras": "åra",

View File

@ -1,19 +1,53 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...attrs import LANG from ...attrs import LANG, NORM
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...tokens import Doc from ...tokens import Doc
from .stop_words import STOP_WORDS
from ...util import update_exc, add_lookups
from .lex_attrs import LEX_ATTRS
#from ..tokenizer_exceptions import BASE_EXCEPTIONS
#from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
class VietnameseDefaults(Language.Defaults): class VietnameseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'vi' # for pickling lex_attr_getters[LANG] = lambda text: 'vi' # for pickling
# add more norm exception dictionaries here
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
# overwrite functions for lexical attributes
lex_attr_getters.update(LEX_ATTRS)
# merge base exceptions and custom tokenizer exceptions
#tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
use_pyvi = True
class Vietnamese(Language): class Vietnamese(Language):
lang = 'vi' lang = 'vi'
Defaults = VietnameseDefaults # override defaults Defaults = VietnameseDefaults # override defaults
def make_doc(self, text):
if self.Defaults.use_pyvi:
try:
from pyvi import ViTokenizer
except ImportError:
msg = ("Pyvi not installed. Either set Vietnamese.use_pyvi = False, "
"or install it https://pypi.python.org/pypi/pyvi")
raise ImportError(msg)
words, spaces = ViTokenizer.spacy_tokenize(text)
return Doc(self.vocab, words=words, spaces=spaces)
else:
words = []
spaces = []
doc = self.tokenizer(text)
for token in self.tokenizer(text):
words.extend(list(token.text))
spaces.extend([False]*len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
__all__ = ['Vietnamese'] __all__ = ['Vietnamese']

View File

@ -0,0 +1,26 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['không', 'một', 'hai', 'ba', 'bốn', 'năm', 'sáu', 'bẩy',
'tám', 'chín', 'mười', 'trăm', 'tỷ']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

1951
spacy/lang/vi/stop_words.py Normal file

File diff suppressed because it is too large Load Diff

36
spacy/lang/vi/tag_map.py Normal file
View File

@ -0,0 +1,36 @@
# coding: utf8
from __future__ import unicode_literals
from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
# Add a tag map
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
# The keys of the tag map should be strings in your tag set. The dictionary must
# have an entry POS whose value is one of the Universal Dependencies tags.
# Optionally, you can also include morphological features or other attributes.
TAG_MAP = {
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
"SP": {POS: SPACE}
}

View File

@ -28,6 +28,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
from .lang.tokenizer_exceptions import TOKEN_MATCH from .lang.tokenizer_exceptions import TOKEN_MATCH
from .lang.tag_map import TAG_MAP from .lang.tag_map import TAG_MAP
from .lang.lex_attrs import LEX_ATTRS, is_stop from .lang.lex_attrs import LEX_ATTRS, is_stop
from .errors import Errors
from . import util from . import util
from . import about from . import about
@ -112,7 +113,7 @@ class Language(object):
'merge_subtokens': lambda nlp, **cfg: merge_subtokens, 'merge_subtokens': lambda nlp, **cfg: merge_subtokens,
} }
def __init__(self, vocab=True, make_doc=True, meta={}, **kwargs): def __init__(self, vocab=True, make_doc=True, max_length=10**6, meta={}, **kwargs):
"""Initialise a Language object. """Initialise a Language object.
vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
@ -127,6 +128,15 @@ class Language(object):
string occurs in both, the component is not loaded. string occurs in both, the component is not loaded.
meta (dict): Custom meta data for the Language class. Is written to by meta (dict): Custom meta data for the Language class. Is written to by
models to add model meta data. models to add model meta data.
max_length (int) :
Maximum number of characters in a single text. The current v2 models
may run out memory on extremely long texts, due to large internal
allocations. You should segment these texts into meaningful units,
e.g. paragraphs, subsections etc, before passing them to spaCy.
Default maximum length is 1,000,000 characters (1mb). As a rule of
thumb, if all pipeline components are enabled, spaCy's default
models currently requires roughly 1GB of temporary memory per
100,000 characters in one text.
RETURNS (Language): The newly constructed object. RETURNS (Language): The newly constructed object.
""" """
self._meta = dict(meta) self._meta = dict(meta)
@ -134,12 +144,15 @@ class Language(object):
if vocab is True: if vocab is True:
factory = self.Defaults.create_vocab factory = self.Defaults.create_vocab
vocab = factory(self, **meta.get('vocab', {})) vocab = factory(self, **meta.get('vocab', {}))
if vocab.vectors.name is None:
vocab.vectors.name = meta.get('vectors', {}).get('name')
self.vocab = vocab self.vocab = vocab
if make_doc is True: if make_doc is True:
factory = self.Defaults.create_tokenizer factory = self.Defaults.create_tokenizer
make_doc = factory(self, **meta.get('tokenizer', {})) make_doc = factory(self, **meta.get('tokenizer', {}))
self.tokenizer = make_doc self.tokenizer = make_doc
self.pipeline = [] self.pipeline = []
self.max_length = max_length
self._optimizer = None self._optimizer = None
@property @property
@ -159,7 +172,8 @@ class Language(object):
self._meta.setdefault('license', '') self._meta.setdefault('license', '')
self._meta['vectors'] = {'width': self.vocab.vectors_length, self._meta['vectors'] = {'width': self.vocab.vectors_length,
'vectors': len(self.vocab.vectors), 'vectors': len(self.vocab.vectors),
'keys': self.vocab.vectors.n_keys} 'keys': self.vocab.vectors.n_keys,
'name': self.vocab.vectors.name}
self._meta['pipeline'] = self.pipe_names self._meta['pipeline'] = self.pipe_names
return self._meta return self._meta
@ -205,8 +219,7 @@ class Language(object):
for pipe_name, component in self.pipeline: for pipe_name, component in self.pipeline:
if pipe_name == name: if pipe_name == name:
return component return component
msg = "No component '{}' found in pipeline. Available names: {}" raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
raise KeyError(msg.format(name, self.pipe_names))
def create_pipe(self, name, config=dict()): def create_pipe(self, name, config=dict()):
"""Create a pipeline component from a factory. """Create a pipeline component from a factory.
@ -216,7 +229,7 @@ class Language(object):
RETURNS (callable): Pipeline component. RETURNS (callable): Pipeline component.
""" """
if name not in self.factories: if name not in self.factories:
raise KeyError("Can't find factory for '{}'.".format(name)) raise KeyError(Errors.E002.format(name=name))
factory = self.factories[name] factory = self.factories[name]
return factory(self, **config) return factory(self, **config)
@ -241,12 +254,9 @@ class Language(object):
>>> nlp.add_pipe(component, name='custom_name', last=True) >>> nlp.add_pipe(component, name='custom_name', last=True)
""" """
if not hasattr(component, '__call__'): if not hasattr(component, '__call__'):
msg = ("Not a valid pipeline component. Expected callable, but " msg = Errors.E003.format(component=repr(component), name=name)
"got {}. ".format(repr(component)))
if isinstance(component, basestring_) and component in self.factories: if isinstance(component, basestring_) and component in self.factories:
msg += ("If you meant to add a built-in component, use " msg += Errors.E004.format(component=component)
"create_pipe: nlp.add_pipe(nlp.create_pipe('{}'))"
.format(component))
raise ValueError(msg) raise ValueError(msg)
if name is None: if name is None:
if hasattr(component, 'name'): if hasattr(component, 'name'):
@ -259,11 +269,9 @@ class Language(object):
else: else:
name = repr(component) name = repr(component)
if name in self.pipe_names: if name in self.pipe_names:
raise ValueError("'{}' already exists in pipeline.".format(name)) raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2: if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
msg = ("Invalid constraints. You can only set one of the " raise ValueError(Errors.E006)
"following: before, after, first, last.")
raise ValueError(msg)
pipe = (name, component) pipe = (name, component)
if last or not any([first, before, after]): if last or not any([first, before, after]):
self.pipeline.append(pipe) self.pipeline.append(pipe)
@ -274,9 +282,8 @@ class Language(object):
elif after and after in self.pipe_names: elif after and after in self.pipe_names:
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe) self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
else: else:
msg = "Can't find '{}' in pipeline. Available names: {}" raise ValueError(Errors.E001.format(name=before or after,
unfound = before or after opts=self.pipe_names))
raise ValueError(msg.format(unfound, self.pipe_names))
def has_pipe(self, name): def has_pipe(self, name):
"""Check if a component name is present in the pipeline. Equivalent to """Check if a component name is present in the pipeline. Equivalent to
@ -294,8 +301,7 @@ class Language(object):
component (callable): Pipeline component. component (callable): Pipeline component.
""" """
if name not in self.pipe_names: if name not in self.pipe_names:
msg = "Can't find '{}' in pipeline. Available names: {}" raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
raise ValueError(msg.format(name, self.pipe_names))
self.pipeline[self.pipe_names.index(name)] = (name, component) self.pipeline[self.pipe_names.index(name)] = (name, component)
def rename_pipe(self, old_name, new_name): def rename_pipe(self, old_name, new_name):
@ -305,11 +311,9 @@ class Language(object):
new_name (unicode): New name of the component. new_name (unicode): New name of the component.
""" """
if old_name not in self.pipe_names: if old_name not in self.pipe_names:
msg = "Can't find '{}' in pipeline. Available names: {}" raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
raise ValueError(msg.format(old_name, self.pipe_names))
if new_name in self.pipe_names: if new_name in self.pipe_names:
msg = "'{}' already exists in pipeline. Existing names: {}" raise ValueError(Errors.E007.format(name=new_name, opts=self.pipe_names))
raise ValueError(msg.format(new_name, self.pipe_names))
i = self.pipe_names.index(old_name) i = self.pipe_names.index(old_name)
self.pipeline[i] = (new_name, self.pipeline[i][1]) self.pipeline[i] = (new_name, self.pipeline[i][1])
@ -320,8 +324,7 @@ class Language(object):
RETURNS (tuple): A `(name, component)` tuple of the removed component. RETURNS (tuple): A `(name, component)` tuple of the removed component.
""" """
if name not in self.pipe_names: if name not in self.pipe_names:
msg = "Can't find '{}' in pipeline. Available names: {}" raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
raise ValueError(msg.format(name, self.pipe_names))
return self.pipeline.pop(self.pipe_names.index(name)) return self.pipeline.pop(self.pipe_names.index(name))
def __call__(self, text, disable=[]): def __call__(self, text, disable=[]):
@ -338,11 +341,18 @@ class Language(object):
>>> tokens[0].text, tokens[0].head.tag_ >>> tokens[0].text, tokens[0].head.tag_
('An', 'NN') ('An', 'NN')
""" """
if len(text) >= self.max_length:
raise ValueError(Errors.E088.format(length=len(text),
max_length=self.max_length))
doc = self.make_doc(text) doc = self.make_doc(text)
for name, proc in self.pipeline: for name, proc in self.pipeline:
if name in disable: if name in disable:
continue continue
if not hasattr(proc, '__call__'):
raise ValueError(Errors.E003.format(component=type(proc), name=name))
doc = proc(doc) doc = proc(doc)
if doc is None:
raise ValueError(Errors.E005.format(name=name))
return doc return doc
def disable_pipes(self, *names): def disable_pipes(self, *names):
@ -384,8 +394,7 @@ class Language(object):
>>> state = nlp.update(docs, golds, sgd=optimizer) >>> state = nlp.update(docs, golds, sgd=optimizer)
""" """
if len(docs) != len(golds): if len(docs) != len(golds):
raise IndexError("Update expects same number of docs and golds " raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
"Got: %d, %d" % (len(docs), len(golds)))
if len(docs) == 0: if len(docs) == 0:
return return
if sgd is None: if sgd is None:
@ -458,6 +467,8 @@ class Language(object):
else: else:
device = None device = None
link_vectors_to_models(self.vocab) link_vectors_to_models(self.vocab)
if self.vocab.vectors.data.shape[1]:
cfg['pretrained_vectors'] = self.vocab.vectors.name
if sgd is None: if sgd is None:
sgd = create_default_optimizer(Model.ops) sgd = create_default_optimizer(Model.ops)
self._optimizer = sgd self._optimizer = sgd
@ -626,9 +637,10 @@ class Language(object):
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
deserializers = OrderedDict(( deserializers = OrderedDict((
('vocab', lambda p: self.vocab.from_disk(p)), ('meta.json', lambda p: self.meta.update(util.read_json(p))),
('vocab', lambda p: (
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self))),
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)), ('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
('meta.json', lambda p: self.meta.update(util.read_json(p)))
)) ))
for name, proc in self.pipeline: for name, proc in self.pipeline:
if name in disable: if name in disable:
@ -671,9 +683,10 @@ class Language(object):
RETURNS (Language): The `Language` object. RETURNS (Language): The `Language` object.
""" """
deserializers = OrderedDict(( deserializers = OrderedDict((
('vocab', lambda b: self.vocab.from_bytes(b)), ('meta', lambda b: self.meta.update(ujson.loads(b))),
('vocab', lambda b: (
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self))),
('tokenizer', lambda b: self.tokenizer.from_bytes(b, vocab=False)), ('tokenizer', lambda b: self.tokenizer.from_bytes(b, vocab=False)),
('meta', lambda b: self.meta.update(ujson.loads(b)))
)) ))
for i, (name, proc) in enumerate(self.pipeline): for i, (name, proc) in enumerate(self.pipeline):
if name in disable: if name in disable:
@ -685,6 +698,27 @@ class Language(object):
return self return self
def _fix_pretrained_vectors_name(nlp):
# TODO: Replace this once we handle vectors consistently as static
# data
if 'vectors' in nlp.meta and nlp.meta['vectors'].get('name'):
nlp.vocab.vectors.name = nlp.meta['vectors']['name']
elif not nlp.vocab.vectors.size:
nlp.vocab.vectors.name = None
elif 'name' in nlp.meta and 'lang' in nlp.meta:
vectors_name = '%s_%s.vectors' % (nlp.meta['lang'], nlp.meta['name'])
nlp.vocab.vectors.name = vectors_name
else:
raise ValueError(Errors.E092)
if nlp.vocab.vectors.size != 0:
link_vectors_to_models(nlp.vocab)
for name, proc in nlp.pipeline:
if not hasattr(proc, 'cfg'):
continue
proc.cfg.setdefault('deprecation_fixes', {})
proc.cfg['deprecation_fixes']['vectors_name'] = nlp.vocab.vectors.name
class DisabledPipes(list): class DisabledPipes(list):
"""Manager for temporary pipeline disabling.""" """Manager for temporary pipeline disabling."""
def __init__(self, nlp, *names): def __init__(self, nlp, *names):
@ -711,14 +745,7 @@ class DisabledPipes(list):
if unexpected: if unexpected:
# Don't change the pipeline if we're raising an error. # Don't change the pipeline if we're raising an error.
self.nlp.pipeline = current self.nlp.pipeline = current
msg = ( raise ValueError(Errors.E008.format(names=unexpected))
"Some current components would be lost when restoring "
"previous pipeline state. If you added components after "
"calling nlp.disable_pipes(), you should remove them "
"explicitly with nlp.remove_pipe() before the pipeline is "
"restore. Names of the new components: %s"
)
raise ValueError(msg % unexpected)
self[:] = [] self[:] = []

View File

@ -15,7 +15,7 @@ from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
from .attrs cimport PROB from .attrs cimport PROB
from .attrs import intify_attrs from .attrs import intify_attrs
from . import about from .errors import Errors
memset(&EMPTY_LEXEME, 0, sizeof(LexemeC)) memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))
@ -37,7 +37,8 @@ cdef class Lexeme:
self.vocab = vocab self.vocab = vocab
self.orth = orth self.orth = orth
self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth) self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
assert self.c.orth == orth if self.c.orth != orth:
raise ValueError(Errors.E071.format(orth=orth, vocab_orth=self.c.orth))
def __richcmp__(self, other, int op): def __richcmp__(self, other, int op):
if other is None: if other is None:
@ -129,20 +130,25 @@ cdef class Lexeme:
lex_data = Lexeme.c_to_bytes(self.c) lex_data = Lexeme.c_to_bytes(self.c)
start = <const char*>&self.c.flags start = <const char*>&self.c.flags
end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment) end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
assert (end-start) == sizeof(lex_data.data), (end-start, sizeof(lex_data.data)) if (end-start) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=end-start,
bad_length=sizeof(lex_data.data)))
byte_string = b'\0' * sizeof(lex_data.data) byte_string = b'\0' * sizeof(lex_data.data)
byte_chars = <char*>byte_string byte_chars = <char*>byte_string
for i in range(sizeof(lex_data.data)): for i in range(sizeof(lex_data.data)):
byte_chars[i] = lex_data.data[i] byte_chars[i] = lex_data.data[i]
assert len(byte_string) == sizeof(lex_data.data), (len(byte_string), if len(byte_string) != sizeof(lex_data.data):
sizeof(lex_data.data)) raise ValueError(Errors.E072.format(length=len(byte_string),
bad_length=sizeof(lex_data.data)))
return byte_string return byte_string
def from_bytes(self, bytes byte_string): def from_bytes(self, bytes byte_string):
# This method doesn't really have a use-case --- wrote it for testing. # This method doesn't really have a use-case --- wrote it for testing.
# Possibly delete? It puts the Lexeme out of synch with the vocab. # Possibly delete? It puts the Lexeme out of synch with the vocab.
cdef SerializedLexemeC lex_data cdef SerializedLexemeC lex_data
assert len(byte_string) == sizeof(lex_data.data) if len(byte_string) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=len(byte_string),
bad_length=sizeof(lex_data.data)))
for i in range(len(byte_string)): for i in range(len(byte_string)):
lex_data.data[i] = byte_string[i] lex_data.data[i] = byte_string[i]
Lexeme.c_from_bytes(self.c, lex_data) Lexeme.c_from_bytes(self.c, lex_data)
@ -169,16 +175,13 @@ cdef class Lexeme:
def __get__(self): def __get__(self):
cdef int length = self.vocab.vectors_length cdef int length = self.vocab.vectors_length
if length == 0: if length == 0:
raise ValueError( raise ValueError(Errors.E010)
"Word vectors set to length 0. This may be because you "
"don't have a model installed or loaded, or because your "
"model doesn't include word vectors. For more info, see "
"the documentation: \n%s\n" % about.__docs_models__
)
return self.vocab.get_vector(self.c.orth) return self.vocab.get_vector(self.c.orth)
def __set__(self, vector): def __set__(self, vector):
assert len(vector) == self.vocab.vectors_length if len(vector) != self.vocab.vectors_length:
raise ValueError(Errors.E073.format(new_length=len(vector),
length=self.vocab.vectors_length))
self.vocab.set_vector(self.c.orth, vector) self.vocab.set_vector(self.c.orth, vector)
property rank: property rank:

View File

@ -13,6 +13,8 @@ from .vocab cimport Vocab
from .tokens.doc cimport Doc from .tokens.doc cimport Doc
from .tokens.doc cimport get_token_attr from .tokens.doc cimport get_token_attr
from .attrs cimport ID, attr_id_t, NULL_ATTR from .attrs cimport ID, attr_id_t, NULL_ATTR
from .errors import Errors, TempErrors
from .attrs import IDS from .attrs import IDS
from .attrs import FLAG61 as U_ENT from .attrs import FLAG61 as U_ENT
from .attrs import FLAG60 as B2_ENT from .attrs import FLAG60 as B2_ENT
@ -321,6 +323,9 @@ cdef attr_t get_pattern_key(const TokenPatternC* pattern) nogil:
while pattern.nr_attr != 0: while pattern.nr_attr != 0:
pattern += 1 pattern += 1
id_attr = pattern[0].attrs[0] id_attr = pattern[0].attrs[0]
if id_attr.attr != ID:
with gil:
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
return id_attr.value return id_attr.value
def _convert_strings(token_specs, string_store): def _convert_strings(token_specs, string_store):
@ -341,8 +346,8 @@ def _convert_strings(token_specs, string_store):
if value in operators: if value in operators:
ops = operators[value] ops = operators[value]
else: else:
msg = "Unknown operator '%s'. Options: %s" keys = ', '.join(operators.keys())
raise KeyError(msg % (value, ', '.join(operators.keys()))) raise KeyError(Errors.E011.format(op=value, opts=keys))
if isinstance(attr, basestring): if isinstance(attr, basestring):
attr = IDS.get(attr.upper()) attr = IDS.get(attr.upper())
if isinstance(value, basestring): if isinstance(value, basestring):
@ -429,9 +434,7 @@ cdef class Matcher:
""" """
for pattern in patterns: for pattern in patterns:
if len(pattern) == 0: if len(pattern) == 0:
msg = ("Cannot add pattern for zero tokens to matcher.\n" raise ValueError(Errors.E012.format(key=key))
"key: {key}\n")
raise ValueError(msg.format(key=key))
key = self._normalize_key(key) key = self._normalize_key(key)
for pattern in patterns: for pattern in patterns:
specs = _convert_strings(pattern, self.vocab.strings) specs = _convert_strings(pattern, self.vocab.strings)

View File

@ -9,6 +9,7 @@ from .attrs import LEMMA, intify_attrs
from .parts_of_speech cimport SPACE from .parts_of_speech cimport SPACE
from .parts_of_speech import IDS as POS_IDS from .parts_of_speech import IDS as POS_IDS
from .lexeme cimport Lexeme from .lexeme cimport Lexeme
from .errors import Errors
def _normalize_props(props): def _normalize_props(props):
@ -93,7 +94,7 @@ cdef class Morphology:
cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1: cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1:
if tag_id > self.n_tags: if tag_id > self.n_tags:
raise ValueError("Unknown tag ID: %s" % tag_id) raise ValueError(Errors.E014.format(tag=tag_id))
# TODO: It's pretty arbitrary to put this logic here. I guess the # TODO: It's pretty arbitrary to put this logic here. I guess the
# justification is that this is where the specific word and the tag # justification is that this is where the specific word and the tag
# interact. Still, we should have a better way to enforce this rule, or # interact. Still, we should have a better way to enforce this rule, or
@ -129,7 +130,7 @@ cdef class Morphology:
tag (unicode): The part-of-speech tag to key the exception. tag (unicode): The part-of-speech tag to key the exception.
orth (unicode): The word-form to key the exception. orth (unicode): The word-form to key the exception.
""" """
# TODO: Currently we've assumed that we know the number of tags -- # TODO: Currently we've assumed that we know the number of tags --
# RichTagC is an array, and _cache is a PreshMapArray # RichTagC is an array, and _cache is a PreshMapArray
# This is really bad: it makes the morphology typed to the tagger # This is really bad: it makes the morphology typed to the tagger
# classes, which is all wrong. # classes, which is all wrong.
@ -147,9 +148,7 @@ cdef class Morphology:
elif force: elif force:
memset(cached, 0, sizeof(cached[0])) memset(cached, 0, sizeof(cached[0]))
else: else:
raise ValueError( raise ValueError(Errors.E015.format(tag=tag_str, orth=orth_str))
"Conflicting morphology exception for (%s, %s). Use "
"force=True to overwrite." % (tag_str, orth_str))
cached.tag = rich_tag cached.tag = rich_tag
# TODO: Refactor this to take arbitrary attributes. # TODO: Refactor this to take arbitrary attributes.

View File

@ -8,7 +8,9 @@ cimport numpy as np
import cytoolz import cytoolz
from collections import OrderedDict from collections import OrderedDict
import ujson import ujson
import msgpack
from .util import msgpack
from .util import msgpack_numpy
from thinc.api import chain from thinc.api import chain
from thinc.v2v import Affine, SELU, Softmax from thinc.v2v import Affine, SELU, Softmax
@ -32,6 +34,7 @@ from .parts_of_speech import X
from ._ml import Tok2Vec, build_text_classifier, build_tagger_model from ._ml import Tok2Vec, build_text_classifier, build_tagger_model
from ._ml import link_vectors_to_models, zero_init, flatten from ._ml import link_vectors_to_models, zero_init, flatten
from ._ml import create_default_optimizer from ._ml import create_default_optimizer
from .errors import Errors, TempErrors
from . import util from . import util
@ -77,7 +80,7 @@ def merge_noun_chunks(doc):
RETURNS (Doc): The Doc object with merged noun chunks. RETURNS (Doc): The Doc object with merged noun chunks.
""" """
if not doc.is_parsed: if not doc.is_parsed:
return return doc
spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep) spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep)
for np in doc.noun_chunks] for np in doc.noun_chunks]
for start, end, tag, dep in spans: for start, end, tag, dep in spans:
@ -214,8 +217,10 @@ class Pipe(object):
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, **exclude):
"""Load the pipe from a bytestring.""" """Load the pipe from a bytestring."""
def load_model(b): def load_model(b):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True: if self.model is True:
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
self.model = self.Model(**self.cfg) self.model = self.Model(**self.cfg)
self.model.from_bytes(b) self.model.from_bytes(b)
@ -239,8 +244,10 @@ class Pipe(object):
def from_disk(self, path, **exclude): def from_disk(self, path, **exclude):
"""Load the pipe from disk.""" """Load the pipe from disk."""
def load_model(p): def load_model(p):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True: if self.model is True:
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
self.model = self.Model(**self.cfg) self.model = self.Model(**self.cfg)
self.model.from_bytes(p.open('rb').read()) self.model.from_bytes(p.open('rb').read())
@ -298,7 +305,6 @@ class Tensorizer(Pipe):
self.model = model self.model = model
self.input_models = [] self.input_models = []
self.cfg = dict(cfg) self.cfg = dict(cfg)
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
self.cfg.setdefault('cnn_maxout_pieces', 3) self.cfg.setdefault('cnn_maxout_pieces', 3)
def __call__(self, doc): def __call__(self, doc):
@ -343,7 +349,8 @@ class Tensorizer(Pipe):
tensors (object): Vector representation for each token in the docs. tensors (object): Vector representation for each token in the docs.
""" """
for doc, tensor in zip(docs, tensors): for doc, tensor in zip(docs, tensors):
assert tensor.shape[0] == len(doc) if tensor.shape[0] != len(doc):
raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
doc.tensor = tensor doc.tensor = tensor
def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None): def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
@ -415,8 +422,6 @@ class Tagger(Pipe):
self.model = model self.model = model
self.cfg = OrderedDict(sorted(cfg.items())) self.cfg = OrderedDict(sorted(cfg.items()))
self.cfg.setdefault('cnn_maxout_pieces', 2) self.cfg.setdefault('cnn_maxout_pieces', 2)
self.cfg.setdefault('pretrained_dims',
self.vocab.vectors.data.shape[1])
@property @property
def labels(self): def labels(self):
@ -477,7 +482,7 @@ class Tagger(Pipe):
doc.extend_tensor(tensors[i].get()) doc.extend_tensor(tensors[i].get())
else: else:
doc.extend_tensor(tensors[i]) doc.extend_tensor(tensors[i])
doc.is_tagged = True doc.is_tagged = True
def update(self, docs, golds, drop=0., sgd=None, losses=None): def update(self, docs, golds, drop=0., sgd=None, losses=None):
if losses is not None and self.name not in losses: if losses is not None and self.name not in losses:
@ -527,8 +532,8 @@ class Tagger(Pipe):
vocab.morphology = Morphology(vocab.strings, new_tag_map, vocab.morphology = Morphology(vocab.strings, new_tag_map,
vocab.morphology.lemmatizer, vocab.morphology.lemmatizer,
exc=vocab.morphology.exc) exc=vocab.morphology.exc)
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
if self.model is True: if self.model is True:
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
link_vectors_to_models(self.vocab) link_vectors_to_models(self.vocab)
if sgd is None: if sgd is None:
@ -537,6 +542,8 @@ class Tagger(Pipe):
@classmethod @classmethod
def Model(cls, n_tags, **cfg): def Model(cls, n_tags, **cfg):
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):
raise ValueError(TempErrors.T008)
return build_tagger_model(n_tags, **cfg) return build_tagger_model(n_tags, **cfg)
def add_label(self, label, values=None): def add_label(self, label, values=None):
@ -552,9 +559,7 @@ class Tagger(Pipe):
# copy_array(larger.W[:smaller.nO], smaller.W) # copy_array(larger.W[:smaller.nO], smaller.W)
# copy_array(larger.b[:smaller.nO], smaller.b) # copy_array(larger.b[:smaller.nO], smaller.b)
# self.model._layers[-1] = larger # self.model._layers[-1] = larger
raise ValueError( raise ValueError(TempErrors.T003)
"Resizing pre-trained Tagger models is not "
"currently supported.")
tag_map = dict(self.vocab.morphology.tag_map) tag_map = dict(self.vocab.morphology.tag_map)
if values is None: if values is None:
values = {POS: "X"} values = {POS: "X"}
@ -584,6 +589,10 @@ class Tagger(Pipe):
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, **exclude):
def load_model(b): def load_model(b):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True: if self.model is True:
token_vector_width = util.env_opt( token_vector_width = util.env_opt(
'token_vector_width', 'token_vector_width',
@ -609,7 +618,6 @@ class Tagger(Pipe):
return self return self
def to_disk(self, path, **exclude): def to_disk(self, path, **exclude):
self.cfg.setdefault('pretrained_dims', self.vocab.vectors.data.shape[1])
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
serialize = OrderedDict(( serialize = OrderedDict((
('vocab', lambda p: self.vocab.to_disk(p)), ('vocab', lambda p: self.vocab.to_disk(p)),
@ -622,6 +630,9 @@ class Tagger(Pipe):
def from_disk(self, path, **exclude): def from_disk(self, path, **exclude):
def load_model(p): def load_model(p):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True: if self.model is True:
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
with p.open('rb') as file_: with p.open('rb') as file_:
@ -669,12 +680,9 @@ class MultitaskObjective(Tagger):
elif hasattr(target, '__call__'): elif hasattr(target, '__call__'):
self.make_label = target self.make_label = target
else: else:
raise ValueError("MultitaskObjective target should be function or " raise ValueError(Errors.E016)
"one of: dep, tag, ent, sent_start, dep_tag_offset, ent_tag.")
self.cfg = dict(cfg) self.cfg = dict(cfg)
self.cfg.setdefault('cnn_maxout_pieces', 2) self.cfg.setdefault('cnn_maxout_pieces', 2)
self.cfg.setdefault('pretrained_dims',
self.vocab.vectors.data.shape[1])
@property @property
def labels(self): def labels(self):
@ -723,7 +731,9 @@ class MultitaskObjective(Tagger):
return tokvecs, scores return tokvecs, scores
def get_loss(self, docs, golds, scores): def get_loss(self, docs, golds, scores):
assert len(docs) == len(golds) if len(docs) != len(golds):
raise ValueError(Errors.E077.format(value='loss', n_docs=len(docs),
n_golds=len(golds)))
cdef int idx = 0 cdef int idx = 0
correct = numpy.zeros((scores.shape[0],), dtype='i') correct = numpy.zeros((scores.shape[0],), dtype='i')
guesses = scores.argmax(axis=1) guesses = scores.argmax(axis=1)
@ -878,8 +888,8 @@ class TextCategorizer(Pipe):
name = 'textcat' name = 'textcat'
@classmethod @classmethod
def Model(cls, **cfg): def Model(cls, nr_class, **cfg):
return build_text_classifier(**cfg) return build_text_classifier(nr_class, **cfg)
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab self.vocab = vocab
@ -962,16 +972,16 @@ class TextCategorizer(Pipe):
self.labels.append(label) self.labels.append(label)
return 1 return 1
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None): def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None,
**kwargs):
if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer': if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
token_vector_width = pipeline[0].model.nO token_vector_width = pipeline[0].model.nO
else: else:
token_vector_width = 64 token_vector_width = 64
if self.model is True: if self.model is True:
self.cfg['pretrained_dims'] = self.vocab.vectors_length self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
self.cfg['nr_class'] = len(self.labels) self.model = self.Model(len(self.labels), **self.cfg)
self.cfg['width'] = token_vector_width
self.model = self.Model(**self.cfg)
link_vectors_to_models(self.vocab) link_vectors_to_models(self.vocab)
if sgd is None: if sgd is None:
sgd = self.create_optimizer() sgd = self.create_optimizer()

View File

@ -2,6 +2,7 @@
from __future__ import division, print_function, unicode_literals from __future__ import division, print_function, unicode_literals
from .gold import tags_to_entities, GoldParse from .gold import tags_to_entities, GoldParse
from .errors import Errors
class PRFScore(object): class PRFScore(object):
@ -86,7 +87,6 @@ class Scorer(object):
def score(self, tokens, gold, verbose=False, punct_labels=('p', 'punct')): def score(self, tokens, gold, verbose=False, punct_labels=('p', 'punct')):
if len(tokens) != len(gold): if len(tokens) != len(gold):
gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot)) gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot))
assert len(tokens) == len(gold)
gold_deps = set() gold_deps = set()
gold_tags = set() gold_tags = set()
gold_ents = set(tags_to_entities([annot[-1] gold_ents = set(tags_to_entities([annot[-1]

View File

@ -13,6 +13,7 @@ from .symbols import IDS as SYMBOLS_BY_STR
from .symbols import NAMES as SYMBOLS_BY_INT from .symbols import NAMES as SYMBOLS_BY_INT
from .typedefs cimport hash_t from .typedefs cimport hash_t
from .compat import json_dumps from .compat import json_dumps
from .errors import Errors
from . import util from . import util
@ -59,7 +60,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char)) string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char))
string.p[0] = length string.p[0] = length
memcpy(&string.p[1], chars, length) memcpy(&string.p[1], chars, length)
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
return string return string
else: else:
i = 0 i = 0
@ -69,7 +69,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
string.p[i] = 255 string.p[i] = 255
string.p[n_length_bytes-1] = length % 255 string.p[n_length_bytes-1] = length % 255
memcpy(&string.p[n_length_bytes], chars, length) memcpy(&string.p[n_length_bytes], chars, length)
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
return string return string
@ -115,7 +114,7 @@ cdef class StringStore:
self.hits.insert(key) self.hits.insert(key)
utf8str = <Utf8Str*>self._map.get(key) utf8str = <Utf8Str*>self._map.get(key)
if utf8str is NULL: if utf8str is NULL:
raise KeyError(string_or_id) raise KeyError(Errors.E018.format(hash_value=string_or_id))
else: else:
return decode_Utf8Str(utf8str) return decode_Utf8Str(utf8str)
@ -136,8 +135,7 @@ cdef class StringStore:
key = hash_utf8(string, len(string)) key = hash_utf8(string, len(string))
self._intern_utf8(string, len(string)) self._intern_utf8(string, len(string))
else: else:
raise TypeError( raise TypeError(Errors.E017.format(value_type=type(string)))
"Can only add unicode or bytes. Got type: %s" % type(string))
return key return key
def __len__(self): def __len__(self):

View File

@ -10,6 +10,7 @@ from thinc.extra.search cimport MaxViolation
from .transition_system cimport TransitionSystem, Transition from .transition_system cimport TransitionSystem, Transition
from ..gold cimport GoldParse from ..gold cimport GoldParse
from ..errors import Errors
from .stateclass cimport StateC, StateClass from .stateclass cimport StateC, StateClass
@ -220,7 +221,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
p_indices = [] p_indices = []
g_indices = [] g_indices = []
cdef Beam pbeam, gbeam cdef Beam pbeam, gbeam
assert len(pbeams) == len(gbeams) if len(pbeams) != len(gbeams):
raise ValueError(Errors.E079.format(pbeams=len(pbeams), gbeams=len(gbeams)))
for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)): for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)):
p_indices.append([]) p_indices.append([])
g_indices.append([]) g_indices.append([])
@ -228,7 +230,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
state = StateClass.borrow(<StateC*>pbeam.at(i)) state = StateClass.borrow(<StateC*>pbeam.at(i))
if not state.is_final(): if not state.is_final():
key = tuple([eg_id] + pbeam.histories[i]) key = tuple([eg_id] + pbeam.histories[i])
assert key not in seen, (key, seen) if key in seen:
raise ValueError(Errors.E080.format(key=key))
seen[key] = len(states) seen[key] = len(states)
p_indices[-1].append(len(states)) p_indices[-1].append(len(states))
states.append(state) states.append(state)
@ -271,7 +274,8 @@ def get_gradient(nr_class, beam_maps, histories, losses):
for i in range(nr_step): for i in range(nr_step):
grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class), grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class),
dtype='f')) dtype='f'))
assert len(histories) == len(losses) if len(histories) != len(losses):
raise ValueError(Errors.E081.format(n_hist=len(histories), losses=len(losses)))
for eg_id, hists in enumerate(histories): for eg_id, hists in enumerate(histories):
for loss, hist in zip(losses[eg_id], hists): for loss, hist in zip(losses[eg_id], hists):
if loss == 0.0 or numpy.isnan(loss): if loss == 0.0 or numpy.isnan(loss):

View File

@ -10,21 +10,25 @@ from collections import OrderedDict, defaultdict, Counter
from thinc.extra.search cimport Beam from thinc.extra.search cimport Beam
import json import json
from .nonproj import is_nonproj_tree
from ..typedefs cimport hash_t, attr_t
from ..strings cimport hash_string
from .stateclass cimport StateClass from .stateclass cimport StateClass
from ._state cimport StateC from ._state cimport StateC
from . import nonproj from . import nonproj
from .transition_system cimport move_cost_func_t, label_cost_func_t from .transition_system cimport move_cost_func_t, label_cost_func_t
from ..gold cimport GoldParse, GoldParseC from ..gold cimport GoldParse, GoldParseC
from ..structs cimport TokenC from ..structs cimport TokenC
from ..errors import Errors
# Calculate cost as gold/not gold. We don't use scalar value anyway. # Calculate cost as gold/not gold. We don't use scalar value anyway.
cdef int BINARY_COSTS = 1 cdef int BINARY_COSTS = 1
cdef weight_t MIN_SCORE = -90000
cdef attr_t SUBTOK_LABEL = hash_string('subtok')
DEF NON_MONOTONIC = True DEF NON_MONOTONIC = True
DEF USE_BREAK = True DEF USE_BREAK = True
cdef weight_t MIN_SCORE = -90000
# Break transition from here # Break transition from here
# http://www.aclweb.org/anthology/P13-1074 # http://www.aclweb.org/anthology/P13-1074
cdef enum: cdef enum:
@ -178,6 +182,8 @@ cdef class Reduce:
cdef class LeftArc: cdef class LeftArc:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
return 0
sent_start = st._sent[st.B_(0).l_edge].sent_start sent_start = st._sent[st.B_(0).l_edge].sent_start
return sent_start != 1 return sent_start != 1
@ -214,6 +220,8 @@ cdef class RightArc:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
# If there's (perhaps partial) parse pre-set, don't allow cycle. # If there's (perhaps partial) parse pre-set, don't allow cycle.
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
return 0
sent_start = st._sent[st.B_(0).l_edge].sent_start sent_start = st._sent[st.B_(0).l_edge].sent_start
return sent_start != 1 and st.H(st.S(0)) != st.B(0) return sent_start != 1 and st.H(st.S(0)) != st.B(0)
@ -364,6 +372,18 @@ cdef class ArcEager(TransitionSystem):
def __get__(self): def __get__(self):
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK) return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
def get_cost(self, StateClass state, GoldParse gold, action):
cdef Transition t = self.lookup_transition(action)
if not t.is_valid(state.c, t.label):
return 9000
else:
return t.get_cost(state, &gold.c, t.label)
def transition(self, StateClass state, action):
cdef Transition t = self.lookup_transition(action)
t.do(state.c, t.label)
return state
def is_gold_parse(self, StateClass state, GoldParse gold): def is_gold_parse(self, StateClass state, GoldParse gold):
predicted = set() predicted = set()
truth = set() truth = set()
@ -435,7 +455,10 @@ cdef class ArcEager(TransitionSystem):
parses.append((prob, parse)) parses.append((prob, parse))
return parses return parses
cdef Transition lookup_transition(self, object name) except *: cdef Transition lookup_transition(self, object name_or_id) except *:
if isinstance(name_or_id, int):
return self.c[name_or_id]
name = name_or_id
if '-' in name: if '-' in name:
move_str, label_str = name.split('-', 1) move_str, label_str = name.split('-', 1)
label = self.strings[label_str] label = self.strings[label_str]
@ -455,6 +478,9 @@ cdef class ArcEager(TransitionSystem):
else: else:
return MOVE_NAMES[move] return MOVE_NAMES[move]
def class_name(self, int i):
return self.move_name(self.c[i].move, self.c[i].label)
cdef Transition init_transition(self, int clas, int move, attr_t label) except *: cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
# TODO: Apparent Cython bug here when we try to use the Transition() # TODO: Apparent Cython bug here when we try to use the Transition()
# constructor with the function pointers # constructor with the function pointers
@ -484,7 +510,7 @@ cdef class ArcEager(TransitionSystem):
t.do = Break.transition t.do = Break.transition
t.get_cost = Break.cost t.get_cost = Break.cost
else: else:
raise Exception(move) raise ValueError(Errors.E019.format(action=move, src='arc_eager'))
return t return t
cdef int initialize_state(self, StateC* st) nogil: cdef int initialize_state(self, StateC* st) nogil:
@ -516,7 +542,10 @@ cdef class ArcEager(TransitionSystem):
is_valid[BREAK] = Break.is_valid(st, 0) is_valid[BREAK] = Break.is_valid(st, 0)
cdef int i cdef int i
for i in range(self.n_moves): for i in range(self.n_moves):
output[i] = is_valid[self.c[i].move] if self.c[i].label == SUBTOK_LABEL:
output[i] = self.c[i].is_valid(st, self.c[i].label)
else:
output[i] = is_valid[self.c[i].move]
cdef int set_costs(self, int* is_valid, weight_t* costs, cdef int set_costs(self, int* is_valid, weight_t* costs,
StateClass stcls, GoldParse gold) except -1: StateClass stcls, GoldParse gold) except -1:
@ -556,35 +585,13 @@ cdef class ArcEager(TransitionSystem):
is_valid[i] = False is_valid[i] = False
costs[i] = 9000 costs[i] = 9000
if n_gold < 1: if n_gold < 1:
# Check label set --- leading cause # Check projectivity --- leading cause
label_set = set([self.strings[self.c[i].label] for i in range(self.n_moves)]) if is_nonproj_tree(gold.heads):
for label_str in gold.labels: raise ValueError(Errors.E020)
if label_str is not None and label_str not in label_set:
raise ValueError("Cannot get gold parser action: unknown label: %s" % label_str)
# Check projectivity --- other leading cause
if nonproj.is_nonproj_tree(gold.heads):
raise ValueError(
"Could not find a gold-standard action to supervise the "
"dependency parser. Likely cause: the tree is "
"non-projective (i.e. it has crossing arcs -- see "
"spacy/syntax/nonproj.pyx for definitions). The ArcEager "
"transition system only supports projective trees. To "
"learn non-projective representations, transform the data "
"before training and after parsing. Either pass "
"make_projective=True to the GoldParse class, or use "
"spacy.syntax.nonproj.preprocess_training_data.")
else: else:
print(gold.orig_annot) failure_state = stcls.print_state(gold.words)
print(gold.words) raise ValueError(Errors.E021.format(n_actions=self.n_moves,
print(gold.heads) state=failure_state))
print(gold.labels)
print(gold.sent_starts)
raise ValueError(
"Could not find a gold-standard action to supervise the"
"dependency parser. The GoldParse was projective. The "
"transition system has %d actions. State at failure: %s"
% (self.n_moves, stcls.print_state(gold.words)))
assert n_gold >= 1
def get_beam_annot(self, Beam beam): def get_beam_annot(self, Beam beam):
length = (<StateC*>beam.at(0)).length length = (<StateC*>beam.at(0)).length

View File

@ -10,6 +10,7 @@ from ._state cimport StateC
from .transition_system cimport Transition from .transition_system cimport Transition
from .transition_system cimport do_func_t from .transition_system cimport do_func_t
from ..gold cimport GoldParseC, GoldParse from ..gold cimport GoldParseC, GoldParse
from ..errors import Errors
cdef enum: cdef enum:
@ -81,9 +82,7 @@ cdef class BiluoPushDown(TransitionSystem):
for (ids, words, tags, heads, labels, biluo), _ in sents: for (ids, words, tags, heads, labels, biluo), _ in sents:
for i, ner_tag in enumerate(biluo): for i, ner_tag in enumerate(biluo):
if ner_tag != 'O' and ner_tag != '-': if ner_tag != 'O' and ner_tag != '-':
if ner_tag.count('-') != 1: _, label = ner_tag.split('-', 1)
raise ValueError(ner_tag)
_, label = ner_tag.split('-')
for action in (BEGIN, IN, LAST, UNIT): for action in (BEGIN, IN, LAST, UNIT):
actions[action][label] += 1 actions[action][label] += 1
return actions return actions
@ -170,7 +169,7 @@ cdef class BiluoPushDown(TransitionSystem):
if self.c[i].move == move and self.c[i].label == label: if self.c[i].move == move and self.c[i].label == label:
return self.c[i] return self.c[i]
else: else:
raise KeyError(name) raise KeyError(Errors.E022.format(name=name))
cdef Transition init_transition(self, int clas, int move, attr_t label) except *: cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
# TODO: Apparent Cython bug here when we try to use the Transition() # TODO: Apparent Cython bug here when we try to use the Transition()
@ -205,7 +204,7 @@ cdef class BiluoPushDown(TransitionSystem):
t.do = Out.transition t.do = Out.transition
t.get_cost = Out.cost t.get_cost = Out.cost
else: else:
raise Exception(move) raise ValueError(Errors.E019.format(action=move, src='ner'))
return t return t
def add_action(self, int action, label_name, freq=None): def add_action(self, int action, label_name, freq=None):
@ -227,7 +226,6 @@ cdef class BiluoPushDown(TransitionSystem):
self._size *= 2 self._size *= 2
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0])) self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id) self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
assert self.c[self.n_moves].label == label_id
self.n_moves += 1 self.n_moves += 1
if self.labels.get(action, []): if self.labels.get(action, []):
freq = min(0, min(self.labels[action].values())) freq = min(0, min(self.labels[action].values()))

View File

@ -35,6 +35,7 @@ from .._ml import link_vectors_to_models, create_default_optimizer
from ..compat import json_dumps, copy_array from ..compat import json_dumps, copy_array
from ..tokens.doc cimport Doc from ..tokens.doc cimport Doc
from ..gold cimport GoldParse from ..gold cimport GoldParse
from ..errors import Errors, TempErrors
from .. import util from .. import util
from .stateclass cimport StateClass from .stateclass cimport StateClass
from ._state cimport StateC from ._state cimport StateC
@ -244,7 +245,7 @@ cdef class Parser:
def Model(cls, nr_class, **cfg): def Model(cls, nr_class, **cfg):
depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1)) depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1))
if depth != 1: if depth != 1:
raise ValueError("Currently parser depth is hard-coded to 1.") raise ValueError(TempErrors.T004.format(value=depth))
parser_maxout_pieces = util.env_opt('parser_maxout_pieces', parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
cfg.get('maxout_pieces', 2)) cfg.get('maxout_pieces', 2))
token_vector_width = util.env_opt('token_vector_width', token_vector_width = util.env_opt('token_vector_width',
@ -254,11 +255,12 @@ cdef class Parser:
hist_size = util.env_opt('history_feats', cfg.get('hist_size', 0)) hist_size = util.env_opt('history_feats', cfg.get('hist_size', 0))
hist_width = util.env_opt('history_width', cfg.get('hist_width', 0)) hist_width = util.env_opt('history_width', cfg.get('hist_width', 0))
if hist_size != 0: if hist_size != 0:
raise ValueError("Currently history size is hard-coded to 0") raise ValueError(TempErrors.T005.format(value=hist_size))
if hist_width != 0: if hist_width != 0:
raise ValueError("Currently history width is hard-coded to 0") raise ValueError(TempErrors.T006.format(value=hist_width))
pretrained_vectors = cfg.get('pretrained_vectors', None)
tok2vec = Tok2Vec(token_vector_width, embed_size, tok2vec = Tok2Vec(token_vector_width, embed_size,
pretrained_dims=cfg.get('pretrained_dims', 0)) pretrained_vectors=pretrained_vectors)
tok2vec = chain(tok2vec, flatten) tok2vec = chain(tok2vec, flatten)
lower = PrecomputableAffine(hidden_width, lower = PrecomputableAffine(hidden_width,
nF=cls.nr_feature, nI=token_vector_width, nF=cls.nr_feature, nI=token_vector_width,
@ -277,6 +279,7 @@ cdef class Parser:
'token_vector_width': token_vector_width, 'token_vector_width': token_vector_width,
'hidden_width': hidden_width, 'hidden_width': hidden_width,
'maxout_pieces': parser_maxout_pieces, 'maxout_pieces': parser_maxout_pieces,
'pretrained_vectors': pretrained_vectors,
'hist_size': hist_size, 'hist_size': hist_size,
'hist_width': hist_width 'hist_width': hist_width
} }
@ -296,9 +299,9 @@ cdef class Parser:
unless True (default), in which case a new instance is created with unless True (default), in which case a new instance is created with
`Parser.Moves()`. `Parser.Moves()`.
model (object): Defines how the parse-state is created, updated and model (object): Defines how the parse-state is created, updated and
evaluated. The value is set to the .model attribute unless True evaluated. The value is set to the .model attribute. If set to True
(default), in which case a new instance is created with (default), a new instance will be created with `Parser.Model()`
`Parser.Model()`. in parser.begin_training(), parser.from_disk() or parser.from_bytes().
**cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute **cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute
""" """
self.vocab = vocab self.vocab = vocab
@ -310,8 +313,7 @@ cdef class Parser:
cfg['beam_width'] = util.env_opt('beam_width', 1) cfg['beam_width'] = util.env_opt('beam_width', 1)
if 'beam_density' not in cfg: if 'beam_density' not in cfg:
cfg['beam_density'] = util.env_opt('beam_density', 0.0) cfg['beam_density'] = util.env_opt('beam_density', 0.0)
if 'pretrained_dims' not in cfg: cfg.setdefault('cnn_maxout_pieces', 3)
cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
self.cfg = cfg self.cfg = cfg
self.model = model self.model = model
self._multitasks = [] self._multitasks = []
@ -551,8 +553,13 @@ cdef class Parser:
def update(self, docs, golds, drop=0., sgd=None, losses=None): def update(self, docs, golds, drop=0., sgd=None, losses=None):
if not any(self.moves.has_gold(gold) for gold in golds): if not any(self.moves.has_gold(gold) for gold in golds):
return None return None
assert len(docs) == len(golds) if len(docs) != len(golds):
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= 0.5: raise ValueError(Errors.E077.format(value='update', n_docs=len(docs),
n_golds=len(golds)))
# The probability we use beam update, instead of falling back to
# a greedy update
beam_update_prob = 1-self.cfg.get('beam_update_prob', 0.5)
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= beam_update_prob:
return self.update_beam(docs, golds, return self.update_beam(docs, golds,
self.cfg['beam_width'], self.cfg['beam_density'], self.cfg['beam_width'], self.cfg['beam_density'],
drop=drop, sgd=sgd, losses=losses) drop=drop, sgd=sgd, losses=losses)
@ -595,7 +602,6 @@ cdef class Parser:
scores, bp_scores = vec2scores.begin_update(vector, drop=drop) scores, bp_scores = vec2scores.begin_update(vector, drop=drop)
d_scores = self.get_batch_loss(states, golds, scores) d_scores = self.get_batch_loss(states, golds, scores)
d_scores /= len(docs)
d_vector = bp_scores(d_scores, sgd=sgd) d_vector = bp_scores(d_scores, sgd=sgd)
if drop != 0: if drop != 0:
d_vector *= mask d_vector *= mask
@ -620,7 +626,7 @@ cdef class Parser:
break break
self._make_updates(d_tokvecs, self._make_updates(d_tokvecs,
bp_tokvecs, backprops, sgd, cuda_stream) bp_tokvecs, backprops, sgd, cuda_stream)
def update_beam(self, docs, golds, width=None, density=None, def update_beam(self, docs, golds, width=None, density=None,
drop=0., sgd=None, losses=None): drop=0., sgd=None, losses=None):
if not any(self.moves.has_gold(gold) for gold in golds): if not any(self.moves.has_gold(gold) for gold in golds):
@ -634,7 +640,6 @@ cdef class Parser:
if losses is not None and self.name not in losses: if losses is not None and self.name not in losses:
losses[self.name] = 0. losses[self.name] = 0.
lengths = [len(d) for d in docs] lengths = [len(d) for d in docs]
assert min(lengths) >= 1
states = self.moves.init_batch(docs) states = self.moves.init_batch(docs)
for gold in golds: for gold in golds:
self.moves.preprocess_gold(gold) self.moves.preprocess_gold(gold)
@ -648,7 +653,6 @@ cdef class Parser:
backprop_lower = [] backprop_lower = []
cdef float batch_size = len(docs) cdef float batch_size = len(docs)
for i, d_scores in enumerate(states_d_scores): for i, d_scores in enumerate(states_d_scores):
d_scores /= batch_size
if losses is not None: if losses is not None:
losses[self.name] += (d_scores**2).sum() losses[self.name] += (d_scores**2).sum()
ids, bp_vectors, bp_scores = backprops[i] ids, bp_vectors, bp_scores = backprops[i]
@ -846,7 +850,6 @@ cdef class Parser:
self.moves.initialize_actions(actions) self.moves.initialize_actions(actions)
cfg.setdefault('token_vector_width', 128) cfg.setdefault('token_vector_width', 128)
if self.model is True: if self.model is True:
cfg['pretrained_dims'] = self.vocab.vectors_length
self.model, cfg = self.Model(self.moves.n_moves, **cfg) self.model, cfg = self.Model(self.moves.n_moves, **cfg)
if sgd is None: if sgd is None:
sgd = self.create_optimizer() sgd = self.create_optimizer()
@ -910,9 +913,11 @@ cdef class Parser:
} }
util.from_disk(path, deserializers, exclude) util.from_disk(path, deserializers, exclude)
if 'model' not in exclude: if 'model' not in exclude:
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
path = util.ensure_path(path) path = util.ensure_path(path)
if self.model is True: if self.model is True:
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
self.model, cfg = self.Model(**self.cfg) self.model, cfg = self.Model(**self.cfg)
else: else:
cfg = {} cfg = {}
@ -955,12 +960,13 @@ cdef class Parser:
)) ))
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
if 'model' not in exclude: if 'model' not in exclude:
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True: if self.model is True:
self.model, cfg = self.Model(**self.cfg) self.model, cfg = self.Model(**self.cfg)
cfg['pretrained_dims'] = self.vocab.vectors_length
else: else:
cfg = {} cfg = {}
cfg['pretrained_dims'] = self.vocab.vectors_length
if 'tok2vec_model' in msg: if 'tok2vec_model' in msg:
self.model[0].from_bytes(msg['tok2vec_model']) self.model[0].from_bytes(msg['tok2vec_model'])
if 'lower_model' in msg: if 'lower_model' in msg:
@ -1033,15 +1039,11 @@ def _cleanup(Beam beam):
del state del state
seen.add(addr) seen.add(addr)
else: else:
print(i, addr) raise ValueError(Errors.E023.format(addr=addr, i=i))
print(seen)
raise Exception
addr = <size_t>beam._states[i].content addr = <size_t>beam._states[i].content
if addr not in seen: if addr not in seen:
state = <StateC*>addr state = <StateC*>addr
del state del state
seen.add(addr) seen.add(addr)
else: else:
print(i, addr) raise ValueError(Errors.E023.format(addr=addr, i=i))
print(seen)
raise Exception

View File

@ -10,6 +10,7 @@ from __future__ import unicode_literals
from copy import copy from copy import copy
from ..tokens.doc cimport Doc, set_children_from_heads from ..tokens.doc cimport Doc, set_children_from_heads
from ..errors import Errors
DELIMITER = '||' DELIMITER = '||'
@ -146,7 +147,10 @@ cpdef deprojectivize(Doc doc):
def _decorate(heads, proj_heads, labels): def _decorate(heads, proj_heads, labels):
# uses decoration scheme HEAD from Nivre & Nilsson 2005 # uses decoration scheme HEAD from Nivre & Nilsson 2005
assert(len(heads) == len(proj_heads) == len(labels)) if (len(heads) != len(proj_heads)) or (len(proj_heads) != len(labels)):
raise ValueError(Errors.E082.format(n_heads=len(heads),
n_proj_heads=len(proj_heads),
n_labels=len(labels)))
deco_labels = [] deco_labels = []
for tokenid, head in enumerate(heads): for tokenid, head in enumerate(heads):
if head != proj_heads[tokenid]: if head != proj_heads[tokenid]:

View File

@ -12,6 +12,7 @@ from ..structs cimport TokenC
from .stateclass cimport StateClass from .stateclass cimport StateClass
from ..typedefs cimport attr_t from ..typedefs cimport attr_t
from ..compat import json_dumps from ..compat import json_dumps
from ..errors import Errors
from .. import util from .. import util
@ -73,10 +74,7 @@ cdef class TransitionSystem:
action.do(state.c, action.label) action.do(state.c, action.label)
break break
else: else:
print(gold.words) raise ValueError(Errors.E024)
print(gold.ner)
print(history)
raise ValueError("Could not find gold move")
return history return history
cdef int initialize_state(self, StateC* state) nogil: cdef int initialize_state(self, StateC* state) nogil:
@ -123,17 +121,7 @@ cdef class TransitionSystem:
else: else:
costs[i] = 9000 costs[i] = 9000
if n_gold <= 0: if n_gold <= 0:
print(gold.words) raise ValueError(Errors.E024)
print(gold.ner)
print([gold.c.ner[i].clas for i in range(gold.length)])
print([gold.c.ner[i].move for i in range(gold.length)])
print([gold.c.ner[i].label for i in range(gold.length)])
print("Self labels",
[self.c[i].label for i in range(self.n_moves)])
raise ValueError(
"Could not find a gold-standard action to supervise "
"the entity recognizer. The transition system has "
"%d actions." % (self.n_moves))
def get_class_name(self, int clas): def get_class_name(self, int clas):
act = self.c[clas] act = self.c[clas]
@ -171,7 +159,6 @@ cdef class TransitionSystem:
self._size *= 2 self._size *= 2
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0])) self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id) self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
assert self.c[self.n_moves].label == label_id
self.n_moves += 1 self.n_moves += 1
if self.labels.get(action, []): if self.labels.get(action, []):
new_freq = min(self.labels[action].values()) new_freq = min(self.labels[action].values())

View File

@ -19,7 +19,9 @@ _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
_models = {'en': ['en_core_web_sm'], _models = {'en': ['en_core_web_sm'],
'de': ['de_core_news_md'], 'de': ['de_core_news_md'],
'fr': ['fr_core_news_sm'], 'fr': ['fr_core_news_sm'],
'xx': ['xx_ent_web_md']} 'xx': ['xx_ent_web_md'],
'en_core_web_md': ['en_core_web_md'],
'es_core_news_md': ['es_core_news_md']}
# only used for tests that require loading the models # only used for tests that require loading the models
@ -183,6 +185,9 @@ def pytest_addoption(parser):
for lang in _languages + ['all']: for lang in _languages + ['all']:
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang) parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
for model in _models:
if model not in _languages:
parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
def pytest_runtest_setup(item): def pytest_runtest_setup(item):

View File

@ -0,0 +1,13 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
('kolesterols', 'kolesterol'),
('åsyns', 'åsyn')])
def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
tokens = da_tokenizer(string)
assert tokens[0].lemma_ == lemma

View File

@ -1,9 +1,74 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
from ...vocab import Vocab from ...vocab import Vocab
from ...pipeline import DependencyParser from ...pipeline import DependencyParser
from ...tokens import Doc from ...tokens import Doc
from ...gold import GoldParse from ...gold import GoldParse
from ...syntax.nonproj import projectivize from ...syntax.nonproj import projectivize
from ...syntax.stateclass import StateClass
from ...syntax.arc_eager import ArcEager
def get_sequence_costs(M, words, heads, deps, transitions):
doc = Doc(Vocab(), words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
state = StateClass(doc)
M.preprocess_gold(gold)
cost_history = []
for gold_action in transitions:
state_costs = {}
for i in range(M.n_moves):
name = M.class_name(i)
state_costs[name] = M.get_cost(state, gold, i)
M.transition(state, gold_action)
cost_history.append(state_costs)
return state, cost_history
@pytest.fixture
def vocab():
return Vocab()
@pytest.fixture
def arc_eager(vocab):
moves = ArcEager(vocab.strings, ArcEager.get_actions())
moves.add_action(2, 'left')
moves.add_action(3, 'right')
return moves
@pytest.fixture
def words():
return ['a', 'b']
@pytest.fixture
def doc(words, vocab):
if vocab is None:
vocab = Vocab()
return Doc(vocab, words=list(words))
@pytest.fixture
def gold(doc, words):
if len(words) == 2:
return GoldParse(doc, words=['a', 'b'], heads=[0, 0], deps=['ROOT', 'right'])
else:
raise NotImplementedError
@pytest.mark.xfail
def test_oracle_four_words(arc_eager, vocab):
words = ['a', 'b', 'c', 'd']
heads = [1, 1, 3, 3]
deps = ['left', 'ROOT', 'left', 'ROOT']
actions = ['L-left', 'B-ROOT', 'L-left']
state, cost_history = get_sequence_costs(arc_eager, words, heads, deps, actions)
assert state.is_final()
for i, state_costs in enumerate(cost_history):
# Check gold moves is 0 cost
assert state_costs[actions[i]] == 0.0, actions[i]
for other_action, cost in state_costs.items():
if other_action != actions[i]:
assert cost >= 1
annot_tuples = [ annot_tuples = [
(0, 'When', 'WRB', 11, 'advmod', 'O'), (0, 'When', 'WRB', 11, 'advmod', 'O'),

View File

@ -0,0 +1,12 @@
from __future__ import unicode_literals
import pytest
from ...util import load_model
@pytest.mark.models("en_core_web_md")
@pytest.mark.models("es_core_news_md")
def test_models_with_different_vectors():
nlp = load_model('en_core_web_md')
doc = nlp(u'hello world')
nlp2 = load_model('es_core_news_md')
doc2 = nlp2(u'hola')
doc = nlp(u'hello world')

View File

@ -0,0 +1,15 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from ...pipeline import EntityRecognizer
from ...vocab import Vocab
@pytest.mark.parametrize('label', ['U-JOB-NAME'])
def test_issue1967(label):
ner = EntityRecognizer(Vocab())
entry = ([0], ['word'], ['tag'], [0], ['dep'], [label])
gold_parses = [(None, [(entry, None)])]
ner.moves.get_actions(gold_parses=gold_parses)

View File

@ -17,6 +17,7 @@ def meta_data():
'email': 'email-in-fixture', 'email': 'email-in-fixture',
'url': 'url-in-fixture', 'url': 'url-in-fixture',
'license': 'license-in-fixture', 'license': 'license-in-fixture',
'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}
} }

View File

@ -10,8 +10,8 @@ from ..gold import GoldParse
def test_textcat_learns_multilabel(): def test_textcat_learns_multilabel():
random.seed(0) random.seed(5)
numpy.random.seed(0) numpy.random.seed(5)
docs = [] docs = []
nlp = English() nlp = English()
vocab = nlp.vocab vocab = nlp.vocab

View File

@ -1,4 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from mock import Mock from mock import Mock
from ..vocab import Vocab
from ..tokens import Doc, Span, Token
from ..tokens.underscore import Underscore from ..tokens.underscore import Underscore
@ -51,3 +58,42 @@ def test_token_underscore_method():
None, None) None, None)
token._ = Underscore(Underscore.token_extensions, token, start=token.idx) token._ = Underscore(Underscore.token_extensions, token, start=token.idx)
assert token._.hello() == 'cheese' assert token._.hello() == 'cheese'
@pytest.mark.parametrize('obj', [Doc, Span, Token])
def test_doc_underscore_remove_extension(obj):
ext_name = 'to_be_removed'
obj.set_extension(ext_name, default=False)
assert obj.has_extension(ext_name)
obj.remove_extension(ext_name)
assert not obj.has_extension(ext_name)
@pytest.mark.parametrize('obj', [Doc, Span, Token])
def test_underscore_raises_for_dup(obj):
obj.set_extension('test', default=None)
with pytest.raises(ValueError):
obj.set_extension('test', default=None)
@pytest.mark.parametrize('invalid_kwargs', [
{'getter': None, 'setter': lambda: None},
{'default': None, 'method': lambda: None, 'getter': lambda: None},
{'setter': lambda: None},
{'default': None, 'method': lambda: None},
{'getter': True}])
def test_underscore_raises_for_invalid(invalid_kwargs):
invalid_kwargs['force'] = True
with pytest.raises(ValueError):
Doc.set_extension('test', **invalid_kwargs)
@pytest.mark.parametrize('valid_kwargs', [
{'getter': lambda: None},
{'getter': lambda: None, 'setter': lambda: None},
{'default': 'hello'},
{'default': None},
{'method': lambda: None}])
def test_underscore_accepts_valid(valid_kwargs):
valid_kwargs['force'] = True
Doc.set_extension('test', **valid_kwargs)

View File

@ -28,12 +28,38 @@ def vectors():
def data(): def data():
return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f') return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f')
@pytest.fixture
def resize_data():
return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype='f')
@pytest.fixture() @pytest.fixture()
def vocab(en_vocab, vectors): def vocab(en_vocab, vectors):
add_vecs_to_vocab(en_vocab, vectors) add_vecs_to_vocab(en_vocab, vectors)
return en_vocab return en_vocab
def test_init_vectors_with_resize_shape(strings,resize_data):
v = Vectors(shape=(len(strings), 3))
v.resize(shape=resize_data.shape)
assert v.shape == resize_data.shape
assert v.shape != (len(strings), 3)
def test_init_vectors_with_resize_data(data,resize_data):
v = Vectors(data=data)
v.resize(shape=resize_data.shape)
assert v.shape == resize_data.shape
assert v.shape != data.shape
def test_get_vector_resize(strings, data,resize_data):
v = Vectors(data=data)
v.resize(shape=resize_data.shape)
strings = [hash_string(s) for s in strings]
for i, string in enumerate(strings):
v.add(string, row=i)
assert list(v[strings[0]]) == list(resize_data[0])
assert list(v[strings[0]]) != list(resize_data[1])
assert list(v[strings[1]]) != list(resize_data[0])
assert list(v[strings[1]]) == list(resize_data[1])
def test_init_vectors_with_data(strings, data): def test_init_vectors_with_data(strings, data):
v = Vectors(data=data) v = Vectors(data=data)

View File

@ -13,6 +13,7 @@ cimport cython
from .tokens.doc cimport Doc from .tokens.doc cimport Doc
from .strings cimport hash_string from .strings cimport hash_string
from .errors import Errors, Warnings, deprecation_warning
from . import util from . import util
@ -63,11 +64,7 @@ cdef class Tokenizer:
return (self.__class__, args, None, None) return (self.__class__, args, None, None)
cpdef Doc tokens_from_list(self, list strings): cpdef Doc tokens_from_list(self, list strings):
util.deprecated( deprecation_warning(Warnings.W002)
"Tokenizer.from_list is now deprecated. Create a new Doc "
"object instead and pass in the strings as the `words` keyword "
"argument, for example:\nfrom spacy.tokens import Doc\n"
"doc = Doc(nlp.vocab, words=[...])")
return Doc(self.vocab, words=strings) return Doc(self.vocab, words=strings)
@cython.boundscheck(False) @cython.boundscheck(False)
@ -78,8 +75,7 @@ cdef class Tokenizer:
RETURNS (Doc): A container for linguistic annotations. RETURNS (Doc): A container for linguistic annotations.
""" """
if len(string) >= (2 ** 30): if len(string) >= (2 ** 30):
msg = "String is too long: %d characters. Max is 2**30." raise ValueError(Errors.E025.format(length=len(string)))
raise ValueError(msg % len(string))
cdef int length = len(string) cdef int length = len(string)
cdef Doc doc = Doc(self.vocab) cdef Doc doc = Doc(self.vocab)
if length == 0: if length == 0:

View File

@ -0,0 +1,129 @@
# coding: utf8
# cython: infer_types=True
# cython: bounds_check=False
# cython: profile=True
from __future__ import unicode_literals
from libc.string cimport memcpy, memset
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
from .span cimport Span
from .token cimport Token
from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..structs cimport LexemeC, TokenC
from ..attrs cimport *
cdef class Retokenizer:
'''Helper class for doc.retokenize() context manager.'''
cdef Doc doc
cdef list merges
cdef list splits
def __init__(self, doc):
self.doc = doc
self.merges = []
self.splits = []
def merge(self, Span span, attrs=None):
'''Mark a span for merging. The attrs will be applied to the resulting
token.'''
self.merges.append((span.start_char, span.end_char, attrs))
def split(self, Token token, orths, attrs=None):
'''Mark a Token for splitting, into the specified orths. The attrs
will be applied to each subtoken.'''
self.splits.append((token.start_char, orths, attrs))
def __enter__(self):
self.merges = []
self.splits = []
return self
def __exit__(self, *args):
# Do the actual merging here
for start_char, end_char, attrs in self.merges:
start = token_by_start(self.doc.c, self.doc.length, start_char)
end = token_by_end(self.doc.c, self.doc.length, end_char)
_merge(self.doc, start, end+1, attrs)
for start_char, orths, attrs in self.splits:
raise NotImplementedError
def _merge(Doc doc, int start, int end, attributes):
"""Retokenize the document, such that the span at
`doc.text[start_idx : end_idx]` is merged into a single token. If
`start_idx` and `end_idx `do not mark start and end token boundaries,
the document remains unchanged.
start_idx (int): Character index of the start of the slice to merge.
end_idx (int): Character index after the end of the slice to merge.
**attributes: Attributes to assign to the merged token. By default,
attributes are inherited from the syntactic root of the span.
RETURNS (Token): The newly merged token, or `None` if the start and end
indices did not fall at token boundaries.
"""
cdef Span span = doc[start:end]
cdef int start_char = span.start_char
cdef int end_char = span.end_char
# Get LexemeC for newly merged token
new_orth = ''.join([t.text_with_ws for t in span])
if span[-1].whitespace_:
new_orth = new_orth[:-len(span[-1].whitespace_)]
cdef const LexemeC* lex = doc.vocab.get(doc.mem, new_orth)
# House the new merged token where it starts
cdef TokenC* token = &doc.c[start]
token.spacy = doc.c[end-1].spacy
for attr_name, attr_value in attributes.items():
if attr_name == TAG:
doc.vocab.morphology.assign_tag(token, attr_value)
else:
Token.set_struct_attr(token, attr_name, attr_value)
# Make sure ent_iob remains consistent
if doc.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
if token.ent_type == doc.c[end].ent_type:
token.ent_iob = 3
else:
# If they're not the same entity type, let them be two entities
doc.c[end].ent_iob = 3
# Begin by setting all the head indices to absolute token positions
# This is easier to work with for now than the offsets
# Before thinking of something simpler, beware the case where a
# dependency bridges over the entity. Here the alignment of the
# tokens changes.
span_root = span.root.i
token.dep = span.root.dep
# We update token.lex after keeping span root and dep, since
# setting token.lex will change span.start and span.end properties
# as it modifies the character offsets in the doc
token.lex = lex
for i in range(doc.length):
doc.c[i].head += i
# Set the head of the merged token, and its dep relation, from the Span
token.head = doc.c[span_root].head
# Adjust deps before shrinking tokens
# Tokens which point into the merged token should now point to it
# Subtract the offset from all tokens which point to >= end
offset = (end - start) - 1
for i in range(doc.length):
head_idx = doc.c[i].head
if start <= head_idx < end:
doc.c[i].head = start
elif head_idx >= end:
doc.c[i].head -= offset
# Now compress the token array
for i in range(end, doc.length):
doc.c[i - offset] = doc.c[i]
for i in range(doc.length - offset, doc.length):
memset(&doc.c[i], 0, sizeof(TokenC))
doc.c[i].lex = &EMPTY_LEXEME
doc.length -= offset
for i in range(doc.length):
# ...And, set heads back to a relative position
doc.c[i].head -= i
# Set the left/right children, left/right edges
set_children_from_heads(doc.c, doc.length)
# Clear the cached Python objects
# Return the merged Python object
return doc[start]

View File

@ -28,6 +28,8 @@ cdef int token_by_start(const TokenC* tokens, int length, int start_char) except
cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2 cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2
cdef int set_children_from_heads(TokenC* tokens, int length) except -1
cdef class Doc: cdef class Doc:
cdef readonly Pool mem cdef readonly Pool mem
cdef readonly Vocab vocab cdef readonly Vocab vocab

View File

@ -31,18 +31,19 @@ from ..attrs cimport ENT_TYPE, SENT_START
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..util import normalize_slice from ..util import normalize_slice
from ..compat import is_config, copy_reg, pickle, basestring_ from ..compat import is_config, copy_reg, pickle, basestring_
from .. import about from ..errors import Errors, Warnings, deprecation_warning
from .. import util from .. import util
from .underscore import Underscore from .underscore import Underscore, get_ext_args
from ._retokenize import Retokenizer
DEF PADDING = 5 DEF PADDING = 5
cdef int bounds_check(int i, int length, int padding) except -1: cdef int bounds_check(int i, int length, int padding) except -1:
if (i + padding) < 0: if (i + padding) < 0:
raise IndexError raise IndexError(Errors.E026.format(i=i, length=length))
if (i - padding) >= length: if (i - padding) >= length:
raise IndexError raise IndexError(Errors.E026.format(i=i, length=length))
cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil: cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
@ -94,11 +95,10 @@ cdef class Doc:
spaces=[True, False, False]) spaces=[True, False, False])
""" """
@classmethod @classmethod
def set_extension(cls, name, default=None, method=None, def set_extension(cls, name, **kwargs):
getter=None, setter=None): if cls.has_extension(name) and not kwargs.get('force', False):
nr_defined = sum(t is not None for t in (default, getter, setter, method)) raise ValueError(Errors.E090.format(name=name, obj='Doc'))
assert nr_defined == 1 Underscore.doc_extensions[name] = get_ext_args(**kwargs)
Underscore.doc_extensions[name] = (default, method, getter, setter)
@classmethod @classmethod
def get_extension(cls, name): def get_extension(cls, name):
@ -108,6 +108,12 @@ cdef class Doc:
def has_extension(cls, name): def has_extension(cls, name):
return name in Underscore.doc_extensions return name in Underscore.doc_extensions
@classmethod
def remove_extension(cls, name):
if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name))
return Underscore.doc_extensions.pop(name)
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None, def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
orths_and_spaces=None): orths_and_spaces=None):
"""Create a Doc object. """Create a Doc object.
@ -154,11 +160,7 @@ cdef class Doc:
if spaces is None: if spaces is None:
spaces = [True] * len(words) spaces = [True] * len(words)
elif len(spaces) != len(words): elif len(spaces) != len(words):
raise ValueError( raise ValueError(Errors.E027)
"Arguments 'words' and 'spaces' should be sequences of "
"the same length, or 'spaces' should be left default at "
"None. spaces should be a sequence of booleans, with True "
"meaning that the word owns a ' ' character following it.")
orths_and_spaces = zip(words, spaces) orths_and_spaces = zip(words, spaces)
if orths_and_spaces is not None: if orths_and_spaces is not None:
for orth_space in orths_and_spaces: for orth_space in orths_and_spaces:
@ -166,10 +168,7 @@ cdef class Doc:
orth = orth_space orth = orth_space
has_space = True has_space = True
elif isinstance(orth_space, bytes): elif isinstance(orth_space, bytes):
raise ValueError( raise ValueError(Errors.E028.format(value=orth_space))
"orths_and_spaces expects either List(unicode) or "
"List((unicode, bool)). "
"Got bytes instance: %s" % (str(orth_space)))
else: else:
orth, has_space = orth_space orth, has_space = orth_space
# Note that we pass self.mem here --- we have ownership, if LexemeC # Note that we pass self.mem here --- we have ownership, if LexemeC
@ -319,7 +318,7 @@ cdef class Doc:
break break
else: else:
return 1.0 return 1.0
if self.vector_norm == 0 or other.vector_norm == 0: if self.vector_norm == 0 or other.vector_norm == 0:
return 0.0 return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm) return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
@ -437,10 +436,7 @@ cdef class Doc:
if token.ent_iob == 1: if token.ent_iob == 1:
if start == -1: if start == -1:
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]] seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
raise ValueError( raise ValueError(Errors.E093.format(seq=' '.join(seq)))
"token.ent_iob values make invalid sequence: "
"I without B\n"
"{seq}".format(seq=' '.join(seq)))
elif token.ent_iob == 2 or token.ent_iob == 0: elif token.ent_iob == 2 or token.ent_iob == 0:
if start != -1: if start != -1:
output.append(Span(self, start, i, label=label)) output.append(Span(self, start, i, label=label))
@ -503,19 +499,16 @@ cdef class Doc:
""" """
def __get__(self): def __get__(self):
if not self.is_parsed: if not self.is_parsed:
raise ValueError( raise ValueError(Errors.E029)
"noun_chunks requires the dependency parse, which "
"requires a statistical model to be installed and loaded. "
"For more info, see the "
"documentation: \n%s\n" % about.__docs_models__)
# Accumulate the result before beginning to iterate over it. This # Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us # prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts # during the iteration. The tricky thing here is that Span accepts
# its tokenisation changing, so it's okay once we have the Span # its tokenisation changing, so it's okay once we have the Span
# objects. See Issue #375. # objects. See Issue #375.
spans = [] spans = []
for start, end, label in self.noun_chunks_iterator(self): if self.noun_chunks_iterator is not None:
spans.append(Span(self, start, end, label=label)) for start, end, label in self.noun_chunks_iterator(self):
spans.append(Span(self, start, end, label=label))
for span in spans: for span in spans:
yield span yield span
@ -532,12 +525,7 @@ cdef class Doc:
""" """
def __get__(self): def __get__(self):
if not self.is_sentenced: if not self.is_sentenced:
raise ValueError( raise ValueError(Errors.E030)
"Sentence boundaries unset. You can add the 'sentencizer' "
"component to the pipeline with: "
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
"Alternatively, add the dependency parser, or set "
"sentence boundaries by setting doc[i].sent_start")
if 'sents' in self.user_hooks: if 'sents' in self.user_hooks:
yield from self.user_hooks['sents'](self) yield from self.user_hooks['sents'](self)
else: else:
@ -567,7 +555,8 @@ cdef class Doc:
t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy
t.l_edge = self.length t.l_edge = self.length
t.r_edge = self.length t.r_edge = self.length
assert t.lex.orth != 0 if t.lex.orth == 0:
raise ValueError(Errors.E031.format(i=self.length))
t.spacy = has_space t.spacy = has_space
self.length += 1 self.length += 1
return t.idx + t.lex.length + t.spacy return t.idx + t.lex.length + t.spacy
@ -683,13 +672,7 @@ cdef class Doc:
def from_array(self, attrs, array): def from_array(self, attrs, array):
if SENT_START in attrs and HEAD in attrs: if SENT_START in attrs and HEAD in attrs:
raise ValueError( raise ValueError(Errors.E032)
"Conflicting attributes specified in doc.from_array(): "
"(HEAD, SENT_START)\n"
"The HEAD attribute currently sets sentence boundaries "
"implicitly, based on the tree structure. This means the HEAD "
"attribute would potentially override the sentence boundaries "
"set by SENT_START.")
cdef int i, col cdef int i, col
cdef attr_id_t attr_id cdef attr_id_t attr_id
cdef TokenC* tokens = self.c cdef TokenC* tokens = self.c
@ -827,7 +810,7 @@ cdef class Doc:
RETURNS (Doc): Itself. RETURNS (Doc): Itself.
""" """
if self.length != 0: if self.length != 0:
raise ValueError("Cannot load into non-empty Doc") raise ValueError(Errors.E033.format(length=self.length))
deserializers = { deserializers = {
'text': lambda b: None, 'text': lambda b: None,
'array_head': lambda b: None, 'array_head': lambda b: None,
@ -878,7 +861,7 @@ cdef class Doc:
computed by the models in the pipeline. Let's say a computed by the models in the pipeline. Let's say a
document with 30 words has a tensor with 128 dimensions document with 30 words has a tensor with 128 dimensions
per word. doc.tensor.shape will be (30, 128). After per word. doc.tensor.shape will be (30, 128). After
calling doc.extend_tensor with an array of hape (30, 64), calling doc.extend_tensor with an array of shape (30, 64),
doc.tensor == (30, 192). doc.tensor == (30, 192).
''' '''
xp = get_array_module(self.tensor) xp = get_array_module(self.tensor)
@ -888,6 +871,18 @@ cdef class Doc:
else: else:
self.tensor = xp.hstack((self.tensor, tensor)) self.tensor = xp.hstack((self.tensor, tensor))
def retokenize(self):
'''Context manager to handle retokenization of the Doc.
Modifications to the Doc's tokenization are stored, and then
made all at once when the context manager exits. This is
much more efficient, and less error-prone.
All views of the Doc (Span and Token) created before the
retokenization are invalidated, although they may accidentally
continue to work.
'''
return Retokenizer(self)
def merge(self, int start_idx, int end_idx, *args, **attributes): def merge(self, int start_idx, int end_idx, *args, **attributes):
"""Retokenize the document, such that the span at """Retokenize the document, such that the span at
`doc.text[start_idx : end_idx]` is merged into a single token. If `doc.text[start_idx : end_idx]` is merged into a single token. If
@ -903,10 +898,7 @@ cdef class Doc:
""" """
cdef unicode tag, lemma, ent_type cdef unicode tag, lemma, ent_type
if len(args) == 3: if len(args) == 3:
util.deprecated( deprecation_warning(Warnings.W003)
"Positional arguments to Doc.merge are deprecated. Instead, "
"use the keyword arguments, for example tag=, lemma= or "
"ent_type=.")
tag, lemma, ent_type = args tag, lemma, ent_type = args
attributes[TAG] = tag attributes[TAG] = tag
attributes[LEMMA] = lemma attributes[LEMMA] = lemma
@ -920,13 +912,9 @@ cdef class Doc:
if 'ent_type' in attributes: if 'ent_type' in attributes:
attributes[ENT_TYPE] = attributes['ent_type'] attributes[ENT_TYPE] = attributes['ent_type']
elif args: elif args:
raise ValueError( raise ValueError(Errors.E034.format(n_args=len(args),
"Doc.merge received %d non-keyword arguments. Expected either " args=repr(args),
"3 arguments (deprecated), or 0 (use keyword arguments). " kwargs=repr(attributes)))
"Arguments supplied:\n%s\n"
"Keyword arguments: %s\n" % (len(args), repr(args),
repr(attributes)))
# More deprecated attribute handling =/ # More deprecated attribute handling =/
if 'label' in attributes: if 'label' in attributes:
attributes['ent_type'] = attributes.pop('label') attributes['ent_type'] = attributes.pop('label')
@ -941,66 +929,8 @@ cdef class Doc:
return None return None
# Currently we have the token index, we want the range-end index # Currently we have the token index, we want the range-end index
end += 1 end += 1
cdef Span span = self[start:end] with self.retokenize() as retokenizer:
# Get LexemeC for newly merged token retokenizer.merge(self[start:end], attrs=attributes)
new_orth = ''.join([t.text_with_ws for t in span])
if span[-1].whitespace_:
new_orth = new_orth[:-len(span[-1].whitespace_)]
cdef const LexemeC* lex = self.vocab.get(self.mem, new_orth)
# House the new merged token where it starts
cdef TokenC* token = &self.c[start]
token.spacy = self.c[end-1].spacy
for attr_name, attr_value in attributes.items():
if attr_name == TAG:
self.vocab.morphology.assign_tag(token, attr_value)
else:
Token.set_struct_attr(token, attr_name, attr_value)
# Make sure ent_iob remains consistent
if self.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
if token.ent_type == self.c[end].ent_type:
token.ent_iob = 3
else:
# If they're not the same entity type, let them be two entities
self.c[end].ent_iob = 3
# Begin by setting all the head indices to absolute token positions
# This is easier to work with for now than the offsets
# Before thinking of something simpler, beware the case where a
# dependency bridges over the entity. Here the alignment of the
# tokens changes.
span_root = span.root.i
token.dep = span.root.dep
# We update token.lex after keeping span root and dep, since
# setting token.lex will change span.start and span.end properties
# as it modifies the character offsets in the doc
token.lex = lex
for i in range(self.length):
self.c[i].head += i
# Set the head of the merged token, and its dep relation, from the Span
token.head = self.c[span_root].head
# Adjust deps before shrinking tokens
# Tokens which point into the merged token should now point to it
# Subtract the offset from all tokens which point to >= end
offset = (end - start) - 1
for i in range(self.length):
head_idx = self.c[i].head
if start <= head_idx < end:
self.c[i].head = start
elif head_idx >= end:
self.c[i].head -= offset
# Now compress the token array
for i in range(end, self.length):
self.c[i - offset] = self.c[i]
for i in range(self.length - offset, self.length):
memset(&self.c[i], 0, sizeof(TokenC))
self.c[i].lex = &EMPTY_LEXEME
self.length -= offset
for i in range(self.length):
# ...And, set heads back to a relative position
self.c[i].head -= i
# Set the left/right children, left/right edges
set_children_from_heads(self.c, self.length)
# Clear the cached Python objects
# Return the merged Python object
return self[start] return self[start]
def print_tree(self, light=False, flat=False): def print_tree(self, light=False, flat=False):

View File

@ -8,7 +8,7 @@ from ..symbols import HEAD, TAG, DEP, ENT_IOB, ENT_TYPE
def merge_ents(doc): def merge_ents(doc):
"""Helper: merge adjacent entities into single tokens; modifies the doc.""" """Helper: merge adjacent entities into single tokens; modifies the doc."""
for ent in doc.ents: for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, ent.label_) ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)
return doc return doc

View File

@ -16,16 +16,17 @@ from ..util import normalize_slice
from ..attrs cimport IS_PUNCT, IS_SPACE from ..attrs cimport IS_PUNCT, IS_SPACE
from ..lexeme cimport Lexeme from ..lexeme cimport Lexeme
from ..compat import is_config from ..compat import is_config
from .. import about from ..errors import Errors, TempErrors
from .underscore import Underscore from .underscore import Underscore, get_ext_args
cdef class Span: cdef class Span:
"""A slice from a Doc object.""" """A slice from a Doc object."""
@classmethod @classmethod
def set_extension(cls, name, default=None, method=None, def set_extension(cls, name, **kwargs):
getter=None, setter=None): if cls.has_extension(name) and not kwargs.get('force', False):
Underscore.span_extensions[name] = (default, method, getter, setter) raise ValueError(Errors.E090.format(name=name, obj='Span'))
Underscore.span_extensions[name] = get_ext_args(**kwargs)
@classmethod @classmethod
def get_extension(cls, name): def get_extension(cls, name):
@ -35,6 +36,12 @@ cdef class Span:
def has_extension(cls, name): def has_extension(cls, name):
return name in Underscore.span_extensions return name in Underscore.span_extensions
@classmethod
def remove_extension(cls, name):
if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name))
return Underscore.span_extensions.pop(name)
def __cinit__(self, Doc doc, int start, int end, attr_t label=0, def __cinit__(self, Doc doc, int start, int end, attr_t label=0,
vector=None, vector_norm=None): vector=None, vector_norm=None):
"""Create a `Span` object from the slice `doc[start : end]`. """Create a `Span` object from the slice `doc[start : end]`.
@ -48,8 +55,7 @@ cdef class Span:
RETURNS (Span): The newly constructed object. RETURNS (Span): The newly constructed object.
""" """
if not (0 <= start <= end <= len(doc)): if not (0 <= start <= end <= len(doc)):
raise IndexError raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc)))
self.doc = doc self.doc = doc
self.start = start self.start = start
self.start_char = self.doc[start].idx if start < self.doc.length else 0 self.start_char = self.doc[start].idx if start < self.doc.length else 0
@ -58,7 +64,8 @@ cdef class Span:
self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1]) self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1])
else: else:
self.end_char = 0 self.end_char = 0
assert label in doc.vocab.strings, label if label not in doc.vocab.strings:
raise ValueError(Errors.E084.format(label=label))
self.label = label self.label = label
self._vector = vector self._vector = vector
self._vector_norm = vector_norm self._vector_norm = vector_norm
@ -267,11 +274,10 @@ cdef class Span:
or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char: or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char:
start = token_by_start(self.doc.c, self.doc.length, self.start_char) start = token_by_start(self.doc.c, self.doc.length, self.start_char)
if self.start == -1: if self.start == -1:
raise IndexError("Error calculating span: Can't find start") raise IndexError(Errors.E036.format(start=self.start_char))
end = token_by_end(self.doc.c, self.doc.length, self.end_char) end = token_by_end(self.doc.c, self.doc.length, self.end_char)
if end == -1: if end == -1:
raise IndexError("Error calculating span: Can't find end") raise IndexError(Errors.E037.format(end=self.end_char))
self.start = start self.start = start
self.end = end + 1 self.end = end + 1
@ -294,12 +300,11 @@ cdef class Span:
cdef int i cdef int i
if self.doc.is_parsed: if self.doc.is_parsed:
root = &self.doc.c[self.start] root = &self.doc.c[self.start]
n = 0
while root.head != 0: while root.head != 0:
root += root.head root += root.head
n += 1 n += 1
if n >= self.doc.length: if n >= self.doc.length:
raise RuntimeError raise RuntimeError(Errors.E038)
return self.doc[root.l_edge:root.r_edge + 1] return self.doc[root.l_edge:root.r_edge + 1]
elif self.doc.is_sentenced: elif self.doc.is_sentenced:
# find start of the sentence # find start of the sentence
@ -314,13 +319,7 @@ cdef class Span:
n += 1 n += 1
if n >= self.doc.length: if n >= self.doc.length:
break break
#
return self.doc[start:end] return self.doc[start:end]
else:
raise ValueError(
"Access to sentence requires either the dependency parse "
"or sentence boundaries to be set by setting " +
"doc[i].is_sent_start = True")
property has_vector: property has_vector:
"""RETURNS (bool): Whether a word vector is associated with the object. """RETURNS (bool): Whether a word vector is associated with the object.
@ -402,11 +401,7 @@ cdef class Span:
""" """
def __get__(self): def __get__(self):
if not self.doc.is_parsed: if not self.doc.is_parsed:
raise ValueError( raise ValueError(Errors.E029)
"noun_chunks requires the dependency parse, which "
"requires a statistical model to be installed and loaded. "
"For more info, see the "
"documentation: \n%s\n" % about.__docs_models__)
# Accumulate the result before beginning to iterate over it. This # Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us # prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts # during the iteration. The tricky thing here is that Span accepts
@ -552,9 +547,7 @@ cdef class Span:
return self.root.ent_id return self.root.ent_id
def __set__(self, hash_t key): def __set__(self, hash_t key):
raise NotImplementedError( raise NotImplementedError(TempErrors.T007.format(attr='ent_id'))
"Can't yet set ent_id from Span. Vote for this feature on "
"the issue tracker: http://github.com/explosion/spaCy/issues")
property ent_id_: property ent_id_:
"""RETURNS (unicode): The (string) entity ID.""" """RETURNS (unicode): The (string) entity ID."""
@ -562,9 +555,7 @@ cdef class Span:
return self.root.ent_id_ return self.root.ent_id_
def __set__(self, hash_t key): def __set__(self, hash_t key):
raise NotImplementedError( raise NotImplementedError(TempErrors.T007.format(attr='ent_id_'))
"Can't yet set ent_id_ from Span. Vote for this feature on the "
"issue tracker: http://github.com/explosion/spaCy/issues")
property orth_: property orth_:
"""Verbatim text content (identical to Span.text). Exists mostly for """Verbatim text content (identical to Span.text). Exists mostly for
@ -612,9 +603,5 @@ cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
token += token.head token += token.head
n += 1 n += 1
if n >= sent_length: if n >= sent_length:
raise RuntimeError( raise RuntimeError(Errors.E039)
"Array bounds exceeded while searching for root word. This "
"likely means the parse tree is in an invalid state. Please "
"report this issue here: "
"http://github.com/explosion/spaCy/issues")
return n return n

View File

@ -6,6 +6,7 @@ from ..typedefs cimport attr_t, flags_t
from ..parts_of_speech cimport univ_pos_t from ..parts_of_speech cimport univ_pos_t
from .doc cimport Doc from .doc cimport Doc
from ..lexeme cimport Lexeme from ..lexeme cimport Lexeme
from ..errors import Errors
cdef class Token: cdef class Token:
@ -17,8 +18,7 @@ cdef class Token:
@staticmethod @staticmethod
cdef inline Token cinit(Vocab vocab, const TokenC* token, int offset, Doc doc): cdef inline Token cinit(Vocab vocab, const TokenC* token, int offset, Doc doc):
if offset < 0 or offset >= doc.length: if offset < 0 or offset >= doc.length:
msg = "Attempt to access token at %d, max length %d" raise IndexError(Errors.E040.format(i=offset, max_length=doc.length))
raise IndexError(msg % (offset, doc.length))
cdef Token self = Token.__new__(Token, vocab, doc, offset) cdef Token self = Token.__new__(Token, vocab, doc, offset)
return self return self

View File

@ -19,26 +19,33 @@ from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
from ..compat import is_config from ..compat import is_config
from ..errors import Errors
from .. import util from .. import util
from .. import about from .underscore import Underscore, get_ext_args
from .underscore import Underscore
cdef class Token: cdef class Token:
"""An individual token i.e. a word, punctuation symbol, whitespace, """An individual token i.e. a word, punctuation symbol, whitespace,
etc.""" etc."""
@classmethod @classmethod
def set_extension(cls, name, default=None, method=None, def set_extension(cls, name, **kwargs):
getter=None, setter=None): if cls.has_extension(name) and not kwargs.get('force', False):
Underscore.token_extensions[name] = (default, method, getter, setter) raise ValueError(Errors.E090.format(name=name, obj='Token'))
Underscore.token_extensions[name] = get_ext_args(**kwargs)
@classmethod @classmethod
def get_extension(cls, name): def get_extension(cls, name):
return Underscore.span_extensions.get(name) return Underscore.token_extensions.get(name)
@classmethod @classmethod
def has_extension(cls, name): def has_extension(cls, name):
return name in Underscore.span_extensions return name in Underscore.token_extensions
@classmethod
def remove_extension(cls, name):
if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name))
return Underscore.token_extensions.pop(name)
def __cinit__(self, Vocab vocab, Doc doc, int offset): def __cinit__(self, Vocab vocab, Doc doc, int offset):
"""Construct a `Token` object. """Construct a `Token` object.
@ -106,7 +113,7 @@ cdef class Token:
elif op == 5: elif op == 5:
return my >= their return my >= their
else: else:
raise ValueError(op) raise ValueError(Errors.E041.format(op=op))
@property @property
def _(self): def _(self):
@ -135,8 +142,7 @@ cdef class Token:
RETURNS (Token): The token at position `self.doc[self.i+i]`. RETURNS (Token): The token at position `self.doc[self.i+i]`.
""" """
if self.i+i < 0 or (self.i+i >= len(self.doc)): if self.i+i < 0 or (self.i+i >= len(self.doc)):
msg = "Error accessing doc[%d].nbor(%d), for doc of length %d" raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc)))
raise IndexError(msg % (self.i, i, len(self.doc)))
return self.doc[self.i+i] return self.doc[self.i+i]
def similarity(self, other): def similarity(self, other):
@ -354,14 +360,7 @@ cdef class Token:
property sent_start: property sent_start:
def __get__(self): def __get__(self):
# Raising a deprecation warning causes errors for autocomplete # Raising a deprecation warning here causes errors for autocomplete
#util.deprecated(
# "Token.sent_start is now deprecated. Use Token.is_sent_start "
# "instead, which returns a boolean value or None if the answer "
# "is unknown instead of a misleading 0 for False and 1 for "
# "True. It also fixes a quirk in the old logic that would "
# "always set the property to 0 for the first word of the "
# "document.")
# Handle broken backwards compatibility case: doc[0].sent_start # Handle broken backwards compatibility case: doc[0].sent_start
# was False. # was False.
if self.i == 0: if self.i == 0:
@ -386,9 +385,7 @@ cdef class Token:
def __set__(self, value): def __set__(self, value):
if self.doc.is_parsed: if self.doc.is_parsed:
raise ValueError( raise ValueError(Errors.E043)
"Refusing to write to token.sent_start if its document "
"is parsed, because this may cause inconsistent state.")
if value is None: if value is None:
self.c.sent_start = 0 self.c.sent_start = 0
elif value is True: elif value is True:
@ -396,8 +393,7 @@ cdef class Token:
elif value is False: elif value is False:
self.c.sent_start = -1 self.c.sent_start = -1
else: else:
raise ValueError("Invalid value for token.sent_start. Must be " raise ValueError(Errors.E044.format(value=value))
"one of: None, True, False")
property lefts: property lefts:
"""The leftward immediate children of the word, in the syntactic """The leftward immediate children of the word, in the syntactic
@ -415,8 +411,7 @@ cdef class Token:
nr_iter += 1 nr_iter += 1
# This is ugly, but it's a way to guard out infinite loops # This is ugly, but it's a way to guard out infinite loops
if nr_iter >= 10000000: if nr_iter >= 10000000:
raise RuntimeError("Possibly infinite loop encountered " raise RuntimeError(Errors.E045.format(attr='token.lefts'))
"while looking for token.lefts")
property rights: property rights:
"""The rightward immediate children of the word, in the syntactic """The rightward immediate children of the word, in the syntactic
@ -434,8 +429,7 @@ cdef class Token:
ptr -= 1 ptr -= 1
nr_iter += 1 nr_iter += 1
if nr_iter >= 10000000: if nr_iter >= 10000000:
raise RuntimeError("Possibly infinite loop encountered " raise RuntimeError(Errors.E045.format(attr='token.rights'))
"while looking for token.rights")
tokens.reverse() tokens.reverse()
for t in tokens: for t in tokens:
yield t yield t

View File

@ -3,6 +3,8 @@ from __future__ import unicode_literals
import functools import functools
from ..errors import Errors
class Underscore(object): class Underscore(object):
doc_extensions = {} doc_extensions = {}
@ -23,7 +25,7 @@ class Underscore(object):
def __getattr__(self, name): def __getattr__(self, name):
if name not in self._extensions: if name not in self._extensions:
raise AttributeError(name) raise AttributeError(Errors.E046.format(name=name))
default, method, getter, setter = self._extensions[name] default, method, getter, setter = self._extensions[name]
if getter is not None: if getter is not None:
return getter(self._obj) return getter(self._obj)
@ -34,7 +36,7 @@ class Underscore(object):
def __setattr__(self, name, value): def __setattr__(self, name, value):
if name not in self._extensions: if name not in self._extensions:
raise AttributeError(name) raise AttributeError(Errors.E047.format(name=name))
default, method, getter, setter = self._extensions[name] default, method, getter, setter = self._extensions[name]
if setter is not None: if setter is not None:
return setter(self._obj, value) return setter(self._obj, value)
@ -52,3 +54,24 @@ class Underscore(object):
def _get_key(self, name): def _get_key(self, name):
return ('._.', name, self._start, self._end) return ('._.', name, self._start, self._end)
def get_ext_args(**kwargs):
"""Validate and convert arguments. Reused in Doc, Token and Span."""
default = kwargs.get('default')
getter = kwargs.get('getter')
setter = kwargs.get('setter')
method = kwargs.get('method')
if getter is None and setter is not None:
raise ValueError(Errors.E089)
valid_opts = ('default' in kwargs, method is not None, getter is not None)
nr_defined = sum(t is True for t in valid_opts)
if nr_defined != 1:
raise ValueError(Errors.E083.format(nr_defined=nr_defined))
if setter is not None and not hasattr(setter, '__call__'):
raise ValueError(Errors.E091.format(name='setter', value=repr(setter)))
if getter is not None and not hasattr(getter, '__call__'):
raise ValueError(Errors.E091.format(name='getter', value=repr(getter)))
if method is not None and not hasattr(method, '__call__'):
raise ValueError(Errors.E091.format(name='method', value=repr(method)))
return (default, method, getter, setter)

View File

@ -11,8 +11,6 @@ import sys
import textwrap import textwrap
import random import random
from collections import OrderedDict from collections import OrderedDict
import inspect
import warnings
from thinc.neural._classes.model import Model from thinc.neural._classes.model import Model
from thinc.neural.ops import NumpyOps from thinc.neural.ops import NumpyOps
import functools import functools
@ -23,10 +21,12 @@ import numpy.random
from .symbols import ORTH from .symbols import ORTH
from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_ from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
from .compat import import_file from .compat import import_file
from .errors import Errors
import msgpack # Import these directly from Thinc, so that we're sure we always have the
import msgpack_numpy # same version.
msgpack_numpy.patch() from thinc.neural._classes.model import msgpack
from thinc.neural._classes.model import msgpack_numpy
LANGUAGES = {} LANGUAGES = {}
@ -50,8 +50,7 @@ def get_lang_class(lang):
try: try:
module = importlib.import_module('.lang.%s' % lang, 'spacy') module = importlib.import_module('.lang.%s' % lang, 'spacy')
except ImportError: except ImportError:
msg = "Can't import language %s from spacy.lang." raise ImportError(Errors.E048.format(lang=lang))
raise ImportError(msg % lang)
LANGUAGES[lang] = getattr(module, module.__all__[0]) LANGUAGES[lang] = getattr(module, module.__all__[0])
return LANGUAGES[lang] return LANGUAGES[lang]
@ -108,7 +107,7 @@ def load_model(name, **overrides):
""" """
data_path = get_data_path() data_path = get_data_path()
if not data_path or not data_path.exists(): if not data_path or not data_path.exists():
raise IOError("Can't find spaCy data path: %s" % path2str(data_path)) raise IOError(Errors.E049.format(path=path2str(data_path)))
if isinstance(name, basestring_): # in data dir / shortcut if isinstance(name, basestring_): # in data dir / shortcut
if name in set([d.name for d in data_path.iterdir()]): if name in set([d.name for d in data_path.iterdir()]):
return load_model_from_link(name, **overrides) return load_model_from_link(name, **overrides)
@ -118,7 +117,7 @@ def load_model(name, **overrides):
return load_model_from_path(Path(name), **overrides) return load_model_from_path(Path(name), **overrides)
elif hasattr(name, 'exists'): # Path or Path-like to model data elif hasattr(name, 'exists'): # Path or Path-like to model data
return load_model_from_path(name, **overrides) return load_model_from_path(name, **overrides)
raise IOError("Can't find model '%s'" % name) raise IOError(Errors.E050.format(name=name))
def load_model_from_link(name, **overrides): def load_model_from_link(name, **overrides):
@ -127,9 +126,7 @@ def load_model_from_link(name, **overrides):
try: try:
cls = import_file(name, path) cls = import_file(name, path)
except AttributeError: except AttributeError:
raise IOError( raise IOError(Errors.E051.format(name=name))
"Cant' load '%s'. If you're using a shortcut link, make sure it "
"points to a valid package (not just a data directory)." % name)
return cls.load(**overrides) return cls.load(**overrides)
@ -173,8 +170,7 @@ def load_model_from_init_py(init_file, **overrides):
data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version']) data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
data_path = model_path / data_dir data_path = model_path / data_dir
if not model_path.exists(): if not model_path.exists():
msg = "Can't find model directory: %s" raise IOError(Errors.E052.format(path=path2str(data_path)))
raise ValueError(msg % path2str(data_path))
return load_model_from_path(data_path, meta, **overrides) return load_model_from_path(data_path, meta, **overrides)
@ -186,16 +182,14 @@ def get_model_meta(path):
""" """
model_path = ensure_path(path) model_path = ensure_path(path)
if not model_path.exists(): if not model_path.exists():
msg = "Can't find model directory: %s" raise IOError(Errors.E052.format(path=path2str(model_path)))
raise ValueError(msg % path2str(model_path))
meta_path = model_path / 'meta.json' meta_path = model_path / 'meta.json'
if not meta_path.is_file(): if not meta_path.is_file():
raise IOError("Could not read meta.json from %s" % meta_path) raise IOError(Errors.E053.format(path=meta_path))
meta = read_json(meta_path) meta = read_json(meta_path)
for setting in ['lang', 'name', 'version']: for setting in ['lang', 'name', 'version']:
if setting not in meta or not meta[setting]: if setting not in meta or not meta[setting]:
msg = "No valid '%s' setting found in model meta.json" raise ValueError(Errors.E054.format(setting=setting))
raise ValueError(msg % setting)
return meta return meta
@ -344,13 +338,10 @@ def update_exc(base_exceptions, *addition_dicts):
for orth, token_attrs in additions.items(): for orth, token_attrs in additions.items():
if not all(isinstance(attr[ORTH], unicode_) if not all(isinstance(attr[ORTH], unicode_)
for attr in token_attrs): for attr in token_attrs):
msg = "Invalid ORTH value in exception: key='%s', orths='%s'" raise ValueError(Errors.E055.format(key=orth, orths=token_attrs))
raise ValueError(msg % (orth, token_attrs))
described_orth = ''.join(attr[ORTH] for attr in token_attrs) described_orth = ''.join(attr[ORTH] for attr in token_attrs)
if orth != described_orth: if orth != described_orth:
msg = ("Invalid tokenizer exception: ORTH values combined " raise ValueError(Errors.E056.format(key=orth, orths=described_orth))
"don't match original string. key='%s', orths='%s'")
raise ValueError(msg % (orth, described_orth))
exc.update(additions) exc.update(additions)
exc = expand_exc(exc, "'", "") exc = expand_exc(exc, "'", "")
return exc return exc
@ -380,8 +371,7 @@ def expand_exc(excs, search, replace):
def normalize_slice(length, start, stop, step=None): def normalize_slice(length, start, stop, step=None):
if not (step is None or step == 1): if not (step is None or step == 1):
raise ValueError("Stepped slices not supported in Span objects." raise ValueError(Errors.E057)
"Try: list(tokens)[start:stop:step] instead.")
if start is None: if start is None:
start = 0 start = 0
elif start < 0: elif start < 0:
@ -392,7 +382,6 @@ def normalize_slice(length, start, stop, step=None):
elif stop < 0: elif stop < 0:
stop += length stop += length
stop = min(length, max(start, stop)) stop = min(length, max(start, stop))
assert 0 <= start <= stop <= length
return start, stop return start, stop
@ -552,18 +541,6 @@ def from_disk(path, readers, exclude):
return path return path
def deprecated(message, filter='always'):
"""Show a deprecation warning.
message (unicode): The message to display.
filter (unicode): Filter value.
"""
stack = inspect.stack()[-1]
with warnings.catch_warnings():
warnings.simplefilter(filter, DeprecationWarning)
warnings.warn_explicit(message, DeprecationWarning, stack[1], stack[2])
def print_table(data, title=None): def print_table(data, title=None):
"""Print data in table format. """Print data in table format.

View File

@ -1,24 +1,43 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import functools
import numpy import numpy
from collections import OrderedDict from collections import OrderedDict
import msgpack
import msgpack_numpy from .util import msgpack
msgpack_numpy.patch() from .util import msgpack_numpy
cimport numpy as np cimport numpy as np
from thinc.neural.util import get_array_module from thinc.neural.util import get_array_module
from thinc.neural._classes.model import Model from thinc.neural._classes.model import Model
from .strings cimport StringStore, hash_string from .strings cimport StringStore, hash_string
from .compat import basestring_, path2str from .compat import basestring_, path2str
from .errors import Errors
from . import util from . import util
from cython.operator cimport dereference as deref
from libcpp.set cimport set as cppset
def unpickle_vectors(bytes_data): def unpickle_vectors(bytes_data):
return Vectors().from_bytes(bytes_data) return Vectors().from_bytes(bytes_data)
class GlobalRegistry(object):
'''Global store of vectors, to avoid repeatedly loading the data.'''
data = {}
@classmethod
def register(cls, name, data):
cls.data[name] = data
return functools.partial(cls.get, name)
@classmethod
def get(cls, name):
return cls.data[name]
cdef class Vectors: cdef class Vectors:
"""Store, save and load word vectors. """Store, save and load word vectors.
@ -31,18 +50,21 @@ cdef class Vectors:
the table need to be assigned --- so len(list(vectors.keys())) may be the table need to be assigned --- so len(list(vectors.keys())) may be
greater or smaller than vectors.shape[0]. greater or smaller than vectors.shape[0].
""" """
cdef public object name
cdef public object data cdef public object data
cdef public object key2row cdef public object key2row
cdef public object _unset cdef cppset[int] _unset
def __init__(self, *, shape=None, data=None, keys=None): def __init__(self, *, shape=None, data=None, keys=None, name=None):
"""Create a new vector store. """Create a new vector store.
shape (tuple): Size of the table, as (# entries, # columns) shape (tuple): Size of the table, as (# entries, # columns)
data (numpy.ndarray): The vector data. data (numpy.ndarray): The vector data.
keys (iterable): A sequence of keys, aligned with the data. keys (iterable): A sequence of keys, aligned with the data.
name (string): A name to identify the vectors table.
RETURNS (Vectors): The newly created object. RETURNS (Vectors): The newly created object.
""" """
self.name = name
if data is None: if data is None:
if shape is None: if shape is None:
shape = (0,0) shape = (0,0)
@ -50,9 +72,9 @@ cdef class Vectors:
self.data = data self.data = data
self.key2row = OrderedDict() self.key2row = OrderedDict()
if self.data is not None: if self.data is not None:
self._unset = set(range(self.data.shape[0])) self._unset = cppset[int]({i for i in range(self.data.shape[0])})
else: else:
self._unset = set() self._unset = cppset[int]()
if keys is not None: if keys is not None:
for i, key in enumerate(keys): for i, key in enumerate(keys):
self.add(key, row=i) self.add(key, row=i)
@ -74,7 +96,7 @@ cdef class Vectors:
@property @property
def is_full(self): def is_full(self):
"""RETURNS (bool): `True` if no slots are available for new keys.""" """RETURNS (bool): `True` if no slots are available for new keys."""
return len(self._unset) == 0 return self._unset.size() == 0
@property @property
def n_keys(self): def n_keys(self):
@ -93,7 +115,7 @@ cdef class Vectors:
""" """
i = self.key2row[key] i = self.key2row[key]
if i is None: if i is None:
raise KeyError(key) raise KeyError(Errors.E058.format(key=key))
else: else:
return self.data[i] return self.data[i]
@ -105,8 +127,8 @@ cdef class Vectors:
""" """
i = self.key2row[key] i = self.key2row[key]
self.data[i] = vector self.data[i] = vector
if i in self._unset: if self._unset.count(i):
self._unset.remove(i) self._unset.erase(self._unset.find(i))
def __iter__(self): def __iter__(self):
"""Iterate over the keys in the table. """Iterate over the keys in the table.
@ -145,7 +167,7 @@ cdef class Vectors:
xp = get_array_module(self.data) xp = get_array_module(self.data)
self.data = xp.resize(self.data, shape) self.data = xp.resize(self.data, shape)
filled = {row for row in self.key2row.values()} filled = {row for row in self.key2row.values()}
self._unset = {row for row in range(shape[0]) if row not in filled} self._unset = cppset[int]({row for row in range(shape[0]) if row not in filled})
removed_items = [] removed_items = []
for key, row in list(self.key2row.items()): for key, row in list(self.key2row.items()):
if row >= shape[0]: if row >= shape[0]:
@ -169,7 +191,7 @@ cdef class Vectors:
YIELDS (ndarray): A vector in the table. YIELDS (ndarray): A vector in the table.
""" """
for row, vector in enumerate(range(self.data.shape[0])): for row, vector in enumerate(range(self.data.shape[0])):
if row not in self._unset: if not self._unset.count(row):
yield vector yield vector
def items(self): def items(self):
@ -194,7 +216,8 @@ cdef class Vectors:
RETURNS: The requested key, keys, row or rows. RETURNS: The requested key, keys, row or rows.
""" """
if sum(arg is None for arg in (key, keys, row, rows)) != 3: if sum(arg is None for arg in (key, keys, row, rows)) != 3:
raise ValueError("One (and only one) keyword arg must be set.") bad_kwargs = {'key': key, 'keys': keys, 'row': row, 'rows': rows}
raise ValueError(Errors.E059.format(kwargs=bad_kwargs))
xp = get_array_module(self.data) xp = get_array_module(self.data)
if key is not None: if key is not None:
if isinstance(key, basestring_): if isinstance(key, basestring_):
@ -233,14 +256,14 @@ cdef class Vectors:
row = self.key2row[key] row = self.key2row[key]
elif row is None: elif row is None:
if self.is_full: if self.is_full:
raise ValueError("Cannot add new key to vectors -- full") raise ValueError(Errors.E060.format(rows=self.data.shape[0],
row = min(self._unset) cols=self.data.shape[1]))
row = deref(self._unset.begin())
self.key2row[key] = row self.key2row[key] = row
if vector is not None: if vector is not None:
self.data[row] = vector self.data[row] = vector
if row in self._unset: if self._unset.count(row):
self._unset.remove(row) self._unset.erase(self._unset.find(row))
return row return row
def most_similar(self, queries, *, batch_size=1024): def most_similar(self, queries, *, batch_size=1024):
@ -297,7 +320,7 @@ cdef class Vectors:
width = int(dims) width = int(dims)
break break
else: else:
raise IOError("Expected file named e.g. vectors.128.f.bin") raise IOError(Errors.E061.format(filename=path))
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims, bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
dtype=dtype) dtype=dtype)
xp = get_array_module(self.data) xp = get_array_module(self.data)
@ -346,8 +369,8 @@ cdef class Vectors:
with path.open('rb') as file_: with path.open('rb') as file_:
self.key2row = msgpack.load(file_) self.key2row = msgpack.load(file_)
for key, row in self.key2row.items(): for key, row in self.key2row.items():
if row in self._unset: if self._unset.count(row):
self._unset.remove(row) self._unset.erase(self._unset.find(row))
def load_keys(path): def load_keys(path):
if path.exists(): if path.exists():

View File

@ -16,6 +16,7 @@ from .attrs cimport PROB, LANG, ORTH, TAG
from .structs cimport SerializedLexemeC from .structs cimport SerializedLexemeC
from .compat import copy_reg, basestring_ from .compat import copy_reg, basestring_
from .errors import Errors
from .lemmatizer import Lemmatizer from .lemmatizer import Lemmatizer
from .attrs import intify_attrs from .attrs import intify_attrs
from .vectors import Vectors from .vectors import Vectors
@ -100,15 +101,9 @@ cdef class Vocab:
flag_id = bit flag_id = bit
break break
else: else:
raise ValueError( raise ValueError(Errors.E062)
"Cannot find empty bit for new lexical flag. All bits "
"between 0 and 63 are occupied. You can replace one by "
"specifying the flag_id explicitly, e.g. "
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
elif flag_id >= 64 or flag_id < 1: elif flag_id >= 64 or flag_id < 1:
raise ValueError( raise ValueError(Errors.E063.format(value=flag_id))
"Invalid value for flag_id: %d. Flag IDs must be between "
"1 and 63 (inclusive)" % flag_id)
for lex in self: for lex in self:
lex.set_flag(flag_id, flag_getter(lex.orth_)) lex.set_flag(flag_id, flag_getter(lex.orth_))
self.lex_attr_getters[flag_id] = flag_getter self.lex_attr_getters[flag_id] = flag_getter
@ -127,8 +122,9 @@ cdef class Vocab:
cdef size_t addr cdef size_t addr
if lex != NULL: if lex != NULL:
if lex.orth != self.strings[string]: if lex.orth != self.strings[string]:
raise LookupError.mismatched_strings( raise KeyError(Errors.E064.format(string=lex.orth,
lex.orth, self.strings[string], string) orth=self.strings[string],
orth_id=string))
return lex return lex
else: else:
return self._new_lexeme(mem, string) return self._new_lexeme(mem, string)
@ -171,7 +167,8 @@ cdef class Vocab:
if not is_oov: if not is_oov:
key = hash_string(string) key = hash_string(string)
self._add_lex_to_vocab(key, lex) self._add_lex_to_vocab(key, lex)
assert lex != NULL, string if lex == NULL:
raise ValueError(Errors.E085.format(string=string))
return lex return lex
cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1: cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1:
@ -254,7 +251,7 @@ cdef class Vocab:
width, you have to call this to change the size of the vectors. width, you have to call this to change the size of the vectors.
""" """
if width is not None and shape is not None: if width is not None and shape is not None:
raise ValueError("Only one of width and shape can be specified") raise ValueError(Errors.E065.format(width=width, shape=shape))
elif shape is not None: elif shape is not None:
self.vectors = Vectors(shape=shape) self.vectors = Vectors(shape=shape)
else: else:
@ -381,7 +378,8 @@ cdef class Vocab:
self.lexemes_from_bytes(file_.read()) self.lexemes_from_bytes(file_.read())
if self.vectors is not None: if self.vectors is not None:
self.vectors.from_disk(path, exclude='strings.json') self.vectors.from_disk(path, exclude='strings.json')
link_vectors_to_models(self) if self.vectors.name is not None:
link_vectors_to_models(self)
return self return self
def to_bytes(self, **exclude): def to_bytes(self, **exclude):
@ -421,6 +419,8 @@ cdef class Vocab:
('vectors', lambda b: serialize_vectors(b)) ('vectors', lambda b: serialize_vectors(b))
)) ))
util.from_bytes(bytes_data, setters, exclude) util.from_bytes(bytes_data, setters, exclude)
if self.vectors.name is not None:
link_vectors_to_models(self)
return self return self
def lexemes_to_bytes(self): def lexemes_to_bytes(self):
@ -468,7 +468,10 @@ cdef class Vocab:
if ptr == NULL: if ptr == NULL:
continue continue
py_str = self.strings[lexeme.orth] py_str = self.strings[lexeme.orth]
assert self.strings[py_str] == lexeme.orth, (py_str, lexeme.orth) if self.strings[py_str] != lexeme.orth:
raise ValueError(Errors.E086.format(string=py_str,
orth_id=lexeme.orth,
hash_id=self.strings[py_str]))
key = hash_string(py_str) key = hash_string(py_str)
self._by_hash.set(key, lexeme) self._by_hash.set(key, lexeme)
self._by_orth.set(lexeme.orth, lexeme) self._by_orth.set(lexeme.orth, lexeme)
@ -509,16 +512,3 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
copy_reg.pickle(Vocab, pickle_vocab, unpickle_vocab) copy_reg.pickle(Vocab, pickle_vocab, unpickle_vocab)
class LookupError(Exception):
@classmethod
def mismatched_strings(cls, id_, id_string, original_string):
return cls(
"Error fetching a Lexeme from the Vocab. When looking up a "
"string, the lexeme returned had an orth ID that did not match "
"the query string. This means that the cached lexeme structs are "
"mismatched to the string encoding table. The mismatched:\n"
"Query string: {}\n"
"Orth cached: {}\n"
"Orth ID: {}".format(repr(original_string), repr(id_string), id_))

View File

@ -1,7 +1,7 @@
{ {
"globals": { "globals": {
"title": "spaCy", "title": "spaCy",
"description": "spaCy is a free open-source library featuring state-of-the-art speed and accuracy and a powerful Python API.", "description": "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.",
"SITENAME": "spaCy", "SITENAME": "spaCy",
"SLOGAN": "Industrial-strength Natural Language Processing in Python", "SLOGAN": "Industrial-strength Natural Language Processing in Python",
@ -10,10 +10,13 @@
"COMPANY": "Explosion AI", "COMPANY": "Explosion AI",
"COMPANY_URL": "https://explosion.ai", "COMPANY_URL": "https://explosion.ai",
"DEMOS_URL": "https://demos.explosion.ai", "DEMOS_URL": "https://explosion.ai/demos",
"MODELS_REPO": "explosion/spacy-models", "MODELS_REPO": "explosion/spacy-models",
"KERNEL_BINDER": "ines/spacy-binder",
"KERNEL_PYTHON": "python3",
"SPACY_VERSION": "2.0", "SPACY_VERSION": "2.0",
"BINDER_VERSION": "2.0.11",
"SOCIAL": { "SOCIAL": {
"twitter": "spacy_io", "twitter": "spacy_io",
@ -26,7 +29,8 @@
"NAVIGATION": { "NAVIGATION": {
"Usage": "/usage", "Usage": "/usage",
"Models": "/models", "Models": "/models",
"API": "/api" "API": "/api",
"Universe": "/universe"
}, },
"FOOTER": { "FOOTER": {
@ -34,7 +38,7 @@
"Usage": "/usage", "Usage": "/usage",
"Models": "/models", "Models": "/models",
"API Reference": "/api", "API Reference": "/api",
"Resources": "/usage/resources" "Universe": "/universe"
}, },
"Support": { "Support": {
"Issue Tracker": "https://github.com/explosion/spaCy/issues", "Issue Tracker": "https://github.com/explosion/spaCy/issues",
@ -82,8 +86,8 @@
} }
], ],
"V_CSS": "2.0.1", "V_CSS": "2.1.2",
"V_JS": "2.0.1", "V_JS": "2.1.0",
"DEFAULT_SYNTAX": "python", "DEFAULT_SYNTAX": "python",
"ANALYTICS": "UA-58931649-1", "ANALYTICS": "UA-58931649-1",
"MAILCHIMP": { "MAILCHIMP": {

View File

@ -15,12 +15,39 @@
- MODEL_META = public.models._data.MODEL_META - MODEL_META = public.models._data.MODEL_META
- MODEL_LICENSES = public.models._data.MODEL_LICENSES - MODEL_LICENSES = public.models._data.MODEL_LICENSES
- MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS - MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS
- EXAMPLE_SENT_LANGS = public.models._data.EXAMPLE_SENT_LANGS
- EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES - EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES
- IS_PAGE = (SECTION != "index") && !landing - IS_PAGE = (SECTION != "index") && !landing
- IS_MODELS = (SECTION == "models" && LANGUAGES[current.source]) - IS_MODELS = (SECTION == "models" && LANGUAGES[current.source])
- HAS_MODELS = IS_MODELS && CURRENT_MODELS.length - HAS_MODELS = IS_MODELS && CURRENT_MODELS.length
//- Get page URL
- function getPageUrl() {
- var path = current.path;
- if(path[path.length - 1] == 'index') path = path.slice(0, path.length - 1);
- return `${SITE_URL}/${path.join('/')}`;
- }
//- Get pretty page title depending on section
- function getPageTitle() {
- var sections = ['api', 'usage', 'models'];
- if (sections.includes(SECTION)) {
- var titleSection = (SECTION == "api") ? 'API' : SECTION.charAt(0).toUpperCase() + SECTION.slice(1);
- return `${title} · ${SITENAME} ${titleSection} Documentation`;
- }
- else if (SECTION != 'index') return `${title} · ${SITENAME}`;
- return `${SITENAME} · ${SLOGAN}`;
- }
//- Get social image based on section and settings
- function getPageImage() {
- var img = (SECTION == 'api') ? 'api' : 'default';
- return `${SITE_URL}/assets/img/social/preview_${preview || img}.jpg`;
- }
//- Add prefixes to items of an array (for modifier CSS classes) //- Add prefixes to items of an array (for modifier CSS classes)
array - [array] list of class names or options, e.g. ["foot"] array - [array] list of class names or options, e.g. ["foot"]

View File

@ -7,7 +7,7 @@ include _functions
id - [string] anchor assigned to section (used for breadcrumb navigation) id - [string] anchor assigned to section (used for breadcrumb navigation)
mixin section(id) mixin section(id)
section.o-section(id="section-" + id data-section=id) section.o-section(id=id ? "section-" + id : null data-section=id)&attributes(attributes)
block block
@ -143,7 +143,7 @@ mixin aside-wrapper(label, emoji)
mixin aside(label, emoji) mixin aside(label, emoji)
+aside-wrapper(label, emoji) +aside-wrapper(label, emoji)
.c-aside__text.u-text-small .c-aside__text.u-text-small&attributes(attributes)
block block
@ -154,7 +154,7 @@ mixin aside(label, emoji)
prompt - [string] prompt displayed before first line, e.g. "$" prompt - [string] prompt displayed before first line, e.g. "$"
mixin aside-code(label, language, prompt) mixin aside-code(label, language, prompt)
+aside-wrapper(label) +aside-wrapper(label)&attributes(attributes)
+code(false, language, prompt).o-no-block +code(false, language, prompt).o-no-block
block block
@ -165,7 +165,7 @@ mixin aside-code(label, language, prompt)
argument to be able to wrap it for spacing argument to be able to wrap it for spacing
mixin infobox(label, emoji) mixin infobox(label, emoji)
aside.o-box.o-block.u-text-small aside.o-box.o-block.u-text-small&attributes(attributes)
if label if label
h3.u-heading.u-text-label.u-color-theme h3.u-heading.u-text-label.u-color-theme
if emoji if emoji
@ -242,7 +242,9 @@ mixin button(url, trusted, ...style)
wrap - [boolean] wrap text and disable horizontal scrolling wrap - [boolean] wrap text and disable horizontal scrolling
mixin code(label, language, prompt, height, icon, wrap) mixin code(label, language, prompt, height, icon, wrap)
pre.c-code-block.o-block(class="lang-#{(language || DEFAULT_SYNTAX)}" class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes) - var lang = (language != "none") ? (language || DEFAULT_SYNTAX) : null
- var lang_class = (language != "none") ? "lang-" + (language || DEFAULT_SYNTAX) : null
pre.c-code-block.o-block(data-language=lang class=lang_class class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
if label if label
h4.u-text-label.u-text-label--dark=label h4.u-text-label.u-text-label--dark=label
if icon if icon
@ -253,6 +255,15 @@ mixin code(label, language, prompt, height, icon, wrap)
code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt) code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt)
block block
//- Executable code
mixin code-exec(label, large)
- label = (label || "Editable code example") + " (experimental)"
+terminal-wrapper(label, !large)
figure.thebelab-wrapper
span.thebelab-wrapper__text.u-text-tiny v#{BINDER_VERSION} &middot; Python 3 &middot; via #[+a("https://mybinder.org/").u-hide-link Binder]
+code(data-executable="true")&attributes(attributes)
block
//- Wrapper for code blocks to display old/new versions //- Wrapper for code blocks to display old/new versions
@ -658,12 +669,16 @@ mixin qs(data, style)
//- Terminal-style code window //- Terminal-style code window
label - [string] title displayed in top bar of terminal window label - [string] title displayed in top bar of terminal window
mixin terminal(label, button_text, button_url) mixin terminal-wrapper(label, small)
.x-terminal .x-terminal(class=small ? "x-terminal--small" : null)
.x-terminal__icons: span .x-terminal__icons(class=small ? "x-terminal__icons--small" : null): span
.u-padding-small.u-text-label.u-text-center=label .u-padding-small.u-text-center(class=small ? "u-text-tiny" : "u-text")
strong=label
block
+code.x-terminal__code mixin terminal(label, button_text, button_url, exec)
+terminal-wrapper(label)
+code.x-terminal__code(data-executable=exec ? "" : null)
block block
if button_text && button_url if button_text && button_url

View File

@ -10,10 +10,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
li.c-nav__menu__item(class=is_active ? "is-active" : null) li.c-nav__menu__item(class=is_active ? "is-active" : null)
+a(url)(tabindex=is_active ? "-1" : null)=item +a(url)(tabindex=is_active ? "-1" : null)=item
li.c-nav__menu__item.u-hidden-xs li.c-nav__menu__item
+a("https://survey.spacy.io", true) User Survey 2018
li.c-nav__menu__item.u-hidden-xs
+a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)] +a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
progress.c-progress.js-progress(value="0" max="1") progress.c-progress.js-progress(value="0" max="1")

View File

@ -1,77 +1,110 @@
//- 💫 INCLUDES > MODELS PAGE TEMPLATE //- 💫 INCLUDES > MODELS PAGE TEMPLATE
for id in CURRENT_MODELS for id in CURRENT_MODELS
- var comps = getModelComponents(id)
+section(id) +section(id)
+grid("vcenter").o-no-block(id=id) section(data-vue=id data-model=id)
+grid-col("two-thirds") +grid("vcenter").o-no-block(id=id)
+h(2) +grid-col("two-thirds")
+a("#" + id).u-permalink=id +h(2)
+a("#" + id).u-permalink=id
+grid-col("third").u-text-right +grid-col("third").u-text-right
.u-color-subtle.u-text-tiny .u-color-subtle.u-text-tiny
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(data-tpl=id data-tpl-key="download") +button(gh("spacy-models") + "/releases", true, "secondary", "small")(v-bind:href="releaseUrl")
| Release details | Release details
.u-padding-small Latest: #[code(data-tpl=id data-tpl-key="version") n/a] .u-padding-small Latest: #[code(v-text="version") n/a]
+aside-code("Installation", "bash", "$"). +aside-code("Installation", "bash", "$").
python -m spacy download #{id} python -m spacy download #{id}
- var comps = getModelComponents(id) p(v-if="description" v-text="description")
p(data-tpl=id data-tpl-key="description") +infobox(v-if="error")
div(data-tpl=id data-tpl-key="error")
+infobox
| Unable to load model details from GitHub. To find out more | Unable to load model details from GitHub. To find out more
| about this model, see the overview of the | about this model, see the overview of the
| #[+a(gh("spacy-models") + "/releases") latest model releases]. | #[+a(gh("spacy-models") + "/releases") latest model releases].
+table.o-block-small(data-tpl=id data-tpl-key="table") +table.o-block-small(v-bind:data-loading="loading")
+row
+cell #[+label Language]
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
for comp, label in {"Type": comps.type, "Genre": comps.genre}
+row +row
+cell #[+label=label] +cell #[+label Language]
+cell #[+tag=comp] #{MODEL_META[comp]} +cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
+row for comp, label in {"Type": comps.type, "Genre": comps.genre}
+cell #[+label Size] +row
+cell #[+tag=comps.size] #[span(data-tpl=id data-tpl-key="size") #[em n/a]] +cell #[+label=label]
+cell #[+tag=comp] #{MODEL_META[comp]}
+row
+cell #[+label Size]
+cell #[+tag=comps.size] #[span(v-text="sizeFull" v-if="sizeFull")] #[em(v-else="") n/a]
each label in ["Pipeline", "Vectors", "Sources", "Author", "License"] +row(v-if="pipeline && pipeline.length" v-cloak="")
- var field = label.toLowerCase()
if field == "vectors"
- field = "vecs"
+row
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
+cell +cell
span(data-tpl=id data-tpl-key=field) #[em n/a] +label Pipeline #[+help(MODEL_META.pipeline).u-color-subtle]
+cell
span(v-for="(pipe, index) in pipeline" v-if="pipeline")
code(v-text="pipe")
span(v-if="index != pipeline.length - 1") ,&nbsp;
+row(data-tpl=id data-tpl-key="compat-wrapper" hidden="") +row(v-if="vectors" v-cloak="")
+cell +cell
+label Compat #[+help("Latest compatible model version for your spaCy installation").u-color-subtle] +label Vectors #[+help(MODEL_META.vectors).u-color-subtle]
+cell +cell(v-text="vectors")
.o-field.u-float-left
select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
div(data-tpl=id data-tpl-key="compat-versions") &nbsp;
section(data-tpl=id data-tpl-key="benchmarks" hidden="") +row(v-if="sources && sources.length" v-cloak="")
+grid.o-block-small +cell
+label Sources #[+help(MODEL_META.sources).u-color-subtle]
+cell
span(v-for="(source, index) in sources") {{ source }}
span(v-if="index != sources.length - 1") ,&nbsp;
+row(v-if="author" v-cloak="")
+cell #[+label Author]
+cell
+a("")(v-bind:href="url" v-if="url" v-text="author")
span(v-else="" v-text="author") {{ model.author }}
+row(v-if="license" v-cloak="")
+cell #[+label License]
+cell
+a("")(v-bind:href="modelLicenses[license]" v-if="modelLicenses[license]") {{ license }}
span(v-else="") {{ license }}
+row(v-cloak="")
+cell #[+label Compat #[+help(MODEL_META.compat).u-color-subtle]]
+cell
.o-field.u-float-left
select.o-field__select.u-text-small(v-model="spacyVersion")
option(v-for="version in orderedCompat" v-bind:value="version") spaCy v{{ version }}
code(v-if="compatVersion" v-text="compatVersion")
em(v-else="") not compatible
+grid.o-block-small(v-cloak="" v-if="hasAccuracy")
for keys, label in MODEL_BENCHMARKS for keys, label in MODEL_BENCHMARKS
.u-flex-full.u-padding-small(data-tpl=id data-tpl-key=label.toLowerCase() hidden="") .u-flex-full.u-padding-small
+table.o-block-small +table.o-block-small
+row("head") +row("head")
+head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label) +head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
for label, field in keys for label, field in keys
+row(hidden="") +row
+cell.u-nowrap +cell.u-nowrap
+label=label +label=label
if MODEL_META[field] if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle] | #[+help(MODEL_META[field]).u-color-subtle]
+cell("num")(data-tpl=id data-tpl-key=field) +cell("num")
| n/a span(v-if="#{field}" v-text="#{field}")
em(v-if="!#{field}") n/a
p.u-text-small.u-color-dark(v-if="notes" v-text="notes" v-cloak="")
if comps.size == "sm" && EXAMPLE_SENT_LANGS.includes(comps.lang)
section
+code-exec("Test the model live").
import spacy
from spacy.lang.#{comps.lang}.examples import sentences
nlp = spacy.load('#{id}')
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
print(token.text, token.pos_, token.dep_)
p.u-text-small.u-color-dark(data-tpl=id data-tpl-key="notes")

View File

@ -1,86 +1,33 @@
//- 💫 INCLUDES > SCRIPTS //- 💫 INCLUDES > SCRIPTS
if quickstart if IS_PAGE || SECTION == "index"
script(src="/assets/js/vendor/quickstart.min.js") script(type="text/x-thebe-config")
| { bootstrap: true, binderOptions: { repo: "#{KERNEL_BINDER}"},
| kernelOptions: { name: "#{KERNEL_PYTHON}" }}
if IS_PAGE - scripts = ["vendor/prism.min", "vendor/vue.min"]
script(src="/assets/js/vendor/in-view.min.js") - if (SECTION == "universe") scripts.push("vendor/vue-markdown.min")
- if (quickstart) scripts.push("vendor/quickstart.min")
- if (IS_PAGE) scripts.push("vendor/in-view.min")
- if (IS_PAGE || SECTION == "index") scripts.push("vendor/thebelab.custom.min")
for script in scripts
script(src="/assets/js/" + script + ".js")
script(src="/assets/js/main.js?v#{V_JS}" type=(environment == "deploy") ? null : "module")
if environment == "deploy" if environment == "deploy"
script(async src="https://www.google-analytics.com/analytics.js") script(src="https://www.google-analytics.com/analytics.js", async)
script
script(src="/assets/js/vendor/prism.min.js")
if compare_models
script(src="/assets/js/vendor/chart.min.js")
script
if quickstart
| new Quickstart("#qs");
if environment == "deploy"
| window.ga=window.ga||function(){ | window.ga=window.ga||function(){
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date; | (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview'); | ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
if IS_PAGE if IS_PAGE
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
script
| ((window.gitter = {}).chat = {}).options = { | ((window.gitter = {}).chat = {}).options = {
| useStyles: false, | useStyles: false,
| activationElement: '.js-gitter-button', | activationElement: '.js-gitter-button',
| targetElement: '.js-gitter', | targetElement: '.js-gitter',
| room: '!{SOCIAL.gitter}' | room: '!{SOCIAL.gitter}'
| }; | };
if IS_PAGE
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
//- JS modules slightly hacky, but necessary to dynamically instantiate the
classes with data from the Harp JSON files, while still being able to
support older browsers that can't handle JS modules. More details:
https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
- ProgressBar = "new ProgressBar('.js-progress');"
- Accordion = "new Accordion('.js-accordion');"
- Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
- NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
- GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
- ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
- ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
if environment == "deploy"
//- DEPLOY: use compiled rollup.js and instantiate classes directly
script(src="/assets/js/rollup.js?v#{V_JS}")
script
!=ProgressBar
if changelog
!=Changelog
if IS_PAGE
!=NavHighlighter
!=GitHubEmbed
!=Accordion
if HAS_MODELS
!=ModelLoader
if compare_models
!=ModelComparer
else
//- DEVELOPMENT: Use ES6 modules
script(type="module")
| import ProgressBar from '/assets/js/progress.js';
!=ProgressBar
if changelog
| import Changelog from '/assets/js/changelog.js';
!=Changelog
if IS_PAGE
| import NavHighlighter from '/assets/js/nav-highlighter.js';
!=NavHighlighter
| import GitHubEmbed from '/assets/js/github-embed.js';
!=GitHubEmbed
| import Accordion from '/assets/js/accordion.js';
!=Accordion
if HAS_MODELS
| import { ModelLoader } from '/assets/js/models.js';
!=ModelLoader
if compare_models
| import { ModelComparer } from '/assets/js/models.js';
!=ModelComparer

View File

@ -7,6 +7,12 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
symbol#svg_github(viewBox="0 0 27 32") symbol#svg_github(viewBox="0 0 27 32")
path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z") path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z")
symbol#svg_twitter(viewBox="0 0 30 32")
path(d="M28.929 7.286q-1.196 1.75-2.893 2.982 0.018 0.25 0.018 0.75 0 2.321-0.679 4.634t-2.063 4.437-3.295 3.759-4.607 2.607-5.768 0.973q-4.839 0-8.857-2.589 0.625 0.071 1.393 0.071 4.018 0 7.161-2.464-1.875-0.036-3.357-1.152t-2.036-2.848q0.589 0.089 1.089 0.089 0.768 0 1.518-0.196-2-0.411-3.313-1.991t-1.313-3.67v-0.071q1.214 0.679 2.607 0.732-1.179-0.786-1.875-2.054t-0.696-2.75q0-1.571 0.786-2.911 2.161 2.661 5.259 4.259t6.634 1.777q-0.143-0.679-0.143-1.321 0-2.393 1.688-4.080t4.080-1.688q2.5 0 4.214 1.821 1.946-0.375 3.661-1.393-0.661 2.054-2.536 3.179 1.661-0.179 3.321-0.893z")
symbol#svg_website(viewBox="0 0 32 32")
path(d="M22.658 10.988h5.172c0.693 1.541 1.107 3.229 1.178 5.012h-5.934c-0.025-1.884-0.181-3.544-0.416-5.012zM20.398 3.896c2.967 1.153 5.402 3.335 6.928 6.090h-4.836c-0.549-2.805-1.383-4.799-2.092-6.090zM16.068 9.986v-6.996c1.066 0.047 2.102 0.216 3.092 0.493 0.75 1.263 1.719 3.372 2.33 6.503h-5.422zM9.489 22.014c-0.234-1.469-0.396-3.119-0.421-5.012h5.998v5.012h-5.577zM9.479 10.988h5.587v5.012h-6.004c0.025-1.886 0.183-3.543 0.417-5.012zM11.988 3.461c0.987-0.266 2.015-0.435 3.078-0.469v6.994h-5.422c0.615-3.148 1.591-5.265 2.344-6.525zM3.661 9.986c1.551-2.8 4.062-4.993 7.096-6.131-0.715 1.29-1.559 3.295-2.114 6.131h-4.982zM8.060 16h-6.060c0.066-1.781 0.467-3.474 1.158-5.012h5.316c-0.233 1.469-0.39 3.128-0.414 5.012zM8.487 22.014h-5.29c-0.694-1.543-1.139-3.224-1.204-5.012h6.071c0.024 1.893 0.188 3.541 0.423 5.012zM8.651 23.016c0.559 2.864 1.416 4.867 2.134 6.142-3.045-1.133-5.557-3.335-7.11-6.142h4.976zM15.066 23.016v6.994c-1.052-0.033-2.067-0.199-3.045-0.46-0.755-1.236-1.736-3.363-2.356-6.534h5.401zM21.471 23.016c-0.617 3.152-1.592 5.271-2.344 6.512-0.979 0.271-2.006 0.418-3.059 0.465v-6.977h5.403zM16.068 17.002h5.998c-0.023 1.893-0.188 3.542-0.422 5.012h-5.576v-5.012zM22.072 16h-6.004v-5.012h5.586c0.235 1.469 0.393 3.126 0.418 5.012zM23.070 17.002h5.926c-0.066 1.787-0.506 3.468-1.197 5.012h-5.152c0.234-1.471 0.398-3.119 0.423-5.012zM27.318 23.016c-1.521 2.766-3.967 4.949-6.947 6.1 0.715-1.276 1.561-3.266 2.113-6.1h4.834z")
symbol#svg_code(viewBox="0 0 20 20") symbol#svg_code(viewBox="0 0 20 20")
path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z") path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z")

View File

@ -3,23 +3,15 @@
include _includes/_mixins include _includes/_mixins
- title = IS_MODELS ? LANGUAGES[current.source] || title : title - title = IS_MODELS ? LANGUAGES[current.source] || title : title
- social_title = (SECTION == "index") ? SITENAME + " - " + SLOGAN : title + " - " + SITENAME
- social_img = SITE_URL + "/assets/img/social/preview_" + (preview || ALPHA ? "alpha" : "default") + ".jpg" - PAGE_URL = getPageUrl()
- PAGE_TITLE = getPageTitle()
- PAGE_IMAGE = getPageImage()
doctype html doctype html
html(lang="en") html(lang="en")
head head
title title=PAGE_TITLE
if SECTION == "api" || SECTION == "usage" || SECTION == "models"
- var title_section = (SECTION == "api") ? "API" : SECTION.charAt(0).toUpperCase() + SECTION.slice(1)
| #{title} | #{SITENAME} #{title_section} Documentation
else if SECTION != "index"
| #{title} | #{SITENAME}
else
| #{SITENAME} - #{SLOGAN}
meta(charset="utf-8") meta(charset="utf-8")
meta(name="viewport" content="width=device-width, initial-scale=1.0") meta(name="viewport" content="width=device-width, initial-scale=1.0")
meta(name="referrer" content="always") meta(name="referrer" content="always")
@ -27,23 +19,24 @@ html(lang="en")
meta(property="og:type" content="website") meta(property="og:type" content="website")
meta(property="og:site_name" content=sitename) meta(property="og:site_name" content=sitename)
meta(property="og:url" content="#{SITE_URL}/#{current.path.join('/')}") meta(property="og:url" content=PAGE_URL)
meta(property="og:title" content=social_title) meta(property="og:title" content=PAGE_TITLE)
meta(property="og:description" content=description) meta(property="og:description" content=description)
meta(property="og:image" content=social_img) meta(property="og:image" content=PAGE_IMAGE)
meta(name="twitter:card" content="summary_large_image") meta(name="twitter:card" content="summary_large_image")
meta(name="twitter:site" content="@" + SOCIAL.twitter) meta(name="twitter:site" content="@" + SOCIAL.twitter)
meta(name="twitter:title" content=social_title) meta(name="twitter:title" content=PAGE_TITLE)
meta(name="twitter:description" content=description) meta(name="twitter:description" content=description)
meta(name="twitter:image" content=social_img) meta(name="twitter:image" content=PAGE_IMAGE)
link(rel="shortcut icon" href="/assets/img/favicon.ico") link(rel="shortcut icon" href="/assets/img/favicon.ico")
link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico") link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico")
if SECTION == "api" if SECTION == "api"
link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet") link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet")
else if SECTION == "universe"
link(href="/assets/css/style_purple.css?v#{V_CSS}" rel="stylesheet")
else else
link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet") link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet")
@ -54,6 +47,9 @@ html(lang="en")
if !landing if !landing
include _includes/_page-docs include _includes/_page-docs
else if SECTION == "universe"
!=yield
else else
main!=yield main!=yield
include _includes/_footer include _includes/_footer

View File

@ -29,7 +29,7 @@ p
+ud-row("NUM", "numeral", "1, 2017, one, seventy-seven, IV, MMXIV") +ud-row("NUM", "numeral", "1, 2017, one, seventy-seven, IV, MMXIV")
+ud-row("PART", "particle", "'s, not, ") +ud-row("PART", "particle", "'s, not, ")
+ud-row("PRON", "pronoun", "I, you, he, she, myself, themselves, somebody") +ud-row("PRON", "pronoun", "I, you, he, she, myself, themselves, somebody")
+ud-row("PROPN", "proper noun", "Mary, John, Londin, NATO, HBO") +ud-row("PROPN", "proper noun", "Mary, John, London, NATO, HBO")
+ud-row("PUNCT", "punctuation", "., (, ), ?") +ud-row("PUNCT", "punctuation", "., (, ), ?")
+ud-row("SCONJ", "subordinating conjunction", "if, while, that") +ud-row("SCONJ", "subordinating conjunction", "if, while, that")
+ud-row("SYM", "symbol", "$, %, §, ©, +, , ×, ÷, =, :), 😝") +ud-row("SYM", "symbol", "$, %, §, ©, +, , ×, ÷, =, :), 😝")

View File

@ -1,5 +1,13 @@
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE //- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
p
| spaCy's statistical models have been custom-designed to give a
| high-performance mix of speed and accuracy. The current architecture
| hasn't been published yet, but in the meantime we prepared a video that
| explains how the models work, with particular focus on NER.
+youtube("sqDHBH9IjRU")
p p
| The parsing model is a blend of recent results. The two recent | The parsing model is a blend of recent results. The two recent
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at | inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
@ -44,7 +52,7 @@ p
+cell First two words of the buffer. +cell First two words of the buffer.
+row +row
+cell.u-nowrap +cell
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1], | #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
| #[code B1L1]#[br] | #[code B1L1]#[br]
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2], | #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
@ -54,7 +62,7 @@ p
| #[code S2], #[code B0] and #[code B1]. | #[code S2], #[code B0] and #[code B1].
+row +row
+cell.u-nowrap +cell
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1], | #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
| #[code B1R1]#[br] | #[code B1R1]#[br]
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2], | #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],

View File

@ -6,8 +6,7 @@ p
| but somewhat ugly in Python. Logic that deals with Python or platform | but somewhat ugly in Python. Logic that deals with Python or platform
| compatibility only lives in #[code spacy.compat]. To distinguish them from | compatibility only lives in #[code spacy.compat]. To distinguish them from
| the builtin functions, replacement functions are suffixed with an | the builtin functions, replacement functions are suffixed with an
| undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the | undersocre, e.e #[code unicode_].
| #[code six] and #[code ftfy] packages.
+aside-code("Example"). +aside-code("Example").
from spacy.compat import unicode_, json_dumps from spacy.compat import unicode_, json_dumps

View File

@ -533,8 +533,10 @@ p
+cell option +cell option
+cell +cell
| Optional location of vectors file. Should be a tab-separated | Optional location of vectors file. Should be a tab-separated
| file where the first column contains the word and the remaining | file in Word2Vec format where the first column contains the word
| columns the values. | and the remaining columns the values. File can be provided in
| #[code .txt] format or as a zipped text file in #[code .zip] or
| #[code .tar.gz] format.
+row +row
+cell #[code --prune-vectors], #[code -V] +cell #[code --prune-vectors], #[code -V]

View File

@ -31,6 +31,7 @@
$grid-gutter: 2rem $grid-gutter: 2rem
margin-top: $grid-gutter margin-top: $grid-gutter
min-width: 0 // hack to prevent overflow
@include breakpoint(min, lg) @include breakpoint(min, lg)
display: flex display: flex

View File

@ -60,6 +60,13 @@
padding-bottom: 4rem padding-bottom: 4rem
border-bottom: 1px dotted $color-subtle border-bottom: 1px dotted $color-subtle
&.o-section--small
overflow: auto
&:not(:last-child)
margin-bottom: 3.5rem
padding-bottom: 2rem
.o-block .o-block
margin-bottom: 4rem margin-bottom: 4rem
@ -142,6 +149,14 @@
.o-badge .o-badge
border-radius: 1em border-radius: 1em
.o-thumb
@include size(100px)
overflow: hidden
border-radius: 50%
&.o-thumb--small
@include size(35px)
//- SVG //- SVG

View File

@ -103,6 +103,9 @@
&:hover &:hover
color: $color-theme-dark color: $color-theme-dark
.u-hand
cursor: pointer
.u-hide-link.u-hide-link .u-hide-link.u-hide-link
border: none border: none
color: inherit color: inherit
@ -224,6 +227,7 @@
$spinner-size: 75px $spinner-size: 75px
$spinner-bar: 8px $spinner-bar: 8px
min-height: $spinner-size * 2
position: relative position: relative
& > * & > *
@ -245,10 +249,19 @@
//- Hidden elements //- Hidden elements
.u-hidden .u-hidden,
display: none [v-cloak]
display: none !important
@each $breakpoint in (xs, sm, md) @each $breakpoint in (xs, sm, md)
.u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint} .u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint}
@include breakpoint(max, $breakpoint) @include breakpoint(max, $breakpoint)
display: none display: none
//- Transitions
.u-fade-enter-active
transition: opacity 0.5s
.u-fade-enter
opacity: 0

View File

@ -2,7 +2,8 @@
//- Code block //- Code block
.c-code-block .c-code-block,
.thebelab-cell
background: $color-front background: $color-front
color: darken($color-back, 20) color: darken($color-back, 20)
padding: 0.75em 0 padding: 0.75em 0
@ -13,11 +14,11 @@
white-space: pre white-space: pre
direction: ltr direction: ltr
&.c-code-block--has-icon .c-code-block--has-icon
padding: 0 padding: 0
display: flex display: flex
border-top-left-radius: 0 border-top-left-radius: 0
border-bottom-left-radius: 0 border-bottom-left-radius: 0
.c-code-block__icon .c-code-block__icon
padding: 0 0 0 1rem padding: 0 0 0 1rem
@ -28,26 +29,66 @@
&.c-code-block__icon--border &.c-code-block__icon--border
border-left: 6px solid border-left: 6px solid
//- Code block content //- Code block content
.c-code-block__content .c-code-block__content,
.thebelab-input,
.jp-OutputArea
display: block display: block
font: normal normal 1.1rem/#{1.9} $font-code font: normal normal 1.1rem/#{1.9} $font-code
padding: 1em 2em padding: 1em 2em
&[data-prompt]:before, .c-code-block__content[data-prompt]:before,
content: attr(data-prompt) content: attr(data-prompt)
margin-right: 0.65em margin-right: 0.65em
display: inline-block display: inline-block
vertical-align: middle vertical-align: middle
opacity: 0.5 opacity: 0.5
//- Thebelab
[data-executable]
margin-bottom: 0
.thebelab-input.thebelab-input
padding: 3em 2em 1em
.jp-OutputArea
&:not(:empty)
padding: 2rem 2rem 1rem
border-top: 1px solid $color-dark
margin-top: 2rem
.entities, svg
white-space: initial
font-family: inherit
.entities
font-size: 1.35rem
.jp-OutputArea pre
font: inherit
.jp-OutputPrompt.jp-OutputArea-prompt
padding-top: 0.5em
margin-right: 1rem
font-family: inherit
font-weight: bold
.thebelab-run-button
@extend .u-text-label, .u-text-label--dark
.thebelab-wrapper
position: relative
.thebelab-wrapper__text
@include position(absolute, top, right, 1.25rem, 1.25rem)
color: $color-subtle-dark
z-index: 10
//- Code //- Code
code code, .CodeMirror, .jp-RenderedText, .jp-OutputArea
-webkit-font-smoothing: subpixel-antialiased -webkit-font-smoothing: subpixel-antialiased
-moz-osx-font-smoothing: auto -moz-osx-font-smoothing: auto
@ -73,7 +114,7 @@ code
text-shadow: none text-shadow: none
//- Syntax Highlighting //- Syntax Highlighting (Prism)
[class*="language-"] .token [class*="language-"] .token
&.comment, &.prolog, &.doctype, &.cdata, &.punctuation &.comment, &.prolog, &.doctype, &.cdata, &.punctuation
@ -103,3 +144,50 @@ code
&.italic &.italic
font-style: italic font-style: italic
//- Syntax Highlighting (CodeMirror)
.CodeMirror.cm-s-default
background: $color-front
color: darken($color-back, 20)
.CodeMirror-selected
background: $color-theme
color: $color-back
.CodeMirror-cursor
border-left-color: currentColor
.cm-variable-2
color: inherit
font-style: italic
.cm-comment
color: map-get($syntax-highlighting, comment)
.cm-keyword, .cm-builtin
color: map-get($syntax-highlighting, keyword)
.cm-operator
color: map-get($syntax-highlighting, operator)
.cm-string
color: map-get($syntax-highlighting, selector)
.cm-number
color: map-get($syntax-highlighting, number)
.cm-def
color: map-get($syntax-highlighting, function)
//- Syntax highlighting (Jupyter)
.jp-RenderedText pre
.ansi-cyan-fg
color: map-get($syntax-highlighting, function)
.ansi-green-fg
color: $color-green
.ansi-red-fg
color: map-get($syntax-highlighting, operator)

View File

@ -8,10 +8,20 @@
width: 100% width: 100%
position: relative position: relative
&.x-terminal--small
background: $color-dark
color: $color-subtle
border-radius: 4px
margin-bottom: 4rem
.x-terminal__icons .x-terminal__icons
display: none
position: absolute position: absolute
padding: 10px padding: 10px
@include breakpoint(min, sm)
display: block
&:before, &:before,
&:after, &:after,
span span
@ -32,6 +42,12 @@
content: "" content: ""
background: $color-yellow background: $color-yellow
&.x-terminal__icons--small
&:before,
&:after,
span
@include size(10px)
.x-terminal__code .x-terminal__code
margin: 0 margin: 0
border: none border: none

View File

@ -9,7 +9,7 @@
display: flex display: flex
justify-content: space-between justify-content: space-between
flex-flow: row nowrap flex-flow: row nowrap
padding: 0 2rem 0 1rem padding: 0 0 0 1rem
z-index: 30 z-index: 30
width: 100% width: 100%
box-shadow: $box-shadow box-shadow: $box-shadow
@ -21,11 +21,20 @@
.c-nav__menu .c-nav__menu
@include size(100%) @include size(100%)
display: flex display: flex
justify-content: flex-end
flex-flow: row nowrap flex-flow: row nowrap
border-color: inherit border-color: inherit
flex: 1 flex: 1
@include breakpoint(max, sm)
@include scroll-shadow-base($color-front)
overflow-x: auto
overflow-y: hidden
-webkit-overflow-scrolling: touch
@include breakpoint(min, md)
justify-content: flex-end
.c-nav__menu__item .c-nav__menu__item
display: flex display: flex
align-items: center align-items: center
@ -39,6 +48,14 @@
&:not(:first-child) &:not(:first-child)
margin-left: 2em margin-left: 2em
&:last-child
@include scroll-shadow-cover(right, $color-back)
padding-right: 2rem
&:first-child
@include scroll-shadow-cover(left, $color-back)
padding-left: 2rem
&.is-active &.is-active
color: $color-dark color: $color-dark
pointer-events: none pointer-events: none

View File

@ -26,7 +26,7 @@ $font-code: Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace
// Colors // Colors
$colors: ( blue: #09a3d5, green: #05b083 ) $colors: ( blue: #09a3d5, green: #05b083, purple: #6542d1 )
$color-back: #fff !default $color-back: #fff !default
$color-front: #1a1e23 !default $color-front: #1a1e23 !default

View File

@ -0,0 +1,4 @@
//- 💫 STYLESHEET (PURPLE)
$theme: purple
@import style

Some files were not shown because too many files have changed in this diff Show More