Update develop from master

This commit is contained in:
Matthew Honnibal 2018-05-02 01:35:59 +00:00
commit 2338e8c7fc
159 changed files with 699781 additions and 2579 deletions

View File

@ -87,11 +87,11 @@ U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |Matthew Upson |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date |2018-04-24 |
| GitHub username |ivyleavedtoadflax |
| Website (optional) |www.machinegurning.com|

106
.github/contributors/katrinleinweber.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Katrin Leinweber |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-03-30 |
| GitHub username | katrinleinweber |
| Website (optional) | |

106
.github/contributors/miroli.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Robin Linderborg |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-04-23 |
| GitHub username | miroli |
| Website (optional) | |

106
.github/contributors/mollerhoj.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jens Dahl Mollerhoj |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 4/04/2018 |
| GitHub username | mollerhoj |
| Website (optional) | |

106
.github/contributors/skrcode.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Suraj Rajan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 31/Mar/2018 |
| GitHub username | skrcode |
| Website (optional) | |

106
.github/contributors/trungtv.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Viet-Trung Tran |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-03-28 |
| GitHub username | trungtv |
| Website (optional) | https://datalab.vn |

6
CITATION Normal file
View File

@ -0,0 +1,6 @@
@ARTICLE{spacy2,
AUTHOR = {Honnibal, Matthew AND Montani, Ines},
TITLE = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
YEAR = {2017},
JOURNAL = {To appear}
}

View File

@ -73,28 +73,8 @@ so it only becomes visible on click, making the issue easier to read and follow.
### Issue labels
To distinguish issues that are opened by us, the maintainers, we usually add a
💫 to the title. We also use the following system to tag our issues and pull
requests:
| Issue label | Description |
| --- | --- |
| [`bug`](https://github.com/explosion/spaCy/labels/bug) | Bugs and behaviour differing from documentation |
| [`enhancement`](https://github.com/explosion/spaCy/labels/enhancement) | Feature requests and improvements |
| [`install`](https://github.com/explosion/spaCy/labels/install) | Installation problems |
| [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
| [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
| [`docs`](https://github.com/explosion/spaCy/labels/docs), [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [documentation](https://spacy.io/docs) and [examples](spacy/examples) |
| [`training`](https://github.com/explosion/spaCy/labels/training) | Issues related to training and updating models |
| [`models`](https://github.com/explosion/spaCy/labels/models), `language / [name]` | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
| [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
| [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
| [`compat`](https://github.com/explosion/spaCy/labels/compat) | Cross-platform and cross-Python compatibility issues |
| [`wip`](https://github.com/explosion/spaCy/labels/wip) | Work in progress, mostly used for pull requests |
| [`v1`](https://github.com/explosion/spaCy/labels/v1) | Reports related to spaCy v1.x |
| [`duplicate`](https://github.com/explosion/spaCy/labels/duplicate) | Duplicates, i.e. issues that have been reported before |
| [`third-party`](https://github.com/explosion/spaCy/labels/third-party) | Issues related to third-party packages and services |
| [`meta`](https://github.com/explosion/spaCy/labels/meta) | Meta topics, e.g. repo organisation and issue management |
| [`help wanted`](https://github.com/explosion/spaCy/labels/help%20wanted), [`help wanted (easy)`](https://github.com/explosion/spaCy/labels/help%20wanted%20%28easy%29) | Requests for contributions |
💫 to the title. [See this page](https://github.com/explosion/spaCy/labels)
for an overview of the system we use to tag our issues and pull requests.
## Contributing to the code base
@ -220,7 +200,7 @@ All Python code must be written in an **intersection of Python 2 and Python 3**.
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
Python or platform compatibility should only live in
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
functions, replacement functions are suffixed with an undersocre, for example
functions, replacement functions are suffixed with an underscore, for example
`unicode_`. If you need to access the user's version or platform information,
for example to show more specific error messages, you can use the `is_config()`
helper function.

View File

@ -12,11 +12,11 @@ integration. It's commercial open-source software, released under the MIT licens
💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis
:target: https://travis-ci.org/explosion/spaCy
:alt: Build Status
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square&logo=appveyor
:target: https://ci.appveyor.com/project/explosion/spaCy
:alt: Appveyor Build Status
@ -28,11 +28,11 @@ integration. It's commercial open-source software, released under the MIT licens
:target: https://pypi.python.org/pypi/spacy
:alt: pypi Version
.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg
.. image:: https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square
:target: https://anaconda.org/conda-forge/spacy
:alt: conda Version
.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square
.. image:: https://img.shields.io/badge/chat-join%20%E2%86%92-09a3d5.svg?style=flat-square&logo=gitter-white
:target: https://gitter.im/explosion/spaCy
:alt: spaCy on Gitter
@ -49,7 +49,7 @@ integration. It's commercial open-source software, released under the MIT licens
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
`API Reference`_ The detailed reference for spaCy's API.
`Models`_ Download statistical language models for spaCy.
`Resources`_ Libraries, extensions, demos, books and courses.
`Universe`_ Libraries, extensions, demos, books and courses.
`Changelog`_ Changes and version history.
`Contribute`_ How to contribute to the spaCy project and code base.
=================== ===
@ -59,7 +59,7 @@ integration. It's commercial open-source software, released under the MIT licens
.. _Usage Guides: https://spacy.io/usage/
.. _API Reference: https://spacy.io/api/
.. _Models: https://spacy.io/models
.. _Resources: https://spacy.io/usage/resources
.. _Universe: https://spacy.io/universe
.. _Changelog: https://spacy.io/usage/#changelog
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
@ -308,18 +308,20 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
Run tests
=========
spaCy comes with an `extensive test suite <spacy/tests>`_. First, find out where
spaCy is installed:
spaCy comes with an `extensive test suite <spacy/tests>`_. In order to run the
tests, you'll usually want to clone the repository and build spaCy from source.
This will also install the required development dependencies and test utilities
defined in the ``requirements.txt``.
Alternatively, you can find out where spaCy is installed and run ``pytest`` on
that directory. Don't forget to also install the test utilities via spaCy's
``requirements.txt``:
.. code:: bash
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow``
and ``--model`` are optional and enable additional tests:
.. code:: bash
# make sure you are using recent pytest version
python -m pip install -U pytest
pip install -r path/to/requirements.txt
python -m pytest <spacy-directory>
See `the documentation <https://spacy.io/usage/#tests>`_ for more details and
examples.

View File

@ -9,6 +9,7 @@ coordinates. Can be extended with more details from the API.
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
Compatible with: spaCy v2.0.0+
Prerequisites: pip install requests
"""
from __future__ import unicode_literals, print_function

View File

@ -81,7 +81,6 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
@ -92,11 +91,18 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
ner = nlp.get_pipe('ner')
ner.add_label(LABEL) # add new entity label to entity recognizer
if model is None:
optimizer = nlp.begin_training()
else:
# Note that 'begin_training' initializes the models, so it'll zero out
# existing entity types.
optimizer = nlp.entity.create_optimizer()
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}

View File

@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding: utf8
"""Train a multi-label convolutional neural network text classifier on the
"""Train a convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`. For more details,

29
fabfile.py vendored
View File

@ -6,6 +6,7 @@ from pathlib import Path
from fabric.api import local, lcd, env, settings, prefix
from os import path, environ
import shutil
import sys
PWD = path.dirname(__file__)
@ -90,3 +91,31 @@ def train():
args = environ.get('SPACY_TRAIN_ARGS', '')
with virtualenv(VENV_DIR) as venv_local:
venv_local('spacy train {args}'.format(args=args))
def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=''):
is_not_clean = local('git status --porcelain', capture=True)
if is_not_clean:
print("Repository is not clean")
print(is_not_clean)
sys.exit(1)
git_sha = local('git rev-parse --short HEAD', capture=True)
config_checksum = local('sha256sum {config}'.format(config=config), capture=True)
experiment_dir = Path(experiment_dir) / '{}--{}'.format(config_checksum[:6], git_sha)
if not experiment_dir.exists():
experiment_dir.mkdir()
test_data_dir = Path(treebank_dir) / 'ud-test-v2.0-conll2017'
assert test_data_dir.exists()
assert test_data_dir.is_dir()
if corpus:
corpora = [corpus]
else:
corpora = ['UD_English', 'UD_Chinese', 'UD_Japanese', 'UD_Vietnamese']
local('cp {config} {experiment_dir}/config.json'.format(config=config, experiment_dir=experiment_dir))
with virtualenv(VENV_DIR) as venv_local:
for corpus in corpora:
venv_local('spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}'.format(
treebank_dir=treebank_dir, experiment_dir=experiment_dir, config=config, corpus=corpus, vectors_dir=vectors_dir))
venv_local('spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}'.format(
test_data_dir=test_data_dir, experiment_dir=experiment_dir, config=config, corpus=corpus))

View File

@ -3,16 +3,12 @@ pathlib
numpy>=1.7
cymem>=1.30,<1.32
preshed>=1.0.0,<2.0.0
thinc>=6.11.1.dev11,<6.12.0
thinc>=6.11.1.dev12,<6.12.0
murmurhash>=0.28,<0.29
cytoolz>=0.9.0,<0.10.0
plac<1.0.0,>=0.9.6
ujson>=1.35
dill>=0.2,<0.3
requests>=2.13.0,<3.0.0
regex==2017.4.5
ftfy>=4.4.2,<5.0.0
pytest>=3.0.6,<4.0.0
mock>=2.0.0,<3.0.0
msgpack-python==0.5.4
msgpack-numpy==0.4.1

View File

@ -38,6 +38,7 @@ MOD_NAMES = [
'spacy.tokens.doc',
'spacy.tokens.span',
'spacy.tokens.token',
'spacy.tokens._retokenize',
'spacy.matcher',
'spacy.syntax.ner',
'spacy.symbols',
@ -194,12 +195,7 @@ def setup_package():
'plac<1.0.0,>=0.9.6',
'pathlib',
'ujson>=1.35',
'dill>=0.2,<0.3',
'requests>=2.13.0,<3.0.0',
'regex==2017.4.5',
'ftfy>=4.4.2,<5.0.0',
'msgpack-python==0.5.4',
'msgpack-numpy==0.4.1'],
'dill>=0.2,<0.3'],
setup_requires=['wheel'],
classifiers=[
'Development Status :: 5 - Production/Stable',

View File

@ -4,18 +4,14 @@ from __future__ import unicode_literals
from .cli.info import info as cli_info
from .glossary import explain
from .about import __version__
from .errors import Warnings, deprecation_warning
from . import util
def load(name, **overrides):
depr_path = overrides.get('path')
if depr_path not in (True, False, None):
util.deprecated(
"As of spaCy v2.0, the keyword argument `path=` is deprecated. "
"You can now call spacy.load with the path as its first argument, "
"and the model's meta.json will be used to determine the language "
"to load. For example:\nnlp = spacy.load('{}')".format(depr_path),
'error')
deprecation_warning(Warnings.W001.format(path=depr_path))
return util.load_model(name, **overrides)

View File

@ -23,6 +23,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
import thinc.extra.load_nlp
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
from .errors import Errors
from . import util
@ -42,8 +43,8 @@ def cosine(vec1, vec2):
def create_default_optimizer(ops, **cfg):
learn_rate = util.env_opt('learn_rate', 0.001)
beta1 = util.env_opt('optimizer_B1', 0.9)
beta2 = util.env_opt('optimizer_B2', 0.999)
eps = util.env_opt('optimizer_eps', 1e-08)
beta2 = util.env_opt('optimizer_B2', 0.9)
eps = util.env_opt('optimizer_eps', 1e-12)
L2 = util.env_opt('L2_penalty', 1e-6)
max_grad_norm = util.env_opt('grad_norm_clip', 1.)
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1,
@ -194,14 +195,15 @@ class PrecomputableAffine(Model):
size=tokvecs.size).reshape(tokvecs.shape)
def predict(ids, tokvecs):
# nS ids. nW tokvecs
hiddens = model(tokvecs) # (nW, f, o, p)
# nS ids. nW tokvecs. Exclude the padding array.
hiddens = model(tokvecs[:-1]) # (nW, f, o, p)
vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype='f')
# need nS vectors
vectors = model.ops.allocate((ids.shape[0], model.nO, model.nP))
for i, feats in enumerate(ids):
for j, id_ in enumerate(feats):
vectors[i] += hiddens[id_, j]
hiddens = hiddens.reshape((hiddens.shape[0] * model.nF, model.nO * model.nP))
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP))
vectors += model.b
vectors = model.ops.asarray(vectors)
if model.nP >= 2:
return model.ops.maxout(vectors)[0]
else:
@ -225,6 +227,11 @@ class PrecomputableAffine(Model):
def link_vectors_to_models(vocab):
vectors = vocab.vectors
if vectors.name is None:
vectors.name = VECTORS_KEY
print(
"Warning: Unnamed vectors -- this won't allow multiple vectors "
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape)
ops = Model.ops
for word in vocab:
if word.orth in vectors.key2row:
@ -234,11 +241,11 @@ def link_vectors_to_models(vocab):
data = ops.asarray(vectors.data)
# Set an entry here, so that vectors are accessed by StaticVectors
# (unideal, I know)
thinc.extra.load_nlp.VECTORS[(ops.device, VECTORS_KEY)] = data
thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
def Tok2Vec(width, embed_size, **kwargs):
pretrained_dims = kwargs.get('pretrained_dims', 0)
pretrained_vectors = kwargs.get('pretrained_vectors', None)
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
@ -251,16 +258,16 @@ def Tok2Vec(width, embed_size, **kwargs):
name='embed_suffix')
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
name='embed_shape')
if pretrained_dims is not None and pretrained_dims >= 1:
glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
if pretrained_vectors is not None:
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width*5, pieces=3)), column=5)
>> LN(Maxout(width, width*5, pieces=3)), column=cols.index(ORTH))
else:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width*4, pieces=3)), column=5)
>> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))
convolution = Residual(
ExtractWindow(nW=1)
@ -318,10 +325,10 @@ def _divide_array(X, size):
def get_col(idx):
assert idx >= 0, idx
if idx < 0:
raise IndexError(Errors.E066.format(value=idx))
def forward(X, drop=0.):
assert idx >= 0, idx
if isinstance(X, numpy.ndarray):
ops = NumpyOps()
else:
@ -329,7 +336,6 @@ def get_col(idx):
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
def backward(y, sgd=None):
assert idx >= 0, idx
dX = ops.allocate(X.shape)
dX[:, idx] += y
return dX
@ -416,13 +422,13 @@ def build_tagger_model(nr_class, **cfg):
token_vector_width = cfg['token_vector_width']
else:
token_vector_width = util.env_opt('token_vector_width', 128)
pretrained_dims = cfg.get('pretrained_dims', 0)
pretrained_vectors = cfg.get('pretrained_vectors')
with Model.define_operators({'>>': chain, '+': add}):
if 'tok2vec' in cfg:
tok2vec = cfg['tok2vec']
else:
tok2vec = Tok2Vec(token_vector_width, embed_size,
pretrained_dims=pretrained_dims)
pretrained_vectors=pretrained_vectors)
softmax = with_flatten(Softmax(nr_class, token_vector_width))
model = (
tok2vec

View File

@ -11,7 +11,6 @@ __email__ = 'contact@explosion.ai'
__license__ = 'MIT'
__release__ = False
__docs_models__ = 'https://spacy.io/usage/models'
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'

74
spacy/cli/_messages.py Normal file
View File

@ -0,0 +1,74 @@
# coding: utf8
from __future__ import unicode_literals
class Messages(object):
M001 = ("Download successful but linking failed")
M002 = ("Creating a shortcut link for 'en' didn't work (maybe you "
"don't have admin permissions?), but you can still load the "
"model via its full package name: nlp = spacy.load('{name}')")
M003 = ("Server error ({code}: {desc})")
M004 = ("Couldn't fetch {desc}. Please find a model for your spaCy "
"installation (v{version}), and download it manually. For more "
"details, see the documentation: https://spacy.io/usage/models")
M005 = ("Compatibility error")
M006 = ("No compatible models found for v{version} of spaCy.")
M007 = ("No compatible model found for '{name}' (spaCy v{version}).")
M008 = ("Can't locate model data")
M009 = ("The data should be located in {path}")
M010 = ("Can't find the spaCy data path to create model symlink")
M011 = ("Make sure a directory `/data` exists within your spaCy "
"installation and try again. The data directory should be "
"located here:")
M012 = ("Link '{name}' already exists")
M013 = ("To overwrite an existing link, use the --force flag.")
M014 = ("Can't overwrite symlink '{name}'")
M015 = ("This can happen if your data directory contains a directory or "
"file of the same name.")
M016 = ("Error: Couldn't link model to '{name}'")
M017 = ("Creating a symlink in spacy/data failed. Make sure you have the "
"required permissions and try re-running the command as admin, or "
"use a virtualenv. You can still import the model as a module and "
"call its load() method, or create the symlink manually.")
M018 = ("Linking successful")
M019 = ("You can now load the model via spacy.load('{name}')")
M020 = ("Can't find model meta.json")
M021 = ("Couldn't fetch compatibility table.")
M022 = ("Can't find spaCy v{version} in compatibility table")
M023 = ("Installed models (spaCy v{version})")
M024 = ("No models found in your current environment.")
M025 = ("Use the following commands to update the model packages:")
M026 = ("The following models are not available for spaCy "
"v{version}: {models}")
M027 = ("You may also want to overwrite the incompatible links using the "
"`python -m spacy link` command with `--force`, or remove them "
"from the data directory. Data path: {path}")
M028 = ("Input file not found")
M029 = ("Output directory not found")
M030 = ("Unknown format")
M031 = ("Can't find converter for {converter}")
M032 = ("Generated output file {name}")
M033 = ("Created {n_docs} documents")
M034 = ("Evaluation data not found")
M035 = ("Visualization output directory not found")
M036 = ("Generated {n} parses as HTML")
M037 = ("Can't find words frequencies file")
M038 = ("Sucessfully compiled vocab")
M039 = ("{entries} entries, {vectors} vectors")
M040 = ("Output directory not found")
M041 = ("Loaded meta.json from file")
M042 = ("Successfully created package '{name}'")
M043 = ("To build the package, run `python setup.py sdist` in this "
"directory.")
M044 = ("Package directory already exists")
M045 = ("Please delete the directory and try again, or use the `--force` "
"flag to overwrite existing directories.")
M046 = ("Generating meta.json")
M047 = ("Enter the package settings for your model. The following "
"information will be read from your model data: pipeline, vectors.")
M048 = ("No '{key}' setting found in meta.json")
M049 = ("This setting is required to build your package.")
M050 = ("Training data not found")
M051 = ("Development data not found")
M052 = ("Not a valid meta.json format")
M053 = ("Expected dict but got: {meta_type}")

View File

@ -5,6 +5,7 @@ import plac
from pathlib import Path
from .converters import conllu2json, iob2json, conll_ner2json
from ._messages import Messages
from ..util import prints
# Converters are matched by file extension. To add a converter, add a new
@ -32,14 +33,14 @@ def convert(input_file, output_dir, n_sents=1, morphology=False, converter='auto
input_path = Path(input_file)
output_path = Path(output_dir)
if not input_path.exists():
prints(input_path, title="Input file not found", exits=1)
prints(input_path, title=Messages.M028, exits=1)
if not output_path.exists():
prints(output_path, title="Output directory not found", exits=1)
prints(output_path, title=Messages.M029, exits=1)
if converter == 'auto':
converter = input_path.suffix[1:]
if converter not in CONVERTERS:
prints("Can't find converter for %s" % converter,
title="Unknown format", exits=1)
prints(Messages.M031.format(converter=converter),
title=Messages.M030, exits=1)
func = CONVERTERS[converter]
func(input_path, output_path,
n_sents=n_sents, use_morphology=morphology)

View File

@ -1,6 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
from .._messages import Messages
from ...compat import json_dumps, path2str
from ...util import prints
from ...gold import iob_to_biluo
@ -18,8 +19,8 @@ def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs))
prints("Created %d documents" % len(docs),
title="Generated output file %s" % path2str(output_file))
prints(Messages.M033.format(n_docs=len(docs)),
title=Messages.M032.format(name=path2str(output_file)))
def read_conll_ner(input_path):

View File

@ -1,6 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
from .._messages import Messages
from ...compat import json_dumps, path2str
from ...util import prints
@ -32,8 +33,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs))
prints("Created %d documents" % len(docs),
title="Generated output file %s" % path2str(output_file))
prints(Messages.M033.format(n_docs=len(docs)),
title=Messages.M032.format(name=path2str(output_file)))
def read_conllx(input_path, use_morphology=False, n=0):

View File

@ -2,6 +2,7 @@
from __future__ import unicode_literals
from cytoolz import partition_all, concat
from .._messages import Messages
from ...compat import json_dumps, path2str
from ...util import prints
from ...gold import iob_to_biluo
@ -18,8 +19,8 @@ def iob2json(input_path, output_path, n_sents=10, *a, **k):
output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs))
prints("Created %d documents" % len(docs),
title="Generated output file %s" % path2str(output_file))
prints(Messages.M033.format(n_docs=len(docs)),
title=Messages.M032.format(name=path2str(output_file)))
def read_iob(raw_sents):

View File

@ -2,13 +2,15 @@
from __future__ import unicode_literals
import plac
import requests
import os
import subprocess
import sys
import ujson
from .link import link
from ._messages import Messages
from ..util import prints, get_package_path
from ..compat import url_read, HTTPError
from .. import about
@ -31,9 +33,7 @@ def download(model, direct=False):
version = get_version(model_name, compatibility)
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
v=version))
if dl != 0:
# if download subprocess doesn't return 0, exit with the respective
# exit code before doing anything else
if dl != 0: # if download subprocess doesn't return 0, exit
sys.exit(dl)
try:
# Get package path here because link uses
@ -47,22 +47,16 @@ def download(model, direct=False):
# Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails.
prints(
"Creating a shortcut link for 'en' didn't work (maybe "
"you don't have admin permissions?), but you can still "
"load the model via its full package name:",
"nlp = spacy.load('%s')" % model_name,
title="Download successful but linking failed")
prints(Messages.M001.format(name=model_name), title=Messages.M002)
def get_json(url, desc):
r = requests.get(url)
if r.status_code != 200:
msg = ("Couldn't fetch %s. Please find a model for your spaCy "
"installation (v%s), and download it manually.")
prints(msg % (desc, about.__version__), about.__docs_models__,
title="Server error (%d)" % r.status_code, exits=1)
return r.json()
try:
data = url_read(url)
except HTTPError as e:
prints(Messages.M004.format(desc, about.__version__),
title=Messages.M003.format(e.code, e.reason), exits=1)
return ujson.loads(data)
def get_compatibility():
@ -71,17 +65,16 @@ def get_compatibility():
comp_table = get_json(about.__compatibility__, "compatibility table")
comp = comp_table['spacy']
if version not in comp:
prints("No compatible models found for v%s of spaCy." % version,
title="Compatibility error", exits=1)
prints(Messages.M006.format(version=version), title=Messages.M005,
exits=1)
return comp[version]
def get_version(model, comp):
model = model.rsplit('.dev', 1)[0]
if model not in comp:
version = about.__version__
msg = "No compatible model found for '%s' (spaCy v%s)."
prints(msg % (model, version), title="Compatibility error", exits=1)
prints(Messages.M007.format(name=model, version=about.__version__),
title=Messages.M005, exits=1)
return comp[model][0]

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals, division, print_function
import plac
from timeit import default_timer as timer
from ._messages import Messages
from ..gold import GoldCorpus
from ..util import prints
from .. import util
@ -33,10 +34,9 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
data_path = util.ensure_path(data_path)
displacy_path = util.ensure_path(displacy_path)
if not data_path.exists():
prints(data_path, title="Evaluation data not found", exits=1)
prints(data_path, title=Messages.M034, exits=1)
if displacy_path and not displacy_path.exists():
prints(displacy_path, title="Visualization output directory not found",
exits=1)
prints(displacy_path, title=Messages.M035, exits=1)
corpus = GoldCorpus(data_path, data_path)
nlp = util.load_model(model)
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
@ -52,8 +52,7 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
render_ents = 'ner' in nlp.meta.get('pipeline', [])
render_parses(docs, displacy_path, model_name=model,
limit=displacy_limit, deps=render_deps, ents=render_ents)
msg = "Generated %s parses as HTML" % displacy_limit
prints(displacy_path, title=msg)
prints(displacy_path, title=Messages.M036.format(n=displacy_limit))
def render_parses(docs, output_path, model_name='', limit=250, deps=True,

View File

@ -5,15 +5,17 @@ import plac
import platform
from pathlib import Path
from ._messages import Messages
from ..compat import path2str
from .. import about
from .. import util
from .. import about
@plac.annotations(
model=("optional: shortcut link of model", "positional", None, str),
markdown=("generate Markdown for GitHub issues", "flag", "md", str))
def info(model=None, markdown=False):
markdown=("generate Markdown for GitHub issues", "flag", "md", str),
silent=("don't print anything (just return)", "flag", "s"))
def info(model=None, markdown=False, silent=False):
"""Print info about spaCy installation. If a model shortcut link is
speficied as an argument, print model information. Flag --markdown
prints details in Markdown for easy copy-pasting to GitHub issues.
@ -25,21 +27,24 @@ def info(model=None, markdown=False):
model_path = util.get_data_path() / model
meta_path = model_path / 'meta.json'
if not meta_path.is_file():
util.prints(meta_path, title="Can't find model meta.json", exits=1)
util.prints(meta_path, title=Messages.M020, exits=1)
meta = util.read_json(meta_path)
if model_path.resolve() != model_path:
meta['link'] = path2str(model_path)
meta['source'] = path2str(model_path.resolve())
else:
meta['source'] = path2str(model_path)
print_info(meta, 'model %s' % model, markdown)
else:
data = {'spaCy version': about.__version__,
'Location': path2str(Path(__file__).parent.parent),
'Platform': platform.platform(),
'Python version': platform.python_version(),
'Models': list_models()}
if not silent:
print_info(meta, 'model %s' % model, markdown)
return meta
data = {'spaCy version': about.__version__,
'Location': path2str(Path(__file__).parent.parent),
'Platform': platform.platform(),
'Python version': platform.python_version(),
'Models': list_models()}
if not silent:
print_info(data, 'spaCy', markdown)
return data
def print_info(data, title, markdown):

View File

@ -12,10 +12,16 @@ import tarfile
import gzip
import zipfile
from ..compat import fix_text
from ._messages import Messages
from ..vectors import Vectors
from ..errors import Errors, Warnings, user_warning
from ..util import prints, ensure_path, get_lang_class
try:
import ftfy
except ImportError:
ftfy = None
@plac.annotations(
lang=("model language", "positional", None, str),
@ -23,27 +29,26 @@ from ..util import prints, ensure_path, get_lang_class
freqs_loc=("location of words frequencies file", "positional", None, Path),
clusters_loc=("optional: location of brown clusters data",
"option", "c", str),
vectors_loc=("optional: location of vectors file in GenSim text format",
"option", "v", str),
vectors_loc=("optional: location of vectors file in Word2Vec format "
"(either as .txt or zipped as .zip or .tar.gz)", "option",
"v", str),
prune_vectors=("optional: number of vectors to prune to",
"option", "V", int)
)
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None, vectors_loc=None, prune_vectors=-1):
def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None,
vectors_loc=None, prune_vectors=-1):
"""
Create a new model from raw data, like word frequencies, Brown clusters
and word vectors.
"""
if freqs_loc is not None and not freqs_loc.exists():
prints(freqs_loc, title="Can't find words frequencies file", exits=1)
prints(freqs_loc, title=Messages.M037, exits=1)
clusters_loc = ensure_path(clusters_loc)
vectors_loc = ensure_path(vectors_loc)
probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
clusters = read_clusters(clusters_loc) if clusters_loc else {}
nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
@ -71,7 +76,6 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
nlp = lang_class()
for lexeme in nlp.vocab:
lexeme.rank = 0
lex_added = 0
for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))):
lexeme = nlp.vocab[word]
@ -91,15 +95,13 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
lexeme = nlp.vocab[word]
lexeme.is_oov = False
lex_added += 1
if len(vectors_data):
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
if prune_vectors >= 1:
nlp.vocab.prune_vectors(prune_vectors)
vec_added = len(nlp.vocab.vectors)
prints("{} entries, {} vectors".format(lex_added, vec_added),
title="Sucessfully compiled vocab")
prints(Messages.M039.format(entries=lex_added, vectors=vec_added),
title=Messages.M038)
return nlp
@ -114,8 +116,7 @@ def read_vectors(vectors_loc):
pieces = line.rsplit(' ', vectors_data.shape[1]+1)
word = pieces.pop(0)
if len(pieces) != vectors_data.shape[1]:
print(word, repr(line))
raise ValueError("Bad line in file")
raise ValueError(Errors.E094.format(line_num=i, loc=vectors_loc))
vectors_data[i] = numpy.asarray(pieces, dtype='f')
vectors_keys.append(word)
return vectors_data, vectors_keys
@ -150,11 +151,14 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
def read_clusters(clusters_loc):
print("Reading clusters...")
clusters = {}
if ftfy is None:
user_warning(Warnings.W004)
with clusters_loc.open() as f:
for line in tqdm(f):
try:
cluster, word, freq = line.split()
word = fix_text(word)
if ftfy is not None:
word = ftfy.fix_text(word)
except ValueError:
continue
# If the clusterer has only seen the word a few times, its

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
import plac
from pathlib import Path
from ._messages import Messages
from ..compat import symlink_to, path2str
from ..util import prints
from .. import util
@ -24,40 +25,29 @@ def link(origin, link_name, force=False, model_path=None):
else:
model_path = Path(origin) if model_path is None else Path(model_path)
if not model_path.exists():
prints("The data should be located in %s" % path2str(model_path),
title="Can't locate model data", exits=1)
prints(Messages.M009.format(path=path2str(model_path)),
title=Messages.M008, exits=1)
data_path = util.get_data_path()
if not data_path or not data_path.exists():
spacy_loc = Path(__file__).parent.parent
prints("Make sure a directory `/data` exists within your spaCy "
"installation and try again. The data directory should be "
"located here:", path2str(spacy_loc), exits=1,
title="Can't find the spaCy data path to create model symlink")
prints(Messages.M011, spacy_loc, title=Messages.M010, exits=1)
link_path = util.get_data_path() / link_name
if link_path.is_symlink() and not force:
prints("To overwrite an existing link, use the --force flag.",
title="Link %s already exists" % link_name, exits=1)
prints(Messages.M013, title=Messages.M012.format(name=link_name),
exits=1)
elif link_path.is_symlink(): # does a symlink exist?
# NB: It's important to check for is_symlink here and not for exists,
# because invalid/outdated symlinks would return False otherwise.
link_path.unlink()
elif link_path.exists(): # does it exist otherwise?
# NB: Check this last because valid symlinks also "exist".
prints("This can happen if your data directory contains a directory "
"or file of the same name.", link_path,
title="Can't overwrite symlink %s" % link_name, exits=1)
prints(Messages.M015, link_path,
title=Messages.M014.format(name=link_name), exits=1)
msg = "%s --> %s" % (path2str(model_path), path2str(link_path))
try:
symlink_to(link_path, model_path)
except:
# This is quite dirty, but just making sure other errors are caught.
prints("Creating a symlink in spacy/data failed. Make sure you have "
"the required permissions and try re-running the command as "
"admin, or use a virtualenv. You can still import the model as "
"a module and call its load() method, or create the symlink "
"manually.",
"%s --> %s" % (path2str(model_path), path2str(link_path)),
title="Error: Couldn't link model to '%s'" % link_name)
prints(Messages.M017, msg, title=Messages.M016.format(name=link_name))
raise
prints("%s --> %s" % (path2str(model_path), path2str(link_path)),
"You can now load the model via spacy.load('%s')" % link_name,
title="Linking successful")
prints(msg, Messages.M019.format(name=link_name), title=Messages.M018)

View File

@ -5,6 +5,7 @@ import plac
import shutil
from pathlib import Path
from ._messages import Messages
from ..compat import path2str, json_dumps
from ..util import prints
from .. import util
@ -31,17 +32,17 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
output_path = util.ensure_path(output_dir)
meta_path = util.ensure_path(meta_path)
if not input_path or not input_path.exists():
prints(input_path, title="Model directory not found", exits=1)
prints(input_path, title=Messages.M008, exits=1)
if not output_path or not output_path.exists():
prints(output_path, title="Output directory not found", exits=1)
prints(output_path, title=Messages.M040, exits=1)
if meta_path and not meta_path.exists():
prints(meta_path, title="meta.json not found", exits=1)
prints(meta_path, title=Messages.M020, exits=1)
meta_path = meta_path or input_path / 'meta.json'
if meta_path.is_file():
meta = util.read_json(meta_path)
if not create_meta: # only print this if user doesn't want to overwrite
prints(meta_path, title="Loaded meta.json from file")
prints(meta_path, title=Messages.M041)
else:
meta = generate_meta(input_dir, meta)
meta = validate_meta(meta, ['lang', 'name', 'version'])
@ -57,9 +58,8 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
create_file(main_path / 'setup.py', TEMPLATE_SETUP)
create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
create_file(package_path / '__init__.py', TEMPLATE_INIT)
prints(main_path, "To build the package, run `python setup.py sdist` in "
"this directory.",
title="Successfully created package '%s'" % model_name_v)
prints(main_path, Messages.M043,
title=Messages.M042.format(name=model_name_v))
def create_dirs(package_path, force):
@ -67,10 +67,7 @@ def create_dirs(package_path, force):
if force:
shutil.rmtree(path2str(package_path))
else:
prints(package_path, "Please delete the directory and try again, "
"or use the --force flag to overwrite existing "
"directories.", title="Package directory already exists",
exits=1)
prints(package_path, Messages.M045, title=Messages.M044, exits=1)
Path.mkdir(package_path, parents=True)
@ -97,9 +94,7 @@ def generate_meta(model_path, existing_meta):
meta['vectors'] = {'width': nlp.vocab.vectors_length,
'vectors': len(nlp.vocab.vectors),
'keys': nlp.vocab.vectors.n_keys}
prints("Enter the package settings for your model. The following "
"information will be read from your model data: pipeline, vectors.",
title="Generating meta.json")
prints(Messages.M047, title=Messages.M046)
for setting, desc, default in settings:
response = util.get_raw_input(desc, default)
meta[setting] = default if response == '' and default else response
@ -111,8 +106,7 @@ def generate_meta(model_path, existing_meta):
def validate_meta(meta, keys):
for key in keys:
if key not in meta or meta[key] == '':
prints("This setting is required to build your package.",
title='No "%s" setting found in meta.json' % key, exits=1)
prints(Messages.M049, title=Messages.M048.format(key=key), exits=1)
return meta

View File

@ -7,6 +7,7 @@ import tqdm
from thinc.neural._classes.model import Model
from timeit import default_timer as timer
from ._messages import Messages
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus
from ..util import prints, minibatch, minibatch_by_words
@ -52,15 +53,15 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
dev_path = util.ensure_path(dev_data)
meta_path = util.ensure_path(meta_path)
if not train_path.exists():
prints(train_path, title="Training data not found", exits=1)
prints(train_path, title=Messages.M050, exits=1)
if dev_path and not dev_path.exists():
prints(dev_path, title="Development data not found", exits=1)
prints(dev_path, title=Messages.M051, exits=1)
if meta_path is not None and not meta_path.exists():
prints(meta_path, title="meta.json not found", exits=1)
prints(meta_path, title=Messages.M020, exits=1)
meta = util.read_json(meta_path) if meta_path else {}
if not isinstance(meta, dict):
prints("Expected dict but got: {}".format(type(meta)),
title="Not a valid meta.json format", exits=1)
prints(Messages.M053.format(meta_type=type(meta)),
title=Messages.M052, exits=1)
meta.setdefault('lang', lang)
meta.setdefault('name', 'unnamed')
@ -94,6 +95,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
meta['pipeline'] = pipeline
nlp.meta.update(meta)
if vectors:
print("Load vectors model", vectors)
util.load_model(vectors, vocab=nlp.vocab)
for lex in nlp.vocab:
values = {}

315
spacy/cli/ud_run_test.py Normal file
View File

@ -0,0 +1,315 @@
'''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
.conllu format for development data, allowing the official scorer to be used.
'''
from __future__ import unicode_literals
import plac
import tqdm
from pathlib import Path
import re
import sys
import json
import spacy
import spacy.util
from ..tokens import Token, Doc
from ..gold import GoldParse
from ..util import compounding, minibatch_by_words
from ..syntax.nonproj import projectivize
from ..matcher import Matcher
from ..morphology import Fused_begin, Fused_inside
from .. import displacy
from collections import defaultdict, Counter
from timeit import default_timer as timer
import itertools
import random
import numpy.random
import cytoolz
from . import conll17_ud_eval
from .. import lang
from .. import lang
from ..lang import zh
from ..lang import ja
from ..lang import ru
################
# Data reading #
################
space_re = re.compile('\s+')
def split_text(text):
return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
##############
# Evaluation #
##############
def read_conllu(file_):
docs = []
sent = []
doc = []
for line in file_:
if line.startswith('# newdoc'):
if doc:
docs.append(doc)
doc = []
elif line.startswith('#'):
continue
elif not line.strip():
if sent:
doc.append(sent)
sent = []
else:
sent.append(list(line.strip().split('\t')))
if len(sent[-1]) != 10:
print(repr(line))
raise ValueError
if sent:
doc.append(sent)
if doc:
docs.append(doc)
return docs
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith('.conllu'):
docs = []
with text_loc.open() as file_:
for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent]
docs.append(Doc(nlp.vocab, words=words))
for name, component in nlp.pipeline:
docs = list(component.pipe(docs))
else:
with text_loc.open('r', encoding='utf8') as text_file:
texts = split_text(text_file.read())
docs = list(nlp.pipe(texts))
with sys_loc.open('w', encoding='utf8') as out_file:
write_conllu(docs, out_file)
with gold_loc.open('r', encoding='utf8') as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
with sys_loc.open('r', encoding='utf8') as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file)
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
return docs, scores
def write_conllu(docs, file_):
merger = Matcher(docs[0].vocab)
merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
for i, doc in enumerate(docs):
matches = merger(doc)
spans = [doc[start:end+1] for _, start, end in matches]
offsets = [(span.start_char, span.end_char) for span in spans]
for start_char, end_char in offsets:
doc.merge(start_char, end_char)
# TODO: This shuldn't be necessary? Should be handled in merge
for word in doc:
if word.i == word.head.i:
word.dep_ = 'ROOT'
file_.write("# newdoc id = {i}\n".format(i=i))
for j, sent in enumerate(doc.sents):
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
file_.write("# text = {text}\n".format(text=sent.text))
for k, token in enumerate(sent):
file_.write(_get_token_conllu(token, k, len(sent)) + '\n')
file_.write('\n')
for word in sent:
if word.head.i == word.i and word.dep_ == 'ROOT':
break
else:
print("Rootless sentence!")
print(sent)
print(i)
for w in sent:
print(w.i, w.text, w.head.text, w.head.i, w.dep_)
raise ValueError
def _get_token_conllu(token, k, sent_len):
if token.check_morph(Fused_begin) and (k+1 < sent_len):
n = 1
text = [token.text]
while token.nbor(n).check_morph(Fused_inside):
text.append(token.nbor(n).text)
n += 1
id_ = '%d-%d' % (k+1, (k+n))
fields = [id_, ''.join(text)] + ['_'] * 8
lines = ['\t'.join(fields)]
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = k + (token.head.i - token.i) + 1
fields = [str(k+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
str(head), token.dep_.lower(), '_', '_']
if token.check_morph(Fused_begin) and (k+1 < sent_len):
if k == 0:
fields[1] = token.norm_[0].upper() + token.norm_[1:]
else:
fields[1] = token.norm_
elif token.check_morph(Fused_inside):
fields[1] = token.norm_
elif token._.split_start is not None:
split_start = token._.split_start
split_end = token._.split_end
split_len = (split_end.i - split_start.i) + 1
n_in_split = token.i - split_start.i
subtokens = guess_fused_orths(split_start.text, [''] * split_len)
fields[1] = subtokens[n_in_split]
lines.append('\t'.join(fields))
return '\n'.join(lines)
def guess_fused_orths(word, ud_forms):
'''The UD data 'fused tokens' don't necessarily expand to keys that match
the form. We need orths that exact match the string. Here we make a best
effort to divide up the word.'''
if word == ''.join(ud_forms):
# Happy case: we get a perfect split, with each letter accounted for.
return ud_forms
elif len(word) == sum(len(subtoken) for subtoken in ud_forms):
# Unideal, but at least lengths match.
output = []
remain = word
for subtoken in ud_forms:
assert len(subtoken) >= 1
output.append(remain[:len(subtoken)])
remain = remain[len(subtoken):]
assert len(remain) == 0, (word, ud_forms, remain)
return output
else:
# Let's say word is 6 long, and there are three subtokens. The orths
# *must* equal the original string. Arbitrarily, split [4, 1, 1]
first = word[:len(word)-(len(ud_forms)-1)]
output = [first]
remain = word[len(first):]
for i in range(1, len(ud_forms)):
assert remain
output.append(remain[:1])
remain = remain[1:]
assert len(remain) == 0, (word, output, remain)
return output
def print_results(name, ud_scores):
fields = {}
if ud_scores is not None:
fields.update({
'words': ud_scores['Words'].f1 * 100,
'sents': ud_scores['Sentences'].f1 * 100,
'tags': ud_scores['XPOS'].f1 * 100,
'uas': ud_scores['UAS'].f1 * 100,
'las': ud_scores['LAS'].f1 * 100,
})
else:
fields.update({
'words': 0.0,
'sents': 0.0,
'tags': 0.0,
'uas': 0.0,
'las': 0.0
})
tpl = '\t'.join((
name,
'{las:.1f}',
'{uas:.1f}',
'{tags:.1f}',
'{sents:.1f}',
'{words:.1f}',
))
print(tpl.format(**fields))
return fields
def get_token_split_start(token):
if token.text == '':
assert token.i != 0
i = -1
while token.nbor(i).text == '':
i -= 1
return token.nbor(i)
elif (token.i+1) < len(token.doc) and token.nbor(1).text == '':
return token
else:
return None
def get_token_split_end(token):
if (token.i+1) == len(token.doc):
return token if token.text == '' else None
elif token.text != '' and token.nbor(1).text != '':
return None
i = 1
while (token.i+i) < len(token.doc) and token.nbor(i).text == '':
i += 1
return token.nbor(i-1)
Token.set_extension('split_start', getter=get_token_split_start)
Token.set_extension('split_end', getter=get_token_split_end)
Token.set_extension('begins_fused', default=False)
Token.set_extension('inside_fused', default=False)
##################
# Initialization #
##################
def load_nlp(experiments_dir, corpus):
nlp = spacy.load(experiments_dir / corpus / 'best-model')
return nlp
def initialize_pipeline(nlp, docs, golds, config, device):
nlp.add_pipe(nlp.create_pipe('parser'))
return nlp
@plac.annotations(
test_data_dir=("Path to Universal Dependencies test data", "positional", None, Path),
experiment_dir=("Parent directory with output model", "positional", None, Path),
corpus=("UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc", "positional", None, str),
)
def main(test_data_dir, experiment_dir, corpus):
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False
lang.ru.Russian.Defaults.use_pymorphy2 = False
nlp = load_nlp(experiment_dir, corpus)
treebank_code = nlp.meta['treebank']
for section in ('test', 'dev'):
if section == 'dev':
section_dir = 'conll17-ud-development-2017-03-19'
else:
section_dir = 'conll17-ud-test-2017-05-09'
text_path = test_data_dir / 'input' / section_dir / (treebank_code+'.txt')
udpipe_path = test_data_dir / 'input' / section_dir / (treebank_code+'-udpipe.conllu')
gold_path = test_data_dir / 'gold' / section_dir / (treebank_code+'.conllu')
header = [section, 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
print('\t'.join(header))
inputs = {'gold': gold_path, 'udp': udpipe_path, 'raw': text_path}
for input_type in ('udp', 'raw'):
input_path = inputs[input_type]
output_path = experiment_dir / corpus / '{section}.conllu'.format(section=section)
parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path)
accuracy = print_results(input_type, test_scores)
acc_path = experiment_dir / corpus / '{section}-accuracy.json'.format(section=section)
with open(acc_path, 'w') as file_:
file_.write(json.dumps(accuracy, indent=2))
if __name__ == '__main__':
plac.call(main)

View File

@ -247,13 +247,19 @@ Token.set_extension('inside_fused', default=False)
##################
def load_nlp(corpus, config):
def load_nlp(corpus, config, vectors=None):
lang = corpus.split('_')[0]
nlp = spacy.blank(lang)
if config.vectors:
nlp.vocab.from_disk(Path(config.vectors) / 'vocab')
if not vectors:
raise ValueError("config asks for vectors, but no vectors "
"directory set on command line (use -v)")
if (Path(vectors) / corpus).exists():
nlp.vocab.from_disk(Path(vectors) / corpus / 'vocab')
nlp.meta['treebank'] = corpus
return nlp
def initialize_pipeline(nlp, docs, golds, config, device):
nlp.add_pipe(nlp.create_pipe('parser'))
if config.multitask_tag:
@ -274,10 +280,12 @@ def initialize_pipeline(nlp, docs, golds, config, device):
class Config(object):
def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2):
multitask_sent=True, multitask_dep=True, multitask_vectors=False,
nr_epoch=30, batch_size=1000, dropout=0.2):
for key, value in locals().items():
setattr(self, key, value)
@classmethod
def load(cls, loc):
with Path(loc).open('r', encoding='utf8') as file_:
@ -319,9 +327,11 @@ class TreebankPaths(object):
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional"),
limit=("Size limit", "option", "n", int),
use_gpu=("Use GPU", "option", "g", int)
use_gpu=("Use GPU", "option", "g", int),
vectors_dir=("Path to directory with pre-trained vectors, named e.g. en/",
"option", "v", Path),
)
def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1):
def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1, vectors_dir=None):
spacy.util.fix_random_seed()
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False
@ -331,7 +341,7 @@ def main(ud_dir, parses_dir, config, corpus, limit=0, use_gpu=-1):
if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir()
print("Train and evaluate", corpus, "using lang", paths.lang)
nlp = load_nlp(paths.lang, config)
nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
max_doc_length=config.max_doc_length, limit=limit)

View File

@ -1,12 +1,13 @@
# coding: utf8
from __future__ import unicode_literals, print_function
import requests
import pkg_resources
from pathlib import Path
import sys
import ujson
from ..compat import path2str, locale_escape
from ._messages import Messages
from ..compat import path2str, locale_escape, url_read, HTTPError
from ..util import prints, get_data_path, read_json
from .. import about
@ -15,16 +16,16 @@ def validate():
"""Validate that the currently installed version of spaCy is compatible
with the installed models. Should be run after `pip install -U spacy`.
"""
r = requests.get(about.__compatibility__)
if r.status_code != 200:
prints("Couldn't fetch compatibility table.",
title="Server error (%d)" % r.status_code, exits=1)
compat = r.json()['spacy']
try:
data = url_read(about.__compatibility__)
except HTTPError as e:
title = Messages.M003.format(code=e.code, desc=e.reason)
prints(Messages.M021, title=title, exits=1)
compat = ujson.loads(data)['spacy']
current_compat = compat.get(about.__version__)
if not current_compat:
prints(about.__compatibility__, exits=1,
title="Can't find spaCy v{} in compatibility table"
.format(about.__version__))
title=Messages.M022.format(version=about.__version__))
all_models = set()
for spacy_v, models in dict(compat).items():
all_models.update(models.keys())
@ -41,7 +42,7 @@ def validate():
update_models = [m for m in incompat_models if m in current_compat]
prints(path2str(Path(__file__).parent.parent),
title="Installed models (spaCy v{})".format(about.__version__))
title=Messages.M023.format(version=about.__version__))
if model_links or model_pkgs:
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
for name, data in model_pkgs.items():
@ -49,23 +50,16 @@ def validate():
for name, data in model_links.items():
print(get_model_row(current_compat, name, data, 'link'))
else:
prints("No models found in your current environment.", exits=0)
prints(Messages.M024, exits=0)
if update_models:
cmd = ' python -m spacy download {}'
print("\n Use the following commands to update the model packages:")
print("\n " + Messages.M025)
print('\n'.join([cmd.format(pkg) for pkg in update_models]))
if na_models:
prints("The following models are not available for spaCy v{}: {}"
.format(about.__version__, ', '.join(na_models)))
prints(Messages.M025.format(version=about.__version__,
models=', '.join(na_models)))
if incompat_links:
prints("You may also want to overwrite the incompatible links using "
"the `python -m spacy link` command with `--force`, or remove "
"them from the data directory. Data path: {}"
.format(path2str(get_data_path())))
prints(Messages.M027.format(path=path2str(get_data_path())))
if incompat_models or incompat_links:
sys.exit(1)

View File

@ -1,7 +1,6 @@
# coding: utf8
from __future__ import unicode_literals
import ftfy
import sys
import ujson
import itertools
@ -34,11 +33,20 @@ try:
except ImportError:
from thinc.neural.optimizers import Adam as Optimizer
try:
import urllib.request
except ImportError:
import urllib2 as urllib
try:
from urllib.error import HTTPError
except ImportError:
from urllib2 import HTTPError
pickle = pickle
copy_reg = copy_reg
CudaStream = CudaStream
cupy = cupy
fix_text = ftfy.fix_text
copy_array = copy_array
izip = getattr(itertools, 'izip', zip)
@ -58,6 +66,7 @@ if is_python2:
input_ = raw_input # noqa: F821
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8')
path2str = lambda path: str(path).decode('utf8')
url_open = urllib.urlopen
elif is_python3:
bytes_ = bytes
@ -66,6 +75,16 @@ elif is_python3:
input_ = input
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
path2str = lambda path: str(path)
url_open = urllib.request.urlopen
def url_read(url):
file_ = url_open(url)
code = file_.getcode()
if code != 200:
raise HTTPError(url, code, "Cannot GET url", [], file_)
data = file_.read()
return data
def b_to_str(b_str):

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
from .render import DependencyRenderer, EntityRenderer
from ..tokens import Doc
from ..compat import b_to_str
from ..errors import Errors, Warnings, user_warning
from ..util import prints, is_in_jupyter
@ -27,7 +28,7 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
factories = {'dep': (DependencyRenderer, parse_deps),
'ent': (EntityRenderer, parse_ents)}
if style not in factories:
raise ValueError("Unknown style: %s" % style)
raise ValueError(Errors.E087.format(style=style))
if isinstance(docs, Doc) or isinstance(docs, dict):
docs = [docs]
renderer, converter = factories[style]
@ -57,12 +58,12 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
render(docs, style=style, page=page, minify=minify, options=options,
manual=manual)
httpd = simple_server.make_server('0.0.0.0', port, app)
prints("Using the '%s' visualizer" % style,
title="Serving on port %d..." % port)
prints("Using the '{}' visualizer".format(style),
title="Serving on port {}...".format(port))
try:
httpd.serve_forever()
except KeyboardInterrupt:
prints("Shutting down server on port %d." % port)
prints("Shutting down server on port {}.".format(port))
finally:
httpd.server_close()
@ -83,6 +84,12 @@ def parse_deps(orig_doc, options={}):
RETURNS (dict): Generated dependency parse keyed by words and arcs.
"""
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
if not doc.is_parsed:
user_warning(Warnings.W005)
if options.get('collapse_phrases', False):
for np in list(doc.noun_chunks):
np.merge(tag=np.root.tag_, lemma=np.root.lemma_,
ent_type=np.root.ent_type_)
if options.get('collapse_punct', True):
spans = []
for word in doc[:-1]:
@ -120,6 +127,8 @@ def parse_ents(doc, options={}):
"""
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
for ent in doc.ents]
if not ents:
user_warning(Warnings.W006)
title = (doc.user_data.get('title', None)
if hasattr(doc, 'user_data') else None)
return {'text': doc.text, 'ents': ents, 'title': title}

313
spacy/errors.py Normal file
View File

@ -0,0 +1,313 @@
# coding: utf8
from __future__ import unicode_literals
import os
import warnings
import inspect
def add_codes(err_cls):
"""Add error codes to string messages via class attribute names."""
class ErrorsWithCodes(object):
def __getattribute__(self, code):
msg = getattr(err_cls, code)
return '[{code}] {msg}'.format(code=code, msg=msg)
return ErrorsWithCodes()
@add_codes
class Warnings(object):
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
"You can now call spacy.load with the path as its first argument, "
"and the model's meta.json will be used to determine the language "
"to load. For example:\nnlp = spacy.load('{path}')")
W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object "
"instead and pass in the strings as the `words` keyword argument, "
"for example:\nfrom spacy.tokens import Doc\n"
"doc = Doc(nlp.vocab, words=[...])")
W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use "
"the keyword arguments, for example tag=, lemma= or ent_type=.")
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
"using ftfy.fix_text if necessary.")
W005 = ("Doc object not parsed. This means displaCy won't be able to "
"generate a dependency visualization for it. Make sure the Doc "
"was processed with a model that supports dependency parsing, and "
"not just a language class like `English()`. For more info, see "
"the docs:\nhttps://spacy.io/usage/models")
W006 = ("No entities to visualize found in Doc object. If this is "
"surprising to you, make sure the Doc was processed using a model "
"that supports named entity recognition, and check the `doc.ents` "
"property manually if necessary.")
@add_codes
class Errors(object):
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
E002 = ("Can't find factory for '{name}'. This usually happens when spaCy "
"calls `nlp.create_pipe` with a component name that's not built "
"in - for example, when constructing the pipeline from a model's "
"meta.json. If you're using a custom component, you can write to "
"`Language.factories['{name}']` or remove it from the model meta "
"and add it via `nlp.add_pipe` instead.")
E003 = ("Not a valid pipeline component. Expected callable, but "
"got {component} (name: '{name}').")
E004 = ("If you meant to add a built-in component, use `create_pipe`: "
"`nlp.add_pipe(nlp.create_pipe('{component}'))`")
E005 = ("Pipeline component '{name}' returned None. If you're using a "
"custom component, maybe you forgot to return the processed Doc?")
E006 = ("Invalid constraints. You can only set one of the following: "
"before, after, first, last.")
E007 = ("'{name}' already exists in pipeline. Existing names: {opts}")
E008 = ("Some current components would be lost when restoring previous "
"pipeline state. If you added components after calling "
"`nlp.disable_pipes()`, you should remove them explicitly with "
"`nlp.remove_pipe()` before the pipeline is restored. Names of "
"the new components: {names}")
E009 = ("The `update` method expects same number of docs and golds, but "
"got: {n_docs} docs, {n_golds} golds.")
E010 = ("Word vectors set to length 0. This may be because you don't have "
"a model installed or loaded, or because your model doesn't "
"include word vectors. For more info, see the docs:\n"
"https://spacy.io/usage/models")
E011 = ("Unknown operator: '{op}'. Options: {opts}")
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
E013 = ("Error selecting action in matcher")
E014 = ("Uknown tag ID: {tag}")
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
"`force=True` to overwrite.")
E016 = ("MultitaskObjective target should be function or one of: dep, "
"tag, ent, dep_tag_offset, ent_tag.")
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
E018 = ("Can't retrieve string for hash '{hash_value}'.")
E019 = ("Can't create transition with unknown action ID: {action}. Action "
"IDs are enumerated in spacy/syntax/{src}.pyx.")
E020 = ("Could not find a gold-standard action to supervise the "
"dependency parser. The tree is non-projective (i.e. it has "
"crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
"The ArcEager transition system only supports projective trees. "
"To learn non-projective representations, transform the data "
"before training and after parsing. Either pass "
"`make_projective=True` to the GoldParse class, or use "
"spacy.syntax.nonproj.preprocess_training_data.")
E021 = ("Could not find a gold-standard action to supervise the "
"dependency parser. The GoldParse was projective. The transition "
"system has {n_actions} actions. State at failure: {state}")
E022 = ("Could not find a transition with the name '{name}' in the NER "
"model.")
E023 = ("Error cleaning up beam: The same state occurred twice at "
"memory address {addr} and position {i}.")
E024 = ("Could not find an optimal move to supervise the parser. Usually, "
"this means the GoldParse was not correct. For example, are all "
"labels added to the model?")
E025 = ("String is too long: {length} characters. Max is 2**30.")
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
"length {length}.")
E027 = ("Arguments 'words' and 'spaces' should be sequences of the same "
"length, or 'spaces' should be left default at None. spaces "
"should be a sequence of booleans, with True meaning that the "
"word owns a ' ' character following it.")
E028 = ("orths_and_spaces expects either a list of unicode string or a "
"list of (unicode, bool) tuples. Got bytes instance: {value}")
E029 = ("noun_chunks requires the dependency parse, which requires a "
"statistical model to be installed and loaded. For more info, see "
"the documentation:\nhttps://spacy.io/usage/models")
E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
"component to the pipeline with: "
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
"Alternatively, add the dependency parser, or set sentence "
"boundaries by setting doc[i].is_sent_start.")
E031 = ("Invalid token: empty string ('') at position {i}.")
E032 = ("Conflicting attributes specified in doc.from_array(): "
"(HEAD, SENT_START). The HEAD attribute currently sets sentence "
"boundaries implicitly, based on the tree structure. This means "
"the HEAD attribute would potentially override the sentence "
"boundaries set by SENT_START.")
E033 = ("Cannot load into non-empty Doc of length {length}.")
E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected "
"either 3 arguments (deprecated), or 0 (use keyword arguments).\n"
"Arguments supplied:\n{args}\nKeyword arguments:{kwargs}")
E035 = ("Error creating span with start {start} and end {end} for Doc of "
"length {length}.")
E036 = ("Error calculating span: Can't find a token starting at character "
"offset {start}.")
E037 = ("Error calculating span: Can't find a token ending at character "
"offset {end}.")
E038 = ("Error finding sentence for span. Infinite loop detected.")
E039 = ("Array bounds exceeded while searching for root word. This likely "
"means the parse tree is in an invalid state. Please report this "
"issue here: http://github.com/explosion/spaCy/issues")
E040 = ("Attempt to access token at {i}, max length {max_length}.")
E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.")
E043 = ("Refusing to write to token.sent_start if its document is parsed, "
"because this may cause inconsistent state.")
E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
"None, True, False")
E045 = ("Possibly infinite loop encountered while looking for {attr}.")
E046 = ("Can't retrieve unregistered extension attribute '{name}'. Did "
"you forget to call the `set_extension` method?")
E047 = ("Can't assign a value to unregistered extension attribute "
"'{name}'. Did you forget to call the `set_extension` method?")
E048 = ("Can't import language {lang} from spacy.lang.")
E049 = ("Can't find spaCy data directory: '{path}'. Check your "
"installation and permissions, or use spacy.util.set_data_path "
"to customise the location if necessary.")
E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut "
"link, a Python package or a valid path to a data directory.")
E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure "
"it points to a valid package (not just a data directory).")
E052 = ("Can't find model directory: {path}")
E053 = ("Could not read meta.json from {path}")
E054 = ("No valid '{setting}' setting found in model meta.json.")
E055 = ("Invalid ORTH value in exception:\nKey: {key}\nOrths: {orths}")
E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
"original string.\nKey: {key}\nOrths: {orths}")
E057 = ("Stepped slices not supported in Span objects. Try: "
"list(tokens)[start:stop:step] instead.")
E058 = ("Could not retrieve vector for key {key}.")
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
"({rows}, {cols}).")
E061 = ("Bad file name: {filename}. Example of a valid file name: "
"'vectors.128.f.bin'")
E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
"and 63 are occupied. You can replace one by specifying the "
"`flag_id` explicitly, e.g. "
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 "
"and 63 (inclusive).")
E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
"string, the lexeme returned had an orth ID that did not match "
"the query string. This means that the cached lexeme structs are "
"mismatched to the string encoding table. The mismatched:\n"
"Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
E065 = ("Only one of the vector table's width and shape can be specified. "
"Got width {width} and shape {shape}.")
E066 = ("Error creating model helper for extracting columns. Can only "
"extract columns by positive integer. Got: {value}.")
E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
"an entity) without a preceding 'B' (beginning of an entity). "
"Tag sequence:\n{tags}")
E068 = ("Invalid BILUO tag: '{tag}'.")
E069 = ("Invalid gold-standard parse tree. Found cycle between word "
"IDs: {cycle}")
E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
"does not align with number of annotations ({n_annots}).")
E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
"match the one in the vocab ({vocab_orth}).")
E072 = ("Error serializing lexeme: expected data length {length}, "
"got {bad_length}.")
E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
"are of length {length}. You can use `vocab.reset_vectors` to "
"clear the existing vectors and resize the table.")
E074 = ("Error interpreting compiled match pattern: patterns are expected "
"to end with the attribute {attr}. Got: {bad_attr}.")
E075 = ("Error accepting match: length ({length}) > maximum length "
"({max_len}).")
E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
"has {words} words.")
E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
"equal number of GoldParse objects ({n_golds}) in batch.")
E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
"not equal number of words in GoldParse ({words_gold}).")
E079 = ("Error computing states in beam: number of predicted beams "
"({pbeams}) does not equal number of gold beams ({gbeams}).")
E080 = ("Duplicate state found in beam: {key}.")
E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
"does not equal number of losses ({losses}).")
E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
"projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
"match.")
E083 = ("Error setting extension: only one of `default`, `method`, or "
"`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
E084 = ("Error assigning label ID {label} to span: not in StringStore.")
E085 = ("Can't create lexeme for string '{string}'.")
E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
"not match hash {hash_id} in StringStore.")
E087 = ("Unknown displaCy style: {style}.")
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
"v2.x parser and NER models require roughly 1GB of temporary "
"memory per 100,000 characters in the input. This means long "
"texts may cause memory allocation errors. If you're not using "
"the parser or NER, it's probably safe to increase the "
"`nlp.max_length` limit. The limit is in number of characters, so "
"you can check whether your inputs are too long by checking "
"`len(text)`.")
E089 = ("Extensions can't have a setter argument without a getter "
"argument. Check the keyword arguments on `set_extension`.")
E090 = ("Extension '{name}' already exists on {obj}. To overwrite the "
"existing extension, set `force=True` on `{obj}.set_extension`.")
E091 = ("Invalid extension attribute {name}: expected callable or None, "
"but got: {value}")
E092 = ("Could not find or assign name for word vectors. Ususally, the "
"name is read from the model's meta.json in vector.name. "
"Alternatively, it is built from the 'lang' and 'name' keys in "
"the meta.json. Vector names are required to avoid issue #1660.")
E093 = ("token.ent_iob values make invalid sequence: I without B\n{seq}")
E094 = ("Error reading line {line_num} in vectors file {loc}.")
@add_codes
class TempErrors(object):
T001 = ("Max length currently 10 for phrase matching")
T002 = ("Pattern length ({doc_len}) >= phrase_matcher.max_length "
"({max_len}). Length can be set on initialization, up to 10.")
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
T005 = ("Currently history size is hard-coded to 0. Received: {value}.")
T006 = ("Currently history width is hard-coded to 0. Received: {value}.")
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
"issue tracker: http://github.com/explosion/spaCy/issues")
T008 = ("Bad configuration of Tagger. This is probably a bug within "
"spaCy. We changed the name of an internal attribute for loading "
"pre-trained vectors, and the class has been passed the old name "
"(pretrained_dims) but not the new name (pretrained_vectors).")
class ModelsWarning(UserWarning):
pass
WARNINGS = {
'user': UserWarning,
'deprecation': DeprecationWarning,
'models': ModelsWarning,
}
def _get_warn_types(arg):
if arg == '': # don't show any warnings
return []
if not arg or arg == 'all': # show all available warnings
return WARNINGS.keys()
return [w_type.strip() for w_type in arg.split(',')
if w_type.strip() in WARNINGS]
SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER', 'always')
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES'))
def user_warning(message):
_warn(message, 'user')
def deprecation_warning(message):
_warn(message, 'deprecation')
def models_warning(message):
_warn(message, 'models')
def _warn(message, warn_type='user'):
"""
message (unicode): The message to display.
category (Warning): The Warning to show.
"""
if warn_type in SPACY_WARNING_TYPES:
category = WARNINGS[warn_type]
stack = inspect.stack()[-1]
with warnings.catch_warnings():
warnings.simplefilter(SPACY_WARNING_FILTER, category)
warnings.warn_explicit(message, category, stack[1], stack[2])

View File

@ -17,6 +17,7 @@ import ujson
from . import _align
from .syntax import nonproj
from .tokens import Doc
from .errors import Errors
from . import util
from .util import minibatch, itershuffle
from .compat import json_dumps
@ -37,7 +38,8 @@ def tags_to_entities(tags):
elif tag == '-':
continue
elif tag.startswith('I'):
assert start is not None, tags[:i]
if start is None:
raise ValueError(Errors.E067.format(tags=tags[:i]))
continue
if tag.startswith('U'):
entities.append((tag[2:], i, i))
@ -47,7 +49,7 @@ def tags_to_entities(tags):
entities.append((tag[2:], start, i))
start = None
else:
raise Exception(tag)
raise ValueError(Errors.E068.format(tag=tag))
return entities
@ -225,7 +227,9 @@ class GoldCorpus(object):
@classmethod
def _make_golds(cls, docs, paragraph_tuples, make_projective):
assert len(docs) == len(paragraph_tuples)
if len(docs) != len(paragraph_tuples):
raise ValueError(Errors.E070.format(n_docs=len(docs),
n_annots=len(paragraph_tuples)))
if len(docs) == 1:
return [GoldParse.from_annot_tuples(docs[0],
paragraph_tuples[0][0],
@ -525,7 +529,7 @@ cdef class GoldParse:
cycle = nonproj.contains_cycle(self.heads)
if cycle is not None:
raise Exception("Cycle found: %s" % cycle)
raise ValueError(Errors.E069.format(cycle=cycle))
def __len__(self):
"""Get the number of gold-standard tokens.

View File

@ -8,6 +8,7 @@ from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .morph_rules import MORPH_RULES
from ..tag_map import TAG_MAP
from .lemmatizer import LOOKUP
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
@ -28,6 +29,7 @@ class DanishDefaults(Language.Defaults):
suffixes = TOKENIZER_SUFFIXES
tag_map = TAG_MAP
stop_words = STOP_WORDS
lemma_lookup = LOOKUP
class Danish(Language):

692415
spacy/lang/da/lemmatizer.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -286069,7 +286069,6 @@ LOOKUP = {
"sonnolente": "sonnolento",
"sonnolenti": "sonnolento",
"sonnolenze": "sonnolenza",
"sono": "sonare",
"sonora": "sonoro",
"sonore": "sonoro",
"sonori": "sonoro",
@ -333681,6 +333680,7 @@ LOOKUP = {
"zurliniane": "zurliniano",
"zurliniani": "zurliniano",
"àncore": "àncora",
"sono": "essere",
"è": "essere",
"èlites": "èlite",
"ère": "èra",

View File

@ -190262,7 +190262,6 @@ LOOKUP = {
"gämserna": "gäms",
"gämsernas": "gäms",
"gämsers": "gäms",
"gäng": "gänga",
"gängad": "gänga",
"gängade": "gängad",
"gängades": "gängad",
@ -651423,7 +651422,6 @@ LOOKUP = {
"åpnasts": "åpen",
"åpne": "åpen",
"åpnes": "åpen",
"år": "åra",
"åran": "åra",
"årans": "åra",
"åras": "åra",

View File

@ -1,19 +1,53 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LANG
from ...attrs import LANG, NORM
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...tokens import Doc
from .stop_words import STOP_WORDS
from ...util import update_exc, add_lookups
from .lex_attrs import LEX_ATTRS
#from ..tokenizer_exceptions import BASE_EXCEPTIONS
#from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
class VietnameseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'vi' # for pickling
# add more norm exception dictionaries here
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
# overwrite functions for lexical attributes
lex_attr_getters.update(LEX_ATTRS)
# merge base exceptions and custom tokenizer exceptions
#tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
use_pyvi = True
class Vietnamese(Language):
lang = 'vi'
Defaults = VietnameseDefaults # override defaults
def make_doc(self, text):
if self.Defaults.use_pyvi:
try:
from pyvi import ViTokenizer
except ImportError:
msg = ("Pyvi not installed. Either set Vietnamese.use_pyvi = False, "
"or install it https://pypi.python.org/pypi/pyvi")
raise ImportError(msg)
words, spaces = ViTokenizer.spacy_tokenize(text)
return Doc(self.vocab, words=words, spaces=spaces)
else:
words = []
spaces = []
doc = self.tokenizer(text)
for token in self.tokenizer(text):
words.extend(list(token.text))
spaces.extend([False]*len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
__all__ = ['Vietnamese']

View File

@ -0,0 +1,26 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['không', 'một', 'hai', 'ba', 'bốn', 'năm', 'sáu', 'bẩy',
'tám', 'chín', 'mười', 'trăm', 'tỷ']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

1951
spacy/lang/vi/stop_words.py Normal file

File diff suppressed because it is too large Load Diff

36
spacy/lang/vi/tag_map.py Normal file
View File

@ -0,0 +1,36 @@
# coding: utf8
from __future__ import unicode_literals
from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
# Add a tag map
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
# The keys of the tag map should be strings in your tag set. The dictionary must
# have an entry POS whose value is one of the Universal Dependencies tags.
# Optionally, you can also include morphological features or other attributes.
TAG_MAP = {
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
"SP": {POS: SPACE}
}

View File

@ -28,6 +28,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
from .lang.tokenizer_exceptions import TOKEN_MATCH
from .lang.tag_map import TAG_MAP
from .lang.lex_attrs import LEX_ATTRS, is_stop
from .errors import Errors
from . import util
from . import about
@ -112,7 +113,7 @@ class Language(object):
'merge_subtokens': lambda nlp, **cfg: merge_subtokens,
}
def __init__(self, vocab=True, make_doc=True, meta={}, **kwargs):
def __init__(self, vocab=True, make_doc=True, max_length=10**6, meta={}, **kwargs):
"""Initialise a Language object.
vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
@ -127,6 +128,15 @@ class Language(object):
string occurs in both, the component is not loaded.
meta (dict): Custom meta data for the Language class. Is written to by
models to add model meta data.
max_length (int) :
Maximum number of characters in a single text. The current v2 models
may run out memory on extremely long texts, due to large internal
allocations. You should segment these texts into meaningful units,
e.g. paragraphs, subsections etc, before passing them to spaCy.
Default maximum length is 1,000,000 characters (1mb). As a rule of
thumb, if all pipeline components are enabled, spaCy's default
models currently requires roughly 1GB of temporary memory per
100,000 characters in one text.
RETURNS (Language): The newly constructed object.
"""
self._meta = dict(meta)
@ -134,12 +144,15 @@ class Language(object):
if vocab is True:
factory = self.Defaults.create_vocab
vocab = factory(self, **meta.get('vocab', {}))
if vocab.vectors.name is None:
vocab.vectors.name = meta.get('vectors', {}).get('name')
self.vocab = vocab
if make_doc is True:
factory = self.Defaults.create_tokenizer
make_doc = factory(self, **meta.get('tokenizer', {}))
self.tokenizer = make_doc
self.pipeline = []
self.max_length = max_length
self._optimizer = None
@property
@ -159,7 +172,8 @@ class Language(object):
self._meta.setdefault('license', '')
self._meta['vectors'] = {'width': self.vocab.vectors_length,
'vectors': len(self.vocab.vectors),
'keys': self.vocab.vectors.n_keys}
'keys': self.vocab.vectors.n_keys,
'name': self.vocab.vectors.name}
self._meta['pipeline'] = self.pipe_names
return self._meta
@ -205,8 +219,7 @@ class Language(object):
for pipe_name, component in self.pipeline:
if pipe_name == name:
return component
msg = "No component '{}' found in pipeline. Available names: {}"
raise KeyError(msg.format(name, self.pipe_names))
raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
def create_pipe(self, name, config=dict()):
"""Create a pipeline component from a factory.
@ -216,7 +229,7 @@ class Language(object):
RETURNS (callable): Pipeline component.
"""
if name not in self.factories:
raise KeyError("Can't find factory for '{}'.".format(name))
raise KeyError(Errors.E002.format(name=name))
factory = self.factories[name]
return factory(self, **config)
@ -241,12 +254,9 @@ class Language(object):
>>> nlp.add_pipe(component, name='custom_name', last=True)
"""
if not hasattr(component, '__call__'):
msg = ("Not a valid pipeline component. Expected callable, but "
"got {}. ".format(repr(component)))
msg = Errors.E003.format(component=repr(component), name=name)
if isinstance(component, basestring_) and component in self.factories:
msg += ("If you meant to add a built-in component, use "
"create_pipe: nlp.add_pipe(nlp.create_pipe('{}'))"
.format(component))
msg += Errors.E004.format(component=component)
raise ValueError(msg)
if name is None:
if hasattr(component, 'name'):
@ -259,11 +269,9 @@ class Language(object):
else:
name = repr(component)
if name in self.pipe_names:
raise ValueError("'{}' already exists in pipeline.".format(name))
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
msg = ("Invalid constraints. You can only set one of the "
"following: before, after, first, last.")
raise ValueError(msg)
raise ValueError(Errors.E006)
pipe = (name, component)
if last or not any([first, before, after]):
self.pipeline.append(pipe)
@ -274,9 +282,8 @@ class Language(object):
elif after and after in self.pipe_names:
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
else:
msg = "Can't find '{}' in pipeline. Available names: {}"
unfound = before or after
raise ValueError(msg.format(unfound, self.pipe_names))
raise ValueError(Errors.E001.format(name=before or after,
opts=self.pipe_names))
def has_pipe(self, name):
"""Check if a component name is present in the pipeline. Equivalent to
@ -294,8 +301,7 @@ class Language(object):
component (callable): Pipeline component.
"""
if name not in self.pipe_names:
msg = "Can't find '{}' in pipeline. Available names: {}"
raise ValueError(msg.format(name, self.pipe_names))
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
self.pipeline[self.pipe_names.index(name)] = (name, component)
def rename_pipe(self, old_name, new_name):
@ -305,11 +311,9 @@ class Language(object):
new_name (unicode): New name of the component.
"""
if old_name not in self.pipe_names:
msg = "Can't find '{}' in pipeline. Available names: {}"
raise ValueError(msg.format(old_name, self.pipe_names))
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
if new_name in self.pipe_names:
msg = "'{}' already exists in pipeline. Existing names: {}"
raise ValueError(msg.format(new_name, self.pipe_names))
raise ValueError(Errors.E007.format(name=new_name, opts=self.pipe_names))
i = self.pipe_names.index(old_name)
self.pipeline[i] = (new_name, self.pipeline[i][1])
@ -320,8 +324,7 @@ class Language(object):
RETURNS (tuple): A `(name, component)` tuple of the removed component.
"""
if name not in self.pipe_names:
msg = "Can't find '{}' in pipeline. Available names: {}"
raise ValueError(msg.format(name, self.pipe_names))
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
return self.pipeline.pop(self.pipe_names.index(name))
def __call__(self, text, disable=[]):
@ -338,11 +341,18 @@ class Language(object):
>>> tokens[0].text, tokens[0].head.tag_
('An', 'NN')
"""
if len(text) >= self.max_length:
raise ValueError(Errors.E088.format(length=len(text),
max_length=self.max_length))
doc = self.make_doc(text)
for name, proc in self.pipeline:
if name in disable:
continue
if not hasattr(proc, '__call__'):
raise ValueError(Errors.E003.format(component=type(proc), name=name))
doc = proc(doc)
if doc is None:
raise ValueError(Errors.E005.format(name=name))
return doc
def disable_pipes(self, *names):
@ -384,8 +394,7 @@ class Language(object):
>>> state = nlp.update(docs, golds, sgd=optimizer)
"""
if len(docs) != len(golds):
raise IndexError("Update expects same number of docs and golds "
"Got: %d, %d" % (len(docs), len(golds)))
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
if len(docs) == 0:
return
if sgd is None:
@ -458,6 +467,8 @@ class Language(object):
else:
device = None
link_vectors_to_models(self.vocab)
if self.vocab.vectors.data.shape[1]:
cfg['pretrained_vectors'] = self.vocab.vectors.name
if sgd is None:
sgd = create_default_optimizer(Model.ops)
self._optimizer = sgd
@ -626,9 +637,10 @@ class Language(object):
"""
path = util.ensure_path(path)
deserializers = OrderedDict((
('vocab', lambda p: self.vocab.from_disk(p)),
('meta.json', lambda p: self.meta.update(util.read_json(p))),
('vocab', lambda p: (
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self))),
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
('meta.json', lambda p: self.meta.update(util.read_json(p)))
))
for name, proc in self.pipeline:
if name in disable:
@ -671,9 +683,10 @@ class Language(object):
RETURNS (Language): The `Language` object.
"""
deserializers = OrderedDict((
('vocab', lambda b: self.vocab.from_bytes(b)),
('meta', lambda b: self.meta.update(ujson.loads(b))),
('vocab', lambda b: (
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self))),
('tokenizer', lambda b: self.tokenizer.from_bytes(b, vocab=False)),
('meta', lambda b: self.meta.update(ujson.loads(b)))
))
for i, (name, proc) in enumerate(self.pipeline):
if name in disable:
@ -685,6 +698,27 @@ class Language(object):
return self
def _fix_pretrained_vectors_name(nlp):
# TODO: Replace this once we handle vectors consistently as static
# data
if 'vectors' in nlp.meta and nlp.meta['vectors'].get('name'):
nlp.vocab.vectors.name = nlp.meta['vectors']['name']
elif not nlp.vocab.vectors.size:
nlp.vocab.vectors.name = None
elif 'name' in nlp.meta and 'lang' in nlp.meta:
vectors_name = '%s_%s.vectors' % (nlp.meta['lang'], nlp.meta['name'])
nlp.vocab.vectors.name = vectors_name
else:
raise ValueError(Errors.E092)
if nlp.vocab.vectors.size != 0:
link_vectors_to_models(nlp.vocab)
for name, proc in nlp.pipeline:
if not hasattr(proc, 'cfg'):
continue
proc.cfg.setdefault('deprecation_fixes', {})
proc.cfg['deprecation_fixes']['vectors_name'] = nlp.vocab.vectors.name
class DisabledPipes(list):
"""Manager for temporary pipeline disabling."""
def __init__(self, nlp, *names):
@ -711,14 +745,7 @@ class DisabledPipes(list):
if unexpected:
# Don't change the pipeline if we're raising an error.
self.nlp.pipeline = current
msg = (
"Some current components would be lost when restoring "
"previous pipeline state. If you added components after "
"calling nlp.disable_pipes(), you should remove them "
"explicitly with nlp.remove_pipe() before the pipeline is "
"restore. Names of the new components: %s"
)
raise ValueError(msg % unexpected)
raise ValueError(Errors.E008.format(names=unexpected))
self[:] = []

View File

@ -15,7 +15,7 @@ from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
from .attrs cimport PROB
from .attrs import intify_attrs
from . import about
from .errors import Errors
memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))
@ -37,7 +37,8 @@ cdef class Lexeme:
self.vocab = vocab
self.orth = orth
self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
assert self.c.orth == orth
if self.c.orth != orth:
raise ValueError(Errors.E071.format(orth=orth, vocab_orth=self.c.orth))
def __richcmp__(self, other, int op):
if other is None:
@ -129,20 +130,25 @@ cdef class Lexeme:
lex_data = Lexeme.c_to_bytes(self.c)
start = <const char*>&self.c.flags
end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
assert (end-start) == sizeof(lex_data.data), (end-start, sizeof(lex_data.data))
if (end-start) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=end-start,
bad_length=sizeof(lex_data.data)))
byte_string = b'\0' * sizeof(lex_data.data)
byte_chars = <char*>byte_string
for i in range(sizeof(lex_data.data)):
byte_chars[i] = lex_data.data[i]
assert len(byte_string) == sizeof(lex_data.data), (len(byte_string),
sizeof(lex_data.data))
if len(byte_string) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=len(byte_string),
bad_length=sizeof(lex_data.data)))
return byte_string
def from_bytes(self, bytes byte_string):
# This method doesn't really have a use-case --- wrote it for testing.
# Possibly delete? It puts the Lexeme out of synch with the vocab.
cdef SerializedLexemeC lex_data
assert len(byte_string) == sizeof(lex_data.data)
if len(byte_string) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=len(byte_string),
bad_length=sizeof(lex_data.data)))
for i in range(len(byte_string)):
lex_data.data[i] = byte_string[i]
Lexeme.c_from_bytes(self.c, lex_data)
@ -169,16 +175,13 @@ cdef class Lexeme:
def __get__(self):
cdef int length = self.vocab.vectors_length
if length == 0:
raise ValueError(
"Word vectors set to length 0. This may be because you "
"don't have a model installed or loaded, or because your "
"model doesn't include word vectors. For more info, see "
"the documentation: \n%s\n" % about.__docs_models__
)
raise ValueError(Errors.E010)
return self.vocab.get_vector(self.c.orth)
def __set__(self, vector):
assert len(vector) == self.vocab.vectors_length
if len(vector) != self.vocab.vectors_length:
raise ValueError(Errors.E073.format(new_length=len(vector),
length=self.vocab.vectors_length))
self.vocab.set_vector(self.c.orth, vector)
property rank:

View File

@ -13,6 +13,8 @@ from .vocab cimport Vocab
from .tokens.doc cimport Doc
from .tokens.doc cimport get_token_attr
from .attrs cimport ID, attr_id_t, NULL_ATTR
from .errors import Errors, TempErrors
from .attrs import IDS
from .attrs import FLAG61 as U_ENT
from .attrs import FLAG60 as B2_ENT
@ -321,6 +323,9 @@ cdef attr_t get_pattern_key(const TokenPatternC* pattern) nogil:
while pattern.nr_attr != 0:
pattern += 1
id_attr = pattern[0].attrs[0]
if id_attr.attr != ID:
with gil:
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
return id_attr.value
def _convert_strings(token_specs, string_store):
@ -341,8 +346,8 @@ def _convert_strings(token_specs, string_store):
if value in operators:
ops = operators[value]
else:
msg = "Unknown operator '%s'. Options: %s"
raise KeyError(msg % (value, ', '.join(operators.keys())))
keys = ', '.join(operators.keys())
raise KeyError(Errors.E011.format(op=value, opts=keys))
if isinstance(attr, basestring):
attr = IDS.get(attr.upper())
if isinstance(value, basestring):
@ -429,9 +434,7 @@ cdef class Matcher:
"""
for pattern in patterns:
if len(pattern) == 0:
msg = ("Cannot add pattern for zero tokens to matcher.\n"
"key: {key}\n")
raise ValueError(msg.format(key=key))
raise ValueError(Errors.E012.format(key=key))
key = self._normalize_key(key)
for pattern in patterns:
specs = _convert_strings(pattern, self.vocab.strings)

View File

@ -9,6 +9,7 @@ from .attrs import LEMMA, intify_attrs
from .parts_of_speech cimport SPACE
from .parts_of_speech import IDS as POS_IDS
from .lexeme cimport Lexeme
from .errors import Errors
def _normalize_props(props):
@ -93,7 +94,7 @@ cdef class Morphology:
cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1:
if tag_id > self.n_tags:
raise ValueError("Unknown tag ID: %s" % tag_id)
raise ValueError(Errors.E014.format(tag=tag_id))
# TODO: It's pretty arbitrary to put this logic here. I guess the
# justification is that this is where the specific word and the tag
# interact. Still, we should have a better way to enforce this rule, or
@ -147,9 +148,7 @@ cdef class Morphology:
elif force:
memset(cached, 0, sizeof(cached[0]))
else:
raise ValueError(
"Conflicting morphology exception for (%s, %s). Use "
"force=True to overwrite." % (tag_str, orth_str))
raise ValueError(Errors.E015.format(tag=tag_str, orth=orth_str))
cached.tag = rich_tag
# TODO: Refactor this to take arbitrary attributes.

View File

@ -8,7 +8,9 @@ cimport numpy as np
import cytoolz
from collections import OrderedDict
import ujson
import msgpack
from .util import msgpack
from .util import msgpack_numpy
from thinc.api import chain
from thinc.v2v import Affine, SELU, Softmax
@ -32,6 +34,7 @@ from .parts_of_speech import X
from ._ml import Tok2Vec, build_text_classifier, build_tagger_model
from ._ml import link_vectors_to_models, zero_init, flatten
from ._ml import create_default_optimizer
from .errors import Errors, TempErrors
from . import util
@ -77,7 +80,7 @@ def merge_noun_chunks(doc):
RETURNS (Doc): The Doc object with merged noun chunks.
"""
if not doc.is_parsed:
return
return doc
spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep)
for np in doc.noun_chunks]
for start, end, tag, dep in spans:
@ -214,8 +217,10 @@ class Pipe(object):
def from_bytes(self, bytes_data, **exclude):
"""Load the pipe from a bytestring."""
def load_model(b):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True:
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
self.model = self.Model(**self.cfg)
self.model.from_bytes(b)
@ -239,8 +244,10 @@ class Pipe(object):
def from_disk(self, path, **exclude):
"""Load the pipe from disk."""
def load_model(p):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True:
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
self.model = self.Model(**self.cfg)
self.model.from_bytes(p.open('rb').read())
@ -298,7 +305,6 @@ class Tensorizer(Pipe):
self.model = model
self.input_models = []
self.cfg = dict(cfg)
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
self.cfg.setdefault('cnn_maxout_pieces', 3)
def __call__(self, doc):
@ -343,7 +349,8 @@ class Tensorizer(Pipe):
tensors (object): Vector representation for each token in the docs.
"""
for doc, tensor in zip(docs, tensors):
assert tensor.shape[0] == len(doc)
if tensor.shape[0] != len(doc):
raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
doc.tensor = tensor
def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
@ -415,8 +422,6 @@ class Tagger(Pipe):
self.model = model
self.cfg = OrderedDict(sorted(cfg.items()))
self.cfg.setdefault('cnn_maxout_pieces', 2)
self.cfg.setdefault('pretrained_dims',
self.vocab.vectors.data.shape[1])
@property
def labels(self):
@ -477,7 +482,7 @@ class Tagger(Pipe):
doc.extend_tensor(tensors[i].get())
else:
doc.extend_tensor(tensors[i])
doc.is_tagged = True
doc.is_tagged = True
def update(self, docs, golds, drop=0., sgd=None, losses=None):
if losses is not None and self.name not in losses:
@ -527,8 +532,8 @@ class Tagger(Pipe):
vocab.morphology = Morphology(vocab.strings, new_tag_map,
vocab.morphology.lemmatizer,
exc=vocab.morphology.exc)
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
if self.model is True:
self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
link_vectors_to_models(self.vocab)
if sgd is None:
@ -537,6 +542,8 @@ class Tagger(Pipe):
@classmethod
def Model(cls, n_tags, **cfg):
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):
raise ValueError(TempErrors.T008)
return build_tagger_model(n_tags, **cfg)
def add_label(self, label, values=None):
@ -552,9 +559,7 @@ class Tagger(Pipe):
# copy_array(larger.W[:smaller.nO], smaller.W)
# copy_array(larger.b[:smaller.nO], smaller.b)
# self.model._layers[-1] = larger
raise ValueError(
"Resizing pre-trained Tagger models is not "
"currently supported.")
raise ValueError(TempErrors.T003)
tag_map = dict(self.vocab.morphology.tag_map)
if values is None:
values = {POS: "X"}
@ -584,6 +589,10 @@ class Tagger(Pipe):
def from_bytes(self, bytes_data, **exclude):
def load_model(b):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True:
token_vector_width = util.env_opt(
'token_vector_width',
@ -609,7 +618,6 @@ class Tagger(Pipe):
return self
def to_disk(self, path, **exclude):
self.cfg.setdefault('pretrained_dims', self.vocab.vectors.data.shape[1])
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
serialize = OrderedDict((
('vocab', lambda p: self.vocab.to_disk(p)),
@ -622,6 +630,9 @@ class Tagger(Pipe):
def from_disk(self, path, **exclude):
def load_model(p):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True:
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
with p.open('rb') as file_:
@ -669,12 +680,9 @@ class MultitaskObjective(Tagger):
elif hasattr(target, '__call__'):
self.make_label = target
else:
raise ValueError("MultitaskObjective target should be function or "
"one of: dep, tag, ent, sent_start, dep_tag_offset, ent_tag.")
raise ValueError(Errors.E016)
self.cfg = dict(cfg)
self.cfg.setdefault('cnn_maxout_pieces', 2)
self.cfg.setdefault('pretrained_dims',
self.vocab.vectors.data.shape[1])
@property
def labels(self):
@ -723,7 +731,9 @@ class MultitaskObjective(Tagger):
return tokvecs, scores
def get_loss(self, docs, golds, scores):
assert len(docs) == len(golds)
if len(docs) != len(golds):
raise ValueError(Errors.E077.format(value='loss', n_docs=len(docs),
n_golds=len(golds)))
cdef int idx = 0
correct = numpy.zeros((scores.shape[0],), dtype='i')
guesses = scores.argmax(axis=1)
@ -878,8 +888,8 @@ class TextCategorizer(Pipe):
name = 'textcat'
@classmethod
def Model(cls, **cfg):
return build_text_classifier(**cfg)
def Model(cls, nr_class, **cfg):
return build_text_classifier(nr_class, **cfg)
def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab
@ -962,16 +972,16 @@ class TextCategorizer(Pipe):
self.labels.append(label)
return 1
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None):
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None,
**kwargs):
if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
token_vector_width = pipeline[0].model.nO
else:
token_vector_width = 64
if self.model is True:
self.cfg['pretrained_dims'] = self.vocab.vectors_length
self.cfg['nr_class'] = len(self.labels)
self.cfg['width'] = token_vector_width
self.model = self.Model(**self.cfg)
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
self.model = self.Model(len(self.labels), **self.cfg)
link_vectors_to_models(self.vocab)
if sgd is None:
sgd = self.create_optimizer()

View File

@ -2,6 +2,7 @@
from __future__ import division, print_function, unicode_literals
from .gold import tags_to_entities, GoldParse
from .errors import Errors
class PRFScore(object):
@ -86,7 +87,6 @@ class Scorer(object):
def score(self, tokens, gold, verbose=False, punct_labels=('p', 'punct')):
if len(tokens) != len(gold):
gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot))
assert len(tokens) == len(gold)
gold_deps = set()
gold_tags = set()
gold_ents = set(tags_to_entities([annot[-1]

View File

@ -13,6 +13,7 @@ from .symbols import IDS as SYMBOLS_BY_STR
from .symbols import NAMES as SYMBOLS_BY_INT
from .typedefs cimport hash_t
from .compat import json_dumps
from .errors import Errors
from . import util
@ -59,7 +60,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char))
string.p[0] = length
memcpy(&string.p[1], chars, length)
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
return string
else:
i = 0
@ -69,7 +69,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
string.p[i] = 255
string.p[n_length_bytes-1] = length % 255
memcpy(&string.p[n_length_bytes], chars, length)
assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
return string
@ -115,7 +114,7 @@ cdef class StringStore:
self.hits.insert(key)
utf8str = <Utf8Str*>self._map.get(key)
if utf8str is NULL:
raise KeyError(string_or_id)
raise KeyError(Errors.E018.format(hash_value=string_or_id))
else:
return decode_Utf8Str(utf8str)
@ -136,8 +135,7 @@ cdef class StringStore:
key = hash_utf8(string, len(string))
self._intern_utf8(string, len(string))
else:
raise TypeError(
"Can only add unicode or bytes. Got type: %s" % type(string))
raise TypeError(Errors.E017.format(value_type=type(string)))
return key
def __len__(self):

View File

@ -10,6 +10,7 @@ from thinc.extra.search cimport MaxViolation
from .transition_system cimport TransitionSystem, Transition
from ..gold cimport GoldParse
from ..errors import Errors
from .stateclass cimport StateC, StateClass
@ -220,7 +221,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
p_indices = []
g_indices = []
cdef Beam pbeam, gbeam
assert len(pbeams) == len(gbeams)
if len(pbeams) != len(gbeams):
raise ValueError(Errors.E079.format(pbeams=len(pbeams), gbeams=len(gbeams)))
for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)):
p_indices.append([])
g_indices.append([])
@ -228,7 +230,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
state = StateClass.borrow(<StateC*>pbeam.at(i))
if not state.is_final():
key = tuple([eg_id] + pbeam.histories[i])
assert key not in seen, (key, seen)
if key in seen:
raise ValueError(Errors.E080.format(key=key))
seen[key] = len(states)
p_indices[-1].append(len(states))
states.append(state)
@ -271,7 +274,8 @@ def get_gradient(nr_class, beam_maps, histories, losses):
for i in range(nr_step):
grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class),
dtype='f'))
assert len(histories) == len(losses)
if len(histories) != len(losses):
raise ValueError(Errors.E081.format(n_hist=len(histories), losses=len(losses)))
for eg_id, hists in enumerate(histories):
for loss, hist in zip(losses[eg_id], hists):
if loss == 0.0 or numpy.isnan(loss):

View File

@ -10,21 +10,25 @@ from collections import OrderedDict, defaultdict, Counter
from thinc.extra.search cimport Beam
import json
from .nonproj import is_nonproj_tree
from ..typedefs cimport hash_t, attr_t
from ..strings cimport hash_string
from .stateclass cimport StateClass
from ._state cimport StateC
from . import nonproj
from .transition_system cimport move_cost_func_t, label_cost_func_t
from ..gold cimport GoldParse, GoldParseC
from ..structs cimport TokenC
from ..errors import Errors
# Calculate cost as gold/not gold. We don't use scalar value anyway.
cdef int BINARY_COSTS = 1
cdef weight_t MIN_SCORE = -90000
cdef attr_t SUBTOK_LABEL = hash_string('subtok')
DEF NON_MONOTONIC = True
DEF USE_BREAK = True
cdef weight_t MIN_SCORE = -90000
# Break transition from here
# http://www.aclweb.org/anthology/P13-1074
cdef enum:
@ -178,6 +182,8 @@ cdef class Reduce:
cdef class LeftArc:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
return 0
sent_start = st._sent[st.B_(0).l_edge].sent_start
return sent_start != 1
@ -214,6 +220,8 @@ cdef class RightArc:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
# If there's (perhaps partial) parse pre-set, don't allow cycle.
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
return 0
sent_start = st._sent[st.B_(0).l_edge].sent_start
return sent_start != 1 and st.H(st.S(0)) != st.B(0)
@ -364,6 +372,18 @@ cdef class ArcEager(TransitionSystem):
def __get__(self):
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
def get_cost(self, StateClass state, GoldParse gold, action):
cdef Transition t = self.lookup_transition(action)
if not t.is_valid(state.c, t.label):
return 9000
else:
return t.get_cost(state, &gold.c, t.label)
def transition(self, StateClass state, action):
cdef Transition t = self.lookup_transition(action)
t.do(state.c, t.label)
return state
def is_gold_parse(self, StateClass state, GoldParse gold):
predicted = set()
truth = set()
@ -435,7 +455,10 @@ cdef class ArcEager(TransitionSystem):
parses.append((prob, parse))
return parses
cdef Transition lookup_transition(self, object name) except *:
cdef Transition lookup_transition(self, object name_or_id) except *:
if isinstance(name_or_id, int):
return self.c[name_or_id]
name = name_or_id
if '-' in name:
move_str, label_str = name.split('-', 1)
label = self.strings[label_str]
@ -455,6 +478,9 @@ cdef class ArcEager(TransitionSystem):
else:
return MOVE_NAMES[move]
def class_name(self, int i):
return self.move_name(self.c[i].move, self.c[i].label)
cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
# TODO: Apparent Cython bug here when we try to use the Transition()
# constructor with the function pointers
@ -484,7 +510,7 @@ cdef class ArcEager(TransitionSystem):
t.do = Break.transition
t.get_cost = Break.cost
else:
raise Exception(move)
raise ValueError(Errors.E019.format(action=move, src='arc_eager'))
return t
cdef int initialize_state(self, StateC* st) nogil:
@ -516,7 +542,10 @@ cdef class ArcEager(TransitionSystem):
is_valid[BREAK] = Break.is_valid(st, 0)
cdef int i
for i in range(self.n_moves):
output[i] = is_valid[self.c[i].move]
if self.c[i].label == SUBTOK_LABEL:
output[i] = self.c[i].is_valid(st, self.c[i].label)
else:
output[i] = is_valid[self.c[i].move]
cdef int set_costs(self, int* is_valid, weight_t* costs,
StateClass stcls, GoldParse gold) except -1:
@ -556,35 +585,13 @@ cdef class ArcEager(TransitionSystem):
is_valid[i] = False
costs[i] = 9000
if n_gold < 1:
# Check label set --- leading cause
label_set = set([self.strings[self.c[i].label] for i in range(self.n_moves)])
for label_str in gold.labels:
if label_str is not None and label_str not in label_set:
raise ValueError("Cannot get gold parser action: unknown label: %s" % label_str)
# Check projectivity --- other leading cause
if nonproj.is_nonproj_tree(gold.heads):
raise ValueError(
"Could not find a gold-standard action to supervise the "
"dependency parser. Likely cause: the tree is "
"non-projective (i.e. it has crossing arcs -- see "
"spacy/syntax/nonproj.pyx for definitions). The ArcEager "
"transition system only supports projective trees. To "
"learn non-projective representations, transform the data "
"before training and after parsing. Either pass "
"make_projective=True to the GoldParse class, or use "
"spacy.syntax.nonproj.preprocess_training_data.")
# Check projectivity --- leading cause
if is_nonproj_tree(gold.heads):
raise ValueError(Errors.E020)
else:
print(gold.orig_annot)
print(gold.words)
print(gold.heads)
print(gold.labels)
print(gold.sent_starts)
raise ValueError(
"Could not find a gold-standard action to supervise the"
"dependency parser. The GoldParse was projective. The "
"transition system has %d actions. State at failure: %s"
% (self.n_moves, stcls.print_state(gold.words)))
assert n_gold >= 1
failure_state = stcls.print_state(gold.words)
raise ValueError(Errors.E021.format(n_actions=self.n_moves,
state=failure_state))
def get_beam_annot(self, Beam beam):
length = (<StateC*>beam.at(0)).length

View File

@ -10,6 +10,7 @@ from ._state cimport StateC
from .transition_system cimport Transition
from .transition_system cimport do_func_t
from ..gold cimport GoldParseC, GoldParse
from ..errors import Errors
cdef enum:
@ -81,9 +82,7 @@ cdef class BiluoPushDown(TransitionSystem):
for (ids, words, tags, heads, labels, biluo), _ in sents:
for i, ner_tag in enumerate(biluo):
if ner_tag != 'O' and ner_tag != '-':
if ner_tag.count('-') != 1:
raise ValueError(ner_tag)
_, label = ner_tag.split('-')
_, label = ner_tag.split('-', 1)
for action in (BEGIN, IN, LAST, UNIT):
actions[action][label] += 1
return actions
@ -170,7 +169,7 @@ cdef class BiluoPushDown(TransitionSystem):
if self.c[i].move == move and self.c[i].label == label:
return self.c[i]
else:
raise KeyError(name)
raise KeyError(Errors.E022.format(name=name))
cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
# TODO: Apparent Cython bug here when we try to use the Transition()
@ -205,7 +204,7 @@ cdef class BiluoPushDown(TransitionSystem):
t.do = Out.transition
t.get_cost = Out.cost
else:
raise Exception(move)
raise ValueError(Errors.E019.format(action=move, src='ner'))
return t
def add_action(self, int action, label_name, freq=None):
@ -227,7 +226,6 @@ cdef class BiluoPushDown(TransitionSystem):
self._size *= 2
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
assert self.c[self.n_moves].label == label_id
self.n_moves += 1
if self.labels.get(action, []):
freq = min(0, min(self.labels[action].values()))

View File

@ -35,6 +35,7 @@ from .._ml import link_vectors_to_models, create_default_optimizer
from ..compat import json_dumps, copy_array
from ..tokens.doc cimport Doc
from ..gold cimport GoldParse
from ..errors import Errors, TempErrors
from .. import util
from .stateclass cimport StateClass
from ._state cimport StateC
@ -244,7 +245,7 @@ cdef class Parser:
def Model(cls, nr_class, **cfg):
depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1))
if depth != 1:
raise ValueError("Currently parser depth is hard-coded to 1.")
raise ValueError(TempErrors.T004.format(value=depth))
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
cfg.get('maxout_pieces', 2))
token_vector_width = util.env_opt('token_vector_width',
@ -254,11 +255,12 @@ cdef class Parser:
hist_size = util.env_opt('history_feats', cfg.get('hist_size', 0))
hist_width = util.env_opt('history_width', cfg.get('hist_width', 0))
if hist_size != 0:
raise ValueError("Currently history size is hard-coded to 0")
raise ValueError(TempErrors.T005.format(value=hist_size))
if hist_width != 0:
raise ValueError("Currently history width is hard-coded to 0")
raise ValueError(TempErrors.T006.format(value=hist_width))
pretrained_vectors = cfg.get('pretrained_vectors', None)
tok2vec = Tok2Vec(token_vector_width, embed_size,
pretrained_dims=cfg.get('pretrained_dims', 0))
pretrained_vectors=pretrained_vectors)
tok2vec = chain(tok2vec, flatten)
lower = PrecomputableAffine(hidden_width,
nF=cls.nr_feature, nI=token_vector_width,
@ -277,6 +279,7 @@ cdef class Parser:
'token_vector_width': token_vector_width,
'hidden_width': hidden_width,
'maxout_pieces': parser_maxout_pieces,
'pretrained_vectors': pretrained_vectors,
'hist_size': hist_size,
'hist_width': hist_width
}
@ -296,9 +299,9 @@ cdef class Parser:
unless True (default), in which case a new instance is created with
`Parser.Moves()`.
model (object): Defines how the parse-state is created, updated and
evaluated. The value is set to the .model attribute unless True
(default), in which case a new instance is created with
`Parser.Model()`.
evaluated. The value is set to the .model attribute. If set to True
(default), a new instance will be created with `Parser.Model()`
in parser.begin_training(), parser.from_disk() or parser.from_bytes().
**cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute
"""
self.vocab = vocab
@ -310,8 +313,7 @@ cdef class Parser:
cfg['beam_width'] = util.env_opt('beam_width', 1)
if 'beam_density' not in cfg:
cfg['beam_density'] = util.env_opt('beam_density', 0.0)
if 'pretrained_dims' not in cfg:
cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
cfg.setdefault('cnn_maxout_pieces', 3)
self.cfg = cfg
self.model = model
self._multitasks = []
@ -551,8 +553,13 @@ cdef class Parser:
def update(self, docs, golds, drop=0., sgd=None, losses=None):
if not any(self.moves.has_gold(gold) for gold in golds):
return None
assert len(docs) == len(golds)
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= 0.5:
if len(docs) != len(golds):
raise ValueError(Errors.E077.format(value='update', n_docs=len(docs),
n_golds=len(golds)))
# The probability we use beam update, instead of falling back to
# a greedy update
beam_update_prob = 1-self.cfg.get('beam_update_prob', 0.5)
if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= beam_update_prob:
return self.update_beam(docs, golds,
self.cfg['beam_width'], self.cfg['beam_density'],
drop=drop, sgd=sgd, losses=losses)
@ -595,7 +602,6 @@ cdef class Parser:
scores, bp_scores = vec2scores.begin_update(vector, drop=drop)
d_scores = self.get_batch_loss(states, golds, scores)
d_scores /= len(docs)
d_vector = bp_scores(d_scores, sgd=sgd)
if drop != 0:
d_vector *= mask
@ -634,7 +640,6 @@ cdef class Parser:
if losses is not None and self.name not in losses:
losses[self.name] = 0.
lengths = [len(d) for d in docs]
assert min(lengths) >= 1
states = self.moves.init_batch(docs)
for gold in golds:
self.moves.preprocess_gold(gold)
@ -648,7 +653,6 @@ cdef class Parser:
backprop_lower = []
cdef float batch_size = len(docs)
for i, d_scores in enumerate(states_d_scores):
d_scores /= batch_size
if losses is not None:
losses[self.name] += (d_scores**2).sum()
ids, bp_vectors, bp_scores = backprops[i]
@ -846,7 +850,6 @@ cdef class Parser:
self.moves.initialize_actions(actions)
cfg.setdefault('token_vector_width', 128)
if self.model is True:
cfg['pretrained_dims'] = self.vocab.vectors_length
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
if sgd is None:
sgd = self.create_optimizer()
@ -910,9 +913,11 @@ cdef class Parser:
}
util.from_disk(path, deserializers, exclude)
if 'model' not in exclude:
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
path = util.ensure_path(path)
if self.model is True:
self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
self.model, cfg = self.Model(**self.cfg)
else:
cfg = {}
@ -955,12 +960,13 @@ cdef class Parser:
))
msg = util.from_bytes(bytes_data, deserializers, exclude)
if 'model' not in exclude:
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
if self.model is True:
self.model, cfg = self.Model(**self.cfg)
cfg['pretrained_dims'] = self.vocab.vectors_length
else:
cfg = {}
cfg['pretrained_dims'] = self.vocab.vectors_length
if 'tok2vec_model' in msg:
self.model[0].from_bytes(msg['tok2vec_model'])
if 'lower_model' in msg:
@ -1033,15 +1039,11 @@ def _cleanup(Beam beam):
del state
seen.add(addr)
else:
print(i, addr)
print(seen)
raise Exception
raise ValueError(Errors.E023.format(addr=addr, i=i))
addr = <size_t>beam._states[i].content
if addr not in seen:
state = <StateC*>addr
del state
seen.add(addr)
else:
print(i, addr)
print(seen)
raise Exception
raise ValueError(Errors.E023.format(addr=addr, i=i))

View File

@ -10,6 +10,7 @@ from __future__ import unicode_literals
from copy import copy
from ..tokens.doc cimport Doc, set_children_from_heads
from ..errors import Errors
DELIMITER = '||'
@ -146,7 +147,10 @@ cpdef deprojectivize(Doc doc):
def _decorate(heads, proj_heads, labels):
# uses decoration scheme HEAD from Nivre & Nilsson 2005
assert(len(heads) == len(proj_heads) == len(labels))
if (len(heads) != len(proj_heads)) or (len(proj_heads) != len(labels)):
raise ValueError(Errors.E082.format(n_heads=len(heads),
n_proj_heads=len(proj_heads),
n_labels=len(labels)))
deco_labels = []
for tokenid, head in enumerate(heads):
if head != proj_heads[tokenid]:

View File

@ -12,6 +12,7 @@ from ..structs cimport TokenC
from .stateclass cimport StateClass
from ..typedefs cimport attr_t
from ..compat import json_dumps
from ..errors import Errors
from .. import util
@ -73,10 +74,7 @@ cdef class TransitionSystem:
action.do(state.c, action.label)
break
else:
print(gold.words)
print(gold.ner)
print(history)
raise ValueError("Could not find gold move")
raise ValueError(Errors.E024)
return history
cdef int initialize_state(self, StateC* state) nogil:
@ -123,17 +121,7 @@ cdef class TransitionSystem:
else:
costs[i] = 9000
if n_gold <= 0:
print(gold.words)
print(gold.ner)
print([gold.c.ner[i].clas for i in range(gold.length)])
print([gold.c.ner[i].move for i in range(gold.length)])
print([gold.c.ner[i].label for i in range(gold.length)])
print("Self labels",
[self.c[i].label for i in range(self.n_moves)])
raise ValueError(
"Could not find a gold-standard action to supervise "
"the entity recognizer. The transition system has "
"%d actions." % (self.n_moves))
raise ValueError(Errors.E024)
def get_class_name(self, int clas):
act = self.c[clas]
@ -171,7 +159,6 @@ cdef class TransitionSystem:
self._size *= 2
self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
assert self.c[self.n_moves].label == label_id
self.n_moves += 1
if self.labels.get(action, []):
new_freq = min(self.labels[action].values())

View File

@ -19,7 +19,9 @@ _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
_models = {'en': ['en_core_web_sm'],
'de': ['de_core_news_md'],
'fr': ['fr_core_news_sm'],
'xx': ['xx_ent_web_md']}
'xx': ['xx_ent_web_md'],
'en_core_web_md': ['en_core_web_md'],
'es_core_news_md': ['es_core_news_md']}
# only used for tests that require loading the models
@ -183,6 +185,9 @@ def pytest_addoption(parser):
for lang in _languages + ['all']:
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
for model in _models:
if model not in _languages:
parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
def pytest_runtest_setup(item):

View File

@ -0,0 +1,13 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
('kolesterols', 'kolesterol'),
('åsyns', 'åsyn')])
def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
tokens = da_tokenizer(string)
assert tokens[0].lemma_ == lemma

View File

@ -1,9 +1,74 @@
from __future__ import unicode_literals
import pytest
from ...vocab import Vocab
from ...pipeline import DependencyParser
from ...tokens import Doc
from ...gold import GoldParse
from ...syntax.nonproj import projectivize
from ...syntax.stateclass import StateClass
from ...syntax.arc_eager import ArcEager
def get_sequence_costs(M, words, heads, deps, transitions):
doc = Doc(Vocab(), words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
state = StateClass(doc)
M.preprocess_gold(gold)
cost_history = []
for gold_action in transitions:
state_costs = {}
for i in range(M.n_moves):
name = M.class_name(i)
state_costs[name] = M.get_cost(state, gold, i)
M.transition(state, gold_action)
cost_history.append(state_costs)
return state, cost_history
@pytest.fixture
def vocab():
return Vocab()
@pytest.fixture
def arc_eager(vocab):
moves = ArcEager(vocab.strings, ArcEager.get_actions())
moves.add_action(2, 'left')
moves.add_action(3, 'right')
return moves
@pytest.fixture
def words():
return ['a', 'b']
@pytest.fixture
def doc(words, vocab):
if vocab is None:
vocab = Vocab()
return Doc(vocab, words=list(words))
@pytest.fixture
def gold(doc, words):
if len(words) == 2:
return GoldParse(doc, words=['a', 'b'], heads=[0, 0], deps=['ROOT', 'right'])
else:
raise NotImplementedError
@pytest.mark.xfail
def test_oracle_four_words(arc_eager, vocab):
words = ['a', 'b', 'c', 'd']
heads = [1, 1, 3, 3]
deps = ['left', 'ROOT', 'left', 'ROOT']
actions = ['L-left', 'B-ROOT', 'L-left']
state, cost_history = get_sequence_costs(arc_eager, words, heads, deps, actions)
assert state.is_final()
for i, state_costs in enumerate(cost_history):
# Check gold moves is 0 cost
assert state_costs[actions[i]] == 0.0, actions[i]
for other_action, cost in state_costs.items():
if other_action != actions[i]:
assert cost >= 1
annot_tuples = [
(0, 'When', 'WRB', 11, 'advmod', 'O'),

View File

@ -0,0 +1,12 @@
from __future__ import unicode_literals
import pytest
from ...util import load_model
@pytest.mark.models("en_core_web_md")
@pytest.mark.models("es_core_news_md")
def test_models_with_different_vectors():
nlp = load_model('en_core_web_md')
doc = nlp(u'hello world')
nlp2 = load_model('es_core_news_md')
doc2 = nlp2(u'hola')
doc = nlp(u'hello world')

View File

@ -0,0 +1,15 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from ...pipeline import EntityRecognizer
from ...vocab import Vocab
@pytest.mark.parametrize('label', ['U-JOB-NAME'])
def test_issue1967(label):
ner = EntityRecognizer(Vocab())
entry = ([0], ['word'], ['tag'], [0], ['dep'], [label])
gold_parses = [(None, [(entry, None)])]
ner.moves.get_actions(gold_parses=gold_parses)

View File

@ -17,6 +17,7 @@ def meta_data():
'email': 'email-in-fixture',
'url': 'url-in-fixture',
'license': 'license-in-fixture',
'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}
}

View File

@ -10,8 +10,8 @@ from ..gold import GoldParse
def test_textcat_learns_multilabel():
random.seed(0)
numpy.random.seed(0)
random.seed(5)
numpy.random.seed(5)
docs = []
nlp = English()
vocab = nlp.vocab

View File

@ -1,4 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from mock import Mock
from ..vocab import Vocab
from ..tokens import Doc, Span, Token
from ..tokens.underscore import Underscore
@ -51,3 +58,42 @@ def test_token_underscore_method():
None, None)
token._ = Underscore(Underscore.token_extensions, token, start=token.idx)
assert token._.hello() == 'cheese'
@pytest.mark.parametrize('obj', [Doc, Span, Token])
def test_doc_underscore_remove_extension(obj):
ext_name = 'to_be_removed'
obj.set_extension(ext_name, default=False)
assert obj.has_extension(ext_name)
obj.remove_extension(ext_name)
assert not obj.has_extension(ext_name)
@pytest.mark.parametrize('obj', [Doc, Span, Token])
def test_underscore_raises_for_dup(obj):
obj.set_extension('test', default=None)
with pytest.raises(ValueError):
obj.set_extension('test', default=None)
@pytest.mark.parametrize('invalid_kwargs', [
{'getter': None, 'setter': lambda: None},
{'default': None, 'method': lambda: None, 'getter': lambda: None},
{'setter': lambda: None},
{'default': None, 'method': lambda: None},
{'getter': True}])
def test_underscore_raises_for_invalid(invalid_kwargs):
invalid_kwargs['force'] = True
with pytest.raises(ValueError):
Doc.set_extension('test', **invalid_kwargs)
@pytest.mark.parametrize('valid_kwargs', [
{'getter': lambda: None},
{'getter': lambda: None, 'setter': lambda: None},
{'default': 'hello'},
{'default': None},
{'method': lambda: None}])
def test_underscore_accepts_valid(valid_kwargs):
valid_kwargs['force'] = True
Doc.set_extension('test', **valid_kwargs)

View File

@ -28,12 +28,38 @@ def vectors():
def data():
return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f')
@pytest.fixture
def resize_data():
return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype='f')
@pytest.fixture()
def vocab(en_vocab, vectors):
add_vecs_to_vocab(en_vocab, vectors)
return en_vocab
def test_init_vectors_with_resize_shape(strings,resize_data):
v = Vectors(shape=(len(strings), 3))
v.resize(shape=resize_data.shape)
assert v.shape == resize_data.shape
assert v.shape != (len(strings), 3)
def test_init_vectors_with_resize_data(data,resize_data):
v = Vectors(data=data)
v.resize(shape=resize_data.shape)
assert v.shape == resize_data.shape
assert v.shape != data.shape
def test_get_vector_resize(strings, data,resize_data):
v = Vectors(data=data)
v.resize(shape=resize_data.shape)
strings = [hash_string(s) for s in strings]
for i, string in enumerate(strings):
v.add(string, row=i)
assert list(v[strings[0]]) == list(resize_data[0])
assert list(v[strings[0]]) != list(resize_data[1])
assert list(v[strings[1]]) != list(resize_data[0])
assert list(v[strings[1]]) == list(resize_data[1])
def test_init_vectors_with_data(strings, data):
v = Vectors(data=data)

View File

@ -13,6 +13,7 @@ cimport cython
from .tokens.doc cimport Doc
from .strings cimport hash_string
from .errors import Errors, Warnings, deprecation_warning
from . import util
@ -63,11 +64,7 @@ cdef class Tokenizer:
return (self.__class__, args, None, None)
cpdef Doc tokens_from_list(self, list strings):
util.deprecated(
"Tokenizer.from_list is now deprecated. Create a new Doc "
"object instead and pass in the strings as the `words` keyword "
"argument, for example:\nfrom spacy.tokens import Doc\n"
"doc = Doc(nlp.vocab, words=[...])")
deprecation_warning(Warnings.W002)
return Doc(self.vocab, words=strings)
@cython.boundscheck(False)
@ -78,8 +75,7 @@ cdef class Tokenizer:
RETURNS (Doc): A container for linguistic annotations.
"""
if len(string) >= (2 ** 30):
msg = "String is too long: %d characters. Max is 2**30."
raise ValueError(msg % len(string))
raise ValueError(Errors.E025.format(length=len(string)))
cdef int length = len(string)
cdef Doc doc = Doc(self.vocab)
if length == 0:

View File

@ -0,0 +1,129 @@
# coding: utf8
# cython: infer_types=True
# cython: bounds_check=False
# cython: profile=True
from __future__ import unicode_literals
from libc.string cimport memcpy, memset
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
from .span cimport Span
from .token cimport Token
from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..structs cimport LexemeC, TokenC
from ..attrs cimport *
cdef class Retokenizer:
'''Helper class for doc.retokenize() context manager.'''
cdef Doc doc
cdef list merges
cdef list splits
def __init__(self, doc):
self.doc = doc
self.merges = []
self.splits = []
def merge(self, Span span, attrs=None):
'''Mark a span for merging. The attrs will be applied to the resulting
token.'''
self.merges.append((span.start_char, span.end_char, attrs))
def split(self, Token token, orths, attrs=None):
'''Mark a Token for splitting, into the specified orths. The attrs
will be applied to each subtoken.'''
self.splits.append((token.start_char, orths, attrs))
def __enter__(self):
self.merges = []
self.splits = []
return self
def __exit__(self, *args):
# Do the actual merging here
for start_char, end_char, attrs in self.merges:
start = token_by_start(self.doc.c, self.doc.length, start_char)
end = token_by_end(self.doc.c, self.doc.length, end_char)
_merge(self.doc, start, end+1, attrs)
for start_char, orths, attrs in self.splits:
raise NotImplementedError
def _merge(Doc doc, int start, int end, attributes):
"""Retokenize the document, such that the span at
`doc.text[start_idx : end_idx]` is merged into a single token. If
`start_idx` and `end_idx `do not mark start and end token boundaries,
the document remains unchanged.
start_idx (int): Character index of the start of the slice to merge.
end_idx (int): Character index after the end of the slice to merge.
**attributes: Attributes to assign to the merged token. By default,
attributes are inherited from the syntactic root of the span.
RETURNS (Token): The newly merged token, or `None` if the start and end
indices did not fall at token boundaries.
"""
cdef Span span = doc[start:end]
cdef int start_char = span.start_char
cdef int end_char = span.end_char
# Get LexemeC for newly merged token
new_orth = ''.join([t.text_with_ws for t in span])
if span[-1].whitespace_:
new_orth = new_orth[:-len(span[-1].whitespace_)]
cdef const LexemeC* lex = doc.vocab.get(doc.mem, new_orth)
# House the new merged token where it starts
cdef TokenC* token = &doc.c[start]
token.spacy = doc.c[end-1].spacy
for attr_name, attr_value in attributes.items():
if attr_name == TAG:
doc.vocab.morphology.assign_tag(token, attr_value)
else:
Token.set_struct_attr(token, attr_name, attr_value)
# Make sure ent_iob remains consistent
if doc.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
if token.ent_type == doc.c[end].ent_type:
token.ent_iob = 3
else:
# If they're not the same entity type, let them be two entities
doc.c[end].ent_iob = 3
# Begin by setting all the head indices to absolute token positions
# This is easier to work with for now than the offsets
# Before thinking of something simpler, beware the case where a
# dependency bridges over the entity. Here the alignment of the
# tokens changes.
span_root = span.root.i
token.dep = span.root.dep
# We update token.lex after keeping span root and dep, since
# setting token.lex will change span.start and span.end properties
# as it modifies the character offsets in the doc
token.lex = lex
for i in range(doc.length):
doc.c[i].head += i
# Set the head of the merged token, and its dep relation, from the Span
token.head = doc.c[span_root].head
# Adjust deps before shrinking tokens
# Tokens which point into the merged token should now point to it
# Subtract the offset from all tokens which point to >= end
offset = (end - start) - 1
for i in range(doc.length):
head_idx = doc.c[i].head
if start <= head_idx < end:
doc.c[i].head = start
elif head_idx >= end:
doc.c[i].head -= offset
# Now compress the token array
for i in range(end, doc.length):
doc.c[i - offset] = doc.c[i]
for i in range(doc.length - offset, doc.length):
memset(&doc.c[i], 0, sizeof(TokenC))
doc.c[i].lex = &EMPTY_LEXEME
doc.length -= offset
for i in range(doc.length):
# ...And, set heads back to a relative position
doc.c[i].head -= i
# Set the left/right children, left/right edges
set_children_from_heads(doc.c, doc.length)
# Clear the cached Python objects
# Return the merged Python object
return doc[start]

View File

@ -28,6 +28,8 @@ cdef int token_by_start(const TokenC* tokens, int length, int start_char) except
cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2
cdef int set_children_from_heads(TokenC* tokens, int length) except -1
cdef class Doc:
cdef readonly Pool mem
cdef readonly Vocab vocab

View File

@ -31,18 +31,19 @@ from ..attrs cimport ENT_TYPE, SENT_START
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..util import normalize_slice
from ..compat import is_config, copy_reg, pickle, basestring_
from .. import about
from ..errors import Errors, Warnings, deprecation_warning
from .. import util
from .underscore import Underscore
from .underscore import Underscore, get_ext_args
from ._retokenize import Retokenizer
DEF PADDING = 5
cdef int bounds_check(int i, int length, int padding) except -1:
if (i + padding) < 0:
raise IndexError
raise IndexError(Errors.E026.format(i=i, length=length))
if (i - padding) >= length:
raise IndexError
raise IndexError(Errors.E026.format(i=i, length=length))
cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
@ -94,11 +95,10 @@ cdef class Doc:
spaces=[True, False, False])
"""
@classmethod
def set_extension(cls, name, default=None, method=None,
getter=None, setter=None):
nr_defined = sum(t is not None for t in (default, getter, setter, method))
assert nr_defined == 1
Underscore.doc_extensions[name] = (default, method, getter, setter)
def set_extension(cls, name, **kwargs):
if cls.has_extension(name) and not kwargs.get('force', False):
raise ValueError(Errors.E090.format(name=name, obj='Doc'))
Underscore.doc_extensions[name] = get_ext_args(**kwargs)
@classmethod
def get_extension(cls, name):
@ -108,6 +108,12 @@ cdef class Doc:
def has_extension(cls, name):
return name in Underscore.doc_extensions
@classmethod
def remove_extension(cls, name):
if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name))
return Underscore.doc_extensions.pop(name)
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
orths_and_spaces=None):
"""Create a Doc object.
@ -154,11 +160,7 @@ cdef class Doc:
if spaces is None:
spaces = [True] * len(words)
elif len(spaces) != len(words):
raise ValueError(
"Arguments 'words' and 'spaces' should be sequences of "
"the same length, or 'spaces' should be left default at "
"None. spaces should be a sequence of booleans, with True "
"meaning that the word owns a ' ' character following it.")
raise ValueError(Errors.E027)
orths_and_spaces = zip(words, spaces)
if orths_and_spaces is not None:
for orth_space in orths_and_spaces:
@ -166,10 +168,7 @@ cdef class Doc:
orth = orth_space
has_space = True
elif isinstance(orth_space, bytes):
raise ValueError(
"orths_and_spaces expects either List(unicode) or "
"List((unicode, bool)). "
"Got bytes instance: %s" % (str(orth_space)))
raise ValueError(Errors.E028.format(value=orth_space))
else:
orth, has_space = orth_space
# Note that we pass self.mem here --- we have ownership, if LexemeC
@ -437,10 +436,7 @@ cdef class Doc:
if token.ent_iob == 1:
if start == -1:
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
raise ValueError(
"token.ent_iob values make invalid sequence: "
"I without B\n"
"{seq}".format(seq=' '.join(seq)))
raise ValueError(Errors.E093.format(seq=' '.join(seq)))
elif token.ent_iob == 2 or token.ent_iob == 0:
if start != -1:
output.append(Span(self, start, i, label=label))
@ -503,19 +499,16 @@ cdef class Doc:
"""
def __get__(self):
if not self.is_parsed:
raise ValueError(
"noun_chunks requires the dependency parse, which "
"requires a statistical model to be installed and loaded. "
"For more info, see the "
"documentation: \n%s\n" % about.__docs_models__)
raise ValueError(Errors.E029)
# Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts
# its tokenisation changing, so it's okay once we have the Span
# objects. See Issue #375.
spans = []
for start, end, label in self.noun_chunks_iterator(self):
spans.append(Span(self, start, end, label=label))
if self.noun_chunks_iterator is not None:
for start, end, label in self.noun_chunks_iterator(self):
spans.append(Span(self, start, end, label=label))
for span in spans:
yield span
@ -532,12 +525,7 @@ cdef class Doc:
"""
def __get__(self):
if not self.is_sentenced:
raise ValueError(
"Sentence boundaries unset. You can add the 'sentencizer' "
"component to the pipeline with: "
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
"Alternatively, add the dependency parser, or set "
"sentence boundaries by setting doc[i].sent_start")
raise ValueError(Errors.E030)
if 'sents' in self.user_hooks:
yield from self.user_hooks['sents'](self)
else:
@ -567,7 +555,8 @@ cdef class Doc:
t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy
t.l_edge = self.length
t.r_edge = self.length
assert t.lex.orth != 0
if t.lex.orth == 0:
raise ValueError(Errors.E031.format(i=self.length))
t.spacy = has_space
self.length += 1
return t.idx + t.lex.length + t.spacy
@ -683,13 +672,7 @@ cdef class Doc:
def from_array(self, attrs, array):
if SENT_START in attrs and HEAD in attrs:
raise ValueError(
"Conflicting attributes specified in doc.from_array(): "
"(HEAD, SENT_START)\n"
"The HEAD attribute currently sets sentence boundaries "
"implicitly, based on the tree structure. This means the HEAD "
"attribute would potentially override the sentence boundaries "
"set by SENT_START.")
raise ValueError(Errors.E032)
cdef int i, col
cdef attr_id_t attr_id
cdef TokenC* tokens = self.c
@ -827,7 +810,7 @@ cdef class Doc:
RETURNS (Doc): Itself.
"""
if self.length != 0:
raise ValueError("Cannot load into non-empty Doc")
raise ValueError(Errors.E033.format(length=self.length))
deserializers = {
'text': lambda b: None,
'array_head': lambda b: None,
@ -878,7 +861,7 @@ cdef class Doc:
computed by the models in the pipeline. Let's say a
document with 30 words has a tensor with 128 dimensions
per word. doc.tensor.shape will be (30, 128). After
calling doc.extend_tensor with an array of hape (30, 64),
calling doc.extend_tensor with an array of shape (30, 64),
doc.tensor == (30, 192).
'''
xp = get_array_module(self.tensor)
@ -888,6 +871,18 @@ cdef class Doc:
else:
self.tensor = xp.hstack((self.tensor, tensor))
def retokenize(self):
'''Context manager to handle retokenization of the Doc.
Modifications to the Doc's tokenization are stored, and then
made all at once when the context manager exits. This is
much more efficient, and less error-prone.
All views of the Doc (Span and Token) created before the
retokenization are invalidated, although they may accidentally
continue to work.
'''
return Retokenizer(self)
def merge(self, int start_idx, int end_idx, *args, **attributes):
"""Retokenize the document, such that the span at
`doc.text[start_idx : end_idx]` is merged into a single token. If
@ -903,10 +898,7 @@ cdef class Doc:
"""
cdef unicode tag, lemma, ent_type
if len(args) == 3:
util.deprecated(
"Positional arguments to Doc.merge are deprecated. Instead, "
"use the keyword arguments, for example tag=, lemma= or "
"ent_type=.")
deprecation_warning(Warnings.W003)
tag, lemma, ent_type = args
attributes[TAG] = tag
attributes[LEMMA] = lemma
@ -920,13 +912,9 @@ cdef class Doc:
if 'ent_type' in attributes:
attributes[ENT_TYPE] = attributes['ent_type']
elif args:
raise ValueError(
"Doc.merge received %d non-keyword arguments. Expected either "
"3 arguments (deprecated), or 0 (use keyword arguments). "
"Arguments supplied:\n%s\n"
"Keyword arguments: %s\n" % (len(args), repr(args),
repr(attributes)))
raise ValueError(Errors.E034.format(n_args=len(args),
args=repr(args),
kwargs=repr(attributes)))
# More deprecated attribute handling =/
if 'label' in attributes:
attributes['ent_type'] = attributes.pop('label')
@ -941,66 +929,8 @@ cdef class Doc:
return None
# Currently we have the token index, we want the range-end index
end += 1
cdef Span span = self[start:end]
# Get LexemeC for newly merged token
new_orth = ''.join([t.text_with_ws for t in span])
if span[-1].whitespace_:
new_orth = new_orth[:-len(span[-1].whitespace_)]
cdef const LexemeC* lex = self.vocab.get(self.mem, new_orth)
# House the new merged token where it starts
cdef TokenC* token = &self.c[start]
token.spacy = self.c[end-1].spacy
for attr_name, attr_value in attributes.items():
if attr_name == TAG:
self.vocab.morphology.assign_tag(token, attr_value)
else:
Token.set_struct_attr(token, attr_name, attr_value)
# Make sure ent_iob remains consistent
if self.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
if token.ent_type == self.c[end].ent_type:
token.ent_iob = 3
else:
# If they're not the same entity type, let them be two entities
self.c[end].ent_iob = 3
# Begin by setting all the head indices to absolute token positions
# This is easier to work with for now than the offsets
# Before thinking of something simpler, beware the case where a
# dependency bridges over the entity. Here the alignment of the
# tokens changes.
span_root = span.root.i
token.dep = span.root.dep
# We update token.lex after keeping span root and dep, since
# setting token.lex will change span.start and span.end properties
# as it modifies the character offsets in the doc
token.lex = lex
for i in range(self.length):
self.c[i].head += i
# Set the head of the merged token, and its dep relation, from the Span
token.head = self.c[span_root].head
# Adjust deps before shrinking tokens
# Tokens which point into the merged token should now point to it
# Subtract the offset from all tokens which point to >= end
offset = (end - start) - 1
for i in range(self.length):
head_idx = self.c[i].head
if start <= head_idx < end:
self.c[i].head = start
elif head_idx >= end:
self.c[i].head -= offset
# Now compress the token array
for i in range(end, self.length):
self.c[i - offset] = self.c[i]
for i in range(self.length - offset, self.length):
memset(&self.c[i], 0, sizeof(TokenC))
self.c[i].lex = &EMPTY_LEXEME
self.length -= offset
for i in range(self.length):
# ...And, set heads back to a relative position
self.c[i].head -= i
# Set the left/right children, left/right edges
set_children_from_heads(self.c, self.length)
# Clear the cached Python objects
# Return the merged Python object
with self.retokenize() as retokenizer:
retokenizer.merge(self[start:end], attrs=attributes)
return self[start]
def print_tree(self, light=False, flat=False):

View File

@ -8,7 +8,7 @@ from ..symbols import HEAD, TAG, DEP, ENT_IOB, ENT_TYPE
def merge_ents(doc):
"""Helper: merge adjacent entities into single tokens; modifies the doc."""
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, ent.label_)
ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)
return doc

View File

@ -16,16 +16,17 @@ from ..util import normalize_slice
from ..attrs cimport IS_PUNCT, IS_SPACE
from ..lexeme cimport Lexeme
from ..compat import is_config
from .. import about
from .underscore import Underscore
from ..errors import Errors, TempErrors
from .underscore import Underscore, get_ext_args
cdef class Span:
"""A slice from a Doc object."""
@classmethod
def set_extension(cls, name, default=None, method=None,
getter=None, setter=None):
Underscore.span_extensions[name] = (default, method, getter, setter)
def set_extension(cls, name, **kwargs):
if cls.has_extension(name) and not kwargs.get('force', False):
raise ValueError(Errors.E090.format(name=name, obj='Span'))
Underscore.span_extensions[name] = get_ext_args(**kwargs)
@classmethod
def get_extension(cls, name):
@ -35,6 +36,12 @@ cdef class Span:
def has_extension(cls, name):
return name in Underscore.span_extensions
@classmethod
def remove_extension(cls, name):
if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name))
return Underscore.span_extensions.pop(name)
def __cinit__(self, Doc doc, int start, int end, attr_t label=0,
vector=None, vector_norm=None):
"""Create a `Span` object from the slice `doc[start : end]`.
@ -48,8 +55,7 @@ cdef class Span:
RETURNS (Span): The newly constructed object.
"""
if not (0 <= start <= end <= len(doc)):
raise IndexError
raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc)))
self.doc = doc
self.start = start
self.start_char = self.doc[start].idx if start < self.doc.length else 0
@ -58,7 +64,8 @@ cdef class Span:
self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1])
else:
self.end_char = 0
assert label in doc.vocab.strings, label
if label not in doc.vocab.strings:
raise ValueError(Errors.E084.format(label=label))
self.label = label
self._vector = vector
self._vector_norm = vector_norm
@ -267,11 +274,10 @@ cdef class Span:
or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char:
start = token_by_start(self.doc.c, self.doc.length, self.start_char)
if self.start == -1:
raise IndexError("Error calculating span: Can't find start")
raise IndexError(Errors.E036.format(start=self.start_char))
end = token_by_end(self.doc.c, self.doc.length, self.end_char)
if end == -1:
raise IndexError("Error calculating span: Can't find end")
raise IndexError(Errors.E037.format(end=self.end_char))
self.start = start
self.end = end + 1
@ -294,12 +300,11 @@ cdef class Span:
cdef int i
if self.doc.is_parsed:
root = &self.doc.c[self.start]
n = 0
while root.head != 0:
root += root.head
n += 1
if n >= self.doc.length:
raise RuntimeError
raise RuntimeError(Errors.E038)
return self.doc[root.l_edge:root.r_edge + 1]
elif self.doc.is_sentenced:
# find start of the sentence
@ -314,13 +319,7 @@ cdef class Span:
n += 1
if n >= self.doc.length:
break
#
return self.doc[start:end]
else:
raise ValueError(
"Access to sentence requires either the dependency parse "
"or sentence boundaries to be set by setting " +
"doc[i].is_sent_start = True")
property has_vector:
"""RETURNS (bool): Whether a word vector is associated with the object.
@ -402,11 +401,7 @@ cdef class Span:
"""
def __get__(self):
if not self.doc.is_parsed:
raise ValueError(
"noun_chunks requires the dependency parse, which "
"requires a statistical model to be installed and loaded. "
"For more info, see the "
"documentation: \n%s\n" % about.__docs_models__)
raise ValueError(Errors.E029)
# Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts
@ -552,9 +547,7 @@ cdef class Span:
return self.root.ent_id
def __set__(self, hash_t key):
raise NotImplementedError(
"Can't yet set ent_id from Span. Vote for this feature on "
"the issue tracker: http://github.com/explosion/spaCy/issues")
raise NotImplementedError(TempErrors.T007.format(attr='ent_id'))
property ent_id_:
"""RETURNS (unicode): The (string) entity ID."""
@ -562,9 +555,7 @@ cdef class Span:
return self.root.ent_id_
def __set__(self, hash_t key):
raise NotImplementedError(
"Can't yet set ent_id_ from Span. Vote for this feature on the "
"issue tracker: http://github.com/explosion/spaCy/issues")
raise NotImplementedError(TempErrors.T007.format(attr='ent_id_'))
property orth_:
"""Verbatim text content (identical to Span.text). Exists mostly for
@ -612,9 +603,5 @@ cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
token += token.head
n += 1
if n >= sent_length:
raise RuntimeError(
"Array bounds exceeded while searching for root word. This "
"likely means the parse tree is in an invalid state. Please "
"report this issue here: "
"http://github.com/explosion/spaCy/issues")
raise RuntimeError(Errors.E039)
return n

View File

@ -6,6 +6,7 @@ from ..typedefs cimport attr_t, flags_t
from ..parts_of_speech cimport univ_pos_t
from .doc cimport Doc
from ..lexeme cimport Lexeme
from ..errors import Errors
cdef class Token:
@ -17,8 +18,7 @@ cdef class Token:
@staticmethod
cdef inline Token cinit(Vocab vocab, const TokenC* token, int offset, Doc doc):
if offset < 0 or offset >= doc.length:
msg = "Attempt to access token at %d, max length %d"
raise IndexError(msg % (offset, doc.length))
raise IndexError(Errors.E040.format(i=offset, max_length=doc.length))
cdef Token self = Token.__new__(Token, vocab, doc, offset)
return self

View File

@ -19,26 +19,33 @@ from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
from ..compat import is_config
from ..errors import Errors
from .. import util
from .. import about
from .underscore import Underscore
from .underscore import Underscore, get_ext_args
cdef class Token:
"""An individual token i.e. a word, punctuation symbol, whitespace,
etc."""
@classmethod
def set_extension(cls, name, default=None, method=None,
getter=None, setter=None):
Underscore.token_extensions[name] = (default, method, getter, setter)
def set_extension(cls, name, **kwargs):
if cls.has_extension(name) and not kwargs.get('force', False):
raise ValueError(Errors.E090.format(name=name, obj='Token'))
Underscore.token_extensions[name] = get_ext_args(**kwargs)
@classmethod
def get_extension(cls, name):
return Underscore.span_extensions.get(name)
return Underscore.token_extensions.get(name)
@classmethod
def has_extension(cls, name):
return name in Underscore.span_extensions
return name in Underscore.token_extensions
@classmethod
def remove_extension(cls, name):
if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name))
return Underscore.token_extensions.pop(name)
def __cinit__(self, Vocab vocab, Doc doc, int offset):
"""Construct a `Token` object.
@ -106,7 +113,7 @@ cdef class Token:
elif op == 5:
return my >= their
else:
raise ValueError(op)
raise ValueError(Errors.E041.format(op=op))
@property
def _(self):
@ -135,8 +142,7 @@ cdef class Token:
RETURNS (Token): The token at position `self.doc[self.i+i]`.
"""
if self.i+i < 0 or (self.i+i >= len(self.doc)):
msg = "Error accessing doc[%d].nbor(%d), for doc of length %d"
raise IndexError(msg % (self.i, i, len(self.doc)))
raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc)))
return self.doc[self.i+i]
def similarity(self, other):
@ -354,14 +360,7 @@ cdef class Token:
property sent_start:
def __get__(self):
# Raising a deprecation warning causes errors for autocomplete
#util.deprecated(
# "Token.sent_start is now deprecated. Use Token.is_sent_start "
# "instead, which returns a boolean value or None if the answer "
# "is unknown instead of a misleading 0 for False and 1 for "
# "True. It also fixes a quirk in the old logic that would "
# "always set the property to 0 for the first word of the "
# "document.")
# Raising a deprecation warning here causes errors for autocomplete
# Handle broken backwards compatibility case: doc[0].sent_start
# was False.
if self.i == 0:
@ -386,9 +385,7 @@ cdef class Token:
def __set__(self, value):
if self.doc.is_parsed:
raise ValueError(
"Refusing to write to token.sent_start if its document "
"is parsed, because this may cause inconsistent state.")
raise ValueError(Errors.E043)
if value is None:
self.c.sent_start = 0
elif value is True:
@ -396,8 +393,7 @@ cdef class Token:
elif value is False:
self.c.sent_start = -1
else:
raise ValueError("Invalid value for token.sent_start. Must be "
"one of: None, True, False")
raise ValueError(Errors.E044.format(value=value))
property lefts:
"""The leftward immediate children of the word, in the syntactic
@ -415,8 +411,7 @@ cdef class Token:
nr_iter += 1
# This is ugly, but it's a way to guard out infinite loops
if nr_iter >= 10000000:
raise RuntimeError("Possibly infinite loop encountered "
"while looking for token.lefts")
raise RuntimeError(Errors.E045.format(attr='token.lefts'))
property rights:
"""The rightward immediate children of the word, in the syntactic
@ -434,8 +429,7 @@ cdef class Token:
ptr -= 1
nr_iter += 1
if nr_iter >= 10000000:
raise RuntimeError("Possibly infinite loop encountered "
"while looking for token.rights")
raise RuntimeError(Errors.E045.format(attr='token.rights'))
tokens.reverse()
for t in tokens:
yield t

View File

@ -3,6 +3,8 @@ from __future__ import unicode_literals
import functools
from ..errors import Errors
class Underscore(object):
doc_extensions = {}
@ -23,7 +25,7 @@ class Underscore(object):
def __getattr__(self, name):
if name not in self._extensions:
raise AttributeError(name)
raise AttributeError(Errors.E046.format(name=name))
default, method, getter, setter = self._extensions[name]
if getter is not None:
return getter(self._obj)
@ -34,7 +36,7 @@ class Underscore(object):
def __setattr__(self, name, value):
if name not in self._extensions:
raise AttributeError(name)
raise AttributeError(Errors.E047.format(name=name))
default, method, getter, setter = self._extensions[name]
if setter is not None:
return setter(self._obj, value)
@ -52,3 +54,24 @@ class Underscore(object):
def _get_key(self, name):
return ('._.', name, self._start, self._end)
def get_ext_args(**kwargs):
"""Validate and convert arguments. Reused in Doc, Token and Span."""
default = kwargs.get('default')
getter = kwargs.get('getter')
setter = kwargs.get('setter')
method = kwargs.get('method')
if getter is None and setter is not None:
raise ValueError(Errors.E089)
valid_opts = ('default' in kwargs, method is not None, getter is not None)
nr_defined = sum(t is True for t in valid_opts)
if nr_defined != 1:
raise ValueError(Errors.E083.format(nr_defined=nr_defined))
if setter is not None and not hasattr(setter, '__call__'):
raise ValueError(Errors.E091.format(name='setter', value=repr(setter)))
if getter is not None and not hasattr(getter, '__call__'):
raise ValueError(Errors.E091.format(name='getter', value=repr(getter)))
if method is not None and not hasattr(method, '__call__'):
raise ValueError(Errors.E091.format(name='method', value=repr(method)))
return (default, method, getter, setter)

View File

@ -11,8 +11,6 @@ import sys
import textwrap
import random
from collections import OrderedDict
import inspect
import warnings
from thinc.neural._classes.model import Model
from thinc.neural.ops import NumpyOps
import functools
@ -23,10 +21,12 @@ import numpy.random
from .symbols import ORTH
from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
from .compat import import_file
from .errors import Errors
import msgpack
import msgpack_numpy
msgpack_numpy.patch()
# Import these directly from Thinc, so that we're sure we always have the
# same version.
from thinc.neural._classes.model import msgpack
from thinc.neural._classes.model import msgpack_numpy
LANGUAGES = {}
@ -50,8 +50,7 @@ def get_lang_class(lang):
try:
module = importlib.import_module('.lang.%s' % lang, 'spacy')
except ImportError:
msg = "Can't import language %s from spacy.lang."
raise ImportError(msg % lang)
raise ImportError(Errors.E048.format(lang=lang))
LANGUAGES[lang] = getattr(module, module.__all__[0])
return LANGUAGES[lang]
@ -108,7 +107,7 @@ def load_model(name, **overrides):
"""
data_path = get_data_path()
if not data_path or not data_path.exists():
raise IOError("Can't find spaCy data path: %s" % path2str(data_path))
raise IOError(Errors.E049.format(path=path2str(data_path)))
if isinstance(name, basestring_): # in data dir / shortcut
if name in set([d.name for d in data_path.iterdir()]):
return load_model_from_link(name, **overrides)
@ -118,7 +117,7 @@ def load_model(name, **overrides):
return load_model_from_path(Path(name), **overrides)
elif hasattr(name, 'exists'): # Path or Path-like to model data
return load_model_from_path(name, **overrides)
raise IOError("Can't find model '%s'" % name)
raise IOError(Errors.E050.format(name=name))
def load_model_from_link(name, **overrides):
@ -127,9 +126,7 @@ def load_model_from_link(name, **overrides):
try:
cls = import_file(name, path)
except AttributeError:
raise IOError(
"Cant' load '%s'. If you're using a shortcut link, make sure it "
"points to a valid package (not just a data directory)." % name)
raise IOError(Errors.E051.format(name=name))
return cls.load(**overrides)
@ -173,8 +170,7 @@ def load_model_from_init_py(init_file, **overrides):
data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
data_path = model_path / data_dir
if not model_path.exists():
msg = "Can't find model directory: %s"
raise ValueError(msg % path2str(data_path))
raise IOError(Errors.E052.format(path=path2str(data_path)))
return load_model_from_path(data_path, meta, **overrides)
@ -186,16 +182,14 @@ def get_model_meta(path):
"""
model_path = ensure_path(path)
if not model_path.exists():
msg = "Can't find model directory: %s"
raise ValueError(msg % path2str(model_path))
raise IOError(Errors.E052.format(path=path2str(model_path)))
meta_path = model_path / 'meta.json'
if not meta_path.is_file():
raise IOError("Could not read meta.json from %s" % meta_path)
raise IOError(Errors.E053.format(path=meta_path))
meta = read_json(meta_path)
for setting in ['lang', 'name', 'version']:
if setting not in meta or not meta[setting]:
msg = "No valid '%s' setting found in model meta.json"
raise ValueError(msg % setting)
raise ValueError(Errors.E054.format(setting=setting))
return meta
@ -344,13 +338,10 @@ def update_exc(base_exceptions, *addition_dicts):
for orth, token_attrs in additions.items():
if not all(isinstance(attr[ORTH], unicode_)
for attr in token_attrs):
msg = "Invalid ORTH value in exception: key='%s', orths='%s'"
raise ValueError(msg % (orth, token_attrs))
raise ValueError(Errors.E055.format(key=orth, orths=token_attrs))
described_orth = ''.join(attr[ORTH] for attr in token_attrs)
if orth != described_orth:
msg = ("Invalid tokenizer exception: ORTH values combined "
"don't match original string. key='%s', orths='%s'")
raise ValueError(msg % (orth, described_orth))
raise ValueError(Errors.E056.format(key=orth, orths=described_orth))
exc.update(additions)
exc = expand_exc(exc, "'", "")
return exc
@ -380,8 +371,7 @@ def expand_exc(excs, search, replace):
def normalize_slice(length, start, stop, step=None):
if not (step is None or step == 1):
raise ValueError("Stepped slices not supported in Span objects."
"Try: list(tokens)[start:stop:step] instead.")
raise ValueError(Errors.E057)
if start is None:
start = 0
elif start < 0:
@ -392,7 +382,6 @@ def normalize_slice(length, start, stop, step=None):
elif stop < 0:
stop += length
stop = min(length, max(start, stop))
assert 0 <= start <= stop <= length
return start, stop
@ -552,18 +541,6 @@ def from_disk(path, readers, exclude):
return path
def deprecated(message, filter='always'):
"""Show a deprecation warning.
message (unicode): The message to display.
filter (unicode): Filter value.
"""
stack = inspect.stack()[-1]
with warnings.catch_warnings():
warnings.simplefilter(filter, DeprecationWarning)
warnings.warn_explicit(message, DeprecationWarning, stack[1], stack[2])
def print_table(data, title=None):
"""Print data in table format.

View File

@ -1,24 +1,43 @@
# coding: utf8
from __future__ import unicode_literals
import functools
import numpy
from collections import OrderedDict
import msgpack
import msgpack_numpy
msgpack_numpy.patch()
from .util import msgpack
from .util import msgpack_numpy
cimport numpy as np
from thinc.neural.util import get_array_module
from thinc.neural._classes.model import Model
from .strings cimport StringStore, hash_string
from .compat import basestring_, path2str
from .errors import Errors
from . import util
from cython.operator cimport dereference as deref
from libcpp.set cimport set as cppset
def unpickle_vectors(bytes_data):
return Vectors().from_bytes(bytes_data)
class GlobalRegistry(object):
'''Global store of vectors, to avoid repeatedly loading the data.'''
data = {}
@classmethod
def register(cls, name, data):
cls.data[name] = data
return functools.partial(cls.get, name)
@classmethod
def get(cls, name):
return cls.data[name]
cdef class Vectors:
"""Store, save and load word vectors.
@ -31,18 +50,21 @@ cdef class Vectors:
the table need to be assigned --- so len(list(vectors.keys())) may be
greater or smaller than vectors.shape[0].
"""
cdef public object name
cdef public object data
cdef public object key2row
cdef public object _unset
cdef cppset[int] _unset
def __init__(self, *, shape=None, data=None, keys=None):
def __init__(self, *, shape=None, data=None, keys=None, name=None):
"""Create a new vector store.
shape (tuple): Size of the table, as (# entries, # columns)
data (numpy.ndarray): The vector data.
keys (iterable): A sequence of keys, aligned with the data.
name (string): A name to identify the vectors table.
RETURNS (Vectors): The newly created object.
"""
self.name = name
if data is None:
if shape is None:
shape = (0,0)
@ -50,9 +72,9 @@ cdef class Vectors:
self.data = data
self.key2row = OrderedDict()
if self.data is not None:
self._unset = set(range(self.data.shape[0]))
self._unset = cppset[int]({i for i in range(self.data.shape[0])})
else:
self._unset = set()
self._unset = cppset[int]()
if keys is not None:
for i, key in enumerate(keys):
self.add(key, row=i)
@ -74,7 +96,7 @@ cdef class Vectors:
@property
def is_full(self):
"""RETURNS (bool): `True` if no slots are available for new keys."""
return len(self._unset) == 0
return self._unset.size() == 0
@property
def n_keys(self):
@ -93,7 +115,7 @@ cdef class Vectors:
"""
i = self.key2row[key]
if i is None:
raise KeyError(key)
raise KeyError(Errors.E058.format(key=key))
else:
return self.data[i]
@ -105,8 +127,8 @@ cdef class Vectors:
"""
i = self.key2row[key]
self.data[i] = vector
if i in self._unset:
self._unset.remove(i)
if self._unset.count(i):
self._unset.erase(self._unset.find(i))
def __iter__(self):
"""Iterate over the keys in the table.
@ -145,7 +167,7 @@ cdef class Vectors:
xp = get_array_module(self.data)
self.data = xp.resize(self.data, shape)
filled = {row for row in self.key2row.values()}
self._unset = {row for row in range(shape[0]) if row not in filled}
self._unset = cppset[int]({row for row in range(shape[0]) if row not in filled})
removed_items = []
for key, row in list(self.key2row.items()):
if row >= shape[0]:
@ -169,7 +191,7 @@ cdef class Vectors:
YIELDS (ndarray): A vector in the table.
"""
for row, vector in enumerate(range(self.data.shape[0])):
if row not in self._unset:
if not self._unset.count(row):
yield vector
def items(self):
@ -194,7 +216,8 @@ cdef class Vectors:
RETURNS: The requested key, keys, row or rows.
"""
if sum(arg is None for arg in (key, keys, row, rows)) != 3:
raise ValueError("One (and only one) keyword arg must be set.")
bad_kwargs = {'key': key, 'keys': keys, 'row': row, 'rows': rows}
raise ValueError(Errors.E059.format(kwargs=bad_kwargs))
xp = get_array_module(self.data)
if key is not None:
if isinstance(key, basestring_):
@ -233,14 +256,14 @@ cdef class Vectors:
row = self.key2row[key]
elif row is None:
if self.is_full:
raise ValueError("Cannot add new key to vectors -- full")
row = min(self._unset)
raise ValueError(Errors.E060.format(rows=self.data.shape[0],
cols=self.data.shape[1]))
row = deref(self._unset.begin())
self.key2row[key] = row
if vector is not None:
self.data[row] = vector
if row in self._unset:
self._unset.remove(row)
if self._unset.count(row):
self._unset.erase(self._unset.find(row))
return row
def most_similar(self, queries, *, batch_size=1024):
@ -297,7 +320,7 @@ cdef class Vectors:
width = int(dims)
break
else:
raise IOError("Expected file named e.g. vectors.128.f.bin")
raise IOError(Errors.E061.format(filename=path))
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
dtype=dtype)
xp = get_array_module(self.data)
@ -346,8 +369,8 @@ cdef class Vectors:
with path.open('rb') as file_:
self.key2row = msgpack.load(file_)
for key, row in self.key2row.items():
if row in self._unset:
self._unset.remove(row)
if self._unset.count(row):
self._unset.erase(self._unset.find(row))
def load_keys(path):
if path.exists():

View File

@ -16,6 +16,7 @@ from .attrs cimport PROB, LANG, ORTH, TAG
from .structs cimport SerializedLexemeC
from .compat import copy_reg, basestring_
from .errors import Errors
from .lemmatizer import Lemmatizer
from .attrs import intify_attrs
from .vectors import Vectors
@ -100,15 +101,9 @@ cdef class Vocab:
flag_id = bit
break
else:
raise ValueError(
"Cannot find empty bit for new lexical flag. All bits "
"between 0 and 63 are occupied. You can replace one by "
"specifying the flag_id explicitly, e.g. "
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
raise ValueError(Errors.E062)
elif flag_id >= 64 or flag_id < 1:
raise ValueError(
"Invalid value for flag_id: %d. Flag IDs must be between "
"1 and 63 (inclusive)" % flag_id)
raise ValueError(Errors.E063.format(value=flag_id))
for lex in self:
lex.set_flag(flag_id, flag_getter(lex.orth_))
self.lex_attr_getters[flag_id] = flag_getter
@ -127,8 +122,9 @@ cdef class Vocab:
cdef size_t addr
if lex != NULL:
if lex.orth != self.strings[string]:
raise LookupError.mismatched_strings(
lex.orth, self.strings[string], string)
raise KeyError(Errors.E064.format(string=lex.orth,
orth=self.strings[string],
orth_id=string))
return lex
else:
return self._new_lexeme(mem, string)
@ -171,7 +167,8 @@ cdef class Vocab:
if not is_oov:
key = hash_string(string)
self._add_lex_to_vocab(key, lex)
assert lex != NULL, string
if lex == NULL:
raise ValueError(Errors.E085.format(string=string))
return lex
cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1:
@ -254,7 +251,7 @@ cdef class Vocab:
width, you have to call this to change the size of the vectors.
"""
if width is not None and shape is not None:
raise ValueError("Only one of width and shape can be specified")
raise ValueError(Errors.E065.format(width=width, shape=shape))
elif shape is not None:
self.vectors = Vectors(shape=shape)
else:
@ -381,7 +378,8 @@ cdef class Vocab:
self.lexemes_from_bytes(file_.read())
if self.vectors is not None:
self.vectors.from_disk(path, exclude='strings.json')
link_vectors_to_models(self)
if self.vectors.name is not None:
link_vectors_to_models(self)
return self
def to_bytes(self, **exclude):
@ -421,6 +419,8 @@ cdef class Vocab:
('vectors', lambda b: serialize_vectors(b))
))
util.from_bytes(bytes_data, setters, exclude)
if self.vectors.name is not None:
link_vectors_to_models(self)
return self
def lexemes_to_bytes(self):
@ -468,7 +468,10 @@ cdef class Vocab:
if ptr == NULL:
continue
py_str = self.strings[lexeme.orth]
assert self.strings[py_str] == lexeme.orth, (py_str, lexeme.orth)
if self.strings[py_str] != lexeme.orth:
raise ValueError(Errors.E086.format(string=py_str,
orth_id=lexeme.orth,
hash_id=self.strings[py_str]))
key = hash_string(py_str)
self._by_hash.set(key, lexeme)
self._by_orth.set(lexeme.orth, lexeme)
@ -509,16 +512,3 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
copy_reg.pickle(Vocab, pickle_vocab, unpickle_vocab)
class LookupError(Exception):
@classmethod
def mismatched_strings(cls, id_, id_string, original_string):
return cls(
"Error fetching a Lexeme from the Vocab. When looking up a "
"string, the lexeme returned had an orth ID that did not match "
"the query string. This means that the cached lexeme structs are "
"mismatched to the string encoding table. The mismatched:\n"
"Query string: {}\n"
"Orth cached: {}\n"
"Orth ID: {}".format(repr(original_string), repr(id_string), id_))

View File

@ -1,7 +1,7 @@
{
"globals": {
"title": "spaCy",
"description": "spaCy is a free open-source library featuring state-of-the-art speed and accuracy and a powerful Python API.",
"description": "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.",
"SITENAME": "spaCy",
"SLOGAN": "Industrial-strength Natural Language Processing in Python",
@ -10,10 +10,13 @@
"COMPANY": "Explosion AI",
"COMPANY_URL": "https://explosion.ai",
"DEMOS_URL": "https://demos.explosion.ai",
"DEMOS_URL": "https://explosion.ai/demos",
"MODELS_REPO": "explosion/spacy-models",
"KERNEL_BINDER": "ines/spacy-binder",
"KERNEL_PYTHON": "python3",
"SPACY_VERSION": "2.0",
"BINDER_VERSION": "2.0.11",
"SOCIAL": {
"twitter": "spacy_io",
@ -26,7 +29,8 @@
"NAVIGATION": {
"Usage": "/usage",
"Models": "/models",
"API": "/api"
"API": "/api",
"Universe": "/universe"
},
"FOOTER": {
@ -34,7 +38,7 @@
"Usage": "/usage",
"Models": "/models",
"API Reference": "/api",
"Resources": "/usage/resources"
"Universe": "/universe"
},
"Support": {
"Issue Tracker": "https://github.com/explosion/spaCy/issues",
@ -82,8 +86,8 @@
}
],
"V_CSS": "2.0.1",
"V_JS": "2.0.1",
"V_CSS": "2.1.2",
"V_JS": "2.1.0",
"DEFAULT_SYNTAX": "python",
"ANALYTICS": "UA-58931649-1",
"MAILCHIMP": {

View File

@ -15,12 +15,39 @@
- MODEL_META = public.models._data.MODEL_META
- MODEL_LICENSES = public.models._data.MODEL_LICENSES
- MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS
- EXAMPLE_SENT_LANGS = public.models._data.EXAMPLE_SENT_LANGS
- EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES
- IS_PAGE = (SECTION != "index") && !landing
- IS_MODELS = (SECTION == "models" && LANGUAGES[current.source])
- HAS_MODELS = IS_MODELS && CURRENT_MODELS.length
//- Get page URL
- function getPageUrl() {
- var path = current.path;
- if(path[path.length - 1] == 'index') path = path.slice(0, path.length - 1);
- return `${SITE_URL}/${path.join('/')}`;
- }
//- Get pretty page title depending on section
- function getPageTitle() {
- var sections = ['api', 'usage', 'models'];
- if (sections.includes(SECTION)) {
- var titleSection = (SECTION == "api") ? 'API' : SECTION.charAt(0).toUpperCase() + SECTION.slice(1);
- return `${title} · ${SITENAME} ${titleSection} Documentation`;
- }
- else if (SECTION != 'index') return `${title} · ${SITENAME}`;
- return `${SITENAME} · ${SLOGAN}`;
- }
//- Get social image based on section and settings
- function getPageImage() {
- var img = (SECTION == 'api') ? 'api' : 'default';
- return `${SITE_URL}/assets/img/social/preview_${preview || img}.jpg`;
- }
//- Add prefixes to items of an array (for modifier CSS classes)
array - [array] list of class names or options, e.g. ["foot"]

View File

@ -7,7 +7,7 @@ include _functions
id - [string] anchor assigned to section (used for breadcrumb navigation)
mixin section(id)
section.o-section(id="section-" + id data-section=id)
section.o-section(id=id ? "section-" + id : null data-section=id)&attributes(attributes)
block
@ -143,7 +143,7 @@ mixin aside-wrapper(label, emoji)
mixin aside(label, emoji)
+aside-wrapper(label, emoji)
.c-aside__text.u-text-small
.c-aside__text.u-text-small&attributes(attributes)
block
@ -154,7 +154,7 @@ mixin aside(label, emoji)
prompt - [string] prompt displayed before first line, e.g. "$"
mixin aside-code(label, language, prompt)
+aside-wrapper(label)
+aside-wrapper(label)&attributes(attributes)
+code(false, language, prompt).o-no-block
block
@ -165,7 +165,7 @@ mixin aside-code(label, language, prompt)
argument to be able to wrap it for spacing
mixin infobox(label, emoji)
aside.o-box.o-block.u-text-small
aside.o-box.o-block.u-text-small&attributes(attributes)
if label
h3.u-heading.u-text-label.u-color-theme
if emoji
@ -242,7 +242,9 @@ mixin button(url, trusted, ...style)
wrap - [boolean] wrap text and disable horizontal scrolling
mixin code(label, language, prompt, height, icon, wrap)
pre.c-code-block.o-block(class="lang-#{(language || DEFAULT_SYNTAX)}" class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
- var lang = (language != "none") ? (language || DEFAULT_SYNTAX) : null
- var lang_class = (language != "none") ? "lang-" + (language || DEFAULT_SYNTAX) : null
pre.c-code-block.o-block(data-language=lang class=lang_class class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
if label
h4.u-text-label.u-text-label--dark=label
if icon
@ -253,6 +255,15 @@ mixin code(label, language, prompt, height, icon, wrap)
code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt)
block
//- Executable code
mixin code-exec(label, large)
- label = (label || "Editable code example") + " (experimental)"
+terminal-wrapper(label, !large)
figure.thebelab-wrapper
span.thebelab-wrapper__text.u-text-tiny v#{BINDER_VERSION} &middot; Python 3 &middot; via #[+a("https://mybinder.org/").u-hide-link Binder]
+code(data-executable="true")&attributes(attributes)
block
//- Wrapper for code blocks to display old/new versions
@ -658,12 +669,16 @@ mixin qs(data, style)
//- Terminal-style code window
label - [string] title displayed in top bar of terminal window
mixin terminal(label, button_text, button_url)
.x-terminal
.x-terminal__icons: span
.u-padding-small.u-text-label.u-text-center=label
mixin terminal-wrapper(label, small)
.x-terminal(class=small ? "x-terminal--small" : null)
.x-terminal__icons(class=small ? "x-terminal__icons--small" : null): span
.u-padding-small.u-text-center(class=small ? "u-text-tiny" : "u-text")
strong=label
block
+code.x-terminal__code
mixin terminal(label, button_text, button_url, exec)
+terminal-wrapper(label)
+code.x-terminal__code(data-executable=exec ? "" : null)
block
if button_text && button_url

View File

@ -10,10 +10,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
li.c-nav__menu__item(class=is_active ? "is-active" : null)
+a(url)(tabindex=is_active ? "-1" : null)=item
li.c-nav__menu__item.u-hidden-xs
+a("https://survey.spacy.io", true) User Survey 2018
li.c-nav__menu__item.u-hidden-xs
li.c-nav__menu__item
+a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
progress.c-progress.js-progress(value="0" max="1")

View File

@ -1,77 +1,110 @@
//- 💫 INCLUDES > MODELS PAGE TEMPLATE
for id in CURRENT_MODELS
- var comps = getModelComponents(id)
+section(id)
+grid("vcenter").o-no-block(id=id)
+grid-col("two-thirds")
+h(2)
+a("#" + id).u-permalink=id
section(data-vue=id data-model=id)
+grid("vcenter").o-no-block(id=id)
+grid-col("two-thirds")
+h(2)
+a("#" + id).u-permalink=id
+grid-col("third").u-text-right
.u-color-subtle.u-text-tiny
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(data-tpl=id data-tpl-key="download")
| Release details
.u-padding-small Latest: #[code(data-tpl=id data-tpl-key="version") n/a]
+grid-col("third").u-text-right
.u-color-subtle.u-text-tiny
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(v-bind:href="releaseUrl")
| Release details
.u-padding-small Latest: #[code(v-text="version") n/a]
+aside-code("Installation", "bash", "$").
python -m spacy download #{id}
+aside-code("Installation", "bash", "$").
python -m spacy download #{id}
- var comps = getModelComponents(id)
p(v-if="description" v-text="description")
p(data-tpl=id data-tpl-key="description")
div(data-tpl=id data-tpl-key="error")
+infobox
+infobox(v-if="error")
| Unable to load model details from GitHub. To find out more
| about this model, see the overview of the
| #[+a(gh("spacy-models") + "/releases") latest model releases].
+table.o-block-small(data-tpl=id data-tpl-key="table")
+row
+cell #[+label Language]
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
for comp, label in {"Type": comps.type, "Genre": comps.genre}
+table.o-block-small(v-bind:data-loading="loading")
+row
+cell #[+label=label]
+cell #[+tag=comp] #{MODEL_META[comp]}
+row
+cell #[+label Size]
+cell #[+tag=comps.size] #[span(data-tpl=id data-tpl-key="size") #[em n/a]]
+cell #[+label Language]
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
for comp, label in {"Type": comps.type, "Genre": comps.genre}
+row
+cell #[+label=label]
+cell #[+tag=comp] #{MODEL_META[comp]}
+row
+cell #[+label Size]
+cell #[+tag=comps.size] #[span(v-text="sizeFull" v-if="sizeFull")] #[em(v-else="") n/a]
each label in ["Pipeline", "Vectors", "Sources", "Author", "License"]
- var field = label.toLowerCase()
if field == "vectors"
- field = "vecs"
+row
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
+row(v-if="pipeline && pipeline.length" v-cloak="")
+cell
span(data-tpl=id data-tpl-key=field) #[em n/a]
+label Pipeline #[+help(MODEL_META.pipeline).u-color-subtle]
+cell
span(v-for="(pipe, index) in pipeline" v-if="pipeline")
code(v-text="pipe")
span(v-if="index != pipeline.length - 1") ,&nbsp;
+row(data-tpl=id data-tpl-key="compat-wrapper" hidden="")
+cell
+label Compat #[+help("Latest compatible model version for your spaCy installation").u-color-subtle]
+cell
.o-field.u-float-left
select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
div(data-tpl=id data-tpl-key="compat-versions") &nbsp;
+row(v-if="vectors" v-cloak="")
+cell
+label Vectors #[+help(MODEL_META.vectors).u-color-subtle]
+cell(v-text="vectors")
section(data-tpl=id data-tpl-key="benchmarks" hidden="")
+grid.o-block-small
+row(v-if="sources && sources.length" v-cloak="")
+cell
+label Sources #[+help(MODEL_META.sources).u-color-subtle]
+cell
span(v-for="(source, index) in sources") {{ source }}
span(v-if="index != sources.length - 1") ,&nbsp;
+row(v-if="author" v-cloak="")
+cell #[+label Author]
+cell
+a("")(v-bind:href="url" v-if="url" v-text="author")
span(v-else="" v-text="author") {{ model.author }}
+row(v-if="license" v-cloak="")
+cell #[+label License]
+cell
+a("")(v-bind:href="modelLicenses[license]" v-if="modelLicenses[license]") {{ license }}
span(v-else="") {{ license }}
+row(v-cloak="")
+cell #[+label Compat #[+help(MODEL_META.compat).u-color-subtle]]
+cell
.o-field.u-float-left
select.o-field__select.u-text-small(v-model="spacyVersion")
option(v-for="version in orderedCompat" v-bind:value="version") spaCy v{{ version }}
code(v-if="compatVersion" v-text="compatVersion")
em(v-else="") not compatible
+grid.o-block-small(v-cloak="" v-if="hasAccuracy")
for keys, label in MODEL_BENCHMARKS
.u-flex-full.u-padding-small(data-tpl=id data-tpl-key=label.toLowerCase() hidden="")
.u-flex-full.u-padding-small
+table.o-block-small
+row("head")
+head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
for label, field in keys
+row(hidden="")
+row
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
+cell("num")(data-tpl=id data-tpl-key=field)
| n/a
+cell("num")
span(v-if="#{field}" v-text="#{field}")
em(v-if="!#{field}") n/a
p.u-text-small.u-color-dark(v-if="notes" v-text="notes" v-cloak="")
if comps.size == "sm" && EXAMPLE_SENT_LANGS.includes(comps.lang)
section
+code-exec("Test the model live").
import spacy
from spacy.lang.#{comps.lang}.examples import sentences
nlp = spacy.load('#{id}')
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
print(token.text, token.pos_, token.dep_)
p.u-text-small.u-color-dark(data-tpl=id data-tpl-key="notes")

View File

@ -1,86 +1,33 @@
//- 💫 INCLUDES > SCRIPTS
if quickstart
script(src="/assets/js/vendor/quickstart.min.js")
if IS_PAGE || SECTION == "index"
script(type="text/x-thebe-config")
| { bootstrap: true, binderOptions: { repo: "#{KERNEL_BINDER}"},
| kernelOptions: { name: "#{KERNEL_PYTHON}" }}
if IS_PAGE
script(src="/assets/js/vendor/in-view.min.js")
- scripts = ["vendor/prism.min", "vendor/vue.min"]
- if (SECTION == "universe") scripts.push("vendor/vue-markdown.min")
- if (quickstart) scripts.push("vendor/quickstart.min")
- if (IS_PAGE) scripts.push("vendor/in-view.min")
- if (IS_PAGE || SECTION == "index") scripts.push("vendor/thebelab.custom.min")
for script in scripts
script(src="/assets/js/" + script + ".js")
script(src="/assets/js/main.js?v#{V_JS}" type=(environment == "deploy") ? null : "module")
if environment == "deploy"
script(async src="https://www.google-analytics.com/analytics.js")
script(src="/assets/js/vendor/prism.min.js")
if compare_models
script(src="/assets/js/vendor/chart.min.js")
script
if quickstart
| new Quickstart("#qs");
if environment == "deploy"
script(src="https://www.google-analytics.com/analytics.js", async)
script
| window.ga=window.ga||function(){
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
if IS_PAGE
if IS_PAGE
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
script
| ((window.gitter = {}).chat = {}).options = {
| useStyles: false,
| activationElement: '.js-gitter-button',
| targetElement: '.js-gitter',
| room: '!{SOCIAL.gitter}'
| };
if IS_PAGE
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
//- JS modules slightly hacky, but necessary to dynamically instantiate the
classes with data from the Harp JSON files, while still being able to
support older browsers that can't handle JS modules. More details:
https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
- ProgressBar = "new ProgressBar('.js-progress');"
- Accordion = "new Accordion('.js-accordion');"
- Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
- NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
- GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
- ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
- ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
if environment == "deploy"
//- DEPLOY: use compiled rollup.js and instantiate classes directly
script(src="/assets/js/rollup.js?v#{V_JS}")
script
!=ProgressBar
if changelog
!=Changelog
if IS_PAGE
!=NavHighlighter
!=GitHubEmbed
!=Accordion
if HAS_MODELS
!=ModelLoader
if compare_models
!=ModelComparer
else
//- DEVELOPMENT: Use ES6 modules
script(type="module")
| import ProgressBar from '/assets/js/progress.js';
!=ProgressBar
if changelog
| import Changelog from '/assets/js/changelog.js';
!=Changelog
if IS_PAGE
| import NavHighlighter from '/assets/js/nav-highlighter.js';
!=NavHighlighter
| import GitHubEmbed from '/assets/js/github-embed.js';
!=GitHubEmbed
| import Accordion from '/assets/js/accordion.js';
!=Accordion
if HAS_MODELS
| import { ModelLoader } from '/assets/js/models.js';
!=ModelLoader
if compare_models
| import { ModelComparer } from '/assets/js/models.js';
!=ModelComparer

View File

@ -7,6 +7,12 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
symbol#svg_github(viewBox="0 0 27 32")
path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z")
symbol#svg_twitter(viewBox="0 0 30 32")
path(d="M28.929 7.286q-1.196 1.75-2.893 2.982 0.018 0.25 0.018 0.75 0 2.321-0.679 4.634t-2.063 4.437-3.295 3.759-4.607 2.607-5.768 0.973q-4.839 0-8.857-2.589 0.625 0.071 1.393 0.071 4.018 0 7.161-2.464-1.875-0.036-3.357-1.152t-2.036-2.848q0.589 0.089 1.089 0.089 0.768 0 1.518-0.196-2-0.411-3.313-1.991t-1.313-3.67v-0.071q1.214 0.679 2.607 0.732-1.179-0.786-1.875-2.054t-0.696-2.75q0-1.571 0.786-2.911 2.161 2.661 5.259 4.259t6.634 1.777q-0.143-0.679-0.143-1.321 0-2.393 1.688-4.080t4.080-1.688q2.5 0 4.214 1.821 1.946-0.375 3.661-1.393-0.661 2.054-2.536 3.179 1.661-0.179 3.321-0.893z")
symbol#svg_website(viewBox="0 0 32 32")
path(d="M22.658 10.988h5.172c0.693 1.541 1.107 3.229 1.178 5.012h-5.934c-0.025-1.884-0.181-3.544-0.416-5.012zM20.398 3.896c2.967 1.153 5.402 3.335 6.928 6.090h-4.836c-0.549-2.805-1.383-4.799-2.092-6.090zM16.068 9.986v-6.996c1.066 0.047 2.102 0.216 3.092 0.493 0.75 1.263 1.719 3.372 2.33 6.503h-5.422zM9.489 22.014c-0.234-1.469-0.396-3.119-0.421-5.012h5.998v5.012h-5.577zM9.479 10.988h5.587v5.012h-6.004c0.025-1.886 0.183-3.543 0.417-5.012zM11.988 3.461c0.987-0.266 2.015-0.435 3.078-0.469v6.994h-5.422c0.615-3.148 1.591-5.265 2.344-6.525zM3.661 9.986c1.551-2.8 4.062-4.993 7.096-6.131-0.715 1.29-1.559 3.295-2.114 6.131h-4.982zM8.060 16h-6.060c0.066-1.781 0.467-3.474 1.158-5.012h5.316c-0.233 1.469-0.39 3.128-0.414 5.012zM8.487 22.014h-5.29c-0.694-1.543-1.139-3.224-1.204-5.012h6.071c0.024 1.893 0.188 3.541 0.423 5.012zM8.651 23.016c0.559 2.864 1.416 4.867 2.134 6.142-3.045-1.133-5.557-3.335-7.11-6.142h4.976zM15.066 23.016v6.994c-1.052-0.033-2.067-0.199-3.045-0.46-0.755-1.236-1.736-3.363-2.356-6.534h5.401zM21.471 23.016c-0.617 3.152-1.592 5.271-2.344 6.512-0.979 0.271-2.006 0.418-3.059 0.465v-6.977h5.403zM16.068 17.002h5.998c-0.023 1.893-0.188 3.542-0.422 5.012h-5.576v-5.012zM22.072 16h-6.004v-5.012h5.586c0.235 1.469 0.393 3.126 0.418 5.012zM23.070 17.002h5.926c-0.066 1.787-0.506 3.468-1.197 5.012h-5.152c0.234-1.471 0.398-3.119 0.423-5.012zM27.318 23.016c-1.521 2.766-3.967 4.949-6.947 6.1 0.715-1.276 1.561-3.266 2.113-6.1h4.834z")
symbol#svg_code(viewBox="0 0 20 20")
path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z")

View File

@ -3,23 +3,15 @@
include _includes/_mixins
- title = IS_MODELS ? LANGUAGES[current.source] || title : title
- social_title = (SECTION == "index") ? SITENAME + " - " + SLOGAN : title + " - " + SITENAME
- social_img = SITE_URL + "/assets/img/social/preview_" + (preview || ALPHA ? "alpha" : "default") + ".jpg"
- PAGE_URL = getPageUrl()
- PAGE_TITLE = getPageTitle()
- PAGE_IMAGE = getPageImage()
doctype html
html(lang="en")
head
title
if SECTION == "api" || SECTION == "usage" || SECTION == "models"
- var title_section = (SECTION == "api") ? "API" : SECTION.charAt(0).toUpperCase() + SECTION.slice(1)
| #{title} | #{SITENAME} #{title_section} Documentation
else if SECTION != "index"
| #{title} | #{SITENAME}
else
| #{SITENAME} - #{SLOGAN}
title=PAGE_TITLE
meta(charset="utf-8")
meta(name="viewport" content="width=device-width, initial-scale=1.0")
meta(name="referrer" content="always")
@ -27,23 +19,24 @@ html(lang="en")
meta(property="og:type" content="website")
meta(property="og:site_name" content=sitename)
meta(property="og:url" content="#{SITE_URL}/#{current.path.join('/')}")
meta(property="og:title" content=social_title)
meta(property="og:url" content=PAGE_URL)
meta(property="og:title" content=PAGE_TITLE)
meta(property="og:description" content=description)
meta(property="og:image" content=social_img)
meta(property="og:image" content=PAGE_IMAGE)
meta(name="twitter:card" content="summary_large_image")
meta(name="twitter:site" content="@" + SOCIAL.twitter)
meta(name="twitter:title" content=social_title)
meta(name="twitter:title" content=PAGE_TITLE)
meta(name="twitter:description" content=description)
meta(name="twitter:image" content=social_img)
meta(name="twitter:image" content=PAGE_IMAGE)
link(rel="shortcut icon" href="/assets/img/favicon.ico")
link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico")
if SECTION == "api"
link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet")
else if SECTION == "universe"
link(href="/assets/css/style_purple.css?v#{V_CSS}" rel="stylesheet")
else
link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet")
@ -54,6 +47,9 @@ html(lang="en")
if !landing
include _includes/_page-docs
else if SECTION == "universe"
!=yield
else
main!=yield
include _includes/_footer

View File

@ -29,7 +29,7 @@ p
+ud-row("NUM", "numeral", "1, 2017, one, seventy-seven, IV, MMXIV")
+ud-row("PART", "particle", "'s, not, ")
+ud-row("PRON", "pronoun", "I, you, he, she, myself, themselves, somebody")
+ud-row("PROPN", "proper noun", "Mary, John, Londin, NATO, HBO")
+ud-row("PROPN", "proper noun", "Mary, John, London, NATO, HBO")
+ud-row("PUNCT", "punctuation", "., (, ), ?")
+ud-row("SCONJ", "subordinating conjunction", "if, while, that")
+ud-row("SYM", "symbol", "$, %, §, ©, +, , ×, ÷, =, :), 😝")

View File

@ -1,5 +1,13 @@
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
p
| spaCy's statistical models have been custom-designed to give a
| high-performance mix of speed and accuracy. The current architecture
| hasn't been published yet, but in the meantime we prepared a video that
| explains how the models work, with particular focus on NER.
+youtube("sqDHBH9IjRU")
p
| The parsing model is a blend of recent results. The two recent
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
@ -44,7 +52,7 @@ p
+cell First two words of the buffer.
+row
+cell.u-nowrap
+cell
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
| #[code B1L1]#[br]
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
@ -54,7 +62,7 @@ p
| #[code S2], #[code B0] and #[code B1].
+row
+cell.u-nowrap
+cell
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
| #[code B1R1]#[br]
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],

View File

@ -6,8 +6,7 @@ p
| but somewhat ugly in Python. Logic that deals with Python or platform
| compatibility only lives in #[code spacy.compat]. To distinguish them from
| the builtin functions, replacement functions are suffixed with an
| undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the
| #[code six] and #[code ftfy] packages.
| undersocre, e.e #[code unicode_].
+aside-code("Example").
from spacy.compat import unicode_, json_dumps

View File

@ -533,8 +533,10 @@ p
+cell option
+cell
| Optional location of vectors file. Should be a tab-separated
| file where the first column contains the word and the remaining
| columns the values.
| file in Word2Vec format where the first column contains the word
| and the remaining columns the values. File can be provided in
| #[code .txt] format or as a zipped text file in #[code .zip] or
| #[code .tar.gz] format.
+row
+cell #[code --prune-vectors], #[code -V]

View File

@ -31,6 +31,7 @@
$grid-gutter: 2rem
margin-top: $grid-gutter
min-width: 0 // hack to prevent overflow
@include breakpoint(min, lg)
display: flex

View File

@ -60,6 +60,13 @@
padding-bottom: 4rem
border-bottom: 1px dotted $color-subtle
&.o-section--small
overflow: auto
&:not(:last-child)
margin-bottom: 3.5rem
padding-bottom: 2rem
.o-block
margin-bottom: 4rem
@ -142,6 +149,14 @@
.o-badge
border-radius: 1em
.o-thumb
@include size(100px)
overflow: hidden
border-radius: 50%
&.o-thumb--small
@include size(35px)
//- SVG

View File

@ -103,6 +103,9 @@
&:hover
color: $color-theme-dark
.u-hand
cursor: pointer
.u-hide-link.u-hide-link
border: none
color: inherit
@ -224,6 +227,7 @@
$spinner-size: 75px
$spinner-bar: 8px
min-height: $spinner-size * 2
position: relative
& > *
@ -245,10 +249,19 @@
//- Hidden elements
.u-hidden
display: none
.u-hidden,
[v-cloak]
display: none !important
@each $breakpoint in (xs, sm, md)
.u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint}
@include breakpoint(max, $breakpoint)
display: none
//- Transitions
.u-fade-enter-active
transition: opacity 0.5s
.u-fade-enter
opacity: 0

View File

@ -2,7 +2,8 @@
//- Code block
.c-code-block
.c-code-block,
.thebelab-cell
background: $color-front
color: darken($color-back, 20)
padding: 0.75em 0
@ -13,11 +14,11 @@
white-space: pre
direction: ltr
&.c-code-block--has-icon
padding: 0
display: flex
border-top-left-radius: 0
border-bottom-left-radius: 0
.c-code-block--has-icon
padding: 0
display: flex
border-top-left-radius: 0
border-bottom-left-radius: 0
.c-code-block__icon
padding: 0 0 0 1rem
@ -28,26 +29,66 @@
&.c-code-block__icon--border
border-left: 6px solid
//- Code block content
.c-code-block__content
.c-code-block__content,
.thebelab-input,
.jp-OutputArea
display: block
font: normal normal 1.1rem/#{1.9} $font-code
padding: 1em 2em
&[data-prompt]:before,
content: attr(data-prompt)
margin-right: 0.65em
display: inline-block
vertical-align: middle
opacity: 0.5
.c-code-block__content[data-prompt]:before,
content: attr(data-prompt)
margin-right: 0.65em
display: inline-block
vertical-align: middle
opacity: 0.5
//- Thebelab
[data-executable]
margin-bottom: 0
.thebelab-input.thebelab-input
padding: 3em 2em 1em
.jp-OutputArea
&:not(:empty)
padding: 2rem 2rem 1rem
border-top: 1px solid $color-dark
margin-top: 2rem
.entities, svg
white-space: initial
font-family: inherit
.entities
font-size: 1.35rem
.jp-OutputArea pre
font: inherit
.jp-OutputPrompt.jp-OutputArea-prompt
padding-top: 0.5em
margin-right: 1rem
font-family: inherit
font-weight: bold
.thebelab-run-button
@extend .u-text-label, .u-text-label--dark
.thebelab-wrapper
position: relative
.thebelab-wrapper__text
@include position(absolute, top, right, 1.25rem, 1.25rem)
color: $color-subtle-dark
z-index: 10
//- Code
code
code, .CodeMirror, .jp-RenderedText, .jp-OutputArea
-webkit-font-smoothing: subpixel-antialiased
-moz-osx-font-smoothing: auto
@ -73,7 +114,7 @@ code
text-shadow: none
//- Syntax Highlighting
//- Syntax Highlighting (Prism)
[class*="language-"] .token
&.comment, &.prolog, &.doctype, &.cdata, &.punctuation
@ -103,3 +144,50 @@ code
&.italic
font-style: italic
//- Syntax Highlighting (CodeMirror)
.CodeMirror.cm-s-default
background: $color-front
color: darken($color-back, 20)
.CodeMirror-selected
background: $color-theme
color: $color-back
.CodeMirror-cursor
border-left-color: currentColor
.cm-variable-2
color: inherit
font-style: italic
.cm-comment
color: map-get($syntax-highlighting, comment)
.cm-keyword, .cm-builtin
color: map-get($syntax-highlighting, keyword)
.cm-operator
color: map-get($syntax-highlighting, operator)
.cm-string
color: map-get($syntax-highlighting, selector)
.cm-number
color: map-get($syntax-highlighting, number)
.cm-def
color: map-get($syntax-highlighting, function)
//- Syntax highlighting (Jupyter)
.jp-RenderedText pre
.ansi-cyan-fg
color: map-get($syntax-highlighting, function)
.ansi-green-fg
color: $color-green
.ansi-red-fg
color: map-get($syntax-highlighting, operator)

View File

@ -8,10 +8,20 @@
width: 100%
position: relative
&.x-terminal--small
background: $color-dark
color: $color-subtle
border-radius: 4px
margin-bottom: 4rem
.x-terminal__icons
display: none
position: absolute
padding: 10px
@include breakpoint(min, sm)
display: block
&:before,
&:after,
span
@ -32,6 +42,12 @@
content: ""
background: $color-yellow
&.x-terminal__icons--small
&:before,
&:after,
span
@include size(10px)
.x-terminal__code
margin: 0
border: none

View File

@ -9,7 +9,7 @@
display: flex
justify-content: space-between
flex-flow: row nowrap
padding: 0 2rem 0 1rem
padding: 0 0 0 1rem
z-index: 30
width: 100%
box-shadow: $box-shadow
@ -21,11 +21,20 @@
.c-nav__menu
@include size(100%)
display: flex
justify-content: flex-end
flex-flow: row nowrap
border-color: inherit
flex: 1
@include breakpoint(max, sm)
@include scroll-shadow-base($color-front)
overflow-x: auto
overflow-y: hidden
-webkit-overflow-scrolling: touch
@include breakpoint(min, md)
justify-content: flex-end
.c-nav__menu__item
display: flex
align-items: center
@ -39,6 +48,14 @@
&:not(:first-child)
margin-left: 2em
&:last-child
@include scroll-shadow-cover(right, $color-back)
padding-right: 2rem
&:first-child
@include scroll-shadow-cover(left, $color-back)
padding-left: 2rem
&.is-active
color: $color-dark
pointer-events: none

View File

@ -26,7 +26,7 @@ $font-code: Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace
// Colors
$colors: ( blue: #09a3d5, green: #05b083 )
$colors: ( blue: #09a3d5, green: #05b083, purple: #6542d1 )
$color-back: #fff !default
$color-front: #1a1e23 !default

View File

@ -0,0 +1,4 @@
//- 💫 STYLESHEET (PURPLE)
$theme: purple
@import style

Some files were not shown because too many files have changed in this diff Show More