Merge changes to parser and _ml

This commit is contained in:
Explosion Bot 2017-10-28 11:52:10 +02:00
commit b22e42af7f
111 changed files with 3075 additions and 3657 deletions

View File

@ -87,8 +87,8 @@ U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
@ -98,9 +98,9 @@ mark both statements:
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Shuvanon Razik |
| Name | |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 3/12/2017 |
| GitHub username | shuvanon |
| Date | |
| GitHub username | |
| Website (optional) | |

View File

@ -1,20 +1,19 @@
<!--- Provide a general summary of your changes in the Title -->
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes and how they're affecting the code. -->
<!-- If your changes required testing, include information about the testing environment and the tests you ran. -->
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine just
include a note to let us know. -->
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
## Types of changes
<!--- What types of changes does your code introduce? Put an `x` in all applicable boxes.: -->
- [ ] **Bug fix** (non-breaking change fixing an issue)
- [ ] **New feature** (non-breaking change adding functionality to spaCy)
- [ ] **Breaking change** (fix or feature causing change to spaCy's existing functionality)
- [ ] **Documentation** (addition to documentation of spaCy)
## Checklist:
<!--- Go over all the following points, and put an `x` in all applicable boxes.: -->
- [ ] My change requires a change to spaCy's documentation.
- [ ] I have updated the documentation accordingly.
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

106
.github/contributors/demfier.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Gaurav Sahu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2017-10-18 |
| GitHub username | demfier |
| Website (optional) | |

106
.github/contributors/honnibal.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Matthew Honnibal |
| Company name (if applicable) | Explosion AI |
| Title or role (if applicable) | Founder |
| Date | 2017-10-18 |
| GitHub username | honnibal |
| Website (optional) | https://explosion.ai |

106
.github/contributors/ines.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ines Montani |
| Company name (if applicable) | Explosion AI |
| Title or role (if applicable) | Founder |
| Date | 2017/10/18 |
| GitHub username | ines |
| Website (optional) | https://explosion.ai |

106
.github/contributors/jerbob92.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jeroen Bobbeldijk |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 22-10-2017 |
| GitHub username | jerbob92 |
| Website (optional) | |

106
.github/contributors/johnhaley81.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | John Haley |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 19/10/2017 |
| GitHub username | johnhaley81 |
| Website (optional) | |

106
.github/contributors/mdcclv.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------------------- |
| Name | Orion Montoya |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 04-10-2017 |
| GitHub username | mdcclv |
| Website (optional) | http://www.mdcclv.com/ |

106
.github/contributors/polm.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Paul McCann |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2017-10-14 |
| GitHub username | polm |
| Website (optional) | http://dampfkraft.com|

108
.github/contributors/shuvanon.md vendored Normal file
View File

@ -0,0 +1,108 @@
<!-- This agreement was mistakenly submitted as an update to the CONTRIBUTOR_AGREEMENT.md template. Commit: 8a2d22222dec5cf910df5a378cbcd9ea2ab53ec4. It was therefore moved over manually. -->
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Shuvanon Razik |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 3/12/2017 |
| GitHub username | shuvanon |
| Website (optional) | |

106
.github/contributors/yuukos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alexey Kim |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 13-12-2017 |
| GitHub username | yuukos |
| Website (optional) | |

View File

@ -3,6 +3,8 @@
This is a list of everyone who has made significant contributions to spaCy, in alphabetical order. Thanks a lot for the great work!
* Adam Bittlingmayer, [@bittlingmayer](https://github.com/bittlingmayer)
* Alexey Kim, [@yuukos](https://github.com/yuukos)
* Alexis Eidelman, [@AlexisEidelman](https://github.com/AlexisEidelman)
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
* Andrew Poliakov, [@pavlin99th](https://github.com/pavlin99th)
* Aniruddha Adhikary [@aniruddha-adhikary](https://github.com/aniruddha-adhikary)
@ -16,6 +18,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Daniel Vila Suero, [@dvsrepo](https://github.com/dvsrepo)
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
* Eric Zhao, [@ericzhao28](https://github.com/ericzhao28)
* Francisco Aranda, [@frascuchon](https://github.com/frascuchon)
* Greg Baker, [@solresol](https://github.com/solresol)
* Grégory Howard, [@Gregory-Howard](https://github.com/Gregory-Howard)
* György Orosz, [@oroszgy](https://github.com/oroszgy)
@ -24,6 +27,9 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Ines Montani, [@ines](https://github.com/ines)
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
* Janneke van der Zwaan, [@jvdzwaan](https://github.com/jvdzwaan)
* Jim Geovedi, [@geovedi](https://github.com/geovedi)
* Jim Regan, [@jimregan](https://github.com/jimregan)
* Jeffrey Gerard, [@IamJeffG](https://github.com/IamJeffG)
* Jordan Suchow, [@suchow](https://github.com/suchow)
* Josh Reeter, [@jreeter](https://github.com/jreeter)
* Juan Miguel Cejuela, [@juanmirocks](https://github.com/juanmirocks)
@ -38,6 +44,8 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Michael Wallin, [@wallinm1](https://github.com/wallinm1)
* Miguel Almeida, [@mamoit](https://github.com/mamoit)
* Oleg Zd, [@olegzd](https://github.com/olegzd)
* Orion Montoya, [@mdcclv](https://github.com/mdcclv)
* Paul O'Leary McCann, [@polm](https://github.com/polm)
* Pokey Rule, [@pokey](https://github.com/pokey)
* Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
@ -45,12 +53,18 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Sam Bozek, [@sambozek](https://github.com/sambozek)
* Sasho Savkov, [@savkov](https://github.com/savkov)
* Shuvanon Razik, [@shuvanon](https://github.com/shuvanon)
* Swier, [@swierh](https://github.com/swierh)
* Thomas Tanon, [@Tpt](https://github.com/Tpt)
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
* Vimos Tan, [@Vimos](https://github.com/Vimos)
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
* Wah Loon Keng, [@kengz](https://github.com/kengz)
* Wannaphong Phatthiyaphaibun, [@wannaphongcom](https://github.com/wannaphongcom)
* Willem van Hage, [@wrvhage](https://github.com/wrvhage)
* Wolfgang Seeker, [@wbwseeker](https://github.com/wbwseeker)
* Yam, [@hscspring](https://github.com/hscspring)
* Yanhao Yang, [@YanhaoYang](https://github.com/YanhaoYang)
* Yasuaki Uechi, [@uetchy](https://github.com/uetchy)
* Yu-chun Huang, [@galaxyh](https://github.com/galaxyh)
* Yubing Dong, [@tomtung](https://github.com/tomtung)
* Yuval Pinter, [@yuvalpinter](https://github.com/yuvalpinter)

View File

@ -1,15 +1,16 @@
spaCy: Industrial-strength NLP
******************************
spaCy is a library for advanced natural language processing in Python and
Cython. spaCy is built on the very latest research, but it isn't researchware.
It was designed from day one to be used in real products. spaCy currently supports
English, German, French and Spanish, as well as tokenization for Italian,
Portuguese, Dutch, Swedish, Finnish, Norwegian, Danish, Hungarian, Polish,
Bengali, Hebrew, Chinese and Japanese. It's commercial open-source software,
released under the MIT license.
spaCy is a library for advanced Natural Language Processing in Python and Cython.
It's built on the very latest research, and was designed from day one to be
used in real products. spaCy comes with
`pre-trained statistical models <https://alpha.spacy.io/models>`_ and word
vectors, and currently supports tokenization for **20+ languages**. It features
the **fastest syntactic parser** in the world, convolutional **neural network models**
for tagging, parsing and **named entity recognition** and easy **deep learning**
integration. It's commercial open-source software, released under the MIT license.
💫 **Version 1.8 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
💫 **Version 2.0 out now!** `Check out the new features here. <https://alpha.spacy.io/usage/v2>`_
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
:target: https://travis-ci.org/explosion/spaCy
@ -38,68 +39,72 @@ released under the MIT license.
📖 Documentation
================
=================== ===
`Usage Workflows`_ How to use spaCy and its features.
`API Reference`_ The detailed reference for spaCy's API.
`Troubleshooting`_ Common problems and solutions for beginners.
`Tutorials`_ End-to-end examples, with code you can modify and run.
`Showcase & Demos`_ Demos, libraries and products from the spaCy community.
`Contribute`_ How to contribute to the spaCy project and code base.
=================== ===
=================== ===
`spaCy 101`_ New to spaCy? Here's everything you need to know!
`Usage Guides`_ How to use spaCy and its features.
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
`API Reference`_ The detailed reference for spaCy's API.
`Models`_ Download statistical language models for spaCy.
`Resources`_ Libraries, extensions, demos, books and courses.
`Changelog`_ Changes and version history.
`Contribute`_ How to contribute to the spaCy project and code base.
=================== ===
.. _Usage Workflows: https://spacy.io/docs/usage/
.. _API Reference: https://spacy.io/docs/api/
.. _Troubleshooting: https://spacy.io/docs/usage/troubleshooting
.. _Tutorials: https://spacy.io/docs/usage/tutorials
.. _Showcase & Demos: https://spacy.io/docs/usage/showcase
.. _spaCy 101: https://alpha.spacy.io/usage/spacy-101
.. _New in v2.0: https://alpha.spacy.io/usage/v2#migrating
.. _Usage Guides: https://alpha.spacy.io/usage/
.. _API Reference: https://alpha.spacy.io/api/
.. _Models: https://alpha.spacy.io/models
.. _Resources: https://alpha.spacy.io/usage/resources
.. _Changelog: https://alpha.spacy.io/usage/#changelog
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
💬 Where to ask questions
==========================
The spaCy project is maintained by `@honnibal <https://github.com/honnibal>`_
and `@ines <https://github.com/ines>`_. Please understand that we won't be able
to provide individual support via email. We also believe that help is much more
valuable if it's shared publicly, so that more people can benefit from it.
====================== ===
**Bug reports** `GitHub issue tracker`_
**Usage questions** `StackOverflow`_, `Gitter chat`_, `Reddit user group`_
**General discussion** `Gitter chat`_, `Reddit user group`_
**Commercial support** contact@explosion.ai
**Bug Reports** `GitHub Issue Tracker`_
**Usage Questions** `StackOverflow`_, `Gitter Chat`_, `Reddit User Group`_
**General Discussion** `Gitter Chat`_, `Reddit User Group`_
====================== ===
.. _GitHub issue tracker: https://github.com/explosion/spaCy/issues
.. _GitHub Issue Tracker: https://github.com/explosion/spaCy/issues
.. _StackOverflow: http://stackoverflow.com/questions/tagged/spacy
.. _Gitter chat: https://gitter.im/explosion/spaCy
.. _Reddit user group: https://www.reddit.com/r/spacynlp
.. _Gitter Chat: https://gitter.im/explosion/spaCy
.. _Reddit User Group: https://www.reddit.com/r/spacynlp
Features
========
* Non-destructive **tokenization**
* Syntax-driven sentence segmentation
* Pre-trained **word vectors**
* Part-of-speech tagging
* **Fastest syntactic parser** in the world
* **Named entity** recognition
* Labelled dependency parsing
* Convenient string-to-int mapping
* Export to numpy data arrays
* GIL-free **multi-threading**
* Efficient binary serialization
* Non-destructive **tokenization**
* Support for **20+ languages**
* Pre-trained `statistical models <https://alpha.spacy.io/models>`_ and word vectors
* Easy **deep learning** integration
* Statistical models for **English**, **German**, **French** and **Spanish**
* Part-of-speech tagging
* Labelled dependency parsing
* Syntax-driven sentence segmentation
* Built in **visualizers** for syntax and NER
* Convenient string-to-hash mapping
* Export to numpy data arrays
* Efficient binary serialization
* Easy **model packaging** and deployment
* State-of-the-art speed
* Robust, rigorously evaluated accuracy
See `facts, figures and benchmarks <https://spacy.io/docs/api/>`_.
📖 **For more details, see the** `facts, figures and benchmarks <https://alpha.spacy.io/usage/facts-figures>`_.
Top Performance
---------------
Install spaCy
=============
* Fastest in the world: <50ms per document. No faster system has ever been
announced.
* Accuracy within 1% of the current state of the art on all tasks performed
(parsing, named entity recognition, part-of-speech tagging). The only more
accurate systems are an order of magnitude slower or more.
Supports
--------
For detailed installation instructions, see
the `documentation <https://alpha.spacy.io/usage>`_.
==================== ===
**Operating system** macOS / OS X, Linux, Windows (Cygwin, MinGW, Visual Studio)
@ -110,12 +115,6 @@ Supports
.. _pip: https://pypi.python.org/pypi/spacy
.. _conda: https://anaconda.org/conda-forge/spacy
Install spaCy
=============
Installation requires a working build environment. See notes on Ubuntu,
macOS/OS X and Windows for details.
pip
---
@ -123,7 +122,7 @@ Using pip, spaCy releases are currently only available as source packages.
.. code:: bash
pip install -U spacy
pip install spacy
When using pip it is generally recommended to install packages in a ``virtualenv``
to avoid modifying system state:
@ -149,25 +148,41 @@ For the feedstock including the build recipe and configuration,
check out `this repository <https://github.com/conda-forge/spacy-feedstock>`_.
Improvements and pull requests to the recipe and setup are always appreciated.
Updating spaCy
--------------
Some updates to spaCy may require downloading new statistical models. If you're
running spaCy v2.0 or higher, you can use the ``validate`` command to check if
your installed models are compatible and if not, print details on how to update
them:
.. code:: bash
pip install -U spacy
spacy validate
If you've trained your own models, keep in mind that your training and runtime
inputs must match. After updating spaCy, we recommend **retraining your models**
with the new version.
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the**
`migration guide <https://alpha.spacy.io/usage/v2#migrating>`_.
Download models
===============
As of v1.7.0, models for spaCy can be installed as **Python packages**.
This means that they're a component of your application, just like any
other module. They're versioned and can be defined as a dependency in your
``requirements.txt``. Models can be installed from a download URL or
a local directory, manually or via pip. Their data can be located anywhere on
your file system. To make a model available to spaCy, all you need to do is
create a "shortcut link", an internal alias that tells spaCy where to find the
data files for a specific model name.
other module. Models can be installed using spaCy's ``download`` command,
or manually by pointing pip to a path or URL.
======================= ===
`spaCy Models`_ Available models, latest releases and direct download.
`Available Models`_ Detailed model descriptions, accuracy figures and benchmarks.
`Models Documentation`_ Detailed usage instructions.
======================= ===
.. _spaCy Models: https://github.com/explosion/spacy-models/releases/
.. _Models Documentation: https://spacy.io/docs/usage/models
.. _Available Models: https://alpha.spacy.io/models
.. _Models Documentation: https://alpha.spacy.io/docs/usage/models
.. code:: bash
@ -175,17 +190,10 @@ data files for a specific model name.
python -m spacy download en
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
# pip install .tar.gz archive from path or URL
pip install /Users/you/en_core_web_md-1.2.0.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz
# set up shortcut link to load installed package as "en_default"
python -m spacy link en_core_web_md en_default
# set up shortcut link to load local model as "my_amazing_model"
python -m spacy link /Users/you/data my_amazing_model
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
Loading and using models
------------------------
@ -199,24 +207,24 @@ To load a model, use ``spacy.load()`` with the model's shortcut link:
doc = nlp(u'This is a sentence.')
If you've installed a model via pip, you can also ``import`` it directly and
then call its ``load()`` method with no arguments. This should also work for
older models in previous versions of spaCy.
then call its ``load()`` method:
.. code:: python
import spacy
import en_core_web_md
import en_core_web_sm
nlp = en_core_web_md.load()
nlp = en_core_web_.load()
doc = nlp(u'This is a sentence.')
📖 **For more info and examples, check out the** `models documentation <https://spacy.io/docs/usage/models>`_.
📖 **For more info and examples, check out the**
`models documentation <https://alpha.spacy.io/docs/usage/models>`_.
Support for older versions
--------------------------
If you're using an older version (v1.6.0 or below), you can still download and
install the old models from within spaCy using ``python -m spacy.en.download all``
If you're using an older version (``v1.6.0`` or below), you can still download
and install the old models from within spaCy using ``python -m spacy.en.download all``
or ``python -m spacy.de.download all``. The ``.tar.gz`` archives are also
`attached to the v1.6.0 release <https://github.com/explosion/spaCy/tree/v1.6.0>`_.
To download and install the models manually, unpack the archive, drop the
@ -248,11 +256,13 @@ details.
pip install -r requirements.txt
pip install -e .
Compared to regular install via pip `requirements.txt <requirements.txt>`_
Compared to regular install via pip, `requirements.txt <requirements.txt>`_
additionally installs developer dependencies such as Cython.
Instead of the above verbose commands, you can also use the following
`Fabric <http://www.fabfile.org/>`_ commands:
`Fabric <http://www.fabfile.org/>`_ commands. All commands assume that your
``virtualenv`` is located in a directory ``.env``. If you're using a different
directory, you can change it via the environment variable ``VENV_DIR``, for
example ``VENV_DIR=".custom-env" fab clean make``.
============= ===
``fab env`` Create ``virtualenv`` and delete previous one, if it exists.
@ -261,14 +271,6 @@ Instead of the above verbose commands, you can also use the following
``fab test`` Run basic tests, aborting after first failure.
============= ===
All commands assume that your ``virtualenv`` is located in a directory ``.env``.
If you're using a different directory, you can change it via the environment
variable ``VENV_DIR``, for example:
.. code:: bash
VENV_DIR=".custom-env" fab clean make
Ubuntu
------
@ -310,76 +312,4 @@ and ``--model`` are optional and enable additional tests:
# make sure you are using recent pytest version
python -m pip install -U pytest
python -m pytest <spacy-directory>
🛠 Changelog
============
=========== ============== ===========
Version Date Description
=========== ============== ===========
`v1.8.2`_ ``2017-04-26`` French model and small improvements
`v1.8.1`_ ``2017-04-23`` Saving, loading and training bug fixes
`v1.8.0`_ ``2017-04-16`` Better NER training, saving and loading
`v1.7.5`_ ``2017-04-07`` Bug fixes and new CLI commands
`v1.7.3`_ ``2017-03-26`` Alpha support for Hebrew, new CLI commands and bug fixes
`v1.7.2`_ ``2017-03-20`` Small fixes to beam parser and model linking
`v1.7.1`_ ``2017-03-19`` Fix data download for system installation
`v1.7.0`_ ``2017-03-18`` New 50 MB model, CLI, better downloads and lots of bug fixes
`v1.6.0`_ ``2017-01-16`` Improvements to tokenizer and tests
`v1.5.0`_ ``2016-12-27`` Alpha support for Swedish and Hungarian
`v1.4.0`_ ``2016-12-18`` Improved language data and alpha Dutch support
`v1.3.0`_ ``2016-12-03`` Improve API consistency
`v1.2.0`_ ``2016-11-04`` Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese
`v1.1.0`_ ``2016-10-23`` Bug fixes and adjustments
`v1.0.0`_ ``2016-10-18`` Support for deep learning workflows and entity-aware rule matcher
`v0.101.0`_ ``2016-05-10`` Fixed German model
`v0.100.7`_ ``2016-05-05`` German support
`v0.100.6`_ ``2016-03-08`` Add support for GloVe vectors
`v0.100.5`_ ``2016-02-07`` Fix incorrect use of header file
`v0.100.4`_ ``2016-02-07`` Fix OSX problem introduced in 0.100.3
`v0.100.3`_ ``2016-02-06`` Multi-threading, faster loading and bugfixes
`v0.100.2`_ ``2016-01-21`` Fix data version lock
`v0.100.1`_ ``2016-01-21`` Fix install for OSX
`v0.100`_ ``2016-01-19`` Revise setup.py, better model downloads, bug fixes
`v0.99`_ ``2015-11-08`` Improve span merging, internal refactoring
`v0.98`_ ``2015-11-03`` Smaller package, bug fixes
`v0.97`_ ``2015-10-23`` Load the StringStore from a json list, instead of a text file
`v0.96`_ ``2015-10-19`` Hotfix to .merge method
`v0.95`_ ``2015-10-18`` Bug fixes
`v0.94`_ ``2015-10-09`` Fix memory and parse errors
`v0.93`_ ``2015-09-22`` Bug fixes to word vectors
=========== ============== ===========
.. _v1.8.2: https://github.com/explosion/spaCy/releases/tag/v1.8.2
.. _v1.8.1: https://github.com/explosion/spaCy/releases/tag/v1.8.1
.. _v1.8.0: https://github.com/explosion/spaCy/releases/tag/v1.8.0
.. _v1.7.5: https://github.com/explosion/spaCy/releases/tag/v1.7.5
.. _v1.7.3: https://github.com/explosion/spaCy/releases/tag/v1.7.3
.. _v1.7.2: https://github.com/explosion/spaCy/releases/tag/v1.7.2
.. _v1.7.1: https://github.com/explosion/spaCy/releases/tag/v1.7.1
.. _v1.7.0: https://github.com/explosion/spaCy/releases/tag/v1.7.0
.. _v1.6.0: https://github.com/explosion/spaCy/releases/tag/v1.6.0
.. _v1.5.0: https://github.com/explosion/spaCy/releases/tag/v1.5.0
.. _v1.4.0: https://github.com/explosion/spaCy/releases/tag/v1.4.0
.. _v1.3.0: https://github.com/explosion/spaCy/releases/tag/v1.3.0
.. _v1.2.0: https://github.com/explosion/spaCy/releases/tag/v1.2.0
.. _v1.1.0: https://github.com/explosion/spaCy/releases/tag/v1.1.0
.. _v1.0.0: https://github.com/explosion/spaCy/releases/tag/v1.0.0
.. _v0.101.0: https://github.com/explosion/spaCy/releases/tag/0.101.0
.. _v0.100.7: https://github.com/explosion/spaCy/releases/tag/0.100.7
.. _v0.100.6: https://github.com/explosion/spaCy/releases/tag/0.100.6
.. _v0.100.5: https://github.com/explosion/spaCy/releases/tag/0.100.5
.. _v0.100.4: https://github.com/explosion/spaCy/releases/tag/0.100.4
.. _v0.100.3: https://github.com/explosion/spaCy/releases/tag/0.100.3
.. _v0.100.2: https://github.com/explosion/spaCy/releases/tag/0.100.2
.. _v0.100.1: https://github.com/explosion/spaCy/releases/tag/0.100.1
.. _v0.100: https://github.com/explosion/spaCy/releases/tag/0.100
.. _v0.99: https://github.com/explosion/spaCy/releases/tag/0.99
.. _v0.98: https://github.com/explosion/spaCy/releases/tag/0.98
.. _v0.97: https://github.com/explosion/spaCy/releases/tag/0.97
.. _v0.96: https://github.com/explosion/spaCy/releases/tag/0.96
.. _v0.95: https://github.com/explosion/spaCy/releases/tag/0.95
.. _v0.94: https://github.com/explosion/spaCy/releases/tag/0.94
.. _v0.93: https://github.com/explosion/spaCy/releases/tag/0.93

View File

@ -2,20 +2,18 @@
# spaCy examples
The examples are Python scripts with well-behaved command line interfaces. For a full list of spaCy tutorials and code snippets, see the [documentation](https://spacy.io/docs/usage/tutorials).
The examples are Python scripts with well-behaved command line interfaces. For
more detailed usage guides, see the [documentation](https://alpha.spacy.io/usage/).
## How to run an example
For example, to run the [`nn_text_class.py`](nn_text_class.py) script, do:
To see the available arguments, you can use the `--help` or `-h` flag:
```bash
$ python examples/nn_text_class.py
usage: nn_text_class.py [-h] [-d 3] [-H 300] [-i 5] [-w 40000] [-b 24]
[-r 0.3] [-p 1e-05] [-e 0.005]
data_dir
nn_text_class.py: error: too few arguments
$ python examples/training/train_ner.py --help
```
You can print detailed help with the `-h` argument.
While we try to keep the examples up to date, they are not currently exercised by the test suite, as some of them require significant data downloads or take time to train. If you find that an example is no longer running, [please tell us](https://github.com/explosion/spaCy/issues)! We know there's nothing worse than trying to figure out what you're doing wrong, and it turns out your code was never the problem.
While we try to keep the examples up to date, they are not currently exercised
by the test suite, as some of them require significant data downloads or take
time to train. If you find that an example is no longer running,
[please tell us](https://github.com/explosion/spaCy/issues)! We know there's
nothing worse than trying to figure out what you're doing wrong, and it turns
out your code was never the problem.

View File

@ -1,37 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from math import sqrt
from numpy import dot
from numpy.linalg import norm
def handle_tweet(spacy, tweet_data, query):
text = tweet_data.get('text', u'')
# Twython returns either bytes or unicode, depending on tweet.
# ಠ_ಠ #APIshaming
try:
match_tweet(spacy, text, query)
except TypeError:
match_tweet(spacy, text.decode('utf8'), query)
def match_tweet(spacy, text, query):
def get_vector(word):
return spacy.vocab[word].repvec
tweet = spacy(text)
tweet = [w.repvec for w in tweet if w.is_alpha and w.lower_ != query]
if tweet:
accept = map(get_vector, 'child classroom teach'.split())
reject = map(get_vector, 'mouth hands giveaway'.split())
y = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in accept)
n = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in reject)
if (y / (y + n)) >= 0.5 or True:
print(text)
def cos(v1, v2):
return dot(v1, v2) / (norm(v1) * norm(v2))

View File

@ -1,59 +0,0 @@
"""Issue #252
Question:
In the documents and tutorials the main thing I haven't found is examples on how to break sentences down into small sub thoughts/chunks. The noun_chunks is handy, but having examples on using the token.head to find small (near-complete) sentence chunks would be neat.
Lets take the example sentence on https://displacy.spacy.io/displacy/index.html
displaCy uses CSS and JavaScript to show you how computers understand language
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
[displaCy] uses CSS and Javascript [to + show]
&
show you how computers understand [language]
I'm assuming that we can use the token.head to build these groups. In one of your examples you had the following function.
def dependency_labels_to_root(token):
'''Walk up the syntactic tree, collecting the arc labels.'''
dep_labels = []
while token.head is not token:
dep_labels.append(token.dep)
token = token.head
return dep_labels
"""
from __future__ import print_function, unicode_literals
# Answer:
# The easiest way is to find the head of the subtree you want, and then use the
# `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree` is the
# one that does what you're asking for most directly:
from spacy.en import English
nlp = English()
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
print(''.join(w.text_with_ws for w in word.subtree))
# It'd probably be better for `word.subtree` to return a `Span` object instead
# of a generator over the tokens. If you want the `Span` you can get it via the
# `.right_edge` and `.left_edge` properties. The `Span` object is nice because
# you can easily get a vector, merge it, etc.
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.text, '|', subtree_span.root.text)
print(subtree_span.similarity(doc))
print(subtree_span.similarity(subtree_span.root))
# You might also want to select a head, and then select a start and end position by
# walking along its children. You could then take the `.left_edge` and `.right_edge`
# of those tokens, and use it to calculate a span.

View File

@ -1,59 +0,0 @@
import plac
from spacy.en import English
from spacy.parts_of_speech import NOUN
from spacy.parts_of_speech import ADP as PREP
def _span_to_tuple(span):
start = span[0].idx
end = span[-1].idx + len(span[-1])
tag = span.root.tag_
text = span.text
label = span.label_
return (start, end, tag, text, label)
def merge_spans(spans, doc):
# This is a bit awkward atm. What we're doing here is merging the entities,
# so that each only takes up a single token. But an entity is a Span, and
# each Span is a view into the doc. When we merge a span, we invalidate
# the other spans. This will get fixed --- but for now the solution
# is to gather the information first, before merging.
tuples = [_span_to_tuple(span) for span in spans]
for span_tuple in tuples:
doc.merge(*span_tuple)
def extract_currency_relations(doc):
merge_spans(doc.ents, doc)
merge_spans(doc.noun_chunks, doc)
relations = []
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
if money.dep_ in ('attr', 'dobj'):
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
if subject:
subject = subject[0]
relations.append((subject, money))
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
relations.append((money.head.head, money))
return relations
def main():
nlp = English()
texts = [
u'Net income was $9.4 million compared to the prior year of $2.7 million.',
u'Revenue exceeded twelve billion dollars, with a loss of $1b.',
]
for text in texts:
doc = nlp(text)
relations = extract_currency_relations(doc)
for r1, r2 in relations:
print(r1.text, r2.ent_type_, r2.text)
if __name__ == '__main__':
plac.call(main)

View File

@ -0,0 +1,62 @@
#!/usr/bin/env python
# coding: utf8
"""
A simple example of extracting relations between phrases and entities using
spaCy's named entity recognizer and the dependency parse. Here, we extract
money and currency values (entities labelled as MONEY) and then check the
dependency tree to find the noun phrase they are referring to for example:
$9.4 million --> Net income.
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
TEXTS = [
'Net income was $9.4 million compared to the prior year of $2.7 million.',
'Revenue exceeded twelve billion dollars, with a loss of $1b.',
]
@plac.annotations(
model=("Model to load (needs parser and NER)", "positional", None, str))
def main(model='en_core_web_sm'):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
print("Processing %d texts" % len(TEXTS))
for text in TEXTS:
doc = nlp(text)
relations = extract_currency_relations(doc)
for r1, r2 in relations:
print('{:<10}\t{}\t{}'.format(r1.text, r2.ent_type_, r2.text))
def extract_currency_relations(doc):
# merge entities and noun chunks into one token
for span in [*list(doc.ents), *list(doc.noun_chunks)]:
span.merge()
relations = []
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
if money.dep_ in ('attr', 'dobj'):
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
if subject:
subject = subject[0]
relations.append((subject, money))
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
relations.append((money.head.head, money))
return relations
if __name__ == '__main__':
plac.call(main)
# Expected output:
# Net income MONEY $9.4 million
# the prior year MONEY $2.7 million
# Revenue MONEY twelve billion dollars
# a loss MONEY 1b

View File

@ -0,0 +1,65 @@
#!/usr/bin/env python
# coding: utf8
"""
This example shows how to navigate the parse tree including subtrees attached
to a word.
Based on issue #252:
"In the documents and tutorials the main thing I haven't found is
examples on how to break sentences down into small sub thoughts/chunks. The
noun_chunks is handy, but having examples on using the token.head to find small
(near-complete) sentence chunks would be neat. Lets take the example sentence:
"displaCy uses CSS and JavaScript to show you how computers understand language"
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
[displaCy] uses CSS and Javascript [to + show]
show you how computers understand [language]
I'm assuming that we can use the token.head to build these groups."
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
@plac.annotations(
model=("Model to load", "positional", None, str))
def main(model='en_core_web_sm'):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
doc = nlp("displaCy uses CSS and JavaScript to show you how computers "
"understand language")
# The easiest way is to find the head of the subtree you want, and then use
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
# is the one that does what you're asking for most directly:
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
print(''.join(w.text_with_ws for w in word.subtree))
# It'd probably be better for `word.subtree` to return a `Span` object
# instead of a generator over the tokens. If you want the `Span` you can
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
# object is nice because you can easily get a vector, merge it, etc.
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.text, '|', subtree_span.root.text)
# You might also want to select a head, and then select a start and end
# position by walking along its children. You could then take the
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
# a span.
if __name__ == '__main__':
plac.call(main)
# Expected output:
# to show you how computers understand language
# how computers understand language
# to show you how computers understand language | show
# how computers understand language | understand

View File

@ -4,22 +4,24 @@ The idea is to associate each word in the vocabulary with a tag, noting whether
they begin, end, or are inside at least one pattern. An additional tag is used
for single-word patterns. Complete patterns are also stored in a hash set.
When we process a document, we look up the words in the vocabulary, to associate
the words with the tags. We then search for tag-sequences that correspond to
valid candidates. Finally, we look up the candidates in the hash set.
When we process a document, we look up the words in the vocabulary, to
associate the words with the tags. We then search for tag-sequences that
correspond to valid candidates. Finally, we look up the candidates in the hash
set.
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary Clinton", we
would associate "Barack" and "Hilary" with the B tag, Hussein with the I tag,
and Obama and Clinton with the L tag.
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary
Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with
the I tag, and Obama and Clinton with the L tag.
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second candidate
is in the phrase dictionary, so only one is returned as a match.
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second
candidate is in the phrase dictionary, so only one is returned as a match.
The algorithm is O(n) at run-time for document of length n because we're only ever
matching over the tag patterns. So no matter how many phrases we're looking for,
our pattern set stays very small (exact size depends on the maximum length we're
looking for, as the query language currently has no quantifiers)
The algorithm is O(n) at run-time for document of length n because we're only
ever matching over the tag patterns. So no matter how many phrases we're
looking for, our pattern set stays very small (exact size depends on the
maximum length we're looking for, as the query language currently has no
quantifiers).
The example expects a .bz2 file from the Reddit corpus, and a patterns file,
formatted in jsonl as a sequence of entries like this:
@ -32,11 +34,9 @@ formatted in jsonl as a sequence of entries like this:
{"text":"Argentina"}
"""
from __future__ import print_function, unicode_literals, division
from bz2 import BZ2File
import time
import math
import codecs
import plac
import ujson
@ -44,6 +44,24 @@ from spacy.matcher import PhraseMatcher
import spacy
@plac.annotations(
patterns_loc=("Path to gazetteer", "positional", None, str),
text_loc=("Path to Reddit corpus file", "positional", None, str),
n=("Number of texts to read", "option", "n", int),
lang=("Language class to initialise", "option", "l", str))
def main(patterns_loc, text_loc, n=10000, lang='en'):
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0
t1 = time.time()
for ent_id, text in get_matches(nlp.tokenizer, phrases,
read_text(text_loc, n=n)):
count += 1
t2 = time.time()
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
def read_gazetteer(tokenizer, loc, n=-1):
for i, line in enumerate(open(loc)):
data = ujson.loads(line.strip())
@ -75,18 +93,6 @@ def get_matches(tokenizer, phrases, texts, max_length=6):
yield (ent_id, doc[start:end].text)
def main(patterns_loc, text_loc, n=10000):
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0
t1 = time.time()
for ent_id, text in get_matches(nlp.tokenizer, phrases, read_text(text_loc, n=n)):
count += 1
t2 = time.time()
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
if __name__ == '__main__':
if False:
import cProfile

View File

@ -1,5 +0,0 @@
An example of inventory counting using SpaCy.io NLP library. Meant to show how to instantiate Spacy's English class, and allow reusability by reloading the main module.
In the future, a better implementation of this library would be to apply machine learning to each query and learn what to classify as the quantitative statement (55 kgs OF), vs the actual item of count (how likely is a preposition object to be the item of count if x,y,z qualifications appear in the statement).

View File

@ -1,35 +0,0 @@
class Inventory:
"""
Inventory class - a struct{} like feature to house inventory counts
across modules.
"""
originalQuery = None
item = ""
unit = ""
amount = ""
def __init__(self, statement):
"""
Constructor - only takes in the original query/statement
:return: new Inventory object
"""
self.originalQuery = statement
pass
def __str__(self):
return str(self.amount) + ' ' + str(self.unit) + ' ' + str(self.item)
def printInfo(self):
print '-------------Inventory Count------------'
print "Original Query: " + str(self.originalQuery)
print 'Amount: ' + str(self.amount)
print 'Unit: ' + str(self.unit)
print 'Item: ' + str(self.item)
print '----------------------------------------'
def isValid(self):
if not self.item or not self.unit or not self.amount or not self.originalQuery:
return False
else:
return True

View File

@ -1,92 +0,0 @@
from inventory import Inventory
def runTest(nlp):
testset = []
testset += [nlp(u'6 lobster cakes')]
testset += [nlp(u'6 avacados')]
testset += [nlp(u'fifty five carrots')]
testset += [nlp(u'i have 55 carrots')]
testset += [nlp(u'i got me some 9 cabbages')]
testset += [nlp(u'i got 65 kgs of carrots')]
result = []
for doc in testset:
c = decodeInventoryEntry_level1(doc)
if not c.isValid():
c = decodeInventoryEntry_level2(doc)
result.append(c)
for i in result:
i.printInfo()
def decodeInventoryEntry_level1(document):
"""
Decodes a basic entry such as: '6 lobster cake' or '6' cakes
@param document : NLP Doc object
:return: Status if decoded correctly (true, false), and Inventory object
"""
count = Inventory(str(document))
for token in document:
if token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
item = str(token)
for child in token.children:
if child.dep_ == u'compound' or child.dep_ == u'ad':
item = str(child) + str(item)
elif child.dep_ == u'nummod':
count.amount = str(child).strip()
for numerical_child in child.children:
# this isn't arithmetic rather than treating it such as a string
count.amount = str(numerical_child) + str(count.amount).strip()
else:
print "WARNING: unknown child: " + str(child) + ':'+str(child.dep_)
count.item = item
count.unit = item
return count
def decodeInventoryEntry_level2(document):
"""
Entry level 2, a more complicated parsing scheme that covers examples such as
'i have 80 boxes of freshly baked pies'
@document @param document : NLP Doc object
:return: Status if decoded correctly (true, false), and Inventory object-
"""
count = Inventory(str(document))
for token in document:
# Look for a preposition object that is a noun (this is the item we are counting).
# If found, look at its' dependency (if a preposition that is not indicative of
# inventory location, the dependency of the preposition must be a noun
if token.dep_ == (u'pobj' or u'meta') and token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
item = ''
# Go through all the token's children, these are possible adjectives and other add-ons
# this deals with cases such as 'hollow rounded waffle pancakes"
for i in token.children:
item += ' ' + str(i)
item += ' ' + str(token)
count.item = item
# Get the head of the item:
if token.head.dep_ != u'prep':
# Break out of the loop, this is a confusing entry
break
else:
amountUnit = token.head.head
count.unit = str(amountUnit)
for inner in amountUnit.children:
if inner.pos_ == u'NUM':
count.amount += str(inner)
return count

View File

@ -1,30 +0,0 @@
import inventoryCount as mainModule
import os
from spacy.en import English
if __name__ == '__main__':
"""
Main module for this example - loads the English main NLP class,
and keeps it in RAM while waiting for the user to re-run it. Allows the
developer to re-edit their module under testing without having
to wait as long to load the English class
"""
# Set the NLP object here for the parameters you want to see,
# or just leave it blank and get all the opts
print "Loading English module... this will take a while."
nlp = English()
print "Done loading English module."
while True:
try:
reload(mainModule)
mainModule.runTest(nlp)
raw_input('================ To reload main module, press Enter ================')
except Exception, e:
print "Unexpected error: " + str(e)
continue

View File

@ -1,161 +0,0 @@
from __future__ import unicode_literals, print_function
import spacy.en
import spacy.matcher
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import plac
def main():
nlp = spacy.en.English()
example = u"I prefer Siri to Google Now. I'll google now to find out how the google now service works."
before = nlp(example)
print("Before")
for ent in before.ents:
print(ent.text, ent.label_, [w.tag_ for w in ent])
# Output:
# Google ORG [u'NNP']
# google ORG [u'VB']
# google ORG [u'NNP']
nlp.matcher.add(
"GoogleNow", # Entity ID: Not really used at the moment.
"PRODUCT", # Entity type: should be one of the types in the NER data
{"wiki_en": "Google_Now"}, # Arbitrary attributes. Currently unused.
[ # List of patterns that can be Surface Forms of the entity
# This Surface Form matches "Google Now", verbatim
[ # Each Surface Form is a list of Token Specifiers.
{ # This Token Specifier matches tokens whose orth field is "Google"
ORTH: "Google"
},
{ # This Token Specifier matches tokens whose orth field is "Now"
ORTH: "Now"
}
],
[ # This Surface Form matches "google now", verbatim, and requires
# "google" to have the NNP tag. This helps prevent the pattern from
# matching cases like "I will google now to look up the time"
{
ORTH: "google",
TAG: "NNP"
},
{
ORTH: "now"
}
]
]
)
after = nlp(example)
print("After")
for ent in after.ents:
print(ent.text, ent.label_, [w.tag_ for w in ent])
# Output
# Google Now PRODUCT [u'NNP', u'RB']
# google ORG [u'VB']
# google now PRODUCT [u'NNP', u'RB']
#
# You can customize attribute values in the lexicon, and then refer to the
# new attributes in your Token Specifiers.
# This is particularly good for word-set membership.
#
australian_capitals = ['Brisbane', 'Sydney', 'Canberra', 'Melbourne', 'Hobart',
'Darwin', 'Adelaide', 'Perth']
# Internally, the tokenizer immediately maps each token to a pointer to a
# LexemeC struct. These structs hold various features, e.g. the integer IDs
# of the normalized string forms.
# For our purposes, the key attribute is a 64-bit integer, used as a bit field.
# spaCy currently only uses 12 of the bits for its built-in features, so
# the others are available for use. It's best to use the higher bits, as
# future versions of spaCy may add more flags. For instance, we might add
# a built-in IS_MONTH flag, taking up FLAG13. So, we bind our user-field to
# FLAG63 here.
is_australian_capital = FLAG63
# Now we need to set the flag value. It's False on all tokens by default,
# so we just need to set it to True for the tokens we want.
# Here we iterate over the strings, and set it on only the literal matches.
for string in australian_capitals:
lexeme = nlp.vocab[string]
lexeme.set_flag(is_australian_capital, True)
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
# If we want case-insensitive matching, we have to be a little bit more
# round-about, as there's no case-insensitive index to the vocabulary. So
# we have to iterate over the vocabulary.
# We'll be looking up attribute IDs in this set a lot, so it's good to pre-build it
target_ids = {nlp.vocab.strings[s.lower()] for s in australian_capitals}
for lexeme in nlp.vocab:
if lexeme.lower in target_ids:
lexeme.set_flag(is_australian_capital, True)
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
print('SYDNEY', nlp.vocab[u'SYDNEY'].check_flag(is_australian_capital))
# Output
# Sydney True
# sydney False
# Sydney True
# sydney True
# SYDNEY True
#
# The key thing to note here is that we're setting these attributes once,
# over the vocabulary --- and then reusing them at run-time. This means the
# amortized complexity of anything we do this way is going to be O(1). You
# can match over expressions that need to have sets with tens of thousands
# of values, e.g. "all the street names in Germany", and you'll still have
# O(1) complexity. Most regular expression algorithms don't scale well to
# this sort of problem.
#
# Now, let's use this in a pattern
nlp.matcher.add("AuCitySportsTeam", "ORG", {},
[
[
{LOWER: "the"},
{is_australian_capital: True},
{TAG: "NNS"}
],
[
{LOWER: "the"},
{is_australian_capital: True},
{TAG: "NNPS"}
],
[
{LOWER: "the"},
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
{is_australian_capital: True},
{TAG: "NNS"}
],
[
{LOWER: "the"},
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
{is_australian_capital: True},
{TAG: "NNPS"}
]
])
doc = nlp(u'The pattern should match the Brisbane Broncos and the South Darwin Spiders, but not the Colorado Boulders')
for ent in doc.ents:
print(ent.text, ent.label_)
# Output
# the Brisbane Broncos ORG
# the South Darwin Spiders ORG
# Output
# Before
# Google ORG [u'NNP']
# google ORG [u'VB']
# google ORG [u'NNP']
# After
# Google Now PRODUCT [u'NNP', u'RB']
# google ORG [u'VB']
# google now PRODUCT [u'NNP', u'RB']
# Sydney True
# sydney False
# Sydney True
# sydney True
# SYDNEY True
# the Brisbane Broncos ORG
# the South Darwin Spiders ORG
if __name__ == '__main__':
main()

View File

@ -1,74 +0,0 @@
from __future__ import print_function, unicode_literals, division
import io
import bz2
import logging
from toolz import partition
from os import path
import re
import spacy.en
from spacy.tokens import Doc
from joblib import Parallel, delayed
import plac
import ujson
def parallelize(func, iterator, n_jobs, extra, backend='multiprocessing'):
extra = tuple(extra)
return Parallel(n_jobs=n_jobs, backend=backend)(delayed(func)(*(item + extra))
for item in iterator)
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield ujson.loads(line)['body']
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
def strip_meta(text):
text = link_re.sub(r'\1', text)
text = text.replace('&gt;', '>').replace('&lt;', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
return text.strip()
def save_parses(batch_id, input_, out_dir, n_threads, batch_size):
out_loc = path.join(out_dir, '%d.bin' % batch_id)
if path.exists(out_loc):
return None
print('Batch', batch_id)
nlp = spacy.en.English()
nlp.matcher = None
with open(out_loc, 'wb') as file_:
texts = (strip_meta(text) for text in input_)
texts = (text for text in texts if text.strip())
for doc in nlp.pipe(texts, batch_size=batch_size, n_threads=n_threads):
file_.write(doc.to_bytes())
@plac.annotations(
in_loc=("Location of input file"),
out_dir=("Location of input file"),
n_process=("Number of processes", "option", "p", int),
n_thread=("Number of threads per process", "option", "t", int),
batch_size=("Number of texts to accumulate in a buffer", "option", "b", int)
)
def main(in_loc, out_dir, n_process=1, n_thread=4, batch_size=100):
if not path.exists(out_dir):
path.join(out_dir)
if n_process >= 2:
texts = partition(200000, iter_comments(in_loc))
parallelize(save_parses, enumerate(texts), n_process, [out_dir, n_thread, batch_size],
backend='multiprocessing')
else:
save_parses(0, iter_comments(in_loc), out_dir, n_thread, batch_size)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,35 +1,60 @@
#!/usr/bin/env python
# coding: utf-8
"""This example contains several snippets of methods that can be set via custom
Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like
they're "bound" to the object and are partially applied i.e. the object
they're called on is passed in as the first argument."""
from __future__ import unicode_literals
they're called on is passed in as the first argument.
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.tokens import Doc, Span
from spacy import displacy
from pathlib import Path
@plac.annotations(
output_dir=("Output directory for saved HTML", "positional", None, Path))
def main(output_dir=None):
nlp = English() # start off with blank English class
Doc.set_extension('overlap', method=overlap_tokens)
doc1 = nlp(u"Peach emoji is where it has always been.")
doc2 = nlp(u"Peach is the superior emoji.")
print("Text 1:", doc1.text)
print("Text 2:", doc2.text)
print("Overlapping tokens:", doc1._.overlap(doc2))
Doc.set_extension('to_html', method=to_html)
doc = nlp(u"This is a sentence about Apple.")
# add entity manually for demo purposes, to make it work without a model
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings['ORG'])]
print("Text:", doc.text)
doc._.to_html(output=output_dir, style='ent')
def to_html(doc, output='/tmp', style='dep'):
"""Doc method extension for saving the current state as a displaCy
visualization.
"""
# generate filename from first six non-punct tokens
file_name = '-'.join([w.text for w in doc[:6] if not w.is_punct]) + '.html'
output_path = Path(output) / file_name
html = displacy.render(doc, style=style, page=True) # render markup
output_path.open('w', encoding='utf-8').write(html) # save to file
print('Saved HTML to {}'.format(output_path))
Doc.set_extension('to_html', method=to_html)
nlp = English()
doc = nlp(u"This is a sentence about Apple.")
# add entity manually for demo purposes, to make it work without a model
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings['ORG'])]
doc._.to_html(style='ent')
if output is not None:
output_path = Path(output)
if not output_path.exists():
output_path.mkdir()
output_file = Path(output) / file_name
output_file.open('w', encoding='utf-8').write(html) # save to file
print('Saved HTML to {}'.format(output_file))
else:
print(html)
def overlap_tokens(doc, other_doc):
@ -43,10 +68,10 @@ def overlap_tokens(doc, other_doc):
return overlap
Doc.set_extension('overlap', method=overlap_tokens)
if __name__ == '__main__':
plac.call(main)
nlp = English()
doc1 = nlp(u"Peach emoji is where it has always been.")
doc2 = nlp(u"Peach is the superior emoji.")
tokens = doc1._.overlap(doc2)
print(tokens)
# Expected output:
# Text 1: Peach emoji is where it has always been.
# Text 2: Peach is the superior emoji.
# Overlapping tokens: [Peach, emoji, is, .]

View File

@ -1,21 +1,45 @@
# coding: utf-8
from __future__ import unicode_literals
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens, e.g. the capital and lat/lng
coordinates. Can be extended with more details from the API.
* REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0)
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import requests
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
class RESTCountriesComponent(object):
"""Example of a spaCy v2.0 pipeline component that requests all countries
via the REST Countries API, merges country names into one token, assigns
entity labels and sets attributes on country tokens, e.g. the capital and
lat/lng coordinates. Can be extended with more details from the API.
def main():
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
rest_countries = RESTCountriesComponent(nlp) # initialise component
nlp.add_pipe(rest_countries) # add it to the pipeline
doc = nlp(u"Some text about Colombia and the Czech Republic")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Doc has countries', doc._.has_country) # Doc contains countries
for token in doc:
if token._.is_country:
print(token.text, token._.country_capital, token._.country_latlng,
token._.country_flag) # country data
print('Entities', [(e.text, e.label_) for e in doc.ents]) # entities
REST Countries API: https://restcountries.eu
API License: Mozilla Public License MPL 2.0
class RESTCountriesComponent(object):
"""spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens.
"""
name = 'rest_countries' # component name, will show up in the pipeline
@ -90,19 +114,12 @@ class RESTCountriesComponent(object):
return any([t._.get('is_country') for t in tokens])
# For simplicity, we start off with only the blank English Language class and
# no model or pre-defined pipeline loaded.
if __name__ == '__main__':
plac.call(main)
nlp = English()
rest_countries = RESTCountriesComponent(nlp) # initialise component
nlp.add_pipe(rest_countries) # add it to the pipeline
doc = nlp(u"Some text about Colombia and the Czech Republic")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Doc has countries', doc._.has_country) # Doc contains countries
for token in doc:
if token._.is_country:
print(token.text, token._.country_capital, token._.country_latlng,
token._.country_flag) # country data
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all countries are entities
# Expected output:
# Pipeline ['rest_countries']
# Doc has countries True
# Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg
# Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg
# Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]

View File

@ -1,11 +1,45 @@
# coding: utf-8
from __future__ import unicode_literals
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
labelled as ORG and their spans are merged into one token. Additionally,
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
respectively.
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
@plac.annotations(
text=("Text to process", "positional", None, str),
companies=("Names of technology companies", "positional", None, str))
def main(text="Alphabet Inc. is the company behind Google.", *companies):
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
if not companies: # set default companies if none are set via args
companies = ['Alphabet Inc.', 'Google', 'Netflix', 'Apple'] # etc.
component = TechCompanyRecognizer(nlp, companies) # initialise component
nlp.add_pipe(component, last=True) # add last to the pipeline
doc = nlp(text)
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Tokens', [t.text for t in doc]) # company names from the list are merged
print('Doc has_tech_org', doc._.has_tech_org) # Doc contains tech orgs
print('Token 0 is_tech_org', doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
print('Token 1 is_tech_org', doc[1]._.is_tech_org) # "is" is not
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
class TechCompanyRecognizer(object):
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
@ -67,19 +101,13 @@ class TechCompanyRecognizer(object):
return any([t._.get('is_tech_org') for t in tokens])
# For simplicity, we start off with only the blank English Language class and
# no model or pre-defined pipeline loaded.
if __name__ == '__main__':
plac.call(main)
nlp = English()
companies = ['Alphabet Inc.', 'Google', 'Netflix', 'Apple'] # etc.
component = TechCompanyRecognizer(nlp, companies) # initialise component
nlp.add_pipe(component, last=True) # add it to the pipeline as the last element
doc = nlp(u"Alphabet Inc. is the company behind Google.")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Tokens', [t.text for t in doc]) # company names from the list are merged
print('Doc has_tech_org', doc._.has_tech_org) # Doc contains tech orgs
print('Token 0 is_tech_org', doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
print('Token 1 is_tech_org', doc[1]._.is_tech_org) # "is" is not
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
# Expected output:
# Pipeline ['tech_companies']
# Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.']
# Doc has_tech_org True
# Token 0 is_tech_org True
# Token 1 is_tech_org False
# Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')]

View File

@ -0,0 +1,73 @@
"""
Example of multi-processing with Joblib. Here, we're exporting
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
each "sentence" on a newline, and spaces between tokens. Data is loaded from
the IMDB movie reviews dataset and will be loaded automatically via Thinc's
built-in dataset loader.
Last updated for: spaCy 2.0.0a18
"""
from __future__ import print_function, unicode_literals
from toolz import partition_all
from pathlib import Path
from joblib import Parallel, delayed
import thinc.extra.datasets
import plac
import spacy
@plac.annotations(
output_dir=("Output directory", "positional", None, Path),
model=("Model name (needs tagger)", "positional", None, str),
n_jobs=("Number of workers", "option", "n", int),
batch_size=("Batch-size for each process", "option", "b", int),
limit=("Limit of entries from the dataset", "option", "l", int))
def main(output_dir, model='en_core_web_sm', n_jobs=4, batch_size=1000,
limit=10000):
nlp = spacy.load(model) # load spaCy model
print("Loaded model '%s'" % model)
if not output_dir.exists():
output_dir.mkdir()
# load and pre-process the IMBD dataset
print("Loading IMDB data...")
data, _ = thinc.extra.datasets.imdb()
texts, _ = zip(*data[-limit:])
partitions = partition_all(batch_size, texts)
items = ((i, [nlp(text) for text in texts], output_dir) for i, texts
in enumerate(partitions))
Parallel(n_jobs=n_jobs)(delayed(transform_texts)(*item) for item in items)
def transform_texts(batch_id, docs, output_dir):
out_path = Path(output_dir) / ('%d.txt' % batch_id)
if out_path.exists(): # return None in case same batch is called again
return None
print('Processing batch', batch_id)
with out_path.open('w', encoding='utf8') as f:
for doc in docs:
f.write(' '.join(represent_word(w) for w in doc if not w.is_space))
f.write('\n')
print('Saved {} texts to {}.txt'.format(len(docs), batch_id))
def represent_word(word):
text = word.text
# True-case, i.e. try to normalize sentence-initial capitals.
# Only do this if the lower-cased form is more probable.
if text.istitle() and is_sent_begin(word) \
and word.prob < word.doc.vocab[text.lower()].prob:
text = text.lower()
return text + '|' + word.tag_
def is_sent_begin(word):
if word.i == 0:
return True
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
return True
else:
return False
if __name__ == '__main__':
plac.call(main)

View File

@ -1,90 +0,0 @@
"""
Print part-of-speech tagged, true-cased, (very roughly) sentence-separated
text, with each "sentence" on a newline, and spaces between tokens. Supports
multi-processing.
"""
from __future__ import print_function, unicode_literals, division
import io
import bz2
import logging
from toolz import partition
from os import path
import spacy.en
from joblib import Parallel, delayed
import plac
import ujson
def parallelize(func, iterator, n_jobs, extra):
extra = tuple(extra)
return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
def iter_texts_from_json_bz2(loc):
"""
Iterator of unicode strings, one per document (here, a comment).
Expects a a path to a BZ2 file, which should be new-line delimited JSON. The
document text should be in a string field titled 'body'.
This is the data format of the Reddit comments corpus.
"""
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield ujson.loads(line)['body']
def transform_texts(batch_id, input_, out_dir):
out_loc = path.join(out_dir, '%d.txt' % batch_id)
if path.exists(out_loc):
return None
print('Batch', batch_id)
nlp = spacy.en.English(parser=False, entity=False)
with io.open(out_loc, 'w', encoding='utf8') as file_:
for text in input_:
doc = nlp(text)
file_.write(' '.join(represent_word(w) for w in doc if not w.is_space))
file_.write('\n')
def represent_word(word):
text = word.text
# True-case, i.e. try to normalize sentence-initial capitals.
# Only do this if the lower-cased form is more probable.
if text.istitle() \
and is_sent_begin(word) \
and word.prob < word.doc.vocab[text.lower()].prob:
text = text.lower()
return text + '|' + word.tag_
def is_sent_begin(word):
# It'd be nice to have some heuristics like these in the library, for these
# times where we don't care so much about accuracy of SBD, and we don't want
# to parse
if word.i == 0:
return True
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
return True
else:
return False
@plac.annotations(
in_loc=("Location of input file"),
out_dir=("Location of input file"),
n_workers=("Number of workers", "option", "n", int),
batch_size=("Batch-size for each process", "option", "b", int)
)
def main(in_loc, out_dir, n_workers=4, batch_size=100000):
if not path.exists(out_dir):
path.join(out_dir)
texts = partition(batch_size, iter_texts_from_json_bz2(in_loc))
parallelize(transform_texts, enumerate(texts), n_workers, [out_dir])
if __name__ == '__main__':
plac.call(main)

View File

@ -1,22 +0,0 @@
# Load NER
from __future__ import unicode_literals
import spacy
import pathlib
from spacy.pipeline import EntityRecognizer
from spacy.vocab import Vocab
def load_model(model_dir):
model_dir = pathlib.Path(model_dir)
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
with (model_dir / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
nlp.vocab.strings.load(file_)
nlp.vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
ner = EntityRecognizer.load(model_dir, nlp.vocab, require=True)
return (nlp, ner)
(nlp, ner) = load_model('ner')
doc = nlp.make_doc('Who is Shaka Khan?')
nlp.tagger(doc)
ner(doc)
for word in doc:
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)

View File

@ -0,0 +1,157 @@
#!/usr/bin/env python
# coding: utf-8
"""Using the parser to recognise your own semantics
spaCy's parser component can be used to trained to predict any type of tree
structure over your input text. You can also predict trees over whole documents
or chat logs, with connections between the sentence-roots used to annotate
discourse structure. In this example, we'll build a message parser for a common
"chat intent": finding local businesses. Our message semantics will have the
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
"show me the best hotel in berlin"
('show', 'ROOT', 'show')
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
('hotel', 'PLACE', 'show') --> show PLACE hotel
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
"""
from __future__ import unicode_literals, print_function
import plac
import random
import spacy
from spacy.gold import GoldParse
from spacy.tokens import Doc
from pathlib import Path
# training data: words, head and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
TRAIN_DATA = [
(
['find', 'a', 'cafe', 'with', 'great', 'wifi'],
[0, 2, 0, 5, 5, 2], # index of token head
['ROOT', '-', 'PLACE', '-', 'QUALITY', 'ATTRIBUTE']
),
(
['find', 'a', 'hotel', 'near', 'the', 'beach'],
[0, 2, 0, 5, 5, 2],
['ROOT', '-', 'PLACE', 'QUALITY', '-', 'ATTRIBUTE']
),
(
['find', 'me', 'the', 'closest', 'gym', 'that', "'s", 'open', 'late'],
[0, 0, 4, 4, 0, 6, 4, 6, 6],
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'ATTRIBUTE', 'TIME']
),
(
['show', 'me', 'the', 'cheapest', 'store', 'that', 'sells', 'flowers'],
[0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'PRODUCT']
),
(
['find', 'a', 'nice', 'restaurant', 'in', 'london'],
[0, 3, 3, 0, 3, 3],
['ROOT', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
),
(
['show', 'me', 'the', 'coolest', 'hostel', 'in', 'berlin'],
[0, 0, 4, 4, 0, 4, 4],
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
),
(
['find', 'a', 'good', 'italian', 'restaurant', 'near', 'work'],
[0, 4, 4, 4, 0, 4, 5],
['ROOT', '-', 'QUALITY', 'ATTRIBUTE', 'PLACE', 'ATTRIBUTE', 'LOCATION']
)
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe('parser')
for _, _, deps in TRAIN_DATA:
for dep in deps:
parser.add_label(dep)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, heads, deps in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_model(nlp)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
test_model(nlp2)
def test_model(nlp):
texts = ["find a hotel with good wifi",
"find me the cheapest gym near work",
"show me the best hotel in berlin"]
docs = nlp.pipe(texts)
for doc in docs:
print(doc.text)
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# find a hotel with good wifi
# [
# ('find', 'ROOT', 'find'),
# ('hotel', 'PLACE', 'find'),
# ('good', 'QUALITY', 'wifi'),
# ('wifi', 'ATTRIBUTE', 'hotel')
# ]
# find me the cheapest gym near work
# [
# ('find', 'ROOT', 'find'),
# ('cheapest', 'QUALITY', 'gym'),
# ('gym', 'PLACE', 'find')
# ]
# show me the best hotel in berlin
# [
# ('show', 'ROOT', 'show'),
# ('best', 'QUALITY', 'hotel'),
# ('hotel', 'PLACE', 'show'),
# ('berlin', 'LOCATION', 'hotel')
# ]

View File

@ -1,13 +1,103 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy's named entity recognizer, starting off with an
existing model or a blank model.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* NER: https://alpha.spacy.io/usage/linguistic-features#named-entities
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
from spacy.lang.en import English
import spacy
from spacy.gold import GoldParse, biluo_tags_from_offsets
# training data
TRAIN_DATA = [
('Who is Shaka Khan?', [(7, 17, 'PERSON')]),
('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# function that allows begin_training to get the training data
get_data = lambda: reformat_train_data(nlp.tokenizer, TRAIN_DATA)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training(get_data)
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for raw_text, entity_offsets in TRAIN_DATA:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update(
[doc], # Batch of Doc objects
[gold], # Batch of GoldParse objects
drop=0.5, # Dropout -- make it harder to memorise data
sgd=optimizer, # Callable to update weights
losses=losses)
print(losses)
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
def reformat_train_data(tokenizer, examples):
"""Reformat data to match JSON format"""
"""Reformat data to match JSON format.
https://alpha.spacy.io/api/annotation#json-input
tokenizer (Tokenizer): Tokenizer to process the raw text.
examples (list): The trainig data.
RETURNS (list): The reformatted training data."""
output = []
for i, (text, entity_offsets) in enumerate(examples):
doc = tokenizer(text)
@ -21,49 +111,5 @@ def reformat_train_data(tokenizer, examples):
return output
def main(model_dir=None):
train_data = [
(
'Who is Shaka Khan?',
[(len('Who is '), len('Who is Shaka Khan'), 'PERSON')]
),
(
'I like London and Berlin.',
[(len('I like '), len('I like London'), 'LOC'),
(len('I like London and '), len('I like London and Berlin'), 'LOC')]
)
]
nlp = English(pipeline=['tensorizer', 'ner'])
get_data = lambda: reformat_train_data(nlp.tokenizer, train_data)
optimizer = nlp.begin_training(get_data)
for itn in range(100):
random.shuffle(train_data)
losses = {}
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update(
[doc], # Batch of Doc objects
[gold], # Batch of GoldParse objects
drop=0.5, # Dropout -- make it harder to memorise data
sgd=optimizer, # Callable to update weights
losses=losses)
print(losses)
print("Save to", model_dir)
nlp.to_disk(model_dir)
print("Load from", model_dir)
nlp = spacy.lang.en.English(pipeline=['tensorizer', 'ner'])
nlp.from_disk(model_dir)
for raw_text, _ in train_data:
doc = nlp(raw_text)
for word in doc:
print(word.text, word.ent_type_, word.ent_iob_)
if __name__ == '__main__':
import plac
plac.call(main)
# Who "" 2
# is "" 2
# Shaka "" PERSON 3
# Khan "" PERSON 1
# ? "" 2

View File

@ -1,206 +0,0 @@
#!/usr/bin/env python
'''Example of training a named entity recognition system from scratch using spaCy
This example is written to be self-contained and reasonably transparent.
To achieve that, it duplicates some of spaCy's internal functionality.
Specifically, in this example, we don't use spaCy's built-in Language class to
wire together the Vocab, Tokenizer and EntityRecognizer. Instead, we write
our own simple Pipeline class, so that it's easier to see how the pieces
interact.
Input data:
https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/data/GermEval2014_complete_data.zip
Developed for: spaCy 1.7.1
Last tested for: spaCy 2.0.0a13
'''
from __future__ import unicode_literals, print_function
import plac
from pathlib import Path
import random
import json
import tqdm
from thinc.neural.optimizers import Adam
from thinc.neural.ops import NumpyOps
from spacy.vocab import Vocab
from spacy.pipeline import TokenVectorEncoder, NeuralEntityRecognizer
from spacy.tokenizer import Tokenizer
from spacy.tokens import Doc
from spacy.attrs import *
from spacy.gold import GoldParse
from spacy.gold import iob_to_biluo
from spacy.gold import minibatch
from spacy.scorer import Scorer
import spacy.util
try:
unicode
except NameError:
unicode = str
spacy.util.set_env_log(True)
def init_vocab():
return Vocab(
lex_attr_getters={
LOWER: lambda string: string.lower(),
NORM: lambda string: string.lower(),
PREFIX: lambda string: string[0],
SUFFIX: lambda string: string[-3:],
})
class Pipeline(object):
def __init__(self, vocab=None, tokenizer=None, entity=None):
if vocab is None:
vocab = init_vocab()
if tokenizer is None:
tokenizer = Tokenizer(vocab, {}, None, None, None)
if entity is None:
entity = NeuralEntityRecognizer(vocab)
self.vocab = vocab
self.tokenizer = tokenizer
self.entity = entity
self.pipeline = [self.entity]
def begin_training(self):
for model in self.pipeline:
model.begin_training([])
optimizer = Adam(NumpyOps(), 0.001)
return optimizer
def __call__(self, input_):
doc = self.make_doc(input_)
for process in self.pipeline:
process(doc)
return doc
def make_doc(self, input_):
if isinstance(input_, bytes):
input_ = input_.decode('utf8')
if isinstance(input_, unicode):
return self.tokenizer(input_)
else:
return Doc(self.vocab, words=input_)
def make_gold(self, input_, annotations):
doc = self.make_doc(input_)
gold = GoldParse(doc, entities=annotations)
return gold
def update(self, inputs, annots, sgd, losses=None, drop=0.):
if losses is None:
losses = {}
docs = [self.make_doc(input_) for input_ in inputs]
golds = [self.make_gold(input_, annot) for input_, annot in
zip(inputs, annots)]
self.entity.update(docs, golds, drop=drop,
sgd=sgd, losses=losses)
return losses
def evaluate(self, examples):
scorer = Scorer()
for input_, annot in examples:
gold = self.make_gold(input_, annot)
doc = self(input_)
scorer.score(doc, gold)
return scorer.scores
def to_disk(self, path):
path = Path(path)
if not path.exists():
path.mkdir()
elif not path.is_dir():
raise IOError("Can't save pipeline to %s\nNot a directory" % path)
self.vocab.to_disk(path / 'vocab')
self.entity.to_disk(path / 'ner')
def from_disk(self, path):
path = Path(path)
if not path.exists():
raise IOError("Cannot load pipeline from %s\nDoes not exist" % path)
if not path.is_dir():
raise IOError("Cannot load pipeline from %s\nNot a directory" % path)
self.vocab = self.vocab.from_disk(path / 'vocab')
self.entity = self.entity.from_disk(path / 'ner')
def train(nlp, train_examples, dev_examples, nr_epoch=5):
sgd = nlp.begin_training()
print("Iter", "Loss", "P", "R", "F")
for i in range(nr_epoch):
random.shuffle(train_examples)
losses = {}
for batch in minibatch(tqdm.tqdm(train_examples, leave=False), size=8):
inputs, annots = zip(*batch)
nlp.update(list(inputs), list(annots), sgd, losses=losses)
scores = nlp.evaluate(dev_examples)
report_scores(i+1, losses['ner'], scores)
def report_scores(i, loss, scores):
precision = '%.2f' % scores['ents_p']
recall = '%.2f' % scores['ents_r']
f_measure = '%.2f' % scores['ents_f']
print('Epoch %d: %d %s %s %s' % (
i, int(loss), precision, recall, f_measure))
def read_examples(path):
path = Path(path)
with path.open() as file_:
sents = file_.read().strip().split('\n\n')
for sent in sents:
sent = sent.strip()
if not sent:
continue
tokens = sent.split('\n')
while tokens and tokens[0].startswith('#'):
tokens.pop(0)
words = []
iob = []
for token in tokens:
if token.strip():
pieces = token.split('\t')
words.append(pieces[1])
iob.append(pieces[2])
yield words, iob_to_biluo(iob)
def get_labels(examples):
labels = set()
for words, tags in examples:
for tag in tags:
if '-' in tag:
labels.add(tag.split('-')[1])
return sorted(labels)
@plac.annotations(
model_dir=("Path to save the model", "positional", None, Path),
train_loc=("Path to your training data", "positional", None, Path),
dev_loc=("Path to your development data", "positional", None, Path),
)
def main(model_dir, train_loc, dev_loc, nr_epoch=30):
print(model_dir, train_loc, dev_loc)
train_examples = list(read_examples(train_loc))
dev_examples = read_examples(dev_loc)
nlp = Pipeline()
for label in get_labels(train_examples):
nlp.entity.add_label(label)
print("Add label", label)
train(nlp, train_examples, list(dev_examples), nr_epoch)
nlp.to_disk(model_dir)
if __name__ == '__main__':
plac.call(main)

View File

@ -21,104 +21,120 @@ After training your model, you can save it to a directory. We recommend
wrapping models as Python packages, for ease of deployment.
For more details, see the documentation:
* Training the Named Entity Recognizer: https://spacy.io/docs/usage/train-ner
* Saving and loading models: https://spacy.io/docs/usage/saving-loading
* Training: https://alpha.spacy.io/usage/training
* NER: https://alpha.spacy.io/usage/linguistic-features#named-entities
Developed for: spaCy 1.7.6
Last updated for: spaCy 2.0.0a13
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import random
import spacy
from spacy.gold import GoldParse, minibatch
from spacy.pipeline import NeuralEntityRecognizer
from spacy.pipeline import TokenVectorEncoder
# new entity label
LABEL = 'ANIMAL'
# training data
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]),
("Do they bite?", []),
("horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]),
("horses pretend to care about your feelings", [(0, 6, 'ANIMAL')]),
("they pretend to care about your feelings, those horses",
[(48, 54, 'ANIMAL')]),
("horses?", [(0, 6, 'ANIMAL')])
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
new_model_name=("New model name for model meta.", "option", "nm", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, new_model_name='animal', output_dir=None, n_iter=50):
"""Set up the pipeline and entity recognizer, and train the new entity."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
ner = nlp.get_pipe('ner')
ner.add_label(LABEL) # add new entity label to entity recognizer
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
random.seed(0)
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
losses = {}
gold_parses = get_gold_parses(nlp.make_doc, TRAIN_DATA)
for batch in minibatch(gold_parses, size=3):
docs, golds = zip(*batch)
nlp.update(docs, golds, losses=losses, sgd=optimizer,
drop=0.35)
print(losses)
# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta['name'] = new_model_name # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
for ent in doc2.ents:
print(ent.label_, ent.text)
def get_gold_parses(tokenizer, train_data):
'''Shuffle and create GoldParse objects'''
"""Shuffle and create GoldParse objects.
tokenizer (Tokenizer): Tokenizer to processs the raw text.
train_data (list): The training data.
YIELDS (tuple): (doc, gold) tuples.
"""
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = tokenizer(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
yield doc, gold
def train_ner(nlp, train_data, output_dir):
random.seed(0)
optimizer = nlp.begin_training(lambda: [])
nlp.meta['name'] = 'en_ent_animal'
for itn in range(50):
losses = {}
for batch in minibatch(get_gold_parses(nlp.make_doc, train_data), size=3):
docs, golds = zip(*batch)
nlp.update(docs, golds, losses=losses, sgd=optimizer, drop=0.35)
print(losses)
if not output_dir:
return
elif not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
def main(model_name, output_directory=None):
print("Creating initial model", model_name)
nlp = spacy.blank(model_name)
if output_directory is not None:
output_directory = Path(output_directory)
train_data = [
(
"Horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')],
),
(
"Do they bite?",
[],
),
(
"horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]
),
(
"horses pretend to care about your feelings",
[(0, 6, 'ANIMAL')]
),
(
"they pretend to care about your feelings, those horses",
[(48, 54, 'ANIMAL')]
),
(
"horses?",
[(0, 6, 'ANIMAL')]
)
]
nlp.add_pipe(TokenVectorEncoder(nlp.vocab))
ner = NeuralEntityRecognizer(nlp.vocab)
ner.add_label('ANIMAL')
nlp.add_pipe(ner)
train_ner(nlp, train_data, output_directory)
# Test that the entity is recognized
text = 'Do you like horses?'
print("Ents in 'Do you like horses?':")
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent.text)
if output_directory:
print("Loading from", output_directory)
nlp2 = spacy.load(output_directory)
doc2 = nlp2('Do you like horses?')
for ent in doc2.ents:
print(ent.label_, ent.text)
if __name__ == '__main__':
import plac
plac.call(main)

View File

@ -1,75 +1,109 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy dependency parser, starting off with an existing model
or a blank model.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Dependency Parse: https://alpha.spacy.io/usage/linguistic-features#dependency-parse
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import json
import pathlib
import plac
import random
from pathlib import Path
import spacy
from spacy.pipeline import DependencyParser
from spacy.gold import GoldParse
from spacy.tokens import Doc
def train_parser(nlp, train_data, left_labels, right_labels):
parser = DependencyParser(
nlp.vocab,
left_labels=left_labels,
right_labels=right_labels)
for itn in range(1000):
random.shuffle(train_data)
loss = 0
for words, heads, deps in train_data:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
loss += parser.update(doc, gold)
parser.model.end_training()
return parser
# training data
TRAIN_DATA = [
(
['They', 'trade', 'mortgage', '-', 'backed', 'securities', '.'],
[1, 1, 4, 4, 5, 1, 1],
['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
),
(
['I', 'like', 'London', 'and', 'Berlin', '.'],
[1, 1, 1, 2, 2, 1],
['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
)
]
def main(model_dir=None):
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if not model_dir.exists():
model_dir.mkdir()
assert model_dir.is_dir()
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=1000):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
nlp = spacy.load('en', tagger=False, parser=False, entity=False, add_vectors=False)
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe('parser')
train_data = [
(
['They', 'trade', 'mortgage', '-', 'backed', 'securities', '.'],
[1, 1, 4, 4, 5, 1, 1],
['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
),
(
['I', 'like', 'London', 'and', 'Berlin', '.'],
[1, 1, 1, 2, 2, 1],
['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
)
]
left_labels = set()
right_labels = set()
for _, heads, deps in train_data:
for i, (head, dep) in enumerate(zip(heads, deps)):
if i < head:
left_labels.add(dep)
elif i > head:
right_labels.add(dep)
parser = train_parser(nlp, train_data, sorted(left_labels), sorted(right_labels))
# add labels to the parser
for _, _, deps in TRAIN_DATA:
for dep in deps:
parser.add_label(dep)
doc = Doc(nlp.vocab, words=['I', 'like', 'securities', '.'])
parser(doc)
for word in doc:
print(word.text, word.dep_, word.head.text)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, heads, deps in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
if model_dir is not None:
with (model_dir / 'config.json').open('w') as file_:
json.dump(parser.cfg, file_)
parser.model.dump(str(model_dir / 'model'))
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
if __name__ == '__main__':
main()
# I nsubj like
# like ROOT like
# securities dobj like
# . cc securities
plac.call(main)
# expected result:
# [
# ('I', 'nsubj', 'like'),
# ('like', 'ROOT', 'like'),
# ('securities', 'dobj', 'like'),
# ('.', 'punct', 'like')
# ]

View File

@ -1,18 +1,28 @@
"""A quick example for training a part-of-speech tagger, without worrying
about the tokenization, or other language-specific customizations."""
#!/usr/bin/env python
# coding: utf8
"""
A simple example for training a part-of-speech tagger with a custom tag map.
To allow us to update the tag map with our custom one, this example starts off
with a blank Language class and modifies its defaults.
from __future__ import unicode_literals
from __future__ import print_function
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* POS Tagging: https://alpha.spacy.io/usage/linguistic-features#pos-tagging
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
from spacy.vocab import Vocab
from spacy.tagger import Tagger
import spacy
from spacy.util import get_lang_class
from spacy.tokens import Doc
from spacy.gold import GoldParse
import random
# You need to define a mapping from your data's part-of-speech tag names to the
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
@ -28,54 +38,67 @@ TAG_MAP = {
# Usually you'll read this in, of course. Data formats vary.
# Ensure your strings are unicode.
DATA = [
(
["I", "like", "green", "eggs"],
["N", "V", "J", "N"]
),
(
["Eat", "blue", "ham"],
["V", "J", "N"]
)
TRAIN_DATA = [
(["I", "like", "green", "eggs"], ["N", "V", "J", "N"]),
(["Eat", "blue", "ham"], ["V", "J", "N"])
]
def ensure_dir(path):
if not path.exists():
path.mkdir()
@plac.annotations(
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(lang='en', output_dir=None, n_iter=25):
"""Create a new model, set up the pipeline and train the tagger. In order to
train the tagger with a custom tag map, we're creating a new Language
instance with a custom vocab.
"""
lang_cls = get_lang_class(lang) # get Language class
lang_cls.Defaults.tag_map.update(TAG_MAP) # add tag map to defaults
nlp = lang_cls() # initialise Language class
# add the tagger to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
tagger = nlp.create_pipe('tagger')
nlp.add_pipe(tagger)
def main(output_dir=None):
optimizer = nlp.begin_training(lambda: [])
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, tags in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, tags=tags)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_text = "I like blue eggs"
doc = nlp(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
ensure_dir(output_dir)
ensure_dir(output_dir / "pos")
ensure_dir(output_dir / "vocab")
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
vocab = Vocab(tag_map=TAG_MAP)
# The default_templates argument is where features are specified. See
# spacy/tagger.pyx for the defaults.
tagger = Tagger(vocab)
for i in range(25):
for words, tags in DATA:
doc = Doc(vocab, words=words)
gold = GoldParse(doc, tags=tags)
tagger.update(doc, gold)
random.shuffle(DATA)
tagger.model.end_training()
doc = Doc(vocab, orths_and_spaces=zip(["I", "like", "blue", "eggs"], [True] * 4))
tagger(doc)
for word in doc:
print(word.text, word.tag_, word.pos_)
if output_dir is not None:
tagger.model.dump(str(output_dir / 'pos' / 'model'))
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
tagger.vocab.strings.dump(file_)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
if __name__ == '__main__':
plac.call(main)
# I V VERB
# like V VERB
# blue N NOUN
# eggs N NOUN
# Expected output:
# [
# ('I', 'N', 'NOUN'),
# ('like', 'V', 'VERB'),
# ('blue', 'J', 'ADJ'),
# ('eggs', 'N', 'NOUN')
# ]

View File

@ -1,58 +1,119 @@
'''Train a multi-label convolutional neural network text classifier,
using the spacy.pipeline.TextCategorizer component. The model is then added
to spacy.pipeline, and predictions are available at `doc.cats`.
'''
from __future__ import unicode_literals
#!/usr/bin/env python
# coding: utf8
"""Train a multi-label convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Text classification: https://alpha.spacy.io/usage/text-classification
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
import tqdm
from thinc.neural.optimizers import Adam
from thinc.neural.ops import NumpyOps
from pathlib import Path
import thinc.extra.datasets
import spacy.lang.en
import spacy
from spacy.gold import GoldParse, minibatch
from spacy.util import compounding
from spacy.pipeline import TextCategorizer
# TODO: Remove this once we're not supporting models trained with thinc <6.9.0
import thinc.neural._classes.layernorm
thinc.neural._classes.layernorm.set_compat_six_eight(False)
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=20):
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
def train_textcat(tokenizer, textcat,
train_texts, train_cats, dev_texts, dev_cats,
n_iter=20):
'''
Train the TextCategorizer without associated pipeline.
'''
textcat.begin_training()
optimizer = Adam(NumpyOps(), 0.001)
train_docs = [tokenizer(text) for text in train_texts]
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
# textcat = nlp.create_pipe('textcat')
textcat = TextCategorizer(nlp.vocab, labels=['POSITIVE'])
nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
textcat = nlp.get_pipe('textcat')
# add label to text classifier
# textcat.add_label('POSITIVE')
# load the IMBD dataset
print("Loading IMDB data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=2000)
train_docs = [nlp.tokenizer(text) for text in train_texts]
train_gold = [GoldParse(doc, cats=cats) for doc, cats in
zip(train_docs, train_cats)]
train_data = list(zip(train_docs, train_gold))
batch_sizes = compounding(4., 128., 1.001)
for i in range(n_iter):
losses = {}
# Progress bar and minibatching
batches = minibatch(tqdm.tqdm(train_data, leave=False), size=batch_sizes)
for batch in batches:
docs, golds = zip(*batch)
textcat.update(docs, golds, sgd=optimizer, drop=0.2,
losses=losses)
with textcat.model.use_params(optimizer.averages):
scores = evaluate(tokenizer, textcat, dev_texts, dev_cats)
yield losses['textcat'], scores
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training(lambda: [])
print("Training the model...")
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4., 128., 1.001))
for batch in batches:
docs, golds = zip(*batch)
nlp.update(docs, golds, sgd=optimizer, drop=0.2, losses=losses)
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
print('{0:.3f}\t{0:.3f}\t{0:.3f}\t{0:.3f}' # print a simple table
.format(losses['textcat'], scores['textcat_p'],
scores['textcat_r'], scores['textcat_f']))
# test the trained model
test_text = "This movie sucked"
doc = nlp(test_text)
print(test_text, doc.cats)
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
print(test_text, doc2.cats)
def load_data(limit=0, split=0.8):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{'POSITIVE': bool(y)} for y in labels]
split = int(len(train_data) * split)
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
def evaluate(tokenizer, textcat, texts, cats):
docs = (tokenizer(text) for text in texts)
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 1e-8 # True negatives
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 1e-8 # True negatives
for i, doc in enumerate(textcat.pipe(docs)):
gold = cats[i]
for label, score in doc.cats.items():
@ -66,55 +127,10 @@ def evaluate(tokenizer, textcat, texts, cats):
tn += 1
elif score < 0.5 and gold[label] >= 0.5:
fn += 1
precis = tp / (tp + fp)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
fscore = 2 * (precis * recall) / (precis + recall)
return {'textcat_p': precis, 'textcat_r': recall, 'textcat_f': fscore}
def load_data(limit=0):
# Partition off part of the train data --- avoid running experiments
# against test.
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{'POSITIVE': bool(y)} for y in labels]
split = int(len(train_data) * 0.8)
train_texts = texts[:split]
train_cats = cats[:split]
dev_texts = texts[split:]
dev_cats = cats[split:]
return (train_texts, train_cats), (dev_texts, dev_cats)
def main(model_loc=None):
nlp = spacy.lang.en.English()
tokenizer = nlp.tokenizer
textcat = TextCategorizer(tokenizer.vocab, labels=['POSITIVE'])
print("Load IMDB data")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=2000)
print("Itn.\tLoss\tP\tR\tF")
progress = '{i:d} {loss:.3f} {textcat_p:.3f} {textcat_r:.3f} {textcat_f:.3f}'
for i, (loss, scores) in enumerate(train_textcat(tokenizer, textcat,
train_texts, train_cats,
dev_texts, dev_cats, n_iter=20)):
print(progress.format(i=i, loss=loss, **scores))
# How to save, load and use
nlp.pipeline.append(textcat)
if model_loc is not None:
nlp.to_disk(model_loc)
nlp = spacy.load(model_loc)
doc = nlp(u'This movie sucked!')
print(doc.cats)
f_score = 2 * (precision * recall) / (precision + recall)
return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}
if __name__ == '__main__':

View File

@ -1,36 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
import plac
import codecs
import pathlib
import random
import twython
import spacy.en
import _handler
class Connection(twython.TwythonStreamer):
def __init__(self, keys_dir, nlp, query):
keys_dir = pathlib.Path(keys_dir)
read = lambda fn: (keys_dir / (fn + '.txt')).open().read().strip()
api_key = map(read, ['key', 'secret', 'token', 'token_secret'])
twython.TwythonStreamer.__init__(self, *api_key)
self.nlp = nlp
self.query = query
def on_success(self, data):
_handler.handle_tweet(self.nlp, data, self.query)
if random.random() >= 0.1:
reload(_handler)
def main(keys_dir, term):
nlp = spacy.en.English()
twitter = Connection(keys_dir, nlp, term)
twitter.statuses.filter(track=term, language='en')
if __name__ == '__main__':
plac.call(main)

View File

@ -1,16 +1,19 @@
'''Load vectors for a language trained using FastText
#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
'''
"""
from __future__ import unicode_literals
import plac
import numpy
import spacy.language
import from spacy.language import Language
@plac.annotations(
vectors_loc=("Path to vectors", "positional", None, str))
def main(vectors_loc):
nlp = spacy.language.Language()
nlp = Language()
with open(vectors_loc, 'rb') as file_:
header = file_.readline()
@ -18,7 +21,7 @@ def main(vectors_loc):
nlp.vocab.clear_vectors(int(nr_dim))
for line in file_:
line = line.decode('utf8')
pieces = line.split()
pieces = line.split()
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector)

View File

@ -30,19 +30,14 @@ MOD_NAMES = [
'spacy.syntax._state',
'spacy.syntax._beam_utils',
'spacy.tokenizer',
'spacy._cfile',
'spacy.syntax.parser',
'spacy.syntax.nn_parser',
'spacy.syntax.beam_parser',
'spacy.syntax.nonproj',
'spacy.syntax.transition_system',
'spacy.syntax.arc_eager',
'spacy.syntax._parse_features',
'spacy.gold',
'spacy.tokens.doc',
'spacy.tokens.span',
'spacy.tokens.token',
'spacy.cfile',
'spacy.matcher',
'spacy.syntax.ner',
'spacy.symbols',
@ -67,7 +62,7 @@ LINK_OPTIONS = {
# I don't understand this very well yet. See Issue #267
# Fingers crossed!
USE_OPENMP_DEFAULT = '1' if sys.platform != 'darwin' else None
USE_OPENMP_DEFAULT = '0' if sys.platform != 'darwin' else None
if os.environ.get('USE_OPENMP', USE_OPENMP_DEFAULT) == '1':
if sys.platform == 'darwin':
COMPILE_OPTIONS['other'].append('-fopenmp')

View File

@ -1,26 +0,0 @@
from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
from cymem.cymem cimport Pool
cdef class CFile:
cdef FILE* fp
cdef bint is_open
cdef Pool mem
cdef int size # For compatibility with subclass
cdef int _capacity # For compatibility with subclass
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *
cdef class StringCFile(CFile):
cdef unsigned char* data
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *

View File

@ -1,88 +0,0 @@
from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
from libc.string cimport memcpy
cdef class CFile:
def __init__(self, loc, mode, on_open_error=None):
if isinstance(mode, unicode):
mode_str = mode.encode('ascii')
else:
mode_str = mode
if hasattr(loc, 'as_posix'):
loc = loc.as_posix()
self.mem = Pool()
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self.fp = fopen(<char*>bytes_loc, mode_str)
if self.fp == NULL:
if on_open_error is not None:
on_open_error()
else:
raise IOError("Could not open binary file %s" % bytes_loc)
self.is_open = True
def __dealloc__(self):
if self.is_open:
fclose(self.fp)
def close(self):
fclose(self.fp)
self.is_open = False
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
st = fread(dest, elem_size, number, self.fp)
if st != number:
raise IOError
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:
st = fwrite(src, elem_size, number, self.fp)
if st != number:
raise IOError
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)
cdef class StringCFile:
def __init__(self, mode, bytes data=b'', on_open_error=None):
self.mem = Pool()
self.is_open = 'w' in mode
self._capacity = max(len(data), 8)
self.size = len(data)
self.data = <unsigned char*>self.mem.alloc(1, self._capacity)
for i in range(len(data)):
self.data[i] = data[i]
def close(self):
self.is_open = False
def string_data(self):
return (self.data-self.size)[:self.size]
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
memcpy(dest, self.data, elem_size * number)
self.data += elem_size * number
cdef int write_from(self, void* src, size_t elem_size, size_t number) except -1:
write_size = number * elem_size
if (self.size + write_size) >= self._capacity:
self._capacity = (self.size + write_size) * 2
self.data = <unsigned char*>self.mem.realloc(self.data, self._capacity)
memcpy(&self.data[self.size], src, elem_size * number)
self.size += write_size
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)

View File

@ -96,7 +96,6 @@ def _zero_init(model):
@layerize
def _preprocess_doc(docs, drop=0.):
keys = [doc.to_array([LOWER]) for doc in docs]
keys = [a[:, 0] for a in keys]
ops = Model.ops
lengths = ops.asarray([arr.shape[0] for arr in keys])
keys = ops.xp.concatenate(keys)
@ -128,31 +127,34 @@ class PrecomputableAffine(Model):
self.nF = nF
def begin_update(self, X, drop=0.):
tensordot = self.ops.xp.tensordot
ascontiguous = self.ops.xp.ascontiguousarray
if self.nP == 1:
Yf = tensordot(X, self.W, axes=[[1], [2]])
else:
Yf = tensordot(X, self.W, axes=[[1], [3]])
Yf = self.ops.dot(X,
self.W.reshape((self.nF*self.nO*self.nP, self.nI)).T)
Yf = Yf.reshape((X.shape[0], self.nF, self.nO, self.nP))
def backward(dY_ids, sgd=None):
dY, ids = dY_ids
Xf = X[ids]
if self.nP == 1:
dXf = tensordot(dY, self.W, axes=[[1], [1]])
else:
dXf = tensordot(dY, self.W, axes=[[1,2], [1,2]])
dW = tensordot(dY, Xf, axes=[[0], [0]])
# (o, p, f, i) --> (f, o, p, i)
if self.nP == 1:
self.d_W += dW.transpose((1, 0, 2))
else:
self.d_W += dW.transpose((2, 0, 1, 3))
Xf = Xf.reshape((Xf.shape[0], self.nF * self.nI))
self.d_b += dY.sum(axis=0)
dY = dY.reshape((dY.shape[0], self.nO*self.nP))
Wopfi = self.W.transpose((1, 2, 0, 3))
Wopfi = self.ops.xp.ascontiguousarray(Wopfi)
Wopfi = Wopfi.reshape((self.nO*self.nP, self.nF * self.nI))
dXf = self.ops.dot(dY.reshape((dY.shape[0], self.nO*self.nP)), Wopfi)
# Reuse the buffer
dWopfi = Wopfi; dWopfi.fill(0.)
self.ops.xp.dot(dY.T, Xf, out=dWopfi)
dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI))
# (o, p, f, i) --> (f, o, p, i)
self.d_W += dWopfi.transpose((2, 0, 1, 3))
if sgd is not None:
sgd(self._mem.weights, self._mem.gradient, key=self.id)
return dXf
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
return Yf, backward
@staticmethod
@ -176,12 +178,9 @@ class PrecomputableAffine(Model):
size=tokvecs.size).reshape(tokvecs.shape)
def predict(ids, tokvecs):
hiddens = model(tokvecs)
if model.nP == 1:
vector = model.ops.allocate((hiddens.shape[0], model.nO))
else:
vector = model.ops.allocate((hiddens.shape[0], model.nO, model.nP))
model.ops.scatter_add(vector, ids, hiddens)
hiddens = model(tokvecs) # (b, f, o, p)
vector = model.ops.allocate((hiddens.shape[0], model.nO, model.nP))
model.ops.xp.add.at(vector, ids, hiddens)
vector += model.b
if model.nP >= 2:
return model.ops.maxout(vector)[0]
@ -329,8 +328,7 @@ def Tok2Vec(width, embed_size, **kwargs):
tok2vec = (
FeatureExtracter(cols)
>> with_flatten(
embed >> (convolution ** 4), pad=4)
>> with_flatten(embed >> (convolution ** 4), pad=4)
)
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
@ -359,58 +357,12 @@ def reapply(layer, n_times):
return wrap(reapply_fwd, layer)
def asarray(ops, dtype):
def forward(X, drop=0.):
return ops.asarray(X, dtype=dtype), None
return layerize(forward)
def foreach(layer):
def forward(Xs, drop=0.):
results = []
backprops = []
for X in Xs:
result, bp = layer.begin_update(X, drop=drop)
results.append(result)
backprops.append(bp)
def backward(d_results, sgd=None):
dXs = []
for d_result, backprop in zip(d_results, backprops):
dXs.append(backprop(d_result, sgd))
return dXs
return results, backward
model = layerize(forward)
model._layers.append(layer)
return model
def rebatch(size, layer):
ops = layer.ops
def forward(X, drop=0.):
if X.shape[0] < size:
return layer.begin_update(X)
parts = _divide_array(X, size)
results, bp_results = zip(*[layer.begin_update(p, drop=drop)
for p in parts])
y = ops.flatten(results)
def backward(dy, sgd=None):
d_parts = [bp(y, sgd=sgd) for bp, y in
zip(bp_results, _divide_array(dy, size))]
try:
dX = ops.flatten(d_parts)
except TypeError:
dX = None
except ValueError:
dX = None
return dX
return y, backward
model = layerize(forward)
model._layers.append(layer)
return model
def _divide_array(X, size):
parts = []
index = 0
@ -473,46 +425,6 @@ def get_token_vectors(tokens_attrs_vectors, drop=0.):
return vectors, backward
def fine_tune(embedding, combine=None):
if combine is not None:
raise NotImplementedError(
"fine_tune currently only supports addition. Set combine=None")
def fine_tune_fwd(docs_tokvecs, drop=0.):
docs, tokvecs = docs_tokvecs
lengths = model.ops.asarray([len(doc) for doc in docs], dtype='i')
vecs, bp_vecs = embedding.begin_update(docs, drop=drop)
flat_tokvecs = embedding.ops.flatten(tokvecs)
flat_vecs = embedding.ops.flatten(vecs)
output = embedding.ops.unflatten(
(model.mix[0] * flat_tokvecs + model.mix[1] * flat_vecs), lengths)
def fine_tune_bwd(d_output, sgd=None):
flat_grad = model.ops.flatten(d_output)
model.d_mix[0] += flat_tokvecs.dot(flat_grad.T).sum()
model.d_mix[1] += flat_vecs.dot(flat_grad.T).sum()
bp_vecs([d_o * model.mix[1] for d_o in d_output], sgd=sgd)
if sgd is not None:
sgd(model._mem.weights, model._mem.gradient, key=model.id)
return [d_o * model.mix[0] for d_o in d_output]
return output, fine_tune_bwd
def fine_tune_predict(docs_tokvecs):
docs, tokvecs = docs_tokvecs
vecs = embedding(docs)
return [model.mix[0]*tv+model.mix[1]*v
for tv, v in zip(tokvecs, vecs)]
model = wrap(fine_tune_fwd, embedding)
model.mix = model._mem.add((model.id, 'mix'), (2,))
model.mix.fill(0.5)
model.d_mix = model._mem.add_gradient((model.id, 'd_mix'), (model.id, 'mix'))
model.predict = fine_tune_predict
return model
@layerize
def flatten(seqs, drop=0.):
if isinstance(seqs[0], numpy.ndarray):
@ -552,18 +464,19 @@ def zero_init(model):
@layerize
def preprocess_doc(docs, drop=0.):
keys = [doc.to_array([LOWER]) for doc in docs]
keys = [a[:, 0] for a in keys]
ops = Model.ops
lengths = ops.asarray([arr.shape[0] for arr in keys])
keys = ops.xp.concatenate(keys)
vals = ops.allocate(keys.shape[0]) + 1
return (keys, vals, lengths), None
def getitem(i):
def getitem_fwd(X, drop=0.):
return X[i], None
return layerize(getitem_fwd)
def build_tagger_model(nr_class, **cfg):
embed_size = util.env_opt('embed_size', 7000)
if 'token_vector_width' in cfg:
@ -603,29 +516,6 @@ def SpacyVectors(docs, drop=0.):
return batch, None
def foreach(layer, drop_factor=1.0):
'''Map a layer across elements in a list'''
def foreach_fwd(Xs, drop=0.):
drop *= drop_factor
ys = []
backprops = []
for X in Xs:
y, bp_y = layer.begin_update(X, drop=drop)
ys.append(y)
backprops.append(bp_y)
def foreach_bwd(d_ys, sgd=None):
d_Xs = []
for d_y, bp_y in zip(d_ys, backprops):
if bp_y is not None and bp_y is not None:
d_Xs.append(d_y, sgd=sgd)
else:
d_Xs.append(None)
return d_Xs
return ys, foreach_bwd
model = wrap(foreach_fwd, layer)
return model
def build_text_classifier(nr_class, width=64, **cfg):
nr_vector = cfg.get('nr_vector', 5000)
pretrained_dims = cfg.get('pretrained_dims', 0)

View File

@ -1,33 +0,0 @@
from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
from cymem.cymem cimport Pool
cdef class CFile:
cdef FILE* fp
cdef unsigned char* data
cdef int is_open
cdef Pool mem
cdef int size # For compatibility with subclass
cdef int i # For compatibility with subclass
cdef int _capacity # For compatibility with subclass
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *
cdef class StringCFile:
cdef unsigned char* data
cdef int is_open
cdef Pool mem
cdef int size # For compatibility with subclass
cdef int i # For compatibility with subclass
cdef int _capacity # For compatibility with subclass
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *

View File

@ -1,103 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from libc.stdio cimport fopen, fclose, fread, fwrite
from libc.string cimport memcpy
cdef class CFile:
def __init__(self, loc, mode, on_open_error=None):
if isinstance(mode, unicode):
mode_str = mode.encode('ascii')
else:
mode_str = mode
if hasattr(loc, 'as_posix'):
loc = loc.as_posix()
self.mem = Pool()
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self.fp = fopen(<char*>bytes_loc, mode_str)
if self.fp == NULL:
if on_open_error is not None:
on_open_error()
else:
raise IOError("Could not open binary file %s" % bytes_loc)
self.is_open = True
def __dealloc__(self):
if self.is_open:
fclose(self.fp)
def close(self):
fclose(self.fp)
self.is_open = False
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
st = fread(dest, elem_size, number, self.fp)
if st != number:
raise IOError
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:
st = fwrite(src, elem_size, number, self.fp)
if st != number:
raise IOError
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)
cdef class StringCFile:
def __init__(self, bytes data, mode, on_open_error=None):
self.mem = Pool()
self.is_open = 1 if 'w' in mode else 0
self._capacity = max(len(data), 8)
self.size = len(data)
self.i = 0
self.data = <unsigned char*>self.mem.alloc(1, self._capacity)
for i in range(len(data)):
self.data[i] = data[i]
def __dealloc__(self):
# Important to override this -- or
# we try to close a non-existant file pointer!
pass
def close(self):
self.is_open = False
def string_data(self):
cdef bytes byte_string = b'\0' * (self.size)
bytes_ptr = <char*>byte_string
for i in range(self.size):
bytes_ptr[i] = self.data[i]
print(byte_string)
return byte_string
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
if self.i+(number * elem_size) < self.size:
memcpy(dest, &self.data[self.i], elem_size * number)
self.i += elem_size * number
cdef int write_from(self, void* src, size_t elem_size, size_t number) except -1:
write_size = number * elem_size
if (self.size + write_size) >= self._capacity:
self._capacity = (self.size + write_size) * 2
self.data = <unsigned char*>self.mem.realloc(self.data, self._capacity)
memcpy(&self.data[self.size], src, write_size)
self.size += write_size
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)

View File

@ -1,8 +1,11 @@
# coding: utf8
from __future__ import unicode_literals
import bz2
import gzip
try:
import bz2
import gzip
except ImportError:
pass
import math
from ast import literal_eval
from pathlib import Path

View File

@ -43,7 +43,7 @@ def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False, force
prints(meta_path, title="Reading meta.json from file")
meta = util.read_json(meta_path)
else:
meta = generate_meta()
meta = generate_meta(input_dir)
meta = validate_meta(meta, ['lang', 'name', 'version'])
model_name = meta['lang'] + '_' + meta['name']
@ -77,7 +77,8 @@ def create_file(file_path, contents):
file_path.open('w', encoding='utf-8').write(contents)
def generate_meta():
def generate_meta(model_path):
meta = {}
settings = [('lang', 'Model language', 'en'),
('name', 'Model name', 'model'),
('version', 'Model version', '0.0.0'),
@ -87,31 +88,21 @@ def generate_meta():
('email', 'Author email', False),
('url', 'Author website', False),
('license', 'License', 'CC BY-NC 3.0')]
prints("Enter the package settings for your model.", title="Generating meta.json")
meta = {}
nlp = util.load_model_from_path(Path(model_path))
meta['pipeline'] = nlp.pipe_names
meta['vectors'] = {'width': nlp.vocab.vectors_length,
'entries': len(nlp.vocab.vectors)}
prints("Enter the package settings for your model. The following "
"information will be read from your model data: pipeline, vectors.",
title="Generating meta.json")
for setting, desc, default in settings:
response = util.get_raw_input(desc, default)
meta[setting] = default if response == '' and default else response
meta['pipeline'] = generate_pipeline()
if about.__title__ != 'spacy':
meta['parent_package'] = about.__title__
return meta
def generate_pipeline():
prints("If set to 'True', the default pipeline is used. If set to 'False', "
"the pipeline will be disabled. Components should be specified as a "
"comma-separated list of component names, e.g. tagger, "
"parser, ner. For more information, see the docs on processing pipelines.",
title="Enter your model's pipeline components")
pipeline = util.get_raw_input("Pipeline components", True)
subs = {'True': True, 'False': False}
if pipeline in subs:
return subs[pipeline]
else:
return [p.strip() for p in pipeline.split(',')]
def validate_meta(meta, keys):
for key in keys:
if key not in meta or meta[key] == '':

View File

@ -144,7 +144,10 @@ def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
file_.write(json_dumps(scorer.scores))
meta_loc = output_path / ('model%d' % i) / 'meta.json'
meta['accuracy'] = scorer.scores
meta['speed'] = {'nwords': nwords, 'cpu':cpu_wps, 'gpu': gpu_wps}
meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
'gpu': gpu_wps}
meta['vectors'] = {'width': nlp.vocab.vectors_length,
'entries': len(nlp.vocab.vectors)}
meta['lang'] = nlp.lang
meta['pipeline'] = pipeline
meta['spacy_version'] = '>=%s' % about.__version__

View File

@ -30,6 +30,10 @@ try:
except ImportError:
cupy = None
try:
from thinc.neural.optimizers import Optimizer
except ImportError:
from thinc.neural.optimizers import Adam as Optimizer
pickle = pickle
copy_reg = copy_reg

View File

@ -3,6 +3,16 @@ from __future__ import unicode_literals
def explain(term):
"""Get a description for a given POS tag, dependency label or entity type.
term (unicode): The term to explain.
RETURNS (unicode): The explanation, or `None` if not found in the glossary.
EXAMPLE:
>>> spacy.explain(u'NORP')
>>> doc = nlp(u'Hello world')
>>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc])
"""
if term in GLOSSARY:
return GLOSSARY[term]
@ -283,6 +293,7 @@ GLOSSARY = {
'PRODUCT': 'Objects, vehicles, foods, etc. (not services)',
'EVENT': 'Named hurricanes, battles, wars, sports events, etc.',
'WORK_OF_ART': 'Titles of books, songs, etc.',
'LAW': 'Named documents made into laws.',
'LANGUAGE': 'Any named language',
'DATE': 'Absolute or relative dates or periods',
'TIME': 'Times smaller than a day',

View File

@ -12,11 +12,11 @@ MORPH_RULES = {
'কি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'সে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Nom'},
'কিসে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'কাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তাকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Acc'},
'স্বয়ং': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'কোনগুলো': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'তুমি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তুই': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তাদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Acc'},
'আমরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One ', 'PronType': 'Prs', 'Case': 'Nom'},
'যিনি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Rel', 'Case': 'Nom'},
@ -24,12 +24,15 @@ MORPH_RULES = {
'কোন': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Int', 'Case': 'Acc'},
'কারা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তোমাকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'তোকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'খোদ': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'কে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Int', 'Case': 'Acc'},
'যারা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Rel', 'Case': 'Nom'},
'যে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Rel', 'Case': 'Nom'},
'তোমরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তোরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তোমাদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'তোদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'আপন': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'': {LEMMA: PRON_LEMMA, 'PronType': 'Dem'},
'নিজ': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
@ -42,6 +45,10 @@ MORPH_RULES = {
'আমার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'মোর': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'মোদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তোমাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
@ -50,7 +57,13 @@ MORPH_RULES = {
'Case': 'Nom'},
'তোমার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তোর': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Three', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'কাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তোদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'যাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
}
}

View File

@ -22,7 +22,7 @@ STOP_WORDS = set("""
ি
ি
তখন তত তথ তব তব রপর রই হল িনই
িি িি ি মন
িি িি ি মন
কব কব
ি ি ি ি ি ি ি ি ওয ওয খত
ি ি ওয় ওয় ি
@ -32,7 +32,7 @@ STOP_WORDS = set("""
ফল ি
বছর বদল বর বলত বলল বলল বল বল বল বল বস বহ ি িি ি িষযি যবহ বকতব বন ি
মত মত মত মধযভ মধ মধ মধ মন যম
মত মত মত মধযভ মধ মধ মধ মন যম
যখন যত যতট যথ যদি যদি ওয ওয িি
মন
রকম রয রয়

View File

@ -3,6 +3,9 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .morph_rules import MORPH_RULES
from ..tag_map import TAG_MAP
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
@ -13,9 +16,12 @@ from ...util import update_exc, add_lookups
class DanishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'da'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# morph_rules = MORPH_RULES
tag_map = TAG_MAP
stop_words = STOP_WORDS

View File

@ -0,0 +1,52 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
# Source http://fjern-uv.dk/tal.php
_num_words = """nul
en et to tre fire fem seks syv otte ni ti
elleve tolv tretten fjorten femten seksten sytten atten nitten tyve
enogtyve toogtyve treogtyve fireogtyve femogtyve seksogtyve syvogtyve otteogtyve niogtyve tredive
enogtredive toogtredive treogtredive fireogtredive femogtredive seksogtredive syvogtredive otteogtredive niogtredive fyrre
enogfyrre toogfyrre treogfyrre fireogfyrre femgogfyrre seksogfyrre syvogfyrre otteogfyrre niogfyrre halvtreds
enoghalvtreds tooghalvtreds treoghalvtreds fireoghalvtreds femoghalvtreds seksoghalvtreds syvoghalvtreds otteoghalvtreds nioghalvtreds tres
enogtres toogtres treogtres fireogtres femogtres seksogtres syvogtres otteogtres niogtres halvfjerds
enoghalvfjerds tooghalvfjerds treoghalvfjerds fireoghalvfjerds femoghalvfjerds seksoghalvfjerds syvoghalvfjerds otteoghalvfjerds nioghalvfjerds firs
enogfirs toogfirs treogfirs fireogfirs femogfirs seksogfirs syvogfirs otteogfirs niogfirs halvfems
enoghalvfems tooghalvfems treoghalvfems fireoghalvfems femoghalvfems seksoghalvfems syvoghalvfems otteoghalvfems nioghalvfems hundrede
million milliard billion billiard trillion trilliard
""".split()
# source http://www.duda.dk/video/dansk/grammatik/talord/talord.html
_ordinal_words = """nulte
første anden tredje fjerde femte sjette syvende ottende niende tiende
elfte tolvte trettende fjortende femtende sekstende syttende attende nittende tyvende
enogtyvende toogtyvende treogtyvende fireogtyvende femogtyvende seksogtyvende syvogtyvende otteogtyvende niogtyvende tredivte enogtredivte toogtredivte treogtredivte fireogtredivte femogtredivte seksogtredivte syvogtredivte otteogtredivte niogtredivte fyrretyvende
enogfyrretyvende toogfyrretyvende treogfyrretyvende fireogfyrretyvende femogfyrretyvende seksogfyrretyvende syvogfyrretyvende otteogfyrretyvende niogfyrretyvende halvtredsindstyvende enoghalvtredsindstyvende
tooghalvtredsindstyvende treoghalvtredsindstyvende fireoghalvtredsindstyvende femoghalvtredsindstyvende seksoghalvtredsindstyvende syvoghalvtredsindstyvende otteoghalvtredsindstyvende nioghalvtredsindstyvende
tresindstyvende enogtresindstyvende toogtresindstyvende treogtresindstyvende fireogtresindstyvende femogtresindstyvende seksogtresindstyvende syvogtresindstyvende otteogtresindstyvende niogtresindstyvende halvfjerdsindstyvende
enoghalvfjerdsindstyvende tooghalvfjerdsindstyvende treoghalvfjerdsindstyvende fireoghalvfjerdsindstyvende femoghalvfjerdsindstyvende seksoghalvfjerdsindstyvende syvoghalvfjerdsindstyvende otteoghalvfjerdsindstyvende nioghalvfjerdsindstyvende firsindstyvende
enogfirsindstyvende toogfirsindstyvende treogfirsindstyvende fireogfirsindstyvende femogfirsindstyvende seksogfirsindstyvende syvogfirsindstyvende otteogfirsindstyvende niogfirsindstyvende halvfemsindstyvende
enoghalvfemsindstyvende tooghalvfemsindstyvende treoghalvfemsindstyvende fireoghalvfemsindstyvende femoghalvfemsindstyvende seksoghalvfemsindstyvende syvoghalvfemsindstyvende otteoghalvfemsindstyvende nioghalvfemsindstyvende
""".split()
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,41 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import LEMMA
from ...deprecated import PRON_LEMMA
MORPH_RULES = {
"PRON": {
"jeg": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
"mig": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"du": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two"},
"han": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
"ham": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"hun": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
"hende": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
"den": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"det": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"vi": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
"os": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
"de": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"dem": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
"min": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"din": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
"hendes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
"dens": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Poss": "Yes", "Reflex": "Yes"},
"dets": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Poss": "Yes", "Reflex": "Yes"},
"vores": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"deres": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
},
"VERB": {
"er": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Pres"},
"var": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Past"}
}
}
for tag, rules in MORPH_RULES.items():
for key, attrs in dict(rules).items():
rules[key.title()] = attrs

View File

@ -1,47 +1,46 @@
# encoding: utf8
from __future__ import unicode_literals
# Source: https://github.com/stopwords-iso/stopwords-da
# Source: Handpicked by Jens Dahl Møllerhøj.
STOP_WORDS = set("""
ad af aldrig alle alt anden andet andre at
af aldrig alene alle allerede alligevel alt altid anden andet andre at
bare begge blev blive bliver
bag begge blandt blev blive bliver burde r
da de dem den denne der deres det dette dig din dine disse dit dog du
da de dem den denne dens der derefter deres derfor derfra deri dermed derpå derved det dette dig din dine disse dog du
efter ej eller en end ene eneste enhver er et
efter egen eller ellers en end endnu ene eneste enhver ens enten er et
far fem fik fire flere fleste for fordi forrige fra får før
flere flest fleste for foran fordi forrige fra før først
god godt
gennem gjorde gjort god gør gøre gørende
ham han hans har havde have hej helt hende hendes her hos hun hvad hvem hver
hvilken hvis hvor hvordan hvorfor hvornår
ham han hans har havde have hel heller hen hende hendes henover her herefter heri hermed herpå hun hvad hvem hver hvilke hvilken hvilkes hvis hvor hvordan hvorefter hvorfor hvorfra hvorhen hvori hvorimod hvornår hvorved
i ikke ind ingen intet
i igen igennem ikke imellem imens imod ind indtil ingen intet
ja jeg jer jeres jo
jeg jer jeres jo
kan kom komme kommer kun kunne
kan kom kommer kun kunne
lad lav lidt lige lille
lad langs lav lave lavet lidt lige ligesom lille længere
man mand mange med meget men mens mere mig min mine mit mod
man mange med meget mellem men mens mere mest mig min mindre mindst mine mit måske
ned nej ni nogen noget nogle nu ny nyt når nær næste næsten
ned nemlig nogen nogensinde noget nogle nok nu ny nyt nær næste næsten
og også okay om op os otte over
og også om omkring op os over overalt
se seks selv ser ses sig sige sin sine sit skal skulle som stor store syv
sådan
samme sammen selv selvom senere ses siden sig sige skal skulle som stadig synes syntes sådan således
tag tage thi ti til to tre
temmelig tidligere til tilbage tit
ud under
ud uden udover under undtagen
var ved vi vil ville vor vores være været
var ved vi via vil ville vore vores vær være været
øvrigt
""".split())

View File

@ -1,11 +1,27 @@
# encoding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA
from ...symbols import ORTH, LEMMA, NORM
_exc = {}
for exc_data in [
{ORTH: "Kbh.", LEMMA: "København", NORM: "København"},
{ORTH: "Jan.", LEMMA: "januar", NORM: "januar"},
{ORTH: "Feb.", LEMMA: "februar", NORM: "februar"},
{ORTH: "Mar.", LEMMA: "marts", NORM: "marts"},
{ORTH: "Apr.", LEMMA: "april", NORM: "april"},
{ORTH: "Maj.", LEMMA: "maj", NORM: "maj"},
{ORTH: "Jun.", LEMMA: "juni", NORM: "juni"},
{ORTH: "Jul.", LEMMA: "juli", NORM: "juli"},
{ORTH: "Aug.", LEMMA: "august", NORM: "august"},
{ORTH: "Sep.", LEMMA: "september", NORM: "september"},
{ORTH: "Okt.", LEMMA: "oktober", NORM: "oktober"},
{ORTH: "Nov.", LEMMA: "november", NORM: "november"},
{ORTH: "Dec.", LEMMA: "december", NORM: "december"}]:
_exc[exc_data[ORTH]] = [dict(exc_data)]
for orth in [
"A/S", "beg.", "bl.a.", "ca.", "d.s.s.", "dvs.", "f.eks.", "fr.", "hhv.",

View File

@ -16,7 +16,7 @@ call can cannot ca could
did do does doing done down due during
each eight either eleven else elsewhere empty enough etc even ever every
each eight either eleven else elsewhere empty enough even ever every
everyone everything everywhere except
few fifteen fifty first five for former formerly forty four from front full
@ -27,7 +27,7 @@ get give go
had has have he hence her here hereafter hereby herein hereupon hers herself
him himself his how however hundred
i if in inc indeed into is it its itself
i if in indeed into is it its itself
keep

View File

@ -1,9 +1,9 @@
# coding: utf8
from __future__ import absolute_import, unicode_literals
from contextlib import contextmanager
import copy
from thinc.neural import Model
from thinc.neural.optimizers import Adam
import random
import ujson
from collections import OrderedDict
@ -16,11 +16,11 @@ from .tokenizer import Tokenizer
from .vocab import Vocab
from .tagger import Tagger
from .lemmatizer import Lemmatizer
from .syntax.parser import get_templates
from .pipeline import NeuralDependencyParser, TokenVectorEncoder, NeuralTagger
from .pipeline import NeuralEntityRecognizer, SimilarityHook, TextCategorizer
from .pipeline import DependencyParser, Tensorizer, Tagger
from .pipeline import EntityRecognizer, SimilarityHook, TextCategorizer
from .compat import Optimizer
from .compat import json_dumps, izip, copy_reg
from .scorer import Scorer
from ._ml import link_vectors_to_models
@ -75,9 +75,6 @@ class BaseDefaults(object):
infixes = tuple(TOKENIZER_INFIXES)
tag_map = dict(TAG_MAP)
tokenizer_exceptions = {}
parser_features = get_templates('parser')
entity_features = get_templates('ner')
tagger_features = Tagger.feature_templates # TODO -- fix this
stop_words = set()
lemma_rules = {}
lemma_exc = {}
@ -102,9 +99,9 @@ class Language(object):
factories = {
'tokenizer': lambda nlp: nlp.Defaults.create_tokenizer(nlp),
'tensorizer': lambda nlp, **cfg: TokenVectorEncoder(nlp.vocab, **cfg),
'tagger': lambda nlp, **cfg: NeuralTagger(nlp.vocab, **cfg),
'parser': lambda nlp, **cfg: NeuralDependencyParser(nlp.vocab, **cfg),
'ner': lambda nlp, **cfg: NeuralEntityRecognizer(nlp.vocab, **cfg),
'tagger': lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
'parser': lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
'ner': lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
'similarity': lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
'textcat': lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg)
}
@ -127,6 +124,7 @@ class Language(object):
RETURNS (Language): The newly constructed object.
"""
self._meta = dict(meta)
self._path = None
if vocab is True:
factory = self.Defaults.create_vocab
vocab = factory(self, **meta.get('vocab', {}))
@ -142,10 +140,14 @@ class Language(object):
bytes_data = self.to_bytes(vocab=False)
return (unpickle_language, (self.vocab, self.meta, bytes_data))
@property
def path(self):
return self._path
@property
def meta(self):
self._meta.setdefault('lang', self.vocab.lang)
self._meta.setdefault('name', '')
self._meta.setdefault('name', 'model')
self._meta.setdefault('version', '0.0.0')
self._meta.setdefault('spacy_version', about.__version__)
self._meta.setdefault('description', '')
@ -329,6 +331,29 @@ class Language(object):
doc = proc(doc)
return doc
def disable_pipes(self, *names):
'''Disable one or more pipeline components.
If used as a context manager, the pipeline will be restored to the initial
state at the end of the block. Otherwise, a DisabledPipes object is
returned, that has a `.restore()` method you can use to undo your
changes.
EXAMPLE:
>>> nlp.add_pipe('parser')
>>> nlp.add_pipe('tagger')
>>> with nlp.disable_pipes('parser', 'tagger'):
>>> assert not nlp.has_pipe('parser')
>>> assert nlp.has_pipe('parser')
>>> disabled = nlp.disable_pipes('parser')
>>> assert len(disabled) == 1
>>> assert not nlp.has_pipe('parser')
>>> disabled.restore()
>>> assert nlp.has_pipe('parser')
'''
return DisabledPipes(self, *names)
def make_doc(self, text):
return self.tokenizer(text)
@ -354,7 +379,8 @@ class Language(object):
return
if sgd is None:
if self._optimizer is None:
self._optimizer = Adam(Model.ops, 0.001)
self._optimizer = Optimizer(Model.ops, 0.001,
beta1=0.9, beta2=0.0, nesterov=True)
sgd = self._optimizer
grads = {}
def get_grads(W, dW, key=None):
@ -395,8 +421,8 @@ class Language(object):
eps = util.env_opt('optimizer_eps', 1e-08)
L2 = util.env_opt('L2_penalty', 1e-6)
max_grad_norm = util.env_opt('grad_norm_clip', 1.)
self._optimizer = Adam(Model.ops, learn_rate, L2=L2, beta1=beta1,
beta2=beta2, eps=eps)
self._optimizer = Optimizer(Model.ops, learn_rate, L2=L2, beta1=beta1,
beta2=beta2, eps=eps, nesterov=True)
self._optimizer.max_grad_norm = max_grad_norm
self._optimizer.device = device
return self._optimizer
@ -435,7 +461,7 @@ class Language(object):
eps = util.env_opt('optimizer_eps', 1e-08)
L2 = util.env_opt('L2_penalty', 1e-6)
max_grad_norm = util.env_opt('grad_norm_clip', 1.)
self._optimizer = Adam(Model.ops, learn_rate, L2=L2, beta1=beta1,
self._optimizer = Optimizer(Model.ops, learn_rate, L2=L2, beta1=beta1,
beta2=beta2, eps=eps)
self._optimizer.max_grad_norm = max_grad_norm
self._optimizer.device = device
@ -611,6 +637,7 @@ class Language(object):
if not (path / 'vocab').exists():
exclude['vocab'] = True
util.from_disk(path, deserializers, exclude)
self._path = path
return self
def to_bytes(self, disable=[], **exclude):
@ -655,6 +682,42 @@ class Language(object):
return self
class DisabledPipes(list):
'''Manager for temporary pipeline disabling.'''
def __init__(self, nlp, *names):
self.nlp = nlp
self.names = names
# Important! Not deep copy -- we just want the container (but we also
# want to support people providing arbitrarily typed nlp.pipeline
# objects.)
self.original_pipeline = copy.copy(nlp.pipeline)
list.__init__(self)
self.extend(nlp.remove_pipe(name) for name in names)
def __enter__(self):
return self
def __exit__(self, *args):
self.restore()
def restore(self):
'''Restore the pipeline to its state when DisabledPipes was created.'''
current, self.nlp.pipeline = self.nlp.pipeline, self.original_pipeline
unexpected = [name for name, pipe in current if not self.nlp.has_pipe(name)]
if unexpected:
# Don't change the pipeline if we're raising an error.
self.nlp.pipeline = current
msg = (
"Some current components would be lost when restoring "
"previous pipeline state. If you added components after "
"calling nlp.disable_pipes(), you should remove them "
"explicitly with nlp.remove_pipe() before the pipeline is "
"restore. Names of the new components: %s"
)
raise ValueError(msg % unexpected)
self[:] = []
def unpickle_language(vocab, meta, bytes_data):
lang = Language(vocab=vocab)
lang.from_bytes(bytes_data)

View File

@ -198,7 +198,6 @@ cdef class Matcher:
cdef public object _patterns
cdef public object _entities
cdef public object _callbacks
cdef public object _acceptors
def __init__(self, vocab):
"""Create the Matcher.
@ -209,7 +208,6 @@ cdef class Matcher:
"""
self._patterns = {}
self._entities = {}
self._acceptors = {}
self._callbacks = {}
self.vocab = vocab
self.mem = Pool()
@ -232,7 +230,7 @@ cdef class Matcher:
key (unicode): The match ID.
RETURNS (bool): Whether the matcher contains rules for this match ID.
"""
return len(self._patterns)
return self._normalize_key(key) in self._patterns
def add(self, key, on_match, *patterns):
"""Add a match-rule to the matcher. A match-rule consists of: an ID key,
@ -257,6 +255,10 @@ cdef class Matcher:
and '*' patterns in a row and their matches overlap, the first
operator will behave non-greedily. This quirk in the semantics
makes the matcher more efficient, by avoiding the need for back-tracking.
key (unicode): The match ID.
on_match (callable): Callback executed on match.
*patterns (list): List of token descritions.
"""
for pattern in patterns:
if len(pattern) == 0:
@ -473,15 +475,34 @@ cdef class PhraseMatcher:
self._callbacks = {}
def __len__(self):
raise NotImplementedError
"""Get the number of rules added to the matcher. Note that this only
returns the number of rules (identical with the number of IDs), not the
number of individual patterns.
RETURNS (int): The number of rules.
"""
return len(self.phrase_ids)
def __contains__(self, key):
raise NotImplementedError
"""Check whether the matcher contains rules for a match ID.
key (unicode): The match ID.
RETURNS (bool): Whether the matcher contains rules for this match ID.
"""
cdef hash_t ent_id = self.matcher._normalize_key(key)
return ent_id in self._callbacks
def __reduce__(self):
return (self.__class__, (self.vocab,), None, None)
def add(self, key, on_match, *docs):
"""Add a match-rule to the matcher. A match-rule consists of: an ID key,
an on_match callback, and one or more patterns.
key (unicode): The match ID.
on_match (callable): Callback executed on match.
*docs (Doc): `Doc` objects representing match patterns.
"""
cdef Doc doc
for doc in docs:
if len(doc) >= self.max_length:
@ -510,6 +531,13 @@ cdef class PhraseMatcher:
self.phrase_ids.set(phrase_hash, <void*>ent_id)
def __call__(self, Doc doc):
"""Find all sequences matching the supplied patterns on the `Doc`.
doc (Doc): The document to match over.
RETURNS (list): A list of `(key, start, end)` tuples,
describing the matches. A match tuple describes a span
`doc[start:end]`. The `label_id` and `key` are both integers.
"""
matches = []
for _, start, end in self.matcher(doc):
ent_id = self.accept_match(doc, start, end)
@ -522,6 +550,14 @@ cdef class PhraseMatcher:
return matches
def pipe(self, stream, batch_size=1000, n_threads=2):
"""Match a stream of documents, yielding them in turn.
docs (iterable): A stream of documents.
batch_size (int): The number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel, if the `Matcher` implementation supports multi-threading.
YIELDS (Doc): Documents, in order.
"""
for doc in stream:
self(doc)
yield doc

View File

@ -1,21 +0,0 @@
from .syntax.parser cimport Parser
#from .syntax.beam_parser cimport BeamParser
from .syntax.ner cimport BiluoPushDown
from .syntax.arc_eager cimport ArcEager
from .tagger cimport Tagger
cdef class EntityRecognizer(Parser):
pass
cdef class DependencyParser(Parser):
pass
#cdef class BeamEntityRecognizer(BeamParser):
# pass
#
#
#cdef class BeamDependencyParser(BeamParser):
# pass

View File

@ -26,11 +26,8 @@ from thinc.neural.util import to_categorical
from thinc.neural._classes.difference import Siamese, CauchySimilarity
from .tokens.doc cimport Doc
from .syntax.parser cimport Parser as LinearParser
from .syntax.nn_parser cimport Parser as NeuralParser
from .syntax.nn_parser cimport Parser
from .syntax import nonproj
from .syntax.parser import get_templates as get_feature_templates
from .syntax.beam_parser cimport BeamParser
from .syntax.ner cimport BiluoPushDown
from .syntax.arc_eager cimport ArcEager
from .tagger import Tagger
@ -42,7 +39,7 @@ from .syntax import nonproj
from .compat import json_dumps
from .attrs import ID, LOWER, PREFIX, SUFFIX, SHAPE, TAG, DEP, POS
from ._ml import rebatch, Tok2Vec, flatten
from ._ml import Tok2Vec, flatten
from ._ml import build_text_classifier, build_tagger_model
from ._ml import link_vectors_to_models
from .parts_of_speech import X
@ -86,7 +83,7 @@ class SentenceSegmenter(object):
yield doc[start : len(doc)]
class BaseThincComponent(object):
class Pipe(object):
name = None
@classmethod
@ -217,7 +214,7 @@ def _load_cfg(path):
return {}
class TokenVectorEncoder(BaseThincComponent):
class Tensorizer(Pipe):
"""Assign position-sensitive vectors to tokens, using a CNN or RNN."""
name = 'tensorizer'
@ -329,7 +326,7 @@ class TokenVectorEncoder(BaseThincComponent):
link_vectors_to_models(self.vocab)
class NeuralTagger(BaseThincComponent):
class Tagger(Pipe):
name = 'tagger'
def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab
@ -420,8 +417,6 @@ class NeuralTagger(BaseThincComponent):
new_tag_map[tag] = orig_tag_map[tag]
else:
new_tag_map[tag] = {POS: X}
if 'SP' not in new_tag_map:
new_tag_map['SP'] = orig_tag_map.get('SP', {POS: X})
cdef Vocab vocab = self.vocab
if new_tag_map:
vocab.morphology = Morphology(vocab.strings, new_tag_map,
@ -513,7 +508,11 @@ class NeuralTagger(BaseThincComponent):
return self
class NeuralLabeller(NeuralTagger):
class MultitaskObjective(Tagger):
'''Assist training of a parser or tagger, by training a side-objective.
Experimental
'''
name = 'nn_labeller'
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
self.vocab = vocab
@ -532,7 +531,7 @@ class NeuralLabeller(NeuralTagger):
self.make_label = target
else:
raise ValueError(
"NeuralLabeller target should be function or one of "
"MultitaskObjective target should be function or one of "
"['dep', 'tag', 'ent', 'dep_tag_offset', 'ent_tag']")
self.cfg = dict(cfg)
self.cfg.setdefault('cnn_maxout_pieces', 2)
@ -622,7 +621,7 @@ class NeuralLabeller(NeuralTagger):
return '%s-%s' % (tags[i], ents[i])
class SimilarityHook(BaseThincComponent):
class SimilarityHook(Pipe):
"""
Experimental
@ -674,7 +673,7 @@ class SimilarityHook(BaseThincComponent):
link_vectors_to_models(self.vocab)
class TextCategorizer(BaseThincComponent):
class TextCategorizer(Pipe):
name = 'textcat'
@classmethod
@ -752,45 +751,7 @@ class TextCategorizer(BaseThincComponent):
link_vectors_to_models(self.vocab)
cdef class EntityRecognizer(LinearParser):
"""Annotate named entities on Doc objects."""
TransitionSystem = BiluoPushDown
feature_templates = get_feature_templates('ner')
def add_label(self, label):
LinearParser.add_label(self, label)
if isinstance(label, basestring):
label = self.vocab.strings[label]
cdef class BeamEntityRecognizer(BeamParser):
"""Annotate named entities on Doc objects."""
TransitionSystem = BiluoPushDown
feature_templates = get_feature_templates('ner')
def add_label(self, label):
LinearParser.add_label(self, label)
if isinstance(label, basestring):
label = self.vocab.strings[label]
cdef class DependencyParser(LinearParser):
TransitionSystem = ArcEager
feature_templates = get_feature_templates('basic')
def add_label(self, label):
LinearParser.add_label(self, label)
if isinstance(label, basestring):
label = self.vocab.strings[label]
@property
def postprocesses(self):
return [nonproj.deprojectivize]
cdef class NeuralDependencyParser(NeuralParser):
cdef class DependencyParser(Parser):
name = 'parser'
TransitionSystem = ArcEager
@ -800,17 +761,17 @@ cdef class NeuralDependencyParser(NeuralParser):
def init_multitask_objectives(self, gold_tuples, pipeline, **cfg):
for target in []:
labeller = NeuralLabeller(self.vocab, target=target)
labeller = MultitaskObjective(self.vocab, target=target)
tok2vec = self.model[0]
labeller.begin_training(gold_tuples, pipeline=pipeline, tok2vec=tok2vec)
pipeline.append(labeller)
self._multitasks.append(labeller)
def __reduce__(self):
return (NeuralDependencyParser, (self.vocab, self.moves, self.model), None, None)
return (DependencyParser, (self.vocab, self.moves, self.model), None, None)
cdef class NeuralEntityRecognizer(NeuralParser):
cdef class EntityRecognizer(Parser):
name = 'ner'
TransitionSystem = BiluoPushDown
@ -818,31 +779,14 @@ cdef class NeuralEntityRecognizer(NeuralParser):
def init_multitask_objectives(self, gold_tuples, pipeline, **cfg):
for target in []:
labeller = NeuralLabeller(self.vocab, target=target)
labeller = MultitaskObjective(self.vocab, target=target)
tok2vec = self.model[0]
labeller.begin_training(gold_tuples, pipeline=pipeline, tok2vec=tok2vec)
pipeline.append(labeller)
self._multitasks.append(labeller)
def __reduce__(self):
return (NeuralEntityRecognizer, (self.vocab, self.moves, self.model), None, None)
return (EntityRecognizer, (self.vocab, self.moves, self.model), None, None)
cdef class BeamDependencyParser(BeamParser):
TransitionSystem = ArcEager
feature_templates = get_feature_templates('basic')
def add_label(self, label):
Parser.add_label(self, label)
if isinstance(label, basestring):
label = self.vocab.strings[label]
@property
def postprocesses(self):
return [nonproj.deprojectivize]
__all__ = ['Tagger', 'DependencyParser', 'EntityRecognizer', 'BeamDependencyParser',
'BeamEntityRecognizer', 'TokenVectorEnoder']
__all__ = ['Tagger', 'DependencyParser', 'EntityRecognizer', 'Tensorizer']

View File

@ -1,259 +0,0 @@
from thinc.typedefs cimport atom_t
from .stateclass cimport StateClass
from ._state cimport StateC
cdef int fill_context(atom_t* context, const StateC* state) nogil
# Context elements
# Ensure each token's attributes are listed: w, p, c, c6, c4. The order
# is referenced by incrementing the enum...
# Tokens are listed in left-to-right order.
#cdef size_t* SLOTS = [
# S2w, S1w,
# S0l0w, S0l2w, S0lw,
# S0w,
# S0r0w, S0r2w, S0rw,
# N0l0w, N0l2w, N0lw,
# P2w, P1w,
# N0w, N1w, N2w, N3w, 0
#]
# NB: The order of the enum is _NOT_ arbitrary!!
cpdef enum:
S2w
S2W
S2p
S2c
S2c4
S2c6
S2L
S2_prefix
S2_suffix
S2_shape
S2_ne_iob
S2_ne_type
S1w
S1W
S1p
S1c
S1c4
S1c6
S1L
S1_prefix
S1_suffix
S1_shape
S1_ne_iob
S1_ne_type
S1rw
S1rW
S1rp
S1rc
S1rc4
S1rc6
S1rL
S1r_prefix
S1r_suffix
S1r_shape
S1r_ne_iob
S1r_ne_type
S0lw
S0lW
S0lp
S0lc
S0lc4
S0lc6
S0lL
S0l_prefix
S0l_suffix
S0l_shape
S0l_ne_iob
S0l_ne_type
S0l2w
S0l2W
S0l2p
S0l2c
S0l2c4
S0l2c6
S0l2L
S0l2_prefix
S0l2_suffix
S0l2_shape
S0l2_ne_iob
S0l2_ne_type
S0w
S0W
S0p
S0c
S0c4
S0c6
S0L
S0_prefix
S0_suffix
S0_shape
S0_ne_iob
S0_ne_type
S0r2w
S0r2W
S0r2p
S0r2c
S0r2c4
S0r2c6
S0r2L
S0r2_prefix
S0r2_suffix
S0r2_shape
S0r2_ne_iob
S0r2_ne_type
S0rw
S0rW
S0rp
S0rc
S0rc4
S0rc6
S0rL
S0r_prefix
S0r_suffix
S0r_shape
S0r_ne_iob
S0r_ne_type
N0l2w
N0l2W
N0l2p
N0l2c
N0l2c4
N0l2c6
N0l2L
N0l2_prefix
N0l2_suffix
N0l2_shape
N0l2_ne_iob
N0l2_ne_type
N0lw
N0lW
N0lp
N0lc
N0lc4
N0lc6
N0lL
N0l_prefix
N0l_suffix
N0l_shape
N0l_ne_iob
N0l_ne_type
N0w
N0W
N0p
N0c
N0c4
N0c6
N0L
N0_prefix
N0_suffix
N0_shape
N0_ne_iob
N0_ne_type
N1w
N1W
N1p
N1c
N1c4
N1c6
N1L
N1_prefix
N1_suffix
N1_shape
N1_ne_iob
N1_ne_type
N2w
N2W
N2p
N2c
N2c4
N2c6
N2L
N2_prefix
N2_suffix
N2_shape
N2_ne_iob
N2_ne_type
P1w
P1W
P1p
P1c
P1c4
P1c6
P1L
P1_prefix
P1_suffix
P1_shape
P1_ne_iob
P1_ne_type
P2w
P2W
P2p
P2c
P2c4
P2c6
P2L
P2_prefix
P2_suffix
P2_shape
P2_ne_iob
P2_ne_type
E0w
E0W
E0p
E0c
E0c4
E0c6
E0L
E0_prefix
E0_suffix
E0_shape
E0_ne_iob
E0_ne_type
E1w
E1W
E1p
E1c
E1c4
E1c6
E1L
E1_prefix
E1_suffix
E1_shape
E1_ne_iob
E1_ne_type
# Misc features at the end
dist
N0lv
S0lv
S0rv
S1lv
S1rv
S0_has_head
S1_has_head
S2_has_head
CONTEXT_SIZE

View File

@ -1,419 +0,0 @@
"""
Fill an array, context, with every _atomic_ value our features reference.
We then write the _actual features_ as tuples of the atoms. The machinery
that translates from the tuples to feature-extractors (which pick the values
out of "context") is in features/extractor.pyx
The atomic feature names are listed in a big enum, so that the feature tuples
can refer to them.
"""
# coding: utf-8
from __future__ import unicode_literals
from libc.string cimport memset
from itertools import combinations
from cymem.cymem cimport Pool
from ..structs cimport TokenC
from .stateclass cimport StateClass
from ._state cimport StateC
cdef inline void fill_token(atom_t* context, const TokenC* token) nogil:
if token is NULL:
context[0] = 0
context[1] = 0
context[2] = 0
context[3] = 0
context[4] = 0
context[5] = 0
context[6] = 0
context[7] = 0
context[8] = 0
context[9] = 0
context[10] = 0
context[11] = 0
else:
context[0] = token.lex.orth
context[1] = token.lemma
context[2] = token.tag
context[3] = token.lex.cluster
# We've read in the string little-endian, so now we can take & (2**n)-1
# to get the first n bits of the cluster.
# e.g. s = "1110010101"
# s = ''.join(reversed(s))
# first_4_bits = int(s, 2)
# print first_4_bits
# 5
# print "{0:b}".format(prefix).ljust(4, '0')
# 1110
# What we're doing here is picking a number where all bits are 1, e.g.
# 15 is 1111, 63 is 111111 and doing bitwise AND, so getting all bits in
# the source that are set to 1.
context[4] = token.lex.cluster & 15
context[5] = token.lex.cluster & 63
context[6] = token.dep if token.head != 0 else 0
context[7] = token.lex.prefix
context[8] = token.lex.suffix
context[9] = token.lex.shape
context[10] = token.ent_iob
context[11] = token.ent_type
cdef int fill_context(atom_t* ctxt, const StateC* st) nogil:
# Take care to fill every element of context!
# We could memset, but this makes it very easy to have broken features that
# make almost no impact on accuracy. If instead they're unset, the impact
# tends to be dramatic, so we get an obvious regression to fix...
fill_token(&ctxt[S2w], st.S_(2))
fill_token(&ctxt[S1w], st.S_(1))
fill_token(&ctxt[S1rw], st.R_(st.S(1), 1))
fill_token(&ctxt[S0lw], st.L_(st.S(0), 1))
fill_token(&ctxt[S0l2w], st.L_(st.S(0), 2))
fill_token(&ctxt[S0w], st.S_(0))
fill_token(&ctxt[S0r2w], st.R_(st.S(0), 2))
fill_token(&ctxt[S0rw], st.R_(st.S(0), 1))
fill_token(&ctxt[N0lw], st.L_(st.B(0), 1))
fill_token(&ctxt[N0l2w], st.L_(st.B(0), 2))
fill_token(&ctxt[N0w], st.B_(0))
fill_token(&ctxt[N1w], st.B_(1))
fill_token(&ctxt[N2w], st.B_(2))
fill_token(&ctxt[P1w], st.safe_get(st.B(0)-1))
fill_token(&ctxt[P2w], st.safe_get(st.B(0)-2))
fill_token(&ctxt[E0w], st.E_(0))
fill_token(&ctxt[E1w], st.E_(1))
if st.stack_depth() >= 1 and not st.eol():
ctxt[dist] = min_(st.B(0) - st.E(0), 5)
else:
ctxt[dist] = 0
ctxt[N0lv] = min_(st.n_L(st.B(0)), 5)
ctxt[S0lv] = min_(st.n_L(st.S(0)), 5)
ctxt[S0rv] = min_(st.n_R(st.S(0)), 5)
ctxt[S1lv] = min_(st.n_L(st.S(1)), 5)
ctxt[S1rv] = min_(st.n_R(st.S(1)), 5)
ctxt[S0_has_head] = 0
ctxt[S1_has_head] = 0
ctxt[S2_has_head] = 0
if st.stack_depth() >= 1:
ctxt[S0_has_head] = st.has_head(st.S(0)) + 1
if st.stack_depth() >= 2:
ctxt[S1_has_head] = st.has_head(st.S(1)) + 1
if st.stack_depth() >= 3:
ctxt[S2_has_head] = st.has_head(st.S(2)) + 1
cdef inline int min_(int a, int b) nogil:
return a if a > b else b
ner = (
(N0W,),
(P1W,),
(N1W,),
(P2W,),
(N2W,),
(P1W, N0W,),
(N0W, N1W),
(N0_prefix,),
(N0_suffix,),
(P1_shape,),
(N0_shape,),
(N1_shape,),
(P1_shape, N0_shape,),
(N0_shape, P1_shape,),
(P1_shape, N0_shape, N1_shape),
(N2_shape,),
(P2_shape,),
#(P2_norm, P1_norm, W_norm),
#(P1_norm, W_norm, N1_norm),
#(W_norm, N1_norm, N2_norm)
(P2p,),
(P1p,),
(N0p,),
(N1p,),
(N2p,),
(P1p, N0p),
(N0p, N1p),
(P2p, P1p, N0p),
(P1p, N0p, N1p),
(N0p, N1p, N2p),
(P2c,),
(P1c,),
(N0c,),
(N1c,),
(N2c,),
(P1c, N0c),
(N0c, N1c),
(E0W,),
(E0c,),
(E0p,),
(E0W, N0W),
(E0c, N0W),
(E0p, N0W),
(E0p, P1p, N0p),
(E0c, P1c, N0c),
(E0w, P1c),
(E0p, P1p),
(E0c, P1c),
(E0p, E1p),
(E0c, P1p),
(E1W,),
(E1c,),
(E1p,),
(E0W, E1W),
(E0W, E1p,),
(E0p, E1W,),
(E0p, E1W),
(P1_ne_iob,),
(P1_ne_iob, P1_ne_type),
(N0w, P1_ne_iob, P1_ne_type),
(N0_shape,),
(N1_shape,),
(N2_shape,),
(P1_shape,),
(P2_shape,),
(N0_prefix,),
(N0_suffix,),
(P1_ne_iob,),
(P2_ne_iob,),
(P1_ne_iob, P2_ne_iob),
(P1_ne_iob, P1_ne_type),
(P2_ne_iob, P2_ne_type),
(N0w, P1_ne_iob, P1_ne_type),
(N0w, N1w),
)
unigrams = (
(S2W, S2p),
(S2c6, S2p),
(S1W, S1p),
(S1c6, S1p),
(S0W, S0p),
(S0c6, S0p),
(N0W, N0p),
(N0p,),
(N0c,),
(N0c6, N0p),
(N0L,),
(N1W, N1p),
(N1c6, N1p),
(N2W, N2p),
(N2c6, N2p),
(S0r2W, S0r2p),
(S0r2c6, S0r2p),
(S0r2L,),
(S0rW, S0rp),
(S0rc6, S0rp),
(S0rL,),
(S0l2W, S0l2p),
(S0l2c6, S0l2p),
(S0l2L,),
(S0lW, S0lp),
(S0lc6, S0lp),
(S0lL,),
(N0l2W, N0l2p),
(N0l2c6, N0l2p),
(N0l2L,),
(N0lW, N0lp),
(N0lc6, N0lp),
(N0lL,),
)
s0_n0 = (
(S0W, S0p, N0W, N0p),
(S0c, S0p, N0c, N0p),
(S0c6, S0p, N0c6, N0p),
(S0c4, S0p, N0c4, N0p),
(S0p, N0p),
(S0W, N0p),
(S0p, N0W),
(S0W, N0c),
(S0c, N0W),
(S0p, N0c),
(S0c, N0p),
(S0W, S0rp, N0p),
(S0p, S0rp, N0p),
(S0p, N0lp, N0W),
(S0p, N0lp, N0p),
(S0L, N0p),
(S0p, S0rL, N0p),
(S0p, N0lL, N0p),
(S0p, S0rv, N0p),
(S0p, N0lv, N0p),
(S0c6, S0rL, S0r2L, N0p),
(S0p, N0lL, N0l2L, N0p),
)
s1_s0 = (
(S1p, S0p),
(S1p, S0p, S0_has_head),
(S1W, S0p),
(S1W, S0p, S0_has_head),
(S1c, S0p),
(S1c, S0p, S0_has_head),
(S1p, S1rL, S0p),
(S1p, S1rL, S0p, S0_has_head),
(S1p, S0lL, S0p),
(S1p, S0lL, S0p, S0_has_head),
(S1p, S0lL, S0l2L, S0p),
(S1p, S0lL, S0l2L, S0p, S0_has_head),
(S1L, S0L, S0W),
(S1L, S0L, S0p),
(S1p, S1L, S0L, S0p),
(S1p, S0p),
)
s1_n0 = (
(S1p, N0p),
(S1c, N0c),
(S1c, N0p),
(S1p, N0c),
(S1W, S1p, N0p),
(S1p, N0W, N0p),
(S1c6, S1p, N0c6, N0p),
(S1L, N0p),
(S1p, S1rL, N0p),
(S1p, S1rp, N0p),
)
s0_n1 = (
(S0p, N1p),
(S0c, N1c),
(S0c, N1p),
(S0p, N1c),
(S0W, S0p, N1p),
(S0p, N1W, N1p),
(S0c6, S0p, N1c6, N1p),
(S0L, N1p),
(S0p, S0rL, N1p),
)
n0_n1 = (
(N0W, N0p, N1W, N1p),
(N0W, N0p, N1p),
(N0p, N1W, N1p),
(N0c, N0p, N1c, N1p),
(N0c6, N0p, N1c6, N1p),
(N0c, N1c),
(N0p, N1c),
)
tree_shape = (
(dist,),
(S0p, S0_has_head, S1_has_head, S2_has_head),
(S0p, S0lv, S0rv),
(N0p, N0lv),
)
trigrams = (
(N0p, N1p, N2p),
(S0p, S0lp, S0l2p),
(S0p, S0rp, S0r2p),
(S0p, S1p, S2p),
(S1p, S0p, N0p),
(S0p, S0lp, N0p),
(S0p, N0p, N0lp),
(N0p, N0lp, N0l2p),
(S0W, S0p, S0rL, S0r2L),
(S0p, S0rL, S0r2L),
(S0W, S0p, S0lL, S0l2L),
(S0p, S0lL, S0l2L),
(N0W, N0p, N0lL, N0l2L),
(N0p, N0lL, N0l2L),
)
words = (
S2w,
S1w,
S1rw,
S0lw,
S0l2w,
S0w,
S0r2w,
S0rw,
N0lw,
N0l2w,
N0w,
N1w,
N2w,
P1w,
P2w
)
tags = (
S2p,
S1p,
S1rp,
S0lp,
S0l2p,
S0p,
S0r2p,
S0rp,
N0lp,
N0l2p,
N0p,
N1p,
N2p,
P1p,
P2p
)
labels = (
S2L,
S1L,
S1rL,
S0lL,
S0l2L,
S0L,
S0r2L,
S0rL,
N0lL,
N0l2L,
N0L,
N1L,
N2L,
P1L,
P2L
)

View File

@ -1,10 +0,0 @@
from .parser cimport Parser
from ..structs cimport TokenC
from thinc.typedefs cimport weight_t
cdef class BeamParser(Parser):
cdef public int beam_width
cdef public weight_t beam_density
cdef int _parseC(self, TokenC* tokens, int length, int nr_feat, int nr_class) except -1

View File

@ -1,239 +0,0 @@
"""
MALT-style dependency parser
"""
# cython: profile=True
# cython: experimental_cpp_class_def=True
# cython: cdivision=True
# cython: infer_types=True
# coding: utf-8
from __future__ import unicode_literals, print_function
cimport cython
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
from libc.stdint cimport uint32_t, uint64_t
from libc.string cimport memset, memcpy
from libc.stdlib cimport rand
from libc.math cimport log, exp, isnan, isinf
from cymem.cymem cimport Pool, Address
from murmurhash.mrmr cimport real_hash64 as hash64
from thinc.typedefs cimport weight_t, class_t, feat_t, atom_t, hash_t
from thinc.linear.features cimport ConjunctionExtracter
from thinc.structs cimport FeatureC, ExampleC
from thinc.extra.search cimport Beam, MaxViolation
from thinc.extra.eg cimport Example
from thinc.extra.mb cimport Minibatch
from ..structs cimport TokenC
from ..tokens.doc cimport Doc
from ..strings cimport StringStore
from .transition_system cimport TransitionSystem, Transition
from ..gold cimport GoldParse
from . import _parse_features
from ._parse_features cimport CONTEXT_SIZE
from ._parse_features cimport fill_context
from .stateclass cimport StateClass
from .parser cimport Parser
DEBUG = False
def set_debug(val):
global DEBUG
DEBUG = val
def get_templates(name):
pf = _parse_features
if name == 'ner':
return pf.ner
elif name == 'debug':
return pf.unigrams
else:
return (pf.unigrams + pf.s0_n0 + pf.s1_n0 + pf.s1_s0 + pf.s0_n1 + pf.n0_n1 + \
pf.tree_shape + pf.trigrams)
cdef int BEAM_WIDTH = 16
cdef weight_t BEAM_DENSITY = 0.001
cdef class BeamParser(Parser):
def __init__(self, *args, **kwargs):
self.beam_width = kwargs.get('beam_width', BEAM_WIDTH)
self.beam_density = kwargs.get('beam_density', BEAM_DENSITY)
Parser.__init__(self, *args, **kwargs)
cdef int parseC(self, TokenC* tokens, int length, int nr_feat) nogil:
with gil:
self._parseC(tokens, length, nr_feat, self.moves.n_moves)
cdef int _parseC(self, TokenC* tokens, int length, int nr_feat, int nr_class) except -1:
cdef Beam beam = Beam(self.moves.n_moves, self.beam_width, min_density=self.beam_density)
# TODO: How do we handle new labels here? This increases nr_class
beam.initialize(self.moves.init_beam_state, length, tokens)
beam.check_done(_check_final_state, NULL)
if beam.is_done:
_cleanup(beam)
return 0
while not beam.is_done:
self._advance_beam(beam, None, False)
state = <StateClass>beam.at(0)
self.moves.finalize_state(state.c)
for i in range(length):
tokens[i] = state.c._sent[i]
_cleanup(beam)
def update(self, Doc tokens, GoldParse gold_parse, itn=0):
self.moves.preprocess_gold(gold_parse)
cdef Beam pred = Beam(self.moves.n_moves, self.beam_width)
pred.initialize(self.moves.init_beam_state, tokens.length, tokens.c)
pred.check_done(_check_final_state, NULL)
# Hack for NER
for i in range(pred.size):
stcls = <StateClass>pred.at(i)
self.moves.initialize_state(stcls.c)
cdef Beam gold = Beam(self.moves.n_moves, self.beam_width, min_density=0.0)
gold.initialize(self.moves.init_beam_state, tokens.length, tokens.c)
gold.check_done(_check_final_state, NULL)
violn = MaxViolation()
while not pred.is_done and not gold.is_done:
# We search separately here, to allow for ambiguity in the gold parse.
self._advance_beam(pred, gold_parse, False)
self._advance_beam(gold, gold_parse, True)
violn.check_crf(pred, gold)
if pred.loss > 0 and pred.min_score > (gold.score + self.model.time):
break
else:
# The non-monotonic oracle makes it difficult to ensure final costs are
# correct. Therefore do final correction
for i in range(pred.size):
if self.moves.is_gold_parse(<StateClass>pred.at(i), gold_parse):
pred._states[i].loss = 0.0
elif pred._states[i].loss == 0.0:
pred._states[i].loss = 1.0
violn.check_crf(pred, gold)
if pred.size < 1:
raise Exception("No candidates", tokens.length)
if gold.size < 1:
raise Exception("No gold", tokens.length)
if pred.loss == 0:
self.model.update_from_histories(self.moves, tokens, [(0.0, [])])
elif True:
#_check_train_integrity(pred, gold, gold_parse, self.moves)
histories = list(zip(violn.p_probs, violn.p_hist)) + \
list(zip(violn.g_probs, violn.g_hist))
self.model.update_from_histories(self.moves, tokens, histories, min_grad=0.001**(itn+1))
else:
self.model.update_from_histories(self.moves, tokens,
[(1.0, violn.p_hist[0]), (-1.0, violn.g_hist[0])])
_cleanup(pred)
_cleanup(gold)
return pred.loss
def _advance_beam(self, Beam beam, GoldParse gold, bint follow_gold):
cdef atom_t[CONTEXT_SIZE] context
cdef Pool mem = Pool()
features = <FeatureC*>mem.alloc(self.model.nr_feat, sizeof(FeatureC))
if False:
mb = Minibatch(self.model.widths, beam.size)
for i in range(beam.size):
stcls = <StateClass>beam.at(i)
if stcls.c.is_final():
nr_feat = 0
else:
nr_feat = self.model.set_featuresC(context, features, stcls.c)
self.moves.set_valid(beam.is_valid[i], stcls.c)
mb.c.push_back(features, nr_feat, beam.costs[i], beam.is_valid[i], 0)
self.model(mb)
for i in range(beam.size):
memcpy(beam.scores[i], mb.c.scores(i), mb.c.nr_out() * sizeof(beam.scores[i][0]))
else:
for i in range(beam.size):
stcls = <StateClass>beam.at(i)
if not stcls.is_final():
nr_feat = self.model.set_featuresC(context, features, stcls.c)
self.moves.set_valid(beam.is_valid[i], stcls.c)
self.model.set_scoresC(beam.scores[i], features, nr_feat)
if gold is not None:
n_gold = 0
lines = []
for i in range(beam.size):
stcls = <StateClass>beam.at(i)
if not stcls.c.is_final():
self.moves.set_costs(beam.is_valid[i], beam.costs[i], stcls, gold)
if follow_gold:
for j in range(self.moves.n_moves):
if beam.costs[i][j] >= 1:
beam.is_valid[i][j] = 0
lines.append((stcls.B(0), stcls.B(1),
stcls.B_(0).ent_iob, stcls.B_(1).ent_iob,
stcls.B_(1).sent_start,
j,
beam.is_valid[i][j], 'set invalid',
beam.costs[i][j], self.moves.c[j].move, self.moves.c[j].label))
n_gold += 1 if beam.is_valid[i][j] else 0
if follow_gold and n_gold == 0:
raise Exception("No gold")
if follow_gold:
beam.advance(_transition_state, NULL, <void*>self.moves.c)
else:
beam.advance(_transition_state, _hash_state, <void*>self.moves.c)
beam.check_done(_check_final_state, NULL)
# These are passed as callbacks to thinc.search.Beam
cdef int _transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1:
dest = <StateClass>_dest
src = <StateClass>_src
moves = <const Transition*>_moves
dest.clone(src)
moves[clas].do(dest.c, moves[clas].label)
cdef int _check_final_state(void* _state, void* extra_args) except -1:
return (<StateClass>_state).is_final()
def _cleanup(Beam beam):
for i in range(beam.width):
Py_XDECREF(<PyObject*>beam._states[i].content)
Py_XDECREF(<PyObject*>beam._parents[i].content)
cdef hash_t _hash_state(void* _state, void* _) except 0:
state = <StateClass>_state
if state.c.is_final():
return 1
else:
return state.c.hash()
def _check_train_integrity(Beam pred, Beam gold, GoldParse gold_parse, TransitionSystem moves):
for i in range(pred.size):
if not pred._states[i].is_done or pred._states[i].loss == 0:
continue
state = <StateClass>pred.at(i)
if moves.is_gold_parse(state, gold_parse) == True:
for dep in gold_parse.orig_annot:
print(dep[1], dep[3], dep[4])
print("Cost", pred._states[i].loss)
for j in range(gold_parse.length):
print(gold_parse.orig_annot[j][1], state.H(j), moves.strings[state.safe_get(j).dep])
acts = [moves.c[clas].move for clas in pred.histories[i]]
labels = [moves.c[clas].label for clas in pred.histories[i]]
print([moves.move_name(move, label) for move, label in zip(acts, labels)])
raise Exception("Predicted state is gold-standard")
for i in range(gold.size):
if not gold._states[i].is_done:
continue
state = <StateClass>gold.at(i)
if moves.is_gold(state, gold_parse) == False:
print("Truth")
for dep in gold_parse.orig_annot:
print(dep[1], dep[3], dep[4])
print("Predicted good")
for j in range(gold_parse.length):
print(gold_parse.orig_annot[j][1], state.H(j), moves.strings[state.safe_get(j).dep])
raise Exception("Gold parse is not gold-standard")

View File

@ -47,15 +47,12 @@ from thinc.neural.util import get_array_module
from .. import util
from ..util import get_async, get_cuda_stream
from .._ml import zero_init, PrecomputableAffine
from .._ml import Tok2Vec, doc2feats, rebatch, fine_tune
from .._ml import Tok2Vec, doc2feats
from .._ml import Residual, drop_layer, flatten
from .._ml import link_vectors_to_models
from .._ml import HistoryFeatures
from ..compat import json_dumps, copy_array
from . import _parse_features
from ._parse_features cimport CONTEXT_SIZE
from ._parse_features cimport fill_context
from .stateclass cimport StateClass
from ._state cimport StateC
from . import nonproj
@ -261,7 +258,7 @@ cdef class Parser:
hist_width = util.env_opt('history_width', cfg.get('hist_width', 0))
if hist_size != 0:
raise ValueError("Currently history size is hard-coded to 0")
if hist_width != 0:
if hist_width != 0:
raise ValueError("Currently history width is hard-coded to 0")
tok2vec = Tok2Vec(token_vector_width, embed_size,
pretrained_dims=cfg.get('pretrained_dims', 0))
@ -434,8 +431,7 @@ cdef class Parser:
cdef int nr_hidden = hidden_weights.shape[0]
cdef int nr_task = states.size()
with nogil:
for i in cython.parallel.prange(nr_task, num_threads=2,
schedule='guided'):
for i in range(nr_task):
self._parseC(states[i],
feat_weights, bias, hW, hb,
nr_class, nr_hidden, nr_feat, nr_piece)
@ -454,7 +450,6 @@ cdef class Parser:
with gil:
PyErr_SetFromErrno(MemoryError)
PyErr_CheckSignals()
while not state.is_final():
state.set_context_tokens(token_ids, nr_feat)
memset(vectors, 0, nr_hidden * nr_piece * sizeof(float))
@ -696,9 +691,10 @@ cdef class Parser:
xp = get_array_module(d_tokvecs)
for ids, d_vector, bp_vector in backprops:
d_state_features = bp_vector(d_vector, sgd=sgd)
mask = ids >= 0
d_state_features *= mask.reshape(ids.shape + (1,))
self.model[0].ops.scatter_add(d_tokvecs, ids * mask,
ids = ids.flatten()
d_state_features = d_state_features.reshape(
(ids.size, d_state_features.shape[2]))
self.model[0].ops.scatter_add(d_tokvecs, ids,
d_state_features)
bp_tokvecs(d_tokvecs, sgd=sgd)

View File

@ -1,24 +0,0 @@
from thinc.linear.avgtron cimport AveragedPerceptron
from thinc.typedefs cimport atom_t
from thinc.structs cimport FeatureC
from .stateclass cimport StateClass
from .arc_eager cimport TransitionSystem
from ..vocab cimport Vocab
from ..tokens.doc cimport Doc
from ..structs cimport TokenC
from ._state cimport StateC
cdef class ParserModel(AveragedPerceptron):
cdef int set_featuresC(self, atom_t* context, FeatureC* features,
const StateC* state) nogil
cdef class Parser:
cdef readonly Vocab vocab
cdef readonly ParserModel model
cdef readonly TransitionSystem moves
cdef readonly object cfg
cdef int parseC(self, TokenC* tokens, int length, int nr_feat) nogil

View File

@ -1,526 +0,0 @@
"""
MALT-style dependency parser
"""
# coding: utf-8
# cython: infer_types=True
from __future__ import unicode_literals
from collections import Counter
import ujson
cimport cython
cimport cython.parallel
import numpy.random
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
from cpython.exc cimport PyErr_CheckSignals
from libc.stdint cimport uint32_t, uint64_t
from libc.string cimport memset, memcpy
from libc.stdlib cimport malloc, calloc, free
from thinc.typedefs cimport weight_t, class_t, feat_t, atom_t, hash_t
from thinc.linear.avgtron cimport AveragedPerceptron
from thinc.linalg cimport VecVec
from thinc.structs cimport SparseArrayC, FeatureC, ExampleC
from thinc.extra.eg cimport Example
from cymem.cymem cimport Pool, Address
from murmurhash.mrmr cimport hash64
from preshed.maps cimport MapStruct
from preshed.maps cimport map_get
from . import _parse_features
from ._parse_features cimport CONTEXT_SIZE
from ._parse_features cimport fill_context
from .stateclass cimport StateClass
from ._state cimport StateC
from .transition_system import OracleError
from .transition_system cimport TransitionSystem, Transition
from ..structs cimport TokenC
from ..tokens.doc cimport Doc
from ..strings cimport StringStore
from ..gold cimport GoldParse
USE_FTRL = True
DEBUG = False
def set_debug(val):
global DEBUG
DEBUG = val
def get_templates(name):
pf = _parse_features
if name == 'ner':
return pf.ner
elif name == 'debug':
return pf.unigrams
elif name.startswith('embed'):
return (pf.words, pf.tags, pf.labels)
else:
return (pf.unigrams + pf.s0_n0 + pf.s1_n0 + pf.s1_s0 + pf.s0_n1 + pf.n0_n1 + \
pf.tree_shape + pf.trigrams)
cdef class ParserModel(AveragedPerceptron):
cdef int set_featuresC(self, atom_t* context, FeatureC* features,
const StateC* state) nogil:
fill_context(context, state)
nr_feat = self.extracter.set_features(features, context)
return nr_feat
def update(self, Example eg, itn=0):
"""
Does regression on negative cost. Sort of cute?
"""
self.time += 1
cdef int best = arg_max_if_gold(eg.c.scores, eg.c.costs, eg.c.nr_class)
cdef int guess = eg.guess
if guess == best or best == -1:
return 0.0
cdef FeatureC feat
cdef int clas
cdef weight_t gradient
if USE_FTRL:
for feat in eg.c.features[:eg.c.nr_feat]:
for clas in range(eg.c.nr_class):
if eg.c.is_valid[clas] and eg.c.scores[clas] >= eg.c.scores[best]:
gradient = eg.c.scores[clas] + eg.c.costs[clas]
self.update_weight_ftrl(feat.key, clas, feat.value * gradient)
else:
for feat in eg.c.features[:eg.c.nr_feat]:
self.update_weight(feat.key, guess, feat.value * eg.c.costs[guess])
self.update_weight(feat.key, best, -feat.value * eg.c.costs[guess])
return eg.c.costs[guess]
def update_from_histories(self, TransitionSystem moves, Doc doc, histories, weight_t min_grad=0.0):
cdef Pool mem = Pool()
features = <FeatureC*>mem.alloc(self.nr_feat, sizeof(FeatureC))
cdef StateClass stcls
cdef class_t clas
self.time += 1
cdef atom_t[CONTEXT_SIZE] atoms
histories = [(grad, hist) for grad, hist in histories if abs(grad) >= min_grad and hist]
if not histories:
return None
gradient = [Counter() for _ in range(max([max(h)+1 for _, h in histories]))]
for d_loss, history in histories:
stcls = StateClass.init(doc.c, doc.length)
moves.initialize_state(stcls.c)
for clas in history:
nr_feat = self.set_featuresC(atoms, features, stcls.c)
clas_grad = gradient[clas]
for feat in features[:nr_feat]:
clas_grad[feat.key] += d_loss * feat.value
moves.c[clas].do(stcls.c, moves.c[clas].label)
cdef feat_t key
cdef weight_t d_feat
for clas, clas_grad in enumerate(gradient):
for key, d_feat in clas_grad.items():
if d_feat != 0:
self.update_weight_ftrl(key, clas, d_feat)
cdef class Parser:
"""
Base class of the DependencyParser and EntityRecognizer.
"""
@classmethod
def load(cls, path, Vocab vocab, TransitionSystem=None, require=False, **cfg):
"""
Load the statistical model from the supplied path.
Arguments:
path (Path):
The path to load from.
vocab (Vocab):
The vocabulary. Must be shared by the documents to be processed.
require (bool):
Whether to raise an error if the files are not found.
Returns (Parser):
The newly constructed object.
"""
with (path / 'config.json').open() as file_:
cfg = ujson.load(file_)
# TODO: remove this shim when we don't have to support older data
if 'labels' in cfg and 'actions' not in cfg:
cfg['actions'] = cfg.pop('labels')
# TODO: remove this shim when we don't have to support older data
for action_name, labels in dict(cfg.get('actions', {})).items():
# We need this to be sorted
if isinstance(labels, dict):
labels = list(sorted(labels.keys()))
cfg['actions'][action_name] = labels
self = cls(vocab, TransitionSystem=TransitionSystem, model=None, **cfg)
if (path / 'model').exists():
self.model.load(str(path / 'model'))
elif require:
raise IOError(
"Required file %s/model not found when loading" % str(path))
return self
def __init__(self, Vocab vocab, TransitionSystem=None, ParserModel model=None, **cfg):
"""
Create a Parser.
Arguments:
vocab (Vocab):
The vocabulary object. Must be shared with documents to be processed.
model (thinc.linear.AveragedPerceptron):
The statistical model.
Returns (Parser):
The newly constructed object.
"""
if TransitionSystem is None:
TransitionSystem = self.TransitionSystem
self.vocab = vocab
cfg['actions'] = TransitionSystem.get_actions(**cfg)
self.moves = TransitionSystem(vocab.strings, cfg['actions'])
# TODO: Remove this when we no longer need to support old-style models
if isinstance(cfg.get('features'), basestring):
cfg['features'] = get_templates(cfg['features'])
elif 'features' not in cfg:
cfg['features'] = self.feature_templates
self.model = ParserModel(cfg['features'])
self.model.l1_penalty = cfg.get('L1', 0.0)
self.model.learn_rate = cfg.get('learn_rate', 0.001)
self.cfg = cfg
# TODO: This is a pretty hacky fix to the problem of adding more
# labels. The issue is they come in out of order, if labels are
# added during training
for label in cfg.get('extra_labels', []):
self.add_label(label)
def __reduce__(self):
return (Parser, (self.vocab, self.moves, self.model), None, None)
def __call__(self, Doc tokens):
"""
Apply the entity recognizer, setting the annotations onto the Doc object.
Arguments:
doc (Doc): The document to be processed.
Returns:
None
"""
cdef int nr_feat = self.model.nr_feat
with nogil:
status = self.parseC(tokens.c, tokens.length, nr_feat)
# Check for KeyboardInterrupt etc. Untested
PyErr_CheckSignals()
if status != 0:
raise ParserStateError(tokens)
self.moves.finalize_doc(tokens)
def pipe(self, stream, int batch_size=1000, int n_threads=2):
"""
Process a stream of documents.
Arguments:
stream: The sequence of documents to process.
batch_size (int):
The number of documents to accumulate into a working set.
n_threads (int):
The number of threads with which to work on the buffer in parallel.
Yields (Doc): Documents, in order.
"""
cdef Pool mem = Pool()
cdef TokenC** doc_ptr = <TokenC**>mem.alloc(batch_size, sizeof(TokenC*))
cdef int* lengths = <int*>mem.alloc(batch_size, sizeof(int))
cdef Doc doc
cdef int i
cdef int nr_feat = self.model.nr_feat
cdef int status
queue = []
for doc in stream:
doc_ptr[len(queue)] = doc.c
lengths[len(queue)] = doc.length
queue.append(doc)
if len(queue) == batch_size:
with nogil:
for i in cython.parallel.prange(batch_size, num_threads=n_threads):
status = self.parseC(doc_ptr[i], lengths[i], nr_feat)
if status != 0:
with gil:
raise ParserStateError(queue[i])
PyErr_CheckSignals()
for doc in queue:
self.moves.finalize_doc(doc)
yield doc
queue = []
batch_size = len(queue)
with nogil:
for i in cython.parallel.prange(batch_size, num_threads=n_threads):
status = self.parseC(doc_ptr[i], lengths[i], nr_feat)
if status != 0:
with gil:
raise ParserStateError(queue[i])
PyErr_CheckSignals()
for doc in queue:
self.moves.finalize_doc(doc)
yield doc
cdef int parseC(self, TokenC* tokens, int length, int nr_feat) nogil:
state = new StateC(tokens, length)
# NB: This can change self.moves.n_moves!
# I think this causes memory errors if called by .pipe()
self.moves.initialize_state(state)
nr_class = self.moves.n_moves
cdef ExampleC eg
eg.nr_feat = nr_feat
eg.nr_atom = CONTEXT_SIZE
eg.nr_class = nr_class
eg.features = <FeatureC*>calloc(sizeof(FeatureC), nr_feat)
eg.atoms = <atom_t*>calloc(sizeof(atom_t), CONTEXT_SIZE)
eg.scores = <weight_t*>calloc(sizeof(weight_t), nr_class)
eg.is_valid = <int*>calloc(sizeof(int), nr_class)
cdef int i
while not state.is_final():
eg.nr_feat = self.model.set_featuresC(eg.atoms, eg.features, state)
self.moves.set_valid(eg.is_valid, state)
self.model.set_scoresC(eg.scores, eg.features, eg.nr_feat)
guess = VecVec.arg_max_if_true(eg.scores, eg.is_valid, eg.nr_class)
if guess < 0:
return 1
action = self.moves.c[guess]
action.do(state, action.label)
memset(eg.scores, 0, sizeof(eg.scores[0]) * eg.nr_class)
for i in range(eg.nr_class):
eg.is_valid[i] = 1
self.moves.finalize_state(state)
for i in range(length):
tokens[i] = state._sent[i]
del state
free(eg.features)
free(eg.atoms)
free(eg.scores)
free(eg.is_valid)
return 0
def update(self, Doc tokens, GoldParse gold, itn=0, double drop=0.0):
"""
Update the statistical model.
Arguments:
doc (Doc):
The example document for the update.
gold (GoldParse):
The gold-standard annotations, to calculate the loss.
Returns (float):
The loss on this example.
"""
self.moves.preprocess_gold(gold)
cdef StateClass stcls = StateClass.init(tokens.c, tokens.length)
self.moves.initialize_state(stcls.c)
cdef Pool mem = Pool()
cdef Example eg = Example(
nr_class=self.moves.n_moves,
nr_atom=CONTEXT_SIZE,
nr_feat=self.model.nr_feat)
cdef weight_t loss = 0
cdef Transition action
cdef double dropout_rate = self.cfg.get('dropout', drop)
while not stcls.is_final():
eg.c.nr_feat = self.model.set_featuresC(eg.c.atoms, eg.c.features,
stcls.c)
dropout(eg.c.features, eg.c.nr_feat, dropout_rate)
self.moves.set_costs(eg.c.is_valid, eg.c.costs, stcls, gold)
self.model.set_scoresC(eg.c.scores, eg.c.features, eg.c.nr_feat)
guess = VecVec.arg_max_if_true(eg.c.scores, eg.c.is_valid, eg.c.nr_class)
self.model.update(eg)
action = self.moves.c[guess]
action.do(stcls.c, action.label)
loss += eg.costs[guess]
eg.fill_scores(0, eg.c.nr_class)
eg.fill_costs(0, eg.c.nr_class)
eg.fill_is_valid(1, eg.c.nr_class)
self.moves.finalize_state(stcls.c)
return loss
def step_through(self, Doc doc, GoldParse gold=None):
"""
Set up a stepwise state, to introspect and control the transition sequence.
Arguments:
doc (Doc): The document to step through.
gold (GoldParse): Optional gold parse
Returns (StepwiseState):
A state object, to step through the annotation process.
"""
return StepwiseState(self, doc, gold=gold)
def from_transition_sequence(self, Doc doc, sequence):
"""Control the annotations on a document by specifying a transition sequence
to follow.
Arguments:
doc (Doc): The document to annotate.
sequence: A sequence of action names, as unicode strings.
Returns: None
"""
with self.step_through(doc) as stepwise:
for transition in sequence:
stepwise.transition(transition)
def add_label(self, label):
# Doesn't set label into serializer -- subclasses override it to do that.
for action in self.moves.action_types:
added = self.moves.add_action(action, label)
if added:
# Important that the labels be stored as a list! We need the
# order, or the model goes out of synch
self.cfg.setdefault('extra_labels', []).append(label)
cdef int dropout(FeatureC* feats, int nr_feat, float prob) except -1:
if prob <= 0 or prob >= 1.:
return 0
cdef double[::1] py_probs = numpy.random.uniform(0., 1., nr_feat)
cdef double* probs = &py_probs[0]
for i in range(nr_feat):
if probs[i] >= prob:
feats[i].value /= prob
else:
feats[i].value = 0.
cdef class StepwiseState:
cdef readonly StateClass stcls
cdef readonly Example eg
cdef readonly Doc doc
cdef readonly GoldParse gold
cdef readonly Parser parser
def __init__(self, Parser parser, Doc doc, GoldParse gold=None):
self.parser = parser
self.doc = doc
if gold is not None:
self.gold = gold
self.parser.moves.preprocess_gold(self.gold)
else:
self.gold = GoldParse(doc)
self.stcls = StateClass.init(doc.c, doc.length)
self.parser.moves.initialize_state(self.stcls.c)
self.eg = Example(
nr_class=self.parser.moves.n_moves,
nr_atom=CONTEXT_SIZE,
nr_feat=self.parser.model.nr_feat)
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.finish()
@property
def is_final(self):
return self.stcls.is_final()
@property
def stack(self):
return self.stcls.stack
@property
def queue(self):
return self.stcls.queue
@property
def heads(self):
return [self.stcls.H(i) for i in range(self.stcls.c.length)]
@property
def deps(self):
return [self.doc.vocab.strings[self.stcls.c._sent[i].dep]
for i in range(self.stcls.c.length)]
@property
def costs(self):
"""
Find the action-costs for the current state.
"""
if not self.gold:
raise ValueError("Can't set costs: No GoldParse provided")
self.parser.moves.set_costs(self.eg.c.is_valid, self.eg.c.costs,
self.stcls, self.gold)
costs = {}
for i in range(self.parser.moves.n_moves):
if not self.eg.c.is_valid[i]:
continue
transition = self.parser.moves.c[i]
name = self.parser.moves.move_name(transition.move, transition.label)
costs[name] = self.eg.c.costs[i]
return costs
def predict(self):
self.eg.reset()
self.eg.c.nr_feat = self.parser.model.set_featuresC(self.eg.c.atoms, self.eg.c.features,
self.stcls.c)
self.parser.moves.set_valid(self.eg.c.is_valid, self.stcls.c)
self.parser.model.set_scoresC(self.eg.c.scores,
self.eg.c.features, self.eg.c.nr_feat)
cdef Transition action = self.parser.moves.c[self.eg.guess]
return self.parser.moves.move_name(action.move, action.label)
def transition(self, action_name=None):
if action_name is None:
action_name = self.predict()
moves = {'S': 0, 'D': 1, 'L': 2, 'R': 3}
if action_name == '_':
action_name = self.predict()
action = self.parser.moves.lookup_transition(action_name)
elif action_name == 'L' or action_name == 'R':
self.predict()
move = moves[action_name]
clas = _arg_max_clas(self.eg.c.scores, move, self.parser.moves.c,
self.eg.c.nr_class)
action = self.parser.moves.c[clas]
else:
action = self.parser.moves.lookup_transition(action_name)
action.do(self.stcls.c, action.label)
def finish(self):
if self.stcls.is_final():
self.parser.moves.finalize_state(self.stcls.c)
self.doc.set_parse(self.stcls.c._sent)
self.parser.moves.finalize_doc(self.doc)
class ParserStateError(ValueError):
def __init__(self, doc):
ValueError.__init__(self,
"Error analysing doc -- no valid actions available. This should "
"never happen, so please report the error on the issue tracker. "
"Here's the thread to do so --- reopen it if it's closed:\n"
"https://github.com/spacy-io/spaCy/issues/429\n"
"Please include the text that the parser failed on, which is:\n"
"%s" % repr(doc.text))
cdef int arg_max_if_gold(const weight_t* scores, const weight_t* costs, int n) nogil:
cdef int best = -1
for i in range(n):
if costs[i] <= 0:
if best == -1 or scores[i] > scores[best]:
best = i
return best
cdef int _arg_max_clas(const weight_t* scores, int move, const Transition* actions,
int nr_class) except -1:
cdef weight_t score = 0
cdef int mode = -1
cdef int i
for i in range(nr_class):
if actions[i].move == move and (mode == -1 or scores[i] >= score):
mode = i
score = scores[i]
return mode

View File

@ -117,6 +117,9 @@ def he_tokenizer():
def nb_tokenizer():
return util.get_lang_class('nb').Defaults.create_tokenizer()
@pytest.fixture
def da_tokenizer():
return util.get_lang_class('da').Defaults.create_tokenizer()
@pytest.fixture
def ja_tokenizer():

View File

@ -10,7 +10,8 @@ import pytest
def test_doc_add_entities_set_ents_iob(en_vocab):
text = ["This", "is", "a", "lion"]
doc = get_doc(en_vocab, text)
ner = EntityRecognizer(en_vocab, features=[(2,), (3,)])
ner = EntityRecognizer(en_vocab)
ner.begin_training([])
ner(doc)
assert len(list(doc.ents)) == 0

View File

View File

@ -0,0 +1,15 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text', ["ca.", "m.a.o.", "Jan.", "Dec."])
def test_da_tokenizer_handles_abbr(da_tokenizer, text):
tokens = da_tokenizer(text)
assert len(tokens) == 1
def test_da_tokenizer_handles_exc_in_text(da_tokenizer):
text = "Det er bl.a. ikke meningen"
tokens = da_tokenizer(text)
assert len(tokens) == 5
assert tokens[2].text == "bl.a."

View File

@ -0,0 +1,27 @@
# coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals
import pytest
def test_da_tokenizer_handles_long_text(da_tokenizer):
text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn,
høet var rejst i stakke nede i de grønne enge, og der gik storken sine lange,
røde ben og snakkede ægyptisk, for det sprog havde han lært af sin moder.
Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der var rigtignok dejligt derude landet!"""
tokens = da_tokenizer(text)
assert len(tokens) == 84
@pytest.mark.parametrize('text,match', [
('10', True), ('1', True), ('10.000', True), ('10.00', True),
('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True),
('hund', False), (',', False), ('1/2', True)])
def test_lex_attrs_like_number(da_tokenizer, text, match):
tokens = da_tokenizer(text)
assert len(tokens) == 1
print(tokens[0])
assert tokens[0].like_num == match

View File

@ -9,7 +9,7 @@ from ...attrs import NORM
from ...gold import GoldParse
from ...vocab import Vocab
from ...tokens import Doc
from ...pipeline import NeuralDependencyParser
from ...pipeline import DependencyParser
numpy.random.seed(0)
@ -21,7 +21,7 @@ def vocab():
@pytest.fixture
def parser(vocab):
parser = NeuralDependencyParser(vocab)
parser = DependencyParser(vocab)
parser.cfg['token_vector_width'] = 8
parser.cfg['hidden_width'] = 30
parser.cfg['hist_size'] = 0

View File

@ -6,7 +6,7 @@ import numpy
from ..._ml import chain, Tok2Vec, doc2feats
from ...vocab import Vocab
from ...pipeline import TokenVectorEncoder
from ...pipeline import Tensorizer
from ...syntax.arc_eager import ArcEager
from ...syntax.nn_parser import Parser
from ...tokens.doc import Doc

View File

@ -8,7 +8,7 @@ from ...attrs import NORM
from ...gold import GoldParse
from ...vocab import Vocab
from ...tokens import Doc
from ...pipeline import NeuralDependencyParser
from ...pipeline import DependencyParser
@pytest.fixture
def vocab():
@ -16,7 +16,7 @@ def vocab():
@pytest.fixture
def parser(vocab):
parser = NeuralDependencyParser(vocab)
parser = DependencyParser(vocab)
parser.cfg['token_vector_width'] = 4
parser.cfg['hidden_width'] = 32
#parser.add_label('right')

View File

@ -1,11 +1,11 @@
import pytest
from ...pipeline import NeuralDependencyParser
from ...pipeline import DependencyParser
@pytest.fixture
def parser(en_vocab):
parser = NeuralDependencyParser(en_vocab)
parser = DependencyParser(en_vocab)
parser.add_label('nsubj')
parser.model, cfg = parser.Model(parser.moves.n_moves)
parser.cfg.update(cfg)
@ -14,7 +14,7 @@ def parser(en_vocab):
@pytest.fixture
def blank_parser(en_vocab):
parser = NeuralDependencyParser(en_vocab)
parser = DependencyParser(en_vocab)
return parser

View File

@ -82,3 +82,21 @@ def test_remove_pipe(nlp, name):
assert not len(nlp.pipeline)
assert removed_name == name
assert removed_component == new_pipe
@pytest.mark.parametrize('name', ['my_component'])
def test_disable_pipes_method(nlp, name):
nlp.add_pipe(new_pipe, name=name)
assert nlp.has_pipe(name)
disabled = nlp.disable_pipes(name)
assert not nlp.has_pipe(name)
disabled.restore()
@pytest.mark.parametrize('name', ['my_component'])
def test_disable_pipes_context(nlp, name):
nlp.add_pipe(new_pipe, name=name)
assert nlp.has_pipe(name)
with nlp.disable_pipes(name):
assert not nlp.has_pipe(name)
assert nlp.has_pipe(name)

View File

@ -1,11 +1,10 @@
import pytest
import spacy
#@pytest.mark.models('en')
@pytest.mark.models('en')
def test_issue1305():
'''Test lemmatization of English VBZ'''
nlp = spacy.load('en_core_web_sm')
assert nlp.vocab.morphology.lemmatizer('works', 'verb') == ['work']
doc = nlp(u'This app works well')
print([(w.text, w.tag_) for w in doc])
assert doc[2].lemma_ == 'work'

View File

@ -2,8 +2,8 @@
from __future__ import unicode_literals
from ..util import make_tempdir
from ...pipeline import NeuralDependencyParser as DependencyParser
from ...pipeline import NeuralEntityRecognizer as EntityRecognizer
from ...pipeline import DependencyParser
from ...pipeline import EntityRecognizer
import pytest

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals
from ..util import make_tempdir
from ...pipeline import NeuralTagger as Tagger
from ...pipeline import Tagger
import pytest

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals
from ..util import make_tempdir
from ...pipeline import TokenVectorEncoder as Tensorizer
from ...pipeline import Tensorizer
import pytest

View File

@ -64,6 +64,12 @@ def test_matcher_init(en_vocab, words):
assert matcher(doc) == []
def test_matcher_contains(matcher):
matcher.add('TEST', None, [{'ORTH': 'test'}])
assert 'TEST' in matcher
assert 'TEST2' not in matcher
def test_matcher_no_match(matcher):
words = ["I", "like", "cheese", "."]
doc = get_doc(matcher.vocab, words)
@ -112,7 +118,8 @@ def test_matcher_empty_dict(en_vocab):
matcher.add('A.', None, [{'ORTH': 'a'}, {}])
matches = matcher(doc)
assert matches[0][1:] == (0, 2)
def test_matcher_operator_shadow(en_vocab):
matcher = Matcher(en_vocab)
abc = ["a", "b", "c"]
@ -123,7 +130,8 @@ def test_matcher_operator_shadow(en_vocab):
matches = matcher(doc)
assert len(matches) == 1
assert matches[0][1:] == (0, 3)
def test_matcher_phrase_matcher(en_vocab):
words = ["Google", "Now"]
doc = get_doc(en_vocab, words)
@ -134,6 +142,22 @@ def test_matcher_phrase_matcher(en_vocab):
assert len(matcher(doc)) == 1
def test_phrase_matcher_length(en_vocab):
matcher = PhraseMatcher(en_vocab)
assert len(matcher) == 0
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
assert len(matcher) == 1
matcher.add('TEST2', None, get_doc(en_vocab, ['test2']))
assert len(matcher) == 2
def test_phrase_matcher_contains(en_vocab):
matcher = PhraseMatcher(en_vocab)
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
assert 'TEST' in matcher
assert 'TEST2' not in matcher
def test_matcher_match_zero(matcher):
words1 = 'He said , " some words " ...'.split()
words2 = 'He said , " some three words " ...'.split()

View File

@ -63,11 +63,8 @@ cdef class Tokenizer:
return (self.__class__, args, None, None)
cpdef Doc tokens_from_list(self, list strings):
# TODO: deprecation warning
return Doc(self.vocab, words=strings)
#raise NotImplementedError(
# "Method deprecated in 1.0.\n"
# "Old: tokenizer.tokens_from_list(strings)\n"
# "New: Doc(tokenizer.vocab, words=strings)")
@cython.boundscheck(False)
def __call__(self, unicode string):

View File

@ -1,7 +1,6 @@
# coding: utf8
from __future__ import unicode_literals
import bz2
import ujson
import re
import numpy
@ -16,7 +15,6 @@ from .lexeme cimport EMPTY_LEXEME
from .lexeme cimport Lexeme
from .strings cimport hash_string
from .typedefs cimport attr_t
from .cfile cimport CFile
from .tokens.token cimport Token
from .attrs cimport PROB, LANG
from .structs cimport SerializedLexemeC

View File

@ -181,7 +181,7 @@ mixin codepen(slug, height, default_tab)
alt_file - [string] alternative file path used in footer and link button
height - [integer] height of code preview in px
mixin github(repo, file, alt_file, height, language)
mixin github(repo, file, height, alt_file, language)
- var branch = ALPHA ? "develop" : "master"
- var height = height || 250

View File

@ -38,7 +38,7 @@ for id in CURRENT_MODELS
+cell #[+label Size]
+cell #[+tag=comps.size] #[span(data-tpl=id data-tpl-key="size") #[em n/a]]
each label in ["Pipeline", "Sources", "Author", "License"]
each label in ["Pipeline", "Vectors", "Sources", "Author", "License"]
- var field = label.toLowerCase()
+row
+cell.u-nowrap

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > ANNOTATION > BILUO
+table([ "Tag", "Description" ])
+table(["Tag", "Description"])
+row
+cell #[code #[span.u-color-theme B] EGIN]
+cell The first token of a multi-token entity.

View File

@ -13,7 +13,9 @@ p
| that are part of an entity are set to the entity label, prefixed by the
| BILUO marker. For example #[code "B-ORG"] describes the first token of
| a multi-token #[code ORG] entity and #[code "U-PERSON"] a single
| token representing a #[code PERSON] entity
| token representing a #[code PERSON] entity. The
| #[+api("goldparse#biluo_tags_from_offsets") #[code biluo_tags_from_offsets]]
| function can help you convert entity offsets to the right format.
+code("Example structure").
[{

View File

@ -136,7 +136,7 @@ p
| #[+src(gh("spacy", "spacy/glossary.py")) #[code glossary.py]].
+aside-code("Example").
spacy.explain('NORP')
spacy.explain(u'NORP')
# Nationalities or religious or political groups
doc = nlp(u'Hello world')

View File

@ -2,4 +2,5 @@
include ../_includes/_mixins
//- This class inherits from Pipe, so this page uses the template in pipe.jade.
!=partial("pipe", { subclass: "DependencyParser", short: "parser", pipeline_id: "parser" })

View File

@ -2,4 +2,5 @@
include ../_includes/_mixins
//- This class inherits from Pipe, so this page uses the template in pipe.jade.
!=partial("pipe", { subclass: "EntityRecognizer", short: "ner", pipeline_id: "ner" })

View File

@ -229,6 +229,7 @@ p
+cell Config parameters.
+h(2, "preprocess_gold") Language.preprocess_gold
+tag method
p
| Can be called before training to pre-process gold data. By default, it
@ -440,6 +441,37 @@ p
+cell tuple
+cell A #[code (name, component)] tuple of the removed component.
+h(2, "disable_pipes") Language.disable_pipes
+tag contextmanager
+tag-new(2)
p
| Disable one or more pipeline components. If used as a context manager,
| the pipeline will be restored to the initial state at the end of the
| block. Otherwise, a #[code DisabledPipes] object is returned, that has a
| #[code .restore()] method you can use to undo your changes.
+aside-code("Example").
with nlp.disable_pipes('tagger', 'parser'):
optimizer = nlp.begin_training(gold_tuples)
disabled = nlp.disable_pipes('tagger', 'parser')
optimizer = nlp.begin_training(gold_tuples)
disabled.restore()
+table(["Name", "Type", "Description"])
+row
+cell #[code *disabled]
+cell unicode
+cell Names of pipeline components to disable.
+row("foot")
+cell returns
+cell #[code DisabledPipes]
+cell
| The disabled pipes that can be restored by calling the object's
| #[code .restore()] method.
+h(2, "to_disk") Language.to_disk
+tag method
+tag-new(2)
@ -609,6 +641,14 @@ p Load state from a binary string.
| Custom meta data for the Language class. If a model is loaded,
| contains meta data of the model.
+row
+cell #[code path]
+tag-new(2)
+cell #[code Path]
+cell
| Path to the model data directory, if a model is loaded. Otherwise
| #[code None].
+h(2, "class-attributes") Class attributes
+table(["Name", "Type", "Description"])

View File

@ -304,6 +304,21 @@ p Modify the pipe's model, to use the given parameter values.
| The parameter values to use in the model. At the end of the
| context, the original parameters are restored.
+h(2, "add_label") #{CLASSNAME}.add_label
+tag method
p Add a new label to the pipe.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
#{VARNAME}.add_label('MY_LABEL')
+table(["Name", "Type", "Description"])
+row
+cell #[code label]
+cell unicode
+cell The label to add.
+h(2, "to_disk") #{CLASSNAME}.to_disk
+tag method

View File

@ -2,4 +2,5 @@
include ../_includes/_mixins
//- This class inherits from Pipe, so this page uses the template in pipe.jade.
!=partial("pipe", { subclass: "Tagger", pipeline_id: "tagger" })

View File

@ -2,4 +2,5 @@
include ../_includes/_mixins
//- This class inherits from Pipe, so this page uses the template in pipe.jade.
!=partial("pipe", { subclass: "Tensorizer", pipeline_id: "tensorizer" })

View File

@ -16,4 +16,5 @@ p
| before a logistic activation is applied elementwise. The value of each
| output neuron is the probability that some class is present.
//- This class inherits from Pipe, so this page uses the template in pipe.jade.
!=partial("pipe", { subclass: "TextCategorizer", short: "textcat", pipeline_id: "textcat" })

Some files were not shown because too many files have changed in this diff Show More