This commit is contained in:
Jim O'Regan 2017-10-31 14:07:45 +00:00
commit 41dd29e48e
522 changed files with 21740 additions and 18776 deletions

View File

@ -1 +1,55 @@
environment:
matrix:
# For Python versions available on Appveyor, see
# http://www.appveyor.com/docs/installed-software#python
# The list here is complete (excluding Python 2.6, which
# isn't covered by this document) at the time of writing.
- PYTHON: "C:\\Python27"
#- PYTHON: "C:\\Python33"
#- PYTHON: "C:\\Python34"
#- PYTHON: "C:\\Python35"
#- PYTHON: "C:\\Python27-x64"
#- PYTHON: "C:\\Python33-x64"
#- DISTUTILS_USE_SDK: "1"
#- PYTHON: "C:\\Python34-x64"
#- DISTUTILS_USE_SDK: "1"
#- PYTHON: "C:\\Python35-x64"
- PYTHON: "C:\\Python36-x64"
install:
# We need wheel installed to build wheels
- "%PYTHON%\\python.exe -m pip install wheel"
- "%PYTHON%\\python.exe -m pip install cython"
- "%PYTHON%\\python.exe -m pip install -r requirements.txt"
- "%PYTHON%\\python.exe -m pip install -e ."
build: off
test_script:
# Put your test command here.
# If you don't need to build C extensions on 64-bit Python 3.3 or 3.4,
# you can remove "build.cmd" from the front of the command, as it's
# only needed to support those cases.
# Note that you must use the environment variable %PYTHON% to refer to
# the interpreter you're using - Appveyor does not do anything special
# to put the Python version you want to use on PATH.
- "%PYTHON%\\python.exe -m pytest spacy/"
after_test:
# This step builds your wheels.
# Again, you only need build.cmd if you're building C extensions for
# 64-bit Python 3.3/3.4. And you need to use %PYTHON% to get the correct
# interpreter
- "%PYTHON%\\python.exe setup.py bdist_wheel"
artifacts:
# bdist_wheel puts your built wheel in the dist directory
- path: dist\*
#on_success:
# You can use this step to upload your artifacts to a public website.
# See Appveyor's documentation for more details. Or you can simply
# access your wheels from the Appveyor "artifacts" tab for your build.

11
.buildkite/sdist.yml Normal file
View File

@ -0,0 +1,11 @@
steps:
-
command: "fab env clean make test sdist"
label: ":dizzy: :python:"
artifact_paths: "dist/*.tar.gz"
- wait
- trigger: "spacy-sdist-against-models"
label: ":dizzy: :hammer:"
build:
env:
SPACY_VERSION: "{$SPACY_VERSION}"

View File

@ -87,8 +87,8 @@ U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
@ -98,9 +98,9 @@ mark both statements:
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Shuvanon Razik |
| Name | |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 3/12/2017 |
| GitHub username | shuvanon |
| Date | |
| GitHub username | |
| Website (optional) | |

View File

@ -1,20 +1,19 @@
<!--- Provide a general summary of your changes in the Title -->
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes and how they're affecting the code. -->
<!-- If your changes required testing, include information about the testing environment and the tests you ran. -->
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine just
include a note to let us know. -->
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
## Types of changes
<!--- What types of changes does your code introduce? Put an `x` in all applicable boxes.: -->
- [ ] **Bug fix** (non-breaking change fixing an issue)
- [ ] **New feature** (non-breaking change adding functionality to spaCy)
- [ ] **Breaking change** (fix or feature causing change to spaCy's existing functionality)
- [ ] **Documentation** (addition to documentation of spaCy)
## Checklist:
<!--- Go over all the following points, and put an `x` in all applicable boxes.: -->
- [ ] My change requires a change to spaCy's documentation.
- [ ] I have updated the documentation accordingly.
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

106
.github/contributors/demfier.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Gaurav Sahu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2017-10-18 |
| GitHub username | demfier |
| Website (optional) | |

106
.github/contributors/honnibal.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Matthew Honnibal |
| Company name (if applicable) | Explosion AI |
| Title or role (if applicable) | Founder |
| Date | 2017-10-18 |
| GitHub username | honnibal |
| Website (optional) | https://explosion.ai |

106
.github/contributors/ines.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ines Montani |
| Company name (if applicable) | Explosion AI |
| Title or role (if applicable) | Founder |
| Date | 2017/10/18 |
| GitHub username | ines |
| Website (optional) | https://explosion.ai |

106
.github/contributors/jerbob92.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jeroen Bobbeldijk |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 22-10-2017 |
| GitHub username | jerbob92 |
| Website (optional) | |

106
.github/contributors/johnhaley81.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | John Haley |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 19/10/2017 |
| GitHub username | johnhaley81 |
| Website (optional) | |

106
.github/contributors/mdcclv.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------------------- |
| Name | Orion Montoya |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 04-10-2017 |
| GitHub username | mdcclv |
| Website (optional) | http://www.mdcclv.com/ |

106
.github/contributors/polm.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Paul McCann |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2017-10-14 |
| GitHub username | polm |
| Website (optional) | http://dampfkraft.com|

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ramanan Balakrishnan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2017-10-19 |
| GitHub username | ramananbalakrishnan |
| Website (optional) | |

108
.github/contributors/shuvanon.md vendored Normal file
View File

@ -0,0 +1,108 @@
<!-- This agreement was mistakenly submitted as an update to the CONTRIBUTOR_AGREEMENT.md template. Commit: 8a2d22222dec5cf910df5a378cbcd9ea2ab53ec4. It was therefore moved over manually. -->
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Shuvanon Razik |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 3/12/2017 |
| GitHub username | shuvanon |
| Website (optional) | |

106
.github/contributors/yuukos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alexey Kim |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 13-12-2017 |
| GitHub username | yuukos |
| Website (optional) | |

4
.gitignore vendored
View File

@ -1,14 +1,12 @@
# spaCy
spacy/data/
corpora/
models/
/models/
keys/
# Website
website/www/
website/_deploy.sh
website/package.json
website/announcement.jade
website/.gitignore
# Cython / C extensions

View File

@ -3,6 +3,8 @@
This is a list of everyone who has made significant contributions to spaCy, in alphabetical order. Thanks a lot for the great work!
* Adam Bittlingmayer, [@bittlingmayer](https://github.com/bittlingmayer)
* Alexey Kim, [@yuukos](https://github.com/yuukos)
* Alexis Eidelman, [@AlexisEidelman](https://github.com/AlexisEidelman)
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
* Andrew Poliakov, [@pavlin99th](https://github.com/pavlin99th)
* Aniruddha Adhikary [@aniruddha-adhikary](https://github.com/aniruddha-adhikary)
@ -16,6 +18,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Daniel Vila Suero, [@dvsrepo](https://github.com/dvsrepo)
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
* Eric Zhao, [@ericzhao28](https://github.com/ericzhao28)
* Francisco Aranda, [@frascuchon](https://github.com/frascuchon)
* Greg Baker, [@solresol](https://github.com/solresol)
* Grégory Howard, [@Gregory-Howard](https://github.com/Gregory-Howard)
* György Orosz, [@oroszgy](https://github.com/oroszgy)
@ -24,6 +27,9 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Ines Montani, [@ines](https://github.com/ines)
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
* Janneke van der Zwaan, [@jvdzwaan](https://github.com/jvdzwaan)
* Jim Geovedi, [@geovedi](https://github.com/geovedi)
* Jim Regan, [@jimregan](https://github.com/jimregan)
* Jeffrey Gerard, [@IamJeffG](https://github.com/IamJeffG)
* Jordan Suchow, [@suchow](https://github.com/suchow)
* Josh Reeter, [@jreeter](https://github.com/jreeter)
* Juan Miguel Cejuela, [@juanmirocks](https://github.com/juanmirocks)
@ -38,6 +44,8 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Michael Wallin, [@wallinm1](https://github.com/wallinm1)
* Miguel Almeida, [@mamoit](https://github.com/mamoit)
* Oleg Zd, [@olegzd](https://github.com/olegzd)
* Orion Montoya, [@mdcclv](https://github.com/mdcclv)
* Paul O'Leary McCann, [@polm](https://github.com/polm)
* Pokey Rule, [@pokey](https://github.com/pokey)
* Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
@ -45,12 +53,18 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Sam Bozek, [@sambozek](https://github.com/sambozek)
* Sasho Savkov, [@savkov](https://github.com/savkov)
* Shuvanon Razik, [@shuvanon](https://github.com/shuvanon)
* Swier, [@swierh](https://github.com/swierh)
* Thomas Tanon, [@Tpt](https://github.com/Tpt)
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
* Vimos Tan, [@Vimos](https://github.com/Vimos)
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
* Wah Loon Keng, [@kengz](https://github.com/kengz)
* Wannaphong Phatthiyaphaibun, [@wannaphongcom](https://github.com/wannaphongcom)
* Willem van Hage, [@wrvhage](https://github.com/wrvhage)
* Wolfgang Seeker, [@wbwseeker](https://github.com/wbwseeker)
* Yam, [@hscspring](https://github.com/hscspring)
* Yanhao Yang, [@YanhaoYang](https://github.com/YanhaoYang)
* Yasuaki Uechi, [@uetchy](https://github.com/uetchy)
* Yu-chun Huang, [@galaxyh](https://github.com/galaxyh)
* Yubing Dong, [@tomtung](https://github.com/tomtung)
* Yuval Pinter, [@yuvalpinter](https://github.com/yuvalpinter)

View File

@ -1,15 +1,16 @@
spaCy: Industrial-strength NLP
******************************
spaCy is a library for advanced natural language processing in Python and
Cython. spaCy is built on the very latest research, but it isn't researchware.
It was designed from day one to be used in real products. spaCy currently supports
English, German, French and Spanish, as well as tokenization for Italian,
Portuguese, Dutch, Swedish, Finnish, Norwegian, Danish, Hungarian, Polish,
Bengali, Hebrew, Chinese and Japanese. It's commercial open-source software,
released under the MIT license.
spaCy is a library for advanced Natural Language Processing in Python and Cython.
It's built on the very latest research, and was designed from day one to be
used in real products. spaCy comes with
`pre-trained statistical models <https://alpha.spacy.io/models>`_ and word
vectors, and currently supports tokenization for **20+ languages**. It features
the **fastest syntactic parser** in the world, convolutional **neural network models**
for tagging, parsing and **named entity recognition** and easy **deep learning**
integration. It's commercial open-source software, released under the MIT license.
💫 **Version 1.8 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
💫 **Version 2.0 out now!** `Check out the new features here. <https://alpha.spacy.io/usage/v2>`_
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
:target: https://travis-ci.org/explosion/spaCy
@ -38,68 +39,72 @@ released under the MIT license.
📖 Documentation
================
=================== ===
`Usage Workflows`_ How to use spaCy and its features.
`API Reference`_ The detailed reference for spaCy's API.
`Troubleshooting`_ Common problems and solutions for beginners.
`Tutorials`_ End-to-end examples, with code you can modify and run.
`Showcase & Demos`_ Demos, libraries and products from the spaCy community.
`Contribute`_ How to contribute to the spaCy project and code base.
=================== ===
=================== ===
`spaCy 101`_ New to spaCy? Here's everything you need to know!
`Usage Guides`_ How to use spaCy and its features.
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
`API Reference`_ The detailed reference for spaCy's API.
`Models`_ Download statistical language models for spaCy.
`Resources`_ Libraries, extensions, demos, books and courses.
`Changelog`_ Changes and version history.
`Contribute`_ How to contribute to the spaCy project and code base.
=================== ===
.. _Usage Workflows: https://spacy.io/docs/usage/
.. _API Reference: https://spacy.io/docs/api/
.. _Troubleshooting: https://spacy.io/docs/usage/troubleshooting
.. _Tutorials: https://spacy.io/docs/usage/tutorials
.. _Showcase & Demos: https://spacy.io/docs/usage/showcase
.. _spaCy 101: https://alpha.spacy.io/usage/spacy-101
.. _New in v2.0: https://alpha.spacy.io/usage/v2#migrating
.. _Usage Guides: https://alpha.spacy.io/usage/
.. _API Reference: https://alpha.spacy.io/api/
.. _Models: https://alpha.spacy.io/models
.. _Resources: https://alpha.spacy.io/usage/resources
.. _Changelog: https://alpha.spacy.io/usage/#changelog
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
💬 Where to ask questions
==========================
The spaCy project is maintained by `@honnibal <https://github.com/honnibal>`_
and `@ines <https://github.com/ines>`_. Please understand that we won't be able
to provide individual support via email. We also believe that help is much more
valuable if it's shared publicly, so that more people can benefit from it.
====================== ===
**Bug reports** `GitHub issue tracker`_
**Usage questions** `StackOverflow`_, `Gitter chat`_, `Reddit user group`_
**General discussion** `Gitter chat`_, `Reddit user group`_
**Commercial support** contact@explosion.ai
**Bug Reports** `GitHub Issue Tracker`_
**Usage Questions** `StackOverflow`_, `Gitter Chat`_, `Reddit User Group`_
**General Discussion** `Gitter Chat`_, `Reddit User Group`_
====================== ===
.. _GitHub issue tracker: https://github.com/explosion/spaCy/issues
.. _GitHub Issue Tracker: https://github.com/explosion/spaCy/issues
.. _StackOverflow: http://stackoverflow.com/questions/tagged/spacy
.. _Gitter chat: https://gitter.im/explosion/spaCy
.. _Reddit user group: https://www.reddit.com/r/spacynlp
.. _Gitter Chat: https://gitter.im/explosion/spaCy
.. _Reddit User Group: https://www.reddit.com/r/spacynlp
Features
========
* Non-destructive **tokenization**
* Syntax-driven sentence segmentation
* Pre-trained **word vectors**
* Part-of-speech tagging
* **Fastest syntactic parser** in the world
* **Named entity** recognition
* Labelled dependency parsing
* Convenient string-to-int mapping
* Export to numpy data arrays
* GIL-free **multi-threading**
* Efficient binary serialization
* Non-destructive **tokenization**
* Support for **20+ languages**
* Pre-trained `statistical models <https://alpha.spacy.io/models>`_ and word vectors
* Easy **deep learning** integration
* Statistical models for **English**, **German**, **French** and **Spanish**
* Part-of-speech tagging
* Labelled dependency parsing
* Syntax-driven sentence segmentation
* Built in **visualizers** for syntax and NER
* Convenient string-to-hash mapping
* Export to numpy data arrays
* Efficient binary serialization
* Easy **model packaging** and deployment
* State-of-the-art speed
* Robust, rigorously evaluated accuracy
See `facts, figures and benchmarks <https://spacy.io/docs/api/>`_.
📖 **For more details, see the** `facts, figures and benchmarks <https://alpha.spacy.io/usage/facts-figures>`_.
Top Performance
---------------
Install spaCy
=============
* Fastest in the world: <50ms per document. No faster system has ever been
announced.
* Accuracy within 1% of the current state of the art on all tasks performed
(parsing, named entity recognition, part-of-speech tagging). The only more
accurate systems are an order of magnitude slower or more.
Supports
--------
For detailed installation instructions, see
the `documentation <https://alpha.spacy.io/usage>`_.
==================== ===
**Operating system** macOS / OS X, Linux, Windows (Cygwin, MinGW, Visual Studio)
@ -110,12 +115,6 @@ Supports
.. _pip: https://pypi.python.org/pypi/spacy
.. _conda: https://anaconda.org/conda-forge/spacy
Install spaCy
=============
Installation requires a working build environment. See notes on Ubuntu,
macOS/OS X and Windows for details.
pip
---
@ -123,7 +122,7 @@ Using pip, spaCy releases are currently only available as source packages.
.. code:: bash
pip install -U spacy
pip install spacy
When using pip it is generally recommended to install packages in a ``virtualenv``
to avoid modifying system state:
@ -149,25 +148,41 @@ For the feedstock including the build recipe and configuration,
check out `this repository <https://github.com/conda-forge/spacy-feedstock>`_.
Improvements and pull requests to the recipe and setup are always appreciated.
Updating spaCy
--------------
Some updates to spaCy may require downloading new statistical models. If you're
running spaCy v2.0 or higher, you can use the ``validate`` command to check if
your installed models are compatible and if not, print details on how to update
them:
.. code:: bash
pip install -U spacy
spacy validate
If you've trained your own models, keep in mind that your training and runtime
inputs must match. After updating spaCy, we recommend **retraining your models**
with the new version.
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the**
`migration guide <https://alpha.spacy.io/usage/v2#migrating>`_.
Download models
===============
As of v1.7.0, models for spaCy can be installed as **Python packages**.
This means that they're a component of your application, just like any
other module. They're versioned and can be defined as a dependency in your
``requirements.txt``. Models can be installed from a download URL or
a local directory, manually or via pip. Their data can be located anywhere on
your file system. To make a model available to spaCy, all you need to do is
create a "shortcut link", an internal alias that tells spaCy where to find the
data files for a specific model name.
other module. Models can be installed using spaCy's ``download`` command,
or manually by pointing pip to a path or URL.
======================= ===
`spaCy Models`_ Available models, latest releases and direct download.
`Available Models`_ Detailed model descriptions, accuracy figures and benchmarks.
`Models Documentation`_ Detailed usage instructions.
======================= ===
.. _spaCy Models: https://github.com/explosion/spacy-models/releases/
.. _Models Documentation: https://spacy.io/docs/usage/models
.. _Available Models: https://alpha.spacy.io/models
.. _Models Documentation: https://alpha.spacy.io/docs/usage/models
.. code:: bash
@ -175,17 +190,10 @@ data files for a specific model name.
python -m spacy download en
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
# pip install .tar.gz archive from path or URL
pip install /Users/you/en_core_web_md-1.2.0.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz
# set up shortcut link to load installed package as "en_default"
python -m spacy link en_core_web_md en_default
# set up shortcut link to load local model as "my_amazing_model"
python -m spacy link /Users/you/data my_amazing_model
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
Loading and using models
------------------------
@ -199,24 +207,24 @@ To load a model, use ``spacy.load()`` with the model's shortcut link:
doc = nlp(u'This is a sentence.')
If you've installed a model via pip, you can also ``import`` it directly and
then call its ``load()`` method with no arguments. This should also work for
older models in previous versions of spaCy.
then call its ``load()`` method:
.. code:: python
import spacy
import en_core_web_md
import en_core_web_sm
nlp = en_core_web_md.load()
nlp = en_core_web_.load()
doc = nlp(u'This is a sentence.')
📖 **For more info and examples, check out the** `models documentation <https://spacy.io/docs/usage/models>`_.
📖 **For more info and examples, check out the**
`models documentation <https://alpha.spacy.io/docs/usage/models>`_.
Support for older versions
--------------------------
If you're using an older version (v1.6.0 or below), you can still download and
install the old models from within spaCy using ``python -m spacy.en.download all``
If you're using an older version (``v1.6.0`` or below), you can still download
and install the old models from within spaCy using ``python -m spacy.en.download all``
or ``python -m spacy.de.download all``. The ``.tar.gz`` archives are also
`attached to the v1.6.0 release <https://github.com/explosion/spaCy/tree/v1.6.0>`_.
To download and install the models manually, unpack the archive, drop the
@ -248,11 +256,13 @@ details.
pip install -r requirements.txt
pip install -e .
Compared to regular install via pip `requirements.txt <requirements.txt>`_
Compared to regular install via pip, `requirements.txt <requirements.txt>`_
additionally installs developer dependencies such as Cython.
Instead of the above verbose commands, you can also use the following
`Fabric <http://www.fabfile.org/>`_ commands:
`Fabric <http://www.fabfile.org/>`_ commands. All commands assume that your
``virtualenv`` is located in a directory ``.env``. If you're using a different
directory, you can change it via the environment variable ``VENV_DIR``, for
example ``VENV_DIR=".custom-env" fab clean make``.
============= ===
``fab env`` Create ``virtualenv`` and delete previous one, if it exists.
@ -261,14 +271,6 @@ Instead of the above verbose commands, you can also use the following
``fab test`` Run basic tests, aborting after first failure.
============= ===
All commands assume that your ``virtualenv`` is located in a directory ``.env``.
If you're using a different directory, you can change it via the environment
variable ``VENV_DIR``, for example:
.. code:: bash
VENV_DIR=".custom-env" fab clean make
Ubuntu
------
@ -310,76 +312,4 @@ and ``--model`` are optional and enable additional tests:
# make sure you are using recent pytest version
python -m pip install -U pytest
python -m pytest <spacy-directory>
🛠 Changelog
============
=========== ============== ===========
Version Date Description
=========== ============== ===========
`v1.8.2`_ ``2017-04-26`` French model and small improvements
`v1.8.1`_ ``2017-04-23`` Saving, loading and training bug fixes
`v1.8.0`_ ``2017-04-16`` Better NER training, saving and loading
`v1.7.5`_ ``2017-04-07`` Bug fixes and new CLI commands
`v1.7.3`_ ``2017-03-26`` Alpha support for Hebrew, new CLI commands and bug fixes
`v1.7.2`_ ``2017-03-20`` Small fixes to beam parser and model linking
`v1.7.1`_ ``2017-03-19`` Fix data download for system installation
`v1.7.0`_ ``2017-03-18`` New 50 MB model, CLI, better downloads and lots of bug fixes
`v1.6.0`_ ``2017-01-16`` Improvements to tokenizer and tests
`v1.5.0`_ ``2016-12-27`` Alpha support for Swedish and Hungarian
`v1.4.0`_ ``2016-12-18`` Improved language data and alpha Dutch support
`v1.3.0`_ ``2016-12-03`` Improve API consistency
`v1.2.0`_ ``2016-11-04`` Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese
`v1.1.0`_ ``2016-10-23`` Bug fixes and adjustments
`v1.0.0`_ ``2016-10-18`` Support for deep learning workflows and entity-aware rule matcher
`v0.101.0`_ ``2016-05-10`` Fixed German model
`v0.100.7`_ ``2016-05-05`` German support
`v0.100.6`_ ``2016-03-08`` Add support for GloVe vectors
`v0.100.5`_ ``2016-02-07`` Fix incorrect use of header file
`v0.100.4`_ ``2016-02-07`` Fix OSX problem introduced in 0.100.3
`v0.100.3`_ ``2016-02-06`` Multi-threading, faster loading and bugfixes
`v0.100.2`_ ``2016-01-21`` Fix data version lock
`v0.100.1`_ ``2016-01-21`` Fix install for OSX
`v0.100`_ ``2016-01-19`` Revise setup.py, better model downloads, bug fixes
`v0.99`_ ``2015-11-08`` Improve span merging, internal refactoring
`v0.98`_ ``2015-11-03`` Smaller package, bug fixes
`v0.97`_ ``2015-10-23`` Load the StringStore from a json list, instead of a text file
`v0.96`_ ``2015-10-19`` Hotfix to .merge method
`v0.95`_ ``2015-10-18`` Bug fixes
`v0.94`_ ``2015-10-09`` Fix memory and parse errors
`v0.93`_ ``2015-09-22`` Bug fixes to word vectors
=========== ============== ===========
.. _v1.8.2: https://github.com/explosion/spaCy/releases/tag/v1.8.2
.. _v1.8.1: https://github.com/explosion/spaCy/releases/tag/v1.8.1
.. _v1.8.0: https://github.com/explosion/spaCy/releases/tag/v1.8.0
.. _v1.7.5: https://github.com/explosion/spaCy/releases/tag/v1.7.5
.. _v1.7.3: https://github.com/explosion/spaCy/releases/tag/v1.7.3
.. _v1.7.2: https://github.com/explosion/spaCy/releases/tag/v1.7.2
.. _v1.7.1: https://github.com/explosion/spaCy/releases/tag/v1.7.1
.. _v1.7.0: https://github.com/explosion/spaCy/releases/tag/v1.7.0
.. _v1.6.0: https://github.com/explosion/spaCy/releases/tag/v1.6.0
.. _v1.5.0: https://github.com/explosion/spaCy/releases/tag/v1.5.0
.. _v1.4.0: https://github.com/explosion/spaCy/releases/tag/v1.4.0
.. _v1.3.0: https://github.com/explosion/spaCy/releases/tag/v1.3.0
.. _v1.2.0: https://github.com/explosion/spaCy/releases/tag/v1.2.0
.. _v1.1.0: https://github.com/explosion/spaCy/releases/tag/v1.1.0
.. _v1.0.0: https://github.com/explosion/spaCy/releases/tag/v1.0.0
.. _v0.101.0: https://github.com/explosion/spaCy/releases/tag/0.101.0
.. _v0.100.7: https://github.com/explosion/spaCy/releases/tag/0.100.7
.. _v0.100.6: https://github.com/explosion/spaCy/releases/tag/0.100.6
.. _v0.100.5: https://github.com/explosion/spaCy/releases/tag/0.100.5
.. _v0.100.4: https://github.com/explosion/spaCy/releases/tag/0.100.4
.. _v0.100.3: https://github.com/explosion/spaCy/releases/tag/0.100.3
.. _v0.100.2: https://github.com/explosion/spaCy/releases/tag/0.100.2
.. _v0.100.1: https://github.com/explosion/spaCy/releases/tag/0.100.1
.. _v0.100: https://github.com/explosion/spaCy/releases/tag/0.100
.. _v0.99: https://github.com/explosion/spaCy/releases/tag/0.99
.. _v0.98: https://github.com/explosion/spaCy/releases/tag/0.98
.. _v0.97: https://github.com/explosion/spaCy/releases/tag/0.97
.. _v0.96: https://github.com/explosion/spaCy/releases/tag/0.96
.. _v0.95: https://github.com/explosion/spaCy/releases/tag/0.95
.. _v0.94: https://github.com/explosion/spaCy/releases/tag/0.94
.. _v0.93: https://github.com/explosion/spaCy/releases/tag/0.93

View File

@ -1,93 +0,0 @@
#!/usr/bin/env python
from __future__ import unicode_literals, print_function
import plac
import joblib
from os import path
import os
import bz2
import ujson
from preshed.counter import PreshCounter
from joblib import Parallel, delayed
import io
from spacy.en import English
from spacy.strings import StringStore
from spacy.attrs import ORTH
from spacy.tokenizer import Tokenizer
from spacy.vocab import Vocab
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for line in file_:
yield ujson.loads(line)
def count_freqs(input_loc, output_loc):
print(output_loc)
vocab = English.default_vocab(get_lex_attr=None)
tokenizer = Tokenizer.from_dir(vocab,
path.join(English.default_data_dir(), 'tokenizer'))
counts = PreshCounter()
for json_comment in iter_comments(input_loc):
doc = tokenizer(json_comment['body'])
doc.count_by(ORTH, counts=counts)
with io.open(output_loc, 'w', 'utf8') as file_:
for orth, freq in counts:
string = tokenizer.vocab.strings[orth]
if not string.isspace():
file_.write('%d\t%s\n' % (freq, string))
def parallelize(func, iterator, n_jobs):
Parallel(n_jobs=n_jobs)(delayed(func)(*item) for item in iterator)
def merge_counts(locs, out_loc):
string_map = StringStore()
counts = PreshCounter()
for loc in locs:
with io.open(loc, 'r', encoding='utf8') as file_:
for line in file_:
freq, word = line.strip().split('\t', 1)
orth = string_map[word]
counts.inc(orth, int(freq))
with io.open(out_loc, 'w', encoding='utf8') as file_:
for orth, count in counts:
string = string_map[orth]
file_.write('%d\t%s\n' % (count, string))
@plac.annotations(
input_loc=("Location of input file list"),
freqs_dir=("Directory for frequency files"),
output_loc=("Location for output file"),
n_jobs=("Number of workers", "option", "n", int),
skip_existing=("Skip inputs where an output file exists", "flag", "s", bool),
)
def main(input_loc, freqs_dir, output_loc, n_jobs=2, skip_existing=False):
tasks = []
outputs = []
for input_path in open(input_loc):
input_path = input_path.strip()
if not input_path:
continue
filename = input_path.split('/')[-1]
output_path = path.join(freqs_dir, filename.replace('bz2', 'freq'))
outputs.append(output_path)
if not path.exists(output_path) or not skip_existing:
tasks.append((input_path, output_path))
if tasks:
parallelize(count_freqs, tasks, n_jobs)
print("Merge")
merge_counts(outputs, output_loc)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,89 +0,0 @@
#!/usr/bin/env python
from __future__ import unicode_literals
from xml.etree import cElementTree as ElementTree
import json
import re
import plac
from pathlib import Path
from os import path
escaped_tokens = {
'-LRB-': '(',
'-RRB-': ')',
'-LSB-': '[',
'-RSB-': ']',
'-LCB-': '{',
'-RCB-': '}',
}
def read_parses(parse_loc):
offset = 0
doc = []
for parse in open(str(parse_loc) + '.dep').read().strip().split('\n\n'):
parse = _adjust_token_ids(parse, offset)
offset += len(parse.split('\n'))
doc.append(parse)
return doc
def _adjust_token_ids(parse, offset):
output = []
for line in parse.split('\n'):
pieces = line.split()
pieces[0] = str(int(pieces[0]) + offset)
pieces[5] = str(int(pieces[5]) + offset) if pieces[5] != '0' else '0'
output.append('\t'.join(pieces))
return '\n'.join(output)
def _fmt_doc(filename, paras):
return {'id': filename, 'paragraphs': [_fmt_para(*para) for para in paras]}
def _fmt_para(raw, sents):
return {'raw': raw, 'sentences': [_fmt_sent(sent) for sent in sents]}
def _fmt_sent(sent):
return {
'tokens': [_fmt_token(*t.split()) for t in sent.strip().split('\n')],
'brackets': []}
def _fmt_token(id_, word, hyph, pos, ner, head, dep, blank1, blank2, blank3):
head = int(head) - 1
id_ = int(id_) - 1
head = (head - id_) if head != -1 else 0
return {'id': id_, 'orth': word, 'tag': pos, 'dep': dep, 'head': head}
tags_re = re.compile(r'<[\w\?/][^>]+>')
def main(out_dir, ewtb_dir='/usr/local/data/eng_web_tbk'):
ewtb_dir = Path(ewtb_dir)
out_dir = Path(out_dir)
if not out_dir.exists():
out_dir.mkdir()
for genre_dir in ewtb_dir.joinpath('data').iterdir():
#if 'answers' in str(genre_dir): continue
parse_dir = genre_dir.joinpath('penntree')
docs = []
for source_loc in genre_dir.joinpath('source').joinpath('source_original').iterdir():
filename = source_loc.parts[-1].replace('.sgm.sgm', '')
filename = filename.replace('.xml', '')
filename = filename.replace('.txt', '')
parse_loc = parse_dir.joinpath(filename + '.xml.tree')
parses = read_parses(parse_loc)
source = source_loc.open().read().strip()
if 'answers' in str(genre_dir):
source = tags_re.sub('', source).strip()
docs.append(_fmt_doc(filename, [[source, parses]]))
out_loc = out_dir.joinpath(genre_dir.parts[-1] + '.json')
with open(str(out_loc), 'w') as out_file:
out_file.write(json.dumps(docs, indent=4))
if __name__ == '__main__':
plac.call(main)

View File

@ -1,32 +0,0 @@
import io
import plac
from spacy.en import English
def main(text_loc):
with io.open(text_loc, 'r', encoding='utf8') as file_:
text = file_.read()
NLU = English()
for paragraph in text.split('\n\n'):
tokens = NLU(paragraph)
ent_starts = {}
ent_ends = {}
for span in tokens.ents:
ent_starts[span.start] = span.label_
ent_ends[span.end] = span.label_
output = []
for token in tokens:
if token.i in ent_starts:
output.append('<%s>' % ent_starts[token.i])
output.append(token.orth_)
if (token.i+1) in ent_ends:
output.append('</%s>' % ent_ends[token.i+1])
output.append('\n\n')
print ' '.join(output)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,157 +0,0 @@
#!/usr/bin/env python
from __future__ import division
from __future__ import unicode_literals
import os
from os import path
import shutil
import io
import random
import time
import gzip
import plac
import cProfile
import pstats
import spacy.util
from spacy.en import English
from spacy.gold import GoldParse
from spacy.syntax.util import Config
from spacy.syntax.arc_eager import ArcEager
from spacy.syntax.parser import Parser
from spacy.scorer import Scorer
from spacy.tagger import Tagger
# Last updated for spaCy v0.97
def read_conll(file_):
"""Read a standard CoNLL/MALT-style format"""
sents = []
for sent_str in file_.read().strip().split('\n\n'):
ids = []
words = []
heads = []
labels = []
tags = []
for i, line in enumerate(sent_str.split('\n')):
word, pos_string, head_idx, label = _parse_line(line)
words.append(word)
if head_idx < 0:
head_idx = i
ids.append(i)
heads.append(head_idx)
labels.append(label)
tags.append(pos_string)
text = ' '.join(words)
annot = (ids, words, tags, heads, labels, ['O'] * len(ids))
sents.append((None, [(annot, [])]))
return sents
def _parse_line(line):
pieces = line.split()
if len(pieces) == 4:
word, pos, head_idx, label = pieces
head_idx = int(head_idx)
elif len(pieces) == 15:
id_ = int(pieces[0].split('_')[-1])
word = pieces[1]
pos = pieces[4]
head_idx = int(pieces[8])-1
label = pieces[10]
else:
id_ = int(pieces[0].split('_')[-1])
word = pieces[1]
pos = pieces[4]
head_idx = int(pieces[6])-1
label = pieces[7]
if head_idx == 0:
label = 'ROOT'
return word, pos, head_idx, label
def score_model(scorer, nlp, raw_text, annot_tuples, verbose=False):
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
nlp.tagger(tokens)
nlp.parser(tokens)
gold = GoldParse(tokens, annot_tuples, make_projective=False)
scorer.score(tokens, gold, verbose=verbose, punct_labels=('--', 'p', 'punct'))
def train(Language, gold_tuples, model_dir, n_iter=15, feat_set=u'basic', seed=0,
gold_preproc=False, force_gold=False):
dep_model_dir = path.join(model_dir, 'deps')
pos_model_dir = path.join(model_dir, 'pos')
if path.exists(dep_model_dir):
shutil.rmtree(dep_model_dir)
if path.exists(pos_model_dir):
shutil.rmtree(pos_model_dir)
os.mkdir(dep_model_dir)
os.mkdir(pos_model_dir)
Config.write(dep_model_dir, 'config', features=feat_set, seed=seed,
labels=ArcEager.get_labels(gold_tuples))
nlp = Language(data_dir=model_dir, tagger=False, parser=False, entity=False)
nlp.tagger = Tagger.blank(nlp.vocab, Tagger.default_templates())
nlp.parser = Parser.from_dir(dep_model_dir, nlp.vocab.strings, ArcEager)
print("Itn.\tP.Loss\tUAS\tNER F.\tTag %\tToken %")
for itn in range(n_iter):
scorer = Scorer()
loss = 0
for _, sents in gold_tuples:
for annot_tuples, _ in sents:
if len(annot_tuples[1]) == 1:
continue
score_model(scorer, nlp, None, annot_tuples, verbose=False)
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
nlp.tagger(tokens)
gold = GoldParse(tokens, annot_tuples, make_projective=True)
if not gold.is_projective:
raise Exception(
"Non-projective sentence in training, after we should "
"have enforced projectivity: %s" % annot_tuples
)
loss += nlp.parser.train(tokens, gold)
nlp.tagger.train(tokens, gold.tags)
random.shuffle(gold_tuples)
print('%d:\t%d\t%.3f\t%.3f\t%.3f' % (itn, loss, scorer.uas,
scorer.tags_acc, scorer.token_acc))
print('end training')
nlp.end_training(model_dir)
print('done')
@plac.annotations(
train_loc=("Location of CoNLL 09 formatted training file"),
dev_loc=("Location of CoNLL 09 formatted development file"),
model_dir=("Location of output model directory"),
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
n_iter=("Number of training iterations", "option", "i", int),
)
def main(train_loc, dev_loc, model_dir, n_iter=15):
with io.open(train_loc, 'r', encoding='utf8') as file_:
train_sents = read_conll(file_)
if not eval_only:
train(English, train_sents, model_dir, n_iter=n_iter)
nlp = English(data_dir=model_dir)
dev_sents = read_conll(io.open(dev_loc, 'r', encoding='utf8'))
scorer = Scorer()
for _, sents in dev_sents:
for annot_tuples, _ in sents:
score_model(scorer, nlp, None, annot_tuples)
print('TOK', 100-scorer.token_acc)
print('POS', scorer.tags_acc)
print('UAS', scorer.uas)
print('LAS', scorer.las)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,187 +0,0 @@
#!/usr/bin/env python
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function
import os
from os import path
import shutil
import io
import random
import plac
import re
import spacy.util
from spacy.syntax.util import Config
from spacy.gold import read_json_file
from spacy.gold import GoldParse
from spacy.gold import merge_sents
from spacy.scorer import Scorer
from spacy.syntax.arc_eager import ArcEager
from spacy.syntax.ner import BiluoPushDown
from spacy.tagger import Tagger
from spacy.syntax.parser import Parser
from spacy.syntax.nonproj import PseudoProjectivity
def _corrupt(c, noise_level):
if random.random() >= noise_level:
return c
elif c == ' ':
return '\n'
elif c == '\n':
return ' '
elif c in ['.', "'", "!", "?"]:
return ''
else:
return c.lower()
def add_noise(orig, noise_level):
if random.random() >= noise_level:
return orig
elif type(orig) == list:
corrupted = [_corrupt(word, noise_level) for word in orig]
corrupted = [w for w in corrupted if w]
return corrupted
else:
return ''.join(_corrupt(c, noise_level) for c in orig)
def score_model(scorer, nlp, raw_text, annot_tuples, verbose=False):
if raw_text is None:
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
else:
tokens = nlp.tokenizer(raw_text)
nlp.tagger(tokens)
nlp.entity(tokens)
nlp.parser(tokens)
gold = GoldParse(tokens, annot_tuples)
scorer.score(tokens, gold, verbose=verbose)
def train(Language, train_data, dev_data, model_dir, tagger_cfg, parser_cfg, entity_cfg,
n_iter=15, seed=0, gold_preproc=False, n_sents=0, corruption_level=0):
print("Itn.\tN weight\tN feats\tUAS\tNER F.\tTag %\tToken %")
format_str = '{:d}\t{:d}\t{:d}\t{uas:.3f}\t{ents_f:.3f}\t{tags_acc:.3f}\t{token_acc:.3f}'
with Language.train(model_dir, train_data,
tagger_cfg, parser_cfg, entity_cfg) as trainer:
loss = 0
for itn, epoch in enumerate(trainer.epochs(n_iter, gold_preproc=gold_preproc,
augment_data=None)):
for doc, gold in epoch:
trainer.update(doc, gold)
dev_scores = trainer.evaluate(dev_data, gold_preproc=gold_preproc)
print(format_str.format(itn, trainer.nlp.parser.model.nr_weight,
trainer.nlp.parser.model.nr_active_feat, **dev_scores.scores))
def evaluate(Language, gold_tuples, model_dir, gold_preproc=False, verbose=False,
beam_width=None, cand_preproc=None):
print("Load parser", model_dir)
nlp = Language(path=model_dir)
if nlp.lang == 'de':
nlp.vocab.morphology.lemmatizer = lambda string,pos: set([string])
if beam_width is not None:
nlp.parser.cfg.beam_width = beam_width
scorer = Scorer()
for raw_text, sents in gold_tuples:
if gold_preproc:
raw_text = None
else:
sents = merge_sents(sents)
for annot_tuples, brackets in sents:
if raw_text is None:
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
nlp.tagger(tokens)
nlp.parser(tokens)
nlp.entity(tokens)
else:
tokens = nlp(raw_text)
gold = GoldParse.from_annot_tuples(tokens, annot_tuples)
scorer.score(tokens, gold, verbose=verbose)
return scorer
def write_parses(Language, dev_loc, model_dir, out_loc):
nlp = Language(data_dir=model_dir)
gold_tuples = read_json_file(dev_loc)
scorer = Scorer()
out_file = io.open(out_loc, 'w', 'utf8')
for raw_text, sents in gold_tuples:
sents = _merge_sents(sents)
for annot_tuples, brackets in sents:
if raw_text is None:
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
nlp.tagger(tokens)
nlp.entity(tokens)
nlp.parser(tokens)
else:
tokens = nlp(raw_text)
#gold = GoldParse(tokens, annot_tuples)
#scorer.score(tokens, gold, verbose=False)
for sent in tokens.sents:
for t in sent:
if not t.is_space:
out_file.write(
'%d\t%s\t%s\t%s\t%s\n' % (t.i, t.orth_, t.tag_, t.head.orth_, t.dep_)
)
out_file.write('\n')
@plac.annotations(
language=("The language to train", "positional", None, str, ['en','de', 'zh']),
train_loc=("Location of training file or directory"),
dev_loc=("Location of development file or directory"),
model_dir=("Location of output model directory",),
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
corruption_level=("Amount of noise to add to training data", "option", "c", float),
gold_preproc=("Use gold-standard sentence boundaries in training?", "flag", "g", bool),
out_loc=("Out location", "option", "o", str),
n_sents=("Number of training sentences", "option", "n", int),
n_iter=("Number of training iterations", "option", "i", int),
verbose=("Verbose error reporting", "flag", "v", bool),
debug=("Debug mode", "flag", "d", bool),
pseudoprojective=("Use pseudo-projective parsing", "flag", "p", bool),
L1=("L1 regularization penalty", "option", "L", float),
)
def main(language, train_loc, dev_loc, model_dir, n_sents=0, n_iter=15, out_loc="", verbose=False,
debug=False, corruption_level=0.0, gold_preproc=False, eval_only=False, pseudoprojective=False,
L1=1e-6):
parser_cfg = dict(locals())
tagger_cfg = dict(locals())
entity_cfg = dict(locals())
lang = spacy.util.get_lang_class(language)
parser_cfg['features'] = lang.Defaults.parser_features
entity_cfg['features'] = lang.Defaults.entity_features
if not eval_only:
gold_train = list(read_json_file(train_loc))
gold_dev = list(read_json_file(dev_loc))
if n_sents > 0:
gold_train = gold_train[:n_sents]
train(lang, gold_train, gold_dev, model_dir, tagger_cfg, parser_cfg, entity_cfg,
n_sents=n_sents, gold_preproc=gold_preproc, corruption_level=corruption_level,
n_iter=n_iter)
if out_loc:
write_parses(lang, dev_loc, model_dir, out_loc)
scorer = evaluate(lang, list(read_json_file(dev_loc)),
model_dir, gold_preproc=gold_preproc, verbose=verbose)
print('TOK', scorer.token_acc)
print('POS', scorer.tags_acc)
print('UAS', scorer.uas)
print('LAS', scorer.las)
print('NER P', scorer.ents_p)
print('NER R', scorer.ents_r)
print('NER F', scorer.ents_f)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,201 +0,0 @@
from __future__ import unicode_literals, print_function
import plac
import json
import random
import pathlib
from spacy.tokens import Doc
from spacy.syntax.nonproj import PseudoProjectivity
from spacy.language import Language
from spacy.gold import GoldParse
from spacy.tagger import Tagger
from spacy.pipeline import DependencyParser, TokenVectorEncoder
from spacy.syntax.parser import get_templates
from spacy.syntax.arc_eager import ArcEager
from spacy.scorer import Scorer
from spacy.language_data.tag_map import TAG_MAP as DEFAULT_TAG_MAP
import spacy.attrs
import io
from thinc.neural.ops import CupyOps
from thinc.neural import Model
from spacy.es import Spanish
from spacy.attrs import POS
from thinc.neural import Model
try:
import cupy
from thinc.neural.ops import CupyOps
except:
cupy = None
def read_conllx(loc, n=0):
with io.open(loc, 'r', encoding='utf8') as file_:
text = file_.read()
i = 0
for sent in text.strip().split('\n\n'):
lines = sent.strip().split('\n')
if lines:
while lines[0].startswith('#'):
lines.pop(0)
tokens = []
for line in lines:
id_, word, lemma, pos, tag, morph, head, dep, _1, \
_2 = line.split('\t')
if '-' in id_ or '.' in id_:
continue
try:
id_ = int(id_) - 1
head = (int(head) - 1) if head != '0' else id_
dep = 'ROOT' if dep == 'root' else dep #'unlabelled'
tag = pos+'__'+dep+'__'+morph
Spanish.Defaults.tag_map[tag] = {POS: pos}
tokens.append((id_, word, tag, head, dep, 'O'))
except:
raise
tuples = [list(t) for t in zip(*tokens)]
yield (None, [[tuples, []]])
i += 1
if n >= 1 and i >= n:
break
def score_model(vocab, encoder, parser, Xs, ys, verbose=False):
scorer = Scorer()
correct = 0.
total = 0.
for doc, gold in zip(Xs, ys):
doc = Doc(vocab, words=[w.text for w in doc])
encoder(doc)
parser(doc)
PseudoProjectivity.deprojectivize(doc)
scorer.score(doc, gold, verbose=verbose)
for token, tag in zip(doc, gold.tags):
if '_' in token.tag_:
univ_guess, _ = token.tag_.split('_', 1)
else:
univ_guess = ''
univ_truth, _ = tag.split('_', 1)
correct += univ_guess == univ_truth
total += 1
return scorer
def organize_data(vocab, train_sents):
Xs = []
ys = []
for _, doc_sents in train_sents:
for (ids, words, tags, heads, deps, ner), _ in doc_sents:
doc = Doc(vocab, words=words)
gold = GoldParse(doc, tags=tags, heads=heads, deps=deps)
Xs.append(doc)
ys.append(gold)
return Xs, ys
def main(lang_name, train_loc, dev_loc, model_dir, clusters_loc=None):
LangClass = spacy.util.get_lang_class(lang_name)
train_sents = list(read_conllx(train_loc))
dev_sents = list(read_conllx(dev_loc))
train_sents = PseudoProjectivity.preprocess_training_data(train_sents)
actions = ArcEager.get_actions(gold_parses=train_sents)
features = get_templates('basic')
model_dir = pathlib.Path(model_dir)
if not model_dir.exists():
model_dir.mkdir()
if not (model_dir / 'deps').exists():
(model_dir / 'deps').mkdir()
if not (model_dir / 'pos').exists():
(model_dir / 'pos').mkdir()
with (model_dir / 'deps' / 'config.json').open('wb') as file_:
file_.write(
json.dumps(
{'pseudoprojective': True, 'labels': actions, 'features': features}).encode('utf8'))
vocab = LangClass.Defaults.create_vocab()
if not (model_dir / 'vocab').exists():
(model_dir / 'vocab').mkdir()
else:
if (model_dir / 'vocab' / 'strings.json').exists():
with (model_dir / 'vocab' / 'strings.json').open() as file_:
vocab.strings.load(file_)
if (model_dir / 'vocab' / 'lexemes.bin').exists():
vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
if clusters_loc is not None:
clusters_loc = pathlib.Path(clusters_loc)
with clusters_loc.open() as file_:
for line in file_:
try:
cluster, word, freq = line.split()
except ValueError:
continue
lex = vocab[word]
lex.cluster = int(cluster[::-1], 2)
# Populate vocab
for _, doc_sents in train_sents:
for (ids, words, tags, heads, deps, ner), _ in doc_sents:
for word in words:
_ = vocab[word]
for dep in deps:
_ = vocab[dep]
for tag in tags:
_ = vocab[tag]
if vocab.morphology.tag_map:
for tag in tags:
vocab.morphology.tag_map[tag] = {POS: tag.split('__', 1)[0]}
tagger = Tagger(vocab)
encoder = TokenVectorEncoder(vocab, width=64)
parser = DependencyParser(vocab, actions=actions, features=features, L1=0.0)
Xs, ys = organize_data(vocab, train_sents)
dev_Xs, dev_ys = organize_data(vocab, dev_sents)
with encoder.model.begin_training(Xs[:100], ys[:100]) as (trainer, optimizer):
docs = list(Xs)
for doc in docs:
encoder(doc)
nn_loss = [0.]
def track_progress():
with encoder.tagger.use_params(optimizer.averages):
with parser.model.use_params(optimizer.averages):
scorer = score_model(vocab, encoder, parser, dev_Xs, dev_ys)
itn = len(nn_loss)
print('%d:\t%.3f\t%.3f\t%.3f' % (itn, nn_loss[-1], scorer.uas, scorer.tags_acc))
nn_loss.append(0.)
track_progress()
trainer.each_epoch.append(track_progress)
trainer.batch_size = 24
trainer.nb_epoch = 40
for docs, golds in trainer.iterate(Xs, ys, progress_bar=True):
docs = [Doc(vocab, words=[w.text for w in doc]) for doc in docs]
tokvecs, upd_tokvecs = encoder.begin_update(docs)
for doc, tokvec in zip(docs, tokvecs):
doc.tensor = tokvec
d_tokvecs = parser.update(docs, golds, sgd=optimizer)
upd_tokvecs(d_tokvecs, sgd=optimizer)
encoder.update(docs, golds, sgd=optimizer)
nlp = LangClass(vocab=vocab, parser=parser)
scorer = score_model(vocab, encoder, parser, read_conllx(dev_loc))
print('%d:\t%.3f\t%.3f\t%.3f' % (itn, scorer.uas, scorer.las, scorer.tags_acc))
#nlp.end_training(model_dir)
#scorer = score_model(vocab, tagger, parser, read_conllx(dev_loc))
#print('%d:\t%.3f\t%.3f\t%.3f' % (itn, scorer.uas, scorer.las, scorer.tags_acc))
if __name__ == '__main__':
import cProfile
import pstats
if 1:
plac.call(main)
else:
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()
plac.call(main)

View File

@ -1,194 +0,0 @@
"""Convert OntoNotes into a json format.
doc: {
id: string,
paragraphs: [{
raw: string,
sents: [int],
tokens: [{
start: int,
tag: string,
head: int,
dep: string}],
ner: [{
start: int,
end: int,
label: string}],
brackets: [{
start: int,
end: int,
label: string}]}]}
Consumes output of spacy/munge/align_raw.py
"""
from __future__ import unicode_literals
import plac
import json
from os import path
import os
import re
import io
from collections import defaultdict
from spacy.munge import read_ptb
from spacy.munge import read_conll
from spacy.munge import read_ner
def _iter_raw_files(raw_loc):
files = json.load(open(raw_loc))
for f in files:
yield f
def format_doc(file_id, raw_paras, ptb_text, dep_text, ner_text):
ptb_sents = read_ptb.split(ptb_text)
dep_sents = read_conll.split(dep_text)
if len(ptb_sents) != len(dep_sents):
return None
if ner_text is not None:
ner_sents = read_ner.split(ner_text)
else:
ner_sents = [None] * len(ptb_sents)
i = 0
doc = {'id': file_id}
if raw_paras is None:
doc['paragraphs'] = [format_para(None, ptb_sents, dep_sents, ner_sents)]
#for ptb_sent, dep_sent, ner_sent in zip(ptb_sents, dep_sents, ner_sents):
# doc['paragraphs'].append(format_para(None, [ptb_sent], [dep_sent], [ner_sent]))
else:
doc['paragraphs'] = []
for raw_sents in raw_paras:
para = format_para(
' '.join(raw_sents).replace('<SEP>', ''),
ptb_sents[i:i+len(raw_sents)],
dep_sents[i:i+len(raw_sents)],
ner_sents[i:i+len(raw_sents)])
if para['sentences']:
doc['paragraphs'].append(para)
i += len(raw_sents)
return doc
def format_para(raw_text, ptb_sents, dep_sents, ner_sents):
para = {'raw': raw_text, 'sentences': []}
offset = 0
assert len(ptb_sents) == len(dep_sents) == len(ner_sents)
for ptb_text, dep_text, ner_text in zip(ptb_sents, dep_sents, ner_sents):
_, deps = read_conll.parse(dep_text, strip_bad_periods=True)
if deps and 'VERB' in [t['tag'] for t in deps]:
continue
if ner_text is not None:
_, ner = read_ner.parse(ner_text, strip_bad_periods=True)
else:
ner = ['-' for _ in deps]
_, brackets = read_ptb.parse(ptb_text, strip_bad_periods=True)
# Necessary because the ClearNLP converter deletes EDITED words.
if len(ner) != len(deps):
ner = ['-' for _ in deps]
para['sentences'].append(format_sentence(deps, ner, brackets))
return para
def format_sentence(deps, ner, brackets):
sent = {'tokens': [], 'brackets': []}
for token_id, (token, token_ent) in enumerate(zip(deps, ner)):
sent['tokens'].append(format_token(token_id, token, token_ent))
for label, start, end in brackets:
if start != end:
sent['brackets'].append({
'label': label,
'first': start,
'last': (end-1)})
return sent
def format_token(token_id, token, ner):
assert token_id == token['id']
head = (token['head'] - token_id) if token['head'] != -1 else 0
return {
'id': token_id,
'orth': token['word'],
'tag': token['tag'],
'head': head,
'dep': token['dep'],
'ner': ner}
def read_file(*pieces):
loc = path.join(*pieces)
if not path.exists(loc):
return None
else:
return io.open(loc, 'r', encoding='utf8').read().strip()
def get_file_names(section_dir, subsection):
filenames = []
for fn in os.listdir(path.join(section_dir, subsection)):
filenames.append(fn.rsplit('.', 1)[0])
return list(sorted(set(filenames)))
def read_wsj_with_source(onto_dir, raw_dir):
# Now do WSJ, with source alignment
onto_dir = path.join(onto_dir, 'data', 'english', 'annotations', 'nw', 'wsj')
docs = {}
for i in range(25):
section = str(i) if i >= 10 else ('0' + str(i))
raw_loc = path.join(raw_dir, 'wsj%s.json' % section)
for j, (filename, raw_paras) in enumerate(_iter_raw_files(raw_loc)):
if section == '00':
j += 1
if section == '04' and filename == '55':
continue
ptb = read_file(onto_dir, section, '%s.parse' % filename)
dep = read_file(onto_dir, section, '%s.parse.dep' % filename)
ner = read_file(onto_dir, section, '%s.name' % filename)
if ptb is not None and dep is not None:
docs[filename] = format_doc(filename, raw_paras, ptb, dep, ner)
return docs
def get_doc(onto_dir, file_path, wsj_docs):
filename = file_path.rsplit('/', 1)[1]
if filename in wsj_docs:
return wsj_docs[filename]
else:
ptb = read_file(onto_dir, file_path + '.parse')
dep = read_file(onto_dir, file_path + '.parse.dep')
ner = read_file(onto_dir, file_path + '.name')
if ptb is not None and dep is not None:
return format_doc(filename, None, ptb, dep, ner)
else:
return None
def read_ids(loc):
return open(loc).read().strip().split('\n')
def main(onto_dir, raw_dir, out_dir):
wsj_docs = read_wsj_with_source(onto_dir, raw_dir)
for partition in ('train', 'test', 'development'):
ids = read_ids(path.join(onto_dir, '%s.id' % partition))
docs_by_genre = defaultdict(list)
for file_path in ids:
doc = get_doc(onto_dir, file_path, wsj_docs)
if doc is not None:
genre = file_path.split('/')[3]
docs_by_genre[genre].append(doc)
part_dir = path.join(out_dir, partition)
if not path.exists(part_dir):
os.mkdir(part_dir)
for genre, docs in sorted(docs_by_genre.items()):
out_loc = path.join(part_dir, genre + '.json')
with open(out_loc, 'w') as file_:
json.dump(docs, file_, indent=4)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,13 +0,0 @@
"""Read a vector file, and prepare it as binary data, for easy consumption"""
import plac
from spacy.vocab import write_binary_vectors
def main(in_loc, out_loc):
write_binary_vectors(in_loc, out_loc)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,175 +0,0 @@
#!/usr/bin/env python
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function
import os
from os import path
import shutil
import codecs
import random
import plac
import re
import spacy.util
from spacy.en import English
from spacy.tagger import Tagger
from spacy.syntax.util import Config
from spacy.gold import read_json_file
from spacy.gold import GoldParse
from spacy.scorer import Scorer
def score_model(scorer, nlp, raw_text, annot_tuples):
if raw_text is None:
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
else:
tokens = nlp.tokenizer(raw_text)
nlp.tagger(tokens)
gold = GoldParse(tokens, annot_tuples)
scorer.score(tokens, gold)
def _merge_sents(sents):
m_deps = [[], [], [], [], [], []]
m_brackets = []
i = 0
for (ids, words, tags, heads, labels, ner), brackets in sents:
m_deps[0].extend(id_ + i for id_ in ids)
m_deps[1].extend(words)
m_deps[2].extend(tags)
m_deps[3].extend(head + i for head in heads)
m_deps[4].extend(labels)
m_deps[5].extend(ner)
m_brackets.extend((b['first'] + i, b['last'] + i, b['label']) for b in brackets)
i += len(ids)
return [(m_deps, m_brackets)]
def train(Language, gold_tuples, model_dir, n_iter=15, feat_set=u'basic',
seed=0, gold_preproc=False, n_sents=0, corruption_level=0,
beam_width=1, verbose=False,
use_orig_arc_eager=False):
if n_sents > 0:
gold_tuples = gold_tuples[:n_sents]
templates = Tagger.default_templates()
nlp = Language(data_dir=model_dir, tagger=False)
nlp.tagger = Tagger.blank(nlp.vocab, templates)
print("Itn.\tP.Loss\tUAS\tNER F.\tTag %\tToken %")
for itn in range(n_iter):
scorer = Scorer()
loss = 0
for raw_text, sents in gold_tuples:
if gold_preproc:
raw_text = None
else:
sents = _merge_sents(sents)
for annot_tuples, ctnt in sents:
words = annot_tuples[1]
gold_tags = annot_tuples[2]
score_model(scorer, nlp, raw_text, annot_tuples)
if raw_text is None:
tokens = nlp.tokenizer.tokens_from_list(words)
else:
tokens = nlp.tokenizer(raw_text)
loss += nlp.tagger.train(tokens, gold_tags)
random.shuffle(gold_tuples)
print('%d:\t%d\t%.3f\t%.3f\t%.3f\t%.3f' % (itn, loss, scorer.uas, scorer.ents_f,
scorer.tags_acc,
scorer.token_acc))
nlp.end_training(model_dir)
def evaluate(Language, gold_tuples, model_dir, gold_preproc=False, verbose=False,
beam_width=None):
nlp = Language(data_dir=model_dir)
if beam_width is not None:
nlp.parser.cfg.beam_width = beam_width
scorer = Scorer()
for raw_text, sents in gold_tuples:
if gold_preproc:
raw_text = None
else:
sents = _merge_sents(sents)
for annot_tuples, brackets in sents:
if raw_text is None:
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
nlp.tagger(tokens)
nlp.entity(tokens)
nlp.parser(tokens)
else:
tokens = nlp(raw_text, merge_mwes=False)
gold = GoldParse(tokens, annot_tuples)
scorer.score(tokens, gold, verbose=verbose)
return scorer
def write_parses(Language, dev_loc, model_dir, out_loc, beam_width=None):
nlp = Language(data_dir=model_dir)
if beam_width is not None:
nlp.parser.cfg.beam_width = beam_width
gold_tuples = read_json_file(dev_loc)
scorer = Scorer()
out_file = codecs.open(out_loc, 'w', 'utf8')
for raw_text, sents in gold_tuples:
sents = _merge_sents(sents)
for annot_tuples, brackets in sents:
if raw_text is None:
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
nlp.tagger(tokens)
nlp.entity(tokens)
nlp.parser(tokens)
else:
tokens = nlp(raw_text, merge_mwes=False)
gold = GoldParse(tokens, annot_tuples)
scorer.score(tokens, gold, verbose=False)
for t in tokens:
out_file.write(
'%s\t%s\t%s\t%s\n' % (t.orth_, t.tag_, t.head.orth_, t.dep_)
)
return scorer
@plac.annotations(
train_loc=("Location of training file or directory"),
dev_loc=("Location of development file or directory"),
model_dir=("Location of output model directory",),
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
corruption_level=("Amount of noise to add to training data", "option", "c", float),
gold_preproc=("Use gold-standard sentence boundaries in training?", "flag", "g", bool),
out_loc=("Out location", "option", "o", str),
n_sents=("Number of training sentences", "option", "n", int),
n_iter=("Number of training iterations", "option", "i", int),
verbose=("Verbose error reporting", "flag", "v", bool),
debug=("Debug mode", "flag", "d", bool),
)
def main(train_loc, dev_loc, model_dir, n_sents=0, n_iter=15, out_loc="", verbose=False,
debug=False, corruption_level=0.0, gold_preproc=False, eval_only=False):
if not eval_only:
gold_train = list(read_json_file(train_loc))
train(English, gold_train, model_dir,
feat_set='basic' if not debug else 'debug',
gold_preproc=gold_preproc, n_sents=n_sents,
corruption_level=corruption_level, n_iter=n_iter,
verbose=verbose)
#if out_loc:
# write_parses(English, dev_loc, model_dir, out_loc, beam_width=beam_width)
scorer = evaluate(English, list(read_json_file(dev_loc)),
model_dir, gold_preproc=gold_preproc, verbose=verbose)
print('TOK', scorer.token_acc)
print('POS', scorer.tags_acc)
print('UAS', scorer.uas)
print('LAS', scorer.las)
print('NER P', scorer.ents_p)
print('NER R', scorer.ents_r)
print('NER F', scorer.ents_f)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,160 +0,0 @@
#!/usr/bin/env python
from __future__ import division
from __future__ import unicode_literals
import os
from os import path
import shutil
import io
import random
import time
import gzip
import ujson
import plac
import cProfile
import pstats
import spacy.util
from spacy.de import German
from spacy.gold import GoldParse
from spacy.tagger import Tagger
from spacy.scorer import PRFScore
from spacy.tagger import P2_orth, P2_cluster, P2_shape, P2_prefix, P2_suffix, P2_pos, P2_lemma, P2_flags
from spacy.tagger import P1_orth, P1_cluster, P1_shape, P1_prefix, P1_suffix, P1_pos, P1_lemma, P1_flags
from spacy.tagger import W_orth, W_cluster, W_shape, W_prefix, W_suffix, W_pos, W_lemma, W_flags
from spacy.tagger import N1_orth, N1_cluster, N1_shape, N1_prefix, N1_suffix, N1_pos, N1_lemma, N1_flags
from spacy.tagger import N2_orth, N2_cluster, N2_shape, N2_prefix, N2_suffix, N2_pos, N2_lemma, N2_flags, N_CONTEXT_FIELDS
def default_templates():
return spacy.tagger.Tagger.default_templates()
def default_templates_without_clusters():
return (
(W_orth,),
(P1_lemma, P1_pos),
(P2_lemma, P2_pos),
(N1_orth,),
(N2_orth,),
(W_suffix,),
(W_prefix,),
(P1_pos,),
(P2_pos,),
(P1_pos, P2_pos),
(P1_pos, W_orth),
(P1_suffix,),
(N1_suffix,),
(W_shape,),
(W_flags,),
(N1_flags,),
(N2_flags,),
(P1_flags,),
(P2_flags,),
)
def make_tagger(vocab, templates):
model = spacy.tagger.TaggerModel(templates)
return spacy.tagger.Tagger(vocab,model)
def read_conll(file_):
def sentences():
words, tags = [], []
for line in file_:
line = line.strip()
if line:
word, tag = line.split('\t')[1::3][:2] # get column 1 and 4 (CoNLL09)
words.append(word)
tags.append(tag)
elif words:
yield words, tags
words, tags = [], []
if words:
yield words, tags
return [ s for s in sentences() ]
def score_model(score, nlp, words, gold_tags):
tokens = nlp.tokenizer.tokens_from_list(words)
assert(len(tokens) == len(gold_tags))
nlp.tagger(tokens)
for token, gold_tag in zip(tokens,gold_tags):
score.score_set(set([token.tag_]),set([gold_tag]))
def train(Language, train_sents, dev_sents, model_dir, n_iter=15, seed=21):
# make shuffling deterministic
random.seed(seed)
# set up directory for model
pos_model_dir = path.join(model_dir, 'pos')
if path.exists(pos_model_dir):
shutil.rmtree(pos_model_dir)
os.mkdir(pos_model_dir)
nlp = Language(data_dir=model_dir, tagger=False, parser=False, entity=False)
nlp.tagger = make_tagger(nlp.vocab,default_templates())
print("Itn.\ttrain acc %\tdev acc %")
for itn in range(n_iter):
# train on train set
#train_acc = PRFScore()
correct, total = 0., 0.
for words, gold_tags in train_sents:
tokens = nlp.tokenizer.tokens_from_list(words)
correct += nlp.tagger.train(tokens, gold_tags)
total += len(words)
train_acc = correct/total
# test on dev set
dev_acc = PRFScore()
for words, gold_tags in dev_sents:
score_model(dev_acc, nlp, words, gold_tags)
random.shuffle(train_sents)
print('%d:\t%6.2f\t%6.2f' % (itn, 100*train_acc, 100*dev_acc.precision))
print('end training')
nlp.end_training(model_dir)
print('done')
@plac.annotations(
train_loc=("Location of CoNLL 09 formatted training file"),
dev_loc=("Location of CoNLL 09 formatted development file"),
model_dir=("Location of output model directory"),
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
n_iter=("Number of training iterations", "option", "i", int),
)
def main(train_loc, dev_loc, model_dir, eval_only=False, n_iter=15):
# training
if not eval_only:
with io.open(train_loc, 'r', encoding='utf8') as trainfile_, \
io.open(dev_loc, 'r', encoding='utf8') as devfile_:
train_sents = read_conll(trainfile_)
dev_sents = read_conll(devfile_)
train(German, train_sents, dev_sents, model_dir, n_iter=n_iter)
# testing
with io.open(dev_loc, 'r', encoding='utf8') as file_:
dev_sents = read_conll(file_)
nlp = German(data_dir=model_dir)
dev_acc = PRFScore()
for words, gold_tags in dev_sents:
score_model(dev_acc, nlp, words, gold_tags)
print('POS: %6.2f %%' % (100*dev_acc.precision))
if __name__ == '__main__':
plac.call(main)

View File

@ -2,20 +2,18 @@
# spaCy examples
The examples are Python scripts with well-behaved command line interfaces. For a full list of spaCy tutorials and code snippets, see the [documentation](https://spacy.io/docs/usage/tutorials).
The examples are Python scripts with well-behaved command line interfaces. For
more detailed usage guides, see the [documentation](https://alpha.spacy.io/usage/).
## How to run an example
For example, to run the [`nn_text_class.py`](nn_text_class.py) script, do:
To see the available arguments, you can use the `--help` or `-h` flag:
```bash
$ python examples/nn_text_class.py
usage: nn_text_class.py [-h] [-d 3] [-H 300] [-i 5] [-w 40000] [-b 24]
[-r 0.3] [-p 1e-05] [-e 0.005]
data_dir
nn_text_class.py: error: too few arguments
$ python examples/training/train_ner.py --help
```
You can print detailed help with the `-h` argument.
While we try to keep the examples up to date, they are not currently exercised by the test suite, as some of them require significant data downloads or take time to train. If you find that an example is no longer running, [please tell us](https://github.com/explosion/spaCy/issues)! We know there's nothing worse than trying to figure out what you're doing wrong, and it turns out your code was never the problem.
While we try to keep the examples up to date, they are not currently exercised
by the test suite, as some of them require significant data downloads or take
time to train. If you find that an example is no longer running,
[please tell us](https://github.com/explosion/spaCy/issues)! We know there's
nothing worse than trying to figure out what you're doing wrong, and it turns
out your code was never the problem.

View File

@ -1,37 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from math import sqrt
from numpy import dot
from numpy.linalg import norm
def handle_tweet(spacy, tweet_data, query):
text = tweet_data.get('text', u'')
# Twython returns either bytes or unicode, depending on tweet.
# ಠ_ಠ #APIshaming
try:
match_tweet(spacy, text, query)
except TypeError:
match_tweet(spacy, text.decode('utf8'), query)
def match_tweet(spacy, text, query):
def get_vector(word):
return spacy.vocab[word].repvec
tweet = spacy(text)
tweet = [w.repvec for w in tweet if w.is_alpha and w.lower_ != query]
if tweet:
accept = map(get_vector, 'child classroom teach'.split())
reject = map(get_vector, 'mouth hands giveaway'.split())
y = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in accept)
n = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in reject)
if (y / (y + n)) >= 0.5 or True:
print(text)
def cos(v1, v2):
return dot(v1, v2) / (norm(v1) * norm(v2))

View File

@ -1,322 +0,0 @@
'''WIP --- Doesn't work well yet'''
import plac
import random
import six
import cProfile
import pstats
import pathlib
import cPickle as pickle
from itertools import izip
import spacy
import cytoolz
import cupy as xp
import cupy.cuda
import chainer.cuda
import chainer.links as L
import chainer.functions as F
from chainer import Chain, Variable, report
import chainer.training
import chainer.optimizers
from chainer.training import extensions
from chainer.iterators import SerialIterator
from chainer.datasets import TupleDataset
class SentimentAnalyser(object):
@classmethod
def load(cls, path, nlp, max_length=100):
raise NotImplementedError
#with (path / 'config.json').open() as file_:
# model = model_from_json(file_.read())
#with (path / 'model').open('rb') as file_:
# lstm_weights = pickle.load(file_)
#embeddings = get_embeddings(nlp.vocab)
#model.set_weights([embeddings] + lstm_weights)
#return cls(model, max_length=max_length)
def __init__(self, model, max_length=100):
self._model = model
self.max_length = max_length
def __call__(self, doc):
X = get_features([doc], self.max_length)
y = self._model.predict(X)
self.set_sentiment(doc, y)
def pipe(self, docs, batch_size=1000, n_threads=2):
for minibatch in cytoolz.partition_all(batch_size, docs):
minibatch = list(minibatch)
sentences = []
for doc in minibatch:
sentences.extend(doc.sents)
Xs = get_features(sentences, self.max_length)
ys = self._model.predict(Xs)
for sent, label in zip(sentences, ys):
sent.doc.sentiment += label - 0.5
for doc in minibatch:
yield doc
def set_sentiment(self, doc, y):
doc.sentiment = float(y[0])
# Sentiment has a native slot for a single float.
# For arbitrary data storage, there's:
# doc.user_data['my_data'] = y
class Classifier(Chain):
def __init__(self, predictor):
super(Classifier, self).__init__(predictor=predictor)
def __call__(self, x, t):
y = self.predictor(x)
loss = F.softmax_cross_entropy(y, t)
accuracy = F.accuracy(y, t)
report({'loss': loss, 'accuracy': accuracy}, self)
return loss
class SentimentModel(Chain):
def __init__(self, nlp, shape, **settings):
Chain.__init__(self,
embed=_Embed(shape['nr_vector'], shape['nr_dim'], shape['nr_hidden'],
set_vectors=lambda arr: set_vectors(arr, nlp.vocab)),
encode=_Encode(shape['nr_hidden'], shape['nr_hidden']),
attend=_Attend(shape['nr_hidden'], shape['nr_hidden']),
predict=_Predict(shape['nr_hidden'], shape['nr_class']))
self.to_gpu(0)
def __call__(self, sentence):
return self.predict(
self.attend(
self.encode(
self.embed(sentence))))
class _Embed(Chain):
def __init__(self, nr_vector, nr_dim, nr_out, set_vectors=None):
Chain.__init__(self,
embed=L.EmbedID(nr_vector, nr_dim, initialW=set_vectors),
project=L.Linear(None, nr_out, nobias=True))
self.embed.W.volatile = False
def __call__(self, sentence):
return [self.project(self.embed(ts)) for ts in F.transpose(sentence)]
class _Encode(Chain):
def __init__(self, nr_in, nr_out):
Chain.__init__(self,
fwd=L.LSTM(nr_in, nr_out),
bwd=L.LSTM(nr_in, nr_out),
mix=L.Bilinear(nr_out, nr_out, nr_out))
def __call__(self, sentence):
self.fwd.reset_state()
fwds = map(self.fwd, sentence)
self.bwd.reset_state()
bwds = reversed(map(self.bwd, reversed(sentence)))
return [F.elu(self.mix(f, b)) for f, b in zip(fwds, bwds)]
class _Attend(Chain):
def __init__(self, nr_in, nr_out):
Chain.__init__(self)
def __call__(self, sentence):
sent = sum(sentence)
return sent
class _Predict(Chain):
def __init__(self, nr_in, nr_out):
Chain.__init__(self,
l1=L.Linear(nr_in, nr_in),
l2=L.Linear(nr_in, nr_out))
def __call__(self, vector):
vector = self.l1(vector)
vector = F.elu(vector)
vector = self.l2(vector)
return vector
class SentenceDataset(TupleDataset):
def __init__(self, nlp, texts, labels, max_length):
self.max_length = max_length
sents, labels = self._get_labelled_sentences(
nlp.pipe(texts, batch_size=5000, n_threads=3),
labels)
TupleDataset.__init__(self,
get_features(sents, max_length),
labels)
def __getitem__(self, index):
batches = [dataset[index] for dataset in self._datasets]
if isinstance(index, slice):
length = len(batches[0])
returns = [tuple([batch[i] for batch in batches])
for i in six.moves.range(length)]
return returns
else:
return tuple(batches)
def _get_labelled_sentences(self, docs, doc_labels):
labels = []
sentences = []
for doc, y in izip(docs, doc_labels):
for sent in doc.sents:
sentences.append(sent)
labels.append(y)
return sentences, xp.asarray(labels, dtype='i')
class DocDataset(TupleDataset):
def __init__(self, nlp, texts, labels):
self.max_length = max_length
DatasetMixin.__init__(self,
get_features(
nlp.pipe(texts, batch_size=5000, n_threads=3), self.max_length),
labels)
def read_data(data_dir, limit=0):
examples = []
for subdir, label in (('pos', 1), ('neg', 0)):
for filename in (data_dir / subdir).iterdir():
with filename.open() as file_:
text = file_.read()
examples.append((text, label))
random.shuffle(examples)
if limit >= 1:
examples = examples[:limit]
return zip(*examples) # Unzips into two lists
def get_features(docs, max_length):
docs = list(docs)
Xs = xp.zeros((len(docs), max_length), dtype='i')
for i, doc in enumerate(docs):
j = 0
for token in doc:
if token.has_vector and not token.is_punct and not token.is_space:
Xs[i, j] = token.norm
j += 1
if j >= max_length:
break
return Xs
def set_vectors(vectors, vocab):
for lex in vocab:
if lex.has_vector and (lex.rank+1) < vectors.shape[0]:
lex.norm = lex.rank+1
vectors[lex.rank + 1] = lex.vector
else:
lex.norm = 0
return vectors
def train(train_texts, train_labels, dev_texts, dev_labels,
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100, nb_epoch=5,
by_sentence=True):
nlp = spacy.load('en', entity=False)
if 'nr_vector' not in lstm_shape:
lstm_shape['nr_vector'] = max(lex.rank+1 for lex in nlp.vocab if lex.has_vector)
if 'nr_dim' not in lstm_shape:
lstm_shape['nr_dim'] = nlp.vocab.vectors_length
print("Make model")
model = Classifier(SentimentModel(nlp, lstm_shape, **lstm_settings))
print("Parsing texts...")
if by_sentence:
train_data = SentenceDataset(nlp, train_texts, train_labels, lstm_shape['max_length'])
dev_data = SentenceDataset(nlp, dev_texts, dev_labels, lstm_shape['max_length'])
else:
train_data = DocDataset(nlp, train_texts, train_labels)
dev_data = DocDataset(nlp, dev_texts, dev_labels)
train_iter = SerialIterator(train_data, batch_size=batch_size,
shuffle=True, repeat=True)
dev_iter = SerialIterator(dev_data, batch_size=batch_size,
shuffle=False, repeat=False)
optimizer = chainer.optimizers.Adam()
optimizer.setup(model)
updater = chainer.training.StandardUpdater(train_iter, optimizer, device=0)
trainer = chainer.training.Trainer(updater, (1, 'epoch'), out='result')
trainer.extend(extensions.Evaluator(dev_iter, model, device=0))
trainer.extend(extensions.LogReport())
trainer.extend(extensions.PrintReport([
'epoch', 'main/accuracy', 'validation/main/accuracy']))
trainer.extend(extensions.ProgressBar())
trainer.run()
def evaluate(model_dir, texts, labels, max_length=100):
def create_pipeline(nlp):
'''
This could be a lambda, but named functions are easier to read in Python.
'''
return [nlp.tagger, nlp.parser, SentimentAnalyser.load(model_dir, nlp,
max_length=max_length)]
nlp = spacy.load('en')
nlp.pipeline = create_pipeline(nlp)
correct = 0
i = 0
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
i += 1
return float(correct) / i
@plac.annotations(
train_dir=("Location of training file or directory"),
dev_dir=("Location of development file or directory"),
model_dir=("Location of output model directory",),
is_runtime=("Demonstrate run-time usage", "flag", "r", bool),
nr_hidden=("Number of hidden units", "option", "H", int),
max_length=("Maximum sentence length", "option", "L", int),
dropout=("Dropout", "option", "d", float),
learn_rate=("Learn rate", "option", "e", float),
nb_epoch=("Number of training epochs", "option", "i", int),
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
nr_examples=("Limit to N examples", "option", "n", int)
)
def main(model_dir, train_dir, dev_dir,
is_runtime=False,
nr_hidden=64, max_length=100, # Shape
dropout=0.5, learn_rate=0.001, # General NN config
nb_epoch=5, batch_size=32, nr_examples=-1): # Training params
model_dir = pathlib.Path(model_dir)
train_dir = pathlib.Path(train_dir)
dev_dir = pathlib.Path(dev_dir)
if is_runtime:
dev_texts, dev_labels = read_data(dev_dir)
acc = evaluate(model_dir, dev_texts, dev_labels, max_length=max_length)
print(acc)
else:
print("Read data")
train_texts, train_labels = read_data(train_dir, limit=nr_examples)
dev_texts, dev_labels = read_data(dev_dir, limit=nr_examples)
print("Using GPU 0")
#chainer.cuda.get_device(0).use()
train_labels = xp.asarray(train_labels, dtype='i')
dev_labels = xp.asarray(dev_labels, dtype='i')
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
{'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 2,
'nr_vector': 5000},
{'dropout': 0.5, 'lr': learn_rate},
{},
nb_epoch=nb_epoch, batch_size=batch_size)
if __name__ == '__main__':
#cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
#s = pstats.Stats("Profile.prof")
#s.strip_dirs().sort_stats("time").print_stats()
plac.call(main)

View File

@ -1,59 +0,0 @@
"""Issue #252
Question:
In the documents and tutorials the main thing I haven't found is examples on how to break sentences down into small sub thoughts/chunks. The noun_chunks is handy, but having examples on using the token.head to find small (near-complete) sentence chunks would be neat.
Lets take the example sentence on https://displacy.spacy.io/displacy/index.html
displaCy uses CSS and JavaScript to show you how computers understand language
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
[displaCy] uses CSS and Javascript [to + show]
&
show you how computers understand [language]
I'm assuming that we can use the token.head to build these groups. In one of your examples you had the following function.
def dependency_labels_to_root(token):
'''Walk up the syntactic tree, collecting the arc labels.'''
dep_labels = []
while token.head is not token:
dep_labels.append(token.dep)
token = token.head
return dep_labels
"""
from __future__ import print_function, unicode_literals
# Answer:
# The easiest way is to find the head of the subtree you want, and then use the
# `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree` is the
# one that does what you're asking for most directly:
from spacy.en import English
nlp = English()
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
print(''.join(w.text_with_ws for w in word.subtree))
# It'd probably be better for `word.subtree` to return a `Span` object instead
# of a generator over the tokens. If you want the `Span` you can get it via the
# `.right_edge` and `.left_edge` properties. The `Span` object is nice because
# you can easily get a vector, merge it, etc.
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.text, '|', subtree_span.root.text)
print(subtree_span.similarity(doc))
print(subtree_span.similarity(subtree_span.root))
# You might also want to select a head, and then select a start and end position by
# walking along its children. You could then take the `.left_edge` and `.right_edge`
# of those tokens, and use it to calculate a span.

View File

@ -1,59 +0,0 @@
import plac
from spacy.en import English
from spacy.parts_of_speech import NOUN
from spacy.parts_of_speech import ADP as PREP
def _span_to_tuple(span):
start = span[0].idx
end = span[-1].idx + len(span[-1])
tag = span.root.tag_
text = span.text
label = span.label_
return (start, end, tag, text, label)
def merge_spans(spans, doc):
# This is a bit awkward atm. What we're doing here is merging the entities,
# so that each only takes up a single token. But an entity is a Span, and
# each Span is a view into the doc. When we merge a span, we invalidate
# the other spans. This will get fixed --- but for now the solution
# is to gather the information first, before merging.
tuples = [_span_to_tuple(span) for span in spans]
for span_tuple in tuples:
doc.merge(*span_tuple)
def extract_currency_relations(doc):
merge_spans(doc.ents, doc)
merge_spans(doc.noun_chunks, doc)
relations = []
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
if money.dep_ in ('attr', 'dobj'):
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
if subject:
subject = subject[0]
relations.append((subject, money))
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
relations.append((money.head.head, money))
return relations
def main():
nlp = English()
texts = [
u'Net income was $9.4 million compared to the prior year of $2.7 million.',
u'Revenue exceeded twelve billion dollars, with a loss of $1b.',
]
for text in texts:
doc = nlp(text)
relations = extract_currency_relations(doc)
for r1, r2 in relations:
print(r1.text, r2.ent_type_, r2.text)
if __name__ == '__main__':
plac.call(main)

View File

@ -0,0 +1,62 @@
#!/usr/bin/env python
# coding: utf8
"""
A simple example of extracting relations between phrases and entities using
spaCy's named entity recognizer and the dependency parse. Here, we extract
money and currency values (entities labelled as MONEY) and then check the
dependency tree to find the noun phrase they are referring to for example:
$9.4 million --> Net income.
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
TEXTS = [
'Net income was $9.4 million compared to the prior year of $2.7 million.',
'Revenue exceeded twelve billion dollars, with a loss of $1b.',
]
@plac.annotations(
model=("Model to load (needs parser and NER)", "positional", None, str))
def main(model='en_core_web_sm'):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
print("Processing %d texts" % len(TEXTS))
for text in TEXTS:
doc = nlp(text)
relations = extract_currency_relations(doc)
for r1, r2 in relations:
print('{:<10}\t{}\t{}'.format(r1.text, r2.ent_type_, r2.text))
def extract_currency_relations(doc):
# merge entities and noun chunks into one token
for span in [*list(doc.ents), *list(doc.noun_chunks)]:
span.merge()
relations = []
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
if money.dep_ in ('attr', 'dobj'):
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
if subject:
subject = subject[0]
relations.append((subject, money))
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
relations.append((money.head.head, money))
return relations
if __name__ == '__main__':
plac.call(main)
# Expected output:
# Net income MONEY $9.4 million
# the prior year MONEY $2.7 million
# Revenue MONEY twelve billion dollars
# a loss MONEY 1b

View File

@ -0,0 +1,65 @@
#!/usr/bin/env python
# coding: utf8
"""
This example shows how to navigate the parse tree including subtrees attached
to a word.
Based on issue #252:
"In the documents and tutorials the main thing I haven't found is
examples on how to break sentences down into small sub thoughts/chunks. The
noun_chunks is handy, but having examples on using the token.head to find small
(near-complete) sentence chunks would be neat. Lets take the example sentence:
"displaCy uses CSS and JavaScript to show you how computers understand language"
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
[displaCy] uses CSS and Javascript [to + show]
show you how computers understand [language]
I'm assuming that we can use the token.head to build these groups."
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
@plac.annotations(
model=("Model to load", "positional", None, str))
def main(model='en_core_web_sm'):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
doc = nlp("displaCy uses CSS and JavaScript to show you how computers "
"understand language")
# The easiest way is to find the head of the subtree you want, and then use
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
# is the one that does what you're asking for most directly:
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
print(''.join(w.text_with_ws for w in word.subtree))
# It'd probably be better for `word.subtree` to return a `Span` object
# instead of a generator over the tokens. If you want the `Span` you can
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
# object is nice because you can easily get a vector, merge it, etc.
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.text, '|', subtree_span.root.text)
# You might also want to select a head, and then select a start and end
# position by walking along its children. You could then take the
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
# a span.
if __name__ == '__main__':
plac.call(main)
# Expected output:
# to show you how computers understand language
# how computers understand language
# to show you how computers understand language | show
# how computers understand language | understand

View File

@ -0,0 +1,104 @@
"""Match a large set of multi-word expressions in O(1) time.
The idea is to associate each word in the vocabulary with a tag, noting whether
they begin, end, or are inside at least one pattern. An additional tag is used
for single-word patterns. Complete patterns are also stored in a hash set.
When we process a document, we look up the words in the vocabulary, to
associate the words with the tags. We then search for tag-sequences that
correspond to valid candidates. Finally, we look up the candidates in the hash
set.
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary
Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with
the I tag, and Obama and Clinton with the L tag.
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second
candidate is in the phrase dictionary, so only one is returned as a match.
The algorithm is O(n) at run-time for document of length n because we're only
ever matching over the tag patterns. So no matter how many phrases we're
looking for, our pattern set stays very small (exact size depends on the
maximum length we're looking for, as the query language currently has no
quantifiers).
The example expects a .bz2 file from the Reddit corpus, and a patterns file,
formatted in jsonl as a sequence of entries like this:
{"text":"Anchorage"}
{"text":"Angola"}
{"text":"Ann Arbor"}
{"text":"Annapolis"}
{"text":"Appalachia"}
{"text":"Argentina"}
"""
from __future__ import print_function, unicode_literals, division
from bz2 import BZ2File
import time
import plac
import ujson
from spacy.matcher import PhraseMatcher
import spacy
@plac.annotations(
patterns_loc=("Path to gazetteer", "positional", None, str),
text_loc=("Path to Reddit corpus file", "positional", None, str),
n=("Number of texts to read", "option", "n", int),
lang=("Language class to initialise", "option", "l", str))
def main(patterns_loc, text_loc, n=10000, lang='en'):
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0
t1 = time.time()
for ent_id, text in get_matches(nlp.tokenizer, phrases,
read_text(text_loc, n=n)):
count += 1
t2 = time.time()
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
def read_gazetteer(tokenizer, loc, n=-1):
for i, line in enumerate(open(loc)):
data = ujson.loads(line.strip())
phrase = tokenizer(data['text'])
for w in phrase:
_ = tokenizer.vocab[w.text]
if len(phrase) >= 2:
yield phrase
def read_text(bz2_loc, n=10000):
with BZ2File(bz2_loc) as file_:
for i, line in enumerate(file_):
data = ujson.loads(line)
yield data['body']
if i >= n:
break
def get_matches(tokenizer, phrases, texts, max_length=6):
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
matcher.add('Phrase', None, *phrases)
for text in texts:
doc = tokenizer(text)
for w in doc:
_ = doc.vocab[w.text]
matches = matcher(doc)
for ent_id, start, end in matches:
yield (ent_id, doc[start:end].text)
if __name__ == '__main__':
if False:
import cProfile
import pstats
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()
else:
plac.call(main)

View File

@ -1,5 +0,0 @@
An example of inventory counting using SpaCy.io NLP library. Meant to show how to instantiate Spacy's English class, and allow reusability by reloading the main module.
In the future, a better implementation of this library would be to apply machine learning to each query and learn what to classify as the quantitative statement (55 kgs OF), vs the actual item of count (how likely is a preposition object to be the item of count if x,y,z qualifications appear in the statement).

View File

@ -1,35 +0,0 @@
class Inventory:
"""
Inventory class - a struct{} like feature to house inventory counts
across modules.
"""
originalQuery = None
item = ""
unit = ""
amount = ""
def __init__(self, statement):
"""
Constructor - only takes in the original query/statement
:return: new Inventory object
"""
self.originalQuery = statement
pass
def __str__(self):
return str(self.amount) + ' ' + str(self.unit) + ' ' + str(self.item)
def printInfo(self):
print '-------------Inventory Count------------'
print "Original Query: " + str(self.originalQuery)
print 'Amount: ' + str(self.amount)
print 'Unit: ' + str(self.unit)
print 'Item: ' + str(self.item)
print '----------------------------------------'
def isValid(self):
if not self.item or not self.unit or not self.amount or not self.originalQuery:
return False
else:
return True

View File

@ -1,92 +0,0 @@
from inventory import Inventory
def runTest(nlp):
testset = []
testset += [nlp(u'6 lobster cakes')]
testset += [nlp(u'6 avacados')]
testset += [nlp(u'fifty five carrots')]
testset += [nlp(u'i have 55 carrots')]
testset += [nlp(u'i got me some 9 cabbages')]
testset += [nlp(u'i got 65 kgs of carrots')]
result = []
for doc in testset:
c = decodeInventoryEntry_level1(doc)
if not c.isValid():
c = decodeInventoryEntry_level2(doc)
result.append(c)
for i in result:
i.printInfo()
def decodeInventoryEntry_level1(document):
"""
Decodes a basic entry such as: '6 lobster cake' or '6' cakes
@param document : NLP Doc object
:return: Status if decoded correctly (true, false), and Inventory object
"""
count = Inventory(str(document))
for token in document:
if token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
item = str(token)
for child in token.children:
if child.dep_ == u'compound' or child.dep_ == u'ad':
item = str(child) + str(item)
elif child.dep_ == u'nummod':
count.amount = str(child).strip()
for numerical_child in child.children:
# this isn't arithmetic rather than treating it such as a string
count.amount = str(numerical_child) + str(count.amount).strip()
else:
print "WARNING: unknown child: " + str(child) + ':'+str(child.dep_)
count.item = item
count.unit = item
return count
def decodeInventoryEntry_level2(document):
"""
Entry level 2, a more complicated parsing scheme that covers examples such as
'i have 80 boxes of freshly baked pies'
@document @param document : NLP Doc object
:return: Status if decoded correctly (true, false), and Inventory object-
"""
count = Inventory(str(document))
for token in document:
# Look for a preposition object that is a noun (this is the item we are counting).
# If found, look at its' dependency (if a preposition that is not indicative of
# inventory location, the dependency of the preposition must be a noun
if token.dep_ == (u'pobj' or u'meta') and token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
item = ''
# Go through all the token's children, these are possible adjectives and other add-ons
# this deals with cases such as 'hollow rounded waffle pancakes"
for i in token.children:
item += ' ' + str(i)
item += ' ' + str(token)
count.item = item
# Get the head of the item:
if token.head.dep_ != u'prep':
# Break out of the loop, this is a confusing entry
break
else:
amountUnit = token.head.head
count.unit = str(amountUnit)
for inner in amountUnit.children:
if inner.pos_ == u'NUM':
count.amount += str(inner)
return count

View File

@ -1,30 +0,0 @@
import inventoryCount as mainModule
import os
from spacy.en import English
if __name__ == '__main__':
"""
Main module for this example - loads the English main NLP class,
and keeps it in RAM while waiting for the user to re-run it. Allows the
developer to re-edit their module under testing without having
to wait as long to load the English class
"""
# Set the NLP object here for the parameters you want to see,
# or just leave it blank and get all the opts
print "Loading English module... this will take a while."
nlp = English()
print "Done loading English module."
while True:
try:
reload(mainModule)
mainModule.runTest(nlp)
raw_input('================ To reload main module, press Enter ================')
except Exception, e:
print "Unexpected error: " + str(e)
continue

View File

@ -1,161 +0,0 @@
from __future__ import unicode_literals, print_function
import spacy.en
import spacy.matcher
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import plac
def main():
nlp = spacy.en.English()
example = u"I prefer Siri to Google Now. I'll google now to find out how the google now service works."
before = nlp(example)
print("Before")
for ent in before.ents:
print(ent.text, ent.label_, [w.tag_ for w in ent])
# Output:
# Google ORG [u'NNP']
# google ORG [u'VB']
# google ORG [u'NNP']
nlp.matcher.add(
"GoogleNow", # Entity ID: Not really used at the moment.
"PRODUCT", # Entity type: should be one of the types in the NER data
{"wiki_en": "Google_Now"}, # Arbitrary attributes. Currently unused.
[ # List of patterns that can be Surface Forms of the entity
# This Surface Form matches "Google Now", verbatim
[ # Each Surface Form is a list of Token Specifiers.
{ # This Token Specifier matches tokens whose orth field is "Google"
ORTH: "Google"
},
{ # This Token Specifier matches tokens whose orth field is "Now"
ORTH: "Now"
}
],
[ # This Surface Form matches "google now", verbatim, and requires
# "google" to have the NNP tag. This helps prevent the pattern from
# matching cases like "I will google now to look up the time"
{
ORTH: "google",
TAG: "NNP"
},
{
ORTH: "now"
}
]
]
)
after = nlp(example)
print("After")
for ent in after.ents:
print(ent.text, ent.label_, [w.tag_ for w in ent])
# Output
# Google Now PRODUCT [u'NNP', u'RB']
# google ORG [u'VB']
# google now PRODUCT [u'NNP', u'RB']
#
# You can customize attribute values in the lexicon, and then refer to the
# new attributes in your Token Specifiers.
# This is particularly good for word-set membership.
#
australian_capitals = ['Brisbane', 'Sydney', 'Canberra', 'Melbourne', 'Hobart',
'Darwin', 'Adelaide', 'Perth']
# Internally, the tokenizer immediately maps each token to a pointer to a
# LexemeC struct. These structs hold various features, e.g. the integer IDs
# of the normalized string forms.
# For our purposes, the key attribute is a 64-bit integer, used as a bit field.
# spaCy currently only uses 12 of the bits for its built-in features, so
# the others are available for use. It's best to use the higher bits, as
# future versions of spaCy may add more flags. For instance, we might add
# a built-in IS_MONTH flag, taking up FLAG13. So, we bind our user-field to
# FLAG63 here.
is_australian_capital = FLAG63
# Now we need to set the flag value. It's False on all tokens by default,
# so we just need to set it to True for the tokens we want.
# Here we iterate over the strings, and set it on only the literal matches.
for string in australian_capitals:
lexeme = nlp.vocab[string]
lexeme.set_flag(is_australian_capital, True)
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
# If we want case-insensitive matching, we have to be a little bit more
# round-about, as there's no case-insensitive index to the vocabulary. So
# we have to iterate over the vocabulary.
# We'll be looking up attribute IDs in this set a lot, so it's good to pre-build it
target_ids = {nlp.vocab.strings[s.lower()] for s in australian_capitals}
for lexeme in nlp.vocab:
if lexeme.lower in target_ids:
lexeme.set_flag(is_australian_capital, True)
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
print('SYDNEY', nlp.vocab[u'SYDNEY'].check_flag(is_australian_capital))
# Output
# Sydney True
# sydney False
# Sydney True
# sydney True
# SYDNEY True
#
# The key thing to note here is that we're setting these attributes once,
# over the vocabulary --- and then reusing them at run-time. This means the
# amortized complexity of anything we do this way is going to be O(1). You
# can match over expressions that need to have sets with tens of thousands
# of values, e.g. "all the street names in Germany", and you'll still have
# O(1) complexity. Most regular expression algorithms don't scale well to
# this sort of problem.
#
# Now, let's use this in a pattern
nlp.matcher.add("AuCitySportsTeam", "ORG", {},
[
[
{LOWER: "the"},
{is_australian_capital: True},
{TAG: "NNS"}
],
[
{LOWER: "the"},
{is_australian_capital: True},
{TAG: "NNPS"}
],
[
{LOWER: "the"},
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
{is_australian_capital: True},
{TAG: "NNS"}
],
[
{LOWER: "the"},
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
{is_australian_capital: True},
{TAG: "NNPS"}
]
])
doc = nlp(u'The pattern should match the Brisbane Broncos and the South Darwin Spiders, but not the Colorado Boulders')
for ent in doc.ents:
print(ent.text, ent.label_)
# Output
# the Brisbane Broncos ORG
# the South Darwin Spiders ORG
# Output
# Before
# Google ORG [u'NNP']
# google ORG [u'VB']
# google ORG [u'NNP']
# After
# Google Now PRODUCT [u'NNP', u'RB']
# google ORG [u'VB']
# google now PRODUCT [u'NNP', u'RB']
# Sydney True
# sydney False
# Sydney True
# sydney True
# SYDNEY True
# the Brisbane Broncos ORG
# the South Darwin Spiders ORG
if __name__ == '__main__':
main()

View File

@ -1,98 +0,0 @@
"""Match a large set of multi-word expressions in O(1) time.
The idea is to associate each word in the vocabulary with a tag, noting whether
they begin, end, or are inside at least one pattern. An additional tag is used
for single-word patterns. Complete patterns are also stored in a hash set.
When we process a document, we look up the words in the vocabulary, to associate
the words with the tags. We then search for tag-sequences that correspond to
valid candidates. Finally, we look up the candidates in the hash set.
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary Clinton", we
would associate "Barack" and "Hilary" with the B tag, Hussein with the I tag,
and Obama and Clinton with the L tag.
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second candidate
is in the phrase dictionary, so only one is returned as a match.
The algorithm is O(n) at run-time for document of length n because we're only ever
matching over the tag patterns. So no matter how many phrases we're looking for,
our pattern set stays very small (exact size depends on the maximum length we're
looking for, as the query language currently has no quantifiers)
"""
from __future__ import print_function, unicode_literals, division
from ast import literal_eval
from bz2 import BZ2File
import time
import math
import codecs
import plac
from preshed.maps import PreshMap
from preshed.counter import PreshCounter
from spacy.strings import hash_string
from spacy.en import English
from spacy.matcher import PhraseMatcher
def read_gazetteer(tokenizer, loc, n=-1):
for i, line in enumerate(open(loc)):
phrase = literal_eval('u' + line.strip())
if ' (' in phrase and phrase.endswith(')'):
phrase = phrase.split(' (', 1)[0]
if i >= n:
break
phrase = tokenizer(phrase)
if all((t.is_lower and t.prob >= -10) for t in phrase):
continue
if len(phrase) >= 2:
yield phrase
def read_text(bz2_loc):
with BZ2File(bz2_loc) as file_:
for line in file_:
yield line.decode('utf8')
def get_matches(tokenizer, phrases, texts, max_length=6):
matcher = PhraseMatcher(tokenizer.vocab, phrases, max_length=max_length)
print("Match")
for text in texts:
doc = tokenizer(text)
matches = matcher(doc)
for mwe in doc.ents:
yield mwe
def main(patterns_loc, text_loc, counts_loc, n=10000000):
nlp = English(parser=False, tagger=False, entity=False)
print("Make matcher")
phrases = read_gazetteer(nlp.tokenizer, patterns_loc, n=n)
counts = PreshCounter()
t1 = time.time()
for mwe in get_matches(nlp.tokenizer, phrases, read_text(text_loc)):
counts.inc(hash_string(mwe.text), 1)
t2 = time.time()
print("10m tokens in %d s" % (t2 - t1))
with codecs.open(counts_loc, 'w', 'utf8') as file_:
for phrase in read_gazetteer(nlp.tokenizer, patterns_loc, n=n):
text = phrase.string
key = hash_string(text)
count = counts[key]
if count != 0:
file_.write('%d\t%s\n' % (count, text))
if __name__ == '__main__':
if False:
import cProfile
import pstats
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()
else:
plac.call(main)

View File

@ -1,281 +0,0 @@
"""This script expects something like a binary sentiment data set, such as
that available here: `http://www.cs.cornell.edu/people/pabo/movie-review-data/`
It expects a directory structure like: `data_dir/train/{pos|neg}`
and `data_dir/test/{pos|neg}`. Put (say) 90% of the files in the former
and the remainder in the latter.
"""
from __future__ import unicode_literals
from __future__ import print_function
from __future__ import division
from collections import defaultdict
from pathlib import Path
import numpy
import plac
import spacy.en
def read_data(nlp, data_dir):
for subdir, label in (('pos', 1), ('neg', 0)):
for filename in (data_dir / subdir).iterdir():
text = filename.open().read()
doc = nlp(text)
if len(doc) >= 1:
yield doc, label
def partition(examples, split_size):
examples = list(examples)
numpy.random.shuffle(examples)
n_docs = len(examples)
split = int(n_docs * split_size)
return examples[:split], examples[split:]
def minibatch(data, bs=24):
for i in range(0, len(data), bs):
yield data[i:i+bs]
class Extractor(object):
def __init__(self, nlp, vector_length, dropout=0.3):
self.nlp = nlp
self.dropout = dropout
self.vector = numpy.zeros((vector_length, ))
def doc2bow(self, doc, dropout=None):
if dropout is None:
dropout = self.dropout
bow = defaultdict(int)
all_words = defaultdict(int)
for word in doc:
if numpy.random.random() >= dropout and not word.is_punct:
bow[word.lower] += 1
all_words[word.lower] += 1
if sum(bow.values()) >= 1:
return bow
else:
return all_words
def bow2vec(self, bow, E):
self.vector.fill(0)
n = 0
for orth_id, freq in bow.items():
self.vector += self.nlp.vocab[self.nlp.vocab.strings[orth_id]].vector * freq
# Apply the fine-tuning we've learned
if orth_id < E.shape[0]:
self.vector += E[orth_id] * freq
n += freq
return self.vector / n
class NeuralNetwork(object):
def __init__(self, depth, width, n_classes, n_vocab, extracter, optimizer):
self.depth = depth
self.width = width
self.n_classes = n_classes
self.weights = Params.random(depth, width, width, n_classes, n_vocab)
self.doc2bow = extracter.doc2bow
self.bow2vec = extracter.bow2vec
self.optimizer = optimizer
self._gradient = Params.zero(depth, width, width, n_classes, n_vocab)
self._activity = numpy.zeros((depth, width))
def train(self, batch):
activity = self._activity
gradient = self._gradient
activity.fill(0)
gradient.data.fill(0)
loss = 0
word_freqs = defaultdict(int)
for doc, label in batch:
word_ids = self.doc2bow(doc)
vector = self.bow2vec(word_ids, self.weights.E)
self.forward(activity, vector)
loss += self.backprop(vector, gradient, activity, word_ids, label)
for w, freq in word_ids.items():
word_freqs[w] += freq
self.optimizer(self.weights, gradient, len(batch), word_freqs)
return loss
def predict(self, doc):
actv = self._activity
actv.fill(0)
W = self.weights.W
b = self.weights.b
E = self.weights.E
vector = self.bow2vec(self.doc2bow(doc, dropout=0.0), E)
self.forward(actv, vector)
return numpy.argmax(softmax(actv[-1], W[-1], b[-1]))
def forward(self, actv, in_):
actv.fill(0)
W = self.weights.W; b = self.weights.b
actv[0] = relu(in_, W[0], b[0])
for i in range(1, self.depth):
actv[i] = relu(actv[i-1], W[i], b[i])
def backprop(self, input_vector, gradient, activity, ids, label):
W = self.weights.W
b = self.weights.b
target = numpy.zeros(self.n_classes)
target[label] = 1.0
pred = softmax(activity[-1], W[-1], b[-1])
delta = pred - target
for i in range(self.depth, 0, -1):
gradient.b[i] += delta
gradient.W[i] += numpy.outer(delta, activity[i-1])
delta = d_relu(activity[i-1]) * W[i].T.dot(delta)
gradient.b[0] += delta
gradient.W[0] += numpy.outer(delta, input_vector)
tuning = W[0].T.dot(delta).reshape((self.width,)) / len(ids)
for w, freq in ids.items():
if w < gradient.E.shape[0]:
gradient.E[w] += tuning * freq
return -sum(target * numpy.log(pred))
def softmax(actvn, W, b):
w = W.dot(actvn) + b
ew = numpy.exp(w - max(w))
return (ew / sum(ew)).ravel()
def relu(actvn, W, b):
x = W.dot(actvn) + b
return x * (x > 0)
def d_relu(x):
return x > 0
class Adagrad(object):
def __init__(self, lr, rho):
self.eps = 1e-3
# initial learning rate
self.learning_rate = lr
self.rho = rho
# stores sum of squared gradients
#self.h = numpy.zeros(self.dim)
#self._curr_rate = numpy.zeros(self.h.shape)
self.h = None
self._curr_rate = None
def __call__(self, weights, gradient, batch_size, word_freqs):
if self.h is None:
self.h = numpy.zeros(gradient.data.shape)
self._curr_rate = numpy.zeros(gradient.data.shape)
self.L2_penalty(gradient, weights, word_freqs)
update = self.rescale(gradient.data / batch_size)
weights.data -= update
def rescale(self, gradient):
if self.h is None:
self.h = numpy.zeros(gradient.data.shape)
self._curr_rate = numpy.zeros(gradient.data.shape)
self._curr_rate.fill(0)
self.h += gradient ** 2
self._curr_rate = self.learning_rate / (numpy.sqrt(self.h) + self.eps)
return self._curr_rate * gradient
def L2_penalty(self, gradient, weights, word_freqs):
# L2 Regularization
for i in range(len(weights.W)):
gradient.W[i] += weights.W[i] * self.rho
gradient.b[i] += weights.b[i] * self.rho
for w, freq in word_freqs.items():
if w < gradient.E.shape[0]:
gradient.E[w] += weights.E[w] * self.rho
class Params(object):
@classmethod
def zero(cls, depth, n_embed, n_hidden, n_labels, n_vocab):
return cls(depth, n_embed, n_hidden, n_labels, n_vocab, lambda x: numpy.zeros((x,)))
@classmethod
def random(cls, depth, nE, nH, nL, nV):
return cls(depth, nE, nH, nL, nV, lambda x: (numpy.random.rand(x) * 2 - 1) * 0.08)
def __init__(self, depth, n_embed, n_hidden, n_labels, n_vocab, initializer):
nE = n_embed; nH = n_hidden; nL = n_labels; nV = n_vocab
n_weights = sum([
(nE * nH) + nH,
(nH * nH + nH) * depth,
(nH * nL) + nL,
(nV * nE)
])
self.data = initializer(n_weights)
self.W = []
self.b = []
i = self._add_layer(0, nE, nH)
for _ in range(1, depth):
i = self._add_layer(i, nH, nH)
i = self._add_layer(i, nL, nH)
self.E = self.data[i : i + (nV * nE)].reshape((nV, nE))
self.E.fill(0)
def _add_layer(self, start, x, y):
end = start + (x * y)
self.W.append(self.data[start : end].reshape((x, y)))
self.b.append(self.data[end : end + x].reshape((x, )))
return end + x
@plac.annotations(
data_dir=("Data directory", "positional", None, Path),
n_iter=("Number of iterations (epochs)", "option", "i", int),
width=("Size of hidden layers", "option", "H", int),
depth=("Depth", "option", "d", int),
dropout=("Drop-out rate", "option", "r", float),
rho=("Regularization penalty", "option", "p", float),
eta=("Learning rate", "option", "e", float),
batch_size=("Batch size", "option", "b", int),
vocab_size=("Number of words to fine-tune", "option", "w", int),
)
def main(data_dir, depth=3, width=300, n_iter=5, vocab_size=40000,
batch_size=24, dropout=0.3, rho=1e-5, eta=0.005):
n_classes = 2
print("Loading")
nlp = spacy.en.English(parser=False)
train_data, dev_data = partition(read_data(nlp, data_dir / 'train'), 0.8)
print("Begin training")
extracter = Extractor(nlp, width, dropout=0.3)
optimizer = Adagrad(eta, rho)
model = NeuralNetwork(depth, width, n_classes, vocab_size, extracter, optimizer)
prev_best = 0
best_weights = None
for epoch in range(n_iter):
numpy.random.shuffle(train_data)
train_loss = 0.0
for batch in minibatch(train_data, bs=batch_size):
train_loss += model.train(batch)
n_correct = sum(model.predict(x) == y for x, y in dev_data)
print(epoch, train_loss, n_correct / len(dev_data))
if n_correct >= prev_best:
best_weights = model.weights.data.copy()
prev_best = n_correct
model.weights.data = best_weights
print("Evaluating")
eval_data = list(read_data(nlp, data_dir / 'test'))
n_correct = sum(model.predict(x) == y for x, y in eval_data)
print(n_correct / len(eval_data))
if __name__ == '__main__':
#import cProfile
#import pstats
#cProfile.runctx("main(Path('data/aclImdb'))", globals(), locals(), "Profile.prof")
#s = pstats.Stats("Profile.prof")
#s.strip_dirs().sort_stats("time").print_stats(100)
plac.call(main)

View File

@ -1,74 +0,0 @@
from __future__ import print_function, unicode_literals, division
import io
import bz2
import logging
from toolz import partition
from os import path
import re
import spacy.en
from spacy.tokens import Doc
from joblib import Parallel, delayed
import plac
import ujson
def parallelize(func, iterator, n_jobs, extra, backend='multiprocessing'):
extra = tuple(extra)
return Parallel(n_jobs=n_jobs, backend=backend)(delayed(func)(*(item + extra))
for item in iterator)
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield ujson.loads(line)['body']
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
def strip_meta(text):
text = link_re.sub(r'\1', text)
text = text.replace('&gt;', '>').replace('&lt;', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
return text.strip()
def save_parses(batch_id, input_, out_dir, n_threads, batch_size):
out_loc = path.join(out_dir, '%d.bin' % batch_id)
if path.exists(out_loc):
return None
print('Batch', batch_id)
nlp = spacy.en.English()
nlp.matcher = None
with open(out_loc, 'wb') as file_:
texts = (strip_meta(text) for text in input_)
texts = (text for text in texts if text.strip())
for doc in nlp.pipe(texts, batch_size=batch_size, n_threads=n_threads):
file_.write(doc.to_bytes())
@plac.annotations(
in_loc=("Location of input file"),
out_dir=("Location of input file"),
n_process=("Number of processes", "option", "p", int),
n_thread=("Number of threads per process", "option", "t", int),
batch_size=("Number of texts to accumulate in a buffer", "option", "b", int)
)
def main(in_loc, out_dir, n_process=1, n_thread=4, batch_size=100):
if not path.exists(out_dir):
path.join(out_dir)
if n_process >= 2:
texts = partition(200000, iter_comments(in_loc))
parallelize(save_parses, enumerate(texts), n_process, [out_dir, n_thread, batch_size],
backend='multiprocessing')
else:
save_parses(0, iter_comments(in_loc), out_dir, n_thread, batch_size)
if __name__ == '__main__':
plac.call(main)

View File

@ -0,0 +1,77 @@
#!/usr/bin/env python
# coding: utf-8
"""This example contains several snippets of methods that can be set via custom
Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like
they're "bound" to the object and are partially applied i.e. the object
they're called on is passed in as the first argument.
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.tokens import Doc, Span
from spacy import displacy
from pathlib import Path
@plac.annotations(
output_dir=("Output directory for saved HTML", "positional", None, Path))
def main(output_dir=None):
nlp = English() # start off with blank English class
Doc.set_extension('overlap', method=overlap_tokens)
doc1 = nlp(u"Peach emoji is where it has always been.")
doc2 = nlp(u"Peach is the superior emoji.")
print("Text 1:", doc1.text)
print("Text 2:", doc2.text)
print("Overlapping tokens:", doc1._.overlap(doc2))
Doc.set_extension('to_html', method=to_html)
doc = nlp(u"This is a sentence about Apple.")
# add entity manually for demo purposes, to make it work without a model
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings['ORG'])]
print("Text:", doc.text)
doc._.to_html(output=output_dir, style='ent')
def to_html(doc, output='/tmp', style='dep'):
"""Doc method extension for saving the current state as a displaCy
visualization.
"""
# generate filename from first six non-punct tokens
file_name = '-'.join([w.text for w in doc[:6] if not w.is_punct]) + '.html'
html = displacy.render(doc, style=style, page=True) # render markup
if output is not None:
output_path = Path(output)
if not output_path.exists():
output_path.mkdir()
output_file = Path(output) / file_name
output_file.open('w', encoding='utf-8').write(html) # save to file
print('Saved HTML to {}'.format(output_file))
else:
print(html)
def overlap_tokens(doc, other_doc):
"""Get the tokens from the original Doc that are also in the comparison Doc.
"""
overlap = []
other_tokens = [token.text for token in other_doc]
for token in doc:
if token.text in other_tokens:
overlap.append(token)
return overlap
if __name__ == '__main__':
plac.call(main)
# Expected output:
# Text 1: Peach emoji is where it has always been.
# Text 2: Peach is the superior emoji.
# Overlapping tokens: [Peach, emoji, is, .]

View File

@ -0,0 +1,125 @@
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens, e.g. the capital and lat/lng
coordinates. Can be extended with more details from the API.
* REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0)
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import requests
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
def main():
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
rest_countries = RESTCountriesComponent(nlp) # initialise component
nlp.add_pipe(rest_countries) # add it to the pipeline
doc = nlp(u"Some text about Colombia and the Czech Republic")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Doc has countries', doc._.has_country) # Doc contains countries
for token in doc:
if token._.is_country:
print(token.text, token._.country_capital, token._.country_latlng,
token._.country_flag) # country data
print('Entities', [(e.text, e.label_) for e in doc.ents]) # entities
class RESTCountriesComponent(object):
"""spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens.
"""
name = 'rest_countries' # component name, will show up in the pipeline
def __init__(self, nlp, label='GPE'):
"""Initialise the pipeline component. The shared nlp instance is used
to initialise the matcher with the shared vocab, get the label ID and
generate Doc objects as phrase match patterns.
"""
# Make request once on initialisation and store the data
r = requests.get('https://restcountries.eu/rest/v2/all')
r.raise_for_status() # make sure requests raises an error if it fails
countries = r.json()
# Convert API response to dict keyed by country name for easy lookup
# This could also be extended using the alternative and foreign language
# names provided by the API
self.countries = {c['name']: c for c in countries}
self.label = nlp.vocab.strings[label] # get entity label ID
# Set up the PhraseMatcher with Doc patterns for each country name
patterns = [nlp(c) for c in self.countries.keys()]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add('COUNTRIES', None, *patterns)
# Register attribute on the Token. We'll be overwriting this based on
# the matches, so we're only setting a default value, not a getter.
# If no default value is set, it defaults to None.
Token.set_extension('is_country', default=False)
Token.set_extension('country_capital')
Token.set_extension('country_latlng')
Token.set_extension('country_flag')
# Register attributes on Doc and Span via a getter that checks if one of
# the contained tokens is set to is_country == True.
Doc.set_extension('has_country', getter=self.has_country)
Span.set_extension('has_country', getter=self.has_country)
def __call__(self, doc):
"""Apply the pipeline component on a Doc object and modify it if matches
are found. Return the Doc, so it can be processed by the next component
in the pipeline, if available.
"""
matches = self.matcher(doc)
spans = [] # keep the spans for later so we can merge them afterwards
for _, start, end in matches:
# Generate Span representing the entity & set label
entity = Span(doc, start, end, label=self.label)
spans.append(entity)
# Set custom attribute on each token of the entity
# Can be extended with other data returned by the API, like
# currencies, country code, flag, calling code etc.
for token in entity:
token._.set('is_country', True)
token._.set('country_capital', self.countries[entity.text]['capital'])
token._.set('country_latlng', self.countries[entity.text]['latlng'])
token._.set('country_flag', self.countries[entity.text]['flag'])
# Overwrite doc.ents and add entity be careful not to replace!
doc.ents = list(doc.ents) + [entity]
for span in spans:
# Iterate over all spans and merge them into one token. This is done
# after setting the entities otherwise, it would cause mismatched
# indices!
span.merge()
return doc # don't forget to return the Doc!
def has_country(self, tokens):
"""Getter for Doc and Span attributes. Returns True if one of the tokens
is a country. Since the getter is only called when we access the
attribute, we can refer to the Token's 'is_country' attribute here,
which is already set in the processing step."""
return any([t._.get('is_country') for t in tokens])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# Pipeline ['rest_countries']
# Doc has countries True
# Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg
# Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg
# Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]

View File

@ -0,0 +1,113 @@
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
labelled as ORG and their spans are merged into one token. Additionally,
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
respectively.
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
@plac.annotations(
text=("Text to process", "positional", None, str),
companies=("Names of technology companies", "positional", None, str))
def main(text="Alphabet Inc. is the company behind Google.", *companies):
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
if not companies: # set default companies if none are set via args
companies = ['Alphabet Inc.', 'Google', 'Netflix', 'Apple'] # etc.
component = TechCompanyRecognizer(nlp, companies) # initialise component
nlp.add_pipe(component, last=True) # add last to the pipeline
doc = nlp(text)
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Tokens', [t.text for t in doc]) # company names from the list are merged
print('Doc has_tech_org', doc._.has_tech_org) # Doc contains tech orgs
print('Token 0 is_tech_org', doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
print('Token 1 is_tech_org', doc[1]._.is_tech_org) # "is" is not
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
class TechCompanyRecognizer(object):
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
labelled as ORG and their spans are merged into one token. Additionally,
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
respectively."""
name = 'tech_companies' # component name, will show up in the pipeline
def __init__(self, nlp, companies=tuple(), label='ORG'):
"""Initialise the pipeline component. The shared nlp instance is used
to initialise the matcher with the shared vocab, get the label ID and
generate Doc objects as phrase match patterns.
"""
self.label = nlp.vocab.strings[label] # get entity label ID
# Set up the PhraseMatcher it can now take Doc objects as patterns,
# so even if the list of companies is long, it's very efficient
patterns = [nlp(org) for org in companies]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add('TECH_ORGS', None, *patterns)
# Register attribute on the Token. We'll be overwriting this based on
# the matches, so we're only setting a default value, not a getter.
Token.set_extension('is_tech_org', default=False)
# Register attributes on Doc and Span via a getter that checks if one of
# the contained tokens is set to is_tech_org == True.
Doc.set_extension('has_tech_org', getter=self.has_tech_org)
Span.set_extension('has_tech_org', getter=self.has_tech_org)
def __call__(self, doc):
"""Apply the pipeline component on a Doc object and modify it if matches
are found. Return the Doc, so it can be processed by the next component
in the pipeline, if available.
"""
matches = self.matcher(doc)
spans = [] # keep the spans for later so we can merge them afterwards
for _, start, end in matches:
# Generate Span representing the entity & set label
entity = Span(doc, start, end, label=self.label)
spans.append(entity)
# Set custom attribute on each token of the entity
for token in entity:
token._.set('is_tech_org', True)
# Overwrite doc.ents and add entity be careful not to replace!
doc.ents = list(doc.ents) + [entity]
for span in spans:
# Iterate over all spans and merge them into one token. This is done
# after setting the entities otherwise, it would cause mismatched
# indices!
span.merge()
return doc # don't forget to return the Doc!
def has_tech_org(self, tokens):
"""Getter for Doc and Span attributes. Returns True if one of the tokens
is a tech org. Since the getter is only called when we access the
attribute, we can refer to the Token's 'is_tech_org' attribute here,
which is already set in the processing step."""
return any([t._.get('is_tech_org') for t in tokens])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# Pipeline ['tech_companies']
# Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.']
# Doc has_tech_org True
# Token 0 is_tech_org True
# Token 1 is_tech_org False
# Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')]

View File

@ -0,0 +1,73 @@
"""
Example of multi-processing with Joblib. Here, we're exporting
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
each "sentence" on a newline, and spaces between tokens. Data is loaded from
the IMDB movie reviews dataset and will be loaded automatically via Thinc's
built-in dataset loader.
Last updated for: spaCy 2.0.0a18
"""
from __future__ import print_function, unicode_literals
from toolz import partition_all
from pathlib import Path
from joblib import Parallel, delayed
import thinc.extra.datasets
import plac
import spacy
@plac.annotations(
output_dir=("Output directory", "positional", None, Path),
model=("Model name (needs tagger)", "positional", None, str),
n_jobs=("Number of workers", "option", "n", int),
batch_size=("Batch-size for each process", "option", "b", int),
limit=("Limit of entries from the dataset", "option", "l", int))
def main(output_dir, model='en_core_web_sm', n_jobs=4, batch_size=1000,
limit=10000):
nlp = spacy.load(model) # load spaCy model
print("Loaded model '%s'" % model)
if not output_dir.exists():
output_dir.mkdir()
# load and pre-process the IMBD dataset
print("Loading IMDB data...")
data, _ = thinc.extra.datasets.imdb()
texts, _ = zip(*data[-limit:])
partitions = partition_all(batch_size, texts)
items = ((i, [nlp(text) for text in texts], output_dir) for i, texts
in enumerate(partitions))
Parallel(n_jobs=n_jobs)(delayed(transform_texts)(*item) for item in items)
def transform_texts(batch_id, docs, output_dir):
out_path = Path(output_dir) / ('%d.txt' % batch_id)
if out_path.exists(): # return None in case same batch is called again
return None
print('Processing batch', batch_id)
with out_path.open('w', encoding='utf8') as f:
for doc in docs:
f.write(' '.join(represent_word(w) for w in doc if not w.is_space))
f.write('\n')
print('Saved {} texts to {}.txt'.format(len(docs), batch_id))
def represent_word(word):
text = word.text
# True-case, i.e. try to normalize sentence-initial capitals.
# Only do this if the lower-cased form is more probable.
if text.istitle() and is_sent_begin(word) \
and word.prob < word.doc.vocab[text.lower()].prob:
text = text.lower()
return text + '|' + word.tag_
def is_sent_begin(word):
if word.i == 0:
return True
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
return True
else:
return False
if __name__ == '__main__':
plac.call(main)

View File

@ -1,90 +0,0 @@
"""
Print part-of-speech tagged, true-cased, (very roughly) sentence-separated
text, with each "sentence" on a newline, and spaces between tokens. Supports
multi-processing.
"""
from __future__ import print_function, unicode_literals, division
import io
import bz2
import logging
from toolz import partition
from os import path
import spacy.en
from joblib import Parallel, delayed
import plac
import ujson
def parallelize(func, iterator, n_jobs, extra):
extra = tuple(extra)
return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
def iter_texts_from_json_bz2(loc):
"""
Iterator of unicode strings, one per document (here, a comment).
Expects a a path to a BZ2 file, which should be new-line delimited JSON. The
document text should be in a string field titled 'body'.
This is the data format of the Reddit comments corpus.
"""
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield ujson.loads(line)['body']
def transform_texts(batch_id, input_, out_dir):
out_loc = path.join(out_dir, '%d.txt' % batch_id)
if path.exists(out_loc):
return None
print('Batch', batch_id)
nlp = spacy.en.English(parser=False, entity=False)
with io.open(out_loc, 'w', encoding='utf8') as file_:
for text in input_:
doc = nlp(text)
file_.write(' '.join(represent_word(w) for w in doc if not w.is_space))
file_.write('\n')
def represent_word(word):
text = word.text
# True-case, i.e. try to normalize sentence-initial capitals.
# Only do this if the lower-cased form is more probable.
if text.istitle() \
and is_sent_begin(word) \
and word.prob < word.doc.vocab[text.lower()].prob:
text = text.lower()
return text + '|' + word.tag_
def is_sent_begin(word):
# It'd be nice to have some heuristics like these in the library, for these
# times where we don't care so much about accuracy of SBD, and we don't want
# to parse
if word.i == 0:
return True
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
return True
else:
return False
@plac.annotations(
in_loc=("Location of input file"),
out_dir=("Location of input file"),
n_workers=("Number of workers", "option", "n", int),
batch_size=("Batch-size for each process", "option", "b", int)
)
def main(in_loc, out_dir, n_workers=4, batch_size=100000):
if not path.exists(out_dir):
path.join(out_dir)
texts = partition(batch_size, iter_texts_from_json_bz2(in_loc))
parallelize(transform_texts, enumerate(texts), n_workers, [out_dir])
if __name__ == '__main__':
plac.call(main)

View File

@ -1,22 +0,0 @@
# Load NER
from __future__ import unicode_literals
import spacy
import pathlib
from spacy.pipeline import EntityRecognizer
from spacy.vocab import Vocab
def load_model(model_dir):
model_dir = pathlib.Path(model_dir)
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
with (model_dir / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
nlp.vocab.strings.load(file_)
nlp.vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
ner = EntityRecognizer.load(model_dir, nlp.vocab, require=True)
return (nlp, ner)
(nlp, ner) = load_model('ner')
doc = nlp.make_doc('Who is Shaka Khan?')
nlp.tagger(doc)
ner(doc)
for word in doc:
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)

View File

@ -0,0 +1,157 @@
#!/usr/bin/env python
# coding: utf-8
"""Using the parser to recognise your own semantics
spaCy's parser component can be used to trained to predict any type of tree
structure over your input text. You can also predict trees over whole documents
or chat logs, with connections between the sentence-roots used to annotate
discourse structure. In this example, we'll build a message parser for a common
"chat intent": finding local businesses. Our message semantics will have the
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
"show me the best hotel in berlin"
('show', 'ROOT', 'show')
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
('hotel', 'PLACE', 'show') --> show PLACE hotel
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
"""
from __future__ import unicode_literals, print_function
import plac
import random
import spacy
from spacy.gold import GoldParse
from spacy.tokens import Doc
from pathlib import Path
# training data: words, head and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
TRAIN_DATA = [
(
['find', 'a', 'cafe', 'with', 'great', 'wifi'],
[0, 2, 0, 5, 5, 2], # index of token head
['ROOT', '-', 'PLACE', '-', 'QUALITY', 'ATTRIBUTE']
),
(
['find', 'a', 'hotel', 'near', 'the', 'beach'],
[0, 2, 0, 5, 5, 2],
['ROOT', '-', 'PLACE', 'QUALITY', '-', 'ATTRIBUTE']
),
(
['find', 'me', 'the', 'closest', 'gym', 'that', "'s", 'open', 'late'],
[0, 0, 4, 4, 0, 6, 4, 6, 6],
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'ATTRIBUTE', 'TIME']
),
(
['show', 'me', 'the', 'cheapest', 'store', 'that', 'sells', 'flowers'],
[0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'PRODUCT']
),
(
['find', 'a', 'nice', 'restaurant', 'in', 'london'],
[0, 3, 3, 0, 3, 3],
['ROOT', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
),
(
['show', 'me', 'the', 'coolest', 'hostel', 'in', 'berlin'],
[0, 0, 4, 4, 0, 4, 4],
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
),
(
['find', 'a', 'good', 'italian', 'restaurant', 'near', 'work'],
[0, 4, 4, 4, 0, 4, 5],
['ROOT', '-', 'QUALITY', 'ATTRIBUTE', 'PLACE', 'ATTRIBUTE', 'LOCATION']
)
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe('parser')
for _, _, deps in TRAIN_DATA:
for dep in deps:
parser.add_label(dep)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, heads, deps in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_model(nlp)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
test_model(nlp2)
def test_model(nlp):
texts = ["find a hotel with good wifi",
"find me the cheapest gym near work",
"show me the best hotel in berlin"]
docs = nlp.pipe(texts)
for doc in docs:
print(doc.text)
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# find a hotel with good wifi
# [
# ('find', 'ROOT', 'find'),
# ('hotel', 'PLACE', 'find'),
# ('good', 'QUALITY', 'wifi'),
# ('wifi', 'ATTRIBUTE', 'hotel')
# ]
# find me the cheapest gym near work
# [
# ('find', 'ROOT', 'find'),
# ('cheapest', 'QUALITY', 'gym'),
# ('gym', 'PLACE', 'find')
# ]
# show me the best hotel in berlin
# [
# ('show', 'ROOT', 'show'),
# ('best', 'QUALITY', 'hotel'),
# ('hotel', 'PLACE', 'show'),
# ('berlin', 'LOCATION', 'hotel')
# ]

View File

@ -1,13 +1,103 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy's named entity recognizer, starting off with an
existing model or a blank model.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* NER: https://alpha.spacy.io/usage/linguistic-features#named-entities
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
from spacy.lang.en import English
import spacy
from spacy.gold import GoldParse, biluo_tags_from_offsets
# training data
TRAIN_DATA = [
('Who is Shaka Khan?', [(7, 17, 'PERSON')]),
('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# function that allows begin_training to get the training data
get_data = lambda: reformat_train_data(nlp.tokenizer, TRAIN_DATA)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training(get_data)
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for raw_text, entity_offsets in TRAIN_DATA:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update(
[doc], # Batch of Doc objects
[gold], # Batch of GoldParse objects
drop=0.5, # Dropout -- make it harder to memorise data
sgd=optimizer, # Callable to update weights
losses=losses)
print(losses)
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
def reformat_train_data(tokenizer, examples):
"""Reformat data to match JSON format"""
"""Reformat data to match JSON format.
https://alpha.spacy.io/api/annotation#json-input
tokenizer (Tokenizer): Tokenizer to process the raw text.
examples (list): The trainig data.
RETURNS (list): The reformatted training data."""
output = []
for i, (text, entity_offsets) in enumerate(examples):
doc = tokenizer(text)
@ -21,49 +111,5 @@ def reformat_train_data(tokenizer, examples):
return output
def main(model_dir=None):
train_data = [
(
'Who is Shaka Khan?',
[(len('Who is '), len('Who is Shaka Khan'), 'PERSON')]
),
(
'I like London and Berlin.',
[(len('I like '), len('I like London'), 'LOC'),
(len('I like London and '), len('I like London and Berlin'), 'LOC')]
)
]
nlp = English(pipeline=['tensorizer', 'ner'])
get_data = lambda: reformat_train_data(nlp.tokenizer, train_data)
optimizer = nlp.begin_training(get_data)
for itn in range(100):
random.shuffle(train_data)
losses = {}
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update(
[doc], # Batch of Doc objects
[gold], # Batch of GoldParse objects
drop=0.5, # Dropout -- make it harder to memorise data
sgd=optimizer, # Callable to update weights
losses=losses)
print(losses)
print("Save to", model_dir)
nlp.to_disk(model_dir)
print("Load from", model_dir)
nlp = spacy.lang.en.English(pipeline=['tensorizer', 'ner'])
nlp.from_disk(model_dir)
for raw_text, _ in train_data:
doc = nlp(raw_text)
for word in doc:
print(word.text, word.ent_type_, word.ent_iob_)
if __name__ == '__main__':
import plac
plac.call(main)
# Who "" 2
# is "" 2
# Shaka "" PERSON 3
# Khan "" PERSON 1
# ? "" 2

View File

@ -1,244 +0,0 @@
#!/usr/bin/env python
'''Example of training a named entity recognition system from scratch using spaCy
This example is written to be self-contained and reasonably transparent.
To achieve that, it duplicates some of spaCy's internal functionality.
Specifically, in this example, we don't use spaCy's built-in Language class to
wire together the Vocab, Tokenizer and EntityRecognizer. Instead, we write
our own simle Pipeline class, so that it's easier to see how the pieces
interact.
Input data:
https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/data/GermEval2014_complete_data.zip
Developed for: spaCy 1.7.1
Last tested for: spaCy 1.7.1
'''
from __future__ import unicode_literals, print_function
import plac
from pathlib import Path
import random
import json
import spacy.orth as orth_funcs
from spacy.vocab import Vocab
from spacy.pipeline import BeamEntityRecognizer
from spacy.pipeline import EntityRecognizer
from spacy.tokenizer import Tokenizer
from spacy.tokens import Doc
from spacy.attrs import *
from spacy.gold import GoldParse
from spacy.gold import _iob_to_biluo as iob_to_biluo
from spacy.scorer import Scorer
try:
unicode
except NameError:
unicode = str
def init_vocab():
return Vocab(
lex_attr_getters={
LOWER: lambda string: string.lower(),
SHAPE: orth_funcs.word_shape,
PREFIX: lambda string: string[0],
SUFFIX: lambda string: string[-3:],
CLUSTER: lambda string: 0,
IS_ALPHA: orth_funcs.is_alpha,
IS_ASCII: orth_funcs.is_ascii,
IS_DIGIT: lambda string: string.isdigit(),
IS_LOWER: orth_funcs.is_lower,
IS_PUNCT: orth_funcs.is_punct,
IS_SPACE: lambda string: string.isspace(),
IS_TITLE: orth_funcs.is_title,
IS_UPPER: orth_funcs.is_upper,
IS_STOP: lambda string: False,
IS_OOV: lambda string: True
})
def save_vocab(vocab, path):
path = Path(path)
if not path.exists():
path.mkdir()
elif not path.is_dir():
raise IOError("Can't save vocab to %s\nNot a directory" % path)
with (path / 'strings.json').open('w') as file_:
vocab.strings.dump(file_)
vocab.dump((path / 'lexemes.bin').as_posix())
def load_vocab(path):
path = Path(path)
if not path.exists():
raise IOError("Cannot load vocab from %s\nDoes not exist" % path)
if not path.is_dir():
raise IOError("Cannot load vocab from %s\nNot a directory" % path)
return Vocab.load(path)
def init_ner_model(vocab, features=None):
if features is None:
features = tuple(EntityRecognizer.feature_templates)
return EntityRecognizer(vocab, features=features)
def save_ner_model(model, path):
path = Path(path)
if not path.exists():
path.mkdir()
if not path.is_dir():
raise IOError("Can't save model to %s\nNot a directory" % path)
model.model.dump((path / 'model').as_posix())
with (path / 'config.json').open('w') as file_:
data = json.dumps(model.cfg)
if not isinstance(data, unicode):
data = data.decode('utf8')
file_.write(data)
def load_ner_model(vocab, path):
return EntityRecognizer.load(path, vocab)
class Pipeline(object):
@classmethod
def load(cls, path):
path = Path(path)
if not path.exists():
raise IOError("Cannot load pipeline from %s\nDoes not exist" % path)
if not path.is_dir():
raise IOError("Cannot load pipeline from %s\nNot a directory" % path)
vocab = load_vocab(path)
tokenizer = Tokenizer(vocab, {}, None, None, None)
ner_model = load_ner_model(vocab, path / 'ner')
return cls(vocab, tokenizer, ner_model)
def __init__(self, vocab=None, tokenizer=None, entity=None):
if vocab is None:
vocab = init_vocab()
if tokenizer is None:
tokenizer = Tokenizer(vocab, {}, None, None, None)
if entity is None:
entity = init_ner_model(self.vocab)
self.vocab = vocab
self.tokenizer = tokenizer
self.entity = entity
self.pipeline = [self.entity]
def __call__(self, input_):
doc = self.make_doc(input_)
for process in self.pipeline:
process(doc)
return doc
def make_doc(self, input_):
if isinstance(input_, bytes):
input_ = input_.decode('utf8')
if isinstance(input_, unicode):
return self.tokenizer(input_)
else:
return Doc(self.vocab, words=input_)
def make_gold(self, input_, annotations):
doc = self.make_doc(input_)
gold = GoldParse(doc, entities=annotations)
return gold
def update(self, input_, annot):
doc = self.make_doc(input_)
gold = self.make_gold(input_, annot)
for ner in gold.ner:
if ner not in (None, '-', 'O'):
action, label = ner.split('-', 1)
self.entity.add_label(label)
return self.entity.update(doc, gold)
def evaluate(self, examples):
scorer = Scorer()
for input_, annot in examples:
gold = self.make_gold(input_, annot)
doc = self(input_)
scorer.score(doc, gold)
return scorer.scores
def average_weights(self):
self.entity.model.end_training()
def save(self, path):
path = Path(path)
if not path.exists():
path.mkdir()
elif not path.is_dir():
raise IOError("Can't save pipeline to %s\nNot a directory" % path)
save_vocab(self.vocab, path / 'vocab')
save_ner_model(self.entity, path / 'ner')
def train(nlp, train_examples, dev_examples, ctx, nr_epoch=5):
next_epoch = train_examples
print("Iter", "Loss", "P", "R", "F")
for i in range(nr_epoch):
this_epoch = next_epoch
next_epoch = []
loss = 0
for input_, annot in this_epoch:
loss += nlp.update(input_, annot)
if (i+1) < nr_epoch:
next_epoch.append((input_, annot))
random.shuffle(next_epoch)
scores = nlp.evaluate(dev_examples)
report_scores(i, loss, scores)
nlp.average_weights()
scores = nlp.evaluate(dev_examples)
report_scores(channels, i+1, loss, scores)
def report_scores(i, loss, scores):
precision = '%.2f' % scores['ents_p']
recall = '%.2f' % scores['ents_r']
f_measure = '%.2f' % scores['ents_f']
print('%d %s %s %s' % (int(loss), precision, recall, f_measure))
def read_examples(path):
path = Path(path)
with path.open() as file_:
sents = file_.read().strip().split('\n\n')
for sent in sents:
if not sent.strip():
continue
tokens = sent.split('\n')
while tokens and tokens[0].startswith('#'):
tokens.pop(0)
words = []
iob = []
for token in tokens:
if token.strip():
pieces = token.split()
words.append(pieces[1])
iob.append(pieces[2])
yield words, iob_to_biluo(iob)
@plac.annotations(
model_dir=("Path to save the model", "positional", None, Path),
train_loc=("Path to your training data", "positional", None, Path),
dev_loc=("Path to your development data", "positional", None, Path),
)
def main(model_dir=Path('/home/matt/repos/spaCy/spacy/data/de-1.0.0'),
train_loc=None, dev_loc=None, nr_epoch=30):
train_examples = read_examples(train_loc)
dev_examples = read_examples(dev_loc)
nlp = Pipeline.load(model_dir)
train(nlp, train_examples, list(dev_examples), ctx, nr_epoch)
nlp.save(model_dir)
if __name__ == '__main__':
main()

View File

@ -21,111 +21,120 @@ After training your model, you can save it to a directory. We recommend
wrapping models as Python packages, for ease of deployment.
For more details, see the documentation:
* Training the Named Entity Recognizer: https://spacy.io/docs/usage/train-ner
* Saving and loading models: https://spacy.io/docs/usage/saving-loading
* Training: https://alpha.spacy.io/usage/training
* NER: https://alpha.spacy.io/usage/linguistic-features#named-entities
Developed for: spaCy 1.7.6
Last tested for: spaCy 1.7.6
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import random
import spacy
from spacy.gold import GoldParse
from spacy.tagger import Tagger
from spacy.gold import GoldParse, minibatch
def train_ner(nlp, train_data, output_dir):
# Add new words to vocab
for raw_text, _ in train_data:
doc = nlp.make_doc(raw_text)
for word in doc:
_ = nlp.vocab[word.orth]
random.seed(0)
# You may need to change the learning rate. It's generally difficult to
# guess what rate you should set, especially when you have limited data.
nlp.entity.model.learn_rate = 0.001
for itn in range(1000):
random.shuffle(train_data)
loss = 0.
for raw_text, entity_offsets in train_data:
gold = GoldParse(doc, entities=entity_offsets)
# By default, the GoldParse class assumes that the entities
# described by offset are complete, and all other words should
# have the tag 'O'. You can tell it to make no assumptions
# about the tag of a word by giving it the tag '-'.
# However, this allows a trivial solution to the current
# learning problem: if words are either 'any tag' or 'ANIMAL',
# the model can learn that all words can be tagged 'ANIMAL'.
#for i in range(len(gold.ner)):
#if not gold.ner[i].endswith('ANIMAL'):
# gold.ner[i] = '-'
doc = nlp.make_doc(raw_text)
nlp.tagger(doc)
# As of 1.9, spaCy's parser now lets you supply a dropout probability
# This might help the model generalize better from only a few
# examples.
loss += nlp.entity.update(doc, gold, drop=0.9)
if loss == 0:
break
# This step averages the model's weights. This may or may not be good for
# your situation --- it's empirical.
nlp.end_training()
if output_dir:
if not output_dir.exists():
output_dir.mkdir()
nlp.save_to_directory(output_dir)
# new entity label
LABEL = 'ANIMAL'
# training data
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]),
("Do they bite?", []),
("horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]),
("horses pretend to care about your feelings", [(0, 6, 'ANIMAL')]),
("they pretend to care about your feelings, those horses",
[(48, 54, 'ANIMAL')]),
("horses?", [(0, 6, 'ANIMAL')])
]
def main(model_name, output_directory=None):
print("Loading initial model", model_name)
nlp = spacy.load(model_name)
if output_directory is not None:
output_directory = Path(output_directory)
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
new_model_name=("New model name for model meta.", "option", "nm", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, new_model_name='animal', output_dir=None, n_iter=50):
"""Set up the pipeline and entity recognizer, and train the new entity."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
train_data = [
(
"Horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')],
),
(
"horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]
),
(
"horses pretend to care about your feelings",
[(0, 6, 'ANIMAL')]
),
(
"they pretend to care about your feelings, those horses",
[(48, 54, 'ANIMAL')]
),
(
"horses?",
[(0, 6, 'ANIMAL')]
)
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
ner = nlp.get_pipe('ner')
]
nlp.entity.add_label('ANIMAL')
train_ner(nlp, train_data, output_directory)
ner.add_label(LABEL) # add new entity label to entity recognizer
# Test that the entity is recognized
doc = nlp('Do you like horses?')
print("Ents in 'Do you like horses?':")
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
random.seed(0)
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
losses = {}
gold_parses = get_gold_parses(nlp.make_doc, TRAIN_DATA)
for batch in minibatch(gold_parses, size=3):
docs, golds = zip(*batch)
nlp.update(docs, golds, losses=losses, sgd=optimizer,
drop=0.35)
print(losses)
# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
if output_directory:
print("Loading from", output_directory)
nlp2 = spacy.load('en', path=output_directory)
nlp2.entity.add_label('ANIMAL')
doc2 = nlp2('Do you like horses?')
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta['name'] = new_model_name # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
for ent in doc2.ents:
print(ent.label_, ent.text)
def get_gold_parses(tokenizer, train_data):
"""Shuffle and create GoldParse objects.
tokenizer (Tokenizer): Tokenizer to processs the raw text.
train_data (list): The training data.
YIELDS (tuple): (doc, gold) tuples.
"""
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = tokenizer(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
yield doc, gold
if __name__ == '__main__':
import plac
plac.call(main)

View File

@ -1,75 +1,109 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy dependency parser, starting off with an existing model
or a blank model.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Dependency Parse: https://alpha.spacy.io/usage/linguistic-features#dependency-parse
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import json
import pathlib
import plac
import random
from pathlib import Path
import spacy
from spacy.pipeline import DependencyParser
from spacy.gold import GoldParse
from spacy.tokens import Doc
def train_parser(nlp, train_data, left_labels, right_labels):
parser = DependencyParser(
nlp.vocab,
left_labels=left_labels,
right_labels=right_labels)
for itn in range(1000):
random.shuffle(train_data)
loss = 0
for words, heads, deps in train_data:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
loss += parser.update(doc, gold)
parser.model.end_training()
return parser
# training data
TRAIN_DATA = [
(
['They', 'trade', 'mortgage', '-', 'backed', 'securities', '.'],
[1, 1, 4, 4, 5, 1, 1],
['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
),
(
['I', 'like', 'London', 'and', 'Berlin', '.'],
[1, 1, 1, 2, 2, 1],
['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
)
]
def main(model_dir=None):
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if not model_dir.exists():
model_dir.mkdir()
assert model_dir.is_dir()
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=1000):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
nlp = spacy.load('en', tagger=False, parser=False, entity=False, add_vectors=False)
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe('parser')
train_data = [
(
['They', 'trade', 'mortgage', '-', 'backed', 'securities', '.'],
[1, 1, 4, 4, 5, 1, 1],
['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
),
(
['I', 'like', 'London', 'and', 'Berlin', '.'],
[1, 1, 1, 2, 2, 1],
['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
)
]
left_labels = set()
right_labels = set()
for _, heads, deps in train_data:
for i, (head, dep) in enumerate(zip(heads, deps)):
if i < head:
left_labels.add(dep)
elif i > head:
right_labels.add(dep)
parser = train_parser(nlp, train_data, sorted(left_labels), sorted(right_labels))
# add labels to the parser
for _, _, deps in TRAIN_DATA:
for dep in deps:
parser.add_label(dep)
doc = Doc(nlp.vocab, words=['I', 'like', 'securities', '.'])
parser(doc)
for word in doc:
print(word.text, word.dep_, word.head.text)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, heads, deps in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
if model_dir is not None:
with (model_dir / 'config.json').open('w') as file_:
json.dump(parser.cfg, file_)
parser.model.dump(str(model_dir / 'model'))
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
if __name__ == '__main__':
main()
# I nsubj like
# like ROOT like
# securities dobj like
# . cc securities
plac.call(main)
# expected result:
# [
# ('I', 'nsubj', 'like'),
# ('like', 'ROOT', 'like'),
# ('securities', 'dobj', 'like'),
# ('.', 'punct', 'like')
# ]

View File

@ -1,18 +1,28 @@
"""A quick example for training a part-of-speech tagger, without worrying
about the tokenization, or other language-specific customizations."""
#!/usr/bin/env python
# coding: utf8
"""
A simple example for training a part-of-speech tagger with a custom tag map.
To allow us to update the tag map with our custom one, this example starts off
with a blank Language class and modifies its defaults.
from __future__ import unicode_literals
from __future__ import print_function
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* POS Tagging: https://alpha.spacy.io/usage/linguistic-features#pos-tagging
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
from spacy.vocab import Vocab
from spacy.tagger import Tagger
import spacy
from spacy.util import get_lang_class
from spacy.tokens import Doc
from spacy.gold import GoldParse
import random
# You need to define a mapping from your data's part-of-speech tag names to the
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
@ -28,54 +38,67 @@ TAG_MAP = {
# Usually you'll read this in, of course. Data formats vary.
# Ensure your strings are unicode.
DATA = [
(
["I", "like", "green", "eggs"],
["N", "V", "J", "N"]
),
(
["Eat", "blue", "ham"],
["V", "J", "N"]
)
TRAIN_DATA = [
(["I", "like", "green", "eggs"], ["N", "V", "J", "N"]),
(["Eat", "blue", "ham"], ["V", "J", "N"])
]
def ensure_dir(path):
if not path.exists():
path.mkdir()
@plac.annotations(
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(lang='en', output_dir=None, n_iter=25):
"""Create a new model, set up the pipeline and train the tagger. In order to
train the tagger with a custom tag map, we're creating a new Language
instance with a custom vocab.
"""
lang_cls = get_lang_class(lang) # get Language class
lang_cls.Defaults.tag_map.update(TAG_MAP) # add tag map to defaults
nlp = lang_cls() # initialise Language class
# add the tagger to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
tagger = nlp.create_pipe('tagger')
nlp.add_pipe(tagger)
def main(output_dir=None):
optimizer = nlp.begin_training(lambda: [])
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, tags in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, tags=tags)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_text = "I like blue eggs"
doc = nlp(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
ensure_dir(output_dir)
ensure_dir(output_dir / "pos")
ensure_dir(output_dir / "vocab")
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
vocab = Vocab(tag_map=TAG_MAP)
# The default_templates argument is where features are specified. See
# spacy/tagger.pyx for the defaults.
tagger = Tagger(vocab)
for i in range(25):
for words, tags in DATA:
doc = Doc(vocab, words=words)
gold = GoldParse(doc, tags=tags)
tagger.update(doc, gold)
random.shuffle(DATA)
tagger.model.end_training()
doc = Doc(vocab, orths_and_spaces=zip(["I", "like", "blue", "eggs"], [True] * 4))
tagger(doc)
for word in doc:
print(word.text, word.tag_, word.pos_)
if output_dir is not None:
tagger.model.dump(str(output_dir / 'pos' / 'model'))
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
tagger.vocab.strings.dump(file_)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
if __name__ == '__main__':
plac.call(main)
# I V VERB
# like V VERB
# blue N NOUN
# eggs N NOUN
# Expected output:
# [
# ('I', 'N', 'NOUN'),
# ('like', 'V', 'VERB'),
# ('blue', 'J', 'ADJ'),
# ('eggs', 'N', 'NOUN')
# ]

View File

@ -1,108 +1,136 @@
from __future__ import unicode_literals
#!/usr/bin/env python
# coding: utf8
"""Train a multi-label convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Text classification: https://alpha.spacy.io/usage/text-classification
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
import tqdm
from thinc.neural.optimizers import Adam
from thinc.neural.ops import NumpyOps
from pathlib import Path
import thinc.extra.datasets
import spacy.lang.en
import spacy
from spacy.gold import GoldParse, minibatch
from spacy.util import compounding
from spacy.pipeline import TextCategorizer
def train_textcat(tokenizer, textcat,
train_texts, train_cats, dev_texts, dev_cats,
n_iter=20):
'''
Train the TextCategorizer without associated pipeline.
'''
textcat.begin_training()
optimizer = Adam(NumpyOps(), 0.001)
train_docs = [tokenizer(text) for text in train_texts]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=20):
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
# textcat = nlp.create_pipe('textcat')
textcat = TextCategorizer(nlp.vocab, labels=['POSITIVE'])
nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
textcat = nlp.get_pipe('textcat')
# add label to text classifier
# textcat.add_label('POSITIVE')
# load the IMBD dataset
print("Loading IMDB data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=2000)
train_docs = [nlp.tokenizer(text) for text in train_texts]
train_gold = [GoldParse(doc, cats=cats) for doc, cats in
zip(train_docs, train_cats)]
train_data = zip(train_docs, train_gold)
batch_sizes = compounding(4., 128., 1.001)
for i in range(n_iter):
losses = {}
train_data = tqdm.tqdm(train_data, leave=False) # Progress bar
for batch in minibatch(train_data, size=batch_sizes):
docs, golds = zip(*batch)
textcat.update((docs, None), golds, sgd=optimizer, drop=0.2,
losses=losses)
with textcat.model.use_params(optimizer.averages):
scores = evaluate(tokenizer, textcat, dev_texts, dev_cats)
yield losses['textcat'], scores
train_data = list(zip(train_docs, train_gold))
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training(lambda: [])
print("Training the model...")
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4., 128., 1.001))
for batch in batches:
docs, golds = zip(*batch)
nlp.update(docs, golds, sgd=optimizer, drop=0.2, losses=losses)
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
print('{0:.3f}\t{0:.3f}\t{0:.3f}\t{0:.3f}' # print a simple table
.format(losses['textcat'], scores['textcat_p'],
scores['textcat_r'], scores['textcat_f']))
# test the trained model
test_text = "This movie sucked"
doc = nlp(test_text)
print(test_text, doc.cats)
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
print(test_text, doc2.cats)
def load_data(limit=0, split=0.8):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{'POSITIVE': bool(y)} for y in labels]
split = int(len(train_data) * split)
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
def evaluate(tokenizer, textcat, texts, cats):
docs = (tokenizer(text) for text in texts)
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 1e-8 # True negatives
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 1e-8 # True negatives
for i, doc in enumerate(textcat.pipe(docs)):
gold = cats[i]
for label, score in doc.cats.items():
if score >= 0.5 and label in gold:
if label not in gold:
continue
if score >= 0.5 and gold[label] >= 0.5:
tp += 1.
elif score >= 0.5 and label not in gold:
elif score >= 0.5 and gold[label] < 0.5:
fp += 1.
elif score < 0.5 and label not in gold:
elif score < 0.5 and gold[label] < 0.5:
tn += 1
if score < 0.5 and label in gold:
elif score < 0.5 and gold[label] >= 0.5:
fn += 1
precis = tp / (tp + fp)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
fscore = 2 * (precis * recall) / (precis + recall)
return {'textcat_p': precis, 'textcat_r': recall, 'textcat_f': fscore}
def load_data():
# Partition off part of the train data --- avoid running experiments
# against test.
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
texts, labels = zip(*train_data)
cats = [(['POSITIVE'] if y else []) for y in labels]
split = int(len(train_data) * 0.8)
train_texts = texts[:split]
train_cats = cats[:split]
dev_texts = texts[split:]
dev_cats = cats[split:]
return (train_texts, train_cats), (dev_texts, dev_cats)
def main(model_loc=None):
nlp = spacy.lang.en.English()
tokenizer = nlp.tokenizer
textcat = TextCategorizer(tokenizer.vocab, labels=['POSITIVE'])
print("Load IMDB data")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data()
print("Itn.\tLoss\tP\tR\tF")
progress = '{i:d} {loss:.3f} {textcat_p:.3f} {textcat_r:.3f} {textcat_f:.3f}'
for i, (loss, scores) in enumerate(train_textcat(tokenizer, textcat,
train_texts, train_cats,
dev_texts, dev_cats, n_iter=20)):
print(progress.format(i=i, loss=loss, **scores))
# How to save, load and use
nlp.pipeline.append(textcat)
if model_loc is not None:
nlp.to_disk(model_loc)
nlp = spacy.load(model_loc)
doc = nlp(u'This movie sucked!')
print(doc.cats)
f_score = 2 * (precision * recall) / (precision + recall)
return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}
if __name__ == '__main__':

View File

@ -0,0 +1,641 @@
[
{
"id": "wsj_0200",
"paragraphs": [
{
"raw": "In an Oct. 19 review of \"The Misanthrope\" at Chicago's Goodman Theatre (\"Revitalized Classics Take the Stage in Windy City,\" Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Haag. Ms. Haag plays Elianti.",
"sentences": [
{
"tokens": [
{
"head": 44,
"dep": "prep",
"tag": "IN",
"orth": "In",
"ner": "O",
"id": 0
},
{
"head": 3,
"dep": "det",
"tag": "DT",
"orth": "an",
"ner": "O",
"id": 1
},
{
"head": 2,
"dep": "nmod",
"tag": "NNP",
"orth": "Oct.",
"ner": "B-DATE",
"id": 2
},
{
"head": -1,
"dep": "nummod",
"tag": "CD",
"orth": "19",
"ner": "L-DATE",
"id": 3
},
{
"head": -4,
"dep": "pobj",
"tag": "NN",
"orth": "review",
"ner": "O",
"id": 4
},
{
"head": -1,
"dep": "prep",
"tag": "IN",
"orth": "of",
"ner": "O",
"id": 5
},
{
"head": 2,
"dep": "punct",
"tag": "``",
"orth": "``",
"ner": "O",
"id": 6
},
{
"head": 1,
"dep": "det",
"tag": "DT",
"orth": "The",
"ner": "B-WORK_OF_ART",
"id": 7
},
{
"head": -3,
"dep": "pobj",
"tag": "NN",
"orth": "Misanthrope",
"ner": "L-WORK_OF_ART",
"id": 8
},
{
"head": -1,
"dep": "punct",
"tag": "''",
"orth": "''",
"ner": "O",
"id": 9
},
{
"head": -2,
"dep": "prep",
"tag": "IN",
"orth": "at",
"ner": "O",
"id": 10
},
{
"head": 3,
"dep": "poss",
"tag": "NNP",
"orth": "Chicago",
"ner": "U-GPE",
"id": 11
},
{
"head": -1,
"dep": "case",
"tag": "POS",
"orth": "'s",
"ner": "O",
"id": 12
},
{
"head": 1,
"dep": "compound",
"tag": "NNP",
"orth": "Goodman",
"ner": "B-FAC",
"id": 13
},
{
"head": -4,
"dep": "pobj",
"tag": "NNP",
"orth": "Theatre",
"ner": "L-FAC",
"id": 14
},
{
"head": 4,
"dep": "punct",
"tag": "-LRB-",
"orth": "(",
"ner": "O",
"id": 15
},
{
"head": 3,
"dep": "punct",
"tag": "``",
"orth": "``",
"ner": "O",
"id": 16
},
{
"head": 1,
"dep": "amod",
"tag": "VBN",
"orth": "Revitalized",
"ner": "B-WORK_OF_ART",
"id": 17
},
{
"head": 1,
"dep": "nsubj",
"tag": "NNS",
"orth": "Classics",
"ner": "I-WORK_OF_ART",
"id": 18
},
{
"head": -15,
"dep": "appos",
"tag": "VBP",
"orth": "Take",
"ner": "I-WORK_OF_ART",
"id": 19
},
{
"head": 1,
"dep": "det",
"tag": "DT",
"orth": "the",
"ner": "I-WORK_OF_ART",
"id": 20
},
{
"head": -2,
"dep": "dobj",
"tag": "NN",
"orth": "Stage",
"ner": "I-WORK_OF_ART",
"id": 21
},
{
"head": -3,
"dep": "prep",
"tag": "IN",
"orth": "in",
"ner": "I-WORK_OF_ART",
"id": 22
},
{
"head": 1,
"dep": "compound",
"tag": "NNP",
"orth": "Windy",
"ner": "I-WORK_OF_ART",
"id": 23
},
{
"head": -2,
"dep": "pobj",
"tag": "NNP",
"orth": "City",
"ner": "L-WORK_OF_ART",
"id": 24
},
{
"head": -6,
"dep": "punct",
"tag": ",",
"orth": ",",
"ner": "O",
"id": 25
},
{
"head": -7,
"dep": "punct",
"tag": "''",
"orth": "''",
"ner": "O",
"id": 26
},
{
"head": -8,
"dep": "npadvmod",
"tag": "NN",
"orth": "Leisure",
"ner": "B-ORG",
"id": 27
},
{
"head": -1,
"dep": "cc",
"tag": "CC",
"orth": "&",
"ner": "I-ORG",
"id": 28
},
{
"head": -2,
"dep": "conj",
"tag": "NNS",
"orth": "Arts",
"ner": "L-ORG",
"id": 29
},
{
"head": -11,
"dep": "punct",
"tag": "-RRB-",
"orth": ")",
"ner": "O",
"id": 30
},
{
"head": 13,
"dep": "punct",
"tag": ",",
"orth": ",",
"ner": "O",
"id": 31
},
{
"head": 1,
"dep": "det",
"tag": "DT",
"orth": "the",
"ner": "O",
"id": 32
},
{
"head": 11,
"dep": "nsubjpass",
"tag": "NN",
"orth": "role",
"ner": "O",
"id": 33
},
{
"head": -1,
"dep": "prep",
"tag": "IN",
"orth": "of",
"ner": "O",
"id": 34
},
{
"head": -1,
"dep": "pobj",
"tag": "NNP",
"orth": "Celimene",
"ner": "U-PERSON",
"id": 35
},
{
"head": -3,
"dep": "punct",
"tag": ",",
"orth": ",",
"ner": "O",
"id": 36
},
{
"head": -4,
"dep": "acl",
"tag": "VBN",
"orth": "played",
"ner": "O",
"id": 37
},
{
"head": -1,
"dep": "agent",
"tag": "IN",
"orth": "by",
"ner": "O",
"id": 38
},
{
"head": 1,
"dep": "compound",
"tag": "NNP",
"orth": "Kim",
"ner": "B-PERSON",
"id": 39
},
{
"head": -2,
"dep": "pobj",
"tag": "NNP",
"orth": "Cattrall",
"ner": "L-PERSON",
"id": 40
},
{
"head": -8,
"dep": "punct",
"tag": ",",
"orth": ",",
"ner": "O",
"id": 41
},
{
"head": 2,
"dep": "auxpass",
"tag": "VBD",
"orth": "was",
"ner": "O",
"id": 42
},
{
"head": 1,
"dep": "advmod",
"tag": "RB",
"orth": "mistakenly",
"ner": "O",
"id": 43
},
{
"head": 0,
"dep": "root",
"tag": "VBN",
"orth": "attributed",
"ner": "O",
"id": 44
},
{
"head": -1,
"dep": "prep",
"tag": "IN",
"orth": "to",
"ner": "O",
"id": 45
},
{
"head": 1,
"dep": "compound",
"tag": "NNP",
"orth": "Christina",
"ner": "B-PERSON",
"id": 46
},
{
"head": -2,
"dep": "pobj",
"tag": "NNP",
"orth": "Haag",
"ner": "L-PERSON",
"id": 47
},
{
"head": -4,
"dep": "punct",
"tag": ".",
"orth": ".",
"ner": "O",
"id": 48
}
],
"brackets": [
{
"first": 2,
"last": 3,
"label": "NML"
},
{
"first": 1,
"last": 4,
"label": "NP"
},
{
"first": 7,
"last": 8,
"label": "NP-TTL"
},
{
"first": 11,
"last": 12,
"label": "NP"
},
{
"first": 11,
"last": 14,
"label": "NP"
},
{
"first": 10,
"last": 14,
"label": "PP-LOC"
},
{
"first": 6,
"last": 14,
"label": "NP"
},
{
"first": 5,
"last": 14,
"label": "PP"
},
{
"first": 1,
"last": 14,
"label": "NP"
},
{
"first": 17,
"last": 18,
"label": "NP-SBJ"
},
{
"first": 20,
"last": 21,
"label": "NP"
},
{
"first": 23,
"last": 24,
"label": "NP"
},
{
"first": 22,
"last": 24,
"label": "PP-LOC"
},
{
"first": 19,
"last": 24,
"label": "VP"
},
{
"first": 17,
"last": 24,
"label": "S-HLN"
},
{
"first": 27,
"last": 29,
"label": "NP-TMP"
},
{
"first": 15,
"last": 30,
"label": "NP"
},
{
"first": 1,
"last": 30,
"label": "NP"
},
{
"first": 0,
"last": 30,
"label": "PP-LOC"
},
{
"first": 32,
"last": 33,
"label": "NP"
},
{
"first": 35,
"last": 35,
"label": "NP"
},
{
"first": 34,
"last": 35,
"label": "PP"
},
{
"first": 32,
"last": 35,
"label": "NP"
},
{
"first": 39,
"last": 40,
"label": "NP-LGS"
},
{
"first": 38,
"last": 40,
"label": "PP"
},
{
"first": 37,
"last": 40,
"label": "VP"
},
{
"first": 32,
"last": 41,
"label": "NP-SBJ-2"
},
{
"first": 43,
"last": 43,
"label": "ADVP-MNR"
},
{
"first": 46,
"last": 47,
"label": "NP"
},
{
"first": 45,
"last": 47,
"label": "PP-CLR"
},
{
"first": 44,
"last": 47,
"label": "VP"
},
{
"first": 42,
"last": 47,
"label": "VP"
},
{
"first": 0,
"last": 48,
"label": "S"
}
]
},
{
"tokens": [
{
"head": 1,
"dep": "compound",
"tag": "NNP",
"orth": "Ms.",
"ner": "O",
"id": 0
},
{
"head": 1,
"dep": "nsubj",
"tag": "NNP",
"orth": "Haag",
"ner": "U-PERSON",
"id": 1
},
{
"head": 0,
"dep": "root",
"tag": "VBZ",
"orth": "plays",
"ner": "O",
"id": 2
},
{
"head": -1,
"dep": "dobj",
"tag": "NNP",
"orth": "Elianti",
"ner": "U-PERSON",
"id": 3
},
{
"head": -2,
"dep": "punct",
"tag": ".",
"orth": ".",
"ner": "O",
"id": 4
}
],
"brackets": [
{
"first": 0,
"last": 1,
"label": "NP-SBJ"
},
{
"first": 3,
"last": 3,
"label": "NP"
},
{
"first": 2,
"last": 3,
"label": "VP"
},
{
"first": 0,
"last": 4,
"label": "S"
}
]
}
]
}
]
}
]

View File

@ -0,0 +1,21 @@
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
{"orth": ".", "id": 1, "lower": ".", "norm": ".", "shape": ".", "prefix": ".", "suffix": ".", "length": 1, "cluster": "8", "prob": -3.0678977966308594, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": ",", "id": 2, "lower": ",", "norm": ",", "shape": ",", "prefix": ",", "suffix": ",", "length": 1, "cluster": "4", "prob": -3.4549596309661865, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "the", "id": 3, "lower": "the", "norm": "the", "shape": "xxx", "prefix": "t", "suffix": "the", "length": 3, "cluster": "11", "prob": -3.528766632080078, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "I", "id": 4, "lower": "i", "norm": "I", "shape": "X", "prefix": "I", "suffix": "I", "length": 1, "cluster": "346", "prob": -3.791565179824829, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": false, "is_title": true, "is_upper": true, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "to", "id": 5, "lower": "to", "norm": "to", "shape": "xx", "prefix": "t", "suffix": "to", "length": 2, "cluster": "12", "prob": -3.8560216426849365, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "a", "id": 6, "lower": "a", "norm": "a", "shape": "x", "prefix": "a", "suffix": "a", "length": 1, "cluster": "19", "prob": -3.92978835105896, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "and", "id": 7, "lower": "and", "norm": "and", "shape": "xxx", "prefix": "a", "suffix": "and", "length": 3, "cluster": "20", "prob": -4.113108158111572, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "of", "id": 8, "lower": "of", "norm": "of", "shape": "xx", "prefix": "o", "suffix": "of", "length": 2, "cluster": "28", "prob": -4.27587366104126, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "you", "id": 9, "lower": "you", "norm": "you", "shape": "xxx", "prefix": "y", "suffix": "you", "length": 3, "cluster": "602", "prob": -4.373791217803955, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "it", "id": 10, "lower": "it", "norm": "it", "shape": "xx", "prefix": "i", "suffix": "it", "length": 2, "cluster": "474", "prob": -4.388050079345703, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "is", "id": 11, "lower": "is", "norm": "is", "shape": "xx", "prefix": "i", "suffix": "is", "length": 2, "cluster": "762", "prob": -4.457748889923096, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "that", "id": 12, "lower": "that", "norm": "that", "shape": "xxxx", "prefix": "t", "suffix": "hat", "length": 4, "cluster": "84", "prob": -4.464504718780518, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "\n\n", "id": 0, "lower": "\n\n", "norm": "\n\n", "shape": "\n\n", "prefix": "\n", "suffix": "\n\n", "length": 2, "cluster": "0", "prob": -4.606560707092285, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "in", "id": 13, "lower": "in", "norm": "in", "shape": "xx", "prefix": "i", "suffix": "in", "length": 2, "cluster": "60", "prob": -4.619071960449219, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "'s", "id": 14, "lower": "'s", "norm": "'s", "shape": "'x", "prefix": "'", "suffix": "'s", "length": 2, "cluster": "52", "prob": -4.830559253692627, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "n't", "id": 15, "lower": "n't", "norm": "n't", "shape": "x'x", "prefix": "n", "suffix": "n't", "length": 3, "cluster": "74", "prob": -4.859938621520996, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "for", "id": 16, "lower": "for", "norm": "for", "shape": "xxx", "prefix": "f", "suffix": "for", "length": 3, "cluster": "508", "prob": -4.8801093101501465, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "\"", "id": 17, "lower": "\"", "norm": "\"", "shape": "\"", "prefix": "\"", "suffix": "\"", "length": 1, "cluster": "0", "prob": -5.02677583694458, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": true, "is_left_punct": true, "is_right_punct": true}
{"orth": "?", "id": 18, "lower": "?", "norm": "?", "shape": "?", "prefix": "?", "suffix": "?", "length": 1, "cluster": "0", "prob": -5.05924654006958, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": " ", "id": 0, "lower": " ", "norm": " ", "shape": " ", "prefix": " ", "suffix": " ", "length": 1, "cluster": "0", "prob": -5.129165172576904, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}

View File

@ -1,36 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
import plac
import codecs
import pathlib
import random
import twython
import spacy.en
import _handler
class Connection(twython.TwythonStreamer):
def __init__(self, keys_dir, nlp, query):
keys_dir = pathlib.Path(keys_dir)
read = lambda fn: (keys_dir / (fn + '.txt')).open().read().strip()
api_key = map(read, ['key', 'secret', 'token', 'token_secret'])
twython.TwythonStreamer.__init__(self, *api_key)
self.nlp = nlp
self.query = query
def on_success(self, data):
_handler.handle_tweet(self.nlp, data, self.query)
if random.random() >= 0.1:
reload(_handler)
def main(keys_dir, term):
nlp = spacy.en.English()
twitter = Connection(keys_dir, nlp, term)
twitter.statuses.filter(track=term, language='en')
if __name__ == '__main__':
plac.call(main)

View File

@ -0,0 +1,33 @@
#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
"""
from __future__ import unicode_literals
import plac
import numpy
import from spacy.language import Language
@plac.annotations(
vectors_loc=("Path to vectors", "positional", None, str))
def main(vectors_loc):
nlp = Language()
with open(vectors_loc, 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.clear_vectors(int(nr_dim))
for line in file_:
line = line.decode('utf8')
pieces = line.split()
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector)
doc = nlp(u'class colspan')
print(doc[0].similarity(doc[1]))
if __name__ == '__main__':
plac.call(main)

5
fabfile.py vendored
View File

@ -14,6 +14,7 @@ VENV_DIR = path.join(PWD, ENV)
def env(lang='python2.7'):
if path.exists(VENV_DIR):
local('rm -rf {env}'.format(env=VENV_DIR))
local('pip install virtualenv')
local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
@ -32,6 +33,10 @@ def make():
local('pip install -r requirements.txt')
local('python setup.py build_ext --inplace')
def sdist():
with virtualenv(VENV_DIR):
with lcd(path.dirname(__file__)):
local('python setup.py sdist')
def clean():
with lcd(path.dirname(__file__)):

View File

@ -1,9 +1,9 @@
cython<0.24
cython>=0.24,<0.27.0
pathlib
numpy>=1.7
cymem>=1.30,<1.32
preshed>=1.0.0,<2.0.0
thinc>=6.8.0,<6.9.0
thinc>=6.10.0,<6.11.0
murmurhash>=0.28,<0.29
plac<1.0.0,>=0.9.6
six
@ -13,7 +13,7 @@ requests>=2.13.0,<3.0.0
regex==2017.4.5
ftfy>=4.4.2,<5.0.0
pytest>=3.0.6,<4.0.0
pip>=9.0.0,<10.0.0
mock>=2.0.0,<3.0.0
msgpack-python
msgpack-numpy
html5lib==1.0b8

View File

@ -24,25 +24,19 @@ MOD_NAMES = [
'spacy.vocab',
'spacy.attrs',
'spacy.morphology',
'spacy.tagger',
'spacy.pipeline',
'spacy.syntax.stateclass',
'spacy.syntax._state',
'spacy.syntax._beam_utils',
'spacy.tokenizer',
'spacy._cfile',
'spacy.syntax.parser',
'spacy.syntax.nn_parser',
'spacy.syntax.beam_parser',
'spacy.syntax.nonproj',
'spacy.syntax.transition_system',
'spacy.syntax.arc_eager',
'spacy.syntax._parse_features',
'spacy.gold',
'spacy.tokens.doc',
'spacy.tokens.span',
'spacy.tokens.token',
'spacy.cfile',
'spacy.matcher',
'spacy.syntax.ner',
'spacy.symbols',
@ -53,7 +47,8 @@ MOD_NAMES = [
COMPILE_OPTIONS = {
'msvc': ['/Ox', '/EHsc'],
'mingw32' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function'],
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function']
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function',
'-march=native']
}
@ -66,7 +61,7 @@ LINK_OPTIONS = {
# I don't understand this very well yet. See Issue #267
# Fingers crossed!
USE_OPENMP_DEFAULT = '1' if sys.platform != 'darwin' else None
USE_OPENMP_DEFAULT = '0' if sys.platform != 'darwin' else None
if os.environ.get('USE_OPENMP', USE_OPENMP_DEFAULT) == '1':
if sys.platform == 'darwin':
COMPILE_OPTIONS['other'].append('-fopenmp')
@ -195,9 +190,8 @@ def setup_package():
'murmurhash>=0.28,<0.29',
'cymem>=1.30,<1.32',
'preshed>=1.0.0,<2.0.0',
'thinc>=6.8.0,<6.9.0',
'thinc>=6.10.0,<6.11.0',
'plac<1.0.0,>=0.9.6',
'pip>=9.0.0,<10.0.0',
'six',
'pathlib',
'ujson>=1.35',

View File

@ -3,12 +3,12 @@ from __future__ import unicode_literals
from .cli.info import info as cli_info
from .glossary import explain
from .deprecated import resolve_load_name
from .about import __version__
from . import util
def load(name, **overrides):
from .deprecated import resolve_load_name
name = resolve_load_name(name, **overrides)
return util.load_model(name, **overrides)

View File

@ -1,13 +1,13 @@
# coding: utf8
from __future__ import print_function
# NB! This breaks in plac on Python 2!!
#from __future__ import unicode_literals
# from __future__ import unicode_literals
if __name__ == '__main__':
import plac
import sys
from spacy.cli import download, link, info, package, train, convert, model
from spacy.cli import profile
from spacy.cli import vocab, profile, evaluate, validate
from spacy.util import prints
commands = {
@ -15,10 +15,13 @@ if __name__ == '__main__':
'link': link,
'info': info,
'train': train,
'evaluate': evaluate,
'convert': convert,
'package': package,
'model': model,
'vocab': vocab,
'profile': profile,
'validate': validate
}
if len(sys.argv) == 1:
prints(', '.join(commands), title="Available commands", exits=1)

View File

@ -1,26 +0,0 @@
from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
from cymem.cymem cimport Pool
cdef class CFile:
cdef FILE* fp
cdef bint is_open
cdef Pool mem
cdef int size # For compatibility with subclass
cdef int _capacity # For compatibility with subclass
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *
cdef class StringCFile(CFile):
cdef unsigned char* data
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *

View File

@ -1,88 +0,0 @@
from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
from libc.string cimport memcpy
cdef class CFile:
def __init__(self, loc, mode, on_open_error=None):
if isinstance(mode, unicode):
mode_str = mode.encode('ascii')
else:
mode_str = mode
if hasattr(loc, 'as_posix'):
loc = loc.as_posix()
self.mem = Pool()
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self.fp = fopen(<char*>bytes_loc, mode_str)
if self.fp == NULL:
if on_open_error is not None:
on_open_error()
else:
raise IOError("Could not open binary file %s" % bytes_loc)
self.is_open = True
def __dealloc__(self):
if self.is_open:
fclose(self.fp)
def close(self):
fclose(self.fp)
self.is_open = False
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
st = fread(dest, elem_size, number, self.fp)
if st != number:
raise IOError
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:
st = fwrite(src, elem_size, number, self.fp)
if st != number:
raise IOError
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)
cdef class StringCFile:
def __init__(self, mode, bytes data=b'', on_open_error=None):
self.mem = Pool()
self.is_open = 'w' in mode
self._capacity = max(len(data), 8)
self.size = len(data)
self.data = <unsigned char*>self.mem.alloc(1, self._capacity)
for i in range(len(data)):
self.data[i] = data[i]
def close(self):
self.is_open = False
def string_data(self):
return (self.data-self.size)[:self.size]
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
memcpy(dest, self.data, elem_size * number)
self.data += elem_size * number
cdef int write_from(self, void* src, size_t elem_size, size_t number) except -1:
write_size = number * elem_size
if (self.size + write_size) >= self._capacity:
self._capacity = (self.size + write_size) * 2
self.data = <unsigned char*>self.mem.realloc(self.data, self._capacity)
memcpy(&self.data[self.size], src, elem_size * number)
self.size += write_size
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)

View File

@ -1,43 +1,42 @@
import ujson
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
from thinc.neural import Model, Maxout, Softmax, Affine
from thinc.neural._classes.hash_embed import HashEmbed
from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module
import random
import cytoolz
# coding: utf8
from __future__ import unicode_literals
import numpy
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow, ParametricAttention
from thinc.t2v import Pooling, sum_pool
from thinc.misc import Residual
from thinc.misc import LayerNorm as LN
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
from thinc.api import FeatureExtracter, with_getitem, flatten_add_lengths
from thinc.api import uniqued, wrap, noop
from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array
from thinc.neural._lsuv import svd_orthonormal
from thinc.neural._classes.convolution import ExtractWindow
from thinc.neural._classes.static_vectors import StaticVectors
from thinc.neural._classes.batchnorm import BatchNorm as BN
from thinc.neural._classes.layernorm import LayerNorm as LN
from thinc.neural._classes.resnet import Residual
from thinc.neural import ReLu
from thinc.neural._classes.selu import SELU
from thinc import describe
from thinc.describe import Dimension, Synapses, Biases, Gradient
from thinc.neural._classes.affine import _set_dimensions_if_needed
from thinc.api import FeatureExtracter, with_getitem
from thinc.neural.pooling import Pooling, max_pool, mean_pool, sum_pool
from thinc.neural._classes.attention import ParametricAttention
from thinc.linear.linear import LinearModel
from thinc.api import uniqued, wrap, flatten_add_lengths
import thinc.extra.load_nlp
from thinc.neural._lsuv import svd_orthonormal
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE, TAG, DEP, CLUSTER
from .tokens.doc import Doc
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
from . import util
import numpy
import io
VECTORS_KEY = 'spacy_pretrained_vectors'
@layerize
def _flatten_add_lengths(seqs, pad=0, drop=0.):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=pad)
X = ops.flatten(seqs, pad=pad)
return (X, lengths), finish_update
@ -51,33 +50,14 @@ def _logistic(X, drop=0.):
X = xp.minimum(X, 10., X)
X = xp.maximum(X, -10., X)
Y = 1. / (1. + xp.exp(-X))
def logistic_bwd(dY, sgd=None):
dX = dY * (Y * (1-Y))
return dX
return Y, logistic_bwd
@layerize
def add_tuples(X, drop=0.):
"""Give inputs of sequence pairs, where each sequence is (vals, length),
sum the values, returning a single sequence.
If input is:
((vals1, length), (vals2, length)
Output is:
(vals1+vals2, length)
vals are a single tensor for the whole batch.
"""
(vals1, length1), (vals2, length2) = X
assert length1 == length2
def add_tuples_bwd(dY, sgd=None):
return (dY, dY)
return (vals1+vals2, length), add_tuples_bwd
def _zero_init(model):
def _zero_init_impl(self, X, y):
self.W.fill(0)
@ -90,7 +70,6 @@ def _zero_init(model):
@layerize
def _preprocess_doc(docs, drop=0.):
keys = [doc.to_array([LOWER]) for doc in docs]
keys = [a[:, 0] for a in keys]
ops = Model.ops
lengths = ops.asarray([arr.shape[0] for arr in keys])
keys = ops.xp.concatenate(keys)
@ -98,78 +77,25 @@ def _preprocess_doc(docs, drop=0.):
return (keys, vals, lengths), None
def _init_for_precomputed(W, ops):
if (W**2).sum() != 0.:
return
reshaped = W.reshape((W.shape[1], W.shape[0] * W.shape[2]))
ops.xavier_uniform_init(reshaped)
W[:] = reshaped.reshape(W.shape)
@describe.on_data(_set_dimensions_if_needed)
@describe.on_data(_set_dimensions_if_needed,
lambda model, X, y: model.init_weights(model))
@describe.attributes(
nI=Dimension("Input size"),
nF=Dimension("Number of features"),
nO=Dimension("Output size"),
nP=Dimension("Maxout pieces"),
W=Synapses("Weights matrix",
lambda obj: (obj.nF, obj.nO, obj.nI),
lambda W, ops: _init_for_precomputed(W, ops)),
b=Biases("Bias vector",
lambda obj: (obj.nO,)),
d_W=Gradient("W"),
d_b=Gradient("b")
)
class PrecomputableAffine(Model):
def __init__(self, nO=None, nI=None, nF=None, **kwargs):
Model.__init__(self, **kwargs)
self.nO = nO
self.nI = nI
self.nF = nF
def begin_update(self, X, drop=0.):
# X: (b, i)
# Yf: (b, f, i)
# dY: (b, o)
# dYf: (b, f, o)
#Yf = numpy.einsum('bi,foi->bfo', X, self.W)
Yf = self.ops.xp.tensordot(
X, self.W, axes=[[1], [2]])
Yf += self.b
def backward(dY_ids, sgd=None):
tensordot = self.ops.xp.tensordot
dY, ids = dY_ids
Xf = X[ids]
#dXf = numpy.einsum('bo,foi->bfi', dY, self.W)
dXf = tensordot(dY, self.W, axes=[[1], [1]])
#dW = numpy.einsum('bo,bfi->ofi', dY, Xf)
dW = tensordot(dY, Xf, axes=[[0], [0]])
# ofi -> foi
self.d_W += dW.transpose((1, 0, 2))
self.d_b += dY.sum(axis=0)
if sgd is not None:
sgd(self._mem.weights, self._mem.gradient, key=self.id)
return dXf
return Yf, backward
@describe.on_data(_set_dimensions_if_needed)
@describe.attributes(
nI=Dimension("Input size"),
nF=Dimension("Number of features"),
nP=Dimension("Number of pieces"),
nO=Dimension("Output size"),
W=Synapses("Weights matrix",
lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI),
lambda W, ops: ops.xavier_uniform_init(W)),
lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)),
b=Biases("Bias vector",
lambda obj: (obj.nO, obj.nP)),
pad=Synapses("Pad",
lambda obj: (1, obj.nF, obj.nO, obj.nP),
lambda M, ops: ops.normal_init(M, 1.)),
d_W=Gradient("W"),
d_b=Gradient("b")
)
class PrecomputableMaxouts(Model):
def __init__(self, nO=None, nI=None, nF=None, nP=3, **kwargs):
d_pad=Gradient("pad"),
d_b=Gradient("b"))
class PrecomputableAffine(Model):
def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs):
Model.__init__(self, **kwargs)
self.nO = nO
self.nP = nP
@ -177,142 +103,195 @@ class PrecomputableMaxouts(Model):
self.nF = nF
def begin_update(self, X, drop=0.):
# X: (b, i)
# Yfp: (b, f, o, p)
# Xf: (f, b, i)
# dYp: (b, o, p)
# W: (f, o, p, i)
# b: (o, p)
Yf = self.ops.xp.dot(X,
self.W.reshape((self.nF*self.nO*self.nP, self.nI)).T)
Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP))
Yf = self._add_padding(Yf)
# bi,opfi->bfop
# bop,fopi->bfi
# bop,fbi->opfi : fopi
tensordot = self.ops.xp.tensordot
ascontiguous = self.ops.xp.ascontiguousarray
Yfp = tensordot(X, self.W, axes=[[1], [3]])
Yfp += self.b
def backward(dYp_ids, sgd=None):
dYp, ids = dYp_ids
def backward(dY_ids, sgd=None):
dY, ids = dY_ids
dY, ids = self._backprop_padding(dY, ids)
Xf = X[ids]
Xf = Xf.reshape((Xf.shape[0], self.nF * self.nI))
dXf = tensordot(dYp, self.W, axes=[[1, 2], [1,2]])
dW = tensordot(dYp, Xf, axes=[[0], [0]])
self.d_b += dY.sum(axis=0)
dY = dY.reshape((dY.shape[0], self.nO*self.nP))
self.d_W += dW.transpose((2, 0, 1, 3))
self.d_b += dYp.sum(axis=0)
Wopfi = self.W.transpose((1, 2, 0, 3))
Wopfi = self.ops.xp.ascontiguousarray(Wopfi)
Wopfi = Wopfi.reshape((self.nO*self.nP, self.nF * self.nI))
dXf = self.ops.dot(dY.reshape((dY.shape[0], self.nO*self.nP)), Wopfi)
# Reuse the buffer
dWopfi = Wopfi; dWopfi.fill(0.)
self.ops.xp.dot(dY.T, Xf, out=dWopfi)
dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI))
# (o, p, f, i) --> (f, o, p, i)
self.d_W += dWopfi.transpose((2, 0, 1, 3))
if sgd is not None:
sgd(self._mem.weights, self._mem.gradient, key=self.id)
return dXf
return Yfp, backward
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
return Yf, backward
def _add_padding(self, Yf):
Yf_padded = self.ops.xp.vstack((self.pad, Yf))
return Yf_padded[1:]
def _backprop_padding(self, dY, ids):
for i in range(ids.shape[0]):
for j in range(ids.shape[1]):
if ids[i, j] < 0:
self.d_pad[0, j] += dY[i, j]
return dY, ids
def drop_layer(layer, factor=2.):
def drop_layer_fwd(X, drop=0.):
if drop <= 0.:
return layer.begin_update(X, drop=drop)
else:
coinflip = layer.ops.xp.random.random()
if (coinflip / factor) >= drop:
return layer.begin_update(X, drop=drop)
@staticmethod
def init_weights(model):
'''This is like the 'layer sequential unit variance', but instead
of taking the actual inputs, we randomly generate whitened data.
Why's this all so complicated? We have a huge number of inputs,
and the maxout unit makes guessing the dynamics tricky. Instead
we set the maxout weights to values that empirically result in
whitened outputs given whitened inputs.
'''
if (model.W**2).sum() != 0.:
return
model.ops.normal_init(model.W, model.nF * model.nI, inplace=True)
ids = numpy.zeros((5000, model.nF), dtype='i')
ids += numpy.asarray(numpy.random.uniform(0, 1000, ids.shape), dtype='i')
tokvecs = numpy.zeros((5000, model.nI), dtype='f')
tokvecs += numpy.random.normal(loc=0., scale=1.,
size=tokvecs.size).reshape(tokvecs.shape)
def predict(ids, tokvecs):
# nS ids. nW tokvecs
hiddens = model(tokvecs) # (nW, f, o, p)
# need nS vectors
vectors = model.ops.allocate((ids.shape[0], model.nO, model.nP))
for i, feats in enumerate(ids):
for j, id_ in enumerate(feats):
vectors[i] += hiddens[id_, j]
vectors += model.b
if model.nP >= 2:
return model.ops.maxout(vectors)[0]
else:
return X, lambda dX, sgd=None: dX
return vectors * (vectors >= 0)
model = wrap(drop_layer_fwd, layer)
model.predict = layer
return model
tol_var = 0.01
tol_mean = 0.01
t_max = 10
t_i = 0
for t_i in range(t_max):
acts1 = predict(ids, tokvecs)
var = numpy.var(acts1)
mean = numpy.mean(acts1)
if abs(var - 1.0) >= tol_var:
model.W /= numpy.sqrt(var)
elif abs(mean) >= tol_mean:
model.b -= mean
else:
break
def Tok2Vec(width, embed_size, preprocess=None):
def link_vectors_to_models(vocab):
vectors = vocab.vectors
ops = Model.ops
for word in vocab:
if word.orth in vectors.key2row:
word.rank = vectors.key2row[word.orth]
else:
word.rank = 0
data = ops.asarray(vectors.data)
# Set an entry here, so that vectors are accessed by StaticVectors
# (unideal, I know)
thinc.extra.load_nlp.VECTORS[(ops.device, VECTORS_KEY)] = data
def Tok2Vec(width, embed_size, **kwargs):
pretrained_dims = kwargs.get('pretrained_dims', 0)
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone, '+': add}):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name='embed_norm')
prefix = HashEmbed(width, embed_size//2, column=cols.index(PREFIX), name='embed_prefix')
suffix = HashEmbed(width, embed_size//2, column=cols.index(SUFFIX), name='embed_suffix')
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE), name='embed_shape')
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
'+': add, '*': reapply}):
norm = HashEmbed(width, embed_size, column=cols.index(NORM),
name='embed_norm')
prefix = HashEmbed(width, embed_size//2, column=cols.index(PREFIX),
name='embed_prefix')
suffix = HashEmbed(width, embed_size//2, column=cols.index(SUFFIX),
name='embed_suffix')
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
name='embed_shape')
if pretrained_dims is not None and pretrained_dims >= 1:
glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width*5, pieces=3)), column=5)
else:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width*4, pieces=3)), column=5)
convolution = Residual(
ExtractWindow(nW=1)
>> LN(Maxout(width, width*3, pieces=cnn_maxout_pieces))
)
embed = (norm | prefix | suffix | shape ) >> LN(Maxout(width, width*4, pieces=3))
tok2vec = (
with_flatten(
asarray(Model.ops, dtype='uint64')
>> uniqued(embed, column=5)
>> Residual(
(ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
) ** 4, pad=4
FeatureExtracter(cols)
>> with_flatten(
embed
>> convolution ** 4, pad=4
)
)
if preprocess not in (False, None):
tok2vec = preprocess >> tok2vec
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
tok2vec.nO = width
tok2vec.embed = embed
return tok2vec
def reapply(layer, n_times):
def reapply_fwd(X, drop=0.):
backprops = []
for i in range(n_times):
Y, backprop = layer.begin_update(X, drop=drop)
X = Y
backprops.append(backprop)
def reapply_bwd(dY, sgd=None):
dX = None
for backprop in reversed(backprops):
dY = backprop(dY, sgd=sgd)
if dX is None:
dX = dY
else:
dX += dY
return dX
return Y, reapply_bwd
return wrap(reapply_fwd, layer)
def asarray(ops, dtype):
def forward(X, drop=0.):
return ops.asarray(X, dtype=dtype), None
return layerize(forward)
def foreach(layer):
def forward(Xs, drop=0.):
results = []
backprops = []
for X in Xs:
result, bp = layer.begin_update(X, drop=drop)
results.append(result)
backprops.append(bp)
def backward(d_results, sgd=None):
dXs = []
for d_result, backprop in zip(d_results, backprops):
dXs.append(backprop(d_result, sgd))
return dXs
return results, backward
model = layerize(forward)
model._layers.append(layer)
return model
def rebatch(size, layer):
ops = layer.ops
def forward(X, drop=0.):
if X.shape[0] < size:
return layer.begin_update(X)
parts = _divide_array(X, size)
results, bp_results = zip(*[layer.begin_update(p, drop=drop)
for p in parts])
y = ops.flatten(results)
def backward(dy, sgd=None):
d_parts = [bp(y, sgd=sgd) for bp, y in
zip(bp_results, _divide_array(dy, size))]
try:
dX = ops.flatten(d_parts)
except TypeError:
dX = None
except ValueError:
dX = None
return dX
return y, backward
model = layerize(forward)
model._layers.append(layer)
return model
def _divide_array(X, size):
parts = []
index = 0
while index < len(X):
parts.append(X[index : index + size])
parts.append(X[index:index + size])
index += size
return parts
def get_col(idx):
assert idx >= 0, idx
def forward(X, drop=0.):
assert idx >= 0, idx
if isinstance(X, numpy.ndarray):
@ -320,30 +299,28 @@ def get_col(idx):
else:
ops = CupyOps()
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
def backward(y, sgd=None):
assert idx >= 0, idx
dX = ops.allocate(X.shape)
dX[:, idx] += y
return dX
return output, backward
return layerize(forward)
def zero_init(model):
def _hook(self, X, y=None):
self.W.fill(0)
model.on_data_hooks.append(_hook)
return model
def doc2feats(cols=None):
if cols is None:
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
def forward(docs, drop=0.):
feats = []
for doc in docs:
feats.append(doc.to_array(cols))
return feats, None
model = layerize(forward)
model.cols = cols
return model
@ -357,68 +334,14 @@ def print_shape(prefix):
@layerize
def get_token_vectors(tokens_attrs_vectors, drop=0.):
ops = Model.ops
tokens, attrs, vectors = tokens_attrs_vectors
def backward(d_output, sgd=None):
return (tokens, d_output)
return vectors, backward
def fine_tune(embedding, combine=None):
if combine is not None:
raise NotImplementedError(
"fine_tune currently only supports addition. Set combine=None")
def fine_tune_fwd(docs_tokvecs, drop=0.):
docs, tokvecs = docs_tokvecs
lengths = model.ops.asarray([len(doc) for doc in docs], dtype='i')
vecs, bp_vecs = embedding.begin_update(docs, drop=drop)
flat_tokvecs = embedding.ops.flatten(tokvecs)
flat_vecs = embedding.ops.flatten(vecs)
output = embedding.ops.unflatten(
(model.mix[0] * flat_tokvecs + model.mix[1] * flat_vecs), lengths)
def fine_tune_bwd(d_output, sgd=None):
flat_grad = model.ops.flatten(d_output)
model.d_mix[0] += flat_tokvecs.dot(flat_grad.T).sum()
model.d_mix[1] += flat_vecs.dot(flat_grad.T).sum()
bp_vecs([d_o * model.mix[1] for d_o in d_output], sgd=sgd)
if sgd is not None:
sgd(model._mem.weights, model._mem.gradient, key=model.id)
return [d_o * model.mix[0] for d_o in d_output]
return output, fine_tune_bwd
def fine_tune_predict(docs_tokvecs):
docs, tokvecs = docs_tokvecs
vecs = embedding(docs)
return [model.mix[0]*tv+model.mix[1]*v
for tv, v in zip(tokvecs, vecs)]
model = wrap(fine_tune_fwd, embedding)
model.mix = model._mem.add((model.id, 'mix'), (2,))
model.mix.fill(0.5)
model.d_mix = model._mem.add_gradient((model.id, 'd_mix'), (model.id, 'mix'))
model.predict = fine_tune_predict
return model
@layerize
def flatten(seqs, drop=0.):
if isinstance(seqs[0], numpy.ndarray):
ops = NumpyOps()
elif hasattr(CupyOps.xp, 'ndarray') and isinstance(seqs[0], CupyOps.xp.ndarray):
ops = CupyOps()
else:
raise ValueError("Unable to flatten sequence of type %s" % type(seqs[0]))
lengths = [len(seq) for seq in seqs]
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths)
X = ops.xp.vstack(seqs)
return X, finish_update
@layerize
def logistic(X, drop=0.):
xp = get_array_module(X)
@ -428,9 +351,11 @@ def logistic(X, drop=0.):
X = xp.minimum(X, 10., X)
X = xp.maximum(X, -10., X)
Y = 1. / (1. + xp.exp(-X))
def logistic_bwd(dY, sgd=None):
dX = dY * (Y * (1-Y))
return dX
return Y, logistic_bwd
@ -440,42 +365,47 @@ def zero_init(model):
model.on_data_hooks.append(_zero_init_impl)
return model
@layerize
def preprocess_doc(docs, drop=0.):
keys = [doc.to_array([LOWER]) for doc in docs]
keys = [a[:, 0] for a in keys]
ops = Model.ops
lengths = ops.asarray([arr.shape[0] for arr in keys])
keys = ops.xp.concatenate(keys)
vals = ops.allocate(keys.shape[0]) + 1
return (keys, vals, lengths), None
def getitem(i):
def getitem_fwd(X, drop=0.):
return X[i], None
return layerize(getitem_fwd)
def build_tagger_model(nr_class, token_vector_width, **cfg):
embed_size = util.env_opt('embed_size', 7500)
with Model.define_operators({'>>': chain, '+': add}):
# Input: (doc, tensor) tuples
private_tok2vec = Tok2Vec(token_vector_width, embed_size, preprocess=doc2feats())
def build_tagger_model(nr_class, **cfg):
embed_size = util.env_opt('embed_size', 7000)
if 'token_vector_width' in cfg:
token_vector_width = cfg['token_vector_width']
else:
token_vector_width = util.env_opt('token_vector_width', 128)
pretrained_dims = cfg.get('pretrained_dims', 0)
with Model.define_operators({'>>': chain, '+': add}):
if 'tok2vec' in cfg:
tok2vec = cfg['tok2vec']
else:
tok2vec = Tok2Vec(token_vector_width, embed_size,
pretrained_dims=pretrained_dims)
model = (
fine_tune(private_tok2vec)
>> with_flatten(
Maxout(token_vector_width, token_vector_width)
>> Softmax(nr_class, token_vector_width)
)
tok2vec
>> with_flatten(Softmax(nr_class, token_vector_width))
)
model.nI = None
model.tok2vec = tok2vec
return model
@layerize
def SpacyVectors(docs, drop=0.):
xp = get_array_module(docs[0].vocab.vectors.data)
width = docs[0].vocab.vectors.data.shape[1]
batch = []
for doc in docs:
indices = numpy.zeros((len(doc),), dtype='i')
@ -489,40 +419,16 @@ def SpacyVectors(docs, drop=0.):
return batch, None
def foreach(layer, drop_factor=1.0):
'''Map a layer across elements in a list'''
def foreach_fwd(Xs, drop=0.):
drop *= drop_factor
ys = []
backprops = []
for X in Xs:
y, bp_y = layer.begin_update(X, drop=drop)
ys.append(y)
backprops.append(bp_y)
def foreach_bwd(d_ys, sgd=None):
d_Xs = []
for d_y, bp_y in zip(d_ys, backprops):
if bp_y is not None and bp_y is not None:
d_Xs.append(d_y, sgd=sgd)
else:
d_Xs.append(None)
return d_Xs
return ys, foreach_bwd
model = wrap(foreach_fwd, layer)
return model
def build_text_classifier(nr_class, width=64, **cfg):
nr_vector = cfg.get('nr_vector', 5000)
pretrained_dims = cfg.get('pretrained_dims', 0)
with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
'**': clone}):
if cfg.get('low_data'):
model = (
SpacyVectors
>> flatten_add_lengths
>> with_getitem(0,
Affine(width, 300)
)
>> with_getitem(0, Affine(width, pretrained_dims))
>> ParametricAttention(width)
>> Pooling(sum_pool)
>> Residual(ReLu(width, width)) ** 2
@ -531,7 +437,6 @@ def build_text_classifier(nr_class, width=64, **cfg):
)
return model
lower = HashEmbed(width, nr_vector, column=1)
prefix = HashEmbed(width//2, nr_vector, column=2)
suffix = HashEmbed(width//2, nr_vector, column=3)
@ -548,18 +453,24 @@ def build_text_classifier(nr_class, width=64, **cfg):
)
)
static_vectors = (
SpacyVectors
>> with_flatten(Affine(width, 300))
)
cnn_model = (
if pretrained_dims:
static_vectors = (
SpacyVectors
>> with_flatten(Affine(width, pretrained_dims))
)
# TODO Make concatenate support lists
concatenate_lists(trained_vectors, static_vectors)
vectors = concatenate_lists(trained_vectors, static_vectors)
vectors_width = width*2
else:
vectors = trained_vectors
vectors_width = width
static_vectors = None
cnn_model = (
vectors
>> with_flatten(
LN(Maxout(width, width*2))
LN(Maxout(width, vectors_width))
>> Residual(
(ExtractWindow(nW=1) >> zero_init(Maxout(width, width*3)))
(ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
) ** 2, pad=2
)
>> flatten_add_lengths
@ -579,39 +490,44 @@ def build_text_classifier(nr_class, width=64, **cfg):
>> zero_init(Affine(nr_class, nr_class*2, drop_factor=0.0))
>> logistic
)
model.nO = nr_class
model.lsuv = False
return model
@layerize
def flatten(seqs, drop=0.):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
X = ops.flatten(seqs, pad=0)
return X, finish_update
def concatenate_lists(*layers, **kwargs): # pragma: no cover
'''Compose two or more models `f`, `g`, etc, such that their outputs are
def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""Compose two or more models `f`, `g`, etc, such that their outputs are
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
'''
"""
if not layers:
return noop()
drop_factor = kwargs.get('drop_factor', 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.):
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype='i')
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)
def concatenate_lists_bwd(d_ys, sgd=None):
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
return ys, concatenate_lists_bwd
model = wrap(concatenate_lists_fwd, concat)
return model

View File

@ -3,15 +3,16 @@
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
__title__ = 'spacy-nightly'
__version__ = '2.0.0a13'
__version__ = '2.0.0a18'
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
__uri__ = 'https://spacy.io'
__author__ = 'Explosion AI'
__email__ = 'contact@explosion.ai'
__license__ = 'MIT'
__release__ = False
__docs_models__ = 'https://spacy.io/docs/usage/models'
__docs_models__ = 'https://alpha.spacy.io/usage/models'
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts.json'
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-nightly.json'
__model_files__ = 'https://raw.githubusercontent.com/explosion/spacy-dev-resources/develop/templates/model/'

View File

@ -1,5 +1,5 @@
# Reserve 64 values for flag features
cpdef enum attr_id_t:
cdef enum attr_id_t:
NULL_ATTR
IS_ALPHA
IS_ASCII

View File

@ -94,23 +94,19 @@ IDS = {
# ATTR IDs, in order of the symbol
NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])]
locals().update(IDS)
def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
"""
Normalize a dictionary of attributes, converting them to ints.
Arguments:
stringy_attrs (dict):
Dictionary keyed by attribute string names. Values can be ints or strings.
strings_map (StringStore):
Defaults to None. If provided, encodes string values into ints.
Returns:
inty_attrs (dict):
Attributes dictionary with keys and optionally values converted to
ints.
stringy_attrs (dict): Dictionary keyed by attribute string names. Values
can be ints or strings.
strings_map (StringStore): Defaults to None. If provided, encodes string
values into ints.
RETURNS (dict): Attributes dictionary with keys and optionally values
converted to ints.
"""
inty_attrs = {}
if _do_deprecated:

View File

@ -1,33 +0,0 @@
from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
from cymem.cymem cimport Pool
cdef class CFile:
cdef FILE* fp
cdef unsigned char* data
cdef int is_open
cdef Pool mem
cdef int size # For compatibility with subclass
cdef int i # For compatibility with subclass
cdef int _capacity # For compatibility with subclass
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *
cdef class StringCFile:
cdef unsigned char* data
cdef int is_open
cdef Pool mem
cdef int size # For compatibility with subclass
cdef int i # For compatibility with subclass
cdef int _capacity # For compatibility with subclass
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *

View File

@ -1,103 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from libc.stdio cimport fopen, fclose, fread, fwrite
from libc.string cimport memcpy
cdef class CFile:
def __init__(self, loc, mode, on_open_error=None):
if isinstance(mode, unicode):
mode_str = mode.encode('ascii')
else:
mode_str = mode
if hasattr(loc, 'as_posix'):
loc = loc.as_posix()
self.mem = Pool()
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self.fp = fopen(<char*>bytes_loc, mode_str)
if self.fp == NULL:
if on_open_error is not None:
on_open_error()
else:
raise IOError("Could not open binary file %s" % bytes_loc)
self.is_open = True
def __dealloc__(self):
if self.is_open:
fclose(self.fp)
def close(self):
fclose(self.fp)
self.is_open = False
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
st = fread(dest, elem_size, number, self.fp)
if st != number:
raise IOError
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:
st = fwrite(src, elem_size, number, self.fp)
if st != number:
raise IOError
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)
cdef class StringCFile:
def __init__(self, bytes data, mode, on_open_error=None):
self.mem = Pool()
self.is_open = 1 if 'w' in mode else 0
self._capacity = max(len(data), 8)
self.size = len(data)
self.i = 0
self.data = <unsigned char*>self.mem.alloc(1, self._capacity)
for i in range(len(data)):
self.data[i] = data[i]
def __dealloc__(self):
# Important to override this -- or
# we try to close a non-existant file pointer!
pass
def close(self):
self.is_open = False
def string_data(self):
cdef bytes byte_string = b'\0' * (self.size)
bytes_ptr = <char*>byte_string
for i in range(self.size):
bytes_ptr[i] = self.data[i]
print(byte_string)
return byte_string
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
if self.i+(number * elem_size) < self.size:
memcpy(dest, &self.data[self.i], elem_size * number)
self.i += elem_size * number
cdef int write_from(self, void* src, size_t elem_size, size_t number) except -1:
write_size = number * elem_size
if (self.size + write_size) >= self._capacity:
self._capacity = (self.size + write_size) * 2
self.data = <unsigned char*>self.mem.realloc(self.data, self._capacity)
memcpy(&self.data[self.size], src, write_size)
self.size += write_size
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
cdef void* dest = mem.alloc(number, elem_size)
self.read_into(dest, number, elem_size)
return dest
def write_unicode(self, unicode value):
cdef bytes py_bytes = value.encode('utf8')
cdef char* chars = <char*>py_bytes
self.write(sizeof(char), len(py_bytes), chars)

View File

@ -4,5 +4,8 @@ from .link import link
from .package import package
from .profile import profile
from .train import train
from .evaluate import evaluate
from .convert import convert
from .model import model
from .vocab import make_vocab as vocab
from .validate import validate

View File

@ -4,17 +4,17 @@ from __future__ import unicode_literals
import plac
from pathlib import Path
from .converters import conllu2json, iob2json
from .converters import conllu2json, iob2json, conll_ner2json
from ..util import prints
# Converters are matched by file extension. To add a converter, add a new entry
# to this dict with the file extension mapped to the converter function imported
# from /converters.
# Converters are matched by file extension. To add a converter, add a new
# entry to this dict with the file extension mapped to the converter function
# imported from /converters.
CONVERTERS = {
'.conllu': conllu2json,
'.conll': conllu2json,
'.iob': iob2json
'conllu': conllu2json,
'conll': conllu2json,
'ner': conll_ner2json,
'iob': iob2json,
}
@ -22,9 +22,10 @@ CONVERTERS = {
input_file=("input file", "positional", None, str),
output_dir=("output directory for converted file", "positional", None, str),
n_sents=("Number of sentences per doc", "option", "n", int),
morphology=("Enable appending morphology to tags", "flag", "m", bool)
)
def convert(cmd, input_file, output_dir, n_sents=1, morphology=False):
converter=("Name of converter (auto, iob, conllu or ner)", "option", "c", str),
morphology=("Enable appending morphology to tags", "flag", "m", bool))
def convert(cmd, input_file, output_dir, n_sents=1, morphology=False,
converter='auto'):
"""
Convert files into JSON format for use with train command and other
experiment management functions.
@ -35,9 +36,11 @@ def convert(cmd, input_file, output_dir, n_sents=1, morphology=False):
prints(input_path, title="Input file not found", exits=1)
if not output_path.exists():
prints(output_path, title="Output directory not found", exits=1)
file_ext = input_path.suffix
if not file_ext in CONVERTERS:
prints("Can't find converter for %s" % input_path.parts[-1],
title="Unknown format", exits=1)
CONVERTERS[file_ext](input_path, output_path,
n_sents=n_sents, use_morphology=morphology)
if converter == 'auto':
converter = input_path.suffix[1:]
if converter not in CONVERTERS:
prints("Can't find converter for %s" % converter,
title="Unknown format", exits=1)
func = CONVERTERS[converter]
func(input_path, output_path,
n_sents=n_sents, use_morphology=morphology)

View File

@ -1,2 +1,3 @@
from .conllu2json import conllu2json
from .iob2json import iob2json
from .conll_ner2json import conll_ner2json

View File

@ -0,0 +1,51 @@
# coding: utf8
from __future__ import unicode_literals
from ...compat import json_dumps, path2str
from ...util import prints
from ...gold import iob_to_biluo
def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
"""
Convert files in the CoNLL-2003 NER format into JSON format for use with
train cli.
"""
docs = read_conll_ner(input_path)
output_filename = input_path.parts[-1].replace(".conll", "") + ".json"
output_filename = input_path.parts[-1].replace(".conll", "") + ".json"
output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs))
prints("Created %d documents" % len(docs),
title="Generated output file %s" % path2str(output_file))
def read_conll_ner(input_path):
text = input_path.open('r', encoding='utf-8').read()
i = 0
delimit_docs = '-DOCSTART- -X- O O'
output_docs = []
for doc in text.strip().split(delimit_docs):
doc = doc.strip()
if not doc:
continue
output_doc = []
for sent in doc.split('\n\n'):
sent = sent.strip()
if not sent:
continue
lines = [line.strip() for line in sent.split('\n') if line.strip()]
words, tags, chunks, iob_ents = zip(*[line.split() for line in lines])
biluo_ents = iob_to_biluo(iob_ents)
output_doc.append({'tokens': [
{'orth': w, 'tag': tag, 'ner': ent} for (w, tag, ent) in
zip(words, tags, biluo_ents)
]})
output_docs.append({
'id': len(output_docs),
'paragraphs': [{'sentences': output_doc}]
})
output_doc = []
return output_docs

View File

@ -1,5 +1,6 @@
# coding: utf8
from __future__ import unicode_literals
from cytoolz import partition_all, concat
from ...compat import json_dumps, path2str
from ...util import prints
@ -10,11 +11,9 @@ def iob2json(input_path, output_path, n_sents=10, *a, **k):
"""
Convert IOB files into JSON format for use with train cli.
"""
# TODO: This isn't complete yet -- need to map from IOB to
# BILUO
with input_path.open('r', encoding='utf8') as file_:
docs = read_iob(file_)
sentences = read_iob(file_)
docs = merge_sentences(sentences, n_sents)
output_filename = input_path.parts[-1].replace(".iob", ".json")
output_file = output_path / output_filename
with output_file.open('w', encoding='utf-8') as f:
@ -23,9 +22,9 @@ def iob2json(input_path, output_path, n_sents=10, *a, **k):
title="Generated output file %s" % path2str(output_file))
def read_iob(file_):
def read_iob(raw_sents):
sentences = []
for line in file_:
for line in raw_sents:
if not line.strip():
continue
tokens = [t.split('|') for t in line.split()]
@ -43,3 +42,15 @@ def read_iob(file_):
paragraphs = [{'sentences': [sent]} for sent in sentences]
docs = [{'id': 0, 'paragraphs': [para]} for para in paragraphs]
return docs
def merge_sentences(docs, n_sents):
counter = 0
merged = []
for group in partition_all(n_sents, docs):
group = list(group)
first = group.pop(0)
to_extend = first['paragraphs'][0]['sentences']
for sent in group[1:]:
to_extend.extend(sent['paragraphs'][0]['sentences'])
merged.append(first)
return merged

View File

@ -13,10 +13,9 @@ from .. import about
@plac.annotations(
model=("model to download (shortcut or model name)", "positional", None, str),
model=("model to download, shortcut or name)", "positional", None, str),
direct=("force direct download. Needs model name with version and won't "
"perform compatibility check", "flag", "d", bool)
)
"perform compatibility check", "flag", "d", bool))
def download(cmd, model, direct=False):
"""
Download compatible model from default download path using pip. Model
@ -30,21 +29,25 @@ def download(cmd, model, direct=False):
model_name = shortcuts.get(model, model)
compatibility = get_compatibility()
version = get_version(model_name, compatibility)
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name, v=version))
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
v=version))
if dl == 0:
try:
# Get package path here because link uses
# pip.get_installed_distributions() to check if model is a package,
# which fails if model was just installed via subprocess
# pip.get_installed_distributions() to check if model is a
# package, which fails if model was just installed via
# subprocess
package_path = get_package_path(model_name)
link(None, model_name, model, force=True, model_path=package_path)
link(None, model_name, model, force=True,
model_path=package_path)
except:
# Dirty, but since spacy.download and the auto-linking is mostly
# a convenience wrapper, it's best to show a success message and
# loading instructions, even if linking fails.
prints("Creating a shortcut link for 'en' didn't work (maybe you "
"don't have admin permissions?), but you can still load "
"the model via its full package name:",
# Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails.
prints(
"Creating a shortcut link for 'en' didn't work (maybe "
"you don't have admin permissions?), but you can still "
"load the model via its full package name:",
"nlp = spacy.load('%s')" % model_name,
title="Download successful")
@ -52,9 +55,10 @@ def download(cmd, model, direct=False):
def get_json(url, desc):
r = requests.get(url)
if r.status_code != 200:
prints("Couldn't fetch %s. Please find a model for your spaCy installation "
"(v%s), and download it manually." % (desc, about.__version__),
about.__docs_models__, title="Server error (%d)" % r.status_code, exits=1)
msg = ("Couldn't fetch %s. Please find a model for your spaCy "
"installation (v%s), and download it manually.")
prints(msg % (desc, about.__version__), about.__docs_models__,
title="Server error (%d)" % r.status_code, exits=1)
return r.json()
@ -71,13 +75,13 @@ def get_compatibility():
def get_version(model, comp):
if model not in comp:
version = about.__version__
prints("No compatible model found for '%s' (spaCy v%s)." % (model, version),
title="Compatibility error", exits=1)
msg = "No compatible model found for '%s' (spaCy v%s)."
prints(msg % (model, version), title="Compatibility error", exits=1)
return comp[model][0]
def download_model(filename):
download_url = about.__download_url__ + '/' + filename
return subprocess.call([sys.executable, '-m',
'pip', 'install', '--no-cache-dir', download_url],
env=os.environ.copy())
return subprocess.call(
[sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
download_url], env=os.environ.copy())

113
spacy/cli/evaluate.py Normal file
View File

@ -0,0 +1,113 @@
# coding: utf8
from __future__ import unicode_literals, division, print_function
import plac
from timeit import default_timer as timer
import random
import numpy.random
from ..gold import GoldCorpus
from ..util import prints
from .. import util
from .. import displacy
random.seed(0)
numpy.random.seed(0)
@plac.annotations(
model=("model name or path", "positional", None, str),
data_path=("location of JSON-formatted evaluation data", "positional",
None, str),
gold_preproc=("use gold preprocessing", "flag", "G", bool),
gpu_id=("use GPU", "option", "g", int),
displacy_path=("directory to output rendered parses as HTML", "option",
"dp", str),
displacy_limit=("limit of parses to render as HTML", "option", "dl", int))
def evaluate(cmd, model, data_path, gpu_id=-1, gold_preproc=False,
displacy_path=None, displacy_limit=25):
"""
Evaluate a model. To render a sample of parses in a HTML file, set an
output directory as the displacy_path argument.
"""
if gpu_id >= 0:
util.use_gpu(gpu_id)
util.set_env_log(False)
data_path = util.ensure_path(data_path)
displacy_path = util.ensure_path(displacy_path)
if not data_path.exists():
prints(data_path, title="Evaluation data not found", exits=1)
if displacy_path and not displacy_path.exists():
prints(displacy_path, title="Visualization output directory not found",
exits=1)
corpus = GoldCorpus(data_path, data_path)
nlp = util.load_model(model)
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
begin = timer()
scorer = nlp.evaluate(dev_docs, verbose=False)
end = timer()
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
print_results(scorer, time=end - begin, words=nwords,
wps=nwords / (end - begin))
if displacy_path:
docs, golds = zip(*dev_docs)
render_deps = 'parser' in nlp.meta.get('pipeline', [])
render_ents = 'ner' in nlp.meta.get('pipeline', [])
render_parses(docs, displacy_path, model_name=model,
limit=displacy_limit, deps=render_deps, ents=render_ents)
msg = "Generated %s parses as HTML" % displacy_limit
prints(displacy_path, title=msg)
def render_parses(docs, output_path, model_name='', limit=250, deps=True,
ents=True):
docs[0].user_data['title'] = model_name
if ents:
with (output_path / 'entities.html').open('w') as file_:
html = displacy.render(docs[:limit], style='ent', page=True)
file_.write(html)
if deps:
with (output_path / 'parses.html').open('w') as file_:
html = displacy.render(docs[:limit], style='dep', page=True,
options={'compact': True})
file_.write(html)
def print_progress(itn, losses, dev_scores, wps=0.0):
scores = {}
for col in ['dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc',
'ents_p', 'ents_r', 'ents_f', 'wps']:
scores[col] = 0.0
scores['dep_loss'] = losses.get('parser', 0.0)
scores['ner_loss'] = losses.get('ner', 0.0)
scores['tag_loss'] = losses.get('tagger', 0.0)
scores.update(dev_scores)
scores['wps'] = wps
tpl = '\t'.join((
'{:d}',
'{dep_loss:.3f}',
'{ner_loss:.3f}',
'{uas:.3f}',
'{ents_p:.3f}',
'{ents_r:.3f}',
'{ents_f:.3f}',
'{tags_acc:.3f}',
'{token_acc:.3f}',
'{wps:.1f}'))
print(tpl.format(itn, **scores))
def print_results(scorer, time, words, wps):
results = {
'Time': '%.2f s' % time,
'Words': words,
'Words/s': '%.0f' % wps,
'TOK': '%.2f' % scorer.token_acc,
'POS': '%.2f' % scorer.tags_acc,
'UAS': '%.2f' % scorer.uas,
'LAS': '%.2f' % scorer.las,
'NER P': '%.2f' % scorer.ents_p,
'NER R': '%.2f' % scorer.ents_r,
'NER F': '%.2f' % scorer.ents_f}
util.print_table(results, title="Results")

View File

@ -12,8 +12,7 @@ from .. import util
@plac.annotations(
model=("optional: shortcut link of model", "positional", None, str),
markdown=("generate Markdown for GitHub issues", "flag", "md", str)
)
markdown=("generate Markdown for GitHub issues", "flag", "md", str))
def info(cmd, model=None, markdown=False):
"""Print info about spaCy installation. If a model shortcut link is
speficied as an argument, print model information. Flag --markdown

View File

@ -12,8 +12,7 @@ from .. import util
@plac.annotations(
origin=("package name or local path to model", "positional", None, str),
link_name=("name of shortuct link to create", "positional", None, str),
force=("force overwriting of existing link", "flag", "f", bool)
)
force=("force overwriting of existing link", "flag", "f", bool))
def link(cmd, origin, link_name, force=False, model_path=None):
"""
Create a symlink for models within the spacy/data directory. Accepts
@ -27,6 +26,13 @@ def link(cmd, origin, link_name, force=False, model_path=None):
if not model_path.exists():
prints("The data should be located in %s" % path2str(model_path),
title="Can't locate model data", exits=1)
data_path = util.get_data_path()
if not data_path or not data_path.exists():
spacy_loc = Path(__file__).parent.parent
prints("Make sure a directory `/data` exists within your spaCy "
"installation and try again. The data directory should be "
"located here:", path2str(spacy_loc), exits=1,
title="Can't find the spaCy data path to create model symlink")
link_path = util.get_data_path() / link_name
if link_path.exists() and not force:
prints("To overwrite an existing link, use the --force flag.",
@ -39,8 +45,9 @@ def link(cmd, origin, link_name, force=False, model_path=None):
# This is quite dirty, but just making sure other errors are caught.
prints("Creating a symlink in spacy/data failed. Make sure you have "
"the required permissions and try re-running the command as "
"admin, or use a virtualenv. You can still import the model as a "
"module and call its load() method, or create the symlink manually.",
"admin, or use a virtualenv. You can still import the model as "
"a module and call its load() method, or create the symlink "
"manually.",
"%s --> %s" % (path2str(model_path), path2str(link_path)),
title="Error: Couldn't link model to '%s'" % link_name)
raise

View File

@ -1,8 +1,11 @@
# coding: utf8
from __future__ import unicode_literals
import bz2
import gzip
try:
import bz2
import gzip
except ImportError:
pass
import math
from ast import literal_eval
from pathlib import Path

View File

@ -16,10 +16,13 @@ from .. import about
input_dir=("directory with model data", "positional", None, str),
output_dir=("output parent directory", "positional", None, str),
meta_path=("path to meta.json", "option", "m", str),
create_meta=("create meta.json, even if one exists in directory", "flag", "c", bool),
force=("force overwriting of existing folder in output directory", "flag", "f", bool)
)
def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False, force=False):
create_meta=("create meta.json, even if one exists in directory if "
"existing meta is found, entries are shown as defaults in "
"the command line prompt", "flag", "c", bool),
force=("force overwriting of existing model directory in output directory",
"flag", "f", bool))
def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False,
force=False):
"""
Generate Python package for model data, including meta and required
installation files. A new directory will be created in the specified
@ -39,26 +42,28 @@ def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False, force
template_manifest = get_template('MANIFEST.in')
template_init = get_template('xx_model_name/__init__.py')
meta_path = meta_path or input_path / 'meta.json'
if not create_meta and meta_path.is_file():
prints(meta_path, title="Reading meta.json from file")
if meta_path.is_file():
meta = util.read_json(meta_path)
else:
meta = generate_meta()
if not create_meta: # only print this if user doesn't want to overwrite
prints(meta_path, title="Loaded meta.json from file")
else:
meta = generate_meta(input_dir, meta)
meta = validate_meta(meta, ['lang', 'name', 'version'])
model_name = meta['lang'] + '_' + meta['name']
model_name_v = model_name + '-' + meta['version']
main_path = output_path / model_name_v
package_path = main_path / model_name
create_dirs(package_path, force)
shutil.copytree(path2str(input_path), path2str(package_path / model_name_v))
shutil.copytree(path2str(input_path),
path2str(package_path / model_name_v))
create_file(main_path / 'meta.json', json_dumps(meta))
create_file(main_path / 'setup.py', template_setup)
create_file(main_path / 'MANIFEST.in', template_manifest)
create_file(package_path / '__init__.py', template_init)
prints(main_path, "To build the package, run `python setup.py sdist` in this "
"directory.", title="Successfully created package '%s'" % model_name_v)
prints(main_path, "To build the package, run `python setup.py sdist` in "
"this directory.",
title="Successfully created package '%s'" % model_name_v)
def create_dirs(package_path, force):
@ -66,9 +71,10 @@ def create_dirs(package_path, force):
if force:
shutil.rmtree(path2str(package_path))
else:
prints(package_path, "Please delete the directory and try again, or "
"use the --force flag to overwrite existing directories.",
title="Package directory already exists", exits=1)
prints(package_path, "Please delete the directory and try again, "
"or use the --force flag to overwrite existing "
"directories.", title="Package directory already exists",
exits=1)
Path.mkdir(package_path, parents=True)
@ -77,38 +83,34 @@ def create_file(file_path, contents):
file_path.open('w', encoding='utf-8').write(contents)
def generate_meta():
settings = [('lang', 'Model language', 'en'),
('name', 'Model name', 'model'),
('version', 'Model version', '0.0.0'),
('spacy_version', 'Required spaCy version', '>=%s,<3.0.0' % about.__version__),
('description', 'Model description', False),
('author', 'Author', False),
('email', 'Author email', False),
('url', 'Author website', False),
('license', 'License', 'CC BY-NC 3.0')]
prints("Enter the package settings for your model.", title="Generating meta.json")
meta = {}
def generate_meta(model_path, existing_meta):
meta = existing_meta or {}
settings = [('lang', 'Model language', meta.get('lang', 'en')),
('name', 'Model name', meta.get('name', 'model')),
('version', 'Model version', meta.get('version', '0.0.0')),
('spacy_version', 'Required spaCy version',
'>=%s,<3.0.0' % about.__version__),
('description', 'Model description',
meta.get('description', False)),
('author', 'Author', meta.get('author', False)),
('email', 'Author email', meta.get('email', False)),
('url', 'Author website', meta.get('url', False)),
('license', 'License', meta.get('license', 'CC BY-SA 3.0'))]
nlp = util.load_model_from_path(Path(model_path))
meta['pipeline'] = nlp.pipe_names
meta['vectors'] = {'width': nlp.vocab.vectors_length,
'entries': len(nlp.vocab.vectors)}
prints("Enter the package settings for your model. The following "
"information will be read from your model data: pipeline, vectors.",
title="Generating meta.json")
for setting, desc, default in settings:
response = util.get_raw_input(desc, default)
meta[setting] = default if response == '' and default else response
meta['pipeline'] = generate_pipeline()
if about.__title__ != 'spacy':
meta['parent_package'] = about.__title__
return meta
def generate_pipeline():
prints("If set to 'True', the default pipeline is used. If set to 'False', "
"the pipeline will be disabled. Components should be specified as a "
"comma-separated list of component names, e.g. tensorizer, tagger, "
"parser, ner. For more information, see the docs on processing pipelines.",
title="Enter your model's pipeline components")
pipeline = util.get_raw_input("Pipeline components", True)
replace = {'True': True, 'False': False}
return replace[pipeline] if pipeline in replace else pipeline.split(', ')
def validate_meta(meta, keys):
for key in keys:
if key not in meta or meta[key] == '':

View File

@ -27,15 +27,15 @@ def read_inputs(loc):
@plac.annotations(
lang=("model/language", "positional", None, str),
inputs=("Location of input file", "positional", None, read_inputs)
)
inputs=("Location of input file", "positional", None, read_inputs))
def profile(cmd, lang, inputs=None):
"""
Profile a spaCy pipeline, to find out which functions take the most time.
"""
nlp = spacy.load(lang)
nlp = spacy.load(lang)
texts = list(cytoolz.take(10000, inputs))
cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof")
cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(),
"Profile.prof")
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()

View File

@ -2,42 +2,48 @@
from __future__ import unicode_literals, division, print_function
import plac
import json
from collections import defaultdict
import cytoolz
from pathlib import Path
import dill
import tqdm
from thinc.neural.optimizers import linear_decay
from thinc.neural._classes.model import Model
from timeit import default_timer as timer
import random
import numpy.random
from ..tokens.doc import Doc
from ..scorer import Scorer
from ..gold import GoldParse, merge_sents
from ..gold import GoldCorpus, minibatch
from ..util import prints
from .. import util
from .. import about
from .. import displacy
from ..compat import json_dumps
random.seed(0)
numpy.random.seed(0)
@plac.annotations(
lang=("model language", "positional", None, str),
output_dir=("output directory to store model in", "positional", None, str),
train_data=("location of JSON-formatted training data", "positional", None, str),
dev_data=("location of JSON-formatted development data (optional)", "positional", None, str),
train_data=("location of JSON-formatted training data", "positional",
None, str),
dev_data=("location of JSON-formatted development data (optional)",
"positional", None, str),
n_iter=("number of iterations", "option", "n", int),
n_sents=("number of sentences", "option", "ns", int),
use_gpu=("Use GPU", "option", "g", int),
resume=("Whether to resume training", "flag", "R", bool),
vectors=("Model to load vectors from", "option", "v"),
vectors_limit=("Truncate to N vectors (requires -v)", "option", None, int),
no_tagger=("Don't train tagger", "flag", "T", bool),
no_parser=("Don't train parser", "flag", "P", bool),
no_entities=("Don't train NER", "flag", "N", bool),
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
)
def train(cmd, lang, output_dir, train_data, dev_data, n_iter=20, n_sents=0,
use_gpu=-1, resume=False, no_tagger=False, no_parser=False, no_entities=False,
gold_preproc=False):
version=("Model version", "option", "V", str),
meta_path=("Optional path to meta.json. All relevant properties will be "
"overwritten.", "option", "m", Path))
def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
use_gpu=-1, vectors=None, vectors_limit=None, no_tagger=False,
no_parser=False, no_entities=False, gold_preproc=False,
version="0.0.0", meta_path=None):
"""
Train a model. Expects data in spaCy's JSON format.
"""
@ -46,19 +52,29 @@ def train(cmd, lang, output_dir, train_data, dev_data, n_iter=20, n_sents=0,
output_path = util.ensure_path(output_dir)
train_path = util.ensure_path(train_data)
dev_path = util.ensure_path(dev_data)
meta_path = util.ensure_path(meta_path)
if not output_path.exists():
output_path.mkdir()
if not train_path.exists():
prints(train_path, title="Training data not found", exits=1)
if dev_path and not dev_path.exists():
prints(dev_path, title="Development data not found", exits=1)
if meta_path is not None and not meta_path.exists():
prints(meta_path, title="meta.json not found", exits=1)
meta = util.read_json(meta_path) if meta_path else {}
if not isinstance(meta, dict):
prints("Expected dict but got: {}".format(type(meta)),
title="Not a valid meta.json format", exits=1)
meta.setdefault('lang', lang)
meta.setdefault('name', 'unnamed')
lang_class = util.get_lang_class(lang)
pipeline = ['token_vectors', 'tags', 'dependencies', 'entities']
if no_tagger and 'tags' in pipeline: pipeline.remove('tags')
if no_parser and 'dependencies' in pipeline: pipeline.remove('dependencies')
if no_entities and 'entities' in pipeline: pipeline.remove('entities')
pipeline = ['tagger', 'parser', 'ner']
if no_tagger and 'tagger' in pipeline:
pipeline.remove('tagger')
if no_parser and 'parser' in pipeline:
pipeline.remove('parser')
if no_entities and 'ner' in pipeline:
pipeline.remove('ner')
# Take dropout and batch size as generators of values -- dropout
# starts high and decays sharply, to force the optimizer to explore.
@ -68,55 +84,91 @@ def train(cmd, lang, output_dir, train_data, dev_data, n_iter=20, n_sents=0,
util.env_opt('dropout_to', 0.2),
util.env_opt('dropout_decay', 0.0))
batch_sizes = util.compounding(util.env_opt('batch_from', 1),
util.env_opt('batch_to', 64),
util.env_opt('batch_to', 16),
util.env_opt('batch_compound', 1.001))
if resume:
prints(output_path / 'model9.pickle', title="Resuming training")
nlp = dill.load((output_path / 'model9.pickle').open('rb'))
else:
nlp = lang_class(pipeline=pipeline)
corpus = GoldCorpus(train_path, dev_path, limit=n_sents)
n_train_words = corpus.count_train()
lang_class = util.get_lang_class(lang)
nlp = lang_class()
meta['pipeline'] = pipeline
nlp.meta.update(meta)
if vectors:
util.load_model(vectors, vocab=nlp.vocab)
if vectors_limit is not None:
nlp.vocab.prune_vectors(vectors_limit)
for name in pipeline:
nlp.add_pipe(nlp.create_pipe(name), name=name)
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
nlp._optimizer = None
print("Itn.\tLoss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
try:
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
gold_preproc=gold_preproc, max_length=0)
train_docs = list(train_docs)
for i in range(n_iter):
if resume:
i += 20
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
gold_preproc=gold_preproc, max_length=0)
losses = {}
for batch in minibatch(train_docs, size=batch_sizes):
docs, golds = zip(*batch)
nlp.update(docs, golds, sgd=optimizer,
drop=next(dropout_rates), losses=losses,
update_shared=True)
drop=next(dropout_rates), losses=losses)
pbar.update(sum(len(doc) for doc in docs))
with nlp.use_params(optimizer.averages):
util.set_env_log(False)
epoch_model_path = output_path / ('model%d' % i)
nlp.to_disk(epoch_model_path)
nlp_loaded = lang_class(pipeline=pipeline)
nlp_loaded = nlp_loaded.from_disk(epoch_model_path)
scorer = nlp_loaded.evaluate(
corpus.dev_docs(
nlp_loaded = util.load_model_from_path(epoch_model_path)
dev_docs = list(corpus.dev_docs(
nlp_loaded,
gold_preproc=gold_preproc))
acc_loc =(output_path / ('model%d' % i) / 'accuracy.json')
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
start_time = timer()
scorer = nlp_loaded.evaluate(dev_docs)
end_time = timer()
if use_gpu < 0:
gpu_wps = None
cpu_wps = nwords/(end_time-start_time)
else:
gpu_wps = nwords/(end_time-start_time)
with Model.use_device('cpu'):
nlp_loaded = util.load_model_from_path(epoch_model_path)
dev_docs = list(corpus.dev_docs(
nlp_loaded, gold_preproc=gold_preproc))
start_time = timer()
scorer = nlp_loaded.evaluate(dev_docs)
end_time = timer()
cpu_wps = nwords/(end_time-start_time)
acc_loc = (output_path / ('model%d' % i) / 'accuracy.json')
with acc_loc.open('w') as file_:
file_.write(json_dumps(scorer.scores))
meta_loc = output_path / ('model%d' % i) / 'meta.json'
meta['accuracy'] = scorer.scores
meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
'gpu': gpu_wps}
meta['vectors'] = {'width': nlp.vocab.vectors_length,
'entries': len(nlp.vocab.vectors)}
meta['lang'] = nlp.lang
meta['pipeline'] = pipeline
meta['spacy_version'] = '>=%s' % about.__version__
meta.setdefault('name', 'model%d' % i)
meta.setdefault('version', version)
with meta_loc.open('w') as file_:
file_.write(json_dumps(meta))
util.set_env_log(True)
print_progress(i, losses, scorer.scores)
print_progress(i, losses, scorer.scores, cpu_wps=cpu_wps,
gpu_wps=gpu_wps)
finally:
print("Saving model...")
with (output_path / 'model-final.pickle').open('wb') as file_:
with nlp.use_params(optimizer.averages):
dill.dump(nlp, file_, -1)
try:
with (output_path / 'model-final.pickle').open('wb') as file_:
with nlp.use_params(optimizer.averages):
dill.dump(nlp, file_, -1)
except:
print("Error saving model")
def _render_parses(i, to_render):
@ -129,25 +181,30 @@ def _render_parses(i, to_render):
file_.write(html)
def print_progress(itn, losses, dev_scores, wps=0.0):
def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
scores = {}
for col in ['dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc',
'ents_p', 'ents_r', 'ents_f', 'wps']:
'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps']:
scores[col] = 0.0
scores['dep_loss'] = losses.get('parser', 0.0)
scores['ner_loss'] = losses.get('ner', 0.0)
scores['tag_loss'] = losses.get('tagger', 0.0)
scores.update(dev_scores)
scores['wps'] = wps
scores['cpu_wps'] = cpu_wps
scores['gpu_wps'] = gpu_wps or 0.0
tpl = '\t'.join((
'{:d}',
'{dep_loss:.3f}',
'{ner_loss:.3f}',
'{uas:.3f}',
'{ents_p:.3f}',
'{ents_r:.3f}',
'{ents_f:.3f}',
'{tags_acc:.3f}',
'{token_acc:.3f}',
'{wps:.1f}'))
'{cpu_wps:.1f}',
'{gpu_wps:.1f}',
))
print(tpl.format(itn, **scores))

126
spacy/cli/validate.py Normal file
View File

@ -0,0 +1,126 @@
# coding: utf8
from __future__ import unicode_literals, print_function
import requests
import pkg_resources
from pathlib import Path
from ..compat import path2str, locale_escape
from ..util import prints, get_data_path, read_json
from .. import about
def validate(cmd):
"""Validate that the currently installed version of spaCy is compatible
with the installed models. Should be run after `pip install -U spacy`.
"""
r = requests.get(about.__compatibility__)
if r.status_code != 200:
prints("Couldn't fetch compatibility table.",
title="Server error (%d)" % r.status_code, exits=1)
compat = r.json()['spacy']
all_models = set()
for spacy_v, models in dict(compat).items():
all_models.update(models.keys())
for model, model_vs in models.items():
compat[spacy_v][model] = [reformat_version(v) for v in model_vs]
current_compat = compat[about.__version__]
model_links = get_model_links(current_compat)
model_pkgs = get_model_pkgs(current_compat, all_models)
incompat_links = {l for l, d in model_links.items() if not d['compat']}
incompat_models = {d['name'] for _, d in model_pkgs.items()
if not d['compat']}
incompat_models.update([d['name'] for _, d in model_links.items()
if not d['compat']])
na_models = [m for m in incompat_models if m not in current_compat]
update_models = [m for m in incompat_models if m in current_compat]
prints(path2str(Path(__file__).parent.parent),
title="Installed models (spaCy v{})".format(about.__version__))
if model_links or model_pkgs:
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
for name, data in model_pkgs.items():
print(get_model_row(current_compat, name, data, 'package'))
for name, data in model_links.items():
print(get_model_row(current_compat, name, data, 'link'))
else:
prints("No models found in your current environment.", exits=0)
if update_models:
cmd = ' python -m spacy download {}'
print("\n Use the following commands to update the model packages:")
print('\n'.join([cmd.format(pkg) for pkg in update_models]))
if na_models:
prints("The following models are not available for spaCy v{}: {}"
.format(about.__version__, ', '.join(na_models)))
if incompat_links:
prints("You may also want to overwrite the incompatible links using "
"the `spacy link` command with `--force`, or remove them from "
"the data directory. Data path: {}"
.format(path2str(get_data_path())))
def get_model_links(compat):
links = {}
data_path = get_data_path()
if data_path:
models = [p for p in data_path.iterdir() if is_model_path(p)]
for model in models:
meta_path = Path(model) / 'meta.json'
if not meta_path.exists():
continue
meta = read_json(meta_path)
link = model.parts[-1]
name = meta['lang'] + '_' + meta['name']
links[link] = {'name': name, 'version': meta['version'],
'compat': is_compat(compat, name, meta['version'])}
return links
def get_model_pkgs(compat, all_models):
pkgs = {}
for pkg_name, pkg_data in pkg_resources.working_set.by_key.items():
package = pkg_name.replace('-', '_')
if package in all_models:
version = pkg_data.version
pkgs[pkg_name] = {'name': package, 'version': version,
'compat': is_compat(compat, package, version)}
return pkgs
def get_model_row(compat, name, data, type='package'):
tpl_red = '\x1b[38;5;1m{}\x1b[0m'
tpl_green = '\x1b[38;5;2m{}\x1b[0m'
if data['compat']:
comp = tpl_green.format(locale_escape('', errors='ignore'))
version = tpl_green.format(data['version'])
else:
comp = '--> {}'.format(compat.get(data['name'], ['n/a'])[0])
version = tpl_red.format(data['version'])
return get_row(type, name, data['name'], version, comp)
def get_row(*args):
tpl_row = ' {:<10}' + (' {:<20}' * 4)
return tpl_row.format(*args)
def is_model_path(model_path):
exclude = ['cache', 'pycache', '__pycache__']
name = model_path.parts[-1]
return (model_path.is_dir() and name not in exclude
and not name.startswith('.'))
def is_compat(compat, name, version):
return name in compat and version in compat[name]
def reformat_version(version):
"""Hack to reformat old versions ending on '-alpha' to match pip format."""
if version.endswith('-alpha'):
return version.replace('-alpha', 'a0')
return version.replace('-alpha', 'a')

54
spacy/cli/vocab.py Normal file
View File

@ -0,0 +1,54 @@
# coding: utf8
from __future__ import unicode_literals
import plac
import json
import spacy
import numpy
from pathlib import Path
from ..util import prints, ensure_path
@plac.annotations(
lang=("model language", "positional", None, str),
output_dir=("model output directory", "positional", None, Path),
lexemes_loc=("location of JSONL-formatted lexical data", "positional",
None, Path),
vectors_loc=("optional: location of vectors data, as numpy .npz",
"positional", None, str))
def make_vocab(cmd, lang, output_dir, lexemes_loc, vectors_loc=None):
"""Compile a vocabulary from a lexicon jsonl file and word vectors."""
if not lexemes_loc.exists():
prints(lexemes_loc, title="Can't find lexical data", exits=1)
vectors_loc = ensure_path(vectors_loc)
nlp = spacy.blank(lang)
for word in nlp.vocab:
word.rank = 0
lex_added = 0
vec_added = 0
with lexemes_loc.open() as file_:
for line in file_:
if line.strip():
attrs = json.loads(line)
if 'settings' in attrs:
nlp.vocab.cfg.update(attrs['settings'])
else:
lex = nlp.vocab[attrs['orth']]
lex.set_attrs(**attrs)
assert lex.rank == attrs['id']
lex_added += 1
if vectors_loc is not None:
vector_data = numpy.load(open(vectors_loc, 'rb'))
nlp.vocab.clear_vectors(width=vector_data.shape[1])
for word in nlp.vocab:
if word.rank:
nlp.vocab.vectors.add(word.orth_, row=word.rank,
vector=vector_data[word.rank])
vec_added += 1
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
prints("{} entries, {} vectors".format(lex_added, vec_added), output_dir,
title="Sucessfully compiled vocab and vectors, and saved model")
return nlp

View File

@ -6,6 +6,7 @@ import ftfy
import sys
import ujson
import itertools
import locale
from thinc.neural.util import copy_array
@ -29,6 +30,10 @@ try:
except ImportError:
cupy = None
try:
from thinc.neural.optimizers import Optimizer
except ImportError:
from thinc.neural.optimizers import Adam as Optimizer
pickle = pickle
copy_reg = copy_reg
@ -86,15 +91,15 @@ def symlink_to(orig, dest):
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
return ((python2 == None or python2 == is_python2) and
(python3 == None or python3 == is_python3) and
(windows == None or windows == is_windows) and
(linux == None or linux == is_linux) and
(osx == None or osx == is_osx))
return ((python2 is None or python2 == is_python2) and
(python3 is None or python3 == is_python3) and
(windows is None or windows == is_windows) and
(linux is None or linux == is_linux) and
(osx is None or osx == is_osx))
def normalize_string_keys(old):
'''Given a dictionary, make sure keys are unicode strings, not bytes.'''
"""Given a dictionary, make sure keys are unicode strings, not bytes."""
new = {}
for key, value in old.items():
if isinstance(key, bytes_):
@ -113,3 +118,12 @@ def import_file(name, loc):
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
return module
def locale_escape(string, errors='replace'):
'''
Mangle non-supported characters, for savages with ascii terminals.
'''
encoding = locale.getpreferredencoding()
string = string.encode(encoding, errors).decode('utf8')
return string

View File

@ -24,7 +24,7 @@ def depr_model_download(lang):
def resolve_load_name(name, **overrides):
"""Resolve model loading if deprecated path kwarg is specified in overrides.
"""Resolve model loading if deprecated path kwarg in overrides.
name (unicode): Name of model to load.
**overrides: Overrides specified in spacy.load().
@ -32,8 +32,9 @@ def resolve_load_name(name, **overrides):
"""
if overrides.get('path') not in (None, False, True):
name = overrides.get('path')
prints("To load a model from a path, you can now use the first argument. "
"The model meta is used to load the required Language class.",
"OLD: spacy.load('en', path='/some/path')", "NEW: spacy.load('/some/path')",
prints("To load a model from a path, you can now use the first "
"argument. The model meta is used to load the Language class.",
"OLD: spacy.load('en', path='/some/path')",
"NEW: spacy.load('/some/path')",
title="Warning: deprecated argument 'path'")
return name

View File

@ -12,7 +12,7 @@ IS_JUPYTER = is_in_jupyter()
def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
options={}, manual=False):
options={}, manual=False):
"""Render displaCy visualisation.
docs (list or Doc): Document(s) to visualise.
@ -21,7 +21,7 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
minify (bool): Minify HTML markup.
jupyter (bool): Experimental, use Jupyter's `display()` to output markup.
options (dict): Visualiser-specific options, e.g. colors.
manual (bool): Don't parse `Doc` and instead, expect a dict or list of dicts.
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
RETURNS (unicode): Rendered HTML markup.
"""
factories = {'dep': (DependencyRenderer, parse_deps),
@ -35,7 +35,7 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
parsed = [converter(doc, options) for doc in docs] if not manual else docs
_html['parsed'] = renderer.render(parsed, page=page, minify=minify).strip()
html = _html['parsed']
if jupyter: # return HTML rendered by IPython display()
if jupyter: # return HTML rendered by IPython display()
from IPython.core.display import display, HTML
return display(HTML(html))
return html
@ -50,13 +50,15 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
page (bool): Render markup as full HTML page.
minify (bool): Minify HTML markup.
options (dict): Visualiser-specific options, e.g. colors.
manual (bool): Don't parse `Doc` and instead, expect a dict or list of dicts.
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
port (int): Port to serve visualisation.
"""
from wsgiref import simple_server
render(docs, style=style, page=page, minify=minify, options=options, manual=manual)
render(docs, style=style, page=page, minify=minify, options=options,
manual=manual)
httpd = simple_server.make_server('0.0.0.0', port, app)
prints("Using the '%s' visualizer" % style, title="Serving on port %d..." % port)
prints("Using the '%s' visualizer" % style,
title="Serving on port %d..." % port)
try:
httpd.serve_forever()
except KeyboardInterrupt:
@ -67,7 +69,8 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
def app(environ, start_response):
# headers and status need to be bytes in Python 2, see #1227
headers = [(b_to_str(b'Content-type'), b_to_str(b'text/html; charset=utf-8'))]
headers = [(b_to_str(b'Content-type'),
b_to_str(b'text/html; charset=utf-8'))]
start_response(b_to_str(b'200 OK'), headers)
res = _html['parsed'].encode(encoding='utf-8')
return [res]
@ -89,9 +92,9 @@ def parse_deps(orig_doc, options={}):
end = word.i + 1
while end < len(doc) and doc[end].is_punct:
end += 1
span = doc[start : end]
span = doc[start:end]
spans.append((span.start_char, span.end_char, word.tag_,
word.lemma_, word.ent_type_))
word.lemma_, word.ent_type_))
for span_props in spans:
doc.merge(*span_props)
words = [{'text': w.text, 'tag': w.tag_} for w in doc]
@ -113,6 +116,7 @@ def parse_ents(doc, options={}):
RETURNS (dict): Generated entities keyed by text (original text) and ents.
"""
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
for ent in doc.ents]
title = doc.user_data.get('title', None) if hasattr(doc, 'user_data') else None
for ent in doc.ents]
title = (doc.user_data.get('title', None)
if hasattr(doc, 'user_data') else None)
return {'text': doc.text, 'ents': ents, 'title': title}

View File

@ -14,13 +14,15 @@ class DependencyRenderer(object):
"""Initialise dependency renderer.
options (dict): Visualiser-specific options (compact, word_spacing,
arrow_spacing, arrow_width, arrow_stroke, distance,
offset_x, color, bg, font)
arrow_spacing, arrow_width, arrow_stroke, distance, offset_x,
color, bg, font)
"""
self.compact = options.get('compact', False)
self.word_spacing = options.get('word_spacing', 45)
self.arrow_spacing = options.get('arrow_spacing', 12 if self.compact else 20)
self.arrow_width = options.get('arrow_width', 6 if self.compact else 10)
self.arrow_spacing = options.get('arrow_spacing',
12 if self.compact else 20)
self.arrow_width = options.get('arrow_width',
6 if self.compact else 10)
self.arrow_stroke = options.get('arrow_stroke', 2)
self.distance = options.get('distance', 150 if self.compact else 175)
self.offset_x = options.get('offset_x', 50)
@ -39,7 +41,8 @@ class DependencyRenderer(object):
rendered = [self.render_svg(i, p['words'], p['arcs'])
for i, p in enumerate(parsed)]
if page:
content = ''.join([TPL_FIGURE.format(content=svg) for svg in rendered])
content = ''.join([TPL_FIGURE.format(content=svg)
for svg in rendered])
markup = TPL_PAGE.format(content=content)
else:
markup = ''.join(rendered)
@ -63,12 +66,13 @@ class DependencyRenderer(object):
self.id = render_id
words = [self.render_word(w['text'], w['tag'], i)
for i, w in enumerate(words)]
arcs = [self.render_arrow(a['label'], a['start'], a['end'], a['dir'], i)
arcs = [self.render_arrow(a['label'], a['start'],
a['end'], a['dir'], i)
for i, a in enumerate(arcs)]
content = ''.join(words) + ''.join(arcs)
return TPL_DEP_SVG.format(id=self.id, width=self.width, height=self.height,
color=self.color, bg=self.bg, font=self.font,
content=content)
return TPL_DEP_SVG.format(id=self.id, width=self.width,
height=self.height, color=self.color,
bg=self.bg, font=self.font, content=content)
def render_word(self, text, tag, i):
"""Render individual word.
@ -96,7 +100,7 @@ class DependencyRenderer(object):
x_start = self.offset_x+start*self.distance+self.arrow_spacing
y = self.offset_y
x_end = (self.offset_x+(end-start)*self.distance+start*self.distance
-self.arrow_spacing*(self.highest_level-level)/4)
- self.arrow_spacing*(self.highest_level-level)/4)
y_curve = self.offset_y-level*self.distance/2
if self.compact:
y_curve = self.offset_y-level*self.distance/6
@ -133,8 +137,10 @@ class DependencyRenderer(object):
if direction is 'left':
pos1, pos2, pos3 = (x, x-self.arrow_width+2, x+self.arrow_width-2)
else:
pos1, pos2, pos3 = (end, end+self.arrow_width-2, end-self.arrow_width+2)
arrowhead = (pos1, y+2, pos2, y-self.arrow_width, pos3, y-self.arrow_width)
pos1, pos2, pos3 = (end, end+self.arrow_width-2,
end-self.arrow_width+2)
arrowhead = (pos1, y+2, pos2, y-self.arrow_width, pos3,
y-self.arrow_width)
return "M{},{} L{},{} {},{}".format(*arrowhead)
def get_levels(self, arcs):
@ -159,9 +165,10 @@ class EntityRenderer(object):
"""
colors = {'ORG': '#7aecec', 'PRODUCT': '#bfeeb7', 'GPE': '#feca74',
'LOC': '#ff9561', 'PERSON': '#aa9cfc', 'NORP': '#c887fb',
'FACILITY': '#9cc9cc', 'EVENT': '#ffeb80', 'LANGUAGE': '#ff8197',
'WORK_OF_ART': '#f0d0ff', 'DATE': '#bfe1d9', 'TIME': '#bfe1d9',
'MONEY': '#e4e7d2', 'QUANTITY': '#e4e7d2', 'ORDINAL': '#e4e7d2',
'FACILITY': '#9cc9cc', 'EVENT': '#ffeb80', 'LAW': '#ff8197',
'LANGUAGE': '#ff8197', 'WORK_OF_ART': '#f0d0ff',
'DATE': '#bfe1d9', 'TIME': '#bfe1d9', 'MONEY': '#e4e7d2',
'QUANTITY': '#e4e7d2', 'ORDINAL': '#e4e7d2',
'CARDINAL': '#e4e7d2', 'PERCENT': '#e4e7d2'}
colors.update(options.get('colors', {}))
self.default_color = '#ddd'
@ -176,9 +183,11 @@ class EntityRenderer(object):
minify (bool): Minify HTML markup.
RETURNS (unicode): Rendered HTML markup.
"""
rendered = [self.render_ents(p['text'], p['ents'], p.get('title', None)) for p in parsed]
rendered = [self.render_ents(p['text'], p['ents'],
p.get('title', None)) for p in parsed]
if page:
docs = ''.join([TPL_FIGURE.format(content=doc) for doc in rendered])
docs = ''.join([TPL_FIGURE.format(content=doc)
for doc in rendered])
markup = TPL_PAGE.format(content=docs)
else:
markup = ''.join(rendered)

View File

@ -3,6 +3,16 @@ from __future__ import unicode_literals
def explain(term):
"""Get a description for a given POS tag, dependency label or entity type.
term (unicode): The term to explain.
RETURNS (unicode): The explanation, or `None` if not found in the glossary.
EXAMPLE:
>>> spacy.explain(u'NORP')
>>> doc = nlp(u'Hello world')
>>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc])
"""
if term in GLOSSARY:
return GLOSSARY[term]
@ -254,7 +264,6 @@ GLOSSARY = {
'nk': 'noun kernel element',
'nmc': 'numerical component',
'oa': 'accusative object',
'oa': 'second accusative object',
'oc': 'clausal object',
'og': 'genitive object',
'op': 'prepositional object',
@ -283,6 +292,7 @@ GLOSSARY = {
'PRODUCT': 'Objects, vehicles, foods, etc. (not services)',
'EVENT': 'Named hurricanes, battles, wars, sports events, etc.',
'WORK_OF_ART': 'Titles of books, songs, etc.',
'LAW': 'Named documents made into laws.',
'LANGUAGE': 'Any named language',
'DATE': 'Absolute or relative dates or periods',
'TIME': 'Times smaller than a day',
@ -290,5 +300,15 @@ GLOSSARY = {
'MONEY': 'Monetary values, including unit',
'QUANTITY': 'Measurements, as of weight or distance',
'ORDINAL': '"first", "second", etc.',
'CARDINAL': 'Numerals that do not fall under another type'
'CARDINAL': 'Numerals that do not fall under another type',
# Named Entity Recognition
# Wikipedia
# http://www.sciencedirect.com/science/article/pii/S0004370212000276
# https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
'PER': 'Named person or family.',
'MISC': ('Miscellaneous entities, e.g. events, nationalities, '
'products or works of art'),
}

View File

@ -2,16 +2,15 @@
# coding: utf8
from __future__ import unicode_literals, print_function
import io
import re
import ujson
import random
import cytoolz
import itertools
from .syntax import nonproj
from .util import ensure_path
from . import util
from .tokens import Doc
from . import util
def tags_to_entities(tags):
@ -53,7 +52,8 @@ def merge_sents(sents):
m_deps[3].extend(head + i for head in heads)
m_deps[4].extend(labels)
m_deps[5].extend(ner)
m_brackets.extend((b['first'] + i, b['last'] + i, b['label']) for b in brackets)
m_brackets.extend((b['first'] + i, b['last'] + i, b['label'])
for b in brackets)
i += len(ids)
return [(m_deps, m_brackets)]
@ -79,6 +79,8 @@ def align(cand_words, gold_words):
punct_re = re.compile(r'\W')
def _min_edit_path(cand_words, gold_words):
cdef:
Pool mem
@ -97,9 +99,9 @@ def _min_edit_path(cand_words, gold_words):
mem = Pool()
n_cand = len(cand_words)
n_gold = len(gold_words)
# Levenshtein distance, except we need the history, and we may want different
# costs.
# Mark operations with a string, and score the history using _edit_cost.
# Levenshtein distance, except we need the history, and we may want
# different costs. Mark operations with a string, and score the history
# using _edit_cost.
previous_row = []
prev_costs = <int*>mem.alloc(n_gold + 1, sizeof(int))
curr_costs = <int*>mem.alloc(n_gold + 1, sizeof(int))
@ -143,12 +145,16 @@ def _min_edit_path(cand_words, gold_words):
def minibatch(items, size=8):
'''Iterate over batches of items. `size` may be an iterator,
"""Iterate over batches of items. `size` may be an iterator,
so that batch-size can vary on each step.
'''
"""
if isinstance(size, int):
size_ = itertools.repeat(8)
else:
size_ = size
items = iter(items)
while True:
batch_size = next(size) #if hasattr(size, '__next__') else size
batch_size = next(size_)
batch = list(cytoolz.take(int(batch_size), items))
if len(batch) == 0:
break
@ -163,6 +169,7 @@ class GoldCorpus(object):
train_path (unicode or Path): File or directory of training data.
dev_path (unicode or Path): File or directory of development data.
RETURNS (GoldCorpus): The newly created object.
"""
self.train_path = util.ensure_path(train_path)
self.dev_path = util.ensure_path(dev_path)
@ -208,7 +215,7 @@ class GoldCorpus(object):
train_tuples = self.train_tuples
if projectivize:
train_tuples = nonproj.preprocess_training_data(
self.train_tuples)
self.train_tuples, label_freq_cutoff=100)
random.shuffle(train_tuples)
gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
max_length=max_length,
@ -217,7 +224,6 @@ class GoldCorpus(object):
def dev_docs(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)
#gold_docs = nlp.preprocess_gold(gold_docs)
yield from gold_docs
@classmethod
@ -228,7 +234,6 @@ class GoldCorpus(object):
raw_text = None
else:
paragraph_tuples = merge_sents(paragraph_tuples)
docs = cls._make_docs(nlp, raw_text, paragraph_tuples,
gold_preproc, noise_level=noise_level)
golds = cls._make_golds(docs, paragraph_tuples)
@ -243,17 +248,20 @@ class GoldCorpus(object):
raw_text = add_noise(raw_text, noise_level)
return [nlp.make_doc(raw_text)]
else:
return [Doc(nlp.vocab, words=add_noise(sent_tuples[1], noise_level))
for (sent_tuples, brackets) in paragraph_tuples]
return [Doc(nlp.vocab,
words=add_noise(sent_tuples[1], noise_level))
for (sent_tuples, brackets) in paragraph_tuples]
@classmethod
def _make_golds(cls, docs, paragraph_tuples):
assert len(docs) == len(paragraph_tuples)
if len(docs) == 1:
return [GoldParse.from_annot_tuples(docs[0], paragraph_tuples[0][0])]
return [GoldParse.from_annot_tuples(docs[0],
paragraph_tuples[0][0])]
else:
return [GoldParse.from_annot_tuples(doc, sent_tuples)
for doc, (sent_tuples, brackets) in zip(docs, paragraph_tuples)]
for doc, (sent_tuples, brackets)
in zip(docs, paragraph_tuples)]
@staticmethod
def walk_corpus(path):
@ -300,7 +308,7 @@ def _corrupt(c, noise_level):
def read_json_file(loc, docs_filter=None, limit=None):
loc = ensure_path(loc)
loc = util.ensure_path(loc)
if loc.is_dir():
for filename in loc.iterdir():
yield from read_json_file(loc / filename, limit=limit)
@ -325,16 +333,16 @@ def read_json_file(loc, docs_filter=None, limit=None):
for i, token in enumerate(sent['tokens']):
words.append(token['orth'])
ids.append(i)
tags.append(token.get('tag','-'))
heads.append(token.get('head',0) + i)
labels.append(token.get('dep',''))
tags.append(token.get('tag', '-'))
heads.append(token.get('head', 0) + i)
labels.append(token.get('dep', ''))
# Ensure ROOT label is case-insensitive
if labels[-1].lower() == 'root':
labels[-1] = 'ROOT'
ner.append(token.get('ner', '-'))
sents.append([
[ids, words, tags, heads, labels, ner],
sent.get('brackets', [])])
sent.get('brackets', [])])
if sents:
yield [paragraph.get('raw', None), sents]
@ -377,28 +385,34 @@ cdef class GoldParse:
@classmethod
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
_, words, tags, heads, deps, entities = annot_tuples
return cls(doc, words=words, tags=tags, heads=heads, deps=deps, entities=entities,
make_projective=make_projective)
return cls(doc, words=words, tags=tags, heads=heads, deps=deps,
entities=entities, make_projective=make_projective)
def __init__(self, doc, annot_tuples=None, words=None, tags=None, heads=None,
deps=None, entities=None, make_projective=False,
cats=tuple()):
def __init__(self, doc, annot_tuples=None, words=None, tags=None,
heads=None, deps=None, entities=None, make_projective=False,
cats=None):
"""Create a GoldParse.
doc (Doc): The document the annotations refer to.
words (iterable): A sequence of unicode word strings.
tags (iterable): A sequence of strings, representing tag annotations.
heads (iterable): A sequence of integers, representing syntactic head offsets.
deps (iterable): A sequence of strings, representing the syntactic relation types.
heads (iterable): A sequence of integers, representing syntactic
head offsets.
deps (iterable): A sequence of strings, representing the syntactic
relation types.
entities (iterable): A sequence of named entity annotations, either as
BILUO tag strings, or as `(start_char, end_char, label)` tuples,
representing the entity positions.
cats (iterable): A sequence of labels for text classification. Each
label may be a string or an int, or a `(start_char, end_char, label)`
cats (dict): Labels for text classification. Each key in the dictionary
may be a string or an int, or a `(start_char, end_char, label)`
tuple, indicating that the label is applied to only part of the
document (usually a sentence). Unlike entity annotations, label
annotations can overlap, i.e. a single word can be covered by
multiple labelled spans.
multiple labelled spans. The TextCategorizer component expects
true examples of a label to have the value 1.0, and negative
examples of a label to have the value 0.0. Labels not in the
dictionary are treated as missing - the gradient for those labels
will be zero.
RETURNS (GoldParse): The newly constructed object.
"""
if words is None:
@ -429,7 +443,7 @@ cdef class GoldParse:
self.c.sent_start = <int*>self.mem.alloc(len(doc), sizeof(int))
self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
self.cats = list(cats)
self.cats = {} if cats is None else dict(cats)
self.words = [None] * len(doc)
self.tags = [None] * len(doc)
self.heads = [None] * len(doc)
@ -462,11 +476,11 @@ cdef class GoldParse:
self.ner[i] = entities[gold_i]
cycle = nonproj.contains_cycle(self.heads)
if cycle != None:
if cycle is not None:
raise Exception("Cycle found: %s" % cycle)
if make_projective:
proj_heads,_ = nonproj.projectivize(self.heads, self.labels)
proj_heads, _ = nonproj.projectivize(self.heads, self.labels)
self.heads = proj_heads
def __len__(self):
@ -489,20 +503,19 @@ cdef class GoldParse:
def biluo_tags_from_offsets(doc, entities, missing='O'):
"""Encode labelled spans into per-token tags, using the Begin/In/Last/Unit/Out
scheme (BILUO).
"""Encode labelled spans into per-token tags, using the
Begin/In/Last/Unit/Out scheme (BILUO).
doc (Doc): The document that the entity offsets refer to. The output tags
will refer to the token boundaries within the document.
entities (iterable): A sequence of `(start, end, label)` triples. `start` and
`end` should be character-offset integers denoting the slice into the
original string.
entities (iterable): A sequence of `(start, end, label)` triples. `start`
and `end` should be character-offset integers denoting the slice into
the original string.
RETURNS (list): A list of unicode strings, describing the tags. Each tag
string will be of the form either "", "O" or "{action}-{label}", where
action is one of "B", "I", "L", "U". The string "-" is used where the
entity offsets don't align with the tokenization in the `Doc` object. The
training algorithm will view these as missing values. "O" denotes a
entity offsets don't align with the tokenization in the `Doc` object.
The training algorithm will view these as missing values. "O" denotes a
non-entity token. "B" denotes the beginning of a multi-token entity,
"I" the inside of an entity of three or more tokens, and "L" the end
of an entity of two or more tokens. "U" denotes a single-token entity.

View File

@ -16,15 +16,13 @@ from ...util import update_exc
class BengaliDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'bn'
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
lemma_rules = LEMMA_RULES
prefixes = tuple(TOKENIZER_PREFIXES)
suffixes = tuple(TOKENIZER_SUFFIXES)
infixes = tuple(TOKENIZER_INFIXES)
prefixes = TOKENIZER_PREFIXES
suffixes = TOKENIZER_SUFFIXES
infixes = TOKENIZER_INFIXES
class Bengali(Language):

View File

@ -12,11 +12,11 @@ MORPH_RULES = {
'কি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'সে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Nom'},
'কিসে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'কাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তাকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Acc'},
'স্বয়ং': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'কোনগুলো': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'তুমি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তুই': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তাদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Acc'},
'আমরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One ', 'PronType': 'Prs', 'Case': 'Nom'},
'যিনি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Rel', 'Case': 'Nom'},
@ -24,12 +24,15 @@ MORPH_RULES = {
'কোন': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Int', 'Case': 'Acc'},
'কারা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তোমাকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'তোকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'খোদ': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'কে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Int', 'Case': 'Acc'},
'যারা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Rel', 'Case': 'Nom'},
'যে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Rel', 'Case': 'Nom'},
'তোমরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তোরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তোমাদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'তোদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'আপন': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'': {LEMMA: PRON_LEMMA, 'PronType': 'Dem'},
'নিজ': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
@ -42,6 +45,10 @@ MORPH_RULES = {
'আমার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'মোর': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'মোদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তোমাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
@ -50,7 +57,13 @@ MORPH_RULES = {
'Case': 'Nom'},
'তোমার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তোর': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Three', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'কাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তোদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'যাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
}
}

View File

@ -22,7 +22,7 @@ STOP_WORDS = set("""
ি
ি
তখন তত তথ তব তব রপর রই হল িনই
িি িি ি মন
িি িি ি মন
কব কব
ি ি ি ি ি ি ি ি ওয ওয খত
ি ি ওয় ওয় ি
@ -32,7 +32,7 @@ STOP_WORDS = set("""
ফল ি
বছর বদল বর বলত বলল বলল বল বল বল বল বস বহ ি িি ি িষযি যবহ বকতব বন ি
মত মত মত মধযভ মধ মধ মধ মন যম
মত মত মত মধযভ মধ মধ মধ মন যম
যখন যত যতট যথ যদি যদি ওয ওয িি
মন
রকম রয রয়

View File

@ -29,11 +29,19 @@ _units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
'TB T G M K %')
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$'
_punct = r'… , : ; \! \? ¿ ¡ \( \) \[ \] \{ \} < > _ # \* &'
_quotes = r'\' \'\' " ” “ `` ` ´ , „ » «'
_hyphens = '- — -- ---'
# These expressions contain various unicode variations, including characters
# used in Chinese (see #1333, #1340, #1351) unless there are cross-language
# conflicts, spaCy's base tokenizer should handle all of those by default
_punct = r'… …… , : ; \! \? ¿ ¡ \( \) \[ \] \{ \} < > _ # \* & 。 · ।'
_quotes = r'\' \'\' " ” “ `` ` ´ , „ » « 「 」 『 』 【 】 《 》 〈 〉'
_hyphens = '- — -- --- —— ~'
# Various symbols like dingbats, but also emoji
# Details: https://www.compart.com/en/unicode/category/So
_other_symbols = r'[\p{So}]'
UNITS = merge_chars(_units)
CURRENCY = merge_chars(_currency)
QUOTES = merge_chars(_quotes)

View File

@ -3,6 +3,9 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .morph_rules import MORPH_RULES
from ..tag_map import TAG_MAP
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
@ -13,11 +16,13 @@ from ...util import update_exc, add_lookups
class DanishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'da'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = set(STOP_WORDS)
# morph_rules = MORPH_RULES
tag_map = TAG_MAP
stop_words = STOP_WORDS
class Danish(Language):

View File

@ -0,0 +1,52 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
# Source http://fjern-uv.dk/tal.php
_num_words = """nul
en et to tre fire fem seks syv otte ni ti
elleve tolv tretten fjorten femten seksten sytten atten nitten tyve
enogtyve toogtyve treogtyve fireogtyve femogtyve seksogtyve syvogtyve otteogtyve niogtyve tredive
enogtredive toogtredive treogtredive fireogtredive femogtredive seksogtredive syvogtredive otteogtredive niogtredive fyrre
enogfyrre toogfyrre treogfyrre fireogfyrre femgogfyrre seksogfyrre syvogfyrre otteogfyrre niogfyrre halvtreds
enoghalvtreds tooghalvtreds treoghalvtreds fireoghalvtreds femoghalvtreds seksoghalvtreds syvoghalvtreds otteoghalvtreds nioghalvtreds tres
enogtres toogtres treogtres fireogtres femogtres seksogtres syvogtres otteogtres niogtres halvfjerds
enoghalvfjerds tooghalvfjerds treoghalvfjerds fireoghalvfjerds femoghalvfjerds seksoghalvfjerds syvoghalvfjerds otteoghalvfjerds nioghalvfjerds firs
enogfirs toogfirs treogfirs fireogfirs femogfirs seksogfirs syvogfirs otteogfirs niogfirs halvfems
enoghalvfems tooghalvfems treoghalvfems fireoghalvfems femoghalvfems seksoghalvfems syvoghalvfems otteoghalvfems nioghalvfems hundrede
million milliard billion billiard trillion trilliard
""".split()
# source http://www.duda.dk/video/dansk/grammatik/talord/talord.html
_ordinal_words = """nulte
første anden tredje fjerde femte sjette syvende ottende niende tiende
elfte tolvte trettende fjortende femtende sekstende syttende attende nittende tyvende
enogtyvende toogtyvende treogtyvende fireogtyvende femogtyvende seksogtyvende syvogtyvende otteogtyvende niogtyvende tredivte enogtredivte toogtredivte treogtredivte fireogtredivte femogtredivte seksogtredivte syvogtredivte otteogtredivte niogtredivte fyrretyvende
enogfyrretyvende toogfyrretyvende treogfyrretyvende fireogfyrretyvende femogfyrretyvende seksogfyrretyvende syvogfyrretyvende otteogfyrretyvende niogfyrretyvende halvtredsindstyvende enoghalvtredsindstyvende
tooghalvtredsindstyvende treoghalvtredsindstyvende fireoghalvtredsindstyvende femoghalvtredsindstyvende seksoghalvtredsindstyvende syvoghalvtredsindstyvende otteoghalvtredsindstyvende nioghalvtredsindstyvende
tresindstyvende enogtresindstyvende toogtresindstyvende treogtresindstyvende fireogtresindstyvende femogtresindstyvende seksogtresindstyvende syvogtresindstyvende otteogtresindstyvende niogtresindstyvende halvfjerdsindstyvende
enoghalvfjerdsindstyvende tooghalvfjerdsindstyvende treoghalvfjerdsindstyvende fireoghalvfjerdsindstyvende femoghalvfjerdsindstyvende seksoghalvfjerdsindstyvende syvoghalvfjerdsindstyvende otteoghalvfjerdsindstyvende nioghalvfjerdsindstyvende firsindstyvende
enogfirsindstyvende toogfirsindstyvende treogfirsindstyvende fireogfirsindstyvende femogfirsindstyvende seksogfirsindstyvende syvogfirsindstyvende otteogfirsindstyvende niogfirsindstyvende halvfemsindstyvende
enoghalvfemsindstyvende tooghalvfemsindstyvende treoghalvfemsindstyvende fireoghalvfemsindstyvende femoghalvfemsindstyvende seksoghalvfemsindstyvende syvoghalvfemsindstyvende otteoghalvfemsindstyvende nioghalvfemsindstyvende
""".split()
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

Some files were not shown because too many files have changed in this diff Show More