Merge remote-tracking branch 'origin/develop' into rliaw-develop

This commit is contained in:
Richard Liaw 2020-06-30 13:50:03 -07:00
commit 610dfd85c2
235 changed files with 8908 additions and 5314 deletions

106
.github/contributors/Arvindcheenu.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Arvind Srinivasan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-06-13 |
| GitHub username | arvindcheenu |
| Website (optional) | |

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ----------------------------- |
| Name | Jannis Rauschke |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 22.05.2020 |
| GitHub username | JannisTriesToCode |
| Website (optional) | https://twitter.com/JRauschke |

View File

@ -99,8 +99,8 @@ mark both statements:
| Field | Entry | | Field | Entry |
|------------------------------- | -------------------- | |------------------------------- | -------------------- |
| Name | Martino Mensio | | Name | Martino Mensio |
| Company name (if applicable) | Polytechnic University of Turin | | Company name (if applicable) | The Open University |
| Title or role (if applicable) | Student | | Title or role (if applicable) | PhD Student |
| Date | 17 November 2017 | | Date | 17 November 2017 |
| GitHub username | MartinoMensio | | GitHub username | MartinoMensio |
| Website (optional) | https://martinomensio.github.io/ | | Website (optional) | https://martinomensio.github.io/ |

106
.github/contributors/R1j1t.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Rajat |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 24 May 2020 |
| GitHub username | R1j1t |
| Website (optional) | |

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Hiroshi Matsuda |
| Company name (if applicable) | Megagon Labs, Tokyo |
| Title or role (if applicable) | Research Scientist |
| Date | June 6, 2020 |
| GitHub username | hiroshi-matsuda-rit |
| Website (optional) | |

106
.github/contributors/jonesmartins.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jones Martins |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-06-10 |
| GitHub username | jonesmartins |
| Website (optional) | |

106
.github/contributors/leomrocha.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Leonardo M. Rocha |
| Company name (if applicable) | |
| Title or role (if applicable) | Eng. |
| Date | 31/05/2020 |
| GitHub username | leomrocha |
| Website (optional) | |

106
.github/contributors/lfiedler.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Leander Fiedler |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 06 April 2020 |
| GitHub username | lfiedler |
| Website (optional) | |

106
.github/contributors/mahnerak.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Karen Hambardzumyan |
| Company name (if applicable) | YerevaNN |
| Title or role (if applicable) | Researcher |
| Date | 2020-06-19 |
| GitHub username | mahnerak |
| Website (optional) | https://mahnerak.com/|

106
.github/contributors/myavrum.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Marat M. Yavrumyan |
| Company name (if applicable) | YSU, UD_Armenian Project |
| Title or role (if applicable) | Dr., Principal Investigator |
| Date | 2020-06-19 |
| GitHub username | myavrum |
| Website (optional) | http://armtreebank.yerevann.com/ |

106
.github/contributors/theudas.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Philipp Sodmann |
| Company name (if applicable) | Empolis |
| Title or role (if applicable) | |
| Date | 2017-05-06 |
| GitHub username | theudas |
| Website (optional) | |

29
.github/workflows/issue-manager.yml vendored Normal file
View File

@ -0,0 +1,29 @@
name: Issue Manager
on:
schedule:
- cron: "0 0 * * *"
issue_comment:
types:
- created
- edited
issues:
types:
- labeled
jobs:
issue-manager:
runs-on: ubuntu-latest
steps:
- uses: tiangolo/issue-manager@0.2.1
with:
token: ${{ secrets.GITHUB_TOKEN }}
config: >
{
"resolved": {
"delay": "P7D",
"message": "This issue has been automatically closed because it was answered and there was no follow-up discussion.",
"remove_label_on_comment": true,
"remove_label_on_close": true
}
}

View File

@ -5,8 +5,9 @@ VENV := ./env$(PYVER)
version := $(shell "bin/get-version.sh") version := $(shell "bin/get-version.sh")
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy_lookups_data $(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core
chmod a+rx $@ chmod a+rx $@
cp $@ dist/spacy.pex
dist/pytest.pex : wheelhouse/pytest-*.whl dist/pytest.pex : wheelhouse/pytest-*.whl
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock $(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
@ -14,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
$(VENV)/bin/pip wheel . -w ./wheelhouse $(VENV)/bin/pip wheel . -w ./wheelhouse
$(VENV)/bin/pip wheel spacy_lookups_data -w ./wheelhouse $(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse
touch $@ touch $@
wheelhouse/pytest-%.whl : $(VENV)/bin/pex wheelhouse/pytest-%.whl : $(VENV)/bin/pex

View File

@ -6,12 +6,12 @@ spaCy is a library for advanced Natural Language Processing in Python and
Cython. It's built on the very latest research, and was designed from day one to Cython. It's built on the very latest research, and was designed from day one to
be used in real products. spaCy comes with be used in real products. spaCy comes with
[pretrained statistical models](https://spacy.io/models) and word vectors, and [pretrained statistical models](https://spacy.io/models) and word vectors, and
currently supports tokenization for **50+ languages**. It features currently supports tokenization for **60+ languages**. It features
state-of-the-art speed, convolutional **neural network models** for tagging, state-of-the-art speed, convolutional **neural network models** for tagging,
parsing and **named entity recognition** and easy **deep learning** integration. parsing and **named entity recognition** and easy **deep learning** integration.
It's commercial open-source software, released under the MIT license. It's commercial open-source software, released under the MIT license.
💫 **Version 2.2 out now!** 💫 **Version 2.3 out now!**
[Check out the release notes here.](https://github.com/explosion/spaCy/releases) [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) [![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
@ -31,7 +31,7 @@ It's commercial open-source software, released under the MIT license.
| --------------- | -------------------------------------------------------------- | | --------------- | -------------------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! | | [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. | | [Usage Guides] | How to use spaCy and its features. |
| [New in v2.2] | New features, backwards incompatibilities and migration guide. | | [New in v2.3] | New features, backwards incompatibilities and migration guide. |
| [API Reference] | The detailed reference for spaCy's API. | | [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. | | [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. | | [Universe] | Libraries, extensions, demos, books and courses. |
@ -39,7 +39,7 @@ It's commercial open-source software, released under the MIT license.
| [Contribute] | How to contribute to the spaCy project and code base. | | [Contribute] | How to contribute to the spaCy project and code base. |
[spacy 101]: https://spacy.io/usage/spacy-101 [spacy 101]: https://spacy.io/usage/spacy-101
[new in v2.2]: https://spacy.io/usage/v2-2 [new in v2.3]: https://spacy.io/usage/v2-3
[usage guides]: https://spacy.io/usage/ [usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/ [api reference]: https://spacy.io/api/
[models]: https://spacy.io/models [models]: https://spacy.io/models
@ -119,12 +119,13 @@ of `v2.0.13`).
pip install spacy pip install spacy
``` ```
To install additional data tables for lemmatization in **spaCy v2.2+** you can To install additional data tables for lemmatization and normalization in
run `pip install spacy[lookups]` or install **spaCy v2.2+** you can run `pip install spacy[lookups]` or install
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
separately. The lookups package is needed to create blank models with separately. The lookups package is needed to create blank models with
lemmatization data, and to lemmatize in languages that don't yet come with lemmatization data for v2.2+ plus normalization data for v2.3+, and to
pretrained models and aren't powered by third-party libraries. lemmatize in languages that don't yet come with pretrained models and aren't
powered by third-party libraries.
When using pip it is generally recommended to install packages in a virtual When using pip it is generally recommended to install packages in a virtual
environment to avoid modifying system state: environment to avoid modifying system state:

View File

@ -14,7 +14,7 @@ import spacy
import spacy.util import spacy.util
from bin.ud import conll17_ud_eval from bin.ud import conll17_ud_eval
from spacy.tokens import Token, Doc from spacy.tokens import Token, Doc
from spacy.gold import GoldParse, Example from spacy.gold import Example
from spacy.util import compounding, minibatch, minibatch_by_words from spacy.util import compounding, minibatch, minibatch_by_words
from spacy.syntax.nonproj import projectivize from spacy.syntax.nonproj import projectivize
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -78,22 +78,21 @@ def read_data(
head = int(head) - 1 if head != "0" else id_ head = int(head) - 1 if head != "0" else id_
sent["words"].append(word) sent["words"].append(word)
sent["tags"].append(tag) sent["tags"].append(tag)
sent["morphology"].append(_parse_morph_string(morph)) sent["morphs"].append(_compile_morph_string(morph, pos))
sent["morphology"][-1].add("POS_%s" % pos)
sent["heads"].append(head) sent["heads"].append(head)
sent["deps"].append("ROOT" if dep == "root" else dep) sent["deps"].append("ROOT" if dep == "root" else dep)
sent["spaces"].append(space_after == "_") sent["spaces"].append(space_after == "_")
sent["entities"] = ["-"] * len(sent["words"]) sent["entities"] = ["-"] * len(sent["words"]) # TODO: doc-level format
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"]) sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
if oracle_segments: if oracle_segments:
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"])) docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
golds.append(GoldParse(docs[-1], **sent)) golds.append(sent)
assert golds[-1].morphology is not None assert golds[-1]["morphs"] is not None
sent_annots.append(sent) sent_annots.append(sent)
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length: if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
doc, gold = _make_gold(nlp, None, sent_annots) doc, gold = _make_gold(nlp, None, sent_annots)
assert gold.morphology is not None assert gold["morphs"] is not None
sent_annots = [] sent_annots = []
docs.append(doc) docs.append(doc)
golds.append(gold) golds.append(gold)
@ -109,17 +108,10 @@ def read_data(
return golds_to_gold_data(docs, golds) return golds_to_gold_data(docs, golds)
def _parse_morph_string(morph_string): def _compile_morph_string(morph_string, pos):
if morph_string == '_': if morph_string == '_':
return set() return f"POS={pos}"
output = [] return morph_string + f"|POS={pos}"
replacements = {'1': 'one', '2': 'two', '3': 'three'}
for feature in morph_string.split('|'):
key, value = feature.split('=')
value = replacements.get(value, value)
value = value.split(',')[0]
output.append('%s_%s' % (key, value.lower()))
return set(output)
def read_conllu(file_): def read_conllu(file_):
@ -151,28 +143,27 @@ def read_conllu(file_):
def _make_gold(nlp, text, sent_annots, drop_deps=0.0): def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
# Flatten the conll annotations, and adjust the head indices # Flatten the conll annotations, and adjust the head indices
flat = defaultdict(list) gold = defaultdict(list)
sent_starts = [] sent_starts = []
for sent in sent_annots: for sent in sent_annots:
flat["heads"].extend(len(flat["words"])+head for head in sent["heads"]) gold["heads"].extend(len(gold["words"])+head for head in sent["heads"])
for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]: for field in ["words", "tags", "deps", "morphs", "entities", "spaces"]:
flat[field].extend(sent[field]) gold[field].extend(sent[field])
sent_starts.append(True) sent_starts.append(True)
sent_starts.extend([False] * (len(sent["words"]) - 1)) sent_starts.extend([False] * (len(sent["words"]) - 1))
# Construct text if necessary # Construct text if necessary
assert len(flat["words"]) == len(flat["spaces"]) assert len(gold["words"]) == len(gold["spaces"])
if text is None: if text is None:
text = "".join( text = "".join(
word + " " * space for word, space in zip(flat["words"], flat["spaces"]) word + " " * space for word, space in zip(gold["words"], gold["spaces"])
) )
doc = nlp.make_doc(text) doc = nlp.make_doc(text)
flat.pop("spaces") gold.pop("spaces")
gold = GoldParse(doc, **flat) gold["sent_starts"] = sent_starts
gold.sent_starts = sent_starts for i in range(len(gold["heads"])):
for i in range(len(gold.heads)):
if random.random() < drop_deps: if random.random() < drop_deps:
gold.heads[i] = None gold["heads"][i] = None
gold.labels[i] = None gold["labels"][i] = None
return doc, gold return doc, gold
@ -183,15 +174,10 @@ def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
def golds_to_gold_data(docs, golds): def golds_to_gold_data(docs, golds):
"""Get out the training data format used by begin_training, given the """Get out the training data format used by begin_training"""
GoldParse objects."""
data = [] data = []
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
example = Example(doc=doc) example = Example.from_dict(doc, dict(gold))
example.add_doc_annotation(cats=gold.cats)
token_annotation_dict = gold.orig.to_dict()
example.add_token_annotation(**token_annotation_dict)
example.goldparse = gold
data.append(example) data.append(example)
return data return data
@ -359,9 +345,8 @@ def initialize_pipeline(nlp, examples, config, device):
nlp.parser.add_multitask_objective("tag") nlp.parser.add_multitask_objective("tag")
if config.multitask_sent: if config.multitask_sent:
nlp.parser.add_multitask_objective("sent_start") nlp.parser.add_multitask_objective("sent_start")
for ex in examples: for eg in examples:
gold = ex.gold for tag in eg.get_aligned("TAG", as_string=True):
for tag in gold.tags:
if tag is not None: if tag is not None:
nlp.tagger.add_label(tag) nlp.tagger.add_label(tag)
if torch is not None and device != -1: if torch is not None and device != -1:
@ -495,10 +480,6 @@ def main(
Token.set_extension("begins_fused", default=False) Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False) Token.set_extension("inside_fused", default=False)
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
spacy.util.fix_random_seed() spacy.util.fix_random_seed()
lang.zh.Chinese.Defaults.use_jieba = False lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False lang.ja.Japanese.Defaults.use_janome = False
@ -541,10 +522,10 @@ def main(
else: else:
batches = minibatch(examples, size=batch_sizes) batches = minibatch(examples, size=batch_sizes)
losses = {} losses = {}
n_train_words = sum(len(ex.doc) for ex in examples) n_train_words = sum(len(eg.predicted) for eg in examples)
with tqdm.tqdm(total=n_train_words, leave=False) as pbar: with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
for batch in batches: for batch in batches:
pbar.update(sum(len(ex.doc) for ex in batch)) pbar.update(sum(len(ex.predicted) for ex in batch))
nlp.parser.cfg["beam_update_prob"] = next(beam_prob) nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
nlp.update( nlp.update(
batch, batch,

View File

@ -5,17 +5,16 @@
# data is passed in sentence-by-sentence via some prior preprocessing. # data is passed in sentence-by-sentence via some prior preprocessing.
gold_preproc = false gold_preproc = false
# Limitations on training document length or number of examples. # Limitations on training document length or number of examples.
max_length = 0 max_length = 5000
limit = 0 limit = 0
# Data augmentation # Data augmentation
orth_variant_level = 0.0 orth_variant_level = 0.0
noise_level = 0.0
dropout = 0.1 dropout = 0.1
# Controls early-stopping. 0 or -1 mean unlimited. # Controls early-stopping. 0 or -1 mean unlimited.
patience = 1600 patience = 1600
max_epochs = 0 max_epochs = 0
max_steps = 20000 max_steps = 20000
eval_frequency = 400 eval_frequency = 200
# Other settings # Other settings
seed = 0 seed = 0
accumulate_gradient = 1 accumulate_gradient = 1
@ -41,15 +40,15 @@ beta2 = 0.999
L2_is_weight_decay = true L2_is_weight_decay = true
L2 = 0.01 L2 = 0.01
grad_clip = 1.0 grad_clip = 1.0
use_averages = true use_averages = false
eps = 1e-8 eps = 1e-8
learn_rate = 0.001 #learn_rate = 0.001
#[optimizer.learn_rate] [optimizer.learn_rate]
#@schedules = "warmup_linear.v1" @schedules = "warmup_linear.v1"
#warmup_steps = 250 warmup_steps = 250
#total_steps = 20000 total_steps = 20000
#initial_rate = 0.001 initial_rate = 0.001
[nlp] [nlp]
lang = "en" lang = "en"
@ -58,15 +57,11 @@ vectors = null
[nlp.pipeline.tok2vec] [nlp.pipeline.tok2vec]
factory = "tok2vec" factory = "tok2vec"
[nlp.pipeline.senter]
factory = "senter"
[nlp.pipeline.ner] [nlp.pipeline.ner]
factory = "ner" factory = "ner"
learn_tokens = false learn_tokens = false
min_action_freq = 1 min_action_freq = 1
beam_width = 1
beam_update_prob = 1.0
[nlp.pipeline.tagger] [nlp.pipeline.tagger]
factory = "tagger" factory = "tagger"
@ -74,16 +69,7 @@ factory = "tagger"
[nlp.pipeline.parser] [nlp.pipeline.parser]
factory = "parser" factory = "parser"
learn_tokens = false learn_tokens = false
min_action_freq = 1 min_action_freq = 30
beam_width = 1
beam_update_prob = 1.0
[nlp.pipeline.senter.model]
@architectures = "spacy.Tagger.v1"
[nlp.pipeline.senter.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.tagger.model] [nlp.pipeline.tagger.model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v1"
@ -96,8 +82,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
@architectures = "spacy.TransitionBasedParser.v1" @architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 8 nr_feature_tokens = 8
hidden_width = 128 hidden_width = 128
maxout_pieces = 3 maxout_pieces = 2
use_upper = false use_upper = true
[nlp.pipeline.parser.model.tok2vec] [nlp.pipeline.parser.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1" @architectures = "spacy.Tok2VecTensors.v1"
@ -107,8 +93,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
@architectures = "spacy.TransitionBasedParser.v1" @architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3 nr_feature_tokens = 3
hidden_width = 128 hidden_width = 128
maxout_pieces = 3 maxout_pieces = 2
use_upper = false use_upper = true
[nlp.pipeline.ner.model.tok2vec] [nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1" @architectures = "spacy.Tok2VecTensors.v1"
@ -117,10 +103,10 @@ width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.tok2vec.model] [nlp.pipeline.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = ${nlp:vectors} pretrained_vectors = ${nlp:vectors}
width = 256 width = 128
depth = 6 depth = 4
window_size = 1 window_size = 1
embed_size = 10000 embed_size = 7000
maxout_pieces = 3 maxout_pieces = 3
subword_features = true subword_features = true
dropout = null dropout = ${training:dropout}

View File

@ -9,7 +9,6 @@ max_length = 0
limit = 0 limit = 0
# Data augmentation # Data augmentation
orth_variant_level = 0.0 orth_variant_level = 0.0
noise_level = 0.0
dropout = 0.1 dropout = 0.1
# Controls early-stopping. 0 or -1 mean unlimited. # Controls early-stopping. 0 or -1 mean unlimited.
patience = 1600 patience = 1600

View File

@ -0,0 +1,80 @@
# Training hyper-parameters and additional features.
[training]
# Whether to train on sequences with 'gold standard' sentence boundaries
# and tokens. If you set this to true, take care to ensure your run-time
# data is passed in sentence-by-sentence via some prior preprocessing.
gold_preproc = false
# Limitations on training document length or number of examples.
max_length = 5000
limit = 0
# Data augmentation
orth_variant_level = 0.0
dropout = 0.2
# Controls early-stopping. 0 or -1 mean unlimited.
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 500
# Other settings
seed = 0
accumulate_gradient = 1
use_pytorch_for_gpu_memory = false
# Control how scores are printed and checkpoints are evaluated.
scores = ["speed", "ents_p", "ents_r", "ents_f"]
score_weights = {"ents_f": 1.0}
# These settings are invalid for the transformer models.
init_tok2vec = null
discard_oversize = false
omit_extra_lookups = false
[training.batch_size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = false
L2 = 1e-6
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 0.001
#[optimizer.learn_rate]
#@schedules = "warmup_linear.v1"
#warmup_steps = 250
#total_steps = 20000
#initial_rate = 0.001
[nlp]
lang = "en"
vectors = null
[nlp.pipeline.ner]
factory = "ner"
learn_tokens = false
min_action_freq = 1
beam_width = 1
beam_update_prob = 1.0
[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 64
maxout_pieces = 2
use_upper = true
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = ${nlp:vectors}
width = 96
depth = 4
window_size = 1
embed_size = 2000
maxout_pieces = 3
subword_features = true
dropout = ${training:dropout}

View File

@ -6,7 +6,6 @@ init_tok2vec = null
vectors = null vectors = null
max_epochs = 100 max_epochs = 100
orth_variant_level = 0.0 orth_variant_level = 0.0
noise_level = 0.0
gold_preproc = true gold_preproc = true
max_length = 0 max_length = 0
use_gpu = 0 use_gpu = 0

View File

@ -6,7 +6,6 @@ init_tok2vec = null
vectors = null vectors = null
max_epochs = 100 max_epochs = 100
orth_variant_level = 0.0 orth_variant_level = 0.0
noise_level = 0.0
gold_preproc = true gold_preproc = true
max_length = 0 max_length = 0
use_gpu = -1 use_gpu = -1

View File

@ -12,7 +12,7 @@ import tqdm
import spacy import spacy
import spacy.util import spacy.util
from spacy.tokens import Token, Doc from spacy.tokens import Token, Doc
from spacy.gold import GoldParse, Example from spacy.gold import Example
from spacy.syntax.nonproj import projectivize from spacy.syntax.nonproj import projectivize
from collections import defaultdict from collections import defaultdict
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -33,31 +33,6 @@ random.seed(0)
numpy.random.seed(0) numpy.random.seed(0)
def minibatch_by_words(examples, size=5000):
random.shuffle(examples)
if isinstance(size, int):
size_ = itertools.repeat(size)
else:
size_ = size
examples = iter(examples)
while True:
batch_size = next(size_)
batch = []
while batch_size >= 0:
try:
example = next(examples)
except StopIteration:
if batch:
yield batch
return
batch_size -= len(example.doc)
batch.append(example)
if batch:
yield batch
else:
break
################ ################
# Data reading # # Data reading #
################ ################
@ -110,7 +85,7 @@ def read_data(
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"]) sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
if oracle_segments: if oracle_segments:
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"])) docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
golds.append(GoldParse(docs[-1], **sent)) golds.append(sent)
sent_annots.append(sent) sent_annots.append(sent)
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length: if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
@ -159,20 +134,19 @@ def read_conllu(file_):
def _make_gold(nlp, text, sent_annots): def _make_gold(nlp, text, sent_annots):
# Flatten the conll annotations, and adjust the head indices # Flatten the conll annotations, and adjust the head indices
flat = defaultdict(list) gold = defaultdict(list)
for sent in sent_annots: for sent in sent_annots:
flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"]) gold["heads"].extend(len(gold["words"]) + head for head in sent["heads"])
for field in ["words", "tags", "deps", "entities", "spaces"]: for field in ["words", "tags", "deps", "entities", "spaces"]:
flat[field].extend(sent[field]) gold[field].extend(sent[field])
# Construct text if necessary # Construct text if necessary
assert len(flat["words"]) == len(flat["spaces"]) assert len(gold["words"]) == len(gold["spaces"])
if text is None: if text is None:
text = "".join( text = "".join(
word + " " * space for word, space in zip(flat["words"], flat["spaces"]) word + " " * space for word, space in zip(gold["words"], gold["spaces"])
) )
doc = nlp.make_doc(text) doc = nlp.make_doc(text)
flat.pop("spaces") gold.pop("spaces")
gold = GoldParse(doc, **flat)
return doc, gold return doc, gold
@ -182,15 +156,10 @@ def _make_gold(nlp, text, sent_annots):
def golds_to_gold_data(docs, golds): def golds_to_gold_data(docs, golds):
"""Get out the training data format used by begin_training, given the """Get out the training data format used by begin_training."""
GoldParse objects."""
data = [] data = []
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
example = Example(doc=doc) example = Example.from_dict(doc, gold)
example.add_doc_annotation(cats=gold.cats)
token_annotation_dict = gold.orig.to_dict()
example.add_token_annotation(**token_annotation_dict)
example.goldparse = gold
data.append(example) data.append(example)
return data return data
@ -313,15 +282,15 @@ def initialize_pipeline(nlp, examples, config):
nlp.parser.add_multitask_objective("sent_start") nlp.parser.add_multitask_objective("sent_start")
nlp.parser.moves.add_action(2, "subtok") nlp.parser.moves.add_action(2, "subtok")
nlp.add_pipe(nlp.create_pipe("tagger")) nlp.add_pipe(nlp.create_pipe("tagger"))
for ex in examples: for eg in examples:
for tag in ex.gold.tags: for tag in eg.get_aligned("TAG", as_string=True):
if tag is not None: if tag is not None:
nlp.tagger.add_label(tag) nlp.tagger.add_label(tag)
# Replace labels that didn't make the frequency cutoff # Replace labels that didn't make the frequency cutoff
actions = set(nlp.parser.labels) actions = set(nlp.parser.labels)
label_set = set([act.split("-")[1] for act in actions if "-" in act]) label_set = set([act.split("-")[1] for act in actions if "-" in act])
for ex in examples: for eg in examples:
gold = ex.gold gold = eg.gold
for i, label in enumerate(gold.labels): for i, label in enumerate(gold.labels):
if label is not None and label not in label_set: if label is not None and label not in label_set:
gold.labels[i] = label.split("||")[0] gold.labels[i] = label.split("||")[0]
@ -415,13 +384,12 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
optimizer = initialize_pipeline(nlp, examples, config) optimizer = initialize_pipeline(nlp, examples, config)
for i in range(config.nr_epoch): for i in range(config.nr_epoch):
docs = [nlp.make_doc(example.doc.text) for example in examples] batches = spacy.minibatch_by_words(examples, size=config.batch_size)
batches = minibatch_by_words(examples, size=config.batch_size)
losses = {} losses = {}
n_train_words = sum(len(doc) for doc in docs) n_train_words = sum(len(eg.reference.doc) for eg in examples)
with tqdm.tqdm(total=n_train_words, leave=False) as pbar: with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
for batch in batches: for batch in batches:
pbar.update(sum(len(ex.doc) for ex in batch)) pbar.update(sum(len(eg.reference.doc) for eg in batch))
nlp.update( nlp.update(
examples=batch, sgd=optimizer, drop=config.dropout, losses=losses, examples=batch, sgd=optimizer, drop=config.dropout, losses=losses,
) )

View File

@ -30,7 +30,7 @@ ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
model=("Model name, should have pretrained word embeddings", "positional", None, str), model=("Model name, should have pretrained word embeddings", "positional", None, str),
output_dir=("Optional output directory", "option", "o", Path), output_dir=("Optional output directory", "option", "o", Path),
) )
def main(model=None, output_dir=None): def main(model, output_dir=None):
"""Load the model and create the KB with pre-defined entity encodings. """Load the model and create the KB with pre-defined entity encodings.
If an output_dir is provided, the KB will be stored there in a file 'kb'. If an output_dir is provided, the KB will be stored there in a file 'kb'.
The updated vocab will also be written to a directory in the output_dir.""" The updated vocab will also be written to a directory in the output_dir."""

View File

@ -24,8 +24,10 @@ import random
import plac import plac
import spacy import spacy
import os.path import os.path
from spacy.gold.example import Example
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.gold import read_json_file, GoldParse from spacy.gold import read_json_file
random.seed(0) random.seed(0)
@ -59,17 +61,15 @@ def main(n_iter=10):
print(nlp.pipeline) print(nlp.pipeline)
print("Create data", len(TRAIN_DATA)) print("Create data", len(TRAIN_DATA))
optimizer = nlp.begin_training(get_examples=lambda: TRAIN_DATA) optimizer = nlp.begin_training()
for itn in range(n_iter): for itn in range(n_iter):
random.shuffle(TRAIN_DATA) random.shuffle(TRAIN_DATA)
losses = {} losses = {}
for example in TRAIN_DATA: for example_dict in TRAIN_DATA:
for token_annotation in example.token_annotations: doc = Doc(nlp.vocab, words=example_dict["words"])
doc = Doc(nlp.vocab, words=token_annotation.words) example = Example.from_dict(doc, example_dict)
gold = GoldParse.from_annotation(doc, example.doc_annotation, token_annotation)
nlp.update( nlp.update(
examples=[(doc, gold)], # 1 example examples=[example], # 1 example
drop=0.2, # dropout - make it harder to memorise data drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights sgd=optimizer, # callable to update weights
losses=losses, losses=losses,
@ -77,9 +77,9 @@ def main(n_iter=10):
print(losses.get("nn_labeller", 0.0), losses["ner"]) print(losses.get("nn_labeller", 0.0), losses["ner"])
# test the trained model # test the trained model
for example in TRAIN_DATA: for example_dict in TRAIN_DATA:
if example.text is not None: if "text" in example_dict:
doc = nlp(example.text) doc = nlp(example_dict["text"])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

View File

@ -4,9 +4,10 @@ import random
import warnings import warnings
import srsly import srsly
import spacy import spacy
from spacy.gold import GoldParse from spacy.gold import Example
from spacy.util import minibatch, compounding from spacy.util import minibatch, compounding
# TODO: further fix & test this script for v.3 ? (read_gold_data is never called)
LABEL = "ANIMAL" LABEL = "ANIMAL"
TRAIN_DATA = [ TRAIN_DATA = [
@ -36,15 +37,13 @@ def read_raw_data(nlp, jsonl_loc):
def read_gold_data(nlp, gold_loc): def read_gold_data(nlp, gold_loc):
docs = [] examples = []
golds = []
for json_obj in srsly.read_jsonl(gold_loc): for json_obj in srsly.read_jsonl(gold_loc):
doc = nlp.make_doc(json_obj["text"]) doc = nlp.make_doc(json_obj["text"])
ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]] ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
gold = GoldParse(doc, entities=ents) example = Example.from_dict(doc, {"entities": ents})
docs.append(doc) examples.append(example)
golds.append(gold) return examples
return list(zip(docs, golds))
def main(model_name, unlabelled_loc): def main(model_name, unlabelled_loc):

View File

@ -2,7 +2,7 @@
# coding: utf-8 # coding: utf-8
"""Using the parser to recognise your own semantics """Using the parser to recognise your own semantics
spaCy's parser component can be used to trained to predict any type of tree spaCy's parser component can be trained to predict any type of tree
structure over your input text. You can also predict trees over whole documents structure over your input text. You can also predict trees over whole documents
or chat logs, with connections between the sentence-roots used to annotate or chat logs, with connections between the sentence-roots used to annotate
discourse structure. In this example, we'll build a message parser for a common discourse structure. In this example, we'll build a message parser for a common

View File

@ -56,7 +56,7 @@ def main(model=None, output_dir=None, n_iter=100):
print("Add label", ent[2]) print("Add label", ent[2])
ner.add_label(ent[2]) ner.add_label(ent[2])
with nlp.select_pipes(enable="ner") and warnings.catch_warnings(): with nlp.select_pipes(enable="simple_ner") and warnings.catch_warnings():
# show warnings for misaligned entity spans once # show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module="spacy") warnings.filterwarnings("once", category=UserWarning, module="spacy")

View File

@ -19,7 +19,7 @@ from ml_datasets import loaders
import spacy import spacy
from spacy import util from spacy import util
from spacy.util import minibatch, compounding from spacy.util import minibatch, compounding
from spacy.gold import Example, GoldParse from spacy.gold import Example
@plac.annotations( @plac.annotations(
@ -62,11 +62,10 @@ def main(config_path, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=Non
train_examples = [] train_examples = []
for text, cats in zip(train_texts, train_cats): for text, cats in zip(train_texts, train_cats):
doc = nlp.make_doc(text) doc = nlp.make_doc(text)
gold = GoldParse(doc, cats=cats) example = Example.from_dict(doc, {"cats": cats})
for cat in cats: for cat in cats:
textcat.add_label(cat) textcat.add_label(cat)
ex = Example.from_gold(gold, doc=doc) train_examples.append(example)
train_examples.append(ex)
with nlp.select_pipes(enable="textcat"): # only train textcat with nlp.select_pipes(enable="textcat"): # only train textcat
optimizer = nlp.begin_training() optimizer = nlp.begin_training()

View File

@ -6,7 +6,7 @@ requires = [
"cymem>=2.0.2,<2.1.0", "cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0", "preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0", "murmurhash>=0.28.0,<1.1.0",
"thinc==8.0.0a9", "thinc==8.0.0a11",
"blis>=0.4.0,<0.5.0" "blis>=0.4.0,<0.5.0"
] ]
build-backend = "setuptools.build_meta" build-backend = "setuptools.build_meta"

View File

@ -1,17 +1,17 @@
# Our libraries # Our libraries
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc==8.0.0a9 thinc==8.0.0a11
blis>=0.4.0,<0.5.0 blis>=0.4.0,<0.5.0
ml_datasets>=0.1.1 ml_datasets>=0.1.1
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.4.0,<1.1.0 wasabi>=0.7.0,<1.1.0
srsly>=2.0.0,<3.0.0 srsly>=2.1.0,<3.0.0
catalogue>=0.0.7,<1.1.0 catalogue>=0.0.7,<1.1.0
typer>=0.3.0,<1.0.0
# Third party dependencies # Third party dependencies
numpy>=1.15.0 numpy>=1.15.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
plac>=0.9.6,<1.2.0
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
pydantic>=1.3.0,<2.0.0 pydantic>=1.3.0,<2.0.0
# Official Python utilities # Official Python utilities

View File

@ -36,22 +36,21 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc==8.0.0a9 thinc==8.0.0a11
install_requires = install_requires =
# Our libraries # Our libraries
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc==8.0.0a9 thinc==8.0.0a11
blis>=0.4.0,<0.5.0 blis>=0.4.0,<0.5.0
wasabi>=0.4.0,<1.1.0 wasabi>=0.7.0,<1.1.0
srsly>=2.0.0,<3.0.0 srsly>=2.1.0,<3.0.0
catalogue>=0.0.7,<1.1.0 catalogue>=0.0.7,<1.1.0
ml_datasets>=0.1.1 typer>=0.3.0,<1.0.0
# Third-party dependencies # Third-party dependencies
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
numpy>=1.15.0 numpy>=1.15.0
plac>=0.9.6,<1.2.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
pydantic>=1.3.0,<2.0.0 pydantic>=1.3.0,<2.0.0
# Official Python utilities # Official Python utilities
@ -61,7 +60,7 @@ install_requires =
[options.extras_require] [options.extras_require]
lookups = lookups =
spacy_lookups_data>=0.3.1,<0.4.0 spacy_lookups_data>=0.3.2,<0.4.0
cuda = cuda =
cupy>=5.0.0b4,<9.0.0 cupy>=5.0.0b4,<9.0.0
cuda80 = cuda80 =
@ -80,7 +79,8 @@ cuda102 =
cupy-cuda102>=5.0.0b4,<9.0.0 cupy-cuda102>=5.0.0b4,<9.0.0
# Language tokenizers with external dependencies # Language tokenizers with external dependencies
ja = ja =
fugashi>=0.1.3 sudachipy>=0.4.5
sudachidict_core>=20200330
ko = ko =
natto-py==0.9.0 natto-py==0.9.0
th = th =

View File

@ -23,6 +23,8 @@ Options.docstrings = True
PACKAGES = find_packages() PACKAGES = find_packages()
MOD_NAMES = [ MOD_NAMES = [
"spacy.gold.align",
"spacy.gold.example",
"spacy.parts_of_speech", "spacy.parts_of_speech",
"spacy.strings", "spacy.strings",
"spacy.lexeme", "spacy.lexeme",
@ -37,11 +39,10 @@ MOD_NAMES = [
"spacy.tokenizer", "spacy.tokenizer",
"spacy.syntax.nn_parser", "spacy.syntax.nn_parser",
"spacy.syntax._parser_model", "spacy.syntax._parser_model",
"spacy.syntax._beam_utils",
"spacy.syntax.nonproj", "spacy.syntax.nonproj",
"spacy.syntax.transition_system", "spacy.syntax.transition_system",
"spacy.syntax.arc_eager", "spacy.syntax.arc_eager",
"spacy.gold", "spacy.gold.gold_io",
"spacy.tokens.doc", "spacy.tokens.doc",
"spacy.tokens.span", "spacy.tokens.span",
"spacy.tokens.token", "spacy.tokens.token",
@ -120,7 +121,7 @@ class build_ext_subclass(build_ext, build_ext_options):
def clean(path): def clean(path):
for path in path.glob("**/*"): for path in path.glob("**/*"):
if path.is_file() and path.suffix in (".so", ".cpp"): if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
print(f"Deleting {path.name}") print(f"Deleting {path.name}")
path.unlink() path.unlink()

View File

@ -8,7 +8,7 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
from thinc.api import prefer_gpu, require_gpu from thinc.api import prefer_gpu, require_gpu
from . import pipeline from . import pipeline
from .cli.info import info as cli_info from .cli.info import info
from .glossary import explain from .glossary import explain
from .about import __version__ from .about import __version__
from .errors import Errors, Warnings from .errors import Errors, Warnings
@ -34,7 +34,3 @@ def load(name, **overrides):
def blank(name, **kwargs): def blank(name, **kwargs):
LangClass = util.get_lang_class(name) LangClass = util.get_lang_class(name)
return LangClass(**kwargs) return LangClass(**kwargs)
def info(model=None, markdown=False, silent=False):
return cli_info(model, markdown, silent)

View File

@ -1,31 +1,4 @@
if __name__ == "__main__": if __name__ == "__main__":
import plac from spacy.cli import setup_cli
import sys
from wasabi import msg
from spacy.cli import download, link, info, package, pretrain, convert
from spacy.cli import init_model, profile, evaluate, validate, debug_data
from spacy.cli import train_cli
commands = { setup_cli()
"download": download,
"link": link,
"info": info,
"train": train_cli,
"pretrain": pretrain,
"debug-data": debug_data,
"evaluate": evaluate,
"convert": convert,
"package": package,
"init-model": init_model,
"profile": profile,
"validate": validate,
}
if len(sys.argv) == 1:
msg.info("Available commands", ", ".join(commands), exits=1)
command = sys.argv.pop(1)
sys.argv[0] = f"spacy {command}"
if command in commands:
plac.call(commands[command], sys.argv[1:])
else:
available = f"Available: {', '.join(commands)}"
msg.fail(f"Unknown command: {command}", available, exits=1)

View File

@ -1,7 +1,8 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.0.0.dev9" __version__ = "3.0.0.dev12"
__release__ = True __release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json" __shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json"
__projects__ = "https://github.com/explosion/spacy-boilerplates"

View File

@ -1,19 +1,28 @@
from wasabi import msg from wasabi import msg
from ._app import app, setup_cli # noqa: F401
# These are the actual functions, NOT the wrapped CLI commands. The CLI commands
# are registered automatically and won't have to be imported here.
from .download import download # noqa: F401 from .download import download # noqa: F401
from .info import info # noqa: F401 from .info import info # noqa: F401
from .package import package # noqa: F401 from .package import package # noqa: F401
from .profile import profile # noqa: F401 from .profile import profile # noqa: F401
from .train_from_config import train_cli # noqa: F401 from .train import train_cli # noqa: F401
from .pretrain import pretrain # noqa: F401 from .pretrain import pretrain # noqa: F401
from .debug_data import debug_data # noqa: F401 from .debug_data import debug_data # noqa: F401
from .evaluate import evaluate # noqa: F401 from .evaluate import evaluate # noqa: F401
from .convert import convert # noqa: F401 from .convert import convert # noqa: F401
from .init_model import init_model # noqa: F401 from .init_model import init_model # noqa: F401
from .validate import validate # noqa: F401 from .validate import validate # noqa: F401
from .project import project_clone, project_assets, project_run # noqa: F401
from .project import project_run_all # noqa: F401
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
def link(*args, **kwargs): def link(*args, **kwargs):
"""As of spaCy v3.0, model symlinks are deprecated. You can load models
using their full names or from a directory path."""
msg.warn( msg.warn(
"As of spaCy v3.0, model symlinks are deprecated. You can load models " "As of spaCy v3.0, model symlinks are deprecated. You can load models "
"using their full names or from a directory path." "using their full names or from a directory path."

24
spacy/cli/_app.py Normal file
View File

@ -0,0 +1,24 @@
import typer
from typer.main import get_command
COMMAND = "python -m spacy"
NAME = "spacy"
HELP = """spaCy Command-line Interface
DOCS: https://spacy.io/api/cli
"""
app = typer.Typer(name=NAME, help=HELP)
# Wrappers for Typer's annotations. Initially created to set defaults and to
# keep the names short, but not needed at the moment.
Arg = typer.Argument
Opt = typer.Option
def setup_cli() -> None:
# Ensure that the help messages always display the correct prompt
command = get_command(app)
command(prog_name=COMMAND)

View File

@ -1,88 +1,115 @@
from typing import Optional
from enum import Enum
from pathlib import Path from pathlib import Path
from wasabi import Printer from wasabi import Printer
import srsly import srsly
import re import re
import sys
from .converters import conllu2json, iob2json, conll_ner2json from ._app import app, Arg, Opt
from .converters import ner_jsonl2json from ..gold import docs_to_json
from ..tokens import DocBin
from ..gold.converters import iob2docs, conll_ner2docs, json2docs
# Converters are matched by file extension except for ner/iob, which are # Converters are matched by file extension except for ner/iob, which are
# matched by file extension and content. To add a converter, add a new # matched by file extension and content. To add a converter, add a new
# entry to this dict with the file extension mapped to the converter function # entry to this dict with the file extension mapped to the converter function
# imported from /converters. # imported from /converters.
CONVERTERS = { CONVERTERS = {
"conllubio": conllu2json, # "conllubio": conllu2docs, TODO
"conllu": conllu2json, # "conllu": conllu2docs, TODO
"conll": conllu2json, # "conll": conllu2docs, TODO
"ner": conll_ner2json, "ner": conll_ner2docs,
"iob": iob2json, "iob": iob2docs,
"jsonl": ner_jsonl2json, "json": json2docs,
} }
# File types
FILE_TYPES = ("json", "jsonl", "msg") # File types that can be written to stdout
FILE_TYPES_STDOUT = ("json", "jsonl") FILE_TYPES_STDOUT = ("json")
def convert( class FileTypes(str, Enum):
json = "json"
spacy = "spacy"
@app.command("convert")
def convert_cli(
# fmt: off # fmt: off
input_file: ("Input file", "positional", None, str), input_path: str = Arg(..., help="Input file or directory", exists=True),
output_dir: ("Output directory. '-' for stdout.", "positional", None, str) = "-", output_dir: Path = Arg("-", help="Output directory. '-' for stdout.", allow_dash=True, exists=True),
file_type: (f"Type of data to produce: {FILE_TYPES}", "option", "t", str, FILE_TYPES) = "json", file_type: FileTypes = Opt("spacy", "--file-type", "-t", help="Type of data to produce"),
n_sents: ("Number of sentences per doc (0 to disable)", "option", "n", int) = 1, n_sents: int = Opt(1, "--n-sents", "-n", help="Number of sentences per doc (0 to disable)"),
seg_sents: ("Segment sentences (for -c ner)", "flag", "s") = False, seg_sents: bool = Opt(False, "--seg-sents", "-s", help="Segment sentences (for -c ner)"),
model: ("Model for sentence segmentation (for -s)", "option", "b", str) = None, model: Optional[str] = Opt(None, "--model", "-b", help="Model for sentence segmentation (for -s)"),
morphology: ("Enable appending morphology to tags", "flag", "m", bool) = False, morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
merge_subtokens: ("Merge CoNLL-U subtokens", "flag", "T", bool) = False, merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
converter: (f"Converter: {tuple(CONVERTERS.keys())}", "option", "c", str) = "auto", converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
ner_map_path: ("NER tag mapping (as JSON-encoded dict of entity types)", "option", "N", Path) = None, ner_map: Optional[Path] = Opt(None, "--ner-map", "-N", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
lang: ("Language (if tokenizer required)", "option", "l", str) = None, lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
# fmt: on # fmt: on
): ):
""" """
Convert files into JSON format for use with train command and other Convert files into json or DocBin format for use with train command and other
experiment management functions. If no output_dir is specified, the data experiment management functions. If no output_dir is specified, the data
is written to stdout, so you can pipe them forward to a JSON file: is written to stdout, so you can pipe them forward to a JSON file:
$ spacy convert some_file.conllu > some_file.json $ spacy convert some_file.conllu > some_file.json
""" """
no_print = output_dir == "-" if isinstance(file_type, FileTypes):
msg = Printer(no_print=no_print) # We get an instance of the FileTypes from the CLI so we need its string value
input_path = Path(input_file) file_type = file_type.value
if file_type not in FILE_TYPES_STDOUT and output_dir == "-": input_path = Path(input_path)
# TODO: support msgpack via stdout in srsly? output_dir = "-" if output_dir == Path("-") else output_dir
msg.fail( cli_args = locals()
f"Can't write .{file_type} data to stdout", silent = output_dir == "-"
"Please specify an output directory.", msg = Printer(no_print=silent)
exits=1, verify_cli_args(msg, **cli_args)
converter = _get_converter(msg, converter, input_path)
convert(
input_path,
output_dir,
file_type=file_type,
n_sents=n_sents,
seg_sents=seg_sents,
model=model,
morphology=morphology,
merge_subtokens=merge_subtokens,
converter=converter,
ner_map=ner_map,
lang=lang,
silent=silent,
msg=msg,
) )
if not input_path.exists():
msg.fail("Input file not found", input_path, exits=1)
if output_dir != "-" and not Path(output_dir).exists(): def convert(
msg.fail("Output directory not found", output_dir, exits=1) input_path: Path,
input_data = input_path.open("r", encoding="utf-8").read() output_dir: Path,
if converter == "auto": *,
converter = input_path.suffix[1:] file_type: str = "json",
if converter == "ner" or converter == "iob": n_sents: int = 1,
converter_autodetect = autodetect_ner_format(input_data) seg_sents: bool = False,
if converter_autodetect == "ner": model: Optional[str] = None,
msg.info("Auto-detected token-per-line NER format") morphology: bool = False,
converter = converter_autodetect merge_subtokens: bool = False,
elif converter_autodetect == "iob": converter: str = "auto",
msg.info("Auto-detected sentence-per-line NER format") ner_map: Optional[Path] = None,
converter = converter_autodetect lang: Optional[str] = None,
else: silent: bool = True,
msg.warn( msg: Optional[Path] = None,
"Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert" ) -> None:
) if not msg:
if converter not in CONVERTERS: msg = Printer(no_print=silent)
msg.fail(f"Can't find converter for {converter}", exits=1) ner_map = srsly.read_json(ner_map) if ner_map is not None else None
ner_map = None
if ner_map_path is not None: for input_loc in walk_directory(input_path):
ner_map = srsly.read_json(ner_map_path) input_data = input_loc.open("r", encoding="utf-8").read()
# Use converter function to convert data # Use converter function to convert data
func = CONVERTERS[converter] func = CONVERTERS[converter]
data = func( docs = func(
input_data, input_data,
n_sents=n_sents, n_sents=n_sents,
seg_sents=seg_sents, seg_sents=seg_sents,
@ -90,29 +117,41 @@ def convert(
merge_subtokens=merge_subtokens, merge_subtokens=merge_subtokens,
lang=lang, lang=lang,
model=model, model=model,
no_print=no_print, no_print=silent,
ner_map=ner_map, ner_map=ner_map,
) )
if output_dir != "-": if output_dir == "-":
# Export data to a file _print_docs_to_stdout(docs, file_type)
suffix = f".{file_type}"
output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix)
if file_type == "json":
srsly.write_json(output_file, data)
elif file_type == "jsonl":
srsly.write_jsonl(output_file, data)
elif file_type == "msg":
srsly.write_msgpack(output_file, data)
msg.good(f"Generated output file ({len(data)} documents): {output_file}")
else: else:
# Print to stdout if input_loc != input_path:
if file_type == "json": subpath = input_loc.relative_to(input_path)
srsly.write_json("-", data) output_file = Path(output_dir) / subpath.with_suffix(f".{file_type}")
elif file_type == "jsonl": else:
srsly.write_jsonl("-", data) output_file = Path(output_dir) / input_loc.parts[-1]
output_file = output_file.with_suffix(f".{file_type}")
_write_docs_to_file(docs, output_file, file_type)
msg.good(f"Generated output file ({len(docs)} documents): {output_file}")
def autodetect_ner_format(input_data): def _print_docs_to_stdout(docs, output_type):
if output_type == "json":
srsly.write_json("-", docs_to_json(docs))
else:
sys.stdout.buffer.write(DocBin(docs=docs).to_bytes())
def _write_docs_to_file(docs, output_file, output_type):
if not output_file.parent.exists():
output_file.parent.mkdir(parents=True)
if output_type == "json":
srsly.write_json(output_file, docs_to_json(docs))
else:
data = DocBin(docs=docs).to_bytes()
with output_file.open("wb") as file_:
file_.write(data)
def autodetect_ner_format(input_data: str) -> str:
# guess format from the first 20 lines # guess format from the first 20 lines
lines = input_data.split("\n")[:20] lines = input_data.split("\n")[:20]
format_guesses = {"ner": 0, "iob": 0} format_guesses = {"ner": 0, "iob": 0}
@ -129,3 +168,86 @@ def autodetect_ner_format(input_data):
if format_guesses["ner"] == 0 and format_guesses["iob"] > 0: if format_guesses["ner"] == 0 and format_guesses["iob"] > 0:
return "iob" return "iob"
return None return None
def walk_directory(path):
if not path.is_dir():
return [path]
paths = [path]
locs = []
seen = set()
for path in paths:
if str(path) in seen:
continue
seen.add(str(path))
if path.parts[-1].startswith("."):
continue
elif path.is_dir():
paths.extend(path.iterdir())
else:
locs.append(path)
return locs
def verify_cli_args(
msg,
input_path,
output_dir,
file_type,
n_sents,
seg_sents,
model,
morphology,
merge_subtokens,
converter,
ner_map,
lang,
):
input_path = Path(input_path)
if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
# TODO: support msgpack via stdout in srsly?
msg.fail(
f"Can't write .{file_type} data to stdout",
"Please specify an output directory.",
exits=1,
)
if not input_path.exists():
msg.fail("Input file not found", input_path, exits=1)
if output_dir != "-" and not Path(output_dir).exists():
msg.fail("Output directory not found", output_dir, exits=1)
if input_path.is_dir():
input_locs = walk_directory(input_path)
if len(input_locs) == 0:
msg.fail("No input files in directory", input_path, exits=1)
file_types = list(set([loc.suffix[1:] for loc in input_locs]))
if len(file_types) >= 2:
file_types = ",".join(file_types)
msg.fail("All input files must be same type", file_types, exits=1)
converter = _get_converter(msg, converter, input_path)
if converter not in CONVERTERS:
msg.fail(f"Can't find converter for {converter}", exits=1)
return converter
def _get_converter(msg, converter, input_path):
if input_path.is_dir():
input_path = walk_directory(input_path)[0]
if converter == "auto":
converter = input_path.suffix[1:]
if converter == "ner" or converter == "iob":
with input_path.open() as file_:
input_data = file_.read()
converter_autodetect = autodetect_ner_format(input_data)
if converter_autodetect == "ner":
msg.info("Auto-detected token-per-line NER format")
converter = converter_autodetect
elif converter_autodetect == "iob":
msg.info("Auto-detected sentence-per-line NER format")
converter = converter_autodetect
else:
msg.warn(
"Can't automatically detect NER format. "
"Conversion may not succeed. "
"See https://spacy.io/api/cli#convert"
)
return converter

View File

@ -1,4 +0,0 @@
from .conllu2json import conllu2json # noqa: F401
from .iob2json import iob2json # noqa: F401
from .conll_ner2json import conll_ner2json # noqa: F401
from .jsonl2json import ner_jsonl2json # noqa: F401

View File

@ -1,65 +0,0 @@
from wasabi import Printer
from ...gold import iob_to_biluo
from ...util import minibatch
from .conll_ner2json import n_sents_info
def iob2json(input_data, n_sents=10, no_print=False, *args, **kwargs):
"""
Convert IOB files with one sentence per line and tags separated with '|'
into JSON format for use with train cli. IOB and IOB2 are accepted.
Sample formats:
I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
"""
msg = Printer(no_print=no_print)
docs = read_iob(input_data.split("\n"))
if n_sents > 0:
n_sents_info(msg, n_sents)
docs = merge_sentences(docs, n_sents)
return docs
def read_iob(raw_sents):
sentences = []
for line in raw_sents:
if not line.strip():
continue
tokens = [t.split("|") for t in line.split()]
if len(tokens[0]) == 3:
words, pos, iob = zip(*tokens)
elif len(tokens[0]) == 2:
words, iob = zip(*tokens)
pos = ["-"] * len(words)
else:
raise ValueError(
"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
)
biluo = iob_to_biluo(iob)
sentences.append(
[
{"orth": w, "tag": p, "ner": ent}
for (w, p, ent) in zip(words, pos, biluo)
]
)
sentences = [{"tokens": sent} for sent in sentences]
paragraphs = [{"sentences": [sent]} for sent in sentences]
docs = [{"id": i, "paragraphs": [para]} for i, para in enumerate(paragraphs)]
return docs
def merge_sentences(docs, n_sents):
merged = []
for group in minibatch(docs, size=n_sents):
group = list(group)
first = group.pop(0)
to_extend = first["paragraphs"][0]["sentences"]
for sent in group:
to_extend.extend(sent["paragraphs"][0]["sentences"])
merged.append(first)
return merged

View File

@ -1,50 +0,0 @@
import srsly
from ...gold import docs_to_json
from ...util import get_lang_class, minibatch
def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False, **_):
if lang is None:
raise ValueError("No --lang specified, but tokenization required")
json_docs = []
input_examples = [srsly.json_loads(line) for line in input_data.strip().split("\n")]
nlp = get_lang_class(lang)()
sentencizer = nlp.create_pipe("sentencizer")
for i, batch in enumerate(minibatch(input_examples, size=n_sents)):
docs = []
for record in batch:
raw_text = record["text"]
if "entities" in record:
ents = record["entities"]
else:
ents = record["spans"]
ents = [(e["start"], e["end"], e["label"]) for e in ents]
doc = nlp.make_doc(raw_text)
sentencizer(doc)
spans = [doc.char_span(s, e, label=L) for s, e, L in ents]
doc.ents = _cleanup_spans(spans)
docs.append(doc)
json_docs.append(docs_to_json(docs, id=i))
return json_docs
def _cleanup_spans(spans):
output = []
seen = set()
for span in spans:
if span is not None:
# Trim whitespace
while len(span) and span[0].is_space:
span = span[1:]
while len(span) and span[-1].is_space:
span = span[:-1]
if not len(span):
continue
for i in range(span.start, span.end):
if i in seen:
break
else:
output.append(span)
seen.update(range(span.start, span.end))
return output

View File

@ -1,11 +1,14 @@
from typing import Optional, List, Sequence, Dict, Any, Tuple
from pathlib import Path from pathlib import Path
from collections import Counter from collections import Counter
import sys import sys
import srsly import srsly
from wasabi import Printer, MESSAGES from wasabi import Printer, MESSAGES
from ..gold import GoldCorpus from ._app import app, Arg, Opt
from ..gold import Corpus, Example
from ..syntax import nonproj from ..syntax import nonproj
from ..language import Language
from ..util import load_model, get_lang_class from ..util import load_model, get_lang_class
@ -18,17 +21,18 @@ BLANK_MODEL_MIN_THRESHOLD = 100
BLANK_MODEL_THRESHOLD = 2000 BLANK_MODEL_THRESHOLD = 2000
def debug_data( @app.command("debug-data")
def debug_data_cli(
# fmt: off # fmt: off
lang: ("Model language", "positional", None, str), lang: str = Arg(..., help="Model language"),
train_path: ("Location of JSON-formatted training data", "positional", None, Path), train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
dev_path: ("Location of JSON-formatted development data", "positional", None, Path), dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None, tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map", exists=True, dir_okay=False),
base_model: ("Name of model to update (optional)", "option", "b", str) = None, base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Name of model to update (optional)"),
pipeline: ("Comma-separated names of pipeline components to train", "option", "p", str) = "tagger,parser,ner", pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of pipeline components to train"),
ignore_warnings: ("Ignore warnings, only show stats and errors", "flag", "IW", bool) = False, ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
verbose: ("Print additional information and explanations", "flag", "V", bool) = False, verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
no_format: ("Don't pretty-print the results", "flag", "NF", bool) = False, no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
# fmt: on # fmt: on
): ):
""" """
@ -36,8 +40,36 @@ def debug_data(
stats, and find problems like invalid entity annotations, cyclic stats, and find problems like invalid entity annotations, cyclic
dependencies, low data labels and more. dependencies, low data labels and more.
""" """
msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings) debug_data(
lang,
train_path,
dev_path,
tag_map_path=tag_map_path,
base_model=base_model,
pipeline=[p.strip() for p in pipeline.split(",")],
ignore_warnings=ignore_warnings,
verbose=verbose,
no_format=no_format,
silent=False,
)
def debug_data(
lang: str,
train_path: Path,
dev_path: Path,
*,
tag_map_path: Optional[Path] = None,
base_model: Optional[str] = None,
pipeline: List[str] = ["tagger", "parser", "ner"],
ignore_warnings: bool = False,
verbose: bool = False,
no_format: bool = True,
silent: bool = True,
):
msg = Printer(
no_print=silent, pretty=not no_format, ignore_warnings=ignore_warnings
)
# Make sure all files and paths exists if they are needed # Make sure all files and paths exists if they are needed
if not train_path.exists(): if not train_path.exists():
msg.fail("Training data not found", train_path, exits=1) msg.fail("Training data not found", train_path, exits=1)
@ -49,7 +81,6 @@ def debug_data(
tag_map = srsly.read_json(tag_map_path) tag_map = srsly.read_json(tag_map_path)
# Initialize the model and pipeline # Initialize the model and pipeline
pipeline = [p.strip() for p in pipeline.split(",")]
if base_model: if base_model:
nlp = load_model(base_model) nlp = load_model(base_model)
else: else:
@ -68,12 +99,9 @@ def debug_data(
loading_train_error_message = "" loading_train_error_message = ""
loading_dev_error_message = "" loading_dev_error_message = ""
with msg.loading("Loading corpus..."): with msg.loading("Loading corpus..."):
corpus = GoldCorpus(train_path, dev_path) corpus = Corpus(train_path, dev_path)
try: try:
train_dataset = list(corpus.train_dataset(nlp)) train_dataset = list(corpus.train_dataset(nlp))
train_dataset_unpreprocessed = list(
corpus.train_dataset_without_preprocessing(nlp)
)
except ValueError as e: except ValueError as e:
loading_train_error_message = f"Training data cannot be loaded: {e}" loading_train_error_message = f"Training data cannot be loaded: {e}"
try: try:
@ -89,11 +117,9 @@ def debug_data(
msg.good("Corpus is loadable") msg.good("Corpus is loadable")
# Create all gold data here to avoid iterating over the train_dataset constantly # Create all gold data here to avoid iterating over the train_dataset constantly
gold_train_data = _compile_gold(train_dataset, pipeline, nlp) gold_train_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=True)
gold_train_unpreprocessed_data = _compile_gold( gold_train_unpreprocessed_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=False)
train_dataset_unpreprocessed, pipeline gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp, make_proj=True)
)
gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp)
train_texts = gold_train_data["texts"] train_texts = gold_train_data["texts"]
dev_texts = gold_dev_data["texts"] dev_texts = gold_dev_data["texts"]
@ -446,7 +472,7 @@ def debug_data(
sys.exit(1) sys.exit(1)
def _load_file(file_path, msg): def _load_file(file_path: Path, msg: Printer) -> None:
file_name = file_path.parts[-1] file_name = file_path.parts[-1]
if file_path.suffix == ".json": if file_path.suffix == ".json":
with msg.loading(f"Loading {file_name}..."): with msg.loading(f"Loading {file_name}..."):
@ -465,7 +491,9 @@ def _load_file(file_path, msg):
) )
def _compile_gold(examples, pipeline, nlp): def _compile_gold(
examples: Sequence[Example], pipeline: List[str], nlp: Language, make_proj: bool
) -> Dict[str, Any]:
data = { data = {
"ner": Counter(), "ner": Counter(),
"cats": Counter(), "cats": Counter(),
@ -484,20 +512,20 @@ def _compile_gold(examples, pipeline, nlp):
"n_cats_multilabel": 0, "n_cats_multilabel": 0,
"texts": set(), "texts": set(),
} }
for example in examples: for eg in examples:
gold = example.gold gold = eg.reference
doc = example.doc doc = eg.predicted
valid_words = [x for x in gold.words if x is not None] valid_words = [x for x in gold if x is not None]
data["words"].update(valid_words) data["words"].update(valid_words)
data["n_words"] += len(valid_words) data["n_words"] += len(valid_words)
data["n_misaligned_words"] += len(gold.words) - len(valid_words) data["n_misaligned_words"] += len(gold) - len(valid_words)
data["texts"].add(doc.text) data["texts"].add(doc.text)
if len(nlp.vocab.vectors): if len(nlp.vocab.vectors):
for word in valid_words: for word in valid_words:
if nlp.vocab.strings[word] not in nlp.vocab.vectors: if nlp.vocab.strings[word] not in nlp.vocab.vectors:
data["words_missing_vectors"].update([word]) data["words_missing_vectors"].update([word])
if "ner" in pipeline: if "ner" in pipeline:
for i, label in enumerate(gold.ner): for i, label in enumerate(eg.get_aligned_ner()):
if label is None: if label is None:
continue continue
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space: if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
@ -523,32 +551,34 @@ def _compile_gold(examples, pipeline, nlp):
if list(gold.cats.values()).count(1.0) != 1: if list(gold.cats.values()).count(1.0) != 1:
data["n_cats_multilabel"] += 1 data["n_cats_multilabel"] += 1
if "tagger" in pipeline: if "tagger" in pipeline:
data["tags"].update([x for x in gold.tags if x is not None]) tags = eg.get_aligned("TAG", as_string=True)
data["tags"].update([x for x in tags if x is not None])
if "parser" in pipeline: if "parser" in pipeline:
data["deps"].update([x for x in gold.labels if x is not None]) aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
for i, (dep, head) in enumerate(zip(gold.labels, gold.heads)): data["deps"].update([x for x in aligned_deps if x is not None])
for i, (dep, head) in enumerate(zip(aligned_deps, aligned_heads)):
if head == i: if head == i:
data["roots"].update([dep]) data["roots"].update([dep])
data["n_sents"] += 1 data["n_sents"] += 1
if nonproj.is_nonproj_tree(gold.heads): if nonproj.is_nonproj_tree(aligned_heads):
data["n_nonproj"] += 1 data["n_nonproj"] += 1
if nonproj.contains_cycle(gold.heads): if nonproj.contains_cycle(aligned_heads):
data["n_cycles"] += 1 data["n_cycles"] += 1
return data return data
def _format_labels(labels, counts=False): def _format_labels(labels: List[Tuple[str, int]], counts: bool = False) -> str:
if counts: if counts:
return ", ".join([f"'{l}' ({c})" for l, c in labels]) return ", ".join([f"'{l}' ({c})" for l, c in labels])
return ", ".join([f"'{l}'" for l in labels]) return ", ".join([f"'{l}'" for l in labels])
def _get_examples_without_label(data, label): def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
count = 0 count = 0
for ex in data: for eg in data:
labels = [ labels = [
label.split("-")[1] label.split("-")[1]
for label in ex.gold.ner for label in eg.get_aligned_ner()
if label not in ("O", "-", None) if label not in ("O", "-", None)
] ]
if label not in labels: if label not in labels:
@ -556,7 +586,7 @@ def _get_examples_without_label(data, label):
return count return count
def _get_labels_from_model(nlp, pipe_name): def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]:
if pipe_name not in nlp.pipe_names: if pipe_name not in nlp.pipe_names:
return set() return set()
pipe = nlp.get_pipe(pipe_name) pipe = nlp.get_pipe(pipe_name)

View File

@ -1,23 +1,36 @@
from typing import Optional, Sequence, Union
import requests import requests
import os
import subprocess
import sys import sys
from wasabi import msg from wasabi import msg
import typer
from ._app import app, Arg, Opt
from .. import about from .. import about
from ..util import is_package, get_base_version from ..util import is_package, get_base_version, run_command
def download( @app.command(
model: ("Model to download (shortcut or name)", "positional", None, str), "download",
direct: ("Force direct download of name + version", "flag", "d", bool) = False, context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
*pip_args: ("Additional arguments to be passed to `pip install` on model install"), )
def download_cli(
# fmt: off
ctx: typer.Context,
model: str = Arg(..., help="Model to download (shortcut or name)"),
direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
# fmt: on
): ):
""" """
Download compatible model from default download path using pip. If --direct Download compatible model from default download path using pip. If --direct
flag is set, the command expects the full model name with version. flag is set, the command expects the full model name with version.
For direct downloads, the compatibility check will be skipped. For direct downloads, the compatibility check will be skipped. All
additional arguments provided to this command will be passed to `pip install`
on model installation.
""" """
download(model, direct, *ctx.args)
def download(model: str, direct: bool = False, *pip_args) -> None:
if not is_package("spacy") and "--no-deps" not in pip_args: if not is_package("spacy") and "--no-deps" not in pip_args:
msg.warn( msg.warn(
"Skipping model package dependencies and setting `--no-deps`. " "Skipping model package dependencies and setting `--no-deps`. "
@ -33,22 +46,20 @@ def download(
components = model.split("-") components = model.split("-")
model_name = "".join(components[:-1]) model_name = "".join(components[:-1])
version = components[-1] version = components[-1]
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args) download_model(dl_tpl.format(m=model_name, v=version), pip_args)
else: else:
shortcuts = get_json(about.__shortcuts__, "available shortcuts") shortcuts = get_json(about.__shortcuts__, "available shortcuts")
model_name = shortcuts.get(model, model) model_name = shortcuts.get(model, model)
compatibility = get_compatibility() compatibility = get_compatibility()
version = get_version(model_name, compatibility) version = get_version(model_name, compatibility)
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args) download_model(dl_tpl.format(m=model_name, v=version), pip_args)
if dl != 0: # if download subprocess doesn't return 0, exit
sys.exit(dl)
msg.good( msg.good(
"Download and installation successful", "Download and installation successful",
f"You can now load the model via spacy.load('{model_name}')", f"You can now load the model via spacy.load('{model_name}')",
) )
def get_json(url, desc): def get_json(url: str, desc: str) -> Union[dict, list]:
r = requests.get(url) r = requests.get(url)
if r.status_code != 200: if r.status_code != 200:
msg.fail( msg.fail(
@ -62,7 +73,7 @@ def get_json(url, desc):
return r.json() return r.json()
def get_compatibility(): def get_compatibility() -> dict:
version = get_base_version(about.__version__) version = get_base_version(about.__version__)
comp_table = get_json(about.__compatibility__, "compatibility table") comp_table = get_json(about.__compatibility__, "compatibility table")
comp = comp_table["spacy"] comp = comp_table["spacy"]
@ -71,7 +82,7 @@ def get_compatibility():
return comp[version] return comp[version]
def get_version(model, comp): def get_version(model: str, comp: dict) -> str:
model = get_base_version(model) model = get_base_version(model)
if model not in comp: if model not in comp:
msg.fail( msg.fail(
@ -81,10 +92,12 @@ def get_version(model, comp):
return comp[model][0] return comp[model][0]
def download_model(filename, user_pip_args=None): def download_model(
filename: str, user_pip_args: Optional[Sequence[str]] = None
) -> None:
download_url = about.__download_url__ + "/" + filename download_url = about.__download_url__ + "/" + filename
pip_args = ["--no-cache-dir"] pip_args = ["--no-cache-dir"]
if user_pip_args: if user_pip_args:
pip_args.extend(user_pip_args) pip_args.extend(user_pip_args)
cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url] cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url]
return subprocess.call(cmd, env=os.environ.copy()) run_command(cmd)

View File

@ -1,46 +1,75 @@
from typing import Optional, List, Dict
from timeit import default_timer as timer from timeit import default_timer as timer
from wasabi import msg from wasabi import Printer
from pathlib import Path
import re
import srsly
from ..gold import GoldCorpus from ..gold import Corpus
from ..tokens import Doc
from ._app import app, Arg, Opt
from ..scorer import Scorer
from .. import util from .. import util
from .. import displacy from .. import displacy
def evaluate( @app.command("evaluate")
def evaluate_cli(
# fmt: off # fmt: off
model: ("Model name or path", "positional", None, str), model: str = Arg(..., help="Model name or path"),
data_path: ("Location of JSON-formatted evaluation data", "positional", None, str), data_path: Path = Arg(..., help="Location of JSON-formatted evaluation data", exists=True),
gpu_id: ("Use GPU", "option", "g", int) = -1, output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False),
gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False, gpu_id: int = Opt(-1, "--gpu-id", "-g", help="Use GPU"),
displacy_path: ("Directory to output rendered parses as HTML", "option", "dp", str) = None, gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
displacy_limit: ("Limit of parses to render as HTML", "option", "dl", int) = 25, displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
return_scores: ("Return dict containing model scores", "flag", "R", bool) = False, displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"),
# fmt: on # fmt: on
): ):
""" """
Evaluate a model. To render a sample of parses in a HTML file, set an Evaluate a model. To render a sample of parses in a HTML file, set an
output directory as the displacy_path argument. output directory as the displacy_path argument.
""" """
evaluate(
model,
data_path,
output=output,
gpu_id=gpu_id,
gold_preproc=gold_preproc,
displacy_path=displacy_path,
displacy_limit=displacy_limit,
silent=False,
)
def evaluate(
model: str,
data_path: Path,
output: Optional[Path],
gpu_id: int = -1,
gold_preproc: bool = False,
displacy_path: Optional[Path] = None,
displacy_limit: int = 25,
silent: bool = True,
) -> Scorer:
msg = Printer(no_print=silent, pretty=not silent)
util.fix_random_seed() util.fix_random_seed()
if gpu_id >= 0: if gpu_id >= 0:
util.use_gpu(gpu_id) util.use_gpu(gpu_id)
util.set_env_log(False) util.set_env_log(False)
data_path = util.ensure_path(data_path) data_path = util.ensure_path(data_path)
output_path = util.ensure_path(output)
displacy_path = util.ensure_path(displacy_path) displacy_path = util.ensure_path(displacy_path)
if not data_path.exists(): if not data_path.exists():
msg.fail("Evaluation data not found", data_path, exits=1) msg.fail("Evaluation data not found", data_path, exits=1)
if displacy_path and not displacy_path.exists(): if displacy_path and not displacy_path.exists():
msg.fail("Visualization output directory not found", displacy_path, exits=1) msg.fail("Visualization output directory not found", displacy_path, exits=1)
corpus = GoldCorpus(data_path, data_path) corpus = Corpus(data_path, data_path)
if model.startswith("blank:"):
nlp = util.get_lang_class(model.replace("blank:", ""))()
else:
nlp = util.load_model(model) nlp = util.load_model(model)
dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc)) dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc))
begin = timer() begin = timer()
scorer = nlp.evaluate(dev_dataset, verbose=False) scorer = nlp.evaluate(dev_dataset, verbose=False)
end = timer() end = timer()
nwords = sum(len(ex.doc) for ex in dev_dataset) nwords = sum(len(ex.predicted) for ex in dev_dataset)
results = { results = {
"Time": f"{end - begin:.2f} s", "Time": f"{end - begin:.2f} s",
"Words": nwords, "Words": nwords,
@ -60,10 +89,22 @@ def evaluate(
"Sent R": f"{scorer.sent_r:.2f}", "Sent R": f"{scorer.sent_r:.2f}",
"Sent F": f"{scorer.sent_f:.2f}", "Sent F": f"{scorer.sent_f:.2f}",
} }
data = {re.sub(r"[\s/]", "_", k.lower()): v for k, v in results.items()}
msg.table(results, title="Results") msg.table(results, title="Results")
if scorer.ents_per_type:
data["ents_per_type"] = scorer.ents_per_type
print_ents_per_type(msg, scorer.ents_per_type)
if scorer.textcats_f_per_cat:
data["textcats_f_per_cat"] = scorer.textcats_f_per_cat
print_textcats_f_per_cat(msg, scorer.textcats_f_per_cat)
if scorer.textcats_auc_per_cat:
data["textcats_auc_per_cat"] = scorer.textcats_auc_per_cat
print_textcats_auc_per_cat(msg, scorer.textcats_auc_per_cat)
if displacy_path: if displacy_path:
docs = [ex.doc for ex in dev_dataset] docs = [ex.predicted for ex in dev_dataset]
render_deps = "parser" in nlp.meta.get("pipeline", []) render_deps = "parser" in nlp.meta.get("pipeline", [])
render_ents = "ner" in nlp.meta.get("pipeline", []) render_ents = "ner" in nlp.meta.get("pipeline", [])
render_parses( render_parses(
@ -75,11 +116,21 @@ def evaluate(
ents=render_ents, ents=render_ents,
) )
msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path) msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path)
if return_scores:
return scorer.scores if output_path is not None:
srsly.write_json(output_path, data)
msg.good(f"Saved results to {output_path}")
return data
def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True): def render_parses(
docs: List[Doc],
output_path: Path,
model_name: str = "",
limit: int = 250,
deps: bool = True,
ents: bool = True,
):
docs[0].user_data["title"] = model_name docs[0].user_data["title"] = model_name
if ents: if ents:
html = displacy.render(docs[:limit], style="ent", page=True) html = displacy.render(docs[:limit], style="ent", page=True)
@ -91,3 +142,40 @@ def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=T
) )
with (output_path / "parses.html").open("w", encoding="utf8") as file_: with (output_path / "parses.html").open("w", encoding="utf8") as file_:
file_.write(html) file_.write(html)
def print_ents_per_type(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
data = [
(k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
for k, v in scores.items()
]
msg.table(
data,
header=("", "P", "R", "F"),
aligns=("l", "r", "r", "r"),
title="NER (per type)",
)
def print_textcats_f_per_cat(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
data = [
(k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
for k, v in scores.items()
]
msg.table(
data,
header=("", "P", "R", "F"),
aligns=("l", "r", "r", "r"),
title="Textcat F (per type)",
)
def print_textcats_auc_per_cat(
msg: Printer, scores: Dict[str, Dict[str, float]]
) -> None:
msg.table(
[(k, f"{v['roc_auc_score']:.2f}") for k, v in scores.items()],
header=("", "ROC AUC"),
aligns=("l", "r"),
title="Textcat ROC AUC (per label)",
)

View File

@ -1,24 +1,80 @@
from typing import Optional, Dict, Any, Union
import platform import platform
from pathlib import Path from pathlib import Path
from wasabi import msg from wasabi import Printer
import srsly import srsly
from .validate import get_model_pkgs from ._app import app, Arg, Opt
from .. import util from .. import util
from .. import about from .. import about
def info( @app.command("info")
model: ("Optional model name", "positional", None, str) = None, def info_cli(
markdown: ("Generate Markdown for GitHub issues", "flag", "md", str) = False, # fmt: off
silent: ("Don't print anything (just return)", "flag", "s") = False, model: Optional[str] = Arg(None, help="Optional model name"),
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
# fmt: on
): ):
""" """
Print info about spaCy installation. If a model is speficied as an argument, Print info about spaCy installation. If a model is speficied as an argument,
print model information. Flag --markdown prints details in Markdown for easy print model information. Flag --markdown prints details in Markdown for easy
copy-pasting to GitHub issues. copy-pasting to GitHub issues.
""" """
info(model, markdown=markdown, silent=silent)
def info(
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True
) -> Union[str, dict]:
msg = Printer(no_print=silent, pretty=not silent)
if model: if model:
title = f"Info about model '{model}'"
data = info_model(model, silent=silent)
else:
title = "Info about spaCy"
data = info_spacy()
raw_data = {k.lower().replace(" ", "_"): v for k, v in data.items()}
if "Models" in data and isinstance(data["Models"], dict):
data["Models"] = ", ".join(f"{n} ({v})" for n, v in data["Models"].items())
markdown_data = get_markdown(data, title=title)
if markdown:
if not silent:
print(markdown_data)
return markdown_data
if not silent:
table_data = dict(data)
msg.table(table_data, title=title)
return raw_data
def info_spacy() -> Dict[str, any]:
"""Generate info about the current spaCy intallation.
RETURNS (dict): The spaCy info.
"""
all_models = {}
for pkg_name in util.get_installed_models():
package = pkg_name.replace("-", "_")
all_models[package] = util.get_package_version(pkg_name)
return {
"spaCy version": about.__version__,
"Location": str(Path(__file__).parent.parent),
"Platform": platform.platform(),
"Python version": platform.python_version(),
"Models": all_models,
}
def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
"""Generate info about a specific model.
model (str): Model name of path.
silent (bool): Don't print anything, just return.
RETURNS (dict): The model meta.
"""
msg = Printer(no_print=silent, pretty=not silent)
if util.is_package(model): if util.is_package(model):
model_path = util.get_package_path(model) model_path = util.get_package_path(model)
else: else:
@ -32,46 +88,22 @@ def info(
meta["source"] = str(model_path.resolve()) meta["source"] = str(model_path.resolve())
else: else:
meta["source"] = str(model_path) meta["source"] = str(model_path)
if not silent: return {k: v for k, v in meta.items() if k not in ("accuracy", "speed")}
title = f"Info about model '{model}'"
model_meta = {
k: v for k, v in meta.items() if k not in ("accuracy", "speed")
}
if markdown:
print_markdown(model_meta, title=title)
else:
msg.table(model_meta, title=title)
return meta
all_models, _ = get_model_pkgs()
data = {
"spaCy version": about.__version__,
"Location": str(Path(__file__).parent.parent),
"Platform": platform.platform(),
"Python version": platform.python_version(),
"Models": ", ".join(
f"{m['name']} ({m['version']})" for m in all_models.values()
),
}
if not silent:
title = "Info about spaCy"
if markdown:
print_markdown(data, title=title)
else:
msg.table(data, title=title)
return data
def print_markdown(data, title=None): def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
"""Print data in GitHub-flavoured Markdown format for issues etc. """Get data in GitHub-flavoured Markdown format for issues etc.
data (dict or list of tuples): Label/value pairs. data (dict or list of tuples): Label/value pairs.
title (str / None): Title, will be rendered as headline 2. title (str / None): Title, will be rendered as headline 2.
RETURNS (str): The Markdown string.
""" """
markdown = [] markdown = []
for key, value in data.items(): for key, value in data.items():
if isinstance(value, str) and Path(value).exists(): if isinstance(value, str) and Path(value).exists():
continue continue
markdown.append(f"* **{key}:** {value}") markdown.append(f"* **{key}:** {value}")
result = "\n{}\n".format("\n".join(markdown))
if title: if title:
print(f"\n## {title}") result = f"\n## {title}\n{result}"
print("\n{}\n".format("\n".join(markdown))) return result

View File

@ -1,3 +1,4 @@
from typing import Optional, List, Dict, Any, Union, IO
import math import math
from tqdm import tqdm from tqdm import tqdm
import numpy import numpy
@ -9,10 +10,12 @@ import gzip
import zipfile import zipfile
import srsly import srsly
import warnings import warnings
from wasabi import msg from wasabi import Printer
from ._app import app, Arg, Opt
from ..vectors import Vectors from ..vectors import Vectors
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..language import Language
from ..util import ensure_path, get_lang_class, load_model, OOV_RANK from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
from ..lookups import Lookups from ..lookups import Lookups
@ -25,20 +28,21 @@ except ImportError:
DEFAULT_OOV_PROB = -20 DEFAULT_OOV_PROB = -20
def init_model( @app.command("init-model")
def init_model_cli(
# fmt: off # fmt: off
lang: ("Model language", "positional", None, str), lang: str = Arg(..., help="Model language"),
output_dir: ("Model output directory", "positional", None, Path), output_dir: Path = Arg(..., help="Model output directory"),
freqs_loc: ("Location of words frequencies file", "option", "f", Path) = None, freqs_loc: Optional[Path] = Arg(None, help="Location of words frequencies file", exists=True),
clusters_loc: ("Optional location of brown clusters data", "option", "c", str) = None, clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
jsonl_loc: ("Location of JSONL-formatted attributes file", "option", "j", Path) = None, jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
vectors_loc: ("Optional vectors file in Word2Vec format", "option", "v", str) = None, vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
prune_vectors: ("Optional number of vectors to prune to", "option", "V", int) = -1, prune_vectors: int = Opt(-1 , "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
truncate_vectors: ("Optional number of vectors to truncate to when reading in vectors file", "option", "t", int) = 0, truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
vectors_name: ("Optional name for the word vectors, e.g. en_core_web_lg.vectors", "option", "vn", str) = None, vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
model_name: ("Optional name for the model meta", "option", "mn", str) = None, model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"),
omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False, omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
base_model: ("Base model (for languages with custom tokenizers)", "option", "b", str) = None base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Base model (for languages with custom tokenizers)")
# fmt: on # fmt: on
): ):
""" """
@ -46,6 +50,38 @@ def init_model(
and word vectors. If vectors are provided in Word2Vec format, they can and word vectors. If vectors are provided in Word2Vec format, they can
be either a .txt or zipped as a .zip or .tar.gz. be either a .txt or zipped as a .zip or .tar.gz.
""" """
init_model(
lang,
output_dir,
freqs_loc=freqs_loc,
clusters_loc=clusters_loc,
jsonl_loc=jsonl_loc,
prune_vectors=prune_vectors,
truncate_vectors=truncate_vectors,
vectors_name=vectors_name,
model_name=model_name,
omit_extra_lookups=omit_extra_lookups,
base_model=base_model,
silent=False,
)
def init_model(
lang: str,
output_dir: Path,
freqs_loc: Optional[Path] = None,
clusters_loc: Optional[Path] = None,
jsonl_loc: Optional[Path] = None,
vectors_loc: Optional[Path] = None,
prune_vectors: int = -1,
truncate_vectors: int = 0,
vectors_name: Optional[str] = None,
model_name: Optional[str] = None,
omit_extra_lookups: bool = False,
base_model: Optional[str] = None,
silent: bool = True,
) -> Language:
msg = Printer(no_print=silent, pretty=not silent)
if jsonl_loc is not None: if jsonl_loc is not None:
if freqs_loc is not None or clusters_loc is not None: if freqs_loc is not None or clusters_loc is not None:
settings = ["-j"] settings = ["-j"]
@ -68,7 +104,7 @@ def init_model(
freqs_loc = ensure_path(freqs_loc) freqs_loc = ensure_path(freqs_loc)
if freqs_loc is not None and not freqs_loc.exists(): if freqs_loc is not None and not freqs_loc.exists():
msg.fail("Can't find words frequencies file", freqs_loc, exits=1) msg.fail("Can't find words frequencies file", freqs_loc, exits=1)
lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc) lex_attrs = read_attrs_from_deprecated(msg, freqs_loc, clusters_loc)
with msg.loading("Creating model..."): with msg.loading("Creating model..."):
nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model) nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
@ -83,7 +119,9 @@ def init_model(
msg.good("Successfully created model") msg.good("Successfully created model")
if vectors_loc is not None: if vectors_loc is not None:
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name) add_vectors(
msg, nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name
)
vec_added = len(nlp.vocab.vectors) vec_added = len(nlp.vocab.vectors)
lex_added = len(nlp.vocab) lex_added = len(nlp.vocab)
msg.good( msg.good(
@ -95,7 +133,7 @@ def init_model(
return nlp return nlp
def open_file(loc): def open_file(loc: Union[str, Path]) -> IO:
"""Handle .gz, .tar.gz or unzipped files""" """Handle .gz, .tar.gz or unzipped files"""
loc = ensure_path(loc) loc = ensure_path(loc)
if tarfile.is_tarfile(str(loc)): if tarfile.is_tarfile(str(loc)):
@ -111,7 +149,9 @@ def open_file(loc):
return loc.open("r", encoding="utf8") return loc.open("r", encoding="utf8")
def read_attrs_from_deprecated(freqs_loc, clusters_loc): def read_attrs_from_deprecated(
msg: Printer, freqs_loc: Optional[Path], clusters_loc: Optional[Path]
) -> List[Dict[str, Any]]:
if freqs_loc is not None: if freqs_loc is not None:
with msg.loading("Counting frequencies..."): with msg.loading("Counting frequencies..."):
probs, _ = read_freqs(freqs_loc) probs, _ = read_freqs(freqs_loc)
@ -139,7 +179,12 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
return lex_attrs return lex_attrs
def create_model(lang, lex_attrs, name=None, base_model=None): def create_model(
lang: str,
lex_attrs: List[Dict[str, Any]],
name: Optional[str] = None,
base_model: Optional[Union[str, Path]] = None,
) -> Language:
if base_model: if base_model:
nlp = load_model(base_model) nlp = load_model(base_model)
# keep the tokenizer but remove any existing pipeline components due to # keep the tokenizer but remove any existing pipeline components due to
@ -166,7 +211,14 @@ def create_model(lang, lex_attrs, name=None, base_model=None):
return nlp return nlp
def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None): def add_vectors(
msg: Printer,
nlp: Language,
vectors_loc: Optional[Path],
truncate_vectors: int,
prune_vectors: int,
name: Optional[str] = None,
) -> None:
vectors_loc = ensure_path(vectors_loc) vectors_loc = ensure_path(vectors_loc)
if vectors_loc and vectors_loc.parts[-1].endswith(".npz"): if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb"))) nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
@ -176,7 +228,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
else: else:
if vectors_loc: if vectors_loc:
with msg.loading(f"Reading vectors from {vectors_loc}"): with msg.loading(f"Reading vectors from {vectors_loc}"):
vectors_data, vector_keys = read_vectors(vectors_loc) vectors_data, vector_keys = read_vectors(msg, vectors_loc)
msg.good(f"Loaded vectors from {vectors_loc}") msg.good(f"Loaded vectors from {vectors_loc}")
else: else:
vectors_data, vector_keys = (None, None) vectors_data, vector_keys = (None, None)
@ -195,7 +247,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
nlp.vocab.prune_vectors(prune_vectors) nlp.vocab.prune_vectors(prune_vectors)
def read_vectors(vectors_loc, truncate_vectors=0): def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int = 0):
f = open_file(vectors_loc) f = open_file(vectors_loc)
shape = tuple(int(size) for size in next(f).split()) shape = tuple(int(size) for size in next(f).split())
if truncate_vectors >= 1: if truncate_vectors >= 1:
@ -215,7 +267,9 @@ def read_vectors(vectors_loc, truncate_vectors=0):
return vectors_data, vectors_keys return vectors_data, vectors_keys
def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50): def read_freqs(
freqs_loc: Path, max_length: int = 100, min_doc_freq: int = 5, min_freq: int = 50
):
counts = PreshCounter() counts = PreshCounter()
total = 0 total = 0
with freqs_loc.open() as f: with freqs_loc.open() as f:
@ -244,7 +298,7 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
return probs, oov_prob return probs, oov_prob
def read_clusters(clusters_loc): def read_clusters(clusters_loc: Path) -> dict:
clusters = {} clusters = {}
if ftfy is None: if ftfy is None:
warnings.warn(Warnings.W004) warnings.warn(Warnings.W004)

View File

@ -1,19 +1,25 @@
from typing import Optional, Union, Any, Dict
import shutil import shutil
from pathlib import Path from pathlib import Path
from wasabi import msg, get_raw_input from wasabi import Printer, get_raw_input
import srsly import srsly
import sys
from ._app import app, Arg, Opt
from ..schemas import validate, ModelMetaSchema
from .. import util from .. import util
from .. import about from .. import about
def package( @app.command("package")
def package_cli(
# fmt: off # fmt: off
input_dir: ("Directory with model data", "positional", None, str), input_dir: Path = Arg(..., help="Directory with model data", exists=True, file_okay=False),
output_dir: ("Output parent directory", "positional", None, str), output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
meta_path: ("Path to meta.json", "option", "m", str) = None, meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
create_meta: ("Create meta.json, even if one exists", "flag", "c", bool) = False, create_meta: bool = Opt(False, "--create-meta", "-c", "-C", help="Create meta.json, even if one exists"),
force: ("Force overwriting existing model in output directory", "flag", "f", bool) = False, version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing model in output directory"),
# fmt: on # fmt: on
): ):
""" """
@ -23,6 +29,27 @@ def package(
set and a meta.json already exists in the output directory, the existing set and a meta.json already exists in the output directory, the existing
values will be used as the defaults in the command-line prompt. values will be used as the defaults in the command-line prompt.
""" """
package(
input_dir,
output_dir,
meta_path=meta_path,
version=version,
create_meta=create_meta,
force=force,
silent=False,
)
def package(
input_dir: Path,
output_dir: Path,
meta_path: Optional[Path] = None,
version: Optional[str] = None,
create_meta: bool = False,
force: bool = False,
silent: bool = True,
) -> None:
msg = Printer(no_print=silent, pretty=not silent)
input_path = util.ensure_path(input_dir) input_path = util.ensure_path(input_dir)
output_path = util.ensure_path(output_dir) output_path = util.ensure_path(output_dir)
meta_path = util.ensure_path(meta_path) meta_path = util.ensure_path(meta_path)
@ -33,23 +60,23 @@ def package(
if meta_path and not meta_path.exists(): if meta_path and not meta_path.exists():
msg.fail("Can't find model meta.json", meta_path, exits=1) msg.fail("Can't find model meta.json", meta_path, exits=1)
meta_path = meta_path or input_path / "meta.json" meta_path = meta_path or input_dir / "meta.json"
if meta_path.is_file(): if not meta_path.exists() or not meta_path.is_file():
msg.fail("Can't load model meta.json", meta_path, exits=1)
meta = srsly.read_json(meta_path) meta = srsly.read_json(meta_path)
meta = get_meta(input_dir, meta)
if version is not None:
meta["version"] = version
if not create_meta: # only print if user doesn't want to overwrite if not create_meta: # only print if user doesn't want to overwrite
msg.good("Loaded meta.json from file", meta_path) msg.good("Loaded meta.json from file", meta_path)
else: else:
meta = generate_meta(input_dir, meta, msg) meta = generate_meta(meta, msg)
for key in ("lang", "name", "version"): errors = validate(ModelMetaSchema, meta)
if key not in meta or meta[key] == "": if errors:
msg.fail( msg.fail("Invalid model meta.json", "\n".join(errors), exits=1)
f"No '{key}' setting found in meta.json",
"This setting is required to build your package.",
exits=1,
)
model_name = meta["lang"] + "_" + meta["name"] model_name = meta["lang"] + "_" + meta["name"]
model_name_v = model_name + "-" + meta["version"] model_name_v = model_name + "-" + meta["version"]
main_path = output_path / model_name_v main_path = output_dir / model_name_v
package_path = main_path / model_name package_path = main_path / model_name
if package_path.exists(): if package_path.exists():
@ -63,32 +90,37 @@ def package(
exits=1, exits=1,
) )
Path.mkdir(package_path, parents=True) Path.mkdir(package_path, parents=True)
shutil.copytree(str(input_path), str(package_path / model_name_v)) shutil.copytree(str(input_dir), str(package_path / model_name_v))
create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2)) create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
create_file(main_path / "setup.py", TEMPLATE_SETUP) create_file(main_path / "setup.py", TEMPLATE_SETUP)
create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST) create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
create_file(package_path / "__init__.py", TEMPLATE_INIT) create_file(package_path / "__init__.py", TEMPLATE_INIT)
msg.good(f"Successfully created package '{model_name_v}'", main_path) msg.good(f"Successfully created package '{model_name_v}'", main_path)
msg.text("To build the package, run `python setup.py sdist` in this directory.") with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "sdist"])
zip_file = main_path / "dist" / f"{model_name_v}.tar.gz"
msg.good(f"Successfully created zipped Python package", zip_file)
def create_file(file_path, contents): def create_file(file_path: Path, contents: str) -> None:
file_path.touch() file_path.touch()
file_path.open("w", encoding="utf-8").write(contents) file_path.open("w", encoding="utf-8").write(contents)
def generate_meta(model_path, existing_meta, msg): def get_meta(
meta = existing_meta or {} model_path: Union[str, Path], existing_meta: Dict[str, Any]
settings = [ ) -> Dict[str, Any]:
("lang", "Model language", meta.get("lang", "en")), meta = {
("name", "Model name", meta.get("name", "model")), "lang": "en",
("version", "Model version", meta.get("version", "0.0.0")), "name": "model",
("description", "Model description", meta.get("description", False)), "version": "0.0.0",
("author", "Author", meta.get("author", False)), "description": None,
("email", "Author email", meta.get("email", False)), "author": None,
("url", "Author website", meta.get("url", False)), "email": None,
("license", "License", meta.get("license", "MIT")), "url": None,
] "license": "MIT",
}
meta.update(existing_meta)
nlp = util.load_model_from_path(Path(model_path)) nlp = util.load_model_from_path(Path(model_path))
meta["spacy_version"] = util.get_model_version_range(about.__version__) meta["spacy_version"] = util.get_model_version_range(about.__version__)
meta["pipeline"] = nlp.pipe_names meta["pipeline"] = nlp.pipe_names
@ -98,6 +130,23 @@ def generate_meta(model_path, existing_meta, msg):
"keys": nlp.vocab.vectors.n_keys, "keys": nlp.vocab.vectors.n_keys,
"name": nlp.vocab.vectors.name, "name": nlp.vocab.vectors.name,
} }
if about.__title__ != "spacy":
meta["parent_package"] = about.__title__
return meta
def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any]:
meta = existing_meta or {}
settings = [
("lang", "Model language", meta.get("lang", "en")),
("name", "Model name", meta.get("name", "model")),
("version", "Model version", meta.get("version", "0.0.0")),
("description", "Model description", meta.get("description", None)),
("author", "Author", meta.get("author", None)),
("email", "Author email", meta.get("email", None)),
("url", "Author website", meta.get("url", None)),
("license", "License", meta.get("license", "MIT")),
]
msg.divider("Generating meta.json") msg.divider("Generating meta.json")
msg.text( msg.text(
"Enter the package settings for your model. The following information " "Enter the package settings for your model. The following information "
@ -106,8 +155,6 @@ def generate_meta(model_path, existing_meta, msg):
for setting, desc, default in settings: for setting, desc, default in settings:
response = get_raw_input(desc, default) response = get_raw_input(desc, default)
meta[setting] = default if response == "" and default else response meta[setting] = default if response == "" and default else response
if about.__title__ != "spacy":
meta["parent_package"] = about.__title__
return meta return meta
@ -158,12 +205,12 @@ def setup_package():
setup( setup(
name=model_name, name=model_name,
description=meta['description'], description=meta.get('description'),
author=meta['author'], author=meta.get('author'),
author_email=meta['email'], author_email=meta.get('email'),
url=meta['url'], url=meta.get('url'),
version=meta['version'], version=meta['version'],
license=meta['license'], license=meta.get('license'),
packages=[model_name], packages=[model_name],
package_data={model_name: list_files(model_dir)}, package_data={model_name: list_files(model_dir)},
install_requires=list_requirements(meta), install_requires=list_requirements(meta),

View File

@ -1,14 +1,15 @@
from typing import Optional
import random import random
import numpy import numpy
import time import time
import re import re
from collections import Counter from collections import Counter
import plac
from pathlib import Path from pathlib import Path
from thinc.api import Linear, Maxout, chain, list2array, use_pytorch_for_gpu_memory from thinc.api import Linear, Maxout, chain, list2array, use_pytorch_for_gpu_memory
from wasabi import msg from wasabi import msg
import srsly import srsly
from ._app import app, Arg, Opt
from ..errors import Errors from ..errors import Errors
from ..ml.models.multi_task import build_masked_language_model from ..ml.models.multi_task import build_masked_language_model
from ..tokens import Doc from ..tokens import Doc
@ -17,25 +18,17 @@ from .. import util
from ..gold import Example from ..gold import Example
@plac.annotations( @app.command("pretrain")
def pretrain_cli(
# fmt: off # fmt: off
texts_loc=("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", "positional", None, str), texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True),
vectors_model=("Name or path to spaCy model with vectors to learn from", "positional", None, str), vectors_model: str = Arg(..., help="Name or path to spaCy model with vectors to learn from"),
output_dir=("Directory to write models to on each epoch", "positional", None, Path), output_dir: Path = Arg(..., help="Directory to write models to on each epoch"),
config_path=("Path to config file", "positional", None, Path), config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
use_gpu=("Use GPU", "option", "g", int), use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
resume_path=("Path to pretrained weights from which to resume pretraining", "option","r", Path), resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
epoch_resume=("The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files.","option", "er", int), epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files."),
# fmt: on # fmt: on
)
def pretrain(
texts_loc,
vectors_model,
config_path,
output_dir,
use_gpu=-1,
resume_path=None,
epoch_resume=None,
): ):
""" """
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -52,6 +45,26 @@ def pretrain(
all settings are the same between pretraining and training. Ideally, all settings are the same between pretraining and training. Ideally,
this is done by using the same config file for both commands. this is done by using the same config file for both commands.
""" """
pretrain(
texts_loc,
vectors_model,
output_dir,
config_path,
use_gpu=use_gpu,
resume_path=resume_path,
epoch_resume=epoch_resume,
)
def pretrain(
texts_loc: Path,
vectors_model: str,
output_dir: Path,
config_path: Path,
use_gpu: int = -1,
resume_path: Optional[Path] = None,
epoch_resume: Optional[int] = None,
):
if not config_path or not config_path.exists(): if not config_path or not config_path.exists():
msg.fail("Config file not found", config_path, exits=1) msg.fail("Config file not found", config_path, exits=1)
@ -166,8 +179,7 @@ def pretrain(
skip_counter = 0 skip_counter = 0
loss_func = pretrain_config["loss_func"] loss_func = pretrain_config["loss_func"]
for epoch in range(epoch_resume, pretrain_config["max_epochs"]): for epoch in range(epoch_resume, pretrain_config["max_epochs"]):
examples = [Example(doc=text) for text in texts] batches = util.minibatch_by_words(texts, size=pretrain_config["batch_size"])
batches = util.minibatch_by_words(examples, size=pretrain_config["batch_size"])
for batch_id, batch in enumerate(batches): for batch_id, batch in enumerate(batches):
docs, count = make_docs( docs, count = make_docs(
nlp, nlp,

View File

@ -1,3 +1,4 @@
from typing import Optional, Sequence, Union, Iterator
import tqdm import tqdm
from pathlib import Path from pathlib import Path
import srsly import srsly
@ -5,17 +6,19 @@ import cProfile
import pstats import pstats
import sys import sys
import itertools import itertools
import ml_datasets from wasabi import msg, Printer
from wasabi import msg
from ._app import app, Arg, Opt
from ..language import Language
from ..util import load_model from ..util import load_model
def profile( @app.command("profile")
def profile_cli(
# fmt: off # fmt: off
model: ("Model to load", "positional", None, str), model: str = Arg(..., help="Model to load"),
inputs: ("Location of input file. '-' for stdin.", "positional", None, str) = None, inputs: Optional[Path] = Arg(None, help="Location of input file. '-' for stdin.", exists=True, allow_dash=True),
n_texts: ("Maximum number of texts to use if available", "option", "n", int) = 10000, n_texts: int = Opt(10000, "--n-texts", "-n", help="Maximum number of texts to use if available"),
# fmt: on # fmt: on
): ):
""" """
@ -24,6 +27,18 @@ def profile(
It can either be provided as a JSONL file, or be read from sys.sytdin. It can either be provided as a JSONL file, or be read from sys.sytdin.
If no input file is specified, the IMDB dataset is loaded via Thinc. If no input file is specified, the IMDB dataset is loaded via Thinc.
""" """
profile(model, inputs=inputs, n_texts=n_texts)
def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
try:
import ml_datasets
except ImportError:
msg.fail(
"This command requires the ml_datasets library to be installed:"
"pip install ml_datasets",
exits=1,
)
if inputs is not None: if inputs is not None:
inputs = _read_inputs(inputs, msg) inputs = _read_inputs(inputs, msg)
if inputs is None: if inputs is None:
@ -43,12 +58,12 @@ def profile(
s.strip_dirs().sort_stats("time").print_stats() s.strip_dirs().sort_stats("time").print_stats()
def parse_texts(nlp, texts): def parse_texts(nlp: Language, texts: Sequence[str]) -> None:
for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16): for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
pass pass
def _read_inputs(loc, msg): def _read_inputs(loc: Union[Path, str], msg: Printer) -> Iterator[str]:
if loc == "-": if loc == "-":
msg.info("Reading input from sys.stdin") msg.info("Reading input from sys.stdin")
file_ = sys.stdin file_ = sys.stdin

704
spacy/cli/project.py Normal file
View File

@ -0,0 +1,704 @@
from typing import List, Dict, Any, Optional, Sequence
import typer
import srsly
from pathlib import Path
from wasabi import msg
import subprocess
import os
import re
import shutil
import sys
import requests
import tqdm
from ._app import app, Arg, Opt, COMMAND, NAME
from .. import about
from ..schemas import ProjectConfigSchema, validate
from ..util import ensure_path, run_command, make_tempdir, working_dir
from ..util import get_hash, get_checksum, split_command
CONFIG_FILE = "project.yml"
DVC_CONFIG = "dvc.yaml"
DVC_DIR = ".dvc"
DIRS = [
"assets",
"metas",
"configs",
"packages",
"metrics",
"scripts",
"notebooks",
"training",
"corpus",
]
CACHES = [
Path.home() / ".torch",
Path.home() / ".caches" / "torch",
os.environ.get("TORCH_HOME"),
Path.home() / ".keras",
]
DVC_CONFIG_COMMENT = """# This file is auto-generated by spaCy based on your project.yml. Do not edit
# it directly and edit the project.yml instead and re-run the project."""
CLI_HELP = f"""Command-line interface for spaCy projects and working with project
templates. You'd typically start by cloning a project template to a local
directory and fetching its assets like datasets etc. See the project's
{CONFIG_FILE} for the available commands. Under the hood, spaCy uses DVC (Data
Version Control) to manage input and output files and to ensure steps are only
re-run if their inputs change.
"""
project_cli = typer.Typer(help=CLI_HELP, no_args_is_help=True)
@project_cli.callback(invoke_without_command=True)
def callback(ctx: typer.Context):
"""This runs before every project command and ensures DVC is installed."""
ensure_dvc()
################
# CLI COMMANDS #
################
@project_cli.command("clone")
def project_clone_cli(
# fmt: off
name: str = Arg(..., help="The name of the template to fetch"),
dest: Path = Arg(Path.cwd(), help="Where to download and work. Defaults to current working directory.", exists=False),
repo: str = Opt(about.__projects__, "--repo", "-r", help="The repository to look in."),
git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
no_init: bool = Opt(False, "--no-init", "-NI", help="Don't initialize the project with DVC"),
# fmt: on
):
"""Clone a project template from a repository. Calls into "git" and will
only download the files from the given subdirectory. The GitHub repo
defaults to the official spaCy template repo, but can be customized
(including using a private repo). Setting the --git flag will also
initialize the project directory as a Git repo. If the project is intended
to be a Git repo, it should be initialized with Git first, before
initializing DVC (Data Version Control). This allows DVC to integrate with
Git.
"""
if dest == Path.cwd():
dest = dest / name
project_clone(name, dest, repo=repo, git=git, no_init=no_init)
@project_cli.command("init")
def project_init_cli(
# fmt: off
path: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
force: bool = Opt(False, "--force", "-F", help="Force initiziation"),
# fmt: on
):
"""Initialize a project directory with DVC and optionally Git. This should
typically be taken care of automatically when you run the "project clone"
command, but you can also run it separately. If the project is intended to
be a Git repo, it should be initialized with Git first, before initializing
DVC. This allows DVC to integrate with Git.
"""
project_init(path, git=git, force=force, silent=True)
@project_cli.command("assets")
def project_assets_cli(
# fmt: off
project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
# fmt: on
):
"""Use DVC (Data Version Control) to fetch project assets. Assets are
defined in the "assets" section of the project config. If possible, DVC
will try to track the files so you can pull changes from upstream. It will
also try and store the checksum so the assets are versioned. If the file
can't be tracked or checked, it will be downloaded without DVC. If a checksum
is provided in the project config, the file is only downloaded if no local
file with the same checksum exists.
"""
project_assets(project_dir)
@project_cli.command(
"run-all",
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def project_run_all_cli(
# fmt: off
ctx: typer.Context,
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
# fmt: on
):
"""Run all commands defined in the project. This command will use DVC and
the defined outputs and dependencies in the project config to determine
which steps need to be re-run and where to start. This means you're only
re-generating data if the inputs have changed.
This command calls into "dvc repro" and all additional arguments are passed
to the "dvc repro" command: https://dvc.org/doc/command-reference/repro
"""
if show_help:
print_run_help(project_dir)
else:
project_run_all(project_dir, *ctx.args)
@project_cli.command(
"run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def project_run_cli(
# fmt: off
ctx: typer.Context,
subcommand: str = Arg(None, help="Name of command defined in project config"),
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
# fmt: on
):
"""Run a named script defined in the project config. If the command is
part of the default pipeline defined in the "run" section, DVC is used to
determine whether the step should re-run if its inputs have changed, or
whether everything is up to date. If the script is not part of the default
pipeline, it will be called separately without DVC.
If DVC is used, the command calls into "dvc repro" and all additional
arguments are passed to the "dvc repro" command:
https://dvc.org/doc/command-reference/repro
"""
if show_help or not subcommand:
print_run_help(project_dir, subcommand)
else:
project_run(project_dir, subcommand, *ctx.args)
@project_cli.command("exec", hidden=True)
def project_exec_cli(
# fmt: off
subcommand: str = Arg(..., help="Name of command defined in project config"),
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
# fmt: on
):
"""Execute a command defined in the project config. This CLI command is
only called internally in auto-generated DVC pipelines, as a shortcut for
multi-step commands in the project config. You typically shouldn't have to
call it yourself. To run a command, call "run" or "run-all".
"""
project_exec(project_dir, subcommand)
@project_cli.command("update-dvc")
def project_update_dvc_cli(
# fmt: off
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
verbose: bool = Opt(False, "--verbose", "-V", help="Print more info"),
force: bool = Opt(False, "--force", "-F", help="Force update DVC config"),
# fmt: on
):
"""Update the auto-generated DVC config file. Uses the steps defined in the
"run" section of the project config. This typically happens automatically
when running a command, but can also be triggered manually if needed.
"""
config = load_project_config(project_dir)
updated = update_dvc_config(project_dir, config, verbose=verbose, force=force)
if updated:
msg.good(f"Updated DVC config from {CONFIG_FILE}")
else:
msg.info(f"No changes found in {CONFIG_FILE}, no update needed")
app.add_typer(project_cli, name="project")
#################
# CLI FUNCTIONS #
#################
def project_clone(
name: str,
dest: Path,
*,
repo: str = about.__projects__,
git: bool = False,
no_init: bool = False,
) -> None:
"""Clone a project template from a repository.
name (str): Name of subdirectory to clone.
dest (Path): Destination path of cloned project.
repo (str): URL of Git repo containing project templates.
git (bool): Initialize project as Git repo. Should be set to True if project
is intended as a repo, since it will allow DVC to integrate with Git.
no_init (bool): Don't initialize DVC and Git automatically. If True, the
"init" command or "git init" and "dvc init" need to be run manually.
"""
dest = ensure_path(dest)
check_clone(name, dest, repo)
project_dir = dest.resolve()
# We're using Git and sparse checkout to only clone the files we need
with make_tempdir() as tmp_dir:
cmd = f"git clone {repo} {tmp_dir} --no-checkout --depth 1 --config core.sparseCheckout=true"
try:
run_command(cmd)
except SystemExit:
err = f"Could not clone the repo '{repo}' into the temp dir '{tmp_dir}'"
msg.fail(err)
with (tmp_dir / ".git" / "info" / "sparse-checkout").open("w") as f:
f.write(name)
run_command(["git", "-C", str(tmp_dir), "fetch"])
run_command(["git", "-C", str(tmp_dir), "checkout"])
shutil.move(str(tmp_dir / Path(name).name), str(project_dir))
msg.good(f"Cloned project '{name}' from {repo} into {project_dir}")
for sub_dir in DIRS:
dir_path = project_dir / sub_dir
if not dir_path.exists():
dir_path.mkdir(parents=True)
if not no_init:
project_init(project_dir, git=git, force=True, silent=True)
msg.good(f"Your project is now ready!", dest)
print(f"To fetch the assets, run:\n{COMMAND} project assets {dest}")
def project_init(
project_dir: Path,
*,
git: bool = False,
force: bool = False,
silent: bool = False,
analytics: bool = False,
):
"""Initialize a project as a DVC and (optionally) as a Git repo.
project_dir (Path): Path to project directory.
git (bool): Also call "git init" to initialize directory as a Git repo.
silent (bool): Don't print any output (via DVC).
analytics (bool): Opt-in to DVC analytics (defaults to False).
"""
with working_dir(project_dir) as cwd:
if git:
run_command(["git", "init"])
init_cmd = ["dvc", "init"]
if silent:
init_cmd.append("--quiet")
if not git:
init_cmd.append("--no-scm")
if force:
init_cmd.append("--force")
run_command(init_cmd)
# We don't want to have analytics on by default our users should
# opt-in explicitly. If they want it, they can always enable it.
if not analytics:
run_command(["dvc", "config", "core.analytics", "false"])
# Remove unused and confusing plot templates from .dvc directory
# TODO: maybe we shouldn't do this, but it's otherwise super confusing
# once you commit your changes via Git and it creates a bunch of files
# that have no purpose
plots_dir = cwd / DVC_DIR / "plots"
if plots_dir.exists():
shutil.rmtree(str(plots_dir))
config = load_project_config(cwd)
setup_check_dvc(cwd, config)
def project_assets(project_dir: Path) -> None:
"""Fetch assets for a project using DVC if possible.
project_dir (Path): Path to project directory.
"""
project_path = ensure_path(project_dir)
config = load_project_config(project_path)
setup_check_dvc(project_path, config)
assets = config.get("assets", {})
if not assets:
msg.warn(f"No assets specified in {CONFIG_FILE}", exits=0)
msg.info(f"Fetching {len(assets)} asset(s)")
variables = config.get("variables", {})
fetched_assets = []
for asset in assets:
url = asset["url"].format(**variables)
dest = asset["dest"].format(**variables)
fetched_path = fetch_asset(project_path, url, dest, asset.get("checksum"))
if fetched_path:
fetched_assets.append(str(fetched_path))
if fetched_assets:
with working_dir(project_path):
run_command(["dvc", "add", *fetched_assets, "--external"])
def fetch_asset(
project_path: Path, url: str, dest: Path, checksum: Optional[str] = None
) -> Optional[Path]:
"""Fetch an asset from a given URL or path. Will try to import the file
using DVC's import-url if possible (fully tracked and versioned) and falls
back to get-url (versioned) and a non-DVC download if necessary. If a
checksum is provided and a local file exists, it's only re-downloaded if the
checksum doesn't match.
project_path (Path): Path to project directory.
url (str): URL or path to asset.
checksum (Optional[str]): Optional expected checksum of local file.
RETURNS (Optional[Path]): The path to the fetched asset or None if fetching
the asset failed.
"""
url = convert_asset_url(url)
dest_path = (project_path / dest).resolve()
if dest_path.exists() and checksum:
# If there's already a file, check for checksum
# TODO: add support for caches (dvc import-url with local path)
if checksum == get_checksum(dest_path):
msg.good(f"Skipping download with matching checksum: {dest}")
return dest_path
with working_dir(project_path):
try:
# If these fail, we don't want to output an error or info message.
# Try with tracking the source first, then just downloading with
# DVC, then a regular non-DVC download.
try:
dvc_cmd = ["dvc", "import-url", url, str(dest_path)]
print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
except subprocess.CalledProcessError:
dvc_cmd = ["dvc", "get-url", url, str(dest_path)]
print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
except subprocess.CalledProcessError:
try:
download_file(url, dest_path)
except requests.exceptions.HTTPError as e:
msg.fail(f"Download failed: {dest}", e)
return None
if checksum and checksum != get_checksum(dest_path):
msg.warn(f"Checksum doesn't match value defined in {CONFIG_FILE}: {dest}")
msg.good(f"Fetched asset {dest}")
return dest_path
def project_run_all(project_dir: Path, *dvc_args) -> None:
"""Run all commands defined in the project using DVC.
project_dir (Path): Path to project directory.
*dvc_args: Other arguments passed to "dvc repro".
"""
config = load_project_config(project_dir)
setup_check_dvc(project_dir, config)
dvc_cmd = ["dvc", "repro", *dvc_args]
with working_dir(project_dir):
run_command(dvc_cmd)
def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
"""Simulate a CLI help prompt using the info available in the project config.
project_dir (Path): The project directory.
subcommand (Optional[str]): The subcommand or None. If a subcommand is
provided, the subcommand help is shown. Otherwise, the top-level help
and a list of available commands is printed.
"""
config = load_project_config(project_dir)
setup_check_dvc(project_dir, config)
config_commands = config.get("commands", [])
commands = {cmd["name"]: cmd for cmd in config_commands}
if subcommand:
validate_subcommand(commands.keys(), subcommand)
print(f"Usage: {COMMAND} project run {subcommand} {project_dir}")
help_text = commands[subcommand].get("help")
if help_text:
msg.text(f"\n{help_text}\n")
else:
print(f"\nAvailable commands in {CONFIG_FILE}")
print(f"Usage: {COMMAND} project run [COMMAND] {project_dir}")
msg.table([(cmd["name"], cmd.get("help", "")) for cmd in config_commands])
msg.text("Run all commands defined in the 'run' block of the project config:")
print(f"{COMMAND} project run-all {project_dir}")
def project_run(project_dir: Path, subcommand: str, *dvc_args) -> None:
"""Run a named script defined in the project config. If the script is part
of the default pipeline (defined in the "run" section), DVC is used to
execute the command, so it can determine whether to rerun it. It then
calls into "exec" to execute it.
project_dir (Path): Path to project directory.
subcommand (str): Name of command to run.
*dvc_args: Other arguments passed to "dvc repro".
"""
config = load_project_config(project_dir)
setup_check_dvc(project_dir, config)
config_commands = config.get("commands", [])
variables = config.get("variables", {})
commands = {cmd["name"]: cmd for cmd in config_commands}
validate_subcommand(commands.keys(), subcommand)
if subcommand in config.get("run", []):
# This is one of the pipeline commands tracked in DVC
dvc_cmd = ["dvc", "repro", subcommand, *dvc_args]
with working_dir(project_dir):
run_command(dvc_cmd)
else:
cmd = commands[subcommand]
# Deps in non-DVC commands aren't tracked, but if they're defined,
# make sure they exist before running the command
for dep in cmd.get("deps", []):
if not (project_dir / dep).exists():
err = f"Missing dependency specified by command '{subcommand}': {dep}"
msg.fail(err, exits=1)
with working_dir(project_dir):
run_commands(cmd["script"], variables)
def project_exec(project_dir: Path, subcommand: str):
"""Execute a command defined in the project config.
project_dir (Path): Path to project directory.
subcommand (str): Name of command to run.
"""
config = load_project_config(project_dir)
config_commands = config.get("commands", [])
variables = config.get("variables", {})
commands = {cmd["name"]: cmd for cmd in config_commands}
with working_dir(project_dir):
run_commands(commands[subcommand]["script"], variables)
###########
# HELPERS #
###########
def load_project_config(path: Path) -> Dict[str, Any]:
"""Load the project config file from a directory and validate it.
path (Path): The path to the project directory.
RETURNS (Dict[str, Any]): The loaded project config.
"""
config_path = path / CONFIG_FILE
if not config_path.exists():
msg.fail("Can't find project config", config_path, exits=1)
invalid_err = f"Invalid project config in {CONFIG_FILE}"
try:
config = srsly.read_yaml(config_path)
except ValueError as e:
msg.fail(invalid_err, e, exits=1)
errors = validate(ProjectConfigSchema, config)
if errors:
msg.fail(invalid_err, "\n".join(errors), exits=1)
return config
def update_dvc_config(
path: Path,
config: Dict[str, Any],
verbose: bool = False,
silent: bool = False,
force: bool = False,
) -> bool:
"""Re-run the DVC commands in dry mode and update dvc.yaml file in the
project directory. The file is auto-generated based on the config. The
first line of the auto-generated file specifies the hash of the config
dict, so if any of the config values change, the DVC config is regenerated.
path (Path): The path to the project directory.
config (Dict[str, Any]): The loaded project config.
verbose (bool): Whether to print additional info (via DVC).
silent (bool): Don't output anything (via DVC).
force (bool): Force update, even if hashes match.
RETURNS (bool): Whether the DVC config file was updated.
"""
config_hash = get_hash(config)
path = path.resolve()
dvc_config_path = path / DVC_CONFIG
if dvc_config_path.exists():
# Check if the file was generated using the current config, if not, redo
with dvc_config_path.open("r", encoding="utf8") as f:
ref_hash = f.readline().strip().replace("# ", "")
if ref_hash == config_hash and not force:
return False # Nothing has changed in project config, don't need to update
dvc_config_path.unlink()
variables = config.get("variables", {})
commands = []
# We only want to include commands that are part of the main list of "run"
# commands in project.yml and should be run in sequence
config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
for name in config.get("run", []):
validate_subcommand(config_commands.keys(), name)
command = config_commands[name]
deps = command.get("deps", [])
outputs = command.get("outputs", [])
outputs_no_cache = command.get("outputs_no_cache", [])
if not deps and not outputs and not outputs_no_cache:
continue
# Default to "." as the project path since dvc.yaml is auto-generated
# and we don't want arbitrary paths in there
project_cmd = ["python", "-m", NAME, "project", ".", "exec", name]
deps_cmd = [c for cl in [["-d", p] for p in deps] for c in cl]
outputs_cmd = [c for cl in [["-o", p] for p in outputs] for c in cl]
outputs_nc_cmd = [c for cl in [["-O", p] for p in outputs_no_cache] for c in cl]
dvc_cmd = ["dvc", "run", "-n", name, "-w", str(path), "--no-exec"]
if verbose:
dvc_cmd.append("--verbose")
if silent:
dvc_cmd.append("--quiet")
full_cmd = [*dvc_cmd, *deps_cmd, *outputs_cmd, *outputs_nc_cmd, *project_cmd]
commands.append(" ".join(full_cmd))
with working_dir(path):
run_commands(commands, variables, silent=True)
with dvc_config_path.open("r+", encoding="utf8") as f:
content = f.read()
f.seek(0, 0)
f.write(f"# {config_hash}\n{DVC_CONFIG_COMMENT}\n{content}")
return True
def ensure_dvc() -> None:
"""Ensure that the "dvc" command is available and show an error if not."""
try:
subprocess.run(["dvc", "--version"], stdout=subprocess.DEVNULL)
except Exception:
msg.fail(
"spaCy projects require DVC (Data Version Control) and the 'dvc' command",
"You can install the Python package from pip (pip install dvc) or "
"conda (conda install -c conda-forge dvc). For more details, see the "
"documentation: https://dvc.org/doc/install",
exits=1,
)
def setup_check_dvc(project_dir: Path, config: Dict[str, Any]) -> None:
"""Check that the project is set up correctly with DVC and update its
config if needed. Will raise an error if the project is not an initialized
DVC project.
project_dir (Path): The path to the project directory.
config (Dict[str, Any]): The loaded project config.
"""
if not project_dir.exists():
msg.fail(f"Can't find project directory: {project_dir}")
if not (project_dir / ".dvc").exists():
msg.fail(
"Project not initialized as a DVC project.",
f"Make sure that the project template was cloned correctly. To "
f"initialize the project directory manually, you can run: "
f"{COMMAND} project init {project_dir}",
exits=1,
)
with msg.loading("Updating DVC config..."):
updated = update_dvc_config(project_dir, config, silent=True)
if updated:
msg.good(f"Updated DVC config from changed {CONFIG_FILE}")
def run_commands(
commands: List[str] = tuple(), variables: Dict[str, str] = {}, silent: bool = False
) -> None:
"""Run a sequence of commands in a subprocess, in order.
commands (List[str]): The string commands.
variables (Dict[str, str]): Dictionary of variable names, mapped to their
values. Will be used to substitute format string variables in the
commands.
silent (bool): Don't print the commands.
"""
for command in commands:
# Substitute variables, e.g. "./{NAME}.json"
command = command.format(**variables)
command = split_command(command)
# Not sure if this is needed or a good idea. Motivation: users may often
# use commands in their config that reference "python" and we want to
# make sure that it's always executing the same Python that spaCy is
# executed with and the pip in the same env, not some other Python/pip.
# Also ensures cross-compatibility if user 1 writes "python3" (because
# that's how it's set up on their system), and user 2 without the
# shortcut tries to re-run the command.
if len(command) and command[0] in ("python", "python3"):
command[0] = sys.executable
elif len(command) and command[0] in ("pip", "pip3"):
command = [sys.executable, "-m", "pip", *command[1:]]
if not silent:
print(f"Running command: {' '.join(command)}")
run_command(command)
def convert_asset_url(url: str) -> str:
"""Check and convert the asset URL if needed.
url (str): The asset URL.
RETURNS (str): The converted URL.
"""
# If the asset URL is a regular GitHub URL it's likely a mistake
if re.match("(http(s?)):\/\/github.com", url):
converted = url.replace("github.com", "raw.githubusercontent.com")
converted = re.sub(r"/(tree|blob)/", "/", converted)
msg.warn(
"Downloading from a regular GitHub URL. This will only download "
"the source of the page, not the actual file. Converting the URL "
"to a raw URL.",
converted,
)
return converted
return url
def check_clone(name: str, dest: Path, repo: str) -> None:
"""Check and validate that the destination path can be used to clone. Will
check that Git is available and that the destination path is suitable.
name (str): Name of the directory to clone from the repo.
dest (Path): Local destination of cloned directory.
repo (str): URL of the repo to clone from.
"""
try:
subprocess.run(["git", "--version"], stdout=subprocess.DEVNULL)
except Exception:
msg.fail(
f"Cloning spaCy project templates requires Git and the 'git' command. ",
f"To clone a project without Git, copy the files from the '{name}' "
f"directory in the {repo} to {dest} manually and then run:",
f"{COMMAND} project init {dest}",
exits=1,
)
if not dest:
msg.fail(f"Not a valid directory to clone project: {dest}", exits=1)
if dest.exists():
# Directory already exists (not allowed, clone needs to create it)
msg.fail(f"Can't clone project, directory already exists: {dest}", exits=1)
if not dest.parent.exists():
# We're not creating parents, parent dir should exist
msg.fail(
f"Can't clone project, parent directory doesn't exist: {dest.parent}",
exits=1,
)
def validate_subcommand(commands: Sequence[str], subcommand: str) -> None:
"""Check that a subcommand is valid and defined. Raises an error otherwise.
commands (Sequence[str]): The available commands.
subcommand (str): The subcommand.
"""
if subcommand not in commands:
msg.fail(
f"Can't find command '{subcommand}' in {CONFIG_FILE}. "
f"Available commands: {', '.join(commands)}",
exits=1,
)
def download_file(url: str, dest: Path, chunk_size: int = 1024) -> None:
"""Download a file using requests.
url (str): The URL of the file.
dest (Path): The destination path.
chunk_size (int): The size of chunks to read/write.
"""
response = requests.get(url, stream=True)
response.raise_for_status()
total = int(response.headers.get("content-length", 0))
progress_settings = {
"total": total,
"unit": "iB",
"unit_scale": True,
"unit_divisor": chunk_size,
"leave": False,
}
with dest.open("wb") as f, tqdm.tqdm(**progress_settings) as bar:
for data in response.iter_content(chunk_size=chunk_size):
size = f.write(data)
bar.update(size)

View File

@ -2,9 +2,8 @@ from typing import Optional, Dict, List, Union, Sequence
from timeit import default_timer as timer from timeit import default_timer as timer
import srsly import srsly
from pydantic import BaseModel, FilePath
import plac
import tqdm import tqdm
from pydantic import BaseModel, FilePath
from pathlib import Path from pathlib import Path
from wasabi import msg from wasabi import msg
import thinc import thinc
@ -12,11 +11,17 @@ import thinc.schedules
from thinc.api import Model, use_pytorch_for_gpu_memory from thinc.api import Model, use_pytorch_for_gpu_memory
import random import random
from ..gold import GoldCorpus from ._app import app, Arg, Opt
from ..gold import Corpus
from ..lookups import Lookups from ..lookups import Lookups
from .. import util from .. import util
from ..errors import Errors from ..errors import Errors
from ..ml import models # don't remove - required to load the built-in architectures
# Don't remove - required to load the built-in architectures
from ..ml import models # noqa: F401
# from ..schemas import ConfigSchema # TODO: include?
registry = util.registry registry = util.registry
@ -114,41 +119,24 @@ class ConfigSchema(BaseModel):
extra = "allow" extra = "allow"
@plac.annotations( @app.command("train")
# fmt: off
train_path=("Location of JSON-formatted training data", "positional", None, Path),
dev_path=("Location of JSON-formatted development data", "positional", None, Path),
config_path=("Path to config file", "positional", None, Path),
output_path=("Output directory to store model in", "option", "o", Path),
init_tok2vec=(
"Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental.", "option", "t2v",
Path),
raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path),
verbose=("Display more information for debugging purposes", "flag", "VV", bool),
use_gpu=("Use GPU", "option", "g", int),
num_workers=("Parallel Workers", "option", "j", int),
strategy=("Distributed training strategy (requires spacy_ray)", "option", "strategy", str),
ray_address=(
"Address of the Ray cluster. Multi-node training (requires spacy_ray)",
"option", "address", str),
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
# fmt: on
)
def train_cli( def train_cli(
train_path, # fmt: off
dev_path, train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
config_path, dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
output_path=None, config_path: Path = Arg(..., help="Path to config file", exists=True),
init_tok2vec=None, output_path: Optional[Path] = Opt(None, "--output-path", "-o", help="Output directory to store model in"),
raw_text=None, code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
verbose=False, init_tok2vec: Optional[Path] = Opt(None, "--init-tok2vec", "-t2v", help="Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental."),
use_gpu=-1, raw_text: Optional[Path] = Opt(None, "--raw-text", "-rt", help="Path to jsonl file with unlabelled text documents."),
num_workers=1, verbose: bool = Opt(False, "--verbose", "-VV", help="Display more information for debugging purposes"),
strategy="allreduce", use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
ray_address=None, num_workers: int = Opt(None, "-j", help="Parallel Workers"),
tag_map_path=None, strategy: str = Opt(None, "--strategy", help="Distributed training strategy (requires spacy_ray)"),
omit_extra_lookups=False, ray_address: str = Opt(None, "--address", help="Address of the Ray cluster. Multi-node training (requires spacy_ray)"),
tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map"),
omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
# fmt: on
): ):
""" """
Train or update a spaCy model. Requires data to be formatted in spaCy's Train or update a spaCy model. Requires data to be formatted in spaCy's
@ -156,26 +144,8 @@ def train_cli(
command. command.
""" """
util.set_env_log(verbose) util.set_env_log(verbose)
verify_cli_args(**locals())
# Make sure all files and paths exists if they are needed
if not config_path or not config_path.exists():
msg.fail("Config file not found", config_path, exits=1)
if not train_path or not train_path.exists():
msg.fail("Training data not found", train_path, exits=1)
if not dev_path or not dev_path.exists():
msg.fail("Development data not found", dev_path, exits=1)
if output_path is not None:
if not output_path.exists():
output_path.mkdir()
msg.good(f"Created output directory: {output_path}")
elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
msg.warn(
"Output directory is not empty.",
"This can lead to unintended side effects when saving the model. "
"Please use an empty directory or a different path instead. If "
"the specified output path doesn't exist, the directory will be "
"created for you.",
)
if raw_text is not None: if raw_text is not None:
raw_text = list(srsly.read_jsonl(raw_text)) raw_text = list(srsly.read_jsonl(raw_text))
tag_map = {} tag_map = {}
@ -184,8 +154,6 @@ def train_cli(
weights_data = None weights_data = None
if init_tok2vec is not None: if init_tok2vec is not None:
if not init_tok2vec.exists():
msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
with init_tok2vec.open("rb") as file_: with init_tok2vec.open("rb") as file_:
weights_data = file_.read() weights_data = file_.read()
@ -214,17 +182,17 @@ def train_cli(
train(**train_args) train(**train_args)
def train( def train(
config_path, config_path: Path,
data_paths, data_paths: Dict[str, Path],
raw_text=None, raw_text: Optional[Path] = None,
output_path=None, output_path: Optional[Path] = None,
tag_map=None, tag_map: Optional[Path] = None,
weights_data=None, weights_data: Optional[bytes] = None,
omit_extra_lookups=False, omit_extra_lookups: bool = False,
disable_tqdm=False, disable_tqdm: bool = False,
remote_optimizer=None, remote_optimizer: Optimizer = None,
randomization_index=0 randomization_index: int = 0
): ) -> None:
msg.info(f"Loading config from: {config_path}") msg.info(f"Loading config from: {config_path}")
# Read the config first without creating objects, to get to the original nlp_config # Read the config first without creating objects, to get to the original nlp_config
config = util.load_config(config_path, create_objects=False) config = util.load_config(config_path, create_objects=False)
@ -243,69 +211,20 @@ def train(
if remote_optimizer: if remote_optimizer:
optimizer = remote_optimizer optimizer = remote_optimizer
limit = training["limit"] limit = training["limit"]
msg.info("Loading training corpus") corpus = Corpus(data_paths["train"], data_paths["dev"], limit=limit)
corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
# verify textcat config
if "textcat" in nlp_config["pipeline"]: if "textcat" in nlp_config["pipeline"]:
textcat_labels = set(nlp.get_pipe("textcat").labels) verify_textcat_config(nlp, nlp_config)
textcat_multilabel = not nlp_config["pipeline"]["textcat"]["model"]["exclusive_classes"]
# check whether the setting 'exclusive_classes' corresponds to the provided training data
if textcat_multilabel:
multilabel_found = False
for ex in corpus.train_examples:
cats = ex.doc_annotation.cats
textcat_labels.update(cats.keys())
if list(cats.values()).count(1.0) != 1:
multilabel_found = True
if not multilabel_found:
msg.warn(
"The textcat training instances look like they have "
"mutually exclusive classes. Set 'exclusive_classes' "
"to 'true' in the config to train a classifier with "
"mutually exclusive classes more accurately."
)
else:
for ex in corpus.train_examples:
cats = ex.doc_annotation.cats
textcat_labels.update(cats.keys())
if list(cats.values()).count(1.0) != 1:
msg.fail(
"Some textcat training instances do not have exactly "
"one positive label. Set 'exclusive_classes' "
"to 'false' in the config to train a classifier with classes "
"that are not mutually exclusive."
)
msg.info(f"Initialized textcat component for {len(textcat_labels)} unique labels")
nlp.get_pipe("textcat").labels = tuple(textcat_labels)
# if 'positive_label' is provided: double check whether it's in the data and the task is binary
if nlp_config["pipeline"]["textcat"].get("positive_label", None):
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
if pos_label not in textcat_labels:
msg.fail(
f"The textcat's 'positive_label' config setting '{pos_label}' "
f"does not match any label in the training data.",
exits=1,
)
if len(textcat_labels) != 2:
msg.fail(
f"A textcat 'positive_label' '{pos_label}' was "
f"provided for training data that does not appear to be a "
f"binary classification problem with two labels.",
exits=1,
)
if training.get("resume", False): if training.get("resume", False):
msg.info("Resuming training") msg.info("Resuming training")
nlp.resume_training() nlp.resume_training()
else: else:
msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}") msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}")
nlp.begin_training( train_examples = list(corpus.train_dataset(
lambda: corpus.train_examples nlp,
) shuffle=False,
gold_preproc=training["gold_preproc"]
))
nlp.begin_training(lambda: train_examples)
# Update tag map with provided mapping # Update tag map with provided mapping
nlp.vocab.morphology.tag_map.update(tag_map) nlp.vocab.morphology.tag_map.update(tag_map)
@ -332,11 +251,11 @@ def train(
tok2vec = tok2vec.get(subpath) tok2vec = tok2vec.get(subpath)
if not tok2vec: if not tok2vec:
msg.fail( msg.fail(
f"Could not locate the tok2vec model at {tok2vec_path}.", f"Could not locate the tok2vec model at {tok2vec_path}.", exits=1,
exits=1,
) )
tok2vec.from_bytes(weights_data) tok2vec.from_bytes(weights_data)
msg.info("Loading training corpus")
train_batches = create_train_batches(nlp, corpus, training, randomization_index) train_batches = create_train_batches(nlp, corpus, training, randomization_index)
evaluate = create_evaluation_callback(nlp, optimizer, corpus, training) evaluate = create_evaluation_callback(nlp, optimizer, corpus, training)
@ -369,18 +288,15 @@ def train(
update_meta(training, nlp, info) update_meta(training, nlp, info)
nlp.to_disk(output_path / "model-best") nlp.to_disk(output_path / "model-best")
progress = tqdm.tqdm(**tqdm_args) progress = tqdm.tqdm(**tqdm_args)
# Clean up the objects to faciliate garbage collection.
for eg in batch:
eg.doc = None
eg.goldparse = None
eg.doc_annotation = None
eg.token_annotation = None
except Exception as e: except Exception as e:
if output_path is not None:
msg.warn( msg.warn(
f"Aborting and saving the final best model. " f"Aborting and saving the final best model. "
f"Encountered exception: {str(e)}", f"Encountered exception: {str(e)}",
exits=1, exits=1,
) )
else:
raise e
finally: finally:
if output_path is not None: if output_path is not None:
final_model_path = output_path / "model-final" final_model_path = output_path / "model-final"
@ -393,23 +309,22 @@ def train(
def create_train_batches(nlp, corpus, cfg, randomization_index): def create_train_batches(nlp, corpus, cfg, randomization_index):
epochs_todo = cfg.get("max_epochs", 0) max_epochs = cfg.get("max_epochs", 0)
while True: train_examples = list(corpus.train_dataset(
train_examples = list(
corpus.train_dataset(
nlp, nlp,
noise_level=0.0, # I think this is deprecated? shuffle=True,
orth_variant_level=cfg["orth_variant_level"],
gold_preproc=cfg["gold_preproc"], gold_preproc=cfg["gold_preproc"],
max_length=cfg["max_length"], max_length=cfg["max_length"]
ignore_misaligned=True, ))
)
) epoch = 0
while True:
if len(train_examples) == 0: if len(train_examples) == 0:
raise ValueError(Errors.E988) raise ValueError(Errors.E988)
for _ in range(randomization_index): for _ in range(randomization_index):
random.random() random.random()
random.shuffle(train_examples) random.shuffle(train_examples)
epoch += 1
batches = util.minibatch_by_words( batches = util.minibatch_by_words(
train_examples, train_examples,
size=cfg["batch_size"], size=cfg["batch_size"],
@ -418,15 +333,12 @@ def create_train_batches(nlp, corpus, cfg, randomization_index):
# make sure the minibatch_by_words result is not empty, or we'll have an infinite training loop # make sure the minibatch_by_words result is not empty, or we'll have an infinite training loop
try: try:
first = next(batches) first = next(batches)
yield first yield epoch, first
except StopIteration: except StopIteration:
raise ValueError(Errors.E986) raise ValueError(Errors.E986)
for batch in batches: for batch in batches:
yield batch yield epoch, batch
epochs_todo -= 1 if max_epochs >= 1 and epoch >= max_epochs:
# We intentionally compare exactly to 0 here, so that max_epochs < 1
# will not break.
if epochs_todo == 0:
break break
@ -437,7 +349,8 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
nlp, gold_preproc=cfg["gold_preproc"], ignore_misaligned=True nlp, gold_preproc=cfg["gold_preproc"], ignore_misaligned=True
) )
) )
n_words = sum(len(ex.doc) for ex in dev_examples)
n_words = sum(len(ex.predicted) for ex in dev_examples)
start_time = timer() start_time = timer()
if optimizer.averages: if optimizer.averages:
@ -453,7 +366,11 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
try: try:
weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights) weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights)
except KeyError as e: except KeyError as e:
raise KeyError(Errors.E983.format(dict_name='score_weights', key=str(e), keys=list(scores.keys()))) raise KeyError(
Errors.E983.format(
dict="score_weights", key=str(e), keys=list(scores.keys())
)
)
scores["speed"] = wps scores["speed"] = wps
return weighted_score, scores return weighted_score, scores
@ -494,7 +411,7 @@ def train_while_improving(
Every iteration, the function yields out a tuple with: Every iteration, the function yields out a tuple with:
* batch: A zipped sequence of Tuple[Doc, GoldParse] pairs. * batch: A list of Example objects.
* info: A dict with various information about the last update (see below). * info: A dict with various information about the last update (see below).
* is_best_checkpoint: A value in None, False, True, indicating whether this * is_best_checkpoint: A value in None, False, True, indicating whether this
was the best evaluation so far. You should use this to save the model was the best evaluation so far. You should use this to save the model
@ -526,7 +443,7 @@ def train_while_improving(
(nlp.make_doc(rt["text"]) for rt in raw_text), size=8 (nlp.make_doc(rt["text"]) for rt in raw_text), size=8
) )
for step, batch in enumerate(train_data): for step, (epoch, batch) in enumerate(train_data):
dropout = next(dropouts) dropout = next(dropouts)
with nlp.select_pipes(enable=to_enable): with nlp.select_pipes(enable=to_enable):
for subbatch in subdivide_batch(batch, accumulate_gradient): for subbatch in subdivide_batch(batch, accumulate_gradient):
@ -548,6 +465,7 @@ def train_while_improving(
score, other_scores = (None, None) score, other_scores = (None, None)
is_best_checkpoint = None is_best_checkpoint = None
info = { info = {
"epoch": epoch,
"step": step, "step": step,
"score": score, "score": score,
"other_scores": other_scores, "other_scores": other_scores,
@ -568,7 +486,7 @@ def train_while_improving(
def subdivide_batch(batch, accumulate_gradient): def subdivide_batch(batch, accumulate_gradient):
batch = list(batch) batch = list(batch)
batch.sort(key=lambda eg: len(eg.doc)) batch.sort(key=lambda eg: len(eg.predicted))
sub_len = len(batch) // accumulate_gradient sub_len = len(batch) // accumulate_gradient
start = 0 start = 0
for i in range(accumulate_gradient): for i in range(accumulate_gradient):
@ -586,9 +504,9 @@ def setup_printer(training, nlp):
score_widths = [max(len(col), 6) for col in score_cols] score_widths = [max(len(col), 6) for col in score_cols]
loss_cols = [f"Loss {pipe}" for pipe in nlp.pipe_names] loss_cols = [f"Loss {pipe}" for pipe in nlp.pipe_names]
loss_widths = [max(len(col), 8) for col in loss_cols] loss_widths = [max(len(col), 8) for col in loss_cols]
table_header = ["#"] + loss_cols + score_cols + ["Score"] table_header = ["E", "#"] + loss_cols + score_cols + ["Score"]
table_header = [col.upper() for col in table_header] table_header = [col.upper() for col in table_header]
table_widths = [6] + loss_widths + score_widths + [6] table_widths = [3, 6] + loss_widths + score_widths + [6]
table_aligns = ["r" for _ in table_widths] table_aligns = ["r" for _ in table_widths]
msg.row(table_header, widths=table_widths) msg.row(table_header, widths=table_widths)
@ -602,17 +520,25 @@ def setup_printer(training, nlp):
] ]
except KeyError as e: except KeyError as e:
raise KeyError( raise KeyError(
Errors.E983.format(dict_name='scores (losses)', key=str(e), keys=list(info["losses"].keys()))) Errors.E983.format(
dict="scores (losses)", key=str(e), keys=list(info["losses"].keys())
)
)
try: try:
scores = [ scores = [
"{0:.2f}".format(float(info["other_scores"][col])) "{0:.2f}".format(float(info["other_scores"][col])) for col in score_cols
for col in score_cols
] ]
except KeyError as e: except KeyError as e:
raise KeyError(Errors.E983.format(dict_name='scores (other)', key=str(e), keys=list(info["other_scores"].keys()))) raise KeyError(
Errors.E983.format(
dict="scores (other)",
key=str(e),
keys=list(info["other_scores"].keys()),
)
)
data = ( data = (
[info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))] [info["epoch"], info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
) )
msg.row(data, widths=table_widths, aligns=table_aligns) msg.row(data, widths=table_widths, aligns=table_aligns)
@ -626,3 +552,67 @@ def update_meta(training, nlp, info):
nlp.meta["performance"][metric] = info["other_scores"][metric] nlp.meta["performance"][metric] = info["other_scores"][metric]
for pipe_name in nlp.pipe_names: for pipe_name in nlp.pipe_names:
nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name] nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name]
def verify_cli_args(
train_path,
dev_path,
config_path,
output_path=None,
code_path=None,
init_tok2vec=None,
raw_text=None,
verbose=False,
use_gpu=-1,
tag_map_path=None,
omit_extra_lookups=False,
):
# Make sure all files and paths exists if they are needed
if not config_path or not config_path.exists():
msg.fail("Config file not found", config_path, exits=1)
if not train_path or not train_path.exists():
msg.fail("Training data not found", train_path, exits=1)
if not dev_path or not dev_path.exists():
msg.fail("Development data not found", dev_path, exits=1)
if output_path is not None:
if not output_path.exists():
output_path.mkdir()
msg.good(f"Created output directory: {output_path}")
elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
msg.warn(
"Output directory is not empty.",
"This can lead to unintended side effects when saving the model. "
"Please use an empty directory or a different path instead. If "
"the specified output path doesn't exist, the directory will be "
"created for you.",
)
if code_path is not None:
if not code_path.exists():
msg.fail("Path to Python code not found", code_path, exits=1)
try:
util.import_file("python_code", code_path)
except Exception as e:
msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1)
if init_tok2vec is not None and not init_tok2vec.exists():
msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
def verify_textcat_config(nlp, nlp_config):
# if 'positive_label' is provided: double check whether it's in the data and
# the task is binary
if nlp_config["pipeline"]["textcat"].get("positive_label", None):
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
if pos_label not in textcat_labels:
msg.fail(
f"The textcat's 'positive_label' config setting '{pos_label}' "
f"does not match any label in the training data.",
exits=1,
)
if len(textcat_labels) != 2:
msg.fail(
f"A textcat 'positive_label' '{pos_label}' was "
f"provided for training data that does not appear to be a "
f"binary classification problem with two labels.",
exits=1,
)

View File

@ -1,18 +1,25 @@
from typing import Tuple
from pathlib import Path from pathlib import Path
import sys import sys
import requests import requests
from wasabi import msg from wasabi import msg, Printer
from ._app import app
from .. import about from .. import about
from ..util import get_package_version, get_installed_models, get_base_version from ..util import get_package_version, get_installed_models, get_base_version
from ..util import get_package_path, get_model_meta, is_compatible_version from ..util import get_package_path, get_model_meta, is_compatible_version
def validate(): @app.command("validate")
def validate_cli():
""" """
Validate that the currently installed version of spaCy is compatible Validate that the currently installed version of spaCy is compatible
with the installed models. Should be run after `pip install -U spacy`. with the installed models. Should be run after `pip install -U spacy`.
""" """
validate()
def validate() -> None:
model_pkgs, compat = get_model_pkgs() model_pkgs, compat = get_model_pkgs()
spacy_version = get_base_version(about.__version__) spacy_version = get_base_version(about.__version__)
current_compat = compat.get(spacy_version, {}) current_compat = compat.get(spacy_version, {})
@ -55,7 +62,8 @@ def validate():
sys.exit(1) sys.exit(1)
def get_model_pkgs(): def get_model_pkgs(silent: bool = False) -> Tuple[dict, dict]:
msg = Printer(no_print=silent, pretty=not silent)
with msg.loading("Loading compatibility table..."): with msg.loading("Loading compatibility table..."):
r = requests.get(about.__compatibility__) r = requests.get(about.__compatibility__)
if r.status_code != 200: if r.status_code != 200:
@ -93,7 +101,7 @@ def get_model_pkgs():
return pkgs, compat return pkgs, compat
def reformat_version(version): def reformat_version(version: str) -> str:
"""Hack to reformat old versions ending on '-alpha' to match pip format.""" """Hack to reformat old versions ending on '-alpha' to match pip format."""
if version.endswith("-alpha"): if version.endswith("-alpha"):
return version.replace("-alpha", "a0") return version.replace("-alpha", "a0")

View File

@ -3,7 +3,7 @@ def add_codes(err_cls):
class ErrorsWithCodes(err_cls): class ErrorsWithCodes(err_cls):
def __getattribute__(self, code): def __getattribute__(self, code):
msg = super().__getattribute__(code) msg = super(ErrorsWithCodes, self).__getattribute__(code)
if code.startswith("__"): # python system attributes like __class__ if code.startswith("__"): # python system attributes like __class__
return msg return msg
else: else:
@ -111,8 +111,31 @@ class Warnings(object):
"`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`" "`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
" to check the alignment. Misaligned entities ('-') will be " " to check the alignment. Misaligned entities ('-') will be "
"ignored during training.") "ignored during training.")
W031 = ("Model '{model}' ({model_version}) requires spaCy {version} and "
"is incompatible with the current spaCy version ({current}). This "
"may lead to unexpected results or runtime errors. To resolve "
"this, download a newer compatible model or retrain your custom "
"model with the current spaCy version. For more details and "
"available updates, run: python -m spacy validate")
W032 = ("Unable to determine model compatibility for model '{model}' "
"({model_version}) with the current spaCy version ({current}). "
"This may lead to unexpected results or runtime errors. To resolve "
"this, download a newer compatible model or retrain your custom "
"model with the current spaCy version. For more details and "
"available updates, run: python -m spacy validate")
W033 = ("Training a new {model} using a model with no lexeme normalization "
"table. This may degrade the performance of the model to some "
"degree. If this is intentional or the language you're using "
"doesn't have a normalization table, please ignore this warning. "
"If this is surprising, make sure you have the spacy-lookups-data "
"package installed. The languages with lexeme normalization tables "
"are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.")
# TODO: fix numbering after merging develop into master # TODO: fix numbering after merging develop into master
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
W093 = ("Could not find any data to train the {name} on. Is your "
"input data correctly formatted ?")
W094 = ("Model '{model}' ({model_version}) specifies an under-constrained " W094 = ("Model '{model}' ({model_version}) specifies an under-constrained "
"spaCy version requirement: {version}. This can lead to compatibility " "spaCy version requirement: {version}. This can lead to compatibility "
"problems with older versions, or as new spaCy versions are " "problems with older versions, or as new spaCy versions are "
@ -133,7 +156,7 @@ class Warnings(object):
"so a default configuration was used.") "so a default configuration was used.")
W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', " W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', "
"but got '{type}' instead, so ignoring it.") "but got '{type}' instead, so ignoring it.")
W100 = ("Skipping unsupported morphological feature(s): {feature}. " W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or " "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
"string \"Field1=Value1,Value2|Field2=Value3\".") "string \"Field1=Value1,Value2|Field2=Value3\".")
@ -161,18 +184,13 @@ class Errors(object):
"`nlp.select_pipes()`, you should remove them explicitly with " "`nlp.select_pipes()`, you should remove them explicitly with "
"`nlp.remove_pipe()` before the pipeline is restored. Names of " "`nlp.remove_pipe()` before the pipeline is restored. Names of "
"the new components: {names}") "the new components: {names}")
E009 = ("The `update` method expects same number of docs and golds, but "
"got: {n_docs} docs, {n_golds} golds.")
E010 = ("Word vectors set to length 0. This may be because you don't have " E010 = ("Word vectors set to length 0. This may be because you don't have "
"a model installed or loaded, or because your model doesn't " "a model installed or loaded, or because your model doesn't "
"include word vectors. For more info, see the docs:\n" "include word vectors. For more info, see the docs:\n"
"https://spacy.io/usage/models") "https://spacy.io/usage/models")
E011 = ("Unknown operator: '{op}'. Options: {opts}") E011 = ("Unknown operator: '{op}'. Options: {opts}")
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}") E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
E013 = ("Error selecting action in matcher")
E014 = ("Unknown tag ID: {tag}") E014 = ("Unknown tag ID: {tag}")
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
"`force=True` to overwrite.")
E016 = ("MultitaskObjective target should be function or one of: dep, " E016 = ("MultitaskObjective target should be function or one of: dep, "
"tag, ent, dep_tag_offset, ent_tag.") "tag, ent, dep_tag_offset, ent_tag.")
E017 = ("Can only add unicode or bytes. Got type: {value_type}") E017 = ("Can only add unicode or bytes. Got type: {value_type}")
@ -180,21 +198,8 @@ class Errors(object):
"refers to an issue with the `Vocab` or `StringStore`.") "refers to an issue with the `Vocab` or `StringStore`.")
E019 = ("Can't create transition with unknown action ID: {action}. Action " E019 = ("Can't create transition with unknown action ID: {action}. Action "
"IDs are enumerated in spacy/syntax/{src}.pyx.") "IDs are enumerated in spacy/syntax/{src}.pyx.")
E020 = ("Could not find a gold-standard action to supervise the "
"dependency parser. The tree is non-projective (i.e. it has "
"crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
"The ArcEager transition system only supports projective trees. "
"To learn non-projective representations, transform the data "
"before training and after parsing. Either pass "
"`make_projective=True` to the GoldParse class, or use "
"spacy.syntax.nonproj.preprocess_training_data.")
E021 = ("Could not find a gold-standard action to supervise the "
"dependency parser. The GoldParse was projective. The transition "
"system has {n_actions} actions. State at failure: {state}")
E022 = ("Could not find a transition with the name '{name}' in the NER " E022 = ("Could not find a transition with the name '{name}' in the NER "
"model.") "model.")
E023 = ("Error cleaning up beam: The same state occurred twice at "
"memory address {addr} and position {i}.")
E024 = ("Could not find an optimal move to supervise the parser. Usually, " E024 = ("Could not find an optimal move to supervise the parser. Usually, "
"this means that the model can't be updated in a way that's valid " "this means that the model can't be updated in a way that's valid "
"and satisfies the correct annotations specified in the GoldParse. " "and satisfies the correct annotations specified in the GoldParse. "
@ -238,7 +243,6 @@ class Errors(object):
"offset {start}.") "offset {start}.")
E037 = ("Error calculating span: Can't find a token ending at character " E037 = ("Error calculating span: Can't find a token ending at character "
"offset {end}.") "offset {end}.")
E038 = ("Error finding sentence for span. Infinite loop detected.")
E039 = ("Array bounds exceeded while searching for root word. This likely " E039 = ("Array bounds exceeded while searching for root word. This likely "
"means the parse tree is in an invalid state. Please report this " "means the parse tree is in an invalid state. Please report this "
"issue here: http://github.com/explosion/spaCy/issues") "issue here: http://github.com/explosion/spaCy/issues")
@ -269,8 +273,6 @@ class Errors(object):
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}") E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
E060 = ("Cannot add new key to vectors: the table is full. Current shape: " E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
"({rows}, {cols}).") "({rows}, {cols}).")
E061 = ("Bad file name: {filename}. Example of a valid file name: "
"'vectors.128.f.bin'")
E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 " E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
"and 63 are occupied. You can replace one by specifying the " "and 63 are occupied. You can replace one by specifying the "
"`flag_id` explicitly, e.g. " "`flag_id` explicitly, e.g. "
@ -284,39 +286,17 @@ class Errors(object):
"Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}") "Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
E065 = ("Only one of the vector table's width and shape can be specified. " E065 = ("Only one of the vector table's width and shape can be specified. "
"Got width {width} and shape {shape}.") "Got width {width} and shape {shape}.")
E066 = ("Error creating model helper for extracting columns. Can only "
"extract columns by positive integer. Got: {value}.")
E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside " E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
"an entity) without a preceding 'B' (beginning of an entity). " "an entity) without a preceding 'B' (beginning of an entity). "
"Tag sequence:\n{tags}") "Tag sequence:\n{tags}")
E068 = ("Invalid BILUO tag: '{tag}'.") E068 = ("Invalid BILUO tag: '{tag}'.")
E069 = ("Invalid gold-standard parse tree. Found cycle between word "
"IDs: {cycle} (tokens: {cycle_tokens}) in the document starting "
"with tokens: {doc_tokens}.")
E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
"does not align with number of annotations ({n_annots}).")
E071 = ("Error creating lexeme: specified orth ID ({orth}) does not " E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
"match the one in the vocab ({vocab_orth}).") "match the one in the vocab ({vocab_orth}).")
E072 = ("Error serializing lexeme: expected data length {length}, "
"got {bad_length}.")
E073 = ("Cannot assign vector of length {new_length}. Existing vectors " E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
"are of length {length}. You can use `vocab.reset_vectors` to " "are of length {length}. You can use `vocab.reset_vectors` to "
"clear the existing vectors and resize the table.") "clear the existing vectors and resize the table.")
E074 = ("Error interpreting compiled match pattern: patterns are expected " E074 = ("Error interpreting compiled match pattern: patterns are expected "
"to end with the attribute {attr}. Got: {bad_attr}.") "to end with the attribute {attr}. Got: {bad_attr}.")
E075 = ("Error accepting match: length ({length}) > maximum length "
"({max_len}).")
E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
"has {words} words.")
E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
"equal number of GoldParse objects ({n_golds}) in batch.")
E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
"not equal number of words in GoldParse ({words_gold}).")
E079 = ("Error computing states in beam: number of predicted beams "
"({pbeams}) does not equal number of gold beams ({gbeams}).")
E080 = ("Duplicate state found in beam: {key}.")
E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
"does not equal number of losses ({losses}).")
E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), " E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
"projective heads ({n_proj_heads}) and labels ({n_labels}) do not " "projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
"match.") "match.")
@ -324,8 +304,6 @@ class Errors(object):
"`getter` (plus optional `setter`) is allowed. Got: {nr_defined}") "`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
E084 = ("Error assigning label ID {label} to span: not in StringStore.") E084 = ("Error assigning label ID {label} to span: not in StringStore.")
E085 = ("Can't create lexeme for string '{string}'.") E085 = ("Can't create lexeme for string '{string}'.")
E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
"not match hash {hash_id} in StringStore.")
E087 = ("Unknown displaCy style: {style}.") E087 = ("Unknown displaCy style: {style}.")
E088 = ("Text of length {length} exceeds maximum of {max_length}. The " E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
"v2.x parser and NER models require roughly 1GB of temporary " "v2.x parser and NER models require roughly 1GB of temporary "
@ -367,7 +345,6 @@ class Errors(object):
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A " E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
"token can only be part of one entity, so make sure the entities " "token can only be part of one entity, so make sure the entities "
"you're setting don't overlap.") "you're setting don't overlap.")
E104 = ("Can't find JSON schema for '{name}'.")
E105 = ("The Doc.print_tree() method is now deprecated. Please use " E105 = ("The Doc.print_tree() method is now deprecated. Please use "
"Doc.to_json() instead or write your own function.") "Doc.to_json() instead or write your own function.")
E106 = ("Can't find doc._.{attr} attribute specified in the underscore " E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
@ -390,8 +367,6 @@ class Errors(object):
"practically no advantage over pickling the parent Doc directly. " "practically no advantage over pickling the parent Doc directly. "
"So instead of pickling the span, pickle the Doc it belongs to or " "So instead of pickling the span, pickle the Doc it belongs to or "
"use Span.as_doc to convert the span to a standalone Doc object.") "use Span.as_doc to convert the span to a standalone Doc object.")
E113 = ("The newly split token can only have one root (head = 0).")
E114 = ("The newly split token needs to have a root (head = 0).")
E115 = ("All subtokens must have associated heads.") E115 = ("All subtokens must have associated heads.")
E116 = ("Cannot currently add labels to pretrained text classifier. Add " E116 = ("Cannot currently add labels to pretrained text classifier. Add "
"labels before training begins. This functionality was available " "labels before training begins. This functionality was available "
@ -414,12 +389,9 @@ class Errors(object):
"equal to span length ({span_len}).") "equal to span length ({span_len}).")
E122 = ("Cannot find token to be split. Did it get merged?") E122 = ("Cannot find token to be split. Did it get merged?")
E123 = ("Cannot find head of token to be split. Did it get merged?") E123 = ("Cannot find head of token to be split. Did it get merged?")
E124 = ("Cannot read from file: {path}. Supported formats: {formats}")
E125 = ("Unexpected value: {value}") E125 = ("Unexpected value: {value}")
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. " E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
"This is likely a bug in spaCy, so feel free to open an issue.") "This is likely a bug in spaCy, so feel free to open an issue.")
E127 = ("Cannot create phrase pattern representation for length 0. This "
"is likely a bug in spaCy.")
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword " E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
"arguments to exclude fields from being serialized or deserialized " "arguments to exclude fields from being serialized or deserialized "
"is now deprecated. Please use the `exclude` argument instead. " "is now deprecated. Please use the `exclude` argument instead. "
@ -461,8 +433,6 @@ class Errors(object):
"provided {found}.") "provided {found}.")
E143 = ("Labels for component '{name}' not initialized. Did you forget to " E143 = ("Labels for component '{name}' not initialized. Did you forget to "
"call add_label()?") "call add_label()?")
E144 = ("Could not find parameter `{param}` when building the entity "
"linker model.")
E145 = ("Error reading `{param}` from input file.") E145 = ("Error reading `{param}` from input file.")
E146 = ("Could not access `{path}`.") E146 = ("Could not access `{path}`.")
E147 = ("Unexpected error in the {method} functionality of the " E147 = ("Unexpected error in the {method} functionality of the "
@ -474,8 +444,6 @@ class Errors(object):
"the component matches the model being loaded.") "the component matches the model being loaded.")
E150 = ("The language of the `nlp` object and the `vocab` should be the " E150 = ("The language of the `nlp` object and the `vocab` should be the "
"same, but found '{nlp}' and '{vocab}' respectively.") "same, but found '{nlp}' and '{vocab}' respectively.")
E151 = ("Trying to call nlp.update without required annotation types. "
"Expected top-level keys: {exp}. Got: {unexp}.")
E152 = ("The attribute {attr} is not supported for token patterns. " E152 = ("The attribute {attr} is not supported for token patterns. "
"Please use the option validate=True with Matcher, PhraseMatcher, " "Please use the option validate=True with Matcher, PhraseMatcher, "
"or EntityRuler for more details.") "or EntityRuler for more details.")
@ -512,11 +480,6 @@ class Errors(object):
"that case.") "that case.")
E166 = ("Can only merge DocBins with the same pre-defined attributes.\n" E166 = ("Can only merge DocBins with the same pre-defined attributes.\n"
"Current DocBin: {current}\nOther DocBin: {other}") "Current DocBin: {current}\nOther DocBin: {other}")
E167 = ("Unknown morphological feature: '{feat}' ({feat_id}). This can "
"happen if the tagger was trained with a different set of "
"morphological features. If you're using a pretrained model, make "
"sure that your models are up to date:\npython -m spacy validate")
E168 = ("Unknown field: {field}")
E169 = ("Can't find module: {module}") E169 = ("Can't find module: {module}")
E170 = ("Cannot apply transition {name}: invalid for the current state.") E170 = ("Cannot apply transition {name}: invalid for the current state.")
E171 = ("Matcher.add received invalid on_match callback argument: expected " E171 = ("Matcher.add received invalid on_match callback argument: expected "
@ -527,8 +490,6 @@ class Errors(object):
E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of " E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
"Lookups containing the lemmatization tables. See the docs for " "Lookups containing the lemmatization tables. See the docs for "
"details: https://spacy.io/api/lemmatizer#init") "details: https://spacy.io/api/lemmatizer#init")
E174 = ("Architecture '{name}' not found in registry. Available "
"names: {names}")
E175 = ("Can't remove rule for unknown match pattern ID: {key}") E175 = ("Can't remove rule for unknown match pattern ID: {key}")
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.") E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
E177 = ("Ill-formed IOB input detected: {tag}") E177 = ("Ill-formed IOB input detected: {tag}")
@ -556,9 +517,6 @@ class Errors(object):
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.") "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
E186 = ("'{tok_a}' and '{tok_b}' are different texts.") E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
E187 = ("Only unicode strings are supported as labels.") E187 = ("Only unicode strings are supported as labels.")
E188 = ("Could not match the gold entity links to entities in the doc - "
"make sure the gold EL data refers to valid results of the "
"named entity recognizer in the `nlp` pipeline.")
E189 = ("Each argument to `get_doc` should be of equal length.") E189 = ("Each argument to `get_doc` should be of equal length.")
E190 = ("Token head out of range in `Doc.from_array()` for token index " E190 = ("Token head out of range in `Doc.from_array()` for token index "
"'{index}' with value '{value}' (equivalent to relative head " "'{index}' with value '{value}' (equivalent to relative head "
@ -578,12 +536,32 @@ class Errors(object):
E197 = ("Row out of bounds, unable to add row {row} for key {key}.") E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
E198 = ("Unable to return {n} most similar vectors for the current vectors " E198 = ("Unable to return {n} most similar vectors for the current vectors "
"table, which contains {n_rows} vectors.") "table, which contains {n_rows} vectors.")
E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
# TODO: fix numbering after merging develop into master # TODO: fix numbering after merging develop into master
E983 = ("Invalid key for '{dict_name}': {key}. Available keys: " E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?")
E971 = ("Found incompatible lengths in Doc.from_array: {array_length} for the "
"array and {doc_length} for the Doc itself.")
E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
E973 = ("Unexpected type for NER data")
E974 = ("Unknown {obj} attribute: {key}")
E975 = ("The method Example.from_dict expects a Doc as first argument, "
"but got {type}")
E976 = ("The method Example.from_dict expects a dict as second argument, "
"but received None.")
E977 = ("Can not compare a MorphAnalysis with a string object. "
"This is likely a bug in spaCy, so feel free to open an issue.")
E978 = ("The {method} method of component {name} takes a list of Example objects, "
"but found {types} instead.")
E979 = ("Cannot convert {type} to an Example object.")
E980 = ("Each link annotation should refer to a dictionary with at most one "
"identifier mapping to 1.0, and all others to 0.0.")
E981 = ("The offsets of the annotations for 'links' need to refer exactly "
"to the offsets of the 'entities' annotations.")
E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
"into {values}, but found {value}.")
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
"{keys}") "{keys}")
E984 = ("Could not parse the {input} - double check the data is written "
"in the correct format as expected by spaCy.")
E985 = ("The pipeline component '{component}' is already available in the base " E985 = ("The pipeline component '{component}' is already available in the base "
"model. The settings in the component block in the config file are " "model. The settings in the component block in the config file are "
"being ignored. If you want to replace this component instead, set " "being ignored. If you want to replace this component instead, set "
@ -615,22 +593,13 @@ class Errors(object):
E997 = ("Tokenizer special cases are not allowed to modify the text. " E997 = ("Tokenizer special cases are not allowed to modify the text. "
"This would map '{chunk}' to '{orth}' given token attributes " "This would map '{chunk}' to '{orth}' given token attributes "
"'{token_attrs}'.") "'{token_attrs}'.")
E998 = ("To create GoldParse objects from Example objects without a "
"Doc, get_gold_parses() should be called with a Vocab object.")
E999 = ("Encountered an unexpected format for the dictionary holding "
"gold annotations: {gold_dict}")
@add_codes @add_codes
class TempErrors(object): class TempErrors(object):
T003 = ("Resizing pretrained Tagger models is not currently supported.") T003 = ("Resizing pretrained Tagger models is not currently supported.")
T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the " T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
"issue tracker: http://github.com/explosion/spaCy/issues") "issue tracker: http://github.com/explosion/spaCy/issues")
T008 = ("Bad configuration of Tagger. This is probably a bug within "
"spaCy. We changed the name of an internal attribute for loading "
"pretrained vectors, and the class has been passed the old name "
"(pretrained_dims) but not the new name (pretrained_vectors).")
# fmt: on # fmt: on

View File

@ -1,68 +0,0 @@
from cymem.cymem cimport Pool
from .typedefs cimport attr_t
from .syntax.transition_system cimport Transition
from .tokens import Doc
cdef struct GoldParseC:
int* tags
int* heads
int* has_dep
int* sent_start
attr_t* labels
int** brackets
Transition* ner
cdef class GoldParse:
cdef Pool mem
cdef GoldParseC c
cdef readonly TokenAnnotation orig
cdef int length
cdef public int loss
cdef public list words
cdef public list tags
cdef public list pos
cdef public list morphs
cdef public list lemmas
cdef public list sent_starts
cdef public list heads
cdef public list labels
cdef public dict orths
cdef public list ner
cdef public dict brackets
cdef public dict cats
cdef public dict links
cdef readonly list cand_to_gold
cdef readonly list gold_to_cand
cdef class TokenAnnotation:
cdef public list ids
cdef public list words
cdef public list tags
cdef public list pos
cdef public list morphs
cdef public list lemmas
cdef public list heads
cdef public list deps
cdef public list entities
cdef public list sent_starts
cdef public dict brackets_by_start
cdef class DocAnnotation:
cdef public object cats
cdef public object links
cdef class Example:
cdef public object doc
cdef public TokenAnnotation token_annotation
cdef public DocAnnotation doc_annotation
cdef public object goldparse

File diff suppressed because it is too large Load Diff

0
spacy/gold/__init__.pxd Normal file
View File

11
spacy/gold/__init__.py Normal file
View File

@ -0,0 +1,11 @@
from .corpus import Corpus
from .example import Example
from .align import align
from .iob_utils import iob_to_biluo, biluo_to_iob
from .iob_utils import biluo_tags_from_offsets, offsets_from_biluo_tags
from .iob_utils import spans_from_biluo_tags
from .iob_utils import tags_to_entities
from .gold_io import docs_to_json
from .gold_io import read_json_file

8
spacy/gold/align.pxd Normal file
View File

@ -0,0 +1,8 @@
cdef class Alignment:
cdef public object cost
cdef public object i2j
cdef public object j2i
cdef public object i2j_multi
cdef public object j2i_multi
cdef public object cand_to_gold
cdef public object gold_to_cand

101
spacy/gold/align.pyx Normal file
View File

@ -0,0 +1,101 @@
import numpy
from ..errors import Errors, AlignmentError
cdef class Alignment:
def __init__(self, spacy_words, gold_words):
# Do many-to-one alignment for misaligned tokens.
# If we over-segment, we'll have one gold word that covers a sequence
# of predicted words
# If we under-segment, we'll have one predicted word that covers a
# sequence of gold words.
# If we "mis-segment", we'll have a sequence of predicted words covering
# a sequence of gold words. That's many-to-many -- we don't do that
# except for NER spans where the start and end can be aligned.
cost, i2j, j2i, i2j_multi, j2i_multi = align(spacy_words, gold_words)
self.cost = cost
self.i2j = i2j
self.j2i = j2i
self.i2j_multi = i2j_multi
self.j2i_multi = j2i_multi
self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
def align(tokens_a, tokens_b):
"""Calculate alignment tables between two tokenizations.
tokens_a (List[str]): The candidate tokenization.
tokens_b (List[str]): The reference tokenization.
RETURNS: (tuple): A 5-tuple consisting of the following information:
* cost (int): The number of misaligned tokens.
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
it has the value -1.
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
the same token of `tokens_b`.
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
direction.
"""
tokens_a = _normalize_for_alignment(tokens_a)
tokens_b = _normalize_for_alignment(tokens_b)
cost = 0
a2b = numpy.empty(len(tokens_a), dtype="i")
b2a = numpy.empty(len(tokens_b), dtype="i")
a2b.fill(-1)
b2a.fill(-1)
a2b_multi = {}
b2a_multi = {}
i = 0
j = 0
offset_a = 0
offset_b = 0
while i < len(tokens_a) and j < len(tokens_b):
a = tokens_a[i][offset_a:]
b = tokens_b[j][offset_b:]
if a == b:
if offset_a == offset_b == 0:
a2b[i] = j
b2a[j] = i
elif offset_a == 0:
cost += 2
a2b_multi[i] = j
elif offset_b == 0:
cost += 2
b2a_multi[j] = i
offset_a = offset_b = 0
i += 1
j += 1
elif a == "":
assert offset_a == 0
cost += 1
i += 1
elif b == "":
assert offset_b == 0
cost += 1
j += 1
elif b.startswith(a):
cost += 1
if offset_a == 0:
a2b_multi[i] = j
i += 1
offset_a = 0
offset_b += len(a)
elif a.startswith(b):
cost += 1
if offset_b == 0:
b2a_multi[j] = i
j += 1
offset_b = 0
offset_a += len(b)
else:
assert "".join(tokens_a) != "".join(tokens_b)
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
return cost, a2b, b2a, a2b_multi, b2a_multi
def _normalize_for_alignment(tokens):
return [w.replace(" ", "").lower() for w in tokens]

111
spacy/gold/augment.py Normal file
View File

@ -0,0 +1,111 @@
import random
import itertools
def make_orth_variants_example(nlp, example, orth_variant_level=0.0): # TODO: naming
raw_text = example.text
orig_dict = example.to_dict()
variant_text, variant_token_annot = make_orth_variants(
nlp, raw_text, orig_dict["token_annotation"], orth_variant_level
)
doc = nlp.make_doc(variant_text)
orig_dict["token_annotation"] = variant_token_annot
return example.from_dict(doc, orig_dict)
def make_orth_variants(nlp, raw_text, orig_token_dict, orth_variant_level=0.0):
if random.random() >= orth_variant_level:
return raw_text, orig_token_dict
if not orig_token_dict:
return raw_text, orig_token_dict
raw = raw_text
token_dict = orig_token_dict
lower = False
if random.random() >= 0.5:
lower = True
if raw is not None:
raw = raw.lower()
ndsv = nlp.Defaults.single_orth_variants
ndpv = nlp.Defaults.paired_orth_variants
words = token_dict.get("words", [])
tags = token_dict.get("tags", [])
# keep unmodified if words or tags are not defined
if words and tags:
if lower:
words = [w.lower() for w in words]
# single variants
punct_choices = [random.choice(x["variants"]) for x in ndsv]
for word_idx in range(len(words)):
for punct_idx in range(len(ndsv)):
if (
tags[word_idx] in ndsv[punct_idx]["tags"]
and words[word_idx] in ndsv[punct_idx]["variants"]
):
words[word_idx] = punct_choices[punct_idx]
# paired variants
punct_choices = [random.choice(x["variants"]) for x in ndpv]
for word_idx in range(len(words)):
for punct_idx in range(len(ndpv)):
if tags[word_idx] in ndpv[punct_idx]["tags"] and words[
word_idx
] in itertools.chain.from_iterable(ndpv[punct_idx]["variants"]):
# backup option: random left vs. right from pair
pair_idx = random.choice([0, 1])
# best option: rely on paired POS tags like `` / ''
if len(ndpv[punct_idx]["tags"]) == 2:
pair_idx = ndpv[punct_idx]["tags"].index(tags[word_idx])
# next best option: rely on position in variants
# (may not be unambiguous, so order of variants matters)
else:
for pair in ndpv[punct_idx]["variants"]:
if words[word_idx] in pair:
pair_idx = pair.index(words[word_idx])
words[word_idx] = punct_choices[punct_idx][pair_idx]
token_dict["words"] = words
token_dict["tags"] = tags
# modify raw
if raw is not None:
variants = []
for single_variants in ndsv:
variants.extend(single_variants["variants"])
for paired_variants in ndpv:
variants.extend(
list(itertools.chain.from_iterable(paired_variants["variants"]))
)
# store variants in reverse length order to be able to prioritize
# longer matches (e.g., "---" before "--")
variants = sorted(variants, key=lambda x: len(x))
variants.reverse()
variant_raw = ""
raw_idx = 0
# add initial whitespace
while raw_idx < len(raw) and raw[raw_idx].isspace():
variant_raw += raw[raw_idx]
raw_idx += 1
for word in words:
match_found = False
# skip whitespace words
if word.isspace():
match_found = True
# add identical word
elif word not in variants and raw[raw_idx:].startswith(word):
variant_raw += word
raw_idx += len(word)
match_found = True
# add variant word
else:
for variant in variants:
if not match_found and raw[raw_idx:].startswith(variant):
raw_idx += len(variant)
variant_raw += word
match_found = True
# something went wrong, abort
# (add a warning message?)
if not match_found:
return raw_text, orig_token_dict
# add following whitespace
while raw_idx < len(raw) and raw[raw_idx].isspace():
variant_raw += raw[raw_idx]
raw_idx += 1
raw = variant_raw
return raw, token_dict

View File

@ -0,0 +1,6 @@
from .iob2docs import iob2docs # noqa: F401
from .conll_ner2docs import conll_ner2docs # noqa: F401
from .json2docs import json2docs
# TODO: Update this one
# from .conllu2docs import conllu2docs # noqa: F401

View File

@ -1,17 +1,18 @@
from wasabi import Printer from wasabi import Printer
from .. import tags_to_entities
from ...gold import iob_to_biluo from ...gold import iob_to_biluo
from ...lang.xx import MultiLanguage from ...lang.xx import MultiLanguage
from ...tokens.doc import Doc from ...tokens import Doc, Span
from ...util import load_model from ...util import load_model
def conll_ner2json( def conll_ner2docs(
input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
): ):
""" """
Convert files in the CoNLL-2003 NER format and similar Convert files in the CoNLL-2003 NER format and similar
whitespace-separated columns into JSON format for use with train cli. whitespace-separated columns into Doc objects.
The first column is the tokens, the final column is the IOB tags. If an The first column is the tokens, the final column is the IOB tags. If an
additional second column is present, the second column is the tags. additional second column is present, the second column is the tags.
@ -81,17 +82,25 @@ def conll_ner2json(
"No document delimiters found. Use `-n` to automatically group " "No document delimiters found. Use `-n` to automatically group "
"sentences into documents." "sentences into documents."
) )
if model:
nlp = load_model(model)
else:
nlp = MultiLanguage()
output_docs = [] output_docs = []
for doc in input_data.strip().split(doc_delimiter): for conll_doc in input_data.strip().split(doc_delimiter):
doc = doc.strip() conll_doc = conll_doc.strip()
if not doc: if not conll_doc:
continue continue
output_doc = [] words = []
for sent in doc.split("\n\n"): sent_starts = []
sent = sent.strip() pos_tags = []
if not sent: biluo_tags = []
for conll_sent in conll_doc.split("\n\n"):
conll_sent = conll_sent.strip()
if not conll_sent:
continue continue
lines = [line.strip() for line in sent.split("\n") if line.strip()] lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
cols = list(zip(*[line.split() for line in lines])) cols = list(zip(*[line.split() for line in lines]))
if len(cols) < 2: if len(cols) < 2:
raise ValueError( raise ValueError(
@ -99,25 +108,19 @@ def conll_ner2json(
"Try checking whitespace and delimiters. See " "Try checking whitespace and delimiters. See "
"https://spacy.io/api/cli#convert" "https://spacy.io/api/cli#convert"
) )
words = cols[0] length = len(cols[0])
iob_ents = cols[-1] words.extend(cols[0])
if len(cols) > 2: sent_starts.extend([True] + [False] * (length - 1))
tags = cols[1] biluo_tags.extend(iob_to_biluo(cols[-1]))
else: pos_tags.extend(cols[1] if len(cols) > 2 else ["-"] * length)
tags = ["-"] * len(words)
biluo_ents = iob_to_biluo(iob_ents) doc = Doc(nlp.vocab, words=words)
output_doc.append( for i, token in enumerate(doc):
{ token.tag_ = pos_tags[i]
"tokens": [ token.is_sent_start = sent_starts[i]
{"orth": w, "tag": tag, "ner": ent} entities = tags_to_entities(biluo_tags)
for (w, tag, ent) in zip(words, tags, biluo_ents) doc.ents = [Span(doc, start=s, end=e + 1, label=L) for L, s, e in entities]
] output_docs.append(doc)
}
)
output_docs.append(
{"id": len(output_docs), "paragraphs": [{"sentences": output_doc}]}
)
output_doc = []
return output_docs return output_docs

View File

@ -1,10 +1,10 @@
import re import re
from .conll_ner2docs import n_sents_info
from ...gold import Example from ...gold import Example
from ...gold import iob_to_biluo, spans_from_biluo_tags, biluo_tags_from_offsets from ...gold import iob_to_biluo, spans_from_biluo_tags
from ...language import Language from ...language import Language
from ...tokens import Doc, Token from ...tokens import Doc, Token
from .conll_ner2json import n_sents_info
from wasabi import Printer from wasabi import Printer
@ -12,7 +12,6 @@ def conllu2json(
input_data, input_data,
n_sents=10, n_sents=10,
append_morphology=False, append_morphology=False,
lang=None,
ner_map=None, ner_map=None,
merge_subtokens=False, merge_subtokens=False,
no_print=False, no_print=False,
@ -44,10 +43,7 @@ def conllu2json(
raw += example.text raw += example.text
sentences.append( sentences.append(
generate_sentence( generate_sentence(
example.token_annotation, example.to_dict(), has_ner_tags, MISC_NER_PATTERN, ner_map=ner_map,
has_ner_tags,
MISC_NER_PATTERN,
ner_map=ner_map,
) )
) )
# Real-sized documents could be extracted using the comments on the # Real-sized documents could be extracted using the comments on the
@ -145,21 +141,22 @@ def get_entities(lines, tag_pattern, ner_map=None):
return iob_to_biluo(iob) return iob_to_biluo(iob)
def generate_sentence(token_annotation, has_ner_tags, tag_pattern, ner_map=None): def generate_sentence(example_dict, has_ner_tags, tag_pattern, ner_map=None):
sentence = {} sentence = {}
tokens = [] tokens = []
for i, id_ in enumerate(token_annotation.ids): token_annotation = example_dict["token_annotation"]
for i, id_ in enumerate(token_annotation["ids"]):
token = {} token = {}
token["id"] = id_ token["id"] = id_
token["orth"] = token_annotation.get_word(i) token["orth"] = token_annotation["words"][i]
token["tag"] = token_annotation.get_tag(i) token["tag"] = token_annotation["tags"][i]
token["pos"] = token_annotation.get_pos(i) token["pos"] = token_annotation["pos"][i]
token["lemma"] = token_annotation.get_lemma(i) token["lemma"] = token_annotation["lemmas"][i]
token["morph"] = token_annotation.get_morph(i) token["morph"] = token_annotation["morphs"][i]
token["head"] = token_annotation.get_head(i) - id_ token["head"] = token_annotation["heads"][i] - i
token["dep"] = token_annotation.get_dep(i) token["dep"] = token_annotation["deps"][i]
if has_ner_tags: if has_ner_tags:
token["ner"] = token_annotation.get_entity(i) token["ner"] = example_dict["doc_annotation"]["entities"][i]
tokens.append(token) tokens.append(token)
sentence["tokens"] = tokens sentence["tokens"] = tokens
return sentence return sentence
@ -267,40 +264,25 @@ def example_from_conllu_sentence(
doc = merge_conllu_subtokens(lines, doc) doc = merge_conllu_subtokens(lines, doc)
# create Example from custom Doc annotation # create Example from custom Doc annotation
ids, words, tags, heads, deps = [], [], [], [], [] words, spaces, tags, morphs, lemmas = [], [], [], [], []
pos, lemmas, morphs, spaces = [], [], [], []
for i, t in enumerate(doc): for i, t in enumerate(doc):
ids.append(i)
words.append(t._.merged_orth) words.append(t._.merged_orth)
lemmas.append(t._.merged_lemma)
spaces.append(t._.merged_spaceafter)
morphs.append(t._.merged_morph)
if append_morphology and t._.merged_morph: if append_morphology and t._.merged_morph:
tags.append(t.tag_ + "__" + t._.merged_morph) tags.append(t.tag_ + "__" + t._.merged_morph)
else: else:
tags.append(t.tag_) tags.append(t.tag_)
pos.append(t.pos_)
morphs.append(t._.merged_morph) doc_x = Doc(vocab, words=words, spaces=spaces)
lemmas.append(t._.merged_lemma) ref_dict = Example(doc_x, reference=doc).to_dict()
heads.append(t.head.i) ref_dict["words"] = words
deps.append(t.dep_) ref_dict["lemmas"] = lemmas
spaces.append(t._.merged_spaceafter) ref_dict["spaces"] = spaces
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents] ref_dict["tags"] = tags
ents = biluo_tags_from_offsets(doc, ent_offsets) ref_dict["morphs"] = morphs
raw = "" example = Example.from_dict(doc_x, ref_dict)
for word, space in zip(words, spaces):
raw += word
if space:
raw += " "
example = Example(doc=raw)
example.set_token_annotation(
ids=ids,
words=words,
tags=tags,
pos=pos,
morphs=morphs,
lemmas=lemmas,
heads=heads,
deps=deps,
entities=ents,
)
return example return example

View File

@ -0,0 +1,64 @@
from wasabi import Printer
from .conll_ner2docs import n_sents_info
from ...gold import iob_to_biluo, tags_to_entities
from ...tokens import Doc, Span
from ...util import minibatch
def iob2docs(input_data, vocab, n_sents=10, no_print=False, *args, **kwargs):
"""
Convert IOB files with one sentence per line and tags separated with '|'
into Doc objects so they can be saved. IOB and IOB2 are accepted.
Sample formats:
I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
"""
msg = Printer(no_print=no_print)
if n_sents > 0:
n_sents_info(msg, n_sents)
docs = read_iob(input_data.split("\n"), vocab, n_sents)
return docs
def read_iob(raw_sents, vocab, n_sents):
docs = []
for group in minibatch(raw_sents, size=n_sents):
tokens = []
words = []
tags = []
iob = []
sent_starts = []
for line in group:
if not line.strip():
continue
sent_tokens = [t.split("|") for t in line.split()]
if len(sent_tokens[0]) == 3:
sent_words, sent_tags, sent_iob = zip(*sent_tokens)
elif len(sent_tokens[0]) == 2:
sent_words, sent_iob = zip(*sent_tokens)
sent_tags = ["-"] * len(sent_words)
else:
raise ValueError(
"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
)
words.extend(sent_words)
tags.extend(sent_tags)
iob.extend(sent_iob)
tokens.extend(sent_tokens)
sent_starts.append(True)
sent_starts.extend([False for _ in sent_words[1:]])
doc = Doc(vocab, words=words)
for i, tag in enumerate(tags):
doc[i].tag_ = tag
for i, sent_start in enumerate(sent_starts):
doc[i].is_sent_start = sent_start
biluo = iob_to_biluo(iob)
entities = tags_to_entities(biluo)
doc.ents = [Span(doc, start=s, end=e+1, label=L) for (L, s, e) in entities]
docs.append(doc)
return docs

View File

@ -0,0 +1,24 @@
import srsly
from ..gold_io import json_iterate, json_to_annotations
from ..example import annotations2doc
from ..example import _fix_legacy_dict_data, _parse_example_dict_data
from ...util import load_model
from ...lang.xx import MultiLanguage
def json2docs(input_data, model=None, **kwargs):
nlp = load_model(model) if model is not None else MultiLanguage()
if not isinstance(input_data, bytes):
if not isinstance(input_data, str):
input_data = srsly.json_dumps(input_data)
input_data = input_data.encode("utf8")
docs = []
for json_doc in json_iterate(input_data):
for json_para in json_to_annotations(json_doc):
example_dict = _fix_legacy_dict_data(json_para)
tok_dict, doc_dict = _parse_example_dict_data(example_dict)
if json_para.get("raw"):
assert tok_dict.get("SPACY")
doc = annotations2doc(nlp.vocab, tok_dict, doc_dict)
docs.append(doc)
return docs

122
spacy/gold/corpus.py Normal file
View File

@ -0,0 +1,122 @@
import random
from .. import util
from .example import Example
from ..tokens import DocBin, Doc
class Corpus:
"""An annotated corpus, reading train and dev datasets from
the DocBin (.spacy) format.
DOCS: https://spacy.io/api/goldcorpus
"""
def __init__(self, train_loc, dev_loc, limit=0):
"""Create a Corpus.
train (str / Path): File or directory of training data.
dev (str / Path): File or directory of development data.
limit (int): Max. number of examples returned
RETURNS (Corpus): The newly created object.
"""
self.train_loc = train_loc
self.dev_loc = dev_loc
self.limit = limit
@staticmethod
def walk_corpus(path):
path = util.ensure_path(path)
if not path.is_dir():
return [path]
paths = [path]
locs = []
seen = set()
for path in paths:
if str(path) in seen:
continue
seen.add(str(path))
if path.parts[-1].startswith("."):
continue
elif path.is_dir():
paths.extend(path.iterdir())
elif path.parts[-1].endswith(".spacy"):
locs.append(path)
return locs
def make_examples(self, nlp, reference_docs, max_length=0):
for reference in reference_docs:
if len(reference) >= max_length >= 1:
if reference.is_sentenced:
for ref_sent in reference.sents:
yield Example(
nlp.make_doc(ref_sent.text),
ref_sent.as_doc()
)
else:
yield Example(
nlp.make_doc(reference.text),
reference
)
def make_examples_gold_preproc(self, nlp, reference_docs):
for reference in reference_docs:
if reference.is_sentenced:
ref_sents = [sent.as_doc() for sent in reference.sents]
else:
ref_sents = [reference]
for ref_sent in ref_sents:
yield Example(
Doc(
nlp.vocab,
words=[w.text for w in ref_sent],
spaces=[bool(w.whitespace_) for w in ref_sent]
),
ref_sent
)
def read_docbin(self, vocab, locs):
""" Yield training examples as example dicts """
i = 0
for loc in locs:
loc = util.ensure_path(loc)
if loc.parts[-1].endswith(".spacy"):
with loc.open("rb") as file_:
doc_bin = DocBin().from_bytes(file_.read())
docs = doc_bin.get_docs(vocab)
for doc in docs:
if len(doc):
yield doc
i += 1
if self.limit >= 1 and i >= self.limit:
break
def count_train(self, nlp):
"""Returns count of words in train examples"""
n = 0
i = 0
for example in self.train_dataset(nlp):
n += len(example.predicted)
if self.limit >= 0 and i >= self.limit:
break
i += 1
return n
def train_dataset(self, nlp, *, shuffle=True, gold_preproc=False,
max_length=0, **kwargs):
ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.train_loc))
if gold_preproc:
examples = self.make_examples_gold_preproc(nlp, ref_docs)
else:
examples = self.make_examples(nlp, ref_docs, max_length)
if shuffle:
examples = list(examples)
random.shuffle(examples)
yield from examples
def dev_dataset(self, nlp, *, gold_preproc=False, **kwargs):
ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.dev_loc))
if gold_preproc:
examples = self.make_examples_gold_preproc(nlp, ref_docs)
else:
examples = self.make_examples(nlp, ref_docs, max_length=0)
yield from examples

8
spacy/gold/example.pxd Normal file
View File

@ -0,0 +1,8 @@
from ..tokens.doc cimport Doc
from .align cimport Alignment
cdef class Example:
cdef readonly Doc x
cdef readonly Doc y
cdef readonly Alignment _alignment

432
spacy/gold/example.pyx Normal file
View File

@ -0,0 +1,432 @@
import warnings
import numpy
from ..tokens.doc cimport Doc
from ..tokens.span cimport Span
from ..tokens.span import Span
from ..attrs import IDS
from .align cimport Alignment
from .iob_utils import biluo_to_iob, biluo_tags_from_offsets, biluo_tags_from_doc
from .iob_utils import spans_from_biluo_tags
from .align import Alignment
from ..errors import Errors, Warnings
from ..syntax import nonproj
cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
""" Create a Doc from dictionaries with token and doc annotations. Assumes ORTH & SPACY are set. """
attrs, array = _annot2array(vocab, tok_annot, doc_annot)
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
if "entities" in doc_annot:
_add_entities_to_doc(output, doc_annot["entities"])
if array.size:
output = output.from_array(attrs, array)
# links are currently added with ENT_KB_ID on the token level
output.cats.update(doc_annot.get("cats", {}))
return output
cdef class Example:
def __init__(self, Doc predicted, Doc reference, *, Alignment alignment=None):
""" Doc can either be text, or an actual Doc """
if predicted is None:
raise TypeError(Errors.E972.format(arg="predicted"))
if reference is None:
raise TypeError(Errors.E972.format(arg="reference"))
self.x = predicted
self.y = reference
self._alignment = alignment
property predicted:
def __get__(self):
return self.x
def __set__(self, doc):
self.x = doc
property reference:
def __get__(self):
return self.y
def __set__(self, doc):
self.y = doc
def copy(self):
return Example(
self.x.copy(),
self.y.copy()
)
@classmethod
def from_dict(cls, Doc predicted, dict example_dict):
if example_dict is None:
raise ValueError(Errors.E976)
if not isinstance(predicted, Doc):
raise TypeError(Errors.E975.format(type=type(predicted)))
example_dict = _fix_legacy_dict_data(example_dict)
tok_dict, doc_dict = _parse_example_dict_data(example_dict)
if "ORTH" not in tok_dict:
tok_dict["ORTH"] = [tok.text for tok in predicted]
tok_dict["SPACY"] = [tok.whitespace_ for tok in predicted]
if not _has_field(tok_dict, "SPACY"):
spaces = _guess_spaces(predicted.text, tok_dict["ORTH"])
return Example(
predicted,
annotations2doc(predicted.vocab, tok_dict, doc_dict)
)
@property
def alignment(self):
if self._alignment is None:
spacy_words = [token.orth_ for token in self.predicted]
gold_words = [token.orth_ for token in self.reference]
if gold_words == []:
gold_words = spacy_words
self._alignment = Alignment(spacy_words, gold_words)
return self._alignment
def get_aligned(self, field, as_string=False):
"""Return an aligned array for a token attribute."""
i2j_multi = self.alignment.i2j_multi
cand_to_gold = self.alignment.cand_to_gold
vocab = self.reference.vocab
gold_values = self.reference.to_array([field])
output = [None] * len(self.predicted)
for i, gold_i in enumerate(cand_to_gold):
if self.predicted[i].text.isspace():
output[i] = None
if gold_i is None:
if i in i2j_multi:
output[i] = gold_values[i2j_multi[i]]
else:
output[i] = None
else:
output[i] = gold_values[gold_i]
if as_string and field not in ["ENT_IOB", "SENT_START"]:
output = [vocab.strings[o] if o is not None else o for o in output]
return output
def get_aligned_parse(self, projectivize=True):
cand_to_gold = self.alignment.cand_to_gold
gold_to_cand = self.alignment.gold_to_cand
aligned_heads = [None] * self.x.length
aligned_deps = [None] * self.x.length
heads = [token.head.i for token in self.y]
deps = [token.dep_ for token in self.y]
if projectivize:
heads, deps = nonproj.projectivize(heads, deps)
for cand_i in range(self.x.length):
gold_i = cand_to_gold[cand_i]
if gold_i is not None: # Alignment found
gold_head = gold_to_cand[heads[gold_i]]
if gold_head is not None:
aligned_heads[cand_i] = gold_head
aligned_deps[cand_i] = deps[gold_i]
return aligned_heads, aligned_deps
def get_aligned_ner(self):
if not self.y.is_nered:
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
x_text = self.x.text
# Get a list of entities, and make spans for non-entity tokens.
# We then work through the spans in order, trying to find them in
# the text and using that to get the offset. Any token that doesn't
# get a tag set this way is tagged None.
# This could maybe be improved? It at least feels easy to reason about.
y_spans = list(self.y.ents)
y_spans.sort()
x_text_offset = 0
x_spans = []
for y_span in y_spans:
if x_text.count(y_span.text) >= 1:
start_char = x_text.index(y_span.text) + x_text_offset
end_char = start_char + len(y_span.text)
x_span = self.x.char_span(start_char, end_char, label=y_span.label)
if x_span is not None:
x_spans.append(x_span)
x_text = self.x.text[end_char:]
x_text_offset = end_char
x_tags = biluo_tags_from_offsets(
self.x,
[(e.start_char, e.end_char, e.label_) for e in x_spans],
missing=None
)
gold_to_cand = self.alignment.gold_to_cand
for token in self.y:
if token.ent_iob_ == "O":
cand_i = gold_to_cand[token.i]
if cand_i is not None and x_tags[cand_i] is None:
x_tags[cand_i] = "O"
i2j_multi = self.alignment.i2j_multi
for i, tag in enumerate(x_tags):
if tag is None and i in i2j_multi:
gold_i = i2j_multi[i]
if gold_i is not None and self.y[gold_i].ent_iob_ == "O":
x_tags[i] = "O"
return x_tags
def to_dict(self):
return {
"doc_annotation": {
"cats": dict(self.reference.cats),
"entities": biluo_tags_from_doc(self.reference),
"links": self._links_to_dict()
},
"token_annotation": {
"ids": [t.i+1 for t in self.reference],
"words": [t.text for t in self.reference],
"tags": [t.tag_ for t in self.reference],
"lemmas": [t.lemma_ for t in self.reference],
"pos": [t.pos_ for t in self.reference],
"morphs": [t.morph_ for t in self.reference],
"heads": [t.head.i for t in self.reference],
"deps": [t.dep_ for t in self.reference],
"sent_starts": [int(bool(t.is_sent_start)) for t in self.reference]
}
}
def _links_to_dict(self):
links = {}
for ent in self.reference.ents:
if ent.kb_id_:
links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0}
return links
def split_sents(self):
""" Split the token annotations into multiple Examples based on
sent_starts and return a list of the new Examples"""
if not self.reference.is_sentenced:
return [self]
sent_starts = self.get_aligned("SENT_START")
sent_starts.append(1) # appending virtual start of a next sentence to facilitate search
output = []
pred_start = 0
for sent in self.reference.sents:
new_ref = sent.as_doc()
pred_end = sent_starts.index(1, pred_start+1) # find where the next sentence starts
new_pred = self.predicted[pred_start : pred_end].as_doc()
output.append(Example(new_pred, new_ref))
pred_start = pred_end
return output
property text:
def __get__(self):
return self.x.text
def __str__(self):
return str(self.to_dict())
def __repr__(self):
return str(self.to_dict())
def _annot2array(vocab, tok_annot, doc_annot):
attrs = []
values = []
for key, value in doc_annot.items():
if value:
if key == "entities":
pass
elif key == "links":
entities = doc_annot.get("entities", {})
if not entities:
raise ValueError(Errors.E981)
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], value, entities)
tok_annot["ENT_KB_ID"] = ent_kb_ids
elif key == "cats":
pass
else:
raise ValueError(Errors.E974.format(obj="doc", key=key))
for key, value in tok_annot.items():
if key not in IDS:
raise ValueError(Errors.E974.format(obj="token", key=key))
elif key in ["ORTH", "SPACY"]:
pass
elif key == "HEAD":
attrs.append(key)
values.append([h-i for i, h in enumerate(value)])
elif key == "SENT_START":
attrs.append(key)
values.append(value)
elif key == "MORPH":
attrs.append(key)
values.append([vocab.morphology.add(v) for v in value])
else:
attrs.append(key)
values.append([vocab.strings.add(v) for v in value])
array = numpy.asarray(values, dtype="uint64")
return attrs, array.T
def _add_entities_to_doc(doc, ner_data):
if ner_data is None:
return
elif ner_data == []:
doc.ents = []
elif isinstance(ner_data[0], tuple):
return _add_entities_to_doc(
doc,
biluo_tags_from_offsets(doc, ner_data)
)
elif isinstance(ner_data[0], str) or ner_data[0] is None:
return _add_entities_to_doc(
doc,
spans_from_biluo_tags(doc, ner_data)
)
elif isinstance(ner_data[0], Span):
# Ugh, this is super messy. Really hard to set O entities
doc.ents = ner_data
doc.ents = [span for span in ner_data if span.label_]
else:
raise ValueError(Errors.E973)
def _parse_example_dict_data(example_dict):
return (
example_dict["token_annotation"],
example_dict["doc_annotation"]
)
def _fix_legacy_dict_data(example_dict):
token_dict = example_dict.get("token_annotation", {})
doc_dict = example_dict.get("doc_annotation", {})
for key, value in example_dict.items():
if value:
if key in ("token_annotation", "doc_annotation"):
pass
elif key == "ids":
pass
elif key in ("cats", "links"):
doc_dict[key] = value
elif key in ("ner", "entities"):
doc_dict["entities"] = value
else:
token_dict[key] = value
# Remap keys
remapping = {
"words": "ORTH",
"tags": "TAG",
"pos": "POS",
"lemmas": "LEMMA",
"deps": "DEP",
"heads": "HEAD",
"sent_starts": "SENT_START",
"morphs": "MORPH",
"spaces": "SPACY",
}
old_token_dict = token_dict
token_dict = {}
for key, value in old_token_dict.items():
if key in ("text", "ids", "brackets"):
pass
elif key in remapping:
token_dict[remapping[key]] = value
else:
raise KeyError(Errors.E983.format(key=key, dict="token_annotation", keys=remapping.keys()))
text = example_dict.get("text", example_dict.get("raw"))
if _has_field(token_dict, "ORTH") and not _has_field(token_dict, "SPACY"):
token_dict["SPACY"] = _guess_spaces(text, token_dict["ORTH"])
if "HEAD" in token_dict and "SENT_START" in token_dict:
# If heads are set, we don't also redundantly specify SENT_START.
token_dict.pop("SENT_START")
warnings.warn(Warnings.W092)
return {
"token_annotation": token_dict,
"doc_annotation": doc_dict
}
def _has_field(annot, field):
if field not in annot:
return False
elif annot[field] is None:
return False
elif len(annot[field]) == 0:
return False
elif all([value is None for value in annot[field]]):
return False
else:
return True
def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces):
if isinstance(biluo_or_offsets[0], (list, tuple)):
# Convert to biluo if necessary
# This is annoying but to convert the offsets we need a Doc
# that has the target tokenization.
reference = Doc(vocab, words=words, spaces=spaces)
biluo = biluo_tags_from_offsets(reference, biluo_or_offsets)
else:
biluo = biluo_or_offsets
ent_iobs = []
ent_types = []
for iob_tag in biluo_to_iob(biluo):
if iob_tag in (None, "-"):
ent_iobs.append("")
ent_types.append("")
else:
ent_iobs.append(iob_tag.split("-")[0])
if iob_tag.startswith("I") or iob_tag.startswith("B"):
ent_types.append(iob_tag.split("-", 1)[1])
else:
ent_types.append("")
return ent_iobs, ent_types
def _parse_links(vocab, words, links, entities):
reference = Doc(vocab, words=words)
starts = {token.idx: token.i for token in reference}
ends = {token.idx + len(token): token.i for token in reference}
ent_kb_ids = ["" for _ in reference]
entity_map = [(ent[0], ent[1]) for ent in entities]
# links annotations need to refer 1-1 to entity annotations - throw error otherwise
for index, annot_dict in links.items():
start_char, end_char = index
if (start_char, end_char) not in entity_map:
raise ValueError(Errors.E981)
for index, annot_dict in links.items():
true_kb_ids = []
for key, value in annot_dict.items():
if value == 1.0:
true_kb_ids.append(key)
if len(true_kb_ids) > 1:
raise ValueError(Errors.E980)
if len(true_kb_ids) == 1:
start_char, end_char = index
start_token = starts.get(start_char)
end_token = ends.get(end_char)
for i in range(start_token, end_token+1):
ent_kb_ids[i] = true_kb_ids[0]
return ent_kb_ids
def _guess_spaces(text, words):
if text is None:
return [True] * len(words)
spaces = []
text_pos = 0
# align words with text
for word in words:
try:
word_start = text[text_pos:].index(word)
except ValueError:
spaces.append(True)
continue
text_pos += word_start + len(word)
if text_pos < len(text) and text[text_pos] == " ":
spaces.append(True)
else:
spaces.append(False)
return spaces

199
spacy/gold/gold_io.pyx Normal file
View File

@ -0,0 +1,199 @@
import warnings
import srsly
from .. import util
from ..errors import Warnings
from ..tokens import Doc
from .iob_utils import biluo_tags_from_offsets, tags_to_entities
import json
def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
"""Convert a list of Doc objects into the JSON-serializable format used by
the spacy train command.
docs (iterable / Doc): The Doc object(s) to convert.
doc_id (int): Id for the JSON.
RETURNS (dict): The data in spaCy's JSON format
- each input doc will be treated as a paragraph in the output doc
"""
if isinstance(docs, Doc):
docs = [docs]
json_doc = {"id": doc_id, "paragraphs": []}
for i, doc in enumerate(docs):
json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
for cat, val in doc.cats.items():
json_cat = {"label": cat, "value": val}
json_para["cats"].append(json_cat)
for ent in doc.ents:
ent_tuple = (ent.start_char, ent.end_char, ent.label_)
json_para["entities"].append(ent_tuple)
if ent.kb_id_:
link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}}
json_para["links"].append(link_dict)
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
biluo_tags = biluo_tags_from_offsets(doc, ent_offsets, missing=ner_missing_tag)
for j, sent in enumerate(doc.sents):
json_sent = {"tokens": [], "brackets": []}
for token in sent:
json_token = {"id": token.i, "orth": token.text, "space": token.whitespace_}
if doc.is_tagged:
json_token["tag"] = token.tag_
json_token["pos"] = token.pos_
json_token["morph"] = token.morph_
json_token["lemma"] = token.lemma_
if doc.is_parsed:
json_token["head"] = token.head.i-token.i
json_token["dep"] = token.dep_
json_sent["tokens"].append(json_token)
json_para["sentences"].append(json_sent)
json_doc["paragraphs"].append(json_para)
return json_doc
def read_json_file(loc, docs_filter=None, limit=None):
"""Read Example dictionaries from a json file or directory."""
loc = util.ensure_path(loc)
if loc.is_dir():
for filename in loc.iterdir():
yield from read_json_file(loc / filename, limit=limit)
else:
with loc.open("rb") as file_:
utf8_str = file_.read()
for json_doc in json_iterate(utf8_str):
if docs_filter is not None and not docs_filter(json_doc):
continue
for json_paragraph in json_to_annotations(json_doc):
yield json_paragraph
def json_to_annotations(doc):
"""Convert an item in the JSON-formatted training data to the format
used by Example.
doc (dict): One entry in the training data.
YIELDS (tuple): The reformatted data - one training example per paragraph
"""
for paragraph in doc["paragraphs"]:
example = {"text": paragraph.get("raw", None)}
words = []
spaces = []
ids = []
tags = []
ner_tags = []
pos = []
morphs = []
lemmas = []
heads = []
labels = []
sent_starts = []
brackets = []
for sent in paragraph["sentences"]:
sent_start_i = len(words)
for i, token in enumerate(sent["tokens"]):
words.append(token["orth"])
spaces.append(token.get("space", None))
ids.append(token.get('id', sent_start_i + i))
tags.append(token.get("tag", None))
pos.append(token.get("pos", None))
morphs.append(token.get("morph", None))
lemmas.append(token.get("lemma", None))
if "head" in token:
heads.append(token["head"] + sent_start_i + i)
else:
heads.append(None)
if "dep" in token:
labels.append(token["dep"])
# Ensure ROOT label is case-insensitive
if labels[-1].lower() == "root":
labels[-1] = "ROOT"
else:
labels.append(None)
ner_tags.append(token.get("ner", None))
if i == 0:
sent_starts.append(1)
else:
sent_starts.append(0)
if "brackets" in sent:
brackets.extend((b["first"] + sent_start_i,
b["last"] + sent_start_i, b["label"])
for b in sent["brackets"])
example["token_annotation"] = dict(
ids=ids,
words=words,
spaces=spaces,
sent_starts=sent_starts,
brackets=brackets
)
# avoid including dummy values that looks like gold info was present
if any(tags):
example["token_annotation"]["tags"] = tags
if any(pos):
example["token_annotation"]["pos"] = pos
if any(morphs):
example["token_annotation"]["morphs"] = morphs
if any(lemmas):
example["token_annotation"]["lemmas"] = lemmas
if any(head is not None for head in heads):
example["token_annotation"]["heads"] = heads
if any(labels):
example["token_annotation"]["deps"] = labels
cats = {}
for cat in paragraph.get("cats", {}):
cats[cat["label"]] = cat["value"]
example["doc_annotation"] = dict(
cats=cats,
entities=ner_tags,
links=paragraph.get("links", [])
)
yield example
def json_iterate(bytes utf8_str):
# We should've made these files jsonl...But since we didn't, parse out
# the docs one-by-one to reduce memory usage.
# It's okay to read in the whole file -- just don't parse it into JSON.
cdef long file_length = len(utf8_str)
if file_length > 2 ** 30:
warnings.warn(Warnings.W027.format(size=file_length))
raw = <char*>utf8_str
cdef int square_depth = 0
cdef int curly_depth = 0
cdef int inside_string = 0
cdef int escape = 0
cdef long start = -1
cdef char c
cdef char quote = ord('"')
cdef char backslash = ord("\\")
cdef char open_square = ord("[")
cdef char close_square = ord("]")
cdef char open_curly = ord("{")
cdef char close_curly = ord("}")
for i in range(file_length):
c = raw[i]
if escape:
escape = False
continue
if c == backslash:
escape = True
continue
if c == quote:
inside_string = not inside_string
continue
if inside_string:
continue
if c == open_square:
square_depth += 1
elif c == close_square:
square_depth -= 1
elif c == open_curly:
if square_depth == 1 and curly_depth == 0:
start = i
curly_depth += 1
elif c == close_curly:
curly_depth -= 1
if square_depth == 1 and curly_depth == 0:
substr = utf8_str[start : i + 1].decode("utf8")
yield srsly.json_loads(substr)
start = -1

209
spacy/gold/iob_utils.py Normal file
View File

@ -0,0 +1,209 @@
import warnings
from ..errors import Errors, Warnings
from ..tokens import Span
def iob_to_biluo(tags):
out = []
tags = list(tags)
while tags:
out.extend(_consume_os(tags))
out.extend(_consume_ent(tags))
return out
def biluo_to_iob(tags):
out = []
for tag in tags:
if tag is None:
out.append(tag)
else:
tag = tag.replace("U-", "B-", 1).replace("L-", "I-", 1)
out.append(tag)
return out
def _consume_os(tags):
while tags and tags[0] == "O":
yield tags.pop(0)
def _consume_ent(tags):
if not tags:
return []
tag = tags.pop(0)
target_in = "I" + tag[1:]
target_last = "L" + tag[1:]
length = 1
while tags and tags[0] in {target_in, target_last}:
length += 1
tags.pop(0)
label = tag[2:]
if length == 1:
if len(label) == 0:
raise ValueError(Errors.E177.format(tag=tag))
return ["U-" + label]
else:
start = "B-" + label
end = "L-" + label
middle = [f"I-{label}" for _ in range(1, length - 1)]
return [start] + middle + [end]
def biluo_tags_from_doc(doc, missing="O"):
return biluo_tags_from_offsets(
doc,
[(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents],
missing=missing,
)
def biluo_tags_from_offsets(doc, entities, missing="O"):
"""Encode labelled spans into per-token tags, using the
Begin/In/Last/Unit/Out scheme (BILUO).
doc (Doc): The document that the entity offsets refer to. The output tags
will refer to the token boundaries within the document.
entities (iterable): A sequence of `(start, end, label)` triples. `start`
and `end` should be character-offset integers denoting the slice into
the original string.
RETURNS (list): A list of unicode strings, describing the tags. Each tag
string will be of the form either "", "O" or "{action}-{label}", where
action is one of "B", "I", "L", "U". The string "-" is used where the
entity offsets don't align with the tokenization in the `Doc` object.
The training algorithm will view these as missing values. "O" denotes a
non-entity token. "B" denotes the beginning of a multi-token entity,
"I" the inside of an entity of three or more tokens, and "L" the end
of an entity of two or more tokens. "U" denotes a single-token entity.
EXAMPLE:
>>> text = 'I like London.'
>>> entities = [(len('I like '), len('I like London'), 'LOC')]
>>> doc = nlp.tokenizer(text)
>>> tags = biluo_tags_from_offsets(doc, entities)
>>> assert tags == ["O", "O", 'U-LOC', "O"]
"""
# Ensure no overlapping entity labels exist
tokens_in_ents = {}
starts = {token.idx: token.i for token in doc}
ends = {token.idx + len(token): token.i for token in doc}
biluo = ["-" for _ in doc]
# Handle entity cases
for start_char, end_char, label in entities:
if not label:
for s in starts: # account for many-to-one
if s >= start_char and s < end_char:
biluo[starts[s]] = "O"
else:
for token_index in range(start_char, end_char):
if token_index in tokens_in_ents.keys():
raise ValueError(
Errors.E103.format(
span1=(
tokens_in_ents[token_index][0],
tokens_in_ents[token_index][1],
tokens_in_ents[token_index][2],
),
span2=(start_char, end_char, label),
)
)
tokens_in_ents[token_index] = (start_char, end_char, label)
start_token = starts.get(start_char)
end_token = ends.get(end_char)
# Only interested if the tokenization is correct
if start_token is not None and end_token is not None:
if start_token == end_token:
biluo[start_token] = f"U-{label}"
else:
biluo[start_token] = f"B-{label}"
for i in range(start_token + 1, end_token):
biluo[i] = f"I-{label}"
biluo[end_token] = f"L-{label}"
# Now distinguish the O cases from ones where we miss the tokenization
entity_chars = set()
for start_char, end_char, label in entities:
for i in range(start_char, end_char):
entity_chars.add(i)
for token in doc:
for i in range(token.idx, token.idx + len(token)):
if i in entity_chars:
break
else:
biluo[token.i] = missing
if "-" in biluo and missing != "-":
ent_str = str(entities)
warnings.warn(
Warnings.W030.format(
text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
)
)
return biluo
def spans_from_biluo_tags(doc, tags):
"""Encode per-token tags following the BILUO scheme into Span object, e.g.
to overwrite the doc.ents.
doc (Doc): The document that the BILUO tags refer to.
entities (iterable): A sequence of BILUO tags with each tag describing one
token. Each tags string will be of the form of either "", "O" or
"{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS (list): A sequence of Span objects.
"""
token_offsets = tags_to_entities(tags)
spans = []
for label, start_idx, end_idx in token_offsets:
span = Span(doc, start_idx, end_idx + 1, label=label)
spans.append(span)
return spans
def offsets_from_biluo_tags(doc, tags):
"""Encode per-token tags following the BILUO scheme into entity offsets.
doc (Doc): The document that the BILUO tags refer to.
entities (iterable): A sequence of BILUO tags with each tag describing one
token. Each tags string will be of the form of either "", "O" or
"{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS (list): A sequence of `(start, end, label)` triples. `start` and
`end` will be character-offset integers denoting the slice into the
original string.
"""
spans = spans_from_biluo_tags(doc, tags)
return [(span.start_char, span.end_char, span.label_) for span in spans]
def tags_to_entities(tags):
""" Note that the end index returned by this function is inclusive.
To use it for Span creation, increment the end by 1."""
entities = []
start = None
for i, tag in enumerate(tags):
if tag is None:
continue
if tag.startswith("O"):
# TODO: We shouldn't be getting these malformed inputs. Fix this.
if start is not None:
start = None
else:
entities.append(("", i, i))
continue
elif tag == "-":
continue
elif tag.startswith("I"):
if start is None:
raise ValueError(Errors.E067.format(tags=tags[: i + 1]))
continue
if tag.startswith("U"):
entities.append((tag[2:], i, i))
elif tag.startswith("B"):
start = i
elif tag.startswith("L"):
entities.append((tag[2:], start, i))
start = None
else:
raise ValueError(Errors.E068.format(tag=tag))
return entities

View File

@ -446,6 +446,8 @@ cdef class Writer:
assert not path.isdir(loc), f"{loc} is directory" assert not path.isdir(loc), f"{loc} is directory"
if isinstance(loc, Path): if isinstance(loc, Path):
loc = bytes(loc) loc = bytes(loc)
if path.exists(loc):
assert not path.isdir(loc), "%s is directory." % loc
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self._fp = fopen(<char*>bytes_loc, 'wb') self._fp = fopen(<char*>bytes_loc, 'wb')
if not self._fp: if not self._fp:
@ -487,10 +489,10 @@ cdef class Writer:
cdef class Reader: cdef class Reader:
def __init__(self, object loc): def __init__(self, object loc):
assert path.exists(loc)
assert not path.isdir(loc)
if isinstance(loc, Path): if isinstance(loc, Path):
loc = bytes(loc) loc = bytes(loc)
assert path.exists(loc)
assert not path.isdir(loc)
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self._fp = fopen(<char*>bytes_loc, 'rb') self._fp = fopen(<char*>bytes_loc, 'rb')
if not self._fp: if not self._fp:

View File

@ -20,29 +20,25 @@ def noun_chunks(doclike):
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
nmod = doc.vocab.strings.add("nmod") nmod = doc.vocab.strings.add("nmod")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(doclike): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
flag = False flag = False
if word.pos == NOUN: if word.pos == NOUN:
# check for patterns such as γραμμή παραγωγής # check for patterns such as γραμμή παραγωγής
for potential_nmod in word.rights: for potential_nmod in word.rights:
if potential_nmod.dep == nmod: if potential_nmod.dep == nmod:
seen.update( prev_end = potential_nmod.i
j for j in range(word.left_edge.i, potential_nmod.i + 1)
)
yield word.left_edge.i, potential_nmod.i + 1, np_label yield word.left_edge.i, potential_nmod.i + 1, np_label
flag = True flag = True
break break
if flag is False: if flag is False:
seen.update(j for j in range(word.left_edge.i, word.i + 1)) prev_end = word.i
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
# covers the case: έχει όμορφα και έξυπνα παιδιά # covers the case: έχει όμορφα και έξυπνα παιδιά
@ -51,9 +47,7 @@ def noun_chunks(doclike):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label

View File

@ -25,17 +25,15 @@ def noun_chunks(doclike):
np_deps = [doc.vocab.strings.add(label) for label in labels] np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(doclike): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -43,9 +41,7 @@ def noun_chunks(doclike):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label

View File

@ -136,7 +136,19 @@ for pron in ["he", "she", "it"]:
# W-words, relative pronouns, prepositions etc. # W-words, relative pronouns, prepositions etc.
for word in ["who", "what", "when", "where", "why", "how", "there", "that"]: for word in [
"who",
"what",
"when",
"where",
"why",
"how",
"there",
"that",
"this",
"these",
"those",
]:
for orth in [word, word.title()]: for orth in [word, word.title()]:
_exc[orth + "'s"] = [ _exc[orth + "'s"] = [
{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: orth, LEMMA: word, NORM: word},
@ -396,6 +408,8 @@ _other_exc = {
{ORTH: "Let", LEMMA: "let", NORM: "let"}, {ORTH: "Let", LEMMA: "let", NORM: "let"},
{ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"}, {ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"},
], ],
"c'mon": [{ORTH: "c'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
"C'mon": [{ORTH: "C'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
} }
_exc.update(_other_exc) _exc.update(_other_exc)

View File

@ -14,5 +14,9 @@ sentences = [
"El gato come pescado.", "El gato come pescado.",
"Veo al hombre con el telescopio.", "Veo al hombre con el telescopio.",
"La araña come moscas.", "La araña come moscas.",
"El pingüino incuba en su nido.", "El pingüino incuba en su nido sobre el hielo.",
"¿Dónde estais?",
"¿Quién es el presidente Francés?",
"¿Dónde está encuentra la capital de Argentina?",
"¿Cuándo nació José de San Martín?",
] ]

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA

View File

@ -7,8 +7,12 @@ _exc = {
for exc_data in [ for exc_data in [
{ORTH: "", LEMMA: "número"},
{ORTH: "°C", LEMMA: "grados Celcius"},
{ORTH: "aprox.", LEMMA: "aproximadamente"}, {ORTH: "aprox.", LEMMA: "aproximadamente"},
{ORTH: "dna.", LEMMA: "docena"}, {ORTH: "dna.", LEMMA: "docena"},
{ORTH: "dpto.", LEMMA: "departamento"},
{ORTH: "ej.", LEMMA: "ejemplo"},
{ORTH: "esq.", LEMMA: "esquina"}, {ORTH: "esq.", LEMMA: "esquina"},
{ORTH: "pág.", LEMMA: "página"}, {ORTH: "pág.", LEMMA: "página"},
{ORTH: "p.ej.", LEMMA: "por ejemplo"}, {ORTH: "p.ej.", LEMMA: "por ejemplo"},
@ -16,6 +20,7 @@ for exc_data in [
{ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"}, {ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"},
{ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}, {ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
{ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}, {ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
{ORTH: "vol.", NORM: "volúmen"},
]: ]:
_exc[exc_data[ORTH]] = [exc_data] _exc[exc_data[ORTH]] = [exc_data]
@ -35,10 +40,14 @@ for h in range(1, 12 + 1):
for orth in [ for orth in [
"a.C.", "a.C.",
"a.J.C.", "a.J.C.",
"d.C.",
"d.J.C.",
"apdo.", "apdo.",
"Av.", "Av.",
"Avda.", "Avda.",
"Cía.", "Cía.",
"Dr.",
"Dra.",
"EE.UU.", "EE.UU.",
"etc.", "etc.",
"fig.", "fig.",
@ -54,9 +63,9 @@ for orth in [
"Prof.", "Prof.",
"Profa.", "Profa.",
"q.e.p.d.", "q.e.p.d.",
"S.A.", "Q.E.P.D." "S.A.",
"S.L.", "S.L.",
"s.s.s.", "S.R.L." "s.s.s.",
"Sr.", "Sr.",
"Sra.", "Sra.",
"Srta.", "Srta.",

View File

@ -25,17 +25,15 @@ def noun_chunks(doclike):
np_deps = [doc.vocab.strings.add(label) for label in labels] np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(doclike): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -43,9 +41,7 @@ def noun_chunks(doclike):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label

View File

@ -531,7 +531,6 @@ FR_BASE_EXCEPTIONS = [
"Beaumont-Hamel", "Beaumont-Hamel",
"Beaumont-Louestault", "Beaumont-Louestault",
"Beaumont-Monteux", "Beaumont-Monteux",
"Beaumont-Pied-de-Bœuf",
"Beaumont-Pied-de-Bœuf", "Beaumont-Pied-de-Bœuf",
"Beaumont-Sardolles", "Beaumont-Sardolles",
"Beaumont-Village", "Beaumont-Village",
@ -948,7 +947,7 @@ FR_BASE_EXCEPTIONS = [
"Buxières-sous-les-Côtes", "Buxières-sous-les-Côtes",
"Buzy-Darmont", "Buzy-Darmont",
"Byhleguhre-Byhlen", "Byhleguhre-Byhlen",
"Bœurs-en-Othe", "Bœurs-en-Othe",
"Bâle-Campagne", "Bâle-Campagne",
"Bâle-Ville", "Bâle-Ville",
"Béard-Géovreissiat", "Béard-Géovreissiat",
@ -1586,11 +1585,11 @@ FR_BASE_EXCEPTIONS = [
"Cruci-Falgardiens", "Cruci-Falgardiens",
"Cruquius-Oost", "Cruquius-Oost",
"Cruviers-Lascours", "Cruviers-Lascours",
"Crèvecœur-en-Auge", "Crèvecœur-en-Auge",
"Crèvecœur-en-Brie", "Crèvecœur-en-Brie",
"Crèvecœur-le-Grand", "Crèvecœur-le-Grand",
"Crèvecœur-le-Petit", "Crèvecœur-le-Petit",
"Crèvecœur-sur-l'Escaut", "Crèvecœur-sur-l'Escaut",
"Crécy-Couvé", "Crécy-Couvé",
"Créon-d'Armagnac", "Créon-d'Armagnac",
"Cubjac-Auvézère-Val-d'Ans", "Cubjac-Auvézère-Val-d'Ans",
@ -1616,7 +1615,7 @@ FR_BASE_EXCEPTIONS = [
"Cuxac-Cabardès", "Cuxac-Cabardès",
"Cuxac-d'Aude", "Cuxac-d'Aude",
"Cuyk-Sainte-Agathe", "Cuyk-Sainte-Agathe",
"Cœuvres-et-Valsery", "Cœuvres-et-Valsery",
"Céaux-d'Allègre", "Céaux-d'Allègre",
"Céleste-Empire", "Céleste-Empire",
"Cénac-et-Saint-Julien", "Cénac-et-Saint-Julien",
@ -1679,7 +1678,7 @@ FR_BASE_EXCEPTIONS = [
"Devrai-Gondragnières", "Devrai-Gondragnières",
"Dhuys et Morin-en-Brie", "Dhuys et Morin-en-Brie",
"Diane-Capelle", "Diane-Capelle",
"Dieffenbach-lès-Wœrth", "Dieffenbach-lès-Wœrth",
"Diekhusen-Fahrstedt", "Diekhusen-Fahrstedt",
"Diennes-Aubigny", "Diennes-Aubigny",
"Diensdorf-Radlow", "Diensdorf-Radlow",
@ -1752,7 +1751,7 @@ FR_BASE_EXCEPTIONS = [
"Durdat-Larequille", "Durdat-Larequille",
"Durfort-Lacapelette", "Durfort-Lacapelette",
"Durfort-et-Saint-Martin-de-Sossenac", "Durfort-et-Saint-Martin-de-Sossenac",
"Dœuil-sur-le-Mignon", "Dœuil-sur-le-Mignon",
"Dão-Lafões", "Dão-Lafões",
"Débats-Rivière-d'Orpra", "Débats-Rivière-d'Orpra",
"Décines-Charpieu", "Décines-Charpieu",
@ -2687,8 +2686,8 @@ FR_BASE_EXCEPTIONS = [
"Kuhlen-Wendorf", "Kuhlen-Wendorf",
"KwaZulu-Natal", "KwaZulu-Natal",
"Kyzyl-Arvat", "Kyzyl-Arvat",
"Kœur-la-Grande", "Kœur-la-Grande",
"Kœur-la-Petite", "Kœur-la-Petite",
"Kölln-Reisiek", "Kölln-Reisiek",
"Königsbach-Stein", "Königsbach-Stein",
"Königshain-Wiederau", "Königshain-Wiederau",
@ -4024,7 +4023,7 @@ FR_BASE_EXCEPTIONS = [
"Marcilly-d'Azergues", "Marcilly-d'Azergues",
"Marcillé-Raoul", "Marcillé-Raoul",
"Marcillé-Robert", "Marcillé-Robert",
"Marcq-en-Barœul", "Marcq-en-Barœul",
"Marcy-l'Etoile", "Marcy-l'Etoile",
"Marcy-l'Étoile", "Marcy-l'Étoile",
"Mareil-Marly", "Mareil-Marly",
@ -4258,7 +4257,7 @@ FR_BASE_EXCEPTIONS = [
"Monlezun-d'Armagnac", "Monlezun-d'Armagnac",
"Monléon-Magnoac", "Monléon-Magnoac",
"Monnetier-Mornex", "Monnetier-Mornex",
"Mons-en-Barœul", "Mons-en-Barœul",
"Monsempron-Libos", "Monsempron-Libos",
"Monsteroux-Milieu", "Monsteroux-Milieu",
"Montacher-Villegardin", "Montacher-Villegardin",
@ -4348,7 +4347,7 @@ FR_BASE_EXCEPTIONS = [
"Mornay-Berry", "Mornay-Berry",
"Mortain-Bocage", "Mortain-Bocage",
"Morteaux-Couliboeuf", "Morteaux-Couliboeuf",
"Morteaux-Coulibœuf", "Morteaux-Coulibœuf",
"Morteaux-Coulibœuf", "Morteaux-Coulibœuf",
"Mortes-Frontières", "Mortes-Frontières",
"Mory-Montcrux", "Mory-Montcrux",
@ -4391,7 +4390,7 @@ FR_BASE_EXCEPTIONS = [
"Muncq-Nieurlet", "Muncq-Nieurlet",
"Murtin-Bogny", "Murtin-Bogny",
"Murtin-et-le-Châtelet", "Murtin-et-le-Châtelet",
"Mœurs-Verdey", "Mœurs-Verdey",
"Ménestérol-Montignac", "Ménestérol-Montignac",
"Ménil'muche", "Ménil'muche",
"Ménil-Annelles", "Ménil-Annelles",
@ -4612,7 +4611,7 @@ FR_BASE_EXCEPTIONS = [
"Neuves-Maisons", "Neuves-Maisons",
"Neuvic-Entier", "Neuvic-Entier",
"Neuvicq-Montguyon", "Neuvicq-Montguyon",
"Neuville-lès-Lœuilly", "Neuville-lès-Lœuilly",
"Neuvy-Bouin", "Neuvy-Bouin",
"Neuvy-Deux-Clochers", "Neuvy-Deux-Clochers",
"Neuvy-Grandchamp", "Neuvy-Grandchamp",
@ -4773,8 +4772,8 @@ FR_BASE_EXCEPTIONS = [
"Nuncq-Hautecôte", "Nuncq-Hautecôte",
"Nurieux-Volognat", "Nurieux-Volognat",
"Nuthe-Urstromtal", "Nuthe-Urstromtal",
"Nœux-les-Mines", "Nœux-les-Mines",
"Nœux-lès-Auxi", "Nœux-lès-Auxi",
"Nâves-Parmelan", "Nâves-Parmelan",
"Nézignan-l'Evêque", "Nézignan-l'Evêque",
"Nézignan-l'Évêque", "Nézignan-l'Évêque",
@ -5343,7 +5342,7 @@ FR_BASE_EXCEPTIONS = [
"Quincy-Voisins", "Quincy-Voisins",
"Quincy-sous-le-Mont", "Quincy-sous-le-Mont",
"Quint-Fonsegrives", "Quint-Fonsegrives",
"Quœux-Haut-Maînil", "Quœux-Haut-Maînil",
"Quœux-Haut-Maînil", "Quœux-Haut-Maînil",
"Qwa-Qwa", "Qwa-Qwa",
"R.-V.", "R.-V.",
@ -5631,12 +5630,12 @@ FR_BASE_EXCEPTIONS = [
"Saint Aulaye-Puymangou", "Saint Aulaye-Puymangou",
"Saint Geniez d'Olt et d'Aubrac", "Saint Geniez d'Olt et d'Aubrac",
"Saint Martin de l'If", "Saint Martin de l'If",
"Saint-Denœux", "Saint-Denœux",
"Saint-Jean-de-Bœuf", "Saint-Jean-de-Bœuf",
"Saint-Martin-le-Nœud", "Saint-Martin-le-Nœud",
"Saint-Michel-Tubœuf", "Saint-Michel-Tubœuf",
"Saint-Paul - Flaugnac", "Saint-Paul - Flaugnac",
"Saint-Pierre-de-Bœuf", "Saint-Pierre-de-Bœuf",
"Saint-Thegonnec Loc-Eguiner", "Saint-Thegonnec Loc-Eguiner",
"Sainte-Alvère-Saint-Laurent Les Bâtons", "Sainte-Alvère-Saint-Laurent Les Bâtons",
"Salignac-Eyvignes", "Salignac-Eyvignes",
@ -6208,7 +6207,7 @@ FR_BASE_EXCEPTIONS = [
"Tite-Live", "Tite-Live",
"Titisee-Neustadt", "Titisee-Neustadt",
"Tobel-Tägerschen", "Tobel-Tägerschen",
"Togny-aux-Bœufs", "Togny-aux-Bœufs",
"Tongre-Notre-Dame", "Tongre-Notre-Dame",
"Tonnay-Boutonne", "Tonnay-Boutonne",
"Tonnay-Charente", "Tonnay-Charente",
@ -6336,7 +6335,7 @@ FR_BASE_EXCEPTIONS = [
"Vals-près-le-Puy", "Vals-près-le-Puy",
"Valverde-Enrique", "Valverde-Enrique",
"Valzin-en-Petite-Montagne", "Valzin-en-Petite-Montagne",
"Vandœuvre-lès-Nancy", "Vandœuvre-lès-Nancy",
"Varces-Allières-et-Risset", "Varces-Allières-et-Risset",
"Varenne-l'Arconce", "Varenne-l'Arconce",
"Varenne-sur-le-Doubs", "Varenne-sur-le-Doubs",
@ -6457,9 +6456,9 @@ FR_BASE_EXCEPTIONS = [
"Villenave-d'Ornon", "Villenave-d'Ornon",
"Villequier-Aumont", "Villequier-Aumont",
"Villerouge-Termenès", "Villerouge-Termenès",
"Villers-aux-Nœuds", "Villers-aux-Nœuds",
"Villez-sur-le-Neubourg", "Villez-sur-le-Neubourg",
"Villiers-en-Désœuvre", "Villiers-en-Désœuvre",
"Villieu-Loyes-Mollon", "Villieu-Loyes-Mollon",
"Villingen-Schwenningen", "Villingen-Schwenningen",
"Villié-Morgon", "Villié-Morgon",
@ -6467,7 +6466,7 @@ FR_BASE_EXCEPTIONS = [
"Vilosnes-Haraumont", "Vilosnes-Haraumont",
"Vilters-Wangs", "Vilters-Wangs",
"Vincent-Froideville", "Vincent-Froideville",
"Vincy-Manœuvre", "Vincy-Manœuvre",
"Vincy-Manœuvre", "Vincy-Manœuvre",
"Vincy-Reuil-et-Magny", "Vincy-Reuil-et-Magny",
"Vindrac-Alayrac", "Vindrac-Alayrac",
@ -6511,8 +6510,8 @@ FR_BASE_EXCEPTIONS = [
"Vrigne-Meusiens", "Vrigne-Meusiens",
"Vrijhoeve-Capelle", "Vrijhoeve-Capelle",
"Vuisternens-devant-Romont", "Vuisternens-devant-Romont",
"Vœlfling-lès-Bouzonville", "Vœlfling-lès-Bouzonville",
"Vœuil-et-Giget", "Vœuil-et-Giget",
"Vélez-Blanco", "Vélez-Blanco",
"Vélez-Málaga", "Vélez-Málaga",
"Vélez-Rubio", "Vélez-Rubio",
@ -6615,7 +6614,7 @@ FR_BASE_EXCEPTIONS = [
"Wust-Fischbeck", "Wust-Fischbeck",
"Wutha-Farnroda", "Wutha-Farnroda",
"Wy-dit-Joli-Village", "Wy-dit-Joli-Village",
"Wœlfling-lès-Sarreguemines", "Wœlfling-lès-Sarreguemines",
"Wünnewil-Flamatt", "Wünnewil-Flamatt",
"X-SAMPA", "X-SAMPA",
"X-arbre", "X-arbre",

View File

@ -24,17 +24,15 @@ def noun_chunks(doclike):
np_deps = [doc.vocab.strings[label] for label in labels] np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(doclike): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -42,9 +40,7 @@ def noun_chunks(doclike):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label

View File

@ -1,7 +1,6 @@
import re import re
from .punctuation import ELISION, HYPHENS from .punctuation import ELISION, HYPHENS
from ..tokenizer_exceptions import URL_PATTERN
from ..char_classes import ALPHA_LOWER, ALPHA from ..char_classes import ALPHA_LOWER, ALPHA
from ...symbols import ORTH, LEMMA from ...symbols import ORTH, LEMMA
@ -452,9 +451,6 @@ _regular_exp += [
for hc in _hyphen_combination for hc in _hyphen_combination
] ]
# URLs
_regular_exp.append(URL_PATTERN)
TOKENIZER_EXCEPTIONS = _exc TOKENIZER_EXCEPTIONS = _exc
TOKEN_MATCH = re.compile( TOKEN_MATCH = re.compile(

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from ...language import Language from ...language import Language

View File

@ -1,7 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set( STOP_WORDS = set(
""" """
એમ એમ

View File

@ -7,7 +7,6 @@ _concat_icons = CONCAT_ICONS.replace("\u00B0", "")
_currency = r"\$¢£€¥฿" _currency = r"\$¢£€¥฿"
_quotes = CONCAT_QUOTES.replace("'", "") _quotes = CONCAT_QUOTES.replace("'", "")
_units = UNITS.replace("%", "")
_prefixes = ( _prefixes = (
LIST_PUNCT LIST_PUNCT
@ -18,7 +17,8 @@ _prefixes = (
) )
_suffixes = ( _suffixes = (
LIST_PUNCT [r"\+"]
+ LIST_PUNCT
+ LIST_ELLIPSES + LIST_ELLIPSES
+ LIST_QUOTES + LIST_QUOTES
+ [_concat_icons] + [_concat_icons]
@ -26,7 +26,7 @@ _suffixes = (
r"(?<=[0-9])\+", r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.", r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:[{c}])".format(c=_currency), r"(?<=[0-9])(?:[{c}])".format(c=_currency),
r"(?<=[0-9])(?:{u})".format(u=_units), r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{al}{e}{q}(?:{c})])\.".format( r"(?<=[{al}{e}{q}(?:{c})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, c=_currency al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, c=_currency
), ),

View File

@ -1,7 +1,6 @@
import re import re
from ..punctuation import ALPHA_LOWER, CURRENCY from ..punctuation import ALPHA_LOWER, CURRENCY
from ..tokenizer_exceptions import URL_PATTERN
from ...symbols import ORTH from ...symbols import ORTH
@ -646,4 +645,4 @@ _nums = r"(({ne})|({t})|({on})|({c}))({s})?".format(
TOKENIZER_EXCEPTIONS = _exc TOKENIZER_EXCEPTIONS = _exc
TOKEN_MATCH = re.compile(r"^({u})|({n})$".format(u=URL_PATTERN, n=_nums)).match TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.
>>> from spacy.lang.hy.examples import sentences >>> from spacy.lang.hy.examples import sentences

View File

@ -1,12 +1,9 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM from ...attrs import LIKE_NUM
_num_words = [ _num_words = [
"զրօ", "զրո",
"մէկ", "մեկ",
"երկու", "երկու",
"երեք", "երեք",
"չորս", "չորս",
@ -28,10 +25,10 @@ _num_words = [
"քսան" "երեսուն", "քսան" "երեսուն",
"քառասուն", "քառասուն",
"հիսուն", "հիսուն",
"վաթցսուն", "վաթսուն",
"յոթանասուն", "յոթանասուն",
"ութսուն", "ութսուն",
"ինիսուն", "իննսուն",
"հարյուր", "հարյուր",
"հազար", "հազար",
"միլիոն", "միլիոն",

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set( STOP_WORDS = set(
""" """
նա նա

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ

View File

@ -24,17 +24,15 @@ def noun_chunks(doclike):
np_deps = [doc.vocab.strings[label] for label in labels] np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(doclike): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -42,9 +40,7 @@ def noun_chunks(doclike):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label

View File

@ -1,111 +1,266 @@
import re import srsly
from collections import namedtuple from collections import namedtuple, OrderedDict
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .tag_orth_map import TAG_ORTH_MAP
from .tag_bigram_map import TAG_BIGRAM_MAP
from ...attrs import LANG from ...attrs import LANG
from ...language import Language
from ...tokens import Doc
from ...compat import copy_reg from ...compat import copy_reg
from ...errors import Errors
from ...language import Language
from ...symbols import POS
from ...tokens import Doc
from ...util import DummyTokenizer from ...util import DummyTokenizer
from ... import util
# Hold the attributes we need with convenient names
DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
# Handling for multiple spaces in a row is somewhat awkward, this simplifies # Handling for multiple spaces in a row is somewhat awkward, this simplifies
# the flow by creating a dummy with the same interface. # the flow by creating a dummy with the same interface.
DummyNode = namedtuple("DummyNode", ["surface", "pos", "feature"]) DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
DummyNodeFeatures = namedtuple("DummyNodeFeatures", ["lemma"]) DummySpace = DummyNode(" ", " ", " ")
DummySpace = DummyNode(" ", " ", DummyNodeFeatures(" "))
def try_fugashi_import(): def try_sudachi_import(split_mode="A"):
"""Fugashi is required for Japanese support, so check for it. """SudachiPy is required for Japanese support, so check for it.
It it's not available blow up and explain how to fix it.""" It it's not available blow up and explain how to fix it.
split_mode should be one of these values: "A", "B", "C", None->"A"."""
try: try:
import fugashi from sudachipy import dictionary, tokenizer
return fugashi split_mode = {
None: tokenizer.Tokenizer.SplitMode.A,
"A": tokenizer.Tokenizer.SplitMode.A,
"B": tokenizer.Tokenizer.SplitMode.B,
"C": tokenizer.Tokenizer.SplitMode.C,
}[split_mode]
tok = dictionary.Dictionary().create(mode=split_mode)
return tok
except ImportError: except ImportError:
raise ImportError( raise ImportError(
"Japanese support requires Fugashi: " "https://github.com/polm/fugashi" "Japanese support requires SudachiPy and SudachiDict-core "
"(https://github.com/WorksApplications/SudachiPy). "
"Install with `pip install sudachipy sudachidict_core` or "
"install spaCy with `pip install spacy[ja]`."
) )
def resolve_pos(token): def resolve_pos(orth, pos, next_pos):
"""If necessary, add a field to the POS tag for UD mapping. """If necessary, add a field to the POS tag for UD mapping.
Under Universal Dependencies, sometimes the same Unidic POS tag can Under Universal Dependencies, sometimes the same Unidic POS tag can
be mapped differently depending on the literal token or its context be mapped differently depending on the literal token or its context
in the sentence. This function adds information to the POS tag to in the sentence. This function returns resolved POSs for both token
resolve ambiguous mappings. and next_token by tuple.
""" """
# this is only used for consecutive ascii spaces # Some tokens have their UD tag decided based on the POS of the following
if token.surface == " ": # token.
return "空白"
# TODO: This is a first take. The rules here are crude approximations. # orth based rules
# For many of these, full dependencies are needed to properly resolve if pos[0] in TAG_ORTH_MAP:
# PoS mappings. orth_map = TAG_ORTH_MAP[pos[0]]
if token.pos == "連体詞,*,*,*": if orth in orth_map:
if re.match(r"[こそあど此其彼]の", token.surface): return orth_map[orth], None
return token.pos + ",DET"
if re.match(r"[こそあど此其彼]", token.surface): # tag bi-gram mapping
return token.pos + ",PRON" if next_pos:
return token.pos + ",ADJ" tag_bigram = pos[0], next_pos[0]
return token.pos if tag_bigram in TAG_BIGRAM_MAP:
bipos = TAG_BIGRAM_MAP[tag_bigram]
if bipos[0] is None:
return TAG_MAP[pos[0]][POS], bipos[1]
else:
return bipos
return TAG_MAP[pos[0]][POS], None
def get_words_and_spaces(tokenizer, text): # Use a mapping of paired punctuation to avoid splitting quoted sentences.
"""Get the individual tokens that make up the sentence and handle white space. pairpunct = {"": "", "": "", "": ""}
Japanese doesn't usually use white space, and MeCab's handling of it for
multiple spaces in a row is somewhat awkward. def separate_sentences(doc):
"""Given a doc, mark tokens that start sentences based on Unidic tags.
""" """
tokens = tokenizer.parseToNodeList(text) stack = [] # save paired punctuation
for i, token in enumerate(doc[:-2]):
# Set all tokens after the first to false by default. This is necessary
# for the doc code to be aware we've done sentencization, see
# `is_sentenced`.
token.sent_start = i == 0
if token.tag_:
if token.tag_ == "補助記号-括弧開":
ts = str(token)
if ts in pairpunct:
stack.append(pairpunct[ts])
elif stack and ts == stack[-1]:
stack.pop()
if token.tag_ == "補助記号-句点":
next_token = doc[i + 1]
if next_token.tag_ != token.tag_ and not stack:
next_token.sent_start = True
def get_dtokens(tokenizer, text):
tokens = tokenizer.tokenize(text)
words = [] words = []
spaces = [] for ti, token in enumerate(tokens):
for token in tokens: tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
# If there's more than one space, spaces after the first become tokens inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
for ii in range(len(token.white_space) - 1): dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
words.append(DummySpace) if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
spaces.append(False) # don't add multiple space tokens in a row
continue
words.append(dtoken)
words.append(token) # remove empty tokens. These can be produced with characters like … that
spaces.append(bool(token.white_space)) # Sudachi normalizes internally.
return words, spaces words = [ww for ww in words if len(ww.surface) > 0]
return words
def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
words = [x.surface for x in dtokens]
if "".join("".join(words).split()) != "".join(text.split()):
raise ValueError(Errors.E194.format(text=text, words=words))
text_words = []
text_lemmas = []
text_tags = []
text_spaces = []
text_pos = 0
# handle empty and whitespace-only texts
if len(words) == 0:
return text_words, text_lemmas, text_tags, text_spaces
elif len([word for word in words if not word.isspace()]) == 0:
assert text.isspace()
text_words = [text]
text_lemmas = [text]
text_tags = [gap_tag]
text_spaces = [False]
return text_words, text_lemmas, text_tags, text_spaces
# normalize words to remove all whitespace tokens
norm_words, norm_dtokens = zip(
*[
(word, dtokens)
for word, dtokens in zip(words, dtokens)
if not word.isspace()
]
)
# align words with text
for word, dtoken in zip(norm_words, norm_dtokens):
try:
word_start = text[text_pos:].index(word)
except ValueError:
raise ValueError(Errors.E194.format(text=text, words=words))
if word_start > 0:
w = text[text_pos : text_pos + word_start]
text_words.append(w)
text_lemmas.append(w)
text_tags.append(gap_tag)
text_spaces.append(False)
text_pos += word_start
text_words.append(word)
text_lemmas.append(dtoken.lemma)
text_tags.append(dtoken.pos)
text_spaces.append(False)
text_pos += len(word)
if text_pos < len(text) and text[text_pos] == " ":
text_spaces[-1] = True
text_pos += 1
if text_pos < len(text):
w = text[text_pos:]
text_words.append(w)
text_lemmas.append(w)
text_tags.append(gap_tag)
text_spaces.append(False)
return text_words, text_lemmas, text_tags, text_spaces
class JapaneseTokenizer(DummyTokenizer): class JapaneseTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None): def __init__(self, cls, nlp=None, config={}):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
self.tokenizer = try_fugashi_import().Tagger() self.split_mode = config.get("split_mode", None)
self.tokenizer.parseToNodeList("") # see #2901 self.tokenizer = try_sudachi_import(self.split_mode)
def __call__(self, text): def __call__(self, text):
dtokens, spaces = get_words_and_spaces(self.tokenizer, text) dtokens = get_dtokens(self.tokenizer, text)
words = [x.surface for x in dtokens]
words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
doc = Doc(self.vocab, words=words, spaces=spaces) doc = Doc(self.vocab, words=words, spaces=spaces)
unidic_tags = [] next_pos = None
for token, dtoken in zip(doc, dtokens): for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
unidic_tags.append(dtoken.pos) token.tag_ = unidic_tag[0]
token.tag_ = resolve_pos(dtoken) if next_pos:
token.pos = next_pos
next_pos = None
else:
token.pos, next_pos = resolve_pos(
token.orth_,
unidic_tag,
unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None,
)
# if there's no lemma info (it's an unk) just use the surface # if there's no lemma info (it's an unk) just use the surface
token.lemma_ = dtoken.feature.lemma or dtoken.surface token.lemma_ = lemma
doc.user_data["unidic_tags"] = unidic_tags doc.user_data["unidic_tags"] = unidic_tags
return doc return doc
def _get_config(self):
config = OrderedDict((("split_mode", self.split_mode),))
return config
def _set_config(self, config={}):
self.split_mode = config.get("split_mode", None)
def to_bytes(self, **kwargs):
serializers = OrderedDict(
(("cfg", lambda: srsly.json_dumps(self._get_config())),)
)
return util.to_bytes(serializers, [])
def from_bytes(self, data, **kwargs):
deserializers = OrderedDict(
(("cfg", lambda b: self._set_config(srsly.json_loads(b))),)
)
util.from_bytes(data, deserializers, [])
self.tokenizer = try_sudachi_import(self.split_mode)
return self
def to_disk(self, path, **kwargs):
path = util.ensure_path(path)
serializers = OrderedDict(
(("cfg", lambda p: srsly.write_json(p, self._get_config())),)
)
return util.to_disk(path, serializers, [])
def from_disk(self, path, **kwargs):
path = util.ensure_path(path)
serializers = OrderedDict(
(("cfg", lambda p: self._set_config(srsly.read_json(p))),)
)
util.from_disk(path, serializers, [])
self.tokenizer = try_sudachi_import(self.split_mode)
class JapaneseDefaults(Language.Defaults): class JapaneseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda _text: "ja" lex_attr_getters[LANG] = lambda _text: "ja"
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP
syntax_iterators = SYNTAX_ITERATORS
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
@classmethod @classmethod
def create_tokenizer(cls, nlp=None): def create_tokenizer(cls, nlp=None, config={}):
return JapaneseTokenizer(cls, nlp) return JapaneseTokenizer(cls, nlp, config)
class Japanese(Language): class Japanese(Language):

176
spacy/lang/ja/bunsetu.py Normal file
View File

@ -0,0 +1,176 @@
POS_PHRASE_MAP = {
"NOUN": "NP",
"NUM": "NP",
"PRON": "NP",
"PROPN": "NP",
"VERB": "VP",
"ADJ": "ADJP",
"ADV": "ADVP",
"CCONJ": "CCONJP",
}
# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
def yield_bunsetu(doc, debug=False):
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
prev = None
prev_tag = None
prev_dep = None
prev_head = None
for t in doc:
pos = t.pos_
pos_type = POS_PHRASE_MAP.get(pos, None)
tag = t.tag_
dep = t.dep_
head = t.head.i
if debug:
print(
t.i,
t.orth_,
pos,
pos_type,
dep,
head,
bunsetu_may_end,
phrase_type,
phrase,
bunsetu,
)
# DET is always an individual bunsetu
if pos == "DET":
if bunsetu:
yield bunsetu, phrase_type, phrase
yield [t], None, None
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
# PRON or Open PUNCT always splits bunsetu
elif tag == "補助記号-括弧開":
if bunsetu:
yield bunsetu, phrase_type, phrase
bunsetu = [t]
bunsetu_may_end = True
phrase_type = None
phrase = None
# bunsetu head not appeared
elif phrase_type is None:
if bunsetu and prev_tag == "補助記号-読点":
yield bunsetu, phrase_type, phrase
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
bunsetu.append(t)
if pos_type: # begin phrase
phrase = [t]
phrase_type = pos_type
if pos_type in {"ADVP", "CCONJP"}:
bunsetu_may_end = True
# entering new bunsetu
elif pos_type and (
pos_type != phrase_type
or bunsetu_may_end # different phrase type arises # same phrase type but bunsetu already ended
):
# exceptional case: NOUN to VERB
if (
phrase_type == "NP"
and pos_type == "VP"
and prev_dep == "compound"
and prev_head == t.i
):
bunsetu.append(t)
phrase_type = "VP"
phrase.append(t)
# exceptional case: VERB to NOUN
elif (
phrase_type == "VP"
and pos_type == "NP"
and (
prev_dep == "compound"
and prev_head == t.i
or dep == "compound"
and prev == head
or prev_dep == "nmod"
and prev_head == t.i
)
):
bunsetu.append(t)
phrase_type = "NP"
phrase.append(t)
else:
yield bunsetu, phrase_type, phrase
bunsetu = [t]
bunsetu_may_end = False
phrase_type = pos_type
phrase = [t]
# NOUN bunsetu
elif phrase_type == "NP":
bunsetu.append(t)
if not bunsetu_may_end and (
(
(pos_type == "NP" or pos == "SYM")
and (prev_head == t.i or prev_head == head)
and prev_dep in {"compound", "nummod"}
)
or (
pos == "PART"
and (prev == head or prev_head == head)
and dep == "mark"
)
):
phrase.append(t)
else:
bunsetu_may_end = True
# VERB bunsetu
elif phrase_type == "VP":
bunsetu.append(t)
if (
not bunsetu_may_end
and pos == "VERB"
and prev_head == t.i
and prev_dep == "compound"
):
phrase.append(t)
else:
bunsetu_may_end = True
# ADJ bunsetu
elif phrase_type == "ADJP" and tag != "連体詞":
bunsetu.append(t)
if not bunsetu_may_end and (
(
pos == "NOUN"
and (prev_head == t.i or prev_head == head)
and prev_dep in {"amod", "compound"}
)
or (
pos == "PART"
and (prev == head or prev_head == head)
and dep == "mark"
)
):
phrase.append(t)
else:
bunsetu_may_end = True
# other bunsetu
else:
bunsetu.append(t)
prev = t.i
prev_tag = t.tag_
prev_dep = t.dep_
prev_head = head
if bunsetu:
yield bunsetu, phrase_type, phrase

View File

@ -0,0 +1,54 @@
from ...symbols import NOUN, PROPN, PRON, VERB
# XXX this can probably be pruned a bit
labels = [
"nsubj",
"nmod",
"dobj",
"nsubjpass",
"pcomp",
"pobj",
"obj",
"obl",
"dative",
"appos",
"attr",
"ROOT",
]
def noun_chunks(obj):
"""
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
doc = obj.doc # Ensure works on both Doc and Span.
np_deps = [doc.vocab.strings.add(label) for label in labels]
doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP")
seen = set()
for i, word in enumerate(obj):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.i in seen:
continue
if word.dep in np_deps:
unseen = [w.i for w in word.subtree if w.i not in seen]
if not unseen:
continue
# this takes care of particles etc.
seen.update(j.i for j in word.subtree)
# This avoids duplicating embedded clauses
seen.update(range(word.i + 1))
# if the head of this is a verb, mark that and rights seen
# Don't do the subtree as that can hide other phrases
if word.head.pos == VERB:
seen.add(word.head.i)
seen.update(w.i for w in word.head.rights)
yield unseen[0], word.i + 1, np_label
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -0,0 +1,28 @@
from ...symbols import ADJ, AUX, NOUN, PART, VERB
# mapping from tag bi-gram to pos of previous token
TAG_BIGRAM_MAP = {
# This covers only small part of AUX.
("形容詞-非自立可能", "助詞-終助詞"): (AUX, None),
("名詞-普通名詞-形状詞可能", "助動詞"): (ADJ, None),
# ("副詞", "名詞-普通名詞-形状詞可能"): (None, ADJ),
# This covers acl, advcl, obl and root, but has side effect for compound.
("名詞-普通名詞-サ変可能", "動詞-非自立可能"): (VERB, AUX),
# This covers almost all of the deps
("名詞-普通名詞-サ変形状詞可能", "動詞-非自立可能"): (VERB, AUX),
("名詞-普通名詞-副詞可能", "動詞-非自立可能"): (None, VERB),
("副詞", "動詞-非自立可能"): (None, VERB),
("形容詞-一般", "動詞-非自立可能"): (None, VERB),
("形容詞-非自立可能", "動詞-非自立可能"): (None, VERB),
("接頭辞", "動詞-非自立可能"): (None, VERB),
("助詞-係助詞", "動詞-非自立可能"): (None, VERB),
("助詞-副助詞", "動詞-非自立可能"): (None, VERB),
("助詞-格助詞", "動詞-非自立可能"): (None, VERB),
("補助記号-読点", "動詞-非自立可能"): (None, VERB),
("形容詞-一般", "接尾辞-名詞的-一般"): (None, PART),
("助詞-格助詞", "形状詞-助動詞語幹"): (None, NOUN),
("連体詞", "形状詞-助動詞語幹"): (None, NOUN),
("動詞-一般", "助詞-副助詞"): (None, PART),
("動詞-非自立可能", "助詞-副助詞"): (None, PART),
("助動詞", "助詞-副助詞"): (None, PART),
}

View File

@ -1,79 +1,68 @@
from ...symbols import POS, PUNCT, INTJ, X, ADJ, AUX, ADP, PART, SCONJ, NOUN from ...symbols import POS, PUNCT, INTJ, ADJ, AUX, ADP, PART, SCONJ, NOUN
from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE, CCONJ
TAG_MAP = { TAG_MAP = {
# Explanation of Unidic tags: # Explanation of Unidic tags:
# https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf # https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
# Universal Dependencies Mapping: # Universal Dependencies Mapping: (Some of the entries in this mapping are updated to v2.6 in the list below)
# http://universaldependencies.org/ja/overview/morphology.html # http://universaldependencies.org/ja/overview/morphology.html
# http://universaldependencies.org/ja/pos/all.html # http://universaldependencies.org/ja/pos/all.html
"記号,一般,*,*": { "記号-一般": {POS: NOUN}, # this includes characters used to represent sounds like ドレミ
POS: PUNCT "記号-文字": {
}, # this includes characters used to represent sounds like ドレミ POS: NOUN
"記号,文字,*,*": { }, # this is for Greek and Latin characters having some meanings, or used as symbols, as in math
POS: PUNCT "感動詞-フィラー": {POS: INTJ},
}, # this is for Greek and Latin characters used as sumbols, as in math "感動詞-一般": {POS: INTJ},
"感動詞,フィラー,*,*": {POS: INTJ},
"感動詞,一般,*,*": {POS: INTJ},
# this is specifically for unicode full-width space
"空白,*,*,*": {POS: X},
# This is used when sequential half-width spaces are present
"空白": {POS: SPACE}, "空白": {POS: SPACE},
"形状詞,一般,*,*": {POS: ADJ}, "形状詞-一般": {POS: ADJ},
"形状詞,タリ,*,*": {POS: ADJ}, "形状詞-タリ": {POS: ADJ},
"形状詞,助動詞語幹,*,*": {POS: ADJ}, "形状詞-助動詞語幹": {POS: AUX},
"形容詞,一般,*,*": {POS: ADJ}, "形容詞-一般": {POS: ADJ},
"形容詞,非自立可能,*,*": {POS: AUX}, # XXX ADJ if alone, AUX otherwise "形容詞-非自立可能": {POS: ADJ}, # XXX ADJ if alone, AUX otherwise
"助詞,格助詞,*,*": {POS: ADP}, "助詞-格助詞": {POS: ADP},
"助詞,係助詞,*,*": {POS: ADP}, "助詞-係助詞": {POS: ADP},
"助詞,終助詞,*,*": {POS: PART}, "助詞-終助詞": {POS: PART},
"助詞,準体助詞,*,*": {POS: SCONJ}, # の as in 走るのが速い "助詞-準体助詞": {POS: SCONJ}, # の as in 走るのが速い
"助詞,接続助詞,*,*": {POS: SCONJ}, # verb ending て "助詞-接続助詞": {POS: SCONJ}, # verb ending て0
"助詞,副助詞,*,*": {POS: PART}, # ばかり, つつ after a verb "助詞-副助詞": {POS: ADP}, # ばかり, つつ after a verb
"助動詞,*,*,*": {POS: AUX}, "助動詞": {POS: AUX},
"接続詞,*,*,*": {POS: SCONJ}, # XXX: might need refinement "接続詞": {POS: CCONJ}, # XXX: might need refinement
"接頭辞,*,*,*": {POS: NOUN}, "接頭辞": {POS: NOUN},
"接尾辞,形状詞的,*,*": {POS: ADJ}, # がち, チック "接尾辞-形状詞的": {POS: PART}, # がち, チック
"接尾辞,形容詞的,*,*": {POS: ADJ}, # -らしい "接尾辞-形容詞的": {POS: AUX}, # -らしい
"接尾辞,動詞的,*,*": {POS: NOUN}, # -じみ "接尾辞-動詞的": {POS: PART}, # -じみ
"接尾辞,名詞的,サ変可能,*": {POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,* "接尾辞-名詞的-サ変可能": {POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,*
"接尾辞,名詞的,一般,*": {POS: NOUN}, "接尾辞-名詞的-一般": {POS: NOUN},
"接尾辞,名詞的,助数詞,*": {POS: NOUN}, "接尾辞-名詞的-助数詞": {POS: NOUN},
"接尾辞,名詞的,副詞可能,*": {POS: NOUN}, # -後, -過ぎ "接尾辞-名詞的-副詞可能": {POS: NOUN}, # -後, -過ぎ
"代名詞,*,*,*": {POS: PRON}, "代名詞": {POS: PRON},
"動詞,一般,*,*": {POS: VERB}, "動詞-一般": {POS: VERB},
"動詞,非自立可能,*,*": {POS: VERB}, # XXX VERB if alone, AUX otherwise "動詞-非自立可能": {POS: AUX}, # XXX VERB if alone, AUX otherwise
"動詞,非自立可能,*,*,AUX": {POS: AUX}, "副詞": {POS: ADV},
"動詞,非自立可能,*,*,VERB": {POS: VERB}, "補助記号--一般": {POS: SYM}, # text art
"副詞,*,*,*": {POS: ADV}, "補助記号--顔文字": {POS: PUNCT}, # kaomoji
"補助記号,,一般,*": {POS: SYM}, # text art "補助記号-一般": {POS: SYM},
"補助記号,,顔文字,*": {POS: SYM}, # kaomoji "補助記号-括弧開": {POS: PUNCT}, # open bracket
"補助記号,一般,*,*": {POS: SYM}, "補助記号-括弧閉": {POS: PUNCT}, # close bracket
"補助記号,括弧開,*,*": {POS: PUNCT}, # open bracket "補助記号-句点": {POS: PUNCT}, # period or other EOS marker
"補助記号,括弧閉,*,*": {POS: PUNCT}, # close bracket "補助記号-読点": {POS: PUNCT}, # comma
"補助記号,句点,*,*": {POS: PUNCT}, # period or other EOS marker "名詞-固有名詞-一般": {POS: PROPN}, # general proper noun
"補助記号,読点,*,*": {POS: PUNCT}, # comma "名詞-固有名詞-人名-一般": {POS: PROPN}, # person's name
"名詞,固有名詞,一般,*": {POS: PROPN}, # general proper noun "名詞-固有名詞-人名-姓": {POS: PROPN}, # surname
"名詞,固有名詞,人名,一般": {POS: PROPN}, # person's name "名詞-固有名詞-人名-名": {POS: PROPN}, # first name
"名詞,固有名詞,人名,姓": {POS: PROPN}, # surname "名詞-固有名詞-地名-一般": {POS: PROPN}, # place name
"名詞,固有名詞,人名,名": {POS: PROPN}, # first name "名詞-固有名詞-地名-国": {POS: PROPN}, # country name
"名詞,固有名詞,地名,一般": {POS: PROPN}, # place name "名詞-助動詞語幹": {POS: AUX},
"名詞,固有名詞,地名,国": {POS: PROPN}, # country name "名詞-数詞": {POS: NUM}, # includes Chinese numerals
"名詞,助動詞語幹,*,*": {POS: AUX}, "名詞-普通名詞-サ変可能": {POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun
"名詞,数詞,*,*": {POS: NUM}, # includes Chinese numerals "名詞-普通名詞-サ変形状詞可能": {POS: NOUN},
"名詞,普通名詞,サ変可能,*": {POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun "名詞-普通名詞-一般": {POS: NOUN},
"名詞,普通名詞,サ変可能,*,NOUN": {POS: NOUN}, "名詞-普通名詞-形状詞可能": {POS: NOUN}, # XXX: sometimes ADJ in UDv2
"名詞,普通名詞,サ変可能,*,VERB": {POS: VERB}, "名詞-普通名詞-助数詞可能": {POS: NOUN}, # counter / unit
"名詞,普通名詞,サ変形状詞可能,*": {POS: NOUN}, # ex: 下手 "名詞-普通名詞-副詞可能": {POS: NOUN},
"名詞,普通名詞,一般,*": {POS: NOUN}, "連体詞": {POS: DET}, # XXX this has exceptions based on literal token
"名詞,普通名詞,形状詞可能,*": {POS: NOUN}, # XXX: sometimes ADJ in UDv2 # GSD tags. These aren't in Unidic, but we need them for the GSD data.
"名詞,普通名詞,形状詞可能,*,NOUN": {POS: NOUN}, "外国語": {POS: PROPN}, # Foreign words
"名詞,普通名詞,形状詞可能,*,ADJ": {POS: ADJ}, "絵文字・記号等": {POS: SYM}, # emoji / kaomoji ^^;
"名詞,普通名詞,助数詞可能,*": {POS: NOUN}, # counter / unit
"名詞,普通名詞,副詞可能,*": {POS: NOUN},
"連体詞,*,*,*": {POS: ADJ}, # XXX this has exceptions based on literal token
"連体詞,*,*,*,ADJ": {POS: ADJ},
"連体詞,*,*,*,PRON": {POS: PRON},
"連体詞,*,*,*,DET": {POS: DET},
} }

View File

@ -0,0 +1,22 @@
from ...symbols import DET, PART, PRON, SPACE, X
# mapping from tag bi-gram to pos of previous token
TAG_ORTH_MAP = {
"空白": {" ": SPACE, " ": X},
"助詞-副助詞": {"たり": PART},
"連体詞": {
"あの": DET,
"かの": DET,
"この": DET,
"その": DET,
"どの": DET,
"彼の": DET,
"此の": DET,
"其の": DET,
"ある": PRON,
"こんな": PRON,
"そんな": PRON,
"どんな": PRON,
"あらゆる": PRON,
},
}

View File

@ -1,7 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from ...language import Language from ...language import Language

View File

@ -1,7 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.

Some files were not shown because too many files have changed in this diff Show More