Merge branch 'develop' into pr/6333

This commit is contained in:
Ines Montani 2020-12-17 10:19:28 +11:00
commit 47c1ec678b
152 changed files with 10130 additions and 1895 deletions

View File

@ -1,15 +1,18 @@
<!--- Please provide a summary in the title and describe your issue here.
Is this a bug or feature request? If a bug, include all the steps that led to the issue.
If you're looking for help with your code, consider posting a question on Stack Overflow instead:
http://stackoverflow.com/questions/tagged/spacy -->
If you're looking for help with your code, consider posting a question here:
- GitHub Discussions: https://github.com/explosion/spaCy/discussions
- Stack Overflow: http://stackoverflow.com/questions/tagged/spacy
-->
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type
`python -m spacy info --markdown` and copy-paste the result here.-->
* Operating System:
* Python Version Used:
* spaCy Version Used:
* Environment Information:
- Operating System:
- Python Version Used:
- spaCy Version Used:
- Environment Information:

View File

@ -1,11 +0,0 @@
---
name: "\U0001F381 Feature Request"
about: Do you have an idea for an improvement, a new feature or a plugin?
---
## Feature description
<!-- Please describe the feature: Which area of the library is it related to? What specific solution would you like? -->
## Could the feature be a [custom component](https://spacy.io/usage/processing-pipelines#custom-components) or [spaCy plugin](https://spacy.io/universe)?
If so, we will tag it as [`project idea`](https://github.com/explosion/spaCy/labels/project%20idea) so other users can take it on.

19
.github/ISSUE_TEMPLATE/04_other.md vendored Normal file
View File

@ -0,0 +1,19 @@
---
name: "\U0001F4AC Anything else?"
about: For feature and project ideas, general usage questions or help with your code, please post on the GitHub Discussions board instead.
---
<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and enhancements. If you're looking for help with your code, consider posting a question here:
- GitHub Discussions: https://github.com/explosion/spaCy/discussions
- Stack Overflow: http://stackoverflow.com/questions/tagged/spacy
-->
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
- Operating System:
- Python Version Used:
- spaCy Version Used:
- Environment Information:

View File

@ -1,15 +0,0 @@
---
name: "\U0001F4AC Anything else?"
about: For general usage questions or help with your code, please consider
posting on Stack Overflow instead.
---
<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and feature requests. If you're looking for help with your code, consider posting a question on Stack Overflow instead: http://stackoverflow.com/questions/tagged/spacy -->
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
* Operating System:
* Python Version Used:
* spaCy Version Used:
* Environment Information:

108
.github/contributors/KKsharma99.md vendored Normal file
View File

@ -0,0 +1,108 @@
<!-- This agreement was mistakenly submitted as an update to the CONTRIBUTOR_AGREEMENT.md template. Commit: 8a2d22222dec5cf910df5a378cbcd9ea2ab53ec4. It was therefore moved over manually. -->
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Kunal Sharma |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 10/19/2020 |
| GitHub username | KKsharma99 |
| Website (optional) | |

106
.github/contributors/borijang.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Borijan Georgievski |
| Company name (if applicable) | Netcetera |
| Title or role (if applicable) | Deta Scientist |
| Date | 2020.10.09 |
| GitHub username | borijang |
| Website (optional) | |

106
.github/contributors/danielvasic.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Daniel Vasić |
| Company name (if applicable) | University of Mostar |
| Title or role (if applicable) | Teaching asistant |
| Date | 13/10/2020 |
| GitHub username | danielvasic |
| Website (optional) | |

106
.github/contributors/forest1988.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Yusuke Mori |
| Company name (if applicable) | |
| Title or role (if applicable) | Ph.D. student |
| Date | 2020/11/22 |
| GitHub username | forest1988 |
| Website (optional) | https://forest1988.github.io |

106
.github/contributors/jabortell.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jacob Bortell |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-11-20 |
| GitHub username | jabortell |
| Website (optional) | |

106
.github/contributors/revuel.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Miguel Revuelta |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-11-17 |
| GitHub username | revuel |
| Website (optional) | |

106
.github/contributors/robertsipek.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Robert Šípek |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 22.10.2020 |
| GitHub username | @robertsipek |
| Website (optional) | |

106
.github/contributors/vha14.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Vu Ha |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 10-23-2020 |
| GitHub username | vha14 |
| Website (optional) | |

106
.github/contributors/walterhenry.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Walter Henry |
| Company name (if applicable) | ExplosionAI GmbH |
| Title or role (if applicable) | Executive Assistant |
| Date | September 14, 2020 |
| GitHub username | walterhenry |
| Website (optional) | |

View File

@ -26,11 +26,11 @@ also often include helpful tips and solutions to common problems. You should
also check the [troubleshooting guide](https://spacy.io/usage/#troubleshooting)
to see if your problem is already listed there.
If you're looking for help with your code, consider posting a question on
[Stack Overflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you
tag it `spacy` and `python`, more people will see it and hopefully be able to
help. Please understand that we won't be able to provide individual support via
email. We also believe that help is much more valuable if it's **shared publicly**,
If you're looking for help with your code, consider posting a question on the
[GitHub Discussions board](https://github.com/explosion/spaCy/discussions) or
[Stack Overflow](http://stackoverflow.com/questions/tagged/spacy). Please
understand that we won't be able to provide individual support via email. We
also believe that help is much more valuable if it's **shared publicly**,
so that more people can benefit from it.
### Submitting issues

View File

@ -61,12 +61,14 @@ much more valuable if it's shared publicly, so that more people can benefit from
it.
| Type | Platforms |
| ----------------------- | ---------------------- |
| ------------------------------- | --------------------------------------- |
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
| 👩‍💻 **Usage Questions** | [Stack Overflow] |
| 🎁 **Feature Requests & Ideas** | [GitHub Discussions] |
| 👩‍💻 **Usage Questions** | [GitHub Discussions] · [Stack Overflow] |
| 🗯 **General Discussion** | [GitHub Discussions] |
[github issue tracker]: https://github.com/explosion/spaCy/issues
[github discussions]: https://github.com/explosion/spaCy/discussions
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
## Features
@ -126,6 +128,7 @@ environment to avoid modifying system state:
```bash
python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
pip install spacy
```
@ -224,16 +227,28 @@ do that depends on your system. See notes on Ubuntu, OS X and Windows for
details.
```bash
# make sure you are using the latest pip
python -m pip install -U pip
git clone https://github.com/explosion/spaCy
cd spaCy
python -m venv .env
source .env/bin/activate
export PYTHONPATH=`pwd`
# make sure you are using the latest pip
python -m pip install -U pip setuptools wheel
pip install .
```
To install with extras:
```bash
pip install .[lookups,cuda102]
```
To install all dependencies required for development:
```bash
pip install -r requirements.txt
python setup.py build_ext --inplace
```
Compared to regular install via pip, [requirements.txt](requirements.txt)
@ -271,14 +286,13 @@ tests, you'll usually want to clone the repository and build spaCy from source.
This will also install the required development dependencies and test utilities
defined in the `requirements.txt`.
Alternatively, you can find out where spaCy is installed and run `pytest` on
that directory. Don't forget to also install the test utilities via spaCy's
Alternatively, you can run `pytest` on the tests from within the installed
`spacy` package. Don't forget to also install the test utilities via spaCy's
`requirements.txt`:
```bash
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
pip install -r path/to/requirements.txt
python -m pytest <spacy-directory>
pip install -r requirements.txt
python -m pytest --pyargs spacy
```
See [the documentation](https://spacy.io/usage#tests) for more details and

View File

@ -2,76 +2,75 @@ trigger:
batch: true
branches:
include:
- '*'
- "*"
exclude:
- 'spacy.io'
- "spacy.io"
paths:
exclude:
- 'website/*'
- '*.md'
- "website/*"
- "*.md"
pr:
paths:
exclude:
- 'website/*'
- '*.md'
- "website/*"
- "*.md"
jobs:
# Perform basic checks for most important errors (syntax etc.) Uses the config
# defined in .flake8 and overwrites the selected codes.
- job: 'Validate'
- job: "Validate"
pool:
vmImage: 'ubuntu-16.04'
vmImage: "ubuntu-16.04"
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.7'
versionSpec: "3.7"
- script: |
pip install flake8==3.5.0
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
displayName: 'flake8'
displayName: "flake8"
- job: 'Test'
dependsOn: 'Validate'
- job: "Test"
dependsOn: "Validate"
strategy:
matrix:
Python36Linux:
imageName: 'ubuntu-16.04'
python.version: '3.6'
imageName: "ubuntu-16.04"
python.version: "3.6"
Python36Windows:
imageName: 'vs2017-win2016'
python.version: '3.6'
imageName: "vs2017-win2016"
python.version: "3.6"
Python36Mac:
imageName: 'macos-10.14'
python.version: '3.6'
imageName: "macos-10.14"
python.version: "3.6"
# Don't test on 3.7 for now to speed up builds
Python37Linux:
imageName: 'ubuntu-16.04'
python.version: '3.7'
imageName: "ubuntu-16.04"
python.version: "3.7"
Python37Windows:
imageName: 'vs2017-win2016'
python.version: '3.7'
imageName: "vs2017-win2016"
python.version: "3.7"
Python37Mac:
imageName: 'macos-10.14'
python.version: '3.7'
imageName: "macos-10.14"
python.version: "3.7"
Python38Linux:
imageName: 'ubuntu-16.04'
python.version: '3.8'
imageName: "ubuntu-16.04"
python.version: "3.8"
Python38Windows:
imageName: 'vs2017-win2016'
python.version: '3.8'
imageName: "vs2017-win2016"
python.version: "3.8"
Python38Mac:
imageName: 'macos-10.14'
python.version: '3.8'
imageName: "macos-10.14"
python.version: "3.8"
Python39Linux:
imageName: 'ubuntu-16.04'
python.version: '3.9'
imageName: "ubuntu-16.04"
python.version: "3.9"
Python39Windows:
imageName: 'vs2017-win2016'
python.version: '3.9'
imageName: "vs2017-win2016"
python.version: "3.9"
Python39Mac:
imageName: 'macos-10.14'
python.version: '3.9'
imageName: "macos-10.14"
python.version: "3.9"
maxParallel: 4
pool:
vmImage: $(imageName)
@ -79,35 +78,35 @@ jobs:
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '$(python.version)'
architecture: 'x64'
versionSpec: "$(python.version)"
architecture: "x64"
- script: |
python -m pip install -U setuptools
pip install -r requirements.txt
displayName: 'Install dependencies'
displayName: "Install dependencies"
- script: |
python setup.py build_ext --inplace
python setup.py sdist --formats=gztar
displayName: 'Compile and build sdist'
displayName: "Compile and build sdist"
- task: DeleteFiles@1
inputs:
contents: 'spacy'
displayName: 'Delete source directory'
contents: "spacy"
displayName: "Delete source directory"
- script: |
pip freeze > installed.txt
pip uninstall -y -r installed.txt
displayName: 'Uninstall all packages'
displayName: "Uninstall all packages"
- bash: |
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
pip install dist/$SDIST
displayName: 'Install from sdist'
displayName: "Install from sdist"
- script: |
pip install -r requirements.txt
python -m pytest --pyargs spacy
displayName: 'Run tests'
displayName: "Run tests"

5
build-constraints.txt Normal file
View File

@ -0,0 +1,5 @@
# build version constraints for use with wheelwright + multibuild
numpy==1.15.0; python_version<='3.7'
numpy==1.17.3; python_version=='3.8'
numpy==1.19.3; python_version=='3.9'
numpy; python_version>='3.10'

View File

@ -36,3 +36,44 @@ DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
scikit-learn
------------
* Files: scorer.py
The following implementation of roc_auc_score() is adapted from
scikit-learn, which is distributed under the following license:
New BSD License
Copyright (c) 20072019 The scikit-learn developers.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
a. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
b. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
c. Neither the name of the Scikit-learn Developers nor the names of
its contributors may be used to endorse or promote products
derived from this software without specific prior written
permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.

View File

@ -3,6 +3,8 @@ redirects = [
{from = "https://spacy.netlify.com/*", to="https://spacy.io/:splat", force = true },
# Subdomain for branches
{from = "https://nightly.spacy.io/*", to="https://nightly-spacy-io.spacy.io/:splat", force = true, status = 200},
# TODO: update this with the v2 branch build once v3 is live (status = 200)
{from = "https://v2.spacy.io/*", to="https://spacy.io/:splat", force = true},
# Old subdomains
{from = "https://survey.spacy.io/*", to = "https://spacy.io", force = true},
{from = "http://survey.spacy.io/*", to = "https://spacy.io", force = true},

View File

@ -1,13 +1,16 @@
[build-system]
requires = [
"setuptools",
"wheel",
"cython>=0.25",
"cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.0rc0,<8.1.0",
"thinc>=8.0.0rc2,<8.1.0",
"blis>=0.4.0,<0.8.0",
"pathy"
"pathy",
"numpy==1.15.0; python_version<='3.7'",
"numpy==1.17.3; python_version=='3.8'",
"numpy==1.19.3; python_version=='3.9'",
"numpy; python_version>='3.10'",
]
build-backend = "setuptools.build_meta"

View File

@ -1,7 +1,7 @@
# Our libraries
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.0.0rc0,<8.1.0
thinc>=8.0.0rc2,<8.1.0
blis>=0.4.0,<0.8.0
ml_datasets==0.2.0a0
murmurhash>=0.28.0,<1.1.0
@ -15,6 +15,7 @@ numpy>=1.15.0
requests>=2.13.0,<3.0.0
tqdm>=4.38.0,<5.0.0
pydantic>=1.5.0,<1.7.0
jinja2
# Official Python utilities
setuptools
packaging>=20.0
@ -26,4 +27,4 @@ pytest>=4.6.5
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
flake8>=3.5.0,<3.6.0
jinja2
hypothesis

View File

@ -34,13 +34,13 @@ setup_requires =
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0
thinc>=8.0.0rc0,<8.1.0
thinc>=8.0.0rc2,<8.1.0
install_requires =
# Our libraries
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.0.0rc0,<8.1.0
thinc>=8.0.0rc2,<8.1.0
blis>=0.4.0,<0.8.0
wasabi>=0.8.0,<1.1.0
srsly>=2.3.0,<3.0.0
@ -86,6 +86,10 @@ cuda101 =
cupy-cuda101>=5.0.0b4,<9.0.0
cuda102 =
cupy-cuda102>=5.0.0b4,<9.0.0
cuda110 =
cupy-cuda110>=5.0.0b4,<9.0.0
cuda111 =
cupy-cuda111>=5.0.0b4,<9.0.0
# Language tokenizers with external dependencies
ja =
sudachipy>=0.4.9
@ -94,8 +98,6 @@ ko =
natto-py==0.9.0
th =
pythainlp>=2.0
zh =
spacy-pkuseg==0.0.26
[bdist_wheel]
universal = false

View File

@ -2,9 +2,9 @@
from setuptools import Extension, setup, find_packages
import sys
import platform
import numpy
from distutils.command.build_ext import build_ext
from distutils.sysconfig import get_python_inc
import numpy
from pathlib import Path
import shutil
from Cython.Build import cythonize
@ -48,6 +48,7 @@ MOD_NAMES = [
"spacy.pipeline._parser_internals._state",
"spacy.pipeline._parser_internals.stateclass",
"spacy.pipeline._parser_internals.transition_system",
"spacy.pipeline._parser_internals._beam_utils",
"spacy.tokenizer",
"spacy.training.align",
"spacy.training.gold_io",
@ -194,8 +195,8 @@ def setup_package():
print(f"Copied {copy_file} -> {target_dir}")
include_dirs = [
get_python_inc(plat_specific=True),
numpy.get_include(),
get_python_inc(plat_specific=True),
]
ext_modules = []
for name in MOD_NAMES:
@ -212,7 +213,7 @@ def setup_package():
ext_modules=ext_modules,
cmdclass={"build_ext": build_ext_subclass},
include_dirs=include_dirs,
package_data={"": ["*.pyx", "*.pxd", "*.pxi", "*.cpp"]},
package_data={"": ["*.pyx", "*.pxd", "*.pxi"]},
)

View File

@ -7,7 +7,7 @@ warnings.filterwarnings("ignore", message="numpy.dtype size changed") # noqa
warnings.filterwarnings("ignore", message="numpy.ufunc size changed") # noqa
# These are imported as part of the API
from thinc.api import prefer_gpu, require_gpu # noqa: F401
from thinc.api import prefer_gpu, require_gpu, require_cpu # noqa: F401
from thinc.api import Config
from . import pipeline # noqa: F401

View File

@ -272,7 +272,11 @@ def show_validation_error(
msg.fail(title)
print(err.text.strip())
if hint_fill and "value_error.missing" in err.error_types:
config_path = file_path if file_path is not None else "config.cfg"
config_path = (
file_path
if file_path is not None and str(file_path) != "-"
else "config.cfg"
)
msg.text(
"If your config contains missing values, you can run the 'init "
"fill-config' command to fill in all the defaults, if possible:",

View File

@ -5,6 +5,7 @@ from wasabi import Printer
import srsly
import re
import sys
import itertools
from ._util import app, Arg, Opt
from ..training import docs_to_json
@ -130,15 +131,16 @@ def convert(
)
doc_files.append((input_loc, docs))
if concatenate:
all_docs = []
for _, docs in doc_files:
all_docs.extend(docs)
all_docs = itertools.chain.from_iterable([docs for _, docs in doc_files])
doc_files = [(input_path, all_docs)]
for input_loc, docs in doc_files:
if file_type == "json":
data = [docs_to_json(docs)]
len_docs = len(data)
else:
data = DocBin(docs=docs, store_user_data=True).to_bytes()
db = DocBin(docs=docs, store_user_data=True)
len_docs = len(db)
data = db.to_bytes()
if output_dir == "-":
_print_docs_to_stdout(data, file_type)
else:
@ -149,7 +151,7 @@ def convert(
output_file = Path(output_dir) / input_loc.parts[-1]
output_file = output_file.with_suffix(f".{file_type}")
_write_docs_to_file(data, output_file, file_type)
msg.good(f"Generated output file ({len(docs)} documents): {output_file}")
msg.good(f"Generated output file ({len_docs} documents): {output_file}")
def _print_docs_to_stdout(data: Any, output_type: str) -> None:

View File

@ -19,7 +19,7 @@ from .. import util
def debug_config_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True),
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"),
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")

View File

@ -37,7 +37,7 @@ BLANK_MODEL_THRESHOLD = 2000
def debug_data_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True),
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),

View File

@ -22,7 +22,7 @@ from .. import util
def debug_model_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True),
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
component: str = Arg(..., help="Name of the pipeline component of which the model should be analysed"),
layers: str = Opt("", "--layers", "-l", help="Comma-separated names of layer IDs to print"),
dimensions: bool = Opt(False, "--dimensions", "-DIM", help="Show dimensions"),

View File

@ -35,7 +35,7 @@ def download_cli(
def download(model: str, direct: bool = False, *pip_args) -> None:
if not is_package("spacy") and "--no-deps" not in pip_args:
if not (is_package("spacy") or is_package("spacy-nightly")) and "--no-deps" not in pip_args:
msg.warn(
"Skipping pipeline package dependencies and setting `--no-deps`. "
"You don't seem to have the spaCy package itself installed "

View File

@ -5,6 +5,7 @@ from wasabi import Printer, diff_strings
from thinc.api import Config
import srsly
import re
from jinja2 import Template
from .. import util
from ..language import DEFAULT_CONFIG_PRETRAIN_PATH
@ -29,7 +30,7 @@ def init_config_cli(
lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
cpu: bool = Opt(False, "--cpu", "-C", help="Whether the model needs to run on CPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
# fmt: on
):
@ -44,14 +45,16 @@ def init_config_cli(
if isinstance(optimize, Optimizations): # instance of enum from the CLI
optimize = optimize.value
pipeline = string_to_list(pipeline)
init_config(
output_file,
is_stdout = str(output_file) == "-"
config = init_config(
lang=lang,
pipeline=pipeline,
optimize=optimize,
cpu=cpu,
gpu=gpu,
pretraining=pretraining,
silent=is_stdout,
)
save_config(config, output_file, is_stdout=is_stdout)
@init_cli.command("fill-config")
@ -117,20 +120,15 @@ def fill_config(
def init_config(
output_file: Path,
*,
lang: str,
pipeline: List[str],
optimize: str,
cpu: bool,
gpu: bool,
pretraining: bool = False,
) -> None:
is_stdout = str(output_file) == "-"
msg = Printer(no_print=is_stdout)
try:
from jinja2 import Template
except ImportError:
msg.fail("This command requires jinja2", "pip install jinja2", exits=1)
silent: bool = True,
) -> Config:
msg = Printer(no_print=silent)
with TEMPLATE_PATH.open("r") as f:
template = Template(f.read())
# Filter out duplicates since tok2vec and transformer are added by template
@ -140,7 +138,7 @@ def init_config(
"lang": lang,
"components": pipeline,
"optimize": optimize,
"hardware": "cpu" if cpu else "gpu",
"hardware": "gpu" if gpu else "cpu",
"transformer_data": reco["transformer"],
"word_vectors": reco["word_vectors"],
"has_letters": reco["has_letters"],
@ -176,7 +174,7 @@ def init_config(
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
config = pretrain_config.merge(config)
msg.good("Auto-filled config with all values")
save_config(config, output_file, is_stdout=is_stdout)
return config
def save_config(

View File

@ -62,7 +62,7 @@ def update_lexemes(nlp: Language, jsonl_loc: Path) -> None:
def init_pipeline_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True),
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
output_path: Path = Arg(..., help="Output directory for the prepared data"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
@ -88,7 +88,7 @@ def init_pipeline_cli(
def init_labels_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True),
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
output_path: Path = Arg(..., help="Output directory for the labels"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),

View File

@ -1,4 +1,4 @@
from typing import Optional, Union, Any, Dict
from typing import Optional, Union, Any, Dict, List
import shutil
from pathlib import Path
from wasabi import Printer, get_raw_input
@ -16,6 +16,7 @@ def package_cli(
# fmt: off
input_dir: Path = Arg(..., help="Directory with pipeline data", exists=True, file_okay=False),
output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
code_paths: Optional[str] = Opt(None, "--code", "-c", help="Comma-separated paths to Python file with additional code (registered functions) to be included in the package"),
meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
create_meta: bool = Opt(False, "--create-meta", "-c", "-C", help="Create meta.json, even if one exists"),
name: Optional[str] = Opt(None, "--name", "-n", help="Package name to override meta"),
@ -33,12 +34,22 @@ def package_cli(
After packaging, "python setup.py sdist" is run in the package directory,
which will create a .tar.gz archive that can be installed via "pip install".
If additional code files are provided (e.g. Python files containing custom
registered functions like pipeline components), they are copied into the
package and imported in the __init__.py.
DOCS: https://nightly.spacy.io/api/cli#package
"""
code_paths = (
[Path(p.strip()) for p in code_paths.split(",")]
if code_paths is not None
else []
)
package(
input_dir,
output_dir,
meta_path=meta_path,
code_paths=code_paths,
name=name,
version=version,
create_meta=create_meta,
@ -52,6 +63,7 @@ def package(
input_dir: Path,
output_dir: Path,
meta_path: Optional[Path] = None,
code_paths: List[Path] = [],
name: Optional[str] = None,
version: Optional[str] = None,
create_meta: bool = False,
@ -67,6 +79,14 @@ def package(
msg.fail("Can't locate pipeline data", input_path, exits=1)
if not output_path or not output_path.exists():
msg.fail("Output directory not found", output_path, exits=1)
for code_path in code_paths:
if not code_path.exists():
msg.fail("Can't find code file", code_path, exits=1)
# Import the code here so it's available when model is loaded (via
# get_meta helper). Also verifies that everything works
util.import_file(code_path.stem, code_path)
if code_paths:
msg.good(f"Including {len(code_paths)} Python module(s) with custom code")
if meta_path and not meta_path.exists():
msg.fail("Can't find pipeline meta.json", meta_path, exits=1)
meta_path = meta_path or input_dir / "meta.json"
@ -103,10 +123,20 @@ def package(
)
Path.mkdir(package_path, parents=True)
shutil.copytree(str(input_dir), str(package_path / model_name_v))
license_path = package_path / model_name_v / "LICENSE"
if license_path.exists():
shutil.move(str(license_path), str(main_path))
imports = []
for code_path in code_paths:
imports.append(code_path.stem)
shutil.copy(str(code_path), str(package_path))
create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
create_file(main_path / "setup.py", TEMPLATE_SETUP)
create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
create_file(package_path / "__init__.py", TEMPLATE_INIT)
init_py = TEMPLATE_INIT.format(
imports="\n".join(f"from . import {m}" for m in imports)
)
create_file(package_path / "__init__.py", init_py)
msg.good(f"Successfully created package '{model_name_v}'", main_path)
if create_sdist:
with util.working_dir(main_path):
@ -238,7 +268,7 @@ if __name__ == '__main__':
TEMPLATE_MANIFEST = """
include meta.json
include config.cfg
include LICENSE
""".strip()
@ -246,6 +276,7 @@ TEMPLATE_INIT = """
from pathlib import Path
from spacy.util import load_model_from_init_py, get_model_meta
{imports}
__version__ = get_model_meta(Path(__file__).parent)['version']

View File

@ -17,7 +17,7 @@ from ..util import load_config
def pretrain_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False, allow_dash=True),
output_dir: Path = Arg(..., help="Directory to write weights to on each epoch"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
@ -79,7 +79,7 @@ def pretrain_cli(
def verify_cli_args(config_path, output_dir, resume_path, epoch_resume):
if not config_path or not config_path.exists():
if not config_path or (str(config_path) != "-" and not config_path.exists()):
msg.fail("Config file not found", config_path, exits=1)
if output_dir.exists() and [p for p in output_dir.iterdir()]:
if resume_path:

View File

@ -19,6 +19,7 @@ lang = "{{ lang }}"
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
tokenizer = {"@tokenizers": "spacy.Tokenizer.v1"}
batch_size = {{ 128 if hardware == "gpu" else 1000 }}
[components]
@ -143,6 +144,9 @@ nO = null
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.textcat.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false

View File

@ -18,7 +18,7 @@ from .. import util
def train_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True),
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store trained pipeline in"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
@ -41,7 +41,7 @@ def train_cli(
"""
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
# Make sure all files and paths exists if they are needed
if not config_path or not config_path.exists():
if not config_path or (str(config_path) != "-" and not config_path.exists()):
msg.fail("Config file not found", config_path, exits=1)
if output_path is not None and not output_path.exists():
output_path.mkdir(parents=True)

View File

@ -20,6 +20,8 @@ disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
# Default batch size to use with nlp.pipe and nlp.evaluate
batch_size = 1000
[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

View File

@ -17,7 +17,9 @@ tolerance = 0.2
get_length = null
[pretraining.objective]
type = "characters"
@architectures = "spacy.PretrainCharacters.v1"
maxout_pieces = 3
hidden_size = 300
n_characters = 4
[pretraining.optimizer]

View File

@ -119,14 +119,19 @@ class Warnings:
"call the {matcher} on each Doc object.")
W107 = ("The property `Doc.{prop}` is deprecated. Use "
"`Doc.has_annotation(\"{attr}\")` instead.")
W108 = ("The rule-based lemmatizer did not find POS annotation for the "
"token '{text}'. Check that your pipeline includes components that "
"assign token.pos, typically 'tagger'+'attribute_ruler' or "
"'morphologizer'.")
@add_codes
class Errors:
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
"This usually happens when spaCy calls `nlp.{method}` with custom "
"This usually happens when spaCy calls `nlp.{method}` with a custom "
"component name that's not registered on the current language class. "
"If you're using a Transformer, make sure to install 'spacy-transformers'. "
"If you're using a custom component, make sure you've added the "
"decorator `@Language.component` (for function components) or "
"`@Language.factory` (for class components).\n\nAvailable "
@ -456,6 +461,9 @@ class Errors:
"issue tracker: http://github.com/explosion/spaCy/issues")
# TODO: fix numbering after merging develop into master
E896 = ("There was an error using the static vectors. Ensure that the vectors "
"of the vocab are properly initialized, or set 'include_static_vectors' "
"to False.")
E897 = ("Field '{field}' should be a dot-notation string referring to the "
"relevant section in the config, but found type {type} instead.")
E898 = ("Can't serialize trainable pipe '{name}': the `model` attribute "
@ -483,8 +491,8 @@ class Errors:
"has been applied.")
E905 = ("Cannot initialize StaticVectors layer: nM dimension unset. This "
"dimension refers to the width of the vectors table.")
E906 = ("Unexpected `loss` value in pretraining objective: {loss_type}")
E907 = ("Unexpected `objective_type` value in pretraining objective: {objective_type}")
E906 = ("Unexpected `loss` value in pretraining objective: '{found}'. Supported values "
"are: {supported}")
E908 = ("Can't set `spaces` without `words` in `Doc.__init__`.")
E909 = ("Expected {name} in parser internals. This is likely a bug in spaCy.")
E910 = ("Encountered NaN value when computing loss for component '{name}'.")
@ -712,6 +720,10 @@ class Errors:
E1013 = ("Invalid morph: the MorphAnalysis must have the same vocab as the "
"token itself. To set the morph from this MorphAnalysis, set from "
"the string value with: `token.set_morph(str(other_morph))`.")
E1014 = ("Error loading DocBin data. It doesn't look like the data is in "
"DocBin (.spacy) format. If your data is in spaCy v2's JSON "
"training format, convert it using `python -m spacy convert "
"file.json .`.")
# Deprecated model shortcuts, only used in errors and warnings

View File

@ -210,8 +210,12 @@ _ukrainian_lower = r"а-щюяіїєґ"
_ukrainian_upper = r"А-ЩЮЯІЇЄҐ"
_ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ"
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
_macedonian_lower = r"ѓѕјљњќѐѝ"
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper + _macedonian_upper
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower
_uncased = (
_bengali
@ -226,7 +230,7 @@ _uncased = (
+ _cjk
)
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased)
ALPHA_LOWER = group_chars(_lower + _uncased)
ALPHA_UPPER = group_chars(_upper + _uncased)

View File

@ -1,9 +1,16 @@
from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from ...language import Language
from ...attrs import LANG
from .lex_attrs import LEX_ATTRS
from ...language import Language
class CzechDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "cs"
tag_map = TAG_MAP
stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS

4312
spacy/lang/cs/tag_map.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -6,10 +6,21 @@ from ...tokens import Doc, Span
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
"""Detect base noun phrases from a dependency parse. Works on Doc and Span."""
# fmt: off
labels = ["nsubj", "dobj", "nsubjpass", "pcomp", "pobj", "dative", "appos", "attr", "ROOT"]
# fmt: on
"""
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
labels = [
"oprd",
"nsubj",
"dobj",
"nsubjpass",
"pcomp",
"pobj",
"dative",
"appos",
"attr",
"ROOT",
]
doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.has_annotation("DEP"):
raise ValueError(Errors.E029)

48
spacy/lang/mk/__init__.py Normal file
View File

@ -0,0 +1,48 @@
from typing import Optional
from thinc.api import Model
from .lemmatizer import MacedonianLemmatizer
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
from ...lookups import Lookups
class MacedonianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "mk"
# Optional: replace flags with custom functions, e.g. like_num()
lex_attr_getters.update(LEX_ATTRS)
# Merge base exceptions and custom tokenizer exceptions
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
@classmethod
def create_lemmatizer(cls, nlp=None, lookups=None):
if lookups is None:
lookups = Lookups()
return MacedonianLemmatizer(lookups)
class Macedonian(Language):
lang = "mk"
Defaults = MacedonianDefaults
@Macedonian.factory(
"lemmatizer",
assigns=["token.lemma"],
default_config={"model": None, "mode": "rule"},
default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str):
return MacedonianLemmatizer(nlp.vocab, model, name, mode=mode)
__all__ = ["Macedonian"]

View File

@ -0,0 +1,55 @@
from typing import List
from collections import OrderedDict
from ...pipeline import Lemmatizer
from ...tokens import Token
class MacedonianLemmatizer(Lemmatizer):
def rule_lemmatize(self, token: Token) -> List[str]:
string = token.text
univ_pos = token.pos_.lower()
morphology = token.morph.to_dict()
if univ_pos in ("", "eol", "space"):
return [string.lower()]
if string[-3:] == 'јќи':
string = string[:-3]
univ_pos = "verb"
if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
return [string.lower()]
index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {})
rules_table = self.lookups.get_table("lemma_rules", {})
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
if univ_pos == "propn":
return [string]
else:
return [string.lower()]
index = index_table.get(univ_pos, {})
exceptions = exc_table.get(univ_pos, {})
rules = rules_table.get(univ_pos, [])
orig = string
string = string.lower()
forms = []
for old, new in rules:
if string.endswith(old):
form = string[: len(string) - len(old)] + new
if not form:
continue
if form in index or not form.isalpha():
forms.append(form)
forms = list(OrderedDict.fromkeys(forms))
for form in exceptions.get(string, []):
if form not in forms:
forms.insert(0, form)
if not forms:
forms.append(orig)
return forms

View File

@ -0,0 +1,55 @@
from ...attrs import LIKE_NUM
_num_words = [
"нула", "еден", "една", "едно", "два", "две", "три", "четири", "пет", "шест", "седум", "осум", "девет", "десет",
"единаесет", "дванаесет", "тринаесет", "четиринаесет", "петнаесет", "шеснаесет", "седумнаесет", "осумнаесет",
"деветнаесет", "дваесет", "триесет", "четириесет", "педесет", "шеесет", "седумдесет", "осумдесет", "деведесет",
"сто", "двесте", "триста", "четиристотини", "петстотини", "шестотини", "седумстотини", "осумстотини",
"деветстотини", "илјада", "илјади", 'милион', 'милиони', 'милијарда', 'милијарди', 'билион', 'билиони',
"двајца", "тројца", "четворица", "петмина", "шестмина", "седуммина", "осуммина", "деветмина", "обата", "обајцата",
"прв", "втор", "трет", "четврт", "седм", "осм", "двестоти",
"два-три", "два-триесет", "два-триесетмина", "два-тринаесет", "два-тројца", "две-три", "две-тристотини",
"пет-шеесет", "пет-шеесетмина", "пет-шеснаесетмина", "пет-шест", "пет-шестмина", "пет-шестотини", "петина",
"осмина", "седум-осум", "седум-осумдесет", "седум-осуммина", "седум-осумнаесет", "седум-осумнаесетмина",
"три-четириесет", "три-четиринаесет", "шеесет", "шеесетина", "шеесетмина", "шеснаесет", "шеснаесетмина",
"шест-седум", "шест-седумдесет", "шест-седумнаесет", "шест-седумстотини", "шестоти", "шестотини"
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
if text_lower.endswith(("а", "о", "и")):
if text_lower[:-1] in _num_words:
return True
if text_lower.endswith(("ти", "та", "то", "на")):
if text_lower[:-2] in _num_words:
return True
if text_lower.endswith(("ата", "иот", "ите", "ина", "чки")):
if text_lower[:-3] in _num_words:
return True
if text_lower.endswith(("мина", "тина")):
if text_lower[:-4] in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

815
spacy/lang/mk/stop_words.py Normal file
View File

@ -0,0 +1,815 @@
STOP_WORDS = set(
"""
а
абре
аи
ако
алало
ам
ама
аман
ами
амин
априли-ли-ли
ау
аух
ауч
ах
аха
аха-ха
аш
ашколсум
ашколсун
ај
ајде
ајс
аџаба
бавно
бам
бам-бум
бап
бар
баре
барем
бау
бау-бау
баш
бај
бе
беа
бев
бевме
бевте
без
безбели
бездруго
белки
беше
би
бидејќи
бим
бис
бла
блазе
богами
божем
боц
браво
бравос
бре
бреј
брзо
бришка
бррр
бу
бум
буф
буц
бујрум
ваа
вам
варај
варда
вас
вај
ве
велат
вели
версус
веќе
ви
виа
види
вие
вистина
витос
внатре
во
воз
вон
впрочем
врв
вред
време
врз
всушност
втор
галиба
ги
гитла
го
годе
годишник
горе
гра
гуц
гљу
да
даан
дава
дал
дали
дан
два
дваесет
дванаесет
двајца
две
двесте
движам
движат
движи
движиме
движите
движиш
де
деведесет
девет
деветнаесет
деветстотини
деветти
дека
дел
делми
демек
десет
десетина
десетти
деситици
дејгиди
дејди
ди
дилми
дин
дип
дно
до
доволно
додека
додуша
докај
доколку
доправено
доправи
досамоти
доста
држи
дрн
друг
друга
другата
други
другиот
другите
друго
другото
дум
дур
дури
е
евала
еве
евет
ега
егиди
еден
едикојси
единаесет
единствено
еднаш
едно
ексик
ела
елбете
елем
ели
ем
еми
ене
ете
еурека
ех
еј
жими
жити
за
завал
заврши
зад
задека
задоволна
задржи
заедно
зар
зарад
заради
заре
зарем
затоа
зашто
згора
зема
земе
земува
зер
значи
зошто
зуј
и
иако
из
извезен
изгледа
измеѓу
износ
или
или-или
илјада
илјади
им
има
имаа
имаат
имавме
имавте
имам
имаме
имате
имаш
имаше
име
имено
именува
имплицира
имплицираат
имплицирам
имплицираме
имплицирате
имплицираш
инаку
индицира
исечок
исклучен
исклучена
исклучени
исклучено
искористен
искористена
искористени
искористено
искористи
искрај
исти
исто
итака
итн
их
иха
ихуу
иш
ишала
иј
ка
каде
кажува
како
каков
камоли
кај
ква
ки
кит
кло
клум
кога
кого
кого-годе
кое
кои
количество
количина
колку
кому
кон
користена
користени
користено
користи
кот
котрр
кош-кош
кој
која
којзнае
којшто
кр-кр-кр
крај
крек
крз
крк
крц
куку
кукуригу
куш
ле
лебами
леле
лели
ли
лиду
луп
ма
макар
малку
марш
мат
мац
машала
ме
мене
место
меѓу
меѓувреме
меѓутоа
ми
мое
може
можеби
молам
моли
мор
мора
море
мори
мразец
му
муклец
мутлак
муц
мјау
на
навидум
навистина
над
надвор
назад
накај
накрај
нали
нам
наместо
наоколу
направено
направи
напред
нас
наспоред
наспрема
наспроти
насред
натаму
натема
начин
наш
наша
наше
наши
нај
најдоцна
најмалку
најмногу
не
неа
него
негов
негова
негови
негово
незе
нека
некаде
некако
некаков
некого
некое
некои
неколку
некому
некој
некојси
нели
немој
нему
неоти
нечиј
нешто
нејзе
нејзин
нејзини
нејзино
нејсе
ни
нив
нивен
нивна
нивни
нивно
ние
низ
никаде
никако
никогаш
никого
никому
никој
ним
нити
нито
ниту
ничиј
ништо
но
нѐ
о
обр
ова
ова-она
оваа
овај
овде
овега
овие
овој
од
одавде
оди
однесува
односно
одошто
околу
олеле
олкацок
он
она
онаа
онака
онаков
онде
они
оние
оно
оној
оп
освем
освен
осем
осми
осум
осумдесет
осумнаесет
осумстотитни
отаде
оти
откако
откај
откога
отколку
оттаму
оттука
оф
ох
ој
па
пак
папа
пардон
пате-ќуте
пати
пау
паче
пеесет
пеки
пет
петнаесет
петстотини
петти
пи
пи-пи
пис
плас
плус
по
побавно
поблиску
побрзо
побуни
повеќе
повторно
под
подалеку
подолу
подоцна
подруго
позади
поинаква
поинакви
поинакво
поинаков
поинаку
покаже
покажува
покрај
полно
помалку
помеѓу
понатаму
понекогаш
понекој
поради
поразличен
поразлична
поразлични
поразлично
поседува
после
последен
последна
последни
последно
поспоро
потег
потоа
пошироко
прави
празно
прв
пред
през
преку
претежно
претходен
претходна
претходни
претходник
претходно
при
присвои
притоа
причинува
пријатно
просто
против
прр
пст
пук
пусто
пуф
пуј
пфуј
пшт
ради
различен
различна
различни
различно
разни
разоружен
разредлив
рамките
рамнообразно
растревожено
растреперено
расчувствувано
ратоборно
рече
роден
с
сакан
сам
сама
сами
самите
само
самоти
свое
свои
свој
своја
се
себе
себеси
сега
седми
седум
седумдесет
седумнаесет
седумстотини
секаде
секаков
секи
секогаш
секого
секому
секој
секојдневно
сем
сенешто
сепак
сериозен
сериозна
сериозни
сериозно
сет
сечиј
сешто
си
сиктер
сиот
сип
сиреч
сите
сичко
скок
скоро
скрц
следбеник
следбеничка
следен
следователно
следствено
сме
со
соне
сопствен
сопствена
сопствени
сопствено
сосе
сосем
сполај
според
споро
спрема
спроти
спротив
сред
среде
среќно
срочен
сст
става
ставаат
ставам
ставаме
ставате
ставаш
стави
сте
сто
стоп
страна
сум
сума
супер
сус
сѐ
та
таа
така
таква
такви
таков
тамам
таму
тангар-мангар
тандар-мандар
тап
твое
те
тебе
тебека
тек
текот
ти
тие
тизе
тик-так
тики
тоа
тогаш
тој
трак
трака-трука
трас
треба
трет
три
триесет
тринаест
триста
труп
трупа
трус
ту
тука
туку
тукушто
туф
у
уа
убаво
уви
ужасно
уз
ура
уу
уф
уха
уш
уште
фазен
фала
фил
филан
фис
фиу
фиљан
фоб
фон
ха
ха-ха
хе
хеј
хеј
хи
хм
хо
цак
цап
целина
цело
цигу-лигу
циц
чекај
често
четврт
четири
четириесет
четиринаесет
четирстотини
чие
чии
чик
чик-чирик
чини
чиш
чиј
чија
чијшто
чкрап
чому
чук
чукш
чуму
чунки
шеесет
шеснаесет
шест
шести
шестотини
ширум
шлак
шлап
шлапа-шлупа
шлуп
шмрк
што
штогоде
штом
штотуку
штрак
штрап
штрап-штруп
шуќур
ѓиди
ѓоа
ѓоамити
ѕан
ѕе
ѕин
ја
јадец
јазе
јали
јас
јаска
јок
ќе
ќешки
ѝ
џагара-магара
џанам
џив-џив
""".split()
)

View File

@ -0,0 +1,100 @@
from ...symbols import ORTH, NORM
_exc = {}
_abbr_exc = [
{ORTH: "м", NORM: "метар"},
{ORTH: "мм", NORM: "милиметар"},
{ORTH: "цм", NORM: "центиметар"},
{ORTH: "см", NORM: "сантиметар"},
{ORTH: "дм", NORM: "дециметар"},
{ORTH: "км", NORM: "километар"},
{ORTH: "кг", NORM: "килограм"},
{ORTH: "дкг", NORM: "декаграм"},
{ORTH: "дг", NORM: "дециграм"},
{ORTH: "мг", NORM: "милиграм"},
{ORTH: "г", NORM: "грам"},
{ORTH: "т", NORM: "тон"},
{ORTH: "кл", NORM: "килолитар"},
{ORTH: "хл", NORM: "хектолитар"},
{ORTH: "дкл", NORM: "декалитар"},
{ORTH: "л", NORM: "литар"},
{ORTH: "дл", NORM: "децилитар"}
]
for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr]
_abbr_line_exc = [
{ORTH: "д-р", NORM: "доктор"},
{ORTH: "м-р", NORM: "магистер"},
{ORTH: "г-ѓа", NORM: "госпоѓа"},
{ORTH: "г-ца", NORM: "госпоѓица"},
{ORTH: "г-дин", NORM: "господин"},
]
for abbr in _abbr_line_exc:
_exc[abbr[ORTH]] = [abbr]
_abbr_dot_exc = [
{ORTH: "в.", NORM: "век"},
{ORTH: "в.д.", NORM: "вршител на должност"},
{ORTH: "г.", NORM: "година"},
{ORTH: "г.г.", NORM: "господин господин"},
{ORTH: "м.р.", NORM: "машки род"},
{ORTH: "год.", NORM: "женски род"},
{ORTH: "с.р.", NORM: "среден род"},
{ORTH: "н.е.", NORM: "наша ера"},
{ORTH: "о.г.", NORM: "оваа година"},
{ORTH: "о.м.", NORM: "овој месец"},
{ORTH: "с.", NORM: "село"},
{ORTH: "т.", NORM: "точка"},
{ORTH: "т.е.", NORM: "то ест"},
{ORTH: "т.н.", NORM: "таканаречен"},
{ORTH: "бр.", NORM: "број"},
{ORTH: "гр.", NORM: "град"},
{ORTH: "др.", NORM: "другар"},
{ORTH: "и др.", NORM: "и друго"},
{ORTH: "и сл.", NORM: "и слично"},
{ORTH: "кн.", NORM: "книга"},
{ORTH: "мн.", NORM: "множина"},
{ORTH: "на пр.", NORM: "на пример"},
{ORTH: "св.", NORM: "свети"},
{ORTH: "сп.", NORM: "списание"},
{ORTH: "с.", NORM: "страница"},
{ORTH: "стр.", NORM: "страница"},
{ORTH: "чл.", NORM: "член"},
{ORTH: "арх.", NORM: "архитект"},
{ORTH: "бел.", NORM: "белешка"},
{ORTH: "гимн.", NORM: "гимназија"},
{ORTH: "ден.", NORM: "денар"},
{ORTH: "ул.", NORM: "улица"},
{ORTH: "инж.", NORM: "инженер"},
{ORTH: "проф.", NORM: "професор"},
{ORTH: "студ.", NORM: "студент"},
{ORTH: "бот.", NORM: "ботаника"},
{ORTH: "мат.", NORM: "математика"},
{ORTH: "мед.", NORM: "медицина"},
{ORTH: "прил.", NORM: "прилог"},
{ORTH: "прид.", NORM: "придавка"},
{ORTH: "сврз.", NORM: "сврзник"},
{ORTH: "физ.", NORM: "физика"},
{ORTH: "хем.", NORM: "хемија"},
{ORTH: "пр. н.", NORM: "природни науки"},
{ORTH: "истор.", NORM: "историја"},
{ORTH: "геогр.", NORM: "географија"},
{ORTH: "литер.", NORM: "литература"},
]
for abbr in _abbr_dot_exc:
_exc[abbr[ORTH]] = [abbr]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -1,4 +1,4 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS
from .lex_attrs import LEX_ATTRS
@ -9,6 +9,7 @@ class TurkishDefaults(Language.Defaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
token_match = TOKEN_MATCH
syntax_iterators = SYNTAX_ITERATORS

View File

@ -1,119 +1,181 @@
from ..tokenizer_exceptions import BASE_EXCEPTIONS
import re
from ..punctuation import ALPHA_LOWER, ALPHA
from ...symbols import ORTH, NORM
from ...util import update_exc
_exc = {"sağol": [{ORTH: "sağ"}, {ORTH: "ol", NORM: "olun"}]}
_exc = {}
for exc_data in [
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
{ORTH: "Alb.", NORM: "Albay"},
{ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
{ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
{ORTH: "Asb.", NORM: "Astsubay"},
{ORTH: "Astsb.", NORM: "Astsubay"},
{ORTH: "As.İz.", NORM: "Askeri İnzibat"},
{ORTH: "Atğm", NORM: "Asteğmen"},
{ORTH: "Av.", NORM: "Avukat"},
{ORTH: "Apt.", NORM: "Apartmanı"},
{ORTH: "Bçvş.", NORM: "Başçavuş"},
_abbr_period_exc = [
{ORTH: "A.B.D.", NORM: "Amerika"},
{ORTH: "Alb.", NORM: "albay"},
{ORTH: "Ank.", NORM: "Ankara"},
{ORTH: "Ar.Gör."},
{ORTH: "Arş.Gör."},
{ORTH: "Asb.", NORM: "astsubay"},
{ORTH: "Astsb.", NORM: "astsubay"},
{ORTH: "As.İz."},
{ORTH: "as.iz."},
{ORTH: "Atğm", NORM: "asteğmen"},
{ORTH: "Av.", NORM: "avukat"},
{ORTH: "Apt.", NORM: "apartmanı"},
{ORTH: "apt.", NORM: "apartmanı"},
{ORTH: "Bçvş.", NORM: "başçavuş"},
{ORTH: "bçvş.", NORM: "başçavuş"},
{ORTH: "bk.", NORM: "bakınız"},
{ORTH: "bknz.", NORM: "bakınız"},
{ORTH: "Bnb.", NORM: "Binbaşı"},
{ORTH: "Bnb.", NORM: "binbaşı"},
{ORTH: "bnb.", NORM: "binbaşı"},
{ORTH: "Böl.", NORM: "Bölümü"},
{ORTH: "Bşk.", NORM: "Başkanlığı"},
{ORTH: "Bştbp.", NORM: "Baştabip"},
{ORTH: "Bul.", NORM: "Bulvarı"},
{ORTH: "Cad.", NORM: "Caddesi"},
{ORTH: "Böl.", NORM: "bölümü"},
{ORTH: "böl.", NORM: "bölümü"},
{ORTH: "Bşk.", NORM: "başkanlığı"},
{ORTH: "bşk.", NORM: "başkanlığı"},
{ORTH: "Bştbp.", NORM: "baştabip"},
{ORTH: "bştbp.", NORM: "baştabip"},
{ORTH: "Bul.", NORM: "bulvarı"},
{ORTH: "bul.", NORM: "bulvarı"},
{ORTH: "Cad.", NORM: "caddesi"},
{ORTH: "cad.", NORM: "caddesi"},
{ORTH: "çev.", NORM: "çeviren"},
{ORTH: "Çvş.", NORM: "Çavuş"},
{ORTH: "Çvş.", NORM: "çavuş"},
{ORTH: "çvş.", NORM: "çavuş"},
{ORTH: "dak.", NORM: "dakika"},
{ORTH: "dk.", NORM: "dakika"},
{ORTH: "Doç.", NORM: "Doçent"},
{ORTH: "doğ.", NORM: "doğum tarihi"},
{ORTH: "Doç.", NORM: "doçent"},
{ORTH: "doğ."},
{ORTH: "Dr.", NORM: "doktor"},
{ORTH: "dr.", NORM:"doktor"},
{ORTH: "drl.", NORM: "derleyen"},
{ORTH: "Dz.", NORM: "Deniz"},
{ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
{ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
{ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
{ORTH: "Dz.", NORM: "deniz"},
{ORTH: "Dz.K.K.lığı"},
{ORTH: "Dz.Kuv."},
{ORTH: "Dz.Kuv.K."},
{ORTH: "dzl.", NORM: "düzenleyen"},
{ORTH: "Ecz.", NORM: "Eczanesi"},
{ORTH: "Ecz.", NORM: "eczanesi"},
{ORTH: "ecz.", NORM: "eczanesi"},
{ORTH: "ekon.", NORM: "ekonomi"},
{ORTH: "Fak.", NORM: "Fakültesi"},
{ORTH: "Gn.", NORM: "Genel"},
{ORTH: "Fak.", NORM: "fakültesi"},
{ORTH: "Gn.", NORM: "genel"},
{ORTH: "Gnkur.", NORM: "Genelkurmay"},
{ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
{ORTH: "gr.", NORM: "gram"},
{ORTH: "Hst.", NORM: "Hastanesi"},
{ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
{ORTH: "Hst.", NORM: "hastanesi"},
{ORTH: "hst.", NORM: "hastanesi"},
{ORTH: "Hs.Uzm."},
{ORTH: "huk.", NORM: "hukuk"},
{ORTH: "Hv.", NORM: "Hava"},
{ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
{ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
{ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
{ORTH: "Hz.", NORM: "Hazreti"},
{ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
{ORTH: "İng.", NORM: "İngilizce"},
{ORTH: "Jeol.", NORM: "Jeoloji"},
{ORTH: "Hv.", NORM: "hava"},
{ORTH: "Hv.K.K.lığı"},
{ORTH: "Hv.Kuv."},
{ORTH: "Hv.Kuv.K."},
{ORTH: "Hz.", NORM: "hazreti"},
{ORTH: "Hz.Öz."},
{ORTH: "İng.", NORM: "ingilizce"},
{ORTH: "İst.", NORM: "İstanbul"},
{ORTH: "Jeol.", NORM: "jeoloji"},
{ORTH: "jeol.", NORM: "jeoloji"},
{ORTH: "Korg.", NORM: "Korgeneral"},
{ORTH: "Kur.", NORM: "Kurmay"},
{ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
{ORTH: "Kuv.", NORM: "Kuvvetleri"},
{ORTH: "Ltd.", NORM: "Limited"},
{ORTH: "Mah.", NORM: "Mahallesi"},
{ORTH: "Korg.", NORM: "korgeneral"},
{ORTH: "Kur.", NORM: "kurmay"},
{ORTH: "Kur.Bşk."},
{ORTH: "Kuv.", NORM: "kuvvetleri"},
{ORTH: "Ltd.", NORM: "limited"},
{ORTH: "ltd.", NORM: "limited"},
{ORTH: "Mah.", NORM: "mahallesi"},
{ORTH: "mah.", NORM: "mahallesi"},
{ORTH: "max.", NORM: "maksimum"},
{ORTH: "min.", NORM: "minimum"},
{ORTH: "Müh.", NORM: "Mühendisliği"},
{ORTH: "Müh.", NORM: "mühendisliği"},
{ORTH: "müh.", NORM: "mühendisliği"},
{ORTH: "MÖ.", NORM: "Milattan Önce"},
{ORTH: "Onb.", NORM: "Onbaşı"},
{ORTH: "Ord.", NORM: "Ordinaryüs"},
{ORTH: "Org.", NORM: "Orgeneral"},
{ORTH: "Ped.", NORM: "Pedagoji"},
{ORTH: "Prof.", NORM: "Profesör"},
{ORTH: "Sb.", NORM: "Subay"},
{ORTH: "Sn.", NORM: "Sayın"},
{ORTH: "M.Ö."},
{ORTH: "M.S."},
{ORTH: "Onb.", NORM: "onbaşı"},
{ORTH: "Ord.", NORM: "ordinaryüs"},
{ORTH: "Org.", NORM: "orgeneral"},
{ORTH: "Ped.", NORM: "pedagoji"},
{ORTH: "Prof.", NORM: "profesör"},
{ORTH: "prof.", NORM: "profesör"},
{ORTH: "Sb.", NORM: "subay"},
{ORTH: "Sn.", NORM: "sayın"},
{ORTH: "sn.", NORM: "saniye"},
{ORTH: "Sok.", NORM: "Sokak"},
{ORTH: "Şb.", NORM: "Şube"},
{ORTH: "Şti.", NORM: "Şirketi"},
{ORTH: "Tbp.", NORM: "Tabip"},
{ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
{ORTH: "Tel.", NORM: "Telefon"},
{ORTH: "Sok.", NORM: "sokak"},
{ORTH: "sok.", NORM: "sokak"},
{ORTH: "Şb.", NORM: "şube"},
{ORTH: "şb.", NORM: "şube"},
{ORTH: "Şti.", NORM: "şirketi"},
{ORTH: "şti.", NORM: "şirketi"},
{ORTH: "Tbp.", NORM: "tabip"},
{ORTH: "tbp.", NORM: "tabip"},
{ORTH: "T.C."},
{ORTH: "Tel.", NORM: "telefon"},
{ORTH: "tel.", NORM: "telefon"},
{ORTH: "telg.", NORM: "telgraf"},
{ORTH: "Tğm.", NORM: "Teğmen"},
{ORTH: "Tğm.", NORM: "teğmen"},
{ORTH: "tğm.", NORM: "teğmen"},
{ORTH: "tic.", NORM: "ticaret"},
{ORTH: "Tug.", NORM: "Tugay"},
{ORTH: "Tuğg.", NORM: "Tuğgeneral"},
{ORTH: "Tümg.", NORM: "Tümgeneral"},
{ORTH: "Uzm.", NORM: "Uzman"},
{ORTH: "Üçvş.", NORM: "Üstçavuş"},
{ORTH: "Üni.", NORM: "Üniversitesi"},
{ORTH: "Ütğm.", NORM: "Üsteğmen"},
{ORTH: "vb.", NORM: "ve benzeri"},
{ORTH: "Tug.", NORM: "tugay"},
{ORTH: "Tuğg.", NORM: "tuğgeneral"},
{ORTH: "Tümg.", NORM: "tümgeneral"},
{ORTH: "Uzm.", NORM: "uzman"},
{ORTH: "Üçvş.", NORM: "üstçavuş"},
{ORTH: "Üni.", NORM: "üniversitesi"},
{ORTH: "Ütğm.", NORM: "üsteğmen"},
{ORTH: "vb."},
{ORTH: "vs.", NORM: "vesaire"},
{ORTH: "Yard.", NORM: "Yardımcı"},
{ORTH: "Yar.", NORM: "Yardımcı"},
{ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
{ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Yb.", NORM: "Yarbay"},
{ORTH: "Yrd.", NORM: "Yardımcı"},
{ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
{ORTH: "Y.Mim.", NORM: "Yüksek mimar"},
]:
_exc[exc_data[ORTH]] = [exc_data]
{ORTH: "Yard.", NORM: "yardımcı"},
{ORTH: "Yar.", NORM: "yardımcı"},
{ORTH: "Yd.Sb."},
{ORTH: "Yard.Doç."},
{ORTH: "Yar.Doç."},
{ORTH: "Yb.", NORM: "yarbay"},
{ORTH: "Yrd.", NORM: "yardımcı"},
{ORTH: "Yrd.Doç."},
{ORTH: "Y.Müh."},
{ORTH: "Y.Mim."},
{ORTH: "yy.", NORM: "yüzyıl"},
]
for abbr in _abbr_period_exc:
_exc[abbr[ORTH]] = [abbr]
_abbr_exc = [
{ORTH: "AB", NORM: "Avrupa Birliği"},
{ORTH: "ABD", NORM: "Amerika"},
{ORTH: "ABS", NORM: "fren"},
{ORTH: "AOÇ"},
{ORTH: "ASKİ"},
{ORTH: "Bağ-kur", NORM: "Bağkur"},
{ORTH: "BDDK"},
{ORTH: "BJK", NORM: "Beşiktaş"},
{ORTH: "ESA", NORM: "Avrupa uzay ajansı"},
{ORTH: "FB", NORM: "Fenerbahçe"},
{ORTH: "GATA"},
{ORTH: "GS", NORM: "Galatasaray"},
{ORTH: "İSKİ"},
{ORTH: "KBB"},
{ORTH: "RTÜK", NORM: "radyo ve televizyon üst kurulu"},
{ORTH: "TBMM"},
{ORTH: "TC"},
{ORTH: "TÜİK", NORM: "Türkiye istatistik kurumu"},
{ORTH: "YÖK"},
]
for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr]
for orth in ["Dr.", "yy."]:
_exc[orth] = [{ORTH: orth}]
_num = r"[+-]?\d+([,.]\d+)*"
_ord_num = r"(\d+\.)"
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
_roman_ord = r"({rn})\.".format(rn=_roman_num)
_time_exp = r"\d+(:\d+)*"
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(d=_date, dn=_dash_num, te=_time_exp, on=_ord_num, n=_num, ro=_roman_ord, rn=_roman_num, inf=_inflections)
TOKENIZER_EXCEPTIONS = _exc
TOKEN_MATCH = re.compile(r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)).match

View File

@ -17,7 +17,7 @@ from ... import util
# fmt: off
_PKUSEG_INSTALL_MSG = "install spacy-pkuseg with `pip install spacy-pkuseg==0.0.26`"
_PKUSEG_INSTALL_MSG = "install spacy-pkuseg with `pip install \"spacy-pkuseg>=0.0.27,<0.1.0\"` or `conda install -c conda-forge \"spacy-pkuseg>=0.0.27,<0.1.0\"`"
# fmt: on
DEFAULT_CONFIG = """

View File

@ -121,6 +121,7 @@ class Language:
max_length: int = 10 ** 6,
meta: Dict[str, Any] = {},
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
batch_size: int = 1000,
**kwargs,
) -> None:
"""Initialise a Language object.
@ -138,6 +139,7 @@ class Language:
100,000 characters in one text.
create_tokenizer (Callable): Function that takes the nlp object and
returns a tokenizer.
batch_size (int): Default batch size for pipe and evaluate.
DOCS: https://nightly.spacy.io/api/language#init
"""
@ -173,6 +175,7 @@ class Language:
tokenizer_cfg = {"tokenizer": self._config["nlp"]["tokenizer"]}
create_tokenizer = registry.resolve(tokenizer_cfg)["tokenizer"]
self.tokenizer = create_tokenizer(self)
self.batch_size = batch_size
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
@ -968,10 +971,6 @@ class Language:
DOCS: https://nightly.spacy.io/api/language#call
"""
if len(text) > self.max_length:
raise ValueError(
Errors.E088.format(length=len(text), max_length=self.max_length)
)
doc = self.make_doc(text)
if component_cfg is None:
component_cfg = {}
@ -1045,6 +1044,11 @@ class Language:
text (str): The text to process.
RETURNS (Doc): The processed doc.
"""
if len(text) > self.max_length:
raise ValueError(
Errors.E088.format(length=len(text), max_length=self.max_length)
)
return self.tokenizer(text)
return self.tokenizer(text)
def update(
@ -1267,7 +1271,7 @@ class Language:
self,
examples: Iterable[Example],
*,
batch_size: int = 256,
batch_size: Optional[int] = None,
scorer: Optional[Scorer] = None,
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
scorer_cfg: Optional[Dict[str, Any]] = None,
@ -1275,7 +1279,7 @@ class Language:
"""Evaluate a model's pipeline components.
examples (Iterable[Example]): `Example` objects.
batch_size (int): Batch size to use.
batch_size (Optional[int]): Batch size to use.
scorer (Optional[Scorer]): Scorer to use. If not passed in, a new one
will be created.
component_cfg (dict): An optional dictionary with extra keyword
@ -1287,6 +1291,8 @@ class Language:
DOCS: https://nightly.spacy.io/api/language#evaluate
"""
validate_examples(examples, "Language.evaluate")
if batch_size is None:
batch_size = self.batch_size
if component_cfg is None:
component_cfg = {}
if scorer_cfg is None:
@ -1365,7 +1371,7 @@ class Language:
texts: Iterable[str],
*,
as_tuples: bool = False,
batch_size: int = 1000,
batch_size: Optional[int] = None,
disable: Iterable[str] = SimpleFrozenList(),
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
n_process: int = 1,
@ -1376,7 +1382,7 @@ class Language:
as_tuples (bool): If set to True, inputs should be a sequence of
(text, context) tuples. Output will then be a sequence of
(doc, context) tuples. Defaults to False.
batch_size (int): The number of texts to buffer.
batch_size (Optional[int]): The number of texts to buffer.
disable (List[str]): Names of the pipeline components to disable.
component_cfg (Dict[str, Dict]): An optional dictionary with extra keyword
arguments for specific components.
@ -1403,6 +1409,8 @@ class Language:
return
if component_cfg is None:
component_cfg = {}
if batch_size is None:
batch_size = self.batch_size
pipes = (
[]
@ -1617,6 +1625,7 @@ class Language:
nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name)
disabled_pipes = [*config["nlp"]["disabled"], *disable]
nlp._disabled = set(p for p in disabled_pipes if p not in exclude)
nlp.batch_size = config["nlp"]["batch_size"]
nlp.config = filled if auto_fill else config
if after_pipeline_creation is not None:
nlp = after_pipeline_creation(nlp)

View File

@ -26,6 +26,7 @@ cdef enum quantifier_t:
ZERO_PLUS
ONE
ONE_PLUS
FINAL_ID
cdef struct AttrValueC:

View File

@ -2,7 +2,7 @@
from typing import List
from libcpp.vector cimport vector
from libc.stdint cimport int32_t
from libc.stdint cimport int32_t, int8_t
from libc.string cimport memset, memcmp
from cymem.cymem cimport Pool
from murmurhash.mrmr cimport hash64
@ -308,7 +308,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
# avoid any processing or mem alloc if the document is empty
return output
if len(predicates) > 0:
predicate_cache = <char*>mem.alloc(length * len(predicates), sizeof(char))
predicate_cache = <int8_t*>mem.alloc(length * len(predicates), sizeof(int8_t))
if extensions is not None and len(extensions) >= 1:
nr_extra_attr = max(extensions.values()) + 1
extra_attr_values = <attr_t*>mem.alloc(length * nr_extra_attr, sizeof(attr_t))
@ -349,7 +349,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
char* cached_py_predicates,
int8_t* cached_py_predicates,
Token token, const attr_t* extra_attrs, py_predicates) except *:
cdef int q = 0
cdef vector[PatternStateC] new_states
@ -421,7 +421,7 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
states.push_back(new_states[i])
cdef int update_predicate_cache(char* cache,
cdef int update_predicate_cache(int8_t* cache,
const TokenPatternC* pattern, Token token, predicates) except -1:
# If the state references any extra predicates, check whether they match.
# These are cached, so that we don't call these potentially expensive
@ -459,7 +459,7 @@ cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states)
cdef action_t get_action(PatternStateC state,
const TokenC* token, const attr_t* extra_attrs,
const char* predicate_matches) nogil:
const int8_t* predicate_matches) nogil:
"""We need to consider:
a) Does the token match the specification? [Yes, No]
b) What's the quantifier? [1, 0+, ?]
@ -517,7 +517,7 @@ cdef action_t get_action(PatternStateC state,
Problem: If a quantifier is matching, we're adding a lot of open partials
"""
cdef char is_match
cdef int8_t is_match
is_match = get_is_match(state, token, extra_attrs, predicate_matches)
quantifier = get_quantifier(state)
is_final = get_is_final(state)
@ -569,9 +569,9 @@ cdef action_t get_action(PatternStateC state,
return RETRY
cdef char get_is_match(PatternStateC state,
cdef int8_t get_is_match(PatternStateC state,
const TokenC* token, const attr_t* extra_attrs,
const char* predicate_matches) nogil:
const int8_t* predicate_matches) nogil:
for i in range(state.pattern.nr_py):
if predicate_matches[state.pattern.py_predicates[i]] == -1:
return 0
@ -586,8 +586,8 @@ cdef char get_is_match(PatternStateC state,
return True
cdef char get_is_final(PatternStateC state) nogil:
if state.pattern[1].nr_attr == 0 and state.pattern[1].attrs != NULL:
cdef int8_t get_is_final(PatternStateC state) nogil:
if state.pattern[1].quantifier == FINAL_ID:
id_attr = state.pattern[1].attrs[0]
if id_attr.attr != ID:
with gil:
@ -597,7 +597,7 @@ cdef char get_is_final(PatternStateC state) nogil:
return 0
cdef char get_quantifier(PatternStateC state) nogil:
cdef int8_t get_quantifier(PatternStateC state) nogil:
return state.pattern.quantifier
@ -626,36 +626,20 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
pattern[i].nr_py = len(predicates)
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
i = len(token_specs)
# Even though here, nr_attr == 0, we're storing the ID value in attrs[0] (bug-prone, thread carefully!)
pattern[i].attrs = <AttrValueC*>mem.alloc(2, sizeof(AttrValueC))
# Use quantifier to identify final ID pattern node (rather than previous
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
pattern[i].quantifier = FINAL_ID
pattern[i].attrs = <AttrValueC*>mem.alloc(1, sizeof(AttrValueC))
pattern[i].attrs[0].attr = ID
pattern[i].attrs[0].value = entity_id
pattern[i].nr_attr = 0
pattern[i].nr_attr = 1
pattern[i].nr_extra_attr = 0
pattern[i].nr_py = 0
return pattern
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
# There have been a few bugs here. We used to have two functions,
# get_ent_id and get_pattern_key that tried to do the same thing. These
# are now unified to try to solve the "ghost match" problem.
# Below is the previous implementation of get_ent_id and the comment on it,
# preserved for reference while we figure out whether the heisenbug in the
# matcher is resolved.
#
#
# cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
# # The code was originally designed to always have pattern[1].attrs.value
# # be the ent_id when we get to the end of a pattern. However, Issue #2671
# # showed this wasn't the case when we had a reject-and-continue before a
# # match.
# # The patch to #2671 was wrong though, which came up in #3839.
# while pattern.attrs.attr != ID:
# pattern += 1
# return pattern.attrs.value
while pattern.nr_attr != 0 or pattern.nr_extra_attr != 0 or pattern.nr_py != 0 \
or pattern.quantifier != ZERO:
while pattern.quantifier != FINAL_ID:
pattern += 1
id_attr = pattern[0].attrs[0]
if id_attr.attr != ID:

View File

@ -23,10 +23,7 @@ def forward(model: Model, docs, is_train: bool):
keys, vals = model.ops.xp.unique(keys, return_counts=True)
batch_keys.append(keys)
batch_vals.append(vals)
# The dtype here matches what thinc is expecting -- which differs per
# platform (by int definition). This should be fixed once the problem
# is fixed on Thinc's side.
lengths = model.ops.asarray([arr.shape[0] for arr in batch_keys], dtype=numpy.int_)
lengths = model.ops.asarray([arr.shape[0] for arr in batch_keys], dtype="int32")
batch_keys = model.ops.xp.concatenate(batch_keys)
batch_vals = model.ops.asarray(model.ops.xp.concatenate(batch_vals), dtype="f")

View File

@ -1,4 +1,5 @@
from .entity_linker import * # noqa
from .multi_task import * # noqa
from .parser import * # noqa
from .tagger import * # noqa
from .textcat import * # noqa

View File

@ -1,7 +1,14 @@
from typing import Optional, Iterable, Tuple, List, TYPE_CHECKING
import numpy
from typing import Optional, Iterable, Tuple, List, Callable, TYPE_CHECKING
from thinc.api import chain, Maxout, LayerNorm, Softmax, Linear, zero_init, Model
from thinc.api import MultiSoftmax, list2array
from thinc.api import to_categorical, CosineDistance, L2Distance
from ...util import registry
from ...errors import Errors
from ...attrs import ID
import numpy
from functools import partial
if TYPE_CHECKING:
# This lets us add type hints for mypy etc. without causing circular imports
@ -9,6 +16,74 @@ if TYPE_CHECKING:
from ...tokens import Doc # noqa: F401
@registry.architectures.register("spacy.PretrainVectors.v1")
def create_pretrain_vectors(
maxout_pieces: int, hidden_size: int, loss: str
) -> Callable[["Vocab", Model], Model]:
def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model:
model = build_cloze_multi_task_model(
vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
)
model.attrs["loss"] = create_vectors_loss()
return model
def create_vectors_loss() -> Callable:
if loss == "cosine":
distance = CosineDistance(normalize=True, ignore_zeros=True)
return partial(get_vectors_loss, distance=distance)
elif loss == "L2":
distance = L2Distance(normalize=True)
return partial(get_vectors_loss, distance=distance)
else:
raise ValueError(Errors.E906.format(found=loss, supported="'cosine', 'L2'"))
return create_vectors_objective
@registry.architectures.register("spacy.PretrainCharacters.v1")
def create_pretrain_characters(
maxout_pieces: int, hidden_size: int, n_characters: int
) -> Callable[["Vocab", Model], Model]:
def create_characters_objective(vocab: "Vocab", tok2vec: Model) -> Model:
model = build_cloze_characters_multi_task_model(
vocab,
tok2vec,
hidden_size=hidden_size,
maxout_pieces=maxout_pieces,
nr_char=n_characters,
)
model.attrs["loss"] = partial(get_characters_loss, nr_char=n_characters)
return model
return create_characters_objective
def get_vectors_loss(ops, docs, prediction, distance):
"""Compute a loss based on a distance between the documents' vectors and
the prediction.
"""
# The simplest way to implement this would be to vstack the
# token.vector values, but that's a bit inefficient, especially on GPU.
# Instead we fetch the index into the vectors table for each of our tokens,
# and look them up all at once. This prevents data copying.
ids = ops.flatten([doc.to_array(ID).ravel() for doc in docs])
target = docs[0].vocab.vectors.data[ids]
d_target, loss = distance(prediction, target)
return loss, d_target
def get_characters_loss(ops, docs, prediction, nr_char):
"""Compute a loss based on a number of characters predicted from the docs."""
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
target_ids = target_ids.reshape((-1,))
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
target = target.reshape((-1, 256 * nr_char))
diff = prediction - target
loss = (diff ** 2).sum()
d_target = diff / float(prediction.shape[0])
return loss, d_target
def build_multi_task_model(
tok2vec: Model,
maxout_pieces: int,
@ -33,23 +108,19 @@ def build_multi_task_model(
def build_cloze_multi_task_model(
vocab: "Vocab",
tok2vec: Model,
maxout_pieces: int,
hidden_size: int,
nO: Optional[int] = None,
vocab: "Vocab", tok2vec: Model, maxout_pieces: int, hidden_size: int
) -> Model:
# nO = vocab.vectors.data.shape[1]
nO = vocab.vectors.data.shape[1]
output_layer = chain(
list2array(),
Maxout(
nO=nO,
nO=hidden_size,
nI=tok2vec.get_dim("nO"),
nP=maxout_pieces,
normalize=True,
dropout=0.0,
),
Linear(nO=nO, nI=nO, init_W=zero_init),
Linear(nO=nO, nI=hidden_size, init_W=zero_init),
)
model = chain(tok2vec, output_layer)
model = build_masked_language_model(vocab, model)

View File

@ -61,14 +61,14 @@ def build_bow_text_classifier(
@registry.architectures.register("spacy.TextCatEnsemble.v2")
def build_text_classifier(
def build_text_classifier_v2(
tok2vec: Model[List[Doc], List[Floats2d]],
linear_model: Model[List[Doc], Floats2d],
nO: Optional[int] = None,
) -> Model[List[Doc], Floats2d]:
exclusive_classes = not linear_model.attrs["multi_label"]
with Model.define_operators({">>": chain, "|": concatenate}):
width = tok2vec.get_dim("nO")
width = tok2vec.maybe_get_dim("nO")
cnn_model = (
tok2vec
>> list2ragged()
@ -94,7 +94,7 @@ def build_text_classifier(
# TODO: move to legacy
@registry.architectures.register("spacy.TextCatEnsemble.v1")
def build_text_classifier(
def build_text_classifier_v1(
width: int,
embed_size: int,
pretrained_vectors: Optional[bool],

View File

@ -42,9 +42,13 @@ def forward(
rows = model.ops.flatten(
[doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs]
)
try:
vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
except ValueError:
raise RuntimeError(Errors.E896)
output = Ragged(
model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True),
model.ops.asarray([len(doc) for doc in docs], dtype="i"),
vectors_data,
model.ops.asarray([len(doc) for doc in docs], dtype="i")
)
mask = None
if is_train:

View File

@ -0,0 +1,6 @@
from ...typedefs cimport class_t, hash_t
# These are passed as callbacks to thinc.search.Beam
cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1
cdef int check_final_state(void* _state, void* extra_args) except -1

View File

@ -0,0 +1,296 @@
# cython: infer_types=True
# cython: profile=True
cimport numpy as np
import numpy
from cpython.ref cimport PyObject, Py_XDECREF
from thinc.extra.search cimport Beam
from thinc.extra.search import MaxViolation
from thinc.extra.search cimport MaxViolation
from ...typedefs cimport hash_t, class_t
from .transition_system cimport TransitionSystem, Transition
from ...errors import Errors
from .stateclass cimport StateC, StateClass
# These are passed as callbacks to thinc.search.Beam
cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1:
dest = <StateC*>_dest
src = <StateC*>_src
moves = <const Transition*>_moves
dest.clone(src)
moves[clas].do(dest, moves[clas].label)
cdef int check_final_state(void* _state, void* extra_args) except -1:
state = <StateC*>_state
return state.is_final()
cdef class BeamBatch(object):
cdef public TransitionSystem moves
cdef public object states
cdef public object docs
cdef public object golds
cdef public object beams
def __init__(self, TransitionSystem moves, states, golds,
int width, float density=0.):
cdef StateClass state
self.moves = moves
self.states = states
self.docs = [state.doc for state in states]
self.golds = golds
self.beams = []
cdef Beam beam
cdef StateC* st
for state in states:
beam = Beam(self.moves.n_moves, width, min_density=density)
beam.initialize(self.moves.init_beam_state,
self.moves.del_beam_state, state.c.length,
<void*>state.c._sent)
for i in range(beam.width):
st = <StateC*>beam.at(i)
st.offset = state.c.offset
beam.check_done(check_final_state, NULL)
self.beams.append(beam)
@property
def is_done(self):
return all(b.is_done for b in self.beams)
def __getitem__(self, i):
return self.beams[i]
def __len__(self):
return len(self.beams)
def get_states(self):
cdef Beam beam
cdef StateC* state
cdef StateClass stcls
states = []
for beam, doc in zip(self, self.docs):
for i in range(beam.size):
state = <StateC*>beam.at(i)
stcls = StateClass.borrow(state, doc)
states.append(stcls)
return states
def get_unfinished_states(self):
return [st for st in self.get_states() if not st.is_final()]
def advance(self, float[:, ::1] scores, follow_gold=False):
cdef Beam beam
cdef int nr_class = scores.shape[1]
cdef const float* c_scores = &scores[0, 0]
docs = self.docs
for i, beam in enumerate(self):
if not beam.is_done:
nr_state = self._set_scores(beam, c_scores, nr_class)
assert nr_state
if self.golds is not None:
self._set_costs(
beam,
docs[i],
self.golds[i],
follow_gold=follow_gold
)
c_scores += nr_state * nr_class
beam.advance(transition_state, NULL, <void*>self.moves.c)
beam.check_done(check_final_state, NULL)
cdef int _set_scores(self, Beam beam, const float* scores, int nr_class) except -1:
cdef int nr_state = 0
for i in range(beam.size):
state = <StateC*>beam.at(i)
if not state.is_final():
for j in range(nr_class):
beam.scores[i][j] = scores[nr_state * nr_class + j]
self.moves.set_valid(beam.is_valid[i], state)
nr_state += 1
else:
for j in range(beam.nr_class):
beam.scores[i][j] = 0
beam.costs[i][j] = 0
return nr_state
def _set_costs(self, Beam beam, doc, gold, int follow_gold=False):
cdef const StateC* state
for i in range(beam.size):
state = <const StateC*>beam.at(i)
if state.is_final():
for j in range(beam.nr_class):
beam.is_valid[i][j] = 0
beam.costs[i][j] = 9000
else:
self.moves.set_costs(beam.is_valid[i], beam.costs[i],
state, gold)
if follow_gold:
min_cost = 0
for j in range(beam.nr_class):
if beam.is_valid[i][j] and beam.costs[i][j] < min_cost:
min_cost = beam.costs[i][j]
for j in range(beam.nr_class):
if beam.costs[i][j] > min_cost:
beam.is_valid[i][j] = 0
def update_beam(TransitionSystem moves, states, golds, model, int width, beam_density=0.0):
cdef MaxViolation violn
pbeam = BeamBatch(moves, states, golds, width=width, density=beam_density)
gbeam = BeamBatch(moves, states, golds, width=width, density=0.0)
cdef StateClass state
beam_maps = []
backprops = []
violns = [MaxViolation() for _ in range(len(states))]
dones = [False for _ in states]
while not pbeam.is_done or not gbeam.is_done:
# The beam maps let us find the right row in the flattened scores
# array for each state. States are identified by (example id,
# history). We keep a different beam map for each step (since we'll
# have a flat scores array for each step). The beam map will let us
# take the per-state losses, and compute the gradient for each (step,
# state, class).
# Gather all states from the two beams in a list. Some stats may occur
# in both beams. To figure out which beam each state belonged to,
# we keep two lists of indices, p_indices and g_indices
states, p_indices, g_indices, beam_map = get_unique_states(pbeam, gbeam)
beam_maps.append(beam_map)
if not states:
break
# Now that we have our flat list of states, feed them through the model
scores, bp_scores = model.begin_update(states)
assert scores.size != 0
# Store the callbacks for the backward pass
backprops.append(bp_scores)
# Unpack the scores for the two beams. The indices arrays
# tell us which example and state the scores-row refers to.
# Now advance the states in the beams. The gold beam is constrained to
# to follow only gold analyses.
if not pbeam.is_done:
pbeam.advance(model.ops.as_contig(scores[p_indices]))
if not gbeam.is_done:
gbeam.advance(model.ops.as_contig(scores[g_indices]), follow_gold=True)
# Track the "maximum violation", to use in the update.
for i, violn in enumerate(violns):
if not dones[i]:
violn.check_crf(pbeam[i], gbeam[i])
if pbeam[i].is_done and gbeam[i].is_done:
dones[i] = True
histories = []
grads = []
for violn in violns:
if violn.p_hist:
histories.append(violn.p_hist + violn.g_hist)
d_loss = [d_l * violn.cost for d_l in violn.p_probs + violn.g_probs]
grads.append(d_loss)
else:
histories.append([])
grads.append([])
loss = 0.0
states_d_scores = get_gradient(moves.n_moves, beam_maps, histories, grads)
for i, (d_scores, bp_scores) in enumerate(zip(states_d_scores, backprops)):
loss += (d_scores**2).mean()
bp_scores(d_scores)
return loss
def collect_states(beams, docs):
cdef StateClass state
cdef Beam beam
states = []
for state_or_beam, doc in zip(beams, docs):
if isinstance(state_or_beam, StateClass):
states.append(state_or_beam)
else:
beam = state_or_beam
state = StateClass.borrow(<StateC*>beam.at(0), doc)
states.append(state)
return states
def get_unique_states(pbeams, gbeams):
seen = {}
states = []
p_indices = []
g_indices = []
beam_map = {}
docs = pbeams.docs
cdef Beam pbeam, gbeam
if len(pbeams) != len(gbeams):
raise ValueError(Errors.E079.format(pbeams=len(pbeams), gbeams=len(gbeams)))
for eg_id, (pbeam, gbeam, doc) in enumerate(zip(pbeams, gbeams, docs)):
if not pbeam.is_done:
for i in range(pbeam.size):
state = StateClass.borrow(<StateC*>pbeam.at(i), doc)
if not state.is_final():
key = tuple([eg_id] + pbeam.histories[i])
if key in seen:
raise ValueError(Errors.E080.format(key=key))
seen[key] = len(states)
p_indices.append(len(states))
states.append(state)
beam_map.update(seen)
if not gbeam.is_done:
for i in range(gbeam.size):
state = StateClass.borrow(<StateC*>gbeam.at(i), doc)
if not state.is_final():
key = tuple([eg_id] + gbeam.histories[i])
if key in seen:
g_indices.append(seen[key])
else:
g_indices.append(len(states))
beam_map[key] = len(states)
states.append(state)
p_indices = numpy.asarray(p_indices, dtype='i')
g_indices = numpy.asarray(g_indices, dtype='i')
return states, p_indices, g_indices, beam_map
def get_gradient(nr_class, beam_maps, histories, losses):
"""The global model assigns a loss to each parse. The beam scores
are additive, so the same gradient is applied to each action
in the history. This gives the gradient of a single *action*
for a beam state -- so we have "the gradient of loss for taking
action i given history H."
Histories: Each hitory is a list of actions
Each candidate has a history
Each beam has multiple candidates
Each batch has multiple beams
So history is list of lists of lists of ints
"""
grads = []
nr_steps = []
for eg_id, hists in enumerate(histories):
nr_step = 0
for loss, hist in zip(losses[eg_id], hists):
assert not numpy.isnan(loss)
if loss != 0.0:
nr_step = max(nr_step, len(hist))
nr_steps.append(nr_step)
for i in range(max(nr_steps)):
grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class),
dtype='f'))
if len(histories) != len(losses):
raise ValueError(Errors.E081.format(n_hist=len(histories), losses=len(losses)))
for eg_id, hists in enumerate(histories):
for loss, hist in zip(losses[eg_id], hists):
assert not numpy.isnan(loss)
if loss == 0.0:
continue
key = tuple([eg_id])
# Adjust loss for length
# We need to do this because each state in a short path is scored
# multiple times, as we add in the average cost when we run out
# of actions.
avg_loss = loss / len(hist)
loss += avg_loss * (nr_steps[eg_id] - len(hist))
for step, clas in enumerate(hist):
i = beam_maps[step][key]
# In step j, at state i action clas
# resulted in loss
grads[step][i, clas] += loss
key = key + tuple([clas])
return grads

View File

@ -1,6 +1,9 @@
from libc.string cimport memcpy, memset
from libc.stdlib cimport calloc, free
from libc.stdint cimport uint32_t, uint64_t
cimport libcpp
from libcpp.vector cimport vector
from libcpp.set cimport set
from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
from murmurhash.mrmr cimport hash64
@ -14,89 +17,48 @@ from ...typedefs cimport attr_t
cdef inline bint is_space_token(const TokenC* token) nogil:
return Lexeme.c_check_flag(token.lex, IS_SPACE)
cdef struct RingBufferC:
int[8] data
int i
int default
cdef inline int ring_push(RingBufferC* ring, int value) nogil:
ring.data[ring.i] = value
ring.i += 1
if ring.i >= 8:
ring.i = 0
cdef inline int ring_get(RingBufferC* ring, int i) nogil:
if i >= ring.i:
return ring.default
else:
return ring.data[ring.i-i]
cdef struct ArcC:
int head
int child
attr_t label
cdef cppclass StateC:
int* _stack
int* _buffer
bint* shifted
TokenC* _sent
SpanC* _ents
int* _heads
const TokenC* _sent
vector[int] _stack
vector[int] _rebuffer
vector[SpanC] _ents
vector[ArcC] _left_arcs
vector[ArcC] _right_arcs
vector[libcpp.bool] _unshiftable
set[int] _sent_starts
TokenC _empty_token
RingBufferC _hist
int length
int offset
int _s_i
int _b_i
int _e_i
int _break
__init__(const TokenC* sent, int length) nogil:
cdef int PADDING = 5
this._buffer = <int*>calloc(length + (PADDING * 2), sizeof(int))
this._stack = <int*>calloc(length + (PADDING * 2), sizeof(int))
this.shifted = <bint*>calloc(length + (PADDING * 2), sizeof(bint))
this._sent = <TokenC*>calloc(length + (PADDING * 2), sizeof(TokenC))
this._ents = <SpanC*>calloc(length + (PADDING * 2), sizeof(SpanC))
if not (this._buffer and this._stack and this.shifted
and this._sent and this._ents):
this._sent = sent
this._heads = <int*>calloc(length, sizeof(int))
if not (this._sent and this._heads):
with gil:
PyErr_SetFromErrno(MemoryError)
PyErr_CheckSignals()
memset(&this._hist, 0, sizeof(this._hist))
this.offset = 0
cdef int i
for i in range(length + (PADDING * 2)):
this._ents[i].end = -1
this._sent[i].l_edge = i
this._sent[i].r_edge = i
for i in range(PADDING):
this._sent[i].lex = &EMPTY_LEXEME
this._sent += PADDING
this._ents += PADDING
this._buffer += PADDING
this._stack += PADDING
this.shifted += PADDING
this.length = length
this._break = -1
this._s_i = 0
this._b_i = 0
this._e_i = 0
for i in range(length):
this._buffer[i] = i
this._heads[i] = -1
this._unshiftable.push_back(0)
memset(&this._empty_token, 0, sizeof(TokenC))
this._empty_token.lex = &EMPTY_LEXEME
for i in range(length):
this._sent[i] = sent[i]
this._buffer[i] = i
for i in range(length, length+PADDING):
this._sent[i].lex = &EMPTY_LEXEME
__dealloc__():
cdef int PADDING = 5
free(this._sent - PADDING)
free(this._ents - PADDING)
free(this._buffer - PADDING)
free(this._stack - PADDING)
free(this.shifted - PADDING)
free(this._heads)
void set_context_tokens(int* ids, int n) nogil:
cdef int i, j
if n == 1:
if this.B(0) >= 0:
ids[0] = this.B(0)
@ -145,22 +107,18 @@ cdef cppclass StateC:
ids[11] = this.R(this.S(1), 1)
ids[12] = this.R(this.S(1), 2)
elif n == 6:
for i in range(6):
ids[i] = -1
if this.B(0) >= 0:
ids[0] = this.B(0)
ids[1] = this.B(0)-1
else:
ids[0] = -1
ids[1] = -1
ids[2] = this.B(1)
ids[3] = this.E(0)
if ids[3] >= 1:
ids[4] = this.E(0)-1
else:
ids[4] = -1
if (ids[3]+1) < this.length:
ids[5] = this.E(0)+1
else:
ids[5] = -1
if this.entity_is_open():
ent = this.get_ent()
j = 1
for i in range(ent.start, this.B(0)):
ids[j] = i
j += 1
if j >= 6:
break
else:
# TODO error =/
pass
@ -171,329 +129,256 @@ cdef cppclass StateC:
ids[i] = -1
int S(int i) nogil const:
if i >= this._s_i:
if i >= this._stack.size():
return -1
return this._stack[this._s_i - (i+1)]
elif i < 0:
return -1
return this._stack.at(this._stack.size() - (i+1))
int B(int i) nogil const:
if (i + this._b_i) >= this.length:
if i < 0:
return -1
return this._buffer[this._b_i + i]
const TokenC* S_(int i) nogil const:
return this.safe_get(this.S(i))
elif i < this._rebuffer.size():
return this._rebuffer.at(this._rebuffer.size() - (i+1))
else:
b_i = this._b_i + (i - this._rebuffer.size())
if b_i >= this.length:
return -1
else:
return b_i
const TokenC* B_(int i) nogil const:
return this.safe_get(this.B(i))
const TokenC* H_(int i) nogil const:
return this.safe_get(this.H(i))
const TokenC* E_(int i) nogil const:
return this.safe_get(this.E(i))
const TokenC* L_(int i, int idx) nogil const:
return this.safe_get(this.L(i, idx))
const TokenC* R_(int i, int idx) nogil const:
return this.safe_get(this.R(i, idx))
const TokenC* safe_get(int i) nogil const:
if i < 0 or i >= this.length:
return &this._empty_token
else:
return &this._sent[i]
int H(int i) nogil const:
if i < 0 or i >= this.length:
void get_arcs(vector[ArcC]* arcs) nogil const:
for i in range(this._left_arcs.size()):
arc = this._left_arcs.at(i)
if arc.head != -1 and arc.child != -1:
arcs.push_back(arc)
for i in range(this._right_arcs.size()):
arc = this._right_arcs.at(i)
if arc.head != -1 and arc.child != -1:
arcs.push_back(arc)
int H(int child) nogil const:
if child >= this.length or child < 0:
return -1
return this._sent[i].head + i
else:
return this._heads[child]
int E(int i) nogil const:
if this._e_i <= 0 or this._e_i >= this.length:
if this._ents.size() == 0:
return -1
if i < 0 or i >= this._e_i:
return -1
return this._ents[this._e_i - (i+1)].start
int L(int i, int idx) nogil const:
if idx < 1:
return -1
if i < 0 or i >= this.length:
return -1
cdef const TokenC* target = &this._sent[i]
if target.l_kids < <uint32_t>idx:
return -1
cdef const TokenC* ptr = &this._sent[target.l_edge]
while ptr < target:
# If this head is still to the right of us, we can skip to it
# No token that's between this token and this head could be our
# child.
if (ptr.head >= 1) and (ptr + ptr.head) < target:
ptr += ptr.head
elif ptr + ptr.head == target:
idx -= 1
if idx == 0:
return ptr - this._sent
ptr += 1
else:
ptr += 1
return -1
return this._ents.back().start
int R(int i, int idx) nogil const:
if idx < 1:
int L(int head, int idx) nogil const:
if idx < 1 or this._left_arcs.size() == 0:
return -1
if i < 0 or i >= this.length:
cdef vector[int] lefts
for i in range(this._left_arcs.size()):
arc = this._left_arcs.at(i)
if arc.head == head and arc.child != -1 and arc.child < head:
lefts.push_back(arc.child)
idx = (<int>lefts.size()) - idx
if idx < 0:
return -1
cdef const TokenC* target = &this._sent[i]
if target.r_kids < <uint32_t>idx:
return -1
cdef const TokenC* ptr = &this._sent[target.r_edge]
while ptr > target:
# If this head is still to the right of us, we can skip to it
# No token that's between this token and this head could be our
# child.
if (ptr.head < 0) and ((ptr + ptr.head) > target):
ptr += ptr.head
elif ptr + ptr.head == target:
idx -= 1
if idx == 0:
return ptr - this._sent
ptr -= 1
else:
ptr -= 1
return lefts.at(idx)
int R(int head, int idx) nogil const:
if idx < 1 or this._right_arcs.size() == 0:
return -1
cdef vector[int] rights
for i in range(this._right_arcs.size()):
arc = this._right_arcs.at(i)
if arc.head == head and arc.child != -1 and arc.child > head:
rights.push_back(arc.child)
idx = (<int>rights.size()) - idx
if idx < 0:
return -1
else:
return rights.at(idx)
bint empty() nogil const:
return this._s_i <= 0
return this._stack.size() == 0
bint eol() nogil const:
return this.buffer_length() == 0
bint at_break() nogil const:
return this._break != -1
bint is_final() nogil const:
return this.stack_depth() <= 0 and this._b_i >= this.length
return this.stack_depth() <= 0 and this.eol()
bint has_head(int i) nogil const:
return this.safe_get(i).head != 0
int cannot_sent_start(int word) nogil const:
if word < 0 or word >= this.length:
return 0
elif this._sent[word].sent_start == -1:
return 1
else:
return 0
int n_L(int i) nogil const:
return this.safe_get(i).l_kids
int is_sent_start(int word) nogil const:
if word < 0 or word >= this.length:
return 0
elif this._sent[word].sent_start == 1:
return 1
elif this._sent_starts.count(word) >= 1:
return 1
else:
return 0
int n_R(int i) nogil const:
return this.safe_get(i).r_kids
void set_sent_start(int word, int value) nogil:
if value >= 1:
this._sent_starts.insert(word)
bint has_head(int child) nogil const:
return this._heads[child] >= 0
int l_edge(int word) nogil const:
return word
int r_edge(int word) nogil const:
return word
int n_L(int head) nogil const:
cdef int n = 0
for i in range(this._left_arcs.size()):
arc = this._left_arcs.at(i)
if arc.head == head and arc.child != -1 and arc.child < arc.head:
n += 1
return n
int n_R(int head) nogil const:
cdef int n = 0
for i in range(this._right_arcs.size()):
arc = this._right_arcs.at(i)
if arc.head == head and arc.child != -1 and arc.child > arc.head:
n += 1
return n
bint stack_is_connected() nogil const:
return False
bint entity_is_open() nogil const:
if this._e_i < 1:
if this._ents.size() == 0:
return False
return this._ents[this._e_i-1].end == -1
else:
return this._ents.back().end == -1
int stack_depth() nogil const:
return this._s_i
return this._stack.size()
int buffer_length() nogil const:
if this._break != -1:
return this._break - this._b_i
else:
return this.length - this._b_i
uint64_t hash() nogil const:
cdef TokenC[11] sig
sig[0] = this.S_(2)[0]
sig[1] = this.S_(1)[0]
sig[2] = this.R_(this.S(1), 1)[0]
sig[3] = this.L_(this.S(0), 1)[0]
sig[4] = this.L_(this.S(0), 2)[0]
sig[5] = this.S_(0)[0]
sig[6] = this.R_(this.S(0), 2)[0]
sig[7] = this.R_(this.S(0), 1)[0]
sig[8] = this.B_(0)[0]
sig[9] = this.E_(0)[0]
sig[10] = this.E_(1)[0]
return hash64(sig, sizeof(sig), this._s_i) \
+ hash64(<void*>&this._hist, sizeof(RingBufferC), 1)
void push_hist(int act) nogil:
ring_push(&this._hist, act+1)
int get_hist(int i) nogil:
return ring_get(&this._hist, i)
void push() nogil:
if this.B(0) != -1:
this._stack[this._s_i] = this.B(0)
this._s_i += 1
b0 = this.B(0)
if this._rebuffer.size():
b0 = this._rebuffer.back()
this._rebuffer.pop_back()
else:
b0 = this._b_i
this._b_i += 1
if this.safe_get(this.B_(0).l_edge).sent_start == 1:
this.set_break(this.B_(0).l_edge)
if this._b_i > this._break:
this._break = -1
this._stack.push_back(b0)
void pop() nogil:
if this._s_i >= 1:
this._s_i -= 1
this._stack.pop_back()
void force_final() nogil:
# This should only be used in desperate situations, as it may leave
# the analysis in an unexpected state.
this._s_i = 0
this._stack.clear()
this._b_i = this.length
void unshift() nogil:
this._b_i -= 1
this._buffer[this._b_i] = this.S(0)
this._s_i -= 1
this.shifted[this.B(0)] = True
s0 = this._stack.back()
this._unshiftable[s0] = 1
this._rebuffer.push_back(s0)
this._stack.pop_back()
int is_unshiftable(int item) nogil const:
if item >= this._unshiftable.size():
return 0
else:
return this._unshiftable.at(item)
void set_reshiftable(int item) nogil:
if item < this._unshiftable.size():
this._unshiftable[item] = 0
void add_arc(int head, int child, attr_t label) nogil:
if this.has_head(child):
this.del_arc(this.H(child), child)
cdef int dist = head - child
this._sent[child].head = dist
this._sent[child].dep = label
cdef int i
if child > head:
this._sent[head].r_kids += 1
# Some transition systems can have a word in the buffer have a
# rightward child, e.g. from Unshift.
this._sent[head].r_edge = this._sent[child].r_edge
i = 0
while this.has_head(head) and i < this.length:
head = this.H(head)
this._sent[head].r_edge = this._sent[child].r_edge
i += 1 # Guard against infinite loops
cdef ArcC arc
arc.head = head
arc.child = child
arc.label = label
if head > child:
this._left_arcs.push_back(arc)
else:
this._sent[head].l_kids += 1
this._sent[head].l_edge = this._sent[child].l_edge
this._right_arcs.push_back(arc)
this._heads[child] = head
void del_arc(int h_i, int c_i) nogil:
cdef int dist = h_i - c_i
cdef TokenC* h = &this._sent[h_i]
cdef int i = 0
if c_i > h_i:
# this.R_(h_i, 2) returns the second-rightmost child token of h_i
# If we have more than 2 rightmost children, our 2nd rightmost child's
# rightmost edge is going to be our new rightmost edge.
h.r_edge = this.R_(h_i, 2).r_edge if h.r_kids >= 2 else h_i
h.r_kids -= 1
new_edge = h.r_edge
# Correct upwards in the tree --- see Issue #251
while h.head < 0 and i < this.length: # Guard infinite loop
h += h.head
h.r_edge = new_edge
i += 1
cdef vector[ArcC]* arcs
if h_i > c_i:
arcs = &this._left_arcs
else:
# Same logic applies for left edge, but we don't need to walk up
# the tree, as the head is off the stack.
h.l_edge = this.L_(h_i, 2).l_edge if h.l_kids >= 2 else h_i
h.l_kids -= 1
arcs = &this._right_arcs
if arcs.size() == 0:
return
arc = arcs.back()
if arc.head == h_i and arc.child == c_i:
arcs.pop_back()
else:
for i in range(arcs.size()-1):
arc = arcs.at(i)
if arc.head == h_i and arc.child == c_i:
arc.head = -1
arc.child = -1
arc.label = 0
break
SpanC get_ent() nogil const:
cdef SpanC ent
if this._ents.size() == 0:
ent.start = 0
ent.end = 0
ent.label = 0
return ent
else:
return this._ents.back()
void open_ent(attr_t label) nogil:
this._ents[this._e_i].start = this.B(0)
this._ents[this._e_i].label = label
this._ents[this._e_i].end = -1
this._e_i += 1
cdef SpanC ent
ent.start = this.B(0)
ent.label = label
ent.end = -1
this._ents.push_back(ent)
void close_ent() nogil:
# Note that we don't decrement _e_i here! We want to maintain all
# entities, not over-write them...
this._ents[this._e_i-1].end = this.B(0)+1
this._sent[this.B(0)].ent_iob = 1
void set_ent_tag(int i, int ent_iob, attr_t ent_type) nogil:
if 0 <= i < this.length:
this._sent[i].ent_iob = ent_iob
this._sent[i].ent_type = ent_type
void set_break(int i) nogil:
if 0 <= i < this.length:
this._sent[i].sent_start = 1
this._break = this._b_i
this._ents.back().end = this.B(0)+1
void clone(const StateC* src) nogil:
this.length = src.length
memcpy(this._sent, src._sent, this.length * sizeof(TokenC))
memcpy(this._stack, src._stack, this.length * sizeof(int))
memcpy(this._buffer, src._buffer, this.length * sizeof(int))
memcpy(this._ents, src._ents, this.length * sizeof(SpanC))
memcpy(this.shifted, src.shifted, this.length * sizeof(this.shifted[0]))
this._sent = src._sent
this._stack = src._stack
this._rebuffer = src._rebuffer
this._sent_starts = src._sent_starts
this._unshiftable = src._unshiftable
memcpy(this._heads, src._heads, this.length * sizeof(this._heads[0]))
this._ents = src._ents
this._left_arcs = src._left_arcs
this._right_arcs = src._right_arcs
this._b_i = src._b_i
this._s_i = src._s_i
this._e_i = src._e_i
this._break = src._break
this.offset = src.offset
this._empty_token = src._empty_token
void fast_forward() nogil:
# space token attachement policy:
# - attach space tokens always to the last preceding real token
# - except if it's the beginning of a sentence, then attach to the first following
# - boundary case: a document containing multiple space tokens but nothing else,
# then make the last space token the head of all others
while is_space_token(this.B_(0)) \
or this.buffer_length() == 0 \
or this.stack_depth() == 0:
if this.buffer_length() == 0:
# remove the last sentence's root from the stack
if this.stack_depth() == 1:
this.pop()
# parser got stuck: reduce stack or unshift
elif this.stack_depth() > 1:
if this.has_head(this.S(0)):
this.pop()
else:
this.unshift()
# stack is empty but there is another sentence on the buffer
elif (this.length - this._b_i) >= 1:
this.push()
else: # stack empty and nothing else coming
break
elif is_space_token(this.B_(0)):
# the normal case: we're somewhere inside a sentence
if this.stack_depth() > 0:
# assert not is_space_token(this.S_(0))
# attach all coming space tokens to their last preceding
# real token (which should be on the top of the stack)
while is_space_token(this.B_(0)):
this.add_arc(this.S(0),this.B(0),0)
this.push()
this.pop()
# the rare case: we're at the beginning of a document:
# space tokens are attached to the first real token on the buffer
elif this.stack_depth() == 0:
# store all space tokens on the stack until a real token shows up
# or the last token on the buffer is reached
while is_space_token(this.B_(0)) and this.buffer_length() > 1:
this.push()
# empty the stack by attaching all space tokens to the
# first token on the buffer
# boundary case: if all tokens are space tokens, the last one
# becomes the head of all others
while this.stack_depth() > 0:
this.add_arc(this.B(0),this.S(0),0)
this.pop()
# move the first token onto the stack
this.push()
elif this.stack_depth() == 0:
# for one token sentences (?)
if this.buffer_length() == 1:
this.push()
this.pop()
# with an empty stack and a non-empty buffer
# only shift is valid anyway
elif (this.length - this._b_i) >= 1:
this.push()
else: # can this even happen?
break

View File

@ -1,11 +1,7 @@
from .stateclass cimport StateClass
from ._state cimport StateC
from ...typedefs cimport weight_t, attr_t
from .transition_system cimport Transition, TransitionSystem
cdef class ArcEager(TransitionSystem):
pass
cdef weight_t push_cost(StateClass stcls, const void* _gold, int target) nogil
cdef weight_t arc_cost(StateClass stcls, const void* _gold, int head, int child) nogil

View File

@ -14,16 +14,11 @@ from ._state cimport StateC
from ...errors import Errors
# Calculate cost as gold/not gold. We don't use scalar value anyway.
cdef int BINARY_COSTS = 1
cdef weight_t MIN_SCORE = -90000
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok')
DEF NON_MONOTONIC = True
DEF USE_BREAK = True
# Break transition from here
# http://www.aclweb.org/anthology/P13-1074
cdef enum:
SHIFT
REDUCE
@ -61,9 +56,11 @@ cdef struct GoldParseStateC:
int32_t* n_kids
int32_t length
int32_t stride
weight_t push_cost
weight_t pop_cost
cdef GoldParseStateC create_gold_state(Pool mem, StateClass stcls,
cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
heads, labels, sent_starts) except *:
cdef GoldParseStateC gs
gs.length = len(heads)
@ -142,10 +139,12 @@ cdef GoldParseStateC create_gold_state(Pool mem, StateClass stcls,
if head != i:
gs.kids[head][js[head]] = i
js[head] += 1
gs.push_cost = push_cost(state, &gs)
gs.pop_cost = pop_cost(state, &gs)
return gs
cdef void update_gold_state(GoldParseStateC* gs, StateClass stcls) nogil:
cdef void update_gold_state(GoldParseStateC* gs, const StateC* s) nogil:
for i in range(gs.length):
gs.state_bits[i] = set_state_flag(
gs.state_bits[i],
@ -160,9 +159,9 @@ cdef void update_gold_state(GoldParseStateC* gs, StateClass stcls) nogil:
gs.n_kids_in_stack[i] = 0
gs.n_kids_in_buffer[i] = 0
for i in range(stcls.stack_depth()):
s_i = stcls.S(i)
if not is_head_unknown(gs, s_i):
for i in range(s.stack_depth()):
s_i = s.S(i)
if not is_head_unknown(gs, s_i) and gs.heads[s_i] != s_i:
gs.n_kids_in_stack[gs.heads[s_i]] += 1
for kid in gs.kids[s_i][:gs.n_kids[s_i]]:
gs.state_bits[kid] = set_state_flag(
@ -170,9 +169,11 @@ cdef void update_gold_state(GoldParseStateC* gs, StateClass stcls) nogil:
HEAD_IN_STACK,
1
)
for i in range(stcls.buffer_length()):
b_i = stcls.B(i)
if not is_head_unknown(gs, b_i):
for i in range(s.buffer_length()):
b_i = s.B(i)
if s.is_sent_start(b_i):
break
if not is_head_unknown(gs, b_i) and gs.heads[b_i] != b_i:
gs.n_kids_in_buffer[gs.heads[b_i]] += 1
for kid in gs.kids[b_i][:gs.n_kids[b_i]]:
gs.state_bits[kid] = set_state_flag(
@ -180,6 +181,8 @@ cdef void update_gold_state(GoldParseStateC* gs, StateClass stcls) nogil:
HEAD_IN_BUFFER,
1
)
gs.push_cost = push_cost(s, gs)
gs.pop_cost = pop_cost(s, gs)
cdef class ArcEagerGold:
@ -191,17 +194,17 @@ cdef class ArcEagerGold:
heads, labels = example.get_aligned_parse(projectivize=True)
labels = [label if label is not None else "" for label in labels]
labels = [example.x.vocab.strings.add(label) for label in labels]
sent_starts = example.get_aligned("SENT_START")
assert len(heads) == len(labels) == len(sent_starts)
self.c = create_gold_state(self.mem, stcls, heads, labels, sent_starts)
sent_starts = example.get_aligned_sent_starts()
assert len(heads) == len(labels) == len(sent_starts), (len(heads), len(labels), len(sent_starts))
self.c = create_gold_state(self.mem, stcls.c, heads, labels, sent_starts)
def update(self, StateClass stcls):
update_gold_state(&self.c, stcls)
update_gold_state(&self.c, stcls.c)
cdef int check_state_gold(char state_bits, char flag) nogil:
cdef char one = 1
return state_bits & (one << flag)
return 1 if (state_bits & (one << flag)) else 0
cdef int set_state_flag(char state_bits, char flag, int value) nogil:
@ -232,41 +235,30 @@ cdef int is_sent_start_unknown(const GoldParseStateC* gold, int i) nogil:
# Helper functions for the arc-eager oracle
cdef weight_t push_cost(StateClass stcls, const void* _gold, int target) nogil:
gold = <const GoldParseStateC*>_gold
cdef weight_t push_cost(const StateC* state, const GoldParseStateC* gold) nogil:
cdef weight_t cost = 0
if is_head_in_stack(gold, target):
b0 = state.B(0)
if b0 < 0:
return 9000
if is_head_in_stack(gold, b0):
cost += 1
cost += gold.n_kids_in_stack[target]
if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0:
cost += gold.n_kids_in_stack[b0]
if Break.is_valid(state, 0) and is_sent_start(gold, state.B(1)):
cost += 1
return cost
cdef weight_t pop_cost(StateClass stcls, const void* _gold, int target) nogil:
gold = <const GoldParseStateC*>_gold
cdef weight_t pop_cost(const StateC* state, const GoldParseStateC* gold) nogil:
cdef weight_t cost = 0
if is_head_in_buffer(gold, target):
cost += 1
cost += gold[0].n_kids_in_buffer[target]
if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0:
s0 = state.S(0)
if s0 < 0:
return 9000
if is_head_in_buffer(gold, s0):
cost += 1
cost += gold.n_kids_in_buffer[s0]
return cost
cdef weight_t arc_cost(StateClass stcls, const void* _gold, int head, int child) nogil:
gold = <const GoldParseStateC*>_gold
if arc_is_gold(gold, head, child):
return 0
elif stcls.H(child) == gold.heads[child]:
return 1
# Head in buffer
elif is_head_in_buffer(gold, child):
return 1
else:
return 0
cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) nogil:
if is_head_unknown(gold, child):
return True
@ -276,7 +268,7 @@ cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) nogil:
return False
cdef bint label_is_gold(const GoldParseStateC* gold, int head, int child, attr_t label) nogil:
cdef bint label_is_gold(const GoldParseStateC* gold, int child, attr_t label) nogil:
if is_head_unknown(gold, child):
return True
elif label == 0:
@ -292,218 +284,251 @@ cdef bint _is_gold_root(const GoldParseStateC* gold, int word) nogil:
cdef class Shift:
"""Move the first word of the buffer onto the stack and mark it as "shifted"
Validity:
* If stack is empty
* At least two words in sentence
* Word has not been shifted before
Cost: push_cost
Action:
* Mark B[0] as 'shifted'
* Push stack
* Advance buffer
"""
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
sent_start = st._sent[st.B_(0).l_edge].sent_start
return st.buffer_length() >= 2 and not st.shifted[st.B(0)] and sent_start != 1
if st.stack_depth() == 0:
return 1
elif st.buffer_length() < 2:
return 0
elif st.is_sent_start(st.B(0)):
return 0
elif st.is_unshiftable(st.B(0)):
return 0
else:
return 1
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.push()
st.fast_forward()
@staticmethod
cdef weight_t cost(StateClass st, const void* _gold, attr_t label) nogil:
cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
return Shift.move_cost(st, gold) + Shift.label_cost(st, gold, label)
@staticmethod
cdef inline weight_t move_cost(StateClass s, const void* _gold) nogil:
gold = <const GoldParseStateC*>_gold
return push_cost(s, gold, s.B(0))
@staticmethod
cdef inline weight_t label_cost(StateClass s, const void* _gold, attr_t label) nogil:
return 0
return gold.push_cost
cdef class Reduce:
"""
Pop from the stack. If it has no head and the stack isn't empty, place
it back on the buffer.
Validity:
* Stack not empty
* Buffer nt empty
* Stack depth 1 and cannot sent start l_edge(st.B(0))
Cost:
* If B[0] is the start of a sentence, cost is 0
* Arcs between stack and buffer
* If arc has no head, we're saving arcs between S[0] and S[1:], so decrement
cost by those arcs.
"""
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
return st.stack_depth() >= 2
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
if st.has_head(st.S(0)):
st.pop()
else:
st.unshift()
st.fast_forward()
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
return Reduce.move_cost(s, gold) + Reduce.label_cost(s, gold, label)
@staticmethod
cdef inline weight_t move_cost(StateClass st, const void* _gold) nogil:
gold = <const GoldParseStateC*>_gold
s0 = st.S(0)
cost = pop_cost(st, gold, s0)
return_to_buffer = not st.has_head(s0)
if return_to_buffer:
# Decrement cost for the arcs we save, as we'll be putting this
# back to the buffer
if is_head_in_stack(gold, s0):
cost -= 1
cost -= gold.n_kids_in_stack[s0]
if Break.is_valid(st.c, 0) and Break.move_cost(st, gold) == 0:
cost -= 1
return cost
@staticmethod
cdef inline weight_t label_cost(StateClass s, const void* gold, attr_t label) nogil:
return 0
cdef class LeftArc:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
return 0
sent_start = st._sent[st.B_(0).l_edge].sent_start
return sent_start != 1
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.add_arc(st.B(0), st.S(0), label)
st.pop()
st.fast_forward()
@staticmethod
cdef inline weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
return LeftArc.move_cost(s, gold) + LeftArc.label_cost(s, gold, label)
@staticmethod
cdef inline weight_t move_cost(StateClass s, const GoldParseStateC* gold) nogil:
cdef weight_t cost = 0
s0 = s.S(0)
b0 = s.B(0)
if arc_is_gold(gold, b0, s0):
# Have a negative cost if we 'recover' from the wrong dependency
return 0 if not s.has_head(s0) else -1
else:
# Account for deps we might lose between S0 and stack
if not s.has_head(s0):
cost += gold.n_kids_in_stack[s0]
if is_head_in_buffer(gold, s0):
cost += 1
return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0))
@staticmethod
cdef inline weight_t label_cost(StateClass s, const GoldParseStateC* gold, attr_t label) nogil:
return arc_is_gold(gold, s.B(0), s.S(0)) and not label_is_gold(gold, s.B(0), s.S(0), label)
cdef class RightArc:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
# If there's (perhaps partial) parse pre-set, don't allow cycle.
if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
return 0
sent_start = st._sent[st.B_(0).l_edge].sent_start
return sent_start != 1 and st.H(st.S(0)) != st.B(0)
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.add_arc(st.S(0), st.B(0), label)
st.push()
st.fast_forward()
@staticmethod
cdef inline weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
return RightArc.move_cost(s, gold) + RightArc.label_cost(s, gold, label)
@staticmethod
cdef inline weight_t move_cost(StateClass s, const void* _gold) nogil:
gold = <const GoldParseStateC*>_gold
if arc_is_gold(gold, s.S(0), s.B(0)):
return 0
elif s.c.shifted[s.B(0)]:
return push_cost(s, gold, s.B(0))
else:
return push_cost(s, gold, s.B(0)) + arc_cost(s, gold, s.S(0), s.B(0))
@staticmethod
cdef weight_t label_cost(StateClass s, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
return arc_is_gold(gold, s.S(0), s.B(0)) and not label_is_gold(gold, s.S(0), s.B(0), label)
cdef class Break:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int i
if not USE_BREAK:
if st.stack_depth() == 0:
return False
elif st.at_break():
return False
elif st.stack_depth() < 1:
return False
elif st.B_(0).l_edge < 0:
return False
elif st._sent[st.B_(0).l_edge].sent_start < 0:
elif st.buffer_length() == 0:
return True
elif st.stack_depth() == 1 and st.cannot_sent_start(st.l_edge(st.B(0))):
return False
else:
return True
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.set_break(st.B_(0).l_edge)
st.fast_forward()
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
return Break.move_cost(s, gold) + Break.label_cost(s, gold, label)
@staticmethod
cdef inline weight_t move_cost(StateClass s, const void* _gold) nogil:
gold = <const GoldParseStateC*>_gold
cost = 0
for i in range(s.stack_depth()):
S_i = s.S(i)
cost += gold.n_kids_in_buffer[S_i]
if is_head_in_buffer(gold, S_i):
cost += 1
# It's weird not to check the gold sentence boundaries but if we do,
# we can't account for "sunk costs", i.e. situations where we're already
# wrong.
s0_root = _get_root(s.S(0), gold)
b0_root = _get_root(s.B(0), gold)
if s0_root != b0_root or s0_root == -1 or b0_root == -1:
return cost
if st.has_head(st.S(0)) or st.stack_depth() == 1:
st.pop()
else:
return cost + 1
st.unshift()
@staticmethod
cdef inline weight_t label_cost(StateClass s, const void* gold, attr_t label) nogil:
cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
if state.is_sent_start(state.B(0)):
return 0
s0 = state.S(0)
cost = gold.pop_cost
if not state.has_head(s0):
# Decrement cost for the arcs we save, as we'll be putting this
# back to the buffer
if is_head_in_stack(gold, s0):
cost -= 1
cost -= gold.n_kids_in_stack[s0]
return cost
cdef int _get_root(int word, const GoldParseStateC* gold) nogil:
if is_head_unknown(gold, word):
return -1
while gold.heads[word] != word and word >= 0:
word = gold.heads[word]
if is_head_unknown(gold, word):
return -1
cdef class LeftArc:
"""Add an arc between B[0] and S[0], replacing the previous head of S[0] if
one is set. Pop S[0] from the stack.
Validity:
* len(S) >= 1
* len(B) >= 1
* not is_sent_start(B[0])
Cost:
pop_cost - Arc(B[0], S[0], label) + (Arc(S[1], S[0]) if H(S[0]) else Arcs(S, S[0]))
"""
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
if st.stack_depth() == 0:
return 0
elif st.buffer_length() == 0:
return 0
elif st.is_sent_start(st.B(0)):
return 0
elif label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
return 0
else:
return word
return 1
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.add_arc(st.B(0), st.S(0), label)
# If we change the stack, it's okay to remove the shifted mark, as
# we can't get in an infinite loop this way.
st.set_reshiftable(st.B(0))
st.pop()
@staticmethod
cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
cdef weight_t cost = gold.pop_cost
s0 = state.S(0)
s1 = state.S(1)
b0 = state.B(0)
if state.has_head(s0):
# Increment cost if we're clobbering a correct arc
cost += gold.heads[s0] == s1
else:
# If there's no head, we're losing arcs between S0 and S[1:].
cost += is_head_in_stack(gold, s0)
cost += gold.n_kids_in_stack[s0]
if b0 != -1 and s0 != -1 and gold.heads[s0] == b0:
cost -= 1
cost += not label_is_gold(gold, s0, label)
return cost
cdef class RightArc:
"""
Add an arc from S[0] to B[0]. Push B[0].
Validity:
* len(S) >= 1
* len(B) >= 1
* not is_sent_start(B[0])
Cost:
push_cost + (not shifted[b0] and Arc(B[1:], B[0])) - Arc(S[0], B[0], label)
"""
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
if st.stack_depth() == 0:
return 0
elif st.buffer_length() == 0:
return 0
elif st.is_sent_start(st.B(0)):
return 0
elif label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1):
# If there's (perhaps partial) parse pre-set, don't allow cycle.
return 0
else:
return 1
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.add_arc(st.S(0), st.B(0), label)
st.push()
@staticmethod
cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
cost = gold.push_cost
s0 = state.S(0)
b0 = state.B(0)
if s0 != -1 and b0 != -1 and gold.heads[b0] == s0:
cost -= 1
cost += not label_is_gold(gold, b0, label)
elif is_head_in_buffer(gold, b0) and not state.is_unshiftable(b0):
cost += 1
return cost
cdef class Break:
"""Mark the second word of the buffer as the start of a
sentence.
Validity:
* len(buffer) >= 2
* B[1] == B[0] + 1
* not is_sent_start(B[1])
* not cannot_sent_start(B[1])
Action:
* mark_sent_start(B[1])
Cost:
* not is_sent_start(B[1])
* Arcs between B[0] and B[1:]
* Arcs between S and B[1]
"""
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int i
if st.buffer_length() < 2:
return False
elif st.B(1) != st.B(0) + 1:
return False
elif st.is_sent_start(st.B(1)):
return False
elif st.cannot_sent_start(st.B(1)):
return False
else:
return True
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.set_sent_start(st.B(1), 1)
@staticmethod
cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
gold = <const GoldParseStateC*>_gold
cdef int b0 = state.B(0)
cdef int cost = 0
cdef int si
for i in range(state.stack_depth()):
si = state.S(i)
if is_head_in_buffer(gold, si):
cost += 1
cost += gold.n_kids_in_buffer[si]
# We need to score into B[1:], so subtract deps that are at b0
if gold.heads[b0] == si:
cost -= 1
if gold.heads[si] == b0:
cost -= 1
if not is_sent_start(gold, state.B(1)) \
and not is_sent_start_unknown(gold, state.B(1)):
cost += 1
return cost
cdef void* _init_state(Pool mem, int length, void* tokens) except NULL:
st = new StateC(<const TokenC*>tokens, length)
for i in range(st.length):
if st._sent[i].dep == 0:
st._sent[i].l_edge = i
st._sent[i].r_edge = i
st._sent[i].head = 0
st._sent[i].dep = 0
st._sent[i].l_kids = 0
st._sent[i].r_kids = 0
st.fast_forward()
return <void*>st
@ -515,6 +540,8 @@ cdef int _del_state(Pool mem, void* state, void* x) except -1:
cdef class ArcEager(TransitionSystem):
def __init__(self, *args, **kwargs):
TransitionSystem.__init__(self, *args, **kwargs)
self.init_beam_state = _init_state
self.del_beam_state = _del_state
@classmethod
def get_actions(cls, **kwargs):
@ -537,7 +564,7 @@ cdef class ArcEager(TransitionSystem):
label = 'ROOT'
if head == child:
actions[BREAK][label] += 1
elif head < child:
if head < child:
actions[RIGHT][label] += 1
actions[REDUCE][''] += 1
elif head > child:
@ -567,8 +594,14 @@ cdef class ArcEager(TransitionSystem):
t.do(state.c, t.label)
return state
def is_gold_parse(self, StateClass state, gold):
raise NotImplementedError
def is_gold_parse(self, StateClass state, ArcEagerGold gold):
for i in range(state.c.length):
token = state.c.safe_get(i)
if not arc_is_gold(&gold.c, i, i+token.head):
return False
elif not label_is_gold(&gold.c, i, token.dep):
return False
return True
def init_gold(self, StateClass state, Example example):
gold = ArcEagerGold(self, state, example)
@ -576,6 +609,7 @@ cdef class ArcEager(TransitionSystem):
return gold
def init_gold_batch(self, examples):
# TODO: Projectivitity?
all_states = self.init_batch([eg.predicted for eg in examples])
golds = []
states = []
@ -662,24 +696,13 @@ cdef class ArcEager(TransitionSystem):
raise ValueError(Errors.E019.format(action=move, src='arc_eager'))
return t
cdef int initialize_state(self, StateC* st) nogil:
for i in range(st.length):
if st._sent[i].dep == 0:
st._sent[i].l_edge = i
st._sent[i].r_edge = i
st._sent[i].head = 0
st._sent[i].dep = 0
st._sent[i].l_kids = 0
st._sent[i].r_kids = 0
st.fast_forward()
cdef int finalize_state(self, StateC* st) nogil:
cdef int i
for i in range(st.length):
if st._sent[i].head == 0:
st._sent[i].dep = self.root_label
def finalize_doc(self, Doc doc):
def set_annotations(self, StateClass state, Doc doc):
for arc in state.arcs:
doc.c[arc["child"]].head = arc["head"] - arc["child"]
doc.c[arc["child"]].dep = arc["label"]
for i in range(doc.length):
if doc.c[i].head == 0:
doc.c[i].dep = self.root_label
set_children_from_heads(doc.c, 0, doc.length)
def has_gold(self, Example eg, start=0, end=None):
@ -690,7 +713,7 @@ cdef class ArcEager(TransitionSystem):
return False
cdef int set_valid(self, int* output, const StateC* st) nogil:
cdef bint[N_MOVES] is_valid
cdef int[N_MOVES] is_valid
is_valid[SHIFT] = Shift.is_valid(st, 0)
is_valid[REDUCE] = Reduce.is_valid(st, 0)
is_valid[LEFT] = LeftArc.is_valid(st, 0)
@ -710,29 +733,31 @@ cdef class ArcEager(TransitionSystem):
gold_state = gold_.c
n_gold = 0
if self.c[i].is_valid(stcls.c, self.c[i].label):
cost = self.c[i].get_cost(stcls, &gold_state, self.c[i].label)
cost = self.c[i].get_cost(stcls.c, &gold_state, self.c[i].label)
else:
cost = 9000
return cost
cdef int set_costs(self, int* is_valid, weight_t* costs,
StateClass stcls, gold) except -1:
const StateC* state, gold) except -1:
if not isinstance(gold, ArcEagerGold):
raise TypeError(Errors.E909.format(name="ArcEagerGold"))
cdef ArcEagerGold gold_ = gold
gold_.update(stcls)
gold_state = gold_.c
update_gold_state(&gold_state, state)
self.set_valid(is_valid, state)
cdef int n_gold = 0
for i in range(self.n_moves):
if self.c[i].is_valid(stcls.c, self.c[i].label):
is_valid[i] = True
costs[i] = self.c[i].get_cost(stcls, &gold_state, self.c[i].label)
if is_valid[i]:
costs[i] = self.c[i].get_cost(state, &gold_state, self.c[i].label)
if costs[i] <= 0:
n_gold += 1
else:
is_valid[i] = False
costs[i] = 9000
if n_gold < 1:
for i in range(self.n_moves):
print(self.get_class_name(i), is_valid[i], costs[i])
print("Gold sent starts?", is_sent_start(&gold_state, state.B(0)), is_sent_start(&gold_state, state.B(1)))
raise ValueError
def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None):
@ -748,12 +773,13 @@ cdef class ArcEager(TransitionSystem):
failed = False
while not state.is_final():
try:
self.set_costs(is_valid, costs, state, gold)
self.set_costs(is_valid, costs, state.c, gold)
except ValueError:
failed = True
break
min_cost = min(costs[i] for i in range(self.n_moves))
for i in range(self.n_moves):
if is_valid[i] and costs[i] <= 0:
if is_valid[i] and costs[i] <= min_cost:
action = self.c[i]
history.append(i)
s0 = state.S(0)
@ -762,9 +788,7 @@ cdef class ArcEager(TransitionSystem):
example = _debug
debug_log.append(" ".join((
self.get_class_name(i),
"S0=", (example.x[s0].text if s0 >= 0 else "__"),
"B0=", (example.x[b0].text if b0 >= 0 else "__"),
"S0 head?", str(state.has_head(state.S(0))),
state.print_state()
)))
action.do(state.c, action.label)
break
@ -783,6 +807,8 @@ cdef class ArcEager(TransitionSystem):
print("Aligned heads")
for i, head in enumerate(aligned_heads):
print(example.x[i], example.x[head] if head is not None else "__")
print("Aligned sent starts")
print(example.get_aligned_sent_starts())
print("Predicted tokens")
print([(w.i, w.text) for w in example.x])

View File

@ -3,9 +3,12 @@ from cymem.cymem cimport Pool
from collections import Counter
from ...tokens.doc cimport Doc
from ...tokens.span import Span
from ...typedefs cimport weight_t, attr_t
from ...lexeme cimport Lexeme
from ...attrs cimport IS_SPACE
from ...structs cimport TokenC
from ...training.example cimport Example
from .stateclass cimport StateClass
from ._state cimport StateC
@ -46,17 +49,17 @@ cdef class BiluoGold:
def __init__(self, BiluoPushDown moves, StateClass stcls, Example example):
self.mem = Pool()
self.c = create_gold_state(self.mem, moves, stcls, example)
self.c = create_gold_state(self.mem, moves, stcls.c, example)
def update(self, StateClass stcls):
update_gold_state(&self.c, stcls)
update_gold_state(&self.c, stcls.c)
cdef GoldNERStateC create_gold_state(
Pool mem,
BiluoPushDown moves,
StateClass stcls,
const StateC* stcls,
Example example
) except *:
cdef GoldNERStateC gs
@ -67,7 +70,7 @@ cdef GoldNERStateC create_gold_state(
return gs
cdef void update_gold_state(GoldNERStateC* gs, StateClass stcls) except *:
cdef void update_gold_state(GoldNERStateC* gs, const StateC* state) except *:
# We don't need to update each time, unlike the parser.
pass
@ -75,14 +78,15 @@ cdef void update_gold_state(GoldNERStateC* gs, StateClass stcls) except *:
cdef do_func_t[N_MOVES] do_funcs
cdef bint _entity_is_sunk(StateClass st, Transition* golds) nogil:
if not st.entity_is_open():
cdef bint _entity_is_sunk(const StateC* state, Transition* golds) nogil:
if not state.entity_is_open():
return False
cdef const Transition* gold = &golds[st.E(0)]
cdef const Transition* gold = &golds[state.E(0)]
ent = state.get_ent()
if gold.move != BEGIN and gold.move != UNIT:
return True
elif gold.label != st.E_(0).ent_type:
elif gold.label != ent.label:
return True
else:
return False
@ -228,15 +232,18 @@ cdef class BiluoPushDown(TransitionSystem):
self.labels[action][label_name] = -1
return 1
cdef int initialize_state(self, StateC* st) nogil:
# This is especially necessary when we use limited training data.
for i in range(st.length):
if st._sent[i].ent_type != 0:
with gil:
self.add_action(BEGIN, st._sent[i].ent_type)
self.add_action(IN, st._sent[i].ent_type)
self.add_action(UNIT, st._sent[i].ent_type)
self.add_action(LAST, st._sent[i].ent_type)
def set_annotations(self, StateClass state, Doc doc):
cdef int i
ents = []
for i in range(state.c._ents.size()):
ent = state.c._ents.at(i)
if ent.start != -1 and ent.end != -1:
ents.append(Span(doc, ent.start, ent.end, label=ent.label))
doc.set_ents(ents, default="unmodified")
# Set non-blocked tokens to O
for i in range(doc.length):
if doc.c[i].ent_iob == 0:
doc.c[i].ent_iob = 2
def init_gold(self, StateClass state, Example example):
return BiluoGold(self, state, example)
@ -255,26 +262,25 @@ cdef class BiluoPushDown(TransitionSystem):
gold_state = gold_.c
n_gold = 0
if self.c[i].is_valid(stcls.c, self.c[i].label):
cost = self.c[i].get_cost(stcls, &gold_state, self.c[i].label)
cost = self.c[i].get_cost(stcls.c, &gold_state, self.c[i].label)
else:
cost = 9000
return cost
cdef int set_costs(self, int* is_valid, weight_t* costs,
StateClass stcls, gold) except -1:
const StateC* state, gold) except -1:
if not isinstance(gold, BiluoGold):
raise TypeError(Errors.E909.format(name="BiluoGold"))
cdef BiluoGold gold_ = gold
gold_.update(stcls)
gold_state = gold_.c
update_gold_state(&gold_state, state)
n_gold = 0
self.set_valid(is_valid, state)
for i in range(self.n_moves):
if self.c[i].is_valid(stcls.c, self.c[i].label):
is_valid[i] = 1
costs[i] = self.c[i].get_cost(stcls, &gold_state, self.c[i].label)
if is_valid[i]:
costs[i] = self.c[i].get_cost(state, &gold_state, self.c[i].label)
n_gold += costs[i] <= 0
else:
is_valid[i] = 0
costs[i] = 9000
if n_gold < 1:
raise ValueError
@ -290,7 +296,7 @@ cdef class Missing:
pass
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
return 9000
@ -299,10 +305,10 @@ cdef class Begin:
cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type
# If we're the last token of the input, we can't B -- must U or O.
if st.B(1) == -1:
if st.entity_is_open():
return False
elif st.entity_is_open():
if st.buffer_length() < 2:
# If we're the last token of the input, we can't B -- must U or O.
return False
elif label == 0:
return False
@ -337,12 +343,11 @@ cdef class Begin:
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.open_ent(label)
st.set_ent_tag(st.B(0), 3, label)
st.push()
st.pop()
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
gold = <GoldNERStateC*>_gold
cdef int g_act = gold.ner[s.B(0)].move
cdef attr_t g_tag = gold.ner[s.B(0)].label
@ -366,16 +371,17 @@ cdef class Begin:
cdef class In:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
if not st.entity_is_open():
return False
if st.buffer_length() < 2:
# If we're at the end, we can't I.
return False
ent = st.get_ent()
cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0:
return False
elif st.E_(0).ent_type != label:
return False
elif not st.entity_is_open():
return False
elif st.B(1) == -1:
# If we're at the end, we can't I.
elif ent.label != label:
return False
elif preset_ent_iob == 3:
return False
@ -401,12 +407,11 @@ cdef class In:
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.set_ent_tag(st.B(0), 1, label)
st.push()
st.pop()
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
gold = <GoldNERStateC*>_gold
move = IN
cdef int next_act = gold.ner[s.B(1)].move if s.B(1) >= 0 else OUT
@ -457,7 +462,7 @@ cdef class Last:
# Otherwise, force acceptance, even if we're across a sentence
# boundary or the token is whitespace.
return True
elif st.E_(0).ent_type != label:
elif st.get_ent().label != label:
return False
elif st.B_(1).ent_iob == 1:
# If a preset entity has I next, we can't L here.
@ -468,12 +473,11 @@ cdef class Last:
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.close_ent()
st.set_ent_tag(st.B(0), 1, label)
st.push()
st.pop()
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
gold = <GoldNERStateC*>_gold
move = LAST
@ -537,12 +541,11 @@ cdef class Unit:
cdef int transition(StateC* st, attr_t label) nogil:
st.open_ent(label)
st.close_ent()
st.set_ent_tag(st.B(0), 3, label)
st.push()
st.pop()
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
gold = <GoldNERStateC*>_gold
cdef int g_act = gold.ner[s.B(0)].move
cdef attr_t g_tag = gold.ner[s.B(0)].label
@ -578,12 +581,11 @@ cdef class Out:
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
st.set_ent_tag(st.B(0), 2, 0)
st.push()
st.pop()
@staticmethod
cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil:
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
gold = <GoldNERStateC*>_gold
cdef int g_act = gold.ner[s.B(0)].move
cdef attr_t g_tag = gold.ner[s.B(0)].label

View File

@ -2,30 +2,24 @@ from cymem.cymem cimport Pool
from ...structs cimport TokenC, SpanC
from ...typedefs cimport attr_t
from ...tokens.doc cimport Doc
from ._state cimport StateC
cdef class StateClass:
cdef Pool mem
cdef StateC* c
cdef readonly Doc doc
cdef int _borrowed
@staticmethod
cdef inline StateClass init(const TokenC* sent, int length):
cdef inline StateClass borrow(StateC* ptr, Doc doc):
cdef StateClass self = StateClass()
self.c = new StateC(sent, length)
return self
@staticmethod
cdef inline StateClass borrow(StateC* ptr):
cdef StateClass self = StateClass()
del self.c
self.c = ptr
self._borrowed = 1
self.doc = doc
return self
@staticmethod
cdef inline StateClass init_offset(const TokenC* sent, int length, int
offset):
@ -33,105 +27,3 @@ cdef class StateClass:
self.c = new StateC(sent, length)
self.c.offset = offset
return self
cdef inline int S(self, int i) nogil:
return self.c.S(i)
cdef inline int B(self, int i) nogil:
return self.c.B(i)
cdef inline const TokenC* S_(self, int i) nogil:
return self.c.S_(i)
cdef inline const TokenC* B_(self, int i) nogil:
return self.c.B_(i)
cdef inline const TokenC* H_(self, int i) nogil:
return self.c.H_(i)
cdef inline const TokenC* E_(self, int i) nogil:
return self.c.E_(i)
cdef inline const TokenC* L_(self, int i, int idx) nogil:
return self.c.L_(i, idx)
cdef inline const TokenC* R_(self, int i, int idx) nogil:
return self.c.R_(i, idx)
cdef inline const TokenC* safe_get(self, int i) nogil:
return self.c.safe_get(i)
cdef inline int H(self, int i) nogil:
return self.c.H(i)
cdef inline int E(self, int i) nogil:
return self.c.E(i)
cdef inline int L(self, int i, int idx) nogil:
return self.c.L(i, idx)
cdef inline int R(self, int i, int idx) nogil:
return self.c.R(i, idx)
cdef inline bint empty(self) nogil:
return self.c.empty()
cdef inline bint eol(self) nogil:
return self.c.eol()
cdef inline bint at_break(self) nogil:
return self.c.at_break()
cdef inline bint has_head(self, int i) nogil:
return self.c.has_head(i)
cdef inline int n_L(self, int i) nogil:
return self.c.n_L(i)
cdef inline int n_R(self, int i) nogil:
return self.c.n_R(i)
cdef inline bint stack_is_connected(self) nogil:
return False
cdef inline bint entity_is_open(self) nogil:
return self.c.entity_is_open()
cdef inline int stack_depth(self) nogil:
return self.c.stack_depth()
cdef inline int buffer_length(self) nogil:
return self.c.buffer_length()
cdef inline void push(self) nogil:
self.c.push()
cdef inline void pop(self) nogil:
self.c.pop()
cdef inline void unshift(self) nogil:
self.c.unshift()
cdef inline void add_arc(self, int head, int child, attr_t label) nogil:
self.c.add_arc(head, child, label)
cdef inline void del_arc(self, int head, int child) nogil:
self.c.del_arc(head, child)
cdef inline void open_ent(self, attr_t label) nogil:
self.c.open_ent(label)
cdef inline void close_ent(self) nogil:
self.c.close_ent()
cdef inline void set_ent_tag(self, int i, int ent_iob, attr_t ent_type) nogil:
self.c.set_ent_tag(i, ent_iob, ent_type)
cdef inline void set_break(self, int i) nogil:
self.c.set_break(i)
cdef inline void clone(self, StateClass src) nogil:
self.c.clone(src.c)
cdef inline void fast_forward(self) nogil:
self.c.fast_forward()

View File

@ -1,17 +1,20 @@
# cython: infer_types=True
import numpy
from libcpp.vector cimport vector
from ._state cimport ArcC
from ...tokens.doc cimport Doc
cdef class StateClass:
def __init__(self, Doc doc=None, int offset=0):
cdef Pool mem = Pool()
self.mem = mem
self._borrowed = 0
if doc is not None:
self.c = new StateC(doc.c, doc.length)
self.c.offset = offset
self.doc = doc
else:
self.doc = None
def __dealloc__(self):
if self._borrowed != 1:
@ -19,36 +22,157 @@ cdef class StateClass:
@property
def stack(self):
return {self.S(i) for i in range(self.c._s_i)}
return [self.S(i) for i in range(self.c.stack_depth())]
@property
def queue(self):
return {self.B(i) for i in range(self.c.buffer_length())}
return [self.B(i) for i in range(self.c.buffer_length())]
@property
def token_vector_lenth(self):
return self.doc.tensor.shape[1]
@property
def history(self):
hist = numpy.ndarray((8,), dtype='i')
for i in range(8):
hist[i] = self.c.get_hist(i+1)
return hist
def arcs(self):
cdef vector[ArcC] arcs
self.c.get_arcs(&arcs)
return list(arcs)
#py_arcs = []
#for arc in arcs:
# if arc.head != -1 and arc.child != -1:
# py_arcs.append((arc.head, arc.child, arc.label))
#return arcs
def add_arc(self, int head, int child, int label):
self.c.add_arc(head, child, label)
def del_arc(self, int head, int child):
self.c.del_arc(head, child)
def H(self, int child):
return self.c.H(child)
def L(self, int head, int idx):
return self.c.L(head, idx)
def R(self, int head, int idx):
return self.c.R(head, idx)
@property
def _b_i(self):
return self.c._b_i
@property
def length(self):
return self.c.length
def is_final(self):
return self.c.is_final()
def copy(self):
cdef StateClass new_state = StateClass.init(self.c._sent, self.c.length)
cdef StateClass new_state = StateClass(doc=self.doc, offset=self.c.offset)
new_state.c.clone(self.c)
return new_state
def print_state(self, words):
def print_state(self):
words = [token.text for token in self.doc]
words = list(words) + ['_']
top = f"{words[self.S(0)]}_{self.S_(0).head}"
second = f"{words[self.S(1)]}_{self.S_(1).head}"
third = f"{words[self.S(2)]}_{self.S_(2).head}"
n0 = words[self.B(0)]
n1 = words[self.B(1)]
return ' '.join((third, second, top, '|', n0, n1))
bools = ["F", "T"]
sent_starts = [bools[self.c.is_sent_start(i)] for i in range(len(self.doc))]
shifted = [1 if self.c.is_unshiftable(i) else 0 for i in range(self.c.length)]
shifted.append("")
sent_starts.append("")
top = f"{self.S(0)}{words[self.S(0)]}_{words[self.H(self.S(0))]}_{shifted[self.S(0)]}"
second = f"{self.S(1)}{words[self.S(1)]}_{words[self.H(self.S(1))]}_{shifted[self.S(1)]}"
third = f"{self.S(2)}{words[self.S(2)]}_{words[self.H(self.S(2))]}_{shifted[self.S(2)]}"
n0 = f"{self.B(0)}{words[self.B(0)]}_{sent_starts[self.B(0)]}_{shifted[self.B(0)]}"
n1 = f"{self.B(1)}{words[self.B(1)]}_{sent_starts[self.B(1)]}_{shifted[self.B(1)]}"
return ' '.join((str(self.stack_depth()), str(self.buffer_length()), third, second, top, '|', n0, n1))
def S(self, int i):
return self.c.S(i)
def B(self, int i):
return self.c.B(i)
def H(self, int i):
return self.c.H(i)
def E(self, int i):
return self.c.E(i)
def L(self, int i, int idx):
return self.c.L(i, idx)
def R(self, int i, int idx):
return self.c.R(i, idx)
def S_(self, int i):
return self.doc[self.c.S(i)]
def B_(self, int i):
return self.doc[self.c.B(i)]
def H_(self, int i):
return self.doc[self.c.H(i)]
def E_(self, int i):
return self.doc[self.c.E(i)]
def L_(self, int i, int idx):
return self.doc[self.c.L(i, idx)]
def R_(self, int i, int idx):
return self.doc[self.c.R(i, idx)]
def empty(self):
return self.c.empty()
def eol(self):
return self.c.eol()
def at_break(self):
return False
#return self.c.at_break()
def has_head(self, int i):
return self.c.has_head(i)
def n_L(self, int i):
return self.c.n_L(i)
def n_R(self, int i):
return self.c.n_R(i)
def entity_is_open(self):
return self.c.entity_is_open()
def stack_depth(self):
return self.c.stack_depth()
def buffer_length(self):
return self.c.buffer_length()
def push(self):
self.c.push()
def pop(self):
self.c.pop()
def unshift(self):
self.c.unshift()
def add_arc(self, int head, int child, attr_t label):
self.c.add_arc(head, child, label)
def del_arc(self, int head, int child):
self.c.del_arc(head, child)
def open_ent(self, attr_t label):
self.c.open_ent(label)
def close_ent(self):
self.c.close_ent()
def clone(self, StateClass src):
self.c.clone(src.c)

View File

@ -16,14 +16,14 @@ cdef struct Transition:
weight_t score
bint (*is_valid)(const StateC* state, attr_t label) nogil
weight_t (*get_cost)(StateClass state, const void* gold, attr_t label) nogil
weight_t (*get_cost)(const StateC* state, const void* gold, attr_t label) nogil
int (*do)(StateC* state, attr_t label) nogil
ctypedef weight_t (*get_cost_func_t)(StateClass state, const void* gold,
ctypedef weight_t (*get_cost_func_t)(const StateC* state, const void* gold,
attr_tlabel) nogil
ctypedef weight_t (*move_cost_func_t)(StateClass state, const void* gold) nogil
ctypedef weight_t (*label_cost_func_t)(StateClass state, const void*
ctypedef weight_t (*move_cost_func_t)(const StateC* state, const void* gold) nogil
ctypedef weight_t (*label_cost_func_t)(const StateC* state, const void*
gold, attr_t label) nogil
ctypedef int (*do_func_t)(StateC* state, attr_t label) nogil
@ -41,9 +41,8 @@ cdef class TransitionSystem:
cdef public attr_t root_label
cdef public freqs
cdef public object labels
cdef int initialize_state(self, StateC* state) nogil
cdef int finalize_state(self, StateC* state) nogil
cdef init_state_t init_beam_state
cdef del_state_t del_beam_state
cdef Transition lookup_transition(self, object name) except *
@ -52,4 +51,4 @@ cdef class TransitionSystem:
cdef int set_valid(self, int* output, const StateC* st) nogil
cdef int set_costs(self, int* is_valid, weight_t* costs,
StateClass state, gold) except -1
const StateC* state, gold) except -1

View File

@ -5,6 +5,7 @@ from cymem.cymem cimport Pool
from collections import Counter
import srsly
from . cimport _beam_utils
from ...typedefs cimport weight_t, attr_t
from ...tokens.doc cimport Doc
from ...structs cimport TokenC
@ -44,6 +45,8 @@ cdef class TransitionSystem:
if labels_by_action:
self.initialize_actions(labels_by_action, min_freq=min_freq)
self.root_label = self.strings.add('ROOT')
self.init_beam_state = _init_state
self.del_beam_state = _del_state
def __reduce__(self):
return (self.__class__, (self.strings, self.labels), None, None)
@ -54,7 +57,6 @@ cdef class TransitionSystem:
offset = 0
for doc in docs:
state = StateClass(doc, offset=offset)
self.initialize_state(state.c)
states.append(state)
offset += len(doc)
return states
@ -80,7 +82,7 @@ cdef class TransitionSystem:
history = []
debug_log = []
while not state.is_final():
self.set_costs(is_valid, costs, state, gold)
self.set_costs(is_valid, costs, state.c, gold)
for i in range(self.n_moves):
if is_valid[i] and costs[i] <= 0:
action = self.c[i]
@ -124,15 +126,6 @@ cdef class TransitionSystem:
action = self.lookup_transition(name)
action.do(state.c, action.label)
cdef int initialize_state(self, StateC* state) nogil:
pass
cdef int finalize_state(self, StateC* state) nogil:
pass
def finalize_doc(self, doc):
pass
cdef Transition lookup_transition(self, object name) except *:
raise NotImplementedError
@ -151,7 +144,7 @@ cdef class TransitionSystem:
is_valid[i] = self.c[i].is_valid(st, self.c[i].label)
cdef int set_costs(self, int* is_valid, weight_t* costs,
StateClass stcls, gold) except -1:
const StateC* state, gold) except -1:
raise NotImplementedError
def get_class_name(self, int clas):

View File

@ -105,6 +105,93 @@ def make_parser(
update_with_oracle_cut_size=update_with_oracle_cut_size,
multitasks=[],
learn_tokens=learn_tokens,
min_action_freq=min_action_freq,
beam_width=1,
beam_density=0.0,
beam_update_prob=0.0,
)
@Language.factory(
"beam_parser",
assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
default_config={
"beam_width": 8,
"beam_density": 0.01,
"beam_update_prob": 0.5,
"moves": None,
"update_with_oracle_cut_size": 100,
"learn_tokens": False,
"min_action_freq": 30,
"model": DEFAULT_PARSER_MODEL,
},
default_score_weights={
"dep_uas": 0.5,
"dep_las": 0.5,
"dep_las_per_type": None,
"sents_p": None,
"sents_r": None,
"sents_f": 0.0,
},
)
def make_beam_parser(
nlp: Language,
name: str,
model: Model,
moves: Optional[list],
update_with_oracle_cut_size: int,
learn_tokens: bool,
min_action_freq: int,
beam_width: int,
beam_density: float,
beam_update_prob: float,
):
"""Create a transition-based DependencyParser component that uses beam-search.
The dependency parser jointly learns sentence segmentation and labelled
dependency parsing, and can optionally learn to merge tokens that had been
over-segmented by the tokenizer.
The parser uses a variant of the non-monotonic arc-eager transition-system
described by Honnibal and Johnson (2014), with the addition of a "break"
transition to perform the sentence segmentation. Nivre's pseudo-projective
dependency transformation is used to allow the parser to predict
non-projective parses.
The parser is trained using a global objective. That is, it learns to assign
probabilities to whole parses.
model (Model): The model for the transition-based parser. The model needs
to have a specific substructure of named components --- see the
spacy.ml.tb_framework.TransitionModel for details.
moves (List[str]): A list of transition names. Inferred from the data if not
provided.
beam_width (int): The number of candidate analyses to maintain.
beam_density (float): The minimum ratio between the scores of the first and
last candidates in the beam. This allows the parser to avoid exploring
candidates that are too far behind. This is mostly intended to improve
efficiency, but it can also improve accuracy as deeper search is not
always better.
beam_update_prob (float): The chance of making a beam update, instead of a
greedy update. Greedy updates are an approximation for the beam updates,
and are faster to compute.
learn_tokens (bool): Whether to learn to merge subtokens that are split
relative to the gold standard. Experimental.
min_action_freq (int): The minimum frequency of labelled actions to retain.
Rarer labelled actions have their label backed-off to "dep". While this
primarily affects the label accuracy, it can also affect the attachment
structure, as the labels are used to represent the pseudo-projectivity
transformation.
"""
return DependencyParser(
nlp.vocab,
model,
name,
moves=moves,
update_with_oracle_cut_size=update_with_oracle_cut_size,
beam_width=beam_width,
beam_density=beam_density,
beam_update_prob=beam_update_prob,
multitasks=[],
learn_tokens=learn_tokens,
min_action_freq=min_action_freq
)

View File

@ -30,7 +30,7 @@ default_model_config = """
pretrained_vectors = null
width = 96
depth = 2
embed_size = 300
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

View File

@ -261,7 +261,11 @@ class EntityRuler(Pipe):
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
try:
current_index = self.nlp.pipe_names.index(self.name)
current_index = -1
for i, (name, pipe) in enumerate(self.nlp.pipeline):
if self == pipe:
current_index = i
break
subsequent_pipes = [
pipe for pipe in self.nlp.pipe_names[current_index + 1 :]
]

View File

@ -4,7 +4,7 @@ from thinc.api import Model
from pathlib import Path
from .pipe import Pipe
from ..errors import Errors
from ..errors import Errors, Warnings
from ..language import Language
from ..training import Example
from ..lookups import Lookups, load_lookups
@ -197,6 +197,8 @@ class Lemmatizer(Pipe):
string = token.text
univ_pos = token.pos_.lower()
if univ_pos in ("", "eol", "space"):
if univ_pos == "":
logger.warn(Warnings.W108.format(text=string))
return [string.lower()]
# See Issue #435 for example of where this logic is requied.
if self.is_base_form(token):

View File

@ -67,9 +67,6 @@ class Morphologizer(Tagger):
vocab: Vocab,
model: Model,
name: str = "morphologizer",
*,
labels_morph: Optional[dict] = None,
labels_pos: Optional[dict] = None,
):
"""Initialize a morphologizer.
@ -77,8 +74,6 @@ class Morphologizer(Tagger):
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
labels_morph (dict): Mapping of morph + POS tags to morph labels.
labels_pos (dict): Mapping of morph + POS tags to POS tags.
DOCS: https://nightly.spacy.io/api/morphologizer#init
"""
@ -90,11 +85,8 @@ class Morphologizer(Tagger):
# store mappings from morph+POS labels to token-level annotations:
# 1) labels_morph stores a mapping from morph+POS->morph
# 2) labels_pos stores a mapping from morph+POS->POS
cfg = {"labels_morph": labels_morph or {}, "labels_pos": labels_pos or {}}
cfg = {"labels_morph": {}, "labels_pos": {}}
self.cfg = dict(sorted(cfg.items()))
# add mappings for empty morph
self.cfg["labels_morph"][Morphology.EMPTY_MORPH] = Morphology.EMPTY_MORPH
self.cfg["labels_pos"][Morphology.EMPTY_MORPH] = POS_IDS[""]
@property
def labels(self):
@ -201,8 +193,8 @@ class Morphologizer(Tagger):
doc_tag_ids = doc_tag_ids.get()
for j, tag_id in enumerate(doc_tag_ids):
morph = self.labels[tag_id]
doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"][morph])
doc.c[j].pos = self.cfg["labels_pos"][morph]
doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"].get(morph, 0))
doc.c[j].pos = self.cfg["labels_pos"].get(morph, 0)
def get_loss(self, examples, scores):
"""Find the loss and gradient of loss for the batch of documents and
@ -228,8 +220,8 @@ class Morphologizer(Tagger):
# doesn't, so if either is None, treat both as None here so that
# truths doesn't end up with an unknown morph+POS combination
if pos is None or morph is None:
pos = None
morph = None
label = None
else:
label_dict = Morphology.feats_to_dict(morph)
if pos:
label_dict[self.POS_FEAT] = pos

View File

@ -47,7 +47,7 @@ class MultitaskObjective(Tagger):
side-objective.
"""
def __init__(self, vocab, model, name="nn_labeller", *, labels, target):
def __init__(self, vocab, model, name="nn_labeller", *, target):
self.vocab = vocab
self.model = model
self.name = name
@ -67,7 +67,7 @@ class MultitaskObjective(Tagger):
self.make_label = target
else:
raise ValueError(Errors.E016)
cfg = {"labels": labels or {}, "target": target}
cfg = {"labels": {}, "target": target}
self.cfg = dict(cfg)
@property
@ -81,10 +81,13 @@ class MultitaskObjective(Tagger):
def set_annotations(self, docs, dep_ids):
pass
def initialize(self, get_examples, nlp=None):
def initialize(self, get_examples, nlp=None, labels=None):
if not hasattr(get_examples, "__call__"):
err = Errors.E930.format(name="MultitaskObjective", obj=type(get_examples))
raise ValueError(err)
if labels is not None:
self.labels = labels
else:
for example in get_examples():
for token in example.y:
label = self.make_label(token)

View File

@ -82,6 +82,79 @@ def make_ner(
multitasks=[],
min_action_freq=1,
learn_tokens=False,
beam_width=1,
beam_density=0.0,
beam_update_prob=0.0,
)
@Language.factory(
"beam_ner",
assigns=["doc.ents", "token.ent_iob", "token.ent_type"],
default_config={
"moves": None,
"update_with_oracle_cut_size": 100,
"model": DEFAULT_NER_MODEL,
"beam_density": 0.01,
"beam_update_prob": 0.5,
"beam_width": 32
},
default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None},
)
def make_beam_ner(
nlp: Language,
name: str,
model: Model,
moves: Optional[list],
update_with_oracle_cut_size: int,
beam_width: int,
beam_density: float,
beam_update_prob: float,
):
"""Create a transition-based EntityRecognizer component that uses beam-search.
The entity recognizer identifies non-overlapping labelled spans of tokens.
The transition-based algorithm used encodes certain assumptions that are
effective for "traditional" named entity recognition tasks, but may not be
a good fit for every span identification problem. Specifically, the loss
function optimizes for whole entity accuracy, so if your inter-annotator
agreement on boundary tokens is low, the component will likely perform poorly
on your problem. The transition-based algorithm also assumes that the most
decisive information about your entities will be close to their initial tokens.
If your entities are long and characterised by tokens in their middle, the
component will likely do poorly on your task.
model (Model): The model for the transition-based parser. The model needs
to have a specific substructure of named components --- see the
spacy.ml.tb_framework.TransitionModel for details.
moves (list[str]): A list of transition names. Inferred from the data if not
provided.
update_with_oracle_cut_size (int):
During training, cut long sequences into shorter segments by creating
intermediate states based on the gold-standard history. The model is
not very sensitive to this parameter, so you usually won't need to change
it. 100 is a good default.
beam_width (int): The number of candidate analyses to maintain.
beam_density (float): The minimum ratio between the scores of the first and
last candidates in the beam. This allows the parser to avoid exploring
candidates that are too far behind. This is mostly intended to improve
efficiency, but it can also improve accuracy as deeper search is not
always better.
beam_update_prob (float): The chance of making a beam update, instead of a
greedy update. Greedy updates are an approximation for the beam updates,
and are faster to compute.
"""
return EntityRecognizer(
nlp.vocab,
model,
name,
moves=moves,
update_with_oracle_cut_size=update_with_oracle_cut_size,
multitasks=[],
min_action_freq=1,
learn_tokens=False,
beam_width=beam_width,
beam_density=beam_density,
beam_update_prob=beam_update_prob,
)

View File

@ -61,14 +61,13 @@ class Tagger(TrainablePipe):
DOCS: https://nightly.spacy.io/api/tagger
"""
def __init__(self, vocab, model, name="tagger", *, labels=None):
def __init__(self, vocab, model, name="tagger"):
"""Initialize a part-of-speech tagger.
vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
labels (List): The set of labels. Defaults to None.
DOCS: https://nightly.spacy.io/api/tagger#init
"""
@ -76,7 +75,7 @@ class Tagger(TrainablePipe):
self.model = model
self.name = name
self._rehearsal_model = None
cfg = {"labels": labels or []}
cfg = {"labels": []}
self.cfg = dict(sorted(cfg.items()))
@property

View File

@ -4,13 +4,14 @@ from cymem.cymem cimport Pool
cimport numpy as np
from itertools import islice
from libcpp.vector cimport vector
from libc.string cimport memset
from libc.string cimport memset, memcpy
from libc.stdlib cimport calloc, free
import random
from typing import Optional
import srsly
from thinc.api import set_dropout_rate
from thinc.api import set_dropout_rate, CupyOps
from thinc.extra.search cimport Beam
import numpy.random
import numpy
import warnings
@ -22,6 +23,8 @@ from ..ml.parser_model cimport WeightsC, ActivationsC, SizesC, cpu_log_loss
from ..ml.parser_model cimport get_c_weights, get_c_sizes
from ..tokens.doc cimport Doc
from .trainable_pipe import TrainablePipe
from ._parser_internals cimport _beam_utils
from ._parser_internals import _beam_utils
from ..training import validate_examples, validate_get_examples
from ..errors import Errors, Warnings
@ -41,9 +44,12 @@ cdef class Parser(TrainablePipe):
moves=None,
*,
update_with_oracle_cut_size,
multitasks=tuple(),
min_action_freq,
learn_tokens,
beam_width=1,
beam_density=0.0,
beam_update_prob=0.0,
multitasks=tuple(),
):
"""Create a Parser.
@ -61,7 +67,10 @@ cdef class Parser(TrainablePipe):
"update_with_oracle_cut_size": update_with_oracle_cut_size,
"multitasks": list(multitasks),
"min_action_freq": min_action_freq,
"learn_tokens": learn_tokens
"learn_tokens": learn_tokens,
"beam_width": beam_width,
"beam_density": beam_density,
"beam_update_prob": beam_update_prob
}
if moves is None:
# defined by EntityRecognizer as a BiluoPushDown
@ -183,7 +192,15 @@ cdef class Parser(TrainablePipe):
result = self.moves.init_batch(docs)
self._resize()
return result
if self.cfg["beam_width"] == 1:
return self.greedy_parse(docs, drop=0.0)
else:
return self.beam_parse(
docs,
drop=0.0,
beam_width=self.cfg["beam_width"],
beam_density=self.cfg["beam_density"]
)
def greedy_parse(self, docs, drop=0.):
cdef vector[StateC*] states
@ -207,6 +224,31 @@ cdef class Parser(TrainablePipe):
del model
return batch
def beam_parse(self, docs, int beam_width, float drop=0., beam_density=0.):
cdef Beam beam
cdef Doc doc
batch = _beam_utils.BeamBatch(
self.moves,
self.moves.init_batch(docs),
None,
beam_width,
density=beam_density
)
# This is pretty dirty, but the NER can resize itself in init_batch,
# if labels are missing. We therefore have to check whether we need to
# expand our model output.
self._resize()
model = self.model.predict(docs)
while not batch.is_done:
states = batch.get_unfinished_states()
if not states:
break
scores = model.predict(states)
batch.advance(scores)
model.clear_memory()
del model
return list(batch)
cdef void _parseC(self, StateC** states,
WeightsC weights, SizesC sizes) nogil:
cdef int i, j
@ -227,14 +269,13 @@ cdef class Parser(TrainablePipe):
unfinished.clear()
free_activations(&activations)
def set_annotations(self, docs, states):
def set_annotations(self, docs, states_or_beams):
cdef StateClass state
cdef Beam beam
cdef Doc doc
states = _beam_utils.collect_states(states_or_beams, docs)
for i, (state, doc) in enumerate(zip(states, docs)):
self.moves.finalize_state(state.c)
for j in range(doc.length):
doc.c[j] = state.c._sent[j]
self.moves.finalize_doc(doc)
self.moves.set_annotations(state, doc)
for hook in self.postprocesses:
hook(doc)
@ -265,7 +306,6 @@ cdef class Parser(TrainablePipe):
else:
action = self.moves.c[guess]
action.do(states[i], action.label)
states[i].push_hist(guess)
free(is_valid)
def update(self, examples, *, drop=0., set_annotations=False, sgd=None, losses=None):
@ -276,13 +316,23 @@ cdef class Parser(TrainablePipe):
validate_examples(examples, "Parser.update")
for multitask in self._multitasks:
multitask.update(examples, drop=drop, sgd=sgd)
n_examples = len([eg for eg in examples if self.moves.has_gold(eg)])
if n_examples == 0:
return losses
set_dropout_rate(self.model, drop)
# Prepare the stepwise model, and get the callback for finishing the batch
model, backprop_tok2vec = self.model.begin_update(
[eg.predicted for eg in examples])
# The probability we use beam update, instead of falling back to
# a greedy update
beam_update_prob = self.cfg["beam_update_prob"]
if self.cfg['beam_width'] >= 2 and numpy.random.random() < beam_update_prob:
return self.update_beam(
examples,
beam_width=self.cfg["beam_width"],
set_annotations=set_annotations,
sgd=sgd,
losses=losses,
beam_density=self.cfg["beam_density"]
)
max_moves = self.cfg["update_with_oracle_cut_size"]
if max_moves >= 1:
# Chop sequences into lengths of this many words, to make the
@ -296,6 +346,8 @@ cdef class Parser(TrainablePipe):
states, golds, _ = self.moves.init_gold_batch(examples)
if not states:
return losses
model, backprop_tok2vec = self.model.begin_update([eg.x for eg in examples])
all_states = list(states)
states_golds = list(zip(states, golds))
n_moves = 0
@ -379,6 +431,27 @@ cdef class Parser(TrainablePipe):
del tutor
return losses
def update_beam(self, examples, *, beam_width,
drop=0., sgd=None, losses=None, set_annotations=False, beam_density=0.0):
states, golds, _ = self.moves.init_gold_batch(examples)
if not states:
return losses
# Prepare the stepwise model, and get the callback for finishing the batch
model, backprop_tok2vec = self.model.begin_update(
[eg.predicted for eg in examples])
loss = _beam_utils.update_beam(
self.moves,
states,
golds,
model,
beam_width,
beam_density=beam_density,
)
losses[self.name] += loss
backprop_tok2vec(golds)
if sgd is not None:
self.finish_update(sgd)
def get_batch_loss(self, states, golds, float[:, ::1] scores, losses):
cdef StateClass state
cdef Pool mem = Pool()
@ -396,7 +469,7 @@ cdef class Parser(TrainablePipe):
for i, (state, gold) in enumerate(zip(states, golds)):
memset(is_valid, 0, self.moves.n_moves * sizeof(int))
memset(costs, 0, self.moves.n_moves * sizeof(float))
self.moves.set_costs(is_valid, costs, state, gold)
self.moves.set_costs(is_valid, costs, state.c, gold)
for j in range(self.moves.n_moves):
if costs[j] <= 0.0 and j in unseen_classes:
unseen_classes.remove(j)
@ -539,7 +612,6 @@ cdef class Parser(TrainablePipe):
for clas in oracle_actions[i:i+max_length]:
action = self.moves.c[clas]
action.do(state.c, action.label)
state.c.push_hist(action.clas)
if state.is_final():
break
if self.moves.has_gold(eg, start_state.B(0), state.B(0)):

View File

@ -273,6 +273,7 @@ class ModelMetaSchema(BaseModel):
version: StrictStr = Field(..., title="Model version")
spacy_version: StrictStr = Field("", title="Compatible spaCy version identifier")
parent_package: StrictStr = Field("spacy", title="Name of parent spaCy package, e.g. spacy or spacy-nightly")
requirements: List[StrictStr] = Field([], title="Additional Python package dependencies, used for the Python package setup")
pipeline: List[StrictStr] = Field([], title="Names of pipeline components")
description: StrictStr = Field("", title="Model description")
license: StrictStr = Field("", title="Model license")
@ -329,6 +330,7 @@ class ConfigSchemaNlp(BaseModel):
before_creation: Optional[Callable[[Type["Language"]], Type["Language"]]] = Field(..., title="Optional callback to modify Language class before initialization")
after_creation: Optional[Callable[["Language"], "Language"]] = Field(..., title="Optional callback to modify nlp object after creation and before the pipeline is constructed")
after_pipeline_creation: Optional[Callable[["Language"], "Language"]] = Field(..., title="Optional callback to modify nlp object after the pipeline is constructed")
batch_size: Optional[int] = Field(..., title="Default batch size")
# fmt: on
class Config:
@ -351,9 +353,7 @@ class ConfigSchemaPretrain(BaseModel):
batcher: Batcher = Field(..., title="Batcher for the training data")
component: str = Field(..., title="Component to find the layer to pretrain")
layer: str = Field(..., title="Layer to pretrain. Whole model if empty.")
# TODO: use a more detailed schema for this?
objective: Dict[str, Any] = Field(..., title="Pretraining objective")
objective: Callable[["Vocab", "Model"], "Model"] = Field(..., title="A function that creates the pretraining objective.")
# fmt: on
class Config:

View File

@ -512,7 +512,7 @@ class Scorer:
negative_labels (Iterable[str]): The string values that refer to no annotation (e.g. "NIL")
RETURNS (Dict[str, Any]): A dictionary containing the scores.
DOCS (TODO): https://nightly.spacy.io/api/scorer#score_links
DOCS: https://nightly.spacy.io/api/scorer#score_links
"""
f_per_type = {}
for example in examples:
@ -720,44 +720,10 @@ def get_ner_prf(examples: Iterable[Example]) -> Dict[str, Any]:
}
#############################################################################
#
# The following implementation of roc_auc_score() is adapted from
# scikit-learn, which is distributed under the following license:
#
# New BSD License
#
# scikit-learn, which is distributed under the New BSD License.
# Copyright (c) 20072019 The scikit-learn developers.
# All rights reserved.
#
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# a. Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# b. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# c. Neither the name of the Scikit-learn Developers nor the names of
# its contributors may be used to endorse or promote products
# derived from this software without specific prior written
# permission.
#
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR
# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
# DAMAGE.
# See licenses/3rd_party_licenses.txt
def _roc_auc_score(y_true, y_score):
"""Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores.

View File

@ -109,12 +109,12 @@ Loading the models is expensive and not necessary if you're not actually testing
```python
def test_doc_token_api_strings(en_vocab):
text = "Give it back! He pleaded."
words = ["Give", "it", "back", "!", "He", "pleaded", "."]
pos = ['VERB', 'PRON', 'PART', 'PUNCT', 'PRON', 'VERB', 'PUNCT']
heads = [0, 0, 0, 0, 5, 5, 5]
deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
doc = Doc(en_vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps)
doc = Doc(en_vocab, words=words, pos=pos, heads=heads, deps=deps)
assert doc[0].text == 'Give'
assert doc[0].lower_ == 'give'
assert doc[0].pos_ == 'VERB'

View File

@ -172,6 +172,11 @@ def lt_tokenizer():
return get_lang_class("lt")().tokenizer
@pytest.fixture(scope="session")
def mk_tokenizer():
return get_lang_class("mk")().tokenizer
@pytest.fixture(scope="session")
def ml_tokenizer():
return get_lang_class("ml")().tokenizer

View File

@ -123,6 +123,7 @@ def test_doc_api_serialize(en_tokenizer, text):
tokens[0].norm_ = "norm"
tokens.ents = [(tokens.vocab.strings["PRODUCT"], 0, 1)]
tokens[0].ent_kb_id_ = "ent_kb_id"
tokens[0].ent_id_ = "ent_id"
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens]
@ -130,6 +131,7 @@ def test_doc_api_serialize(en_tokenizer, text):
assert new_tokens[0].lemma_ == "lemma"
assert new_tokens[0].norm_ == "norm"
assert new_tokens[0].ent_kb_id_ == "ent_kb_id"
assert new_tokens[0].ent_id_ == "ent_id"
new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]

View File

@ -416,6 +416,13 @@ def test_doc_retokenizer_merge_lex_attrs(en_vocab):
assert doc[1].is_stop
assert not doc[0].is_stop
assert not doc[1].like_num
# Test that norm is only set on tokens
doc = Doc(en_vocab, words=["eins", "zwei", "!", "!"])
assert doc[0].norm_ == "eins"
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[0:1], attrs={"norm": "1"})
assert doc[0].norm_ == "1"
assert en_vocab["eins"].norm_ == "eins"
def test_retokenize_skip_duplicates(en_vocab):

View File

View File

@ -0,0 +1,84 @@
import pytest
from spacy.lang.mk.lex_attrs import like_num
def test_tokenizer_handles_long_text(mk_tokenizer):
text = """
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
ги разбере овие идеи...
"""
tokens = mk_tokenizer(text)
assert len(tokens) == 297
@pytest.mark.parametrize(
"word,match",
[
("10", True),
("1", True),
("10.000", True),
("1000", True),
("бројка", False),
("999,0", True),
("еден", True),
("два", True),
("цифра", False),
("десет", True),
("сто", True),
("број", False),
("илјада", True),
("илјади", True),
("милион", True),
(",", False),
("милијарда", True),
("билион", True),
]
)
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
tokens = mk_tokenizer(word)
assert len(tokens) == 1
assert tokens[0].like_num == match
@pytest.mark.parametrize(
"word",
[
"двесте",
"два-три",
"пет-шест"
]
)
def test_mk_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())
@pytest.mark.parametrize(
"word",
[
"првиот",
"втора",
"четврт",
"четвртата",
"петти",
"петто",
"стоти",
"шеесетите",
"седумдесетите"
]
)
def test_mk_lex_attrs_like_number_for_ordinal(word):
assert like_num(word)

View File

@ -2,6 +2,27 @@ import pytest
from spacy.lang.tr.lex_attrs import like_num
def test_tr_tokenizer_handles_long_text(tr_tokenizer):
text = """Pamuk nasıl ipliğe dönüştürülür?
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
Daha sonra pamuk yığınları, liflerin ılıp temizlenmesi için tek bir birim halinde
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
tokens = tr_tokenizer(text)
assert len(tokens) == 146
@pytest.mark.parametrize(
"word",
[

View File

@ -0,0 +1,152 @@
import pytest
ABBREV_TESTS = [
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
("Hem İst. hem Ank. bu konuda gayet iyi durumda.", ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."]),
("Hem İst. hem Ank.'da yağış var.", ["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."]),
("Dr.", ["Dr."]),
("Yrd.Doç.", ["Yrd.Doç."]),
("Prof.'un", ["Prof.'un"]),
("Böl.'nde", ["Böl.'nde"]),
]
URL_TESTS = [
("Bizler de www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
("Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "https://www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
("Bizler de www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."]),
("Bizler de https://www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."]),
]
NUMBER_TESTS = [
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
("Hava sıcaklığı -4ten +6ya yükseldi.", ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."]),
("Hava sıcaklığı -4'ten +6'ya yükseldi.", ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."]),
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
("Kitap IV. Murat hakkında.",["Kitap", "IV.", "Murat", "hakkında", "."]),
#("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
("5'te", ["5'te"]),
("6'da", ["6'da"]),
("9dan", ["9dan"]),
("19'da", ["19'da"]),
("VI'da", ["VI'da"]),
("5.", ["5."]),
("72.", ["72."]),
("VI.", ["VI."]),
("6.'dan", ["6.'dan"]),
("19.'dan", ["19.'dan"]),
("6.dan", ["6.dan"]),
("16.dan", ["16.dan"]),
("VI.'dan", ["VI.'dan"]),
("VI.dan", ["VI.dan"]),
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
("2/3 tarihli faturayı bulamadım.", ["2/3", "tarihli", "faturayı", "bulamadım", "."]),
("2.3 tarihli faturayı bulamadım.", ["2.3", "tarihli", "faturayı", "bulamadım", "."]),
("2.3. tarihli faturayı bulamadım.", ["2.3.", "tarihli", "faturayı", "bulamadım", "."]),
("2/3/2020 tarihli faturayı bulamadm.", ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."]),
("2/3/1987 tarihinden beri burda yaşıyorum.", ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."]),
("2-3-1987 tarihinden beri burdayım.", ["2-3-1987", "tarihinden", "beri", "burdayım", "."]),
("2.3.1987 tarihinden beri burdayım.", ["2.3.1987", "tarihinden", "beri", "burdayım", "."]),
("Bu olay 2005-2006 tarihleri arasında oldu.", ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."]),
("Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.", ["Bu", "olay", "4/12/2005", "-", "21/3/2006", "tarihleri", "arasında", "oldu", ".",]),
("Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.", ["Ek", "fıkra", ":", "5/11/2003", "-", "4999/3", "maddesine", "göre", "uygundur", "."]),
("2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre", ["2/A", "alanları", ":", "6831", "sayılı", "Kanunun", "2nci", "maddesinin", "birinci", "fıkrasının", "(", "A", ")", "bendine", "göre"]),
("ŞEHİTTEĞMENKALMAZ Cad. No: 2/311", ["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"]),
("2-3-2025", ["2-3-2025",]),
("2/3/2025", ["2/3/2025"]),
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "", "kullanıyorum", "."]),
("Kan değerlerim 0.5-0.7 arasıydı.", ["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."]),
("0.5", ["0.5"]),
("1/2", ["1/2"]),
("%1", ["%", "1"]),
("%1lik", ["%", "1lik"]),
("%1'lik", ["%", "1'lik"]),
("%1lik dilim", ["%", "1lik", "dilim"]),
("%1'lik dilim", ["%", "1'lik", "dilim"]),
("%1.5", ["%", "1.5"]),
#("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
("%1-2 arası büyüme bekliyoruz.", ["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."]),
("%11-12 arası büyüme bekliyoruz.", ["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."]),
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
("Saat 1-2 arası gelin lütfen.", ["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."]),
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
("9daki otobüse binsek mi?", ["9daki", "otobüse", "binsek", "mi", "?"]),
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
("Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.", ["Antonio", "Gaudí", "20.", "yüzyılda", ",", "1904", "-", "1914", "yılları", "arasında", "on", "yıl", "süren", "bir", "reform", "süreci", "getirmiştir", "."]),
("Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.", ["Dizel", "yakıtın", "avro", "bölgesi", "ortalaması", "olan", "1,165", "avroya", "kıyasla", "litre", "başına", "1,335", "avroya", "mal", "olduğunu", "gösteriyor", "."]),
("Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.", ["Marcus", "Antonius", "M.Ö.", "1", "Ocak", "49'da", ",", "Sezar'dan", "Vali'nin", "kendisini", "barış", "dostu", "ilan", "ettiği", "bir", "bildiri", "yayınlamıştır", "."])
]
PUNCT_TESTS = [
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
("Gitsek mi?", ["Gitsek", "mi", "?"]),
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
("Ankara - Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
("Ankara-Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
("Senden, benden, bizden şarkısını biliyor musun?", ["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"]),
("Akif'le geldik, sonra da o ayrıldı.", ["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."]),
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
("Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...", ["Yok", "hasta", "olmuş", ",", "yok", "annesi", "hastaymış", ",", "bahaneler", "işte", "..."]),
("Ankara'dan İstanbul'a ... bir aşk hikayesi.", ["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."]),
("Ahmet'te", ["Ahmet'te"]),
("İstanbul'da", ["İstanbul'da"]),
]
GENERAL_TESTS = [
("1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.", ["1914'teki", "Endurance", "seferinde", ",", "Sir", "Ernest", "Shackleton'ın", "kaptanlığını", "yaptığı", "İngiliz", "Endurance", "gemisi", "yirmi", "sekiz", "kişi", "ile", "Antarktika'yı", "geçmek", "üzere", "yelken", "açtı", "."]),
("Danışılan \"%100 Cospedal\" olduğunu belirtti.", ["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."]),
("1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.", ["1976'da", "parkur", "artık", "kullanılmıyordu", ";", "1990'da", "ise", "bir", "yangın", ",", "daha", "sonraları", "ahırlarla", "birlikte", "yıkılacak", "olan", "tahta", "tribünlerden", "geri", "kalanları", "da", "yok", "etmişti", "."]),
("Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.", ["Dahiyane", "bir", "ameliyat", "ve", "zorlu", "bir", "rehabilitasyon", "sürecinden", "sonra", ",", "tamamen", "iyileştim", "."]),
("Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.", ["Yaklaşık", "iki", "hafta", "süren", "bireysel", "erken", "oy", "kullanma", "döneminin", "ardından", "5,7", "milyondan", "fazla", "Floridalı", "sandık", "başına", "gitti", "."]),
("Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.", ["Ancak", ",", "bu", "ABD", "Çevre", "Koruma", "Ajansı'nın", "dünyayı", "bu", "konularda", "uyarmasının", "ardından", "ortaya", "çıktı", "."]),
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
("Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar." , ["Granit", "adaları", ";", "Seyşeller", "ve", "Tioman", "ile", "Saint", "Helena", "gibi", "volkanik", "adaları", "kapsar", "."]),
("Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.", ["Barış", "antlaşmasıyla", "İspanya", ",", "Amerika'ya", "Porto", "Riko", ",", "Guam", "ve", "Filipinler", "kolonilerini", "devretti", "."]),
("Makedonya\'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya\'ya doğru yürüdü.", ["Makedonya\'nın", "sınır", "bölgelerini", "güvence", "altına", "alan", "Philip", ",", "büyük", "bir", "Makedon", "ordusu", "kurdu", "ve", "uzun", "bir", "fetih", "seferi", "için", "Trakya\'ya", "doğru", "yürüdü", "."]),
("Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.", ["Fransız", "gazetesi", "Le", "Figaro'ya", "göre", "bu", "hükumet", "planı", "sayesinde", "42", "milyon", "Euro", "kazanç", "sağlanabilir", "ve", "elde", "edilen", "paranın", "15.5", "milyonu", "ulusal", "güvenlik", "için", "kullanılabilir", "."]),
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
("3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.", ["3", "Kasım", "Salı", "günü", ",", "Ankara", "Belediye", "Başkanı", "2014'te", "hükümetle", "birlikte", "oluşturulan", "kentsel", "gelişim", "anlaşmasını", "askıya", "alma", "kararı", "verdi", "."]),
("Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.", ["Stalin", ",", "Abakumov'u", "Beria'nın", "enerji", "bakanlıkları", "üzerindeki", "baskınlığına", "karşı", "MGB", "içinde", "kendi", "ını", "kurmaya", "teşvik", "etmeye", "başlamıştı", "."]),
("Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar", ["Güney", "Avrupa'daki", "kazı", "alanlarının", "çoğunluğu", "gibi", ",", "bu", "bulgu", "M.Ö.", "5.", "yüzyılın", "başlar"]),
("Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.", ["Sağlığın", "bozulması", "Hitchcock", "hayatının", "son", "yirmi", "yılında", "üretimini", "azalttı", "."]),
]
TESTS = (ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS)
@pytest.mark.parametrize("text,expected_tokens", TESTS)
def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
tokens = tr_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
print(token_list)
assert expected_tokens == token_list

View File

@ -457,6 +457,7 @@ def test_attr_pipeline_checks(en_vocab):
([{"IS_LEFT_PUNCT": True}], "``"),
([{"IS_RIGHT_PUNCT": True}], "''"),
([{"IS_STOP": True}], "the"),
([{"SPACY": True}], "the"),
([{"LIKE_NUM": True}], "1"),
([{"LIKE_URL": True}], "http://example.com"),
([{"LIKE_EMAIL": True}], "mail@example.com"),

View File

@ -4,7 +4,9 @@ from pathlib import Path
def test_build_dependencies():
# Check that library requirements are pinned exactly the same across different setup files.
# TODO: correct checks for numpy rather than ignoring
libs_ignore_requirements = [
"numpy",
"pytest",
"pytest-timeout",
"mock",
@ -12,6 +14,7 @@ def test_build_dependencies():
]
# ignore language-specific packages that shouldn't be installed by all
libs_ignore_setup = [
"numpy",
"fugashi",
"natto-py",
"pythainlp",
@ -67,7 +70,7 @@ def test_build_dependencies():
line = line.strip().strip(",").strip('"')
if not line.startswith("#"):
lib, v = _parse_req(line)
if lib:
if lib and lib not in libs_ignore_requirements:
req_v = req_dict.get(lib, None)
assert (lib + v) == (lib + req_v), (
"{} has different version in pyproject.toml and in requirements.txt: "

View File

@ -7,6 +7,7 @@ from spacy.tokens import Doc
from spacy.pipeline._parser_internals.nonproj import projectivize
from spacy.pipeline._parser_internals.arc_eager import ArcEager
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
from spacy.pipeline._parser_internals.stateclass import StateClass
def get_sequence_costs(M, words, heads, deps, transitions):
@ -47,15 +48,24 @@ def test_oracle_four_words(arc_eager, vocab):
for dep in deps:
arc_eager.add_action(2, dep) # Left
arc_eager.add_action(3, dep) # Right
actions = ["L-left", "B-ROOT", "L-left"]
actions = ["S", "L-left", "B-ROOT", "S", "D", "S", "L-left", "S", "D"]
state, cost_history = get_sequence_costs(arc_eager, words, heads, deps, actions)
expected_gold = [
["S"],
["B-ROOT", "L-left"],
["B-ROOT"],
["S"],
["D"],
["S"],
["L-left"],
["S"],
["D"]
]
assert state.is_final()
for i, state_costs in enumerate(cost_history):
# Check gold moves is 0 cost
assert state_costs[actions[i]] == 0.0, actions[i]
for other_action, cost in state_costs.items():
if other_action != actions[i]:
assert cost >= 1, (i, other_action)
golds = [act for act, cost in state_costs.items() if cost < 1]
assert golds == expected_gold[i], (i, golds, expected_gold[i])
annot_tuples = [
@ -169,12 +179,15 @@ def test_oracle_dev_sentence(vocab, arc_eager):
. punct said
"""
expected_transitions = [
"S", # Shift "Rolls-Royce"
"S", # Shift 'Motor'
"S", # Shift 'Cars'
"L-nn", # Attach 'Cars' to 'Inc.'
"L-nn", # Attach 'Motor' to 'Inc.'
"L-nn", # Attach 'Rolls-Royce' to 'Inc.', force shift
"L-nn", # Attach 'Rolls-Royce' to 'Inc.'
"S", # Shift "Inc."
"L-nsubj", # Attach 'Inc.' to 'said'
"S", # Shift 'said'
"S", # Shift 'it'
"L-nsubj", # Attach 'it.' to 'expects'
"R-ccomp", # Attach 'expects' to 'said'
@ -204,6 +217,8 @@ def test_oracle_dev_sentence(vocab, arc_eager):
"D", # Reduce "steady"
"D", # Reduce "expects"
"R-punct", # Attach "." to "said"
"D", # Reduce "."
"D", # Reduce "said"
]
gold_words = []
@ -221,10 +236,40 @@ def test_oracle_dev_sentence(vocab, arc_eager):
for dep in gold_deps:
arc_eager.add_action(2, dep) # Left
arc_eager.add_action(3, dep) # Right
doc = Doc(Vocab(), words=gold_words)
example = Example.from_dict(doc, {"heads": gold_heads, "deps": gold_deps})
ae_oracle_actions = arc_eager.get_oracle_sequence(example)
ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False)
ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions]
assert ae_oracle_actions == expected_transitions
def test_oracle_bad_tokenization(vocab, arc_eager):
words_deps_heads = """
[catalase] dep is
: punct is
that nsubj is
is root is
bad comp is
"""
gold_words = []
gold_deps = []
gold_heads = []
for line in words_deps_heads.strip().split("\n"):
line = line.strip()
if not line:
continue
word, dep, head = line.split()
gold_words.append(word)
gold_deps.append(dep)
gold_heads.append(head)
gold_heads = [gold_words.index(head) for head in gold_heads]
for dep in gold_deps:
arc_eager.add_action(2, dep) # Left
arc_eager.add_action(3, dep) # Right
reference = Doc(Vocab(), words=gold_words, deps=gold_deps, heads=gold_heads)
predicted = Doc(reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"])
example = Example(predicted=predicted, reference=reference)
ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False)
ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions]
assert ae_oracle_actions

View File

@ -54,7 +54,7 @@ def tsys(vocab, entity_types):
def test_get_oracle_moves(tsys, doc, entity_annots):
example = Example.from_dict(doc, {"entities": entity_annots})
act_classes = tsys.get_oracle_sequence(example)
act_classes = tsys.get_oracle_sequence(example, _debug=False)
names = [tsys.get_class_name(act) for act in act_classes]
assert names == ["U-PERSON", "O", "O", "B-GPE", "L-GPE", "O"]

View File

@ -0,0 +1,144 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
import hypothesis
import hypothesis.strategies
import numpy
from spacy.vocab import Vocab
from spacy.language import Language
from spacy.pipeline import DependencyParser
from spacy.pipeline._parser_internals.arc_eager import ArcEager
from spacy.tokens import Doc
from spacy.pipeline._parser_internals._beam_utils import BeamBatch
from spacy.pipeline._parser_internals.stateclass import StateClass
from spacy.training import Example
from thinc.tests.strategies import ndarrays_of_shape
@pytest.fixture(scope="module")
def vocab():
return Vocab()
@pytest.fixture(scope="module")
def moves(vocab):
aeager = ArcEager(vocab.strings, {})
aeager.add_action(0, "")
aeager.add_action(1, "")
aeager.add_action(2, "nsubj")
aeager.add_action(2, "punct")
aeager.add_action(2, "aux")
aeager.add_action(2, "nsubjpass")
aeager.add_action(3, "dobj")
aeager.add_action(2, "aux")
aeager.add_action(4, "ROOT")
return aeager
@pytest.fixture(scope="module")
def docs(vocab):
return [
Doc(
vocab,
words=["Rats", "bite", "things"],
heads=[1, 1, 1],
deps=["nsubj", "ROOT", "dobj"],
sent_starts=[True, False, False]
)
]
@pytest.fixture(scope="module")
def examples(docs):
return [Example(doc, doc.copy()) for doc in docs]
@pytest.fixture
def states(docs):
return [StateClass(doc) for doc in docs]
@pytest.fixture
def tokvecs(docs, vector_size):
output = []
for doc in docs:
vec = numpy.random.uniform(-0.1, 0.1, (len(doc), vector_size))
output.append(numpy.asarray(vec))
return output
@pytest.fixture(scope="module")
def batch_size(docs):
return len(docs)
@pytest.fixture(scope="module")
def beam_width():
return 4
@pytest.fixture(params=[0.0, 0.5, 1.0])
def beam_density(request):
return request.param
@pytest.fixture
def vector_size():
return 6
@pytest.fixture
def beam(moves, examples, beam_width):
states, golds, _ = moves.init_gold_batch(examples)
return BeamBatch(moves, states, golds, width=beam_width, density=0.0)
@pytest.fixture
def scores(moves, batch_size, beam_width):
return numpy.asarray(
numpy.concatenate(
[
numpy.random.uniform(-0.1, 0.1, (beam_width, moves.n_moves))
for _ in range(batch_size)
]
), dtype="float32")
def test_create_beam(beam):
pass
def test_beam_advance(beam, scores):
beam.advance(scores)
def test_beam_advance_too_few_scores(beam, scores):
n_state = sum(len(beam) for beam in beam)
scores = scores[:n_state]
with pytest.raises(IndexError):
beam.advance(scores[:-1])
def test_beam_parse(examples, beam_width):
nlp = Language()
parser = nlp.add_pipe("beam_parser")
parser.cfg["beam_width"] = beam_width
parser.add_label("nsubj")
parser.initialize(lambda: examples)
doc = nlp.make_doc("Australia is a country")
parser(doc)
@hypothesis.given(hyp=hypothesis.strategies.data())
def test_beam_density(moves, examples, beam_width, hyp):
beam_density = float(hyp.draw(hypothesis.strategies.floats(0.0, 1.0, width=32)))
states, golds, _ = moves.init_gold_batch(examples)
beam = BeamBatch(moves, states, golds, width=beam_width, density=beam_density)
n_state = sum(len(beam) for beam in beam)
scores = hyp.draw(ndarrays_of_shape((n_state, moves.n_moves)))
beam.advance(scores)
for b in beam:
beam_probs = b.probs
assert b.min_density == beam_density
assert beam_probs[-1] >= beam_probs[0] * beam_density

View File

@ -22,6 +22,7 @@ def _parser_example(parser):
@pytest.fixture
def parser(vocab):
vocab.strings.add("ROOT")
config = {
"learn_tokens": False,
"min_action_freq": 30,
@ -76,13 +77,16 @@ def test_sents_1_2(parser):
def test_sents_1_3(parser):
doc = Doc(parser.vocab, words=["a", "b", "c", "d"])
doc[1].sent_start = True
doc[3].sent_start = True
doc[0].is_sent_start = True
doc[1].is_sent_start = True
doc[2].is_sent_start = None
doc[3].is_sent_start = True
doc = parser(doc)
assert len(list(doc.sents)) >= 3
doc = Doc(parser.vocab, words=["a", "b", "c", "d"])
doc[1].sent_start = True
doc[2].sent_start = False
doc[3].sent_start = True
doc[0].is_sent_start = True
doc[1].is_sent_start = True
doc[2].is_sent_start = False
doc[3].is_sent_start = True
doc = parser(doc)
assert len(list(doc.sents)) == 3

View File

@ -0,0 +1,74 @@
import pytest
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab
from spacy.pipeline._parser_internals.stateclass import StateClass
@pytest.fixture
def vocab():
return Vocab()
@pytest.fixture
def doc(vocab):
return Doc(vocab, words=["a", "b", "c", "d"])
def test_init_state(doc):
state = StateClass(doc)
assert state.stack == []
assert state.queue == list(range(len(doc)))
assert not state.is_final()
assert state.buffer_length() == 4
def test_push_pop(doc):
state = StateClass(doc)
state.push()
assert state.buffer_length() == 3
assert state.stack == [0]
assert 0 not in state.queue
state.push()
assert state.stack == [1, 0]
assert 1 not in state.queue
assert state.buffer_length() == 2
state.pop()
assert state.stack == [0]
assert 1 not in state.queue
def test_stack_depth(doc):
state = StateClass(doc)
assert state.stack_depth() == 0
assert state.buffer_length() == len(doc)
state.push()
assert state.buffer_length() == 3
assert state.stack_depth() == 1
def test_H(doc):
state = StateClass(doc)
assert state.H(0) == -1
state.add_arc(1, 0, 0)
assert state.arcs == [{"head": 1, "child": 0, "label": 0}]
assert state.H(0) == 1
state.add_arc(3, 1, 0)
assert state.H(1) == 3
def test_L(doc):
state = StateClass(doc)
assert state.L(2, 1) == -1
state.add_arc(2, 1, 0)
assert state.arcs == [{"head": 2, "child": 1, "label": 0}]
assert state.L(2, 1) == 1
state.add_arc(2, 0, 0)
assert state.L(2, 1) == 0
assert state.n_L(2) == 2
def test_R(doc):
state = StateClass(doc)
assert state.R(0, 1) == -1
state.add_arc(0, 1, 0)
assert state.arcs == [{"head": 0, "child": 1, "label": 0}]
assert state.R(0, 1) == 1
state.add_arc(0, 2, 0)
assert state.R(0, 1) == 2
assert state.n_R(0) == 2

View File

@ -197,3 +197,21 @@ def test_entity_ruler_overlapping_spans(nlp):
doc = ruler(nlp.make_doc("foo bar baz"))
assert len(doc.ents) == 1
assert doc.ents[0].label_ == "FOOBAR"
@pytest.mark.parametrize("n_process", [1, 2])
def test_entity_ruler_multiprocessing(nlp, n_process):
texts = [
"I enjoy eating Pizza Hut pizza."
]
patterns = [
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
]
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
for doc in nlp.pipe(texts, n_process=2):
for ent in doc.ents:
assert ent.ent_id_ == "1234"

View File

@ -1,4 +1,6 @@
import pytest
import logging
import mock
from spacy import util, registry
from spacy.lang.en import English
from spacy.lookups import Lookups
@ -54,9 +56,18 @@ def test_lemmatizer_config(nlp):
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "rule"})
nlp.initialize()
# warning if no POS assigned
doc = nlp.make_doc("coping")
logger = logging.getLogger("spacy")
with mock.patch.object(logger, "warn") as mock_warn:
doc = lemmatizer(doc)
mock_warn.assert_called_once()
# works with POS
doc = nlp.make_doc("coping")
doc[0].pos_ = "VERB"
assert doc[0].lemma_ == ""
doc[0].pos_ = "VERB"
doc = lemmatizer(doc)
doc = lemmatizer(doc)
assert doc[0].text == "coping"
assert doc[0].lemma_ == "cope"

View File

@ -116,3 +116,23 @@ def test_overfitting_IO():
no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
# Test without POS
nlp.remove_pipe("morphologizer")
nlp.add_pipe("morphologizer")
for example in train_examples:
for token in example.reference:
token.pos_ = ""
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["morphologizer"] < 0.00001
# Test the trained model
test_text = "I like blue ham"
doc = nlp(test_text)
gold_morphs = ["Feat=N", "Feat=V", "", ""]
gold_pos_tags = ["", "", "", ""]
assert [str(t.morph) for t in doc] == gold_morphs
assert [t.pos_ for t in doc] == gold_pos_tags

Some files were not shown because too many files have changed in this diff Show More