mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Merge branch 'master' into feature/nel-wiki
This commit is contained in:
commit
f2ea3e3ea2
106
.github/contributors/ameyuuno.md
vendored
Normal file
106
.github/contributors/ameyuuno.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Alexey Kim |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-07-09 |
|
||||
| GitHub username | ameyuuno |
|
||||
| Website (optional) | https://ameyuuno.io |
|
106
.github/contributors/askhogan.md
vendored
Normal file
106
.github/contributors/askhogan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Patrick Hogan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 7/7/2019 |
|
||||
| GitHub username | askhogan@gmail.com |
|
||||
| Website (optional) | |
|
106
.github/contributors/khellan.md
vendored
Normal file
106
.github/contributors/khellan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Knut O. Hellan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 02.07.2019 |
|
||||
| GitHub username | khellan |
|
||||
| Website (optional) | knuthellan.com |
|
106
.github/contributors/kognate.md
vendored
Normal file
106
.github/contributors/kognate.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Joshua B. Smith |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | July 7, 2019 |
|
||||
| GitHub username | kognate |
|
||||
| Website (optional) | |
|
106
.github/contributors/rokasramas.md
vendored
Normal file
106
.github/contributors/rokasramas.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ----------------------- |
|
||||
| Name | Rokas Ramanauskas |
|
||||
| Company name (if applicable) | TokenMill |
|
||||
| Title or role (if applicable) | Software Engineer |
|
||||
| Date | 2019-07-02 |
|
||||
| GitHub username | rokasramas |
|
||||
| Website (optional) | http://www.tokenmill.lt |
|
10
CITATION
10
CITATION
|
@ -1,6 +1,6 @@
|
|||
@ARTICLE{spacy2,
|
||||
AUTHOR = {Honnibal, Matthew AND Montani, Ines},
|
||||
TITLE = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
|
||||
YEAR = {2017},
|
||||
JOURNAL = {To appear}
|
||||
@unpublished{spacy2,
|
||||
AUTHOR = {Honnibal, Matthew and Montani, Ines},
|
||||
TITLE = {{spaCy 2}: Natural language understanding with {B}loom embeddings, convolutional neural networks and incremental parsing},
|
||||
YEAR = {2017},
|
||||
Note = {To appear}
|
||||
}
|
||||
|
|
|
@ -51,7 +51,6 @@ def filter_spans(spans):
|
|||
|
||||
def extract_currency_relations(doc):
|
||||
# Merge entities and noun chunks into one token
|
||||
seen_tokens = set()
|
||||
spans = list(doc.ents) + list(doc.noun_chunks)
|
||||
spans = filter_spans(spans)
|
||||
with doc.retokenize() as retokenizer:
|
||||
|
|
|
@ -5,6 +5,7 @@ import plac
|
|||
import random
|
||||
import numpy
|
||||
import time
|
||||
import re
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from thinc.v2v import Affine, Maxout
|
||||
|
@ -23,19 +24,39 @@ from .train import _load_pretrained_tok2vec
|
|||
|
||||
|
||||
@plac.annotations(
|
||||
texts_loc=("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the "
|
||||
"key 'tokens'", "positional", None, str),
|
||||
texts_loc=(
|
||||
"Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the "
|
||||
"key 'tokens'",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
vectors_model=("Name or path to spaCy model with vectors to learn from"),
|
||||
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
||||
width=("Width of CNN layers", "option", "cw", int),
|
||||
depth=("Depth of CNN layers", "option", "cd", int),
|
||||
embed_rows=("Number of embedding rows", "option", "er", int),
|
||||
loss_func=("Loss function to use for the objective. Either 'L2' or 'cosine'", "option", "L", str),
|
||||
loss_func=(
|
||||
"Loss function to use for the objective. Either 'L2' or 'cosine'",
|
||||
"option",
|
||||
"L",
|
||||
str,
|
||||
),
|
||||
use_vectors=("Whether to use the static vectors as input features", "flag", "uv"),
|
||||
dropout=("Dropout rate", "option", "d", float),
|
||||
batch_size=("Number of words per training batch", "option", "bs", int),
|
||||
max_length=("Max words per example. Longer examples are discarded", "option", "xw", int),
|
||||
min_length=("Min words per example. Shorter examples are discarded", "option", "nw", int),
|
||||
max_length=(
|
||||
"Max words per example. Longer examples are discarded",
|
||||
"option",
|
||||
"xw",
|
||||
int,
|
||||
),
|
||||
min_length=(
|
||||
"Min words per example. Shorter examples are discarded",
|
||||
"option",
|
||||
"nw",
|
||||
int,
|
||||
),
|
||||
seed=("Seed for random number generators", "option", "s", int),
|
||||
n_iter=("Number of iterations to pretrain", "option", "i", int),
|
||||
n_save_every=("Save model every X batches.", "option", "se", int),
|
||||
|
@ -45,6 +66,13 @@ from .train import _load_pretrained_tok2vec
|
|||
"t2v",
|
||||
Path,
|
||||
),
|
||||
epoch_start=(
|
||||
"The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been "
|
||||
"renamed. Prevents unintended overwriting of existing weight files.",
|
||||
"option",
|
||||
"es",
|
||||
int
|
||||
),
|
||||
)
|
||||
def pretrain(
|
||||
texts_loc,
|
||||
|
@ -63,6 +91,7 @@ def pretrain(
|
|||
seed=0,
|
||||
n_save_every=None,
|
||||
init_tok2vec=None,
|
||||
epoch_start=None,
|
||||
):
|
||||
"""
|
||||
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
||||
|
@ -131,9 +160,29 @@ def pretrain(
|
|||
if init_tok2vec is not None:
|
||||
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
|
||||
msg.text("Loaded pretrained tok2vec for: {}".format(components))
|
||||
# Parse the epoch number from the given weight file
|
||||
model_name = re.search(r"model\d+\.bin", str(init_tok2vec))
|
||||
if model_name:
|
||||
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
||||
epoch_start = int(model_name.group(0)[5:][:-4]) + 1
|
||||
else:
|
||||
if not epoch_start:
|
||||
msg.fail(
|
||||
"You have to use the '--epoch-start' argument when using a renamed weight file for "
|
||||
"'--init-tok2vec'", exits=True
|
||||
)
|
||||
elif epoch_start < 0:
|
||||
msg.fail(
|
||||
"The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" % epoch_start,
|
||||
exits=True
|
||||
)
|
||||
else:
|
||||
# Without '--init-tok2vec' the '--epoch-start' argument is ignored
|
||||
epoch_start = 0
|
||||
|
||||
optimizer = create_default_optimizer(model.ops)
|
||||
tracker = ProgressTracker(frequency=10000)
|
||||
msg.divider("Pre-training tok2vec layer")
|
||||
msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start)
|
||||
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
|
||||
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
|
||||
|
||||
|
@ -154,7 +203,7 @@ def pretrain(
|
|||
file_.write(srsly.json_dumps(log) + "\n")
|
||||
|
||||
skip_counter = 0
|
||||
for epoch in range(n_iter):
|
||||
for epoch in range(epoch_start, n_iter + epoch_start):
|
||||
for batch_id, batch in enumerate(
|
||||
util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
|
||||
):
|
||||
|
|
|
@ -116,7 +116,7 @@ def parse_deps(orig_doc, options={}):
|
|||
doc (Doc): Document do parse.
|
||||
RETURNS (dict): Generated dependency parse keyed by words and arcs.
|
||||
"""
|
||||
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
||||
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data"]))
|
||||
if not doc.is_parsed:
|
||||
user_warning(Warnings.W005)
|
||||
if options.get("collapse_phrases", False):
|
||||
|
|
|
@ -537,6 +537,7 @@ for orth in [
|
|||
"Sen.",
|
||||
"St.",
|
||||
"vs.",
|
||||
"v.s."
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@ from __future__ import unicode_literals
|
|||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.en.examples import sentences
|
||||
>>> from spacy.lang.id.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
|
|
@ -1,15 +1,37 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .tag_map import TAG_MAP
|
||||
from .lemmatizer import LOOKUP
|
||||
from .morph_rules import MORPH_RULES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
|
||||
|
||||
def _return_lt(_):
|
||||
return "lt"
|
||||
|
||||
|
||||
class LithuanianDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: "lt"
|
||||
lex_attr_getters[LANG] = _return_lt
|
||||
lex_attr_getters[NORM] = add_lookups(
|
||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||
)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
morph_rules = MORPH_RULES
|
||||
lemma_lookup = LOOKUP
|
||||
|
||||
|
||||
class Lithuanian(Language):
|
||||
|
|
22
spacy/lang/lt/examples.py
Normal file
22
spacy/lang/lt/examples.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.lt.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Jaunikis pirmąją vestuvinę naktį iškeitė į areštinės gultą",
|
||||
"Bepiločiai automobiliai išnaikins vairavimo mokyklas, autoservisus ir eismo nelaimes",
|
||||
"Vilniuje galvojama uždrausti naudoti skėčius",
|
||||
"Londonas yra didelis miestas Jungtinėje Karalystėje",
|
||||
"Kur tu?",
|
||||
"Kas yra Prancūzijos prezidentas?",
|
||||
"Kokia yra Jungtinių Amerikos Valstijų sostinė?",
|
||||
"Kada gimė Dalia Grybauskaitė?",
|
||||
]
|
234227
spacy/lang/lt/lemmatizer.py
Normal file
234227
spacy/lang/lt/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
1153
spacy/lang/lt/lex_attrs.py
Normal file
1153
spacy/lang/lt/lex_attrs.py
Normal file
File diff suppressed because it is too large
Load Diff
3075
spacy/lang/lt/morph_rules.py
Normal file
3075
spacy/lang/lt/morph_rules.py
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
4798
spacy/lang/lt/tag_map.py
Normal file
4798
spacy/lang/lt/tag_map.py
Normal file
File diff suppressed because it is too large
Load Diff
268
spacy/lang/lt/tokenizer_exceptions.py
Normal file
268
spacy/lang/lt/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,268 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH
|
||||
|
||||
_exc = {}
|
||||
|
||||
for orth in [
|
||||
"G.",
|
||||
"J. E.",
|
||||
"J. Em.",
|
||||
"J.E.",
|
||||
"J.Em.",
|
||||
"K.",
|
||||
"N.",
|
||||
"V.",
|
||||
"Vt.",
|
||||
"a.",
|
||||
"a.k.",
|
||||
"a.s.",
|
||||
"adv.",
|
||||
"akad.",
|
||||
"aklg.",
|
||||
"akt.",
|
||||
"al.",
|
||||
"ang.",
|
||||
"angl.",
|
||||
"aps.",
|
||||
"apskr.",
|
||||
"apyg.",
|
||||
"arbat.",
|
||||
"asist.",
|
||||
"asm.",
|
||||
"asm.k.",
|
||||
"asmv.",
|
||||
"atk.",
|
||||
"atsak.",
|
||||
"atsisk.",
|
||||
"atsisk.sąsk.",
|
||||
"atv.",
|
||||
"aut.",
|
||||
"avd.",
|
||||
"b.k.",
|
||||
"baud.",
|
||||
"biol.",
|
||||
"bkl.",
|
||||
"bot.",
|
||||
"bt.",
|
||||
"buv.",
|
||||
"ch.",
|
||||
"chem.",
|
||||
"corp.",
|
||||
"d.",
|
||||
"dab.",
|
||||
"dail.",
|
||||
"dek.",
|
||||
"deš.",
|
||||
"dir.",
|
||||
"dirig.",
|
||||
"doc.",
|
||||
"dol.",
|
||||
"dr.",
|
||||
"drp.",
|
||||
"dvit.",
|
||||
"dėst.",
|
||||
"dš.",
|
||||
"dž.",
|
||||
"e.b.",
|
||||
"e.bankas",
|
||||
"e.p.",
|
||||
"e.parašas",
|
||||
"e.paštas",
|
||||
"e.v.",
|
||||
"e.valdžia",
|
||||
"egz.",
|
||||
"eil.",
|
||||
"ekon.",
|
||||
"el.",
|
||||
"el.bankas",
|
||||
"el.p.",
|
||||
"el.parašas",
|
||||
"el.paštas",
|
||||
"el.valdžia",
|
||||
"etc.",
|
||||
"ež.",
|
||||
"fak.",
|
||||
"faks.",
|
||||
"feat.",
|
||||
"filol.",
|
||||
"filos.",
|
||||
"g.",
|
||||
"gen.",
|
||||
"geol.",
|
||||
"gerb.",
|
||||
"gim.",
|
||||
"gr.",
|
||||
"gv.",
|
||||
"gyd.",
|
||||
"gyv.",
|
||||
"habil.",
|
||||
"inc.",
|
||||
"insp.",
|
||||
"inž.",
|
||||
"ir pan.",
|
||||
"ir t. t.",
|
||||
"isp.",
|
||||
"istor.",
|
||||
"it.",
|
||||
"just.",
|
||||
"k.",
|
||||
"k. a.",
|
||||
"k.a.",
|
||||
"kab.",
|
||||
"kand.",
|
||||
"kart.",
|
||||
"kat.",
|
||||
"ketv.",
|
||||
"kh.",
|
||||
"kl.",
|
||||
"kln.",
|
||||
"km.",
|
||||
"kn.",
|
||||
"koresp.",
|
||||
"kpt.",
|
||||
"kr.",
|
||||
"kt.",
|
||||
"kub.",
|
||||
"kun.",
|
||||
"kv.",
|
||||
"kyš.",
|
||||
"l. e. p.",
|
||||
"l.e.p.",
|
||||
"lenk.",
|
||||
"liet.",
|
||||
"lot.",
|
||||
"lt.",
|
||||
"ltd.",
|
||||
"ltn.",
|
||||
"m.",
|
||||
"m.e..",
|
||||
"m.m.",
|
||||
"mat.",
|
||||
"med.",
|
||||
"mgnt.",
|
||||
"mgr.",
|
||||
"min.",
|
||||
"mjr.",
|
||||
"ml.",
|
||||
"mln.",
|
||||
"mlrd.",
|
||||
"mob.",
|
||||
"mok.",
|
||||
"moksl.",
|
||||
"mokyt.",
|
||||
"mot.",
|
||||
"mr.",
|
||||
"mst.",
|
||||
"mstl.",
|
||||
"mėn.",
|
||||
"nkt.",
|
||||
"no.",
|
||||
"nr.",
|
||||
"ntk.",
|
||||
"nuotr.",
|
||||
"op.",
|
||||
"org.",
|
||||
"orig.",
|
||||
"p.",
|
||||
"p.d.",
|
||||
"p.m.e.",
|
||||
"p.s.",
|
||||
"pab.",
|
||||
"pan.",
|
||||
"past.",
|
||||
"pav.",
|
||||
"pavad.",
|
||||
"per.",
|
||||
"perd.",
|
||||
"pirm.",
|
||||
"pl.",
|
||||
"plg.",
|
||||
"plk.",
|
||||
"pr.",
|
||||
"pr.Kr.",
|
||||
"pranc.",
|
||||
"proc.",
|
||||
"prof.",
|
||||
"prom.",
|
||||
"prot.",
|
||||
"psl.",
|
||||
"pss.",
|
||||
"pvz.",
|
||||
"pšt.",
|
||||
"r.",
|
||||
"raj.",
|
||||
"red.",
|
||||
"rez.",
|
||||
"rež.",
|
||||
"rus.",
|
||||
"rš.",
|
||||
"s.",
|
||||
"sav.",
|
||||
"saviv.",
|
||||
"sek.",
|
||||
"sekr.",
|
||||
"sen.",
|
||||
"sh.",
|
||||
"sk.",
|
||||
"skg.",
|
||||
"skv.",
|
||||
"skyr.",
|
||||
"sp.",
|
||||
"spec.",
|
||||
"sr.",
|
||||
"st.",
|
||||
"str.",
|
||||
"stud.",
|
||||
"sąs.",
|
||||
"t.",
|
||||
"t. p.",
|
||||
"t. y.",
|
||||
"t.p.",
|
||||
"t.t.",
|
||||
"t.y.",
|
||||
"techn.",
|
||||
"tel.",
|
||||
"teol.",
|
||||
"th.",
|
||||
"tir.",
|
||||
"trit.",
|
||||
"trln.",
|
||||
"tšk.",
|
||||
"tūks.",
|
||||
"tūkst.",
|
||||
"up.",
|
||||
"upl.",
|
||||
"v.s.",
|
||||
"vad.",
|
||||
"val.",
|
||||
"valg.",
|
||||
"ved.",
|
||||
"vert.",
|
||||
"vet.",
|
||||
"vid.",
|
||||
"virš.",
|
||||
"vlsč.",
|
||||
"vnt.",
|
||||
"vok.",
|
||||
"vs.",
|
||||
"vtv.",
|
||||
"vv.",
|
||||
"vyr.",
|
||||
"vyresn.",
|
||||
"zool.",
|
||||
"Įn",
|
||||
"įl.",
|
||||
"š.m.",
|
||||
"šnek.",
|
||||
"šv.",
|
||||
"švč.",
|
||||
"ž.ū.",
|
||||
"žin.",
|
||||
"žml.",
|
||||
"žr.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -22,6 +22,7 @@ NOUN_RULES = [
|
|||
VERB_RULES = [
|
||||
["er", "e"], # vasker -> vaske
|
||||
["et", "e"], # vasket -> vaske
|
||||
["a", "e"], # vaska -> vaske
|
||||
["es", "e"], # vaskes -> vaske
|
||||
["te", "e"], # stekte -> steke
|
||||
["år", "å"], # får -> få
|
||||
|
|
|
@ -10,7 +10,15 @@ _exc = {}
|
|||
for exc_data in [
|
||||
{ORTH: "jan.", LEMMA: "januar"},
|
||||
{ORTH: "feb.", LEMMA: "februar"},
|
||||
{ORTH: "mar.", LEMMA: "mars"},
|
||||
{ORTH: "apr.", LEMMA: "april"},
|
||||
{ORTH: "jun.", LEMMA: "juni"},
|
||||
{ORTH: "jul.", LEMMA: "juli"},
|
||||
{ORTH: "aug.", LEMMA: "august"},
|
||||
{ORTH: "sep.", LEMMA: "september"},
|
||||
{ORTH: "okt.", LEMMA: "oktober"},
|
||||
{ORTH: "nov.", LEMMA: "november"},
|
||||
{ORTH: "des.", LEMMA: "desember"},
|
||||
]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
@ -18,11 +26,13 @@ for exc_data in [
|
|||
for orth in [
|
||||
"adm.dir.",
|
||||
"a.m.",
|
||||
"andelsnr",
|
||||
"Aq.",
|
||||
"b.c.",
|
||||
"bl.a.",
|
||||
"bla.",
|
||||
"bm.",
|
||||
"bnr.",
|
||||
"bto.",
|
||||
"ca.",
|
||||
"cand.mag.",
|
||||
|
@ -41,6 +51,7 @@ for orth in [
|
|||
"el.",
|
||||
"e.l.",
|
||||
"et.",
|
||||
"etc.",
|
||||
"etg.",
|
||||
"ev.",
|
||||
"evt.",
|
||||
|
@ -76,6 +87,7 @@ for orth in [
|
|||
"kgl.res.",
|
||||
"kl.",
|
||||
"komm.",
|
||||
"kr.",
|
||||
"kst.",
|
||||
"lø.",
|
||||
"ma.",
|
||||
|
@ -106,6 +118,7 @@ for orth in [
|
|||
"o.l.",
|
||||
"on.",
|
||||
"op.",
|
||||
"org."
|
||||
"osv.",
|
||||
"ovf.",
|
||||
"p.",
|
||||
|
@ -130,6 +143,7 @@ for orth in [
|
|||
"sep.",
|
||||
"siviling.",
|
||||
"sms.",
|
||||
"snr.",
|
||||
"spm.",
|
||||
"sr.",
|
||||
"sst.",
|
||||
|
|
18
spacy/lang/sq/examples.py
Normal file
18
spacy/lang/sq/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.sq.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Apple po shqyrton blerjen e nje shoqërie të U.K. për 1 miliard dollarë",
|
||||
"Makinat autonome ndryshojnë përgjegjësinë e sigurimit ndaj prodhuesve",
|
||||
"San Francisko konsideron ndalimin e robotëve të shpërndarjes",
|
||||
"Londra është një qytet i madh në Mbretërinë e Bashkuar.",
|
||||
]
|
|
@ -1,15 +1,17 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from collections import defaultdict
|
||||
from collections import defaultdict, OrderedDict
|
||||
import srsly
|
||||
|
||||
from ..errors import Errors
|
||||
from ..compat import basestring_
|
||||
from ..util import ensure_path
|
||||
from ..util import ensure_path, to_disk, from_disk
|
||||
from ..tokens import Span
|
||||
from ..matcher import Matcher, PhraseMatcher
|
||||
|
||||
DEFAULT_ENT_ID_SEP = '||'
|
||||
|
||||
|
||||
class EntityRuler(object):
|
||||
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
||||
|
@ -24,7 +26,7 @@ class EntityRuler(object):
|
|||
|
||||
name = "entity_ruler"
|
||||
|
||||
def __init__(self, nlp, **cfg):
|
||||
def __init__(self, nlp, phrase_matcher_attr=None, **cfg):
|
||||
"""Initialize the entitiy ruler. If patterns are supplied here, they
|
||||
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
||||
key. A pattern can either be a token pattern (list) or a phrase pattern
|
||||
|
@ -32,6 +34,8 @@ class EntityRuler(object):
|
|||
|
||||
nlp (Language): The shared nlp object to pass the vocab to the matchers
|
||||
and process phrase patterns.
|
||||
phrase_matcher_attr (int / unicode): Token attribute to match on, passed
|
||||
to the internal PhraseMatcher as `attr`
|
||||
patterns (iterable): Optional patterns to load in.
|
||||
overwrite_ents (bool): If existing entities are present, e.g. entities
|
||||
added by the model, overwrite them by matches if necessary.
|
||||
|
@ -47,8 +51,13 @@ class EntityRuler(object):
|
|||
self.token_patterns = defaultdict(list)
|
||||
self.phrase_patterns = defaultdict(list)
|
||||
self.matcher = Matcher(nlp.vocab)
|
||||
self.phrase_matcher = PhraseMatcher(nlp.vocab)
|
||||
self.ent_id_sep = cfg.get("ent_id_sep", "||")
|
||||
if phrase_matcher_attr is not None:
|
||||
self.phrase_matcher_attr = phrase_matcher_attr
|
||||
self.phrase_matcher = PhraseMatcher(nlp.vocab, attr=self.phrase_matcher_attr)
|
||||
else:
|
||||
self.phrase_matcher_attr = None
|
||||
self.phrase_matcher = PhraseMatcher(nlp.vocab)
|
||||
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
|
||||
patterns = cfg.get("patterns")
|
||||
if patterns is not None:
|
||||
self.add_patterns(patterns)
|
||||
|
@ -196,7 +205,7 @@ class EntityRuler(object):
|
|||
|
||||
def _create_label(self, label, ent_id):
|
||||
"""Join Entity label with ent_id if the pattern has an `id` attribute
|
||||
|
||||
|
||||
RETURNS (str): The ent_label joined with configured `ent_id_sep`
|
||||
"""
|
||||
if isinstance(ent_id, basestring_):
|
||||
|
@ -212,8 +221,17 @@ class EntityRuler(object):
|
|||
|
||||
DOCS: https://spacy.io/api/entityruler#from_bytes
|
||||
"""
|
||||
patterns = srsly.msgpack_loads(patterns_bytes)
|
||||
self.add_patterns(patterns)
|
||||
cfg = srsly.msgpack_loads(patterns_bytes)
|
||||
if isinstance(cfg, dict):
|
||||
self.add_patterns(cfg.get('patterns', cfg))
|
||||
self.overwrite = cfg.get('overwrite', False)
|
||||
self.phrase_matcher_attr = cfg.get('phrase_matcher_attr', None)
|
||||
if self.phrase_matcher_attr is not None:
|
||||
self.phrase_matcher = PhraseMatcher(self.nlp.vocab,
|
||||
attr=self.phrase_matcher_attr)
|
||||
self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP)
|
||||
else:
|
||||
self.add_patterns(cfg)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **kwargs):
|
||||
|
@ -223,7 +241,13 @@ class EntityRuler(object):
|
|||
|
||||
DOCS: https://spacy.io/api/entityruler#to_bytes
|
||||
"""
|
||||
return srsly.msgpack_dumps(self.patterns)
|
||||
|
||||
serial = OrderedDict((
|
||||
('overwrite', self.overwrite),
|
||||
('ent_id_sep', self.ent_id_sep),
|
||||
('phrase_matcher_attr', self.phrase_matcher_attr),
|
||||
('patterns', self.patterns)))
|
||||
return srsly.msgpack_dumps(serial)
|
||||
|
||||
def from_disk(self, path, **kwargs):
|
||||
"""Load the entity ruler from a file. Expects a file containing
|
||||
|
@ -236,9 +260,23 @@ class EntityRuler(object):
|
|||
DOCS: https://spacy.io/api/entityruler#from_disk
|
||||
"""
|
||||
path = ensure_path(path)
|
||||
path = path.with_suffix(".jsonl")
|
||||
patterns = srsly.read_jsonl(path)
|
||||
self.add_patterns(patterns)
|
||||
if path.is_file():
|
||||
patterns = srsly.read_jsonl(path)
|
||||
self.add_patterns(patterns)
|
||||
else:
|
||||
cfg = {}
|
||||
deserializers = {
|
||||
'patterns': lambda p: self.add_patterns(srsly.read_jsonl(p.with_suffix('.jsonl'))),
|
||||
'cfg': lambda p: cfg.update(srsly.read_json(p))
|
||||
}
|
||||
from_disk(path, deserializers, {})
|
||||
self.overwrite = cfg.get('overwrite', False)
|
||||
self.phrase_matcher_attr = cfg.get('phrase_matcher_attr')
|
||||
self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP)
|
||||
|
||||
if self.phrase_matcher_attr is not None:
|
||||
self.phrase_matcher = PhraseMatcher(self.nlp.vocab,
|
||||
attr=self.phrase_matcher_attr)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, **kwargs):
|
||||
|
@ -251,6 +289,13 @@ class EntityRuler(object):
|
|||
|
||||
DOCS: https://spacy.io/api/entityruler#to_disk
|
||||
"""
|
||||
cfg = {'overwrite': self.overwrite,
|
||||
'phrase_matcher_attr': self.phrase_matcher_attr,
|
||||
'ent_id_sep': self.ent_id_sep}
|
||||
serializers = {
|
||||
'patterns': lambda p: srsly.write_jsonl(p.with_suffix('.jsonl'),
|
||||
self.patterns),
|
||||
'cfg': lambda p: srsly.write_json(p, cfg)
|
||||
}
|
||||
path = ensure_path(path)
|
||||
path = path.with_suffix(".jsonl")
|
||||
srsly.write_jsonl(path, self.patterns)
|
||||
to_disk(path, serializers, {})
|
||||
|
|
|
@ -1003,7 +1003,7 @@ cdef class DependencyParser(Parser):
|
|||
|
||||
@property
|
||||
def postprocesses(self):
|
||||
return [nonproj.deprojectivize] # , merge_subtokens]
|
||||
return [nonproj.deprojectivize]
|
||||
|
||||
def add_multitask_objective(self, target):
|
||||
if target == "cloze":
|
||||
|
@ -1398,5 +1398,5 @@ class Sentencizer(object):
|
|||
self.punct_chars = cfg.get("punct_chars", self.default_punct_chars)
|
||||
return self
|
||||
|
||||
|
||||
|
||||
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]
|
||||
|
|
|
@ -52,6 +52,7 @@ class Scorer(object):
|
|||
self.labelled = PRFScore()
|
||||
self.tags = PRFScore()
|
||||
self.ner = PRFScore()
|
||||
self.ner_per_ents = dict()
|
||||
self.eval_punct = eval_punct
|
||||
|
||||
@property
|
||||
|
@ -104,6 +105,15 @@ class Scorer(object):
|
|||
"ents_f": self.ents_f,
|
||||
"tags_acc": self.tags_acc,
|
||||
"token_acc": self.token_acc,
|
||||
"ents_per_type": self.__scores_per_ents(),
|
||||
}
|
||||
|
||||
def __scores_per_ents(self):
|
||||
"""RETURNS (dict): Scores per NER entity
|
||||
"""
|
||||
return {
|
||||
k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
|
||||
for k, v in self.ner_per_ents.items()
|
||||
}
|
||||
|
||||
def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")):
|
||||
|
@ -149,13 +159,31 @@ class Scorer(object):
|
|||
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
|
||||
if "-" not in [token[-1] for token in gold.orig_annot]:
|
||||
cand_ents = set()
|
||||
current_ent = {k.label_: set() for k in doc.ents}
|
||||
current_gold = {k.label_: set() for k in doc.ents}
|
||||
for ent in doc.ents:
|
||||
if ent.label_ not in self.ner_per_ents:
|
||||
self.ner_per_ents[ent.label_] = PRFScore()
|
||||
first = gold.cand_to_gold[ent.start]
|
||||
last = gold.cand_to_gold[ent.end - 1]
|
||||
if first is None or last is None:
|
||||
self.ner.fp += 1
|
||||
self.ner_per_ents[ent.label_].fp += 1
|
||||
else:
|
||||
cand_ents.add((ent.label_, first, last))
|
||||
current_ent[ent.label_].add(
|
||||
tuple(x for x in cand_ents if x[0] == ent.label_)
|
||||
)
|
||||
current_gold[ent.label_].add(
|
||||
tuple(x for x in gold_ents if x[0] == ent.label_)
|
||||
)
|
||||
# Scores per ent
|
||||
[
|
||||
v.score_set(current_ent[k], current_gold[k])
|
||||
for k, v in self.ner_per_ents.items()
|
||||
if k in current_ent
|
||||
]
|
||||
# Score for all ents
|
||||
self.ner.score_set(cand_ents, gold_ents)
|
||||
self.tags.score_set(cand_tags, gold_tags)
|
||||
self.labelled.score_set(cand_deps, gold_deps)
|
||||
|
|
|
@ -124,6 +124,16 @@ def ja_tokenizer():
|
|||
return get_lang_class("ja").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def lt_tokenizer():
|
||||
return get_lang_class("lt").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def lt_lemmatizer():
|
||||
return get_lang_class("lt").Defaults.create_lemmatizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def nb_tokenizer():
|
||||
return get_lang_class("nb").Defaults.create_tokenizer()
|
||||
|
|
0
spacy/tests/lang/lt/__init__.py
Normal file
0
spacy/tests/lang/lt/__init__.py
Normal file
15
spacy/tests/lang/lt/test_lemmatizer.py
Normal file
15
spacy/tests/lang/lt/test_lemmatizer.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize("tokens,lemmas", [
|
||||
(["Galime", "vadinti", "gerovės", "valstybe", ",", "turime", "išvystytą", "socialinę", "apsaugą", ",",
|
||||
"sveikatos", "apsaugą", "ir", "prieinamą", "švietimą", "."],
|
||||
["galėti", "vadintas", "gerovė", "valstybė", ",", "turėti", "išvystytas", "socialinis",
|
||||
"apsauga", ",", "sveikata", "apsauga", "ir", "prieinamas", "švietimas", "."]),
|
||||
(["taip", ",", "uoliai", "tyrinėjau", "ir", "pasirinkau", "geriausią", "variantą", "."],
|
||||
["taip", ",", "uolus", "tyrinėti", "ir", "pasirinkti", "geras", "variantas", "."])])
|
||||
def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas):
|
||||
assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens]
|
44
spacy/tests/lang/lt/test_text.py
Normal file
44
spacy/tests/lang/lt/test_text.py
Normal file
|
@ -0,0 +1,44 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_lt_tokenizer_handles_long_text(lt_tokenizer):
|
||||
text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią
|
||||
vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis
|
||||
yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui."""
|
||||
tokens = lt_tokenizer(text.replace("\n", ""))
|
||||
assert len(tokens) == 42
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [
|
||||
("177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.", 15),
|
||||
("ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.", 16)])
|
||||
def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
|
||||
tokens = lt_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
|
||||
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
|
||||
tokens = lt_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,match", [
|
||||
("10", True),
|
||||
("1", True),
|
||||
("10,000", True),
|
||||
("10,00", True),
|
||||
("999.0", True),
|
||||
("vienas", True),
|
||||
("du", True),
|
||||
("milijardas", True),
|
||||
("šuo", False),
|
||||
(",", False),
|
||||
("1/2", True)])
|
||||
def test_lt_lex_attrs_like_number(lt_tokenizer, text, match):
|
||||
tokens = lt_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
|
@ -106,5 +106,24 @@ def test_entity_ruler_serialize_bytes(nlp, patterns):
|
|||
assert len(new_ruler) == 0
|
||||
assert len(new_ruler.labels) == 0
|
||||
new_ruler = new_ruler.from_bytes(ruler_bytes)
|
||||
assert len(new_ruler) == len(patterns)
|
||||
assert len(new_ruler.labels) == 4
|
||||
assert len(new_ruler.patterns) == len(ruler.patterns)
|
||||
for pattern in ruler.patterns:
|
||||
assert pattern in new_ruler.patterns
|
||||
assert new_ruler.labels == ruler.labels
|
||||
|
||||
|
||||
def test_entity_ruler_serialize_phrase_matcher_attr_bytes(nlp, patterns):
|
||||
ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER", patterns=patterns)
|
||||
assert len(ruler) == len(patterns)
|
||||
assert len(ruler.labels) == 4
|
||||
ruler_bytes = ruler.to_bytes()
|
||||
new_ruler = EntityRuler(nlp)
|
||||
assert len(new_ruler) == 0
|
||||
assert len(new_ruler.labels) == 0
|
||||
assert new_ruler.phrase_matcher_attr is None
|
||||
new_ruler = new_ruler.from_bytes(ruler_bytes)
|
||||
assert len(new_ruler) == len(patterns)
|
||||
assert len(new_ruler.labels) == 4
|
||||
assert new_ruler.phrase_matcher_attr == "LOWER"
|
||||
|
|
86
spacy/tests/regression/test_issue3526.py
Normal file
86
spacy/tests/regression/test_issue3526.py
Normal file
|
@ -0,0 +1,86 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.tokens import Span
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline import EntityRuler
|
||||
from spacy import load
|
||||
import srsly
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def patterns():
|
||||
return [
|
||||
{"label": "HELLO", "pattern": "hello world"},
|
||||
{"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]},
|
||||
{"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]},
|
||||
{"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]},
|
||||
{"label": "TECH_ORG", "pattern": "Apple", "id": "a1"},
|
||||
]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def add_ent():
|
||||
def add_ent_component(doc):
|
||||
doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])]
|
||||
return doc
|
||||
|
||||
return add_ent_component
|
||||
|
||||
|
||||
def test_entity_ruler_existing_overwrite_serialize_bytes(patterns, en_vocab):
|
||||
nlp = Language(vocab=en_vocab)
|
||||
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
|
||||
ruler_bytes = ruler.to_bytes()
|
||||
assert len(ruler) == len(patterns)
|
||||
assert len(ruler.labels) == 4
|
||||
assert ruler.overwrite
|
||||
new_ruler = EntityRuler(nlp)
|
||||
new_ruler = new_ruler.from_bytes(ruler_bytes)
|
||||
assert len(new_ruler) == len(ruler)
|
||||
assert len(new_ruler.labels) == 4
|
||||
assert new_ruler.overwrite == ruler.overwrite
|
||||
assert new_ruler.ent_id_sep == ruler.ent_id_sep
|
||||
|
||||
|
||||
def test_entity_ruler_existing_bytes_old_format_safe(patterns, en_vocab):
|
||||
nlp = Language(vocab=en_vocab)
|
||||
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
|
||||
bytes_old_style = srsly.msgpack_dumps(ruler.patterns)
|
||||
new_ruler = EntityRuler(nlp)
|
||||
new_ruler = new_ruler.from_bytes(bytes_old_style)
|
||||
assert len(new_ruler) == len(ruler)
|
||||
for pattern in ruler.patterns:
|
||||
assert pattern in new_ruler.patterns
|
||||
assert new_ruler.overwrite is not ruler.overwrite
|
||||
|
||||
|
||||
def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab):
|
||||
nlp = Language(vocab=en_vocab)
|
||||
ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
|
||||
with make_tempdir() as tmpdir:
|
||||
out_file = tmpdir / "entity_ruler.jsonl"
|
||||
srsly.write_jsonl(out_file, ruler.patterns)
|
||||
new_ruler = EntityRuler(nlp)
|
||||
new_ruler = new_ruler.from_disk(out_file)
|
||||
for pattern in ruler.patterns:
|
||||
assert pattern in new_ruler.patterns
|
||||
assert len(new_ruler) == len(ruler)
|
||||
assert new_ruler.overwrite is not ruler.overwrite
|
||||
|
||||
|
||||
def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab):
|
||||
nlp = Language(vocab=en_vocab)
|
||||
ruler = EntityRuler(nlp, overwrite_ents=True)
|
||||
|
||||
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
|
||||
nlp.add_pipe(ruler)
|
||||
with make_tempdir() as tmpdir:
|
||||
nlp.to_disk(tmpdir)
|
||||
assert nlp.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}]
|
||||
assert nlp.pipeline[-1][-1].overwrite is True
|
||||
nlp2 = load(tmpdir)
|
||||
assert nlp2.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}]
|
||||
assert nlp2.pipeline[-1][-1].overwrite is True
|
15
spacy/tests/regression/test_issue3882.py
Normal file
15
spacy/tests/regression/test_issue3882.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.displacy import parse_deps
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
def test_issue3882(en_vocab):
|
||||
"""Test that displaCy doesn't serialize the doc.user_data when making a
|
||||
copy of the Doc.
|
||||
"""
|
||||
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||
doc.is_parsed = True
|
||||
doc.user_data["test"] = set()
|
||||
parse_deps(doc)
|
|
@ -284,9 +284,9 @@ same between pretraining and training. The API and errors around this need some
|
|||
improvement.
|
||||
|
||||
```bash
|
||||
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
|
||||
[--depth] [--embed-rows] [--loss_func] [--dropout] [--seed] [--n-iter] [--use-vectors]
|
||||
[--n-save_every]
|
||||
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
||||
[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length]
|
||||
[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
|
@ -306,7 +306,8 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
|
|||
| `--n-iter`, `-i` | option | Number of iterations to pretrain. |
|
||||
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
|
||||
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
|
||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
|
||||
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.|
|
||||
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
|
||||
|
||||
### JSONL format for raw text {#pretrain-jsonl}
|
||||
|
|
|
@ -34,6 +34,7 @@ be a token pattern (list) or a phrase pattern (string). For example:
|
|||
| ---------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
||||
| `patterns` | iterable | Optional patterns to load in. |
|
||||
| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phtasematcher). defaults to `None`
|
||||
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
||||
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
|
||||
| **RETURNS** | `EntityRuler` | The newly constructed object. |
|
||||
|
|
|
@ -305,11 +305,11 @@ match on the uppercase versions, in case someone has written it as "Google i/o".
|
|||
|
||||
```python
|
||||
### {executable="true"}
|
||||
import spacy
|
||||
from spacy.lang.en import English
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.tokens import Span
|
||||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
nlp = English()
|
||||
matcher = Matcher(nlp.vocab)
|
||||
|
||||
def add_event_ent(matcher, doc, i, matches):
|
||||
|
@ -322,7 +322,7 @@ def add_event_ent(matcher, doc, i, matches):
|
|||
|
||||
pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
|
||||
matcher.add("GoogleIO", add_event_ent, pattern)
|
||||
doc = nlp(u"This is a text about Google I/O.")
|
||||
doc = nlp(u"This is a text about Google I/O")
|
||||
matches = matcher(doc)
|
||||
```
|
||||
|
||||
|
|
|
@ -106,7 +106,12 @@
|
|||
{ "code": "hi", "name": "Hindi", "example": "यह एक वाक्य है।", "has_examples": true },
|
||||
{ "code": "kn", "name": "Kannada" },
|
||||
{ "code": "ta", "name": "Tamil", "has_examples": true },
|
||||
{ "code": "id", "name": "Indonesian", "has_examples": true },
|
||||
{
|
||||
"code": "id",
|
||||
"name": "Indonesian",
|
||||
"example": "Ini adalah sebuah kalimat.",
|
||||
"has_examples": true
|
||||
},
|
||||
{ "code": "tl", "name": "Tagalog" },
|
||||
{ "code": "af", "name": "Afrikaans" },
|
||||
{ "code": "bg", "name": "Bulgarian" },
|
||||
|
@ -116,7 +121,12 @@
|
|||
{ "code": "lv", "name": "Latvian" },
|
||||
{ "code": "sk", "name": "Slovak" },
|
||||
{ "code": "sl", "name": "Slovenian" },
|
||||
{ "code": "sq", "name": "Albanian" },
|
||||
{
|
||||
"code": "sq",
|
||||
"name": "Albanian",
|
||||
"example": "Kjo është një fjali.",
|
||||
"has_examples": true
|
||||
},
|
||||
{ "code": "et", "name": "Estonian" },
|
||||
{
|
||||
"code": "th",
|
||||
|
|
Loading…
Reference in New Issue
Block a user