mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge branch 'master' into develop
This commit is contained in:
commit
330c039106
106
.github/contributors/BigstickCarpet.md
vendored
Normal file
106
.github/contributors/BigstickCarpet.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | James Messinger |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | May 23, 2018 |
|
||||||
|
| GitHub username | BigstickCarpet |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/aristorinjuang.md
vendored
Normal file
106
.github/contributors/aristorinjuang.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------------- |
|
||||||
|
| Name | Aristo Rinjuang |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | May 22, 2018 |
|
||||||
|
| GitHub username | aristorinjuang |
|
||||||
|
| Website (optional) | https://aristorinjuang.com |
|
106
.github/contributors/armsp.md
vendored
Normal file
106
.github/contributors/armsp.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Shantam |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 21/5/2018 |
|
||||||
|
| GitHub username | armsp |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/idealley.md
vendored
Normal file
106
.github/contributors/idealley.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Pouyt Samuel |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 26.05.2018 |
|
||||||
|
| GitHub username | Idealley |
|
||||||
|
| Website (optional) | |
|
|
@ -118,7 +118,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
||||||
nlp._optimizer = None
|
nlp._optimizer = None
|
||||||
|
|
||||||
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
|
print("Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS")
|
||||||
try:
|
try:
|
||||||
for i in range(n_iter):
|
for i in range(n_iter):
|
||||||
train_docs = corpus.train_docs(nlp, noise_level=0.0,
|
train_docs = corpus.train_docs(nlp, noise_level=0.0,
|
||||||
|
@ -208,17 +208,17 @@ def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
|
||||||
scores.update(dev_scores)
|
scores.update(dev_scores)
|
||||||
scores['cpu_wps'] = cpu_wps
|
scores['cpu_wps'] = cpu_wps
|
||||||
scores['gpu_wps'] = gpu_wps or 0.0
|
scores['gpu_wps'] = gpu_wps or 0.0
|
||||||
tpl = '\t'.join((
|
tpl = ''.join((
|
||||||
'{:d}',
|
'{:<6d}',
|
||||||
'{dep_loss:.3f}',
|
'{dep_loss:<10.3f}',
|
||||||
'{ner_loss:.3f}',
|
'{ner_loss:<10.3f}',
|
||||||
'{uas:.3f}',
|
'{uas:<8.3f}',
|
||||||
'{ents_p:.3f}',
|
'{ents_p:<8.3f}',
|
||||||
'{ents_r:.3f}',
|
'{ents_r:<8.3f}',
|
||||||
'{ents_f:.3f}',
|
'{ents_f:<8.3f}',
|
||||||
'{tags_acc:.3f}',
|
'{tags_acc:<8.3f}',
|
||||||
'{token_acc:.3f}',
|
'{token_acc:<9.3f}',
|
||||||
'{cpu_wps:.1f}',
|
'{cpu_wps:<9.1f}',
|
||||||
'{gpu_wps:.1f}',
|
'{gpu_wps:.1f}',
|
||||||
))
|
))
|
||||||
print(tpl.format(itn, **scores))
|
print(tpl.format(itn, **scores))
|
||||||
|
|
|
@ -4,19 +4,10 @@ from __future__ import unicode_literals
|
||||||
from ...attrs import LIKE_NUM
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
|
_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
|
||||||
'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
|
'delapan', 'sembilan', 'sepuluh', 'sebelas', 'belas', 'puluh',
|
||||||
'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
|
'ratus', 'ribu', 'juta', 'miliar', 'biliun', 'triliun', 'kuadriliun',
|
||||||
'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
|
'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', 'noniliun', 'desiliun']
|
||||||
'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
|
|
||||||
'gajillion', 'bazillion',
|
|
||||||
'nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
|
|
||||||
'delapan', 'sembilan', 'sepuluh', 'sebelas', 'duabelas', 'tigabelas',
|
|
||||||
'empatbelas', 'limabelas', 'enambelas', 'tujuhbelas', 'delapanbelas',
|
|
||||||
'sembilanbelas', 'duapuluh', 'seratus', 'seribu', 'sejuta',
|
|
||||||
'ribu', 'rb', 'juta', 'jt', 'miliar', 'biliun', 'triliun',
|
|
||||||
'kuadriliun', 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun',
|
|
||||||
'noniliun', 'desiliun']
|
|
||||||
|
|
||||||
|
|
||||||
def like_num(text):
|
def like_num(text):
|
||||||
|
|
|
@ -1,14 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
_exc = {
|
_exc = {}
|
||||||
"Rp": "$",
|
|
||||||
"IDR": "$",
|
|
||||||
"RMB": "$",
|
|
||||||
"USD": "$",
|
|
||||||
"AUD": "$",
|
|
||||||
"GBP": "$",
|
|
||||||
}
|
|
||||||
|
|
||||||
NORM_EXCEPTIONS = {}
|
NORM_EXCEPTIONS = {}
|
||||||
|
|
||||||
|
|
|
@ -5,7 +5,7 @@ import regex as re
|
||||||
|
|
||||||
from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS
|
from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS
|
||||||
from ..tokenizer_exceptions import URL_PATTERN
|
from ..tokenizer_exceptions import URL_PATTERN
|
||||||
from ...symbols import ORTH
|
from ...symbols import ORTH, LEMMA, NORM
|
||||||
|
|
||||||
|
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
@ -29,17 +29,58 @@ for orth in ID_BASE_EXCEPTIONS:
|
||||||
orth_caps = '-'.join([part.upper() for part in orth.split('-')])
|
orth_caps = '-'.join([part.upper() for part in orth.split('-')])
|
||||||
_exc[orth_caps] = [{ORTH: orth_caps}]
|
_exc[orth_caps] = [{ORTH: orth_caps}]
|
||||||
|
|
||||||
|
for exc_data in [
|
||||||
|
{ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"},
|
||||||
|
{ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"},
|
||||||
|
{ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"},
|
||||||
|
{ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"},
|
||||||
|
{ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"},
|
||||||
|
{ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"},
|
||||||
|
|
||||||
|
{ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"},
|
||||||
|
{ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"},
|
||||||
|
{ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"},
|
||||||
|
{ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"},
|
||||||
|
{ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"},
|
||||||
|
{ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"},
|
||||||
|
{ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"},
|
||||||
|
{ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"},
|
||||||
|
{ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"},
|
||||||
|
{ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"},
|
||||||
|
{ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"},
|
||||||
|
{ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"},
|
||||||
|
{ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"},
|
||||||
|
{ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"},
|
||||||
|
|
||||||
|
{ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"},
|
||||||
|
{ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"},
|
||||||
|
{ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"},
|
||||||
|
{ORTH: "Apr.", LEMMA: "April", NORM: "April"},
|
||||||
|
{ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"},
|
||||||
|
{ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"},
|
||||||
|
{ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"},
|
||||||
|
{ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"},
|
||||||
|
{ORTH: "Sep.", LEMMA: "September", NORM: "September"},
|
||||||
|
{ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"},
|
||||||
|
{ORTH: "Nov.", LEMMA: "November", NORM: "November"},
|
||||||
|
{ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
for orth in [
|
for orth in [
|
||||||
"'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.",
|
"A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.",
|
||||||
"E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.",
|
|
||||||
"Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.",
|
|
||||||
"Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs.",
|
|
||||||
"B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.",
|
"B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.",
|
||||||
"M.Ag.", "M.Hum.", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Sc.",
|
"M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,",
|
||||||
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "S.Ag.",
|
"M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl",
|
||||||
"S.E.", "S.H.", "S.Hut.", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.",
|
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.",
|
||||||
"S.Pd.", "S.Pol.", "S.Psi.", "S.S.", "S.Sos.", "S.T.", "S.Tekp.", "S.Th.",
|
"S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars",
|
||||||
|
"S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han",
|
||||||
|
"S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP",
|
||||||
|
"S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH",
|
||||||
|
"S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat",
|
||||||
|
"S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.",
|
||||||
|
"S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.",
|
||||||
|
"S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK",
|
||||||
|
"S.Tekp.", "S.Th.",
|
||||||
"a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.",
|
"a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.",
|
||||||
"dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o",
|
"dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o",
|
||||||
"n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.",
|
"n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.",
|
||||||
|
|
42
spacy/lang/ro/lex_attrs.py
Normal file
42
spacy/lang/ro/lex_attrs.py
Normal file
|
@ -0,0 +1,42 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
|
_num_words = set("""
|
||||||
|
zero unu doi două trei patru cinci șase șapte opt nouă zece
|
||||||
|
unsprezece doisprezece douăsprezece treisprezece patrusprezece cincisprezece șaisprezece șaptesprezece optsprezece nouăsprezece
|
||||||
|
douăzeci treizeci patruzeci cincizeci șaizeci șaptezeci optzeci nouăzeci
|
||||||
|
sută mie milion miliard bilion trilion cvadrilion catralion cvintilion sextilion septilion enșpemii
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
_ordinal_words = set("""
|
||||||
|
primul doilea treilea patrulea cincilea șaselea șaptelea optulea nouălea zecelea
|
||||||
|
prima doua treia patra cincia șasea șaptea opta noua zecea
|
||||||
|
unsprezecelea doisprezecelea treisprezecelea patrusprezecelea cincisprezecelea șaisprezecelea șaptesprezecelea optsprezecelea nouăsprezecelea
|
||||||
|
unsprezecea douăsprezecea treisprezecea patrusprezecea cincisprezecea șaisprezecea șaptesprezecea optsprezecea nouăsprezecea
|
||||||
|
douăzecilea treizecilea patruzecilea cincizecilea șaizecilea șaptezecilea optzecilea nouăzecilea sutălea
|
||||||
|
douăzecea treizecea patruzecea cincizecea șaizecea șaptezecea optzecea nouăzecea suta
|
||||||
|
miilea mielea mia milionulea milioana miliardulea miliardelea miliarda enșpemia
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text.lower() in _num_words:
|
||||||
|
return True
|
||||||
|
if text.lower() in _ordinal_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
|
@ -9,8 +9,9 @@ _exc = {}
|
||||||
|
|
||||||
# Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations
|
# Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations
|
||||||
for orth in [
|
for orth in [
|
||||||
"1-a", "1-ul", "10-a", "10-lea", "2-a", "3-a", "3-lea", "6-lea",
|
"1-a", "2-a", "3-a", "4-a", "5-a", "6-a", "7-a", "8-a", "9-a", "10-a", "11-a", "12-a",
|
||||||
"d-voastră", "dvs.", "Rom.", "str."]:
|
"1-ul", "2-lea", "3-lea", "4-lea", "5-lea", "6-lea", "7-lea", "8-lea", "9-lea", "10-lea", "11-lea", "12-lea",
|
||||||
|
"d-voastră", "dvs.", "ing.", "dr.", "Rom.", "str.", "nr.", "etc.", "d.p.d.v.", "dpdv", "șamd.", "ș.a.m.d."]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -15,7 +15,7 @@ from .. import util
|
||||||
# here if it's using spaCy's tokenizer (not a different library)
|
# here if it's using spaCy's tokenizer (not a different library)
|
||||||
# TODO: re-implement generic tokenizer tests
|
# TODO: re-implement generic tokenizer tests
|
||||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||||
'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
|
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx']
|
||||||
|
|
||||||
_models = {'en': ['en_core_web_sm'],
|
_models = {'en': ['en_core_web_sm'],
|
||||||
'de': ['de_core_news_md'],
|
'de': ['de_core_news_md'],
|
||||||
|
|
25
spacy/tests/lang/ro/test_tokenizer.py
Normal file
25
spacy/tests/lang/ro/test_tokenizer.py
Normal file
|
@ -0,0 +1,25 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
DEFAULT_TESTS = [
|
||||||
|
('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
|
||||||
|
('Teste, etc.', ['Teste', ',', 'etc.']),
|
||||||
|
('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
|
||||||
|
('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...'])
|
||||||
|
]
|
||||||
|
|
||||||
|
NUMBER_TESTS = [
|
||||||
|
('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
|
||||||
|
('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
|
||||||
|
]
|
||||||
|
|
||||||
|
TESTCASES = DEFAULT_TESTS + NUMBER_TESTS
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||||
|
def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
|
||||||
|
tokens = ro_tokenizer(text)
|
||||||
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
|
assert expected_tokens == token_list
|
|
@ -53,7 +53,7 @@ p
|
||||||
+tag-new(2)
|
+tag-new(2)
|
||||||
|
|
||||||
p
|
p
|
||||||
| The populate a model's vocabulary, you can use the
|
| To populate a model's vocabulary, you can use the
|
||||||
| #[+api("cli#vocab") #[code spacy vocab]] command and load in a
|
| #[+api("cli#vocab") #[code spacy vocab]] command and load in a
|
||||||
| #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON]
|
| #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON]
|
||||||
| (JSONL) file containing one lexical entry per line. The first line
|
| (JSONL) file containing one lexical entry per line. The first line
|
||||||
|
|
|
@ -16,7 +16,9 @@
|
||||||
|
|
||||||
+qs({package: 'source'}) git clone https://github.com/explosion/spaCy
|
+qs({package: 'source'}) git clone https://github.com/explosion/spaCy
|
||||||
+qs({package: 'source'}) cd spaCy
|
+qs({package: 'source'}) cd spaCy
|
||||||
+qs({package: 'source'}) export PYTHONPATH=`pwd`
|
+qs({package: 'source', os: 'mac'}) export PYTHONPATH=`pwd`
|
||||||
|
+qs({package: 'source', os: 'linux'}) export PYTHONPATH=`pwd`
|
||||||
|
+qs({package: 'source', os: 'windows'}) set PYTHONPATH=/path/to/spaCy
|
||||||
+qs({package: 'source'}) pip install -r requirements.txt
|
+qs({package: 'source'}) pip install -r requirements.txt
|
||||||
+qs({package: 'source'}) python setup.py build_ext --inplace
|
+qs({package: 'source'}) python setup.py build_ext --inplace
|
||||||
|
|
||||||
|
|
|
@ -184,7 +184,7 @@ p
|
||||||
|
|
||||||
p
|
p
|
||||||
| In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
|
| In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
|
||||||
| behave inconsistently. They were usually interpretted
|
| behave inconsistently. They were usually interpreted
|
||||||
| "greedily", i.e. longer matches are returned where possible. However, if
|
| "greedily", i.e. longer matches are returned where possible. However, if
|
||||||
| you specify two #[code +] and #[code *] patterns in a row and their
|
| you specify two #[code +] and #[code *] patterns in a row and their
|
||||||
| matches overlap, the first operator will behave non-greedily. This quirk
|
| matches overlap, the first operator will behave non-greedily. This quirk
|
||||||
|
@ -260,41 +260,6 @@ p
|
||||||
doc = nlp(u"This is a text about Google I/O 2015.")
|
doc = nlp(u"This is a text about Google I/O 2015.")
|
||||||
matches = matcher(doc)
|
matches = matcher(doc)
|
||||||
|
|
||||||
p
|
|
||||||
| In addition to mentions of "Google I/O", your data also contains some
|
|
||||||
| annoying pre-processing artefacts, like leftover HTML line breaks
|
|
||||||
| (e.g. #[code <br>] or #[code <BR/>]). While you're at it,
|
|
||||||
| you want to merge those into one token and flag them, to make sure you
|
|
||||||
| can easily ignore them later. So you add a second pattern and pass in a
|
|
||||||
| function #[code merge_and_flag]:
|
|
||||||
|
|
||||||
+code-exec.
|
|
||||||
import spacy
|
|
||||||
from spacy.matcher import Matcher
|
|
||||||
from spacy.tokens import Token
|
|
||||||
|
|
||||||
nlp = spacy.load('en_core_web_sm')
|
|
||||||
matcher = Matcher(nlp.vocab)
|
|
||||||
# register a new token extension to flag bad HTML
|
|
||||||
Token.set_extension('bad_html', default=False)
|
|
||||||
|
|
||||||
def merge_and_flag(matcher, doc, i, matches):
|
|
||||||
match_id, start, end = matches[i]
|
|
||||||
span = doc[start : end]
|
|
||||||
span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
|
|
||||||
for token in span:
|
|
||||||
token._.bad_html = True # mark token as bad HTML
|
|
||||||
print(span.text)
|
|
||||||
|
|
||||||
matcher.add('BAD_HTML', merge_and_flag,
|
|
||||||
[{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
|
|
||||||
[{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
|
|
||||||
|
|
||||||
doc = nlp(u"Hello<br>world!")
|
|
||||||
matches = matcher(doc)
|
|
||||||
for token in doc:
|
|
||||||
print(token.text, token._.bad_html)
|
|
||||||
|
|
||||||
+aside("Tip: Visualizing matches")
|
+aside("Tip: Visualizing matches")
|
||||||
| When working with entities, you can use #[+api("top-level#displacy") displaCy]
|
| When working with entities, you can use #[+api("top-level#displacy") displaCy]
|
||||||
| to quickly generate a NER visualization from your updated #[code Doc],
|
| to quickly generate a NER visualization from your updated #[code Doc],
|
||||||
|
@ -315,7 +280,7 @@ p
|
||||||
| that was matched, and invoke it.
|
| that was matched, and invoke it.
|
||||||
|
|
||||||
+code.
|
+code.
|
||||||
doc = nlp(LOTS_OF_TEXT)
|
doc = nlp(YOUR_TEXT_HERE)
|
||||||
matcher(doc)
|
matcher(doc)
|
||||||
|
|
||||||
p
|
p
|
||||||
|
@ -348,6 +313,69 @@ p
|
||||||
| A list of #[code (match_id, start, end)] tuples, describing the
|
| A list of #[code (match_id, start, end)] tuples, describing the
|
||||||
| matches. A match tuple describes a span #[code doc[start:end]].
|
| matches. A match tuple describes a span #[code doc[start:end]].
|
||||||
|
|
||||||
|
+h(3, "matcher-pipeline") Using custom pipeline components
|
||||||
|
|
||||||
|
p
|
||||||
|
| Let's say your data also contains some annoying pre-processing artefacts,
|
||||||
|
| like leftover HTML line breaks (e.g. #[code <br>] or
|
||||||
|
| #[code <BR/>]). To make your text easier to analyse, you want to
|
||||||
|
| merge those into one token and flag them, to make sure you
|
||||||
|
| can ignore them later. Ideally, this should all be done automatically
|
||||||
|
| as you process the text. You can achieve this by adding a
|
||||||
|
| #[+a("/usage/processing-pipelines#custom-components") custom pipeline component]
|
||||||
|
| that's called on each #[code Doc] object, merges the leftover HTML spans
|
||||||
|
| and sets an attribute #[code bad_html] on the token.
|
||||||
|
|
||||||
|
+code-exec.
|
||||||
|
import spacy
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
from spacy.tokens import Token
|
||||||
|
|
||||||
|
# we're using a class because the component needs to be initialised with
|
||||||
|
# the shared vocab via the nlp object
|
||||||
|
class BadHTMLMerger(object):
|
||||||
|
def __init__(self, nlp):
|
||||||
|
# register a new token extension to flag bad HTML
|
||||||
|
Token.set_extension('bad_html', default=False)
|
||||||
|
self.matcher = Matcher(nlp.vocab)
|
||||||
|
self.matcher.add('BAD_HTML', None,
|
||||||
|
[{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
|
||||||
|
[{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
|
||||||
|
|
||||||
|
def __call__(self, doc):
|
||||||
|
# this method is invoked when the component is called on a Doc
|
||||||
|
matches = self.matcher(doc)
|
||||||
|
spans = [] # collect the matched spans here
|
||||||
|
for match_id, start, end in matches:
|
||||||
|
spans.append(doc[start:end])
|
||||||
|
for span in spans:
|
||||||
|
span.merge(is_stop=True) # merge (and mark it as a stop word)
|
||||||
|
for token in span:
|
||||||
|
token._.bad_html = True # mark token as bad HTML
|
||||||
|
return doc
|
||||||
|
|
||||||
|
nlp = spacy.load('en_core_web_sm')
|
||||||
|
html_merger = BadHTMLMerger(nlp)
|
||||||
|
nlp.add_pipe(html_merger, last=True) # add component to the pipeline
|
||||||
|
doc = nlp(u"Hello<br>world! <br/> This is a test.")
|
||||||
|
for token in doc:
|
||||||
|
print(token.text, token._.bad_html)
|
||||||
|
|
||||||
|
p
|
||||||
|
| Instead of hard-coding the patterns into the component, you could also
|
||||||
|
| make it take a path to a JSON file containing the patterns. This lets
|
||||||
|
| you reuse the component with different patterns, depending on your
|
||||||
|
| application:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')
|
||||||
|
|
||||||
|
+infobox
|
||||||
|
| For more details and examples of how to
|
||||||
|
| #[strong create custom pipeline components] and
|
||||||
|
| #[strong extension attributes], see the
|
||||||
|
| #[+a("/usage/processing-pipelines") usage guide].
|
||||||
|
|
||||||
+h(3, "regex") Using regular expressions
|
+h(3, "regex") Using regular expressions
|
||||||
|
|
||||||
p
|
p
|
||||||
|
|
|
@ -52,7 +52,7 @@ p
|
||||||
|
|
||||||
+code(false, "bash").
|
+code(false, "bash").
|
||||||
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
|
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
|
||||||
python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
|
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
|
||||||
|
|
||||||
p
|
p
|
||||||
| This will output a spaCy model in the directory
|
| This will output a spaCy model in the directory
|
||||||
|
|
Loading…
Reference in New Issue
Block a user