mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge branch 'master' into develop
This commit is contained in:
commit
330c039106
106
.github/contributors/BigstickCarpet.md
vendored
Normal file
106
.github/contributors/BigstickCarpet.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | James Messinger |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | May 23, 2018 |
|
||||
| GitHub username | BigstickCarpet |
|
||||
| Website (optional) | |
|
106
.github/contributors/aristorinjuang.md
vendored
Normal file
106
.github/contributors/aristorinjuang.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------------- |
|
||||
| Name | Aristo Rinjuang |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | May 22, 2018 |
|
||||
| GitHub username | aristorinjuang |
|
||||
| Website (optional) | https://aristorinjuang.com |
|
106
.github/contributors/armsp.md
vendored
Normal file
106
.github/contributors/armsp.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Shantam |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 21/5/2018 |
|
||||
| GitHub username | armsp |
|
||||
| Website (optional) | |
|
106
.github/contributors/idealley.md
vendored
Normal file
106
.github/contributors/idealley.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Pouyt Samuel |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 26.05.2018 |
|
||||
| GitHub username | Idealley |
|
||||
| Website (optional) | |
|
|
@ -118,7 +118,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
|||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
||||
nlp._optimizer = None
|
||||
|
||||
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
|
||||
print("Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS")
|
||||
try:
|
||||
for i in range(n_iter):
|
||||
train_docs = corpus.train_docs(nlp, noise_level=0.0,
|
||||
|
@ -208,17 +208,17 @@ def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
|
|||
scores.update(dev_scores)
|
||||
scores['cpu_wps'] = cpu_wps
|
||||
scores['gpu_wps'] = gpu_wps or 0.0
|
||||
tpl = '\t'.join((
|
||||
'{:d}',
|
||||
'{dep_loss:.3f}',
|
||||
'{ner_loss:.3f}',
|
||||
'{uas:.3f}',
|
||||
'{ents_p:.3f}',
|
||||
'{ents_r:.3f}',
|
||||
'{ents_f:.3f}',
|
||||
'{tags_acc:.3f}',
|
||||
'{token_acc:.3f}',
|
||||
'{cpu_wps:.1f}',
|
||||
tpl = ''.join((
|
||||
'{:<6d}',
|
||||
'{dep_loss:<10.3f}',
|
||||
'{ner_loss:<10.3f}',
|
||||
'{uas:<8.3f}',
|
||||
'{ents_p:<8.3f}',
|
||||
'{ents_r:<8.3f}',
|
||||
'{ents_f:<8.3f}',
|
||||
'{tags_acc:<8.3f}',
|
||||
'{token_acc:<9.3f}',
|
||||
'{cpu_wps:<9.1f}',
|
||||
'{gpu_wps:.1f}',
|
||||
))
|
||||
print(tpl.format(itn, **scores))
|
||||
|
|
|
@ -4,19 +4,10 @@ from __future__ import unicode_literals
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
|
||||
'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
|
||||
'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
|
||||
'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
|
||||
'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
|
||||
'gajillion', 'bazillion',
|
||||
'nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
|
||||
'delapan', 'sembilan', 'sepuluh', 'sebelas', 'duabelas', 'tigabelas',
|
||||
'empatbelas', 'limabelas', 'enambelas', 'tujuhbelas', 'delapanbelas',
|
||||
'sembilanbelas', 'duapuluh', 'seratus', 'seribu', 'sejuta',
|
||||
'ribu', 'rb', 'juta', 'jt', 'miliar', 'biliun', 'triliun',
|
||||
'kuadriliun', 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun',
|
||||
'noniliun', 'desiliun']
|
||||
_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
|
||||
'delapan', 'sembilan', 'sepuluh', 'sebelas', 'belas', 'puluh',
|
||||
'ratus', 'ribu', 'juta', 'miliar', 'biliun', 'triliun', 'kuadriliun',
|
||||
'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', 'noniliun', 'desiliun']
|
||||
|
||||
|
||||
def like_num(text):
|
||||
|
|
|
@ -1,14 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
_exc = {
|
||||
"Rp": "$",
|
||||
"IDR": "$",
|
||||
"RMB": "$",
|
||||
"USD": "$",
|
||||
"AUD": "$",
|
||||
"GBP": "$",
|
||||
}
|
||||
_exc = {}
|
||||
|
||||
NORM_EXCEPTIONS = {}
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@ import regex as re
|
|||
|
||||
from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS
|
||||
from ..tokenizer_exceptions import URL_PATTERN
|
||||
from ...symbols import ORTH
|
||||
from ...symbols import ORTH, LEMMA, NORM
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
@ -29,17 +29,58 @@ for orth in ID_BASE_EXCEPTIONS:
|
|||
orth_caps = '-'.join([part.upper() for part in orth.split('-')])
|
||||
_exc[orth_caps] = [{ORTH: orth_caps}]
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"},
|
||||
{ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"},
|
||||
{ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"},
|
||||
{ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"},
|
||||
{ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"},
|
||||
{ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"},
|
||||
|
||||
{ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"},
|
||||
{ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"},
|
||||
{ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"},
|
||||
{ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"},
|
||||
{ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"},
|
||||
{ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"},
|
||||
{ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"},
|
||||
{ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"},
|
||||
{ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"},
|
||||
{ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"},
|
||||
{ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"},
|
||||
{ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"},
|
||||
{ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"},
|
||||
{ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"},
|
||||
|
||||
{ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"},
|
||||
{ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"},
|
||||
{ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"},
|
||||
{ORTH: "Apr.", LEMMA: "April", NORM: "April"},
|
||||
{ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"},
|
||||
{ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"},
|
||||
{ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"},
|
||||
{ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"},
|
||||
{ORTH: "Sep.", LEMMA: "September", NORM: "September"},
|
||||
{ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"},
|
||||
{ORTH: "Nov.", LEMMA: "November", NORM: "November"},
|
||||
{ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
for orth in [
|
||||
"'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.",
|
||||
"E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.",
|
||||
"Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.",
|
||||
"Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs.",
|
||||
"A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.",
|
||||
"B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.",
|
||||
"M.Ag.", "M.Hum.", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Sc.",
|
||||
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "S.Ag.",
|
||||
"S.E.", "S.H.", "S.Hut.", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.",
|
||||
"S.Pd.", "S.Pol.", "S.Psi.", "S.S.", "S.Sos.", "S.T.", "S.Tekp.", "S.Th.",
|
||||
"M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,",
|
||||
"M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl",
|
||||
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.",
|
||||
"S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars",
|
||||
"S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han",
|
||||
"S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP",
|
||||
"S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH",
|
||||
"S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat",
|
||||
"S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.",
|
||||
"S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.",
|
||||
"S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK",
|
||||
"S.Tekp.", "S.Th.",
|
||||
"a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.",
|
||||
"dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o",
|
||||
"n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.",
|
||||
|
|
42
spacy/lang/ro/lex_attrs.py
Normal file
42
spacy/lang/ro/lex_attrs.py
Normal file
|
@ -0,0 +1,42 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = set("""
|
||||
zero unu doi două trei patru cinci șase șapte opt nouă zece
|
||||
unsprezece doisprezece douăsprezece treisprezece patrusprezece cincisprezece șaisprezece șaptesprezece optsprezece nouăsprezece
|
||||
douăzeci treizeci patruzeci cincizeci șaizeci șaptezeci optzeci nouăzeci
|
||||
sută mie milion miliard bilion trilion cvadrilion catralion cvintilion sextilion septilion enșpemii
|
||||
""".split())
|
||||
|
||||
_ordinal_words = set("""
|
||||
primul doilea treilea patrulea cincilea șaselea șaptelea optulea nouălea zecelea
|
||||
prima doua treia patra cincia șasea șaptea opta noua zecea
|
||||
unsprezecelea doisprezecelea treisprezecelea patrusprezecelea cincisprezecelea șaisprezecelea șaptesprezecelea optsprezecelea nouăsprezecelea
|
||||
unsprezecea douăsprezecea treisprezecea patrusprezecea cincisprezecea șaisprezecea șaptesprezecea optsprezecea nouăsprezecea
|
||||
douăzecilea treizecilea patruzecilea cincizecilea șaizecilea șaptezecilea optzecilea nouăzecilea sutălea
|
||||
douăzecea treizecea patruzecea cincizecea șaizecea șaptezecea optzecea nouăzecea suta
|
||||
miilea mielea mia milionulea milioana miliardulea miliardelea miliarda enșpemia
|
||||
""".split())
|
||||
|
||||
|
||||
def like_num(text):
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
if text.lower() in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num
|
||||
}
|
|
@ -9,8 +9,9 @@ _exc = {}
|
|||
|
||||
# Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations
|
||||
for orth in [
|
||||
"1-a", "1-ul", "10-a", "10-lea", "2-a", "3-a", "3-lea", "6-lea",
|
||||
"d-voastră", "dvs.", "Rom.", "str."]:
|
||||
"1-a", "2-a", "3-a", "4-a", "5-a", "6-a", "7-a", "8-a", "9-a", "10-a", "11-a", "12-a",
|
||||
"1-ul", "2-lea", "3-lea", "4-lea", "5-lea", "6-lea", "7-lea", "8-lea", "9-lea", "10-lea", "11-lea", "12-lea",
|
||||
"d-voastră", "dvs.", "ing.", "dr.", "Rom.", "str.", "nr.", "etc.", "d.p.d.v.", "dpdv", "șamd.", "ș.a.m.d."]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ from .. import util
|
|||
# here if it's using spaCy's tokenizer (not a different library)
|
||||
# TODO: re-implement generic tokenizer tests
|
||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||
'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
|
||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx']
|
||||
|
||||
_models = {'en': ['en_core_web_sm'],
|
||||
'de': ['de_core_news_md'],
|
||||
|
|
25
spacy/tests/lang/ro/test_tokenizer.py
Normal file
25
spacy/tests/lang/ro/test_tokenizer.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
DEFAULT_TESTS = [
|
||||
('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
|
||||
('Teste, etc.', ['Teste', ',', 'etc.']),
|
||||
('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
|
||||
('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...'])
|
||||
]
|
||||
|
||||
NUMBER_TESTS = [
|
||||
('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
|
||||
('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
|
||||
]
|
||||
|
||||
TESTCASES = DEFAULT_TESTS + NUMBER_TESTS
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||
def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
|
||||
tokens = ro_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
|
@ -53,7 +53,7 @@ p
|
|||
+tag-new(2)
|
||||
|
||||
p
|
||||
| The populate a model's vocabulary, you can use the
|
||||
| To populate a model's vocabulary, you can use the
|
||||
| #[+api("cli#vocab") #[code spacy vocab]] command and load in a
|
||||
| #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON]
|
||||
| (JSONL) file containing one lexical entry per line. The first line
|
||||
|
|
|
@ -16,7 +16,9 @@
|
|||
|
||||
+qs({package: 'source'}) git clone https://github.com/explosion/spaCy
|
||||
+qs({package: 'source'}) cd spaCy
|
||||
+qs({package: 'source'}) export PYTHONPATH=`pwd`
|
||||
+qs({package: 'source', os: 'mac'}) export PYTHONPATH=`pwd`
|
||||
+qs({package: 'source', os: 'linux'}) export PYTHONPATH=`pwd`
|
||||
+qs({package: 'source', os: 'windows'}) set PYTHONPATH=/path/to/spaCy
|
||||
+qs({package: 'source'}) pip install -r requirements.txt
|
||||
+qs({package: 'source'}) python setup.py build_ext --inplace
|
||||
|
||||
|
|
|
@ -184,7 +184,7 @@ p
|
|||
|
||||
p
|
||||
| In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
|
||||
| behave inconsistently. They were usually interpretted
|
||||
| behave inconsistently. They were usually interpreted
|
||||
| "greedily", i.e. longer matches are returned where possible. However, if
|
||||
| you specify two #[code +] and #[code *] patterns in a row and their
|
||||
| matches overlap, the first operator will behave non-greedily. This quirk
|
||||
|
@ -260,41 +260,6 @@ p
|
|||
doc = nlp(u"This is a text about Google I/O 2015.")
|
||||
matches = matcher(doc)
|
||||
|
||||
p
|
||||
| In addition to mentions of "Google I/O", your data also contains some
|
||||
| annoying pre-processing artefacts, like leftover HTML line breaks
|
||||
| (e.g. #[code <br>] or #[code <BR/>]). While you're at it,
|
||||
| you want to merge those into one token and flag them, to make sure you
|
||||
| can easily ignore them later. So you add a second pattern and pass in a
|
||||
| function #[code merge_and_flag]:
|
||||
|
||||
+code-exec.
|
||||
import spacy
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.tokens import Token
|
||||
|
||||
nlp = spacy.load('en_core_web_sm')
|
||||
matcher = Matcher(nlp.vocab)
|
||||
# register a new token extension to flag bad HTML
|
||||
Token.set_extension('bad_html', default=False)
|
||||
|
||||
def merge_and_flag(matcher, doc, i, matches):
|
||||
match_id, start, end = matches[i]
|
||||
span = doc[start : end]
|
||||
span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
|
||||
for token in span:
|
||||
token._.bad_html = True # mark token as bad HTML
|
||||
print(span.text)
|
||||
|
||||
matcher.add('BAD_HTML', merge_and_flag,
|
||||
[{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
|
||||
[{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
|
||||
|
||||
doc = nlp(u"Hello<br>world!")
|
||||
matches = matcher(doc)
|
||||
for token in doc:
|
||||
print(token.text, token._.bad_html)
|
||||
|
||||
+aside("Tip: Visualizing matches")
|
||||
| When working with entities, you can use #[+api("top-level#displacy") displaCy]
|
||||
| to quickly generate a NER visualization from your updated #[code Doc],
|
||||
|
@ -315,7 +280,7 @@ p
|
|||
| that was matched, and invoke it.
|
||||
|
||||
+code.
|
||||
doc = nlp(LOTS_OF_TEXT)
|
||||
doc = nlp(YOUR_TEXT_HERE)
|
||||
matcher(doc)
|
||||
|
||||
p
|
||||
|
@ -348,6 +313,69 @@ p
|
|||
| A list of #[code (match_id, start, end)] tuples, describing the
|
||||
| matches. A match tuple describes a span #[code doc[start:end]].
|
||||
|
||||
+h(3, "matcher-pipeline") Using custom pipeline components
|
||||
|
||||
p
|
||||
| Let's say your data also contains some annoying pre-processing artefacts,
|
||||
| like leftover HTML line breaks (e.g. #[code <br>] or
|
||||
| #[code <BR/>]). To make your text easier to analyse, you want to
|
||||
| merge those into one token and flag them, to make sure you
|
||||
| can ignore them later. Ideally, this should all be done automatically
|
||||
| as you process the text. You can achieve this by adding a
|
||||
| #[+a("/usage/processing-pipelines#custom-components") custom pipeline component]
|
||||
| that's called on each #[code Doc] object, merges the leftover HTML spans
|
||||
| and sets an attribute #[code bad_html] on the token.
|
||||
|
||||
+code-exec.
|
||||
import spacy
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.tokens import Token
|
||||
|
||||
# we're using a class because the component needs to be initialised with
|
||||
# the shared vocab via the nlp object
|
||||
class BadHTMLMerger(object):
|
||||
def __init__(self, nlp):
|
||||
# register a new token extension to flag bad HTML
|
||||
Token.set_extension('bad_html', default=False)
|
||||
self.matcher = Matcher(nlp.vocab)
|
||||
self.matcher.add('BAD_HTML', None,
|
||||
[{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
|
||||
[{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
|
||||
|
||||
def __call__(self, doc):
|
||||
# this method is invoked when the component is called on a Doc
|
||||
matches = self.matcher(doc)
|
||||
spans = [] # collect the matched spans here
|
||||
for match_id, start, end in matches:
|
||||
spans.append(doc[start:end])
|
||||
for span in spans:
|
||||
span.merge(is_stop=True) # merge (and mark it as a stop word)
|
||||
for token in span:
|
||||
token._.bad_html = True # mark token as bad HTML
|
||||
return doc
|
||||
|
||||
nlp = spacy.load('en_core_web_sm')
|
||||
html_merger = BadHTMLMerger(nlp)
|
||||
nlp.add_pipe(html_merger, last=True) # add component to the pipeline
|
||||
doc = nlp(u"Hello<br>world! <br/> This is a test.")
|
||||
for token in doc:
|
||||
print(token.text, token._.bad_html)
|
||||
|
||||
p
|
||||
| Instead of hard-coding the patterns into the component, you could also
|
||||
| make it take a path to a JSON file containing the patterns. This lets
|
||||
| you reuse the component with different patterns, depending on your
|
||||
| application:
|
||||
|
||||
+code.
|
||||
html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')
|
||||
|
||||
+infobox
|
||||
| For more details and examples of how to
|
||||
| #[strong create custom pipeline components] and
|
||||
| #[strong extension attributes], see the
|
||||
| #[+a("/usage/processing-pipelines") usage guide].
|
||||
|
||||
+h(3, "regex") Using regular expressions
|
||||
|
||||
p
|
||||
|
|
|
@ -52,7 +52,7 @@ p
|
|||
|
||||
+code(false, "bash").
|
||||
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
|
||||
python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
|
||||
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
|
||||
|
||||
p
|
||||
| This will output a spaCy model in the directory
|
||||
|
|
Loading…
Reference in New Issue
Block a user