Merge branch 'master' into develop

This commit is contained in:
Ines Montani 2018-12-18 13:48:10 +01:00
commit 61d09c481b
56 changed files with 691923 additions and 334997 deletions

106
.github/contributors/Brixjohn.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [X] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Brixter John Lumabi |
| Company name (if applicable) | Stratpoint |
| Title or role (if applicable) | Software Developer |
| Date | 18 December 2018 |
| GitHub username | Brixjohn |
| Website (optional) | |

106
.github/contributors/amperinet.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ----------------------- |
| Name | Amandine Périnet |
| Company name (if applicable) | 365Talents |
| Title or role (if applicable) | Data Science Researcher |
| Date | 12/12/2018 |
| GitHub username | amperinet |
| Website (optional) | |

106
.github/contributors/beatesi.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Beate Sildnes |
| Company name (if applicable) | NAV |
| Title or role (if applicable) | Data Scientist |
| Date | 04.12.2018 |
| GitHub username | beatesi |
| Website (optional) | |

106
.github/contributors/chezou.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Aki Ariga |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 07/12/2018 |
| GitHub username | chezou |
| Website (optional) | chezo.uno |

106
.github/contributors/svlandeg.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Sofie Van Landeghem |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 29 Nov 2018 |
| GitHub username | svlandeg |
| Website (optional) | |

106
.github/contributors/wxv.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jason Xu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-11-29 |
| GitHub username | wxv |
| Website (optional) | |

View File

@ -20,9 +20,10 @@ import os
import importlib
from keras import backend as K
def set_keras_backend(backend):
if K.backend() != backend:
os.environ['KERAS_BACKEND'] = backend
os.environ["KERAS_BACKEND"] = backend
importlib.reload(K)
assert K.backend() == backend
if backend == "tensorflow":
@ -32,6 +33,7 @@ def set_keras_backend(backend):
K.set_session(K.tf.Session(config=cfg))
K.clear_session()
set_keras_backend("tensorflow")
@ -40,9 +42,8 @@ def train(train_loc, dev_loc, shape, settings):
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
print("Loading spaCy")
nlp = spacy.load('en_vectors_web_lg')
nlp = spacy.load("en_vectors_web_lg")
assert nlp.path is not None
print("Processing texts...")
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
@ -54,29 +55,28 @@ def train(train_loc, dev_loc, shape, settings):
model.fit(
train_X,
train_labels,
validation_data = (dev_X, dev_labels),
epochs = settings['nr_epoch'],
batch_size = settings['batch_size'])
if not (nlp.path / 'similarity').exists():
(nlp.path / 'similarity').mkdir()
print("Saving to", nlp.path / 'similarity')
validation_data=(dev_X, dev_labels),
epochs=settings["nr_epoch"],
batch_size=settings["batch_size"],
)
if not (nlp.path / "similarity").exists():
(nlp.path / "similarity").mkdir()
print("Saving to", nlp.path / "similarity")
weights = model.get_weights()
# remove the embedding matrix. We can reconstruct it.
del weights[1]
with (nlp.path / 'similarity' / 'model').open('wb') as file_:
with (nlp.path / "similarity" / "model").open("wb") as file_:
pickle.dump(weights, file_)
with (nlp.path / 'similarity' / 'config.json').open('w') as file_:
with (nlp.path / "similarity" / "config.json").open("w") as file_:
file_.write(model.to_json())
def evaluate(dev_loc, shape):
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
total = 0.
correct = 0.
nlp = spacy.load("en_vectors_web_lg")
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
total = 0.0
correct = 0.0
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
doc1 = nlp(text1)
doc2 = nlp(text2)
@ -88,11 +88,11 @@ def evaluate(dev_loc, shape):
def demo(shape):
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
nlp = spacy.load("en_vectors_web_lg")
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
doc1 = nlp(u'The king of France is bald.')
doc2 = nlp(u'France has no king.')
doc1 = nlp(u"The king of France is bald.")
doc2 = nlp(u"France has no king.")
print("Sentence 1:", doc1)
print("Sentence 2:", doc2)
@ -101,30 +101,31 @@ def demo(shape):
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
LABELS = {"entailment": 0, "contradiction": 1, "neutral": 2}
def read_snli(path):
texts1 = []
texts2 = []
labels = []
with open(path, 'r') as file_:
with open(path, "r") as file_:
for line in file_:
eg = json.loads(line)
label = eg['gold_label']
if label == '-': # per Parikh, ignore - SNLI entries
label = eg["gold_label"]
if label == "-": # per Parikh, ignore - SNLI entries
continue
texts1.append(eg['sentence1'])
texts2.append(eg['sentence2'])
texts1.append(eg["sentence1"])
texts2.append(eg["sentence2"])
labels.append(LABELS[label])
return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))
return texts1, texts2, to_categorical(np.asarray(labels, dtype="int32"))
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
sents = texts + hypotheses
sents_as_ids = []
for sent in sents:
doc = nlp(sent)
word_ids = []
for i, token in enumerate(doc):
# skip odd spaces from tokenizer
if token.has_vector and token.vector_norm == 0:
@ -140,13 +141,12 @@ def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
word_ids.append(token.rank % num_unk + 1)
# there must be a simpler way of generating padded arrays from lists...
word_id_vec = np.zeros((max_length), dtype='int')
word_id_vec = np.zeros((max_length), dtype="int")
clipped_len = min(max_length, len(word_ids))
word_id_vec[:clipped_len] = word_ids[:clipped_len]
sents_as_ids.append(word_id_vec)
return [np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])]
return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])]
@plac.annotations(
@ -159,39 +159,49 @@ def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
learn_rate=("Learning rate", "option", "r", float),
batch_size=("Batch size for neural network training", "option", "b", int),
nr_epoch=("Number of training epochs", "option", "e", int),
entail_dir=("Direction of entailment", "option", "D", str, ["both", "left", "right"])
entail_dir=(
"Direction of entailment",
"option",
"D",
str,
["both", "left", "right"],
),
)
def main(mode, train_loc, dev_loc,
max_length = 50,
nr_hidden = 200,
dropout = 0.2,
learn_rate = 0.001,
batch_size = 1024,
nr_epoch = 10,
entail_dir="both"):
def main(
mode,
train_loc,
dev_loc,
max_length=50,
nr_hidden=200,
dropout=0.2,
learn_rate=0.001,
batch_size=1024,
nr_epoch=10,
entail_dir="both",
):
shape = (max_length, nr_hidden, 3)
settings = {
'lr': learn_rate,
'dropout': dropout,
'batch_size': batch_size,
'nr_epoch': nr_epoch,
'entail_dir': entail_dir
"lr": learn_rate,
"dropout": dropout,
"batch_size": batch_size,
"nr_epoch": nr_epoch,
"entail_dir": entail_dir,
}
if mode == 'train':
if mode == "train":
if train_loc == None or dev_loc == None:
print("Train mode requires paths to training and development data sets.")
sys.exit(1)
train(train_loc, dev_loc, shape, settings)
elif mode == 'evaluate':
if dev_loc == None:
elif mode == "evaluate":
if dev_loc == None:
print("Evaluate mode requires paths to test data set.")
sys.exit(1)
correct, total = evaluate(dev_loc, shape)
print(correct, '/', total, correct / total)
print(correct, "/", total, correct / total)
else:
demo(shape)
if __name__ == '__main__':
if __name__ == "__main__":
plac.call(main)

View File

@ -5,11 +5,12 @@ import numpy as np
from keras import layers, Model, models, optimizers
from keras import backend as K
def build_model(vectors, shape, settings):
max_length, nr_hidden, nr_class = shape
input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
input1 = layers.Input(shape=(max_length,), dtype="int32", name="words1")
input2 = layers.Input(shape=(max_length,), dtype="int32", name="words2")
# embeddings (projected)
embed = create_embedding(vectors, max_length, nr_hidden)
@ -23,11 +24,11 @@ def build_model(vectors, shape, settings):
G = create_feedforward(nr_hidden)
if settings['entail_dir'] == 'both':
if settings["entail_dir"] == "both":
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
beta = layers.dot([norm_weights_b, b], axes=1)
beta = layers.dot([norm_weights_b, b], axes=1)
# step 2: compare
comp1 = layers.concatenate([a, beta])
@ -40,7 +41,7 @@ def build_model(vectors, shape, settings):
v2_sum = layers.Lambda(sum_word)(v2)
concat = layers.concatenate([v1_sum, v2_sum])
elif settings['entail_dir'] == 'left':
elif settings["entail_dir"] == "left":
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
comp2 = layers.concatenate([b, alpha])
@ -50,7 +51,7 @@ def build_model(vectors, shape, settings):
else:
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
beta = layers.dot([norm_weights_b, b], axes=1)
beta = layers.dot([norm_weights_b, b], axes=1)
comp1 = layers.concatenate([a, beta])
v1 = layers.TimeDistributed(G)(comp1)
v1_sum = layers.Lambda(sum_word)(v1)
@ -58,80 +59,86 @@ def build_model(vectors, shape, settings):
H = create_feedforward(nr_hidden)
out = H(concat)
out = layers.Dense(nr_class, activation='softmax')(out)
out = layers.Dense(nr_class, activation="softmax")(out)
model = Model([input1, input2], out)
model.compile(
optimizer=optimizers.Adam(lr=settings['lr']),
loss='categorical_crossentropy',
metrics=['accuracy'])
optimizer=optimizers.Adam(lr=settings["lr"]),
loss="categorical_crossentropy",
metrics=["accuracy"],
)
return model
def create_embedding(vectors, max_length, projected_dim):
return models.Sequential([
layers.Embedding(
vectors.shape[0],
vectors.shape[1],
input_length=max_length,
weights=[vectors],
trainable=False),
return models.Sequential(
[
layers.Embedding(
vectors.shape[0],
vectors.shape[1],
input_length=max_length,
weights=[vectors],
trainable=False,
),
layers.TimeDistributed(
layers.Dense(projected_dim, activation=None, use_bias=False)
),
]
)
layers.TimeDistributed(
layers.Dense(projected_dim,
activation=None,
use_bias=False))
])
def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
return models.Sequential([
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate),
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate)
])
def create_feedforward(num_units=200, activation="relu", dropout_rate=0.2):
return models.Sequential(
[
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate),
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate),
]
)
def normalizer(axis):
def _normalize(att_weights):
exp_weights = K.exp(att_weights)
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
return exp_weights/sum_weights
return exp_weights / sum_weights
return _normalize
def sum_word(x):
return K.sum(x, axis=1)
def test_build_model():
vectors = np.ndarray((100, 8), dtype='float32')
vectors = np.ndarray((100, 8), dtype="float32")
shape = (10, 16, 3)
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
model = build_model(vectors, shape, settings)
def test_fit_model():
def _generate_X(nr_example, length, nr_vector):
X1 = np.ndarray((nr_example, length), dtype='int32')
X1 = np.ndarray((nr_example, length), dtype="int32")
X1 *= X1 < nr_vector
X1 *= 0 <= X1
X2 = np.ndarray((nr_example, length), dtype='int32')
X2 = np.ndarray((nr_example, length), dtype="int32")
X2 *= X2 < nr_vector
X2 *= 0 <= X2
return [X1, X2]
def _generate_Y(nr_example, nr_class):
ys = np.zeros((nr_example, nr_class), dtype='int32')
ys = np.zeros((nr_example, nr_class), dtype="int32")
for i in range(nr_example):
ys[i, i % nr_class] = 1
return ys
vectors = np.ndarray((100, 8), dtype='float32')
vectors = np.ndarray((100, 8), dtype="float32")
shape = (10, 16, 3)
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
model = build_model(vectors, shape, settings)
train_X = _generate_X(20, shape[0], vectors.shape[0])

View File

@ -59,7 +59,7 @@ def main(model=None, output_dir=None, n_iter=100):
# reset and initialize the weights randomly but only if we're
# training a new model
if model is None:
optimizer = nlp.begin_training()
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}

View File

@ -90,7 +90,8 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
with nlp.use_params(optimizer.averages):
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model

View File

@ -98,6 +98,14 @@ if os.environ.get("USE_OPENMP", USE_OPENMP_DEFAULT) == "1":
COMPILE_OPTIONS["other"].append("-fopenmp")
LINK_OPTIONS["other"].append("-fopenmp")
if sys.platform == "darwin":
# On Mac, use libc++ because Apple deprecated use of
# libstdc
COMPILE_OPTIONS["other"].append("-stdlib=libc++")
LINK_OPTIONS["other"].append("-lc++")
# g++ (used by unix compiler on mac) links to libstdc++ as a default lib.
# See: https://stackoverflow.com/questions/1653047/avoid-linking-to-libstdc
LINK_OPTIONS["other"].append("-nodefaultlibs")
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
@ -183,6 +191,7 @@ def setup_package():
for mod_name in MOD_NAMES:
mod_path = mod_name.replace(".", "/") + ".cpp"
extra_link_args = []
extra_compile_args = []
# ???
# Imported from patch from @mikepb
# See Issue #267. Running blind here...

View File

@ -4,6 +4,8 @@ from __future__ import unicode_literals
from ...gold import iob_to_biluo
from ...util import minibatch
import re
def iob2json(input_data, n_sents=10, *args, **kwargs):
"""
@ -25,7 +27,8 @@ def read_iob(raw_sents):
for line in raw_sents:
if not line.strip():
continue
tokens = [t.split("|") for t in line.split()]
# tokens = [t.split("|") for t in line.split()]
tokens = [re.split("[^\w\-]", line.strip())]
if len(tokens[0]) == 3:
words, pos, iob = zip(*tokens)
else:

View File

@ -1,6 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
# stop words from HAZM package
# Stop words from HAZM package
STOP_WORDS = set(

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,86 +1,369 @@
# coding: utf8
from __future__ import unicode_literals
AUXILIARY_VERBS_IRREG = {
"suis": ("être",),
"es": ("être",),
"est": ("être",),
"sommes": ("être",),
"êtes": ("être",),
"sont": ("être",),
"étais": ("être",),
"étais": ("être",),
"était": ("être",),
"étions": ("être",),
"étiez": ("être",),
"étaient": ("être",),
"fus": ("être",),
"fut": ("être",),
"fûmes": ("être",),
"fûtes": ("être",),
"furent": ("être",),
"serai": ("être",),
"seras": ("être",),
"sera": ("être",),
"serons": ("être",),
"serez": ("être",),
"seront": ("être",),
"serais": ("être",),
"serait": ("être",),
"serions": ("être",),
"seriez": ("être",),
"seraient": ("être",),
"sois": ("être",),
"soit": ("être",),
"soyons": ("être",),
"soyez": ("être",),
"soient": ("être",),
"fusse": ("être",),
"fusses": ("être",),
"fût": ("être",),
"fussions": ("être",),
"fussiez": ("être",),
"fussent": ("être",),
"étant": ("être",),
"ai": ("avoir",),
"as": ("avoir",),
"a": ("avoir",),
"avons": ("avoir",),
"avez": ("avoir",),
"ont": ("avoir",),
"avais": ("avoir",),
"avait": ("avoir",),
"avions": ("avoir",),
"aviez": ("avoir",),
"avaient": ("avoir",),
"eus": ("avoir",),
"eut": ("avoir",),
"eûmes": ("avoir",),
"eûtes": ("avoir",),
"eurent": ("avoir",),
"aurai": ("avoir",),
"auras": ("avoir",),
"aura": ("avoir",),
"aurons": ("avoir",),
"aurez": ("avoir",),
"auront": ("avoir",),
"aurais": ("avoir",),
"aurait": ("avoir",),
"aurions": ("avoir",),
"auriez": ("avoir",),
"auraient": ("avoir",),
"aie": ("avoir",),
"aies": ("avoir",),
"ait": ("avoir",),
"ayons": ("avoir",),
"ayez": ("avoir",),
"aient": ("avoir",),
"eusse": ("avoir",),
"eusses": ("avoir",),
"eût": ("avoir",),
"eussions": ("avoir",),
"eussiez": ("avoir",),
"eussent": ("avoir",),
"ayant": ("avoir",),
}
# coding: utf8
from __future__ import unicode_literals
AUXILIARY_VERBS_IRREG = {
"été": ("être",),
"suis": ("être",),
"es": ("être",),
"est": ("être",),
"sommes": ("être",),
"êtes": ("être",),
"sont": ("être",),
"étais": ("être",),
"étais": ("être",),
"était": ("être",),
"étions": ("être",),
"étiez": ("être",),
"étaient": ("être",),
"fus": ("être",),
"fut": ("être",),
"fûmes": ("être",),
"fûtes": ("être",),
"furent": ("être",),
"serai": ("être",),
"seras": ("être",),
"sera": ("être",),
"serons": ("être",),
"serez": ("être",),
"seront": ("être",),
"serais": ("être",),
"serait": ("être",),
"serions": ("être",),
"seriez": ("être",),
"seraient": ("être",),
"sois": ("être",),
"soit": ("être",),
"soyons": ("être",),
"soyez": ("être",),
"soient": ("être",),
"fusse": ("être",),
"fusses": ("être",),
"fût": ("être",),
"fussions": ("être",),
"fussiez": ("être",),
"fussent": ("être",),
"étant": ("être",),
"ai": ("avoir",),
"as": ("avoir",),
"a": ("avoir",),
"avons": ("avoir",),
"avez": ("avoir",),
"ont": ("avoir",),
"avais": ("avoir",),
"avait": ("avoir",),
"avions": ("avoir",),
"aviez": ("avoir",),
"avaient": ("avoir",),
"eus": ("avoir",),
"eut": ("avoir",),
"eûmes": ("avoir",),
"eûtes": ("avoir",),
"eurent": ("avoir",),
"aurai": ("avoir",),
"auras": ("avoir",),
"aura": ("avoir",),
"aurons": ("avoir",),
"aurez": ("avoir",),
"auront": ("avoir",),
"aurais": ("avoir",),
"aurait": ("avoir",),
"aurions": ("avoir",),
"auriez": ("avoir",),
"auraient": ("avoir",),
"aie": ("avoir",),
"aies": ("avoir",),
"ait": ("avoir",),
"ayons": ("avoir",),
"ayez": ("avoir",),
"aient": ("avoir",),
"eusse": ("avoir",),
"eusses": ("avoir",),
"eût": ("avoir",),
"eussions": ("avoir",),
"eussiez": ("avoir",),
"eussent": ("avoir",),
"ayant": ("avoir",),
"eu": ("avoir",),
"eue": ("avoir",),
"eues": ("avoir",),
"devaient": ("devoir",),
"devais": ("devoir",),
"devait": ("devoir",),
"devant": ("devoir",),
"devez": ("devoir",),
"deviez": ("devoir",),
"devions": ("devoir",),
"devons": ("devoir",),
"devra": ("devoir",),
"devrai": ("devoir",),
"devraient": ("devoir",),
"devrais": ("devoir",),
"devrait": ("devoir",),
"devras": ("devoir",),
"devrez": ("devoir",),
"devriez": ("devoir",),
"devrions": ("devoir",),
"devrons": ("devoir",),
"devront": ("devoir",),
"dois": ("devoir",),
"doit": ("devoir",),
"doive": ("devoir",),
"doivent": ("devoir",),
"doives": ("devoir",),
"": ("devoir",),
"due": ("devoir",),
"dues": ("devoir",),
"dûmes": ("devoir",),
"durent": ("devoir",),
"dus": ("devoir",),
"dûs": ("devoir",),
"dusse": ("devoir",),
"dussent": ("devoir",),
"dusses": ("devoir",),
"dussiez": ("devoir",),
"dussions": ("devoir",),
"dut": ("devoir",),
"dût": ("devoir",),
"dûtes": ("devoir",),
"peut": ("pouvoir",),
"peuvent": ("pouvoir",),
"peux": ("pouvoir",),
"pourraient": ("pouvoir",),
"pourrai": ("pouvoir",),
"pourrais": ("pouvoir",),
"pourrait": ("pouvoir",),
"pourra": ("pouvoir",),
"pourras": ("pouvoir",),
"pourrez": ("pouvoir",),
"pourriez": ("pouvoir",),
"pourrions": ("pouvoir",),
"pourrons": ("pouvoir",),
"pourront": ("pouvoir",),
"pouvaient": ("pouvoir",),
"pouvais": ("pouvoir",),
"pouvait": ("pouvoir",),
"pouvez": ("pouvoir",),
"pouviez": ("pouvoir",),
"pouvions": ("pouvoir",),
"pouvons": ("pouvoir",),
"pûmes": ("pouvoir",),
"pu": ("pouvoir",),
"purent": ("pouvoir",),
"pus": ("pouvoir",),
"pûtes": ("pouvoir",),
"put": ("pouvoir",),
"pouvant": ("pouvoir",),
"puisse": ("pouvoir",),
"puissions": ("pouvoir",),
"puissiez": ("pouvoir",),
"puissent": ("pouvoir",),
"pusse": ("pouvoir",),
"pusses": ("pouvoir",),
"pussions": ("pouvoir",),
"pussiez": ("pouvoir",),
"pussent": ("pouvoir",),
"faisaient": ("faire",),
"faisais": ("faire",),
"faisait": ("faire",),
"faisant": ("faire",),
"fais": ("faire",),
"faisiez": ("faire",),
"faisions": ("faire",),
"faisons": ("faire",),
"faite": ("faire",),
"faites": ("faire",),
"fait": ("faire",),
"faits": ("faire",),
"fasse": ("faire",),
"fassent": ("faire",),
"fasses": ("faire",),
"fassiez": ("faire",),
"fassions": ("faire",),
"fera": ("faire",),
"feraient": ("faire",),
"ferai": ("faire",),
"ferais": ("faire",),
"ferait": ("faire",),
"feras": ("faire",),
"ferez": ("faire",),
"feriez": ("faire",),
"ferions": ("faire",),
"ferons": ("faire",),
"feront": ("faire",),
"fîmes": ("faire",),
"firent": ("faire",),
"fis": ("faire",),
"fisse": ("faire",),
"fissent": ("faire",),
"fisses": ("faire",),
"fissiez": ("faire",),
"fissions": ("faire",),
"fîtes": ("faire",),
"fit": ("faire",),
"fît": ("faire",),
"font": ("faire",),
"veuillent": ("vouloir",),
"veuilles": ("vouloir",),
"veuille": ("vouloir",),
"veuillez": ("vouloir",),
"veuillons": ("vouloir",),
"veulent": ("vouloir",),
"veut": ("vouloir",),
"veux": ("vouloir",),
"voudraient": ("vouloir",),
"voudrais": ("vouloir",),
"voudrait": ("vouloir",),
"voudrai": ("vouloir",),
"voudras": ("vouloir",),
"voudra": ("vouloir",),
"voudrez": ("vouloir",),
"voudriez": ("vouloir",),
"voudrions": ("vouloir",),
"voudrons": ("vouloir",),
"voudront": ("vouloir",),
"voulaient": ("vouloir",),
"voulais": ("vouloir",),
"voulait": ("vouloir",),
"voulant": ("vouloir",),
"voulez": ("vouloir",),
"vouliez": ("vouloir",),
"voulions": ("vouloir",),
"voulons": ("vouloir",),
"voulues": ("vouloir",),
"voulue": ("vouloir",),
"voulûmes": ("vouloir",),
"voulurent": ("vouloir",),
"voulussent": ("vouloir",),
"voulusses": ("vouloir",),
"voulusse": ("vouloir",),
"voulussiez": ("vouloir",),
"voulussions": ("vouloir",),
"voulus": ("vouloir",),
"voulûtes": ("vouloir",),
"voulut": ("vouloir",),
"voulût": ("vouloir",),
"voulu": ("vouloir",),
"sachant": ("savoir",),
"sachent": ("savoir",),
"sache": ("savoir",),
"saches": ("savoir",),
"sachez": ("savoir",),
"sachiez": ("savoir",),
"sachions": ("savoir",),
"sachons": ("savoir",),
"sais": ("savoir",),
"sait": ("savoir",),
"sauraient": ("savoir",),
"saurai": ("savoir",),
"saurais": ("savoir",),
"saurait": ("savoir",),
"saura": ("savoir",),
"sauras": ("savoir",),
"saurez": ("savoir",),
"sauriez": ("savoir",),
"saurions": ("savoir",),
"saurons": ("savoir",),
"sauront": ("savoir",),
"savaient": ("savoir",),
"savais": ("savoir",),
"savait": ("savoir",),
"savent": ("savoir",),
"savez": ("savoir",),
"saviez": ("savoir",),
"savions": ("savoir",),
"savons": ("savoir",),
"sue": ("savoir",),
"sues": ("savoir",),
"sûmes": ("savoir",),
"surent": ("savoir",),
"su": ("savoir",),
"sus": ("savoir",),
"sussent": ("savoir",),
"susse": ("savoir",),
"susses": ("savoir",),
"sussiez": ("savoir",),
"sussions": ("savoir",),
"sûtes": ("savoir",),
"sut": ("savoir",),
"sût": ("savoir",),
"venaient": ("venir",),
"venais": ("venir",),
"venait": ("venir",),
"venant": ("venir",),
"venez": ("venir",),
"veniez": ("venir",),
"venions": ("venir",),
"venons": ("venir",),
"venues": ("venir",),
"venue": ("venir",),
"venus": ("venir",),
"venu": ("venir",),
"viendraient": ("venir",),
"viendrais": ("venir",),
"viendrait": ("venir",),
"viendrai": ("venir",),
"viendras": ("venir",),
"viendra": ("venir",),
"viendrez": ("venir",),
"viendriez": ("venir",),
"viendrions": ("venir",),
"viendrons": ("venir",),
"viendront": ("venir",),
"viennent": ("venir",),
"viennes": ("venir",),
"vienne": ("venir",),
"viens": ("venir",),
"vient": ("venir",),
"vînmes": ("venir",),
"vinrent": ("venir",),
"vinssent": ("venir",),
"vinsses": ("venir",),
"vinsse": ("venir",),
"vinssiez": ("venir",),
"vinssions": ("venir",),
"vins": ("venir",),
"vîntes": ("venir",),
"vint": ("venir",),
"vînt": ("venir",),
"aille": ("aller",),
"aillent": ("aller",),
"ailles": ("aller",),
"alla": ("aller",),
"allai": ("aller",),
"allaient": ("aller",),
"allais": ("aller",),
"allait": ("aller",),
"allâmes": ("aller",),
"allant": ("aller",),
"allas": ("aller",),
"allasse": ("aller",),
"allassent": ("aller",),
"allasses": ("aller",),
"allassiez": ("aller",),
"allassions": ("aller",),
"allât": ("aller",),
"allâtes": ("aller",),
"allé": ("aller",),
"allée": ("aller",),
"allées": ("aller",),
"allèrent": ("aller",),
"allés": ("aller",),
"allez": ("aller",),
"allons": ("aller",),
"ira": ("aller",),
"irai": ("aller",),
"iraient": ("aller",),
"irais": ("aller",),
"irait": ("aller",),
"iras": ("aller",),
"irez": ("aller",),
"iriez": ("aller",),
"irions": ("aller",),
"irons": ("aller",),
"iront": ("aller",),
"va": ("aller",),
"vais": ("aller",),
"vas": ("aller",),
"vont": ("aller",)
}

View File

@ -2,10 +2,113 @@
from __future__ import unicode_literals
ADJECTIVE_RULES = [["s", ""], ["e", ""], ["es", ""]]
ADJECTIVE_RULES = [
["a", "a"],
["aux", "al"],
["c", "c"],
["d", "d"],
["e", ""],
["é", "é"],
["eux", "eux"],
["f", "f"],
["i", "i"],
["ï", "ï"],
["l", "l"],
["m", "m"],
["n", "n"],
["o", "o"],
["p", "p"],
["r", "r"],
["s", ""],
["t", "t"],
["u", "u"],
["y", "y"],
]
NOUN_RULES = [["s", ""]]
NOUN_RULES = [
["a", "a"],
["à", "à"],
["â", "â"],
["b", "b"],
["c", "c"],
["ç", "ç"],
["d", "d"],
["e", "e"],
["é", "é"],
["è", "è"],
["ê", "ê"],
["ë", "ë"],
["f", "f"],
["g", "g"],
["h", "h"],
["i", "i"],
["î", "î"],
["ï", "ï"],
["j", "j"],
["k", "k"],
["l", "l"],
["m", "m"],
["n", "n"],
["o", "o"],
["ô", "ö"],
["ö", "ö"],
["p", "p"],
["q", "q"],
["r", "r"],
["t", "t"],
["u", "u"],
["û", "û"],
["v", "v"],
["w", "w"],
["y", "y"],
["z", "z"],
["as", "a"],
["aux", "au"],
["cs", "c"],
["chs", "ch"],
["ds", "d"],
["és", "é"],
["es", "e"],
["eux", "eu"],
["fs", "f"],
["gs", "g"],
["hs", "h"],
["is", "i"],
["ïs", "ï"],
["js", "j"],
["ks", "k"],
["ls", "l"],
["ms", "m"],
["ns", "n"],
["oux", "ou"],
["os", "o"],
["ps", "p"],
["qs", "q"],
["rs", "r"],
["ses", "se"],
["se", "se"],
["ts", "t"],
["us", "u"],
["vs", "v"],
["ws", "w"],
["ys", "y"],
["nt(e", "nt"],
["nt(e)", "nt"],
["al(e", "ale"],
["é(", "é"],
["é(e", "é"],
["é.e", "é"],
["el(le", "el"],
["eurs(rices", "eur"],
["eur(rice", "eur"],
["eux(se", "eux"],
["ial(e", "ial"],
["er(ère", "er"],
["eur(se", "eur"],
["teur(trice", "teur"],
["teurs(trices", "teur"],
]
VERB_RULES = [
@ -47,4 +150,11 @@ VERB_RULES = [
["assiez", "er"],
["assent", "er"],
["ant", "er"],
["ante", "er"],
["ants", "er"],
["antes", "er"],
["u(er", "u"],
["és(ées", "er"],
["é()e", "er"],
["é()", "er"],
]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -94,15 +94,19 @@ for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
_infixes_exc = []
orig_elision = "'"
orig_hyphen = '-'
orig_hyphen = "-"
# loop through the elison and hyphen characters, and try to substitute the ones that weren't used in the original list
for infix in FR_BASE_EXCEPTIONS:
variants_infix = {infix}
for elision_char in [x for x in ELISION if x != orig_elision]:
variants_infix.update([word.replace(orig_elision, elision_char) for word in variants_infix])
for hyphen_char in [x for x in ['-', ''] if x != orig_hyphen]:
variants_infix.update([word.replace(orig_hyphen, hyphen_char) for word in variants_infix])
variants_infix.update(
[word.replace(orig_elision, elision_char) for word in variants_infix]
)
for hyphen_char in [x for x in ["-", ""] if x != orig_hyphen]:
variants_infix.update(
[word.replace(orig_hyphen, hyphen_char) for word in variants_infix]
)
variants_infix.update([upper_first_letter(word) for word in variants_infix])
_infixes_exc.extend(variants_infix)
@ -327,7 +331,9 @@ _regular_exp = [
"^chape[{hyphen}]chut[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
"^down[{hyphen}]load[{alpha}]*$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
"^[ée]tats[{hyphen}]uni[{alpha}]*$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
"^droits?[{hyphen}]de[{hyphen}]l'homm[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
"^droits?[{hyphen}]de[{hyphen}]l'homm[{alpha}]+$".format(
hyphen=HYPHENS, alpha=ALPHA_LOWER
),
"^fac[{hyphen}]simil[{alpha}]*$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
"^fleur[{hyphen}]bleuis[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
"^flic[{hyphen}]flaqu[{alpha}]+$".format(hyphen=HYPHENS, alpha=ALPHA_LOWER),
@ -380,25 +386,32 @@ _regular_exp += [
]
# catching cases like entr'abat
_elision_prefix = ['r?é?entr', 'grande?s?', 'r']
_elision_prefix = ["r?é?entr", "grande?s?", "r"]
_regular_exp += [
"^{prefix}[{elision}][{alpha}][{alpha}{elision}{hyphen}\-]*$".format(
prefix=p,
elision=ELISION,
hyphen=_other_hyphens,
alpha=ALPHA_LOWER,
prefix=p, elision=ELISION, hyphen=_other_hyphens, alpha=ALPHA_LOWER
)
for p in _elision_prefix
]
# catching cases like saut-de-ski, pet-en-l'air
_hyphen_combination = ['l[èe]s?', 'la', 'en', 'des?', 'd[eu]', 'sur', 'sous', 'aux?', 'à', 'et', "près", "saint"]
_hyphen_combination = [
"l[èe]s?",
"la",
"en",
"des?",
"d[eu]",
"sur",
"sous",
"aux?",
"à",
"et",
"près",
"saint",
]
_regular_exp += [
"^[{alpha}]+[{hyphen}]{hyphen_combo}[{hyphen}](?:l[{elision}])?[{alpha}]+$".format(
hyphen_combo=hc,
elision=ELISION,
hyphen=HYPHENS,
alpha=ALPHA_LOWER,
hyphen_combo=hc, elision=ELISION, hyphen=HYPHENS, alpha=ALPHA_LOWER
)
for hc in _hyphen_combination
]

View File

@ -1,3 +1,10 @@
"""
Slang and abbreviations
Daftar kosakata yang sering salah dieja
https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
"""
# coding: utf8
from __future__ import unicode_literals

View File

@ -1,3 +1,6 @@
"""
List of stop words in Bahasa Indonesia.
"""
# coding: utf8
from __future__ import unicode_literals

View File

@ -1,3 +1,7 @@
"""
Daftar singkatan dan Akronim dari:
https://id.wiktionary.org/wiki/Wiktionary:Daftar_singkatan_dan_akronim_bahasa_Indonesia#A
"""
# coding: utf8
from __future__ import unicode_literals

File diff suppressed because it is too large Load Diff

View File

@ -1,6 +1,6 @@
# coding: utf8
"""
All wordforms are extracted from Norsk Ordbank in Norwegian Bokmål 2005
All wordforms are extracted from Norsk Ordbank in Norwegian Bokmål 2005, updated 20180627
(CLARINO NB - Språkbanken), Nasjonalbiblioteket, Norway:
https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-5&lang=en
License:
@ -15,9 +15,7 @@ ADVERBS_WORDFORMS = {
'à la grecque': ('à la grecque',),
'à la mode': ('à la mode',),
'òg': ('òg',),
'a': ('a',),
'a cappella': ('a cappella',),
'a conto': ('a conto',),
'a konto': ('a konto',),
'a posteriori': ('a posteriori',),
'a prima vista': ('a prima vista',),
@ -34,6 +32,12 @@ ADVERBS_WORDFORMS = {
'ad undas': ('ad undas',),
'adagio': ('adagio',),
'akkurat': ('akkurat',),
'aktenfor': ('aktenfor',),
'aktenfra': ('aktenfra',),
'akter': ('akter',),
'akterinn': ('akterinn',),
'akterover': ('akterover',),
'akterut': ('akterut',),
'al fresco': ('al fresco',),
'al secco': ('al secco',),
'aldeles': ('aldeles',),
@ -46,6 +50,9 @@ ADVERBS_WORDFORMS = {
'allegro': ('allegro',),
'aller': ('aller',),
'allerede': ('allerede',),
'allesteds': ('allesteds',),
'allestedsfra': ('allestedsfra',),
'allestedshen': ('allestedshen',),
'allikevel': ('allikevel',),
'alltid': ('alltid',),
'alltids': ('alltids',),
@ -60,8 +67,12 @@ ADVERBS_WORDFORMS = {
'andelsvis': ('andelsvis',),
'andfares': ('andfares',),
'andføttes': ('andføttes',),
'annensteds': ('annensteds',),
'annenstedsfra': ('annenstedsfra',),
'annenstedshen': ('annenstedshen',),
'annetsteds': ('annetsteds',),
'annetstedsfra': ('annetstedsfra',),
'annetstedsfra': ('annetstedsfra',),
'annetstedshen': ('annetstedshen',),
'anno': ('anno',),
'anslagsvis': ('anslagsvis',),
@ -72,21 +83,35 @@ ADVERBS_WORDFORMS = {
'apropos': ('apropos',),
'argende': ('argende',),
'at': ('at',),
'att': ('att',),
'attende': ('attende',),
'atter': ('atter',),
'attpåtil': ('attpåtil',),
'attåt': ('attåt',),
'au': ('au',),
'aust': ('aust',),
'austa': ('austa',),
'austafjells': ('austafjells',),
'av gårde': ('av gårde',),
'av sted': ('av sted',),
'avdelingsvis': ('avdelingsvis',),
'avdragsvis': ('avdragsvis',),
'avhendes': ('avhendes',),
'avhends': ('avhends',),
'avsatsvis': ('avsatsvis',),
'babord': ('babord',),
'bakfra': ('bakfra',),
'bakk': ('bakk',),
'baklengs': ('baklengs',),
'bakover': ('bakover',),
'bakut': ('bakut',),
'bare': ('bare',),
'bataljonsvis': ('bataljonsvis',),
'beint fram': ('beint fram',),
'bekende': ('bekende',),
'belgende': ('belgende',),
'bent fram': ('bent fram',),
'bent frem': ('bent frem',),
'betids': ('betids',),
'bi': ('bi',),
'bidevind': ('bidevind',),
@ -102,17 +127,21 @@ ADVERBS_WORDFORMS = {
'bom': ('bom',),
'bommende': ('bommende',),
'bona fide': ('bona fide',),
'bort': ('bort',),
'borte': ('borte',),
'bortimot': ('bortimot',),
'brennfort': ('brennfort',),
'brutto': ('brutto',),
'bråtevis': ('bråtevis',),
'bums': ('bums',),
'buntevis': ('buntevis',),
'buntvis': ('buntvis',),
'bus': ('bus',),
'bygdimellom': ('bygdimellom',),
'cantabile': ('cantabile',),
'cf': ('cf',),
'cif': ('cif',),
'cirka': ('cirka',),
'comme il faut': ('comme il faut',),
'crescendo': ('crescendo',),
'da': ('da',),
'dagevis': ('dagevis',),
@ -127,18 +156,38 @@ ADVERBS_WORDFORMS = {
'delkredere': ('delkredere',),
'dels': ('dels',),
'delvis': ('delvis',),
'den gang': ('den gang',),
'der': ('der',),
'der borte': ('der borte',),
'der hen': ('der hen',),
'der inne': ('der inne',),
'der nede': ('der nede',),
'der oppe': ('der oppe',),
'der ute': ('der ute',),
'derav': ('derav',),
'deretter': ('deretter',),
'derfor': ('derfor',),
'derfra': ('derfra',),
'deri': ('deri',),
'deriblant': ('deriblant',),
'derifra': ('derifra',),
'derimot': ('derimot',),
'dermed': ('dermed',),
'dernest': ('dernest',),
'derom': ('derom',),
'derpå': ('derpå',),
'dertil': ('dertil',),
'derved': ('derved',),
'dess': ('dess',),
'dessuten': ('dessuten',),
'dessverre': ('dessverre',),
'desto': ('desto',),
'diminuendo': ('diminuendo',),
'dis': ('dis',),
'dit': ('dit',),
'dit hen': ('dit hen',),
'ditover': ('ditover',),
'ditto': ('ditto',),
'dog': ('dog',),
'dolce': ('dolce',),
'dorgende': ('dorgende',),
@ -158,10 +207,10 @@ ADVERBS_WORDFORMS = {
'eitrende': ('eitrende',),
'eks': ('eks',),
'eksempelvis': ('eksempelvis',),
'eksklusiv': ('eksklusiv',),
'eksklusive': ('eksklusive',),
'ekspress': ('ekspress',),
'ekstempore': ('ekstempore',),
'eldende': ('eldende',),
'eldende': ('eldende',),
'ellers': ('ellers',),
'en': ('en',),
'en bloc': ('en bloc',),
@ -175,6 +224,8 @@ ADVERBS_WORDFORMS = {
'enda': ('enda',),
'endatil': ('endatil',),
'ende': ('ende',),
'ende fram': ('ende fram',),
'ende frem': ('ende frem',),
'ender': ('ender',),
'endog': ('endog',),
'ene': ('ene',),
@ -183,10 +234,12 @@ ADVERBS_WORDFORMS = {
'enkom': ('enkom',),
'enn': ('enn',),
'ennå': ('ennå',),
'ensteds': ('ensteds',),
'eo ipso': ('eo ipso',),
'ergo': ('ergo',),
'et cetera': ('et cetera',),
'etappevis': ('etappevis',),
'etsteds': ('etsteds',),
'etterhånden': ('etterhånden',),
'etterpå': ('etterpå',),
'etterskottsvis': ('etterskottsvis',),
@ -195,9 +248,10 @@ ADVERBS_WORDFORMS = {
'ex auditorio': ('ex auditorio',),
'ex cathedra': ('ex cathedra',),
'ex officio': ('ex officio',),
'exit': ('exit',),
'f.o.r.': ('f.o.r.',),
'fas': ('fas',),
'fatt': ('fatt',),
'fatt': ('fatt',),
'feil': ('feil',),
'femti-femti': ('femti-femti',),
'fifty-fifty': ('fifty-fifty',),
@ -208,44 +262,64 @@ ADVERBS_WORDFORMS = {
'flunkende': ('flunkende',),
'flust': ('flust',),
'fly': ('fly',),
'fløyten': ('fløyten',),
'fob': ('fob',),
'for': ('for',),
'for hånden': ('for hånden',),
'for lengst': ('for lengst',),
'for resten': ('for resten',),
'for så vidt': ('for så vidt',),
'for tida': ('for tida',),
'for tiden': ('for tiden',),
'for visst': ('for visst',),
'for øvrig': ('for øvrig',),
'fordevind': ('fordevind',),
'fordum': ('fordum',),
'fore': ('fore',),
'forfra': ('forfra',),
'forhakkende': ('forhakkende',),
'forholdsvis': ('forholdsvis',),
'forhåpentlig': ('forhåpentlig',),
'forhåpentligvis': ('forhåpentligvis',),
'forlengs': ('forlengs',),
'formelig': ('formelig',),
'forover': ('forover',),
'forresten': ('forresten',),
'forsøksvis': ('forsøksvis',),
'fort': ('fort',),
'fortere': ('fort',),
'fortest': ('fort',),
'forte': ('forte',),
'fortfarende': ('fortfarende',),
'fortissimo': ('fortissimo',),
'fortrinnsvis': ('fortrinnsvis',),
'forut': ('forut',),
'fra borde': ('fra borde',),
'fram': ('fram',),
'framføre': ('framføre',),
'framleis': ('framleis',),
'framlengs': ('framlengs',),
'framme': ('framme',),
'framstupes': ('framstupes',),
'framstups': ('framstups',),
'franko': ('franko',),
'free on board': ('free on board',),
'free on rail': ('free on rail',),
'frem': ('frem',),
'fremad': ('fremad',),
'fremdeles': ('fremdeles',),
'fremlengs': ('fremlengs',),
'fremme': ('fremme',),
'fremstupes': ('fremstupes',),
'fremstups': ('fremstups',),
'furioso': ('furioso',),
'fylkesvis': ('fylkesvis',),
'følgelig': ('følgelig',),
'føre': ('føre',),
'først': ('først',),
'ganske': ('ganske',),
'gardimellom': ('gardimellom',),
'gatelangs': ('gatelangs',),
'gid': ('gid',),
'givetvis': ('givetvis',),
'gjerne': ('gjerne',),
@ -267,17 +341,56 @@ ADVERBS_WORDFORMS = {
'gørrende': ('gørrende',),
'hakk': ('hakk',),
'hakkende': ('hakkende',),
'halvveges': ('halvveges',),
'halvvegs': ('halvvegs',),
'halvveis': ('halvveis',),
'haugevis': ('haugevis',),
'heden': ('heden',),
'heim': ('heim',),
'heim att': ('heim att',),
'heiman': ('heiman',),
'heime': ('heime',),
'heimefra': ('heimefra',),
'heimetter': ('heimetter',),
'heimom': ('heimom',),
'heimover': ('heimover',),
'heldigvis': ('heldigvis',),
'heller': ('heller',),
'helst': ('helst',),
'hen': ('hen',),
'henholdsvis': ('henholdsvis',),
'henne': ('henne',),
'her': ('her',),
'herav': ('herav',),
'heretter': ('heretter',),
'herfra': ('herfra',),
'heri': ('heri',),
'heriblant': ('heriblant',),
'herifra': ('herifra',),
'herigjennom': ('herigjennom',),
'herimot': ('herimot',),
'hermed': ('hermed',),
'herom': ('herom',),
'herover': ('herover',),
'herpå': ('herpå',),
'herre': ('herre',),
'hersens': ('hersens',),
'hertil': ('hertil',),
'herunder': ('herunder',),
'herved': ('herved',),
'himlende': ('himlende',),
'hisset': ('hisset',),
'hist': ('hist',),
'hit': ('hit',),
'hitover': ('hitover',),
'hittil': ('hittil',),
'hjem': ('hjem',),
'hjemad': ('hjemad',),
'hjemetter': ('hjemetter',),
'hjemme': ('hjemme',),
'hjemmefra': ('hjemmefra',),
'hjemom': ('hjemom',),
'hjemover': ('hjemover',),
'hodekulls': ('hodekulls',),
'hodestupes': ('hodestupes',),
'hodestups': ('hodestups',),
@ -288,15 +401,41 @@ ADVERBS_WORDFORMS = {
'hundretusenvis': ('hundretusenvis',),
'hundrevis': ('hundrevis',),
'hurra-meg-rundt': ('hurra-meg-rundt',),
'husimellom': ('husimellom',),
'hvi': ('hvi',),
'hvor': ('hvor',),
'hvor hen': ('hvor hen',),
'hvorav': ('hvorav',),
'hvordan': ('hvordan',),
'hvoretter': ('hvoretter',),
'hvorfor': ('hvorfor',),
'hvorfra': ('hvorfra',),
'hvori': ('hvori',),
'hvoriblant': ('hvoriblant',),
'hvorimot': ('hvorimot',),
'hvorledes': ('hvorledes',),
'hvormed': ('hvormed',),
'hvorom': ('hvorom',),
'hvorpå': ('hvorpå',),
'hånt': ('hånt',),
'høylig': ('høylig',),
'høyst': ('høyst',),
'i aften': ('i aften',),
'i aftes': ('i aftes',),
'i alle fall': ('i alle fall',),
'i dag': ('i dag',),
'i fjor': ('i fjor',),
'i fleng': ('i fleng',),
'i forfjor': ('i forfjor',),
'i forgårs': ('i forgårs',),
'i gjerde': ('i gjerde',),
'i gjære': ('i gjære',),
'i grunnen': ('i grunnen',),
'i går': ('i går',),
'i hende': ('i hende',),
'i hjel': ('i hjel',),
'i hug': ('i hug',),
'i huleste': ('i huleste',),
'i stedet': ('i stedet',),
'iallfall': ('iallfall',),
'ibidem': ('ibidem',),
@ -304,7 +443,7 @@ ADVERBS_WORDFORMS = {
'igjen': ('igjen',),
'ikke': ('ikke',),
'ildende': ('ildende',),
'ildende': ('ildende',),
'ille': ('ille',),
'imens': ('imens',),
'imidlertid': ('imidlertid',),
'in absentia': ('in absentia',),
@ -334,10 +473,22 @@ ADVERBS_WORDFORMS = {
'in vivo': ('in vivo',),
'ingenlunde': ('ingenlunde',),
'ingensteds': ('ingensteds',),
'inklusiv': ('inklusiv',),
'inklusive': ('inklusive',),
'inkognito': ('inkognito',),
'inn': ('inn',),
'innad': ('innad',),
'innafra': ('innafra',),
'innalands': ('innalands',),
'innaskjærs': ('innaskjærs',),
'inne': ('inne',),
'innenat': ('innenat',),
'innenfra': ('innenfra',),
'innenlands': ('innenlands',),
'innenskjærs': ('innenskjærs',),
'innledningsvis': ('innledningsvis',),
'innleiingsvis': ('innleiingsvis',),
'innomhus': ('innomhus',),
'isteden': ('isteden',),
'især': ('især',),
'item': ('item',),
@ -380,12 +531,26 @@ ADVERBS_WORDFORMS = {
'lagerfritt': ('lagerfritt',),
'lagom': ('lagom',),
'lagvis': ('lagvis',),
'landimellom': ('landimellom',),
'landverts': ('landverts',),
'langt': ('langt',),
'lenger': ('langt',),
'lengst': ('langt',),
'langveges': ('langveges',),
'langvegesfra': ('langvegesfra',),
'langvegs': ('langvegs',),
'langvegsfra': ('langvegsfra',),
'langveis': ('langveis',),
'langveisfra': ('langveisfra',),
'larghetto': ('larghetto',),
'largo': ('largo',),
'lassevis': ('lassevis',),
'legato': ('legato',),
'leilighetsvis': ('leilighetsvis',),
'lell': ('lell',),
'lenge': ('lenge',),
'lenger': ('lenge',),
'lengst': ('lenge',),
'lenger': ('lenger',),
'liddelig': ('liddelig',),
'like': ('like',),
@ -408,19 +573,25 @@ ADVERBS_WORDFORMS = {
'maestoso': ('maestoso',),
'mala fide': ('mala fide',),
'malapropos': ('malapropos',),
'mannemellom': ('mannemellom',),
'massevis': ('massevis',),
'med rette': ('med rette',),
'medio': ('medio',),
'medium': ('medium',),
'medsols': ('medsols',),
'medstrøms': ('medstrøms',),
'meget': ('meget',),
'mengdevis': ('mengdevis',),
'metervis': ('metervis',),
'mezzoforte': ('mezzoforte',),
'midsommers': ('midsommers',),
'midsommers': ('midsommers',),
'midt': ('midt',),
'midtfjords': ('midtfjords',),
'midtskips': ('midtskips',),
'midtsommers': ('midtsommers',),
'midtsommers': ('midtsommers',),
'midtveges': ('midtveges',),
'midtvegs': ('midtvegs',),
'midtveis': ('midtveis',),
'midtvinters': ('midtvinters',),
'midvinters': ('midvinters',),
'milevis': ('milevis',),
@ -445,6 +616,13 @@ ADVERBS_WORDFORMS = {
'naturligvis': ('naturligvis',),
'nauende': ('nauende',),
'navnlig': ('navnlig',),
'ned': ('ned',),
'nedad': ('nedad',),
'nedatil': ('nedatil',),
'nede': ('nede',),
'nedentil': ('nedentil',),
'nedenunder': ('nedenunder',),
'nedstrøms': ('nedstrøms',),
'neigu': ('neigu',),
'neimen': ('neimen',),
'nemlig': ('nemlig',),
@ -452,31 +630,46 @@ ADVERBS_WORDFORMS = {
'nesegrus': ('nesegrus',),
'nest': ('nest',),
'nesten': ('nesten',),
'netto': ('netto',),
'nettopp': ('nettopp',),
'noenlunde': ('noenlunde',),
'noensinne': ('noensinne',),
'noensteds': ('noensteds',),
'nok': ('nok',),
'nok': ('nok',),
'noksom': ('noksom',),
'nokså': ('nokså',),
'non stop': ('non stop',),
'nonstop': ('nonstop',),
'nord': ('nord',),
'nordafjells': ('nordafjells',),
'nordaust': ('nordaust',),
'nordenfjells': ('nordenfjells',),
'nordost': ('nordost',),
'nordvest': ('nordvest',),
'nordøst': ('nordøst',),
'notabene': ('notabene',),
'nu': ('nu',),
'nylig': ('nylig',),
'nyss': ('nyss',),
'': ('',),
'når': ('når',),
'nåvel': ('nåvel',),
'nær': ('nær',),
'nærere': ('nær',),
'nærmere': ('nær',),
'nærest': ('nær',),
'nærmest': ('nær',),
'nære': ('nære',),
'nærere': ('nærere',),
'nærest': ('nærest',),
'nærme': ('nærme',),
'nærmere': ('nærmere',),
'nærmest': ('nærmest',),
'nødig': ('nødig',),
'nødigere': ('nødig',),
'nødigst': ('nødig',),
'nødvendigvis': ('nødvendigvis',),
'offside': ('offside',),
'ofte': ('ofte',),
'oftere': ('ofte',),
'oftest': ('ofte',),
'også': ('også',),
'om att': ('om att',),
'om igjen': ('om igjen',),
@ -485,11 +678,18 @@ ADVERBS_WORDFORMS = {
'omsonst': ('omsonst',),
'omtrent': ('omtrent',),
'onnimellom': ('onnimellom',),
'opp': ('opp',),
'opp att': ('opp att',),
'opp ned': ('opp ned',),
'oppad': ('oppad',),
'oppe': ('oppe',),
'oppstrøms': ('oppstrøms',),
'ost': ('ost',),
'ovabords': ('ovabords',),
'ovatil': ('ovatil',),
'oven': ('oven',),
'ovenbords': ('ovenbords',),
'oventil': ('oventil',),
'overalt': ('overalt',),
'overens': ('overens',),
'overhodet': ('overhodet',),
@ -506,8 +706,6 @@ ADVERBS_WORDFORMS = {
'partout': ('partout',),
'parvis': ('parvis',),
'per capita': ('per capita',),
'peu à peu': ('peu à peu',),
'peu om peu': ('peu om peu',),
'pianissimo': ('pianissimo',),
'piano': ('piano',),
'pinende': ('pinende',),
@ -554,7 +752,6 @@ ADVERBS_WORDFORMS = {
'respektive': ('respektive',),
'rettsøles': ('rettsøles',),
'reverenter': ('reverenter',),
'riktig nok': ('riktig nok',),
'riktignok': ('riktignok',),
'rimeligvis': ('rimeligvis',),
'ringside': ('ringside',),
@ -567,6 +764,8 @@ ADVERBS_WORDFORMS = {
'saktelig': ('saktelig',),
'saktens': ('saktens',),
'sammen': ('sammen',),
'sammesteds': ('sammesteds',),
'sammestedsfra': ('sammestedsfra',),
'samstundes': ('samstundes',),
'samt': ('samt',),
'sann': ('sann',),
@ -578,6 +777,7 @@ ADVERBS_WORDFORMS = {
'senhøstes': ('senhøstes',),
'sia': ('sia',),
'sic': ('sic',),
'sidelangs': ('sidelangs',),
'sidelengs': ('sidelengs',),
'siden': ('siden',),
'sideveges': ('sideveges',),
@ -587,9 +787,9 @@ ADVERBS_WORDFORMS = {
'silde': ('silde',),
'simpelthen': ('simpelthen',),
'sine anno': ('sine anno',),
'sistpå': ('sistpå',),
'sjelden': ('sjelden',),
'sjøleies': ('sjøleies',),
'sjøleis': ('sjøleis',),
'sjøverts': ('sjøverts',),
'skeis': ('skeis',),
'skiftevis': ('skiftevis',),
@ -607,6 +807,9 @@ ADVERBS_WORDFORMS = {
'smekk': ('smekk',),
'smellende': ('smellende',),
'småningom': ('småningom',),
'snart': ('snart',),
'snarere': ('snart',),
'snarest': ('snart',),
'sneisevis': ('sneisevis',),
'snesevis': ('snesevis',),
'snuft': ('snuft',),
@ -616,6 +819,7 @@ ADVERBS_WORDFORMS = {
'snyte': ('snyte',),
'solo': ('solo',),
'sommerstid': ('sommerstid',),
'sommesteds': ('sommesteds',),
'spenna': ('spenna',),
'spent': ('spent',),
'spika': ('spika',),
@ -651,6 +855,7 @@ ADVERBS_WORDFORMS = {
'styggelig': ('styggelig',),
'styggende': ('styggende',),
'stykkevis': ('stykkevis',),
'styrbord': ('styrbord',),
'støtt': ('støtt',),
'støtvis': ('støtvis',),
'støytvis': ('støytvis',),
@ -658,6 +863,12 @@ ADVERBS_WORDFORMS = {
'summa summarum': ('summa summarum',),
'surr': ('surr',),
'svinaktig': ('svinaktig',),
'svint': ('svint',),
'svintere': ('svint',),
'svintest': ('svint',),
'syd': ('syd',),
'sydost': ('sydost',),
'sydvest': ('sydvest',),
'sydøst': ('sydøst',),
'synderlig': ('synderlig',),
'': ('',),
@ -672,6 +883,13 @@ ADVERBS_WORDFORMS = {
'søkk': ('søkk',),
'søkkende': ('søkkende',),
'sønder': ('sønder',),
'sønna': ('sønna',),
'sønnafjells': ('sønnafjells',),
'sønnenfjells': ('sønnenfjells',),
'sør': ('sør',),
'søraust': ('søraust',),
'sørvest': ('sørvest',),
'sørøst': ('sørøst',),
'takimellom': ('takimellom',),
'takomtil': ('takomtil',),
'temmelig': ('temmelig',),
@ -679,10 +897,15 @@ ADVERBS_WORDFORMS = {
'tidligdags': ('tidligdags',),
'tidsnok': ('tidsnok',),
'tidvis': ('tidvis',),
'til like': ('til like',),
'tilbake': ('tilbake',),
'tilfeldigvis': ('tilfeldigvis',),
'tilmed': ('tilmed',),
'tilnærmelsesvis': ('tilnærmelsesvis',),
'timevis': ('timevis',),
'titt': ('titt',),
'tiere': ('titt',),
'tiest': ('titt',),
'tjokkende': ('tjokkende',),
'tomreipes': ('tomreipes',),
'tott': ('tott',),
@ -695,44 +918,55 @@ ADVERBS_WORDFORMS = {
'trutt': ('trutt',),
'turevis': ('turevis',),
'turvis': ('turvis',),
'tusenfold': ('tusenfold',),
'tusenvis': ('tusenvis',),
'tvers': ('tvers',),
'tvert': ('tvert',),
'tydeligvis': ('tydeligvis',),
'tynnevis': ('tynnevis',),
'tynnevis': ('tynnevis',),
'tålig': ('tålig',),
'tønnevis': ('tønnevis',),
'tønnevis': ('tønnevis',),
'ufravendt': ('ufravendt',),
'ugjerne': ('ugjerne',),
'uheldigvis': ('uheldigvis',),
'ukevis': ('ukevis',),
'ukevis': ('ukevis',),
'ultimo': ('ultimo',),
'ulykkeligvis': ('ulykkeligvis',),
'uløyves': ('uløyves',),
'undas': ('undas',),
'underhånden': ('underhånden',),
'undertiden': ('undertiden',),
'undervegs': ('undervegs',),
'underveis': ('underveis',),
'unntakelsesvis': ('unntakelsesvis',),
'unntaksvis': ('unntaksvis',),
'ustyggelig': ('ustyggelig',),
'ut': ('ut',),
'utaboks': ('utaboks',),
'utad': ('utad',),
'utalands': ('utalands',),
'utbygdes': ('utbygdes',),
'utdragsvis': ('utdragsvis',),
'ute': ('ute',),
'utelukkende': ('utelukkende',),
'utenat': ('utenat',),
'utenboks': ('utenboks',),
'utenlands': ('utenlands',),
'utomhus': ('utomhus',),
'uvegerlig': ('uvegerlig',),
'uviselig': ('uviselig',),
'uvislig': ('uvislig',),
'va banque': ('va banque',),
'vanligvis': ('vanligvis',),
'vann': ('vann',),
'vekevis': ('vekevis',),
'vekevis': ('vekevis',),
'ved like': ('ved like',),
'veggimellom': ('veggimellom',),
'vekk': ('vekk',),
'vekke': ('vekke',),
'vekselvis': ('vekselvis',),
'vel': ('vel',),
'vest': ('vest',),
'vesta': ('vesta',),
'vestafjells': ('vestafjells',),
'vestenfjells': ('vestenfjells',),
'vibrato': ('vibrato',),
'vice versa': ('vice versa',),
'vide': ('vide',),
@ -741,7 +975,6 @@ ADVERBS_WORDFORMS = {
'viselig': ('viselig',),
'visselig': ('visselig',),
'visst': ('visst',),
'visst nok': ('visst nok',),
'visstnok': ('visstnok',),
'vivace': ('vivace',),
'vonlig': ('vonlig',),
@ -754,40 +987,183 @@ ADVERBS_WORDFORMS = {
'årlig års': ('årlig års',),
'åssen': ('åssen',),
'ørende': ('ørende',),
'øst': ('øst',),
'østa': ('østa',),
'østafjells': ('østafjells',),
'østenfjells': ('østenfjells',),
'øyensynlig': ('øyensynlig',),
'antageligvis': ('antageligvis',),
'coolly': ('coolly',),
'kor': ('kor',),
'korfor': ('korfor',),
'kor': ('kor',),
'korfor': ('korfor',),
'medels': ('medels',),
'nasegrus': ('nasegrus',),
'overimorgen': ('overimorgen',),
'unntagelsesvis': ('unntagelsesvis',),
'åffer': ('åffer',),
'åffer': ('åffer',),
'sist': ('sist',),
'seinhaustes': ('seinhaustes',),
'stetse': ('stetse',),
'stikk': ('stikk',),
'storlig': ('storlig',),
'A': ('A',),
'for': ('for',),
'still going strong': ('still going strong',),
'til og med': ('til og med',),
'i hu': ('i hu',),
'dengang': ('dengang',),
'derborte': ('derborte',),
'derefter': ('derefter',),
'derinne': ('derinne',),
'dernede': ('dernede',),
'deromkring': ('deromkring',),
'etterhvert': ('etterhvert',),
'fordømrade': ('fordømrade',),
'foreksempel': ('foreksempel',),
'forsåvidt': ('forsåvidt',),
'forøvrig': ('forøvrig',),
'herefter': ('herefter',),
'hvertfall': ('hvertfall',),
'idag': ('idag',),
'ifjor': ('ifjor',),
'i gang': ('i gang',),
'igår': ('igår',),
'ihvertfall': ('ihvertfall',),
'ikveld': ('ikveld',),
'iland': ('iland',),
'imorgen': ('imorgen',),
'imøte': ('imøte',),
'inatt': ('inatt',),
'iorden': ('iorden',),
'istand': ('istand',),
'istedet': ('istedet',),
'javisst': ('javisst',),
'neivisst': ('neivisst',),
'fortsatt': ('fortsatt',),
'slik': ('slik',),
'sådan': ('sådan',),
'sånn': ('sånn',),
'for eksempel': ('for eksempel',),
'fra barnsbein av': ('fra barnsbein av',),
'fra barnsben av': ('fra barnsben av',),
'fra oven': ('fra oven',),
'på vidvanke': ('på vidvanke',),
'rubb og stubb': ('rubb og stubb',),
'akterifra': ('akterifra',),
'andsynes': ('andsynes',),
'austenom': ('austenom',),
'avslutningsvis': ('avslutningsvis',),
'bøttevis': ('bøttevis',),
'bakenfra': ('bakenfra',),
'bakenom': ('bakenom',),
'baki': ('baki',),
'bedriftsvis': ('bedriftsvis',),
'beklageligvis': ('beklageligvis',),
'benveges': ('benveges',),
'benveies': ('benveies',),
'bistrende': ('bistrende',),
'bitvis': ('bitvis',),
'bortenom': ('bortenom',),
'bortmed': ('bortmed',),
'bråfort': ('bråfort',),
'bunkevis': ('bunkevis',),
'ca': ('ca',),
'derigjennom': ('derigjennom',),
'derover': ('derover',),
'dessuaktet': ('dessuaktet',),
'distriktsvis': ('distriktsvis',),
'doloroso': ('doloroso',),
'erfaringsvis': ('erfaringsvis',),
'falskelig': ('falskelig',),
'fjellstøtt': ('fjellstøtt',),
'flekkvis': ('flekkvis',),
'flerveis': ('flerveis',),
'forholdvis': ('forholdvis',),
'fornemmelig': ('fornemmelig',),
'fornuftigvis': ('fornuftigvis',),
'forsiktigvis': ('forsiktigvis',),
'forskottsvis': ('forskottsvis',),
'forskuddsvis': ('forskuddsvis',),
'forutsetningsvis': ('forutsetningsvis',),
'framt': ('framt',),
'fremt': ('fremt',),
'godhetsfullt': ('godhetsfullt',),
'hvortil': ('hvortil',),
'hvorunder': ('hvorunder',),
'hvorved': ('hvorved',),
'iltrende': ('iltrende',),
'innatil': ('innatil',),
'innentil': ('innentil',),
'innigjennom': ('innigjennom',),
'kilometervis': ('kilometervis',),
'klattvis': ('klattvis',),
'kolonnevis': ('kolonnevis',),
'kommunevis': ('kommunevis',),
'listelig': ('listelig',),
'lusende': ('lusende',),
'mildelig': ('mildelig',),
'milevidt': ('milevidt',),
'nordøstover': ('nordøstover',),
'ovenover': ('ovenover',),
'periodevis': ('periodevis',),
'pirende': ('pirende',),
'priori': ('priori',),
'rettnok': ('rettnok',),
'rykkvis': ('rykkvis',),
'sørøstover': ('sørøstover',),
'sørvestover': ('sørvestover',),
'sedvanligvis': ('sedvanligvis',),
'seksjonsvis': ('seksjonsvis',),
'styggfort': ('styggfort',),
'stykkomtil': ('stykkomtil',),
'sydvestover': ('sydvestover',),
'terminvis': ('terminvis',),
'tertialvis': ('tertialvis',),
'utdannelsesmessig': ('utdannelsesmessig',),
'vis-à-vis': ('vis-à-vis',),
'før': ('før',),
'jo': ('jo',),
'såvel': ('såvel',),
'efterhvert': ('efterhvert',),
'liksom': ('liksom',),
'dann og vann': ('dann og vann',),
'jaggu': ('jaggu',),
'joggu': ('joggu',),
'knekk': ('knekk',),
'live': ('live',),
'og': ('og',),
'sabla': ('sabla',),
'sikksakk': ('sikksakk',),
'stadig': ('stadig',),
'rett og slett': ('rett og slett',),
'såvidt': ('såvidt',),
'for moro skyld': ('for moro skyld',),
'omlag': ('omlag',),
'nattestid': ('nattestid',),
'sørpe': ('sørpe',),
'A.': ('A.',),
'selv': ('selv',),
'forlengst': ('forlengst',),
'sjøl': ('sjøl',),
'drita': ('drita',),
'ennu': ('ennu',),
'skauleies': ('skauleies',),
'da capo': ('da capo',),
'iallefall': ('iallefall',),
'til alters': ('til alters',),
'pokka': ('pokka',),
'tilslutt': ('tilslutt',),
'i steden': ('i steden',),
'm.a.': ('m.a.',),
'til syvende og sist': ('til syvende og sist',),
'i en fei': ('i en fei',),
'ender og da': ('ender og da',),
'ender og gang': ('ender og gang',),
'fra arilds tid': ('fra arilds tid',),
'i hør og heim': ('i hør og heim',),
'for fote': ('for fote',),
'natterstid': ('natterstid',),
'natterstider': ('natterstider',),
'høgstdags': ('høgstdags',),
'høgstnattes': ('høgstnattes',),
'beint frem': ('beint frem',),
'beintfrem': ('beintfrem',),
'beinveges': ('beinveges',),
'beinvegs': ('beinvegs',),
'beinveis': ('beinveis',),
'benvegs': ('benvegs',),
'benveis': ('benveis',),
'en garde': ('en garde',),
'etter hvert': ('etter hvert',),
'framåt': ('framåt',),
'krittende': ('krittende',),
'kvivitt': ('kvivitt',),
@ -801,5 +1177,14 @@ ADVERBS_WORDFORMS = {
'til sammen': ('til sammen',),
'tomrepes': ('tomrepes',),
'medurs': ('medurs',),
'moturs': ('moturs',)
'moturs': ('moturs',),
'til ansvar': ('til ansvar',),
'til ansvars': ('til ansvars',),
'til fullnads': ('til fullnads',),
'concertando': ('concertando',),
'lesto': ('lesto',),
'tardando': ('tardando',),
'natters tid': ('natters tid',),
'natters tider': ('natters tider',),
'snydens': ('snydens',)
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

73
spacy/lang/tl/__init__.py Normal file
View File

@ -0,0 +1,73 @@
# coding: utf8
from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
# uncomment if files are available
# from .norm_exceptions import NORM_EXCEPTIONS
from .tag_map import TAG_MAP
# from .morph_rules import MORPH_RULES
# uncomment if lookup-based lemmatizer is available
from .lemmatizer import LOOKUP
# from ...lemmatizerlookup import Lemmatizer
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
def _return_tl(_):
return 'tl'
# Create a Language subclass
# Documentation: https://spacy.io/docs/usage/adding-languages
# This file should be placed in spacy/lang/xx (ISO code of language).
# Before submitting a pull request, make sure the remove all comments from the
# language data files, and run at least the basic tokenizer tests. Simply add the
# language ID to the list of languages in spacy/tests/conftest.py to include it
# in the basic tokenizer sanity tests. You can optionally add a fixture for the
# language's tokenizer and add more specific tests. For more info, see the
# tests documentation: https://github.com/explosion/spaCy/tree/master/spacy/tests
class TagalogDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = _return_tl # ISO code
# add more norm exception dictionaries here
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
# overwrite functions for lexical attributes
lex_attr_getters.update(LEX_ATTRS)
# add custom tokenizer exceptions to base exceptions
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# add stop words
stop_words = STOP_WORDS
# if available: add tag map
# tag_map = dict(TAG_MAP)
# if available: add morph rules
# morph_rules = dict(MORPH_RULES)
# if available: add lookup lemmatizer
# @classmethod
# def create_lemmatizer(cls, nlp=None):
# return Lemmatizer(LOOKUP)
class Tagalog(Language):
lang = 'tl' # ISO code
Defaults = TagalogDefaults # set Defaults to custom language defaults
# set default export this allows the language class to be lazy-loaded
__all__ = ['Tagalog']

View File

@ -0,0 +1,18 @@
# coding: utf8
from __future__ import unicode_literals
# Adding a lemmatizer lookup table
# Documentation: https://spacy.io/docs/usage/adding-languages#lemmatizer
# Entries should be added in the following format:
LOOKUP = {
"kaugnayan": "ugnay",
"sangkatauhan": "tao",
"kanayunan": "nayon",
"pandaigdigan": "daigdig",
"kasaysayan": "saysay",
"kabayanihan": "bayani",
"karuwagan": "duwag"
}

View File

@ -0,0 +1,43 @@
# coding: utf8
from __future__ import unicode_literals
# import the symbols for the attrs you want to overwrite
from ...attrs import LIKE_NUM
# Overwriting functions for lexical attributes
# Documentation: https://localhost:1234/docs/usage/adding-languages#lex-attrs
# Most of these functions, like is_lower or like_url should be language-
# independent. Others, like like_num (which includes both digits and number
# words), requires customisation.
# Example: check if token resembles a number
_num_words = ['sero', 'isa', 'dalawa', 'tatlo', 'apat', 'lima', 'anim', 'pito',
'walo', 'siyam', 'sampu', 'labing-isa', 'labindalawa', 'labintatlo', 'labing-apat',
'labinlima', 'labing-anim', 'labimpito', 'labing-walo', 'labinsiyam', 'dalawampu',
'tatlumpu', 'apatnapu', 'limampu', 'animnapu', 'pitumpu', 'walumpu', 'siyamnapu',
'daan', 'libo', 'milyon', 'bilyon', 'trilyon', 'quadrilyon',
'gajilyon', 'bazilyon']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
return False
# Create dictionary of functions to overwrite. The default lex_attr_getters are
# updated with this one, so only the functions defined here are overwritten.
LEX_ATTRS = {
LIKE_NUM: like_num
}

162
spacy/lang/tl/stop_words.py Normal file
View File

@ -0,0 +1,162 @@
# encoding: utf8
from __future__ import unicode_literals
# Add stop words
# Documentation: https://spacy.io/docs/usage/adding-languages#stop-words
# To improve readability, words should be ordered alphabetically and separated
# by spaces and newlines. When adding stop words from an online source, always
# include the link in a comment. Make sure to proofread and double-check the
# words lists available online are often known to contain mistakes.
# data from https://github.com/stopwords-iso/stopwords-tl/blob/master/stopwords-tl.txt
STOP_WORDS = set("""
akin
aking
ako
alin
am
amin
aming
ang
ano
anumang
apat
at
atin
ating
ay
bababa
bago
bakit
bawat
bilang
dahil
dalawa
dapat
din
dito
doon
gagawin
gayunman
ginagawa
ginawa
ginawang
gumawa
gusto
habang
hanggang
hindi
huwag
iba
ibaba
ibabaw
ibig
ikaw
ilagay
ilalim
ilan
inyong
isa
isang
itaas
ito
iyo
iyon
iyong
ka
kahit
kailangan
kailanman
kami
kanila
kanilang
kanino
kanya
kanyang
kapag
kapwa
karamihan
katiyakan
katulad
kaya
kaysa
ko
kong
kulang
kumuha
kung
laban
lahat
lamang
likod
lima
maaari
maaaring
maging
mahusay
makita
marami
marapat
masyado
may
mayroon
mga
minsan
mismo
mula
muli
na
nabanggit
naging
nagkaroon
nais
nakita
namin
napaka
narito
nasaan
ng
ngayon
ni
nila
nilang
nito
niya
niyang
noon
o
pa
paano
pababa
paggawa
pagitan
pagkakaroon
pagkatapos
palabas
pamamagitan
panahon
pangalawa
para
paraan
pareho
pataas
pero
pumunta
pumupunta
sa
saan
sabi
sabihin
sarili
sila
sino
siya
tatlo
tayo
tulad
tungkol
una
walang
""".split())

36
spacy/lang/tl/tag_map.py Normal file
View File

@ -0,0 +1,36 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ...symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
# Add a tag map
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
# The keys of the tag map should be strings in your tag set. The dictionary must
# have an entry POS whose value is one of the Universal Dependencies tags.
# Optionally, you can also include morphological features or other attributes.
TAG_MAP = {
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
"SP": {POS: SPACE}
}

View File

@ -0,0 +1,48 @@
# coding: utf8
from __future__ import unicode_literals
# import symbols if you need to use more, add them here
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
# Add tokenizer exceptions
# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
# Feel free to use custom logic to generate repetitive exceptions more efficiently.
# If an exception is split into more than one token, the ORTH values combined always
# need to match the original string.
# Exceptions should be added in the following format:
_exc = {
"tayo'y": [
{ORTH: "tayo", LEMMA: "tayo"},
{ORTH: "'y", LEMMA: "ay"}],
"isa'y": [
{ORTH: "isa", LEMMA: "isa"},
{ORTH: "'y", LEMMA: "ay"}],
"baya'y": [
{ORTH: "baya", LEMMA: "bayan"},
{ORTH: "'y", LEMMA: "ay"}],
"sa'yo": [
{ORTH: "sa", LEMMA: "sa"},
{ORTH: "'yo", LEMMA: "iyo"}],
"ano'ng": [
{ORTH: "ano", LEMMA: "ano"},
{ORTH: "'ng", LEMMA: "ang"}],
"siya'y": [
{ORTH: "siya", LEMMA: "siya"},
{ORTH: "'y", LEMMA: "ay"}],
"nawa'y": [
{ORTH: "nawa", LEMMA: "nawa"},
{ORTH: "'y", LEMMA: "ay"}],
"papa'no": [
{ORTH: "papa'no", LEMMA: "papaano"}],
"'di": [
{ORTH: "'di", LEMMA: "hindi"}]
}
# To keep things clean and readable, it's recommended to only declare the
# TOKENIZER_EXCEPTIONS at the bottom:
TOKENIZER_EXCEPTIONS = _exc

View File

@ -468,7 +468,7 @@ class Language(object):
EXAMPLE:
>>> raw_text_batches = minibatch(raw_texts)
>>> for labelled_batch in minibatch(zip(train_docs, train_golds)):
>>> docs, golds = zip(*train_docs)
>>> docs, golds = zip(*train_docs)
>>> nlp.update(docs, golds)
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
>>> nlp.rehearse(raw_batch)
@ -554,7 +554,7 @@ class Language(object):
def resume_training(self, sgd=None, **cfg):
"""Continue training a pre-trained model.
Create and return an optimizer, and initialize "rehearsal" for any pipeline
component that has a .rehearse() method. Rehearsal is used to prevent
models from "forgetting" their initialised "knowledge". To perform

View File

@ -291,6 +291,8 @@ cdef char get_quantifier(PatternStateC state) nogil:
DEF PADDING = 5
DEF PADDING = 5
cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id,
object token_specs) except NULL:

View File

@ -426,7 +426,7 @@ cdef class Parser:
n_scores = 0.
loss = 0.
non_zeroed_classes = self._rehearsal_model.upper.W.any(axis=1)
while states:
while states:
targets, _ = tutor.begin_update(states)
guesses, backprop = model.begin_update(states)
d_scores = (targets - guesses) / targets.shape[0]

View File

@ -189,6 +189,25 @@ def test_doc_api_merge(en_tokenizer):
assert doc[5].text_with_ws == "all night"
assert doc[5].tag_ == "NAMED"
# merge both with bulk merge
doc = en_tokenizer(text)
assert len(doc) == 9
with doc.retokenize() as retokenizer:
retokenizer.merge(
doc[4:7], attrs={"tag": "NAMED", "lemma": "LEMMA", "ent_type": "TYPE"}
)
retokenizer.merge(
doc[7:9], attrs={"tag": "NAMED", "lemma": "LEMMA", "ent_type": "TYPE"}
)
assert len(doc) == 6
assert doc[4].text == "the beach boys"
assert doc[4].text_with_ws == "the beach boys "
assert doc[4].tag_ == "NAMED"
assert doc[5].text == "all night"
assert doc[5].text_with_ws == "all night"
assert doc[5].tag_ == "NAMED"
def test_doc_api_merge_children(en_tokenizer):
"""Test that attachments work correctly after merging."""

View File

@ -67,6 +67,22 @@ def test_spans_merge_non_disjoint(en_tokenizer):
)
def test_spans_merge_non_disjoint(en_tokenizer):
text = "Los Angeles start."
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens])
with pytest.raises(ValueError):
with doc.retokenize() as retokenizer:
retokenizer.merge(
doc[0:2],
attrs={"tag": "NNP", "lemma": "Los Angeles", "ent_type": "GPE"},
)
retokenizer.merge(
doc[0:1],
attrs={"tag": "NNP", "lemma": "Los Angeles", "ent_type": "GPE"},
)
def test_span_np_merges(en_tokenizer):
text = "displaCy is a parse tool built with Javascript"
heads = [1, 0, 2, 1, -3, -1, -1, -1]

View File

@ -5,15 +5,36 @@ import pytest
@pytest.mark.parametrize(
"text", ["aujourd'hui", "Aujourd'hui", "prud'hommes", "prudhommal",
"audio-numérique", "Audio-numérique",
"entr'amis", "entr'abat", "rentr'ouvertes", "grand'hamien",
"Châteauneuf-la-Forêt", "Château-Guibert",
"11-septembre", "11-Septembre", "refox-trottâmes",
"K-POP", "K-Pop", "K-pop", "z'yeutes",
"black-outeront", "états-unienne",
"courtes-pattes", "court-pattes",
"saut-de-ski", "Écourt-Saint-Quentin", "Bout-de-l'Îlien", "pet-en-l'air"]
"text",
[
"aujourd'hui",
"Aujourd'hui",
"prud'hommes",
"prudhommal",
"audio-numérique",
"Audio-numérique",
"entr'amis",
"entr'abat",
"rentr'ouvertes",
"grand'hamien",
"Châteauneuf-la-Forêt",
"Château-Guibert",
"11-septembre",
"11-Septembre",
"refox-trottâmes",
"K-POP",
"K-Pop",
"K-pop",
"z'yeutes",
"black-outeront",
"états-unienne",
"courtes-pattes",
"court-pattes",
"saut-de-ski",
"Écourt-Saint-Quentin",
"Bout-de-l'Îlien",
"pet-en-l'air",
],
)
def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text):
tokens = fr_tokenizer(text)

View File

@ -0,0 +1,89 @@
# coding: utf-8
from __future__ import unicode_literals
import json
from tempfile import NamedTemporaryFile
import pytest
from ...cli.train import train
def test_cli_trained_model_can_be_saved(tmpdir):
lang = 'nl'
output_dir = str(tmpdir)
train_file = NamedTemporaryFile('wb', dir=output_dir, delete=False)
train_corpus = [
{
"id": "identifier_0",
"paragraphs": [
{
"raw": "Jan houdt van Marie.\n",
"sentences": [
{
"tokens": [
{
"id": 0,
"dep": "nsubj",
"head": 1,
"tag": "NOUN",
"orth": "Jan",
"ner": "B-PER"
},
{
"id": 1,
"dep": "ROOT",
"head": 0,
"tag": "VERB",
"orth": "houdt",
"ner": "O"
},
{
"id": 2,
"dep": "case",
"head": 1,
"tag": "ADP",
"orth": "van",
"ner": "O"
},
{
"id": 3,
"dep": "obj",
"head": -2,
"tag": "NOUN",
"orth": "Marie",
"ner": "B-PER"
},
{
"id": 4,
"dep": "punct",
"head": -3,
"tag": "PUNCT",
"orth": ".",
"ner": "O"
},
{
"id": 5,
"dep": "",
"head": -1,
"tag": "SPACE",
"orth": "\n",
"ner": "O"
}
],
"brackets": []
}
]
}
]
}
]
train_file.write(json.dumps(train_corpus).encode('utf-8'))
train_file.close()
train_data = train_file.name
dev_data = train_data
# spacy train -n 1 -g -1 nl output_nl training_corpus.json training \
# corpus.json
train(lang, output_dir, train_data, dev_data, n_iter=1)
assert True

View File

@ -0,0 +1,36 @@
'''Test issue that arises when too many labels are added to NER model.'''
from __future__ import unicode_literals
import random
from ...lang.en import English
def train_model(train_data, entity_types):
nlp = English(pipeline=[])
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
for entity_type in list(entity_types):
ner.add_label(entity_type)
optimizer = nlp.begin_training()
# Start training
for i in range(20):
losses = {}
index = 0
random.shuffle(train_data)
for statement, entities in train_data:
nlp.update([statement], [entities], sgd=optimizer, losses=losses, drop=0.5)
return nlp
def test_train_with_many_entity_types():
train_data = []
train_data.extend([("One sentence", {"entities": []})])
entity_types = [str(i) for i in range(1000)]
model = train_model(train_data, entity_types)

View File

@ -0,0 +1,40 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
import os
from pathlib import Path
from ..compat import symlink_to, symlink_remove, path2str
def target_local_path():
return "./foo-target"
def link_local_path():
return "./foo-symlink"
@pytest.fixture(scope="function")
def setup_target(request):
target = Path(target_local_path())
if not target.exists():
os.mkdir(path2str(target))
# yield -- need to cleanup even if assertion fails
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
def cleanup():
symlink_remove(Path(link_local_path()))
os.rmdir(target_local_path())
request.addfinalizer(cleanup)
def test_create_symlink_windows(setup_target):
target = Path(target_local_path())
link = Path(link_local_path())
assert target.exists()
symlink_to(link, target)
assert link.exists()

View File

@ -865,7 +865,7 @@ cdef class Token:
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
property is_right_punct:
"""RETURNS (bool): Whether the token is a left punctuation mark."""
"""RETURNS (bool): Whether the token is a right punctuation mark."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)

View File

@ -2,7 +2,7 @@
p
| Models trained on the
| #[+a("https://catalog.ldc.upenn.edu/ldc2013t19") OntoNotes 5] corpus
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus
| support the following entity types:
+table(["Type", "Description"])

View File

@ -245,7 +245,7 @@ p The following file format converters are available:
+row
+cell #[code iob]
+cell IOB named entity recognition format.
+cell IOB or IOB2 named entity recognition format.
+h(3, "train") Train

View File

@ -352,6 +352,7 @@ p Retokenize the document, such that the span is merged into a single token.
+h(2, "ents") Span.ents
+tag property
+tag-model("NER")
+tag-new("2.0.12")
p
| Iterate over the entities in the span. Yields named-entity

View File

@ -714,7 +714,7 @@ p The L2 norm of the token's vector representation.
+cell bool
+cell
| Does the token consist of ASCII characters? Equivalent to
| #[code [any(ord(c) >= 128 for c in token.text)]].
| #[code all(ord(c) &lt; 128 for c in token.text)].
+row
+cell #[code is_digit]

View File

@ -91,8 +91,8 @@ p
p
| spaCy can be installed on GPU by specifying #[code spacy[cuda]],
| #[code spacy[cuda90]], #[code spacy[cuda91]], #[code spacy[cuda92]] or
| #[code spacy[cuda10]]. If you know your cuda version, using the more
| #[code spacy[cuda90]], #[code spacy[cuda91]] or #[code spacy[cuda92]].
| If you know your cuda version, using the more
| explicit specifier allows cupy to be installed via wheel, saving some
| compilation time. The specifiers should install two libraries:
| #[+a("https://cupy.chainer.org") #[code cupy]] and

View File

@ -206,7 +206,8 @@ p
nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terminology_list]
matcher.add('TerminologyList', None, *patterns)
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "

View File

@ -44,7 +44,7 @@ p
+list.o-no-block
+item #[strong Chinese]: #[+a("https://github.com/fxsjy/jieba") Jieba]
+item #[strong Japanese]: #[+a("https://github.com/taku910/mecab") MeCab]
+item #[strong Japanese]: #[+a("https://github.com/taku910/mecab") MeCab] with #[+a("http://unidic.ninjal.ac.jp/back_number#unidic_cwj") Unidic]
+item #[strong Thai]: #[+a("https://github.com/wannaphongcom/pythainlp") pythainlp]
+item #[strong Vietnamese]: #[+a("https://github.com/trungtv/pyvi") Pyvi]
+item #[strong Russian]: #[+a("https://github.com/kmike/pymorphy2") pymorphy2]

View File

@ -72,7 +72,7 @@ p
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(text) for text in terms]
patterns = [nlp.make_doc(text) for text in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)

View File

@ -240,7 +240,7 @@ p
+code-new.
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp(text) for text in large_terminology_list]
patterns = [nlp.make_doc(text) for text in large_terminology_list]
matcher.add('PRODUCT', None, *patterns)
+code-old.