Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match

This commit is contained in:
Adriane Boyd 2020-05-22 12:18:00 +02:00
commit 730fa493a4
143 changed files with 2003 additions and 8059 deletions

106
.github/contributors/ilivans.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Ilia Ivanov |
| Company name (if applicable) | Chattermill |
| Title or role (if applicable) | DL Engineer |
| Date | 2020-05-14 |
| GitHub username | ilivans |
| Website (optional) | |

106
.github/contributors/kevinlu1248.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Kevin Lu|
| Company name (if applicable) | |
| Title or role (if applicable) | Student|
| Date | |
| GitHub username | kevinlu1248|
| Website (optional) | |

106
.github/contributors/lfiedler.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Leander Fiedler |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 06 April 2020 |
| GitHub username | lfiedler |
| Website (optional) | |

106
.github/contributors/osori.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ilkyu Ju |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-05-17 |
| GitHub username | osori |
| Website (optional) | |

106
.github/contributors/thoppe.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Travis Hoppe |
| Company name (if applicable) | |
| Title or role (if applicable) | Data Scientist |
| Date | 07 May 2020 |
| GitHub username | thoppe |
| Website (optional) | http://thoppe.github.io/ |

106
.github/contributors/vishnupriyavr.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Vishnu Priya VR |
| Company name (if applicable) | Uniphore |
| Title or role (if applicable) | NLP/AI Engineer |
| Date | 2020-05-03 |
| GitHub username | vishnupriyavr |
| Website (optional) | |

View File

@ -1,6 +1,7 @@
"""Prevent catastrophic forgetting with rehearsal updates.""" """Prevent catastrophic forgetting with rehearsal updates."""
import plac import plac
import random import random
import warnings
import srsly import srsly
import spacy import spacy
from spacy.gold import GoldParse from spacy.gold import GoldParse
@ -66,7 +67,10 @@ def main(model_name, unlabelled_loc):
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
sizes = compounding(1.0, 4.0, 1.001) sizes = compounding(1.0, 4.0, 1.001)
with nlp.disable_pipes(*other_pipes): with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
for itn in range(n_iter): for itn in range(n_iter):
random.shuffle(TRAIN_DATA) random.shuffle(TRAIN_DATA)
random.shuffle(raw_docs) random.shuffle(raw_docs)

View File

@ -64,7 +64,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker. """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
The `vocab` should be the one used during creation of the KB.""" The `vocab` should be the one used during creation of the KB."""
vocab = Vocab().from_disk(vocab_path) vocab = Vocab().from_disk(vocab_path)
# create blank Language class with correct vocab # create blank English model with correct vocab
nlp = spacy.blank("en", vocab=vocab) nlp = spacy.blank("en", vocab=vocab)
nlp.vocab.vectors.name = "spacy_pretrained_vectors" nlp.vocab.vectors.name = "spacy_pretrained_vectors"
print("Created blank 'en' model with vocab from '%s'" % vocab_path) print("Created blank 'en' model with vocab from '%s'" % vocab_path)

View File

@ -8,12 +8,13 @@ For more details, see the documentation:
* NER: https://spacy.io/usage/linguistic-features#named-entities * NER: https://spacy.io/usage/linguistic-features#named-entities
Compatible with: spaCy v2.0.0+ Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0 Last tested with: v2.2.4
""" """
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function
import plac import plac
import random import random
import warnings
from pathlib import Path from pathlib import Path
import spacy import spacy
from spacy.util import minibatch, compounding from spacy.util import minibatch, compounding
@ -57,7 +58,11 @@ def main(model=None, output_dir=None, n_iter=100):
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train NER # only train NER
with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
# reset and initialize the weights randomly but only if we're # reset and initialize the weights randomly but only if we're
# training a new model # training a new model
if model is None: if model is None:

View File

@ -24,12 +24,13 @@ For more details, see the documentation:
* NER: https://spacy.io/usage/linguistic-features#named-entities * NER: https://spacy.io/usage/linguistic-features#named-entities
Compatible with: spaCy v2.1.0+ Compatible with: spaCy v2.1.0+
Last tested with: v2.1.0 Last tested with: v2.2.4
""" """
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function
import plac import plac
import random import random
import warnings
from pathlib import Path from pathlib import Path
import spacy import spacy
from spacy.util import minibatch, compounding from spacy.util import minibatch, compounding
@ -97,7 +98,11 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
# get names of other pipes to disable them during training # get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train NER # only train NER
with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
sizes = compounding(1.0, 4.0, 1.001) sizes = compounding(1.0, 4.0, 1.001)
# batch up the examples using spaCy's minibatch # batch up the examples using spaCy's minibatch
for itn in range(n_iter): for itn in range(n_iter):

View File

@ -59,7 +59,7 @@ install_requires =
[options.extras_require] [options.extras_require]
lookups = lookups =
spacy_lookups_data>=0.0.5,<0.2.0 spacy_lookups_data>=0.3.1,<0.4.0
cuda = cuda =
cupy>=5.0.0b4,<9.0.0 cupy>=5.0.0b4,<9.0.0
cuda80 = cuda80 =

View File

@ -279,13 +279,14 @@ class PrecomputableAffine(Model):
break break
def link_vectors_to_models(vocab): def link_vectors_to_models(vocab, skip_rank=False):
vectors = vocab.vectors vectors = vocab.vectors
if vectors.name is None: if vectors.name is None:
vectors.name = VECTORS_KEY vectors.name = VECTORS_KEY
if vectors.data.size != 0: if vectors.data.size != 0:
warnings.warn(Warnings.W020.format(shape=vectors.data.shape)) warnings.warn(Warnings.W020.format(shape=vectors.data.shape))
ops = Model.ops ops = Model.ops
if not skip_rank:
for word in vocab: for word in vocab:
if word.orth in vectors.key2row: if word.orth in vectors.key2row:
word.rank = vectors.key2row[word.orth] word.rank = vectors.key2row[word.orth]

View File

@ -15,7 +15,7 @@ cdef enum attr_id_t:
LIKE_NUM LIKE_NUM
LIKE_EMAIL LIKE_EMAIL
IS_STOP IS_STOP
IS_OOV IS_OOV_DEPRECATED
IS_BRACKET IS_BRACKET
IS_QUOTE IS_QUOTE
IS_LEFT_PUNCT IS_LEFT_PUNCT

View File

@ -16,7 +16,7 @@ IDS = {
"LIKE_NUM": LIKE_NUM, "LIKE_NUM": LIKE_NUM,
"LIKE_EMAIL": LIKE_EMAIL, "LIKE_EMAIL": LIKE_EMAIL,
"IS_STOP": IS_STOP, "IS_STOP": IS_STOP,
"IS_OOV": IS_OOV, "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
"IS_BRACKET": IS_BRACKET, "IS_BRACKET": IS_BRACKET,
"IS_QUOTE": IS_QUOTE, "IS_QUOTE": IS_QUOTE,
"IS_LEFT_PUNCT": IS_LEFT_PUNCT, "IS_LEFT_PUNCT": IS_LEFT_PUNCT,

View File

@ -187,12 +187,17 @@ def debug_data(
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values()) n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
msg.warn( msg.warn(
"{} words in training data without vectors ({:0.2f}%)".format( "{} words in training data without vectors ({:0.2f}%)".format(
n_missing_vectors, n_missing_vectors, n_missing_vectors / gold_train_data["n_words"],
n_missing_vectors / gold_train_data["n_words"],
), ),
) )
msg.text( msg.text(
"10 most common words without vectors: {}".format(_format_labels(gold_train_data["words_missing_vectors"].most_common(10), counts=True)), show=verbose, "10 most common words without vectors: {}".format(
_format_labels(
gold_train_data["words_missing_vectors"].most_common(10),
counts=True,
)
),
show=verbose,
) )
else: else:
msg.info("No word vectors present in the model") msg.info("No word vectors present in the model")

View File

@ -2,7 +2,6 @@
from __future__ import unicode_literals, division, print_function from __future__ import unicode_literals, division, print_function
import plac import plac
import spacy
from timeit import default_timer as timer from timeit import default_timer as timer
from wasabi import msg from wasabi import msg
@ -45,7 +44,7 @@ def evaluate(
msg.fail("Visualization output directory not found", displacy_path, exits=1) msg.fail("Visualization output directory not found", displacy_path, exits=1)
corpus = GoldCorpus(data_path, data_path) corpus = GoldCorpus(data_path, data_path)
if model.startswith("blank:"): if model.startswith("blank:"):
nlp = spacy.blank(model.replace("blank:", "")) nlp = util.get_lang_class(model.replace("blank:", ""))()
else: else:
nlp = util.load_model(model) nlp = util.load_model(model)
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc)) dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))

View File

@ -17,7 +17,9 @@ from wasabi import msg
from ..vectors import Vectors from ..vectors import Vectors
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..util import ensure_path, get_lang_class, OOV_RANK from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
from ..lookups import Lookups
try: try:
import ftfy import ftfy
@ -49,6 +51,8 @@ DEFAULT_OOV_PROB = -20
str, str,
), ),
model_name=("Optional name for the model meta", "option", "mn", str), model_name=("Optional name for the model meta", "option", "mn", str),
omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
base_model=("Base model (for languages with custom tokenizers)", "option", "b", str),
) )
def init_model( def init_model(
lang, lang,
@ -61,6 +65,8 @@ def init_model(
prune_vectors=-1, prune_vectors=-1,
vectors_name=None, vectors_name=None,
model_name=None, model_name=None,
omit_extra_lookups=False,
base_model=None,
): ):
""" """
Create a new model from raw data, like word frequencies, Brown clusters Create a new model from raw data, like word frequencies, Brown clusters
@ -92,7 +98,16 @@ def init_model(
lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc) lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
with msg.loading("Creating model..."): with msg.loading("Creating model..."):
nlp = create_model(lang, lex_attrs, name=model_name) nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
# Create empty extra lexeme tables so the data from spacy-lookups-data
# isn't loaded if these features are accessed
if omit_extra_lookups:
nlp.vocab.lookups_extra = Lookups()
nlp.vocab.lookups_extra.add_table("lexeme_cluster")
nlp.vocab.lookups_extra.add_table("lexeme_prob")
nlp.vocab.lookups_extra.add_table("lexeme_settings")
msg.good("Successfully created model") msg.good("Successfully created model")
if vectors_loc is not None: if vectors_loc is not None:
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name) add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
@ -152,20 +167,23 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
return lex_attrs return lex_attrs
def create_model(lang, lex_attrs, name=None): def create_model(lang, lex_attrs, name=None, base_model=None):
if base_model:
nlp = load_model(base_model)
# keep the tokenizer but remove any existing pipeline components due to
# potentially conflicting vectors
for pipe in nlp.pipe_names:
nlp.remove_pipe(pipe)
else:
lang_class = get_lang_class(lang) lang_class = get_lang_class(lang)
nlp = lang_class() nlp = lang_class()
for lexeme in nlp.vocab: for lexeme in nlp.vocab:
lexeme.rank = OOV_RANK lexeme.rank = OOV_RANK
lex_added = 0
for attrs in lex_attrs: for attrs in lex_attrs:
if "settings" in attrs: if "settings" in attrs:
continue continue
lexeme = nlp.vocab[attrs["orth"]] lexeme = nlp.vocab[attrs["orth"]]
lexeme.set_attrs(**attrs) lexeme.set_attrs(**attrs)
lexeme.is_oov = False
lex_added += 1
lex_added += 1
if len(nlp.vocab): if len(nlp.vocab):
oov_prob = min(lex.prob for lex in nlp.vocab) - 1 oov_prob = min(lex.prob for lex in nlp.vocab) - 1
else: else:
@ -181,7 +199,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
if vectors_loc and vectors_loc.parts[-1].endswith(".npz"): if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb"))) nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
for lex in nlp.vocab: for lex in nlp.vocab:
if lex.rank: if lex.rank and lex.rank != OOV_RANK:
nlp.vocab.vectors.add(lex.orth, row=lex.rank) nlp.vocab.vectors.add(lex.orth, row=lex.rank)
else: else:
if vectors_loc: if vectors_loc:
@ -193,8 +211,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
if vector_keys is not None: if vector_keys is not None:
for word in vector_keys: for word in vector_keys:
if word not in nlp.vocab: if word not in nlp.vocab:
lexeme = nlp.vocab[word] nlp.vocab[word]
lexeme.is_oov = False
if vectors_data is not None: if vectors_data is not None:
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
if name is None: if name is None:

View File

@ -15,9 +15,9 @@ import random
from .._ml import create_default_optimizer from .._ml import create_default_optimizer
from ..util import use_gpu as set_gpu from ..util import use_gpu as set_gpu
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus from ..gold import GoldCorpus
from ..compat import path2str from ..compat import path2str
from ..lookups import Lookups
from .. import util from .. import util
from .. import about from .. import about
@ -58,6 +58,7 @@ from .. import about
textcat_arch=("Textcat model architecture", "option", "ta", str), textcat_arch=("Textcat model architecture", "option", "ta", str),
textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str), textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path), tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
verbose=("Display more information for debug", "flag", "VV", bool), verbose=("Display more information for debug", "flag", "VV", bool),
debug=("Run data diagnostics before training", "flag", "D", bool), debug=("Run data diagnostics before training", "flag", "D", bool),
# fmt: on # fmt: on
@ -97,6 +98,7 @@ def train(
textcat_arch="bow", textcat_arch="bow",
textcat_positive_label=None, textcat_positive_label=None,
tag_map_path=None, tag_map_path=None,
omit_extra_lookups=False,
verbose=False, verbose=False,
debug=False, debug=False,
): ):
@ -248,6 +250,14 @@ def train(
# Update tag map with provided mapping # Update tag map with provided mapping
nlp.vocab.morphology.tag_map.update(tag_map) nlp.vocab.morphology.tag_map.update(tag_map)
# Create empty extra lexeme tables so the data from spacy-lookups-data
# isn't loaded if these features are accessed
if omit_extra_lookups:
nlp.vocab.lookups_extra = Lookups()
nlp.vocab.lookups_extra.add_table("lexeme_cluster")
nlp.vocab.lookups_extra.add_table("lexeme_prob")
nlp.vocab.lookups_extra.add_table("lexeme_settings")
if vectors: if vectors:
msg.text("Loading vector from model '{}'".format(vectors)) msg.text("Loading vector from model '{}'".format(vectors))
_load_vectors(nlp, vectors) _load_vectors(nlp, vectors)
@ -630,15 +640,6 @@ def _create_progress_bar(total):
def _load_vectors(nlp, vectors): def _load_vectors(nlp, vectors):
util.load_model(vectors, vocab=nlp.vocab) util.load_model(vectors, vocab=nlp.vocab)
for lex in nlp.vocab:
values = {}
for attr, func in nlp.vocab.lex_attr_getters.items():
# These attrs are expected to be set by data. Others should
# be set by calling the language functions.
if attr not in (CLUSTER, PROB, IS_OOV, LANG):
values[lex.vocab.strings[attr]] = func(lex.orth_)
lex.set_attrs(**values)
lex.is_oov = False
def _load_pretrained_tok2vec(nlp, loc): def _load_pretrained_tok2vec(nlp, loc):

View File

@ -1,12 +1,16 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
def add_codes(err_cls): def add_codes(err_cls):
"""Add error codes to string messages via class attribute names.""" """Add error codes to string messages via class attribute names."""
class ErrorsWithCodes(object): class ErrorsWithCodes(err_cls):
def __getattribute__(self, code): def __getattribute__(self, code):
msg = getattr(err_cls, code) msg = super(ErrorsWithCodes, self).__getattribute__(code)
if code.startswith("__"): # python system attributes like __class__
return msg
else:
return "[{code}] {msg}".format(code=code, msg=msg) return "[{code}] {msg}".format(code=code, msg=msg)
return ErrorsWithCodes() return ErrorsWithCodes()
@ -106,6 +110,11 @@ class Warnings(object):
"in problems with the vocab further on in the pipeline.") "in problems with the vocab further on in the pipeline.")
W029 = ("Unable to align tokens with entities from character offsets. " W029 = ("Unable to align tokens with entities from character offsets. "
"Discarding entity annotation for the text: {text}.") "Discarding entity annotation for the text: {text}.")
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
"entities \"{entities}\". Use "
"`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
" to check the alignment. Misaligned entities ('-') will be "
"ignored during training.")
@add_codes @add_codes
@ -555,6 +564,9 @@ class Errors(object):
E195 = ("Matcher can be called on {good} only, got {got}.") E195 = ("Matcher can be called on {good} only, got {got}.")
E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can " E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can "
"only be fixed with token.is_sent_start.") "only be fixed with token.is_sent_start.")
E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
E198 = ("Unable to return {n} most similar vectors for the current vectors "
"table, which contains {n_rows} vectors.")
@add_codes @add_codes

View File

@ -658,7 +658,15 @@ cdef class GoldParse:
entdoc = None entdoc = None
# avoid allocating memory if the doc does not contain any tokens # avoid allocating memory if the doc does not contain any tokens
if self.length > 0: if self.length == 0:
self.words = []
self.tags = []
self.heads = []
self.labels = []
self.ner = []
self.morphology = []
else:
if words is None: if words is None:
words = [token.text for token in doc] words = [token.text for token in doc]
if tags is None: if tags is None:
@ -949,6 +957,12 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
break break
else: else:
biluo[token.i] = missing biluo[token.i] = missing
if "-" in biluo:
ent_str = str(entities)
warnings.warn(Warnings.W030.format(
text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str
))
return biluo return biluo

View File

@ -6,7 +6,7 @@ from libcpp.vector cimport vector
from libc.stdint cimport int32_t, int64_t from libc.stdint cimport int32_t, int64_t
from libc.stdio cimport FILE from libc.stdio cimport FILE
from spacy.vocab cimport Vocab from .vocab cimport Vocab
from .typedefs cimport hash_t from .typedefs cimport hash_t
from .structs cimport KBEntryC, AliasC from .structs cimport KBEntryC, AliasC
@ -169,4 +169,3 @@ cdef class Reader:
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1 cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
cdef int _read(self, void* value, size_t size) except -1 cdef int _read(self, void* value, size_t size) except -1

View File

@ -1,23 +1,20 @@
# cython: infer_types=True # cython: infer_types=True
# cython: profile=True # cython: profile=True
# coding: utf8 # coding: utf8
import warnings
from spacy.errors import Errors, Warnings
from pathlib import Path
from cymem.cymem cimport Pool from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap from preshed.maps cimport PreshMap
from cpython.exc cimport PyErr_SetFromErrno from cpython.exc cimport PyErr_SetFromErrno
from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
from libc.stdint cimport int32_t, int64_t from libc.stdint cimport int32_t, int64_t
from libcpp.vector cimport vector
import warnings
from os import path
from pathlib import Path
from .typedefs cimport hash_t from .typedefs cimport hash_t
from os import path from .errors import Errors, Warnings
from libcpp.vector cimport vector
cdef class Candidate: cdef class Candidate:
@ -448,10 +445,10 @@ cdef class KnowledgeBase:
cdef class Writer: cdef class Writer:
def __init__(self, object loc): def __init__(self, object loc):
if path.exists(loc):
assert not path.isdir(loc), "%s is directory." % loc
if isinstance(loc, Path): if isinstance(loc, Path):
loc = bytes(loc) loc = bytes(loc)
if path.exists(loc):
assert not path.isdir(loc), "%s is directory." % loc
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self._fp = fopen(<char*>bytes_loc, 'wb') self._fp = fopen(<char*>bytes_loc, 'wb')
if not self._fp: if not self._fp:
@ -493,10 +490,10 @@ cdef class Writer:
cdef class Reader: cdef class Reader:
def __init__(self, object loc): def __init__(self, object loc):
assert path.exists(loc)
assert not path.isdir(loc)
if isinstance(loc, Path): if isinstance(loc, Path):
loc = bytes(loc) loc = bytes(loc)
assert path.exists(loc)
assert not path.isdir(loc)
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
self._fp = fopen(<char*>bytes_loc, 'rb') self._fp = fopen(<char*>bytes_loc, 'rb')
if not self._fp: if not self._fp:
@ -586,5 +583,3 @@ cdef class Reader:
cdef int _read(self, void* value, size_t size) except -1: cdef int _read(self, void* value, size_t size) except -1:
status = fread(value, size, 1, self._fp) status = fread(value, size, 1, self._fp)
return status return status

View File

@ -2,7 +2,6 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
@ -10,19 +9,15 @@ from .morph_rules import MORPH_RULES
from ..tag_map import TAG_MAP from ..tag_map import TAG_MAP
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
class DanishDefaults(Language.Defaults): class DanishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "da" lex_attr_getters[LANG] = lambda text: "da"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
morph_rules = MORPH_RULES morph_rules = MORPH_RULES
infixes = TOKENIZER_INFIXES infixes = TOKENIZER_INFIXES

View File

@ -1,527 +0,0 @@
# coding: utf8
"""
Special-case rules for normalizing tokens to improve the model's predictions.
For example 'mysterium' vs 'mysterie' and similar.
"""
from __future__ import unicode_literals
# Sources:
# 1: https://dsn.dk/retskrivning/om-retskrivningsordbogen/mere-om-retskrivningsordbogen-2012/endrede-stave-og-ordformer/
# 2: http://www.tjerry-korrektur.dk/ord-med-flere-stavemaader/
_exc = {
# Alternative spelling
"a-kraft-værk": "a-kraftværk", # 1
"ålborg": "aalborg", # 2
"århus": "aarhus",
"accessoirer": "accessoires", # 1
"affektert": "affekteret", # 1
"afrikander": "afrikaaner", # 1
"aftabuere": "aftabuisere", # 1
"aftabuering": "aftabuisering", # 1
"akvarium": "akvarie", # 1
"alenefader": "alenefar", # 1
"alenemoder": "alenemor", # 1
"alkoholambulatorium": "alkoholambulatorie", # 1
"ambulatorium": "ambulatorie", # 1
"ananassene": "ananasserne", # 2
"anførelsestegn": "anførselstegn", # 1
"anseelig": "anselig", # 2
"antioxydant": "antioxidant", # 1
"artrig": "artsrig", # 1
"auditorium": "auditorie", # 1
"avocado": "avokado", # 2
"bagerst": "bagest", # 2
"bagstræv": "bagstræb", # 1
"bagstræver": "bagstræber", # 1
"bagstræverisk": "bagstræberisk", # 1
"balde": "balle", # 2
"barselorlov": "barselsorlov", # 1
"barselvikar": "barselsvikar", # 1
"baskien": "baskerlandet", # 1
"bayrisk": "bayersk", # 1
"bedstefader": "bedstefar", # 1
"bedstemoder": "bedstemor", # 1
"behefte": "behæfte", # 1
"beheftelse": "behæftelse", # 1
"bidragydende": "bidragsydende", # 1
"bidragyder": "bidragsyder", # 1
"billiondel": "billiontedel", # 1
"blaseret": "blasert", # 1
"bleskifte": "bleskift", # 1
"blodbroder": "blodsbroder", # 2
"blyantspidser": "blyantsspidser", # 2
"boligministerium": "boligministerie", # 1
"borhul": "borehul", # 1
"broder": "bror", # 2
"buldog": "bulldog", # 2
"bådhus": "bådehus", # 1
"børnepleje": "barnepleje", # 1
"børneseng": "barneseng", # 1
"børnestol": "barnestol", # 1
"cairo": "kairo", # 1
"cambodia": "cambodja", # 1
"cambodianer": "cambodjaner", # 1
"cambodiansk": "cambodjansk", # 1
"camouflage": "kamuflage", # 2
"campylobacter": "kampylobakter", # 1
"centeret": "centret", # 2
"chefskahyt": "chefkahyt", # 1
"chefspost": "chefpost", # 1
"chefssekretær": "chefsekretær", # 1
"chefsstol": "chefstol", # 1
"cirkulærskrivelse": "cirkulæreskrivelse", # 1
"cognacsglas": "cognacglas", # 1
"columnist": "kolumnist", # 1
"cricket": "kricket", # 2
"dagplejemoder": "dagplejemor", # 1
"damaskesdug": "damaskdug", # 1
"damp-barn": "dampbarn", # 1
"delfinarium": "delfinarie", # 1
"dentallaboratorium": "dentallaboratorie", # 1
"diaramme": "diasramme", # 1
"diaré": "diarré", # 1
"dioxyd": "dioxid", # 1
"dommedagsprædiken": "dommedagspræken", # 1
"donut": "doughnut", # 2
"driftmæssig": "driftsmæssig", # 1
"driftsikker": "driftssikker", # 1
"driftsikring": "driftssikring", # 1
"drikkejogurt": "drikkeyoghurt", # 1
"drivein": "drive-in", # 1
"driveinbiograf": "drive-in-biograf", # 1
"drøvel": "drøbel", # 1
"dødskriterium": "dødskriterie", # 1
"e-mail-adresse": "e-mailadresse", # 1
"e-post-adresse": "e-postadresse", # 1
"egypten": "ægypten", # 2
"ekskommunicere": "ekskommunikere", # 1
"eksperimentarium": "eksperimentarie", # 1
"elsass": "Alsace", # 1
"elsasser": "alsacer", # 1
"elsassisk": "alsacisk", # 1
"elvetal": "ellevetal", # 1
"elvetiden": "ellevetiden", # 1
"elveårig": "elleveårig", # 1
"elveårs": "elleveårs", # 1
"elveårsbarn": "elleveårsbarn", # 1
"elvte": "ellevte", # 1
"elvtedel": "ellevtedel", # 1
"energiministerium": "energiministerie", # 1
"erhvervsministerium": "erhvervsministerie", # 1
"espaliere": "spaliere", # 2
"evangelium": "evangelie", # 1
"fagministerium": "fagministerie", # 1
"fakse": "faxe", # 1
"fangstkvota": "fangstkvote", # 1
"fader": "far", # 2
"farbroder": "farbror", # 1
"farfader": "farfar", # 1
"farmoder": "farmor", # 1
"federal": "føderal", # 1
"federalisering": "føderalisering", # 1
"federalisme": "føderalisme", # 1
"federalist": "føderalist", # 1
"federalistisk": "føderalistisk", # 1
"federation": "føderation", # 1
"federativ": "føderativ", # 1
"fejlbeheftet": "fejlbehæftet", # 1
"femetagers": "femetages", # 2
"femhundredekroneseddel": "femhundredkroneseddel", # 2
"filmpremiere": "filmpræmiere", # 2
"finansimperium": "finansimperie", # 1
"finansministerium": "finansministerie", # 1
"firehjulstræk": "firhjulstræk", # 2
"fjernstudium": "fjernstudie", # 1
"formalier": "formalia", # 1
"formandsskift": "formandsskifte", # 1
"fornemst": "fornemmest", # 2
"fornuftparti": "fornuftsparti", # 1
"fornuftstridig": "fornuftsstridig", # 1
"fornuftvæsen": "fornuftsvæsen", # 1
"fornuftægteskab": "fornuftsægteskab", # 1
"forretningsministerium": "forretningsministerie", # 1
"forskningsministerium": "forskningsministerie", # 1
"forstudium": "forstudie", # 1
"forsvarsministerium": "forsvarsministerie", # 1
"frilægge": "fritlægge", # 1
"frilæggelse": "fritlæggelse", # 1
"frilægning": "fritlægning", # 1
"fristille": "fritstille", # 1
"fristilling": "fritstilling", # 1
"fuldttegnet": "fuldtegnet", # 1
"fødestedskriterium": "fødestedskriterie", # 1
"fødevareministerium": "fødevareministerie", # 1
"følesløs": "følelsesløs", # 1
"følgeligt": "følgelig", # 1
"førne": "førn", # 1
"gearskift": "gearskifte", # 2
"gladeligt": "gladelig", # 1
"glosehefte": "glosehæfte", # 1
"glædeløs": "glædesløs", # 1
"gonoré": "gonorré", # 1
"grangiveligt": "grangivelig", # 1
"grundliggende": "grundlæggende", # 2
"grønsag": "grøntsag", # 2
"gudbenådet": "gudsbenådet", # 1
"gudfader": "gudfar", # 1
"gudmoder": "gudmor", # 1
"gulvmop": "gulvmoppe", # 1
"gymnasium": "gymnasie", # 1
"hackning": "hacking", # 1
"halvbroder": "halvbror", # 1
"halvelvetiden": "halvellevetiden", # 1
"handelsgymnasium": "handelsgymnasie", # 1
"hefte": "hæfte", # 1
"hefteklamme": "hæfteklamme", # 1
"heftelse": "hæftelse", # 1
"heftemaskine": "hæftemaskine", # 1
"heftepistol": "hæftepistol", # 1
"hefteplaster": "hæfteplaster", # 1
"heftestraf": "hæftestraf", # 1
"heftning": "hæftning", # 1
"helbroder": "helbror", # 1
"hjemmeklasse": "hjemklasse", # 1
"hjulspin": "hjulspind", # 1
"huggevåben": "hugvåben", # 1
"hulmurisolering": "hulmursisolering", # 1
"hurtiggående": "hurtigtgående", # 2
"hurtigttørrende": "hurtigtørrende", # 2
"husmoder": "husmor", # 1
"hydroxyd": "hydroxid", # 1
"håndmikser": "håndmixer", # 1
"højtaler": "højttaler", # 2
"hønemoder": "hønemor", # 1
"ide": "idé", # 2
"imperium": "imperie", # 1
"imponerthed": "imponerethed", # 1
"inbox": "indboks", # 2
"indenrigsministerium": "indenrigsministerie", # 1
"indhefte": "indhæfte", # 1
"indheftning": "indhæftning", # 1
"indicium": "indicie", # 1
"indkassere": "inkassere", # 2
"iota": "jota", # 1
"jobskift": "jobskifte", # 1
"jogurt": "yoghurt", # 1
"jukeboks": "jukebox", # 1
"justitsministerium": "justitsministerie", # 1
"kalorifere": "kalorifer", # 1
"kandidatstipendium": "kandidatstipendie", # 1
"kannevas": "kanvas", # 1
"kaperssauce": "kaperssovs", # 1
"kigge": "kikke", # 2
"kirkeministerium": "kirkeministerie", # 1
"klapmydse": "klapmyds", # 1
"klimakterium": "klimakterie", # 1
"klogeligt": "klogelig", # 1
"knivblad": "knivsblad", # 1
"kollegaer": "kolleger", # 2
"kollegium": "kollegie", # 1
"kollegiehefte": "kollegiehæfte", # 1
"kollokviumx": "kollokvium", # 1
"kommissorium": "kommissorie", # 1
"kompendium": "kompendie", # 1
"komplicerthed": "komplicerethed", # 1
"konfederation": "konføderation", # 1
"konfedereret": "konfødereret", # 1
"konferensstudium": "konferensstudie", # 1
"konservatorium": "konservatorie", # 1
"konsulere": "konsultere", # 1
"kradsbørstig": "krasbørstig", # 2
"kravsspecifikation": "kravspecifikation", # 1
"krematorium": "krematorie", # 1
"krep": "crepe", # 1
"krepnylon": "crepenylon", # 1
"kreppapir": "crepepapir", # 1
"kricket": "cricket", # 2
"kriterium": "kriterie", # 1
"kroat": "kroater", # 2
"kroki": "croquis", # 1
"kronprinsepar": "kronprinspar", # 2
"kropdoven": "kropsdoven", # 1
"kroplus": "kropslus", # 1
"krøllefedt": "krølfedt", # 1
"kulturministerium": "kulturministerie", # 1
"kuponhefte": "kuponhæfte", # 1
"kvota": "kvote", # 1
"kvotaordning": "kvoteordning", # 1
"laboratorium": "laboratorie", # 1
"laksfarve": "laksefarve", # 1
"laksfarvet": "laksefarvet", # 1
"laksrød": "lakserød", # 1
"laksyngel": "lakseyngel", # 1
"laksørred": "lakseørred", # 1
"landbrugsministerium": "landbrugsministerie", # 1
"landskampstemning": "landskampsstemning", # 1
"langust": "languster", # 1
"lappegrejer": "lappegrej", # 1
"lavløn": "lavtløn", # 1
"lillebroder": "lillebror", # 1
"linear": "lineær", # 1
"loftlampe": "loftslampe", # 2
"log-in": "login", # 1
"login": "log-in", # 2
"lovmedholdig": "lovmedholdelig", # 1
"ludder": "luder", # 2
"lysholder": "lyseholder", # 1
"lægeskifte": "lægeskift", # 1
"lærvillig": "lærevillig", # 1
"løgsauce": "løgsovs", # 1
"madmoder": "madmor", # 1
"majonæse": "mayonnaise", # 1
"mareridtagtig": "mareridtsagtig", # 1
"margen": "margin", # 2
"martyrium": "martyrie", # 1
"mellemstatlig": "mellemstatslig", # 1
"menneskene": "menneskerne", # 2
"metropolis": "metropol", # 1
"miks": "mix", # 1
"mikse": "mixe", # 1
"miksepult": "mixerpult", # 1
"mikser": "mixer", # 1
"mikserpult": "mixerpult", # 1
"mikslån": "mixlån", # 1
"miksning": "mixning", # 1
"miljøministerium": "miljøministerie", # 1
"milliarddel": "milliardtedel", # 1
"milliondel": "milliontedel", # 1
"ministerium": "ministerie", # 1
"mop": "moppe", # 1
"moder": "mor", # 2
"moratorium": "moratorie", # 1
"morbroder": "morbror", # 1
"morfader": "morfar", # 1
"mormoder": "mormor", # 1
"musikkonservatorium": "musikkonservatorie", # 1
"muslingskal": "muslingeskal", # 1
"mysterium": "mysterie", # 1
"naturalieydelse": "naturalydelse", # 1
"naturalieøkonomi": "naturaløkonomi", # 1
"navnebroder": "navnebror", # 1
"nerium": "nerie", # 1
"nådeløs": "nådesløs", # 1
"nærforestående": "nærtforestående", # 1
"nærstående": "nærtstående", # 1
"observatorium": "observatorie", # 1
"oldefader": "oldefar", # 1
"oldemoder": "oldemor", # 1
"opgraduere": "opgradere", # 1
"opgraduering": "opgradering", # 1
"oratorium": "oratorie", # 1
"overbookning": "overbooking", # 1
"overpræsidium": "overpræsidie", # 1
"overstatlig": "overstatslig", # 1
"oxyd": "oxid", # 1
"oxydere": "oxidere", # 1
"oxydering": "oxidering", # 1
"pakkenellike": "pakkenelliker", # 1
"papirtynd": "papirstynd", # 1
"pastoralseminarium": "pastoralseminarie", # 1
"peanutsene": "peanuttene", # 2
"penalhus": "pennalhus", # 2
"pensakrav": "pensumkrav", # 1
"pepperoni": "peperoni", # 1
"peruaner": "peruvianer", # 1
"petrole": "petrol", # 1
"piltast": "piletast", # 1
"piltaste": "piletast", # 1
"planetarium": "planetarie", # 1
"plasteret": "plastret", # 2
"plastic": "plastik", # 2
"play-off-kamp": "playoffkamp", # 1
"plejefader": "plejefar", # 1
"plejemoder": "plejemor", # 1
"podium": "podie", # 2
"praha": "prag", # 2
"preciøs": "pretiøs", # 2
"privilegium": "privilegie", # 1
"progredere": "progrediere", # 1
"præsidium": "præsidie", # 1
"psykodelisk": "psykedelisk", # 1
"pudsegrejer": "pudsegrej", # 1
"referensgruppe": "referencegruppe", # 1
"referensramme": "referenceramme", # 1
"refugium": "refugie", # 1
"registeret": "registret", # 2
"remedium": "remedie", # 1
"remiks": "remix", # 1
"reservert": "reserveret", # 1
"ressortministerium": "ressortministerie", # 1
"ressource": "resurse", # 2
"resætte": "resette", # 1
"rettelig": "retteligt", # 1
"rettetaste": "rettetast", # 1
"returtaste": "returtast", # 1
"risici": "risikoer", # 2
"roll-on": "rollon", # 1
"rollehefte": "rollehæfte", # 1
"rostbøf": "roastbeef", # 1
"rygsæksturist": "rygsækturist", # 1
"rødstjært": "rødstjert", # 1
"saddel": "sadel", # 2
"samaritan": "samaritaner", # 2
"sanatorium": "sanatorie", # 1
"sauce": "sovs", # 1
"scanning": "skanning", # 2
"sceneskifte": "sceneskift", # 1
"scilla": "skilla", # 1
"sejflydende": "sejtflydende", # 1
"selvstudium": "selvstudie", # 1
"seminarium": "seminarie", # 1
"sennepssauce": "sennepssovs ", # 1
"servitutbeheftet": "servitutbehæftet", # 1
"sit-in": "sitin", # 1
"skatteministerium": "skatteministerie", # 1
"skifer": "skiffer", # 2
"skyldsfølelse": "skyldfølelse", # 1
"skysauce": "skysovs", # 1
"sladdertaske": "sladretaske", # 2
"sladdervorn": "sladrevorn", # 2
"slagsbroder": "slagsbror", # 1
"slettetaste": "slettetast", # 1
"smørsauce": "smørsovs", # 1
"snitsel": "schnitzel", # 1
"snobbeeffekt": "snobeffekt", # 2
"socialministerium": "socialministerie", # 1
"solarium": "solarie", # 1
"soldebroder": "soldebror", # 1
"spagetti": "spaghetti", # 1
"spagettistrop": "spaghettistrop", # 1
"spagettiwestern": "spaghettiwestern", # 1
"spin-off": "spinoff", # 1
"spinnefiskeri": "spindefiskeri", # 1
"spolorm": "spoleorm", # 1
"sproglaboratorium": "sproglaboratorie", # 1
"spækbræt": "spækkebræt", # 2
"stand-in": "standin", # 1
"stand-up-comedy": "standupcomedy", # 1
"stand-up-komiker": "standupkomiker", # 1
"statsministerium": "statsministerie", # 1
"stedbroder": "stedbror", # 1
"stedfader": "stedfar", # 1
"stedmoder": "stedmor", # 1
"stilehefte": "stilehæfte", # 1
"stipendium": "stipendie", # 1
"stjært": "stjert", # 1
"stjærthage": "stjerthage", # 1
"storebroder": "storebror", # 1
"stortå": "storetå", # 1
"strabads": "strabadser", # 1
"strømlinjet": "strømlinet", # 1
"studium": "studie", # 1
"stænkelap": "stænklap", # 1
"sundhedsministerium": "sundhedsministerie", # 1
"suppositorium": "suppositorie", # 1
"svejts": "schweiz", # 1
"svejtser": "schweizer", # 1
"svejtserfranc": "schweizerfranc", # 1
"svejtserost": "schweizerost", # 1
"svejtsisk": "schweizisk", # 1
"svigerfader": "svigerfar", # 1
"svigermoder": "svigermor", # 1
"svirebroder": "svirebror", # 1
"symposium": "symposie", # 1
"sælarium": "sælarie", # 1
"søreme": "sørme", # 2
"søterritorium": "søterritorie", # 1
"t-bone-steak": "t-bonesteak", # 1
"tabgivende": "tabsgivende", # 1
"tabuere": "tabuisere", # 1
"tabuering": "tabuisering", # 1
"tackle": "takle", # 2
"tackling": "takling", # 2
"taifun": "tyfon", # 1
"take-off": "takeoff", # 1
"taknemlig": "taknemmelig", # 2
"talehørelærer": "tale-høre-lærer", # 1
"talehøreundervisning": "tale-høre-undervisning", # 1
"tandstik": "tandstikker", # 1
"tao": "dao", # 1
"taoisme": "daoisme", # 1
"taoist": "daoist", # 1
"taoistisk": "daoistisk", # 1
"taverne": "taverna", # 1
"teateret": "teatret", # 2
"tekno": "techno", # 1
"temposkifte": "temposkift", # 1
"terrarium": "terrarie", # 1
"territorium": "territorie", # 1
"tesis": "tese", # 1
"tidsstudium": "tidsstudie", # 1
"tipoldefader": "tipoldefar", # 1
"tipoldemoder": "tipoldemor", # 1
"tomatsauce": "tomatsovs", # 1
"tonart": "toneart", # 1
"trafikministerium": "trafikministerie", # 1
"tredve": "tredive", # 1
"tredver": "trediver", # 1
"tredveårig": "trediveårig", # 1
"tredveårs": "trediveårs", # 1
"tredveårsfødselsdag": "trediveårsfødselsdag", # 1
"tredvte": "tredivte", # 1
"tredvtedel": "tredivtedel", # 1
"troldunge": "troldeunge", # 1
"trommestikke": "trommestik", # 1
"trubadur": "troubadour", # 2
"trøstepræmie": "trøstpræmie", # 2
"tummerum": "trummerum", # 1
"tumultuarisk": "tumultarisk", # 1
"tunghørighed": "tunghørhed", # 1
"tus": "tusch", # 2
"tusind": "tusinde", # 2
"tvillingbroder": "tvillingebror", # 1
"tvillingbror": "tvillingebror", # 1
"tvillingebroder": "tvillingebror", # 1
"ubeheftet": "ubehæftet", # 1
"udenrigsministerium": "udenrigsministerie", # 1
"udhulning": "udhuling", # 1
"udslaggivende": "udslagsgivende", # 1
"udspekulert": "udspekuleret", # 1
"udviklingsministerium": "udviklingsministerie", # 1
"uforpligtigende": "uforpligtende", # 1
"uheldvarslende": "uheldsvarslende", # 1
"uimponerthed": "uimponerethed", # 1
"undervisningsministerium": "undervisningsministerie", # 1
"unægtelig": "unægteligt", # 1
"urinale": "urinal", # 1
"uvederheftig": "uvederhæftig", # 1
"vabel": "vable", # 2
"vadi": "wadi", # 1
"vaklevorn": "vakkelvorn", # 1
"vanadin": "vanadium", # 1
"vaselin": "vaseline", # 1
"vederheftig": "vederhæftig", # 1
"vedhefte": "vedhæfte", # 1
"velar": "velær", # 1
"videndeling": "vidensdeling", # 2
"vinkelanførelsestegn": "vinkelanførselstegn", # 1
"vipstjært": "vipstjert", # 1
"vismut": "bismut", # 1
"visvas": "vissevasse", # 1
"voksværk": "vokseværk", # 1
"værtdyr": "værtsdyr", # 1
"værtplante": "værtsplante", # 1
"wienersnitsel": "wienerschnitzel", # 1
"yderliggående": "yderligtgående", # 2
"zombi": "zombie", # 1
"ægbakke": "æggebakke", # 1
"ægformet": "æggeformet", # 1
"ægleder": "æggeleder", # 1
"ækvilibrist": "ekvilibrist", # 2
"æselsøre": "æseløre", # 1
"øjehule": "øjenhule", # 1
"øjelåg": "øjenlåg", # 1
"øjeåbner": "øjenåbner", # 1
"økonomiministerium": "økonomiministerie", # 1
"ørenring": "ørering", # 2
"øvehefte": "øvehæfte", # 1
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -6,7 +6,7 @@ Source: https://forkortelse.dk/ and various others.
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, NORM, TAG, PUNCT from ...symbols import ORTH, LEMMA, NORM
_exc = {} _exc = {}
@ -52,7 +52,7 @@ for exc_data in [
{ORTH: "Ons.", LEMMA: "onsdag"}, {ORTH: "Ons.", LEMMA: "onsdag"},
{ORTH: "Fre.", LEMMA: "fredag"}, {ORTH: "Fre.", LEMMA: "fredag"},
{ORTH: "Lør.", LEMMA: "lørdag"}, {ORTH: "Lør.", LEMMA: "lørdag"},
{ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller", TAG: "CC"}, {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller"},
]: ]:
_exc[exc_data[ORTH]] = [exc_data] _exc[exc_data[ORTH]] = [exc_data]
@ -577,7 +577,7 @@ for h in range(1, 31 + 1):
for period in ["."]: for period in ["."]:
_exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}] _exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}]
_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: ".", TAG: PUNCT}]} _custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: "."}]}
_exc.update(_custom_base_exc) _exc.update(_custom_base_exc)
TOKENIZER_EXCEPTIONS = _exc TOKENIZER_EXCEPTIONS = _exc

View File

@ -2,7 +2,6 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .punctuation import TOKENIZER_INFIXES from .punctuation import TOKENIZER_INFIXES
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
@ -10,18 +9,14 @@ from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS from .syntax_iterators import SYNTAX_ITERATORS
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
class GermanDefaults(Language.Defaults): class GermanDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "de" lex_attr_getters[LANG] = lambda text: "de"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
prefixes = TOKENIZER_PREFIXES prefixes = TOKENIZER_PREFIXES
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES

View File

@ -1,16 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
# Here we only want to include the absolute most common words. Otherwise,
# this list would get impossibly long for German especially considering the
# old vs. new spelling rules, and all possible cases.
_exc = {"daß": "dass"}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases from a dependency parse. Works on both Doc and Span. Detect base noun phrases from a dependency parse. Works on both Doc and Span.
""" """
@ -27,13 +28,17 @@ def noun_chunks(obj):
"og", "og",
"app", "app",
] ]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
np_deps = set(doc.vocab.strings.add(label) for label in labels) np_deps = set(doc.vocab.strings.add(label) for label in labels)
close_app = doc.vocab.strings.add("nk") close_app = doc.vocab.strings.add("nk")
rbracket = 0 rbracket = 0
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if i < rbracket: if i < rbracket:
continue continue
if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps: if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:

View File

@ -10,21 +10,16 @@ from .lemmatizer import GreekLemmatizer
from .syntax_iterators import SYNTAX_ITERATORS from .syntax_iterators import SYNTAX_ITERATORS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...lookups import Lookups from ...lookups import Lookups
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
class GreekDefaults(Language.Defaults): class GreekDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "el" lex_attr_getters[LANG] = lambda text: "el"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP

File diff suppressed because it is too large Load Diff

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases. Works on both Doc and Span. Detect base noun phrases. Works on both Doc and Span.
""" """
@ -13,34 +14,34 @@ def noun_chunks(obj):
# obj tag corrects some DEP tagger mistakes. # obj tag corrects some DEP tagger mistakes.
# Further improvement of the models will eliminate the need for this tag. # Further improvement of the models will eliminate the need for this tag.
labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"] labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings.add(label) for label in labels] np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
nmod = doc.vocab.strings.add("nmod") nmod = doc.vocab.strings.add("nmod")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
flag = False flag = False
if word.pos == NOUN: if word.pos == NOUN:
# check for patterns such as γραμμή παραγωγής # check for patterns such as γραμμή παραγωγής
for potential_nmod in word.rights: for potential_nmod in word.rights:
if potential_nmod.dep == nmod: if potential_nmod.dep == nmod:
seen.update( prev_end = potential_nmod.i
j for j in range(word.left_edge.i, potential_nmod.i + 1)
)
yield word.left_edge.i, potential_nmod.i + 1, np_label yield word.left_edge.i, potential_nmod.i + 1, np_label
flag = True flag = True
break break
if flag is False: if flag is False:
seen.update(j for j in range(word.left_edge.i, word.i + 1)) prev_end = word.i
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
# covers the case: έχει όμορφα και έξυπνα παιδιά # covers the case: έχει όμορφα και έξυπνα παιδιά
@ -49,9 +50,7 @@ def noun_chunks(obj):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label

View File

@ -2,7 +2,6 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
@ -10,10 +9,9 @@ from .morph_rules import MORPH_RULES
from .syntax_iterators import SYNTAX_ITERATORS from .syntax_iterators import SYNTAX_ITERATORS
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
def _return_en(_): def _return_en(_):
@ -24,9 +22,6 @@ class EnglishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = _return_en lex_attr_getters[LANG] = _return_en
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP tag_map = TAG_MAP
stop_words = STOP_WORDS stop_words = STOP_WORDS

File diff suppressed because it is too large Load Diff

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases from a dependency parse. Works on both Doc and Span. Detect base noun phrases from a dependency parse. Works on both Doc and Span.
""" """
@ -19,21 +20,23 @@ def noun_chunks(obj):
"attr", "attr",
"ROOT", "ROOT",
] ]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings.add(label) for label in labels] np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label

View File

@ -77,12 +77,12 @@ for pron in ["i", "you", "he", "she", "it", "we", "they"]:
_exc[orth + "'d"] = [ _exc[orth + "'d"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"}, {ORTH: "'d", NORM: "'d"},
] ]
_exc[orth + "d"] = [ _exc[orth + "d"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"}, {ORTH: "d", NORM: "'d"},
] ]
_exc[orth + "'d've"] = [ _exc[orth + "'d've"] = [
@ -195,7 +195,10 @@ for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
{ORTH: "'d", NORM: "'d"}, {ORTH: "'d", NORM: "'d"},
] ]
_exc[orth + "d"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "d"}] _exc[orth + "d"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "d", NORM: "'d"},
]
_exc[orth + "'d've"] = [ _exc[orth + "'d've"] = [
{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: orth, LEMMA: word, NORM: word},

View File

@ -5,7 +5,6 @@ from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
from ..char_classes import merge_chars from ..char_classes import merge_chars
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
_list_units = [u for u in LIST_UNITS if u != "%"] _list_units = [u for u in LIST_UNITS if u != "%"]

View File

@ -2,10 +2,15 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON, VERB, AUX from ...symbols import NOUN, PROPN, PRON, VERB, AUX
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
doc = obj.doc doc = doclike.doc
if not doc.is_parsed:
raise ValueError(Errors.E029)
if not len(doc): if not len(doc):
return return
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
@ -16,7 +21,7 @@ def noun_chunks(obj):
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels] np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels] stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
token = doc[0] token = doc[0]
while token and token.i < len(doc): while token and token.i < len(doclike):
if token.pos in [PROPN, NOUN, PRON]: if token.pos in [PROPN, NOUN, PRON]:
left, right = noun_bounds( left, right = noun_bounds(
doc, token, np_left_deps, np_right_deps, stop_deps doc, token, np_left_deps, np_right_deps, stop_deps

View File

@ -10,6 +10,7 @@ from .lex_attrs import LEX_ATTRS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .punctuation import TOKENIZER_SUFFIXES from .punctuation import TOKENIZER_SUFFIXES
from .syntax_iterators import SYNTAX_ITERATORS
class PersianDefaults(Language.Defaults): class PersianDefaults(Language.Defaults):
@ -24,6 +25,7 @@ class PersianDefaults(Language.Defaults):
tag_map = TAG_MAP tag_map = TAG_MAP
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
syntax_iterators = SYNTAX_ITERATORS
class Persian(Language): class Persian(Language):

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases from a dependency parse. Works on both Doc and Span. Detect base noun phrases from a dependency parse. Works on both Doc and Span.
""" """
@ -19,21 +20,23 @@ def noun_chunks(obj):
"attr", "attr",
"ROOT", "ROOT",
] ]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings.add(label) for label in labels] np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.i
continue
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label yield word.left_edge.i, word.i + 1, np_label

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases from a dependency parse. Works on both Doc and Span. Detect base noun phrases from a dependency parse. Works on both Doc and Span.
""" """
@ -18,21 +19,23 @@ def noun_chunks(obj):
"nmod", "nmod",
"nmod:poss", "nmod:poss",
] ]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings[label] for label in labels] np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label

View File

@ -1,11 +1,12 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from ...attrs import LANG from ...attrs import LANG
from ...language import Language from ...language import Language
from ...tokens import Doc
class ArmenianDefaults(Language.Defaults): class ArmenianDefaults(Language.Defaults):

View File

@ -1,6 +1,6 @@
# coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
""" """
Example sentences to test spaCy and its language models. Example sentences to test spaCy and its language models.
>>> from spacy.lang.hy.examples import sentences >>> from spacy.lang.hy.examples import sentences

View File

@ -1,3 +1,4 @@
# coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...attrs import LIKE_NUM from ...attrs import LIKE_NUM

View File

@ -1,6 +1,6 @@
# coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
STOP_WORDS = set( STOP_WORDS = set(
""" """
նա նա

View File

@ -1,7 +1,7 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ
TAG_MAP = { TAG_MAP = {
@ -716,7 +716,7 @@ TAG_MAP = {
POS: NOUN, POS: NOUN,
"Animacy": "Nhum", "Animacy": "Nhum",
"Case": "Dat", "Case": "Dat",
"Number": "Coll", # "Number": "Coll",
"Number": "Sing", "Number": "Sing",
"Person": "1", "Person": "1",
}, },
@ -815,7 +815,7 @@ TAG_MAP = {
"Animacy": "Nhum", "Animacy": "Nhum",
"Case": "Nom", "Case": "Nom",
"Definite": "Def", "Definite": "Def",
"Number": "Plur", # "Number": "Plur",
"Number": "Sing", "Number": "Sing",
"Poss": "Yes", "Poss": "Yes",
}, },
@ -880,7 +880,7 @@ TAG_MAP = {
POS: NOUN, POS: NOUN,
"Animacy": "Nhum", "Animacy": "Nhum",
"Case": "Nom", "Case": "Nom",
"Number": "Plur", # "Number": "Plur",
"Number": "Sing", "Number": "Sing",
"Person": "2", "Person": "2",
}, },
@ -1223,9 +1223,9 @@ TAG_MAP = {
"PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": { "PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": {
POS: PRON, POS: PRON,
"Case": "Nom", "Case": "Nom",
"Number": "Sing", # "Number": "Sing",
"Number": "Plur", "Number": "Plur",
"Person": "3", # "Person": "3",
"Person": "1", "Person": "1",
"PronType": "Emp", "PronType": "Emp",
}, },

View File

@ -4,25 +4,20 @@ from __future__ import unicode_literals
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .syntax_iterators import SYNTAX_ITERATORS from .syntax_iterators import SYNTAX_ITERATORS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
class IndonesianDefaults(Language.Defaults): class IndonesianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "id" lex_attr_getters[LANG] = lambda text: "id"
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
prefixes = TOKENIZER_PREFIXES prefixes = TOKENIZER_PREFIXES

View File

@ -1,532 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
# Daftar kosakata yang sering salah dieja
# https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
_exc = {
# Slang and abbreviations
"silahkan": "silakan",
"yg": "yang",
"kalo": "kalau",
"cawu": "caturwulan",
"ok": "oke",
"gak": "tidak",
"enggak": "tidak",
"nggak": "tidak",
"ndak": "tidak",
"ngga": "tidak",
"dgn": "dengan",
"tdk": "tidak",
"jg": "juga",
"klo": "kalau",
"denger": "dengar",
"pinter": "pintar",
"krn": "karena",
"nemuin": "menemukan",
"jgn": "jangan",
"udah": "sudah",
"sy": "saya",
"udh": "sudah",
"dapetin": "mendapatkan",
"ngelakuin": "melakukan",
"ngebuat": "membuat",
"membikin": "membuat",
"bikin": "buat",
# Daftar kosakata yang sering salah dieja
"malpraktik": "malapraktik",
"malfungsi": "malafungsi",
"malserap": "malaserap",
"maladaptasi": "malaadaptasi",
"malsuai": "malasuai",
"maldistribusi": "maladistribusi",
"malgizi": "malagizi",
"malsikap": "malasikap",
"memperhatikan": "memerhatikan",
"akte": "akta",
"cemilan": "camilan",
"esei": "esai",
"frase": "frasa",
"kafeteria": "kafetaria",
"ketapel": "katapel",
"kenderaan": "kendaraan",
"menejemen": "manajemen",
"menejer": "manajer",
"mesjid": "masjid",
"rebo": "rabu",
"seksama": "saksama",
"senggama": "sanggama",
"sekedar": "sekadar",
"seprei": "seprai",
"semedi": "semadi",
"samadi": "semadi",
"amandemen": "amendemen",
"algoritma": "algoritme",
"aritmatika": "aritmetika",
"metoda": "metode",
"materai": "meterai",
"meterei": "meterai",
"kalendar": "kalender",
"kadaluwarsa": "kedaluwarsa",
"katagori": "kategori",
"parlamen": "parlemen",
"sekular": "sekuler",
"selular": "seluler",
"sirkular": "sirkuler",
"survai": "survei",
"survey": "survei",
"aktuil": "aktual",
"formil": "formal",
"trotoir": "trotoar",
"komersiil": "komersial",
"komersil": "komersial",
"tradisionil": "tradisionial",
"orisinil": "orisinal",
"orijinil": "orisinal",
"afdol": "afdal",
"antri": "antre",
"apotik": "apotek",
"atlit": "atlet",
"atmosfir": "atmosfer",
"cidera": "cedera",
"cendikiawan": "cendekiawan",
"cepet": "cepat",
"cinderamata": "cenderamata",
"debet": "debit",
"difinisi": "definisi",
"dekrit": "dekret",
"disain": "desain",
"diskripsi": "deskripsi",
"diskotik": "diskotek",
"eksim": "eksem",
"exim": "eksem",
"faidah": "faedah",
"ekstrim": "ekstrem",
"ekstrimis": "ekstremis",
"komplit": "komplet",
"konkrit": "konkret",
"kongkrit": "konkret",
"kongkret": "konkret",
"kridit": "kredit",
"musium": "museum",
"pinalti": "penalti",
"piranti": "peranti",
"pinsil": "pensil",
"personil": "personel",
"sistim": "sistem",
"teoritis": "teoretis",
"vidio": "video",
"cengkeh": "cengkih",
"desertasi": "disertasi",
"hakekat": "hakikat",
"intelejen": "intelijen",
"kaedah": "kaidah",
"kempes": "kempis",
"kementrian": "kementerian",
"ledeng": "leding",
"nasehat": "nasihat",
"penasehat": "penasihat",
"praktek": "praktik",
"praktekum": "praktikum",
"resiko": "risiko",
"retsleting": "ritsleting",
"senen": "senin",
"amuba": "ameba",
"punggawa": "penggawa",
"surban": "serban",
"nomer": "nomor",
"sorban": "serban",
"bis": "bus",
"agribisnis": "agrobisnis",
"kantung": "kantong",
"khutbah": "khotbah",
"mandur": "mandor",
"rubuh": "roboh",
"pastur": "pastor",
"supir": "sopir",
"goncang": "guncang",
"goa": "gua",
"kaos": "kaus",
"kokoh": "kukuh",
"komulatif": "kumulatif",
"kolomnis": "kolumnis",
"korma": "kurma",
"lobang": "lubang",
"limo": "limusin",
"limosin": "limusin",
"mangkok": "mangkuk",
"saos": "saus",
"sop": "sup",
"sorga": "surga",
"tegor": "tegur",
"telor": "telur",
"obrak-abrik": "ubrak-abrik",
"ekwivalen": "ekuivalen",
"frekwensi": "frekuensi",
"konsekwensi": "konsekuensi",
"kwadran": "kuadran",
"kwadrat": "kuadrat",
"kwalifikasi": "kualifikasi",
"kwalitas": "kualitas",
"kwalitet": "kualitas",
"kwalitatif": "kualitatif",
"kwantitas": "kuantitas",
"kwantitatif": "kuantitatif",
"kwantum": "kuantum",
"kwartal": "kuartal",
"kwintal": "kuintal",
"kwitansi": "kuitansi",
"kwatir": "khawatir",
"kuatir": "khawatir",
"jadual": "jadwal",
"hirarki": "hierarki",
"karir": "karier",
"aktip": "aktif",
"daptar": "daftar",
"efektip": "efektif",
"epektif": "efektif",
"epektip": "efektif",
"Pebruari": "Februari",
"pisik": "fisik",
"pondasi": "fondasi",
"photo": "foto",
"photokopi": "fotokopi",
"hapal": "hafal",
"insap": "insaf",
"insyaf": "insaf",
"konperensi": "konferensi",
"kreatip": "kreatif",
"kreativ": "kreatif",
"maap": "maaf",
"napsu": "nafsu",
"negatip": "negatif",
"negativ": "negatif",
"objektip": "objektif",
"obyektip": "objektif",
"obyektif": "objektif",
"pasip": "pasif",
"pasiv": "pasif",
"positip": "positif",
"positiv": "positif",
"produktip": "produktif",
"produktiv": "produktif",
"sarap": "saraf",
"sertipikat": "sertifikat",
"subjektip": "subjektif",
"subyektip": "subjektif",
"subyektif": "subjektif",
"tarip": "tarif",
"transitip": "transitif",
"transitiv": "transitif",
"faham": "paham",
"fikir": "pikir",
"berfikir": "berpikir",
"telefon": "telepon",
"telfon": "telepon",
"telpon": "telepon",
"tilpon": "telepon",
"nafas": "napas",
"bernafas": "bernapas",
"pernafasan": "pernapasan",
"vermak": "permak",
"vulpen": "pulpen",
"aktifis": "aktivis",
"konfeksi": "konveksi",
"motifasi": "motivasi",
"Nopember": "November",
"propinsi": "provinsi",
"babtis": "baptis",
"jerembab": "jerembap",
"lembab": "lembap",
"sembab": "sembap",
"saptu": "sabtu",
"tekat": "tekad",
"bejad": "bejat",
"nekad": "nekat",
"otoped": "otopet",
"skuad": "skuat",
"jenius": "genius",
"marjin": "margin",
"marjinal": "marginal",
"obyek": "objek",
"subyek": "subjek",
"projek": "proyek",
"azas": "asas",
"ijasah": "ijazah",
"jenasah": "jenazah",
"plasa": "plaza",
"bathin": "batin",
"Katholik": "Katolik",
"orthografi": "ortografi",
"pathogen": "patogen",
"theologi": "teologi",
"ijin": "izin",
"rejeki": "rezeki",
"rejim": "rezim",
"jaman": "zaman",
"jamrud": "zamrud",
"jinah": "zina",
"perjinahan": "perzinaan",
"anugrah": "anugerah",
"cendrawasih": "cenderawasih",
"jendral": "jenderal",
"kripik": "keripik",
"krupuk": "kerupuk",
"ksatria": "kesatria",
"mentri": "menteri",
"negri": "negeri",
"Prancis": "Perancis",
"sebrang": "seberang",
"menyebrang": "menyeberang",
"Sumatra": "Sumatera",
"trampil": "terampil",
"isteri": "istri",
"justeru": "justru",
"perajurit": "prajurit",
"putera": "putra",
"puteri": "putri",
"samudera": "samudra",
"sastera": "sastra",
"sutera": "sutra",
"terompet": "trompet",
"iklas": "ikhlas",
"iktisar": "ikhtisar",
"kafilah": "khafilah",
"kawatir": "khawatir",
"kotbah": "khotbah",
"kusyuk": "khusyuk",
"makluk": "makhluk",
"mahluk": "makhluk",
"mahkluk": "makhluk",
"nahkoda": "nakhoda",
"nakoda": "nakhoda",
"tahta": "takhta",
"takhyul": "takhayul",
"tahyul": "takhayul",
"tahayul": "takhayul",
"akhli": "ahli",
"anarkhi": "anarki",
"kharisma": "karisma",
"kharismatik": "karismatik",
"mahsud": "maksud",
"makhsud": "maksud",
"rakhmat": "rahmat",
"tekhnik": "teknik",
"tehnik": "teknik",
"tehnologi": "teknologi",
"ikhwal": "ihwal",
"expor": "ekspor",
"extra": "ekstra",
"komplex": "komplek",
"sex": "seks",
"taxi": "taksi",
"extasi": "ekstasi",
"syaraf": "saraf",
"syurga": "surga",
"mashur": "masyhur",
"masyur": "masyhur",
"mahsyur": "masyhur",
"mashyur": "masyhur",
"muadzin": "muazin",
"adzan": "azan",
"ustadz": "ustaz",
"ustad": "ustaz",
"ustadzah": "ustaz",
"dzikir": "zikir",
"dzuhur": "zuhur",
"dhuhur": "zuhur",
"zhuhur": "zuhur",
"analisa": "analisis",
"diagnosa": "diagnosis",
"hipotesa": "hipotesis",
"sintesa": "sintesis",
"aktiviti": "aktivitas",
"aktifitas": "aktivitas",
"efektifitas": "efektivitas",
"komuniti": "komunitas",
"kreatifitas": "kreativitas",
"produktifitas": "produktivitas",
"realiti": "realitas",
"realita": "realitas",
"selebriti": "selebritas",
"spotifitas": "sportivitas",
"universiti": "universitas",
"utiliti": "utilitas",
"validiti": "validitas",
"dilokalisir": "dilokalisasi",
"didramatisir": "didramatisasi",
"dipolitisir": "dipolitisasi",
"dinetralisir": "dinetralisasi",
"dikonfrontir": "dikonfrontasi",
"mendominir": "mendominasi",
"koordinir": "koordinasi",
"proklamir": "proklamasi",
"terorganisir": "terorganisasi",
"terealisir": "terealisasi",
"robah": "ubah",
"dirubah": "diubah",
"merubah": "mengubah",
"terlanjur": "telanjur",
"terlantar": "telantar",
"penglepasan": "pelepasan",
"pelihatan": "penglihatan",
"pemukiman": "permukiman",
"pengrumahan": "perumahan",
"penyewaan": "persewaan",
"menyintai": "mencintai",
"menyolok": "mencolok",
"contek": "sontek",
"mencontek": "menyontek",
"pungkir": "mungkir",
"dipungkiri": "dimungkiri",
"kupungkiri": "kumungkiri",
"kaupungkiri": "kaumungkiri",
"nampak": "tampak",
"nampaknya": "tampaknya",
"nongkrong": "tongkrong",
"berternak": "beternak",
"berterbangan": "beterbangan",
"berserta": "beserta",
"berperkara": "beperkara",
"berpergian": "bepergian",
"berkerja": "bekerja",
"berberapa": "beberapa",
"terbersit": "tebersit",
"terpercaya": "tepercaya",
"terperdaya": "teperdaya",
"terpercik": "tepercik",
"terpergok": "tepergok",
"aksesoris": "aksesori",
"handal": "andal",
"hantar": "antar",
"panutan": "anutan",
"atsiri": "asiri",
"bhakti": "bakti",
"china": "cina",
"dharma": "darma",
"diktaktor": "diktator",
"eksport": "ekspor",
"hembus": "embus",
"hadits": "hadis",
"hadist": "hadits",
"harafiah": "harfiah",
"himbau": "imbau",
"import": "impor",
"inget": "ingat",
"hisap": "isap",
"interprestasi": "interpretasi",
"kangker": "kanker",
"konggres": "kongres",
"lansekap": "lanskap",
"maghrib": "magrib",
"emak": "mak",
"moderen": "modern",
"pasport": "paspor",
"perduli": "peduli",
"ramadhan": "ramadan",
"rapih": "rapi",
"Sansekerta": "Sanskerta",
"shalat": "salat",
"sholat": "salat",
"silahkan": "silakan",
"standard": "standar",
"hutang": "utang",
"zinah": "zina",
"ambulan": "ambulans",
"antartika": "sntarktika",
"arteri": "arteria",
"asik": "asyik",
"australi": "australia",
"denga": "dengan",
"depo": "depot",
"detil": "detail",
"ensiklopedi": "ensiklopedia",
"elit": "elite",
"frustasi": "frustrasi",
"gladi": "geladi",
"greget": "gereget",
"itali": "italia",
"karna": "karena",
"klenteng": "kelenteng",
"erling": "kerling",
"kontruksi": "konstruksi",
"masal": "massal",
"merk": "merek",
"respon": "respons",
"diresponi": "direspons",
"skak": "sekak",
"stir": "setir",
"singapur": "singapura",
"standarisasi": "standardisasi",
"varitas": "varietas",
"amphibi": "amfibi",
"anjlog": "anjlok",
"alpukat": "avokad",
"alpokat": "avokad",
"bolpen": "pulpen",
"cabe": "cabai",
"cabay": "cabai",
"ceret": "cerek",
"differensial": "diferensial",
"duren": "durian",
"faksimili": "faksimile",
"faksimil": "faksimile",
"graha": "gerha",
"goblog": "goblok",
"gombrong": "gombroh",
"horden": "gorden",
"korden": "gorden",
"gubug": "gubuk",
"imaginasi": "imajinasi",
"jerigen": "jeriken",
"jirigen": "jeriken",
"carut-marut": "karut-marut",
"kwota": "kuota",
"mahzab": "mazhab",
"mempesona": "memesona",
"milyar": "miliar",
"missi": "misi",
"nenas": "nanas",
"negoisasi": "negosiasi",
"automotif": "otomotif",
"pararel": "paralel",
"paska": "pasca",
"prosen": "persen",
"pete": "petai",
"petay": "petai",
"proffesor": "profesor",
"rame": "ramai",
"rapot": "rapor",
"rileks": "relaks",
"rileksasi": "relaksasi",
"renumerasi": "remunerasi",
"seketaris": "sekretaris",
"sekertaris": "sekretaris",
"sensorik": "sensoris",
"sentausa": "sentosa",
"strawberi": "stroberi",
"strawbery": "stroberi",
"taqwa": "takwa",
"tauco": "taoco",
"tauge": "taoge",
"toge": "taoge",
"tauladan": "teladan",
"taubat": "tobat",
"trilyun": "triliun",
"vissi": "visi",
"coklat": "cokelat",
"narkotika": "narkotik",
"oase": "oasis",
"politisi": "politikus",
"terong": "terung",
"wool": "wol",
"himpit": "impit",
"mujizat": "mukjizat",
"mujijat": "mukjizat",
"yag": "yang",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases from a dependency parse. Works on both Doc and Span. Detect base noun phrases from a dependency parse. Works on both Doc and Span.
""" """
@ -18,21 +19,23 @@ def noun_chunks(obj):
"nmod", "nmod",
"nmod:poss", "nmod:poss",
] ]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings[label] for label in labels] np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label

View File

@ -9,8 +9,8 @@ Example sentences to test spaCy and its language models.
""" """
sentences = [ sentences = [
"애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.", "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다.",
"동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.", "율주행 자동차의 손해 배상 책임이 제조 업체로 옮겨 가다",
"자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.", "샌프란시스코 시가 자동 배달 로봇의 보도 주행 금지를 검토 중이라고 합니다.",
"런던은 영국의 수도이자 가장 큰 도시입니다.", "런던은 영국의 수도이자 가장 큰 도시입니다.",
] ]

View File

@ -2,26 +2,21 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES from .punctuation import TOKENIZER_INFIXES
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
class LuxembourgishDefaults(Language.Defaults): class LuxembourgishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "lb" lex_attr_getters[LANG] = lambda text: "lb"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP

View File

@ -1,16 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
# TODO
# norm execptions: find a possibility to deal with the zillions of spelling
# variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
# here one could include the most common spelling mistakes
_exc = {"dass": "datt", "viläicht": "vläicht"}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -186,10 +186,6 @@ def suffix(string):
return string[-3:] return string[-3:]
def cluster(string):
return 0
def is_alpha(string): def is_alpha(string):
return string.isalpha() return string.isalpha()
@ -218,20 +214,11 @@ def is_stop(string, stops=set()):
return string.lower() in stops return string.lower() in stops
def is_oov(string):
return True
def get_prob(string):
return -20.0
LEX_ATTRS = { LEX_ATTRS = {
attrs.LOWER: lower, attrs.LOWER: lower,
attrs.NORM: lower, attrs.NORM: lower,
attrs.PREFIX: prefix, attrs.PREFIX: prefix,
attrs.SUFFIX: suffix, attrs.SUFFIX: suffix,
attrs.CLUSTER: cluster,
attrs.IS_ALPHA: is_alpha, attrs.IS_ALPHA: is_alpha,
attrs.IS_DIGIT: is_digit, attrs.IS_DIGIT: is_digit,
attrs.IS_LOWER: is_lower, attrs.IS_LOWER: is_lower,
@ -239,8 +226,6 @@ LEX_ATTRS = {
attrs.IS_TITLE: is_title, attrs.IS_TITLE: is_title,
attrs.IS_UPPER: is_upper, attrs.IS_UPPER: is_upper,
attrs.IS_STOP: is_stop, attrs.IS_STOP: is_stop,
attrs.IS_OOV: is_oov,
attrs.PROB: get_prob,
attrs.LIKE_EMAIL: like_email, attrs.LIKE_EMAIL: like_email,
attrs.LIKE_NUM: like_num, attrs.LIKE_NUM: like_num,
attrs.IS_PUNCT: is_punct, attrs.IS_PUNCT: is_punct,

View File

@ -55,7 +55,7 @@ _num_words = [
"തൊണ്ണൂറ് ", "തൊണ്ണൂറ് ",
"നുറ് ", "നുറ് ",
"ആയിരം ", "ആയിരം ",
"പത്തുലക്ഷം" "പത്തുലക്ഷം",
] ]

View File

@ -3,7 +3,6 @@ from __future__ import unicode_literals
STOP_WORDS = set( STOP_WORDS = set(
""" """
അത അത
ഇത ഇത

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases from a dependency parse. Works on both Doc and Span. Detect base noun phrases from a dependency parse. Works on both Doc and Span.
""" """
@ -18,21 +19,23 @@ def noun_chunks(obj):
"nmod", "nmod",
"nmod:poss", "nmod:poss",
] ]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings[label] for label in labels] np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label

View File

@ -1,17 +1,19 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .lemmatizer import PolishLemmatizer
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups from ...util import add_lookups
from ...lookups import Lookups
class PolishDefaults(Language.Defaults): class PolishDefaults(Language.Defaults):
@ -21,10 +23,21 @@ class PolishDefaults(Language.Defaults):
lex_attr_getters[NORM] = add_lookups( lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
) )
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) mod_base_exceptions = {
exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
}
tokenizer_exceptions = mod_base_exceptions
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
@classmethod
def create_lemmatizer(cls, nlp=None, lookups=None):
if lookups is None:
lookups = Lookups()
return PolishLemmatizer(lookups)
class Polish(Language): class Polish(Language):

File diff suppressed because it is too large Load Diff

106
spacy/lang/pl/lemmatizer.py Normal file
View File

@ -0,0 +1,106 @@
# coding: utf-8
from __future__ import unicode_literals
from ...lemmatizer import Lemmatizer
from ...parts_of_speech import NAMES
class PolishLemmatizer(Lemmatizer):
# This lemmatizer implements lookup lemmatization based on
# the Morfeusz dictionary (morfeusz.sgjp.pl/en) by Institute of Computer Science PAS
# It utilizes some prefix based improvements for
# verb and adjectives lemmatization, as well as case-sensitive
# lemmatization for nouns
def __init__(self, lookups, *args, **kwargs):
# this lemmatizer is lookup based, so it does not require an index, exceptionlist, or rules
super(PolishLemmatizer, self).__init__(lookups)
self.lemma_lookups = {}
for tag in [
"ADJ",
"ADP",
"ADV",
"AUX",
"NOUN",
"NUM",
"PART",
"PRON",
"VERB",
"X",
]:
self.lemma_lookups[tag] = self.lookups.get_table(
"lemma_lookup_" + tag.lower(), {}
)
self.lemma_lookups["DET"] = self.lemma_lookups["X"]
self.lemma_lookups["PROPN"] = self.lemma_lookups["NOUN"]
def __call__(self, string, univ_pos, morphology=None):
if isinstance(univ_pos, int):
univ_pos = NAMES.get(univ_pos, "X")
univ_pos = univ_pos.upper()
if univ_pos == "NOUN":
return self.lemmatize_noun(string, morphology)
if univ_pos != "PROPN":
string = string.lower()
if univ_pos == "ADJ":
return self.lemmatize_adj(string, morphology)
elif univ_pos == "VERB":
return self.lemmatize_verb(string, morphology)
lemma_dict = self.lemma_lookups.get(univ_pos, {})
return [lemma_dict.get(string, string.lower())]
def lemmatize_adj(self, string, morphology):
# this method utilizes different procedures for adjectives
# with 'nie' and 'naj' prefixes
lemma_dict = self.lemma_lookups["ADJ"]
if string[:3] == "nie":
search_string = string[3:]
if search_string[:3] == "naj":
naj_search_string = search_string[3:]
if naj_search_string in lemma_dict:
return [lemma_dict[naj_search_string]]
if search_string in lemma_dict:
return [lemma_dict[search_string]]
if string[:3] == "naj":
naj_search_string = string[3:]
if naj_search_string in lemma_dict:
return [lemma_dict[naj_search_string]]
return [lemma_dict.get(string, string)]
def lemmatize_verb(self, string, morphology):
# this method utilizes a different procedure for verbs
# with 'nie' prefix
lemma_dict = self.lemma_lookups["VERB"]
if string[:3] == "nie":
search_string = string[3:]
if search_string in lemma_dict:
return [lemma_dict[search_string]]
return [lemma_dict.get(string, string)]
def lemmatize_noun(self, string, morphology):
# this method is case-sensitive, in order to work
# for incorrectly tagged proper names
lemma_dict = self.lemma_lookups["NOUN"]
if string != string.lower():
if string.lower() in lemma_dict:
return [lemma_dict[string.lower()]]
elif string in lemma_dict:
return [lemma_dict[string]]
return [string.lower()]
return [lemma_dict.get(string, string)]
def lookup(self, string, orth=None):
return string.lower()
def lemmatize(self, string, index, exceptions, rules):
raise NotImplementedError

View File

@ -1,23 +0,0 @@
Copyright (c) 2019, Marcin Miłkowski
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@ -1,22 +1,48 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..char_classes import LIST_ELLIPSES, CONCAT_ICONS from ..char_classes import LIST_ELLIPSES, LIST_PUNCT, LIST_HYPHENS
from ..char_classes import LIST_ICONS, LIST_QUOTES, CURRENCY, UNITS, PUNCT
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
_quotes = CONCAT_QUOTES.replace("'", "") _quotes = CONCAT_QUOTES.replace("'", "")
_prefixes = _prefixes = [
r"(długo|krótko|jedno|dwu|trzy|cztero)-"
] + BASE_TOKENIZER_PREFIXES
_infixes = ( _infixes = (
LIST_ELLIPSES LIST_ELLIPSES
+ [CONCAT_ICONS] + LIST_ICONS
+ LIST_HYPHENS
+ [ + [
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER), r"(?<=[0-9{al}])\.(?=[0-9{au}])".format(al=ALPHA, au=ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA), r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA), r"(?<=[{a}])[:<>=\/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=CONCAT_QUOTES), r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=_quotes),
] ]
) )
_suffixes = (
["''", "", r"\.", ""]
+ LIST_PUNCT
+ LIST_QUOTES
+ LIST_ICONS
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
),
r"(?<=[{au}])\.".format(au=ALPHA_UPPER),
]
)
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_INFIXES = _infixes TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -1,26 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals
from ._tokenizer_exceptions_list import PL_BASE_EXCEPTIONS
from ...symbols import POS, ADV, NOUN, ORTH, LEMMA, ADJ
_exc = {}
for exc_data in [
{ORTH: "m.in.", LEMMA: "między innymi", POS: ADV},
{ORTH: "inż.", LEMMA: "inżynier", POS: NOUN},
{ORTH: "mgr.", LEMMA: "magister", POS: NOUN},
{ORTH: "tzn.", LEMMA: "to znaczy", POS: ADV},
{ORTH: "tj.", LEMMA: "to jest", POS: ADV},
{ORTH: "tzw.", LEMMA: "tak zwany", POS: ADJ},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in ["w.", "r."]:
_exc[orth] = [{ORTH: orth}]
for orth in PL_BASE_EXCEPTIONS:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -5,22 +5,17 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .norm_exceptions import NORM_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
class PortugueseDefaults(Language.Defaults): class PortugueseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "pt" lex_attr_getters[LANG] = lambda text: "pt"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS

View File

@ -1,23 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
# These exceptions are used to add NORM values based on a token's ORTH value.
# Individual languages can also add their own exceptions and overwrite them -
# for example, British vs. American spelling in English.
# Norms are only set if no alternative is provided in the tokenizer exceptions.
# Note that this does not change any other token attributes. Its main purpose
# is to normalise the word representations so that equivalent tokens receive
# similar representations. For example: $ and € are very different, but they're
# both currency symbols. By normalising currency symbols to $, all symbols are
# seen as similar, no matter how common they are in the training data.
NORM_EXCEPTIONS = {
"R$": "$", # Real
"r$": "$", # Real
"Cz$": "$", # Cruzado
"cz$": "$", # Cruzado
"NCz$": "$", # Cruzado Novo
"ncz$": "$", # Cruzado Novo
}

View File

@ -3,26 +3,21 @@ from __future__ import unicode_literals, print_function
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .lemmatizer import RussianLemmatizer from .lemmatizer import RussianLemmatizer
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS from ...util import update_exc
from ...util import update_exc, add_lookups
from ...language import Language from ...language import Language
from ...lookups import Lookups from ...lookups import Lookups
from ...attrs import LANG, NORM from ...attrs import LANG
class RussianDefaults(Language.Defaults): class RussianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "ru" lex_attr_getters[LANG] = lambda text: "ru"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP

View File

@ -1,36 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
_exc = {
# Slang
"прив": "привет",
"дарова": "привет",
"дак": "так",
"дык": "так",
"здарова": "привет",
"пакедава": "пока",
"пакедаво": "пока",
"ща": "сейчас",
"спс": "спасибо",
"пжлст": "пожалуйста",
"плиз": "пожалуйста",
"ладненько": "ладно",
"лады": "ладно",
"лан": "ладно",
"ясн": "ясно",
"всм": "всмысле",
"хош": "хочешь",
"хаюшки": "привет",
"оч": "очень",
"че": "что",
"чо": "что",
"шо": "что",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -3,22 +3,17 @@ from __future__ import unicode_literals
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .norm_exceptions import NORM_EXCEPTIONS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language from ...language import Language
from ...attrs import LANG, NORM from ...attrs import LANG
from ...util import update_exc, add_lookups from ...util import update_exc
class SerbianDefaults(Language.Defaults): class SerbianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "sr" lex_attr_getters[LANG] = lambda text: "sr"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS

View File

@ -1,26 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
_exc = {
# Slang
"ћале": "отац",
"кева": "мајка",
"смор": "досада",
"кец": "јединица",
"тебра": "брат",
"штребер": "ученик",
"факс": "факултет",
"профа": "професор",
"бус": "аутобус",
"пискарало": "службеник",
"бакутанер": "бака",
"џибер": "простак",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -40,7 +40,7 @@ _num_words = [
"miljard", "miljard",
"biljon", "biljon",
"biljard", "biljard",
"kvadriljon" "kvadriljon",
] ]

View File

@ -2,9 +2,10 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(obj): def noun_chunks(doclike):
""" """
Detect base noun phrases from a dependency parse. Works on both Doc and Span. Detect base noun phrases from a dependency parse. Works on both Doc and Span.
""" """
@ -19,21 +20,23 @@ def noun_chunks(obj):
"nmod", "nmod",
"nmod:poss", "nmod:poss",
] ]
doc = obj.doc # Ensure works on both Doc and Span. doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings[label] for label in labels] np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings.add("conj") conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
seen = set() prev_end = -1
for i, word in enumerate(obj): for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON): if word.pos not in (NOUN, PROPN, PRON):
continue continue
# Prevent nested chunks from being produced # Prevent nested chunks from being produced
if word.i in seen: if word.left_edge.i <= prev_end:
continue continue
if word.dep in np_deps: if word.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label
elif word.dep == conj: elif word.dep == conj:
head = word.head head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
head = head.head head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP # If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps: if head.dep in np_deps:
if any(w.i in seen for w in word.subtree): prev_end = word.right_edge.i
continue
seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
yield word.left_edge.i, word.right_edge.i + 1, np_label yield word.left_edge.i, word.right_edge.i + 1, np_label

View File

@ -1,7 +1,7 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA, PUNCT, TAG from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA
_exc = {} _exc = {}
@ -155,6 +155,6 @@ for orth in ABBREVIATIONS:
# Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), # Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
# should be tokenized as two separate tokens. # should be tokenized as two separate tokens.
for orth in ["i", "m"]: for orth in ["i", "m"]:
_exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: ".", TAG: PUNCT}] _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: "."}]
TOKENIZER_EXCEPTIONS = _exc TOKENIZER_EXCEPTIONS = _exc

View File

@ -1,139 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
_exc = {
# Regional words normal
# Sri Lanka - wikipeadia
"இங்க": "இங்கே",
"வாங்க": "வாருங்கள்",
"ஒண்டு": "ஒன்று",
"கண்டு": "கன்று",
"கொண்டு": "கொன்று",
"பண்டி": "பன்றி",
"பச்ச": "பச்சை",
"அம்பது": "ஐம்பது",
"வெச்ச": "வைத்து",
"வச்ச": "வைத்து",
"வச்சி": "வைத்து",
"வாளைப்பழம்": "வாழைப்பழம்",
"மண்ணு": "மண்",
"பொன்னு": "பொன்",
"சாவல்": "சேவல்",
"அங்கால": "அங்கு ",
"அசுப்பு": "நடமாட்டம்",
"எழுவான் கரை": "எழுவான்கரை",
"ஓய்யாரம்": "எழில் ",
"ஒளும்பு": "எழும்பு",
"ஓர்மை": "துணிவு",
"கச்சை": "கோவணம்",
"கடப்பு": "தெருவாசல்",
"சுள்ளி": "காய்ந்த குச்சி",
"திறாவுதல்": "தடவுதல்",
"நாசமறுப்பு": "தொல்லை",
"பரிசாரி": "வைத்தியன்",
"பறவாதி": "பேராசைக்காரன்",
"பிசினி": "உலோபி ",
"விசர்": "பைத்தியம்",
"ஏனம்": "பாத்திரம்",
"ஏலா": "இயலாது",
"ஒசில்": "அழகு",
"ஒள்ளுப்பம்": "கொஞ்சம்",
# Srilankan and indian
"குத்துமதிப்பு": "",
"நூனாயம்": "நூல்நயம்",
"பைய": "மெதுவாக",
"மண்டை": "தலை",
"வெள்ளனே": "சீக்கிரம்",
"உசுப்பு": "எழுப்பு",
"ஆணம்": "குழம்பு",
"உறக்கம்": "தூக்கம்",
"பஸ்": "பேருந்து",
"களவு": "திருட்டு ",
# relationship
"புருசன்": "கணவன்",
"பொஞ்சாதி": "மனைவி",
"புள்ள": "பிள்ளை",
"பிள்ள": "பிள்ளை",
"ஆம்பிளப்புள்ள": "ஆண் பிள்ளை",
"பொம்பிளப்புள்ள": "பெண் பிள்ளை",
"அண்ணாச்சி": "அண்ணா",
"அக்காச்சி": "அக்கா",
"தங்கச்சி": "தங்கை",
# difference words
"பொடியன்": "சிறுவன்",
"பொட்டை": "சிறுமி",
"பிறகு": "பின்பு",
"டக்கென்டு": "விரைவாக",
"கெதியா": "விரைவாக",
"கிறுகி": "திரும்பி",
"போயித்து வாறன்": "போய் வருகிறேன்",
"வருவாங்களா": "வருவார்களா",
# regular spokens
"சொல்லு": "சொல்",
"கேளு": "கேள்",
"சொல்லுங்க": "சொல்லுங்கள்",
"கேளுங்க": "கேளுங்கள்",
"நீங்கள்": "நீ",
"உன்": "உன்னுடைய",
# Portugeese formal words
"அலவாங்கு": "கடப்பாரை",
"ஆசுப்பத்திரி": "மருத்துவமனை",
"உரோதை": "சில்லு",
"கடுதாசி": "கடிதம்",
"கதிரை": "நாற்காலி",
"குசினி": "அடுக்களை",
"கோப்பை": "கிண்ணம்",
"சப்பாத்து": "காலணி",
"தாச்சி": "இரும்புச் சட்டி",
"துவாய்": "துவாலை",
"தவறணை": "மதுக்கடை",
"பீப்பா": "மரத்தாழி",
"யன்னல்": "சாளரம்",
"வாங்கு": "மரஇருக்கை",
# Dutch formal words
"இறாக்கை": "பற்சட்டம்",
"இலாட்சி": "இழுப்பறை",
"கந்தோர்": "பணிமனை",
"நொத்தாரிசு": "ஆவண எழுத்துபதிவாளர்",
# English formal words
"இஞ்சினியர்": "பொறியியலாளர்",
"சூப்பு": "ரசம்",
"செக்": "காசோலை",
"சேட்டு": "மேற்ச்சட்டை",
"மார்க்கட்டு": "சந்தை",
"விண்ணன்": "கெட்டிக்காரன்",
# Arabic formal words
"ஈமான்": "நம்பிக்கை",
"சுன்னத்து": "விருத்தசேதனம்",
"செய்த்தான்": "பிசாசு",
"மவுத்து": "இறப்பு",
"ஹலால்": "அங்கீகரிக்கப்பட்டது",
"கறாம்": "நிராகரிக்கப்பட்டது",
# Persian, Hindustanian and hindi formal words
"சுமார்": "கிட்டத்தட்ட",
"சிப்பாய்": "போர்வீரன்",
"சிபார்சு": "சிபாரிசு",
"ஜமீன்": "பணக்காரா்",
"அசல்": "மெய்யான",
"அந்தஸ்து": "கௌரவம்",
"ஆஜர்": "சமா்ப்பித்தல்",
"உசார்": "எச்சரிக்கை",
"அச்சா": "நல்ல",
# English words used in text conversations
"bcoz": "ஏனெனில்",
"bcuz": "ஏனெனில்",
"fav": "விருப்பமான",
"morning": "காலை வணக்கம்",
"gdeveng": "மாலை வணக்கம்",
"gdnyt": "இரவு வணக்கம்",
"gdnit": "இரவு வணக்கம்",
"plz": "தயவு செய்து",
"pls": "தயவு செய்து",
"thx": "நன்றி",
"thanx": "நன்றி",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm

View File

@ -4,14 +4,12 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .norm_exceptions import NORM_EXCEPTIONS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from ..norm_exceptions import BASE_NORMS from ...attrs import LANG
from ...attrs import LANG, NORM
from ...language import Language from ...language import Language
from ...tokens import Doc from ...tokens import Doc
from ...util import DummyTokenizer, add_lookups from ...util import DummyTokenizer
class ThaiTokenizer(DummyTokenizer): class ThaiTokenizer(DummyTokenizer):
@ -37,9 +35,6 @@ class ThaiDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda _text: "th" lex_attr_getters[LANG] = lambda _text: "th"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS) tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP tag_map = TAG_MAP
stop_words = STOP_WORDS stop_words = STOP_WORDS

View File

@ -1,113 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
_exc = {
# Conjugation and Diversion invalid to Tonal form (ผันอักษรและเสียงไม่ตรงกับรูปวรรณยุกต์)
"สนุ๊กเกอร์": "สนุกเกอร์",
"โน้ต": "โน้ต",
# Misspelled because of being lazy or hustle (สะกดผิดเพราะขี้เกียจพิมพ์ หรือเร่งรีบ)
"โทสับ": "โทรศัพท์",
"พุ่งนี้": "พรุ่งนี้",
# Strange (ให้ดูแปลกตา)
"ชะมะ": "ใช่ไหม",
"ชิมิ": "ใช่ไหม",
"ชะ": "ใช่ไหม",
"ช่ายมะ": "ใช่ไหม",
"ป่าว": "เปล่า",
"ป่ะ": "เปล่า",
"ปล่าว": "เปล่า",
"คัย": "ใคร",
"ไค": "ใคร",
"คราย": "ใคร",
"เตง": "ตัวเอง",
"ตะเอง": "ตัวเอง",
"รึ": "หรือ",
"เหรอ": "หรือ",
"หรา": "หรือ",
"หรอ": "หรือ",
"ชั้น": "ฉัน",
"ชั้ล": "ฉัน",
"ช้าน": "ฉัน",
"เทอ": "เธอ",
"เทอร์": "เธอ",
"เทอว์": "เธอ",
"แกร": "แก",
"ป๋ม": "ผม",
"บ่องตง": "บอกตรงๆ",
"ถ่ามตง": "ถามตรงๆ",
"ต่อมตง": "ตอบตรงๆ",
"เพิ่ล": "เพื่อน",
"จอบอ": "จอบอ",
"ดั้ย": "ได้",
"ขอบคุง": "ขอบคุณ",
"ยังงัย": "ยังไง",
"Inw": "เทพ",
"uou": "นอน",
"Lกรีeu": "เกรียน",
# Misspelled to express emotions (คำที่สะกดผิดเพื่อแสดงอารมณ์)
"เปงราย": "เป็นอะไร",
"เปนรัย": "เป็นอะไร",
"เปงรัย": "เป็นอะไร",
"เป็นอัลไล": "เป็นอะไร",
"ทามมาย": "ทำไม",
"ทามมัย": "ทำไม",
"จังรุย": "จังเลย",
"จังเยย": "จังเลย",
"จุงเบย": "จังเลย",
"ไม่รู้": "มะรุ",
"เฮ่ย": "เฮ้ย",
"เห้ย": "เฮ้ย",
"น่าร็อค": "น่ารัก",
"น่าร๊าก": "น่ารัก",
"ตั้ลล๊าก": "น่ารัก",
"คือร๊ะ": "คืออะไร",
"โอป่ะ": "โอเคหรือเปล่า",
"น่ามคาน": "น่ารำคาญ",
"น่ามสาร": "น่าสงสาร",
"วงวาร": "สงสาร",
"บับว่า": "แบบว่า",
"อัลไล": "อะไร",
"อิจ": "อิจฉา",
# Reduce rough words or Avoid to software filter (คำที่สะกดผิดเพื่อลดความหยาบของคำ หรืออาจใช้หลีกเลี่ยงการกรองคำหยาบของซอฟต์แวร์)
"กรู": "กู",
"กุ": "กู",
"กรุ": "กู",
"ตู": "กู",
"ตรู": "กู",
"มรึง": "มึง",
"เมิง": "มึง",
"มืง": "มึง",
"มุง": "มึง",
"สาด": "สัตว์",
"สัส": "สัตว์",
"สัก": "สัตว์",
"แสรด": "สัตว์",
"โคโตะ": "โคตร",
"โคด": "โคตร",
"โครต": "โคตร",
"โคตะระ": "โคตร",
"พ่อง": "พ่อมึง",
"แม่เมิง": "แม่มึง",
"เชี่ย": "เหี้ย",
# Imitate words (คำเลียนเสียง โดยส่วนใหญ่จะเพิ่มทัณฑฆาต หรือซ้ำตัวอักษร)
"แอร๊ยย": "อ๊าย",
"อร๊ายยย": "อ๊าย",
"มันส์": "มัน",
"วู๊วววววววว์": "วู้",
# Acronym (แบบคำย่อ)
"หมาลัย": "มหาวิทยาลัย",
"วิดวะ": "วิศวะ",
"สินสาด ": "ศิลปศาสตร์",
"สินกำ ": "ศิลปกรรมศาสตร์",
"เสารีย์ ": "อนุเสาวรีย์ชัยสมรภูมิ",
"เมกา ": "อเมริกา",
"มอไซค์ ": "มอเตอร์ไซค์",
}
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -38,7 +38,6 @@ TAG_MAP = {
"NNPC": {POS: PROPN}, "NNPC": {POS: PROPN},
"NNC": {POS: NOUN}, "NNC": {POS: NOUN},
"PSP": {POS: ADP}, "PSP": {POS: ADP},
".": {POS: PUNCT}, ".": {POS: PUNCT},
",": {POS: PUNCT}, ",": {POS: PUNCT},
"-LRB-": {POS: PUNCT}, "-LRB-": {POS: PUNCT},

View File

@ -104,6 +104,23 @@ class ChineseTokenizer(DummyTokenizer):
(words, spaces) = util.get_words_and_spaces(words, text) (words, spaces) = util.get_words_and_spaces(words, text)
return Doc(self.vocab, words=words, spaces=spaces) return Doc(self.vocab, words=words, spaces=spaces)
def pkuseg_update_user_dict(self, words, reset=False):
if self.pkuseg_seg:
if reset:
try:
import pkuseg
self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(None)
except ImportError:
if self.use_pkuseg:
msg = (
"pkuseg not installed: unable to reset pkuseg "
"user dict. Please " + _PKUSEG_INSTALL_MSG
)
raise ImportError(msg)
for word in words:
self.pkuseg_seg.preprocesser.insert(word.strip(), "")
def _get_config(self): def _get_config(self):
config = OrderedDict( config = OrderedDict(
( (
@ -152,21 +169,16 @@ class ChineseTokenizer(DummyTokenizer):
return util.to_bytes(serializers, []) return util.to_bytes(serializers, [])
def from_bytes(self, data, **kwargs): def from_bytes(self, data, **kwargs):
pkuseg_features_b = b"" pkuseg_data = {"features_b": b"", "weights_b": b"", "processors_data": None}
pkuseg_weights_b = b""
pkuseg_processors_data = None
def deserialize_pkuseg_features(b): def deserialize_pkuseg_features(b):
nonlocal pkuseg_features_b pkuseg_data["features_b"] = b
pkuseg_features_b = b
def deserialize_pkuseg_weights(b): def deserialize_pkuseg_weights(b):
nonlocal pkuseg_weights_b pkuseg_data["weights_b"] = b
pkuseg_weights_b = b
def deserialize_pkuseg_processors(b): def deserialize_pkuseg_processors(b):
nonlocal pkuseg_processors_data pkuseg_data["processors_data"] = srsly.msgpack_loads(b)
pkuseg_processors_data = srsly.msgpack_loads(b)
deserializers = OrderedDict( deserializers = OrderedDict(
( (
@ -178,13 +190,13 @@ class ChineseTokenizer(DummyTokenizer):
) )
util.from_bytes(data, deserializers, []) util.from_bytes(data, deserializers, [])
if pkuseg_features_b and pkuseg_weights_b: if pkuseg_data["features_b"] and pkuseg_data["weights_b"]:
with tempfile.TemporaryDirectory() as tempdir: with tempfile.TemporaryDirectory() as tempdir:
tempdir = Path(tempdir) tempdir = Path(tempdir)
with open(tempdir / "features.pkl", "wb") as fileh: with open(tempdir / "features.pkl", "wb") as fileh:
fileh.write(pkuseg_features_b) fileh.write(pkuseg_data["features_b"])
with open(tempdir / "weights.npz", "wb") as fileh: with open(tempdir / "weights.npz", "wb") as fileh:
fileh.write(pkuseg_weights_b) fileh.write(pkuseg_data["weights_b"])
try: try:
import pkuseg import pkuseg
except ImportError: except ImportError:
@ -193,13 +205,9 @@ class ChineseTokenizer(DummyTokenizer):
+ _PKUSEG_INSTALL_MSG + _PKUSEG_INSTALL_MSG
) )
self.pkuseg_seg = pkuseg.pkuseg(str(tempdir)) self.pkuseg_seg = pkuseg.pkuseg(str(tempdir))
if pkuseg_processors_data: if pkuseg_data["processors_data"]:
( processors_data = pkuseg_data["processors_data"]
user_dict, (user_dict, do_process, common_words, other_words) = processors_data
do_process,
common_words,
other_words,
) = pkuseg_processors_data
self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict) self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict)
self.pkuseg_seg.postprocesser.do_process = do_process self.pkuseg_seg.postprocesser.do_process = do_process
self.pkuseg_seg.postprocesser.common_words = set(common_words) self.pkuseg_seg.postprocesser.common_words = set(common_words)

View File

@ -4,10 +4,7 @@ from __future__ import absolute_import, unicode_literals
import random import random
import itertools import itertools
import warnings import warnings
from thinc.extra import load_nlp from thinc.extra import load_nlp
from spacy.util import minibatch
import weakref import weakref
import functools import functools
from collections import OrderedDict from collections import OrderedDict
@ -28,10 +25,11 @@ from .compat import izip, basestring_, is_python2, class_types
from .gold import GoldParse from .gold import GoldParse
from .scorer import Scorer from .scorer import Scorer
from ._ml import link_vectors_to_models, create_default_optimizer from ._ml import link_vectors_to_models, create_default_optimizer
from .attrs import IS_STOP, LANG from .attrs import IS_STOP, LANG, NORM
from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .lang.punctuation import TOKENIZER_INFIXES from .lang.punctuation import TOKENIZER_INFIXES
from .lang.tokenizer_exceptions import TOKEN_MATCH, TOKEN_MATCH_WITH_AFFIXES from .lang.tokenizer_exceptions import TOKEN_MATCH, TOKEN_MATCH_WITH_AFFIXES
from .lang.norm_exceptions import BASE_NORMS
from .lang.tag_map import TAG_MAP from .lang.tag_map import TAG_MAP
from .tokens import Doc from .tokens import Doc
from .lang.lex_attrs import LEX_ATTRS, is_stop from .lang.lex_attrs import LEX_ATTRS, is_stop
@ -77,6 +75,11 @@ class BaseDefaults(object):
lemmatizer=lemmatizer, lemmatizer=lemmatizer,
lookups=lookups, lookups=lookups,
) )
vocab.lex_attr_getters[NORM] = util.add_lookups(
vocab.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]),
BASE_NORMS,
vocab.lookups.get_table("lexeme_norm"),
)
for tag_str, exc in cls.morph_rules.items(): for tag_str, exc in cls.morph_rules.items():
for orth_str, attrs in exc.items(): for orth_str, attrs in exc.items():
vocab.morphology.add_special_case(tag_str, orth_str, attrs) vocab.morphology.add_special_case(tag_str, orth_str, attrs)
@ -417,7 +420,7 @@ class Language(object):
def __call__(self, text, disable=[], component_cfg=None): def __call__(self, text, disable=[], component_cfg=None):
"""Apply the pipeline to some text. The text can span multiple sentences, """Apply the pipeline to some text. The text can span multiple sentences,
and can contain arbtrary whitespace. Alignment into the original string and can contain arbitrary whitespace. Alignment into the original string
is preserved. is preserved.
text (unicode): The text to be processed. text (unicode): The text to be processed.
@ -849,7 +852,7 @@ class Language(object):
*[mp.Pipe(False) for _ in range(n_process)] *[mp.Pipe(False) for _ in range(n_process)]
) )
batch_texts = minibatch(texts, batch_size) batch_texts = util.minibatch(texts, batch_size)
# Sender sends texts to the workers. # Sender sends texts to the workers.
# This is necessary to properly handle infinite length of texts. # This is necessary to properly handle infinite length of texts.
# (In this case, all data cannot be sent to the workers at once) # (In this case, all data cannot be sent to the workers at once)
@ -907,9 +910,8 @@ class Language(object):
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk( serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
p, exclude=["vocab"] p, exclude=["vocab"]
) )
serializers["meta.json"] = lambda p: p.open("w").write( serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
srsly.json_dumps(self.meta)
)
for name, proc in self.pipeline: for name, proc in self.pipeline:
if not hasattr(proc, "name"): if not hasattr(proc, "name"):
continue continue
@ -973,7 +975,9 @@ class Language(object):
serializers = OrderedDict() serializers = OrderedDict()
serializers["vocab"] = lambda: self.vocab.to_bytes() serializers["vocab"] = lambda: self.vocab.to_bytes()
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"]) serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
serializers["meta.json"] = lambda: srsly.json_dumps(OrderedDict(sorted(self.meta.items()))) serializers["meta.json"] = lambda: srsly.json_dumps(
OrderedDict(sorted(self.meta.items()))
)
for name, proc in self.pipeline: for name, proc in self.pipeline:
if name in exclude: if name in exclude:
continue continue
@ -1075,7 +1079,7 @@ def _fix_pretrained_vectors_name(nlp):
else: else:
raise ValueError(Errors.E092) raise ValueError(Errors.E092)
if nlp.vocab.vectors.size != 0: if nlp.vocab.vectors.size != 0:
link_vectors_to_models(nlp.vocab) link_vectors_to_models(nlp.vocab, skip_rank=True)
for name, proc in nlp.pipeline: for name, proc in nlp.pipeline:
if not hasattr(proc, "cfg"): if not hasattr(proc, "cfg"):
continue continue

View File

@ -6,6 +6,7 @@ from collections import OrderedDict
from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN
from .errors import Errors from .errors import Errors
from .lookups import Lookups from .lookups import Lookups
from .parts_of_speech import NAMES as UPOS_NAMES
class Lemmatizer(object): class Lemmatizer(object):
@ -43,17 +44,11 @@ class Lemmatizer(object):
lookup_table = self.lookups.get_table("lemma_lookup", {}) lookup_table = self.lookups.get_table("lemma_lookup", {})
if "lemma_rules" not in self.lookups: if "lemma_rules" not in self.lookups:
return [lookup_table.get(string, string)] return [lookup_table.get(string, string)]
if univ_pos in (NOUN, "NOUN", "noun"): if isinstance(univ_pos, int):
univ_pos = "noun" univ_pos = UPOS_NAMES.get(univ_pos, "X")
elif univ_pos in (VERB, "VERB", "verb"): univ_pos = univ_pos.lower()
univ_pos = "verb"
elif univ_pos in (ADJ, "ADJ", "adj"): if univ_pos in ("", "eol", "space"):
univ_pos = "adj"
elif univ_pos in (PUNCT, "PUNCT", "punct"):
univ_pos = "punct"
elif univ_pos in (PROPN, "PROPN"):
return [string]
else:
return [string.lower()] return [string.lower()]
# See Issue #435 for example of where this logic is requied. # See Issue #435 for example of where this logic is requied.
if self.is_base_form(univ_pos, morphology): if self.is_base_form(univ_pos, morphology):
@ -61,6 +56,11 @@ class Lemmatizer(object):
index_table = self.lookups.get_table("lemma_index", {}) index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {}) exc_table = self.lookups.get_table("lemma_exc", {})
rules_table = self.lookups.get_table("lemma_rules", {}) rules_table = self.lookups.get_table("lemma_rules", {})
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
if univ_pos == "propn":
return [string]
else:
return [string.lower()]
lemmas = self.lemmatize( lemmas = self.lemmatize(
string, string,
index_table.get(univ_pos, {}), index_table.get(univ_pos, {}),

View File

@ -1,8 +1,8 @@
from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t
from .attrs cimport attr_id_t from .attrs cimport attr_id_t
from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, CLUSTER, LANG from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG
from .structs cimport LexemeC, SerializedLexemeC from .structs cimport LexemeC
from .strings cimport StringStore from .strings cimport StringStore
from .vocab cimport Vocab from .vocab cimport Vocab
@ -24,22 +24,6 @@ cdef class Lexeme:
self.vocab = vocab self.vocab = vocab
self.orth = lex.orth self.orth = lex.orth
@staticmethod
cdef inline SerializedLexemeC c_to_bytes(const LexemeC* lex) nogil:
cdef SerializedLexemeC lex_data
buff = <const unsigned char*>&lex.flags
end = <const unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
for i in range(sizeof(lex_data.data)):
lex_data.data[i] = buff[i]
return lex_data
@staticmethod
cdef inline void c_from_bytes(LexemeC* lex, SerializedLexemeC lex_data) nogil:
buff = <unsigned char*>&lex.flags
end = <unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
for i in range(sizeof(lex_data.data)):
buff[i] = lex_data.data[i]
@staticmethod @staticmethod
cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil: cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
if name < (sizeof(flags_t) * 8): if name < (sizeof(flags_t) * 8):
@ -56,8 +40,6 @@ cdef class Lexeme:
lex.prefix = value lex.prefix = value
elif name == SUFFIX: elif name == SUFFIX:
lex.suffix = value lex.suffix = value
elif name == CLUSTER:
lex.cluster = value
elif name == LANG: elif name == LANG:
lex.lang = value lex.lang = value
@ -84,8 +66,6 @@ cdef class Lexeme:
return lex.suffix return lex.suffix
elif feat_name == LENGTH: elif feat_name == LENGTH:
return lex.length return lex.length
elif feat_name == CLUSTER:
return lex.cluster
elif feat_name == LANG: elif feat_name == LANG:
return lex.lang return lex.lang
else: else:

View File

@ -17,7 +17,7 @@ from .typedefs cimport attr_t, flags_t
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
from .attrs cimport IS_CURRENCY, IS_OOV, PROB from .attrs cimport IS_CURRENCY
from .attrs import intify_attrs from .attrs import intify_attrs
from .errors import Errors, Warnings from .errors import Errors, Warnings
@ -89,11 +89,10 @@ cdef class Lexeme:
cdef attr_id_t attr cdef attr_id_t attr
attrs = intify_attrs(attrs) attrs = intify_attrs(attrs)
for attr, value in attrs.items(): for attr, value in attrs.items():
if attr == PROB: # skip PROB, e.g. from lexemes.jsonl
self.c.prob = value if isinstance(value, float):
elif attr == CLUSTER: continue
self.c.cluster = int(value) elif isinstance(value, (int, long)):
elif isinstance(value, int) or isinstance(value, long):
Lexeme.set_struct_attr(self.c, attr, value) Lexeme.set_struct_attr(self.c, attr, value)
else: else:
Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value)) Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))
@ -137,34 +136,6 @@ cdef class Lexeme:
xp = get_array_module(vector) xp = get_array_module(vector)
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)) return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
def to_bytes(self):
lex_data = Lexeme.c_to_bytes(self.c)
start = <const char*>&self.c.flags
end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
if (end-start) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=end-start,
bad_length=sizeof(lex_data.data)))
byte_string = b"\0" * sizeof(lex_data.data)
byte_chars = <char*>byte_string
for i in range(sizeof(lex_data.data)):
byte_chars[i] = lex_data.data[i]
if len(byte_string) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=len(byte_string),
bad_length=sizeof(lex_data.data)))
return byte_string
def from_bytes(self, bytes byte_string):
# This method doesn't really have a use-case --- wrote it for testing.
# Possibly delete? It puts the Lexeme out of synch with the vocab.
cdef SerializedLexemeC lex_data
if len(byte_string) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=len(byte_string),
bad_length=sizeof(lex_data.data)))
for i in range(len(byte_string)):
lex_data.data[i] = byte_string[i]
Lexeme.c_from_bytes(self.c, lex_data)
self.orth = self.c.orth
@property @property
def has_vector(self): def has_vector(self):
"""RETURNS (bool): Whether a word vector is associated with the object. """RETURNS (bool): Whether a word vector is associated with the object.
@ -208,10 +179,14 @@ cdef class Lexeme:
"""RETURNS (float): A scalar value indicating the positivity or """RETURNS (float): A scalar value indicating the positivity or
negativity of the lexeme.""" negativity of the lexeme."""
def __get__(self): def __get__(self):
return self.c.sentiment sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {})
return sentiment_table.get(self.c.orth, 0.0)
def __set__(self, float sentiment): def __set__(self, float x):
self.c.sentiment = sentiment if "lexeme_sentiment" not in self.vocab.lookups:
self.vocab.lookups.add_table("lexeme_sentiment")
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment")
sentiment_table[self.c.orth] = x
@property @property
def orth_(self): def orth_(self):
@ -241,6 +216,10 @@ cdef class Lexeme:
return self.c.norm return self.c.norm
def __set__(self, attr_t x): def __set__(self, attr_t x):
if "lexeme_norm" not in self.vocab.lookups:
self.vocab.lookups.add_table("lexeme_norm")
norm_table = self.vocab.lookups.get_table("lexeme_norm")
norm_table[self.c.orth] = self.vocab.strings[x]
self.c.norm = x self.c.norm = x
property shape: property shape:
@ -276,10 +255,12 @@ cdef class Lexeme:
property cluster: property cluster:
"""RETURNS (int): Brown cluster ID.""" """RETURNS (int): Brown cluster ID."""
def __get__(self): def __get__(self):
return self.c.cluster cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
return cluster_table.get(self.c.orth, 0)
def __set__(self, attr_t x): def __set__(self, int x):
self.c.cluster = x cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
cluster_table[self.c.orth] = x
property lang: property lang:
"""RETURNS (uint64): Language of the parent vocabulary.""" """RETURNS (uint64): Language of the parent vocabulary."""
@ -293,10 +274,14 @@ cdef class Lexeme:
"""RETURNS (float): Smoothed log probability estimate of the lexeme's """RETURNS (float): Smoothed log probability estimate of the lexeme's
type.""" type."""
def __get__(self): def __get__(self):
return self.c.prob prob_table = self.vocab.load_extra_lookups("lexeme_prob")
settings_table = self.vocab.load_extra_lookups("lexeme_settings")
default_oov_prob = settings_table.get("oov_prob", -20.0)
return prob_table.get(self.c.orth, default_oov_prob)
def __set__(self, float x): def __set__(self, float x):
self.c.prob = x prob_table = self.vocab.load_extra_lookups("lexeme_prob")
prob_table[self.c.orth] = x
property lower_: property lower_:
"""RETURNS (unicode): Lowercase form of the word.""" """RETURNS (unicode): Lowercase form of the word."""
@ -314,7 +299,7 @@ cdef class Lexeme:
return self.vocab.strings[self.c.norm] return self.vocab.strings[self.c.norm]
def __set__(self, unicode x): def __set__(self, unicode x):
self.c.norm = self.vocab.strings.add(x) self.norm = self.vocab.strings.add(x)
property shape_: property shape_:
"""RETURNS (unicode): Transform of the word's string, to show """RETURNS (unicode): Transform of the word's string, to show
@ -362,13 +347,10 @@ cdef class Lexeme:
def __set__(self, flags_t x): def __set__(self, flags_t x):
self.c.flags = x self.c.flags = x
property is_oov: @property
def is_oov(self):
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary.""" """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
def __get__(self): return self.orth in self.vocab.vectors
return Lexeme.c_check_flag(self.c, IS_OOV)
def __set__(self, attr_t x):
Lexeme.c_set_flag(self.c, IS_OOV, x)
property is_stop: property is_stop:
"""RETURNS (bool): Whether the lexeme is a stop word.""" """RETURNS (bool): Whether the lexeme is a stop word."""

View File

@ -124,7 +124,7 @@ class Lookups(object):
self._tables[key].update(value) self._tables[key].update(value)
return self return self
def to_disk(self, path, **kwargs): def to_disk(self, path, filename="lookups.bin", **kwargs):
"""Save the lookups to a directory as lookups.bin. Expects a path to a """Save the lookups to a directory as lookups.bin. Expects a path to a
directory, which will be created if it doesn't exist. directory, which will be created if it doesn't exist.
@ -136,11 +136,11 @@ class Lookups(object):
path = ensure_path(path) path = ensure_path(path)
if not path.exists(): if not path.exists():
path.mkdir() path.mkdir()
filepath = path / "lookups.bin" filepath = path / filename
with filepath.open("wb") as file_: with filepath.open("wb") as file_:
file_.write(self.to_bytes()) file_.write(self.to_bytes())
def from_disk(self, path, **kwargs): def from_disk(self, path, filename="lookups.bin", **kwargs):
"""Load lookups from a directory containing a lookups.bin. Will skip """Load lookups from a directory containing a lookups.bin. Will skip
loading if the file doesn't exist. loading if the file doesn't exist.
@ -150,7 +150,7 @@ class Lookups(object):
DOCS: https://spacy.io/api/lookups#from_disk DOCS: https://spacy.io/api/lookups#from_disk
""" """
path = ensure_path(path) path = ensure_path(path)
filepath = path / "lookups.bin" filepath = path / filename
if filepath.exists(): if filepath.exists():
with filepath.open("rb") as file_: with filepath.open("rb") as file_:
data = file_.read() data = file_.read()

View File

@ -213,28 +213,28 @@ cdef class Matcher:
else: else:
yield doc yield doc
def __call__(self, object doc_or_span): def __call__(self, object doclike):
"""Find all token sequences matching the supplied pattern. """Find all token sequences matching the supplied pattern.
doc_or_span (Doc or Span): The document to match over. doclike (Doc or Span): The document to match over.
RETURNS (list): A list of `(key, start, end)` tuples, RETURNS (list): A list of `(key, start, end)` tuples,
describing the matches. A match tuple describes a span describing the matches. A match tuple describes a span
`doc[start:end]`. The `label_id` and `key` are both integers. `doc[start:end]`. The `label_id` and `key` are both integers.
""" """
if isinstance(doc_or_span, Doc): if isinstance(doclike, Doc):
doc = doc_or_span doc = doclike
length = len(doc) length = len(doc)
elif isinstance(doc_or_span, Span): elif isinstance(doclike, Span):
doc = doc_or_span.doc doc = doclike.doc
length = doc_or_span.end - doc_or_span.start length = doclike.end - doclike.start
else: else:
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doc_or_span).__name__)) raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \ if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \
and not doc.is_tagged: and not doc.is_tagged:
raise ValueError(Errors.E155.format()) raise ValueError(Errors.E155.format())
if DEP in self._seen_attrs and not doc.is_parsed: if DEP in self._seen_attrs and not doc.is_parsed:
raise ValueError(Errors.E156.format()) raise ValueError(Errors.E156.format())
matches = find_matches(&self.patterns[0], self.patterns.size(), doc_or_span, length, matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
extensions=self._extensions, predicates=self._extra_predicates) extensions=self._extensions, predicates=self._extra_predicates)
for i, (key, start, end) in enumerate(matches): for i, (key, start, end) in enumerate(matches):
on_match = self._callbacks.get(key, None) on_match = self._callbacks.get(key, None)
@ -257,7 +257,7 @@ def unpickle_matcher(vocab, patterns, callbacks):
return matcher return matcher
cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int length, extensions=None, predicates=tuple()): cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
"""Find matches in a doc, with a compiled array of patterns. Matches are """Find matches in a doc, with a compiled array of patterns. Matches are
returned as a list of (id, start, end) tuples. returned as a list of (id, start, end) tuples.
@ -286,7 +286,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
else: else:
nr_extra_attr = 0 nr_extra_attr = 0
extra_attr_values = <attr_t*>mem.alloc(length, sizeof(attr_t)) extra_attr_values = <attr_t*>mem.alloc(length, sizeof(attr_t))
for i, token in enumerate(doc_or_span): for i, token in enumerate(doclike):
for name, index in extensions.items(): for name, index in extensions.items():
value = token._.get(name) value = token._.get(name)
if isinstance(value, basestring): if isinstance(value, basestring):
@ -298,7 +298,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
for j in range(n): for j in range(n):
states.push_back(PatternStateC(patterns[j], i, 0)) states.push_back(PatternStateC(patterns[j], i, 0))
transition_states(states, matches, predicate_cache, transition_states(states, matches, predicate_cache,
doc_or_span[i], extra_attr_values, predicates) doclike[i], extra_attr_values, predicates)
extra_attr_values += nr_extra_attr extra_attr_values += nr_extra_attr
predicate_cache += len(predicates) predicate_cache += len(predicates)
# Handle matches that end in 0-width patterns # Handle matches that end in 0-width patterns

View File

@ -203,7 +203,7 @@ class Pipe(object):
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["vocab"] = lambda p: self.vocab.to_disk(p) serialize["vocab"] = lambda p: self.vocab.to_disk(p)
if self.model not in (None, True, False): if self.model not in (None, True, False):
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) serialize["model"] = lambda p: self.model.to_disk(p)
exclude = util.get_serialization_exclude(serialize, exclude, kwargs) exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
util.to_disk(path, serialize, exclude) util.to_disk(path, serialize, exclude)
@ -626,7 +626,7 @@ class Tagger(Pipe):
serialize = OrderedDict(( serialize = OrderedDict((
("vocab", lambda p: self.vocab.to_disk(p)), ("vocab", lambda p: self.vocab.to_disk(p)),
("tag_map", lambda p: srsly.write_msgpack(p, tag_map)), ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
("model", lambda p: p.open("wb").write(self.model.to_bytes())), ("model", lambda p: self.model.to_disk(p)),
("cfg", lambda p: srsly.write_json(p, self.cfg)) ("cfg", lambda p: srsly.write_json(p, self.cfg))
)) ))
exclude = util.get_serialization_exclude(serialize, exclude, kwargs) exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
@ -1395,7 +1395,7 @@ class EntityLinker(Pipe):
serialize["vocab"] = lambda p: self.vocab.to_disk(p) serialize["vocab"] = lambda p: self.vocab.to_disk(p)
serialize["kb"] = lambda p: self.kb.dump(p) serialize["kb"] = lambda p: self.kb.dump(p)
if self.model not in (None, True, False): if self.model not in (None, True, False):
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) serialize["model"] = lambda p: self.model.to_disk(p)
exclude = util.get_serialization_exclude(serialize, exclude, kwargs) exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
util.to_disk(path, serialize, exclude) util.to_disk(path, serialize, exclude)

View File

@ -23,29 +23,6 @@ cdef struct LexemeC:
attr_t prefix attr_t prefix
attr_t suffix attr_t suffix
attr_t cluster
float prob
float sentiment
cdef struct SerializedLexemeC:
unsigned char[8 + 8*10 + 4 + 4] data
# sizeof(flags_t) # flags
# + sizeof(attr_t) # lang
# + sizeof(attr_t) # id
# + sizeof(attr_t) # length
# + sizeof(attr_t) # orth
# + sizeof(attr_t) # lower
# + sizeof(attr_t) # norm
# + sizeof(attr_t) # shape
# + sizeof(attr_t) # prefix
# + sizeof(attr_t) # suffix
# + sizeof(attr_t) # cluster
# + sizeof(float) # prob
# + sizeof(float) # cluster
# + sizeof(float) # l2_norm
cdef struct SpanC: cdef struct SpanC:
hash_t id hash_t id

View File

@ -12,7 +12,7 @@ cdef enum symbol_t:
LIKE_NUM LIKE_NUM
LIKE_EMAIL LIKE_EMAIL
IS_STOP IS_STOP
IS_OOV IS_OOV_DEPRECATED
IS_BRACKET IS_BRACKET
IS_QUOTE IS_QUOTE
IS_LEFT_PUNCT IS_LEFT_PUNCT

View File

@ -17,7 +17,7 @@ IDS = {
"LIKE_NUM": LIKE_NUM, "LIKE_NUM": LIKE_NUM,
"LIKE_EMAIL": LIKE_EMAIL, "LIKE_EMAIL": LIKE_EMAIL,
"IS_STOP": IS_STOP, "IS_STOP": IS_STOP,
"IS_OOV": IS_OOV, "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
"IS_BRACKET": IS_BRACKET, "IS_BRACKET": IS_BRACKET,
"IS_QUOTE": IS_QUOTE, "IS_QUOTE": IS_QUOTE,
"IS_LEFT_PUNCT": IS_LEFT_PUNCT, "IS_LEFT_PUNCT": IS_LEFT_PUNCT,

View File

@ -9,7 +9,6 @@ import numpy
cimport cython.parallel cimport cython.parallel
import numpy.random import numpy.random
cimport numpy as np cimport numpy as np
from itertools import islice
from cpython.ref cimport PyObject, Py_XDECREF from cpython.ref cimport PyObject, Py_XDECREF
from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
from libc.math cimport exp from libc.math cimport exp
@ -621,15 +620,15 @@ cdef class Parser:
self.model, cfg = self.Model(self.moves.n_moves, **cfg) self.model, cfg = self.Model(self.moves.n_moves, **cfg)
if sgd is None: if sgd is None:
sgd = self.create_optimizer() sgd = self.create_optimizer()
doc_sample = [] docs = []
gold_sample = [] golds = []
for raw_text, annots_brackets in islice(get_gold_tuples(), 1000): for raw_text, annots_brackets in get_gold_tuples():
for annots, brackets in annots_brackets: for annots, brackets in annots_brackets:
ids, words, tags, heads, deps, ents = annots ids, words, tags, heads, deps, ents = annots
doc_sample.append(Doc(self.vocab, words=words)) docs.append(Doc(self.vocab, words=words))
gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags, golds.append(GoldParse(docs[-1], words=words, tags=tags,
heads=heads, deps=deps, entities=ents)) heads=heads, deps=deps, entities=ents))
self.model.begin_training(doc_sample, gold_sample) self.model.begin_training(docs, golds)
if pipeline is not None: if pipeline is not None:
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg) self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
link_vectors_to_models(self.vocab) link_vectors_to_models(self.vocab)

View File

@ -88,6 +88,11 @@ def eu_tokenizer():
return get_lang_class("eu").Defaults.create_tokenizer() return get_lang_class("eu").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def fa_tokenizer():
return get_lang_class("fa").Defaults.create_tokenizer()
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def fi_tokenizer(): def fi_tokenizer():
return get_lang_class("fi").Defaults.create_tokenizer() return get_lang_class("fi").Defaults.create_tokenizer()
@ -107,6 +112,7 @@ def ga_tokenizer():
def gu_tokenizer(): def gu_tokenizer():
return get_lang_class("gu").Defaults.create_tokenizer() return get_lang_class("gu").Defaults.create_tokenizer()
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def he_tokenizer(): def he_tokenizer():
return get_lang_class("he").Defaults.create_tokenizer() return get_lang_class("he").Defaults.create_tokenizer()
@ -241,7 +247,9 @@ def yo_tokenizer():
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def zh_tokenizer_char(): def zh_tokenizer_char():
return get_lang_class("zh").Defaults.create_tokenizer(config={"use_jieba": False, "use_pkuseg": False}) return get_lang_class("zh").Defaults.create_tokenizer(
config={"use_jieba": False, "use_pkuseg": False}
)
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
@ -253,7 +261,9 @@ def zh_tokenizer_jieba():
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def zh_tokenizer_pkuseg(): def zh_tokenizer_pkuseg():
pytest.importorskip("pkuseg") pytest.importorskip("pkuseg")
return get_lang_class("zh").Defaults.create_tokenizer(config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}) return get_lang_class("zh").Defaults.create_tokenizer(
config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}
)
@pytest.fixture(scope="session") @pytest.fixture(scope="session")

View File

@ -50,7 +50,9 @@ def test_create_from_words_and_text(vocab):
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "] assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""] assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
assert doc.text == text assert doc.text == text
assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()] assert [t.text for t in doc if not t.text.isspace()] == [
word for word in words if not word.isspace()
]
# partial whitespace in words # partial whitespace in words
words = [" ", "'", "dogs", "'", "\n\n", "run", " "] words = [" ", "'", "dogs", "'", "\n\n", "run", " "]
@ -60,7 +62,9 @@ def test_create_from_words_and_text(vocab):
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "] assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""] assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
assert doc.text == text assert doc.text == text
assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()] assert [t.text for t in doc if not t.text.isspace()] == [
word for word in words if not word.isspace()
]
# non-standard whitespace tokens # non-standard whitespace tokens
words = [" ", " ", "'", "dogs", "'", "\n\n", "run"] words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]
@ -70,7 +74,9 @@ def test_create_from_words_and_text(vocab):
assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "] assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]
assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""] assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
assert doc.text == text assert doc.text == text
assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()] assert [t.text for t in doc if not t.text.isspace()] == [
word for word in words if not word.isspace()
]
# mismatch between words and text # mismatch between words and text
with pytest.raises(ValueError): with pytest.raises(ValueError):

View File

@ -181,6 +181,7 @@ def test_is_sent_start(en_tokenizer):
doc.is_parsed = True doc.is_parsed = True
assert len(list(doc.sents)) == 2 assert len(list(doc.sents)) == 2
def test_is_sent_end(en_tokenizer): def test_is_sent_end(en_tokenizer):
doc = en_tokenizer("This is a sentence. This is another.") doc = en_tokenizer("This is a sentence. This is another.")
assert doc[4].is_sent_end is None assert doc[4].is_sent_end is None
@ -213,6 +214,7 @@ def test_token0_has_sent_start_true():
assert doc[1].is_sent_start is None assert doc[1].is_sent_start is None
assert not doc.is_sentenced assert not doc.is_sentenced
def test_tokenlast_has_sent_end_true(): def test_tokenlast_has_sent_end_true():
doc = Doc(Vocab(), words=["hello", "world"]) doc = Doc(Vocab(), words=["hello", "world"])
assert doc[0].is_sent_end is None assert doc[0].is_sent_end is None

View File

@ -37,14 +37,6 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
assert tokens[7].text == "." assert tokens[7].text == "."
@pytest.mark.parametrize(
"text,norm", [("akvarium", "akvarie"), ("bedstemoder", "bedstemor")]
)
def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
tokens = da_tokenizer(text)
assert tokens[0].norm_ == norm
@pytest.mark.parametrize( @pytest.mark.parametrize(
"text,n_tokens", "text,n_tokens",
[ [

View File

@ -22,17 +22,3 @@ def test_de_tokenizer_handles_exc_in_text(de_tokenizer):
assert len(tokens) == 6 assert len(tokens) == 6
assert tokens[2].text == "z.Zt." assert tokens[2].text == "z.Zt."
assert tokens[2].lemma_ == "zur Zeit" assert tokens[2].lemma_ == "zur Zeit"
@pytest.mark.parametrize(
"text,norms", [("vor'm", ["vor", "dem"]), ("du's", ["du", "es"])]
)
def test_de_tokenizer_norm_exceptions(de_tokenizer, text, norms):
tokens = de_tokenizer(text)
assert [token.norm_ for token in tokens] == norms
@pytest.mark.parametrize("text,norm", [("daß", "dass")])
def test_de_lex_attrs_norm_exceptions(de_tokenizer, text, norm):
tokens = de_tokenizer(text)
assert tokens[0].norm_ == norm

View File

@ -0,0 +1,16 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_noun_chunks_is_parsed_de(de_tokenizer):
"""Test that noun_chunks raises Value Error for 'de' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = de_tokenizer("Er lag auf seinem")
doc.is_parsed = False
with pytest.raises(ValueError):
list(doc.noun_chunks)

View File

@ -0,0 +1,16 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_noun_chunks_is_parsed_el(el_tokenizer):
"""Test that noun_chunks raises Value Error for 'el' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = el_tokenizer("είναι χώρα της νοτιοανατολικής")
doc.is_parsed = False
with pytest.raises(ValueError):
list(doc.noun_chunks)

View File

@ -118,6 +118,7 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
assert [token.norm_ for token in tokens] == norms assert [token.norm_ for token in tokens] == norms
@pytest.mark.skip
@pytest.mark.parametrize( @pytest.mark.parametrize(
"text,norm", [("radicalised", "radicalized"), ("cuz", "because")] "text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
) )

View File

@ -6,9 +6,24 @@ from spacy.attrs import HEAD, DEP
from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
import pytest
from ...util import get_doc from ...util import get_doc
def test_noun_chunks_is_parsed(en_tokenizer):
"""Test that noun_chunks raises Value Error for 'en' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = en_tokenizer("This is a sentence")
doc.is_parsed = False
with pytest.raises(ValueError):
list(doc.noun_chunks)
def test_en_noun_chunks_not_nested(en_vocab): def test_en_noun_chunks_not_nested(en_vocab):
words = ["Peter", "has", "chronic", "command", "and", "control", "issues"] words = ["Peter", "has", "chronic", "command", "and", "control", "issues"]
heads = [1, 0, 4, 3, -1, -2, -5] heads = [1, 0, 4, 3, -1, -2, -5]

View File

@ -0,0 +1,16 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_noun_chunks_is_parsed_es(es_tokenizer):
"""Test that noun_chunks raises Value Error for 'es' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = es_tokenizer("en Oxford este verano")
doc.is_parsed = False
with pytest.raises(ValueError):
list(doc.noun_chunks)

View File

View File

@ -0,0 +1,17 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_noun_chunks_is_parsed_fa(fa_tokenizer):
"""Test that noun_chunks raises Value Error for 'fa' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = fa_tokenizer("این یک جمله نمونه می باشد.")
doc.is_parsed = False
with pytest.raises(ValueError):
list(doc.noun_chunks)

View File

@ -0,0 +1,16 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_noun_chunks_is_parsed_fr(fr_tokenizer):
"""Test that noun_chunks raises Value Error for 'fr' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = fr_tokenizer("trouver des travaux antérieurs")
doc.is_parsed = False
with pytest.raises(ValueError):
list(doc.noun_chunks)

View File

@ -3,17 +3,16 @@ from __future__ import unicode_literals
import pytest import pytest
def test_gu_tokenizer_handlers_long_text(gu_tokenizer): def test_gu_tokenizer_handlers_long_text(gu_tokenizer):
text = """પશ્ચિમ ભારતમાં આવેલું ગુજરાત રાજ્ય જે વ્યક્તિઓની માતૃભૂમિ છે""" text = """પશ્ચિમ ભારતમાં આવેલું ગુજરાત રાજ્ય જે વ્યક્તિઓની માતૃભૂમિ છે"""
tokens = gu_tokenizer(text) tokens = gu_tokenizer(text)
assert len(tokens) == 9 assert len(tokens) == 9
@pytest.mark.parametrize( @pytest.mark.parametrize(
"text,length", "text,length",
[ [("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6), ("ખેતરની ખેડ કરવામાં આવે છે.", 5)],
("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6),
("ખેતરની ખેડ કરવામાં આવે છે.", 5),
],
) )
def test_gu_tokenizer_handles_cnts(gu_tokenizer, text, length): def test_gu_tokenizer_handles_cnts(gu_tokenizer, text, length):
tokens = gu_tokenizer(text) tokens = gu_tokenizer(text)

Some files were not shown because too many files have changed in this diff Show More