Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2021-01-14 11:38:24 +11:00
commit 30cb6ced1e
35 changed files with 1472 additions and 55 deletions

106
.github/contributors/AMArostegui.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Antonio Miras |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 11/01/2020 |
| GitHub username | AMArostegui |
| Website (optional) | |

106
.github/contributors/alexcombessie.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alex COMBESSIE |
| Company name (if applicable) | Dataiku |
| Title or role (if applicable) | R&D Engineer |
| Date | 2020-10-27 |
| GitHub username | alexcombessie |
| Website (optional) | |

107
.github/contributors/cristianasp.md vendored Normal file
View File

@ -0,0 +1,107 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Cristiana S Parada |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-11-04 |
| GitHub username | cristianasp |
| Website (optional) | |

106
.github/contributors/lorenanda.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Lorena Ciutacu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-12-23 |
| GitHub username | lorenanda |
| Website (optional) | lorenaciutacu.com/ |

106
.github/contributors/ophelielacroix.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|-------------------------------|-----------------|
| Name | Ophélie Lacroix |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | |
| GitHub username | ophelielacroix |
| Website (optional) | |

106
.github/contributors/yosiasz.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Josiah Solomon |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-12-15 |
| GitHub username | yosiasz |
| Website (optional) | |

34
spacy/lang/am/__init__.py Normal file
View File

@ -0,0 +1,34 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
class AmharicDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "am"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "ltr", "has_case": False, "has_letters": True}
class Amharic(Language):
lang = "am"
Defaults = AmharicDefaults
__all__ = ["Amharic"]

22
spacy/lang/am/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.am.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"አፕል የዩኬን ጅምር ድርጅት በ 1 ቢሊዮን ዶላር ለመግዛት አስቧል።",
"የራስ ገዝ መኪኖች የኢንሹራንስ ኃላፊነትን ወደ አምራቾች ያዛውራሉ",
"ሳን ፍራንሲስኮ የእግረኛ መንገድ አቅርቦት ሮቦቶችን ማገድን ይመለከታል",
"ለንደን በእንግሊዝ የምትገኝ ትልቅ ከተማ ናት።",
"የት ነህ?",
"የፈረንሳይ ፕሬዝዳንት ማናቸው?",
"የአሜሪካ ዋና ከተማ ምንድነው?",
"ባራክ ኦባማ መቼ ተወለደ?",
]

104
spacy/lang/am/lex_attrs.py Normal file
View File

@ -0,0 +1,104 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = [
"ዜሮ",
"አንድ",
"ሁለት",
"ሶስት",
"አራት",
"አምስት",
"ስድስት",
"ሰባት",
"ስምት",
"ዘጠኝ",
"አስር",
"አስራ አንድ",
"አስራ ሁለት",
"አስራ ሶስት",
"አስራ አራት",
"አስራ አምስት",
"አስራ ስድስት",
"አስራ ሰባት",
"አስራ ስምንት",
"አስራ ዘጠኝ",
"ሃያ",
"ሰላሳ",
"አርባ",
"ሃምሳ",
"ስልሳ",
"ሰባ",
"ሰማንያ",
"ዘጠና",
"መቶ",
"ሺህ",
"ሚሊዮን",
"ቢሊዮን",
"ትሪሊዮን",
"ኳድሪሊዮን",
"ገጅሊዮን",
"ባዝሊዮን"
]
_ordinal_words = [
"አንደኛ",
"ሁለተኛ",
"ሶስተኛ",
"አራተኛ",
"አምስተኛ",
"ስድስተኛ",
"ሰባተኛ",
"ስምንተኛ",
"ዘጠነኛ",
"አስረኛ",
"አስራ አንደኛ",
"አስራ ሁለተኛ",
"አስራ ሶስተኛ",
"አስራ አራተኛ",
"አስራ አምስተኛ",
"አስራ ስድስተኛ",
"አስራ ሰባተኛ",
"አስራ ስምንተኛ",
"አስራ ዘጠነኛ",
"ሃያኛ",
"ሰላሳኛ"
"አርባኛ",
"አምሳኛ",
"ስድሳኛ",
"ሰባኛ",
"ሰማንያኛ",
"ዘጠናኛ",
"መቶኛ",
"ሺኛ",
"ሚሊዮንኛ",
"ቢሊዮንኛ",
"ትሪሊዮንኛ"
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
if text_lower.endswith(""):
if text_lower[:-2].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import UNITS, ALPHA_UPPER
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
_suffixes = (
_list_punct
+ LIST_ELLIPSES
+ LIST_QUOTES
+ [
r"(?<=[0-9])\+",
# Amharic is written from Left-To-Right
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
TOKENIZER_SUFFIXES = _suffixes

View File

@ -0,0 +1,10 @@
# coding: utf8
from __future__ import unicode_literals
# Stop words
STOP_WORDS = set(
"""
ግን አንቺ አንተ እናንተ ያንተ ያንቺ የናንተ ራስህን ራስሽን ራሳችሁን
""".split()
)

View File

@ -0,0 +1,25 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, NORM, PRON_LEMMA
_exc = {}
for exc_data in [
{ORTH: "ት/ቤት", LEMMA: "ትምህርት ቤት"},
{ORTH: "ወ/ሮ", LEMMA: PRON_LEMMA, NORM: "ወይዘሮ"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"ዓ.ም.",
"ኪ.ሜ.",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -5,6 +5,8 @@ split_chars = lambda char: list(char.strip().split(" "))
merge_chars = lambda char: char.strip().replace(" ", "|") merge_chars = lambda char: char.strip().replace(" ", "|")
group_chars = lambda char: char.strip().replace(" ", "") group_chars = lambda char: char.strip().replace(" ", "")
_ethiopic = r"\u1200-\u137F"
_bengali = r"\u0980-\u09FF" _bengali = r"\u0980-\u09FF"
_hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F" _hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F"
@ -221,7 +223,8 @@ _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower
_uncased = ( _uncased = (
_bengali _ethiopic
+ _bengali
+ _hebrew + _hebrew
+ _persian + _persian
+ _sinhala + _sinhala

View File

@ -7,6 +7,7 @@ from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .morph_rules import MORPH_RULES from .morph_rules import MORPH_RULES
from ..tag_map import TAG_MAP from ..tag_map import TAG_MAP
from .syntax_iterators import SYNTAX_ITERATORS
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language from ...language import Language
@ -24,6 +25,7 @@ class DanishDefaults(Language.Defaults):
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
tag_map = TAG_MAP tag_map = TAG_MAP
stop_words = STOP_WORDS stop_words = STOP_WORDS
syntax_iterators = SYNTAX_ITERATORS
class Danish(Language): class Danish(Language):

View File

@ -0,0 +1,81 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON, VERB, AUX
from ...errors import Errors
def noun_chunks(doclike):
def is_verb_token(tok):
return tok.pos in [VERB, AUX]
def next_token(tok):
try:
return tok.nbor()
except IndexError:
return None
def get_left_bound(doc, root):
left_bound = root
for tok in reversed(list(root.lefts)):
if tok.dep in np_left_deps:
left_bound = tok
return left_bound
def get_right_bound(doc, root):
right_bound = root
for tok in root.rights:
if tok.dep in np_right_deps:
right = get_right_bound(doc, tok)
if list(
filter(
lambda t: is_verb_token(t) or t.dep in stop_deps,
doc[root.i : right.i],
)
):
break
else:
right_bound = right
return right_bound
def get_bounds(doc, root):
return get_left_bound(doc, root), get_right_bound(doc, root)
doc = doclike.doc
if not doc.is_parsed:
raise ValueError(Errors.E029)
if not len(doc):
return
left_labels = [
"det",
"fixed",
"nmod:poss",
"amod",
"flat",
"goeswith",
"nummod",
"appos",
]
right_labels = ["fixed", "nmod:poss", "amod", "flat", "goeswith", "nummod", "appos"]
stop_labels = ["punct"]
np_label = doc.vocab.strings.add("NP")
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
chunks = []
prev_right = -1
for token in doclike:
if token.pos in [PROPN, NOUN, PRON]:
left, right = get_bounds(doc, token)
if left.i <= prev_right:
continue
yield left.i, right.i + 1, np_label
prev_right = right.i
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -319,7 +319,6 @@ for exc_data in [
# Other contractions with leading apostrophe # Other contractions with leading apostrophe
for exc_data in [ for exc_data in [
{ORTH: "cause", NORM: "because"},
{ORTH: "em", LEMMA: PRON_LEMMA, NORM: "them"}, {ORTH: "em", LEMMA: PRON_LEMMA, NORM: "them"},
{ORTH: "ll", LEMMA: "will", NORM: "will"}, {ORTH: "ll", LEMMA: "will", NORM: "will"},
{ORTH: "nuff", LEMMA: "enough", NORM: "enough"}, {ORTH: "nuff", LEMMA: "enough", NORM: "enough"},

View File

@ -4,87 +4,81 @@ from __future__ import unicode_literals
STOP_WORDS = set( STOP_WORDS = set(
""" """
a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons a à â abord afin ah ai aie ainsi ait allaient allons
allô alors anterieur anterieure anterieures apres après as assez attendu au alors anterieur anterieure anterieures apres après as assez attendu au
aucun aucune aujourd aujourd'hui aupres auquel aura auraient aurait auront aucun aucune aujourd aujourd'hui aupres auquel aura auraient aurait auront
aussi autre autrefois autrement autres autrui aux auxquelles auxquels avaient aussi autre autrement autres autrui aux auxquelles auxquels avaient
avais avait avant avec avoir avons ayant avais avait avant avec avoir avons ayant
bah bas basee bat beau beaucoup bien bigre boum bravo brrr bas basee bat
c' c ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui c' c ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui
celui-ci celui- cent cependant certain certaine certaines certains certes ces celui-ci celui- cent cependant certain certaine certaines certains certes ces
cet cette ceux ceux-ci ceux- chacun chacune chaque cher chers chez chiche cet cette ceux ceux-ci ceux- chacun chacune chaque chez ci cinq cinquantaine cinquante
chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac cinquantième cinquième combien comme comment compris concernant
clic combien comme comment comparable comparables compris concernant contre
couic crac
d' d da dans de debout dedans dehors deja delà depuis dernier derniere derriere d' d da dans de debout dedans dehors deja delà depuis derriere
derrière des desormais desquelles desquels dessous dessus deux deuxième derrière des desormais desquelles desquels dessous dessus deux deuxième
deuxièmement devant devers devra different differentes differents différent deuxièmement devant devers devra different differentes differents différent
différente différentes différents dire directe directement dit dite dits divers différente différentes différents dire directe directement dit dite dits divers
diverse diverses dix dix-huit dix-neuf dix-sept dixième doit doivent donc dont diverse diverses dix dix-huit dix-neuf dix-sept dixième doit doivent donc dont
douze douzième dring du duquel durant dès désormais douze douzième du duquel durant dès désormais
effet egale egalement egales eh elle elle-même elles elles-mêmes en encore effet egale egalement egales eh elle elle-même elles elles-mêmes en encore
enfin entre envers environ es ès est et etaient étaient etais étais etait était enfin entre envers environ es ès est et etaient étaient etais étais etait était
etant étant etc été etre être eu euh eux eux-mêmes exactement excepté extenso etant étant etc été etre être eu eux eux-mêmes exactement excepté
exterieur
fais faisaient faisant fait façon feront fi flac floc font fais faisaient faisant fait façon feront font
gens gens
ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum ha hem hep hi ho hormis hors hou houp hue hui huit huitième
hurrah hélas i il ils importe i il ils importe
j' j je jusqu jusque juste j' j je jusqu jusque juste
l' l la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps l' l la laisser laquelle le lequel les lesquelles lesquels leur leurs longtemps
lors lorsque lui lui-meme lui-même lès lors lorsque lui lui-meme lui-même lès
m' m ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien m' m ma maint maintenant mais malgre me meme memes merci mes mien
mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins mienne miennes miens mille moi moi-meme moi-même moindres moins
mon moyennant même mêmes mon même mêmes
n' n na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf n' n na ne neanmoins neuvième ni nombreuses nombreux nos notamment
neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau notre nous nous-mêmes nouvea nul néanmoins nôtre nôtres
nul néanmoins nôtre nôtres
o ô oh ohé ollé olé on ont onze onzième ore ou ouf ouias oust ouste outre o ô on ont onze onzième ore ou ouias oust outre
ouvert ouverte ouverts ouvert ouverte ouverts
paf pan par parce parfois parle parlent parler parmi parseme partant par parce parfois parle parlent parler parmi parseme partant
particulier particulière particulièrement pas passé pendant pense permet pas pendant pense permet personne peu peut peuvent peux plus
personne peu peut peuvent peux pff pfft pfut pif pire plein plouf plus plusieurs plutôt possible possibles pour pourquoi
plusieurs plutôt possessif possessifs possible possibles pouah pour pourquoi
pourrais pourrait pouvait prealable precisement premier première premièrement pourrais pourrait pouvait prealable precisement premier première premièrement
pres probable probante procedant proche près psitt pu puis puisque pur pure pres procedant proche près pu puis puisque
qu' qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt qu' qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque
quelques quels qui quiconque quinze quoi quoique quelques quels qui quiconque quinze quoi quoique
rare rarement rares relative relativement remarquable rend rendre restant reste relative relativement rend rendre restant reste
restent restrictif retour revoici revoilà rien restent retour revoici revoilà
s' s sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient s' s sa sait sans sauf se seize selon semblable semblaient
semble semblent sent sept septième sera seraient serait seront ses seul seule semble semblent sent sept septième sera seraient serait seront ses seul seule
seulement si sien sienne siennes siens sinon six sixième soi soi-même soit seulement si sien sienne siennes siens sinon six sixième soi soi-même soit
soixante son sont sous souvent specifique specifiques speculatif stop soixante son sont sous souvent specifique specifiques stop
strictement subtiles suffisant suffisante suffit suis suit suivant suivante suffisant suffisante suffit suis suit suivant suivante
suivantes suivants suivre superpose sur surtout suivantes suivants suivre sur surtout
t' t ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente t' t ta tant te tel telle tellement telles tels tenant tend tenir tente
tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous tes tien tienne tiennes tiens toi toi-même ton touchant toujours tous
tout toute toutefois toutes treize trente tres trois troisième troisièmement tout toute toutes treize trente tres trois troisième troisièmement
trop très tsoin tsouin tu tu
un une unes uniformement unique uniques uns un une unes uns
va vais vas vers via vif vifs vingt vivat vive vives vlan voici voilà vont vos va vais vas vers via vingt voici voilà vont vos
votre vous vous-mêmes vu vôtre vôtres votre vous vous-mêmes vu vôtre vôtres
zut
""".split() """.split()
) )

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
STOP_WORDS = set( STOP_WORDS = set(
""" """
à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes a à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes
ao aos apenas apoia apoio apontar após aquela aquelas aquele aqueles aqui aquilo ao aos apenas apoia apoio apontar após aquela aquelas aquele aqueles aqui aquilo
as assim através atrás até as assim através atrás até
@ -18,7 +18,7 @@ da daquela daquele dar das de debaixo demais dentro depois des desde dessa desse
desta deste deve devem deverá dez dezanove dezasseis dezassete dezoito diante desta deste deve devem deverá dez dezanove dezasseis dezassete dezoito diante
direita disso diz dizem dizer do dois dos doze duas dão direita disso diz dizem dizer do dois dos doze duas dão
é és ela elas ele eles em embora enquanto entre então era essa essas esse esses esta e é és ela elas ele eles em embora enquanto entre então era essa essas esse esses esta
estado estar estará estas estava este estes esteve estive estivemos estiveram estado estar estará estas estava este estes esteve estive estivemos estiveram
estiveste estivestes estou está estás estão eu eventual exemplo estiveste estivestes estou está estás estão eu eventual exemplo
@ -40,7 +40,7 @@ na nada naquela naquele nas nem nenhuma nessa nesse nesta neste no nos nossa
nossas nosso nossos nova novas nove novo novos num numa nunca nuns não nível nós nossas nosso nossos nova novas nove novo novos num numa nunca nuns não nível nós
número números número números
obrigada obrigado oitava oitavo oito onde ontem onze ora os ou outra outras outros o obrigada obrigado oitava oitavo oito onde ontem onze ora os ou outra outras outros
para parece parte partir pegar pela pelas pelo pelos perto pode podem poder poderá para parece parte partir pegar pela pelas pelo pelos perto pode podem poder poderá
podia pois ponto pontos por porquanto porque porquê portanto porém posição podia pois ponto pontos por porquanto porque porquê portanto porém posição
@ -63,8 +63,8 @@ tudo tão têm
um uma umas uns usa usar último um uma umas uns usa usar último
vai vais valor veja vem vens ver vez vezes vinda vindo vinte você vocês vos vossa vai vais valor veja vem vens ver vez vezes vinda vindo vinte você vocês vos vossa
vossas vosso vossos vários vão vêm vós vossas vosso vossos vários vão vêm vós
zero zero
""".split() """.split()
) )

View File

@ -12,11 +12,13 @@ aceasta
această această
aceea aceea
aceeasi aceeasi
aceeași
acei acei
aceia aceia
acel acel
acela acela
acelasi acelasi
același
acele acele
acelea acelea
acest acest
@ -28,12 +30,11 @@ acestia
acestui acestui
aceşti aceşti
aceştia aceştia
acești
aceștia
acolo acolo
acord acord
acum acum
adica adica
adică
ai ai
aia aia
aibă aibă
@ -57,6 +58,8 @@ alături
am am
anume anume
apoi apoi
apai
apăi
ar ar
are are
as as
@ -154,7 +157,9 @@ că
căci căci
cărei cărei
căror căror
cărora
cărui cărui
căruia
către către
d d
da da
@ -179,6 +184,8 @@ deşi
deși deși
din din
dinaintea dinaintea
dincolo
dincoace
dintr dintr
dintr- dintr-
dintre dintre
@ -190,6 +197,10 @@ drept
dupa dupa
după după
deunaseara
deunăseară
deunazi
deunăzi
e e
ea ea
ei ei
@ -224,7 +235,6 @@ geaba
graţie graţie
grație grație
h h
halbă
i i
ia ia
iar iar
@ -236,6 +246,7 @@ in
inainte inainte
inapoi inapoi
inca inca
incotro
incit incit
insa insa
intr intr
@ -256,6 +267,10 @@ m
ma ma
mai mai
mare mare
macar
măcar
mata
matale
mea mea
mei mei
mele mele
@ -278,11 +293,18 @@ mâine
mîine mîine
n n
na
ne ne
neincetat
neîncetat
nevoie nevoie
ni ni
nici nici
nicidecum
nicidecat
nicidecât
niciodata niciodata
niciodată
nicăieri nicăieri
nimeni nimeni
nimeri nimeri
@ -304,6 +326,10 @@ noștri
nu nu
numai numai
o o
odata
odată
odinioara
odinioară
opt opt
or or
ori ori
@ -318,7 +344,9 @@ oricît
oriunde oriunde
p p
pai pai
păi
parca parca
parcă
patra patra
patru patru
patrulea patrulea
@ -335,13 +363,11 @@ prima
primul primul
prin prin
printr- printr-
printre
putini putini
puţin puţin
puţina puţina
puţină puţină
puțin
puțina
puțină
până până
pînă pînă
r r
@ -419,6 +445,7 @@ unuia
unul unul
v v
va va
vai
vi vi
voastre voastre
voastră voastră

34
spacy/lang/ti/__init__.py Normal file
View File

@ -0,0 +1,34 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
class TigrinyaDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "ti"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "ltr", "has_case": False, "has_letters": True}
class Tigrinya(Language):
lang = "ti"
Defaults = TigrinyaDefaults
__all__ = ["Tigrinya"]

22
spacy/lang/ti/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ti.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"አፕል ብዩኬ ትርከብ ንግድ ብ1 ቢሊዮን ዶላር ንምግዛዕ ሐሲባ።",
"ፈላማይ ክታበት ኮቪድ 19 ተጀሚሩ፤ሓዱሽ ተስፋ ሂቡ ኣሎ",
"ቻንስለር ጀርመን ኣንገላ መርከል ዝርግሓ ቫይረስ ኮሮና ንምክልካል ጽኑዕ እገዳ ክግበር ጸዊዓ",
"ለንደን ብዓዲ እንግሊዝ ትርከብ ዓባይ ከተማ እያ።",
"ናበይ አለኻ፧",
"ናይ ፈረንሳይ ፕሬዝዳንት መን እዩ፧",
"ናይ አሜሪካ ዋና ከተማ እንታይ እያ፧",
"ኦባማ መዓስ ተወሊዱ፧",
]

104
spacy/lang/ti/lex_attrs.py Normal file
View File

@ -0,0 +1,104 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = [
"ዜሮ",
"ሐደ",
"ክልተ",
"ሰለስተ",
"ኣርባዕተ",
"ሓሙሽተ",
"ሽድሽተ",
"ሸውዓተ",
"ሽሞንተ",
"ትሽዓተ",
"ኣሰርተ",
"ኣሰርተ ሐደ",
"ኣሰርተ ክልተ",
"ኣሰርተ ሰለስተ",
"ኣሰርተ ኣርባዕተ",
"ኣሰርተ ሓሙሽተ",
"ኣሰርተ ሽድሽተ",
"ኣሰርተ ሸውዓተ",
"ኣሰርተ ሽሞንተ",
"ኣሰርተ ትሽዓተ",
"ዕስራ",
"ሰላሳ",
"ኣርብዓ",
"ሃምሳ",
"ስልሳ",
"ሰብዓ",
"ሰማንያ",
"ተስዓ",
"ሚእቲ",
"ሺሕ",
"ሚልዮን",
"ቢልዮን",
"ትሪልዮን",
"ኳድሪልዮን",
"ገጅልዮን",
"ባዝልዮን"
]
_ordinal_words = [
"ቀዳማይ",
"ካልኣይ",
"ሳልሳይ",
"ራብኣይ",
"ሓምሻይ",
"ሻድሻይ",
"ሻውዓይ",
"ሻምናይ",
"ዘጠነኛ",
"አስረኛ",
"ኣሰርተ አንደኛ",
"ኣሰርተ ሁለተኛ",
"ኣሰርተ ሶስተኛ",
"ኣሰርተ አራተኛ",
"ኣሰርተ አምስተኛ",
"ኣሰርተ ስድስተኛ",
"ኣሰርተ ሰባተኛ",
"ኣሰርተ ስምንተኛ",
"ኣሰርተ ዘጠነኛ",
"ሃያኛ",
"ሰላሳኛ"
"አርባኛ",
"አምሳኛ",
"ስድሳኛ",
"ሰባኛ",
"ሰማንያኛ",
"ዘጠናኛ",
"መቶኛ",
"ሺኛ",
"ሚሊዮንኛ",
"ቢሊዮንኛ",
"ትሪሊዮንኛ"
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
if text_lower.endswith(""):
if text_lower[:-2].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import UNITS, ALPHA_UPPER
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
_suffixes = (
_list_punct
+ LIST_ELLIPSES
+ LIST_QUOTES
+ [
r"(?<=[0-9])\+",
# Tigrinya is written from Left-To-Right
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
TOKENIZER_SUFFIXES = _suffixes

View File

@ -0,0 +1,10 @@
# coding: utf8
from __future__ import unicode_literals
# Stop words
STOP_WORDS = set(
"""
ግን ግና ንስኻ ንስኺ ንስኻትክን ንስኻትኩም ናትካ ናትኪ ናትክን ናትኩም
""".split()
)

View File

@ -0,0 +1,26 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, NORM, PRON_LEMMA
_exc = {}
for exc_data in [
{ORTH: "ት/ቤት", LEMMA: "ትምህርት ቤት"},
{ORTH: "ወ/ሮ", LEMMA: PRON_LEMMA, NORM: "ወይዘሮ"},
{ORTH: "ወ/ሪ", LEMMA: PRON_LEMMA, NORM: "ወይዘሪት"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"ዓ.ም.",
"ኪ.ሜ.",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -31,6 +31,9 @@ def pytest_runtest_setup(item):
def tokenizer(): def tokenizer():
return get_lang_class("xx").Defaults.create_tokenizer() return get_lang_class("xx").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def am_tokenizer():
return get_lang_class("am").Defaults.create_tokenizer()
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def ar_tokenizer(): def ar_tokenizer():
@ -242,6 +245,9 @@ def th_tokenizer():
pytest.importorskip("pythainlp") pytest.importorskip("pythainlp")
return get_lang_class("th").Defaults.create_tokenizer() return get_lang_class("th").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def ti_tokenizer():
return get_lang_class("ti").Defaults.create_tokenizer()
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def tr_tokenizer(): def tr_tokenizer():

View File

View File

View File

@ -0,0 +1,55 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.lang.am.lex_attrs import like_num
def test_am_tokenizer_handles_long_text(am_tokenizer):
text = """ሆሴ ሙጂካ በበጋ ወቅት በኦክስፎርድ ንግግር አንድያቀርቡ ሲጋበዙ ጭንቅላታቸው "ፈነዳ"
እጅግ ጥንታዊ የእንግሊዝኛ ተናጋሪ ዩኒቨርስቲ በአስር ሺዎች የሚቆጠሩ ዩሮዎችን ለተማሪዎች በማስተማር የሚያስከፍለው
እና ከማርጋሬት ታቸር እስከ ስቲቨን ሆኪንግ በአዳራሾቻቸው ውስጥ ንግግር ያደረጉበት የትምህርት ማዕከል በሞንቴቪዴኦ
በሚገኘው የመንግስት ትምህርት ቤት የሰለጠኑትን የ81 ዓመቱ አዛውንት አገልግሎት ጠየቁ"""
tokens = am_tokenizer(text)
assert len(tokens) == 56
@pytest.mark.parametrize(
"text,length",
[
("ሆሴ ሙጂካ ለምን ተመረጠ?", 5),
("“በፍፁም?”", 4),
("""አዎ! ሆዜ አርካዲዮ ቡንዲያ “እንሂድ” ሲል መለሰ።""", 11),
("እነሱ በግምት 10ኪ.ሜ. ሮጡ።", 7),
("እና ከዚያ ለምን...", 4),
],
)
def test_am_tokenizer_handles_cnts(am_tokenizer, text, length):
tokens = am_tokenizer(text)
assert len(tokens) == length
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("10.000", True),
("1000", True),
("999,0", True),
("አንድ", True),
("ሁለት", True),
("ትሪሊዮን", True),
("ውሻ", False),
(",", False),
("1/2", True),
],
)
def test_lex_attrs_like_number(am_tokenizer, text, match):
tokens = am_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -0,0 +1,61 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from ...util import get_doc
def test_noun_chunks_is_parsed(da_tokenizer):
"""Test that noun_chunks raises Value Error for 'da' language if Doc is not parsed.
To check this test, we're constructing a Doc
with a new Vocab here and forcing is_parsed to 'False'
to make sure the noun chunks don't run.
"""
doc = da_tokenizer("Det er en sætning")
doc.is_parsed = False
with pytest.raises(ValueError):
list(doc.noun_chunks)
DA_NP_TEST_EXAMPLES = [
(
"Hun elsker at plukker frugt.",
['PRON', 'VERB', 'PART', 'VERB', 'NOUN', 'PUNCT'],
['nsubj', 'ROOT', 'mark', 'obj', 'obj', 'punct'],
[1, 0, 1, -2, -1, -4],
["Hun", "frugt"],
),
(
"Påfugle er de smukkeste fugle.",
['NOUN', 'AUX', 'DET', 'ADJ', 'NOUN', 'PUNCT'],
['nsubj', 'cop', 'det', 'amod', 'ROOT', 'punct'],
[4, 3, 2, 1, 0, -1],
["Påfugle", "de smukkeste fugle"],
),
(
"Rikke og Jacob Jensen glæder sig til en hyggelig skovtur",
['PROPN', 'CCONJ', 'PROPN', 'PROPN', 'VERB', 'PRON', 'ADP', 'DET', 'ADJ', 'NOUN'],
['nsubj', 'cc', 'conj', 'flat', 'ROOT', 'obj', 'case', 'det', 'amod', 'obl'],
[4, 1, -2, -1, 0, -1, 3, 2, 1, -5],
["Rikke", "Jacob Jensen", "sig", "en hyggelig skovtur"],
),
]
@pytest.mark.parametrize(
"text,pos,deps,heads,expected_noun_chunks", DA_NP_TEST_EXAMPLES
)
def test_da_noun_chunks(da_tokenizer, text, pos, deps, heads, expected_noun_chunks):
tokens = da_tokenizer(text)
assert len(heads) == len(pos)
doc = get_doc(
tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps, pos=pos
)
noun_chunks = list(doc.noun_chunks)
assert len(noun_chunks) == len(expected_noun_chunks)
for i, np in enumerate(noun_chunks):
assert np.text == expected_noun_chunks[i]

View File

@ -111,7 +111,15 @@ def test_en_tokenizer_handles_times(en_tokenizer, text):
@pytest.mark.parametrize( @pytest.mark.parametrize(
"text,norms", [("I'm", ["i", "am"]), ("shan't", ["shall", "not"])] "text,norms",
[
("I'm", ["i", "am"]),
("shan't", ["shall", "not"]),
(
"Many factors cause cancer 'cause it is complex",
["many", "factors", "cause", "cancer", "because", "it", "is", "complex"],
),
],
) )
def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms): def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)

View File

View File

View File

@ -0,0 +1,55 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.lang.ti.lex_attrs import like_num
def test_ti_tokenizer_handles_long_text(ti_tokenizer):
text = """ቻንስለር ጀርመን ኣንገላ መርከል ኣብታ ሃገር ቁጽሪ መትሓዝቲ ኮቪድ መዓልታዊ ክብረ መዝገብ ድሕሪ ምህራሙ- ጽኑዕ እገዳ ክግበር ጸዊዓ።
መርከል ሎሚ ንታሕታዋይ ባይቶ ሃገራ ክትገልጽ ከላ ኣብ ወሳኒ ምዕራፍ ቃልሲ ኢና ዘለና-ዳሕራዋይ ማዕበል ካብቲ ቀዳማይ ክገድድ ይኽእል` ኢላ
ትካል ምክልኻል ተላገብቲ ሕማማት ጀርመን ኣብ ዝሓለፈ 24 ሰዓታት ኣብ ምልእቲ ጀርመር 590 ሰባት ብኮቪድ19 ምሟቶም ኣፍሊጡ`
ቻንስለር ኣንጀላ መርከል ኣብ እዋን በዓላት ልደት ስድራቤታት ክተኣኻኸባ ዝፍቀደለን` እንተኾነ ድሕሪኡ ኣብ ዘሎ ግዜ ግን እቲ እገዳታት ክትግበር ትደሊ"""
tokens = ti_tokenizer(text)
assert len(tokens) == 85
@pytest.mark.parametrize(
"text,length",
[
("ቻንስለር ጀርመን ኣንገላ መርከል፧", 5),
("“ስድራቤታት፧”", 4),
("""ኣብ እዋን በዓላት ልደት ስድራቤታት ክተኣኻኸባ ዝፍቀደለን`ኳ እንተኾነ።""", 9),
("ብግምት 10ኪ.ሜ. ጎይዩ።", 6),
("ኣብ ዝሓለፈ 24 ሰዓታት...", 5),
],
)
def test_ti_tokenizer_handles_cnts(ti_tokenizer, text, length):
tokens = ti_tokenizer(text)
assert len(tokens) == length
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("10.000", True),
("1000", True),
("999,0", True),
("ሐደ", True),
("ክልተ", True),
("ትሪልዮን", True),
("ከልቢ", False),
(",", False),
("1/2", True),
],
)
def test_lex_attrs_like_number(ti_tokenizer, text, match):
tokens = ti_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -2670,6 +2670,60 @@
}, },
"category": ["scientific", "research", "standalone"], "category": ["scientific", "research", "standalone"],
"tags": ["Evolutionary Computation", "Grammatical Evolution"] "tags": ["Evolutionary Computation", "Grammatical Evolution"]
},
{
"id": "SpacyDotNet",
"title": "spaCy .NET Wrapper",
"slogan": "SpacyDotNet is a .NET Core compatible wrapper for spaCy, based on Python.NET",
"description": "This projects relies on [Python.NET](http://pythonnet.github.io/) to interop with spaCy. It's not meant to be a complete and exhaustive implementation of all spaCy features and [APIs](https://spacy.io/api). Although it should be enough for basic tasks, it's considered as a starting point if you need to build a complex project using spaCy in .NET Most of the basic features in _Spacy101_ are available. All `Container` classes are present (`Doc`, `Token`, `Span` and `Lexeme`) with their basic properties/methods running and also `Vocab` and `StringStore` in a limited form. Anyway, any developer should be ready to add the missing properties or classes in a very straightforward manner.",
"github": "AMArostegui/SpacyDotNet",
"thumb": "https://raw.githubusercontent.com/AMArostegui/SpacyDotNet/master/cslogo.png",
"code_example": [
"var spacy = new Spacy();",
"",
"var nlp = spacy.Load(\"en_core_web_sm\");",
"var doc = nlp.GetDocument(\"Apple is looking at buying U.K. startup for $1 billion\");",
"",
"foreach (Token token in doc.Tokens)",
" Console.WriteLine($\"{token.Text} {token.Lemma} {token.PoS} {token.Tag} {token.Dep} {token.Shape} {token.IsAlpha} {token.IsStop}\");",
"",
"Console.WriteLine(\"\");",
"foreach (Span ent in doc.Ents)",
" Console.WriteLine($\"{ent.Text} {ent.StartChar} {ent.EndChar} {ent.Label}\");",
"",
"nlp = spacy.Load(\"en_core_web_md\");",
"var tokens = nlp.GetDocument(\"dog cat banana afskfsd\");",
"",
"Console.WriteLine(\"\");",
"foreach (Token token in tokens.Tokens)",
" Console.WriteLine($\"{token.Text} {token.HasVector} {token.VectorNorm}, {token.IsOov}\");",
"",
"tokens = nlp.GetDocument(\"dog cat banana\");",
"Console.WriteLine(\"\");",
"foreach (Token token1 in tokens.Tokens)",
"{",
" foreach (Token token2 in tokens.Tokens)",
" Console.WriteLine($\"{token1.Text} {token2.Text} {token1.Similarity(token2) }\");",
"}",
"",
"doc = nlp.GetDocument(\"I love coffee\");",
"Console.WriteLine(\"\");",
"Console.WriteLine(doc.Vocab.Strings[\"coffee\"]);",
"Console.WriteLine(doc.Vocab.Strings[3197928453018144401]);",
"",
"Console.WriteLine(\"\");",
"foreach (Token word in doc.Tokens)",
"{",
" var lexeme = doc.Vocab[word.Text];",
" Console.WriteLine($@\"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} {lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}\");",
"}"
],
"code_language": "csharp",
"author": "Antonio Miras",
"author_links": {
"github": "AMArostegui"
},
"category": ["nonpython"]
} }
], ],