mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 00:46:28 +03:00
Adding support for Yoruba Language (#4614)
* Adding Support for Yoruba * test text * Updated test string. * Fixing encoding declaration. * Adding encoding to stop_words.py * Added contributor agreement and removed iranlowo. * Added removed test files and removed iranlowo to keep project bare. * Returned CONTRIBUTING.md to default state. * Added delted conftest entries * Tidy up and auto-format * Revert CONTRIBUTING.md Co-authored-by: Ines Montani <ines@ines.io>
This commit is contained in:
parent
1b838d1313
commit
a741de7cf6
106
.github/contributors/Olamyy.md
vendored
Normal file
106
.github/contributors/Olamyy.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ x ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Olamilekan Wahab |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 8/11/2019 |
|
||||
| GitHub username | Olamyy |
|
||||
| Website (optional) | |
|
24
spacy/lang/yo/__init__.py
Normal file
24
spacy/lang/yo/__init__.py
Normal file
|
@ -0,0 +1,24 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
||||
|
||||
class YorubaDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "yo"
|
||||
stop_words = STOP_WORDS
|
||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||
|
||||
|
||||
class Yoruba(Language):
|
||||
lang = "yo"
|
||||
Defaults = YorubaDefaults
|
||||
|
||||
|
||||
__all__ = ["Yoruba"]
|
26
spacy/lang/yo/examples.py
Normal file
26
spacy/lang/yo/examples.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.yo.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
# 1. https://yo.wikipedia.org/wiki/Wikipedia:%C3%80y%E1%BB%8Dk%C3%A0_p%C3%A0t%C3%A0k%C3%AC
|
||||
# 2.https://yo.wikipedia.org/wiki/Oj%C3%BAew%C3%A9_%C3%80k%E1%BB%8D%CC%81k%E1%BB%8D%CC%81
|
||||
# 3. https://www.bbc.com/yoruba
|
||||
|
||||
sentences = [
|
||||
"Ìjọba Tanzania fi Ajìjàgbara Ọmọ Orílẹ̀-èdèe Uganda sí àtìmọ́lé",
|
||||
"Olúṣẹ́gun Ọbásanjọ́, tí ó jẹ́ Ààrẹ ìjọba ológun àná (láti ọdún 1976 sí 1979), tí ó sì tún ṣe Ààrẹ ìjọba alágbádá tí ìbò gbé wọlé (ní ọdún 1999 sí 2007), kúndùn láti máa bu ẹnu àtẹ́ lu àwọn "
|
||||
"ètò ìjọba Ààrẹ orílẹ̀-èdè Nàìjíríà tí ó jẹ tẹ̀lé e.",
|
||||
"Akin Alabi rán ẹnu mọ́ agbárá Adárí Òsìsẹ̀, àwọn ọmọ Nàìjíríà dẹnu bò ó",
|
||||
"Ta ló leè dúró s'ẹ́gbẹ̀ẹ́ Okunnu láì rẹ́rìín?",
|
||||
"Dídarapọ̀ mọ́n ìpolongo",
|
||||
"Bi a se n so, omobinrin ni oruko ni ojo kejo bee naa ni omokunrin ni oruko ni ojo kesan.",
|
||||
"Oríṣìíríṣìí nǹkan ló le yọrí sí orúkọ tí a sọ ọmọ",
|
||||
"Gbogbo won ni won ni oriki ti won",
|
||||
]
|
115
spacy/lang/yo/lex_attrs.py
Normal file
115
spacy/lang/yo/lex_attrs.py
Normal file
|
@ -0,0 +1,115 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import unicodedata
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = [
|
||||
"ení",
|
||||
"oókàn",
|
||||
"ọ̀kanlá",
|
||||
"ẹ́ẹdọ́gbọ̀n",
|
||||
"àádọ́fà",
|
||||
"ẹ̀walélúɡba",
|
||||
"egbèje",
|
||||
"ẹgbàárin",
|
||||
"èjì",
|
||||
"eéjì",
|
||||
"èjìlá",
|
||||
"ọgbọ̀n,",
|
||||
"ọgọ́fà",
|
||||
"ọ̀ọ́dúrún",
|
||||
"ẹgbẹ̀jọ",
|
||||
"ẹ̀ẹ́dẹ́ɡbàárùn",
|
||||
"ẹ̀ta",
|
||||
"ẹẹ́ta",
|
||||
"ẹ̀talá",
|
||||
"aárùndílogójì",
|
||||
"àádóje",
|
||||
"irinwó",
|
||||
"ẹgbẹ̀sàn",
|
||||
"ẹgbàárùn",
|
||||
"ẹ̀rin",
|
||||
"ẹẹ́rin",
|
||||
"ẹ̀rinlá",
|
||||
"ogójì",
|
||||
"ogóje",
|
||||
"ẹ̀ẹ́dẹ́gbẹ̀ta",
|
||||
"ẹgbàá",
|
||||
"ẹgbàájọ",
|
||||
"àrún",
|
||||
"aárùn",
|
||||
"ẹ́ẹdógún",
|
||||
"àádọ́ta",
|
||||
"àádọ́jọ",
|
||||
"ẹgbẹ̀ta",
|
||||
"ẹgboókànlá",
|
||||
"ẹgbàawǎ",
|
||||
"ẹ̀fà",
|
||||
"ẹẹ́fà",
|
||||
"ẹẹ́rìndílógún",
|
||||
"ọgọ́ta",
|
||||
"ọgọ́jọ",
|
||||
"ọ̀ọ́dẹ́gbẹ̀rin",
|
||||
"ẹgbẹ́ẹdógún",
|
||||
"ọkẹ́marun",
|
||||
"èje",
|
||||
"etàdílógún",
|
||||
"àádọ́rin",
|
||||
"àádọ́sán",
|
||||
"ẹgbẹ̀rin",
|
||||
"ẹgbàajì",
|
||||
"ẹgbẹ̀ẹgbẹ̀rún",
|
||||
"ẹ̀jọ",
|
||||
"ẹẹ́jọ",
|
||||
"eéjìdílógún",
|
||||
"ọgọ́rin",
|
||||
"ọgọsàn",
|
||||
"ẹ̀ẹ́dẹ́gbẹ̀rún",
|
||||
"ẹgbẹ́ẹdọ́gbọ̀n",
|
||||
"ọgọ́rùn ọkẹ́",
|
||||
"ẹ̀sán",
|
||||
"ẹẹ́sàn",
|
||||
"oókàndílógún",
|
||||
"àádọ́rùn",
|
||||
"ẹ̀wadilúɡba",
|
||||
"ẹgbẹ̀rún",
|
||||
"ẹgbàáta",
|
||||
"ẹ̀wá",
|
||||
"ẹẹ́wàá",
|
||||
"ogún",
|
||||
"ọgọ́rùn",
|
||||
"igba",
|
||||
"ẹgbẹ̀fà",
|
||||
"ẹ̀ẹ́dẹ́ɡbarin",
|
||||
]
|
||||
|
||||
|
||||
def strip_accents_text(text):
|
||||
"""
|
||||
Converts the string to NFD, separates & returns only the base characters
|
||||
:param text:
|
||||
:return: input string without diacritic adornments on base characters
|
||||
"""
|
||||
return "".join(
|
||||
c for c in unicodedata.normalize("NFD", text) if unicodedata.category(c) != "Mn"
|
||||
)
|
||||
|
||||
|
||||
def like_num(text):
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
num_markers = ["dí", "dọ", "lé", "dín", "di", "din", "le", "do"]
|
||||
if any(mark in text for mark in num_markers):
|
||||
return True
|
||||
text = strip_accents_text(text)
|
||||
_num_words_stripped = [strip_accents_text(num) for num in _num_words]
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text in _num_words_stripped or text.lower() in _num_words_stripped:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
12
spacy/lang/yo/stop_words.py
Normal file
12
spacy/lang/yo/stop_words.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# stop words as whitespace-separated list.
|
||||
# Source: https://raw.githubusercontent.com/dohliam/more-stoplists/master/yo/yo.txt
|
||||
|
||||
STOP_WORDS = set(
|
||||
"a an b bá bí bẹ̀rẹ̀ d e f fún fẹ́ g gbogbo i inú j jù jẹ jẹ́ k kan kì kí kò "
|
||||
"l láti lè lọ m mi mo máa mọ̀ n ni náà ní nígbà nítorí nǹkan o p padà pé "
|
||||
"púpọ̀ pẹ̀lú r rẹ̀ s sì sí sínú t ti tí u w wà wá wọn wọ́n y yìí à àti àwọn á "
|
||||
"è é ì í ò òun ó ù ú ń ńlá ǹ ̀ ́ ̣ ṣ ṣe ṣé ṣùgbọ́n ẹ ẹmọ́ ọ ọjọ́ ọ̀pọ̀lọpọ̀".split()
|
||||
)
|
|
@ -219,8 +219,14 @@ def uk_tokenizer():
|
|||
def ur_tokenizer():
|
||||
return get_lang_class("ur").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def yo_tokenizer():
|
||||
return get_lang_class("yo").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def zh_tokenizer():
|
||||
pytest.importorskip("jieba")
|
||||
return get_lang_class("zh").Defaults.create_tokenizer()
|
||||
|
||||
|
|
|
@ -87,4 +87,4 @@ def test_lex_attrs_like_url(text, match):
|
|||
],
|
||||
)
|
||||
def test_lex_attrs_word_shape(text, shape):
|
||||
assert word_shape(text) == shape
|
||||
assert word_shape(text) == shape
|
|
@ -11,7 +11,7 @@ from spacy.util import get_lang_class
|
|||
LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
|
||||
"et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
|
||||
"it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
|
||||
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur"]
|
||||
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo']
|
||||
# fmt: on
|
||||
|
||||
|
||||
|
|
0
spacy/tests/lang/yo/__init__.py
Normal file
0
spacy/tests/lang/yo/__init__.py
Normal file
32
spacy/tests/lang/yo/test_text.py
Normal file
32
spacy/tests/lang/yo/test_text.py
Normal file
|
@ -0,0 +1,32 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.yo.lex_attrs import like_num
|
||||
|
||||
|
||||
def test_yo_tokenizer_handles_long_text(yo_tokenizer):
|
||||
text = """Àwọn ọmọ ìlú tí wọ́n ń ṣàmúlò ayélujára ti bẹ̀rẹ̀ ìkọkúkọ sórí àwòrán ààrẹ Nkurunziza nínú ìfẹ̀hónúhàn pẹ̀lú àmì ìdámọ̀: Nkurunziza àti Burundi:
|
||||
Ọmọ ilé ẹ̀kọ́ gíga ní ẹ̀wọ̀n fún kíkọ ìkọkúkọ sí orí àwòrán Ààrẹ .
|
||||
Bí mo bá ṣe èyí ní Burundi , ó ṣe é ṣe kí a fi mí sí àtìmọ́lé
|
||||
Ìjọba Burundi fi akẹ́kọ̀ọ́bìnrin sí àtìmọ́lé látàrí ẹ̀sùn ìkọkúkọ sí orí àwòrán ààrẹ. A túwíìtì àwòrán ìkọkúkọ wa ní ìbánikẹ́dùn ìṣẹ̀lẹ̀ náà.
|
||||
Wọ́n ní kí a dán an wò, kí a kọ nǹkan sí orí àwòrán ààrẹ mo sì ṣe bẹ́ẹ̀. Mo ní ìgbóyà wípé ẹnikẹ́ni kò ní mú mi níbí.
|
||||
Ìfòfinlíle mú àtakò"""
|
||||
tokens = yo_tokenizer(text)
|
||||
assert len(tokens) == 121
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,match",
|
||||
[("ení", True), ("ogun", True), ("mewadinlogun", True), ("ten", False)],
|
||||
)
|
||||
def test_lex_attrs_like_number(yo_tokenizer, text, match):
|
||||
tokens = yo_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
||||
|
||||
|
||||
@pytest.mark.parametrize("word", ["eji", "ejila", "ogun", "aárùn"])
|
||||
def test_yo_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
Loading…
Reference in New Issue
Block a user