Add Amharic አማርኛ Language support (#6583)

* Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
2025-09-18 10:02:40 +03:00 · 2020-12-22 07:50:34 -08:00 · 2020-12-22 07:50:34 -08:00 · cf52510631
commit cf52510631
parent 292c1d6a73
21 changed files with 661 additions and 1 deletions
--- a/.github/contributors/yosiasz.md
+++ b/.github/contributors/yosiasz.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Josiah Solomon       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-12-15           |
+| GitHub username                | yosiasz              |
+| Website (optional)             |                      |
--- a/spacy/lang/am/init.py
+++ b/spacy/lang/am/init.py
@ -0,0 +1,34 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_SUFFIXES
+
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ..norm_exceptions import BASE_NORMS
+from ...language import Language
+from ...attrs import LANG, NORM
+from ...util import update_exc, add_lookups
+
+
+class AmharicDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: "am"
+    lex_attr_getters[NORM] = add_lookups(
+        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
+    )
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    stop_words = STOP_WORDS
+    suffixes = TOKENIZER_SUFFIXES
+    writing_system = {"direction": "ltr", "has_case": False, "has_letters": True}
+
+
+class Amharic(Language):
+    lang = "am"
+    Defaults = AmharicDefaults
+
+
+__all__ = ["Amharic"]
--- a/spacy/lang/am/examples.py
+++ b/spacy/lang/am/examples.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.am.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "አፕል የዩኬን ጅምር ድርጅት በ 1 ቢሊዮን ዶላር ለመግዛት አስቧል።",
+    "የራስ ገዝ መኪኖች የኢንሹራንስ ኃላፊነትን ወደ አምራቾች ያዛውራሉ",
+    "ሳን ፍራንሲስኮ የእግረኛ መንገድ አቅርቦት ሮቦቶችን ማገድን ይመለከታል",
+    "ለንደን በእንግሊዝ የምትገኝ ትልቅ ከተማ ናት።",
+    "የት ነህ?",
+    "የፈረንሳይ ፕሬዝዳንት ማናቸው?",
+    "የአሜሪካ ዋና ከተማ ምንድነው?",
+    "ባራክ ኦባማ መቼ ተወለደ?",
+]
--- a/spacy/lang/am/lex_attrs.py
+++ b/spacy/lang/am/lex_attrs.py
@ -0,0 +1,104 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+_num_words = [
+    "ዜሮ",
+    "አንድ",
+    "ሁለት",
+    "ሶስት",
+    "አራት",
+    "አምስት",
+    "ስድስት",
+    "ሰባት",
+    "ስምት",
+    "ዘጠኝ",
+    "አስር",
+    "አስራ አንድ",
+    "አስራ ሁለት",
+    "አስራ ሶስት",
+    "አስራ አራት",
+    "አስራ አምስት",
+    "አስራ ስድስት",
+    "አስራ ሰባት",
+    "አስራ ስምንት",
+    "አስራ ዘጠኝ",
+    "ሃያ",
+    "ሰላሳ",
+    "አርባ",
+    "ሃምሳ",
+    "ስልሳ",
+    "ሰባ",
+    "ሰማንያ",
+    "ዘጠና",
+    "መቶ",
+    "ሺህ",
+    "ሚሊዮን",
+    "ቢሊዮን",
+    "ትሪሊዮን",
+    "ኳድሪሊዮን",
+    "ገጅሊዮን",
+    "ባዝሊዮን"
+]
+
+_ordinal_words = [
+    "አንደኛ",
+     "ሁለተኛ",
+     "ሶስተኛ",
+     "አራተኛ",
+     "አምስተኛ",
+     "ስድስተኛ",
+     "ሰባተኛ",
+     "ስምንተኛ",
+     "ዘጠነኛ",
+     "አስረኛ",
+     "አስራ አንደኛ",
+     "አስራ ሁለተኛ",
+     "አስራ ሶስተኛ",
+     "አስራ አራተኛ",
+     "አስራ አምስተኛ",
+     "አስራ ስድስተኛ",
+     "አስራ ሰባተኛ",
+     "አስራ ስምንተኛ",
+     "አስራ ዘጠነኛ",
+     "ሃያኛ",
+     "ሰላሳኛ"
+     "አርባኛ",
+     "አምሳኛ",
+     "ስድሳኛ",
+     "ሰባኛ",
+     "ሰማንያኛ",
+     "ዘጠናኛ",
+     "መቶኛ",
+     "ሺኛ",
+     "ሚሊዮንኛ",
+     "ቢሊዮንኛ",
+     "ትሪሊዮንኛ"
+]
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+
+    text_lower = text.lower()
+    if text_lower in _num_words:
+        return True
+
+    # Check ordinal number
+    if text_lower in _ordinal_words:
+        return True
+    if text_lower.endswith("ኛ"):
+        if text_lower[:-2].isdigit():
+            return True 
+
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/am/punctuation.py
+++ b/spacy/lang/am/punctuation.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
+from ..char_classes import UNITS, ALPHA_UPPER
+
+_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
+
+_suffixes = (
+    _list_punct
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + [
+        r"(?<=[0-9])\+",
+        # Amharic is written from Left-To-Right
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
+        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
+    ]
+)
+
+TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/am/stop_words.py
+++ b/spacy/lang/am/stop_words.py
@ -0,0 +1,10 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+# Stop words
+STOP_WORDS = set(
+    """
+ግን አንቺ አንተ እናንተ ያንተ ያንቺ የናንተ ራስህን ራስሽን ራሳችሁን
+""".split()
+)
--- a/spacy/lang/am/tokenizer_exceptions.py
+++ b/spacy/lang/am/tokenizer_exceptions.py
@ -0,0 +1,25 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import ORTH, LEMMA, NORM, PRON_LEMMA
+
+
+_exc = {}
+
+
+for exc_data in [
+    {ORTH: "ት/ቤት", LEMMA: "ትምህርት ቤት"},    
+    {ORTH: "ወ/ሮ", LEMMA: PRON_LEMMA, NORM: "ወይዘሮ"},
+
+]:
+    _exc[exc_data[ORTH]] = [exc_data]
+
+
+for orth in [
+    "ዓ.ም.",
+    "ኪ.ሜ.",
+]:
+    _exc[orth] = [{ORTH: orth}]
+
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/char_classes.py
+++ b/spacy/lang/char_classes.py
@ -5,6 +5,8 @@ split_chars = lambda char: list(char.strip().split(" "))
 merge_chars = lambda char: char.strip().replace(" ", "|")
 group_chars = lambda char: char.strip().replace(" ", "")

+_ethiopic = r"\u1200-\u137F"
+
 _bengali = r"\u0980-\u09FF"

 _hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F"
@ -221,7 +223,8 @@ _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian
 _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower

 _uncased = (
-    _bengali
+    _ethiopic
+    + _bengali
    + _hebrew
    + _persian
    + _sinhala
--- a/spacy/lang/ti/init.py
+++ b/spacy/lang/ti/init.py
@ -0,0 +1,34 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_SUFFIXES
+
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ..norm_exceptions import BASE_NORMS
+from ...language import Language
+from ...attrs import LANG, NORM
+from ...util import update_exc, add_lookups
+
+
+class TigrinyaDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: "ti"
+    lex_attr_getters[NORM] = add_lookups(
+        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
+    )
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    stop_words = STOP_WORDS
+    suffixes = TOKENIZER_SUFFIXES
+    writing_system = {"direction": "ltr", "has_case": False, "has_letters": True}
+
+
+class Tigrinya(Language):
+    lang = "ti"
+    Defaults = TigrinyaDefaults
+
+
+__all__ = ["Tigrinya"]
--- a/spacy/lang/ti/examples.py
+++ b/spacy/lang/ti/examples.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.ti.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "አፕል ብዩኬ ትርከብ ንግድ ብ1 ቢሊዮን ዶላር ንምግዛዕ ሐሲባ።",
+    "ፈላማይ ክታበት ኮቪድ 19 ተጀሚሩ፤ሓዱሽ ተስፋ ሂቡ ኣሎ",
+    "ቻንስለር ጀርመን ኣንገላ መርከል ዝርግሓ ቫይረስ ኮሮና ንምክልካል ጽኑዕ እገዳ ክግበር ጸዊዓ",
+    "ለንደን ብዓዲ እንግሊዝ ትርከብ ዓባይ ከተማ እያ።",
+    "ናበይ አለኻ፧",
+    "ናይ ፈረንሳይ ፕሬዝዳንት መን እዩ፧",
+    "ናይ አሜሪካ ዋና ከተማ እንታይ እያ፧",
+    "ኦባማ መዓስ ተወሊዱ፧",
+]
--- a/spacy/lang/ti/lex_attrs.py
+++ b/spacy/lang/ti/lex_attrs.py
@ -0,0 +1,104 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+_num_words = [
+    "ዜሮ",
+    "ሐደ",
+    "ክልተ",
+    "ሰለስተ",
+    "ኣርባዕተ",
+    "ሓሙሽተ",
+    "ሽድሽተ",
+    "ሸውዓተ",
+    "ሽሞንተ",
+    "ትሽዓተ",
+    "ኣሰርተ",
+    "ኣሰርተ ሐደ",
+    "ኣሰርተ ክልተ",
+    "ኣሰርተ ሰለስተ",
+    "ኣሰርተ ኣርባዕተ",
+    "ኣሰርተ ሓሙሽተ",
+    "ኣሰርተ ሽድሽተ",
+    "ኣሰርተ ሸውዓተ",
+    "ኣሰርተ ሽሞንተ",
+    "ኣሰርተ ትሽዓተ",
+    "ዕስራ",
+    "ሰላሳ",
+    "ኣርብዓ",
+    "ሃምሳ",
+    "ስልሳ",
+    "ሰብዓ",
+    "ሰማንያ",
+    "ተስዓ",
+    "ሚእቲ",
+    "ሺሕ",
+    "ሚልዮን",
+    "ቢልዮን",
+    "ትሪልዮን",
+    "ኳድሪልዮን",
+    "ገጅልዮን",
+    "ባዝልዮን"
+]
+
+_ordinal_words = [
+     "ቀዳማይ",
+     "ካልኣይ",
+     "ሳልሳይ",
+     "ራብኣይ",
+     "ሓምሻይ",
+     "ሻድሻይ",
+     "ሻውዓይ",
+     "ሻምናይ",
+     "ዘጠነኛ",
+     "አስረኛ",
+     "ኣሰርተ አንደኛ",
+     "ኣሰርተ ሁለተኛ",
+     "ኣሰርተ ሶስተኛ",
+     "ኣሰርተ አራተኛ",
+     "ኣሰርተ አምስተኛ",
+     "ኣሰርተ ስድስተኛ",
+     "ኣሰርተ ሰባተኛ",
+     "ኣሰርተ ስምንተኛ",
+     "ኣሰርተ ዘጠነኛ",
+     "ሃያኛ",
+     "ሰላሳኛ"
+     "አርባኛ",
+     "አምሳኛ",
+     "ስድሳኛ",
+     "ሰባኛ",
+     "ሰማንያኛ",
+     "ዘጠናኛ",
+     "መቶኛ",
+     "ሺኛ",
+     "ሚሊዮንኛ",
+     "ቢሊዮንኛ",
+     "ትሪሊዮንኛ"
+]
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+
+    text_lower = text.lower()
+    if text_lower in _num_words:
+        return True
+
+    # Check ordinal number
+    if text_lower in _ordinal_words:
+        return True
+    if text_lower.endswith("ኛ"):
+        if text_lower[:-2].isdigit():
+            return True 
+
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/ti/punctuation.py
+++ b/spacy/lang/ti/punctuation.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
+from ..char_classes import UNITS, ALPHA_UPPER
+
+_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split()
+
+_suffixes = (
+    _list_punct
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + [
+        r"(?<=[0-9])\+",
+        # Tigrinya is written from Left-To-Right
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
+        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
+    ]
+)
+
+TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/ti/stop_words.py
+++ b/spacy/lang/ti/stop_words.py
@ -0,0 +1,10 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+# Stop words
+STOP_WORDS = set(
+    """
+ግን ግና ንስኻ ንስኺ ንስኻትክን ንስኻትኩም ናትካ ናትኪ ናትክን ናትኩም
+""".split()
+)
--- a/spacy/lang/ti/tokenizer_exceptions.py
+++ b/spacy/lang/ti/tokenizer_exceptions.py
@ -0,0 +1,26 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import ORTH, LEMMA, NORM, PRON_LEMMA
+
+
+_exc = {}
+
+
+for exc_data in [
+    {ORTH: "ት/ቤት", LEMMA: "ትምህርት ቤት"},    
+    {ORTH: "ወ/ሮ", LEMMA: PRON_LEMMA, NORM: "ወይዘሮ"},
+    {ORTH: "ወ/ሪ", LEMMA: PRON_LEMMA, NORM: "ወይዘሪት"},
+
+]:
+    _exc[exc_data[ORTH]] = [exc_data]
+
+
+for orth in [
+    "ዓ.ም.",
+    "ኪ.ሜ.",
+]:
+    _exc[orth] = [{ORTH: orth}]
+
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -31,6 +31,9 @@ def pytest_runtest_setup(item):
 def tokenizer():
    return get_lang_class("xx").Defaults.create_tokenizer()

+@pytest.fixture(scope="session")
+def am_tokenizer():
+    return get_lang_class("am").Defaults.create_tokenizer()

@pytest.fixture(scope="session")
 def ar_tokenizer():
@ -242,6 +245,9 @@ def th_tokenizer():
    pytest.importorskip("pythainlp")
    return get_lang_class("th").Defaults.create_tokenizer()

+@pytest.fixture(scope="session")
+def ti_tokenizer():
+    return get_lang_class("ti").Defaults.create_tokenizer()

@pytest.fixture(scope="session")
 def tr_tokenizer():
--- a/spacy/tests/lang/am/init.py
+++ b/spacy/tests/lang/am/init.py
--- a/spacy/tests/lang/am/test_exception.py
+++ b/spacy/tests/lang/am/test_exception.py
--- a/spacy/tests/lang/am/test_text.py
+++ b/spacy/tests/lang/am/test_text.py
@ -0,0 +1,55 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+from spacy.lang.am.lex_attrs import like_num
+
+
+def test_am_tokenizer_handles_long_text(am_tokenizer):
+    text = """ሆሴ ሙጂካ በበጋ ወቅት በኦክስፎርድ ንግግር አንድያቀርቡ ሲጋበዙ ጭንቅላታቸው "ፈነዳ"።
+
+“እጅግ ጥንታዊ” የእንግሊዝኛ ተናጋሪ ዩኒቨርስቲ፣ በአስር ሺዎች የሚቆጠሩ ዩሮዎችን ለተማሪዎች በማስተማር የሚያስከፍለው 
+
+እና ከማርጋሬት ታቸር እስከ ስቲቨን ሆኪንግ በአዳራሾቻቸው ውስጥ ንግግር ያደረጉበት የትምህርት ማዕከል፣ በሞንቴቪዴኦ 
+
+በሚገኘው የመንግስት ትምህርት ቤት የሰለጠኑትን የ81 ዓመቱ አዛውንት አገልግሎት ጠየቁ።"""
+    tokens = am_tokenizer(text)
+    
+    assert len(tokens) == 56
+
+
+@pytest.mark.parametrize(
+    "text,length",
+    [
+        ("ሆሴ ሙጂካ ለምን ተመረጠ?", 5),
+        ("“በፍፁም?”", 4),
+        ("""አዎ! ሆዜ አርካዲዮ ቡንዲያ “እንሂድ” ሲል መለሰ።""", 11),
+        ("እነሱ በግምት 10ኪ.ሜ. ሮጡ።", 7),
+        ("እና ከዚያ ለምን...", 4),
+    ],
+)
+def test_am_tokenizer_handles_cnts(am_tokenizer, text, length):
+    tokens = am_tokenizer(text)
+    assert len(tokens) == length
+
+
+@pytest.mark.parametrize(
+    "text,match",
+    [
+        ("10", True),
+        ("1", True),
+        ("10.000", True),
+        ("1000", True),
+        ("999,0", True),
+        ("አንድ", True),
+        ("ሁለት", True),
+        ("ትሪሊዮን", True),
+        ("ውሻ", False),
+        (",", False),
+        ("1/2", True),
+    ],
+)
+def test_lex_attrs_like_number(am_tokenizer, text, match):
+    tokens = am_tokenizer(text)
+    assert len(tokens) == 1
+    assert tokens[0].like_num == match
--- a/spacy/tests/lang/ti/init.py
+++ b/spacy/tests/lang/ti/init.py
--- a/spacy/tests/lang/ti/test_exception.py
+++ b/spacy/tests/lang/ti/test_exception.py
--- a/spacy/tests/lang/ti/test_text.py
+++ b/spacy/tests/lang/ti/test_text.py
@ -0,0 +1,55 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+from spacy.lang.ti.lex_attrs import like_num
+
+
+def test_ti_tokenizer_handles_long_text(ti_tokenizer):
+    text = """ቻንስለር ጀርመን ኣንገላ መርከል ኣብታ ሃገር ቁጽሪ መትሓዝቲ ኮቪድ መዓልታዊ ክብረ መዝገብ ድሕሪ ምህራሙ- ጽኑዕ እገዳ ክግበር ጸዊዓ።
+
+መርከል ሎሚ ንታሕታዋይ ባይቶ ሃገራ ክትገልጽ ከላ፡ ኣብ ወሳኒ ምዕራፍ ቃልሲ ኢና ዘለና-ዳሕራዋይ ማዕበል ካብቲ ቀዳማይ ክገድድ ይኽእል`ዩ ኢላ። 
+
+ትካል ምክልኻል ተላገብቲ ሕማማት ጀርመን፡ ኣብ ዝሓለፈ 24 ሰዓታት ኣብ ምልእቲ ጀርመር 590 ሰባት ብኮቪድ19 ምሟቶም ኣፍሊጡ`ሎ። 
+
+ቻንስለር ኣንጀላ መርከል ኣብ እዋን በዓላት ልደት ስድራቤታት ክተኣኻኸባ ዝፍቀደለን`ኳ እንተኾነ ድሕሪኡ ኣብ ዘሎ ግዜ ግን እቲ እገዳታት ክትግበር ትደሊ።"""
+    tokens = ti_tokenizer(text)
+    
+    assert len(tokens) == 85
+
+
+@pytest.mark.parametrize(
+    "text,length",
+    [
+        ("ቻንስለር ጀርመን ኣንገላ መርከል፧", 5),
+        ("“ስድራቤታት፧”", 4),
+        ("""ኣብ እዋን በዓላት ልደት ስድራቤታት ክተኣኻኸባ ዝፍቀደለን`ኳ እንተኾነ።""", 9),
+        ("ብግምት 10ኪ.ሜ. ጎይዩ።", 6),
+        ("ኣብ ዝሓለፈ 24 ሰዓታት...", 5),
+    ],
+)
+def test_ti_tokenizer_handles_cnts(ti_tokenizer, text, length):
+    tokens = ti_tokenizer(text)
+    assert len(tokens) == length
+
+
+@pytest.mark.parametrize(
+    "text,match",
+    [
+        ("10", True),
+        ("1", True),
+        ("10.000", True),
+        ("1000", True),
+        ("999,0", True),
+        ("ሐደ", True),
+        ("ክልተ", True),
+        ("ትሪልዮን", True),
+        ("ከልቢ", False),
+        (",", False),
+        ("1/2", True),
+    ],
+)
+def test_lex_attrs_like_number(ti_tokenizer, text, match):
+    tokens = ti_tokenizer(text)
+    assert len(tokens) == 1
+    assert tokens[0].like_num == match