spaCy/spacy/tests/lang/ht/test_exceptions.py
Jeff Adolphe 41e07772dc
Added Haitian Creole (ht) Language Support to spaCy (#13807)
This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module.
It includes:

    Added all core language data files for spacy/lang/ht:
        tokenizer_exceptions.py
        punctuation.py
        lex_attrs.py
        syntax_iterators.py
        lemmatizer.py
        stop_words.py
        tag_map.py

    Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created.

    Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions.

    Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm").

    Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm").

    Ensured no breakages in other language modules.

    Followed spaCy coding style (PEP8, Black).

This provides a foundation for Haitian Creole NLP development using spaCy.
2025-05-28 17:23:38 +02:00

33 lines
968 B
Python

import pytest
def test_ht_tokenizer_handles_basic_contraction(ht_tokenizer):
text = "m'ap ri"
tokens = ht_tokenizer(text)
assert len(tokens) == 3
assert tokens[0].text == "m'"
assert tokens[1].text == "ap"
assert tokens[2].text == "ri"
text = "mwen di'w non!"
tokens = ht_tokenizer(text)
assert len(tokens) == 5
assert tokens[0].text == "mwen"
assert tokens[1].text == "di"
assert tokens[2].text == "'w"
assert tokens[3].text == "non"
assert tokens[4].text == "!"
@pytest.mark.parametrize("text", ["Dr."])
def test_ht_tokenizer_handles_basic_abbreviation(ht_tokenizer, text):
tokens = ht_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].text == text
def test_ht_tokenizer_full_sentence(ht_tokenizer):
text = "Si'm ka vini, m'ap pale ak li."
tokens = [t.text for t in ht_tokenizer(text)]
assert tokens == ["Si", "'m", "ka", "vini", ",", "m'", "ap", "pale", "ak", "li", "."]