mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-13 17:52:31 +03:00
This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module. It includes: Added all core language data files for spacy/lang/ht: tokenizer_exceptions.py punctuation.py lex_attrs.py syntax_iterators.py lemmatizer.py stop_words.py tag_map.py Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created. Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions. Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm"). Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm"). Ensured no breakages in other language modules. Followed spaCy coding style (PEP8, Black). This provides a foundation for Haitian Creole NLP development using spaCy.
51 lines
761 B
Python
51 lines
761 B
Python
STOP_WORDS = set(
|
||
"""
|
||
a ak an ankò ant apre ap atò avan avanlè
|
||
byen bò byenke
|
||
|
||
chak
|
||
|
||
de depi deja deja
|
||
|
||
e en epi èske
|
||
|
||
fò fòk
|
||
|
||
gen genyen
|
||
|
||
ki kisa kilès kote koukou konsa konbyen konn konnen kounye kouman
|
||
|
||
la l laa le lè li lye lò
|
||
|
||
m m' mwen
|
||
|
||
nan nap nou n'
|
||
|
||
ou oumenm
|
||
|
||
pa paske pami pandan pito pou pral preske pwiske
|
||
|
||
se selman si sou sòt
|
||
|
||
ta tap tankou te toujou tou tan tout toutotan twòp tèl
|
||
|
||
w w' wi wè
|
||
|
||
y y' yo yon yonn
|
||
|
||
non o oh eh
|
||
|
||
sa san si swa si
|
||
|
||
men mèsi oswa osinon
|
||
|
||
"""
|
||
.split()
|
||
)
|
||
|
||
# Add common contractions, with and without apostrophe variants
|
||
contractions = ["m'", "n'", "w'", "y'", "l'", "t'", "k'"]
|
||
for apostrophe in ["'", "’", "‘"]:
|
||
for word in contractions:
|
||
STOP_WORDS.add(word.replace("'", apostrophe))
|