spaCy/spacy/lang/tl/tokenizer_exceptions.py
Brixjohn 52f3c95004 Added alpha support for Tagalog language (#3062)
I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages.

I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language.

While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases.

* Added alpha support for Tagalog language

* Edited contributor template

* Included SCA; Reverted templates

* Fixed SCA template

* Fixed changes in SCA template
2018-12-18 13:08:38 +01:00

49 lines
1.4 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# coding: utf8
from __future__ import unicode_literals
# import symbols if you need to use more, add them here
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
# Add tokenizer exceptions
# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
# Feel free to use custom logic to generate repetitive exceptions more efficiently.
# If an exception is split into more than one token, the ORTH values combined always
# need to match the original string.
# Exceptions should be added in the following format:
_exc = {
"tayo'y": [
{ORTH: "tayo", LEMMA: "tayo"},
{ORTH: "'y", LEMMA: "ay"}],
"isa'y": [
{ORTH: "isa", LEMMA: "isa"},
{ORTH: "'y", LEMMA: "ay"}],
"baya'y": [
{ORTH: "baya", LEMMA: "bayan"},
{ORTH: "'y", LEMMA: "ay"}],
"sa'yo": [
{ORTH: "sa", LEMMA: "sa"},
{ORTH: "'yo", LEMMA: "iyo"}],
"ano'ng": [
{ORTH: "ano", LEMMA: "ano"},
{ORTH: "'ng", LEMMA: "ang"}],
"siya'y": [
{ORTH: "siya", LEMMA: "siya"},
{ORTH: "'y", LEMMA: "ay"}],
"nawa'y": [
{ORTH: "nawa", LEMMA: "nawa"},
{ORTH: "'y", LEMMA: "ay"}],
"papa'no": [
{ORTH: "papa'no", LEMMA: "papaano"}],
"'di": [
{ORTH: "'di", LEMMA: "hindi"}]
}
# To keep things clean and readable, it's recommended to only declare the
# TOKENIZER_EXCEPTIONS at the bottom:
TOKENIZER_EXCEPTIONS = _exc