spaCy/spacy/lang/lb/tokenizer_exceptions.py

# coding: utf8
from __future__ import unicode_literals

from ...symbols import ORTH, LEMMA, NORM

# TODO
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)

_exc = {}

# translate / delete what is not necessary
for exc_data in [
    {ORTH: "'t", LEMMA: "et", NORM: "et"},
    {ORTH: "'T", LEMMA: "et", NORM: "et"},
    {ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
    {ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
    {ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},
    {ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},
    {ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
    {ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
    {ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
    {ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
    {ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"}
]:
    _exc[exc_data[ORTH]] = [exc_data]


# to be extended
for orth in [
    "z.B.",
    "Dipl.",
    "Dr.",
    "etc.",
    "i.e.",
    "o.k.",
    "O.K.",
    "p.a.",
    "p.s.",
    "P.S.",
    "phil.",
    "q.e.d.",
    "R.I.P.",
    "rer.",
    "sen.",
    "ë.a.",
    "U.S.",
    "U.S.A.",
]:
    _exc[orth] = [{ORTH: orth}]

TOKENIZER_EXCEPTIONS = _exc
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`from ...symbols import ORTH, LEMMA, NORM`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00
			`# TODO`
			`# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)`

Auto-format 2019-11-20 15:15:24 +03:00			`_exc = {}`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00
			`# translate / delete what is not necessary`
			`for exc_data in [`
new tests & tokenization fixes (#4734) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version) 2019-12-02 01:08:21 +03:00			`{ORTH: "'t", LEMMA: "et", NORM: "et"},`
			`{ORTH: "'T", LEMMA: "et", NORM: "et"},`
			`{ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`{ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},`
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`{ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},`
			`{ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`{ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},`
			`{ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},`
			`{ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},`
			`{ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},`
new tests & tokenization fixes (#4734) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version) 2019-12-02 01:08:21 +03:00			`{ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"}`
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`]:`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`_exc[exc_data[ORTH]] = [exc_data]`


			`# to be extended`
			`for orth in [`
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`"z.B.",`
			`"Dipl.",`
			`"Dr.",`
			`"etc.",`
			`"i.e.",`
			`"o.k.",`
			`"O.K.",`
			`"p.a.",`
			`"p.s.",`
			`"P.S.",`
			`"phil.",`
			`"q.e.d.",`
			`"R.I.P.",`
			`"rer.",`
			`"sen.",`
			`"ë.a.",`
			`"U.S.",`
			`"U.S.A.",`
			`]:`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`_exc[orth] = [{ORTH: orth}]`

			`TOKENIZER_EXCEPTIONS = _exc`