spaCy/spacy/lang/lb/tokenizer_exceptions.py

from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...symbols import ORTH, LEMMA, NORM
from ...util import update_exc


# TODO
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)

_exc = {}

# translate / delete what is not necessary
for exc_data in [
    {ORTH: "’t", LEMMA: "et", NORM: "et"},
    {ORTH: "’T", LEMMA: "et", NORM: "et"},
    {ORTH: "'t", LEMMA: "et", NORM: "et"},
    {ORTH: "'T", LEMMA: "et", NORM: "et"},
    {ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
    {ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
    {ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},
    {ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},
    {ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
    {ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
    {ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
    {ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
    {ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"},
]:
    _exc[exc_data[ORTH]] = [exc_data]


# to be extended
for orth in [
    "z.B.",
    "Dipl.",
    "Dr.",
    "etc.",
    "i.e.",
    "o.k.",
    "O.K.",
    "p.a.",
    "p.s.",
    "P.S.",
    "phil.",
    "q.e.d.",
    "R.I.P.",
    "rer.",
    "sen.",
    "ë.a.",
    "U.S.",
    "U.S.A.",
]:
    _exc[orth] = [{ORTH: orth}]

TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
-												Tidy up and move noun_chunks, token_match, url_match

											
										
										
											2020-07-22 23:18:46 +03:00
+								from ..tokenizer_exceptions import BASE_EXCEPTIONS
-												Tidy up and auto-format

											
										
										
											2019-10-18 12:27:38 +03:00
+								from ...symbols import ORTH, LEMMA, NORM
-												Tidy up and move noun_chunks, token_match, url_match

											
										
										
											2020-07-22 23:18:46 +03:00
+								from ...util import update_exc
-												Initial commit: New language Luxembourgish (lb) (#4424)

* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md

											
										
										
											2019-10-14 13:27:50 +03:00
 								# TODO
 								# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
-												Auto-format

											
										
										
											2019-11-20 15:15:24 +03:00
+								_exc = {}
-												Initial commit: New language Luxembourgish (lb) (#4424)

* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md

											
										
										
											2019-10-14 13:27:50 +03:00
 								# translate / delete what is not necessary
 								for exc_data in [
-												Update tokenizer_exceptions.py

											
										
										
											2020-02-14 14:02:15 +03:00
+								    {ORTH: "’t", LEMMA: "et", NORM: "et"},
 								    {ORTH: "’T", LEMMA: "et", NORM: "et"},
-												new tests & tokenization fixes (#4734)

- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)
											
										
										
											2019-12-02 01:08:21 +03:00
+								    {ORTH: "'t", LEMMA: "et", NORM: "et"},
 								    {ORTH: "'T", LEMMA: "et", NORM: "et"},
 								    {ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
-												Initial commit: New language Luxembourgish (lb) (#4424)

* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md

											
										
										
											2019-10-14 13:27:50 +03:00
+								    {ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
-												Tidy up and auto-format

											
										
										
											2019-10-18 12:27:38 +03:00
+								    {ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},
 								    {ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},
-												Initial commit: New language Luxembourgish (lb) (#4424)

* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md

											
										
										
											2019-10-14 13:27:50 +03:00
+								    {ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
 								    {ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
 								    {ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
 								    {ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
-												Tidy up and auto-format

											
										
										
											2019-12-21 21:04:17 +03:00
+								    {ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"},
-												Tidy up and auto-format

											
										
										
											2019-10-18 12:27:38 +03:00
+								]:
-												Initial commit: New language Luxembourgish (lb) (#4424)

* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md

											
										
										
											2019-10-14 13:27:50 +03:00
+								    _exc[exc_data[ORTH]] = [exc_data]
 								# to be extended
 								for orth in [
-												Tidy up and auto-format

											
										
										
											2019-10-18 12:27:38 +03:00
+								    "z.B.",
 								    "Dipl.",
 								    "Dr.",
 								    "etc.",
 								    "i.e.",
 								    "o.k.",
 								    "O.K.",
 								    "p.a.",
 								    "p.s.",
 								    "P.S.",
 								    "phil.",
 								    "q.e.d.",
 								    "R.I.P.",
 								    "rer.",
 								    "sen.",
 								    "ë.a.",
 								    "U.S.",
 								    "U.S.A.",
 								]:
-												Initial commit: New language Luxembourgish (lb) (#4424)

* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md

											
										
										
											2019-10-14 13:27:50 +03:00
+								    _exc[orth] = [{ORTH: orth}]
-												Tidy up and move noun_chunks, token_match, url_match

											
										
										
											2020-07-22 23:18:46 +03:00
+								TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)