Initial commit: New language Luxembourgish (lb) (#4424)

* new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md
2026-02-17 04:30:49 +03:00 · 2019-10-14 12:27:50 +02:00 · 2019-10-14 12:27:50 +02:00 · 428887b8f2
commit 428887b8f2
parent 98a961a60e
14 changed files with 608 additions and 1 deletions
--- a/.github/contributors/PeterGilles.md
+++ b/.github/contributors/PeterGilles.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Peter Gilles        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  10.10.              |
+| GitHub username                |  Peter Gilles        |
+| Website (optional)             |                      |
--- a/spacy/lang/lb/init.py
+++ b/spacy/lang/lb/init.py
@ -0,0 +1,37 @@
+# coding: utf8
+
+from __future__ import unicode_literals
+
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS 
+from .norm_exceptions import NORM_EXCEPTIONS
+from .punctuation import TOKENIZER_INFIXES
+from .lex_attrs import LEX_ATTRS
+from .tag_map import TAG_MAP
+from .stop_words import STOP_WORDS
+#from .lemmatizer import LOOKUP
+#from .syntax_iterators import SYNTAX_ITERATORS
+
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ..norm_exceptions import BASE_NORMS
+from ...language import Language
+from ...attrs import LANG, NORM
+from ...util import update_exc, add_lookups
+
+
+class LuxembourgishDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: 'lb'
+    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    stop_words = STOP_WORDS
+    #suffixes = TOKENIZER_SUFFIXES
+    #lemma_lookup = LOOKUP
+    
+
+class Luxembourgish(Language):
+    lang = 'lb'
+    Defaults = LuxembourgishDefaults
+
+
+__all__ = ['Luxembourgish']
--- a/spacy/lang/lb/examples.py
+++ b/spacy/lang/lb/examples.py
@ -0,0 +1,18 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.lb.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+sentences = [
+	"An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum.",
+	"Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
+	"Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet.",
+	"Um Enn huet den Nordwand säi Kampf opginn.",
+	"Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.",
+	"Do huet den Nordwand missen zouginn, dass d’Sonn vun hinnen zwee de Stäerkste wier."
+]
--- a/spacy/lang/lb/lex_attrs.py
+++ b/spacy/lang/lb/lex_attrs.py
@ -0,0 +1,41 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+
+_num_words = set("""
+null eent zwee dräi véier fënnef sechs ziwen aacht néng zéng eelef zwielef dräizéng 
+véierzéng foffzéng siechzéng siwwenzéng uechtzeng uechzeng nonnzéng nongzéng zwanzeg drësseg véierzeg foffzeg sechzeg siechzeg siwenzeg achtzeg achzeg uechtzeg uechzeg nonnzeg 
+honnert dausend millioun milliard billioun billiard trillioun triliard
+""".split())
+
+_ordinal_words = set("""
+éischten zweeten drëtten véierten fënneften sechsten siwenten aachten néngten zéngten eeleften
+zwieleften dräizéngten véierzéngten foffzéngten siechzéngten uechtzéngen uechzéngten nonnzéngten nongzéngten zwanzegsten
+drëssegsten véierzegsten foffzegsten siechzegsten siwenzegsten uechzegsten nonnzegsten
+honnertsten dausendsten milliounsten
+milliardsten billiounsten billiardsten trilliounsten trilliardsten
+""".split())
+
+def like_num(text):
+    """
+    check if text resembles a number
+    """
+    text = text.replace(',', '').replace('.', '')
+    if text.isdigit():
+        return True
+    if text.count('/') == 1:
+        num, denom = text.split('/')
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    if text in _ordinal_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num
+}
--- a/spacy/lang/lb/norm_exceptions.py
+++ b/spacy/lang/lb/norm_exceptions.py
@ -0,0 +1,20 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+# TODO
+# norm execptions: find a possibility to deal with the zillions of spelling variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
+
+# here one could include the most common spelling mistakes
+
+_exc = {
+    "datt": "dass",
+    "wgl.": "weg.",
+    "wgl.": "wegl.",
+    "vläicht": "viläicht"}
+
+
+NORM_EXCEPTIONS = {}
+
+for string, norm in _exc.items():
+    NORM_EXCEPTIONS[string] = norm
+    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/lb/punctuation.py
+++ b/spacy/lang/lb/punctuation.py
@ -0,0 +1,25 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
+
+
+_quotes = CONCAT_QUOTES.replace("'", "")
+
+_infixes = (
+    LIST_ELLIPSES
+    + LIST_ICONS
+    + [
+        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
+        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
+        r'(?<=[{a}])[:;<>=](?=[{a}])'.format(a=ALPHA),
+        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
+        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
+        r"(?<=[0-9])-(?=[0-9])",
+    ]
+)
+
+
+TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/lb/stop_words.py
+++ b/spacy/lang/lb/stop_words.py
@ -0,0 +1,212 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+STOP_WORDS = set("""
+a
+à
+äis
+är
+ärt
+äert
+ären
+all
+allem
+alles
+alleguer
+als
+also
+am
+an
+anerefalls
+ass
+aus
+awer
+bei
+beim
+bis
+bis
+d'
+dach
+datt
+däin
+där
+dat
+de
+dee
+den
+deel
+deem
+deen
+deene
+déi
+den
+deng
+denger
+dem
+der
+dësem
+di
+dir
+do
+da
+dann
+domat
+dozou
+drop
+du
+duerch
+duerno
+e
+ee
+em
+een
+eent
+ë
+en
+ënner
+ëm
+ech
+eis
+eise
+eisen
+eiser
+eises
+eisereen
+esou
+een
+eng
+enger
+engem
+entweder
+et
+eréischt
+falls
+fir
+géint
+géif
+gëtt
+gët
+geet
+gi
+ginn
+gouf
+gouff
+goung
+hat
+haten
+hatt
+hätt
+hei
+hu
+huet
+hun
+hunn
+hiren
+hien
+hin
+hier
+hir
+jidderen
+jiddereen
+jiddwereen
+jiddereng
+jiddwerengen
+jo
+ins
+iech
+iwwer
+kann
+kee
+keen
+kënne
+kënnt
+kéng
+kéngen
+kéngem
+koum
+kuckt
+mam
+mat
+ma
+mä
+mech
+méi
+mécht
+meng
+menger
+mer
+mir
+muss
+nach
+nämmlech
+nämmelech
+näischt
+nawell
+nëmme
+nëmmen
+net
+nees
+nee
+no
+nu
+nom
+och
+oder
+ons
+onsen
+onser
+onsereen
+onst
+om
+op
+ouni
+säi
+säin
+schonn
+schonns
+si
+sid
+sie
+se
+sech
+seng
+senge
+sengem
+senger
+selwecht
+selwer
+sinn
+sollten
+souguer
+sou
+soss
+sot
+'t
+tëscht
+u
+un
+um
+virdrun
+vu
+vum
+vun
+wann
+war
+waren
+was
+wat
+wëllt
+weider
+wéi
+wéini
+wéinst
+wi
+wollt
+wou
+wouhin
+zanter
+ze
+zu
+zum
+zwar
+""".split())
--- a/spacy/lang/lb/tag_map.py
+++ b/spacy/lang/lb/tag_map.py
@ -0,0 +1,28 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
+from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
+
+# TODO: tag map is still using POS tags from an internal training set.
+# These POS tags have to be modified to match those from Universal Dependencies 
+
+TAG_MAP = {
+    "$": {POS: PUNCT},
+    "ADJ": {POS: ADJ},
+    "AV": {POS: ADV},
+    "APPR": {POS: ADP, "AdpType": "prep"},
+    "APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
+    "D": {POS: DET, "PronType": "art"},
+    "KO": {POS: CONJ},
+    "N": {POS: NOUN},
+    "P": {POS: ADV},
+    "TRUNC": {POS: X, "Hyph": "yes"},
+    "AUX": {POS: AUX},
+    "V": {POS: VERB},
+    "MV": {POS: VERB, "VerbType": "mod"},
+    "PTK": {POS: PART},
+    "INTER": {POS: PART},
+    "NUM": {POS: NUM},
+    "_SP": {POS: SPACE},
+}
--- a/spacy/lang/lb/tokenizer_exceptions.py
+++ b/spacy/lang/lb/tokenizer_exceptions.py
@ -0,0 +1,47 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
+from ..punctuation import TOKENIZER_PREFIXES
+
+# TODO
+# tokenize cliticised definite article "d'" as token of its own: d'Kanner > [d'] [Kanner]
+# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
+
+# how to write the tokenisation exeption for the articles d' / D' ? This one is not working.
+_prefixes = [prefix for prefix in TOKENIZER_PREFIXES if prefix not in ["d'", "D'", "d’", "D’", r"\' "]]
+
+
+_exc = {
+    "d'mannst": [
+        {ORTH: "d'", LEMMA: "d'"},
+        {ORTH: "mannst", LEMMA: "mann", NORM: "mann"}],
+    "d'éischt": [
+        {ORTH: "d'", LEMMA: "d'"},
+        {ORTH: "éischt", LEMMA: "éischt", NORM: "éischt"}]
+}
+
+# translate / delete what is not necessary
+# what does PRON_LEMMA mean?
+for exc_data in [
+    {ORTH: "wgl.", LEMMA: "wann ech gelift", NORM: "wann ech gelieft"},
+    {ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
+    {ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"}, 
+    {ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"}, 
+    {ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
+    {ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
+    {ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
+    {ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
+    {ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"}]:
+    _exc[exc_data[ORTH]] = [exc_data]
+
+
+# to be extended
+for orth in [
+    "z.B.", "Dipl.", "Dr.", "etc.", "i.e.", "o.k.", "O.K.", "p.a.", "p.s.", "P.S.", "phil.",
+    "q.e.d.", "R.I.P.", "rer.", "sen.", "ë.a.", "U.S.", "U.S.A."]:
+    _exc[orth] = [{ORTH: orth}]
+
+
+TOKENIZER_PREFIXES = _prefixes
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -134,7 +134,10 @@ def ko_tokenizer():
    pytest.importorskip("natto")
    return get_lang_class("ko").Defaults.create_tokenizer()

-
+@pytest.fixture(scope="session")
+def lb_tokenizer():
+    return get_lang_class("lb").Defaults.create_tokenizer()
+    
@pytest.fixture(scope="session")
 def lt_tokenizer():
    return get_lang_class("lt").Defaults.create_tokenizer()
--- a/spacy/tests/lang/lb/init.py
+++ b/spacy/tests/lang/lb/init.py
--- a/spacy/tests/lang/lb/test_exceptions.py
+++ b/spacy/tests/lang/lb/test_exceptions.py
@ -0,0 +1,12 @@
+# coding: utf-8
+# from __future__ import unicolb_literals
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize("text", ["z.B.", "Jan."])
+def test_lb_tokenizer_handles_abbr(lb_tokenizer, text):
+    tokens = lb_tokenizer(text)
+    assert len(tokens) == 1
+
--- a/spacy/tests/lang/lb/test_prefix_suffix_infix.py
+++ b/spacy/tests/lang/lb/test_prefix_suffix_infix.py
@ -0,0 +1,26 @@
+# coding: utf-8
+#from __future__ import unicolb_literals
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize("text,length", [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
+def test_lb_tokenizer_splits_prefix_interact(lb_tokenizer, text, length):
+    tokens = lb_tokenizer(text)
+    assert len(tokens) == length
+
+
+@pytest.mark.parametrize("text", ["z.B.)"])
+def test_lb_tokenizer_splits_suffix_interact(lb_tokenizer, text):
+    tokens = lb_tokenizer(text)
+    assert len(tokens) == 2
+
+
+@pytest.mark.parametrize("text", ["(z.B.)"])
+def test_lb_tokenizer_splits_even_wrap_interact(lb_tokenizer, text):
+    tokens = lb_tokenizer(text)
+    assert len(tokens) == 3
+
+
+
--- a/spacy/tests/lang/lb/test_text.py
+++ b/spacy/tests/lang/lb/test_text.py
@ -0,0 +1,32 @@
+# coding: utf-8
+from __future__ import unicode_literals
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_lb_tokenizer_handles_long_text(lb_tokenizer):
+    text = """Den Nordwand an d'Sonn
+
+An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum. Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
+
+Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet. Um Enn huet den Nordwand säi Kampf opginn.
+
+Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.
+
+Do huet den Nordwand missen zouginn, dass d’Sonn vun hinnen zwee de Stäerkste wier."""
+
+    tokens = lb_tokenizer(text)
+    assert len(tokens) == 143
+
+
+@pytest.mark.parametrize(
+    "text,length",
+    [
+        ("»Wat ass mat mir geschitt?«, huet hie geduecht.", 13),
+        ("“Dëst fréi Opstoen”, denkt hien, “mécht ee ganz duercherneen. ", 15),
+    ],
+)
+def test_lb_tokenizer_handles_examples(lb_tokenizer, text, length):
+    tokens = lb_tokenizer(text)
+    assert len(tokens) == length