spaCy/spacy/lang/lb/lex_attrs.py

# coding: utf8
from __future__ import unicode_literals

from ...attrs import LIKE_NUM


_num_words = set(
    """
null eent zwee dräi véier fënnef sechs ziwen aacht néng zéng eelef zwielef dräizéng
véierzéng foffzéng siechzéng siwwenzéng uechtzeng uechzeng nonnzéng nongzéng zwanzeg drësseg véierzeg foffzeg sechzeg siechzeg siwenzeg achtzeg achzeg uechtzeg uechzeg nonnzeg
honnert dausend millioun milliard billioun billiard trillioun triliard
""".split()
)

_ordinal_words = set(
    """
éischten zweeten drëtten véierten fënneften sechsten siwenten aachten néngten zéngten eeleften
zwieleften dräizéngten véierzéngten foffzéngten siechzéngten uechtzéngen uechzéngten nonnzéngten nongzéngten zwanzegsten
drëssegsten véierzegsten foffzegsten siechzegsten siwenzegsten uechzegsten nonnzegsten
honnertsten dausendsten milliounsten
milliardsten billiounsten billiardsten trilliounsten trilliardsten
""".split()
)


def like_num(text):
    """
    check if text resembles a number
    """
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text in _num_words:
        return True
    if text in _ordinal_words:
        return True
    return False


LEX_ATTRS = {LIKE_NUM: like_num}
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

			`from ...attrs import LIKE_NUM`


Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`_num_words = set(`
			`"""`
			`null eent zwee dräi véier fënnef sechs ziwen aacht néng zéng eelef zwielef dräizéng`
			`véierzéng foffzéng siechzéng siwwenzéng uechtzeng uechzeng nonnzéng nongzéng zwanzeg drësseg véierzeg foffzeg sechzeg siechzeg siwenzeg achtzeg achzeg uechtzeg uechzeg nonnzeg`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`honnert dausend millioun milliard billioun billiard trillioun triliard`
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`""".split()`
			`)`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`_ordinal_words = set(`
			`"""`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`éischten zweeten drëtten véierten fënneften sechsten siwenten aachten néngten zéngten eeleften`
			`zwieleften dräizéngten véierzéngten foffzéngten siechzéngten uechtzéngen uechzéngten nonnzéngten nongzéngten zwanzegsten`
			`drëssegsten véierzegsten foffzegsten siechzegsten siwenzegsten uechzegsten nonnzegsten`
			`honnertsten dausendsten milliounsten`
			`milliardsten billiounsten billiardsten trilliounsten trilliardsten`
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`""".split()`
			`)`

Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00
			`def like_num(text):`
			`"""`
			`check if text resembles a number`
			`"""`
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`text = text.replace(",", "").replace(".", "")`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`if text.isdigit():`
			`return True`
Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`if text.count("/") == 1:`
			`num, denom = text.split("/")`
Initial commit: New language Luxembourgish (lb) (#4424) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md 2019-10-14 13:27:50 +03:00			`if num.isdigit() and denom.isdigit():`
			`return True`
			`if text in _num_words:`
			`return True`
			`if text in _ordinal_words:`
			`return True`
			`return False`


Tidy up and auto-format 2019-10-18 12:27:38 +03:00			`LEX_ATTRS = {LIKE_NUM: like_num}`