mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Initial commit: New language Luxembourgish (lb) (#4424)
* new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md
This commit is contained in:
parent
98a961a60e
commit
428887b8f2
106
.github/contributors/PeterGilles.md
vendored
Normal file
106
.github/contributors/PeterGilles.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Peter Gilles |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 10.10. |
|
||||
| GitHub username | Peter Gilles |
|
||||
| Website (optional) | |
|
37
spacy/lang/lb/__init__.py
Normal file
37
spacy/lang/lb/__init__.py
Normal file
|
@ -0,0 +1,37 @@
|
|||
# coding: utf8
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .norm_exceptions import NORM_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
#from .lemmatizer import LOOKUP
|
||||
#from .syntax_iterators import SYNTAX_ITERATORS
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
|
||||
|
||||
class LuxembourgishDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: 'lb'
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
#suffixes = TOKENIZER_SUFFIXES
|
||||
#lemma_lookup = LOOKUP
|
||||
|
||||
|
||||
class Luxembourgish(Language):
|
||||
lang = 'lb'
|
||||
Defaults = LuxembourgishDefaults
|
||||
|
||||
|
||||
__all__ = ['Luxembourgish']
|
18
spacy/lang/lb/examples.py
Normal file
18
spacy/lang/lb/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.lb.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
sentences = [
|
||||
"An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum.",
|
||||
"Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
|
||||
"Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet.",
|
||||
"Um Enn huet den Nordwand säi Kampf opginn.",
|
||||
"Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.",
|
||||
"Do huet den Nordwand missen zouginn, dass d’Sonn vun hinnen zwee de Stäerkste wier."
|
||||
]
|
41
spacy/lang/lb/lex_attrs.py
Normal file
41
spacy/lang/lb/lex_attrs.py
Normal file
|
@ -0,0 +1,41 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = set("""
|
||||
null eent zwee dräi véier fënnef sechs ziwen aacht néng zéng eelef zwielef dräizéng
|
||||
véierzéng foffzéng siechzéng siwwenzéng uechtzeng uechzeng nonnzéng nongzéng zwanzeg drësseg véierzeg foffzeg sechzeg siechzeg siwenzeg achtzeg achzeg uechtzeg uechzeg nonnzeg
|
||||
honnert dausend millioun milliard billioun billiard trillioun triliard
|
||||
""".split())
|
||||
|
||||
_ordinal_words = set("""
|
||||
éischten zweeten drëtten véierten fënneften sechsten siwenten aachten néngten zéngten eeleften
|
||||
zwieleften dräizéngten véierzéngten foffzéngten siechzéngten uechtzéngen uechzéngten nonnzéngten nongzéngten zwanzegsten
|
||||
drëssegsten véierzegsten foffzegsten siechzegsten siwenzegsten uechzegsten nonnzegsten
|
||||
honnertsten dausendsten milliounsten
|
||||
milliardsten billiounsten billiardsten trilliounsten trilliardsten
|
||||
""".split())
|
||||
|
||||
def like_num(text):
|
||||
"""
|
||||
check if text resembles a number
|
||||
"""
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
if text in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num
|
||||
}
|
20
spacy/lang/lb/norm_exceptions.py
Normal file
20
spacy/lang/lb/norm_exceptions.py
Normal file
|
@ -0,0 +1,20 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# TODO
|
||||
# norm execptions: find a possibility to deal with the zillions of spelling variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
|
||||
|
||||
# here one could include the most common spelling mistakes
|
||||
|
||||
_exc = {
|
||||
"datt": "dass",
|
||||
"wgl.": "weg.",
|
||||
"wgl.": "wegl.",
|
||||
"vläicht": "viläicht"}
|
||||
|
||||
|
||||
NORM_EXCEPTIONS = {}
|
||||
|
||||
for string, norm in _exc.items():
|
||||
NORM_EXCEPTIONS[string] = norm
|
||||
NORM_EXCEPTIONS[string.title()] = norm
|
25
spacy/lang/lb/punctuation.py
Normal file
25
spacy/lang/lb/punctuation.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
|
||||
|
||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
r'(?<=[{a}])[:;<>=](?=[{a}])'.format(a=ALPHA),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[0-9])-(?=[0-9])",
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
212
spacy/lang/lb/stop_words.py
Normal file
212
spacy/lang/lb/stop_words.py
Normal file
|
@ -0,0 +1,212 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
STOP_WORDS = set("""
|
||||
a
|
||||
à
|
||||
äis
|
||||
är
|
||||
ärt
|
||||
äert
|
||||
ären
|
||||
all
|
||||
allem
|
||||
alles
|
||||
alleguer
|
||||
als
|
||||
also
|
||||
am
|
||||
an
|
||||
anerefalls
|
||||
ass
|
||||
aus
|
||||
awer
|
||||
bei
|
||||
beim
|
||||
bis
|
||||
bis
|
||||
d'
|
||||
dach
|
||||
datt
|
||||
däin
|
||||
där
|
||||
dat
|
||||
de
|
||||
dee
|
||||
den
|
||||
deel
|
||||
deem
|
||||
deen
|
||||
deene
|
||||
déi
|
||||
den
|
||||
deng
|
||||
denger
|
||||
dem
|
||||
der
|
||||
dësem
|
||||
di
|
||||
dir
|
||||
do
|
||||
da
|
||||
dann
|
||||
domat
|
||||
dozou
|
||||
drop
|
||||
du
|
||||
duerch
|
||||
duerno
|
||||
e
|
||||
ee
|
||||
em
|
||||
een
|
||||
eent
|
||||
ë
|
||||
en
|
||||
ënner
|
||||
ëm
|
||||
ech
|
||||
eis
|
||||
eise
|
||||
eisen
|
||||
eiser
|
||||
eises
|
||||
eisereen
|
||||
esou
|
||||
een
|
||||
eng
|
||||
enger
|
||||
engem
|
||||
entweder
|
||||
et
|
||||
eréischt
|
||||
falls
|
||||
fir
|
||||
géint
|
||||
géif
|
||||
gëtt
|
||||
gët
|
||||
geet
|
||||
gi
|
||||
ginn
|
||||
gouf
|
||||
gouff
|
||||
goung
|
||||
hat
|
||||
haten
|
||||
hatt
|
||||
hätt
|
||||
hei
|
||||
hu
|
||||
huet
|
||||
hun
|
||||
hunn
|
||||
hiren
|
||||
hien
|
||||
hin
|
||||
hier
|
||||
hir
|
||||
jidderen
|
||||
jiddereen
|
||||
jiddwereen
|
||||
jiddereng
|
||||
jiddwerengen
|
||||
jo
|
||||
ins
|
||||
iech
|
||||
iwwer
|
||||
kann
|
||||
kee
|
||||
keen
|
||||
kënne
|
||||
kënnt
|
||||
kéng
|
||||
kéngen
|
||||
kéngem
|
||||
koum
|
||||
kuckt
|
||||
mam
|
||||
mat
|
||||
ma
|
||||
mä
|
||||
mech
|
||||
méi
|
||||
mécht
|
||||
meng
|
||||
menger
|
||||
mer
|
||||
mir
|
||||
muss
|
||||
nach
|
||||
nämmlech
|
||||
nämmelech
|
||||
näischt
|
||||
nawell
|
||||
nëmme
|
||||
nëmmen
|
||||
net
|
||||
nees
|
||||
nee
|
||||
no
|
||||
nu
|
||||
nom
|
||||
och
|
||||
oder
|
||||
ons
|
||||
onsen
|
||||
onser
|
||||
onsereen
|
||||
onst
|
||||
om
|
||||
op
|
||||
ouni
|
||||
säi
|
||||
säin
|
||||
schonn
|
||||
schonns
|
||||
si
|
||||
sid
|
||||
sie
|
||||
se
|
||||
sech
|
||||
seng
|
||||
senge
|
||||
sengem
|
||||
senger
|
||||
selwecht
|
||||
selwer
|
||||
sinn
|
||||
sollten
|
||||
souguer
|
||||
sou
|
||||
soss
|
||||
sot
|
||||
't
|
||||
tëscht
|
||||
u
|
||||
un
|
||||
um
|
||||
virdrun
|
||||
vu
|
||||
vum
|
||||
vun
|
||||
wann
|
||||
war
|
||||
waren
|
||||
was
|
||||
wat
|
||||
wëllt
|
||||
weider
|
||||
wéi
|
||||
wéini
|
||||
wéinst
|
||||
wi
|
||||
wollt
|
||||
wou
|
||||
wouhin
|
||||
zanter
|
||||
ze
|
||||
zu
|
||||
zum
|
||||
zwar
|
||||
""".split())
|
28
spacy/lang/lb/tag_map.py
Normal file
28
spacy/lang/lb/tag_map.py
Normal file
|
@ -0,0 +1,28 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
|
||||
|
||||
# TODO: tag map is still using POS tags from an internal training set.
|
||||
# These POS tags have to be modified to match those from Universal Dependencies
|
||||
|
||||
TAG_MAP = {
|
||||
"$": {POS: PUNCT},
|
||||
"ADJ": {POS: ADJ},
|
||||
"AV": {POS: ADV},
|
||||
"APPR": {POS: ADP, "AdpType": "prep"},
|
||||
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
|
||||
"D": {POS: DET, "PronType": "art"},
|
||||
"KO": {POS: CONJ},
|
||||
"N": {POS: NOUN},
|
||||
"P": {POS: ADV},
|
||||
"TRUNC": {POS: X, "Hyph": "yes"},
|
||||
"AUX": {POS: AUX},
|
||||
"V": {POS: VERB},
|
||||
"MV": {POS: VERB, "VerbType": "mod"},
|
||||
"PTK": {POS: PART},
|
||||
"INTER": {POS: PART},
|
||||
"NUM": {POS: NUM},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
47
spacy/lang/lb/tokenizer_exceptions.py
Normal file
47
spacy/lang/lb/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,47 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
||||
from ..punctuation import TOKENIZER_PREFIXES
|
||||
|
||||
# TODO
|
||||
# tokenize cliticised definite article "d'" as token of its own: d'Kanner > [d'] [Kanner]
|
||||
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
|
||||
|
||||
# how to write the tokenisation exeption for the articles d' / D' ? This one is not working.
|
||||
_prefixes = [prefix for prefix in TOKENIZER_PREFIXES if prefix not in ["d'", "D'", "d’", "D’", r"\' "]]
|
||||
|
||||
|
||||
_exc = {
|
||||
"d'mannst": [
|
||||
{ORTH: "d'", LEMMA: "d'"},
|
||||
{ORTH: "mannst", LEMMA: "mann", NORM: "mann"}],
|
||||
"d'éischt": [
|
||||
{ORTH: "d'", LEMMA: "d'"},
|
||||
{ORTH: "éischt", LEMMA: "éischt", NORM: "éischt"}]
|
||||
}
|
||||
|
||||
# translate / delete what is not necessary
|
||||
# what does PRON_LEMMA mean?
|
||||
for exc_data in [
|
||||
{ORTH: "wgl.", LEMMA: "wann ech gelift", NORM: "wann ech gelieft"},
|
||||
{ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
|
||||
{ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},
|
||||
{ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},
|
||||
{ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
|
||||
{ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
|
||||
{ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
|
||||
{ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
|
||||
{ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"}]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
||||
# to be extended
|
||||
for orth in [
|
||||
"z.B.", "Dipl.", "Dr.", "etc.", "i.e.", "o.k.", "O.K.", "p.a.", "p.s.", "P.S.", "phil.",
|
||||
"q.e.d.", "R.I.P.", "rer.", "sen.", "ë.a.", "U.S.", "U.S.A."]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -134,7 +134,10 @@ def ko_tokenizer():
|
|||
pytest.importorskip("natto")
|
||||
return get_lang_class("ko").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def lb_tokenizer():
|
||||
return get_lang_class("lb").Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def lt_tokenizer():
|
||||
return get_lang_class("lt").Defaults.create_tokenizer()
|
||||
|
|
0
spacy/tests/lang/lb/__init__.py
Normal file
0
spacy/tests/lang/lb/__init__.py
Normal file
12
spacy/tests/lang/lb/test_exceptions.py
Normal file
12
spacy/tests/lang/lb/test_exceptions.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
# coding: utf-8
|
||||
# from __future__ import unicolb_literals
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["z.B.", "Jan."])
|
||||
def test_lb_tokenizer_handles_abbr(lb_tokenizer, text):
|
||||
tokens = lb_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
26
spacy/tests/lang/lb/test_prefix_suffix_infix.py
Normal file
26
spacy/tests/lang/lb/test_prefix_suffix_infix.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf-8
|
||||
#from __future__ import unicolb_literals
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,length", [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
|
||||
def test_lb_tokenizer_splits_prefix_interact(lb_tokenizer, text, length):
|
||||
tokens = lb_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["z.B.)"])
|
||||
def test_lb_tokenizer_splits_suffix_interact(lb_tokenizer, text):
|
||||
tokens = lb_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["(z.B.)"])
|
||||
def test_lb_tokenizer_splits_even_wrap_interact(lb_tokenizer, text):
|
||||
tokens = lb_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
|
32
spacy/tests/lang/lb/test_text.py
Normal file
32
spacy/tests/lang/lb/test_text.py
Normal file
|
@ -0,0 +1,32 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_lb_tokenizer_handles_long_text(lb_tokenizer):
|
||||
text = """Den Nordwand an d'Sonn
|
||||
|
||||
An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum. Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
|
||||
|
||||
Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet. Um Enn huet den Nordwand säi Kampf opginn.
|
||||
|
||||
Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.
|
||||
|
||||
Do huet den Nordwand missen zouginn, dass d’Sonn vun hinnen zwee de Stäerkste wier."""
|
||||
|
||||
tokens = lb_tokenizer(text)
|
||||
assert len(tokens) == 143
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,length",
|
||||
[
|
||||
("»Wat ass mat mir geschitt?«, huet hie geduecht.", 13),
|
||||
("“Dëst fréi Opstoen”, denkt hien, “mécht ee ganz duercherneen. ", 15),
|
||||
],
|
||||
)
|
||||
def test_lb_tokenizer_handles_examples(lb_tokenizer, text, length):
|
||||
tokens = lb_tokenizer(text)
|
||||
assert len(tokens) == length
|
Loading…
Reference in New Issue
Block a user