mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-11 17:56:30 +03:00
Initial commit: New language Luxembourgish (lb) (#4424)
* new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md
This commit is contained in:
parent
98a961a60e
commit
428887b8f2
106
.github/contributors/PeterGilles.md
vendored
Normal file
106
.github/contributors/PeterGilles.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Peter Gilles |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 10.10. |
|
||||||
|
| GitHub username | Peter Gilles |
|
||||||
|
| Website (optional) | |
|
37
spacy/lang/lb/__init__.py
Normal file
37
spacy/lang/lb/__init__.py
Normal file
|
@ -0,0 +1,37 @@
|
||||||
|
# coding: utf8
|
||||||
|
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from .norm_exceptions import NORM_EXCEPTIONS
|
||||||
|
from .punctuation import TOKENIZER_INFIXES
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .tag_map import TAG_MAP
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
#from .lemmatizer import LOOKUP
|
||||||
|
#from .syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG, NORM
|
||||||
|
from ...util import update_exc, add_lookups
|
||||||
|
|
||||||
|
|
||||||
|
class LuxembourgishDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = lambda text: 'lb'
|
||||||
|
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
#suffixes = TOKENIZER_SUFFIXES
|
||||||
|
#lemma_lookup = LOOKUP
|
||||||
|
|
||||||
|
|
||||||
|
class Luxembourgish(Language):
|
||||||
|
lang = 'lb'
|
||||||
|
Defaults = LuxembourgishDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ['Luxembourgish']
|
18
spacy/lang/lb/examples.py
Normal file
18
spacy/lang/lb/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.lb.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum.",
|
||||||
|
"Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
|
||||||
|
"Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet.",
|
||||||
|
"Um Enn huet den Nordwand säi Kampf opginn.",
|
||||||
|
"Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.",
|
||||||
|
"Do huet den Nordwand missen zouginn, dass d’Sonn vun hinnen zwee de Stäerkste wier."
|
||||||
|
]
|
41
spacy/lang/lb/lex_attrs.py
Normal file
41
spacy/lang/lb/lex_attrs.py
Normal file
|
@ -0,0 +1,41 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
|
_num_words = set("""
|
||||||
|
null eent zwee dräi véier fënnef sechs ziwen aacht néng zéng eelef zwielef dräizéng
|
||||||
|
véierzéng foffzéng siechzéng siwwenzéng uechtzeng uechzeng nonnzéng nongzéng zwanzeg drësseg véierzeg foffzeg sechzeg siechzeg siwenzeg achtzeg achzeg uechtzeg uechzeg nonnzeg
|
||||||
|
honnert dausend millioun milliard billioun billiard trillioun triliard
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
_ordinal_words = set("""
|
||||||
|
éischten zweeten drëtten véierten fënneften sechsten siwenten aachten néngten zéngten eeleften
|
||||||
|
zwieleften dräizéngten véierzéngten foffzéngten siechzéngten uechtzéngen uechzéngten nonnzéngten nongzéngten zwanzegsten
|
||||||
|
drëssegsten véierzegsten foffzegsten siechzegsten siwenzegsten uechzegsten nonnzegsten
|
||||||
|
honnertsten dausendsten milliounsten
|
||||||
|
milliardsten billiounsten billiardsten trilliounsten trilliardsten
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
"""
|
||||||
|
check if text resembles a number
|
||||||
|
"""
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text in _num_words:
|
||||||
|
return True
|
||||||
|
if text in _ordinal_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
20
spacy/lang/lb/norm_exceptions.py
Normal file
20
spacy/lang/lb/norm_exceptions.py
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
# TODO
|
||||||
|
# norm execptions: find a possibility to deal with the zillions of spelling variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
|
||||||
|
|
||||||
|
# here one could include the most common spelling mistakes
|
||||||
|
|
||||||
|
_exc = {
|
||||||
|
"datt": "dass",
|
||||||
|
"wgl.": "weg.",
|
||||||
|
"wgl.": "wegl.",
|
||||||
|
"vläicht": "viläicht"}
|
||||||
|
|
||||||
|
|
||||||
|
NORM_EXCEPTIONS = {}
|
||||||
|
|
||||||
|
for string, norm in _exc.items():
|
||||||
|
NORM_EXCEPTIONS[string] = norm
|
||||||
|
NORM_EXCEPTIONS[string.title()] = norm
|
25
spacy/lang/lb/punctuation.py
Normal file
25
spacy/lang/lb/punctuation.py
Normal file
|
@ -0,0 +1,25 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||||
|
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
|
||||||
|
|
||||||
|
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||||
|
|
||||||
|
_infixes = (
|
||||||
|
LIST_ELLIPSES
|
||||||
|
+ LIST_ICONS
|
||||||
|
+ [
|
||||||
|
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||||
|
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||||
|
r'(?<=[{a}])[:;<>=](?=[{a}])'.format(a=ALPHA),
|
||||||
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||||
|
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[0-9])-(?=[0-9])",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
212
spacy/lang/lb/stop_words.py
Normal file
212
spacy/lang/lb/stop_words.py
Normal file
|
@ -0,0 +1,212 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
a
|
||||||
|
à
|
||||||
|
äis
|
||||||
|
är
|
||||||
|
ärt
|
||||||
|
äert
|
||||||
|
ären
|
||||||
|
all
|
||||||
|
allem
|
||||||
|
alles
|
||||||
|
alleguer
|
||||||
|
als
|
||||||
|
also
|
||||||
|
am
|
||||||
|
an
|
||||||
|
anerefalls
|
||||||
|
ass
|
||||||
|
aus
|
||||||
|
awer
|
||||||
|
bei
|
||||||
|
beim
|
||||||
|
bis
|
||||||
|
bis
|
||||||
|
d'
|
||||||
|
dach
|
||||||
|
datt
|
||||||
|
däin
|
||||||
|
där
|
||||||
|
dat
|
||||||
|
de
|
||||||
|
dee
|
||||||
|
den
|
||||||
|
deel
|
||||||
|
deem
|
||||||
|
deen
|
||||||
|
deene
|
||||||
|
déi
|
||||||
|
den
|
||||||
|
deng
|
||||||
|
denger
|
||||||
|
dem
|
||||||
|
der
|
||||||
|
dësem
|
||||||
|
di
|
||||||
|
dir
|
||||||
|
do
|
||||||
|
da
|
||||||
|
dann
|
||||||
|
domat
|
||||||
|
dozou
|
||||||
|
drop
|
||||||
|
du
|
||||||
|
duerch
|
||||||
|
duerno
|
||||||
|
e
|
||||||
|
ee
|
||||||
|
em
|
||||||
|
een
|
||||||
|
eent
|
||||||
|
ë
|
||||||
|
en
|
||||||
|
ënner
|
||||||
|
ëm
|
||||||
|
ech
|
||||||
|
eis
|
||||||
|
eise
|
||||||
|
eisen
|
||||||
|
eiser
|
||||||
|
eises
|
||||||
|
eisereen
|
||||||
|
esou
|
||||||
|
een
|
||||||
|
eng
|
||||||
|
enger
|
||||||
|
engem
|
||||||
|
entweder
|
||||||
|
et
|
||||||
|
eréischt
|
||||||
|
falls
|
||||||
|
fir
|
||||||
|
géint
|
||||||
|
géif
|
||||||
|
gëtt
|
||||||
|
gët
|
||||||
|
geet
|
||||||
|
gi
|
||||||
|
ginn
|
||||||
|
gouf
|
||||||
|
gouff
|
||||||
|
goung
|
||||||
|
hat
|
||||||
|
haten
|
||||||
|
hatt
|
||||||
|
hätt
|
||||||
|
hei
|
||||||
|
hu
|
||||||
|
huet
|
||||||
|
hun
|
||||||
|
hunn
|
||||||
|
hiren
|
||||||
|
hien
|
||||||
|
hin
|
||||||
|
hier
|
||||||
|
hir
|
||||||
|
jidderen
|
||||||
|
jiddereen
|
||||||
|
jiddwereen
|
||||||
|
jiddereng
|
||||||
|
jiddwerengen
|
||||||
|
jo
|
||||||
|
ins
|
||||||
|
iech
|
||||||
|
iwwer
|
||||||
|
kann
|
||||||
|
kee
|
||||||
|
keen
|
||||||
|
kënne
|
||||||
|
kënnt
|
||||||
|
kéng
|
||||||
|
kéngen
|
||||||
|
kéngem
|
||||||
|
koum
|
||||||
|
kuckt
|
||||||
|
mam
|
||||||
|
mat
|
||||||
|
ma
|
||||||
|
mä
|
||||||
|
mech
|
||||||
|
méi
|
||||||
|
mécht
|
||||||
|
meng
|
||||||
|
menger
|
||||||
|
mer
|
||||||
|
mir
|
||||||
|
muss
|
||||||
|
nach
|
||||||
|
nämmlech
|
||||||
|
nämmelech
|
||||||
|
näischt
|
||||||
|
nawell
|
||||||
|
nëmme
|
||||||
|
nëmmen
|
||||||
|
net
|
||||||
|
nees
|
||||||
|
nee
|
||||||
|
no
|
||||||
|
nu
|
||||||
|
nom
|
||||||
|
och
|
||||||
|
oder
|
||||||
|
ons
|
||||||
|
onsen
|
||||||
|
onser
|
||||||
|
onsereen
|
||||||
|
onst
|
||||||
|
om
|
||||||
|
op
|
||||||
|
ouni
|
||||||
|
säi
|
||||||
|
säin
|
||||||
|
schonn
|
||||||
|
schonns
|
||||||
|
si
|
||||||
|
sid
|
||||||
|
sie
|
||||||
|
se
|
||||||
|
sech
|
||||||
|
seng
|
||||||
|
senge
|
||||||
|
sengem
|
||||||
|
senger
|
||||||
|
selwecht
|
||||||
|
selwer
|
||||||
|
sinn
|
||||||
|
sollten
|
||||||
|
souguer
|
||||||
|
sou
|
||||||
|
soss
|
||||||
|
sot
|
||||||
|
't
|
||||||
|
tëscht
|
||||||
|
u
|
||||||
|
un
|
||||||
|
um
|
||||||
|
virdrun
|
||||||
|
vu
|
||||||
|
vum
|
||||||
|
vun
|
||||||
|
wann
|
||||||
|
war
|
||||||
|
waren
|
||||||
|
was
|
||||||
|
wat
|
||||||
|
wëllt
|
||||||
|
weider
|
||||||
|
wéi
|
||||||
|
wéini
|
||||||
|
wéinst
|
||||||
|
wi
|
||||||
|
wollt
|
||||||
|
wou
|
||||||
|
wouhin
|
||||||
|
zanter
|
||||||
|
ze
|
||||||
|
zu
|
||||||
|
zum
|
||||||
|
zwar
|
||||||
|
""".split())
|
28
spacy/lang/lb/tag_map.py
Normal file
28
spacy/lang/lb/tag_map.py
Normal file
|
@ -0,0 +1,28 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||||
|
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
|
||||||
|
|
||||||
|
# TODO: tag map is still using POS tags from an internal training set.
|
||||||
|
# These POS tags have to be modified to match those from Universal Dependencies
|
||||||
|
|
||||||
|
TAG_MAP = {
|
||||||
|
"$": {POS: PUNCT},
|
||||||
|
"ADJ": {POS: ADJ},
|
||||||
|
"AV": {POS: ADV},
|
||||||
|
"APPR": {POS: ADP, "AdpType": "prep"},
|
||||||
|
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
|
||||||
|
"D": {POS: DET, "PronType": "art"},
|
||||||
|
"KO": {POS: CONJ},
|
||||||
|
"N": {POS: NOUN},
|
||||||
|
"P": {POS: ADV},
|
||||||
|
"TRUNC": {POS: X, "Hyph": "yes"},
|
||||||
|
"AUX": {POS: AUX},
|
||||||
|
"V": {POS: VERB},
|
||||||
|
"MV": {POS: VERB, "VerbType": "mod"},
|
||||||
|
"PTK": {POS: PART},
|
||||||
|
"INTER": {POS: PART},
|
||||||
|
"NUM": {POS: NUM},
|
||||||
|
"_SP": {POS: SPACE},
|
||||||
|
}
|
47
spacy/lang/lb/tokenizer_exceptions.py
Normal file
47
spacy/lang/lb/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,47 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
||||||
|
from ..punctuation import TOKENIZER_PREFIXES
|
||||||
|
|
||||||
|
# TODO
|
||||||
|
# tokenize cliticised definite article "d'" as token of its own: d'Kanner > [d'] [Kanner]
|
||||||
|
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
|
||||||
|
|
||||||
|
# how to write the tokenisation exeption for the articles d' / D' ? This one is not working.
|
||||||
|
_prefixes = [prefix for prefix in TOKENIZER_PREFIXES if prefix not in ["d'", "D'", "d’", "D’", r"\' "]]
|
||||||
|
|
||||||
|
|
||||||
|
_exc = {
|
||||||
|
"d'mannst": [
|
||||||
|
{ORTH: "d'", LEMMA: "d'"},
|
||||||
|
{ORTH: "mannst", LEMMA: "mann", NORM: "mann"}],
|
||||||
|
"d'éischt": [
|
||||||
|
{ORTH: "d'", LEMMA: "d'"},
|
||||||
|
{ORTH: "éischt", LEMMA: "éischt", NORM: "éischt"}]
|
||||||
|
}
|
||||||
|
|
||||||
|
# translate / delete what is not necessary
|
||||||
|
# what does PRON_LEMMA mean?
|
||||||
|
for exc_data in [
|
||||||
|
{ORTH: "wgl.", LEMMA: "wann ech gelift", NORM: "wann ech gelieft"},
|
||||||
|
{ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"},
|
||||||
|
{ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"},
|
||||||
|
{ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"},
|
||||||
|
{ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"},
|
||||||
|
{ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"},
|
||||||
|
{ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"},
|
||||||
|
{ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"},
|
||||||
|
{ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
|
||||||
|
# to be extended
|
||||||
|
for orth in [
|
||||||
|
"z.B.", "Dipl.", "Dr.", "etc.", "i.e.", "o.k.", "O.K.", "p.a.", "p.s.", "P.S.", "phil.",
|
||||||
|
"q.e.d.", "R.I.P.", "rer.", "sen.", "ë.a.", "U.S.", "U.S.A."]:
|
||||||
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -134,7 +134,10 @@ def ko_tokenizer():
|
||||||
pytest.importorskip("natto")
|
pytest.importorskip("natto")
|
||||||
return get_lang_class("ko").Defaults.create_tokenizer()
|
return get_lang_class("ko").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def lb_tokenizer():
|
||||||
|
return get_lang_class("lb").Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def lt_tokenizer():
|
def lt_tokenizer():
|
||||||
return get_lang_class("lt").Defaults.create_tokenizer()
|
return get_lang_class("lt").Defaults.create_tokenizer()
|
||||||
|
|
0
spacy/tests/lang/lb/__init__.py
Normal file
0
spacy/tests/lang/lb/__init__.py
Normal file
12
spacy/tests/lang/lb/test_exceptions.py
Normal file
12
spacy/tests/lang/lb/test_exceptions.py
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
# coding: utf-8
|
||||||
|
# from __future__ import unicolb_literals
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text", ["z.B.", "Jan."])
|
||||||
|
def test_lb_tokenizer_handles_abbr(lb_tokenizer, text):
|
||||||
|
tokens = lb_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
|
26
spacy/tests/lang/lb/test_prefix_suffix_infix.py
Normal file
26
spacy/tests/lang/lb/test_prefix_suffix_infix.py
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
# coding: utf-8
|
||||||
|
#from __future__ import unicolb_literals
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,length", [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
|
||||||
|
def test_lb_tokenizer_splits_prefix_interact(lb_tokenizer, text, length):
|
||||||
|
tokens = lb_tokenizer(text)
|
||||||
|
assert len(tokens) == length
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text", ["z.B.)"])
|
||||||
|
def test_lb_tokenizer_splits_suffix_interact(lb_tokenizer, text):
|
||||||
|
tokens = lb_tokenizer(text)
|
||||||
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text", ["(z.B.)"])
|
||||||
|
def test_lb_tokenizer_splits_even_wrap_interact(lb_tokenizer, text):
|
||||||
|
tokens = lb_tokenizer(text)
|
||||||
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
|
|
32
spacy/tests/lang/lb/test_text.py
Normal file
32
spacy/tests/lang/lb/test_text.py
Normal file
|
@ -0,0 +1,32 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_lb_tokenizer_handles_long_text(lb_tokenizer):
|
||||||
|
text = """Den Nordwand an d'Sonn
|
||||||
|
|
||||||
|
An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum. Si goufen sech eens, dass deejéinege fir de Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.",
|
||||||
|
|
||||||
|
Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet. Um Enn huet den Nordwand säi Kampf opginn.
|
||||||
|
|
||||||
|
Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.
|
||||||
|
|
||||||
|
Do huet den Nordwand missen zouginn, dass d’Sonn vun hinnen zwee de Stäerkste wier."""
|
||||||
|
|
||||||
|
tokens = lb_tokenizer(text)
|
||||||
|
assert len(tokens) == 143
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"text,length",
|
||||||
|
[
|
||||||
|
("»Wat ass mat mir geschitt?«, huet hie geduecht.", 13),
|
||||||
|
("“Dëst fréi Opstoen”, denkt hien, “mécht ee ganz duercherneen. ", 15),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_lb_tokenizer_handles_examples(lb_tokenizer, text, length):
|
||||||
|
tokens = lb_tokenizer(text)
|
||||||
|
assert len(tokens) == length
|
Loading…
Reference in New Issue
Block a user