mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 09:26:27 +03:00
Lithuanian language support (#3895)
* initial LT lang support * Added more stopwords. Started setting up some basic test environment (not complete) * Initial morph rules for LT lang * Closes #1 Adds tokenizer exceptions for Lithuanian * Closes #5 Punctuation rules. Closes #6 Lexical Attributes * test: add native examples to basic tests * feat: add tag map for lt lang * fix: remove undefined tag attribute 'Definite' * feat: add lemmatizer for lt lang * refactor: add new instances to lt lang morph rules; use tags from tag map * refactor: add morph rules to lt lang defaults * refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup * refactor: add capitalized words to lt lang lemmatizer * refactor: add more num words to lt lang lex attrs * refactor: update lt lang stop word set * refactor: add new instances to lt lang tokenizer exceptions * refactor: remove comments form lt lang init file * refactor: use function instead of lambda in lt lex lang getter * refactor: remove conversion to dict in lt init when dict is already provided * chore: rename lt 'test_basic' to 'test_text' * feat: add more lt text tests * feat: add lemmatizer tests * refactor: remove unused imports, add newline to end of file * chore: add contributor agreement * chore: change 'en' to 'lt' in lt example description * fix: add missing encoding info * style: add newline to end of file * refactor: use python2 compatible syntax * style: reformat code using black
This commit is contained in:
parent
4f1dae1c6b
commit
61ce126d4c
106
.github/contributors/rokasramas.md
vendored
Normal file
106
.github/contributors/rokasramas.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ----------------------- |
|
||||||
|
| Name | Rokas Ramanauskas |
|
||||||
|
| Company name (if applicable) | TokenMill |
|
||||||
|
| Title or role (if applicable) | Software Engineer |
|
||||||
|
| Date | 2019-07-02 |
|
||||||
|
| GitHub username | rokasramas |
|
||||||
|
| Website (optional) | http://www.tokenmill.lt |
|
|
@ -1,15 +1,37 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .tag_map import TAG_MAP
|
||||||
|
from .lemmatizer import LOOKUP
|
||||||
|
from .morph_rules import MORPH_RULES
|
||||||
|
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG, NORM
|
||||||
|
from ...util import update_exc, add_lookups
|
||||||
|
|
||||||
|
|
||||||
|
def _return_lt(_):
|
||||||
|
return "lt"
|
||||||
|
|
||||||
|
|
||||||
class LithuanianDefaults(Language.Defaults):
|
class LithuanianDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: "lt"
|
lex_attr_getters[LANG] = _return_lt
|
||||||
|
lex_attr_getters[NORM] = add_lookups(
|
||||||
|
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||||
|
)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
tag_map = TAG_MAP
|
||||||
|
morph_rules = MORPH_RULES
|
||||||
|
lemma_lookup = LOOKUP
|
||||||
|
|
||||||
|
|
||||||
class Lithuanian(Language):
|
class Lithuanian(Language):
|
||||||
|
|
22
spacy/lang/lt/examples.py
Normal file
22
spacy/lang/lt/examples.py
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.lt.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Jaunikis pirmąją vestuvinę naktį iškeitė į areštinės gultą",
|
||||||
|
"Bepiločiai automobiliai išnaikins vairavimo mokyklas, autoservisus ir eismo nelaimes",
|
||||||
|
"Vilniuje galvojama uždrausti naudoti skėčius",
|
||||||
|
"Londonas yra didelis miestas Jungtinėje Karalystėje",
|
||||||
|
"Kur tu?",
|
||||||
|
"Kas yra Prancūzijos prezidentas?",
|
||||||
|
"Kokia yra Jungtinių Amerikos Valstijų sostinė?",
|
||||||
|
"Kada gimė Dalia Grybauskaitė?",
|
||||||
|
]
|
234227
spacy/lang/lt/lemmatizer.py
Normal file
234227
spacy/lang/lt/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
1153
spacy/lang/lt/lex_attrs.py
Normal file
1153
spacy/lang/lt/lex_attrs.py
Normal file
File diff suppressed because it is too large
Load Diff
3075
spacy/lang/lt/morph_rules.py
Normal file
3075
spacy/lang/lt/morph_rules.py
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
4798
spacy/lang/lt/tag_map.py
Normal file
4798
spacy/lang/lt/tag_map.py
Normal file
File diff suppressed because it is too large
Load Diff
268
spacy/lang/lt/tokenizer_exceptions.py
Normal file
268
spacy/lang/lt/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,268 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import ORTH
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
for orth in [
|
||||||
|
"G.",
|
||||||
|
"J. E.",
|
||||||
|
"J. Em.",
|
||||||
|
"J.E.",
|
||||||
|
"J.Em.",
|
||||||
|
"K.",
|
||||||
|
"N.",
|
||||||
|
"V.",
|
||||||
|
"Vt.",
|
||||||
|
"a.",
|
||||||
|
"a.k.",
|
||||||
|
"a.s.",
|
||||||
|
"adv.",
|
||||||
|
"akad.",
|
||||||
|
"aklg.",
|
||||||
|
"akt.",
|
||||||
|
"al.",
|
||||||
|
"ang.",
|
||||||
|
"angl.",
|
||||||
|
"aps.",
|
||||||
|
"apskr.",
|
||||||
|
"apyg.",
|
||||||
|
"arbat.",
|
||||||
|
"asist.",
|
||||||
|
"asm.",
|
||||||
|
"asm.k.",
|
||||||
|
"asmv.",
|
||||||
|
"atk.",
|
||||||
|
"atsak.",
|
||||||
|
"atsisk.",
|
||||||
|
"atsisk.sąsk.",
|
||||||
|
"atv.",
|
||||||
|
"aut.",
|
||||||
|
"avd.",
|
||||||
|
"b.k.",
|
||||||
|
"baud.",
|
||||||
|
"biol.",
|
||||||
|
"bkl.",
|
||||||
|
"bot.",
|
||||||
|
"bt.",
|
||||||
|
"buv.",
|
||||||
|
"ch.",
|
||||||
|
"chem.",
|
||||||
|
"corp.",
|
||||||
|
"d.",
|
||||||
|
"dab.",
|
||||||
|
"dail.",
|
||||||
|
"dek.",
|
||||||
|
"deš.",
|
||||||
|
"dir.",
|
||||||
|
"dirig.",
|
||||||
|
"doc.",
|
||||||
|
"dol.",
|
||||||
|
"dr.",
|
||||||
|
"drp.",
|
||||||
|
"dvit.",
|
||||||
|
"dėst.",
|
||||||
|
"dš.",
|
||||||
|
"dž.",
|
||||||
|
"e.b.",
|
||||||
|
"e.bankas",
|
||||||
|
"e.p.",
|
||||||
|
"e.parašas",
|
||||||
|
"e.paštas",
|
||||||
|
"e.v.",
|
||||||
|
"e.valdžia",
|
||||||
|
"egz.",
|
||||||
|
"eil.",
|
||||||
|
"ekon.",
|
||||||
|
"el.",
|
||||||
|
"el.bankas",
|
||||||
|
"el.p.",
|
||||||
|
"el.parašas",
|
||||||
|
"el.paštas",
|
||||||
|
"el.valdžia",
|
||||||
|
"etc.",
|
||||||
|
"ež.",
|
||||||
|
"fak.",
|
||||||
|
"faks.",
|
||||||
|
"feat.",
|
||||||
|
"filol.",
|
||||||
|
"filos.",
|
||||||
|
"g.",
|
||||||
|
"gen.",
|
||||||
|
"geol.",
|
||||||
|
"gerb.",
|
||||||
|
"gim.",
|
||||||
|
"gr.",
|
||||||
|
"gv.",
|
||||||
|
"gyd.",
|
||||||
|
"gyv.",
|
||||||
|
"habil.",
|
||||||
|
"inc.",
|
||||||
|
"insp.",
|
||||||
|
"inž.",
|
||||||
|
"ir pan.",
|
||||||
|
"ir t. t.",
|
||||||
|
"isp.",
|
||||||
|
"istor.",
|
||||||
|
"it.",
|
||||||
|
"just.",
|
||||||
|
"k.",
|
||||||
|
"k. a.",
|
||||||
|
"k.a.",
|
||||||
|
"kab.",
|
||||||
|
"kand.",
|
||||||
|
"kart.",
|
||||||
|
"kat.",
|
||||||
|
"ketv.",
|
||||||
|
"kh.",
|
||||||
|
"kl.",
|
||||||
|
"kln.",
|
||||||
|
"km.",
|
||||||
|
"kn.",
|
||||||
|
"koresp.",
|
||||||
|
"kpt.",
|
||||||
|
"kr.",
|
||||||
|
"kt.",
|
||||||
|
"kub.",
|
||||||
|
"kun.",
|
||||||
|
"kv.",
|
||||||
|
"kyš.",
|
||||||
|
"l. e. p.",
|
||||||
|
"l.e.p.",
|
||||||
|
"lenk.",
|
||||||
|
"liet.",
|
||||||
|
"lot.",
|
||||||
|
"lt.",
|
||||||
|
"ltd.",
|
||||||
|
"ltn.",
|
||||||
|
"m.",
|
||||||
|
"m.e..",
|
||||||
|
"m.m.",
|
||||||
|
"mat.",
|
||||||
|
"med.",
|
||||||
|
"mgnt.",
|
||||||
|
"mgr.",
|
||||||
|
"min.",
|
||||||
|
"mjr.",
|
||||||
|
"ml.",
|
||||||
|
"mln.",
|
||||||
|
"mlrd.",
|
||||||
|
"mob.",
|
||||||
|
"mok.",
|
||||||
|
"moksl.",
|
||||||
|
"mokyt.",
|
||||||
|
"mot.",
|
||||||
|
"mr.",
|
||||||
|
"mst.",
|
||||||
|
"mstl.",
|
||||||
|
"mėn.",
|
||||||
|
"nkt.",
|
||||||
|
"no.",
|
||||||
|
"nr.",
|
||||||
|
"ntk.",
|
||||||
|
"nuotr.",
|
||||||
|
"op.",
|
||||||
|
"org.",
|
||||||
|
"orig.",
|
||||||
|
"p.",
|
||||||
|
"p.d.",
|
||||||
|
"p.m.e.",
|
||||||
|
"p.s.",
|
||||||
|
"pab.",
|
||||||
|
"pan.",
|
||||||
|
"past.",
|
||||||
|
"pav.",
|
||||||
|
"pavad.",
|
||||||
|
"per.",
|
||||||
|
"perd.",
|
||||||
|
"pirm.",
|
||||||
|
"pl.",
|
||||||
|
"plg.",
|
||||||
|
"plk.",
|
||||||
|
"pr.",
|
||||||
|
"pr.Kr.",
|
||||||
|
"pranc.",
|
||||||
|
"proc.",
|
||||||
|
"prof.",
|
||||||
|
"prom.",
|
||||||
|
"prot.",
|
||||||
|
"psl.",
|
||||||
|
"pss.",
|
||||||
|
"pvz.",
|
||||||
|
"pšt.",
|
||||||
|
"r.",
|
||||||
|
"raj.",
|
||||||
|
"red.",
|
||||||
|
"rez.",
|
||||||
|
"rež.",
|
||||||
|
"rus.",
|
||||||
|
"rš.",
|
||||||
|
"s.",
|
||||||
|
"sav.",
|
||||||
|
"saviv.",
|
||||||
|
"sek.",
|
||||||
|
"sekr.",
|
||||||
|
"sen.",
|
||||||
|
"sh.",
|
||||||
|
"sk.",
|
||||||
|
"skg.",
|
||||||
|
"skv.",
|
||||||
|
"skyr.",
|
||||||
|
"sp.",
|
||||||
|
"spec.",
|
||||||
|
"sr.",
|
||||||
|
"st.",
|
||||||
|
"str.",
|
||||||
|
"stud.",
|
||||||
|
"sąs.",
|
||||||
|
"t.",
|
||||||
|
"t. p.",
|
||||||
|
"t. y.",
|
||||||
|
"t.p.",
|
||||||
|
"t.t.",
|
||||||
|
"t.y.",
|
||||||
|
"techn.",
|
||||||
|
"tel.",
|
||||||
|
"teol.",
|
||||||
|
"th.",
|
||||||
|
"tir.",
|
||||||
|
"trit.",
|
||||||
|
"trln.",
|
||||||
|
"tšk.",
|
||||||
|
"tūks.",
|
||||||
|
"tūkst.",
|
||||||
|
"up.",
|
||||||
|
"upl.",
|
||||||
|
"v.s.",
|
||||||
|
"vad.",
|
||||||
|
"val.",
|
||||||
|
"valg.",
|
||||||
|
"ved.",
|
||||||
|
"vert.",
|
||||||
|
"vet.",
|
||||||
|
"vid.",
|
||||||
|
"virš.",
|
||||||
|
"vlsč.",
|
||||||
|
"vnt.",
|
||||||
|
"vok.",
|
||||||
|
"vs.",
|
||||||
|
"vtv.",
|
||||||
|
"vv.",
|
||||||
|
"vyr.",
|
||||||
|
"vyresn.",
|
||||||
|
"zool.",
|
||||||
|
"Įn",
|
||||||
|
"įl.",
|
||||||
|
"š.m.",
|
||||||
|
"šnek.",
|
||||||
|
"šv.",
|
||||||
|
"švč.",
|
||||||
|
"ž.ū.",
|
||||||
|
"žin.",
|
||||||
|
"žml.",
|
||||||
|
"žr.",
|
||||||
|
]:
|
||||||
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -124,6 +124,16 @@ def ja_tokenizer():
|
||||||
return get_lang_class("ja").Defaults.create_tokenizer()
|
return get_lang_class("ja").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def lt_tokenizer():
|
||||||
|
return get_lang_class("lt").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def lt_lemmatizer():
|
||||||
|
return get_lang_class("lt").Defaults.create_lemmatizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def nb_tokenizer():
|
def nb_tokenizer():
|
||||||
return get_lang_class("nb").Defaults.create_tokenizer()
|
return get_lang_class("nb").Defaults.create_tokenizer()
|
||||||
|
|
0
spacy/tests/lang/lt/__init__.py
Normal file
0
spacy/tests/lang/lt/__init__.py
Normal file
15
spacy/tests/lang/lt/test_lemmatizer.py
Normal file
15
spacy/tests/lang/lt/test_lemmatizer.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("tokens,lemmas", [
|
||||||
|
(["Galime", "vadinti", "gerovės", "valstybe", ",", "turime", "išvystytą", "socialinę", "apsaugą", ",",
|
||||||
|
"sveikatos", "apsaugą", "ir", "prieinamą", "švietimą", "."],
|
||||||
|
["galėti", "vadintas", "gerovė", "valstybė", ",", "turėti", "išvystytas", "socialinis",
|
||||||
|
"apsauga", ",", "sveikata", "apsauga", "ir", "prieinamas", "švietimas", "."]),
|
||||||
|
(["taip", ",", "uoliai", "tyrinėjau", "ir", "pasirinkau", "geriausią", "variantą", "."],
|
||||||
|
["taip", ",", "uolus", "tyrinėti", "ir", "pasirinkti", "geras", "variantas", "."])])
|
||||||
|
def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas):
|
||||||
|
assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens]
|
44
spacy/tests/lang/lt/test_text.py
Normal file
44
spacy/tests/lang/lt/test_text.py
Normal file
|
@ -0,0 +1,44 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_lt_tokenizer_handles_long_text(lt_tokenizer):
|
||||||
|
text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią
|
||||||
|
vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis
|
||||||
|
yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui."""
|
||||||
|
tokens = lt_tokenizer(text.replace("\n", ""))
|
||||||
|
assert len(tokens) == 42
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,length', [
|
||||||
|
("177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.", 15),
|
||||||
|
("ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.", 16)])
|
||||||
|
def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
|
||||||
|
tokens = lt_tokenizer(text)
|
||||||
|
assert len(tokens) == length
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
|
||||||
|
def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
|
||||||
|
tokens = lt_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,match", [
|
||||||
|
("10", True),
|
||||||
|
("1", True),
|
||||||
|
("10,000", True),
|
||||||
|
("10,00", True),
|
||||||
|
("999.0", True),
|
||||||
|
("vienas", True),
|
||||||
|
("du", True),
|
||||||
|
("milijardas", True),
|
||||||
|
("šuo", False),
|
||||||
|
(",", False),
|
||||||
|
("1/2", True)])
|
||||||
|
def test_lt_lex_attrs_like_number(lt_tokenizer, text, match):
|
||||||
|
tokens = lt_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
assert tokens[0].like_num == match
|
Loading…
Reference in New Issue
Block a user