mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Add Slovak language tools implementation (#4943)
* Add correct stopwords for Slovak language * Add SNK Tags * Disable formatting lint for TAGS * Add example sentences for Slovak language * Add slovak numerals in base form * Add lex_attrs to sk init * Add contributor agreement
This commit is contained in:
parent
9fa9d7f2cb
commit
d4f4060bf3
106
.github/contributors/drndos.md
vendored
Normal file
106
.github/contributors/drndos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Filip Bednárik |
|
||||
| Company name (if applicable) | Ardevop, s. r. o. |
|
||||
| Title or role (if applicable) | IT Consultant |
|
||||
| Date | 2020-01-26 |
|
||||
| GitHub username | drndos |
|
||||
| Website (optional) | https://ardevop.sk |
|
|
@ -2,13 +2,18 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
||||
|
||||
class SlovakDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "sk"
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
|
|
27
spacy/lang/sk/examples.py
Normal file
27
spacy/lang/sk/examples.py
Normal file
|
@ -0,0 +1,27 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.sk.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Ardevop, s.r.o. je malá startup firma na území SR.",
|
||||
"Samojazdiace autá presúvajú poistnú zodpovednosť na výrobcov automobilov.",
|
||||
"Košice sú na východe.",
|
||||
"Bratislava je hlavné mesto Slovenskej republiky.",
|
||||
"Kde si?",
|
||||
"Kto je prezidentom Francúzska?",
|
||||
"Aké je hlavné mesto Slovenska?",
|
||||
"Kedy sa narodil Andrej Kiska?",
|
||||
"Včera som dostal 100€ na ruku.",
|
||||
"Dnes je nedeľa 26.1.2020.",
|
||||
"Narodil sa 15.4.1998 v Ružomberku.",
|
||||
"Niekto mi povedal, že 500 eur je veľa peňazí.",
|
||||
"Podaj mi ruku!",
|
||||
]
|
62
spacy/lang/sk/lex_attrs.py
Normal file
62
spacy/lang/sk/lex_attrs.py
Normal file
|
@ -0,0 +1,62 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = [
|
||||
"nula",
|
||||
"jeden",
|
||||
"dva",
|
||||
"tri",
|
||||
"štyri",
|
||||
"päť",
|
||||
"šesť",
|
||||
"sedem",
|
||||
"osem",
|
||||
"deväť",
|
||||
"desať",
|
||||
"jedenásť",
|
||||
"dvanásť",
|
||||
"trinásť",
|
||||
"štrnásť",
|
||||
"pätnásť",
|
||||
"šestnásť",
|
||||
"sedemnásť",
|
||||
"osemnásť",
|
||||
"devätnásť",
|
||||
"dvadsať",
|
||||
"tridsať",
|
||||
"štyridsať",
|
||||
"päťdesiat",
|
||||
"šesťdesiat",
|
||||
"sedemdesiat",
|
||||
"osemdesiat",
|
||||
"deväťdesiat",
|
||||
"sto",
|
||||
"tisíc",
|
||||
"milión",
|
||||
"miliarda",
|
||||
"bilión",
|
||||
"biliarda",
|
||||
"trilión",
|
||||
"triliarda",
|
||||
"kvadrilión",
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
|
@ -2,7 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-sk
|
||||
# Source: https://github.com/Ardevop-sk/stopwords-sk
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
|
@ -10,17 +10,41 @@ a
|
|||
aby
|
||||
aj
|
||||
ak
|
||||
akej
|
||||
akejže
|
||||
ako
|
||||
akom
|
||||
akomže
|
||||
akou
|
||||
akouže
|
||||
akože
|
||||
aká
|
||||
akáže
|
||||
aké
|
||||
akého
|
||||
akéhože
|
||||
akému
|
||||
akémuže
|
||||
akéže
|
||||
akú
|
||||
akúže
|
||||
aký
|
||||
akých
|
||||
akýchže
|
||||
akým
|
||||
akými
|
||||
akýmiže
|
||||
akýmže
|
||||
akýže
|
||||
ale
|
||||
alebo
|
||||
and
|
||||
ani
|
||||
asi
|
||||
avšak
|
||||
až
|
||||
ba
|
||||
bez
|
||||
bezo
|
||||
bol
|
||||
bola
|
||||
boli
|
||||
|
@ -31,23 +55,32 @@ budeme
|
|||
budete
|
||||
budeš
|
||||
budú
|
||||
buï
|
||||
buď
|
||||
by
|
||||
byť
|
||||
cez
|
||||
cezo
|
||||
dnes
|
||||
do
|
||||
ešte
|
||||
for
|
||||
ho
|
||||
hoci
|
||||
i
|
||||
iba
|
||||
ich
|
||||
im
|
||||
inej
|
||||
inom
|
||||
iná
|
||||
iné
|
||||
iného
|
||||
inému
|
||||
iní
|
||||
inú
|
||||
iný
|
||||
iných
|
||||
iným
|
||||
inými
|
||||
ja
|
||||
je
|
||||
jeho
|
||||
|
@ -56,80 +89,185 @@ jemu
|
|||
ju
|
||||
k
|
||||
kam
|
||||
kamže
|
||||
každou
|
||||
každá
|
||||
každé
|
||||
každého
|
||||
každému
|
||||
každí
|
||||
každú
|
||||
každý
|
||||
každých
|
||||
každým
|
||||
každými
|
||||
kde
|
||||
kedže
|
||||
keï
|
||||
kej
|
||||
kejže
|
||||
keď
|
||||
keďže
|
||||
kie
|
||||
kieho
|
||||
kiehože
|
||||
kiemu
|
||||
kiemuže
|
||||
kieže
|
||||
koho
|
||||
kom
|
||||
komu
|
||||
kou
|
||||
kouže
|
||||
kto
|
||||
ktorej
|
||||
ktorou
|
||||
ktorá
|
||||
ktoré
|
||||
ktorí
|
||||
ktorú
|
||||
ktorý
|
||||
ktorých
|
||||
ktorým
|
||||
ktorými
|
||||
ku
|
||||
ká
|
||||
káže
|
||||
ké
|
||||
kéže
|
||||
kú
|
||||
kúže
|
||||
ký
|
||||
kýho
|
||||
kýhože
|
||||
kým
|
||||
kýmu
|
||||
kýmuže
|
||||
kýže
|
||||
lebo
|
||||
leda
|
||||
ledaže
|
||||
len
|
||||
ma
|
||||
majú
|
||||
mal
|
||||
mala
|
||||
mali
|
||||
mať
|
||||
medzi
|
||||
menej
|
||||
mi
|
||||
mna
|
||||
mne
|
||||
mnou
|
||||
moja
|
||||
moje
|
||||
mojej
|
||||
mojich
|
||||
mojim
|
||||
mojimi
|
||||
mojou
|
||||
moju
|
||||
možno
|
||||
mu
|
||||
musia
|
||||
musieť
|
||||
musí
|
||||
musím
|
||||
musíme
|
||||
musíte
|
||||
musíš
|
||||
my
|
||||
má
|
||||
mám
|
||||
máme
|
||||
máte
|
||||
mòa
|
||||
máš
|
||||
môcť
|
||||
môj
|
||||
môjho
|
||||
môže
|
||||
môžem
|
||||
môžeme
|
||||
môžete
|
||||
môžeš
|
||||
môžu
|
||||
mňa
|
||||
na
|
||||
nad
|
||||
nado
|
||||
najmä
|
||||
nami
|
||||
naša
|
||||
naše
|
||||
našej
|
||||
naši
|
||||
našich
|
||||
našim
|
||||
našimi
|
||||
našou
|
||||
ne
|
||||
nech
|
||||
neho
|
||||
nej
|
||||
nejakej
|
||||
nejakom
|
||||
nejakou
|
||||
nejaká
|
||||
nejaké
|
||||
nejakého
|
||||
nejakému
|
||||
nejakú
|
||||
nejaký
|
||||
nejakých
|
||||
nejakým
|
||||
nejakými
|
||||
nemu
|
||||
než
|
||||
nich
|
||||
nie
|
||||
niektorej
|
||||
niektorom
|
||||
niektorou
|
||||
niektorá
|
||||
niektoré
|
||||
niektorého
|
||||
niektorému
|
||||
niektorú
|
||||
niektorý
|
||||
niektorých
|
||||
niektorým
|
||||
niektorými
|
||||
nielen
|
||||
niečo
|
||||
nim
|
||||
nimi
|
||||
nič
|
||||
ničoho
|
||||
ničom
|
||||
ničomu
|
||||
ničím
|
||||
no
|
||||
nová
|
||||
nové
|
||||
noví
|
||||
nový
|
||||
nám
|
||||
nás
|
||||
náš
|
||||
nášho
|
||||
ním
|
||||
o
|
||||
od
|
||||
odo
|
||||
of
|
||||
on
|
||||
ona
|
||||
oni
|
||||
ono
|
||||
ony
|
||||
oň
|
||||
oňho
|
||||
po
|
||||
pod
|
||||
podo
|
||||
podľa
|
||||
pokiaľ
|
||||
popod
|
||||
popri
|
||||
potom
|
||||
poza
|
||||
pre
|
||||
pred
|
||||
predo
|
||||
|
@ -137,42 +275,56 @@ preto
|
|||
pretože
|
||||
prečo
|
||||
pri
|
||||
prvá
|
||||
prvé
|
||||
prví
|
||||
prvý
|
||||
práve
|
||||
pýta
|
||||
s
|
||||
sa
|
||||
seba
|
||||
sebe
|
||||
sebou
|
||||
sem
|
||||
si
|
||||
sme
|
||||
so
|
||||
som
|
||||
späť
|
||||
ste
|
||||
svoj
|
||||
svoja
|
||||
svoje
|
||||
svojho
|
||||
svojich
|
||||
svojim
|
||||
svojimi
|
||||
svojou
|
||||
svoju
|
||||
svojím
|
||||
svojími
|
||||
sú
|
||||
ta
|
||||
tak
|
||||
takej
|
||||
takejto
|
||||
taká
|
||||
takáto
|
||||
také
|
||||
takého
|
||||
takéhoto
|
||||
takému
|
||||
takémuto
|
||||
takéto
|
||||
takí
|
||||
takú
|
||||
takúto
|
||||
taký
|
||||
takýto
|
||||
takže
|
||||
tam
|
||||
te
|
||||
teba
|
||||
tebe
|
||||
tebou
|
||||
teda
|
||||
tej
|
||||
tejto
|
||||
ten
|
||||
tento
|
||||
the
|
||||
ti
|
||||
tie
|
||||
tieto
|
||||
|
@ -180,52 +332,97 @@ tiež
|
|||
to
|
||||
toho
|
||||
tohoto
|
||||
tohto
|
||||
tom
|
||||
tomto
|
||||
tomu
|
||||
tomuto
|
||||
toto
|
||||
tou
|
||||
touto
|
||||
tu
|
||||
tvoj
|
||||
tvojími
|
||||
tvoja
|
||||
tvoje
|
||||
tvojej
|
||||
tvojho
|
||||
tvoji
|
||||
tvojich
|
||||
tvojim
|
||||
tvojimi
|
||||
tvojím
|
||||
ty
|
||||
tá
|
||||
táto
|
||||
tí
|
||||
títo
|
||||
tú
|
||||
túto
|
||||
tých
|
||||
tým
|
||||
tými
|
||||
týmto
|
||||
tě
|
||||
u
|
||||
už
|
||||
v
|
||||
vami
|
||||
vaša
|
||||
vaše
|
||||
veï
|
||||
vašej
|
||||
vaši
|
||||
vašich
|
||||
vašim
|
||||
vaším
|
||||
veď
|
||||
viac
|
||||
vo
|
||||
vy
|
||||
vám
|
||||
vás
|
||||
váš
|
||||
vášho
|
||||
však
|
||||
všetci
|
||||
všetka
|
||||
všetko
|
||||
všetky
|
||||
všetok
|
||||
z
|
||||
za
|
||||
začo
|
||||
začože
|
||||
zo
|
||||
a
|
||||
áno
|
||||
èi
|
||||
èo
|
||||
èí
|
||||
òom
|
||||
òou
|
||||
òu
|
||||
čej
|
||||
či
|
||||
čia
|
||||
čie
|
||||
čieho
|
||||
čiemu
|
||||
čiu
|
||||
čo
|
||||
čoho
|
||||
čom
|
||||
čomu
|
||||
čou
|
||||
čože
|
||||
čí
|
||||
čím
|
||||
čími
|
||||
ďalšia
|
||||
ďalšie
|
||||
ďalšieho
|
||||
ďalšiemu
|
||||
ďalšiu
|
||||
ďalšom
|
||||
ďalšou
|
||||
ďalší
|
||||
ďalších
|
||||
ďalším
|
||||
ďalšími
|
||||
ňom
|
||||
ňou
|
||||
ňu
|
||||
že
|
||||
""".split()
|
||||
)
|
||||
|
|
1467
spacy/lang/sk/tag_map.py
Normal file
1467
spacy/lang/sk/tag_map.py
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user