mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Updates to Swedish Language (#3164)
* Added the same punctuation rules as danish language. * Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too * Added test for long texts in swedish * Added morph rules, infixes and suffixes to __init__.py for swedish * Added some tests for prefixes, infixes and suffixes * Added tests for lemma * Renamed files to follow convention * [sv] Removed ambigious abbreviations * Added more tests for tokenizer exceptions * Added test for problem with punctuation in issue #2578 * Contributor agreement * Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')
This commit is contained in:
parent
9a5003d5c8
commit
b892b446cc
106
.github/contributors/boena.md
vendored
Normal file
106
.github/contributors/boena.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Björn Lennartsson |
|
||||
| Company name (if applicable) | Uptrail AB |
|
||||
| Title or role (if applicable) | CTO |
|
||||
| Date | 2019-01-15 |
|
||||
| GitHub username | boena |
|
||||
| Website (optional) | www.uptrail.com |
|
|
@ -5,6 +5,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
|||
from .stop_words import STOP_WORDS
|
||||
from .morph_rules import MORPH_RULES
|
||||
from .lemmatizer import LEMMA_RULES, LOOKUP
|
||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -18,11 +19,13 @@ class SwedishDefaults(Language.Defaults):
|
|||
lex_attr_getters[LANG] = lambda text: 'sv'
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
morph_rules = MORPH_RULES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
stop_words = STOP_WORDS
|
||||
lemma_rules = LEMMA_RULES
|
||||
lemma_lookup = LOOKUP
|
||||
|
||||
|
||||
class Swedish(Language):
|
||||
lang = 'sv'
|
||||
Defaults = SwedishDefaults
|
||||
|
|
|
@ -233167,7 +233167,6 @@ LOOKUP = {
|
|||
"jades": "jade",
|
||||
"jaet": "ja",
|
||||
"jaets": "ja",
|
||||
"jag": "jaga",
|
||||
"jagad": "jaga",
|
||||
"jagade": "jaga",
|
||||
"jagades": "jaga",
|
||||
|
|
25
spacy/lang/sv/punctuation.py
Normal file
25
spacy/lang/sv/punctuation.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
# coding: utf8
|
||||
"""Punctuation stolen from Danish"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
_quotes = QUOTES.replace("'", '')
|
||||
|
||||
_infixes = (LIST_ELLIPSES + LIST_ICONS +
|
||||
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
|
||||
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])'.format(a=ALPHA, q=_quotes),
|
||||
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA)])
|
||||
|
||||
_suffixes = [suffix for suffix in TOKENIZER_SUFFIXES if suffix not in ["'s", "'S", "’s", "’S", r"\'"]]
|
||||
_suffixes += [r"(?<=[^sSxXzZ])\'"]
|
||||
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
|
@ -24,14 +24,15 @@ for verb_data in [
|
|||
dict(data),
|
||||
{ORTH: "u", LEMMA: PRON_LEMMA, NORM: "du"}]
|
||||
|
||||
|
||||
# Abbreviations for weekdays "sön." (for "söndag" / "söner")
|
||||
# are left out because they are ambiguous. The same is the case
|
||||
# for abbreviations "jul." and "Jul." ("juli" / "jul").
|
||||
for exc_data in [
|
||||
{ORTH: "jan.", LEMMA: "januari"},
|
||||
{ORTH: "febr.", LEMMA: "februari"},
|
||||
{ORTH: "feb.", LEMMA: "februari"},
|
||||
{ORTH: "apr.", LEMMA: "april"},
|
||||
{ORTH: "jun.", LEMMA: "juni"},
|
||||
{ORTH: "jul.", LEMMA: "juli"},
|
||||
{ORTH: "aug.", LEMMA: "augusti"},
|
||||
{ORTH: "sept.", LEMMA: "september"},
|
||||
{ORTH: "sep.", LEMMA: "september"},
|
||||
|
@ -44,13 +45,11 @@ for exc_data in [
|
|||
{ORTH: "tors.", LEMMA: "torsdag"},
|
||||
{ORTH: "fre.", LEMMA: "fredag"},
|
||||
{ORTH: "lör.", LEMMA: "lördag"},
|
||||
{ORTH: "sön.", LEMMA: "söndag"},
|
||||
{ORTH: "Jan.", LEMMA: "Januari"},
|
||||
{ORTH: "Febr.", LEMMA: "Februari"},
|
||||
{ORTH: "Feb.", LEMMA: "Februari"},
|
||||
{ORTH: "Apr.", LEMMA: "April"},
|
||||
{ORTH: "Jun.", LEMMA: "Juni"},
|
||||
{ORTH: "Jul.", LEMMA: "Juli"},
|
||||
{ORTH: "Aug.", LEMMA: "Augusti"},
|
||||
{ORTH: "Sept.", LEMMA: "September"},
|
||||
{ORTH: "Sep.", LEMMA: "September"},
|
||||
|
@ -63,25 +62,35 @@ for exc_data in [
|
|||
{ORTH: "Tors.", LEMMA: "Torsdag"},
|
||||
{ORTH: "Fre.", LEMMA: "Fredag"},
|
||||
{ORTH: "Lör.", LEMMA: "Lördag"},
|
||||
{ORTH: "Sön.", LEMMA: "Söndag"},
|
||||
{ORTH: "sthlm", LEMMA: "Stockholm"},
|
||||
{ORTH: "gbg", LEMMA: "Göteborg"}]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
||||
# Specific case abbreviations only
|
||||
for orth in ["AB", "Dr.", "H.M.", "H.K.H.", "m/s", "M/S", "Ph.d.", "S:t", "s:t"]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
ABBREVIATIONS = [
|
||||
"ang", "anm", "bil", "bl.a", "d.v.s", "doc", "dvs", "e.d", "e.kr", "el",
|
||||
"eng", "etc", "exkl", "f", "f.d", "f.kr", "f.n", "f.ö", "fid", "fig",
|
||||
"forts", "fr.o.m", "förf", "inkl", "jur", "kap", "kl", "kor", "kr",
|
||||
"kungl", "lat", "m.a.o", "m.fl", "m.m", "max", "milj", "min", "mos",
|
||||
"mt", "o.d", "o.s.v", "obs", "osv", "p.g.a", "proc", "prof", "ref",
|
||||
"resp", "s.a.s", "s.k", "s.t", "sid", "s:t", "t.ex", "t.h", "t.o.m", "t.v",
|
||||
"tel", "ung", "vol", "äv", "övers"
|
||||
"ang", "anm", "bl.a", "d.v.s", "doc", "dvs", "e.d", "e.kr", "el.",
|
||||
"eng", "etc", "exkl", "ev", "f.", "f.d", "f.kr", "f.n", "f.ö", "fid", "fig",
|
||||
"forts", "fr.o.m", "förf", "inkl", "iofs", "jur.", "kap", "kl", "kor.", "kr",
|
||||
"kungl", "lat", "m.a.o", "m.fl", "m.m", "max", "milj", "min.", "mos",
|
||||
"mt", "mvh", "o.d", "o.s.v", "obs", "osv", "p.g.a", "proc", "prof", "ref",
|
||||
"resp", "s.a.s", "s.k", "s.t", "sid", "t.ex", "t.h", "t.o.m", "t.v",
|
||||
"tel", "ung.", "vol", "v.", "äv", "övers"
|
||||
]
|
||||
ABBREVIATIONS = [abbr + "." for abbr in ABBREVIATIONS] + ABBREVIATIONS
|
||||
|
||||
# Add abbreviation for trailing punctuation too. If the abbreviation already has a trailing punctuation - skip it.
|
||||
for abbr in ABBREVIATIONS:
|
||||
if abbr.endswith(".") == False:
|
||||
ABBREVIATIONS.append(abbr + ".")
|
||||
|
||||
for orth in ABBREVIATIONS:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
capitalized = orth.capitalize()
|
||||
_exc[capitalized] = [{ORTH: capitalized}]
|
||||
|
||||
# Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
|
||||
# should be tokenized as two separate tokens.
|
||||
|
|
53
spacy/tests/lang/sv/test_exceptions.py
Normal file
53
spacy/tests/lang/sv/test_exceptions.py
Normal file
|
@ -0,0 +1,53 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
SV_TOKEN_EXCEPTION_TESTS = [
|
||||
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar']),
|
||||
('Anders I. tycker om ord med i i.', ["Anders", "I.", "tycker", "om", "ord", "med", "i", "i", "."])
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||
def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
||||
tokens = sv_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
|
||||
def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[1].text == "u"
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text',
|
||||
["bl.a", "m.a.o.", "Jan.", "Dec.", "kr.", "osv."])
|
||||
def test_sv_tokenizer_handles_abbr(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Jul.", "jul.", "sön.", "Sön."])
|
||||
def test_sv_tokenizer_handles_ambiguous_abbr(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
def test_sv_tokenizer_handles_exc_in_text(sv_tokenizer):
|
||||
text = "Det er bl.a. ikke meningen"
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
assert tokens[2].text == "bl.a."
|
||||
|
||||
|
||||
def test_sv_tokenizer_handles_custom_base_exc(sv_tokenizer):
|
||||
text = "Her er noget du kan kigge i."
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 8
|
||||
assert tokens[6].text == "i"
|
||||
assert tokens[7].text == "."
|
15
spacy/tests/lang/sv/test_lemmatizer.py
Normal file
15
spacy/tests/lang/sv/test_lemmatizer.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('string,lemma', [('DNA-profilernas', 'DNA-profil'),
|
||||
('Elfenbenskustens', 'Elfenbenskusten'),
|
||||
('abortmotståndarens', 'abortmotståndare'),
|
||||
('kolesterols', 'kolesterol'),
|
||||
('portionssnusernas', 'portionssnus'),
|
||||
('åsyns', 'åsyn')])
|
||||
def test_lemmatizer_lookup_assigns(sv_tokenizer, string, lemma):
|
||||
tokens = sv_tokenizer(string)
|
||||
assert tokens[0].lemma_ == lemma
|
37
spacy/tests/lang/sv/test_prefix_suffix_infix.py
Normal file
37
spacy/tests/lang/sv/test_prefix_suffix_infix.py
Normal file
|
@ -0,0 +1,37 @@
|
|||
# coding: utf-8
|
||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
@pytest.mark.parametrize('text', ["(under)"])
|
||||
def test_tokenizer_splits_no_special(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["gitta'r", "Björn's", "Lars'"])
|
||||
def test_tokenizer_handles_no_punct(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["svart.Gul", "Hej.Världen"])
|
||||
def test_tokenizer_splits_period_infix(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Hej,Världen", "en,två"])
|
||||
def test_tokenizer_splits_comma_infix(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[0].text == text.split(",")[0]
|
||||
assert tokens[1].text == ","
|
||||
assert tokens[2].text == text.split(",")[1]
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["svart...Gul", "svart...gul"])
|
||||
def test_tokenizer_splits_ellipsis_infix(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 3
|
21
spacy/tests/lang/sv/test_text.py
Normal file
21
spacy/tests/lang/sv/test_text.py
Normal file
|
@ -0,0 +1,21 @@
|
|||
# coding: utf-8
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
def test_sv_tokenizer_handles_long_text(sv_tokenizer):
|
||||
text = """Det var så härligt ute på landet. Det var sommar, majsen var gul, havren grön,
|
||||
höet var uppställt i stackar nere vid den gröna ängen, och där gick storken på sina långa,
|
||||
röda ben och snackade engelska, för det språket hade han lärt sig av sin mor.
|
||||
|
||||
Runt om åkrar och äng låg den stora skogen, och mitt i skogen fanns djupa sjöar; jo, det var verkligen trevligt ute på landet!"""
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 86
|
||||
|
||||
|
||||
def test_sv_tokenizer_handles_trailing_dot_for_i_in_sentence(sv_tokenizer):
|
||||
text = "Provar att tokenisera en mening med ord i."
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 9
|
|
@ -1,25 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
SV_TOKEN_EXCEPTION_TESTS = [
|
||||
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar']),
|
||||
('Anders I. tycker om ord med i i.', ["Anders", "I.", "tycker", "om", "ord", "med", "i", "i", "."])
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||
def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
||||
tokens = sv_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
|
||||
def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[1].text == "u"
|
Loading…
Reference in New Issue
Block a user