mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-10 19:57:17 +03:00
Catalan Language Support (#2940)
* Catalan language Support * Ddding Catalan to documentation
This commit is contained in:
parent
1844bc238a
commit
98fe1ab259
106
.github/contributors/mpuig.md
vendored
Normal file
106
.github/contributors/mpuig.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Marc Puig |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-11-17 |
|
||||
| GitHub username | mpuig |
|
||||
| Website (optional) | |
|
64
spacy/lang/ca/__init__.py
Normal file
64
spacy/lang/ca/__init__.py
Normal file
|
@ -0,0 +1,64 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
||||
# uncomment if files are available
|
||||
# from .norm_exceptions import NORM_EXCEPTIONS
|
||||
# from .tag_map import TAG_MAP
|
||||
# from .morph_rules import MORPH_RULES
|
||||
|
||||
# uncomment if lookup-based lemmatizer is available
|
||||
from .lemmatizer import LOOKUP
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc, add_lookups
|
||||
|
||||
# Create a Language subclass
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages
|
||||
|
||||
# This file should be placed in spacy/lang/ca (ISO code of language).
|
||||
# Before submitting a pull request, make sure the remove all comments from the
|
||||
# language data files, and run at least the basic tokenizer tests. Simply add the
|
||||
# language ID to the list of languages in spacy/tests/conftest.py to include it
|
||||
# in the basic tokenizer sanity tests. You can optionally add a fixture for the
|
||||
# language's tokenizer and add more specific tests. For more info, see the
|
||||
# tests documentation: https://github.com/explosion/spaCy/tree/master/spacy/tests
|
||||
|
||||
|
||||
class CatalanDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'ca' # ISO code
|
||||
# add more norm exception dictionaries here
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||
|
||||
# overwrite functions for lexical attributes
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
|
||||
# add custom tokenizer exceptions to base exceptions
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
|
||||
# add stop words
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
# if available: add tag map
|
||||
# tag_map = dict(TAG_MAP)
|
||||
|
||||
# if available: add morph rules
|
||||
# morph_rules = dict(MORPH_RULES)
|
||||
|
||||
lemma_lookup = LOOKUP
|
||||
|
||||
|
||||
class Catalan(Language):
|
||||
lang = 'ca' # ISO code
|
||||
Defaults = CatalanDefaults # set Defaults to custom language defaults
|
||||
|
||||
|
||||
# set default export – this allows the language class to be lazy-loaded
|
||||
__all__ = ['Catalan']
|
22
spacy/lang/ca/examples.py
Normal file
22
spacy/lang/ca/examples.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.es.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Apple està buscant comprar una startup del Regne Unit per mil milions de dòlars",
|
||||
"Els cotxes autònoms deleguen la responsabilitat de l'assegurança als seus fabricants",
|
||||
"San Francisco analitza prohibir els robots de repartiment",
|
||||
"Londres és una gran ciutat del Regne Unit",
|
||||
"El gat menja peix",
|
||||
"Veig a l'home amb el telescopi",
|
||||
"L'Aranya menja mosques",
|
||||
"El pingüí incuba en el seu niu",
|
||||
]
|
591540
spacy/lang/ca/lemmatizer.py
Normal file
591540
spacy/lang/ca/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
43
spacy/lang/ca/lex_attrs.py
Normal file
43
spacy/lang/ca/lex_attrs.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# import the symbols for the attrs you want to overwrite
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
# Overwriting functions for lexical attributes
|
||||
# Documentation: https://localhost:1234/docs/usage/adding-languages#lex-attrs
|
||||
# Most of these functions, like is_lower or like_url should be language-
|
||||
# independent. Others, like like_num (which includes both digits and number
|
||||
# words), requires customisation.
|
||||
|
||||
|
||||
# Example: check if token resembles a number
|
||||
|
||||
_num_words = ['zero', 'un', 'dos', 'tres', 'quatre', 'cinc', 'sis', 'set',
|
||||
'vuit', 'nou', 'deu', 'onze', 'dotze', 'tretze', 'catorze',
|
||||
'quinze', 'setze', 'disset', 'divuit', 'dinou', 'vint',
|
||||
'trenta', 'quaranta', 'cinquanta', 'seixanta', 'setanta', 'vuitanta', 'noranta',
|
||||
'cent', 'mil', 'milió', 'bilió', 'trilió', 'quatrilió',
|
||||
'gazilió', 'bazilió']
|
||||
|
||||
|
||||
def like_num(text):
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
# Create dictionary of functions to overwrite. The default lex_attr_getters are
|
||||
# updated with this one, so only the functions defined here are overwritten.
|
||||
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num
|
||||
}
|
56
spacy/lang/ca/stop_words.py
Normal file
56
spacy/lang/ca/stop_words.py
Normal file
|
@ -0,0 +1,56 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Stop words
|
||||
|
||||
STOP_WORDS = set("""
|
||||
a abans ací ah així això al aleshores algun alguna algunes alguns alhora allà allí allò
|
||||
als altra altre altres amb ambdues ambdós anar ans apa aquell aquella aquelles aquells
|
||||
aquest aquesta aquestes aquests aquí
|
||||
|
||||
baix bastant bé
|
||||
|
||||
cada cadascuna cadascunes cadascuns cadascú com consegueixo conseguim conseguir
|
||||
consigueix consigueixen consigueixes contra
|
||||
|
||||
d'un d'una d'unes d'uns dalt de del dels des des de després dins dintre donat doncs durant
|
||||
|
||||
e eh el elles ells els em en encara ens entre era erem eren eres es esta estan estat
|
||||
estava estaven estem esteu estic està estàvem estàveu et etc ets érem éreu és éssent
|
||||
|
||||
fa faig fan fas fem fer feu fi fins fora
|
||||
|
||||
gairebé
|
||||
|
||||
ha han has haver havia he hem heu hi ho
|
||||
|
||||
i igual iguals inclòs
|
||||
|
||||
ja jo
|
||||
|
||||
l'hi la les li li'n llarg llavors
|
||||
|
||||
m'he ma mal malgrat mateix mateixa mateixes mateixos me mentre meu meus meva
|
||||
meves mode molt molta moltes molts mon mons més
|
||||
|
||||
n'he n'hi ne ni no nogensmenys només nosaltres nostra nostre nostres
|
||||
|
||||
o oh oi on
|
||||
|
||||
pas pel pels per per que perquè però poc poca pocs podem poden poder
|
||||
podeu poques potser primer propi puc
|
||||
|
||||
qual quals quan quant que quelcom qui quin quina quines quins què
|
||||
|
||||
s'ha s'han sa sabem saben saber sabeu sap saps semblant semblants sense ser ses
|
||||
seu seus seva seves si sobre sobretot soc solament sols som son sons sota sou sóc són
|
||||
|
||||
t'ha t'han t'he ta tal també tampoc tan tant tanta tantes te tene tenim tenir teniu
|
||||
teu teus teva teves tinc ton tons tot tota totes tots
|
||||
|
||||
un una unes uns us últim ús
|
||||
|
||||
va vaig vam van vas veu vosaltres vostra vostre vostres
|
||||
|
||||
""".split())
|
36
spacy/lang/ca/tag_map.py
Normal file
36
spacy/lang/ca/tag_map.py
Normal file
|
@ -0,0 +1,36 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
|
||||
# Add a tag map
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
|
||||
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
|
||||
# The keys of the tag map should be strings in your tag set. The dictionary must
|
||||
# have an entry POS whose value is one of the Universal Dependencies tags.
|
||||
# Optionally, you can also include morphological features or other attributes.
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADV": {POS: ADV},
|
||||
"NOUN": {POS: NOUN},
|
||||
"ADP": {POS: ADP},
|
||||
"PRON": {POS: PRON},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PROPN": {POS: PROPN},
|
||||
"DET": {POS: DET},
|
||||
"SYM": {POS: SYM},
|
||||
"INTJ": {POS: INTJ},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"NUM": {POS: NUM},
|
||||
"AUX": {POS: AUX},
|
||||
"X": {POS: X},
|
||||
"CONJ": {POS: CONJ},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"ADJ": {POS: ADJ},
|
||||
"VERB": {POS: VERB},
|
||||
"PART": {POS: PART},
|
||||
"SP": {POS: SPACE}
|
||||
}
|
51
spacy/lang/ca/tokenizer_exceptions.py
Normal file
51
spacy/lang/ca/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,51 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# import symbols – if you need to use more, add them here
|
||||
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "aprox.", LEMMA: "aproximadament"},
|
||||
{ORTH: "pàg.", LEMMA: "pàgina"},
|
||||
{ORTH: "p.ex.", LEMMA: "per exemple"},
|
||||
{ORTH: "gen.", LEMMA: "gener"},
|
||||
{ORTH: "feb.", LEMMA: "febrer"},
|
||||
{ORTH: "abr.", LEMMA: "abril"},
|
||||
{ORTH: "jul.", LEMMA: "juliol"},
|
||||
{ORTH: "set.", LEMMA: "setembre"},
|
||||
{ORTH: "oct.", LEMMA: "octubre"},
|
||||
{ORTH: "nov.", LEMMA: "novembre"},
|
||||
{ORTH: "dec.", LEMMA: "desembre"},
|
||||
{ORTH: "Dr.", LEMMA: "doctor"},
|
||||
{ORTH: "Sr.", LEMMA: "senyor"},
|
||||
{ORTH: "Sra.", LEMMA: "senyora"},
|
||||
{ORTH: "Srta.", LEMMA: "senyoreta"},
|
||||
{ORTH: "núm", LEMMA: "número"},
|
||||
{ORTH: "St.", LEMMA: "sant"},
|
||||
{ORTH: "Sta.", LEMMA: "santa"}]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
# Times
|
||||
|
||||
_exc["12m."] = [
|
||||
{ORTH: "12"},
|
||||
{ORTH: "m.", LEMMA: "p.m."}]
|
||||
|
||||
|
||||
for h in range(1, 12 + 1):
|
||||
for period in ["a.m.", "am"]:
|
||||
_exc["%d%s" % (h, period)] = [
|
||||
{ORTH: "%d" % h},
|
||||
{ORTH: period, LEMMA: "a.m."}]
|
||||
for period in ["p.m.", "pm"]:
|
||||
_exc["%d%s" % (h, period)] = [
|
||||
{ORTH: "%d" % h},
|
||||
{ORTH: period, LEMMA: "p.m."}]
|
||||
|
||||
# To keep things clean and readable, it's recommended to only declare the
|
||||
# TOKENIZER_EXCEPTIONS at the bottom:
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -14,7 +14,7 @@ from .. import util
|
|||
# These languages are used for generic tokenizer tests – only add a language
|
||||
# here if it's using spaCy's tokenizer (not a different library)
|
||||
# TODO: re-implement generic tokenizer tests
|
||||
_languages = ['bn', 'da', 'de', 'el', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||
_languages = ['bn', 'ca', 'da', 'de', 'el', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
|
||||
'xx']
|
||||
|
||||
|
@ -175,6 +175,10 @@ def ru_tokenizer():
|
|||
pymorphy = pytest.importorskip('pymorphy2')
|
||||
return util.get_lang_class('ru').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ca_tokenizer():
|
||||
return util.get_lang_class('ca').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def stringstore():
|
||||
|
|
0
spacy/tests/lang/ca/__init__.py
Normal file
0
spacy/tests/lang/ca/__init__.py
Normal file
23
spacy/tests/lang/ca/test_exception.py
Normal file
23
spacy/tests/lang/ca/test_exception.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
# coding: utf-8
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,lemma', [("aprox.", "aproximadament"),
|
||||
("pàg.", "pàgina"),
|
||||
("p.ex.", "per exemple")
|
||||
])
|
||||
def test_ca_tokenizer_handles_abbr(ca_tokenizer, text, lemma):
|
||||
tokens = ca_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].lemma_ == lemma
|
||||
|
||||
|
||||
def test_ca_tokenizer_handles_exc_in_text(ca_tokenizer):
|
||||
text = "La Núria i el Pere han vingut aprox. a les 7 de la tarda."
|
||||
tokens = ca_tokenizer(text)
|
||||
assert len(tokens) == 15
|
||||
assert tokens[7].text == "aprox."
|
||||
assert tokens[7].lemma_ == "aproximadament"
|
42
spacy/tests/lang/ca/test_text.py
Normal file
42
spacy/tests/lang/ca/test_text.py
Normal file
|
@ -0,0 +1,42 @@
|
|||
# coding: utf-8
|
||||
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_ca_tokenizer_handles_long_text(ca_tokenizer):
|
||||
text = """Una taula amb grans gerres de begudes i palles de coloraines com a reclam. Una carta
|
||||
cridanera amb ofertes de tapes, paelles i sangria. Un cambrer amb un somriure que convida a
|
||||
seure. La ubicació perfecta: el bell mig de la Rambla. Però és la una del migdia d’un dimecres
|
||||
de tardor i no hi ha ningú assegut a la terrassa del local. El dia és rúfol, però no fa fred i
|
||||
a la majoria de terrasses de la Rambla hi ha poca gent. La immensa majoria dels clients -tret
|
||||
d’alguna excepció com al restaurant Núria- són turistes. I la immensa majoria tenen entre mans
|
||||
una gerra de cervesa. Ens asseiem -fotògraf i periodista- en una terrassa buida."""
|
||||
|
||||
tokens = ca_tokenizer(text)
|
||||
assert len(tokens) == 136
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [
|
||||
("Perquè va anar-hi?", 6),
|
||||
("“Ah no?”", 5),
|
||||
("""Sí! "Anem", va contestar el Joan Carles""", 11),
|
||||
("Van córrer aprox. 10km", 5),
|
||||
("Llavors perqué...", 3)])
|
||||
def test_ca_tokenizer_handles_cnts(ca_tokenizer, text, length):
|
||||
tokens = ca_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,match', [
|
||||
('10', True), ('1', True), ('10,000', True), ('10,00', True),
|
||||
('999.0', True), ('un', True), ('dos', True), ('bilió', True),
|
||||
('gos', False), (',', False), ('1/2', True)])
|
||||
def test_ca_lex_attrs_like_number(ca_tokenizer, text, match):
|
||||
tokens = ca_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
|
@ -121,6 +121,7 @@
|
|||
"zh": "Chinese",
|
||||
"ja": "Japanese",
|
||||
"vi": "Vietnamese",
|
||||
"ca": "Catalan",
|
||||
"xx": "Multi-language"
|
||||
},
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user