mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 00:46:28 +03:00
This PR adds Gujarati Language class along with (#5355)
* This PR adds Gujarati Language class along with - stop words * Add test for gu tokenizer
This commit is contained in:
parent
90c754024f
commit
b2b7e1f37a
107
.github/contributors/punitvara.md
vendored
Normal file
107
.github/contributors/punitvara.md
vendored
Normal file
|
@ -0,0 +1,107 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Punit Vara |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-04-26 |
|
||||
| GitHub username | punitvara |
|
||||
| Website (optional) | https://punitvara.com |
|
||||
|
18
spacy/lang/gu/__init__.py
Normal file
18
spacy/lang/gu/__init__.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
|
||||
from ...language import Language
|
||||
|
||||
|
||||
class GujaratiDefaults(Language.Defaults):
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
class Gujarati(Language):
|
||||
lang = "gu"
|
||||
Defaults = GujaratiDefaults
|
||||
|
||||
|
||||
__all__ = ["Gujarati"]
|
22
spacy/lang/gu/examples.py
Normal file
22
spacy/lang/gu/examples.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.gu.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"લોકશાહી એ સરકારનું એક એવું તંત્ર છે જ્યાં નાગરિકો મત દ્વારા સત્તાનો ઉપયોગ કરે છે.",
|
||||
"તે ગુજરાત રાજ્યના ધરમપુર શહેરમાં આવેલું હતું",
|
||||
"કર્ણદેવ પહેલો સોલંકી વંશનો રાજા હતો",
|
||||
"તેજપાળને બે પત્ની હતી",
|
||||
"ગુજરાતમાં ભારતીય જનતા પક્ષનો ઉદય આ સમયગાળા દરમિયાન થયો",
|
||||
"આંદોલનકારીઓએ ચીમનભાઇ પટેલના રાજીનામાની માંગણી કરી.",
|
||||
"અહિયાં શું જોડાય છે?",
|
||||
"મંદિરનો પૂર્વાભિમુખ ભાગ નાના મંડપ સાથે થોડો લંબચોરસ આકારનો છે.",
|
||||
]
|
91
spacy/lang/gu/stop_words.py
Normal file
91
spacy/lang/gu/stop_words.py
Normal file
|
@ -0,0 +1,91 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
એમ
|
||||
આ
|
||||
એ
|
||||
રહી
|
||||
છે
|
||||
છો
|
||||
હતા
|
||||
હતું
|
||||
હતી
|
||||
હોય
|
||||
હતો
|
||||
શકે
|
||||
તે
|
||||
તેના
|
||||
તેનું
|
||||
તેને
|
||||
તેની
|
||||
તેઓ
|
||||
તેમને
|
||||
તેમના
|
||||
તેમણે
|
||||
તેમનું
|
||||
તેમાં
|
||||
અને
|
||||
અહીં
|
||||
થી
|
||||
થઈ
|
||||
થાય
|
||||
જે
|
||||
ને
|
||||
કે
|
||||
ના
|
||||
ની
|
||||
નો
|
||||
ને
|
||||
નું
|
||||
શું
|
||||
માં
|
||||
પણ
|
||||
પર
|
||||
જેવા
|
||||
જેવું
|
||||
જાય
|
||||
જેમ
|
||||
જેથી
|
||||
માત્ર
|
||||
માટે
|
||||
પરથી
|
||||
આવ્યું
|
||||
એવી
|
||||
આવી
|
||||
રીતે
|
||||
સુધી
|
||||
થાય
|
||||
થઈ
|
||||
સાથે
|
||||
લાગે
|
||||
હોવા
|
||||
છતાં
|
||||
રહેલા
|
||||
કરી
|
||||
કરે
|
||||
કેટલા
|
||||
કોઈ
|
||||
કેમ
|
||||
કર્યો
|
||||
કર્યુ
|
||||
કરે
|
||||
સૌથી
|
||||
ત્યારબાદ
|
||||
તથા
|
||||
દ્વારા
|
||||
જુઓ
|
||||
જાઓ
|
||||
જ્યારે
|
||||
ત્યારે
|
||||
શકો
|
||||
નથી
|
||||
હવે
|
||||
અથવા
|
||||
થતો
|
||||
દર
|
||||
એટલો
|
||||
પરંતુ
|
||||
""".split()
|
||||
)
|
|
@ -103,6 +103,10 @@ def ga_tokenizer():
|
|||
return get_lang_class("ga").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def gu_tokenizer():
|
||||
return get_lang_class("gu").Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def he_tokenizer():
|
||||
return get_lang_class("he").Defaults.create_tokenizer()
|
||||
|
|
20
spacy/tests/lang/gu/test_text.py
Normal file
20
spacy/tests/lang/gu/test_text.py
Normal file
|
@ -0,0 +1,20 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
def test_gu_tokenizer_handlers_long_text(gu_tokenizer):
|
||||
text = """પશ્ચિમ ભારતમાં આવેલું ગુજરાત રાજ્ય જે વ્યક્તિઓની માતૃભૂમિ છે"""
|
||||
tokens = gu_tokenizer(text)
|
||||
assert len(tokens) == 9
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,length",
|
||||
[
|
||||
("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6),
|
||||
("ખેતરની ખેડ કરવામાં આવે છે.", 5),
|
||||
],
|
||||
)
|
||||
def test_gu_tokenizer_handles_cnts(gu_tokenizer, text, length):
|
||||
tokens = gu_tokenizer(text)
|
||||
assert len(tokens) == length
|
Loading…
Reference in New Issue
Block a user