mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 00:46:28 +03:00
Add Urdu Language Support (#2430)
* added Urdu language support. * added Urdu language tests. * modified conftest.py for Urdu language support. * added spacy contributor agreement.
This commit is contained in:
parent
14d9007efd
commit
f33c703066
106
.github/contributors/mirfan899.md
vendored
Normal file
106
.github/contributors/mirfan899.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Muhammad Irfan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | AI & ML Developer |
|
||||
| Date | 2018-09-06 |
|
||||
| GitHub username | mirfan899 |
|
||||
| Website (optional) | |
|
30
spacy/lang/ur/__init__.py
Normal file
30
spacy/lang/ur/__init__.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from ..tag_map import TAG_MAP
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG, NORM
|
||||
from ...util import update_exc
|
||||
|
||||
|
||||
class UrduDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: 'ur'
|
||||
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
class Urdu(Language):
|
||||
lang = 'ur'
|
||||
Defaults = UrduDefaults
|
||||
|
||||
|
||||
__all__ = ['Urdu']
|
16
spacy/lang/ur/examples.py
Normal file
16
spacy/lang/ur/examples.py
Normal file
|
@ -0,0 +1,16 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.da.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"اردو ہے جس کا نام ہم جانتے ہیں داغ",
|
||||
"سارے جہاں میں دھوم ہماری زباں کی ہے",
|
||||
]
|
29113
spacy/lang/ur/lemmatizer.py
Normal file
29113
spacy/lang/ur/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
47
spacy/lang/ur/lex_attrs.py
Normal file
47
spacy/lang/ur/lex_attrs.py
Normal file
|
@ -0,0 +1,47 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
# Source https://quizlet.com/4271889/1-100-urdu-number-wordsurdu-numerals-flash-cards/
|
||||
# http://www.urduword.com/lessons.php?lesson=numbers
|
||||
# https://en.wikibooks.org/wiki/Urdu/Vocabulary/Numbers
|
||||
# https://www.urdu-english.com/lessons/beginner/numbers
|
||||
|
||||
_num_words = """ایک دو تین چار پانچ چھ سات آٹھ نو دس گیارہ بارہ تیرہ چودہ پندرہ سولہ سترہ
|
||||
اٹهارا انیس بیس اکیس بائیس تئیس چوبیس پچیس چھببیس
|
||||
ستایس اٹھائس انتيس تیس اکتیس بتیس تینتیس چونتیس پینتیس
|
||||
چھتیس سینتیس ارتیس انتالیس چالیس اکتالیس بیالیس تیتالیس
|
||||
چوالیس پیتالیس چھیالیس سینتالیس اڑتالیس انچالیس پچاس اکاون باون
|
||||
تریپن چون پچپن چھپن ستاون اٹھاون انسٹھ ساثھ
|
||||
اکسٹھ باسٹھ تریسٹھ چوسٹھ پیسٹھ چھیاسٹھ سڑسٹھ اڑسٹھ
|
||||
انھتر ستر اکھتر بھتتر تیھتر چوھتر تچھتر چھیتر ستتر
|
||||
اٹھتر انیاسی اسی اکیاسی بیاسی تیراسی چوراسی پچیاسی چھیاسی
|
||||
سٹیاسی اٹھیاسی نواسی نوے اکانوے بانوے ترانوے
|
||||
چورانوے پچانوے چھیانوے ستانوے اٹھانوے ننانوے سو
|
||||
""".split()
|
||||
|
||||
# source https://www.google.com/intl/ur/inputtools/try/
|
||||
|
||||
_ordinal_words = """پہلا دوسرا تیسرا چوتھا پانچواں چھٹا ساتواں آٹھواں نواں دسواں گیارہواں بارہواں تیرھواں چودھواں
|
||||
پندرھواں سولہواں سترھواں اٹھارواں انیسواں بسیواں
|
||||
""".split()
|
||||
|
||||
|
||||
def like_num(text):
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
if text in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num
|
||||
}
|
515
spacy/lang/ur/stop_words.py
Normal file
515
spacy/lang/ur/stop_words.py
Normal file
|
@ -0,0 +1,515 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# Source: collected from different resource on internet
|
||||
|
||||
STOP_WORDS = set("""
|
||||
ثھی
|
||||
خو
|
||||
گی
|
||||
اپٌے
|
||||
گئے
|
||||
ثہت
|
||||
طرف
|
||||
ہوبری
|
||||
پبئے
|
||||
اپٌب
|
||||
دوضری
|
||||
گیب
|
||||
کت
|
||||
گب
|
||||
ثھی
|
||||
ضے
|
||||
ہر
|
||||
پر
|
||||
اش
|
||||
دی
|
||||
گے
|
||||
لگیں
|
||||
ہے
|
||||
ثعذ
|
||||
ضکتے
|
||||
تھی
|
||||
اى
|
||||
دیب
|
||||
لئے
|
||||
والے
|
||||
یہ
|
||||
ثدبئے
|
||||
ضکتی
|
||||
تھب
|
||||
اًذر
|
||||
رریعے
|
||||
لگی
|
||||
ہوبرا
|
||||
ہوًے
|
||||
ثبہر
|
||||
ضکتب
|
||||
ًہیں
|
||||
تو
|
||||
اور
|
||||
رہب
|
||||
لگے
|
||||
ہوضکتب
|
||||
ہوں
|
||||
کب
|
||||
ہوبرے
|
||||
توبم
|
||||
کیب
|
||||
ایطے
|
||||
رہی
|
||||
هگر
|
||||
ہوضکتی
|
||||
ہیں
|
||||
کریں
|
||||
ہو
|
||||
تک
|
||||
کی
|
||||
ایک
|
||||
رہے
|
||||
هیں
|
||||
ہوضکتے
|
||||
کیطے
|
||||
ہوًب
|
||||
تت
|
||||
کہ
|
||||
ہوا
|
||||
آئے
|
||||
ضبت
|
||||
تھے
|
||||
کیوں
|
||||
ہو
|
||||
تب
|
||||
کے
|
||||
پھر
|
||||
ثغیر
|
||||
خبر
|
||||
ہے
|
||||
رکھ
|
||||
کی
|
||||
طب
|
||||
کوئی
|
||||
رریعے
|
||||
ثبرے
|
||||
خب
|
||||
اضطرذ
|
||||
ثلکہ
|
||||
خجکہ
|
||||
رکھ
|
||||
تب
|
||||
کی
|
||||
طرف
|
||||
ثراں
|
||||
خبر
|
||||
رریعہ
|
||||
اضکب
|
||||
ثٌذ
|
||||
خص
|
||||
کی
|
||||
لئے
|
||||
توہیں
|
||||
دوضرے
|
||||
کررہی
|
||||
اضکی
|
||||
ثیچ
|
||||
خوکہ
|
||||
رکھتی
|
||||
کیوًکہ
|
||||
دوًوں
|
||||
کر
|
||||
رہے
|
||||
خبر
|
||||
ہی
|
||||
ثرآں
|
||||
اضکے
|
||||
پچھلا
|
||||
خیطب
|
||||
رکھتے
|
||||
کے
|
||||
ثعذ
|
||||
تو
|
||||
ہی
|
||||
دورى
|
||||
کر
|
||||
یہبں
|
||||
آش
|
||||
تھوڑا
|
||||
چکے
|
||||
زکویہ
|
||||
دوضروں
|
||||
ضکب
|
||||
اوًچب
|
||||
ثٌب
|
||||
پل
|
||||
تھوڑی
|
||||
چلا
|
||||
خبهوظ
|
||||
دیتب
|
||||
ضکٌب
|
||||
اخبزت
|
||||
اوًچبئی
|
||||
ثٌبرہب
|
||||
پوچھب
|
||||
تھوڑے
|
||||
چلو
|
||||
ختن
|
||||
دیتی
|
||||
ضکی
|
||||
اچھب
|
||||
اوًچی
|
||||
ثٌبرہی
|
||||
پوچھتب
|
||||
تیي
|
||||
چلیں
|
||||
در
|
||||
دیتے
|
||||
ضکے
|
||||
اچھی
|
||||
اوًچے
|
||||
ثٌبرہے
|
||||
پوچھتی
|
||||
خبًب
|
||||
چلے
|
||||
درخبت
|
||||
دیر
|
||||
ضلطلہ
|
||||
اچھے
|
||||
اٹھبًب
|
||||
ثٌبًب
|
||||
پوچھتے
|
||||
خبًتب
|
||||
چھوٹب
|
||||
درخہ
|
||||
دیکھٌب
|
||||
ضوچ
|
||||
اختتبم
|
||||
اہن
|
||||
ثٌذ
|
||||
پوچھٌب
|
||||
خبًتی
|
||||
چھوٹوں
|
||||
درخے
|
||||
دیکھو
|
||||
ضوچب
|
||||
ادھر
|
||||
آئی
|
||||
ثٌذکرًب
|
||||
پوچھو
|
||||
خبًتے
|
||||
چھوٹی
|
||||
درزقیقت
|
||||
دیکھی
|
||||
ضوچتب
|
||||
ارد
|
||||
آئے
|
||||
ثٌذکرو
|
||||
پوچھوں
|
||||
خبًٌب
|
||||
چھوٹے
|
||||
درضت
|
||||
دیکھیں
|
||||
ضوچتی
|
||||
اردگرد
|
||||
آج
|
||||
ثٌذی
|
||||
پوچھیں
|
||||
خططرذ
|
||||
چھہ
|
||||
دش
|
||||
دیٌب
|
||||
ضوچتے
|
||||
ارکبى
|
||||
آخر
|
||||
ثڑا
|
||||
پورا
|
||||
خگہ
|
||||
چیسیں
|
||||
دفعہ
|
||||
دے
|
||||
ضوچٌب
|
||||
اضتعوبل
|
||||
آخر
|
||||
پہلا
|
||||
خگہوں
|
||||
زبصل
|
||||
دکھبئیں
|
||||
راضتوں
|
||||
ضوچو
|
||||
اضتعوبلات
|
||||
آدهی
|
||||
ثڑی
|
||||
پہلی
|
||||
خگہیں
|
||||
زبضر
|
||||
دکھبتب
|
||||
راضتہ
|
||||
ضوچی
|
||||
اغیب
|
||||
آًب
|
||||
ثڑے
|
||||
پہلےضی
|
||||
خلذی
|
||||
زبل
|
||||
دکھبتی
|
||||
راضتے
|
||||
ضوچیں
|
||||
اطراف
|
||||
آٹھ
|
||||
ثھر
|
||||
خٌبة
|
||||
زبل
|
||||
دکھبتے
|
||||
رکي
|
||||
ضیذھب
|
||||
افراد
|
||||
آیب
|
||||
ثھرا
|
||||
پہلے
|
||||
خواى
|
||||
زبلات
|
||||
دکھبًب
|
||||
رکھب
|
||||
ضیذھی
|
||||
اکثر
|
||||
ثب
|
||||
ہوا
|
||||
پیع
|
||||
خوًہی
|
||||
زبلیہ
|
||||
دکھبو
|
||||
رکھی
|
||||
ضیذھے
|
||||
اکٹھب
|
||||
ثھرپور
|
||||
تبزٍ
|
||||
خیطبکہ
|
||||
زصوں
|
||||
رکھے
|
||||
ضیکٌڈ
|
||||
اکٹھی
|
||||
ثبری
|
||||
ثہتر
|
||||
تر
|
||||
چبر
|
||||
زصہ
|
||||
دلچطپ
|
||||
زیبدٍ
|
||||
غبیذ
|
||||
اکٹھے
|
||||
ثبلا
|
||||
ثہتری
|
||||
ترتیت
|
||||
چبہب
|
||||
زصے
|
||||
دلچطپی
|
||||
ضبت
|
||||
غخص
|
||||
اکیلا
|
||||
ثبلترتیت
|
||||
ثہتریي
|
||||
تریي
|
||||
چبہٌب
|
||||
زقبئق
|
||||
دلچطپیبں
|
||||
ضبدٍ
|
||||
غذ
|
||||
اکیلی
|
||||
ثرش
|
||||
پبش
|
||||
تعذاد
|
||||
چبہے
|
||||
زقیتیں
|
||||
هٌبضت
|
||||
ضبرا
|
||||
غروع
|
||||
اکیلے
|
||||
ثغیر
|
||||
پبًب
|
||||
چکب
|
||||
زقیقت
|
||||
دو
|
||||
ضبرے
|
||||
غروعبت
|
||||
اگرچہ
|
||||
ثلٌذ
|
||||
پبًچ
|
||||
تن
|
||||
چکی
|
||||
زکن
|
||||
دور
|
||||
ضبل
|
||||
غے
|
||||
الگ
|
||||
پراًب
|
||||
تٌہب
|
||||
چکیں
|
||||
دوضرا
|
||||
ضبلوں
|
||||
صبف
|
||||
صسیر
|
||||
قجیلہ
|
||||
کوًطے
|
||||
لازهی
|
||||
هطئلے
|
||||
ًیب
|
||||
طریق
|
||||
کرتی
|
||||
کہتے
|
||||
صفر
|
||||
قطن
|
||||
کھولا
|
||||
لگتب
|
||||
هطبئل
|
||||
وار
|
||||
طریقوں
|
||||
کرتے
|
||||
کہٌب
|
||||
صورت
|
||||
کئی
|
||||
کھولٌب
|
||||
لگتی
|
||||
هطتعول
|
||||
وار
|
||||
طریقہ
|
||||
کرتے
|
||||
ہو
|
||||
کہٌب
|
||||
صورتسبل
|
||||
کئے
|
||||
کھولو
|
||||
لگتے
|
||||
هػتول
|
||||
ٹھیک
|
||||
طریقے
|
||||
کرًب
|
||||
کہو
|
||||
صورتوں
|
||||
کبفی
|
||||
هطلق
|
||||
ڈھوًڈا
|
||||
طور
|
||||
کرو
|
||||
کہوں
|
||||
صورتیں
|
||||
کبم
|
||||
کھولیں
|
||||
لگی
|
||||
هعلوم
|
||||
ڈھوًڈلیب
|
||||
طورپر
|
||||
کریں
|
||||
کہی
|
||||
ضرور
|
||||
کجھی
|
||||
کھولے
|
||||
لگے
|
||||
هکول
|
||||
ڈھوًڈًب
|
||||
ظبہر
|
||||
کرے
|
||||
کہیں
|
||||
ضرورت
|
||||
کرا
|
||||
کہب
|
||||
لوجب
|
||||
هلا
|
||||
ڈھوًڈو
|
||||
عذد
|
||||
کل
|
||||
کہیں
|
||||
کرتب
|
||||
کہتب
|
||||
لوجی
|
||||
هوکي
|
||||
ڈھوًڈی
|
||||
عظین
|
||||
کن
|
||||
کہے
|
||||
ضروری
|
||||
کرتبہوں
|
||||
کہتی
|
||||
لوجے
|
||||
هوکٌبت
|
||||
ڈھوًڈیں
|
||||
علاقوں
|
||||
کوتر
|
||||
کیے
|
||||
لوسبت
|
||||
هوکٌہ
|
||||
ہن
|
||||
لے
|
||||
ًبپطٌذ
|
||||
ہورہے
|
||||
علاقہ
|
||||
کورا
|
||||
کے
|
||||
رریعے
|
||||
لوسہ
|
||||
هڑا
|
||||
ہوئی
|
||||
هتعلق
|
||||
ًبگسیر
|
||||
ہوگئی
|
||||
علاقے
|
||||
کوروں
|
||||
گئی
|
||||
لو
|
||||
هڑًب
|
||||
ہوئے
|
||||
هسترم
|
||||
ًطجت
|
||||
ہو
|
||||
گئے
|
||||
علاوٍ
|
||||
کورٍ
|
||||
گرد
|
||||
لوگ
|
||||
هڑے
|
||||
ہوتی
|
||||
هسترهہ
|
||||
ًقطہ
|
||||
ہوگیب
|
||||
کورے
|
||||
گروپ
|
||||
لوگوں
|
||||
هہرثبى
|
||||
ہوتے
|
||||
هسطوش
|
||||
ًکبلٌب
|
||||
ہوًی
|
||||
عووهی
|
||||
کوطي
|
||||
گروٍ
|
||||
لڑکپي
|
||||
هیرا
|
||||
ہوچکب
|
||||
هختلف
|
||||
ًکتہ
|
||||
ہی
|
||||
فرد
|
||||
کوى
|
||||
گروہوں
|
||||
لی
|
||||
هیری
|
||||
ہوچکی
|
||||
هسیذ
|
||||
فی
|
||||
کوًطب
|
||||
گٌتی
|
||||
لیب
|
||||
هیرے
|
||||
ہوچکے
|
||||
هطئلہ
|
||||
ًوخواى
|
||||
یقیٌی
|
||||
قجل
|
||||
کوًطی
|
||||
لیٌب
|
||||
ًئی
|
||||
ہورہب
|
||||
لیں
|
||||
ًئے
|
||||
ہورہی
|
||||
ثبعث
|
||||
ضت
|
||||
""".split())
|
65
spacy/lang/ur/tag_map.py
Normal file
65
spacy/lang/ur/tag_map.py
Normal file
|
@ -0,0 +1,65 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||
|
||||
TAG_MAP = {
|
||||
".": {POS: PUNCT, "PunctType": "peri"},
|
||||
",": {POS: PUNCT, "PunctType": "comm"},
|
||||
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
|
||||
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
|
||||
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
|
||||
"\"\"": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
":": {POS: PUNCT},
|
||||
"$": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||
"CC": {POS: CCONJ, "ConjType": "coor"},
|
||||
"CD": {POS: NUM, "NumType": "card"},
|
||||
"DT": {POS: DET},
|
||||
"EX": {POS: ADV, "AdvType": "ex"},
|
||||
"FW": {POS: X, "Foreign": "yes"},
|
||||
"HYPH": {POS: PUNCT, "PunctType": "dash"},
|
||||
"IN": {POS: ADP},
|
||||
"JJ": {POS: ADJ, "Degree": "pos"},
|
||||
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||
"LS": {POS: PUNCT, "NumType": "ord"},
|
||||
"MD": {POS: VERB, "VerbType": "mod"},
|
||||
"NIL": {POS: ""},
|
||||
"NN": {POS: NOUN, "Number": "sing"},
|
||||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||
"NNS": {POS: NOUN, "Number": "plur"},
|
||||
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||
"POS": {POS: PART, "Poss": "yes"},
|
||||
"PRP": {POS: PRON, "PronType": "prs"},
|
||||
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||
"RB": {POS: ADV, "Degree": "pos"},
|
||||
"RBR": {POS: ADV, "Degree": "comp"},
|
||||
"RBS": {POS: ADV, "Degree": "sup"},
|
||||
"RP": {POS: PART},
|
||||
"SP": {POS: SPACE},
|
||||
"SYM": {POS: SYM},
|
||||
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
|
||||
"UH": {POS: INTJ},
|
||||
"VB": {POS: VERB, "VerbForm": "inf"},
|
||||
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
|
||||
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||||
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
|
||||
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
|
||||
"VBZ": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
|
||||
"WDT": {POS: ADJ, "PronType": "int|rel"},
|
||||
"WP": {POS: NOUN, "PronType": "int|rel"},
|
||||
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
|
||||
"WRB": {POS: ADV, "PronType": "int|rel"},
|
||||
"ADD": {POS: X},
|
||||
"NFP": {POS: PUNCT},
|
||||
"GW": {POS: X},
|
||||
"XX": {POS: X},
|
||||
"BES": {POS: VERB},
|
||||
"HVS": {POS: VERB},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
22
spacy/lang/ur/tokenizer_exceptions.py
Normal file
22
spacy/lang/ur/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# import symbols – if you need to use more, add them here
|
||||
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
|
||||
|
||||
# Add tokenizer exceptions
|
||||
# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
|
||||
# Feel free to use custom logic to generate repetitive exceptions more efficiently.
|
||||
# If an exception is split into more than one token, the ORTH values combined always
|
||||
# need to match the original string.
|
||||
|
||||
# Exceptions should be added in the following format:
|
||||
|
||||
_exc = {
|
||||
|
||||
}
|
||||
|
||||
# To keep things clean and readable, it's recommended to only declare the
|
||||
# TOKENIZER_EXCEPTIONS at the bottom:
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -15,7 +15,7 @@ from .. import util
|
|||
# here if it's using spaCy's tokenizer (not a different library)
|
||||
# TODO: re-implement generic tokenizer tests
|
||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'tt',
|
||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
|
||||
'xx']
|
||||
|
||||
_models = {'en': ['en_core_web_sm'],
|
||||
|
@ -162,6 +162,10 @@ def tt_tokenizer():
|
|||
def ar_tokenizer():
|
||||
return util.get_lang_class('ar').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
def ur_tokenizer():
|
||||
return util.get_lang_class('ur').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
def ru_tokenizer():
|
||||
pymorphy = pytest.importorskip('pymorphy2')
|
||||
|
|
0
spacy/tests/lang/ur/__init__.py
Normal file
0
spacy/tests/lang/ur/__init__.py
Normal file
26
spacy/tests/lang/ur/test_text.py
Normal file
26
spacy/tests/lang/ur/test_text.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf-8
|
||||
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(ur_tokenizer):
|
||||
text = """اصل میں رسوا ہونے کی ہمیں
|
||||
کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
|
||||
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
|
||||
ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
|
||||
|
||||
tokens = ur_tokenizer(text)
|
||||
assert len(tokens) == 77
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [
|
||||
("تحریر باسط حبیب", 3),
|
||||
("میرا پاکستان", 2)])
|
||||
def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
|
||||
tokens = ur_tokenizer(text)
|
||||
assert len(tokens) == length
|
Loading…
Reference in New Issue
Block a user