Add Urdu Language Support (#2430)

* added Urdu language support. * added Urdu language tests. * modified conftest.py for Urdu language support. * added spacy contributor agreement.
2025-07-25 23:49:46 +03:00 · 2018-06-22 14:14:03 +05:00 · 2018-06-22 14:14:03 +05:00 · f33c703066
commit f33c703066
parent 14d9007efd
11 changed files with 29945 additions and 1 deletions
--- a/.github/contributors/mirfan899.md
+++ b/.github/contributors/mirfan899.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Muhammad Irfan           |
+| Company name (if applicable)   |                          |
+| Title or role (if applicable)  | AI & ML Developer        |
+| Date                           | 2018-09-06               |
+| GitHub username                | mirfan899                |
+| Website (optional)             |                          |
--- a/spacy/lang/ur/init.py
+++ b/spacy/lang/ur/init.py
@ -0,0 +1,30 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from ..tag_map import TAG_MAP
+
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ...language import Language
+from ...attrs import LANG, NORM
+from ...util import update_exc
+
+
+class UrduDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: 'ur'
+
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    tag_map = TAG_MAP
+    stop_words = STOP_WORDS
+
+
+class Urdu(Language):
+    lang = 'ur'
+    Defaults = UrduDefaults
+
+
+__all__ = ['Urdu']
--- a/spacy/lang/ur/examples.py
+++ b/spacy/lang/ur/examples.py
@ -0,0 +1,16 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.da.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "اردو ہے جس کا نام ہم جانتے ہیں داغ",
+    "سارے جہاں میں دھوم ہماری زباں کی ہے",
+]
--- a/spacy/lang/ur/lemmatizer.py
+++ b/spacy/lang/ur/lemmatizer.py
--- a/spacy/lang/ur/lex_attrs.py
+++ b/spacy/lang/ur/lex_attrs.py
@ -0,0 +1,47 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+# Source https://quizlet.com/4271889/1-100-urdu-number-wordsurdu-numerals-flash-cards/
+# http://www.urduword.com/lessons.php?lesson=numbers
+# https://en.wikibooks.org/wiki/Urdu/Vocabulary/Numbers
+# https://www.urdu-english.com/lessons/beginner/numbers
+
+_num_words = """ایک دو تین چار پانچ چھ سات آٹھ نو دس گیارہ بارہ تیرہ چودہ پندرہ سولہ سترہ
+ اٹهارا انیس بیس اکیس بائیس تئیس چوبیس پچیس چھببیس 
+ستایس اٹھائس انتيس تیس اکتیس بتیس تینتیس چونتیس پینتیس
+ چھتیس سینتیس ارتیس انتالیس چالیس اکتالیس بیالیس تیتالیس 
+چوالیس پیتالیس چھیالیس سینتالیس اڑتالیس انچالیس پچاس اکاون باون
+ تریپن چون پچپن چھپن ستاون اٹھاون انسٹھ ساثھ 
+اکسٹھ باسٹھ تریسٹھ چوسٹھ پیسٹھ چھیاسٹھ سڑسٹھ اڑسٹھ 
+انھتر ستر اکھتر بھتتر تیھتر چوھتر تچھتر چھیتر ستتر
+اٹھتر انیاسی اسی اکیاسی بیاسی تیراسی چوراسی پچیاسی چھیاسی
+ سٹیاسی اٹھیاسی نواسی نوے اکانوے بانوے ترانوے 
+چورانوے پچانوے چھیانوے ستانوے اٹھانوے ننانوے سو
+""".split()
+
+# source https://www.google.com/intl/ur/inputtools/try/
+
+_ordinal_words = """پہلا دوسرا تیسرا چوتھا پانچواں چھٹا ساتواں آٹھواں نواں دسواں گیارہواں بارہواں تیرھواں چودھواں
+ پندرھواں سولہواں سترھواں اٹھارواں انیسواں بسیواں 
+""".split()
+
+
+def like_num(text):
+    text = text.replace(',', '').replace('.', '')
+    if text.isdigit():
+        return True
+    if text.count('/') == 1:
+        num, denom = text.split('/')
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    if text in _ordinal_words:
+        return True
+    return False
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num
+}
--- a/spacy/lang/ur/stop_words.py
+++ b/spacy/lang/ur/stop_words.py
@ -0,0 +1,515 @@
+# encoding: utf8
+from __future__ import unicode_literals
+
+# Source: collected from different resource on internet
+
+STOP_WORDS = set("""
+ثھی
+ خو
+  گی
+   اپٌے
+    گئے
+     ثہت
+      طرف
+       ہوبری
+        پبئے
+         اپٌب
+          دوضری
+           گیب
+            کت
+             گب
+              ثھی
+               ضے
+                ہر
+پر
+اش
+ دی
+ گے
+لگیں
+ہے
+ثعذ
+ ضکتے
+  تھی
+   اى
+    دیب
+     لئے
+      والے
+       یہ
+        ثدبئے
+         ضکتی
+          تھب
+           اًذر
+            رریعے
+             لگی
+              ہوبرا
+               ہوًے
+                ثبہر
+                 ضکتب
+                  ًہیں
+                   تو
+                    اور
+رہب
+ لگے
+  ہوضکتب
+   ہوں
+    کب
+     ہوبرے
+      توبم
+       کیب
+        ایطے
+         رہی
+          هگر
+           ہوضکتی
+            ہیں
+             کریں
+              ہو
+               تک
+                کی
+                 ایک
+                  رہے
+                   هیں
+ ہوضکتے
+  کیطے
+   ہوًب
+    تت
+     کہ
+      ہوا
+       آئے
+        ضبت
+         تھے
+          کیوں
+           ہو
+            تب
+             کے
+              پھر
+               ثغیر
+                خبر
+                ہے
+                 رکھ
+                  کی
+                   طب
+                    کوئی
+                     رریعے
+ثبرے
+ خب
+  اضطرذ
+   ثلکہ
+    خجکہ
+     رکھ
+      تب
+       کی
+        طرف
+         ثراں
+          خبر
+            رریعہ
+ اضکب
+  ثٌذ
+   خص
+ کی
+  لئے
+ توہیں
+دوضرے
+ کررہی
+  اضکی
+   ثیچ
+    خوکہ
+     رکھتی
+      کیوًکہ
+       دوًوں
+        کر
+         رہے
+          خبر
+           ہی
+            ثرآں
+             اضکے
+              پچھلا
+               خیطب
+                رکھتے
+                 کے
+                  ثعذ
+                   تو
+                    ہی
+                     دورى
+کر
+ یہبں
+ آش
+  تھوڑا
+  چکے
+  زکویہ
+  دوضروں
+  ضکب
+  اوًچب
+  ثٌب
+  پل
+  تھوڑی
+  چلا
+  خبهوظ
+  دیتب
+  ضکٌب
+  اخبزت
+  اوًچبئی
+  ثٌبرہب
+پوچھب
+تھوڑے
+چلو
+ختن
+دیتی
+ضکی
+اچھب
+اوًچی
+ثٌبرہی
+پوچھتب
+تیي
+چلیں
+در
+دیتے
+ضکے
+اچھی
+اوًچے
+ثٌبرہے
+پوچھتی
+خبًب
+چلے
+درخبت
+دیر
+ضلطلہ
+اچھے
+اٹھبًب
+ثٌبًب
+پوچھتے
+خبًتب
+چھوٹب
+درخہ
+دیکھٌب
+ضوچ
+اختتبم
+اہن
+ثٌذ
+پوچھٌب
+خبًتی
+چھوٹوں
+درخے
+دیکھو
+ضوچب
+ادھر
+آئی
+ثٌذکرًب
+پوچھو
+خبًتے
+چھوٹی
+درزقیقت
+دیکھی
+ضوچتب
+ارد
+آئے
+ثٌذکرو
+پوچھوں
+خبًٌب
+چھوٹے
+درضت
+دیکھیں
+ضوچتی
+اردگرد
+آج
+ثٌذی
+پوچھیں
+خططرذ
+چھہ
+دش
+دیٌب
+ضوچتے
+ارکبى
+آخر
+ثڑا
+پورا
+خگہ
+چیسیں
+دفعہ
+دے
+ضوچٌب
+اضتعوبل
+آخر
+پہلا
+خگہوں
+زبصل
+دکھبئیں
+راضتوں
+ضوچو
+اضتعوبلات
+آدهی
+ثڑی
+پہلی
+خگہیں
+زبضر
+دکھبتب
+راضتہ
+ضوچی
+اغیب
+آًب
+ثڑے
+پہلےضی
+خلذی
+زبل
+دکھبتی
+راضتے
+ضوچیں
+اطراف
+آٹھ
+ثھر
+خٌبة
+زبل
+دکھبتے
+رکي
+ضیذھب
+افراد
+آیب
+ثھرا
+پہلے
+خواى
+زبلات
+دکھبًب
+رکھب
+ضیذھی
+اکثر
+ثب
+ہوا
+پیع
+خوًہی
+زبلیہ
+دکھبو
+رکھی
+ضیذھے
+اکٹھب
+ثھرپور
+تبزٍ
+خیطبکہ
+زصوں
+رکھے
+ضیکٌڈ
+اکٹھی
+ثبری
+ثہتر
+تر
+چبر
+زصہ
+دلچطپ
+زیبدٍ
+غبیذ
+اکٹھے
+ثبلا
+ثہتری
+ترتیت
+چبہب
+زصے
+دلچطپی
+ضبت
+غخص
+اکیلا
+ثبلترتیت
+ثہتریي
+تریي
+چبہٌب
+زقبئق
+دلچطپیبں
+ضبدٍ
+غذ
+اکیلی
+ثرش
+پبش
+تعذاد
+چبہے
+زقیتیں
+هٌبضت
+ضبرا
+غروع
+اکیلے
+ثغیر
+پبًب
+چکب
+زقیقت
+دو
+ضبرے
+غروعبت
+اگرچہ
+ثلٌذ
+پبًچ
+تن
+چکی
+زکن
+دور
+ضبل
+غے
+الگ
+پراًب
+تٌہب
+چکیں
+دوضرا
+ضبلوں
+صبف
+صسیر
+قجیلہ
+کوًطے
+لازهی
+هطئلے
+ًیب
+طریق
+کرتی
+کہتے
+صفر
+قطن
+کھولا
+لگتب
+هطبئل
+وار
+طریقوں
+کرتے
+کہٌب
+صورت
+کئی
+کھولٌب
+لگتی
+هطتعول
+وار
+طریقہ
+کرتے
+ہو
+کہٌب
+صورتسبل
+کئے
+کھولو
+لگتے
+هػتول
+ٹھیک
+طریقے
+کرًب
+کہو
+صورتوں
+کبفی
+هطلق
+ڈھوًڈا
+طور
+کرو
+کہوں
+صورتیں
+کبم
+کھولیں
+لگی
+هعلوم
+ڈھوًڈلیب
+طورپر
+کریں
+کہی
+ضرور
+کجھی
+کھولے
+لگے
+هکول
+ڈھوًڈًب
+ظبہر
+کرے
+کہیں
+ضرورت
+کرا
+کہب
+لوجب
+هلا
+ڈھوًڈو
+عذد
+کل
+کہیں
+کرتب
+کہتب
+لوجی
+هوکي
+ڈھوًڈی
+عظین
+کن
+کہے
+ضروری
+کرتبہوں
+کہتی
+لوجے
+هوکٌبت
+ڈھوًڈیں
+علاقوں
+کوتر
+کیے
+لوسبت
+هوکٌہ
+ہن
+لے
+ًبپطٌذ
+ہورہے
+علاقہ
+کورا
+کے
+رریعے
+لوسہ
+هڑا
+ہوئی
+هتعلق
+ًبگسیر
+ہوگئی
+علاقے
+کوروں
+گئی
+لو
+هڑًب
+ہوئے
+هسترم
+ًطجت
+ہو
+گئے
+علاوٍ
+کورٍ
+گرد
+لوگ
+هڑے
+ہوتی
+هسترهہ
+ًقطہ
+ہوگیب
+کورے
+گروپ
+لوگوں
+هہرثبى
+ہوتے
+هسطوش
+ًکبلٌب
+ہوًی
+عووهی
+کوطي
+گروٍ
+لڑکپي
+هیرا
+ہوچکب
+هختلف
+ًکتہ
+ہی
+فرد
+کوى
+گروہوں
+لی
+هیری
+ہوچکی
+هسیذ
+فی
+کوًطب
+گٌتی
+لیب
+هیرے
+ہوچکے
+هطئلہ
+ًوخواى
+یقیٌی
+قجل
+کوًطی
+لیٌب
+ًئی
+ہورہب
+لیں
+ًئے
+ہورہی
+ثبعث
+ضت
+""".split())
--- a/spacy/lang/ur/tag_map.py
+++ b/spacy/lang/ur/tag_map.py
@ -0,0 +1,65 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
+from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
+
+TAG_MAP = {
+    ".":        {POS: PUNCT, "PunctType": "peri"},
+    ",":        {POS: PUNCT, "PunctType": "comm"},
+    "-LRB-":    {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
+    "-RRB-":    {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
+    "``":       {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
+    "\"\"":     {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    "''":       {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    ":":        {POS: PUNCT},
+    "$":        {POS: SYM, "Other": {"SymType": "currency"}},
+    "#":        {POS: SYM, "Other": {"SymType": "numbersign"}},
+    "AFX":      {POS: ADJ,  "Hyph": "yes"},
+    "CC":       {POS: CCONJ, "ConjType": "coor"},
+    "CD":       {POS: NUM, "NumType": "card"},
+    "DT":       {POS: DET},
+    "EX":       {POS: ADV, "AdvType": "ex"},
+    "FW":       {POS: X, "Foreign": "yes"},
+    "HYPH":     {POS: PUNCT, "PunctType": "dash"},
+    "IN":       {POS: ADP},
+    "JJ":       {POS: ADJ, "Degree": "pos"},
+    "JJR":      {POS: ADJ, "Degree": "comp"},
+    "JJS":      {POS: ADJ, "Degree": "sup"},
+    "LS":       {POS: PUNCT, "NumType": "ord"},
+    "MD":       {POS: VERB, "VerbType": "mod"},
+    "NIL":      {POS: ""},
+    "NN":       {POS: NOUN, "Number": "sing"},
+    "NNP":      {POS: PROPN, "NounType": "prop", "Number": "sing"},
+    "NNPS":     {POS: PROPN, "NounType": "prop", "Number": "plur"},
+    "NNS":      {POS: NOUN, "Number": "plur"},
+    "PDT":      {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
+    "POS":      {POS: PART, "Poss": "yes"},
+    "PRP":      {POS: PRON, "PronType": "prs"},
+    "PRP$":     {POS: ADJ, "PronType": "prs", "Poss": "yes"},
+    "RB":       {POS: ADV, "Degree": "pos"},
+    "RBR":      {POS: ADV, "Degree": "comp"},
+    "RBS":      {POS: ADV, "Degree": "sup"},
+    "RP":       {POS: PART},
+    "SP":       {POS: SPACE},
+    "SYM":      {POS: SYM},
+    "TO":       {POS: PART, "PartType": "inf", "VerbForm": "inf"},
+    "UH":       {POS: INTJ},
+    "VB":       {POS: VERB, "VerbForm": "inf"},
+    "VBD":      {POS: VERB, "VerbForm": "fin", "Tense": "past"},
+    "VBG":      {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
+    "VBN":      {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
+    "VBP":      {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
+    "VBZ":      {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
+    "WDT":      {POS: ADJ, "PronType": "int|rel"},
+    "WP":       {POS: NOUN, "PronType": "int|rel"},
+    "WP$":      {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
+    "WRB":      {POS: ADV, "PronType": "int|rel"},
+    "ADD":      {POS: X},
+    "NFP":      {POS: PUNCT},
+    "GW":       {POS: X},
+    "XX":       {POS: X},
+    "BES":      {POS: VERB},
+    "HVS":      {POS: VERB},
+    "_SP":       {POS: SPACE},
+}
--- a/spacy/lang/ur/tokenizer_exceptions.py
+++ b/spacy/lang/ur/tokenizer_exceptions.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+# import symbols – if you need to use more, add them here
+from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
+
+# Add tokenizer exceptions
+# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
+# Feel free to use custom logic to generate repetitive exceptions more efficiently.
+# If an exception is split into more than one token, the ORTH values combined always
+# need to match the original string.
+
+# Exceptions should be added in the following format:
+
+_exc = {
+
+}
+
+# To keep things clean and readable, it's recommended to only declare the
+# TOKENIZER_EXCEPTIONS at the bottom:
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -15,7 +15,7 @@ from .. import util
 # here if it's using spaCy's tokenizer (not a different library)
 # TODO: re-implement generic tokenizer tests
 _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
-              'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'tt',
+              'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
              'xx']

 _models = {'en': ['en_core_web_sm'],
@ -162,6 +162,10 @@ def tt_tokenizer():
 def ar_tokenizer():
    return util.get_lang_class('ar').Defaults.create_tokenizer()

+@pytest.fixture
+def ur_tokenizer():
+    return util.get_lang_class('ur').Defaults.create_tokenizer()
+
@pytest.fixture
 def ru_tokenizer():
    pymorphy = pytest.importorskip('pymorphy2')
--- a/spacy/tests/lang/ur/init.py
+++ b/spacy/tests/lang/ur/init.py
--- a/spacy/tests/lang/ur/test_text.py
+++ b/spacy/tests/lang/ur/test_text.py
@ -0,0 +1,26 @@
+# coding: utf-8
+
+"""Test that longer and mixed texts are tokenized correctly."""
+
+
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_tokenizer_handles_long_text(ur_tokenizer):
+    text = """اصل میں رسوا ہونے کی ہمیں
+     کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
+     کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر 
+    ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
+
+    tokens = ur_tokenizer(text)
+    assert len(tokens) == 77
+
+
+@pytest.mark.parametrize('text,length', [
+    ("تحریر باسط حبیب", 3),
+    ("میرا پاکستان", 2)])
+def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
+    tokens = ur_tokenizer(text)
+    assert len(tokens) == length