mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Add Arabic language (#2314)
* added support for Arabic lang * added Arabic language support * updated conftest
This commit is contained in:
		
							parent
							
								
									0e08e49e87
								
							
						
					
					
						commit
						00417794d3
					
				
							
								
								
									
										106
									
								
								.github/contributors/tzano.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/tzano.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,106 @@
 | 
			
		|||
# spaCy contributor agreement
 | 
			
		||||
 | 
			
		||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
			
		||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
			
		||||
The SCA applies to any contribution that you make to any product or project
 | 
			
		||||
managed by us (the **"project"**), and sets out the intellectual property rights
 | 
			
		||||
you grant to us in the contributed materials. The term **"us"** shall mean
 | 
			
		||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 | 
			
		||||
**"you"** shall mean the person or entity identified below.
 | 
			
		||||
 | 
			
		||||
If you agree to be bound by these terms, fill in the information requested
 | 
			
		||||
below and include the filled-in version with your first pull request, under the
 | 
			
		||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
			
		||||
should be your GitHub username, with the extension `.md`. For example, the user
 | 
			
		||||
example_user would create the file `.github/contributors/example_user.md`.
 | 
			
		||||
 | 
			
		||||
Read this agreement carefully before signing. These terms and conditions
 | 
			
		||||
constitute a binding legal agreement.
 | 
			
		||||
 | 
			
		||||
## Contributor Agreement
 | 
			
		||||
 | 
			
		||||
1. The term "contribution" or "contributed materials" means any source code,
 | 
			
		||||
object code, patch, tool, sample, graphic, specification, manual,
 | 
			
		||||
documentation, or any other material posted or submitted by you to the project.
 | 
			
		||||
 | 
			
		||||
2. With respect to any worldwide copyrights, or copyright applications and
 | 
			
		||||
registrations, in your contribution:
 | 
			
		||||
 | 
			
		||||
    * you hereby assign to us joint ownership, and to the extent that such
 | 
			
		||||
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
			
		||||
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
			
		||||
    royalty-free, unrestricted license to exercise all rights under those
 | 
			
		||||
    copyrights. This includes, at our option, the right to sublicense these same
 | 
			
		||||
    rights to third parties through multiple levels of sublicensees or other
 | 
			
		||||
    licensing arrangements;
 | 
			
		||||
 | 
			
		||||
    * you agree that each of us can do all things in relation to your
 | 
			
		||||
    contribution as if each of us were the sole owners, and if one of us makes
 | 
			
		||||
    a derivative work of your contribution, the one who makes the derivative
 | 
			
		||||
    work (or has it made will be the sole owner of that derivative work;
 | 
			
		||||
 | 
			
		||||
    * you agree that you will not assert any moral rights in your contribution
 | 
			
		||||
    against us, our licensees or transferees;
 | 
			
		||||
 | 
			
		||||
    * you agree that we may register a copyright in your contribution and
 | 
			
		||||
    exercise all ownership rights associated with it; and
 | 
			
		||||
 | 
			
		||||
    * you agree that neither of us has any duty to consult with, obtain the
 | 
			
		||||
    consent of, pay or render an accounting to the other for any use or
 | 
			
		||||
    distribution of your contribution.
 | 
			
		||||
 | 
			
		||||
3. With respect to any patents you own, or that you can license without payment
 | 
			
		||||
to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
			
		||||
non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
			
		||||
 | 
			
		||||
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
			
		||||
    your contribution in whole or in part, alone or in combination with or
 | 
			
		||||
    included in any product, work or materials arising out of the project to
 | 
			
		||||
    which your contribution was submitted, and
 | 
			
		||||
 | 
			
		||||
    * at our option, to sublicense these same rights to third parties through
 | 
			
		||||
    multiple levels of sublicensees or other licensing arrangements.
 | 
			
		||||
 | 
			
		||||
4. Except as set out above, you keep all right, title, and interest in your
 | 
			
		||||
contribution. The rights that you grant to us under these terms are effective
 | 
			
		||||
on the date you first submitted a contribution to us, even if your submission
 | 
			
		||||
took place before the date you sign these terms.
 | 
			
		||||
 | 
			
		||||
5. You covenant, represent, warrant and agree that:
 | 
			
		||||
 | 
			
		||||
    * Each contribution that you submit is and shall be an original work of
 | 
			
		||||
    authorship and you can legally grant the rights set out in this SCA;
 | 
			
		||||
 | 
			
		||||
    * to the best of your knowledge, each contribution will not violate any
 | 
			
		||||
    third party's copyrights, trademarks, patents, or other intellectual
 | 
			
		||||
    property rights; and
 | 
			
		||||
 | 
			
		||||
    * each contribution shall be in compliance with U.S. export control laws and
 | 
			
		||||
    other applicable export and import laws. You agree to notify us if you
 | 
			
		||||
    become aware of any circumstance which would make any of the foregoing
 | 
			
		||||
    representations inaccurate in any respect. We may publicly disclose your 
 | 
			
		||||
    participation in the project, including the fact that you have signed the SCA.
 | 
			
		||||
 | 
			
		||||
6. This SCA is governed by the laws of the State of California and applicable
 | 
			
		||||
U.S. Federal law. Any choice of law rules will not apply.
 | 
			
		||||
 | 
			
		||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
			
		||||
mark both statements:
 | 
			
		||||
 | 
			
		||||
    * [x] I am signing on behalf of myself as an individual and no other person
 | 
			
		||||
    or entity, including my employer, has or will have rights with respect to my
 | 
			
		||||
    contributions.
 | 
			
		||||
 | 
			
		||||
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
			
		||||
    actual authority to contractually bind that entity.
 | 
			
		||||
 | 
			
		||||
## Contributor Details
 | 
			
		||||
 | 
			
		||||
| Field                          | Entry                |
 | 
			
		||||
|------------------------------- | -------------------- |
 | 
			
		||||
| Name                           | Tahar Zanouda        |
 | 
			
		||||
| Company name (if applicable)   |                      |
 | 
			
		||||
| Title or role (if applicable)  |                      |
 | 
			
		||||
| Date                           | 09-05-2018           |
 | 
			
		||||
| GitHub username                | tzano                |
 | 
			
		||||
| Website (optional)             |                      |
 | 
			
		||||
							
								
								
									
										31
									
								
								spacy/lang/ar/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										31
									
								
								spacy/lang/ar/__init__.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,31 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
from .stop_words import STOP_WORDS
 | 
			
		||||
from .lex_attrs import LEX_ATTRS
 | 
			
		||||
from .punctuation import TOKENIZER_SUFFIXES
 | 
			
		||||
 | 
			
		||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 | 
			
		||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
 | 
			
		||||
from ..norm_exceptions import BASE_NORMS
 | 
			
		||||
from ...language import Language
 | 
			
		||||
from ...attrs import LANG, NORM
 | 
			
		||||
from ...util import update_exc, add_lookups
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class ArabicDefaults(Language.Defaults):
 | 
			
		||||
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 | 
			
		||||
    lex_attr_getters.update(LEX_ATTRS)
 | 
			
		||||
    lex_attr_getters[LANG] = lambda text: 'ar'
 | 
			
		||||
    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
 | 
			
		||||
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
 | 
			
		||||
    stop_words = STOP_WORDS
 | 
			
		||||
    suffixes = TOKENIZER_SUFFIXES
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class Arabic(Language):
 | 
			
		||||
    lang = 'ar'
 | 
			
		||||
    Defaults = ArabicDefaults
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
__all__ = ['Arabic']
 | 
			
		||||
							
								
								
									
										20
									
								
								spacy/lang/ar/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										20
									
								
								spacy/lang/ar/examples.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,20 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
"""
 | 
			
		||||
Example sentences to test spaCy and its language models.
 | 
			
		||||
 | 
			
		||||
>>> from spacy.lang.ar.examples import sentences
 | 
			
		||||
>>> docs = nlp.pipe(sentences)
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
sentences = [
 | 
			
		||||
    "نال الكاتب خالد توفيق  جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
 | 
			
		||||
    "أين تقع دمشق ؟"
 | 
			
		||||
    "كيف حالك ؟",
 | 
			
		||||
    "هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
 | 
			
		||||
    "ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
 | 
			
		||||
    "هل بالإمكان أن نلتقي غدا؟",
 | 
			
		||||
    "هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
 | 
			
		||||
    "كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم"
 | 
			
		||||
]
 | 
			
		||||
							
								
								
									
										95
									
								
								spacy/lang/ar/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										95
									
								
								spacy/lang/ar/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,95 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
from ...attrs import LIKE_NUM
 | 
			
		||||
 | 
			
		||||
_num_words = set("""
 | 
			
		||||
صفر
 | 
			
		||||
واحد
 | 
			
		||||
إثنان
 | 
			
		||||
اثنان
 | 
			
		||||
ثلاثة
 | 
			
		||||
ثلاثه
 | 
			
		||||
أربعة
 | 
			
		||||
أربعه
 | 
			
		||||
خمسة
 | 
			
		||||
خمسه
 | 
			
		||||
ستة
 | 
			
		||||
سته
 | 
			
		||||
سبعة
 | 
			
		||||
سبعه
 | 
			
		||||
ثمانية
 | 
			
		||||
ثمانيه
 | 
			
		||||
تسعة
 | 
			
		||||
تسعه
 | 
			
		||||
ﻋﺸﺮﺓ
 | 
			
		||||
ﻋﺸﺮه
 | 
			
		||||
عشرون
 | 
			
		||||
عشرين
 | 
			
		||||
ثلاثون
 | 
			
		||||
ثلاثين
 | 
			
		||||
اربعون
 | 
			
		||||
اربعين
 | 
			
		||||
أربعون
 | 
			
		||||
أربعين
 | 
			
		||||
خمسون
 | 
			
		||||
خمسين
 | 
			
		||||
ستون
 | 
			
		||||
ستين
 | 
			
		||||
سبعون
 | 
			
		||||
سبعين
 | 
			
		||||
ثمانون
 | 
			
		||||
ثمانين
 | 
			
		||||
تسعون
 | 
			
		||||
تسعين
 | 
			
		||||
مائتين
 | 
			
		||||
مائتان
 | 
			
		||||
ثلاثمائة
 | 
			
		||||
خمسمائة
 | 
			
		||||
سبعمائة
 | 
			
		||||
الف
 | 
			
		||||
آلاف
 | 
			
		||||
ملايين
 | 
			
		||||
مليون
 | 
			
		||||
مليار
 | 
			
		||||
مليارات
 | 
			
		||||
""".split())
 | 
			
		||||
 | 
			
		||||
_ordinal_words = set("""
 | 
			
		||||
اول
 | 
			
		||||
أول
 | 
			
		||||
حاد
 | 
			
		||||
واحد
 | 
			
		||||
ثان
 | 
			
		||||
ثاني
 | 
			
		||||
ثالث
 | 
			
		||||
رابع
 | 
			
		||||
خامس
 | 
			
		||||
سادس
 | 
			
		||||
سابع
 | 
			
		||||
ثامن
 | 
			
		||||
تاسع
 | 
			
		||||
عاشر
 | 
			
		||||
""".split())
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def like_num(text):
 | 
			
		||||
    """
 | 
			
		||||
    check if text resembles a number
 | 
			
		||||
    """
 | 
			
		||||
    text = text.replace(',', '').replace('.', '')
 | 
			
		||||
    if text.isdigit():
 | 
			
		||||
        return True
 | 
			
		||||
    if text.count('/') == 1:
 | 
			
		||||
        num, denom = text.split('/')
 | 
			
		||||
        if num.isdigit() and denom.isdigit():
 | 
			
		||||
            return True
 | 
			
		||||
    if text in _num_words:
 | 
			
		||||
        return True
 | 
			
		||||
    if text in _ordinal_words:
 | 
			
		||||
        return True
 | 
			
		||||
    return False
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
LEX_ATTRS = {
 | 
			
		||||
    LIKE_NUM: like_num
 | 
			
		||||
}
 | 
			
		||||
							
								
								
									
										15
									
								
								spacy/lang/ar/punctuation.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										15
									
								
								spacy/lang/ar/punctuation.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,15 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
from ..punctuation import TOKENIZER_INFIXES
 | 
			
		||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
 | 
			
		||||
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
 | 
			
		||||
 | 
			
		||||
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
 | 
			
		||||
             [r'(?<=[0-9])\+',
 | 
			
		||||
              # Arabic is written from Right-To-Left
 | 
			
		||||
              r'(?<=[0-9])(?:{})'.format(CURRENCY),
 | 
			
		||||
              r'(?<=[0-9])(?:{})'.format(UNITS),
 | 
			
		||||
              r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
 | 
			
		||||
 | 
			
		||||
TOKENIZER_SUFFIXES = _suffixes
 | 
			
		||||
							
								
								
									
										229
									
								
								spacy/lang/ar/stop_words.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										229
									
								
								spacy/lang/ar/stop_words.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,229 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
STOP_WORDS = set("""
 | 
			
		||||
من
 | 
			
		||||
نحو
 | 
			
		||||
لعل
 | 
			
		||||
بما
 | 
			
		||||
بين
 | 
			
		||||
وبين
 | 
			
		||||
ايضا
 | 
			
		||||
وبينما
 | 
			
		||||
تحت
 | 
			
		||||
مثلا
 | 
			
		||||
لدي
 | 
			
		||||
عنه
 | 
			
		||||
مع
 | 
			
		||||
هي
 | 
			
		||||
وهذا
 | 
			
		||||
واذا
 | 
			
		||||
هذان
 | 
			
		||||
انه
 | 
			
		||||
بينما
 | 
			
		||||
أمسى
 | 
			
		||||
وسوف
 | 
			
		||||
ولم
 | 
			
		||||
لذلك
 | 
			
		||||
إلى
 | 
			
		||||
منه
 | 
			
		||||
منها
 | 
			
		||||
كما
 | 
			
		||||
ظل
 | 
			
		||||
هنا
 | 
			
		||||
به
 | 
			
		||||
كذلك
 | 
			
		||||
اما
 | 
			
		||||
هما
 | 
			
		||||
بعد
 | 
			
		||||
بينهم
 | 
			
		||||
التي
 | 
			
		||||
أبو
 | 
			
		||||
اذا
 | 
			
		||||
بدلا
 | 
			
		||||
لها
 | 
			
		||||
أمام
 | 
			
		||||
يلي
 | 
			
		||||
حين
 | 
			
		||||
ضد
 | 
			
		||||
الذي
 | 
			
		||||
قد
 | 
			
		||||
صار
 | 
			
		||||
إذا
 | 
			
		||||
مابرح
 | 
			
		||||
قبل
 | 
			
		||||
كل
 | 
			
		||||
وليست
 | 
			
		||||
الذين
 | 
			
		||||
لهذا
 | 
			
		||||
وثي
 | 
			
		||||
انهم
 | 
			
		||||
باللتي
 | 
			
		||||
مافتئ
 | 
			
		||||
ولا
 | 
			
		||||
بهذه
 | 
			
		||||
بحيث
 | 
			
		||||
كيف
 | 
			
		||||
وله
 | 
			
		||||
علي
 | 
			
		||||
بات
 | 
			
		||||
لاسيما
 | 
			
		||||
حتى
 | 
			
		||||
وقد
 | 
			
		||||
و
 | 
			
		||||
أما
 | 
			
		||||
فيها
 | 
			
		||||
بهذا
 | 
			
		||||
لذا
 | 
			
		||||
حيث
 | 
			
		||||
لقد
 | 
			
		||||
إن
 | 
			
		||||
فإن
 | 
			
		||||
اول
 | 
			
		||||
ليت
 | 
			
		||||
فاللتي
 | 
			
		||||
ولقد
 | 
			
		||||
لسوف
 | 
			
		||||
هذه
 | 
			
		||||
ولماذا
 | 
			
		||||
معه
 | 
			
		||||
الحالي
 | 
			
		||||
بإن
 | 
			
		||||
حول
 | 
			
		||||
في
 | 
			
		||||
عليه
 | 
			
		||||
مايزال
 | 
			
		||||
ولعل
 | 
			
		||||
أنه
 | 
			
		||||
أضحى
 | 
			
		||||
اي
 | 
			
		||||
ستكون
 | 
			
		||||
لن
 | 
			
		||||
أن
 | 
			
		||||
ضمن
 | 
			
		||||
وعلى
 | 
			
		||||
امسى
 | 
			
		||||
الي
 | 
			
		||||
ذات
 | 
			
		||||
ولايزال
 | 
			
		||||
ذلك
 | 
			
		||||
فقد
 | 
			
		||||
هم
 | 
			
		||||
أي
 | 
			
		||||
عند
 | 
			
		||||
ابن
 | 
			
		||||
أو
 | 
			
		||||
فهو
 | 
			
		||||
فانه
 | 
			
		||||
سوف
 | 
			
		||||
ما
 | 
			
		||||
آل
 | 
			
		||||
كلا
 | 
			
		||||
عنها
 | 
			
		||||
وكذلك
 | 
			
		||||
ليست
 | 
			
		||||
لم
 | 
			
		||||
وأن
 | 
			
		||||
ماذا
 | 
			
		||||
لو
 | 
			
		||||
وهل
 | 
			
		||||
اللتي
 | 
			
		||||
ولذا
 | 
			
		||||
يمكن
 | 
			
		||||
فيه
 | 
			
		||||
الا
 | 
			
		||||
عليها
 | 
			
		||||
وبينهم
 | 
			
		||||
يوم
 | 
			
		||||
وبما
 | 
			
		||||
لما
 | 
			
		||||
فكان
 | 
			
		||||
اضحى
 | 
			
		||||
اصبح
 | 
			
		||||
لهم
 | 
			
		||||
بها
 | 
			
		||||
او
 | 
			
		||||
الذى
 | 
			
		||||
الى
 | 
			
		||||
إلي
 | 
			
		||||
قال
 | 
			
		||||
والتي
 | 
			
		||||
لازال
 | 
			
		||||
أصبح
 | 
			
		||||
ولهذا
 | 
			
		||||
مثل
 | 
			
		||||
وكانت
 | 
			
		||||
لكنه
 | 
			
		||||
بذلك
 | 
			
		||||
هذا
 | 
			
		||||
لماذا
 | 
			
		||||
قالت
 | 
			
		||||
فقط
 | 
			
		||||
لكن
 | 
			
		||||
مما
 | 
			
		||||
وكل
 | 
			
		||||
وان
 | 
			
		||||
وأبو
 | 
			
		||||
ومن
 | 
			
		||||
كان
 | 
			
		||||
مازال
 | 
			
		||||
هل
 | 
			
		||||
بينهن
 | 
			
		||||
هو
 | 
			
		||||
وما
 | 
			
		||||
على
 | 
			
		||||
وهو
 | 
			
		||||
لأن
 | 
			
		||||
واللتي
 | 
			
		||||
والذي
 | 
			
		||||
دون
 | 
			
		||||
عن
 | 
			
		||||
وايضا
 | 
			
		||||
هناك
 | 
			
		||||
بلا
 | 
			
		||||
جدا
 | 
			
		||||
ثم
 | 
			
		||||
منذ
 | 
			
		||||
اللذين
 | 
			
		||||
لايزال
 | 
			
		||||
بعض
 | 
			
		||||
مساء
 | 
			
		||||
تكون
 | 
			
		||||
فلا
 | 
			
		||||
بيننا
 | 
			
		||||
لا
 | 
			
		||||
ولكن
 | 
			
		||||
إذ
 | 
			
		||||
وأثناء
 | 
			
		||||
ليس
 | 
			
		||||
ومع
 | 
			
		||||
فيهم
 | 
			
		||||
ولسوف
 | 
			
		||||
بل
 | 
			
		||||
تلك
 | 
			
		||||
أحد
 | 
			
		||||
وهي
 | 
			
		||||
وكان
 | 
			
		||||
ومنها
 | 
			
		||||
وفي
 | 
			
		||||
ماانفك
 | 
			
		||||
اليوم
 | 
			
		||||
وماذا
 | 
			
		||||
هؤلاء
 | 
			
		||||
وليس
 | 
			
		||||
له
 | 
			
		||||
أثناء
 | 
			
		||||
بد
 | 
			
		||||
اليه
 | 
			
		||||
كأن
 | 
			
		||||
اليها
 | 
			
		||||
بتلك
 | 
			
		||||
يكون
 | 
			
		||||
ولما
 | 
			
		||||
هن
 | 
			
		||||
والى
 | 
			
		||||
كانت
 | 
			
		||||
وقبل
 | 
			
		||||
ان
 | 
			
		||||
لدى
 | 
			
		||||
""".split())
 | 
			
		||||
							
								
								
									
										47
									
								
								spacy/lang/ar/tokenizer_exceptions.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										47
									
								
								spacy/lang/ar/tokenizer_exceptions.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,47 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
 | 
			
		||||
import re
 | 
			
		||||
 | 
			
		||||
_exc = {}
 | 
			
		||||
 | 
			
		||||
# time
 | 
			
		||||
for exc_data in [
 | 
			
		||||
    {LEMMA: "قبل الميلاد", ORTH: "ق.م"},
 | 
			
		||||
    {LEMMA: "بعد الميلاد", ORTH: "ب. م"},
 | 
			
		||||
    {LEMMA: "ميلادي", ORTH: ".م"},
 | 
			
		||||
    {LEMMA: "هجري", ORTH: ".هـ"},
 | 
			
		||||
    {LEMMA: "توفي", ORTH: ".ت"}]:
 | 
			
		||||
    _exc[exc_data[ORTH]] = [exc_data]
 | 
			
		||||
 | 
			
		||||
# scientific abv.
 | 
			
		||||
for exc_data in [
 | 
			
		||||
    {LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
 | 
			
		||||
    {LEMMA: "الشارح", ORTH: "الشـ"},
 | 
			
		||||
    {LEMMA: "الظاهر", ORTH: "الظـ"},
 | 
			
		||||
    {LEMMA: "أيضًا", ORTH: "أيضـ"},
 | 
			
		||||
    {LEMMA: "إلى آخره", ORTH: "إلخ"},
 | 
			
		||||
    {LEMMA: "انتهى", ORTH: "اهـ"},
 | 
			
		||||
    {LEMMA: "حدّثنا", ORTH: "ثنا"},
 | 
			
		||||
    {LEMMA: "حدثني", ORTH: "ثنى"},
 | 
			
		||||
    {LEMMA: "أنبأنا", ORTH: "أنا"},
 | 
			
		||||
    {LEMMA: "أخبرنا", ORTH: "نا"},
 | 
			
		||||
    {LEMMA: "مصدر سابق", ORTH: "م. س"},
 | 
			
		||||
    {LEMMA: "مصدر نفسه", ORTH: "م. ن"}]:
 | 
			
		||||
    _exc[exc_data[ORTH]] = [exc_data]
 | 
			
		||||
 | 
			
		||||
# other abv.
 | 
			
		||||
for exc_data in [
 | 
			
		||||
    {LEMMA: "دكتور", ORTH: "د."},
 | 
			
		||||
    {LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
 | 
			
		||||
    {LEMMA: "أستاذ", ORTH: "أ."},
 | 
			
		||||
    {LEMMA: "بروفيسور", ORTH: "ب."}]:
 | 
			
		||||
    _exc[exc_data[ORTH]] = [exc_data]
 | 
			
		||||
 | 
			
		||||
for exc_data in [
 | 
			
		||||
    {LEMMA: "تلفون", ORTH: "ت."},
 | 
			
		||||
    {LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
 | 
			
		||||
    _exc[exc_data[ORTH]] = [exc_data]
 | 
			
		||||
 | 
			
		||||
TOKENIZER_EXCEPTIONS = _exc
 | 
			
		||||
| 
						 | 
				
			
			@ -3,13 +3,11 @@ from __future__ import unicode_literals
 | 
			
		|||
 | 
			
		||||
import regex as re
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
re.DEFAULT_VERSION = re.VERSION1
 | 
			
		||||
merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
 | 
			
		||||
split_chars = lambda char: list(char.strip().split(' '))
 | 
			
		||||
merge_chars = lambda char: char.strip().replace(' ', '|')
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
_bengali = r'[\p{L}&&\p{Bengali}]'
 | 
			
		||||
_hebrew = r'[\p{L}&&\p{Hebrew}]'
 | 
			
		||||
_latin_lower = r'[\p{Ll}&&\p{Latin}]'
 | 
			
		||||
| 
						 | 
				
			
			@ -27,11 +25,11 @@ ALPHA = merge_char_classes(_upper + _lower + _uncased)
 | 
			
		|||
ALPHA_LOWER = merge_char_classes(_lower + _uncased)
 | 
			
		||||
ALPHA_UPPER = merge_char_classes(_upper + _uncased)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
 | 
			
		||||
          'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
 | 
			
		||||
          'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
 | 
			
		||||
          'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб')
 | 
			
		||||
          'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб'
 | 
			
		||||
          'كم كم² كم³ م م² م³ سم سم² سم³ مم مم² مم³ كم غرام جرام جم كغ ملغ كوب اكواب')
 | 
			
		||||
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'
 | 
			
		||||
 | 
			
		||||
# These expressions contain various unicode variations, including characters
 | 
			
		||||
| 
						 | 
				
			
			@ -45,7 +43,6 @@ _hyphens = '- – — -- --- —— ~'
 | 
			
		|||
# Details: https://www.compart.com/en/unicode/category/So
 | 
			
		||||
_other_symbols = r'[\p{So}]'
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
UNITS = merge_chars(_units)
 | 
			
		||||
CURRENCY = merge_chars(_currency)
 | 
			
		||||
QUOTES = merge_chars(_quotes)
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -15,7 +15,9 @@ from .. import util
 | 
			
		|||
# here if it's using spaCy's tokenizer (not a different library)
 | 
			
		||||
# TODO: re-implement generic tokenizer tests
 | 
			
		||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
 | 
			
		||||
              'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
 | 
			
		||||
              'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'xx']
 | 
			
		||||
 | 
			
		||||
_models = {'en': ['en_core_web_sm'],
 | 
			
		||||
           'de': ['de_core_news_md'],
 | 
			
		||||
           'fr': ['fr_core_news_sm'],
 | 
			
		||||
| 
						 | 
				
			
			@ -50,8 +52,8 @@ def RU(request):
 | 
			
		|||
 | 
			
		||||
#@pytest.fixture(params=_languages)
 | 
			
		||||
#def tokenizer(request):
 | 
			
		||||
    #lang = util.get_lang_class(request.param)
 | 
			
		||||
    #return lang.Defaults.create_tokenizer()
 | 
			
		||||
#lang = util.get_lang_class(request.param)
 | 
			
		||||
#return lang.Defaults.create_tokenizer()
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.fixture
 | 
			
		||||
| 
						 | 
				
			
			@ -152,6 +154,9 @@ def th_tokenizer():
 | 
			
		|||
def tr_tokenizer():
 | 
			
		||||
    return util.get_lang_class('tr').Defaults.create_tokenizer()
 | 
			
		||||
 | 
			
		||||
@pytest.fixture
 | 
			
		||||
def ar_tokenizer():
 | 
			
		||||
    return util.get_lang_class('ar').Defaults.create_tokenizer()
 | 
			
		||||
 | 
			
		||||
@pytest.fixture
 | 
			
		||||
def ru_tokenizer():
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
							
								
								
									
										0
									
								
								spacy/tests/lang/ar/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								spacy/tests/lang/ar/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
									
										26
									
								
								spacy/tests/lang/ar/test_exceptions.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										26
									
								
								spacy/tests/lang/ar/test_exceptions.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,26 @@
 | 
			
		|||
# coding: utf-8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
import pytest
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize('text',
 | 
			
		||||
                         ["ق.م", "إلخ", "ص.ب", "ت."])
 | 
			
		||||
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
 | 
			
		||||
    tokens = ar_tokenizer(text)
 | 
			
		||||
    assert len(tokens) == 1
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
 | 
			
		||||
    text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
 | 
			
		||||
    tokens = ar_tokenizer(text)
 | 
			
		||||
    assert len(tokens) == 7
 | 
			
		||||
    assert tokens[6].text == "ق.م"
 | 
			
		||||
    assert tokens[6].lemma_ == "قبل الميلاد"
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
 | 
			
		||||
    text = u"يبلغ طول مضيق طارق 14كم "
 | 
			
		||||
    tokens = ar_tokenizer(text)
 | 
			
		||||
    print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
 | 
			
		||||
    assert len(tokens) == 6
 | 
			
		||||
							
								
								
									
										13
									
								
								spacy/tests/lang/ar/test_text.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										13
									
								
								spacy/tests/lang/ar/test_text.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,13 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_tokenizer_handles_long_text(ar_tokenizer):
 | 
			
		||||
    text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
 | 
			
		||||
     ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
 | 
			
		||||
      فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
 | 
			
		||||
      و قد نجح في الحصول على جائزة نوبل للآداب، ليكون بذلك العربي الوحيد الذي فاز بها."""
 | 
			
		||||
 | 
			
		||||
    tokens = ar_tokenizer(text)
 | 
			
		||||
    assert tokens[3].is_stop == True
 | 
			
		||||
    assert len(tokens) == 77
 | 
			
		||||
		Loading…
	
		Reference in New Issue
	
	Block a user