mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Add Arabic language (#2314)
* added support for Arabic lang * added Arabic language support * updated conftest
This commit is contained in:
parent
0e08e49e87
commit
00417794d3
106
.github/contributors/tzano.md
vendored
Normal file
106
.github/contributors/tzano.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Tahar Zanouda |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 09-05-2018 |
|
||||||
|
| GitHub username | tzano |
|
||||||
|
| Website (optional) | |
|
31
spacy/lang/ar/__init__.py
Normal file
31
spacy/lang/ar/__init__.py
Normal file
|
@ -0,0 +1,31 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG, NORM
|
||||||
|
from ...util import update_exc, add_lookups
|
||||||
|
|
||||||
|
|
||||||
|
class ArabicDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = lambda text: 'ar'
|
||||||
|
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
|
||||||
|
class Arabic(Language):
|
||||||
|
lang = 'ar'
|
||||||
|
Defaults = ArabicDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ['Arabic']
|
20
spacy/lang/ar/examples.py
Normal file
20
spacy/lang/ar/examples.py
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.ar.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"نال الكاتب خالد توفيق جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
|
||||||
|
"أين تقع دمشق ؟"
|
||||||
|
"كيف حالك ؟",
|
||||||
|
"هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
|
||||||
|
"ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
|
||||||
|
"هل بالإمكان أن نلتقي غدا؟",
|
||||||
|
"هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
|
||||||
|
"كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم"
|
||||||
|
]
|
95
spacy/lang/ar/lex_attrs.py
Normal file
95
spacy/lang/ar/lex_attrs.py
Normal file
|
@ -0,0 +1,95 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
_num_words = set("""
|
||||||
|
صفر
|
||||||
|
واحد
|
||||||
|
إثنان
|
||||||
|
اثنان
|
||||||
|
ثلاثة
|
||||||
|
ثلاثه
|
||||||
|
أربعة
|
||||||
|
أربعه
|
||||||
|
خمسة
|
||||||
|
خمسه
|
||||||
|
ستة
|
||||||
|
سته
|
||||||
|
سبعة
|
||||||
|
سبعه
|
||||||
|
ثمانية
|
||||||
|
ثمانيه
|
||||||
|
تسعة
|
||||||
|
تسعه
|
||||||
|
ﻋﺸﺮﺓ
|
||||||
|
ﻋﺸﺮه
|
||||||
|
عشرون
|
||||||
|
عشرين
|
||||||
|
ثلاثون
|
||||||
|
ثلاثين
|
||||||
|
اربعون
|
||||||
|
اربعين
|
||||||
|
أربعون
|
||||||
|
أربعين
|
||||||
|
خمسون
|
||||||
|
خمسين
|
||||||
|
ستون
|
||||||
|
ستين
|
||||||
|
سبعون
|
||||||
|
سبعين
|
||||||
|
ثمانون
|
||||||
|
ثمانين
|
||||||
|
تسعون
|
||||||
|
تسعين
|
||||||
|
مائتين
|
||||||
|
مائتان
|
||||||
|
ثلاثمائة
|
||||||
|
خمسمائة
|
||||||
|
سبعمائة
|
||||||
|
الف
|
||||||
|
آلاف
|
||||||
|
ملايين
|
||||||
|
مليون
|
||||||
|
مليار
|
||||||
|
مليارات
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
_ordinal_words = set("""
|
||||||
|
اول
|
||||||
|
أول
|
||||||
|
حاد
|
||||||
|
واحد
|
||||||
|
ثان
|
||||||
|
ثاني
|
||||||
|
ثالث
|
||||||
|
رابع
|
||||||
|
خامس
|
||||||
|
سادس
|
||||||
|
سابع
|
||||||
|
ثامن
|
||||||
|
تاسع
|
||||||
|
عاشر
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
"""
|
||||||
|
check if text resembles a number
|
||||||
|
"""
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text in _num_words:
|
||||||
|
return True
|
||||||
|
if text in _ordinal_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
15
spacy/lang/ar/punctuation.py
Normal file
15
spacy/lang/ar/punctuation.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..punctuation import TOKENIZER_INFIXES
|
||||||
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||||
|
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
|
||||||
|
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
|
||||||
|
[r'(?<=[0-9])\+',
|
||||||
|
# Arabic is written from Right-To-Left
|
||||||
|
r'(?<=[0-9])(?:{})'.format(CURRENCY),
|
||||||
|
r'(?<=[0-9])(?:{})'.format(UNITS),
|
||||||
|
r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
|
||||||
|
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
229
spacy/lang/ar/stop_words.py
Normal file
229
spacy/lang/ar/stop_words.py
Normal file
|
@ -0,0 +1,229 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
من
|
||||||
|
نحو
|
||||||
|
لعل
|
||||||
|
بما
|
||||||
|
بين
|
||||||
|
وبين
|
||||||
|
ايضا
|
||||||
|
وبينما
|
||||||
|
تحت
|
||||||
|
مثلا
|
||||||
|
لدي
|
||||||
|
عنه
|
||||||
|
مع
|
||||||
|
هي
|
||||||
|
وهذا
|
||||||
|
واذا
|
||||||
|
هذان
|
||||||
|
انه
|
||||||
|
بينما
|
||||||
|
أمسى
|
||||||
|
وسوف
|
||||||
|
ولم
|
||||||
|
لذلك
|
||||||
|
إلى
|
||||||
|
منه
|
||||||
|
منها
|
||||||
|
كما
|
||||||
|
ظل
|
||||||
|
هنا
|
||||||
|
به
|
||||||
|
كذلك
|
||||||
|
اما
|
||||||
|
هما
|
||||||
|
بعد
|
||||||
|
بينهم
|
||||||
|
التي
|
||||||
|
أبو
|
||||||
|
اذا
|
||||||
|
بدلا
|
||||||
|
لها
|
||||||
|
أمام
|
||||||
|
يلي
|
||||||
|
حين
|
||||||
|
ضد
|
||||||
|
الذي
|
||||||
|
قد
|
||||||
|
صار
|
||||||
|
إذا
|
||||||
|
مابرح
|
||||||
|
قبل
|
||||||
|
كل
|
||||||
|
وليست
|
||||||
|
الذين
|
||||||
|
لهذا
|
||||||
|
وثي
|
||||||
|
انهم
|
||||||
|
باللتي
|
||||||
|
مافتئ
|
||||||
|
ولا
|
||||||
|
بهذه
|
||||||
|
بحيث
|
||||||
|
كيف
|
||||||
|
وله
|
||||||
|
علي
|
||||||
|
بات
|
||||||
|
لاسيما
|
||||||
|
حتى
|
||||||
|
وقد
|
||||||
|
و
|
||||||
|
أما
|
||||||
|
فيها
|
||||||
|
بهذا
|
||||||
|
لذا
|
||||||
|
حيث
|
||||||
|
لقد
|
||||||
|
إن
|
||||||
|
فإن
|
||||||
|
اول
|
||||||
|
ليت
|
||||||
|
فاللتي
|
||||||
|
ولقد
|
||||||
|
لسوف
|
||||||
|
هذه
|
||||||
|
ولماذا
|
||||||
|
معه
|
||||||
|
الحالي
|
||||||
|
بإن
|
||||||
|
حول
|
||||||
|
في
|
||||||
|
عليه
|
||||||
|
مايزال
|
||||||
|
ولعل
|
||||||
|
أنه
|
||||||
|
أضحى
|
||||||
|
اي
|
||||||
|
ستكون
|
||||||
|
لن
|
||||||
|
أن
|
||||||
|
ضمن
|
||||||
|
وعلى
|
||||||
|
امسى
|
||||||
|
الي
|
||||||
|
ذات
|
||||||
|
ولايزال
|
||||||
|
ذلك
|
||||||
|
فقد
|
||||||
|
هم
|
||||||
|
أي
|
||||||
|
عند
|
||||||
|
ابن
|
||||||
|
أو
|
||||||
|
فهو
|
||||||
|
فانه
|
||||||
|
سوف
|
||||||
|
ما
|
||||||
|
آل
|
||||||
|
كلا
|
||||||
|
عنها
|
||||||
|
وكذلك
|
||||||
|
ليست
|
||||||
|
لم
|
||||||
|
وأن
|
||||||
|
ماذا
|
||||||
|
لو
|
||||||
|
وهل
|
||||||
|
اللتي
|
||||||
|
ولذا
|
||||||
|
يمكن
|
||||||
|
فيه
|
||||||
|
الا
|
||||||
|
عليها
|
||||||
|
وبينهم
|
||||||
|
يوم
|
||||||
|
وبما
|
||||||
|
لما
|
||||||
|
فكان
|
||||||
|
اضحى
|
||||||
|
اصبح
|
||||||
|
لهم
|
||||||
|
بها
|
||||||
|
او
|
||||||
|
الذى
|
||||||
|
الى
|
||||||
|
إلي
|
||||||
|
قال
|
||||||
|
والتي
|
||||||
|
لازال
|
||||||
|
أصبح
|
||||||
|
ولهذا
|
||||||
|
مثل
|
||||||
|
وكانت
|
||||||
|
لكنه
|
||||||
|
بذلك
|
||||||
|
هذا
|
||||||
|
لماذا
|
||||||
|
قالت
|
||||||
|
فقط
|
||||||
|
لكن
|
||||||
|
مما
|
||||||
|
وكل
|
||||||
|
وان
|
||||||
|
وأبو
|
||||||
|
ومن
|
||||||
|
كان
|
||||||
|
مازال
|
||||||
|
هل
|
||||||
|
بينهن
|
||||||
|
هو
|
||||||
|
وما
|
||||||
|
على
|
||||||
|
وهو
|
||||||
|
لأن
|
||||||
|
واللتي
|
||||||
|
والذي
|
||||||
|
دون
|
||||||
|
عن
|
||||||
|
وايضا
|
||||||
|
هناك
|
||||||
|
بلا
|
||||||
|
جدا
|
||||||
|
ثم
|
||||||
|
منذ
|
||||||
|
اللذين
|
||||||
|
لايزال
|
||||||
|
بعض
|
||||||
|
مساء
|
||||||
|
تكون
|
||||||
|
فلا
|
||||||
|
بيننا
|
||||||
|
لا
|
||||||
|
ولكن
|
||||||
|
إذ
|
||||||
|
وأثناء
|
||||||
|
ليس
|
||||||
|
ومع
|
||||||
|
فيهم
|
||||||
|
ولسوف
|
||||||
|
بل
|
||||||
|
تلك
|
||||||
|
أحد
|
||||||
|
وهي
|
||||||
|
وكان
|
||||||
|
ومنها
|
||||||
|
وفي
|
||||||
|
ماانفك
|
||||||
|
اليوم
|
||||||
|
وماذا
|
||||||
|
هؤلاء
|
||||||
|
وليس
|
||||||
|
له
|
||||||
|
أثناء
|
||||||
|
بد
|
||||||
|
اليه
|
||||||
|
كأن
|
||||||
|
اليها
|
||||||
|
بتلك
|
||||||
|
يكون
|
||||||
|
ولما
|
||||||
|
هن
|
||||||
|
والى
|
||||||
|
كانت
|
||||||
|
وقبل
|
||||||
|
ان
|
||||||
|
لدى
|
||||||
|
""".split())
|
47
spacy/lang/ar/tokenizer_exceptions.py
Normal file
47
spacy/lang/ar/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,47 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
||||||
|
import re
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
# time
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "قبل الميلاد", ORTH: "ق.م"},
|
||||||
|
{LEMMA: "بعد الميلاد", ORTH: "ب. م"},
|
||||||
|
{LEMMA: "ميلادي", ORTH: ".م"},
|
||||||
|
{LEMMA: "هجري", ORTH: ".هـ"},
|
||||||
|
{LEMMA: "توفي", ORTH: ".ت"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
# scientific abv.
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
|
||||||
|
{LEMMA: "الشارح", ORTH: "الشـ"},
|
||||||
|
{LEMMA: "الظاهر", ORTH: "الظـ"},
|
||||||
|
{LEMMA: "أيضًا", ORTH: "أيضـ"},
|
||||||
|
{LEMMA: "إلى آخره", ORTH: "إلخ"},
|
||||||
|
{LEMMA: "انتهى", ORTH: "اهـ"},
|
||||||
|
{LEMMA: "حدّثنا", ORTH: "ثنا"},
|
||||||
|
{LEMMA: "حدثني", ORTH: "ثنى"},
|
||||||
|
{LEMMA: "أنبأنا", ORTH: "أنا"},
|
||||||
|
{LEMMA: "أخبرنا", ORTH: "نا"},
|
||||||
|
{LEMMA: "مصدر سابق", ORTH: "م. س"},
|
||||||
|
{LEMMA: "مصدر نفسه", ORTH: "م. ن"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
# other abv.
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "دكتور", ORTH: "د."},
|
||||||
|
{LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
|
||||||
|
{LEMMA: "أستاذ", ORTH: "أ."},
|
||||||
|
{LEMMA: "بروفيسور", ORTH: "ب."}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "تلفون", ORTH: "ت."},
|
||||||
|
{LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -3,13 +3,11 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import regex as re
|
import regex as re
|
||||||
|
|
||||||
|
|
||||||
re.DEFAULT_VERSION = re.VERSION1
|
re.DEFAULT_VERSION = re.VERSION1
|
||||||
merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
|
merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
|
||||||
split_chars = lambda char: list(char.strip().split(' '))
|
split_chars = lambda char: list(char.strip().split(' '))
|
||||||
merge_chars = lambda char: char.strip().replace(' ', '|')
|
merge_chars = lambda char: char.strip().replace(' ', '|')
|
||||||
|
|
||||||
|
|
||||||
_bengali = r'[\p{L}&&\p{Bengali}]'
|
_bengali = r'[\p{L}&&\p{Bengali}]'
|
||||||
_hebrew = r'[\p{L}&&\p{Hebrew}]'
|
_hebrew = r'[\p{L}&&\p{Hebrew}]'
|
||||||
_latin_lower = r'[\p{Ll}&&\p{Latin}]'
|
_latin_lower = r'[\p{Ll}&&\p{Latin}]'
|
||||||
|
@ -27,11 +25,11 @@ ALPHA = merge_char_classes(_upper + _lower + _uncased)
|
||||||
ALPHA_LOWER = merge_char_classes(_lower + _uncased)
|
ALPHA_LOWER = merge_char_classes(_lower + _uncased)
|
||||||
ALPHA_UPPER = merge_char_classes(_upper + _uncased)
|
ALPHA_UPPER = merge_char_classes(_upper + _uncased)
|
||||||
|
|
||||||
|
|
||||||
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
|
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
|
||||||
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
|
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
|
||||||
'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
|
'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
|
||||||
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб')
|
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб'
|
||||||
|
'كم كم² كم³ م م² م³ سم سم² سم³ مم مم² مم³ كم غرام جرام جم كغ ملغ كوب اكواب')
|
||||||
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'
|
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'
|
||||||
|
|
||||||
# These expressions contain various unicode variations, including characters
|
# These expressions contain various unicode variations, including characters
|
||||||
|
@ -45,7 +43,6 @@ _hyphens = '- – — -- --- —— ~'
|
||||||
# Details: https://www.compart.com/en/unicode/category/So
|
# Details: https://www.compart.com/en/unicode/category/So
|
||||||
_other_symbols = r'[\p{So}]'
|
_other_symbols = r'[\p{So}]'
|
||||||
|
|
||||||
|
|
||||||
UNITS = merge_chars(_units)
|
UNITS = merge_chars(_units)
|
||||||
CURRENCY = merge_chars(_currency)
|
CURRENCY = merge_chars(_currency)
|
||||||
QUOTES = merge_chars(_quotes)
|
QUOTES = merge_chars(_quotes)
|
||||||
|
|
|
@ -15,7 +15,9 @@ from .. import util
|
||||||
# here if it's using spaCy's tokenizer (not a different library)
|
# here if it's using spaCy's tokenizer (not a different library)
|
||||||
# TODO: re-implement generic tokenizer tests
|
# TODO: re-implement generic tokenizer tests
|
||||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||||
|
'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
|
||||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'xx']
|
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'xx']
|
||||||
|
|
||||||
_models = {'en': ['en_core_web_sm'],
|
_models = {'en': ['en_core_web_sm'],
|
||||||
'de': ['de_core_news_md'],
|
'de': ['de_core_news_md'],
|
||||||
'fr': ['fr_core_news_sm'],
|
'fr': ['fr_core_news_sm'],
|
||||||
|
@ -152,6 +154,9 @@ def th_tokenizer():
|
||||||
def tr_tokenizer():
|
def tr_tokenizer():
|
||||||
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def ar_tokenizer():
|
||||||
|
return util.get_lang_class('ar').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def ru_tokenizer():
|
def ru_tokenizer():
|
||||||
|
|
0
spacy/tests/lang/ar/__init__.py
Normal file
0
spacy/tests/lang/ar/__init__.py
Normal file
26
spacy/tests/lang/ar/test_exceptions.py
Normal file
26
spacy/tests/lang/ar/test_exceptions.py
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text',
|
||||||
|
["ق.م", "إلخ", "ص.ب", "ت."])
|
||||||
|
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||||
|
text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
assert len(tokens) == 7
|
||||||
|
assert tokens[6].text == "ق.م"
|
||||||
|
assert tokens[6].lemma_ == "قبل الميلاد"
|
||||||
|
|
||||||
|
|
||||||
|
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||||
|
text = u"يبلغ طول مضيق طارق 14كم "
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
|
||||||
|
assert len(tokens) == 6
|
13
spacy/tests/lang/ar/test_text.py
Normal file
13
spacy/tests/lang/ar/test_text.py
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_handles_long_text(ar_tokenizer):
|
||||||
|
text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
|
||||||
|
ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
|
||||||
|
فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
|
||||||
|
و قد نجح في الحصول على جائزة نوبل للآداب، ليكون بذلك العربي الوحيد الذي فاز بها."""
|
||||||
|
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
assert tokens[3].is_stop == True
|
||||||
|
assert len(tokens) == 77
|
Loading…
Reference in New Issue
Block a user