mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Merge pull request #6828 from explosion/master-tmp
This commit is contained in:
		
						commit
						5ed51c9dd2
					
				
							
								
								
									
										106
									
								
								.github/contributors/jganseman.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/jganseman.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,106 @@
 | 
			
		|||
# spaCy contributor agreement
 | 
			
		||||
 | 
			
		||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
			
		||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
			
		||||
The SCA applies to any contribution that you make to any product or project
 | 
			
		||||
managed by us (the **"project"**), and sets out the intellectual property rights
 | 
			
		||||
you grant to us in the contributed materials. The term **"us"** shall mean
 | 
			
		||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
			
		||||
**"you"** shall mean the person or entity identified below.
 | 
			
		||||
 | 
			
		||||
If you agree to be bound by these terms, fill in the information requested
 | 
			
		||||
below and include the filled-in version with your first pull request, under the
 | 
			
		||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
			
		||||
should be your GitHub username, with the extension `.md`. For example, the user
 | 
			
		||||
example_user would create the file `.github/contributors/example_user.md`.
 | 
			
		||||
 | 
			
		||||
Read this agreement carefully before signing. These terms and conditions
 | 
			
		||||
constitute a binding legal agreement.
 | 
			
		||||
 | 
			
		||||
## Contributor Agreement
 | 
			
		||||
 | 
			
		||||
1. The term "contribution" or "contributed materials" means any source code,
 | 
			
		||||
object code, patch, tool, sample, graphic, specification, manual,
 | 
			
		||||
documentation, or any other material posted or submitted by you to the project.
 | 
			
		||||
 | 
			
		||||
2. With respect to any worldwide copyrights, or copyright applications and
 | 
			
		||||
registrations, in your contribution:
 | 
			
		||||
 | 
			
		||||
    * you hereby assign to us joint ownership, and to the extent that such
 | 
			
		||||
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
			
		||||
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
			
		||||
    royalty-free, unrestricted license to exercise all rights under those
 | 
			
		||||
    copyrights. This includes, at our option, the right to sublicense these same
 | 
			
		||||
    rights to third parties through multiple levels of sublicensees or other
 | 
			
		||||
    licensing arrangements;
 | 
			
		||||
 | 
			
		||||
    * you agree that each of us can do all things in relation to your
 | 
			
		||||
    contribution as if each of us were the sole owners, and if one of us makes
 | 
			
		||||
    a derivative work of your contribution, the one who makes the derivative
 | 
			
		||||
    work (or has it made will be the sole owner of that derivative work;
 | 
			
		||||
 | 
			
		||||
    * you agree that you will not assert any moral rights in your contribution
 | 
			
		||||
    against us, our licensees or transferees;
 | 
			
		||||
 | 
			
		||||
    * you agree that we may register a copyright in your contribution and
 | 
			
		||||
    exercise all ownership rights associated with it; and
 | 
			
		||||
 | 
			
		||||
    * you agree that neither of us has any duty to consult with, obtain the
 | 
			
		||||
    consent of, pay or render an accounting to the other for any use or
 | 
			
		||||
    distribution of your contribution.
 | 
			
		||||
 | 
			
		||||
3. With respect to any patents you own, or that you can license without payment
 | 
			
		||||
to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
			
		||||
non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
			
		||||
 | 
			
		||||
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
			
		||||
    your contribution in whole or in part, alone or in combination with or
 | 
			
		||||
    included in any product, work or materials arising out of the project to
 | 
			
		||||
    which your contribution was submitted, and
 | 
			
		||||
 | 
			
		||||
    * at our option, to sublicense these same rights to third parties through
 | 
			
		||||
    multiple levels of sublicensees or other licensing arrangements.
 | 
			
		||||
 | 
			
		||||
4. Except as set out above, you keep all right, title, and interest in your
 | 
			
		||||
contribution. The rights that you grant to us under these terms are effective
 | 
			
		||||
on the date you first submitted a contribution to us, even if your submission
 | 
			
		||||
took place before the date you sign these terms.
 | 
			
		||||
 | 
			
		||||
5. You covenant, represent, warrant and agree that:
 | 
			
		||||
 | 
			
		||||
    * Each contribution that you submit is and shall be an original work of
 | 
			
		||||
    authorship and you can legally grant the rights set out in this SCA;
 | 
			
		||||
 | 
			
		||||
    * to the best of your knowledge, each contribution will not violate any
 | 
			
		||||
    third party's copyrights, trademarks, patents, or other intellectual
 | 
			
		||||
    property rights; and
 | 
			
		||||
 | 
			
		||||
    * each contribution shall be in compliance with U.S. export control laws and
 | 
			
		||||
    other applicable export and import laws. You agree to notify us if you
 | 
			
		||||
    become aware of any circumstance which would make any of the foregoing
 | 
			
		||||
    representations inaccurate in any respect. We may publicly disclose your
 | 
			
		||||
    participation in the project, including the fact that you have signed the SCA.
 | 
			
		||||
 | 
			
		||||
6. This SCA is governed by the laws of the State of California and applicable
 | 
			
		||||
U.S. Federal law. Any choice of law rules will not apply.
 | 
			
		||||
 | 
			
		||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
			
		||||
mark both statements:
 | 
			
		||||
 | 
			
		||||
    * [x] I am signing on behalf of myself as an individual and no other person
 | 
			
		||||
    or entity, including my employer, has or will have rights with respect to my
 | 
			
		||||
    contributions.
 | 
			
		||||
 | 
			
		||||
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
			
		||||
    actual authority to contractually bind that entity.
 | 
			
		||||
 | 
			
		||||
## Contributor Details
 | 
			
		||||
 | 
			
		||||
| Field                          | Entry                |
 | 
			
		||||
|------------------------------- | -------------------- |
 | 
			
		||||
| Name                           | Joachim Ganseman     |
 | 
			
		||||
| Company name (if applicable)   |                      |
 | 
			
		||||
| Title or role (if applicable)  |                      |
 | 
			
		||||
| Date                           | 26/01/2021           |
 | 
			
		||||
| GitHub username                | jganseman            |
 | 
			
		||||
| Website (optional)             | www.ganseman.be      |
 | 
			
		||||
							
								
								
									
										106
									
								
								.github/contributors/jumasheff.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/jumasheff.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,106 @@
 | 
			
		|||
# spaCy contributor agreement
 | 
			
		||||
 | 
			
		||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
			
		||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
			
		||||
The SCA applies to any contribution that you make to any product or project
 | 
			
		||||
managed by us (the **"project"**), and sets out the intellectual property rights
 | 
			
		||||
you grant to us in the contributed materials. The term **"us"** shall mean
 | 
			
		||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
			
		||||
**"you"** shall mean the person or entity identified below.
 | 
			
		||||
 | 
			
		||||
If you agree to be bound by these terms, fill in the information requested
 | 
			
		||||
below and include the filled-in version with your first pull request, under the
 | 
			
		||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
			
		||||
should be your GitHub username, with the extension `.md`. For example, the user
 | 
			
		||||
example_user would create the file `.github/contributors/example_user.md`.
 | 
			
		||||
 | 
			
		||||
Read this agreement carefully before signing. These terms and conditions
 | 
			
		||||
constitute a binding legal agreement.
 | 
			
		||||
 | 
			
		||||
## Contributor Agreement
 | 
			
		||||
 | 
			
		||||
1. The term "contribution" or "contributed materials" means any source code,
 | 
			
		||||
object code, patch, tool, sample, graphic, specification, manual,
 | 
			
		||||
documentation, or any other material posted or submitted by you to the project.
 | 
			
		||||
 | 
			
		||||
2. With respect to any worldwide copyrights, or copyright applications and
 | 
			
		||||
registrations, in your contribution:
 | 
			
		||||
 | 
			
		||||
    * you hereby assign to us joint ownership, and to the extent that such
 | 
			
		||||
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
			
		||||
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
			
		||||
    royalty-free, unrestricted license to exercise all rights under those
 | 
			
		||||
    copyrights. This includes, at our option, the right to sublicense these same
 | 
			
		||||
    rights to third parties through multiple levels of sublicensees or other
 | 
			
		||||
    licensing arrangements;
 | 
			
		||||
 | 
			
		||||
    * you agree that each of us can do all things in relation to your
 | 
			
		||||
    contribution as if each of us were the sole owners, and if one of us makes
 | 
			
		||||
    a derivative work of your contribution, the one who makes the derivative
 | 
			
		||||
    work (or has it made will be the sole owner of that derivative work;
 | 
			
		||||
 | 
			
		||||
    * you agree that you will not assert any moral rights in your contribution
 | 
			
		||||
    against us, our licensees or transferees;
 | 
			
		||||
 | 
			
		||||
    * you agree that we may register a copyright in your contribution and
 | 
			
		||||
    exercise all ownership rights associated with it; and
 | 
			
		||||
 | 
			
		||||
    * you agree that neither of us has any duty to consult with, obtain the
 | 
			
		||||
    consent of, pay or render an accounting to the other for any use or
 | 
			
		||||
    distribution of your contribution.
 | 
			
		||||
 | 
			
		||||
3. With respect to any patents you own, or that you can license without payment
 | 
			
		||||
to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
			
		||||
non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
			
		||||
 | 
			
		||||
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
			
		||||
    your contribution in whole or in part, alone or in combination with or
 | 
			
		||||
    included in any product, work or materials arising out of the project to
 | 
			
		||||
    which your contribution was submitted, and
 | 
			
		||||
 | 
			
		||||
    * at our option, to sublicense these same rights to third parties through
 | 
			
		||||
    multiple levels of sublicensees or other licensing arrangements.
 | 
			
		||||
 | 
			
		||||
4. Except as set out above, you keep all right, title, and interest in your
 | 
			
		||||
contribution. The rights that you grant to us under these terms are effective
 | 
			
		||||
on the date you first submitted a contribution to us, even if your submission
 | 
			
		||||
took place before the date you sign these terms.
 | 
			
		||||
 | 
			
		||||
5. You covenant, represent, warrant and agree that:
 | 
			
		||||
 | 
			
		||||
    * Each contribution that you submit is and shall be an original work of
 | 
			
		||||
    authorship and you can legally grant the rights set out in this SCA;
 | 
			
		||||
 | 
			
		||||
    * to the best of your knowledge, each contribution will not violate any
 | 
			
		||||
    third party's copyrights, trademarks, patents, or other intellectual
 | 
			
		||||
    property rights; and
 | 
			
		||||
 | 
			
		||||
    * each contribution shall be in compliance with U.S. export control laws and
 | 
			
		||||
    other applicable export and import laws. You agree to notify us if you
 | 
			
		||||
    become aware of any circumstance which would make any of the foregoing
 | 
			
		||||
    representations inaccurate in any respect. We may publicly disclose your
 | 
			
		||||
    participation in the project, including the fact that you have signed the SCA.
 | 
			
		||||
 | 
			
		||||
6. This SCA is governed by the laws of the State of California and applicable
 | 
			
		||||
U.S. Federal law. Any choice of law rules will not apply.
 | 
			
		||||
 | 
			
		||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
			
		||||
mark both statements:
 | 
			
		||||
 | 
			
		||||
    * [x] I am signing on behalf of myself as an individual and no other person
 | 
			
		||||
    or entity, including my employer, has or will have rights with respect to my
 | 
			
		||||
    contributions.
 | 
			
		||||
 | 
			
		||||
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
			
		||||
    actual authority to contractually bind that entity.
 | 
			
		||||
 | 
			
		||||
## Contributor Details
 | 
			
		||||
 | 
			
		||||
| Field                          | Entry                |
 | 
			
		||||
|------------------------------- | -------------------- |
 | 
			
		||||
| Name                           | Murat Jumashev       |
 | 
			
		||||
| Company name (if applicable)   |                      |
 | 
			
		||||
| Title or role (if applicable)  |                      |
 | 
			
		||||
| Date                           | 25.01.2021           |
 | 
			
		||||
| GitHub username                | jumasheff            |
 | 
			
		||||
| Website (optional)             |                      |
 | 
			
		||||
							
								
								
									
										20
									
								
								spacy/lang/ky/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										20
									
								
								spacy/lang/ky/__init__.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,20 @@
 | 
			
		|||
from .lex_attrs import LEX_ATTRS
 | 
			
		||||
from .punctuation import TOKENIZER_INFIXES
 | 
			
		||||
from .stop_words import STOP_WORDS
 | 
			
		||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 | 
			
		||||
from ...language import Language
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class KyrgyzDefaults(Language.Defaults):
 | 
			
		||||
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
 | 
			
		||||
    infixes = TOKENIZER_INFIXES
 | 
			
		||||
    lex_attr_getters = LEX_ATTRS
 | 
			
		||||
    stop_words = STOP_WORDS
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class Kyrgyz(Language):
 | 
			
		||||
    lang = "ky"
 | 
			
		||||
    Defaults = KyrgyzDefaults
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
__all__ = ["Kyrgyz"]
 | 
			
		||||
							
								
								
									
										16
									
								
								spacy/lang/ky/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										16
									
								
								spacy/lang/ky/examples.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,16 @@
 | 
			
		|||
"""
 | 
			
		||||
Example sentences to test spaCy and its language models.
 | 
			
		||||
>>> from spacy.lang.ky.examples import sentences
 | 
			
		||||
>>> docs = nlp.pipe(sentences)
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
sentences = [
 | 
			
		||||
    "Apple Улуу Британия стартабын $1 миллиардга сатып алууну көздөөдө.",
 | 
			
		||||
    "Автоном автомобилдерди камсыздоо жоопкерчилиги өндүрүүчүлөргө артылды.",
 | 
			
		||||
    "Сан-Франциско тротуар менен жүрүүчү робот-курьерлерге тыю салууну караштырууда.",
 | 
			
		||||
    "Лондон - Улуу Британияда жайгашкан ири шаар.",
 | 
			
		||||
    "Кайдасың?",
 | 
			
		||||
    "Франциянын президенти ким?",
 | 
			
		||||
    "Америка Кошмо Штаттарынын борбор калаасы кайсы шаар?",
 | 
			
		||||
    "Барак Обама качан төрөлгөн?",
 | 
			
		||||
]
 | 
			
		||||
							
								
								
									
										48
									
								
								spacy/lang/ky/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										48
									
								
								spacy/lang/ky/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,48 @@
 | 
			
		|||
from ...attrs import LIKE_NUM
 | 
			
		||||
 | 
			
		||||
_num_words = [
 | 
			
		||||
    "нөл",
 | 
			
		||||
    "ноль",
 | 
			
		||||
    "бир",
 | 
			
		||||
    "эки",
 | 
			
		||||
    "үч",
 | 
			
		||||
    "төрт",
 | 
			
		||||
    "беш",
 | 
			
		||||
    "алты",
 | 
			
		||||
    "жети",
 | 
			
		||||
    "сегиз",
 | 
			
		||||
    "тогуз",
 | 
			
		||||
    "он",
 | 
			
		||||
    "жыйырма",
 | 
			
		||||
    "отуз",
 | 
			
		||||
    "кырк",
 | 
			
		||||
    "элүү",
 | 
			
		||||
    "алтымыш",
 | 
			
		||||
    "жетмиш",
 | 
			
		||||
    "сексен",
 | 
			
		||||
    "токсон",
 | 
			
		||||
    "жүз",
 | 
			
		||||
    "миң",
 | 
			
		||||
    "миллион",
 | 
			
		||||
    "миллиард",
 | 
			
		||||
    "триллион",
 | 
			
		||||
    "триллиард",
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def like_num(text):
 | 
			
		||||
    if text.startswith(("+", "-", "±", "~")):
 | 
			
		||||
        text = text[1:]
 | 
			
		||||
    text = text.replace(",", "").replace(".", "")
 | 
			
		||||
    if text.isdigit():
 | 
			
		||||
        return True
 | 
			
		||||
    if text.count("/") == 1:
 | 
			
		||||
        num, denom = text.split("/")
 | 
			
		||||
        if num.isdigit() and denom.isdigit():
 | 
			
		||||
            return True
 | 
			
		||||
    if text in _num_words:
 | 
			
		||||
        return True
 | 
			
		||||
    return False
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
LEX_ATTRS = {LIKE_NUM: like_num}
 | 
			
		||||
							
								
								
									
										21
									
								
								spacy/lang/ky/punctuation.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										21
									
								
								spacy/lang/ky/punctuation.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,21 @@
 | 
			
		|||
from ..char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, HYPHENS
 | 
			
		||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
 | 
			
		||||
 | 
			
		||||
_hyphens_no_dash = HYPHENS.replace("-", "").strip("|").replace("||", "")
 | 
			
		||||
_infixes = (
 | 
			
		||||
    LIST_ELLIPSES
 | 
			
		||||
    + LIST_ICONS
 | 
			
		||||
    + [
 | 
			
		||||
        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
 | 
			
		||||
        r"(?<=[{a}])[,!?/()]+(?=[{a}])".format(a=ALPHA),
 | 
			
		||||
        r"(?<=[{a}{q}])[:<>=](?=[{a}])".format(a=ALPHA, q=CONCAT_QUOTES),
 | 
			
		||||
        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
 | 
			
		||||
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
 | 
			
		||||
        r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=CONCAT_QUOTES),
 | 
			
		||||
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=_hyphens_no_dash),
 | 
			
		||||
        r"(?<=[0-9])-(?=[{a}])".format(a=ALPHA),
 | 
			
		||||
        r"(?<=[0-9])-(?=[0-9])",
 | 
			
		||||
    ]
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
TOKENIZER_INFIXES = _infixes
 | 
			
		||||
							
								
								
									
										42
									
								
								spacy/lang/ky/stop_words.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										42
									
								
								spacy/lang/ky/stop_words.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,42 @@
 | 
			
		|||
STOP_WORDS = set(
 | 
			
		||||
    """
 | 
			
		||||
ага адам айтты айтымында айтып ал алар
 | 
			
		||||
алардын алган алуу алып анда андан аны
 | 
			
		||||
анын ар
 | 
			
		||||
 | 
			
		||||
бар басма баш башка башкы башчысы берген
 | 
			
		||||
биз билдирген билдирди бир биринчи бирок
 | 
			
		||||
бишкек болгон болот болсо болуп боюнча
 | 
			
		||||
буга бул
 | 
			
		||||
 | 
			
		||||
гана
 | 
			
		||||
 | 
			
		||||
да дагы деген деди деп
 | 
			
		||||
 | 
			
		||||
жана жатат жаткан жаңы же жогорку жок жол
 | 
			
		||||
жолу
 | 
			
		||||
 | 
			
		||||
кабыл калган кандай карата каршы катары
 | 
			
		||||
келген керек кийин кол кылмыш кыргыз
 | 
			
		||||
күнү көп
 | 
			
		||||
 | 
			
		||||
маалымат мамлекеттик мен менен миң
 | 
			
		||||
мурдагы мыйзам мындай мүмкүн
 | 
			
		||||
 | 
			
		||||
ошол ошондой
 | 
			
		||||
 | 
			
		||||
сүрөт сөз
 | 
			
		||||
 | 
			
		||||
тарабынан турган тууралуу
 | 
			
		||||
 | 
			
		||||
укук учурда
 | 
			
		||||
 | 
			
		||||
чейин чек
 | 
			
		||||
 | 
			
		||||
экенин эки эл эле эмес эми эч
 | 
			
		||||
 | 
			
		||||
үч үчүн
 | 
			
		||||
 | 
			
		||||
өз
 | 
			
		||||
""".split()
 | 
			
		||||
)
 | 
			
		||||
							
								
								
									
										52
									
								
								spacy/lang/ky/tokenizer_exceptions.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										52
									
								
								spacy/lang/ky/tokenizer_exceptions.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,52 @@
 | 
			
		|||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
 | 
			
		||||
from ...symbols import ORTH, NORM
 | 
			
		||||
from ...util import update_exc
 | 
			
		||||
 | 
			
		||||
_exc = {}
 | 
			
		||||
 | 
			
		||||
_abbrev_exc = [
 | 
			
		||||
    # Weekdays abbreviations
 | 
			
		||||
    {ORTH: "дүй", NORM: "дүйшөмбү"},
 | 
			
		||||
    {ORTH: "шей", NORM: "шейшемби"},
 | 
			
		||||
    {ORTH: "шар", NORM: "шаршемби"},
 | 
			
		||||
    {ORTH: "бей", NORM: "бейшемби"},
 | 
			
		||||
    {ORTH: "жум", NORM: "жума"},
 | 
			
		||||
    {ORTH: "ишм", NORM: "ишемби"},
 | 
			
		||||
    {ORTH: "жек", NORM: "жекшемби"},
 | 
			
		||||
    # Months abbreviations
 | 
			
		||||
    {ORTH: "янв", NORM: "январь"},
 | 
			
		||||
    {ORTH: "фев", NORM: "февраль"},
 | 
			
		||||
    {ORTH: "мар", NORM: "март"},
 | 
			
		||||
    {ORTH: "апр", NORM: "апрель"},
 | 
			
		||||
    {ORTH: "июн", NORM: "июнь"},
 | 
			
		||||
    {ORTH: "июл", NORM: "июль"},
 | 
			
		||||
    {ORTH: "авг", NORM: "август"},
 | 
			
		||||
    {ORTH: "сен", NORM: "сентябрь"},
 | 
			
		||||
    {ORTH: "окт", NORM: "октябрь"},
 | 
			
		||||
    {ORTH: "ноя", NORM: "ноябрь"},
 | 
			
		||||
    {ORTH: "дек", NORM: "декабрь"},
 | 
			
		||||
    # Number abbreviations
 | 
			
		||||
    {ORTH: "млрд", NORM: "миллиард"},
 | 
			
		||||
    {ORTH: "млн", NORM: "миллион"},
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
for abbr in _abbrev_exc:
 | 
			
		||||
    for orth in (abbr[ORTH], abbr[ORTH].capitalize(), abbr[ORTH].upper()):
 | 
			
		||||
        _exc[orth] = [{ORTH: orth, NORM: abbr[NORM]}]
 | 
			
		||||
        _exc[orth + "."] = [{ORTH: orth + ".", NORM: abbr[NORM]}]
 | 
			
		||||
 | 
			
		||||
for exc_data in [  # "etc." abbreviations
 | 
			
		||||
    {ORTH: "ж.б.у.с.", NORM: "жана башка ушул сыяктуу"},
 | 
			
		||||
    {ORTH: "ж.б.", NORM: "жана башка"},
 | 
			
		||||
    {ORTH: "ж.", NORM: "жыл"},
 | 
			
		||||
    {ORTH: "б.з.ч.", NORM: "биздин заманга чейин"},
 | 
			
		||||
    {ORTH: "б.з.", NORM: "биздин заман"},
 | 
			
		||||
    {ORTH: "кк.", NORM: "кылымдар"},
 | 
			
		||||
    {ORTH: "жж.", NORM: "жылдар"},
 | 
			
		||||
    {ORTH: "к.", NORM: "кылым"},
 | 
			
		||||
    {ORTH: "көч.", NORM: "көчөсү"},
 | 
			
		||||
    {ORTH: "м-н", NORM: "менен"},
 | 
			
		||||
    {ORTH: "б-ча", NORM: "боюнча"},
 | 
			
		||||
]:    _exc[exc_data[ORTH]] = [exc_data]
 | 
			
		||||
 | 
			
		||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
 | 
			
		||||
| 
						 | 
				
			
			@ -265,6 +265,11 @@ def tt_tokenizer():
 | 
			
		|||
    return get_lang_class("tt")().tokenizer
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.fixture(scope="session")
 | 
			
		||||
def ky_tokenizer():
 | 
			
		||||
    return get_lang_class("ky")().tokenizer
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.fixture(scope="session")
 | 
			
		||||
def uk_tokenizer():
 | 
			
		||||
    pytest.importorskip("pymorphy2")
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
							
								
								
									
										0
									
								
								spacy/tests/lang/ky/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								spacy/tests/lang/ky/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
									
										91
									
								
								spacy/tests/lang/ky/test_tokenizer.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										91
									
								
								spacy/tests/lang/ky/test_tokenizer.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,91 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
import pytest
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
INFIX_HYPHEN_TESTS = [
 | 
			
		||||
    ("Бала-чака жакшыбы?", "Бала-чака жакшыбы ?".split()),
 | 
			
		||||
    ("Кыз-келиндер кийими.", "Кыз-келиндер кийими .".split()),
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
PUNC_INSIDE_WORDS_TESTS = [
 | 
			
		||||
    (
 | 
			
		||||
        "Пассажир саны - 2,13 млн — киши/күнүнө (2010), 783,9 млн. киши/жылына.",
 | 
			
		||||
        "Пассажир саны - 2,13 млн — киши / күнүнө ( 2010 ) ,"
 | 
			
		||||
        " 783,9 млн. киши / жылына .".split(),
 | 
			
		||||
    ),
 | 
			
		||||
    ('То"кой', 'То " кой'.split()),
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
MIXED_ORDINAL_NUMS_TESTS = [
 | 
			
		||||
    ("Эртең 22-январь...", "Эртең 22 - январь ...".split())
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
ABBREV_TESTS = [
 | 
			
		||||
    ("Маселе б-ча эртең келет", "Маселе б-ча эртең келет".split()),
 | 
			
		||||
    ("Ахунбаев көч. турат.", "Ахунбаев көч. турат .".split()),
 | 
			
		||||
    ("«3-жылы (б.з.ч.) туулган", "« 3 - жылы ( б.з.ч. ) туулган".split()),
 | 
			
		||||
    ("Жүгөрү ж.б. дандар колдонулат", "Жүгөрү ж.б. дандар колдонулат".split()),
 | 
			
		||||
    ("3-4 кк. курулган.", "3 - 4 кк. курулган .".split()),
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
NAME_ABBREV_TESTS = [
 | 
			
		||||
    ("М.Жумаш", "М.Жумаш".split()),
 | 
			
		||||
    ("М.жумаш", "М.жумаш".split()),
 | 
			
		||||
    ("м.Жумаш", "м . Жумаш".split()),
 | 
			
		||||
    ("Жумаш М.Н.", "Жумаш М.Н.".split()),
 | 
			
		||||
    ("Жумаш.", "Жумаш .".split()),
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
TYPOS_IN_PUNC_TESTS = [
 | 
			
		||||
    ("«3-жылда , туулган", "« 3 - жылда , туулган".split()),
 | 
			
		||||
    ("«3-жылда,туулган", "« 3 - жылда , туулган".split()),
 | 
			
		||||
    ("«3-жылда,туулган.", "« 3 - жылда , туулган .".split()),
 | 
			
		||||
    ("Ал иштейт(качан?)", "Ал иштейт ( качан ? )".split()),
 | 
			
		||||
    ("Ал (качан?)иштейт", "Ал ( качан ?) иштейт".split()),  # "?)" => "?)" or "? )"
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
LONG_TEXTS_TESTS = [
 | 
			
		||||
    (
 | 
			
		||||
        "Алыскы өлкөлөргө аздыр-көптүр татаалыраак жүрүштөргө чыккандар "
 | 
			
		||||
        "азыраак: ал бир топ кымбат жана логистика маселесинин айынан "
 | 
			
		||||
        "кыйла татаал. Мисалы, январдагы майрамдарда Мароккого үчүнчү "
 | 
			
		||||
        "категориядагы маршрутка (100 чакырымдан кем эмес) барып "
 | 
			
		||||
        "келгенге аракет кылдык.",
 | 
			
		||||
        "Алыскы өлкөлөргө аздыр-көптүр татаалыраак жүрүштөргө чыккандар "
 | 
			
		||||
        "азыраак : ал бир топ кымбат жана логистика маселесинин айынан "
 | 
			
		||||
        "кыйла татаал . Мисалы , январдагы майрамдарда Мароккого үчүнчү "
 | 
			
		||||
        "категориядагы маршрутка ( 100 чакырымдан кем эмес ) барып "
 | 
			
		||||
        "келгенге аракет кылдык .".split(),
 | 
			
		||||
    )
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
TESTCASES = (
 | 
			
		||||
    INFIX_HYPHEN_TESTS
 | 
			
		||||
    + PUNC_INSIDE_WORDS_TESTS
 | 
			
		||||
    + MIXED_ORDINAL_NUMS_TESTS
 | 
			
		||||
    + ABBREV_TESTS
 | 
			
		||||
    + NAME_ABBREV_TESTS
 | 
			
		||||
    + LONG_TEXTS_TESTS
 | 
			
		||||
    + TYPOS_IN_PUNC_TESTS
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
NORM_TESTCASES = [
 | 
			
		||||
    (
 | 
			
		||||
        "ит, мышык ж.б.у.с. үй жаныбарлары.",
 | 
			
		||||
        ["ит", ",", "мышык", "жана башка ушул сыяктуу", "үй", "жаныбарлары", "."],
 | 
			
		||||
    )
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
 | 
			
		||||
def test_ky_tokenizer_handles_testcases(ky_tokenizer, text, expected_tokens):
 | 
			
		||||
    tokens = [token.text for token in ky_tokenizer(text) if not token.is_space]
 | 
			
		||||
    assert expected_tokens == tokens
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize("text,norms", NORM_TESTCASES)
 | 
			
		||||
def test_ky_tokenizer_handles_norm_exceptions(ky_tokenizer, text, norms):
 | 
			
		||||
    tokens = ky_tokenizer(text)
 | 
			
		||||
    assert [token.norm_ for token in tokens] == norms
 | 
			
		||||
| 
						 | 
				
			
			@ -155,7 +155,7 @@ The L2 norm of the lexeme's vector representation.
 | 
			
		|||
| `like_url`                                   | Does the lexeme resemble a URL? ~~bool~~                                                                                                                                                                                                                             |
 | 
			
		||||
| `like_num`                                   | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~                                                                                                                                                                                          |
 | 
			
		||||
| `like_email`                                 | Does the lexeme resemble an email address? ~~bool~~                                                                                                                                                                                                                  |
 | 
			
		||||
| `is_oov`                                     | Does the lexeme have a word vector? ~~bool~~                                                                                                                                                                                                                         |
 | 
			
		||||
| `is_oov`                                     | Is the lexeme out-of-vocabulary (i.e. does it not have a word vector)? ~~bool~~                                                                                                                                                                                      |
 | 
			
		||||
| `is_stop`                                    | Is the lexeme part of a "stop list"? ~~bool~~                                                                                                                                                                                                                        |
 | 
			
		||||
| `lang`                                       | Language of the parent vocabulary. ~~int~~                                                                                                                                                                                                                           |
 | 
			
		||||
| `lang_`                                      | Language of the parent vocabulary. ~~str~~                                                                                                                                                                                                                           |
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -472,7 +472,7 @@ The L2 norm of the token's vector representation.
 | 
			
		|||
| `like_url`                                   | Does the token resemble a URL? ~~bool~~                                                                                                                                                                                                                               |
 | 
			
		||||
| `like_num`                                   | Does the token represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~                                                                                                                                                                                            |
 | 
			
		||||
| `like_email`                                 | Does the token resemble an email address? ~~bool~~                                                                                                                                                                                                                    |
 | 
			
		||||
| `is_oov`                                     | Does the token have a word vector? ~~bool~~                                                                                                                                                                                                                           |
 | 
			
		||||
| `is_oov`                                     | Is the token out-of-vocabulary (i.e. does it not have a word vector)? ~~bool~~                                                                                                                                                                                        |
 | 
			
		||||
| `is_stop`                                    | Is the token part of a "stop list"? ~~bool~~                                                                                                                                                                                                                          |
 | 
			
		||||
| `pos`                                        | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~int~~                                                                                                                                                |
 | 
			
		||||
| `pos_`                                       | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~                                                                                                                                                |
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -103,6 +103,12 @@
 | 
			
		|||
            "example": "이것은 문장입니다.",
 | 
			
		||||
            "has_examples": true
 | 
			
		||||
        },
 | 
			
		||||
        {
 | 
			
		||||
            "code": "ky",
 | 
			
		||||
            "name": "Kyrgyz",
 | 
			
		||||
            "example": "Адамга эң кыйыны — күн сайын адам болуу",
 | 
			
		||||
            "has_examples": true
 | 
			
		||||
        },
 | 
			
		||||
        { "code": "lb", "name": "Luxembourgish", "has_examples": true },
 | 
			
		||||
        {
 | 
			
		||||
            "code": "lij",
 | 
			
		||||
| 
						 | 
				
			
			@ -273,6 +279,9 @@
 | 
			
		|||
        { "id": "GPU GPL 3.0", "url": "https://www.gnu.org/licenses/gpl-3.0.en.html" },
 | 
			
		||||
        { "id": "LGPL", "url": "https://www.gnu.org/licenses/lgpl.html" },
 | 
			
		||||
        { "id": "MIT", "url": "https://opensource.org/licenses/MIT" },
 | 
			
		||||
        { "id": "LGPL-LR", "url": "https://github.com/UniversalDependencies/UD_French-Sequoia/blob/master/LICENSE.txt" }
 | 
			
		||||
        {
 | 
			
		||||
            "id": "LGPL-LR",
 | 
			
		||||
            "url": "https://github.com/UniversalDependencies/UD_French-Sequoia/blob/master/LICENSE.txt"
 | 
			
		||||
        }
 | 
			
		||||
    ]
 | 
			
		||||
}
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in New Issue
	
	Block a user