mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	Merge pull request #6255 from explosion/master-tmp
This commit is contained in:
		
						commit
						db16059f9b
					
				
							
								
								
									
										106
									
								
								.github/contributors/Nuccy90.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/Nuccy90.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,106 @@
 | 
			
		|||
# spaCy contributor agreement
 | 
			
		||||
 | 
			
		||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
			
		||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
			
		||||
The SCA applies to any contribution that you make to any product or project
 | 
			
		||||
managed by us (the **"project"**), and sets out the intellectual property rights
 | 
			
		||||
you grant to us in the contributed materials. The term **"us"** shall mean
 | 
			
		||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
			
		||||
**"you"** shall mean the person or entity identified below.
 | 
			
		||||
 | 
			
		||||
If you agree to be bound by these terms, fill in the information requested
 | 
			
		||||
below and include the filled-in version with your first pull request, under the
 | 
			
		||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
			
		||||
should be your GitHub username, with the extension `.md`. For example, the user
 | 
			
		||||
example_user would create the file `.github/contributors/example_user.md`.
 | 
			
		||||
 | 
			
		||||
Read this agreement carefully before signing. These terms and conditions
 | 
			
		||||
constitute a binding legal agreement.
 | 
			
		||||
 | 
			
		||||
## Contributor Agreement
 | 
			
		||||
 | 
			
		||||
1. The term "contribution" or "contributed materials" means any source code,
 | 
			
		||||
object code, patch, tool, sample, graphic, specification, manual,
 | 
			
		||||
documentation, or any other material posted or submitted by you to the project.
 | 
			
		||||
 | 
			
		||||
2. With respect to any worldwide copyrights, or copyright applications and
 | 
			
		||||
registrations, in your contribution:
 | 
			
		||||
 | 
			
		||||
    * you hereby assign to us joint ownership, and to the extent that such
 | 
			
		||||
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
			
		||||
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
			
		||||
    royalty-free, unrestricted license to exercise all rights under those
 | 
			
		||||
    copyrights. This includes, at our option, the right to sublicense these same
 | 
			
		||||
    rights to third parties through multiple levels of sublicensees or other
 | 
			
		||||
    licensing arrangements;
 | 
			
		||||
 | 
			
		||||
    * you agree that each of us can do all things in relation to your
 | 
			
		||||
    contribution as if each of us were the sole owners, and if one of us makes
 | 
			
		||||
    a derivative work of your contribution, the one who makes the derivative
 | 
			
		||||
    work (or has it made will be the sole owner of that derivative work;
 | 
			
		||||
 | 
			
		||||
    * you agree that you will not assert any moral rights in your contribution
 | 
			
		||||
    against us, our licensees or transferees;
 | 
			
		||||
 | 
			
		||||
    * you agree that we may register a copyright in your contribution and
 | 
			
		||||
    exercise all ownership rights associated with it; and
 | 
			
		||||
 | 
			
		||||
    * you agree that neither of us has any duty to consult with, obtain the
 | 
			
		||||
    consent of, pay or render an accounting to the other for any use or
 | 
			
		||||
    distribution of your contribution.
 | 
			
		||||
 | 
			
		||||
3. With respect to any patents you own, or that you can license without payment
 | 
			
		||||
to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
			
		||||
non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
			
		||||
 | 
			
		||||
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
			
		||||
    your contribution in whole or in part, alone or in combination with or
 | 
			
		||||
    included in any product, work or materials arising out of the project to
 | 
			
		||||
    which your contribution was submitted, and
 | 
			
		||||
 | 
			
		||||
    * at our option, to sublicense these same rights to third parties through
 | 
			
		||||
    multiple levels of sublicensees or other licensing arrangements.
 | 
			
		||||
 | 
			
		||||
4. Except as set out above, you keep all right, title, and interest in your
 | 
			
		||||
contribution. The rights that you grant to us under these terms are effective
 | 
			
		||||
on the date you first submitted a contribution to us, even if your submission
 | 
			
		||||
took place before the date you sign these terms.
 | 
			
		||||
 | 
			
		||||
5. You covenant, represent, warrant and agree that:
 | 
			
		||||
 | 
			
		||||
    * Each contribution that you submit is and shall be an original work of
 | 
			
		||||
    authorship and you can legally grant the rights set out in this SCA;
 | 
			
		||||
 | 
			
		||||
    * to the best of your knowledge, each contribution will not violate any
 | 
			
		||||
    third party's copyrights, trademarks, patents, or other intellectual
 | 
			
		||||
    property rights; and
 | 
			
		||||
 | 
			
		||||
    * each contribution shall be in compliance with U.S. export control laws and
 | 
			
		||||
    other applicable export and import laws. You agree to notify us if you
 | 
			
		||||
    become aware of any circumstance which would make any of the foregoing
 | 
			
		||||
    representations inaccurate in any respect. We may publicly disclose your
 | 
			
		||||
    participation in the project, including the fact that you have signed the SCA.
 | 
			
		||||
 | 
			
		||||
6. This SCA is governed by the laws of the State of California and applicable
 | 
			
		||||
U.S. Federal law. Any choice of law rules will not apply.
 | 
			
		||||
 | 
			
		||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
			
		||||
mark both statements:
 | 
			
		||||
 | 
			
		||||
    * [x] I am signing on behalf of myself as an individual and no other person
 | 
			
		||||
    or entity, including my employer, has or will have rights with respect to my
 | 
			
		||||
    contributions.
 | 
			
		||||
 | 
			
		||||
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
			
		||||
    actual authority to contractually bind that entity.
 | 
			
		||||
 | 
			
		||||
## Contributor Details
 | 
			
		||||
 | 
			
		||||
| Field                          | Entry                |
 | 
			
		||||
|------------------------------- | -------------------- |
 | 
			
		||||
| Name                           | Elena Fano           |
 | 
			
		||||
| Company name (if applicable)   |                      |
 | 
			
		||||
| Title or role (if applicable)  |                      |
 | 
			
		||||
| Date                           | 2020-09-21           |
 | 
			
		||||
| GitHub username                | Nuccy90              |
 | 
			
		||||
| Website (optional)             |                      |
 | 
			
		||||
							
								
								
									
										106
									
								
								.github/contributors/rahul1990gupta.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/rahul1990gupta.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,106 @@
 | 
			
		|||
# spaCy contributor agreement
 | 
			
		||||
 | 
			
		||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
			
		||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
			
		||||
The SCA applies to any contribution that you make to any product or project
 | 
			
		||||
managed by us (the **"project"**), and sets out the intellectual property rights
 | 
			
		||||
you grant to us in the contributed materials. The term **"us"** shall mean
 | 
			
		||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
			
		||||
**"you"** shall mean the person or entity identified below.
 | 
			
		||||
 | 
			
		||||
If you agree to be bound by these terms, fill in the information requested
 | 
			
		||||
below and include the filled-in version with your first pull request, under the
 | 
			
		||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
			
		||||
should be your GitHub username, with the extension `.md`. For example, the user
 | 
			
		||||
example_user would create the file `.github/contributors/example_user.md`.
 | 
			
		||||
 | 
			
		||||
Read this agreement carefully before signing. These terms and conditions
 | 
			
		||||
constitute a binding legal agreement.
 | 
			
		||||
 | 
			
		||||
## Contributor Agreement
 | 
			
		||||
 | 
			
		||||
1. The term "contribution" or "contributed materials" means any source code,
 | 
			
		||||
object code, patch, tool, sample, graphic, specification, manual,
 | 
			
		||||
documentation, or any other material posted or submitted by you to the project.
 | 
			
		||||
 | 
			
		||||
2. With respect to any worldwide copyrights, or copyright applications and
 | 
			
		||||
registrations, in your contribution:
 | 
			
		||||
 | 
			
		||||
    * you hereby assign to us joint ownership, and to the extent that such
 | 
			
		||||
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
			
		||||
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
			
		||||
    royalty-free, unrestricted license to exercise all rights under those
 | 
			
		||||
    copyrights. This includes, at our option, the right to sublicense these same
 | 
			
		||||
    rights to third parties through multiple levels of sublicensees or other
 | 
			
		||||
    licensing arrangements;
 | 
			
		||||
 | 
			
		||||
    * you agree that each of us can do all things in relation to your
 | 
			
		||||
    contribution as if each of us were the sole owners, and if one of us makes
 | 
			
		||||
    a derivative work of your contribution, the one who makes the derivative
 | 
			
		||||
    work (or has it made will be the sole owner of that derivative work;
 | 
			
		||||
 | 
			
		||||
    * you agree that you will not assert any moral rights in your contribution
 | 
			
		||||
    against us, our licensees or transferees;
 | 
			
		||||
 | 
			
		||||
    * you agree that we may register a copyright in your contribution and
 | 
			
		||||
    exercise all ownership rights associated with it; and
 | 
			
		||||
 | 
			
		||||
    * you agree that neither of us has any duty to consult with, obtain the
 | 
			
		||||
    consent of, pay or render an accounting to the other for any use or
 | 
			
		||||
    distribution of your contribution.
 | 
			
		||||
 | 
			
		||||
3. With respect to any patents you own, or that you can license without payment
 | 
			
		||||
to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
			
		||||
non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
			
		||||
 | 
			
		||||
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
			
		||||
    your contribution in whole or in part, alone or in combination with or
 | 
			
		||||
    included in any product, work or materials arising out of the project to
 | 
			
		||||
    which your contribution was submitted, and
 | 
			
		||||
 | 
			
		||||
    * at our option, to sublicense these same rights to third parties through
 | 
			
		||||
    multiple levels of sublicensees or other licensing arrangements.
 | 
			
		||||
 | 
			
		||||
4. Except as set out above, you keep all right, title, and interest in your
 | 
			
		||||
contribution. The rights that you grant to us under these terms are effective
 | 
			
		||||
on the date you first submitted a contribution to us, even if your submission
 | 
			
		||||
took place before the date you sign these terms.
 | 
			
		||||
 | 
			
		||||
5. You covenant, represent, warrant and agree that:
 | 
			
		||||
 | 
			
		||||
    * Each contribution that you submit is and shall be an original work of
 | 
			
		||||
    authorship and you can legally grant the rights set out in this SCA;
 | 
			
		||||
 | 
			
		||||
    * to the best of your knowledge, each contribution will not violate any
 | 
			
		||||
    third party's copyrights, trademarks, patents, or other intellectual
 | 
			
		||||
    property rights; and
 | 
			
		||||
 | 
			
		||||
    * each contribution shall be in compliance with U.S. export control laws and
 | 
			
		||||
    other applicable export and import laws. You agree to notify us if you
 | 
			
		||||
    become aware of any circumstance which would make any of the foregoing
 | 
			
		||||
    representations inaccurate in any respect. We may publicly disclose your
 | 
			
		||||
    participation in the project, including the fact that you have signed the SCA.
 | 
			
		||||
 | 
			
		||||
6. This SCA is governed by the laws of the State of California and applicable
 | 
			
		||||
U.S. Federal law. Any choice of law rules will not apply.
 | 
			
		||||
 | 
			
		||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
			
		||||
mark both statements:
 | 
			
		||||
 | 
			
		||||
    * [x] I am signing on behalf of myself as an individual and no other person
 | 
			
		||||
    or entity, including my employer, has or will have rights with respect to my
 | 
			
		||||
    contributions.
 | 
			
		||||
 | 
			
		||||
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
			
		||||
    actual authority to contractually bind that entity.
 | 
			
		||||
 | 
			
		||||
## Contributor Details
 | 
			
		||||
 | 
			
		||||
| Field                          | Entry                |
 | 
			
		||||
|------------------------------- | -------------------- |
 | 
			
		||||
| Name                           |  Rahul Gupta         |
 | 
			
		||||
| Company name (if applicable)   |                      |
 | 
			
		||||
| Title or role (if applicable)  |                      |
 | 
			
		||||
| Date                           |  28 July 2020        |
 | 
			
		||||
| GitHub username                |  rahul1990gupta      |
 | 
			
		||||
| Website (optional)             |                      |
 | 
			
		||||
| 
						 | 
				
			
			@ -10,23 +10,26 @@ _stem_suffixes = [
 | 
			
		|||
    ["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
 | 
			
		||||
    ["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
 | 
			
		||||
]
 | 
			
		||||
# fmt: on
 | 
			
		||||
 | 
			
		||||
# reference 1:https://en.wikipedia.org/wiki/Indian_numbering_system
 | 
			
		||||
# reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system
 | 
			
		||||
# reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
 | 
			
		||||
# reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/
 | 
			
		||||
 | 
			
		||||
_num_words = [
 | 
			
		||||
_one_to_ten = [
 | 
			
		||||
    "शून्य",
 | 
			
		||||
    "एक",
 | 
			
		||||
    "दो",
 | 
			
		||||
    "तीन",
 | 
			
		||||
    "चार",
 | 
			
		||||
    "पांच",
 | 
			
		||||
    "पांच", "पाँच",
 | 
			
		||||
    "छह",
 | 
			
		||||
    "सात",
 | 
			
		||||
    "आठ",
 | 
			
		||||
    "नौ",
 | 
			
		||||
    "दस",
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
_eleven_to_beyond = [
 | 
			
		||||
    "ग्यारह",
 | 
			
		||||
    "बारह",
 | 
			
		||||
    "तेरह",
 | 
			
		||||
| 
						 | 
				
			
			@ -37,13 +40,85 @@ _num_words = [
 | 
			
		|||
    "अठारह",
 | 
			
		||||
    "उन्नीस",
 | 
			
		||||
    "बीस",
 | 
			
		||||
    "इकीस", "इक्कीस",
 | 
			
		||||
    "बाईस",
 | 
			
		||||
    "तेइस",
 | 
			
		||||
    "चौबीस",
 | 
			
		||||
    "पच्चीस",
 | 
			
		||||
    "छब्बीस",
 | 
			
		||||
    "सताइस", "सत्ताइस",
 | 
			
		||||
    "अट्ठाइस",
 | 
			
		||||
    "उनतीस",
 | 
			
		||||
    "तीस",
 | 
			
		||||
    "इकतीस", "इकत्तीस",
 | 
			
		||||
    "बतीस", "बत्तीस",
 | 
			
		||||
    "तैंतीस",
 | 
			
		||||
    "चौंतीस",
 | 
			
		||||
    "पैंतीस",
 | 
			
		||||
    "छतीस", "छत्तीस",
 | 
			
		||||
    "सैंतीस",
 | 
			
		||||
    "अड़तीस",
 | 
			
		||||
    "उनतालीस", "उनत्तीस",
 | 
			
		||||
    "चालीस",
 | 
			
		||||
    "इकतालीस",
 | 
			
		||||
    "बयालीस",
 | 
			
		||||
    "तैतालीस",
 | 
			
		||||
    "चवालीस",
 | 
			
		||||
    "पैंतालीस",
 | 
			
		||||
    "छयालिस",
 | 
			
		||||
    "सैंतालीस",
 | 
			
		||||
    "अड़तालीस",
 | 
			
		||||
    "उनचास",
 | 
			
		||||
    "पचास",
 | 
			
		||||
    "इक्यावन",
 | 
			
		||||
    "बावन",
 | 
			
		||||
    "तिरपन", "तिरेपन",
 | 
			
		||||
    "चौवन", "चउवन",
 | 
			
		||||
    "पचपन", 
 | 
			
		||||
    "छप्पन",
 | 
			
		||||
    "सतावन", "सत्तावन",
 | 
			
		||||
    "अठावन",
 | 
			
		||||
    "उनसठ",
 | 
			
		||||
    "साठ",
 | 
			
		||||
    "इकसठ",
 | 
			
		||||
    "बासठ",
 | 
			
		||||
    "तिरसठ", "तिरेसठ",
 | 
			
		||||
    "चौंसठ",
 | 
			
		||||
    "पैंसठ",
 | 
			
		||||
    "छियासठ",
 | 
			
		||||
    "सड़सठ",
 | 
			
		||||
    "अड़सठ",
 | 
			
		||||
    "उनहत्तर",
 | 
			
		||||
    "सत्तर",
 | 
			
		||||
    "इकहत्तर"
 | 
			
		||||
    "बहत्तर", 
 | 
			
		||||
    "तिहत्तर",
 | 
			
		||||
    "चौहत्तर",
 | 
			
		||||
    "पचहत्तर",
 | 
			
		||||
    "छिहत्तर",
 | 
			
		||||
    "सतहत्तर",
 | 
			
		||||
    "अठहत्तर",
 | 
			
		||||
    "उन्नासी", "उन्यासी"
 | 
			
		||||
    "अस्सी",
 | 
			
		||||
    "इक्यासी",
 | 
			
		||||
    "बयासी",
 | 
			
		||||
    "तिरासी",
 | 
			
		||||
    "चौरासी",
 | 
			
		||||
    "पचासी",
 | 
			
		||||
    "छियासी",
 | 
			
		||||
    "सतासी",
 | 
			
		||||
    "अट्ठासी",
 | 
			
		||||
    "नवासी",
 | 
			
		||||
    "नब्बे",
 | 
			
		||||
    "इक्यानवे",
 | 
			
		||||
    "बानवे",
 | 
			
		||||
    "तिरानवे",
 | 
			
		||||
    "चौरानवे",
 | 
			
		||||
    "पचानवे",
 | 
			
		||||
    "छियानवे",
 | 
			
		||||
    "सतानवे",
 | 
			
		||||
    "अट्ठानवे",
 | 
			
		||||
    "निन्यानवे",
 | 
			
		||||
    "सौ",
 | 
			
		||||
    "हज़ार",
 | 
			
		||||
    "लाख",
 | 
			
		||||
| 
						 | 
				
			
			@ -52,6 +127,22 @@ _num_words = [
 | 
			
		|||
    "खरब",
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
_num_words = _one_to_ten + _eleven_to_beyond
 | 
			
		||||
 | 
			
		||||
_ordinal_words_one_to_ten = [
 | 
			
		||||
    "प्रथम", "पहला",
 | 
			
		||||
    "द्वितीय", "दूसरा",
 | 
			
		||||
    "तृतीय", "तीसरा",
 | 
			
		||||
    "चौथा",
 | 
			
		||||
    "पांचवाँ",
 | 
			
		||||
    "छठा",
 | 
			
		||||
    "सातवाँ",
 | 
			
		||||
    "आठवाँ",
 | 
			
		||||
    "नौवाँ",
 | 
			
		||||
    "दसवाँ",
 | 
			
		||||
]
 | 
			
		||||
_ordinal_suffix = "वाँ"
 | 
			
		||||
# fmt: on
 | 
			
		||||
 | 
			
		||||
def norm(string):
 | 
			
		||||
    # normalise base exceptions,  e.g. punctuation or currency symbols
 | 
			
		||||
| 
						 | 
				
			
			@ -64,7 +155,7 @@ def norm(string):
 | 
			
		|||
    for suffix_group in reversed(_stem_suffixes):
 | 
			
		||||
        length = len(suffix_group[0])
 | 
			
		||||
        if len(string) <= length:
 | 
			
		||||
            break
 | 
			
		||||
            continue
 | 
			
		||||
        for suffix in suffix_group:
 | 
			
		||||
            if string.endswith(suffix):
 | 
			
		||||
                return string[:-length]
 | 
			
		||||
| 
						 | 
				
			
			@ -74,7 +165,7 @@ def norm(string):
 | 
			
		|||
def like_num(text):
 | 
			
		||||
    if text.startswith(("+", "-", "±", "~")):
 | 
			
		||||
        text = text[1:]
 | 
			
		||||
    text = text.replace(", ", "").replace(".", "")
 | 
			
		||||
    text = text.replace(",", "").replace(".", "")
 | 
			
		||||
    if text.isdigit():
 | 
			
		||||
        return True
 | 
			
		||||
    if text.count("/") == 1:
 | 
			
		||||
| 
						 | 
				
			
			@ -83,6 +174,14 @@ def like_num(text):
 | 
			
		|||
            return True
 | 
			
		||||
    if text.lower() in _num_words:
 | 
			
		||||
        return True
 | 
			
		||||
 | 
			
		||||
    # check ordinal numbers
 | 
			
		||||
    # reference: http://www.englishkitab.com/Vocabulary/Numbers.html
 | 
			
		||||
    if text in _ordinal_words_one_to_ten:
 | 
			
		||||
        return True
 | 
			
		||||
    if text.endswith(_ordinal_suffix):
 | 
			
		||||
        if text[:-len(_ordinal_suffix)] in _eleven_to_beyond:
 | 
			
		||||
            return True
 | 
			
		||||
    return False
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -19,4 +19,6 @@ sentences = [
 | 
			
		|||
    "தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
 | 
			
		||||
    "நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
 | 
			
		||||
    "லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
 | 
			
		||||
    "என்ன வேலை செய்கிறீர்கள்?",
 | 
			
		||||
    "எந்த கல்லூரியில் படிக்கிறாய்?",
 | 
			
		||||
]
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -73,20 +73,16 @@ def like_num(text):
 | 
			
		|||
        num, denom = text.split("/")
 | 
			
		||||
        if num.isdigit() and denom.isdigit():
 | 
			
		||||
            return True
 | 
			
		||||
 | 
			
		||||
    text_lower = text.lower()
 | 
			
		||||
 | 
			
		||||
    # Check cardinal number
 | 
			
		||||
    if text_lower in _num_words:
 | 
			
		||||
        return True
 | 
			
		||||
 | 
			
		||||
    # Check ordinal number
 | 
			
		||||
    if text_lower in _ordinal_words:
 | 
			
		||||
        return True
 | 
			
		||||
    if text_lower.endswith(_ordinal_endings):
 | 
			
		||||
        if text_lower[:-3].isdigit() or text_lower[:-4].isdigit():
 | 
			
		||||
            return True
 | 
			
		||||
 | 
			
		||||
    return False
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,6 +1,3 @@
 | 
			
		|||
# coding: utf8
 | 
			
		||||
from __future__ import unicode_literals
 | 
			
		||||
 | 
			
		||||
from ...symbols import NOUN, PROPN, PRON
 | 
			
		||||
from ...errors import Errors
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -125,6 +125,11 @@ def he_tokenizer():
 | 
			
		|||
    return get_lang_class("he")().tokenizer
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.fixture(scope="session")
 | 
			
		||||
def hi_tokenizer():
 | 
			
		||||
    return get_lang_class("hi")().tokenizer
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.fixture(scope="session")
 | 
			
		||||
def hr_tokenizer():
 | 
			
		||||
    return get_lang_class("hr")().tokenizer
 | 
			
		||||
| 
						 | 
				
			
			@ -240,11 +245,6 @@ def tr_tokenizer():
 | 
			
		|||
    return get_lang_class("tr")().tokenizer
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.fixture(scope="session")
 | 
			
		||||
def tr_vocab():
 | 
			
		||||
    return get_lang_class("tr").Defaults.create_vocab()
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.fixture(scope="session")
 | 
			
		||||
def tt_tokenizer():
 | 
			
		||||
    return get_lang_class("tt")().tokenizer
 | 
			
		||||
| 
						 | 
				
			
			@ -297,11 +297,7 @@ def zh_tokenizer_pkuseg():
 | 
			
		|||
                "segmenter": "pkuseg",
 | 
			
		||||
            }
 | 
			
		||||
        },
 | 
			
		||||
        "initialize": {
 | 
			
		||||
            "tokenizer": {
 | 
			
		||||
                "pkuseg_model": "web",
 | 
			
		||||
            }
 | 
			
		||||
        },
 | 
			
		||||
        "initialize": {"tokenizer": {"pkuseg_model": "web",}},
 | 
			
		||||
    }
 | 
			
		||||
    nlp = get_lang_class("zh").from_config(config)
 | 
			
		||||
    nlp.initialize()
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
							
								
								
									
										0
									
								
								spacy/tests/lang/hi/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								spacy/tests/lang/hi/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
									
										41
									
								
								spacy/tests/lang/hi/test_lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										41
									
								
								spacy/tests/lang/hi/test_lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,41 @@
 | 
			
		|||
import pytest
 | 
			
		||||
from spacy.lang.hi.lex_attrs import norm, like_num
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_hi_tokenizer_handles_long_text(hi_tokenizer):
 | 
			
		||||
    text = """
 | 
			
		||||
ये कहानी 1900 के दशक की है। कौशल्या (स्मिता जयकर) को पता चलता है कि उसका
 | 
			
		||||
छोटा बेटा, देवदास (शाहरुख खान) वापस घर आ रहा है। देवदास 10 साल पहले कानून की
 | 
			
		||||
पढ़ाई करने के लिए इंग्लैंड गया था। उसके लौटने की खुशी में ये बात कौशल्या अपनी पड़ोस
 | 
			
		||||
में रहने वाली सुमित्रा (किरण खेर) को भी बता देती है। इस खबर से वो भी खुश हो जाती है।
 | 
			
		||||
"""
 | 
			
		||||
    tokens = hi_tokenizer(text)
 | 
			
		||||
    assert len(tokens) == 86
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "word,word_norm",
 | 
			
		||||
    [
 | 
			
		||||
        ("चलता", "चल"),
 | 
			
		||||
        ("पढ़ाई", "पढ़"),
 | 
			
		||||
        ("देती", "दे"),
 | 
			
		||||
        ("जाती", "ज"),
 | 
			
		||||
        ("मुस्कुराकर", "मुस्कुर"),
 | 
			
		||||
    ],
 | 
			
		||||
)
 | 
			
		||||
def test_hi_norm(word, word_norm):
 | 
			
		||||
    assert norm(word) == word_norm
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "word", ["१९८७", "1987", "१२,२६७", "उन्नीस", "पाँच", "नवासी", "५/१०"],
 | 
			
		||||
)
 | 
			
		||||
def test_hi_like_num(word):
 | 
			
		||||
    assert like_num(word)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "word", ["पहला", "तृतीय", "निन्यानवेवाँ", "उन्नीस", "तिहत्तरवाँ", "छत्तीसवाँ",],
 | 
			
		||||
)
 | 
			
		||||
def test_hi_like_num_ordinal_words(word):
 | 
			
		||||
    assert like_num(word)
 | 
			
		||||
| 
						 | 
				
			
			@ -490,7 +490,7 @@ phrases, so that you can resolve overlaps and other conflicts in whatever way
 | 
			
		|||
you prefer.
 | 
			
		||||
 | 
			
		||||
| Argument  | Description                                                                                                                                       |
 | 
			
		||||
| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
			
		||||
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
			
		||||
| `matcher` | The matcher instance. ~~Matcher~~                                                                                                                 |
 | 
			
		||||
| `doc`     | The document the matcher was used on. ~~Doc~~                                                                                                     |
 | 
			
		||||
| `i`       | Index of the current match (`matches[i`]). ~~int~~                                                                                                |
 | 
			
		||||
| 
						 | 
				
			
			@ -631,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences
 | 
			
		|||
containing a match and render them with the
 | 
			
		||||
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
 | 
			
		||||
access to the `start` and `end` of each match, as well as the parent `Doc`. This
 | 
			
		||||
lets you determine the sentence containing the match, `doc[start:end].sent`,
 | 
			
		||||
and calculate the start and end of the matched span within the sentence. Using
 | 
			
		||||
lets you determine the sentence containing the match, `doc[start:end].sent`, and
 | 
			
		||||
calculate the start and end of the matched span within the sentence. Using
 | 
			
		||||
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
 | 
			
		||||
list of dictionaries containing the text and entities to render.
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in New Issue
	
	Block a user