spaCy/spacy/lang/hi/lex_attrs.py

190 lines
5.8 KiB
Python
Raw Normal View History

from ..norm_exceptions import BASE_NORMS
from ...attrs import NORM, LIKE_NUM
# fmt: off
_stem_suffixes = [
["", "", "", "", "", "ि", ""],
["कर", "ाओ", "िए", "ाई", "ाए", "ने", "नी", "ना", "ते", "ीं", "ती", "ता", "ाँ", "ां", "ों", "ें"],
["ाकर", "ाइए", "ाईं", "ाया", "ेगी", "ेगा", "ोगी", "ोगे", "ाने", "ाना", "ाते", "ाती", "ाता", "तीं", "ाओं", "ाएं", "ुओं", "ुएं", "ुआं"],
["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
]
# reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system
# reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
# reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/
_one_to_ten = [
"शून्य",
"एक",
"दो",
"तीन",
"चार",
"पांच", "पाँच",
"छह",
"सात",
"आठ",
"नौ",
"दस",
]
_eleven_to_beyond = [
"ग्यारह",
"बारह",
"तेरह",
"चौदह",
"पंद्रह",
"सोलह",
"सत्रह",
"अठारह",
"उन्नीस",
"बीस",
"इकीस", "इक्कीस",
"बाईस",
"तेइस",
"चौबीस",
"पच्चीस",
"छब्बीस",
"सताइस", "सत्ताइस",
"अट्ठाइस",
"उनतीस",
"तीस",
"इकतीस", "इकत्तीस",
"बतीस", "बत्तीस",
"तैंतीस",
"चौंतीस",
"पैंतीस",
"छतीस", "छत्तीस",
"सैंतीस",
"अड़तीस",
"उनतालीस", "उनत्तीस",
"चालीस",
"इकतालीस",
"बयालीस",
"तैतालीस",
"चवालीस",
"पैंतालीस",
"छयालिस",
"सैंतालीस",
"अड़तालीस",
"उनचास",
"पचास",
"इक्यावन",
"बावन",
"तिरपन", "तिरेपन",
"चौवन", "चउवन",
2020-10-15 11:08:53 +03:00
"पचपन",
"छप्पन",
"सतावन", "सत्तावन",
"अठावन",
"उनसठ",
"साठ",
"इकसठ",
"बासठ",
"तिरसठ", "तिरेसठ",
"चौंसठ",
"पैंसठ",
"छियासठ",
"सड़सठ",
"अड़सठ",
"उनहत्तर",
"सत्तर",
"इकहत्तर",
2020-10-15 11:08:53 +03:00
"बहत्तर",
"तिहत्तर",
"चौहत्तर",
"पचहत्तर",
"छिहत्तर",
"सतहत्तर",
"अठहत्तर",
"उन्नासी", "उन्यासी"
"अस्सी",
"इक्यासी",
"बयासी",
"तिरासी",
"चौरासी",
"पचासी",
"छियासी",
"सतासी",
"अट्ठासी",
"नवासी",
"नब्बे",
"इक्यानवे",
"बानवे",
"तिरानवे",
"चौरानवे",
"पचानवे",
"छियानवे",
"सतानवे",
"अट्ठानवे",
"निन्यानवे",
"सौ",
"हज़ार",
"लाख",
"करोड़",
"अरब",
"खरब",
]
Added numbers to ../lang/hi/lex_attrs.py (#2629) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 17:06:11 +03:00
_num_words = _one_to_ten + _eleven_to_beyond
_ordinal_words_one_to_ten = [
"प्रथम", "पहला",
"द्वितीय", "दूसरा",
"तृतीय", "तीसरा",
"चौथा",
"पांचवाँ",
"छठा",
"सातवाँ",
"आठवाँ",
"नौवाँ",
"दसवाँ",
]
_ordinal_suffix = "वाँ"
# fmt: on
2020-10-15 11:08:53 +03:00
def norm(string):
# normalise base exceptions, e.g. punctuation or currency symbols
if string in BASE_NORMS:
return BASE_NORMS[string]
# set stem word as norm, if available, adapted from:
# http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf
# http://research.variancia.com/hindi_stemmer/
# https://github.com/taranjeet/hindi-tokenizer/blob/master/HindiTokenizer.py#L142
for suffix_group in reversed(_stem_suffixes):
length = len(suffix_group[0])
if len(string) <= length:
continue
for suffix in suffix_group:
if string.endswith(suffix):
return string[:-length]
return string
Added numbers to ../lang/hi/lex_attrs.py (#2629) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 17:06:11 +03:00
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
Added numbers to ../lang/hi/lex_attrs.py (#2629) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 17:06:11 +03:00
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
Added numbers to ../lang/hi/lex_attrs.py (#2629) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 17:06:11 +03:00
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
# check ordinal numbers
# reference: http://www.englishkitab.com/Vocabulary/Numbers.html
if text in _ordinal_words_one_to_ten:
return True
if text.endswith(_ordinal_suffix):
2020-10-15 11:08:53 +03:00
if text[: -len(_ordinal_suffix)] in _eleven_to_beyond:
return True
Added numbers to ../lang/hi/lex_attrs.py (#2629) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 17:06:11 +03:00
return False
LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}