mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-30 23:47:31 +03:00 
			
		
		
		
	Tamil language support (#3154)
Tamil language support to spaCy
Description
Hereby, creating new PR to add support for Tamil language in spaCy
    added stop words, examples and numerical attributes
    <--Working on other language data-->
Types of change
Enhancement
Checklist
    [ x] I have submitted the spaCy Contributor Agreement.
    [x ] I ran the tests, and all new and existing tests passed.
    [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.
			
			
This commit is contained in:
		
							parent
							
								
									f28a1c7271
								
							
						
					
					
						commit
						d97661d18b
					
				
							
								
								
									
										106
									
								
								.github/contributors/Loghijiaha.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/Loghijiaha.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | ||||||
|  | # spaCy contributor agreement | ||||||
|  | 
 | ||||||
|  | This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||||
|  | [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||||
|  | The SCA applies to any contribution that you make to any product or project | ||||||
|  | managed by us (the **"project"**), and sets out the intellectual property rights | ||||||
|  | you grant to us in the contributed materials. The term **"us"** shall mean | ||||||
|  | [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||||
|  | **"you"** shall mean the person or entity identified below. | ||||||
|  | 
 | ||||||
|  | If you agree to be bound by these terms, fill in the information requested | ||||||
|  | below and include the filled-in version with your first pull request, under the | ||||||
|  | folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||||
|  | should be your GitHub username, with the extension `.md`. For example, the user | ||||||
|  | example_user would create the file `.github/contributors/example_user.md`. | ||||||
|  | 
 | ||||||
|  | Read this agreement carefully before signing. These terms and conditions | ||||||
|  | constitute a binding legal agreement. | ||||||
|  | 
 | ||||||
|  | ## Contributor Agreement | ||||||
|  | 
 | ||||||
|  | 1. The term "contribution" or "contributed materials" means any source code, | ||||||
|  | object code, patch, tool, sample, graphic, specification, manual, | ||||||
|  | documentation, or any other material posted or submitted by you to the project. | ||||||
|  | 
 | ||||||
|  | 2. With respect to any worldwide copyrights, or copyright applications and | ||||||
|  | registrations, in your contribution: | ||||||
|  | 
 | ||||||
|  |     * you hereby assign to us joint ownership, and to the extent that such | ||||||
|  |     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||||
|  |     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||||
|  |     royalty-free, unrestricted license to exercise all rights under those | ||||||
|  |     copyrights. This includes, at our option, the right to sublicense these same | ||||||
|  |     rights to third parties through multiple levels of sublicensees or other | ||||||
|  |     licensing arrangements; | ||||||
|  | 
 | ||||||
|  |     * you agree that each of us can do all things in relation to your | ||||||
|  |     contribution as if each of us were the sole owners, and if one of us makes | ||||||
|  |     a derivative work of your contribution, the one who makes the derivative | ||||||
|  |     work (or has it made will be the sole owner of that derivative work; | ||||||
|  | 
 | ||||||
|  |     * you agree that you will not assert any moral rights in your contribution | ||||||
|  |     against us, our licensees or transferees; | ||||||
|  | 
 | ||||||
|  |     * you agree that we may register a copyright in your contribution and | ||||||
|  |     exercise all ownership rights associated with it; and | ||||||
|  | 
 | ||||||
|  |     * you agree that neither of us has any duty to consult with, obtain the | ||||||
|  |     consent of, pay or render an accounting to the other for any use or | ||||||
|  |     distribution of your contribution. | ||||||
|  | 
 | ||||||
|  | 3. With respect to any patents you own, or that you can license without payment | ||||||
|  | to any third party, you hereby grant to us a perpetual, irrevocable, | ||||||
|  | non-exclusive, worldwide, no-charge, royalty-free license to: | ||||||
|  | 
 | ||||||
|  |     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||||
|  |     your contribution in whole or in part, alone or in combination with or | ||||||
|  |     included in any product, work or materials arising out of the project to | ||||||
|  |     which your contribution was submitted, and | ||||||
|  | 
 | ||||||
|  |     * at our option, to sublicense these same rights to third parties through | ||||||
|  |     multiple levels of sublicensees or other licensing arrangements. | ||||||
|  | 
 | ||||||
|  | 4. Except as set out above, you keep all right, title, and interest in your | ||||||
|  | contribution. The rights that you grant to us under these terms are effective | ||||||
|  | on the date you first submitted a contribution to us, even if your submission | ||||||
|  | took place before the date you sign these terms. | ||||||
|  | 
 | ||||||
|  | 5. You covenant, represent, warrant and agree that: | ||||||
|  | 
 | ||||||
|  |     * Each contribution that you submit is and shall be an original work of | ||||||
|  |     authorship and you can legally grant the rights set out in this SCA; | ||||||
|  | 
 | ||||||
|  |     * to the best of your knowledge, each contribution will not violate any | ||||||
|  |     third party's copyrights, trademarks, patents, or other intellectual | ||||||
|  |     property rights; and | ||||||
|  | 
 | ||||||
|  |     * each contribution shall be in compliance with U.S. export control laws and | ||||||
|  |     other applicable export and import laws. You agree to notify us if you | ||||||
|  |     become aware of any circumstance which would make any of the foregoing | ||||||
|  |     representations inaccurate in any respect. We may publicly disclose your | ||||||
|  |     participation in the project, including the fact that you have signed the SCA. | ||||||
|  | 
 | ||||||
|  | 6. This SCA is governed by the laws of the State of California and applicable | ||||||
|  | U.S. Federal law. Any choice of law rules will not apply. | ||||||
|  | 
 | ||||||
|  | 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||||
|  | mark both statements: | ||||||
|  | 
 | ||||||
|  |     * [ x] I am signing on behalf of myself as an individual and no other person | ||||||
|  |     or entity, including my employer, has or will have rights with respect to my | ||||||
|  |     contributions. | ||||||
|  | 
 | ||||||
|  |     * [ x] I am signing on behalf of my employer or a legal entity and I have the | ||||||
|  |     actual authority to contractually bind that entity. | ||||||
|  | 
 | ||||||
|  | ## Contributor Details | ||||||
|  | 
 | ||||||
|  | | Field                          | Entry                | | ||||||
|  | |------------------------------- | -------------------- | | ||||||
|  | | Name                           | Loghi Perinpanayagam | | ||||||
|  | | Company name (if applicable)   |                      | | ||||||
|  | | Title or role (if applicable)  |   Student            | | ||||||
|  | | Date                           |   13 Jan, 2019       | | ||||||
|  | | GitHub username                |   loghijiaha         | | ||||||
|  | | Website (optional)             |                      | | ||||||
							
								
								
									
										24
									
								
								spacy/lang/ta/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										24
									
								
								spacy/lang/ta/__init__.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,24 @@ | ||||||
|  | # import language-specific data | ||||||
|  | from .stop_words import STOP_WORDS | ||||||
|  | from .lex_attrs import LEX_ATTRS | ||||||
|  | 
 | ||||||
|  | from ..tokenizer_exceptions import BASE_EXCEPTIONS | ||||||
|  | from ...language import Language | ||||||
|  | from ...attrs import LANG | ||||||
|  | from ...util import update_exc | ||||||
|  | 
 | ||||||
|  | # create Defaults class in the module scope (necessary for pickling!) | ||||||
|  | class TamilDefaults(Language.Defaults): | ||||||
|  |     lex_attr_getters = dict(Language.Defaults.lex_attr_getters) | ||||||
|  |     lex_attr_getters[LANG] = lambda text: 'ta' # language ISO code | ||||||
|  | 
 | ||||||
|  |     # optional: replace flags with custom functions, e.g. like_num() | ||||||
|  |     lex_attr_getters.update(LEX_ATTRS) | ||||||
|  | 
 | ||||||
|  | # create actual Language class | ||||||
|  | class Tamil(Language): | ||||||
|  |     lang = 'ta' # language ISO code | ||||||
|  |     Defaults = TamilDefaults # override defaults | ||||||
|  | 
 | ||||||
|  | # set default export – this allows the language class to be lazy-loaded | ||||||
|  | __all__ = ['Tamil'] | ||||||
							
								
								
									
										21
									
								
								spacy/lang/ta/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										21
									
								
								spacy/lang/ta/examples.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,21 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | """ | ||||||
|  | Example sentences to test spaCy and its language models. | ||||||
|  | 
 | ||||||
|  | >>> from spacy.lang.ta.examples import sentences | ||||||
|  | >>> docs = nlp.pipe(sentences) | ||||||
|  | """ | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | sentences = [ | ||||||
|  |     "கிறிஸ்துமஸ் மற்றும் இனிய புத்தாண்டு வாழ்த்துக்கள்", | ||||||
|  |     "எனக்கு என் குழந்தைப் பருவம் நினைவிருக்கிறது", | ||||||
|  |     "உங்கள் பெயர் என்ன?", | ||||||
|  |     "ஏறத்தாழ இலங்கைத் தமிழரில் மூன்றிலொரு பங்கினர் இலங்கையை விட்டு வெளியேறிப் பிற நாடுகளில் வாழ்கின்றனர்", | ||||||
|  |     "இந்த ஃபோனுடன் சுமார் ரூ.2,990 மதிப்புள்ள போட் ராக்கர்ஸ் நிறுவனத்தின் ஸ்போர்ட் புளூடூத் ஹெட்போன்ஸ்  இலவசமாக வழங்கப்படவுள்ளது.", | ||||||
|  |     "மட்டக்களப்பில் பல இடங்களில் வீட்டுத் திட்டங்களுக்கு இன்று அடிக்கல் நாட்டல்", | ||||||
|  |     "ஐ போன்க்கு முகத்தை வைத்து அன்லாக் செய்யும் முறை மற்றும்  விரலால் தொட்டு அன்லாக் செய்யும் முறையை வாட்ஸ் ஆப் நிறுவனம் இதற்கு முன் கண்டுபிடித்தது" | ||||||
|  | ] | ||||||
							
								
								
									
										44
									
								
								spacy/lang/ta/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										44
									
								
								spacy/lang/ta/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,44 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | from ...attrs import LIKE_NUM | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | _numeral_suffixes = {'பத்து': 'பது', 'ற்று': 'று', 'ரத்து':'ரம்' , 'சத்து': 'சம்'} | ||||||
|  | _num_words = ['பூச்சியம்', 'ஒரு', 'ஒன்று', 'இரண்டு', 'மூன்று', 'நான்கு', 'ஐந்து', 'ஆறு', 'ஏழு', | ||||||
|  |               'எட்டு', 'ஒன்பது', 'பத்து', 'பதினொன்று', 'பன்னிரண்டு', 'பதின்மூன்று', 'பதினான்கு', | ||||||
|  |               'பதினைந்து', 'பதினாறு', 'பதினேழு', 'பதினெட்டு', 'பத்தொன்பது', 'இருபது', | ||||||
|  |               'முப்பது', 'நாற்பது', 'ஐம்பது', 'அறுபது', 'எழுபது', 'எண்பது', 'தொண்ணூறு', | ||||||
|  |               'நூறு', 'இருநூறு', 'முன்னூறு', 'நாநூறு', 'ஐநூறு', 'அறுநூறு', 'எழுநூறு', 'எண்ணூறு', 'தொள்ளாயிரம்', | ||||||
|  |               'ஆயிரம்', 'ஒராயிரம்', 'லட்சம்', 'மில்லியன்', 'கோடி', 'பில்லியன்', 'டிரில்லியன்'] | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | # 20-89 ,90-899,900-99999 and above have different suffixes | ||||||
|  | def suffix_filter(text): | ||||||
|  |     # text without numeral suffixes | ||||||
|  |     for num_suffix in _numeral_suffixes.keys(): | ||||||
|  |         length = len(num_suffix) | ||||||
|  |         if (len(text) < length): | ||||||
|  |             break | ||||||
|  |         elif text.endswith(num_suffix): | ||||||
|  |             return text[:-length] + _numeral_suffixes[num_suffix] | ||||||
|  |     return text | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def like_num(text): | ||||||
|  |     text = text.replace(',', '').replace('.', '') | ||||||
|  |     if text.isdigit(): | ||||||
|  |         return True | ||||||
|  |     if text.count('/') == 1: | ||||||
|  |         num, denom = text.split('/') | ||||||
|  |         if num.isdigit() and denom.isdigit(): | ||||||
|  |             return True | ||||||
|  |     print(suffix_filter(text)) | ||||||
|  |     if text.lower() in _num_words: | ||||||
|  |         return True | ||||||
|  |     elif suffix_filter(text) in _num_words: | ||||||
|  |         return True | ||||||
|  | 
 | ||||||
|  |     return False | ||||||
|  | LEX_ATTRS = { | ||||||
|  |     LIKE_NUM: like_num | ||||||
|  | } | ||||||
							
								
								
									
										133
									
								
								spacy/lang/ta/stop_words.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										133
									
								
								spacy/lang/ta/stop_words.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,133 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | # Stop words | ||||||
|  | 
 | ||||||
|  | STOP_WORDS = set(""" | ||||||
|  | ஒரு | ||||||
|  | என்று | ||||||
|  | மற்றும் | ||||||
|  | இந்த | ||||||
|  | இது | ||||||
|  | என்ற | ||||||
|  | கொண்டு | ||||||
|  | என்பது | ||||||
|  | பல | ||||||
|  | ஆகும் | ||||||
|  | அல்லது | ||||||
|  | அவர் | ||||||
|  | நான் | ||||||
|  | உள்ள | ||||||
|  | அந்த | ||||||
|  | இவர் | ||||||
|  | என | ||||||
|  | முதல் | ||||||
|  | என்ன | ||||||
|  | இருந்து | ||||||
|  | சில | ||||||
|  | என் | ||||||
|  | போன்ற | ||||||
|  | வேண்டும் | ||||||
|  | வந்து | ||||||
|  | இதன் | ||||||
|  | அது | ||||||
|  | அவன் | ||||||
|  | தான் | ||||||
|  | பலரும் | ||||||
|  | என்னும் | ||||||
|  | மேலும் | ||||||
|  | பின்னர் | ||||||
|  | கொண்ட | ||||||
|  | இருக்கும் | ||||||
|  | தனது | ||||||
|  | உள்ளது | ||||||
|  | போது | ||||||
|  | என்றும் | ||||||
|  | அதன் | ||||||
|  | தன் | ||||||
|  | பிறகு | ||||||
|  | அவர்கள் | ||||||
|  | வரை | ||||||
|  | அவள் | ||||||
|  | நீ | ||||||
|  | ஆகிய | ||||||
|  | இருந்தது | ||||||
|  | உள்ளன | ||||||
|  | வந்த | ||||||
|  | இருந்த | ||||||
|  | மிகவும் | ||||||
|  | இங்கு | ||||||
|  | மீது | ||||||
|  | ஓர் | ||||||
|  | இவை | ||||||
|  | இந்தக் | ||||||
|  | பற்றி | ||||||
|  | வரும் | ||||||
|  | வேறு | ||||||
|  | இரு | ||||||
|  | இதில் | ||||||
|  | போல் | ||||||
|  | இப்போது | ||||||
|  | அவரது | ||||||
|  | மட்டும் | ||||||
|  | இந்தப் | ||||||
|  | எனும் | ||||||
|  | மேல் | ||||||
|  | பின் | ||||||
|  | சேர்ந்த | ||||||
|  | ஆகியோர் | ||||||
|  | எனக்கு | ||||||
|  | இன்னும் | ||||||
|  | அந்தப் | ||||||
|  | அன்று | ||||||
|  | ஒரே | ||||||
|  | மிக | ||||||
|  | அங்கு | ||||||
|  | பல்வேறு | ||||||
|  | விட்டு | ||||||
|  | பெரும் | ||||||
|  | அதை | ||||||
|  | பற்றிய | ||||||
|  | உன் | ||||||
|  | அதிக | ||||||
|  | அந்தக் | ||||||
|  | பேர் | ||||||
|  | இதனால் | ||||||
|  | அவை | ||||||
|  | அதே | ||||||
|  | ஏன் | ||||||
|  | முறை | ||||||
|  | யார் | ||||||
|  | என்பதை | ||||||
|  | எல்லாம் | ||||||
|  | மட்டுமே | ||||||
|  | இங்கே | ||||||
|  | அங்கே | ||||||
|  | இடம் | ||||||
|  | இடத்தில் | ||||||
|  | அதில் | ||||||
|  | நாம் | ||||||
|  | அதற்கு | ||||||
|  | எனவே | ||||||
|  | பிற | ||||||
|  | சிறு | ||||||
|  | மற்ற | ||||||
|  | விட | ||||||
|  | எந்த | ||||||
|  | எனவும் | ||||||
|  | எனப்படும் | ||||||
|  | எனினும் | ||||||
|  | அடுத்த | ||||||
|  | இதனை | ||||||
|  | இதை | ||||||
|  | கொள்ள | ||||||
|  | இந்தத் | ||||||
|  | இதற்கு | ||||||
|  | அதனால் | ||||||
|  | தவிர | ||||||
|  | போல | ||||||
|  | வரையில் | ||||||
|  | சற்று | ||||||
|  | எனக் | ||||||
|  | """.split()) | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user