spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-18 08:01:58 +03:00

Author	SHA1	Message	Date
Andrew Ongko	81564cc4e8	Update Indonesian model (#2752 ) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file	2018-09-14 12:30:32 +02:00
Filipe Caixeta	fe515085f3	Add words to portuguese language _num_words (#2759 ) * Add words to portuguese language _num_words * Add words to portuguese language _num_words	2018-09-14 12:30:16 +02:00
tyburam	476472d181	Lex _attrs for polish language (#2750 ) * Signed spaCy contributor agreement * Added polish version of english lex_attrs	2018-09-10 11:53:57 +02:00
Sainath Adapa	77139bc03c	Basic support for Telugu language (#2751 )	2018-09-10 11:53:18 +02:00
Aniruddha Adhikary	4530ddcc51	update bengali token rules for hyphen and digits (#2731 )	2018-09-05 21:49:00 +02:00
Ioannis Daras	fe94e696d3	Optimize Greek language support (#2658 )	2018-08-14 02:31:32 +02:00
Aashish Gangwani	6eebfc7bf4	Added numbers to ../lang/hi/lex_attrs.py (#2629 ) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-08-08 16:06:11 +02:00
Emil Stenström	3834f4146d	Add abbreviations from UD_Swedish-Talbanken (#2613 ) * Add abbreviations from UD_Swedish-Talbanken * Add contributor agreement.	2018-08-07 13:53:17 +02:00
Xiaoquan Kong	87fa847e6e	Fix Chinese language related bugs (#2634 )	2018-08-07 11:26:31 +02:00
Emil Stenström	1914c488d3	Swedish: Exceptions for single letter words ending sentence (#2615 ) * Exceptions for single letter words ending sentence Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens. * Add test	2018-08-05 14:14:30 +02:00
Dmitry Bruhanov	07d0cc9de7	Update examples.py (#2597 )	2018-07-25 22:20:24 +02:00
Matthew Honnibal	66983d8412	Port BenDerPan's Chinese changes to v2 (finally) (#2591 ) * add template files for Chinese * add template files for Chinese, and test directory .	2018-07-25 02:47:23 +02:00
ines	f2e3e039b7	Update French stop words (resolves #2540 )	2018-07-24 23:41:51 +02:00
Ines Montani	a43ad114c2	Fix typo [ci skip]	2018-07-24 18:45:40 +02:00
Dmitry Bruhanov	27160b1516	added some widespread written jargon & dialectizms (#2584 ) This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc. Dmitry Briukhanov, Linguist & Pythonist	2018-07-24 18:44:29 +02:00
katarkor	5ca853bee0	changed tag_map, morph_rules, lemmatizer for Norwegian (#2565 ) * changed tag_map, morph_rules, lemmatizer for Norwegian * Move unicode declaration up Hopefully fixes test failure on Python 2 * Update CONTRIBUTOR_AGREEMENT.md * Move unicode declarations Hopefully fixes test this time * Revert "Merge remote-tracking branch 'origin/patch-1'" This reverts commit `f5ccd5dd0d`, reversing changes made to `dd07e180ea`. * Update contributor agreement [ci skip]	2018-07-19 19:38:24 +02:00
Ioannis Daras	6ed18412d0	Greek language optimizations (#2558 ) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words	2018-07-18 18:51:38 +02:00
Paul O'Leary McCann	61ef0739b8	Add Japanese stop words. (#2549 ) List created by taking the 2000 top words from a Wikipedia dump and removing everything that wasn't hiragana. Tried going through kanji words and deciding what to keep but there were too many obvious non-stopwords (東京 was in the top 500) and many other words where it wasn't clear if they should be included or not.	2018-07-17 10:12:48 +02:00
Tero K	f35980f865	Enhancement/lang fi examples (#2547 ) * Added a file with examples in finnish * added contributor agreement	2018-07-15 09:50:27 +02:00
Paul O'Leary McCann	1987f3f784	Add Japanese lemmas (#2543 ) This info was already available from Mecab, forgot to add it before.	2018-07-13 10:55:14 +02:00
Eleni170	6042723535	Add support for Greek language (#2535 ) * Add contributor agreement * Support for Greek language * Fix missing el_tokenizer	2018-07-10 13:48:38 +02:00
Stefan Schweter	3dfc7f86be	lemmatizer: correct lemma for Rang (#2537 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang"). ### Types of change The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-10 13:11:19 +02:00
Duygu Altinok	00b9a58558	German lemmatizer additions (#2529 ) * lemma of was-> was * added new pairs issue @2486 * added article tests	2018-07-09 11:10:15 +02:00
Muhammad Irfan	f33c703066	Add Urdu Language Support (#2430 ) * added Urdu language support. * added Urdu language tests. * modified conftest.py for Urdu language support. * added spacy contributor agreement.	2018-06-22 11:14:03 +02:00
himkt	14d9007efd	fix wrong indexing (#2416 ) * fix wrong indexing * add agreement	2018-06-19 10:20:57 +02:00
Aliia E	428bae66b5	Add Tatar Language Support (#2444 ) * add Tatar lang support * add Tatar letters * add Tatar tests * sign contributor agreement * sign contributor agreement [x] * remove comments from Language class * remove all template comments	2018-06-19 10:17:53 +02:00
Nour Shalabi	a169b79092	Additions to Arabic stop words. (#2422 ) * Additions to Arabic stop words. * Create nourshalabi.md	2018-06-08 02:33:23 +02:00
Aristo Rinjuang	432ede04af	adding more words and rephrasing (#2351 ) * adding more words and rephrasing * adding a contributor * tokenizer bugs solved	2018-05-24 11:40:57 +02:00
Jani Monoses	ec62cadf4c	Updates to Romanian support (#2354 ) * Add back Romanian in conftest * Romanian lex_attr * More tokenizer exceptions for Romanian * Add tests for some Romanian tokenizer exceptions	2018-05-24 11:40:00 +02:00
Tahar Zanouda	00417794d3	Add Arabic language (#2314 ) * added support for Arabic lang * added Arabic language support * updated conftest	2018-05-15 00:27:19 +02:00
Jani Monoses	0e08e49e87	Lemmatizer ro (#2319 ) * Add Romanian lemmatizer lookup table. Adapted from http://www.lexiconista.com/datasets/lemmatization/ by replacing cedillas with commas (ș and ț). The original dataset is licensed under the Open Database License. * Fix one blatant issue in the Romanian lemmatizer * Romanian examples file * Add ro_tokenizer in conftest * Add Romanian lemmatizer test	2018-05-12 15:20:04 +02:00
Jani Monoses	42b34832e4	Update Romanian stopword list (#2316 ) * Contributor agreement for janimo * Update Romanian stopword list Include the correct spellings of all the words already in the repo that are using cedillas (ş and ţ) instead of commas (ș and ț). Add another unrelated spelling fix. See https://github.com/stopwords-iso/stopwords-ro/pull/1 and https://github.com/stopwords-iso/stopwords-ro/pull/2	2018-05-10 12:16:56 +02:00
Lucas Abbade	be7fdc59d1	Update lex_attrs.py (#2307 ) * Update lex_attrs.py Fixed spelling mistakes of some numbers (according to Brazilian Portuguese). * Update lex_attrs.py As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese. I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.	2018-05-09 20:49:31 +02:00
mauryaland	5368ba028a	Update stop_words.py for French language (#2310 ) * Add contraction forms of some common stopwords All the stopwords added contain the apostrophe" ' "or " ’ ". * Adds contributor agreement mauryaland * Update mauryaland.md	2018-05-09 12:04:38 +02:00
Paul O'Leary McCann	bd72fbf09c	Port Japanese mecab tokenizer from v1 (#2036 ) * Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests	2018-05-03 18:38:26 +02:00
Robin Linderborg	1f9904ef12	fixes #2238 (#2241 ) * Remove erroneous lemma lookup år > åra in Swedish * Add contributors agreement * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:55:22 +02:00
Robin Linderborg	d01f503b54	Remove incorrect lemma lookup gäng->gänga (#2252 ) * Remove incorrect lemma lookup gäng->gänga In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread". * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:54:41 +02:00
ines	686225eadd	Fix Spanish noun_chunks (resolves #2210 ) Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets	2018-04-18 18:44:01 -04:00
Jens Dahl Møllerhøj	e5055e3cf6	Add Danish lemmatizer (#2184 ) * add danish lemmatizer * fill contributor agreement	2018-04-07 19:07:28 +02:00
Matthew Honnibal	21047bde52	Fix syntax error in italian lemmatizer	2018-04-03 23:13:22 +02:00
Viet Trung Tran	ea2af94cd9	Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155 ) * support for Vietnamese * Contributor Agreement for adding Vietnamese support on spaCy	2018-03-29 12:19:51 +02:00
ines	11c4735ccf	Fix issue in Italian lemmatizer data (resolves #2050 )	2018-03-27 23:55:22 +02:00
Ines Montani	68226109f4	Merge pull request #2142 from jimregan/polish-more-tokens more exceptions	2018-03-24 19:06:44 +01:00
Matthew Honnibal	0d3bf0d4eb	Merge branch 'master' of https://github.com/explosion/spaCy	2018-03-24 17:31:49 +01:00
dejanmarich	ccd1c04c63	Update stop_words.py Added more words	2018-03-24 17:31:24 +01:00
ines	f1446b0257	Port over Turkish changes	2018-03-24 17:31:07 +01:00
DuyguA	cd604878a4	quick typo fix	2018-03-24 17:26:35 +01:00
Jim O'Regan	efe037e8be	more exceptions	2018-03-24 00:05:27 +00:00
alldefector	f4e5904fc2	Fix Spanish noun_chunks failure caused by typo	2018-03-14 17:03:17 +01:00
Ines Montani	14e7e0f12a	Merge pull request #2000 from jimregan/polish-tag-map Polish tag map	2018-02-18 19:05:58 +01:00

1 2 3 4 5 ...

324 Commits