spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-13 16:05:50 +03:00

Author	SHA1	Message	Date
mauryaland	36514b5762	Rule-based French Lemmatizer (#2818 ) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-10-13 16:38:21 +02:00
Ines Montani	cb57b35bb8	Also include lowercase norm exceptions	2018-10-13 15:37:30 +02:00
JKhakpour	74a30d883c	Add Persian(Farsi) language support (#2797 )	2018-10-13 15:31:49 +02:00
Marina Lysyuk	b76fe08308	Correcting lang/ru/examples.py (#2845 ) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file	2018-10-13 15:19:43 +02:00
Matthew Honnibal	67ddce68d8	Unskip test	2018-10-02 23:47:55 +02:00
Matthew Honnibal	4cf5ce2cc2	Revert "Remove problematic test" This reverts commit `bdebbef455`.	2018-10-02 23:47:24 +02:00
Matthew Honnibal	bdebbef455	Remove problematic test	2018-10-02 23:16:29 +02:00
Matthew Honnibal	6afc6ffe56	Skip seemingly problematic test	2018-10-02 22:33:40 +02:00
Matthew Honnibal	9e4079ddb2	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-02 19:44:43 +02:00
Matthew Honnibal	40f228c2f2	Set version to 2.0.13.dev3	2018-10-02 19:44:25 +02:00
Filipe Caixeta	6c498f9ff4	Update Portuguese Language (#2790 ) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language	2018-09-29 09:51:45 +02:00
Matthew Honnibal	6430b1fe64	Restore encoding arg on msgpack-numpy	2018-09-27 15:58:21 +02:00
Matthew Honnibal	2ac69facc6	Fix Python 2 test failure	2018-09-27 15:34:16 +02:00
Matthew Honnibal	72778375fb	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-27 13:54:49 +02:00
Matthew Honnibal	96fe314d8d	Fix bug when too many entity types. Fixes #2800	2018-09-27 13:54:34 +02:00
Suraj Rajan	bbdc6456c6	Set up dependency tree pattern matching skeleton (#2732 )	2018-09-27 13:27:18 +02:00
Matthew Honnibal	8809dc4514	Remove deprecated encoding argument to msgpack	2018-09-27 12:56:23 +02:00
Matthew Honnibal	bae6b3e2b3	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-27 12:50:31 +02:00
Ines Montani	71cdbeada7	Revert "Also include lowercase norm exceptions" This reverts commit `70f4e8adf3`.	2018-09-27 12:29:25 +02:00
darindf	8227566805	Fix error (#2802 ) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement	2018-09-26 21:31:03 +02:00
Ines Montani	5e0dfb34fa	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-26 11:13:58 +02:00
Ines Montani	70f4e8adf3	Also include lowercase norm exceptions	2018-09-25 12:22:02 +02:00
Keshan	9a016d17c2	Adding basic support for Sinhala language. (#2788 ) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement	2018-09-25 12:18:25 +02:00
Ines Montani	3c4e3ade30	Fix typo (closes #2784 )	2018-09-21 10:45:11 +02:00
mauryaland	68b3c544d5	Adding French hyphenated first name (#2786 )	2018-09-21 10:38:13 +02:00
Andrew Ongko	81564cc4e8	Update Indonesian model (#2752 ) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file	2018-09-14 12:30:32 +02:00
Filipe Caixeta	fe515085f3	Add words to portuguese language _num_words (#2759 ) * Add words to portuguese language _num_words * Add words to portuguese language _num_words	2018-09-14 12:30:16 +02:00
Grivaz	aeba99ab0d	Introduces a bulk merge function, in order to solve issue #653 (#2696 ) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions	2018-09-10 16:41:42 +02:00
tyburam	476472d181	Lex _attrs for polish language (#2750 ) * Signed spaCy contributor agreement * Added polish version of english lex_attrs	2018-09-10 11:53:57 +02:00
Sainath Adapa	77139bc03c	Basic support for Telugu language (#2751 )	2018-09-10 11:53:18 +02:00
Maxim Kupfer	cebe50b5b8	Remove ')' for clarity (#2737 ) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.	2018-09-10 11:31:49 +02:00
Piotr Żelasko	bdb2165bd1	Less norm computations in token similarity (#2730 ) * Less norm computations in token similarity * Contributor agreement	2018-09-05 21:50:23 +02:00
Aniruddha Adhikary	4530ddcc51	update bengali token rules for hyphen and digits (#2731 )	2018-09-05 21:49:00 +02:00
Nathaniel J. Smith	26849874ad	When calling getoption() in conftest.py, pass a default option (#2709 ) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement	2018-09-03 09:57:52 +02:00
Ines Montani	e9022f7b33	Remove docstrings for deprecated arguments (see #2703 )	2018-08-26 14:23:13 +02:00
Ines Montani	559f4139e3	Add FAC to spacy.explain (resolves #2706 )	2018-08-26 14:13:50 +02:00
Matthew Honnibal	13fa550b36	Merge branch 'master' of https://github.com/explosion/spaCy	2018-08-14 02:32:01 +02:00
Ioannis Daras	fe94e696d3	Optimize Greek language support (#2658 )	2018-08-14 02:31:32 +02:00
Matthew Honnibal	85000ea13b	Increment version to 2.0.13.dev2	2018-08-10 00:43:55 +02:00
Matthew Honnibal	c4ac981e6d	Try again to filter warnings	2018-08-10 00:42:54 +02:00
Matthew Honnibal	ae7fc42a41	Increment version to v2.0.13.dev1	2018-08-10 00:14:31 +02:00
Matthew Honnibal	19f5046934	Undoing warning suppression, as doesnt really work	2018-08-10 00:13:34 +02:00
Matthew Honnibal	3fb828352d	Set version to 2.0.13.dev0	2018-08-09 23:49:34 +02:00
Matthew Honnibal	1c0614ecd2	Catch numpy warning	2018-08-09 23:49:24 +02:00
Aashish Gangwani	6eebfc7bf4	Added numbers to ../lang/hi/lex_attrs.py (#2629 ) I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations: 'शून्य' => zero 'एक' => one 'दो' => two 'तीन' => three 'चार' => four 'पांच' => five 'छह' => six 'सात'=>seven 'आठ' => eight 'नौ' => nine 'दस' => ten 'ग्यारह' => eleven 'बारह' => twelve 'तेरह' => thirteen 'चौदह' => fourteen 'पंद्रह' => fifteen 'सोलह'=> sixteen 'सत्रह' => seventeen 'अठारह' => eighteen 'उन्नीस' => nineteen 'बीस' => twenty 'तीस' => thirty 'चालीस' => forty 'पचास' => fifty 'साठ' => sixty 'सत्तर' => seventy 'अस्सी' => eighty 'नब्बे' => ninety 'सौ' => hundred 'हज़ार' => thousand 'लाख' => hundred thousand 'करोड़' => ten million 'अरब' => billion 'खरब' => hundred billion <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-08-08 16:06:11 +02:00
Emil Stenström	3834f4146d	Add abbreviations from UD_Swedish-Talbanken (#2613 ) * Add abbreviations from UD_Swedish-Talbanken * Add contributor agreement.	2018-08-07 13:53:17 +02:00
Ole Henrik Skogstrøm	0473add369	Feature/span ents (#2599 ) * Created Span.ents property * Add tests for span.ents * Add tests for start and end of sentence	2018-08-07 13:52:32 +02:00
Xiaoquan Kong	87fa847e6e	Fix Chinese language related bugs (#2634 )	2018-08-07 11:26:31 +02:00
Xiaoquan Kong	f0c9652ed1	New Feature: display more detail when Error E067 (#2639 ) * Fix off-by-one error * Add verbose option * Update verbose option * Update documents for verbose option	2018-08-07 10:45:29 +02:00
Emil Stenström	1914c488d3	Swedish: Exceptions for single letter words ending sentence (#2615 ) * Exceptions for single letter words ending sentence Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens. * Add test	2018-08-05 14:14:30 +02:00

1 2 3 4 5 ...

4977 Commits