spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-25 03:13:41 +03:00

Author	SHA1	Message	Date
Ines Montani	043e8186f3	Merge branch 'master' into develop	2019-02-17 17:51:17 +01:00
Marc Puig	51268e9f21	Typo error fixed (#3284 )	2019-02-17 17:51:02 +01:00
Ines Montani	19a002bfd3	Merge branch 'master' into develop	2019-02-17 12:22:54 +01:00
Roshni Biswas	e26d923726	Update morph_rules.py (#3283 )	2019-02-17 12:21:47 +01:00
Ines Montani	c31a9dabd5	💫 Add en/em dash to prefixes and suffixes (#3281 ) * Auto-format * Add en/em dash to prefixes and suffixes	2019-02-15 10:29:59 +01:00
Ines Montani	2e31921d0a	💫 Add base Language classes for more languages (#3276 ) * Add base classes for more languages * Add test for language class initialization Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded	2019-02-15 01:31:19 +11:00
Ines Montani	106d95b01a	Fix typo	2019-02-14 12:26:56 +01:00
Ines Montani	11d6b874db	Update stop_words.py	2019-02-14 12:25:19 +01:00
Ines Montani	4d2438f985	Tidy up and auto-format	2019-02-13 15:29:08 +01:00
Ines Montani	2f45bd94c0	Auto-formatting	2019-02-12 18:30:11 +01:00
Ines Montani	0184a95340	Merge branch 'master' into develop	2019-02-12 18:29:24 +01:00
Akhilesh	a78db10941	add kannada support (#3264 ) * add kannada support * add few more stop words * add support for Kannada Language	2019-02-12 18:28:39 +01:00
Ines Montani	25602c794c	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
Ines Montani	9e652afa4b	Merge branch 'master' into develop	2019-02-08 13:28:09 +01:00
Björn Lennartsson	647f0140c7	Fixed tag map for Swedish Talbanken (#3186 )	2019-02-08 14:28:59 +11:00
Stanisław Giziński	1448ad100c	Improved polish tokenizer and stop words. (#2974 ) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions	2019-02-08 14:27:21 +11:00
Ines Montani	402d133c90	Add Ukrainian unicode	2019-02-07 21:11:58 +01:00
Ines Montani	e2d93e4852	Merge branch 'master' into develop	2019-02-07 21:10:08 +01:00
Ines Montani	2499da97e8	Format	2019-02-07 21:07:02 +01:00
Julia Makogon	b41d64825a	Ukrainian language added. Small fixes in Russian (#3241 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement	2019-02-07 21:05:11 +01:00
Ines Montani	77efee0295	Auto-format	2019-02-07 21:00:04 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Sofie	9745b0d523	Improve Italian & Urdu tokenization accuracy (#3228 ) ## Description 1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour. 2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour. ### Types of change Enhancement of Italian & Urdu tokenization ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-04 22:39:25 +01:00
Sofie	a3efa3e8d9	Improve Catalan tokenization accuracy (#3225 ) * small hyphen clean up for French * catalan infix similar to french	2019-02-04 20:37:19 +11:00
Ines Montani	e00680a33a	Remove unused outdated file	2019-02-01 11:39:48 +01:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
Amandine Périnet	d570e75dbb	Improving the French lookup dictionnary for ambiguous words (#3185 ) * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * updating the contributor agreement for amperinet	2019-01-31 23:53:45 +01:00
Amandine Périnet	b34bc9d2e9	add small fix for French lemmatizer (#3206 )	2019-01-31 23:44:10 +01:00
Loghi	5ca8e2b269	Tamil (#3194 ) * Tamil language support stop wors, examples and numerical attribite supports added Contributor agreement signed * Create Loghijiaha.md Added contributor agreement * Update CONTRIBUTOR_AGREEMENT.md Adjusted contributor_agreement.md * Norm exceptions added	2019-01-27 06:02:04 +01:00
foufaster	8bd85fd9d5	Fix french lemmatization (#3180 )	2019-01-27 06:01:30 +01:00
Björn Lennartsson	b892b446cc	Updates to Swedish Language (#3164 ) * Added the same punctuation rules as danish language. * Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too * Added test for long texts in swedish * Added morph rules, infixes and suffixes to __init__.py for swedish * Added some tests for prefixes, infixes and suffixes * Added tests for lemma * Renamed files to follow convention * [sv] Removed ambigious abbreviations * Added more tests for tokenizer exceptions * Added test for problem with punctuation in issue #2578 * Contributor agreement * Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')	2019-01-16 13:45:50 +01:00
Loghi	d97661d18b	Tamil language support (#3154 ) Tamil language support to spaCy Description Hereby, creating new PR to add support for Tamil language in spaCy added stop words, examples and numerical attributes <--Working on other language data--> Types of change Enhancement Checklist [ x] I have submitted the spaCy Contributor Agreement. [x ] I ran the tests, and all new and existing tests passed. [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-14 15:32:30 +01:00
Amandine Périnet	ee24e2534d	French lemmatization: adding lemmas for adverbs and irregular lemmas for function words (#3131 ) * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * updating contributor agreement for amperinet	2019-01-10 15:41:15 +01:00
Kirill Bulygin	7b064542f7	Making `lang/th/test_tokenizer.py` pass by creating `ThaiTokenizer` (#3078 )	2019-01-10 15:40:37 +01:00
Amandine Périnet	eef11a7a2c	French lemmatization: correcting wrong lemmas in the lookup dictionnary (#3104 ) * modifying French lookup that contained wrong lemmas * correcting wrong line breaks on hyphen * adding contributor agreement for amperinet@ * correcting a typo	2019-01-07 14:15:19 +01:00
Matthew Honnibal	ee4d06fb1b	Prevent exceptions from setting POS but not TAG. Closes #1773	2018-12-30 13:16:05 +01:00
Kirill Bulygin	b665a32b95	Enabling `tests/lang/ru/test_lemmatizer.py`, fixing a `unicode` issue (#3084 ) <!--- Provide a general summary of your changes in the title. --> ## Description See #3079. Here I'm merging into `develop` instead of `master`. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-30 12:10:26 +01:00
Jari Bakken	e172f2478e	Add three missing tags from the `nb` tag map (#3085 ) * Contributors agreement for jarib * Add tags from the UD/NORNE dataset that is missing in the nb tag map. Relates to #3082.	2018-12-27 14:48:40 +01:00
Özcan Kasal	b573ebca77	trilyon forgotten (#3083 ) * trilyon forgotten * contributor added	2018-12-27 14:44:23 +01:00
Ines Montani	77a47b2b20	Auto-format	2018-12-18 15:02:11 +01:00
Kirill Bulygin	2fb004832f	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 15:01:06 +01:00
Kirill Bulygin	10189d9092	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 14:53:50 +01:00
Ines Montani	ae880ef912	Tidy up merge conflict leftovers	2018-12-18 13:58:30 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Brixjohn	52f3c95004	Added alpha support for Tagalog language (#3062 ) I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages. I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language. While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases. * Added alpha support for Tagalog language * Edited contributor template * Included SCA; Reverted templates * Fixed SCA template * Fixed changes in SCA template	2018-12-18 13:08:38 +01:00
Amandine Périnet	361554f629	Lemmatization of Adjectives - French : adding rules and vocabulary (#3045 ) * modifying FR lemmatisation for Adjectives * adding contributor agreement for amperinet * correcting some errors in vocabulary files	2018-12-16 18:11:07 +01:00
Sofie	c6ad557cea	French regular expressions instead of extensive exceptions list (on develop) (#3046 ) (resolves #2679 ) * merge changes of PR 3023 into develop branch instead of master * further deletions from exception list according to PR 3023	2018-12-16 18:04:55 +01:00
Ines Montani	7bbdffd36e	Remove pre-set lemma for "cause" (resolves #2165 )	2018-12-14 12:51:18 +01:00
Amandine Périnet	0b44ea23bd	Lemmatization of Nouns - French : adding rules and vocabulary (#2992 ) * modifying FR lemmatization for nouns * modifying FR lemmatization for nouns * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized * adding contributor agreement for amperinet * adding a missing comma	2018-12-06 22:42:18 +01:00
Amandine Périnet	2457318b7a	Lemmatization of Verbs - French : adding rules and vocabulary (#3006 ) * updating rules and vocabulary for French lemmatization of verbs * updating the file with French auxiliary verb * updating rules and vocabulary for French lemmatization of verbs * adding contributor agreement for amperinet * adding rules for words with inclusive parentheses wrongly tokenized	2018-12-06 15:49:28 +01:00

1 2 3 4 5 ...

408 Commits