spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-12 15:25:47 +03:00

Author	SHA1	Message	Date
AJ Rader	2f3648700c	Correction of default lemmatizer lookup in English (Issue # 4104) (#4110 ) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement	2019-08-15 11:39:10 +02:00
黎谢鹏	250a54414b	update lang/zh (#4103 ) * update lang/zh * update lang/zh	2019-08-12 10:37:48 +02:00
Pavle Vidanović	e1a935d71c	Stopwords for Serbian language. (#4078 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated	2019-08-05 10:22:27 +02:00
veer-bains	874bd8c8dd	Fixed syntax error in lang/ko when using python 2 (#4082 ) (closes #4068 ) * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * Update __init__.py * Create veer-bains.md * Update __init__.py fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7	2019-08-05 10:19:32 +02:00
Muhammad Irfan	d1d30b0442	added missing punctuation following conventions. (#4066 )	2019-08-04 13:41:18 +02:00
Bae Yong-Ju	05fbf5d976	Fix error when Korean text contains regexp special characters. (#4022 )	2019-07-25 17:53:33 +02:00
Paul O'Leary McCann	c8949ce88a	Remove old comment (#4012 ) Norwegian used to borrow from French but that doesn't appear to have been true for a while now, so the comment that was here is no longer relevant.	2019-07-23 23:10:06 +02:00
BreakBB	3e370cf2ba	Add 'Prof.' to Englisch tokenizer_exceptions	2019-07-19 10:00:45 +02:00
Søren Lind Kristiansen	26aee70d95	Make Danish tokenizer split on forward slash	2019-07-12 15:20:42 +02:00
Ines Montani	197cfd7ebc	Merge branch 'master' into pr/3948	2019-07-11 12:18:31 +02:00
Ines Montani	0b8406a05c	Tidy up and auto-format	2019-07-11 12:02:25 +02:00
yash	815f8d13dd	Fix default punctuation rules for hindi text (#3625 explosion)	2019-07-11 15:00:51 +05:30
cedar101	58f06e6180	Korean support (#3901 ) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io>	2019-07-09 22:23:16 +02:00
Knut O. Hellan	a54f0cfc2b	Norwegian tweaks (#3894 ) * Norwegian fix Add support for alternative past tense verb form (vaska). * Norwegian months Add all Norwegian months to tokenizer excpetions. * More Norwegian abbreviations Add more Norwegian abbreviations to tokenizer_exceptions. * Contributor agreement khellan Add signed contributor agreement for khellan (Knut O. Hellan).	2019-07-08 10:28:47 +02:00
Rokas Ramanauskas	61ce126d4c	Lithuanian language support (#3895 ) * initial LT lang support * Added more stopwords. Started setting up some basic test environment (not complete) * Initial morph rules for LT lang * Closes #1 Adds tokenizer exceptions for Lithuanian * Closes #5 Punctuation rules. Closes #6 Lexical Attributes * test: add native examples to basic tests * feat: add tag map for lt lang * fix: remove undefined tag attribute 'Definite' * feat: add lemmatizer for lt lang * refactor: add new instances to lt lang morph rules; use tags from tag map * refactor: add morph rules to lt lang defaults * refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup * refactor: add capitalized words to lt lang lemmatizer * refactor: add more num words to lt lang lex attrs * refactor: update lt lang stop word set * refactor: add new instances to lt lang tokenizer exceptions * refactor: remove comments form lt lang init file * refactor: use function instead of lambda in lt lex lang getter * refactor: remove conversion to dict in lt init when dict is already provided * chore: rename lt 'test_basic' to 'test_text' * feat: add more lt text tests * feat: add lemmatizer tests * refactor: remove unused imports, add newline to end of file * chore: add contributor agreement * chore: change 'en' to 'lt' in lt example description * fix: add missing encoding info * style: add newline to end of file * refactor: use python2 compatible syntax * style: reformat code using black	2019-07-08 10:25:22 +02:00
Ines Montani	4f1dae1c6b	Update languages and examples (see #1107 )	2019-06-26 16:19:17 +02:00
Ines Montani	c833d9b314	Add "v.s." to English tokenizer exceptions (see #3868 )	2019-06-20 17:48:45 +02:00
Azagh3l	5accfbb938	Update exemples.py (#3838 ) Added missing hyphen and accent.	2019-06-14 09:31:05 +02:00
Ines Montani	aae9034492	Tidy up [ci skip]	2019-06-12 13:38:23 +02:00
Azagh3l	eb3e4263ee	Update lex_attrs.py (#3835 ) Corrected typos, added french (from France) versions of some numbers.	2019-06-11 10:59:16 +02:00
Germán	86eb817b74	Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810 ) (closes #3803 )) * (#3803) Spanish like_num returning false for number-like token * (#3803) Spanish like_num now returning True for number-like token	2019-06-02 12:22:57 +02:00
Ujwal Narayan	ed7be3f64c	Update norm_exceptions.py (#3778 ) * Update norm_exceptions.py Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound * Fix formatting [ci skip]	2019-05-27 11:52:52 +02:00
estr4ng7d	604acb6ace	Marathi Language Support (#3767 ) * Adding Marathi language details and folder to it * Adding few changes and running tests * Adding few changes and running tests * Update __init__.py mh -> mr * Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py * mh -> mr	2019-05-24 14:29:42 +02:00
Ujwal Narayan	4d550a3055	Enhancing Kannada language Resources (#3755 ) * Updated stop_words.py Added more stopwords * Create ujwal-narayan.md Enhancing Kannada language resources	2019-05-20 12:56:10 +02:00
Wannaphong Phatthiyaphaibun	5a14a13f64	fix thai bug (#3693 ) fix tokenize for pythainlp	2019-05-10 14:21:34 +02:00
Ines Montani	78cb807a9a	Auto-format [ci skip]	2019-05-06 16:58:29 +02:00
Dobita21	f95ecedd83	Add Thai lex_attrs (#3655 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words * add lex_attrs * Add lexical attribute getters into the language defaults * fix LEX_ATTRS Co-authored-by: Donut <dobita21@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-05-01 12:03:14 +02:00
BreakBB	8952004dfc	Update French example sents and add two German stop words (#3662 ) * Update french example sentences * Add 'anderem' and 'ihren' to German stop words	2019-05-01 12:01:35 +02:00
Dobita21	721e1fc86c	update norm_exceptions (#3627 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words	2019-04-23 12:48:03 +02:00
Dobita21	189c90743c	Add Thai norm_exceptions (#3612 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults	2019-04-20 12:16:03 +02:00
Omer Celik	531c0869b2	Added Turkish Lira symbol(₺) (#3576 ) Added Turkish Lira symbol(₺) https://en.wikipedia.org/wiki/Turkish_lira	2019-04-11 11:32:28 +02:00
Ines Montani	145c0b7e88	Tidy up and auto-format	2019-04-09 11:40:19 +02:00
Dobita21	8bf6967eb7	Update Thai stop words (#3545 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability	2019-04-05 12:06:38 +02:00
jeannefukumaru	f67d881b30	fix typos in tag_map flagged by `python -m debug-data` (#3542 ) ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. Co-authored-by: Ines Montani <ines@ines.io>	2019-04-05 12:06:09 +02:00
Jeanne Choo	b6c9807431	Merge remote-tracking branch 'upstream/master'	2019-04-04 14:21:50 +08:00
Jeanne Choo	80e15af76c	fixed tag_map.py merge conflict	2019-04-04 14:18:27 +08:00
jeannefukumaru	876ce01567	updated tag map with missing tags	2019-04-03 23:09:11 +08:00
Ines Montani	4faf62d515	Merge pull request #3530 from svlandeg/fix/issue_3521 Allow English stopwords with any type of apostrophe	2019-04-03 14:14:03 +02:00
Yves Peirsman	951825532c	Improved Dutch language resources and Dutch lemmatization (#3409 ) * Improved Dutch language resources and Dutch lemmatization * Fix conftest * Update punctuation.py * Auto-format * Format and fix tests * Remove unused test file * Re-add deleted test * removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains * Cleaner lemmatization files	2019-04-03 14:13:26 +02:00
svlandeg	4ff786e113	addressed all comments by Ines	2019-04-03 13:50:33 +02:00
Kamolsit Mongkolsrisawat	dcc67f3f51	Update Thai tokenizer_exception list (#3529 ) * add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq * update tokenizer_exceptions word list * add contributor file	2019-04-03 09:13:36 +02:00
svlandeg	673c81bbb4	unicode string for python 2.7	2019-04-02 13:52:07 +02:00
svlandeg	eca9cc5417	fixing Issue #3521 by adding all hyphen variants for each stopword	2019-04-02 13:24:59 +02:00
jeannefukumaru	6cdb7b2e04	added tag_map for indonesian (#3515 ) * added tag_map for indonesian * changed tag map from .py to .txt to see if tests pass * added symbols import * added utf8 encoding flag * added missing SCONJ symbol * Auto-format * Remove unused imports * Make tag map available in Indonesian defaults	2019-04-01 12:27:48 +02:00
Ines Montani	c23e234d65	Auto-format	2019-04-01 12:11:27 +02:00
Ines Montani	0a0b1087b0	Make tag map available in Indonesian defaults	2019-04-01 11:46:51 +02:00
Ines Montani	5d9212c44c	Remove unused imports	2019-04-01 11:46:25 +02:00
Ines Montani	8d6b544632	Auto-format	2019-04-01 11:45:43 +02:00
jeannefukumaru	6567f27849	added missing SCONJ symbol	2019-04-01 17:02:53 +08:00
jeannefukumaru	082a0a2232	added utf8 encoding flag	2019-04-01 16:37:11 +08:00
jeannefukumaru	a741bed7a7	added symbols import	2019-04-01 16:21:06 +08:00
jeannefukumaru	745cf0c914	changed tag map from .py to .txt to see if tests pass	2019-04-01 07:04:50 +08:00
jeannefukumaru	3cc897102f	added tag_map for indonesian	2019-04-01 00:00:08 +08:00
Duygu Altinok	5a7bc6b39d	Fix/irreg adverbs extension (#3499 ) * extended list of irreg adverbs * added test to exceptions * fixed typo	2019-03-28 13:23:33 +01:00
Wannaphong Phatthiyaphaibun	297a051992	Update Thai tag map (#3480 ) * Update Thai tag map Update Thai tag map * Create wannaphongcom.md	2019-03-25 16:53:26 +01:00
Matthew Honnibal	c66bd61e88	Fix lemmas	2019-03-21 14:22:12 +01:00
Matthew Honnibal	04395ffa49	Bring English tag_map in line with UD Treebank I wrote a small script to read the UD English training data and check that our tag map and morph rules were resulting in the best POS map. This hadn't been done for some time, and there have been various changes to the UD schema since it has been done. After these changes we should see much better agreement between our POS assignments and the UD POS tags.	2019-03-21 13:53:44 +01:00
Mehdi Hamoumi	9211f30ee3	Tiny correction in french lookup dictionary (#3427 )	2019-03-19 13:00:19 +01:00
Ines Montani	278e9d2eb0	Merge branch 'master' into feature/lemmatizer	2019-03-16 13:44:22 +01:00
Ines Montani	2912ddc9a6	Don't set extension attribute in Japanese (closes #3398 )	2019-03-12 13:30:33 +01:00
Ines Montani	cdd418b93e	Auto-format [ci skip]	2019-03-11 17:10:50 +01:00
Matthew Honnibal	39a4741e26	Add support for vocab.writing_system property (#3390 ) * Add xfail test for vocab.writing_system * Add vocab.writing_system property * Set Language.Defaults.writing_system * Set default writing system * Remove xfail on test_vocab_writing_system	2019-03-11 15:23:20 +01:00
Ines Montani	ee4f312e89	Add writing_system to ArabicDefaults (experimental)	2019-03-11 14:22:23 +01:00
Ines Montani	ef80cfde6f	Fix pickling of Japanese (closes #3191 )	2019-03-11 13:34:23 +01:00
Matthew Honnibal	5d25ee52fb	Fix English tag map	2019-03-11 01:06:02 +01:00
Matthew Honnibal	7503e1e505	Improve English tag map. Re #593 , #3311	2019-03-10 23:50:00 +01:00
Matthew Honnibal	78aba46530	Update feature/lemmatizer from develop	2019-03-10 02:45:33 +01:00
Ines Montani	610fb306bd	Revert hyphens	2019-03-09 12:51:53 +01:00
Ines Montani	bbabb6aaae	Escape more hyphens	2019-03-09 12:41:05 +01:00
Ines Montani	b8db219850	Auto-format	2019-03-09 12:40:58 +01:00
Ines Montani	a145bfe627	Try escaping hyphens again	2019-03-09 03:06:50 +01:00
Ines Montani	b9c71fc0f0	Fix flags	2019-03-09 02:46:04 +01:00
Ines Montani	ae09b6a6cf	Try fixing unicode inconsistencies on Python 2	2019-03-09 02:37:50 +01:00
Ines Montani	d957d7a697	Auto-format	2019-03-09 02:37:41 +01:00
Ines Montani	65402c3d02	Revert "Experiment with escaping hyphens" This reverts commit `9b42e2d5dd`.	2019-03-09 02:13:00 +01:00
Ines Montani	9b42e2d5dd	Experiment with escaping hyphens	2019-03-09 02:05:26 +01:00
Matthew Honnibal	00cfadbf63	Fix obsolete data in English tokenizer exceptions	2019-03-07 21:58:16 +01:00
Matthew Honnibal	7afe56a360	Fix morphological features in en tag_map	2019-03-07 21:57:56 +01:00
Matthew Honnibal	3a667833d1	Fix morphological features in de tag_map	2019-03-07 21:57:43 +01:00
Matthew Honnibal	e585b50458	Fix features in English tag map	2019-03-07 18:32:09 +01:00
Matthew Honnibal	3993f41cc4	Update morphology branch from develop	2019-03-07 00:14:43 +01:00
Ines Montani	6bd34e9d54	Expose Japanese stop words (closes #3346 )	2019-03-06 14:21:15 +01:00
Ines Montani	85deb96278	Fix whitespace	2019-03-06 14:20:34 +01:00
Ines Montani	23f6ebf0f3	Add missing " (closes #3343 )	2019-02-27 16:37:03 +01:00
Ines Montani	48a2046d1c	Remove stray print statement (closes #3342 )	2019-02-27 15:35:04 +01:00
Ines Montani	07d7c0a1af	Fix whitespace	2019-02-27 15:34:21 +01:00
Ines Montani	76ce8b2662	Merge branch 'master' into develop	2019-02-25 15:54:55 +01:00
Julia Makogon	f1c3108d52	Fixing pymorphy2 dependency issue (#3329 ) (closes #3327 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement * pymorphy2 initialization split for ru and uk (#3327) * stop-words fixed * Unit-tests updated	2019-02-25 15:48:17 +01:00
Ines Montani	2982f82934	Auto-format	2019-02-24 14:09:15 +01:00
Matthew Honnibal	c5f947f194	Fix regex deprecation warnings	2019-02-21 11:56:47 +01:00
Sofie	9a478b6db8	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 ) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib	2019-02-20 22:10:13 +01:00
Ines Montani	3fdcdec6a0	Merge branch 'master' into develop	2019-02-18 10:03:32 +01:00
Roshni Biswas	e09f1347fa	updates for Bengali language (#3286 ) * Update morph_rules.py * contributor agreement for roshni-b * created example sentences	2019-02-18 10:02:28 +01:00
Ines Montani	043e8186f3	Merge branch 'master' into develop	2019-02-17 17:51:17 +01:00
Marc Puig	51268e9f21	Typo error fixed (#3284 )	2019-02-17 17:51:02 +01:00
Ines Montani	19a002bfd3	Merge branch 'master' into develop	2019-02-17 12:22:54 +01:00
Roshni Biswas	e26d923726	Update morph_rules.py (#3283 )	2019-02-17 12:21:47 +01:00
Ines Montani	c31a9dabd5	💫 Add en/em dash to prefixes and suffixes (#3281 ) * Auto-format * Add en/em dash to prefixes and suffixes	2019-02-15 10:29:59 +01:00
Ines Montani	2e31921d0a	💫 Add base Language classes for more languages (#3276 ) * Add base classes for more languages * Add test for language class initialization Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded	2019-02-15 01:31:19 +11:00
Ines Montani	106d95b01a	Fix typo	2019-02-14 12:26:56 +01:00

1 2 3 4 5 ...

552 Commits