Commit Graph

439 Commits

Author SHA1 Message Date
Matthew Honnibal
c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal
04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Mehdi Hamoumi
9211f30ee3 Tiny correction in french lookup dictionary (#3427) 2019-03-19 13:00:19 +01:00
Ines Montani
2912ddc9a6 Don't set extension attribute in Japanese (closes #3398) 2019-03-12 13:30:33 +01:00
Ines Montani
cdd418b93e Auto-format [ci skip] 2019-03-11 17:10:50 +01:00
Matthew Honnibal
39a4741e26 Add support for vocab.writing_system property (#3390)
* Add xfail test for vocab.writing_system

* Add vocab.writing_system property

* Set Language.Defaults.writing_system

* Set default writing system

* Remove xfail on test_vocab_writing_system
2019-03-11 15:23:20 +01:00
Ines Montani
ee4f312e89 Add writing_system to ArabicDefaults (experimental) 2019-03-11 14:22:23 +01:00
Ines Montani
ef80cfde6f Fix pickling of Japanese (closes #3191) 2019-03-11 13:34:23 +01:00
Matthew Honnibal
5d25ee52fb Fix English tag map 2019-03-11 01:06:02 +01:00
Matthew Honnibal
7503e1e505 Improve English tag map. Re #593, #3311 2019-03-10 23:50:00 +01:00
Ines Montani
610fb306bd Revert hyphens 2019-03-09 12:51:53 +01:00
Ines Montani
bbabb6aaae Escape more hyphens 2019-03-09 12:41:05 +01:00
Ines Montani
b8db219850 Auto-format 2019-03-09 12:40:58 +01:00
Ines Montani
a145bfe627 Try escaping hyphens again 2019-03-09 03:06:50 +01:00
Ines Montani
b9c71fc0f0 Fix flags 2019-03-09 02:46:04 +01:00
Ines Montani
ae09b6a6cf Try fixing unicode inconsistencies on Python 2 2019-03-09 02:37:50 +01:00
Ines Montani
d957d7a697 Auto-format 2019-03-09 02:37:41 +01:00
Ines Montani
65402c3d02 Revert "Experiment with escaping hyphens"
This reverts commit 9b42e2d5dd.
2019-03-09 02:13:00 +01:00
Ines Montani
9b42e2d5dd Experiment with escaping hyphens 2019-03-09 02:05:26 +01:00
Ines Montani
6bd34e9d54 Expose Japanese stop words (closes #3346) 2019-03-06 14:21:15 +01:00
Ines Montani
85deb96278 Fix whitespace 2019-03-06 14:20:34 +01:00
Ines Montani
23f6ebf0f3 Add missing " (closes #3343) 2019-02-27 16:37:03 +01:00
Ines Montani
48a2046d1c Remove stray print statement (closes #3342) 2019-02-27 15:35:04 +01:00
Ines Montani
07d7c0a1af Fix whitespace 2019-02-27 15:34:21 +01:00
Ines Montani
76ce8b2662 Merge branch 'master' into develop 2019-02-25 15:54:55 +01:00
Julia Makogon
f1c3108d52 Fixing pymorphy2 dependency issue (#3329) (closes #3327)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement

* pymorphy2 initialization split for ru and uk (#3327)

* stop-words fixed

* Unit-tests updated
2019-02-25 15:48:17 +01:00
Ines Montani
2982f82934 Auto-format 2019-02-24 14:09:15 +01:00
Matthew Honnibal
c5f947f194 Fix regex deprecation warnings 2019-02-21 11:56:47 +01:00
Sofie
9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Ines Montani
3fdcdec6a0 Merge branch 'master' into develop 2019-02-18 10:03:32 +01:00
Roshni Biswas
e09f1347fa updates for Bengali language (#3286)
* Update morph_rules.py

* contributor agreement for roshni-b

* created example sentences
2019-02-18 10:02:28 +01:00
Ines Montani
043e8186f3 Merge branch 'master' into develop 2019-02-17 17:51:17 +01:00
Marc Puig
51268e9f21 Typo error fixed (#3284) 2019-02-17 17:51:02 +01:00
Ines Montani
19a002bfd3 Merge branch 'master' into develop 2019-02-17 12:22:54 +01:00
Roshni Biswas
e26d923726 Update morph_rules.py (#3283) 2019-02-17 12:21:47 +01:00
Ines Montani
c31a9dabd5 💫 Add en/em dash to prefixes and suffixes (#3281)
* Auto-format

* Add en/em dash to prefixes and suffixes
2019-02-15 10:29:59 +01:00
Ines Montani
2e31921d0a 💫 Add base Language classes for more languages (#3276)
* Add base classes for more languages

* Add test for language class initialization

Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded
2019-02-15 01:31:19 +11:00
Ines Montani
106d95b01a Fix typo 2019-02-14 12:26:56 +01:00
Ines Montani
11d6b874db
Update stop_words.py 2019-02-14 12:25:19 +01:00
Ines Montani
4d2438f985 Tidy up and auto-format 2019-02-13 15:29:08 +01:00
Ines Montani
2f45bd94c0 Auto-formatting 2019-02-12 18:30:11 +01:00
Ines Montani
0184a95340 Merge branch 'master' into develop 2019-02-12 18:29:24 +01:00
Akhilesh
a78db10941 add kannada support (#3264)
* add kannada support

* add few more stop words

* add support for Kannada Language
2019-02-12 18:28:39 +01:00
Ines Montani
25602c794c Tidy up and fix small bugs and typos 2019-02-08 14:14:49 +01:00
Ines Montani
9e652afa4b Merge branch 'master' into develop 2019-02-08 13:28:09 +01:00
Björn Lennartsson
647f0140c7 Fixed tag map for Swedish Talbanken (#3186) 2019-02-08 14:28:59 +11:00
Stanisław Giziński
1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Ines Montani
402d133c90 Add Ukrainian unicode 2019-02-07 21:11:58 +01:00
Ines Montani
e2d93e4852 Merge branch 'master' into develop 2019-02-07 21:10:08 +01:00
Ines Montani
2499da97e8 Format 2019-02-07 21:07:02 +01:00