Commit Graph

8 Commits

Author SHA1 Message Date
Vadim Mazaev
81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
ines
e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines
09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
ines
5ee10379db Port over changes from #1340 2017-09-26 16:38:08 +02:00
ines
10d291f129 Port over change from #1351 2017-09-26 16:11:41 +02:00
Matthew Honnibal
cfc055734e Split % in units, for compatibility with corpus 2017-08-25 20:03:37 -05:00
ines
a8e58e04ef Add symbols class to punctuation rules to handle emoji (see #1088)
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽‍💻 into account.
2017-05-27 17:57:10 +02:00
ines
604f299cf6 Add char classes to global language data 2017-05-08 23:59:33 +02:00