Commit Graph

10 Commits

Author SHA1 Message Date
Ali Zarezade
42349471bc
add ٪ as punctuation 2018-01-23 18:11:33 +03:30
Ali Zarezade
2bda582135
Add Persian character and symbols
Add Persian characters and the following:
- ٪ used instead of %
- ؟ used instead of ?
- ﷼ used instead of $
- ، used instead of ,
- ؛ used instead of ;
2018-01-23 13:20:36 +03:30
Vadim Mazaev
81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
ines
e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines
09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
ines
5ee10379db Port over changes from #1340 2017-09-26 16:38:08 +02:00
ines
10d291f129 Port over change from #1351 2017-09-26 16:11:41 +02:00
Matthew Honnibal
cfc055734e Split % in units, for compatibility with corpus 2017-08-25 20:03:37 -05:00
ines
a8e58e04ef Add symbols class to punctuation rules to handle emoji (see #1088)
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽‍💻 into account.
2017-05-27 17:57:10 +02:00
ines
604f299cf6 Add char classes to global language data 2017-05-08 23:59:33 +02:00