Commit Graph

21 Commits

Author SHA1 Message Date
Matthew Honnibal
fe442cac53 Fix #717: Set correct lemma for contracted verbs 2017-03-18 16:16:10 +01:00
Matthew Honnibal
8dbff4f5f4 Wire up English lemma and morph rules. 2017-03-15 09:23:22 -05:00
ines
ce9568af84 Move English time exceptions ("1a.m." etc.) and refactor 2017-03-12 13:58:22 +01:00
ines
6b30541774 Fix formatting 2017-03-12 13:58:22 +01:00
ines
66c1f194f9 Use consistent unicode declarations 2017-03-12 13:07:28 +01:00
ines
30ce2a6793 Exclude "shed" and "Shed" from tokenizer exceptions (see #847) 2017-02-18 14:10:44 +01:00
Ines Montani
209c37bbcf Exclude "shell" and "Shell" from English tokenizer exceptions (resolves #775) 2017-01-25 13:15:02 +01:00
Ines Montani
50878ef598 Exclude "were" and "Were" from tokenizer exceptions and add regression test (resolves #744) 2017-01-16 13:10:38 +01:00
Matthew Honnibal
fba67fa342 Fix Issue #736: Times were being tokenized with incorrect string values. 2017-01-12 11:21:01 +01:00
Ines Montani
0dec90e9f7 Use global abbreviation data languages and remove duplicates 2017-01-08 20:36:00 +01:00
Ines Montani
cab39c59c5 Add missing contractions to English tokenizer exceptions
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani
a23504fe07 Move abbreviations below other exceptions 2017-01-05 19:58:07 +01:00
Ines Montani
7d2cf934b9 Generate he/she/it correctly with 's instead of 've 2017-01-05 19:57:00 +01:00
Ines Montani
bc911322b3 Move ") to emoticons (see Tweebo challenge test) 2017-01-05 18:05:38 +01:00
Ines Montani
1d237664af Add lowercase lemma to tokenizer exceptions 2017-01-03 23:02:21 +01:00
Ines Montani
84a87951eb Fix typos 2017-01-03 18:27:43 +01:00
Ines Montani
35b39f53c3 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani
461cbb99d8 Revert "Reorganise English tokenizer exceptions (as discussed in #718)"
This reverts commit b19cfcc144.
2017-01-03 18:21:29 +01:00
Ines Montani
b19cfcc144 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani
78e63dc7d0 Update tokenizer exceptions for English 2016-12-21 18:06:34 +01:00
Ines Montani
704c7442e0 Break language data components into their own files 2016-12-18 15:36:53 +01:00