Matthew Honnibal
|
fba67fa342
|
Fix Issue #736: Times were being tokenized with incorrect string values.
|
2017-01-12 11:21:01 +01:00 |
|
Ines Montani
|
0dec90e9f7
|
Use global abbreviation data languages and remove duplicates
|
2017-01-08 20:36:00 +01:00 |
|
Ines Montani
|
cab39c59c5
|
Add missing contractions to English tokenizer exceptions
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
|
2017-01-05 19:59:06 +01:00 |
|
Ines Montani
|
a23504fe07
|
Move abbreviations below other exceptions
|
2017-01-05 19:58:07 +01:00 |
|
Ines Montani
|
7d2cf934b9
|
Generate he/she/it correctly with 's instead of 've
|
2017-01-05 19:57:00 +01:00 |
|
Ines Montani
|
bc911322b3
|
Move ") to emoticons (see Tweebo challenge test)
|
2017-01-05 18:05:38 +01:00 |
|
Ines Montani
|
1d237664af
|
Add lowercase lemma to tokenizer exceptions
|
2017-01-03 23:02:21 +01:00 |
|
Ines Montani
|
84a87951eb
|
Fix typos
|
2017-01-03 18:27:43 +01:00 |
|
Ines Montani
|
35b39f53c3
|
Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
|
2017-01-03 18:26:09 +01:00 |
|
Ines Montani
|
461cbb99d8
|
Revert "Reorganise English tokenizer exceptions (as discussed in #718)"
This reverts commit b19cfcc144 .
|
2017-01-03 18:21:29 +01:00 |
|
Ines Montani
|
b19cfcc144
|
Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
|
2017-01-03 18:17:57 +01:00 |
|
Ines Montani
|
78e63dc7d0
|
Update tokenizer exceptions for English
|
2016-12-21 18:06:34 +01:00 |
|
JM
|
70ff0639b5
|
Fixed missing vec_path declaration that was failing if 'add_vectors' was set
Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.
|
2016-12-20 18:21:05 +01:00 |
|
Matthew Honnibal
|
13a0b31279
|
Another tweak to GloVe path hackery.
|
2016-12-18 23:12:49 +01:00 |
|
Matthew Honnibal
|
2c6228565e
|
Fix vector loading re glove hack
|
2016-12-18 23:06:44 +01:00 |
|
Matthew Honnibal
|
618b50a064
|
Fix issue #684: GloVe vectors not loaded in spacy.en.English.
|
2016-12-18 22:46:31 +01:00 |
|
Matthew Honnibal
|
2ef9d53117
|
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load.
|
2016-12-18 22:29:31 +01:00 |
|
Matthew Honnibal
|
7a98ee5e5a
|
Merge language data change
|
2016-12-18 17:03:52 +01:00 |
|
Ines Montani
|
b99d683a93
|
Fix formatting
|
2016-12-18 16:58:28 +01:00 |
|
Ines Montani
|
b11d8cd3db
|
Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data
|
2016-12-18 16:57:12 +01:00 |
|
Ines Montani
|
2b2ea8ca11
|
Reorganise language data
|
2016-12-18 16:54:19 +01:00 |
|
Matthew Honnibal
|
44f4f008bd
|
Wire up lemmatizer rules for English
|
2016-12-18 15:50:09 +01:00 |
|
Ines Montani
|
1bff59a8db
|
Update English language data
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
2eb163c5dd
|
Add lemma rules
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
29ad8143d8
|
Add morph rules
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
704c7442e0
|
Break language data components into their own files
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
28326649f3
|
Fix typo
|
2016-12-18 13:30:03 +01:00 |
|
Matthew Honnibal
|
28d63ec58e
|
Restore missing '' character in tokenizer exceptions.
|
2016-12-18 05:34:51 +01:00 |
|
Ines Montani
|
a9421652c9
|
Remove duplicates in tag map
|
2016-12-17 22:44:31 +01:00 |
|
Ines Montani
|
577adad945
|
Fix formatting
|
2016-12-17 14:00:52 +01:00 |
|
Ines Montani
|
bb94e784dc
|
Fix typo
|
2016-12-17 13:59:30 +01:00 |
|
Ines Montani
|
a22322187f
|
Add missing lemmas to tokenizer exceptions (fixes #674)
|
2016-12-17 12:42:41 +01:00 |
|
Ines Montani
|
5445074cbd
|
Expand tokenizer exceptions with unicode apostrophe (fixes #685)
|
2016-12-17 12:34:08 +01:00 |
|
Ines Montani
|
e0a7b5c612
|
Fix formatting
|
2016-12-17 12:33:09 +01:00 |
|
Ines Montani
|
08162dce67
|
Move shared functions and constants to global language data
|
2016-12-17 12:32:48 +01:00 |
|
Ines Montani
|
6a60a61086
|
Move update_exc to global language data utils
|
2016-12-17 12:29:02 +01:00 |
|
Ines Montani
|
487ce1e20a
|
Add encoding declaration
|
2016-12-17 12:25:44 +01:00 |
|
Ines Montani
|
d8d50a0334
|
Add tokenizer exception for "gonna" (fixes #691)
|
2016-12-17 11:59:28 +01:00 |
|
Ines Montani
|
c69b77d8aa
|
Revert "Add exception for "gonna""
This reverts commit 280c03f67b .
|
2016-12-17 11:56:44 +01:00 |
|
Ines Montani
|
280c03f67b
|
Add exception for "gonna"
|
2016-12-17 11:54:59 +01:00 |
|
Ines Montani
|
c0c5f31950
|
Remove unused data and download script
|
2016-12-08 20:39:49 +01:00 |
|
Ines Montani
|
0c39654786
|
Remove unused import
|
2016-12-08 19:46:53 +01:00 |
|
Ines Montani
|
e47ee94761
|
Split punctuation into its own file
|
2016-12-08 19:46:43 +01:00 |
|
Ines Montani
|
311b30ab35
|
Reorganize exceptions for English and German
|
2016-12-08 13:58:32 +01:00 |
|
Ines Montani
|
877f09218b
|
Add more custom rules for abbreviations
|
2016-12-08 12:47:01 +01:00 |
|
Ines Montani
|
ec44bee321
|
Fix capitalization on morphological features
|
2016-12-08 12:00:54 +01:00 |
|
Ines Montani
|
ce979553df
|
Resolve conflict
|
2016-12-07 21:16:52 +01:00 |
|
Ines Montani
|
0d07d7fc80
|
Apply emoticon exceptions to tokenizer
|
2016-12-07 21:11:59 +01:00 |
|
Ines Montani
|
71f0f34cb3
|
Fix formatting
|
2016-12-07 21:11:29 +01:00 |
|
Ines Montani
|
1285c4ba93
|
Update English language data
|
2016-12-07 20:33:28 +01:00 |
|