Commit Graph

97 Commits

Author SHA1 Message Date
Kit
9bc524982e
Find lowercased forms of numeric words 2018-01-08 03:25:08 +01:00
ines
7e424a1804 Don't copy exception dicts if not necessary and tidy up 2017-10-31 21:05:29 +01:00
ines
8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines
417d45f5d0 Add lemmatizer data as variable on language data
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
2017-10-11 02:24:58 +02:00
ines
0c2343d73a Tidy up language data 2017-10-11 02:22:49 +02:00
Jim Geovedi
f77443ab68 reworked 2017-08-20 13:43:21 +07:00
Jim Geovedi
b7d83f37c8 indonesian abbr. 2017-08-20 12:16:50 +07:00
Jim Geovedi
7193c47f0b direct lookup 2017-08-20 11:57:52 +07:00
Jim Geovedi
fdf802d505 added examples 2017-08-20 11:57:10 +07:00
Jim Geovedi
37f19f5ed2 added more currencies based on corpus data 2017-08-03 13:03:25 +07:00
Jim Geovedi
30fd068d42 hashtag prefix should be handled somewhere else 2017-08-03 13:03:02 +07:00
Jim Geovedi
ba07e23c87 added USD in currency rules 2017-08-02 22:42:47 +07:00
Jim Geovedi
bb08d696f9 added hashtag rule and fixed currency rules 2017-07-30 21:23:28 +07:00
Jim Geovedi
e9af79a803 added u-\d+ rules (sports team) 2017-07-30 21:23:01 +07:00
Jim Geovedi
e5adc26c72 simplified rules 2017-07-29 18:21:32 +07:00
Jim Geovedi
4d04898dea updated regexp 2017-07-29 17:44:57 +07:00
Jim Geovedi
7d96d477ea updated like_num 2017-07-29 17:44:46 +07:00
Jim Geovedi
3cca4ed798 added lex attrs rules 2017-07-29 17:22:21 +07:00
Jim Geovedi
8b814c63f1 more exceptions 2017-07-27 19:46:30 +07:00
Jim Geovedi
6c725e8dcf updated lemma 2017-07-27 19:46:21 +07:00
Jim Geovedi
547973b92a wip syntax iterators 2017-07-27 10:51:34 +07:00
Jim Geovedi
bbc75da38d enable syntax iterator and lemma lookup 2017-07-27 10:51:15 +07:00
Jim Geovedi
24a8c8bf28 added wip lemma dict 2017-07-26 21:39:54 +07:00
Jim Geovedi
63f14ba46b added hyphen-suffix rules 2017-07-26 19:28:57 +07:00
Jim Geovedi
f288964441 removed -el from suffix rules 2017-07-26 19:28:38 +07:00
Jim Geovedi
6eee7a7411 updated tokenizer exceptions 2017-07-26 19:13:47 +07:00
Jim Geovedi
edec51b1b1 update punctuation rules 2017-07-26 19:13:36 +07:00
Jim Geovedi
62443d495a enable token match 2017-07-26 19:13:14 +07:00
Jim Geovedi
c97f5ae0bb updated tokenizer exceptions 2017-07-26 19:12:52 +07:00
Jim Geovedi
73f6ac9d9b added hyhen 2017-07-24 15:56:31 +07:00
Jim Geovedi
68454c40bf added missing import 2017-07-24 14:12:34 +07:00
Jim Geovedi
eaf9cbd708 cursed of copy & paste 2017-07-24 14:11:51 +07:00
Jim Geovedi
7aad6718bc enable tokenizer exceptions 2017-07-24 14:11:10 +07:00
Jim Geovedi
ad56c9179a added tokenizer exceptions list 2017-07-24 14:10:16 +07:00
Jim Geovedi
c1f3fe99fe updated punctuation rules 2017-07-24 13:57:21 +07:00
Jim Geovedi
37fa2c8c80 punctution rules 2017-07-24 06:17:18 +07:00
Jim Geovedi
082e94ac1c added inflix rules 2017-07-24 06:17:07 +07:00
Jim Geovedi
0e590c711f added prefix & suffix rules 2017-07-23 23:46:40 +07:00
Jim Geovedi
d5fd32a572 added known currencies 2017-07-23 22:56:48 +07:00
Jim Geovedi
f6f15678fb added lex_attrs 2017-07-23 22:55:22 +07:00
Jim Geovedi
bed8162d00 added tokenizer_exceptions 2017-07-23 22:55:05 +07:00
Jim Geovedi
b80c35bc9a added norm_exceptions 2017-07-23 22:54:49 +07:00
Jim Geovedi
b5de329ea3 added norm_exceptions 2017-07-23 22:54:19 +07:00
Jim Geovedi
082e9ade46 fixed typo 2017-07-23 21:30:34 +07:00
Jim Geovedi
e2efeb186e added stopwords 2017-07-23 20:52:37 +07:00
Jim Geovedi
da98676839 use template 2017-07-23 20:51:31 +07:00
Jim Geovedi
c2b4dd7809 start working on Indonesian language 2017-07-23 20:50:56 +07:00