Commit Graph

136 Commits

Author SHA1 Message Date
ines
8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines
417d45f5d0 Add lemmatizer data as variable on language data
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
2017-10-11 02:24:58 +02:00
ines
0c2343d73a Tidy up language data 2017-10-11 02:22:49 +02:00
Matthew Honnibal
8143618497 Set prefix length back to 1 2017-10-10 19:32:54 +02:00
Matthew Honnibal
dce8afb9cf Set prefix length to 3 2017-10-09 21:55:55 -05:00
Ines Montani
959c46eabe Merge pull request #1365 from wannaphongcom/develop
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun
3d5046c499 fix import in th 2017-09-26 22:41:20 +07:00
Wannaphong Phatthiyaphaibun
a63f790b8c fix thai tag_map 2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun
2ea27d07f4 fix tokenizer_exceptions in thai 2017-09-26 22:14:47 +07:00
Wannaphong Phatthiyaphaibun
a2bf4cc7bf fix newline in file 2017-09-26 21:49:43 +07:00
ines
bb5c631402 Implement like_num getter for French (via #1161) 2017-09-26 16:47:45 +02:00
ines
15479b3bae Add comment to like_num re: future work 2017-09-26 16:43:28 +02:00
ines
adda08fe14 Implement like_num getter for Dutch (via #1177) 2017-09-26 16:39:15 +02:00
ines
5ee10379db Port over changes from #1340 2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun
5cba67146c add thai in spacy2 2017-09-26 21:36:27 +07:00
ines
10d291f129 Port over change from #1351 2017-09-26 16:11:41 +02:00
ines
ece30c28a8 Don't split hyphenated words in German
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
Ines Montani
bd3da3d6fb Port over change from #1323 and tidy up 2017-09-14 19:23:13 +02:00
Matthew Honnibal
b29e6bff46 Improve lemmatization rule for am|VBP 2017-09-04 15:18:10 +02:00
Matthew Honnibal
2e28982e28 Merge pull request #1288 from geovedi/indonesian
Indonesian language support
2017-08-26 21:31:13 +02:00
Matthew Honnibal
cfc055734e Split % in units, for compatibility with corpus 2017-08-25 20:03:37 -05:00
Jim Geovedi
58d8078971 Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-25 09:21:49 +08:00
Matthew Honnibal
bb2541ffd3 Fix PROB attr for OOV words 2017-08-23 12:11:52 +02:00
ines
a68dc891ea Port over changes from #1281 2017-08-21 23:19:18 +02:00
Jim Geovedi
f77443ab68 reworked 2017-08-20 13:43:21 +07:00
Jim Geovedi
b7d83f37c8 indonesian abbr. 2017-08-20 12:16:50 +07:00
Jim Geovedi
7193c47f0b direct lookup 2017-08-20 11:57:52 +07:00
Jim Geovedi
fdf802d505 added examples 2017-08-20 11:57:10 +07:00
Jim Geovedi
fa544e6c9a Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-20 11:49:40 +07:00
ines
1fe5e1a4d1 Add language example sentences (see #1107)
da, de, en, es, fr, he, it, nb, pl, pt, sv
2017-08-19 12:22:29 +02:00
Jim Geovedi
37f19f5ed2 added more currencies based on corpus data 2017-08-03 13:03:25 +07:00
Jim Geovedi
30fd068d42 hashtag prefix should be handled somewhere else 2017-08-03 13:03:02 +07:00
Jim Geovedi
ba07e23c87 added USD in currency rules 2017-08-02 22:42:47 +07:00
Jim Geovedi
bb08d696f9 added hashtag rule and fixed currency rules 2017-07-30 21:23:28 +07:00
Jim Geovedi
e9af79a803 added u-\d+ rules (sports team) 2017-07-30 21:23:01 +07:00
Jim Geovedi
e5adc26c72 simplified rules 2017-07-29 18:21:32 +07:00
Jim Geovedi
4d04898dea updated regexp 2017-07-29 17:44:57 +07:00
Jim Geovedi
7d96d477ea updated like_num 2017-07-29 17:44:46 +07:00
Jim Geovedi
3cca4ed798 added lex attrs rules 2017-07-29 17:22:21 +07:00
Jim Geovedi
8b814c63f1 more exceptions 2017-07-27 19:46:30 +07:00
Jim Geovedi
6c725e8dcf updated lemma 2017-07-27 19:46:21 +07:00
Jim Geovedi
547973b92a wip syntax iterators 2017-07-27 10:51:34 +07:00
Jim Geovedi
bbc75da38d enable syntax iterator and lemma lookup 2017-07-27 10:51:15 +07:00
Jim Geovedi
24a8c8bf28 added wip lemma dict 2017-07-26 21:39:54 +07:00
Jim Geovedi
63f14ba46b added hyphen-suffix rules 2017-07-26 19:28:57 +07:00
Jim Geovedi
f288964441 removed -el from suffix rules 2017-07-26 19:28:38 +07:00
Jim Geovedi
6eee7a7411 updated tokenizer exceptions 2017-07-26 19:13:47 +07:00
Jim Geovedi
edec51b1b1 update punctuation rules 2017-07-26 19:13:36 +07:00
Jim Geovedi
62443d495a enable token match 2017-07-26 19:13:14 +07:00
Jim Geovedi
c97f5ae0bb updated tokenizer exceptions 2017-07-26 19:12:52 +07:00