Matthew Honnibal
|
839de87ca9
|
Make lambda func a named function, for pickling
|
2017-10-17 18:21:20 +02:00 |
|
Matthew Honnibal
|
9ce7d6af87
|
Make lex attr functions top-level functions, to promote pickling
|
2017-10-17 18:19:18 +02:00 |
|
Ines Montani
|
aab299c8ae
|
Merge pull request #1429 from vishnunekkanti/develop
fix syntax error in zh
|
2017-10-17 14:45:02 +02:00 |
|
ines
|
485c4f6df5
|
Add Hungarian examples (see #1107)
|
2017-10-17 02:37:45 +02:00 |
|
Vishnu Kumar Nekkanti
|
d3c54cf39a
|
fixed SyntaxError while checking for jieba
|
2017-10-16 18:51:33 +05:30 |
|
ines
|
9d6c8eaa49
|
Update base norm exceptions with more unicode characters
e.g. unicode variations of punctuation used in Chinese
|
2017-10-14 14:58:52 +02:00 |
|
ines
|
38c756fd85
|
Port over changes from #1287
|
2017-10-14 13:16:21 +02:00 |
|
ines
|
612224c10d
|
Port over changes from #1157
|
2017-10-14 13:11:39 +02:00 |
|
ines
|
a4d974d97b
|
Port over URL pattern changes from #1411
|
2017-10-14 12:58:07 +02:00 |
|
ines
|
09aed58140
|
Port over changes from #1333 and add comments
|
2017-10-14 12:52:59 +02:00 |
|
ines
|
8ce6f96180
|
Don't make copies of language data components
|
2017-10-11 15:34:55 +02:00 |
|
ines
|
417d45f5d0
|
Add lemmatizer data as variable on language data
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
|
2017-10-11 02:24:58 +02:00 |
|
ines
|
0c2343d73a
|
Tidy up language data
|
2017-10-11 02:22:49 +02:00 |
|
Matthew Honnibal
|
8143618497
|
Set prefix length back to 1
|
2017-10-10 19:32:54 +02:00 |
|
Matthew Honnibal
|
dce8afb9cf
|
Set prefix length to 3
|
2017-10-09 21:55:55 -05:00 |
|
Ines Montani
|
959c46eabe
|
Merge pull request #1365 from wannaphongcom/develop
Add Thai language for spaCy v2
|
2017-09-26 23:43:05 +02:00 |
|
Wannaphong Phatthiyaphaibun
|
3d5046c499
|
fix import in th
|
2017-09-26 22:41:20 +07:00 |
|
Wannaphong Phatthiyaphaibun
|
a63f790b8c
|
fix thai tag_map
|
2017-09-26 22:28:57 +07:00 |
|
Wannaphong Phatthiyaphaibun
|
2ea27d07f4
|
fix tokenizer_exceptions in thai
|
2017-09-26 22:14:47 +07:00 |
|
Wannaphong Phatthiyaphaibun
|
a2bf4cc7bf
|
fix newline in file
|
2017-09-26 21:49:43 +07:00 |
|
ines
|
bb5c631402
|
Implement like_num getter for French (via #1161)
|
2017-09-26 16:47:45 +02:00 |
|
ines
|
15479b3bae
|
Add comment to like_num re: future work
|
2017-09-26 16:43:28 +02:00 |
|
ines
|
adda08fe14
|
Implement like_num getter for Dutch (via #1177)
|
2017-09-26 16:39:15 +02:00 |
|
ines
|
5ee10379db
|
Port over changes from #1340
|
2017-09-26 16:38:08 +02:00 |
|
Wannaphong Phatthiyaphaibun
|
5cba67146c
|
add thai in spacy2
|
2017-09-26 21:36:27 +07:00 |
|
ines
|
10d291f129
|
Port over change from #1351
|
2017-09-26 16:11:41 +02:00 |
|
ines
|
ece30c28a8
|
Don't split hyphenated words in German
This way, the tokenizer matches the tokenization in German treebanks
|
2017-09-16 20:40:15 +02:00 |
|
Ines Montani
|
bd3da3d6fb
|
Port over change from #1323 and tidy up
|
2017-09-14 19:23:13 +02:00 |
|
Matthew Honnibal
|
b29e6bff46
|
Improve lemmatization rule for am|VBP
|
2017-09-04 15:18:10 +02:00 |
|
Matthew Honnibal
|
2e28982e28
|
Merge pull request #1288 from geovedi/indonesian
Indonesian language support
|
2017-08-26 21:31:13 +02:00 |
|
Matthew Honnibal
|
cfc055734e
|
Split % in units, for compatibility with corpus
|
2017-08-25 20:03:37 -05:00 |
|
Jim Geovedi
|
58d8078971
|
Merge remote-tracking branch 'upstream/develop' into indonesian
|
2017-08-25 09:21:49 +08:00 |
|
Matthew Honnibal
|
bb2541ffd3
|
Fix PROB attr for OOV words
|
2017-08-23 12:11:52 +02:00 |
|
ines
|
a68dc891ea
|
Port over changes from #1281
|
2017-08-21 23:19:18 +02:00 |
|
Jim Geovedi
|
f77443ab68
|
reworked
|
2017-08-20 13:43:21 +07:00 |
|
Jim Geovedi
|
b7d83f37c8
|
indonesian abbr.
|
2017-08-20 12:16:50 +07:00 |
|
Jim Geovedi
|
7193c47f0b
|
direct lookup
|
2017-08-20 11:57:52 +07:00 |
|
Jim Geovedi
|
fdf802d505
|
added examples
|
2017-08-20 11:57:10 +07:00 |
|
Jim Geovedi
|
fa544e6c9a
|
Merge remote-tracking branch 'upstream/develop' into indonesian
|
2017-08-20 11:49:40 +07:00 |
|
ines
|
1fe5e1a4d1
|
Add language example sentences (see #1107)
da, de, en, es, fr, he, it, nb, pl, pt, sv
|
2017-08-19 12:22:29 +02:00 |
|
Jim Geovedi
|
37f19f5ed2
|
added more currencies based on corpus data
|
2017-08-03 13:03:25 +07:00 |
|
Jim Geovedi
|
30fd068d42
|
hashtag prefix should be handled somewhere else
|
2017-08-03 13:03:02 +07:00 |
|
Jim Geovedi
|
ba07e23c87
|
added USD in currency rules
|
2017-08-02 22:42:47 +07:00 |
|
Jim Geovedi
|
bb08d696f9
|
added hashtag rule and fixed currency rules
|
2017-07-30 21:23:28 +07:00 |
|
Jim Geovedi
|
e9af79a803
|
added u-\d+ rules (sports team)
|
2017-07-30 21:23:01 +07:00 |
|
Jim Geovedi
|
e5adc26c72
|
simplified rules
|
2017-07-29 18:21:32 +07:00 |
|
Jim Geovedi
|
4d04898dea
|
updated regexp
|
2017-07-29 17:44:57 +07:00 |
|
Jim Geovedi
|
7d96d477ea
|
updated like_num
|
2017-07-29 17:44:46 +07:00 |
|
Jim Geovedi
|
3cca4ed798
|
added lex attrs rules
|
2017-07-29 17:22:21 +07:00 |
|
Jim Geovedi
|
8b814c63f1
|
more exceptions
|
2017-07-27 19:46:30 +07:00 |
|