ines
|
15479b3bae
|
Add comment to like_num re: future work
|
2017-09-26 16:43:28 +02:00 |
|
ines
|
adda08fe14
|
Implement like_num getter for Dutch (via #1177)
|
2017-09-26 16:39:15 +02:00 |
|
ines
|
5ee10379db
|
Port over changes from #1340
|
2017-09-26 16:38:08 +02:00 |
|
Wannaphong Phatthiyaphaibun
|
5cba67146c
|
add thai in spacy2
|
2017-09-26 21:36:27 +07:00 |
|
ines
|
10d291f129
|
Port over change from #1351
|
2017-09-26 16:11:41 +02:00 |
|
ines
|
ece30c28a8
|
Don't split hyphenated words in German
This way, the tokenizer matches the tokenization in German treebanks
|
2017-09-16 20:40:15 +02:00 |
|
Ines Montani
|
bd3da3d6fb
|
Port over change from #1323 and tidy up
|
2017-09-14 19:23:13 +02:00 |
|
Jim O'Regan
|
9dfd301962
|
rearrange
|
2017-09-11 10:14:18 +01:00 |
|
Jim O'Regan
|
1ee75ae337
|
Merge remote-tracking branch 'origin/develop' into develop-irish
|
2017-09-11 08:40:11 +01:00 |
|
Matthew Honnibal
|
b29e6bff46
|
Improve lemmatization rule for am|VBP
|
2017-09-04 15:18:10 +02:00 |
|
Matthew Honnibal
|
2e28982e28
|
Merge pull request #1288 from geovedi/indonesian
Indonesian language support
|
2017-08-26 21:31:13 +02:00 |
|
Matthew Honnibal
|
cfc055734e
|
Split % in units, for compatibility with corpus
|
2017-08-25 20:03:37 -05:00 |
|
Jim Geovedi
|
58d8078971
|
Merge remote-tracking branch 'upstream/develop' into indonesian
|
2017-08-25 09:21:49 +08:00 |
|
Matthew Honnibal
|
bb2541ffd3
|
Fix PROB attr for OOV words
|
2017-08-23 12:11:52 +02:00 |
|
ines
|
a68dc891ea
|
Port over changes from #1281
|
2017-08-21 23:19:18 +02:00 |
|
Jim Geovedi
|
f77443ab68
|
reworked
|
2017-08-20 13:43:21 +07:00 |
|
Jim Geovedi
|
b7d83f37c8
|
indonesian abbr.
|
2017-08-20 12:16:50 +07:00 |
|
Jim Geovedi
|
7193c47f0b
|
direct lookup
|
2017-08-20 11:57:52 +07:00 |
|
Jim Geovedi
|
fdf802d505
|
added examples
|
2017-08-20 11:57:10 +07:00 |
|
Jim Geovedi
|
fa544e6c9a
|
Merge remote-tracking branch 'upstream/develop' into indonesian
|
2017-08-20 11:49:40 +07:00 |
|
ines
|
1fe5e1a4d1
|
Add language example sentences (see #1107)
da, de, en, es, fr, he, it, nb, pl, pt, sv
|
2017-08-19 12:22:29 +02:00 |
|
Jim O'Regan
|
c069b4acb5
|
fix in UD submitted; map either way
|
2017-08-08 19:22:14 +01:00 |
|
Jim O'Regan
|
76c22dec4d
|
UD Irish tag mapping
|
2017-08-08 19:04:52 +01:00 |
|
Jim O'Regan
|
95921d7d4c
|
Merge branch 'develop' into develop-irish
|
2017-08-08 17:21:27 +01:00 |
|
Jim Geovedi
|
37f19f5ed2
|
added more currencies based on corpus data
|
2017-08-03 13:03:25 +07:00 |
|
Jim Geovedi
|
30fd068d42
|
hashtag prefix should be handled somewhere else
|
2017-08-03 13:03:02 +07:00 |
|
Jim Geovedi
|
ba07e23c87
|
added USD in currency rules
|
2017-08-02 22:42:47 +07:00 |
|
Jim Geovedi
|
bb08d696f9
|
added hashtag rule and fixed currency rules
|
2017-07-30 21:23:28 +07:00 |
|
Jim Geovedi
|
e9af79a803
|
added u-\d+ rules (sports team)
|
2017-07-30 21:23:01 +07:00 |
|
Jim Geovedi
|
e5adc26c72
|
simplified rules
|
2017-07-29 18:21:32 +07:00 |
|
Jim Geovedi
|
4d04898dea
|
updated regexp
|
2017-07-29 17:44:57 +07:00 |
|
Jim Geovedi
|
7d96d477ea
|
updated like_num
|
2017-07-29 17:44:46 +07:00 |
|
Jim Geovedi
|
3cca4ed798
|
added lex attrs rules
|
2017-07-29 17:22:21 +07:00 |
|
Jim Geovedi
|
8b814c63f1
|
more exceptions
|
2017-07-27 19:46:30 +07:00 |
|
Jim Geovedi
|
6c725e8dcf
|
updated lemma
|
2017-07-27 19:46:21 +07:00 |
|
Jim Geovedi
|
547973b92a
|
wip syntax iterators
|
2017-07-27 10:51:34 +07:00 |
|
Jim Geovedi
|
bbc75da38d
|
enable syntax iterator and lemma lookup
|
2017-07-27 10:51:15 +07:00 |
|
Jim Geovedi
|
24a8c8bf28
|
added wip lemma dict
|
2017-07-26 21:39:54 +07:00 |
|
Jim Geovedi
|
63f14ba46b
|
added hyphen-suffix rules
|
2017-07-26 19:28:57 +07:00 |
|
Jim Geovedi
|
f288964441
|
removed -el from suffix rules
|
2017-07-26 19:28:38 +07:00 |
|
Jim Geovedi
|
6eee7a7411
|
updated tokenizer exceptions
|
2017-07-26 19:13:47 +07:00 |
|
Jim Geovedi
|
edec51b1b1
|
update punctuation rules
|
2017-07-26 19:13:36 +07:00 |
|
Jim Geovedi
|
62443d495a
|
enable token match
|
2017-07-26 19:13:14 +07:00 |
|
Jim Geovedi
|
c97f5ae0bb
|
updated tokenizer exceptions
|
2017-07-26 19:12:52 +07:00 |
|
Jim Geovedi
|
73f6ac9d9b
|
added hyhen
|
2017-07-24 15:56:31 +07:00 |
|
Jim Geovedi
|
68454c40bf
|
added missing import
|
2017-07-24 14:12:34 +07:00 |
|
Jim Geovedi
|
eaf9cbd708
|
cursed of copy & paste
|
2017-07-24 14:11:51 +07:00 |
|
Jim Geovedi
|
7aad6718bc
|
enable tokenizer exceptions
|
2017-07-24 14:11:10 +07:00 |
|
Jim Geovedi
|
ad56c9179a
|
added tokenizer exceptions list
|
2017-07-24 14:10:16 +07:00 |
|
Jim Geovedi
|
c1f3fe99fe
|
updated punctuation rules
|
2017-07-24 13:57:21 +07:00 |
|
Jim Geovedi
|
37fa2c8c80
|
punctution rules
|
2017-07-24 06:17:18 +07:00 |
|
Jim Geovedi
|
082e94ac1c
|
added inflix rules
|
2017-07-24 06:17:07 +07:00 |
|
Jim Geovedi
|
d0ec484725
|
reverted
|
2017-07-24 06:16:29 +07:00 |
|
Jim Geovedi
|
0e590c711f
|
added prefix & suffix rules
|
2017-07-23 23:46:40 +07:00 |
|
Jim Geovedi
|
ba922e30e8
|
added ampere hour unit
|
2017-07-23 23:46:18 +07:00 |
|
Jim Geovedi
|
3b17eba27b
|
added frequency units
|
2017-07-23 23:10:52 +07:00 |
|
Jim Geovedi
|
d5fd32a572
|
added known currencies
|
2017-07-23 22:56:48 +07:00 |
|
Jim Geovedi
|
f6f15678fb
|
added lex_attrs
|
2017-07-23 22:55:22 +07:00 |
|
Jim Geovedi
|
bed8162d00
|
added tokenizer_exceptions
|
2017-07-23 22:55:05 +07:00 |
|
Jim Geovedi
|
b80c35bc9a
|
added norm_exceptions
|
2017-07-23 22:54:49 +07:00 |
|
Jim Geovedi
|
b5de329ea3
|
added norm_exceptions
|
2017-07-23 22:54:19 +07:00 |
|
Jim Geovedi
|
082e9ade46
|
fixed typo
|
2017-07-23 21:30:34 +07:00 |
|
Jim Geovedi
|
e2efeb186e
|
added stopwords
|
2017-07-23 20:52:37 +07:00 |
|
Jim Geovedi
|
da98676839
|
use template
|
2017-07-23 20:51:31 +07:00 |
|
Jim Geovedi
|
c2b4dd7809
|
start working on Indonesian language
|
2017-07-23 20:50:56 +07:00 |
|
mollerhoj
|
85144835da
|
Add Tag_map for Danish
|
2017-07-03 15:52:55 +02:00 |
|
mollerhoj
|
64c732918a
|
Add Morph_rules. (TODO: Not working?)
|
2017-07-03 15:52:55 +02:00 |
|
mollerhoj
|
3b2cb107a3
|
Add like_num functionality to Danish
|
2017-07-03 15:49:51 +02:00 |
|
mollerhoj
|
e8f40ceed8
|
Add short names of months to tokenizer_exceptions
|
2017-07-03 15:49:51 +02:00 |
|
mollerhoj
|
23025d3b05
|
Clean up a couple of strange English stopwords
|
2017-07-03 15:41:59 +02:00 |
|
mollerhoj
|
dc5be7d2f3
|
Cleanup list of Danish stopwords
|
2017-07-03 15:40:58 +02:00 |
|
Ines Montani
|
c91642efd5
|
Port over changes from #1168
|
2017-07-01 11:43:54 +02:00 |
|
Jim O'Regan
|
70f4d26c10
|
bounds checks
|
2017-06-28 10:59:46 +01:00 |
|
Jim O'Regan
|
1ba38b2036
|
some helpers; the Irish part of UD only has 2500 sentences so this will need source of morphology
|
2017-06-28 00:42:00 +01:00 |
|
Jim O'Regan
|
559e03605a
|
b'
|
2017-06-27 22:42:16 +01:00 |
|
Jim Regan
|
d81ceb0cd5
|
Merge branch 'develop' into polish
|
2017-06-26 22:42:27 +01:00 |
|
Jim O'Regan
|
2f84c73585
|
a start
|
2017-06-26 22:40:04 +01:00 |
|
Jim O'Regan
|
28d7f0a672
|
reference
|
2017-06-26 22:38:28 +01:00 |
|
Jim O'Regan
|
e12defdd9c
|
missed a couple
|
2017-06-26 22:24:14 +01:00 |
|
Jim O'Regan
|
c1e4e0f3bf
|
just now discovered that you can do multiwords
|
2017-06-26 22:19:39 +01:00 |
|
Jim O'Regan
|
5e5f94c1c0
|
fix dup
|
2017-06-26 21:57:00 +01:00 |
|
Jim O'Regan
|
a8dff9133e
|
add POS
|
2017-06-26 21:53:41 +01:00 |
|
Jim O'Regan
|
e9213f54de
|
missed one
|
2017-06-26 21:29:21 +01:00 |
|
Jim O'Regan
|
1eb7cc3017
|
attempt a port from #1147
|
2017-06-26 21:24:55 +01:00 |
|
Matthew Honnibal
|
91e52543ef
|
Merge pull request #1118 from Gregory-Howard/patch-2
Update _tokenizer_exceptions_list (adding cities)
|
2017-06-20 11:16:07 +02:00 |
|
Tpt
|
7745b3ae04
|
Adds noun chunks to French syntax iterators
|
2017-06-12 15:29:58 +02:00 |
|
Grégory Howard
|
cd974b32b7
|
Update _tokenizer_exceptions_list (adding cities)
|
2017-06-09 17:58:18 +02:00 |
|
Matthew Honnibal
|
55d0621532
|
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
|
2017-06-04 15:53:25 -05:00 |
|
Matthew Honnibal
|
e28f90b672
|
Fix syntax iterators
|
2017-06-04 15:51:50 -05:00 |
|
Ines Montani
|
112c5787eb
|
Merge pull request #1101 from oroszgy/hu_tokenizer_fix
More robust Hungarian tokenizer.
|
2017-06-04 22:37:51 +02:00 |
|
ines
|
9254a3dd78
|
Import and add Spanish syntax iterators
|
2017-06-04 21:42:15 +02:00 |
|
Matthew Honnibal
|
7ca215bc26
|
Resolve lex_attr_getters conflict
|
2017-06-03 16:12:01 -05:00 |
|
ines
|
4c643d74c5
|
Add norm exceptions to other Language classes
|
2017-06-03 22:29:21 +02:00 |
|
ines
|
fa7e576c57
|
Change order of exception dicts
|
2017-06-03 21:52:06 +02:00 |
|
Matthew Honnibal
|
3f5c85d8de
|
Reorder setting of lex attrs, to avoid clobbering
|
2017-06-03 14:47:55 -05:00 |
|
Matthew Honnibal
|
aeb7520133
|
Make norm use lower-case
|
2017-06-03 14:47:38 -05:00 |
|
Matthew Honnibal
|
de3954843e
|
Populate norm exceptions with lower-case
|
2017-06-03 14:47:12 -05:00 |
|
ines
|
e47eef5e03
|
Update German tokenizer exceptions and tests
|
2017-06-03 21:07:44 +02:00 |
|
ines
|
0d6fa8b241
|
Add German norm exceptions
|
2017-06-03 20:54:18 +02:00 |
|
ines
|
5bd311c77e
|
Fix update of norm exceptions
|
2017-06-03 20:54:09 +02:00 |
|