Matthew Honnibal
dce8afb9cf
Set prefix length to 3
2017-10-09 21:55:55 -05:00
Ines Montani
959c46eabe
Merge pull request #1365 from wannaphongcom/develop
...
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun
3d5046c499
fix import in th
2017-09-26 22:41:20 +07:00
Wannaphong Phatthiyaphaibun
a63f790b8c
fix thai tag_map
2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun
2ea27d07f4
fix tokenizer_exceptions in thai
2017-09-26 22:14:47 +07:00
Wannaphong Phatthiyaphaibun
a2bf4cc7bf
fix newline in file
2017-09-26 21:49:43 +07:00
ines
bb5c631402
Implement like_num getter for French (via #1161 )
2017-09-26 16:47:45 +02:00
ines
15479b3bae
Add comment to like_num re: future work
2017-09-26 16:43:28 +02:00
ines
adda08fe14
Implement like_num getter for Dutch (via #1177 )
2017-09-26 16:39:15 +02:00
ines
5ee10379db
Port over changes from #1340
2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun
5cba67146c
add thai in spacy2
2017-09-26 21:36:27 +07:00
ines
10d291f129
Port over change from #1351
2017-09-26 16:11:41 +02:00
ines
ece30c28a8
Don't split hyphenated words in German
...
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
Ines Montani
bd3da3d6fb
Port over change from #1323 and tidy up
2017-09-14 19:23:13 +02:00
Jim O'Regan
9dfd301962
rearrange
2017-09-11 10:14:18 +01:00
Jim O'Regan
1ee75ae337
Merge remote-tracking branch 'origin/develop' into develop-irish
2017-09-11 08:40:11 +01:00
Matthew Honnibal
b29e6bff46
Improve lemmatization rule for am|VBP
2017-09-04 15:18:10 +02:00
Matthew Honnibal
2e28982e28
Merge pull request #1288 from geovedi/indonesian
...
Indonesian language support
2017-08-26 21:31:13 +02:00
Matthew Honnibal
cfc055734e
Split % in units, for compatibility with corpus
2017-08-25 20:03:37 -05:00
Jim Geovedi
58d8078971
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-25 09:21:49 +08:00
Matthew Honnibal
bb2541ffd3
Fix PROB attr for OOV words
2017-08-23 12:11:52 +02:00
ines
a68dc891ea
Port over changes from #1281
2017-08-21 23:19:18 +02:00
Jim Geovedi
f77443ab68
reworked
2017-08-20 13:43:21 +07:00
Jim Geovedi
b7d83f37c8
indonesian abbr.
2017-08-20 12:16:50 +07:00
Jim Geovedi
7193c47f0b
direct lookup
2017-08-20 11:57:52 +07:00
Jim Geovedi
fdf802d505
added examples
2017-08-20 11:57:10 +07:00
Jim Geovedi
fa544e6c9a
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-20 11:49:40 +07:00
ines
1fe5e1a4d1
Add language example sentences (see #1107 )
...
da, de, en, es, fr, he, it, nb, pl, pt, sv
2017-08-19 12:22:29 +02:00
Jim O'Regan
c069b4acb5
fix in UD submitted; map either way
2017-08-08 19:22:14 +01:00
Jim O'Regan
76c22dec4d
UD Irish tag mapping
2017-08-08 19:04:52 +01:00
Jim O'Regan
95921d7d4c
Merge branch 'develop' into develop-irish
2017-08-08 17:21:27 +01:00
Jim Geovedi
37f19f5ed2
added more currencies based on corpus data
2017-08-03 13:03:25 +07:00
Jim Geovedi
30fd068d42
hashtag prefix should be handled somewhere else
2017-08-03 13:03:02 +07:00
Jim Geovedi
ba07e23c87
added USD in currency rules
2017-08-02 22:42:47 +07:00
Jim Geovedi
bb08d696f9
added hashtag rule and fixed currency rules
2017-07-30 21:23:28 +07:00
Jim Geovedi
e9af79a803
added u-\d+ rules (sports team)
2017-07-30 21:23:01 +07:00
Jim Geovedi
e5adc26c72
simplified rules
2017-07-29 18:21:32 +07:00
Jim Geovedi
4d04898dea
updated regexp
2017-07-29 17:44:57 +07:00
Jim Geovedi
7d96d477ea
updated like_num
2017-07-29 17:44:46 +07:00
Jim Geovedi
3cca4ed798
added lex attrs rules
2017-07-29 17:22:21 +07:00
Jim Geovedi
8b814c63f1
more exceptions
2017-07-27 19:46:30 +07:00
Jim Geovedi
6c725e8dcf
updated lemma
2017-07-27 19:46:21 +07:00
Jim Geovedi
547973b92a
wip syntax iterators
2017-07-27 10:51:34 +07:00
Jim Geovedi
bbc75da38d
enable syntax iterator and lemma lookup
2017-07-27 10:51:15 +07:00
Jim Geovedi
24a8c8bf28
added wip lemma dict
2017-07-26 21:39:54 +07:00
Jim Geovedi
63f14ba46b
added hyphen-suffix rules
2017-07-26 19:28:57 +07:00
Jim Geovedi
f288964441
removed -el from suffix rules
2017-07-26 19:28:38 +07:00
Jim Geovedi
6eee7a7411
updated tokenizer exceptions
2017-07-26 19:13:47 +07:00
Jim Geovedi
edec51b1b1
update punctuation rules
2017-07-26 19:13:36 +07:00
Jim Geovedi
62443d495a
enable token match
2017-07-26 19:13:14 +07:00
Jim Geovedi
c97f5ae0bb
updated tokenizer exceptions
2017-07-26 19:12:52 +07:00
Jim Geovedi
73f6ac9d9b
added hyhen
2017-07-24 15:56:31 +07:00
Jim Geovedi
68454c40bf
added missing import
2017-07-24 14:12:34 +07:00
Jim Geovedi
eaf9cbd708
cursed of copy & paste
2017-07-24 14:11:51 +07:00
Jim Geovedi
7aad6718bc
enable tokenizer exceptions
2017-07-24 14:11:10 +07:00
Jim Geovedi
ad56c9179a
added tokenizer exceptions list
2017-07-24 14:10:16 +07:00
Jim Geovedi
c1f3fe99fe
updated punctuation rules
2017-07-24 13:57:21 +07:00
Jim Geovedi
37fa2c8c80
punctution rules
2017-07-24 06:17:18 +07:00
Jim Geovedi
082e94ac1c
added inflix rules
2017-07-24 06:17:07 +07:00
Jim Geovedi
d0ec484725
reverted
2017-07-24 06:16:29 +07:00
Jim Geovedi
0e590c711f
added prefix & suffix rules
2017-07-23 23:46:40 +07:00
Jim Geovedi
ba922e30e8
added ampere hour unit
2017-07-23 23:46:18 +07:00
Jim Geovedi
3b17eba27b
added frequency units
2017-07-23 23:10:52 +07:00
Jim Geovedi
d5fd32a572
added known currencies
2017-07-23 22:56:48 +07:00
Jim Geovedi
f6f15678fb
added lex_attrs
2017-07-23 22:55:22 +07:00
Jim Geovedi
bed8162d00
added tokenizer_exceptions
2017-07-23 22:55:05 +07:00
Jim Geovedi
b80c35bc9a
added norm_exceptions
2017-07-23 22:54:49 +07:00
Jim Geovedi
b5de329ea3
added norm_exceptions
2017-07-23 22:54:19 +07:00
Jim Geovedi
082e9ade46
fixed typo
2017-07-23 21:30:34 +07:00
Jim Geovedi
e2efeb186e
added stopwords
2017-07-23 20:52:37 +07:00
Jim Geovedi
da98676839
use template
2017-07-23 20:51:31 +07:00
Jim Geovedi
c2b4dd7809
start working on Indonesian language
2017-07-23 20:50:56 +07:00
mollerhoj
85144835da
Add Tag_map for Danish
2017-07-03 15:52:55 +02:00
mollerhoj
64c732918a
Add Morph_rules. (TODO: Not working?)
2017-07-03 15:52:55 +02:00
mollerhoj
3b2cb107a3
Add like_num functionality to Danish
2017-07-03 15:49:51 +02:00
mollerhoj
e8f40ceed8
Add short names of months to tokenizer_exceptions
2017-07-03 15:49:51 +02:00
mollerhoj
23025d3b05
Clean up a couple of strange English stopwords
2017-07-03 15:41:59 +02:00
mollerhoj
dc5be7d2f3
Cleanup list of Danish stopwords
2017-07-03 15:40:58 +02:00
Ines Montani
c91642efd5
Port over changes from #1168
2017-07-01 11:43:54 +02:00
Jim O'Regan
70f4d26c10
bounds checks
2017-06-28 10:59:46 +01:00
Jim O'Regan
1ba38b2036
some helpers; the Irish part of UD only has 2500 sentences so this will need source of morphology
2017-06-28 00:42:00 +01:00
Jim O'Regan
559e03605a
b'
2017-06-27 22:42:16 +01:00
Jim Regan
d81ceb0cd5
Merge branch 'develop' into polish
2017-06-26 22:42:27 +01:00
Jim O'Regan
2f84c73585
a start
2017-06-26 22:40:04 +01:00
Jim O'Regan
28d7f0a672
reference
2017-06-26 22:38:28 +01:00
Jim O'Regan
e12defdd9c
missed a couple
2017-06-26 22:24:14 +01:00
Jim O'Regan
c1e4e0f3bf
just now discovered that you can do multiwords
2017-06-26 22:19:39 +01:00
Jim O'Regan
5e5f94c1c0
fix dup
2017-06-26 21:57:00 +01:00
Jim O'Regan
a8dff9133e
add POS
2017-06-26 21:53:41 +01:00
Jim O'Regan
e9213f54de
missed one
2017-06-26 21:29:21 +01:00
Jim O'Regan
1eb7cc3017
attempt a port from #1147
2017-06-26 21:24:55 +01:00
Matthew Honnibal
91e52543ef
Merge pull request #1118 from Gregory-Howard/patch-2
...
Update _tokenizer_exceptions_list (adding cities)
2017-06-20 11:16:07 +02:00
Tpt
7745b3ae04
Adds noun chunks to French syntax iterators
2017-06-12 15:29:58 +02:00
Grégory Howard
cd974b32b7
Update _tokenizer_exceptions_list (adding cities)
2017-06-09 17:58:18 +02:00
Matthew Honnibal
55d0621532
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 15:53:25 -05:00
Matthew Honnibal
e28f90b672
Fix syntax iterators
2017-06-04 15:51:50 -05:00
Ines Montani
112c5787eb
Merge pull request #1101 from oroszgy/hu_tokenizer_fix
...
More robust Hungarian tokenizer.
2017-06-04 22:37:51 +02:00
ines
9254a3dd78
Import and add Spanish syntax iterators
2017-06-04 21:42:15 +02:00
Matthew Honnibal
7ca215bc26
Resolve lex_attr_getters conflict
2017-06-03 16:12:01 -05:00
ines
4c643d74c5
Add norm exceptions to other Language classes
2017-06-03 22:29:21 +02:00
ines
fa7e576c57
Change order of exception dicts
2017-06-03 21:52:06 +02:00
Matthew Honnibal
3f5c85d8de
Reorder setting of lex attrs, to avoid clobbering
2017-06-03 14:47:55 -05:00
Matthew Honnibal
aeb7520133
Make norm use lower-case
2017-06-03 14:47:38 -05:00
Matthew Honnibal
de3954843e
Populate norm exceptions with lower-case
2017-06-03 14:47:12 -05:00
ines
e47eef5e03
Update German tokenizer exceptions and tests
2017-06-03 21:07:44 +02:00
ines
0d6fa8b241
Add German norm exceptions
2017-06-03 20:54:18 +02:00
ines
5bd311c77e
Fix update of norm exceptions
2017-06-03 20:54:09 +02:00
ines
746653880c
Add English norm exceptions to lex_attrs
2017-06-03 20:27:28 +02:00
ines
095eeeb12f
Update English tokenizer exceptions and add norms
2017-06-03 20:27:16 +02:00
ines
e5d426406a
Add base norm exceptions
2017-06-03 20:27:05 +02:00
ines
2f1025a94c
Port over Spanish changes from #1096
2017-06-02 19:09:58 +02:00
Gyorgy Orosz
f0c3b09242
More robust Hungarian tokenizer.
2017-05-31 22:28:40 +02:00
Gyorgy Orosz
8c0b4b850e
Fixed emoji handling for Hungarian
2017-05-30 21:34:46 +02:00
ines
84189c1cab
Add 'xx' language ID for multi-language support
...
Allows models to specify their language ID as 'xx'.
2017-05-28 00:58:59 +02:00
ines
33e332e67c
Remove unused export
2017-05-28 00:57:59 +02:00
ines
a8e58e04ef
Add symbols class to punctuation rules to handle emoji (see #1088 )
...
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽💻 into account.
2017-05-27 17:57:10 +02:00
Matthew Honnibal
5db89053aa
Merge docstrings
2017-05-21 13:46:23 -05:00
ines
924e8506de
Move Defaults subclass to module scope (necessary for pickling)
2017-05-20 19:02:27 +02:00
Matthew Honnibal
61fe55efba
Move EnglishDefaults class out of English
2017-05-20 02:18:19 -05:00
Matthew Honnibal
8815507f8e
Move SpanishDefaults out of Language class, for pickle
2017-05-18 04:28:51 -05:00
ines
1a05078c79
Add language-specific syntax iterators to en and de
2017-05-17 12:04:03 +02:00
Matthew Honnibal
4b9d69f428
Merge branch 'v2' into develop
...
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module
Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
ines
a4a37a783e
Remove import from non-existing module
2017-05-13 16:00:09 +02:00
ines
c13b3fa052
Add LEX_ATTRS
2017-05-12 15:37:45 +02:00
ines
bca2ea9c72
Update Portuguese lexical attributes
2017-05-12 15:37:39 +02:00
ines
2f870123bf
Fix formatting
2017-05-12 15:37:20 +02:00
ines
ca65993d59
Add basic Polish Language class
2017-05-12 09:25:37 +02:00
ines
48177c4f92
Add missing tokenizer exceptions
2017-05-12 09:25:24 +02:00
ines
bb8be3d194
Add Danish language data
2017-05-10 21:15:12 +02:00
ines
a0b00624bb
Make sure like_email returns bool
2017-05-09 11:37:29 +02:00
ines
ea60932e1b
Fix formatting
2017-05-09 11:08:14 +02:00
ines
02d0ac5cab
Remove redundant function and fix formatting
2017-05-09 11:06:04 +02:00
ines
b5ca50607e
Reorganise entity rules
2017-05-09 01:37:10 +02:00
ines
12c3d5fbba
Fix formatting
2017-05-09 01:15:28 +02:00
ines
2829a024ef
Re-add basic like_num check to global lex_attrs
2017-05-09 01:15:23 +02:00
ines
88adeee548
Add English lex_attrs overrides
2017-05-09 01:09:52 +02:00
ines
8f3fbbb147
Fix typos
2017-05-09 01:09:37 +02:00
ines
2216e5f326
Reorganise lex_attrs and add dict
2017-05-09 00:57:54 +02:00
ines
e666f14d20
Add global lex_attrs
2017-05-09 00:41:53 +02:00
ines
41972c43fe
Use consistent regex imports
2017-05-09 00:34:31 +02:00
ines
9f0fd5963f
Reorganise Hungarian punctuation rules
2017-05-09 00:01:59 +02:00
ines
fc0d793360
Reorganise Bengali punctuation rules
2017-05-09 00:01:52 +02:00
ines
e895d1afd7
Reorganise French punctuation rules
2017-05-09 00:00:54 +02:00
ines
014bda0ae3
Reorganise global punctuation rules
2017-05-09 00:00:46 +02:00
ines
a91278cb32
Rename _URL_PATTERN to URL_PATTERN
2017-05-09 00:00:00 +02:00
ines
604f299cf6
Add char classes to global language data
2017-05-08 23:59:33 +02:00
ines
f6f5d78cb9
Fix formatting
2017-05-08 23:59:17 +02:00
ines
3c0f85de8e
Remove imports in /lang/__init__.py
2017-05-08 23:58:07 +02:00
ines
614aa09582
Tidy up Bengali tokenizer exceptions
2017-05-08 22:29:49 +02:00
ines
73b577cb01
Fix relative imports
2017-05-08 22:29:04 +02:00
ines
ae99990f63
Fix formatting
2017-05-08 22:23:48 +02:00
ines
f46ffe3e89
Move language data to /lang module
2017-05-08 20:00:40 +02:00