ines
66f8f9d4a0
Fix Japanese tokenizer
...
JapaneseTokenizer now returns a Doc, not individual words
2017-10-24 13:02:19 +02:00
Ines Montani
facf77e541
Merge branch 'develop' into support-danish
2017-10-24 11:53:19 +02:00
Matthew Honnibal
49895fbef6
Rename 'SP' special tag to '_SP'
...
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Ines Montani
f0d577e460
Merge pull request #1425 from explosion/feature/hindi-tokenizer
...
💫 Basic Hindi tokenization support
2017-10-18 13:34:52 +02:00
Matthew Honnibal
839de87ca9
Make lambda func a named function, for pickling
2017-10-17 18:21:20 +02:00
Matthew Honnibal
9ce7d6af87
Make lex attr functions top-level functions, to promote pickling
2017-10-17 18:19:18 +02:00
Ines Montani
aab299c8ae
Merge pull request #1429 from vishnunekkanti/develop
...
fix syntax error in zh
2017-10-17 14:45:02 +02:00
ines
485c4f6df5
Add Hungarian examples (see #1107 )
2017-10-17 02:37:45 +02:00
Vishnu Kumar Nekkanti
d3c54cf39a
fixed SyntaxError while checking for jieba
2017-10-16 18:51:33 +05:30
ines
266e7180a7
Add Language class, stop words and basic stemmer that sets NORM
2017-10-14 14:59:52 +02:00
ines
e85e1d571b
Update base punctuation
2017-10-14 14:59:23 +02:00
ines
9d6c8eaa49
Update base norm exceptions with more unicode characters
...
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines
38c756fd85
Port over changes from #1287
2017-10-14 13:16:21 +02:00
ines
612224c10d
Port over changes from #1157
2017-10-14 13:11:39 +02:00
ines
a4d974d97b
Port over URL pattern changes from #1411
2017-10-14 12:58:07 +02:00
ines
09aed58140
Port over changes from #1333 and add comments
2017-10-14 12:52:59 +02:00
ines
8ce6f96180
Don't make copies of language data components
2017-10-11 15:34:55 +02:00
ines
417d45f5d0
Add lemmatizer data as variable on language data
...
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
2017-10-11 02:24:58 +02:00
ines
0c2343d73a
Tidy up language data
2017-10-11 02:22:49 +02:00
Matthew Honnibal
8143618497
Set prefix length back to 1
2017-10-10 19:32:54 +02:00
Matthew Honnibal
dce8afb9cf
Set prefix length to 3
2017-10-09 21:55:55 -05:00
Ines Montani
959c46eabe
Merge pull request #1365 from wannaphongcom/develop
...
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun
3d5046c499
fix import in th
2017-09-26 22:41:20 +07:00
Wannaphong Phatthiyaphaibun
a63f790b8c
fix thai tag_map
2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun
2ea27d07f4
fix tokenizer_exceptions in thai
2017-09-26 22:14:47 +07:00
Wannaphong Phatthiyaphaibun
a2bf4cc7bf
fix newline in file
2017-09-26 21:49:43 +07:00
ines
bb5c631402
Implement like_num getter for French (via #1161 )
2017-09-26 16:47:45 +02:00
ines
15479b3bae
Add comment to like_num re: future work
2017-09-26 16:43:28 +02:00
ines
adda08fe14
Implement like_num getter for Dutch (via #1177 )
2017-09-26 16:39:15 +02:00
ines
5ee10379db
Port over changes from #1340
2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun
5cba67146c
add thai in spacy2
2017-09-26 21:36:27 +07:00
ines
10d291f129
Port over change from #1351
2017-09-26 16:11:41 +02:00
ines
ece30c28a8
Don't split hyphenated words in German
...
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
Ines Montani
bd3da3d6fb
Port over change from #1323 and tidy up
2017-09-14 19:23:13 +02:00
Jim O'Regan
9dfd301962
rearrange
2017-09-11 10:14:18 +01:00
Jim O'Regan
1ee75ae337
Merge remote-tracking branch 'origin/develop' into develop-irish
2017-09-11 08:40:11 +01:00
Matthew Honnibal
b29e6bff46
Improve lemmatization rule for am|VBP
2017-09-04 15:18:10 +02:00
Matthew Honnibal
2e28982e28
Merge pull request #1288 from geovedi/indonesian
...
Indonesian language support
2017-08-26 21:31:13 +02:00
Matthew Honnibal
cfc055734e
Split % in units, for compatibility with corpus
2017-08-25 20:03:37 -05:00
Jim Geovedi
58d8078971
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-25 09:21:49 +08:00
Matthew Honnibal
bb2541ffd3
Fix PROB attr for OOV words
2017-08-23 12:11:52 +02:00
ines
a68dc891ea
Port over changes from #1281
2017-08-21 23:19:18 +02:00
Jim Geovedi
f77443ab68
reworked
2017-08-20 13:43:21 +07:00
Jim Geovedi
b7d83f37c8
indonesian abbr.
2017-08-20 12:16:50 +07:00
Jim Geovedi
7193c47f0b
direct lookup
2017-08-20 11:57:52 +07:00
Jim Geovedi
fdf802d505
added examples
2017-08-20 11:57:10 +07:00
Jim Geovedi
fa544e6c9a
Merge remote-tracking branch 'upstream/develop' into indonesian
2017-08-20 11:49:40 +07:00
ines
1fe5e1a4d1
Add language example sentences (see #1107 )
...
da, de, en, es, fr, he, it, nb, pl, pt, sv
2017-08-19 12:22:29 +02:00
Jim O'Regan
c069b4acb5
fix in UD submitted; map either way
2017-08-08 19:22:14 +01:00
Jim O'Regan
76c22dec4d
UD Irish tag mapping
2017-08-08 19:04:52 +01:00
Jim O'Regan
95921d7d4c
Merge branch 'develop' into develop-irish
2017-08-08 17:21:27 +01:00
Jim Geovedi
37f19f5ed2
added more currencies based on corpus data
2017-08-03 13:03:25 +07:00
Jim Geovedi
30fd068d42
hashtag prefix should be handled somewhere else
2017-08-03 13:03:02 +07:00
Jim Geovedi
ba07e23c87
added USD in currency rules
2017-08-02 22:42:47 +07:00
Jim Geovedi
bb08d696f9
added hashtag rule and fixed currency rules
2017-07-30 21:23:28 +07:00
Jim Geovedi
e9af79a803
added u-\d+ rules (sports team)
2017-07-30 21:23:01 +07:00
Jim Geovedi
e5adc26c72
simplified rules
2017-07-29 18:21:32 +07:00
Jim Geovedi
4d04898dea
updated regexp
2017-07-29 17:44:57 +07:00
Jim Geovedi
7d96d477ea
updated like_num
2017-07-29 17:44:46 +07:00
Jim Geovedi
3cca4ed798
added lex attrs rules
2017-07-29 17:22:21 +07:00
Jim Geovedi
8b814c63f1
more exceptions
2017-07-27 19:46:30 +07:00
Jim Geovedi
6c725e8dcf
updated lemma
2017-07-27 19:46:21 +07:00
Jim Geovedi
547973b92a
wip syntax iterators
2017-07-27 10:51:34 +07:00
Jim Geovedi
bbc75da38d
enable syntax iterator and lemma lookup
2017-07-27 10:51:15 +07:00
Jim Geovedi
24a8c8bf28
added wip lemma dict
2017-07-26 21:39:54 +07:00
Jim Geovedi
63f14ba46b
added hyphen-suffix rules
2017-07-26 19:28:57 +07:00
Jim Geovedi
f288964441
removed -el from suffix rules
2017-07-26 19:28:38 +07:00
Jim Geovedi
6eee7a7411
updated tokenizer exceptions
2017-07-26 19:13:47 +07:00
Jim Geovedi
edec51b1b1
update punctuation rules
2017-07-26 19:13:36 +07:00
Jim Geovedi
62443d495a
enable token match
2017-07-26 19:13:14 +07:00
Jim Geovedi
c97f5ae0bb
updated tokenizer exceptions
2017-07-26 19:12:52 +07:00
Jim Geovedi
73f6ac9d9b
added hyhen
2017-07-24 15:56:31 +07:00
Jim Geovedi
68454c40bf
added missing import
2017-07-24 14:12:34 +07:00
Jim Geovedi
eaf9cbd708
cursed of copy & paste
2017-07-24 14:11:51 +07:00
Jim Geovedi
7aad6718bc
enable tokenizer exceptions
2017-07-24 14:11:10 +07:00
Jim Geovedi
ad56c9179a
added tokenizer exceptions list
2017-07-24 14:10:16 +07:00
Jim Geovedi
c1f3fe99fe
updated punctuation rules
2017-07-24 13:57:21 +07:00
Jim Geovedi
37fa2c8c80
punctution rules
2017-07-24 06:17:18 +07:00
Jim Geovedi
082e94ac1c
added inflix rules
2017-07-24 06:17:07 +07:00
Jim Geovedi
d0ec484725
reverted
2017-07-24 06:16:29 +07:00
Jim Geovedi
0e590c711f
added prefix & suffix rules
2017-07-23 23:46:40 +07:00
Jim Geovedi
ba922e30e8
added ampere hour unit
2017-07-23 23:46:18 +07:00
Jim Geovedi
3b17eba27b
added frequency units
2017-07-23 23:10:52 +07:00
Jim Geovedi
d5fd32a572
added known currencies
2017-07-23 22:56:48 +07:00
Jim Geovedi
f6f15678fb
added lex_attrs
2017-07-23 22:55:22 +07:00
Jim Geovedi
bed8162d00
added tokenizer_exceptions
2017-07-23 22:55:05 +07:00
Jim Geovedi
b80c35bc9a
added norm_exceptions
2017-07-23 22:54:49 +07:00
Jim Geovedi
b5de329ea3
added norm_exceptions
2017-07-23 22:54:19 +07:00
Jim Geovedi
082e9ade46
fixed typo
2017-07-23 21:30:34 +07:00
Jim Geovedi
e2efeb186e
added stopwords
2017-07-23 20:52:37 +07:00
Jim Geovedi
da98676839
use template
2017-07-23 20:51:31 +07:00
Jim Geovedi
c2b4dd7809
start working on Indonesian language
2017-07-23 20:50:56 +07:00
mollerhoj
85144835da
Add Tag_map for Danish
2017-07-03 15:52:55 +02:00
mollerhoj
64c732918a
Add Morph_rules. (TODO: Not working?)
2017-07-03 15:52:55 +02:00
mollerhoj
3b2cb107a3
Add like_num functionality to Danish
2017-07-03 15:49:51 +02:00
mollerhoj
e8f40ceed8
Add short names of months to tokenizer_exceptions
2017-07-03 15:49:51 +02:00
mollerhoj
23025d3b05
Clean up a couple of strange English stopwords
2017-07-03 15:41:59 +02:00
mollerhoj
dc5be7d2f3
Cleanup list of Danish stopwords
2017-07-03 15:40:58 +02:00
Ines Montani
c91642efd5
Port over changes from #1168
2017-07-01 11:43:54 +02:00
Jim O'Regan
70f4d26c10
bounds checks
2017-06-28 10:59:46 +01:00
Jim O'Regan
1ba38b2036
some helpers; the Irish part of UD only has 2500 sentences so this will need source of morphology
2017-06-28 00:42:00 +01:00
Jim O'Regan
559e03605a
b'
2017-06-27 22:42:16 +01:00
Jim Regan
d81ceb0cd5
Merge branch 'develop' into polish
2017-06-26 22:42:27 +01:00
Jim O'Regan
2f84c73585
a start
2017-06-26 22:40:04 +01:00
Jim O'Regan
28d7f0a672
reference
2017-06-26 22:38:28 +01:00
Jim O'Regan
e12defdd9c
missed a couple
2017-06-26 22:24:14 +01:00
Jim O'Regan
c1e4e0f3bf
just now discovered that you can do multiwords
2017-06-26 22:19:39 +01:00
Jim O'Regan
5e5f94c1c0
fix dup
2017-06-26 21:57:00 +01:00
Jim O'Regan
a8dff9133e
add POS
2017-06-26 21:53:41 +01:00
Jim O'Regan
e9213f54de
missed one
2017-06-26 21:29:21 +01:00
Jim O'Regan
1eb7cc3017
attempt a port from #1147
2017-06-26 21:24:55 +01:00
Matthew Honnibal
91e52543ef
Merge pull request #1118 from Gregory-Howard/patch-2
...
Update _tokenizer_exceptions_list (adding cities)
2017-06-20 11:16:07 +02:00
Tpt
7745b3ae04
Adds noun chunks to French syntax iterators
2017-06-12 15:29:58 +02:00
Grégory Howard
cd974b32b7
Update _tokenizer_exceptions_list (adding cities)
2017-06-09 17:58:18 +02:00
Matthew Honnibal
55d0621532
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-06-04 15:53:25 -05:00
Matthew Honnibal
e28f90b672
Fix syntax iterators
2017-06-04 15:51:50 -05:00
Ines Montani
112c5787eb
Merge pull request #1101 from oroszgy/hu_tokenizer_fix
...
More robust Hungarian tokenizer.
2017-06-04 22:37:51 +02:00
ines
9254a3dd78
Import and add Spanish syntax iterators
2017-06-04 21:42:15 +02:00
Matthew Honnibal
7ca215bc26
Resolve lex_attr_getters conflict
2017-06-03 16:12:01 -05:00
ines
4c643d74c5
Add norm exceptions to other Language classes
2017-06-03 22:29:21 +02:00
ines
fa7e576c57
Change order of exception dicts
2017-06-03 21:52:06 +02:00
Matthew Honnibal
3f5c85d8de
Reorder setting of lex attrs, to avoid clobbering
2017-06-03 14:47:55 -05:00
Matthew Honnibal
aeb7520133
Make norm use lower-case
2017-06-03 14:47:38 -05:00
Matthew Honnibal
de3954843e
Populate norm exceptions with lower-case
2017-06-03 14:47:12 -05:00
ines
e47eef5e03
Update German tokenizer exceptions and tests
2017-06-03 21:07:44 +02:00
ines
0d6fa8b241
Add German norm exceptions
2017-06-03 20:54:18 +02:00
ines
5bd311c77e
Fix update of norm exceptions
2017-06-03 20:54:09 +02:00
ines
746653880c
Add English norm exceptions to lex_attrs
2017-06-03 20:27:28 +02:00
ines
095eeeb12f
Update English tokenizer exceptions and add norms
2017-06-03 20:27:16 +02:00
ines
e5d426406a
Add base norm exceptions
2017-06-03 20:27:05 +02:00
ines
2f1025a94c
Port over Spanish changes from #1096
2017-06-02 19:09:58 +02:00
Gyorgy Orosz
f0c3b09242
More robust Hungarian tokenizer.
2017-05-31 22:28:40 +02:00
Gyorgy Orosz
8c0b4b850e
Fixed emoji handling for Hungarian
2017-05-30 21:34:46 +02:00
ines
84189c1cab
Add 'xx' language ID for multi-language support
...
Allows models to specify their language ID as 'xx'.
2017-05-28 00:58:59 +02:00
ines
33e332e67c
Remove unused export
2017-05-28 00:57:59 +02:00
ines
a8e58e04ef
Add symbols class to punctuation rules to handle emoji (see #1088 )
...
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽💻 into account.
2017-05-27 17:57:10 +02:00
Matthew Honnibal
5db89053aa
Merge docstrings
2017-05-21 13:46:23 -05:00
ines
924e8506de
Move Defaults subclass to module scope (necessary for pickling)
2017-05-20 19:02:27 +02:00
Matthew Honnibal
61fe55efba
Move EnglishDefaults class out of English
2017-05-20 02:18:19 -05:00
Matthew Honnibal
8815507f8e
Move SpanishDefaults out of Language class, for pickle
2017-05-18 04:28:51 -05:00
ines
1a05078c79
Add language-specific syntax iterators to en and de
2017-05-17 12:04:03 +02:00
Matthew Honnibal
4b9d69f428
Merge branch 'v2' into develop
...
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module
Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
ines
a4a37a783e
Remove import from non-existing module
2017-05-13 16:00:09 +02:00
ines
c13b3fa052
Add LEX_ATTRS
2017-05-12 15:37:45 +02:00
ines
bca2ea9c72
Update Portuguese lexical attributes
2017-05-12 15:37:39 +02:00
ines
2f870123bf
Fix formatting
2017-05-12 15:37:20 +02:00
ines
ca65993d59
Add basic Polish Language class
2017-05-12 09:25:37 +02:00
ines
48177c4f92
Add missing tokenizer exceptions
2017-05-12 09:25:24 +02:00
ines
bb8be3d194
Add Danish language data
2017-05-10 21:15:12 +02:00
ines
a0b00624bb
Make sure like_email returns bool
2017-05-09 11:37:29 +02:00
ines
ea60932e1b
Fix formatting
2017-05-09 11:08:14 +02:00
ines
02d0ac5cab
Remove redundant function and fix formatting
2017-05-09 11:06:04 +02:00
ines
b5ca50607e
Reorganise entity rules
2017-05-09 01:37:10 +02:00
ines
12c3d5fbba
Fix formatting
2017-05-09 01:15:28 +02:00
ines
2829a024ef
Re-add basic like_num check to global lex_attrs
2017-05-09 01:15:23 +02:00
ines
88adeee548
Add English lex_attrs overrides
2017-05-09 01:09:52 +02:00
ines
8f3fbbb147
Fix typos
2017-05-09 01:09:37 +02:00
ines
2216e5f326
Reorganise lex_attrs and add dict
2017-05-09 00:57:54 +02:00
ines
e666f14d20
Add global lex_attrs
2017-05-09 00:41:53 +02:00
ines
41972c43fe
Use consistent regex imports
2017-05-09 00:34:31 +02:00
ines
9f0fd5963f
Reorganise Hungarian punctuation rules
2017-05-09 00:01:59 +02:00
ines
fc0d793360
Reorganise Bengali punctuation rules
2017-05-09 00:01:52 +02:00
ines
e895d1afd7
Reorganise French punctuation rules
2017-05-09 00:00:54 +02:00
ines
014bda0ae3
Reorganise global punctuation rules
2017-05-09 00:00:46 +02:00
ines
a91278cb32
Rename _URL_PATTERN to URL_PATTERN
2017-05-09 00:00:00 +02:00
ines
604f299cf6
Add char classes to global language data
2017-05-08 23:59:33 +02:00
ines
f6f5d78cb9
Fix formatting
2017-05-08 23:59:17 +02:00
ines
3c0f85de8e
Remove imports in /lang/__init__.py
2017-05-08 23:58:07 +02:00
ines
614aa09582
Tidy up Bengali tokenizer exceptions
2017-05-08 22:29:49 +02:00
ines
73b577cb01
Fix relative imports
2017-05-08 22:29:04 +02:00
ines
ae99990f63
Fix formatting
2017-05-08 22:23:48 +02:00
ines
f46ffe3e89
Move language data to /lang module
2017-05-08 20:00:40 +02:00