Commit Graph

2939 Commits

Author SHA1 Message Date
Matthew Honnibal
38af2941ca * Require Jieba Chinese tokenizer and tagger 2016-04-28 14:33:22 +02:00
Matthew Honnibal
1ede19c75a * Use tokens from Jieba library 2016-04-28 14:32:27 +02:00
Matthew Honnibal
3186379253 * Restore support for orths_and_spaces argument in Doc.__init__ 2016-04-28 14:32:06 +02:00
Matthew Honnibal
11bffaa1ab * Add test for regex locale in gold standard 2016-04-28 14:31:41 +02:00
Matthew Honnibal
7c37f45e9f * Fix unicode regex problem for non-English locales in gold standard 2016-04-28 14:31:14 +02:00
Matthew Honnibal
588026fe93 * Make very hacky modifications to parser training script, to get Chinese up and running. 2016-04-28 14:30:24 +02:00
Matthew Honnibal
b1cf2c16c3 * Fix scoring on train.py for Chinese 2016-04-27 10:25:41 +02:00
Matthew Honnibal
92dcfd798a Merge branch 'master' of ssh://github.com/spacy-io/spaCy into chinese 2016-04-25 22:23:12 +02:00
Matthew Honnibal
1cc4c613dc * Ignore char deps when scoring 2016-04-25 22:20:26 +02:00
Matthew Honnibal
e3de3f62cb * Add character tagger for Chinese 2016-04-25 22:20:01 +02:00
Matthew Honnibal
97b2bba249 * Merge updated/simplified Break approach 2016-04-25 19:44:42 +00:00
Matthew Honnibal
77609588b6 * Fix assignment of root label to words left as root implicitly, after parsing ends. 2016-04-25 19:41:59 +00:00
Matthew Honnibal
7c2d2deaa7 * Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322 2016-04-25 19:41:59 +00:00
Matthew Honnibal
feb65fcaa1 Merge pull request #346 from wbwseeker/sentbnd_bug
introduce sentence boundaries for additional root tokens
2016-04-25 20:31:27 +10:00
Wolfgang Seeker
1003e7ccec remove debug output from tests 2016-04-25 12:12:40 +02:00
Wolfgang Seeker
f57f843e85 fix bug in updating tree structure when introducing additional roots 2016-04-25 12:01:19 +02:00
Matthew Honnibal
b6ccd8d76a * Use Jieba tokenizer in Chinese class 2016-04-24 19:11:49 +02:00
Matthew Honnibal
9bfe20cac9 * Create tokenizer via default_tokenizer function 2016-04-24 19:11:49 +02:00
Matthew Honnibal
478a8d1829 * Register Chinese language in spacy/__init__.py 2016-04-24 18:45:16 +02:00
Matthew Honnibal
8569dbc2d0 * Add initial stuff for Chinese parsing 2016-04-24 18:44:24 +02:00
Wolfgang Seeker
b6477fc4f4 adjusted tests to Travis Setup 2016-04-21 17:15:10 +02:00
Wolfgang Seeker
736ffcb9a2 remove whitespace 2016-04-21 16:55:55 +02:00
Wolfgang Seeker
6c7301cc6d the parser now introduces sentence boundaries properly when predicting dependents with root labels 2016-04-21 16:50:53 +02:00
Wolfgang Seeker
12024b0b0a bugfix: introducing multiple roots now updates original head's properties
adjust tests to rely less on statistical model
2016-04-20 16:42:41 +02:00
Henning Peters
c356251f45 Merge branch 'master' of github.com:spacy-io/spaCy 2016-04-19 19:50:55 +02:00
Henning Peters
bb3238bcdd pin numpy to >=1.7, ship headers 2016-04-19 19:50:42 +02:00
Matthew Honnibal
67ce96c9c9 * Make patterns argument to Matcher class optional 2016-04-17 21:32:24 +02:00
Matthew Honnibal
8b4677d34d * Add missing keyword arguments to spacy.load() function 2016-04-17 21:31:50 +02:00
Matthew Honnibal
2add5206aa * Fix description of matcher test 2016-04-17 15:40:21 +02:00
Matthew Honnibal
2b419d5b8c * Update test for Issue #242 2016-04-17 15:34:23 +02:00
Matthew Honnibal
f12b043308 * Add test for Issue #242: Overlapping matches not well recognised. 2016-04-17 15:19:17 +02:00
Wolfgang Seeker
b98cc3266d bugfix: iterators now reset properly when called a second time 2016-04-15 17:49:16 +02:00
Wolfgang Seeker
e6945c4d0e bugfix: uppercase attr values before looking them up 2016-04-15 15:46:31 +02:00
Matthew Honnibal
c0909afe22 Merge pull request #312 from wbwseeker/space_head_bug
add restrictions to L-arc and R-arc to prevent space heads
2016-04-15 20:36:03 +10:00
Wolfgang Seeker
289b10f441 remove some comments 2016-04-14 15:37:51 +02:00
Matthew Honnibal
fe9299a118 * Fix long-standing issue with coarse-grained tags: proper nouns weren't receiving the PROPN tag, and personal pronouns weren't receiving the PRON tag. This should fix Issue #191, and also Issue #325, which reported that proper nouns were being lemmatized using the common noun policies. This lemmatization will be prevented if the universal tag is PROPN, not NOUN, as no lemmatization rules are loaded for the PROPN tag. 2016-04-14 12:46:43 +02:00
Matthew Honnibal
6f82065761 * Fix infixed commas in tokenizer, re Issue #326. Need to benchmark on empirical data, to make sure this doesn't break other cases. 2016-04-14 11:36:03 +02:00
Matthew Honnibal
0f957dd586 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2016-04-14 10:37:56 +02:00
Matthew Honnibal
108aca0e50 * Make Matcher use attrs from the attrs.pyx file, rather than having an incomplete function doing the mapping. 2016-04-14 10:37:39 +02:00
Matthew Honnibal
61d20de35d * Fix language.py docstring 2016-04-14 10:36:57 +02:00
Wolfgang Seeker
d99a9cbce9 different handling of space tokens
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Matthew Honnibal
04d0209be9 * Recognise multiple infixes in a token. 2016-04-13 18:38:26 +10:00
Henning Peters
a473d6e937 fix tests (use english model) 2016-04-12 16:41:57 +02:00
Henning Peters
f2d011c034 avoid polluting spacy namespace with lang classes 2016-04-12 16:31:16 +02:00
Henning Peters
ff690f76ba fix loading non-german models 2016-04-12 16:00:56 +02:00
Henning Peters
6215272786 remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels 2016-04-12 11:28:07 +02:00
Henning Peters
5f699883dd make openmp on windows optional 2016-04-12 10:12:57 +02:00
Matthew Honnibal
6df3858dbc * Fix Issue #323: Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition. 2016-04-12 13:17:59 +10:00
Wolfgang Seeker
d328e0b4a8 Merge branch 'master' into space_head_bug 2016-04-11 12:11:01 +02:00
Henning Peters
13a6899fc6 Merge pull request #329 from sjjpo2002/patch-1
Enable OpenMP compiler option for MSVC
2016-04-10 09:45:08 +02:00