Commit Graph

1494 Commits

Author SHA1 Message Date
Matthew Honnibal
a95974ad3f * Fix oov probability 2016-02-06 15:13:55 +01:00
Matthew Honnibal
af8514cb0c * Refine the way the is_parsed attribute is set by from_array 2016-02-06 14:44:35 +01:00
Matthew Honnibal
161b01d4c0 * Tweak usage example for multi-processing 2016-02-06 14:44:11 +01:00
Matthew Honnibal
7f24229f10 * Don't try to pickle the tokenizer 2016-02-06 14:09:05 +01:00
Matthew Honnibal
dcb401f3e1 * Remove broken Vocab pickling 2016-02-06 14:08:47 +01:00
Matthew Honnibal
e66d45bf66 * Restore previous patch to Span.root, as it seems it wasn't the cause of the problem. 2016-02-06 13:37:41 +01:00
Matthew Honnibal
4412a70dc5 * Initialize StateC._empty_token to 0, to avoid undefined behaviour. 2016-02-06 13:34:38 +01:00
Matthew Honnibal
1b41f868d2 * Check for errors in parser, and parallelise the left-over batch 2016-02-06 10:06:30 +01:00
Matthew Honnibal
031b00cb91 * Fix Span.root calculation 2016-02-05 20:12:09 +01:00
Matthew Honnibal
165ca28b80 * Set is_parsed flag in Parser.pipe 2016-02-05 19:51:44 +01:00
Matthew Honnibal
bdd579db0a * Set is_parsed flag in Parser.pipe 2016-02-05 19:50:11 +01:00
Matthew Honnibal
7119e77fb6 * Fix Matcher.pipe 2016-02-05 19:46:02 +01:00
Matthew Honnibal
1cf0100bf6 * Add test for multithreading 2016-02-05 19:38:22 +01:00
Matthew Honnibal
b04c9aad71 * Fix off-by-one in Parser.pipe 2016-02-05 19:37:50 +01:00
Matthew Honnibal
e5c447e237 * Questionable fix to problem in Span.root 2016-02-05 19:18:35 +01:00
Matthew Honnibal
1ef84a0557 * Merge master into rethinc2 2016-02-05 12:55:59 +01:00
Matthew Honnibal
4cf34fc170 Merge branch 'rethinc2' of ssh://github.com/honnibal/spaCy into rethinc2 2016-02-05 12:48:28 +01:00
Matthew Honnibal
249dccbe95 * Fix Language.pipe 2016-02-05 12:47:57 +01:00
Matthew Honnibal
c0e63feccc * xfail pickle tests 2016-02-05 12:46:58 +01:00
Matthew Honnibal
6aa92b70f1 * Fix merge problem in span 2016-02-05 12:46:11 +01:00
Matthew Honnibal
048dfe35aa * cimport cython.parallel 2016-02-05 12:20:42 +01:00
Matthew Honnibal
af58f273b3 * Fix spacy.language.pipe 2016-02-05 12:20:29 +01:00
Matthew Honnibal
8a13cebdcc * Update for modified thinc interface 2016-02-05 11:44:39 +01:00
Matthew Honnibal
48ce09687d * Skip pickling the vocab in the tests 2016-02-04 15:51:19 +01:00
Matthew Honnibal
419edfab50 * Use generic flags for the new attributes until they're added 2016-02-04 15:50:54 +01:00
Matthew Honnibal
c4017a06d9 * Add placeholders for the new flags in attrs and symbols 2016-02-04 15:49:45 +01:00
Matthew Honnibal
e5c96c969f * Wire up new attributes 2016-02-04 13:04:58 +01:00
Matthew Honnibal
9703ccc3de * Remove unused import 2016-02-04 13:04:33 +01:00
Matthew Honnibal
11810be33e * Add Python hooks for is_bracket/is_quote/is_left_punct/is_right_punct 2016-02-04 13:04:16 +01:00
Matthew Honnibal
fe611132f0 * Add stubs for is_bracket/is_quote/is_left_punct/is_right_punct functions 2016-02-04 13:03:04 +01:00
Matthew Honnibal
ee975d36d0 * Add stubs to test is_bracket/is_quote/is_left_punct/is_right_punct functions 2016-02-04 13:02:25 +01:00
Matthew Honnibal
f9e765cae7 * Add pipe() method to tokenizer 2016-02-03 02:32:37 +01:00
Matthew Honnibal
4cbad510ff * Fix calculation of head for spans with punctuation. 2016-02-03 02:32:21 +01:00
Matthew Honnibal
84b247ef83 * Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading. 2016-02-03 02:10:58 +01:00
Matthew Honnibal
fcfc17a164 Merge branch 'master' into rethinc2 2016-02-02 23:05:34 +01:00
Matthew Honnibal
f204daf27b * Add error warning that a gold tag is unrecognised 2016-02-02 22:59:59 +01:00
Matthew Honnibal
99b8906100 * Accept punct_labels as an argument to the scorer 2016-02-02 22:59:06 +01:00
Matthew Honnibal
59123443e2 * Check for presence/absence of the different models in Language.end_training 2016-02-02 22:49:55 +01:00
Matthew Honnibal
9e9d4c8706 * Fix stupid error in Language.batch 2016-02-01 09:49:32 +01:00
Matthew Honnibal
e3db39dd21 * Fix compiler warning about signed/unsigned comparison 2016-02-01 09:08:07 +01:00
Matthew Honnibal
98fbdf2856 * Add Language.batch() method, to support multi-threaded jobs 2016-02-01 09:01:13 +01:00
Matthew Honnibal
b3802562d6 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 08:59:24 +01:00
Matthew Honnibal
4b08a3fafd * Fix merge conflict 2016-02-01 08:58:18 +01:00
Matthew Honnibal
5188f6d9d8 * Fix parseC function 2016-02-01 08:48:48 +01:00
Matthew Honnibal
bcf8f7ba40 * Add a parse_batch method to Parser, that releases the GIL around a batch of documents. 2016-02-01 08:34:55 +01:00
Matthew Honnibal
d5579cd0d8 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 03:08:49 +01:00
Matthew Honnibal
490ba65398 * Use openmp in parser 2016-02-01 03:08:42 +01:00
Matthew Honnibal
cb78d91ec5 * Fix ArcEager.set_valid 2016-02-01 03:07:37 +01:00
Matthew Honnibal
28e5ad62bc * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 03:00:15 +01:00
Matthew Honnibal
a47f00901b * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 02:58:14 +01:00
Matthew Honnibal
daaad66448 * Now fully proxied 2016-02-01 02:37:08 +01:00
Matthew Honnibal
7a0e3bb9c1 * Continue proxying. Some problem currently 2016-02-01 02:22:21 +01:00
Matthew Honnibal
2169bbb7ea * Shadow StateClass with StateC, to start proxying 2016-02-01 01:16:14 +01:00
Matthew Honnibal
2fa228458e * Add _state file, which StateClass will proxy to 2016-02-01 01:09:21 +01:00
Matthew Honnibal
6bb007d16e * Make set_parse nogil 2016-01-30 20:27:52 +01:00
Matthew Honnibal
9410e74c92 * Switch parser to use nogil functions 2016-01-30 20:27:07 +01:00
Matthew Honnibal
10877a7791 * Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser 2016-01-30 14:31:36 +01:00
Matthew Honnibal
ea4ff94cde * Whitespace 2016-01-29 03:59:22 +01:00
Matthew Honnibal
b0718b6ee1 * Move to thinc 5.0 2016-01-29 03:58:55 +01:00
Matthew Honnibal
9721502c81 * Update version 2016-01-25 15:52:59 +01:00
Matthew Honnibal
907e8cf07d * Add u prefix to string in web example 2016-01-25 15:51:38 +01:00
Matthew Honnibal
eba03695ef * Comment out pickle tests 2016-01-25 15:51:13 +01:00
Matthew Honnibal
de94e6c525 * Mark pickle tests as xfail, due to temp files problem 2016-01-25 15:24:17 +01:00
Matthew Honnibal
87172a15c6 * Fix runtime error bug that arose from updated Span.root function. 2016-01-25 15:22:42 +01:00
Matthew Honnibal
2c8dd91785 * Fix first code example on the website 2016-01-23 18:09:19 +01:00
Matthew Honnibal
3af84cfd6e * Increment version 2016-01-21 17:49:27 +01:00
Henning Peters
65aeac24cb remove package version constraint 2016-01-21 17:40:51 +01:00
Matthew Honnibal
792c98a438 * Increment version for OSX-fixed release of v0.100 2016-01-21 00:23:04 +01:00
Matthew Honnibal
82d011ac43 * Fix test for whitespace 2016-01-19 20:38:26 +01:00
Matthew Honnibal
e89069dcae * Fix matcher test 2016-01-19 20:24:01 +01:00
Matthew Honnibal
63e3d4e27f * Add comment on Vocab.__reduce__ 2016-01-19 20:11:25 +01:00
Matthew Honnibal
e1282b7f2f * Require user-custom NER classes to work without adding the label. 2016-01-19 20:11:03 +01:00
Matthew Honnibal
84c5dfbfc3 * Clean up debugging python list 2016-01-19 20:10:32 +01:00
Matthew Honnibal
04d0686b26 * Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions. 2016-01-19 20:10:04 +01:00
Matthew Honnibal
c4a89d56bd * Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types. 2016-01-19 20:09:26 +01:00
Matthew Honnibal
f0f92793f6 * Add test for user NER classes in matcher blocking the NER model. Re Issue #178 and Issue #217 2016-01-19 19:23:16 +01:00
Matthew Honnibal
65c5bc4988 * Add add_label method, to allow users to register new entity types and dependency labels. 2016-01-19 19:11:02 +01:00
Matthew Honnibal
151aa0b0e2 * Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model 2016-01-19 19:09:33 +01:00
Matthew Honnibal
c8e0011ebc * Add iterators to the NER and parser transition systems, to get the action types 2016-01-19 19:07:43 +01:00
Matthew Honnibal
515493c675 * Add xfail test for Issue #225: tokenization with non-whitespace delimiters 2016-01-19 13:20:14 +01:00
Matthew Honnibal
7abe653223 * Fix imports 2016-01-19 03:36:51 +01:00
Matthew Honnibal
590f38bdb2 * Add hacky solution to Issue #220. Currently specials.json only supports literal patterns, which doesn't allow us to pre-tag whitespace with the correct token, SP, as a rule. The data-driven approach should be easy but for some reason fails here. Adding a hard code in Morphology isn't a good solution, but we do want to fix the behaviour right away, and don't want to wait for an architecturally better solution. 2016-01-19 03:35:20 +01:00
Matthew Honnibal
445164d5b4 * Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated 2016-01-19 02:54:56 +01:00
Matthew Honnibal
04177debd0 * Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184, but may cause further problems. Needs testing. 2016-01-19 02:54:15 +01:00
Matthew Honnibal
7893de3203 * Add test for Issue #184: Whitespace at sentence boundary causes sentence boundary error. 2016-01-18 23:04:38 +01:00
Matthew Honnibal
bba0a5e078 * Handle string paths in default_vocab, default_parser, default_entity in Language class 2016-01-18 22:37:24 +01:00
Matthew Honnibal
e825fd9554 * Make some of the website tests work without models 2016-01-18 18:14:44 +01:00
Matthew Honnibal
334c4b2b57 * Disprefer punctuation and spaces as heads of spans 2016-01-18 18:14:09 +01:00
Matthew Honnibal
bed36ab0ff * Fix import of HEAD attribute 2016-01-18 17:34:43 +01:00
Matthew Honnibal
28c659c1fe * Fix import for numpy 2016-01-18 17:25:04 +01:00
Matthew Honnibal
fc36bcf458 * Fix import for English 2016-01-18 17:14:40 +01:00
Matthew Honnibal
cc4c335e14 * Set heads for test_merge_tokens, to make the test run without models 2016-01-18 17:00:11 +01:00
Matthew Honnibal
c107da9738 * Bug fix to _count_words_to_root 2016-01-18 16:59:38 +01:00
Matthew Honnibal
f24833d607 * Fix merge for coordinations 2016-01-18 16:03:19 +01:00
Matthew Honnibal
14534958a9 * Fix bug in Span.root 2016-01-18 15:40:28 +01:00
Matthew Honnibal
714cbc03d5 * Add test for Issue #203: nested noun chunks. 2016-01-16 18:02:30 +01:00
Matthew Honnibal
4e2253170c * Move test for doc.merge to tokens_api file, to avoid name conflicts which upset pytest 2016-01-16 18:01:36 +01:00
Matthew Honnibal
34a157511f * Move test_merge_hang to test_tokens_api 2016-01-16 18:00:26 +01:00
Matthew Honnibal
fc8f26584a * Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203, but might be problematic. Also allow root NPs to be considered noun chunks. 2016-01-16 17:52:40 +01:00
Matthew Honnibal
4a16dbfeca * Add test for Issue #203: noun chunks should be flat, but sometimes are nested 2016-01-16 17:41:25 +01:00