Commit Graph

588 Commits

Author SHA1 Message Date
Matthew Honnibal
6640386b25 * Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way. 2015-04-07 06:00:57 +02:00
Matthew Honnibal
b64b2bd910 * Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way. 2015-04-07 06:00:30 +02:00
Matthew Honnibal
f9e510a893 * Whitespace 2015-04-07 04:53:59 +02:00
Matthew Honnibal
66c7ccf6cc * Fix Spans.orth_ 2015-04-07 04:53:40 +02:00
Matthew Honnibal
b8d34531c4 * Add support for units to English.__init__, by loading and applying regular expressions 2015-04-07 04:02:32 +02:00
Matthew Honnibal
0ea5af88b6 * Add multi-word expression RegexMatcher 2015-04-07 03:45:40 +02:00
Matthew Honnibal
2fee67cfa3 * Add regular expressions for English multi-word expressions 2015-04-07 03:45:18 +02:00
Matthew Honnibal
5a075ea3fc * Ensure NER moves are available for single-word tokens 2015-04-05 22:30:58 +02:00
Matthew Honnibal
a60a366b2c * Support 'punct' dep label in conll.pyx 2015-04-05 22:30:19 +02:00
Matthew Honnibal
021c972137 * Print parse if verbose in scorer 2015-04-05 22:29:30 +02:00
Matthew Honnibal
fbf19049cf * Add ent_type_ property 2015-03-31 02:01:29 +02:00
Matthew Honnibal
e70b87efeb * Add merge() method to Tokens, with fairly brittle/hacky implementation, but quite easy to test. Passing minimal tests. Still need to fix left/right deps in C data 2015-03-30 01:37:41 +02:00
Matthew Honnibal
557856e84c * Allow regular expressions to specify labels for merged spans 2015-03-27 17:40:52 +01:00
Matthew Honnibal
a3af6b7c3d * Left-Arc from Root, to allow non-monotonic reduce to compete with left-arc when the stack is not empty. 2015-03-27 17:39:16 +01:00
Matthew Honnibal
db5a43318c * Improve print_state debug printer 2015-03-27 17:29:58 +01:00
Matthew Honnibal
1705eccbbe * Remove whitespace 2015-03-27 15:22:39 +01:00
Matthew Honnibal
3feb52374c * Break apart a condition, for ease of debug printing 2015-03-27 15:21:38 +01:00
Matthew Honnibal
b32f581acb * Fix bug in ArcEager.get_labels 2015-03-27 15:21:06 +01:00
Matthew Honnibal
5f2a4ff36d * Fix spans.lemma_ 2015-03-26 16:45:38 +01:00
Matthew Honnibal
f4cc222ec3 * Fix NER scoring 2015-03-26 16:45:38 +01:00
Matthew Honnibal
1320bd19db * Move Span class to own file 2015-03-26 16:45:38 +01:00
Matthew Honnibal
6f47a667cf * Move Span class to own file 2015-03-26 16:45:38 +01:00
Matthew Honnibal
f02c39dfaf * Compare to is not None, for more robustness 2015-03-26 16:44:48 +01:00
Matthew Honnibal
8f68b864c4 * Move Span/Spans to separate files. Currently duplicates lots of Tokens functionality. Should probably be integrated into Tokens 2015-03-26 16:44:48 +01:00
Matthew Honnibal
e854ba0a13 * Remove support for force_gold flag from GreedyParser, since it's not so useful, and it's clutter 2015-03-26 16:44:47 +01:00
Matthew Honnibal
6a6085f8b9 * Clean up GreedyParser.train function a bit 2015-03-26 16:44:47 +01:00
Matthew Honnibal
b3157927e6 * Clean up unused feature templates 2015-03-26 16:44:47 +01:00
Matthew Honnibal
411bf377d4 * Remove dependency on ner_util module 2015-03-26 16:44:47 +01:00
Matthew Honnibal
01c892f583 * Add comment to fill_context 2015-03-26 16:44:47 +01:00
Matthew Honnibal
2741179aff * Important bug fix: Fill token N2w, which was being unfilled, after a bad edit while writing the NER features. 2015-03-26 16:44:47 +01:00
Matthew Honnibal
2b2dec95d3 * Add comment to set_parse 2015-03-26 16:44:47 +01:00
Matthew Honnibal
e770fade1e * Don't set dependency labels in set_parse, as this may be used by the Entity recogniser instead. Need to clean this method up... 2015-03-26 16:44:47 +01:00
Matthew Honnibal
71648205d9 * Add support for debug feature set. Just use unigrams for this. 2015-03-26 16:44:47 +01:00
Matthew Honnibal
3b70b304b2 * Add words to gold_tuples from gold conll file 2015-03-26 16:44:47 +01:00
Matthew Honnibal
2e12dec76e * Adjust scorer to account for tokenization mistakes 2015-03-26 16:44:47 +01:00
Matthew Honnibal
05d6065e2e * Add assertion 2015-03-26 16:44:46 +01:00
Matthew Honnibal
377e9b29b1 * Whitespace 2015-03-26 16:44:46 +01:00
Matthew Honnibal
670959f40c * Fix iteration order on Tokens.rights 2015-03-26 16:44:46 +01:00
Matthew Honnibal
231ce2dae5 * Assign ROOT label by default. May be papering over another bug. 2015-03-26 16:44:46 +01:00
Matthew Honnibal
9f4ad8fdfb * Assign root words the ROOT label via the Break transition. Something is still wrong here... 2015-03-26 16:44:46 +01:00
Matthew Honnibal
f729164c01 * Fix bug in label assignment: ensure null-label transitions receive the label 0 2015-03-26 16:44:46 +01:00
Matthew Honnibal
7237c805c7 * Load tag for specials.json token 2015-03-26 16:44:46 +01:00
Matthew Honnibal
567388e38d * Use values encoded by StringStore in POS tagging, rather than indices into a list of tags 2015-03-26 16:44:45 +01:00
Matthew Honnibal
3105c7f8ba * Don't pass label_ids dict to Tokens, since we now use the StringStore to manage string-to-int mapping for labels 2015-03-26 16:44:45 +01:00
Matthew Honnibal
801bf14f4f * Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names. 2015-03-26 16:44:45 +01:00
Matthew Honnibal
31fad99518 * Use StringStore to encode label names, instead of label_ids 2015-03-26 16:44:45 +01:00
Matthew Honnibal
64db61bff1 * Add Span class to Python API 2015-03-26 16:44:45 +01:00
Matthew Honnibal
b9b695fb1b * Remove debug word list 2015-03-26 16:44:45 +01:00
Matthew Honnibal
f21ab2d7fb * Fix bug in ugly ent_strings hack on English class 2015-03-26 16:44:45 +01:00
Matthew Honnibal
1c843934be * Fix oracle bug in NER. Now getting 77% F on ontonotes 2015-03-26 16:44:44 +01:00
Matthew Honnibal
903f196b3f * Fix verbose printing for scorer 2015-03-26 16:44:44 +01:00
Matthew Honnibal
e181c051d5 * Improve features for NER 2015-03-26 16:44:44 +01:00
Matthew Honnibal
7ecb52c0ed * Add scorer script 2015-03-26 16:44:44 +01:00
Matthew Honnibal
8057a95f20 * NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-26 16:44:44 +01:00
Matthew Honnibal
ae235e07b9 * Refactoring working for parser, but now need to rig up features for NER, and then debug oracle etc. 2015-03-26 16:44:44 +01:00
Matthew Honnibal
b3eda03c9c * Tmp 2015-03-26 16:44:44 +01:00
Matthew Honnibal
220ce8bfed * Prepare English class for NER 2015-03-26 16:44:44 +01:00
Matthew Honnibal
f5830dc1c1 * Remove _transitions.pyx 2015-03-26 16:44:44 +01:00
Matthew Honnibal
6865c2fb4d * Fix assignment of dep strings in tokens.pyx 2015-03-26 16:44:43 +01:00
Matthew Honnibal
6b6bce9e7a * Fix label loading for transition system 2015-03-26 16:44:43 +01:00
Matthew Honnibal
5278c7504b * Hacks to conll.pyx. Should clean these up. 2015-03-26 16:44:43 +01:00
Matthew Honnibal
f321b2b2eb * Remove TODO comment 2015-03-26 16:44:43 +01:00
Matthew Honnibal
fdabd93bfb * Ensure high loss for invalid moves, and fix label reading for arc-eager 2015-03-26 16:44:43 +01:00
Matthew Honnibal
10ed738df2 * Tmp commit 2015-03-26 16:44:43 +01:00
Matthew Honnibal
4f83c9b3d5 * Make costs label-sensitive 2015-03-26 16:44:43 +01:00
Matthew Honnibal
179b7eb0a7 * Specify parser transition system in language 2015-03-26 16:44:43 +01:00
Matthew Honnibal
8c883cef58 * Refactored transition system code now compiling. Still need to hook up label oracle, and test 2015-03-26 16:44:43 +01:00
Matthew Honnibal
f0159ab4b6 * Add file to hold GoldParse class 2015-03-26 16:44:42 +01:00
Matthew Honnibal
8eadb984cb * Refactor arc_eager to use new TransitionSystem base class. Need to fix oracle 2015-03-26 16:44:42 +01:00
Matthew Honnibal
b063001596 * Add base TransitionSystem class. Still need to rethink how non-monotonic labelling will work for best_valid 2015-03-26 16:44:42 +01:00
Matthew Honnibal
01bc4d6815 * Add set_parse method, to assign parse to tokens in a less hacky way. 2015-03-26 16:44:42 +01:00
Matthew Honnibal
dc986dbc0b * Work on refactored parser, where TransitionSystem can be easily subclassed 2015-03-26 16:44:42 +01:00
Matthew Honnibal
1cc6329b18 * Add base class to do transitions 2015-03-26 16:44:42 +01:00
Matthew Honnibal
135756ac3d * Tmp commit of NER refactoring 2015-03-26 16:44:42 +01:00
Matthew Honnibal
23c1f6fc04 * Merge changes from stash 2015-03-26 16:44:41 +01:00
Matthew Honnibal
0ff078876a * Commit some work on ner.yx done on the plane 2015-03-26 16:44:41 +01:00
Matthew Honnibal
d81b7be6a2 * Merge train.py 2015-03-26 16:44:41 +01:00
Matthew Honnibal
2e3dc3dfe2 * Merge changes in tokens.pyx 2015-03-26 16:44:41 +01:00
Matthew Honnibal
8cc3524dc9 * Ws 2015-03-26 16:44:41 +01:00
Matthew Honnibal
3d0570685c * Add NER transition system 2015-03-26 16:44:41 +01:00
Matthew Honnibal
043b758cf4 * Resurrect old NER code. This version won't be the one that runs; we want to re-use the parser code. But for now this is a useful reference. 2015-03-26 16:44:41 +01:00
Matthew Honnibal
b139aa92ba * Start setting out how NER will be implemented in the data model 2015-03-26 16:44:41 +01:00
Matthew Honnibal
0962ffc095 * Fix issue #37: missing check_flag attribute from Token class 2015-03-26 15:06:26 +01:00
Matthew Honnibal
2e8d0e5d45 * Upd download script 2015-03-03 05:47:16 -05:00
Matthew Honnibal
dbe26f5793 * Add children and subtree methods to Token, which are generators to assist parse-tree navigation. 2015-03-03 04:18:41 -05:00
Matthew Honnibal
ea90d136e8 * Fix bug in labelled parsing, that caused an 8% drop in labelled accuracy. 2015-02-27 03:56:10 -05:00
Matthew Honnibal
caf046b220 * Hastily add method to apply tags from a list of strings, instead of predicting the tags. 2015-02-23 15:40:17 -05:00
Matthew Honnibal
cae077b583 * Work on fixing orphaned Token objects bug 2015-02-16 15:20:31 -05:00
Matthew Honnibal
7572e31f5e * Pass ownership of C data to Token instances if Tokens object is being garbage-collected, but Token instances are staying alive. 2015-02-11 18:05:06 -05:00
Matthew Honnibal
64645a1c2f * Improve docstring on English 2015-02-11 15:13:20 -05:00
Matthew Honnibal
594e50bd45 * Add option to download speech-parsing data set. 2015-02-11 14:20:29 -05:00
Matthew Honnibal
0b7e769211 * Add POS tags to support SWBD tag set 2015-02-11 14:08:28 -05:00
Matthew Honnibal
312b3a45f3 * Fix issue #19: Allow parsing/pos tagging of empty strings 2015-02-10 10:15:58 -05:00
Matthew Honnibal
2a0615104b * Upd download script 2015-02-09 10:22:59 -05:00
Matthew Honnibal
5c3513583d * Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens. 2015-02-09 03:57:10 -05:00
Matthew Honnibal
be5536d239 * Fix Issue #22: PRP and PRP$ were mapped to NOUN. Should be PRON. 2015-02-08 18:36:18 -05:00
Matthew Honnibal
0492cee8b4 * Fix Issue #24: Lemmas are empty when the L field is missing for special-cased tokens 2015-02-08 18:30:30 -05:00
Matthew Honnibal
d229fbd228 * Give better error on out-of-bounds array access 2015-02-07 12:59:12 -05:00
Matthew Honnibal
ab8bb047d0 * Fix negative index for __getitem__ 2015-02-07 12:58:46 -05:00
Matthew Honnibal
44c7eafe44 * Fix download.py 2015-02-07 12:00:36 -05:00