Commit Graph

792 Commits

Author SHA1 Message Date
Matthew Honnibal
8c354c432b * Add ValueError condition to ner_tag reading 2015-04-10 04:59:59 +02:00
Matthew Honnibal
435cccf098 * Add read_conll03_file function to conll.pyx 2015-04-10 04:59:11 +02:00
Matthew Honnibal
99c9ecfc18 * Fix bug in prefix, suffix and word shape features in parser and NER 2015-04-10 03:53:33 +02:00
Matthew Honnibal
cff2b13fef * Fix Issue #44: Broken Token.string attribute when single word sentence 2015-04-07 06:08:25 +02:00
Matthew Honnibal
6640386b25 * Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way. 2015-04-07 06:00:57 +02:00
Matthew Honnibal
b64b2bd910 * Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way. 2015-04-07 06:00:30 +02:00
Matthew Honnibal
f9e510a893 * Whitespace 2015-04-07 04:53:59 +02:00
Matthew Honnibal
66c7ccf6cc * Fix Spans.orth_ 2015-04-07 04:53:40 +02:00
Matthew Honnibal
b8d34531c4 * Add support for units to English.__init__, by loading and applying regular expressions 2015-04-07 04:02:32 +02:00
Matthew Honnibal
0ea5af88b6 * Add multi-word expression RegexMatcher 2015-04-07 03:45:40 +02:00
Matthew Honnibal
2fee67cfa3 * Add regular expressions for English multi-word expressions 2015-04-07 03:45:18 +02:00
Matthew Honnibal
5a075ea3fc * Ensure NER moves are available for single-word tokens 2015-04-05 22:30:58 +02:00
Matthew Honnibal
a60a366b2c * Support 'punct' dep label in conll.pyx 2015-04-05 22:30:19 +02:00
Matthew Honnibal
021c972137 * Print parse if verbose in scorer 2015-04-05 22:29:30 +02:00
Matthew Honnibal
fbf19049cf * Add ent_type_ property 2015-03-31 02:01:29 +02:00
Matthew Honnibal
e70b87efeb * Add merge() method to Tokens, with fairly brittle/hacky implementation, but quite easy to test. Passing minimal tests. Still need to fix left/right deps in C data 2015-03-30 01:37:41 +02:00
Matthew Honnibal
557856e84c * Allow regular expressions to specify labels for merged spans 2015-03-27 17:40:52 +01:00
Matthew Honnibal
a3af6b7c3d * Left-Arc from Root, to allow non-monotonic reduce to compete with left-arc when the stack is not empty. 2015-03-27 17:39:16 +01:00
Matthew Honnibal
db5a43318c * Improve print_state debug printer 2015-03-27 17:29:58 +01:00
Matthew Honnibal
1705eccbbe * Remove whitespace 2015-03-27 15:22:39 +01:00
Matthew Honnibal
3feb52374c * Break apart a condition, for ease of debug printing 2015-03-27 15:21:38 +01:00
Matthew Honnibal
b32f581acb * Fix bug in ArcEager.get_labels 2015-03-27 15:21:06 +01:00
Matthew Honnibal
5f2a4ff36d * Fix spans.lemma_ 2015-03-26 16:45:38 +01:00
Matthew Honnibal
f4cc222ec3 * Fix NER scoring 2015-03-26 16:45:38 +01:00
Matthew Honnibal
1320bd19db * Move Span class to own file 2015-03-26 16:45:38 +01:00
Matthew Honnibal
6f47a667cf * Move Span class to own file 2015-03-26 16:45:38 +01:00
Matthew Honnibal
f02c39dfaf * Compare to is not None, for more robustness 2015-03-26 16:44:48 +01:00
Matthew Honnibal
8f68b864c4 * Move Span/Spans to separate files. Currently duplicates lots of Tokens functionality. Should probably be integrated into Tokens 2015-03-26 16:44:48 +01:00
Matthew Honnibal
e854ba0a13 * Remove support for force_gold flag from GreedyParser, since it's not so useful, and it's clutter 2015-03-26 16:44:47 +01:00
Matthew Honnibal
6a6085f8b9 * Clean up GreedyParser.train function a bit 2015-03-26 16:44:47 +01:00
Matthew Honnibal
b3157927e6 * Clean up unused feature templates 2015-03-26 16:44:47 +01:00
Matthew Honnibal
411bf377d4 * Remove dependency on ner_util module 2015-03-26 16:44:47 +01:00
Matthew Honnibal
01c892f583 * Add comment to fill_context 2015-03-26 16:44:47 +01:00
Matthew Honnibal
2741179aff * Important bug fix: Fill token N2w, which was being unfilled, after a bad edit while writing the NER features. 2015-03-26 16:44:47 +01:00
Matthew Honnibal
2b2dec95d3 * Add comment to set_parse 2015-03-26 16:44:47 +01:00
Matthew Honnibal
e770fade1e * Don't set dependency labels in set_parse, as this may be used by the Entity recogniser instead. Need to clean this method up... 2015-03-26 16:44:47 +01:00
Matthew Honnibal
71648205d9 * Add support for debug feature set. Just use unigrams for this. 2015-03-26 16:44:47 +01:00
Matthew Honnibal
3b70b304b2 * Add words to gold_tuples from gold conll file 2015-03-26 16:44:47 +01:00
Matthew Honnibal
2e12dec76e * Adjust scorer to account for tokenization mistakes 2015-03-26 16:44:47 +01:00
Matthew Honnibal
05d6065e2e * Add assertion 2015-03-26 16:44:46 +01:00
Matthew Honnibal
377e9b29b1 * Whitespace 2015-03-26 16:44:46 +01:00
Matthew Honnibal
670959f40c * Fix iteration order on Tokens.rights 2015-03-26 16:44:46 +01:00
Matthew Honnibal
231ce2dae5 * Assign ROOT label by default. May be papering over another bug. 2015-03-26 16:44:46 +01:00
Matthew Honnibal
9f4ad8fdfb * Assign root words the ROOT label via the Break transition. Something is still wrong here... 2015-03-26 16:44:46 +01:00
Matthew Honnibal
f729164c01 * Fix bug in label assignment: ensure null-label transitions receive the label 0 2015-03-26 16:44:46 +01:00
Matthew Honnibal
7237c805c7 * Load tag for specials.json token 2015-03-26 16:44:46 +01:00
Matthew Honnibal
567388e38d * Use values encoded by StringStore in POS tagging, rather than indices into a list of tags 2015-03-26 16:44:45 +01:00
Matthew Honnibal
3105c7f8ba * Don't pass label_ids dict to Tokens, since we now use the StringStore to manage string-to-int mapping for labels 2015-03-26 16:44:45 +01:00
Matthew Honnibal
801bf14f4f * Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names. 2015-03-26 16:44:45 +01:00
Matthew Honnibal
31fad99518 * Use StringStore to encode label names, instead of label_ids 2015-03-26 16:44:45 +01:00
Matthew Honnibal
64db61bff1 * Add Span class to Python API 2015-03-26 16:44:45 +01:00
Matthew Honnibal
b9b695fb1b * Remove debug word list 2015-03-26 16:44:45 +01:00
Matthew Honnibal
f21ab2d7fb * Fix bug in ugly ent_strings hack on English class 2015-03-26 16:44:45 +01:00
Matthew Honnibal
1c843934be * Fix oracle bug in NER. Now getting 77% F on ontonotes 2015-03-26 16:44:44 +01:00
Matthew Honnibal
903f196b3f * Fix verbose printing for scorer 2015-03-26 16:44:44 +01:00
Matthew Honnibal
e181c051d5 * Improve features for NER 2015-03-26 16:44:44 +01:00
Matthew Honnibal
7ecb52c0ed * Add scorer script 2015-03-26 16:44:44 +01:00
Matthew Honnibal
8057a95f20 * NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-26 16:44:44 +01:00
Matthew Honnibal
ae235e07b9 * Refactoring working for parser, but now need to rig up features for NER, and then debug oracle etc. 2015-03-26 16:44:44 +01:00
Matthew Honnibal
b3eda03c9c * Tmp 2015-03-26 16:44:44 +01:00
Matthew Honnibal
220ce8bfed * Prepare English class for NER 2015-03-26 16:44:44 +01:00
Matthew Honnibal
f5830dc1c1 * Remove _transitions.pyx 2015-03-26 16:44:44 +01:00
Matthew Honnibal
6865c2fb4d * Fix assignment of dep strings in tokens.pyx 2015-03-26 16:44:43 +01:00
Matthew Honnibal
6b6bce9e7a * Fix label loading for transition system 2015-03-26 16:44:43 +01:00
Matthew Honnibal
5278c7504b * Hacks to conll.pyx. Should clean these up. 2015-03-26 16:44:43 +01:00
Matthew Honnibal
f321b2b2eb * Remove TODO comment 2015-03-26 16:44:43 +01:00
Matthew Honnibal
fdabd93bfb * Ensure high loss for invalid moves, and fix label reading for arc-eager 2015-03-26 16:44:43 +01:00
Matthew Honnibal
10ed738df2 * Tmp commit 2015-03-26 16:44:43 +01:00
Matthew Honnibal
4f83c9b3d5 * Make costs label-sensitive 2015-03-26 16:44:43 +01:00
Matthew Honnibal
179b7eb0a7 * Specify parser transition system in language 2015-03-26 16:44:43 +01:00
Matthew Honnibal
8c883cef58 * Refactored transition system code now compiling. Still need to hook up label oracle, and test 2015-03-26 16:44:43 +01:00
Matthew Honnibal
f0159ab4b6 * Add file to hold GoldParse class 2015-03-26 16:44:42 +01:00
Matthew Honnibal
8eadb984cb * Refactor arc_eager to use new TransitionSystem base class. Need to fix oracle 2015-03-26 16:44:42 +01:00
Matthew Honnibal
b063001596 * Add base TransitionSystem class. Still need to rethink how non-monotonic labelling will work for best_valid 2015-03-26 16:44:42 +01:00
Matthew Honnibal
01bc4d6815 * Add set_parse method, to assign parse to tokens in a less hacky way. 2015-03-26 16:44:42 +01:00
Matthew Honnibal
dc986dbc0b * Work on refactored parser, where TransitionSystem can be easily subclassed 2015-03-26 16:44:42 +01:00
Matthew Honnibal
1cc6329b18 * Add base class to do transitions 2015-03-26 16:44:42 +01:00
Matthew Honnibal
135756ac3d * Tmp commit of NER refactoring 2015-03-26 16:44:42 +01:00
Matthew Honnibal
23c1f6fc04 * Merge changes from stash 2015-03-26 16:44:41 +01:00
Matthew Honnibal
0ff078876a * Commit some work on ner.yx done on the plane 2015-03-26 16:44:41 +01:00
Matthew Honnibal
d81b7be6a2 * Merge train.py 2015-03-26 16:44:41 +01:00
Matthew Honnibal
2e3dc3dfe2 * Merge changes in tokens.pyx 2015-03-26 16:44:41 +01:00
Matthew Honnibal
8cc3524dc9 * Ws 2015-03-26 16:44:41 +01:00
Matthew Honnibal
3d0570685c * Add NER transition system 2015-03-26 16:44:41 +01:00
Matthew Honnibal
043b758cf4 * Resurrect old NER code. This version won't be the one that runs; we want to re-use the parser code. But for now this is a useful reference. 2015-03-26 16:44:41 +01:00
Matthew Honnibal
b139aa92ba * Start setting out how NER will be implemented in the data model 2015-03-26 16:44:41 +01:00
Matthew Honnibal
0962ffc095 * Fix issue #37: missing check_flag attribute from Token class 2015-03-26 15:06:26 +01:00
Matthew Honnibal
2e8d0e5d45 * Upd download script 2015-03-03 05:47:16 -05:00
Matthew Honnibal
dbe26f5793 * Add children and subtree methods to Token, which are generators to assist parse-tree navigation. 2015-03-03 04:18:41 -05:00
Matthew Honnibal
ea90d136e8 * Fix bug in labelled parsing, that caused an 8% drop in labelled accuracy. 2015-02-27 03:56:10 -05:00
Matthew Honnibal
caf046b220 * Hastily add method to apply tags from a list of strings, instead of predicting the tags. 2015-02-23 15:40:17 -05:00
Matthew Honnibal
cae077b583 * Work on fixing orphaned Token objects bug 2015-02-16 15:20:31 -05:00
Matthew Honnibal
7572e31f5e * Pass ownership of C data to Token instances if Tokens object is being garbage-collected, but Token instances are staying alive. 2015-02-11 18:05:06 -05:00
Matthew Honnibal
64645a1c2f * Improve docstring on English 2015-02-11 15:13:20 -05:00
Matthew Honnibal
594e50bd45 * Add option to download speech-parsing data set. 2015-02-11 14:20:29 -05:00
Matthew Honnibal
0b7e769211 * Add POS tags to support SWBD tag set 2015-02-11 14:08:28 -05:00
Matthew Honnibal
312b3a45f3 * Fix issue #19: Allow parsing/pos tagging of empty strings 2015-02-10 10:15:58 -05:00
Matthew Honnibal
2a0615104b * Upd download script 2015-02-09 10:22:59 -05:00
Matthew Honnibal
5c3513583d * Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens. 2015-02-09 03:57:10 -05:00
Matthew Honnibal
be5536d239 * Fix Issue #22: PRP and PRP$ were mapped to NOUN. Should be PRON. 2015-02-08 18:36:18 -05:00
Matthew Honnibal
0492cee8b4 * Fix Issue #24: Lemmas are empty when the L field is missing for special-cased tokens 2015-02-08 18:30:30 -05:00
Matthew Honnibal
d229fbd228 * Give better error on out-of-bounds array access 2015-02-07 12:59:12 -05:00
Matthew Honnibal
ab8bb047d0 * Fix negative index for __getitem__ 2015-02-07 12:58:46 -05:00
Matthew Honnibal
44c7eafe44 * Fix download.py 2015-02-07 12:00:36 -05:00
Matthew Honnibal
6ca7f2eedc * Upd download script 2015-02-07 11:32:33 -05:00
Matthew Honnibal
f0e0588833 * Fill L2 norm attribute on LexemeC struct 2015-02-07 08:44:42 -05:00
Matthew Honnibal
75f9b7d6bf * Add L2 norm field to LexemeC struct 2015-02-07 08:43:17 -05:00
Matthew Honnibal
51b618d646 * Add a has_repvec property to Lexeme, and a check function to check flags 2015-02-07 08:42:44 -05:00
Matthew Honnibal
321b402739 * Store the l2 norm of the word's vector 2015-02-07 08:42:16 -05:00
Matthew Honnibal
c7d8644149 * Fix regression on 'prob' attr of Token. 2015-02-03 03:32:18 +11:00
Matthew Honnibal
c55a33d045 * Catch oracle errors 2015-02-02 23:02:04 +11:00
Matthew Honnibal
de772088e6 * Use parse tree for sbd in Tokens.sents 2015-02-02 12:17:32 +11:00
Matthew Honnibal
56c2ef2982 * Tweak POS features for web text 2015-02-02 11:59:36 +11:00
Matthew Honnibal
d68678a93e * Add Exception class, OracleError 2015-02-02 11:57:32 +11:00
Matthew Honnibal
a20fdbd8ee * Upd download script 2015-02-01 13:22:23 +11:00
Matthew Honnibal
76d9394cb4 * Fix vocab.pyx for Python3 2015-02-01 13:14:04 +11:00
Matthew Honnibal
63abdf154c * Hastily hack download file 2015-01-31 22:48:32 +11:00
Matthew Honnibal
7de00c5a79 * Try not holding a reference to Pool, since that seems to confuse the GC 2015-01-31 22:10:22 +11:00
Matthew Honnibal
ce3ae8b5d9 * Fix platform-specific lexicon bug. 2015-01-31 16:38:58 +11:00
Matthew Honnibal
a1ed574b7b * Fix default model path for English 2015-01-31 16:38:27 +11:00
Matthew Honnibal
018e0bfa24 * Bug fixes to parse navigation 2015-01-31 16:37:13 +11:00
Matthew Honnibal
e013555b25 * Add option to download script 2015-01-31 13:51:56 +11:00
Matthew Honnibal
08ca5c8970 * Add sent_end flag to TokenC struct 2015-01-31 13:44:16 +11:00
Matthew Honnibal
024cfd485c * Pass tag_strings as a tuple, to support new Tokens API 2015-01-31 13:43:37 +11:00
Matthew Honnibal
77d62d0179 * Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation. 2015-01-31 13:42:58 +11:00
Matthew Honnibal
88170e6295 * Supply dep_strings as a tuple, for the changed API on Tokens 2015-01-31 13:42:09 +11:00
Matthew Honnibal
0981d68022 * Set a sent_end flag during parsing, for later use 2015-01-31 13:41:46 +11:00
Matthew Honnibal
251dbf24d7 * Fix unintialised variable error 2015-01-30 20:46:34 +11:00
Matthew Honnibal
83a4df5a1a * Fix download script 2015-01-30 20:40:42 +11:00
Matthew Honnibal
6f9ebc2f34 * Fix download script 2015-01-30 20:33:19 +11:00
Matthew Honnibal
8b85d0bb8a * Only download small data if no data dir exists 2015-01-30 20:27:14 +11:00
Matthew Honnibal
1a7a1c2771 * Fix Issue #16: tokens recurse when printing 2015-01-30 19:47:50 +11:00
Matthew Honnibal
cb95ef6934 * Fix download script 2015-01-30 19:28:43 +11:00
Matthew Honnibal
e578bd37bd * Fix download script 2015-01-30 18:59:31 +11:00
Matthew Honnibal
df52014d12 * Fix download script 2015-01-30 18:36:24 +11:00
Matthew Honnibal
0f95712189 * Improve accuracy reporting during training 2015-01-30 18:05:06 +11:00
Matthew Honnibal
b68f563c2f * Fix Issue #14: Improve parsing API 2015-01-30 18:04:41 +11:00
Matthew Honnibal
998b607f65 * Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source 2015-01-30 18:04:01 +11:00
Matthew Honnibal
67d6e53a69 * Ensure parser and tagger function correctly when training from missing values, indicated by -1 2015-01-30 14:08:56 +11:00
Matthew Honnibal
4ff180db74 * Fix off-by-one error in commit 0a7fceb 2015-01-30 12:49:33 +11:00
Matthew Honnibal
0a7fcebdf7 * Fix Issue #12: Incorrect token.idx calculations for some punctuation, in the presence of token cache 2015-01-30 12:33:38 +11:00
Matthew Honnibal
ebf7d2fab1 * Use non-joint sbd, for more simplicity and fewer classes 2015-01-29 06:22:03 +11:00
Matthew Honnibal
d05c5bf141 * Remove comment 2015-01-29 05:19:27 +11:00
Matthew Honnibal
320b045daa * Oracle now consistent over gold standard derivation 2015-01-29 03:41:58 +11:00
Matthew Honnibal
f590382134 * Work on sbd 2015-01-29 03:18:29 +11:00
Matthew Honnibal
1884a7a0be * Attach comment with paper 2015-01-28 03:18:43 +11:00
Matthew Honnibal
a2d6b195db * Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013) 2015-01-28 03:09:45 +11:00
Matthew Honnibal
f9ee5d9934 * Build a python list of word strings, for debugging 2015-01-28 01:06:13 +11:00
Matthew Honnibal
d819101571 * Improve error message on oracle failure 2015-01-28 00:58:03 +11:00
Matthew Honnibal
e6c3d3471f * Tweak documentation for Tokens, and hide constructor as __cinit__ 2015-01-27 18:57:52 +11:00
Matthew Honnibal
c38c62d4a3 * Add docstring to English class 2015-01-27 02:45:21 +11:00
Matthew Honnibal
d4c99f7dec * Add attrs.pxd 2015-01-26 22:22:09 +11:00
Matthew Honnibal
d4a493855e * Fix error msg 2015-01-25 23:01:30 +11:00
Matthew Honnibal
7f87716cf7 * Fix download script 2015-01-25 23:01:10 +11:00
Matthew Honnibal
92fb9257dd * Add parts-of-speech file 2015-01-25 22:00:39 +11:00
Matthew Honnibal
c1c3dba4cb * Check whether vector files are present before trying to load them. 2015-01-25 18:16:48 +11:00
Matthew Honnibal
5049d4c2e6 * Add parts_of_speech.pyx 2015-01-25 16:32:26 +11:00
Matthew Honnibal
12b034e3ef * Move POS tag definitions to parts_of_speech.pxd 2015-01-25 16:31:07 +11:00
Matthew Honnibal
7431c133d8 * Add error if try to access head and not is_parsed 2015-01-25 15:33:54 +11:00
Matthew Honnibal
951d06c824 * Silently don't parse if data is not present 2015-01-25 14:47:38 +11:00
Matthew Honnibal
4e857ab7a6 * Fix bug in POS tagger feature 2015-01-25 02:20:15 +11:00
Matthew Honnibal
dd56e298e2 * Ensure tagging is applied if parse=True 2015-01-25 02:19:44 +11:00
Matthew Honnibal
94750819cd * Set parse=True by default --- i.e. parse unless told not to. 2015-01-25 01:28:28 +11:00
Matthew Honnibal
71b95202eb * Add docstring to StringStore 2015-01-24 20:49:15 +11:00
Matthew Honnibal
6d1c08dafd * Add docstring to Lexeme 2015-01-24 20:48:34 +11:00
Matthew Honnibal
a97bed9359 * Fix POS and dependency label tag names. Add parse and string navigation functions. 2015-01-24 17:29:04 +11:00
Matthew Honnibal
76cd024095 * Add whitespace property to Token 2015-01-24 07:41:21 +11:00
Matthew Honnibal
5fd72bc220 * Have 'string' refer to the whitespace-padded string 2015-01-24 07:32:38 +11:00
Matthew Honnibal
fda94271af * Rename NORM1 and NORM2 attrs to lower and norm 2015-01-24 06:17:03 +11:00
Matthew Honnibal
5ed8b2b98f * Rename sic to orth 2015-01-23 02:08:25 +11:00
Matthew Honnibal
a27b23cc8f * Have SBD return start/end indices 2015-01-22 22:24:44 +11:00
Matthew Honnibal
d460c28838 * Rename vec to repvec 2015-01-22 02:06:22 +11:00
Matthew Honnibal
8b9d913d97 * Rename vec to repvec 2015-01-22 02:05:58 +11:00
Matthew Honnibal
9cd0b6b3e9 * Various tweaks to Tokens class 2015-01-22 02:05:37 +11:00
Matthew Honnibal
5928d158ce * Pass the string to Tokens 2015-01-22 02:04:58 +11:00
Matthew Honnibal
45264e356b * Rename vec to repvec 2015-01-22 02:04:24 +11:00
Matthew Honnibal
5e63c606ad * Rename vec to repvec 2015-01-22 02:03:54 +11:00
Matthew Honnibal
56e6cf0672 * Add _string attr to Tokens object 2015-01-21 18:57:09 +11:00
Matthew Honnibal
d6ac60e91c * Bug fixes to sentences method, and improved vector transport for tokens 2015-01-21 18:56:32 +11:00
Matthew Honnibal
f2a229136c * Fix data_dir=None argument to English class 2015-01-21 18:27:31 +11:00
Matthew Honnibal
ef49b8c179 * Add stop-word flag 2015-01-21 18:22:31 +11:00
Matthew Honnibal
6646bfc5df * Add LOWER attr 2015-01-21 18:19:08 +11:00
Matthew Honnibal
f149259bf5 * Fix negative indices in tokens 2015-01-20 01:16:29 +11:00
Matthew Honnibal
b65b0c07bf * Messily hook up vector in tokens 2015-01-19 19:59:55 +11:00
Matthew Honnibal
8ff5b8bd84 * Add attribute for POS scheme 2015-01-17 17:33:16 +11:00
Matthew Honnibal
6c7e44140b * Work on word vectors, and other stuff 2015-01-17 16:21:17 +11:00
Matthew Honnibal
802867e96a * Revise interface to Token. Strings now have attribute names like norm1_ 2015-01-15 03:51:47 +11:00
Matthew Honnibal
7d3c40de7d * Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme 2015-01-15 00:33:16 +11:00
Matthew Honnibal
0930892fc1 * Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-14 00:03:48 +11:00
Matthew Honnibal
46da3d74d2 * Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 11:23:44 +11:00
Matthew Honnibal
ce2edd6312 * Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 10:26:22 +11:00
Matthew Honnibal
aacaf1a0f0 * Fix parser 2015-01-08 01:19:23 +11:00
Matthew Honnibal
9a21127bf7 * Fix parser, which was importing the wrong model 2015-01-08 00:10:15 +11:00
Matthew Honnibal
6a3e39cdd1 * Add typedefs.pyx 2015-01-06 04:51:40 +11:00
Matthew Honnibal
a58920cc5e * Import orth.word_shape as a C module 2015-01-06 03:18:22 +11:00
Matthew Honnibal
6b68f7ef75 * Finally get string types right for orth function 2015-01-06 03:17:39 +11:00
Matthew Honnibal
90c143bd85 * Fix orth import 2015-01-05 18:49:19 +11:00
Matthew Honnibal
7689dccd0f * Remove unused import 2015-01-05 18:48:48 +11:00
Matthew Honnibal
3f1944d688 * Make PyPy work 2015-01-05 17:54:38 +11:00
Matthew Honnibal
a510d9f677 * Another assertion removed 2015-01-05 13:01:40 +11:00
Matthew Honnibal
2856946a66 * Remove assertion that doesn't work on Python 3 2015-01-05 12:51:16 +11:00
Matthew Honnibal
94034f1112 * Fix encoding in lemmatization 2015-01-05 11:54:29 +11:00
Matthew Honnibal
b132b3caa6 * Fix unicode error in lemmatizer 2015-01-05 11:53:54 +11:00
Matthew Honnibal
477e7fbffe * Fix data reading for lemmatizer 2015-01-05 06:01:32 +11:00
Matthew Honnibal
58f75abaca * Fix unicode error in orth 2015-01-05 05:53:08 +11:00
Matthew Honnibal
4e085d5166 * Fix lemmatizer for Python3 2015-01-05 05:51:26 +11:00
Matthew Honnibal
ae7c811fd1 * Use Exception instead of StandardError 2015-01-04 01:22:12 +11:00
Matthew Honnibal
0e4c2ba036 * Fix loading of special morph words 2015-01-03 23:13:00 +11:00
Matthew Honnibal
f5d41028b5 * Move around data files for test release 2015-01-03 01:59:22 +11:00
Matthew Honnibal
a24321b63a * Add downloader 2015-01-02 21:44:41 +11:00
Matthew Honnibal
5d9a096e2f * Some minor clean-up after HastyModel 2014-12-31 19:46:04 +11:00
Matthew Honnibal
aafaf58cbe * Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile. 2014-12-31 19:40:59 +11:00
Matthew Honnibal
bcd038e7b6 * Implement HastyModel 2014-12-31 01:16:47 +11:00
Matthew Honnibal
1a075f77ff * Don't over-ride pre-loaded POS tags, if set by special-cases 2014-12-30 23:26:32 +11:00
Matthew Honnibal
785c7ba76a * Embed signature on attrs 2014-12-30 23:25:31 +11:00
Matthew Honnibal
30e5805656 * Lazy-load tagger and parser 2014-12-30 23:25:09 +11:00
Matthew Honnibal
9976aa976e * Messily fix morphology and POS tags on special tokens. 2014-12-30 23:24:37 +11:00
Matthew Honnibal
c1ef3febee * Embedsignature in tokens.pyx 2014-12-30 21:22:00 +11:00
Matthew Honnibal
aac5028b6e * Move tagger to _ml 2014-12-30 21:21:38 +11:00
Matthew Honnibal
1ffb0229ed * Import tokens in parser.pxd 2014-12-30 21:21:17 +11:00
Matthew Honnibal
bb0b00f819 * Repurporse the Tagger class as a generic Model, wrapping thinc's interface 2014-12-30 21:20:15 +11:00
Matthew Honnibal
fe2a5e0370 * Work on docstrings 2014-12-27 21:46:04 +11:00
Matthew Honnibal
bb80937544 * Upd docstrings 2014-12-27 18:45:16 +11:00
Matthew Honnibal
b8b65903fc * Tmp 2014-12-24 17:42:00 +11:00
Matthew Honnibal
ab61673edd * Fix api of array method 2014-12-23 15:18:48 +11:00
Matthew Honnibal
7708d0e24a * Move lemmatizer to en dir 2014-12-23 15:16:57 +11:00
Matthew Honnibal
98eb4c0426 * Fix path to parser model 2014-12-23 15:09:09 +11:00
Matthew Honnibal
b00bc01d8c * All tests now passing for reorg 2014-12-23 13:18:59 +11:00
Matthew Honnibal
73f200436f * Tests passing except for morphology/lemmatization stuff 2014-12-23 11:40:32 +11:00
Matthew Honnibal
cf8d26c3d2 * POS tagger training working after reorg 2014-12-22 08:54:47 +11:00
Matthew Honnibal
4c4aa2c5c9 * Work on train 2014-12-22 07:25:43 +11:00
Matthew Honnibal
61df50b598 * Add English-subclass POS tagger 2014-12-21 20:59:07 +11:00
Matthew Honnibal
9f3f07cab6 * Add attrs file for English 2014-12-21 11:29:11 +11:00
Matthew Honnibal
2a89d70429 * Add vocab.pyx to setup, and ensure we can import spacy.en.lang 2014-12-21 06:03:53 +11:00
Matthew Honnibal
b34a1325d3 * Everything compiling after reorg. About to start testing. 2014-12-21 05:42:23 +11:00
Matthew Honnibal
e1c1a4b868 * Tmp 2014-12-21 05:36:29 +11:00
Matthew Honnibal
d11c1edf8c * Import slice_unicode from strings.pyx 2014-12-20 07:56:26 +11:00
Matthew Honnibal
be1bdcbd85 * Move lang.pyx to tokenizer.pyx 2014-12-20 07:55:40 +11:00
Matthew Honnibal
89a1cc1a48 * Move murmurhash to .pxd in strings file 2014-12-20 07:41:08 +11:00
Matthew Honnibal
d5a942c4a4 * Rename lang.pyx to tokenizer.pyx 2014-12-20 07:30:39 +11:00
Matthew Honnibal
a60ae261ae * Move tokenizer to its own file, and refactor 2014-12-20 07:29:16 +11:00
Matthew Honnibal
867a4a000c * Export set_morph_from_dict function 2014-12-20 07:28:27 +11:00
Matthew Honnibal
4e30195c6d * Refactor morphology.pyx 2014-12-20 07:27:28 +11:00
Matthew Honnibal
4c6ce7ee84 * Update tokens.pyx as part of reorg 2014-12-20 07:03:26 +11:00
Matthew Honnibal
116f7f3bc1 * Rename Lexicon to Vocab, and move it to its own file 2014-12-20 06:54:03 +11:00
Matthew Honnibal
780cbd68b1 * Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-20 06:51:33 +11:00
Matthew Honnibal
f6556d8e5d * Refactor, move Lexeme struct to structs.pxd 2014-12-20 06:51:03 +11:00
Matthew Honnibal
7d48bba6c4 * Move StringStore class to its own file 2014-12-20 06:42:01 +11:00
Matthew Honnibal
b066102d2d * Remove POS cache for now 2014-12-20 03:49:58 +11:00
Matthew Honnibal
ff252dd535 * Clean up 'guess_cache' idea, which didnt work well enough 2014-12-20 03:49:11 +11:00
Matthew Honnibal
9d3ca13909 * Start work on parse-tree iteration classes 2014-12-20 03:48:10 +11:00
Matthew Honnibal
bed680c632 * Remove commented-out features 2014-12-20 03:47:32 +11:00
Matthew Honnibal
3d178c03ae * Prune the features a bit 2014-12-20 02:46:14 +11:00
Matthew Honnibal
a0408e1758 * Working DecisionMemory class 2014-12-20 01:43:26 +11:00
Matthew Honnibal
7920ea72b4 * Working parser with the decision memory idea. Disabling that for now, for simplicity 2014-12-20 01:43:15 +11:00
Matthew Honnibal
a2f2a48da9 * Add some extra features 2014-12-20 01:42:24 +11:00
Matthew Honnibal
8fd9762d91 * Start laying out parse tree iteration methods 2014-12-20 01:42:09 +11:00
Matthew Honnibal
53b8bc1f3c * Work on implementing a trainable cache for the parser. So far, doesn't improve efficiency 2014-12-19 09:30:50 +11:00
Matthew Honnibal
033d6c9ac2 * Adapt POS tagger decision-memory for use in parser 2014-12-19 07:23:04 +11:00
Matthew Honnibal
809ddf7887 * Add index.pxd 2014-12-19 07:23:00 +11:00
Matthew Honnibal
1879abd16a * Set const-correctness for tagger 2014-12-18 20:41:52 +11:00
Matthew Honnibal
f72243b156 * Set const-correctness for Feature* array 2014-12-18 20:41:32 +11:00
Matthew Honnibal
6ab7e40590 * Add non-monotonic parsing with cost-sensitive update. 92.26 on Y&M set 2014-12-18 11:33:25 +11:00
Matthew Honnibal
7e0c692daf * Automatically push when the stack is empty 2014-12-18 09:16:10 +11:00
Matthew Honnibal
61142a8eff * Tweak features 2014-12-18 09:15:03 +11:00
Matthew Honnibal
8446ebfbbb * Work on parser. Up to 92 UAS on YM labels 2014-12-18 09:05:31 +11:00
Matthew Honnibal
55de747bfc * Remove .cpp files 2014-12-18 02:43:13 +11:00
Matthew Honnibal
4448a840f7 * Work on greedy parsing. Scoring about 91.2 2014-12-18 02:42:55 +11:00
Matthew Honnibal
87e9487d76 * Work on parser 2014-12-17 21:10:12 +11:00
Matthew Honnibal
9d7d97978d * Work on greedy parser 2014-12-17 21:09:29 +11:00
Matthew Honnibal
d524dd306a * Work on greedy parser 2014-12-17 03:19:43 +11:00
Matthew Honnibal
95ccea03b2 * Work on greedy parser 2014-12-16 22:46:55 +11:00
Matthew Honnibal
a432862fde * Add exception type to _arg_max_among in tagger 2014-12-16 09:44:19 +11:00
Matthew Honnibal
9e00798820 * Work on integrating a greedy dependency parser 2014-12-16 08:06:04 +11:00
Matthew Honnibal
792802b2b9 * POS tag memoisation working, with good speed-up 2014-12-12 14:33:51 +11:00
Matthew Honnibal
ca54d58638 * Merge setup.py 2014-12-10 15:21:27 +11:00
Matthew Honnibal
9959a64f7b * Working morphology and lemmatisation. POS tagging quite fast. 2014-12-10 08:09:32 +11:00
Matthew Honnibal
df3be14987 * Add pos_type features to POS tagger 2014-12-10 08:08:55 +11:00
Matthew Honnibal
42973c4b37 * Improve efficiency of tagger, and improve morphological processing 2014-12-10 01:02:04 +11:00
Matthew Honnibal
6b34a2f34b * Move morphological analysis into its own module, morphology.pyx 2014-12-09 21:16:17 +11:00
Matthew Honnibal
b962fe73d7 * Make suffixes file use full-power regex, so that we can handle periods properly 2014-12-09 19:04:27 +11:00
Matthew Honnibal
accdbe989b * Remove Tokens.extend method 2014-12-09 17:09:23 +11:00
Matthew Honnibal
495e1c7366 * Use fused type in Tokens.push_back, simplifying the use of the cache 2014-12-09 16:50:01 +11:00
Matthew Honnibal
302e09018b * Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas 2014-12-09 14:48:01 +11:00
Matthew Honnibal
99bbbb6feb * Work on morphological processing 2014-12-08 21:12:15 +11:00
Matthew Honnibal
7b68f911cf * Add WordNet lemmatizer 2014-12-08 01:39:13 +11:00
Matthew Honnibal
c20dd79748 * Fiddle with const correctness and comments 2014-12-08 00:03:55 +11:00
Matthew Honnibal
b031c7c430 * Remove language-general context module 2014-12-07 23:53:01 +11:00
Matthew Honnibal
ef4398b204 * Rearrange POS stuff, so that language-specific stuff can live in language-specific modules 2014-12-07 23:52:41 +11:00
Matthew Honnibal
327383e38a * Remove unused code in tagger.pyx 2014-12-07 22:16:17 +11:00
Matthew Honnibal
9f17467c2e * Fix EMPTY_TOKEN 2014-12-07 22:07:41 +11:00
Matthew Honnibal
3819a88e1b * Add support for tag dictionary, and fix error-code for predict method 2014-12-07 22:07:16 +11:00
Matthew Honnibal
f00afe12c4 * Load POS tagger in load() function if path exists 2014-12-07 22:05:57 +11:00
Matthew Honnibal
5fe5e6e66b * Move context functions to header, inlining them. 2014-12-07 21:59:04 +11:00
Matthew Honnibal
5caabec789 * Link in tagger, to work on integrating POS tagging 2014-12-07 15:29:41 +11:00
Matthew Honnibal
0c7aeb9de7 * Begin revising tagger, focussing on POS tagging 2014-12-07 15:29:04 +11:00
Matthew Honnibal
f5c4f2eb52 * Revise context, focussing on POS tagging for now 2014-12-07 15:28:22 +11:00
Matthew Honnibal
e27b912ef9 * Remove need for confusing _data pointer to be stored on Tokens 2014-12-05 16:31:30 +11:00
Matthew Honnibal
1c9253701d * Introduce a TokenC struct, to handle token indices, pos tags and sense tags 2014-12-05 15:56:14 +11:00
Matthew Honnibal
187372c7f3 * Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached 2014-12-05 03:29:50 +11:00