Commit Graph

357 Commits

Author SHA1 Message Date
Matthew Honnibal
e1c1a4b868 * Tmp 2014-12-21 05:36:29 +11:00
Matthew Honnibal
d11c1edf8c * Import slice_unicode from strings.pyx 2014-12-20 07:56:26 +11:00
Matthew Honnibal
be1bdcbd85 * Move lang.pyx to tokenizer.pyx 2014-12-20 07:55:40 +11:00
Matthew Honnibal
89a1cc1a48 * Move murmurhash to .pxd in strings file 2014-12-20 07:41:08 +11:00
Matthew Honnibal
d5a942c4a4 * Rename lang.pyx to tokenizer.pyx 2014-12-20 07:30:39 +11:00
Matthew Honnibal
a60ae261ae * Move tokenizer to its own file, and refactor 2014-12-20 07:29:16 +11:00
Matthew Honnibal
867a4a000c * Export set_morph_from_dict function 2014-12-20 07:28:27 +11:00
Matthew Honnibal
4e30195c6d * Refactor morphology.pyx 2014-12-20 07:27:28 +11:00
Matthew Honnibal
4c6ce7ee84 * Update tokens.pyx as part of reorg 2014-12-20 07:03:26 +11:00
Matthew Honnibal
116f7f3bc1 * Rename Lexicon to Vocab, and move it to its own file 2014-12-20 06:54:03 +11:00
Matthew Honnibal
780cbd68b1 * Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-20 06:51:33 +11:00
Matthew Honnibal
f6556d8e5d * Refactor, move Lexeme struct to structs.pxd 2014-12-20 06:51:03 +11:00
Matthew Honnibal
7d48bba6c4 * Move StringStore class to its own file 2014-12-20 06:42:01 +11:00
Matthew Honnibal
b066102d2d * Remove POS cache for now 2014-12-20 03:49:58 +11:00
Matthew Honnibal
ff252dd535 * Clean up 'guess_cache' idea, which didnt work well enough 2014-12-20 03:49:11 +11:00
Matthew Honnibal
9d3ca13909 * Start work on parse-tree iteration classes 2014-12-20 03:48:10 +11:00
Matthew Honnibal
bed680c632 * Remove commented-out features 2014-12-20 03:47:32 +11:00
Matthew Honnibal
3d178c03ae * Prune the features a bit 2014-12-20 02:46:14 +11:00
Matthew Honnibal
a0408e1758 * Working DecisionMemory class 2014-12-20 01:43:26 +11:00
Matthew Honnibal
7920ea72b4 * Working parser with the decision memory idea. Disabling that for now, for simplicity 2014-12-20 01:43:15 +11:00
Matthew Honnibal
a2f2a48da9 * Add some extra features 2014-12-20 01:42:24 +11:00
Matthew Honnibal
8fd9762d91 * Start laying out parse tree iteration methods 2014-12-20 01:42:09 +11:00
Matthew Honnibal
53b8bc1f3c * Work on implementing a trainable cache for the parser. So far, doesn't improve efficiency 2014-12-19 09:30:50 +11:00
Matthew Honnibal
033d6c9ac2 * Adapt POS tagger decision-memory for use in parser 2014-12-19 07:23:04 +11:00
Matthew Honnibal
809ddf7887 * Add index.pxd 2014-12-19 07:23:00 +11:00
Matthew Honnibal
1879abd16a * Set const-correctness for tagger 2014-12-18 20:41:52 +11:00
Matthew Honnibal
f72243b156 * Set const-correctness for Feature* array 2014-12-18 20:41:32 +11:00
Matthew Honnibal
6ab7e40590 * Add non-monotonic parsing with cost-sensitive update. 92.26 on Y&M set 2014-12-18 11:33:25 +11:00
Matthew Honnibal
7e0c692daf * Automatically push when the stack is empty 2014-12-18 09:16:10 +11:00
Matthew Honnibal
61142a8eff * Tweak features 2014-12-18 09:15:03 +11:00
Matthew Honnibal
8446ebfbbb * Work on parser. Up to 92 UAS on YM labels 2014-12-18 09:05:31 +11:00
Matthew Honnibal
55de747bfc * Remove .cpp files 2014-12-18 02:43:13 +11:00
Matthew Honnibal
4448a840f7 * Work on greedy parsing. Scoring about 91.2 2014-12-18 02:42:55 +11:00
Matthew Honnibal
87e9487d76 * Work on parser 2014-12-17 21:10:12 +11:00
Matthew Honnibal
9d7d97978d * Work on greedy parser 2014-12-17 21:09:29 +11:00
Matthew Honnibal
d524dd306a * Work on greedy parser 2014-12-17 03:19:43 +11:00
Matthew Honnibal
95ccea03b2 * Work on greedy parser 2014-12-16 22:46:55 +11:00
Matthew Honnibal
a432862fde * Add exception type to _arg_max_among in tagger 2014-12-16 09:44:19 +11:00
Matthew Honnibal
9e00798820 * Work on integrating a greedy dependency parser 2014-12-16 08:06:04 +11:00
Matthew Honnibal
792802b2b9 * POS tag memoisation working, with good speed-up 2014-12-12 14:33:51 +11:00
Matthew Honnibal
ca54d58638 * Merge setup.py 2014-12-10 15:21:27 +11:00
Matthew Honnibal
9959a64f7b * Working morphology and lemmatisation. POS tagging quite fast. 2014-12-10 08:09:32 +11:00
Matthew Honnibal
df3be14987 * Add pos_type features to POS tagger 2014-12-10 08:08:55 +11:00
Matthew Honnibal
42973c4b37 * Improve efficiency of tagger, and improve morphological processing 2014-12-10 01:02:04 +11:00
Matthew Honnibal
6b34a2f34b * Move morphological analysis into its own module, morphology.pyx 2014-12-09 21:16:17 +11:00
Matthew Honnibal
b962fe73d7 * Make suffixes file use full-power regex, so that we can handle periods properly 2014-12-09 19:04:27 +11:00
Matthew Honnibal
accdbe989b * Remove Tokens.extend method 2014-12-09 17:09:23 +11:00
Matthew Honnibal
495e1c7366 * Use fused type in Tokens.push_back, simplifying the use of the cache 2014-12-09 16:50:01 +11:00
Matthew Honnibal
302e09018b * Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas 2014-12-09 14:48:01 +11:00
Matthew Honnibal
99bbbb6feb * Work on morphological processing 2014-12-08 21:12:15 +11:00
Matthew Honnibal
7b68f911cf * Add WordNet lemmatizer 2014-12-08 01:39:13 +11:00
Matthew Honnibal
c20dd79748 * Fiddle with const correctness and comments 2014-12-08 00:03:55 +11:00
Matthew Honnibal
b031c7c430 * Remove language-general context module 2014-12-07 23:53:01 +11:00
Matthew Honnibal
ef4398b204 * Rearrange POS stuff, so that language-specific stuff can live in language-specific modules 2014-12-07 23:52:41 +11:00
Matthew Honnibal
327383e38a * Remove unused code in tagger.pyx 2014-12-07 22:16:17 +11:00
Matthew Honnibal
9f17467c2e * Fix EMPTY_TOKEN 2014-12-07 22:07:41 +11:00
Matthew Honnibal
3819a88e1b * Add support for tag dictionary, and fix error-code for predict method 2014-12-07 22:07:16 +11:00
Matthew Honnibal
f00afe12c4 * Load POS tagger in load() function if path exists 2014-12-07 22:05:57 +11:00
Matthew Honnibal
5fe5e6e66b * Move context functions to header, inlining them. 2014-12-07 21:59:04 +11:00
Matthew Honnibal
5caabec789 * Link in tagger, to work on integrating POS tagging 2014-12-07 15:29:41 +11:00
Matthew Honnibal
0c7aeb9de7 * Begin revising tagger, focussing on POS tagging 2014-12-07 15:29:04 +11:00
Matthew Honnibal
f5c4f2eb52 * Revise context, focussing on POS tagging for now 2014-12-07 15:28:22 +11:00
Matthew Honnibal
e27b912ef9 * Remove need for confusing _data pointer to be stored on Tokens 2014-12-05 16:31:30 +11:00
Matthew Honnibal
1c9253701d * Introduce a TokenC struct, to handle token indices, pos tags and sense tags 2014-12-05 15:56:14 +11:00
Matthew Honnibal
187372c7f3 * Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached 2014-12-05 03:29:50 +11:00
Matthew Honnibal
75b8dfb348 * Remove upper_pc from lexeme.pyx 2014-12-04 22:14:34 +11:00
Matthew Honnibal
49f3780ff5 * Fiddle with lexeme attrs 2014-12-04 21:22:38 +11:00
Matthew Honnibal
564082e48e * Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed... 2014-12-04 20:51:29 +11:00
Matthew Honnibal
69bb022204 * Add as_array and count_by method 2014-12-04 20:46:55 +11:00
Matthew Honnibal
e1b1f45cc9 * Add STEM attribute to lexeme 2014-12-04 20:46:20 +11:00
Matthew Honnibal
d7952634ca * Make the string-store serve const pointers to Utf8Str 2014-12-03 16:01:47 +11:00
Matthew Honnibal
7e04c22f8f * const added to Lexicon interface. Seems to work. 2014-12-03 15:58:17 +11:00
Matthew Honnibal
d70d31aa45 * Introduce first attempt at const-ness 2014-12-03 15:44:25 +11:00
Matthew Honnibal
4560ada85b * Add typedef for attr_t. Change flag_t to flags_t 2014-12-03 11:06:31 +11:00
Matthew Honnibal
e600f7b327 * Move String struct stuff into the utf8string module, from spacy.lang 2014-12-03 11:06:00 +11:00
Matthew Honnibal
e170faf5b0 * Hack Tokens to work without tagger.pyx 2014-12-03 11:05:15 +11:00
Matthew Honnibal
b463a7eb86 * Make flag-setting a language-specific thing 2014-12-03 11:04:32 +11:00
Matthew Honnibal
71b009e323 * Fix bug in refactored StringStore.__getitem__ 2014-12-03 11:02:24 +11:00
Matthew Honnibal
14097311ae * Make StringStore.__getitem__ accept unicode-typed keys. 2014-12-03 01:33:20 +11:00
Matthew Honnibal
522bb0346e * Work on get_array method of Tokens 2014-12-02 23:48:05 +11:00
Matthew Honnibal
8c2938fe01 * Rename Lexicon._dict to Lexicon._map 2014-12-02 23:46:59 +11:00
Matthew Honnibal
33dfb4933c * Remove taggers from Language class. Work on doc strings 2014-11-26 19:53:55 +11:00
Matthew Honnibal
80baa2e3db * Work on beam parser 2014-11-20 19:49:33 +11:00
Matthew Honnibal
5c3016bac8 * Tmp commit of ner code 2014-11-14 18:27:47 +11:00
Matthew Honnibal
33c421bcf8 * More feature tweaks 2014-11-12 23:59:16 +11:00
Matthew Honnibal
41dedfb14e * Add label features for NER parsing 2014-11-12 23:55:10 +11:00
Matthew Honnibal
cf55b48ba6 * Switch to predict label on shift. Big increase in accuracy. 2014-11-12 23:50:12 +11:00
Matthew Honnibal
8f84e8a78b * Neaten oracle 2014-11-12 23:38:07 +11:00
Matthew Honnibal
7e0a9077dd * Add context files 2014-11-12 23:22:36 +11:00
Matthew Honnibal
3b0b902384 * IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86 2014-11-12 23:21:09 +11:00
Matthew Honnibal
e6bb8aa3a9 * Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style 2014-11-12 00:54:50 +11:00
Matthew Honnibal
c788633429 * Add tokens_from_list method to Language 2014-11-11 23:43:14 +11:00
Matthew Honnibal
95282d4993 * Use the dynamic oracle 'follow' strategy 2014-11-11 21:11:17 +11:00
Matthew Honnibal
5aaf7a024d * Move ner features to ner subdir 2014-11-11 21:09:03 +11:00
Matthew Honnibal
ff8989b63c * Use greedy NER parser 2014-11-11 21:08:35 +11:00
Matthew Honnibal
0d943ab358 * Fixed greedy NER parsing. With static oracle, replicates accuracy from tagger. 2014-11-11 17:17:54 +11:00
Matthew Honnibal
399239760b * Fix moves for new State struct 2014-11-10 22:16:05 +11:00
Matthew Honnibal
82247169f2 * Implement validation and oracle on pystate, for testing 2014-11-10 22:15:32 +11:00
Matthew Honnibal
3709ed9d6d * Add curr field to State, to handle entity being built 2014-11-10 22:14:36 +11:00
Matthew Honnibal
af9ed18cf1 * Bug fixes to NER 2014-11-10 17:39:23 +11:00