Commit Graph

107 Commits

Author SHA1 Message Date
Matthew Honnibal
38ca0c33f5 Merge branch 'neuralnet' into refactor
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.

Conflicts:
	setup.py
	spacy/syntax/parser.pxd
	spacy/syntax/parser.pyx
	spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
39c93116eb * Add get_freqs script 2015-07-14 02:31:32 +02:00
Matthew Honnibal
62cfcd76fe * Add supersense sets to lexemes, from WordNet. Look-up via lemmatization. 2015-07-01 18:48:59 +02:00
Matthew Honnibal
31b5e58aeb * Begin reorganizing neuralnet work 2015-06-30 14:26:53 +02:00
Matthew Honnibal
1135cfe50a * Tidy nn_train a bit 2015-06-29 16:45:14 +02:00
Matthew Honnibal
df8179ca4f * Add separate Param and AdadeltaParam classes. AdadeltaParam seems broken. 2015-06-29 16:39:16 +02:00
Matthew Honnibal
1dff04acb5 * Apply regularization to the softmax, not the bias 2015-06-29 11:45:38 +02:00
Matthew Honnibal
ca30fe1582 * Use He initialization trick 2015-06-29 10:56:02 +02:00
Matthew Honnibal
fc34e1b6e4 * Move Theano functions into nn_train.py script 2015-06-29 07:09:16 +02:00
Matthew Honnibal
fe7b24ecef * whitespace 2015-06-28 11:37:17 +02:00
Matthew Honnibal
7b8275fcc4 * Wire hyperparameters to script interface 2015-06-28 11:37:17 +02:00
Matthew Honnibal
897dd0dd0b * Merge changes, and adjust Example to use memoryview 2015-06-28 11:36:11 +02:00
Matthew Honnibal
ef97b90833 * Fix token scoring 2015-06-28 06:22:18 +02:00
Matthew Honnibal
34c0ef2ee8 * Don't compile the orig_arc_eager and tree_arc_eager modules used for the EMNLP paper 2015-06-23 05:38:17 +02:00
Matthew Honnibal
59e9f9153c * Remove projectivity constraint in train.py, but raise Exception if non-projective sentence is encountered, since we've told GoldParse to projectivize 2015-06-23 05:04:46 +02:00
Matthew Honnibal
839e5038b7 * Raise exception on non-projective input 2015-06-23 00:01:55 +02:00
Matthew Honnibal
4dad4058c3 * Uncomment NER training 2015-06-16 23:36:54 +02:00
Matthew Honnibal
5699585278 * Use tree_arc_eager system as baseline in experiments 2015-06-15 08:23:43 +02:00
Matthew Honnibal
4841f8ad5e * Set transition system early 2015-06-15 02:54:12 +02:00
Matthew Honnibal
bcfdf126a4 * Add toggle for OrigArcEager system 2015-06-14 20:28:14 +02:00
Matthew Honnibal
c500d72dc2 * Temporarily disable NER, and wire up the verbose flag during training 2015-06-14 17:45:31 +02:00
Matthew Honnibal
ac422492cf * Fix write_parses mode of bin/parser/train.py 2015-06-07 19:08:48 +02:00
Matthew Honnibal
4073533e28 * Upd munge_ewtb for the new json format 2015-06-06 02:10:33 +02:00
Matthew Honnibal
6a1341b29e * Add tb pre-process script 2015-06-06 01:59:44 +02:00
Matthew Honnibal
1736fc5a67 * Add more options to bin/parser/train 2015-06-05 23:49:26 +02:00
Matthew Honnibal
362f87dc3a * Update input corruption method to work with lists as well as trings 2015-06-05 19:33:32 +02:00
Matthew Honnibal
0aed9c9a33 * Fix train.py 2015-06-05 15:50:24 +02:00
Matthew Honnibal
8466600add * Clean up train.py, removing unused tag jackknifing code 2015-06-05 15:01:28 +02:00
Matthew Honnibal
e772b48dcd * Skip sentences of length 1 in training 2015-06-05 02:29:03 +02:00
Matthew Honnibal
e822df0867 * Fix bugs in new greedy/beam parser 2015-06-02 02:01:33 +02:00
Matthew Honnibal
70a7ad89ca * Removed unused imports from train.py 2015-06-02 00:59:09 +02:00
Matthew Honnibal
a3de20118e * Wire up beam-width command line argument 2015-06-02 00:54:12 +02:00
Matthew Honnibal
08044ea70c * Remove try/except around parser.train 2015-05-31 15:21:56 +02:00
Matthew Honnibal
c8a553fe91 * Fix cluster initialization 2015-05-31 15:21:28 +02:00
Matthew Honnibal
d7cc2338e7 * Fix bug in train.py 2015-05-31 06:49:06 +02:00
Matthew Honnibal
c037f80638 * Add case expansion to Brown clusters 2015-05-31 05:50:50 +02:00
Matthew Honnibal
5ab0f233a1 * Ensure words in Brown clusters make it into the vocab, even if they're not in our probs list 2015-05-31 05:46:16 +02:00
Matthew Honnibal
d42dda0372 * Shuffle docs before doing jackknife partition --- otherwise we'll not get the right genre mixes... 2015-05-31 01:25:02 +02:00
Matthew Honnibal
4d8d490547 * Exclude empty sentences in prepare_treebank 2015-05-31 01:12:46 +02:00
Matthew Honnibal
d512d20d81 * Allow parser to jackknife POS tags before training. 2015-05-31 01:11:11 +02:00
Matthew Honnibal
6bbdcc5db5 * Fix gold_preproc flag in train.py 2015-05-30 05:23:02 +02:00
Matthew Honnibal
76300bbb1b * Use updated JSON format, with sentences below paragraphs. Allows use of gold preprocessing flag. 2015-05-30 01:25:46 +02:00
Matthew Honnibal
2d11739f28 * Change data format of JSON corpus, putting sentences into lists with the paragraph 2015-05-30 01:25:00 +02:00
Matthew Honnibal
784e577f45 * Check NER length matches conll length in prepare_treebank 2015-05-29 03:54:06 +02:00
Matthew Honnibal
b76bbbd12c * Read json files recursively from a directory, instead of requiring a single .json file 2015-05-29 03:52:55 +02:00
Matthew Honnibal
ef67ef7a4c * Recomment in training in train.py 2015-05-28 22:40:26 +02:00
Matthew Honnibal
5eb64eeb11 * Print json treebank by genre, instead of by large file 2015-05-28 22:40:01 +02:00
Matthew Honnibal
f42dc1f7d8 * Fix evaluate method in train.py, to use sentences which don't have raw text 2015-05-28 16:30:23 +02:00
Matthew Honnibal
a7cee46fe9 * Update train.py, to support paragraphs where there's no raw_text 2015-05-27 19:14:02 +02:00
Matthew Honnibal
ef1333cf89 * Have prepare_treebank read train/dev/test IDs. 2015-05-27 17:35:05 +02:00