spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-12 03:31:17 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	e31ef9c7f6	Add some property vars for testing	2018-04-03 15:44:31 +02:00
Matthew Honnibal	4029dc2cc7	Fix feature-flagging of Split action	2018-04-03 15:43:50 +02:00
Matthew Honnibal	6cc79fc244	Fix state array length for split	2018-04-03 15:43:23 +02:00
Matthew Honnibal	7ff4d8967f	Add file for compile-time flags in parser	2018-04-03 15:42:51 +02:00
Matthew Honnibal	88de8fe323	Fix pre-processing of more complicated heads in ArcEager	2018-04-03 01:54:21 +02:00
Matthew Honnibal	aa5ecf7fd2	Update ArcEager for changes to GoldParse class	2018-04-02 23:53:13 +02:00
Matthew Honnibal	9c3612d40b	Draft ArcEager.preprocess_gold for fused tokens	2018-04-01 22:11:35 +02:00
Matthew Honnibal	5f68e491e1	Prepare ArcEager.preprocess_gold to handle subtokens	2018-04-01 18:31:33 +02:00
Matthew Honnibal	b8461e71b7	Prepare ArcEager.preprocess_gold to handle subtokens	2018-04-01 18:03:48 +02:00
Matthew Honnibal	2d929ffc5d	Handle list-valued GoldParse values	2018-04-01 17:42:33 +02:00
Matthew Honnibal	19ac03ce09	Go back to letting Break work with deeper stacks It seems very appealing to restrict Break so that it only works when there's one word on the stack. Then we can pop that word, mark it as the root, and continue. However, results are suggesting it's nice to be able to predict Break when the last word of the previous sentence is on the stack, and the first word of the next sentence is at the buffer. This does make sense! Consider that the last word is often a period or something --- a pretty huge clue. We otherwise have to go out of our way to get that feature in. The really decisive thing is we have to handle upcoming sentence breaks anyway, because we need to conform to preset SBD constraints. So, we may as well let the parser predict the Break when it's at a stack/queue position that is most revealing.	2018-04-01 14:32:15 +02:00
Matthew Honnibal	ad70b91e1e	Comment	2018-04-01 13:47:16 +02:00
Matthew Honnibal	83ca2113a2	Constrain Break action to stack depth==1	2018-04-01 13:47:02 +02:00
Matthew Honnibal	dc7f879281	Set USE_SPLIT=False feature flag	2018-04-01 13:46:25 +02:00
Matthew Honnibal	a2f07ab57f	Start sketching out Split transition implementation	2018-04-01 13:45:41 +02:00
Matthew Honnibal	5da7945917	Allocate StateC.was_split	2018-04-01 13:44:42 +02:00
Matthew Honnibal	d8dec1134c	Simplify Break transition to require stack depth 1. Hopefully as accurate	2018-04-01 12:53:25 +02:00
Matthew Honnibal	a37188fe98	Dont drop preset actions on begin_training	2018-04-01 11:46:22 +02:00
Matthew Honnibal	e887b2330e	Rewrite oracle to not use fast-forward. Seems to work?	2018-04-01 10:43:11 +02:00
Matthew Honnibal	bc2a2c81c8	Add some methods to ArcEager that make testing easier	2018-04-01 10:41:28 +02:00
Matthew Honnibal	e5ad35787c	WIP on adding split-token actions to parser This patch starts getting the StateC object ready to split tokens. The split function is implemented by pushing indices into the buffer that indicate an out-of-length token. Still todo: * Update the oracles * Update GoldParseC * Interpret the parse once it's complete * Add retokenizer.split() method	2018-03-31 20:05:27 +02:00
Matthew Honnibal	ab93fdf5d1	Add more state Python variables, to make testing easier	2018-03-30 13:26:08 +02:00
Matthew Honnibal	e826b85cf0	Fix state.split() function	2018-03-30 13:25:28 +02:00
Matthew Honnibal	d399843576	WIP on split parsing	2018-03-28 01:44:05 +02:00
Matthew Honnibal	18da89e04c	Handle non-callable gold_tuples in parser begin_training	2018-03-27 21:08:41 +02:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Matthew Honnibal	f57bfbccdc	Fix non-projective label filtering	2018-03-27 13:41:33 +02:00
Matthew Honnibal	d2118792e7	Merge changes from master	2018-03-27 13:38:41 +02:00
Matthew Honnibal	25280b7013	Try to make sum_state_features faster	2018-03-27 10:08:38 +00:00
Matthew Honnibal	987e1533a4	Use 8 features in parser	2018-03-27 10:08:12 +00:00
Matthew Honnibal	dd54511c4f	Pass data as a function in begin_training methods	2018-03-27 09:39:59 +00:00
Matthew Honnibal	d9ebd78e11	Change default sizes in parser	2018-03-26 17:22:18 +02:00
Matthew Honnibal	49fbe2dfee	Use thinc.openblas in spacy.syntax.nn_parser	2018-03-20 02:22:09 +01:00
Matthew Honnibal	bede11b67c	Improve label management in parser and NER (#2108 ) This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly. Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable. We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense. To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort. Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training. To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make. Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths. This is a squash merge, as I made a lot of very small commits. Individual commit messages below. * Simplify label management for TransitionSystem and its subclasses * Fix serialization for new label handling format in parser * Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir * Set actions in transition system * Require thinc 6.11.1.dev4 * Fix error in parser init * Add unicode declaration * Fix unicode declaration * Update textcat test * Try to get model training on less memory * Print json loc for now * Try rapidjson to reduce memory use * Remove rapidjson requirement * Try rapidjson for reduced mem usage * Handle None heads when projectivising * Stream json docs * Fix train script * Handle projectivity in GoldParse * Fix projectivity handling * Add minibatch_by_words util from ud_train * Minibatch by number of words in spacy.cli.train * Move minibatch_by_words util to spacy.util * Fix label handling * More hacking at label management in parser * Fix encoding in msgpack serialization in GoldParse * Adjust batch sizes in parser training * Fix minibatch_by_words * Add merge_subtokens function to pipeline.pyx * Register merge_subtokens factory * Restore use of msgpack tmp directory * Use minibatch-by-words in train * Handle retokenization in scorer * Change back-off approach for missing labels. Use 'dep' label * Update NER for new label management * Set NER tags for over-segmented words * Fix label alignment in gold * Fix label back-off for infrequent labels * Fix int type in labels dict key * Fix int type in labels dict key * Update feature definition for 8 feature set * Update ud-train script for new label stuff * Fix json streamer * Print the line number if conll eval fails * Update children and sentence boundaries after deprojectivisation * Export set_children_from_heads from doc.pxd * Render parses during UD training * Remove print statement * Require thinc 6.11.1.dev6. Try adding wheel as install_requires * Set different dev version, to flush pip cache * Update thinc version * Update GoldCorpus docs * Remove print statements * Fix formatting and links [ci skip]	2018-03-19 02:58:08 +01:00
Matthew Honnibal	307d6bf6d3	Fix parser for Thinc 6.11	2018-03-16 10:59:31 +01:00
Matthew Honnibal	9a389c4490	Fix parser for Thinc 6.11	2018-03-16 10:38:13 +01:00
Matthew Honnibal	648532d647	Don't assume blas methods are present	2018-03-16 02:48:20 +01:00
Matthew Honnibal	e101f10ef0	Fix header	2018-03-13 02:12:16 +01:00
Matthew Honnibal	d55620041b	Switch parser to gemm from thinc.openblas	2018-03-13 02:10:58 +01:00
Matthew Honnibal	4b72c38556	Fix dropout bug in beam parser	2018-03-10 23:16:40 +01:00
Matthew Honnibal	3d6487c734	Support dropout in beam parse	2018-03-10 22:41:55 +01:00
Matthew Honnibal	14f729c72a	Add subtok label to parser	2018-02-26 12:26:35 +01:00
Matthew Honnibal	7137ad8b0b	Make label filtering clearer for projectivisation	2018-02-26 12:02:01 +01:00
Matthew Honnibal	7b66ec896a	Revert "Revert "Improve parser oracle around sentence breaks."" This reverts commit `36e481c584`.	2018-02-26 10:57:37 +01:00
Matthew Honnibal	36e481c584	Revert "Improve parser oracle around sentence breaks." This reverts commit `50817dc9ad`.	2018-02-26 10:53:55 +01:00
Matthew Honnibal	50817dc9ad	Improve parser oracle around sentence breaks.	2018-02-22 19:22:26 +01:00
Matthew Honnibal	661873ee4c	Randomize the rebatch size in parser	2018-02-21 21:02:07 +01:00
Matthew Honnibal	a0ddb803fd	Make error when no label found more helpful	2018-02-21 16:00:59 +01:00
Matthew Honnibal	ea2fc5d45f	Improve length and freq cutoffs in parser	2018-02-21 16:00:38 +01:00
Matthew Honnibal	e5757d4bf0	Add labels property to parser	2018-02-21 16:00:00 +01:00

1 2 3 4 5 ...

784 Commits