spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 13:47:13 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	3b7c108246	Pass tokvecs through as a list, instead of concatenated. Also fix padding	2017-05-20 13:23:32 -05:00
Matthew Honnibal	d52b65aec2	Revert "Move to contiguous buffer for token_ids and d_vectors" This reverts commit `3ff8c35a79`.	2017-05-20 11:26:23 -05:00
Matthew Honnibal	b272890a8c	Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change	2017-05-20 06:40:10 -05:00
Matthew Honnibal	3ff8c35a79	Move to contiguous buffer for token_ids and d_vectors	2017-05-20 04:17:30 -05:00
Matthew Honnibal	8b04b0af9f	Remove freqs from transition_system	2017-05-20 02:20:48 -05:00
Matthew Honnibal	a1ba20e2b1	Fix over-run on parse_batch	2017-05-19 18:57:30 -05:00
Matthew Honnibal	e84de028b5	Remove 'rebatch' op, and remove min-batch cap	2017-05-19 18:16:36 -05:00
Matthew Honnibal	c12ab47a56	Remove state argument in pipeline. Other changes	2017-05-19 13:26:36 -05:00
Matthew Honnibal	c2c825127a	Fix use_params and pipe methods	2017-05-18 08:30:59 -05:00
Matthew Honnibal	fc8d3a112c	Add util.env_opt support: Can set hyper params through environment variables.	2017-05-18 04:36:53 -05:00
Matthew Honnibal	d2626fdb45	Fix name error in nn parser	2017-05-18 04:31:01 -05:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	8cf097ca88	Redesign training to integrate NN components * Obsolete .parser, .entity etc names in favour of .pipeline * Components no longer create models on initialization * Models created by loading method (from_disk(), from_bytes() etc), or .begin_training() * Add .predict(), .set_annotations() methods in components * Pass state through pipeline, to allow components to share information more flexibly.	2017-05-16 16:17:30 +02:00
Matthew Honnibal	5211645af3	Get data flowing through pipeline. Needs redesign	2017-05-16 11:21:59 +02:00
Matthew Honnibal	a9edb3aa1d	Improve integration of NN parser, to support unified training API	2017-05-15 21:53:27 +02:00
Matthew Honnibal	4b9d69f428	Merge branch 'v2' into develop * Move v2 parser into nn_parser.pyx * New TokenVectorEncoder class in pipeline.pyx * New spacy/_ml.py module Currently the two parsers live side-by-side, until we figure out how to organize them.	2017-05-14 01:10:23 +02:00
Matthew Honnibal	5cac951a16	Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.	2017-05-14 00:55:01 +02:00
Matthew Honnibal	f8c02b4341	Remove cupy imports from parser, so it can work on CPU	2017-05-14 00:37:53 +02:00
Matthew Honnibal	e6d71e1778	Small fixes to parser	2017-05-13 17:19:04 -05:00
Matthew Honnibal	188c0f6949	Clean up unused import	2017-05-13 17:18:27 -05:00
Matthew Honnibal	f85c8464f7	Draft support of regression loss in parser	2017-05-13 17:17:27 -05:00
Matthew Honnibal	827b5af697	Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick.	2017-05-12 16:09:15 -05:00
Matthew Honnibal	b44f7e259c	Clean up unused parser code	2017-05-08 15:42:04 +02:00
Matthew Honnibal	17efb1c001	Change width	2017-05-08 08:40:13 -05:00
Matthew Honnibal	bef89ef23d	Mergery	2017-05-08 08:29:36 -05:00
Matthew Honnibal	50ddc9fc45	Fix infinite loop bug	2017-05-08 07:54:26 -05:00
Matthew Honnibal	a66a4a4d0f	Replace einsums	2017-05-08 14:46:50 +02:00
Matthew Honnibal	8d2eab74da	Use PretrainableMaxouts	2017-05-08 14:24:55 +02:00
Matthew Honnibal	2e2268a442	Precomputable hidden now working	2017-05-08 11:36:37 +02:00
Matthew Honnibal	10682d35ab	Get pre-computed version working	2017-05-08 00:38:35 +02:00
Matthew Honnibal	35458987e8	Checkpoint -- nearly finished reimpl	2017-05-07 23:05:01 +02:00
Matthew Honnibal	4441866f55	Checkpoint -- nearly finished reimpl	2017-05-07 22:47:06 +02:00
Matthew Honnibal	6782eedf9b	Tmp GPU code	2017-05-07 11:04:24 -05:00
Matthew Honnibal	e420e5a809	Tmp	2017-05-07 07:31:09 -05:00
Matthew Honnibal	700979fb3c	CPU/GPU compat	2017-05-07 04:01:11 +02:00
Matthew Honnibal	f99f5b75dc	working residual net	2017-05-07 03:57:26 +02:00
Matthew Honnibal	bdf2dba9fb	WIP on refactor, with hidde pre-computing	2017-05-07 02:02:43 +02:00
Matthew Honnibal	b439e04f8d	Learning smoothly	2017-05-06 20:38:12 +02:00
Matthew Honnibal	08bee76790	Learns things	2017-05-06 18:24:38 +02:00
Matthew Honnibal	bcf4cd0a5f	Learns things	2017-05-06 17:37:36 +02:00
Matthew Honnibal	8e48b58cd6	Gradients look correct	2017-05-06 16:47:15 +02:00
Matthew Honnibal	7e04260d38	Data running through, likely errors in model	2017-05-06 14:22:20 +02:00
Matthew Honnibal	ef4fa594aa	Draft of NN parser, to be tested	2017-05-05 19:20:39 +02:00
Matthew Honnibal	ccaf26206b	Pseudocode for parser	2017-05-04 12:17:59 +02:00
Matthew Honnibal	2da16adcc2	Add dropout optin for parser and NER Dropout can now be specified in the `Parser.update()` method via the `drop` keyword argument, e.g. nlp.entity.update(doc, gold, drop=0.4) This will randomly drop 40% of features, and multiply the value of the others by 1. / 0.4. This may be useful for generalising from small data sets. This commit also patches the examples/training/train_new_entity_type.py example, to use dropout and fix the output (previously it did not output the learned entity).	2017-04-27 13:18:39 +02:00
Matthew Honnibal	d2436dc17b	Update fix for Issue #999	2017-04-23 18:14:37 +02:00
Matthew Honnibal	60703cede5	Ensure noun chunks can't be nested. Closes #955	2017-04-23 17:56:39 +02:00
Matthew Honnibal	4eef200bab	Persist the actions within spacy.parser.cfg	2017-04-20 17:02:44 +02:00
Matthew Honnibal	137b210bcf	Restore use of FTRL training	2017-04-16 18:02:42 +02:00
Matthew Honnibal	45464d065e	Remove print statement	2017-04-15 16:11:43 +02:00
Matthew Honnibal	c76cb8af35	Fix training for new labels	2017-04-15 16:11:26 +02:00
Matthew Honnibal	4884b2c113	Refix StepwiseState	2017-04-15 16:00:28 +02:00
Matthew Honnibal	1a98e48b8e	Fix Stepwisestate'	2017-04-15 13:35:01 +02:00
ines	0739ae7b76	Tidy up and fix formatting and imports	2017-04-15 13:05:15 +02:00
Matthew Honnibal	354458484c	WIP on add_label bug during NER training Currently when a new label is introduced to NER during training, it causes the labels to be read in in an unexpected order. This invalidates the model.	2017-04-14 23:52:17 +02:00
Matthew Honnibal	49e2de900e	Add costs property to StepwiseState, to show which moves are gold.	2017-04-10 11:37:04 +02:00
Matthew Honnibal	cc36c308f4	Fix noun_chunk rules around coordination Closes #693.	2017-04-07 17:06:40 +02:00
Matthew Honnibal	1bb7b4ca71	Add comment	2017-03-31 13:59:19 +02:00
Matthew Honnibal	47a3ef06a6	Unhack deprojetivization, moving it into pipeline Previously the deprojectivize() call was attached to the transition system, and only called for German. Instead it should be a separate process, called after the parser. This makes it available for any language. Closes #898.	2017-03-31 12:31:50 +02:00
Matthew Honnibal	a9b1f23c7d	Enable regression loss for parser	2017-03-26 09:26:30 -05:00
Matthew Honnibal	b487b8735a	Decrease beam density, and fix Python 3 problem in beam	2017-03-20 12:56:05 +01:00
Matthew Honnibal	c90dc7ac29	Clean up state initiatisation in transition system	2017-03-16 11:59:11 -05:00
Matthew Honnibal	a46933a8fe	Clean up FTRL parsing stuff.	2017-03-16 11:58:20 -05:00
Matthew Honnibal	2611ac2a89	Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens	2017-03-16 09:38:28 -05:00
Matthew Honnibal	3d0833c3df	Fix off-by-1 in parse features fill_context	2017-03-15 19:55:35 -05:00
Matthew Honnibal	4ef68c413f	Approximate cost in Break transition, to speed things up a bit.	2017-03-15 16:40:27 -05:00
Matthew Honnibal	8543db8a5b	Use ftrl optimizer in parser	2017-03-15 11:56:37 -05:00
Matthew Honnibal	d719f8e77e	Use nogil in parser, and set L1 to 0.0 by default	2017-03-15 09:31:01 -05:00
Matthew Honnibal	c61c501406	Update beam-parser to allow parser to maintain nogil	2017-03-15 09:30:22 -05:00
Matthew Honnibal	c79b3129e3	Fix setting of empty lexeme in initial parse state	2017-03-15 09:26:53 -05:00
Matthew Honnibal	6c4108c073	Add header for beam parser	2017-03-11 12:45:12 -06:00
Matthew Honnibal	931feb3360	Allow beam parsing for NER	2017-03-11 11:12:01 -06:00
Matthew Honnibal	ca9c8c57c0	Add iteration argument to parser.update	2017-03-11 07:00:47 -06:00
Matthew Honnibal	d59c6926c1	I think this fixes the segfault	2017-03-11 06:58:34 -06:00
Matthew Honnibal	318b9e32ff	WIP on beam parser. Currently segfaults.	2017-03-11 06:19:52 -06:00
Matthew Honnibal	b0d80dc9ae	Update name of 'train' function in BeamParser	2017-03-10 14:35:43 -06:00
Matthew Honnibal	d11f1a4ddf	Record negative costs in non-monotonic arc eager oracle	2017-03-10 11:22:04 -06:00
Matthew Honnibal	ecf91a2dbb	Support beam parser	2017-03-10 11:21:21 -06:00
Matthew Honnibal	c62da02344	Use ftrl training, to learn compressed model.	2017-03-09 18:43:21 -06:00
Matthew Honnibal	40703988bc	Use FTRL training in parser	2017-03-08 01:38:51 +01:00
Roman Inflianskas	66e1109b53	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
Matthew Honnibal	97a1286129	Revert changes to tagger and parser for thinc 6	2017-01-09 10:08:34 -06:00
Matthew Honnibal	af81ac8bb0	Use thinc 6.0	2016-12-29 11:58:42 +01:00
Matthew Honnibal	bc0a202c9c	Fix unicode problem in nonproj module	2016-11-25 17:29:17 -06:00
Matthew Honnibal	159e8c46e1	Merge old training fixes with newer state	2016-11-25 09:16:36 -06:00
Matthew Honnibal	39341598bb	Fix NER label calculation	2016-11-25 09:02:22 -06:00
Matthew Honnibal	ca773a1f53	Tweak arc_eager n_gold to deal with negative costs, and improve error message.	2016-11-25 09:01:52 -06:00
Matthew Honnibal	608d8f5421	Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state	2016-11-25 09:00:21 -06:00
Matthew Honnibal	b8c4f5ea76	Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects.	2016-11-24 23:30:15 +11:00
Pokey Rule	3e3bda142d	Add noun_chunks to Span	2016-11-24 10:47:20 +00:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	708ea22208	Infer types in transition_system.pyx	2016-10-27 18:08:13 +02:00
Matthew Honnibal	301f3cc898	Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.	2016-10-27 18:01:55 +02:00
Matthew Honnibal	03a520ec4f	Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.	2016-10-27 17:58:56 +02:00
Matthew Honnibal	a209b10579	Improve error message when oracle fails for non-projective trees, re Issue #571 .	2016-10-24 20:31:30 +02:00
Matthew Honnibal	3e688e6d4b	Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.	2016-10-23 17:45:44 +02:00
Matthew Honnibal	59038f7efa	Restore support for prior data format -- specifically, the labels field of the config.	2016-10-17 00:53:26 +02:00
Matthew Honnibal	7887ab3b36	Fix default use of feature_templates in parser	2016-10-16 21:41:56 +02:00
Matthew Honnibal	f787cd29fe	Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.	2016-10-16 21:34:57 +02:00
Matthew Honnibal	274a4d4272	Fix queue Python property in StateClass	2016-10-16 17:04:41 +02:00
Matthew Honnibal	e8c8aa08ce	Make action_name optional in StepwiseState	2016-10-16 17:04:16 +02:00
Matthew Honnibal	4fc56d4a31	Rename 'labels' to 'actions' in parser options	2016-10-16 11:42:26 +02:00
Matthew Honnibal	3259a63779	Whitespace	2016-10-16 01:47:28 +02:00
Matthew Honnibal	d9ae2d68af	Load features by string-name for backwards compatibility.	2016-10-12 20:15:11 +02:00
Matthew Honnibal	3a03c668c3	Fix message in ParserStateError	2016-10-12 14:44:31 +02:00
Matthew Honnibal	6bf505e865	Fix error on ParserStateError	2016-10-12 14:35:55 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Matthew Honnibal	1d70db58aa	Revert "Changes to iterators.pyx for new StringStore scheme" This reverts commit `4f794b215a`.	2016-09-30 20:19:53 +02:00
Matthew Honnibal	9e09b39b9f	Revert "Changes to transition systems for new StringStore scheme" This reverts commit `0442e0ab1e`.	2016-09-30 20:11:49 +02:00
Matthew Honnibal	e3285f6f30	Revert "Fix report of ParserStateError" This reverts commit `78f19baafa`.	2016-09-30 20:11:33 +02:00
Matthew Honnibal	78f19baafa	Fix report of ParserStateError	2016-09-30 19:59:22 +02:00
Matthew Honnibal	0442e0ab1e	Changes to transition systems for new StringStore scheme	2016-09-30 19:58:51 +02:00
Matthew Honnibal	4f794b215a	Changes to iterators.pyx for new StringStore scheme	2016-09-30 19:57:49 +02:00
Matthew Honnibal	4cbf0d3bb6	Handle errors when no valid actions are available, pointing users to the issue tracker.	2016-09-27 19:19:53 +02:00
Matthew Honnibal	430473bd98	Raise errors when no actions are available, re Issue #429	2016-09-27 19:09:37 +02:00
Matthew Honnibal	8e7df3c4ca	Expect the parser data, if parser.load() is called.	2016-09-27 14:02:12 +02:00
Matthew Honnibal	a44763af0e	Fix Issue #469 : Incorrectly cased root label in noun chunk iterator	2016-09-27 13:13:01 +02:00
Matthew Honnibal	e07b9665f7	Don't expect parser model	2016-09-26 18:09:33 +02:00
Matthew Honnibal	ee6fa106da	Fix parser features	2016-09-26 17:57:32 +02:00
Matthew Honnibal	e607e4b598	Fix parser loading	2016-09-26 17:51:11 +02:00
Matthew Honnibal	2debc4e0a2	Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.	2016-09-26 11:57:54 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	60fdf4d5f1	Remove commented out debuggng code	2016-09-24 01:17:18 +02:00
Matthew Honnibal	070af4af9d	Revert "* Working neural net, but features hacky. Switching to extractor." This reverts commit `7c2f1a673b`.	2016-09-21 12:26:14 +02:00
Matthew Honnibal	7c2f1a673b	* Working neural net, but features hacky. Switching to extractor.	2016-05-26 19:06:10 +02:00
Matthew Honnibal	13fad36e49	* Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop	2016-05-20 10:11:05 +02:00
Wolfgang Seeker	7b78239436	add fix for German noun chunk iterator (issue #365 )	2016-05-06 01:41:26 +02:00
Matthew Honnibal	bb94022975	* Fix Issue #365 : Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags.	2016-05-06 00:21:05 +02:00
Wolfgang Seeker	dbf8f5f3ec	fix bug in StateC.set_break()	2016-05-05 15:15:34 +02:00
Wolfgang Seeker	3c44b5dc1a	call deprojectivization after parsing	2016-05-05 15:10:36 +02:00
Matthew Honnibal	472f576b82	* Deprojectivize German parses	2016-05-05 15:01:10 +02:00
Wolfgang Seeker	e4ea2bea01	fix whitespace	2016-05-04 07:40:38 +02:00
Wolfgang Seeker	5bf2fd1f78	make the code less cryptic	2016-05-03 17:19:05 +02:00
Wolfgang Seeker	a06fca9fdf	German noun chunk iterator now doesn't return tokens more than once	2016-05-03 16:58:59 +02:00
Wolfgang Seeker	7b246c13cb	reformulate noun chunk tests for English	2016-05-03 14:24:35 +02:00
Matthew Honnibal	1f1532142f	* Fix cost calculation on non-monotonic oracle	2016-05-03 00:21:08 +02:00
Matthew Honnibal	508fd1f6dc	* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.	2016-05-02 14:25:10 +02:00
Matthew Honnibal	77609588b6	* Fix assignment of root label to words left as root implicitly, after parsing ends.	2016-04-25 19:41:59 +00:00
Matthew Honnibal	7c2d2deaa7	* Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322	2016-04-25 19:41:59 +00:00
Wolfgang Seeker	12024b0b0a	bugfix: introducing multiple roots now updates original head's properties adjust tests to rely less on statistical model	2016-04-20 16:42:41 +02:00
Wolfgang Seeker	b98cc3266d	bugfix: iterators now reset properly when called a second time	2016-04-15 17:49:16 +02:00
Wolfgang Seeker	289b10f441	remove some comments	2016-04-14 15:37:51 +02:00
Wolfgang Seeker	d99a9cbce9	different handling of space tokens space tokens are now always attached to the previous non-space token there are two exceptions: leading space tokens are attached to the first following non-space token in input that consists exclusively of space tokens, the last space token is the head of all others.	2016-04-13 15:28:28 +02:00
Wolfgang Seeker	d328e0b4a8	Merge branch 'master' into space_head_bug	2016-04-11 12:11:01 +02:00
Wolfgang Seeker	80bea62842	bugfix in unit test	2016-04-08 16:46:44 +02:00
Wolfgang Seeker	1fe911cdb0	bigfix	2016-04-07 18:19:51 +02:00
Matthew Honnibal	872695759d	Merge pull request #306 from wbwseeker/german_noun_chunks add German noun chunk functionality	2016-04-08 00:54:24 +10:00
Wolfgang Seeker	7195b6742d	add restrictions to L-arc and R-arc to prevent space heads	2016-03-28 10:40:52 +02:00
Wolfgang Seeker	5e2e8e951a	add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model	2016-03-16 15:53:35 +01:00

1 2 3 4 5 ...

598 Commits