spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 18:06:29 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	ca28590ddd	Use dep and ent multi-task objectives for parser'	2017-09-26 08:13:52 -05:00
Matthew Honnibal	18a27c7579	Fix typo in tensorizer serialization	2017-09-26 06:45:14 -05:00
Matthew Honnibal	bf917225ab	Allow multi-task objectives during training	2017-09-26 05:42:52 -05:00
ines	d2d35b63b7	Fix formatting	2017-09-25 18:37:13 +02:00
Matthew Honnibal	8eb0b7b779	Add docstrings for Pipe API	2017-09-25 16:22:07 +02:00
Matthew Honnibal	39f390dba7	Add docstrings for Pipe API	2017-09-25 16:20:49 +02:00
Matthew Honnibal	4348c479fc	Merge pre-trained vectors and noshare patches	2017-09-22 20:07:28 -05:00
Matthew Honnibal	386c1a5bd8	Fix tagger training	2017-09-23 02:58:06 +02:00
Matthew Honnibal	05596159bf	Fix serialization when pre-trained vectors	2017-09-22 15:33:27 -05:00
Matthew Honnibal	d9124f1aa3	Add link_vectors_to_models function	2017-09-22 09:38:22 -05:00
Matthew Honnibal	40a4873b70	Fix serialization of model options	2017-09-21 13:07:26 -05:00
Matthew Honnibal	20193371f5	Don't share CNN, to reduce complexities	2017-09-21 14:59:48 +02:00
Matthew Honnibal	24e85c2048	Pass values for CNN maxout pieces option	2017-09-20 19:16:12 -05:00
Matthew Honnibal	b36a38f63d	Fix serialization of pretrained_dims property	2017-09-19 23:42:27 +02:00
Matthew Honnibal	40837b275d	Fix tensorizer with pretrained vectors	2017-09-18 18:05:38 -05:00
Matthew Honnibal	84e637e2e6	Pass option for pretrained vectors in pipeline	2017-09-16 12:46:02 -05:00
Matthew Honnibal	7fdafcc4c4	Fix config loading in tagger	2017-09-04 16:38:49 +02:00
Matthew Honnibal	382ce566eb	Fix deserialization bug	2017-09-04 15:19:01 +02:00
Matthew Honnibal	9e378bdac5	Fix textcat serialization	2017-09-02 15:17:20 +02:00
Matthew Honnibal	a3b69bcb3d	Add low_data mode in textcat	2017-09-02 14:56:30 +02:00
Matthew Honnibal	5e6a9e7dcc	Add rule-based SBD	2017-09-02 12:53:38 +02:00
Matthew Honnibal	c1d3ff517a	Track loss in tagger	2017-08-20 14:42:23 +02:00
Matthew Honnibal	ec482580b5	Restore changes to pipeline.pyx from nn-beam-parser branch	2017-08-18 22:02:35 +02:00
Matthew Honnibal	426f84937f	Resolve conflicts when merging new beam parsing stuff	2017-08-18 13:38:32 -05:00
Matthew Honnibal	1cb2f15d65	Clean up unused predict_confidences function	2017-08-16 18:22:26 -05:00
Matthew Honnibal	52c180ecf5	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `ea8de11ad5`, reversing changes made to `08e443e083`.	2017-08-14 13:00:23 +02:00
Matthew Honnibal	3e30712b62	Improve defaults	2017-08-12 19:24:17 -05:00
Matthew Honnibal	680043ebca	Improve efficiency of tagger.set_annotations for GPU	2017-08-12 08:54:21 -05:00
Matthew Honnibal	3cb8f06881	Fix NeuralLabeller	2017-08-06 14:15:14 +02:00
Matthew Honnibal	e9ab800e15	Fix tagging model	2017-08-06 01:50:08 +02:00
Matthew Honnibal	468c138ab3	WIP: Add fine-tuning logic to tagger model, re #1182	2017-08-06 01:13:23 +02:00
Matthew Honnibal	6780132821	Fix tagger loading	2017-07-25 19:41:11 +02:00
Matthew Honnibal	c4a81a47a4	Fix deserialization	2017-07-23 14:11:07 +02:00
Matthew Honnibal	4fe77bced2	Add cfg attr to pipeline components	2017-07-23 00:52:47 +02:00
Matthew Honnibal	a88a7deffe	Five save/load of textcat config	2017-07-23 00:33:43 +02:00
Matthew Honnibal	b55714d5d1	Make gold_tuples arg optional in begin_training	2017-07-22 20:04:43 +02:00
Matthew Honnibal	b3a749610e	Fix name of TextCategorizer	2017-07-22 01:14:07 +02:00
Matthew Honnibal	a231b56d40	Add text-classification hook to pipeline	2017-07-20 00:18:15 +02:00
Matthew Honnibal	d59fa32df1	Add experimental SimilarityHook omponent	2017-06-05 15:40:03 +02:00
Matthew Honnibal	b3b5521625	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 20:17:18 -05:00
Matthew Honnibal	7b2ede783d	Add SP tag to tag map if missing	2017-06-04 20:16:30 -05:00
Matthew Honnibal	516798e9fc	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-05 01:35:21 +02:00
Matthew Honnibal	193bf913c0	Set is_tagged=True after tagging	2017-06-05 01:35:07 +02:00
Matthew Honnibal	b78cc318c3	Fix loading of morphology exceptions	2017-06-04 16:34:32 -05:00
Matthew Honnibal	3680c51b8f	Avoid clobbering preset POS tags	2017-06-04 15:52:42 -05:00
ines	1b593bbd6d	Fix encoding on tagger serialization	2017-06-02 17:29:21 +02:00
Matthew Honnibal	5f4d328e2c	Fix serialization of tag_map in NeuralTagger	2017-06-02 10:18:37 -05:00
Matthew Honnibal	307d615c5f	Fix serialization for tagger when tag_map has changed	2017-06-01 12:18:36 -05:00
ines	7a2380f617	Rename "nn_tagger" to "tagger"	2017-06-01 17:37:53 +02:00
Matthew Honnibal	5eae3b9a1e	Fix to/from disk in tagger	2017-06-01 04:55:49 -05:00
Matthew Honnibal	53d00a0371	Move weight serialization to Thinc	2017-06-01 03:04:36 -05:00
Matthew Honnibal	ae8010b526	Move weight serialization to Thinc	2017-06-01 02:56:12 -05:00
Matthew Honnibal	33e5ec737f	Fix to/from disk methods	2017-05-31 13:43:10 +02:00
Matthew Honnibal	293d1b425b	Serialize in consistent order	2017-05-29 17:53:06 -05:00
Matthew Honnibal	6522ea6c8b	More serialization fixes. Still broken	2017-05-29 13:23:47 -05:00
Matthew Honnibal	aa4c33914b	Work on serialization	2017-05-29 08:40:45 -05:00
Matthew Honnibal	ff26aa6c37	Work on to/from bytes/disk serialization methods	2017-05-29 11:45:45 +02:00
Matthew Honnibal	6b019b0540	Update to/from bytes methods	2017-05-29 10:14:20 +02:00
Matthew Honnibal	6dad4117ad	Work on serialization for models	2017-05-29 01:37:57 +02:00
Matthew Honnibal	8a24c60c1e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-28 08:12:05 -05:00
Matthew Honnibal	bc97bc292c	Fix __call__ method	2017-05-28 08:11:58 -05:00
Matthew Honnibal	c1263a844b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-27 18:32:57 -05:00
Matthew Honnibal	9e711c3476	Divide d_loss by batch size	2017-05-27 18:32:46 -05:00
Matthew Honnibal	34bbad8e0e	Add __reduce__ methods on parser subclasses. Fixes pickling.	2017-05-27 15:46:06 -05:00
Matthew Honnibal	467bbeadb8	Add hidden layers for tagger	2017-05-24 20:09:51 -05:00
Matthew Honnibal	5b67bcbee0	Increase default embed size to 7500	2017-05-23 15:20:16 -05:00
Matthew Honnibal	3959d778ac	Revert "Revert "WIP on improving parser efficiency"" This reverts commit `532afef4a8`.	2017-05-23 03:06:53 -05:00
Matthew Honnibal	532afef4a8	Revert "WIP on improving parser efficiency" This reverts commit `bdaac7ab44`.	2017-05-23 03:05:25 -05:00
Matthew Honnibal	bdaac7ab44	WIP on improving parser efficiency	2017-05-23 02:59:31 -05:00
Matthew Honnibal	a7ee63c0ac	Fix labeller loss for unseen labels	2017-05-22 10:41:20 -05:00
Matthew Honnibal	83ffd16474	Fix offset calculation for other negative values	2017-05-22 08:00:53 -05:00
Matthew Honnibal	b45b4aa392	PseudoProjectivity --> nonproj	2017-05-22 05:17:44 -05:00
Matthew Honnibal	8d1e64be69	Add experimental NeuralLabeller	2017-05-22 04:51:08 -05:00
Matthew Honnibal	9b1b0742fd	Fix prediction for tok2vec	2017-05-22 04:51:08 -05:00
Matthew Honnibal	5db89053aa	Merge docstrings	2017-05-21 13:46:23 -05:00
Matthew Honnibal	180e5afede	Fix tokvecs flattening in pipeline	2017-05-21 09:05:34 -05:00
ines	99b631617d	Reformat docstrings	2017-05-21 13:32:15 +02:00
ines	d82ae9a585	Change "function" to "callable" in docs	2017-05-21 13:17:40 +02:00
Matthew Honnibal	3b7c108246	Pass tokvecs through as a list, instead of concatenated. Also fix padding	2017-05-20 13:23:32 -05:00
Matthew Honnibal	d52b65aec2	Revert "Move to contiguous buffer for token_ids and d_vectors" This reverts commit `3ff8c35a79`.	2017-05-20 11:26:23 -05:00
Matthew Honnibal	3ff8c35a79	Move to contiguous buffer for token_ids and d_vectors	2017-05-20 04:17:30 -05:00
Matthew Honnibal	c12ab47a56	Remove state argument in pipeline. Other changes	2017-05-19 13:26:36 -05:00
ines	0fc05e54e4	Document TokenVectorEncoder	2017-05-19 00:00:02 +02:00
Matthew Honnibal	c2c825127a	Fix use_params and pipe methods	2017-05-18 08:30:59 -05:00
Matthew Honnibal	b460533827	Bug fixes to pipeline	2017-05-18 04:29:51 -05:00
Matthew Honnibal	692bd2a186	Bug fix to tagger: wasnt backproping to token vectors	2017-05-17 13:13:14 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	8cf097ca88	Redesign training to integrate NN components * Obsolete .parser, .entity etc names in favour of .pipeline * Components no longer create models on initialization * Models created by loading method (from_disk(), from_bytes() etc), or .begin_training() * Add .predict(), .set_annotations() methods in components * Pass state through pipeline, to allow components to share information more flexibly.	2017-05-16 16:17:30 +02:00
Matthew Honnibal	5211645af3	Get data flowing through pipeline. Needs redesign	2017-05-16 11:21:59 +02:00
Matthew Honnibal	a9edb3aa1d	Improve integration of NN parser, to support unified training API	2017-05-15 21:53:27 +02:00
Matthew Honnibal	4b9d69f428	Merge branch 'v2' into develop * Move v2 parser into nn_parser.pyx * New TokenVectorEncoder class in pipeline.pyx * New spacy/_ml.py module Currently the two parsers live side-by-side, until we figure out how to organize them.	2017-05-14 01:10:23 +02:00
Matthew Honnibal	5cac951a16	Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.	2017-05-14 00:55:01 +02:00
Matthew Honnibal	613ba79e2e	Fiddle with sizings for parser	2017-05-13 17:20:23 -05:00
Matthew Honnibal	827b5af697	Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick.	2017-05-12 16:09:15 -05:00
Matthew Honnibal	b16ae75824	Remove serializer hacks from pipeline classes	2017-05-09 18:16:40 +02:00
Matthew Honnibal	bef89ef23d	Mergery	2017-05-08 08:29:36 -05:00
Matthew Honnibal	94e86ae00a	Predict tags with encoder	2017-05-08 07:53:45 -05:00
Matthew Honnibal	a66a4a4d0f	Replace einsums	2017-05-08 14:46:50 +02:00
Matthew Honnibal	6782eedf9b	Tmp GPU code	2017-05-07 11:04:24 -05:00
Matthew Honnibal	f99f5b75dc	working residual net	2017-05-07 03:57:26 +02:00
Matthew Honnibal	7e04260d38	Data running through, likely errors in model	2017-05-06 14:22:20 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Matthew Honnibal	354458484c	WIP on add_label bug during NER training Currently when a new label is introduced to NER during training, it causes the labels to be read in in an unexpected order. This invalidates the model.	2017-04-14 23:52:17 +02:00
Matthew Honnibal	2f63806ddb	Update config when adding label. Re #910	2017-03-25 22:35:44 +01:00
Raphaël Bournhonesque	f332bf05be	Remove unused import statements	2017-03-21 21:08:54 +01:00
Matthew Honnibal	7769bc31e3	Add beam-search classes	2017-03-15 09:27:41 -05:00
Matthew Honnibal	fa23278ee3	Add classes for beam parser and beam NER	2017-03-11 12:45:37 -06:00
Matthew Honnibal	f77a5bb60a	Switch back to greedy parser	2017-03-11 11:11:30 -06:00
Matthew Honnibal	dcce9ca3f3	Use beam parser	2017-03-11 07:00:20 -06:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	3e688e6d4b	Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.	2016-10-23 17:45:44 +02:00
Matthew Honnibal	f787cd29fe	Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.	2016-10-16 21:34:57 +02:00
Matthew Honnibal	4bb73b1a93	Fix parser labels in pipeline	2016-10-16 17:03:22 +02:00
Matthew Honnibal	a079677984	Fix omission of O action when creating blank entity recognizer	2016-10-16 11:43:25 +02:00
Matthew Honnibal	509b30834f	Add a pipeline module, to collect and wrap processes for annotation	2016-10-16 01:47:12 +02:00

1 2 3 4 5

216 Commits