Commit Graph

182 Commits

Author SHA1 Message Date
Matthew Honnibal
e7a9174877 Add add_label methods to Tagger and TextCategorizer 2017-11-01 16:32:44 +01:00
ines
ba5e646219 Tidy up pipeline 2017-10-27 20:29:08 +02:00
Ines Montani
4033e70c71 Merge pull request #1461 from explosion/feature/disable-pipes
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
ines
9e372913e0 Remove old 'SP' condition in tag map 2017-10-26 16:11:57 +02:00
Matthew Honnibal
a8abc47811 Rename BaseThincComponent --> Pipe 2017-10-26 12:40:40 +02:00
Matthew Honnibal
b0f3ea2200 Fix names of pipeline components
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder     --> Tensorizer
NeuralLabeller         --> MultitaskObjective
2017-10-26 12:38:23 +02:00
Matthew Honnibal
ed8da9b11f Add missing return statement in SentenceSegmenter 2017-10-17 15:32:56 +02:00
Matthew Honnibal
09d61ada5e Merge pull request #1396 from explosion/feature/pipeline-management
💫 Improve pipeline and factory management
2017-10-10 04:29:54 +02:00
Matthew Honnibal
8978212ee5 Patch serialization bug raised in #1105 2017-10-10 03:58:12 +02:00
Matthew Honnibal
0384f08218 Trigger nonproj.deprojectivize as a postprocess 2017-10-07 02:00:47 +02:00
Matthew Honnibal
563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
Matthew Honnibal
5454b20cd7 Update thinc imports for 6.9 2017-10-03 20:07:17 +02:00
Matthew Honnibal
4a59f6358c Fix thinc imports 2017-10-03 19:21:26 +02:00
Matthew Honnibal
66c388ee01 Remove unhelpful multitask objectives 2017-09-27 11:44:16 -05:00
Matthew Honnibal
983201a83a Fix hard-coded vector width 2017-09-27 11:43:58 -05:00
Matthew Honnibal
defb68e94f Update feature/noshare with recent develop changes 2017-09-26 08:15:14 -05:00
Matthew Honnibal
ca28590ddd Use dep and ent multi-task objectives for parser' 2017-09-26 08:13:52 -05:00
Matthew Honnibal
18a27c7579 Fix typo in tensorizer serialization 2017-09-26 06:45:14 -05:00
Matthew Honnibal
bf917225ab Allow multi-task objectives during training 2017-09-26 05:42:52 -05:00
ines
d2d35b63b7 Fix formatting 2017-09-25 18:37:13 +02:00
Matthew Honnibal
8eb0b7b779 Add docstrings for Pipe API 2017-09-25 16:22:07 +02:00
Matthew Honnibal
39f390dba7 Add docstrings for Pipe API 2017-09-25 16:20:49 +02:00
Matthew Honnibal
4348c479fc Merge pre-trained vectors and noshare patches 2017-09-22 20:07:28 -05:00
Matthew Honnibal
386c1a5bd8 Fix tagger training 2017-09-23 02:58:06 +02:00
Matthew Honnibal
05596159bf Fix serialization when pre-trained vectors 2017-09-22 15:33:27 -05:00
Matthew Honnibal
d9124f1aa3 Add link_vectors_to_models function 2017-09-22 09:38:22 -05:00
Matthew Honnibal
40a4873b70 Fix serialization of model options 2017-09-21 13:07:26 -05:00
Matthew Honnibal
20193371f5 Don't share CNN, to reduce complexities 2017-09-21 14:59:48 +02:00
Matthew Honnibal
24e85c2048 Pass values for CNN maxout pieces option 2017-09-20 19:16:12 -05:00
Matthew Honnibal
b36a38f63d Fix serialization of pretrained_dims property 2017-09-19 23:42:27 +02:00
Matthew Honnibal
40837b275d Fix tensorizer with pretrained vectors 2017-09-18 18:05:38 -05:00
Matthew Honnibal
84e637e2e6 Pass option for pretrained vectors in pipeline 2017-09-16 12:46:02 -05:00
Matthew Honnibal
7fdafcc4c4 Fix config loading in tagger 2017-09-04 16:38:49 +02:00
Matthew Honnibal
382ce566eb Fix deserialization bug 2017-09-04 15:19:01 +02:00
Matthew Honnibal
9e378bdac5 Fix textcat serialization 2017-09-02 15:17:20 +02:00
Matthew Honnibal
a3b69bcb3d Add low_data mode in textcat 2017-09-02 14:56:30 +02:00
Matthew Honnibal
5e6a9e7dcc Add rule-based SBD 2017-09-02 12:53:38 +02:00
Matthew Honnibal
c1d3ff517a Track loss in tagger 2017-08-20 14:42:23 +02:00
Matthew Honnibal
ec482580b5 Restore changes to pipeline.pyx from nn-beam-parser branch 2017-08-18 22:02:35 +02:00
Matthew Honnibal
426f84937f Resolve conflicts when merging new beam parsing stuff 2017-08-18 13:38:32 -05:00
Matthew Honnibal
1cb2f15d65 Clean up unused predict_confidences function 2017-08-16 18:22:26 -05:00
Matthew Honnibal
52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal
3e30712b62 Improve defaults 2017-08-12 19:24:17 -05:00
Matthew Honnibal
680043ebca Improve efficiency of tagger.set_annotations for GPU 2017-08-12 08:54:21 -05:00
Matthew Honnibal
3cb8f06881 Fix NeuralLabeller 2017-08-06 14:15:14 +02:00
Matthew Honnibal
e9ab800e15 Fix tagging model 2017-08-06 01:50:08 +02:00
Matthew Honnibal
468c138ab3 WIP: Add fine-tuning logic to tagger model, re #1182 2017-08-06 01:13:23 +02:00
Matthew Honnibal
6780132821 Fix tagger loading 2017-07-25 19:41:11 +02:00
Matthew Honnibal
c4a81a47a4 Fix deserialization 2017-07-23 14:11:07 +02:00
Matthew Honnibal
4fe77bced2 Add cfg attr to pipeline components 2017-07-23 00:52:47 +02:00
Matthew Honnibal
a88a7deffe Five save/load of textcat config 2017-07-23 00:33:43 +02:00
Matthew Honnibal
b55714d5d1 Make gold_tuples arg optional in begin_training 2017-07-22 20:04:43 +02:00
Matthew Honnibal
b3a749610e Fix name of TextCategorizer 2017-07-22 01:14:07 +02:00
Matthew Honnibal
a231b56d40 Add text-classification hook to pipeline 2017-07-20 00:18:15 +02:00
Matthew Honnibal
d59fa32df1 Add experimental SimilarityHook omponent 2017-06-05 15:40:03 +02:00
Matthew Honnibal
b3b5521625 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 20:17:18 -05:00
Matthew Honnibal
7b2ede783d Add SP tag to tag map if missing 2017-06-04 20:16:30 -05:00
Matthew Honnibal
516798e9fc Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-05 01:35:21 +02:00
Matthew Honnibal
193bf913c0 Set is_tagged=True after tagging 2017-06-05 01:35:07 +02:00
Matthew Honnibal
b78cc318c3 Fix loading of morphology exceptions 2017-06-04 16:34:32 -05:00
Matthew Honnibal
3680c51b8f Avoid clobbering preset POS tags 2017-06-04 15:52:42 -05:00
ines
1b593bbd6d Fix encoding on tagger serialization 2017-06-02 17:29:21 +02:00
Matthew Honnibal
5f4d328e2c Fix serialization of tag_map in NeuralTagger 2017-06-02 10:18:37 -05:00
Matthew Honnibal
307d615c5f Fix serialization for tagger when tag_map has changed 2017-06-01 12:18:36 -05:00
ines
7a2380f617 Rename "nn_tagger" to "tagger" 2017-06-01 17:37:53 +02:00
Matthew Honnibal
5eae3b9a1e Fix to/from disk in tagger 2017-06-01 04:55:49 -05:00
Matthew Honnibal
53d00a0371 Move weight serialization to Thinc 2017-06-01 03:04:36 -05:00
Matthew Honnibal
ae8010b526 Move weight serialization to Thinc 2017-06-01 02:56:12 -05:00
Matthew Honnibal
33e5ec737f Fix to/from disk methods 2017-05-31 13:43:10 +02:00
Matthew Honnibal
293d1b425b Serialize in consistent order 2017-05-29 17:53:06 -05:00
Matthew Honnibal
6522ea6c8b More serialization fixes. Still broken 2017-05-29 13:23:47 -05:00
Matthew Honnibal
aa4c33914b Work on serialization 2017-05-29 08:40:45 -05:00
Matthew Honnibal
ff26aa6c37 Work on to/from bytes/disk serialization methods 2017-05-29 11:45:45 +02:00
Matthew Honnibal
6b019b0540 Update to/from bytes methods 2017-05-29 10:14:20 +02:00
Matthew Honnibal
6dad4117ad Work on serialization for models 2017-05-29 01:37:57 +02:00
Matthew Honnibal
8a24c60c1e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-28 08:12:05 -05:00
Matthew Honnibal
bc97bc292c Fix __call__ method 2017-05-28 08:11:58 -05:00
Matthew Honnibal
c1263a844b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-27 18:32:57 -05:00
Matthew Honnibal
9e711c3476 Divide d_loss by batch size 2017-05-27 18:32:46 -05:00
Matthew Honnibal
34bbad8e0e Add __reduce__ methods on parser subclasses. Fixes pickling. 2017-05-27 15:46:06 -05:00
Matthew Honnibal
467bbeadb8 Add hidden layers for tagger 2017-05-24 20:09:51 -05:00
Matthew Honnibal
5b67bcbee0 Increase default embed size to 7500 2017-05-23 15:20:16 -05:00
Matthew Honnibal
3959d778ac Revert "Revert "WIP on improving parser efficiency""
This reverts commit 532afef4a8.
2017-05-23 03:06:53 -05:00
Matthew Honnibal
532afef4a8 Revert "WIP on improving parser efficiency"
This reverts commit bdaac7ab44.
2017-05-23 03:05:25 -05:00
Matthew Honnibal
bdaac7ab44 WIP on improving parser efficiency 2017-05-23 02:59:31 -05:00
Matthew Honnibal
a7ee63c0ac Fix labeller loss for unseen labels 2017-05-22 10:41:20 -05:00
Matthew Honnibal
83ffd16474 Fix offset calculation for other negative values 2017-05-22 08:00:53 -05:00
Matthew Honnibal
b45b4aa392 PseudoProjectivity --> nonproj 2017-05-22 05:17:44 -05:00
Matthew Honnibal
8d1e64be69 Add experimental NeuralLabeller 2017-05-22 04:51:08 -05:00
Matthew Honnibal
9b1b0742fd Fix prediction for tok2vec 2017-05-22 04:51:08 -05:00
Matthew Honnibal
5db89053aa Merge docstrings 2017-05-21 13:46:23 -05:00
Matthew Honnibal
180e5afede Fix tokvecs flattening in pipeline 2017-05-21 09:05:34 -05:00
ines
99b631617d Reformat docstrings 2017-05-21 13:32:15 +02:00
ines
d82ae9a585 Change "function" to "callable" in docs 2017-05-21 13:17:40 +02:00
Matthew Honnibal
3b7c108246 Pass tokvecs through as a list, instead of concatenated. Also fix padding 2017-05-20 13:23:32 -05:00
Matthew Honnibal
d52b65aec2 Revert "Move to contiguous buffer for token_ids and d_vectors"
This reverts commit 3ff8c35a79.
2017-05-20 11:26:23 -05:00
Matthew Honnibal
3ff8c35a79 Move to contiguous buffer for token_ids and d_vectors 2017-05-20 04:17:30 -05:00
Matthew Honnibal
c12ab47a56 Remove state argument in pipeline. Other changes 2017-05-19 13:26:36 -05:00
ines
0fc05e54e4 Document TokenVectorEncoder 2017-05-19 00:00:02 +02:00
Matthew Honnibal
c2c825127a Fix use_params and pipe methods 2017-05-18 08:30:59 -05:00