Matthew Honnibal
3e7a96f99d
Improve pretrain textcat example
2018-11-03 17:44:12 +00:00
Matthew Honnibal
c87c50af62
Rename new example
2018-11-03 13:09:46 +00:00
Matthew Honnibal
8e8ccc0f92
Work on pretraining script
2018-11-03 12:53:25 +00:00
Matthew Honnibal
ad44982f01
Fix dropout in tensorizer, update comment
2018-11-03 12:46:58 +00:00
Matthew Honnibal
0127f10ba3
Improve train tensorizer script
2018-11-03 10:54:20 +00:00
Matthew Honnibal
ba365ae1c9
Normalize gradient by number of words in tensorizer
2018-11-03 10:53:22 +00:00
Matthew Honnibal
dac3f1b280
Improve Tensorizer
2018-11-03 10:52:50 +00:00
Matthew Honnibal
baf7feae68
Add tensorizer training example
2018-11-02 23:30:06 +00:00
Matthew Honnibal
2527ba68e5
Fix tensorizer
2018-11-02 23:29:54 +00:00
Suraj Rajan
0bf14082a4
Added more constucts for dependency tree matcher ( #2836 )
2018-10-29 23:21:39 +01:00
Matthew Honnibal
817e1fc5e5
Fix out-of-bounds access in NER training
...
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!
This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 01:12:50 +02:00
Ines Montani
ea20b72c08
💫 Make like_num work for prefixed numbers ( #2808 )
...
* Only split + prefix if not numbers
* Make like_num work for prefixed numbers
* Add test for like_num
2018-10-01 10:49:14 +02:00
Matthew Honnibal
b39810d692
Fix copy_reg compatibility on _serialize module
2018-09-28 15:23:14 +02:00
Matthew Honnibal
f82f8ba5dd
Fix serialization when empty parser model. Closes #2482
2018-09-28 15:18:52 +02:00
Matthew Honnibal
d5a6c63b62
Add regression test for #2482
2018-09-28 15:18:30 +02:00
Matthew Honnibal
e3e9fe18d4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-09-28 14:27:35 +02:00
Matthew Honnibal
0323f5be0c
Fix _serialize module
2018-09-28 14:27:24 +02:00
Ines Montani
5d56eb70d7
Tidy up tests
2018-09-27 16:41:57 +02:00
Ines Montani
1f1bab9264
Remove unused import
2018-09-27 16:41:37 +02:00
Matthew Honnibal
b42c123e5d
Fix regression introduced by 1759abf1e
2018-09-25 11:08:58 +02:00
Matthew Honnibal
500898907b
Fix regression in parser.begin_training()
2018-09-25 11:08:31 +02:00
Matthew Honnibal
1759abf1e5
Fix bug in sentence starts for non-projective parses
...
The set_children_from_heads function assumed parse trees were
projective. However, non-projective parses may be passed in during
deserialization, or after deprojectivising. This caused incorrect
sentence boundaries to be set for non-projective parses. Close #2772 .
2018-09-19 14:50:06 +02:00
Matthew Honnibal
48fd36bf05
Fix test for issue 27772
2018-09-19 14:47:27 +02:00
Matthew Honnibal
6cd920e088
Add xfail test for deprojectivization SBD bug
2018-09-19 14:00:31 +02:00
Matthew Honnibal
99a6011580
Avoid adding empty layer in model, to keep models backwards compatible
2018-09-14 22:51:58 +02:00
Matthew Honnibal
c046392317
Trigger on_data hooks in parser model
2018-09-14 20:51:21 +02:00
Matthew Honnibal
5afd98dff5
Add a stepping function, for changing batch sizes or learning rates
2018-09-14 18:37:16 +02:00
Matthew Honnibal
27c00f4f22
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-09-14 12:30:57 +02:00
Matthew Honnibal
f32b52e611
Fix bug that caused deprojectivisation to run multiple times
2018-09-14 12:12:54 +02:00
Matthew Honnibal
8f2a6367e9
Fix usage of PyTorch BiLSTM in ud_train
2018-09-13 22:54:59 +00:00
Matthew Honnibal
afeddfff26
Fix PyTorch BiLSTM
2018-09-13 22:54:34 +00:00
Matthew Honnibal
a26fe8e7bb
Small hack in Language.update to make torch work
2018-09-13 22:51:52 +00:00
Matthew Honnibal
445b81ce3f
Support bilstm_depth argument in ud-train
2018-09-13 19:30:22 +02:00
Matthew Honnibal
b43643a953
Support bilstm_depth option in parser
2018-09-13 19:29:49 +02:00
Matthew Honnibal
45032fe9e1
Support option of BiLSTM in Tok2Vec (requires pytorch)
2018-09-13 19:28:35 +02:00
Matthew Honnibal
3eb9f3e2b8
Fix defaults for ud-train
2018-09-13 18:05:48 +02:00
Matthew Honnibal
59cf533879
Improve ud-train script. Make config optional
2018-09-13 14:24:08 +02:00
Matthew Honnibal
3e3a309764
Fix tagger
2018-09-13 14:14:38 +02:00
Matthew Honnibal
da7650e84b
Fix maximum doc length in ud_train script
2018-09-13 14:10:25 +02:00
Matthew Honnibal
a95eea4c06
Fix multi-task objective for parser
2018-09-13 14:08:55 +02:00
Matthew Honnibal
21321cd6cf
Add tok2vec property to parser model
2018-09-13 14:08:43 +02:00
Matthew Honnibal
d6aa60139d
Fix tagger training on GPU
2018-09-13 14:05:37 +02:00
Matthew Honnibal
b2cb1fc67d
Merge matcher tests
2018-09-06 01:39:53 +02:00
Suraj Krishnan Rajan
356af7b0a1
Fix tests
2018-09-06 01:39:36 +02:00
Matthew Honnibal
4d2d7d5866
Fix new feature flags
2018-08-27 02:12:39 +02:00
Matthew Honnibal
598dbf1ce0
Fix character-based tokenization for Japanese
2018-08-27 01:51:38 +02:00
Matthew Honnibal
3763e20afc
Pass subword_features and conv_depth params
2018-08-27 01:51:15 +02:00
Matthew Honnibal
8051136d70
Support subword_features and conv_depth params in Tok2Vec
2018-08-27 01:50:48 +02:00
Matthew Honnibal
9c33d4d1df
Add more hyper-parameters to spacy ud-train
...
* subword_features: Controls whether subword features are used in the
word embeddings. True by default (specifically, prefix, suffix and word
shape). Should be set to False for languages like Chinese and Japanese.
* conv_depth: Depth of the convolutional layers. Defaults to 4.
2018-08-27 01:48:46 +02:00
Matthew Honnibal
51a9efbf3b
Add draft Binder class
2018-08-22 13:12:51 +02:00