Commit Graph

349 Commits

Author SHA1 Message Date
Matthew Honnibal
008e1ee1dd Update pretrain command 2018-11-29 12:36:43 +00:00
Matthew Honnibal
61e435610e
💫 Feature/improve pretraining (#2971)
* Improve spacy pretrain script

* Implement BERT-style 'masked language model' objective. Much better
results.

* Improve logging.

* Add length cap for documents, to avoid memory errors.

* Require thinc 7.0.0.dev1

* Require thinc 7.0.0.dev1

* Add argument for using pretrained vectors

* Fix defaults

* Fix syntax error

* Improve spacy pretrain script

* Implement BERT-style 'masked language model' objective. Much better
results.

* Improve logging.

* Add length cap for documents, to avoid memory errors.

* Require thinc 7.0.0.dev1

* Require thinc 7.0.0.dev1

* Add argument for using pretrained vectors

* Fix defaults

* Fix syntax error

* Tweak pretraining script

* Fix data limits in spacy.gold

* Fix pretrain script
2018-11-28 18:04:58 +01:00
Matthew Honnibal
ef0820827a
Update hyper-parameters after NER random search (#2972)
These experiments were completed a few weeks ago, but I didn't make the PR, pending model release.

    Token vector width: 128->96
    Hidden width: 128->64
    Embed size: 5000->2000
    Dropout: 0.2->0.1
    Updated optimizer defaults (unclear how important?)

This should improve speed, model size and load time, while keeping
similar or slightly better accuracy.

The tl;dr is we prefer to prevent over-fitting by reducing model size,
rather than using more dropout.
2018-11-27 18:49:52 +01:00
Ines Montani
b4581435f6 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-11-16 13:08:22 +01:00
Ines Montani
e2f75eb492 Fix message formatting 2018-11-16 13:08:20 +01:00
Matthew Honnibal
2874b8efd8 Fix tok2vec loading in spacy train 2018-11-15 23:34:54 +00:00
Matthew Honnibal
2ddd428834 Fix pretrain script 2018-11-15 23:34:35 +00:00
Matthew Honnibal
f8afaa0c1c Fix pretrain 2018-11-15 22:46:53 +00:00
Matthew Honnibal
6af6950e46 Fix pretrain 2018-11-15 22:45:36 +00:00
Matthew Honnibal
3e7b214e57 Make pretrain script work with stream from stdin 2018-11-15 22:44:07 +00:00
Matthew Honnibal
8fdb9bc278
💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931)
* Add 'spacy pretrain' command

* Fix pretrain command for Python 2

* Fix pretrain command

* Fix pretrain command
2018-11-15 22:17:16 +01:00
Matthew Honnibal
8f2a6367e9 Fix usage of PyTorch BiLSTM in ud_train 2018-09-13 22:54:59 +00:00
Matthew Honnibal
445b81ce3f Support bilstm_depth argument in ud-train 2018-09-13 19:30:22 +02:00
Matthew Honnibal
3eb9f3e2b8 Fix defaults for ud-train 2018-09-13 18:05:48 +02:00
Matthew Honnibal
59cf533879 Improve ud-train script. Make config optional 2018-09-13 14:24:08 +02:00
Matthew Honnibal
da7650e84b Fix maximum doc length in ud_train script 2018-09-13 14:10:25 +02:00
Matthew Honnibal
4d2d7d5866 Fix new feature flags 2018-08-27 02:12:39 +02:00
Matthew Honnibal
9c33d4d1df Add more hyper-parameters to spacy ud-train
* subword_features: Controls whether subword features are used in the
word embeddings. True by default (specifically, prefix, suffix and word
shape). Should be set to False for languages like Chinese and Japanese.

* conv_depth: Depth of the convolutional layers. Defaults to 4.
2018-08-27 01:48:46 +02:00
Matthew Honnibal
595c893791 Expose noise_level option in train CLI 2018-08-16 00:41:44 +02:00
Matthew Honnibal
6ea981c839 Add converter for jsonl NER data 2018-08-14 14:04:32 +02:00
Matthew Honnibal
02c5c114d0 Fix usage of deprecated freqs.txt in init-model 2018-08-14 13:19:15 +02:00
Matthew Honnibal
4336397ecb Update develop from master 2018-08-14 03:04:28 +02:00
Xiaoquan Kong
f0c9652ed1 New Feature: display more detail when Error E067 (#2639)
* Fix off-by-one error

* Add verbose option

* Update verbose option

* Update documents for verbose option
2018-08-07 10:45:29 +02:00
Kaisa (Katarzyna) Korsak
e531a827db Changed conllu2json to be able to extract NER tags (#2594)
* extract ner tags from conllu file if available

* fixed a bug in regex
2018-07-25 22:21:31 +02:00
ines
d84b13e02c Merge branch 'master' into develop 2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm
6e2930a4a2 Conll(u)-bio converter (#2525)
* Started simple conllxbiluo converter

* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
Matthew Honnibal
8ae1bec8bf Fix init_model 2018-07-05 14:02:06 +02:00
Matthew Honnibal
dee8bdb900 Fix init-model for npz vectors 2018-07-04 02:29:48 +02:00
Matthew Honnibal
59d655e8d0 Fix model init from jsonl 2018-07-04 01:30:40 +02:00
Matthew Honnibal
1e38bea6e9 Save vectors init 2018-07-03 23:55:04 +02:00
Matthew Honnibal
6692833887 Fix init_model 2018-07-03 23:24:11 +02:00
Matthew Honnibal
4a38a26cb5 Fix init_model 2018-07-03 22:57:11 +02:00
Matthew Honnibal
019d09e3c3 Fix init model 2018-07-03 22:16:44 +02:00
Matthew Honnibal
2543f8c93a Support .npz vectors in init-model command 2018-07-03 21:42:16 +02:00
Matthew Honnibal
86aad11939 Fix init_model arg 2018-07-03 17:00:42 +02:00
Matthew Honnibal
eff42d36e3 Fix init model command 2018-07-03 16:32:23 +02:00
Matthew Honnibal
6a89faf12e Add support for jsonl-formatted lexical attributes to init-model command. 2018-07-03 12:22:56 +02:00
Matthew Honnibal
c83fccfe2a Fix output of best model 2018-06-25 23:05:56 +02:00
Matthew Honnibal
69c900f003 Fix init-model if no vectors provided 2018-06-25 18:26:02 +02:00
Matthew Honnibal
664f89327a Fix init-model if no vectors provided 2018-06-25 17:58:45 +02:00
Matthew Honnibal
c4698f5712 Don't collate model unless training succeeds 2018-06-25 16:36:42 +02:00
Matthew Honnibal
24dfbb8a28 Fix model collation 2018-06-25 14:35:24 +02:00
Matthew Honnibal
62237755a4 Import shutil 2018-06-25 13:40:17 +02:00
Matthew Honnibal
a040fca99e Import json into cli.train 2018-06-25 11:50:37 +02:00
Matthew Honnibal
2c703d99c2 Fix collation of best models 2018-06-25 01:21:34 +02:00
Matthew Honnibal
2c80b7c013 Collate best model after training 2018-06-24 23:39:52 +02:00
ines
330c039106 Merge branch 'master' into develop 2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90 Better formatting for spacy train CLI (#2357)
* Better formatting for `spacy train` CLI

Changed to use fixed-spaces rather than tabs to align table headers and data.

### Before:
```
Itn.    P.Loss  N.Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %
0       4618.857        2910.004        76.172  79.645  67.987  88.732  88.261  100.000 4436.9  6376.4
1       4671.972        3764.812        74.481  78.046  62.374  82.680  88.377  100.000 4672.2  6227.1
2       4742.756        3673.473        71.994  77.380  63.966  84.494  90.620  100.000 4298.0  5983.9
```

### After:
```
Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS
0     4618.857  2910.004  76.172  79.645  67.987  88.732  88.261  100.000  4436.9   6376.4
1     4671.972  3764.812  74.481  78.046  62.374  82.680  88.377  100.000  4672.2   6227.1
2     4742.756  3673.473  71.994  77.380  63.966  84.494  90.620  100.000  4298.0   5983.9
```

* Added contributor file
2018-05-25 13:08:45 +02:00
Matthew Honnibal
ce458c2428 Fix spacy requirement constraint in package template 2018-05-22 20:50:46 +02:00
Matthew Honnibal
f3b4f6a4ec Merge setup.py 2018-05-20 23:21:00 +02:00