Matthew Honnibal
008e1ee1dd
Update pretrain command
2018-11-29 12:36:43 +00:00
Matthew Honnibal
61e435610e
💫 Feature/improve pretraining ( #2971 )
...
* Improve spacy pretrain script
* Implement BERT-style 'masked language model' objective. Much better
results.
* Improve logging.
* Add length cap for documents, to avoid memory errors.
* Require thinc 7.0.0.dev1
* Require thinc 7.0.0.dev1
* Add argument for using pretrained vectors
* Fix defaults
* Fix syntax error
* Improve spacy pretrain script
* Implement BERT-style 'masked language model' objective. Much better
results.
* Improve logging.
* Add length cap for documents, to avoid memory errors.
* Require thinc 7.0.0.dev1
* Require thinc 7.0.0.dev1
* Add argument for using pretrained vectors
* Fix defaults
* Fix syntax error
* Tweak pretraining script
* Fix data limits in spacy.gold
* Fix pretrain script
2018-11-28 18:04:58 +01:00
Matthew Honnibal
ef0820827a
Update hyper-parameters after NER random search ( #2972 )
...
These experiments were completed a few weeks ago, but I didn't make the PR, pending model release.
Token vector width: 128->96
Hidden width: 128->64
Embed size: 5000->2000
Dropout: 0.2->0.1
Updated optimizer defaults (unclear how important?)
This should improve speed, model size and load time, while keeping
similar or slightly better accuracy.
The tl;dr is we prefer to prevent over-fitting by reducing model size,
rather than using more dropout.
2018-11-27 18:49:52 +01:00
Ines Montani
b4581435f6
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-11-16 13:08:22 +01:00
Ines Montani
e2f75eb492
Fix message formatting
2018-11-16 13:08:20 +01:00
Matthew Honnibal
2874b8efd8
Fix tok2vec loading in spacy train
2018-11-15 23:34:54 +00:00
Matthew Honnibal
2ddd428834
Fix pretrain script
2018-11-15 23:34:35 +00:00
Matthew Honnibal
f8afaa0c1c
Fix pretrain
2018-11-15 22:46:53 +00:00
Matthew Honnibal
6af6950e46
Fix pretrain
2018-11-15 22:45:36 +00:00
Matthew Honnibal
3e7b214e57
Make pretrain script work with stream from stdin
2018-11-15 22:44:07 +00:00
Matthew Honnibal
8fdb9bc278
💫 Add experimental ULMFit/BERT/Elmo-like pretraining ( #2931 )
...
* Add 'spacy pretrain' command
* Fix pretrain command for Python 2
* Fix pretrain command
* Fix pretrain command
2018-11-15 22:17:16 +01:00
Matthew Honnibal
8f2a6367e9
Fix usage of PyTorch BiLSTM in ud_train
2018-09-13 22:54:59 +00:00
Matthew Honnibal
445b81ce3f
Support bilstm_depth argument in ud-train
2018-09-13 19:30:22 +02:00
Matthew Honnibal
3eb9f3e2b8
Fix defaults for ud-train
2018-09-13 18:05:48 +02:00
Matthew Honnibal
59cf533879
Improve ud-train script. Make config optional
2018-09-13 14:24:08 +02:00
Matthew Honnibal
da7650e84b
Fix maximum doc length in ud_train script
2018-09-13 14:10:25 +02:00
Matthew Honnibal
4d2d7d5866
Fix new feature flags
2018-08-27 02:12:39 +02:00
Matthew Honnibal
9c33d4d1df
Add more hyper-parameters to spacy ud-train
...
* subword_features: Controls whether subword features are used in the
word embeddings. True by default (specifically, prefix, suffix and word
shape). Should be set to False for languages like Chinese and Japanese.
* conv_depth: Depth of the convolutional layers. Defaults to 4.
2018-08-27 01:48:46 +02:00
Matthew Honnibal
595c893791
Expose noise_level option in train CLI
2018-08-16 00:41:44 +02:00
Matthew Honnibal
6ea981c839
Add converter for jsonl NER data
2018-08-14 14:04:32 +02:00
Matthew Honnibal
02c5c114d0
Fix usage of deprecated freqs.txt in init-model
2018-08-14 13:19:15 +02:00
Matthew Honnibal
4336397ecb
Update develop from master
2018-08-14 03:04:28 +02:00
Xiaoquan Kong
f0c9652ed1
New Feature: display more detail when Error E067 ( #2639 )
...
* Fix off-by-one error
* Add verbose option
* Update verbose option
* Update documents for verbose option
2018-08-07 10:45:29 +02:00
Kaisa (Katarzyna) Korsak
e531a827db
Changed conllu2json to be able to extract NER tags ( #2594 )
...
* extract ner tags from conllu file if available
* fixed a bug in regex
2018-07-25 22:21:31 +02:00
ines
d84b13e02c
Merge branch 'master' into develop
2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm
6e2930a4a2
Conll(u)-bio converter ( #2525 )
...
* Started simple conllxbiluo converter
* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
Matthew Honnibal
8ae1bec8bf
Fix init_model
2018-07-05 14:02:06 +02:00
Matthew Honnibal
dee8bdb900
Fix init-model for npz vectors
2018-07-04 02:29:48 +02:00
Matthew Honnibal
59d655e8d0
Fix model init from jsonl
2018-07-04 01:30:40 +02:00
Matthew Honnibal
1e38bea6e9
Save vectors init
2018-07-03 23:55:04 +02:00
Matthew Honnibal
6692833887
Fix init_model
2018-07-03 23:24:11 +02:00
Matthew Honnibal
4a38a26cb5
Fix init_model
2018-07-03 22:57:11 +02:00
Matthew Honnibal
019d09e3c3
Fix init model
2018-07-03 22:16:44 +02:00
Matthew Honnibal
2543f8c93a
Support .npz vectors in init-model command
2018-07-03 21:42:16 +02:00
Matthew Honnibal
86aad11939
Fix init_model arg
2018-07-03 17:00:42 +02:00
Matthew Honnibal
eff42d36e3
Fix init model command
2018-07-03 16:32:23 +02:00
Matthew Honnibal
6a89faf12e
Add support for jsonl-formatted lexical attributes to init-model command.
2018-07-03 12:22:56 +02:00
Matthew Honnibal
c83fccfe2a
Fix output of best model
2018-06-25 23:05:56 +02:00
Matthew Honnibal
69c900f003
Fix init-model if no vectors provided
2018-06-25 18:26:02 +02:00
Matthew Honnibal
664f89327a
Fix init-model if no vectors provided
2018-06-25 17:58:45 +02:00
Matthew Honnibal
c4698f5712
Don't collate model unless training succeeds
2018-06-25 16:36:42 +02:00
Matthew Honnibal
24dfbb8a28
Fix model collation
2018-06-25 14:35:24 +02:00
Matthew Honnibal
62237755a4
Import shutil
2018-06-25 13:40:17 +02:00
Matthew Honnibal
a040fca99e
Import json into cli.train
2018-06-25 11:50:37 +02:00
Matthew Honnibal
2c703d99c2
Fix collation of best models
2018-06-25 01:21:34 +02:00
Matthew Honnibal
2c80b7c013
Collate best model after training
2018-06-24 23:39:52 +02:00
ines
330c039106
Merge branch 'master' into develop
2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90
Better formatting for spacy train
CLI ( #2357 )
...
* Better formatting for `spacy train` CLI
Changed to use fixed-spaces rather than tabs to align table headers and data.
### Before:
```
Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4
1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1
2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9
```
### After:
```
Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS
0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4
1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1
2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9
```
* Added contributor file
2018-05-25 13:08:45 +02:00
Matthew Honnibal
ce458c2428
Fix spacy requirement constraint in package template
2018-05-22 20:50:46 +02:00
Matthew Honnibal
f3b4f6a4ec
Merge setup.py
2018-05-20 23:21:00 +02:00