spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-30 21:43:34 +03:00

History

Motoki Wu 8e2cef49f3 Add save after `--save-every` batches for `spacy pretrain` (#3510 ) <!--- Provide a general summary of your changes in the title. --> When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches. ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> To test... Save this file to `sample_sents.jsonl` ``` {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} ``` Then run `--save-every 2` when pretraining. ```bash spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2 ``` And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name. At the end the training, you should see these files (`ls here/`): ```bash config.json model2.bin model5.bin model8.bin log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin model0.bin model3.bin model6.bin model9.bin model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin model1.bin model4.bin model7.bin model1.temp.bin model4.temp.bin model7.temp.bin ``` ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> This is a new feature to `spacy pretrain`. 🌵 Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error). ``` Processing matcher.pyx [Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx' Traceback (most recent call last): File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module> run(args.root) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run process(base, filename, db) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd func(args) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx raise Exception("Cython failed") Exception: Cython failed Traceback (most recent call last): File "setup.py", line 276, in <module> setup_package() File "setup.py", line 209, in setup_package generate_cython(root, "spacy") File "setup.py", line 132, in generate_cython raise RuntimeError("Running cythonize failed") RuntimeError: Running cythonize failed ``` Edit: Fixed! after deleting all `.cpp` files: `find spacy -name ".cpp" \| xargs rm` ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.		2019-04-22 14:10:16 +02:00
..
annotation.md	Don't auto-slugify accordion links [ci skip]	2019-03-12 15:30:49 +01:00
cli.md	Add save after `--save-every` batches for `spacy pretrain` (#3510 )	2019-04-22 14:10:16 +02:00
cython-classes.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
cython-structs.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
cython.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
dependencyparser.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
doc.md	Add Doc.lang and Doc.lang_	2019-03-11 14:21:40 +01:00
entityrecognizer.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
entityruler.md	Tidy up and improve docs and docstrings (#3370 )	2019-03-08 11:42:26 +01:00
goldcorpus.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
goldparse.md	Auto-format [ci skip]	2019-02-27 12:07:35 +01:00
index.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
language.md	💫 Allow passing of config parameters to specific pipeline components (#3386 )	2019-03-10 23:36:47 +01:00
lemmatizer.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
lexeme.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
matcher.md	Remove n_threads	2019-02-17 22:25:42 +01:00
phrasematcher.md	Remove n_threads	2019-02-17 22:25:42 +01:00
pipeline-functions.md	Tidy up and improve docs and docstrings (#3370 )	2019-03-08 11:42:26 +01:00
sentencizer.md	💫 Add better and serializable sentencizer (#3471 )	2019-03-23 15:45:02 +01:00
span.md	Update Span.__init__ docs (see #3445 ) [ci skip]	2019-03-20 17:24:17 +01:00
stringstore.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
tagger.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
textcategorizer.md	Fix formatting [ci skip]	2019-03-23 16:45:50 +01:00
token.md	Removes duplicate in table (#3550 )	2019-04-08 10:30:42 +02:00
tokenizer.md	DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492 )	2019-03-28 12:48:02 +01:00
top-level.md	fix(util): fix decaying function output (#3495 )	2019-03-28 13:24:47 +01:00
vectors.md	Fix website docs for Vectors.from_glove (#3565 )	2019-04-10 15:23:27 +02:00
vocab.md	Document new API [ci skip]	2019-03-11 15:23:53 +01:00