mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 05:01:02 +03:00 
			
		
		
		
	Add save after --save-every batches for spacy pretrain (#3510)
				
					
				
			<!--- Provide a general summary of your changes in the title. -->
When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
To test...
Save this file to `sample_sents.jsonl`
```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```
Then run `--save-every 2` when pretraining.
```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```
And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.
At the end the training, you should see these files (`ls here/`):
```bash
config.json     model2.bin      model5.bin      model8.bin
log.jsonl       model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin      model3.bin      model6.bin      model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin      model4.bin      model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
This is a new feature to `spacy pretrain`.
🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).** 
```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
    run(args.root)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
    process(base, filename, db)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
    preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
    func(*args)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
    raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
  File "setup.py", line 276, in <module>
    setup_package()
  File "setup.py", line 209, in setup_package
    generate_cython(root, "spacy")
  File "setup.py", line 132, in generate_cython
    raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```
Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
			
			
This commit is contained in:
		
							parent
							
								
									189c90743c
								
							
						
					
					
						commit
						8e2cef49f3
					
				|  | @ -34,7 +34,8 @@ from .. import util | ||||||
|     max_length=("Max words per example.", "option", "xw", int), |     max_length=("Max words per example.", "option", "xw", int), | ||||||
|     min_length=("Min words per example.", "option", "nw", int), |     min_length=("Min words per example.", "option", "nw", int), | ||||||
|     seed=("Seed for random number generators", "option", "s", float), |     seed=("Seed for random number generators", "option", "s", float), | ||||||
|     nr_iter=("Number of iterations to pretrain", "option", "i", int), |     n_iter=("Number of iterations to pretrain", "option", "i", int), | ||||||
|  |     n_save_every=("Save model every X batches.", "option", "se", int), | ||||||
| ) | ) | ||||||
| def pretrain( | def pretrain( | ||||||
|     texts_loc, |     texts_loc, | ||||||
|  | @ -46,11 +47,12 @@ def pretrain( | ||||||
|     loss_func="cosine", |     loss_func="cosine", | ||||||
|     use_vectors=False, |     use_vectors=False, | ||||||
|     dropout=0.2, |     dropout=0.2, | ||||||
|     nr_iter=1000, |     n_iter=1000, | ||||||
|     batch_size=3000, |     batch_size=3000, | ||||||
|     max_length=500, |     max_length=500, | ||||||
|     min_length=5, |     min_length=5, | ||||||
|     seed=0, |     seed=0, | ||||||
|  |     n_save_every=None, | ||||||
| ): | ): | ||||||
|     """ |     """ | ||||||
|     Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, |     Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, | ||||||
|  | @ -115,9 +117,26 @@ def pretrain( | ||||||
|     msg.divider("Pre-training tok2vec layer") |     msg.divider("Pre-training tok2vec layer") | ||||||
|     row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")} |     row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")} | ||||||
|     msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings) |     msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings) | ||||||
|     for epoch in range(nr_iter): | 
 | ||||||
|         for batch in util.minibatch_by_words( |     def _save_model(epoch, is_temp=False): | ||||||
|             ((text, None) for text in texts), size=batch_size |         is_temp_str = ".temp" if is_temp else "" | ||||||
|  |         with model.use_params(optimizer.averages): | ||||||
|  |             with (output_dir / ("model%d%s.bin" % (epoch, is_temp_str))).open( | ||||||
|  |                 "wb" | ||||||
|  |             ) as file_: | ||||||
|  |                 file_.write(model.tok2vec.to_bytes()) | ||||||
|  |             log = { | ||||||
|  |                 "nr_word": tracker.nr_word, | ||||||
|  |                 "loss": tracker.loss, | ||||||
|  |                 "epoch_loss": tracker.epoch_loss, | ||||||
|  |                 "epoch": epoch, | ||||||
|  |             } | ||||||
|  |             with (output_dir / "log.jsonl").open("a") as file_: | ||||||
|  |                 file_.write(srsly.json_dumps(log) + "\n") | ||||||
|  | 
 | ||||||
|  |     for epoch in range(n_iter): | ||||||
|  |         for batch_id, batch in enumerate( | ||||||
|  |             util.minibatch_by_words(((text, None) for text in texts), size=batch_size) | ||||||
|         ): |         ): | ||||||
|             docs = make_docs( |             docs = make_docs( | ||||||
|                 nlp, |                 nlp, | ||||||
|  | @ -133,17 +152,9 @@ def pretrain( | ||||||
|                 msg.row(progress, **row_settings) |                 msg.row(progress, **row_settings) | ||||||
|                 if texts_loc == "-" and tracker.words_per_epoch[epoch] >= 10 ** 7: |                 if texts_loc == "-" and tracker.words_per_epoch[epoch] >= 10 ** 7: | ||||||
|                     break |                     break | ||||||
|         with model.use_params(optimizer.averages): |             if n_save_every and (batch_id % n_save_every == 0): | ||||||
|             with (output_dir / ("model%d.bin" % epoch)).open("wb") as file_: |                 _save_model(epoch, is_temp=True) | ||||||
|                 file_.write(model.tok2vec.to_bytes()) |         _save_model(epoch) | ||||||
|             log = { |  | ||||||
|                 "nr_word": tracker.nr_word, |  | ||||||
|                 "loss": tracker.loss, |  | ||||||
|                 "epoch_loss": tracker.epoch_loss, |  | ||||||
|                 "epoch": epoch, |  | ||||||
|             } |  | ||||||
|             with (output_dir / "log.jsonl").open("a") as file_: |  | ||||||
|                 file_.write(srsly.json_dumps(log) + "\n") |  | ||||||
|         tracker.epoch_loss = 0.0 |         tracker.epoch_loss = 0.0 | ||||||
|         if texts_loc != "-": |         if texts_loc != "-": | ||||||
|             # Reshuffle the texts if texts were loaded from a file |             # Reshuffle the texts if texts were loaded from a file | ||||||
|  |  | ||||||
|  | @ -285,6 +285,7 @@ improvement. | ||||||
| ```bash | ```bash | ||||||
| $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] | $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] | ||||||
| [--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors] | [--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors] | ||||||
|  | [--n-save_every] | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| | Argument               | Type       | Description                                                                                                                       | | | Argument               | Type       | Description                                                                                                                       | | ||||||
|  | @ -302,6 +303,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] | ||||||
| | `--seed`, `-s`         | option     | Seed for random number generators.                                                                                                | | | `--seed`, `-s`         | option     | Seed for random number generators.                                                                                                | | ||||||
| | `--n-iter`, `-i`       | option     | Number of iterations to pretrain.                                                                                                 | | | `--n-iter`, `-i`       | option     | Number of iterations to pretrain.                                                                                                 | | ||||||
| | `--use-vectors`, `-uv` | flag       | Whether to use the static vectors as input features.                                                                              | | | `--use-vectors`, `-uv` | flag       | Whether to use the static vectors as input features.                                                                              | | ||||||
|  | | `--n-save_every`, `-se`  | option     | Save model every X batches.                                                                                                       | | ||||||
| | **CREATES**            | weights    | The pre-trained weights that can be used to initialize `spacy train`.                                                             | | | **CREATES**            | weights    | The pre-trained weights that can be used to initialize `spacy train`.                                                             | | ||||||
| 
 | 
 | ||||||
| ### JSONL format for raw text {#pretrain-jsonl} | ### JSONL format for raw text {#pretrain-jsonl} | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user