spaCy/website/docs/api
Motoki Wu 8e2cef49f3 Add save after --save-every batches for spacy pretrain (#3510)
<!--- Provide a general summary of your changes in the title. -->

When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

To test...

Save this file to `sample_sents.jsonl`

```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```

Then run `--save-every 2` when pretraining.

```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```

And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.

At the end the training, you should see these files (`ls here/`):

```bash
config.json     model2.bin      model5.bin      model8.bin
log.jsonl       model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin      model3.bin      model6.bin      model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin      model4.bin      model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

This is a new feature to `spacy pretrain`.

🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).** 

```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
    run(args.root)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
    process(base, filename, db)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
    preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
    func(*args)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
    raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
  File "setup.py", line 276, in <module>
    setup_package()
  File "setup.py", line 209, in setup_package
    generate_cython(root, "spacy")
  File "setup.py", line 132, in generate_cython
    raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```

Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-22 14:10:16 +02:00
..
annotation.md Don't auto-slugify accordion links [ci skip] 2019-03-12 15:30:49 +01:00
cli.md Add save after --save-every batches for spacy pretrain (#3510) 2019-04-22 14:10:16 +02:00
cython-classes.md 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
cython-structs.md 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
cython.md 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
dependencyparser.md 💫 Make serialization methods consistent (#3385) 2019-03-10 19:16:45 +01:00
doc.md Add Doc.lang and Doc.lang_ 2019-03-11 14:21:40 +01:00
entityrecognizer.md 💫 Make serialization methods consistent (#3385) 2019-03-10 19:16:45 +01:00
entityruler.md Tidy up and improve docs and docstrings (#3370) 2019-03-08 11:42:26 +01:00
goldcorpus.md 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
goldparse.md Auto-format [ci skip] 2019-02-27 12:07:35 +01:00
index.md 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
language.md 💫 Allow passing of config parameters to specific pipeline components (#3386) 2019-03-10 23:36:47 +01:00
lemmatizer.md 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
lexeme.md 💫 Update website (#3285) 2019-02-17 19:31:19 +01:00
matcher.md Remove n_threads 2019-02-17 22:25:42 +01:00
phrasematcher.md Remove n_threads 2019-02-17 22:25:42 +01:00
pipeline-functions.md Tidy up and improve docs and docstrings (#3370) 2019-03-08 11:42:26 +01:00
sentencizer.md 💫 Add better and serializable sentencizer (#3471) 2019-03-23 15:45:02 +01:00
span.md Update Span.__init__ docs (see #3445) [ci skip] 2019-03-20 17:24:17 +01:00
stringstore.md 💫 Make serialization methods consistent (#3385) 2019-03-10 19:16:45 +01:00
tagger.md 💫 Make serialization methods consistent (#3385) 2019-03-10 19:16:45 +01:00
textcategorizer.md Fix formatting [ci skip] 2019-03-23 16:45:50 +01:00
token.md Removes duplicate in table (#3550) 2019-04-08 10:30:42 +02:00
tokenizer.md DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492) 2019-03-28 12:48:02 +01:00
top-level.md fix(util): fix decaying function output (#3495) 2019-03-28 13:24:47 +01:00
vectors.md Fix website docs for Vectors.from_glove (#3565) 2019-04-10 15:23:27 +02:00
vocab.md Document new API [ci skip] 2019-03-11 15:23:53 +01:00