* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
* add acronym and some norm exception words
<!--- Provide a general summary of your changes in the title. -->
When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
To test...
Save this file to `sample_sents.jsonl`
```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```
Then run `--save-every 2` when pretraining.
```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```
And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.
At the end the training, you should see these files (`ls here/`):
```bash
config.json model2.bin model5.bin model8.bin
log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin model3.bin model6.bin model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin model4.bin model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
This is a new feature to `spacy pretrain`.
🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).**
```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
run(args.root)
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
process(base, filename, db)
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
func(*args)
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
File "setup.py", line 276, in <module>
setup_package()
File "setup.py", line 209, in setup_package
generate_cython(root, "spacy")
File "setup.py", line 132, in generate_cython
raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```
Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
If the Morphology class tries to lemmatize a word that's not in the
string store, it's forced to just return it as-is. While loading
exceptions, the class could hit a case where these strings weren't in
the string store yet. The resulting lemmas could then be cached, leading
to some words receiving upper-case lemmas. Closes#3551.
* Add early stopping
* Add return_score option to evaluate
* Fix missing str to path conversion
* Fix import + old python compatibility
* Fix bad beam_width setting during cpu evaluation in spacy train with gpu option turned on