These experiments were completed a few weeks ago, but I didn't make the PR, pending model release.
Token vector width: 128->96
Hidden width: 128->64
Embed size: 5000->2000
Dropout: 0.2->0.1
Updated optimizer defaults (unclear how important?)
This should improve speed, model size and load time, while keeping
similar or slightly better accuracy.
The tl;dr is we prefer to prevent over-fitting by reducing model size,
rather than using more dropout.
* Auto-format tests with black
* Add flake8 config
* Tidy up and remove unused imports
* Fix redefinitions of test functions
* Replace orths_and_spaces with words and spaces
* Fix compatibility with pytest 4.0
* xfail test for now
Test was previously overwritten by following test due to naming conflict, so failure wasn't reported
* Unfail passing test
* Only use fixture via arguments
Fixes pytest 4.0 compatibility
Our epic matrix multiplication odyssey is drawing to a close...
I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis
Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython.
The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced.
With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with.
* Use blis
* Use -2 arg to Cython
* Update dependencies
* Fix requirements
* Update setup dependencies
* Fix requirement typo
* Fix msgpack errors
* Remove Python27 test from Appveyor, until Blis works there
* Auto-format setup.py
* Fix murmurhash version
* Allow matching non-orth attributes in PhraseMatcher (see #1971)
Usage: PhraseMatcher(nlp.vocab, attr='POS')
* Allow attr argument to be int
* Fix formatting
* Fix typo
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!
This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.