Follow-ups to the parser efficiency fix.
* Avoid introducing new counter for number of pushes
* Base cut on number of transitions, keeping it more even
* Reintroduce the randomization we had in v2.
The parser training makes use of a trick for long documents, where we
use the oracle to cut up the document into sections, so that we can have
batch items in the middle of a document. For instance, if we have one
document of 600 words, we might make 6 states, starting at words 0, 100,
200, 300, 400 and 500.
The problem is for v3, I screwed this up and didn't stop parsing! So
instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch
of [600, 500, 400, 300, 200, 100]. Oops.
The implementation here could probably be improved, it's annoying to
have this extra variable in the state. But this'll do.
This makes the v3 parser training 5-10 times faster, depending on document
lengths. This problem wasn't in v2.
A long time ago we went to some trouble to try to clean up "unused"
strings, to avoid the `StringStore` growing in long-running processes.
This never really worked reliably, and I think it was a really wrong
approach. It's much better to let the user reload the `nlp` object as
necessary, now that the string encoding is stable (in v1, the string IDs
were sequential integers, making reloading the NLP object really
annoying.)
The extra book-keeping does make some performance difference, and the
feature is unsed, so it's past time we killed it.
* Prevent Tagger model init with 0 labels
Raise an error before trying to initialize a tagger model with 0 labels.
* Add dummy tagger label for test
* Remove tagless tagger model initializiation
* Fix error number after merge
* Add dummy tagger label to test
* Fix formatting
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>