* Updates/bugfixes for NER/IOB converters
* Converter formats `ner` and `iob` use autodetect to choose a converter if
possible
* `iob2json` is reverted to handle sentence-per-line data like
`word1|pos1|ent1 word2|pos2|ent2`
* Fix bug in `merge_sentences()` so the second sentence in each batch isn't
skipped
* `conll_ner2json` is made more general so it can handle more formats with
whitespace-separated columns
* Supports all formats where the first column is the token and the final
column is the IOB tag; if present, the second column is the POS tag
* As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
separates documents
* Add option for segmenting sentences (new flag `-s`)
* Parser-based sentence segmentation with a provided model, otherwise with
sentencizer (new option `-b` to specify model)
* Can group sentences into documents with `n_sents` as long as sentence
segmentation is available
* Only applies automatic segmentation when there are no existing delimiters
in the data
* Provide info about settings applied during conversion with warnings and
suggestions if settings conflict or might not be not optimal.
* Add tests for common formats
* Add '(default)' back to docs for -c auto
* Add document count back to output
* Revert changes to converter output message
* Use explicit tabs in convert CLI test data
* Adjust/add messages for n_sents=1 default
* Add sample NER data to training examples
* Update README
* Add links in docs to example NER data
* Define msg within converters
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.
The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.
Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.