* Fix on_match callback and remove empty patterns (#6312)
For the `DependencyMatcher`:
* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
* Add --prefer-binary for python 3.5
* Add version pins for pyrsistent
* Use backwards-compatible super()
* Try to fix tests on Travis (2.7)
* Fix naming conflict and formatting
* Update pkuseg version in Chinese tokenizer warnings
* Some changes for Armenian (#5616)
* Fixing numericals
* We need a Armenian question sign to make the sentence a question
* Update lex_attrs.py (#5608)
* Fix compat
* Update Armenian from v2.3.x
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Karen Hambardzumyan <mahnerak@gmail.com>
Co-authored-by: Marat M. Yavrumyan <myavrum@ysu.am>
* Limiting noun_chunks for specific langauges
* Limiting noun_chunks for specific languages
Contributor Agreement
* Addressing review comments
* Removed unused fixtures and imports
* Add fa_tokenizer in test suite
* Use fa_tokenizer in test
* Undo extraneous reformatting
Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>
Check that row is within bounds for the vector data array when adding a
vector.
Don't add vectors with rank OOV_RANK in `init-model` (change is due to
shift from OOV as 0 to OOV as OOV_RANK).
Remove `TAG` value from Danish and Swedish tokenizer exceptions because
it may not be included in a tag map (and these settings are problematic
as tokenizer exceptions anyway).
Instead of treating `'d` in contractions like `I'd` as `would` in all
cases in the tokenizer exceptions, leave the tagging and lemmatization
up to later components.
* Initialize lower flag explicitly
* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants
* Return the text with original casing if anything goes wrong
* `debug-data`: determine coverage of provided vectors
* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization
* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file
* `train`:
* if training on GPU, only run evaluation/timing on CPU in the first
iteration
* if training is aborted, exit with a non-0 exit status
* simplify creation of KB by skipping dim reduction
* small fixes to train EL example script
* add KB creation and NEL training example scripts to example section
* update descriptions of example scripts in the documentation
* moving wiki_entity_linking folder from bin to projects
* remove test for wiki NEL functionality that is being moved
Reconstruction of the original PR #4697 by @MiniLau.
Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.