* Add helper function for reading in JSONL
* Add rule-based NER component
* Fix whitespace
* Add component to factories
* Add tests
* Add option to disable indent on json_dumps compat
Otherwise, reading JSONL back in line by line won't work
* Fix error code
* issue_2385 add tests for iob_to_biluo converter function
* issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter
* issue_2385 add test to fix b char bug
* add contributor agreement
* fill contributor agreement
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp; , &lt; , &gt; , &quot; in before rendering svg
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361)
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Add logic to filter out warning IDs via environment variable
Usage: SPACY_WARNING_EXCLUDE=W001,W007
* Add warnings for empty vectors
* Add warning if no word vectors are used in .similarity methods
For example, if only tensors are available in small models – should hopefully clear up some confusion around this
* Capture warnings in tests
* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
* Work on refactoring greedy parser
* Compile updated parser
* Fix refactored parser
* Update test
* Fix refactored parser
* Fix refactored parser
* Readd beam search after refactor
* Fix beam search after refactor
* Fix parser
* Fix beam parsing
* Support oracle segmentation in ud-train CLI command
* Avoid relying on final gold check in beam search
* Add a keyword argument sink to GoldParse
* Bug fixes to beam search after refactor
* Avoid importing fused token symbol in ud-run-test, untl that's added
* Avoid importing fused token symbol in ud-run-test, untl that's added
* Don't modify Token in global scope
* Fix error in beam gradient calculation
* Default to beam_update_prob 1
* Set a more aggressive threshold on the max violn update
* Disable some tests to figure out why CI fails
* Disable some tests to figure out why CI fails
* Add some diagnostics to travis.yml to try to figure out why build fails
* Tell Thinc to link against system blas on Travis
* Point thinc to libblas on Travis
* Try running sudo=true for travis
* Unhack travis.sh
* Restore beam_density argument for parser beam
* Require thinc 6.11.1.dev16
* Revert hacks to tests
* Revert hacks to travis.yml
* Update thinc requirement
* Fix parser model loading
* Fix size limits in training data
* Add missing name attribute for parser
* Fix appveyor for Windows
* Add Romanian lemmatizer lookup table.
Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).
The original dataset is licensed under the Open Database License.
* Fix one blatant issue in the Romanian lemmatizer
* Romanian examples file
* Add ro_tokenizer in conftest
* Add Romanian lemmatizer test
* Port Japanese mecab tokenizer from v1
This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.
As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.
Things to check:
1. Is this the right way to use a token extension?
2. What's the right way to implement a JapaneseTagger? The approach in
#1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?
-POLM
* Add tagging/make_doc and tests