* Check whether two entities overlap
- biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps
- added unit test
* SCA agreement
* pytest file for issue4104 established
* edited default lookup english lemmatizer for spun; fixes issue 4102
* eliminated parameterization and sorted dictionary dependnency in issue 4104 test
* added contributor agreement
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* turn kb_creator into CLI script (wip)
* proper parameters for training entity vectors
* wikidata pipeline split up into two executable scripts
* remove context_width
* move wikidata scripts in bin directory, remove old dummy script
* refine KB script with logs and preprocessing options
* small edits
* small improvements to logging of EL CLI script
* Improve NER per type scoring
* include all gold labels in per type scoring, not only when recall > 0
* improve efficiency of per type scoring
* Create Scorer tests, initially with NER tests
* move regression test #3968 (per type NER scoring) to Scorer tests
* add new test for per type NER scoring with imperfect P/R/F and per
type P/R/F including a case where R == 0.0
* failing unit test for issue 3962
* attempt to fix Issue #3962
* create artificial unit test example
* using length instead of self.length
* sp
* reformat with black
* find better ancestor within span and use generic 'dep'
* attach to span.root if there is no appropriate ancestor
* comment span text
* clean up ancestor code
* reconstruct dep tree to keep same number of sentences
Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".
* Perserve flags in EntityRuler
The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized. This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.
* add signed contributor agreement
* flake8 cleanup
mostly blank line issues.
* mark test from the issue as needing a model
The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.
* Adds `phrase_matcher_attr` to allow args to PhraseMatcher
This is an added arg to pass to the `PhraseMatcher`. For example,
this allows creation of a case insensitive phrase matcher when the
`EntityRuler` is created. References explosion/spaCy#3822
* remove unneeded model loading
The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)
* updated docstring for new argument
* updated docs to reflect new argument to the EntityRuler constructor
* change tempdir handling to be compatible with python 2.7
* return conflicted code to entityruler
Some stuff got cut out because of merge conflicts, this
returns that code for the phrase_matcher_attr.
* fixed typo in the code added back after conflicts
* flake8 compliance
When I deconflicted the branch there were some flake8 issues
introduced. This resolves the spacing problems.
* test changes: attempts to fix flaky test in python3.5
These tests seem to be alittle flaky in 3.5 so I changed the check to avoid
the comparisons that seem to be fail sometimes.