* Check whether two entities overlap
- biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps
- added unit test
* SCA agreement
* pytest file for issue4104 established
* edited default lookup english lemmatizer for spun; fixes issue 4102
* eliminated parameterization and sorted dictionary dependnency in issue 4104 test
* added contributor agreement
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* turn kb_creator into CLI script (wip)
* proper parameters for training entity vectors
* wikidata pipeline split up into two executable scripts
* remove context_width
* move wikidata scripts in bin directory, remove old dummy script
* refine KB script with logs and preprocessing options
* small edits
* small improvements to logging of EL CLI script
* Improve NER per type scoring
* include all gold labels in per type scoring, not only when recall > 0
* improve efficiency of per type scoring
* Create Scorer tests, initially with NER tests
* move regression test #3968 (per type NER scoring) to Scorer tests
* add new test for per type NER scoring with imperfect P/R/F and per
type P/R/F including a case where R == 0.0
* failing unit test for issue 3962
* attempt to fix Issue #3962
* create artificial unit test example
* using length instead of self.length
* sp
* reformat with black
* find better ancestor within span and use generic 'dep'
* attach to span.root if there is no appropriate ancestor
* comment span text
* clean up ancestor code
* reconstruct dep tree to keep same number of sentences
Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".
* Perserve flags in EntityRuler
The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized. This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.
* add signed contributor agreement
* flake8 cleanup
mostly blank line issues.
* mark test from the issue as needing a model
The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.
* Adds `phrase_matcher_attr` to allow args to PhraseMatcher
This is an added arg to pass to the `PhraseMatcher`. For example,
this allows creation of a case insensitive phrase matcher when the
`EntityRuler` is created. References explosion/spaCy#3822
* remove unneeded model loading
The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)
* updated docstring for new argument
* updated docs to reflect new argument to the EntityRuler constructor
* change tempdir handling to be compatible with python 2.7
* return conflicted code to entityruler
Some stuff got cut out because of merge conflicts, this
returns that code for the phrase_matcher_attr.
* fixed typo in the code added back after conflicts
* flake8 compliance
When I deconflicted the branch there were some flake8 issues
introduced. This resolves the spacing problems.
* test changes: attempts to fix flaky test in python3.5
These tests seem to be alittle flaky in 3.5 so I changed the check to avoid
the comparisons that seem to be fail sometimes.
* Perserve flags in EntityRuler
The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized. This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.
* add signed contributor agreement
* flake8 cleanup
mostly blank line issues.
* mark test from the issue as needing a model
The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.
* remove unneeded model loading
The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)
* change tempdir handling to be compatible with python 2.7
* Adds code to handle item saved before this change.
This code chanes how the save files are handled and how the bytes
are stored as well. This code adds check to dispatch correctly
if it encounters bytes or files saved in the old format (and tests
for those cases).
* use util function for tempdir management
Updated after PR comments: this code now uses the make_tempdir function from util
instead of doing it by hand.