Commit Graph

18 Commits

Author SHA1 Message Date
Sofie Van Landeghem
7f5715a081
Various fixes to NEL functionality, Example class etc (#5460)
* setting KB in the EL constructor, similar to how the model is passed on

* removing wikipedia example files - moved to projects

* throw an error when nlp.update is called with 2 positional arguments

* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config

* update config files with new parameters

* avoid training pipeline components that don't have a model (like sentencizer)

* various small fixes + UX improvements

* small fixes

* set thinc to 8.0.0a9 everywhere

* remove outdated comment
2020-05-20 11:41:12 +02:00
Sofie Van Landeghem
6ac9fc0619
Unit test for NEL functionality (#5114)
* empty begin_training for sentencizer

* overfitting unit test for entity linker

* fixed NEL IO by storing the entity_vector_length in the cfg
2020-03-06 14:42:23 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
9a849fe54e Explicitly catch warning in test 2019-10-24 16:16:27 +02:00
Sofie Van Landeghem
2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Sofie Van Landeghem
03ac29f437 Ensure that doc.ents preserves kb_id annotations (#4294)
* bugfix: ensure doc.ents preserves kb_id annotations

* fix backward compatibility

* additional test
2019-09-16 15:18:37 +02:00
Sofie Van Landeghem
0ba1b5eebc CLI scripts for entity linking (wikipedia & generic) (#4091)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* turn kb_creator into CLI script (wip)

* proper parameters for training entity vectors

* wikidata pipeline split up into two executable scripts

* remove context_width

* move wikidata scripts in bin directory, remove old dummy script

* refine KB script with logs and preprocessing options

* small edits

* small improvements to logging of EL CLI script
2019-08-13 15:38:59 +02:00
svlandeg
76184374e2 test corner cases 2019-07-22 13:39:32 +02:00
svlandeg
dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg
d833d4c358 fixes in kb and gold 2019-07-17 17:18:26 +02:00
svlandeg
4086c6ff60 get vector functionality + unit test 2019-07-17 12:17:02 +02:00
svlandeg
1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
svlandeg
8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg
ddc73b11a9 fix unicode literals 2019-06-24 12:58:18 +02:00
svlandeg
5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg
9a7d534b1b enable nogil for cython functions in kb.pxd 2019-04-10 17:25:10 +02:00