Raphael Mitsch
9340eb8ad2
Introduce hierarchy for EL Candidate
objects ( #12341 )
...
* Convert Candidate from Cython to Python class.
* Format.
* Fix .entity_ typo in _add_activations() usage.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update doc string of BaseCandidate.__init__().
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.
* Adjust Candidate to support and mandate numerical entity IDs.
* Format.
* Fix docstring and docs.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename alias -> mention.
* Refactor Candidate attribute names. Update docs and tests accordingly.
* Refacor Candidate attributes and their usage.
* Format.
* Fix mypy error.
* Update error code in line with v4 convention.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Updated error code.
* Simplify interface for int/str representations.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename 'alias' to 'mention'.
* Port Candidate and InMemoryCandidate to Cython.
* Remove redundant entry in setup.py.
* Add abstract class check.
* Drop storing mention.
* Update spacy/kb/candidate.pxd
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix entity_id refactoring problems in docstrings.
* Drop unused InMemoryCandidate._entity_hash.
* Update docstrings.
* Move attributes out of Candidate.
* Partially fix alias/mention terminology usage. Convert Candidate to interface.
* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().
* Update docstrings related to prior_prob.
* Update alias/mention usage in doc(strings).
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.
* Update docstrings.
* Fix InMemoryCandidate attribute names.
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update W401 test.
* Update spacy/errors.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use Candidate output type for toy generators in the test suite to mimick best practices
* fix docs
* fix import
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 00:34:35 +01:00
Raphael Mitsch
6aa6b86d49
Make generation of empty KnowledgeBase
instances configurable in EntityLinker
( #12320 )
...
* Make empty_kb() configurable.
* Format.
* Update docs.
* Be more specific in KB serialization test.
* Update KB serialization tests. Update docs.
* Remove doc update for batched candidate generation.
* Fix serialization of subclassed KB in tests.
* Format.
* Update docstring.
* Update docstring.
* Switch from pickle to json for custom field serialization.
2023-03-01 16:02:55 +01:00
Raphael Mitsch
1f23c615d7
Refactor KB for easier customization ( #11268 )
...
* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.
* Fix tests. Add distinction w.r.t. batch size.
* Remove redundant and add new comments.
* Adjust comments. Fix variable naming in EL prediction.
* Fix mypy errors.
* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.
* Update spacy/pipeline/entity_linker.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update spacy/pipeline/entity_linker.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update spacy/kb_base.pyx
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update spacy/kb_base.pyx
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update spacy/pipeline/entity_linker.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Add error messages to NotImplementedErrors. Remove redundant comment.
* Fix imports.
* Remove redundant comments.
* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.
* Fix tests.
* Update spacy/errors.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move KB into subdirectory.
* Adjust imports after KB move to dedicated subdirectory.
* Fix config imports.
* Move Candidate + retrieval functions to separate module. Fix other, small issues.
* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.
* Update spacy/kb/kb_in_memory.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix typing.
* Change typing of mentions to be Span instead of Union[Span, str].
* Update docs.
* Update EntityLinker and _architecture docs.
* Update website/docs/api/entitylinker.md
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Adjust message for E1046.
* Re-add section for Candidate in kb.md, add reference to dedicated page.
* Update docs and docstrings.
* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.
* Update spacy/kb/candidate.pyx
* Update spacy/kb/kb_in_memory.pyx
* Update spacy/pipeline/legacy/entity_linker.py
* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-08 10:38:07 +02:00
Adriane Boyd
4b0ed73ed4
Update flake8 version in reqs and CI
...
* Update some unneeded forward refs related to flake8 checks
2021-06-28 11:29:36 +02:00
svlandeg
d900c55061
consistently use registry as callable
2021-03-02 17:56:28 +01:00
svlandeg
eaf5c265cb
set_kb method for entity_linker
2020-10-08 10:34:01 +02:00
svlandeg
efedccea8d
fix tests
2020-10-07 15:29:52 +02:00
Ines Montani
5afe6447cd
registry.assets -> registry.misc
2020-09-03 17:31:14 +02:00
Sofie Van Landeghem
358cbb21e3
Define candidate generator in EL config ( #5876 )
...
* candidate generator as separate part of EL config
* update comment
* ent instead of str as input for candidate generation
* Span instead of str: correct type indication
* fix types
* unit test to create new candidate generator
* fix replace_pipe argument passing
* move error message, general cleanup
* add vocab back to KB constructor
* provide KB as callable from Vocab arg
* rename to kb_loader, fix KB serialization as part of the EL pipe
* fix typo
* reformatting
* cleanup
* fix comment
* fix wrongly duplicated code from merge conflict
* rename dump to to_disk
* from_disk instead of load_bulk
* update test after recent removal of set_morphology in tagger
* remove old doc
2020-08-18 16:10:36 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component ( #5872 )
...
* EL field documentation
* documentation consistent with docs
* default empty KB, initialize vocab separately
* formatting
* add test for changing the default entity vector length
* update comment
2020-08-04 14:34:09 +02:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 ( #4828 )
...
* Remove unicode declarations
* Remove Python 3.5 and 2.7 from CI
* Don't require pathlib
* Replace compat helpers
* Remove OrderedDict
* Use f-strings
* Set Cython compiler language level
* Fix typo
* Re-add OrderedDict for Table
* Update setup.cfg
* Revert CONTRIBUTING.md
* Revert lookups.md
* Revert top-level.md
* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
0226b3bf0e
Fix test imports
2019-09-29 17:34:56 +02:00
Ines Montani
3d8fd4b461
Revert #4334
2019-09-29 17:32:12 +02:00
Ines Montani
c9cd516d96
Move tests out of package ( #4334 )
...
* Move tests out of package
* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani
f580302673
Tidy up and auto-format
2019-08-20 17:36:34 +02:00
Sofie Van Landeghem
0ba1b5eebc
CLI scripts for entity linking (wikipedia & generic) ( #4091 )
...
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* turn kb_creator into CLI script (wip)
* proper parameters for training entity vectors
* wikidata pipeline split up into two executable scripts
* remove context_width
* move wikidata scripts in bin directory, remove old dummy script
* refine KB script with logs and preprocessing options
* small edits
* small improvements to logging of EL CLI script
2019-08-13 15:38:59 +02:00
svlandeg
dae8a21282
rename entity frequency
2019-07-19 17:40:28 +02:00
svlandeg
ddc73b11a9
fix unicode literals
2019-06-24 12:58:18 +02:00
svlandeg
b76a43bee4
unicode strings
2019-06-19 13:26:33 +02:00
svlandeg
0b0959b363
UTF8 encoding
2019-06-19 13:11:39 +02:00
svlandeg
5c723c32c3
entity vectors in the KB + serialization of them
2019-06-05 18:29:18 +02:00
svlandeg
19e8f339cb
deduce entity freq from WP corpus and serialize vocab in WP test
2019-04-29 17:37:29 +02:00
svlandeg
387263d618
simplify chains
2019-04-29 13:58:07 +02:00
svlandeg
54d0cea062
unit test for KB serialization
2019-04-24 23:52:34 +02:00