* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups. * Fix tests. Add distinction w.r.t. batch size. * Remove redundant and add new comments. * Adjust comments. Fix variable naming in EL prediction. * Fix mypy errors. * Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues. * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add error messages to NotImplementedErrors. Remove redundant comment. * Fix imports. * Remove redundant comments. * Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase. * Fix tests. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move KB into subdirectory. * Adjust imports after KB move to dedicated subdirectory. * Fix config imports. * Move Candidate + retrieval functions to separate module. Fix other, small issues. * Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions. * Update spacy/kb/kb_in_memory.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typing. * Change typing of mentions to be Span instead of Union[Span, str]. * Update docs. * Update EntityLinker and _architecture docs. * Update website/docs/api/entitylinker.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Adjust message for E1046. * Re-add section for Candidate in kb.md, add reference to dedicated page. * Update docs and docstrings. * Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs. * Update spacy/kb/candidate.pyx * Update spacy/kb/kb_in_memory.pyx * Update spacy/pipeline/legacy/entity_linker.py * Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py. Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
10 KiB
title | teaser | tag | source | new |
---|---|---|---|---|
KnowledgeBase | A storage class for entities and aliases of a specific knowledge base (ontology) | class | spacy/kb/kb.pyx | 2.2 |
The KnowledgeBase
object is an abstract class providing a method to generate
Candidate
objects, which are plausible external
identifiers given a certain textual mention. Each such Candidate
holds
information from the relevant KB entities, such as its frequency in text and
possible aliases. Each entity in the knowledge base also has a pretrained entity
vector of a fixed size.
Beyond that, KnowledgeBase
classes have to implement a number of utility
functions called by the EntityLinker
component.
This class was not abstract up to spaCy version 3.5. The KnowledgeBase
implementation up to that point is available as InMemoryLookupKB
from 3.5
onwards.
KnowledgeBase.__init__
KnowledgeBase
is an abstract class and cannot be instantiated. Its child
classes should call __init__()
to set up some necessary attributes.
Example
from spacy.kb import KnowledgeBase from spacy.vocab import Vocab class FullyImplementedKB(KnowledgeBase): def __init__(self, vocab: Vocab, entity_vector_length: int): super().__init__(vocab, entity_vector_length) ... vocab = nlp.vocab kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
Name | Description |
---|---|
vocab |
The shared vocabulary. |
entity_vector_length |
Length of the fixed-size entity vectors. |
KnowledgeBase.entity_vector_length
The length of the fixed-size entity vectors in the knowledge base.
Name | Description |
---|---|
RETURNS | Length of the fixed-size entity vectors. |
KnowledgeBase.get_candidates
Given a certain textual mention as input, retrieve a list of candidate entities
of type Candidate
.
Example
from spacy.lang.en import English nlp = English() doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.") candidates = kb.get_candidates(doc[0:2])
Name | Description |
---|---|
mention |
The textual mention or alias. |
RETURNS | An iterable of relevant Candidate objects. |
KnowledgeBase.get_candidates_batch
Same as get_candidates()
, but for an arbitrary
number of mentions. The EntityLinker
component will call
get_candidates_batch()
instead of get_candidates()
, if the config parameter
candidates_batch_size
is greater or equal than 1.
The default implementation of get_candidates_batch()
executes
get_candidates()
in a loop. We recommend implementing a more efficient way to
retrieve candidates for multiple mentions at once, if performance is of concern
to you.
Example
from spacy.lang.en import English nlp = English() doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.") candidates = kb.get_candidates((doc[0:2], doc[3:]))
Name | Description |
---|---|
mentions |
The textual mention or alias. |
RETURNS | An iterable of iterable with relevant Candidate objects. |
KnowledgeBase.get_alias_candidates
This method is _not_ available from spaCy 3.5 onwards.From spaCy 3.5 on KnowledgeBase
is an abstract class (with
InMemoryLookupKB
being a drop-in replacement) to allow
more flexibility in customizing knowledge bases. Some of its methods were moved
to InMemoryLookupKB
during this refactoring, one of those
being get_alias_candidates()
. This method is now available as
InMemoryLookupKB.get_alias_candidates()
.
Note: InMemoryLookupKB.get_candidates()
defaults to
InMemoryLookupKB.get_alias_candidates()
.
KnowledgeBase.get_vector
Given a certain entity ID, retrieve its pretrained entity vector.
Example
vector = kb.get_vector("Q42")
Name | Description |
---|---|
entity |
The entity ID. |
RETURNS | The entity vector. |
KnowledgeBase.get_vectors
Same as get_vector()
, but for an arbitrary number of
entity IDs.
The default implementation of get_vectors()
executes get_vector()
in a loop.
We recommend implementing a more efficient way to retrieve vectors for multiple
entities at once, if performance is of concern to you.
Example
vectors = kb.get_vectors(("Q42", "Q3107329"))
Name | Description |
---|---|
entities |
The entity IDs. |
RETURNS | The entity vectors. |
KnowledgeBase.to_disk
Save the current state of the knowledge base to a directory.
Example
kb.to_disk(path)
Name | Description |
---|---|
path |
A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path -like objects. |
exclude |
List of components to exclude. |
KnowledgeBase.from_disk
Restore the state of the knowledge base from a given directory. Note that the
Vocab
should also be the same as the one used to create the KB.
Example
from spacy.vocab import Vocab vocab = Vocab().from_disk("/path/to/vocab") kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64) kb.from_disk("/path/to/kb")
Name | Description |
---|---|
loc |
A path to a directory. Paths may be either strings or Path -like objects. |
exclude |
List of components to exclude. |
RETURNS | The modified KnowledgeBase object. |
Candidate
A Candidate
object refers to a textual mention (alias) that may or may not be
resolved to a specific entity from a KnowledgeBase
. This will be used as input
for the entity linking algorithm which will disambiguate the various candidates
to the correct one. Each candidate (alias, entity)
pair is assigned to a
certain prior probability.
Candidate.__init__
Construct a Candidate
object. Usually this constructor is not called directly,
but instead these objects are returned by the get_candidates
method of the
entity_linker
pipe.
Example
from spacy.kb import Candidate candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob)
Name | Description |
---|---|
kb |
The knowledge base that defined this candidate. |
entity_hash |
The hash of the entity's KB ID. |
entity_freq |
The entity frequency as recorded in the KB. |
alias_hash |
The hash of the textual mention or alias. |
prior_prob |
The prior probability of the alias referring to the entity . |
Candidate attributes
Name | Description |
---|---|
entity |
The entity's unique KB identifier. |
entity_ |
The entity's unique KB identifier. |
alias |
The alias or textual mention. |
alias_ |
The alias or textual mention. |
prior_prob |
The prior probability of the alias referring to the entity . |
entity_freq |
The frequency of the entity in a typical corpus. |
entity_vector |
The pretrained vector of the entity. |