spaCy/website/docs/api/kb.md
Raphael Mitsch 1f23c615d7
Refactor KB for easier customization (#11268)
* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.

* Fix tests. Add distinction w.r.t. batch size.

* Remove redundant and add new comments.

* Adjust comments. Fix variable naming in EL prediction.

* Fix mypy errors.

* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add error messages to NotImplementedErrors. Remove redundant comment.

* Fix imports.

* Remove redundant comments.

* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.

* Fix tests.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move KB into subdirectory.

* Adjust imports after KB move to dedicated subdirectory.

* Fix config imports.

* Move Candidate + retrieval functions to separate module. Fix other, small issues.

* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typing.

* Change typing of mentions to be Span instead of Union[Span, str].

* Update docs.

* Update EntityLinker and _architecture docs.

* Update website/docs/api/entitylinker.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Adjust message for E1046.

* Re-add section for Candidate in kb.md, add reference to dedicated page.

* Update docs and docstrings.

* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.

* Update spacy/kb/candidate.pyx

* Update spacy/kb/kb_in_memory.pyx

* Update spacy/pipeline/legacy/entity_linker.py

* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-08 10:38:07 +02:00

10 KiB

title teaser tag source new
KnowledgeBase A storage class for entities and aliases of a specific knowledge base (ontology) class spacy/kb/kb.pyx 2.2

The KnowledgeBase object is an abstract class providing a method to generate Candidate objects, which are plausible external identifiers given a certain textual mention. Each such Candidate holds information from the relevant KB entities, such as its frequency in text and possible aliases. Each entity in the knowledge base also has a pretrained entity vector of a fixed size.

Beyond that, KnowledgeBase classes have to implement a number of utility functions called by the EntityLinker component.

This class was not abstract up to spaCy version 3.5. The KnowledgeBase implementation up to that point is available as InMemoryLookupKB from 3.5 onwards.

KnowledgeBase.__init__

KnowledgeBase is an abstract class and cannot be instantiated. Its child classes should call __init__() to set up some necessary attributes.

Example

from spacy.kb import KnowledgeBase
from spacy.vocab import Vocab

class FullyImplementedKB(KnowledgeBase):
  def __init__(self, vocab: Vocab, entity_vector_length: int):
      super().__init__(vocab, entity_vector_length)
      ...
vocab = nlp.vocab
kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
Name Description
vocab The shared vocabulary. Vocab
entity_vector_length Length of the fixed-size entity vectors. int

KnowledgeBase.entity_vector_length

The length of the fixed-size entity vectors in the knowledge base.

Name Description
RETURNS Length of the fixed-size entity vectors. int

KnowledgeBase.get_candidates

Given a certain textual mention as input, retrieve a list of candidate entities of type Candidate.

Example

from spacy.lang.en import English
nlp = English()
doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.")
candidates = kb.get_candidates(doc[0:2])
Name Description
mention The textual mention or alias. Span
RETURNS An iterable of relevant Candidate objects. Iterable[Candidate]

KnowledgeBase.get_candidates_batch

Same as get_candidates(), but for an arbitrary number of mentions. The EntityLinker component will call get_candidates_batch() instead of get_candidates(), if the config parameter candidates_batch_size is greater or equal than 1.

The default implementation of get_candidates_batch() executes get_candidates() in a loop. We recommend implementing a more efficient way to retrieve candidates for multiple mentions at once, if performance is of concern to you.

Example

from spacy.lang.en import English
nlp = English()
doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.")
candidates = kb.get_candidates((doc[0:2], doc[3:]))
Name Description
mentions The textual mention or alias. Iterable[Span]
RETURNS An iterable of iterable with relevant Candidate objects. Iterable[Iterable[Candidate]]

KnowledgeBase.get_alias_candidates

This method is _not_ available from spaCy 3.5 onwards.

From spaCy 3.5 on KnowledgeBase is an abstract class (with InMemoryLookupKB being a drop-in replacement) to allow more flexibility in customizing knowledge bases. Some of its methods were moved to InMemoryLookupKB during this refactoring, one of those being get_alias_candidates(). This method is now available as InMemoryLookupKB.get_alias_candidates(). Note: InMemoryLookupKB.get_candidates() defaults to InMemoryLookupKB.get_alias_candidates().

KnowledgeBase.get_vector

Given a certain entity ID, retrieve its pretrained entity vector.

Example

vector = kb.get_vector("Q42")
Name Description
entity The entity ID. str
RETURNS The entity vector. Iterable[float]

KnowledgeBase.get_vectors

Same as get_vector(), but for an arbitrary number of entity IDs.

The default implementation of get_vectors() executes get_vector() in a loop. We recommend implementing a more efficient way to retrieve vectors for multiple entities at once, if performance is of concern to you.

Example

vectors = kb.get_vectors(("Q42", "Q3107329"))
Name Description
entities The entity IDs. Iterable[str]
RETURNS The entity vectors. Iterable[Iterable[numpy.ndarray]]

KnowledgeBase.to_disk

Save the current state of the knowledge base to a directory.

Example

kb.to_disk(path)
Name Description
path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]
exclude List of components to exclude. Iterable[str]

KnowledgeBase.from_disk

Restore the state of the knowledge base from a given directory. Note that the Vocab should also be the same as the one used to create the KB.

Example

from spacy.vocab import Vocab
vocab = Vocab().from_disk("/path/to/vocab")
kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
kb.from_disk("/path/to/kb")
Name Description
loc A path to a directory. Paths may be either strings or Path-like objects. Union[str, Path]
exclude List of components to exclude. Iterable[str]
RETURNS The modified KnowledgeBase object. KnowledgeBase

Candidate

A Candidate object refers to a textual mention (alias) that may or may not be resolved to a specific entity from a KnowledgeBase. This will be used as input for the entity linking algorithm which will disambiguate the various candidates to the correct one. Each candidate (alias, entity) pair is assigned to a certain prior probability.

Candidate.__init__

Construct a Candidate object. Usually this constructor is not called directly, but instead these objects are returned by the get_candidates method of the entity_linker pipe.

Example

from spacy.kb import Candidate
candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob)
Name Description
kb The knowledge base that defined this candidate. KnowledgeBase
entity_hash The hash of the entity's KB ID. int
entity_freq The entity frequency as recorded in the KB. float
alias_hash The hash of the textual mention or alias. int
prior_prob The prior probability of the alias referring to the entity. float

Candidate attributes

Name Description
entity The entity's unique KB identifier. int
entity_ The entity's unique KB identifier. str
alias The alias or textual mention. int
alias_ The alias or textual mention. str
prior_prob The prior probability of the alias referring to the entity. long
entity_freq The frequency of the entity in a typical corpus. long
entity_vector The pretrained vector of the entity. numpy.ndarray