2019-09-12 12:38:34 +03:00
---
title: KnowledgeBase
2020-05-24 18:23:00 +03:00
teaser:
A storage class for entities and aliases of a specific knowledge base
(ontology)
2019-09-12 12:38:34 +03:00
tag: class
2022-09-08 11:38:07 +03:00
source: spacy/kb/kb.pyx
2019-09-12 12:38:34 +03:00
new: 2.2
---
2022-09-08 11:38:07 +03:00
The `KnowledgeBase` object is an abstract class providing a method to generate
[`Candidate` ](/api/kb#candidate ) objects, which are plausible external
2020-05-24 18:23:00 +03:00
identifiers given a certain textual mention. Each such `Candidate` holds
information from the relevant KB entities, such as its frequency in text and
possible aliases. Each entity in the knowledge base also has a pretrained entity
vector of a fixed size.
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
Beyond that, `KnowledgeBase` classes have to implement a number of utility
functions called by the [`EntityLinker` ](/api/entitylinker ) component.
< Infobox variant = "warning" >
This class was not abstract up to spaCy version 3.5. The `KnowledgeBase`
implementation up to that point is available as `InMemoryLookupKB` from 3.5
onwards.
< / Infobox >
2019-09-12 12:38:34 +03:00
## KnowledgeBase.\_\_init\_\_ {#init tag="method"}
2022-09-08 11:38:07 +03:00
`KnowledgeBase` is an abstract class and cannot be instantiated. Its child
classes should call `__init__()` to set up some necessary attributes.
2019-09-12 12:38:34 +03:00
> #### Example
>
> ```python
> from spacy.kb import KnowledgeBase
2022-09-08 11:38:07 +03:00
> from spacy.vocab import Vocab
>
> class FullyImplementedKB(KnowledgeBase):
> def __init__(self, vocab: Vocab, entity_vector_length: int):
> super().__init__(vocab, entity_vector_length)
> ...
2019-09-12 12:38:34 +03:00
> vocab = nlp.vocab
2022-09-08 11:38:07 +03:00
> kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
2019-09-12 12:38:34 +03:00
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ---------------------- | ------------------------------------------------ |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `entity_vector_length` | Length of the fixed-size entity vectors. ~~int~~ |
2019-09-12 12:38:34 +03:00
## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"}
The length of the fixed-size entity vectors in the knowledge base.
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ------------------------------------------------ |
| **RETURNS** | Length of the fixed-size entity vectors. ~~int~~ |
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
## KnowledgeBase.get_candidates {#get_candidates tag="method"}
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
Given a certain textual mention as input, retrieve a list of candidate entities
of type [`Candidate` ](/api/kb#candidate ).
2019-09-12 12:38:34 +03:00
> #### Example
>
> ```python
2022-09-08 11:38:07 +03:00
> from spacy.lang.en import English
> nlp = English()
> doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.")
> candidates = kb.get_candidates(doc[0:2])
2019-09-12 12:38:34 +03:00
> ```
2022-09-08 11:38:07 +03:00
| Name | Description |
| ----------- | -------------------------------------------------------------------- |
| `mention` | The textual mention or alias. ~~Span~~ |
| **RETURNS** | An iterable of relevant `Candidate` objects. ~~Iterable[Candidate]~~ |
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
## KnowledgeBase.get_candidates_batch {#get_candidates_batch tag="method"}
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
Same as [`get_candidates()` ](/api/kb#get_candidates ), but for an arbitrary
number of mentions. The [`EntityLinker` ](/api/entitylinker ) component will call
`get_candidates_batch()` instead of `get_candidates()` , if the config parameter
`candidates_batch_size` is greater or equal than 1.
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
The default implementation of `get_candidates_batch()` executes
`get_candidates()` in a loop. We recommend implementing a more efficient way to
retrieve candidates for multiple mentions at once, if performance is of concern
to you.
2019-09-12 12:38:34 +03:00
> #### Example
>
> ```python
2022-09-08 11:38:07 +03:00
> from spacy.lang.en import English
> nlp = English()
> doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.")
> candidates = kb.get_candidates((doc[0:2], doc[3:]))
2019-09-12 12:38:34 +03:00
> ```
2022-09-08 11:38:07 +03:00
| Name | Description |
| ----------- | -------------------------------------------------------------------------------------------- |
| `mentions` | The textual mention or alias. ~~Iterable[Span]~~ |
| **RETURNS** | An iterable of iterable with relevant `Candidate` objects. ~~Iterable[Iterable[Candidate]]~~ |
2019-09-12 12:38:34 +03:00
2021-02-25 22:09:36 +03:00
## KnowledgeBase.get_alias_candidates {#get_alias_candidates tag="method"}
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
< Infobox variant = "warning" >
This method is _not_ available from spaCy 3.5 onwards.
< / Infobox >
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
From spaCy 3.5 on `KnowledgeBase` is an abstract class (with
[`InMemoryLookupKB` ](/api/kb_in_memory ) being a drop-in replacement) to allow
more flexibility in customizing knowledge bases. Some of its methods were moved
to [`InMemoryLookupKB` ](/api/kb_in_memory ) during this refactoring, one of those
being `get_alias_candidates()` . This method is now available as
[`InMemoryLookupKB.get_alias_candidates()` ](/api/kb_in_memory#get_alias_candidates ).
Note: [`InMemoryLookupKB.get_candidates()` ](/api/kb_in_memory#get_candidates )
defaults to
[`InMemoryLookupKB.get_alias_candidates()` ](/api/kb_in_memory#get_alias_candidates ).
2019-09-12 12:38:34 +03:00
## KnowledgeBase.get_vector {#get_vector tag="method"}
2019-10-02 11:37:39 +03:00
Given a certain entity ID, retrieve its pretrained entity vector.
2019-09-12 12:38:34 +03:00
> #### Example
>
> ```python
> vector = kb.get_vector("Q42")
> ```
2022-09-08 11:38:07 +03:00
| Name | Description |
| ----------- | -------------------------------------- |
| `entity` | The entity ID. ~~str~~ |
| **RETURNS** | The entity vector. ~~Iterable[float]~~ |
## KnowledgeBase.get_vectors {#get_vectors tag="method"}
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
Same as [`get_vector()` ](/api/kb#get_vector ), but for an arbitrary number of
entity IDs.
2019-09-12 12:38:34 +03:00
2022-09-08 11:38:07 +03:00
The default implementation of `get_vectors()` executes `get_vector()` in a loop.
We recommend implementing a more efficient way to retrieve vectors for multiple
entities at once, if performance is of concern to you.
2019-09-12 12:38:34 +03:00
> #### Example
>
> ```python
2022-09-08 11:38:07 +03:00
> vectors = kb.get_vectors(("Q42", "Q3107329"))
2019-09-12 12:38:34 +03:00
> ```
2022-09-08 11:38:07 +03:00
| Name | Description |
| ----------- | --------------------------------------------------------- |
| `entities` | The entity IDs. ~~Iterable[str]~~ |
| **RETURNS** | The entity vectors. ~~Iterable[Iterable[numpy.ndarray]]~~ |
2019-09-12 12:38:34 +03:00
2020-08-18 17:10:36 +03:00
## KnowledgeBase.to_disk {#to_disk tag="method"}
2019-09-12 12:38:34 +03:00
Save the current state of the knowledge base to a directory.
> #### Example
>
> ```python
2022-09-08 11:38:07 +03:00
> kb.to_disk(path)
2019-09-12 12:38:34 +03:00
> ```
2022-09-08 11:38:07 +03:00
| Name | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path` -like objects. ~~Union[str, Path]~~ |
| `exclude` | List of components to exclude. ~~Iterable[str]~~ |
2019-09-12 12:38:34 +03:00
2020-08-18 17:10:36 +03:00
## KnowledgeBase.from_disk {#from_disk tag="method"}
2019-09-12 12:38:34 +03:00
2020-05-24 18:23:00 +03:00
Restore the state of the knowledge base from a given directory. Note that the
[`Vocab` ](/api/vocab ) should also be the same as the one used to create the KB.
2019-09-12 12:38:34 +03:00
> #### Example
>
> ```python
> from spacy.vocab import Vocab
> vocab = Vocab().from_disk("/path/to/vocab")
2022-09-08 11:38:07 +03:00
> kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
2020-08-18 17:10:36 +03:00
> kb.from_disk("/path/to/kb")
2019-09-12 12:38:34 +03:00
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------- |
| `loc` | A path to a directory. Paths may be either strings or `Path` -like objects. ~~Union[str, Path]~~ |
2022-09-08 11:38:07 +03:00
| `exclude` | List of components to exclude. ~~Iterable[str]~~ |
2020-08-17 17:45:24 +03:00
| **RETURNS** | The modified `KnowledgeBase` object. ~~KnowledgeBase~~ |
2019-09-12 12:38:34 +03:00
2020-08-17 17:45:24 +03:00
## Candidate {#candidate tag="class"}
A `Candidate` object refers to a textual mention (alias) that may or may not be
resolved to a specific entity from a `KnowledgeBase` . This will be used as input
for the entity linking algorithm which will disambiguate the various candidates
to the correct one. Each candidate `(alias, entity)` pair is assigned to a
certain prior probability.
### Candidate.\_\_init\_\_ {#candidate-init tag="method"}
2019-09-12 12:38:34 +03:00
Construct a `Candidate` object. Usually this constructor is not called directly,
2021-06-28 12:48:11 +03:00
but instead these objects are returned by the `get_candidates` method of the
[`entity_linker` ](/api/entitylinker ) pipe.
2019-09-12 12:38:34 +03:00
> #### Example
>
> ```python
> from spacy.kb import Candidate
> candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob)
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------------- | ------------------------------------------------------------------------- |
| `kb` | The knowledge base that defined this candidate. ~~KnowledgeBase~~ |
| `entity_hash` | The hash of the entity's KB ID. ~~int~~ |
| `entity_freq` | The entity frequency as recorded in the KB. ~~float~~ |
| `alias_hash` | The hash of the textual mention or alias. ~~int~~ |
| `prior_prob` | The prior probability of the `alias` referring to the `entity` . ~~float~~ |
## Candidate attributes {#candidate-attributes}
| Name | Description |
| --------------- | ------------------------------------------------------------------------ |
| `entity` | The entity's unique KB identifier. ~~int~~ |
| `entity_` | The entity's unique KB identifier. ~~str~~ |
| `alias` | The alias or textual mention. ~~int~~ |
| `alias_` | The alias or textual mention. ~~str~~ |
| `prior_prob` | The prior probability of the `alias` referring to the `entity` . ~~long~~ |
| `entity_freq` | The frequency of the entity in a typical corpus. ~~long~~ |
| `entity_vector` | The pretrained vector of the entity. ~~numpy.ndarray~~ |