mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-06 07:16:29 +03:00
304b9331e6
* Convert Candidate from Cython to Python class. * Format. * Fix .entity_ typo in _add_activations() usage. * Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span]. * Update docs. * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update doc string of BaseCandidate.__init__(). * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate. * Adjust Candidate to support and mandate numerical entity IDs. * Format. * Fix docstring and docs. * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename alias -> mention. * Refactor Candidate attribute names. Update docs and tests accordingly. * Refacor Candidate attributes and their usage. * Format. * Fix mypy error. * Update error code in line with v4 convention. * Modify EL batching system. * Update leftover get_candidates() mention in docs. * Format docs. * Format. * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Updated error code. * Simplify interface for int/str representations. * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename 'alias' to 'mention'. * Port Candidate and InMemoryCandidate to Cython. * Remove redundant entry in setup.py. * Add abstract class check. * Drop storing mention. * Update spacy/kb/candidate.pxd Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix entity_id refactoring problems in docstrings. * Drop unused InMemoryCandidate._entity_hash. * Update docstrings. * Move attributes out of Candidate. * Partially fix alias/mention terminology usage. Convert Candidate to interface. * Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs(). * Update docstrings related to prior_prob. * Update alias/mention usage in doc(strings). * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs. * Update docstrings. * Fix InMemoryCandidate attribute names. * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update W401 test. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use Candidate output type for toy generators in the test suite to mimick best practices * fix docs * fix import * Fix merge leftovers. * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/entitylinker.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/kb/kb_in_memory.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/inmemorylookupkb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update get_candidates() docstring. * Reformat imports in entity_linker.py. * Drop valid_ent_idx_per_doc. * Update docs. * Format. * Simplify doc loop in predict(). * Remove E1044 comment. * Fix merge errors. * Format. * Format. * Format. * Fix merge error & tests. * Format. * Apply suggestions from code review Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Use type alias. * isort. * isort. * Lint. * Add typedefs.pyx. * Fix typedef import. * Fix type aliases. * Format. * Update docstring and type usage. * Add info on get_candidates(), get_candidates_batched(). * Readd get_candidates info to v3 changelog. * Update website/docs/api/entitylinker.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update factory functions for backwards compatibility. * Format. * Ignore mypy error. * Fix mypy error. * Format. * Add test for multiple docs with multiple entities. --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>
196 lines
9.0 KiB
Plaintext
196 lines
9.0 KiB
Plaintext
---
|
|
title: KnowledgeBase
|
|
teaser:
|
|
A storage class for entities and aliases of a specific knowledge base
|
|
(ontology)
|
|
tag: class
|
|
source: spacy/kb/kb.pyx
|
|
version: 2.2
|
|
---
|
|
|
|
The `KnowledgeBase` object is an abstract class providing a method to generate
|
|
[`Candidate`](/api/kb#candidate) objects, which are plausible external
|
|
identifiers given a certain textual mention. Each such `Candidate` holds
|
|
information from the relevant KB entities, such as its frequency in text and
|
|
possible aliases. Each entity in the knowledge base also has a pretrained entity
|
|
vector of a fixed size.
|
|
|
|
Beyond that, `KnowledgeBase` classes have to implement a number of utility
|
|
functions called by the [`EntityLinker`](/api/entitylinker) component.
|
|
|
|
<Infobox variant="warning">
|
|
|
|
This class was not abstract up to spaCy version 3.5. The `KnowledgeBase`
|
|
implementation up to that point is available as
|
|
[`InMemoryLookupKB`](/api/inmemorylookupkb) from 3.5 onwards.
|
|
|
|
</Infobox>
|
|
|
|
## KnowledgeBase.\_\_init\_\_ {id="init",tag="method"}
|
|
|
|
`KnowledgeBase` is an abstract class and cannot be instantiated. Its child
|
|
classes should call `__init__()` to set up some necessary attributes.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.kb import KnowledgeBase
|
|
> from spacy.vocab import Vocab
|
|
>
|
|
> class FullyImplementedKB(KnowledgeBase):
|
|
> def __init__(self, vocab: Vocab, entity_vector_length: int):
|
|
> super().__init__(vocab, entity_vector_length)
|
|
> ...
|
|
> vocab = nlp.vocab
|
|
> kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ---------------------- | ------------------------------------------------ |
|
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
|
| `entity_vector_length` | Length of the fixed-size entity vectors. ~~int~~ |
|
|
|
|
## KnowledgeBase.entity_vector_length {id="entity_vector_length",tag="property"}
|
|
|
|
The length of the fixed-size entity vectors in the knowledge base.
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------ |
|
|
| **RETURNS** | Length of the fixed-size entity vectors. ~~int~~ |
|
|
|
|
## KnowledgeBase.get_candidates {id="get_candidates",tag="method"}
|
|
|
|
Given textual mentions for an arbitrary number of documents as input, retrieve a
|
|
list of candidate entities of type [`Candidate`](/api/kb#candidate) for each
|
|
mention. The [`EntityLinker`](/api/entitylinker) component passes a generator
|
|
that yields mentions as [`SpanGroup`](/api/spangroup))s per document.
|
|
The decision of how to batch
|
|
candidate retrieval lookups over multiple documents is left up to the
|
|
implementation of `KnowledgeBase.get_candidates()`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lang.en import English
|
|
> from spacy.tokens import SpanGroup
|
|
> nlp = English()
|
|
> doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.")
|
|
> candidates = kb.get_candidates([SpanGroup(doc, spans=[doc[0:2], doc[3:]]])
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `mentions` | The textual mentions or aliases (one `SpanGroup` per `Doc` instance). ~~Iterator[SpanGroup]~~ |
|
|
| **RETURNS** | An iterator (per document) over iterables (per mention) of iterables (per candidate for this mention) with relevant `Candidate` objects. ~~Iterator[Iterable[Iterable[Candidate]]]~~ |
|
|
|
|
## KnowledgeBase.get_vector {id="get_vector",tag="method"}
|
|
|
|
Given a certain entity ID, retrieve its pretrained entity vector.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> vector = kb.get_vector("Q42")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | -------------------------------------- |
|
|
| `entity` | The entity ID. ~~str~~ |
|
|
| **RETURNS** | The entity vector. ~~Iterable[float]~~ |
|
|
|
|
## KnowledgeBase.get_vectors {id="get_vectors",tag="method"}
|
|
|
|
Same as [`get_vector()`](/api/kb#get_vector), but for an arbitrary number of
|
|
entity IDs.
|
|
|
|
The default implementation of `get_vectors()` executes `get_vector()` in a loop.
|
|
We recommend implementing a more efficient way to retrieve vectors for multiple
|
|
entities at once, if performance is of concern to you.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> vectors = kb.get_vectors(("Q42", "Q3107329"))
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------------------------------------------------- |
|
|
| `entities` | The entity IDs. ~~Iterable[str]~~ |
|
|
| **RETURNS** | The entity vectors. ~~Iterable[Iterable[numpy.ndarray]]~~ |
|
|
|
|
## KnowledgeBase.to_disk {id="to_disk",tag="method"}
|
|
|
|
Save the current state of the knowledge base to a directory.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> kb.to_disk(path)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
|
| `exclude` | List of components to exclude. ~~Iterable[str]~~ |
|
|
|
|
## KnowledgeBase.from_disk {id="from_disk",tag="method"}
|
|
|
|
Restore the state of the knowledge base from a given directory. Note that the
|
|
[`Vocab`](/api/vocab) should also be the same as the one used to create the KB.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.vocab import Vocab
|
|
> vocab = Vocab().from_disk("/path/to/vocab")
|
|
> kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
|
|
> kb.from_disk("/path/to/kb")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
|
| `loc` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
|
| `exclude` | List of components to exclude. ~~Iterable[str]~~ |
|
|
| **RETURNS** | The modified `KnowledgeBase` object. ~~KnowledgeBase~~ |
|
|
|
|
## InMemoryCandidate {id="candidate",tag="class"}
|
|
|
|
An `InMemoryCandidate` object refers to a textual mention (alias) that may or
|
|
may not be resolved to a specific entity from a `KnowledgeBase`. This will be
|
|
used as input for the entity linking algorithm which will disambiguate the
|
|
various candidates to the correct one. Each candidate `(alias, entity)` pair is
|
|
assigned to a certain prior probability.
|
|
|
|
### InMemoryCandidate.\_\_init\_\_ {id="candidate-init",tag="method"}
|
|
|
|
Construct an `InMemoryCandidate` object. Usually this constructor is not called
|
|
directly, but instead these objects are returned by the `get_candidates` method
|
|
of the [`entity_linker`](/api/entitylinker) pipe.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.kb import InMemoryCandidate candidate = InMemoryCandidate(kb,
|
|
> entity_hash, entity_freq, entity_vector, alias_hash, prior_prob)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------- | ------------------------------------------------------------------------- |
|
|
| `kb` | The knowledge base that defined this candidate. ~~KnowledgeBase~~ |
|
|
| `entity_hash` | The hash of the entity's KB ID. ~~int~~ |
|
|
| `entity_freq` | The entity frequency as recorded in the KB. ~~float~~ |
|
|
| `alias_hash` | The hash of the entity alias. ~~int~~ |
|
|
| `prior_prob` | The prior probability of the `alias` referring to the `entity`. ~~float~~ |
|
|
|
|
## InMemoryCandidate attributes {id="candidate-attributes"}
|
|
|
|
| Name | Description |
|
|
| --------------- | ------------------------------------------------------------------------ |
|
|
| `entity` | The entity's unique KB identifier. ~~int~~ |
|
|
| `entity_` | The entity's unique KB identifier. ~~str~~ |
|
|
| `alias` | The alias or textual mention. ~~int~~ |
|
|
| `alias_` | The alias or textual mention. ~~str~~ |
|
|
| `prior_prob` | The prior probability of the `alias` referring to the `entity`. ~~long~~ |
|
|
| `entity_freq` | The frequency of the entity in a typical corpus. ~~long~~ |
|
|
| `entity_vector` | The pretrained vector of the entity. ~~numpy.ndarray~~ |
|