mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups. * Fix tests. Add distinction w.r.t. batch size. * Remove redundant and add new comments. * Adjust comments. Fix variable naming in EL prediction. * Fix mypy errors. * Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues. * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add error messages to NotImplementedErrors. Remove redundant comment. * Fix imports. * Remove redundant comments. * Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase. * Fix tests. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move KB into subdirectory. * Adjust imports after KB move to dedicated subdirectory. * Fix config imports. * Move Candidate + retrieval functions to separate module. Fix other, small issues. * Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions. * Update spacy/kb/kb_in_memory.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typing. * Change typing of mentions to be Span instead of Union[Span, str]. * Update docs. * Update EntityLinker and _architecture docs. * Update website/docs/api/entitylinker.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Adjust message for E1046. * Re-add section for Candidate in kb.md, add reference to dedicated page. * Update docs and docstrings. * Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs. * Update spacy/kb/candidate.pyx * Update spacy/kb/kb_in_memory.pyx * Update spacy/pipeline/legacy/entity_linker.py * Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py. Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
		
			
				
	
	
		
			232 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			232 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: KnowledgeBase
 | |
| teaser:
 | |
|   A storage class for entities and aliases of a specific knowledge base
 | |
|   (ontology)
 | |
| tag: class
 | |
| source: spacy/kb/kb.pyx
 | |
| new: 2.2
 | |
| ---
 | |
| 
 | |
| The `KnowledgeBase` object is an abstract class providing a method to generate
 | |
| [`Candidate`](/api/kb#candidate) objects, which are plausible external
 | |
| identifiers given a certain textual mention. Each such `Candidate` holds
 | |
| information from the relevant KB entities, such as its frequency in text and
 | |
| possible aliases. Each entity in the knowledge base also has a pretrained entity
 | |
| vector of a fixed size.
 | |
| 
 | |
| Beyond that, `KnowledgeBase` classes have to implement a number of utility
 | |
| functions called by the [`EntityLinker`](/api/entitylinker) component.
 | |
| 
 | |
| <Infobox variant="warning">
 | |
| 
 | |
| This class was not abstract up to spaCy version 3.5. The `KnowledgeBase`
 | |
| implementation up to that point is available as `InMemoryLookupKB` from 3.5
 | |
| onwards.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| ## KnowledgeBase.\_\_init\_\_ {#init tag="method"}
 | |
| 
 | |
| `KnowledgeBase` is an abstract class and cannot be instantiated. Its child
 | |
| classes should call `__init__()` to set up some necessary attributes.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > from spacy.kb import KnowledgeBase
 | |
| > from spacy.vocab import Vocab
 | |
| >
 | |
| > class FullyImplementedKB(KnowledgeBase):
 | |
| >   def __init__(self, vocab: Vocab, entity_vector_length: int):
 | |
| >       super().__init__(vocab, entity_vector_length)
 | |
| >       ...
 | |
| > vocab = nlp.vocab
 | |
| > kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
 | |
| > ```
 | |
| 
 | |
| | Name                   | Description                                      |
 | |
| | ---------------------- | ------------------------------------------------ |
 | |
| | `vocab`                | The shared vocabulary. ~~Vocab~~                 |
 | |
| | `entity_vector_length` | Length of the fixed-size entity vectors. ~~int~~ |
 | |
| 
 | |
| ## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"}
 | |
| 
 | |
| The length of the fixed-size entity vectors in the knowledge base.
 | |
| 
 | |
| | Name        | Description                                      |
 | |
| | ----------- | ------------------------------------------------ |
 | |
| | **RETURNS** | Length of the fixed-size entity vectors. ~~int~~ |
 | |
| 
 | |
| ## KnowledgeBase.get_candidates {#get_candidates tag="method"}
 | |
| 
 | |
| Given a certain textual mention as input, retrieve a list of candidate entities
 | |
| of type [`Candidate`](/api/kb#candidate).
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > from spacy.lang.en import English
 | |
| > nlp = English()
 | |
| > doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.")
 | |
| > candidates = kb.get_candidates(doc[0:2])
 | |
| > ```
 | |
| 
 | |
| | Name        | Description                                                          |
 | |
| | ----------- | -------------------------------------------------------------------- |
 | |
| | `mention`   | The textual mention or alias. ~~Span~~                               |
 | |
| | **RETURNS** | An iterable of relevant `Candidate` objects. ~~Iterable[Candidate]~~ |
 | |
| 
 | |
| ## KnowledgeBase.get_candidates_batch {#get_candidates_batch tag="method"}
 | |
| 
 | |
| Same as [`get_candidates()`](/api/kb#get_candidates), but for an arbitrary
 | |
| number of mentions. The [`EntityLinker`](/api/entitylinker) component will call
 | |
| `get_candidates_batch()` instead of `get_candidates()`, if the config parameter
 | |
| `candidates_batch_size` is greater or equal than 1.
 | |
| 
 | |
| The default implementation of `get_candidates_batch()` executes
 | |
| `get_candidates()` in a loop. We recommend implementing a more efficient way to
 | |
| retrieve candidates for multiple mentions at once, if performance is of concern
 | |
| to you.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > from spacy.lang.en import English
 | |
| > nlp = English()
 | |
| > doc = nlp("Douglas Adams wrote 'The Hitchhiker's Guide to the Galaxy'.")
 | |
| > candidates = kb.get_candidates((doc[0:2], doc[3:]))
 | |
| > ```
 | |
| 
 | |
| | Name        | Description                                                                                  |
 | |
| | ----------- | -------------------------------------------------------------------------------------------- |
 | |
| | `mentions`  | The textual mention or alias. ~~Iterable[Span]~~                                             |
 | |
| | **RETURNS** | An iterable of iterable with relevant `Candidate` objects. ~~Iterable[Iterable[Candidate]]~~ |
 | |
| 
 | |
| ## KnowledgeBase.get_alias_candidates {#get_alias_candidates tag="method"}
 | |
| 
 | |
| <Infobox variant="warning">
 | |
| This method is _not_ available from spaCy 3.5 onwards.
 | |
| </Infobox>
 | |
| 
 | |
| From spaCy 3.5 on `KnowledgeBase` is an abstract class (with
 | |
| [`InMemoryLookupKB`](/api/kb_in_memory) being a drop-in replacement) to allow
 | |
| more flexibility in customizing knowledge bases. Some of its methods were moved
 | |
| to [`InMemoryLookupKB`](/api/kb_in_memory) during this refactoring, one of those
 | |
| being `get_alias_candidates()`. This method is now available as
 | |
| [`InMemoryLookupKB.get_alias_candidates()`](/api/kb_in_memory#get_alias_candidates).
 | |
| Note: [`InMemoryLookupKB.get_candidates()`](/api/kb_in_memory#get_candidates)
 | |
| defaults to
 | |
| [`InMemoryLookupKB.get_alias_candidates()`](/api/kb_in_memory#get_alias_candidates).
 | |
| 
 | |
| ## KnowledgeBase.get_vector {#get_vector tag="method"}
 | |
| 
 | |
| Given a certain entity ID, retrieve its pretrained entity vector.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > vector = kb.get_vector("Q42")
 | |
| > ```
 | |
| 
 | |
| | Name        | Description                            |
 | |
| | ----------- | -------------------------------------- |
 | |
| | `entity`    | The entity ID. ~~str~~                 |
 | |
| | **RETURNS** | The entity vector. ~~Iterable[float]~~ |
 | |
| 
 | |
| ## KnowledgeBase.get_vectors {#get_vectors tag="method"}
 | |
| 
 | |
| Same as [`get_vector()`](/api/kb#get_vector), but for an arbitrary number of
 | |
| entity IDs.
 | |
| 
 | |
| The default implementation of `get_vectors()` executes `get_vector()` in a loop.
 | |
| We recommend implementing a more efficient way to retrieve vectors for multiple
 | |
| entities at once, if performance is of concern to you.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > vectors = kb.get_vectors(("Q42", "Q3107329"))
 | |
| > ```
 | |
| 
 | |
| | Name        | Description                                               |
 | |
| | ----------- | --------------------------------------------------------- |
 | |
| | `entities`  | The entity IDs. ~~Iterable[str]~~                         |
 | |
| | **RETURNS** | The entity vectors. ~~Iterable[Iterable[numpy.ndarray]]~~ |
 | |
| 
 | |
| ## KnowledgeBase.to_disk {#to_disk tag="method"}
 | |
| 
 | |
| Save the current state of the knowledge base to a directory.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > kb.to_disk(path)
 | |
| > ```
 | |
| 
 | |
| | Name      | Description                                                                                                                                |
 | |
| | --------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
 | |
| | `path`    | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
 | |
| | `exclude` | List of components to exclude. ~~Iterable[str]~~                                                                                           |
 | |
| 
 | |
| ## KnowledgeBase.from_disk {#from_disk tag="method"}
 | |
| 
 | |
| Restore the state of the knowledge base from a given directory. Note that the
 | |
| [`Vocab`](/api/vocab) should also be the same as the one used to create the KB.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > from spacy.vocab import Vocab
 | |
| > vocab = Vocab().from_disk("/path/to/vocab")
 | |
| > kb = FullyImplementedKB(vocab=vocab, entity_vector_length=64)
 | |
| > kb.from_disk("/path/to/kb")
 | |
| > ```
 | |
| 
 | |
| | Name        | Description                                                                                     |
 | |
| | ----------- | ----------------------------------------------------------------------------------------------- |
 | |
| | `loc`       | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
 | |
| | `exclude`   | List of components to exclude. ~~Iterable[str]~~                                                |
 | |
| | **RETURNS** | The modified `KnowledgeBase` object. ~~KnowledgeBase~~                                          |
 | |
| 
 | |
| ## Candidate {#candidate tag="class"}
 | |
| 
 | |
| A `Candidate` object refers to a textual mention (alias) that may or may not be
 | |
| resolved to a specific entity from a `KnowledgeBase`. This will be used as input
 | |
| for the entity linking algorithm which will disambiguate the various candidates
 | |
| to the correct one. Each candidate `(alias, entity)` pair is assigned to a
 | |
| certain prior probability.
 | |
| 
 | |
| ### Candidate.\_\_init\_\_ {#candidate-init tag="method"}
 | |
| 
 | |
| Construct a `Candidate` object. Usually this constructor is not called directly,
 | |
| but instead these objects are returned by the `get_candidates` method of the
 | |
| [`entity_linker`](/api/entitylinker) pipe.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > from spacy.kb import Candidate
 | |
| > candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob)
 | |
| > ```
 | |
| 
 | |
| | Name          | Description                                                               |
 | |
| | ------------- | ------------------------------------------------------------------------- |
 | |
| | `kb`          | The knowledge base that defined this candidate. ~~KnowledgeBase~~         |
 | |
| | `entity_hash` | The hash of the entity's KB ID. ~~int~~                                   |
 | |
| | `entity_freq` | The entity frequency as recorded in the KB. ~~float~~                     |
 | |
| | `alias_hash`  | The hash of the textual mention or alias. ~~int~~                         |
 | |
| | `prior_prob`  | The prior probability of the `alias` referring to the `entity`. ~~float~~ |
 | |
| 
 | |
| ## Candidate attributes {#candidate-attributes}
 | |
| 
 | |
| | Name            | Description                                                              |
 | |
| | --------------- | ------------------------------------------------------------------------ |
 | |
| | `entity`        | The entity's unique KB identifier. ~~int~~                               |
 | |
| | `entity_`       | The entity's unique KB identifier. ~~str~~                               |
 | |
| | `alias`         | The alias or textual mention. ~~int~~                                    |
 | |
| | `alias_`        | The alias or textual mention. ~~str~~                                    |
 | |
| | `prior_prob`    | The prior probability of the `alias` referring to the `entity`. ~~long~~ |
 | |
| | `entity_freq`   | The frequency of the entity in a typical corpus. ~~long~~                |
 | |
| | `entity_vector` | The pretrained vector of the entity. ~~numpy.ndarray~~                   |
 |