spaCy/_architecture.md at e0168ccce940251351711ac0196d8560cb77547e

mirror of https://github.com/explosion/spaCy.git synced 2025-07-15 18:52:29 +03:00

Refactor KB for easier customization (#11268 )

* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.

* Fix tests. Add distinction w.r.t. batch size.

* Remove redundant and add new comments.

* Adjust comments. Fix variable naming in EL prediction.

* Fix mypy errors.

* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add error messages to NotImplementedErrors. Remove redundant comment.

* Fix imports.

* Remove redundant comments.

* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.

* Fix tests.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move KB into subdirectory.

* Adjust imports after KB move to dedicated subdirectory.

* Fix config imports.

* Move Candidate + retrieval functions to separate module. Fix other, small issues.

* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typing.

* Change typing of mentions to be Span instead of Union[Span, str].

* Update docs.

* Update EntityLinker and _architecture docs.

* Update website/docs/api/entitylinker.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Adjust message for E1046.

* Re-add section for Candidate in kb.md, add reference to dedicated page.

* Update docs and docstrings.

* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.

* Update spacy/kb/candidate.pyx

* Update spacy/kb/kb_in_memory.pyx

* Update spacy/pipeline/legacy/entity_linker.py

* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

2022-09-08 10:38:07 +02:00

9.7 KiB

Raw Blame History

The central data structures in spaCy are the Language class, the Vocab and the Doc object. The Language class is used to process a text and turn it into a Doc object. It's typically stored as a variable called nlp. The Doc object owns the sequence of tokens and all their annotations. By centralizing strings, word vectors and lexical attributes in the Vocab, we avoid storing multiple copies of this data. This saves memory, and ensures there's a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

Container objects

Name	Description
`Doc`	A container for accessing linguistic annotations.
`DocBin`	A collection of `Doc` objects for efficient binary serialization. Also used for training data.
`Example`	A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.
`Language`	Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.
`Lexeme`	An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.
`Span`	A slice from a `Doc` object.
`SpanGroup`	A named collection of spans belonging to a `Doc`.
`Token`	An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Processing pipeline

The processing pipeline consists of one or more pipeline components that are called on the Doc in order. The tokenizer runs before the components. Pipeline components can be added using Language.add_pipe. They can contain a statistical model and trained weights, or only make rule-based modifications to the Doc. spaCy provides a range of built-in components for different language processing tasks and also allows adding custom components.

Name	Description
`AttributeRuler`	Set token attributes using matcher rules.
`DependencyParser`	Predict syntactic dependencies.
`EditTreeLemmatizer`	Predict base forms of words.
`EntityLinker`	Disambiguate named entities to nodes in a knowledge base.
`EntityRecognizer`	Predict named entities, e.g. persons or products.
`EntityRuler`	Add entity spans to the `Doc` using token-based rules or exact phrase matches.
`Lemmatizer`	Determine the base forms of words using rules and lookups.
`Morphologizer`	Predict morphological features and coarse-grained part-of-speech tags.
`SentenceRecognizer`	Predict sentence boundaries.
`Sentencizer`	Implement rule-based sentence boundary detection that doesn't require the dependency parse.
`Tagger`	Predict part-of-speech tags.
`TextCategorizer`	Predict categories or labels over the whole document.
`Tok2Vec`	Apply a "token-to-vector" model and set its outputs.
`Tokenizer`	Segment raw text and create `Doc` objects from the words.
`TrainablePipe`	Class that all trainable pipeline components inherit from.
`Transformer`	Use a transformer model and set its outputs.
Other functions	Automatically apply something to the `Doc`, e.g. to merge spans of tokens.

Matchers

Matchers help you find and extract information from Doc objects based on match patterns describing the sequences you're looking for. A matcher operates on a Doc and gives you access to the matched tokens in context.

Name	Description
`DependencyMatcher`	Match sequences of tokens based on dependency trees using Semgrex operators.
`Matcher`	Match sequences of tokens, based on pattern rules, similar to regular expressions.
`PhraseMatcher`	Match sequences of tokens based on phrases.

Other classes

Name	Description
`Corpus`	Class for managing annotated corpora for training and evaluation data.
`KnowledgeBase`	Abstract base class for storage and retrieval of data for entity linking.
`InMemoryLookupKB`	Implementation of `KnowledgeBase` storing all data in memory.
`Candidate`	Object associating a textual mention with a specific entity contained in a `KnowledgeBase`.
`Lookups`	Container for convenient access to large lookup tables and dictionaries.
`MorphAnalysis`	A morphological analysis.
`Morphology`	Store morphological analyses and map them to and from hash values.
`Scorer`	Compute evaluation scores.
`StringStore`	Map strings to and from hash values.
`Vectors`	Container class for vector data keyed by string.
`Vocab`	The shared vocabulary that stores strings and gives you access to `Lexeme` objects.

9.7 KiB Raw Blame History

Container objects

Processing pipeline

Matchers

Other classes

9.7 KiB

Raw Blame History