spaCy/_architecture.md at d093d6343b3cd8ab4814037e5e75bbff3177690b

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 20:28:20 +03:00

Sofie Van Landeghem d093d6343b

* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36

2020-10-08 21:33:49 +02:00

9.2 KiB

Raw Blame History

The central data structures in spaCy are the Language class, the Vocab and the Doc object. The Language class is used to process a text and turn it into a Doc object. It's typically stored as a variable called nlp. The Doc object owns the sequence of tokens and all their annotations. By centralizing strings, word vectors and lexical attributes in the Vocab, we avoid storing multiple copies of this data. This saves memory, and ensures there's a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

Container objects

Name	Description
`Language`	Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.
`Doc`	A container for accessing linguistic annotations.
`Span`	A slice from a `Doc` object.
`Token`	An individual token — i.e. a word, punctuation symbol, whitespace, etc.
`Lexeme`	An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.
`Example`	A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.
`DocBin`	A collection of `Doc` objects for efficient binary serialization. Also used for training data.

Processing pipeline

The processing pipeline consists of one or more pipeline components that are called on the Doc in order. The tokenizer runs before the components. Pipeline components can be added using Language.add_pipe. They can contain a statistical model and trained weights, or only make rule-based modifications to the Doc. spaCy provides a range of built-in components for different language processing tasks and also allows adding custom components.

Name	Description
`Tokenizer`	Segment raw text and create `Doc` objects from the words.
`Tok2Vec`	Apply a "token-to-vector" model and set its outputs.
`Transformer`	Use a transformer model and set its outputs.
`Lemmatizer`	Determine the base forms of words.
`Morphologizer`	Predict morphological features and coarse-grained part-of-speech tags.
`Tagger`	Predict part-of-speech tags.
`AttributeRuler`	Set token attributes using matcher rules.
`DependencyParser`	Predict syntactic dependencies.
`EntityRecognizer`	Predict named entities, e.g. persons or products.
`EntityRuler`	Add entity spans to the `Doc` using token-based rules or exact phrase matches.
`EntityLinker`	Disambiguate named entities to nodes in a knowledge base.
`TextCategorizer`	Predict categories or labels over the whole document.
`Sentencizer`	Implement rule-based sentence boundary detection that doesn't require the dependency parse.
`SentenceRecognizer`	Predict sentence boundaries.
Other functions	Automatically apply something to the `Doc`, e.g. to merge spans of tokens.
`Pipe`	Base class that pipeline components may inherit from.
`TrainablePipe`	Class that all trainable pipeline components inherit from.

Matchers

Matchers help you find and extract information from Doc objects based on match patterns describing the sequences you're looking for. A matcher operates on a Doc and gives you access to the matched tokens in context.

Name	Description
`Matcher`	Match sequences of tokens, based on pattern rules, similar to regular expressions.
`PhraseMatcher`	Match sequences of tokens based on phrases.
`DependencyMatcher`	Match sequences of tokens based on dependency trees using Semgrex operators.

Other classes

Name	Description
`Vocab`	The shared vocabulary that stores strings and gives you access to `Lexeme` objects.
`StringStore`	Map strings to and from hash values.
`Vectors`	Container class for vector data keyed by string.
`Lookups`	Container for convenient access to large lookup tables and dictionaries.
`Morphology`	Store morphological analyses and map them to and from hash values.
`MorphAnalysis`	A morphological analysis.
`KnowledgeBase`	Storage for entities and aliases of a knowledge base for entity linking.
`Scorer`	Compute evaluation scores.
`Corpus`	Class for managing annotated corpora for training and evaluation data.

9.2 KiB Raw Blame History

Container objects

Processing pipeline

Matchers

Other classes

9.2 KiB

Raw Blame History