9.2 KiB
The central data structures in spaCy are the Language
class,
the Vocab
and the Doc
object. The Language
class
is used to process a text and turn it into a Doc
object. It's typically stored
as a variable called nlp
. The Doc
object owns the sequence of tokens and
all their annotations. By centralizing strings, word vectors and lexical
attributes in the Vocab
, we avoid storing multiple copies of this data. This
saves memory, and ensures there's a single source of truth.
Text annotations are also designed to allow a single source of truth: the Doc
object owns the data, and Span
and Token
are
views that point into it. The Doc
object is constructed by the
Tokenizer
, and then modified in place by the components
of the pipeline. The Language
object coordinates these components. It takes
raw text and sends it through the pipeline, returning an annotated document.
It also orchestrates training and serialization.
Container objects
Name | Description |
---|---|
Language |
Processing class that turns text into Doc objects. Different languages implement their own subclasses of it. The variable is typically called nlp . |
Doc |
A container for accessing linguistic annotations. |
Span |
A slice from a Doc object. |
Token |
An individual token — i.e. a word, punctuation symbol, whitespace, etc. |
Lexeme |
An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
Example |
A collection of training annotations, containing two Doc objects: the reference data and the predictions. |
DocBin |
A collection of Doc objects for efficient binary serialization. Also used for training data. |
Processing pipeline
The processing pipeline consists of one or more pipeline components that are
called on the Doc
in order. The tokenizer runs before the components. Pipeline
components can be added using Language.add_pipe
.
They can contain a statistical model and trained weights, or only make
rule-based modifications to the Doc
. spaCy provides a range of built-in
components for different language processing tasks and also allows adding
custom components.
Name | Description |
---|---|
Tokenizer |
Segment raw text and create Doc objects from the words. |
Tok2Vec |
Apply a "token-to-vector" model and set its outputs. |
Transformer |
Use a transformer model and set its outputs. |
Lemmatizer |
Determine the base forms of words. |
Morphologizer |
Predict morphological features and coarse-grained part-of-speech tags. |
Tagger |
Predict part-of-speech tags. |
AttributeRuler |
Set token attributes using matcher rules. |
DependencyParser |
Predict syntactic dependencies. |
EntityRecognizer |
Predict named entities, e.g. persons or products. |
EntityRuler |
Add entity spans to the Doc using token-based rules or exact phrase matches. |
EntityLinker |
Disambiguate named entities to nodes in a knowledge base. |
TextCategorizer |
Predict categories or labels over the whole document. |
Sentencizer |
Implement rule-based sentence boundary detection that doesn't require the dependency parse. |
SentenceRecognizer |
Predict sentence boundaries. |
Other functions | Automatically apply something to the Doc , e.g. to merge spans of tokens. |
Pipe |
Base class that all trainable pipeline components inherit from. |
Matchers
Matchers help you find and extract information from Doc
objects
based on match patterns describing the sequences you're looking for. A matcher
operates on a Doc
and gives you access to the matched tokens in context.
Name | Description |
---|---|
Matcher |
Match sequences of tokens, based on pattern rules, similar to regular expressions. |
PhraseMatcher |
Match sequences of tokens based on phrases. |
DependencyMatcher |
Match sequences of tokens based on dependency trees using the Semgrex syntax. |
Other classes
Name | Description |
---|---|
Vocab |
The shared vocabulary that stores strings and gives you access to Lexeme objects. |
StringStore |
Map strings to and from hash values. |
Vectors |
Container class for vector data keyed by string. |
Lookups |
Container for convenient access to large lookup tables and dictionaries. |
Morphology |
Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. |
MorphAnalysis |
A morphological analysis. |
KnowledgeBase |
Storage for entities and aliases of a knowledge base for entity linking. |
Scorer |
Compute evaluation scores. |
Corpus |
Class for managing annotated corpora for training and evaluation data. |