//- 💫 DOCS > USAGE > SPACY 101 > ARCHITECTURE p | The central data structures in spaCy are the #[code Doc] and the | #[code Vocab]. The #[code Doc] object owns the | #[strong sequence of tokens] and all their annotations. The #[code Vocab] | object owns a set of #[strong look-up tables] that make common | information available across documents. By centralising strings, word | vectors and lexical attributes, we avoid storing multiple copies of this | data. This saves memory, and ensures there's a | #[strong single source of truth]. p | Text annotations are also designed to allow a single source of truth: the | #[code Doc] object owns the data, and #[code Span] and #[code Token] are | #[strong views that point into it]. The #[code Doc] object is constructed | by the #[code Tokenizer], and then #[strong modified in place] by the | components of the pipeline. The #[code Language] object coordinates these | components. It takes raw text and sends it through the pipeline, | returning an #[strong annotated document]. It also orchestrates training | and serialization. +graphic("/assets/img/architecture.svg") include ../../assets/img/architecture.svg +h(3, "architecture-containers") Container objects +table(["Name", "Description"]) +row +cell #[+api("doc") #[code Doc]] +cell A container for accessing linguistic annotations. +row +cell #[+api("span") #[code Span]] +cell A slice from a #[code Doc] object. +row +cell #[+api("token") #[code Token]] +cell | An individual token — i.e. a word, punctuation symbol, whitespace, | etc. +row +cell #[+api("lexeme") #[code Lexeme]] +cell | An entry in the vocabulary. It's a word type with no context, as | opposed to a word token. It therefore has no part-of-speech tag, | dependency parse etc. +h(3, "architecture-pipeline") Processing pipeline +table(["Name", "Description"]) +row +cell #[+api("language") #[code Language]] +cell | A text-processing pipeline. Usually you'll load this once per | process as #[code nlp] and pass the instance around your application. +row +cell #[+api("pipe") #[code Pipe]] +cell Base class for processing pipeline components. +row +cell #[+api("tensorizer") #[code Tensorizer]] +cell | Add tensors with position-sensitive meaning representations to | #[code Doc] objects. +row +cell #[+api("tagger") #[code Tagger]] +cell Annotate part-of-speech tags on #[code Doc] objects. +row +cell #[+api("dependencyparser") #[code DependencyParser]] +cell Annotate syntactic dependencies on #[code Doc] objects. +row +cell #[+api("entityrecognizer") #[code EntityRecognizer]] +cell | Annotate named entities, e.g. persons or products, on #[code Doc] | objects. +row +cell #[+api("textcategorizer") #[code TextCategorizer]] +cell Assigning categories or labels to #[code Doc] objects. +row +cell #[+api("tokenizer") #[code Tokenizer]] +cell | Segment text, and create #[code Doc] objects with the discovered | segment boundaries. +row +cell #[+api("lemmatizer") #[code Lemmatizer]] +cell | Determine the base forms of words. +row +cell #[code Morphology] +cell | Assign linguistic features like lemmas, noun case, verb tense etc. | based on the word and its part-of-speech tag. +row +cell #[+api("matcher") #[code Matcher]] +cell | Match sequences of tokens, based on pattern rules, similar to | regular expressions. +row +cell #[+api("phrasematcher") #[code PhraseMatcher]] +cell Match sequences of tokens based on phrases. +h(3, "architecture-other") Other classes +table(["Name", "Description"]) +row +cell #[+api("vocab") #[code Vocab]] +cell | A lookup table for the vocabulary that allows you to access | #[code Lexeme] objects. +row +cell #[+api("stringstore") #[code StringStore]] +cell Map strings to and from hash values. +row +cell #[+api("vectors") #[code Vectors]] +cell Container class for vector data keyed by string. +row +cell #[+api("goldparse") #[code GoldParse]] +cell Collection for training annotations. +row +cell #[+api("goldcorpus") #[code GoldCorpus]] +cell | An annotated corpus, using the JSON file format. Manages | annotations for tagging, dependency parsing and NER.