The central data structures in spaCy are the Doc and the Vocab. The Doc
object owns the sequence of tokens and all their annotations. The Vocab
object owns a set of look-up tables that make common information available
across documents. By centralizing strings, word vectors and lexical attributes,
we avoid storing multiple copies of this data. This saves memory, and ensures
there's a single source of truth.
Text annotations are also designed to allow a single source of truth: the Doc
object owns the data, and Span and Token are views that point into it.
The Doc object is constructed by the Tokenizer, and then modified in
place by the components of the pipeline. The Language object coordinates
these components. It takes raw text and sends it through the pipeline, returning
an annotated document. It also orchestrates training and serialization.

Container objects
| Name | Description | 
| Doc | A container for accessing linguistic annotations. | 
| Span | A slice from a Docobject. | 
| Token | An individual token — i.e. a word, punctuation symbol, whitespace, etc. | 
| Lexeme | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. | 
Processing pipeline
| Name | Description | 
| Language | A text-processing pipeline. Usually you'll load this once per process as nlpand pass the instance around your application. | 
| Tokenizer | Segment text, and create Docobjects with the discovered segment boundaries. | 
| Lemmatizer | Determine the base forms of words. | 
| Morphology | Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. | 
| Tagger | Annotate part-of-speech tags on Docobjects. | 
| DependencyParser | Annotate syntactic dependencies on Docobjects. | 
| EntityRecognizer | Annotate named entities, e.g. persons or products, on Docobjects. | 
| TextCategorizer | Assign categories or labels to Docobjects. | 
| Matcher | Match sequences of tokens, based on pattern rules, similar to regular expressions. | 
| PhraseMatcher | Match sequences of tokens based on phrases. | 
| EntityRuler | Add entity spans to the Docusing token-based rules or exact phrase matches. | 
| Sentencizer | Implement custom sentence boundary detection logic that doesn't require the dependency parse. | 
| Other functions | Automatically apply something to the Doc, e.g. to merge spans of tokens. | 
Other classes
| Name | Description | 
| Vocab | A lookup table for the vocabulary that allows you to access Lexemeobjects. | 
| StringStore | Map strings to and from hash values. | 
| Vectors | Container class for vector data keyed by string. | 
| GoldParse | Collection for training annotations. | 
| GoldCorpus | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. |