//- 💫 DOCS > USAGE > SPACY'S DATA MODEL include ../../_includes/_mixins p After reading this page, you should be able to: +list +item Understand how spaCy's Doc, Span, Token and Lexeme object work +item Start using spaCy's Cython API +item Use spaCy more efficiently +h(2, "design-considerations") Design considerations +h(3, "no-job-too-big") No job too big p | When writing spaCy, one of my mottos was #[em no job too big]. I wanted | to make sure that if Google or Facebook were founded tomorrow, spaCy | would be the obvious choice for them. I wanted spaCy to be the obvious | choice for web-scale NLP. This meant sweating about performance, because | for web-scale tasks, Moore's law can't save you. p | Most computational work gets less expensive over time. If you wrote a | program to solve fluid dynamics in 2008, and you ran it again in 2014, | you would expect it to be cheaper. For NLP, it often doesn't work out | that way. The problem is that we're writing programs where the task is | something like "Process all articles in the English Wikipedia". Sure, | compute prices dropped from $0.80 per hour to $0.20 per hour on AWS in | 2008-2014. But the size of Wikipedia grew from 3GB to 11GB. Maybe the | job is a #[em little] cheaper in 2014 — but not by much. +h(3, "annotation-layers") Multiple layers of annotation p | When I tell a certain sort of person that I'm a computational linguist, | this comic is often the first thing that comes to their mind: +image("http://i.imgur.com/n3DTzqx.png", 450) +image-caption © #[+a("http://xkcd.com") xkcd] p | I've thought a lot about what this comic is really trying to say. It's | probably not talking about #[em data models] — but in that sense at | least, it really rings true. p | You'll often need to model a document as a sequence of sentences. Other | times you'll need to model it as a sequence of words. Sometimes you'll | care about paragraphs, other times you won't. Sometimes you'll care | about extracting quotes, which can cross paragraph boundaries. A quote | can also occur within a sentence. When we consider sentence structure, | things get even more complicated and contradictory. We have syntactic | trees, sequences of entities, sequences of phrases, sub-word units, | multi-word units... p | Different applications are going to need to query different, | overlapping, and often contradictory views of the document. They're | often going to need to query them jointly. You need to be able to get | the syntactic head of a named entity, or the sentiment of a paragraph. +h(2, "solutions") Solutions +h(3) Fat types, thin tokens +h(3) Static model, dynamic views p | Different applications are going to need to query different, | overlapping, and often contradictory views of the document. For this | reason, I think it's a bad idea to have too much of the document | structure reflected in the data model. If you structure the data | according to the needs of one layer of annotation, you're going to need | to copy the data and transform it in order to use a different layer of | annotation. You'll soon have lots of copies, and no single source of | truth. +h(3) Never go full stand-off +h(3) Implementation +h(3) Cython 101 +h(3) #[code cdef class Doc] p | Let's start at the top. Here's the memory layout of the | #[+api("doc") #[code Doc]] class, minus irrelevant details: +code. from cymem.cymem cimport Pool from ..vocab cimport Vocab from ..structs cimport TokenC cdef class Doc: cdef Pool mem cdef Vocab vocab cdef TokenC* c cdef int length cdef int max_length p | So, our #[code Doc] class is a wrapper around a TokenC* array — that's | where the actual document content is stored. Here's the #[code TokenC] | struct, in its entirety: +h(3) #[code cdef struct TokenC] +code. cdef struct TokenC: const LexemeC* lex uint64_t morph univ_pos_t pos bint spacy int tag int idx int lemma int sense int head int dep bint sent_start uint32_t l_kids uint32_t r_kids uint32_t l_edge uint32_t r_edge int ent_iob int ent_type # TODO: Is there a better way to do this? Multiple sources of truth.. hash_t ent_id p | The token owns all of its linguistic annotations, and holds a const | pointer to a #[code LexemeC] struct. The #[code LexemeC] struct owns all | of the #[em vocabulary] data about the word — all the dictionary | definition stuff that we want to be shared by all instances of the type. | Here's the #[code LexemeC] struct, in its entirety: +h(3) #[code cdef struct LexemeC] +code. cdef struct LexemeC: int32_t id int32_t orth # Allows the string to be retrieved int32_t length # Length of the string uint64_t flags # These are the most useful parts. int32_t cluster # Distributional similarity cluster float prob # Probability float sentiment # Slot for sentiment int32_t lang int32_t lower # These string views made sense int32_t norm # when NLP meant linear models. int32_t shape # Now they're less relevant, and int32_t prefix # will probably be revised. int32_t suffix float* vector # <-- This was a design mistake, and will change. +h(2, "dynamic-views") Dynamic views +h(3) Text p | You might have noticed that in all of the structs above, there's not a | string to be found. The strings are all stored separately, in the | #[+api("stringstore") #[code StringStore]] class. The lexemes don't know | the strings — they only know their integer IDs. The document string is | never stored anywhere, either. Instead, it's reconstructed by iterating | over the tokens, which look up the #[code orth] attribute of their | underlying lexeme. Once we have the orth ID, we can fetch the string | from the vocabulary. Finally, each token knows whether a single | whitespace character (#[code ' ']) should be used to separate it from | the subsequent tokens. This allows us to preserve whitespace. +code. cdef print_text(Vocab vocab, const TokenC* tokens, int length): for i in range(length): word_string = vocab.strings[tokens.lex.orth] if tokens.lex.spacy: word_string += ' ' print(word_string) p | This is why you get whitespace tokens in spaCy — we need those tokens, | so that we can reconstruct the document string. I also think you should | have those tokens anyway. Most NLP libraries strip them, making it very | difficult to recover the paragraph information once you're at the token | level. You'll never have that sort of problem with spaCy — because | there's a single source of truth. +h(3) #[code cdef class Token] p When you do... +code. doc[i] p | ...you get back an instance of class #[code spacy.tokens.token.Token]. | This instance owns no data. Instead, it holds the information | #[code (doc, i)], and uses these to retrieve all information via the | parent container. +h(3) #[code cdef class Span] p When you do... +code. doc[i : j] p | ...you get back an instance of class #[code spacy.tokens.span.Span]. | #[code Span] instances are also returned by the #[code .sents], | #[code .ents] and #[code .noun_chunks] iterators of the #[code Doc] | object. A #[code Span] is a slice of tokens, with an optional label | attached. Its data model is: +code. cdef class Span: cdef readonly Doc doc cdef int start cdef int end cdef int start_char cdef int end_char cdef int label p | Once again, the #[code Span] owns almost no data. Instead, it refers | back to the parent #[code Doc] container. p | The #[code start] and #[code end] attributes refer to token positions, | while #[code start_char] and #[code end_char] record the character | positions of the span. By recording the character offsets, we can still | use the #[code Span] object if the tokenization of the document changes. +h(3) #[code cdef class Lexeme] p When you do... +code. vocab[u'the'] p | ...you get back an instance of class #[code spacy.lexeme.Lexeme]. The | #[code Lexeme]'s data model is: +code. cdef class Lexeme: cdef LexemeC* c cdef readonly Vocab vocab