Overview ======== What and Why ------------ spaCy is a lightning-fast, full-cream NLP tokenizer, tightly coupled to a global vocabulary store. Most tokenizers give you a sequence of strings. That's barbaric. Giving you strings invites you to compute on every *token*, when what you should be doing is computing on every *type*. Remember `Zipf's law `_: you'll see exponentially fewer types than tokens. Instead of strings, spacy gives you Lexeme IDs, from which you can access an excellent set of pre-computed orthographic and distributional features: :: >>> from spacy import en >>> apples, are, nt, oranges, dots = en.tokenize(u"Apples aren't oranges...") >>> en.is_lower(apples) False >>> en.prob_of(are) >= en.prob_of(oranges) True >>> en.can_tag(are, en.NOUN) False >>> en.is_often_titled(apples) False Accessing these properties is essentially free: the Lexeme IDs are actually memory addresses that point to structs --- so the only cost is the Python function call overhead. If you call the accessor functions from Cython, there's no overhead at all. Benchmark --------- Because it exploits Zipf's law, spaCy is much more efficient than regular-expression based tokenizers. See Algorithm and Implementation Details for an explanation of how this works. +--------+-------+--------------+--------------+ | System | Time | Words/second | Speed Factor | +--------+-------+--------------+--------------+ | NLTK | 6m4s | 89,000 | 1.00 | +--------+-------+--------------+--------------+ | spaCy | 9.5s | 3,093,000 | 38.30 | +--------+-------+--------------+--------------+ The comparison refers to 30 million words from the English Gigaword, on a Maxbook Air. For context, calling string.split() on the data completes in about 5s. Pros and Cons ------------- Pros: - All tokens come with indices into the original string - Full unicode support - Extensible to other languages - Batch operations computed efficiently in Cython - Cython API - numpy interoperability Cons: - It's new (released September 2014) - Security concerns, from memory management - Higher memory usage (up to 1gb) - More conceptually complicated - Tokenization rules expressed in code, not as data