From 4eb9c2b30fb0c16405a1789b90e9bf680916fc4f Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Fri, 22 Aug 2014 03:38:05 +0200 Subject: [PATCH] * Add overview doc --- docs/guide/overview.rst | 78 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 docs/guide/overview.rst diff --git a/docs/guide/overview.rst b/docs/guide/overview.rst new file mode 100644 index 000000000..cd7561be0 --- /dev/null +++ b/docs/guide/overview.rst @@ -0,0 +1,78 @@ +Overview +======== + +What and Why +------------ + +spaCy is a lightning-fast, full-cream NLP tokenizer, tightly coupled to a +global vocabulary store. + +Most tokenizers give you a sequence of strings. That's barbaric. +Giving you strings invites you to compute on every *token*, when what +you should be doing is computing on every *type*. Remember +`Zipf's law `_: you'll +see exponentially fewer types than tokens. + +Instead of strings, spacy gives you Lexeme IDs, from which you can access +an excellent set of pre-computed orthographic and distributional features: + +:: + + >>> from spacy import en + >>> apples, are, not, oranges, dots = en.tokenize(u"Apples aren't oranges...") + >>> en.is_lower(apples) + False + # Distributional features calculated from large corpora + # Smoothed unigram log probability + >>> en.prob_of(are) > en.prob_of(oranges) + True + # After POS tagging lots of text, is this word ever a noun? + >>> en.can_tag(are, en.NOUN) + False + # Is this word always title-cased? + >>> en.often_title(apples) + False + +Accessing these properties is essentially free: the Lexeme IDs are actually +memory addresses that point to structs --- so the only cost is the Python +function call overhead. If you call the accessor functions from Cython, +there's no overhead at all. + +Benchmark +--------- + +Because it exploits Zipf's law, spaCy is much more efficient than +regular-expression based tokenizers. See Algorithm and Implementation Details +for an explanation of how this works. + ++--------+-------+--------------+--------------+ +| System | Time | Words/second | Speed Factor | ++--------+-------+--------------+--------------+ +| NLTK | 6m4s | 89,000 | 1.00 | ++--------+-------+--------------+--------------+ +| spaCy | 9.5s | 3,093,000 | 38.30 | ++--------+-------+--------------+--------------+ + +The comparison refers to 30 million words from the English Gigaword, on +a Maxbook Air. For context, calling string.split() on the data completes in +about 5s. + +Pros and Cons +------------- + +Pros: + +- All tokens come with indices into the original string +- Full unicode support +- Extensible to other languages +- Batch operations computed efficiently in Cython +- Cython API +- numpy interoperability + +Cons: + +- It's new (released September 2014) +- Higher memory usage (up to 1gb) +- More conceptually complicated +- Tokenization rules expressed in code, not as data +