diff --git a/docs/source/guide/overview.rst b/docs/source/guide/overview.rst index 59d0810d8..44d750490 100644 --- a/docs/source/guide/overview.rst +++ b/docs/source/guide/overview.rst @@ -1,71 +1,106 @@ -Overview -======== +Don't Settle for a List of Strings +================================== -What and Why ------------- -spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon. + *"Other NLP tokenizers return lists of strings, which is downright + barbaric."* --- me -Most tokenizers give you a sequence of strings. That's barbaric. -Giving you strings invites you to compute on every *token*, when what -you should be doing is computing on every *type*. Remember -`Zipf's law `_: you'll -see exponentially fewer types than tokens. -Instead of strings, spaCy gives you references to Lexeme objects, from which you -can access an excellent set of pre-computed orthographic and distributional features: +spaCy splits text into a list of lexical types, which come with a variety of +features pre-computed. It's designed to **make the right thing easy**, where the right +thing is: -:: +* A global vocabulary store; - >>> from spacy import en - >>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...") - >>> are.prob >= oranges.prob - True - >>> apples.check_flag(en.IS_TITLE) - True - >>> apples.check_flag(en.OFT_TITLE) - False - >>> are.check_flag(en.CAN_NOUN) - False +* Cached orthographic features; -spaCy makes it easy to write very efficient NLP applications, because your feature -functions have to do almost no work: almost every lexical property you'll want -is pre-computed for you. See the tutorial for an example POS tagger. +* Clever use of distributional data. + +Let's say you're writing an entity tagger for English. Case distinctions are an +important feature here: you need to know whether the word you're tagging is +upper-cased, lower-cased, title-cased, non-alphabetic, etc. +The right thing is to call the string.isupper(), string.islower(), string.isalpha() +etc functions once for every *type* in your vocabulary, instead +of once for every *token* in the text you're tagging. +When you encounter a new word, you want to create a lexeme object, calculate its +features, and save it. -Benchmark ---------- +That's the *right* way to do it, so it's what spaCy does for you. -The tokenizer itself is also very efficient: +Other tokenizers give you a list of strings, which makes it really easy to do +the wrong thing. And the wrong thing isn't just a little bit worse: it's +**exponentially** worse, because of +`Zipf's law `_. + +.. raw:: html + +
+
+ +
+
+ +Over the Gigaword corpus, if you compute some feature on a per-token basis, you'll +make **500x more calls** to that function than if you had computed it on a per-token +basis. +(Mouse-over a line to see its value at that point. And yes, it's a bit snarky +to present the graph in a linear scale --- but it isn't misleading.) + +Zipf's Law also makes distributional information a really powerful source of +type-based features. It's really handy to know where a word falls in the language's +frequency distribution, especially compared to variants of the word. For instance, +we might be processing a Twitter comment that contains the string "nasa". We have +little hope of recognising this as an entity except by noting that the string "NASA" +is much more common, and that both strings are quite rare. + +.. Each spaCy Lexeme comes with a rich, curated set of orthographic and +.. distributional features. Different languages get a different set of features, +.. to take into account different orthographic conventions and morphological +.. complexity. It's also easy to define your own features. + +.. And, of course, we take care to get the details right. Indices into the original +.. text are always easy to calculate, so it's easy to, say, mark entities with in-line +.. mark-up. You'll also receive tokens for newlines, tabs and other non-space whitespace, +.. making it easy to do paragraph and sentence recognition. And, of course, we deal +.. smartly with all the random unicode whitespace and punctuation characters you might +.. not have thought of. + + +Benchmarks +---------- + +We here ask two things: + +1. How fast is the spaCy tokenizer itself, relative to other tokenizers? + +2. How fast are applications using spaCy's pre-computed lexical features, + compared to applications that re-compute their features on every token? +--------+-------+--------------+--------------+ | System | Time | Words/second | Speed Factor | +--------+-------+--------------+--------------+ | NLTK | 6m4s | 89,000 | 1.00 | +--------+-------+--------------+--------------+ -| spaCy | 9.5s | 3,093,000 | 38.30 | +| spaCy | | | | +--------+-------+--------------+--------------+ -The comparison refers to 30 million words from the English Gigaword, on -a Maxbook Air. For context, calling string.split() on the data completes in -about 5s. + +spaCy uses more memory than a standard tokenizer, but is far more efficient. We +compare against the NLTK tokenizer and the Penn Treebank's tokenizer.sed script. +We also give the performance of Python's native string.split, for reference. + Pros and Cons ------------- Pros: -- All tokens come with indices into the original string -- Full unicode support -- Extensible to other languages -- Batch operations computed efficiently in Cython -- Cython API -- numpy interoperability +- Stuff Cons: - It's new (released September 2014) -- Security concerns, from memory management - Higher memory usage (up to 1gb) -- More conceptually complicated -- Tokenization rules expressed in code, not as data - +- More complicated diff --git a/docs/source/index.rst b/docs/source/index.rst index ca7dbee40..6bced7b15 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -6,6 +6,40 @@ spaCy NLP Tokenizer and Lexicon ================================ +spaCy splits a string of natural language into a list of references to lexical types: + + >>> from spacy.en import EN + >>> tokens = EN.tokenize(u"Examples aren't easy, are they?") + >>> type(tokens[0]) + spacy.word.Lexeme + >>> tokens[1] is tokens[5] + True + +Other tokenizers return lists of strings, which is +`downright barbaric `__. If you get a list of strings, +you have to write all the features yourself, and you'll probably compute them +on a per-token basis, instead of a per-type basis. At scale, that's very +inefficient. + +spaCy's tokens come with the following orthographic and distributional features +pre-computed: + +* Orthographic flags, such as is_alpha, is_digit, is_punct, is_title etc; + +* Useful string transforms, such as canonical casing, word shape, ASCIIfied, + etc; + +* Unigram log probability; + +* Brown cluster; + +* can_noun, can_verb etc tag-dictionary; + +* oft_upper, oft_title etc case-behaviour flags. + +The features are up-to-date with current NLP research, but you can replace or +augment them if you need to. + .. toctree:: :maxdepth: 3 @@ -15,20 +49,24 @@ spaCy NLP Tokenizer and Lexicon api/index.rst modules/index.rst - - -Source (GitHub) ----------------- - -http://github.com/honnibal/spaCy License -------- +======= -Copyright Matthew Honnibal ++------------------+------+ +| Non-commercial | $0 | ++------------------+------+ +| Trial commercial | $0 | ++------------------+------+ +| Full commercial | $500 | ++------------------+------+ -Non-commercial use: $0 -Commercial trial use: $0 -Full commercial license: $500 +spaCy is non-free software. Its source is published, but the copyright is +retained by the author (Matthew Honnibal). Licenses are currently under preparation. + +There is currently a gap between the output of academic NLP researchers, and +the needs of a small software companiess. I left academia to try to correct this. +My idea is that non-commercial and trial commercial use should "feel" just like +free software. But, if you do use the code in a commercial product, a small +fixed license-fee will apply, in order to fund development. -honnibal@gmail.com