* Tweak overview docs

This commit is contained in:
Matthew Honnibal 2014-09-07 21:29:41 +02:00
parent 7dac9b9ccb
commit b8c4549ffe

View File

@ -4,8 +4,7 @@ Overview
What and Why What and Why
------------ ------------
spaCy is a lightning-fast, full-cream NLP tokenizer, tightly coupled to a spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
global vocabulary store.
Most tokenizers give you a sequence of strings. That's barbaric. Most tokenizers give you a sequence of strings. That's barbaric.
Giving you strings invites you to compute on every *token*, when what Giving you strings invites you to compute on every *token*, when what
@ -13,33 +12,30 @@ you should be doing is computing on every *type*. Remember
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll `Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
see exponentially fewer types than tokens. see exponentially fewer types than tokens.
Instead of strings, spacy gives you Lexeme IDs, from which you can access Instead of strings, spaCy gives you references to Lexeme objects, from which you
an excellent set of pre-computed orthographic and distributional features: can access an excellent set of pre-computed orthographic and distributional features:
:: ::
>>> from spacy import en >>> from spacy import en
>>> apples, are, nt, oranges, dots = en.tokenize(u"Apples aren't oranges...") >>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
>>> en.is_lower(apples) >>> are.prob >= oranges.prob
False
>>> en.prob_of(are) >= en.prob_of(oranges)
True True
>>> en.can_tag(are, en.NOUN) >>> apples.check_flag(en.IS_TITLE)
True
>>> apples.check_flag(en.OFT_TITLE)
False False
>>> en.is_often_titled(apples) >>> are.check_flag(en.CAN_NOUN)
False False
Accessing these properties is essentially free: the Lexeme IDs are actually spaCy makes it easy to write very efficient NLP applications, because your feature
memory addresses that point to structs --- so the only cost is the Python functions have to do almost no work: almost every lexical property you'll want
function call overhead. If you call the accessor functions from Cython, is pre-computed for you. See the tutorial for an example POS tagger.
there's no overhead at all.
Benchmark Benchmark
--------- ---------
Because it exploits Zipf's law, spaCy is much more efficient than The tokenizer itself is also very efficient:
regular-expression based tokenizers. See Algorithm and Implementation Details
for an explanation of how this works.
+--------+-------+--------------+--------------+ +--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor | | System | Time | Words/second | Speed Factor |