* Impove index docs

2025-10-30 23:47:31 +03:00 · 2015-01-16 07:08:35 +11:00 · 2015-01-16 07:08:35 +11:00 · e28b224b80
commit e28b224b80
parent e8dbac8a0c
1 changed files with 158 additions and 61 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -3,87 +3,184 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

-===================================
-spaCy: Text-processing for products
-===================================
+==============================
+spaCy: Industrial-strength NLP
+==============================

 spaCy is a library for industrial-strength text processing in Python and Cython.
-Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of
-state-of-the-art components, a nice API, and no clutter:
+It is commercial open source software, with a dual (AGPL or commercial)
+license.

-    >>> from spacy.en import English
-    >>> nlp = English()
-    >>> tokens = nlp(u'An example sentence', tag=True, parse=True)
-    >>> for token in tokens:
-    ...   print token.lemma, token.pos, bin(token.cluster)
-    an DT Xx 0b111011110
-    example NN xxxx 0b111110001
-    sentence NN xxxx 0b1101111110010
+If you're a small company doing NLP, spaCy might seem like a minor miracle.
+It's by far the fastest NLP software available.  The full processing pipeline
+completes in 7ms, including state-of-the-art part-of-speech tagging and
+dependency parsing.  All strings are mapped to integer IDs, tokens
+are linked to word vectors and other lexical resources, and a range of useful
+features are pre-calculated and cached.

-spaCy is particularly good for feature extraction, because it pre-loads lexical
-resources, maps strings to integer IDs, and supports output of numpy arrays:
+If none of that made any sense to you, here's the gist of it.  Computers don't
+understand text. This is unfortunate, because that's what the web almost entirely
+consists of.  We want to recommend people text based on other text they liked.
+We want to shorten text to display it on a mobile screen.  We want to aggregate
+it, link it, filter it, categorise it, generate it and correct it.

-    >>> from spacy.en import attrs
-    >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
-    array([[ 1265,    14,    76,   478],
-       [ 1545,    24,   262,   497],
-       [ 3385,    24,   262, 14309]])
+spaCy provides a set of utility functions that help programmers build such
+products.  It's an NLP engine, analogous to the 3d engines commonly licensed
+for game development.

-spaCy also makes it easy to add in-line mark up. Let's say you're convinced by
-Stephen King's advice that `adverbs are not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so you want to mark
-them in red. We'll use one of the examples he finds particularly egregious:
+Example functionality
+---------------------

-    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
-    >>> red = lambda string: u'\033[91m{0}\033[0m'.format(string)
-    >>> red = lambda string: unicode(string).upper() # TODO -- make red work on website...
-    >>> print u''.join(red(t) if t.is_adverb else unicode(t) for t in tokens)
+Let's say you're developing a proofreading tool, or possibly an IDE for
+writers.  You're convinced by Stephen King's advice that `adverbs are not your
+friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
+you want to **mark adverbs in red**.  We'll use one of the examples he finds
+particularly egregious:
+
+    >>> import spacy.en
+    >>> from spacy.enums import ADVERB
+    >>> # Load the pipeline, and call it with some text.
+    >>> nlp = spacy.en.English()
+    >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
+                     tag=True, parse=True)
+    >>> output = ''
+    >>> for tok in tokens:
+    ...     # Token.string preserves whitespace, making it easy to
+    ...     # reconstruct the original string.
+    ...     output += tok.string.upper() if tok.is_pos(ADVERB) else tok.string
+    >>> print(output)
    ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’


-Easy --- except, "back" isn't the sort of word we're looking for, even though
-it's undeniably an adverb.  Let's search refine the logic a little, and only
-highlight adverbs that modify verbs:
+Easy enough --- but the problem is that we've also highlighted "back", when probably
+we only wanted to highlight "abjectly". This is undoubtedly an adverb, but it's
+not the sort of adverb King is talking about.  This is a persistent problem when
+dealing with linguistic categories: the prototypical examples, the ones whic
+spring to your mind, are often not the most common cases.

-    >>> print u''.join(red(t) if t.is_adverb and t.head.is_verb else unicode(t) for t in tokens)
+There are lots of ways we might refine our logic, depending on just what words
+we want to flag.  The simplest way to filter out adverbs like "back" and "not"
+is by word frequency: these words are much more common than the manner adverbs
+the style guides are worried about.
+
+The prob attribute of a Lexeme or Token object gives a log probability estimate
+of the word, based on smoothed counts from a 3bn word corpus:
+
+   >>> nlp.vocab[u'back'].prob
+   -7.403977394104004
+   >>> nlp.vocab[u'not'].prob
+   -5.407193660736084
+   >>> nlp.vocab[u'quietly'].prob
+   -11.07155704498291
+
+So we can easily exclude the N most frequent words in English from our adverb
+marker.  Let's try N=1000 for now:
+
+    >>> import spacy.en
+    >>> from spacy.enums import ADVERB
+    >>> nlp = spacy.en.English()
+    >>> # Find log probability of Nth most frequent word
+    >>> probs = [lex.prob for lex in nlp.vocab]
+    >>> is_adverb = lambda tok: tok.is_pos(ADVERB) and tok.prob < probs[-1000]
+    >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
+                     tag=True, parse=True)
+    >>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string))
    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’

-spaCy is also very efficient --- much more efficient than any other language
-processing tools available.  The table below compares the time to tokenize, POS
-tag and parse a document (amortized over 100k samples).  It also shows accuracy
-on the standard evaluation, from the Wall Street Journal:
+There are lots of ways to refine the logic, depending on just what words we
+want to flag.  Let's define this narrowly, and only flag adverbs applied to
+verbs of communication or perception:

-+----------+----------+---------+----------+----------+------------+
-| System   | Tokenize | POS Tag | Parse    | POS Acc. | Parse Acc. |
-+----------+----------+---------+----------+----------+------------+
-| spaCy    | 0.37ms   | 0.98ms  | 10ms     | 97.3%    | 92.4%      |
-+----------+----------+---------+----------+----------+------------+
-| NLTK     | 6.2ms    | 443ms   | n/a      | 94.0%    | n/a        |
-+----------+----------+---------+----------+----------+------------+
-| CoreNLP  | 4.2ms    | 13ms    | todo     | 96.97%   | 92.2%      |
-+----------+----------+---------+----------+----------+------------+
-| ZPar     | n/a      | 15ms    | 850ms    | 97.3%    | 92.9%      |
-+----------+----------+---------+----------+----------+------------+
+    >>> from spacy.enums import VERB, WN_V_COMMUNICATION, WN_V_COGNITION
+    >>> def is_say_verb(tok):
+    ...   return tok.is_pos(VERB) and (tok.check_flag(WN_V_COMMUNICATION) or
+                                       tok.check_flag(WN_V_COGNITION))
+    >>> print(''.join(tok.string.upper() if is_adverb(tok) and is_say_verb(tok.head)
+                      else tok.string))
+    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’

-(The CoreNLP results refer to their recently published shift-reduce neural
-network parser.)
+The two flags refer to the 45 top-level categories in the WordNet ontology.
+spaCy stores membership in these categories as a bit set, because
+words can have multiple senses.  We only need one 64
+bit flag variable per word in the vocabulary, so this useful data requires only
+2.4mb of memory.

-I wrote spaCy so that startups and other small companies could take advantage
-of the enormous progress being made by NLP academics.  Academia is competitive,
-and what you're competing to do is write papers --- so it's very hard to write
-software useful to non-academics. Seeing this gap, I resigned from my post-doc,
-and wrote spaCy.
+spaCy packs all sorts of other goodies into its lexicon.
+Words are mapped to one these rich lexical types immediately, during
+tokenization --- and spaCy's tokenizer is *fast*.
+
+Efficiency
+----------
+
+.. table:: Efficiency comparison. See `Benchmarks`_ for details.
+
+  +--------------+---------------------------+--------------------------------+
+  |              | Absolute (ms per doc)     | Relative (to spaCy)            |
+  +--------------+----------+--------+-------+----------+---------+-----------+
+  | System       | Tokenize | Tag    | Parse | Tokenize | Tag     | Parse     |
+  +--------------+----------+--------+-------+----------+---------+-----------+
+  | spaCy        | 0.2ms    | 1ms    | 7ms   | 1x       | 1x      | 1x        |
+  +--------------+----------+--------+-------+----------+---------+-----------+
+  | CoreNLP      | 2ms      | 10ms   | 49ms  | 10x      | 10x     | 7x        |
+  +--------------+----------+--------+-------+----------+---------+-----------+
+  | ZPar         | 1ms      | 8ms    | 850ms | 5x       | 8x      | 121x      |
+  +--------------+----------+--------+-------+----------+---------+-----------+
+  | NLTK         | 4ms      | 443ms  | n/a   | 20x      | 443x    |  n/a      |
+  +--------------+----------+--------+-------+----------+---------+-----------+
+
+
+Efficiency is a major concern for NLP applications.  It is very common to hear
+people say that they cannot afford more detailed processing, because their
+datasets are too large.  This is a bad position to be in.  If you can't apply
+detailed processing, you generally have to cobble together various heuristics.
+This normally takes a few iterations, and what you come up with will usually be
+brittle and difficult to reason about.
+
+spaCy's parser is faster than most taggers, and its tokenizer is fast enough
+for truly web-scale processing.  And the tokenizer doesn't just give you a list
+of strings.  A spaCy token is a pointer to a Lexeme struct, from which you can
+access a wide range of pre-computed features.
+
+.. I wrote spaCy because I think existing commercial NLP engines are crap.
+  Alchemy API are a typical example.  Check out this part of their terms of
+  service:
+  publish or perform any benchmark or performance tests or analysis relating to
+  the Service or the use thereof without express authorization from AlchemyAPI;
+
+.. Did you get that? You're not allowed to evaluate how well their system works,
+  unless you're granted a special exception.  Their system must be pretty
+  terrible to motivate such an embarrassing restriction.
+  They must know this makes them look bad, but they apparently believe allowing
+  you to evaluate their product would make them look even worse!
+
+.. spaCy is based on science, not alchemy.  It's open source, and I am happy to
+  clarify any detail of the algorithms I've implemented.
+  It's evaluated against the current best published systems, following the standard
+  methodologies.  These evaluations show that it performs extremely well.  
+
+Accuracy
+--------
+
+.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. See `Benchmarks`_ for details.
+
+  +--------------+----------+------------+
+  | System       | POS acc. | Parse acc. |
+  +--------------+----------+------------+
+  | spaCy        | 97.2     | 92.4       |
+  +--------------+----------+------------+
+  | CoreNLP      | 96.9     | 92.2       | 
+  +--------------+----------+------------+
+  | ZPar         | 97.3     | 92.9       |
+  +--------------+----------+------------+
+  | NLTK         | 94.3     | n/a        |
+  +--------------+----------+------------+

-spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time
-fee of $5000 for a commercial license.  I think this is excellent value:
-you'll find NLTK etc much more expensive, because what you save on license
-cost, you'll lose many times over in lost productivity. $5000 does not buy you
-much developer time.


 .. toctree::
-    :hidden:
    :maxdepth: 3

+    license.rst 
+    quickstart.rst
    features.rst
-    license_stories.rst 
    api.rst