* Fix results table

2025-10-19 10:14:24 +03:00 · 2014-12-24 14:35:32 +11:00 · 2014-12-24 14:35:32 +11:00 · 75a6930ad9
commit 75a6930ad9
parent a68ecc50fa
1 changed files with 44 additions and 33 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -9,52 +9,63 @@ spaCy: Text-processing for products
 spaCy is a library for industrial-strength text processing in Python and Cython.
 Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of
-state-of-the-art components, a nice API, and no clutter.
+state-of-the-art components, a nice API, and no clutter:
 spaCy is particularly good for feature extraction, because it pre-loads lexical
 resources, maps strings to integer IDs, and supports output of numpy arrays:
    >>> from spacy.en import English
    >>> nlp = English()
    >>> tokens = nlp(u'An example sentence', tag=True, parse=True)
    >>> for token in tokens:
    ...   print token.lemma, token.pos, bin(token.cluster)
    an DT Xx 0b111011110
    example NN xxxx 0b111110001
    sentence NN xxxx 0b1101111110010
 spaCy is particularly good for feature extraction, because it pre-loads lexical
 resources, maps strings to integer IDs, and supports output of numpy arrays:
    >>> from spacy.en import attrs
-    >>> feats = tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
+    >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
-    >>> for lemma, pos, shape, cluster in feats:
+    array([[ 1265,    14,    76,   478],
-    ...   print nlp.strings[lemma], nlp.tagger.tags[pos], nlp.strings[shape], cluster
+       [ 1545,    24,   262,   497],
       [ 3385,    24,   262, 14309]])
-spaCy also makes it easy to add in-line mark up. Let's say you want to mark all
+spaCy also makes it easy to add in-line mark up. Let's say you're convinced by
-adverbs in red:
+Stephen King's advice that `adverbs are not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so you want to mark
 them in red. We'll use one of the examples he finds particularly egregious:
-    >>> from spacy.defs import ADVERB
+    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
-    >>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s'
+    >>> red = lambda string: u'\033[91m{0}\033[0m'.format(string)
-    >>> print u''.join(color(token) + unicode(token) for t in tokens)
+    >>> red = lambda string: unicode(string).upper() # TODO -- make red work on website...
    >>> print u''.join(red(t) if t.is_adverb else unicode(t) for t in tokens)
    ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
-Easy.  The trick here is that the Token objects know to pad themselves with
+
-whitespace when you ask for their unicode representation, so you can always get
+Easy --- except, "back" isn't the sort of word we're looking for, even though
-back the original string. 
+it's undeniably an adverb.  Let's search refine the logic a little, and only
 highlight adverbs that modify verbs:
    >>> print u''.join(red(t) if t.is_adverb and t.head.is_verb else unicode(t) for t in tokens)
    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
 spaCy is also very efficient --- much more efficient than any other language
 processing tools available.  The table below compares the time to tokenize, POS
-tag and parse 100m words of text; it also shows accuracy on the standard
+tag and parse a document (amortized over 100k samples).  It also shows accuracy
-evaluation, from the Wall Street Journal:
+on the standard evaluation, from the Wall Street Journal:
 +----------+----------+---------+----------+----------+------------+
 | System   | Tokenize | POS Tag | Parse    | POS Acc. | Parse Acc. |
 +----------+----------+---------+----------+----------+------------+
 | spaCy    | 0.37ms   | 0.98ms  | 10ms     | 97.3%    | 92.4%      |
 +----------+----------+---------+----------+----------+------------+
 | NLTK     | 6.2ms    | 443ms   | n/a      | 94.0%    | n/a        |
 +----------+----------+---------+----------+----------+------------+
 | CoreNLP  | 4.2ms    | 13ms    | todo     | 96.97%   | 92.2%      |
 +----------+----------+---------+----------+----------+------------+
 | ZPar     | n/a      | 15ms    | 850ms    | 97.3%    | 92.9%      |
 +----------+----------+---------+----------+----------+------------+
-+----------+----------+---------------+----------+
+(The CoreNLP results refer to their recently published shift-reduce neural
-| System   | Tokenize | POS Tag       |          |
+network parser.)
 +----------+----------+---------------+----------+
 | spaCy    | 37s      | 98s           |          |
 +----------+----------+---------------+----------+
 | NLTK     | 626s     | 44,310s (12h) |          |
 +----------+----------+---------------+----------+
 | CoreNLP  | 420s     | 1,300s (22m)  |          |
 +----------+----------+---------------+----------+
 | ZPar     |          | ~1,500s       |          |
 +----------+----------+---------------+----------+
 spaCy completes its whole pipeline faster than some of the other libraries can
 tokenize the text.  Its POS tag accuracy is as good as any system available.
 For parsing, I chose an algorithm that sacrificed some accuracy, in favour of
 efficiency.
 I wrote spaCy so that startups and other small companies could take advantage
 of the enormous progress being made by NLP academics.  Academia is competitive,