* More index.rst fiddling

2025-10-26 05:31:15 +03:00 · 2014-12-21 17:40:12 +11:00 · 2014-12-21 17:40:12 +11:00 · 69e3a07fa1
commit 69e3a07fa1
parent 9f3f07cab6
1 changed files with 31 additions and 48 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -8,60 +8,35 @@ spaCy: Industrial-strength NLP
 ================================
 spaCy is a library for industrial-strength text processing in Python and Cython.
-It features extremely efficient, up-to-date algorithms, and a rethink of how those
+Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of
-algorithms should be accessed.
+state-of-the-art components, a nice API, and no clutter.
-A typical text-processing API looks something like this:
+spaCy is particularly good for feature extraction, because it pre-loads lexical
 resources, maps strings to integer IDs, and supports output of numpy arrays:
-    >>> import nltk
+    >>> from spacy.en import English
-    >>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.'''))
+    >>> from spacy.en import attrs
-    [('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')]
+    >>> nlp = English()
    >>> tokens = nlp(u'An example sentence', pos_tag=True, parse=True)
    >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
-This API often leaves you with a lot of busy-work.  If you're doing some machine
+spaCy also makes it easy to add in-line mark up. Let's say you want to mark all
-learning or information extraction, all the strings have to be mapped to integers,
+adverbs in red:
 and you have to save and load the mapping at training and runtime.  If you want
 to display mark-up based on the annotation, you have to realign the tokens to your
 original string.
-I've been writing NLP systems for almost ten years now, so I've done these
+    >>> from spacy.defs import ADVERB
-things dozens of times.  When designing spaCy, I thought carefully about how to
+    >>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s'
-make the right thing easy.  
+    >>> print u''.join(color(t) + unicode(t) for t in tokens)
-We begin by initializing a global vocabulary store:
+Tokens.__iter__ produces a sequence of Token objects.  The Token.__unicode__
 method --- invoked by unicode(t) --- pads each token with any whitespace that
 followed it.  So, u''.join(unicode(t) for t in tokens) is guaranteed to restore
 the original string.
-    >>> from spacy.en import EN
+spaCy is also very efficient --- much more efficient than any other language
-    >>> EN.load()
+processing tools available.  The table below compares the time to tokenize, POS
 tag and parse 100m words of text; it also shows accuracy on the standard
 evaluation, from the Wall Street Journal:
 The vocabulary reads in a data file with all sorts of pre-computed lexical
 features.  You can load anything you like here, but by default I give you:
 * String IDs for the word's string, its prefix, suffix and "shape";
 * Length (in unicode code-points)
 * A cluster ID, representing distributional similarity;
 * A cluster ID, representing its typical POS tag distribution;
 * Good-turing smoothed unigram probability;
 * 64 boolean features, for assorted orthographic and distributional features.
 With so many features pre-computed, you usually don't have to do any string
 processing at all.  You give spaCy your string, and tell it to give you either
 a numpy array, or a counts dictionary:
    >>> from spacy.en import feature_names as fn
    >>> tokens = EN.tokenize(u'''Some string of language.''')
    >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER))
    ...
    >>> tokens.count_by(fn.WORD)
 If you do need strings, you can simply iterate over the Tokens object:
    >>> for token in tokens:
    ...   
 I mostly use this for debugging and testing.
 spaCy returns these rich Tokens objects much faster than most other tokenizers
 can give you a list of strings --- in fact, spaCy's POS tagger is *4 times
 faster* than CoreNLP's tokenizer:
 +----------+----------+---------------+----------+
 | System   | Tokenize | POS Tag       |          |
@ -75,8 +50,16 @@ faster* than CoreNLP's tokenizer:
 | ZPar     |          | ~1,500s       |          |
 +----------+----------+---------------+----------+
 spaCy completes its whole pipeline faster than some of the other libraries can
 tokenize the text.  Its POS tag accuracy is as good as any system available.
 For parsing, I chose an algorithm that sacrificed some accuracy, in favour of
 efficiency.
-
+I wrote spaCy so that startups and other small companies could take advantage
 of the enormous progress being made by NLP academics.  Academia is competitive,
 and what you're competing to do is write papers --- so it's very hard to write
 software useful to non-academics. Seeing this gap, I resigned from my post-doc,
 and wrote spaCy.
 .. toctree::
    :hidden: