* Another redraft of index.rst

2025-06-17 03:23:09 +03:00 · 2014-12-15 16:32:03 +11:00 · 2014-12-15 16:32:03 +11:00 · 24ffc32f2f
commit 24ffc32f2f
parent 77dd7a212a
1 changed files with 42 additions and 165 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -11,30 +11,57 @@ spaCy is a library for industrial-strength text processing in Python and Cython.
 It features extremely efficient, up-to-date algorithms, and a rethink of how those
 algorithms should be accessed.
-Most text-processing libraries give you APIs that look like this:
+A typical text-processing API looks something like this:
    >>> import nltk
    >>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.'''))
    [('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')]
-A list of strings is good for poking around, or for printing the annotation to
+This API often leaves you with a lot of busy-work.  If you're doing some machine
-evaluate it.  But to actually *use* the output, you have to jump through some
+learning or information extraction, all the strings have to be mapped to integers,
-hoops.  If you're doing some machine learning, all the strings have to be
+and you have to save and load the mapping at training and runtime.  If you want
-mapped to integers, and you have to save and load the mapping at training and
+to display mark-up based on the annotation, you have to realign the tokens to your
-runtime.  If you want to display mark-up based on the annotation, you have to
+original string.
 realign the tokens to your original string.
-With spaCy, you should never have to do any string processing at all:
+I've been writing NLP systems for almost ten years now, so I've done these
 things dozens of times.  When designing spaCy, I thought carefully about how to
 make the right thing easy.  
 We begin by initializing a global vocabulary store:
    >>> from spacy.en import EN
-    >>> from spacy.en import feature_names as fn
+    >>> EN.load()
    >>> tokens = EN.tokenize('''Some string of language.''')
    >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER, fn.POS, fn.LEMMA))
-A range of excellent features are pre-computed for you, and by default the
+The vocabulary reads in a data file with all sorts of pre-computed lexical
-words are part-of-speech tagged and lemmatized.  We do this by default because
+features.  You can load anything you like here, but by default I give you:
-even with these extra processes, spaCy is still several times faster than
+
-most tokenizers:
+* String IDs for the word's string, its prefix, suffix and "shape";
 * Length (in unicode code-points)
 * A cluster ID, representing distributional similarity;
 * A cluster ID, representing its typical POS tag distribution;
 * Good-turing smoothed unigram probability;
 * 64 boolean features, for assorted orthographic and distributional features.
 With so many features pre-computed, you usually don't have to do any string
 processing at all.  You give spaCy your string, and tell it to give you either
 a numpy array, or a counts dictionary:
    >>> from spacy.en import feature_names as fn
    >>> tokens = EN.tokenize(u'''Some string of language.''')
    >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER))
    ...
    >>> tokens.count_by(fn.WORD)
 If you do need strings, you can simply iterate over the Tokens object:
    >>> for token in tokens:
    ...   
 I mostly use this for debugging and testing.
 spaCy returns these rich Tokens objects much faster than most other tokenizers
 can give you a list of strings --- in fact, spaCy's POS tagger is *4 times
 faster* than CoreNLP's tokenizer:
 +----------+----------+---------------+----------+
 | System   | Tokenize | POS Tag       |          |
@ -48,157 +75,7 @@ most tokenizers:
 | ZPar     |          | ~1,500s       |          |
 +----------+----------+---------------+----------+
 spaCy is designed to **make the right thing easy**, where the right thing is to:
 * **Use rich distributional and orthographic features**. Without these, your model
  will be very brittle and domain dependent.
 * **Compute features per type, not per token**. Because of Zipf's law, you can
  expect this to be exponentially more efficient.
 * **Minimize string processing**, and instead compute with arrays of ID ints.
 Tokenization done right
 =======================
 Most tokenizers rely on complicated regular expressions.  Often, they leave you
 with no way to align the tokens back to the original string --- a vital feature
 if you want to display some mark-up, such as spelling correction.  The regular
 expressions also interact, making it hard to accommodate special cases.
 spaCy introduces a **novel tokenization algorithm** that's much faster and much
 more flexible:
 .. code-block:: python
    def tokenize(string, prefixes={}, suffixes={}, specials={}):
        '''Sketch of spaCy's tokenization algorithm.'''
        tokens = []
        cache = {}
        for chunk in string.split():
            # Because of Zipf's law, the cache serves the majority of "chunks".
            if chunk in cache:
                tokens.extend(cache[chunl])
                continue
            key = chunk
            subtokens = []
            # Process a chunk by splitting off prefixes e.g. ( " { and suffixes e.g. , . :
            # If we split one off, check whether we're left with a special-case, 
            # e.g. contractions (can't, won't, etc), emoticons, abbreviations, etc.
            # This makes the tokenization easy to update and customize.
            while chunk:
                prefix, chunk = _consume_prefix(chunk, prefixes)
                if prefix:
                    subtokens.append(prefix)
                    if chunk in specials:
                        subtokens.extend(specials[chunk])
                        break
                suffix, chunk = _consume_suffix(chunk, suffixes)
                if suffix:
                    subtokens.append(suffix)
                    if chunk in specials:
                        subtokens.extend(specials[chunk])
                        break
            cache[key] = subtokens
 Your data is going to have its own quirks, so it's really useful to have
 a tokenizer you can easily control.  To see the limitations of the standard
 regex-based approach, check out `CMU's recent work on tokenizing tweets <http://www.ark.cs.cmu.edu/TweetNLP/>`_. Despite a lot of careful attention, they can't handle all of their
 known emoticons correctly --- doing so would interfere with the way they
 process other punctuation.  This isn't a problem for spaCy: we just add them
 all to the special tokenization rules.
 Comparison with NLTK
 ====================
 `NLTK <http://nltk.org>`_ provides interfaces to a wide-variety of NLP
 tools and resources, and its own implementations of a few algorithms.  It comes
 with comprehensive documentation, and a book introducing concepts in NLP.  For
 these reasons, it's very widely known.  However, if you're trying to make money
 or do cutting-edge research, NLTK is not a good choice.
 The `list of stuff in NLTK <http://www.nltk.org/py-modindex.html>`_ looks impressive,
 but almost none of it is useful for real work.  You're not going to make any money,
 or do top research, by using the NLTK chat bots, theorem provers, toy CCG implementation,
 etc.  Most of NLTK is there to assist in the explanation ideas in computational
 linguistics, at roughly an undergraduate level.
 But it also claims to support serious work, by wrapping external tools.
 In a pretty well known essay, Joel Spolsky discusses the pain of dealing with 
 `leaky abstractions <http://www.joelonsoftware.com/articles/LeakyAbstractions.html>`_.
 An abstraction tells you to not care about implementation
 details, but sometimes the implementation matters after all. When it
 does, you have to waste time revising your assumptions.
 NLTK's wrappers call external tools via subprocesses, and wrap this up so
 that it looks like a native API.  This abstraction leaks *a lot*.  The system
 calls impose far more overhead than a normal Python function call, which makes
 the most natural way to program against the API infeasible. 
 Case study: POS tagging
 -----------------------
 Here's a quick comparison of the following POS taggers:
 * **Stanford (CLI)**: The Stanford POS tagger, invoked once as a batch process
  from the command-line;
 * **nltk.tag.stanford**: The Stanford tagger, invoked document-by-document via
  NLTK's wrapper;
 * **nltk.pos_tag**: NLTK's own POS tagger, invoked document-by-document.
 * **spacy.en.pos_tag**: spaCy's POS tagger, invoked document-by-document.
 +-------------------+-------------+--------+
 | System            | Speed (w/s) | % Acc. |
 +-------------------+-------------+--------+
 | spaCy             | 107,000     | 96.7   |
 +-------------------+-------------+--------+
 | Stanford (CLI)    | 8,000       | 96.7   |
 +-------------------+-------------+--------+
 | nltk.pos_tag      | 543         | 94.0   |
 +-------------------+-------------+--------+
 | nltk.tag.stanford | 209         | 96.7   |
 +-------------------+-------------+--------+
 Experimental details TODO.  Three things are apparent from this comparison:
 1. The native NLTK tagger, nltk.pos_tag, is both slow and inaccurate;
 2. Calling the Stanford tagger document-by-document via NLTK is **40x** slower
   than invoking the model once as a batch process, via the command-line;
 3. spaCy is over 10x faster than the Stanford tagger, even when called
   **sentence-by-sentence**.
 The problem is that NLTK simply wraps the command-line
 interfaces of these tools, so communication is via a subprocess.  NLTK does not
 even hold open a pipe for you --- the model is reloaded, again and again.
 To use the wrapper effectively, you should batch up your text as much as possible.
 This probably isn't how you would like to structure your pipeline, and you
 might not be able to batch up much text at all, e.g. if serving a single
 request means processing a single document.
 Technically, NLTK does give you Python functions to access lots of different
 systems --- but, you can't use them as you would expect to use a normal Python
 function.  The abstraction leaks.
 Here's the bottom-line: the Stanford tools are written in Java, so using them
 from Python sucks.  You shouldn't settle for this.  It's a problem that springs
 purely from the tooling, rather than the domain.
 Summary
 -------
 NLTK is a well-known Python library for NLP, but for the important bits, you
 don't get actual Python modules.  You get wrappers which throw to external
 tools, via subprocesses.  This is not at all the same thing.
 spaCy is implemented in Cython, just like numpy, scikit-learn, lxml and other
 high-performance Python libraries.  So you get a native Python API, but the
 performance you expect from a program written in C.
 .. toctree::