spaCy/docs/source/index.rst

.. spaCy documentation master file, created by
   sphinx-quickstart on Tue Aug 19 16:27:38 2014.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

spaCy NLP Tokenizer and Lexicon
================================

spaCy is an industrial-strength multi-language tokenizer, bristling with features
you never knew you wanted. You do want these features though --- your current
tokenizer has been doing it wrong.
Where other tokenizers give you a list of strings, spaCy gives you references
to rich lexical types, for easy, excellent and efficient feature extraction.

* **Easy**: Tokenizer returns a sequence of rich lexical types, with features
  pre-computed:

    >>> from spacy.en import EN
    >>> for w in EN.tokenize(string):
    ...   print w.sic, w.shape, w.cluster, w.oft_title, w.can_verb

Check out the tutorial and API docs.

* **Excellent**: Distributional and orthographic features are crucial to robust
  NLP. Without them, models can only learn from tiny annotated training
  corpora.  Read more.

* **Efficient**: spaCy serves you rich lexical objects faster than most
  tokenizers can give you a list of strings.

+--------+-------+--------------+--------------+
| System | Time	 | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK	 | 6m4s  | 89,000       | 1.00         |
+--------+-------+--------------+--------------+
| spaCy	 | 9.5s	 | 3,093,000	| 38.30        |
+--------+-------+--------------+--------------+


.. toctree::
    :hidden:
    :maxdepth: 3

    what/index.rst
    why/index.rst
    how/index.rst