* Work on intro copy

2025-10-15 08:16:36 +03:00 · 2014-11-03 00:13:19 +11:00 · 2014-11-03 00:13:19 +11:00 · f1c3e17c80
commit f1c3e17c80
parent fa91506073
1 changed files with 19 additions and 27 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -6,36 +6,28 @@
 spaCy NLP Tokenizer and Lexicon
 ================================

-spaCy is an industrial-strength multi-language tokenizer, bristling with features
-you never knew you wanted. You do want these features though --- your current
-tokenizer has been doing it wrong.
-Where other tokenizers give you a list of strings, spaCy gives you references
-to rich lexical types, for easy, excellent and efficient feature extraction.
+spaCy is a library for industrial strength NLP in Python and Cython.  Its core
+values are efficiency, accuracy and minimalism.  

-* **Easy**: Tokenizer returns a sequence of rich lexical types, with features
-  pre-computed:
+* Efficiency: spaCy is 

-    >>> from spacy.en import EN
-    >>> for w in EN.tokenize(string):
-    ...   print w.sic, w.shape, w.cluster, w.oft_title, w.can_verb
-
-Check out the tutorial and API docs.
-
-* **Excellent**: Distributional and orthographic features are crucial to robust
-  NLP. Without them, models can only learn from tiny annotated training
-  corpora.  Read more.
-  
-* **Efficient**: spaCy serves you rich lexical objects faster than most
-  tokenizers can give you a list of strings.  
-
-+--------+-------+--------------+--------------+
-| System | Time	 | Words/second | Speed Factor |
-+--------+-------+--------------+--------------+
-| NLTK	 | 6m4s  | 89,000       | 1.00         |
-+--------+-------+--------------+--------------+
-| spaCy	 | 9.5s	 | 3,093,000	| 38.30        |
-+--------+-------+--------------+--------------+
+It does not attempt to be comprehensive,
+or to provide lavish syntactic sugar.  This isn't a library that covers 43 known
+algorithms to do X. You get 1 --- the best one --- with a simple, low-level interface. 
+For commercial users, the code is free but the data isn't.  For researchers, both
+are free and always will be.

+Comparison
+----------
+-------------+-------------+---+-----------+--------------+
+| POS taggers | Speed (w/s) | % Acc. (news) | % Acc. (web) |
+-------------+-------------+---------------+--------------+
+| spaCy       |             |               |              |
+-------------+-------------+---------------+--------------+
+| Stanford    | 16,000      |               |              |
+-------------+-------------+---------------+--------------+
+| NLTK        |             |               |              |
+-------------+-------------+---------------+--------------+


 .. toctree::