* Rework intro text

2025-07-21 21:49:49 +03:00 · 2015-01-25 00:58:52 +11:00 · 2015-01-25 00:58:52 +11:00 · 70d4a9dcc5
commit 70d4a9dcc5
parent 83a7e91f3c
1 changed files with 32 additions and 11 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -8,21 +8,42 @@ spaCy: Industrial-strength NLP
 ==============================
 spaCy is a new library for text processing in Python and Cython.
 I wrote it because I think small companies are terrible at NLP.  Or rather:
 small companies are using terrible NLP technology.
-Most commercial NLP development is based on obsolete
+To do great NLP, you have to know a little about linguistics, a lot
-technology.  Over the last 3-5 years, the field has advanced dramatically, but
+about machine learning, and almost everything about the latest research.
-only the tech giants have really been able to capitalize.  The research is all
+The people who fit this description seldom join small companies, and almost
-public, but it's been too hard for small companies to read and apply it.  
+never start them.  Most are broke --- they've just finished grad school.
-Many end up relying on `NLTK`_, which is intended primarily as an educational
+If they don't want to stay in academia, they join Google, IBM, etc.
 resource.
-.. _NLTK: https://www.nltk.org/
+The net result is that outside of the tech giants, commercial NLP has changed
 little in the last ten years.  In academia, it's changed entirely.  Amazing
 improvements in quality. Orders of magnitude faster.  But the
 academic code is always GPL, undocumented, unuseable, or all three.  You could
 implement the ideas yourself, but the papers are hard to read, and training
 data is exorbitantly expensive.  So what are you left with? NLTK?
 I used to think that the NLP community just needed to do more to communicate
 its findings to software engineers.  So I wrote two blog posts, explaining
 `how to write a part-of-speech tagger`_ and `parser`_.  Both were very well received,
 and there's been a bit of interest in `my research software`_ --- even though
 it's entirely undocumented, and mostly unuseable to anyone but me.
 .. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
 .. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
 .. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
 So six months ago I quit my post-doc, and I've been working day and night on
 spaCy since.  I'm now pleased to announce an alpha release.
 If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
-It's by far the fastest NLP software available.  The full processing pipeline
+The full processing pipeline completes in 7ms per document, including accurate
-completes in 7ms per document, including accurate tagging and parsing.  All strings
+tagging and parsing.  All strings are mapped to integer IDs, tokens are linked
-are mapped to integer IDs, tokens are linked to embedded word representations,
+to embedded word representations, and a range of useful features are pre-calculated
-and a range of useful features are pre-calculated and cached.
+and cached.
 If none of that made any sense to you, here's the gist of it.  Computers don't
 understand text. This is unfortunate, because that's what the web almost entirely