* Update docs

2025-07-15 18:52:29 +03:00 · 2014-10-15 21:50:34 +11:00 · 2014-10-15 21:50:34 +11:00 · df110476d5
commit df110476d5
parent 849de654e7
10 changed files with 147 additions and 131 deletions
--- a/docs/source/guide/chart.svg
+++ b/docs/source/guide/chart.svg
--- a/docs/source/guide/overview.rst
+++ b/docs/source/guide/overview.rst
@ -1,106 +1,71 @@
-Don't Settle for a List of Strings
-==================================
+Overview
+========

+What and Why
+------------

-    *"Other NLP tokenizers return lists of strings, which is downright
-    barbaric."* --- me
+spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.

+Most tokenizers give you a sequence of strings. That's barbaric.
+Giving you strings invites you to compute on every *token*, when what
+you should be doing is computing on every *type*.  Remember
+`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
+see exponentially fewer types than tokens.

-spaCy splits text into a list of lexical types, which come with a variety of
-features pre-computed.  It's designed to **make the right thing easy**, where the right
-thing is:
+Instead of strings, spaCy gives you references to Lexeme objects, from which you
+can access an excellent set of pre-computed orthographic and distributional features:

-* A global vocabulary store;
+::

-* Cached orthographic features;
+    >>> from spacy import en
+    >>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
+    >>> are.prob >= oranges.prob
+    True
+    >>> apples.check_flag(en.IS_TITLE)
+    True
+    >>> apples.check_flag(en.OFT_TITLE)
+    False
+    >>> are.check_flag(en.CAN_NOUN)
+    False

-* Clever use of distributional data.
+spaCy makes it easy to write very efficient NLP applications, because your feature
+functions have to do almost no work: almost every lexical property you'll want
+is pre-computed for you.  See the tutorial for an example POS tagger.

-Let's say you're writing an entity tagger for English. Case distinctions are an
-important feature here: you need to know whether the word you're tagging is
-upper-cased, lower-cased, title-cased, non-alphabetic, etc.
-The right thing is to call the string.isupper(), string.islower(), string.isalpha()
-etc functions once for every *type* in your vocabulary, instead
-of once for every *token* in the text you're tagging.
-When you encounter a new word, you want to create a lexeme object, calculate its
-features, and save it.
+Benchmark
+---------

-That's the *right* way to do it, so it's what spaCy does for you.
-
-Other tokenizers give you a list of strings, which makes it really easy to do
-the wrong thing. And the wrong thing isn't just a little bit worse: it's
-**exponentially** worse, because of
-`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_. 
-
-.. raw:: html
-
-    <center>
-    <figure>
-      <embed 
-        width="650em" height="auto"
-        type="image/svg+xml" src="chart.svg"/>
-    </figure>
-    </center>
-
-Over the Gigaword corpus, if you compute some feature on a per-token basis, you'll
-make **500x more calls** to that function than if you had computed it on a per-token
-basis.
-(Mouse-over a line to see its value at that point. And yes, it's a bit snarky
-to present the graph in a linear scale --- but it isn't misleading.)
-  
-Zipf's Law also makes distributional information a really powerful source of
-type-based features. It's really handy to know where a word falls in the language's
-frequency distribution, especially compared to variants of the word.  For instance,
-we might be processing a Twitter comment that contains the string "nasa". We have
-little hope of recognising this as an entity except by noting that the string "NASA"
-is much more common, and that both strings are quite rare.
-
-.. Each spaCy Lexeme comes with a rich, curated set of orthographic and
-.. distributional features.  Different languages get a different set of features,
-.. to take into account different orthographic conventions and morphological
-.. complexity. It's also easy to define your own features.
-
-.. And, of course, we take care to get the details right.  Indices into the original
-.. text are always easy to calculate, so it's easy to, say, mark entities with in-line
-.. mark-up. You'll also receive tokens for newlines, tabs and other non-space whitespace,
-.. making it easy to do paragraph and sentence recognition.  And, of course, we deal
-.. smartly with all the random unicode whitespace and punctuation characters you might
-.. not have thought of.
-
-
-Benchmarks
----------
-
-We here ask two things:
-
-1. How fast is the spaCy tokenizer itself, relative to other tokenizers?
-
-2. How fast are applications using spaCy's pre-computed lexical features,
-   compared to applications that re-compute their features on every token?
+The tokenizer itself is also very efficient:

 +--------+-------+--------------+--------------+
 | System | Time	 | Words/second | Speed Factor |
 +--------+-------+--------------+--------------+
 | NLTK	 | 6m4s  | 89,000       | 1.00         |
 +--------+-------+--------------+--------------+
-| spaCy	 |       |           	|              |
+| spaCy	 | 9.5s	 | 3,093,000	| 38.30        |
 +--------+-------+--------------+--------------+

-
-spaCy uses more memory than a standard tokenizer, but is far more efficient. We
-compare against the NLTK tokenizer and the Penn Treebank's tokenizer.sed script.
-We also give the performance of Python's native string.split, for reference.
-
+The comparison refers to 30 million words from the English Gigaword, on
+a Maxbook Air.  For context, calling string.split() on the data completes in
+about 5s.

 Pros and Cons
 -------------

 Pros:

- Stuff
+- All tokens come with indices into the original string
+- Full unicode support
+- Extensible to other languages
+- Batch operations computed efficiently in Cython
+- Cython API
+- numpy interoperability

 Cons:

 - It's new (released September 2014)
+- Security concerns, from memory management
 - Higher memory usage (up to 1gb)
- More complicated
+- More conceptually complicated
+- Tokenization rules expressed in code, not as data
+
--- a/docs/source/how/api/index.rst
+++ b/docs/source/how/api/index.rst
--- a/docs/source/how/api/lexicon.rst
+++ b/docs/source/how/api/lexicon.rst
--- a/docs/source/how/api/tokenizers/en.rst
+++ b/docs/source/how/api/tokenizers/en.rst
--- a/docs/source/how/api/tokenizers/index.rst
+++ b/docs/source/how/api/tokenizers/index.rst
--- a/docs/source/how/index.rst
+++ b/docs/source/how/index.rst
@ -0,0 +1,13 @@
+How
+===
+
+Tutorial
+--------
+
+Installation
+------------
+
+API
+---
+
+
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -6,67 +6,42 @@
 spaCy NLP Tokenizer and Lexicon
 ================================

-spaCy splits a string of natural language into a list of references to lexical types:
+spaCy is an industrial-strength multi-language tokenizer, bristling with features
+you never knew you wanted. You do want these features though --- your current
+tokenizer has been doing it wrong.
+Where other tokenizers give you a list of strings, spaCy gives you references
+to rich lexical types, for easy, excellent and efficient feature extraction.
+
+* **Easy**: Tokenizer returns a sequence of rich lexical types, with features
+  pre-computed:

    >>> from spacy.en import EN
-    >>> tokens = EN.tokenize(u"Examples aren't easy, are they?")
-    >>> type(tokens[0])
-    spacy.word.Lexeme
-    >>> tokens[1] is tokens[5]
-    True
+    >>> for w in EN.tokenize(string):
+    ...   print w.sic, w.shape, w.cluster, w.oft_title, w.can_verb

-Other tokenizers return lists of strings, which is
-`downright barbaric <guide/overview.html>`__.  If you get a list of strings,
-you have to write all the features yourself, and you'll probably compute them
-on a per-token basis, instead of a per-type basis.  At scale, that's very
-inefficient.
+Check out the tutorial and API docs.

-spaCy's tokens come with the following orthographic and distributional features
-pre-computed:
+* **Excellent**: Distributional and orthographic features are crucial to robust
+  NLP. Without them, models can only learn from tiny annotated training
+  corpora.  Read more.
  
-* Orthographic flags, such as is_alpha, is_digit, is_punct, is_title etc;
+* **Efficient**: spaCy serves you rich lexical objects faster than most
+  tokenizers can give you a list of strings.  

-* Useful string transforms, such as canonical casing, word shape, ASCIIfied,
-  etc;
+--------+-------+--------------+--------------+
+| System | Time	 | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
+| NLTK	 | 6m4s  | 89,000       | 1.00         |
+--------+-------+--------------+--------------+
+| spaCy	 | 9.5s	 | 3,093,000	| 38.30        |
+--------+-------+--------------+--------------+

-* Unigram log probability;

-* Brown cluster;
-
-* can_noun, can_verb etc tag-dictionary;
-
-* oft_upper, oft_title etc case-behaviour flags.
-
-The features are up-to-date with current NLP research, but you can replace or
-augment them if you need to.

 .. toctree::
+    :hidden:
    :maxdepth: 3
    
-    guide/overview.rst
-    guide/install.rst
-
-    api/index.rst
-
-    modules/index.rst
-
-License
-=======
-
-+------------------+------+
-| Non-commercial   | $0   |
-+------------------+------+
-| Trial commercial | $0   |
-+------------------+------+
-| Full commercial  | $500 |
-+------------------+------+
-
-spaCy is non-free software. Its source is published, but the copyright is
-retained by the author (Matthew Honnibal).  Licenses are currently under preparation.
-
-There is currently a gap between the output of academic NLP researchers, and
-the needs of a small software companiess. I left academia to try to correct this.
-My idea is that non-commercial and trial commercial use should "feel" just like
-free software. But, if you do use the code in a commercial product, a small
-fixed license-fee will apply, in order to fund development. 
-
+    what/index.rst
+    why/index.rst
+    how/index.rst
--- a/docs/source/what/index.rst
+++ b/docs/source/what/index.rst
@ -0,0 +1,31 @@
+What
+====
+
+Overview
+--------
+
+Feature List
+------------
+
+License (for the code)
+-------
+
+------------------+------+
+| Non-commercial   | $0   |
+------------------+------+
+| Trial commercial | $0   |
+------------------+------+
+| Full commercial  | $500 |
+------------------+------+
+
+spaCy is non-free software. Its source is published, but the copyright is
+retained by the author (Matthew Honnibal).  Licenses are currently under preparation.
+
+There is currently a gap between the output of academic NLP researchers, and
+the needs of a small software companiess. I left academia to try to correct this.
+My idea is that non-commercial and trial commercial use should "feel" just like
+free software. But, if you do use the code in a commercial product, a small
+fixed license-fee will apply, in order to fund development. 
+
+Pricing (for the data)
+----------------------
--- a/docs/source/why/index.rst
+++ b/docs/source/why/index.rst
@ -0,0 +1,28 @@
+Why
+===
+
+Benchmarks
+----------
+
+Efficiency
+----------
+
+--------+-------+--------------+--------------+
+| System | Time	 | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
+| NLTK	 | 6m4s  | 89,000       | 1.00         |
+--------+-------+--------------+--------------+
+| spaCy	 | 9.5s	 | 3,093,000	| 38.30        |
+--------+-------+--------------+--------------+
+
+
+Accuracy
+--------
+
+The comparison refers to 30 million words from the English Gigaword, on
+a Maxbook Air.  For context, calling string.split() on the data completes in
+about 5s.
+
+Pros and Cons
+-------------
+