* Upd docs

2025-11-11 21:35:47 +03:00 · 2014-09-26 18:40:18 +02:00 · 2014-09-26 18:40:18 +02:00 · 0c6402ab73
commit 0c6402ab73
parent 096ef2b199
2 changed files with 128 additions and 55 deletions
--- a/docs/source/guide/overview.rst
+++ b/docs/source/guide/overview.rst
@ -1,71 +1,106 @@
-Overview
-========
+Don't Settle for a List of Strings
+==================================

-What and Why
------------

-spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
+    *"Other NLP tokenizers return lists of strings, which is downright
+    barbaric."* --- me

-Most tokenizers give you a sequence of strings. That's barbaric.
-Giving you strings invites you to compute on every *token*, when what
-you should be doing is computing on every *type*.  Remember
-`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
-see exponentially fewer types than tokens.

-Instead of strings, spaCy gives you references to Lexeme objects, from which you
-can access an excellent set of pre-computed orthographic and distributional features:
+spaCy splits text into a list of lexical types, which come with a variety of
+features pre-computed.  It's designed to **make the right thing easy**, where the right
+thing is:

-::
+* A global vocabulary store;

-    >>> from spacy import en
-    >>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
-    >>> are.prob >= oranges.prob
-    True
-    >>> apples.check_flag(en.IS_TITLE)
-    True
-    >>> apples.check_flag(en.OFT_TITLE)
-    False
-    >>> are.check_flag(en.CAN_NOUN)
-    False
+* Cached orthographic features;

-spaCy makes it easy to write very efficient NLP applications, because your feature
-functions have to do almost no work: almost every lexical property you'll want
-is pre-computed for you.  See the tutorial for an example POS tagger.
+* Clever use of distributional data.
+  
+Let's say you're writing an entity tagger for English. Case distinctions are an
+important feature here: you need to know whether the word you're tagging is
+upper-cased, lower-cased, title-cased, non-alphabetic, etc.
+The right thing is to call the string.isupper(), string.islower(), string.isalpha()
+etc functions once for every *type* in your vocabulary, instead
+of once for every *token* in the text you're tagging.
+When you encounter a new word, you want to create a lexeme object, calculate its
+features, and save it.

-Benchmark
---------
+That's the *right* way to do it, so it's what spaCy does for you.

-The tokenizer itself is also very efficient:
+Other tokenizers give you a list of strings, which makes it really easy to do
+the wrong thing. And the wrong thing isn't just a little bit worse: it's
+**exponentially** worse, because of
+`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_. 
+
+.. raw:: html
+
+    <center>
+    <figure>
+      <embed 
+        width="650em" height="auto"
+        type="image/svg+xml" src="chart.svg"/>
+    </figure>
+    </center>
+
+Over the Gigaword corpus, if you compute some feature on a per-token basis, you'll
+make **500x more calls** to that function than if you had computed it on a per-token
+basis.
+(Mouse-over a line to see its value at that point. And yes, it's a bit snarky
+to present the graph in a linear scale --- but it isn't misleading.)
+  
+Zipf's Law also makes distributional information a really powerful source of
+type-based features. It's really handy to know where a word falls in the language's
+frequency distribution, especially compared to variants of the word.  For instance,
+we might be processing a Twitter comment that contains the string "nasa". We have
+little hope of recognising this as an entity except by noting that the string "NASA"
+is much more common, and that both strings are quite rare.
+
+.. Each spaCy Lexeme comes with a rich, curated set of orthographic and
+.. distributional features.  Different languages get a different set of features,
+.. to take into account different orthographic conventions and morphological
+.. complexity. It's also easy to define your own features.
+
+.. And, of course, we take care to get the details right.  Indices into the original
+.. text are always easy to calculate, so it's easy to, say, mark entities with in-line
+.. mark-up. You'll also receive tokens for newlines, tabs and other non-space whitespace,
+.. making it easy to do paragraph and sentence recognition.  And, of course, we deal
+.. smartly with all the random unicode whitespace and punctuation characters you might
+.. not have thought of.
+
+
+Benchmarks
+----------
+
+We here ask two things:
+
+1. How fast is the spaCy tokenizer itself, relative to other tokenizers?
+
+2. How fast are applications using spaCy's pre-computed lexical features,
+   compared to applications that re-compute their features on every token?

 +--------+-------+--------------+--------------+
 | System | Time	 | Words/second | Speed Factor |
 +--------+-------+--------------+--------------+
 | NLTK	 | 6m4s  | 89,000       | 1.00         |
 +--------+-------+--------------+--------------+
-| spaCy	 | 9.5s	 | 3,093,000	| 38.30        |
+| spaCy	 |       |           	|              |
 +--------+-------+--------------+--------------+

-The comparison refers to 30 million words from the English Gigaword, on
-a Maxbook Air.  For context, calling string.split() on the data completes in
-about 5s.
+
+spaCy uses more memory than a standard tokenizer, but is far more efficient. We
+compare against the NLTK tokenizer and the Penn Treebank's tokenizer.sed script.
+We also give the performance of Python's native string.split, for reference.
+

 Pros and Cons
 -------------

 Pros:

- All tokens come with indices into the original string
- Full unicode support
- Extensible to other languages
- Batch operations computed efficiently in Cython
- Cython API
- numpy interoperability
+- Stuff

 Cons:

 - It's new (released September 2014)
- Security concerns, from memory management
 - Higher memory usage (up to 1gb)
- More conceptually complicated
- Tokenization rules expressed in code, not as data
-
+- More complicated
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -6,6 +6,40 @@
 spaCy NLP Tokenizer and Lexicon
 ================================

+spaCy splits a string of natural language into a list of references to lexical types:
+
+    >>> from spacy.en import EN
+    >>> tokens = EN.tokenize(u"Examples aren't easy, are they?")
+    >>> type(tokens[0])
+    spacy.word.Lexeme
+    >>> tokens[1] is tokens[5]
+    True
+
+Other tokenizers return lists of strings, which is
+`downright barbaric <guide/overview.html>`__.  If you get a list of strings,
+you have to write all the features yourself, and you'll probably compute them
+on a per-token basis, instead of a per-type basis.  At scale, that's very
+inefficient.
+
+spaCy's tokens come with the following orthographic and distributional features
+pre-computed:
+
+* Orthographic flags, such as is_alpha, is_digit, is_punct, is_title etc;
+
+* Useful string transforms, such as canonical casing, word shape, ASCIIfied,
+  etc;
+
+* Unigram log probability;
+
+* Brown cluster;
+
+* can_noun, can_verb etc tag-dictionary;
+
+* oft_upper, oft_title etc case-behaviour flags.
+
+The features are up-to-date with current NLP research, but you can replace or
+augment them if you need to.
+
 .. toctree::
    :maxdepth: 3
    
@ -15,20 +49,24 @@ spaCy NLP Tokenizer and Lexicon
    api/index.rst

    modules/index.rst
-    
-
-Source (GitHub)
----------------
-
-http://github.com/honnibal/spaCy

 License
-------
+=======

-Copyright Matthew Honnibal
+------------------+------+
+| Non-commercial   | $0   |
+------------------+------+
+| Trial commercial | $0   |
+------------------+------+
+| Full commercial  | $500 |
+------------------+------+

-Non-commercial use: $0
-Commercial trial use: $0
-Full commercial license: $500
+spaCy is non-free software. Its source is published, but the copyright is
+retained by the author (Matthew Honnibal).  Licenses are currently under preparation.
+
+There is currently a gap between the output of academic NLP researchers, and
+the needs of a small software companiess. I left academia to try to correct this.
+My idea is that non-commercial and trial commercial use should "feel" just like
+free software. But, if you do use the code in a commercial product, a small
+fixed license-fee will apply, in order to fund development. 

-honnibal@gmail.com