* Update docs

This commit is contained in:
Matthew Honnibal 2014-10-15 21:50:34 +11:00
parent 849de654e7
commit df110476d5
10 changed files with 147 additions and 131 deletions

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 198 KiB

View File

@ -1,106 +1,71 @@
Don't Settle for a List of Strings
==================================
Overview
========
What and Why
------------
*"Other NLP tokenizers return lists of strings, which is downright
barbaric."* --- me
spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
Most tokenizers give you a sequence of strings. That's barbaric.
Giving you strings invites you to compute on every *token*, when what
you should be doing is computing on every *type*. Remember
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
see exponentially fewer types than tokens.
spaCy splits text into a list of lexical types, which come with a variety of
features pre-computed. It's designed to **make the right thing easy**, where the right
thing is:
Instead of strings, spaCy gives you references to Lexeme objects, from which you
can access an excellent set of pre-computed orthographic and distributional features:
* A global vocabulary store;
::
* Cached orthographic features;
>>> from spacy import en
>>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
>>> are.prob >= oranges.prob
True
>>> apples.check_flag(en.IS_TITLE)
True
>>> apples.check_flag(en.OFT_TITLE)
False
>>> are.check_flag(en.CAN_NOUN)
False
* Clever use of distributional data.
spaCy makes it easy to write very efficient NLP applications, because your feature
functions have to do almost no work: almost every lexical property you'll want
is pre-computed for you. See the tutorial for an example POS tagger.
Let's say you're writing an entity tagger for English. Case distinctions are an
important feature here: you need to know whether the word you're tagging is
upper-cased, lower-cased, title-cased, non-alphabetic, etc.
The right thing is to call the string.isupper(), string.islower(), string.isalpha()
etc functions once for every *type* in your vocabulary, instead
of once for every *token* in the text you're tagging.
When you encounter a new word, you want to create a lexeme object, calculate its
features, and save it.
Benchmark
---------
That's the *right* way to do it, so it's what spaCy does for you.
Other tokenizers give you a list of strings, which makes it really easy to do
the wrong thing. And the wrong thing isn't just a little bit worse: it's
**exponentially** worse, because of
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_.
.. raw:: html
<center>
<figure>
<embed
width="650em" height="auto"
type="image/svg+xml" src="chart.svg"/>
</figure>
</center>
Over the Gigaword corpus, if you compute some feature on a per-token basis, you'll
make **500x more calls** to that function than if you had computed it on a per-token
basis.
(Mouse-over a line to see its value at that point. And yes, it's a bit snarky
to present the graph in a linear scale --- but it isn't misleading.)
Zipf's Law also makes distributional information a really powerful source of
type-based features. It's really handy to know where a word falls in the language's
frequency distribution, especially compared to variants of the word. For instance,
we might be processing a Twitter comment that contains the string "nasa". We have
little hope of recognising this as an entity except by noting that the string "NASA"
is much more common, and that both strings are quite rare.
.. Each spaCy Lexeme comes with a rich, curated set of orthographic and
.. distributional features. Different languages get a different set of features,
.. to take into account different orthographic conventions and morphological
.. complexity. It's also easy to define your own features.
.. And, of course, we take care to get the details right. Indices into the original
.. text are always easy to calculate, so it's easy to, say, mark entities with in-line
.. mark-up. You'll also receive tokens for newlines, tabs and other non-space whitespace,
.. making it easy to do paragraph and sentence recognition. And, of course, we deal
.. smartly with all the random unicode whitespace and punctuation characters you might
.. not have thought of.
Benchmarks
----------
We here ask two things:
1. How fast is the spaCy tokenizer itself, relative to other tokenizers?
2. How fast are applications using spaCy's pre-computed lexical features,
compared to applications that re-compute their features on every token?
The tokenizer itself is also very efficient:
+--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK | 6m4s | 89,000 | 1.00 |
+--------+-------+--------------+--------------+
| spaCy | | | |
| spaCy | 9.5s | 3,093,000 | 38.30 |
+--------+-------+--------------+--------------+
spaCy uses more memory than a standard tokenizer, but is far more efficient. We
compare against the NLTK tokenizer and the Penn Treebank's tokenizer.sed script.
We also give the performance of Python's native string.split, for reference.
The comparison refers to 30 million words from the English Gigaword, on
a Maxbook Air. For context, calling string.split() on the data completes in
about 5s.
Pros and Cons
-------------
Pros:
- Stuff
- All tokens come with indices into the original string
- Full unicode support
- Extensible to other languages
- Batch operations computed efficiently in Cython
- Cython API
- numpy interoperability
Cons:
- It's new (released September 2014)
- Security concerns, from memory management
- Higher memory usage (up to 1gb)
- More complicated
- More conceptually complicated
- Tokenization rules expressed in code, not as data

13
docs/source/how/index.rst Normal file
View File

@ -0,0 +1,13 @@
How
===
Tutorial
--------
Installation
------------
API
---

View File

@ -6,67 +6,42 @@
spaCy NLP Tokenizer and Lexicon
================================
spaCy splits a string of natural language into a list of references to lexical types:
spaCy is an industrial-strength multi-language tokenizer, bristling with features
you never knew you wanted. You do want these features though --- your current
tokenizer has been doing it wrong.
Where other tokenizers give you a list of strings, spaCy gives you references
to rich lexical types, for easy, excellent and efficient feature extraction.
* **Easy**: Tokenizer returns a sequence of rich lexical types, with features
pre-computed:
>>> from spacy.en import EN
>>> tokens = EN.tokenize(u"Examples aren't easy, are they?")
>>> type(tokens[0])
spacy.word.Lexeme
>>> tokens[1] is tokens[5]
True
>>> for w in EN.tokenize(string):
... print w.sic, w.shape, w.cluster, w.oft_title, w.can_verb
Other tokenizers return lists of strings, which is
`downright barbaric <guide/overview.html>`__. If you get a list of strings,
you have to write all the features yourself, and you'll probably compute them
on a per-token basis, instead of a per-type basis. At scale, that's very
inefficient.
Check out the tutorial and API docs.
spaCy's tokens come with the following orthographic and distributional features
pre-computed:
* **Excellent**: Distributional and orthographic features are crucial to robust
NLP. Without them, models can only learn from tiny annotated training
corpora. Read more.
* Orthographic flags, such as is_alpha, is_digit, is_punct, is_title etc;
* **Efficient**: spaCy serves you rich lexical objects faster than most
tokenizers can give you a list of strings.
* Useful string transforms, such as canonical casing, word shape, ASCIIfied,
etc;
+--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK | 6m4s | 89,000 | 1.00 |
+--------+-------+--------------+--------------+
| spaCy | 9.5s | 3,093,000 | 38.30 |
+--------+-------+--------------+--------------+
* Unigram log probability;
* Brown cluster;
* can_noun, can_verb etc tag-dictionary;
* oft_upper, oft_title etc case-behaviour flags.
The features are up-to-date with current NLP research, but you can replace or
augment them if you need to.
.. toctree::
:hidden:
:maxdepth: 3
guide/overview.rst
guide/install.rst
api/index.rst
modules/index.rst
License
=======
+------------------+------+
| Non-commercial | $0 |
+------------------+------+
| Trial commercial | $0 |
+------------------+------+
| Full commercial | $500 |
+------------------+------+
spaCy is non-free software. Its source is published, but the copyright is
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
There is currently a gap between the output of academic NLP researchers, and
the needs of a small software companiess. I left academia to try to correct this.
My idea is that non-commercial and trial commercial use should "feel" just like
free software. But, if you do use the code in a commercial product, a small
fixed license-fee will apply, in order to fund development.
what/index.rst
why/index.rst
how/index.rst

View File

@ -0,0 +1,31 @@
What
====
Overview
--------
Feature List
------------
License (for the code)
-------
+------------------+------+
| Non-commercial | $0 |
+------------------+------+
| Trial commercial | $0 |
+------------------+------+
| Full commercial | $500 |
+------------------+------+
spaCy is non-free software. Its source is published, but the copyright is
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
There is currently a gap between the output of academic NLP researchers, and
the needs of a small software companiess. I left academia to try to correct this.
My idea is that non-commercial and trial commercial use should "feel" just like
free software. But, if you do use the code in a commercial product, a small
fixed license-fee will apply, in order to fund development.
Pricing (for the data)
----------------------

28
docs/source/why/index.rst Normal file
View File

@ -0,0 +1,28 @@
Why
===
Benchmarks
----------
Efficiency
----------
+--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK | 6m4s | 89,000 | 1.00 |
+--------+-------+--------------+--------------+
| spaCy | 9.5s | 3,093,000 | 38.30 |
+--------+-------+--------------+--------------+
Accuracy
--------
The comparison refers to 30 million words from the English Gigaword, on
a Maxbook Air. For context, calling string.split() on the data completes in
about 5s.
Pros and Cons
-------------