* Impove index docs

This commit is contained in:
Matthew Honnibal 2015-01-16 07:08:35 +11:00
parent e8dbac8a0c
commit e28b224b80

View File

@ -3,87 +3,184 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
===================================
spaCy: Text-processing for products
===================================
==============================
spaCy: Industrial-strength NLP
==============================
spaCy is a library for industrial-strength text processing in Python and Cython.
Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of
state-of-the-art components, a nice API, and no clutter:
It is commercial open source software, with a dual (AGPL or commercial)
license.
>>> from spacy.en import English
>>> nlp = English()
>>> tokens = nlp(u'An example sentence', tag=True, parse=True)
>>> for token in tokens:
... print token.lemma, token.pos, bin(token.cluster)
an DT Xx 0b111011110
example NN xxxx 0b111110001
sentence NN xxxx 0b1101111110010
If you're a small company doing NLP, spaCy might seem like a minor miracle.
It's by far the fastest NLP software available. The full processing pipeline
completes in 7ms, including state-of-the-art part-of-speech tagging and
dependency parsing. All strings are mapped to integer IDs, tokens
are linked to word vectors and other lexical resources, and a range of useful
features are pre-calculated and cached.
spaCy is particularly good for feature extraction, because it pre-loads lexical
resources, maps strings to integer IDs, and supports output of numpy arrays:
If none of that made any sense to you, here's the gist of it. Computers don't
understand text. This is unfortunate, because that's what the web almost entirely
consists of. We want to recommend people text based on other text they liked.
We want to shorten text to display it on a mobile screen. We want to aggregate
it, link it, filter it, categorise it, generate it and correct it.
>>> from spacy.en import attrs
>>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
array([[ 1265, 14, 76, 478],
[ 1545, 24, 262, 497],
[ 3385, 24, 262, 14309]])
spaCy provides a set of utility functions that help programmers build such
products. It's an NLP engine, analogous to the 3d engines commonly licensed
for game development.
spaCy also makes it easy to add in-line mark up. Let's say you're convinced by
Stephen King's advice that `adverbs are not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so you want to mark
them in red. We'll use one of the examples he finds particularly egregious:
Example functionality
---------------------
>>> tokens = nlp(u"Give it back, he pleaded abjectly, its mine.")
>>> red = lambda string: u'\033[91m{0}\033[0m'.format(string)
>>> red = lambda string: unicode(string).upper() # TODO -- make red work on website...
>>> print u''.join(red(t) if t.is_adverb else unicode(t) for t in tokens)
Let's say you're developing a proofreading tool, or possibly an IDE for
writers. You're convinced by Stephen King's advice that `adverbs are not your
friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
you want to **mark adverbs in red**. We'll use one of the examples he finds
particularly egregious:
>>> import spacy.en
>>> from spacy.enums import ADVERB
>>> # Load the pipeline, and call it with some text.
>>> nlp = spacy.en.English()
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=True)
>>> output = ''
>>> for tok in tokens:
... # Token.string preserves whitespace, making it easy to
... # reconstruct the original string.
... output += tok.string.upper() if tok.is_pos(ADVERB) else tok.string
>>> print(output)
Give it BACK, he pleaded ABJECTLY, its mine.
Easy --- except, "back" isn't the sort of word we're looking for, even though
it's undeniably an adverb. Let's search refine the logic a little, and only
highlight adverbs that modify verbs:
Easy enough --- but the problem is that we've also highlighted "back", when probably
we only wanted to highlight "abjectly". This is undoubtedly an adverb, but it's
not the sort of adverb King is talking about. This is a persistent problem when
dealing with linguistic categories: the prototypical examples, the ones whic
spring to your mind, are often not the most common cases.
>>> print u''.join(red(t) if t.is_adverb and t.head.is_verb else unicode(t) for t in tokens)
There are lots of ways we might refine our logic, depending on just what words
we want to flag. The simplest way to filter out adverbs like "back" and "not"
is by word frequency: these words are much more common than the manner adverbs
the style guides are worried about.
The prob attribute of a Lexeme or Token object gives a log probability estimate
of the word, based on smoothed counts from a 3bn word corpus:
>>> nlp.vocab[u'back'].prob
-7.403977394104004
>>> nlp.vocab[u'not'].prob
-5.407193660736084
>>> nlp.vocab[u'quietly'].prob
-11.07155704498291
So we can easily exclude the N most frequent words in English from our adverb
marker. Let's try N=1000 for now:
>>> import spacy.en
>>> from spacy.enums import ADVERB
>>> nlp = spacy.en.English()
>>> # Find log probability of Nth most frequent word
>>> probs = [lex.prob for lex in nlp.vocab]
>>> is_adverb = lambda tok: tok.is_pos(ADVERB) and tok.prob < probs[-1000]
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=True)
>>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string))
Give it back, he pleaded ABJECTLY, its mine.
spaCy is also very efficient --- much more efficient than any other language
processing tools available. The table below compares the time to tokenize, POS
tag and parse a document (amortized over 100k samples). It also shows accuracy
on the standard evaluation, from the Wall Street Journal:
There are lots of ways to refine the logic, depending on just what words we
want to flag. Let's define this narrowly, and only flag adverbs applied to
verbs of communication or perception:
+----------+----------+---------+----------+----------+------------+
| System | Tokenize | POS Tag | Parse | POS Acc. | Parse Acc. |
+----------+----------+---------+----------+----------+------------+
| spaCy | 0.37ms | 0.98ms | 10ms | 97.3% | 92.4% |
+----------+----------+---------+----------+----------+------------+
| NLTK | 6.2ms | 443ms | n/a | 94.0% | n/a |
+----------+----------+---------+----------+----------+------------+
| CoreNLP | 4.2ms | 13ms | todo | 96.97% | 92.2% |
+----------+----------+---------+----------+----------+------------+
| ZPar | n/a | 15ms | 850ms | 97.3% | 92.9% |
+----------+----------+---------+----------+----------+------------+
>>> from spacy.enums import VERB, WN_V_COMMUNICATION, WN_V_COGNITION
>>> def is_say_verb(tok):
... return tok.is_pos(VERB) and (tok.check_flag(WN_V_COMMUNICATION) or
tok.check_flag(WN_V_COGNITION))
>>> print(''.join(tok.string.upper() if is_adverb(tok) and is_say_verb(tok.head)
else tok.string))
Give it back, he pleaded ABJECTLY, its mine.
(The CoreNLP results refer to their recently published shift-reduce neural
network parser.)
The two flags refer to the 45 top-level categories in the WordNet ontology.
spaCy stores membership in these categories as a bit set, because
words can have multiple senses. We only need one 64
bit flag variable per word in the vocabulary, so this useful data requires only
2.4mb of memory.
I wrote spaCy so that startups and other small companies could take advantage
of the enormous progress being made by NLP academics. Academia is competitive,
and what you're competing to do is write papers --- so it's very hard to write
software useful to non-academics. Seeing this gap, I resigned from my post-doc,
and wrote spaCy.
spaCy packs all sorts of other goodies into its lexicon.
Words are mapped to one these rich lexical types immediately, during
tokenization --- and spaCy's tokenizer is *fast*.
Efficiency
----------
.. table:: Efficiency comparison. See `Benchmarks`_ for details.
+--------------+---------------------------+--------------------------------+
| | Absolute (ms per doc) | Relative (to spaCy) |
+--------------+----------+--------+-------+----------+---------+-----------+
| System | Tokenize | Tag | Parse | Tokenize | Tag | Parse |
+--------------+----------+--------+-------+----------+---------+-----------+
| spaCy | 0.2ms | 1ms | 7ms | 1x | 1x | 1x |
+--------------+----------+--------+-------+----------+---------+-----------+
| CoreNLP | 2ms | 10ms | 49ms | 10x | 10x | 7x |
+--------------+----------+--------+-------+----------+---------+-----------+
| ZPar | 1ms | 8ms | 850ms | 5x | 8x | 121x |
+--------------+----------+--------+-------+----------+---------+-----------+
| NLTK | 4ms | 443ms | n/a | 20x | 443x | n/a |
+--------------+----------+--------+-------+----------+---------+-----------+
Efficiency is a major concern for NLP applications. It is very common to hear
people say that they cannot afford more detailed processing, because their
datasets are too large. This is a bad position to be in. If you can't apply
detailed processing, you generally have to cobble together various heuristics.
This normally takes a few iterations, and what you come up with will usually be
brittle and difficult to reason about.
spaCy's parser is faster than most taggers, and its tokenizer is fast enough
for truly web-scale processing. And the tokenizer doesn't just give you a list
of strings. A spaCy token is a pointer to a Lexeme struct, from which you can
access a wide range of pre-computed features.
.. I wrote spaCy because I think existing commercial NLP engines are crap.
Alchemy API are a typical example. Check out this part of their terms of
service:
publish or perform any benchmark or performance tests or analysis relating to
the Service or the use thereof without express authorization from AlchemyAPI;
.. Did you get that? You're not allowed to evaluate how well their system works,
unless you're granted a special exception. Their system must be pretty
terrible to motivate such an embarrassing restriction.
They must know this makes them look bad, but they apparently believe allowing
you to evaluate their product would make them look even worse!
.. spaCy is based on science, not alchemy. It's open source, and I am happy to
clarify any detail of the algorithms I've implemented.
It's evaluated against the current best published systems, following the standard
methodologies. These evaluations show that it performs extremely well.
Accuracy
--------
.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. See `Benchmarks`_ for details.
+--------------+----------+------------+
| System | POS acc. | Parse acc. |
+--------------+----------+------------+
| spaCy | 97.2 | 92.4 |
+--------------+----------+------------+
| CoreNLP | 96.9 | 92.2 |
+--------------+----------+------------+
| ZPar | 97.3 | 92.9 |
+--------------+----------+------------+
| NLTK | 94.3 | n/a |
+--------------+----------+------------+
spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time
fee of $5000 for a commercial license. I think this is excellent value:
you'll find NLTK etc much more expensive, because what you save on license
cost, you'll lose many times over in lost productivity. $5000 does not buy you
much developer time.
.. toctree::
:hidden:
:maxdepth: 3
license.rst
quickstart.rst
features.rst
license_stories.rst
api.rst