* Fix results table

This commit is contained in:
Matthew Honnibal 2014-12-24 14:35:32 +11:00
parent a68ecc50fa
commit 75a6930ad9

View File

@ -9,52 +9,63 @@ spaCy: Text-processing for products
spaCy is a library for industrial-strength text processing in Python and Cython. spaCy is a library for industrial-strength text processing in Python and Cython.
Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of
state-of-the-art components, a nice API, and no clutter. state-of-the-art components, a nice API, and no clutter:
spaCy is particularly good for feature extraction, because it pre-loads lexical
resources, maps strings to integer IDs, and supports output of numpy arrays:
>>> from spacy.en import English >>> from spacy.en import English
>>> nlp = English() >>> nlp = English()
>>> tokens = nlp(u'An example sentence', tag=True, parse=True) >>> tokens = nlp(u'An example sentence', tag=True, parse=True)
>>> for token in tokens:
... print token.lemma, token.pos, bin(token.cluster)
an DT Xx 0b111011110
example NN xxxx 0b111110001
sentence NN xxxx 0b1101111110010
spaCy is particularly good for feature extraction, because it pre-loads lexical
resources, maps strings to integer IDs, and supports output of numpy arrays:
>>> from spacy.en import attrs >>> from spacy.en import attrs
>>> feats = tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER)) >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
>>> for lemma, pos, shape, cluster in feats: array([[ 1265, 14, 76, 478],
... print nlp.strings[lemma], nlp.tagger.tags[pos], nlp.strings[shape], cluster [ 1545, 24, 262, 497],
[ 3385, 24, 262, 14309]])
spaCy also makes it easy to add in-line mark up. Let's say you want to mark all spaCy also makes it easy to add in-line mark up. Let's say you're convinced by
adverbs in red: Stephen King's advice that `adverbs are not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so you want to mark
them in red. We'll use one of the examples he finds particularly egregious:
>>> from spacy.defs import ADVERB >>> tokens = nlp(u"Give it back, he pleaded abjectly, its mine.")
>>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s' >>> red = lambda string: u'\033[91m{0}\033[0m'.format(string)
>>> print u''.join(color(token) + unicode(token) for t in tokens) >>> red = lambda string: unicode(string).upper() # TODO -- make red work on website...
>>> print u''.join(red(t) if t.is_adverb else unicode(t) for t in tokens)
Give it BACK, he pleaded ABJECTLY, its mine.
Easy. The trick here is that the Token objects know to pad themselves with
whitespace when you ask for their unicode representation, so you can always get Easy --- except, "back" isn't the sort of word we're looking for, even though
back the original string. it's undeniably an adverb. Let's search refine the logic a little, and only
highlight adverbs that modify verbs:
>>> print u''.join(red(t) if t.is_adverb and t.head.is_verb else unicode(t) for t in tokens)
Give it back, he pleaded ABJECTLY, its mine.
spaCy is also very efficient --- much more efficient than any other language spaCy is also very efficient --- much more efficient than any other language
processing tools available. The table below compares the time to tokenize, POS processing tools available. The table below compares the time to tokenize, POS
tag and parse 100m words of text; it also shows accuracy on the standard tag and parse a document (amortized over 100k samples). It also shows accuracy
evaluation, from the Wall Street Journal: on the standard evaluation, from the Wall Street Journal:
+----------+----------+---------+----------+----------+------------+
| System | Tokenize | POS Tag | Parse | POS Acc. | Parse Acc. |
+----------+----------+---------+----------+----------+------------+
| spaCy | 0.37ms | 0.98ms | 10ms | 97.3% | 92.4% |
+----------+----------+---------+----------+----------+------------+
| NLTK | 6.2ms | 443ms | n/a | 94.0% | n/a |
+----------+----------+---------+----------+----------+------------+
| CoreNLP | 4.2ms | 13ms | todo | 96.97% | 92.2% |
+----------+----------+---------+----------+----------+------------+
| ZPar | n/a | 15ms | 850ms | 97.3% | 92.9% |
+----------+----------+---------+----------+----------+------------+
+----------+----------+---------------+----------+ (The CoreNLP results refer to their recently published shift-reduce neural
| System | Tokenize | POS Tag | | network parser.)
+----------+----------+---------------+----------+
| spaCy | 37s | 98s | |
+----------+----------+---------------+----------+
| NLTK | 626s | 44,310s (12h) | |
+----------+----------+---------------+----------+
| CoreNLP | 420s | 1,300s (22m) | |
+----------+----------+---------------+----------+
| ZPar | | ~1,500s | |
+----------+----------+---------------+----------+
spaCy completes its whole pipeline faster than some of the other libraries can
tokenize the text. Its POS tag accuracy is as good as any system available.
For parsing, I chose an algorithm that sacrificed some accuracy, in favour of
efficiency.
I wrote spaCy so that startups and other small companies could take advantage I wrote spaCy so that startups and other small companies could take advantage
of the enormous progress being made by NLP academics. Academia is competitive, of the enormous progress being made by NLP academics. Academia is competitive,