2014-09-25 20:42:20 +04:00
|
|
|
|
.. spaCy documentation master file, created by
|
|
|
|
|
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
|
|
|
|
|
You can adapt this file completely to your liking, but it should at least
|
|
|
|
|
contain the root `toctree` directive.
|
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
==============================
|
|
|
|
|
spaCy: Industrial-strength NLP
|
|
|
|
|
==============================
|
2014-09-25 20:42:20 +04:00
|
|
|
|
|
2015-01-31 15:05:17 +03:00
|
|
|
|
|
2015-02-01 08:47:33 +03:00
|
|
|
|
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
|
|
|
|
|
|
2015-07-09 13:12:29 +03:00
|
|
|
|
**2015-07-08**: `Version 0.88 released`_
|
2015-06-08 02:47:06 +03:00
|
|
|
|
|
2015-07-01 16:36:41 +03:00
|
|
|
|
.. _Version 0.87 released: updates.html
|
2015-06-26 05:43:52 +03:00
|
|
|
|
|
2015-01-25 19:07:46 +03:00
|
|
|
|
`spaCy`_ is a new library for text processing in Python and Cython.
|
2015-01-25 21:10:04 +03:00
|
|
|
|
I wrote it because I think small companies are terrible at
|
|
|
|
|
natural language processing (NLP). Or rather:
|
2015-01-24 16:58:52 +03:00
|
|
|
|
small companies are using terrible NLP technology.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-25 17:58:05 +03:00
|
|
|
|
.. _spaCy: https://github.com/honnibal/spaCy/
|
2015-01-25 14:57:37 +03:00
|
|
|
|
|
2015-01-24 16:58:52 +03:00
|
|
|
|
To do great NLP, you have to know a little about linguistics, a lot
|
|
|
|
|
about machine learning, and almost everything about the latest research.
|
2015-01-24 17:06:14 +03:00
|
|
|
|
The people who fit this description seldom join small companies.
|
|
|
|
|
Most are broke --- they've just finished grad school.
|
2015-01-24 16:58:52 +03:00
|
|
|
|
If they don't want to stay in academia, they join Google, IBM, etc.
|
2015-01-23 15:11:16 +03:00
|
|
|
|
|
2015-01-24 16:58:52 +03:00
|
|
|
|
The net result is that outside of the tech giants, commercial NLP has changed
|
|
|
|
|
little in the last ten years. In academia, it's changed entirely. Amazing
|
2015-04-19 11:43:46 +03:00
|
|
|
|
improvements in quality. Orders of magnitude faster. But the
|
2015-01-24 16:58:52 +03:00
|
|
|
|
academic code is always GPL, undocumented, unuseable, or all three. You could
|
|
|
|
|
implement the ideas yourself, but the papers are hard to read, and training
|
2015-01-25 15:38:36 +03:00
|
|
|
|
data is exorbitantly expensive. So what are you left with? A common answer is
|
|
|
|
|
NLTK, which was written primarily as an educational resource. Nothing past the
|
|
|
|
|
tokenizer is suitable for production use.
|
2015-01-24 16:58:52 +03:00
|
|
|
|
|
|
|
|
|
I used to think that the NLP community just needed to do more to communicate
|
|
|
|
|
its findings to software engineers. So I wrote two blog posts, explaining
|
2015-04-19 11:56:32 +03:00
|
|
|
|
`how to write a part-of-speech tagger`_ and `parser`_. Both were well received,
|
2015-01-24 16:58:52 +03:00
|
|
|
|
and there's been a bit of interest in `my research software`_ --- even though
|
|
|
|
|
it's entirely undocumented, and mostly unuseable to anyone but me.
|
|
|
|
|
|
|
|
|
|
.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
|
|
|
|
|
|
|
|
|
|
.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
|
|
|
|
|
|
|
|
|
|
.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
|
|
|
|
|
|
|
|
|
|
So six months ago I quit my post-doc, and I've been working day and night on
|
|
|
|
|
spaCy since. I'm now pleased to announce an alpha release.
|
2015-01-23 15:11:16 +03:00
|
|
|
|
|
|
|
|
|
If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
|
2015-01-24 17:06:14 +03:00
|
|
|
|
It's by far the fastest NLP software ever released.
|
2015-06-26 05:43:52 +03:00
|
|
|
|
The full processing pipeline completes in 20ms per document, including accurate
|
2015-01-24 16:58:52 +03:00
|
|
|
|
tagging and parsing. All strings are mapped to integer IDs, tokens are linked
|
|
|
|
|
to embedded word representations, and a range of useful features are pre-calculated
|
|
|
|
|
and cached.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
If none of that made any sense to you, here's the gist of it. Computers don't
|
2015-04-19 11:43:46 +03:00
|
|
|
|
understand text. This is unfortunate, because that's what the web almost entirely
|
2015-01-15 23:08:35 +03:00
|
|
|
|
consists of. We want to recommend people text based on other text they liked.
|
|
|
|
|
We want to shorten text to display it on a mobile screen. We want to aggregate
|
|
|
|
|
it, link it, filter it, categorise it, generate it and correct it.
|
|
|
|
|
|
2015-01-23 15:11:16 +03:00
|
|
|
|
spaCy provides a library of utility functions that help programmers build such
|
|
|
|
|
products. It's commercial open source software: you can either use it under
|
2015-04-19 11:31:31 +03:00
|
|
|
|
the AGPL, or you can `buy a commercial license`_ for a one-time fee.
|
2015-01-23 15:11:16 +03:00
|
|
|
|
|
2015-01-25 17:58:05 +03:00
|
|
|
|
.. _buy a commercial license: license.html
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
Example functionality
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
Let's say you're developing a proofreading tool, or possibly an IDE for
|
|
|
|
|
writers. You're convinced by Stephen King's advice that `adverbs are not your
|
|
|
|
|
friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
|
2015-01-23 00:22:00 +03:00
|
|
|
|
you want to **highlight all adverbs**. We'll use one of the examples he finds
|
2015-01-15 23:08:35 +03:00
|
|
|
|
particularly egregious:
|
|
|
|
|
|
|
|
|
|
>>> import spacy.en
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> from spacy.parts_of_speech import ADV
|
2015-01-15 23:08:35 +03:00
|
|
|
|
>>> # Load the pipeline, and call it with some text.
|
|
|
|
|
>>> nlp = spacy.en.English()
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
|
|
|
|
|
>>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
|
|
|
|
|
u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
2014-12-09 08:08:01 +03:00
|
|
|
|
|
|
|
|
|
|
2015-01-25 14:57:37 +03:00
|
|
|
|
Easy enough --- but the problem is that we've also highlighted "back".
|
|
|
|
|
While "back" is undoubtedly an adverb, we probably don't want to highlight it.
|
|
|
|
|
If what we're trying to do is flag dubious stylistic choices, we'll need to
|
|
|
|
|
refine our logic. It turns out only a certain type of adverb is of interest to
|
|
|
|
|
us.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-25 14:57:37 +03:00
|
|
|
|
There are lots of ways we might do this, depending on just what words
|
2015-01-25 14:07:08 +03:00
|
|
|
|
we want to flag. The simplest way to exclude adverbs like "back" and "not"
|
2015-01-23 00:22:00 +03:00
|
|
|
|
is by word frequency: these words are much more common than the prototypical
|
|
|
|
|
manner adverbs that the style guides are worried about.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-25 14:07:08 +03:00
|
|
|
|
The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
|
|
|
|
|
log probability estimate of the word:
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> nlp.vocab[u'back'].prob
|
2015-01-15 23:08:35 +03:00
|
|
|
|
-7.403977394104004
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> nlp.vocab[u'not'].prob
|
2015-01-15 23:08:35 +03:00
|
|
|
|
-5.407193660736084
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> nlp.vocab[u'quietly'].prob
|
2015-01-15 23:08:35 +03:00
|
|
|
|
-11.07155704498291
|
|
|
|
|
|
2015-01-25 14:07:08 +03:00
|
|
|
|
(The probability estimate is based on counts from a 3 billion word corpus,
|
2015-01-25 14:57:37 +03:00
|
|
|
|
smoothed using the `Simple Good-Turing`_ method.)
|
2015-01-25 14:07:08 +03:00
|
|
|
|
|
|
|
|
|
.. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf
|
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
So we can easily exclude the N most frequent words in English from our adverb
|
|
|
|
|
marker. Let's try N=1000 for now:
|
|
|
|
|
|
|
|
|
|
>>> import spacy.en
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> from spacy.parts_of_speech import ADV
|
2015-01-15 23:08:35 +03:00
|
|
|
|
>>> nlp = spacy.en.English()
|
|
|
|
|
>>> # Find log probability of Nth most frequent word
|
|
|
|
|
>>> probs = [lex.prob for lex in nlp.vocab]
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> probs.sort()
|
|
|
|
|
>>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
|
|
|
|
|
>>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
|
2015-01-15 23:08:35 +03:00
|
|
|
|
‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
|
|
|
|
2015-01-25 14:07:08 +03:00
|
|
|
|
There are lots of other ways we could refine the logic, depending on just what
|
|
|
|
|
words we want to flag. Let's say we wanted to only flag adverbs that modified words
|
2015-01-23 00:22:00 +03:00
|
|
|
|
similar to "pleaded". This is easy to do, as spaCy loads a vector-space
|
|
|
|
|
representation for every word (by default, the vectors produced by
|
2015-01-25 14:57:37 +03:00
|
|
|
|
`Levy and Goldberg (2014)`_). Naturally, the vector is provided as a numpy
|
2015-01-23 00:22:00 +03:00
|
|
|
|
array:
|
|
|
|
|
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> pleaded = tokens[7]
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> pleaded.repvec.shape
|
|
|
|
|
(300,)
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> pleaded.repvec[:5]
|
|
|
|
|
array([ 0.04229792, 0.07459262, 0.00820188, -0.02181299, 0.07519238], dtype=float32)
|
2015-01-23 00:22:00 +03:00
|
|
|
|
|
|
|
|
|
.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
|
|
|
|
|
|
|
|
|
|
We want to sort the words in our vocabulary by their similarity to "pleaded".
|
|
|
|
|
There are lots of ways to measure the similarity of two vectors. We'll use the
|
|
|
|
|
cosine metric:
|
|
|
|
|
|
|
|
|
|
>>> from numpy import dot
|
|
|
|
|
>>> from numpy.linalg import norm
|
2015-04-19 11:31:31 +03:00
|
|
|
|
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
2015-02-12 04:07:39 +03:00
|
|
|
|
>>> words = [w for w in nlp.vocab if w.has_repvec]
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> words.reverse()
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
|
2015-01-26 05:26:42 +03:00
|
|
|
|
100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
2015-01-26 05:26:42 +03:00
|
|
|
|
1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
|
|
|
|
|
>>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
|
|
|
|
|
50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
|
2015-01-23 00:22:00 +03:00
|
|
|
|
|
|
|
|
|
As you can see, the similarity model that these vectors give us is excellent
|
|
|
|
|
--- we're still getting meaningful results at 1000 words, off a single
|
|
|
|
|
prototype! The only problem is that the list really contains two clusters of
|
|
|
|
|
words: one associated with the legal meaning of "pleaded", and one for the more
|
|
|
|
|
general sense. Sorting out these clusters is an area of active research.
|
|
|
|
|
|
2015-01-26 05:26:42 +03:00
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
A simple work-around is to average the vectors of several words, and use that
|
|
|
|
|
as our target:
|
|
|
|
|
|
2015-01-26 05:26:42 +03:00
|
|
|
|
>>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
|
|
|
|
|
>>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
|
2015-02-07 16:45:09 +03:00
|
|
|
|
>>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> words.reverse()
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
|
2015-01-26 05:29:02 +03:00
|
|
|
|
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
|
2015-01-25 14:57:37 +03:00
|
|
|
|
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
|
|
|
|
|
|
|
|
|
|
These definitely look like words that King might scold a writer for attaching
|
2015-01-26 05:26:42 +03:00
|
|
|
|
adverbs to. Recall that our original adverb highlighting function looked like
|
2015-01-23 00:22:00 +03:00
|
|
|
|
this:
|
2014-12-15 08:32:03 +03:00
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> import spacy.en
|
2015-01-25 20:55:41 +03:00
|
|
|
|
>>> from spacy.parts_of_speech import ADV
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> # Load the pipeline, and call it with some text.
|
|
|
|
|
>>> nlp = spacy.en.English()
|
|
|
|
|
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
2015-01-26 05:26:42 +03:00
|
|
|
|
tag=True, parse=False)
|
|
|
|
|
>>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
|
|
|
|
2015-01-26 05:26:42 +03:00
|
|
|
|
|
2015-01-27 16:57:16 +03:00
|
|
|
|
|
2015-04-19 11:31:31 +03:00
|
|
|
|
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
2015-01-23 00:22:00 +03:00
|
|
|
|
of communication, like "pleaded", were highlighted. We've now built a vector that
|
2015-04-19 11:56:32 +03:00
|
|
|
|
represents that type of word, so now we can highlight adverbs based on
|
2015-01-23 00:22:00 +03:00
|
|
|
|
subtle logic, honing in on adverbs that seem the most stylistically
|
|
|
|
|
problematic, given our starting assumptions:
|
|
|
|
|
|
|
|
|
|
>>> import numpy
|
|
|
|
|
>>> from numpy import dot
|
|
|
|
|
>>> from numpy.linalg import norm
|
|
|
|
|
>>> import spacy.en
|
2015-01-25 20:55:41 +03:00
|
|
|
|
>>> from spacy.parts_of_speech import ADV, VERB
|
2015-02-12 02:08:19 +03:00
|
|
|
|
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> def is_bad_adverb(token, target_verb, tol):
|
2015-04-19 11:31:31 +03:00
|
|
|
|
... if token.pos != ADV
|
2015-01-23 00:22:00 +03:00
|
|
|
|
... return False
|
2015-01-27 10:53:29 +03:00
|
|
|
|
... elif token.head.pos != VERB:
|
2015-01-23 00:22:00 +03:00
|
|
|
|
... return False
|
|
|
|
|
... elif cosine(token.head.repvec, target_verb) < tol:
|
|
|
|
|
... return False
|
|
|
|
|
... else:
|
|
|
|
|
... return True
|
2014-12-02 07:20:18 +03:00
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
This example was somewhat contrived --- and, truth be told, I've never really
|
|
|
|
|
bought the idea that adverbs were a grave stylistic sin. But hopefully it got
|
|
|
|
|
the message across: the state-of-the-art NLP technologies are very powerful.
|
|
|
|
|
spaCy gives you easy and efficient access to them, which lets you build all
|
|
|
|
|
sorts of use products and features that were previously impossible.
|
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-06-26 05:39:48 +03:00
|
|
|
|
Independent Evaluation
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
.. table:: Independent evaluation by Yahoo! Labs and Emory
|
|
|
|
|
University, to appear at ACL 2015. Higher is better.
|
|
|
|
|
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| System | Language | Accuracy | Speed |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| spaCy v0.86 | Cython | 91.9 | **13,963** |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| ClearNLP | Java | 91.7 | 10,271 |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| spaCy v0.84 | Cython | 90.9 | 13,963 |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| CoreNLP | Java | 89.6 | 8,602 |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| MATE | Java | **92.5** | 550 |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| Turbo | C++ | 92.4 | 349 |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
| Yara | Java | 92.3 | 340 |
|
|
|
|
|
+----------------+------------+------------+------------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Accuracy is % unlabelled arcs correct, speed is tokens per second.
|
|
|
|
|
|
|
|
|
|
Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory) performed
|
|
|
|
|
a detailed comparison of the best parsers available. All numbers above
|
|
|
|
|
are taken from the pre-print they kindly made available to me,
|
|
|
|
|
except for spaCy v0.86.
|
|
|
|
|
|
|
|
|
|
I'm particularly grateful to the authors for discussion of their results, which
|
|
|
|
|
led to the improvement in accuracy between v0.84 and v0.86. A tip from Jin-ho
|
|
|
|
|
(developer of ClearNLP) was particularly useful.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Detailed Speed Comparison
|
|
|
|
|
-------------------------
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-24 17:25:27 +03:00
|
|
|
|
**Set up**: 100,000 plain-text documents were streamed from an SQLite3
|
|
|
|
|
database, and processed with an NLP library, to one of three levels of detail
|
|
|
|
|
--- tokenization, tagging, or parsing. The tasks are additive: to parse the
|
|
|
|
|
text you have to tokenize and tag it. The pre-processing was not subtracted
|
|
|
|
|
from the times --- I report the time required for the pipeline to complete.
|
2015-04-19 11:31:31 +03:00
|
|
|
|
I report mean times per document, in milliseconds.
|
2015-01-24 17:25:27 +03:00
|
|
|
|
|
|
|
|
|
**Hardware**: Intel i7-3770 (2012)
|
|
|
|
|
|
2015-06-26 05:39:48 +03:00
|
|
|
|
.. table:: Per-document processing times. Lower is better.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
+--------------+---------------------------+--------------------------------+
|
|
|
|
|
| | Absolute (ms per doc) | Relative (to spaCy) |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
| System | Tokenize | Tag | Parse | Tokenize | Tag | Parse |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
2015-06-26 05:39:48 +03:00
|
|
|
|
| spaCy | 0.2ms | 1ms | 19ms | 1x | 1x | 1x |
|
2015-01-15 23:08:35 +03:00
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
2015-06-26 05:39:48 +03:00
|
|
|
|
| CoreNLP | 2ms | 10ms | 49ms | 10x | 10x | 2.6x |
|
2015-01-15 23:08:35 +03:00
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
2015-06-26 05:39:48 +03:00
|
|
|
|
| ZPar | 1ms | 8ms | 850ms | 5x | 8x | 44.7x |
|
2015-01-15 23:08:35 +03:00
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
| NLTK | 4ms | 443ms | n/a | 20x | 443x | n/a |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Efficiency is a major concern for NLP applications. It is very common to hear
|
|
|
|
|
people say that they cannot afford more detailed processing, because their
|
|
|
|
|
datasets are too large. This is a bad position to be in. If you can't apply
|
|
|
|
|
detailed processing, you generally have to cobble together various heuristics.
|
|
|
|
|
This normally takes a few iterations, and what you come up with will usually be
|
|
|
|
|
brittle and difficult to reason about.
|
|
|
|
|
|
|
|
|
|
spaCy's parser is faster than most taggers, and its tokenizer is fast enough
|
2015-01-23 00:22:00 +03:00
|
|
|
|
for any workload. And the tokenizer doesn't just give you a list
|
2015-01-15 23:08:35 +03:00
|
|
|
|
of strings. A spaCy token is a pointer to a Lexeme struct, from which you can
|
2015-01-23 00:22:00 +03:00
|
|
|
|
access a wide range of pre-computed features, including embedded word
|
|
|
|
|
representations.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
.. I wrote spaCy because I think existing commercial NLP engines are crap.
|
|
|
|
|
Alchemy API are a typical example. Check out this part of their terms of
|
|
|
|
|
service:
|
|
|
|
|
publish or perform any benchmark or performance tests or analysis relating to
|
|
|
|
|
the Service or the use thereof without express authorization from AlchemyAPI;
|
|
|
|
|
|
2015-04-19 11:43:46 +03:00
|
|
|
|
.. Did you get that? You're not allowed to evaluate how well their system works,
|
2015-01-15 23:08:35 +03:00
|
|
|
|
unless you're granted a special exception. Their system must be pretty
|
|
|
|
|
terrible to motivate such an embarrassing restriction.
|
|
|
|
|
They must know this makes them look bad, but they apparently believe allowing
|
|
|
|
|
you to evaluate their product would make them look even worse!
|
|
|
|
|
|
|
|
|
|
.. spaCy is based on science, not alchemy. It's open source, and I am happy to
|
|
|
|
|
clarify any detail of the algorithms I've implemented.
|
|
|
|
|
It's evaluated against the current best published systems, following the standard
|
2015-04-19 11:31:31 +03:00
|
|
|
|
methodologies. These evaluations show that it performs extremely well.
|
2015-01-26 01:31:34 +03:00
|
|
|
|
.. See `Benchmarks`_ for details.
|
|
|
|
|
|
2014-12-30 13:20:34 +03:00
|
|
|
|
|
2014-09-25 20:42:20 +04:00
|
|
|
|
.. toctree::
|
2015-07-08 18:58:49 +03:00
|
|
|
|
:maxdepth: 4
|
|
|
|
|
:hidden:
|
2014-12-01 14:55:13 +03:00
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
quickstart.rst
|
2015-07-08 18:58:49 +03:00
|
|
|
|
reference/index.rst
|
2015-04-19 11:31:31 +03:00
|
|
|
|
license.rst
|
2015-01-31 15:05:17 +03:00
|
|
|
|
updates.rst
|