2014-09-25 20:42:20 +04:00
|
|
|
|
.. spaCy documentation master file, created by
|
|
|
|
|
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
|
|
|
|
|
You can adapt this file completely to your liking, but it should at least
|
|
|
|
|
contain the root `toctree` directive.
|
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
==============================
|
|
|
|
|
spaCy: Industrial-strength NLP
|
|
|
|
|
==============================
|
2014-09-25 20:42:20 +04:00
|
|
|
|
|
2015-01-23 15:11:16 +03:00
|
|
|
|
spaCy is a new library for text processing in Python and Cython.
|
2015-01-24 16:58:52 +03:00
|
|
|
|
I wrote it because I think small companies are terrible at NLP. Or rather:
|
|
|
|
|
small companies are using terrible NLP technology.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-24 16:58:52 +03:00
|
|
|
|
To do great NLP, you have to know a little about linguistics, a lot
|
|
|
|
|
about machine learning, and almost everything about the latest research.
|
2015-01-24 17:06:14 +03:00
|
|
|
|
The people who fit this description seldom join small companies.
|
|
|
|
|
Most are broke --- they've just finished grad school.
|
2015-01-24 16:58:52 +03:00
|
|
|
|
If they don't want to stay in academia, they join Google, IBM, etc.
|
2015-01-23 15:11:16 +03:00
|
|
|
|
|
2015-01-24 16:58:52 +03:00
|
|
|
|
The net result is that outside of the tech giants, commercial NLP has changed
|
|
|
|
|
little in the last ten years. In academia, it's changed entirely. Amazing
|
|
|
|
|
improvements in quality. Orders of magnitude faster. But the
|
|
|
|
|
academic code is always GPL, undocumented, unuseable, or all three. You could
|
|
|
|
|
implement the ideas yourself, but the papers are hard to read, and training
|
|
|
|
|
data is exorbitantly expensive. So what are you left with? NLTK?
|
|
|
|
|
|
|
|
|
|
I used to think that the NLP community just needed to do more to communicate
|
|
|
|
|
its findings to software engineers. So I wrote two blog posts, explaining
|
|
|
|
|
`how to write a part-of-speech tagger`_ and `parser`_. Both were very well received,
|
|
|
|
|
and there's been a bit of interest in `my research software`_ --- even though
|
|
|
|
|
it's entirely undocumented, and mostly unuseable to anyone but me.
|
|
|
|
|
|
|
|
|
|
.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
|
|
|
|
|
|
|
|
|
|
.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
|
|
|
|
|
|
|
|
|
|
.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
|
|
|
|
|
|
|
|
|
|
So six months ago I quit my post-doc, and I've been working day and night on
|
|
|
|
|
spaCy since. I'm now pleased to announce an alpha release.
|
2015-01-23 15:11:16 +03:00
|
|
|
|
|
|
|
|
|
If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
|
2015-01-24 17:06:14 +03:00
|
|
|
|
It's by far the fastest NLP software ever released.
|
2015-01-24 16:58:52 +03:00
|
|
|
|
The full processing pipeline completes in 7ms per document, including accurate
|
|
|
|
|
tagging and parsing. All strings are mapped to integer IDs, tokens are linked
|
|
|
|
|
to embedded word representations, and a range of useful features are pre-calculated
|
|
|
|
|
and cached.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
If none of that made any sense to you, here's the gist of it. Computers don't
|
|
|
|
|
understand text. This is unfortunate, because that's what the web almost entirely
|
|
|
|
|
consists of. We want to recommend people text based on other text they liked.
|
|
|
|
|
We want to shorten text to display it on a mobile screen. We want to aggregate
|
|
|
|
|
it, link it, filter it, categorise it, generate it and correct it.
|
|
|
|
|
|
2015-01-23 15:11:16 +03:00
|
|
|
|
spaCy provides a library of utility functions that help programmers build such
|
|
|
|
|
products. It's commercial open source software: you can either use it under
|
|
|
|
|
the AGPL, or you can `buy a commercial license`_ for a one-time fee.
|
|
|
|
|
|
|
|
|
|
.. _buy a commercial license: license.rst
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
Example functionality
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
Let's say you're developing a proofreading tool, or possibly an IDE for
|
|
|
|
|
writers. You're convinced by Stephen King's advice that `adverbs are not your
|
|
|
|
|
friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
|
2015-01-23 00:22:00 +03:00
|
|
|
|
you want to **highlight all adverbs**. We'll use one of the examples he finds
|
2015-01-15 23:08:35 +03:00
|
|
|
|
particularly egregious:
|
|
|
|
|
|
|
|
|
|
>>> import spacy.en
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> from spacy.postags import ADVERB
|
2015-01-15 23:08:35 +03:00
|
|
|
|
>>> # Load the pipeline, and call it with some text.
|
|
|
|
|
>>> nlp = spacy.en.English()
|
|
|
|
|
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
|
|
|
|
tag=True, parse=True)
|
|
|
|
|
>>> output = ''
|
|
|
|
|
>>> for tok in tokens:
|
2015-01-23 00:22:00 +03:00
|
|
|
|
... output += tok.string.upper() if tok.pos == ADVERB else tok.string
|
|
|
|
|
... output += tok.whitespace
|
2015-01-15 23:08:35 +03:00
|
|
|
|
>>> print(output)
|
2014-12-24 06:35:32 +03:00
|
|
|
|
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
2014-12-09 08:08:01 +03:00
|
|
|
|
|
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
Easy enough --- but the problem is that we've also highlighted "back", when probably
|
2015-01-23 00:22:00 +03:00
|
|
|
|
we only wanted to highlight "abjectly". While "back" is undoubtedly an adverb,
|
|
|
|
|
we probably don't want to highlight it.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
There are lots of ways we might refine our logic, depending on just what words
|
|
|
|
|
we want to flag. The simplest way to filter out adverbs like "back" and "not"
|
2015-01-23 00:22:00 +03:00
|
|
|
|
is by word frequency: these words are much more common than the prototypical
|
|
|
|
|
manner adverbs that the style guides are worried about.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
The prob attribute of a Lexeme or Token object gives a log probability estimate
|
|
|
|
|
of the word, based on smoothed counts from a 3bn word corpus:
|
|
|
|
|
|
|
|
|
|
>>> nlp.vocab[u'back'].prob
|
|
|
|
|
-7.403977394104004
|
|
|
|
|
>>> nlp.vocab[u'not'].prob
|
|
|
|
|
-5.407193660736084
|
|
|
|
|
>>> nlp.vocab[u'quietly'].prob
|
|
|
|
|
-11.07155704498291
|
|
|
|
|
|
|
|
|
|
So we can easily exclude the N most frequent words in English from our adverb
|
|
|
|
|
marker. Let's try N=1000 for now:
|
|
|
|
|
|
|
|
|
|
>>> import spacy.en
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> from spacy.postags import ADVERB
|
2015-01-15 23:08:35 +03:00
|
|
|
|
>>> nlp = spacy.en.English()
|
|
|
|
|
>>> # Find log probability of Nth most frequent word
|
|
|
|
|
>>> probs = [lex.prob for lex in nlp.vocab]
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> is_adverb = lambda tok: tok.pos == ADVERB and tok.prob < probs[-1000]
|
2015-01-15 23:08:35 +03:00
|
|
|
|
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
|
|
|
|
tag=True, parse=True)
|
|
|
|
|
>>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string))
|
|
|
|
|
‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
There are lots of ways we could refine the logic, depending on just what words we
|
|
|
|
|
want to flag. Let's say we wanted to only flag adverbs that modified words
|
|
|
|
|
similar to "pleaded". This is easy to do, as spaCy loads a vector-space
|
|
|
|
|
representation for every word (by default, the vectors produced by
|
|
|
|
|
`Levy and Goldberg (2014)`_. Naturally, the vector is provided as a numpy
|
|
|
|
|
array:
|
|
|
|
|
|
|
|
|
|
>>> pleaded = tokens[8]
|
|
|
|
|
>>> pleaded.repvec.shape
|
|
|
|
|
(300,)
|
|
|
|
|
|
|
|
|
|
.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
|
|
|
|
|
|
|
|
|
|
We want to sort the words in our vocabulary by their similarity to "pleaded".
|
|
|
|
|
There are lots of ways to measure the similarity of two vectors. We'll use the
|
|
|
|
|
cosine metric:
|
|
|
|
|
|
|
|
|
|
>>> from numpy import dot
|
|
|
|
|
>>> from numpy.linalg import norm
|
|
|
|
|
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
|
|
|
|
|
>>> words = [w for w in nlp.vocab if w.is_lower and w.has_repvec]
|
|
|
|
|
>>> words.sort(key=lambda w: cosine(w, pleaded))
|
|
|
|
|
>>> words.reverse()
|
|
|
|
|
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20])
|
|
|
|
|
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
|
|
|
|
|
>>> print '50-60', ', '.join(w.orth_ for w in words[50:60])
|
|
|
|
|
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
|
|
|
|
|
>>> print '100-110', ', '.join(w.orth_ for w in words[100:110])
|
|
|
|
|
cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
|
|
|
|
|
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010])
|
|
|
|
|
scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
|
|
|
|
|
>>> print ', '.join(w.orth_ for w in words[50000:50010])
|
|
|
|
|
fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
|
|
|
|
|
|
|
|
|
|
As you can see, the similarity model that these vectors give us is excellent
|
|
|
|
|
--- we're still getting meaningful results at 1000 words, off a single
|
|
|
|
|
prototype! The only problem is that the list really contains two clusters of
|
|
|
|
|
words: one associated with the legal meaning of "pleaded", and one for the more
|
|
|
|
|
general sense. Sorting out these clusters is an area of active research.
|
|
|
|
|
|
|
|
|
|
A simple work-around is to average the vectors of several words, and use that
|
|
|
|
|
as our target:
|
|
|
|
|
|
|
|
|
|
>>> say_verbs = [u'pleaded', u'confessed', u'remonstrated', u'begged',
|
|
|
|
|
u'bragged', u'confided', u'requested']
|
|
|
|
|
>>> say_vector = numpy.zeros(shape=(300,))
|
|
|
|
|
>>> for verb in say_verbs:
|
|
|
|
|
... say_vector += nlp.vocab[verb].repvec
|
|
|
|
|
>>> words.sort(key=lambda w: cosine(w.repvec, say_vector))
|
|
|
|
|
>>> words.reverse()
|
|
|
|
|
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20])
|
|
|
|
|
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
|
|
|
|
|
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
|
|
|
|
|
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010])
|
|
|
|
|
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
|
|
|
|
|
|
|
|
|
|
These definitely look like words that King might scold a writer for attaching
|
|
|
|
|
adverbs to. Recall that our previous adverb highlighting function looked like
|
|
|
|
|
this:
|
2014-12-15 08:32:03 +03:00
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
>>> import spacy.en
|
|
|
|
|
>>> from spacy.postags import ADVERB
|
|
|
|
|
>>> # Load the pipeline, and call it with some text.
|
|
|
|
|
>>> nlp = spacy.en.English()
|
|
|
|
|
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
|
|
|
|
tag=True, parse=True)
|
|
|
|
|
>>> output = ''
|
|
|
|
|
>>> for tok in tokens:
|
|
|
|
|
... output += tok.string.upper() if tok.pos == ADVERB else tok.string
|
|
|
|
|
... output += tok.whitespace
|
|
|
|
|
>>> print(output)
|
|
|
|
|
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
|
|
|
|
|
|
|
|
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
|
|
|
|
of communication, like "pleaded", were highlighted. We've now built a vector that
|
|
|
|
|
represents that type of word, so now we can highlight adverbs based on very
|
|
|
|
|
subtle logic, honing in on adverbs that seem the most stylistically
|
|
|
|
|
problematic, given our starting assumptions:
|
|
|
|
|
|
|
|
|
|
>>> import numpy
|
|
|
|
|
>>> from numpy import dot
|
|
|
|
|
>>> from numpy.linalg import norm
|
|
|
|
|
>>> import spacy.en
|
|
|
|
|
>>> from spacy.postags import ADVERB, VERB
|
|
|
|
|
>>> def is_bad_adverb(token, target_verb, tol):
|
|
|
|
|
... if token.pos != ADVERB
|
|
|
|
|
... return False
|
|
|
|
|
... elif toke.head.pos != VERB:
|
|
|
|
|
... return False
|
|
|
|
|
... elif cosine(token.head.repvec, target_verb) < tol:
|
|
|
|
|
... return False
|
|
|
|
|
... else:
|
|
|
|
|
... return True
|
2014-12-02 07:20:18 +03:00
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
This example was somewhat contrived --- and, truth be told, I've never really
|
|
|
|
|
bought the idea that adverbs were a grave stylistic sin. But hopefully it got
|
|
|
|
|
the message across: the state-of-the-art NLP technologies are very powerful.
|
|
|
|
|
spaCy gives you easy and efficient access to them, which lets you build all
|
|
|
|
|
sorts of use products and features that were previously impossible.
|
|
|
|
|
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-24 17:25:27 +03:00
|
|
|
|
Speed Comparison
|
|
|
|
|
----------------
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-24 17:25:27 +03:00
|
|
|
|
**Set up**: 100,000 plain-text documents were streamed from an SQLite3
|
|
|
|
|
database, and processed with an NLP library, to one of three levels of detail
|
|
|
|
|
--- tokenization, tagging, or parsing. The tasks are additive: to parse the
|
|
|
|
|
text you have to tokenize and tag it. The pre-processing was not subtracted
|
|
|
|
|
from the times --- I report the time required for the pipeline to complete.
|
|
|
|
|
I report mean times per document, in milliseconds.
|
|
|
|
|
|
|
|
|
|
**Hardware**: Intel i7-3770 (2012)
|
|
|
|
|
|
|
|
|
|
.. table:: Efficiency comparison. Lower is better.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
+--------------+---------------------------+--------------------------------+
|
|
|
|
|
| | Absolute (ms per doc) | Relative (to spaCy) |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
| System | Tokenize | Tag | Parse | Tokenize | Tag | Parse |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
| spaCy | 0.2ms | 1ms | 7ms | 1x | 1x | 1x |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
| CoreNLP | 2ms | 10ms | 49ms | 10x | 10x | 7x |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
| ZPar | 1ms | 8ms | 850ms | 5x | 8x | 121x |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
| NLTK | 4ms | 443ms | n/a | 20x | 443x | n/a |
|
|
|
|
|
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Efficiency is a major concern for NLP applications. It is very common to hear
|
|
|
|
|
people say that they cannot afford more detailed processing, because their
|
|
|
|
|
datasets are too large. This is a bad position to be in. If you can't apply
|
|
|
|
|
detailed processing, you generally have to cobble together various heuristics.
|
|
|
|
|
This normally takes a few iterations, and what you come up with will usually be
|
|
|
|
|
brittle and difficult to reason about.
|
|
|
|
|
|
|
|
|
|
spaCy's parser is faster than most taggers, and its tokenizer is fast enough
|
2015-01-23 00:22:00 +03:00
|
|
|
|
for any workload. And the tokenizer doesn't just give you a list
|
2015-01-15 23:08:35 +03:00
|
|
|
|
of strings. A spaCy token is a pointer to a Lexeme struct, from which you can
|
2015-01-23 00:22:00 +03:00
|
|
|
|
access a wide range of pre-computed features, including embedded word
|
|
|
|
|
representations.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
.. I wrote spaCy because I think existing commercial NLP engines are crap.
|
|
|
|
|
Alchemy API are a typical example. Check out this part of their terms of
|
|
|
|
|
service:
|
|
|
|
|
publish or perform any benchmark or performance tests or analysis relating to
|
|
|
|
|
the Service or the use thereof without express authorization from AlchemyAPI;
|
|
|
|
|
|
|
|
|
|
.. Did you get that? You're not allowed to evaluate how well their system works,
|
|
|
|
|
unless you're granted a special exception. Their system must be pretty
|
|
|
|
|
terrible to motivate such an embarrassing restriction.
|
|
|
|
|
They must know this makes them look bad, but they apparently believe allowing
|
|
|
|
|
you to evaluate their product would make them look even worse!
|
|
|
|
|
|
|
|
|
|
.. spaCy is based on science, not alchemy. It's open source, and I am happy to
|
|
|
|
|
clarify any detail of the algorithms I've implemented.
|
|
|
|
|
It's evaluated against the current best published systems, following the standard
|
|
|
|
|
methodologies. These evaluations show that it performs extremely well.
|
|
|
|
|
|
2015-01-24 17:25:27 +03:00
|
|
|
|
Accuracy Comparison
|
|
|
|
|
-------------------
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal.
|
|
|
|
|
|
|
|
|
|
.. See `Benchmarks`_ for details.
|
2015-01-15 23:08:35 +03:00
|
|
|
|
|
|
|
|
|
+--------------+----------+------------+
|
|
|
|
|
| System | POS acc. | Parse acc. |
|
|
|
|
|
+--------------+----------+------------+
|
|
|
|
|
| spaCy | 97.2 | 92.4 |
|
|
|
|
|
+--------------+----------+------------+
|
|
|
|
|
| CoreNLP | 96.9 | 92.2 |
|
|
|
|
|
+--------------+----------+------------+
|
|
|
|
|
| ZPar | 97.3 | 92.9 |
|
|
|
|
|
+--------------+----------+------------+
|
2015-01-23 00:22:00 +03:00
|
|
|
|
| Redshift | 97.3 | 93.5 |
|
|
|
|
|
+--------------+----------+------------+
|
2015-01-15 23:08:35 +03:00
|
|
|
|
| NLTK | 94.3 | n/a |
|
|
|
|
|
+--------------+----------+------------+
|
|
|
|
|
|
2015-01-23 00:22:00 +03:00
|
|
|
|
The table above compares spaCy to some of the current state-of-the-art systems,
|
|
|
|
|
on the standard evaluation from the Wall Street Journal, given gold-standard
|
|
|
|
|
sentence boundaries and tokenization. I'm in the process of completing a more
|
|
|
|
|
realistic evaluation on web text.
|
|
|
|
|
|
|
|
|
|
spaCy's parser offers a better speed/accuracy trade-off than any published
|
|
|
|
|
system: its accuracy is within 1% of the current state-of-the-art, and it's
|
|
|
|
|
seven times faster than the 2014 CoreNLP neural network parser, which is the
|
|
|
|
|
previous fastest parser that I'm aware of.
|
2014-12-23 07:17:56 +03:00
|
|
|
|
|
2014-12-30 13:20:34 +03:00
|
|
|
|
|
2014-09-25 20:42:20 +04:00
|
|
|
|
.. toctree::
|
|
|
|
|
:maxdepth: 3
|
2014-12-01 14:55:13 +03:00
|
|
|
|
|
2015-01-17 08:19:21 +03:00
|
|
|
|
index.rst
|
2015-01-15 23:08:35 +03:00
|
|
|
|
quickstart.rst
|
2014-12-30 13:20:34 +03:00
|
|
|
|
api.rst
|
2015-01-17 08:19:21 +03:00
|
|
|
|
howworks.rst
|
|
|
|
|
license.rst
|