mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 10:46:29 +03:00
* Improve example functionality, adding usage of word vectors
This commit is contained in:
parent
5ed8b2b98f
commit
edd898947c
|
@ -7,16 +7,17 @@
|
|||
spaCy: Industrial-strength NLP
|
||||
==============================
|
||||
|
||||
spaCy is a library for industrial-strength text processing in Python and Cython.
|
||||
spaCy is a new library for industrial-strength text processing in Python and Cython.
|
||||
It is commercial open source software, with a dual (AGPL or commercial)
|
||||
license.
|
||||
|
||||
If you're a small company doing NLP, spaCy might seem like a minor miracle.
|
||||
I've been working on this full-time for the last six months, and am excited to
|
||||
announce its beta release.
|
||||
If you're a small company doing NLP, spaCy should seem like a minor miracle.
|
||||
It's by far the fastest NLP software available. The full processing pipeline
|
||||
completes in 7ms, including state-of-the-art part-of-speech tagging and
|
||||
dependency parsing. All strings are mapped to integer IDs, tokens
|
||||
are linked to word vectors and other lexical resources, and a range of useful
|
||||
features are pre-calculated and cached.
|
||||
completes in 7ms, including state-of-the-art tagging and parsing. All strings
|
||||
are mapped to integer IDs, tokens are linked to embedded word representations,
|
||||
and a range of useful features are pre-calculated and cached.
|
||||
|
||||
If none of that made any sense to you, here's the gist of it. Computers don't
|
||||
understand text. This is unfortunate, because that's what the web almost entirely
|
||||
|
@ -34,34 +35,31 @@ Example functionality
|
|||
Let's say you're developing a proofreading tool, or possibly an IDE for
|
||||
writers. You're convinced by Stephen King's advice that `adverbs are not your
|
||||
friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
|
||||
you want to **mark adverbs in red**. We'll use one of the examples he finds
|
||||
you want to **highlight all adverbs**. We'll use one of the examples he finds
|
||||
particularly egregious:
|
||||
|
||||
>>> import spacy.en
|
||||
>>> from spacy.enums import ADVERB
|
||||
>>> from spacy.postags import ADVERB
|
||||
>>> # Load the pipeline, and call it with some text.
|
||||
>>> nlp = spacy.en.English()
|
||||
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
||||
tag=True, parse=True)
|
||||
>>> output = ''
|
||||
>>> for tok in tokens:
|
||||
... # Token.string preserves whitespace, making it easy to
|
||||
... # reconstruct the original string.
|
||||
... output += tok.string.upper() if tok.is_pos(ADVERB) else tok.string
|
||||
... output += tok.string.upper() if tok.pos == ADVERB else tok.string
|
||||
... output += tok.whitespace
|
||||
>>> print(output)
|
||||
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||
|
||||
|
||||
Easy enough --- but the problem is that we've also highlighted "back", when probably
|
||||
we only wanted to highlight "abjectly". This is undoubtedly an adverb, but it's
|
||||
not the sort of adverb King is talking about. This is a persistent problem when
|
||||
dealing with linguistic categories: the prototypical examples, the ones whic
|
||||
spring to your mind, are often not the most common cases.
|
||||
we only wanted to highlight "abjectly". While "back" is undoubtedly an adverb,
|
||||
we probably don't want to highlight it.
|
||||
|
||||
There are lots of ways we might refine our logic, depending on just what words
|
||||
we want to flag. The simplest way to filter out adverbs like "back" and "not"
|
||||
is by word frequency: these words are much more common than the manner adverbs
|
||||
the style guides are worried about.
|
||||
is by word frequency: these words are much more common than the prototypical
|
||||
manner adverbs that the style guides are worried about.
|
||||
|
||||
The prob attribute of a Lexeme or Token object gives a log probability estimate
|
||||
of the word, based on smoothed counts from a 3bn word corpus:
|
||||
|
@ -77,37 +75,117 @@ So we can easily exclude the N most frequent words in English from our adverb
|
|||
marker. Let's try N=1000 for now:
|
||||
|
||||
>>> import spacy.en
|
||||
>>> from spacy.enums import ADVERB
|
||||
>>> from spacy.postags import ADVERB
|
||||
>>> nlp = spacy.en.English()
|
||||
>>> # Find log probability of Nth most frequent word
|
||||
>>> probs = [lex.prob for lex in nlp.vocab]
|
||||
>>> is_adverb = lambda tok: tok.is_pos(ADVERB) and tok.prob < probs[-1000]
|
||||
>>> is_adverb = lambda tok: tok.pos == ADVERB and tok.prob < probs[-1000]
|
||||
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
||||
tag=True, parse=True)
|
||||
>>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string))
|
||||
‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||
|
||||
There are lots of ways to refine the logic, depending on just what words we
|
||||
want to flag. Let's define this narrowly, and only flag adverbs applied to
|
||||
verbs of communication or perception:
|
||||
There are lots of ways we could refine the logic, depending on just what words we
|
||||
want to flag. Let's say we wanted to only flag adverbs that modified words
|
||||
similar to "pleaded". This is easy to do, as spaCy loads a vector-space
|
||||
representation for every word (by default, the vectors produced by
|
||||
`Levy and Goldberg (2014)`_. Naturally, the vector is provided as a numpy
|
||||
array:
|
||||
|
||||
>>> from spacy.enums import VERB, WN_V_COMMUNICATION, WN_V_COGNITION
|
||||
>>> def is_say_verb(tok):
|
||||
... return tok.is_pos(VERB) and (tok.check_flag(WN_V_COMMUNICATION) or
|
||||
tok.check_flag(WN_V_COGNITION))
|
||||
>>> print(''.join(tok.string.upper() if is_adverb(tok) and is_say_verb(tok.head)
|
||||
else tok.string))
|
||||
‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||
>>> pleaded = tokens[8]
|
||||
>>> pleaded.repvec.shape
|
||||
(300,)
|
||||
|
||||
The two flags refer to the 45 top-level categories in the WordNet ontology.
|
||||
spaCy stores membership in these categories as a bit set, because
|
||||
words can have multiple senses. We only need one 64
|
||||
bit flag variable per word in the vocabulary, so this useful data requires only
|
||||
2.4mb of memory.
|
||||
.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
|
||||
|
||||
We want to sort the words in our vocabulary by their similarity to "pleaded".
|
||||
There are lots of ways to measure the similarity of two vectors. We'll use the
|
||||
cosine metric:
|
||||
|
||||
>>> from numpy import dot
|
||||
>>> from numpy.linalg import norm
|
||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
|
||||
>>> words = [w for w in nlp.vocab if w.is_lower and w.has_repvec]
|
||||
>>> words.sort(key=lambda w: cosine(w, pleaded))
|
||||
>>> words.reverse()
|
||||
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20])
|
||||
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
|
||||
>>> print '50-60', ', '.join(w.orth_ for w in words[50:60])
|
||||
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
|
||||
>>> print '100-110', ', '.join(w.orth_ for w in words[100:110])
|
||||
cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
|
||||
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010])
|
||||
scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
|
||||
>>> print ', '.join(w.orth_ for w in words[50000:50010])
|
||||
fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
|
||||
|
||||
As you can see, the similarity model that these vectors give us is excellent
|
||||
--- we're still getting meaningful results at 1000 words, off a single
|
||||
prototype! The only problem is that the list really contains two clusters of
|
||||
words: one associated with the legal meaning of "pleaded", and one for the more
|
||||
general sense. Sorting out these clusters is an area of active research.
|
||||
|
||||
A simple work-around is to average the vectors of several words, and use that
|
||||
as our target:
|
||||
|
||||
>>> say_verbs = [u'pleaded', u'confessed', u'remonstrated', u'begged',
|
||||
u'bragged', u'confided', u'requested']
|
||||
>>> say_vector = numpy.zeros(shape=(300,))
|
||||
>>> for verb in say_verbs:
|
||||
... say_vector += nlp.vocab[verb].repvec
|
||||
>>> words.sort(key=lambda w: cosine(w.repvec, say_vector))
|
||||
>>> words.reverse()
|
||||
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20])
|
||||
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
|
||||
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
|
||||
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010])
|
||||
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
|
||||
|
||||
These definitely look like words that King might scold a writer for attaching
|
||||
adverbs to. Recall that our previous adverb highlighting function looked like
|
||||
this:
|
||||
|
||||
>>> import spacy.en
|
||||
>>> from spacy.postags import ADVERB
|
||||
>>> # Load the pipeline, and call it with some text.
|
||||
>>> nlp = spacy.en.English()
|
||||
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
||||
tag=True, parse=True)
|
||||
>>> output = ''
|
||||
>>> for tok in tokens:
|
||||
... output += tok.string.upper() if tok.pos == ADVERB else tok.string
|
||||
... output += tok.whitespace
|
||||
>>> print(output)
|
||||
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||
|
||||
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
||||
of communication, like "pleaded", were highlighted. We've now built a vector that
|
||||
represents that type of word, so now we can highlight adverbs based on very
|
||||
subtle logic, honing in on adverbs that seem the most stylistically
|
||||
problematic, given our starting assumptions:
|
||||
|
||||
>>> import numpy
|
||||
>>> from numpy import dot
|
||||
>>> from numpy.linalg import norm
|
||||
>>> import spacy.en
|
||||
>>> from spacy.postags import ADVERB, VERB
|
||||
>>> def is_bad_adverb(token, target_verb, tol):
|
||||
... if token.pos != ADVERB
|
||||
... return False
|
||||
... elif toke.head.pos != VERB:
|
||||
... return False
|
||||
... elif cosine(token.head.repvec, target_verb) < tol:
|
||||
... return False
|
||||
... else:
|
||||
... return True
|
||||
|
||||
|
||||
This example was somewhat contrived --- and, truth be told, I've never really
|
||||
bought the idea that adverbs were a grave stylistic sin. But hopefully it got
|
||||
the message across: the state-of-the-art NLP technologies are very powerful.
|
||||
spaCy gives you easy and efficient access to them, which lets you build all
|
||||
sorts of use products and features that were previously impossible.
|
||||
|
||||
spaCy packs all sorts of other goodies into its lexicon.
|
||||
Words are mapped to one these rich lexical types immediately, during
|
||||
tokenization --- and spaCy's tokenizer is *fast*.
|
||||
|
||||
Efficiency
|
||||
----------
|
||||
|
@ -137,9 +215,10 @@ This normally takes a few iterations, and what you come up with will usually be
|
|||
brittle and difficult to reason about.
|
||||
|
||||
spaCy's parser is faster than most taggers, and its tokenizer is fast enough
|
||||
for truly web-scale processing. And the tokenizer doesn't just give you a list
|
||||
for any workload. And the tokenizer doesn't just give you a list
|
||||
of strings. A spaCy token is a pointer to a Lexeme struct, from which you can
|
||||
access a wide range of pre-computed features.
|
||||
access a wide range of pre-computed features, including embedded word
|
||||
representations.
|
||||
|
||||
.. I wrote spaCy because I think existing commercial NLP engines are crap.
|
||||
Alchemy API are a typical example. Check out this part of their terms of
|
||||
|
@ -161,7 +240,9 @@ access a wide range of pre-computed features.
|
|||
Accuracy
|
||||
--------
|
||||
|
||||
.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. See `Benchmarks`_ for details.
|
||||
.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal.
|
||||
|
||||
.. See `Benchmarks`_ for details.
|
||||
|
||||
+--------------+----------+------------+
|
||||
| System | POS acc. | Parse acc. |
|
||||
|
@ -172,9 +253,20 @@ Accuracy
|
|||
+--------------+----------+------------+
|
||||
| ZPar | 97.3 | 92.9 |
|
||||
+--------------+----------+------------+
|
||||
| Redshift | 97.3 | 93.5 |
|
||||
+--------------+----------+------------+
|
||||
| NLTK | 94.3 | n/a |
|
||||
+--------------+----------+------------+
|
||||
|
||||
The table above compares spaCy to some of the current state-of-the-art systems,
|
||||
on the standard evaluation from the Wall Street Journal, given gold-standard
|
||||
sentence boundaries and tokenization. I'm in the process of completing a more
|
||||
realistic evaluation on web text.
|
||||
|
||||
spaCy's parser offers a better speed/accuracy trade-off than any published
|
||||
system: its accuracy is within 1% of the current state-of-the-art, and it's
|
||||
seven times faster than the 2014 CoreNLP neural network parser, which is the
|
||||
previous fastest parser that I'm aware of.
|
||||
|
||||
|
||||
.. toctree::
|
||||
|
|
Loading…
Reference in New Issue
Block a user