From edd898947c87444850b210e07966b33e7082233d Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Fri, 23 Jan 2015 08:22:00 +1100 Subject: [PATCH] * Improve example functionality, adding usage of word vectors --- docs/source/index.rst | 172 ++++++++++++++++++++++++++++++++---------- 1 file changed, 132 insertions(+), 40 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index 74359ab0f..392d509bf 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -7,16 +7,17 @@ spaCy: Industrial-strength NLP ============================== -spaCy is a library for industrial-strength text processing in Python and Cython. +spaCy is a new library for industrial-strength text processing in Python and Cython. It is commercial open source software, with a dual (AGPL or commercial) license. -If you're a small company doing NLP, spaCy might seem like a minor miracle. +I've been working on this full-time for the last six months, and am excited to +announce its beta release. +If you're a small company doing NLP, spaCy should seem like a minor miracle. It's by far the fastest NLP software available. The full processing pipeline -completes in 7ms, including state-of-the-art part-of-speech tagging and -dependency parsing. All strings are mapped to integer IDs, tokens -are linked to word vectors and other lexical resources, and a range of useful -features are pre-calculated and cached. +completes in 7ms, including state-of-the-art tagging and parsing. All strings +are mapped to integer IDs, tokens are linked to embedded word representations, +and a range of useful features are pre-calculated and cached. If none of that made any sense to you, here's the gist of it. Computers don't understand text. This is unfortunate, because that's what the web almost entirely @@ -34,34 +35,31 @@ Example functionality Let's say you're developing a proofreading tool, or possibly an IDE for writers. You're convinced by Stephen King's advice that `adverbs are not your friend `_, so -you want to **mark adverbs in red**. We'll use one of the examples he finds +you want to **highlight all adverbs**. We'll use one of the examples he finds particularly egregious: >>> import spacy.en - >>> from spacy.enums import ADVERB + >>> from spacy.postags import ADVERB >>> # Load the pipeline, and call it with some text. >>> nlp = spacy.en.English() >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=True) >>> output = '' >>> for tok in tokens: - ... # Token.string preserves whitespace, making it easy to - ... # reconstruct the original string. - ... output += tok.string.upper() if tok.is_pos(ADVERB) else tok.string + ... output += tok.string.upper() if tok.pos == ADVERB else tok.string + ... output += tok.whitespace >>> print(output) ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’ Easy enough --- but the problem is that we've also highlighted "back", when probably -we only wanted to highlight "abjectly". This is undoubtedly an adverb, but it's -not the sort of adverb King is talking about. This is a persistent problem when -dealing with linguistic categories: the prototypical examples, the ones whic -spring to your mind, are often not the most common cases. +we only wanted to highlight "abjectly". While "back" is undoubtedly an adverb, +we probably don't want to highlight it. There are lots of ways we might refine our logic, depending on just what words we want to flag. The simplest way to filter out adverbs like "back" and "not" -is by word frequency: these words are much more common than the manner adverbs -the style guides are worried about. +is by word frequency: these words are much more common than the prototypical +manner adverbs that the style guides are worried about. The prob attribute of a Lexeme or Token object gives a log probability estimate of the word, based on smoothed counts from a 3bn word corpus: @@ -77,37 +75,117 @@ So we can easily exclude the N most frequent words in English from our adverb marker. Let's try N=1000 for now: >>> import spacy.en - >>> from spacy.enums import ADVERB + >>> from spacy.postags import ADVERB >>> nlp = spacy.en.English() >>> # Find log probability of Nth most frequent word >>> probs = [lex.prob for lex in nlp.vocab] - >>> is_adverb = lambda tok: tok.is_pos(ADVERB) and tok.prob < probs[-1000] + >>> is_adverb = lambda tok: tok.pos == ADVERB and tok.prob < probs[-1000] >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=True) >>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string)) ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’ -There are lots of ways to refine the logic, depending on just what words we -want to flag. Let's define this narrowly, and only flag adverbs applied to -verbs of communication or perception: +There are lots of ways we could refine the logic, depending on just what words we +want to flag. Let's say we wanted to only flag adverbs that modified words +similar to "pleaded". This is easy to do, as spaCy loads a vector-space +representation for every word (by default, the vectors produced by +`Levy and Goldberg (2014)`_. Naturally, the vector is provided as a numpy +array: - >>> from spacy.enums import VERB, WN_V_COMMUNICATION, WN_V_COGNITION - >>> def is_say_verb(tok): - ... return tok.is_pos(VERB) and (tok.check_flag(WN_V_COMMUNICATION) or - tok.check_flag(WN_V_COGNITION)) - >>> print(''.join(tok.string.upper() if is_adverb(tok) and is_say_verb(tok.head) - else tok.string)) - ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’ + >>> pleaded = tokens[8] + >>> pleaded.repvec.shape + (300,) -The two flags refer to the 45 top-level categories in the WordNet ontology. -spaCy stores membership in these categories as a bit set, because -words can have multiple senses. We only need one 64 -bit flag variable per word in the vocabulary, so this useful data requires only -2.4mb of memory. +.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ + +We want to sort the words in our vocabulary by their similarity to "pleaded". +There are lots of ways to measure the similarity of two vectors. We'll use the +cosine metric: + + >>> from numpy import dot + >>> from numpy.linalg import norm + >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2)) + >>> words = [w for w in nlp.vocab if w.is_lower and w.has_repvec] + >>> words.sort(key=lambda w: cosine(w, pleaded)) + >>> words.reverse() + >>> print '1-20', ', '.join(w.orth_ for w in words[0:20]) + 1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading + >>> print '50-60', ', '.join(w.orth_ for w in words[50:60]) + 50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses + >>> print '100-110', ', '.join(w.orth_ for w in words[100:110]) + cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes + >>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010]) + scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged + >>> print ', '.join(w.orth_ for w in words[50000:50010]) + fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists + +As you can see, the similarity model that these vectors give us is excellent +--- we're still getting meaningful results at 1000 words, off a single +prototype! The only problem is that the list really contains two clusters of +words: one associated with the legal meaning of "pleaded", and one for the more +general sense. Sorting out these clusters is an area of active research. + +A simple work-around is to average the vectors of several words, and use that +as our target: + + >>> say_verbs = [u'pleaded', u'confessed', u'remonstrated', u'begged', + u'bragged', u'confided', u'requested'] + >>> say_vector = numpy.zeros(shape=(300,)) + >>> for verb in say_verbs: + ... say_vector += nlp.vocab[verb].repvec + >>> words.sort(key=lambda w: cosine(w.repvec, say_vector)) + >>> words.reverse() + >>> print '1-20', ', '.join(w.orth_ for w in words[0:20]) + 1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired + 50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed + >>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010]) + 1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate + +These definitely look like words that King might scold a writer for attaching +adverbs to. Recall that our previous adverb highlighting function looked like +this: + + >>> import spacy.en + >>> from spacy.postags import ADVERB + >>> # Load the pipeline, and call it with some text. + >>> nlp = spacy.en.English() + >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", + tag=True, parse=True) + >>> output = '' + >>> for tok in tokens: + ... output += tok.string.upper() if tok.pos == ADVERB else tok.string + ... output += tok.whitespace + >>> print(output) + ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’ + +We wanted to refine the logic so that only adverbs modifying evocative verbs +of communication, like "pleaded", were highlighted. We've now built a vector that +represents that type of word, so now we can highlight adverbs based on very +subtle logic, honing in on adverbs that seem the most stylistically +problematic, given our starting assumptions: + + >>> import numpy + >>> from numpy import dot + >>> from numpy.linalg import norm + >>> import spacy.en + >>> from spacy.postags import ADVERB, VERB + >>> def is_bad_adverb(token, target_verb, tol): + ... if token.pos != ADVERB + ... return False + ... elif toke.head.pos != VERB: + ... return False + ... elif cosine(token.head.repvec, target_verb) < tol: + ... return False + ... else: + ... return True + + +This example was somewhat contrived --- and, truth be told, I've never really +bought the idea that adverbs were a grave stylistic sin. But hopefully it got +the message across: the state-of-the-art NLP technologies are very powerful. +spaCy gives you easy and efficient access to them, which lets you build all +sorts of use products and features that were previously impossible. -spaCy packs all sorts of other goodies into its lexicon. -Words are mapped to one these rich lexical types immediately, during -tokenization --- and spaCy's tokenizer is *fast*. Efficiency ---------- @@ -137,9 +215,10 @@ This normally takes a few iterations, and what you come up with will usually be brittle and difficult to reason about. spaCy's parser is faster than most taggers, and its tokenizer is fast enough -for truly web-scale processing. And the tokenizer doesn't just give you a list +for any workload. And the tokenizer doesn't just give you a list of strings. A spaCy token is a pointer to a Lexeme struct, from which you can -access a wide range of pre-computed features. +access a wide range of pre-computed features, including embedded word +representations. .. I wrote spaCy because I think existing commercial NLP engines are crap. Alchemy API are a typical example. Check out this part of their terms of @@ -161,7 +240,9 @@ access a wide range of pre-computed features. Accuracy -------- -.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. See `Benchmarks`_ for details. +.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. + +.. See `Benchmarks`_ for details. +--------------+----------+------------+ | System | POS acc. | Parse acc. | @@ -172,9 +253,20 @@ Accuracy +--------------+----------+------------+ | ZPar | 97.3 | 92.9 | +--------------+----------+------------+ + | Redshift | 97.3 | 93.5 | + +--------------+----------+------------+ | NLTK | 94.3 | n/a | +--------------+----------+------------+ +The table above compares spaCy to some of the current state-of-the-art systems, +on the standard evaluation from the Wall Street Journal, given gold-standard +sentence boundaries and tokenization. I'm in the process of completing a more +realistic evaluation on web text. + +spaCy's parser offers a better speed/accuracy trade-off than any published +system: its accuracy is within 1% of the current state-of-the-art, and it's +seven times faster than the 2014 CoreNLP neural network parser, which is the +previous fastest parser that I'm aware of. .. toctree::