* Fixes to examples

This commit is contained in:
Matthew Honnibal 2015-01-26 13:26:42 +11:00
parent 9d3afba255
commit 00974959a9

View File

@ -152,11 +152,11 @@ cosine metric:
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60])) >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses 50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
>>> print('100-110', ', '.join(w.orth_ for w in words[100:110])) >>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes 100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010])) >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged 1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
>>> print(', '.join(w.orth_ for w in words[50000:50010])) >>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists 50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
As you can see, the similarity model that these vectors give us is excellent As you can see, the similarity model that these vectors give us is excellent
--- we're still getting meaningful results at 1000 words, off a single --- we're still getting meaningful results at 1000 words, off a single
@ -164,14 +164,12 @@ prototype! The only problem is that the list really contains two clusters of
words: one associated with the legal meaning of "pleaded", and one for the more words: one associated with the legal meaning of "pleaded", and one for the more
general sense. Sorting out these clusters is an area of active research. general sense. Sorting out these clusters is an area of active research.
A simple work-around is to average the vectors of several words, and use that A simple work-around is to average the vectors of several words, and use that
as our target: as our target:
>>> say_verbs = [u'pleaded', u'confessed', u'remonstrated', u'begged', >>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
u'bragged', u'confided', u'requested'] >>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
>>> say_vector = numpy.zeros(shape=(300,))
>>> for verb in say_verbs:
... say_vector += nlp.vocab[verb].repvec
>>> words.sort(key=lambda w: cosine(w.repvec, say_vector)) >>> words.sort(key=lambda w: cosine(w.repvec, say_vector))
>>> words.reverse() >>> words.reverse()
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20])) >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
@ -181,7 +179,7 @@ as our target:
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate 1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
These definitely look like words that King might scold a writer for attaching These definitely look like words that King might scold a writer for attaching
adverbs to. Recall that our previous adverb highlighting function looked like adverbs to. Recall that our original adverb highlighting function looked like
this: this:
>>> import spacy.en >>> import spacy.en
@ -189,14 +187,11 @@ this:
>>> # Load the pipeline, and call it with some text. >>> # Load the pipeline, and call it with some text.
>>> nlp = spacy.en.English() >>> nlp = spacy.en.English()
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.", >>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=True) tag=True, parse=False)
>>> output = '' >>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
>>> for tok in tokens:
... output += tok.string.upper() if tok.pos == ADVERB else tok.string
... output += tok.whitespace
>>> print(output)
Give it BACK, he pleaded ABJECTLY, its mine. Give it BACK, he pleaded ABJECTLY, its mine.
We wanted to refine the logic so that only adverbs modifying evocative verbs We wanted to refine the logic so that only adverbs modifying evocative verbs
of communication, like "pleaded", were highlighted. We've now built a vector that of communication, like "pleaded", were highlighted. We've now built a vector that
represents that type of word, so now we can highlight adverbs based on very represents that type of word, so now we can highlight adverbs based on very