mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 01:04:34 +03:00
Remove old files
This commit is contained in:
parent
ed4d231bb7
commit
06f2374f98
|
@ -1,121 +0,0 @@
|
||||||
include ../../_includes/_mixins
|
|
||||||
|
|
||||||
p.u-text-large Let's say you're developing a proofreading tool, or possibly an IDE for writers. You're convinced by Stephen King's advice that #[a(href="http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs" target="_blank") adverbs are not your friend] so you want to #[strong highlight all adverbs]. We'll use one of the examples he finds particularly egregious:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV
|
|
||||||
>>> # Load the pipeline, and call it with some text.
|
|
||||||
>>> nlp = spacy.en.English()
|
|
||||||
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
|
|
||||||
>>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
|
|
||||||
u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
||||||
|
|
||||||
p Easy enough --- but the problem is that we've also highlighted "back". While "back" is undoubtedly an adverb, we probably don't want to highlight it. If what we're trying to do is flag dubious stylistic choices, we'll need to refine our logic. It turns out only a certain type of adverb is of interest to us.
|
|
||||||
|
|
||||||
p There are lots of ways we might do this, depending on just what words we want to flag. The simplest way to exclude adverbs like "back" and "not" is by word frequency: these words are much more common than the prototypical manner adverbs that the style guides are worried about.
|
|
||||||
|
|
||||||
p The #[code Lexeme.prob] and #[code Token.prob] attribute gives a log probability estimate of the word:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> nlp.vocab[u'back'].prob
|
|
||||||
-7.403977394104004
|
|
||||||
>>> nlp.vocab[u'not'].prob
|
|
||||||
-5.407193660736084
|
|
||||||
>>> nlp.vocab[u'quietly'].prob
|
|
||||||
-11.07155704498291
|
|
||||||
|
|
||||||
p (The probability estimate is based on counts from a 3 billion word corpus, smoothed using the `Simple Good-Turing`_ method.)
|
|
||||||
|
|
||||||
p So we can easily exclude the N most frequent words in English from our adverb marker. Let's try N=1000 for now:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV
|
|
||||||
>>> nlp = spacy.en.English()
|
|
||||||
>>> # Find log probability of Nth most frequent word
|
|
||||||
>>> probs = [lex.prob for lex in nlp.vocab]
|
|
||||||
>>> probs.sort()
|
|
||||||
>>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
|
|
||||||
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
|
|
||||||
>>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
|
|
||||||
‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
||||||
|
|
||||||
p There are lots of other ways we could refine the logic, depending on just what words we want to flag. Let's say we wanted to only flag adverbs that modified words similar to "pleaded". This is easy to do, as spaCy loads a vector-space representation for every word (by default, the vectors produced by Levy and Goldberg (2014)). Naturally, the vector is provided as a numpy array:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> pleaded = tokens[7]
|
|
||||||
>>> pleaded.vector.shape
|
|
||||||
(300,)
|
|
||||||
>>> pleaded.vector[:5]
|
|
||||||
array([ 0.04229792, 0.07459262, 0.00820188, -0.02181299, 0.07519238], dtype=float32)
|
|
||||||
|
|
||||||
p We want to sort the words in our vocabulary by their similarity to "pleaded". There are lots of ways to measure the similarity of two vectors. We'll use the cosine metric:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> from numpy import dot
|
|
||||||
>>> from numpy.linalg import norm
|
|
||||||
|
|
||||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
|
||||||
>>> words = [w for w in nlp.vocab if w.has_vector]
|
|
||||||
>>> words.sort(key=lambda w: cosine(w.vector, pleaded.vector))
|
|
||||||
>>> words.reverse()
|
|
||||||
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
|
||||||
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
|
|
||||||
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
|
||||||
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
|
|
||||||
>>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
|
|
||||||
100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
|
|
||||||
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
|
||||||
1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
|
|
||||||
>>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
|
|
||||||
50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
|
|
||||||
|
|
||||||
p As you can see, the similarity model that these vectors give us is excellent — we're still getting meaningful results at 1000 words, off a single prototype! The only problem is that the list really contains two clusters of words: one associated with the legal meaning of "pleaded", and one for the more general sense. Sorting out these clusters is an area of active research.
|
|
||||||
|
|
||||||
p A simple work-around is to average the vectors of several words, and use that as our target:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
|
|
||||||
>>> say_vector = sum(nlp.vocab[verb].vector for verb in say_verbs) / len(say_verbs)
|
|
||||||
>>> words.sort(key=lambda w: cosine(w.vector, say_vector))
|
|
||||||
>>> words.reverse()
|
|
||||||
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
|
||||||
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
|
|
||||||
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
|
||||||
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
|
|
||||||
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
|
||||||
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
|
|
||||||
|
|
||||||
p These definitely look like words that King might scold a writer for attaching adverbs to. Recall that our original adverb highlighting function looked like this:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV
|
|
||||||
>>> # Load the pipeline, and call it with some text.
|
|
||||||
>>> nlp = spacy.en.English()
|
|
||||||
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
|
||||||
tag=True, parse=False)
|
|
||||||
>>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
|
|
||||||
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
||||||
|
|
||||||
p We wanted to refine the logic so that only adverbs modifying evocative verbs of communication, like "pleaded", were highlighted. We've now built a vector that represents that type of word, so now we can highlight adverbs based on subtle logic, honing in on adverbs that seem the most stylistically problematic, given our starting assumptions:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
>>> import numpy
|
|
||||||
>>> from numpy import dot
|
|
||||||
>>> from numpy.linalg import norm
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV, VERB
|
|
||||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
|
||||||
>>> def is_bad_adverb(token, target_verb, tol):
|
|
||||||
... if token.pos != ADV
|
|
||||||
... return False
|
|
||||||
... elif token.head.pos != VERB:
|
|
||||||
... return False
|
|
||||||
... elif cosine(token.head.vector, target_verb) < tol:
|
|
||||||
... return False
|
|
||||||
... else:
|
|
||||||
... return True
|
|
||||||
|
|
||||||
p This example was somewhat contrived — and, truth be told, I've never really bought the idea that adverbs were a grave stylistic sin. But hopefully it got the message across: the state-of-the-art NLP technologies are very powerful. spaCy gives you easy and efficient access to them, which lets you build all sorts of useful products and features that were previously impossible.
|
|
|
@ -1,82 +0,0 @@
|
||||||
include ../../_includes/_mixins
|
|
||||||
|
|
||||||
p.u-text-large This tutorial describes how to train new statistical models for spaCy's part-of-speech tagger, named entity recognizer and dependency parser.
|
|
||||||
|
|
||||||
p I'll start with some quick code examples, that describe how to train each model. I'll then provide a bit of background about the algorithms, and explain how the data and feature templates work.
|
|
||||||
|
|
||||||
+h(2, "train-pos-tagger") Training the part-of-speech tagger
|
|
||||||
|
|
||||||
+code('python', 'Simple Example').
|
|
||||||
from spacy.vocab import Vocab
|
|
||||||
from spacy.pipeline import Tagger
|
|
||||||
from spacy.tokens import Doc
|
|
||||||
|
|
||||||
vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
|
|
||||||
tagger = Tagger(vocab)
|
|
||||||
|
|
||||||
doc = Doc(vocab, words=['I', 'like', 'stuff'])
|
|
||||||
tagger.update(doc, ['N', 'V', 'N'])
|
|
||||||
|
|
||||||
tagger.model.end_training()
|
|
||||||
|
|
||||||
p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/blob/master/examples/training/train_tagger.py") Full example]
|
|
||||||
|
|
||||||
+h(2, "train-entity") Training the named entity recognizer
|
|
||||||
|
|
||||||
+code('python', 'Simple Example').
|
|
||||||
from spacy.vocab import Vocab
|
|
||||||
from spacy.pipeline import EntityRecognizer
|
|
||||||
from spacy.tokens import Doc
|
|
||||||
from spacy.gold import GoldParse
|
|
||||||
|
|
||||||
vocab = Vocab()
|
|
||||||
entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])
|
|
||||||
|
|
||||||
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
|
|
||||||
entity.update(doc, GoldParse(doc, entities=['O', 'O', 'B-PERSON', 'L-PERSON', 'O']))
|
|
||||||
|
|
||||||
entity.model.end_training()
|
|
||||||
|
|
||||||
p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/blob/master/examples/training/train_ner.py") Full example]
|
|
||||||
|
|
||||||
+h(2, "train-entity") Training the dependency parser
|
|
||||||
|
|
||||||
+code('python', 'Simple Example').
|
|
||||||
from spacy.vocab import Vocab
|
|
||||||
from spacy.pipeline import DependencyParser
|
|
||||||
from spacy.tokens import Doc
|
|
||||||
|
|
||||||
vocab = Vocab()
|
|
||||||
parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])
|
|
||||||
|
|
||||||
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
|
|
||||||
parser.update(doc, GoldParse(doc, heads=[1, 1, 3, 1, 1,], deps=['nsubj', 'ROOT', 'compound', 'dobj', 'punct']))
|
|
||||||
|
|
||||||
parser.model.end_training()
|
|
||||||
|
|
||||||
p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/blob/master/examples/training/train_parser.py") Full example]
|
|
||||||
|
|
||||||
+h(2, 'feature-templates') Customising the feature extraction
|
|
||||||
|
|
||||||
p spaCy currently uses linear models for the tagger, parser and entity recognizer, with weights learned using the #[+a("https://explosion.ai/blog/part-of-speech-pos-tagger-in-python") Averaged Perceptron algorithm].
|
|
||||||
|
|
||||||
p Because it's a linear model, it's important for accuracy to build conjunction features out of the atomic predictors. Let's say you have two atomic predictors asking, "What is the part-of-speech of the previous token?", and "What is the part-of-speech of the previous previous token?". These ppredictors will introduce a number of features, e.g. "Prev-pos=NN", "Prev-pos=VBZ", etc. A conjunction template introduces features such as "Prev-pos=NN&Prev-pos=VBZ".
|
|
||||||
|
|
||||||
p The feature extraction proceeds in two passes. In the first pass, we fill an array with the values of all of the atomic predictors. In the second pass, we iterate over the feature templates, and fill a small temporary array with the predictors that will be combined into a conjunction feature. Finally, we hash this array into a 64-bit integer, using the MurmurHash algorithm. You can see this at work in the #[+a("https://github.com/" + SOCIAL.github + "/thinc/blob/94dbe06fd3c8f24d86ab0f5c7984e52dbfcdc6cb/thinc/linear/features.pyx") thinc.linear.features] module.
|
|
||||||
|
|
||||||
p It's very easy to change the feature templates, to create novel combinations of the existing atomic predictors. There's currently no API available to add new atomic predictors, though. You'll have to create a subclass of the model, and write your own #[code set_featuresC] method.
|
|
||||||
|
|
||||||
p The feature templates are passed in using the #[code features] keyword argument to the constructors of the Tagger, DependencyParser and EntityRecognizer:
|
|
||||||
|
|
||||||
+code('python', 'custom tagger templates').
|
|
||||||
from spacy.vocab import Vocab
|
|
||||||
from spacy.pipeline import Tagger
|
|
||||||
from spacy.tagger import P2_orth, P1_orth
|
|
||||||
from spacy.tagger import P2_cluster, P1_cluster, W_orth, N1_orth, N2_orth
|
|
||||||
|
|
||||||
vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
|
|
||||||
tagger = Tagger(vocab, features=[(P2_orth, P2_cluster), (P1_orth, P1_cluster),
|
|
||||||
(P2_orth,), (P1_orth,), (W_orth,),
|
|
||||||
(N1_orth,), (N2_orth,)])
|
|
||||||
|
|
||||||
p Custom feature templates can be passed to the DependencyParser and EntityRecognizer as well, also using the #[code features] keyword argument of the constructor.
|
|
Loading…
Reference in New Issue
Block a user