* Work on website

This commit is contained in:
Matthew Honnibal 2015-08-14 22:03:19 +02:00
parent c9b19a9c00
commit 6cc9e7881b

View File

@ -0,0 +1,492 @@
extends ./template_post.jade
block body_block
- var urls = {}
- urls.share_twitter = "http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal"
article.post
header
h2 A good Part-of-Speech tagger in about 200 lines of Python
.subhead
| by
a(href="#" rel="author") Matthew Honnibal
| on
time(datetime='2013-09-11') October 11, 2013
p.
Up-to-date knowledge about natural language processing is mostly locked away
in academia. And academics are mostly pretty self-conscious when we write.
Were careful. We dont want to stick our necks out too much. But under-confident
recommendations suck, so heres how to write a good part-of-speech tagger.
p.
There are a tonne of “best known techniques” for POS tagging, and you should
ignore the others and just use Averaged Perceptron.
p.
You should use two tags of history, and features derived from the Brown word
clusters distributed here.
p.
If you only need the tagger to work on carefully edited text, you should
use case-sensitive features, but if you want a more robust tagger you
should avoid them because theyll make you over-fit to the conventions
of your training domain. Instead, features that ask “how frequently is
this word title-cased, in a large sample from the web?” work well. Then
you can lower-case your comparatively tiny training corpus.
p.
For efficiency, you should figure out which frequent words in your training
data have unambiguous tags, so you dont have to do anything but output
their tags when they come up. About 50% of the words can be tagged that way.
p.
And unless you really, really cant do without an extra 0.1% of accuracy,
you probably shouldnt bother with any kind of search strategy you should
just use a greedy model.
p.
If you do all that, youll find your tagger easy to write and understand,
and an efficient Cython implementation will perform as follows on the standard
evaluation, 130,000 words of text from the Wall Street Journal:
table
thead
tr
th Tagger
th Accuracy
th Time (130k words)
tbody
tr
td CyGreedyAP
td 97.1%
td 4s
p.
The 4s includes initialisation time — the actual per-token speed is high
enough to be irrelevant; it wont be your bottleneck.
p.
Its tempting to look at 97% accuracy and say something similar, but thats
not true. My parser is about 1% more accurate if the input has hand-labelled
POS tags, and the taggers all perform much worse on out-of-domain data.
Unfortunately accuracies have been fairly flat for the last ten years.
Thats why my recommendation is to just use a simple and fast tagger thats
roughly as good.
p.
The thing is though, its very common to see people using taggers that
arent anywhere near that good! For an example of what a non-expert is
likely to use, these were the two taggers wrapped by TextBlob, a new Python
api that I think is quite neat:
table
thead
tr
th Tagger
th Accuracy
th Time (130k words)
tbody
tr
td NLTK
td 94.0%
td 3m56s
tr
td Pattern
td 93.5%
td 26s
p.
Both Pattern and NLTK are very robust and beautifully well documented, so
the appeal of using them is obvious. But Patterns algorithms are pretty
crappy, and NLTK carries tremendous baggage around in its implementation
because of its massive framework, and double-duty as a teaching tool.
p.
As a stand-alone tagger, my Cython implementation is needlessly complicated
– it was written for my parser. So today I wrote a 200 line version
of my recommended algorithm for TextBlob. It gets:
table
thead
tr
th Tagger
th Accuracy
th Time (130k words)
tbody
tr
td PyGreedyAP
td 96.8%
td 12s
p.
I traded some accuracy and a lot of efficiency to keep the implementation
simple. Heres a far-too-brief description of how it works.
h3 Averaged perceptron
p.
POS tagging is a “supervised learning problem”. Youre given a table of data,
and youre told that the values in the last column will be missing during
run-time. You have to find correlations from the other columns to predict
that value.
p.
So for us, the missing column will be “part of speech at word i“. The predictor
columns (features) will be things like “part of speech at word i-1“, “last three
letters of word at i+1“, etc
p.
First, heres what prediction looks like at run-time:
pre.language-python
code
| def predict(self, features):
| '''Dot-product the features and current weights and return the best class.'''
| scores = defaultdict(float)
| for feat in features:
| if feat not in self.weights:
| continue
| weights = self.weights[feat]
| for clas, weight in weights.items():
| scores[clas] += weight
| # Do a secondary alphabetic sort, for stability
| return max(self.classes, key=lambda clas: (scores[clas], clas))
p.
Earlier I described the learning problem as a table, with one of the columns
marked as missing-at-runtime. For NLP, our tables are always exceedingly
sparse. You have columns like “word i-1=Parliament”, which is almost always
0. So our “weight vectors” can pretty much never be implemented as vectors.
Map-types are good though — here we use dictionaries.
p.
The input data, features, is a set with a member for every non-zero “column”
in our “table” – every active feature. Usually this is actually a dictionary,
to let you set values for the features. But here all my features are binary
present-or-absent type deals.
p.
The weights data-structure is a dictionary of dictionaries, that ultimately
associates feature/class pairs with some weight. You want to structure it
this way instead of the reverse because of the way word frequencies are
distributed: most words are rare, frequent words are very frequent.
h3 Learning the weights
p.
Okay, so how do we get the values for the weights? We start with an empty
weights dictionary, and iteratively do the following:
ol
li Receive a new (features, POS-tag) pair
li Guess the value of the POS tag given the current “weights” for the features
li If guess is wrong, add +1 to the weights associated with the correct class for these features, and -1 to the weights for the predicted class.
p.
Its one of the simplest learning algorithms. Whenever you make a mistake,
increment the weights for the correct class, and penalise the weights that
led to your false prediction. In code:
pre.language-python
code
| def train(self, nr_iter, examples):
| for i in range(nr_iter):
| for features, true_tag in examples:
| guess = self.predict(features)
| if guess != true_tag:
| for f in features:
| self.weights[f][true_tag] += 1
| self.weights[f][guess] -= 1
| random.shuffle(examples)
p.
If you iterate over the same example this way, the weights for the correct
class would have to come out ahead, and youd get the example right. If
you think about what happens with two examples, you should be able to
see that it will get them both right unless the features are identical.
In general the algorithm will converge so long as the examples are
linearly separable, although that doesnt matter for our purpose.
h3 Averaging the weights
p.
We need to do one more thing to make the perceptron algorithm competitive.
The problem with the algorithm so far is that if you train it twice on
slightly different sets of examples, you end up with really different models.
It doesnt generalise that smartly. And the problem is really in the later
iterations — if you let it run to convergence, itll pay lots of attention
to the few examples its getting wrong, and mutate its whole model around
them.
p.
So, what were going to do is make the weights more "sticky" – give
the model less chance to ruin all its hard work in the later rounds. And
were going to do that by returning the averaged weights, not the final
weights.
p.
I doubt there are many people who are convinced thats the most obvious
solution to the problem, but whatever. Were not here to innovate, and this
way is time tested on lots of problems. If you have another idea, run the
experiments and tell us what you find. Actually Id love to see more work
on this, now that the averaged perceptron has become such a prominent learning
algorithm in NLP.
p.
Okay. So this averaging. Hows that going to work? Note that we dont want
to just average after each outer-loop iteration. We want the average of all
the values — from the inner loop. So if we have 5,000 examples, and we train
for 10 iterations, well average across 50,000 values for each weight.
p.
Obviously were not going to store all those intermediate values. Instead,
well track an accumulator for each weight, and divide it by the number of
iterations at the end. Again: we want the average weight assigned to a
feature/class pair during learning, so the key component we need is the total
weight it was assigned. But we also want to be careful about how we compute
that accumulator, too. On almost any instance, were going to see a tiny
fraction of active feature/class pairs. All the other feature/class weights
wont change. So we shouldnt have to go back and add the unchanged value
to our accumulators anyway, like chumps.
p.
Since were not chumps, well make the obvious improvement. Well maintain
another dictionary that tracks how long each weight has gone unchanged. Now
when we do change a weight, we can do a fast-forwarded update to the accumulator,
for all those iterations where it lay unchanged.
p.
Heres what a weight update looks like now that we have to maintain the
totals and the time-stamps:
pre.language-python
code
| def update(self, truth, guess, features):
| def upd_feat(c, f, v):
| nr_iters_at_this_weight = self.i - self._timestamps[f][c]
| self._totals[f][c] += nr_iters_at_this_weight * self.weights[f][c]
| self.weights[f][c] += v
| self._timestamps[f][c] = self.i
| self.i += 1
| for f in features:
| upd_feat(truth, f, 1.0)
| upd_feat(guess, f, -1.0)
h3 Features and pre-processing
p.
The POS tagging literature has tonnes of intricate features sensitive to
case, punctuation, etc. They help on the standard test-set, which is from
Wall Street Journal articles from the 1980s, but I dont see how theyll
help us learn models that are useful on other text.
p.
To help us learn a more general model, well pre-process the data prior
to feature extraction, as follows:
ul
li All words are lower cased;
li Digits in the range 1800-2100 are represented as !YEAR;
li Other digit strings are represented as !DIGITS
li
| It would be better to have a module recognising dates, phone numbers,
| emails, hash-tags, etc. but that will have to be pushed back into the
| tokenization.
p.
I played around with the features a little, and this seems to be a reasonable
bang-for-buck configuration in terms of getting the development-data accuracy
to 97% (where it typically converges anyway), and having a smaller memory
foot-print:
pre.language-python
code
| def _get_features(self, i, word, context, prev, prev2):
| '''Map tokens-in-contexts into a feature representation, implemented as a
| set. If the features change, a new model must be trained.'''
| def add(name, *args):
| features.add('+'.join((name,) + tuple(args)))
| features = set()
| add('bias') # This acts sort of like a prior
| add('i suffix', word[-3:])
| add('i pref1', word[0])
| add('i-1 tag', prev)
| add('i-2 tag', prev2)
| add('i tag+i-2 tag', prev, prev2)
| add('i word', context[i])
| add('i-1 tag+i word', prev, context[i])
| add('i-1 word', context[i-1])
| add('i-1 suffix', context[i-1][-3:])
| add('i-2 word', context[i-2])
| add('i+1 word', context[i+1])
| add('i+1 suffix', context[i+1][-3:])
| add('i+2 word', context[i+2])
| return features
p.
I havent added any features from external data, such as case frequency
statistics from the Google Web 1T corpus. I might add those later, but for
now I figured Id keep things simple.
h3 What about search?
p.
The model Ive recommended commits to its predictions on each word, and
moves on to the next one. Those predictions are then used as features for
the next word. Theres a potential problem here, but it turns out it doesnt
matter much. Its easy to fix with beam-search, but I say its not really
worth bothering. And it definitely doesnt matter enough to adopt a slow
and complicated algorithm like Conditional Random Fields.
p.
Heres the problem. The best indicator for the tag at position, say, 3 in
a sentence is the word at position 3. But the next-best indicators are the
tags at positions 2 and 4. So theres a chicken-and-egg problem: we want
the predictions for the surrounding words in hand before we commit to a
prediction for the current word. Heres an example where search might matter:
p.example.
Their management plan reforms worked
p.
Depending on just what youve learned from your training data, you can
imagine making a different decision if you started at the left and moved
right, conditioning on your previous decisions, than if youd started at
the right and moved left.
p.
If thats not obvious to you, think about it this way: “worked” is almost
surely a verb, so if you tag “reforms” with that in hand, youll have a
different idea of its tag than if youd just come from “plan“, which you
might have regarded as either a noun or a verb.
p.
Search can only help you when you make a mistake. It can prevent that error
from throwing off your subsequent decisions, or sometimes your future choices
will correct the mistake. And thats why for POS tagging, search hardly matters!
Your model is so good straight-up that your past predictions are almost always
true. So you really need the planets to align for search to matter at all.
p.
And as we improve our taggers, search will matter less and less. Instead
of search, what we should be caring about is multi-tagging. If we let the
model be a bit uncertain, we can get over 99% accuracy assigning an average
of 1.05 tags per word (Vadas et al, ACL 2006). The averaged perceptron is
rubbish at multi-tagging though. Thats its big weakness. You really want
a probability distribution for that.
p.
One caveat when doing greedy search, though. Its very important that your
training data model the fact that the history will be imperfect at run-time.
Otherwise, it will be way over-reliant on the tag-history features. Because
the Perceptron is iterative, this is very easy.
p.
Heres the training loop for the tagger:
pre.language-python
code
| def train(self, sentences, save_loc=None, nr_iter=5, quiet=False):
| '''Train a model from sentences, and save it at save_loc. nr_iter
| controls the number of Perceptron training iterations.'''
| self._make_tagdict(sentences, quiet=quiet)
| self.model.classes = self.classes
| prev, prev2 = START
| for iter_ in range(nr_iter):
| c = 0; n = 0
| for words, tags in sentences:
| context = START + [self._normalize(w) for w in words] + END
| for i, word in enumerate(words):
| guess = self.tagdict.get(word)
| if not guess:
| feats = self._get_features(
| i, word, context, prev, prev2)
| guess = self.model.predict(feats)
| self.model.update(tags[i], guess, feats)
| # Set the history features from the guesses, not the
| # true tags
| prev2 = prev; prev = guess
| c += guess == tags[i]; n += 1
| random.shuffle(sentences)
| if not quiet:
| print("Iter %d: %d/%d=%.3f" % (iter_, c, n, _pc(c, n)))
| self.model.average_weights()
| # Pickle as a binary file
| if save_loc is not None:
| cPickle.dump((self.model.weights, self.tagdict, self.classes),
| open(save_loc, 'wb'), -1)
p.
Unlike the previous snippets, this ones literal – I tended to edit the
previous ones to simplify. So if they have bugs, hopefully thats why!
p.
At the time of writing, Im just finishing up the implementation before I
submit a pull request to TextBlob. You can see the rest of the source here:
ul
li
a(href="https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/taggers.py") taggers.py
li
a(href="https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/_perceptron.py") _perceptron.py
h3 A final comparison…
p.
Over the years Ive seen a lot of cynicism about the WSJ evaluation methodology.
The claim is that weve just been meticulously over-fitting our methods to this
data. Actually the evidence doesnt really bear this out. Mostly, if a technique
is clearly better on one evaluation, it improves others as well. Still, its
very reasonable to want to know how these tools perform on other text. So I
ran the unchanged models over two other sections from the OntoNotes corpus:
table
thead
tr
th Tagger
th WSJ
th ABC
th Web
tbody
tr
td Pattern
td 93.5
td 90.7
td 88.1
tr
td NLTK
td 94.0
td 91.5
td 88.4
tr
td PyGreedyAP
td 96.8
td 94.8
td 91.8
p.
The ABC section is broadcast news, Web is text from the web (blogs etc — I havent
looked at the data much).
p.
As you can see, the order of the systems is stable across the three comparisons,
and the advantage of our Averaged Perceptron tagger over the other two is real
enough. Actually the pattern tagger does very poorly on out-of-domain text.
It mostly just looks up the words, so its very domain dependent. I hadnt
realised it before, but its obvious enough now that I think about it.
p.
We can improve our score greatly by training on some of the foreign data.
The technique described in this paper (Daume III, 2007) is the first thing
I try when I have to do that.
footer.meta(role='contentinfo')
a.button.button-twitter(href=urls.share_twitter, title='Share on Twitter', rel='nofollow') Share on Twitter
.discuss
a.button.button-hn(href='#', title='Discuss on Hacker News', rel='nofollow') Discuss on Hacker News
|
a.button.button-reddit(href='#', title='Discuss on Reddit', rel='nofollow') Discuss on Reddit