* Work on website

2025-10-29 06:57:49 +03:00 · 2015-08-14 22:03:19 +02:00 · 2015-08-14 22:03:19 +02:00 · 6cc9e7881b
commit 6cc9e7881b
parent c9b19a9c00
1 changed files with 492 additions and 0 deletions
--- a/docs/redesign/blog_tagger.jade
+++ b/docs/redesign/blog_tagger.jade
@ -0,0 +1,492 @@
 extends ./template_post.jade
 block body_block
  - var urls = {}
  - urls.share_twitter = "http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal"
  article.post
    header
      h2 A good Part-of-Speech tagger in about 200 lines of Python
      .subhead
        | by 
        a(href="#" rel="author") Matthew Honnibal
        | on 
        time(datetime='2013-09-11') October 11, 2013
    p.
      Up-to-date knowledge about natural language processing is mostly locked away
      in academia. And academics are mostly pretty self-conscious when we write.
      We’re careful. We don’t want to stick our necks out too much. But under-confident
      recommendations suck, so here’s how to write a good part-of-speech tagger.
    p.
      There are a tonne of “best known techniques” for POS tagging, and you should
      ignore the others and just use Averaged Perceptron.
    p.
      You should use two tags of history, and features derived from the Brown word
      clusters distributed here.
    p.
      If you only need the tagger to work on carefully edited text, you should
      use case-sensitive features, but if you want a more robust tagger you
      should avoid them because they’ll make you over-fit to the conventions
      of your training domain. Instead, features that ask “how frequently is
      this word title-cased, in a large sample from the web?” work well. Then
      you can lower-case your comparatively tiny training corpus.
    p.
      For efficiency, you should figure out which frequent words in your training
      data have unambiguous tags, so you don’t have to do anything but output
      their tags when they come up. About 50% of the words can be tagged that way.
    p.
      And unless you really, really can’t do without an extra 0.1% of accuracy,
      you probably shouldn’t bother with any kind of search strategy  you should
      just use a greedy model.
    p.
      If you do all that, you’ll find your tagger easy to write and understand,
      and an efficient Cython implementation will perform as follows on the standard
      evaluation, 130,000 words of text from the Wall Street Journal:
    table
      thead
        tr
          th Tagger
          th Accuracy
          th Time (130k words)
      tbody
        tr
          td CyGreedyAP
          td 97.1%
          td 4s
    p.
      The 4s includes initialisation time — the actual per-token speed is high
      enough to be irrelevant; it won’t be your bottleneck.
    p.
      It’s tempting to look at 97% accuracy and say something similar, but that’s
      not true. My parser is about 1% more accurate if the input has hand-labelled
      POS tags, and the taggers all perform much worse on out-of-domain data.
      Unfortunately accuracies have been fairly flat for the last ten years.
      That’s why my recommendation is to just use a simple and fast tagger that’s
      roughly as good.
    p.
      The thing is though, it’s very common to see people using taggers that
      aren’t anywhere near that good!  For an example of what a non-expert is
      likely to use, these were the two taggers wrapped by TextBlob, a new Python
      api that I think is quite neat:
    table
      thead
        tr
          th Tagger
          th Accuracy
          th Time (130k words)
      tbody
        tr
          td NLTK
          td 94.0%
          td 3m56s
        tr
          td Pattern
          td 93.5%
          td 26s
    p.
      Both Pattern and NLTK are very robust and beautifully well documented, so
      the appeal of using them is obvious. But Pattern’s algorithms are pretty
      crappy, and NLTK carries tremendous baggage around in its implementation
      because of its massive framework, and double-duty as a teaching tool.
    p.  
      As a stand-alone tagger, my Cython implementation is needlessly complicated
      &ndash; it was written for my parser. So today I wrote a 200 line version
      of my recommended algorithm for TextBlob. It gets:
    table
      thead
        tr
          th Tagger
          th Accuracy
          th Time (130k words)
      tbody
        tr
          td PyGreedyAP
          td 96.8%
          td 12s
    p.
      I traded some accuracy and a lot of efficiency to keep the implementation
      simple. Here’s a far-too-brief description of how it works.
    h3 Averaged perceptron
    p.
      POS tagging is a “supervised learning problem”. You’re given a table of data,
      and you’re told that the values in the last column will be missing during
      run-time. You have to find correlations from the other columns to predict
      that value.
    p.
      So for us, the missing column will be “part of speech at word i“. The predictor
      columns (features) will be things like “part of speech at word i-1“, “last three
      letters of word at i+1“, etc
    p.
      First, here’s what prediction looks like at run-time:
    pre.language-python
      code
        | def predict(self, features):
        |     '''Dot-product the features and current weights and return the best class.'''
        |     scores = defaultdict(float)
        |     for feat in features:
        |         if feat not in self.weights:
        |             continue
        |         weights = self.weights[feat]
        |         for clas, weight in weights.items():
        |             scores[clas] += weight
        |     # Do a secondary alphabetic sort, for stability
        |     return max(self.classes, key=lambda clas: (scores[clas], clas))
    p.
      Earlier I described the learning problem as a table, with one of the columns
      marked as missing-at-runtime. For NLP, our tables are always exceedingly
      sparse. You have columns like “word i-1=Parliament”, which is almost always
      0. So our “weight vectors” can pretty much never be implemented as vectors.
      Map-types are good though — here we use dictionaries.
    p.
      The input data, features, is a set with a member for every non-zero “column”
      in our “table” &ndash; every active feature. Usually this is actually a dictionary,
      to let you set values for the features. But here all my features are binary
      present-or-absent type deals.
    p.
      The weights data-structure is a dictionary of dictionaries, that ultimately
      associates feature/class pairs with some weight. You want to structure it
      this way instead of the reverse because of the way word frequencies are
      distributed: most words are rare, frequent words are very frequent.
    h3 Learning the weights
    p.
      Okay, so how do we get the values for the weights? We start with an empty
      weights dictionary, and iteratively do the following:
    ol
      li Receive a new (features, POS-tag) pair
      li Guess the value of the POS tag given the current “weights” for the features
      li If guess is wrong, add +1 to the weights associated with the correct class for these features, and -1 to the weights for the predicted class.
    p.
      It’s one of the simplest learning algorithms. Whenever you make a mistake,
      increment the weights for the correct class, and penalise the weights that
      led to your false prediction. In code:
    pre.language-python
      code
        | def train(self, nr_iter, examples):
        |     for i in range(nr_iter):
        |         for features, true_tag in examples:
        |             guess = self.predict(features)
        |             if guess != true_tag:
        |                 for f in features:
        |                     self.weights[f][true_tag] += 1
        |                     self.weights[f][guess] -= 1
        |         random.shuffle(examples)
    p.
      If you iterate over the same example this way, the weights for the correct
      class would have to come out ahead, and you’d get the example right. If
      you think about what happens with two examples, you should be able to
      see that it will get them both right unless the features are identical.
      In general the algorithm will converge so long as the examples are
      linearly separable, although that doesn’t matter for our purpose.
    h3 Averaging the weights
    p.
      We need to do one more thing to make the perceptron algorithm competitive.
      The problem with the algorithm so far is that if you train it twice on
      slightly different sets of examples, you end up with really different models.
      It doesn’t generalise that smartly. And the problem is really in the later
      iterations — if you let it run to convergence, it’ll pay lots of attention
      to the few examples it’s getting wrong, and mutate its whole model around
      them.
    p.
      So, what we’re going to do is make the weights more "sticky" &ndash; give
      the model less chance to ruin all its hard work in the later rounds. And
      we’re going to do that by returning the averaged weights, not the final
      weights.
    p.
      I doubt there are many people who are convinced that’s the most obvious
      solution to the problem, but whatever. We’re not here to innovate, and this
      way is time tested on lots of problems. If you have another idea, run the
      experiments and tell us what you find. Actually I’d love to see more work
      on this, now that the averaged perceptron has become such a prominent learning
      algorithm in NLP.
    p.
      Okay. So this averaging. How’s that going to work? Note that we don’t want
      to just average after each outer-loop iteration. We want the average of all
      the values — from the inner loop. So if we have 5,000 examples, and we train
      for 10 iterations, we’ll average across 50,000 values for each weight.
    p.
      Obviously we’re not going to store all those intermediate values. Instead,
      we’ll track an accumulator for each weight, and divide it by the number of
      iterations at the end. Again: we want the average weight assigned to a
      feature/class pair during learning, so the key component we need is the total
      weight it was assigned. But we also want to be careful about how we compute
      that accumulator, too. On almost any instance, we’re going to see a tiny
      fraction of active feature/class pairs. All the other feature/class weights
      won’t change. So we shouldn’t have to go back and add the unchanged value
      to our accumulators anyway, like chumps.
    p.
      Since we’re not chumps, we’ll make the obvious improvement. We’ll maintain
      another dictionary that tracks how long each weight has gone unchanged. Now
      when we do change a weight, we can do a fast-forwarded update to the accumulator,
      for all those iterations where it lay unchanged.
    p.
      Here’s what a weight update looks like now that we have to maintain the
      totals and the time-stamps:
    pre.language-python
      code
        | def update(self, truth, guess, features):
        |     def upd_feat(c, f, v):
        |         nr_iters_at_this_weight = self.i - self._timestamps[f][c]
        |         self._totals[f][c] += nr_iters_at_this_weight * self.weights[f][c]
        |         self.weights[f][c] += v
        |         self._timestamps[f][c] = self.i
        |     self.i += 1
        |     for f in features:
        |         upd_feat(truth, f, 1.0)
        |         upd_feat(guess, f, -1.0)
    h3 Features and pre-processing
    p.
      The POS tagging literature has tonnes of intricate features sensitive to
      case, punctuation, etc. They help on the standard test-set, which is from
      Wall Street Journal articles from the 1980s, but I don’t see how they’ll
      help us learn models that are useful on other text.
    p.
      To help us learn a more general model, we’ll pre-process the data prior
      to feature extraction, as follows:
    ul
      li All words are lower cased;
      li Digits in the range 1800-2100 are represented as !YEAR;
      li Other digit strings are represented as !DIGITS
      li
        | It would be better to have a module recognising dates, phone numbers,
        | emails, hash-tags, etc. but that will have to be pushed back into the
        | tokenization.
    p.
      I played around with the features a little, and this seems to be a reasonable
      bang-for-buck configuration in terms of getting the development-data accuracy
      to 97% (where it typically converges anyway), and having a smaller memory
      foot-print:
    pre.language-python
      code
        | def _get_features(self, i, word, context, prev, prev2):
        |     '''Map tokens-in-contexts into a feature representation, implemented as a
        |     set. If the features change, a new model must be trained.'''
        |     def add(name, *args):
        |         features.add('+'.join((name,) + tuple(args)))
        |     features = set()
        |     add('bias') # This acts sort of like a prior
        |     add('i suffix', word[-3:])
        |     add('i pref1', word[0])
        |     add('i-1 tag', prev)
        |     add('i-2 tag', prev2)
        |     add('i tag+i-2 tag', prev, prev2)
        |     add('i word', context[i])
        |     add('i-1 tag+i word', prev, context[i])
        |     add('i-1 word', context[i-1])
        |     add('i-1 suffix', context[i-1][-3:])
        |     add('i-2 word', context[i-2])
        |     add('i+1 word', context[i+1])
        |     add('i+1 suffix', context[i+1][-3:])
        |     add('i+2 word', context[i+2])
        |     return features
    p.
      I haven’t added any features from external data, such as case frequency
      statistics from the Google Web 1T corpus. I might add those later, but for
      now I figured I’d keep things simple.
    h3 What about search?
    p.
      The model I’ve recommended commits to its predictions on each word, and
      moves on to the next one. Those predictions are then used as features for
      the next word. There’s a potential problem here, but it turns out it doesn’t
      matter much. It’s easy to fix with beam-search, but I say it’s not really
      worth bothering. And it definitely doesn’t matter enough to adopt a slow
      and complicated algorithm like Conditional Random Fields.
    p.
      Here’s the problem. The best indicator for the tag at position, say, 3 in
      a sentence is the word at position 3. But the next-best indicators are the
      tags at positions 2 and 4. So there’s a chicken-and-egg problem: we want
      the predictions for the surrounding words in hand before we commit to a
      prediction for the current word. Here’s an example where search might matter:
    p.example.
      Their management plan reforms worked
    p.
      Depending on just what you’ve learned from your training data, you can
      imagine making a different decision if you started at the left and moved
      right, conditioning on your previous decisions, than if you’d started at
      the right and moved left.
    p.
      If that’s not obvious to you, think about it this way: “worked” is almost
      surely a verb, so if you tag “reforms” with that in hand, you’ll have a
      different idea of its tag than if you’d just come from “plan“, which you
      might have regarded as either a noun or a verb.
    p.
      Search can only help you when you make a mistake. It can prevent that error
      from throwing off your subsequent decisions, or sometimes your future choices
      will correct the mistake. And that’s why for POS tagging, search hardly matters!
      Your model is so good straight-up that your past predictions are almost always
      true. So you really need the planets to align for search to matter at all.
    p.
      And as we improve our taggers, search will matter less and less. Instead
      of search, what we should be caring about is multi-tagging. If we let the
      model be a bit uncertain, we can get over 99% accuracy assigning an average
      of 1.05 tags per word (Vadas et al, ACL 2006). The averaged perceptron is
      rubbish at multi-tagging though. That’s its big weakness. You really want
      a probability distribution for that.
    p.
      One caveat when doing greedy search, though. It’s very important that your
      training data model the fact that the history will be imperfect at run-time.
      Otherwise, it will be way over-reliant on the tag-history features. Because
      the Perceptron is iterative, this is very easy.
    p.
      Here’s the training loop for the tagger:
    pre.language-python
      code
        | def train(self, sentences, save_loc=None, nr_iter=5, quiet=False):
        |     '''Train a model from sentences, and save it at save_loc. nr_iter
        |     controls the number of Perceptron training iterations.'''
        |     self._make_tagdict(sentences, quiet=quiet)
        |     self.model.classes = self.classes
        |     prev, prev2 = START
        |     for iter_ in range(nr_iter):
        |         c = 0; n = 0
        |         for words, tags in sentences:
        |             context = START + [self._normalize(w) for w in words] + END
        |             for i, word in enumerate(words):
        |                 guess = self.tagdict.get(word)
        |                 if not guess:
        |                     feats = self._get_features(
        |                               i, word, context, prev, prev2)
        |                     guess = self.model.predict(feats)
        |                     self.model.update(tags[i], guess, feats)
        |                 # Set the history features from the guesses, not the
        |                 # true tags
        |                 prev2 = prev; prev = guess
        |                 c += guess == tags[i]; n += 1
        |         random.shuffle(sentences)
        |         if not quiet:
        |             print(&quot;Iter %d: %d/%d=%.3f&quot; % (iter_, c, n, _pc(c, n)))
        |     self.model.average_weights()
        |     # Pickle as a binary file
        |     if save_loc is not None:
        |         cPickle.dump((self.model.weights, self.tagdict, self.classes),
        |                      open(save_loc, 'wb'), -1)
    p.
      Unlike the previous snippets, this one’s literal &ndash; I tended to edit the
      previous ones to simplify. So if they have bugs, hopefully that’s why!
    p.
      At the time of writing, I’m just finishing up the implementation before I
      submit a pull request to TextBlob. You can see the rest of the source here:
    ul
      li
        a(href="https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/taggers.py") taggers.py
      li
        a(href="https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/_perceptron.py") _perceptron.py
    h3 A final comparison…
    p.
      Over the years I’ve seen a lot of cynicism about the WSJ evaluation methodology.
      The claim is that we’ve just been meticulously over-fitting our methods to this
      data. Actually the evidence doesn’t really bear this out. Mostly, if a technique
      is clearly better on one evaluation, it improves others as well. Still, it’s
      very reasonable to want to know how these tools perform on other text. So I
      ran the unchanged models over two other sections from the OntoNotes corpus:
    table
      thead
        tr
          th Tagger
          th WSJ
          th ABC
          th Web
      tbody
        tr
          td Pattern
          td 93.5
          td 90.7
          td 88.1
        tr
          td NLTK
          td 94.0
          td 91.5
          td 88.4
        tr
          td PyGreedyAP
          td 96.8
          td 94.8
          td 91.8
    p.
      The ABC section is broadcast news, Web is text from the web (blogs etc — I haven’t
      looked at the data much).
    p.
      As you can see, the order of the systems is stable across the three comparisons,
      and the advantage of our Averaged Perceptron tagger over the other two is real
      enough. Actually the pattern tagger does very poorly on out-of-domain text.
      It mostly just looks up the words, so it’s very domain dependent. I hadn’t
      realised it before, but it’s obvious enough now that I think about it.
    p.
      We can improve our score greatly by training on some of the foreign data.
      The technique described in this paper (Daume III, 2007) is the first thing
      I try when I have to do that.
    footer.meta(role='contentinfo')
      a.button.button-twitter(href=urls.share_twitter, title='Share on Twitter', rel='nofollow') Share on Twitter
      .discuss
        a.button.button-hn(href='#', title='Discuss on Hacker News', rel='nofollow') Discuss on Hacker News
        | 
        a.button.button-reddit(href='#', title='Discuss on Reddit', rel='nofollow') Discuss on Reddit