mirror of
https://github.com/explosion/spaCy.git
synced 2025-03-26 12:54:12 +03:00
* Work on website
This commit is contained in:
parent
c9b19a9c00
commit
6cc9e7881b
492
docs/redesign/blog_tagger.jade
Normal file
492
docs/redesign/blog_tagger.jade
Normal file
|
@ -0,0 +1,492 @@
|
|||
extends ./template_post.jade
|
||||
|
||||
block body_block
|
||||
- var urls = {}
|
||||
- urls.share_twitter = "http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal"
|
||||
|
||||
|
||||
article.post
|
||||
header
|
||||
h2 A good Part-of-Speech tagger in about 200 lines of Python
|
||||
.subhead
|
||||
| by
|
||||
a(href="#" rel="author") Matthew Honnibal
|
||||
| on
|
||||
time(datetime='2013-09-11') October 11, 2013
|
||||
|
||||
p.
|
||||
Up-to-date knowledge about natural language processing is mostly locked away
|
||||
in academia. And academics are mostly pretty self-conscious when we write.
|
||||
We’re careful. We don’t want to stick our necks out too much. But under-confident
|
||||
recommendations suck, so here’s how to write a good part-of-speech tagger.
|
||||
|
||||
p.
|
||||
There are a tonne of “best known techniques” for POS tagging, and you should
|
||||
ignore the others and just use Averaged Perceptron.
|
||||
|
||||
p.
|
||||
You should use two tags of history, and features derived from the Brown word
|
||||
clusters distributed here.
|
||||
|
||||
p.
|
||||
If you only need the tagger to work on carefully edited text, you should
|
||||
use case-sensitive features, but if you want a more robust tagger you
|
||||
should avoid them because they’ll make you over-fit to the conventions
|
||||
of your training domain. Instead, features that ask “how frequently is
|
||||
this word title-cased, in a large sample from the web?” work well. Then
|
||||
you can lower-case your comparatively tiny training corpus.
|
||||
|
||||
p.
|
||||
For efficiency, you should figure out which frequent words in your training
|
||||
data have unambiguous tags, so you don’t have to do anything but output
|
||||
their tags when they come up. About 50% of the words can be tagged that way.
|
||||
|
||||
p.
|
||||
And unless you really, really can’t do without an extra 0.1% of accuracy,
|
||||
you probably shouldn’t bother with any kind of search strategy you should
|
||||
just use a greedy model.
|
||||
|
||||
p.
|
||||
If you do all that, you’ll find your tagger easy to write and understand,
|
||||
and an efficient Cython implementation will perform as follows on the standard
|
||||
evaluation, 130,000 words of text from the Wall Street Journal:
|
||||
|
||||
table
|
||||
thead
|
||||
tr
|
||||
th Tagger
|
||||
th Accuracy
|
||||
th Time (130k words)
|
||||
tbody
|
||||
tr
|
||||
td CyGreedyAP
|
||||
td 97.1%
|
||||
td 4s
|
||||
|
||||
p.
|
||||
The 4s includes initialisation time — the actual per-token speed is high
|
||||
enough to be irrelevant; it won’t be your bottleneck.
|
||||
|
||||
p.
|
||||
It’s tempting to look at 97% accuracy and say something similar, but that’s
|
||||
not true. My parser is about 1% more accurate if the input has hand-labelled
|
||||
POS tags, and the taggers all perform much worse on out-of-domain data.
|
||||
Unfortunately accuracies have been fairly flat for the last ten years.
|
||||
That’s why my recommendation is to just use a simple and fast tagger that’s
|
||||
roughly as good.
|
||||
|
||||
p.
|
||||
The thing is though, it’s very common to see people using taggers that
|
||||
aren’t anywhere near that good! For an example of what a non-expert is
|
||||
likely to use, these were the two taggers wrapped by TextBlob, a new Python
|
||||
api that I think is quite neat:
|
||||
|
||||
table
|
||||
thead
|
||||
tr
|
||||
th Tagger
|
||||
th Accuracy
|
||||
th Time (130k words)
|
||||
tbody
|
||||
tr
|
||||
td NLTK
|
||||
td 94.0%
|
||||
td 3m56s
|
||||
tr
|
||||
td Pattern
|
||||
td 93.5%
|
||||
td 26s
|
||||
|
||||
p.
|
||||
Both Pattern and NLTK are very robust and beautifully well documented, so
|
||||
the appeal of using them is obvious. But Pattern’s algorithms are pretty
|
||||
crappy, and NLTK carries tremendous baggage around in its implementation
|
||||
because of its massive framework, and double-duty as a teaching tool.
|
||||
|
||||
p.
|
||||
As a stand-alone tagger, my Cython implementation is needlessly complicated
|
||||
– it was written for my parser. So today I wrote a 200 line version
|
||||
of my recommended algorithm for TextBlob. It gets:
|
||||
|
||||
table
|
||||
thead
|
||||
tr
|
||||
th Tagger
|
||||
th Accuracy
|
||||
th Time (130k words)
|
||||
tbody
|
||||
tr
|
||||
td PyGreedyAP
|
||||
td 96.8%
|
||||
td 12s
|
||||
|
||||
p.
|
||||
I traded some accuracy and a lot of efficiency to keep the implementation
|
||||
simple. Here’s a far-too-brief description of how it works.
|
||||
|
||||
h3 Averaged perceptron
|
||||
|
||||
p.
|
||||
POS tagging is a “supervised learning problem”. You’re given a table of data,
|
||||
and you’re told that the values in the last column will be missing during
|
||||
run-time. You have to find correlations from the other columns to predict
|
||||
that value.
|
||||
|
||||
p.
|
||||
So for us, the missing column will be “part of speech at word i“. The predictor
|
||||
columns (features) will be things like “part of speech at word i-1“, “last three
|
||||
letters of word at i+1“, etc
|
||||
|
||||
p.
|
||||
First, here’s what prediction looks like at run-time:
|
||||
|
||||
pre.language-python
|
||||
code
|
||||
| def predict(self, features):
|
||||
| '''Dot-product the features and current weights and return the best class.'''
|
||||
| scores = defaultdict(float)
|
||||
| for feat in features:
|
||||
| if feat not in self.weights:
|
||||
| continue
|
||||
| weights = self.weights[feat]
|
||||
| for clas, weight in weights.items():
|
||||
| scores[clas] += weight
|
||||
| # Do a secondary alphabetic sort, for stability
|
||||
| return max(self.classes, key=lambda clas: (scores[clas], clas))
|
||||
|
||||
p.
|
||||
Earlier I described the learning problem as a table, with one of the columns
|
||||
marked as missing-at-runtime. For NLP, our tables are always exceedingly
|
||||
sparse. You have columns like “word i-1=Parliament”, which is almost always
|
||||
0. So our “weight vectors” can pretty much never be implemented as vectors.
|
||||
Map-types are good though — here we use dictionaries.
|
||||
|
||||
p.
|
||||
The input data, features, is a set with a member for every non-zero “column”
|
||||
in our “table” – every active feature. Usually this is actually a dictionary,
|
||||
to let you set values for the features. But here all my features are binary
|
||||
present-or-absent type deals.
|
||||
|
||||
p.
|
||||
The weights data-structure is a dictionary of dictionaries, that ultimately
|
||||
associates feature/class pairs with some weight. You want to structure it
|
||||
this way instead of the reverse because of the way word frequencies are
|
||||
distributed: most words are rare, frequent words are very frequent.
|
||||
|
||||
h3 Learning the weights
|
||||
|
||||
p.
|
||||
Okay, so how do we get the values for the weights? We start with an empty
|
||||
weights dictionary, and iteratively do the following:
|
||||
|
||||
ol
|
||||
li Receive a new (features, POS-tag) pair
|
||||
li Guess the value of the POS tag given the current “weights” for the features
|
||||
li If guess is wrong, add +1 to the weights associated with the correct class for these features, and -1 to the weights for the predicted class.
|
||||
|
||||
|
||||
p.
|
||||
It’s one of the simplest learning algorithms. Whenever you make a mistake,
|
||||
increment the weights for the correct class, and penalise the weights that
|
||||
led to your false prediction. In code:
|
||||
|
||||
pre.language-python
|
||||
code
|
||||
| def train(self, nr_iter, examples):
|
||||
| for i in range(nr_iter):
|
||||
| for features, true_tag in examples:
|
||||
| guess = self.predict(features)
|
||||
| if guess != true_tag:
|
||||
| for f in features:
|
||||
| self.weights[f][true_tag] += 1
|
||||
| self.weights[f][guess] -= 1
|
||||
| random.shuffle(examples)
|
||||
p.
|
||||
If you iterate over the same example this way, the weights for the correct
|
||||
class would have to come out ahead, and you’d get the example right. If
|
||||
you think about what happens with two examples, you should be able to
|
||||
see that it will get them both right unless the features are identical.
|
||||
In general the algorithm will converge so long as the examples are
|
||||
linearly separable, although that doesn’t matter for our purpose.
|
||||
|
||||
h3 Averaging the weights
|
||||
|
||||
p.
|
||||
We need to do one more thing to make the perceptron algorithm competitive.
|
||||
The problem with the algorithm so far is that if you train it twice on
|
||||
slightly different sets of examples, you end up with really different models.
|
||||
It doesn’t generalise that smartly. And the problem is really in the later
|
||||
iterations — if you let it run to convergence, it’ll pay lots of attention
|
||||
to the few examples it’s getting wrong, and mutate its whole model around
|
||||
them.
|
||||
|
||||
p.
|
||||
So, what we’re going to do is make the weights more "sticky" – give
|
||||
the model less chance to ruin all its hard work in the later rounds. And
|
||||
we’re going to do that by returning the averaged weights, not the final
|
||||
weights.
|
||||
|
||||
p.
|
||||
I doubt there are many people who are convinced that’s the most obvious
|
||||
solution to the problem, but whatever. We’re not here to innovate, and this
|
||||
way is time tested on lots of problems. If you have another idea, run the
|
||||
experiments and tell us what you find. Actually I’d love to see more work
|
||||
on this, now that the averaged perceptron has become such a prominent learning
|
||||
algorithm in NLP.
|
||||
|
||||
p.
|
||||
Okay. So this averaging. How’s that going to work? Note that we don’t want
|
||||
to just average after each outer-loop iteration. We want the average of all
|
||||
the values — from the inner loop. So if we have 5,000 examples, and we train
|
||||
for 10 iterations, we’ll average across 50,000 values for each weight.
|
||||
|
||||
p.
|
||||
Obviously we’re not going to store all those intermediate values. Instead,
|
||||
we’ll track an accumulator for each weight, and divide it by the number of
|
||||
iterations at the end. Again: we want the average weight assigned to a
|
||||
feature/class pair during learning, so the key component we need is the total
|
||||
weight it was assigned. But we also want to be careful about how we compute
|
||||
that accumulator, too. On almost any instance, we’re going to see a tiny
|
||||
fraction of active feature/class pairs. All the other feature/class weights
|
||||
won’t change. So we shouldn’t have to go back and add the unchanged value
|
||||
to our accumulators anyway, like chumps.
|
||||
|
||||
p.
|
||||
Since we’re not chumps, we’ll make the obvious improvement. We’ll maintain
|
||||
another dictionary that tracks how long each weight has gone unchanged. Now
|
||||
when we do change a weight, we can do a fast-forwarded update to the accumulator,
|
||||
for all those iterations where it lay unchanged.
|
||||
|
||||
p.
|
||||
Here’s what a weight update looks like now that we have to maintain the
|
||||
totals and the time-stamps:
|
||||
|
||||
pre.language-python
|
||||
code
|
||||
| def update(self, truth, guess, features):
|
||||
| def upd_feat(c, f, v):
|
||||
| nr_iters_at_this_weight = self.i - self._timestamps[f][c]
|
||||
| self._totals[f][c] += nr_iters_at_this_weight * self.weights[f][c]
|
||||
| self.weights[f][c] += v
|
||||
| self._timestamps[f][c] = self.i
|
||||
|
||||
| self.i += 1
|
||||
| for f in features:
|
||||
| upd_feat(truth, f, 1.0)
|
||||
| upd_feat(guess, f, -1.0)
|
||||
|
||||
h3 Features and pre-processing
|
||||
|
||||
p.
|
||||
The POS tagging literature has tonnes of intricate features sensitive to
|
||||
case, punctuation, etc. They help on the standard test-set, which is from
|
||||
Wall Street Journal articles from the 1980s, but I don’t see how they’ll
|
||||
help us learn models that are useful on other text.
|
||||
|
||||
p.
|
||||
To help us learn a more general model, we’ll pre-process the data prior
|
||||
to feature extraction, as follows:
|
||||
|
||||
ul
|
||||
li All words are lower cased;
|
||||
li Digits in the range 1800-2100 are represented as !YEAR;
|
||||
li Other digit strings are represented as !DIGITS
|
||||
li
|
||||
| It would be better to have a module recognising dates, phone numbers,
|
||||
| emails, hash-tags, etc. but that will have to be pushed back into the
|
||||
| tokenization.
|
||||
|
||||
p.
|
||||
I played around with the features a little, and this seems to be a reasonable
|
||||
bang-for-buck configuration in terms of getting the development-data accuracy
|
||||
to 97% (where it typically converges anyway), and having a smaller memory
|
||||
foot-print:
|
||||
|
||||
pre.language-python
|
||||
code
|
||||
| def _get_features(self, i, word, context, prev, prev2):
|
||||
| '''Map tokens-in-contexts into a feature representation, implemented as a
|
||||
| set. If the features change, a new model must be trained.'''
|
||||
| def add(name, *args):
|
||||
| features.add('+'.join((name,) + tuple(args)))
|
||||
|
||||
| features = set()
|
||||
| add('bias') # This acts sort of like a prior
|
||||
| add('i suffix', word[-3:])
|
||||
| add('i pref1', word[0])
|
||||
| add('i-1 tag', prev)
|
||||
| add('i-2 tag', prev2)
|
||||
| add('i tag+i-2 tag', prev, prev2)
|
||||
| add('i word', context[i])
|
||||
| add('i-1 tag+i word', prev, context[i])
|
||||
| add('i-1 word', context[i-1])
|
||||
| add('i-1 suffix', context[i-1][-3:])
|
||||
| add('i-2 word', context[i-2])
|
||||
| add('i+1 word', context[i+1])
|
||||
| add('i+1 suffix', context[i+1][-3:])
|
||||
| add('i+2 word', context[i+2])
|
||||
| return features
|
||||
|
||||
p.
|
||||
I haven’t added any features from external data, such as case frequency
|
||||
statistics from the Google Web 1T corpus. I might add those later, but for
|
||||
now I figured I’d keep things simple.
|
||||
|
||||
h3 What about search?
|
||||
|
||||
p.
|
||||
The model I’ve recommended commits to its predictions on each word, and
|
||||
moves on to the next one. Those predictions are then used as features for
|
||||
the next word. There’s a potential problem here, but it turns out it doesn’t
|
||||
matter much. It’s easy to fix with beam-search, but I say it’s not really
|
||||
worth bothering. And it definitely doesn’t matter enough to adopt a slow
|
||||
and complicated algorithm like Conditional Random Fields.
|
||||
|
||||
p.
|
||||
Here’s the problem. The best indicator for the tag at position, say, 3 in
|
||||
a sentence is the word at position 3. But the next-best indicators are the
|
||||
tags at positions 2 and 4. So there’s a chicken-and-egg problem: we want
|
||||
the predictions for the surrounding words in hand before we commit to a
|
||||
prediction for the current word. Here’s an example where search might matter:
|
||||
|
||||
p.example.
|
||||
Their management plan reforms worked
|
||||
|
||||
p.
|
||||
Depending on just what you’ve learned from your training data, you can
|
||||
imagine making a different decision if you started at the left and moved
|
||||
right, conditioning on your previous decisions, than if you’d started at
|
||||
the right and moved left.
|
||||
|
||||
p.
|
||||
If that’s not obvious to you, think about it this way: “worked” is almost
|
||||
surely a verb, so if you tag “reforms” with that in hand, you’ll have a
|
||||
different idea of its tag than if you’d just come from “plan“, which you
|
||||
might have regarded as either a noun or a verb.
|
||||
|
||||
p.
|
||||
Search can only help you when you make a mistake. It can prevent that error
|
||||
from throwing off your subsequent decisions, or sometimes your future choices
|
||||
will correct the mistake. And that’s why for POS tagging, search hardly matters!
|
||||
Your model is so good straight-up that your past predictions are almost always
|
||||
true. So you really need the planets to align for search to matter at all.
|
||||
|
||||
p.
|
||||
And as we improve our taggers, search will matter less and less. Instead
|
||||
of search, what we should be caring about is multi-tagging. If we let the
|
||||
model be a bit uncertain, we can get over 99% accuracy assigning an average
|
||||
of 1.05 tags per word (Vadas et al, ACL 2006). The averaged perceptron is
|
||||
rubbish at multi-tagging though. That’s its big weakness. You really want
|
||||
a probability distribution for that.
|
||||
|
||||
p.
|
||||
One caveat when doing greedy search, though. It’s very important that your
|
||||
training data model the fact that the history will be imperfect at run-time.
|
||||
Otherwise, it will be way over-reliant on the tag-history features. Because
|
||||
the Perceptron is iterative, this is very easy.
|
||||
|
||||
p.
|
||||
Here’s the training loop for the tagger:
|
||||
|
||||
pre.language-python
|
||||
code
|
||||
| def train(self, sentences, save_loc=None, nr_iter=5, quiet=False):
|
||||
| '''Train a model from sentences, and save it at save_loc. nr_iter
|
||||
| controls the number of Perceptron training iterations.'''
|
||||
| self._make_tagdict(sentences, quiet=quiet)
|
||||
| self.model.classes = self.classes
|
||||
| prev, prev2 = START
|
||||
| for iter_ in range(nr_iter):
|
||||
| c = 0; n = 0
|
||||
| for words, tags in sentences:
|
||||
| context = START + [self._normalize(w) for w in words] + END
|
||||
| for i, word in enumerate(words):
|
||||
| guess = self.tagdict.get(word)
|
||||
| if not guess:
|
||||
| feats = self._get_features(
|
||||
| i, word, context, prev, prev2)
|
||||
| guess = self.model.predict(feats)
|
||||
| self.model.update(tags[i], guess, feats)
|
||||
| # Set the history features from the guesses, not the
|
||||
| # true tags
|
||||
| prev2 = prev; prev = guess
|
||||
| c += guess == tags[i]; n += 1
|
||||
| random.shuffle(sentences)
|
||||
| if not quiet:
|
||||
| print("Iter %d: %d/%d=%.3f" % (iter_, c, n, _pc(c, n)))
|
||||
| self.model.average_weights()
|
||||
| # Pickle as a binary file
|
||||
| if save_loc is not None:
|
||||
| cPickle.dump((self.model.weights, self.tagdict, self.classes),
|
||||
| open(save_loc, 'wb'), -1)
|
||||
p.
|
||||
Unlike the previous snippets, this one’s literal – I tended to edit the
|
||||
previous ones to simplify. So if they have bugs, hopefully that’s why!
|
||||
|
||||
p.
|
||||
At the time of writing, I’m just finishing up the implementation before I
|
||||
submit a pull request to TextBlob. You can see the rest of the source here:
|
||||
|
||||
ul
|
||||
li
|
||||
a(href="https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/taggers.py") taggers.py
|
||||
li
|
||||
a(href="https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/_perceptron.py") _perceptron.py
|
||||
|
||||
h3 A final comparison…
|
||||
|
||||
p.
|
||||
Over the years I’ve seen a lot of cynicism about the WSJ evaluation methodology.
|
||||
The claim is that we’ve just been meticulously over-fitting our methods to this
|
||||
data. Actually the evidence doesn’t really bear this out. Mostly, if a technique
|
||||
is clearly better on one evaluation, it improves others as well. Still, it’s
|
||||
very reasonable to want to know how these tools perform on other text. So I
|
||||
ran the unchanged models over two other sections from the OntoNotes corpus:
|
||||
|
||||
table
|
||||
thead
|
||||
tr
|
||||
th Tagger
|
||||
th WSJ
|
||||
th ABC
|
||||
th Web
|
||||
tbody
|
||||
tr
|
||||
td Pattern
|
||||
td 93.5
|
||||
td 90.7
|
||||
td 88.1
|
||||
tr
|
||||
td NLTK
|
||||
td 94.0
|
||||
td 91.5
|
||||
td 88.4
|
||||
tr
|
||||
td PyGreedyAP
|
||||
td 96.8
|
||||
td 94.8
|
||||
td 91.8
|
||||
|
||||
p.
|
||||
The ABC section is broadcast news, Web is text from the web (blogs etc — I haven’t
|
||||
looked at the data much).
|
||||
|
||||
p.
|
||||
As you can see, the order of the systems is stable across the three comparisons,
|
||||
and the advantage of our Averaged Perceptron tagger over the other two is real
|
||||
enough. Actually the pattern tagger does very poorly on out-of-domain text.
|
||||
It mostly just looks up the words, so it’s very domain dependent. I hadn’t
|
||||
realised it before, but it’s obvious enough now that I think about it.
|
||||
|
||||
p.
|
||||
We can improve our score greatly by training on some of the foreign data.
|
||||
The technique described in this paper (Daume III, 2007) is the first thing
|
||||
I try when I have to do that.
|
||||
|
||||
|
||||
footer.meta(role='contentinfo')
|
||||
a.button.button-twitter(href=urls.share_twitter, title='Share on Twitter', rel='nofollow') Share on Twitter
|
||||
.discuss
|
||||
a.button.button-hn(href='#', title='Discuss on Hacker News', rel='nofollow') Discuss on Hacker News
|
||||
|
|
||||
a.button.button-reddit(href='#', title='Discuss on Reddit', rel='nofollow') Discuss on Reddit
|
Loading…
Reference in New Issue
Block a user