* Add parser post in jade

This commit is contained in:
Matthew Honnibal 2015-08-13 14:40:53 +02:00
parent ba00c72505
commit 8a252d08f9

View File

@ -0,0 +1,923 @@
-
var urls = {
'pos_post': 'https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/',
'google_ngrams': "http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html",
'implementation': 'https://gist.github.com/syllog1sm/10343947',
'redshift': 'http://github.com/syllog1sm/redshift',
'tasker': 'https://play.google.com/store/apps/details?id=net.dinglisch.android.taskerm',
'acl_anthology': 'http://aclweb.org/anthology/',
'share_twitter': 'http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal'
}
doctype html
html(lang='en')
head
meta(charset='utf-8')
title spaCy Blog
meta(name='description', content='')
meta(name='author', content='Matthew Honnibal')
link(rel='stylesheet', href='css/style.css')
//if lt IE 9
script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
body#blog
header(role='banner')
h1.logo spaCy Blog
.slogan Blog
main#content(role='main')
article.post
header
h2 Parsing English with 500 lines of Python
.subhead
| by
a(href='#', rel='author') Matthew Honnibal
| on
time(datetime='2013-12-18') December 18, 2013
p
| A
a(href=urls.google_ngrams) syntactic parser
| describes a sentences grammatical structure, to help another
| application reason about it. Natural languages introduce many unexpected
| ambiguities, which our world-knowledge immediately filters out. A
| favourite example:
p.example They ate the pizza with anchovies
p
img(src='img/blog01.png', alt='Eat-with pizza-with ambiguity')
p
| A correct parse links “with” to “pizza”, while an incorrect parse
| links “with” to “eat”:
.displacy
iframe(src='displacy/anchovies_bad.html', height='275')
.displacy
iframe.displacy(src='displacy/anchovies_good.html', height='275')
a.view-displacy(href='#') View on displaCy
p.caption
| The Natural Language Processing (NLP) community has made big progress
| in syntactic parsing over the last few years.
p
| The Natural Language Processing (NLP) community has made big progress
| in syntactic parsing over the last few years. Its now possible for
| a tiny Python implementation to perform better than the widely-used
| Stanford PCFG parser.
p
strong Update!
| The Stanford CoreNLP library now includes a greedy transition-based
| dependency parser, similar to the one described in this post, but with
| an improved learning strategy. It is much faster and more accurate
| than this simple Python implementation.
table
thead
tr
th Parser
th Accuracy
th Speed (w/s)
th Language
th LOC
tbody
tr
td Stanford
td 89.6%
td 19
td Java
td
| > 50,000
sup
a(href='#note-1') [1]
tr
td
strong parser.py
td 89.8%
td 2,020
td Python
td
strong ~500
tr
td Redshift
td
strong 93.6%
td
strong 2,580
td Cython
td ~4,000
p
| The rest of the post sets up the problem, and then takes you through
a(href=urls.implementation) a concise implementation
| , prepared for this post. The first 200 lines of parser.py, the
| part-of-speech tagger and learner, are described
a(href=pos_tagger_url) here. You should probably at least skim that
| post before reading this one, unless youre very familiar with NLP
| research.
p
| The Cython system, Redshift, was written for my current research. I
| plan to improve it for general use in June, after my contract ends
| at Macquarie University. The current version is
a(href=urls.redshift) hosted on GitHub
| .
h3 Problem Description
p Itd be nice to type an instruction like this into your phone:
p.example
Set volume to zero when Im in a meeting, unless Johns school calls.
p
| And have it set the appropriate policy. On Android you can do this
| sort of thing with
a(href=urls.tasker) Tasker
| , but an NL interface would be much better. Itd be especially nice
| to receive a meaning representation you could edit, so you could see
| what it thinks you said, and correct it.
p
| There are lots of problems to solve to make that work, but some sort
| of syntactic representation is definitely necessary. We need to know that:
p.example
Unless Johns school calls, when Im in a meeting, set volume to zero
p is another way of phrasing the first instruction, while:
p.example
Unless Johns school, call when Im in a meeting
p means something completely different.
p
| A dependency parser returns a graph of word-word relationships,
| intended to make such reasoning easier. Our graphs will be trees –
| edges will be directed, and every node (word) will have exactly one
| incoming arc (one dependency, with its head), except one.
h4 Example usage
pre.language-python.
p.
The idea is that it should be slightly easier to reason from the parse,
than it was from the string. The parse-to-meaning mapping is hopefully
simpler than the string-to-meaning mapping.
p.
The most confusing thing about this problem area is that “correctness”
is defined by convention — by annotation guidelines. If you havent
read the guidelines and youre not a linguist, you cant tell whether
the parse is “wrong” or “right”, which makes the whole task feel weird
and artificial.
p.
For instance, theres a mistake in the parse above: “Johns school
calls” is structured wrongly, according to the Stanford annotation
guidelines. The structure of that part of the sentence is how the
annotators were instructed to parse an example like “Johns school
clothes”.
p
| Its worth dwelling on this point a bit. We could, in theory, have
| written our guidelines so that the “correct” parses were reversed.
| Theres good reason to believe the parsing task will be harder if we
| reversed our convention, as itd be less consistent with the rest of
| the grammar.
sup: a(href='#note-2') [2]
| But we could test that empirically, and wed be pleased to gain an
| advantage by reversing the policy.
p
| We definitely do want that distinction in the guidelines — we dont
| want both to receive the same structure, or our output will be less
| useful. The annotation guidelines strike a balance between what
| distinctions downstream applications will find useful, and what
| parsers will be able to predict easily.
h4 Projective trees
p
| Theres a particularly useful simplification that we can make, when
| deciding what we want the graph to look like: we can restrict the
| graph structures well be dealing with. This doesnt just give us a
| likely advantage in learnability; it can have deep algorithmic
| implications. We follow most work on English in constraining the
| dependency graphs to be
em projective trees
| :
ol
li Tree. Every word has exactly one head, except for the dummy ROOT symbol.
li
| Projective. For every pair of dependencies (a1, a2) and (b1, b2),
| if a1 < b2, then a2 >= b2. In other words, dependencies cannot “cross”.
| You cant have a pair of dependencies that goes a1 b1 a2 b2, or
| b1 a1 b2 a2.
p
| Theres a rich literature on parsing non-projective trees, and a
| smaller literature on parsing DAGs. But the parsing algorithm Ill
| be explaining deals with projective trees.
h3 Greedy transition-based parsing
p
| Our parser takes as input a list of string tokens, and outputs a
| list of head indices, representing edges in the graph. If the
em i
| th member of heads is
em j
| , the dependency parse contains an edge (j, i). A transition-based
| parser is a finite-state transducer; it maps an array of N words
| onto an output array of N head indices:
table.center
tbody
tr
td
em start
td MSNBC
td reported
td that
td Facebook
td bought
td WhatsApp
td for
td $16bn
td
em root
tr
td 0
td 2
td 9
td 2
td 4
td 2
td 4
td 4
td 7
td 0
p
| The heads array denotes that the head of
em MSNBC
| is
em reported
| :
em MSNBC
| is word 1, and
em reported
| is word 2, and
code.language-python heads[1] == 2
| . You can already see why parsing a tree is handy — this data structure
| wouldnt work if we had to output a DAG, where words may have multiple
| heads.
p
| Although
code.language-python heads
| can be represented as an array, wed actually like to maintain some
| alternate ways to access the parse, to make it easy and efficient to
| extract features. Our
code.language-python Parse
| class looks like this:
pre.language-python
code
| class Parse(object):
| def __init__(self, n):
| self.n = n
| self.heads = [None] * (n-1)
| self.lefts = []
| self.rights = []
| for i in range(n+1):
| self.lefts.append(DefaultList(0))
| self.rights.append(DefaultList(0))
|
| def add_arc(self, head, child):
| self.heads[child] = head
| if child < head:
| self.lefts[head].append(child)
| else:
| self.rights[head].append(child)
p
| As well as the parse, we also have to keep track of where were up
| to in the sentence. Well do this with an index into the
code.language-python words
| array, and a stack, to which well push words, before popping them
| once their head is set. So our state data structure is fundamentally:
ul
li An index, i, into the list of tokens;
li The dependencies added so far, in Parse
li
| A stack, containing words that occurred before i, for which were
| yet to assign a head.
p Each step of the parsing process applies one of three actions to the state:
pre.language-python
code
| SHIFT = 0; RIGHT = 1; LEFT = 2
| MOVES = [SHIFT, RIGHT, LEFT]
|
| def transition(move, i, stack, parse):
| global SHIFT, RIGHT, LEFT
| if move == SHIFT:
| stack.append(i)
| return i + 1
| elif move == RIGHT:
| parse.add_arc(stack[-2], stack.pop())
| return i
| elif move == LEFT:
| parse.add_arc(i, stack.pop())
| return i
| raise GrammarError(&quot;Unknown move: %d&quot; % move)
p
| The
code.language-python LEFT
| and
code.language-python RIGHT
| actions add dependencies and pop the stack, while
code.language-python SHIFT
| pushes the stack and advances i into the buffer.
p.
So, the parser starts with an empty stack, and a buffer index at 0, with
no dependencies recorded. It chooses one of the (valid) actions, and
applies it to the state. It continues choosing actions and applying
them until the stack is empty and the buffer index is at the end of
the input. (Its hard to understand this sort of algorithm without
stepping through it. Try coming up with a sentence, drawing a projective
parse tree over it, and then try to reach the parse tree by choosing
the right sequence of transitions.)
p Heres what the parsing loop looks like in code:
pre.language-python
code
| class Parser(object):
| ...
| def parse(self, words):
| tags = self.tagger(words)
| n = len(words)
| idx = 1
| stack = [0]
| deps = Parse(n)
| while stack or idx < n:
| features = extract_features(words, tags, idx, n, stack, deps)
| scores = self.model.score(features)
| valid_moves = get_valid_moves(i, n, len(stack))
| next_move = max(valid_moves, key=lambda move: scores[move])
| idx = transition(next_move, idx, stack, parse)
| return tags, parse
|
| def get_valid_moves(i, n, stack_depth):
| moves = []
| if i < n:
| moves.append(SHIFT)
| if stack_depth <= 2:
| moves.append(RIGHT)
| if stack_depth <= 1:
| moves.append(LEFT)
| return moves
p.
We start by tagging the sentence, and initializing the state. We then
map the state to a set of features, which we score using a linear model.
We then find the best-scoring valid move, and apply it to the state.
p
| The model scoring works the same as it did in
a(href=urls.post) the POS tagger.
| If youre confused about the idea of extracting features and scoring
| them with a linear model, you should review that post. Heres a reminder
| of how the model scoring works:
pre.language-python
code
| class Perceptron(object)
| ...
| def score(self, features):
| all_weights = self.weights
| scores = dict((clas, 0) for clas in self.classes)
| for feat, value in features.items():
| if value == 0:
| continue
| if feat not in all_weights:
| continue
| weights = all_weights[feat]
| for clas, weight in weights.items():
| scores[clas] += value * weight
| return scores
p.
Its just summing the class-weights for each feature. This is often
expressed as a dot-product, but when youre dealing with multiple
classes, that gets awkward, I find.
p.
The beam parser (RedShift) tracks multiple candidates, and only decides
on the best one at the very end. Were going to trade away accuracy
in favour of efficiency and simplicity. Well only follow a single
analysis. Our search strategy will be entirely greedy, as it was with
the POS tagger. Well lock-in our choices at every step.
p.
If you read the POS tagger post carefully, you might see the underlying
similarity. What weve done is mapped the parsing problem onto a
sequence-labelling problem, which we address using a “flat”, or unstructured,
learning algorithm (by doing greedy search).
h3 Features
p.
Feature extraction code is always pretty ugly. The features for the parser
refer to a few tokens from the context:
ul
li The first three words of the buffer (n0, n1, n2)
li The top three words of the stack (s0, s1, s2)
li The two leftmost children of s0 (s0b1, s0b2);
li The two rightmost children of s0 (s0f1, s0f2);
li The two leftmost children of n0 (n0b1, n0b2)
p.
For these 12 tokens, we refer to the word-form, the part-of-speech tag,
and the number of left and right children attached to the token.
p.
Because were using a linear model, we have our features refer to pairs
and triples of these atomic properties.
pre.language-python
code
| def extract_features(words, tags, n0, n, stack, parse):
| def get_stack_context(depth, stack, data):
| if depth &gt;= 3:
| return data[stack[-1]], data[stack[-2]], data[stack[-3]]
| elif depth &gt;= 2:
| return data[stack[-1]], data[stack[-2]], ''
| elif depth == 1:
| return data[stack[-1]], '', ''
| else:
| return '', '', ''
|
| def get_buffer_context(i, n, data):
| if i + 1 &gt;= n:
| return data[i], '', ''
| elif i + 2 &gt;= n:
| return data[i], data[i + 1], ''
| else:
| return data[i], data[i + 1], data[i + 2]
|
| def get_parse_context(word, deps, data):
| if word == -1:
| return 0, '', ''
| deps = deps[word]
| valency = len(deps)
| if not valency:
| return 0, '', ''
| elif valency == 1:
| return 1, data[deps[-1]], ''
| else:
| return valency, data[deps[-1]], data[deps[-2]]
|
| features = {}
| # Set up the context pieces --- the word, W, and tag, T, of:
| # S0-2: Top three words on the stack
| # N0-2: First three words of the buffer
| # n0b1, n0b2: Two leftmost children of the first word of the buffer
| # s0b1, s0b2: Two leftmost children of the top word of the stack
| # s0f1, s0f2: Two rightmost children of the top word of the stack
|
| depth = len(stack)
| s0 = stack[-1] if depth else -1
|
| Ws0, Ws1, Ws2 = get_stack_context(depth, stack, words)
| Ts0, Ts1, Ts2 = get_stack_context(depth, stack, tags)
|
| Wn0, Wn1, Wn2 = get_buffer_context(n0, n, words)
| Tn0, Tn1, Tn2 = get_buffer_context(n0, n, tags)
|
| Vn0b, Wn0b1, Wn0b2 = get_parse_context(n0, parse.lefts, words)
| Vn0b, Tn0b1, Tn0b2 = get_parse_context(n0, parse.lefts, tags)
|
| Vn0f, Wn0f1, Wn0f2 = get_parse_context(n0, parse.rights, words)
| _, Tn0f1, Tn0f2 = get_parse_context(n0, parse.rights, tags)
|
| Vs0b, Ws0b1, Ws0b2 = get_parse_context(s0, parse.lefts, words)
| _, Ts0b1, Ts0b2 = get_parse_context(s0, parse.lefts, tags)
|
| Vs0f, Ws0f1, Ws0f2 = get_parse_context(s0, parse.rights, words)
| _, Ts0f1, Ts0f2 = get_parse_context(s0, parse.rights, tags)
|
| # Cap numeric features at 5?
| # String-distance
| Ds0n0 = min((n0 - s0, 5)) if s0 != 0 else 0
|
| features['bias'] = 1
| # Add word and tag unigrams
| for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2, Ws0b1, Ws0b2, Ws0f1, Ws0f2):
| if w:
| features['w=%s' % w] = 1
| for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2):
| if t:
| features['t=%s' % t] = 1
|
| # Add word/tag pairs
| for i, (w, t) in enumerate(((Wn0, Tn0), (Wn1, Tn1), (Wn2, Tn2), (Ws0, Ts0))):
| if w or t:
| features['%d w=%s, t=%s' % (i, w, t)] = 1
|
| # Add some bigrams
| features['s0w=%s, n0w=%s' % (Ws0, Wn0)] = 1
| features['wn0tn0-ws0 %s/%s %s' % (Wn0, Tn0, Ws0)] = 1
| features['wn0tn0-ts0 %s/%s %s' % (Wn0, Tn0, Ts0)] = 1
| features['ws0ts0-wn0 %s/%s %s' % (Ws0, Ts0, Wn0)] = 1
| features['ws0-ts0 tn0 %s/%s %s' % (Ws0, Ts0, Tn0)] = 1
| features['wt-wt %s/%s %s/%s' % (Ws0, Ts0, Wn0, Tn0)] = 1
| features['tt s0=%s n0=%s' % (Ts0, Tn0)] = 1
| features['tt n0=%s n1=%s' % (Tn0, Tn1)] = 1
|
| # Add some tag trigrams
| trigrams = ((Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0),
| (Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1),
| (Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2),
| (Ts0, Ts1, Ts1))
| for i, (t1, t2, t3) in enumerate(trigrams):
| if t1 or t2 or t3:
| features['ttt-%d %s %s %s' % (i, t1, t2, t3)] = 1
|
| # Add some valency and distance features
| vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b))
| vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b))
| d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0),
| ('t' + Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0))
| for i, (w_t, v_d) in enumerate(vw + vt + d):
| if w_t or v_d:
| features['val/d-%d %s %d' % (i, w_t, v_d)] = 1
| return features</code></pre>
h3 Training
p.
Weights are learned using the same algorithm, averaged perceptron, that
we used for part-of-speech tagging. Its key strength is that its an
online learning algorithm: examples stream in one-by-one, we make our
prediction, check the actual answer, and adjust our beliefs (weights)
if we were wrong.
p The training loop looks like this:
pre.language-python
code
| class Parser(object):
| ...
| def train_one(self, itn, words, gold_tags, gold_heads):
| n = len(words)
| i = 2; stack = [1]; parse = Parse(n)
| tags = self.tagger.tag(words)
| while stack or (i + 1) < n:
| features = extract_features(words, tags, i, n, stack, parse)
| scores = self.model.score(features)
| valid_moves = get_valid_moves(i, n, len(stack))
| guess = max(valid_moves, key=lambda move: scores[move])
| gold_moves = get_gold_moves(i, n, stack, parse.heads, gold_heads)
| best = max(gold_moves, key=lambda move: scores[move])
| self.model.update(best, guess, features)
| i = transition(guess, i, stack, parse)
| # Return number correct
| return len([i for i in range(n-1) if parse.heads[i] == gold_heads[i]])
p.
The most interesting part of the training process is in
code.language-python get_gold_moves.
The performance of our parser is made possible by an advance by Goldberg
and Nivre (2012), who showed that wed been doing this wrong for years.
p
| In the POS-tagging post, I cautioned that during training you need to
| make sure you pass in the last two
em predicted
| tags as features for the current tag, not the last two
em gold
| tags. At test time youll only have the predicted tags, so if you
| base your features on the gold sequence during training, your training
| contexts wont resemble your test-time contexts, so youll learn the
| wrong weights.
p.
In parsing, the problem was that we didnt know
em how
| to pass in the predicted sequence! Training worked by taking the
| gold-standard tree, and finding a transition sequence that led to it.
| i.e., you got back a sequence of moves, with the guarantee that if
| you followed those moves, youd get the gold-standard dependencies.
p
| The problem is, we didnt know how to define the “correct” move to
| teach a parser to make if it was in any state that
em wasnt
| along that gold-standard sequence. Once the parser had made a mistake,
| we didnt know how to train from that example.
p
| That was a big problem, because it meant that once the parser started
| making mistakes, it would end up in states unlike any in its training
| data &ndash; leading to yet more mistakes. The problem was specific
| to greedy parsers: once you use a beam, theres a natural way to do
| structured prediction.
p
| The solution seems obvious once you know it, like all the best breakthroughs.
| What we do is define a function that asks “How many gold-standard
| dependencies can be recovered from this state?”. If you can define
| that function, then you can apply each move in turn, and ask, “How
| many gold-standard dependencies can be recovered from
em this
| state?”. If the action you applied allows
em fewer
| gold-standard dependencies to be reached, then it is sub-optimal.
p Thats a lot to take in.
p
| So we have this function
code.language-python Oracle(state)
| :
pre
code
Oracle(state) = | gold_arcs ∩ reachable_arcs(state) |
p
| We also have a set of actions, each of which returns a new state.
| We want to know:
ul
li shift_cost = Oracle(state) Oracle(shift(state))
li right_cost = Oracle(state) Oracle(right(state))
li left_cost = Oracle(state) Oracle(left(state))
p
| Now, at least one of those costs
em has
| to be zero. Oracle(state) is asking, “whats the cost of the best
| path forward?”, and the first action of that best path has to be
| shift, right, or left.
p
| It turns out that we can derive Oracle fairly simply for many transition
| systems. The derivation for the transition system were using, Arc
| Hybrid, is in Goldberg and Nivre (2013).
p
| Were going to implement the oracle as a function that returns the
| zero-cost moves, rather than implementing a function Oracle(state).
| This prevents us from doing a bunch of costly copy operations.
| Hopefully the reasoning in the code isnt too hard to follow, but
| you can also consult Goldberg and Nivres papers if youre confused
| and want to get to the bottom of this.
pre.language-python
code
| def get_gold_moves(n0, n, stack, heads, gold):
| def deps_between(target, others, gold):
| for word in others:
| if gold[word] == target or gold[target] == word:
| return True
| return False
|
| valid = get_valid_moves(n0, n, len(stack))
| if not stack or (SHIFT in valid and gold[n0] == stack[-1]):
| return [SHIFT]
| if gold[stack[-1]] == n0:
| return [LEFT]
| costly = set([m for m in MOVES if m not in valid])
| # If the word behind s0 is its gold head, Left is incorrect
| if len(stack) >= 2 and gold[stack[-1]] == stack[-2]:
| costly.add(LEFT)
| # If there are any dependencies between n0 and the stack,
| # pushing n0 will lose them.
| if SHIFT not in costly and deps_between(n0, stack, gold):
| costly.add(SHIFT)
| # If there are any dependencies between s0 and the buffer, popping
| # s0 will lose them.
| if deps_between(stack[-1], range(n0+1, n-1), gold):
| costly.add(LEFT)
| costly.add(RIGHT)
| return [m for m in MOVES if m not in costly]</code></pre>
p
| Doing this “dynamic oracle” training procedure makes a big difference
| to accuracy — typically 1-2%, with no difference to the way the run-time
| works. The old “static oracle” greedy training procedure is fully
| obsolete; theres no reason to do it that way any more.
h3 Conclusion
p
| I have the sense that language technologies, particularly those relating
| to grammar, are particularly mysterious. I can imagine having no idea
| what the program might even do.
p
| I think it therefore seems natural to people that the best solutions
| would be over-whelmingly complicated. A 200,000 line Java package
| feels appropriate.
p
| But, algorithmic code is usually short, when only a single algorithm
| is implemented. And when you only implement one algorithm, and you
| know exactly what you want to write before you write a line, you
| also dont pay for any unnecessary abstractions, which can have a
| big performance impact.
h3 Notes
p
a(name='note-1')
| [1] I wasnt really sure how to count the lines of code in the Stanford
| parser. Its jar file ships over 200k, but there are a lot of different
| models in it. Its not important, but over 50k seems safe.
p
a(name='note-2')
| [2] For instance, how would you parse, “Johns school of music calls”?
| You want to make sure the phrase “Johns school” has a consistent
| structure in both “Johns school calls” and “Johns school of music
| calls”. Reasoning about the different “slots” you can put a phrase
| into is a key way we reason about what syntactic analyses look like.
| You can think of each phrase as having a different shaped connector,
| which you need to plug into different slots — which each phrase also
| has a certain number of, each of a different shape. Were trying to
| figure out what connectors are where, so we can figure out how the
| sentences are put together.
h3 Idle speculation
p
| For a long time, incremental language processing algorithms were
| primarily of scientific interest. If you want to write a parser to
| test a theory about how the human sentence processor might work, well,
| that parser needs to build partial interpretations. Theres a wealth
| of evidence, including commonsense introspection, that establishes
| that we dont buffer input and analyse it once the speaker has finished.
p
| But now algorithms with that neat scientific feature are winning!
| As best as I can tell, the secret to that success is to be:
ul
li Incremental. Earlier words constrain the search.
li
| Error-driven. Training involves a working hypothesis, which is
| updated as it makes mistakes.
p
| The links to human sentence processing seem tantalising. I look
| forward to seeing whether these engineering breakthroughs lead to
| any psycholinguistic advances.
h3 Bibliography
p
| The NLP literature is almost entirely open access. All of the relavant
| papers can be found
a(href=urls.acl_anthology, rel='nofollow') here
| .
p
| The parser Ive described is an implementation of the dynamic-oracle
| Arc-Hybrid system here:
span.bib-item
| Goldberg, Yoav; Nivre, Joakim.
em Training Deterministic Parsers with Non-Deterministic Oracles
| . TACL 2013
p
| However, I wrote my own features for it. The arc-hybrid system was
| originally described here:
span.bib-item
| Kuhlmann, Marco; Gomez-Rodriguez, Carlos; Satta, Giorgio. Dynamic
| programming algorithms for transition-based dependency parsers. ACL 2011
p
| The dynamic oracle training method was first described here:
span.bib-item
| A Dynamic Oracle for Arc-Eager Dependency Parsing. Goldberg, Yoav;
| Nivre, Joakim. COLING 2012
p
| This work depended on a big break-through in accuracy for transition-based
| parsers, when beam-search was properly explored by Zhang and Clark.
| They have several papers, but the preferred citation is:
span.bib-item
| Zhang, Yue; Clark, Steven. Syntactic Processing Using the Generalized
| Perceptron and Beam Search. Computational Linguistics 2011 (1)
p
| Another important paper was this little feature engineering paper,
| which further improved the accuracy:
span.bib-item
| Zhang, Yue; Nivre, Joakim. Transition-based Dependency Parsing with
| Rich Non-local Features. ACL 2011
p
| The generalised perceptron, which is the learning framework for these
| beam parsers, is from this paper:
span.bib-item
| Collins, Michael. Discriminative Training Methods for Hidden Markov
| Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002
h3 Experimental details
p
| The results at the start of the post refer to Section 22 of the Wall
| Street Journal corpus. The Stanford parser was run as follows:
pre.language-bash
code
| java -mx10000m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
| -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishFactored.ser.gz $*
p
| A small post-process was applied, to undo the fancy tokenisation
| Stanford adds for numbers, to make them match the PTB tokenisation:
pre.language-python
code
| """Stanford parser retokenises numbers. Split them."""
| import sys
| import re
|
| qp_re = re.compile('\xc2\xa0')
| for line in sys.stdin:
| line = line.rstrip()
| if qp_re.search(line):
| line = line.replace('(CD', '(QP (CD', 1) + ')'
| line = line.replace('\xc2\xa0', ') (CD ')
| print line
p
| The resulting PTB-format files were then converted into dependencies
| using the Stanford converter:
pre.language-bash
code
| ./scripts/train.py -x zhang+stack -k 8 -p ~/data/stanford/train.conll ~/data/parsers/tmp
| ./scripts/parse.py ~/data/parsers/tmp ~/data/stanford/devi.txt /tmp/parse/
| ./scripts/evaluate.py /tmp/parse/parses ~/data/stanford/dev.conll
p
| I cant easily read that anymore, but it should just convert every
| .mrg file in a folder to a CoNLL-format Stanford basic dependencies
| file, using the settings common in the dependency literature.
p
| I then converted the gold-standard trees from WSJ 22, for the evaluation.
| Accuracy scores refer to unlabelled attachment score (i.e. the head index)
| of all non-punctuation tokens.
p
| To train parser.py, I fed the gold-standard PTB trees for WSJ 02-21
| into the same conversion script.
p
| In a nutshell: The Stanford model and parser.py are trained on the
| same set of sentences, and they each make their predictions on a
| held-out test set, for which we know the answers. Accuracy refers
| to how many of the words heads we got correct.
p
| Speeds were measured on a 2.4Ghz Xeon. I ran the experiments on a
| server, to give the Stanford parser more memory. The parser.py system
| runs fine on my MacBook Air. I used PyPy for the parser.py experiments;
| CPython was about half as fast on an early benchmark.
p
| One of the reasons parser.py is so fast is that it does unlabelled
| parsing. Based on previous experiments, a labelled parser would likely
| be about 40x slower, and about 1% more accurate. Adapting the program
| to labelled parsing would be a good exercise for the reader, if you
| have access to the data.
p
| The result from the Redshift parser was produced from commit
code.language-python b6b624c9900f3bf
| , which was run as follows:
pre.language-python.
footer.meta(role='contentinfo')
a.button.button-twitter(href=urls.share_twitter, title='Share on Twitter', rel='nofollow') Share on Twitter
.discuss
a.button.button-hn(href='#', title='Discuss on Hacker News', rel='nofollow') Discuss on Hacker News
a.button.button-reddit(href='#', title='Discuss on Reddit', rel='nofollow') Discuss on Reddit
footer(role='contentinfo')
script(src='js/prism.js')