| Rightmost and second rightmost children of #[code S0], #[code S1],
| #[code S2], #[code B0] and #[code B1].
p
| This makes the state vector quite long: #[code 13*T], where #[code T] is
| the token vector width (128 is working well). Fortunately, there's a way
| to structure the computation to save some expense (and make it more
| GPU-friendly).
p
| The parser typically visits #[code 2*N] states for a sentence of length
| #[code N] (although it may visit more, if it back-tracks with a
| non-monotonic transition#[+fn(4)]). A naive implementation would require
| #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
| size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
| multiplication, to pre-compute the hidden weights for each positional
| feature with respect to the words in the batch. (Note that our token
| vectors come from the CNN — so we can't play this trick over the
| vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
| model is so big.)
p
| This pre-computation strategy allows a nice compromise between
| GPU-friendliness and implementation simplicity. The CNN and the wide
| lower layer are computed on the GPU, and then the precomputed hidden
| weights are moved to the CPU, before we start the transition-based
| parsing process. This makes a lot of things much easier. We don't have to
| worry about variable-length batch sizes, and we don't have to implement
| the dynamic oracle in CUDA to train.
p
| Currently the parser's loss function is multilabel log loss#[+fn(6)], as
| the dynamic oracle allows multiple states to be 0 cost. This is defined
| as follows, where #[code gZ] is the sum of the scores assigned to gold
| classes:
+code.
(exp(score) / Z) - (exp(score) / gZ)
+bibliography
+item
| #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
br
| Eliyahu Kiperwasser, Yoav Goldberg. (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
br
| Yoav Goldberg, Joakim Nivre (2012)
+item
| #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
br
| Matthew Honnibal (2013)
+item
| #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
br
| Yuan Zhang, David Weiss (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
br
| Anders Søgaard, Yoav Goldberg (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
br
| Matthew Honnibal, Mark Johnson (2015)
+item
| #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
br
| Danqi Cheng, Christopher D. Manning (2014)
+item
| #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]