This procedure splits off tokens from the start and end of the string, at each
point checking whether the remaining string is in our special-cases table. If
it is, we stop splitting, and return the tokenization at that point.
The advantage of this design is that the prefixes, suffixes and special-cases
can be declared separately, in easy-to-understand files. If a new entry is
added to the special-cases, you can be sure that it won't have some unforeseen
consequence to a complicated regular-expression grammar.
Coupling the Tokenizer and Lexicon
##################################
As mentioned above, the tokenizer is designed to support easy caching. If all
we were caching were the matched substrings, this would not be so advantageous.
Instead, what we do is create a struct which houses all of our lexical
features, and cache *that*. The tokens are then simply pointers to these rich
lexical types.
In a sample of text, vocabulary size grows exponentially slower than word
count. So any computations we can perform over the vocabulary and apply to the
word count are very efficient.
Part-of-speech Tagger
---------------------
.._how to write a good part of speech tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/ .
In 2013, I wrote a blog post describing `how to write a good part of speech
tagger`_.
My recommendation then was to use greedy decoding with the averaged perceptron.
I think this is still the best approach, so it's what I implemented in spaCy.
The tutorial also recommends the use of Brown cluster features, and case
normalization features, as these make the model more robust and domain
independent. spaCy's tagger makes heavy use of these features.
Dependency Parser
-----------------
.._2014 blog post: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
The parser uses the algorithm described in my `2014 blog post`_.
This algorithm, shift-reduce dependency parsing, is becoming widely adopted due
to its compelling speed/accuracy trade-off.
Some quick details about spaCy's take on this, for those who happen to know
these models well. I'll write up a better description shortly.
1. I use greedy decoding, not beam search;
2. I use the arc-eager transition system;
3. I use the Goldberg and Nivre (2012) dynamic oracle.
4. I use the non-monotonic update from my CoNLL 2013 paper (Honnibal, Goldberg
and Johnson 2013).
So far, this is exactly the configuration from the CoNLL 2013 paper, which
scored 91.0. So how have I gotten it to 92.4? The following tweaks:
1. I use Brown cluster features --- these help a lot;
2. I redesigned the feature set. I've long known that the Zhang and Nivre
(2011) feature set was suboptimal, but a few features don't make a very
compelling publication. Still, they're important.
3. When I do the dynamic oracle training, I also make
the upate cost-sensitive: if the oracle determines that the move the parser
took has a cost of N, then the weights for the gold class are incremented by
+N, and the weights for the predicted class are incremented by -N. This
only made a small (0.1-0.2%) difference.
Implementation
##############
I don't do anything algorithmically novel to improve the efficiency of the
parser. However, I was very careful in the implementation.
A greedy shift-reduce parser with a linear model boils down to the following