..note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser.
Most English parsing research is performed on text with perfect pre-processing:
one newline between every sentence, one space between every token.
It's always been done this way, and it's good. It's a useful idealisation,
because the pre-processing has few algorithmic implications.
But, for practical performance, this stuff can matter a lot.
Dridan and Oepen (2013) did a simple but rare thing: they actually ran a few
parsers on raw text. Even on the standard Wall Street Journal corpus,
where pre-processing tools are quite good, the quality of pre-processing
made a big difference:
+-------------+-------+----------+
| Preprocess | BLLIP | Berkeley |
+-------------+-------+----------+
| Gold | 90.9 | 89.8 |
+-------------+-------+----------+
| Default | 86.4 | 88.4 |
+-------------+-------+----------+
| Corrected | 89.9 | 88.8 |
+-------------+-------+----------+
..note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable.
In the standard experimental condition --- gold pre-processing --- the
BLLIP parser is better. But, it turns out it ships with lousy pre-processing
tools: when you evaluate the parsers on raw text, the BLLIP parser falls way
behind. To verify that this was due to the quality of the pre-processing
tools, and not some particular algorithmic sensitivity, Dridan and Oepen ran
both parsers with their high-quality tokenizer and sentence segmenter. This
confirmed that with equal pre-processing, the BLLIP parser is better.
The Dridan and Oepen paper really convinced me to take pre-processing seriously
in spaCy. In fact, spaCy started life as just a tokenizer --- hence the name.
The spaCy parser has a special trick up its sleeve. Because both the tagger
and parser run in linear time, it doesn't require that the input be divided
into sentences. This is nice because it avoids error-cascades: if you segment
first, then the parser just has to live with whatever decision the segmenter
made.
But, even though I designed the system with this consideration in mind,
I decided to present the initial results using the standard methodology, using
gold-standard inputs. But...then I made a mistake.
Unfortunately, with all the other things I was doing before launch, I forgot
all about this problem. spaCy launched with a parsing model that expected the
input to be segmented into sentences, but with no sentence segmenter. This
caused a drop in parse accuracy of 4%!
Over the last five days, I've worked hard to correct this. I implemented the
modifications to the parsing algorithm I had planned, from Dongdong Zhang et al
(2013), and trained and evaluated the parser on raw text, using the version of
the WSJ distributed by Read et al (2012), and used in Dridan and Oepen's
experiments.
I'm pleased to say that on the WSJ at least, spaCy 0.4 performs almost exactly
as well on raw text as text with gold-standard tokenization and sentence
boundary detection.
I still need to evaluate this on web text, and I need to compare against the
Stanford CoreNLP and other parsers. I suspect that most other parsers will