It seems very appealing to restrict Break so that it only works when
there's one word on the stack. Then we can pop that word, mark it as the
root, and continue.
However, results are suggesting it's nice to be able to predict Break
when the last word of the previous sentence is on the stack, and the
first word of the next sentence is at the buffer. This does make sense!
Consider that the last word is often a period or something --- a pretty
huge clue. We otherwise have to go out of our way to get that feature
in.
The really decisive thing is we have to handle upcoming sentence breaks
anyway, because we need to conform to preset SBD constraints. So, we may
as well let the parser predict the Break when it's at a stack/queue
position that is most revealing.
This patch starts getting the StateC object ready to split tokens. The
split function is implemented by pushing indices into the buffer that
indicate an out-of-length token.
Still todo:
* Update the oracles
* Update GoldParseC
* Interpret the parse once it's complete
* Add retokenizer.split() method
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:
1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.
This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.
Closes#1253.
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes#898.