Commit Graph

8571 Commits

Author SHA1 Message Date
Matthew Honnibal
f7beefe9c1 Update oracle tests for Split 2018-04-03 15:44:58 +02:00
Matthew Honnibal
e31ef9c7f6 Add some property vars for testing 2018-04-03 15:44:31 +02:00
Matthew Honnibal
4029dc2cc7 Fix feature-flagging of Split action 2018-04-03 15:43:50 +02:00
Matthew Honnibal
6cc79fc244 Fix state array length for split 2018-04-03 15:43:23 +02:00
Matthew Honnibal
7ff4d8967f Add file for compile-time flags in parser 2018-04-03 15:42:51 +02:00
Matthew Honnibal
9d7409ce71 Hack at Alignment.flatten for split tokens 2018-04-03 02:32:31 +02:00
Matthew Honnibal
e9d1e6d66b Fix head alignment for split tokens 2018-04-03 02:32:09 +02:00
Matthew Honnibal
6aded3d855 Handle complex tags in ud-train 2018-04-03 01:57:37 +02:00
Matthew Honnibal
6d42e0ad8e Fix handling of tuple-valued tags in Tagger 2018-04-03 01:55:06 +02:00
Matthew Honnibal
9c5c940441 Fix head alignment in GoldParse 2018-04-03 01:54:45 +02:00
Matthew Honnibal
88de8fe323 Fix pre-processing of more complicated heads in ArcEager 2018-04-03 01:54:21 +02:00
Matthew Honnibal
e6bacc26cb Update tests 2018-04-03 00:55:16 +02:00
Matthew Honnibal
06a5be9dfd Fix handling of heads for undersegmented tokens 2018-04-03 00:55:05 +02:00
Matthew Honnibal
aa5ecf7fd2 Update ArcEager for changes to GoldParse class 2018-04-02 23:53:13 +02:00
Matthew Honnibal
c9d314b7ba Bug fixes for alignment 2018-04-02 23:50:21 +02:00
Matthew Honnibal
c8ba54e052 Fix Alignment class for undersegmentation 2018-04-02 23:39:26 +02:00
Matthew Honnibal
e6641a11b1 Refactor alignment into its own class 2018-04-02 21:54:29 +02:00
Matthew Honnibal
9c3612d40b Draft ArcEager.preprocess_gold for fused tokens 2018-04-01 22:11:35 +02:00
Matthew Honnibal
fb9c3984b5 Add GoldParse.resize_arrays method 2018-04-01 22:10:53 +02:00
Matthew Honnibal
cb6988f2f4 Fix comment in GoldParse 2018-04-01 22:10:26 +02:00
Matthew Honnibal
00fa41a924 Handle list values in ud-train for tagger 2018-04-01 18:34:28 +02:00
Matthew Honnibal
5f68e491e1 Prepare ArcEager.preprocess_gold to handle subtokens 2018-04-01 18:31:33 +02:00
Matthew Honnibal
b8461e71b7 Prepare ArcEager.preprocess_gold to handle subtokens 2018-04-01 18:03:48 +02:00
Matthew Honnibal
2d929ffc5d Handle list-valued GoldParse values 2018-04-01 17:42:33 +02:00
Matthew Honnibal
29b77fd0eb Add tests for gold alignment and parser state 2018-04-01 17:26:37 +02:00
Matthew Honnibal
3d182fbc43 Represent fused tokens in GoldParse
Entries in GoldParse.{words, heads, tags, deps, ner} can now be lists
instead of single values, to handle getting the analysis for fused
tokens. For instance, let's say we have a token like "hows", while the
gold-standard has two tokens, ["how", "s"]. We need to store the gold
data for each of the two subtokens.

Example gold.words: [["how", "s"], "it", "going"]

Things get more complicated for heads, as we need to address particular
subtokens. Let's say the gold heads for ["how", "s", "it", "going"] is
[1, 1, 3, 1], i.e. the root "s" is within the subtoken. The gold.heads
list would be:

    [[(0, 1), (0, 1)], 2, (0, 1)]

The tuples indicate token 0, subtoken 1. A helper method
_flatten_fused_heads is available that unpacks the above to
[1, 1, 3, 1].
2018-04-01 17:18:18 +02:00
Matthew Honnibal
a64680c137 Add test for one-to-many alignment 2018-04-01 14:53:49 +02:00
Matthew Honnibal
19ac03ce09 Go back to letting Break work with deeper stacks
It seems very appealing to restrict Break so that it only works when
there's one word on the stack. Then we can pop that word, mark it as the
root, and continue.

However, results are suggesting it's nice to be able to predict Break
when the last word of the previous sentence is on the stack, and the
first word of the next sentence is at the buffer. This does make sense!
Consider that the last word is often a period or something --- a pretty
huge clue. We otherwise have to go out of our way to get that feature
in.

The really decisive thing is we have to handle upcoming sentence breaks
anyway, because we need to conform to preset SBD constraints. So, we may
as well let the parser predict the Break when it's at a stack/queue
position that is most revealing.
2018-04-01 14:32:15 +02:00
Matthew Honnibal
ad70b91e1e Comment 2018-04-01 13:47:16 +02:00
Matthew Honnibal
83ca2113a2 Constrain Break action to stack depth==1 2018-04-01 13:47:02 +02:00
Matthew Honnibal
dc7f879281 Set USE_SPLIT=False feature flag 2018-04-01 13:46:25 +02:00
Matthew Honnibal
a2f07ab57f Start sketching out Split transition implementation 2018-04-01 13:45:41 +02:00
Matthew Honnibal
5da7945917 Allocate StateC.was_split 2018-04-01 13:44:42 +02:00
Matthew Honnibal
728d9841c7 Allocate fused tokens array in GoldParseC 2018-04-01 13:43:56 +02:00
Matthew Honnibal
d8dec1134c Simplify Break transition to require stack depth 1. Hopefully as accurate 2018-04-01 12:53:25 +02:00
Matthew Honnibal
a37188fe98 Dont drop preset actions on begin_training 2018-04-01 11:46:22 +02:00
Matthew Honnibal
e887b2330e Rewrite oracle to not use fast-forward. Seems to work? 2018-04-01 10:43:11 +02:00
Matthew Honnibal
c5574f48c7 Add better arc-eager oracle tests 2018-04-01 10:41:52 +02:00
Matthew Honnibal
bc2a2c81c8 Add some methods to ArcEager that make testing easier 2018-04-01 10:41:28 +02:00
Matthew Honnibal
e5ad35787c WIP on adding split-token actions to parser
This patch starts getting the StateC object ready to split tokens. The
split function is implemented by pushing indices into the buffer that
indicate an out-of-length token.

Still todo:

* Update the oracles
* Update GoldParseC
* Interpret the parse once it's complete
* Add retokenizer.split() method
2018-03-31 20:05:27 +02:00
Matthew Honnibal
3e3af01681 Add notes for adding retokenize.split() 2018-03-31 19:32:37 +02:00
Matthew Honnibal
7325de449d Export set_children_from_heads C function from doc.pxd 2018-03-31 15:17:23 +02:00
Matthew Honnibal
168fa080b7 Add doc.retokenize() context manager
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-03-31 15:17:13 +02:00
Matthew Honnibal
2df926e819 Add placeholder fused_tokens to GoldC 2018-03-30 13:26:21 +02:00
Matthew Honnibal
ab93fdf5d1 Add more state Python variables, to make testing easier 2018-03-30 13:26:08 +02:00
Matthew Honnibal
e0375132bd Add state tests, esp. for split function 2018-03-30 13:25:46 +02:00
Matthew Honnibal
e826b85cf0 Fix state.split() function 2018-03-30 13:25:28 +02:00
Matthew Honnibal
d399843576 WIP on split parsing 2018-03-28 01:44:05 +02:00
Matthew Honnibal
de9fd091ac Fix #2014: token.pos_ not writeable 2018-03-27 21:21:11 +02:00
Matthew Honnibal
18da89e04c Handle non-callable gold_tuples in parser begin_training 2018-03-27 21:08:41 +02:00