Commit Graph

8547 Commits

Author SHA1 Message Date
Matthew Honnibal
29b77fd0eb Add tests for gold alignment and parser state 2018-04-01 17:26:37 +02:00
Matthew Honnibal
3d182fbc43 Represent fused tokens in GoldParse
Entries in GoldParse.{words, heads, tags, deps, ner} can now be lists
instead of single values, to handle getting the analysis for fused
tokens. For instance, let's say we have a token like "hows", while the
gold-standard has two tokens, ["how", "s"]. We need to store the gold
data for each of the two subtokens.

Example gold.words: [["how", "s"], "it", "going"]

Things get more complicated for heads, as we need to address particular
subtokens. Let's say the gold heads for ["how", "s", "it", "going"] is
[1, 1, 3, 1], i.e. the root "s" is within the subtoken. The gold.heads
list would be:

    [[(0, 1), (0, 1)], 2, (0, 1)]

The tuples indicate token 0, subtoken 1. A helper method
_flatten_fused_heads is available that unpacks the above to
[1, 1, 3, 1].
2018-04-01 17:18:18 +02:00
Matthew Honnibal
a64680c137 Add test for one-to-many alignment 2018-04-01 14:53:49 +02:00
Matthew Honnibal
19ac03ce09 Go back to letting Break work with deeper stacks
It seems very appealing to restrict Break so that it only works when
there's one word on the stack. Then we can pop that word, mark it as the
root, and continue.

However, results are suggesting it's nice to be able to predict Break
when the last word of the previous sentence is on the stack, and the
first word of the next sentence is at the buffer. This does make sense!
Consider that the last word is often a period or something --- a pretty
huge clue. We otherwise have to go out of our way to get that feature
in.

The really decisive thing is we have to handle upcoming sentence breaks
anyway, because we need to conform to preset SBD constraints. So, we may
as well let the parser predict the Break when it's at a stack/queue
position that is most revealing.
2018-04-01 14:32:15 +02:00
Matthew Honnibal
ad70b91e1e Comment 2018-04-01 13:47:16 +02:00
Matthew Honnibal
83ca2113a2 Constrain Break action to stack depth==1 2018-04-01 13:47:02 +02:00
Matthew Honnibal
dc7f879281 Set USE_SPLIT=False feature flag 2018-04-01 13:46:25 +02:00
Matthew Honnibal
a2f07ab57f Start sketching out Split transition implementation 2018-04-01 13:45:41 +02:00
Matthew Honnibal
5da7945917 Allocate StateC.was_split 2018-04-01 13:44:42 +02:00
Matthew Honnibal
728d9841c7 Allocate fused tokens array in GoldParseC 2018-04-01 13:43:56 +02:00
Matthew Honnibal
d8dec1134c Simplify Break transition to require stack depth 1. Hopefully as accurate 2018-04-01 12:53:25 +02:00
Matthew Honnibal
a37188fe98 Dont drop preset actions on begin_training 2018-04-01 11:46:22 +02:00
Matthew Honnibal
e887b2330e Rewrite oracle to not use fast-forward. Seems to work? 2018-04-01 10:43:11 +02:00
Matthew Honnibal
c5574f48c7 Add better arc-eager oracle tests 2018-04-01 10:41:52 +02:00
Matthew Honnibal
bc2a2c81c8 Add some methods to ArcEager that make testing easier 2018-04-01 10:41:28 +02:00
Matthew Honnibal
e5ad35787c WIP on adding split-token actions to parser
This patch starts getting the StateC object ready to split tokens. The
split function is implemented by pushing indices into the buffer that
indicate an out-of-length token.

Still todo:

* Update the oracles
* Update GoldParseC
* Interpret the parse once it's complete
* Add retokenizer.split() method
2018-03-31 20:05:27 +02:00
Matthew Honnibal
3e3af01681 Add notes for adding retokenize.split() 2018-03-31 19:32:37 +02:00
Matthew Honnibal
7325de449d Export set_children_from_heads C function from doc.pxd 2018-03-31 15:17:23 +02:00
Matthew Honnibal
168fa080b7 Add doc.retokenize() context manager
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-03-31 15:17:13 +02:00
Matthew Honnibal
2df926e819 Add placeholder fused_tokens to GoldC 2018-03-30 13:26:21 +02:00
Matthew Honnibal
ab93fdf5d1 Add more state Python variables, to make testing easier 2018-03-30 13:26:08 +02:00
Matthew Honnibal
e0375132bd Add state tests, esp. for split function 2018-03-30 13:25:46 +02:00
Matthew Honnibal
e826b85cf0 Fix state.split() function 2018-03-30 13:25:28 +02:00
Matthew Honnibal
d399843576 WIP on split parsing 2018-03-28 01:44:05 +02:00
Matthew Honnibal
de9fd091ac Fix #2014: token.pos_ not writeable 2018-03-27 21:21:11 +02:00
Matthew Honnibal
18da89e04c Handle non-callable gold_tuples in parser begin_training 2018-03-27 21:08:41 +02:00
Matthew Honnibal
1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
8b7a74570f Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop""
This reverts commit f41e626844.
2018-03-27 19:22:52 +02:00
Matthew Honnibal
f41e626844 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to f57bfbccdc.
2018-03-27 19:22:25 +02:00
Matthew Honnibal
c9ba3d3c2d Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-03-27 18:59:08 +02:00
Matthew Honnibal
92c26a35d4 Update get_cuda_stream 2018-03-27 16:42:00 +00:00
Matthew Honnibal
f57bfbccdc Fix non-projective label filtering 2018-03-27 13:41:33 +02:00
Matthew Honnibal
d2118792e7 Merge changes from master 2018-03-27 13:38:41 +02:00
Matthew Honnibal
d4680e4d83 Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-27 13:36:37 +02:00
Matthew Honnibal
63a267b34d Fix #2073: Token.set_extension not working 2018-03-27 13:36:20 +02:00
Ines Montani
284bbb1dd1
Merge pull request #2146 from justindujardin/tensorboard-standalone-example
Add example using TensorBoard standalone projector
2018-03-27 13:23:32 +02:00
Matthew Honnibal
25280b7013 Try to make sum_state_features faster 2018-03-27 10:08:38 +00:00
Matthew Honnibal
987e1533a4 Use 8 features in parser 2018-03-27 10:08:12 +00:00
Matthew Honnibal
8bbd26579c Support GPU in UD training script 2018-03-27 09:53:35 +00:00
Matthew Honnibal
dd54511c4f Pass data as a function in begin_training methods 2018-03-27 09:39:59 +00:00
Matthew Honnibal
d9ebd78e11 Change default sizes in parser 2018-03-26 17:22:18 +02:00
Matthew Honnibal
a3d0cb15d3 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-26 07:16:06 +02:00
Matthew Honnibal
7d4687162f Update doc.ents test 2018-03-26 07:14:35 +02:00
Matthew Honnibal
514d89a3ae Set missing label for non-specified entities when setting doc.ents 2018-03-26 07:14:16 +02:00
Matthew Honnibal
54d7a1c916 Improve error message when entity sequence is inconsistent 2018-03-26 07:13:34 +02:00
Justin DuJardin
4eeb178856 Add example using TensorBoard standalone projector
- the tensorboard standalone project expects a different set of files than the plugin to TensorFlow.
2018-03-25 21:50:13 -07:00
Matthew Honnibal
938436455a Add test for ent_iob during span merge 2018-03-25 22:16:19 +02:00
Matthew Honnibal
8e08c378fe Fix entity IOB and tag in span merging 2018-03-25 22:16:01 +02:00
Matthew Honnibal
5430c43298 Set about to spacy-nightly 2018-03-25 19:30:14 +02:00
Matthew Honnibal
c059fcb0ba Update thinc requirement 2018-03-25 19:29:36 +02:00