spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-02 18:06:46 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	29b77fd0eb	Add tests for gold alignment and parser state	2018-04-01 17:26:37 +02:00
Matthew Honnibal	3d182fbc43	Represent fused tokens in GoldParse Entries in GoldParse.{words, heads, tags, deps, ner} can now be lists instead of single values, to handle getting the analysis for fused tokens. For instance, let's say we have a token like "hows", while the gold-standard has two tokens, ["how", "s"]. We need to store the gold data for each of the two subtokens. Example gold.words: [["how", "s"], "it", "going"] Things get more complicated for heads, as we need to address particular subtokens. Let's say the gold heads for ["how", "s", "it", "going"] is [1, 1, 3, 1], i.e. the root "s" is within the subtoken. The gold.heads list would be: [[(0, 1), (0, 1)], 2, (0, 1)] The tuples indicate token 0, subtoken 1. A helper method _flatten_fused_heads is available that unpacks the above to [1, 1, 3, 1].	2018-04-01 17:18:18 +02:00
Matthew Honnibal	a64680c137	Add test for one-to-many alignment	2018-04-01 14:53:49 +02:00
Matthew Honnibal	19ac03ce09	Go back to letting Break work with deeper stacks It seems very appealing to restrict Break so that it only works when there's one word on the stack. Then we can pop that word, mark it as the root, and continue. However, results are suggesting it's nice to be able to predict Break when the last word of the previous sentence is on the stack, and the first word of the next sentence is at the buffer. This does make sense! Consider that the last word is often a period or something --- a pretty huge clue. We otherwise have to go out of our way to get that feature in. The really decisive thing is we have to handle upcoming sentence breaks anyway, because we need to conform to preset SBD constraints. So, we may as well let the parser predict the Break when it's at a stack/queue position that is most revealing.	2018-04-01 14:32:15 +02:00
Matthew Honnibal	ad70b91e1e	Comment	2018-04-01 13:47:16 +02:00
Matthew Honnibal	83ca2113a2	Constrain Break action to stack depth==1	2018-04-01 13:47:02 +02:00
Matthew Honnibal	dc7f879281	Set USE_SPLIT=False feature flag	2018-04-01 13:46:25 +02:00
Matthew Honnibal	a2f07ab57f	Start sketching out Split transition implementation	2018-04-01 13:45:41 +02:00
Matthew Honnibal	5da7945917	Allocate StateC.was_split	2018-04-01 13:44:42 +02:00
Matthew Honnibal	728d9841c7	Allocate fused tokens array in GoldParseC	2018-04-01 13:43:56 +02:00
Matthew Honnibal	d8dec1134c	Simplify Break transition to require stack depth 1. Hopefully as accurate	2018-04-01 12:53:25 +02:00
Matthew Honnibal	a37188fe98	Dont drop preset actions on begin_training	2018-04-01 11:46:22 +02:00
Matthew Honnibal	e887b2330e	Rewrite oracle to not use fast-forward. Seems to work?	2018-04-01 10:43:11 +02:00
Matthew Honnibal	c5574f48c7	Add better arc-eager oracle tests	2018-04-01 10:41:52 +02:00
Matthew Honnibal	bc2a2c81c8	Add some methods to ArcEager that make testing easier	2018-04-01 10:41:28 +02:00
Matthew Honnibal	e5ad35787c	WIP on adding split-token actions to parser This patch starts getting the StateC object ready to split tokens. The split function is implemented by pushing indices into the buffer that indicate an out-of-length token. Still todo: * Update the oracles * Update GoldParseC * Interpret the parse once it's complete * Add retokenizer.split() method	2018-03-31 20:05:27 +02:00
Matthew Honnibal	3e3af01681	Add notes for adding retokenize.split()	2018-03-31 19:32:37 +02:00
Matthew Honnibal	7325de449d	Export set_children_from_heads C function from doc.pxd	2018-03-31 15:17:23 +02:00
Matthew Honnibal	168fa080b7	Add doc.retokenize() context manager This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-03-31 15:17:13 +02:00
Matthew Honnibal	2df926e819	Add placeholder fused_tokens to GoldC	2018-03-30 13:26:21 +02:00
Matthew Honnibal	ab93fdf5d1	Add more state Python variables, to make testing easier	2018-03-30 13:26:08 +02:00
Matthew Honnibal	e0375132bd	Add state tests, esp. for split function	2018-03-30 13:25:46 +02:00
Matthew Honnibal	e826b85cf0	Fix state.split() function	2018-03-30 13:25:28 +02:00
Matthew Honnibal	d399843576	WIP on split parsing	2018-03-28 01:44:05 +02:00
Matthew Honnibal	de9fd091ac	Fix #2014 : token.pos_ not writeable	2018-03-27 21:21:11 +02:00
Matthew Honnibal	18da89e04c	Handle non-callable gold_tuples in parser begin_training	2018-03-27 21:08:41 +02:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Matthew Honnibal	8b7a74570f	Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"" This reverts commit `f41e626844`.	2018-03-27 19:22:52 +02:00
Matthew Honnibal	f41e626844	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `f57bfbccdc`.	2018-03-27 19:22:25 +02:00
Matthew Honnibal	c9ba3d3c2d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-03-27 18:59:08 +02:00
Matthew Honnibal	92c26a35d4	Update get_cuda_stream	2018-03-27 16:42:00 +00:00
Matthew Honnibal	f57bfbccdc	Fix non-projective label filtering	2018-03-27 13:41:33 +02:00
Matthew Honnibal	d2118792e7	Merge changes from master	2018-03-27 13:38:41 +02:00
Matthew Honnibal	d4680e4d83	Merge branch 'master' of https://github.com/explosion/spaCy	2018-03-27 13:36:37 +02:00
Matthew Honnibal	63a267b34d	Fix #2073 : Token.set_extension not working	2018-03-27 13:36:20 +02:00
Ines Montani	284bbb1dd1	Merge pull request #2146 from justindujardin/tensorboard-standalone-example Add example using TensorBoard standalone projector	2018-03-27 13:23:32 +02:00
Matthew Honnibal	25280b7013	Try to make sum_state_features faster	2018-03-27 10:08:38 +00:00
Matthew Honnibal	987e1533a4	Use 8 features in parser	2018-03-27 10:08:12 +00:00
Matthew Honnibal	8bbd26579c	Support GPU in UD training script	2018-03-27 09:53:35 +00:00
Matthew Honnibal	dd54511c4f	Pass data as a function in begin_training methods	2018-03-27 09:39:59 +00:00
Matthew Honnibal	d9ebd78e11	Change default sizes in parser	2018-03-26 17:22:18 +02:00
Matthew Honnibal	a3d0cb15d3	Fix ent_iob tags in doc.merge to avoid inconsistent sequences	2018-03-26 07:16:06 +02:00
Matthew Honnibal	7d4687162f	Update doc.ents test	2018-03-26 07:14:35 +02:00
Matthew Honnibal	514d89a3ae	Set missing label for non-specified entities when setting doc.ents	2018-03-26 07:14:16 +02:00
Matthew Honnibal	54d7a1c916	Improve error message when entity sequence is inconsistent	2018-03-26 07:13:34 +02:00
Justin DuJardin	4eeb178856	Add example using TensorBoard standalone projector - the tensorboard standalone project expects a different set of files than the plugin to TensorFlow.	2018-03-25 21:50:13 -07:00
Matthew Honnibal	938436455a	Add test for ent_iob during span merge	2018-03-25 22:16:19 +02:00
Matthew Honnibal	8e08c378fe	Fix entity IOB and tag in span merging	2018-03-25 22:16:01 +02:00
Matthew Honnibal	5430c43298	Set about to spacy-nightly	2018-03-25 19:30:14 +02:00
Matthew Honnibal	c059fcb0ba	Update thinc requirement	2018-03-25 19:29:36 +02:00

1 2 3 4 5 ...

8547 Commits