spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-02 09:56:39 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	f7beefe9c1	Update oracle tests for Split	2018-04-03 15:44:58 +02:00
Matthew Honnibal	e31ef9c7f6	Add some property vars for testing	2018-04-03 15:44:31 +02:00
Matthew Honnibal	4029dc2cc7	Fix feature-flagging of Split action	2018-04-03 15:43:50 +02:00
Matthew Honnibal	6cc79fc244	Fix state array length for split	2018-04-03 15:43:23 +02:00
Matthew Honnibal	7ff4d8967f	Add file for compile-time flags in parser	2018-04-03 15:42:51 +02:00
Matthew Honnibal	9d7409ce71	Hack at Alignment.flatten for split tokens	2018-04-03 02:32:31 +02:00
Matthew Honnibal	e9d1e6d66b	Fix head alignment for split tokens	2018-04-03 02:32:09 +02:00
Matthew Honnibal	6aded3d855	Handle complex tags in ud-train	2018-04-03 01:57:37 +02:00
Matthew Honnibal	6d42e0ad8e	Fix handling of tuple-valued tags in Tagger	2018-04-03 01:55:06 +02:00
Matthew Honnibal	9c5c940441	Fix head alignment in GoldParse	2018-04-03 01:54:45 +02:00
Matthew Honnibal	88de8fe323	Fix pre-processing of more complicated heads in ArcEager	2018-04-03 01:54:21 +02:00
Matthew Honnibal	e6bacc26cb	Update tests	2018-04-03 00:55:16 +02:00
Matthew Honnibal	06a5be9dfd	Fix handling of heads for undersegmented tokens	2018-04-03 00:55:05 +02:00
Matthew Honnibal	aa5ecf7fd2	Update ArcEager for changes to GoldParse class	2018-04-02 23:53:13 +02:00
Matthew Honnibal	c9d314b7ba	Bug fixes for alignment	2018-04-02 23:50:21 +02:00
Matthew Honnibal	c8ba54e052	Fix Alignment class for undersegmentation	2018-04-02 23:39:26 +02:00
Matthew Honnibal	e6641a11b1	Refactor alignment into its own class	2018-04-02 21:54:29 +02:00
Matthew Honnibal	9c3612d40b	Draft ArcEager.preprocess_gold for fused tokens	2018-04-01 22:11:35 +02:00
Matthew Honnibal	fb9c3984b5	Add GoldParse.resize_arrays method	2018-04-01 22:10:53 +02:00
Matthew Honnibal	cb6988f2f4	Fix comment in GoldParse	2018-04-01 22:10:26 +02:00
Matthew Honnibal	00fa41a924	Handle list values in ud-train for tagger	2018-04-01 18:34:28 +02:00
Matthew Honnibal	5f68e491e1	Prepare ArcEager.preprocess_gold to handle subtokens	2018-04-01 18:31:33 +02:00
Matthew Honnibal	b8461e71b7	Prepare ArcEager.preprocess_gold to handle subtokens	2018-04-01 18:03:48 +02:00
Matthew Honnibal	2d929ffc5d	Handle list-valued GoldParse values	2018-04-01 17:42:33 +02:00
Matthew Honnibal	29b77fd0eb	Add tests for gold alignment and parser state	2018-04-01 17:26:37 +02:00
Matthew Honnibal	3d182fbc43	Represent fused tokens in GoldParse Entries in GoldParse.{words, heads, tags, deps, ner} can now be lists instead of single values, to handle getting the analysis for fused tokens. For instance, let's say we have a token like "hows", while the gold-standard has two tokens, ["how", "s"]. We need to store the gold data for each of the two subtokens. Example gold.words: [["how", "s"], "it", "going"] Things get more complicated for heads, as we need to address particular subtokens. Let's say the gold heads for ["how", "s", "it", "going"] is [1, 1, 3, 1], i.e. the root "s" is within the subtoken. The gold.heads list would be: [[(0, 1), (0, 1)], 2, (0, 1)] The tuples indicate token 0, subtoken 1. A helper method _flatten_fused_heads is available that unpacks the above to [1, 1, 3, 1].	2018-04-01 17:18:18 +02:00
Matthew Honnibal	a64680c137	Add test for one-to-many alignment	2018-04-01 14:53:49 +02:00
Matthew Honnibal	19ac03ce09	Go back to letting Break work with deeper stacks It seems very appealing to restrict Break so that it only works when there's one word on the stack. Then we can pop that word, mark it as the root, and continue. However, results are suggesting it's nice to be able to predict Break when the last word of the previous sentence is on the stack, and the first word of the next sentence is at the buffer. This does make sense! Consider that the last word is often a period or something --- a pretty huge clue. We otherwise have to go out of our way to get that feature in. The really decisive thing is we have to handle upcoming sentence breaks anyway, because we need to conform to preset SBD constraints. So, we may as well let the parser predict the Break when it's at a stack/queue position that is most revealing.	2018-04-01 14:32:15 +02:00
Matthew Honnibal	ad70b91e1e	Comment	2018-04-01 13:47:16 +02:00
Matthew Honnibal	83ca2113a2	Constrain Break action to stack depth==1	2018-04-01 13:47:02 +02:00
Matthew Honnibal	dc7f879281	Set USE_SPLIT=False feature flag	2018-04-01 13:46:25 +02:00
Matthew Honnibal	a2f07ab57f	Start sketching out Split transition implementation	2018-04-01 13:45:41 +02:00
Matthew Honnibal	5da7945917	Allocate StateC.was_split	2018-04-01 13:44:42 +02:00
Matthew Honnibal	728d9841c7	Allocate fused tokens array in GoldParseC	2018-04-01 13:43:56 +02:00
Matthew Honnibal	d8dec1134c	Simplify Break transition to require stack depth 1. Hopefully as accurate	2018-04-01 12:53:25 +02:00
Matthew Honnibal	a37188fe98	Dont drop preset actions on begin_training	2018-04-01 11:46:22 +02:00
Matthew Honnibal	e887b2330e	Rewrite oracle to not use fast-forward. Seems to work?	2018-04-01 10:43:11 +02:00
Matthew Honnibal	c5574f48c7	Add better arc-eager oracle tests	2018-04-01 10:41:52 +02:00
Matthew Honnibal	bc2a2c81c8	Add some methods to ArcEager that make testing easier	2018-04-01 10:41:28 +02:00
Matthew Honnibal	e5ad35787c	WIP on adding split-token actions to parser This patch starts getting the StateC object ready to split tokens. The split function is implemented by pushing indices into the buffer that indicate an out-of-length token. Still todo: * Update the oracles * Update GoldParseC * Interpret the parse once it's complete * Add retokenizer.split() method	2018-03-31 20:05:27 +02:00
Matthew Honnibal	3e3af01681	Add notes for adding retokenize.split()	2018-03-31 19:32:37 +02:00
Matthew Honnibal	7325de449d	Export set_children_from_heads C function from doc.pxd	2018-03-31 15:17:23 +02:00
Matthew Honnibal	168fa080b7	Add doc.retokenize() context manager This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-03-31 15:17:13 +02:00
Matthew Honnibal	2df926e819	Add placeholder fused_tokens to GoldC	2018-03-30 13:26:21 +02:00
Matthew Honnibal	ab93fdf5d1	Add more state Python variables, to make testing easier	2018-03-30 13:26:08 +02:00
Matthew Honnibal	e0375132bd	Add state tests, esp. for split function	2018-03-30 13:25:46 +02:00
Matthew Honnibal	e826b85cf0	Fix state.split() function	2018-03-30 13:25:28 +02:00
Matthew Honnibal	d399843576	WIP on split parsing	2018-03-28 01:44:05 +02:00
Matthew Honnibal	de9fd091ac	Fix #2014 : token.pos_ not writeable	2018-03-27 21:21:11 +02:00
Matthew Honnibal	18da89e04c	Handle non-callable gold_tuples in parser begin_training	2018-03-27 21:08:41 +02:00

1 2 3 4 5 ...

8571 Commits