spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-02 07:45:41 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	00d481dd12	Stack the mention scorer In the reference implementations, there's usually a function to build a ffnn of arbitrary depth, consisting of a stack of Linear >> Relu >> Dropout. In practice the depth is always 1 in coref-hoi, but in earlier iterations of the model, which are more similar to our model here (since we aren't using attention or even necessarily BERT), using a small depth like 2 was common. This hard-codes a stack of 2. In brief tests this allows similar performance to the unstacked version with much smaller embedding sizes. The depth of the stack could be made into a hyperparameter.	2021-08-09 18:04:42 +09:00
Paul O'Leary McCann	56803d3909	Change mention limit to match reference implementations This generall means fewer spans are considered, which makes individual steps in training faster but can make training take longer to find the good spans.	2021-08-08 19:55:52 +09:00
Paul O'Leary McCann	1d1679d431	Minor speedup This continue should be a break. The current form doesn't cause errors but using a break will be a bit faster.	2021-07-21 19:50:10 +09:00
Paul O'Leary McCann	8bd0474730	Run black	2021-07-18 20:20:22 +09:00
Paul O'Leary McCann	9b63cbb775	Add extract spans import	2021-07-15 18:16:53 +09:00
Paul O'Leary McCann	4a9dc00d86	Use relative indices for mentions Was using batch absolute indices to manage mentions, but extract_spans expects doc-relative ones.	2021-07-14 18:36:18 +09:00
Paul O'Leary McCann	f1796e4af7	Fix mention list bug There was an off-by-one error in how mentions are generated that would affect mentions at the end of a sentence. This was pretty nasty.	2021-07-14 18:19:00 +09:00
Paul O'Leary McCann	c25ec292a9	Cleanup	2021-07-10 22:42:55 +09:00
Paul O'Leary McCann	e00bd422d9	Fix span embeds Some of the lengths and backprop weren't right. Also various cleanup.	2021-07-10 21:38:53 +09:00
Paul O'Leary McCann	d7d317a1b5	Clean up span embedding code This is now cleaner and significantly faster. There's still some messy parts in the code (particularly variable names), will get to that later.	2021-07-10 19:59:08 +09:00
Paul O'Leary McCann	dc1f974d39	Merge branch 'master' into feature/coref	2021-07-10 18:10:40 +09:00
Paul O'Leary McCann	f34915c1e8	Use scatter_add to speed up span embed backprop This was the slowest part of the code, and using scatter_add here probably reduces the runtime by 50%.	2021-07-10 18:08:51 +09:00
Paul O'Leary McCann	d0b041aff4	Switch to using Thinc tuplify The tuplify code here was added to Thinc proper and that's been released, so no need to have it here any more.	2021-07-08 16:08:36 +09:00
Sofie Van Landeghem	e7d747e3ee	TransitionBasedParser.v1 to legacy (#8586 ) * TransitionBasedParser.v1 to legacy * register sublayers * bump spacy-legacy to 3.0.7	2021-07-06 15:26:45 +02:00
Paul O'Leary McCann	eb5820b593	Improve take_vecs implementation This pulls out references to needed bits so that other parts (the larger embeddings) can be freed before backprop.	2021-07-05 21:08:42 +09:00
Paul O'Leary McCann	13bef2ddb6	Add width prior feature Not necessary for convergence, but in coref-hoi this seems to add a few f1 points. Note that there are two width-related features in coref-hoi. This is a "prior" that is added to mention scores. The other width related feature is appended to the span embedding representation for other layers to reference.	2021-07-05 21:06:28 +09:00
Paul O'Leary McCann	8f66176b2d	Fix loss? This rewrites the loss to not use the Thinc crossentropy code at all. The main difference here is that the negative predictions are being masked out (= marginalized over), but negative gradient is still being reflected. I'm still not sure this is exactly right but models seem to train reliably now.	2021-07-05 18:17:10 +09:00
Paul O'Leary McCann	5db28ec2fd	Tweak mention limit calculation The calculation of this in the coref-hoi code is hard to follow. Based on comments and variable names it sounds like it's using the doc length, but it might actually be the number of mentions? Number of mentions should be much larger and seems more correct, but might want to revisit this.	2021-07-03 21:13:32 +09:00
Paul O'Leary McCann	251a5b43ac	Minor fix in crossing spans code I think this was technically incorrect but harmless. The reason the code here is different than the reference in coref-hoi is that the indices there are such that they get +1 at the end of processing, while the code here handles indices directly.	2021-07-03 18:41:46 +09:00
Paul O'Leary McCann	865caedebd	Remove XXX comment Comment wondered if there should be some subtraction to avoid double counting, but it probably doesn't matter because the diagonal is 0.	2021-07-03 18:40:38 +09:00
Paul O'Leary McCann	d74fa82c80	Fix axis handling in topk In practice this is only ever used with axis=1, so it wasn't causing issues, even though it was wrong.	2021-07-03 18:39:25 +09:00
Paul O'Leary McCann	f2e0e9dc28	Move placeholder handling into model code	2021-07-03 18:38:48 +09:00
Paul O'Leary McCann	3f66e18592	Clean up pw_prod loss This doesn't change the math but makes the transposes slightly easier to understand (maybe?).	2021-07-03 18:33:17 +09:00
Ines Montani	7f65902702	Merge pull request #8522 from adrianeboyd/chore/update-flake8 Update flake8 version in reqs and CI	2021-06-28 21:46:06 +10:00
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00
Adriane Boyd	4b0ed73ed4	Update flake8 version in reqs and CI * Update some unneeded forward refs related to flake8 checks	2021-06-28 11:29:36 +02:00
Paul O'Leary McCann	4f377d8de8	Fix bug in crossing span detection	2021-06-28 18:20:33 +09:00
Paul O'Leary McCann	23344857b9	Remove unused function	2021-06-28 18:19:43 +09:00
Matthew Honnibal	f9946154d9	Add SpanCategorizer component (#6747 ) * Draft spancat model * Add spancat model * Add test for extract_spans * Add extract_spans layer * Upd extract_spans * Add spancat model * Add test for spancat model * Upd spancat model * Update spancat component * Upd spancat * Update spancat model * Add quick spancat test * Import SpanCategorizer * Fix SpanCategorizer component * Import SpanGroup * Fix span extraction * Fix import * Fix import * Upd model * Update spancat models * Add scoring, update defaults * Update and add docs * Fix type * Update spacy/ml/extract_spans.py * Auto-format and fix import * Fix comment * Fix type * Fix type * Update website/docs/api/spancategorizer.md * Fix comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Better defense Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix labels list Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/extract_spans.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/spancat.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Set annotations during update * Set annotations in spancat * fix imports in test * Update spacy/pipeline/spancat.py * replace MaxoutLogistic with LinearLogistic * fix config * various small fixes * remove set_annotations parameter in update * use our beloved tupley format with recent support for doc.spans * bugfix to allow renaming the default span_key (scores weren't showing up) * use different key in docs example * change defaults to better-working parameters from project (WIP) * register spacy.extract_spans.v1 for legacy purposes * Upd dev version so can build wheel * layers instead of architectures for smaller building blocks * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Include additional scores from overrides in combined score weights * Parameterize spans key in scoring Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`. * Update website/docs/api/scorer.md * Fix scorer for spans key containing underscore * Increment version * Add Spans to Evaluate CLI (#8439) * Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix spancat GPU issues (#8455) * Fix GPU issues * Require thinc >=8.0.6 * Switch to glorot_uniform_init * Fix and test ngram suggester * Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com>	2021-06-24 12:35:27 +02:00
Paul O'Leary McCann	5c98c4c3b9	Probably fix pw prod backprop I think this change is correct, but intuition doesn't really help here...	2021-06-17 21:23:00 +09:00
Paul O'Leary McCann	ccf561112a	Remove old comments	2021-06-17 21:22:17 +09:00
Paul O'Leary McCann	a62121e3b4	Expose more hyperparameters	2021-06-17 21:21:46 +09:00
Paul O'Leary McCann	848fd102e7	Small fix	2021-06-17 21:19:38 +09:00
Paul O'Leary McCann	fce804a79f	Minor optimization	2021-06-17 21:10:46 +09:00
Paul O'Leary McCann	cb2364cf83	Fix type of mask The call here was creating a float64 array, which was turning many downstream scores into float64s. Later on these values were assigned to a float32 array in backprop, and numerical underflow caused things to go to zero. That's almost certainly not the only reason things go to zero, but it is incorrect.	2021-06-17 17:56:00 +09:00
Sofie Van Landeghem	e796aab4b3	Resizable textcat (#7862 ) * implement textcat resizing for TextCatCNN * resizing textcat in-place * simplify code * ensure predictions for old textcat labels remain the same after resizing (WIP) * fix for softmax * store softmax as attr * fix ensemble weight copy and cleanup * restructure slightly * adjust documentation, update tests and quickstart templates to use latest versions * extend unit test slightly * revert unnecessary edits * fix typo * ensemble architecture won't be resizable for now * use resizable layer (WIP) * revert using resizable layer * resizable container while avoid shape inference trouble * cleanup * ensure model continues training after resizing * use fill_b parameter * use fill_defaults * resize_layer callback * format * bump thinc to 8.0.4 * bump spacy-legacy to 3.0.6	2021-06-16 11:45:00 +02:00
Paul O'Leary McCann	8452d117ef	Fix typo, remove old comment	2021-06-13 19:42:55 +09:00
Paul O'Leary McCann	96be7e8858	Change topk to sort descending Shouldn't change correctness but is a little clearer	2021-06-13 19:42:24 +09:00
Paul O'Leary McCann	d71198ed36	Replace squeeze with flatten At a few points in the code it's normal to get a "2d" array where each row is a single entry. Calling squeeze will make that a proper 1d array... unless it's just one entry, in which case it turns into a 0d scalar. That's not what we want; flatten() provides the desired behavior.	2021-06-12 19:48:01 +09:00
Paul O'Leary McCann	e728b0e45d	Silence warning	2021-06-12 19:31:35 +09:00
Paul O'Leary McCann	7efbc721a1	Don't use is_sentenced	2021-06-12 19:29:27 +09:00
Paul O'Leary McCann	18444fccd9	Remove old comment	2021-06-04 17:56:08 +09:00
Paul O'Leary McCann	4a4ef72191	Clean up unused functions `make_clean_doc` is not needed and was removed. `logsumexp` may be needed if I misunderstood the loss calculation, so I left it in for now with a note.	2021-06-02 21:42:23 +09:00
Vito De Tullio	3672464e25	applying suggestion to avoid mypy errors (#8265 ) * applying suggestion to avoid mypy errors * sign contributor agreement	2021-06-02 19:25:30 +10:00
svlandeg	0aa1083ce8	avoid repetitive entities in the output	2021-05-28 16:52:51 +02:00
svlandeg	0f5c586e2f	add basic tests for debugging	2021-05-28 14:19:55 +02:00
svlandeg	391b512afd	fix types of fwd functions	2021-05-27 16:36:46 +02:00
svlandeg	04b55bf054	removing unused imports	2021-05-27 16:31:38 +02:00
svlandeg	910026582d	set versions to v1 instead of v0	2021-05-27 16:17:20 +02:00
Paul O'Leary McCann	d6389b133d	Don't use a generator for no reason	2021-05-24 19:06:15 +09:00

1 2 3 4

168 Commits