spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-01 08:27:44 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	00d481dd12	Stack the mention scorer In the reference implementations, there's usually a function to build a ffnn of arbitrary depth, consisting of a stack of Linear >> Relu >> Dropout. In practice the depth is always 1 in coref-hoi, but in earlier iterations of the model, which are more similar to our model here (since we aren't using attention or even necessarily BERT), using a small depth like 2 was common. This hard-codes a stack of 2. In brief tests this allows similar performance to the unstacked version with much smaller embedding sizes. The depth of the stack could be made into a hyperparameter.	2021-08-09 18:04:42 +09:00
Paul O'Leary McCann	56803d3909	Change mention limit to match reference implementations This generall means fewer spans are considered, which makes individual steps in training faster but can make training take longer to find the good spans.	2021-08-08 19:55:52 +09:00
Paul O'Leary McCann	1d1679d431	Minor speedup This continue should be a break. The current form doesn't cause errors but using a break will be a bit faster.	2021-07-21 19:50:10 +09:00
Paul O'Leary McCann	8bd0474730	Run black	2021-07-18 20:20:22 +09:00
Paul O'Leary McCann	9b63cbb775	Add extract spans import	2021-07-15 18:16:53 +09:00
Paul O'Leary McCann	4a9dc00d86	Use relative indices for mentions Was using batch absolute indices to manage mentions, but extract_spans expects doc-relative ones.	2021-07-14 18:36:18 +09:00
Paul O'Leary McCann	f1796e4af7	Fix mention list bug There was an off-by-one error in how mentions are generated that would affect mentions at the end of a sentence. This was pretty nasty.	2021-07-14 18:19:00 +09:00
Adriane Boyd	f9fd2889b7	Use 0-vector for OOV lexemes (#8639 )	2021-07-13 14:48:12 +10:00
Paul O'Leary McCann	c25ec292a9	Cleanup	2021-07-10 22:42:55 +09:00
Paul O'Leary McCann	e00bd422d9	Fix span embeds Some of the lengths and backprop weren't right. Also various cleanup.	2021-07-10 21:38:53 +09:00
Paul O'Leary McCann	d7d317a1b5	Clean up span embedding code This is now cleaner and significantly faster. There's still some messy parts in the code (particularly variable names), will get to that later.	2021-07-10 19:59:08 +09:00
Paul O'Leary McCann	dc1f974d39	Merge branch 'master' into feature/coref	2021-07-10 18:10:40 +09:00
Paul O'Leary McCann	f34915c1e8	Use scatter_add to speed up span embed backprop This was the slowest part of the code, and using scatter_add here probably reduces the runtime by 50%.	2021-07-10 18:08:51 +09:00
Paul O'Leary McCann	d0b041aff4	Switch to using Thinc tuplify The tuplify code here was added to Thinc proper and that's been released, so no need to have it here any more.	2021-07-08 16:08:36 +09:00
Sofie Van Landeghem	e7d747e3ee	TransitionBasedParser.v1 to legacy (#8586 ) * TransitionBasedParser.v1 to legacy * register sublayers * bump spacy-legacy to 3.0.7	2021-07-06 15:26:45 +02:00
Paul O'Leary McCann	eb5820b593	Improve take_vecs implementation This pulls out references to needed bits so that other parts (the larger embeddings) can be freed before backprop.	2021-07-05 21:08:42 +09:00
Paul O'Leary McCann	13bef2ddb6	Add width prior feature Not necessary for convergence, but in coref-hoi this seems to add a few f1 points. Note that there are two width-related features in coref-hoi. This is a "prior" that is added to mention scores. The other width related feature is appended to the span embedding representation for other layers to reference.	2021-07-05 21:06:28 +09:00
Paul O'Leary McCann	8f66176b2d	Fix loss? This rewrites the loss to not use the Thinc crossentropy code at all. The main difference here is that the negative predictions are being masked out (= marginalized over), but negative gradient is still being reflected. I'm still not sure this is exactly right but models seem to train reliably now.	2021-07-05 18:17:10 +09:00
Paul O'Leary McCann	5db28ec2fd	Tweak mention limit calculation The calculation of this in the coref-hoi code is hard to follow. Based on comments and variable names it sounds like it's using the doc length, but it might actually be the number of mentions? Number of mentions should be much larger and seems more correct, but might want to revisit this.	2021-07-03 21:13:32 +09:00
Paul O'Leary McCann	251a5b43ac	Minor fix in crossing spans code I think this was technically incorrect but harmless. The reason the code here is different than the reference in coref-hoi is that the indices there are such that they get +1 at the end of processing, while the code here handles indices directly.	2021-07-03 18:41:46 +09:00
Paul O'Leary McCann	865caedebd	Remove XXX comment Comment wondered if there should be some subtraction to avoid double counting, but it probably doesn't matter because the diagonal is 0.	2021-07-03 18:40:38 +09:00
Paul O'Leary McCann	d74fa82c80	Fix axis handling in topk In practice this is only ever used with axis=1, so it wasn't causing issues, even though it was wrong.	2021-07-03 18:39:25 +09:00
Paul O'Leary McCann	f2e0e9dc28	Move placeholder handling into model code	2021-07-03 18:38:48 +09:00
Paul O'Leary McCann	3f66e18592	Clean up pw_prod loss This doesn't change the math but makes the transposes slightly easier to understand (maybe?).	2021-07-03 18:33:17 +09:00
Ines Montani	7f65902702	Merge pull request #8522 from adrianeboyd/chore/update-flake8 Update flake8 version in reqs and CI	2021-06-28 21:46:06 +10:00
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00
Adriane Boyd	4b0ed73ed4	Update flake8 version in reqs and CI * Update some unneeded forward refs related to flake8 checks	2021-06-28 11:29:36 +02:00
Paul O'Leary McCann	4f377d8de8	Fix bug in crossing span detection	2021-06-28 18:20:33 +09:00
Paul O'Leary McCann	23344857b9	Remove unused function	2021-06-28 18:19:43 +09:00
Matthew Honnibal	f9946154d9	Add SpanCategorizer component (#6747 ) * Draft spancat model * Add spancat model * Add test for extract_spans * Add extract_spans layer * Upd extract_spans * Add spancat model * Add test for spancat model * Upd spancat model * Update spancat component * Upd spancat * Update spancat model * Add quick spancat test * Import SpanCategorizer * Fix SpanCategorizer component * Import SpanGroup * Fix span extraction * Fix import * Fix import * Upd model * Update spancat models * Add scoring, update defaults * Update and add docs * Fix type * Update spacy/ml/extract_spans.py * Auto-format and fix import * Fix comment * Fix type * Fix type * Update website/docs/api/spancategorizer.md * Fix comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Better defense Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix labels list Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/extract_spans.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/spancat.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Set annotations during update * Set annotations in spancat * fix imports in test * Update spacy/pipeline/spancat.py * replace MaxoutLogistic with LinearLogistic * fix config * various small fixes * remove set_annotations parameter in update * use our beloved tupley format with recent support for doc.spans * bugfix to allow renaming the default span_key (scores weren't showing up) * use different key in docs example * change defaults to better-working parameters from project (WIP) * register spacy.extract_spans.v1 for legacy purposes * Upd dev version so can build wheel * layers instead of architectures for smaller building blocks * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Include additional scores from overrides in combined score weights * Parameterize spans key in scoring Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`. * Update website/docs/api/scorer.md * Fix scorer for spans key containing underscore * Increment version * Add Spans to Evaluate CLI (#8439) * Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix spancat GPU issues (#8455) * Fix GPU issues * Require thinc >=8.0.6 * Switch to glorot_uniform_init * Fix and test ngram suggester * Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com>	2021-06-24 12:35:27 +02:00
Paul O'Leary McCann	5c98c4c3b9	Probably fix pw prod backprop I think this change is correct, but intuition doesn't really help here...	2021-06-17 21:23:00 +09:00
Paul O'Leary McCann	ccf561112a	Remove old comments	2021-06-17 21:22:17 +09:00
Paul O'Leary McCann	a62121e3b4	Expose more hyperparameters	2021-06-17 21:21:46 +09:00
Paul O'Leary McCann	848fd102e7	Small fix	2021-06-17 21:19:38 +09:00
Paul O'Leary McCann	fce804a79f	Minor optimization	2021-06-17 21:10:46 +09:00
Paul O'Leary McCann	cb2364cf83	Fix type of mask The call here was creating a float64 array, which was turning many downstream scores into float64s. Later on these values were assigned to a float32 array in backprop, and numerical underflow caused things to go to zero. That's almost certainly not the only reason things go to zero, but it is incorrect.	2021-06-17 17:56:00 +09:00
Sofie Van Landeghem	e796aab4b3	Resizable textcat (#7862 ) * implement textcat resizing for TextCatCNN * resizing textcat in-place * simplify code * ensure predictions for old textcat labels remain the same after resizing (WIP) * fix for softmax * store softmax as attr * fix ensemble weight copy and cleanup * restructure slightly * adjust documentation, update tests and quickstart templates to use latest versions * extend unit test slightly * revert unnecessary edits * fix typo * ensemble architecture won't be resizable for now * use resizable layer (WIP) * revert using resizable layer * resizable container while avoid shape inference trouble * cleanup * ensure model continues training after resizing * use fill_b parameter * use fill_defaults * resize_layer callback * format * bump thinc to 8.0.4 * bump spacy-legacy to 3.0.6	2021-06-16 11:45:00 +02:00
Paul O'Leary McCann	8452d117ef	Fix typo, remove old comment	2021-06-13 19:42:55 +09:00
Paul O'Leary McCann	96be7e8858	Change topk to sort descending Shouldn't change correctness but is a little clearer	2021-06-13 19:42:24 +09:00
Paul O'Leary McCann	d71198ed36	Replace squeeze with flatten At a few points in the code it's normal to get a "2d" array where each row is a single entry. Calling squeeze will make that a proper 1d array... unless it's just one entry, in which case it turns into a 0d scalar. That's not what we want; flatten() provides the desired behavior.	2021-06-12 19:48:01 +09:00
Paul O'Leary McCann	e728b0e45d	Silence warning	2021-06-12 19:31:35 +09:00
Paul O'Leary McCann	7efbc721a1	Don't use is_sentenced	2021-06-12 19:29:27 +09:00
Paul O'Leary McCann	18444fccd9	Remove old comment	2021-06-04 17:56:08 +09:00
Paul O'Leary McCann	4a4ef72191	Clean up unused functions `make_clean_doc` is not needed and was removed. `logsumexp` may be needed if I misunderstood the loss calculation, so I left it in for now with a note.	2021-06-02 21:42:23 +09:00
Vito De Tullio	3672464e25	applying suggestion to avoid mypy errors (#8265 ) * applying suggestion to avoid mypy errors * sign contributor agreement	2021-06-02 19:25:30 +10:00
svlandeg	0aa1083ce8	avoid repetitive entities in the output	2021-05-28 16:52:51 +02:00
svlandeg	0f5c586e2f	add basic tests for debugging	2021-05-28 14:19:55 +02:00
svlandeg	391b512afd	fix types of fwd functions	2021-05-27 16:36:46 +02:00
svlandeg	04b55bf054	removing unused imports	2021-05-27 16:31:38 +02:00
svlandeg	910026582d	set versions to v1 instead of v0	2021-05-27 16:17:20 +02:00
Paul O'Leary McCann	d6389b133d	Don't use a generator for no reason	2021-05-24 19:06:15 +09:00
Paul O'Leary McCann	d6fd5fe1c0	Minor cleanup	2021-05-24 14:56:43 +09:00
Paul O'Leary McCann	ff3fed06cf	Catch a stray reference	2021-05-20 21:30:46 +09:00
Paul O'Leary McCann	8c5df622d8	Help out python gc in coref backprop	2021-05-20 16:40:55 +09:00
Paul O'Leary McCann	fa92daf052	Break pairwise operations into pseudolayers This makes their scope tighter and more contained, and has the nice side effect that fewer things need to be passed around for backprop.	2021-05-20 15:59:51 +09:00
Paul O'Leary McCann	0620820857	Deal with generators in tuplify	2021-05-18 19:55:52 +09:00
Paul O'Leary McCann	a7d9c8156d	Make get_sentence_map work with init When sentences are not available, just treat the whole doc as one sentence. A reasonable general fallback, but important due to the init call, where upstream components aren't run.	2021-05-18 19:54:54 +09:00
Paul O'Leary McCann	883c137b26	Add basic tuplify init	2021-05-18 19:53:59 +09:00
Paul O'Leary McCann	051715506e	Fiddle with get_mentions definition Ended up not making a difference, but oh well.	2021-05-18 19:53:33 +09:00
Paul O'Leary McCann	e303628205	Attempt to use registry correctly	2021-05-17 14:52:48 +09:00
Paul O'Leary McCann	91b111467b	Minor fixes	2021-05-17 14:52:30 +09:00
Paul O'Leary McCann	7c42a8c90a	Migrate coref code This includes the coref code that was being tested separately, modified to work in spaCy. It hasn't been tested yet and presumably still needs fixes. In particular, the evaluation code is currently omitted. It's unclear at the moment whether we want to use a complex scorer similar to the official one, or a simpler scorer using more modern evaluation methods.	2021-05-15 21:36:10 +09:00
Paul O'Leary McCann	3608b7b3f9	Merge branch 'master' into feature/coref	2021-05-15 20:05:17 +09:00
Sofie Van Landeghem	e9037d8fc0	make EntityLinker robust for nO=None (#7930 )	2021-05-06 18:14:47 +10:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Sofie Van Landeghem	cd70c3cb79	Fixing pretrain (#7342 ) * initialize NLP with train corpus * add more pretraining tests * more tests * function to fetch tok2vec layer for pretraining * clarify parameter name * test different objectives * formatting * fix check for static vectors when using vectors objective * clarify docs * logger statement * fix init_tok2vec and proc.initialize order * test training after pretraining * add init_config tests for pretraining * pop pretraining block to avoid config validation errors * custom errors	2021-03-09 14:01:13 +11:00
Sofie Van Landeghem	e0c45c669a	Native coref component (#7243 ) * initial coref_er pipe * matcher more flexible * base coref component without actual model * initial setup of coref_er.score * rename to include_label * preliminary score_clusters method * apply scoring in coref component * IO fix * return None loss for now * rename to CoreferenceResolver * some preliminary unit tests * use registry as callable	2021-03-03 13:50:14 +01:00
svlandeg	d900c55061	consistently use registry as callable	2021-03-02 17:56:28 +01:00
René Octavio Queiroz Dias	59271e887a	fix: TransformerListener with TextCatEnsemble (#6951 ) * bug: Regression test Issue #6946 * fix: Fix issue #6946 * chore: Remove regression test	2021-02-06 13:44:51 +01:00
Matthew Honnibal	ffc371350a	Avoid assuming encode.get_dim('nO') is set in tok2vec (#6800 )	2021-01-24 14:37:33 +11:00
Sofie Van Landeghem	c8761b0e6e	rewrite Maxout layer as separate layers to avoid shape inference trouble (#6760 )	2021-01-19 07:37:17 +08:00
Adriane Boyd	26c34ab8b0	Fix parser resizing for cupy (#6758 )	2021-01-18 20:43:15 +01:00
Matthew Honnibal	c2a18e4fa3	Update textcat ensemble model	2021-01-19 02:53:02 +11:00
Ines Montani	a203e3dbb8	Support spacy-legacy via the registry	2021-01-15 21:42:40 +11:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
Sofie Van Landeghem	3983bc6b1e	Fix Transformer width in TextCatEnsemble (#6431 ) * add convenience method to determine tok2vec width in a model * fix transformer tok2vec dimensions in TextCatEnsemble architecture * init function should not be nested to avoid pickle issues	2021-01-06 12:44:04 +01:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Sofie Van Landeghem	282a3b49ea	Fix parser resizing when there is no upper layer (#6460 ) * allow resizing of the parser model even when upper=False * update from spacy.TransitionBasedParser.v1 to v2 * bugfix	2020-12-18 18:56:57 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Sofie Van Landeghem	a0c899a0ff	Fix textcat + transformer architecture (#6371 ) * add pooling to textcat TransformerListener * maybe_get_dim in case it's null	2020-11-10 20:14:47 +08:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
svlandeg	40276fd3be	update NEL docs after latest refactor	2020-10-12 11:41:27 +02:00
svlandeg	08cb085f6c	Merge remote-tracking branch 'upstream/develop' into fix/various	2020-10-09 17:01:27 +02:00
svlandeg	040c7c0541	fix get_dim calls in build_simple_cnn_text_classifier	2020-10-09 15:40:58 +02:00
svlandeg	853edace37	fix MultiHashEmbed example in documentation	2020-10-09 14:11:06 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Ines Montani	1a554bdcb1	Update docs and docstring [ci skip]	2020-10-05 21:55:27 +02:00
Ines Montani	9614e53b02	Tidy up and auto-format	2020-10-05 21:55:18 +02:00
Matthew Honnibal	e50047f1c5	Check lengths match	2020-10-05 20:02:45 +02:00
Matthew Honnibal	cdd2b79b6d	Remove deprecated MultiHashEmbed	2020-10-05 19:58:18 +02:00
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
Matthew Honnibal	eb9ba61517	Format	2020-10-05 15:29:49 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
svlandeg	02247cccaf	Merge remote-tracking branch 'upstream/develop' into feature/small-fixes	2020-10-02 20:48:11 +02:00
Matthew Honnibal	6965cdf16d	Fix comment	2020-10-02 17:26:21 +02:00
Matthew Honnibal	75a1569908	Merge	2020-10-01 23:07:53 +02:00
Matthew Honnibal	300e5a9928	Avoid relying on NORM in default v3 models (#6176 ) * Allow CharacterEmbed to specify feature * Default to LOWER in character embed * Update tok2vec * Use LOWER, not NORM	2020-10-01 23:05:55 +02:00

1 2 3 4 5

219 Commits