spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-03 01:17:52 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	5c98c4c3b9	Probably fix pw prod backprop I think this change is correct, but intuition doesn't really help here...	2021-06-17 21:23:00 +09:00
Paul O'Leary McCann	ccf561112a	Remove old comments	2021-06-17 21:22:17 +09:00
Paul O'Leary McCann	a62121e3b4	Expose more hyperparameters	2021-06-17 21:21:46 +09:00
Paul O'Leary McCann	848fd102e7	Small fix	2021-06-17 21:19:38 +09:00
Paul O'Leary McCann	fce804a79f	Minor optimization	2021-06-17 21:10:46 +09:00
Paul O'Leary McCann	cb2364cf83	Fix type of mask The call here was creating a float64 array, which was turning many downstream scores into float64s. Later on these values were assigned to a float32 array in backprop, and numerical underflow caused things to go to zero. That's almost certainly not the only reason things go to zero, but it is incorrect.	2021-06-17 17:56:00 +09:00
Paul O'Leary McCann	8452d117ef	Fix typo, remove old comment	2021-06-13 19:42:55 +09:00
Paul O'Leary McCann	96be7e8858	Change topk to sort descending Shouldn't change correctness but is a little clearer	2021-06-13 19:42:24 +09:00
Paul O'Leary McCann	d71198ed36	Replace squeeze with flatten At a few points in the code it's normal to get a "2d" array where each row is a single entry. Calling squeeze will make that a proper 1d array... unless it's just one entry, in which case it turns into a 0d scalar. That's not what we want; flatten() provides the desired behavior.	2021-06-12 19:48:01 +09:00
Paul O'Leary McCann	e728b0e45d	Silence warning	2021-06-12 19:31:35 +09:00
Paul O'Leary McCann	7efbc721a1	Don't use is_sentenced	2021-06-12 19:29:27 +09:00
Paul O'Leary McCann	67d9ebc922	Transpose before calculating loss	2021-06-04 17:56:08 +09:00
Paul O'Leary McCann	18444fccd9	Remove old comment	2021-06-04 17:56:08 +09:00
Paul O'Leary McCann	4a4ef72191	Clean up unused functions `make_clean_doc` is not needed and was removed. `logsumexp` may be needed if I misunderstood the loss calculation, so I left it in for now with a note.	2021-06-02 21:42:23 +09:00
svlandeg	0aa1083ce8	avoid repetitive entities in the output	2021-05-28 16:52:51 +02:00
svlandeg	0d81bce9cc	add failing test for too short a sentence	2021-05-28 15:10:35 +02:00
svlandeg	0f5c586e2f	add basic tests for debugging	2021-05-28 14:19:55 +02:00
svlandeg	391b512afd	fix types of fwd functions	2021-05-27 16:36:46 +02:00
svlandeg	04b55bf054	removing unused imports	2021-05-27 16:31:38 +02:00
svlandeg	910026582d	set versions to v1 instead of v0	2021-05-27 16:17:20 +02:00
svlandeg	2e3c0e2256	delete outdated tests	2021-05-27 13:54:31 +02:00
svlandeg	ba2e491cc4	Merge remote-tracking branch 'upstream/master' into feature/coref	2021-05-27 13:50:32 +02:00
Sofie Van Landeghem	3c58c0323f	fix docs (#8200 )	2021-05-27 10:48:59 +02:00
Sofie Van Landeghem	290bd6ed39	ensure tolerance is properly passed on (#8158 )	2021-05-27 18:10:28 +10:00
Paul O'Leary McCann	0c553ecd4e	Fix docs (fix #8189 )	2021-05-24 19:47:30 +09:00
Paul O'Leary McCann	a484245f35	Remove references to coref_er	2021-05-24 19:08:45 +09:00
Paul O'Leary McCann	d6389b133d	Don't use a generator for no reason	2021-05-24 19:06:15 +09:00
Paul O'Leary McCann	d6fd5fe1c0	Minor cleanup	2021-05-24 14:56:43 +09:00
Paul O'Leary McCann	0942a0b51b	Remove coref_er.py The intent of this was that it would be a component pipeline that used entities as input, but that's now covered by the get_mentions function as a pipeline arg.	2021-05-21 18:20:25 +09:00
Paul O'Leary McCann	f6652c9252	Add new coref scoring This is closer to the traditional evaluation method. That uses an average of three scores, this is just using the bcubed metric for now (nothing special about bcubed, just picked one). The scoring implementation comes from the coval project. It relies on scipy, which is one issue, and is rather involved, which is another. Besides being comparable with traditional evaluations, this scoring is relatively fast.	2021-05-21 15:56:40 +09:00
Paul O'Leary McCann	e1b4a85bb9	Fix loss The loss was being returned as a single element array, which caused training to die when it attempted to turn it into JSON.	2021-05-21 15:46:50 +09:00
Paul O'Leary McCann	ff3fed06cf	Catch a stray reference	2021-05-20 21:30:46 +09:00
Sofie Van Landeghem	202943bc8c	KB & NEL to/from bytes (#8113 ) * unit test for pickling KB * add pickling test for NEL * KB to_bytes and from_bytes * NEL to_bytes and from_bytes * xfail pickle tests for now * fix docs * cleanup	2021-05-20 18:11:30 +10:00
Paul O'Leary McCann	8c5df622d8	Help out python gc in coref backprop	2021-05-20 16:40:55 +09:00
Paul O'Leary McCann	fa92daf052	Break pairwise operations into pseudolayers This makes their scope tighter and more contained, and has the nice side effect that fewer things need to be passed around for backprop.	2021-05-20 15:59:51 +09:00
Adriane Boyd	f6128c06b0	Disable GPU CI tests (#8143 )	2021-05-19 12:00:07 +02:00
Paul O'Leary McCann	d22acee4f7	Fix backprop Training seems to actually run now!	2021-05-18 20:09:27 +09:00
Paul O'Leary McCann	2486b8ad4d	Fix pipeline intialize	2021-05-18 19:56:27 +09:00
Paul O'Leary McCann	0620820857	Deal with generators in tuplify	2021-05-18 19:55:52 +09:00
Paul O'Leary McCann	a7d9c8156d	Make get_sentence_map work with init When sentences are not available, just treat the whole doc as one sentence. A reasonable general fallback, but important due to the init call, where upstream components aren't run.	2021-05-18 19:54:54 +09:00
Paul O'Leary McCann	883c137b26	Add basic tuplify init	2021-05-18 19:53:59 +09:00
Paul O'Leary McCann	051715506e	Fiddle with get_mentions definition Ended up not making a difference, but oh well.	2021-05-18 19:53:33 +09:00
Adriane Boyd	06324e5a5e	Update pydantic requirements (#8127 ) Update pydantic requirements following https://github.com/explosion/thinc/pull/499	2021-05-18 11:35:50 +02:00
Paul O'Leary McCann	a33d29441a	Merge remote-tracking branch 'upstream/develop' into feature/coref	2021-05-18 17:00:17 +09:00
Adriane Boyd	6baab565eb	Minor updates to quickstart settings/instructions (#7965 ) * Minor updates to quickstart settings/instructions * set default value of textcat exclusive to `false` until the default checkbox behavior is updated * add the `morphologizer` to the list of components * add a note that v3.0.6+ is required * Switch to warning above quickstart * Undo changes to textcat default in quickstart Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-05-17 16:55:22 +02:00
Adriane Boyd	2c545c4c5b	Fix offsets in Span.get_lca_matrix (#8116 ) * Fix range in Span.get_lca_matrix Fix the adjusted token index / lca matrix index ranges for `_get_lca_matrix` for spans. * The range for `k` should correspond to the adjusted indices in `lca_matrix` with the `start` indexed at `0` * Update test for v3.x	2021-05-17 16:54:23 +02:00
Sofie Van Landeghem	0dffc5d9e2	Custom warning if the doc_bin is too large (#8069 ) * custom warning if the doc_bin is too large * cleanup * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * fix numbering * fixing numbering once more * fixing this seems to be pretty hard Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-05-17 15:48:40 +02:00
Adriane Boyd	b120fb3511	Handle errors while multiprocessing (#8004 ) * Handle errors while multiprocessing Handle errors while multiprocessing without hanging. * Return the traceback for errors raised while processing a batch, which can be handled by the top-level error handler * Allow for shortened batches due to custom error handlers that ignore errors and skip documents * Define custom components at a higher level * Also move up custom error handler * Use simpler component for test * Switch error type * Adjust test * Only call top-level error handler for exceptions * Register custom test components within tests Use global functions (so they can be pickled) but register the components only within the individual tests.	2021-05-17 13:28:39 +02:00
Adriane Boyd	8a2602051c	Update debug data for textcat (#8066 ) * Check for unsupported cats values * Only show labels if train/dev mismatched * Don't show label counts (only counting positive labels seems odd) * Use warnings for mismatched train/dev labels	2021-05-17 13:27:04 +02:00
Adriane Boyd	1d59fdbd39	Update Vietnamese tokenizer (#8099 ) * Adapt tokenization methods from `pyvi` to preserve text encoding and whitespace * Add serialization support similar to Chinese and Japanese Note: as for Chinese and Japanese, some settings are duplicated in `config.cfg` and `tokenizer/cfg`.	2021-05-17 18:16:20 +10:00

1 2 3 4 5 ...

14560 Commits