spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-03 08:14:20 +03:00

Author	SHA1	Message	Date
svlandeg	015050f42c	Merge branch 'master' into feature/coref	2022-05-25 13:01:56 +02:00
Paul O'Leary McCann	6087da9675	Suggestions from code review, cleanup, typing	2022-05-25 19:11:48 +09:00
Richard Hudson	32954c3bcb	Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786 ) * Make changes to typing * Correction * Format with black * Corrections based on review * Bumped Thinc dependency version * Bumped blis requirement * Correction for older Python versions * Update spacy/ml/models/textcat.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Corrections based on review feedback * Readd deleted docstring line Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2022-05-25 09:33:54 +02:00
Paul O'Leary McCann	6be09bbd07	Fix Entity Linker with tokenization mismatches (fix #9575 ) (#10457 ) * Add failing test * Partial fix for issue This kind of works. The issue with token length mismatches is gone. The problem is that when you get empty lists of encodings to compare, it fails because the sizes are not the same, even though they're both zero: (0, 3) vs (0,). Not sure why that happens... * Short circuit on empties * Remove spurious check The check here isn't needed now the the short circuit is fixed. * Update spacy/tests/pipeline/test_entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use "eg", not "example" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-23 20:42:26 +02:00
kadarakos	1dc3894447	new parameters	2022-05-17 15:36:32 +00:00
kadarakos	403fb95d56	merge	2022-05-17 06:56:34 +00:00
Paul O'Leary McCann	2e8f0e9168	Rename coref params	2022-05-16 16:50:10 +09:00
kadarakos	b7ac4b33e2	fixing arguments	2022-05-11 14:59:59 +00:00
kadarakos	7cf6bcca0e	merge misery	2022-05-10 17:19:16 +00:00
Paul O'Leary McCann	33f4f90ff0	Formatting	2022-05-10 19:09:52 +09:00
Paul O'Leary McCann	f852c5cea4	Split span predictor component into its own file This runs. The imports in both of the split files could probably use a close check to remove extras.	2022-05-10 18:53:45 +09:00
Raphael Mitsch	f5390e278a	Refactor error messages to remove hardcoded strings (#10729 ) * Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings. * Use custom error msg instead of hardcoded string: fixing faulty Errors import.	2022-05-02 13:38:46 +02:00
Paul O'Leary McCann	683f470852	Merge branch 'master' into feature/coref	2022-04-18 18:39:08 +09:00
Paul O'Leary McCann	afd255c0ed	Undo multiply by 100 This was mistaken, not sure why my score seemed to be off before.	2022-04-14 18:42:09 +09:00
Paul O'Leary McCann	08729e0fbd	Remove end adjustment The difference in environments was due to a change in Thinc, the code here is fine.	2022-04-14 18:31:30 +09:00
Paul O'Leary McCann	8181d4570c	Multiply accuracy by 100 This seems to match with the scorer expectations better	2022-04-14 15:56:38 +09:00
Paul O'Leary McCann	e8af02700f	Remove all coref scoring exept LEA This is necessary because one of the three old methods relied on scipy for some complex problem solving. LEA is generally better for evaluations. The downside is that this means evaluations aren't comparable with many papers, but canonical scoring can be supported using external eval scripts or other methods.	2022-04-13 21:02:18 +09:00
Paul O'Leary McCann	2300f4df3d	Fix span score logging	2022-04-13 20:37:06 +09:00
Paul O'Leary McCann	d470fa03c1	Adjust end indices It's not clear if this is technically correct or not but it won't run without it for me.	2022-04-13 20:19:21 +09:00
kadarakos	b53113e3b8	Preparing span predictor for predicting from gold (#10547 ) Note this is squashed because rebasing had conflicts. * remove unnecessary .device * span predictor debug start * gearing up SpanPredictor for gold-heads * merge SpanPredictor attributes * remove useless extra prefix and device from spanpredictor * make sure predicted and reference keeps aligned * handle empty head_ids * handle empty clusters * addressing suggestions by @polm * nicer restore * fix score overwriting bug * prepare for aligned heads-spans training * span accuracy score * update with eg.predited as other components * add backprop callback to spanpredictor * report start- and end-accuracies separately * fixing scorer Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>	2022-04-13 19:42:49 +09:00
Kádár Ákos	6aedd98d02	fixing scorer	2022-04-11 16:10:14 +02:00
Kádár Ákos	7a239f2ec7	report start- and end-accuracies separately	2022-04-08 14:57:19 +02:00
Kádár Ákos	3ba913109d	update with eg.predited as other components	2022-04-07 13:20:12 +02:00
Kádár Ákos	ef141ad399	span accuracy score	2022-04-04 18:10:09 +02:00
Kádár Ákos	a1d0219903	prepare for aligned heads-spans training	2022-04-04 15:26:15 +02:00
Daniël de Kok	c90dd6f265	Alignment: use a simplified ragged type for performance (#10319 ) * Alignment: use a simplified ragged type for performance This introduces the AlignmentArray type, which is a simplified version of Ragged that performs better on the simple(r) indexing performed for alignment. * AlignmentArray: raise an error when using unsupported index * AlignmentArray: move error messages to Errors * AlignmentArray: remove simlified ... with simplifications * AlignmentArray: fix typo that broke a[n:n] indexing	2022-04-01 09:02:06 +02:00
Kádár Ákos	63a41ba50a	fix score overwriting bug	2022-03-30 17:28:20 +02:00
Kádár Ákos	7ff99a3acc	nicer restore	2022-03-28 18:16:41 +02:00
Kádár Ákos	06d680b269	addressing suggestions by @polm	2022-03-28 14:31:51 +02:00
Kádár Ákos	e4b4b67ef6	handle empty clusters	2022-03-28 11:29:00 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00
Kádár Ákos	7304604edd	make sure predicted and reference keeps aligned	2022-03-25 18:29:33 +01:00
Kádár Ákos	83ac0477c8	remove useless extra prefix and device from spanpredictor	2022-03-24 16:44:50 +01:00
Kádár Ákos	706b2e6f25	gearing up SpanPredictor for gold-heads	2022-03-24 16:06:20 +01:00
Kádár Ákos	1eaf8fb0cf	span predictor debug start	2022-03-23 11:24:27 +01:00
Paul O'Leary McCann	2190cbc0e6	Add progress on SpanPredictor component This isn't working. There is a CUDA error in the torch code during initialization and it's not clear why.	2022-03-19 19:39:49 +09:00
Paul O'Leary McCann	a098849112	Add fake batching The way fake batching works is that the pipeline component calls the model repeatedly in a loop internally. It feels like this should break something, but it worked in testing. Another issue is that this changes the signature of some of the pipeline functions, though I don't think that's an issue. Tested with batch size of 2, so more testing is needed, but this is a start.	2022-03-18 19:46:58 +09:00
Paul O'Leary McCann	1a79d18796	Formatting	2022-03-16 20:10:47 +09:00
Paul O'Leary McCann	6855df0e66	Skeleton for span predictor component This should be moved into its own file, but for now just stubbing out the methods.	2022-03-16 20:09:33 +09:00
Paul O'Leary McCann	7811a1194b	Change architecture	2022-03-16 14:57:15 +09:00
Daniël de Kok	e5debc68e4	Tagger: use unnormalized probabilities for inference (#10197 ) * Tagger: use unnormalized probabilities for inference Using unnormalized softmax avoids use of the relatively expensive exp function, which can significantly speed up non-transformer models (e.g. I got a speedup of 27% on a German tagging + parsing pipeline). * Add spacy.Tagger.v2 with configurable normalization Normalization of probabilities is disabled by default to improve performance. * Update documentation, models, and tests to spacy.Tagger.v2 * Move Tagger.v1 to spacy-legacy * docs/architectures: run prettier * Unnormalized softmax is now a Softmax_v2 option * Require thinc 8.0.14 and spacy-legacy 3.0.9	2022-03-15 14:15:31 +01:00
Paul O'Leary McCann	55039a66ad	Remove old default config	2022-03-15 19:53:09 +09:00
Paul O'Leary McCann	17d017a177	Remove span2head This doesn't work as a component because it needs to modify gold data, so instead it's a conversion script (in another repo).	2022-03-15 19:52:20 +09:00
Paul O'Leary McCann	0522a43116	Make span2head component	2022-03-15 19:19:15 +09:00
Edward	2eef47dd26	Save span candidates produced by spancat suggesters (#10413 ) * Add save_candidates attribute * Change spancat api * Add unit test * reimplement method to produce a list of doc * Add method to docs * Add new version tag * Add intended use to docstring * prettier formatting	2022-03-14 16:46:58 +01:00
Paul O'Leary McCann	dfec6993d6	Training works now	2022-03-14 19:27:23 +09:00
Paul O'Leary McCann	8eadf3781b	Training runs now Evaluation needs fixing, and code still needs cleanup.	2022-03-14 19:02:17 +09:00
Paul O'Leary McCann	d22a002641	Forward/backward pass works Evaluate does not work - predict hasn't been updated	2022-03-14 17:26:27 +09:00
Paul O'Leary McCann	91acc3ea75	Fix entity linker batching (#9669 ) * Partial fix of entity linker batching * Add import * Better name * Add `use_gold_ents` option, docs * Change to v2, create stub v1, update docs etc. * Fix error type Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError? * Make mypy happy * Add hacky fix for init issue * Add legacy pipeline entity linker * Fix references to class name * Add __init__.py for legacy * Attempted fix for loss issue * Remove placeholder V1 * formatting * slightly more interesting train data * Handle batches with no usable examples This adds a test for batches that have docs but not entities, and a check in the component that detects such cases and skips the update step as thought the batch were empty. * Remove todo about data verification Check for empty data was moved further up so this should be OK now - the case in question shouldn't be possible. * Fix gradient calculation The model doesn't know which entities are not in the kb, so it generates embeddings for the context of all of them. However, the loss does know which entities aren't in the kb, and it ignores them, as there's no sensible gradient. This has the issue that the gradient will not be calculated for some of the input embeddings, which causes a dimension mismatch in backprop. That should have caused a clear error, but with numpyops it was causing nans to happen, which is another problem that should be addressed separately. This commit changes the loss to give a zero gradient for entities not in the kb. * add failing test for v1 EL legacy architecture * Add nasty but simple working check for legacy arch * Clarify why init hack works the way it does * Clarify use_gold_ents use case * Fix use gold ents related handling * Add tests for no gold ents and fix other tests * Use aligned ents function (not working) This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, then this will work as desired. * Use proper matching ent check This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used. * Move get_matching_ents to Example * Use model attribute to check for legacy arch * Rename flag * bump spacy-legacy to lower 3.0.9 Co-authored-by: svlandeg <svlandeg@github.com>	2022-03-04 09:17:36 +01:00
kadarakos	249b97184d	Bugfixes and test for rehearse (#10347 ) * fixing argument order for rehearse * rehearse test for ner and tagger * rehearse bugfix * added test for parser * test for multilabel textcat * rehearse fix * remove debug line * Update spacy/tests/training/test_rehearse.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/training/test_rehearse.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-23 16:10:05 +01:00

1 2 3 4 5 ...

580 Commits