spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-06 17:54:21 +03:00

History

Lj Miranda 53687b5bca Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365 ) * [wip] Update * [wip] Update * Add initial port * [wip] Update * Fix all imports * Add spancat_exclusive to pipeline * [WIP] Update * [ci skip] Add breakpoint for debugging * Use spacy.SpanCategorizer.v1 as default archi * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: kadarakos <kadar.akos@gmail.com> * [ci skip] Small updates * Use Softmax v2 directly from thinc * Cache the label map * Fix mypy errors However, I ignored line 370 because it opened up a bunch of type errors that might be trickier to solve and might lead to a more complicated codebase. * avoid multiplication with 1.0 Co-authored-by: kadarakos <kadar.akos@gmail.com> * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update component versions to v2 * Add scorer to docstring * Add _n_labels property to SpanCategorizer Instead of using len(self.labels) in initialize() I am using a private property self._n_labels. This achieves implementation parity and allows me to delete the whole initialize() method for spancat_exclusive (since it's now the same with spancat). * Inherit from SpanCat instead of TrainablePipe This commit changes the inheritance structure of Exclusive_Spancat, now it's inheriting from SpanCategorizer than TrainablePipe. This allows me to remove duplicate methods that are already present in the parent function. * Revert documentation link to spancat * Fix init call for exclusive spancat * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Import Suggester from spancat * Include zero_init.v1 for spancat * Implement _allow_extra_label to use _n_labels To ensure that spancat / spancat_exclusive cannot be resized after initialization, I inherited the _allow_extra_label() method from spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead of len(self.labels) for checking. I think that changing it locally is a better solution rather than forcing each class that inherits TrainablePipe to use the self._n_labels attribute. Also note that I turned-off black formatting in this block of code because it reads better without the overhang. * Extend existing tests to spancat_exclusive In this commit, I extended the existing tests for spancat to include spancat_exclusive. I parametrized the test functions with 'name' (similar var name with textcat and textcat_multilabel) for each applicable test. TODO: Add overfitting tests for spancat_exclusive * Update documentation for spancat * Turn on formatting for allow_extra_label * Remove initializers in default config * Use DEFAULT_EXCL_SPANCAT_MODEL I also renamed spancat_exclusive_default_config into spancat_excl_default_config because black does some not pretty formatting changes. * Update documentation Update grammar and usage Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clarify docstring for Exclusive_SpanCategorizer * Remove mypy ignore and typecast labels to list * Fix documentation API * Use a single variable for tests * Update defaults for number of rows Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Put back initializers in spancat config Whenever I remove model.scorer.init_w and model.scorer.init_b, I encounter an error in the test: SystemError: <method '__getitem__' of 'dict' objects> returned a result with an error set. My Thinc version is 8.1.5, but I can't seem to check what's causing the error. * Update spancat_exclusive docstring * Remove init_W and init_B parameters This commit is expected to fail until the new Thinc release. * Require thinc>=8.1.6 for serializable Softmax defaults * Handle zero suggestions to make tests pass I'm not sure if this is the most elegant solution. But what should happen is that the _make_span_group function MUST return an empty SpanGroup if there are no suggestions. The error happens when the 'scores' variable is empty. We cannot get the 'predicted' and other downstream vars. * Better approach for handling zero suggestions * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spancategorizer headers * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add default value in negative_weight in docs * Add default value in allow_overlap in docs * Update how spancat_exclusive is constructed In this commit, I added the following: - Put the default values of negative_weight and allow_overlap in the default_config dictionary. - Rename make_spancat -> make_exclusive_spancat * Run prettier on spancategorizer.mdx * Change exactly one -> at most one * Add suggester documentation in Exclusive_SpanCategorizer * Add suggester to spancat docstrings * merge multilabel and singlelabel spancat * rename spancat_exclusive to singlelable * wire up different make_spangroups for single and multilabel * black * black * add docstrings * more docstring and fix negative_label * don't rely on default arguments * black * remove spancat exclusive * replace single_label with add_negative_label and adjust inference * mypy * logical bug in configuration check * add spans.attrs[scores] * single label make_spangroup test * bugfix * black * tests for make_span_group with negative labels * refactor make_span_group * black * Update spacy/tests/pipeline/test_spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * remove duplicate declaration * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * raise error instead of just print * make label mapper private * update docs * run prettier * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * don't keep recomputing self._label_map for each span * typo in docs * Intervals to private and document 'name' param * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * add Tag to new features * replace tags * revert * revert * revert * revert * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * prettier * Fix merge * Update website/docs/api/spancategorizer.mdx * remove references to 'single_label' * remove old paragraph * Add spancat_singlelabel to config template * Format * Extend init config tests --------- Co-authored-by: kadarakos <kadar.akos@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>		2023-03-09 10:33:16 +01:00
..
_edit_tree_internals	Refactor error messages to remove hardcoded strings (#10729 )	2022-05-02 13:38:46 +02:00
_parser_internals	account for NER labels with a hyphen in the name (#10960 )	2022-06-17 20:02:37 +01:00
legacy	Refactor KB for easier customization (#11268 )	2022-09-08 10:38:07 +02:00
__init__.py	Add SpanRuler component (#9880 )	2022-06-02 13:12:53 +02:00
attributeruler.py	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
dep_parser.pyx	account for NER labels with a hyphen in the name (#10960 )	2022-06-17 20:02:37 +01:00
edit_tree_lemmatizer.py	Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer (#12017 )	2023-01-20 19:34:11 +01:00
entity_linker.py	Make generation of empty `KnowledgeBase` instances configurable in `EntityLinker` (#12320 )	2023-03-01 17:33:31 +01:00
entityruler.py	Enable fuzzy text matching in Matcher (#11359 )	2023-01-10 10:36:17 +01:00
functions.py	Add doc_cleaner component (#9659 )	2021-11-23 15:33:33 +01:00
lemmatizer.py	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
morphologizer.pyx	Tagger: use unnormalized probabilities for inference (#10197 )	2022-03-15 14:15:31 +01:00
multitask.pyx	Replace negative rows with 0 in StaticVectors (#7674 )	2021-04-22 18:04:15 +10:00
ner.pyx	account for NER labels with a hyphen in the name (#10960 )	2022-06-17 20:02:37 +01:00
pipe.pxd	TrainablePipe (#6213 )	2020-10-08 21:33:49 +02:00
pipe.pyi	Add Pipe.hide_labels to omit labels from pipeline meta (#10175 )	2022-02-05 17:59:24 +01:00
pipe.pyx	Fix config validation failures caused by NVTX pipeline wrappers (#11460 )	2022-09-12 14:55:41 +02:00
sentencizer.pyx	Add overwrite settings for more components (#9050 )	2021-09-30 15:35:55 +02:00
senter.pyx	Tagger: use unnormalized probabilities for inference (#10197 )	2022-03-15 14:15:31 +01:00
span_ruler.py	Enable fuzzy text matching in Matcher (#11359 )	2023-01-10 10:36:17 +01:00
spancat.py	Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365 )	2023-03-09 10:33:16 +01:00
tagger.pyx	Tagger: use unnormalized probabilities for inference (#10197 )	2022-03-15 14:15:31 +01:00
textcat_multilabel.py	Improve score_cats for use with multiple textcat components (#11820 )	2023-01-09 11:43:48 +01:00
textcat.py	Rename modified textcat scorer to v2 (#11971 )	2022-12-29 14:01:08 +01:00
tok2vec.py	Prevent tok2vec to broadcast to listeners when predicting (#11385 )	2022-09-12 15:36:48 +02:00
trainable_pipe.pxd	Refactor scoring methods to use registered functions (#8766 )	2021-08-10 15:13:39 +02:00
trainable_pipe.pyx	Fix config validation failures caused by NVTX pipeline wrappers (#11460 )	2022-09-12 14:55:41 +02:00
transition_parser.pxd	Parser: use C saxpy/sgemm provided by the Ops implementation (#10773 )	2022-05-27 11:20:52 +02:00
transition_parser.pyx	precompute_hiddens/Parser: do not look up CPU ops (3.4) (#11069 )	2022-07-05 10:53:42 +02:00