spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 12:18:04 +03:00

Author	SHA1	Message	Date
Adriane Boyd	8548d4d16e	Merge remote-tracking branch 'upstream/master' into update-v4-from-master-1	2023-01-27 08:29:09 +01:00
Richard Hudson	f9e020dd67	Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer (#12017 ) * Refactor _scores2guesses * Handle arrays on GPU * Convert argmax result to raw integer Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Use NumpyOps() to copy data to CPU Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Changes based on review comments * Use different _scores2guesses depending on tree_k * Add tests for corner cases * Add empty line for consistency * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2023-01-20 19:34:11 +01:00
Daniël de Kok	5e297aa20e	Add `TrainablePipe.{distill,get_teacher_student_loss}` (#12016 ) * Add `TrainablePipe.{distill,get_teacher_student_loss}` This change adds two methods: - `TrainablePipe::distill` which performs a training step of a student pipe on a teacher pipe, giving a batch of `Doc`s. - `TrainablePipe::get_teacher_student_loss` computes the loss of a student relative to the teacher. The `distill` or `get_teacher_student_loss` methods are also implemented in the tagger, edit tree lemmatizer, and parser pipes, to enable distillation in those pipes and as an example for other pipes. * Fix stray `Beam` import * Fix incorrect import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * TrainablePipe.distill: use `Iterable[Example]` * Add Pipe.is_distillable method * Add `validate_distillation_examples` This first calls `validate_examples` and then checks that the student/teacher tokens are the same. * Update distill documentation * Add distill documentation for all pipes that support distillation * Fix incorrect identifier * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add comment to explain `is_distillable` Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-01-16 10:25:53 +01:00
Daniël de Kok	dda7331da3	Handle missing annotations in the edit tree lemmatizer (#12098 ) The losses/gradients of missing annotations were not correctly masked out. Fix this and check the masking in the partial data test.	2023-01-12 12:13:55 +01:00
Daniël de Kok	207565a788	Merge remote-tracking branch 'upstream/master' into chore/v4-merge-master-20221222	2022-12-22 10:08:54 +01:00
Daniël de Kok	f9308aae13	Fix v4 branch to build against Thinc v9 (#11921 ) * Move `thinc.extra.search` to `spacy.pipeline._parser_internals` Backport of: https://github.com/explosion/spaCy/pull/11317 Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Replace references to `thinc.backends.linalg` with `CBlas` Backport of: https://github.com/explosion/spaCy/pull/11292 Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Use cross entropy from `thinc.legacy` * Require thinc>=9.0.0.dev0,<9.1.0 Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2022-12-17 14:32:19 +01:00
Daniël de Kok	27fac7df2e	EditTreeLemmatizer: correctly add strings when initializing from labels (#11934 ) Strings in replacement nodes where not added to the `StringStore` when `EditTreeLemmatizer` was initialized from a set of labels. The corresponding test did not capture this because it added the strings through the examples that were passed to the initialization. This change fixes both this bug in the initialization as the 'shadowing' of the bug in the test.	2022-12-07 13:53:41 +09:00
Adriane Boyd	103b24fb25	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master	2022-10-21 09:13:32 +02:00
github-actions[bot]	ceb62352bf	Auto-format code with black (#11649 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-10-14 18:04:55 +09:00
Sofie Van Landeghem	29649589fc	remove dtype (#11615 )	2022-10-11 15:25:05 +02:00
Sofie Van Landeghem	ef74f8f5e4	Fix mypy error in edittree lemmatizer (#11612 ) * cleanup imports * try limiting Thinc to previous release * remove Model specification * fix code and revert Thinc constraint	2022-10-11 14:15:22 +02:00
Daniël de Kok	efdbb722c5	Store activations in `Doc`s when `save_activations` is enabled (#11002 ) * Store activations in Doc when `store_activations` is enabled This change adds the new `activations` attribute to `Doc`. This attribute can be used by trainable pipes to store their activations, probabilities, and guesses for downstream users. As an example, this change modifies the `tagger` and `senter` pipes to add an `store_activations` option. When this option is enabled, the probabilities and guesses are stored in `set_annotations`. * Change type of `store_activations` to `Union[bool, List[str]]` When the value is: - A bool: all activations are stored when set to `True`. - A List[str]: the activations named in the list are stored * Formatting fixes in Tagger * Support store_activations in spancat and morphologizer * Make Doc.activations type visible to MyPy * textcat/textcat_multilabel: add store_activations option * trainable_lemmatizer/entity_linker: add store_activations option * parser/ner: do not currently support returning activations * Extend tagger and senter tests So that they, like the other tests, also check that we get no activations if no activations were requested. * Document `Doc.activations` and `store_activations` in the relevant pipes * Start errors/warnings at higher numbers to avoid merge conflicts Between the master and v4 branches. * Add `store_activations` to docstrings. * Replace store_activations setter by set_store_activations method Setters that take a different type than what the getter returns are still problematic for MyPy. Replace the setter by a method, so that type inference works everywhere. * Use dict comprehension suggested by @svlandeg * Revert "Use dict comprehension suggested by @svlandeg" This reverts commit `6e7b958f70`. * EntityLinker: add type annotations to _add_activations * _store_activations: make kwarg-only, remove doc_scores_lens arg * set_annotations: add type annotations * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * TextCat.predict: return dict * Make the `TrainablePipe.store_activations` property a bool This means that we can also bring back `store_activations` setter. * Remove `TrainablePipe.activations` We do not need to enumerate the activations anymore since `store_activations` is `bool`. * Add type annotations for activations in predict/set_annotations * Rename `TrainablePipe.store_activations` to `save_activations` * Error E1400 is not used anymore This error was used when activations were still `Union[bool, List[str]]`. * Change wording in API docs after store -> save change * docs: tag (save_)activations as new in spaCy 4.0 * Fix copied line in morphologizer activations test * Don't train in any test_save_activations test * Rename activations - "probs" -> "probabilities" - "guesses" -> "label_ids", except in the edit tree lemmatizer, where "guesses" -> "tree_ids". * Remove unused W400 warning. This warning was used when we still allowed the user to specify which activations to save. * Formatting fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Replace "kb_ids" by a constant * spancat: replace a cast by an assertion * Fix EOF spacing * Fix comments in test_save_activations tests * Do not set RNG seed in activation saving tests * Revert "spancat: replace a cast by an assertion" This reverts commit `0bd5730d16`. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-13 09:51:12 +02:00
Richard Hudson	32954c3bcb	Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786 ) * Make changes to typing * Correction * Format with black * Corrections based on review * Bumped Thinc dependency version * Bumped blis requirement * Correction for older Python versions * Update spacy/ml/models/textcat.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Corrections based on review feedback * Readd deleted docstring line Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2022-05-25 09:33:54 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00

14 Commits