spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-08 17:51:16 +03:00

Author	SHA1	Message	Date
Adriane Boyd	357fdd4871	Load exceptions last in Tokenizer.from_bytes (#12553 ) In `Tokenizer.from_bytes`, the exceptions should be loaded last so that they are only processed once as part of loading the model. The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for `Tokenizer.from_bytes/disk` vs. `Tokenizer.add_special_case` as the caches are reloaded more than necessary during deserialization.	2023-05-12 09:55:22 +02:00
Sofie Van Landeghem	7bf1db87ad	fix typo (#12543 )	2023-05-12 09:55:22 +02:00
TAN Long	b0e5aed5ed	perf(REL_OP): Replace some token.children with token.rights or token.lefts (#12528 ) Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-05-12 09:55:22 +02:00
TAN Long	6be67db59f	docs(REL_OP): modify docs for REL_OPs to match Semgrex's update on CoreNLP v4.5.2 (#12531 ) Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-05-12 09:55:22 +02:00
andyjessen	18a2a88a95	Add category to spaCy project (#12506 ) ScispaCy fits within biomedical domain. Consider adding this category.	2023-05-12 09:55:22 +02:00
Adriane Boyd	aea4a96f92	Set version to v3.5.2 (#12508 )	2023-04-06 17:30:39 +02:00
Adriane Boyd	e4bbdf7b50	Merge pull request #12494 from adrianeboyd/backport/v3.5.2-1 Backports for v3.5.2	2023-04-06 16:18:59 +02:00
Madeesh Kannan	f66d55fe5b	`Docs`: Fix rule-based matching example that expands named entities (#12495 )	2023-04-06 11:48:04 +02:00
Edward	9fbb8ee912	Add more information to custom code docs (#12491 ) * Add info to sections * Update website/docs/usage/training.mdx --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-06 11:48:04 +02:00
Will Frey	314a7cea73	Fix invalid ConsoleLogger.v3 example config (#12498 ) Replace `progress_bar = "all_steps"` with `progress_bar = "eval"`, which is consistent with the default behavior for `spacy.ConsoleLogger.v1` and `spacy.ConsoleLogger.v2`.	2023-04-06 11:48:04 +02:00
Edward	2fbd080a03	Add model-last saving mechanism to pretraining (#12459 ) * Adjust pretrain command * chane naming and add finally block * Add unit test * Add unit test assertions * Update spacy/training/pretrain.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * change finally block * Add to docs * Update website/docs/usage/embeddings-transformers.mdx * Add flag to skip saving model-last --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-03 15:28:52 +02:00
Adriane Boyd	bbf232e355	Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set (#12493 ) * Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set * Format	2023-04-03 15:28:52 +02:00
Adriane Boyd	0ec4dc5c29	Remove redundant strings.add for Doc.char_span (#12429 )	2023-04-03 15:28:52 +02:00
Adriane Boyd	a5406a6c45	Allow cupy 12.0 for extras (#12490 )	2023-04-03 15:28:52 +02:00
Adriane Boyd	57ee1212de	Fix pickle for ngram suggester (#12486 )	2023-04-03 15:28:52 +02:00
Ye Lei (叶磊)	b228875600	Allow passing a Span to displacy.parse_deps (#12477 ) * Allow passing a Span to displacy.parse_deps * Update docstring Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-03 15:28:52 +02:00
Raphael Mitsch	8d064872ff	Fix Span.sents for edge case of Span being the only Span in the last sentence of a Doc. (#12484 )	2023-04-03 15:28:52 +02:00
kadarakos	26da226a39	Fix spancat-singlelabel score (#12469 ) * debug argmax sort and add span scores * add missing tests for spanscores	2023-04-03 15:28:52 +02:00
Edward	888332dfb2	Add info to stringstore and vocab (#12471 )	2023-04-03 15:28:52 +02:00
Adriane Boyd	1b4a67bc54	Restrict github workflows to explosion (#12470 )	2023-04-03 15:28:52 +02:00
sloev / Johannes Valbjørn	79dcef17f7	add spacy_onnx_sentiment_english to universe (#12422 ) * add spacy_onnx_sentiment_english to universe * rename to sentimental-onix * fix comma json error * fix typo * typo fix Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * mention need to download model before example works Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-03 15:28:52 +02:00
Prajakta Darade	0ecbeff1a6	corrected example code (#12466 )	2023-04-03 15:28:52 +02:00
kadarakos	4380d750f9	add explanation about overwriting behaviour (#12464 ) * add explanation about overwriting behaviour * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * format --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-03 15:28:52 +02:00
Adriane Boyd	2953e7b7ce	Support floret for PretrainVectors (#12435 ) * Support floret for PretrainVectors * Format	2023-04-03 15:28:52 +02:00
Ines Montani	d2d9e9e139	Add user survey alert to the top (#12452 ) * Add user survey alert to the top * Shorter --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-04-03 15:28:52 +02:00
Adriane Boyd	f1a42b6fcc	CI: Separate spacy universe validation into a separate workflow (#12440 ) * Separate spacy universe validation into a separate workflow * Fix new workflow name	2023-04-03 15:28:52 +02:00
Adriane Boyd	f9c0220ea5	CI: Switch PR back to paths-ignore (#12438 ) Switch PR tests back to paths-ignore but include changes to `.github` for all PRs rather than trying to figure out complicated includes+excludes. Changes to `.github` are relatively rare and should not be a huge burden for the CI.	2023-04-03 15:28:52 +02:00
Adriane Boyd	6183906a0b	Remove autoblack workflow (#12437 ) Now that all PRs have `black` formatting validation, we no longer need the autoblack workflow.	2023-04-03 15:28:52 +02:00
Raphael Mitsch	bd0768c05c	Fix EL failure with sentence-crossing entities (#12398 ) * Add test reproducing EL failure in sentence-crossing entities. * Format. * Draft fix. * Format. * Fix case for len(ent.sents) == 1. * Format. * Format. * Format. * Fix mypy error. * Merge EL sentence crossing tests. * Remove unneeded sentencizer component. * Fix or ignore mypy issues in test. * Simplify ent.sents handling. * Format. Update assert in ent.sents handling. * Small rewrite --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-04-03 15:28:52 +02:00
Adriane Boyd	be644caa13	Fix --verbose for spacy find-threshold (#12418 )	2023-04-03 15:28:52 +02:00
Adriane Boyd	7880da952b	CI: Add all paths before excluding patterns (#12419 )	2023-04-03 15:28:52 +02:00
Raphael Mitsch	545218a7d9	Fix sentence indexing bug in `Span.sents` (#12405 ) * Add test for partial sentences in ent.sents. * Removed unneeded import. * Format. Simplify code.	2023-04-03 15:28:52 +02:00
Adriane Boyd	d00e58d1ac	CI: Move CLI tests to ubuntu for speed (#12409 )	2023-04-03 15:28:52 +02:00
Adriane Boyd	9ca67dc539	Fix thinc-apple-ops test to run for python 3.11 (#12408 )	2023-04-03 15:28:52 +02:00
Adriane Boyd	ed83cafe46	CI: Move universe validation to validate job (#12406 ) * CI: Move universe validation to validate job * Fix indentation * Update step name	2023-04-03 15:28:52 +02:00
Adriane Boyd	9da333cbfa	Add GHA for CI tests (#12403 ) * Add GHA for CI tests * Reorder paths	2023-04-03 15:28:52 +02:00
Adriane Boyd	8153bd573f	Merge pull request #12395 from adrianeboyd/backport/v3.5.1-2 Skip project clone tests if git is not available (#12394)	2023-03-09 17:45:32 +01:00
Adriane Boyd	83056bb44c	Skip project clone tests if git is not available (#12394 )	2023-03-09 16:42:33 +01:00
Adriane Boyd	03b320b3bd	Set version to v3.5.1 (#12393 )	2023-03-09 12:40:28 +01:00
Adriane Boyd	c2810575c0	Merge pull request #12351 from adrianeboyd/backport/v3.5.1-1 Backports for v3.5.1	2023-03-09 11:29:51 +01:00
Lj Miranda	53687b5bca	Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365 ) * [wip] Update * [wip] Update * Add initial port * [wip] Update * Fix all imports * Add spancat_exclusive to pipeline * [WIP] Update * [ci skip] Add breakpoint for debugging * Use spacy.SpanCategorizer.v1 as default archi * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: kadarakos <kadar.akos@gmail.com> * [ci skip] Small updates * Use Softmax v2 directly from thinc * Cache the label map * Fix mypy errors However, I ignored line 370 because it opened up a bunch of type errors that might be trickier to solve and might lead to a more complicated codebase. * avoid multiplication with 1.0 Co-authored-by: kadarakos <kadar.akos@gmail.com> * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update component versions to v2 * Add scorer to docstring * Add _n_labels property to SpanCategorizer Instead of using len(self.labels) in initialize() I am using a private property self._n_labels. This achieves implementation parity and allows me to delete the whole initialize() method for spancat_exclusive (since it's now the same with spancat). * Inherit from SpanCat instead of TrainablePipe This commit changes the inheritance structure of Exclusive_Spancat, now it's inheriting from SpanCategorizer than TrainablePipe. This allows me to remove duplicate methods that are already present in the parent function. * Revert documentation link to spancat * Fix init call for exclusive spancat * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Import Suggester from spancat * Include zero_init.v1 for spancat * Implement _allow_extra_label to use _n_labels To ensure that spancat / spancat_exclusive cannot be resized after initialization, I inherited the _allow_extra_label() method from spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead of len(self.labels) for checking. I think that changing it locally is a better solution rather than forcing each class that inherits TrainablePipe to use the self._n_labels attribute. Also note that I turned-off black formatting in this block of code because it reads better without the overhang. * Extend existing tests to spancat_exclusive In this commit, I extended the existing tests for spancat to include spancat_exclusive. I parametrized the test functions with 'name' (similar var name with textcat and textcat_multilabel) for each applicable test. TODO: Add overfitting tests for spancat_exclusive * Update documentation for spancat * Turn on formatting for allow_extra_label * Remove initializers in default config * Use DEFAULT_EXCL_SPANCAT_MODEL I also renamed spancat_exclusive_default_config into spancat_excl_default_config because black does some not pretty formatting changes. * Update documentation Update grammar and usage Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clarify docstring for Exclusive_SpanCategorizer * Remove mypy ignore and typecast labels to list * Fix documentation API * Use a single variable for tests * Update defaults for number of rows Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Put back initializers in spancat config Whenever I remove model.scorer.init_w and model.scorer.init_b, I encounter an error in the test: SystemError: <method '__getitem__' of 'dict' objects> returned a result with an error set. My Thinc version is 8.1.5, but I can't seem to check what's causing the error. * Update spancat_exclusive docstring * Remove init_W and init_B parameters This commit is expected to fail until the new Thinc release. * Require thinc>=8.1.6 for serializable Softmax defaults * Handle zero suggestions to make tests pass I'm not sure if this is the most elegant solution. But what should happen is that the _make_span_group function MUST return an empty SpanGroup if there are no suggestions. The error happens when the 'scores' variable is empty. We cannot get the 'predicted' and other downstream vars. * Better approach for handling zero suggestions * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spancategorizer headers * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add default value in negative_weight in docs * Add default value in allow_overlap in docs * Update how spancat_exclusive is constructed In this commit, I added the following: - Put the default values of negative_weight and allow_overlap in the default_config dictionary. - Rename make_spancat -> make_exclusive_spancat * Run prettier on spancategorizer.mdx * Change exactly one -> at most one * Add suggester documentation in Exclusive_SpanCategorizer * Add suggester to spancat docstrings * merge multilabel and singlelabel spancat * rename spancat_exclusive to singlelable * wire up different make_spangroups for single and multilabel * black * black * add docstrings * more docstring and fix negative_label * don't rely on default arguments * black * remove spancat exclusive * replace single_label with add_negative_label and adjust inference * mypy * logical bug in configuration check * add spans.attrs[scores] * single label make_spangroup test * bugfix * black * tests for make_span_group with negative labels * refactor make_span_group * black * Update spacy/tests/pipeline/test_spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * remove duplicate declaration * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * raise error instead of just print * make label mapper private * update docs * run prettier * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * don't keep recomputing self._label_map for each span * typo in docs * Intervals to private and document 'name' param * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * add Tag to new features * replace tags * revert * revert * revert * revert * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * prettier * Fix merge * Update website/docs/api/spancategorizer.mdx * remove references to 'single_label' * remove old paragraph * Add spancat_singlelabel to config template * Format * Extend init config tests --------- Co-authored-by: kadarakos <kadar.akos@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-09 10:33:16 +01:00
Victoria	5398e9f276	Add links in website and readme for survey (#12385 )	2023-03-09 10:33:08 +01:00
Marcus Blättermann	69ca6eb041	Make sure to run Python setup before NPM dev mode (#12384 )	2023-03-09 10:33:00 +01:00
Paul O'Leary McCann	cbd85c9608	Change GPU efficient textcat to use CNN, not BOW in generated configs (#11900 ) * Change GPU efficient textcat to use CNN, not BOW If you generate a config with a textcat component using GPU (transformers), the defaut option (efficiency) uses a BOW architecture, which does not use tok2vec features. While that can make sense as part of a larger pipeline, in the case of just a transformer and a textcat, that means the transformer is doing a lot of work for no purpose. This changes it so that the CNN architecture is used instead. It could also be changed to be the same as the accuracy config, which uses the ensemble architecture. * Add the transformer when using a textcat with GPU * Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928) * Switch ubuntu-latest to ubuntu-20.04 in main tests * Only use 20.04 for 3.6 * Require thinc v8.1.7 * Require thinc v8.1.8 * Break up longer expression --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-09 10:32:51 +01:00
Sofie Van Landeghem	a1fc4ed962	fix types (#12365 )	2023-03-09 10:32:33 +01:00
Adriane Boyd	6177c87539	Raise error for non-default vectors with PretrainVectors (#12366 )	2023-03-09 10:32:22 +01:00
Adriane Boyd	a86ec1b2b1	Update to use absolute imports in tests (#12372 )	2023-03-09 10:32:12 +01:00
Adriane Boyd	e381efd936	Partially work around pending deprecation of pkg_resources (#12368 ) * Handle deprecation of pkg_resources * Replace `pkg_resources` with `importlib_metadata` for `spacy info --url` * Remove requirements check from `spacy project` given the lack of alternatives * Fix installed model URL method and CI test * Fix types/handling, simplify catch-all return * Move imports instead of disabling requirements check * Format * Reenable test with ignored deprecation warning * Fix except * Fix return	2023-03-09 10:32:01 +01:00
Raphael Mitsch	6f1632b3e9	Make generation of empty `KnowledgeBase` instances configurable in `EntityLinker` (#12320 ) * Make empty_kb() configurable. * Format. * Update docs. * Be more specific in KB serialization test. * Update KB serialization tests. Update docs. * Remove doc update for batched candidate generation. * Fix serialization of subclassed KB in tests. * Format. * Update docstring. * Update docstring. * Switch from pickle to json for custom field serialization.	2023-03-01 17:33:31 +01:00
kadarakos	e325de3ff8	Displacy doc fix (#12352 ) * more details for color setting * more details for color setting * prettier	2023-03-01 17:33:31 +01:00

1 2 3 4 5 ...

15885 Commits