spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-02-20 21:40:35 +03:00

Author	SHA1	Message	Date
Patrick J. Burns	ab4ba04c32	Update LatinDefaults for lang 'la' (#12538 ) * Add noun chunking to la syntax iterators * Expand list of numeral, ordinal words * Expand abbreviations in la tokenizer_exceptions * Add example sents * Update spacy/lang/la/syntax_iterators.py Reorganize la syntax iterators Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor updates based on review * fix call --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-04-20 16:55:40 +02:00
Adriane Boyd	b60b027927	Add default option to MorphAnalysis.get (#12545 ) * Add default to MorphAnalysis.get Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for the user to provide a default return value if the field is not found. The default return value remains `[]`, which is not the same as `dict.get`, but is already established as this method's default return value with the return type `List[str]`. However the new `default` option does not enforce that the user-provided default is actually `List[str]`. * Restore test case	2023-04-20 14:06:32 +02:00
Adriane Boyd	dc0a1a9808	Load exceptions last in Tokenizer.from_bytes (#12553 ) In `Tokenizer.from_bytes`, the exceptions should be loaded last so that they are only processed once as part of loading the model. The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for `Tokenizer.from_bytes/disk` vs. `Tokenizer.add_special_case` as the caches are reloaded more than necessary during deserialization.	2023-04-20 11:30:34 +02:00
Sofie Van Landeghem	8e6a3d58d8	fix typo (#12543 )	2023-04-19 10:59:33 +02:00
TAN Long	923d24e885	perf(REL_OP): Replace some token.children with token.rights or token.lefts (#12528 ) Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-04-17 13:16:34 +02:00
Edward	de32011e4c	Add model-last saving mechanism to pretraining (#12459 ) * Adjust pretrain command * chane naming and add finally block * Add unit test * Add unit test assertions * Update spacy/training/pretrain.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * change finally block * Add to docs * Update website/docs/usage/embeddings-transformers.mdx * Add flag to skip saving model-last --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-03 15:24:03 +02:00
Adriane Boyd	4a1ec332de	Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set (#12493 ) * Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set * Format	2023-04-03 15:11:12 +02:00
Adriane Boyd	4538ceb507	Remove redundant strings.add for Doc.char_span (#12429 )	2023-04-03 11:38:56 +02:00
Adriane Boyd	69e20ce03d	Fix pickle for ngram suggester (#12486 )	2023-03-31 13:43:51 +02:00
Adriane Boyd	140d53649d	Convert values to numpy for label smoothing tests (#12472 )	2023-03-31 13:41:41 +02:00
Ye Lei (叶磊)	ce258670b7	Allow passing a Span to displacy.parse_deps (#12477 ) * Allow passing a Span to displacy.parse_deps * Update docstring Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-31 09:44:01 +02:00
Raphael Mitsch	d85df9d577	Fix Span.sents for edge case of Span being the only Span in the last sentence of a Doc. (#12484 )	2023-03-29 18:54:47 +02:00
kadarakos	372a90885e	Fix spancat-singlelabel score (#12469 ) * debug argmax sort and add span scores * add missing tests for spanscores	2023-03-29 08:38:11 +02:00
Adriane Boyd	fac457a509	Support floret for PretrainVectors (#12435 ) * Support floret for PretrainVectors * Format	2023-03-24 16:28:51 +01:00
Adriane Boyd	d0bd3f5ee4	Update Serbian tokenization for UD Serbian SET (#12442 )	2023-03-24 16:26:40 +01:00
Vinit Ravishankar	28de85737f	Tagger label smoothing (#12293 ) * add label smoothing * use True/False instead of floats * add entropy to debug data * formatting * docs * change test to check difference in distributions * Update website/docs/api/tagger.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * bool -> float * update docs * fix seed * black * update tests to use label_smoothing = 0.0 * set default to 0.0, update quickstart * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update morphologizer, tagger test * fix morph docs * add url to docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-22 12:17:56 +01:00
Raphael Mitsch	96b61d0671	Fix EL failure with sentence-crossing entities (#12398 ) * Add test reproducing EL failure in sentence-crossing entities. * Format. * Draft fix. * Format. * Fix case for len(ent.sents) == 1. * Format. * Format. * Format. * Fix mypy error. * Merge EL sentence crossing tests. * Remove unneeded sentencizer component. * Fix or ignore mypy issues in test. * Simplify ent.sents handling. * Format. Update assert in ent.sents handling. * Small rewrite --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-03-14 22:02:49 +01:00
Adriane Boyd	2ce9a220db	Fix --verbose for spacy find-threshold (#12418 )	2023-03-14 17:16:49 +01:00
Raphael Mitsch	e8cab4625c	Fix sentence indexing bug in `Span.sents` (#12405 ) * Add test for partial sentences in ent.sents. * Removed unneeded import. * Format. Simplify code.	2023-03-14 10:21:53 +01:00
Adriane Boyd	f27bce67fd	Skip project clone tests if git is not available (#12394 )	2023-03-09 16:41:21 +01:00
Lj Miranda	913d74f509	Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365 ) * [wip] Update * [wip] Update * Add initial port * [wip] Update * Fix all imports * Add spancat_exclusive to pipeline * [WIP] Update * [ci skip] Add breakpoint for debugging * Use spacy.SpanCategorizer.v1 as default archi * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: kadarakos <kadar.akos@gmail.com> * [ci skip] Small updates * Use Softmax v2 directly from thinc * Cache the label map * Fix mypy errors However, I ignored line 370 because it opened up a bunch of type errors that might be trickier to solve and might lead to a more complicated codebase. * avoid multiplication with 1.0 Co-authored-by: kadarakos <kadar.akos@gmail.com> * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update component versions to v2 * Add scorer to docstring * Add _n_labels property to SpanCategorizer Instead of using len(self.labels) in initialize() I am using a private property self._n_labels. This achieves implementation parity and allows me to delete the whole initialize() method for spancat_exclusive (since it's now the same with spancat). * Inherit from SpanCat instead of TrainablePipe This commit changes the inheritance structure of Exclusive_Spancat, now it's inheriting from SpanCategorizer than TrainablePipe. This allows me to remove duplicate methods that are already present in the parent function. * Revert documentation link to spancat * Fix init call for exclusive spancat * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Import Suggester from spancat * Include zero_init.v1 for spancat * Implement _allow_extra_label to use _n_labels To ensure that spancat / spancat_exclusive cannot be resized after initialization, I inherited the _allow_extra_label() method from spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead of len(self.labels) for checking. I think that changing it locally is a better solution rather than forcing each class that inherits TrainablePipe to use the self._n_labels attribute. Also note that I turned-off black formatting in this block of code because it reads better without the overhang. * Extend existing tests to spancat_exclusive In this commit, I extended the existing tests for spancat to include spancat_exclusive. I parametrized the test functions with 'name' (similar var name with textcat and textcat_multilabel) for each applicable test. TODO: Add overfitting tests for spancat_exclusive * Update documentation for spancat * Turn on formatting for allow_extra_label * Remove initializers in default config * Use DEFAULT_EXCL_SPANCAT_MODEL I also renamed spancat_exclusive_default_config into spancat_excl_default_config because black does some not pretty formatting changes. * Update documentation Update grammar and usage Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clarify docstring for Exclusive_SpanCategorizer * Remove mypy ignore and typecast labels to list * Fix documentation API * Use a single variable for tests * Update defaults for number of rows Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Put back initializers in spancat config Whenever I remove model.scorer.init_w and model.scorer.init_b, I encounter an error in the test: SystemError: <method '__getitem__' of 'dict' objects> returned a result with an error set. My Thinc version is 8.1.5, but I can't seem to check what's causing the error. * Update spancat_exclusive docstring * Remove init_W and init_B parameters This commit is expected to fail until the new Thinc release. * Require thinc>=8.1.6 for serializable Softmax defaults * Handle zero suggestions to make tests pass I'm not sure if this is the most elegant solution. But what should happen is that the _make_span_group function MUST return an empty SpanGroup if there are no suggestions. The error happens when the 'scores' variable is empty. We cannot get the 'predicted' and other downstream vars. * Better approach for handling zero suggestions * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spancategorizer headers * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add default value in negative_weight in docs * Add default value in allow_overlap in docs * Update how spancat_exclusive is constructed In this commit, I added the following: - Put the default values of negative_weight and allow_overlap in the default_config dictionary. - Rename make_spancat -> make_exclusive_spancat * Run prettier on spancategorizer.mdx * Change exactly one -> at most one * Add suggester documentation in Exclusive_SpanCategorizer * Add suggester to spancat docstrings * merge multilabel and singlelabel spancat * rename spancat_exclusive to singlelable * wire up different make_spangroups for single and multilabel * black * black * add docstrings * more docstring and fix negative_label * don't rely on default arguments * black * remove spancat exclusive * replace single_label with add_negative_label and adjust inference * mypy * logical bug in configuration check * add spans.attrs[scores] * single label make_spangroup test * bugfix * black * tests for make_span_group with negative labels * refactor make_span_group * black * Update spacy/tests/pipeline/test_spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * remove duplicate declaration * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * raise error instead of just print * make label mapper private * update docs * run prettier * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * don't keep recomputing self._label_map for each span * typo in docs * Intervals to private and document 'name' param * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * add Tag to new features * replace tags * revert * revert * revert * revert * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * prettier * Fix merge * Update website/docs/api/spancategorizer.mdx * remove references to 'single_label' * remove old paragraph * Add spancat_singlelabel to config template * Format * Extend init config tests --------- Co-authored-by: kadarakos <kadar.akos@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-09 10:30:59 +01:00
Paul O'Leary McCann	e656189ec3	Change GPU efficient textcat to use CNN, not BOW in generated configs (#11900 ) * Change GPU efficient textcat to use CNN, not BOW If you generate a config with a textcat component using GPU (transformers), the defaut option (efficiency) uses a BOW architecture, which does not use tok2vec features. While that can make sense as part of a larger pipeline, in the case of just a transformer and a textcat, that means the transformer is doing a lot of work for no purpose. This changes it so that the CNN architecture is used instead. It could also be changed to be the same as the accuracy config, which uses the ensemble architecture. * Add the transformer when using a textcat with GPU * Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928) * Switch ubuntu-latest to ubuntu-20.04 in main tests * Only use 20.04 for 3.6 * Require thinc v8.1.7 * Require thinc v8.1.8 * Break up longer expression --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-07 17:47:45 +01:00
Sofie Van Landeghem	3bf4539e31	fix types (#12365 )	2023-03-07 13:29:08 +01:00
Adriane Boyd	260cb9c6fe	Raise error for non-default vectors with PretrainVectors (#12366 )	2023-03-06 18:06:31 +01:00
Adriane Boyd	5ecb3babed	Update to use absolute imports in tests (#12372 )	2023-03-06 17:30:17 +01:00
Adriane Boyd	0bbc620dd8	Partially work around pending deprecation of pkg_resources (#12368 ) * Handle deprecation of pkg_resources * Replace `pkg_resources` with `importlib_metadata` for `spacy info --url` * Remove requirements check from `spacy project` given the lack of alternatives * Fix installed model URL method and CI test * Fix types/handling, simplify catch-all return * Move imports instead of disabling requirements check * Format * Reenable test with ignored deprecation warning * Fix except * Fix return	2023-03-06 14:48:57 +01:00
Raphael Mitsch	6aa6b86d49	Make generation of empty `KnowledgeBase` instances configurable in `EntityLinker` (#12320 ) * Make empty_kb() configurable. * Format. * Update docs. * Be more specific in KB serialization test. * Update KB serialization tests. Update docs. * Remove doc update for batched candidate generation. * Fix serialization of subclassed KB in tests. * Format. * Update docstring. * Update docstring. * Switch from pickle to json for custom field serialization.	2023-03-01 16:02:55 +01:00
Sofie Van Landeghem	74cae47bf6	rely on is_empty property instead of __len__ (#12347 )	2023-03-01 12:06:07 +01:00
Adriane Boyd	8f058e39bd	Fix error message for displacy auto_select_port (#12343 )	2023-02-28 16:36:03 +01:00
TAN Long	071667376a	Add new REL_OPs: `>+`, `>-`, `<+`, and `<-` (#12334 ) * Add immediate left/right child/parent dependency relations * Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`. --------- Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-02-28 14:36:33 +01:00
lise-brinck	e2de188cf1	Bugfix/swedish tokenizer (#12315 ) * add unittest for explosion#12311 * create punctuation.py for swedish * removed : from infixes in swedish punctuation.py * allow : as infix if succeeding char is uppercase	2023-02-27 10:53:45 +01:00
Kevin Humphreys	acdd993071	Matcher performance fix for extension predicates: use shared key function (#12272 ) * standardize predicate key format * single key function * Make optional args in key function keyword-only --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-02-27 08:35:08 +01:00
Paul O'Leary McCann	1e8bac99f3	Add tests for projects to master (#12303 ) * Add tests for projects to master * Fix git clone related issues on Windows * Add stat import	2023-02-23 10:22:57 +01:00
Adriane Boyd	80bc140533	Add grc to langs with lexeme norms in spacy-lookups-data (#12287 )	2023-02-16 17:57:02 +01:00
Edward	61b8454137	Adjust return type of `registry.find` (#12227 ) * Fix registry find return type * add dot * Add type ignore for mypy * update black formatting version * add mypy ignore to package cli * mypy type fix (for real) * Update find description in spacy/util.py Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * adjust mypy directive --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-02-15 12:32:53 +01:00
Sofie Van Landeghem	79ef6cf0f9	Have logging calls use string formatting types (#12215 ) * change logging call for spacy.LookupsDataLoader.v1 * substitutions in language and _util * various more substitutions * add string formatting guidelines to contribution guidelines	2023-02-02 11:15:22 +01:00
Raphael Mitsch	02af17a5c8	Remove flaky assertions. (#12210 )	2023-01-31 16:52:06 +01:00
Adriane Boyd	606273f7e4	Normalize whitespace in evaluate CLI output test (#12157 ) * Normalize whitespace in evaluate CLI output test Depending on terminal settings, lines may be padded to the screen width so the comparison is too strict with only the command string replacement. * Move to test util method * Change to normalization method	2023-01-27 16:13:34 +01:00
Adriane Boyd	5f8a398bb9	Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196 ) * Add span_id to Span.char_span, update Doc/Span.char_span docs `Span.char_span(id=)` should be removed in the future. * Also use Union[int, str] in Doc docstring	2023-01-27 15:09:17 +01:00
Simon Gurcke	774c10fa39	Add alignment_mode argument to Span.char_span() (#12145 ) * Add alignment_mode argument to Span.char_span() * Update website * Update spacy/tokens/span.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-27 11:43:40 +01:00
Peter Baumgartner	c68e6b8a96	`trainable_lemmatizer` in `debug data` (#11419 ) * WIP * rm ipython embeds * rm total * WIP * cleanup * cleanup + reword * rm component function * remove migration support form * fix reference dataset for dev data * additional fixes - set approach to identifying unique trees - adjust line length on messages - add logic for detecting docs without annotations * use 0 instead of none for no annotation * partial annotation support * initial tests for _compile_gold lemma attributes Using the example data from the edit tree lemmatizer tests for: - lemmatizer_trees - partial_lemma_annotations - n_low_cardinality_lemmas - no_lemma_annotations * adds output test for cli app * switch msg level * rm unclear uniqueness check * Revert "rm unclear uniqueness check" This reverts commit `6ea2b3524b`. * remove good message on uniqueness * formatting * use en_vocab fixture * clarify data set source in messages * remove unnecessary import Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-26 17:36:50 +01:00
Daniël de Kok	8d69874afb	Add `spacy.PlainTextCorpusReader.v1` (#12122 ) * Add `spacy.PlainTextCorpusReader.v1` This is a corpus reader that reads plain text corpora with the following format: - UTF-8 encoding - One line per document. - Blank lines are ignored. It is useful for applications where we deal with very large corpora, such as distillation, and don't want to deal with the space overhead of serialized formats. Additionally, many large corpora already use such a text format, keeping the necessary preprocessing to a minimum. * Update spacy/training/corpus.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * docs: add version to `PlainTextCorpus` * Add docstring to registry function * Add plain text corpus tests * Only strip newline/carriage return * Add return type _string_to_tmp_file helper * Use a temporary directory in place of file name Different OS auto delete/sharing semantics are just wonky. * This will be new in 3.5.1 (rather than 4) * Test improvements from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-26 11:33:22 +01:00
Raphael Mitsch	950fceceb6	Make test_cli_find_threshold() more robust. (#12148 )	2023-01-23 14:42:33 +01:00
Richard Hudson	f9e020dd67	Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer (#12017 ) * Refactor _scores2guesses * Handle arrays on GPU * Convert argmax result to raw integer Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Use NumpyOps() to copy data to CPU Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Changes based on review comments * Use different _scores2guesses depending on tree_k * Add tests for corner cases * Add empty line for consistency * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2023-01-20 19:34:11 +01:00
Adriane Boyd	1e993d3b03	Merge pull request #12121 from adrianeboyd/chore/v3.5.0-2 Revert "Temporarily skip tests that require models/compat"	2023-01-19 15:59:30 +01:00
Adriane Boyd	3b8918e166	API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128 ) * API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar * adjust to mdx * linkout to InMemoryLookupKB at first occurrence in kb.mdx * fix links to docs * revert Azure trigger setting (I'll make a separate PR) Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-19 13:29:17 +01:00
Adriane Boyd	dc0f527039	Revert "Temporarily skip tests that require models/compat" This reverts commit `378db0eb1e`.	2023-01-18 12:54:56 +01:00
Adriane Boyd	794cea6907	Fix comments and examples for levenshtein_compare (#12113 )	2023-01-18 08:02:33 +01:00
github-actions[bot]	9ef7d26032	Auto-format code with black (#12100 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2023-01-13 10:12:10 +01:00
Daniël de Kok	dda7331da3	Handle missing annotations in the edit tree lemmatizer (#12098 ) The losses/gradients of missing annotations were not correctly masked out. Fix this and check the masking in the partial data test.	2023-01-12 12:13:55 +01:00

1 2 3 4 5 ...

9281 Commits