spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 07:31:59 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	d65e3c31a6	use system-independent commands (#12693 )	2023-06-08 11:43:36 +02:00
Adriane Boyd	0f9d2b01fb	Set version v3.6.0.dev1 (#12703 )	2023-06-07 16:23:14 +02:00
kadarakos	c003aac29a	SpanFinder into spaCy from experimental (#12507 ) * span finder integrated into spacy from experimental * black * isort * black * default spankey constant * black * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * rename * rename * max_length and min_length as Optional[int] and strict checking * black * mypy fix for integer type infinity * revert line order * implement all comparison operators for inf int * avoid two for loops over all docs by not precomputing * interleave thresholding with span creation * black * revert to not interleaving (relized its faster) * black * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update dosctring * enforce that the gold and predicted documents have the same text * new error for ensuring reference and predicted texts are the same * remove todo * adjust test * black * handle misaligned tokenization * return correct variable * failing overfit test * only use a single spans_key like in spancat * black * remove debug lines * typo * remove comment * remove near duplicate reduntant method * use the 'spans_key' variable name everywhere * Update spacy/pipeline/span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * flaky test fix suggestion, hand set bias terms * only test suggester and test result exhaustively * make it clear that the span_finder_suggester is more general (not specific to span_finder) * Update spacy/tests/pipeline/test_span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review * remove question comment * move preset_spans_suggester test to spancat tests * Add docs and unify default configs for spancat and span finder * Add `allow_overlap=True` to span finder scorer * Fix offset bug in set_annotations * Ignore labels in span finder scorer * Format * Add span_finder to quickstart template * Move settings to self.cfg, store min/max unset as None * Remove debugging * Update docstrings and docs * Update spacy/pipeline/span_finder.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix imports --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-06-07 15:52:28 +02:00
Basile Dura	c3c064ace4	fix: `InitializableComponent` type hints (#12692 ) * fix: InitializableComponent type hints * fix: avoid circular dependency * style: clean imports in language.py * style: use relative imports Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * fix: apply black --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-06-02 14:29:52 +02:00
Adriane Boyd	c4112a1da3	Require that all SpanGroup spans are from the current doc (#12569 ) * Require that all SpanGroup spans are from the current doc The restriction on only adding spans from the current doc were already implemented for all operations except for `SpanGroup.__init__`. Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in order to validate the character offsets and to make it possible to copy spans between documents with differing tokenization. Currently there is no validation that the document texts are identical, but the span char offsets must be valid spans in the target doc, which prevents you from ending up with completely invalid spans. * Undo change in test_beam_overfitting_IO	2023-06-01 19:19:17 +02:00
Adriane Boyd	c936db2faf	Address numpy 1.25 deprecations in test suite (#12684 ) * Address upcoming numpy v1.25 deprecations in test suite * Temporarily test most recent numpy prerelease in CI * Revert "Temporarily test most recent numpy prerelease in CI" This reverts commit `d75a66e55e`.	2023-05-31 17:23:07 +02:00
Basile Dura	6ea4155487	feat: add comparison operators in `span.pyi` (#12652 ) * feat: add comparison operators in span.pyi remove Cython-specific `__richcmp__` * fix: comparison operators should be defined for any other object	2023-05-23 08:50:37 +02:00
Basile Dura	95fd46b1dd	feat: add type hinting on SpanGroup.__iter__ (#12642 )	2023-05-17 14:20:00 +02:00
Sani	873c16a4df	Malay language support (#12602 ) * add malay lang * fix token len * black format * reformat conftest malay * remove exceptions not exist in dbp * format code	2023-05-17 12:45:21 +02:00
Adriane Boyd	3637148c4d	Add scorer option to return per-component scores (#12540 ) * Add scorer option to return per-component scores Add `per_component` option to `Language.evaluate` and `Scorer.score` to return scores keyed by `tokenizer` (hard-coded) or by component name. Add option to `evaluate` CLI to score by component. Per-component scores can only be saved to JSON. * Update help text and messages	2023-05-12 15:36:54 +02:00
Adriane Boyd	b5af0fe836	Revert "Use Latin normalization for Serbian attrs (#12608 )" (#12621 ) This reverts commit `6f314f99c4`. We are reverting this until we can support this normalization more consistently across vectors, training corpora, and lemmatizer data.	2023-05-11 11:54:16 +02:00
Adriane Boyd	1279b464bb	In initialize only calculate current vectors hash if needed (#12607 )	2023-05-08 16:51:58 +02:00
Adriane Boyd	6f314f99c4	Use Latin normalization for Serbian attrs (#12608 ) * Use Latin normalization for Serbian attrs Use Latin normalization for Serbian `NORM`, `PREFIX`, and `SUFFIX`. * Update NORMs in tokenizer exceptions and related tests * Add tests for all custom lex attrs * Remove unused imports	2023-05-08 12:33:56 +02:00
Adriane Boyd	fbd12eb4a4	Set version to v3.6.0.dev0	2023-05-08 09:10:35 +02:00
Adriane Boyd	dbc71ecd44	Remove #egg from download URLs (#12567 ) The current URLs will become invalid in pip 25.0. According to the pip docs, the egg= URLs are currently only needed for editable VCS installs.	2023-05-04 17:13:12 +02:00
Lj Miranda	298e6036b7	Add spans in spacy benchmark (#12575 ) * Add spans in spacy benchmark The current implementation of spaCy benchmark accuracy / spacy evaluate doesn't include the "spans" type, so calling the command doesn't render the HTML displaCy file needed. This PR attempts to fix that by creating a new parameter for "spans" and calling the appropriate displaCy value. * Reformat file with black * Add tests for evaluate * Fix spans -> span for displacy style * Update test to check render instead * Update source so mypy passes * Add parser information to avoid warnings	2023-04-28 14:32:52 +02:00
kadarakos	34d1164b0e	Spancat speed improvement (#12577 ) * avoid nesting then flattening * mypy fix * Apply suggestions from code review * Add type for indices * Run full matrix for mypy * Add back modified type: ignore * Revert "Run full matrix for mypy" This reverts commit `e218873d04`. --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-27 15:27:13 +02:00
Patrick J. Burns	ab4ba04c32	Update LatinDefaults for lang 'la' (#12538 ) * Add noun chunking to la syntax iterators * Expand list of numeral, ordinal words * Expand abbreviations in la tokenizer_exceptions * Add example sents * Update spacy/lang/la/syntax_iterators.py Reorganize la syntax iterators Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor updates based on review * fix call --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-04-20 16:55:40 +02:00
Adriane Boyd	b60b027927	Add default option to MorphAnalysis.get (#12545 ) * Add default to MorphAnalysis.get Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for the user to provide a default return value if the field is not found. The default return value remains `[]`, which is not the same as `dict.get`, but is already established as this method's default return value with the return type `List[str]`. However the new `default` option does not enforce that the user-provided default is actually `List[str]`. * Restore test case	2023-04-20 14:06:32 +02:00
Adriane Boyd	dc0a1a9808	Load exceptions last in Tokenizer.from_bytes (#12553 ) In `Tokenizer.from_bytes`, the exceptions should be loaded last so that they are only processed once as part of loading the model. The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for `Tokenizer.from_bytes/disk` vs. `Tokenizer.add_special_case` as the caches are reloaded more than necessary during deserialization.	2023-04-20 11:30:34 +02:00
Sofie Van Landeghem	8e6a3d58d8	fix typo (#12543 )	2023-04-19 10:59:33 +02:00
TAN Long	923d24e885	perf(REL_OP): Replace some token.children with token.rights or token.lefts (#12528 ) Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-04-17 13:16:34 +02:00
Edward	de32011e4c	Add model-last saving mechanism to pretraining (#12459 ) * Adjust pretrain command * chane naming and add finally block * Add unit test * Add unit test assertions * Update spacy/training/pretrain.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * change finally block * Add to docs * Update website/docs/usage/embeddings-transformers.mdx * Add flag to skip saving model-last --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-03 15:24:03 +02:00
Adriane Boyd	4a1ec332de	Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set (#12493 ) * Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set * Format	2023-04-03 15:11:12 +02:00
Adriane Boyd	4538ceb507	Remove redundant strings.add for Doc.char_span (#12429 )	2023-04-03 11:38:56 +02:00
Adriane Boyd	69e20ce03d	Fix pickle for ngram suggester (#12486 )	2023-03-31 13:43:51 +02:00
Adriane Boyd	140d53649d	Convert values to numpy for label smoothing tests (#12472 )	2023-03-31 13:41:41 +02:00
Ye Lei (叶磊)	ce258670b7	Allow passing a Span to displacy.parse_deps (#12477 ) * Allow passing a Span to displacy.parse_deps * Update docstring Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-31 09:44:01 +02:00
Raphael Mitsch	d85df9d577	Fix Span.sents for edge case of Span being the only Span in the last sentence of a Doc. (#12484 )	2023-03-29 18:54:47 +02:00
kadarakos	372a90885e	Fix spancat-singlelabel score (#12469 ) * debug argmax sort and add span scores * add missing tests for spanscores	2023-03-29 08:38:11 +02:00
Adriane Boyd	fac457a509	Support floret for PretrainVectors (#12435 ) * Support floret for PretrainVectors * Format	2023-03-24 16:28:51 +01:00
Adriane Boyd	d0bd3f5ee4	Update Serbian tokenization for UD Serbian SET (#12442 )	2023-03-24 16:26:40 +01:00
Vinit Ravishankar	28de85737f	Tagger label smoothing (#12293 ) * add label smoothing * use True/False instead of floats * add entropy to debug data * formatting * docs * change test to check difference in distributions * Update website/docs/api/tagger.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * bool -> float * update docs * fix seed * black * update tests to use label_smoothing = 0.0 * set default to 0.0, update quickstart * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update morphologizer, tagger test * fix morph docs * add url to docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-22 12:17:56 +01:00
Raphael Mitsch	96b61d0671	Fix EL failure with sentence-crossing entities (#12398 ) * Add test reproducing EL failure in sentence-crossing entities. * Format. * Draft fix. * Format. * Fix case for len(ent.sents) == 1. * Format. * Format. * Format. * Fix mypy error. * Merge EL sentence crossing tests. * Remove unneeded sentencizer component. * Fix or ignore mypy issues in test. * Simplify ent.sents handling. * Format. Update assert in ent.sents handling. * Small rewrite --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-03-14 22:02:49 +01:00
Adriane Boyd	2ce9a220db	Fix --verbose for spacy find-threshold (#12418 )	2023-03-14 17:16:49 +01:00
Raphael Mitsch	e8cab4625c	Fix sentence indexing bug in `Span.sents` (#12405 ) * Add test for partial sentences in ent.sents. * Removed unneeded import. * Format. Simplify code.	2023-03-14 10:21:53 +01:00
Adriane Boyd	f27bce67fd	Skip project clone tests if git is not available (#12394 )	2023-03-09 16:41:21 +01:00
Lj Miranda	913d74f509	Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365 ) * [wip] Update * [wip] Update * Add initial port * [wip] Update * Fix all imports * Add spancat_exclusive to pipeline * [WIP] Update * [ci skip] Add breakpoint for debugging * Use spacy.SpanCategorizer.v1 as default archi * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: kadarakos <kadar.akos@gmail.com> * [ci skip] Small updates * Use Softmax v2 directly from thinc * Cache the label map * Fix mypy errors However, I ignored line 370 because it opened up a bunch of type errors that might be trickier to solve and might lead to a more complicated codebase. * avoid multiplication with 1.0 Co-authored-by: kadarakos <kadar.akos@gmail.com> * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update component versions to v2 * Add scorer to docstring * Add _n_labels property to SpanCategorizer Instead of using len(self.labels) in initialize() I am using a private property self._n_labels. This achieves implementation parity and allows me to delete the whole initialize() method for spancat_exclusive (since it's now the same with spancat). * Inherit from SpanCat instead of TrainablePipe This commit changes the inheritance structure of Exclusive_Spancat, now it's inheriting from SpanCategorizer than TrainablePipe. This allows me to remove duplicate methods that are already present in the parent function. * Revert documentation link to spancat * Fix init call for exclusive spancat * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Import Suggester from spancat * Include zero_init.v1 for spancat * Implement _allow_extra_label to use _n_labels To ensure that spancat / spancat_exclusive cannot be resized after initialization, I inherited the _allow_extra_label() method from spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead of len(self.labels) for checking. I think that changing it locally is a better solution rather than forcing each class that inherits TrainablePipe to use the self._n_labels attribute. Also note that I turned-off black formatting in this block of code because it reads better without the overhang. * Extend existing tests to spancat_exclusive In this commit, I extended the existing tests for spancat to include spancat_exclusive. I parametrized the test functions with 'name' (similar var name with textcat and textcat_multilabel) for each applicable test. TODO: Add overfitting tests for spancat_exclusive * Update documentation for spancat * Turn on formatting for allow_extra_label * Remove initializers in default config * Use DEFAULT_EXCL_SPANCAT_MODEL I also renamed spancat_exclusive_default_config into spancat_excl_default_config because black does some not pretty formatting changes. * Update documentation Update grammar and usage Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clarify docstring for Exclusive_SpanCategorizer * Remove mypy ignore and typecast labels to list * Fix documentation API * Use a single variable for tests * Update defaults for number of rows Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Put back initializers in spancat config Whenever I remove model.scorer.init_w and model.scorer.init_b, I encounter an error in the test: SystemError: <method '__getitem__' of 'dict' objects> returned a result with an error set. My Thinc version is 8.1.5, but I can't seem to check what's causing the error. * Update spancat_exclusive docstring * Remove init_W and init_B parameters This commit is expected to fail until the new Thinc release. * Require thinc>=8.1.6 for serializable Softmax defaults * Handle zero suggestions to make tests pass I'm not sure if this is the most elegant solution. But what should happen is that the _make_span_group function MUST return an empty SpanGroup if there are no suggestions. The error happens when the 'scores' variable is empty. We cannot get the 'predicted' and other downstream vars. * Better approach for handling zero suggestions * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spancategorizer headers * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add default value in negative_weight in docs * Add default value in allow_overlap in docs * Update how spancat_exclusive is constructed In this commit, I added the following: - Put the default values of negative_weight and allow_overlap in the default_config dictionary. - Rename make_spancat -> make_exclusive_spancat * Run prettier on spancategorizer.mdx * Change exactly one -> at most one * Add suggester documentation in Exclusive_SpanCategorizer * Add suggester to spancat docstrings * merge multilabel and singlelabel spancat * rename spancat_exclusive to singlelable * wire up different make_spangroups for single and multilabel * black * black * add docstrings * more docstring and fix negative_label * don't rely on default arguments * black * remove spancat exclusive * replace single_label with add_negative_label and adjust inference * mypy * logical bug in configuration check * add spans.attrs[scores] * single label make_spangroup test * bugfix * black * tests for make_span_group with negative labels * refactor make_span_group * black * Update spacy/tests/pipeline/test_spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * remove duplicate declaration * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * raise error instead of just print * make label mapper private * update docs * run prettier * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * don't keep recomputing self._label_map for each span * typo in docs * Intervals to private and document 'name' param * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * add Tag to new features * replace tags * revert * revert * revert * revert * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * prettier * Fix merge * Update website/docs/api/spancategorizer.mdx * remove references to 'single_label' * remove old paragraph * Add spancat_singlelabel to config template * Format * Extend init config tests --------- Co-authored-by: kadarakos <kadar.akos@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-09 10:30:59 +01:00
Paul O'Leary McCann	e656189ec3	Change GPU efficient textcat to use CNN, not BOW in generated configs (#11900 ) * Change GPU efficient textcat to use CNN, not BOW If you generate a config with a textcat component using GPU (transformers), the defaut option (efficiency) uses a BOW architecture, which does not use tok2vec features. While that can make sense as part of a larger pipeline, in the case of just a transformer and a textcat, that means the transformer is doing a lot of work for no purpose. This changes it so that the CNN architecture is used instead. It could also be changed to be the same as the accuracy config, which uses the ensemble architecture. * Add the transformer when using a textcat with GPU * Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928) * Switch ubuntu-latest to ubuntu-20.04 in main tests * Only use 20.04 for 3.6 * Require thinc v8.1.7 * Require thinc v8.1.8 * Break up longer expression --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-07 17:47:45 +01:00
Sofie Van Landeghem	3bf4539e31	fix types (#12365 )	2023-03-07 13:29:08 +01:00
Adriane Boyd	260cb9c6fe	Raise error for non-default vectors with PretrainVectors (#12366 )	2023-03-06 18:06:31 +01:00
Adriane Boyd	5ecb3babed	Update to use absolute imports in tests (#12372 )	2023-03-06 17:30:17 +01:00
Adriane Boyd	0bbc620dd8	Partially work around pending deprecation of pkg_resources (#12368 ) * Handle deprecation of pkg_resources * Replace `pkg_resources` with `importlib_metadata` for `spacy info --url` * Remove requirements check from `spacy project` given the lack of alternatives * Fix installed model URL method and CI test * Fix types/handling, simplify catch-all return * Move imports instead of disabling requirements check * Format * Reenable test with ignored deprecation warning * Fix except * Fix return	2023-03-06 14:48:57 +01:00
Raphael Mitsch	6aa6b86d49	Make generation of empty `KnowledgeBase` instances configurable in `EntityLinker` (#12320 ) * Make empty_kb() configurable. * Format. * Update docs. * Be more specific in KB serialization test. * Update KB serialization tests. Update docs. * Remove doc update for batched candidate generation. * Fix serialization of subclassed KB in tests. * Format. * Update docstring. * Update docstring. * Switch from pickle to json for custom field serialization.	2023-03-01 16:02:55 +01:00
Sofie Van Landeghem	74cae47bf6	rely on is_empty property instead of __len__ (#12347 )	2023-03-01 12:06:07 +01:00
Adriane Boyd	8f058e39bd	Fix error message for displacy auto_select_port (#12343 )	2023-02-28 16:36:03 +01:00
TAN Long	071667376a	Add new REL_OPs: `>+`, `>-`, `<+`, and `<-` (#12334 ) * Add immediate left/right child/parent dependency relations * Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`. --------- Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-02-28 14:36:33 +01:00
lise-brinck	e2de188cf1	Bugfix/swedish tokenizer (#12315 ) * add unittest for explosion#12311 * create punctuation.py for swedish * removed : from infixes in swedish punctuation.py * allow : as infix if succeeding char is uppercase	2023-02-27 10:53:45 +01:00
Kevin Humphreys	acdd993071	Matcher performance fix for extension predicates: use shared key function (#12272 ) * standardize predicate key format * single key function * Make optional args in key function keyword-only --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-02-27 08:35:08 +01:00
Paul O'Leary McCann	1e8bac99f3	Add tests for projects to master (#12303 ) * Add tests for projects to master * Fix git clone related issues on Windows * Add stat import	2023-02-23 10:22:57 +01:00
Adriane Boyd	80bc140533	Add grc to langs with lexeme norms in spacy-lookups-data (#12287 )	2023-02-16 17:57:02 +01:00
Edward	61b8454137	Adjust return type of `registry.find` (#12227 ) * Fix registry find return type * add dot * Add type ignore for mypy * update black formatting version * add mypy ignore to package cli * mypy type fix (for real) * Update find description in spacy/util.py Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * adjust mypy directive --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-02-15 12:32:53 +01:00
Sofie Van Landeghem	79ef6cf0f9	Have logging calls use string formatting types (#12215 ) * change logging call for spacy.LookupsDataLoader.v1 * substitutions in language and _util * various more substitutions * add string formatting guidelines to contribution guidelines	2023-02-02 11:15:22 +01:00
Raphael Mitsch	02af17a5c8	Remove flaky assertions. (#12210 )	2023-01-31 16:52:06 +01:00
Adriane Boyd	606273f7e4	Normalize whitespace in evaluate CLI output test (#12157 ) * Normalize whitespace in evaluate CLI output test Depending on terminal settings, lines may be padded to the screen width so the comparison is too strict with only the command string replacement. * Move to test util method * Change to normalization method	2023-01-27 16:13:34 +01:00
Adriane Boyd	5f8a398bb9	Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196 ) * Add span_id to Span.char_span, update Doc/Span.char_span docs `Span.char_span(id=)` should be removed in the future. * Also use Union[int, str] in Doc docstring	2023-01-27 15:09:17 +01:00
Simon Gurcke	774c10fa39	Add alignment_mode argument to Span.char_span() (#12145 ) * Add alignment_mode argument to Span.char_span() * Update website * Update spacy/tokens/span.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-27 11:43:40 +01:00
Peter Baumgartner	c68e6b8a96	`trainable_lemmatizer` in `debug data` (#11419 ) * WIP * rm ipython embeds * rm total * WIP * cleanup * cleanup + reword * rm component function * remove migration support form * fix reference dataset for dev data * additional fixes - set approach to identifying unique trees - adjust line length on messages - add logic for detecting docs without annotations * use 0 instead of none for no annotation * partial annotation support * initial tests for _compile_gold lemma attributes Using the example data from the edit tree lemmatizer tests for: - lemmatizer_trees - partial_lemma_annotations - n_low_cardinality_lemmas - no_lemma_annotations * adds output test for cli app * switch msg level * rm unclear uniqueness check * Revert "rm unclear uniqueness check" This reverts commit `6ea2b3524b`. * remove good message on uniqueness * formatting * use en_vocab fixture * clarify data set source in messages * remove unnecessary import Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-26 17:36:50 +01:00
Daniël de Kok	8d69874afb	Add `spacy.PlainTextCorpusReader.v1` (#12122 ) * Add `spacy.PlainTextCorpusReader.v1` This is a corpus reader that reads plain text corpora with the following format: - UTF-8 encoding - One line per document. - Blank lines are ignored. It is useful for applications where we deal with very large corpora, such as distillation, and don't want to deal with the space overhead of serialized formats. Additionally, many large corpora already use such a text format, keeping the necessary preprocessing to a minimum. * Update spacy/training/corpus.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * docs: add version to `PlainTextCorpus` * Add docstring to registry function * Add plain text corpus tests * Only strip newline/carriage return * Add return type _string_to_tmp_file helper * Use a temporary directory in place of file name Different OS auto delete/sharing semantics are just wonky. * This will be new in 3.5.1 (rather than 4) * Test improvements from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-26 11:33:22 +01:00
Raphael Mitsch	950fceceb6	Make test_cli_find_threshold() more robust. (#12148 )	2023-01-23 14:42:33 +01:00
Richard Hudson	f9e020dd67	Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer (#12017 ) * Refactor _scores2guesses * Handle arrays on GPU * Convert argmax result to raw integer Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Use NumpyOps() to copy data to CPU Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Changes based on review comments * Use different _scores2guesses depending on tree_k * Add tests for corner cases * Add empty line for consistency * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2023-01-20 19:34:11 +01:00
Adriane Boyd	1e993d3b03	Merge pull request #12121 from adrianeboyd/chore/v3.5.0-2 Revert "Temporarily skip tests that require models/compat"	2023-01-19 15:59:30 +01:00
Adriane Boyd	3b8918e166	API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128 ) * API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar * adjust to mdx * linkout to InMemoryLookupKB at first occurrence in kb.mdx * fix links to docs * revert Azure trigger setting (I'll make a separate PR) Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-19 13:29:17 +01:00
Adriane Boyd	dc0f527039	Revert "Temporarily skip tests that require models/compat" This reverts commit `378db0eb1e`.	2023-01-18 12:54:56 +01:00
Adriane Boyd	794cea6907	Fix comments and examples for levenshtein_compare (#12113 )	2023-01-18 08:02:33 +01:00
github-actions[bot]	9ef7d26032	Auto-format code with black (#12100 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2023-01-13 10:12:10 +01:00
Daniël de Kok	dda7331da3	Handle missing annotations in the edit tree lemmatizer (#12098 ) The losses/gradients of missing annotations were not correctly masked out. Fix this and check the masking in the partial data test.	2023-01-12 12:13:55 +01:00
Daniël de Kok	319eb508b5	Add a `spacy benchmark speed` subcommand (#11902 ) * Add a `spacy evaluate speed` subcommand This subcommand reports the mean batch performance of a model on a data set with a 95% confidence interval. For reliability, it first performs some warmup rounds. Then it will measure performance on batches with randomly shuffled documents. To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate` and accuracy evaluation is moved to its own `evaluate accuracy` subcommand. * Fix import cycle * Restore `spacy evaluate`, make `spacy benchmark speed` an alias * Add documentation for `spacy benchmark` * CREATES -> PRINTS * WPS -> words/s * Disable formatting of benchmark speed arguments * Fail with an error message when trying to speed bench empty corpus * Make it clearer that `benchmark accuracy` is a replacement for `evaluate` * Fix docstring webpage reference * tests: check `evaluate` output against `benchmark accuracy`	2023-01-12 11:55:21 +01:00
Paul O'Leary McCann	8e558095a1	Clean up displacy port-related error messages, docs (#12089 ) * Clean up displacy port-related error messages, docs There were some issues in the error messages and docs in #11948. 1. the error messages didn't specify the port argument to displacy.serve correctly 2. the docs didn't mark the auto select argument as new This addresses those issues. * Update website/docs/api/top-level.md Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * Apply prettier Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-01-12 14:54:09 +09:00
Adriane Boyd	9e0322de1a	Restore v2 token_acc score implementation (#12073 ) In the v3 scorer refactoring, `token_acc` was implemented incorrectly. It should use `precision` instead of `fscore` for the measure of correctly aligned tokens / number of predicted tokens. Fix the docs to reflect that the measure uses the number of predicted tokens rather than the number of gold tokens.	2023-01-11 08:01:47 +01:00
Kevin Humphreys	19650ebb52	Enable fuzzy text matching in Matcher (#11359 ) * enable fuzzy matching * add fuzzy param to EntityMatcher * include rapidfuzz_capi not yet used * fix type * add FUZZY predicate * add fuzzy attribute list * fix type properly * tidying * remove unnecessary dependency * handle fuzzy sets * simplify fuzzy sets * case fix * switch to FUZZYn predicates use Levenshtein distance. remove fuzzy param. remove rapidfuzz_capi. * revert changes added for fuzzy param * switch to polyleven (Python package) * enable fuzzy matching * add fuzzy param to EntityMatcher * include rapidfuzz_capi not yet used * fix type * add FUZZY predicate * add fuzzy attribute list * fix type properly * tidying * remove unnecessary dependency * handle fuzzy sets * simplify fuzzy sets * case fix * switch to FUZZYn predicates use Levenshtein distance. remove fuzzy param. remove rapidfuzz_capi. * revert changes added for fuzzy param * switch to polyleven (Python package) * fuzzy match only on oov tokens * remove polyleven * exclude whitespace tokens * don't allow more edits than characters * fix min distance * reinstate FUZZY operator with length-based distance function * handle sets inside regex operator * remove is_oov check * attempt build fix no mypy failure locally * re-attempt build fix * don't overwrite fuzzy param value * move fuzzy_match to its own Python module to allow patching * move fuzzy_match back inside Matcher simplify logic and add tests * Format tests * Parametrize fuzzyn tests * Parametrize and merge fuzzy+set tests * Format * Move fuzzy_match to a standalone method * Change regex kwarg type to bool * Add types for fuzzy_match - Refactor variable names - Add test for symmetrical behavior * Parametrize fuzzyn+set tests * Minor refactoring for fuzz/fuzzy * Make fuzzy_match a Matcher kwarg * Update type for _default_fuzzy_match * don't overwrite function param * Rename to fuzzy_compare * Update fuzzy_compare default argument declarations * allow fuzzy_compare override from EntityRuler * define new Matcher keyword arg * fix type definition * Implement fuzzy_compare config option for EntityRuler and SpanRuler * Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects * Use simpler fuzzy_compare algorithm * Update types * Increase minimum to 2 in fuzzy_compare to allow one transposition * Fix predicate keys and matching for SetPredicate with FUZZY and REGEX * Add FUZZY6..9 * Add initial docs * Increase default fuzzy to rounded 30% of pattern length * Update docs for fuzzy_compare in components * Update EntityRuler and SpanRuler API docs * Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare To having naming similar to `phrase_matcher_attr`, rename `fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to `matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs. * Fix schema aliases Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typo Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add FUZZY6-9 operators and update tests * Parameterize test over greedy Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix type for fuzzy_compare to remove Optional * Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein * Update docs following levenshtein_compare renaming Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-01-10 10:36:17 +01:00
Zhangrp	eb8bb35c13	improve ux for displacy when the serve port is in use (#11948 ) * check port in use and add itself * check port in use and add itself * Auto switch to nearest available port. * Use bind to check port instead of connect_ex. * Reformat. * Add auto_select_port argument. * update docs for displacy.serve * Update spacy/errors.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/top-level.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/errors.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add test using multiprocessing * fix argument name * Increase sleep times Want to rule this out as a cause of test failure * Don't terminate a process that isn't alive * Refactor port finding logic This moves all the port logic into its own util function, which can be tested without having to background a server directly. * Use with for the server This ensures the server is closed correctly. * Pass in the host when checking port availability * Shorten argument name * Update error codes following merge * Add types for arguments, specify docstrings. * Add typing for arguments with default value. * Update docstring to match spaCy format. * Update docstring to match spaCy format. * Fix docs Arg name changed from `auto_select_port` to just `auto_select`. * Revert "Fix docs" This reverts commit `356966fe84`. Co-authored-by: zhiiw <1302593554@qq.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-01-10 15:52:57 +09:00
Sofie Van Landeghem	6d03b04901	Improve score_cats for use with multiple textcat components (#11820 ) * add test for running evaluate on an nlp pipeline with two distinct textcat components * cleanup * merge dicts instead of overwrite * don't add more labels to the given set * Revert "merge dicts instead of overwrite" This reverts commit `89bee0ed77`. * Switch tests to separate scorer keys rather than merged dicts * Revert unrelated edits * Switch textcat scorers to v2 * formatting Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-09 11:43:48 +01:00
Sofie Van Landeghem	7f6c638c3a	fix processing of "auto" in convert (#12050 ) * fix processing of "auto" in walk_directory * add check for None * move AUTO check to convert and fix verification of args * add specific CLI test with CliRunner * cleanup * more cleanup * update docstring	2023-01-05 10:21:00 +01:00
Paul O'Leary McCann	dbd829f0ed	Fix inconsistency in displaCy docs about page option (#12047 ) * Fix inconsistency in displaCy docs about page option The `page` option, which wraps the output SVG in HTML, is true by default for `serve` but not for `render`. The `render` docs were wrong though, so this updates them. * Update the same statement in more docs A few renderers used the same language	2023-01-04 12:51:40 +09:00
github-actions[bot]	abb0ab109d	Auto-format code with black (#12035 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2023-01-02 11:59:57 +01:00
Adriane Boyd	ef9e504eac	Rename modified textcat scorer to v2 (#11971 ) As a follow-up to #11696, rename the modified scorer to v2 and move the v1 scorer to `spacy-legacy`.	2022-12-29 14:01:08 +01:00
kadarakos	933b54ac79	typo fix (#11995 )	2022-12-26 13:26:35 +01:00
Madeesh Kannan	aa2b471a6e	New console logger with expanded progress tracking (#11972 ) * Add `ConsoleLogger.v3` This addition expands the progress bar feature to count up the training/distillation steps to either the next evaluation pass or the maximum number of steps. * Rename progress bar types * Add defaults to docs Minor fixes * Move comment * Minor punctuation fixes * Explicitly check for `None` when validating progress bar type Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-12-23 15:21:44 +01:00
github-actions[bot]	90896504a5	Auto-format code with black (#12019 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-12-23 12:44:07 +01:00
Raphael Mitsch	eef3d950b4	Fix `SpanGroup` and `Span` typing (#12009 ) * Correct Span.label, Span.kb_id types. Fix SpanGroup.__iter__(). * Extend test. * Rename test. Fix typo. * Add comment. * Fix types for Span.label, Span.kb_id, Span.char_span(). * Update spacy/tests/doc/test_span_group.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update docs. * Fix typo. * Update spacy/tokens/span_group.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-12-21 18:54:27 +01:00
kadarakos	c223cd7a86	Add apply CLI (#11376 ) * annotate cli first try * add batch-size and n_process * rename to apply * typing fix * handle file suffixes * walk directories * support jsonl * typing fix * remove debug * make suffix optional for walk * revert unrelated * don't warn but raise * better error message * minor touch up * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * update tests and bugfix * add force_overwrite * typo * fix adding .spacy suffix * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * store user data and rename cmd arg * include test for user attr * rename cmd arg * better help message * documentation * prettier * black * link fix * Update spacy/cli/apply.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * addressing reviews * dont quit but warn * prettier Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-12-20 17:11:33 +01:00
Jos Polfliet	18ffe5bbd6	Update stop_words.py (#11997 ) fix typo in "aangaande"	2022-12-19 16:17:49 +01:00
Adriane Boyd	0591e67265	Cast to uint64 for all array-based doc representations (#11933 ) * Convert all individual values explicitly to uint64 for array-based doc representations * Temporarily test with latest numpy v1.24.0rc * Remove unnecessary conversion from attr_t * Reduce number of individual casts * Convert specifically from int32 to uint64 * Revert "Temporarily test with latest numpy v1.24.0rc" This reverts commit `eb0e3c5006`. * Also use int32 in tests	2022-12-12 08:45:35 +01:00
github-actions[bot]	f22fc7a113	Auto-format code with black (#11955 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-12-09 10:15:52 +01:00
Paul O'Leary McCann	5c3a60e8f4	Add in errors used in the beam code that were removed at some point (#11935 ) I don't think there's any way to use the beam code at the moment, but as long as it's around the errors it refers to should also be present.	2022-12-07 15:52:35 +01:00
Daniël de Kok	27fac7df2e	EditTreeLemmatizer: correctly add strings when initializing from labels (#11934 ) Strings in replacement nodes where not added to the `StringStore` when `EditTreeLemmatizer` was initialized from a set of labels. The corresponding test did not capture this because it added the strings through the examples that were passed to the initialization. This change fixes both this bug in the initialization as the 'shadowing' of the bug in the test.	2022-12-07 13:53:41 +09:00
Zhangrp	23085ffef4	Fix interpolation in directory names, see #11235 . (#11914 )	2022-12-06 17:42:12 +09:00
Adriane Boyd	8afa8b5a7b	Refactor kwargs in CLI msg for future wasabi compatibility (#11918 ) Necessary for mypy with wasabi v1+.	2022-12-05 10:00:00 +01:00
github-actions[bot]	df0cb4b77b	Auto-format code with black (#11913 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-12-02 14:49:12 +01:00
Paul O'Leary McCann	f9d17a644b	Config generation fails for GPU without transformers (#11899 ) If you don't have spacy-transformers installed, but try to use `init config` with the GPU flag, you'll get an error. The issue is that the `use_transformers` flag in the config is conflated with the GPU flag, and then there's an attempt to access transformers config info that may not exist. There may be a better way to do this, but this stops the error.	2022-12-02 10:17:11 +01:00
Adriane Boyd	445c670a2d	Fix spancat for zero suggestions (#11860 ) * Add test for spancat predict with zero suggestions * Fix spancat for zero suggestions * Undo changes to extract_spans * Use .sum() as in update	2022-12-02 09:33:52 +01:00
Adriane Boyd	6f9d630f7e	Replace Pipe type with Callable in Language (#11803 ) * Replace Pipe type with Callable in Language * Use Callable[[Doc], Doc] in the docstrings	2022-11-29 13:20:08 +01:00
Paul O'Leary McCann	f1e0243450	Remove macro auc per type from textcat defaults (#11887 ) This appears to have been added by mistake and never used. Removing it does not break validation.	2022-11-29 11:50:23 +01:00
Adriane Boyd	e0d43557b7	Merge pull request #11871 from adrianeboyd/chore/v3.5.0 Prepare for v3.5.0	2022-11-29 11:41:32 +01:00
Adriane Boyd	1ebe7db07c	Support local filesystem remotes for projects (#11762 ) * Support local filesystem remotes for projects * Fix support for local filesystem remotes for projects * Use `FluidPath` instead of `Pathy` to support both filesystem and remote paths * Create missing parent directories if required for local filesystem * Add a more general `_file_exists` method to support both `Pathy`, `Path`, and `smart_open`-compatible URLs * Add explicit `smart_open` dependency starting with support for `compression` flag * Update `pathy` dependency to exclude older versions that aren't compatible with required `smart_open` version * Update docs to refer to `Pathy` instead of `smart_open` for project remotes (technically you can still push to any `smart_open`-compatible path but you can't pull from them) * Add tests for local filesystem remotes * Update pathy for general BlobStat sorting * Add import * Remove _file_exists since only Pathy remotes are supported * Format CLI docs * Clean up merge	2022-11-29 11:40:58 +01:00
Paul O'Leary McCann	f54bfb56c9	Don't throw an error if using displacy on an unset span key (#11845 ) * Don't throw an error if using displacy on an unset span key * List available keys in W117	2022-11-28 10:01:09 +01:00
Adriane Boyd	681ec20914	Add smart_open requirement, update deprecated options (#11864 ) * Switch from deprecated `ignore_ext` to `compression` * Add upload/download test for local files	2022-11-25 13:00:57 +01:00
Adriane Boyd	32396e0bda	Set version to v3.5.0	2022-11-25 12:05:25 +01:00
Adriane Boyd	378db0eb1e	Temporarily skip tests that require models/compat	2022-11-25 12:05:25 +01:00
Raphael Mitsch	c0fd8a2e71	find-threshold: CLI command for multi-label classifier threshold tuning (#11280 ) * Add foundation for find-threshold CLI functionality. * Finish first draft for find-threshold. * Add tests. * Revert adjusted import statements. * Fix mypy errors. * Fix imports. * Harmonize arguments with spacy evaluate command. * Generalize component and threshold handling. Harmonize arguments with 'spacy evaluate' CLI. * Fix Spancat test. * Add beta parameter to Scorer and PRFScore. * Make beta a component scorer setting. * Remove beta. * Update nlp.config (workaround). * Reload pipeline on threshold change. Adjust tests. Remove confection reference. * Remove assumption of component being a Pipe object or having a .cfg attribute. * Adjust test output and reference values. * Remove beta references. Delete universe.json. * Reverting unnecessary changes. Removing unused default values. Renaming variables in find-cli tests. * Update spacy/cli/find_threshold.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Remove adding labels in tests. * Remove unused error * Undo changes to PRFScorer * Change default value for n_trials. Log table iteratively. * Add warnings for pointless applications of find_threshold(). * Fix imports. * Adjust type check of TextCategorizer to exclude subclasses. * Change check of if there's only one unique value in scores. * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Incorporate feedback. * Fix test issue. Update docstring. * Update docs & docstring. * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add examples to docs. Rename _nlp to nlp in tests. * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-11-25 11:44:55 +01:00
Adriane Boyd	30d31fd335	Update Russian and Ukrainian lemmatizers (#11811 ) * pymorph2 issues #11620, #11626, #11625: - #11620: pymorphy2_lookup - #11626: handle multiple forms pointing to the same normal form + handling empty POS tag - #11625: matching DET that are labelled as PRON by pymorhp2 * Move lemmatizer algorithm changes back into RussianLemmatizer * Fix uk pymorphy3_lookup mode init * Move and update tests for ru/uk lookup lemmatizer modes * Fix typo * Remove traces of previous behavior for uninflected POS * Refactor to private generic-looking pymorphy methods * Remove xfailed uk lemmatizer cases * Update spacy/lang/ru/lemmatizer.py Co-authored-by: Richard Hudson <richard@explosion.ai> Co-authored-by: Dmytro S Lituiev <d.lituiev@gmail.com> Co-authored-by: Richard Hudson <richard@explosion.ai>	2022-11-25 11:12:46 +01:00
Adriane Boyd	8f062b849c	Fix Matcher cython profile=True header (#11867 )	2022-11-24 16:03:42 +01:00
Madeesh Kannan	5ea14af32b	Add `training.before_update` callback (#11739 ) * Add `training.before_update` callback This callback can be used to implement training paradigms like gradual (un)freezing of components (e.g: the Transformer) after a certain number of training steps to mitigate catastrophic forgetting during fine-tuning. * Fix type annotation, default config value * Generalize arguments passed to the callback * Update schema * Pass `epoch` to callback, rename `current_step` to `step` * Add test * Simplify test * Replace config string with `spacy.blank` * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Cleanup imports Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-23 17:54:58 +01:00
Paul O'Leary McCann	f1ddac187d	Remove unused error object (#11837 )	2022-11-23 10:51:31 +01:00
Marco Edward Gorelli	f0d8309a28	fix comparison of constants (#11834 ) Co-authored-by: MarcoGorelli <>	2022-11-21 08:12:03 +01:00
github-actions[bot]	89bfd06fbd	Auto-format code with black (#11826 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-18 18:24:13 +09:00
Adriane Boyd	a83463c5e0	Add transformer recommendation for ca (#11819 ) Model recommendation from @cayorodriguez.	2022-11-18 08:15:27 +01:00
Paul O'Leary McCann	75bb7ad541	Check textcat values for validity (#11763 ) * Check textcat values for validity * Fix error numbers * Clean up vals reference * Check category value validity through training The _validate_categories is called in update, which for multilabel is inherited from the single label component. * Formatting	2022-11-17 10:25:01 +01:00
Paul O'Leary McCann	c0c54e44bc	Add equality definition for vectors (#11806 ) * Add equality definition for vectors This re-uses the check from sourcing components. * Use the equality check * Format Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-16 09:44:42 +01:00
Sofie Van Landeghem	caa9efad59	prevent rewriting an already raw URL (#11810 )	2022-11-15 14:15:00 +01:00
Denis Bezykornov	7e684ad691	Update russian tokenizer exceptions (#11753 ) * Fix typos, add couple of new abbreviations, remove nonbreaking spaces * Remove space from abbreviation Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-15 11:37:25 +01:00
github-actions[bot]	188a7d00eb	Auto-format code with black (#11792 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-11 09:58:31 +01:00
Adriane Boyd	03eebe9d1c	Update warning, add tests for project requirements check (#11777 ) * Update warning, add tests for project requirements check * Make warning more general for differences between PEP 508 and pip * Add tests for _check_requirements * Parameterize test	2022-11-09 10:59:28 +01:00
Raphael Mitsch	20bbbe3e44	Revert disable/disabled merging behavior (#11745 ) * Merge disable with disabled. Adjust warnings, errors and tests. * Replace any() with set operation. * Update spacy/tests/pipeline/test_pipe_methods.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update docs. * Remve reference to config entry nlp.enabled from docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-08 14:58:10 +01:00
Adriane Boyd	e116395f89	Add fallback in requirements check, only check once (#11735 ) * Add fallback in requirements check, only check once * Rename to skip_requirements_check * Update spacy/cli/project/run.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-11-07 14:46:08 +01:00
Adriane Boyd	e91b47a226	Check for unsafe paths in tarfile.extractall (CVE-2007-4559) (#11746 ) * Adding tarfile member sanitization to extractall() * Format * Simplify and add error message * Fix import * Add comment about CVE Co-authored-by: TrellixVulnTeam <charles.mcfarland@trellix.com>	2022-11-07 10:43:34 +01:00
Adriane Boyd	ea326cf47d	Fix types for Span.id and Span.id_ (#11744 )	2022-11-07 08:11:13 +01:00
github-actions[bot]	bbf64cfc43	Auto-format code with black (#11749 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-04 11:17:43 +01:00
Adriane Boyd	40e1000db0	Restore Doc attr getter values in Doc.to_json (#11700 )	2022-11-03 11:49:08 +01:00
Paul O'Leary McCann	db56600536	Fix default parameters for load functions (fix #11706 ) (#11713 ) * Fix default parameters for load functions Some load functions used SimpleFrozenList() directly instead of the _DEFAULT_EMPTY_PIPES parameter. That mostly worked as intended, but the changes in #11459 check for equality using identity, not value, so a warning is incorrectly raised sometimes, as in #11706. This change just has all the load functions use the singleton value instead. * Add test that there are no warnings on module-based load This will succeed due to changes in this branch, but local tests with the latest release failed as intended. * Try reverting commit and see if CI changes There is an error in CI that is probably unrelated. Revert "Fix default parameters for load functions" This reverts commit `dc46b35687`. * Revert "Try reverting commit and see if CI changes" This reverts commit `2514ed07ef`. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-03 10:52:59 +01:00
Adriane Boyd	420b1d854b	Update textcat scorer threshold behavior (#11696 ) * Update textcat scorer threshold behavior For `textcat` (with exclusive classes) the scorer should always use a threshold of 0.0 because there should be one predicted label per doc and the numeric score for that particular label should not matter. * Rename to test_textcat_multilabel_threshold * Remove all uses of threshold for multi_label=False * Update Scorer.score_cats API docs * Add tests for score_cats with thresholds * Update textcat API docs * Fix types * Convert threshold back to float * Fix threshold type in docstring * Improve formatting in Scorer API docs	2022-11-02 15:35:04 +01:00
Paul O'Leary McCann	d61e742960	Handle Docs with no entities in EntityLinker (#11640 ) * Handle docs with no entities If a whole batch contains no entities it won't make it to the model, but it's possible for individual Docs to have no entities. Before this commit, those Docs would cause an error when attempting to concatenate arrays because the dimensions didn't match. It turns out the process of preparing the Ragged at the end of the span maker forward was a little different from list2ragged, which just uses the flatten function directly. Letting list2ragged do the conversion avoids the dimension issue. This did not come up before because in NEL demo projects it's typical for data with no entities to be discarded before it reaches the NEL component. This includes a simple direct test that shows the issue and checks it's resolved. It doesn't check if there are any downstream changes, so a more complete test could be added. A full run was tested by adding an example with no entities to the Emerson sample project. * Add a blank instance to default training data in tests Rather than adding a specific test, since not failing on instances with no entities is basic functionality, it makes sense to add it to the default set. * Fix without modifying architecture If the architecture is modified this would have to be a new version, but this change isn't big enough to merit that.	2022-10-28 10:25:34 +02:00
Adriane Boyd	865691d169	Adjust default attrs for textcat configs (#11698 )	2022-10-26 08:43:00 +02:00
Adriane Boyd	88d35450dc	Rename test helper method with non-test_ name (#11701 )	2022-10-25 14:53:18 +02:00
github-actions[bot]	84d9cb6b38	Auto-format code with black (#11687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-10-21 11:54:17 +02:00
Adriane Boyd	7e56701057	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5	2022-10-20 13:38:49 +02:00
Adriane Boyd	3d0e895363	Set version to v3.4.2 (#11672 )	2022-10-19 17:33:55 +02:00
Edward	d66ccb8eb0	Fix multiple entries per custom extension in doc json (#11551 ) * Fix multiple extensions and character offset * Rename token_start/end to start/end * Refactor Doc.from_json based on review * Iterate over user_data items * Only add non-empty underscore entries Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-10-19 15:52:47 +02:00
Paul O'Leary McCann	858565a567	Fix issues with DVC commands (#11592 ) * Fix flag handling in dvc Prior to this commit, if a flag (--verbose or --quiet) was passed to DVC, it would be added to the end of the generated dvc command line. This would result in the command being interpreted as part of the actual command to run, rather than an argument to dvc. This would result in command lines like: spacy project run preprocess --verbose That would fail with an error that there's no such directory as `--verbose`. This change puts the flags at the front of the dvc command so that they are interpreted correctly. It removes the `run_dvc_commands` function, which had been reduced to just a for loop and wasn't used elsewhere. A separate problem is that there's no way to specify the quiet behaviour to dvc from the command line, though it's unclear if that's a bug. * Add dvc quiet flag to docs * Handle case in DVC where no commands are appropriate If only have commands with no deps or outputs (admittedly unlikely), you get a weird error about the dvc file not existing. This gives explicit output instead. * Add support for quiet flag * Fix command execution Commands are strings now because they're joined further up.	2022-10-18 15:11:39 +09:00
Sofie Van Landeghem	2ce6aadda2	update default configs to recent versions (#11618 )	2022-10-17 12:10:03 +02:00
github-actions[bot]	ceb62352bf	Auto-format code with black (#11649 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-10-14 18:04:55 +09:00
Adriane Boyd	6b5a3e7219	Extend to pydantic v1.10 (#11635 ) * Update types in `spacy.schemas` for updated pydantic+mypy	2022-10-14 08:16:49 +02:00
Sofie Van Landeghem	4d869fcc11	Small fixes to docstrings (#11610 ) * add missing scorer arg to docstring * fix class names in textcat_multilabel * add missing scorer to docstrings	2022-10-12 15:17:40 +02:00
Adriane Boyd	fe06e037bc	Fix init for pymorphy2_lookup lemmatizer mode (#11631 )	2022-10-12 12:18:39 +02:00
Sofie Van Landeghem	29649589fc	remove dtype (#11615 )	2022-10-11 15:25:05 +02:00
Sofie Van Landeghem	ef74f8f5e4	Fix mypy error in edittree lemmatizer (#11612 ) * cleanup imports * try limiting Thinc to previous release * remove Model specification * fix code and revert Thinc constraint	2022-10-11 14:15:22 +02:00
svlandeg	9c8cdb403e	Merge branch 'master_copy' into develop_copy	2022-09-30 15:40:26 +02:00
Sofie Van Landeghem	bcda8bc1e7	update mypy to latest version (#11546 ) * update mypy and disable it for python 3.6 * ignoring mypy's type redefinition error	2022-09-29 14:24:40 +02:00
Adriane Boyd	6d7630c5d3	Allow overriding spacy_version in spacy package meta (#11552 )	2022-09-29 10:44:06 +02:00
Peter Baumgartner	e794d4ae39	`debug data` Spancat Table Improvements (#11504 ) * update * fix format function * pull out _format_number * format with black	2022-09-28 17:16:05 +02:00
Raphael Mitsch	aea16719be	Simplify and clarify enable/disable behavior of spacy.load() (#11459 ) * Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments. * Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable. * Fix type issue. * Move comment. * Move comment. * Issue UserWarning instead of printing wasabi message. Adjust test. * Added pytest.warns(UserWarning) for expected warning to fix tests. * Update warning message. * Move type handling out of fetch_pipes_status(). * Add global variable for default value. Use id() to determine whether used values are default value. * Fix default value for disable. * Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.	2022-09-27 14:22:36 +02:00
Jacobo Myerston	3e8bc1272f	add punctuation to grc (#11426 ) * add punctuation to grc Add support for special editorial punctuation that is common in ancient Greek texts. Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer. * add unit tests * simplify regex * move generic quotes to char classes * rename unit test * fix regex Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-27 11:38:56 +02:00
Adriane Boyd	877671e09a	Preserve missing entity annotation in augmenters (#11540 ) Preserve both `-` and `O` annotation in augmenters rather than relying on `Example.to_dict`'s default support for one option outside of labeled entity spans. This is intended as a temporary workaround for augmenters for v3.4.x. The behavior of `Example` and related IOB utils could be improved in the general case for v3.5.	2022-09-27 10:16:51 +02:00
Richard Hudson	6f692a06d5	Remove side effects from Doc.__init__() (#11506 ) * Remove side effects from Doc.__init__() * Changes based on review comment * Readd test * Change interface of Doc.__init__() * Simplify test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update doc.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-26 15:58:21 +02:00
Raphael Mitsch	af9b01ef97	Add dependency check to project step runs (#11226 ) * Add dependency check to project step running. * Fix dependency mismatch warning. * Remove newline. * Add types-setuptools to setup.cfg. * Move types-setuptools to test requirements. Move warnings into _validate_requirements(). Handle file reading in project_run(). * Remove newline formatting for output of package conflicts. * Show full version conflict message instead of just package name. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix typo. * Re-add rephrasing of message for conflicting packages. Remove requirements path redundancy. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Print unified message for requirement conflicts and missing requirements. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix warning message. * Print conflict/missing messages individually. * Print conflict/missing messages individually. * Add check_requirements setting in project.yml to disable requirements check. * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update description of project.yml structure in projects.md. * Update website/docs/usage/projects.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Prettify projects docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-16 16:54:31 +02:00
github-actions[bot]	279358be63	Auto-format code with black (#11513 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-09-16 11:50:19 +02:00
Sofie Van Landeghem	0509f90874	add dot (#11500 )	2022-09-15 17:29:42 +02:00
Adriane Boyd	7c98245c0c	Add levenshtein from polyleven (#11418 ) Add a simple levenshtein distance function using the implementation from the polyleven library as `spacy.matcher.levenshtein`.	2022-09-14 17:05:22 +02:00
Sofie Van Landeghem	cc10a27c59	Prevent tok2vec to broadcast to listeners when predicting (#11385 ) * replicate bug with tok2vec in annotating components * add overfitting test with a frozen tok2vec * remove broadcast from predict and check doc.tensor instead * remove broadcast * proper error * slight rephrase of documentation	2022-09-12 15:36:48 +02:00

1 2 3 4 5 ...

9398 Commits