spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-10 10:41:14 +03:00

Author	SHA1	Message	Date
Adriane Boyd	f94168a41e	Backport bugfixes from v3.1.0 to v3.0 (#8739 ) * Fix scoring normalization (#7629) * fix scoring normalization * score weights by total sum instead of per component * cleanup * more cleanup * Use a context manager when reading model (fix #7036) (#8244) * Fix other open calls without context managers (#8245) * Don't add duplicate patterns all the time in EntityRuler (fix #8216) (#8246) * Don't add duplicate patterns (fix #8216) * Refactor EntityRuler init This simplifies the EntityRuler init code. This is helpful as prep for allowing the EntityRuler to reset itself. * Make EntityRuler.clear reset matchers Includes a new test for this. * Tidy PhraseMatcher instantiation Since the attr can be None safely now, the guard if is no longer required here. Also renamed the `_validate` attr. Maybe it's not needed? * Fix NER test * Add test to make sure patterns aren't increasing * Move test to regression tests * Exclude generated .cpp files from package (#8271) * Fix non-deterministic deduplication in Greek lemmatizer (#8421) * Fix setting empty entities in Example.from_dict (#8426) * Filter W036 for entity ruler, etc. (#8424) * Preserve paths.vectors/initialize.vectors setting in quickstart template * Various fixes for spans in Docs.from_docs (#8487) * Fix spans offsets if a doc ends in a single space and no space is inserted * Also include spans key in merged doc for empty spans lists * Fix duplicate spacy package CLI opts (#8551) Use `-c` for `--code` and not additionally for `--create-meta`, in line with the docs. * Raise an error for textcat with <2 labels (#8584) * Raise an error for textcat with <2 labels Raise an error if initializing a `textcat` component without at least two labels. * Add similar note to docs * Update positive_label description in API docs * Add Macedonian models to website (#8637) * Fix Azerbaijani init, extend lang init tests (#8656) * Extend langs in initialize tests * Fix az init * Fix ru/uk lemmatizer mp with spawn (#8657) Use an instance variable instead a class variable for the morphological analzyer so that multiprocessing with spawn is possible. * Use 0-vector for OOV lexemes (#8639) * Set version to v3.0.7 Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-07-19 09:20:40 +02:00
Adriane Boyd	bb97e7bf8a	Update validate CLI to fix compat and ignore warnings (#8423 )	2021-07-14 23:28:08 +02:00
Adriane Boyd	8a2602051c	Update debug data for textcat (#8066 ) * Check for unsupported cats values * Only show labels if train/dev mismatched * Don't show label counts (only counting positive labels seems odd) * Use warnings for mismatched train/dev labels	2021-05-17 13:27:04 +02:00
Sofie Van Landeghem	02a6a5fea0	Fix 'debug model' for transformers + generalize (#7973 ) * add overrides to docs * fix debug model with transformer * assume training data is set in config	2021-05-06 18:43:32 +10:00
Paul O'Leary McCann	8007d5c814	Check if the resume path points to a directory (#7919 ) This came up in #7878, but if --resume-path is a directory then loading the weights will fail. On Linux this will give a straightforward error message, but on Windows it gives "Permission Denied", which is confusing.	2021-04-28 09:17:15 +02:00
Paul O'Leary McCann	de6b5ed14d	Fix percent unk display in debug data (#7886 ) * Fix percent unk display This was showing (ratio %), so 10% would show as 0.10%. Fix by multiplying ration by 100. Might want to add a warning if this is over a threshold. * Only show whole-integer percents	2021-04-27 09:16:35 +02:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Sofie Van Landeghem	c786e98e56	assemble CLI command (#7783 ) * assemble CLI command * ensure assemble runs even without training section * cleanup	2021-04-19 18:39:11 +10:00
Bram Vanroy	ed561cf428	Terminology: deprecated vs obsolete (#7621 ) * Terminology: deprecated vs obsolete Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works. In light of this, perhaps all other error codes should be checked as well. * clarify that the link command is removed and not just deprecated Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-12 14:37:00 +02:00
Adriane Boyd	73a8c0f992	Update debug data further for v3 (#7602 ) * Update debug data further for v3 * Remove new/existing label distinction (new labels are not immediately distinguishable because the pipeline is already initialized) * Warn on missing labels in training data for all components except parser * Separate textcat and textcat_multilabel sections * Add section for morphologizer * Reword missing label warnings	2021-04-09 11:53:42 +02:00
Adriane Boyd	03e9e7b567	Add --code option to init fill-config	2021-03-12 10:03:57 +01:00
Adriane Boyd	ce6317231f	Add --code to spacy debug CLI	2021-03-12 09:51:26 +01:00
Sofie Van Landeghem	932887b950	textcat scoring fix and multi_label docs (#6974 ) * add multi-label textcat to menu * add infobox on textcat API * add info to v3 migration guide * small edits * further fixes in doc strings * add infobox to textcat architectures * add textcat_multilabel to overview of built-in components * spelling * fix unrelated warn msg * Add textcat_multilabel to quickstart [ci skip] * remove separate documentation page for multilabel_textcategorizer * small edits * positive label clarification * avoid duplicating information in self.cfg and fix textcat.score * fix multilabel textcat too * revert threshold to storage in cfg * revert threshold stuff for multi-textcat Co-authored-by: Ines Montani <ines@ines.io>	2021-03-09 23:04:22 +11:00
Ines Montani	ea555b03e0	Merge pull request #7255 from adrianeboyd/bugfix/extraneous-tok2vec Omit unused tok2vec/transformer components	2021-03-03 23:15:06 +11:00
Adriane Boyd	8a4200d4e9	Omit unused tok2vec/transformer components Omit unused tok2vec/transformer components in quickstart template.	2021-03-02 15:53:30 +01:00
Adriane Boyd	fb98862337	Add hint for --gpu-id to CLI device info (#7234 ) * Add hint for --gpu-id to CLI device info If the user has `cupy` and an available GPU, add a hint about using `--gpu-id 0` to the CLI output. * Undo change to original CPU message	2021-03-03 01:11:18 +11:00
Adriane Boyd	ee7bb0b393	Fix formatting in bg/bn quickstart recs	2021-02-26 17:08:37 +01:00
Adriane Boyd	30e1a89aeb	Fix displacy output in evaluate CLI (#7122 ) Now that `nlp.evaluate()` does not modify the examples, rerun the pipeline on the (limited) texts in order to provide the predicted annotation in the displacy output option.	2021-02-19 23:01:20 +11:00
Adriane Boyd	4188beda87	Fix conll converter option (#7071 ) Map `conll` to the NER converter, not the `CoNLL-U` converter.	2021-02-18 10:22:41 +01:00
Ines Montani	1e3a326e53	Change Dutch transformer recommendation [ci skip] https://github.com/explosion/spaCy/discussions/6529#discussioncomment-366620	2021-02-14 15:30:16 +11:00
Ines Montani	f4f46b617f	Preserve sourced components in fill-config (fixes #7055 ) (#7058 )	2021-02-14 14:02:14 +11:00
Adriane Boyd	0ee2ae86bf	Update trf quickstart recommendations Add/update trf recommendations for Bengali, Hindi, Sinhala, and Tamil based on #7044.	2021-02-12 15:55:17 +01:00
Ines Montani	26bf642afd	Fix issue #7019 : Handle None scores in evaluate printer (#7026 )	2021-02-11 16:45:23 +11:00
Ines Montani	c08b3f294c	Support env vars and CLI overrides for project.yml	2021-02-10 13:45:27 +11:00
svlandeg	f852af2acf	add capture arg	2021-02-02 19:47:12 +01:00
Sofie Van Landeghem	f319d2765f	Add capture argument to project_run (#6878 ) * add capture argument to project_run and run_commands * git bump to 3.0.1 * Set version to 3.0.1.dev0 Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-02-02 10:11:15 +08:00
Ines Montani	a59f3fcf5d	Make wheel the default format and update docs [ci skip]	2021-02-01 23:18:43 +11:00
Ines Montani	b9573e9e22	Fix pip args	2021-02-01 23:15:00 +11:00
Ines Montani	b46073234a	Fix default clone branch and error handling [ci skip]	2021-02-01 22:29:04 +11:00
Adriane Boyd	35a863cd27	Remove nlp.tokenizer from quickstart template Remove `nlp.tokenizer` from quickstart template so that the default language-specific tokenizer settings are filled instead.	2021-02-01 11:20:12 +01:00
Ines Montani	f058cbd751	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-30 21:03:25 +11:00
Ines Montani	3435b894df	Remove nightly reference from auto docs [ci skip]	2021-01-30 20:12:08 +11:00
Ines Montani	d0c3775712	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
Ines Montani	b26a3daa9a	Merge pull request #6860 from explosion/feature/package-wheel	2021-01-30 14:17:01 +11:00
Ines Montani	2332c4280b	Update and use unified --build option	2021-01-30 13:11:36 +11:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	30765674d0	Merge branch 'master' into develop	2021-01-30 12:20:28 +11:00
Ines Montani	2609ba4e89	Support building wheel in spacy package	2021-01-30 11:54:02 +11:00
Pamphile ROY	41ee75ac6d	Remove --no-cache-dir when downloading models When `--no-cache-dir` is present, it prevents caching to properly function. If the user still wants to do this, there is the possibility to pass options with `user_pip_args`. But you should not enforce options like these. In my case this is preventing some docker build (using buildkit caching) to have proper caching of models.	2021-01-29 15:37:44 +01:00
Ines Montani	78d6ff4dd4	Update quickstart recommendations	2021-01-28 11:14:49 +11:00
Ines Montani	ec5f55aa5b	Update config generation defaults and transformers (#6832 )	2021-01-27 23:56:33 +11:00
Ines Montani	c0926c9088	WIP: Various small training changes (#6818 ) * Allow output_path to be None during training * Fix cat scoring (?) * Improve error message for weighted None score * Improve messages So we can call this in other places etc. * FIx output path check * Use latest wasabi * Revert "Improve error message for weighted None score" This reverts commit `7059926763`. * Exclude None scores from final score by default It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc. * Update warnings and use logger.warning	2021-01-26 14:51:52 +11:00
Adriane Boyd	0f2de39efb	Fix types for exclude args in info CLI (#6808 )	2021-01-25 20:00:22 +08:00
KeshavG-lb	0a86d833d7	Spacy Cli info method causing backward compatibility issues (#6793 ) * Spacy Cli info method causing backward compatibility issues #6791 fix backward compatibility by setting default value to exclude in info method. * setting empty list as default argument is dangerous. so setting default to None and then setting it to emptylist, if None. Reference : https://nikos7am.com/posts/mutable-default-arguments/	2021-01-23 11:21:43 +01:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Ines Montani	e8a97a2bd6	Merge pull request #6720 from adrianeboyd/feature/improved-init-training-config-validation	2021-01-15 11:45:24 +11:00
Adriane Boyd	5fb8b7037a	Expand initialize/training config validation Validate both `[initialize]` and `[training]` in `debug data` and `nlp.initialize()` with separate config validation error blocks that indicate which block of the config is being validated.	2021-01-12 17:17:00 +01:00
svlandeg	1abeca90a6	refer to _parser_internals.nonproj.DELIMITER	2021-01-07 18:58:13 +01:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
Sofie Van Landeghem	afc5714d32	multi-label textcat component (#6474 ) * multi-label textcat component * formatting * fix comment * cleanup * fix from #6481 * random edit to push the tests * add explicit error when textcat is called with multi-label gold data * fix error nr * small fix	2021-01-06 13:07:14 +11:00

1 2 3 4 5 ...

1098 Commits