spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 18:06:29 +03:00

Author	SHA1	Message	Date
Adriane Boyd	1e9b4b55ee	Pass overrides to subcommands in workflows (#9059 ) * Pass overrides to subcommands in workflows * Add missing docstring	2021-08-30 09:23:54 +02:00
Ines Montani	4cd052e81d	Include component factories in third-party dependencies resolver (#9009 ) * Include component factories in third-party dependencies resolver * Increment catalogue and update test	2021-08-25 14:58:01 +02:00
Ines Montani	d94ddd5686	Auto-detect package dependencies in spacy package (#8948 ) * Auto-detect package dependencies in spacy package * Add simple get_third_party_dependencies test * Import packages_distributions explicitly * Inline packages_distributions * Fix docstring [ci skip] * Relax catalogue requirement * Move importlib_metadata to spacy.compat with note * Include license information [ci skip]	2021-08-17 14:05:13 +02:00
Adriane Boyd	8448c7dbc5	Update da trf recommendation (#8921 ) Update the da trf recommendation to the same model used in the pretrained pipelines.	2021-08-12 13:54:02 +02:00
github-actions[bot]	56d4d87aeb	Auto-format code with black (#8895 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-06 13:38:06 +02:00
Kabir Khan	1dfffe5fb4	No output info message in train (#8885 ) * Add info message that no output directory was provided in train * Update train.py * Fix logging	2021-08-05 09:21:22 +02:00
Nick Sorros	0485cdefcc	Add logger debug for project push and pull (#8860 ) * Add logger debug for project push and pull * Sign contributor agreement	2021-08-02 18:13:53 +02:00
Paul O'Leary McCann	284b530c63	Respect the no_skip value Seems like the logic for this was just left out. See #8796.	2021-07-24 15:31:17 +09:00
Adriane Boyd	6bbc2b1956	Reload train corpus in debug data after initialize (#8776 )	2021-07-21 22:38:40 +02:00
Edward	8233359225	Fix preservation of spacy package meta (#8663 ) * update package meta with existing_meta and nlp_meta * Add spaCy contributor agreement * Added more info when creating readme	2021-07-12 11:18:52 +02:00
Sofie Van Landeghem	733e8ceea9	fix spancat initialize with labels (#8620 )	2021-07-06 19:08:25 +02:00
Sofie Van Landeghem	608fc1d623	avoid msg var impliciteness (#8619 ) * avoid msg var impliciteness * rename local msg * Add CI tests for debug data and train * Adjust debug data CLI test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-06 19:08:08 +02:00
Sofie Van Landeghem	b9f59118bf	Fix silent evaluation (#8581 ) * fix silentness * sneak in docs typo fix * pass silent boolean instead	2021-07-06 14:16:19 +02:00
Ines Montani	327f83573a	Move scores per type handling into util function (#8590 )	2021-07-06 13:02:37 +02:00
Adriane Boyd	2b8c679a3d	Fix duplicate spacy package CLI opts (#8551 ) Use `-c` for `--code` and not additionally for `--create-meta`, in line with the docs.	2021-06-30 11:23:26 +02:00
Adriane Boyd	86d01e9229	Tidy up with flake8: imports, comparisons, etc.	2021-06-28 12:08:15 +02:00
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00
Santiago Castro	ee63b2b199	Fix typo in `train_cli` docstring	2021-06-25 22:45:03 -07:00
Matthew Honnibal	f9946154d9	Add SpanCategorizer component (#6747 ) * Draft spancat model * Add spancat model * Add test for extract_spans * Add extract_spans layer * Upd extract_spans * Add spancat model * Add test for spancat model * Upd spancat model * Update spancat component * Upd spancat * Update spancat model * Add quick spancat test * Import SpanCategorizer * Fix SpanCategorizer component * Import SpanGroup * Fix span extraction * Fix import * Fix import * Upd model * Update spancat models * Add scoring, update defaults * Update and add docs * Fix type * Update spacy/ml/extract_spans.py * Auto-format and fix import * Fix comment * Fix type * Fix type * Update website/docs/api/spancategorizer.md * Fix comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Better defense Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix labels list Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/extract_spans.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/spancat.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Set annotations during update * Set annotations in spancat * fix imports in test * Update spacy/pipeline/spancat.py * replace MaxoutLogistic with LinearLogistic * fix config * various small fixes * remove set_annotations parameter in update * use our beloved tupley format with recent support for doc.spans * bugfix to allow renaming the default span_key (scores weren't showing up) * use different key in docs example * change defaults to better-working parameters from project (WIP) * register spacy.extract_spans.v1 for legacy purposes * Upd dev version so can build wheel * layers instead of architectures for smaller building blocks * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Include additional scores from overrides in combined score weights * Parameterize spans key in scoring Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`. * Update website/docs/api/scorer.md * Fix scorer for spans key containing underscore * Increment version * Add Spans to Evaluate CLI (#8439) * Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix spancat GPU issues (#8455) * Fix GPU issues * Require thinc >=8.0.6 * Switch to glorot_uniform_init * Fix and test ngram suggester * Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com>	2021-06-24 12:35:27 +02:00
Ines Montani	fb9b389f52	Merge pull request #8486 from adrianeboyd/bugfix/template-paths-vectors Preserve paths.vectors/initialize.vectors setting in quickstart template	2021-06-24 13:12:18 +10:00
Ines Montani	3982be14e8	Improve fallbacks	2021-06-24 11:55:50 +10:00
Adriane Boyd	5aa099505f	Preserve paths.vectors/initialize.vectors setting in quickstart template	2021-06-23 11:07:14 +02:00
Ines Montani	cdcbd1023a	Auto-generate README in spacy packge	2021-06-22 12:06:25 +10:00
Adriane Boyd	9fde258053	Use minor version for compatibility check (#8403 ) * Use minor version for compatibility check * Use minor version of compatibility table * Soften warning message about incompatible models * Add test for presence of current version in compatibility table * Add test for download compatibility table * Use minor version of lower pin in error message if possible * Fall back to spacy_git_version if available * Fix unknown version string	2021-06-21 09:39:22 +02:00
Adriane Boyd	83fd04dee5	Update package CLI handling of README and LICENSE (#8422 ) * Copy rather than move files to top-level of package * Add all files to `MANIFEST.in` (primarily for older versions of pip) * Include the `README.md` contents as `long_description` in the setup	2021-06-18 15:48:53 +02:00
Sofie Van Landeghem	e796aab4b3	Resizable textcat (#7862 ) * implement textcat resizing for TextCatCNN * resizing textcat in-place * simplify code * ensure predictions for old textcat labels remain the same after resizing (WIP) * fix for softmax * store softmax as attr * fix ensemble weight copy and cleanup * restructure slightly * adjust documentation, update tests and quickstart templates to use latest versions * extend unit test slightly * revert unnecessary edits * fix typo * ensemble architecture won't be resizable for now * use resizable layer (WIP) * revert using resizable layer * resizable container while avoid shape inference trouble * cleanup * ensure model continues training after resizing * use fill_b parameter * use fill_defaults * resize_layer callback * format * bump thinc to 8.0.4 * bump spacy-legacy to 3.0.6	2021-06-16 11:45:00 +02:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Adriane Boyd	d9be9e6cf9	Move README.md and LICENSES_SOURCES in package (#8297 ) In addition to `LICENSE`, move the files `README.md` and `LICENSES_SOURCES` to the top directory in `spacy package` if present in the model directory.	2021-06-11 10:20:24 +02:00
Paul O'Leary McCann	d54631f68b	Fix other open calls without context managers (#8245 )	2021-05-31 19:04:29 +10:00
Adriane Boyd	cd6bd91c3a	Switch default train corpus max_length to 0 in quickstart (#8142 ) The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length != 0` that `0` is a better default for users creating a new config with the quickstart. If not, documents are skipped, sometimes the entire corpus is skipped, and sometimes documents are (quite unexpectedly for your average user) split into sentences.	2021-05-20 14:48:09 +02:00
Adriane Boyd	8a2602051c	Update debug data for textcat (#8066 ) * Check for unsupported cats values * Only show labels if train/dev mismatched * Don't show label counts (only counting positive labels seems odd) * Use warnings for mismatched train/dev labels	2021-05-17 13:27:04 +02:00
Sofie Van Landeghem	02a6a5fea0	Fix 'debug model' for transformers + generalize (#7973 ) * add overrides to docs * fix debug model with transformer * assume training data is set in config	2021-05-06 18:43:32 +10:00
Paul O'Leary McCann	8007d5c814	Check if the resume path points to a directory (#7919 ) This came up in #7878, but if --resume-path is a directory then loading the weights will fail. On Linux this will give a straightforward error message, but on Windows it gives "Permission Denied", which is confusing.	2021-04-28 09:17:15 +02:00
Paul O'Leary McCann	de6b5ed14d	Fix percent unk display in debug data (#7886 ) * Fix percent unk display This was showing (ratio %), so 10% would show as 0.10%. Fix by multiplying ration by 100. Might want to add a warning if this is over a threshold. * Only show whole-integer percents	2021-04-27 09:16:35 +02:00
Sofie Van Landeghem	95e3cf576b	Optionally append lang for packaged model name (#7417 ) * Add empty lines at the end of Python files * Only prepend the lang code if it's not there already * Update spacy/cli/package.py * fix whitespace stripping	2021-04-26 16:53:21 +02:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Sofie Van Landeghem	c786e98e56	assemble CLI command (#7783 ) * assemble CLI command * ensure assemble runs even without training section * cleanup	2021-04-19 18:39:11 +10:00
Bram Vanroy	ed561cf428	Terminology: deprecated vs obsolete (#7621 ) * Terminology: deprecated vs obsolete Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works. In light of this, perhaps all other error codes should be checked as well. * clarify that the link command is removed and not just deprecated Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-12 14:37:00 +02:00
Adriane Boyd	73a8c0f992	Update debug data further for v3 (#7602 ) * Update debug data further for v3 * Remove new/existing label distinction (new labels are not immediately distinguishable because the pipeline is already initialized) * Warn on missing labels in training data for all components except parser * Separate textcat and textcat_multilabel sections * Add section for morphologizer * Reword missing label warnings	2021-04-09 11:53:42 +02:00
Adriane Boyd	03e9e7b567	Add --code option to init fill-config	2021-03-12 10:03:57 +01:00
Adriane Boyd	ce6317231f	Add --code to spacy debug CLI	2021-03-12 09:51:26 +01:00
Sofie Van Landeghem	932887b950	textcat scoring fix and multi_label docs (#6974 ) * add multi-label textcat to menu * add infobox on textcat API * add info to v3 migration guide * small edits * further fixes in doc strings * add infobox to textcat architectures * add textcat_multilabel to overview of built-in components * spelling * fix unrelated warn msg * Add textcat_multilabel to quickstart [ci skip] * remove separate documentation page for multilabel_textcategorizer * small edits * positive label clarification * avoid duplicating information in self.cfg and fix textcat.score * fix multilabel textcat too * revert threshold to storage in cfg * revert threshold stuff for multi-textcat Co-authored-by: Ines Montani <ines@ines.io>	2021-03-09 23:04:22 +11:00
Ines Montani	ea555b03e0	Merge pull request #7255 from adrianeboyd/bugfix/extraneous-tok2vec Omit unused tok2vec/transformer components	2021-03-03 23:15:06 +11:00
Adriane Boyd	8a4200d4e9	Omit unused tok2vec/transformer components Omit unused tok2vec/transformer components in quickstart template.	2021-03-02 15:53:30 +01:00
Adriane Boyd	fb98862337	Add hint for --gpu-id to CLI device info (#7234 ) * Add hint for --gpu-id to CLI device info If the user has `cupy` and an available GPU, add a hint about using `--gpu-id 0` to the CLI output. * Undo change to original CPU message	2021-03-03 01:11:18 +11:00
Adriane Boyd	ee7bb0b393	Fix formatting in bg/bn quickstart recs	2021-02-26 17:08:37 +01:00
Adriane Boyd	30e1a89aeb	Fix displacy output in evaluate CLI (#7122 ) Now that `nlp.evaluate()` does not modify the examples, rerun the pipeline on the (limited) texts in order to provide the predicted annotation in the displacy output option.	2021-02-19 23:01:20 +11:00
Adriane Boyd	4188beda87	Fix conll converter option (#7071 ) Map `conll` to the NER converter, not the `CoNLL-U` converter.	2021-02-18 10:22:41 +01:00
Ines Montani	1e3a326e53	Change Dutch transformer recommendation [ci skip] https://github.com/explosion/spaCy/discussions/6529#discussioncomment-366620	2021-02-14 15:30:16 +11:00
Ines Montani	f4f46b617f	Preserve sourced components in fill-config (fixes #7055 ) (#7058 )	2021-02-14 14:02:14 +11:00
Adriane Boyd	0ee2ae86bf	Update trf quickstart recommendations Add/update trf recommendations for Bengali, Hindi, Sinhala, and Tamil based on #7044.	2021-02-12 15:55:17 +01:00
Ines Montani	26bf642afd	Fix issue #7019 : Handle None scores in evaluate printer (#7026 )	2021-02-11 16:45:23 +11:00
Ines Montani	c08b3f294c	Support env vars and CLI overrides for project.yml	2021-02-10 13:45:27 +11:00
svlandeg	f852af2acf	add capture arg	2021-02-02 19:47:12 +01:00
Sofie Van Landeghem	f319d2765f	Add capture argument to project_run (#6878 ) * add capture argument to project_run and run_commands * git bump to 3.0.1 * Set version to 3.0.1.dev0 Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-02-02 10:11:15 +08:00
Ines Montani	a59f3fcf5d	Make wheel the default format and update docs [ci skip]	2021-02-01 23:18:43 +11:00
Ines Montani	b9573e9e22	Fix pip args	2021-02-01 23:15:00 +11:00
Ines Montani	b46073234a	Fix default clone branch and error handling [ci skip]	2021-02-01 22:29:04 +11:00
Adriane Boyd	35a863cd27	Remove nlp.tokenizer from quickstart template Remove `nlp.tokenizer` from quickstart template so that the default language-specific tokenizer settings are filled instead.	2021-02-01 11:20:12 +01:00
Ines Montani	f058cbd751	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-30 21:03:25 +11:00
Ines Montani	3435b894df	Remove nightly reference from auto docs [ci skip]	2021-01-30 20:12:08 +11:00
Ines Montani	d0c3775712	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
Ines Montani	b26a3daa9a	Merge pull request #6860 from explosion/feature/package-wheel	2021-01-30 14:17:01 +11:00
Ines Montani	2332c4280b	Update and use unified --build option	2021-01-30 13:11:36 +11:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	30765674d0	Merge branch 'master' into develop	2021-01-30 12:20:28 +11:00
Ines Montani	2609ba4e89	Support building wheel in spacy package	2021-01-30 11:54:02 +11:00
Pamphile ROY	41ee75ac6d	Remove --no-cache-dir when downloading models When `--no-cache-dir` is present, it prevents caching to properly function. If the user still wants to do this, there is the possibility to pass options with `user_pip_args`. But you should not enforce options like these. In my case this is preventing some docker build (using buildkit caching) to have proper caching of models.	2021-01-29 15:37:44 +01:00
Ines Montani	78d6ff4dd4	Update quickstart recommendations	2021-01-28 11:14:49 +11:00
Ines Montani	ec5f55aa5b	Update config generation defaults and transformers (#6832 )	2021-01-27 23:56:33 +11:00
Ines Montani	c0926c9088	WIP: Various small training changes (#6818 ) * Allow output_path to be None during training * Fix cat scoring (?) * Improve error message for weighted None score * Improve messages So we can call this in other places etc. * FIx output path check * Use latest wasabi * Revert "Improve error message for weighted None score" This reverts commit `7059926763`. * Exclude None scores from final score by default It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc. * Update warnings and use logger.warning	2021-01-26 14:51:52 +11:00
Adriane Boyd	0f2de39efb	Fix types for exclude args in info CLI (#6808 )	2021-01-25 20:00:22 +08:00
KeshavG-lb	0a86d833d7	Spacy Cli info method causing backward compatibility issues (#6793 ) * Spacy Cli info method causing backward compatibility issues #6791 fix backward compatibility by setting default value to exclude in info method. * setting empty list as default argument is dangerous. so setting default to None and then setting it to emptylist, if None. Reference : https://nikos7am.com/posts/mutable-default-arguments/	2021-01-23 11:21:43 +01:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Ines Montani	e8a97a2bd6	Merge pull request #6720 from adrianeboyd/feature/improved-init-training-config-validation	2021-01-15 11:45:24 +11:00
Adriane Boyd	5fb8b7037a	Expand initialize/training config validation Validate both `[initialize]` and `[training]` in `debug data` and `nlp.initialize()` with separate config validation error blocks that indicate which block of the config is being validated.	2021-01-12 17:17:00 +01:00
svlandeg	1abeca90a6	refer to _parser_internals.nonproj.DELIMITER	2021-01-07 18:58:13 +01:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
Sofie Van Landeghem	afc5714d32	multi-label textcat component (#6474 ) * multi-label textcat component * formatting * fix comment * cleanup * fix from #6481 * random edit to push the tests * add explicit error when textcat is called with multi-label gold data * fix error nr * small fix	2021-01-06 13:07:14 +11:00
Ines Montani	6f83abb971	Merge pull request #6647 from svlandeg/feature/init_config_overwrite	2021-01-05 14:59:04 +11:00
Ines Montani	81f018fb67	Merge pull request #6671 from explosion/chore/tidy-autoformat Tidy up and auto-format	2021-01-05 14:45:31 +11:00
Ines Montani	a9e845426f	Use --force for consistency and add docs	2021-01-05 13:49:59 +11:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
svlandeg	712a78b74a	add simple unit test	2020-12-30 12:35:26 +01:00
svlandeg	4347e6d39b	fixes for CLI info command	2020-12-30 12:05:58 +01:00
svlandeg	62b4fe118f	prevent overwriting existing config file	2020-12-29 15:40:22 +01:00
Tim Gates	292c1d6a73	docs: fix simple typo, speficied -> specified (#6611 ) There is a small typo in spacy/cli/info.py. Should read `specified` rather than `speficied`.	2020-12-22 09:14:10 +01:00
Sofie Van Landeghem	282a3b49ea	Fix parser resizing when there is no upper layer (#6460 ) * allow resizing of the parser model even when upper=False * update from spacy.TransitionBasedParser.v1 to v2 * bugfix	2020-12-18 18:56:57 +08:00
Ines Montani	3f90bffa27	Merge pull request #6571 from adrianeboyd/bugfix/debug-data-missing-vectors Fix alignment and vector checks in debug data	2020-12-17 10:10:47 +11:00
Adriane Boyd	1ddf2f39c7	Switch converters to generator functions (#6547 ) * Switch converters to generator functions To reduce the memory usage when converting large corpora, refactor the convert methods to be generator functions. * Update tests	2020-12-15 16:47:16 +08:00
Adriane Boyd	20e18cc246	Fix alignment and vector checks in debug data * Update token alignment check to use Example alignment * Update missing vector check further related to changes in v3	2020-12-15 09:43:14 +01:00
Ines Montani	513c4e332a	Include custom code via spacy package command (#6531 )	2020-12-10 20:36:46 +08:00
Ines Montani	2a6043fabb	Merge pull request #6530 from explosion/feature/init-config-cpu-gpu	2020-12-10 09:38:46 +11:00
Ines Montani	9d32e839d3	Merge branch 'develop' into feature/init-config-cpu-gpu	2020-12-10 08:50:53 +11:00
Adriane Boyd	fa8fa474a3	Add nlp.batch_size setting Add a default `batch_size` setting for `Language.pipe` and `Language.evaluate` as `nlp.batch_size`.	2020-12-09 09:13:26 +01:00
Ines Montani	758ad6c3cd	Make CPU the default for init config	2020-12-09 11:00:51 +11:00
Ines Montani	5d605d539d	Remove output_file from init_config helper	2020-12-09 10:57:55 +11:00
svlandeg	8f8a7f1733	returning config in init_config	2020-12-08 17:37:20 +01:00
Ines Montani	6c7a930ee8	Fix variable	2020-12-08 20:44:59 +11:00
Ines Montani	94a5a9814f	Update argument handling and documentation	2020-12-08 20:41:18 +11:00

1 2 3 4 5 ...

1177 Commits