spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-14 18:22:27 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	b4e457d9fe	Accept multiple code files in all CLI commands (#12101 ) * Add support for multiple code files to all relevant commands Prior to this, only the package command supported multiple code files. * Update docs * Add debug data test, plus generic fixtures One tricky thing here: it's tempting to create the config by creating a pipeline in code, but that requires declaring the custom components here. However the CliRunner appears to be run in the same process or otherwise have access to our registry, so it works even without any code arguments. So it's necessary to avoid declaring the components in the tests. * Add debug config test and restructure The code argument imports the provided file. If it adds item to the registry, that affects global state, which CliRunner doesn't isolate. Since there's no standard way to remove things from the registry, this instead uses subprocess.run to run commands. * Use a more generic, parametrized test * Add output arg for assemble and pretrain Assemble and pretrain require an output argument. This commit adds assemble testing, but not pretrain, as that requires an actual trainable component, which is not currently in the test config. * Add evaluate test and some cleanup * Mark tests as slow * Revert argument name change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Format API CLI docs * isort * Fix imports in tests * isort * Undo changes to package CLI help * Fix python executable and lang code in test * Fix executable in another test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-08-01 15:24:02 +02:00
Daniël de Kok	2468742cb8	isort all the things	2023-06-26 11:41:03 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Ines Montani	5003a9c3c7	Move core training logic in CLI into standalone function (#9398 )	2021-10-11 10:56:14 +02:00
Kabir Khan	1dfffe5fb4	No output info message in train (#8885 ) * Add info message that no output directory was provided in train * Update train.py * Fix logging	2021-08-05 09:21:22 +02:00
Santiago Castro	ee63b2b199	Fix typo in `train_cli` docstring	2021-06-25 22:45:03 -07:00
Ines Montani	d0c3775712	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
Ines Montani	d25b1606d6	Allow reading config from sdtin in spacy train	2020-12-08 18:01:40 +11:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Matthew Honnibal	db419f6b2f	Improve control of training progress and logging (#6184 ) * Make logging and progress easier to control * Update docs * Cleanup errors * Fix ConfigValidationError * Pass stdout/stderr, not wasabi.Printer * Fix type * Upd logging example * Fix logger example * Fix type	2020-10-03 14:57:46 +02:00
Ines Montani	44160cd52f	Tidy up [ci skip]	2020-10-01 10:41:19 +02:00
Ines Montani	a5debb356d	Tidy up and adjust logging [ci skip]	2020-09-30 01:22:08 +02:00
Ines Montani	56a2f778c4	Add logging [ci skip]	2020-09-30 01:08:55 +02:00
Ines Montani	0a1ee109db	Remove init form path	2020-09-29 22:53:18 +02:00
Ines Montani	0250bcf6a3	Show validation error during init	2020-09-29 22:29:09 +02:00
Matthew Honnibal	e70a00fa76	Remove unnecessary warning from train	2020-09-29 16:47:54 +02:00
Matthew Honnibal	3f0d61232d	Remove outdated arg from train	2020-09-29 16:47:44 +02:00
Ines Montani	a139fe672b	Fix typos and refactor CLI logging	2020-09-28 21:17:10 +02:00
Ines Montani	822ea4ef61	Refactor CLI	2020-09-28 15:09:59 +02:00
Ines Montani	c22ecc66bb	Don't support init path for now	2020-09-28 12:46:28 +02:00
Ines Montani	a5f2cc0509	Tidy up and remove raw text (rehearsal) for now	2020-09-28 12:30:13 +02:00
Ines Montani	e44a7519cd	Update CLI and add [initialize] block	2020-09-28 11:56:14 +02:00
Ines Montani	2fdb7285a0	Update CLI	2020-09-28 11:06:07 +02:00
Ines Montani	553bfea641	Fix commands	2020-09-28 10:53:17 +02:00
Matthew Honnibal	b886f53c31	init-pipeline runs (maybe doesnt work)	2020-09-28 03:42:47 +02:00
Matthew Honnibal	ed2aff2db3	Remove unused train code	2020-09-28 03:12:31 +02:00
Matthew Honnibal	3a0a3b8db6	Dont hard-code for 'corpora' name	2020-09-28 03:06:33 +02:00
Matthew Honnibal	a3e1791c9c	Upd train	2020-09-28 01:08:30 +02:00
Matthew Honnibal	b5556093e2	Start updating train script	2020-09-27 23:59:44 +02:00
Ines Montani	7e938ed63e	Update config resolution to use new Thinc	2020-09-27 22:21:31 +02:00
Matthew Honnibal	39b178999c	Tmp notes	2020-09-27 20:13:38 +02:00
Ines Montani	b4486d747d	Merge branch 'develop' into fix/train-config-interpolation	2020-09-26 15:32:14 +02:00
Ines Montani	b2d07de786	Construct nlp from uninterpolated config before training	2020-09-26 15:16:59 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Sofie Van Landeghem	009ba14aaf	Fix pretraining in train script (#6143 ) * update pretraining API in train CLI * bump thinc to 8.0.0a35 * bump to 3.0.0a26 * doc fixes * small doc fix	2020-09-25 15:47:10 +02:00
Ines Montani	be56c0994b	Add [training.before_to_disk] callback	2020-09-24 12:40:25 +02:00
Ines Montani	f69fea8b25	Improve error handling around non-number scores	2020-09-24 11:29:07 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
svlandeg	6435458d51	simplify expression	2020-09-23 12:12:38 +02:00
svlandeg	20b0ec5dcf	avoid logging performance of frozen components	2020-09-23 10:37:12 +02:00
Ines Montani	e863b3dc14	Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2	2020-09-19 12:33:38 +02:00
Sofie Van Landeghem	39872de1f6	Introducing the gpu_allocator (#6091 ) * rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator' * --code instead of --code-path * update documentation * avoid querying the "system" section directly * add explanation of gpu_allocator to TF/PyTorch section in docs * fix typo * fix typo 2 * use set_gpu_allocator from thinc 8.0.0a34 * default null instead of empty string	2020-09-19 01:17:02 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
svlandeg	427dbecdd6	cleanup and formatting	2020-09-17 11:48:04 +02:00
svlandeg	0c35885751	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
svlandeg	51fa929f47	rewrite train_corpus to corpus.train in config	2020-09-15 21:58:04 +02:00
Sofie Van Landeghem	3216a33149	positive_label config for textcat (#6062 ) * hook up positive_label in textcat * unit tests * documentation * formatting * tests * fix typo * move verify_config to after begin_training * revert accidential commit	2020-09-14 17:08:00 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Matthew Honnibal	ba5f4c9b32	Add words and seconds to train info	2020-09-08 15:24:47 +02:00
Ines Montani	ab1bb421ed	Update docs links in codebase	2020-09-04 12:58:50 +02:00

1 2 3 4 5 ...

319 Commits