spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-28 02:46:35 +03:00

Author	SHA1	Message	Date
Daniël de Kok	81beaea70e	Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119	2024-01-19 12:34:29 +01:00
Daniël de Kok	da7ad97519	Update `TextCatBOW` to use the fixed `SparseLinear` layer (#13149 ) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: https://github.com/explosion/thinc/pull/754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import	2023-11-29 09:11:54 +01:00
Adriane Boyd	ff9ddb6a07	Unskip python 3.12 remote tests (#13110 )	2023-11-06 11:59:45 +01:00
Adriane Boyd	406794a081	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.7-1	2023-09-28 15:09:06 +02:00
Adriane Boyd	36d4767aca	Skip project remotes test for python 3.12 (#12980 ) `weasel` (using `cloudpathlib`) does not currently support remote paths for python 3.12.	2023-09-13 13:16:05 +02:00
Paul O'Leary McCann	b4e457d9fe	Accept multiple code files in all CLI commands (#12101 ) * Add support for multiple code files to all relevant commands Prior to this, only the package command supported multiple code files. * Update docs * Add debug data test, plus generic fixtures One tricky thing here: it's tempting to create the config by creating a pipeline in code, but that requires declaring the custom components here. However the CliRunner appears to be run in the same process or otherwise have access to our registry, so it works even without any code arguments. So it's necessary to avoid declaring the components in the tests. * Add debug config test and restructure The code argument imports the provided file. If it adds item to the registry, that affects global state, which CliRunner doesn't isolate. Since there's no standard way to remove things from the registry, this instead uses subprocess.run to run commands. * Use a more generic, parametrized test * Add output arg for assemble and pretrain Assemble and pretrain require an output argument. This commit adds assemble testing, but not pretrain, as that requires an actual trainable component, which is not currently in the test config. * Add evaluate test and some cleanup * Mark tests as slow * Revert argument name change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Format API CLI docs * isort * Fix imports in tests * isort * Undo changes to package CLI help * Fix python executable and lang code in test * Fix executable in another test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-08-01 15:24:02 +02:00
Peter Baumgartner	a0a195688f	Tests for CLI app - `init config` generates `train`-able config (#12173 ) * remove migration support form * initial test commit * add fixture * add combo test * pull out parameter example data * fix formatting on examples * remove unused import * remove unncessary fmt:off instructions * only set logger level if verbose flag is explicitly set --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 14:45:04 +02:00
Victoria	49055ed7c8	Add cli for finding locations of registered func (#12757 ) * Add cli for finding locations of registered func * fixes: naming and typing * isort * update naming * remove to find-function * remove file:// bit * use registry name if given and exit gracefully if a registry was not found * clean up failure msg * specify registry_name options * mypy fixes * return location for internal usage * add documentation * more mypy fixes * clean up example * add section to menu * add tests --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 09:39:00 +02:00
Daniël de Kok	2468742cb8	isort all the things	2023-06-26 11:41:03 +02:00
Daniël de Kok	e2b70df012	Configure isort to use the Black profile, recursively isort the `spacy` module (#12721 ) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo	2023-06-14 17:48:41 +02:00
Sofie Van Landeghem	d65e3c31a6	use system-independent commands (#12693 )	2023-06-08 11:43:36 +02:00
Adriane Boyd	f27bce67fd	Skip project clone tests if git is not available (#12394 )	2023-03-09 16:41:21 +01:00
Paul O'Leary McCann	1e8bac99f3	Add tests for projects to master (#12303 ) * Add tests for projects to master * Fix git clone related issues on Windows * Add stat import	2023-02-23 10:22:57 +01:00
Adriane Boyd	606273f7e4	Normalize whitespace in evaluate CLI output test (#12157 ) * Normalize whitespace in evaluate CLI output test Depending on terminal settings, lines may be padded to the screen width so the comparison is too strict with only the command string replacement. * Move to test util method * Change to normalization method	2023-01-27 16:13:34 +01:00
Peter Baumgartner	c68e6b8a96	`trainable_lemmatizer` in `debug data` (#11419 ) * WIP * rm ipython embeds * rm total * WIP * cleanup * cleanup + reword * rm component function * remove migration support form * fix reference dataset for dev data * additional fixes - set approach to identifying unique trees - adjust line length on messages - add logic for detecting docs without annotations * use 0 instead of none for no annotation * partial annotation support * initial tests for _compile_gold lemma attributes Using the example data from the edit tree lemmatizer tests for: - lemmatizer_trees - partial_lemma_annotations - n_low_cardinality_lemmas - no_lemma_annotations * adds output test for cli app * switch msg level * rm unclear uniqueness check * Revert "rm unclear uniqueness check" This reverts commit `6ea2b3524b`. * remove good message on uniqueness * formatting * use en_vocab fixture * clarify data set source in messages * remove unnecessary import Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-26 17:36:50 +01:00
Daniël de Kok	319eb508b5	Add a `spacy benchmark speed` subcommand (#11902 ) * Add a `spacy evaluate speed` subcommand This subcommand reports the mean batch performance of a model on a data set with a 95% confidence interval. For reliability, it first performs some warmup rounds. Then it will measure performance on batches with randomly shuffled documents. To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate` and accuracy evaluation is moved to its own `evaluate accuracy` subcommand. * Fix import cycle * Restore `spacy evaluate`, make `spacy benchmark speed` an alias * Add documentation for `spacy benchmark` * CREATES -> PRINTS * WPS -> words/s * Disable formatting of benchmark speed arguments * Fail with an error message when trying to speed bench empty corpus * Make it clearer that `benchmark accuracy` is a replacement for `evaluate` * Fix docstring webpage reference * tests: check `evaluate` output against `benchmark accuracy`	2023-01-12 11:55:21 +01:00
Sofie Van Landeghem	7f6c638c3a	fix processing of "auto" in convert (#12050 ) * fix processing of "auto" in walk_directory * add check for None * move AUTO check to convert and fix verification of args * add specific CLI test with CliRunner * cleanup * more cleanup * update docstring	2023-01-05 10:21:00 +01:00

17 Commits