spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-02 18:06:46 +03:00

Author	SHA1	Message	Date
Daniël de Kok	afac7fb650	test_find_available_port: use port 5001 (#13255 ) macOS now uses port 5000 for the AirPlay receiver functionality, so this test will always fail on a macOS desktop (unless AirPlay receiver functionality is disabled like in CI).	2024-01-23 20:11:16 +01:00
Daniël de Kok	5a2ad4af4b	Merge remote-tracking branch 'upstream/master' into patch-1	2024-01-23 19:53:20 +01:00
Daniël de Kok	128197a5fc	Properly clean up pipe multiprocessing workers (#13259 ) Before this change, the workers of pipe call with n_process != 1 were stopped by calling `terminate` on the processes. However, terminating a process can leave queues, pipes, and other concurrent data structures in an invalid state. With this change, we stop using terminate and take the following approach instead: * When the all documents are processed, the parent process puts a sentinel in the queue of each worker. * The parent process then calls `join` on each worker process to let them finish up gracefully. * Worker processes break from the queue processing loop when the sentinel is encountered, so that they exit. We need special handling when one of the workers encounters an error and the error handler is set to raise an exception. In this case, we cannot rely on the sentinel to finish all workers -- the queue is a FIFO queue and there may be other work queued up before the sentinel. We use the following approach to handle error scenarios: * The parent puts the end-of-work sentinel in the queue of each worker. * The parent closes the reading-end of the channel of each worker. * Then: - If the worker was waiting for work, it will encounter the sentinel and break from the processing loop. - If the worker was processing a batch, it will attempt to write results to the channel. This will fail because the channel was closed by the parent and the worker will break from the processing loop.	2024-01-23 18:33:04 +01:00
Raphael Mitsch	3b3b5cdc63	Merge pull request #13253 from explosion/chore/sync-master-with-llm_main Sync `master` with `docs/llm_main`	2024-01-19 16:50:43 +01:00
Raphael Mitsch	575c405ae3	Fix LLM docs on task factories.	2024-01-19 16:48:54 +01:00
Raphael Mitsch	256468c414	Merge branch 'docs/llm_main' into chore/sync-master-with-llm_main # Conflicts: # website/docs/api/large-language-models.mdx	2024-01-19 16:34:35 +01:00
Raphael Mitsch	91c24c0285	Merge pull request #13251 from explosion/docs/llm_develop Sync `docs/llm_main` with `docs/llm_develop`	2024-01-19 12:56:38 +01:00
Daniël de Kok	81beaea70e	Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119	2024-01-19 12:34:29 +01:00
Daniël de Kok	2891e27421	Merge pull request #13191 from explosion/maintenance/revert-parser-refactor Revert the parser refactor	2024-01-18 17:06:41 +01:00
Daniël de Kok	9972333ef9	Temporily xfail local remote storage test	2024-01-17 10:20:40 +01:00
maurice	c608baeecc	Fix typo in method name	2024-01-16 21:54:54 +01:00
Daniël de Kok	7351f6bbeb	Update thinc dependency to 9.0.0.dev4	2024-01-16 15:56:09 +01:00
Raphael Mitsch	0062c22c35	Updated docs w.r.t. infinite doc length changes (#13214 ) * Updated docs w.r.t. infinite doc length. * Fix typo. * fix typo's * Fix table formatting. * Update formatting. --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2024-01-05 14:20:58 +01:00
Daniël de Kok	e2a3952de5	Add spacy.TextCatParametricAttention.v1 (#13201 ) * Add spacy.TextCatParametricAttention.v1 This layer provides is a simplification of the ensemble classifier that only uses paramteric attention. We have found empirically that with a sufficient amount of training data, using the ensemble classifier with BoW does not provide significant improvement in classifier accuracy. However, plugging in a BoW classifier does reduce GPU training and inference performance substantially, since it uses a GPU-only kernel. * Fix merge fallout	2024-01-02 10:03:06 +01:00
Daniël de Kok	7718886fa3	TransitionBasedParser.v2 in run example output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-12-21 11:14:35 +01:00
Daniël de Kok	7ebba86402	Add TextCatReduce.v1 (#13181 ) * Add TextCatReduce.v1 This is a textcat classifier that pools the vectors generated by a tok2vec implementation and then applies a classifier to the pooled representation. Three reductions are supported for pooling: first, max, and mean. When multiple reductions are enabled, the reductions are concatenated before providing them to the classification layer. This model is a generalization of the TextCatCNN model, which only supports mean reductions and is a bit of a misnomer, because it can also be used with transformers. This change also reimplements TextCatCNN.v2 using the new TextCatReduce.v1 layer. * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence * Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy * Add back a test for TextCatCNN.v2 * Replace TextCatCNN in pipe configurations and templates * Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor * Add last reduction (`use_reduce_last`) * Remove non-working TextCatCNN Netlify redirect * Revert layer changes for the quickstart * Revert one more quickstart change * Remove unused import * Fix docstring * Fix setting name in error message --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-12-21 11:00:06 +01:00
Daniël de Kok	532225b955	Fix parser distillation test seed The test would sometimes fail. Rather than increasing test by increasing training iterations, use a known-good seed.	2023-12-21 10:06:28 +01:00
Daniël de Kok	7b689bde44	No need for `Literal` compat, since we only support >= 3.8	2023-12-21 09:47:38 +01:00
Daniël de Kok	57203fa0fc	Fix `TransitionBasedParser` version in transformer embeddings docs	2023-12-19 09:28:20 +01:00
Daniël de Kok	5e8bafa5bb	Bring back W401	2023-12-18 20:17:24 +01:00
Daniël de Kok	9b36729cbd	Fix Cython lints	2023-12-18 20:02:15 +01:00
Steven Crowther	764be103bc	update README to include links to GPU processing, LLM's, and the spaCy blog. (#13197 ) * Update README.md to include links for GPU processing, LLM, and spaCy's blog. * Create ojo4f3.md * corrected README to most current version with links to GPU processing, LLM's, and the spaCy blog. * Delete .github/contributors/ojo4f3.md * changed LLM icon Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-12-18 09:49:07 +01:00
Sofie Van Landeghem	56fc3bc0f3	Type documentation fixes for Doc (#13187 ) * correct char_span output type - can be None * unify type of exclude parameter * black * further fixes to from_dict and to_dict * formatting	2023-12-18 09:00:47 +01:00
Ines Montani	7df328fbfe	Update README.md [ci skip]	2023-12-12 10:19:57 +01:00
Raphael Mitsch	d56ee65ddf	Document `spacy-llm`'s `TranslationTask` (#13183 ) * Describe translation task. * Fix references to examples and template. * Format.	2023-12-11 17:41:04 +01:00
Raphael Mitsch	e79a9c5acd	Document `spacy-llm`'s `RawTask` (#13180 ) * Add section on RawTask. * Fix API docs. * Update website/docs/api/large-language-models.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-12-11 17:14:12 +01:00
Ines Montani	8cfccdd2f8	Update links [ci skip]	2023-12-11 15:51:43 +01:00
Ines Montani	f78b91c03b	Update links [ci skip]	2023-12-11 15:51:01 +01:00
Daniël de Kok	42fe4edfd7	Add distillation tests with max cut size And fix endless loop when the max cut size is 0 or 1.	2023-12-08 20:38:01 +01:00
Daniël de Kok	e2591cda36	isort	2023-12-08 20:24:09 +01:00
Daniël de Kok	e5ec45cb7e	Revert "Merge the parser refactor into `v4` (#10940 )" This reverts commit `a183db3cef`.	2023-12-08 20:23:08 +01:00
Daniël de Kok	05803cfe76	Revert "Reimplement distillation with oracle cut size (#12214 )" This reverts commit `e27c60a702`.	2023-12-08 14:38:05 +01:00
Raphael Mitsch	9fcd2bfa08	Add info on endpoint arg. (#13169 )	2023-12-05 12:46:29 +01:00
Raphael Mitsch	a25a3b996b	Merge pull request #13173 from explosion/docs/llm_main Sync `llm_develop` with `llm_main`	2023-12-04 16:46:21 +01:00
Raphael Mitsch	55ed2b4e82	Add documentation for EL task (#12988 ) * Add documentation for EL task. * Fix EL factory name. * Add llm_entity_linker_mentio. * Apply suggestions from code review Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Incorporate feedback. * Format. * Fix link to KB data. --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2023-12-04 15:23:28 +01:00
Adriane Boyd	e467573550	Docs: update trf_data examples and pipeline design info (#13164 )	2023-12-04 15:15:54 +01:00
Raphael Mitsch	0e43fca036	Add Claude-2.1 mention. (#13167 )	2023-12-01 16:48:35 +01:00
Daniël de Kok	da7ad97519	Update `TextCatBOW` to use the fixed `SparseLinear` layer (#13149 ) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: https://github.com/explosion/thinc/pull/754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import	2023-11-29 09:11:54 +01:00
Ines Montani	bf7c2ea99a	Add merch link [ci skip]	2023-11-22 12:55:00 +01:00
Ines Montani	8f69e56a5a	Add swag [ci skip]	2023-11-20 14:42:01 +01:00
Lise	b6e022381d	Feature/nn and fo language extensions (#13116 ) * add language extensions for norwegian nynorsk and faroese * update docstring for nn/examples.py * use relative imports * add fo and nn tokenizers to pytest fixtures * add unittests for fo and nn and fix bug in nn * remove module docstring from fo/__init__.py * add comments about example sentences' origin * add license information to faroese data credit * format unittests using black * add __init__ files to test/lang/nn and tests/lang/fo * fix import order and use relative imports in fo/__nit__.py and nn/__init__.py * Make the tests a bit more compact * Add fo and nn to website languages * Add note about jul. * Add "jul." as exception --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-11-20 07:49:59 +01:00
ajbond	9f2ce6bb00	Add Redfield NLP Nodes to the Spacy Universe (#13133 )	2023-11-17 09:48:02 +01:00
Madeesh Kannan	bd2c17e206	Warn about reloading dependencies after downloading models (#13081 ) * Update the "Missing factory" error message This accounts for model installations that took place during the current Python session. * Add a note about Jupyter notebooks * Move error to `spacy.cli.download` Add extra message for Jupyter sessions * Add additional note for interactive sessions * Remove note about `spacy-transformers` from error message * `isort` * Improve checks for colab (also helps displacy) * Update warning messages * Improve flow for multiple checks --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-11-10 08:05:07 +01:00
Raphael Mitsch	b2e831d966	LLM docs: OpenAI model update (#13119 ) * Update supported OpenAI models. * Update with new GPT-3.5 and GPT-4 versions. * Add links to OpenAI model docs.	2023-11-08 17:55:16 +01:00
Adriane Boyd	513bbd5fa3	Add preferred use of build for package CLI (#13109 ) Build with `build` if available. Warn and fall back to previous `setup.py`-based builds if `build` build fails.	2023-11-08 17:35:24 +01:00
Ridge Kimani	2b8da84717	feat: add extra lexical attributes (#13106 ) Co-authored-by: Ridge Kimani <ridgekimani@gmail.com>	2023-11-08 17:29:11 +01:00
Adriane Boyd	0c25725359	Update Tokenizer.explain for special cases with whitespace (#13086 ) * Update Tokenizer.explain for special cases with whitespace Update `Tokenizer.explain` to skip special case matches if the exact text has not been matched due to intervening whitespace. Enable fuzzy `Tokenizer.explain` tests with additional whitespace normalization. * Add unit test for special cases with whitespace, xfail fuzzy tests again	2023-11-06 17:29:59 +01:00
Adriane Boyd	ff9ddb6a07	Unskip python 3.12 remote tests (#13110 )	2023-11-06 11:59:45 +01:00
Adriane Boyd	c096c5c0c9	Update for numpy 2.0 deprecations (#13103 ) - Replace `np.trapz` with vendored `trapezoid` from scipy - Replace `np.float_` with `np.float64`	2023-11-06 08:47:53 +01:00
Adriane Boyd	92f1d0a195	CI: Switch to stable python 3.12 and limit 3.11 runs (#13104 )	2023-11-03 15:46:03 +01:00

1 2 3 4 5 ...

16293 Commits