spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-15 06:32:01 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	acb44f8e73	Fix meta writing for numpy conversion	2024-10-02 01:10:04 +02:00
Matthew Honnibal	75d097155d	Replace numpy floats in update and evaluate	2024-10-02 01:06:23 +02:00
Matthew Honnibal	5d0d2de955	Support 'memory zones' for user memory management Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed.	2024-09-08 13:06:54 +02:00
Matthew Honnibal	a559cde432	Update about	2024-09-07 00:47:09 +02:00
Matthew Honnibal	b4e60e3151	Fix dump meta	2024-09-07 00:46:48 +02:00
Matthew Honnibal	ae6910b09b	Bump version	2024-09-06 22:23:41 +02:00
Matthew Honnibal	3bc5846e83	Fix serialization for uk trf model	2024-09-06 22:23:25 +02:00
Matthew Honnibal	2a37f97365	Increment version	2024-09-04 14:31:07 +02:00
Matthew Honnibal	3ee1b2bd1f	Fix Spanish lemmatizer	2024-09-04 14:29:34 +02:00
Matthew Honnibal	6f7590bbf1	Revert "Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3" This reverts commit `64b22be76e`.	2024-09-04 14:26:39 +02:00
Matthew Honnibal	64b22be76e	Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3	2024-09-04 14:22:13 +02:00
Matthew Honnibal	4eec3bfad1	Bump version	2024-09-02 13:16:15 +02:00
Matthew Honnibal	b9ecb15439	Bump version	2024-09-02 12:36:28 +02:00
Matthew Honnibal	a5ba7e4716	Bump dev version	2024-09-02 10:10:43 +02:00
Matthew Honnibal	304a8539e9	Bump dev version	2024-09-02 01:45:38 +02:00
Matthew Honnibal	f4c8fdfaad	Update cli.package for removed spacy.vectors.name attr	2024-09-01 16:43:49 +02:00
svlandeg	e32a394ff0	fix the fix for textcat init functionality	2024-05-14 18:45:51 +02:00
svlandeg	5992e927b9	fix textcat init functionality	2024-05-14 18:38:11 +02:00
svlandeg	c27679f210	Merge branch 'master' into feat/update_v4	2024-05-14 17:42:48 +02:00
Alex Strick van Linschoten	045cd43c3f	Fix typos in docs (#13466 ) * fix typos * prettier formatting --------- Co-authored-by: svlandeg <svlandeg@github.com>	2024-04-29 11:10:17 +02:00
Sofie Van Landeghem	287deee02c	remove empty file (#13458 )	2024-04-26 10:04:16 +02:00
Daniël de Kok	f5918d4353	Update to Thinc 9.0.0 and set version to 4.0.0.dev3 (#13448 ) * Update to Thinc 9.0.0 and set version to 4.0.0.dev3 * Set minimum Python version to 3.9	2024-04-22 09:40:55 +02:00
Daniël de Kok	5bd141013b	Remove `apple` from extras (#13439 ) Account for merging of `thinc-apple-ops` into `thinc`.	2024-04-17 13:43:27 +02:00
Sofie Van Landeghem	2e2334632b	Fix use_gold_ents behaviour for EntityLinker (#13400 ) * fix type annotation in docs * only restore entities after loss calculation * restore entities of sample in initialization * rename overfitting function * fix EL scorer * Relax test * fix formatting * Update spacy/pipeline/entity_linker.py Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * rename to _ensure_ents * further rename * allow for scorer to be None --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2024-04-16 12:00:22 +02:00
Joe Schiff	2e96797696	Convert properties to decorator syntax (#13390 )	2024-04-16 11:51:14 +02:00
Daniël de Kok	fbc14aea45	Add distill subcommand (#13431 ) * Add distill subcommand This subcommand distills a student model from a teacher model. * Fixes from Sofie Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Type and doc fixes * Wording * distill: document missing `-o` * Wording * Small fix --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2024-04-11 19:33:46 +02:00
Raphael Mitsch	304b9331e6	Modify EL batching to doc-wise streaming approach (#12367 ) * Convert Candidate from Cython to Python class. * Format. * Fix .entity_ typo in _add_activations() usage. * Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span]. * Update docs. * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update doc string of BaseCandidate.__init__(). * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate. * Adjust Candidate to support and mandate numerical entity IDs. * Format. * Fix docstring and docs. * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename alias -> mention. * Refactor Candidate attribute names. Update docs and tests accordingly. * Refacor Candidate attributes and their usage. * Format. * Fix mypy error. * Update error code in line with v4 convention. * Modify EL batching system. * Update leftover get_candidates() mention in docs. * Format docs. * Format. * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Updated error code. * Simplify interface for int/str representations. * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename 'alias' to 'mention'. * Port Candidate and InMemoryCandidate to Cython. * Remove redundant entry in setup.py. * Add abstract class check. * Drop storing mention. * Update spacy/kb/candidate.pxd Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix entity_id refactoring problems in docstrings. * Drop unused InMemoryCandidate._entity_hash. * Update docstrings. * Move attributes out of Candidate. * Partially fix alias/mention terminology usage. Convert Candidate to interface. * Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs(). * Update docstrings related to prior_prob. * Update alias/mention usage in doc(strings). * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs. * Update docstrings. * Fix InMemoryCandidate attribute names. * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update W401 test. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use Candidate output type for toy generators in the test suite to mimick best practices * fix docs * fix import * Fix merge leftovers. * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/entitylinker.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/kb/kb_in_memory.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/inmemorylookupkb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update get_candidates() docstring. * Reformat imports in entity_linker.py. * Drop valid_ent_idx_per_doc. * Update docs. * Format. * Simplify doc loop in predict(). * Remove E1044 comment. * Fix merge errors. * Format. * Format. * Format. * Fix merge error & tests. * Format. * Apply suggestions from code review Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Use type alias. * isort. * isort. * Lint. * Add typedefs.pyx. * Fix typedef import. * Fix type aliases. * Format. * Update docstring and type usage. * Add info on get_candidates(), get_candidates_batched(). * Readd get_candidates info to v3 changelog. * Update website/docs/api/entitylinker.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update factory functions for backwards compatibility. * Format. * Ignore mypy error. * Fix mypy error. * Format. * Add test for multiple docs with multiple entities. --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2024-04-09 11:39:18 +02:00
Matthew Honnibal	0518c36f04	Sanitize direct download (#13313 ) The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.	2024-02-20 13:17:51 +01:00
Daniël de Kok	bff8725f4b	Set version to 3.7.4 (#13327 )	2024-02-14 14:46:28 +01:00
Daniël de Kok	fdfdbcd9f4	Make `Language.pipe` workers exit cleanly (#13321 ) Also warn when any worker exited with a non-zero exit code and modify test to ensure that workers exit cleanly by default.	2024-02-12 14:39:38 +01:00
Adriane Boyd	afb22ad491	Remove debug data normalization for span analysis (#13203 ) * Remove debug data normalization for span analysis As a result of this normalization, `debug data` could show a user tokens that do not exist in their data. * Update spacy/cli/debug_data.py --------- Co-authored-by: svlandeg <svlandeg@github.com>	2024-02-06 14:14:55 +01:00
Daniël de Kok	e1249d3722	Test if closing explicitly solves recursive lock issues (#13304 )	2024-02-05 10:07:03 +01:00
Daniël de Kok	1052cba9f3	Merge pull request #13299 from danieldk/copy/master Sync main with latests changes from master (v3)	2024-02-04 15:40:55 +01:00
Daniël de Kok	40422ff904	Set version to 3.7.3 (#13301 )	2024-02-02 13:51:26 +01:00
Daniël de Kok	2dbb332cea	`TextCatParametricAttention.v1`: set key transform dimensions (#13249 ) * TextCatParametricAttention.v1: set key transform dimensions This is necessary for tok2vec implementations that initialize lazily (e.g. curated transformers). * Add lazily-initialized tok2vec to simulate transformers Add a lazily-initialized tok2vec to the tests and test the current textcat models with it. Fix some additional issues found using this test. * isort * Add `test.` prefix to `LazyInitTok2Vec.v1`	2024-02-02 13:01:59 +01:00
Daniël de Kok	2d4067d021	Test if closing explicitly solves recursive lock issues	2024-02-02 11:39:07 +01:00
Daniël de Kok	68d7841df5	Extension serialization attr tests: add teardown (#13284 ) The doc/token extension serialization tests add extensions that are not serializable with pickle. This didn't cause issues before due to the implicit run order of tests. However, test ordering has changed with pytest 8.0.0, leading to failed tests in test_language. Update the fixtures in the extension serialization tests to do proper teardown and remove the extensions.	2024-01-29 13:51:56 +01:00
Eliana Vornov	00e938a7c3	add custom code support to CLI speed benchmark (#13247 ) * add custom code support to CLI speed benchmark * sort imports * better copying for warmup docs	2024-01-26 13:29:22 +01:00
Daniël de Kok	ce9ea9629f	Set version to v4.0.0.dev2 (#13269 )	2024-01-25 12:54:23 +01:00
Daniël de Kok	9e97c730be	Fix up requirements test To account for buil dependencies being removed from `setup.cfg`.	2024-01-24 17:18:49 +01:00
Daniël de Kok	e722284ff4	Construct TextCatEnsemble.v2 using helper function	2024-01-24 14:59:01 +01:00
Daniël de Kok	ce4ea5ffa7	Py_UNICODE is not compatible with 3.12	2024-01-24 13:08:56 +01:00
Daniël de Kok	c621e251b8	Typing fixes	2024-01-24 12:20:01 +01:00
Daniël de Kok	82ef6783a8	Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119	2024-01-24 09:09:01 +01:00
Daniël de Kok	a8894a8946	Merge pull request #13240 from mauricesvp/patch-1 Fix typo in method name	2024-01-23 20:49:21 +01:00
Daniël de Kok	afac7fb650	test_find_available_port: use port 5001 (#13255 ) macOS now uses port 5000 for the AirPlay receiver functionality, so this test will always fail on a macOS desktop (unless AirPlay receiver functionality is disabled like in CI).	2024-01-23 20:11:16 +01:00
Daniël de Kok	5a2ad4af4b	Merge remote-tracking branch 'upstream/master' into patch-1	2024-01-23 19:53:20 +01:00
Daniël de Kok	128197a5fc	Properly clean up pipe multiprocessing workers (#13259 ) Before this change, the workers of pipe call with n_process != 1 were stopped by calling `terminate` on the processes. However, terminating a process can leave queues, pipes, and other concurrent data structures in an invalid state. With this change, we stop using terminate and take the following approach instead: * When the all documents are processed, the parent process puts a sentinel in the queue of each worker. * The parent process then calls `join` on each worker process to let them finish up gracefully. * Worker processes break from the queue processing loop when the sentinel is encountered, so that they exit. We need special handling when one of the workers encounters an error and the error handler is set to raise an exception. In this case, we cannot rely on the sentinel to finish all workers -- the queue is a FIFO queue and there may be other work queued up before the sentinel. We use the following approach to handle error scenarios: * The parent puts the end-of-work sentinel in the queue of each worker. * The parent closes the reading-end of the channel of each worker. * Then: - If the worker was waiting for work, it will encounter the sentinel and break from the processing loop. - If the worker was processing a batch, it will attempt to write results to the channel. This will fail because the channel was closed by the parent and the worker will break from the processing loop.	2024-01-23 18:33:04 +01:00
Daniël de Kok	81beaea70e	Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119	2024-01-19 12:34:29 +01:00
Daniël de Kok	9972333ef9	Temporily xfail local remote storage test	2024-01-17 10:20:40 +01:00

1 2 3 4 5 ...

9530 Commits