Commit Graph

9530 Commits

Author SHA1 Message Date
Matthew Honnibal
acb44f8e73 Fix meta writing for numpy conversion 2024-10-02 01:10:04 +02:00
Matthew Honnibal
75d097155d Replace numpy floats in update and evaluate 2024-10-02 01:06:23 +02:00
Matthew Honnibal
5d0d2de955 Support 'memory zones' for user memory management
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.

Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.

The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.

Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.

The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.

I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-08 13:06:54 +02:00
Matthew Honnibal
a559cde432 Update about 2024-09-07 00:47:09 +02:00
Matthew Honnibal
b4e60e3151 Fix dump meta 2024-09-07 00:46:48 +02:00
Matthew Honnibal
ae6910b09b Bump version 2024-09-06 22:23:41 +02:00
Matthew Honnibal
3bc5846e83 Fix serialization for uk trf model 2024-09-06 22:23:25 +02:00
Matthew Honnibal
2a37f97365 Increment version 2024-09-04 14:31:07 +02:00
Matthew Honnibal
3ee1b2bd1f Fix Spanish lemmatizer 2024-09-04 14:29:34 +02:00
Matthew Honnibal
6f7590bbf1 Revert "Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3"
This reverts commit 64b22be76e.
2024-09-04 14:26:39 +02:00
Matthew Honnibal
64b22be76e Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3 2024-09-04 14:22:13 +02:00
Matthew Honnibal
4eec3bfad1 Bump version 2024-09-02 13:16:15 +02:00
Matthew Honnibal
b9ecb15439 Bump version 2024-09-02 12:36:28 +02:00
Matthew Honnibal
a5ba7e4716 Bump dev version 2024-09-02 10:10:43 +02:00
Matthew Honnibal
304a8539e9 Bump dev version 2024-09-02 01:45:38 +02:00
Matthew Honnibal
f4c8fdfaad Update cli.package for removed spacy.vectors.name attr 2024-09-01 16:43:49 +02:00
svlandeg
e32a394ff0 fix the fix for textcat init functionality 2024-05-14 18:45:51 +02:00
svlandeg
5992e927b9 fix textcat init functionality 2024-05-14 18:38:11 +02:00
svlandeg
c27679f210 Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
Alex Strick van Linschoten
045cd43c3f
Fix typos in docs (#13466)
* fix typos

* prettier formatting

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-29 11:10:17 +02:00
Sofie Van Landeghem
287deee02c
remove empty file (#13458) 2024-04-26 10:04:16 +02:00
Daniël de Kok
f5918d4353
Update to Thinc 9.0.0 and set version to 4.0.0.dev3 (#13448)
* Update to Thinc 9.0.0 and set version to 4.0.0.dev3

* Set minimum Python version to 3.9
2024-04-22 09:40:55 +02:00
Daniël de Kok
5bd141013b
Remove apple from extras (#13439)
Account for merging of `thinc-apple-ops` into `thinc`.
2024-04-17 13:43:27 +02:00
Sofie Van Landeghem
2e2334632b
Fix use_gold_ents behaviour for EntityLinker (#13400)
* fix type annotation in docs

* only restore entities after loss calculation

* restore entities of sample in initialization

* rename overfitting function

* fix EL scorer

* Relax test

* fix formatting

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* rename to _ensure_ents

* further rename

* allow for scorer to be None

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2024-04-16 12:00:22 +02:00
Joe Schiff
2e96797696
Convert properties to decorator syntax (#13390) 2024-04-16 11:51:14 +02:00
Daniël de Kok
fbc14aea45
Add distill subcommand (#13431)
* Add distill subcommand

This subcommand distills a student model from a teacher model.

* Fixes from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Type and doc fixes

* Wording

* distill: document missing `-o`

* Wording

* Small fix

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-04-11 19:33:46 +02:00
Raphael Mitsch
304b9331e6
Modify EL batching to doc-wise streaming approach (#12367)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].

* Update docs.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Modify EL batching system.

* Update leftover get_candidates() mention in docs.

* Format docs.

* Format.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

* Fix merge leftovers.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/inmemorylookupkb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update get_candidates() docstring.

* Reformat imports in entity_linker.py.

* Drop valid_ent_idx_per_doc.

* Update docs.

* Format.

* Simplify doc loop in predict().

* Remove E1044 comment.

* Fix merge errors.

* Format.

* Format.

* Format.

* Fix merge error & tests.

* Format.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use type alias.

* isort.

* isort.

* Lint.

* Add typedefs.pyx.

* Fix typedef import.

* Fix type aliases.

* Format.

* Update docstring and type usage.

* Add info on get_candidates(), get_candidates_batched().

* Readd get_candidates info to v3 changelog.

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update factory functions for backwards compatibility.

* Format.

* Ignore mypy error.

* Fix mypy error.

* Format.

* Add test for multiple docs with multiple entities.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-09 11:39:18 +02:00
Matthew Honnibal
0518c36f04
Sanitize direct download (#13313)
The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.
2024-02-20 13:17:51 +01:00
Daniël de Kok
bff8725f4b
Set version to 3.7.4 (#13327) 2024-02-14 14:46:28 +01:00
Daniël de Kok
fdfdbcd9f4
Make Language.pipe workers exit cleanly (#13321)
Also warn when any worker exited with a non-zero exit code and modify
test to ensure that workers exit cleanly by default.
2024-02-12 14:39:38 +01:00
Adriane Boyd
afb22ad491
Remove debug data normalization for span analysis (#13203)
* Remove debug data normalization for span analysis

As a result of this normalization, `debug data` could show a user tokens
that do not exist in their data.

* Update spacy/cli/debug_data.py

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2024-02-06 14:14:55 +01:00
Daniël de Kok
e1249d3722
Test if closing explicitly solves recursive lock issues (#13304) 2024-02-05 10:07:03 +01:00
Daniël de Kok
1052cba9f3
Merge pull request #13299 from danieldk/copy/master
Sync main with latests changes from master (v3)
2024-02-04 15:40:55 +01:00
Daniël de Kok
40422ff904
Set version to 3.7.3 (#13301) 2024-02-02 13:51:26 +01:00
Daniël de Kok
2dbb332cea
TextCatParametricAttention.v1: set key transform dimensions (#13249)
* TextCatParametricAttention.v1: set key transform dimensions

This is necessary for tok2vec implementations that initialize
lazily (e.g. curated transformers).

* Add lazily-initialized tok2vec to simulate transformers

Add a lazily-initialized tok2vec to the tests and test the current
textcat models with it.

Fix some additional issues found using this test.

* isort

* Add `test.` prefix to `LazyInitTok2Vec.v1`
2024-02-02 13:01:59 +01:00
Daniël de Kok
2d4067d021 Test if closing explicitly solves recursive lock issues 2024-02-02 11:39:07 +01:00
Daniël de Kok
68d7841df5
Extension serialization attr tests: add teardown (#13284)
The doc/token extension serialization tests add extensions that are not
serializable with pickle. This didn't cause issues before due to the
implicit run order of tests. However, test ordering has changed with
pytest 8.0.0, leading to failed tests in test_language.

Update the fixtures in the extension serialization tests to do proper
teardown and remove the extensions.
2024-01-29 13:51:56 +01:00
Eliana Vornov
00e938a7c3
add custom code support to CLI speed benchmark (#13247)
* add custom code support to CLI speed benchmark

* sort imports

* better copying for warmup docs
2024-01-26 13:29:22 +01:00
Daniël de Kok
ce9ea9629f
Set version to v4.0.0.dev2 (#13269) 2024-01-25 12:54:23 +01:00
Daniël de Kok
9e97c730be Fix up requirements test
To account for buil dependencies being removed from `setup.cfg`.
2024-01-24 17:18:49 +01:00
Daniël de Kok
e722284ff4 Construct TextCatEnsemble.v2 using helper function 2024-01-24 14:59:01 +01:00
Daniël de Kok
ce4ea5ffa7 Py_UNICODE is not compatible with 3.12 2024-01-24 13:08:56 +01:00
Daniël de Kok
c621e251b8 Typing fixes 2024-01-24 12:20:01 +01:00
Daniël de Kok
82ef6783a8 Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-24 09:09:01 +01:00
Daniël de Kok
a8894a8946
Merge pull request #13240 from mauricesvp/patch-1
Fix typo in method name
2024-01-23 20:49:21 +01:00
Daniël de Kok
afac7fb650
test_find_available_port: use port 5001 (#13255)
macOS now uses port 5000 for the AirPlay receiver functionality, so this
test will always fail on a macOS desktop (unless AirPlay receiver
functionality is disabled like in CI).
2024-01-23 20:11:16 +01:00
Daniël de Kok
5a2ad4af4b Merge remote-tracking branch 'upstream/master' into patch-1 2024-01-23 19:53:20 +01:00
Daniël de Kok
128197a5fc
Properly clean up pipe multiprocessing workers (#13259)
Before this change, the workers of pipe call with n_process != 1 were
stopped by calling `terminate` on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.

With this change, we stop using terminate and take the following approach
instead:

* When the all documents are processed, the parent process puts a
  sentinel in the queue of each worker.
* The parent process then calls `join` on each worker process to
  let them finish up gracefully.
* Worker processes break from the queue processing loop when the
  sentinel is encountered, so that they exit.

We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:

* The parent puts the end-of-work sentinel in the queue of each worker.
* The parent closes the reading-end of the channel of each worker.
* Then:
  - If the worker was waiting for work, it will encounter the sentinel
    and break from the processing loop.
  - If the worker was processing a batch, it will attempt to write
    results to the channel. This will fail because the channel was
    closed by the parent and the worker will break from the processing
    loop.
2024-01-23 18:33:04 +01:00
Daniël de Kok
81beaea70e Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
Daniël de Kok
9972333ef9 Temporily xfail local remote storage test 2024-01-17 10:20:40 +01:00