Commit Graph

16314 Commits

Author SHA1 Message Date
Matthew Honnibal
acb44f8e73 Fix meta writing for numpy conversion 2024-10-02 01:10:04 +02:00
Matthew Honnibal
75d097155d Replace numpy floats in update and evaluate 2024-10-02 01:06:23 +02:00
Matthew Honnibal
5d0d2de955 Support 'memory zones' for user memory management
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.

Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.

The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.

Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.

The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.

I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-08 13:06:54 +02:00
Matthew Honnibal
a559cde432 Update about 2024-09-07 00:47:09 +02:00
Matthew Honnibal
b4e60e3151 Fix dump meta 2024-09-07 00:46:48 +02:00
Matthew Honnibal
ae6910b09b Bump version 2024-09-06 22:23:41 +02:00
Matthew Honnibal
3bc5846e83 Fix serialization for uk trf model 2024-09-06 22:23:25 +02:00
Matthew Honnibal
2a37f97365 Increment version 2024-09-04 14:31:07 +02:00
Matthew Honnibal
3ee1b2bd1f Fix Spanish lemmatizer 2024-09-04 14:29:34 +02:00
Matthew Honnibal
6f7590bbf1 Revert "Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3"
This reverts commit 64b22be76e.
2024-09-04 14:26:39 +02:00
Matthew Honnibal
64b22be76e Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3 2024-09-04 14:22:13 +02:00
Matthew Honnibal
4eec3bfad1 Bump version 2024-09-02 13:16:15 +02:00
Matthew Honnibal
9e7421d45f Relax cupy-cuda pins to allow numpy v2 2024-09-02 13:15:53 +02:00
Matthew Honnibal
b9ecb15439 Bump version 2024-09-02 12:36:28 +02:00
Matthew Honnibal
3ccec6af7a Update thinc pin 2024-09-02 12:35:56 +02:00
Matthew Honnibal
a5ba7e4716 Bump dev version 2024-09-02 10:10:43 +02:00
Matthew Honnibal
d558e79823 Pin numpy to v2 2024-09-02 10:10:14 +02:00
Matthew Honnibal
77abf0828a Pin numpy to v2 2024-09-02 10:09:55 +02:00
Matthew Honnibal
304a8539e9 Bump dev version 2024-09-02 01:45:38 +02:00
Matthew Honnibal
f4c8fdfaad Update cli.package for removed spacy.vectors.name attr 2024-09-01 16:43:49 +02:00
Sofie Van Landeghem
818fdb537e
Merge pull request #13490 from svlandeg/feat/update_v4
Update v4 branch with latest from master
2024-05-14 22:41:17 +02:00
svlandeg
e32a394ff0 fix the fix for textcat init functionality 2024-05-14 18:45:51 +02:00
svlandeg
5992e927b9 fix textcat init functionality 2024-05-14 18:38:11 +02:00
svlandeg
c27679f210 Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
Sofie Van Landeghem
c195ca4f9c
fix docs for MorphAnalysis.__contains__ (#13433) 2024-05-02 16:46:41 +02:00
Sofie Van Landeghem
d3a232f773
Update LICENSE to include 2024 (#13472) 2024-04-30 09:17:59 +02:00
Sofie Van Landeghem
ecd85d2618
Update Typer pin and GH actions (#13471)
* update gh actions

* pin typer upperbound to 1.0.0
2024-04-29 13:28:46 +02:00
Alex Strick van Linschoten
045cd43c3f
Fix typos in docs (#13466)
* fix typos

* prettier formatting

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-29 11:10:17 +02:00
Sofie Van Landeghem
74836524e3
Bump to v5 (#13470) 2024-04-29 10:36:31 +02:00
Sofie Van Landeghem
6d6c10ab9c
Fix CI (#13469)
* Remove hardcoded architecture setting

* update classifiers to include Python 3.12
2024-04-29 10:18:07 +02:00
Sofie Van Landeghem
287deee02c
remove empty file (#13458) 2024-04-26 10:04:16 +02:00
Daniël de Kok
b2ca7253d2
Document TrainablePipe.save_activations (#13452)
* Document `TrainablePipe.save_activations`

* Fully qualified links

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* prettier

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-04-23 09:21:23 +02:00
Daniël de Kok
f5918d4353
Update to Thinc 9.0.0 and set version to 4.0.0.dev3 (#13448)
* Update to Thinc 9.0.0 and set version to 4.0.0.dev3

* Set minimum Python version to 3.9
2024-04-22 09:40:55 +02:00
Daniël de Kok
5bd141013b
Remove apple from extras (#13439)
Account for merging of `thinc-apple-ops` into `thinc`.
2024-04-17 13:43:27 +02:00
Daniël de Kok
8696861c8c
Update spacy-curated-transformers docs for spaCy 4 (#13440)
- Update model constructors to v2 and add `dtype` argument.
- Update to `PyTorchCheckpointLoader` to `v2`.
- Add `transformer_discriminative.v1`.
2024-04-16 12:06:58 +02:00
Sofie Van Landeghem
2e2334632b
Fix use_gold_ents behaviour for EntityLinker (#13400)
* fix type annotation in docs

* only restore entities after loss calculation

* restore entities of sample in initialization

* rename overfitting function

* fix EL scorer

* Relax test

* fix formatting

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* rename to _ensure_ents

* further rename

* allow for scorer to be None

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2024-04-16 12:00:22 +02:00
Joe Schiff
2e96797696
Convert properties to decorator syntax (#13390) 2024-04-16 11:51:14 +02:00
Daniël de Kok
fbc14aea45
Add distill subcommand (#13431)
* Add distill subcommand

This subcommand distills a student model from a teacher model.

* Fixes from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Type and doc fixes

* Wording

* distill: document missing `-o`

* Wording

* Small fix

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-04-11 19:33:46 +02:00
Raphael Mitsch
304b9331e6
Modify EL batching to doc-wise streaming approach (#12367)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].

* Update docs.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Modify EL batching system.

* Update leftover get_candidates() mention in docs.

* Format docs.

* Format.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

* Fix merge leftovers.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/inmemorylookupkb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update get_candidates() docstring.

* Reformat imports in entity_linker.py.

* Drop valid_ent_idx_per_doc.

* Update docs.

* Format.

* Simplify doc loop in predict().

* Remove E1044 comment.

* Fix merge errors.

* Format.

* Format.

* Format.

* Fix merge error & tests.

* Format.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use type alias.

* isort.

* isort.

* Lint.

* Add typedefs.pyx.

* Fix typedef import.

* Fix type aliases.

* Format.

* Update docstring and type usage.

* Add info on get_candidates(), get_candidates_batched().

* Readd get_candidates info to v3 changelog.

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update factory functions for backwards compatibility.

* Format.

* Ignore mypy error.

* Fix mypy error.

* Format.

* Add test for multiple docs with multiple entities.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-09 11:39:18 +02:00
Sofie Van Landeghem
f5e85fa05a
allow weasel 0.4.x (#13409) 2024-04-04 12:55:08 +02:00
Yaseen
21aea59001
Update code.module.sass to make code title sticky (#13379) 2024-03-26 12:15:25 +01:00
Sofie Van Landeghem
4dc5fe5469
Renamed main branch back to v4 for now (#13395)
* Update gputests.yml

* Update slowtests.yml
2024-03-26 09:53:07 +01:00
Ines Montani
1252370f69 Move DocSearch key to env var [ci skip] 2024-03-25 10:17:57 +01:00
Sofie Van Landeghem
d410d95b52
remove smart_open requirement as it's taken care of via Weasel (#13391) 2024-03-22 18:21:20 +01:00
Matthew Honnibal
0518c36f04
Sanitize direct download (#13313)
The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.
2024-02-20 13:17:51 +01:00
Daniël de Kok
bff8725f4b
Set version to 3.7.4 (#13327) 2024-02-14 14:46:28 +01:00
Daniël de Kok
fdfdbcd9f4
Make Language.pipe workers exit cleanly (#13321)
Also warn when any worker exited with a non-zero exit code and modify
test to ensure that workers exit cleanly by default.
2024-02-12 14:39:38 +01:00
Daniël de Kok
14bd9d89a3
Update example that shows model in requirments (#13302)
See #13293.
2024-02-11 19:46:43 +01:00
Adriane Boyd
afb22ad491
Remove debug data normalization for span analysis (#13203)
* Remove debug data normalization for span analysis

As a result of this normalization, `debug data` could show a user tokens
that do not exist in their data.

* Update spacy/cli/debug_data.py

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2024-02-06 14:14:55 +01:00
Daniël de Kok
e1249d3722
Test if closing explicitly solves recursive lock issues (#13304) 2024-02-05 10:07:03 +01:00