Matthew Honnibal
acb44f8e73
Fix meta writing for numpy conversion
2024-10-02 01:10:04 +02:00
Matthew Honnibal
75d097155d
Replace numpy floats in update and evaluate
2024-10-02 01:06:23 +02:00
Matthew Honnibal
5d0d2de955
Support 'memory zones' for user memory management
...
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-08 13:06:54 +02:00
Matthew Honnibal
a559cde432
Update about
2024-09-07 00:47:09 +02:00
Matthew Honnibal
b4e60e3151
Fix dump meta
2024-09-07 00:46:48 +02:00
Matthew Honnibal
ae6910b09b
Bump version
2024-09-06 22:23:41 +02:00
Matthew Honnibal
3bc5846e83
Fix serialization for uk trf model
2024-09-06 22:23:25 +02:00
Matthew Honnibal
2a37f97365
Increment version
2024-09-04 14:31:07 +02:00
Matthew Honnibal
3ee1b2bd1f
Fix Spanish lemmatizer
2024-09-04 14:29:34 +02:00
Matthew Honnibal
6f7590bbf1
Revert "Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3"
...
This reverts commit 64b22be76e
.
2024-09-04 14:26:39 +02:00
Matthew Honnibal
64b22be76e
Fix apparent bug in Spanish lemmatizer. Not sure why this emerges in v4 not in v3
2024-09-04 14:22:13 +02:00
Matthew Honnibal
4eec3bfad1
Bump version
2024-09-02 13:16:15 +02:00
Matthew Honnibal
9e7421d45f
Relax cupy-cuda pins to allow numpy v2
2024-09-02 13:15:53 +02:00
Matthew Honnibal
b9ecb15439
Bump version
2024-09-02 12:36:28 +02:00
Matthew Honnibal
3ccec6af7a
Update thinc pin
2024-09-02 12:35:56 +02:00
Matthew Honnibal
a5ba7e4716
Bump dev version
2024-09-02 10:10:43 +02:00
Matthew Honnibal
d558e79823
Pin numpy to v2
2024-09-02 10:10:14 +02:00
Matthew Honnibal
77abf0828a
Pin numpy to v2
2024-09-02 10:09:55 +02:00
Matthew Honnibal
304a8539e9
Bump dev version
2024-09-02 01:45:38 +02:00
Matthew Honnibal
f4c8fdfaad
Update cli.package for removed spacy.vectors.name attr
2024-09-01 16:43:49 +02:00
Sofie Van Landeghem
818fdb537e
Merge pull request #13490 from svlandeg/feat/update_v4
...
Update v4 branch with latest from master
2024-05-14 22:41:17 +02:00
svlandeg
e32a394ff0
fix the fix for textcat init functionality
2024-05-14 18:45:51 +02:00
svlandeg
5992e927b9
fix textcat init functionality
2024-05-14 18:38:11 +02:00
svlandeg
c27679f210
Merge branch 'master' into feat/update_v4
2024-05-14 17:42:48 +02:00
Sofie Van Landeghem
c195ca4f9c
fix docs for MorphAnalysis.__contains__ ( #13433 )
2024-05-02 16:46:41 +02:00
Sofie Van Landeghem
d3a232f773
Update LICENSE to include 2024 ( #13472 )
2024-04-30 09:17:59 +02:00
Sofie Van Landeghem
ecd85d2618
Update Typer pin and GH actions ( #13471 )
...
* update gh actions
* pin typer upperbound to 1.0.0
2024-04-29 13:28:46 +02:00
Alex Strick van Linschoten
045cd43c3f
Fix typos in docs ( #13466 )
...
* fix typos
* prettier formatting
---------
Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-29 11:10:17 +02:00
Sofie Van Landeghem
74836524e3
Bump to v5 ( #13470 )
2024-04-29 10:36:31 +02:00
Sofie Van Landeghem
6d6c10ab9c
Fix CI ( #13469 )
...
* Remove hardcoded architecture setting
* update classifiers to include Python 3.12
2024-04-29 10:18:07 +02:00
Sofie Van Landeghem
287deee02c
remove empty file ( #13458 )
2024-04-26 10:04:16 +02:00
Daniël de Kok
b2ca7253d2
Document TrainablePipe.save_activations
( #13452 )
...
* Document `TrainablePipe.save_activations`
* Fully qualified links
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* prettier
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-04-23 09:21:23 +02:00
Daniël de Kok
f5918d4353
Update to Thinc 9.0.0 and set version to 4.0.0.dev3 ( #13448 )
...
* Update to Thinc 9.0.0 and set version to 4.0.0.dev3
* Set minimum Python version to 3.9
2024-04-22 09:40:55 +02:00
Daniël de Kok
5bd141013b
Remove apple
from extras ( #13439 )
...
Account for merging of `thinc-apple-ops` into `thinc`.
2024-04-17 13:43:27 +02:00
Daniël de Kok
8696861c8c
Update spacy-curated-transformers
docs for spaCy 4 ( #13440 )
...
- Update model constructors to v2 and add `dtype` argument.
- Update to `PyTorchCheckpointLoader` to `v2`.
- Add `transformer_discriminative.v1`.
2024-04-16 12:06:58 +02:00
Sofie Van Landeghem
2e2334632b
Fix use_gold_ents behaviour for EntityLinker ( #13400 )
...
* fix type annotation in docs
* only restore entities after loss calculation
* restore entities of sample in initialization
* rename overfitting function
* fix EL scorer
* Relax test
* fix formatting
* Update spacy/pipeline/entity_linker.py
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* rename to _ensure_ents
* further rename
* allow for scorer to be None
---------
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2024-04-16 12:00:22 +02:00
Joe Schiff
2e96797696
Convert properties to decorator syntax ( #13390 )
2024-04-16 11:51:14 +02:00
Daniël de Kok
fbc14aea45
Add distill subcommand ( #13431 )
...
* Add distill subcommand
This subcommand distills a student model from a teacher model.
* Fixes from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Type and doc fixes
* Wording
* distill: document missing `-o`
* Wording
* Small fix
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-04-11 19:33:46 +02:00
Raphael Mitsch
304b9331e6
Modify EL batching to doc-wise streaming approach ( #12367 )
...
* Convert Candidate from Cython to Python class.
* Format.
* Fix .entity_ typo in _add_activations() usage.
* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].
* Update docs.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update doc string of BaseCandidate.__init__().
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.
* Adjust Candidate to support and mandate numerical entity IDs.
* Format.
* Fix docstring and docs.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename alias -> mention.
* Refactor Candidate attribute names. Update docs and tests accordingly.
* Refacor Candidate attributes and their usage.
* Format.
* Fix mypy error.
* Update error code in line with v4 convention.
* Modify EL batching system.
* Update leftover get_candidates() mention in docs.
* Format docs.
* Format.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Updated error code.
* Simplify interface for int/str representations.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename 'alias' to 'mention'.
* Port Candidate and InMemoryCandidate to Cython.
* Remove redundant entry in setup.py.
* Add abstract class check.
* Drop storing mention.
* Update spacy/kb/candidate.pxd
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix entity_id refactoring problems in docstrings.
* Drop unused InMemoryCandidate._entity_hash.
* Update docstrings.
* Move attributes out of Candidate.
* Partially fix alias/mention terminology usage. Convert Candidate to interface.
* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().
* Update docstrings related to prior_prob.
* Update alias/mention usage in doc(strings).
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.
* Update docstrings.
* Fix InMemoryCandidate attribute names.
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update W401 test.
* Update spacy/errors.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use Candidate output type for toy generators in the test suite to mimick best practices
* fix docs
* fix import
* Fix merge leftovers.
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/entitylinker.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/kb/kb_in_memory.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/inmemorylookupkb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update get_candidates() docstring.
* Reformat imports in entity_linker.py.
* Drop valid_ent_idx_per_doc.
* Update docs.
* Format.
* Simplify doc loop in predict().
* Remove E1044 comment.
* Fix merge errors.
* Format.
* Format.
* Format.
* Fix merge error & tests.
* Format.
* Apply suggestions from code review
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Use type alias.
* isort.
* isort.
* Lint.
* Add typedefs.pyx.
* Fix typedef import.
* Fix type aliases.
* Format.
* Update docstring and type usage.
* Add info on get_candidates(), get_candidates_batched().
* Readd get_candidates info to v3 changelog.
* Update website/docs/api/entitylinker.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update factory functions for backwards compatibility.
* Format.
* Ignore mypy error.
* Fix mypy error.
* Format.
* Add test for multiple docs with multiple entities.
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-09 11:39:18 +02:00
Sofie Van Landeghem
f5e85fa05a
allow weasel 0.4.x ( #13409 )
2024-04-04 12:55:08 +02:00
Yaseen
21aea59001
Update code.module.sass to make code title sticky ( #13379 )
2024-03-26 12:15:25 +01:00
Sofie Van Landeghem
4dc5fe5469
Renamed main branch back to v4 for now ( #13395 )
...
* Update gputests.yml
* Update slowtests.yml
2024-03-26 09:53:07 +01:00
Ines Montani
1252370f69
Move DocSearch key to env var [ci skip]
2024-03-25 10:17:57 +01:00
Sofie Van Landeghem
d410d95b52
remove smart_open requirement as it's taken care of via Weasel ( #13391 )
2024-03-22 18:21:20 +01:00
Matthew Honnibal
0518c36f04
Sanitize direct download ( #13313 )
...
The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.
2024-02-20 13:17:51 +01:00
Daniël de Kok
bff8725f4b
Set version to 3.7.4 ( #13327 )
2024-02-14 14:46:28 +01:00
Daniël de Kok
fdfdbcd9f4
Make Language.pipe
workers exit cleanly ( #13321 )
...
Also warn when any worker exited with a non-zero exit code and modify
test to ensure that workers exit cleanly by default.
2024-02-12 14:39:38 +01:00
Daniël de Kok
14bd9d89a3
Update example that shows model in requirments ( #13302 )
...
See #13293 .
2024-02-11 19:46:43 +01:00
Adriane Boyd
afb22ad491
Remove debug data normalization for span analysis ( #13203 )
...
* Remove debug data normalization for span analysis
As a result of this normalization, `debug data` could show a user tokens
that do not exist in their data.
* Update spacy/cli/debug_data.py
---------
Co-authored-by: svlandeg <svlandeg@github.com>
2024-02-06 14:14:55 +01:00
Daniël de Kok
e1249d3722
Test if closing explicitly solves recursive lock issues ( #13304 )
2024-02-05 10:07:03 +01:00