Commit Graph

9357 Commits

Author SHA1 Message Date
Raphael Mitsch
1f2685029f Reverse erroneous changes during merge. 2023-03-20 09:30:39 +01:00
Raphael Mitsch
1620a04d46 Merge branch 'v4' into refactor/span-group-for-mentions
# Conflicts:
#	spacy/errors.py
#	spacy/kb/__init__.py
#	spacy/kb/candidate.pxd
#	spacy/kb/candidate.pyx
#	spacy/kb/kb.pyx
#	spacy/kb/kb_in_memory.pyx
#	spacy/ml/models/entity_linker.py
#	spacy/pipeline/entity_linker.py
#	spacy/tests/pipeline/test_entity_linker.py
#	spacy/tests/serialize/test_serialize_kb.py
#	website/docs/api/inmemorylookupkb.mdx
#	website/docs/api/kb.mdx
2023-03-20 09:29:52 +01:00
Raphael Mitsch
9340eb8ad2
Introduce hierarchy for EL Candidate objects (#12341)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 00:34:35 +01:00
Adriane Boyd
6ae7618418
Clean up Vocab constructor (#12290)
* Clean up Vocab constructor

* Change effective type of `strings` from `Iterable[str]` to `Optional[StringStore]`
  * Don't automatically add strings to vocab
* Change default values to `None`
* Remove `**deprecated_kwargs`

* Format
2023-03-19 23:41:20 +01:00
Madeesh Kannan
520279ff7c
Tok2Vec: Add distill method (#12108)
* `Tok2Vec`: Add `distill` method

* `Tok2Vec`: Refactor `update`

* Add `Tok2Vec.distill` test

* Update `distill` signature to accept `Example`s instead of separate teacher and student docs

* Add docs

* Remove docstring

* Update test

* Remove `update` calls from test

* Update `Tok2Vec.distill` docstring
2023-03-09 09:37:19 +01:00
Raphael Mitsch
41b3a0d932
Drop support for EntityLinker_v1. (#12377) 2023-03-07 13:10:45 +01:00
Raphael Mitsch
86703da8b7 Merge branch 'refactor/el-candidates' into refactor/span-group-for-mentions
# Conflicts:
#	spacy/pipeline/entity_linker.py
2023-03-07 09:10:10 +01:00
Raphael Mitsch
8dbb74c9c0 Merge branch 'v4' into refactor/el-candidates 2023-03-07 09:06:51 +01:00
Raphael Mitsch
749e446ee3 Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	.github/azure-steps.yml
2023-03-06 16:27:56 +01:00
Adriane Boyd
0bbc620dd8
Partially work around pending deprecation of pkg_resources (#12368)
* Handle deprecation of pkg_resources

* Replace `pkg_resources` with `importlib_metadata` for `spacy info
--url`
* Remove requirements check from `spacy project` given the lack of
alternatives

* Fix installed model URL method and CI test

* Fix types/handling, simplify catch-all return

* Move imports instead of disabling requirements check

* Format

* Reenable test with ignored deprecation warning

* Fix except

* Fix return
2023-03-06 14:48:57 +01:00
Raphael Mitsch
2ac586fdb5 Update error code in line with v4 convention. 2023-03-05 14:43:32 +01:00
Raphael Mitsch
670e1ca7c5 Fix mypy error. 2023-03-05 14:33:32 +01:00
Raphael Mitsch
5f40b3e523 Format. 2023-03-05 14:14:16 +01:00
Raphael Mitsch
38dce966e5 Refacor Candidate attributes and their usage. 2023-03-05 13:49:13 +01:00
Raphael Mitsch
94e57d0ed5 Refactor Candidate attribute names. Update docs and tests accordingly. 2023-03-03 11:08:17 +01:00
Raphael Mitsch
46fe069f87 Rename alias -> mention. 2023-03-03 10:29:53 +01:00
Raphael Mitsch
3beda2b23a Merge branch 'refactor/el-candidates' into refactor/span-group-for-mentions
# Conflicts:
#	spacy/ml/models/entity_linker.py
#	website/docs/api/inmemorylookupkb.mdx
2023-03-03 08:32:38 +01:00
Raphael Mitsch
1ea31552be Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	requirements.txt
#	spacy/pipeline/entity_linker.py
#	spacy/util.py
#	website/docs/api/entitylinker.mdx
2023-03-02 16:24:15 +01:00
Raphael Mitsch
6aa6b86d49
Make generation of empty KnowledgeBase instances configurable in EntityLinker (#12320)
* Make empty_kb() configurable.

* Format.

* Update docs.

* Be more specific in KB serialization test.

* Update KB serialization tests. Update docs.

* Remove doc update for batched candidate generation.

* Fix serialization of subclassed KB in tests.

* Format.

* Update docstring.

* Update docstring.

* Switch from pickle to json for custom field serialization.
2023-03-01 16:02:55 +01:00
Adriane Boyd
da75896ef5
Return Tuple[Span] for all Doc/Span attrs that provide spans (#12288)
* Return Tuple[Span] for all Doc/Span attrs that provide spans

* Update Span types
2023-03-01 16:00:02 +01:00
Raphael Mitsch
9bd498cdae Fix docstring and docs. 2023-03-01 15:09:24 +01:00
Raphael Mitsch
257bca3959 Format. 2023-03-01 14:54:03 +01:00
Raphael Mitsch
fa390618c8 Adjust Candidate to support and mandate numerical entity IDs. 2023-03-01 14:50:58 +01:00
Raphael Mitsch
49abf4fb3a Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate. 2023-03-01 14:27:50 +01:00
Raphael Mitsch
417e8fea8b
Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-01 13:51:33 +01:00
Raphael Mitsch
21fa22de08 Merge branch 'refactor/el-candidates' of github.com:rmitsch/spaCy into refactor/el-candidates 2023-03-01 13:48:46 +01:00
Raphael Mitsch
3da0712582 Update doc string of BaseCandidate.__init__(). 2023-03-01 13:15:38 +01:00
Raphael Mitsch
0680958476
Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-01 12:42:08 +01:00
Sofie Van Landeghem
74cae47bf6
rely on is_empty property instead of __len__ (#12347) 2023-03-01 12:06:07 +01:00
Adriane Boyd
8f058e39bd
Fix error message for displacy auto_select_port (#12343) 2023-02-28 16:36:03 +01:00
Raphael Mitsch
8596fb8b88 Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span]. 2023-02-28 15:28:05 +01:00
TAN Long
071667376a
Add new REL_OPs: >+, >-, <+, and <- (#12334)
* Add immediate left/right child/parent dependency relations

* Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`.

---------

Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-02-28 14:36:33 +01:00
Raphael Mitsch
a97ef65b33 Fix .entity_ typo in _add_activations() usage. 2023-02-28 14:22:27 +01:00
Raphael Mitsch
5a9d8ba73c Format. 2023-02-28 13:56:13 +01:00
Raphael Mitsch
cd98ab4e95 Convert Candidate from Cython to Python class. 2023-02-28 13:49:52 +01:00
lise-brinck
e2de188cf1
Bugfix/swedish tokenizer (#12315)
* add unittest for explosion#12311

* create punctuation.py for swedish

* removed : from infixes in swedish punctuation.py

* allow : as infix if succeeding char is uppercase
2023-02-27 10:53:45 +01:00
Kevin Humphreys
acdd993071
Matcher performance fix for extension predicates: use shared key function (#12272)
* standardize predicate key format

* single key function

* Make optional args in key function keyword-only

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-27 08:35:08 +01:00
Adriane Boyd
df4c069a13
Remove backoff from .vector to .tensor (#12292) 2023-02-23 11:36:50 +01:00
Paul O'Leary McCann
1e8bac99f3
Add tests for projects to master (#12303)
* Add tests for projects to master

* Fix git clone related issues on Windows

* Add stat import
2023-02-23 10:22:57 +01:00
Daniël de Kok
e27c60a702
Reimplement distillation with oracle cut size (#12214)
* Improve the correctness of _parse_patch

* If there are no more actions, do not attempt to make further
  transitions, even if not all states are final.
* Assert that the number of actions for a step is the same as
  the number of states.

* Reimplement distillation with oracle cut size

The code for distillation with an oracle cut size was not reimplemented
after the parser refactor. We did not notice, because we did not have
tests for this functionality. This change brings back the functionality
and adds this to the parser tests.

* Rename states2actions to _states_to_actions for consistency

* Test distillation max cuts in NER

* Mark parser/NER tests as slow

* Typo

* Fix invariant in _states_diff_to_actions

* Rename _init_batch -> _init_batch_from_teacher

* Ninja edit the ninja edit

* Check that we raise an exception when we pass the incorrect number or actions

* Remove unnecessary get

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Write out condition more explicitly

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-02-21 15:47:18 +01:00
Adriane Boyd
80bc140533
Add grc to langs with lexeme norms in spacy-lookups-data (#12287) 2023-02-16 17:57:02 +01:00
Paul O'Leary McCann
dd3f138830
Use tempfile.TemporaryDirectory (#12285) 2023-02-16 11:08:55 +01:00
Adriane Boyd
b95123060a
Make Span.char_span optional args keyword-only (#12257)
* Make Span.char_span optional args keyword-only

* Make kb_id and following kw-only

* Format
2023-02-15 12:34:33 +01:00
Edward
61b8454137
Adjust return type of registry.find (#12227)
* Fix registry find return type

* add dot

* Add type ignore for mypy

* update black formatting version

* add mypy ignore to package cli

* mypy type fix (for real)

* Update find description in spacy/util.py

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* adjust mypy directive

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-02-15 12:32:53 +01:00
Adriane Boyd
cbc2ae933e
Remove unused Span.char_span(id=) (#12250) 2023-02-08 14:46:07 +01:00
Adriane Boyd
cf85b81f34
Remove names for vectors (#12243)
* Remove names for vectors

Named vectors are basically a carry-over from v2 and aren't used for
anything.

* Format
2023-02-08 14:37:42 +01:00
Adriane Boyd
5089efa2d0
Use the same tuple in Span cmp and hash (#12251) 2023-02-08 14:28:34 +01:00
Daniël de Kok
eec5ccd72f
Language.update: ensure that tok2vec gets updated (#12136)
* `Language.update`: ensure that tok2vec gets updated

The components in a pipeline can be updated independently. However,
tok2vec implementations are an exception to this, since they depend on
listeners for their gradients. The update method of a tok2vec
implementation computes the tok2vec forward and passes this along with a
backprop function to the listeners. This backprop function accumulates
gradients for all the listeners. There are two ways in which the
accumulated gradients can be used to update the tok2vec weights:

1. Call the `finish_update` method of tok2vec *after* the `update`
   method is called on all of the pipes that use a tok2vec listener.
2. Pass an optimizer to the `update` method of tok2vec. In this
   case, tok2vec will give the last listener a special backprop
   function that calls `finish_update` on the tok2vec.

Unfortunately, `Language.update` did neither of these. Instead, it
immediately called `finish_update` on every pipe after `update`. As a
result, the tok2vec weights are updated when no gradients have been
accumulated from listeners yet. And the gradients of the listeners are
only used in the next call to `Language.update` (when `finish_update` is
called on tok2vec again).

This change fixes this issue by passing the optimizer to the `update`
method of trainable pipes, leading to use of the second strategy
outlined above.

The main updating loop in `Language.update` is also simplified by using
the `TrainableComponent` protocol consistently.

* Train loop: `sgd` is `Optional[Optimizer]`, do not pass false

* Language.update: call pipe finish_update after all pipe updates

This does correct and fast updates if multiple components update the
same parameters.

* Add comment why we moved `finish_update` to a separate loop
2023-02-03 15:22:25 +01:00
Sofie Van Landeghem
c47ec5b5c6
Merge pull request #12218 from adrianeboyd/chore/update-v4-from-master-7
Update v4 from master
2023-02-03 12:04:20 +01:00
Paul O'Leary McCann
89f974d4f5
Cleanup/remove backwards compat overwrite settings (#11888)
* Remove backwards-compatible overwrite from Entity Linker

This also adds a docstring about overwrite, since it wasn't present.

* Fix docstring

* Remove backward compat settings in Morphologizer

This also needed a docstring added.

For this component it's less clear what the right overwrite settings
are.

* Remove backward compat from sentencizer

This was simple

* Remove backward compat from senter

Another simple one

* Remove backward compat setting from tagger

* Add docstrings

* Update spacy/pipeline/morphologizer.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-02 14:13:38 +01:00